![]() |
![]() |
|
|
|
|
|
|
|
|
||
|
|
Document Imaging The computerized reference library will provide the structured and controlled document management environment that can be expanded for use in exchanging documents with clients and vendors.
Justification It is important to note that the proposed system should not be perceived as an extra cost to have documents available for online access. Rather, the system will be replacing the present method used for creating and updating the documents. The use of the system for maintaining the documents will provide the users with the added benefits of online access and control. For all documents that can be maintained on the system, the cost to reproduce and distribute updated hard copies will be eliminated. Included in these costs are the 3-ring document binders. Over a period of time this will result in fewer bookcases and file cabinets, and in reduced floor space. Updates to documents will be immediately available to users. This will reduce the amount of rework that now results from the use of outdated reference materials. Due to the reduced time, effort and cost to update documents, many types of documents will be maintained more current than they have been using the present methods. The structured and readily available method of locating documents will result in the following benefits to the users: non-productive time spent searching for documents will be reduced, projects will have improved methods of locating and using documents that have been produced on other projects. A document imaging system will reduce the requirement for hard copies, and enhance the benefits of the computer generated documents. The number of computer generated project documents continually increases. Therefore the benefits of an online reference system for use in managing and accessing these documents will continually increase. Additionally, if the system use expands to several departments and to other offices as expected, there will be an increased need for documentation, training and responsive technical support. The need for administration of the related activities would also increase. Therefore, although the proposed system has the potential to provide a significant positive influence on our methods of executing projects, enhancements and support will be required for the system to deliver this full potential. State of the Technology
Scanning
OCR The OCR process can be performed by special hardware cards for PC's, by software running on the PC or by firmware in special scanners. OCR software typically accepts compressed raster information and must expand it to full raster before the recognition phase can begin. High end OCR scanners can avoid the compression/decompression phase since the input stream is direct raster. The output from an OCR process is seldom totally correct. Errors are introduced when characters "bleed" and touch one another or when the scanner picks up "ghost" images from the reverse side of a document. Success rates vary from 80% to 99%. Output formats include straight ASCII and popular word processing documents complete with underlining and superscripting.
Full Text Search Exhaustive search - This is the most primitive method available. A search through the source documents for a particular keyword or phrase is initiated for each request. This method is very simple to implement but is totally unusable for large stores of information. Some low cost PC packages employ this method. Inverted keyword index - This method will build an index table based upon selected or typical keywords used in the documents, filtering out common articles like "A" or "THE". The inverted index is popular today but has a major drawback in that the documents and the query statement must be correct or a keyword will not be found. Some users employ a table of frequently misspelled words to assist the query function. This table will not compensate for the random errors introduced by an OCR process. N-gram - This is a new technology available which reduces the impact of misspelled words. An N-gram is a sub partition of a word. By reducing the dependence upon complete words, a higher probability of finding association between query and source is achieved. An interesting implementation of N-gram search has been produced utilizing neural networking technology. In this method, a document is "learned" or searched for patterns. The index of patterns is stored for later comparison in a manner similar to what psychologists believe the human brain remembers facts. The N-gram and the inverted index methods require additional storage for the index files. Overhead size of 30% is common.
File Structures
Document Storage Write Once Read Many (WORM) drives permit a user to place information on a disk by burning it in with a laser and reading the data later. There are no real standards being applied to WORM drives and media. There are 5.25, 8, 10, 12 and 14 inch media systems available. Even having two drives with the same size does not guarantee readability since manufacturers format the drives with different methods. CD-ROM machines take manufactured disks with information pre-written and provide read only access. The media is virtually the same as Audio Compact Diskette (CD) which has been available for over 5 years. The media is inexpensive, but the manufacturing costs of producing a master may make this alternative unattractive unless a large number of copies is required. Erasable Optical disks have recently been introduced. The major advantage of this drive is the ability to reuse the media. There are still some concerns about the long term stability of the media which may preclude the use of this first generation of erasable optical for archival purposes. In order to fully utilize the large storage capacity of WORM or Erasable Optical drives, a mechanical robotic mechanism is required to switch and/or flip the media. These devices are called Jukeboxes due to their similar function to the audio record changers of days gone by.
File Management/Indexing
Networking
Data Security
|
|
||
|
|
|
|
|
|