NETMATION
Home
Welcome
Newsletter
Job Openings
Press Releases

BUSINESS
Resumes
Computer Store

CONSULTING
Consulting
Justify
Full Time
Temps
Management
Experience
White Papers
Quote
FAQ

SOFTWARE
NCAP
Partnership
Features
Screenshots
Download
Quote
FAQ

HELP
Contact
Coming Soon
Legal
Site Map

---
Document Imaging
---

Click here!

Introduction
The computerized reference library will provide the structured and controlled document management environment that can be expanded for use in exchanging documents with clients and vendors.

Justification
It should be recognized that the development of such a system will require a long term commitment.

It is important to note that the proposed system should not be perceived as an extra cost to have documents available for online access. Rather, the system will be replacing the present method used for creating and updating the documents. The use of the system for maintaining the documents will provide the users with the added benefits of online access and control.

For all documents that can be maintained on the system, the cost to reproduce and distribute updated hard copies will be eliminated. Included in these costs are the 3-ring document binders. Over a period of time this will result in fewer bookcases and file cabinets, and in reduced floor space.

Updates to documents will be immediately available to users. This will reduce the amount of rework that now results from the use of outdated reference materials.

Due to the reduced time, effort and cost to update documents, many types of documents will be maintained more current than they have been using the present methods.

The structured and readily available method of locating documents will result in the following benefits to the users: non-productive time spent searching for documents will be reduced, projects will have improved methods of locating and using documents that have been produced on other projects.

A document imaging system will reduce the requirement for hard copies, and enhance the benefits of the computer generated documents. The number of computer generated project documents continually increases. Therefore the benefits of an online reference system for use in managing and accessing these documents will continually increase.

Additionally, if the system use expands to several departments and to other offices as expected, there will be an increased need for documentation, training and responsive technical support. The need for administration of the related activities would also increase. Therefore, although the proposed system has the potential to provide a significant positive influence on our methods of executing projects, enhancements and support will be required for the system to deliver this full potential.

State of the Technology

Scanning
A scanner is a hardware device which takes an input document and produces an electronic stream of information representing the image. Due to the amount of information presented, compression and decompression techniques are used to reduce the storage and communication requirements of these raster data. Issues to consider when selecting a scanner are: document size, automatic scanning requirements, speed, output formats, resolution requirements.

OCR
Optical Character Recognition (OCR) is a process in which a stream of raster information is interpreted into ASCII characters. The process is extremely sensitive to the quality of the raster information presented as well as differences in the fonts used. A high quality scanner will be able to provide quality raster information by eliminating background noise and giving a high contrast image.

The OCR process can be performed by special hardware cards for PC's, by software running on the PC or by firmware in special scanners. OCR software typically accepts compressed raster information and must expand it to full raster before the recognition phase can begin. High end OCR scanners can avoid the compression/decompression phase since the input stream is direct raster.

The output from an OCR process is seldom totally correct. Errors are introduced when characters "bleed" and touch one another or when the scanner picks up "ghost" images from the reverse side of a document. Success rates vary from 80% to 99%. Output formats include straight ASCII and popular word processing documents complete with underlining and superscripting.

Full Text Search
Once a large bank of ASCII based information is stored in electronic form, a full text search process is needed in order to find documents based upon their content. There are three competing technologies in use for text search.

Exhaustive search - This is the most primitive method available. A search through the source documents for a particular keyword or phrase is initiated for each request. This method is very simple to implement but is totally unusable for large stores of information. Some low cost PC packages employ this method.

Inverted keyword index - This method will build an index table based upon selected or typical keywords used in the documents, filtering out common articles like "A" or "THE". The inverted index is popular today but has a major drawback in that the documents and the query statement must be correct or a keyword will not be found. Some users employ a table of frequently misspelled words to assist the query function. This table will not compensate for the random errors introduced by an OCR process.

N-gram - This is a new technology available which reduces the impact of misspelled words. An N-gram is a sub partition of a word. By reducing the dependence upon complete words, a higher probability of finding association between query and source is achieved. An interesting implementation of N-gram search has been produced utilizing neural networking technology. In this method, a document is "learned" or searched for patterns. The index of patterns is stored for later comparison in a manner similar to what psychologists believe the human brain remembers facts. The N-gram and the inverted index methods require additional storage for the index files. Overhead size of 30% is common.

File Structures
With the accumulation of electronic information generated by various software packages, attention is focusing on supporting multiple file structures in their native mode. Advantages include the ability to reuse information as opposed to translating in one direction and merging data.

Document Storage
There are three technologies available in Disk Storage technology.

Write Once Read Many (WORM) drives permit a user to place information on a disk by burning it in with a laser and reading the data later. There are no real standards being applied to WORM drives and media. There are 5.25, 8, 10, 12 and 14 inch media systems available. Even having two drives with the same size does not guarantee readability since manufacturers format the drives with different methods.

CD-ROM machines take manufactured disks with information pre-written and provide read only access. The media is virtually the same as Audio Compact Diskette (CD) which has been available for over 5 years. The media is inexpensive, but the manufacturing costs of producing a master may make this alternative unattractive unless a large number of copies is required.

Erasable Optical disks have recently been introduced. The major advantage of this drive is the ability to reuse the media. There are still some concerns about the long term stability of the media which may preclude the use of this first generation of erasable optical for archival purposes.

In order to fully utilize the large storage capacity of WORM or Erasable Optical drives, a mechanical robotic mechanism is required to switch and/or flip the media. These devices are called Jukeboxes due to their similar function to the audio record changers of days gone by.

File Management/Indexing
In order to store documents and retrieve them directly requires a management and indexing scheme. This method is more desirable than a full text search process for documents of a known content. The costs of this system include the manual creation of the index through operator key in. Typical systems are linked to the direct input system in the case of forms processing. In this case, selected items are stored in an indexed database with pointers to the page or document containing the entire volume of information. The system is of limited value for documents in free form unless a person knowledgeable in the content of the document supplies the key items to the index.

Networking
Having a repository of information is convenient but the utility of this information is greatly increased if it is available to all users at their desks. The transport of the information to the users requires a large and stable network. Due to the volume of information which must move, networking schemes are based upon 10Mb/sec Ethernet technology. Another contender for this traffic is IBM's Token Ring. The local implementation of the networking architecture becomes increasingly important as the volume of information increases. In the future, fiber based FDDI with 100Mb/sec data rates will become the backbone of choice. The original Ethernet topology will become the basis for the individual work area or office floor transport scheme.

Data Security
Data security on a network will have increased significance due to the integration of sensitive business information into a variety of documents. A robust security system must be available on each of the servers and clients in the network.

Click here!

---

netmation.com | netmation.net | netmation.org | netmation.tv

Copyright © 1991-2005 Netmation Inc. All Rights Reserved
Site Designed and Hosted by Netmation Inc.