|
NETMATION
BUSINESS
CONSULTING
SOFTWARE
HELP
|
Document Imaging
Introduction
Justification
It is important to note that the proposed system should not be
perceived as an extra cost to have documents available for online
access. Rather, the system will be replacing the present method
used for creating and updating the documents. The use of the
system for maintaining the documents will provide the users with
the added benefits of online access and control.
For all documents that can be maintained on the system, the cost
to reproduce and distribute updated hard copies will be
eliminated. Included in these costs are the 3-ring document
binders. Over a period of time this will result in fewer
bookcases and file cabinets, and in reduced floor space.
Updates to documents will be immediately available to users. This
will reduce the amount of rework that now results from the use of
outdated reference materials.
Due to the reduced time, effort and cost to update documents, many
types of documents will be maintained more current than they have
been using the present methods.
The structured and readily available method of locating documents
will result in the following benefits to the users: non-productive
time spent searching for documents will be reduced, projects will
have improved methods of locating and using documents that have
been produced on other projects.
A document imaging system will reduce the requirement for hard
copies, and enhance the benefits of the computer generated
documents. The number of computer generated project documents
continually increases. Therefore the benefits of an online
reference system for use in managing and accessing these documents
will continually increase.
Additionally, if the system use expands to several departments and
to other offices as expected, there will be an increased need for
documentation, training and responsive technical support. The
need for administration of the related activities would also
increase. Therefore, although the proposed system has the
potential to provide a significant positive influence on our
methods of executing projects, enhancements and support will be
required for the system to deliver this full potential.
State of the Technology
Scanning
OCR
The OCR process can be performed by special hardware cards for
PC's, by software running on the PC or by firmware in special
scanners. OCR software typically accepts compressed raster
information and must expand it to full raster before the
recognition phase can begin. High end OCR scanners can avoid the
compression/decompression phase since the input stream is direct
raster.
The output from an OCR process is seldom totally correct. Errors
are introduced when characters "bleed" and touch one another or
when the scanner picks up "ghost" images from the reverse side of
a document. Success rates vary from 80% to 99%. Output formats
include straight ASCII and popular word processing documents
complete with underlining and superscripting.
Full Text Search
Exhaustive search - This is the most primitive method available.
A search through the source documents for a particular keyword or
phrase is initiated for each request. This method is very simple
to implement but is totally unusable for large stores of
information. Some low cost PC packages employ this method.
Inverted keyword index - This method will build an index table
based upon selected or typical keywords used in the documents,
filtering out common articles like "A" or "THE". The inverted
index is popular today but has a major drawback in that the
documents and the query statement must be correct or a keyword
will not be found. Some users employ a table of frequently
misspelled words to assist the query function. This table will
not compensate for the random errors introduced by an OCR process.
N-gram - This is a new technology available which reduces the
impact of misspelled words. An N-gram is a sub partition of a
word. By reducing the dependence upon complete words, a higher
probability of finding association between query and source is
achieved. An interesting implementation of N-gram search has been
produced utilizing neural networking technology. In this method,
a document is "learned" or searched for patterns. The index of
patterns is stored for later comparison in a manner similar to
what psychologists believe the human brain remembers facts. The
N-gram and the inverted index methods require additional storage
for the index files. Overhead size of 30% is common.
File Structures
Document Storage
Write Once Read Many (WORM) drives permit a user to place
information on a disk by burning it in with a laser and reading
the data later. There are no real standards being applied to WORM
drives and media. There are 5.25, 8, 10, 12 and 14 inch media
systems available. Even having two drives with the same size does
not guarantee readability since manufacturers format the drives
with different methods.
CD-ROM machines take manufactured disks with information
pre-written and provide read only access. The media is virtually
the same as Audio Compact Diskette (CD) which has been available
for over 5 years. The media is inexpensive, but the manufacturing
costs of producing a master may make this alternative unattractive
unless a large number of copies is required.
Erasable Optical disks have recently been introduced. The major
advantage of this drive is the ability to reuse the media. There
are still some concerns about the long term stability of the media
which may preclude the use of this first generation of erasable
optical for archival purposes.
In order to fully utilize the large storage capacity of WORM or
Erasable Optical drives, a mechanical robotic mechanism is
required to switch and/or flip the media. These devices are
called Jukeboxes due to their similar function to the audio record
changers of days gone by.
File Management/Indexing
Networking
Data Security
![]()
|