Working with Scanned Documents
The OCR Pack provides tools to extract
data from scanned images. Branches can transform, report, validate, and store
the extracted data.
Document Scanning
Once a document has been scanned and converted into a TIFF, the
OCR Pack can accept those images for processing. The method used to
process scanned images depends on whether the image contains a single document
or a batch of multiple documents.
Batch Processing
To process an image containing more than one discrete document, the following
processing steps are available:
Batch Merge |
Scanning equipment and the documents they process are not perfect. Pages can jam the scanner, resulting in a scanned document that is incomplete.
When this happens, you must combine multiple scanned images to construct
a complete document batch. This is accomplished by selecting the individual,
incomplete scanned documents within Transform Content Center and submitting them to a
remote branch. This branch merges the documents
into a single multi-document image. Typically, this process submits the
newly-created image for bursting and removes the original, incomplete documents
from Transform Content Center storage. |
Bursting |
The OCR engine scans each page within a multi-document
image under the control of a branch and identifies
the first page of each document. It stores this information about how
the batch is to be split into its component documents in a Scanned Batch (.fsbd) file.
|
Burst Validation |
Since initial document scanning is performed by people and bursting is
done under program control, Burst Validation provides an opportunity for
a person to review the results of both the scanning and bursting operations.
This individual can make changes to the order in which the pages appear
and to the page attributes assigned during the bursting operation. This
might be necessary under circumstances such as the following:
- A new employee who is unfamiliar with the scanning hardware loads
the pages incorrectly so that every page is blank.
- Pages are loaded into the scanning hardware in the wrong order so that, for example, the last page was scanned as the first page.
- A burst operation requires a blank page to be inserted between each
document in the batch. One of these was not included, so two documents
are incorrectly identified as a single document.
- Every document in a batch contains exactly three pages, so the bursting
operation marks every third page as the first page of a document.
Someone accidentally omitted one page of a document so the bursting
is incorrect after the point within the batch where the page is missing.
In environments where accuracy is paramount, Burst Validation makes
certain that only properly validated and reviewed documents are
submitted to subsequent document processing steps.
Once a batch of documents has been burst and validated, the individual
documents are extracted and can be submitted for further OCR processing. The status of the batch file is set to 2 (Finished) and its automatic deletion date is set for 14 days later.
|
Document Storage |
The special batch document format used to handle scanned document batches
is a format known to the Transform Content Center application, and so the batch documents
can be stored. The page attribute information is associated with
the scanned document pages, and so this information remains available for
subsequent processing, should it be needed at a later time. |
Document Processing
If individual documents are scanned or when batches of documents have been burst,
the OCR Pack makes available the following processing steps:
Pattern-based data
extraction |
When an individual document is submitted for processing, the OCR engine
scans each page and extracts the data that it finds. The system then compares
the extracted data with patterns, which indicate
the location of data elements on a page.
Each pattern is associated with a Transform Content Center category that has been configured to work with scanned documents. When
the data within a scanned document matches a pattern, index data is extracted from the document and the
document is stored within the category associated with the pattern.
|
Index search-
based data extraction |
If a scanned document does not match an existing pattern, it is next checked
to see if it matches an Index search, which is a text-based search performed against the data the OCR
engine extracted from the document. An Index search is based on regular
expression matching. This type of
match does not rely upon data appearing at a particular location on a page, because the search is made after all data has been extracted from the page. This type of match compares the text values appearing within the data and
the relationship of those values to nearby values. The following are examples
of the type of regular expression matching that can be done with
an Index search:
- If the string Invoice
Date: is followed by ten characters formatted as four
numbers, a hyphen, two numbers, a hyphen, and
two numbers, match those ten characters for use as a date.
- Look for the string Order
Code: appearing on a line with no other text before it. If
the next two characters are a number ranging in value from 00 to 50,
the third character is an X, the fourth through seventh characters
are numbers, and there are no other characters on the line following
this text, match those characters for use as an order code.
Regular expression matching can match text on a static string
or where a very specific data format and values are required.
Each Index search is associated with a Transform Content Center category.
When an Index search returns a match, index data is extracted
from the document and the document is stored
within the category associated with the Index search.
|
Index correction |
Users with access to the category in which a scanned document has been stored can examine the document to check the extracted data and correct it if necessary.
If the document has been stored without matching a pattern or an Index search, the user can follow the same process to enter the data manually.
|
Create new patterns
or index searches |
If a scanned document does not match any pattern or Index search, Transform Content Center
stores the document in a special Unmatched_Scans
category. A user with access to this category can examine the document and create a new pattern or Index search that matches the document.
The user can then resubmit the document to the OCR system so that the document
is stored in the desired category.
It is also possible to create a pattern from a document submitted specifically for this purpose, or from a document that has been stored in a scanning category without having matched an existing pattern (either by matching a regular expression or by having the category specified in the storing branch).
|
Document storage |
The special scanned document format used to process these scanned documents
is a format known to the Transform Content Center application, and so the scanned documents
can be stored. Since the extracted data remains a part of this document
format, that information is available for any subsequent processing.
Scanned documents are stored in special OCR-enabled categories, known as scanning categories.
|