Working with Scanned Documents

The OCR Pack provides tools to extract data from scanned images. Branches can transform, report, validate, and store the extracted data.

Document Scanning

Once a document has been scanned and converted into a TIFF, the OCR Pack can accept those images for processing. The method used to process scanned images depends on whether the image contains a single document or a batch of multiple documents.

Batch Processing

To process an image containing more than one discrete document, the following processing steps are available:

Batch Merge	Scanning equipment and the documents they process are not perfect. Pages can jam the scanner, resulting in a scanned document that is incomplete. When this happens, you must combine multiple scanned images to construct a complete document batch. This is accomplished by selecting the individual, incomplete scanned documents within Transform Content Center and submitting them to a remote branch. This branch merges the documents into a single multi-document image. Typically, this process submits the newly-created image for bursting and removes the original, incomplete documents from Transform Content Center storage.
Bursting	The OCR engine scans each page within a multi-document image under the control of a branch and identifies the first page of each document. It stores this information about how the batch is to be split into its component documents in a Scanned Batch (.fsbd) file.
Burst Validation	Since initial document scanning is performed by people and bursting is done under program control, Burst Validation provides an opportunity for a person to review the results of both the scanning and bursting operations. This individual can make changes to the order in which the pages appear and to the page attributes assigned during the bursting operation. This might be necessary under circumstances such as the following: A new employee who is unfamiliar with the scanning hardware loads the pages incorrectly so that every page is blank. Pages are loaded into the scanning hardware in the wrong order so that, for example, the last page was scanned as the first page. A burst operation requires a blank page to be inserted between each document in the batch. One of these was not included, so two documents are incorrectly identified as a single document. Every document in a batch contains exactly three pages, so the bursting operation marks every third page as the first page of a document. Someone accidentally omitted one page of a document so the bursting is incorrect after the point within the batch where the page is missing. In environments where accuracy is paramount, Burst Validation makes certain that only properly validated and reviewed documents are submitted to subsequent document processing steps. Once a batch of documents has been burst and validated, the individual documents are extracted and can be submitted for further OCR processing. The status of the batch file is set to 2 (Finished) and its automatic deletion date is set for 14 days later.
Document Storage	The special batch document format used to handle scanned document batches is a format known to the Transform Content Center application, and so the batch documents can be stored. The page attribute information is associated with the scanned document pages, and so this information remains available for subsequent processing, should it be needed at a later time.

Document Processing

If individual documents are scanned or when batches of documents have been burst, the OCR Pack makes available the following processing steps:

Pattern-based data extraction	When an individual document is submitted for processing, the OCR engine scans each page and extracts the data that it finds. The system then compares the extracted data with patterns, which indicate the location of data elements on a page. Each pattern is associated with a Transform Content Center category that has been configured to work with scanned documents. When the data within a scanned document matches a pattern, index data is extracted from the document and the document is stored within the category associated with the pattern.
Index search- based data extraction	If a scanned document does not match an existing pattern, it is next checked to see if it matches an Index search, which is a text-based search performed against the data the OCR engine extracted from the document. An Index search is based on regular expression matching. This type of match does not rely upon data appearing at a particular location on a page, because the search is made after all data has been extracted from the page. This type of match compares the text values appearing within the data and the relationship of those values to nearby values. The following are examples of the type of regular expression matching that can be done with an Index search: If the string Invoice Date: is followed by ten characters formatted as four numbers, a hyphen, two numbers, a hyphen, and two numbers, match those ten characters for use as a date. Look for the string Order Code: appearing on a line with no other text before it. If the next two characters are a number ranging in value from 00 to 50, the third character is an X, the fourth through seventh characters are numbers, and there are no other characters on the line following this text, match those characters for use as an order code. Regular expression matching can match text on a static string or where a very specific data format and values are required. Each Index search is associated with a Transform Content Center category. When an Index search returns a match, index data is extracted from the document and the document is stored within the category associated with the Index search.
Index correction	Users with access to the category in which a scanned document has been stored can examine the document to check the extracted data and correct it if necessary. If the document has been stored without matching a pattern or an Index search, the user can follow the same process to enter the data manually.
Create new patterns or index searches	If a scanned document does not match any pattern or Index search, Transform Content Center stores the document in a special Unmatched_Scans category. A user with access to this category can examine the document and create a new pattern or Index search that matches the document. The user can then resubmit the document to the OCR system so that the document is stored in the desired category. It is also possible to create a pattern from a document submitted specifically for this purpose, or from a document that has been stored in a scanning category without having matched an existing pattern (either by matching a regular expression or by having the category specified in the storing branch).
Document storage	The special scanned document format used to process these scanned documents is a format known to the Transform Content Center application, and so the scanned documents can be stored. Since the extracted data remains a part of this document format, that information is available for any subsequent processing. Scanned documents are stored in special OCR-enabled categories, known as scanning categories.