Processing Scanned Documents

Many businesses receive, process, and store large volumes of paper documents or document images. Examples of these include:

Documents received by mail such as purchase orders and invoices
Documents transmitted by fax, received as graphic images (typically TIFF files)

Often, these documents contain information that must be collected and then submitted to internal business processes. Purchase orders, for example, must be read and the information transferred to an ERP system so that the order can be filled and product shipped to customers. Processing these documents can be very time consuming. Storing paper documents requires many resources (storage space and people to retrieve documents when they are needed).

The Optical Character Recognition (OCR) Pack provides a set of tools to help your organization extract information from the pages of scanned documents, and store those documents in the Transform Content Center archive for quick and easy retrieval. Once the pages have been converted into image files, typically TIFF or PDF files, they can be submitted to the OCR Pack for processing. These files can be single- or multi-page images, and can contain a single document or a collection of documents.

The OCR Pack software is integrated with the Transform Content Center application, which supplies the user interfaces used during burst validation and document handling.

Processing Batches of Documents

To save time spent handling and scanning documents, batches of different types of documents can be scanned together. The result is one multi-page file that can be separated into individual documents by a process called bursting. If the first page of each document within a batch can be identified, a branch file can automatically burst the batch into its individual component documents. A branch can use any of the following methods to identify the individual documents within a batch:

If a blank sheet of paper is inserted between each document, the OCR Pack can burst the batch at every blank page. The OCR Pack marks the first page in the batch, and each page that follows a blank page, as the first page of a new document. It can mark the blank pages for automatic deletion.
If all documents contain exactly the same number of pages, the OCR Pack can burst a batch by page count.
If each page of a document contains a barcode whose value changes for each new document, the OCR Pack can burst a batch by barcode value.
If documents contain checks as either the first page or last page of every document, the OCR Pack can burst the batch by check.
If none of these methods are appropriate, you can attach specially encoded barcodes to the first page of each document, and the OCR Pack can burst the batch by barcode attributes.

When a branch bursts a batch, it can place it in a special category called Scanned_Batches, to await user input. A user can examine the document, validate the results of the bursting operation, and make any necessary corrections. Upon completion of this validation process, the OCR Pack groups the pages into individual documents according to the bursting instructions, and submits each document for additional OCR processing.

Processing Individual Documents

When an individually scanned document or a document created from a bursting operation is submitted for processing, the OCR engine scans it and, if the category is not specified in the storing branch, attempts to match it to a pattern in the Scanning_Patterns category. Each pattern is associated with a specific category, and so when a document matches a pattern, the OCR engine can store the document in the appropriate category, with index values supplied from the data it extracts from the document.

If a document fails to match any stored pattern, the OCR engine can use an Index search to identify whether text within the document matches rules that a privileged Transform Content Center user has defined. If a match occurs, the OCR engine stores the document in the category associated with the Index search, with index values supplied from the data extracted from the document.

If a scanned document fails to match any known pattern or index search, it is placed in the Unmatched_Scans category. A Transform Content Center user can create a pattern that matches this new document, and then resubmit the document for processing, without it having to be rescanned.

Note

The OCR Pack is available only if this module is included in your software license.