April 11, 2014

Everything you Need to Know about Advanced Capture and Recognition

Getting started with a capture implementation can be overwhelming. Currently most companies with a document capture process are only taking a straight scan-to-archive approach. But with today’s advanced capture technology, there’s no reason to be manually classifying documents or re-keying data. Learning what to expect and what’s possible with advanced capture is key to a successful implementation.

Advanced Document Capture

7 Steps to a Successful Capture Implementation

1) Requirements gathering and Design

Gathering the appropriate requirements is a critical step in ensuring that you develop a productive and meaningful capture solution. Gathering the requirements starts with interviewing business, IT, and other personnel that have a stake in the capture solution. Once all the requirements are gathered, the team should create a design document and have the project sponsor agree and sign the design document.

2) Document Preparation / Capture

Step two requires collection and preparation of documents for ingesting into the capture solution. Documents can be physical paper copies that need to be scanned, or they can come electronically from fax machines, email, hot folder, or file shares. To go through almost any capture solution, especially if you want to get meaningful data off of the documents, the format matters. It has to be a TIFF, JPG, PNG or PDF image format at 200-300 dpi.

Document preparation is also required for digital documents. For example, advanced capture systems can take attachments directly from email and process the documents automatically. This works for most situations, however, there are times when manual intervention is required. For example, it is assumed that one file equals one document, but there could be times when people might send one big file with several different documents. In a case like this, someone needs to review and make a decision as to what do with it. Perhaps it is needed to break it apart into different documents.

Another common situation when manual review is needed is to eliminate junk emails or faxes and perform quality control to make sure all files entering the capture system are legitimate documents.

3) Classification

Now that you have the documents scanned and files ready and ingested into the capture system, we have to figure out what we are going to do with these documents. This requires some type of logical document separation and/or classification.

As the files come through, we’ll look at pages and identify the boundaries of the document—what’s the start and what’s the end of the document. This might have been done during capture with a separator sheet or a barcode on the first page, or it can be done automatically by applying rules. For example, every multipage tiff that comes in from a fax machine or an email is a new document. Or in the case of invoice processing, we need to find the start and ending boundaries of our invoices. One way to do this is to find the keyword “invoice number” and extract this data. Then we look at the following page and see if it matches the previous page. If it does, we can append it to the previous document. Then we can apply a rule that when a new invoice number is encountered, a new document is created.

Another great example of a rules-based approach is loan/mortgage applications where documents might look different and have different formats but there are properties that are always going to be the same. So when the system finds a particular type of data, it can classify that document automatically.

A different approach to classification requires training the system with samples, usually called a learning method. It requires feeding the system with hundreds or thousands samples for each document type for the system to identify the following things: the first page, the last page, and combining the middle pages together. Once the document is created the system then identifies the document type. The document type is determined by the information OCRed on the page(s).

4) Document Review

You have prepped, scanned, separated and classified your documents. Unfortunately not all documents will be able to go through this automated process and will need human intervention. Someone will need to look at them and fix any of the exceptions that might have occurred, although most document review systems give you the option to view all the documents and pages. This is helpful because sometimes it will say I’m not sure what to do with this one page but it needs to be appended to the previous document or it needs to be moved down to a lower document. So seeing all the documents in a more linear view allows you to stop at the document in question as it goes down.

During the document review stage we also have the ability to re-classify the document by changing the document type or rescan documents. When a reviewer looks at documents that didn’t classify appropriately, because of skewing or bad scanning, and when the original document is available, a new scan copy of the document can be used.

Skipped or incorrectly ordered pages can be fixed at this point as well. Document capture systems usually allow for drag and drop to reorder pages.

5) Recognition / Data Extraction

Now that classification and document separation have been completed, you can apply rules to determine which index fields and which metadata we want to extract from each document type. You may want to get a certain piece of information off of one document and get additional information or something different off of a different document.

For example an HR department might want to get the employee first name, last name and ID. But you don’t need to get those same fields for every document since you got it off of the first document already and you know all the following documents are going to be associated with the same employee ID. Therefore you only need to get document date or signature from each from all the supporting pages.

Typically, documents are first recognized with full page OCR and the text results are appended with the batch as it travels through. This allows you to search for data by looking at the text file.

When documents contain handwriting, either constrained handprint or even cursive handwriting, ICR is applied on specific zones. Advanced capture systems are able to read even unstructured handwriting anywhere on a document.

6) Validation

Naturally, you can’t expect OCR or ICR to read all the data with 100% accuracy. This is why you set a threshold. Everything that falls above this threshold, gets sent straight through with no human intervention, Everything that falls below, meaning that the software is not that confident that it extracted the right information, gets sent to an operator to look at that information to determine if it was correct. The operator can either type the right value or say yes, that looks good and pass it on.

How to determine the threshold is an art on its own. But usually companies determine this value by scanning a thousand, or more samples, and having a data entry person enter all the metadata 100% correctly. They would then run a recognition engine against it and compare the two. The result tells you where you need to set your threshold.

7) Output

Advanced capture solutions can send the resulting data to multiple locations. You send a slim down version (the image and a few searchable metadata fields) to a document management system, for example SharePoint. With advanced capture you are able to take all this granular metadata and put it to use! You can say take this document type or this particular set of metadata, send it to this system and when this happens, trigger a workflow and go to this person to alleviate the manual data entry that they would’ve had to have done. This is very common in accounting departments for invoice or remittance processing, but there is a multitude of options and opportunities here.

And there you have it. Isn’t it amazing all you can accomplish with advanced capture and recognition?

Want to learn more?