Intelligent Document Processing: Getting Started

Optical Character Recognition (OCR)

Optical Character Recognition (OCR) is often confused with intelligent capture solutions and there is a good, if not misleading, reason for it. For many, when considering intelligent capture solutions, documents are defined as the traditional paper variety. When paper documents need to be processed, OCR is an integral enabling capability. But OCR is only a starting point.

OCR is the process of converting images of text into machine-readable text. This is done through an analysis of blocks of text on a page. And then, these are separated into individual sections that look like words so that individual letters can be converted into actual computer text. The result is something that can be used by a knowledge worker to edit or work from. Alternatively, the text can be fed into a search engine index to support the ability to find documents within a document management system.

However, if you need specific data from a document, OCR won’t get you there. In cases where documents are “borne digital” such as Word documents converted to PDF or transactional documents (such as invoices, remittances, etc.) exported as PDF from an business systems, there is no need for OCR at all.

Intelligent Document Processing

Intelligent Document Processing or IDP takes text, whether it started as digital data or required OCR, and interprets it to identify document types and to locate and export specific data. It also analyzes candidate data in order to provide the best, most accurate data. There are many different techniques involved with interpretation with machine learning algorithms representing the latest. In most cases if all the processes involved in taking a document and interpreting it were measured, OCR would only represent about 10% of the entire process.

IDP is designed to take document-based information, ranging from structured forms to unstructured contracts, and turn these into structured, computer-readable data that can be used in a variety of systems including case management, line-of-business as well as content and knowledge management systems. Since the primary value of intelligent capture is to identify and extract as much data as accurately as possible, there are a lot of subcomponents involved. At a high-level, the following are commonly used within an intelligent capture workflow: information ingestion, image handling and optimization, document classification, document separation, data location and extraction and data validation. These are each detailed below on this page.

Learn More: OCR vs Data Extraction

Handwritten Forms Data Extraction Using FieldXpert.AI

Basically, OCR works on fonts while handwriting recognition has to use other mechanisms. This is because, while there is a finite number of fonts (e.g., the number of fonts in Microsoft Word), there is a “font” or writing style for every person that creates written text. OCR is trained at the individual character level to recognize fonts and font sizes and then can evaluate them to create computer text.

Handwriting recognition also analyzes characters and words, but must implement different algorithms to perform “best matches” to an inventory of letters. Handwriting recognition must accommodate a wide variety of variations in letters and words that normal OCR can avoid. As a result, handwriting recognition uses computer vision along with deep learning to create abstract models of letters and words (much like humans do) to reliably resolve handwritten letters and words.

With deep learning, handwriting recognition performance has come a long way in a short amount of time. Still, the wide variation of handwriting styles also means that the performance of handwriting recognition, relative to OCR on machine-print, is lower. Field-level handwriting recognition (e.g., forms) achieves from 60% to 90% accuracy compared to 95%-98% OCR on machine-print. Page-level transcription of handwriting, available only recently, might achieve 60% accuracy compared to 98% accuracy of OCR on machine-print.

Learn More: Handwriting Recognition

Information Ingestion

Intelligent capture provides for a variety of ways in which information can be imported into the system. Communication systems integration (email, fax and network peripherals), network integration (FTP and fileshare), and hardware integration (scanners) are supported.

Image Preprocessing

For documents that arrive as images (either through scanners, faxes or via mobile import), there is often an “image perfection” set of activities that are designed to manage the wide variety of quality issues typically encountered. These can include differences in the density of the images (measured in DPI or pixels), distortions of the image (stretching of the image or creases on the documents), and contrast (such as blur and lighting).

Also for images, OCR is employed to convert the pictures of text into machine-readable text. Some solutions selectively perform OCR to reduce latencies typical with this process while other solutions convert the entire document into text.

Document Classification

This process takes and analyzes incoming documents in order to assign document types (e.g., contract, invoice or check payment, etc.) in order to support different types of workflows or to support subsequent data extraction tasks.

Document classification can employ simple rules such as locating keywords or it can implement machine learning which automatically identifies what are called “features” that distinguish one document type from another.

Document Separation

While many documents exist as single files (think PDF), many times multiple documents are stored together. Document separation is the process of identifying these “document boundaries” such that a PDF with many different documents, can be “burst” into multiple different documents, tagged, and then go through potentially different workflows. Document separation can make use of simple rules or it can implement machine learning to identify the most reliable features that indicate first, middle and last pages of a document.

Data Location and Extraction

This process takes the text of a document and turns it into tagged data that can be stored within a relational database or other structured format such as XML for use in other systems. There are many, many different techniques involved with interpreting text that include range from providing specific location of data in X/Y coordinates (often called “templates”), use of keyword/value data, use of regular expressions/pattern recognition, to advanced machine learning algorithms that analyze many bits of information to reliably locate required data.

Data Validation

This step involves both automated and manual processes to ensure the data output from an intelligent capture solution is accurate. Automated methods can make use of user-supplied information such as dictionaries or integrations into third-party data stores, all the way to more-complex capabilities such as statistical analysis of output to score reliability of data at a field level. Manual validation includes workflows and special user interfaces that route suspect output to specific staff who review, approve or make corrections prior to exporting the data to another system.

Learn More: Smart Learning Primer

When it comes to defining “accuracy,” there is a lot of confusion and little effort to create real clarity in intelligent capture. Accuracy can mean many different things depending upon what is being measured and what is most important to your organization. Just as many confuse OCR with intelligent capture, system accuracy is also a topic that requires more investigation.

At a very crude level, OCR systems can achieve 98%-99% accuracy at a character or word level. This is fine if your objective is to convert scanned documents so that they can be easily searched or edited. And yet, 99% accuracy at a character or word level means nothing for intelligent capture. The key value proposition is to reduce manual data entry for structured information derived from documents. This means that systems need to provide the largest amount of usable structured data from your documents presented at the highest levels of accuracy. Here two measurements matter: (1) the amount of data that a system can produce and (2) the accuracy of that data.

Without both measurements, a vendor can easily claim to provide 99% accuracy and be completely truthful. But you may only get 5% of your data automated. The “Mars shot” of intelligent capture is 100% data extraction at 100% accuracy. We probably won’t get there in the next several decades, but those are the measurements with which all systems should be compared with the objective of maximizing both numbers as far as possible.

Accuracy & Data Quality

Data Quality Priorities and Intelligent Capture Accuracy

Data Quality - Debunking Myths, Facing Realities

5 Reasons Data Quality Matters

Best Practices to Improve ICR Accuracy

There are two major types of data stored within documents: text-based information and visual-oriented information. Text is easy to understand. Visual information can be pictures, presence of logos or other visually-distinct information such as data structured in a tabular format. The most common type of document classification is a rules-based approach where a subject matter expert identifies words that are unique to each document type. From there, rules are encoded that dictate document class assignments based on the presence of one or more of these words. The benefits are that rules-based approaches are fairly straightforward to understand and create. The drawbacks include the amount of time required to analyze and construct the rules as well as the potential that words identified as belonging to one document class might also belong to another. This creates the potential for errors.

A more modern approach to classifying documents using text uses a machine learning algorithm that operates on the text to automate the process of identifying words or phrases that are distinctive enough to determine the proper document class. These algorithms identify textual “features” that can go beyond what most humans can identify, and they can operate on a much larger data set to provide more comprehensive coverage. The actual type of machine learning algorithm used is not relevant provided the performance is suitable to the need. In some cases, multiple algorithms or techniques are used depending upon the nature of the information.

The benefits are that the investment in time and effort to construct rules is removed and replaced by “compute time” with significantly more data going into analysis to ensure that automation can be more comprehensive. The drawback is machine learning can be a black box with little visibility into the process yielding the results.

Another means to classify documents is to use the visual information available. This type of classification requires algorithms based on computer vision that enable it to identify key visual features of a document to sort one document type from another. Often, especially with visually-distinct document classification projects, visual classification does not require use of OCR, which can be a time-consuming process.

Learn More: Document Classification Techniques

Data Extraction Options

What Are the Options with Data Extraction Including Handwriting? There are two major elements to data extraction that are important to understand. The first is Data Location - the mechanisms used to locate the required data and the second is Data Extraction - the mechanisms to convert (if necessary) the image-based information to computer-readable text-based data. Increasingly, the focus on data extraction is on the first area as more documents are “born digital” and don’t require use of OCR or other types of image-to-text technologies.

Data Location

Within data location there are many different techniques that can be used depending upon the type of documents which are processed. The most simple is the use of a “template” that allows a user to establish the specific locations of data on a document for extraction. Location is typically provided with the user drawing boundaries of the data directly on a example document. From there, when the document is encountered, the software only processes the information contained within those boundaries. This method is best used for highly-standardized forms (e.g., an application form) where the information is always located in the same location.

To deal with scenarios that aren’t as simple as forms, intelligent capture software can use techniques that allow identification of key data based on keywords or patterns of values. For instance, with invoices where the task is to find the invoice number, keywords such as “Invoice Number,” “Invoice No.” and “Invoice #” can be used to locate the region where the value is located. The software then searches around these keywords or searches in a specific location relative to the keywords to identify the actual invoice number. Use of patterns is good for data such as amounts, dates or other well-defined values where the software can parse a page or a portion of a page looking specifically for data that “looks” like the target data. You might hear a term called “regular expression” used when pattern matching is required. Regular expressions use a special vocabulary to describe patterns of information and the acceptable data formats.

Much more recently, use of machine learning algorithms have been employed to aid with location of targeted data. Machine learning is different from various rules-based approaches in that the software itself determines the location of data based on processing sample data sets. This learning function can be done with or without the direct aid of people. While some IDP vendors claim to use machine learning, many use a technique that collects various user rules and input over time to create a knowledge base.

Data Extraction

The second part of the equation is data extraction. As mentioned earlier as more documents are born-digital, this portion relies less on actual use recognizers to perform the function and more on rules or other mechanisms to validate and uniformly format the data.

When dealing with image-based documents, the two mechanisms to transcribe images into text are OCR and Intelligent Character Recognition (ICR), which operate on text and handwriting respectively.

For text, there are many different ways to use OCR, from converting the entire document to text first and then employing location functions after-the-fact to more-sophisticated approaches which only process a portion of the document.

Most intelligent document processing solutions operate on a word and/or field level which is different from the core output of an OCR engine which just provides a literal transcription of images into text.

Handwriting is a much different story, and therefore, the underlying processes involved are different. For instance, handwriting traditionally works much better when dealing at a word- and field-level only, operating on a few words where known values can be described.

Often called context, the handwriting recognizer relies on a list of known values such as a name database or on the pattern of the value such as a social security number, date or amount. Using this context, the recognizer can whittle-down the large number of potential answers into the most-likely correct answer.

More recently using deep learning algorithms, handwriting recognition can operate on larger amounts of information including a full page of information without requiring context. Performance is nowhere near on the level of OCR.

Learn More

Parascript Skillshop: Getting Started

Intelligent Document Processing

Getting Started

What Is OCR vs Intelligent Document Processing?

Handwriting Recognition

Key Components of Intelligent Document Processing

What Does Accuracy Mean for IDP?

Document Classification

What Are the Data Extraction Options?

What is OCR and how is it different from intelligent capture?

Optical Character Recognition (OCR)

Intelligent Document Processing

Handwriting Recognition

How does handwriting recognition work and how is it different from OCR?

What Are the Key Components of Intelligent Capture?

Information Ingestion

Image Preprocessing

Document Classification

Document Separation

Data Location and Extraction

Data Validation

Accuracy in Capture

What Does “Accuracy” Mean and How is Value Measured for Accuracy?

Accuracy & Data Quality

Data Quality Priorities and Intelligent Capture Accuracy

5 Reasons Data Quality Matters

Best Practices to Improve ICR Accuracy

Document Classification

What Are the Options with Document Classification?

Data Extraction Options

Data Location

Data Extraction

CONTACT PARASCRIPT