Advanced Capture | Knowledge Base | Definition


Advanced Capture is software that can convert complex and variable document-based information into structured data. Advanced Capture software uses different techniques to locate and extract data including pattern matching using regular expressions, definition of keyword/value pairs, and location of tabular data column headers and rows.


When it comes to needs that go beyond transcription of images into text or conversion of TIFF files into editable Word documents, capabilities often associated with advanced capture solutions should be considered. Let’s review those key capabilities: image processing, document classification, data extraction and data quality.

Image Processing

This capability is needed if you deal with scanned documents or pictures of documents taken with a mobile device. Delivering solid OCR data is heavily impacted by the quality of the image. Aspects like distortion, contrast, lighting, background/watermark removal, scaling correction and geometry correction are typically employed to ensure that an incoming document is optimized before OCR is applied. Advanced Capture intelligently applies image perfection functions.

Document Classification

Advanced Capture differentiates between different types of documents, which is a typical need in advanced capture tasks. This is the province of classification. The ability to easily train a system to output reliable document class assignments is an important capability of Advanced Capture.

Data Extraction

Here we are not talking about conversion of an image of a document into text, but to satisfy the business need to turn documents into structured tag-value pairs that can be used by various systems. Advanced Capture offers a broader set of capabilities from the simplest ability to locate data by supplying field-level X-Y coordinates to more sophisticated location techniques such as presence of keywords, relative proximity of one data element to another or pattern matching so that the right data is accurately extracted.

Data Quality

This last one is probably the most difficult concept to measure and at the same time, the most important. When it comes to OCR, you receive text along with the character and word-level coordinates as well as character-level confidence scores. These confidence scores are used to convey the overall probability of the software delivering a correct character-level answer. For instance, in a word like “source”, the output might look like “s (45), o (35), u (99), c (85), e (95)” where  the numbers in parentheses represent confidence scores. However, character-level confidence scores don’t provide any value if the real objective is to locate and extract data elements that might consist of several words.

Advanced capture focuses on data element-level confidence scoring to reliably extract the most data elements, not data characters. Additionally, Advanced Capture provides a number of data validation capabilities to improve data results.