Data Extraction | Knowledge Base | Definition


What Are Your Options for Data Extraction?

Optical Character Recognition (OCR)

Optical Character Recognition (OCR) is often confused with intelligent capture solutions and there is a good, if not misleading, reason for it. For many, when considering intelligent document processing solutions, documents are defined as the traditional paper variety. When paper documents need to be processed, OCR is an integral enabling capability. But OCR is only a starting point.

OCR is the process of converting images of text into machine-readable text. This is done through an analysis of blocks of text on a page. And then, these are separated into individual sections that look like words so that individual letters can be converted into actual computer text. The result is something that can be used by a knowledge worker to edit or work from. Alternatively, the text can be fed into a search engine index to support the ability to find documents within a document management system.

However, if you need specific data from a document, OCR won’t get you there. In cases where documents are “borne digital” such as Word documents converted to PDF or transactional documents (such as invoices, remittances, etc.) exported as PDF from an business systems, there is no need for OCR at all.

Intelligent Document Processing

Intelligent Document Processing takes text, whether it started as digital data or required OCR, and interprets it to identify document types and to locate and export specific data. It also analyzes candidate data in order to provide the best, most accurate data. There are many different techniques involved with interpretation with machine learning algorithms representing the latest. In most cases if all the processes involved in taking a document and interpreting it were measured, OCR would only represent about 10% of the entire process.

Intelligent Document Processing is designed to take document-based information, ranging from structured forms to unstructured contracts, and turn these into structured, computer-readable data that can be used in a variety of systems including case management, line-of-business as well as content and knowledge management systems. Since the primary value of intelligent capture is to identify and extract as much data as accurately as possible, there are a lot of subcomponents involved. At a high-level, the following are commonly used within an intelligent capture workflow: information ingestion, image handling and optimization, document classification, document separation, data location and extraction and data validation. These are each detailed below on this page.