Types of Documents Drive Data Extraction
Best practices—when approaching a document management challenge that necessitates data extraction—require fully understanding the types of documents involved. The “nature” of the document is fundamental in determining the most appropriate technologies and techniques to use. For example, OCR cannot provide a comprehensive solution in many cases. Instead, OCR acts as the underlying, supporting technology that aids with producing a final result.
Luckily, a generally accepted set of document definitions has been developed that can be used to establish your initial scope and what is most likely needed. There are three broad document definitions or categories: structured documents, semi-structured documents, and unstructured documents.
Structured documents are commonly referred to as “forms”. With a form, typically the same information of the same type is located in the same place. Let’s take, for instance, a U.S. Passport application. While these forms can change or have variations, the information within any form application type is highly standardized or “structured”.
Here we see a set of fields for last name, first and middle, sex, and place of birth. Even more helpful is the presence of “boxes” which help to separate the handwritten letters into single characters which improves data extraction accuracy. The structured document is probably the easiest document to process due to the highly known state of the information contained within them. Simply put, the data is very easy to locate.
Expanding beyond forms is the category of documents that can include some standardized information but the location or variance of this data can be quite broad. Think of an invoice. While there is always an invoice number, date, subtotal, and total, these data are rarely formatted in the same manner or located in the same place on the document.
Therefore the task of locating the data for each field is difficult and that is before you try to evaluate the data within each field. For semi-structured documents, an entirely different set of technologies need to be employed to provide consistently accurate data extraction.
Confusion often exists between distinguishing semi-structured and unstructured documents. Mostly this is because the level of difficulty is similar in terms of data extraction. What separates the two is the level of standardization and density of the information. Think of a legal agreement, which is an example of an unstructured document. There are always parties, dates, terms, and definitions, but these vary quite a lot depending upon the negotiation process and the different needs of the parties involved. There may be some “field-like” data such as the party names and dates that are standardized, but the location of this information is typically not standard.
Additionally, much of the relevant information beyond that is contained in longer narratives that are incredibly difficult to parse for specific required data. As such, the ability to accurately locate and extract the information relevant to support case management, contract negotiation, or even governance needs is very problematic. For these challenges, another layer of advanced technologies and techniques need to be applied.
In our next set of blog articles on this topic, we delve into the specific technologies and techniques that can be applied on each document type. After reading this series, you’ll agree that this is NOT plain OCR!