If you parse only OCR text to perform data extraction, you will not be successful.
We have come across several cases where a prospective customer or current client wanted to get full-text OCR output so that they could perform further processing on the text. In these cases, the effort was to attempt to locate transactional data on invoices or other variably-structured documents.
The response to my inquiries as to why they wanted to take that approach always comes back with a similar message: “that is how other software vendors say to do it.”
Parsing OCR Text Approach: The Limitations
There’s a problem here. Pattern matching using regular expressions and keyword spotting can extract some level of data, but it is nowhere near the level of capability that a business needs in order to achieve meaningful results. The problem is made worse by the lack of understanding on how to approach a recognition or data extraction project, including how to properly measure results.
If you only parse OCR text to perform extraction, you will only achieve below a 40 percent rate of field location and those results will have error rates in the double digits.
This means that an invoice with ten fields will likely only have, on average, four of those fields located and then the error of those located fields can be 25 percent or greater. At a real rate of accuracy of only three fields from ten, immediately, the amount of effort you’re taking looks meaningless.
Why Are the Results So POOR?
The challenges first start with the quality of the OCR itself and then go downhill after that. Let’s say OCR is 99 percent accurate at a character level. Processing an image with 1000 characters will result in 990 characters that are correct, but the ten erroneous ones can—and will typically—be dispersed across the words and values in that image. That makes a word level error rate of 6 percent, using an average number of characters per word of six, the calculation is: 10 word-level errors/166 words.
Next, let’s apply pattern and/or word matching to that text. There is also a lot of error in that effort. Using only patterns and word matching, additional context is left out that would help with locating the correct data. Take dates, for instance. There is usually more than one date on a transactional document. How can you discern between a purchase date and a payment date? Using pattern matching for dates can get you part of the way there; and if the date has a label, you can further refine the available selections. However, what if the labels have OCR errors in them? What if there are no labels? What if the date field has different formats?
Let’s talk about the sheer enormity of using patterns and words as the means to locate the required data. For invoices, it is not uncommon for even the most basic types of data like invoice date, vendor name, and total amount to use a wide variety of labels and formats. If you are not approaching the project from a statistical manner, using a statistically-representative set of samples, then the error rate will rise dramatically.
What is the Best Approach?
So now that the problems are at least described at a high level, what is the best approach? The answer depends upon the data type. That aside, taking invoices or receipts as a data type example, the approach should include a mix of various technologies and techniques. These would include document classification to first identify if the image is actually an invoice, image analysis to identify logos that can be used to identify invoices from different vendors, image analysis to determine location of key “data blocks” such as addresses, summary data, and line item data, spatial analysis to determine data relationships, and pattern and vocabulary matching to further validate identified candidate data.
Results to Live By: Performance of Over 60%
Finally, all of these can be used in conjunction with ground truth data and machine learning algorithms to automate the entire development of inferences and rule sets used to actually locate and then recognize/extract the required data. The net result will deliver out-of-the-box performance of greater than 60 percent with single digit error rates. Since machine learning is used to pull all of these individual data points together, inferences and resulting performance can improve as more data is fed into the system.
If all of this sounds like IBM’s Watson, generally speaking, it is very similar. If your data extraction project isn’t using something based upon some level of what is described above, you might want to take some time to reconsider your project.
If you found this article interesting, you might find this video useful: