In my previous post regarding accuracy, I left you hanging on how to achieve high accuracy rates. Let’s revisit the main theme of Part 1:
- You have the need to locate and extract data in your documents. Great.
- And you want it to be accurate. Ok, gotcha.
- And you don’t want downstream systems to get fed with bad data. Of course not.
- And you want accuracy to be in the 80% to 90% range. Hmm.
So now you are left with a serious set of business needs and you don’t know how you’re going to get there.
With modern document and data capture software, you can certainly avail yourself of some pretty sophisticated, business rules-oriented tools.
Full page OCR? Not going to get you there.
What about using coordinates of the data and using OCR then? Still not going to get you there. What about the data that isn’t always in the same location?
Ok, so what about using keywords and patterns to determine location of those fields? Ok, you’ve sort of solved the data location bit.
But here’s the problem: finding the data is only part of the problem. Extracting it successfully and accurately is the real important piece. And using OCR isn’t going to get you there either because it sends back what it thinks is the correct answer – all the time.
What you need are tools for not only locating data but also tools for measuring the accuracy of the extracted data and for identifying erroneous data. Remember, it is one thing to send data for additional review or validation; it is quite another thing to send data to another system when it is just plain incorrect.
Here are some things to consider when taking-on a data extraction project:
- Access to recognition confidence. Many recognition technologies have internal evaluation algorithms designed to “vote” for the best result. But these technologies do not always give you access to the rationale for the designated answer. Investigate how much the technology lets you understand not only the answer, but the reason behind the answer.
- Testing real “truth”. It’s one thing to have good access to recognition results and confidence, but it’s quite another to test this capability on your own data. To be really successful with your project and both maximize accuracy and minimize errors, you need what is called “truth data”. This is the set of data on sample forms that represents what the actual recognition answer should be. Once you have this data, you can test the recognition to see how it works across your own forms.
- Arriving at the right balance. Ultimately you will find that you get some data that is located and extracted better than others. Once you can see where the lines are drawn among data that is located and recognized correctly, data that is located but recognized incorrectly, and data that is not located at all, you can then set thresholds that determine when data can go “straight though” with no additional validation and what data needs extra massaging either through additional data checks or through human review.
After reading this, you probably can see that achieving high rates of accuracy isn’t a slam dunk; you need to approach a project methodically and with discipline. You also need to pay attention to the error rates as much as the accuracy rates. Sending bad data to a system or presenting it to an end user creates numerous problems and additional costs.
The good news is that successful data extraction is within your reach and the technology is more than capable.