In every data capture automation project, one of the big criteria on which everyone tends to focus is “what percentage of accuracy will the software provide?”. When even considering a project, it is not uncommon to have very different takes on what that accuracy number should be. 90%? 100%? 80%? – depending on who the stakeholder is, the number can be very different.
The manager of data quality who is tasked with data correction and validation operations would take a 50% reduction with a 10% improvement in accuracy quality. To him or her, this would mean significant cost savings while improving productivity and quality.
To the executive sponsor depending on a slam-dunk project, nothing less than 90% accuracy is acceptable, regardless of what it would take to get there.
But in very few projects will you find the efforts to understand what the current accuracy rates are and what the costs are to achieve those rates. Even fewer projects will have a good grasp of the costs of sending erroneous data downstream to other business processes and applications.
So what’s the answer? Truth. When testing your production system, it is not good enough to just have a strong and representative sample set of documents. You also need to have the actual data recorded describing what should actually be extracted on those sample sets – the actual correct data that would represent 100% accuracy.
In machine learning, this sample set along with the true data is called “truth data”. The most common way to get this truth data is to review each sample document and manually record the actual data.
At Parascript, a fundamental task that is done every time we take on a project is to create this truth data. It takes several staff several days or weeks to review each sample and enter the correct data. But having this information is crucial to tuning our software to meet specific accuracy requirements. Without it, we are blind and there is no way to provide any assurance to our customers.
In projects and even production environments, truth data allows you to objectively measure your system and understand exactly how well it performs. Continuously adding to your samples and truth data allow you to measure your system over time.
Even more important is that you can start to measure your error rates. The costs of sending bad data into other processes can be small or it could be enormous – but you won’t know the truth until you have “the truth”.
Next time, we’ll cover how you can go about collecting samples and truth data, and use them to plan your project or tune your production system.