March 17, 2016

Auto-Classification Revisited: Lower Error Rates

Document Classification Automation: Beyond Hype

Automated document classification is a very popular topic among both technology vendors and businesses that use it, and for good reason. The ability to have documents classified by their content into various “buckets” not only makes business process like mortgage loan origination more efficient, but it enables a more rigorous and comprehensive information governance capability, and it also helps organizations leverage existing knowledge-based assets. These once-costly activities are now approachable with automated document classification using various techniques based upon statistics, linguistics, and image analysis.

Classification Process

Documents come into an organization, are evaluated by a number of different characteristics including presence specific text, potential meaning of the text, presence of visual characteristics, or a number of other different factors and then placed into a document category that is pre-defined by an organization. From there, specific actions can be taken that include assignment into a specific workflow, placing into a particular access control policy, or simply just extracting data and placing it and the document into a knowledge repository or content management system. All without human intervention. (We covered different techniques and concepts associated with document classification in the previous blog.)

What Isn’t Talked About

What isn’t talked about is the performance of these systems. For many organizations, they do not implement automated classification out of the fear that it won’t work. Or perhaps, they have tried it before, and it had poor results. Unfortunately, we have come across situations where the organization’s auto-classification system performs poorly because while it may be designed for document classification, it isn’t designed to control error rates. The result is that error rates can be as high as 25 percent or more.

The answer to the fear and bad experiences is to insist upon and understand the error rate of any document classification system.

Understanding Error Rates

While not every technology is based upon statistics, practically everything is measured using statistics. Document classification is no exception. Nothing is 100 percent, so organizations must be armed with the data on what percentage of error in their auto-classification is acceptable. Many base that error rate on the error rate of human beings. Human error rates are typically between 2 percent and 4 percent for data entry and classification tasks. The error rate represents the data that is not identified as “erroneous” and is allowed to be used as accurate. Measuring error rates, regardless of whether the measurement is for people or machines, is straightforward:

Gather samples.
Document the expected outcome.
Auto-classify and document the actual results
Run comparisons.

The difference between expected outcomes and real results is the error. Many systems provide what is called a confidence score. This score is usually an integer that varies, but can be used on a sample set to identify and separate the results that should be verified from those that can be considered accurate, albeit with the known error rate.

Achieving Lower Error Rates

Using the above, the result for document auto-classification will be the following:

A certain percentage of documents are classified and treated as accurate, but with a certain known rate of error; and
A certain percentage of documents are classified and treated as inaccurate; and therefore, sent for verification, but also with a certain error rate.

Measuring real rates of error and using confidence scores allows a document classification project to have a very clear set of objectives and very measurable results. While the behavior of document classification can be a “black box”, the results can be very transparent. Every document-based process can benefit from automated document classification through both significant cost savings and better control. The key is to rigorously identify and manage error rates.

Parascript offers these tuning services and much more as part of our Accelerator Program for select document-based BPOs. We also offer training on how to do this on behalf of organizations for our valued reseller partners.

If this article interested you, you might also enjoy this blog: Classification Techniques: How to Select the Right Solution.