Document classification is one of the most popular applications for intelligent capture, that is, automating the identification and sorting of documents. There are a lot of processes that involve many different documents—sometimes even several hundred—that can be submitted without any organization. These include claims adjudication, loan origination and commercial logistics.
If organizations are not manually processing these documents (and most are manual), they are undoubtedly using a rules-based process that attempts to identify incoming documents based upon specific, identified attributes.
For instance, with mortgage documentation, a rules-based approach attempts to mimic a manual process, but instead of looking at the overall document including the graphical orientation, specific keywords might be used to discern between a document establishing proof of income from a document providing information on assets. Even though a person might easily distinguish between a W-2 and a bank statement, the rules-based approach relies upon the presence (or absence) of specific words or other textual data.
So rules-based automation might look for instances of “W-2” or “Total Income” for the W-2 document. It also may identify the presence of words like “account balance” along with “account number” and “statement” to establish that a document is a bank statement.
Where Traditional Classification Falls Short
As you might suspect, the power of rules-based classification is directly tied to the amount of time spent by a Subject Matter Expert (SME) reviewing available data, identifying key characteristics of each, and then encoding the rules. For some needs, where there are only a few document types, a rules-based approach might make sense because it is typically simpler to implement. In a case where there are a lot of document types, such as 30 or more and where characteristics of each might overlap, a rules-based approach will fall short.
When 50 document types or more are involved, and where there can be different versions of any particular document type, it is very probable that rules identified for one type will overlap rules for another. It really isn’t practical—mostly due to the time required, but also because of the ongoing maintenance—to analyze each type and version, to verify that there is no overlap, and then to test and tune each one.
The Power of Machine Learning
One of the strongest benefits of machine learning-based solutions, or as the industry is increasingly using, cognitive systems, is the ability to analyze a very large size of sample data to identify and record key attributes (often called “features”) of each document type that are compared against other document attributes to arrive at the most dependable set of features with which to reliably apply automation.
Machine learning systems can detect even the slightest variances that might go unnoticed by SMEs. In addition, a cognitive system can record a larger number and frequency of these key features so that it can use the most reliable inferences to produce high quality results. This ability clearly reduces the associated costs, complexity and risk associated with manual analysis and configuration of rules, including upkeep.
Cognitive classification turns potentially several hundred hours of effort into a compute-time exercise. Better, more reliable performance is achieved at a much lower level of effort. Kind of sounds like having your cake…
If you found this article interesting, you might find this eBook useful, Document Classification Techniques: A Business Primer.