April 20, 2018

Leveraging AI in Document Classification

Leveraging AI for document classification can still require many human steps–or not. The manual processing necessary often depends on the level of automated classification sophistication.

In previous articles and eBooks, we discussed the different types of classification techniques and the benefits and drawbacks of each. Generally, most types of classification techniques fall under Artificial Intelligence (AI) since this broad category includes rules-based expert systems types of classification (e.g., if a document has the word “invoice” classify as invoice) as well as automated classification that uses more sophisticated machine learning-based techniques that remove the need to provide explicit rules.

Document Classification Challenges

In any case of classification, rules or machine learning (ML) algorithms make mistakes. These mistakes can result in a misclassification of a particular document. The most common root cause is “confusing” one document type for another. Take, for instance, the rules-based classifier mentioned above that uses the presence of the word “invoice” to determine if an incoming document is an invoice. This simple rule can be fairly effective. However, what if a purchase order is received that references “invoice # 12345?” The result would be an incorrect classification. Such misclassifications when using rules-based techniques can be observed and corrected.

When using a machine learning approach, the reasons for misclassification may not be as straightforward since the “features” of a particular trained document class are not always available. Even so, the problem is usually resolved by adding more examples for the misclassified document so that more features can be obtained.

As you might suspect, catching all of these misclassifications can be a challenge and updating the system can be even more so. To refine the system, someone must evaluate all of the misclassifications and identify the most common errors. From there, additional rules must be added for rules-based classifiers or more examples need to be provided for training a ML classifier.

The Ugly Hidden Problem

An ugly hidden problem with automated classification is that the more document classes you have in your process, the more effort must be taken to refine performance as changes to one document class can have adverse effects to all other classes. You cannot simply add a new document type to an existing classifier and retrain because the new document type might result in changes to how the classifier might treat existing document classes. The effort to refine performance or correct for what we call “classifier confusion” can grow exponentially with the addition of each new document class. If we consider a mortgage classification problem, the effort to refine performance or correct problems can be measured in hundreds of hours.

In addition to the classifiers, the process of analysis and tuning is where AI can also be used. If we carefully examine the steps a person might take to analyze and improve the performance of a document classifier, we will see that this process can, too, be automated. What is involved? Let’s look at it in the form of procedures.

Document Classification with AI: Manual Process

Human Step: Train a classifier on N document types using samples. Samples of each document class are imported into the system and a training process is started.
Machine Step: During training, the classifier will take each group of samples used for a document class and examine them for similar characteristics or features. These features can be text-based or visual features. Algorithms are then created to identify incoming documents as being part of one class or another.
Human Step: Next, the classifier will be tested by inputting a stream of documents and generating class assignment results.
Human Step: Analyze the output and compare against the “answer key” (technically called Ground Truth Data) which is the actual class assignment for each document.
Human Step: Note errors where a document belonging to “class a” actually belongs to “class b.”
Human Step: Analyze the examples of errors against the samples used to train the document class. Do they contain new features that need to be trained? Do the documents have a lot of similarities that require adding new samples to each class for training?
Human Step: Add new samples and retrain the classifier.
Human Step: Analyze results against what the results should be (using ground truth). Does the performance improve for those “tuned” document classes? Is there degradation to other document classes that requires adjustment by adding samples?
Human Step: Repeat until performance is improved.

Cascade Classifier Automates Human Steps

After review, we can see how a machine learning algorithm can be used to analyze and improve results in a similar (but much faster) way as a human would. The difference is that this AI helps with the analysis and tuning of another AI: the document classifier itself. The two together are mutually reinforcing.

The result is an optimized document classifier without the difficulties and costs associated with a manual process. We call this the Cascade Classifier, and it is now part of FormXtra 6.4.

For full details about Parascript FormXtra 6.4, download the Feature Guide: