May 23, 2018

One Document Classifier to Rule Them All?

advances in document classification Is there one document classifier to rule them all powered by artificial intelligence? Most everyone in technology knows about the Gartner Hype Cycle, and if any technology is more suitable for the “peak of inflated expectations,” it would be artificial intelligence (AI) or any of its derivatives such as “machine learning” or “deep learning.”

Don’t get me wrong, these new AI technologies absolutely show promise for certain applications. The pace of evolution should see AI expanding into more areas of our lives. However, the expectation that AI, inserted into any business application, will perform better than traditional methods is wrong-headed.

Let’s explore why, using document classification as a prime example.

AI Data Automation and Document Classification

Document classification is a hot area in document automation that allows for more straight-through processing (STP) of complex document-oriented processes such as mailroom automation, customer onboarding and loan processing. In these activities, a wide variety of documents are received.

The challenge in document classification is to discern one document from another and ensure that each makes it to the appropriate downstream process. While much as been done to improve this process, believe it or not, most organizations (even large ones) still use antiquated processes that involve manually applying bar codes or inserting separator pages between documents so that document automation software understands each document type.

A natural question many might have is “can AI-based processes do better?” The answer is a qualified “yes.” But if the question is “can deep learning do better than traditional AI?” The answer to this is much more nuanced. Some machine learning techniques work better than others and deep learning neural networks, in the case of document classification, are not the cure all.

AI Options for Document Classification

Here are several AI-based options for document classification:

Keyword or pattern-based matching. The most common means to apply AI for document classification is to use the presence of keywords or common data patterns such as phone numbers or social security numbers to identify the type of document. In this process, a business analyst will review the document types and identify these keywords or data elements and then encode rules that allow a system to sort documents.The benefits of this type of AI are very straightforward: to create and monitor. If a document is misclassified, an analyst can identify if the document has the proper data; if not, the rules are updated. The drawback is that creation and maintenance of rules can be time-consuming and error-prone, especially if a large number of document classes are involved.
Statistical-based machine learning classifiers. The next step up is a set of different techniques that allow a system to be trained on document class examples. You simply input class examples along with the information about which class they belong to and the system automatically identifies key similarities, often referred to as “features” for each document class sample.From there, when the classifier encounters a document, it uses these automatically-generated rules to assign documents to the appropriate class. The benefits are obvious: no need to spend time encoding and updating rules. The primary drawback is that it is much more a “black box” where it isn’t always easy to understand incorrect output. In general, these types of classifiers do well on tasks with fewer training samples required to achieve good levels of performance (e.g., greater than 80% success rates).
Deep-learning classifiers. Technically any neural network is based on statistics, but they are abstracted. With deep-learning classifiers, the manner in which features are learned and documents are classified is quite different from “traditional” machine learning classifiers. This results in a need to have a lot more samples for training compared to traditional machine learning methods in order to achieve acceptable rates of performance. And even then, the levels of performance obtained after significant work, do not provide a significant improvement. So while it may be forward thinking to apply deep learning to document classification, today you are likely to get more bang for the buck using traditional statistical-based machine learning methods.

Selecting the Optimal Path

At the end of the day when tackling a document classification problem, selecting the best path for your business comes down to the requirements you have. If you have just a few classes with fairly static rules, a rules-based approach may work best. If you have unknown variance within your document classes, a statistical-based approach may be better so that you don’t have to spend significant efforts writing and maintaining rules. At this stage, deep learning classifiers are not necessarily the cure-all we want them to be, but time will tell.

###

Parascript Cascade Classifier Automates Human Steps

Parascript offers an optimized document classifier without the difficulties and costs associated with a manual process. We call this the Cascade Classifier, and it is part of FormXtra.

For full details about Parascript FormXtra, download the Feature Guide: