Document automation applied machine learning (ML) platforms are running up against generic ML platforms more and more because ML software libraries abound these days. As a result, new clients come to us more frequently with questions like, “Can I use Google’s Tensorflow to create advanced capture capabilities such as document classification or data extraction?” This type of question is not asked only about Google. Other large vendors such as Microsoft and Amazon are also making available machine learning platforms that allow organizations to import large volumes of data to train these systems. Increasingly, new entrants to the advanced capture space claim to use machine learning to automate data entry tasks.
My answer typically goes something like this, “yes, you can leverage and train these systems to classify documents or to locate and extract data using OCR, but there are huge trade-offs and some of them important ones.”
Considering Cloud Machine Learning Solutions
Let’s take the cloud machine learning offerings. When it comes to a machine learning system, the concept is relatively straightforward: you simply need to offer input data and let the system learn how to generate the required output. That is the concept. In practice, it is VERY different. For instance, general machine learning works well for a wide range of tasks. However, the best approach—and the one most-widely used—is a combination of human-supplied domain expertise and automated machine learning. This means that to achieve decent results, the platform should have—as background knowledge—an understanding of key features of your data.
With data extraction, a system should be prepared to understand the concept of tabular data, text blocks, page dimensions, document boundaries, relative positioning of data and so on. It is a similar challenge for document classification. The benefit from adding machine learning to these human-supplied “hints” is that, once armed with the basic nature of a target data set, it can devour large quantities of data in order to build a robust model to, say, locate and extract invoice line items. Without these hints, performance suffers.
Challenges Faced by Generic Machine Learning Platforms
Another problem associated with these generic platforms is that any model created from a data set is very specific to that data. Without imbuing the concepts of a given problem, it cannot be easily transitioned to tackle another similar problem. So a trained machine learning platform to classify mortgage documents cannot easily be leveraged to classify other document sets. In some ways, it is similar to custom-coded software. Custom solutions are often highly focused on a given task and therefore, very limited. Solution providers offering to apply deep learning to your document automation tasks often take a project-oriented approach where the deliverables are tantamount to a custom-developed piece of software that comes with it, a limited set of capabilities unless you are willing to issue (and pay for) change orders.
Quality Training Data
Lastly, and probably most important, is the issue of quality training data. Machine learning platforms are data hungry and deep learning versions are especially so. But it isn’t only about data quantity. More crucial is that the training data is representative of the overall data that will be processed. If data analysis is not applied to evaluate the training set, then bias and overfitting will adversely affect outputs. Sometimes the effects are unforeseen with serious consequences. General-purpose machine learning platforms do not typically offer a means to test the quality of training data against specific problem domains. This is a severe limitation that only those well-versed in machine learning algorithms can overcome.
Using machine learning to automate complex tasks is a very exciting area of computer science but you only have to go as far as read about the missteps of IBM Watson and other high-profile machine learning offerings to understand that we are far away from taking off-the-shelf ML platforms and use them as we do specially-tuned machine learning platforms. Beware of those that offer a machine learning silver bullet. It doesn’t exist.
If you found this article interesting, you might find this eBook useful, Machine Learning for Advanced Capture: