March 18, 2021

IDP: Machine Learning and the Training Required for Reliable Results

In a previous article, Part 1, I focused on whether there is an ability to enjoy the strengths of machine learning without the requirement to train it.

Is There Machine Learning That Does NOT Require Training?

The underlying rationale for this question is simple: while machine learning can greatly reduce traditional IDP configuration efforts, we exchange the efforts associated with manual configurations for efforts to compile and tag reliable training data.

Is there a pure benefit of moving in this direction of emphasis on training data? You bet. Even with manual configuration, the best practice is to perform and test this work on the same type of data. Compiling data is a much simpler and less expensive cost (easily 1/10th of the cost). The reality is few organizations go to great lengths to actually configure Intelligent Document Processing (IDP) systems properly starting with adequate data collection. Even at a significant cost reduction compared to manual configuration, collecting training data is still a cost.

In Part 1, I explain why training is needed and the traditional tradeoffs between software that comes pre-trained vs. systems that require training for implementation. The reality is that there is no system to-date that works without training of any kind, but there are efforts to reduce the amount of training required to achieve reliable results.

Working Toward the Best of Both Worlds

If pre-trained systems yield optimized performance, but are not adaptable to specific customer requirements, and field-trained systems provide loads of adaptability, but require effort to get there, the natural path forward is to attempt to achieve both high-levels of out-of-the-box performance without the limited adaptability.

One way to progress in that direction is to collect sample data and train systems as part of product development, but with an eye towards allowing customers to take these machine learning results and “fit” them to their own needs, altering existing trained tasks or creating entirely new ones.

How It’s Done: Health Claims

For instance, with health claims, Parascript provides a highly-optimized CMS1500 classification and extraction model that can be used immediately. Because it is trained using the same systems that customers can use, those customers can take this model and build on top of it — customizing it to their own needs.

Need to add support for a new data field? Add it. Need to take an existing pre-trained function and modify it? You can do that, too. There is a clear, immediate benefit from getting high optimization from day one combined with the ability to get some custom tailoring. The tradeoff is that modifying those models still takes time and data to refine.

Low Shot Learning Trend

Another interesting trend in reducing training effort lies in an emerging area of research called, “low shot learning” (or alternatively, “few shot learning”). The idea is very similar to pretraining models as a product development effort described above, but attempts to “relax” the preciseness of use cases.

For instance, if I’m trying to locate and extract data from an EOB for a health remittance automation project, one way to reduce the amount of training required by an insurance company is to pretrain the system on elements common within EOB documents such as tag-value pairs (e.g., “Allowed Amount: $125.00”) service data, amounts, concept of line item tabular data and so on. Using these “primitives” of transactional documents that are already trained can reduce the amount of sample data required to train on specific use cases.

The Upshot

The upshot is that organizations spend less time teaching the system about documents and more time fine-tuning to specific requirements. The Parascript Smart Learning system is an example of this method in that it is trained on a variety of document-level aspects that imbue the system with general knowledge on typical document attributes leaving the real training on specific document-level tasks.

Both of these methods still involve some level of training, we’re just refining the handoff between product-level pretraining and customer-level refinement training. Ultimately, the goal is to make these systems easy to use without sacrificing the high-level of performance expected.