August 9, 2018

Natural Language Processing: What is NLP to Document Automation?

Natural Language Processing (NLP) comes up more often as we engage with more complex document automation projects. This is especially true when unstructured document-based information is involved. Most of the time, the belief is that NLP is the only way to automate unstructured data due to the need to first interpret the language used.

Is NLP really needed?

To determine the necessity of NLP, it is first helpful to define exactly what NLP is. And then, we can discuss practical applications within document automation. NLP is a set of techniques designed to break-down text into a lexicon of sorts such that something can be inferred from it. There are two general ways to do this:

Rules-based Approach.The first is a rules-based approach where staff (typically a large team) identifies the particular arguments required for a particular task. They will identify key verbs, nouns, adjectives, provide dictionaries, encode linguistic structures, etc. all to be able to parse data and appropriately organize it. This process obviously takes a lot of preparation, but it does typically result in more precise, if more constrained, results.

Machine Learning Approach. The second approach, where more research is being done, involves use of statistically-based machine learning in order to derive inferences and make sense of text. This approach requires a lot of sample data along with extensive tagging of key concepts and meaning in order for the machine language algorithms to identify and organize text-based data. The benefit is that the bespoke nature of a rules-based approach can be eliminated, which allows for a much greater range of data to be covered. The drawback is that the results are less-precise because the inferences are not based upon strict rules.

So just What Does NLP Mean to Document Automation?

In most cases, NLP adds a new arrow in the quiver of automation (RPA or otherwise) by allowing more abstract information to be used within a business process. It is important to note that NLP does not replace other forms of document automation.

For instance, document classification (or document identification) and separation can make use of NLP, but only in cases where it is necessary to organize documents by meaning vs. type. Automated, statistical classification often performs much better than a linguistic approach simply because document features can be more reliable and not depend upon linguistic understanding. Software does not need to understand the meaning of an invoice, form or claim to identify what it is.

NLP is also not a substitute for data extraction. Here too, with data extraction, the data elements can be identified by features. If an automation task requires parsing and extraction of bank account statements to find the average daily balance or total expenditures made on credit, understanding the meaning is unnecessary. Machine learning can be applied to automate the identification of location, likely data labels, data type and proximity to other features to automate this process.

Where NLP Excels

Where NLP does excel is at locating and extracting concepts and summarizing more complex unstructured data. Take, for instance, a group of contracts where the task is to summarize the main terms for access to third party data – something that is very relevant to support GDPR. This is the domain of an emerging area known as contract analysis. A corpus of tagged contracts would either be used to construct rules or to train an NLP system to identify the relevant sections of a candidate contract and then extract or summarize them.

The Net

The net: whereas machine learning document classification and data extraction use document-level and data-level features to learn how to automate, NLP-based processing uses linguistic features. An intelligent approach is to understand when to pragmatically apply each to solve a given problem. Using the two in combination will be essential to maximize the use of automation across all document-based processes.