Natural Language Processing | Knowledge Base | Definition


What Is NLP and What Does NLP Mean for Intelligent Capture?

Natural Language Processing or NLP essentially deconstructs text, whether it’s a text from a phone or social media or text within a document. NLP is most generally defined as the automated processing or manipulation of natural language – speech or text – by software so that information can be inferred from it.
What we’re focused here on text within a document, whether that document is born-digital such as a word document, converted into a PDF or if it’s a document that’s scanned and then subsequently run through OCR software. There are a number of different types of things that are involved under the NLP umbrella.

First, NLP information within this context focuses on words and sentence segmentation. The software must be able to define where a sentence starts and where it ends as well as define each individual word. Once sentences and words are defined, then we can go to the next stage, which is called parts of speech tagging or posts.

NLP software labels or tags each word grammatically so that it identifies the nouns, adverbs and adjectives – all those types of necessary grammatical types of constructs that are needed to put things in context and understand them. The next stage is called phrase chunking after the words have been tagged to fit within a grammatical structure. Phrase chunking looks at the a more holistic level at sentences and multiple sentences to understand how these sentences relate to each other.

Most of the time, what you’re going to be looking at are toolkits that do identify words, sentence segmentation and the phrase chunking. Ultimately, all of this is deconstructing texts so that you can feed it into a machine learning algorithms. These are neural networks and they’re deep learning neural networks that are used to identify and extract key components of this data. Watch the video above to get a full understanding of NLP in the context of intelligent capture.

Since NLP is a set of techniques designed to break-down text into a lexicon of sorts so that information can be inferred from it, there are two general ways to do this:

Rules-based Approach.The first is a rules-based approach where staff (typically a large team) identifies the particular arguments required for a particular task. They will identify key verbs, nouns, adjectives, provide dictionaries, encode linguistic structures, etc. all to be able to parse data and appropriately organize it. This process obviously takes a lot of preparation, but it does typically result in more precise, if more constrained, results.

Machine Learning Approach. The second approach, where more research is being done, involves use of statistically-based machine learning in order to derive inferences and make sense of text. This approach requires a lot of sample data along with extensive tagging of key concepts and meaning in order for the machine language algorithms to identify and organize text-based data. The benefit is that the bespoke nature of a rules-based approach can be eliminated, which allows for a much greater range of data to be covered. The drawback is that the results are less-precise because the inferences are not based upon strict rules.

NLP for RPA and Other Automation Processes

In most cases, NLP adds a new arrow in the quiver of automation (RPA or otherwise) by allowing more abstract information to be used within a business process. It is important to note that NLP does not replace other forms of document automation.

For instance, document classification (or document identification) and separation can make use of NLP, but only in cases where it is necessary to organize documents by meaning vs. type. Automated, statistical classification often performs much better than a linguistic approach simply because document features can be more reliable and not depend upon linguistic understanding. Software does not need to understand the meaning of an invoice, form or claim to identify what it is.

NLP is also not a substitute for data extraction. Here too, with data extraction, the data elements can be identified by features. If an automation task requires parsing and extraction of bank account statements to find the average daily balance or total expenditures made on credit, understanding the meaning is unnecessary. Machine learning can be applied to automate the identification of location, likely data labels, data type and proximity to other features to automate this process.

Where NLP Excels

Where NLP does excel is at locating and extracting concepts and summarizing more complex unstructured data. Take, for instance, a group of contracts where the task is to summarize the main terms for access to third party data – something that is very relevant to support GDPR. This is the domain of an emerging area known as contract analysis. A corpus of tagged contracts would either be used to construct rules or to train an NLP system to identify the relevant sections of a candidate contract and then extract or summarize them.

The Net

The net: whereas machine learning document classification and data extraction use document-level and data-level features to learn how to automate, NLP-based processing uses linguistic features. An intelligent approach is to understand when to pragmatically apply each to solve a given problem. Using the two in combination will be essential to maximize the use of automation across all document-based processes.