December 19, 2019

Key Factors for Assessing STP in Document Automation

Understanding the key factors to assess Straight Through Processing (STP) in document-oriented activities is critical to identifying what most affects a given document automation project’s percentage of STP.

In Part 4 in our STP series, we discussed at a general level how much STP you can achieve with document automation. We discussed the difference between measuring automation for document-based tasks and measuring simpler, less-variant tasks.

Automation Without Manual Intervention

Here in Part 5, we delve further into the important aspects of document automation, particularly the documents themselves, that determine the upper-limit of automation. Essentially, the benefit of adding document automation to your processes all boils down to one question: how much work can I automate that requires no manual intervention?

The coy (and accurate) answer is always, “it depends.” This is typically due to the nature of the task and the attributes of the documents. There are some general rules of thumb—built over years of experience with document automation projects—that you can use. These rules are based upon two common automation tasks: document classification and data extraction.

Identifying the Possibilities of Document Classification

Most enterprises have the need to organize documents for some reason or another. Sometimes, it is to support the ability to locate information when needed. Other times, document classification is involved within a given process to make it faster and more controllable.

A classic “nightmare” scenario might be the need to manually sift through and categorize a room full of bankers boxes of documents to support pre-trial discovery. However, regardless of the need, the ultimate aim is to remove the necessity for documents to be organized manually because of the associated issues of cost, time and accuracy.

Within document classification, a number of attributes affect any system’s performance. These include the:

Range of documents within the scope;
Degree of variance within the document types; and
Information that can be used to classify documents.

Range of Documents

An organization that only needs to sort through and organize three types of documents will realize a significant difference in performance from an organization that needs to address several hundred different document types. This is because with the introduction of each new document type to a classification project, you introduce the possibility of the classifier mistakenly assigning a document to the wrong document class.

Simply put, the more potential classes involved, the greater the potential for confusion, and ultimately, error. Exactly how much confusion and error is caused by each new document class is hard to calculate, but we do know from experience that the error rate is not linear.

For example, in a mortgage document classification project, we might find that a bank can deal with 200 different documents. For each classification task, the classifier evaluates a given document to determine if it belongs to one of the 200 identified document classes. This task is quite a bit more complex than the task to assign a given document to one of two or three classes. This is because, as we add a new document class, we add potential for overlap between the characteristics of one document class and another, or worse, several other document types.

Degree of Variance

While there are always differences between two different document types, there are also potential differences between two documents of the same type. For instance, the document class of “Credit Report” can be considered a single type. However, within that type, there are as many variations in terms of data and layout as there are organizations providing credit reports. That is, there is no single format where we can always anticipate the same data.

As a result, there are different key attributes that might indicate a credit report from Experian versus one from Transunion. Just like the potential for error when we’re dealing with multiple document types, the degree of variance within a document type introduces the possibility of error.

Available Information

Some documents are easy to classify just by looking at them. For instance, receipts have a typical shape and data. Invoices typically include tables somewhere in the middle of the page. Other documents require more analysis to determine the correct document class assignment such as text-heavy agreements.

As a general rule, the more attribute-based information that is distinct to a particular document class the better. When document classes combine many different and distinct attributes, we can realize fairly reliable results.

Examples of Document Types

For instance, an invoice can be distinct based upon the layout (table in the middle, numeric data on the bottom right and address block on the top half), text (presence of the word invoice), and non-text data such as a logo.

For an agreement, we rely much more on text that might be shared with other document types so the ability to correctly assign the document class is hampered. Generally classifiers of all types do better when a document class has a distinct set of attributes and the more the better.

Most document classification projects—even complex ones such as mortgage classification—can get 70% or more STP with enough sample data, time for analysis, configuration and refinement. Generally speaking, your classification results will drop by a fraction of a percentage with each new document type, but the calculation is not linear. A few document types can achieve 90% or more while 500 to 800 may get somewhere around 70% STP.

Evaluating Possibilities with Data Extraction

When it comes to data extraction, it all comes down to the variance of the data from two major standpoints: data type and data location. Data types can mean differences within the date format such as U.S. vs. common European formats. Or, it may mean variance between typed and handwritten data. Data location typically speaks to whether the document in question is structured (such as a form), semi-structured (such as an invoice or remittance), or unstructured (such as an agreement or contract).

As a rule, structured forms can start with 80% or more STP rising to 95% or more while unstructured documents might only start with 40%-50% STP.

Even structured forms can have variance in data location due to differences in how the document was scanned. If it started as paper, a host of image quality problems can present themselves. Or, the way that the information was manually entered can have an impact. We have all seen examples of forms where the person wrote data well outside of the box.

Just as with document classification, the ability to realize high STP rates is heavily dependent upon the variance with the higher variance documents providing less reliable results than their low variance counterparts. As a rule, structured forms can start with 80% or more STP rising to 95% or more while unstructured documents might only start with 40%-50% STP.

STP Performance and What to Examine

There are no exact answers for how much STP you can achieve without doing a good amount of work. And yet, it is definitely possible to put boundaries around potential STP performance by examining, at a high level, the number of documents within a project and the degree of variance within each document type. After that, you can start to put together a framework listing-out the documents by type, the estimated number of variants, the document structure and data types. From there, you can provide rough estimates based upon the guidelines provided above to get a good sense of what is practical in terms of overall project STP objectives.

###

If you found this article interesting, our new eBook on conducting a successful Document Automation Proof of Concept co-authored by DocuLabs may be of interest to you.