How Many Samples

How to Many Samples Are Necessary for Intelligent Capture to be Successful?

One of the most common questions that organizations needing to implement intelligent document processing have is, “how many samples do we need?” There are two drivers for how many samples should be used and two areas where samples are necessary.

First of all, sample data is necessary when the intelligent capture system is being configured. Now, if you’re using machine learning, these are typically called the training set or a learning set. The second area where samples are used is in measuring system precision. With precision and configuration in intelligent capture, the real focus of intelligent capture is on optimizing the comprehensiveness of data that you can get out of unstructured information from documents and the precision or the accuracy of the data output/results.

Unlike other systems, the real focus is on data results, which involves a lot of measurement. The number of samples and type of samples you use to configure a system will help determine the outcomes that you expect. So it’s really driven by the number of documents within your scope. Document samples are document types. It could be contracts; it could be purchase orders; it could be bills of lading. Or, it could be medical charts, but the samples are driven on how many document types you have within the scope of your project. Just as important, the variance within each document type is also critical. For example, when automating invoice data extraction, if you only have one vendor’s invoice type, the chances are that you could get by with a few samples. However, if you’re dealing with invoices from many different types of vendors, then samples from each of the 2,500+ different vendors may be more appropriate.

Even though you have one document type called invoice, you’ve got potentially 25 to 2,500 to 5,000 different variations within “invoices” so a representative set of samples and data output will train the system for very accurate results. If your scope involves 10 document types with three variations of each type, the objective would be to get 30 samples.

Quite simply, if you’re dealing with a form and you’re only dealing with one form layout regard, despite the issues that you might have with scanning quality and that type of thing, you’re going to have a very standardized document. Fewer samples are necessary. Whether you’re manually configuring a system or using machine learning, you always need to have a lot of focus on the input data set. With machine learning, it is even more so because you’re seeding the functions or the tasks of actually configuring a system using rules to machine learning algorithms. To understand why machine learning algorithms behave the way they do, examine the input datasets. When you’re using machine learning and when you’re working with a system that really uses machine learning algorithms (instead of rules-based approaches), the emphasis on having a representative data set becomes that much more important.

Deep learning algorithms, like the ones used in our software within smart learning, perform very well on smaller-sized sample sets. Another way to get around huge sample data sets are pre-trained models. We collect large sample sets to pre-train intelligent capture models so that they are available out-of-the-box for immediate use. Pre-trained models find specific data on specific document types, and then gradually train to your specific data set. So it’s kind of like having your cake and eating it too.

Initially, if you’re not using a pre-trained model and you can curate data automatically using intelligent capture. So our smart learning system has a data curation function, which will analyze production data, the outputs and then gradually collect it so that it fits your production information and then uses that to learn. Over time, these models become a lot more efficient.

This eBook provides a primer in intelligent document processing (IDP) technologies from the advent of capture to where it is today. It also addresses the differences between intelligent capture and Optical Character Recognition (OCR), when to use OCR and when to use intelligent capture, the interpretation methods and what matters most depending on your organization’s needs and objectives. Machine learning is an essential component in the most advanced intelligent capture. Here techniques are reviewed that don’t require machine learning as well as when machine learning is best applied in intelligent document processing. Supervised and unsupervised learning as well as artificial neural networks are addressed.

Understanding each technology in the IDP stack applied to document preprocessing, data location and extraction as well as data verification provides the essence of what the intelligent document processing stack can be for your organization. Building a customer IDP stack, buying out-of-the-box or leveraging off-the-shelf SDKs / APIs are also explored here. These options along with how to futureproof your document processing moving forward conclude this eBook.

Parascript Knowledge Base - Your Definition Reference Library

How Many Samples Are Necessary for Your Intelligent Capture to be Successful?

Sample Data for Intelligent Capture | Knowledge Base | Definition

How to Many Samples Are Necessary for Intelligent Capture to be Successful?

Intelligent Document Processing: Technology Stack

Download eBook:

CONTACT PARASCRIPT