What is OCR (Optical Character Recognition) software?
OCR software is used to recognize printed or written text characters. OCR software has been around for several decades and focuses primarily on converting text on a scanned page to machine-readable information.
Many OCR software vendors have expanded their core recognition capability to provide conversion of scanned documents into editable Word documents or conversion to searchable PDFs. OCR is different from data extraction, which involves locating specific data on a page.
OCR Software vs Data Extraction
OCR software is used to convert images of documents into text while advanced capture solutions are designed to reliably classify documents, and locate, extract and verify data.
Before and After OCR Software
Let’s look at an image of a document before OCR software is used to convert it to text.
Converted Text Results from OCR
OCR software is used to convert images of documents into text where results at the character level can be close to 100 percent accurate. That level of accuracy sounds great, but what does this really mean?
As you can see from the example above, the results at the character-level (the “C” in OCR) look 100% accurate, but when we examine the results from a word level, there are approximately 14.5% errors after you divide the number of word errors by the total number of words.
For practical applications of OCR software, which include using text to aid with search, these errors are not necessarily a problem. After all, when searching for “change of control”, an instance where “ofcontrol” is not correctly separated may still be found. However, if you need to extract specific data that typically exist as a series of words, then an advance capture solution is necessary.
Advanced capture uses a variety of methods to locate required data that extend well beyond dictionary-based approaches including relative proximity of one data element to another and expected value patterns, among many others.