Professional Documents
Culture Documents
Improving Read Rates From Captiva Extraction: James Kave
Improving Read Rates From Captiva Extraction: James Kave
EXTRACTION
James Kave
Sr. Architect
Dell EMC
jim.kave@dell.com
Table of Contents
Introduction .................................................................................................................................................. 3
Basic Work Flow ............................................................................................................................................ 4
Sample Light Page ......................................................................................................................................... 6
Sample Normal Page ..................................................................................................................................... 7
OCR Results for Light Page on Basic Process Flow ........................................................................................ 8
Completion Results for Light Page on Basic Process Flow ............................................................................ 9
OCR Results for Normal Page on Basic Process Flow .................................................................................. 10
Completion Results for Normal Page on Basic Process Flow ...................................................................... 11
Enhanced Work Flow .................................................................................................................................. 12
OCR Results for Light Page on Enhanced Process Flow .............................................................................. 14
Completion Results for Light Page on Enhanced Process Flow .................................................................. 15
OCR Results for Normal Page on Enhanced Process Flow .......................................................................... 16
Completion Results for Normal Page on Enhanced Process Flow .............................................................. 17
Conclusion ................................................................................................................................................... 18
Disclaimer: The views, processes or methodologies published in this article are those of the authors.
They do not necessarily reflect Dell EMC’s views, processes or methodologies.
Processing is usually made up of a mixture of automated tasks and manual tasks. The automated tasks
can clean up pages, classify and then read them. The manual tasks would be to identify pages not
automatically classified and to correct the metadata that the read engine was not able to pick up. As the
number of fields that do not read go up, the number of operators required to correct those mis-reads go
up, increasing the cost of processing. Any substantial decrease in the number of rejected fields can
reduce the overall cost of processing by reducing the number of hours required to data correct the
documents.
Let’s review a normal set up for Classification and Read. Early in a project, the customer is asked to
supply as many samples of the documents being processed as possible. The tuning of Classification (the
identification of page type) and Read will work best when the pages supplied represent what will
actually be processed in production. When the pages supplied are reviewed, we almost always see all
the difference mentioned earlier. The bulk will be what would seem to be average printing with a set
font. The outliers might be lighter pages and sometimes darker or heavy printed pages. When a normal
tune is done, you tune to the most populace sample provided, in many cases letting the outliers, which
are a lower volume of pages, be data corrected. My process flow change addresses just that. This idea
has its roots in a change done on a hardware read system that pre-date today's software read systems
circa 1976. An engineeri found that even though reading was real time, there were still a few
microseconds between characters where a second character decision could be made. He used that time
to slightly change the position of the character video presented to the mask, significantly improving read
rates. While we cannot access the internals of the reader, we do have the ability to recycle via the
workflow. I will show a normal workflow processing two pages, one normal and one of light print. I will
then show the enhanced workflow and process the same normal and light print pages.
The scanner used for all tests is a Fujitsu 5120c set to scan single side, bi-tonal pages at 300 dots per
inch (dpi). The workflows were constructed using Captiva Capture 7.5.
This screen represents the second page of the batch of work processed (light page) with the read engine
returning no results. The reason, of course, is that the process is set up to read most effectively on what
would be a normal quality image.
The completion client that is used for data completion shows we were looking in the correct location for
the data, but because of the light print, were not able to read any of the fields.
The extraction results for page one (normal page) show we were able to read all the metadata except
the issue field. The template was set to read average quality print so performed much better on this
page than the light one shown earlier.
The completion client shows all fields were read except the issue field. We would expect this level of
read or better on a normal document. Notice that even with a slight amount of skew induced by the
scanner, we still picked up the fields relativity well.
The second page (light print) was not read by the base template with a read rate below 70 percent and
was rerouted and processed by a second template set to read lighter print. The results show all the
fields were picked up.
Even though the print is light, the second template was able to find and read the field data.
The first page (normal print) read well and was processed on the base template. The reading was above
70 percent so it was not rerouted through classification.
This data shown on the completion screen mirrors what was read by the extraction step. The operator is
presented correct data for all fields and would not need to correct anything.
It is important to note, that while the example above uses one reroute template in a production
environment, the number of rerouting templates would be determined by the mix and quality of pages
being processed. The read rate of 70 percent would also probably need to be set differently. The read
rate value being based on the quality / number of fields needing attention. This number would probably
be different site to site and determined by a processing manager.
This example shows manipulation of the work flow to present the completion operator with the greatest
number of correct fields. This same functionality could become part of a product instead of a
modification to a workflow and would be a differentiator in selecting a solution for a company’s front-
end processing needs.
i
Laray Freeze working at recognition International Corporation (REI)
Dell EMC believes the information in this publication is accurate as of its publication date. The
information is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” DELL EMC MAKES NO
RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS
PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS
FOR A PARTICULAR PURPOSE.
Use, copying and distribution of any Dell EMC software described in this publication requires an
applicable software license.
Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries.