Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

IMPROVING READ RATES FROM CAPTIVA

EXTRACTION
James Kave
Sr. Architect
Dell EMC
jim.kave@dell.com
Table of Contents
Introduction .................................................................................................................................................. 3
Basic Work Flow ............................................................................................................................................ 4
Sample Light Page ......................................................................................................................................... 6
Sample Normal Page ..................................................................................................................................... 7
OCR Results for Light Page on Basic Process Flow ........................................................................................ 8
Completion Results for Light Page on Basic Process Flow ............................................................................ 9
OCR Results for Normal Page on Basic Process Flow .................................................................................. 10
Completion Results for Normal Page on Basic Process Flow ...................................................................... 11
Enhanced Work Flow .................................................................................................................................. 12
OCR Results for Light Page on Enhanced Process Flow .............................................................................. 14
Completion Results for Light Page on Enhanced Process Flow .................................................................. 15
OCR Results for Normal Page on Enhanced Process Flow .......................................................................... 16
Completion Results for Normal Page on Enhanced Process Flow .............................................................. 17
Conclusion ................................................................................................................................................... 18

Disclaimer: The views, processes or methodologies published in this article are those of the authors.
They do not necessarily reflect Dell EMC’s views, processes or methodologies.

2016 EMC Proven Professional Knowledge Sharing 2


Introduction
Front End Capture Systems have basically three areas, an input mechanism, a processing section and an
output to a host. The input can come from many different sources (Scan, Email, File Drop or via a Web
Service). That input can, and usually does, come in a wide variety of conditions. Examples of this are
light print, dark print, small or large characters and different fonts. Regardless of the condition of the
input work, one thing common to all customers is that they want perfection in the output data. That
need in many cases increases the effort in the processing section. While there is a multitude of different
needs across the customer base that the processing section must facilitate, we will concentrate on the
two that are common to most business solutions; Classify/Read and Correction of Metadata.

Processing is usually made up of a mixture of automated tasks and manual tasks. The automated tasks
can clean up pages, classify and then read them. The manual tasks would be to identify pages not
automatically classified and to correct the metadata that the read engine was not able to pick up. As the
number of fields that do not read go up, the number of operators required to correct those mis-reads go
up, increasing the cost of processing. Any substantial decrease in the number of rejected fields can
reduce the overall cost of processing by reducing the number of hours required to data correct the
documents.

Let’s review a normal set up for Classification and Read. Early in a project, the customer is asked to
supply as many samples of the documents being processed as possible. The tuning of Classification (the
identification of page type) and Read will work best when the pages supplied represent what will
actually be processed in production. When the pages supplied are reviewed, we almost always see all
the difference mentioned earlier. The bulk will be what would seem to be average printing with a set
font. The outliers might be lighter pages and sometimes darker or heavy printed pages. When a normal
tune is done, you tune to the most populace sample provided, in many cases letting the outliers, which
are a lower volume of pages, be data corrected. My process flow change addresses just that. This idea
has its roots in a change done on a hardware read system that pre-date today's software read systems
circa 1976. An engineeri found that even though reading was real time, there were still a few
microseconds between characters where a second character decision could be made. He used that time
to slightly change the position of the character video presented to the mask, significantly improving read
rates. While we cannot access the internals of the reader, we do have the ability to recycle via the
workflow. I will show a normal workflow processing two pages, one normal and one of light print. I will
then show the enhanced workflow and process the same normal and light print pages.

2016 EMC Proven Professional Knowledge Sharing 3


Basic Work Flow

2016 EMC Proven Professional Knowledge Sharing 4


In a normal process flow, we identify a page through classification assigning a template ID to the page.
The page is then sent to extraction to pick up the metadata. The template will have been set up with
each field assigned properties to control how the field is read. Those properties will include the actual
read engine used, the location of the data for the field being read and image quality filters. If a page is
identified as a particular template, it is read with the properties assigned to that template regardless of
the quality of print on the page (light, dark, small or large characters or position of the data). The
following two pages were sent through the work flow shown in the figure above. Following the images
will be the Optical Character Recognition (OCR) Results for Both Pages followed by what we would see
in the manual tool to data complete the metadata (Completion).

The scanner used for all tests is a Fujitsu 5120c set to scan single side, bi-tonal pages at 300 dots per
inch (dpi). The workflows were constructed using Captiva Capture 7.5.

2016 EMC Proven Professional Knowledge Sharing 5


Sample Light Page

2016 EMC Proven Professional Knowledge Sharing 6


Sample Normal Page

2016 EMC Proven Professional Knowledge Sharing 7


OCR Results for Light Page on Basic Process Flow

This screen represents the second page of the batch of work processed (light page) with the read engine
returning no results. The reason, of course, is that the process is set up to read most effectively on what
would be a normal quality image.

2016 EMC Proven Professional Knowledge Sharing 8


Completion Results for Light Page on Basic Process Flow

The completion client that is used for data completion shows we were looking in the correct location for
the data, but because of the light print, were not able to read any of the fields.

2016 EMC Proven Professional Knowledge Sharing 9


OCR Results for Normal Page on Basic Process Flow

The extraction results for page one (normal page) show we were able to read all the metadata except
the issue field. The template was set to read average quality print so performed much better on this
page than the light one shown earlier.

2016 EMC Proven Professional Knowledge Sharing 10


Completion Results for Normal Page on Basic Process Flow

The completion client shows all fields were read except the issue field. We would expect this level of
read or better on a normal document. Notice that even with a slight amount of skew induced by the
scanner, we still picked up the fields relativity well.

2016 EMC Proven Professional Knowledge Sharing 11


Enhanced Work Flow

2016 EMC Proven Professional Knowledge Sharing 12


The enhanced work flow initially read just as the basic work flow would, but then after extraction
(reading of the metadata) we have a code module. This module has code applied that will determine the
field read rate. That rate is then used to control routing of the work flow. If the read rate is above 70
percent (just an arbitrary value), the item is sent to completion for data review / completion. If the read
rate is below 70 percent the page is sent back to classification where it is assigned a template that has
been set up to read light print. In the decision block, there is a test for the number of times a page can
be rerouted so an endless loop will not be created. It is important to note that the testing and rerouting
is totally unattended and occurs automatically. The results for the same two pages run in the following
basic work flow.

2016 EMC Proven Professional Knowledge Sharing 13


OCR Results for Light Page on Enhanced Process Flow

The second page (light print) was not read by the base template with a read rate below 70 percent and
was rerouted and processed by a second template set to read lighter print. The results show all the
fields were picked up.

2016 EMC Proven Professional Knowledge Sharing 14


Completion Results for Light Page on Enhanced Process Flow

Even though the print is light, the second template was able to find and read the field data.

2016 EMC Proven Professional Knowledge Sharing 15


OCR Results for Normal Page on Enhanced Process Flow

The first page (normal print) read well and was processed on the base template. The reading was above
70 percent so it was not rerouted through classification.

2016 EMC Proven Professional Knowledge Sharing 16


Completion Results for Normal Page on Enhanced Process Flow

This data shown on the completion screen mirrors what was read by the extraction step. The operator is
presented correct data for all fields and would not need to correct anything.

2016 EMC Proven Professional Knowledge Sharing 17


Conclusion
As you can see by adding the ability to reroute pages that do not read well, the overall quality of data
that is presented to the completion step is much better. If there are fewer fields in completion that
need attention, data completion of the form will be much faster allowing for an operator to process
additional total pages in a set amount of time. Compound this across multiple operators processing
more work, and you get an overall reduction in the man hours required to do a set quantity of
documents.

It is important to note, that while the example above uses one reroute template in a production
environment, the number of rerouting templates would be determined by the mix and quality of pages
being processed. The read rate of 70 percent would also probably need to be set differently. The read
rate value being based on the quality / number of fields needing attention. This number would probably
be different site to site and determined by a processing manager.

This example shows manipulation of the work flow to present the completion operator with the greatest
number of correct fields. This same functionality could become part of a product instead of a
modification to a workflow and would be a differentiator in selecting a solution for a company’s front-
end processing needs.

i
Laray Freeze working at recognition International Corporation (REI)

Dell EMC believes the information in this publication is accurate as of its publication date. The
information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” DELL EMC MAKES NO
RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS
PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS
FOR A PARTICULAR PURPOSE.

Use, copying and distribution of any Dell EMC software described in this publication requires an
applicable software license.

Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries.

2016 EMC Proven Professional Knowledge Sharing 18

You might also like