Professional Documents
Culture Documents
OCR Topic Disscution
OCR Topic Disscution
OCR Topic Disscution
College of Informatics
Postgraduate Program
ID: GUR/03348/14
i
1. Introduction
Since a few decades ago, optical character recognition (OCR) has become one of the best-known
fields of pattern recognition research. Because of its enormous potential for application, it is a
topic that is actively researched in both industry and academia. OCR was first researched in the
early 1930s. OCR is a method for converting printed, typewritten, or handwritten text characters
into a text that may be machine-encoded. It is the process of recognizing and reading handwritten
characters. It is typically used as a method of data entry from printed paper data records, such as
printouts from the mail, bills, bank statements, business cards, and passports. OCR is
acknowledged as a subset of image processing, a significant area of pattern recognition research.
In order to recall and recognize something later, the human brain typically finds some sort of
relation that is mainly pictorial in nature. In a way it tends to produce or find patterns in
handwritten characters. This led to the major motivation towards the development of OCR
systems. The characters of various available languages are based on the lines and curves. An OCR
can be easily designed to recognize them[1].
Optical Character Recognition (OCR) is an interesting and challenging field of research in pattern
recognition, artificial intelligence and machine vision and is used in many real-life applications.
Optical character recognition is a type of document analysis where a scanned document image that
1
contains either machine printed or handwritten script is input to an OCR software engine, is
translated into editable, machine-readable digital text format. Automatic processing of tabular
application forms, bank checks, tax forms, census forms, and postal mails have become more
significant as computers have spread in both public and business sectors as well as in individual
homes. The development of handwritten character/numeral recognition for several languages or
scripts is required for this automation. Machine-printed character recognition and handwritten
character recognition are the two subfields of optical character recognition (OCR). Because of the
many possible applications, handwriting recognition is a crucial area of research. [2].
Handwriting recognition (or HWR) is the ability of a computer to receive and interpret
intelligible handwritten input from sources such as paper documents, photographs, touch-screens
and other devices. Depending on the manner in which data is acquired, the domain of handwritten
character recognition is divided into two types.
Off-line handwriting recognition-The image of the written text may be sensed "off line" from a
piece of paper by optical scanning (optical character recognition). For offline character recognition
the following cases can be considered: recognition of one or affixed number of fonts (Fixed Font
OCR and Multi-font OCR), any printed font (Omni-font OCR), isolated hand printed characters
2
(Handwriting OCR) and unconstrained handwriting (Script Recognition). However, handwritten
character recognition is a challenging task because of variability of writing styles of different
writers from different environment. The task becomes more tedious when the text document
quality is poor and if the characters are written very close to each other. Online recognition is
achieved when a computer recognizes the characters as they are drawn, as opposed to offline
recognition, which typically takes place after writing or printing is complete. After the writing is
finished, the handwritten document is converted into digital form for offline handwriting
recognition, also known as optical character recognition. Offline recognition has the benefit of
being possible years after the document was written, at any time. The drawback is that it cannot
be used for instantaneous text input because it cannot be done as a person writes in real time[3].
OCR is a member of the family of automatic identification methods for machine recognition.
Automatic identification is a process in which a recognition system recognizes items
automatically, gathers information about them, and puts that information straight into computer
systems, all without the need for human intervention. Through the processing of pictures, sounds,
or movies, external data is gathered. A transducer is used to record data, turning the real image or
sound into a digital file. The file is then saved so that the computer can review it later. [1].
3
used. The original document's digital image is obtained through the scanning procedure. Optical
scanners, which have a transport mechanism and a scanner that converts light intensity into grey
levels, are used in OCR. Black text appears on a white background in printed texts. Multilevel
images are transformed into bi-level black and white images during OCR. To conserve memory
and processing effort, the scanner performs a procedure called thresholding. The thresholding
technique is crucial because the quality of the bi-level image has a significant impact on the
recognition outcomes. [1]. A fixed threshold is used where gray levels below this threshold are
black and levels above are white. The output for using scanner is with imputing the handwritten
document is scanned document image.
4
Figure 1: Architecture of Optical character recognition
Pre-processing: -Pre-processing is the second part of OCR. For use in the descriptive stages of
character analysis, the raw data is put through a number of preliminary processing procedures
depending on the method of data collecting. There may be some noise in the image produced by
the scanning process.[1]. Depending on the scanner resolution and the inherent thresholding, the
characters may be smeared or broken. Some of these defects which may cause poor recognition
rates and are eliminated through pre-processor by smoothing digitized characters. The most
common technique for smoothing moves a window across binary image of character and applies
certain rules to the contents of window. Pre-processing also includes normalization along with
smoothing. The normalization is applied to obtain characters of uniform size, slant and rotation.
The pre-processing component thus aims to produce data that are easy for the OCR systems to
operate accurately.
Feature Extraction: - Feature extraction is the fourth OCR component. The goal of feature
extraction is to identify the fundamental traits of symbols. One of the most challenging issues in
pattern recognition is generally acknowledged to be feature extraction. Using a real raster image
to describe a character is the simplest method. Another strategy is to keep the less significant
properties but extract the aspects that define symbols.[1]. In feature extraction stage, each character
is represented as a feature vector, which becomes its identity. The major goal of feature extraction
is to extract a set of features, which maximizes the recognition rate with the least number of
5
elements and to generate similar feature set for variety of instances of the same symbol. Due to the
nature of handwriting with its high degree of variability and imprecision obtaining these features,
is a difficult task. Feature extraction methods analyze the input document image and select a set
of features that uniquely identifies and classifies the character.
Classification: - Feature extraction stage gives us the feature vector that is used for classification.
Classification is the decision-making step in the OCR system that makes use of the features
extracted from the previous stage in the process. To do the classification we must have a data bank
to compare with many feature vectors. A classifier is needed to compare the feature vector of input
and the feature vector of data bank. The selection of classifier depends upon training set and
number of free parameters.
Based on statistical data, the system can detect some typical OCR errors, for example, those related
to the similarity of characters and words. Thus, at these 4 stages, the system corrects flaws in order
to improve the quality of the OCR output.
OCR, commonly referred to as text recognition technology, transforms any type of image
containing written text into text data that can be read by computers. OCR enables rapid and
automatic document digitization without the need for manual data entry.[5].The input to the system
is either printed or handwritten character or numeral. The input data goes through different stages
in the process[2]. The general way of Optical Character Recognition consists of processing the
scanned image of text. The basic steps of processing for character recognition are as follows.
1. Image-acquisition
6
2. Pre-processing
3. Segmentation
4. Feature-extraction
5. Recognition/classification
6. Post-processing
Let as see one by one and how it works it
3.1 Image-acquisition
The input to the OCR system is the scanned document image. And This Image can be acquired by
using scanner or a digital camera or any other suitable digital input device. This input image should
have specific format such as .jpeg, .bmp etc in figure 2. Apart from the scanned document, pre-
stored documents in the system can also considered as input and processed for recognition. After
this the image is passed to the next step of pre-processing.
3.2 Pre-processing
It is concerned with a number of procedures to prepare the scanned input image for additional
processing. It primarily focuses on noise reduction and irregular data handling. Pre-processing is
the most crucial step in character recognition since accuracy depends on it, even if each step is
integral and requires accurate performance. Contrast stretching is a technique for improvement
because it mostly depends on the attraction of consumers. Pre-processing, in reality is used for
enhancing the image quality. Important steps involved are filtering from unwanted Image,
normalization, binarization, skew correction and slant removal.
Pre-processing is including the following steps:
1. Filtering
2. Normalization
7
3. Binarization
4. Skew correction
5. Slant removal
8
3.2.2 Normalization
This stage removes some of the variations in the image that do not affect the identity of the input
data and provides a tremendous reduction in data size. Thinning extracts the shape information
of the characters[2].
3.2.3 Binarization
The collected image is first converted from a grayscale to a binary format. All pixels in the input
image with luminance greater than a level hold have values of 1 (white) in the output binary image,
whereas all other pixels have values of 0 (black) figure 4. Here, the level is established by the user-
declared threshold in accordance with the predetermined value, and the output will be recorded in
the system as a matrix. It is then compared to the template that has been stored.
9
Measurable factors of different handwriting styles are the slant angle between longest strokes in
a word and a vertical direction. Slant removal methods are used to normalize the all characters to
a standard form.
3.3 Segmentation
Segmentation is by far the most important aspect of the character recognition system. It allows the
recognizer to extract features from each individual character. In the more complicated case of
handwritten text, the segmentation problem becomes much more difficult as letters tend to be
connected to each other, overlapped or distorted. Segmentation is done to break the single text
line, single word and single character from the input document [2]. If the image consists of several
horizontal lines of words then first segmentation is done by isolating every line in the document.
Next if one line contains many words then segmentation is done for all words. Finally, each word
carries many letters so segmentation is done for all the letters in the alphabets. Thus, segmentation
is carried out for obtaining the isolate characters by decomposing image of sequence of characters.
10
Thus, segmentation can be done to various different levels depending on the problem. It is usually
implemented through labeling in MATLAB.
The pre-defined labeling functions in MATLAB include bwlabel, bwlabeln, and bwconncomp.
We must choose the best option. The primary idea behind labeling is to recognize the nearby pixel
and tag it appropriately. For instance, pixels with the label "0" denote the background. One
character is made up of the pixels with the label 1, another is made up of the pixels with the label
2, and so forth. As a result, segmentation is carried out using labels for the nearby characters. Each
segmented character is sent to feature extraction after the segmentation phase is finished.[6].
One may take segmentation as a step-by-step breakdown of a region of text until the desired level
of interest is achieved. In character recognition, segmentation ceases after isolating characters.
Fig 7: Line-wise segmentation of word ‘The’ Fig: 8 Line-wise segmentation of word ‘Best’.
11
wise segmentation so that words in each line got isolated giving ‘The’, ‘Best’, ‘Way’, ‘Of’,
‘UNDERSTANDING’ and LIFE as seen from Fig. 7 for the first line ‘The’.
Character-wise segmentation is being performed to isolate constituent characters in each word.
Word “The” is being segmented into its most elemental components ‘T’, ‘h’ and ‘e’. After the
image is broken down into its most basic component Character, it will be taken for feature
extraction.
They are based on three types of features: Statistical features, Structural features Global
transformations and moment, among those I will try to see statistical features.
Statistical features
The following are the major statistical features used for character representation
Zoning: Frame of character is divided into several overlapping and non-overlapping zones. The
densities of the point or some features in different regions are analyzed to form the representation.
Example: - Initially the image is thought of to be divided equally into 3*3 i.e.; 9 zones.
12
Figure 11: - Creation of 9 zones
Euler number: Concept of Euler number is used which will help to classify the characters.
Euler number is defined as the number obtained by subtracting the number of the number of
holes in the image from the number of objects in the image. Therefore, in case of handwritten
character recognition Euler number is the difference between number of characters in the
image and the number of holes present in the character
Figure 12: - Zones named from T1 to T9 figure 13: End-points of character ‘C’
13
❖ In the above figure we can see character „C divided into 9 zones. We can infer from the
figure that the end points of character „C‟ lie in zones T3 and T9.
Projections and profiles: - Character input data can be represented by projecting the pixel gray
values onto lines in various directions giving one dimensional signal into two-dimensional
image. The basic idea behind using projections is that character images, which are 2-D signals,
can be represented as 1-D signal. These features, although independent to noise and deformation,
depend on rotation. Projection histograms count the number of pixels in each column and row of
a character image
Crossings and distances-It refers to the number of crossings of a contour by a line segment in a
specified direction. Distance of line segment from a given boundary can be used as one of the
features. A horizontal threshold can be established above, below and through the center of the
script. The feature value is the count, the number of times the script crosses the threshold.
As a result, we extract features from the entire image by concatenating all the rows to form a single
contiguous vector. This feature vector consists of zeros (0s) and ones (1s) representing background and
foreground pixels in the image, respectively.
3.5 Classification
Feature extraction stage gives us the feature vector that is used for classification. Classification is
the decision-making step in the OCR system that makes use of the features extracted from the
previous stage in the process. To do the classification we must have a data bank to compare with
many feature vectors. A classifier is needed to compare the feature vector of input and the feature
vector of data bank. The selection of classifier depends upon training set and number of free
14
parameters. There are many existing classical and soft computing techniques for handwritten
recognition
3.6 Post-processing
It is the final step of optical character recognition; those errors are corrected using lexicons or
spelling checkers. It prints the recognized characters in structured text format.
Financial transactions involve a huge amount of data entry. Manual processing of this data takes a
lot of time and effort while digitization of financial documents and extracting the necessary
information from them using OCR makes business processes smooth and optimized. As a result,
the OCR technology improves customer onboarding and enhances the overall customer
experience[7].
Optical character recognition uses in the banking and financial sector include the following:
Client onboarding: -Whatever financial transactions you want to perform, whether it be opening
an account, withdrawing cash or transferring money, you first need to authenticate to prove your
identity. OCR technology provides a fully automated onboarding process consisting of scanning
an identity document (e.g. ID, passport or driver’s license), extracting the necessary data using
OCR (e.g. name, dates of birth, gender, photo, signature, etc.) and checking it. For example, the
15
OCR engine can inspect in real-time whether the provided signature matches the signature on the
identity document.
Scan to pay feature: - Manual entry of payment details does not exclude errors and takes more time
than expected. The scan to pay feature uses optical character recognition to instantly capture
invoice data and automatically process it. The user only needs a smartphone camera to do this (for
example, you may need to take a photo of your credit card). OCR can also act as an extra security
feature when making payments. Usually, users store cardholder data in the application desiring not
to enter the card number and other details every time. With OCR, all you need is to enable the
OCR feature which extracts data in seconds for each new payment and then removes it.
Loan processing: -OCR and machine learning text recognition tools can speed up the processing
of loan and mortgage applications by up to 70 percent. Automation of data entry makes the process
of reviewing applications and approving or rejecting them much faster and more cost-effective for
the company. AI algorithms can parse the required data from the application to determine if it
should be approved or rejected based on the financial institution’s rules.
❖ Use cases of OCR in finance are not limited to the above. The technology can be used for
processing other financial documents like invoices, contracts, bills, financial reports, etc.
OCR in healthcare
OСR cases in the healthcare industry are closely related to data management. According to the
World Economic Forum, hospitals produce an average of 50 petabytes of data per year. This data
includes medical reports, prescription forms, claims, laboratory test results, and medical records.
The digitalization of medical documents and the efficient extraction of data from them is a critical
aspect of the functioning of a healthcare institution.
By applying optical character recognition technology hospitals can translate papers into a digital
format much faster and store them as PDF documents that can be easily searched using keywords.
Electronic medical records solve one of the main problems of hospitals, the loss of medical
information about patients. Also, OCR allows data to be pulled from certificates or test results and
sent to hospital information management systems (HIMS) for integration into patient records thus
forming a complete medical history of patients.
16
Pharmaceutical systems can take advantage of OCR as well. Powered with an OCR module such
systems allow you to scan medical prescriptions and import them into software to check the
presence of the medicine in pharmacy databases or even use it to control picking robots.
OCR in security
Almost any industry can take advantage of OCR as part of its security strategy. Using OCR
powered by machine learning, companies have a chance to build advanced user authentication and
verification systems. Usually, manual comparison documents with provided personal info and a
selfie are used to verify the authenticity of the identifier presented by the user. The OCR model
eliminates these manual efforts by scanning ID cards, passports or driver’s licenses and checking
their authenticity, comparing them with the info in the database.
In this case, the OCR engine must first recognize the document type. For example, if a user chooses
to authenticate with a driver’s license, the document they upload to the system must conform to
that document format. Then the system should analyze and process uploaded user documents to
get relevant data.
There are many benefits to optical character recognition. Computers can now recognize and
analyze scanned or photographed images of text thanks to this technology. In order to make copies
of a document without having to completely retype it, it is used to convert scanned documents or
PDFs into editable files. Increased output, higher data entry accuracy, lower operational and
storage costs, better compliance, and data recovery are some benefits of optical character
recognition[8].
17
2. Enhanced Data Entry Accuracy: - Inaccuracy is one of the most difficult aspects of data
entry. Reduced mistakes and inaccuracies arise from automated data input methods
resulting in efficient data entering. Furthermore, automatic data entry may successfully
address issues such as data loss. Because there is no human intervention, concerns such as
inadvertently or intentionally entering incorrect information may be avoided.
3. Reduced Storage Space Costs: -One of the main advantages of optical character
recognition and one of the primary reasons why firms engage in such solutions is cost
reduction. Paper documents can require huge physical storage facilities to be preserved and
kept for as long as the company requires them. The cost of storage will be greatly lowered
when you digitize documents and store them on the cloud or in your internal servers.
4. Reduced Costs: - There are many ways in which OCR can reduce operational costs.
One way is by automating the process of data entry. This can reduce the amount of time
and money spent on manually entered data. Additionally, OCR can help to improve the
accuracy of data entry, which can save time and money that would otherwise be spent on
correcting errors.
18
6. Conclusion
Optical Character Recognition (OCR) is an interesting and challenging field of research in pattern
recognition, artificial intelligence and machine vision and is used in many real-life applications.
Optical character recognition is a type of document analysis where a scanned document image that
contains either machine printed or handwritten script is input to an OCR software engine, is
translated into editable, machine-readable digital text format. The basic steps of processing for
optical character recognition are Image-acquisition that the input to the OCR system is the scanned
document image, 2nd that the scanned input image is made suitable for further processing that is
pre-processing it uses Filtering, Normalization, Binarization, Skew correction and Slant removal
techniques ,3rd segmentation , 4th feature extraction that means each character is represented as a
feature vector, which becomes its identity, Classification is the decision making step in the OCR
system that makes use of the features extracted from the previous stage in the process, and the final
step is post-processing. The major Applications of optical character recognition are numerous:
reading postal addresses, bank check amounts, and forms, data entry, text entry and process
automation. Furthermore, OCR plays an important role for digital libraries, allowing the entry of
image textual information into computers by digitization and recognition methods.
19
7. Reference
[1] A. Chaudhuri, G. Tokyo, P. Badelia, and S. K. Ghosh, Soumya K . Ghosh Optical
Character Recognition Systems for Different Languages with Soft Computing, no. January
2018. 2017. doi: 10.1007/978-3-319-50252-6.
[2] G. Y. Tawde and J. M. Kundargi, “An Overview of Feature Extraction Techniques in
OCR for Indian Scripts Focused on Offline Handwriting,” vol. 3, no. 1, pp. 919–926,
2013.
[3] C. C. Tappert, S. Cha, and E. Systems, “English Language Handwriting Recog- nition
Interfaces Historical Overview of Consumer Text Entry Technologies,” 2007.
[4] A. K. B. Karamjeet Kaur, “Review on Segmentation of Touching and Brokrn Characters
for handwritten Gurmukhi Script.”
[5] K. Takeaways, “Optcal character recognition Technology guide for Business Owner.”
[6] D. T. Disha Bhattacharjee, Rubi Debnath, “Anovel approach for character recognition.”
[7] R. R. Herekar, “Handwritten Character Recognition Based on Zoning Using Euler
Number for English Alphabets and Numerals,” vol. 16, no. 4, pp. 75–88, 2014.
[8] “Information Management Simplified” , 2022
20