Confluence 2018 8442875

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

An Algorithmic Approach for Text Recognition from

Printed/Typed Text Images


Neha Agrawal (Author) Arashdeep Kaur (Author)
Department of Computer Science and Engineering Department of Computer Science and Engineering
Amity School of Engineering and Technology Amity School of Engineering and Technology
Amity University Uttar Pradesh Amity University Uttar Pradesh
Noida, India Noida, India
nehaagrwl@hotmail.com akaur@amity.edu

Abstract— Extraction of texts from scanned copies A lot of research has already been done in this area. To
of documents and text images is an important task in recognize characters, techniques like Otsu’s method has
the recent scenario. Optical Character Recognition been used for image segmentation and Hough transform for
(OCR) is used to analyze text in images. The proposed skew detection. For detection of characters, pattern
algorithm deals with taking scanned copy of a recognition techniques are used.
document as an input and extract texts from the image In 1923, a patent in OCR was granted to Gaurav
into a text format using Otsu’s algorithm for Tauschek in Germany. After him, Paul W. Handel received
segmentation and Hough transform method for skew a US patent in 1933. Later on, Tauschek also obtained a US
detection. The system was confined to recognize patent for his work. The machine deveoped by Tauschek
English alphabets (A-Z, a-z) and numerals (0-9). OCR used a poto detector and templates.The work on first
technique has been implemented to recognize prmitive OCR was carried out by RCA engineers in 1949.
characters. Validation tests were done on screenshots of The purpose was to assist the visually impaired people for
typed texts and images of scanned document from the US Veterans Administration. However, their device did
Internet sources. Experimental results indicate that the not only transform printed text to machine language. It also
proposed algorithm is able to recognize alphabets then spoke the letters. This was not further used for testing
written in Verdana font style with size 14 and also because the costs were very high.[1]
showed good results with rotated images. The average A lot of work has been carried out to discuss about
accuracy to determine rotation angle correctly was document image defects models. To surmise a few parts of
calculated to be 90% and overall system accuracy was known physics of machine printing and imaging of text,
calculated to be 93%. Sarkar et al[2] used a ten-parameter image degradation
model. This model d that document image decoding (DID)
can help one to attain high accuracy with low manual task
Keywords—OCR; Otsu’s algorithm; Hough transform;
and effort. This is guaranteed even when image degradation
English alphabets; skew detection is severe. They had performed huge experiments with all
I. INTRODUCTION kinds of images, the results came out with 99% accuracy
even on heavily degraded images. This has reduced the
People deal with chunks of information daily. Much manual effort of DID as it can be used on low-quality
data nowadays is sent in the form of images or scanned images without and manual segmentation.
documents. Many situations require editing of data sent in
the form of images. Thus, we need tools to convert these In order to improve the accuracy of the proposed
documents into editable forms. Hence, recognition of texts algorithm, document image enhancement is necessary.
from printed/typed text images is an important task. The Canon et al.[3] produced a system to enhance and restore
purpose of the paper is to identify methods to convert images. The system was known as QUARC. The images
printed text images into editable documents. The primary used were the documents prepared using typewriter. To
objective is to extract characters from printed or typed text automatically improve quality of the document images for
images. Nowadays, there are various techniques and OCR processing, they used quality measures and image
algorithms available to do the same. Optical character restoration techniques. Small speckle factor, touching
recognition is one of the popular techniques used for character factor, white speckle factor, broken character
recognizing texts. It is a technique that is used to convert factor and font size factor were the quality measures used
scanned images of printed or handwritten text into character by them. The system took document image as an input,
streams that can be read by machines and then can be edited characterized it using the quality measures. Then an image
for further functionalities. restoration filter was calculated using a pre-trained linear
classifier. It was claimed that this proposed system
After converting a printed or typed text image into a improved the character error rate and word error rate
machine-readable format, it becomes easier to perform respectively by 38% and 24%.
other tasks that are required. One can search for a word or
phrase, edit the content, send it as an email, use it on a web Later on, Summers et al. employed a similar method in
page, etc. their software which was called as ImageRefiner[4]. While

c
978-1-5386-1719-9/18/$31.00 2018 IEEE 876
QUARC just addressed Latin scripts, they improved the be extracted is the input for the proposed algorithm. In
system by also considering non-Latin scripts and non-fixed preprocessing step, image is converted to binary format to
fonts. The measures of size, height, area, width, and aspect make it easily readable. Most of the document images are
ratio were added to the existing quality measures of skewed and contain a lot of background noise. Thus, skew
QUARC. In a similar manner to QUARC, ImageRefiner removal and noise reduction are important steps performed
also determined an optimal restoration method using before we perform segmentation of images.
document image quality measures but at the same time they For skew detection, Hough transformation is used.
changed their classifier to employ an error propagation Hough transformation helps to find lines in every direction.
neural net.
It helps to identify the presence of line described by
When dealing with text recognition, it is very important equation (1) using a two-dimensional array. Despite being
that text lines are identified in a precise and proper manner. computationally expensive, Hough transformation is a
Various methods have been produced for text line finding. robust method to detect lines from images. Detection of
Breuel[5] proposed an algorithm to find accurate page lines from images help to find the line which is at the
rotation and then identifies text lines. Branch and bound maximum angle and helps to find the angle by which
approach to optimal line finding was used by him. It is easy image has to be rotated.
to implement and has only five parameters.
Bai and Huo[6] proposed a method to extract target text ‫ ݎ‬ൌ ‫ ߠ •‘… ݔ‬൅ ‫ߠ ‹• ݕ‬ (1)
line from an image which is captured by a hand-held pen
scanner. The method proposed by them was tested and After rotation of images, noise removal is performed.
verified by experiments on a testing database that had 117 For this various filters are used; median filter is used to
document images. The results were satisfying. remove salt and pepper noise. Weiner filter is used to
smoothen image and then morphological operators are
In this paper, an algorithm to convert scanned text
used to identify text portions to be removed. This image is
images into editable format with maximum efficiency is
the subtracted from the original to get an enhanced image.
proposed. The proposed algorithm uses techniques like
Otsu’s method and Hough transform with some C. Segementation of Images:
modification to enhance images, perform skew removal
operations on images and extract text lines efficiently to The process of image segmentation to detect characters
detect and recognize characters. Despite using simplest follows noise removal. It involves following steps:
techniques, the proposed algorithm showed good results i. First of all, first line of text is cropped from the image.
against rotation and scaling after some modification. The
proposed algorithm deals with English alphabets (a-z, A-Z) ii. After the line has been cropped, the line is processed and
and numerals (0-9). individual character is cropped from the line.
iii. The second step is repeated until all characters are
The rest of the paper is organized as follows. cropped out.
II. PROPOSED ALGORITHM iv. The above three steps are repeated until the entire image
The proposed algorithm involves following steps: has been read till the last line.
preprocessing of images, segmentation of images into Fig. 2 below shows the various steps used in the process of
individual characters, and recognizing characters. Fig. 1 segmentation of images according to the algorithm
shows the flow diagram for the proposed algorithm. proposed above.

(a) (b)

Fig.1. Flow Diagram of Proposed Algorithm


(c) (d)
A. Creation of Database Fig.2: (a) Original image, (b) Image after preprocessing, (c)
For testing of the algorithm, a database with templates Cropped lines from image, (d) Cropped characters from each line
of English characters of size 24 X 42 pixels was created.
The database also consisted 100 test images to test the
accuracy of the algorithm. D. Extraction and Recognition of Characters
B. Preprocessing of Images The system is trained with templates of Verdana font
type with size 14. Thus, the characters are recognized
The scanned image of a document from which text is to

2018 8th International Conference on Cloud Computing, Data Science & Engineering (Confluence) 877
based on matching characters with the templates used for Table 1: Rotation accuracy observed against different
training. angles of rotation
In the previous step of segmentation, individual Rotation angle Rotation Accuracy
character is cropped. After the character has been cropped, 180° 88%
the character is resized into the size of the template. After 130° 92%
that, the resized image is compared with the templates 44° 90%
using correlation factor. The template with which the 160° 93%
resized image shares maximum correlation factor is Average Accuracy 90%
identified as the character represented by the image. The
templates used to train the system are shown in Fig. 3 and One of the samples before and after skew detection and
Fig. 4 shows the templates created after training for the removal is shown in figure 5.
purpose of matching characters.
After detection of character, it is written into a text file.

(a) (b)
Fig.5: (a) Original image with skew, (b) Image after skew
Fig. 3: Templates used to train the system detection and removal
The recognition accuracy for different images was
calculated using the formula given by equation (3).
ே௢Ǥ௢௙஼௛௔௥௔௖௧௘௥௦௖௢௥௥௘௖௧௟௬ூௗ௘௡௧௜௙௜௘ௗ
Ψ‫ ݕܿܽݎݑܿܿܣ‬ൌ  ൈ ͳͲͲΨ (3)
்௢௧௔௟ே௢௢௙஼௛௔௥௔௖௧௘௥௦

The average accuracy for text images with text written in


Verdana font of size 14 was found out to be 93%. One of
the samples with same font as the template is shown in
figure 6 along with the extract of text obtained after text
recognition by applying the proposed algorithm.

Fig. 4: (a) Templates used for matching characters, (b)


Characters cropped from image, (c) Characters resized to
match the size of the template
(a) (b)
Fig. 6: (a) Original Image with same font as the template, (b) Text
III. EXPERIMENTAL RESULTS file created with content of the image after complete execution

The proposed algorithm was tested and validated with For other images that used font other than Verdana, the
different images from different sources. The experiments average accuracy was calculated to be 70%. One of the
and validation of the proposed algorithm has been carried samples with text written in font other than Verdana along
out on different images from different sources. For with the extract of text obtained after text recognition by
validation of same font type as the templates, screenshots applying the proposed algorithm is shown in figure 7.
of typed text are used. The accuracy in this case is almost
95-100% for ideal images and 85-100% for non-ideal
images. Thus, average accuracy is found to be 93%.
Testing on other images from Internet sources was also
done but the results were not too satisfying. Template
matching recognizes characters of different font styles with
an average accuracy of 70%. (a) (b)
Fig. 7: (a) Original Image with different font than the template,
ே௢Ǥ௢௙ா௥௥௢௥஻௜௧௦
ܴ‫ ݕܿܽݎݑܿܿܣ݊݋݅ݐܽݐ݋‬ൌ ൈ ͳͲͲΨ (b) Text file created with content of the image after complete
்௢௧௔௟௡௢௢௙஻௜௧௦
(2) execution

The proposed algorithm showed good results against


rotation. Table 1 gives the rotation accuracy calculated IV. COMPARISON WITH EXISTING ALGORITHMS
against images with rotation at different angles using the The Table 2 gives a quick comparison with some
formula given by equation (2). existing papers.

878 2018 8th International Conference on Cloud Computing, Data Science & Engineering (Confluence)
Table 2: Comparison of recognition accuracy between [4] Kristen Summers. Document image improvement for OCR as
existing algorithms and the proposed algorithm. a classification problem. In T. Kanungo, E. H. B. Smith, J. Hu,
Existing Techniques Accuracy and P. B. Kantor, editors, Proceedings of SPIE/IS&T Electronic
Imaging Conference on Document Recognition & Retrieval X,
[7] 90%
volume 5010, pp. 73–83, Santa Clara, CA, January 2003. SPIE.
[8] 91%
[9] 76% [5] S. P. Chowdhury, S. Mandal, A. K. Das, and Bhabatosh
Proposed Algorithm 93% Chanda. Automated segmentation of mathzones from document
images. In Proceedings of the International Conference on
It can be observed from the table that the proposed Document Analysis and Recognition, volume II, pp. 755-759
algorithm showed good results against existing algorithms August 2003.
for recognizing texts.
[6] Zhen-Long Bai and Qiang Huo. An approach to extracting the
V. CONCLUSIONS target text line from a document image captured by a pen scanner.
In Proceedings of the International Conference on Document
The proposed algorithm is successfully able to Analysis and Recognition, volume I, pp 76-81, August 2003.
recognize characters from text images with an average
accuracy of 93%. It also showed good results against [7] Veleppa Ganapathy and Charles C. H. Lean, “Optical
rotation. The average rotation accuracy to correctly rectify Character Recognition Program for Images of Printed Text Using
skew from images was calculated to be 90%. The proposed a Neural Network”, IEEE 2006 (pp. 1171-1176)
algorithm has also shown good results against scaling and
was also able to reduce noise from images to a good extent. [8] Smruti Rekha Panda, Jogeshwar Tripathy, “Odia Offline
These were the results with the images that contained text Typewritten Character Recognition using Template Matching
with Unicode Mapping, International Symposium on Advanced
in the same font style as the template. With rest other Computing and Communiaction, 2015 (pp. 109-115)
images containing text written in some other font than that
of the template, the average recognition accuracy was [9] P. Iyer, A. Singh, and S. Sanyal, "Optical Character
calculated to be 70%. The proposed algorithm can be Recognition for Noisy Images in Devanagari Script", UDL
improved for other font sizes and font styles by training the Workshop on Optical Character Recognition with workflow and
Document Summarization, 2004
system with more font types.
REFERENCES [10] Jonathan J. Hull; "Document Image Skew Detection: Survey
and Annotated Bibliography"; World Scientific, pp. 40-64, 1998.
[1] Pardeep Kaur, Pooja Choudhary, “Review on: English
Scanned Documents”, International Journal of Engineering
[11] Su Chen and Robert M.Haralick; "An Automatic Algorithm
Research- Online, Vol. 3, Issue.2, pp. 60-65, 2015 for Text skew estimation in document images using Recursive
morphological transforms"; proc. of ICIP, pp 139-143, 1994
[2] Prateek Sarkar, Henry S. Baird, and Xiaohu Zhang. Training
on severely degraded text-line images. In Proceedings of the
[12] Liu Cheng-Lin, Nakashima, Kazuki, H, Sako, H.Fujisawa,
International Conference on Document Analysis and Recognition, Handwritten digit recognition: investigation of normalization and
volume I, pp 38-43, August 2003. feature extraction techniques, Pattern Recognition, Vol. 37, No.
2, pp. 265-279, 2004.
[3] Michael Cannon, Judith Hochberg, and Patrick Kelly. Quality
assessment and restoration of typewritten document images. [13] Kristen Summers, “Document image improvement for OCR
Technical Report LA-UR 99-1233, Los Alamos National as a classification problem”, in T. Kanungo, E. H. B. Smith, J.
Laboratory, 1999. Hu, and P. B. Kantor, editors, Proceedings of SPIE/IS&T
Electronic Imaging Conference on Document Recognition &
Retrieval X, volume 5010, pp. 73–83, Santa Clara, CA, January
2003. SPIE.

2018 8th International Conference on Cloud Computing, Data Science & Engineering (Confluence) 879

You might also like