Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

Assessing and Analyzing Application of Tesseract and Artificial Neural Network to Nepali Script OCR

Sudan Prajapati Aman Maharjan Prof. Dr. Shashidhar Ram Asst. Prof. Bikash
Patan Multiple Amrit Science Campus Joshi Balami
Campus aman.maharjan@gmail.co IOE, Pulchowk Campus Tribhuvan University
sudon.san@gmail.com m srsrjoshi@ioe.edu.np bikuji@gmail.com

Abstract – This paper focuses on character recognition of printer text in Nepali script. This work
analyzes the efficiency pertaining to Nepalese OCR based on Tesseract engine and Artificial
Neural Network (ANN) techniques. The benchmark of this investigation and analysis is to create
the dataset of the 69 different fonts with the 2,484 samples of consonants data of Devanagari
script. With Tesseract, overall accuracy of 96% was obtained in training phase and 69% in testing
phase. Similarly, with ANN, accuracy of 98% was obtained in training and 81% in testing.
Index Terms - Optical Character Recognition, Nepali Script, Tesseract, ANN, Nepali Font, Character
Segmentation, Character Recognition, Feedforward Neural Network, Backpropagation.
INTRODUCTION
The recognition of character by machines has been a research topic for decades. Before the age of
digitized computers, not much research was done in the field of character recognition. In early research
works, printed character recognition generally used template matching; low level image processing
techniques were used on the binary image to extract feature vectors, which were then fed to statistical
classifiers.
A mechanical or electronic translation of images of handwritten, typewritten or printed text into
machine editable text form is defined as an Optical Character recognition (OCR). Automatic text
recognition using an OCR is the process of converting images containing text into equivalent string of
characters. OCR is one of the most challenging topics in the field of pattern recognition [CITATION
LuY95 \l 1033 ]1. OCR technologies are used for various purposes like storing documents, searching text,
information retrieval from paper based documents, documenting library materials etc. OCR can be
categorized into 3 types: offline handwritten text recognition, online handwritten text recognition and
machine printed text recognition.
Tesseract OCR engine is an open source OCR engine developed by the HP labs [CITATION Smi07 \l
1033 ]2. It was originally developed for English but later, extended to recognize various other languages
like Arabic, French, Italian, Catalan, Czech, Danish, Polish, Bulgarian, Russian Japanese, Chinese,
Devanagari etc. Training the Tesseract OCR engine for any new script require in depth knowledge of the
script and its character set.
ANN is nonlinear, parallel and distributed with highly connected mathematical model. It is adaptive,
self-organized, fault tolerant and has VLSI implementation which closely resembles the physical
nervous system. ANN systems can perceive and recognize a character based on its topological features
such as shape, symmetry, closed or open areas, and number of pixels. The advantage of such a system is
that it can be trained on some sample data and then can be used to recognize characters having a similar
(not exact) feature set. ANN gets its inputs in the form of feature vectors i.e. every feature or property is
separated and assigned a numerical value [CITATION MGu \l 1033 ]3.
The way OCR techniques can be applied in Tesseract are very different from that in ANN. Both of
the methods need comprehensive and comparative study to deal with character recognition problem.

1
PROBLEM DEFINITION
Devanagari script is used for written Nepali as well as many other languages of Nepal and India.
Devanagari script is originated from the ancient Brahmi script which played a major historical role in the
development of literature and documentation.

TABLE I
NEPALI CONSONANT CHARACTERS
क ख ग घ ङ च छ ज झ ञ
ट ठ ड ढ ण त थ द ध न
प फ ब भ म य र ल व
श ष स ह क्ष त्र ज्ञ

TABLE II
NEPALI VOWEL CHARACTERS
अ आ इ ई उ ऊ ए ऐ ओ औ अं अः

TABLE III
NEPALI NUMERIC CHARACTERS
० १ २ ३ ४ ५ ६ ७ ८ ९

TABLE IV
NEPALI MODIFIER CHARACTERS
ा ि ी ु ू ृ े ै ो ौ ं ः ँ ्

Top Strip Header Line


Core Strip
Bottom Strip
अभूतपूर्व
FIGURE 1
NEPALI WORD

Nepali language has 12 vowels, 36 consonants along and 14 modifiers (tables I-IV). Unlike
languages like English, it also has many compound characters which are formed by combining two or
more basic characters.
Nepali is written from left to right direction. It is a phonetic and syllabic script, so, it does not have
any lower or uppercase characters. The distinctive feature of Nepali script is the presence of horizontal
line on the top of all character known as header line, shirorekha or diko [CITATION Bal \l 1033 ]4. The
words can typically be divided into three strips: a core strip (middle zone), a top strip (upper zone), and
a bottom strip (lower zone). When two or more characters appear side by side to form a word, the header
lines touch and generate a bigger header line (figure 1).
Nepali character recognition using OCR faces many problems. One of the major problems is in
segmenting characters. This problem arises due to modifiers, combined characters, variability of
character size and so on [CITATION Bal \l 1033 ]4.
Various approaches of character recognition can be applied to Nepali OCR such as gradient features,
template based formulation, and classification techniques like ANN, HMM, SVM are used [CITATION
Smi07 \l 1033 ]2[CITATION MGu \l 1033 ]3 [CITATION Bal \l 1033 ]4. ANN and Tesseract OCR engine are used
for the character recognition in this study.

2
LITERATURE REVIEW
The early attempt for Character Recognition was somewhat limited due to unavailability of powerful
computing devices and digital image capturing devices [CITATION Smi07 \l 1033 ]2. Powerful hardware
became common place during the 1980’s which subsequently accelerated research in both online and
offline OCR [CITATION LuY95 \l 1033 ]1.
Image processing techniques and pattern recognitions powered by artificial intelligence and various
intelligent learning algorithms like ANN, HMM, fuzzy set reasoning etc. are commonly used in OCR
[ CITATION Yad13 \l 1033 ]5. The ultimate milestone of an OCR is to convert the printed or handwritten
scanned document into machine encoded text with negligible error. Electronic device equipped with an
OCR system can improve the speed of input operation and decrease possible human errors. OCR
systems can decrease the use of keyboard and act as the interface between man and machine to a great
extent. It also helps in office automation leading to huge saving of time and human effort.
The Nepali language is written in the Devanagari script and hence the unique characteristics of the
Devanagari script as reported in [ CITATION Yad13 \l 1033 ]5[CITATION Bal1 \l 1033 ]6[CITATION Cha \l 1033 ]
7 apply to the Nepali printed text as well. Research in Devanagari OCR has been done extensively many
people in the past. Similar achievements have been reported for the Bangla script which is similar to the
Devanagari Script in many respects [CITATION Bal1 \l 1033 ]6[CITATION Cha \l 1033 ]7.

Image
Acquisition

Preprocessing

Feature
Extraction

Classification

FIGURE 2
STEPS OF SYSTEM IMPLEMENTATION

This study is focused towards making use of the already available techniques in OCR with slight
modifications necessary for Nepali language.
SYSTEM FUNCTIONING
Hierarchical level of character recognition for the proposed system is divided into four subsections –
image acquisition, preprocessing, feature extraction and classification (figure 2). The hierarchical model
of is given below:
Image Acquisition
Images were acquired by typing characters in a word processor, printing and scanning in 300 DPI
resolution. Character samples of different fonts were collected in PDF format.
Preprocessing
Raw image may contain some noisy pixels. They can be removed using median filter. The resulting
image is then ready for segmentation. Segmentation is the process of separating individual character

3
images from the digitized image. The words in Nepali script can be partitioned into three zones, (1)
upper zone that denotes header line, diko or shirorekha, (2) middle zone that covers the portion of basic
compound characters below the header line, and (3) lower zone, containing some consonants and vowel
modifier. Segmentation is carried out in three steps line segmentation, word segmentation and character
segmentation. The segmented characters are then rescaled, normalized, converted to binary
representation and then skeletonized [ CITATION Har64 \l 1033 ]8.
Feature Extraction
After preprocessing of image, feature vectors can be extracted for training and testing. Feature extraction
plays an important role in recognition of character to distinguish the character from image.
For Tesseract, the features for prototypes are 4-dimensional (x, y, angle, length) with 10-20 features
in a prototype configuration. The features for unknown are 3-dimensional (x, y, angle) and for each
character there are approximately 50-100 features. The normalized features for the unknown are
computed by tracing around the outline of the blob unit twice. The x, y position and the angle between
the tangent line and a vector eastward from the center of the blob are saved as features.
For ANN, bounding box is created with coordinates and the corresponding Unicode character
manually for each character. Bounding box file is read line by line for the coordinate of character and
the intensity of the gray scale is calculated and converted into 50 × 50 matrix. This matrix is then
converted to 1-dimensional feature vector with 2,500 columns and used to train characters of different
typefaces.
Classification
Classification and recognition of a particular dataset depends on the selection of the features and
classifiers which can classify or recognize a particular character pattern. A classifier is called the ‘heart’
of pattern recognition system. It takes feature vectors and goes through all the decision-making process
for recognition of patterns.
EVALUATION AND RESULT
The main aim of this study is to experimentally verify and evaluate the performance of Tesseract and
ANN in optical character recognition of Nepali characters. The experiment is done in consonants of
Nepali script, assuming the same procedure will be followed to the vowel and numerical digits. Dataset
was created from fonts with 69 typefaces, containing 2520 individual character, for training and testing
purposes. Training and testing were each carried out 10 times using random samples from the dataset.
From the experiments, it was observed that segmentation and skeletonization plays a vital role in
character recognition in both ANN and Tesseract. Due to the peculiarities of Nepali Script,
segmentations of some consonants are not effective i.e. the threshold value of upper line Shirorekha will
not work for some consonants. This warrants some intelligent techniques for Nepali script, given its
peculiarities. Had there been good segmentation algorithm, comparative study of ANN and Tesseract
would have been more effective.
Table V shows the training and testing performance for Tesseract and ANN. The test is carried out
10 times and the table below show the average time, accuracy of ANN and Tesseract in the training and
testing data set.
For ANN, the parameters such as epoch, hidden neuron and learning rate have been tested
empirically. Best result was found in epoch 45, with a network of 2 hidden layers containing 50 neurons
each and setting learning rate to 0.01.
Table V represents the average training time and recognition accuracy of the Tesseract and ANN.
80% of the data was used to train both Tesseract and ANN. The result shows 98% average accuracy and
1.8 minutes average training time and verification of the result in case of ANN. Similarly, it shows 96%

4
average accuracy and 2.1 minutes average training time in case of Tesseract. This shows empirically
ANN is better than Tesseract at training dataset for character recognition.
TABLE V
TRAINING ACCURACY AND TIME
Dataset Study Method Average Time Recognition Accuracy
Tesseract 2.1 min 96%
Consonant
ANN 1.8 min 98%

TABLE VI
TESTING ACCURACY AND TIME
Dataset Study Method Average Time Recognition Accuracy
Tesseract 30 sec 69%
Consonant
ANN 26 sec 81%

Table VI represents the average time and recognition accuracy in case of test data. 20% of the data
was used for testing purpose for both Tesseract and ANN. The experimental result shows 81% average
accuracy and 26 second average completion time for ANN, 69% average accuracy and 30 second
average completion time for Tesseract. This shows that ANN clearly works better that Tesseract from
the perspective of both time and accuracy.
Similarly, individual character recognition accuracy for training data is displayed in figure 3 and for
test data is displayed in figure 4. Both Tesseract and ANN were tested using random samples and the
results were verified with characters from all the fonts used in the dataset. The results show that
character recognition accuracy varied with each individual font but overall, ANN outperformed
Tesseract in this area too.
CONCLUSION
Tesseract follows blob detection technique – it considers touching foreground pixels to be part of the
same blob. For individual character, a character component analysis is used to extract character outline
which is very useful because it does the OCR of image with white text and black background. During
training phase, the segments of a polygonal approximation are used for features. In recognition phase,
features of a small, fixed length (in normalized units) are extracted from the outline and matched many-
to-one against the clustered prototype features of the training data.
The neural network model used by ANN is an abstract mathematical model, inspired by the
biological neuron of human brain. It uses gradient descent and backward propagation technique for
learning.
Tesseract and ANN involve different styles for learning and recognizing character and words. The
application of both tools for Nepali scripts shows varied accuracy rates. In case of ANN the number of
epochs, segmentation threshold, good segmentation output, number of data (different font), choice of
transfer functions such as sigmoidal, hyperbolic tangent also determined character recognition accuracy
rate.
With Tesseract overall accuracy of 96% was obtained in training phase and 69% in testing phase.
Similarly, with ANN accuracy of 98% was obtained in training and 81% in testing This result shows
ANN to be more accurate than the Tesseract.

5
102

100

98

96

94

92

90

88

86

84

82
क ख ग घ ङ च छ ज झ ञ ट ठ ड ढ ण त थ द ध न प फ ब भ म य र ल व श ष स ह क्ष त्र ज्ञ

Tesseract ANN

FIGURE 3
TRAINING ACCURACY FOR INDIVIDUAL CHARACTERS

120

100

80

60

40

20

0
क ख ग घ ङ च छ ज झ ञ ट ठ ड ढ ण त थ द ध न प फ ब भ म य र ल व श ष स ह क्ष त्र ज्ञ

Tesseract ANN

FIGURE 4

6
TESTING ACCURACY FOR INDIVIDUAL CHARACTERS

Character recognition, especially in a language like Nepali, has a challenging research area for
decades. Many research and various techniques have been carried out for the Devanagari script it uses,
but due to the peculiarity of the script, the recognition accuracy have not been 100%. This can be
addresses in future studies. Similarly, research on word and sentence level Natural Language Processing
(NLP) can also be carried out. A corpus of words can be created and used for translation. Handwritten
Devanagari script recognition and multilingual character recognition is another comprehensive field of
study for future researchers.
REFERENCES
[1] Y. Lu, Machine Printed Character Segmentation – an Overview: Pattern Recognition,
1995, vol. 28.
[2] Ray Smith, "An Overview of the Tesseract OCR Engine," Proc. of ICDAR 2007, 2007.
[3] M. Gunasekaram and S. Ganeshmoorthy, "OCR Recognition System Using Feed
Forward and Back Propagation Neural Network," Coimbatore.
[4] Bal Krishna Bal , Rajesh Pandey, Sammer Tuladhar, and Shanti Shakya, "INTERIM
REPORT ON NEPALI OCR," Department of Computer Science and Engineering , Kathmandu
University, 2006.
[5] Divakar Yadav, Sonia Sanchez Cuadrado, and Jorge Morato, "Optical Character
Recognition for Hindi Language Using a Neural-network Approcah," J Inf Process Syst, vol. 9, no.
1, 2013.
[6] Bal Krishna Bal, "Scripts, Segmentation and OCR II Nepali OCR and Bangala
Collaboration," January 2009.
[7] Bishnu Chaulagain, Brizika Bantawa Rai, and Sharad Kumar Raya, "Final Report on
Nepali Optical Character Recognition," July, 2009.
[8] Harry Blum, "A Transformation for Extracting New Descriptors of Shape," in Models for
the Perceptron of Speech and Visula Form: Wathen-Dunn, ed., 1964, pp. 362-380.

You might also like