Aradhya-Multi-Lingual OCR

ARTICLE IN PRESS
Engineering Applications of Artificial Intelligence 21 (2008) 658–668

www.elsevier.com/locate/engappai
Multilingual OCR system for South Indian scripts and

English documents: An approach based on Fourier transform
and principal component analysis
V.N. Manjunath Aradhya, G. Hemantha Kumar, S. Noushath
Department of Studies in Computer Science, University of Mysore, Mysore 570006, Karnataka, India
Received 5 May 2006; received in revised form 20 May 2007; accepted 28 May 2007
Available online 12 July 2007
Abstract
Character recognition lies at the core of the discipline of pattern recognition where the aim is to represent a sequence of characters
taken from an alphabet [Kasturi, R., Gorman, L.O., Govindaraju, V., 2002. Document image analysis: a primer. Sadhana 27 (Part 1),
3–22]. Though many kinds of features have been developed and their test performances on standard database have been reported, there is
still room to improve the recognition rate by developing improved features. In this paper, we present a multilingual character recognition
system for printed South Indian scripts (Kannada, Telugu, Tamil and Malayalam) and English documents. South Indian languages are
most popular languages in India and around the world. The proposed multilingual character recognition is based on Fourier transform
and principal component analysis (PCA), which are two commonly used techniques of image processing and recognition. PCA and
Fourier transforms are classical feature extraction and data representation techniques widely used in the area of pattern recognition and
computer vision. Our experimental results show the good performance over the data sets considered.
r 2007 Elsevier Ltd. All rights reserved.
Keywords: Document analysis; Multi-lingual character recognition; South Indian languages; Fourier transform; Principal component analysis (PCA)
1. Introduction The current challenges in the filed of OCR technology is

to take care of poor quality documents, recognizing
Optical character recognition (OCR) is one of the characters with various noise types, recognition of multi-
virtually eminent applications of automatic pattern recog- lingual characters and development of OCR which can
nition. Research in OCR is very popular for various handle different fonts and sizes.
application potentials in banks, post offices, defense For Indian and many other oriental languages, OCR
organizations, reading aid for the blind, library automa- systems are not yet able to successfully recognize printed
tion, language processing and multi-media design. India is document images of varying scripts, quality, size, style, and
a multi-lingual multi-script country, where a single docu- font. In contrast to European languages, Indian languages
ment page (e.g., passport application form, an examination pose many additional challenges. Such as: (i) large number
question paper, a money order form, bank account of vowels, consonants, and conjuncts, (ii) most scripts
opening application form, train reservation form, etc.) spread over several zones, (iii) inflectional in nature and
may contains text in two or more language scripts. OCR is having complex character grapheme, (iv) lack of standard
of special significance for a multilingual country like India test databases for the Indian languages.
having 16 major state languages and over 100 regional However, most of the available systems work on
languages. European scripts which are based on Roman alphabet
(Bozinovic and Srihari, 1989; Casey and Nagy, 1968; Wang
Corresponding author. et al., 1982; Mori et al., 1992; Kahan et al., 1987). Research
E-mail addresses: mukesh_mysore@rediffmail.com (V.N. Manjunath reports on oriental language scripts are few, except for
Aradhya), nawali_naushad@yahoo.co.in (S. Noushath). Korean, Chinese and Japanese scripts (Govindan and
0952-1976/$ - see front matter r 2007 Elsevier Ltd. All rights reserved.
doi:10.1016/j.engappai.2007.05.009
ARTICLE IN PRESS
V.N. Manjunath Aradhya et al. / Engineering Applications of Artificial Intelligence 21 (2008) 658–668 659
Shivaprasad, 1990). Focused documentation of the activ- character height, etc. Contour features include character-
ities in this area may also be seen in Indian language istics of different profiles obtained from a portion of
document analysis and understanding (2002).1 These contour. Water reservoir based features include number of
technical articles reveal the academic contributions to the reservoirs, position of reservoirs, height of reservoir, etc.
development of OCR system for Indian scripts and discuss These features are used for further recognition purpose.
specific issues to be addressed in Indian scripts. One of the The method was tested in a variety of printed Urdu
major contributions in this area is the Devnagari OCR documents. The system identifies individual text line with
system developed by Chaudhuri B.B. Chaudhuri and Pal an accuracy of 98.3% and the system recognizes basic
(1998), which is commercially available. This is the first characters and numerals with an accuracy of about 97.8%.
OCR system among all script forms used in the Indian sub- OCR for printed Tamil text using Unicode is presented in
continent. Performance of the system on documents with a Seethalakshmi et al. (2005). The method uses various
varied degree of noise is being studied and that on size and features that are considered for classification are the
style variations is also studied. Another important OCR character height, width, number of horizontal and vertical
system is also proposed in the literature that can read two lines, etc. Back propagation and SVM based classifiers are
Indian language scripts: Bangla and Devnagari (Hindi), the used for subsequent classification purpose. Automatic
most popular ones in Indian sub-continent by Chaudhuri recognition of printed Oriya script is presented in
and Pal (1997). Stroke features are used for character Chaudhuri et al. (2002). The digitized document image is
recognition. Feature based tree classifiers is initially used first passed through preprocessing modules like skew
and efficient template matching approach is employed to correction, line segmentation, character segmentation,
recognize individual character. The performance of the etc. Next, individual characters are recognized using
system is quite satisfactory on single font clear documents. combination of stroke and run-number based features
An OCR system to read South Indian language of Telugu along with features obtained from the concept of water
is presented in Negi et al. (2001). The simple and most reservoir. In the recognition stage, modifiers are recognized
obvious approach for recognizing Telugu characters is to in the first part and remaining characters are recognized in
break each character into glyphs and recognize them. They the second part. The system has been tested on a variety of
use fringe distance for the comparison of Telugu character printed Oriya documents. Recognition of individual text
binary images. The overall performance of the proposed lines is about 97.5% whereas character segmentation
method is 92%. A multi-font OCR system for printed accuracy is about 97.2%. On average the system recognizes
Telugu text is also described in Vasantha Lakshmi and characters with an accuracy of about 96.3%. A font and
Patvardhan (2002). size independent OCR system for printed Kannada
The character recognition process from printed docu- documents is presented in Ashwin and Sastry (2002). The
ments containing Hindi and Telugu text is presented in system first extracts words from the document image and
Jawahar et al. (2003). The bilingual recognizer is based on then segments these into sub-character level pieces. A set of
principal component analysis (PCA) followed by support zone features is extracted after normalization of the
vector classification. Basically principal components are characters for further recognition purpose. A survey on
used for dimensionality reduction and can give superior Indian script character recognition is presented in Pal and
performance for font independent OCR. Support vector Chaudhuri (2004). The paper discusses a review of OCR
machines (SVM) are pair-wise discriminating classifiers work done on Indian language scripts and different
with the ability to identify the decision boundary with methodologies applied in OCR development in interna-
maximal margin. The overall accuracy of the method is tional scenario.
96.7%. In Hanmandlu et al. (2003) an approach based on To the best of our knowledge, this is the first report of its
Fuzzy for unconstrained handwritten character recognition kind on multiple Indian scripts character recognition. The
is presented. Binary image of a character is partitioned into proposed multilingual character recognition system works
fixed number of sub-images called boxes. The features based on Fourier transform and PCA (Fourier-PCA).
consist of normalized vector distances and angles from Fourier transform is a widely used image processing
each box. They have devised a new fuzzification function technique, which is often applied to the enhancement of
involving parameters, which take into account of the image description information and visual effect. In this
variations in the fuzzy sets. A recognition rate of almost paper, we combine it with popular PCA to enhance the
99% was achieved with the new fuzzification function. classification information and improve the recognition
An OCR system for printed Urdu script is described in effect. We first obtain filtered images (pre-processed) by the
Pal and Sarkar (2003). The method uses topological selection of appropriate Fourier frequency bands of
features, contour features as well as features obtained character images. We then propose to carry out character
from the concept of water reservoir. The topological classification/recognition by using the PCA method.
features include existence of holes, ratio of hole height to The organization of this paper is as follows. In Section 2,
brief overview of South Indian scripts are presented.
1
For further information please visit the link: www.iiit.net/itrc/ In Section 3, preprocessing procedures such as skew
index.html. estimation and segmentation are presented. In Section 4,
ARTICLE IN PRESS
660 V.N. Manjunath Aradhya et al. / Engineering Applications of Artificial Intelligence 21 (2008) 658–668
we describe our proposed technique. Experimental results languages. Both the language and its writing system are
and comparative study are reported in Section 5. Finally, closely related to Tamil; however, Malayalam has a script
conclusions are drawn in Section 6. of its own. Malayalam language script consists of 51 letters
including 16 vowels and 37 consonants. The earlier style of
2. Brief overview of South Indian scripts writing is now substituted with a new style and this new
script reduces the different letters for typeset from 900 to
In this section, we give the brief explanation over the less than 90.
properties of South Indian scripts. Figures containing of
vowels, consonants, and conjunct consonant character 3. Preprocessing
pertaining to South Indian scripts are separately shown in
Appendix A. Preprocessing plays an important role in any OCR
system. In this section we explain two major preprocessing
2.1. Kannada script steps that are crucial for successful development of OCR
system: (1) skew estimation and (2) segmentation.
Kannada is one of the major Dravidian languages of The digitized images are in gray tone and we have used
southern India and one of the earliest languages evidenced histogram based thresholding approach to convert them
epigraphically in India and spoken by about 50 million into two-tone images. For a clear document the histogram
people in the Indian states of Karnataka. The script has 49 shows two-prominent peaks corresponding to white and
characters in its alphasyllabary and is phonemic. The black regions. The threshold value is chosen as the
Kannada character set is almost identical to that of other midpoint of the two-histogram peaks. The two-tone image
Indian languages. The characters are classified into three is converted into 0-1 labels where the label 1 represents the
categories: swaras (vowels), vyanjanas (consonants) and object and 0 represents the background.
yogavaahas (part vowel, part consonants).
3.1. Skew estimation
2.2. Tamil script
Skew angle detection is an important component of any
Tamil is a Dravidian language spoken predominantly by OCR and document analysis system. When a document is
Tamils in India and Sri Lanka, of speakers in many other fed to a scanner either mechanically or by a human
countries. It is the official language of the Indian state of operator for digitization, it suffers some degrees of skew or
Tamil Nadu, and also has official status in Sri Lanka and tilt. Hence, in this work we have applied skew detection
Singapore and having more than 7 million speakers. Tamil method described in Nagabhushan et al. (2006), which is
is one of the major languages of the world. The Tamil specially designed to handle South Indian scripts docu-
script has 12 vowels, 18 consonants and five grantha letters. ments. The technique is based on boundary growing (BG),
The script, however, is syllabic and not alphabetic. The nearest neighboring clustering (NNC) and moments to
complete script, therefore, consists of the 31 letters in their determine the inclination of the scanned document. First,
independent form, and an additional 216 combinant letters the method excludes those components whose height is
representing every possible combination of a vowel and a very small. In doing so, dots of the character like ‘i’ and ‘j’,
consonant. punctuation marks like full stop, comma, hyphen, etc. are
removed. If height of a character is greater than the
2.3. Telugu script average height, then the character height is reduced to
average height. In this way, the method obtains height
Telugu, another Dravidian language spoken by about 5 normalized character document. Using BG, the method
million people in the southern Indian state of Andhra extracts top-left, bottom-right and centriod coordinate
Pradesh and neighboring states, and also in Bahrain, Fiji, points present in each text line. However, the BG alone
Malaysia, Mauritius, Singapore and the UAE. Telugu is a does not give accuracy for South Indian language
syllabic language. Similar to most languages of India, each documents because of the modifiers that often gets plugged
symbol in Telugu script represents a complete syllable. into the characters. Hence the method uses NNC and BG
Officially, there are 18 vowels, 36 consonants, and three to extract the coordinates of the compound characters by
dual symbols. Of these, 13 vowels, 35 consonants are in drawing a single boundary. The extracted coordinate
common usage. points are then passed to moments to estimate skew angle.
Fig. 1(a) shows the sample text lines of four South
2.4. Malayalam script Indian scripts. Fig. 1(b) shows the results of applying BG
alone, and finally the results of BG+NNC is shown in
Malayalam is the language spoken predominantly in Fig. 1(c). Skew angle obtained for the document shown in
the state of Kerala, in southern India. It is one of the 23 Fig. 2(a) is 4.891, whereas actual skew angle is 51. After
official languages of India, spoken by around 37 million detecting the skew, it is necessary to deskew the document.
people. The language belongs to the family of Dravidian To deskew the document, nearest neighbor interpolation
ARTICLE IN PRESS
Fig. 1. Results of BG+NNC for sample images of Kannada, Malayalam, Tamil and Telugu.
Fig. 3. A sample Kannada word is divided into three lines.
Fig. 4. Showing the working procedure of the proposed segmentation

Fig. 2. Input skewed document and its corresponding deskewed docu- method.
ment.
technique is applied for the document. Result of deskewed lower and middle line. Upper line is drawn such that it
document is shown in Fig. 2(b). passes through the upper most pixels present in the
word. In the similar way, the lower line is drawn which
3.2. Segmentation passes through the lower most pixel present in the word.
Finally, the middle line is drawn corresponding to upper
Segmentation of a document into lines, and subsequently and lower line.
extracting individual characters constitute an important task Stage 2: beginning from the middle line, label a
in the optical reading of texts. Presently most recognition connected component by searching in the upward
errors are due to character segmentation errors (Bansal and direction. This search is confined within the image area
Sinha, 2002). To segment the word into individual characters, encompassed by the upper and middle lines. (Refer
proposed system also incorporates a segmentation technique Fig. 4). If any component is encountered along the middle
to segment the individual characters of South Indian and line of the image, label the corresponding component.
English alphabets. As South Indian script is a non-cursive Now, within the image area of the previously labeled
script, the individual characters in a word are isolated. component continue searching process as mentioned
Spacing between the characters can be used for segmentation. earlier (i.e., in upward direction). If any component is
But, there will not be any zero-valued valleys, due to the encountered within that area, the current component
presence of conjunct consonants. Hence, in such cases the along with the previously labeled component is considered
usual method of vertical projection profile to separate as one single component. This labeling process is
characters fails. To segment the characters of these types, continued till we reach the end of the middle line. If any
we propose a new technique based on their structure. The component is present within the middle and lower line
proposed segmentation technique has two stages as follows: then that component is labeled as modifier. Fig. 5 shows
the final segmentation of the individual characters of the
Stage 1: consider the sample word as shown in Fig. 3. sample word considered, where the numbers indicate the
The height of the word is divided into three lines: upper, order by which the components are extracted.
ARTICLE IN PRESS
li(i ¼ 1,y,N) are calculated. From the above N eigenvec-

tors, only k should be chosen corresponding to largest
eigenvalues. The highest the eigenvalues, the more char-
acteristic features of a character does the particular
eigenvector describe. Using the k eigenvectors Uk, feature
extraction done by PCA is as follows:
F i ¼ U Tk ðAi AÞ i ¼ 1; . . . ; N, (3)
Fig. 5. Showing the final segmentation of characters.
4.2. Fourier approach description

This algorithm is a general algorithm which is indepen-
dent of language and size. The proposed segmentation
Suppose that the original image sample set is X, each
algorithm is tested on verity of South Indian scripts and
image matrix is sized m-by-n and expressed by f(x,y), where
performed well in all conditions. After successfully
1pxpm,1pypn and mXn. Assuming there are c known
segmenting words into individual characters, each of these
pattern classes (w1,w2,y,wc) in X. Perform a two-dimen-
components are scaled to a standard size before feature
sional discrete Fourier transform on each image by
extraction. In the final implementation, it turned out to 220
classes for Kannada, 200 classes for Tamil, 225 classes for 1 Xm X n h ux vyi
Telugu, 85 classes for Malayalam and 62 in case of English. F ðu; vÞ ¼ f ðx; yÞ exp j2p þ , (4)
mn x¼1 y¼1 m n
pffiffiffiffiffiffiffi
4. Overview of PCA where j ¼ 1, exp() denotes the exponential function,
and F(u,v) is sized mxn. Let F(u0,v0) indicates zero-
Feature extraction is the identification of appropriate frequency band. Shift F(u0,v0) to the center of image
measures to characterize the component images distinctly. matrix, that is, the point (m/2,n/2). Since the frequency
There are many popular methods to extract features. domain is represented by the matrix form, we use a square
Amongst which PCA is one of the state-of-the art methods box Box (l) to represent the lth frequency band, where
for feature extraction and data representation technique 0plpm/2. The four vertices of the Box (l) are (u0l,v0l),
widely used in the area of computer vision and pattern (u0+l,v0l), (u0l,v0+l) and (u0+l,v0+l), respectively. So,
recognition. Using these techniques, an image can be the lth frequency band denotes:
efficiently represented as a feature vector of low dimen- F ðu; vÞ 2 BoxðlÞ.
sionality. The features in such subspace provide more
salient and richer information for recognition than the raw The Fourier spectrum with different frequency bands
image. In this section, we have briefly explained PCA are shown in Fig. 6. If we select the lth frequency band,
method for the sake of continuity and completeness. retain the values of F(u,v), otherwise set the values of F(u,v)
to zero. But, which principle should we follow to select the
4.1. The PCA method appropriate bands? Here, we propose a simple two-dimen-
sional discriminatory criterion to evaluate the discriminatory
The PCA technique (Turk and Pentland, 1991) regards
each image as a feature vector in a high dimensional space
by concatenating the rows of the image and using the
intensity of each pixel as a single feature vector. More
formally, let us consider a set of N sample training images
{A1, A2,y,AN} taking values in n-dimensional image
space, and assume that each belongs to one of c classes
{B1, B2,y, Bc}. Now the average matrix A of all training
samples has to be calculated then subtracted from the
original characters Ai and the result is stored in Fi:
1X N
A¼ Ai , (1)
N i¼1
Fi ¼ Ai A. (2)
In the next step, the covariance matrix C is calculated
PN
according to C ¼ 1=N Fi FTi . Now the eigenvectors
i¼1 Fig. 6. Illustration of Fourier frequency spectrums with different
Ui(i ¼ 1,y,N) and the corresponding eigenvalues frequency bands.
ARTICLE IN PRESS
power of the frequency band and select the appropriate 5. Experimental results
bands:
In this section, we experimentally evaluate our proposed
(i) Use the l th
frequency band: method with a data set containing various characters
pertaining to South Indian languages and English. Each
F ðu; vÞ experiment is repeated 25 times by varying number of
(
retain original values if F ðu; vÞ 2 BoxðlÞ projection vectors t (where t ¼ 1,y,20, 25, 30, 35, 40, and
¼ , 45). Since t, has a considerable impact on recognition
0 if F ðu; vÞeBoxðlÞ accuracy, we chose the value that corresponds to best
ð5Þ
classification result on the image set. All of our experiments
and perform an inverse Fourier transform on the are carried out on a PC machine with P4 3 GHz CPU and
current F(u,v) values as follows: 512 MB RAM memory under Matlab 7.0 platform.
f ðx; yÞ
h ux vyi 5.1. Experimentation on printed South Indian and English
1 X m X n
text documents
¼ F ðu; vÞ exp j2p þ . ð6Þ
mn u¼1 v¼1 m n
We have evaluated the performance of the proposed
Hence, for all the images in X, we obtain the
system of the recognition process on different real
corresponding filtered images, which construct a new
documents. The characters are collected in a systematic
samples set Yl. Note that the Yl have the same values
manner from the printed pages scanned on a HP 2400
of the number of samples and number of classes as that
Scanjet Scanner. Documents were skew corrected
of X. Let W denotes the total mean value of Yl. Now
and components were extracted. Subsets from this data
with regard to Yl, the scatter matrix is defined as
set were used for training and testing. The size of the
follows:
input character is normalized to 50 50. Totally we
S ¼ E½ðY l W ÞT ðY l W Þ. (7) have conducted four different types of experimentation:
(1) clear and degraded characters (2) scale independence
(ii) We evaluate the discriminatory power of Yl, J(Yl), (3) font independence (4) noisy characters. In the first
using the following judgment: experimentation, we considered the font-specific perfor-
mance of South Indian languages and English. We
JðY l Þ ¼ trðSÞ, (8) considered two fonts each for these languages and we
where tr() represents the trace of the matrix. Now considered 50,000 samples for this experiment. Results
those frequency bands, which maximize the above are reported in Table 1. From Table 1 it is noticed that for
criteria, will have the better discriminatory power. clear and degraded characters the recognition rate achieved
Hence, we chose those frequency bands that maximize is 99.01%. Some of the misclassifications were present due
Eq. (8), perform the inverse Fourier transform and to the distortions created by skew correction.
obtain the band-pass filtered images. Performance of the system is usually influenced by scale
and resolution of the characters. We considered samples of
printed characters at various sizes and scanned at various
After deriving these band-pass filtered (i.e., prepro-
scales for the next experiment. Totally we collected about
cessed) images we apply PCA (as explained in Section 4.1)
20,000 samples of different size characters and scanned
to build the corresponding eigenspaces for subsequent
with different resolution. From the experimentation the
recognition process. NNC is used for subsequent classifica-
recognition rate achieved for scale independence characters
tion purpose. Here we call, the method applying PCA on Yl
is about 95%.
as Fourier-PCA (F-PCA). Fig. 7 displays four original
A font is a set of printable or displayable text characters
samples of images of the character and the corresponding
in a specific style and size. Recognizing different font when
filtered images.
the character class is huge is really interesting and
challenging. In this experimentation, we have considered
different font independent characters of South Indian
scripts and English. Totally, we considered around 37,500
samples for font independence and we achieved around
92.4%.
Noise plays a very important role in the field of
document image processing. To check the robustness
of the proposed method, we conducted series of experi-
mentation on noisy characters by varying noise density
Fig. 7. Typical character images (Top) and their corresponding filtered from 0.01 to 0.5 in step of 0.01. For this, we randomly
images (Bottom). select one character image from each class and generate
ARTICLE IN PRESS
Table 1
Recognition rate of the proposed system and comparative study with PCA method
Experimental details Language Font F-PCA (%) PCA (%)
Clear and degraded Kannada Kailasam 99.01 96.9

Kannada Kasturi
Tamil BRH Tamil
Tamil BRH Tamil RN
Telugu BRH Telugu
Telugu BRH Telugu RN
Malayalam BRH Malayalam
Malayalam BRH Malayalam RN
English Times New Roman
English Arial
Scale independence Kannada Kailasam 95 93.45

Kannada Kasturi
Tamil BRH Tamil
Tamil BRH Tamil RN
Telugu BRH Telugu
English Arial
Font independence South Indian N.A 92.4 90.8
English N.A
Noisy data Kannada Kailasam 94 92.56
Kannada Kasturi
Tamil BRH Tamil
Tamil BRH Tamil RN
Telugu BRH Telugu
English Arial
50 corresponding noisy images (here ‘‘salt and pepper’’

noise is considered). We tested our proposed system with
approximately 30,000 character images of South Indian
scripts and English. We noticed that the proposed system
achieves 94% recognition rate and remained robust to
noise by withstanding upto 0.3 noise density.
We also compared our proposed method with Eigen-
character (PCA) based technique (Jawahar et al., 2003).
Table 1 shows the performance accuracy of the PCA based
technique with clear and degraded characters, scale
Fig. 8. Some of the sample images of English handwritten characters.
independence, font independence, and noisy characters.
From Table 1 it is clear that the proposed Fourier-PCA
based technique performs well for all the conditions
considered. overall performance. In this work, we have also extended
our experiment to handwritten characters of English
5.2. Experimentation on handwritten characters alphabets. For experimentation, we considered samples
from 200 individual writers and total of 12,400 character
Handwritten recognition is an active topic in OCR set is considered. Some of the sample images of hand-
application and pattern classification/learning research. In written characters are shown in Fig. 8. We trained the
OCR applications, English alphabets recognition is dealt system by varying the training sample number by 50, 75,
with postal mail sorting, bank cheque processing, form 125, 175 and remaining samples of each character class are
data entry, etc. For these applications, the performance of used during testing. Table 2 shows the best recognition
handwritten English alphabets recognition is crucial to the accuracy obtained from the proposed system and the PCA
ARTICLE IN PRESS
based technique for varying number of samples. From subsequent recognition purpose. This preprocessing step
Table 2 it is clear that the highest recognition rate achieved makes the training image set more salient by fading out
is 93.8% when 175 samples are used for training and the unimportant features and enhancing the essential
remaining 25 samples used for testing. visual information, which will be useful for recognition
task. The proposed system recognizes all basic characters,
6. Conclusion vowel–consonants combinations, and modifiers of South
Indian script texts and upper case, lower case, and
In this paper, the potentiality of PCA method is further numerals in case of English language with an average
explored by combining it with Fourier transform. A simple recognition accuracy of 95.1%. The system is also tested
criterion for selecting appropriate Fourier frequency bands on English handwritten characters and achieved recogni-
is proposed. After choosing the appropriate frequency tion accuracy of about 93.8%. The system is tested on
bands, we have performed inverse Fourier transform to variety of images containing noise, sizes, mulit fonts and
obtain alternate band-pass filtered (preprocessed) training degraded characters. Hence, our proposed work has
set. We have then applied conventional PCA scheme for greater advantages and meets the current challenges in
the field of OCR. We have also compared our system
with conventional PCA method and ascertained the
Table 2
superiority of the proposed method over PCA method in
Best recognition accuracy for handwritten characters all aspects. Further, we plan to extend this work for other
Indian scripts.
No. of training samples/ Best recognition accuracy
character class
F-PCA (%) PCA (%)
Appendix A. South Indian scripts
50 59 57.5
75 79.6 76.7 Kannada script: Figs. A.1–A.3; Tamil script: Figs. A.4
125 88.9 86 and A.5; Telugu script: Figs. A.6–A.8; Malayalam script:
175 93.8 90.8
Figs. A.9–A.11.
Fig. A.1. Kannada vowels and its diacritics.
Fig. A.2. Kannada consonants.
Fig. A.3. A few examples of conjunct consonants.

ARTICLE IN PRESS
Fig. A.4. Tamil vowels and diacritics.
Fig. A.5. Tamil consonants.
Fig. A.6. Telugu vowels and its diacritics.
Fig. A.7. Telugu consonants.

ARTICLE IN PRESS
Fig. A.8. Example of some of the conjunct consonants.
Fig. A.9. Malayalam vowels and its diacrtitics.
Fig. A.10. Malayalam consonants.
Fig. A.11. Example of some of the conjunct consonants and its diacritics.
ARTICLE IN PRESS
References Kahan, S., Pavildis, T., Baird, H.S., 1987. On the recognition of printed
character of any font and size. IEEE Transactions on Pattern Analysis
Ashwin, T.V., Sastry, P.S., 2002. A font and size-independent OCR system and machine Intelligence 9, 274–288.
for printed Kannada documents using support vector machines. Kasturi, R., Gorman, L.O., Govindaraju, V., 2002. Document image
Sadhana 27 (Part 1), 35–58. analysis: a primer. Sadhana 27 (Part 1), 3–22.
Bansal, V., Sinha, R.M.K., 2002. Segmentation of touching and fused Mori, S., Suen, C.Y., Yamamoto, K., 1992. Historical review of
Devnagari characters. Pattern Recognition 35, 593–875. OCR research and development. Proceedings of the IEEE 80,
Bozinovic, R.M., Srihari, S.N., 1989. Offline cursive script word 1029–1058.
recognition. IEEE Transactions on Pattern Recognition and Machine Nagabhushan, P., Hemantha Kumar, G., Shivakumara, P., Manjunath
Intelligence 11, 68–83. Aradhya, V.N., 2006. Skew estimation by improved boundary growing
Casey, R., Nagy, G., 1968. Automatic reading machine. IEEE Transac- for text documents in South Indian languages. Journal of
tions on Computers. 17, 492–503. Vivek Special Issue on Document Analysis of Indian Scripts 16 (2),
Chaudhuri, B.B., Pal, U., 1997. An OCR system to read two Indian 16–21.
language scripts: Bangla and Devnagari (Hindi). Proceedings of Negi, A., Bhagvati, C., Krishna, B., 2001. An OCR system for Telugu. In:
ICDAR, 1011–1015. Proceedings of ICDAR, 10–13 September, Seattle, pp. 1110–1113.
Chaudhuri, B.B., Pal, U., 1998. A complete Bangla OCR system. Pattern Pal, U., Chaudhuri, B.B., 2004. Indian script character recognition: a
Recognition 31 (5), 531–549. survey. Pattern Recognition 37, 1887–1899.
Chaudhuri, B.B., Pal, U., Mitra, M., 2002. Automatic recognition of Pal, U., Sarkar, A., 2003. Recognition of printed Urdu Script. In:
printed Oriya script. Sadhana 27 (Part 1), 23–34. Proceedings of ICDAR, 3–6 August, Edinburgh 2003, pp. 598–602.
Govindan, V.K., Shivaprasad, A.P., 1990. Character recognition— Seethalakshmi, R., Sreeranjani, T.R., Balachandar, T., 2005. Optical
a survey. Pattern Recognition 23, 671–683. Character recognition for printed Tamil text using unicode. Journal of
Hanmandlu, M., Mohan, A., Goyal, C., Roy, D., 2003. Unconstrained Zhejiang University Science 6A (11), 1297–1305.
handwritten character recognition based on fuzzy logic. Pattern Turk, M., Pentland, A., 1991. Eigenfaces for recognition. Journal of
Recognition 36 (3), 603–623. Cognitive Neuroscience 3 (1), 71–86.
Indian language document analysis and understanding, 2002. Special issue Vasantha Lakshmi, C., Patvardhan, C., 2002. A multi-font OCR system
of Sadhana, February. for printed Telugu text. Proceedings of the Language Engineering
Jawahar, C.V., Pavan Kumar, M.N.S.S.K., Ravi Kiran, S.S., 2003. Conference (LEC), 1–17.
A bilingual OCR for Hindi–Telugu documents and its applications. Wang, K.Y., Casey, R.G., Wahl, F.M., 1982. Document analysis system.
In: Proceedings of ICDAR, 3–6 August, Edinburgh, pp. 656–660. IBM Journal of Research and Devlopment 26, 647–656.

Aradhya-Multi-Lingual OCR

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Aradhya-Multi-Lingual OCR

Uploaded by

Copyright:

Available Formats

ARTICLE IN PRESS

Engineering Applications of Artiﬁcial Intelligence 21 (2008) 658–668

Multilingual OCR system for South Indian scripts and

1. Introduction The current challenges in the ﬁled of OCR technology is

Fig. 3. A sample Kannada word is divided into three lines.

Fig. 4. Showing the working procedure of the proposed segmentation

li(i ¼ 1,y,N) are calculated. From the above N eigenvec-

4.2. Fourier approach description

Experimental details Language Font F-PCA (%) PCA (%)

Clear and degraded Kannada Kailasam 99.01 96.9

Scale independence Kannada Kailasam 95 93.45

50 corresponding noisy images (here ‘‘salt and pepper’’

Fig. A.1. Kannada vowels and its diacritics.

Fig. A.2. Kannada consonants.

Fig. A.3. A few examples of conjunct consonants.

Fig. A.4. Tamil vowels and diacritics.

Fig. A.5. Tamil consonants.

Fig. A.6. Telugu vowels and its diacritics.

Fig. A.7. Telugu consonants.

Fig. A.8. Example of some of the conjunct consonants.

Fig. A.9. Malayalam vowels and its diacrtitics.

Fig. A.10. Malayalam consonants.

You might also like