Professional Documents
Culture Documents
Arabic & Urdu Text Segmentation Challenges & Techniques: Atif Mahmood
Arabic & Urdu Text Segmentation Challenges & Techniques: Atif Mahmood
4, Issue Spl - 1, Jan - March 2013 ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)
Abstract
Text Segmentation is one of the critical and vital step in OCR
system of any language because accuracy of OCR depends
upon correctly segmented characters. Segmentation divide the
text images into its constituent parts (i.e. lines, components or
words and individual characters). As Urdu and Arabic are highly
cursive and context sensitive in nature and have improper space
between words therefore, segmentation is hard as compared to
other languages like English, Hindi, Chinese, etc. This paper
presents a survey of techniques regarding text segmentation of
Urdu and Arabic languages and also discusses various challenges
in segmentation.
Keywords
Segmentation, Challenges, Projection profile, Pixel Count,
Labelling, Clustering
(University)
Fig. 2(a): Cursive & Context Sensitivity in Arabic Language
(University)
Fig. 2(b): Cursive & Context Sensitivity in Urdu Language
T1: maximum size of single shape [10] G. 1. 1. A. M. 2. A. S. Noor Ahmed Shaikh, "Character
T2: minimum high of segmentation point Segmentation of Sindhi, an Arabic Style Scripting Language,
T3: minimum distance before and after circle using Height Profile Vector", 2009.
T4: minimum distance between two segmentation points [11] J. o. I. S. A. A. Z. Ahmad,"Urdu Nastaleeq Optical Character
Recognition", Vol. 32, 2007.
2. Slant-Tolerant Segment Features Method [12] P., B. K. S.Meknavin, Feature based Thai Words Segmentation.
This method extracts the morphological feature to find the junction NLPRS, Incorporating SNLP, 1997.
line between handwritten characters. STSF is based on two [13] W. A. U. B. Z. Rehman,"Challenges in Urdu text tokenization
approaches: Junction-Seeking Approach (JSA) and Recognize- and Sentence Boundary Disambiguation", IJCNLP, pp. 40-
Segment Approach (RSA). 45, 2011.
In JSA, system seeks the segments that lead to produce significant [14] J. S. A. W. J. Geusebroek,"Fast anisotropic Gauss filtering",
fragments of characters on the basis of Horizontal stroke-less IEEE, Trans. Image Process, 2003.
pattern points, low thickness line and small distance to the [15] C. W. O. Lampert,"An optimal nonorthogonal separation of
baseline. the anisotropic Gaussian convolution filter", IEEE Trans.
While in RSA system seeks any segment that matches the Image Process, pp. 3501–3513, 2006.
morphological features of the characters by scanning the word from [16] M. Riley,"Designing time-frequency representations for
right to left. The stops are determined by matching number and speech signals", In IEEE International Conference on
nature of the collected features with features that are standardized Acoustics, Speech, and Signal Processing, Apr. 1987.
[9, 17]. [17] N. D., S. Hussain,"Urdu Word Segmentation", Center for
Research in Urdu Language Processing..
IV. Conclusion and Future Work [18] S. N. H. R., A. F. T. nawaz,"OCR System for Urdu (Nashk
This paper describes the segmentation issues and segmentation Font) Using pattern Matching Technique", Vol. 3, No. 3.
method for Arabic and Urdu text. Although Urdu is derived from [19] J. Sadri,"Application of Support Vector Machines", in
Arabic, but Arabic seems to be a subset of Urdu because Urdu Proceeding of the Second Conference on Machine Vision,
contains all characters of Arabic except one character (i.e. kaaf). Feb.2003.
Thus we can say that all existing segmentation approach for Urdu
could work well for Arabic and vice-versa. Segmentation issues
and methods discussed in the paper could be used to understand Atif Mahmood received B.Tech degree
the segmentation technique and resolve the problem that arises in Information Technology from
during segmentation on the basis of segmentation issues. UNSIET, VBS Purvanchal University,
In future implementation of the existing segmentation techniques Jaunpur(U.P) in 2009 and worked there
will be done and test over standard data set so that performance of as a Guest-Lecturer during 2010 - 2011.
different segmentation technique can be evaluated and compared Presently, he is pursuing M.Tech in
to get the best segmentation method for Arabic and Urdu text. Computer Science & Engineering from
Sharda University, Greater Noida (U.P).
References His area of interests include Image
[1] "A historical Perspective of Urdu Language", [Online]. processing, Soft Computing & Software
Available: http://www.urducouncil.nic.in. Engineering.
[2] "Arabic language", (2010) [Online] Available: http://
en.wikipedia.org/wiki/Arabic_language.
[3] A. Z. B. T. Laurence Likforman-Sulem,"Text Line
Segmentation of Historical Documents: A Survey",
International Journal on Document Analysis and Recognition,
Springer, 2006.
[4] M. A.-A. A. Z., F. H. A.-Z. Said Elaiwat,"A Three Stages
Segmentation Model for a Higher Accurate off-line Arabic
Handwriting Recognition", World of Computer Science and
Information Technology Journal, Vol. 2, pp. 98-104, 2012.
[5] G. S. Lehal,"A Two Stage Word Segmentation System for
Handling Space Insertion Problem in Urdu Script", 2009.
[6] U., A. Sarkar,"Recognition of Printed Urdu Script", in
International Conference on Document Analysis and
Recognition (IEEE), 2003.
[7] F. S. a. T. M. B. Syed Saqib Bukhari,"Layout Analysis of
Arabic Script Documents", Springer-Verlag London, 2012.
[8] U. Iftikhar, Thesis on,"Recognition of Urdu ligatures",
VIBOT Consortium And German Research Center for
Artificial intelligence, 2011.
[9] S. A. ABDULLAH, Thesis on,"OFF-LINE HANDWRITTEN
ARABIC CHARACTERS SEGMENTATION USING
SLANT-TOLERANT SEGMENT FEATURES (STSF)",
April 2007.