Arabic & Urdu Text Segmentation Challenges & Techniques: Atif Mahmood

IJCST Vol.
4, Issue Spl - 1, Jan - March 2013 ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)
Arabic & Urdu Text Segmentation Challenges & Techniques

Atif Mahmood
Dept. of CSE, School of Engineering & Technology, Sharda University, Greater Noida, UP, India
Abstract
Text Segmentation is one of the critical and vital step in OCR
system of any language because accuracy of OCR depends
upon correctly segmented characters. Segmentation divide the
text images into its constituent parts (i.e. lines, components or
words and individual characters). As Urdu and Arabic are highly
cursive and context sensitive in nature and have improper space
between words therefore, segmentation is hard as compared to
other languages like English, Hindi, Chinese, etc. This paper
presents a survey of techniques regarding text segmentation of
Urdu and Arabic languages and also discusses various challenges
in segmentation.
Keywords
Segmentation, Challenges, Projection profile, Pixel Count,
Labelling, Clustering
I. Introduction Fig. 1(b): Urdu Alphabets

Urdu is one of the popular language of South East Asia, spoken
by nearly 150 million people of the world, either as a mother (92%-98%) and (70%-82%) for line, word and character
tongue or as a regional language. Urdu uses an extended Arabic respectively.
adapted script; it has 38 characters against Arabic’s 28 and uses In [3-8] line and word segmentation techniques, and in [9-12]
less diacritics (zabar, zair, pesh, etc.) than Arabic [1]. character segmentation techniques for Arabic and Urdu text
Arabic is spoken by nearly 280 million people in the world and have been reported, to include all such methods in this survey
primarily spoken in the Middle East and North Africa[2].Arabic is not possible. Instead a representation selection to illustrate the
characters serve as scripts for other languages such as Urdu, Farsi different methods that can be used is presented in this paper.
and Uygur. Fig. 1(a) & 1(b) shows the Arabic and Urdu alphabets
respectively. Both Arabic and Urdu are bidirectional languages II. Segmentation Challenges
where characters are written from right to left and numerals from The brief overview of Arabic & Urdu text segmentation challenges
left to right in Nastaliq and Nashk styles normally. Segmentation are discussed in this section.
technique for Arabic text can be used for Urdu text and vice-versa
due to their shape similarity and writing styles [1]. Segmentation A. Cursive & Context Sensitivity
of text into isolated characters take place in three steps (i.e. text Arabic & Urdu are cursive in nature because individual characters
line segmentation, word or sub word segmentation & character join together to form a complete word, thus identifying the
segmentation), each step follow same or different segmentation segmentation points in the words to separate isolated character
techniques depending upon the language characteristics. becomes difficult. Also these two languages are context sensitive,
Segmentation of Arabic and Urdu language has received less character’s shapes change depending upon the preceding or the
attentionas compare to other national languages of the countries. succeeding character in word formation [13]. Thus ambiguity
So far in the literature accuracy for line, word and character arises in identifying the proper character. Fig. 2(a) & 2(b) shows
segmentation is tested using different segmentation methods over how individual character’s shapes change when they join to form
local and standard data set andaverage accuracy reported is the word “university” in Arabic & Urdu language.
(University)
Fig. 2(a): Cursive & Context Sensitivity in Arabic Language
(University)
Fig. 2(b): Cursive & Context Sensitivity in Urdu Language
B. Dots, Diacritics & Diagonal Writing Style

Arabic & Urdu characters are enriched with dots & diacritics (Zabr,
Zair, Paish, Khari Zabr, Juzm, etc.), relative position of dots and
diacritics changes frequently with respect to character to which it
is associated. And diagonal writing style (Nastaliq style) from top
right to bottom left cause overlapping problem in the character,
that results incorrect line or words segmentation [8]. Fig. 3(a) &
3(b) shows word “department” in Arabic & Urdu language with
between(95%-100%), dots and diacritics in diagonal Writing Style.
32 International Journal of Computer Science And Technology w w w. i j c s t. c o m

ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) IJCST Vol. 4, Issue Spl - 1, Jan - March 2013
Then bounding box for the connected components is drawn to get

the individual text lines [3].
3. Pixel Count Method

Fig. 3(a): Arabic This method can be applied for both line and word (or component)
segmentation. For line segmentation, if sum of black pixel along
a row between two consecutive text lines is found zero, then this
row becomes the separation line, which separates one line from
other. Similarly for word segmentation, if sum of black pixel along
Fig. 3(b): Urdu column between consecutive unconnected components of a row
is found zero then this column becomes separation line which
C. Space Inclusion & Exclusion Problem separates one components from other.
Arabic & Urdu text requires space inclusion to segment the
characters and words from each other, it is difficult to do so but is 4. X-Y Cut Method
vital for segmentation process. In [5] two stage system is presented This method is used to detect text lines. X-Y cut method cuts the
for handling extra space insertion problem in Urdu,where Urdu given input text image into block of lines. For splitting a block,
grammar rules have been applied in first stage and in second stage first its horizontal and vertical projection profiles are computed.
statistical based approach has been used.Space exclusion is another The noise removal thresholds txn and tyn are then used for computing
problem that arises during segmentation process, sometimes it valleys in the projection profiles. The bins of horizontal and
is difficult to identify the complete word due to space between vertical projection profiles are set to zero if they contain values
them (e.g. “atif” is written as to deal with such problem less than linearly scaled threshold txn and tyn respectively, with
it is necessary to avoid these gaps and treat them as a single word respectto the width and height of the block.The valleys of the
[8, 13]. horizontal (vx) andvertical (vy) projection profiles are compared
with the predefined thresholds txand ty respectively. The block
D. Sentence Boundary Identification Issue is split into two blocks at the midpoint of the wider of vx and vy,
Sentence boundary identification issue arises due to absence of which are larger than tx and ty respectively [7].
punctuations and confusion in identifying the start and end of
the sentence in Arabic and Urdu languages[8]. Unlike english 5. Ridge Based Method
where capital and small letters helps in identifying the sentence When large border noises are present in the text image or images
boundary and puncutuations(i.e.’ ,’ , ‘.’, ‘?’, ‘!’, etc.) are used as are skewed, then X-Y cut method produces wrong segmentation.
terminator. In that case Ridge-based text line finding algorithm can be used.
This method follow two image processing techniques.
III. Segmentation Techniques First is Anisotropic Gaussian Filter Bank smoothing which enhance
Text Segmentation is a process of subdividing the text images into text line structure by filling intraline gaps and maintaining interline
its constituent parts (i.e. lines, components or words and individual space [14-15]. And second one is Ridge Detection approach used
characters). It is the first module in design of character recognition for finding the main body of text lines in the smoothed document.
systems. In the literature various methods for Arabic and Urdu Different approaches for ridge detection can be used one such
text segmentation have been proposed, here those segmentation approach is Riley [16], ridge detection approach which detects
techniques are described in brief for line, components (or words) spectral peeks in speech signals.
and character segmentation.
VI. Component Labelling Method
A. Line & Component Segmentation Methods This method is used for component segmentation, this method
Segmentation of Arabic & Urdu text image into single separated is simplest method to separate each object from other objects in
lines and connected components or words can be done through a given input text image. In MATLAB, function bwlabel, can
following methods. be used to compute all the connected components for the binary
text images.
1. Projection Profile Method
Projection profile method is used for line and component B. Character Segmentation Methods
segmentation. A Horizontal projection computes sum of all black Character segmentation deals with the segmentation of word
pixels over each row and construct a histograms. The ditch between or connected components into isolated characters. Some of the
consecutive histogram peaks in the horizontal profile marks the technique that can be used for character segmentation are:
boundary between the text lines. Similarly vertical projection is
applied to each line to compute sum of all black pixels over each 1. Branch Point Estimation Method
columns to separate the unconnected components within a text In this method firstly we find the baseline, baseline is a row having
line. This method gives good results for printed text only because maximum density. Then looking for point, that branching from
printed text have proper space between lines and are straight, for baseline. After that we collect information for each branch point
handwritten text performance degrades [6]. such as position, height, width, maximum point in row, minimum
point in row, maximum point in column and minimum point in
2. Smearing Method column. And checking it out whether these branch points can be
Smearing method is used for line segmentation, in this method, gaps segmentation point or not on the basis of four threshold values
between the consecutive black pixels along horizontal direction is [4].
filled with black pixels if distance is within a pre-defined threshold.
w w w. i j c s t. c o m International Journal of Computer Science And Technology 33

IJCST Vol. 4, Issue Spl - 1, Jan - March 2013 ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)
T1: maximum size of single shape [10] G. 1. 1. A. M. 2. A. S. Noor Ahmed Shaikh, "Character
T2: minimum high of segmentation point Segmentation of Sindhi, an Arabic Style Scripting Language,
T3: minimum distance before and after circle using Height Profile Vector", 2009.
T4: minimum distance between two segmentation points [11] J. o. I. S. A. A. Z. Ahmad,"Urdu Nastaleeq Optical Character
Recognition", Vol. 32, 2007.
2. Slant-Tolerant Segment Features Method [12] P., B. K. S.Meknavin, Feature based Thai Words Segmentation.
This method extracts the morphological feature to find the junction NLPRS, Incorporating SNLP, 1997.
line between handwritten characters. STSF is based on two [13] W. A. U. B. Z. Rehman,"Challenges in Urdu text tokenization
approaches: Junction-Seeking Approach (JSA) and Recognize- and Sentence Boundary Disambiguation", IJCNLP, pp. 40-
Segment Approach (RSA). 45, 2011.
In JSA, system seeks the segments that lead to produce significant [14] J. S. A. W. J. Geusebroek,"Fast anisotropic Gauss filtering",
fragments of characters on the basis of Horizontal stroke-less IEEE, Trans. Image Process, 2003.
pattern points, low thickness line and small distance to the [15] C. W. O. Lampert,"An optimal nonorthogonal separation of
baseline. the anisotropic Gaussian convolution filter", IEEE Trans.
While in RSA system seeks any segment that matches the Image Process, pp. 3501–3513, 2006.
morphological features of the characters by scanning the word from [16] M. Riley,"Designing time-frequency representations for
right to left. The stops are determined by matching number and speech signals", In IEEE International Conference on
nature of the collected features with features that are standardized Acoustics, Speech, and Signal Processing, Apr. 1987.
[9, 17]. [17] N. D., S. Hussain,"Urdu Word Segmentation", Center for
Research in Urdu Language Processing..
IV. Conclusion and Future Work [18] S. N. H. R., A. F. T. nawaz,"OCR System for Urdu (Nashk
This paper describes the segmentation issues and segmentation Font) Using pattern Matching Technique", Vol. 3, No. 3.
method for Arabic and Urdu text. Although Urdu is derived from [19] J. Sadri,"Application of Support Vector Machines", in
Arabic, but Arabic seems to be a subset of Urdu because Urdu Proceeding of the Second Conference on Machine Vision,
contains all characters of Arabic except one character (i.e. kaaf). Feb.2003.
Thus we can say that all existing segmentation approach for Urdu
could work well for Arabic and vice-versa. Segmentation issues
and methods discussed in the paper could be used to understand Atif Mahmood received B.Tech degree
the segmentation technique and resolve the problem that arises in Information Technology from
during segmentation on the basis of segmentation issues. UNSIET, VBS Purvanchal University,
In future implementation of the existing segmentation techniques Jaunpur(U.P) in 2009 and worked there
will be done and test over standard data set so that performance of as a Guest-Lecturer during 2010 - 2011.
different segmentation technique can be evaluated and compared Presently, he is pursuing M.Tech in
to get the best segmentation method for Arabic and Urdu text. Computer Science & Engineering from
Sharda University, Greater Noida (U.P).
References His area of interests include Image
[1] "A historical Perspective of Urdu Language", [Online]. processing, Soft Computing & Software
Available: http://www.urducouncil.nic.in. Engineering.
[2] "Arabic language", (2010) [Online] Available: http://
en.wikipedia.org/wiki/Arabic_language.
[3] A. Z. B. T. Laurence Likforman-Sulem,"Text Line
Segmentation of Historical Documents: A Survey",
International Journal on Document Analysis and Recognition,
Springer, 2006.
[4] M. A.-A. A. Z., F. H. A.-Z. Said Elaiwat,"A Three Stages
Segmentation Model for a Higher Accurate off-line Arabic
Handwriting Recognition", World of Computer Science and
Information Technology Journal, Vol. 2, pp. 98-104, 2012.
[5] G. S. Lehal,"A Two Stage Word Segmentation System for
Handling Space Insertion Problem in Urdu Script", 2009.
[6] U., A. Sarkar,"Recognition of Printed Urdu Script", in
International Conference on Document Analysis and
Recognition (IEEE), 2003.
[7] F. S. a. T. M. B. Syed Saqib Bukhari,"Layout Analysis of
Arabic Script Documents", Springer-Verlag London, 2012.
[8] U. Iftikhar, Thesis on,"Recognition of Urdu ligatures",
VIBOT Consortium And German Research Center for
Artificial intelligence, 2011.
[9] S. A. ABDULLAH, Thesis on,"OFF-LINE HANDWRITTEN
ARABIC CHARACTERS SEGMENTATION USING
SLANT-TOLERANT SEGMENT FEATURES (STSF)",
April 2007.
34 International Journal of Computer Science And Technology w w w. i j c s t. c o m

Arabic & Urdu Text Segmentation Challenges & Techniques: Atif Mahmood

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Arabic & Urdu Text Segmentation Challenges & Techniques: Atif Mahmood

Uploaded by

Copyright:

Available Formats

IJCST Vol.

Arabic & Urdu Text Segmentation Challenges & Techniques

I. Introduction Fig. 1(b): Urdu Alphabets

B. Dots, Diacritics & Diagonal Writing Style

32 International Journal of Computer Science And Technology w w w. i j c s t. c o m

Then bounding box for the connected components is drawn to get

3. Pixel Count Method

w w w. i j c s t. c o m International Journal of Computer Science And Technology 33

34 International Journal of Computer Science And Technology w w w. i j c s t. c o m

You might also like