Professional Documents
Culture Documents
(Series On Language Processing Pattern Recognition and Intelligent Systems Vol. 4) Neamat El Gayar, Ching Y. Suen - Computational Linguistics, Speech and Image Processing For Arabic Language-World Sci
(Series On Language Processing Pattern Recognition and Intelligent Systems Vol. 4) Neamat El Gayar, Ching Y. Suen - Computational Linguistics, Speech and Image Processing For Arabic Language-World Sci
(Series On Language Processing Pattern Recognition and Intelligent Systems Vol. 4) Neamat El Gayar, Ching Y. Suen - Computational Linguistics, Speech and Image Processing For Arabic Language-World Sci
Editors
Ching Y. Suen
Concordia University, Canada
parmidir@enes.concordia.ca
Lu Qin
The Hong Kong Polytechnic University, Hong Kong
csluqin@comp.polyu.edu.hk
Published
Vol. 1 Digital Fonts and Reading
edited by Mary C. Dyson and Ching Y. Suen
Vol. 2 Advances in Chinese Document and Text Processing
edited by Cheng-Lin Liu and Yue Lu
Vol. 3 Social Media Content Analysis:
Natural Language Processing and Beyond
edited by Kam-Fai Wong, Wei Gao, Wenjie Li and Ruifeng Xu
Vol. 4 Computational Linguistics, Speech and Image Processing for
Arabic Language
edited by Neamat El Gayar and Ching Y. Suen
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance
Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy
is not required from the publisher.
ISBN 978-981-3229-38-9
Printed in Singapore
Preface
Contents
Preface v
Chapter 1
Arabic Speech Recognition: Challenges and State of the Art 1
Sherif Mahdy Abdou and Abdullah M. Moussa
1. Introduction 2
2. The Automatic Speech Recognition System Components 2
2.1. Pronunciation lexicon 4
2.2. Acoustic model 4
2.3. Language model 8
2.4. Decoding 9
3. Literature Review for Arabic ASR 10
4. Challenges for Arabic ASR Systems 14
4.1. Using non-diacritized Arabic data 15
4.2. Speech recognition for Arabic dialects 16
4.3. Inflection effect and the large vocabulary 19
5. State of the Art Arabic ASR Performance 22
6. Conclusions 24
References 24
Chapter 2
Introduction to Arabic Computational Linguistics 29
Mohsen Rashwan
1. Introduction 29
2. Layers of Linguistic Analysis 30
2.1. Phonological analysis 30
2.2. Morphological analysis 31
2.3. Syntactic analysis 31
2.4. Semantic analysis 31
3. Challenges Facing Human Language Technologies 32
4. Challenges Facing the Arabic Language Processing 32
4.1. Arabic script 33
4.2. Common mistakes 33
4.3. Morphological structure for the Arabic word 34
x Contents
Chapter 3
Challenges in Arabic Natural Language Processing 59
Khaled Shaalan, Sanjeera Siddiqui, Manar Alkhatib and Azza Abdel Monem
1. Introduction 59
2. Challenges 61
2.1. Arabic orthography 62
2.2. Arabic morphology 69
2.3. Syntax is intricate 72
Contents xi
3. Conclusion 78
References 79
Chapter 4
Arabic Recognition Based on Statistical Methods 85
A. Belaïd and A. Kacem Echi
1. Introduction 85
2. A Challenging Morphology 86
3. Features Extraction Techniques 87
4. Machine Learning Techniques 92
5. Markov Models 94
5.1. Case 1: Decomposition of the shape/label 94
5.2. Case 2: Decomposition by association with a model 96
5.3. Extension of HMM to the Plane 98
5.4. Bayesian Networks 99
5.5. Two Dimensional HMM 101
6. Discriminative Models 103
7. Conclusion 107
References 108
Chapter 5
Arabic Word Spotting Approaches and Techniques 111
Muna Khayyat, Louisa Lam and Ching Y. Suen
1. Word Spotting 111
1.1. Definition 112
1.2. Input queries 113
1.3. Performance measures 114
1.4. Word spotting approaches 115
2. Arabic Word Spotting 116
2.1. Characteristics of Arabic handwriting 116
2.2. Arabic word spotting approaches 118
3. Databases 120
4. Extracted Features 121
5. Concluding Remarks 123
References 123
xii Contents
Chapter 6
A‘rib — A Tool to Facilitate School Children’s Ability to Analyze 127
Arabic Sentences Syntactically
Mashael Almedlej and Aqil M Azmi
1. Introduction 127
2. Related Work 130
3. Basic Arabic Sentences Structure 131
4. System Design 132
4.1. Lexical analyzer 134
4.2. Syntactic analyzer 134
4.3. Results builder 138
4.4. Special cases 139
5. Implementation 140
5.1. Lexical analysis 141
5.2. Syntactic analysis 145
5.3. Results builder 151
5.4. Output 152
6. Conclusion and Future Work 152
References 153
Chapter 7
Semi-Automatic Data Annotation, POS Tagging and Mildly Context- 155
Sensitive Disambiguation: The eXtended Revised AraMorph (XRAM)
Giuliano Lancioni, Laura Garofalo, Raoul Villano,
Francesca Romana Romani, Marta Campanelli, Ilaria Cicola,
Ivana Pepe, Valeria Pettinari and Simona Olivieri
1. Introduction 155
2. Description of XRAM 156
2.1. Flag-selectable usage markers 157
2.2. Probabilistic mildly context-sensitive annotation 160
2.3. Lexical and morphological XML tagging of texts 161
2.4. Semi-automatic increment of lexical coverage 163
3. Validation and Research Grounds 165
4. Conclusion 166
References 166
Contents xiii
Chapter 8
WeightedNileULex: A Scored Arabic Sentiment Lexicon for Improved 169
Sentiment Analysis
Samhaa R. El-Beltagy
1. Introduction 169
2. Related Work 170
3. The Base Lexicon 172
4. Assigning Scores to Lexicon Entries 173
4.1. Data collection 173
4.2. Collecting term statistics 174
4.3. Term scoring 174
5. Experiments and Results 178
5.1. The sentiment analysis system 179
5.2. The used datasets 180
5.3. Experimental results 181
6. Conclusion 184
References 184
Chapter 9
Islamic Fatwa Request Routing via Hierarchical Multi-Label Arabic Text 187
Categorization
Reda Zayed, Mohamed Farouk and Hesham Hefny
1. Introduction 187
2. Related Work 190
3. Islamic Fatwa Requests Routing System 191
3.1. Text preprocessing 191
3.2. Feature engineering 193
3.3. The HOMER algorithm 194
4. Performance Evaluation 195
4.1. Data description 195
4.2. Methods 197
4.3. Results and Discussion 197
5. Future Work and Conclusion 199
References 200
xiv Contents
Chapter 10
Arabic and English Typeface Personas 203
Shima Nikfal and Ching Y. Suen
1. Introduction 203
2. Literature Review of Typeface Personality Studies 204
3. Arabic Typeface Personality Traits 207
3.1. Research methodology 207
3.2. Statistical analyses of survey results 212
4. English Typeface Personality Traits 217
4.1. Research methodology 217
4.2. Statistical analyses of survey results 221
5. Summary of English Typefaces 225
6. Summary of Arabic Typefaces 226
7. Comparison of Both Studies 226
8. Conclusions and Future Work 227
References 228
Chapter 11
End-to-End Lexicon Free Arabic Speech Recognition Using Recurrent 231
Neural Networks
Abdelrahman Ahmedy, Yasser Hifny, Khaled Shaalan and Sergio Toral
1. Introduction 231
2. Related Work 232
3. Arabic Speech Recognition System 233
3.1. Acoustic model 234
3.2. Language model 237
3.3. Decoding 237
4. Front-End Preparation 239
4.1. Converting the Arabic text to Latin (transliteration process) 239
4.2. Converting the transcription to alias 240
4.3. Speech features extraction 240
5. Experiments 241
5.1. The 8-hour experiment 241
5.2. The 8-hour results 242
5.3. The 1200-hour experiment 244
5.4. The 1200-hour results 245
6. Conclusion 245
References 246
Contents xv
Chapter 12
Bio-Inspired Optimization Algorithms for Improving Artificial Neural 249
Networks: A Case Study on Handwritten Letter Recognition
Ahmed A. Ewees and Ahmed T. Sahlol
1. Introduction 249
2. Neural Networks and Bio-inspired Optimization Algorithms 252
2.1. Neural Networks (NNs) 252
2.2. Particle Swarm Optimization (PSO) 252
2.3. Evolutionary Strategy (ES) 252
2.4. Probability Based Incremental Learning (PBIL) 253
2.5. Moth-Flame Optimization (MFO) 253
3. Swarms Working Mechanism 255
4. The Proposed Approach 257
5. Experiments and Results 258
5.1. Dataset description 258
5.2. Evaluation criteria 259
5.3. Results and discussions 259
6. Conclusion and Future Work 264
References 265
Index 267
b2530 International Strategic Relations and China’s National Security: World at the Crossroads
Chapter 1
The Arabic language has many features such as the phonology and the
syntax that make it an easy language for developing automatic speech
recognition systems. Many standard techniques for acoustic and
language modeling such as context dependent acoustic models and n-
gram language models can be easily applied to Arabic. Some aspects of
the Arabic language such as the nearly one-to-one letter-to-phone
correspondence make the construction of the pronunciation lexicon even
easier than in other languages. The most difficult challenges in
developing speech recognition systems for Arabic are the dominance of
non-diacritized text material, the several dialects, and the morphological
complexity. In this chapter, we review the efforts that have been done to
handle the challenges of the Arabic language for developing automatic
speech recognition systems. This includes methods for automatic
generation for the diacritics of the Arabic text and word pronunciation
disambiguation. We also review the used approaches for handling the
limited speech and text resources of the different Arabic dialects. Finally,
we review the approaches used to deal with the high degree of affixation,
derivation that contributes to the explosion of different word forms in
Arabic.
2 S. M. Abdou and A. M. Moussa
1. Introduction
The goal of the ASR system is to find the most probable sequence of
words 𝑤 = (𝑤 , 𝑤 ,…) belonging to a fixed vocabulary given some set of
acoustic observations 𝑋 = (𝑥 , 𝑥 , … , 𝑥 ). Following the Bayesian
Arabic Speech Recognition: Challenges and State of the Art 3
approach applied to ASR as shown in Ref. 4, the best estimation for the
word sequence can be given by:
/
𝑤 arg 𝑚𝑎𝑥 𝑃 𝑊/𝑂 arg 𝑚𝑎𝑥 (1)
Speech frame
Input Speech
Front-End
Feature Extraction
Features Vector: 𝑋
Acoustic M odel
P(X/W )
Language M odel
Search
P(W )
Recognized Text
The most popular acoustic models are the so called Hidden Markov
Models (HMM). Each phoneme (unit in general) is modeled using an
HMM. An HMM4 consists of a set of states, transitions, and output
distributions as shown in Fig. 2.
𝑏 𝑥 ∑ 𝑤 𝑁 𝑥, 𝜇 , 𝜎 (2)
yes
Is right phone a back-R ? Is left phone /s, z, sh, zh/ ?
no
senone 2 senone 3
Fig. 3. Decision tree for classifying the second state of K-triphone HMM.
hours of speech are used to train a model. The model together with an
appropriate confidence measure can then be used to automatically
transcribe thousands of hours of data. The new data can then be used to
train a larger model. All the above techniques (and more) are
implemented in the so-called Hidden Markov Model Toolkit (HTK)
developed at Cambridge University.9
𝑃 𝑤 𝑃 𝑤 /𝑤 𝑃 𝑤 /𝑤 , 𝑤 𝑃 𝑤 /𝑤 , 𝑤 , 𝑤 . . . 𝑃 𝑤 /
𝑤 ,…,𝑤 ∏ 𝑃 𝑤 /𝑤 , … , 𝑤 (3)
𝑃 𝑤 |𝑤 , 𝑤 , … , 𝑤 𝑃 𝑤 |𝑤 ,…,𝑤 (4)
2.4. Decoding
Finding the best word (or generally unit) sequence given the speech input
is referred to as the decoding or search problem. Formally, the problem is
reduced to finding the best state sequence in a large state space that
consists of composing the pronunciation lexicon, the acoustic model and
the language model. The solution can be found using the well-known
Viterbi algorithm. Viterbi search is essentially a dynamic programming
algorithm, consisting of traversing a network of HMM states and
maintaining the best possible path score at each state in each frame. It is
a time synchronous search algorithm in that it processes all states
completely at time t before moving on to time t + 1. The abstract
algorithm can be understood with the help of Fig. 5. One dimension
represents the states in the network, and the other dimension represents
the time axis.
10 S. M. Abdou and A. M. Moussa
States
Final state
Start state
Time
The early efforts to develop Arabic ASR systems started with simple
tasks such as digits recognition and small vocabulary of isolated words.
Arabic Speech Recognition: Challenges and State of the Art 11
and was trained using 1024 hrs of speech data that consisted of 764 hrs
of supervised data and 260 hrs of lightly unsupervised. The gain from the
unsupervised data part has shown to be marginal and may even result in
performance degradation. That system used a state-clustered triphone
models with approximately 7k distinct states and an average of 36
Gaussian components per state and used n-gram language model trained
from 1 billion words of text data. It used three decoding stages. The first
stage is a fast decoding run with Gender Independent (GI) models. The
second stage uses Gender Dependent (GD) models adapted using LSLR.
It also uses variance scaling using the first stage supervision. The second
stage generates trigram lattices which are expanded using a 4-gram
language model and then rescored in the third stage using GD models
adapted using lattice-MLLR as discussed in Ref. 21. In that system they
have shown that graphemic models perform at least as well as phonetic
models for conversational data and have very minor degradation for the
news data.
The IBM ViaVoice was one of the first commercial Arabic large
vocabulary systems that was developed for dictation applications.24 A
more advanced system was developed by the speech recognition research
group at IBM for Arabic broadcast transcription system fielded for the
GALE project. Key advances include improved discriminative training,
the use of subspace Gaussian mixture models (SGMM) as shown in Ref.
25, neural network acoustic features as shown in Ref. 26, variable frame
rate decoding as shown in Ref. 27, training data partitioning experiments,
class-based exponential LM model and NNLMs with Syntactic
features.28 This system was trained on 1800hrs of transcribed Arabic
broadcasts and text data of size 1.6 billion words provided by the
Linguistic Data Consortium (LDC).29 A pruned language model of size 7
million n-grams using Entropy pruning as shown in Ref. 30 is used for
the construction of static, finite-state decoding graphs. Another unpruned
version of the LM that contains 883 million n-grams, is used for lattice
rescoring. This system used a vocabulary of 795K words with more than
2 million pronunciations. This system used 6 decoding passes. The first
pass used a speaker independent grapheme based acoustic model. The
following 5 passes used speaker adapted phoneme based models. All
models have penta-phone cross-word acoustic context. Another 3
14 S. M. Abdou and A. M. Moussa
The Arabic language has three major challenges for developing ASR
systems. The first one is the constraint of having to use mostly non-
diacritized texts as recognizer training material which causes problems
for both acoustic and language modeling. Training accurate acoustic
models for the Arabic vowels without knowing their location in the
signal is difficult. Also, a non-diacritized Arabic word can have several
senses with the intended word sense to be derived from the word context.
Language models trained on this non-diacritized material may therefore
be less predictive than those trained on diacritized texts.
Arabic Speech Recognition: Challenges and State of the Art 15
marks and 10% word error rate for case ending marks. This means more
than 10% of the data will be restored with wrong diacritics which would
reduce the efficiency of the trained acoustic models. To reduce the
number of errors for restored diacritics, it was proposed to use the audio
recordings of the text data to help in selecting the correct words diacritics
besides the linguistic information.11 In that approach, a forced alignment
is performed between the audio signal and the reference text using a
pronunciation dictionary that includes all the possible diacritization
forms for each word. A morphology analyzer is used to generate these
diacritization forms.36 For the words that the analyzer fails to find a
possible diacritization form, which usually happens for name entities, a
series of expert rules are used to derive their pronunciations.12 Finally,
for the remaining words, that all the approaches fail to derive any
diacritization forms for them, it is possible to backoff to the graphemic
pronunciation for them and builds a combined system.14
Although the vowelized based acoustic models provide better
accuracy, in some cases such as dialectal Arabic ASR, the grapheme
based models would be a more effective approach since the restoration of
diacritics for this type of data would require some non-existing resources
such as morphological analyzer or expert diacritization rules. Also, with
large amount of training data, the performance of grapheme based and
phoneme based systems becomes very close.37
Whereas MSA data can readily be acquired from various media sources,
there is only very limited speech corpus of dialectal Arabic available.
The construction of such type of corpus is even more challenging than
the MSA one. Initially the manual annotation has no standard reference,
the same word can be transcribed with several ways such as
“ ﺑﺸﻜﺮﻙ،ﺑﺎﺷﻜﺮﻙ،”ﺑﺄﺷﻜﺮﻙ. Some transcription guidelines for Egyptian and
Levantine Dialectal Arabic were proposed to reduce such differences.38
The diacritization for dialectal Arabic is more challenging than MSA
since it would require a dialectal Arabic morphological analyzer to
generate the different diacritization forms. Using context based
diacritization would also require a robust language model for dialectal
Arabic Speech Recognition: Challenges and State of the Art 17
Fig. 6. Left: An example of Arabic word factorization. Right: Vocabulary growth for the
Arabic language.
The main draw back for that approach is the short durations of the
affixation units, which can be only two phones long, that make them
highly susceptible for insertions errors. To avoid these effects, some
enhancements for the approach was proposed such as keeping the most
frequent words in full form without decomposition and the second
enhancement was not to decompose the prefix “Al” for words starting
with a solar consonant since due to assimilation with the following
consonant, deletion of the prefix was one of the most frequent errors.
This enhanced morphologically based LM provided some reduction in
WER compared with word based LM.45 Rather than using linguistic
knowledge to derive the morphological decomposition, an unsupervised
technique based on the Minimum Description Length principle (MDL)
was also proposed to provide better coverage for the Out-Of-Vocabulary
(OOV) words.46
20 S. M. Abdou and A. M. Moussa
(a) (b)
Fig. 7. Standard backoff path for a 4-gram language model over words (left) and backoff
graph for 4-gram over factors (right).
Arabic Speech Recognition: Challenges and State of the Art 21
larger than one million words with processing time close to real time
performance as shown in Refs. 28, 50 but these was with the price of
large model sizes of several Giga bytes.
6. Conclusions
References
[1] J. Billa, et al. Audio indexing of broadcast news. Proc. of the International
Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. I-3-I-5
(2002).
Arabic Speech Recognition: Challenges and State of the Art 25
[2] S. Khurana and A.Ali, QCRI advanced transcription system (QATS) for the Arabic
multi-dialect broadcast media recognition: MGB-2 challenge, IEEE Spoken
Language Technology Workshop, (SLT), pp. 292–298 (2016).
[3] V. Peddinti, D. Povey, and S. Khudanpur, A time delay neural network architecture
for efficient modeling of long temporal contexts, Proc. of the Interspeech Conf., pp.
3214–3218 (2015).
[4] L. Rabiner, A tutorial on hidden Markov models and selected applications in
speech recognition, Proc. IEEE 77(2), pp. 257–286 (1989).
[5] T. Shinozaki, HMM state clustering based on efficient cross-validation, Proc. Int.
Conf. Acoustics Speech and Signal Processing, ICASSP, pp. 1157–1160 (2006).
[6] P. M. Baggenstoss, A modified Baum-Welch algorithm for hidden Markov models
with multiple observation spaces, IEEE Transactions on Speech and Audio
Processing, 9(4), pp. 411–416 (2001).
[7] M. Afify, Extended baum-welch reestimation of Gaussian mixture models based on
reverse Jensen inequality, Proc. of the 9th European Conference on Speech
Communication and Technology, Interspeech, pp. 1113–1116 (2005).
[8] Y. A. Alotaibi, M. Alghamdi, F. Alotaiby, Speech Recognition System of Arabic
Digits based on A Telephony Arabic Corpus, Proc. of the International Conference
on Image and Signal Processing, ICISP, pp 245–248 (2010).
[9] M. Alghamdi, Y. O. El Hadj and M. Alkanhal, A Manual System to Segment and
Transcribe Arabic Speech, Proc. of the International Conference on Signal
Processing and Communications, ICSPC, pp. 233–236 (2007).
[10] Y. A. Alotaibi, Comparative Study of ANN and HMM to Arabic Digits
Recognition Systems, Journal of King Abdulaziz University, JKAU, 19(1), pp. 43–
60 (2008).
[11] J. Ma, S. Matsoukas, O. Kimball and R. Schwartz, Unsupervised training on large
amount of broadcast news data, Proc. of the International Conference on Acoustics,
Speech and Signal Processing, ICASSP, pp. 1056–1059 (2006).
[12] A. Messaoudi, J.-L. Gauvain and L. Lamel, Arabic transcription using a one
million word vocalized vocabulary, Proc. of the International Conference on
Acoustics, Speech and Signal Processing, ICASSP, pp. I-1093–I-1096 (2006).
[13] M. Gales, et al. Progress in the CU-HTK broadcast news transcription system,
IEEE Transactions Speech and Audio Processing, 14(5), pp. 1513–1525 (2006).
[14] H. Soltau, G. Saon, B. Kingsbury, H-K. Kuo, L. Mangu, D. Povey and G. Zweig.
The IBM 2006 GALE Arabic ASR system, Proc. of the International Conference
on Acoustics, Speech and Signal Processing, ICASSP, pp. IV-349–IV-352 (2007).
[15] T. Imai, A. Ando and E. Miyasaka, A new method for automatic generation of
speaker-dependent phonological rules, International Conference of Acoustics,
Speech and Signal Processing, ICASSP, vol. 1, pp. 864–867 (1995).
[16] H. Bahi and M. Sellami, A hybrid approach for Arabic speech recognition.
ACS/IEEE international conference on computer systems and applications, pp. 14–
18 (2003).
[17] M. Nofal, E. Abdel Reheem et al., The development of acoustic models for
command and control Arabic speech recognition system. Proc. of International
Conference on Electrical, Electronic and Computer engineering, ICEEC’04, pp.
1023–1026 (2004).
26 S. M. Abdou and A. M. Moussa
Chapter 2
Mohsen Rashwan
Electronics and Electrical Communications Department,
Faculty of Engineering, Cairo University
Giza, Egypt
mrashwan@RDI-Eg.com
1. Introduction
The higher layers of the language analysis are dependent on the lower
ones. However these layers are overlapped to some extent. We will
explain the nature of each layer as follows:
Language is a great trust from God. The ability of the human in language
is inimitable. If you sit with a native speaker chatting together for long
hours, you can hardly pass — through thousands of words flowing from
his mouth — a word that you do not understand. The Word Error Rate
(WER) in this case is less than 0.1%, while the best WER for the best
spontaneous ASR system could be like 10%; i.e. 100 times the WER of
the human auditory system.
Arabic language has many features that distinguish it. However, these
features may represent the elements of its strength, but at the same time
represent elements of the challenge for computation.
We will state some of the challenges that we face when subjecting the
Arabic language to computing process — through the following points:
Introduction to Arabic Computational Linguistics 33
for each word. The semantic research is still not enough for the Arabic
language.
Also a type of research has recently appeared exceeding the texts into
various other media, such as (see Refs. 1, 6):
Search in audio files.
Search in photos.
Search in videos.
The first way is the most common and widely used, while the
successes of the second algorithm are in specific tasks. With the
advances in understanding the models of natural languages, the
abstraction algorithm will get in heavily to give more convincing
summaries.
The goal in both cases is to get the similar documents. The documents
will be classified and clustered automatically as follows:
Introduction to Arabic Computational Linguistics 41
Detecting the correct spelling of the Arabic words from the contexts
has achieved good level of acceptance. However detecting the syntax and
the semantic mistakes is not quite mature now for Arabic. Semantic
mistakes like saying: “he succeeded although he was studding hard”
cannot be easily detected by computers. It should be noted that
recognizing the errors of the learners of Arabic as a second language in
spelling, grammar and semantics has many evidence of success in some
research works.
5.9. Stylometry
Stylometry is the art of verifying the ownership of someone to a specific
article or book. This technology is just a branch of documents
classification referred to above, see Fig. 5. We can benefit from this
technology in issuing official documentation of a specific article to its
Introduction to Arabic Computational Linguistics 43
Fig. 5. Stylometry.
These systems will turn the machine into much higher levels in terms
of easy interaction with humans, see Ref. 12. They will be used
extensively with robots. These robots will be able to do a lot of work in
homes, serve children and elderly people or patients and do heavy work
in factories 24 hours a day without fatigue. Perhaps they will be able to
narrate tales and entertain the users, relieving them from their concerns
and playing with them intelligently and skillfully.
Introduction to Arabic Computational Linguistics 51
6.1.4. King Abdul Aziz City for Science and Technology (computer
research institute)
The Institute includes the Department of Phonetics and Linguistics,
which is interested in preparing research and solutions to the problems of
Arabic language technologies. It provides consultations and sets up
workshops to follow up the advances in the field.
Headquarters: King Abdul Aziz City for Science and Technology-
Riyadh-Saudi Arabia.
Website: http://www.kacst.edu.sa/ar/about/institutes/pages/ce.aspx
Arabic language resources:
Saudi Bank of sounds.
Arabic optical character recognition system.
Syntax analysis for Arabic online texts.
Huge Arabic text corpus.
Automatic Arabic assay scoring.
References
Chapter 3
Khaled Shaalan1, Sanjeera Siddiqui2, Manar Alkhatib3 and Azza Abdel Monem4
Faculty of Engineering & IT, The British University in Dubai123,
Block 11, Dubai International Academic City,
P.O. Box 345015, Dubai, UAE
School of Informatics, University of Edinburgh1, UK
Faculty of Computer and Information Sciences, Ain Shams University4,
Abbassia, 11566 Cairo, Egypt
khaled.shaalan@buid.ac.ae1, faizan.sanjeera@gmail.com2,
Manaralkhatib09@gmail.com3 and azza_monem@hotmail.com4
1. Introduction
2. Challenges
aThe top alveolar ridge is located on the roof of the mouth between the upper teeth and the
hard palate.
62 Khaled Shaalan et al.
and a rich part system. Arabic makes use of many inflections because of
the appendages, which incorporate relational words and pronouns. Arabic
morphology is perplexing because there are about 10,000 roots that are the
basis for nouns and verbs27. There are 120 patterns in Arabic morphology.
Ref. 28 highlighted the importance of 5000 roots for Arabic morphology.
The word order in Arabic is variant. We can have a free choice of the
word we want to emphasize and put it at the head of sentence. Generally,
the syntactic analyzer parses the input tokens produced by the lexical
analyzer and tries to identify the sentence structure using Arabic grammar
rules. The relatively free word order in an Arabic sentence causes syntactic
ambiguities which require investigating all the possible grammar rules as
well as the agreement between constituents13,24.
In this paper, we discuss the challenges of Arabic language with regard
to its characteristics and their related computational problems at
orthographic, morphological, and syntactic levels. In automating the
process of analyzing Arabic sentences, there is an overlap between these
levels, as they all help in making sense and meaning of words, and in
disambiguating the sentence.
Table 1. The Hamza diacritic is determined by its own diacritics and the preceding letter.
Verb Transliteration Sentence Change applied to the present form of the verb
ﺩﻋﺎ Da-aa ﻉ
ُ ﻟﻢ ﻳﺪ Omit the last long vowel “ ”ﻭand add the present
tense letter “ “ﻱ
ﺳﻌﻰ Sa-aa ﻟﻢ ﻳﺴ َﻊ Omit the last long vowel “ “ﻯand add the present
tense letter “ “ﻱ
ﺻﻠﻰ Sala ﻟﻢ ﻳﺼ ِﻞ Omit the last long vowel “ “ﻱand add the present
tense letter “ “ﻱ
ﺯﺍﺭ Zara ﻳﺰﺭ
ْ ﻟﻢ Omit the middle long vowel " "ﺍand add the present
tense letter “ “ﻱ
Remedies to resolve this type of ambiguity might not necessarily fix all
problems33,34. For example, consider the sentence “( ”ﺭﺃﻳﺖ ﺃﻣﻞI saw
hope/Amal) which have either meaning.
2.1.4. Vowels
In written Arabic, there are two types of vowels: diacritical symbols and
long vowels. Arabic text is dominantly written without diacritics which
leads to major linguistic ambiguities in most cases as an Arabic word has
different meaning depending on how it is diactritized. A diacritic sign
(Tashkeel Or Harakat) is not an orthographic letter. It is formed as
diacritical marks above or below a consonant to give it a sound. Ref. 35
presented a good survey of recent works in the area of automatic
diacritization. There are three groups of diacritics32,36. The first group
consists of the short vowel diacritics such as Fatha ( َ◌), Dhamma ( ُ◌), and
Kasra (◌).
ِ The second group represents the doubled case ending diacritics
(Nunation or tanween) such as Tanween Fatha ( ً◌),Tanween Kasra (◌), ٍ
and Tanween Damma ( ٌ◌). These are vowels occurring at the end of
nominal words (nouns, adjectives and adverbs) indicating nominal
indefiniteness. The third group is composed of Shadda ( ّ◌) and Sukuun (◌) ْ
68 Khaled Shaalan et al.
“( ”ﻭﺳﻴﺤﻀﺮﻭﻧﻬﺎand they will bring it, wasayahdurunaha). This word can be
written in this form:
In this example, the Lemma “( ”ﺣﻀﺮhadr) accepts three prefixes: “( ”ﻭwa),
“( ”ﺱsa), and “( ”ﻱya) and two suffixes: “( ”ﻭﻥwa noun), and “( ”ﻫﺎha).
Thereby, because of the complexity of the Arabic morphology, building
an Arabic NLP system is a challenging task.
The early step in analyzing an Arabic text is to identify the words in
the input sentence based on its type and properties, and outputs them as
tokens. There might be a problem in segmentation where some word
fragments that should be parts of the lemma of a word and were mistaken
to be part of the prefix or suffix of the word; thus, were separated from the
rest of the word as a result of tokenization. This problem arises with
Named Entities Recognition where the ending character n-grams of the
Named Entity were mistaken for objects or personal/possessive anaphora,
and were separated by tokenization19. Moreover, the POS tagger used for
the training and test data may have produced some incorrect tags,
incrementing the noise factor even further.
Another morphological challenge highlighted by Ref. 46, with regard
to relationships between words. The syntactic relationship that a word has
with alternate words in the sentence shows itself in its inflectional endings
and not in the spot in connection to alternate words in that sentence. For
example, “( ”ﺍﻟﻤﻌﻠﻢ ﺍﻟﻤﺨﻠﺺ ﻳﺤﺘﺮﻣﻪ ﻁﻼﺑﻪAl Mu’alim al-mukhlis yahtarimaho
Tulabaho, the faithful teacher is respected by his students), the suffix
pronoun “( ”ـﻪHeh) in the two words “( ”ﻳﺤﺘﺮﻣﻪyahtarima-ho, respected-
him), and “( ”ﻁﻼﺑﻪTulaba-ho, students-his) refers to the word “( ”ﺍﻟﻤﻌﻠﻢAl
Mu’alim, teacher-the).
Generally, Arabic computational morphology is challenging because
the morphological structure of Arabic also comprises a predominant
system of clitics. These are morphemes that are grammatically
independent, but morphologically dependent on another word or phrase47.
Subsequently, one can naturally conclude that this proportion is higher for
Arabic information than for different languages with less perplexing
72 Khaled Shaalan et al.
morphology that the same word can be joined to various appends and
clitics and thus, the vocabulary is much greater. The following Arabic
words: “”ﻣﻜﺘﻮﺏ, (Maktoob, Written) “”ﻛﺘﺎﺑﺎﺕ, (Kitabat, Writings), “”ﻛﺎﺗﺐ
(Katib, Writer) “( ”ﻛﺘﺎﺏKitab, Book), “( ”ﻛﺘﺐKutob, Books) , “”ﻣﻜﺘﺐ
(Maktab, Office) , “( ”ﻣﻜﺘﺒﺔMaktabah, Library), “( ”ﻛﺘﺎﺑﻪKitabah, Writing)
are derived from the same Arabic three consonants trilateral with the origin
verb “( ”ﻛﺘﺐKtb, Wrote). They also refer to the same concept. To extract
the stem from the words, there are two types of stemming. The first type
is light stemming which is used to remove affixes (prefixes, infixes, and
suffixes) that belong to the letters of the word “( ”ﺳﺄﻟﺘﻤﻮﻧﻴﻬﺎsa'altamuniha);
where they are formed by combinations of these letters. The second type
is called heavy stemming (i.e. root stemming) which is used to extract the
root of the words and includes implicitly light stemming48,49.
2.2.3. Annexation
Another morphologic challenge in Arabic language is that we can
compose a word to another by a conjunction of two words. This
conjunction can be with nouns, verbs, or particles. Although it is not
common in traditional Arabic language, it is used in Modern Standard
Arabic. Usually, the compound word is semantically transparent such that
the meaning of the compound word is compositional in the sense that the
meaning of the whole is equal to the meaning of parts put together50. For
example, the word “( ”ﺭﺃﺳﻤﺎﻟﻴﺔcapitalism, rasimalia) comes from compound
of two nouns “( ”ﺭﺃﺱ ﺍﻟﻤﺎﻝcapital, ras almal); the word “( ”ﻣﺎﺩﺍﻡas long as,
madam) comes from the compound of a particle “( ”ﻣﺎma) and a verb “”ﺩﺍﻡ
(dam), and the word “( ”ﻛﻴﻔﻤﺎhowever) comes from the compound of two
particles “( ”ﻛﻴﻒkayf) and “( ”ﻣﺎma). The meaning of a compound word is
important for understanding the Arabic text, which is a challenge to POS
tagging and applications that require semantic processing51.
that are unable to capture the effects of inflectional variation. Thus, they
can cause problems in Machine Translation, Information Retrieval, Text
Summarization, among other NLP applications. Such expression is termed
as idiomatic multi word expressions. Other multi words expressions are
words that co-occur together more often than not, but with transparent
compositional semantics such as “( ”ﺭﺋﻴﺲ ﺍﻟﺪﻭﻟﺔThe president of the
country, rayiys alddawla). As such, they do not pose a challenge in NLP
applications. Such expressions could be of interest if we categorize them
to types as in Named Entity Recognition, i.e. contextual cues.
Ambiguous Anaphora
The pronominal anaphora is a very widely used type in Arabic language
as it has empty semantic structure and does not have an independent
meaning from their antecedent; the main subject. This pronoun could be a
third personal pronoun, called “( ”ﺿﻤﻴﺮ ﺍﻟﻐﺎﺋﺐdamir alghayib) in Arabic,
such as “ ”ﻫﺎ/hA/ (her/hers/it/its), “ ”ﻩ/h/ (him/his/it/its), “ ”ﻫﻢ/hm/
(masculine: them/their), and “ ”ﻫﻦ/hn/ (feminine: them/their).
Challenges in Arabic Natural Language Processing 75
Hidden Anaphora
Another major kind of anaphora is hidden anaphora. It is restricted to the
subject position when there is no present noun or pronoun acting as the
subject. This is evident in the following sentence: “ ﻣﻌﻘﺪﺓ،”ﺍﻟﻤﻼﺣﻈﺔ ﻋﻠﻰ ﺍﻟﻠﻮﺡ
(The note on the board, complex) where the pronoun “ ”ﻫﻲis not presented
in the sentence, i.e. “ ﻫﻲ ﻣﻌﻘﺪﺓ، ”ﺍﻟﻤﻼﺣﻈﺔ ﻋﻠﻲ ﺍﻟﻠﻮﺡ, which is called “zero
anaphora”. The human mind can determine the hidden Anaphora
(antecedent) but it causes grammatical mistakes in automated NLP
systems.
76 Khaled Shaalan et al.
2.3.4. Agreement
Agreement is a major syntactic principle that affects the analysis and
generation of an Arabic sentence which is very significant to difficult NLP
applications such as Machine Translation and Question Answering13,47.
Agreement in Arabic is full or partial and is sensitive to word order
effects1. An adjective in Arabic usually follows the noun it modifies
“( ”ﺍﻟﻤﻮﺻﻮﻑalmawsuf) and fully agrees with respect to number, gender,
case, and definiteness, e.g. “( ”ﺍﻟﻮﻟﺪ ﺍﻟﻤﺠﺘﻬﺪThe diligent boy, alwald
almujtahad) and “( ”ﺍﻷﻭﻻﺩ ﺍﻟﻤﺠﺘﻬﺪﻭﻥThe diligent boys, al'awlad
almujtahidin). The verb is marked for agreement depending on the word
order of the subject relative to the verb, see Figure 1.
3. Conclusion
one mapping between the letters in the language and the sounds with
which they are associated. An Arabic word does not dedicate letters to
represent short vowels. It requires changes in the letter form depending on
its place in the word, and there is no notion of capitalization. As for MSA
texts, short vowels are optional which makes it even more difficult for non-
native speakers of Arabic to learn the language and present challenges to
analyze Arabic words. Morphologically, the word structure is both rich
and compact such that it can represent a phrase or a complete sentence.
Syntactically, the Arabic sentence is long with complex syntax. Arabic
Anaphora has increased the ambiguity of the language, as in some cases
the Machine Translation system fails to identify the correct antecedent
because of the ambiguity of the antecedent. External knowledge is needed
to correct the antecedent. Moreover, Arabic sentence constituents (free
word order) can be swapped without affecting structure or meaning, which
adds more syntactic and semantic ambiguity, and requires analysis that is
more complex. Nevertheless, agreement in Arabic is full or partial and is
sensitive to word order effects.
Arabic language differs from other languages because of its complex
and ambiguous structure that the computational system has to deal with at
each linguistic level.
References
Language Resources and Tools, NEMLAR, 22nd–23rd Sept., Egypt, pp. 118/122
(2004).
35. Azmi and R. Almajed, A survey of automatic Arabic diacritization techniques, Natural
Language Engineering, Cambridge University Press, UK, 21(3):477/495 (2015).
36. S. Abu-Rabia, The Role of Vowels in Reading Semitic Scripts: Data from Arabic and
Hebrew, Reading and Writing: An Interdisciplinary Journal, 14, 39/59 (2001). DOI:
10.1023/A:1008147606320.
37. Farghaly, Three Level Morphology for Arabic, presented at the Arabic Morphology
Workshop, Linguistics Summer Institute, Stanford, CA, (1987).
38. T. McCarthy, The critical theory of Jurgen Habermas, Studies in Soviet Thought,
Springer, Berlin Heidelberg, 23(1):77/79 (1982).
39. Soudi, G. Neumann and A. Bosch, Arabic computational morphology: knowledge-
based and empirical methods, vol. 38, Springer, Dordrecht (2007).
40. Shoukry and A. Rafea, Sentence-level Arabic sentiment analysis, 2012 International
Conference on Collaboration Technologies and Systems (CTS), Denver, CO, USA,
2012, pp. 546/550 (2012). DOI: 10.1109/CTS.2012.6261103.
41. S. S. Al-Fedaghi and F. Al-Anzi., A New Algorithm to Generate Arabic Root-Pattern
forms, In Proceedings of the 11th national Computer Conference and Exhibition, pp.
391/400 (1989).
42. N. De Roeck and W. Al-Fares, A morphologically sensitive clustering algorithm for
identifying Arabic roots, In Proceedings of the 38th Annual Meeting on Association
for Computational Linguistics, Association for Computational Linguistics, pp. 199/206
(2000).
43. S. Mesfar, Towards a cascade of morpho-syntactic tools for Arabic natural language
processing, In Computational Linguistics and Intelligent Text Processing, Springer
Berlin Heidelberg, pp. 150/162 (2010).
44. Y., Benajiba, M. Diab and P. Rosso, Arabic named entity recognition using optimized
feature sets, In Proceedings of the Conference on Empirical Methods in Natural
Language Processing, Association for Computational Linguistics, pp. 284/293 (2008).
45. Y. Benajiba, P. Rosso and M. J. Bened, ANERsys: An Arabic Named Entity
Recognition system based on Maximum Entropy, In Proc. of CICLing-2007, Springer-
Verlag, LNCS series (4394), pp. 143/153 (2007).
46. K. Thakur, Genitive Construction in Hindi. M. Phil Thesis, University of Delhi, India
(1997).
47. K. Shaalan, Arabic GramCheck: A Grammar Checker for Arabic, Software Practice
and Experience, John Wiley & sons Ltd., UK, 35(7):643-665 (2005).
48. M. N. Al-Kabi, S. Kazakzeh, B. Abu Atab, S. Al-Rababah and S. Alsmadi, A Novel
Root based Arabic Stemmer, Journal of King Saud University, Computer and
Information Sciences, 27(2):94–103 (2015). DOI: 10.1016/j.jksuci.2014.04.001
49. H. K. AlAmeed, S. O. AlKitbi, A. A. AlKaabi, K. S. AlShebli, N. F. AlShamsi, N. H.
AlNuaimi, and S. S. AlMuhairi, Arabic Light Stemmer: A new enhanced approach, In
Proceedings of the Second International Conference on Innovations in Information
Technology (IIT'05), Dubai, UAE (2005).
50. W. M. Amer. (2010). Compounding in English and Arabic: A contrastive study,
Technical Report, available online at:
Challenges in Arabic Natural Language Processing 83
http://site.iugaza.edu.ps/wamer/files/2010/02/Compounding-in-English-and-
Arabic.pdf
51. S. Elkateb, W. Black, P. Vossen, D. Farwell, H. Rodríguez, A. Pease and M. Alkhalifa,
Arabic WordNet and the challenges of Arabic, In Proceedings of Arabic NLP/MT
Conference, London, UK (2006).
52. K. Shaalan, An Intelligent Computer Assisted Language Learning System for Arabic
Learners. Computer Assisted Language Learning: An International Journal, Taylor &
Francis Group Ltd., 18(1 & 2):81/108 (2005).
53. Hammo, A. Moubaiddin, N. Obeid, and A. Tuffaha, Formal Description of Arabic
Syntactic Structure in the Framework of the Government and Binding Theory,
Computacion y Sistemas, 18(3):611/625 (2014).
54. S. Hammami, L. Belguith and A. Hamadou, Arabic Anaphora Resolution: Corpora
Annotation with Co-referential Links, The International Arab Journal of Information
Technology, 6(5):481/489 (2009).
55. R. Al-Sabbagh and K. Elghamry, Arabic Anaphora Resolution: A Distributional,
Monolingual and Bilingual Approach, Faculty of Al-Alsun, Ain Shams University,
Cairo, Egypt (2002).
56. S. Usama, On issues of Arabic syntax: An essay in syntactic argumentation, Brill’s
Annual of Afroasiatic Languages and Linguistics, pp. 236/280 (2011).
57. M. Shquier and T. Sembok, Word agreement and ordering in English-Arabic machine
translation, 2008 International Symposium on Information Technology, IEEE Explore,
Kuala Lumpur, pp. 1/10 (2008). DOI: 10.1109/ITSIM.2008.4631625.
b2530 International Strategic Relations and China’s National Security: World at the Crossroads
85
Chapter 4
1. Introduction
The Arabic script has been studied for several decades. Despite the com-
plexity of its morphology, due to its cursive aspect and the presence of many
diacritic signs, several systems are functional and give very encouraging
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 86
results matching those for the Latin (manuscript). The main objective
of this chapter is to show the progress that we have obtained from Ara-
bic for several decades, using Machine learning techniques. As features
extraction remains the most important step for achieving high recognition
performance, we firstly give a brief overview of the features extraction tech-
niques that we proposed in the past works to characterize and recognize the
Arabic script.
The remainder of this chapter is organized as follows: Section 2 discusses
the characteristics of Arabic script. Section 3 reviews some features extrac-
tion techniques. Section 5 focuses more on Markov models as generative
models. Section 6 shows an example of Neural networks used in the con-
text of large vocabulary. It illustrates the combination of three classifiers
for the recognition of decomposable words. Section 7 discusses directions
for future work and conclusions.
2. A Challenging Morphology
The Arabic script has complex morphological properties that make its au-
tomatic recognition a constant challenge [1]. The natural attachment of
the letters that follow one another in the word, makes that the letter shape
vary depending on the connection type and influences its termination as-
pect. Moreover in handwriting, the word division into several parts (PAWs)
gives more freedom in the writing of each PAW and creates a zigzag in the
baseline which distorts the main guide for features extraction.
If we consider that there are two main feature families for writing recog-
nition: structural and statistical [2], structural ones are those that take best
account of the morphological appearance of the Arab script and got our at-
tention throughout our research on Arabic recognition. The morphological
aspect in structural features is exhibited in two elements: regularities and
singularities (see Figure 1). The regularities correspond to a flat part in the
middle of the word representing the elongations between characters. Even
though it contains no information, its location is synonymous of baseline.
While the singularities are rich with information and contain the real char-
acteristics of the word morphology such as ascenders, descenders, diacritic
signs, loops and accents.
The position of some features such as the Alif ( @) and descenders such
as , P , Ð , P , p , h , h. ø , ð , ¨ , ¨ , , , , and ascenders
segmentation in the word and merely a convenient way to pass the image
to a HMM or a DBN. However, implicit word segmentation occurs during
decoding. But, a potential benefit from this word decomposition is that
extracting features from word image’s columns and raw allows the visual
and sequential aspects of handwriting recognition to be learned together,
rather than treated as two separate problems. Some structural features
such as loops, diacritic, ascenders or stems, descenders or legs, considering
their type, number and position in the word, the number and position of
PAWs are then extracted (see Figure 11(a)). Note that Arabic handwritten
words are not usually written on a single baseline and authors extracted a
sequence of sub-baselines and formed the entire word baseline by juxtapo-
sition of its PAW baselines (see Figure 3).
more or less long depended on the used font (see Figure 4(c)). Oppositely,
machine-printed Latin words are composed by successive letters without
any ligature between them (see Figure 4(d)). Consequently, horizontal
strokes would be more frequent in Arabic words than in Latin words. Both
scripts use vertical strokes for ascenders.
To capture the coarseness of a texture in specified directions, BRL based
features are used [4]. Recall that black run is a set of consecutive and co-
linear black pixels. The number of pixels in the run represent its length.
For a given image, a BRL vector P is defined as follows. Each element P (i)
represents the number of runs with black pixels and length of run equal
to i in a given direction. The BRL vector’s size is M which corresponds
to the maximum run length in words. An orientation is defined using a
displacement vector d(x, y), where x and y are the displacements for the
x-axis and y-axis respectively. The typical orientations are horizontal, right
diagonal, vertical and left diagonal, then calculating the run-length encod-
ing for each direction will produce four BRL vectors. The four obtained
BRL vectors are then concatenated into a single vector characterizing the
word’s script. Figure 5 illustrates the proposed feature extraction method
by an example. Various texture features are then derived from BRL vectors
which measure the distribution of short and long runs, the similarities of
gray level values and of the length of runs through out the word’s image
and the homogeneity and the distribution of runs of the word’s image in a
specific direction.
Being based on shape descriptors, HOG has interesting properties for
script characterization [4]. As shown in Figure 6, HOG descriptor is a
histogram which counts the gradient orientation at pixels in a given image.
The number of features depends on the number of cells and orientation.
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 90
Fig. 5. Computing of BRL vectors: (a) binary image (b) four run-length vectors for
black pixels.
Fig. 8. (a) Offset in Co-MOG, (b) Co-occurrence of a word image at a given offset, (c)
Vectorization of co-occurrence matrix [5].
Fig. 9. Matching with SC: (a) SC of p in T C1 , (b) log-polar histogram for p (c) SC of
q in T C2 , (d) log-polar histogram for q which is similar to that in (b), but different of
p0 in (f). The best matching is between p in T C1 and q in T C2 . Black bins correspond
to a higher number of pixels in that bin, gray bins contains fewer pixels than the black
cells. Log-polar histogram similarity is according to the χ2 distance.
compared to the previous case, because one has to just depict the differ-
ence between the X alternatives. Their advantage is they are very fast once
trained. Their drawbacks are: 1) they interpolate between training exam-
ples, and can fail if novel inputs are presented and 2) they do not easily
handle compositionality.
In Naïve Bayes (NB) classifier, we predict the class Y knowing the
feature vector X. If we assume that the features are independent, the
joint probability is the product of the probability of each vector compo-
nent conditionally to the class. In Hidden Markov Model (HMM), X is
a data sequence observed in states, Y is the random variable distributed
in states. The observation probability is the product on all the sequence
of joint probabilities of states and observation in the states. The logistic
regression maximizes the likelihood of a label (phenomena) given explica-
tive data, assuming a log-linear model. The Conditional Random Field
predicts sequences of labels using a sequence of observation conditionally
to the context.
5. Markov Models
We can see later that by combining P (xi , yi ) with terms issued from an
appropriate decomposition of P (Y ), we can achieve a Markovian modeling
of a certain order. In the example of the Figure 10, the system follows
a successive phases in which sub-words are segmented, graphemes are ex-
tracted, characters are extracted and the word is lexically corrected using
matching with a dictionary.
Fig. 10. Hidden Markov Models for Word recognition from [7].
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 96
of the features depends on the current state and the previous state. This
observation is associated to the model transitions. The general system
uses an HMM with two levels for the online word recognition: a local level
describing the features in the letters (loops, peaks and oriented arcs) and
a global observation of letters in the word level (extension in relation to
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 98
same time the likelihood emission of lines of the band and the likelihood of
its height expressed by the duration: P (dj+1 /sj ) is the probability of the
duration in the super state j. Pj (y) is the emission probability of the line
y. It is expressed by the number and width of the horizontal run-lengths.
K is the number of samples, djk is the duration in the super state j.
δy (j) = max [δy−1 (j − 1)aj−1,j , δy−1 (j)P (dj + 1/sj )]Pj (y) (6)
1≤j≤N
where 2 ≤ y ≤ Y, 1 ≤ j ≤ N .
The transition probability between two super-states is equal to:
Pk dk
j
k=1 dk k
j−1 +dj
aj−1,j = (7)
K
While in Figure 15(b) (DBN2), the parameters π, B and A are equal to:
Fig. 15. Two architectures of DBN from [3]: (a) DBN1, (b) DBN2.
Fig. 16. Non Symmetric Half Plane from [13], where (a) represents the random variables
and (b) an example of a Latin word analysis by NSHP.
6. Discriminative Models
where nn is the neighbors number and αij is the connection weight between
i and j.
During the bottom-up process, features are extracted in the image zones,
then the information is propagated for the election of letters, PAWs and
words. In case of ambiguity, the zone of interest is identified and a request
is queried to the feature level to compare the Fourier descriptor of the zone
of interest.
This has been extended to the recognition of Arabic large vocabulary
by Bencheikh et al. [17]. This work was based on two observations:
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 104
7. Conclusion
References
111
Chapter 5
1. Word Spotting
Speech Recognition (ASR), word spotting has since been applied to the
growing number of handwritten documents for the purpose of indexing.
Even though speech is analog in nature, while handwritten documents are
spatial, word spotting of handwritten documents has been able to adopt
the methods of speech recognition for its use. Subsequently, techniques and
algorithms specific to the processing of handwritten documents had been
developed.
Early indexing work started by applying conventional Optical Character
Recognition (OCR) techniques, and the results are passed to special search
engines to search for words. However, Manmatha et al. designed the first
handwritten word spotting system in 1996,1 and they found that applying
traditional OCR techniques to search for words is inadequate. Using OCR
in indexing words fails for the following reasons:2,3 1) handwriting analysis
suffers from low recognition accuracies; 2) the associated indexing systems
are hampered by having to process and recognize all the words of a doc-
ument, and then apply search techniques to the entire result; and 3) the
training of OCR systems requires that a huge database be constructed for
each alphabet.
Word spotting methods are based on two main approaches: template
matching and learning-based. Manmatha et al.1 proposed the first index-
ing or word spotting system for single writer historical documents. The
proposed method was based on matching word pixels. Zhang et al.4 pro-
posed a template matching approach based on extracting features from
word images. Dynamic Time Warping (DTW)2,5,6 was successfully applied
as an efficient template matching algorithm. Learning-based word spotting
systems were introduced to adapt to muli-writers with promising results.
However, sufficiently large databases are needed to train these systems.
This section defines word spotting, and describes different types of in-
put queries to word spotting systems. Then, the performance measures of
word spotting systems are described. Finally, different approaches of word
spotting are discussed.
1.1. Definition
systems suffer from the drawback that they can be applied only on closed
lexicons.12–15
Arabic script is always cursive even when printed, and it is written hori-
zontally from right to left. In Arabic writing, letter shapes change depend-
ing on their location in the word. This fact distinguishes Arabic writing
from many other languages. In addition, dots, diacratics, and ligatures
are special characteristics of Arabic writing. Figure 2 shows two Arabic
handwritten documents.
The Arabic handwriting system evolved from a dialect of Aramaic which
has fewer phonemes than Arabic. Aramaic uses only 15 letters but Arabic
uses 28 letters. The letters in Arabic are formed by adding one, two or three
dots above or below the Aramaic letters to generate different sounds.11
Thus, many letters share a primary common shape and only differ in the
number and/or location of dots. This means dots play an important role in
the writing of Arabic and other languages that share the same letters such
as Farsi (Persian) and Urdu. It is also worth mentioning that more than
half of the Arabic letters (15 out of 28) are dotted. In printed documents,
double and triple dots are printed as separate dots, while in handwritten
documents there are different ways to write them, for example Figure 3
shows three different ways of writing double dots.
In addition, shapes of letters change depending on their position in
the word. Therefore, each Arabic letter has between two and four shapes.
Letters can be isolated (28 letters), beginning (22 letters), middle (22 let-
ters), and ending (28 letters). However, Arabic letters do not have upper
and lower cases. There are six letters in Arabic that are only connected
from the right side; therefore, when they appear in the word they cause
August 13, 2018 8:45 ws-rv9x6 Book Title 10693-05 page 117
Content-based retrieval using a codebook has been used for Arabic word
spotting.21,29,30 In these systems, meaningful features are extracted to
represent codes of symbols, characters, or PAWs. Then similarity matching
or distance measure algorithms between the codes and the codebook are
applied to perform the final match.
Latin script is essentially based on two models (character and word),
while Arabic script is based on three models: Character, PAW and Word
models. The three models are used for Arabic word spotting, while the
PAW model is extensively used, since a line of Arabic text can be viewed as
a sequence of PAWs instead of words; and there are no differences between
the spaces separating PAWs and those separating words. Nevertheless, a
few segmentation-free systems have been proposed for Arabic handwritten
word spotting, in which segmentation is embedded within the classification
process. These systems are either implemented using HMMs based on the
character model,24 or an over-segmentation is applied based on the PAW
model.31
Attempting to segment Arabic documents into candidate words may
not be an appropriate approach for Arabic word spotting systems. This is
because Arabic words are composed of PAWs that are easy to extract, while
there are no clear boundaries between words. This latter aspect would in-
troduce difficulties in segmenting a document into words. Srihari et al.32
tried to cluster words by segmenting the line into connected components
and merging each main component with its diacritics. Nine features were
extracted from each pair of clusters and the features were passed to a neu-
ral network to decide whether the gap between the pairs is a word gap.
However, with ten writers each writing ten documents, the overall perfor-
mance was only 60% when the word segmentations were correct, and this
significantly affected the spotting results.
Many studies favored segmenting documents into PAWs rather than
words due to the problem of not having clear boundaries for words. Sari
and Kefali21 preferred to segment the document into major connected com-
ponents, to circumvent the problem of word segmentation in Arabic doc-
uments. Thus, they decided to favor Arabic PAWs processing instead of
words. They converted the PAW into Word Shape Tokens (WST) and
represented each PAW by global structural features such as loops, ascen-
ders and descenders. Similarly, input queries were coded and then a string
matching technique was applied. They validated their word spotting sys-
tem using both printed and handwritten Arabic manuscripts and historical
documents. This approach is promising because it uses open lexicons and
August 13, 2018 8:45 ws-rv9x6 Book Title 10693-05 page 120
3. Databases
4. Extracted Features
5. Concluding Remarks
References
Chapter 6
1. Introduction
text written fourteen centuries ago. Hardly any living language can claim
such a distinction. Arabic can be classified as Classical or Modern. The
Classical Arabic represents the pure language spoken by Arabs, whereas
Modern Standard Arabic (MSA) is an evolving variety of Arabic with
some borrowing to meet modern challenges, see Ref. 2.
There are 28 basic letters in the Arabic alphabet. In addition, there are
8 basic diacritical marks, which may be combined to form a total of 13
different diacritics. These marks are used to represent the three short
vowels (a, i, u), while the letters ( ﻯ, ﻱ, ﻭ, )ﺍare used to indicate vocalic
length. The diacritical marks are placed either above or below the letters
to indicate the phonetic information associated with each letter. This helps
clarify the sense and meaning of the word. Unfortunately, MSA is devoid
of diacritical markings.
Arabic language is considered one of the richest languages in terms of
vocabulary and rhetorical structures. It is also quite an intricate language.
Consider the sentence, ( )ﻻ ﺗﻀﺮﺏ ﺯﻳﺪًﺍ ﻭﺗﻀﺤﻚbasically meaning, do not hit
Zaid and laugh. Based on the diacritical marking on the last letter of the
word ()ﻭﺗﻀﺤﻚ, there are three distinct meanings: (1) if it is ( ُ)ﻭﺗﻀﺤﻚ, then
the sentence actually means, you are not allowed to hit Zaid, but can
laugh; (2) and if it is ( ْ)ﻭﺗﻀﺤﻚ, then it means, you are forbidden from doing
both acts (hitting Zaid and laughing); and (3) for ( َ)ﻭﺗﻀﺤﻚ, we may do
either act but not both, i.e. we may hit Zaid but not laugh, or may laugh
but without hitting Zaid.3 The Arabic language presents some other
challenges as well, including long sentences with complex syntax, having
a pro-drop property, and being a free order language.4 The pro-drop
property means the subject may not be explicitly present.2 Arabic
sentences can take any form, VSO (Verb-Subject-Object), SVO, and
VOS.2 This free order property of the Arabic language presents a crucial
challenge for some Arabic NLP applications. Additionally, the lack of
diacritical markings in MSA often leads to ambiguity. For example, the
undiacritized word ()ﻋﻠﻢ, has several meanings including ( ) ِﻋ ْﻠﻢscience, and
(ﻋﻠَﻢ
َ ) flag. This can even happen in spoken language as well. An individual
may read a sentence while ignoring the end case diacritics by making all
words end with the silence sound ()ﺳﻜﻮﻥ. It has the same impact as an
undiacritized sentence in the written form. For example, ( ﺃﻣﺮ ﺍﻟﻤﺴﺆﻭﻝ
)ﺍﻟﻤﻮﻅﻒ, which could either mean the person in charge ordered the
A‘rib — A Tool to Analyze Arabic Sentences Syntactically 129
employee, or the employee instructed the person in charge. All the above
should give an idea as to why mastering the Arabic language is very
demanding, even for natives. It also gives credence to why the language
has lagged behind others computationally.
As Islam spread, Arab grammarians were quick to lay down the rules
to prevent incorrect readings of the Holy Qur’an. They established a
completely new science called e‘raab ()ﺇﻋﺮﺍﺏ, which is the syntactical
analysis of Arabic sentences. E‘raab is the key to identify words to tackle
the surface meaning of a sentence, and it is based on Arabic syntactical
rules known as ()ﻗﻮﺍﻋﺪ ﺍﻹﻋﺮﺍﺏ, which play a major role in understanding
the semantics of a sentence. Some grammarians considered it an
intellectual exercise to generate different valid e‘raab of a sentence. It is
said the grammarians were able to generate 147 different e‘raab for the
sentence, ()ﻻ ﺭﺟﻞ ﻓﻲ ﺍﻟﺪﺍﺭ ﻭﻻ ﺍﻣﺮﺃﺓ.3
The problem is how to automate this process to make the computer
analyze Arabic sentences, and correctly classify its words into the main
Arabic language components. This will help in identifying the word’s role
in the semantics of the sentence. The diacritical signs in the Arabic will
certainly help alleviate some of the ambiguity, and its lack surely increases
the vagueness. Natives are somewhat good at resolving the ambiguity
based on the context, but this is truly a challenging problem from the
computer perspective.
In this paper we propose a system which aims to automatically analyze
the Arabic sentences syntactically, the process of e‘raab. We named it
A‘rib which is the imperative verb ( )ﻓﻌﻞ ﺃﻣﺮof e‘raab. It is hoped that such
a system will help the Arab students with the e‘raab process, one of the
most dreadful tasks while studying grammar. It will also help those
learning Arabic as a second language to better understanding the semantics
of sentences as well as appreciate the language’s intricateness. This system
can also be a nucleus for a more robust machine translation engine.
The proposed system is divided into three phases: a lexical analysis, a
syntactical analysis and a results builder. The lexical phase takes each
word of the input sentence and analyzes it, so to figure out its role in the
sentence. The result of this phase is saved ready for use in the next phase.
In the second phase, we take the tokens out of the previous phase and try
to determine a matching Arabic rule. Finally, the tokens and the matching
130 M. Almedlej and A. M. Azmi
Arabic grammar rule are used by the third phase, the results builder, to
generate the e‘raab and place the proper diacritical signs on the sentence.
The rest of the paper is organized as follows. In Section 2 we cover
related work. In Section 3 we go over basic Arabic sentence structure. The
system design is covered in Section 4. In Section 5 we go over
implementation details. Finally, in Section 6 we conclude our study.
2. Related Work
There are few pioneer researchers who have made significant attempts to
open the way to automate the process of structural analysis of Arabic
sentence.
One of the earliest attempts was in Ref. 5, where the author proposed a
model for a system that tries to analyze the Arabic sentence according to
its syntax and explicit structure. The model ignored the semantics. The
author used context-free grammar (CFG), and the system was
implemented in prolog. In Ref. 6 is another early attempt, where the
authors highlighted the importance of the morphology and syntax in the
field of NLU (Natural Language Understanding). This time the authors
introduced what they called an ‘end-case analyzer’ that was integrated
within an NLP system.
More recently, Ref. 7 developed a parser that processes an Arabic
sentence in order to automatically explain the role of each word in the
meaning of a sentence. The system is composed of two main parts: the
lexical analyzer, which includes a database that stores all Arabic words;
and the syntax analyzer, which contains a parser. The recursive parser uses
CFG to parse the sentence structure. One major drawback of the system
which is that it is limited to verbal sentences ( )ﺟﻤﻠﺔ ﻓﻌﻠﻴﺔwith active verbs
only ()ﻓﻌﻞ ﻣﺒﻨﻲ ﻟﻠﻤﻌﻠﻮﻡ.
A somewhat related work is the automatic diacritization of Arabic text.
The MSA texts are often devoid of diacritical markings, and native
speakers hardly suffer. However, there is a need for diacritical markings,
e.g. for children and those learning Arabic as a second language.
Moreover, certain NLP applications such as automatic speech recognition,
text-to-speech, machine translation, and information retrieval all these
may need diacritized texts as a source for learning.8 There are plenty of
A‘rib — A Tool to Analyze Arabic Sentences Syntactically 131
works in this area. Ref. 4 presented a good survey of recent works in the
area of automatic diacritization. There is an overlap between the e‘raab
process and the diacritization of Arabic sentences, as they both are
concerned with the semantics of the sentence. As noted, the diacritical
markings help in making sense and meaning of the words, and in
disambiguating the sentence. So the difference between e‘raab and the
automatic diacritization is that the former has to justify all its
actions/decisions, while in automatic diacritization the program places an
appropriate diacritical marking often stochastically. This is why e‘raab of
an Arabic sentence is a more challenging problem than the automatic
diacritization.
In this Section we will delve into basic sentence structure and relations
among sentence elements. This should help readers appreciate the level of
complexity associated with e‘raab. It is advised that readers consult Ref. 9
for more depth on the subject.
Traditional Arabic grammar divides sentences into two categories:
( )ﺟﻤﻠﺔ ﺍﺳﻤﻴﺔnominal sentences, and ( )ﺟﻤﻠﺔ ﻓﻌﻠﻴﺔverbal sentences. The
difference depends on the nature of the first word in the sentence, whether
it is a noun or noun phrase; or verb (respectively). Nominal sentences
consist of a subject or topic ()ﺍﻟﻤﺒﺘﺪﺃ, and predicate ()ﺍﻟﺨﺒﺮ. That is, the
nominal sentence typically begins with a noun phrase or pronoun and is
completed by a comment on that noun phrase or pronoun. The predicate
or comment may be a complex structure: nouns, adjectives, pronouns, or
prepositional phrases. By default, both the subject and the predicate of the
nominal sentence are in the nominative case ()ﺣﺎﻟﺔ ﺭﻓﻊ. And in the case
where the predicate is a noun, pronoun, or adjective, it agrees with the
subject in gender and number. Interestingly, it is possible to reverse the
order and have the predicate before the subject. This occurs when the
subject lacks the definite article, as in the example ( )ﺑﻴﻨﻬﻤﺎ ﺷﺠﺮﺗﺎﻥbetween
[the two of] them [are] two trees. Example of a complex predicate, where
among others, it could be another nominal sentence, e.g. ()ﺍﻟﺮﺑﻴﻊ ﻓﻀﻠﻪ ﻛﺒﻴﺮ
spring’s bounty [is] large, or even a verbal sentence, e.g. ()ﺍﻟﻜﺘﺎﺏ ﻳﻔﻴﺪ ﺍﻟﻘﺎﺭﺋﻴﻦ
the book benefit the readers.
132 M. Almedlej and A. M. Azmi
The simplest verbal sentence consists of a verb and its pronoun subject,
which is incorporated into the verb as part of its inflection. This is what is
termed in modern linguistic as the ‘pro-drop’ feature. Past tense verbs
inflect with a subject suffix; present tense verbs have a subject prefix and
a suffix. When the subject noun is specified, it usually follows the verb
and is in nominative case. The verb agrees with the subject in gender, e.g.
( )ﻧﺠﺤﺖ ﺍﻟﻄﺎﻟﺒﺔthe student succeeded (f.), but not always in number. The
verb could either be intransitive ()ﻓﻌﻞ ﻏﻴﺮ ﻣﺘﻌﺪﻱ, or transitive ()ﻓﻌﻞ ﻣﺘﻌﺪﻱ. In
the former case, it does not take a direct object, but may be complemented
by a prepositional phrase, e.g. ( )ﻳﻬﻄﻞ ﺍﻟﺜﻠﺞ ﻋﻠﻰ ﺍﻟﺠﺒﺎﻝsnow falls on the
mountains. While in the latter, the verb takes a direct object, which is in
the accusative case ()ﺣﺎﻟﺔ ﻧﺼﺐ, and the object may either be a noun, a noun
phrase, or a pronoun, e.g. ( )ﺭﻓﻊ ﻳﺪﻩhe raised his hand. If both the subject
and the object of the verb are specified, then the order is typically Verb-
Subject-Object (VSO), however, it is also possible to have the ordering
SVO, or VOS under certain conditions. In VSO, if the subject is dual or
plural, the verb inflects for gender agreement, and not number agreement,
e.g. ( )ﻛﺘﺐ ﺍﻟﻄﺎﻟﺒﺎﻥ ﺍﻟﺪﺭﺱthe two students wrote the lesson (m.). Some verbs
in Arabic take two objects, with both being expressed as nouns, noun
phrases, or pronouns, e.g. ()ﺃﺩﺭﺳﻬﻢ ﺍﻟﺮﻳﺎﺿﻴﺎﺕ
ّ ِ I teach them mathematics.
Moreover, the verb could either be in active voice or passive voice
()ﻣﺒﻨﻲ ﻟﻠﻤﺠﻬﻮﻝ. In the first case, the doer of the action is the subject; while
in the passive the direct object of the verb becomes the subject, e.g. ( ﺩُﺭﺳﺖ
)ﺍﻟﻘﻀﻴﺔthe case was studied.
4. System Design
This work is concerned about designing a system that can automate the
Arabic syntactical analysis, so to produce the proper e‘raab results without
human intervention. In order to do that, it was necessary to review how
humans analyze the sentences to accomplish the task.
The normal analysis process goes through three main phases starting
with the sentence and ending up with the e‘raab results, as follows:
Break down the target sentence into its main components, identifying
each component by its type and properties. This part is handled by the
lexical analyzer in the proposed system.
A‘rib — A Tool to Analyze Arabic Sentences Syntactically 133
This is the first part of the system which is responsible for analyzing the
input sentence and identifying its words’ properties. The tokens (word +
property) are stored, which in turn will help in the e‘raab process.
To accomplish this task, we start by isolating the words of the sentence
from each other, so they are ready to be lexically analyzed. This step
includes isolating the words from its prefixes and suffixes, which have
their own position on the syntax ( )ﻟﻬﺎ ﻣﺤﻞ ﻣﻦ ﺍﻹﻋﺮﺍﺏsuch as ()ﺍﻟﻀﻤﺎﺋﺮ ﺍﻟﻤﺘﺼﻠﺔ.
These will each have their own token. Next it determines the kind of each
word, either of: noun, verb or particle. Finally, it identifies the set of
properties that depends on the category of each word. For example, for
nouns, it should identify properties related to: type, gender, count,
variability … etc. For verbs, it needs to identify a verb’s tense, effect,
passivity, gender … etc. And for particle, it only needs to find its type and
sign. We need to classify all this information along with the prefixes and
suffixes of that word. This may help in preparing the final results in some
cases. Before transmitting the output of the lexical analyzer to the next
subsystem, the token tags are converted into English (Figure 2). This will
simplify processing in the next phase.
In this phase we process the tokens received from the lexical analyzer to
find the appropriate grammar rule corresponding to a valid Arabic
syntactical structure. For this we use both grammar and a parser, both of
which are components of this subsystem. The first component is simply a
set of Arabic language sentence structures, expressed in a formal way
using context-free grammar (CFG). The parser’s role is to find the
matching rule(s) from the CFG for the given set of tokens. In our
implementation, the CFG is stored in an external file, and the parser
dynamically parses the grammar. Saving the grammar in an external file
makes for easy editing, in case of error, or future addition of new rules.
A‘rib — A Tool to Analyze Arabic Sentences Syntactically 135
Fig. 2. Lexical analysis results on the sample sentence, the successors are happy, with tags
converted into English.
and predicate. The predicate could be a sentence by itself, and in that case
we have to list it in the e‘raab as a sentence playing the role of a nominative
predicate ()ﻓﻲ ﻣﺤﻞ ﺭﻓﻊ ﺧﺒﺮ. These complex and nested structures in Arabic
sentences need to be tracked, and this tag helps in the tracking process. In
later examples we will show how exactly they are used.
Table 1. Sentence Components (SC) grammar for nominal and verbal sentences. The
superscripted pair holds information regarding the role and the case.
grammar should cover all terminal symbols on their right hand side that
were not covered in the SC grammar.
Fig. 4. Parsing the sentence ( )ﺃﺧﺬ ﺃﺣﻤﺪ ﻗﻠﻢ ﺻﺎﻟﺢAhmad took Saleh’s pen.
The third and final part of the proposed system, which along with the
e‘raab of the input sentence, outputs appropriate diacritical markings. In
Arabic, the diacritics of the internal letters are morphologically determined
while the case-ending diacritics (or the e‘raab) are syntactically
determined. This subsystem imitates the regular e‘raab process, and
therefore requires the output of the two previous phases, the tokens and
the matching syntactic rules.
The builder uses the syntactic structure to figure out the role of each
token and grammatical judgment for each word in the sentence, e.g.
whether it is nominative ()ﻣﺮﻓﻮﻉ, accusative ( )ﻣﻨﺼﻮﺏ... etc. In order to
determine the actual sign the system makes use of the properties that were
attached to the tokens. For example, consider the signs used for the
nominative, known as ()ﻋﻼﻣﺎﺕ ﺍﻟﺮﻓﻊ. It could be the diacritic dumma ()ﺍﻟﻀﻤﺔ
in case of singular noun, broken plural, feminine sound plural, and the
imperfect tense ( ;)ﻓﻌﻞ ﻣﻀﺎﺭﻉit is the letter waw ( )ﻭin case of masculine
sound plural and the five nouns ( ﺫﻭ، ﻓﻮ، ﺣﻤﻮ، ﺃﺧﻮ، ;)ﺃﺑﻮit is the letter alif in
case of dual nouns; and it is the letter noon ( )ﻥin case of the imperfect
verb with a personal pronoun ()ﺍﻷﻓﻌﺎﻝ ﺍﻟﺨﻤﺴﺔ.
A‘rib — A Tool to Analyze Arabic Sentences Syntactically 139
Table 3. E‘raab sentence structure. In the e‘raab format the optional argument is inside
the brackets.
Word E‘raab format Example
Nouns and ﺍﻷﺳﻤﺎء <role> ﻣﺒﺘﺪﺃ ﻣﺮﻓﻮﻉ
verbs with ﻭﺍﻷﻓﻌﺎﻝ <judgment> ﻭﻋﻼﻣﺔ ﺭﻓﻌﻪ
dynamic end- ﺍﻟﻤﻌﺮﺑﺔ <sign> ﺍﻟﻮﺍﻭ )ﻷﻧﻪ ﺟﻤﻊ
cases (<reason>)
(ﻣﺬﻛﺮ ﺳﺎﻟﻢ
Nouns with ﺍﻷﺳﻤﺎء <type> <static ﺿﻤﻴﺮ ﻣﺘﺼﻞ
static case- ﺍﻟﻤﺒﻨﻴﺔ end-case sign> ﻣﺒﻨﻲ ﻋﻠﻰ ﺍﻟﻔﺘﺢ
ending <judgment> ﻓﻲ ﻣﺤﻞ ﺭﻓﻊ
<role>
ﻓﺎﻋﻞ
Perfect tense ﺍﻟﻔﻌﻞ ﺍﻟﻤﺎﺿﻲ <role> <static ﻓﻌﻞ ﻣﺎﺿﻲ ﻣﺒﻨﻲ
and imperative ﻭﺍﻷﻣﺮ end-case sign> ﻋﻠﻰ ﺍﻟﻀﻢ
verbs (<reason>) )ﻻﺗﺼﺎﻟﻪ ﺑﻮﺍﻭ
(ﺍﻟﺠﻤﺎﻋﺔ
Imperfect tense ﺍﻟﻔﻌﻞ ﺍﻟﻤﻀﺎﺭﻉ <role> <static ﻓﻌﻞ ﻣﻀﺎﺭﻉ
verb with static ﺍﻟﻤﺒﻨﻲ end-case sign> ﻣﺒﻨﻲ ﻋﻠﻰ
case-ending <reason> ﺍﻟﺴﻜﻮﻥ ﻻﺗﺼﺎﻟﻪ
<judgment>
ﺑﻨﻮﻥ ﺍﻟﻨﺴﻮﺓ ﻓﻲ
ﻣﺤﻞ ﻧﺼﺐ
One of the main problems an Arabic speaker faces is the issue of ambiguity
due to multiple meanings of the same words. This ambiguity cannot be
attributed to some kind of imperfection in Arabic language but rather it is
due to the modern custom of not writing diacritical signs. These signs fully
resolve the ambiguity and define exactly what the writer meant. Consider
the simple example, ()ﺳﺒﻘﻨﺎ ﺍﻟﻘﻄﺎﺭ. It could either mean (ﺳﺒَﻘَﻨﺎ ﺍﻟﻘﻄﺎﺭ
َ ) where
140 M. Almedlej and A. M. Azmi
the subject is ( )ﺍﻟﻘﻄﺎﺭwhich means we are ahead of the train; and ( ﺳﺒَ ْﻘﻨﺎ
َ
)ﺍﻟﻘﻄﺎﺭhere the subject is ( )ﻧﺤﻦand the sentence means the train is ahead
of us.
There are many ways to handle ambiguity in case the user input of a
plain text is devoid of any signs. The simplest is to ask the user to insert
the appropriate signs each time there is ambiguity. This may be annoying
for the user, moreover, the user might be using the program to inquire what
diacritical options he/she has and their associated meanings. This opens
the door for using the system to automatically guide the user through
possible appropriate signs. For this, the system will have to process all
possible sentence cases, and output all the e‘raab results along with the
proper signs in the sentences. It is worth noting that the number of possible
sentence structures is huge once the sentence is processed by the lexical
analyzer. The main reason is that the system will mull over all possible
types for each word with no regard to its syntactical structure. However,
the number of possibilities will come down when it is processed by the
syntactical analyzer, where all the unfitted types will be ignored. Another
possible scheme to resolve the ambiguity is through semantic analysis. But
since the proposed system focuses on the syntactical analysis, we will
leave this for future work.
In designing the system we allowed for the provision of some sort of
errors from the user’s side. The errors could be lexical or syntactical, and
either way the system should be smart enough to handle them. Lexical
error means the lexical analyzer fails to recognize one or more of the input
words, whereas the syntactical error means the parser fails to find a
matching grammatical rule for the input sentence. In case of either error,
the process should stop when an error is encountered and the user will be
asked to recheck his/her input. If the user insists on the same input then
the system must gracefully terminate with an appropriate message
indicating failure to handle the input in its current form.
5. Implementation
These are called upon as soon the user clicks the button to process the
input (Algorithm 1).
The object of the first class receives the user input text, and it stores the
lexical analysis results as tokens. These tokens are assigned to another
object which processes them syntactically storing the results (matching
rules and tokens) in forms of Solutions. An object of the last class takes
these Solutions in order to translate and print them in the appropriate
format for the user. Next we go over each class in more depth.
Procedure process_input()
{
L = new LexicalAnalyzer()
S = new SyntacticAnalyzer()
R = new ResultsBuilder()
if (text is empty)
print(“Empty input”)
else {
L.Input = text
L.LexicalAnalysis()
S.Tokens = L.Tokens
S.SyntacticAnalysis()
R.Solutions = S.Solutions
R.BuildFinalSentences()
}
}
Algorithm 1. Dataflow between main classes of the system.
to the similarity of its effect on the e‘raab result nor how scarce the case
is. All this leads to ample redundancy and an increase in the final output.
Moreover, the output is geared toward professional human understanding,
summarizing the attributes using as few words as possible.
Fig. 5. Alkhalil’s output results as HTML. For the word ( )ﺫﻫﺐit reports 22 results with
much redundancy. After post processing to remove the redundant results we end up with 7
results only.
The main process of the lexical analysis class is to handle these results
to produce the desired tokens which will be carried on in the next process:
the syntactic analysis. Figure 6 shows the lexical analysis activity. There
are two problems associated with Alkhalil that we need to address: (1)
Alkhalil outputs the results in the form of HTML file, these we converted
to a more convenient form, strings; and (2) too much redundant data.
Following some analysis we managed to remove about 70% of the possible
redundant cases. We looked at the results that repeat the same properties
while ignoring judgments and diacritical markings. The remaining 30% of
redundant cases turned out to be more challenging. These couldn’t be done
A‘rib — A Tool to Analyze Arabic Sentences Syntactically 143
without some sort of human intervention. The removed tokens were not
actually discarded from the system but rather kept for later use. The idea
is to keep the complete diacritics of the word from the original result when
called for.
We need to properly tag each word. The tag depends on the result case,
which in turn depends on the type of the word. We created three functions
to do the tagging, one for each type: verb, noun, and particle. POSTV
handles the verbal analysis process. It identifies the verb tense, gender,
number, doer, activity, transitivity, variability … etc. POSTN is for the
nominal analysis process, where it identifies the noun type, gender,
number … etc. For the particle analysis, POSTP, only identifies particle
type and sign. After this the tokens are ready for syntactic analysis. As we
usually have more than one word with more than one case, all this is stored
in a two-dimensional Token array. The first dimension indicates the words
given by the user, while the second dimension specifies the cases for that
word. Figure 7 is an example of some tokens for the word ( )ﺫﻫﺐfollowing
the lexical analysis on the output of Alkhalil.
We would like to go briefly over some of the limitations/erroneous
behavior we encountered while working with Alkhalil. These impacted the
final e‘raab results.
Even after removing some of the redundant results, the number of
remaining cases of each word was still high. This resulted in an
exponential number of sentences that went to the next stage in
processing. For example, a sentence with four words, and each word
had three cases then the total possible sentences was 34 = 81.
144 M. Almedlej and A. M. Azmi
:myTokens[][] :myTokens[][]
Word ﺫﻫﺐ Word ﺫﻫﺐ
dWord ُ ﺫَ َﻫ
ﺐ dWord َ ﺫَﻫ
ِﺐ
Type noun Type verb
SubType StNn SubType Past
Specialty # Specialty #
Gender M Gender M
Num Sing Num Sing
NPluralType # NPluralType #
isVariable true isVariable false
InVarSign # InVarSign ﺍﻟﻔﺘﺢ
Prefix # Prefix #
Suffix # Suffix #
Indices 01234 Indices 16
NDefinitive Comn NDefinitive #
VPassivity # VPassivity Actv
VTransitivity # VTransitivity 0
VDoer # VDoer Ab
StaticV false StaticV false
This is the heart of the system that does the actual syntactic analysis for
the tokens received from the lexical analyzer. It finds the matching Arabic
sentence rule for the sequence of tokens. The matching rules are stored in
some form of solution objects and will be used in the final stage, the results
builder. The predefined CFG rules are stored in an external XML file, and
are dynamically parsed using a tree based algorithm. Below we go over in
more detail.
Fig. 8. A sample CFG rule stored in an XML format. The <cond> tag marks the conditions.
Here we have two conditions (= the number of arguments to the right of →), one for
Subject and another for Predicate. Each condition consists of four arguments: Role,
Judgment, Addition, and Place. The arguments are separated by a “;”, and “#” indicates
N/A.
(CFG) Nominal Sentence → Subject + Predicate
(XML)
<Rule>
<ID>NSnt1</ID>
<Name>NSnt</Name>
<Strc>Subj Pred</Strc>
<Cons>----</Cons>
<Cond>Subj;Nomi;#;# Pred;Nomi;#;#</Cond>
</Rule>
(XML)
<Rule>
<ID>Pred2</ID>
<Name>Pred</Name>
<Strc>Geni</Strc>
<Cons>----</Cons>
<Cond>Pre;Pre;#;#PrRu</Cond>
</Rule>
This is the main class where all the syntactic analysis is conducted (see
Figure 10). Each process contains several mini processes that act together
to perform a task. It starts by receiving the tokens from the lexical analyzer
and creates all possible combinations of the tokens. This is followed by
loading the rules from the external XML file, ready to parse. Next it
recursively parses the rules to find a match to the valid structures of
possible sentences. In the end it stores the matched rules along with tokens
in the form of solution objects. Below we further explain these steps in
more detail.
Creating tokens sentences. This process produces all possible
sentences from the given set of tokens. It builds the sentence tree of the
148 M. Almedlej and A. M. Azmi
input tokens, where each path from the root down to the leaves represent
one possible morphological combination for the input sentence. See Figure
11 for an example. Each path is stored as one sentence, an array of tokens.
At the end, it produces a two-dimensional array that holds all possible
morphological combinations of the tokens. The number of sentences
produced is usually vast; however, it will be cut down at a later stage when
sentences with no matching rules are removed following the syntactic
analysis.
Fig. 11. Generated sentence tree for the sentence ()ﺫﻫﺐ ﺃﺣﻤﺪ ﺇﻟﻰ ﺍﻟﻤﺪﺭﺳﺔ.
A‘rib — A Tool to Analyze Arabic Sentences Syntactically 149
Loading the rules. Loads the full set of rules into an appropriate object.
Though it may consume some storage it is more efficient than going over
the XML file looking for matching rules.
Finding the matching rules. This is a process that recursively fetches
the rules and builds the Arabic grammar for the given tokens. Overall, it
explores all possible Arabic grammatical combinations that have the same
length as the given tokens. For each token it checks the solution
compatibility to evaluate and test that combination. The search algorithm
works in a tree based fashion fetching the leftmost component till it
encounters a leaf (Figure 12). For each token sentence it fetches the main
rule named “Sent” (for Sentence) with an empty solution, kind of a
bootstrap for all the rules. The parser keeps calling itself recursively each
time it replaces the structure with a new solution. The process is conducted
in an in-depth first search fashion. For each rule the parser fetches, it does
two main operations: substituting the solution and checking it.
For the process of substituting the solution we need the current solution
and the name of the rule to be substituted along with its structure (the new
solution). Here we simply remove the rule from the solution and put
instead the new solution in its place. We need to copy of the properties of
the new solution (Role, Judgment, Addition, and Places) from the previous
solution. See the example in Figure 13, which shows how the parser keeps
tracing the placement and the rules. After substituting a plausible solution,
the parser checks the new solution. In case the check fails, the parser stops
navigating the rest of the current solution and goes onto the next. There
are two cases when the check fails: (1) the leftmost terminal does not
match the corresponding token in the sentence; and (2) the number of
arguments in the solution has exceeded the number of tokens in the
sentence (see the case, filled boxes in Figure 12).
150 M. Almedlej and A. M. Azmi
Fig. 12. A sample syntax parsing tree process for a sentence with 3 tokens. The underlined
word is the non-terminal currently being processed. There are four types of boxes: solid,
dotted boundary, dashed boundary and filled. Solid boxes mean all arguments are non-
terminals, further exploration is required; dotted boxes mean arguments are mixed
(terminals and non-terminals), need to confirm that terminals match; dashed boxes mean
all arguments are terminals, we have reached leaf and no more exploration; and filled boxes
mean the number of arguments is more than the number of tokens, so just stop exploring.
Legend: NSnt: Nominal Sentence ( ;)ﺟﻤﻠﺔ ﺍﺳﻤﻴﺔSubj: Subject ( ;)ﻣﺒﺘﺪﺃPred: Predicate (;)ﺧﺒﺮ
GenC: Genitive Construction ( ;)ﻣﻀﺎﻑ ﻭﻣﻀﺎﻑ ﺇﻟﻴﻪGeni: Genetive (;)ﺷﺒﻪ ﺟﻤﻠﺔ ﺟﺎﺭ ﻭﻣﺠﺮﻭﺭ
GenN: Genetive Noun ( ;)ﺍﺳﻢ ﻣﺠﺮﻭﺭand prpP: Preposition Particle ()ﺣﺮﻑ ﺟﺮ.
A‘rib — A Tool to Analyze Arabic Sentences Syntactically 151
R
1 S0 → Sent(#;#;#;#)
C
R Sent
2 Sent → NSnt(#;#;#;#)
C #;#;#;#
R NSnt
3 NSnt → Subj(Subj;Nomi;#;#) Pred(Pred;Nomi;#;#)
C #;#;#;#
R Subj Pred
4 Subj → noun(Pre;Pre;#;#)
C Subj;Nomi;#;# Pred;Nomi;#;#
R noun Pred
5 Pred → Geni(Pre;Pre;#;#)
C Subj;Nomi;#;# Pred;Nomi;#;#
Fig. 13. Example to explain how the parser keeps tracing the placements and the rules.
Legend: Sent: Sentence; NSnt: Nominal Sentence; Subj: Subject; Pred: Predicate; Geni:
Genetive; GenN: Genetive Noun; PrRu: current component is the place of the Previous role
of the sentence; and prpP: Preposition Particle. The “#” means no value.
This is the final stage of the process which constructs the e‘raab result in
the format shown in Table 3. This class receives the solutions array from
152 M. Almedlej and A. M. Azmi
the syntax analyzer containing all the required information: the solutions
to know the role and judgment, and the token to decide the proper diacritic
sign. The main function of this class calls the appropriate function for each
word of each case according to the token’s type (verb, noun, … etc.) and
its properties (variable, invariable, … etc.). Within these functions the role
and the judgment of a word are translated into proper Arabic e‘raab. The
sign is deduced from the type of token that word was.
5.4. Output
In the end the e‘raab results are displayed onto the screen. Figure 14 shows
A‘rib’s GUI displaying the e‘raab for the sentence ()ﺃﻛﻠﺖ ﺍﻟﺨﺒﺰ. It notes
multiple outputs corresponding to all possible results that match the rules
in the CFG. The number of output is cut down when we enter the sentence
with diacritical marking.
Fig. 14. Our system’s GUI showing results for the sentence ( )ﺃﻛﻠﺖ ﺍﻟﺨﺒﺰentered without
diacritical signs. Note we have multiple possible solutions.
References
1. CIA World fact book. Washington DC: Central Intelligence Agency (2008).
2. A. Farghaly and K. Shaalan, Arabic natural language processing: challenges
and solutions. ACM Trans Asian Lang. Inform. Process., 8(4):1-22 (2009).
3. R. Alkhawwam, Applied e‘raab and its applications (in Arabic). Retreived
Sep 6, 2015, from uqu.edu.sa/page/ar/93207366.
4. A. Azmi and R. Almajed, A survey of automatic Arabic diacritization
techniques. Natural Language Engineering, 21(3):477-495 (2015).
5. M.G. Khayat and S.K. Al-Jabri, Model analysis of the Arabic sentences
structure (in Arabic). Proceeding of the 12th National Computer Conference:
Planning for the Informatics Society, Riyadh, Saudi Arabia, Oct 21-24, pp.
676-91 (1990).
6. A.D. Al-Sawadi and M.G. Khayat, An end-case analyzer for Arabic
sentences. J King Saud University: Computer & Information Sci. 8:21-52
(1996).
7. E. Al-Daoud and A. Basata, A framework to automate the parsing of Arabic
language sentences. Int Arab J Information Technology, 6(2):196-205
(2009).
8. S. Ananthakrishnan, S. Narayanan and S. Bangalore, Automatic
diacritization of Arabic transcripts for automatic speech recognition. Int.
Conf. Natural Lang. Processing (ICON-2005), Kanpur, India (2005).
154 M. Almedlej and A. M. Azmi
Chapter 7
1. Introduction
Applications using it include text analyzers (e.g., BAMAE, see Ref. 2),
ontologies (e.g., Arabic WordNet browser, see Ref. 3), data mining, and
content extraction (e.g., ArMExLeR, see Ref. 4).
However, the original version of AM shows a number of
shortcomings, which reduce the coverage of the morphological analyzer
and hinder its applicability to a number of genres and text types. In
particular, Buckwalter1 focused mainly on contemporary newspaper
texts, which makes the analyzer both underrecognize — because of lack
of lexical and morphological coverage — and overrecognize (by
spuriously increasing the amount of ambiguity because of the inclusion
of historically and linguistically implausible alternatives) texts from
other genres.
Some of these inconsistencies were tackled by the Revised AM model
(RAM) presented in Boella et al.5 However, the necessity of a structural,
opposite to incremental, revision and expansion of AM appears clearly in
the impossibility to let a merely increased version effectively go beyond
a certain level of performance in analyzing, e.g. Classical and modern
informal texts.
XRAM presents itself as a structurally revised AM, which alters the
basic original structure by adding usage and genre markers and by
accruing the original, rigidly context-free conception of the analyzer by
limited statistically gathered contextual selection information. These
enhancements allow for a sensibly higher level of performance (see
Section 3).
2. Description of XRAM
XRAM, just like AM and RAM, has a purpose of analyzing texts, but in
a much more defined and thorough way.
In order to enhance the accuracy of the analysis we implemented a
flag-selectable usage markers tool through the addition of a
supplementary field in the Buckwalter’s analyzer (see Section 2.1).
After selecting a single flag or a set of flags, according to the text
genre, the text is tokenized and all the punctuation and formatting
structure is stripped and factored out. Hence, the program produces a list
of tokens ready to be processed by the XRAM analyzer, which aims to
Semi-Auto Data Annotation, POS Tagging & Mildly Context-Sensitive Disambiguation 157
a
By “Informal Colloquial Arabic” we mean intermediate, relatively high-level varieties of
spoken Arabic that do not exhibit especially localized features and are relatively common
to speakers of different spoken varieties. ICA essentially corresponds to Mitchell’s
Educated Spoken Arabic7 and Ryding’s Formal Spoken Arabic.8
158 G. Lancioni et al.
FLAG FEATURE
XRAM_CA Classical Arabic
XRAM_MSA Modern Standard Arabic
XRAM_ICA Informal Colloquial Arabic
XRAM_SPEC_MED Medical Sublanguage
XRAM_SPEC_ALCH Alchemic Sublanguage
XRAM_SPEC_GRAM Grammatical Sublanguage
XRAM_NE Name Entities
XRAM_FNE Foreign Name Entities
XRAM_CAP Colloquial Aspectual Preverbs
Existing flags reflect the range of text genres included in the corpora
and subcorpora available in our research. The system can be easily
expanded by adding new flags. Flag selection is usually compounded: for
example, when processing a corpus of classical texts, XRAM_MSA,
XRAM_ICA, XRAM_SPEC_MED, XRAM_FNEand XRAM_CAP
flags will be deselected in order to optimize the output analysis.
Flags can be easily and efficiently implemented according to standard
IT practices (as XORed bits), which makes genre and text type filtering
quick and consistent.
160 G. Lancioni et al.
WORD: ﺍﻟﻜﺘﺎﺏ
Al+ktAb+
ﺍﻟﻜﺘﺎﺏ
1 Al+kitAb+ ﺍﻝ٭ ِﻛﺘﺎﺏ٭ kitAb_1 []ﻛﺘﺐ
the+book+
Al/DET+Ndu+
2 Al+kut~Ab+ ﺍﻝ٭ ُﻛﺘّﺎﺏ٭ kut~Ab_1 []ﻛﺘﺐ
the+kuttab (village school);Quran school+
Al/DET+N+
3 Al+kut~Ab+ ﺍﻝ٭ ُﻛﺘّﺎﺏ٭ kAtib_1 []ﻛﺘﺐ
the+authors;writers+
Al/DET+N+
bhttp://www.tei-c.org/index.xml.
162 G. Lancioni et al.
(1) kAtab+a/kAtab_1/V
(2) kAtib/kAtib_1/N
(3) kAtib/kAtib_2/A
Semi-Auto Data Annotation, POS Tagging & Mildly Context-Sensitive Disambiguation 163
<w ana="maEa/maEa_1/PREP"><ﻣﻊ/w>
<w ana="kAtib/kAtib_1/N">
<note ana="kAtib/kAtib_2/A"/><ﻛﺎﺗﺐ/w>
The fragment shows the unique analysis for ﻣﻊand the top ranked
analysis for ﻛﺎﺗﺐencoded in the ‘ana’ attribute of the <w> tag, while the
alternative analysis for ﻛﺎﺗﺐis encoded as a <note>. If the annotator does
prefer one of the alternative analyses, (s)he adds the attribute
ed="correct" to it <w ana="kAtib/kAtib_1/N">
<note ana="kAtib/kAtib_2/A" ed="correct"/><ﻛﺎﺗﺐ/w>
and launches the XLST transformation, which reverses the selection:
<w ana="kAtib/kAtib_2/A">
<note ana="kAtib/kAtib_1/N"/><ﻛﺎﺗﺐ/w>
One of the weak points of AraMorph is the limited range of text genres
on which the resource was based: the lexicon files as well as the
compatibility tables included in the program are mostly based on
newspaper texts and other Modern Standard Arabic non-literary texts,
which largely comprise the LDC Arabic corpus. The program is not only
unbalanced and representative of a limited part of the Arabic vocabulary,
its lists lack any stylistic and chronological information as well. Because
of this, various problems can arise from the analysis of other textual
genres, especially Classical and both (formal and informal)
contemporary ones. Analyses conducted on Pre-Islamic and Classical
texts, such as Hadith texts5 reveal that the main weak points of AM are:
(i) the rejection or the wrong analysis of words such as the ’ā-
interrogative prefix, as well as imperative verbs that are not included in
AM due to their rare occurrence in targeted AM texts. In addition, other
errors that occur with classical Arabic corpora, especially pre-Islamic,
involve broken plurals as well as certain verb stems (mainly maṣdars,
164 G. Lancioni et al.
cAsof matter of the Egyptian variety, Rosenbaum10 defines this linguistic phenomenon
“Egyptianized English”.
Semi-Auto Data Annotation, POS Tagging & Mildly Context-Sensitive Disambiguation 165
existence. Its large list of named entities has already inspired projects
meant to potentiate and expand on other Arabic lexical resources like
Arabic WordNet.12 Inside the XRAM project, the use of Arabic
Wikipedia was finalized to align the transcription of foreign words and
thus add them in the Buckwalter’s lists.
In regard of the most frequent unanalyzed dialect words, the solution
is to manually set a list to include in AM, since XML resources are not
widely available at the moment aside from few recently investigated
varieties.13
GENRE XRAM AM
% unknown % unknown
Classical Arabic 3.4 12.4
Modern Standard Arabic 1.7 2.5
Informal Colloquial Arabic 7.6 18.5
Medical Sublanguage 1.3 7.5
Alchemic Sublanguage 3.5 14.2
Grammatical Sublanguage 2.7 8.6
Named Entities 6.5 7.6
Foreing Named Entities 14.3 15.6
Colloquial Aspectual Preverbs 6.7 23.4
4. Conclusion
References
Chapter 8
Samhaa R. El-Beltagy
Center for Informatics Science, Nile University,
Juhayna Square, Sheikh Zayed City,
Giza, Egypt
samhaa@computer.org
1. Introduction
Over the past few years there has been an increase in interest in the topic
of Arabic Sentiment analysis and opinion mining. The increased interest
in this area is a direct result of the surge in usage of the Arabic language
within various social media platforms, amongst which are twitter and
Facebook.1,2,3 Many approaches for sentiment analysis require the
existence of sentiment lexicons, which are currently scarce for the Arabic
language. In previous work, the author presented NileULex,4 which is a
manually constructed Arabic sentiment lexicon containing approximately
six thousand Arabic terms and phrases of which 45% are colloquial
(mainly Egyptian). This work extends the previous work by presenting
170 S. R. El-Beltagy
2. Related Work
online. However the work presented in 4, has shown that the quality of this
translated lexicon is not as high as a manually constructed one and that the
sentiment analysis accuracy does suffer when using such lexicons.
EmoLex however is the only Arabic lexicon, other than NileULex4 that
contains compound terms, but it must be stated that the number of
compound entries in this lexicon is very limited. To the knowledge of the
author, NileULex is the only lexicon that has both Arabic compound
phrases and common idioms as entries.
Since the goal of our work was to try to assign a strength score for each
positive and negative lexicon entry, we had to obtain a representative set
of tweets for each term. We have chosen to retrieve 100 tweets unique for
each term using Twitter’s search API.20 There were very few cases
however, when the search API was unable to retrieve this number of tweets
and cases where no tweets were retrieved at all. To ensure that tweets were
in fact unique, they were filtered out using the Jaccard similarity
measure.21
aDialect can be Modern Standard Arabic (MSA), Egyptian (EG) or simply a dialectical
term (DIA) which is not specific to one Arabic speaking country or region.
174 S. R. El-Beltagy
After carrying out the data collection step described in the previous sub-
section, each of the collected tweets was processed in order to extract
statistics for lexicon terms. In this processing step, each tweet was scanned
for lexicon terms and negated lexicon terms. A dictionary was created for
each lexicon term to keep track of how many times it occurred
in the entire corpus
with positive terms
with negative terms
with only terms that match with its polarity
with a tweet that has negative sentiment (in a negative context)
with a tweet that has positive sentiment (in a positive context)
with a tweet that has neutral sentiment (in a neutral context)
The last three indictors were obtained by analyzing the tweet using
NU’s sentiment analyzer.22
Scoring of each lexicon term was based on these statistics as described
in the next sub-section.
The main hypothesis behind the presented scoring method is that the
stronger a polar term is, the less likely it is to co-occur with terms of an
opposite polarity or in a context that does not have the same polarity. This
hypothesis is validating empirically in Section 5 by comparing sentiment
analysis performance using the lexicon which is scored using the proposed
method verses using the un-weighted version of the lexicon.
After collecting statistics for each lexicon term, three steps were carried
out for assigning strength scores to lexicon terms. In the first step, an initial
score was calculated for each term. The first step assigns a weight to each
term that indicates the likelihood of this term being positive or negative
WeightedNileULex: A Scored Arabic Sentiment Lexicon for Improved Sentiment Analysis 175
based on co-concurrence analysis of this term with other terms and polarity
contexts. It does not take into consideration the strength of other terms it
co-ocurred with, initially assuming that all terms are equally strong. In the
second step, the weights are re-adjusted, taking the initial calculations into
consideration. In the third step, terms that have occurred at a very low in
the corpus, or have not occurred at all, are processed.
Terms that have not occurred in the input corpus at all or with a support
value less than a given threshold are assigned a default value based on their
given polarity. The details of each step are given below.
First Step:
The initial score assigned to each term in the lexicon (excluding terms
whose occurrence count is less than some given threshold v) is based on
the following equation:
𝒔𝒄𝒐𝒓𝒆𝒕 max 𝑇𝑒𝑟𝑚𝐶𝑜𝑂𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑅𝑎𝑡𝑖𝑜 , 𝑃𝑜𝑙𝑎𝑟𝑖𝑡𝑦𝑅𝑎𝑡𝑖𝑜
Where:
TermCoOccurrenceRatio measures the extent to which a term co-occurs
with other terms of similar polarity and is calculated as follows:
𝑐𝑜 𝑜𝑐𝑐𝑢𝑟𝑎𝑛𝑐𝑒𝐶𝑛𝑡 𝑤𝑒𝑖𝑔ℎ𝑡
𝑻𝒆𝒓𝒎𝑪𝒐𝑶𝒄𝒄𝒖𝒓𝒓𝒆𝒏𝒄𝒆𝑹𝒂𝒕𝒊𝒐𝒕
𝑇𝑜𝑡𝑎𝑙
co-occurrenceCntt = co-occurrence frequency of term t with terms of
the same polarity as t
weightt = tft * Normalized_idft
tft = the number of times term t has appeared in the input corpus
Normalized_idft = idft23 normalized such that the value is a number
between zero and one. The normalization factor is log2 N where N is the
number of documents in the collection used to build the idf table.
idft = the inverse document frequency23 of term t as obtained from
another corpus build using a set of objective documents. The reason we
used a different, un-opinionated corpus, was to penalize polar terms that
appear in a neutral context as terms that appear in such a context should
have less weight than those that do not. The idf table used to get this value
is the one described in 24.
Total_Count = co-occurrenceCntt + revCntt + weightt
176 S. R. El-Beltagy
And where:
PolarityRatio measures the extent to which a term occurs in an overall
context that is similar to its polarity and is calculated as follows:
𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝐶𝑜𝑛𝑡𝑒𝑥𝑡𝐶𝑛𝑡
𝑷𝒐𝒍𝒂𝒓𝒊𝒕𝒚𝑹𝒂𝒕𝒊𝒐𝒕
𝑡𝑤𝑒𝑒𝑡𝐶𝑛𝑡
where
similarConextCntt number of times term t has occurred with tweets
of the same polarity as its given polarity
tweetCntt total number of tweets in which term t has appeared in
the twitter corpus.
While the TermCoOccurrenceRatio takes into account all polar terms
that have co-occurred with the term for which a score is to be calculated,
the PolarityRatio, takes into account the overall sentiment of all tweets in
which the term has appeared.
All terms with support greater than 1 but less than some given value v
are placed in a list data structure that we will refer to as a ‘weak_list’. The
weak list thus represents a list of terms that have not occurred frequently
enough in the collected twitter corpus for us to assign accurate scores to.
All terms that have 0 support (have not occurred at all in the input corpus)
are initially placed in another list (“zero_list”), before being placed in the
weak_list. Processing of both the zero_list and the weak_list, is described
in the third step.
Second Step:
In the second step, all scores are revised for all terms to take into
account the strength of terms they co-occurred with. The score for a term
t is calculated as follows in this step:
𝑛𝑒𝑤𝑆𝑐𝑜𝑟𝑒 𝑠𝑐𝑜𝑟𝑒
𝒎𝒐𝒅𝒊𝒇𝒊𝒆𝒅𝒔𝒄𝒐𝒓𝒆 𝒕
2
where
𝑚 𝑡
𝒏𝒆𝒘𝑺𝒄𝒐𝒓𝒆𝒕
𝑚 𝑟 𝑡
WeightedNileULex: A Scored Arabic Sentiment Lexicon for Improved Sentiment Analysis 177
Third Step:
In this step, terms in the zero_support list and the weak_list are
assigned scores. Terms with 0 support are sometimes just misspelled
versions of existing terms so before moving them to the weak_list, we first
compare them in terms of similarity to existing terms. Very short terms
tend to incorrectly match with other entries so they are excluded from this
matching process. The pseudo code for calculating scores for terms that
have zero support, is as follows:
For each term t in the zero_list
a. if length of term is smaller than 3, move to weak_list and proceed
to next term
b. else get the min Levenshtein distance between t and all terms that
have been assigned a score
c. if (min <= 2) scoret = scorematching_term
d. else move t to weak_list
178 S. R. El-Beltagy
After the above step is completed, the scores for terms in the weak list
are calculated as follows:
(1) calculate the “average positive polarity” and the “average
negative polarity” given all scores that have been calculated for
entries in the lexicon
For each term t in the weak_list
(2) if supportt > 0, get scoret using the equation provided in step 1,
else set scoret = 0
(3) if scoret < 0.5, set scoret = 0.51
(4) adjusted_cnt = log2 term_cnt + 1
(5) scoret = ((scoret* adjusted_cnt) + polarityAverage) /
(adjusted_cnt +1);
Elongation Removal: In this step, words that have been elongated, are
reduced to their normal standard canonical form. Elongation is a way to
emphasize certain words. An example of an English elongated word is
“nooooooooooo”; after elongation removal, this word will be converted to
“no”.
Matching with Lexicon Entries: This step was only carried out for
experiments that are related to the introduction of the lexicon. In this step,
input tweets/texts are matched against entries in the sentiment lexicon. The
matching process is described in detail in22. Both the tweets/texts and
lexicon entries are lemmatized and stemmed prior to any matching steps.
An efficient matching algorithm was employed to facilitate matching
between tweet text and lexicon entries. The output of this step is a count
for positive and negative lexicon entries, which are found in the tweet and
180 S. R. El-Beltagy
which are used as part of the features. Negators are currently handled in a
very simple way: encountering a negator before a sentiment term within a
window w results in the reversal of its polarity. We have observed that in
some cases, this is not necessarily valid. For example, the term “”ﻻ ﺣﻠﻮ, in
which the negator “no” appears before the word “nice”, is actually used to
affirm that something is nice. A positive score (posScore) and a negative
score (negScore) are also added as features in experiments involving the
scored lexicon. In our experiments, we have used a very simple technique
for assigning scores. Basically, the score of all positive terms is calculated
as the sum of their individual scores and that of any negated negative term
multiplied by a penalty. The same is done for all negative terms. After
summing all positive scores (allPos) and all negative scores (allNeg), final
positive and negative scores are assigned as shown in Figure 1.
An amplification factor has been introduced to boost the weight of
these two features with respect to other features in the feature vector.
Through experimentation, it was noticed that different datasets favor
different amplification factors. In all experiments presented in the
evaluation section, the amplification factor was optimized using
experiments carried out using 10 fold cross validation. Whatever factor
worked best with these experiments was used on the test dataset. The use
of intensifiers has yet to be explored and is expected to improve the results
presented in the experimentation section.
if(allNeg >allPos ) {
negScore = allNeg - allPos;
negScore = negScore *amplification_factor;
posScore=0;
} else {
posScore = allPos - allNeg;
posScore = posScore * amplification_factor;
negScore = 0;
}
Figure 1. Code snippet representing score calculation.
The Talaat et al. dataset (NU)26: The collection and annotation for this
dataset is described in 26. The dataset contains 3436 unique tweets, mostly
WeightedNileULex: A Scored Arabic Sentiment Lexicon for Improved Sentiment Analysis 181
written in Egyptian dialect. These tweets are divided into a training set
consisting of 2746 tweets and a test set containing 683 tweets. The
distribution of training tweets amongst polarity classes is: 1046 positive,
976 negative, and 724 neutral tweets. The distribution of the test dataset
is: 263 positive, 228 negative and 192 neutral. This dataset is available by
request from the author.
The KSA_CSS dataset (KSA)26: This dataset is one that was collected at
a research center in Saudi Arabia under the supervision of Dr. Nasser Al-
Biqami and which is also described in 26. The majority of tweets in this
dataset are in Saudi and MSA, but a few are written in Egyptian and other
dialects. The tweets for this dataset have also been divided into a training
set consisting of 9656 tweets and a test set comprised of 1414 tweets. The
training set consists of 2686 positive, 3225 negative, and 3745 neutral
tweets. The test set has 403 positive, 367 negative, and 644 neutral tweets.
The Syria Dataset (SYR)29: This dataset consists of 2000 Syrian tweets,
so most of the tweets in this dataset are in Levantine. The dataset was
collected by (Salameh and Mohammad)29 and consists of 448 positive
tweets, 1350 negative tweets, and 202 neutral tweets.
Experiment 1: The goal of this first experiment was to examine the effect
of using the scored lexicon on improving the accuracy of the sentiment
analysis task when using tenfold cross validation on all used datasets. The
results of this experiment are shown in Table 2. Looking at these results,
it can be seen that in all cases accuracy increases when using a lexicon
(scored or not). The increase in accuracy seems to be related to the size of
the training dataset, with the largest dataset showing the least improvement
182 S. R. El-Beltagy
and the smallest, showing the most. This shows that using a lexicon, does
in fact help classifiers generalize better in the absence of large training
datasets. This hypothesis is further tested when using the lexicon in
conjunction to various test data sets.
Table 2. Results of applying the classifier on the various datasets and testing using 10 fold
cross validation.
Correctly Improvement
Accuracy Fscore
Identified over baseline
NU Data Set (size = 2746), amplification = 14
Baseline 71.34 71.2 1966 -
Lexicon Counts 72.87 72.6 2000 1.78%
ScoredLexicon 73.82 73.7 2027 3.1%
KSA Data Set (size = 9656), amplification = 6
Baseline 78.88 78.9 7613 -
Lexicon Counts 79.26 79.2 7649 0.47%
ScoredLexicon 79.31 79.3 7654 0.53%
BBN Data Set (size = 1199), amplification = 8
Baseline 68.97 68.8 827
Lexicon Counts 71.14 70.7 853 3.14%
ScoredLexicon 72.20 71.4 864 4.47%
Syr Data Set (size = 2000), amplification = 16
Baseline 77.45 77.9 1549
Lexicon Counts 78.45 78.8 1569 1.29%
ScoredLexicon 80.3 80.4 1606 3.68%
The results are shown for the datasets for which a separate test dataset
was provided (NU and KSA). The results of this experiment are provided
in Table 3. The results of this experiment re-affirm the conclusion reached
in the first. Here also, the use of a lexicon results in improved results with
the best results obtained when using the scored lexicon.
WeightedNileULex: A Scored Arabic Sentiment Lexicon for Improved Sentiment Analysis 183
Correctly Improvement
Accuracy Fscore
Identified over baseline
NU Data Set (size = 683)
Baseline 57.40 57.2 392 -
Lexicon Counts 59.59 59.2 407 3.82%
ScoredLexicon 61.90 61.10 423 7.91%
KSA Data Set (size = 1414)
Baseline 69.57 69.4 1125
Lexicon Counts 71.49 71.4 1156 2.76%
ScoredLexicon 71.8 71.8 1161 3.20%
Experiment 3: The goal of the third experiment was to examine the ability
of the scored lexicon to improve a sentiment analyzer’s generalization
ability across datasets. In this experiment, the classifier was trained using
the largest available dataset (KSA) and tested using (a) the NU dataset, (b)
the BBN dataset, (c) the Syr dataset. The results of this experiment are
shown in Table 4.
Table 4. Results of training using the KSA data set and testing using various datasets.
Correctly Improvement
Accuracy Fscore
Identified over baseline
NU_Egy Test dataset (size = 683)
Baseline 57.83 57.1 395 -
Lexicon Counts 60.03 59.2 410 3.78
ScoredLexicon 61.93 60.9 423 7.09
BBN Data Set (size = 1199)
Baseline 54.13 54.0 649 -
Lexicon Counts 56.05 56.4 673 3.70
ScoredLexicon 58.38 58.6 700 7.86
Syr Data Set (size = 2000)
Baseline 53.60 58.3 1072 -
Lexicon Counts 55.90 60.4 1118 4.29
ScoredLexicon 57.80 62.1 1156 7.84
It can be noticed from these results that the use of the scored lexicon
increased the ability of the classifier to correctly identify instances by no
less than 7% over all three used datasets. While the results for BBN and
184 S. R. El-Beltagy
Syr datasets were much lower than those achieved using the 10 fold cross
validation on the same datasets, the result for the NU test dataset was
identical to that achieved when training using the NU training dataset. This
can be explained by the fact that the KSA dataset has a subset of Egyptian
dialect tweets, so with the help of the scored lexicon, the classifier built
using KSA data was able to achieve a similar result to that achieved by a
classifier trained specifically for the Egyptian dialect. The same was not
true for the other two datasets, as they contain a completely different
dialect (Levantine).
6. Conclusion
References
1. R.W. Neal, Twitter Usage Statistics: Which Country Has The Most Active Twitter
Population? International Business Times, 2013, http://www.ibtimes.com/twitter-
usage-statistics-which-country-has-most-active-twitter-population-1474852 (2013).
2. Facebook Statistics by Country, http://www.socialbakers.com/facebook-statistics/
(2012).
3. D. Farid, Egypt has the largest number of Facebook users in the Arab world. Daily
News Egypt, 23 September 2013, http://www.dailynewsegypt.com/2013/09/25/egypt-
has-the-largest-number-of-facebook-users-in-the-arab-world-report/ (2013).
bhttps://github.com/NileTMRG/NileULex
WeightedNileULex: A Scored Arabic Sentiment Lexicon for Improved Sentiment Analysis 185
4. S.R. El-Beltagy, NileULex: A Phrase and Word Level Sentiment Lexicon for Egyptian
and Modern Standard Arabic. In Proc. of LREC 2016. Portorož, Slovenia (2016).
5. S. Baccianella, A. Esuli and F. Sebastiani, SentiWordNet 3.0: An Enhanced Lexical
Resource for Sentiment Analysis and Opinion Mining. In: Proceedings of the Seventh
International Conference on Language Resources and Evaluation (LREC’10), pp.
2200–2204 (2010).
6. B. Liu, Sentiment Analysis and Subjectivity. In: N. Damerau (ed), Handbook of
Natural Language Processing, Second Edition (2010).
7. T. Wilson, J. Wiebe and P. Hoffmann, Recognizing contextual polarity in phrase-level
sentiment analysis. In Proc. of Human Language Technology Conference and
Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP).
pp. 347–354, Vancouver, Canada (2005).
8. S. Mohammad, and P. Turney, Crowdsourcing a Word-Emotion Association Lexicon.
Comput Intell 29(3), 436–465 (2013).
9. S. M. Mohammad, S. Kiritchenko and X Zhu, NRC-Canada: Building the State-of-the-
Art in Sentiment Analysis of Tweets. In Proc. of the Seventh International Workshop
on Semantic Evaluation (SemEval-2013), Atlanta, Georgia, USA (2013).
10. S. Kiritchenko, X. Zhu and S. Mohammad, Sentiment Analysis of Short Informal
Texts. J Artif Intell Res, 50, 723–762 (2014).
11. R. F. Astudillo, S. Amir, W. Ling, et al., INESC-ID: A Regression Model for Large
Scale Twitter Sentiment Lexicon Induction. In Proc. of the 9th International Workshop
on Semantic Evaluation (SemEval 2015), pp. 613–618 (2015).
12. H. Hamdan, P. Bellot and F. Bechet, lsislif: Feature extraction and label weighting for
sentiment analysis in twitter. In Proc. of the 9th International Workshop on Semantic
Evaluation, pp. 568–573 (2015).
13. F. Wang, Z. Zhang and M. Lan, ECNU at SemEval-2016 task 7: An enhanced
supervised learning method for lexicon sentiment intensity ranking. In Proc. of
International Workshop on Semantic Evaluation (SemEval-2016), pp. 491–496 (2015).
14. R. Al-Sabbagh and R. Girju, Mining the web for the induction of a dialectical arabic
lexicon. In Proc. LREC 2010, pp. 288–293 (2010).
15. S. R. El-Beltagy and A. Ali, Open Issues in the Sentiment Analysis of Arabic Social
Media : A Case Study. In Proc. of 9th the International Conference on Innovations and
Information Technology (IIT2013), Al Ain, UAE (2013).
16. M. Abdul-Mageed and M. Diab, Toward Building a Large-Scale Arabic Sentiment
Lexicon, In Proc. of the 6th International Global WordNet Conference, Matuse, Japan,
pp. 18–22 (2012).
17. G. Badaro, R. Baly, H. Hajj, et al., A large scale Arabic sentiment lexicon for Arabic
opinion mining. In Proc. of the EMNLP Workshop on Arabic Natural Language
Processing (ANLP), Association for Computational Linguistics, pp. 165–173 (2014).
18. F.H.H. Mahyouba, M. A. Siddiquia and M. Y. Dahaba, Building an Arabic Sentiment
Lexicon Using Semi-supervised Learning. J King Saud Univ - Comput Inf Sci, 26,
417–424 (2014).
186 S. R. El-Beltagy
19. R. Eskande, and O. Rambow, SLSA: A Sentiment Lexicon for Standard Arabic. In
Proc. 2015 Conference on Empirical Methods in Natural Language Processing, pp.
2545–50 (2015).
20. Twitter. Twitter Search API, https://dev.twitter.com/rest/public/search (2016).
21. J. Leskovec, A. Rajaraman and J.D. Ullman, Mining of Massive Datasets. 2 edition.
Cambridge, UK: Cambridge University Press. Epub ahead of print (2014). DOI:
10.1017/CBO9781139058452.
22. S.R. El-Beltagy, T. Khalil T, A. Halaby and M.H. Hammad, Combining Lexical
Features and a Supervised Learning Approach for Arabic Sentiment Analysis. In Proc.
CICLing 2016, Konya, Turkey (2016).
23. G. Salton and C. Buckley, Term-weighting Approaches in Automatic Text Retrieval.
Inf Process Manag, 24(5), 513–523 (2009).
24. S. R. El-Beltagy and A. Rafea, KP-Miner: A keyphrase extraction system for English
and Arabic documents. Inf Syst, 34(1), 132–144 (2009).
25. J.D.M Rennie, L. Shih, J. Teevan, et al., Tackling the Poor Assumptions of Naive
Bayes Text Classifiers. Proc Twent Int Conf Mach Learn, 20(1973), 616–623 (2003).
26. T. Khalil, A. Halaby, M.H. Hammad and S.R. El-Beltagy. Which configuration works
best? An experimental study on Supervised Arabic Twitter Sentiment Analysis. In
Proc. of the First Conference on Arabic Computational Liguistics (ACLing 2015), co-
located with CICLing 2015, pp. 86–93, Cairo, Egypt (2015).
27. S.R. El-Beltagy and A. Rafea, An Accuracy Enhanced Light Stemmer for Arabic Text.
ACM Trans Speech Lang Process, 7(2), 2–23 (2011).
28. S.R. El-Beltagy and A. Rafea, LemaLight: A Dictionary based Arabic Lemmatizer and
Stemmer, Techenical Report TR2-11-16, Nile University (2016).
29. M. Salameh, S.Mohammad and S. Kiritchenko, Sentiment after Translation: A Case-
Study on Arabic Social Media Posts. In Proc. of the 2015 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language
Technologies, pp. 767–777, Denver, Colorado: Association for Computational
Linguistics (2015).
August 13, 2018 9:6 ws-rv9x6 Book Title 10693-09 page 187
187
Chapter 9
1. Introduction
more) and each instance belongs to many labels such as text categoriza-
tion,2,3 prediction of gene function4 and protein function prediction.5
The high dimensionality of the label space leads to a number of prob-
lems that a multi-label learning algorithm has to address in an effective
and efficient way. First, the number of training examples belonging to each
particular label will be significantly less than the total number of examples.
This is similar to the class-imbalance problem in single-label classification.6
Second, the computational training complexity of a multi-label classifier
may be strongly affected by the number of labels. Some simple algorithms
such as binary relevance have both linear training and classification com-
plexity with respect to |Ω|, but there are also more advanced methods3
whose complexity is worse. Finally, although the classification complexity
of using a multi-label classifier is linear with respect to |Ω| in the best case,
this may still be inefficient for applications requiring fast response times.
Multi-label learning methods addressing these tasks can be grouped into
two categories:1 problem transformation and algorithm adaptation. The
first group of methods are algorithm independent. They transform the
learning task into one or more single-label classification tasks, for which
a large body of learning algorithms exists. The second group of methods
extend specific learning algorithms in order to deal with multi-label data
directly. There exist extensions of decision tree learners, nearest neighbor
classifiers, neural networks, ensemble methods, support vector machines,
kernel methods and others.
When a Muslim has a question that they need to be answered from an
Islamic point of view, they ask an Islamic scholar this question, and the
answer is known as a fatwa. It is similar to the issue of legal opinions from
courts in common-law systems. A fatwa in the Islamic religion represents
the legal opinion or interpretation that a qualified jurist or mufti can give
on issues related to the Islamic law. Muslim scholars are expected to give
their fatwa based on religious scripture, not just their personal opinions.
The following is an example of a fatwa: Muslims are expected to pray five
times every day at specific times during the day. A person who is going to
be on a 12 hour flight may not be able to perform their prayers on time.
So they might ask a Muslim scholar (mufti) for a fatwa on what is the
appropriate thing to do, or they might look up the answer in a book or
on the internet. The scholar might advise them to perform the prayer to
the best of their ability on the plane, or to delay their prayer until they
land. They would support their opinion with Quranic verses which Muslims
believe to be a revelation from God. The fatwa is not legally binding or
final.
August 13, 2018 9:6 ws-rv9x6 Book Title 10693-09 page 189
It is worth mentioning that in Islam, there are four sources from which
Muslim scholars extract religious law or rulings, and upon which they base
their fatwa. The first is the Quran, which is the holy book of Islam, and
which Muslims believe is the direct and literal word of God, revealed to
Prophet Mohammad. The second source is the Sunnah, which incorpo-
rates anything that the Prophet Mohammad said, did or approved of. The
third source is the consensus of the scholars, meaning that if the schol-
ars of a previous generation have all agreed on a certain issue, then this
consensus is regarded as representing Islam. Finally, if there is no evidence
found regarding a specific question from the three first sources, then an
Islamic scholar uses his own logic and reasoning to come up with the best
answer according to the best of their ability. All actions in Muslims’ lives
are permissible, unless they got a fatwa that there is evidence from one of
the four sources previously mentioned that proves otherwise. Fatwa Areas
(categories) can be organized into tree-structured hierarchy where similar
areas share the same parent area. Each scholar could be an expert in one
or more of its branches. To get the best fatwa for a given question, the
request has to be directed to the most relevant mufti.
The main contribution of this paper is to apply an effective and com-
putationally efficient multi-label classification algorithm in a domain with
many labels such as Islamic fatwa requests routing. The algorithm, that
was introduced by Tsoumakas et al. in 2008,7 is called HOMER (Hierarchy
Of Multi-label classifiERs). HOMER constructs a hierarchy of multi-label
classifiers, each one trained to solve a classification problem with a much
smaller set of labels compared to |Ω| and a more balanced example distri-
bution. This leads to improved predictive performance along with linear
training and logarithmic testing complexities with respect to |Ω|. The first
step of HOMER is the label hierarchy generation which is the even distri-
bution of the given set of labels Ω into k disjoint subsets using a balanced
k-means clustering algorithm. That is similar labels are placed together
and dissimilar apart.
The remainder of this paper is organized as follows. Section 2 describes
the related work and Section 3 presents the proposed routing system and the
HOMER algorithm. Section 4 presents the setup and results respectively
of the experimental work comparing HOMER to binary relevance, which
is the most popular and computationally efficient multi-label classification
method. Finally, Section 5 concludes this paper and points to future work.
August 13, 2018 9:6 ws-rv9x6 Book Title 10693-09 page 190
2. Related Work
vector evaluation method. The proposed method determines the key words
of the tested document by weighting each of its words, and then comparing
these key words with the key words of the testing corpus categorizes.
• Remove punctuation.
• Remove special characters and remove any html tags.
• Remove diacritics (primarily weak vowels).
• Remove non Arabic letters.
• Replace Arabic letter ALEF with hamza below, Arabic letter ALEF
with madda above, and Arabic letter ALEF with hamza above with
A Arabic letter ALEF.
• Replace final Arabic letter Farsi YEH with Arabic letter YEH.
• Replace final Arabic letter TEH marbuta with Arabic letter HEH.
• Stop-word removal: we determine the common words in the doc-
uments which are not specific or discriminatory to the different
classes.
• Stemming: different forms of the same word are consolidated into
a single word. For example, singular, plural and different tenses
are consolidated into a single word.
(1) Remove Arabic letter WAW (and) for Light2, Light3, and Light8
if the remainder of the word is 3 or more characters long. Although
it is important to remove Arabic letter WAW, it is also problematic,
because many common Arabic words begin with this character, hence
the stricter length criterion here than for the definite articles.
(2) Remove any of the definite articles if this leaves 2 or more characters.
(3) Go through the list of suffixes once in the (right to left) order indicated
in figure below, removing any that are found at the end of the word,
if this leaves 2 or more characters. The strings to be removed are
listed in figure below. The prefixes are actually definite articles and a
conjunction.
August 13, 2018 9:6 ws-rv9x6 Book Title 10693-09 page 193
4. Performance Evaluation
The dataset used in the experiments was provided by the Egyptian Dar
al-Ifta.a Since it was first established, Dar al-Ifta al-Misriyyah has been
the premier institute to represent Islam and the international flagship for
a http://eng.dar-alifta.org/
August 13, 2018 9:6 ws-rv9x6 Book Title 10693-09 page 196
Islamic legal research. Dar al-Ifta al Misryyah started as one of the divi-
sions of the Egyptian Ministry of Justice. In view of its consultancy role,
capital punishment sentences among others are referred to the Dar al-Ifta
al-Misryyah seeking the opinion of the Grand Mufti concerning these pun-
ishments. The role of Dar al-Ifta does not stop at this point; it is not
limited by domestic boundaries but extends beyond Egypt covering the
entire Islamic world.
a ranking of all features for that label. We then selected the top 30 features
for each label and concatenate them into a single feature vector after the
removal of redundancy. This led to a reduced vocabulary of 4,486 words.
After the aforementioned preprocessing, and the removal of empty exam-
ples (examples with no features or labels) the final version of the dataset
included 15,539 instances.
4.2. Methods
We focus on the comparison of HOMER method and its variations with the
binary relevance method (BR), since it is the most widely used multi-label
learning algorithm. Both the training and the testing phases of BR are
already linear with respect to |Ω|. We compare BR against HOMER using
BR as the multi-label classifier at each internal node of the label hierarchy.
To reduce the computational cost of the experiments, we use naive Bayes
as the base classifier for BR decomposed binary tasks. We evaluate the per-
formance of methods using four fold cross-validation set. HOMER is run
with different number of clusters k = 2,4,6,8,10. In addition to the balanced
k means algorithm we examine two different approaches for the distribu-
tion of the labels of each node into its children. The first variation, called
HOMER-R, distributes evenly but randomly the labels into the k subsets.
The motivation here is to examine the benefits of clustering on top of the
even distribution of labels. The second variation called HOMER-K, dis-
tributes the labels using the Expectation-Maximization (EM) algorithm
without any constraints on cluster sizes. The motivation in this case is
to examine the benefits of even distribution on top of the clustering com-
pared to similarity-based distribution. The default version of HOMER with
balanced k means is called HOMER-B. The implementation of MULANb
Java library13 has been used for learning the multi-label classifiers in these
experiments.
Fig. 4. Routing performance of HOMER and its variations compared to Binary Rele-
vance.
August 13, 2018 9:6 ws-rv9x6 Book Title 10693-09 page 199
HOMER and its variations are much better compared to BR. This is at-
tributed to the skewness of the distribution of the examples for each label
which is an important drawback of BR in domains with large number of
classes such as fatwa areas compared to HOMER that manages to alleviate
it because BR trained within each hierarchy node deals with much smaller
number of labels (<< |Ω|).
We then compare among the different HOMER variations. We observe
that for three metrics out of the four, HOMER-K has the best results,
followed by HOMER-B and then HOMER-R. Except micro-averaged re-
call, HOMER-B is the best, followed by HOMER-K then HOMER-R. This
shows that similarity-based distribution using simple clustering, is actually
more important than clustering constrained with the even label distribu-
tion in this domain. A potential reason is that the categories/areas of fatwa
may be naturally grouped into existing balanced clusters. The precision of
HOMER-K seems to be improving with the number of clusters while the re-
call degrades. In terms of Hamming loss, the performance of HOMER-R has
a decreasing trend while HOMER-K and HOMER-B show improvement as
the number of cluster increases. Hence, we conclude that the performance
of both HOMER and its variations increase with the number of clusters.
References
Chapter 10
1. Introduction
the variety of proportions, heights, weights, and various styles, etc., each
typeface has its own aesthetic and expressive qualities, as evidenced by
the visual attributes of its letterforms.15 Some typefaces are able to reflect
a selected message, while others can detract from an intended meaning.
Previous researchers on typefaces have illustrated that each typeface
has its own individual identity. For example, during BBC audio program
on February 11th, 2005 Ian Peacock discussed how the fonts we select are
sending subtle messages about who we are.2 He argues that the fonts we
select to dress our words are as much of a fashion statement as the clothes
we wear. Moreover, in this program, fonts were illustrated as being
masculine or feminine. Fonts described as being fine, serif, sleek, and
elegant were considered as feminine, whereas fonts characterized as being
blocky and bold were considered as masculine.
This chapter focuses on the visual expression of Arabic and English
typefaces. For both, the relationship between typefaces and their perceived
personas has been investigated. Statistical analysis was used to separately
analyze the data collected from English and Arabic typeface surveys. By
applying statistical analysis on the data, the correlation among fonts and
perceived personas has been extracted. English and Arabic typefaces used
within this study have been grouped according to their personas. This
chapter begins with a literature review of different studies on both
typeface’s personality traits, followed by a description of the newly
designed font surveys for both typefaces and the methodologies used
within these studies. Finally, a discussion, conclusions and suggestions on
topics for future exploration are provided, based on research results.
The 20 different typefaces that were selected as test typefaces for this
survey included eight typeface styles available in Microsoft Windows that
supported Arabic scripts and twelve recommended Arabic typefaces in
that are commonly used.1,34 These typefaces were selected to represent
design characteristics such as tooth and loop heights, ascenders,
descenders and others. Also, these 20 typefaces are widely used in
different applications. Some of them, such as Kufi, are the standard and
most frequently used in displays for titling and in architectural ornaments.
Others, such as Thuluth, are mostly used for short texts and titles. Naskh
has become the industry standard as a text and Ruqaa is popular for
informal text.8,28 The 20 typefaces used in this study are Advertising Light,
Kufi, Maghrib, Naskh, Tahoma, ae_Ostorah, Ae_Mashq, Courier New,
Diwani Letter, Simplified Arabic Fixed, PakType Tehreer, Microsoft Sans
Serif, Ae_Nada, Times New Roman, DecoType Thuluth, Traditional
Arabic, Andalus, Code2000, Pashtu Breshnik and Ruqaa.
3.1.3. Participants
Native Arabic speakers residing in their home country were asked to fill
out the survey, in addition to the native speakers who currently live in
Montreal, Quebec, Canada. The respondents were recruited through
e-mails and posters at Concordia University. In total, 82 participants
completed the survey, consisting of 41 females and 41 males. There were
55 participants between the ages of 20–29, and 21 participants between
the ages of 30–39. Only two participants were younger than 20 and the
other four participants were older than 40. Regarding educational
background, 38 participants reported having a Bachelor’s degree, 29
participants had a Master’s degree and 10 participants had a Doctorate.
The educational background of the remaining 5 participants included high
school, Technical college and Junior college.
For each Arabic typeface, groups of Arabic sentences or words in the size
of 12 points were displayed as a test image (see Figure 1). Image (a)
contains the 3 most common sentences used by Arabic speakers. Image
(b) is the most common Arabic pangram and was obtained from Ref. 24,
containing all of the basic letters. Image (c) was taken from sports news.6
Images (d–f) represent the most, average and least frequently used Arabic
words. Image (g) includes the Arabic alphabet and numerals. The test
image was converted to a binary image at 300 × 300 dpi resolution.
210 S. Nikfal and C. Y. Suen
Figure 1. Samples of the text in typeface Tahoma displayed for this font survey.
size because the difference between actual height and average height
showed the percentage increase or decrease in total typeface size. For
example, if a typeface has a small height compared to the average height,
it will have a bigger size after normalization. However, typefaces with
large heights will have a smaller size after normalization. Figure 2
illustrates the result of typeface Courier New before and after
normalization.
Figure 2. Typeface Courier New, size 18 (from top to bottom: before normalization and
after normalization).
3.1.8. Procedures
Our approach for analyzing the survey data included the use of statistical
software SPSS (version 17.0). SPSS is among the most widely used
programs for statistical analysis in the social sciences.
Table 2. Mean values of rating scores for 20 typefaces with their abbreviations related to
5 personality traits.
Arabic and English Typeface Personas 213
of rating scores for each typeface. Analysis revealed that the histograms
of rating scores displayed two commonly shaped distributions: normal and
slightly skewed. The mean values are displayed in Table 2. The top 3
typefaces related to each personality trait are highlighted.
1. The most legible fonts are in group 1, followed by the fonts in group
2; the least legible fonts are in group 3.
2. The most attractive fonts are in group 3, followed by the fonts in
group 2; the least attractive fonts are in group 1.
Arabic and English Typeface Personas 215
The most formal fonts are in group 1, followed by the fonts in group 2;
the least formal fonts are in group 3.
The 3 groups were labeled by comparing and combining the rating
scores of each personality trait across the 3 groups. The label for each
group reflected its overall persona and distinguished it from the other
groups. Groups 1 and 3 were labeled based on their common personality
traits, as “Directness” and “Creativeness”, respectively. Group 2 was
labeled as “Neutral” because this group did not score extremely high nor
low on any personality traits.
Table 3. Mean values of rating scores for 13 typefaces related to 5 personality traits.
216 S. Nikfal and C. Y. Suen
The typefaces used within this particular are shown in Figure 4. The
complete listing and classification of typefaces is represented in Ref. 13.
218 S. Nikfal and C. Y. Suen
4.1.3. Participants
For each typeface in this study, the complete English alphabet of upper
and lower cases as well as, numerals were printed in the size of 18 points.
Two common English pangrams (a pangram is a sentence using every
letter of the alphabet at least once) were also used in the size of 12 points:
Arabic and English Typeface Personas 219
“The quick brown fox jumps over a lazy dog” followed by the sentence,
“Please complete the survey to your comfort level”. Figure 5 illustrates
sample similar to those printed for each of the 24 typefaces.
Figure 5. Sample of the alphabet and text in the font survey. This sample shows the typeface
“Poor Richard”, where (a) includes upper and lower cases of the English alphabet in the
size of 18 points, and numbers in the size of 18 points; and (b) includes two common
English pangrams in the size of 12 pints.
This ratio was applied to change the 24 typefaces to the same x-heights.
It shows the effect on percentage of increase or decrease in typeface size.
If a typeface has a small actual x-height compared to the average x-height,
it will have a bigger size after normalization. However, typefaces with
large x-heights will have a smaller size after normalization. Figures
illustrate the result of typefaces Chiller (Figure 6) and Impact (Figure 7)
before and after normalization.
Figure 6. Sample of typeface “Chiller” (from left to right: size 18 before normalization and
after normalization).
Figure 7. Typeface “Impact” (from left to right: size 18 before normalization and after
normalization).
4.1.8. Procedures
It was requested that participants visually examine the typefaces and rate
them on 6 personality traits, demonstrating how well the typeface suited
each personality trait.
The approach used to analyze English survey data employed the statistical
software SPSS (version 17.0). First, an univariate analysis on the English
survey data was performed similar to the one described in the Arabic
typeface survey. Secondly, a correlation analysis on the survey data was
conducted, as described in the Arabic typeface survey. Thirdly, a factor
analysis on the remaining typefaces to group them into smaller sets was
performed similarly to the one described in the Arabic typeface survey.
Lastly, the survey’s demographic data including age, gender and
educational levels were examined to identify their potential effects on
participants’ responses, similarly to the analysis described in the Arabic
typeface survey.
It was found that 16 typefaces that exhibited strong correlations with
the 6 personality traits. They are: Impact, Bernard MT Condensed,
Garamond, Centaur, Harry Porter, Times New Roman, Kabel, Berlin Sans
FB, Footlight MT Light, Bauhaus 93, Arial, Helvetica, Rockwell,
Broadway, Cooper Black and Snap ITC. These typefaces were used for
further statistical analysis.
222 S. Nikfal and C. Y. Suen
The items represented the typefaces and the factor represented the
independent group. Items that had higher factor loadings were more
representative of the factor than items with lower factor loadings. The
factor analysis results revealed that 3 or 4 independent factors accounted
for 33% and 66.7% of the total variance respectively. Typefaces are
categorized into 4 groups based on their ratings and the values of their
correlations. Typefaces within a group correlated highly with the other
typefaces in that group, and did not correlate highly with typefaces in the
other groups.
All typefaces that correlated positively in group 4 had much higher
property ratings than those in the other three groups; thus, “Legible”,
“Formal” and “Readable” were common properties of typefaces in group
4 and were characteristics that distinguished those typefaces from the
typefaces in the other groups. The following typefaces were grouped
together in group 4: Garamond, Helvetica, Arial, Times New Roman,
Centaur, Rockwell and Footlight MT Light.
The fonts in group 3 shared “Artistic”, “Readable” and “Attractive”
personality traits but they did not score the highest on those traits. Further
mean rating score analysis was done and the following two typefaces were
grouped together: Kabel and Berlin Sans FB.
The fonts in group 2 scored highest on the “Artistic” and “Sloppy”
characteristics. The following five typefaces were grouped together: Snap
ITC, Harry Porter, Broadway, Bauhaus 93 and Cooper Black.
The fonts in group 1 shared “Legible” and “Formal” personality traits
but they did not score the highest on those traits, therefore further mean
rating score analysis was performed and the fonts were grouped together
the following two typefaces: Impact, Bernard MT.
For the fonts that did not score extremely high or extremely low on any
personality traits, further typographical feature analysis was done, as
shown in Table 6. It was found that Bernard MT Condensed and Impact
had the highest x-height proportion with the exception of Broadway and
weight detection and the smallest ascender proportion and descender
proportion with the exception of Broadway. Therefore, those fonts were
grouped together in group 1.
Arabic and English Typeface Personas 223
Mean rating scores for the grouped typefaces were examined, (see Table
7 and Table 8). From the mean values of 16 typefaces and comparison of
4 groups in Table 8 and Table 9, it was found that:
3. The most formal fonts are in group 4, followed by the fonts in group
1 and group 3; the least formal fonts are in group 2.
4. The sloppiest fonts are in group 2, followed by the fonts in group 1
and group 3; the least sloppy fonts are in group 4.
5. The most readable fonts are in group 4, followed by the fonts in
group 3 and group 1; the least readable fonts are in group 2.
6. The most attractive fonts are in group 4, followed by the fonts in
group 3 and group 2; the least attractive fonts are in group 1.
Table 7. Mean values of rating scores for grouped typefaces related to 6 personality traits.
Arabic and English Typeface Personas 225
It was not feasible to determine if effects were linked to gender, age and
educational background, as the variability of participants within these
groups was not sufficient for a valid analysis.
The results of the statistical analyses provide strong evidence that there is
a clear and significant relationship between particular typefaces and
perceived personality traits. Participants in this study consistently ascribed
specific personality traits to certain typefaces, which was consistent with
results from earlier studies on typefaces and their personality traits.19,31
After examining our statistical analysis, the total number of studied
typefaces was reduced from 24 to 16. The eight typefaces were eliminated
since they produced statistically insignificant results. Via a series of
226 S. Nikfal and C. Y. Suen
1. The only font used in both studies was Times New Roman.
2. Four commonly used personality traits were chosen in both studies:
“Legible”, “Formal”, “Artistic” and “Attractive”.
3. Due to some of the differences in specific typefaces, personality
traits, rating scales, and pangrams, and font sizes, we cannot directly
compare the results of both studies.
References
1. H. S. Abifares, Arabic Typography A Comprehensive Sourcebook. Saqi Books
London (2000).
2. I. S. I. Abuhaiba, Discrete script or cursive language identification from document
images. Journal of King Saud University. 16(1), 253-269, (2004).
3. I. M. Al-Harkan and M. Z. Ramadan, Effects of pixel shape and color, and matrix pixel
density of Arabic digital typeface on characters’ legibility. International Journal of
Industrial Ergonomics. 35(7), 652–66 (2005).
4. M. Almuhajri and C. Y. Suen, Legibility and readability of Arabic fonts on personal
digital assistants. In eds. M. C. Dyson and C. Y. Suen, Digital Fonts and Reading,
pp. 248-265. World Scientific, Singapore (2005).
5. A. Alsumait, A. Al-Osaimi, and H. AlFedaghi, Arab children’s reading preference for
different online fonts. In Proceedings of HCI International Conference, HCI
International, pp. 3-11, San Diego, CA (July 2009).
6. Arabic News. Available from: http://aljazeera.net/portal
7. D. Bartram, Perception of semantic quality in type: differences between designers and
non-designers. Information Design Journal. 3(1), 38:50 (1982).
8. C. H. Baylis, Trends in typefaces. Printer’s Ink 252. 5, 44-46 (1955).
9. BBC Radio. http://www.bbc.co.uk/radio4/ (2015).
10. M. L. Bernard, B. S. Chaparro, M. M. Mills, and C. G. Halcomb, Comparing the effects
of text size and format on the readability of computer-displayed Times New Roman
Arabic and English Typeface Personas 229
29. H. Spencer, The Visible Word Book, 2nd edn. Lund Humphries, Royal College of Arts,
New York (1969).
30. D. Shaikh and B. Chaparro, Perception of fonts: pereceived personality traits and
appropriate uses. In eds. M. C. Dyson and C. Y. Suen, Digital Fonts and Reading,
pp. 226-247. Word Scientific, Singapore (2016).
31. A. D. Shaikh, B. S. Chaparro, and D. Fox, Perception of fonts: Perceived personality
traits and uses. Usability News. 8(1), 1-6 (2006).
32. A. D. Shaikh, B. S. Chaparro, and D. Fox, The effect of typeface on the perception of
email, Usability News. 9(1), 1-7 (2007).
33. R. Shushan and D. Wright, D. Desktop Publishing By Design. 2nd edn. Microsoft
Press, Washington (1994).
34. E. Smitshuijzen, Arabic Font Specimen. Uitgeverij De Buitenkant, Amsterdam (2009).
35. S. A. Sweet and K. G. Martin, Data Analysis with SPSS. 3rd edn. Pearson Education,
Upper Saddle River (2009).
36. C. Y. Suen, S. Nikfal, Y. Li, Y. Zhang, and N. Nobile, Evaluation of typeface legibility
based on human perception and machine recognition. In Proceedings of the ATypI
International Conference, Dublin, Ireland, (2010).
37. J. Tschichold, Graphic Arts And Book Design: Essays on the Morality of Good Design.
Hartley & Marks, Washington (1958).
38. J. V. White, Graphic Design for the Electronic Age: Manual for Traditional and
Desktop Publishing Book. Watson-Gublications and Xerox Press, New York (1988).
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 231
231
Chapter 11
This chapter presents the first end-to-end recipe for an Arabic speech-to-
text transcription system using the lexicon free Recurrent Neural Net-
works (RNNs). The developed approach does not depend on Hidden
Markov Models (HMMs), Gaussian Mixture Models (GMMs), or deci-
sion trees. In addition, a character based decoder is used for searching to
avoid using a word lexicon. The Connectionist Temporal Classification
(CTC) objective function is used to maximize the output character se-
quences given the acoustic features as an input. The recipe was evaluated
using 1200 hours corpus of Aljazeera multi-Genre broadcast programs.
On the development set, we report Word Error Rate (WER) of 12.03%
for non-overlapped speech.
1. Introduction
2. Related Work
The study framework compares the RNNs lexicon free speech recogni-
tion performance and other technologies [7,19]. This study comprises
two experiments: the first experiment is about parameters estimation and
2 http://www.kaldi-asr.org/
3 http://htk.eng.cam.ac.uk/
4 http://cmusphinx.sourceforge.net/
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 234
BDRNNs acoustic model scores each character given the input data. More-
over, the CTC objective function (loss function) maximizes the probabilities
of the correct characters. The following sections will discuss in brief the
BDRNNs and CTC.
5 http://www.qcri.com/
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 235
Fig. 2. Stanford Bidirectional Recurrent Neural Networks for input x and output p(c|x).
(f ) (j−1) (b)
ht = f (W (j)T ht + W (b)T ht+1 + b(j) ) (4)
where f (z) = min(max(z, 0), µ) is a rectified linear activation function
clipped to a maximum possible activation of µ to prevent overflow [21].
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 236
Rectified linear hidden units have been show to work well in general for
deep neural networks, as well as for acoustic modeling of speech data [21].
The final layer of the BDRNNs computes the output distribution p(c|xt )
using a softmax function:
(s)T (s)
h(:) +bk )
e−(Wk
p(c = ck |xt ) = P (s)T (s)
, (5)
K h(:) +bj )
j=1 e−(Wj
(s) (s)
where Wk is the k’th column of the output weight matrix W (s) and bk
is a scalar bias term. The vector h(:) is the hidden layer representation of
the final hidden layer in our DBRNN. The set of all expected characters K
includes the blank symbol ( ).
T
XY
CT C(X, W ) = p(ct |X). (7)
CW t=1
3.3. Decoding
The beam search decoder is performed on a character level. This method
gives more advantages than word level decoding for two reasons. The first
reason is that the decoding speed in a character level is much higher com-
pared to a word level because of the lexicon. The search time for a lexicon
based decoder is a function in the number of words to be searched. For
Arabic language, the lexicon may contain up to 2M words. Hence, the lex-
icon based decoders may be very slow. On the other hand, the character
based decoders depend on the number of characters (e.g. 35) used to train
the BDRNNs. Hence, they are faster than a lexicon based decoder [24].
The second reason is that character based decoding overcomes the out of
vocabulary (OOV) problem compared to word decoding [22]. Algorithm 1
illustrates the decoding pseudo code developed by Stanford [22].
The collapse function ignores the non-blank symbols due time shift
of character alignment which produces the same character again. It also
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 238
4. Front-End Preparation
The transliteration process maps each letter from Arabic to the corre-
sponding Latin character. We added spaces between each character because
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 240
the characters length are different (some characters are transliterated in one
or two characters as shown in example (1)). We uses hash (#) to indicate
the start of the sentence, star (*) for the end of the sentence and separator
(|) for the spaces between the words. These special characters help the
decoder to detect the sentence and word boundaries. Example 1 shows a
transliteration of a statement:
éJ
K. Qm Ì '@ éK
XYªJË@ úΫ éJ
AJ
Ë@ èAJ
mÌ '@ Ðñ® K IJ
k
# hh y th | t q w m | a l hh y a t | a l s y a s y t | aa l a | a l t aa d d
y t | a l hh z b y t *
Example 1. Sample of Arabic statement transliterated into Latin.
# hh y th — t q w m — a l hh y a t *
1 14 37 12 39 11 30 36 33 39 7 32 14 37 7 11 20
Example 2. Numerical transformation.
The feature extraction in this study is based on filter bank (FB) instead of
Mel Frequency Cepstral Coefficient (MFCC). The empirical results and pre-
vious work [7] show that FB outperforms the MFCC in speech and speaker
recognition technologies [25]. FB acts as a bandpass filter for the audio
signal in the frequency domain [25]. FB processing consists of projecting
6 http://www.qamus.org-/transliteration.htm
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 241
5. Experiments
grams (the default value). The 15 grams LM is about 64GB ARPA format
and 32GB binary format. The binary format is faster in decoding process
and less memory consumption compared to ARPA. The 15th gram is the
highest gram we could reach because of memory limitation of the OS. The
experiment setup is 24 processor cores, 144 GB RAM, NVidia Tesla 80K
graphical processing unit (GPU). Tesla k80 is the state of art GPU technol-
ogy, at the time of writing this paper (GPU specification: 24GB memory
cards and 2900 processing cores). The parallel decoding is performed over
the CPUs. The training parameters were adjusted to 50 epochs, 5 hid-
den layers, and each layer size is 1840 hidden units. The step is 1e-5, and
maximum frame length is 6000 frames.
The HMM/GMM and Tandem acoustic models are built using the Hid-
den Markov Model Toolkit (HTK)9 version 3.5. This version has improve-
ment for the language model to accept more than 64k words and supports
DNN modeling. The results in Table 3 are based on α = 5, β = 3.8, beam
length 150. We used the HRESULT command of HTK and built two out-
put files [27]. The first file contains the words sequence that is compared
to the test set. The second file is generated as a sequence of labels (char-
acters) and it is compared to the test set prepared in the same manner
(label sequence). The results become steady from epoch 30 because the
small size of training set (50 epochs per week). By increasing the number
of the network hidden layer size 2048 and beam length 150, the results have
improved less than 1%.
9 http://htk.eng.cam.ac.uk/
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 243
10 The graph is only for illustration and it does not present real data.
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 244
Table 4. illustrates phrase from the test set and corresponding transcription output by
different methods.
Method Transcription
áË èXCK à@ úæ
JÓA Y»
@ éJ.KAg
Test Set ÕÎ .
g ú
Ϋ é<Ë@ éK
@ éJ
K@ QK
B
@ éK
PñêÒm.Ì '@ YQÓ . áÓ
éK ðñJË@ AJk ñËñJº Ì
. JË@ ú¯ Aê®k áÓ ((AîEAÓQm )) ú
×QK ú
æË@ éJ
ËðYË@ ñªÊË @YK. @
áË èXCK à@
úæ
JÓA Y»
@ éJ.KAg
GMM/HMM ÕÎ .
g ú
Ϋ é<Ë@ éK
@ éJ
K@ QK
B
@ éK
PñêÒm.Ì '@ YQÓ . áÓ
ÐAJ
Ëð AK@ Èñ® K èPXA¯ ((AîE AÓ Yg úÍ@
)) I.£ ú
æË@ éJ
ËðYË@ ñª ÊË @YK. @
CTC èXCK. à@
ú
æJÓAg úΫ é<Ë@ éK
@ éJ
K@Q
K B
@ éK
PðX QK
YÓ Y» @ éJ.KAg. áÓ
éJ
K. ñJm .Ì '@ QK
ñ¢JË@ ú
¯ k áÓ Ì)) ú×QK úæË@
((AîEAÓQm
éËðYË@ éËðYË@ áÓ @YK. @ ÕÎ
áË
contributed in the challenge are: QCRI (1200 hours), LIUM (650 hours),
MIT (1200 hours), NDSC (680 hours) and Sevilla University (this study).
The systems description are available at MGB website. Table 5 summarizes
the results and the technologies that have been used in MGB-2 2016.
Table 6 summarizes the results on the development set since the test set
reference transcription is not available for public domain.
The first observation is that the Sevilla CTC lexicon free setup could
lead to WER 12.03% for non-overlapped recordings. The second observa-
tion is that the gap between the overlapped and non-overlapped data set is
around 11% which is quite big difference compared to other systems (3%–
6%). This implies that while the CTC-lexicon free achieves competitive
results for non-overlapped files, it shows poor immunity for cross-talking
speech as well as the noise. Table 7 illustrates a phrase from the test set
and the corresponding transcription output.
6. Conclusion
Table 7. illustrates a phrase from the development set and the corresponding tran-
scription output.
Method Transcription
èYK
Yg. é®Êg
l× AKQK ú¯ Ð@QºË@ AJK YëAÓ
The Truth AJË@ ð XAJ¯B@ . . áÓ
ÕºK. Cë @
¡J
® JËAK. ©J
J.Ë@ úΫ HYj AJ J®Êg
JK
ÐñJ
Ë@ @ YêË ¨ññÓ
Lexicon Free
AJË@ ð XAJ¯B@ l×. AKQK . áÓ
èYK
Yg. é®Êg ú¯ Ð@QºË@ AJK
YëAÓ ÕºK. Cë
@
AJ J®Êg
Q
® JËAK. éJ
J.Ë@ úΫ HYj
JK
ÐñJ
Ë@ @ YêË ¨ññÓ
Acknowledgments
Special thanks for QCRI for providing the Aljazeera corpus and Ziang Xie
from Stanford University.
References
Chapter 12
1. Introduction
There are several systems introduced to deal with this issue for different
languages such as English, Japanese, and Chinese; on the contrary, fewer
achievements were done in the Arabic language. Arabic alphabets are used
by millions in Arabian countries and also with non-Arabic spoken
languages like Persian, Kurdish, Malay and Urdu. The importance of
Arabic language is that it is the Holy Quran language; which is the Holy
Book of Muslims. The Arabic alphabet has 28 basic letters. Sixteen of the
Arabic letters have one, two or three dots. Number and position of these
dots differentiate between similar letters. Each letter has three or four
shapes depending on its position in the word (beginning, middle, end and
isolated), those shapes can be different totally. One of the reasons that
make Arabic letter recognition systems very complicated is that the basic
letters can be expanded to be tribble or have more different letter shapes.
This produces more than 80 different shapes according to the position of
a letter as well as according to the writing style (like Nasekh, Roqa'a, Farisi
or others). Many Arabic letters contain dots. They can be the only feature
that distinguishes between a letter and others.
The secondaries can be written separate or have a dashed line. They
can be also linked to a letter or drawn as one big shape. Several attempts
have been applied to develop Arabic letter recognition systems. In Refs.
1–3 classical letter recognition systems for Arabic letters were developed.
No feature selection algorithms were used; however, some features were
selected manually to reduce the processing time. Artificial neural networks
(NNs) and support vector machines (SVM) achieved the best classification
accuracy among the other classifiers. Whereas, in Refs. 4–6, classical
feature selection algorithms were performed like Genetic Algorithm (GA),
principal component analysis (PCA) and multi-objective GA,
respectively. The feature selection algorithms aimed at selecting the most
important features as well as achieving acceptable recognition rate. Linear
discriminant analysis (LDA) and Support Vector Machines (SVM) were
used to validate the efficiency of the feature selection algorithm. Also in
Ref. 7, Bat algorithm with Random Forests achieved the highest
recognition accuracy among the other classifiers. It also outperformed the
other feature selection algorithm GA. In general, most of the optimization
algorithms achieved a good performance when they were used as feature
Bio-Inspired Optimization Algorithms for Improving ANNs 251
selector; nevertheless, their ability to find global optimum and the time
consuming still need more attempts to improve them.
Some recent works in Arabic handwritten letters were proposed in Ref.
8, Particle Swarm Optimization (PSO) was used as a feature selection
algorithm to select the most significant features. The selected features
were tested by other classifiers, however, Random Forests (RF) achieved
the best accuracy. PSO showed advantages among other swarm
optimization algorithms when compared to other published works. More
details about the principles and applications of swarm in handwritten
Arabic letter recognition can be found in Ref. 9.
Due to the importance of handwritten Arabic letter recognition and the
influence of the classification accuracy of it, various classification
algorithms were applied. One of the most used is NNs; however, it has
some drawbacks in generating its weight and bias. So, there are several
bio-inspired optimization algorithms applied on training NNs, such as
PSO10, GA11, Probability Based Incremental Learning (PBIL)12,
Evolutionary Strategy (ES)13, Differential Evolution (DE)14, and Gray
Wolf Optimization (GWO)15. All these algorithms were adopted mainly
for enhancing the classification efficiency for NNs. That motivated us
to apply a new bio-inspired optimization algorithm; Moth-Flame
Optimization (MFO) on training NNs. MFO has a good ability to avoid
the local optima and has a speed convergence compared to other
optimization techniques.
In this chapter a handwritten letter recognition approach based on
swarms intelligence is proposed. ES, PBIL, PSO and MFO are used for
improving the NNs’ working mechanism by updating its weights and
biases. That has to find the optimum values for the weights and the biases
which accordingly can improve the classification accuracy. This chapter
is organized as follows: Section 2 provides more details about NNs and
the used bio-inspired optimization algorithms. Section 3 explains the
working mechanism of the swarms. Section 4 presents the proposed
approach. Experimental results with discussions are described in Section
5. Finally, conclusions and future work are provided in Section 6.
252 A. A. Ewees and A. T. Sahlol
ES is based on the parents and offsprings iterations and their evolution along
generations. This algorithm proposes usual relations from the real world
between parents and their offsprings. The offsprings inherit the parents’
features. But in some cases, the offsprings are mutated; so, their features are
randomly changed. After each generation, the populations of parents and the
individuals in the offsprings are chosen to be ordered in the parents’
population for the next generation. These sequences are repeated until some
predefined number of fitness functions evaluations13,18.
Bio-Inspired Optimization Algorithms for Improving ANNs 253
The selection of the best neural network parameters (in this chapter:
weight and bias) is generated randomly by the swarms (ES, PBIL, PSO
and MFO). Each starts by generating a random population of candidate
solutions for a given optimization problem. Through iterations, swarm
parameters working on approximating the probable position of their prey.
For each candidate solution, the distance from the prey is updated and
finally, by the satisfaction of an end criterion, the algorithm is terminated.
In order to design the optimization updating strategy, the classification
accuracy of a classifier (NNs in this work) is used to evaluate all possible
solutions (select the best weight and bias value) to achieve fitness function
using the classification accuracy.
256 A. A. Ewees and A. T. Sahlol
After several hundred of trials, the NNs with 5 neurons in the hidden
layers, maximum epochs are 1000, and training function is Scaled
conjugate gradient, as it achieved the best fitness value.
The Mean Square Error (MSE) was selected to validate each swarm
iteration, it can be calculated by the following equation:
𝑀𝑆𝐸 ∑ 𝑦 𝑥 (9)
where yi is the ith predicted values, xi is the actual values, and n is the
number of samples.
The proposed algorithm is trained to find the best weight and bias with
the minimum value of MSE based on the training set. Swarms’ parameters
which used for the optimizer’s searching mechanism can be seen in
Table 1.
Classifiers Parameters
NN Hidden layers: 5
Error performance: MSE
PSO w = 0.3
C1 = 1
C2 = 1
ES 𝜆: 2
σ: 1
MFO t = [-1, 1]
b=1
The results of the classifiers are evaluated using the accuracy and
F-measure as shown in the following equations:
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 (10)
𝑟𝑒𝑐𝑎𝑙𝑙 (13)
chosen. The training and the testing data were chosen randomly; 70% of
the data were assigned to the training task, while the rest (30%) for the
testing. The proposed approach was implemented by MATLAB (2014b).
The results of this chapter are divided into the following subsections.
To validate the performance of the optimized NNs, MSE (in the testing
phase) was used to measure the distance between the predicted and the
output samples. Figure 3 represents the performance of the enhanced NNs
(trained by ES - PSO - PBIL - MFO optimization algorithms) compared
to the classic NNs.
Fig. 3. The performance of NNs trained by ES, PSO, PBIL, MFO and the classic NNs.
The results in Figure 3 proved that the NNs trained by MFO achieved
the lowest error; 0.030 whereas the error of the NNs trained by PSO, PBIL,
and ES were 0.0313, 0.033, and 0.0345, respectively. This shows how
powerful is MFO in optimizing NNs.
Figure 4 illustrates the convergence curves of the fitness values of the
objective function according to the best three runs of each optimizer.
Figure 4 shows that the performance of the optimization algorithms
improved along with iterations. Although, for ES and PSO, it still steady
for many iterations, but for PBIL and MFO, the performance went better
with more iterations.
Bio-Inspired Optimization Algorithms for Improving ANNs 261
Table 3. The accuracy and f-measure results of the proposed optimized NNs
against classic NNs.
Table 4. Comparisons between this work and the performance of previous works on the
same dataset.
(1) Choosing the most efficient optimization algorithms like MFO, ES,
PBIL and PSO which enhance significantly the neural network’s
performance. This leads to better classification results, however, the
enhanced NNs are complex on a time-consuming but this is the cost
of improving the recognition accuracy. A successful optimization
mechanism should find the optimum solution (the best weight and
bias) of a problem which was accomplished in this chapter.
(2) NNs proved to achieve the highest performance among other
classifiers in our previous works Refs. 2 and 27. NNs can be trained
to solve any nonlinear problem by tuning the connections (weights)
values between elements.
References
13. S. Gielen and B. Kappen, Minimizing the system error in feedforward neural networks
with evolution strategy, In ICANN'93, Springer, pp. 490-493 (1993).
14. J. Ilonen, J.-K. Kamarainen and J. Lampinen, Differential evolution training algorithm
for feed-forward neural networks, Neural Processing Letters 17, no. 1, pp. 93-105
(2003).
15. S. Mirjalili, How effective is the Grey Wolf optimizer in training multi-layer
perceptrons, Applied Intelligence 43, no. 1, pp. 150-161 (2015).
16. R. C. Eberhart and J. Kennedy, A new optimizer using particle swarm theory,
Proceedings of the sixth international symposium on micro machine and human
science. Vol. 1 (1995).
17. N. Sultan, S. M. Shamsuddin, and A. Hassanien, Hybrid learning enhancement of
RBF network with particle swarm optimization, Foundations of Computational,
Intelligence Volume 1. Springer, pp. 381-397 (2009).
18. T. P. Runarsson and X. Yao, Search biases in constrained evolutionary optimization,
IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and
Reviews) 35, no. 2, pp. 233-243 (2005).
19. S. Baluja, Population-based incremental learning. a method for integrating genetic
search based function optimization and competitive learning, No. CMU-CS-94-163.
Carnegie-Mellon Univ Pittsburgh Pa Dept of Computer Science (1994).
20. S. Mirjalili, Moth-flame optimization algorithm: A novel nature-inspired heuristic
paradigm, Knowledge-Based Systems 89, pp. 228-249 (2015).
21. H. Alamri, J. Sadri, C.Y. Suen, N. Nobile, A novel comprehensive database for Arabic
off-line handwriting recognition, In Proceedings of 11th International Conference on
Frontiers in Handwriting Recognition, ICFHR, vol. 8, pp. 664-669 (2008).
22. A. Hassanien and E. Emary, Swarm intelligence: principles, advances, and
applications, CRC Press (2016).
23. N. Otsu, A threshold selection method from gray-level histograms, IEEE Trans. Syst.
Man Cybern., vol. 9, pp. 62-66 (Jan, 1979).
24. J. S. Lim, Two-dimensional signal and image processing, Englewood Cliffs, NJ,
Prentice Hall, (1990).
25. A. Rosenfeld and A. C. Kak, Digital picture processing, Academic press (1976).
26. M. Sami, N. EI-Bendary, T-h. Kim, and A. E. Hassanien, Using particle swarm
optimization for image regions annotation, Lecture Notes in Computer Science, vol.
7709, pp. 241-250 (2012).
27. A. T. Sahlol, A. A. Ewees, A. M. Hemdan and A. E. Hassanien, Training feedforward
neural networks using Sine-Cosine algorithm to improve the prediction of liver
enzymes on fish farmed on nano-selenite, In Computer Engineering Conference
(ICENCO), 2016 12th International, pp. 35-40 (2016).
267
Index
feature extraction, 85, 86, 89, 107, natural language technologies, 29,
111–113, 115, 120, 121 34
filtering, 155 neural networks, 249, 250, 252,
font legibility, 204, 205, 227 255, 262, 264
free order language, 128 nominal sentence, 131, 135
nominative case, 131, 138
generative and discriminative
models, 85 orthography, 62, 63, 65, 66
genre selection, 159
Particle Swarm Optimization, 249,
handwriting recognition, 113 251, 252, 264
hidden Markov models, 85, 94, 95 passive voice, 132
hierarchical classification, 187, 191 Piece of Arabic Word, 117
POS tagging, 155
Informal Colloquial Arabic, 157 Probability Based Incremental
intransitive, 132 Learning, 249, 251, 253, 264
intricate, 61, 70, 72 pro-drop property, 128