(Series On Language Processing Pattern Recognition and Intelligent Systems Vol. 4) Neamat El Gayar, Ching Y. Suen - Computational Linguistics, Speech and Image Processing For Arabic Language-World Sci

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 286

10693_9789813229389_TP.

indd 1 20/8/18 5:37 PM


Series on Language Processing, Pattern Recognition, and
Intelligent Systems

Editors
Ching Y. Suen
Concordia University, Canada
parmidir@enes.concordia.ca

Lu Qin
The Hong Kong Polytechnic University, Hong Kong
csluqin@comp.polyu.edu.hk

Published
Vol. 1 Digital Fonts and Reading
edited by Mary C. Dyson and Ching Y. Suen
Vol. 2 Advances in Chinese Document and Text Processing
edited by Cheng-Lin Liu and Yue Lu
Vol. 3 Social Media Content Analysis:
Natural Language Processing and Beyond
edited by Kam-Fai Wong, Wei Gao, Wenjie Li and Ruifeng Xu
Vol. 4 Computational Linguistics, Speech and Image Processing for
Arabic Language
edited by Neamat El Gayar and Ching Y. Suen

Kim - 10693 - Computational Linguistics.indd 1 30-08-18 1:45:26 PM


World Scientific

10693_9789813229389_TP.indd 2 20/8/18 5:37 PM


Published by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data


A catalogue record for this book is available from the British Library.

Series on Language Processing, Pattern Recognition, and Intelligent Systems — Vol. 4


COMPUTATIONAL  LINGUISTICS,  SPEECH  A ND  IMAGE  PROCESSING  FOR
ARABIC  LANGUAGE
Copyright © 2019 by World Scientific Publishing Co. Pte. Ltd.
All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means,
electronic or mechanical, including photocopying, recording or any information storage and retrieval
system now known or to be invented, without written permission from the publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance
Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy
is not required from the publisher.

ISBN 978-981-3229-38-9

Printed in Singapore

Kim - 10693 - Computational Linguistics.indd 2 30-08-18 1:45:26 PM


v

Preface

Arabic is a widely spoken Semitic language and differs considerably from


other languages because of its complex and ambiguous structure.
The chapters of this book outline the challenging aspects of the Arabic
language phonetically, morphologically, syntactically and semantically.
This book presents state-of-the-art reviews and fundamentals in the
areas of Arabic natural language processing, speech and image analysis. It
also highlights and introduces novel applications and advances
encompassing areas of statistical and machine learning models, word-
spotting, handwriting recognition, multi-labelled classification, sentiment
analysis, text annotation, grammatical analysis, speech and font analysis.
We recommend the book to students and researchers who are interested
in developing their fundamentals and skills in the area of Arabic image,
text, document and speech processing. It is also a resource for scientists
who wish to keep track of most recent research directions and interesting
applications in this area.
The book consists of 12 chapters. Chapters 1 to 5 provide an overview
on fields of speech recognition, natural language processing, general
human language technologies, word spotting and statistical classification
specific to the Arabic language. Chapters 6 to 12, however, present
research advances and interesting applications in the field.
Chapter 1 outlines the main building components of an automatic
speech recognition system and reviews the efforts to handle the challenges
of developing such models for the Arabic language. The state-of-the-art
performance for Arabic speech recognition systems is presented and
compared to speech recognition systems developed for the English
language.
Chapter 2 gives a general overview on computational linguistics with
special focus on Arabic human language technologies broadly covering
areas of text, speech and image processing. This chapter also outlines some
important organizations and companies that have contributed to Arabic
technologies.
vi Preface

Chapter 3, on the other hand, focuses on Arabic natural language


processing and outlines complexities and challenges of the Arabic
language giving examples from Modern Standard Arabic (MSA). MSA is
used for TV, newspapers, poetry and in books. It is also a universal
language that is understood by all Arabic speakers.
Chapter 4 summarizes feature extraction techniques for Arabic script
and compares the use of statistical machine learning techniques, mainly
generative and discriminative based models, for Arabic recognition.
Chapter 5 introduces and discusses Arabic word spotting approaches
and challenges. It also summarizes and presents most commonly used
performance measure, databases and features related to Arabic word
spotting.
In chapter 6, a system is implemented to automate the E‘raab process,
which is the process of syntactically analyzing an Arabic sentence. E’raab
is for many the most daunting task when studying Arabic grammar in
school.
Chapter 7 presents an enhanced version of one of the most widespread
electronic Arabic lexical and morphological resources. Chapter 8
discusses the problem of Arabic sentiment analysis and present an
extended Arabic sentiment lexicon containing approximately six thousand
Arabic terms and phrases. The resulting lexicon is available for public use.
Chapter 9, on the other hand, deals with multi-labeled classification in
the domain of legal interpretation in the Islamic religion known as ‘fatwa’.
This application is similar to the issue of legal opinions from courts in
common-law systems. The work here presents a hierarchical classification
system to automatically route incoming fatwa requests to the most relevant
mufti (i.e. Islamic scholar).
Chapter 10 presents a study to identify personality traits of Arabic and
English typefaces (i.e. Legible, Attractive, Comfortable, Artistic, etc.) and
to obtain typeface groups for typeface design analysis.
Chapter 11 presents a novel end-to-end system for Arabic speech-to-
text transcription using the lexicon free Recurrent Neural Networks
(RNNs). And finally, in chapter 12 an Arabic handwritten letter
recognition system based on swarm optimization algorithms with neural
networks (NNs) is presented.
Preface vii

In closing we would like to express our sincere gratitude to all


contributing authors, to Kim Tan and her team at World Scientific
Publishing and to Dr Marleah Bloom from Concordia University.
b2530   International Strategic Relations and China’s National Security: World at the Crossroads

This page intentionally left blank

b2530_FM.indd 6 01-Sep-16 11:03:06 AM


ix

Contents

Preface v

Chapter 1
Arabic Speech Recognition: Challenges and State of the Art 1
Sherif Mahdy Abdou and Abdullah M. Moussa
1. Introduction 2
2. The Automatic Speech Recognition System Components 2
2.1. Pronunciation lexicon 4
2.2. Acoustic model 4
2.3. Language model 8
2.4. Decoding 9
3. Literature Review for Arabic ASR 10
4. Challenges for Arabic ASR Systems 14
4.1. Using non-diacritized Arabic data 15
4.2. Speech recognition for Arabic dialects 16
4.3. Inflection effect and the large vocabulary 19
5. State of the Art Arabic ASR Performance 22
6. Conclusions 24
References 24

Chapter 2
Introduction to Arabic Computational Linguistics 29
Mohsen Rashwan
1. Introduction 29
2. Layers of Linguistic Analysis 30
2.1. Phonological analysis 30
2.2. Morphological analysis 31
2.3. Syntactic analysis 31
2.4. Semantic analysis 31
3. Challenges Facing Human Language Technologies 32
4. Challenges Facing the Arabic Language Processing 32
4.1. Arabic script 33
4.2. Common mistakes 33
4.3. Morphological structure for the Arabic word 34
x Contents

4.4. Syntax of the Arabic sentence 35


5. Defining the Human Languages Technologies 36
5.1. Texts search (search engines) 36
5.2. Machine translation 38
5.3. Question answering 39
5.4. Automated essay scoring 39
5.5. Automatic text summarization 40
5.6. Document classification and clustering 40
5.7. Opinion mining 41
5.8. Computer-aided language learning (CALL) 42
5.9. Stylometry 42
5.10. Automatic speech recognition 43
5.11. Text to speech (TTS) 45
5.12. Audio and video search 46
5.13. Language recognition 46
5.14. Computer-aided pronunciation learning 46
5.15. Typewritten optical character recognition (OCR) 47
5.16. Intelligent character recognition 48
5.17. Book reader 48
5.18. Speech to speech translation 49
5.19. Speech-to-sign-language and sign-language-to-speech 49
5.20. Dialog management systems 50
5.21. Advanced information retrieval systems 51
5.22. Text mining (TM) 52
6. Arabic Computational Linguistics Institutions 52
6.1. Academic institutions 52
6.2. Companies interested in computational linguistics 56
7. Summary and Conclusions 57
References 57

Chapter 3
Challenges in Arabic Natural Language Processing 59
Khaled Shaalan, Sanjeera Siddiqui, Manar Alkhatib and Azza Abdel Monem
1. Introduction 59
2. Challenges 61
2.1. Arabic orthography 62
2.2. Arabic morphology 69
2.3. Syntax is intricate 72
Contents xi

3. Conclusion 78
References 79

Chapter 4
Arabic Recognition Based on Statistical Methods 85
A. Belaïd and A. Kacem Echi
1. Introduction 85
2. A Challenging Morphology 86
3. Features Extraction Techniques 87
4. Machine Learning Techniques 92
5. Markov Models 94
5.1. Case 1: Decomposition of the shape/label 94
5.2. Case 2: Decomposition by association with a model 96
5.3. Extension of HMM to the Plane 98
5.4. Bayesian Networks 99
5.5. Two Dimensional HMM 101
6. Discriminative Models 103
7. Conclusion 107
References 108

Chapter 5
Arabic Word Spotting Approaches and Techniques 111
Muna Khayyat, Louisa Lam and Ching Y. Suen
1. Word Spotting 111
1.1. Definition 112
1.2. Input queries 113
1.3. Performance measures 114
1.4. Word spotting approaches 115
2. Arabic Word Spotting 116
2.1. Characteristics of Arabic handwriting 116
2.2. Arabic word spotting approaches 118
3. Databases 120
4. Extracted Features 121
5. Concluding Remarks 123
References 123
xii Contents

Chapter 6
A‘rib — A Tool to Facilitate School Children’s Ability to Analyze 127
Arabic Sentences Syntactically
Mashael Almedlej and Aqil M Azmi
1. Introduction 127
2. Related Work 130
3. Basic Arabic Sentences Structure 131
4. System Design 132
4.1. Lexical analyzer 134
4.2. Syntactic analyzer 134
4.3. Results builder 138
4.4. Special cases 139
5. Implementation 140
5.1. Lexical analysis 141
5.2. Syntactic analysis 145
5.3. Results builder 151
5.4. Output 152
6. Conclusion and Future Work 152
References 153

Chapter 7
Semi-Automatic Data Annotation, POS Tagging and Mildly Context- 155
Sensitive Disambiguation: The eXtended Revised AraMorph (XRAM)
Giuliano Lancioni, Laura Garofalo, Raoul Villano,
Francesca Romana Romani, Marta Campanelli, Ilaria Cicola,
Ivana Pepe, Valeria Pettinari and Simona Olivieri
1. Introduction 155
2. Description of XRAM 156
2.1. Flag-selectable usage markers 157
2.2. Probabilistic mildly context-sensitive annotation 160
2.3. Lexical and morphological XML tagging of texts 161
2.4. Semi-automatic increment of lexical coverage 163
3. Validation and Research Grounds 165
4. Conclusion 166
References 166
Contents xiii

Chapter 8
WeightedNileULex: A Scored Arabic Sentiment Lexicon for Improved 169
Sentiment Analysis
Samhaa R. El-Beltagy
1. Introduction 169
2. Related Work 170
3. The Base Lexicon 172
4. Assigning Scores to Lexicon Entries 173
4.1. Data collection 173
4.2. Collecting term statistics 174
4.3. Term scoring 174
5. Experiments and Results 178
5.1. The sentiment analysis system 179
5.2. The used datasets 180
5.3. Experimental results 181
6. Conclusion 184
References 184

Chapter 9
Islamic Fatwa Request Routing via Hierarchical Multi-Label Arabic Text 187
Categorization
Reda Zayed, Mohamed Farouk and Hesham Hefny
1. Introduction 187
2. Related Work 190
3. Islamic Fatwa Requests Routing System 191
3.1. Text preprocessing 191
3.2. Feature engineering 193
3.3. The HOMER algorithm 194
4. Performance Evaluation 195
4.1. Data description 195
4.2. Methods 197
4.3. Results and Discussion 197
5. Future Work and Conclusion 199
References 200
xiv Contents

Chapter 10
Arabic and English Typeface Personas 203
Shima Nikfal and Ching Y. Suen
1. Introduction 203
2. Literature Review of Typeface Personality Studies 204
3. Arabic Typeface Personality Traits 207
3.1. Research methodology 207
3.2. Statistical analyses of survey results 212
4. English Typeface Personality Traits 217
4.1. Research methodology 217
4.2. Statistical analyses of survey results 221
5. Summary of English Typefaces 225
6. Summary of Arabic Typefaces 226
7. Comparison of Both Studies 226
8. Conclusions and Future Work 227
References 228

Chapter 11
End-to-End Lexicon Free Arabic Speech Recognition Using Recurrent 231
Neural Networks
Abdelrahman Ahmedy, Yasser Hifny, Khaled Shaalan and Sergio Toral
1. Introduction 231
2. Related Work 232
3. Arabic Speech Recognition System 233
3.1. Acoustic model 234
3.2. Language model 237
3.3. Decoding 237
4. Front-End Preparation 239
4.1. Converting the Arabic text to Latin (transliteration process) 239
4.2. Converting the transcription to alias 240
4.3. Speech features extraction 240
5. Experiments 241
5.1. The 8-hour experiment 241
5.2. The 8-hour results 242
5.3. The 1200-hour experiment 244
5.4. The 1200-hour results 245
6. Conclusion 245
References 246
Contents xv

Chapter 12
Bio-Inspired Optimization Algorithms for Improving Artificial Neural 249
Networks: A Case Study on Handwritten Letter Recognition
Ahmed A. Ewees and Ahmed T. Sahlol
1. Introduction 249
2. Neural Networks and Bio-inspired Optimization Algorithms 252
2.1. Neural Networks (NNs) 252
2.2. Particle Swarm Optimization (PSO) 252
2.3. Evolutionary Strategy (ES) 252
2.4. Probability Based Incremental Learning (PBIL) 253
2.5. Moth-Flame Optimization (MFO) 253
3. Swarms Working Mechanism 255
4. The Proposed Approach 257
5. Experiments and Results 258
5.1. Dataset description 258
5.2. Evaluation criteria 259
5.3. Results and discussions 259
6. Conclusion and Future Work 264
References 265

Index 267
b2530   International Strategic Relations and China’s National Security: World at the Crossroads

This page intentionally left blank

b2530_FM.indd 6 01-Sep-16 11:03:06 AM


1

Chapter 1

Arabic Speech Recognition:


Challenges and State of the Art

Sherif Mahdy Abdou1 and Abdullah M. Moussa2


1
Faculty of Computers and Information,
Cairo University, Giza 12613, Egypt
s.abdou@fci-cu.edu.eg
2
Faculty of Engineering,
Cairo University, Giza 12613, Egypt
a.m.moussa@ieee.org

The Arabic language has many features such as the phonology and the
syntax that make it an easy language for developing automatic speech
recognition systems. Many standard techniques for acoustic and
language modeling such as context dependent acoustic models and n-
gram language models can be easily applied to Arabic. Some aspects of
the Arabic language such as the nearly one-to-one letter-to-phone
correspondence make the construction of the pronunciation lexicon even
easier than in other languages. The most difficult challenges in
developing speech recognition systems for Arabic are the dominance of
non-diacritized text material, the several dialects, and the morphological
complexity. In this chapter, we review the efforts that have been done to
handle the challenges of the Arabic language for developing automatic
speech recognition systems. This includes methods for automatic
generation for the diacritics of the Arabic text and word pronunciation
disambiguation. We also review the used approaches for handling the
limited speech and text resources of the different Arabic dialects. Finally,
we review the approaches used to deal with the high degree of affixation,
derivation that contributes to the explosion of different word forms in
Arabic.
2 S. M. Abdou and A. M. Moussa

1. Introduction

Speech recognition is the ability of a machine or program to identify


words and phrases in spoken language and convert them to a machine-
readable format. The last decade has witnessed substantial advances in
speech recognition technology, which when combined with the increase
in computational power and storage capacity, has resulted in a variety of
commercial products already on the market.
Arabic language is the largest still living Semitic language in terms of
the number of speakers. Around 300 million people use Arabic as their
first native language, and it is the fourth most widely used language
based on the number of first language speakers.
Many serious efforts have been done to develop Arabic speech
recognition systems.1,2,3 Many aspects of Arabic, such as the phonology
and the syntax, do not present problems for Automatic Speech
Recognition (ASR). Standard, language-independent techniques for
acoustic and pronunciation modeling, such as context-dependent phones,
can easily be applied to model the acoustic-phonetic properties of
Arabic. Some aspects of recognizer training are even easier than in other
languages, in particular the task of constructing a pronunciation lexicon
since there is a nearly one-to-one letter-to-phone correspondence. The
most difficult problems in developing high-accuracy speech recognition
systems for Arabic are the predominance of non-diacritized text material,
the enormous dialectal variety, and the morphological complexity.
In the following sections of this chapter we start by describing the
main components of ASR systems and major approaches that have been
introduced to develop each of them. Then, we review the previous efforts
for developing Arabic ASR systems. Finally, we discuss the major
challenges of Arabic ASR and the proposed solutions to overcome them
with a summary of state of art systems performance.

2. The Automatic Speech Recognition System Components

The goal of the ASR system is to find the most probable sequence of
words 𝑤 = (𝑤 , 𝑤 ,…) belonging to a fixed vocabulary given some set of
acoustic observations 𝑋 = (𝑥 , 𝑥 , … , 𝑥 ). Following the Bayesian
Arabic Speech Recognition: Challenges and State of the Art 3

approach applied to ASR as shown in Ref. 4, the best estimation for the
word sequence can be given by:
/
𝑤 arg 𝑚𝑎𝑥 𝑃 𝑊/𝑂 arg 𝑚𝑎𝑥 (1)

To generate an output, the speech recognizer has basically to perform


the following operations as shown in Fig. 1:

 Extract acoustic observations (features) out of the spoken utterance.


 Estimate 𝑃 𝑊 — the probability of individual word sequence to
happen, regardless of the acoustic observations. This is named the
language model.
 Estimate 𝑃 𝑋/𝑊 — the likelihood that the particular set of features
originates from a certain sequence of words. This includes both the
acoustic model and the pronunciation lexicon. The latter is perhaps
the only language-dependent component of an ASR system.
 Find word sequence that delivers the maximum of (1). This is referred
to as the search or decoding.

Speech frame

Input Speech

Front-End
Feature Extraction

Features Vector: 𝑋
Acoustic M odel
P(X/W )

Language M odel
Search
P(W )

Recognized Text

Fig. 1. The ASR system main architecture.


4 S. M. Abdou and A. M. Moussa

The two terms 𝑃 𝑊 and 𝑃 𝑋/𝑊 and the maximization operation


constitute the basic ingredients of a speech recognition system. The goal
is to determine the best word sequence given a speech input 𝑋. Actually,
𝑋 is not the speech input but a set of features derived from the speech.
The Mel Frequency Cepstrum Coefficients (MFCC) and Perceptual
Linear Prediction (PLP) are the most widely used. The acoustic and
language models and the search operation are discussed below.

2.1. Pronunciation lexicon

The pronunciation lexicon is basically a list where each word in the


vocabulary is mapped into a sequence (or multiple sequences) of
phonemes. This allows modeling a large number of words using a fixed
number of phonemes. Sometimes whole word models are used. In this
case the pronunciation lexicon will be a trivial one. The pronunciation
lexicon is language-dependent and for a large vocabulary (several
thousand words) might require a large effort. We will discuss this for
Arabic in the next sections.

2.2. Acoustic model

The most popular acoustic models are the so called Hidden Markov
Models (HMM). Each phoneme (unit in general) is modeled using an
HMM. An HMM4 consists of a set of states, transitions, and output
distributions as shown in Fig. 2.

0.5 0.7 0.9


0.7 0.3 0.2 0.1
S0 S1 S2 S3 S4

0.3 0.2 0.1

Fig. 2. HMM Phone Model.


Arabic Speech Recognition: Challenges and State of the Art 5

The HMM states are associated with emission probability density


functions. These densities are usually given by a mixture of diagonal
covariance Gaussians as expressed in equation (2):

𝑏 𝑥 ∑ 𝑤 𝑁 𝑥, 𝜇 , 𝜎 (2)

where 𝑗 ranges over the number of Gaussian densities in the mixture of


state 𝑆 . The expression 𝑁 : is the value of the chosen component
Gaussian density function for feature vector x. The parameters of the
model (state transition probabilities and output distribution parameters
e.g. means and variances of a Gaussian) are automatically estimated from
training data. Usually, using only one model per phone is not accurate
enough and usually several models are trained for each phone depending
on its context. For example, tri-phone uses a separate model depending
on the immediate left and right contexts of a phone. For example, tri-
phone A with left context b and right context n (referred to as /b-A-n/)
has a different model than tri-phone A with left context t and right
context m (referred to as /t-A-m/). For a total number of phones P, there
will be P3 tri-phones, and for N states/model, there will be N P3 states in
total. The idea can be generalized to larger context e.g. quinphones. This
typically leads to a large number of parameters. In practice, context-
dependent phones are clustered to reduce the number of parameters.
Perhaps the most important aspect in designing a speech recognition
system is finding the right number of states for the given amount of
training data. Extensive research has been done to address this point.
Methods vary from very simple phonetic rules to data driven clustering.
Perhaps the most popular technique used is the decision tree clustering.5
In this method, both context questions and a likelihood metric are used to
cluster the data for each phonetic state as shown in Fig. 3. The depth of
the tree can be used to tradeoff accuracy versus robustness.
Once the context-dependent states are clustered, it remains to assign a
probability distribution to each clustered state. Gaussian mixtures are the
most popular choice in modern speech recognition systems. The
parameters of the Gaussians are estimated to maximize the likelihood of
the training data (the so-called maximum likelihood (ML) estimation).
6 S. M. Abdou and A. M. Moussa

Is left phone a sonorant or nasal ?

yes
Is right phone a back-R ? Is left phone /s, z, sh, zh/ ?
no

Is right phone voiced ?


senone 1 yes
senone 5 senone 6
Is left phone a back-L or is left
phone neither a nasal nor a Y-glide
and right phone a LAX-vowel ? senone 4
yes

senone 2 senone 3

Fig. 3. Decision tree for classifying the second state of K-triphone HMM.

For HMMs ML, estimation is achieved by the so-called forward


backward or Baum-Welch algorithm.
Although ML remained as the preferred training method for a long
time. Recently, discriminative training techniques took over. It was
demonstrated that they can lead to superior performance. However, this
comes at the expense of a more complex training procedure.6 There are
several discriminative training criteria such as Maximum Mutual
Information (MMI), Minimum Classification Error (MCE), Minimum -
Phone Error (MPE) and most recently Maximum Margin methods. All
these different techniques share the idea of using the correct transcription
and a set of competing hypotheses. They estimate the model parameters
to “discriminate” the correct versus competing hypotheses. The
competing hypotheses are usually obtained from a lattice which in turn
requires the decoding of the training data. Model estimation is most
widely done using the so-called extended Baum-Welch estimation
(EBW).7
Arabic Speech Recognition: Challenges and State of the Art 7

Recently, a better acoustic model was introduced that is a hybrid of


HMM and Deep Neural Networks (DNN). The Gaussian Mixtures
Models (GMM) are replaced with neural networks with deep number of
hidden layers as shown in Fig. 4.

Fig. 4. HMM-DNN Model.

The DNNs have a higher modeling capacity per parameter than


GMMs and they also have a fairly efficient training procedure that
combines unsupervised generative learning for feature discovery with a
subsequent stage of supervised learning that fine tunes the features to
optimize discrimination. The Context-Dependent (CD)-DNN-HMM
hybrid model as shown in Ref. 8 has been successfully applied to large
vocabulary speech recognition tasks and can cut word error rate by up to
one third on the challenging conversational speech transcription tasks
compared to the discriminatively trained conventional CD-GMM-HMM
systems.
While the above summarizes how to train models, it remains to
discuss the training data. Of course, using more data allows using larger
and hence more accurate models leading to better performance.
However, data collection and transcription is a tedious and costly
process. For this reason, a technique called unsupervised or better lightly
supervised training is becoming very popular. First, several hundred
8 S. M. Abdou and A. M. Moussa

hours of speech are used to train a model. The model together with an
appropriate confidence measure can then be used to automatically
transcribe thousands of hours of data. The new data can then be used to
train a larger model. All the above techniques (and more) are
implemented in the so-called Hidden Markov Model Toolkit (HTK)
developed at Cambridge University.9

2.3. Language model

A language model (LM) is required in large vocabulary speech


recognition for disambiguating between the large set of alternative and
confusable words that might be hypothesized during the search. The LM
defines the priori probability of a sequence of words. When language
restrictions are well known and all the possible combinations between
words can be defined, probabilities can be precisely calculated and
included in finite state automata that rule the combination of words in a
sentence. Unfortunately, this scheme only applies to restricted
application domains with small vocabularies. For large vocabularies and
more complex configurations of sentences, a simple, but effective, way
to represent a sequence of n words is to consider it as an n-th order
Markov chain. The LM probability of a sentence (i.e., a sequence of
words 𝑤 , 𝑤 , … , 𝑤 ) is given by:

𝑃 𝑤 𝑃 𝑤 /𝑤 𝑃 𝑤 /𝑤 , 𝑤 𝑃 𝑤 /𝑤 , 𝑤 , 𝑤 . . . 𝑃 𝑤 /
𝑤 ,…,𝑤 ∏ 𝑃 𝑤 /𝑤 , … , 𝑤 (3)

where 𝑤 , … , 𝑤 in expressions such as 𝑃 𝑤 /𝑤 , … , 𝑤 , is the


word history for word 𝑤 . In practice, one cannot obtain reliable
probability estimates given arbitrarily long histories since that would
require enormous amounts of training data. Instead, one usually
approximates them in the following way:

𝑃 𝑤 |𝑤 , 𝑤 , … , 𝑤 𝑃 𝑤 |𝑤 ,…,𝑤 (4)

This is the definition of “N-grams”. On several recognition


approaches, the number of predecessors considered tend to be reduced
resulting in “bigrams” (for N = 2) and “trigrams” (for N = 3). An
Arabic Speech Recognition: Challenges and State of the Art 9

important feature of N-grams is that their probabilities can be directly


estimated from text examples and, therefore do not need explicit
linguistic rules like grammar inference systems do. Estimation of N-
grams has to be carefully treated as for a vocabulary of size V there is as
many as (V)N probabilities to be estimated in the N-gram model. Usually
many word histories don’t occur with enough counts to have reliable
estimate for their probabilities. Many approximation techniques were
proposed to approximate these probabilities.10 For example, in the case
of bigram grammar, it typically lists only the most frequently occurring
bigrams, and uses a backoff mechanism to fall back on unigram
probability when the desired bigram is not found. In other words, if
P(wj|wi) is sought and is not found, one falls back on P(wj). But a backoff
weight is applied to account for the fact that wj is known not to be one of
the bigram successors of wi. Other higher-order backoff N-gram
grammars can be defined similarly. Ideally, a good LM would ease the
retrieval of the word sequence present in the speech signal by better
focusing the decoding procedure, which represents another relevant step
of the search procedure. One of effective tools for training language
models is the SRILM toolkit that includes most of state of art
alternatives.11

2.4. Decoding

Finding the best word (or generally unit) sequence given the speech input
is referred to as the decoding or search problem. Formally, the problem is
reduced to finding the best state sequence in a large state space that
consists of composing the pronunciation lexicon, the acoustic model and
the language model. The solution can be found using the well-known
Viterbi algorithm. Viterbi search is essentially a dynamic programming
algorithm, consisting of traversing a network of HMM states and
maintaining the best possible path score at each state in each frame. It is
a time synchronous search algorithm in that it processes all states
completely at time t before moving on to time t + 1. The abstract
algorithm can be understood with the help of Fig. 5. One dimension
represents the states in the network, and the other dimension represents
the time axis.
10 S. M. Abdou and A. M. Moussa

States
Final state

Start state

Time

Fig. 5. Viterbi search as dynamic programming.

Even for a moderate vocabulary, full search is prohibitive. The


Viterbi beam search is a very popular and simple way to speed-up the
search.12 Using a beam is not always sufficient and there are two very
popular approaches to the search problem:

 Use relatively simple acoustic and language models to generate an N-


best list or a lattice. Use more detailed acoustic and/or language
models to rescore the reduced search space to find the best word
sequence. This is called the multi-pass approach.
 Compose the full search space and use determinization and
minimization algorithms to optimize the search space. Use a Viterbi
beam search on the optimized search space to find the best word
sequence. We refer to this as the single pass approach.

A less popular approach is referred to as stack decoding that avoids


visiting the whole search space.13 In addition to optimizing the search
space, calculating the Gaussian probabilities is usually time consuming
especially for large vocabulary speech recognition. Techniques to
accelerate the Gaussian computations are also widely used. These
techniques mainly rely on using Gaussian clustering, quantization and
caching.14

3. Literature Review for Arabic ASR

The early efforts to develop Arabic ASR systems started with simple
tasks such as digits recognition and small vocabulary of isolated words.
Arabic Speech Recognition: Challenges and State of the Art 11

Imai et al. in Ref. 15 presented a rule-based speaker-dependent that use


speaker-dependent phonological rules to model pronunciation variability
in the speakers with objective to decrease their recognition errors. Bahi
and Sellami in Ref. 16 presented a system that combines the vector
quantization technique and HMMs to recognize isolated Arabic words.
Nofal et al. in Ref. 17 demonstrated an Arabic command and control
speech recognition system. Elmisery et al. in Ref. 18 implemented a
pattern matching algorithm based on HMM using Field Programmable
Gate Array (FPGA) to recognize isolated Arabic words. Khasawneh
et al. in Ref. 19 applied polynomial classifier to isolated-word speaker-
independent Arabic speech and showed that it provides better recognition
performance and much faster response when compared with Dynamic
Time Warping (DTW) recognizer. Bourouba et al. in Ref. 20 presented
a hybrid approach of HMM/Support Vector Machine (SVM) for
recognition of isolated spoken Arabic words.
The beginning of 2000’s witnessed major advancement in the state of
art of Arabic ASR systems. This was mainly due to the availability of
larger ASR resources with the support provided from some DARPA
projects such as the EARS and its successor the Gales projects that
targeted the development of effective, affordable and reusable Arabic
speech recognition. One of the earliest efforts to develop large
vocabulary Arabic ASR was by BBN team for the application of
information extraction from broadcast news (the BBN Tides-OnTap
system in Ref. 21). The BBN system for Arabic ASR that was submitted
for the Ears and Gale projects evaluations included two stages. The first
stage is Speaker Independent (SI) and the second stage is Speaker
Adapted (SA) using a reference from the first stage. Each one of these
stages included three decoding passes. The first decoding pass is a
forward pass that uses simple acoustic models, Phonetically Tied
Mixture (PTM) models, and a bigram language model. The second pass
is a backward pass that uses the output of the forward pass to guide a
Viterbi beam search with more complex acoustic and language models.
A state clustered (using decision trees) within-word quinphone acoustic
model (SCTM-NX), and an approximate trigram language model are
used in this step. During the backward pass an N-best list is generated.
This list is rescored using the SCTMNX model1, and a 3-gram language
12 S. M. Abdou and A. M. Moussa

model. This system was trained using 100hrs of broadcast news


recordings and 300 million words from newspapers and web sites. The
system vocabulary was around 60k. Two types of acoustic models,
grapheme based and phoneme based were developed and their evaluation
results show that the phonetic system gives about 13% reduction in WER
compared to the grapheme system.22
One of research groups that have contributed to the advances in
Arabic ASR systems is the Spoken Language Processing Group (TLP) at
LIMSI CNRS. Their recognizer makes use of continuous density tied-
state left-to-right CD-HMM with Gaussian mixture observation densities.
Word recognition is performed in multiple passes. The first pass (less
than 1xRT) is cross-word trigram decoding with gender-specific sets of
position-dependent triphones (around 5k tied states) and a trigram
language model. The trigram lattices are expanded with a 4-gram
language model. Then the posterior probabilities of the lattice edges are
estimated using the forward-backward algorithm and the 4-gram lattice is
converted to a confusion network with posterior probabilities by
iteratively merging lattice vertices and splitting lattice edges until a linear
graph is obtained. These hypotheses are used to carry out unsupervised
acoustic model adaptation for each segment cluster using the MLLR
technique with one regression class. Then a second lattice is generated
for each segment using a bigram LM and position dependent tri-phones
with 11500 tied states (32 Gaussians per state). The word graph
generated in this second decoding pass is rescored after carrying out
unsupervised MLLR acoustic model adaptation using a variable number
of regression classes. This system was trained using 1200 hours of
Arabic broadcast data and 1.1 billion text words, distributed by LDC for
the Gale project23 and used a vocabulary with size of 200k words
with average 8.6 different pronunciations for each word. The key
contributions of that system is the automatic building of a very large
vocalized vocabulary, using a language model that includes vocalized
components and using morphological decomposition to address the
challenges of dealing with the huge lexical variety.
Another prominent large vocabulary Arabic ASR system is the one
developed by the Speech Vision and Robotics Group at Cambridge
University. This system included vocabulary with size up to 350k words
Arabic Speech Recognition: Challenges and State of the Art 13

and was trained using 1024 hrs of speech data that consisted of 764 hrs
of supervised data and 260 hrs of lightly unsupervised. The gain from the
unsupervised data part has shown to be marginal and may even result in
performance degradation. That system used a state-clustered triphone
models with approximately 7k distinct states and an average of 36
Gaussian components per state and used n-gram language model trained
from 1 billion words of text data. It used three decoding stages. The first
stage is a fast decoding run with Gender Independent (GI) models. The
second stage uses Gender Dependent (GD) models adapted using LSLR.
It also uses variance scaling using the first stage supervision. The second
stage generates trigram lattices which are expanded using a 4-gram
language model and then rescored in the third stage using GD models
adapted using lattice-MLLR as discussed in Ref. 21. In that system they
have shown that graphemic models perform at least as well as phonetic
models for conversational data and have very minor degradation for the
news data.
The IBM ViaVoice was one of the first commercial Arabic large
vocabulary systems that was developed for dictation applications.24 A
more advanced system was developed by the speech recognition research
group at IBM for Arabic broadcast transcription system fielded for the
GALE project. Key advances include improved discriminative training,
the use of subspace Gaussian mixture models (SGMM) as shown in Ref.
25, neural network acoustic features as shown in Ref. 26, variable frame
rate decoding as shown in Ref. 27, training data partitioning experiments,
class-based exponential LM model and NNLMs with Syntactic
features.28 This system was trained on 1800hrs of transcribed Arabic
broadcasts and text data of size 1.6 billion words provided by the
Linguistic Data Consortium (LDC).29 A pruned language model of size 7
million n-grams using Entropy pruning as shown in Ref. 30 is used for
the construction of static, finite-state decoding graphs. Another unpruned
version of the LM that contains 883 million n-grams, is used for lattice
rescoring. This system used a vocabulary of 795K words with more than
2 million pronunciations. This system used 6 decoding passes. The first
pass used a speaker independent grapheme based acoustic model. The
following 5 passes used speaker adapted phoneme based models. All
models have penta-phone cross-word acoustic context. Another 3
14 S. M. Abdou and A. M. Moussa

rescoring passes using the different LMs produced different decoding


hypotheses that were optimized in a combination pass.
Recently the Multi Genre Broadcast (MGB) competition as shown in
Ref. 31 has activated the research and development of Arabic speech
recognition for the domain of broadcast programs recognition. MGB is a
controlled evaluation using 1,200 hours audio with lightly supervised
transcription. The system of the Qatar Computing Research Institute
(QCRI) speech transcription system for the 2016 dialectal Arabic Multi-
Genre Broadcast (MGB-2) challenge which was a combination of three
purely sequence trained recognition systems, achieved the lowest WER
of 14.2% among the nine participating teams.2 Key features of this
system are: purely sequence trained acoustic models using the recently
introduced Lattice free Maximum Mutual Information (L-MMI)
modeling framework as shown in Ref. 31; Language model rescoring
using a four-gram and Recurrent Neural Network with Max-Entropy
connections (RNNME) language models as shown in Ref. 32; and
system combination using Minimum Bayes Risk (MBR) decoding
criterion for three acoustic models trained using Time Delay Neural
Network (TDNN) as shown in Ref. 31, Long-Short Term Memory
(LSTM) Recurrent Neural Network (RNN) as shown in Ref. 33 and Bi-
directional LSTM. These results match the state of art performance for
the English ASR system for similar domain data which puts the Arabic
language in same stage as the tier one languages.

4. Challenges for Arabic ASR Systems

The Arabic language has three major challenges for developing ASR
systems. The first one is the constraint of having to use mostly non-
diacritized texts as recognizer training material which causes problems
for both acoustic and language modeling. Training accurate acoustic
models for the Arabic vowels without knowing their location in the
signal is difficult. Also, a non-diacritized Arabic word can have several
senses with the intended word sense to be derived from the word context.
Language models trained on this non-diacritized material may therefore
be less predictive than those trained on diacritized texts.
Arabic Speech Recognition: Challenges and State of the Art 15

The second challenge for Arabic is the existence of many different


Arabic dialects (Egyptian, Levantine, Iraqi, Gulf, etc.) that are only
spoken and not formally written. Dialectal variety is a problem primarily
because of the current lack of training data for conversational Arabic.
Whereas Modern Standard Arabic (MSA) data can readily be acquired
from various media sources, there are only very few speech corpora of
dialectal Arabic available.
The third challenge of Arabic is its morphological complexity which
is known to present serious problems for speech recognition, in particular
for language modeling. A high degree of affixation, derivation, etc.,
contributes to the explosion of different word forms, making it difficult if
not impossible to robustly estimate language model probabilities. Rich
morphology also leads to high out-of-vocabulary rates and larger search
spaces during decoding, thus slowing down the recognition process. In
the following sections, we review most of the proposed approaches to
overcome these challenges.

4.1. Using non-diacritized Arabic data

Several approaches were proposed to overcome the lack of diacritized


text. One of the simple approaches to deal with this challenge is to build
the acoustic models based on grapheme units instead of the phoneme
which are the natural units of speech. The term grapheme refers to the
smallest meaningful contrastive unit in a writing system. In the grapheme
acoustic model, each non-diacritized grapheme is considered an acoustic
unit which is equivalent to a compound consonant-vowel phonemes
pair.34 To compensate for the wide variance of these compound units in
the acoustic space, a larger number of mixtures are used. Although this
type of model eliminates the requirement for restoring the Arabic text
diacritics, the use of compound acoustic units can result in reducing the
accuracy of Arabic ASR systems compared with using the phoneme
based models.
The other alternative approach for dealing with non-diacritized text is
the restoration of the missing diacritics. For this task, an automatic
Arabic text diacritizer can be used.35 The state of art performance for
such tools is 4% word error rate for the word internal diacritization
16 S. M. Abdou and A. M. Moussa

marks and 10% word error rate for case ending marks. This means more
than 10% of the data will be restored with wrong diacritics which would
reduce the efficiency of the trained acoustic models. To reduce the
number of errors for restored diacritics, it was proposed to use the audio
recordings of the text data to help in selecting the correct words diacritics
besides the linguistic information.11 In that approach, a forced alignment
is performed between the audio signal and the reference text using a
pronunciation dictionary that includes all the possible diacritization
forms for each word. A morphology analyzer is used to generate these
diacritization forms.36 For the words that the analyzer fails to find a
possible diacritization form, which usually happens for name entities, a
series of expert rules are used to derive their pronunciations.12 Finally,
for the remaining words, that all the approaches fail to derive any
diacritization forms for them, it is possible to backoff to the graphemic
pronunciation for them and builds a combined system.14
Although the vowelized based acoustic models provide better
accuracy, in some cases such as dialectal Arabic ASR, the grapheme
based models would be a more effective approach since the restoration of
diacritics for this type of data would require some non-existing resources
such as morphological analyzer or expert diacritization rules. Also, with
large amount of training data, the performance of grapheme based and
phoneme based systems becomes very close.37

4.2. Speech recognition for Arabic dialects

Whereas MSA data can readily be acquired from various media sources,
there is only very limited speech corpus of dialectal Arabic available.
The construction of such type of corpus is even more challenging than
the MSA one. Initially the manual annotation has no standard reference,
the same word can be transcribed with several ways such as
“‫ ﺑﺸﻜﺮﻙ‬،‫ﺑﺎﺷﻜﺮﻙ‬،‫”ﺑﺄﺷﻜﺮﻙ‬. Some transcription guidelines for Egyptian and
Levantine Dialectal Arabic were proposed to reduce such differences.38
The diacritization for dialectal Arabic is more challenging than MSA
since it would require a dialectal Arabic morphological analyzer to
generate the different diacritization forms. Using context based
diacritization would also require a robust language model for dialectal
Arabic Speech Recognition: Challenges and State of the Art 17

Arabic which is not currently available. Also, the dialectal Arabic


diacritization using automatic alignment against the audio signal is also
harder due to the larger set of vowels.
To tackle the problem of data sparsity, a cross-lingual approach was
proposed to pool MSA and dialectal speech data to jointly train the
acoustic model.29 Acoustic differences between MSA and Arabic dialects
are smaller than the differences at the language level, and since only a
small amount of acoustic data is currently available for Arabic dialects,
acoustic models might benefit from a larger amount of similar data
that provides more training instances of context-dependent phones.
Moreover, the difference between dialectal and MSA speech is not
necessarily clear-cut; it is a continuum, with speakers varying between
the two ends of the continuum depending on the situational context.
Cross-dialectal data sharing may be helpful in modeling this type of
mixed speech. This approach is similar to sharing acoustic training data
across different languages to build a speech recognition system for a
target under-resourced language using several source languages with
sufficient acoustic data.45 This approach resulted in around 3% relative
reduction in WER for training Egyptian dialectal models by adding MSA
data.39
In another approach for cross-lingual training, it was proposed the
modification of the optimality criterion for training Gaussian Mixture
Model (GMM) to benefit from the similarity between phonemes in MSA
and dialectal speech which showed improvements in phone classification
tasks.24 Also, model adaptation techniques like MLLR and Maximum A-
Posteriori (MAP) were proposed to adapt existing phonemic MSA
acoustic models with a small amount of dialectal ECA speech data which
resulted in about 12% relative reduction in WER.42 Acoustic model
adaptation can perform better than data pooling when dialectal speech
data are very limited compared to existing MSA data, and adaptation
may avoid dialectal acoustic features masking by large MSA data as in
the data pooling approach.
The large overlap between the phonetic units of most of the Arabic
dialectal and MSA allowed the benefit of the large resources of MSA to
help in training the acoustic models. The challenge is harder for language
modeling. The large differences between local Arabic dialects and MSA
18 S. M. Abdou and A. M. Moussa

on the morphological, syntactic, and lexical levels make them behave


like different languages. However due to the scarcity of dialect-specific
linguistic resources, some techniques were proposed to make use of
MSA data to improve language modeling of dialectal Arabic. An
approach explored mixing Egyptian language model with MSA model.43
Although the combined model provided slight reduction in the perplexity
of some hold-out data, there was no visible effect on word error rate.
In another technique, it was proposed to combine models using
constrained interpolation, whose purpose is to limit the degree by which
MSA model can affect the parameters of the Egyptian model, but did not
also yield any improvement. To overcome the genre difference of the
colloquial Arabic corpus and the MSA corpus, which is mainly newswire
data, it was proposed to select for model training those sentences in the
MSA corpus that are closer in style to conversational speech. This
approach did not provide positive effect. An analysis for the results of
these experiments showed that by adding 300 million words of MSA
data to the Egyptian Call-Home colloquial data increases the percentage
of trigrams in the Egyptian test set that are also found in the language
model from 24.5% to 25%. Performing a similar experiment in English
by adding 227 million words of the of North American Business (NAB)
text to the Call_Home American English data increased the seen trigrams
of the test set in the training data from 34.5% to 72%.44
In other approach rather than simply adding selected text data from
MSA, it was proposed to apply linguistic transformations on the MSA
before using it in training language modeling for dialectal Arabic.
Several data transformations were proposed such as morphological
simplification (stemming), lexical transductions, and syntactic trans-
formations. This technique managed to reduce the test perplexity by
factor up to 82% but still did not manage to outperform the model built
using only the dialectal data for speech recognition results.43
All of these efforts raise the conclusion that using MSA data does not
help improve language modeling for colloquial Arabic and the best
effective approach is to train the colloquial Arabic language model from
colloquial data. Fortunately, the recent surge of social networks have
provided rich sources for collecting such type of Arabic data with large
sizes but this data needs extensive efforts of cleaning and normalization.
Arabic Speech Recognition: Challenges and State of the Art 19

4.3. Inflection effect and the large vocabulary

To deal with the morphological complexity of the Arabic language in


developing Arabic ASR systems, several approaches were proposed. An
effective approach is to build the ASR system using morphologically
decomposed words. The Arabic word can be decomposed to its main
morphological components, the prefix, the suffix and the stem as shown
in Fig. 6. Using this decomposition approach the vocabulary size can be
reduced with great factor. As we see in Fig. 6, for a dataset of size 120k
the number of Arabic full form words is 14k while the number of stem
units is only 6k, which is comparable with the number of stems for
English data of same size.

Fig. 6. Left: An example of Arabic word factorization. Right: Vocabulary growth for the
Arabic language.

The main draw back for that approach is the short durations of the
affixation units, which can be only two phones long, that make them
highly susceptible for insertions errors. To avoid these effects, some
enhancements for the approach was proposed such as keeping the most
frequent words in full form without decomposition and the second
enhancement was not to decompose the prefix “Al” for words starting
with a solar consonant since due to assimilation with the following
consonant, deletion of the prefix was one of the most frequent errors.
This enhanced morphologically based LM provided some reduction in
WER compared with word based LM.45 Rather than using linguistic
knowledge to derive the morphological decomposition, an unsupervised
technique based on the Minimum Description Length principle (MDL)
was also proposed to provide better coverage for the Out-Of-Vocabulary
(OOV) words.46
20 S. M. Abdou and A. M. Moussa

Another type of models is the Factored Language Models (FLM) in


which words are viewed as vectors of K factors, so that 𝜔 𝑓 :
Factors represent morphological, syntactic, or semantic word information
and can be e.g. stems, POS tags, etc. in addition to the words themselves.
Probabilistic LMs are then constructed over (sub) sets of factors. Using a
trigram approximation, this can be expressed as:
: : : ∏ : : :
𝑝 𝑓 ,𝑓 ,…,𝑓 𝑝 𝑓 |𝑓 ,𝑓 (5)
Each word is dependent not only on a single stream of temporally
preceding word variables, but also on additional parallel streams of
features. Such a representation can be used to back off to factors when
the word n-gram has not been observed in the training data, thus
improving probability estimates. For instance, a word trigram may not
have any counts in the training set, but its corresponding factor
combinations (e.g. stems and other morphological tags) may have been
observed since they also occur in other words. This is achieved via a new
generalized parallel backoff technique. During standard backoff, the
most distant conditioning variable (in this case wt-2) is dropped first,
followed by the second most distant variable etc., until the unigram is
reached. This can be visualized as a backoff path in Fig. 7(a). If
additional conditioning variables are used which do not form a temporal
sequence, it is not immediately obvious in which order they should be
dropped. In this case, several backoff paths are possible, which can be
summarized in a backoff graph in Fig. 7(b). Paths in this graph can be
chosen in advance based on linguistic knowledge, or at run-time based
on statistical criteria such as counts in the training set.

(a) (b)
Fig. 7. Standard backoff path for a 4-gram language model over words (left) and backoff
graph for 4-gram over factors (right).
Arabic Speech Recognition: Challenges and State of the Art 21

FLMs have been implemented as an add-on to the widely-used


SRILM toolkit. Further details can be found in Ref. 43. One difficulty in
training FLMs is the choice of the best combination of design choices, in
particular the conditioning factors, backoff path(s) and smoothing
options. Since the space of different combinations is too large to be
searched exhaustively, some search algorithms such Genetic Algorithms
(GAs) were proposed to optimize the choice of conditioning factors.47
Another effective approach to deal with the large vocabulary of the
Arabic language is the compilation of the whole search space in a finite
state network that is optimized to the most compact size. The huge size
of the search networks for Large Vocabulary Automatic Speech
Recognition (LVASR) systems make it impractical or even impossible to
expand the whole search network prior to decoding due to memory
limitations. The other alternative approach was to expand the search
network on the fly during the decoding process. But with the increase of
the vocabulary size in conjunction with the usage of complex Knowledge
Sources (KS) such as context dependent tri-phone models and cross word
models the dynamic expansion of the search network becomes very slow
and turns to be an impractical approach. With the efforts of a research
team at AT&T as shown in Refs. 48, 49, they managed to compile the
search network of LVASR systems in a compact size that can fit with
memory limitations and also provide a fast decoding approach. That
approach relied on eliminating the redundancy in the search network that
results from the approximations used in the integrated networks such as
the state tying of the acoustic model units and the back-off techniques in
the used language model. Let’s consider a practical example of a 64k
word trigram, among the 4 billion of possible word bigrams, only 5 to 15
million will be included in the model and, for each of these “seen” word-
pair histories, the average number of trigrams will be comprised between
2 and 5. Such a LM would have about 5 to 15 million of states and 15 to
90 million of arcs, requiring between 100 and 600 MB of storage. This
means a reduction by seven orders of magnitude with respect to a plain
64k trigram. Concerning cross word tri-phones, the number of distinct
generalized models is typically one order of magnitude smaller than the
full inventory of position-dependent contexts. Using finite state based
models, some Arabic ASR systems managed to use a vocabulary size
22 S. M. Abdou and A. M. Moussa

larger than one million words with processing time close to real time
performance as shown in Refs. 28, 50 but these was with the price of
large model sizes of several Giga bytes.

5. State of the Art Arabic ASR Performance

What is the current state of the art in speech recognition? This is a


complex question, because a system’s accuracy depends on the
conditions under which it is evaluated. Under sufficiently narrow
conditions almost any system can attain human-like accuracy, but it’s
much harder to achieve good accuracy under general conditions. The
conditions of evaluation, and hence the accuracy of any system, can vary
along the following dimensions:

 Vocabulary size and confusability: As a general rule, it is easy to


discriminate among a small set of words, but error rates naturally
increase as the vocabulary size grows.
 Speaker dependence vs. independence: By definition, a speaker
dependent system is intended for use by a single speaker, but a
speaker independent system is intended for use by any speaker.
Speaker independence is difficult to achieve because a system's
parameters become tuned to the speaker(s) that it was trained on, and
these parameters tend to be highly speaker-specific.
 Task and language constraints: Even with a fixed vocabulary,
performance will vary with the nature of constraints on the word
sequences that are allowed during recognition. Constraints are often
represented by a grammar, which ideally filters out unreasonable
sentences so that the speech recognizer evaluates only plausible
sentences.
 Read vs. spontaneous speech: Systems can be evaluated on speech
that is either read from prepared scripts, or speech that is uttered
spontaneously. Spontaneous speech is vastly more difficult, because it
tends to be peppered with disfluencies like “uh” and “um”, false
starts, incomplete sentences, stuttering, coughing, and laughter; and
moreover, the vocabulary is essentially unlimited, so the system must
be able to deal intelligently with unknown words.
Arabic Speech Recognition: Challenges and State of the Art 23

 Adverse conditions: A system’s performance can also be degraded by


a range of adverse conditions.

In order to evaluate and compare different systems under well-defined


conditions, a number of standardized databases have been created with
particular characteristics. Such evaluations were mostly based on the
measurement of word (and sentence) error rate as the performance figure
of merit of the recognition system. Furthermore, these evaluations
weights were conducted systematically over carefully designed tasks
with progressive degrees of difficulty, ranging from the recognition of
continuous speech spoken with stylized grammatical structure to
transcriptions of live (off-the-air) news broadcast and conversational
speech.
There were several attempts to perform dialect speech recognition for
Egyptian, Levantine and Iraqi but the error rate is relatively high. On the
other hand MSA has sufficient resources and accordingly reasonable
performance. The table below shows the performance of different
systems for broadcast news transcription in the Gale project and some
dialectal tasks.

Table 1. State of the art performance for Arabic ASR systems.

Vocabulary Size Acoustic


Genre Models LM size WER
size training
135hrs, 1000hrs
MSA unvowelized 589K 56M 4-gram 17.0%
(unsupervised)
135hrs, 1000hrs
MSA vowelized 589K 56M 4-gram 16.9%
(unsupervised)
Vowelized +
135hr, 1000hr
MSA pronunciation 589K 56M 4-gram 14.0%
(unsupervised)
probabilities
MSA+
unvowelized 900K 1200 hrs NN 4-gram 14.2%
Dailectal
Iraqi unvowelized 90K 200 hrs 2M 3-gram 36.0%
Levantine unvowelized 64k 100 hrs 15M 3-gram 39%
Egyptian vowelized 50k 20 hrs 150k bigram 56.1%
24 S. M. Abdou and A. M. Moussa

Table 1 shows roughly state of the art performance for different


speech recognition tasks for Arabic. The performance is closely related
to the existing resources. We can see for MSA Arabic that the available
resources of vowelized training hours and Giga words of LM training
text are close to other Latin languages. So, the state of art performance
for MSA which is around 15% WER is very comparable with the 10%
WER achieved for the similar task of Broadcast News ASR for English.
But we should keep in consideration that the complexity of the Arabic
MSA ASR is much higher with vocabulary size of 560k words compared
with the 210k words of the English vocabulary for the Broadcast News
ASR. The performance of dialectal Arabic, as shown in the Iraqi,
Egyptian and Levantine and conversational ASR is comparable with the
equivalent conversational English ASR with average WER in the range
30%–40%. But we should keep in consideration that the dialectal Arabic
is much challenging when compared with conversational English. The
LM training data is very limited, and many required (Natural Language
Processing) NLP tools such as morph analyzer, diacritizer and text
normalizers need to be developed.

6. Conclusions

In this chapter we reviewed the main building components of ASR


systems and how it can be developed for the Arabic Language. Also we
reviewed the major challenges for developing Arabic ASR systems
which are the dominance of non-diacritized text material, the several
dialects, and the morphological complexity. The main efforts and
proposed approaches for handling the challenges of Arabic ASR systems
were introduced. Finally we introduced the state of art performance for
Arabic ASR systems which show competing performance even when
compared with the more advanced English ASR systems.

References
[1] J. Billa, et al. Audio indexing of broadcast news. Proc. of the International
Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. I-3-I-5
(2002).
Arabic Speech Recognition: Challenges and State of the Art 25

[2] S. Khurana and A.Ali, QCRI advanced transcription system (QATS) for the Arabic
multi-dialect broadcast media recognition: MGB-2 challenge, IEEE Spoken
Language Technology Workshop, (SLT), pp. 292–298 (2016).
[3] V. Peddinti, D. Povey, and S. Khudanpur, A time delay neural network architecture
for efficient modeling of long temporal contexts, Proc. of the Interspeech Conf., pp.
3214–3218 (2015).
[4] L. Rabiner, A tutorial on hidden Markov models and selected applications in
speech recognition, Proc. IEEE 77(2), pp. 257–286 (1989).
[5] T. Shinozaki, HMM state clustering based on efficient cross-validation, Proc. Int.
Conf. Acoustics Speech and Signal Processing, ICASSP, pp. 1157–1160 (2006).
[6] P. M. Baggenstoss, A modified Baum-Welch algorithm for hidden Markov models
with multiple observation spaces, IEEE Transactions on Speech and Audio
Processing, 9(4), pp. 411–416 (2001).
[7] M. Afify, Extended baum-welch reestimation of Gaussian mixture models based on
reverse Jensen inequality, Proc. of the 9th European Conference on Speech
Communication and Technology, Interspeech, pp. 1113–1116 (2005).
[8] Y. A. Alotaibi, M. Alghamdi, F. Alotaiby, Speech Recognition System of Arabic
Digits based on A Telephony Arabic Corpus, Proc. of the International Conference
on Image and Signal Processing, ICISP, pp 245–248 (2010).
[9] M. Alghamdi, Y. O. El Hadj and M. Alkanhal, A Manual System to Segment and
Transcribe Arabic Speech, Proc. of the International Conference on Signal
Processing and Communications, ICSPC, pp. 233–236 (2007).
[10] Y. A. Alotaibi, Comparative Study of ANN and HMM to Arabic Digits
Recognition Systems, Journal of King Abdulaziz University, JKAU, 19(1), pp. 43–
60 (2008).
[11] J. Ma, S. Matsoukas, O. Kimball and R. Schwartz, Unsupervised training on large
amount of broadcast news data, Proc. of the International Conference on Acoustics,
Speech and Signal Processing, ICASSP, pp. 1056–1059 (2006).
[12] A. Messaoudi, J.-L. Gauvain and L. Lamel, Arabic transcription using a one
million word vocalized vocabulary, Proc. of the International Conference on
Acoustics, Speech and Signal Processing, ICASSP, pp. I-1093–I-1096 (2006).
[13] M. Gales, et al. Progress in the CU-HTK broadcast news transcription system,
IEEE Transactions Speech and Audio Processing, 14(5), pp. 1513–1525 (2006).
[14] H. Soltau, G. Saon, B. Kingsbury, H-K. Kuo, L. Mangu, D. Povey and G. Zweig.
The IBM 2006 GALE Arabic ASR system, Proc. of the International Conference
on Acoustics, Speech and Signal Processing, ICASSP, pp. IV-349–IV-352 (2007).
[15] T. Imai, A. Ando and E. Miyasaka, A new method for automatic generation of
speaker-dependent phonological rules, International Conference of Acoustics,
Speech and Signal Processing, ICASSP, vol. 1, pp. 864–867 (1995).
[16] H. Bahi and M. Sellami, A hybrid approach for Arabic speech recognition.
ACS/IEEE international conference on computer systems and applications, pp. 14–
18 (2003).
[17] M. Nofal, E. Abdel Reheem et al., The development of acoustic models for
command and control Arabic speech recognition system. Proc. of International
Conference on Electrical, Electronic and Computer engineering, ICEEC’04, pp.
1023–1026 (2004).
26 S. M. Abdou and A. M. Moussa

[18] F. A. Elmisery, A. H. Khalil, A. H. et al. (2003). A FPGA-based HMM for a


discrete Arabic speech recognition system, Proc. of the 15th International
Conference on Microelectronics, ICM, pp. 205–209.
[19] M. Khasawneh, K. Assaleh, W. Sweidan and M. Haddad, The application of
polynomial discriminant function classifiers to isolated Arabic speech recognition,
Proc. of IEEE International Joint Conference on Neural Networks, vol. 4, pp.
3077–3081 (2004).
[20] H. Bourouba, R. Djemili, M. Bedda and C. Snani, New Hybrid System (Supervised
Classifier/HMM) for Isolated Arabic Speech Recognition, Proc. of the
International Conference on Information & Communication Technologies, pp.
1264–1269 (2006).
[21] J. Billa, et al., Arabic speech and text in Tides Ontap, Proc. of the International
Conference on Human Language Technology Research. HLT, pp 1024–1029.
[22] Afify, M., Nguyen, L., Xiang, B., Abdou, S. and Makhoul, J. (2005). Recent
progress in Arabic broadcast news transcription at BBN, Proc. of the Interspeech
Conf., pp.1637–1640 (2002).
[23] http://projects.ldc.upenn.edu/gale/index.html, page referenced at April 2017.
[24] https://www-01.ibm.com/software/pervasive/viavoice.html
[25] D. Povey, et al., Subspace Gaussian mixture models for speech recognition, Proc.
of the International Conference on Acoustics, Speech and Signal Processing,
ICASSP, pp. 4330–4333 (2010).
[26] H. Hermansky, D. P. W. Ellis and S. Sharma, Tandem connectionist feature
extraction for conventional HMM systems, Proc. of the International Conference
on Acoustics, Speech and Signal Processing, ICASSP, pp. 1635–1638 (2000).
[27] S. M. Chu and D. Povey, Speaking rate adaptation using continuous frame rate
normalization, Proc. of the International Conference on Acoustics, Speech and
Signal Processing, ICASSP, pp. 4306–4309 (2010).
[28] H.-K. J. Kuo, L. Mangu, A. Emami, A., I. Zitouni and Y.-S. Lee, Syntactic features
for Arabic speech recognition, IEEE Workshop on Automatic Speech Recognition
& Understanding, ASRU, pp. 327–332 (2009).
[29] https://catalog.ldc.upenn.edu/search
[30] A. Stolcke, Entropy-based pruning of backoff language models, Proc. of DARPA
Broadcast News Transcription and Understanding Workshop, pp. 270–274 (1998).
[31] http://www.mgb-challenge.org/arabic.html, page referenced at April 2017.
[32] D. Povey, et al., Purely sequence-trained neural networks for ASR based on lattice-
free MMI, Proc. of the Interspeech Conf., pp. 2751–2755 (2016).
[33] T. Mikolov, et al., RNNLM – Recurrent Neural Network Language Modeling
Toolkit, IEEE Workshop on Automatic Speech Recognition & Understanding,
ASRU, pp. 125–128 (2011).
[34] H. Sak, A. W. Senior and F. Beaufays, Long short-term memory recurrent neural
network architectures for large scale acoustic modeling, Proc. of the Interspeech
Conf., pp. 338–342 (2014).
[35] J. Billa, M. Noamany, A. Srivastava, D. Liu, R. Stone, J. Xu,, J. Makhoul and F.
Kubala, Audio Indexing of Arabic Broadcast News. Proc. of the International
Conference on Acoustics, Speech and Signal Processing, ICASSP, pp I-5–I-8
(2002).
Arabic Speech Recognition: Challenges and State of the Art 27

[36] M. Rashwan, M. Al-Badrashiny, M. Attia, S. Abdou and A. Rafea, A Stochastic


Arabic Diacritizer Based on a Hybrid of Factorized and Un-factorized Textual
Features, IEEE Transactions on Speech and Audio Processing, 19(1), pp. 166–175
(2011).
[37] T. Buckwalter, Arabic Morphology Analysis. A tool in LDC catalog
https://catalog.ldc.upenn.edu/LDC2004L02 (2004).
[38] G. Saon, H. Soltau, et al., The IBM 2008 GALE Arabic speech transcription
system. Proc. of the International Conference on Acoustics, Speech and Signal
Processing, ICASSP, pp. 4378–4381 (2010).
[39] N. Habash, M. Diab and O. Rambow, Conventional Orthography for Dialectal
Arabic (CODA): Principles and Guidelines – Egyptian Arabic, Version 0.7,
Columbia University Academic Commons, http://dx.doi.org/10.7916/D83X8562
(2012).
[40] K. Kirchhoff and D. Vergyri, Cross-dialectal data sharing for acoustic modeling in
Arabic speech recognition. Speech Communication, 46(1), pp. 37–51 (2005).
[41] T. Schultz and A. Waibel, Language independent and language adaptive acoustic
modeling for speech recognition, Speech Communication, 35(1-2), pp. 31–51
(2001).
[42] P.-S. Huang. and M. Hasegawa-Johnson, Cross-dialectal data transferring for
Gaussian mixture model training in Arabic speech recognition. International
Conference on Arabic Language Processing, vol. 1, p. 1 (2012).
[43] M. Elmahdy, R. Gruhn and W. Minker, Novel Techniques for Dialectal Arabic
Speech Recognition, Springer (2012).
[44] K. Kirchhoff, et al. Novel Speech Recognition Models for Arabic. Johns Hopkins
University Summer Research Workshop Final Report (2002).
[45] A. Rozovskaya, R. Sproat and E. Benmamoun, Challenges in Processing Colloquial
Arabic: The challenge of Arabic for NLP/MT, International Conference of the
British Computer Society, pp. 4–14 (2006).
[46] L. Lamel, A. Messaoudi and J. Gauvain, Investigating morphological
decomposition for transcription of Arabic broadcast news and broadcast
conversation data, Proc. of the Interspeech Conf., vol. 1, pp. 1429–1432 (2008).
[47] M. Creutz, et al., Morph-based speech recognition and modeling of out-of-
vocabulary words across languages. ACM Transactions on Speech and Language
Processing, 5(1), pp. 1–29 (2007).
[48] J. Bilmes and K. Kirchhoff, Factored language models and generalized parallel
backoff, Proc. Human Language Technology Conf. of the North American Chapter
of the ACL, vol. 2, pp. 4–6 (2003).
[49] M. Mohri, M. Riley, D. Hindle, A. Ljolje and F. Pereira, Full Expansion of
Context-Dependent Networks in Large Vocabulary Speech Recognition,
International Conference of Acoustics, Speech and Signal Processing, pp. 665–668
(1998).
[50] M. Mohri and M. Riley, Network Optimizations for Large-Vocabulary Speech
Recognition, Speech Communication Journal, 28(1), pp. 1–12 (1999).
b2530   International Strategic Relations and China’s National Security: World at the Crossroads

This page intentionally left blank

b2530_FM.indd 6 01-Sep-16 11:03:06 AM


29

Chapter 2

Introduction to Arabic Computational Linguistics

Mohsen Rashwan
Electronics and Electrical Communications Department,
Faculty of Engineering, Cairo University
Giza, Egypt
mrashwan@RDI-Eg.com

In this chapter an introduction is given on the Human Language


Technologies (HLT). Over 20 technologies including natural language
processing such as information retrieval, machine translation, text
mining, speech processing that includes speech recognition, and text to
speech and optical character recognition have been concisely described.
The challenges facing this important area of research have been
underlined. Arabic HLT has more challenges compared, for example,
with those of the English language due to the features of the Arabic
language. The last section of the chapter contains a good reference for
most of the organizations, research centers and companies that are
working on the Arabic human language technologies.

1. Introduction

This chapter is concerned with Arabic Computational Linguistics (CL)


its’ related introductions, technologies and tools for automated
processing. CL is a relatively modern science that appeared early in the
second half of the twentieth century as being an interdisciplinary science
depending on the computer for studying the human languages and
understanding its nature.
This science is referred to with other titles, the most important of
which are: Natural Language Processing (NLP) and Human Language
Technology (HLT).
30 M. Rashwan

Computational linguistics is based on three main themes:


 Text Processing: including the machine translation, automatic
summarization and mining in texts, etc.
 Speech Processing: including Automatic Speech Recognition (ASR),
Text to Speech (TTS), etc.
 Image Processing: Optical Character Recognition (OCR) with all its
variants.

2. Layers of Linguistic Analysis

Researchers in CL like to organize the work in different layers as shown


in Fig. 1 below.

Fig. 1. Layers of linguistic analysis.

The higher layers of the language analysis are dependent on the lower
ones. However these layers are overlapped to some extent. We will
explain the nature of each layer as follows:

2.1. Phonological analysis


In this stage, the way of uttering the word shall be decided with taking
into account the letters that are not pronounced or pronounced in its
origin (such as slurring letters and letters in which two constants are met
together). In Arabic, short vowels are written as diacritics on the letters,
but in most cases these diacritics are not written. This increases the
challenge of handling the first layer in this ladder.
Introduction to Arabic Computational Linguistics 31

2.2. Morphological analysis


In this stage, Arabic language is of derivative language and the word is
analyzed into its basic elements either to (prefix, root, form and suffix),
or to (prefix, stem and suffix). As in many natural languages, the Arabic
word can have more than one solution, and normally called “Part Of
Speech” (POS). This ambiguity is resolved through rules or statistical
analysis.

2.3. Syntactic analysis


The function of the word in a sentence is determined in terms of its
syntactic position which helps understanding the meaning (semantics).
Arabic language is very flexible regarding the word order, which makes
the syntax analysis very challenging. In Arabic the linguists say: syntax
is a branch of semantics. This means that you cannot make right syntax
analysis unless you understand the meaning.
On the other hand, the form of the word in Arabic is not indicative to
its POS. For example, most of the adjectives in English language have
different form from the adverbs; this is not true in Arabic, as we
differentiate between them through the semantics only.

2.4. Semantic analysis


The semantic analysis has several stages; the lowest of which is
determining the meaning of the word in a context, see Refs. 1–3.
It should be noted that the semantic analysis has many sub-stages,
such as:
 Word Sense Disambiguation (WSD)—for example, the word “eye”: it
could mean “a well” or “a spy”, etc.
 Anaphora Disambiguation—for example, “the girl loves her sister
very much, so when she meets her, she hurries up to her”. The
position of the disambiguation is that we find it difficult to find out
who has hurried up; the girl to her sister?
 Mention analysis—for example, “Hisham meets Mohamed”, he said
to him: “our appointment is tomorrow” he replied, “no, it is the day
after tomorrow”. Who said what?
32 M. Rashwan

 Rhetoric Disambiguation—for example “I saw loins in the battle”, as


the meaning is “I saw brave soldiers”.
 Subject separation—in many cases, the article tackles more than one
subject and does not necessarily find sub-headings to separate the
partial subjects.
There are many other issues in the semantic analysis that need to be
seriously tackled before the computer can understand a given text. Some
of these subjects are tackled but with accuracies much less than
convenient.
As for the problem of overlapping between these levels, we can show
these problems through the following points:
 The level of phonological analysis is overlapping with the level of
morphological analysis. People — usually — neglect the diacritics of
its letters (short vowels) which increases the morphological challenge.
 In Arabic language the semantic and the syntactic analyses are closely
related, see Ref. 4.

3. Challenges Facing Human Language Technologies

Language is a great trust from God. The ability of the human in language
is inimitable. If you sit with a native speaker chatting together for long
hours, you can hardly pass — through thousands of words flowing from
his mouth — a word that you do not understand. The Word Error Rate
(WER) in this case is less than 0.1%, while the best WER for the best
spontaneous ASR system could be like 10%; i.e. 100 times the WER of
the human auditory system.

4. Challenges Facing the Arabic Language Processing

Arabic language has many features that distinguish it. However, these
features may represent the elements of its strength, but at the same time
represent elements of the challenge for computation.
We will state some of the challenges that we face when subjecting the
Arabic language to computing process — through the following points:
Introduction to Arabic Computational Linguistics 33

4.1. Arabic script


 Arabic script is not understood unless the Arabic alphabets are
connected. There are some letters that do not accept the connection,
as there is a relatively short distance among these letters and its
neighboring letters in the same word. In fact, the separation of words
from each other is an issue that is not simple. So this issue adds a real
challenge to the Arabic optical character recognition.
 Named entities in Arabic are not distinguished by capital letters as in
Latin languages. For example, the word “Cairo” in Arabic has two
meanings, one of them is a name entity and the other is not. So having
no capital letter adds a more challenging task to make good name
entity recognition.
 The absence of the diacritics upon the structure or the end of the
word. Diacritics represent the short vowels for the Arabic word.
Arabs, usually, do not add these diacritics to the words in their
writings. Although this eases the Arabic writing, but it adds more
challenge in the morphological disambiguation of the Arabic word.

4.2. Common mistakes


In Arabic, as in many other languages, there are many common writing
mistakes that really increase the ambiguity of the CL in Arabic. These
common mistakes include:
 Glottal stop, especially with the first letter (A) (‫ ﺇ‬،‫ ﺃ‬،‫ ;)ﺍ‬the proportion
of errors is high among people.
 Haa and Taa; for example “Cairo ‫ ”ﺍﻟﻘﺎﻫﺮﺓ‬it is frequently written
“Cairo ‫ ”ﺍﻟﻘﺎﻫﺮﻩ‬and vice versa, “beat ‫ ”ﺿﺮﺑﻪ‬you could write it “beat
‫”ﺿﺮﺑﺔ‬.
 Yaa and Alef Maksoura; many people are accustomed not to use Yaa
at the end of the Arabic word when ending with Alef Maksoura. This
leaves two different words like: “‫ ”ﻋﻠﻰ‬and “Ali ‫ ”ﻋﻠﻲ‬that need to be
distinguished before many NLP processing.
 Many people who used to write English find it difficult to use the
Arabic keyboard, so you can find many Arabic writings on the social
media using Latin characters. For example: (Dadi ana bahebk a'wy)
and the writer means: “Dad, I love you so much”.
34 M. Rashwan

 Due to the negligence of dealing with Arabic language at university


level in most of Arabic universities, we find many graduates mix
between the Modern Standard Arabic (MSA) and Dialectical
language in their writings. So if we like to analyze people opinion
through the social media, we have to try to translate their articles as
they are into Modern Standard Arabic.
 common spelling errors, such as:
Something ‫ﺷ ْﻲء‬َ → (anything ‫ — ﺷﺊ‬anything ‫)ﺷﻴﺊ‬.
 Not to mention the other spelling mistakes, like the one who uses the
nearby letters in pronunciation as he writes:
Mazaher ‫( ﻣﻈﺎﻫﺮ‬appearances) → Madaher ‫( ﻣﻀﺎﻫﺮ‬It is a common
mistake in the Arabian Peninsula and the Arab Maghreb).
Wasit ‫( ﻭﺳﻴﻂ‬Mediator) → and Wasit ‫ﻭﺻﻴﻂ‬, etc.
However, this type is prevailed in all human languages and is used
frequently among learners of Arabic as a second language.

4.3. Morphological structure for the Arabic word


The Arabic word is composed of a deep structure which makes it one of
the morphologically rich languages. But what concerns us here is the
indication to the fact of the number of the Arabic words that could be in
millions; however the Arabic word consists of a very limited number of
Lexemes (~ 5000 roots (practically used), ~ 100 forms (except for the
branches for special cases), ~ 300 prefixes and ~ 550 suffixes). The total
number does not exceed 6000 Lexemes. This is a very big advantage
from which many NLP technologies can utilize. Arabic-as a
morphologically rich language-uses so many words in the normal life.
For example in the business domain to cover like 99% of the words that
the people need, we will need over 600,000 words in Arabic while
64,000 words are only used for English to reach to the same coverage. Of
course this adds another dimension of challenge to the NLP of the Arabic
language.
Introduction to Arabic Computational Linguistics 35

4.4. Syntax of the Arabic sentence


The Arabic language is very flexible in the word sequence. But what is
the impact upon the computational linguistics? It increases the difficulty
of syntax analysis for Arabic.
 The absence of some of the words completely with the presence of its
estimated presence: For example, you are talking about, “Zaid”, then
you say: “he gets into the garden” is written in Arabic as: “gets into
the garden”. Where the subject who got into the garden? You may say
that it is a pronoun means “he”. This phenomenon may exit in other
languages; however it increases the challenge for the computational
linguistics.
 The verb may be entirely missed in the Arabic sentence. It is called a
nominal sentence, while the English sentence must have a verb, and if
necessary, one uses an auxiliary verb (verb to be). For example, in
English one may say: “The weather is beautiful”, if we write it in
Arabic we would write what could look in English like “The weather
beautiful” without the need to use the word “is”. This is fantastic as it
gives the human being a more flexible use of the language but in the
same time raises the challenge of resolving the ambiguity a step
higher.
 One of the practical challenges that we are facing in the processing of
the Arabic sentences is that the writer can correlate between the two
sentences with the letter “waw ‫”ﻭ‬, as it is easier to use than the
comma or any other punctuation mark; therefore the average length of
the Arabic sentence is much larger than the average length of the
English one. Although the whole number of words in a given Arabic
subject is usually less than a corresponding text in English if they
have the same meaning (known through translation). This adds more
challenge in the syntax of the Arabic sentence. For example; if a
technology reached to the accuracy of 90% on the average of
processing each word individually, the average of the accuracya
percentage will be for a sentence consisting of:
 Two words, the accuracy of the sentence = (word accuracy)2 =
(90%)2 = 81%

aThe sentence is correct if all its words are correct.


36 M. Rashwan

 5 words, the accuracy of the sentence = (word accuracy)5 = (90%)5


= 59%
 100 words, the accuracy of the sentence = (word accuracy)100 =
0.0027%, almost zero%; this means that if the sentence is longer,
it shall be difficult to be process and understand.

5. Defining the Human Languages Technologies

The progress of the information technology witnessed by the world in


recent decades has a significant impact in developing the tools of
processing the natural languages and improving its applications which
are increasing day after day. Also these technologies are varying as per
the diversity of the work environments, see Ref. 5. We can divide human
language technologies into sections as shown in Fig. 2.
A brief description of many of these technologies will follow:

5.1. Texts search (search engines)


The search engines in the texts or documents are considered as one of the
most commonly used technologies that extract the required information
for the user. The most widely used engines at the moment are those that
deal with searching for documents that contain specific keywords. The
search engines can be divided into several types, including:
 Search engines in personal documents on the personal computer.
 Search engines in the enterprise documents (Enterprise Search).
 Web Search engines in the web (search for documents online).
Search engines can also be divided in terms of the type of texts into:
 General Search.
 Search in a specialized field, such as medicine, law, etc.
It should be noted that search in specialized areas needs the so-called
semantic search or ontology search. This search helps having high
accuracy that is required in such cases. For example: if we want to search
for the word “monsters”, it will not bring the documents that are talking
about lions or tigers, etc. This is considered as a shortcoming in the
search, unless we support the search engines with semantic information
Introduction to Arabic Computational Linguistics 37

for each word. The semantic research is still not enough for the Arabic
language.

Fig. 2. Human language technologies.


38 M. Rashwan

Also a type of research has recently appeared exceeding the texts into
various other media, such as (see Refs. 1, 6):
 Search in audio files.
 Search in photos.
 Search in videos.

5.2. Machine translation


The machine translation is considered as one of the most important
technologies and its value is increased after the existence of the Internet.
It is needed to ease the communication among the people of different
languages to take advantage of the immense treasures of knowledge in
these languages. Machine translation has several schools including:
 Translation school with rules and bilingual dictionaries.
 Translation school relying on bilingual corpus, as such type corpus
has enough information to teach mathematical models. There are two
ways for this school:
 The way of learning from examples; it is used in case of low size
of bilingual corpus.
 Statistical Machine Translation; It needs very large corpus up to
several millions of sentences to give tangible results. This way is
recently considered the most common way among workers in the
field of translation.
 Translation school across an intermediate language: this school has
been developed and has some considerable accomplishments.

The scientists have used several ways to measure the quality of


machine translation; the most famous of which is the BLEU Score. The
human translation can get + 80% on this score, while the best engines of
machine translation from Arabic into English records ~ 50%, see Refs.
7–8. The machine translation engines allow translation from and into
several languages. For example: machine translation engine in Google
allows the translation among more than sixty languages.
Introduction to Arabic Computational Linguistics 39

5.3. Question answering


The simplest way of acquiring knowledge is to ask a question in the
human language. Then the system should analyzes the question,
understands the meaning and searches for answers among the documents
available on the Internet or in the database of institutions. Then it
extracts the information in an adequate way for answering the question,
and then puts this answer in the asker’s human language.
This method is complex and needs a number of techniques to get the
answer to the appropriate degree of accuracy. Like a lot of human
language technologies, this complex technology still needs to have more
efforts, especially for the Arabic language.

5.4. Automated essay scoring


Due to the difficulty of assessing the exams and assignments in the
various stages of education, new types of questions have appeared with
which the computer can easily deal. These questions have specific
answers (such as: multiple choices, connect between given statements,
fill in the spaces, and so on). It is easy to ask the computer to evaluate
these types of questions after being provided with the correct answers for
a list of questions.
However, these types of questions do not test the expression skills of
students; therefore, there was a need to develop the technology that can
deal with the structural answers. Several international examinations have
already appeared that utilize this technology.
The idea of this technology is based on what we call Text Similarity
analysis. Given two statements or paragraphs, the computer is asked to
measure the closeness between these two statements. So if the first text is
the model answer from the teacher and the second text is the answer of
the student, and then we can get a score of how close the answer of the
student to that of teacher.
40 M. Rashwan

5.5. Automatic text summarization


With the vast amount of books and research papers, there is a real need
to make automatic summarization, see Fig. 3. The key ideas of
summarization have two main algorithms:
 Extraction: it depends on selecting the most important sentences from
the document.
 Abstraction: it depends on a new brief drafting for the same concepts.

The first way is the most common and widely used, while the
successes of the second algorithm are in specific tasks. With the
advances in understanding the models of natural languages, the
abstraction algorithm will get in heavily to give more convincing
summaries.

Fig. 3. Automatic text summarization.

5.6. Document classification and clustering


Upon using search engines to gain access to important documents, the
human needs to have the documents similar to the one that he has
selected. This can be done in two ways:
 Document Classification: where there is a pre-division of documents
(for example, being divided into politics, economy, etc.), see Fig. 4.
 Document Clustering: where there is no prior division of documents.
This way is used if we have a set of documents and we want to make
clusters for the similar types of documents.

The goal in both cases is to get the similar documents. The documents
will be classified and clustered automatically as follows:
Introduction to Arabic Computational Linguistics 41

Fig. 4. Document classification and clustering.

In case of automatic classification: the features of each class (politics,


economy, etc.) is extracted from a given set of manually classified
documents. Then for a new document, the computer will select the
features of this document and make similarity measure between it and
each class. The computer will classify the document to the nearest class.
In case of automated clustering of documents: the computer will
collect documents of similar features together in a separate group.

5.7. Opinion mining


It is important to recognize the trends of opinion on several levels; at the
political level to see people’s tendencies to any party or individual, and
at the economic level as knowing the opinion of people about a new
product. It is important for any industrial company to know the
customers’ opinion on its products. But how these opinions will be
clustered? Traditionally, questionnaires for a carefully selected sample of
users will be collected and studied.
After the existence of the social media, collecting people comments
on any subject becomes an easy task. Algorithms have evolved that can
classify each statement or comment as positive, negative or neutral.
Machine learning techniques are heavily used in this task, given that we
42 M. Rashwan

avail for these techniques some manually annotated comments for


learning, see Ref. 9.
Example: if a car company produced a new model of car and wants to
know the opinions of people about it, after clustering the materials
published on the Internet about this particular model of car, the company
can know the opinions of people about the price, quality and what they
like and what they dislike regarding it.
This technique can be used on an ongoing basis to give the business
developers in the company valuable information through which they will
be able to improve their product continuously.

5.8. Computer-aided language learning (CALL)


Learning languages is one of the branches of science having more than
20% of the educational materials for pre-university education. Learning
languages is divided into learning the native language and learning other
languages as second languages. In general, CALL systems are dealing
with detecting and correcting the errors:
 Misspelling.
 Syntax.
 Semantics (using the appropriate word in the right place).

Detecting the correct spelling of the Arabic words from the contexts
has achieved good level of acceptance. However detecting the syntax and
the semantic mistakes is not quite mature now for Arabic. Semantic
mistakes like saying: “he succeeded although he was studding hard”
cannot be easily detected by computers. It should be noted that
recognizing the errors of the learners of Arabic as a second language in
spelling, grammar and semantics has many evidence of success in some
research works.

5.9. Stylometry
Stylometry is the art of verifying the ownership of someone to a specific
article or book. This technology is just a branch of documents
classification referred to above, see Fig. 5. We can benefit from this
technology in issuing official documentation of a specific article to its
Introduction to Arabic Computational Linguistics 43

author, as happens when we attribute children to their father. Glory is to


Allah who created us alike but not identical so as to be distinguished and
to know one another. This is not only in genes and footprints, but also in
writing styles. The words used by each one of us, their forms, and
collocations are like the fingerprints to the author.

Fig. 5. Stylometry.

5.10. Automatic speech recognition


Speech is the best way of communication between human beings. Speech
recognition technology has gained great importance. This technology is a
step towards easy and convenient communication between man and
machine. The applications of this technology are many and varied,
including:
 Dictation engines are used to dictate an article or a letter and convert
it from speech to text.
 This technology is also used as a navigation tool in some applications.
This may be important in certain situations, such as the ability to manage
a phone call inside a car without the need to find the phone number by
eye and hand, because that distracts a driver. This technology enables the
user to ask for a particular phone number by speech.
 Registration of meetings and convert them automatically to minutes
of meeting.
44 M. Rashwan

How does speech recognition work? Simply, all speech and


characters recognition techniques are almost operating on the same
principles. Figure 6 illustrates how these systems work.

Fig. 6. General model for pattern recognition.


Introduction to Arabic Computational Linguistics 45

5.11. Text to speech (TTS)


This technology has many practical applications, such as the
pronunciation of books for the blind and visually impaired, and the
pronunciation of messages over the phone to inform of a service or to
give a piece of information, etc. This technology consists of other
technologies, as shown in Fig. 7.

Fig. 7. TTS technology components.

In Arabic people usually do not write short vowels, so for the


computer to know the right pronunciation; an automatic diacritization
engine must be provided.
As for the Speech Synthesizer, it has two schools. A school relies on
segmenting the given speech for training to speech units. The required
speech segments are recalled in similar contexts when the technology is
tested; this method is the most common.
But there is another school that depends on generating models for
each phoneme. These models can be trained from the training data.
Given a sentence, and its phonetic sequence, the technology generates a
sequence of models that corresponds to the sequence of phonemes.
The generated speech of the first school is more natural but suffers
some interruptions in smoothness; whereas the voice resulting from the
second school is smoother, but it does not seem as natural as the first
one. Increasing the training data improves the quality in both schools, see
Refs. 10–12.
Research is active to generate expressive speech so as the listener
would distinguish the performance of joy, sadness, horror, or angry, etc.
46 M. Rashwan

5.12. Audio and video search


This technology helps in searching for a spoken word or expression in
audio or video files (audio track). In this technology an engine listens to
all the available audio/video content to get the positions suspected to
express the word under search. This technology depends on speech
recognition, either in full or in part, see Refs. 10, 11. This technology has
many uses, which include:
 Search in speeches and sound recordings for topics of interest.
 Radio and television stations need this technology to set up reports
about a person, an event, or an institution.
 For security reasons, International calls are recorded in all countries,
but listening to them is very hard, therefore this technology is recalled
and is provided with a set of words if found, to alert the concerned
with the need of a fine review (an example of these words: Drugs,
heroin, weapons, etc.).
 Regarding the increasing volume of sound and videos on the Internet
now and in the future, this technology will benefit researchers for
lectures, speeches, and movies. All you need is to write a few words,
that you remember to get a relevant list of items.

5.13. Language recognition


The need for this technology has increased after the emergence of the
Internet and openness to all cultures and languages. If we get a recorded
speech that we like to know its content. First, we will need to know its
language before passing it to the ASR system of this language, see
Ref. 13.
It is noteworthy that language recognition is also available for text
data, which is much easier than language recognition for speech data.

5.14. Computer-aided pronunciation learning


Pronunciation is the most difficult skill when learning a foreign
language. How many of us find it difficult to speak to a foreigner for the
first time or listen to foreign news? Therefore, this technology has an
important future in facilitating language learning, see Ref. 10.
Introduction to Arabic Computational Linguistics 47

We can classify this technology as a similar technology (but not


identical) with the speech recognition technology. In speech recognition,
you need to recognize the spoken words or sentences, but in this
technology, the word or sentence searched for is already known, but you
need to make sure of its correct pronunciation. Then the various
alternatives for the pronunciation are made, whether correct or wrong.
The technology recognizes the pronunciation as closer to the correct one
or not?
Of the notable applications that have emerged in this field is the
technology of learning the rules of recitation under the name of Hafss,
see Fig. 8, and is believed to be a good and useful example in this
important field.

Fig. 8. “Hafss” technology-aided pronunciation learning.

5.15. Typewritten optical character recognition (OCR)


Humanity is heading toward providing its heritage in digital format to
deal easily with it in terms of searching, automatic summarization and
pronunciation, etc. Therefore, the importance of OCR technology has
emerged. The amount of governmental and non-governmental documents
and scientific theses that fill the libraries of hundreds of universities in
48 M. Rashwan

the Arab region needs high quality technology to facilitate the


digitization of these documents.
OCR systems deal with images that are scanned or taken by a camera
as shown in Fig. 9. Having different fonts, sizes and styles form a high
challenge to this technology. Preprocessing steps are needed to get the
text regions, then a binarization, anti-tilting, denoising steps are done. As
the characters in Arabic are mostly connected, the Arabic OCR is much
more challenging that for Latin OCRs.

Fig. 9. Optical Character Recognition of printed letter.

5.16. Intelligent character recognition


After the widespread use of mobile phones with touch screens and Tablet
PCs, the need was urgent for this technology because writing with a pen
to is much easier for people than using a keyboard.
Each of the following applications, is composed of two HLTs or even
more.

5.17. Book reader


As shown in Fig. 10, this application is composed of two technologies;
the OCR and the TTS.
Introduction to Arabic Computational Linguistics 49

Fig. 10. Book Reader.

5.18. Speech to speech translation


This application is complex to a great extent: it consists of six
technologies as shown in Fig. 11.

Fig. 11. Speech to Speech Translation.

5.19. Speech-to-sign-language and sign-language-to-speech


As seen in Fig. 12, sign language can be converted to speech and vice
versa, thus people with hearing impairments can be linked with the
community.
50 M. Rashwan

Fig. 12. Speech-to-Sign-Language and Sign-Language-to-Speech.

5.20. Dialog management systems


Speech recognition, TTS and automatic dialog technologies are used to
save human time. The system is used to partially or totally substitute
human operators to answer questions or to reserve for a ticket, see
Fig. 13.

Fig. 13. Dialog management Systems.

These systems will turn the machine into much higher levels in terms
of easy interaction with humans, see Ref. 12. They will be used
extensively with robots. These robots will be able to do a lot of work in
homes, serve children and elderly people or patients and do heavy work
in factories 24 hours a day without fatigue. Perhaps they will be able to
narrate tales and entertain the users, relieving them from their concerns
and playing with them intelligently and skillfully.
Introduction to Arabic Computational Linguistics 51

5.21. Advanced information retrieval systems


These systems not only retrieve information stored in information
containers in its direct form; but they also retrieve complex information
from large collections of documents, see Ref. 14. In addition to their
ability to summarize the information retrieved if required, retrieve
information from different languages, or use advanced methods in the
retrieval processes, such as the usage of the sound via the mobile phone
or the touch-screen Tablets, and so on as shown in Fig. 14.

Fig. 14. Advanced information retrieval systems.


52 M. Rashwan

5.22. Text mining (TM)


TM includes many algorithms that serve to get deep information from
unstructured text that help the decision makers such as (see Ref. 9):
 Text clustering.
 Text Classification.
 Sentiment and opinion analysis.
 Summarization.
 Name Entity detection.
 Keyword and concepts detection, etc.

6. Arabic Computational Linguistics Institutions

There are many computational linguistics institutions, which provide


Arab language technology services. We will discuss a number of these
institutions as follows:

6.1. Academic institutions

6.1.1. Linguistic Data Consortium (LDC)


LDC is a research institution which collects language resources that are
developed by many researchers in several universities and research
centers. LDC gives concern to developing language corpora for written
and spoken languages and language dictionaries for purposes of research
and development. It is concerned with the three languages (English,
Arabic and Chinese).
 Headquarters: University of Pennsylvania - United States of America.
 Website: http://www.ldc.upenn.edu
 Examples of Arabic language resources:
 Buckwalter morphological analyzer.
 Arabic Treebank.
 Egyptian Dialect dictionary.
 Many Arabic Dialect language resources (Egypt and Iraq).
Introduction to Arabic Computational Linguistics 53

6.1.2. NLP Team - School of Computing at the University of Leeds


The School of Computing at the University of Leeds is a specialized
academic institution that awards degrees to researchers and oversees
their Masters and PhD theses. The research team is working on
processing the natural languages and pay special attention to Arabic
language technologies and its linguistic resources. The research team
focuses on modeling language and developing Arabic language corpora.
 Headquarters: University of Leeds - United Kingdom
 Website: http://www.engineering.leeds.ac.uk/computing/postgraduate/
research-degrees/projects/natural-language-processing.shtml
 Arabic language resources:
 Contemporary Arabic language corpus.
 The automatic Parts Of Speech (POS) tagger.
 A computer model for knowledge representation of the Quran.
 A Quranic text corpus.

6.1.3. The Arab League Educational, Cultural and Scientific


Organization (ALECSO)
The Department of Sciences at the Arab league educational, cultural and
scientific organization (ALECSO) directs its attention to Arabic CL.
Therefore, it has held a number of international conferences and forums,
and has accomplished — and is still accomplishing several projects
related to automatic processing of the Arabic language. The ALECSO is
keen to provide the Arabic HLT in a free form or open source in order to
make them available to researchers on one hand, and trying to develop
and process the shortcomings on the other.
 Headquarters: Tunisia
 Website: http://www.alecso.org/
 Arabic Language Resources:
 Morphological analyzer “Al Khalil”.
 Interactive Arabic lexicon.
 Outstanding Projects:
 Syntax analyzer.
 Spell checker for Arabic.
 Automatic Arabic text Diacritizer.
54 M. Rashwan

6.1.4. King Abdul Aziz City for Science and Technology (computer
research institute)
The Institute includes the Department of Phonetics and Linguistics,
which is interested in preparing research and solutions to the problems of
Arabic language technologies. It provides consultations and sets up
workshops to follow up the advances in the field.
 Headquarters: King Abdul Aziz City for Science and Technology-
Riyadh-Saudi Arabia.
 Website: http://www.kacst.edu.sa/ar/about/institutes/pages/ce.aspx
 Arabic language resources:
 Saudi Bank of sounds.
 Arabic optical character recognition system.
 Syntax analysis for Arabic online texts.
 Huge Arabic text corpus.
 Automatic Arabic assay scoring.

6.1.5. Columbia’s Arabic Dialect Modeling Group (CADIM)


It is a research group in the Center for Computational Learning Systems
(CCLS) at Colombia University. The team is interested in processing
Arabic dialects, based on criteria of (Modern Standard Arabic-MSA).
The group has adopted a project on Arabic Automatic Speech
Recognition and is interested in automatic translation from Arabic to
English.
 Headquarters: Colombia University, New York-United States.
 Website: http://www1.ccls.columbia.edu/~cadim/
 Arabic Language Resources:
 Language analysis system MADA + TOKAN. A system for
analyzing the written Arabic texts, and among its functions:
 Tokenization.
 Morphological Disambiguation.
 POS Tagging – Stemming – and Lemmatization
 Diacritization.
Introduction to Arabic Computational Linguistics 55

6.1.6. Research team in natural language processing at Stanford


University (The Stanford NLP Group)
The Stanford team consists of natural language processing from the
Department of Linguistics and Computer Science, its members work
together on algorithms that allow computers to handle and understand
human languages.
 Headquarters: Stanford University, California, United States.
 Website: http://nlp.stanford.edu
 Some projects:
 Stanford Neural Machine Translation.
 Stanford Natural Language Inference Corpus (SNLI).
 Stanford Open Information Extraction.

6.1.7. Qatar Computing Research Institute (QCRI)


The QCRI focuses on Arabic language technology and high performance
computing and bioinformatics.
 Headquarters: Doha, Qatar.
 Website: http://qcri.org.qa
 Challenges for the Institute in the field of Arabic language
techniques:
 Challenge of machine translation of Arabic language.
 Challenge of continuous Arabic language chat system.
 Challenge of the content and search in Arabic language.

6.1.8. Egyptian Society for Language Engineering (ESOLE)


The ESOLE is interested in CL in general and in Arabic Language in
particular. The Society holds an annual conference that is concerned with
linguistic issues.
 Headquarters: Ain Shams University, Cairo, Egypt.
 Website: http://www.esole-eg.org.

6.1.9. Arabic Language Technology Center (ALTEC)


The ALTEC it is non-profit organization that was established by many
technological companies and academic institutions. It aims to provide
linguistic resources for those involved in Arabic computational linguistics.
56 M. Rashwan

 Headquarters: Giza, Egypt.


 Website: http://www.altec-center.org
 Activities of the Center:
 Holding conferences on the Arabic Computational linguistics and
its techniques.
 Achieving a set of systems and language resources for researchers
and developers.
 Arabic Language Resources:
 A database for Arabic typewritten OCR systems (14000 pages).
 A database for Arabic handwritten OCR systems (1000 writer).
 A fully automatic diacritized (3 million words).
 A name entity tagged corpus (3 million words).

6.2. Companies interested in computational linguistics


It is worth mentioning that there exist many companies concerned with
HLTs, for example:

6.2.1. International Business Machines Corporation (IBM)


 Headquarters: New York, the United States.
 Website: http://www.ibm.com

6.2.2. Microsoft Company


 Headquarters: Washington – the United States.
 Website: http://www.microsoft.com

6.2.3. Sakhr Software - Arabic Language Technology


 Headquarters: Cairo, Egypt.
 Website: http://www.sakhr.com

6.2.4. The Engineering Company for the Development of Digital


Systems (RDI)
 Headquarters: Giza, Egypt.
 Website: http://www.rdi-eg.com
Introduction to Arabic Computational Linguistics 57

7. Summary and Conclusions

In this chapter an introduction to computational linguistics in general


with special focus to Arabic language is given. We have reviewed some
of the challenges that face the Arabic computational linguistics with
respect to English language. After giving a wide look to what we meant
be Arabic human language technologies, we also reviewed many of the
organizations and companies that have contributed to Arabic HLT.
In conclusion, language technologies is still an active area of research
and will be for some time until the research community delivers much
higher levels of performance that can satisfy the human needs. When
these technologies mature to serve humanity tirelessly with higher safety
levels compared with human work, what will be left to human beings are
the innovative tasks.

References

1. E. Kumar, Natural Language Processing, I. K. International Pvt Ltd (2011).


2. M. Rosner & R. Johnson, Computational Linguistics and Formal Semantics,
Cambridge University Press (1992).
3. A. Tavast & K. Muischnek & M. Koit, Human Language Technologies–the Baltic
Perspective, Proceedings of the Fifth International Conference Baltic HLT 2012–
Frontiers in Artificial Intelligence and Applications, IOS Press (2012).
4. A. A. S. Farghaly, Arabic Computational Linguistics, University of Chicago Press
(2010).
5. A. P. Rkowski & M. Piasecki & A. Przepiórkowski & K. Jassem & P. Fuglewicz,
Computational Linguistics: Applications, Springer (2012).
6. J. Benesty & M. M. Sondhi & Y. Huang, Handbook of Speech Processing, Springer
(2007).
7. H. A. Dry & J. Lawler, Using Computers in Linguistics: A Practical Guide,
Routledge (2012).
8. J. Pustejovsky & A. Stubbs, Natural Language Annotation for Machine Learning,
O’Reilly Media (2012).
9. A. Kao & S. R. Poteet, Natural Language Processing and Text Mining, Springer
(2007).
10. L. Dybkjr & H. Hemsen & W. Minker, Evaluation of Text and Speech Systems,
Springer (2007).
11. M. Johnson & S. P. Khudanpur & M. Ostendorf & R. Rosenfeld, Mathematical
Foundations of Speech and Language Processing, Springer (2004).
58 M. Rashwan

12. D. Jurafsky & J. H. Martin, Speech And Language Processing: An Introduction to


Natural Language Processing, Computational Linguistics, and Speech
Recognition, Prentice Hall (2009).
13. R. Zhu, Information Engineering and Applications, Springer (2012).
14. R. Mihalcea & D. Radev, Graph-based Natural Language Processing and
Information Retrieval, Cambridge University Press (2011).
15. S. Alansary & M. Nagi & N. Adly, A Suite of Tools for Arabic Natural Language
Processing: A UNL Approach, (ICCSPA’13), Sharjah, UAE (2013).
16. A. Al-Thubaity & M. Khan & M. Al-Mazrua & M. Al-Mousa, New Language
Resources for Arabic: Corpus Containing More Than Two Million Words and a
Corpus Processing Tool, International Conference on Asian Language Processing
(IALP) (2013).
17. A. Clark & C. Fox & S. Lappin, The Handbook of Computational Linguistics and
Natural Language Processing, John Wiley & Sons (2010).
18. M. Dickinson & C. Brew & D. Meurers, Language and Computers, John Wiley &
Sons (2012).
19. C. D. Manning & H. Schütze, Foundations of Statistical Natural Language
Processing, MIT Press (1999).
20. C. D. Manning & M. Surdeanu & J. Bauer & J. Finkel & S. J. Bethard & D.
McClosky, The Stanford CoreNLP Natural Language Processing Toolkit.
Proceedings of 52nd Annual Meeting of the Association for Computational
Linguistics: System Demonstrations, Baltimore, Maryland (2014), pp. 55–60.
21. E. Ovchinnikova, Integration of World Knowledge for Natural Language
Understanding, Atlantis Press (2012).
22. A. Pasha & M. Al-Badrashiny & M. Diab & A. El Kholy & R. Eskander & N.
Habash & M. Pooleery & O. Rambow & R. M. Roth, Madamira: A fast,
comprehensive tool for morphological analysis and disambiguation of Arabic,
Proceedings of the Language Resources and Evaluation Conference (LREC),
Reykjavik, Iceland (2014), pp. 1094–1101.
23. A. Shoufan & S. Al-Ameri, Natural Language Processing for Dialectical Arabic: A
Survey, Proceedings of the Second Workshop on Arabic Natural Language
Processing, Beijing, China (2015), pp. 36–48.
24. Z. Vetulani “Ed”, Human Language Technology. Challenges for Computer Science
and Linguistics, 4th Language and Technology Conference, LTC 2009, Roznan,
Poland, November 6-8, 2009, Revised Selected Papers, Springer (2011).
25. A. Witt & D. Metzing, Linguistic Modeling of Information and Markup
Languages: Contributions to Language Technology, Springer (2010).
59

Chapter 3

Challenges in Arabic Natural Language Processing

Khaled Shaalan1, Sanjeera Siddiqui2, Manar Alkhatib3 and Azza Abdel Monem4
Faculty of Engineering & IT, The British University in Dubai123,
Block 11, Dubai International Academic City,
P.O. Box 345015, Dubai, UAE
School of Informatics, University of Edinburgh1, UK
Faculty of Computer and Information Sciences, Ain Shams University4,
Abbassia, 11566 Cairo, Egypt
khaled.shaalan@buid.ac.ae1, faizan.sanjeera@gmail.com2,
Manaralkhatib09@gmail.com3 and azza_monem@hotmail.com4

Natural Language Processing (NLP) has increased significance in


machine interpretation and different type of applications like discourse
combination and acknowledgment, limitation multilingual data
frameworks, and so forth. Arabic Named Entity Recognition,
Information Retrieval, Machine Translation and Sentiment Analysis are
a percentage of the Arabic apparatuses, which have indicated impressive
information in knowledge and security organizations. NLP assumes a
key part in the preparing stage in Sentiment Analysis, Information
Extraction and Retrieval, Automatic Summarization, Question
Answering, to name a few. Arabic is a Semitic language, which contrasts
from Indo-European lingos phonetically, morphologically, syntactically
and semantically. This paper discusses different challenges of NLP in
Arabic. In addition, it inspires scientists in this field and others to take
measures to handle Arabic dialect challenges.

1. Introduction

Natural language processing (NLP) is a domain of computer science that


aims at facilitating communication between machines (computers that
understand machine language or programming language) and human
60 Khaled Shaalan et al.

beings (who communicate and understand natural languages like English,


Arabic and Chinese etc.) NLP is very important as it makes a huge impact
on our daily lives. Many applications these days use concepts from NLP.
This paper discusses different challenges of NLP in Arabic.
Arabic is the sixth most spoken language in the world. Ref. 1 expressed
the significance of Arabic. Arabic connected with Islam and more than
200 million Muslims perform their petitions five times daily utilizing this
dialect. Moreover, Arabic is the first language of the Arab world countries,
which has significant importance worldwide. Arabic is an uncommonly
rich language which is related to another linguistic family, particularly
Semitic vernaculars, which is different from the Indo-European lingos
talked in the West. Arabic is interesting and any person with a slight
knowledge of Arabic can read and understand a text written fourteen
centuries ago.
Arabic, as a language or dialect is exceedingly derivational and
inflectional according to Refs. 2, 3 and 4, and there are no rules for
emphasis5,6,7,8. Truly, there are principles; however, there are no firm
rules1.
Arabic language has a rich and complex grammatical structure9,10. For
instance, a noun and its modifiers need to agree in number, gender, case,
and definiteness11. Moreover, in Arabic, there are advancements that really
mean, “Mother of” or “father of” to show ownership, a trademark, or a
property, and use gendered pronouns; it has no fair-minded pronouns12.
Arabic sentences can be nominal (subject–verb), or verbal (verb–
subject) with free order; however, English sentences are fundamentally in
the (subject–verb) order. The free order property of the Arabic language
presents a crucial challenge for some Arabic NLP applications13.
Three types characterize Arabic: Classical (Traditional or Quranic)
Arabic, Modern Standard Arabic and Dialect Arabic14,15. Arabic language
takes these forms in light of three key parameters including morphology,
syntax and lexical mixes1,16,17,18. Classical Arabic is primarily use in
Arabic speaking countries, as opposed to within the diaspora. Classical
Arabic is found in religious writings such as the Sunnah and Hadith, and
numerous historical documents19. Diacritic marks (also known as
“Tashkil” or short vowels) are commonly used within Classical Arabic as
phonetic guides to show the correct pronunciation. On the contrary,
Challenges in Arabic Natural Language Processing 61

diacritics are considered optional in most other Arabic writing20. Modern


Standard Arabic (MSA) is used for TV, newspapers, poetry and in books.
Arabic Courses at the Arab Academy are also taught in the Modern
Standard form. The MSA can be transformed to adapt to new words that
need to be created because of science or technology. However, the written
Arabic script has seen no change in the alphabet, spelling or vocabulary in
at least four millenniums. Hardly any living language can claim such a
distinction.
Dialect Arabic or “colloquial Arabic” is casually utilized daily by
Arabs. It is found in various nations and districts of a nation19. It is grouped
into Mesopotamian Arabic, Arabian Peninsula Arabic, Syro Palestanian
Arabic, Egyptian and Maghrebi Arabic. Arabic dialect generally used,
mostly written, by Internet clients21 and social media19; varies from locale
to area is Dialect Arabic. In vernacular Arabic, portions of the words are
acquired from MSA22,23. Ref. 1 showed the significance of building local
devices to chip away both Modern Standard and Dialect Arabic. Ref. 22
presented a hybrid pre-processing approach that has the ability to convert
paraphrases of Egyptian dialectal input into MSA such that the available
NLP tools can be applied to the converted text. Ref. 24, as well as its
enhanced version25, worked on Sentiment Analysis on the data containing
different Arabic Dialects.
In this paper, illustrative examples are used for clarification. The
examples are given in MSA as it represents the bulk of written material
and formal TV shows, lectures, and papers. Besides, it is a universal
language that is understood by all Arabic speakers.

2. Challenges

The Arabic is an extremely bended tonguea, with unique sound, especially


when pronounce the letters “‫( ”ﺽ‬ḍād), “‫( ”ﻅ‬Tha'a), and “‫( ”ﻍ‬Ghain).
Arabic grammar has a rich morphology and intricate sentence structure
and grammarians have described it as the language of ḍād (“‫)”ﻟﻐﺔ ﺍﻟﻀﺎﺩ‬26,18.
Ref. 15 states that Arabic has a greatly rich morphology depicted by a mix
of templatic and affixational morphemes, complex morphological norms,

aThe top alveolar ridge is located on the roof of the mouth between the upper teeth and the
hard palate.
62 Khaled Shaalan et al.

and a rich part system. Arabic makes use of many inflections because of
the appendages, which incorporate relational words and pronouns. Arabic
morphology is perplexing because there are about 10,000 roots that are the
basis for nouns and verbs27. There are 120 patterns in Arabic morphology.
Ref. 28 highlighted the importance of 5000 roots for Arabic morphology.
The word order in Arabic is variant. We can have a free choice of the
word we want to emphasize and put it at the head of sentence. Generally,
the syntactic analyzer parses the input tokens produced by the lexical
analyzer and tries to identify the sentence structure using Arabic grammar
rules. The relatively free word order in an Arabic sentence causes syntactic
ambiguities which require investigating all the possible grammar rules as
well as the agreement between constituents13,24.
In this paper, we discuss the challenges of Arabic language with regard
to its characteristics and their related computational problems at
orthographic, morphological, and syntactic levels. In automating the
process of analyzing Arabic sentences, there is an overlap between these
levels, as they all help in making sense and meaning of words, and in
disambiguating the sentence.

2.1. Arabic orthography


Within the orthographic patterns of the written words, the shape of a letter
can be changed depending on whether it is connected with a former and
subsequent letter, or just connected with a former letter. For example, the
shapes of the letter “‫( ”ﻑ‬f), i.e. “‫”ﻓـ‬/“‫”ـﻔـ‬/“‫”ﻑ‬, changes depending on
whether it occurs in the beginning, middle, or end of a word, respectively.
Arabic orthography includes a set of orthographic symbols, called
diacritics, that carry the intended pronunciation of words. This helps
clarify the sense and meaning of the word.
As far as Qur’an is concerned, these vowel signs are absolutely
necessary in order for children and those who are not well versed in
classical Arabic language to pronounce religious text properly. It is worth
noting that written copies of the Qur’ān cannot be accredited by religious
institutes or authorities that review them unless the diacritics are included.
The absence of short vowels (e.g. inner diacritics) prompts diverse sorts
of equivocalness in Arabic writings (both basic and lexical) on the grounds
Challenges in Arabic Natural Language Processing 63

that distinctive diacritics speak to distinctive implications9,20. These


ambiguities can be determined just by relevant data and a satisfactory
information of the language or dialect. Contextual features and an
adequate knowledge of the language can only resolve these ambiguities29.
Arabic orthography includes 28 letters20, all letters are consonants
except three long vowels: “‫( ”ﺃ‬alef), “‫( ”ﻭ‬Waw), and “‫( ”ﻱ‬yeh) and short
vowels are represented by diacritical signs. This specificity brings into
existence two forms of spelling: with or without vocalisation.
The vowels added through a consonantal skeleton by means of
diacritical marks produce a shallow orthography whereas vocalisation is
missing. Orthography is deep and the word behaves as homograph that is
semantically and phonologically ambiguous For instance, the unvoweled
word “‫( ”ﻛﺘﺐ‬ktb), supports several alternatives such as “‫َﺐ‬َ ‫( ” َﻛﺘ‬he wrote,
kataba), “‫ﺐ‬ ُ
َ ِ‫( ”ﻛﺘ‬it was written, kutiba), “‫ﺐ‬ ُ ُ
ُ ‫( ”ﻛﺘ‬books, kutubun), etc.
Voweled spelling is taught to novice readers, while unvoweled spelling
constitutes the standard form and is gradually imposed at later reading
literacy stages. Unfortunately, MSA is devoid of diacritical markings and
the restoration of these diacritics is an important task for other NLP
applications such as text to speech30.

2.1.1. Lack of consistency in orthography


Hamza Spelling
The most critical use of Hamza letter (“‫ ”ء“ )”ﺍﻟﻬﻤﺰﺓ‬brings in more
challenges. With the very significance of Hamza being an additional letter
seen at the top or bottom of the letters following the sounds of “‫”ﺍ‬, “‫”ﻭ‬, or
“‫”ﻯ‬, i.e. “‫”ﺃ‬, “‫”ﺅ‬, or “‫”ﺉ‬, respectively. As these rules are confusing even
for native speakers, Hamza is ignored most of the time while typing. NLP
based systems should handle this assumption.
There are many orthographical forms of the Hamza letter “the seat of
Al-Hamza”, which is decided by the diacritics (“Tashkeel”) of both the
“Hamza” itself as well as the letter preceding it, i.e. either “Fatha”,
“Dama”, “kasra” or “Sukun”. Exceptionally, when Hamza comes at the
beginning of the word, we always write it over an “Alef”, e.g. “‫( ”ﺃﻧﺎ‬I, 'ana),
or under it, e.g. “‫( ”ﺇﻳﻤﺎﻥ‬faith, 'iiman).
64 Khaled Shaalan et al.

According to its appearance and pronunciation, there are two types of


Hamza: “‫( ”ﻫﻤﺰﺓ ﻗﻄﻊ‬Hamza Al-Qata’) and “‫( ”ﻫﻤﺰﺓ ﻭﺻﻞ‬Hamza Al-Wasl).
Distinguishing each type is a challenge for both text and speech
processing. Hamza Al-Qata’ is the regular Hamza and is always written
and pronounced, e.g. “‫ ”ﺇﻳﻤﺎﻥ‬and “‫”ﺃﻧﺎ‬.
On the contrary, Hamza Al-Wasl is neither written nor pronounced
unless it is at the start of the utterance; a bare Aleh is used instead. A
simple rule to recognize Hamza Al-Wasl is to add “‫( ”ﻭ‬waw, and) before
it and see whether or not it is pronounced; hence, written.
For example, the Hamza in ‫( ﺍﻟﻜﺘﺎﺏ" "ﺇﻗﺮﺃ‬iq-ra' AL’Kitab, read the book),
is pronounced and written. However, if we add “‫( ”ﻭ‬waw, and) at the
beginning of sentence as in “‫( ”ﻭﺍﻗﺮﺃ ﺍﻟﻜﺘﺎﺏ‬waq-ra’ Al-Kitab, and read the
book) the Hamza is neither pronounced nor written. A more complicated
example is “‫( ”ﺃﺧﺬﺕ ﺍﺑﻨﻨﺎ‬a-khadh-tu ibnana, I grabbed our son). In the first
word “‫( ”ﺃﺧﺬﺕ‬grabbed), the Hamza is a glottal stop (pronounced strongly)
and should be pronounced, but in the second word “ ‫” ﺍﺑﻨﻨﺎ‬, it is neither
written nor pronounced.
When the diacritic mark of the “Hamza” is either “Fatha” or “Dama”,
the Hamza appears at the middle or the end of a word and is written over
the letter. Table 1 presents some examples with the addition of the Hamza
and the challenge it brings in causing orthographic confusion. The Hamza
is following a hierarchy of vowels in the language: The Kasra has the
highest priority, the Dama has the medium priority, and the Fatha has the
lowest priority.

Table 1. The Hamza diacritic is determined by its own diacritics and the preceding letter.

Example Tashkeel of Tashkeel of letter before Pronunciation Translation


Hamza Hamza
‫ﺳﺄَﻝ‬
َ Fatha Fatha Sa-ala Asked
‫ﺳﺌِﻞ‬ ُ Kasra Kasra So-ela Was asked
‫ﺳ َﺆﺍﻝ‬ ُ Fatha Dama So-aal Question

If the diacritic is Kasra/Fatha/Dama for either the Hamza itself or the


letter preceding it, the Hamza takes a Kasra/Fatha/Dama diacritic,
respectively. The rules for determining the diacritic of Hamza are of
notorious complexity.
Challenges in Arabic Natural Language Processing 65

In transcribing to Arabic, it is difficult to determine the Hamza seat as


well as the short vowel that it follows. These types of the Hamza are of a
complex nature and need special handling by the computational system.
Al-Hamza orthographic variants are non-standard ways to spell a
specific variant of a name, like "‫ "ﺍﻻﻣﺎﺭﺍﺕ‬instead of "‫( "ﺍﻹﻣﺎﺭﺍﺕ‬Al-Emarat,
Emirates), in which the Hamza is omitted and bare Alef is used instead.
Though the difference between these variants cannot be strictly defined,
based on “statistical and linguistic analysis” of Modern Standard Arabic
orthography37, they are both occur frequently. For example, the capital of
The United Arab Emirates, "‫( "ﺃﺑﻮﻅﺒﻲ‬Abu Dhabi) can be written in
different ways. According to statistics from Google, the most frequent
ones are: "‫"ﺃﺑﻮﻅﺒﻲ‬, ,"‫ "ﺍﺑﻮﻅﺒﻲ‬and "‫ "ﺑﻮﻅﺒﻲ‬with 13,800,000, 9,400,000, and
1,400,000 occurrences, respectively.

Defective Verb Ambiguity


Defective (weak) verb (“‫ )”ﺍﻟﻔﻌﻞ ﺍﻟﻤﻌﺘﻞ‬is any verb that its root has a long
vowel as one of its three radicals. These long vowels will go through a
change when the verb is conjugated. For example, consider the case of a
negated present tense verb that is preceded by the apocopative particle
Lam—“‫”ﺣﺮﻑ ﺍﻟﺠﺰﻡ ﻟﻢ‬. In Arabic, this particle is used for negating a present
tense verb form which is understood as a negated past form18. It is one of
the defining features of Modern Standard Arabic, and is not used in any
dialects. Being able to use this word properly and effectively will bring
Arabic language to a higher level.
Table 2 presents examples of this verb forms. The negative past tense
verb causes ambiguity by having misspelling in writing skills31. When the
apocopative particle Lam precedes a past tense verb, the verb changes to
the present tense form by: 1) attaching a suitable present tense letter, 2) by
omitting the long vowel in the verb, and 3) adding a short vowel to the last
letter. Although apocopative particle Lam is used for the past tense, it can
never be used with the perfective verb itself; rather it is only used before
imperfective verbs.
66 Khaled Shaalan et al.

Table 2. Examples of negated past tense verb form.

Verb Transliteration Sentence Change applied to the present form of the verb
‫ﺩﻋﺎ‬ Da-aa ‫ﻉ‬
ُ ‫ﻟﻢ ﻳﺪ‬ Omit the last long vowel “ ‫ ”ﻭ‬and add the present
tense letter “ ‫“ﻱ‬
‫ﺳﻌﻰ‬ Sa-aa ‫ﻟﻢ ﻳﺴ َﻊ‬ Omit the last long vowel “ ‫ “ﻯ‬and add the present
tense letter “ ‫“ﻱ‬
‫ﺻﻠﻰ‬ Sala ‫ﻟﻢ ﻳﺼ ِﻞ‬ Omit the last long vowel “ ‫ “ﻱ‬and add the present
tense letter “ ‫“ﻱ‬
‫ﺯﺍﺭ‬ Zara ‫ﻳﺰﺭ‬
ْ ‫ﻟﻢ‬ Omit the middle long vowel "‫ "ﺍ‬and add the present
tense letter “ ‫“ﻱ‬

2.1.2. Nonappearance of capital letters


Arabic has no uncommon sign rendering the recognition of a Named
Entity (NE) all the more difficult32. On the other hand, English, in line with
numerous other Latin script-based dialects, has a particular marker in
orthography, in particular upper casing of the underlying letter, and
showing that a word or succession of words is a named substance. Arabic
does not have capital letters; this trademark speaks to an extensive
hindrance for the basic task of Named Entity Recognition in light of the
fact that in different languages, capital letters speak to a vital highlight in
distinguishing formal people, places or things19. Along these lines, the
issue of distinguishing appropriate names is especially troublesome for
Arabic. For instance, in English, capital letters are used, e.g. “Adam”, but
no capital letter in the same name in Arabic, e.g. “‫”ﺁﺩﻡ‬.
Another reality about Arabic to consider is that the vernacular has no
capital letters (e.g. for proper names: the names of people, countries,
months, days of the week); therefore, cannot make usage of acronyms.
This can lead to confusion, especially during Information Extraction in
general and Named Entity Recognition in particular. It makes it difficult
to see names of substances. For example, the NE “‫ ”ﺍﻻﻣﺎﺭﺍﺕ ﺍﻟﻌﺮﺑﻴﺔ ﺍﻟﻤﺘﺤﺪﺓ‬has
the acronyms UAE in English but not in Arabic. Therefore, it is common
to resolve the nonappearance of capital letters by analyzing the context
surrounding the Named Entity.
Challenges in Arabic Natural Language Processing 67

2.1.3. Inherent ambiguity in named entities


Most Arabic proper nouns (NEs) are indistinguishable from forms that are
common nouns and adjectives (non-NEs) which might cause ambiguity.
For example, the noun “‫( ”ﺍﻟﺠﺰﻳﺮﺓ‬Aljazeera) can be recognized as an
organization name or a noun corresponding to island. Nevertheless, Arabic
names that are derived from adjectives are usually ambiguous, which
presents a crucial challenge for some Arabic NLP applications such as
Arabic Named Entity Recognition. As an example, consider the word “‫”ﺃﻣﻞ‬
(Amal), which means “hope”, and can be confused with the name of a
person. In the following two sentences, the word “Amal” means two
different senses:

1. “‫ ”ﺍﻟﺸﺒﺎﺏ ﻫﻢ ﺃﻣﻞ ﺍﻟﺒﻠﺪ‬which means: the youth is the hope of the


country.
2. “‫ ”ﺃﻣﻞ ﺑﻨﺖ ﺟﻤﻴﻠﺔ‬which means: Amal is a beautiful girl.

Remedies to resolve this type of ambiguity might not necessarily fix all
problems33,34. For example, consider the sentence “‫( ”ﺭﺃﻳﺖ ﺃﻣﻞ‬I saw
hope/Amal) which have either meaning.

2.1.4. Vowels
In written Arabic, there are two types of vowels: diacritical symbols and
long vowels. Arabic text is dominantly written without diacritics which
leads to major linguistic ambiguities in most cases as an Arabic word has
different meaning depending on how it is diactritized. A diacritic sign
(Tashkeel Or Harakat) is not an orthographic letter. It is formed as
diacritical marks above or below a consonant to give it a sound. Ref. 35
presented a good survey of recent works in the area of automatic
diacritization. There are three groups of diacritics32,36. The first group
consists of the short vowel diacritics such as Fatha ( َ◌), Dhamma ( ُ◌), and
Kasra (◌).
ِ The second group represents the doubled case ending diacritics
(Nunation or tanween) such as Tanween Fatha ( ً◌),Tanween Kasra (◌), ٍ
and Tanween Damma ( ٌ◌). These are vowels occurring at the end of
nominal words (nouns, adjectives and adverbs) indicating nominal
indefiniteness. The third group is composed of Shadda ( ّ◌) and Sukuun (◌) ْ
68 Khaled Shaalan et al.

diacritics. Shadda reflects the doubling of a consonant whereas Sukuun


indicates the absence of a vowel and reflects a glottal stop.
Diacritics could also be classified into two main groups based on their
functions. The first group includes the lexeme diacritics that determine the
Part of Speech (POS) of a word as in ‫َﺐ‬ َ ‫( َﻛﺘ‬wrote, kataba) and, ْ‫( ُﻛﺘُﺐ‬books,
kutub), and also the meaning of the word such as “‫ﺳﺔ‬ َ ‫( ” َﻣﺪَ َﺭ‬school,
madarasa) and “‫ﺳﺔ‬ ‫ﺭ‬
َ ِ َ ‫ﺪ‬ ‫ﻣ‬
ُ ” (teacher/female, almudarisa). The second category
represents the syntactic diacritics that reflect the syntactic function of the
word in the sentence. For example, in the sentence "َ‫( "ﺯَ ﺍ َ َﺭ ﺍ َ ْﻟ َﻮﻟَﺪُ ﺍﻟ َﺤ ِﺪﻳﻘَﺔ‬The
boy visited the garden, zar aalwalad alhadiqa), the syntactic diacritic
“Fatha” of the word " َ‫( "ﺍﻟ َﺤ ِﺪﻳﻘَﺔ‬the garden, alhadiqa) reflects its “object”
role in the sentence. While in sentence “ُ‫َﺖ ﺍ َ ْﻟ َﺤﺪَﻳﻘَﺔ‬ ْ ‫( ”ﺗ َﺰَ ﻳَﻨ‬Spruced up the
garden, tazayanat aalhadayqa) the same word occurs as a “subject” hence
its syntactic diacritic is a “Damma”. A text without diacritics adds layers
of confusion for novice readers and for automatic computation. For
example, the absence of diacritics is a serious obstacle to many of the
applications such as text to speech (TTS), intent detection, and automatic
understanding in general. Therefore, automatic diacritization is an
essential component for many Arabic NLP applications.
The long vowels in English, that is “a”,”e”, “i", “o” and “u”, are the
ones which are clearly spelled out in a text whereas in Arabic they are not.
There are no exact matches between English and Arabic vowels; they may
differ in quality, and they may behave differently under certain
circumstances. All letters of the Arabic alphabet are consonants except
three letters: “ ‫( ﺍ‬Alef), “‫( ”ﻭ‬Waw), and the letter “‫( ”ﻱ‬Ya'a) which are used
as long vowels or diphthongs, and they also play a role as weak
consonants5. The long vowel can appear at the beginning, in the middle,
or at the end of a word, and it has many forms of pronunciation. Table 3
presents a homographic issue with the aid of an example:
“‫ ﻭﻟﻜﻦ ﺃﻣﻪ ﻟﻢ ﺗﺴﺘﺴﻢ‬، ‫( ”ﻗﺎﻟﻮﺍ ﺇﻧﻪ ﻟﻢ ﻳﻌﺶ‬They said that he did not live, but his mother
did not give up, Qalo Anaho lam ya’esh lakin Amah lam tastaslim). The
acoustic and language models of the speech processing systems should
deal with long vowel issues.
Challenges in Arabic Natural Language Processing 69

Table 3. Homographic issue for long vowels.

Word Transliteration Meaning Marks


‫ﻗﺎﻟﻮﺍ‬ Qalo Said-they No pronunciation for the letter Alef at the
end
‫ﻟﻜﻦ‬ Lakin But No appearance of the letter Alef in the
middle but pronounced

2.1.5. Lack of uniformity in writing styles


The high level of ambiguity of the Arabic script poses special challenges
to developers of NLP areas such as Morphological Analysis, Named Entity
Extraction and Machine Translation. These difficulties are exacerbated by
the lack of comprehensive lexical resources, such as proper noun
databases, and the multiplicity of ambiguous transcription schemes. The
process of automatically transcribing a non-Arabic script into Arabic, is
called Arabization. For example, transcribing an NE such as the city of
Washington into Arabic NE produces variants such as “ ، ”‫ﻭﺍﺷﻨﺠﻄﻦ‬
‫ “ﻭﺷﻨﻄﻦ‬، ”‫ “ﻭﺍﺷﻨﻐﻄﻦ‬، ”‫”“ﻭﺍﺷﻨﻄﻦ‬. Arabizing is very difficult for many
reasons; one is that Arabic has more speech sounds than Western European
languages, which can ambiguously or erroneously lead to an NE having
more variants. One solution is to retain all versions of the name variants
with a possibility of linking them together. Another solution is to
normalize each occurrence of the variant to a canonical form; this requires
a mechanism (such as string distance calculation) for name variant
matching between a name variant and its normalized form19.

2.2. Arabic morphology


An additional property of Arabic that should be noted is that Arabic is an
exceptionally morphological rich lingo. Its vocabulary can be easily
amplified using a framework that allows for a creative use of roots and
morphological samples4,17,28,37,38,39,40. According to Ref. 41, referred to in
Ref. 20, there are 85% of words from tri-demanding roots and there are
around 10,000 free roots. Hence, Arabic is highly derivational and
inflection results in high inflections in morphology28,31,37,42. Arabic is
known for its templatic morphology where words involve roots and
illustrations in the form of patterns, and fastened with affixes.
70 Khaled Shaalan et al.

2.2.1. Morphology is intricate


Arabic is a Semitic language that has a powerful morphology and a
flexible word order. It is difficult to put a border between a word and
sentence; yielding morpho-syntactic structure combinations for a word
along the dimensions of parts of speech, inflection, declension, clitics,
among other features13,43. Arabic morphology and sentence structure give
the ability to incorporate a broad number of adds to each word which
makes the combinatorial expansion of possible words.
Arabic is highly derivational. All the Arabic verbs are derived from a
base of three or four characters’ root verb. Essentially, every one of the
descriptors gets from a verb and every one of them are inferences too50.
Deductions in Arabic are quite often templatic; hence, we can say simply
that: Lemma = Root + Pattern. Additionally, in case of a general deduction
we can realize the significance of a lemma on the off chance that we know
the root and Lemma, which have been utilized to determine it44. Table 4
depicts examples of the composite relation “Lemma = Root + Pattern”,
demonstrating a case of producing two Arabic verbs from the same
classification and their inference/derivation from the same pattern. Notice
that the Arabic root is consonantal whereas the pattern is the vowel(s)
attached to a root.

Table 4. Illustration of Arabic Language in the derivational stage.

Lemma Transliteration = Root Transliteration + Pattern


‫ﻣﻔﺘﻮﺡ‬ Maftooh ‫ﻓﺘﺢ‬ Fath ‫ﻡ_؟ _ ؟ ﻭ_؟‬
‫ﻣﺪﺭﻭﺱ‬ Madroos ‫ﺩﺭﺱ‬ Daras

2.2.2. Morphology declension


Arabic is highly inflectional. The prefixes can be articles, relational words
or conjunctions, though the suffixes are by and large protests or
individual/possessive anaphora. As stated by Ref. 45, both prefixes and
suffixes are permitted to be mixes, and along these lines a word can have
zero or more affixes, i.e. Word = Prefix(es) + Lemma + Suffix(es). Arabic
verb morphology is central to the construction of an Arabic sentence
because of its richness of form and meaning. A more complicated example
would be words that could represent an entire sentence in English such as
Challenges in Arabic Natural Language Processing 71

“‫( ”ﻭﺳﻴﺤﻀﺮﻭﻧﻬﺎ‬and they will bring it, wasayahdurunaha). This word can be
written in this form:

‫ ﻫﺎ‬+ ‫ ﻭﻧـ‬+ ‫ ﺣﻀﺮ‬+ ‫ ﻱ‬+ ‫ ﺱ‬+‫ﻭﺳﻴﺤﻀﺮﻭﻧﻬﺎ = ﻭ‬


(wa+sa+ya+hdr+runa+ha, and+will+bring+they +it )

In this example, the Lemma “‫( ”ﺣﻀﺮ‬hadr) accepts three prefixes: “‫( ”ﻭ‬wa),
“‫( ”ﺱ‬sa), and “‫( ”ﻱ‬ya) and two suffixes: “‫( ”ﻭﻥ‬wa noun), and “‫( ”ﻫﺎ‬ha).
Thereby, because of the complexity of the Arabic morphology, building
an Arabic NLP system is a challenging task.
The early step in analyzing an Arabic text is to identify the words in
the input sentence based on its type and properties, and outputs them as
tokens. There might be a problem in segmentation where some word
fragments that should be parts of the lemma of a word and were mistaken
to be part of the prefix or suffix of the word; thus, were separated from the
rest of the word as a result of tokenization. This problem arises with
Named Entities Recognition where the ending character n-grams of the
Named Entity were mistaken for objects or personal/possessive anaphora,
and were separated by tokenization19. Moreover, the POS tagger used for
the training and test data may have produced some incorrect tags,
incrementing the noise factor even further.
Another morphological challenge highlighted by Ref. 46, with regard
to relationships between words. The syntactic relationship that a word has
with alternate words in the sentence shows itself in its inflectional endings
and not in the spot in connection to alternate words in that sentence. For
example, “‫( ”ﺍﻟﻤﻌﻠﻢ ﺍﻟﻤﺨﻠﺺ ﻳﺤﺘﺮﻣﻪ ﻁﻼﺑﻪ‬Al Mu’alim al-mukhlis yahtarimaho
Tulabaho, the faithful teacher is respected by his students), the suffix
pronoun “‫( ”ـﻪ‬Heh) in the two words “‫( ”ﻳﺤﺘﺮﻣﻪ‬yahtarima-ho, respected-
him), and “‫( ”ﻁﻼﺑﻪ‬Tulaba-ho, students-his) refers to the word “‫( ”ﺍﻟﻤﻌﻠﻢ‬Al
Mu’alim, teacher-the).
Generally, Arabic computational morphology is challenging because
the morphological structure of Arabic also comprises a predominant
system of clitics. These are morphemes that are grammatically
independent, but morphologically dependent on another word or phrase47.
Subsequently, one can naturally conclude that this proportion is higher for
Arabic information than for different languages with less perplexing
72 Khaled Shaalan et al.

morphology that the same word can be joined to various appends and
clitics and thus, the vocabulary is much greater. The following Arabic
words: “‫”ﻣﻜﺘﻮﺏ‬, (Maktoob, Written) “‫”ﻛﺘﺎﺑﺎﺕ‬, (Kitabat, Writings), “‫”ﻛﺎﺗﺐ‬
(Katib, Writer) “‫( ”ﻛﺘﺎﺏ‬Kitab, Book), “‫( ”ﻛﺘﺐ‬Kutob, Books) , “‫”ﻣﻜﺘﺐ‬
(Maktab, Office) , “‫( ”ﻣﻜﺘﺒﺔ‬Maktabah, Library), “‫( ”ﻛﺘﺎﺑﻪ‬Kitabah, Writing)
are derived from the same Arabic three consonants trilateral with the origin
verb “‫( ”ﻛﺘﺐ‬Ktb, Wrote). They also refer to the same concept. To extract
the stem from the words, there are two types of stemming. The first type
is light stemming which is used to remove affixes (prefixes, infixes, and
suffixes) that belong to the letters of the word “‫( ”ﺳﺄﻟﺘﻤﻮﻧﻴﻬﺎ‬sa'altamuniha);
where they are formed by combinations of these letters. The second type
is called heavy stemming (i.e. root stemming) which is used to extract the
root of the words and includes implicitly light stemming48,49.

2.2.3. Annexation
Another morphologic challenge in Arabic language is that we can
compose a word to another by a conjunction of two words. This
conjunction can be with nouns, verbs, or particles. Although it is not
common in traditional Arabic language, it is used in Modern Standard
Arabic. Usually, the compound word is semantically transparent such that
the meaning of the compound word is compositional in the sense that the
meaning of the whole is equal to the meaning of parts put together50. For
example, the word “‫( ”ﺭﺃﺳﻤﺎﻟﻴﺔ‬capitalism, rasimalia) comes from compound
of two nouns “‫( ”ﺭﺃﺱ ﺍﻟﻤﺎﻝ‬capital, ras almal); the word “‫( ”ﻣﺎﺩﺍﻡ‬as long as,
madam) comes from the compound of a particle “‫( ”ﻣﺎ‬ma) and a verb “‫”ﺩﺍﻡ‬
(dam), and the word “‫( ”ﻛﻴﻔﻤﺎ‬however) comes from the compound of two
particles “‫( ”ﻛﻴﻒ‬kayf) and “‫( ”ﻣﺎ‬ma). The meaning of a compound word is
important for understanding the Arabic text, which is a challenge to POS
tagging and applications that require semantic processing51.

2.3. Syntax is intricate


Historically, as Islam spread, the Arab grammarians wanted to lay down
the basis of grammar rules that prevents the incorrect reading of the Holy
Qur’an. Arabic syntax is intricate. Automating the process that makes the
Challenges in Arabic Natural Language Processing 73

computer analyze the Arabic sentences is truly a challenging problem from


the computer perspective34.
Arabic grammar distinguishes between two types of sentences: verbal
and nominal. Verbal sentences usually begin with a verb, and they have at
least a verb (“‫”ﻓﻌﻞ‬, faeal) and a subject (“‫”ﻓﺎﻋﻞ‬, faeil). The subject as well
as the object can be indicated by the conjugation of the verb, and not
written separately. For example, the conjugated verb “‫( ”ﺷﺎﻫﺪﺗﻚ‬I saw you,
saw-I-you, shahidtuk) has a subject and an object suffix pronouns attached
to it. Another example of a verbal sentence is “‫( ”ﻳﺪﺭﺱ ﺍﻟﻮﻟﺪ‬studying the
boy, “yadrus alwald”). This type of sentence is not applied in English
sentences. All the English sentences begin with a subject, and followed by
a verb, for example, “the boy is studying”.
In Arabic, a nominal sentence begins with a noun or a pronoun. The
nominal sentence has two parts: a subject or topic “‫”ﻣﺒﺘﺪﺃ‬, (mubtada) and a
predicate “‫”ﺧﺒﺮ‬, (khabar). The nominal sentences have two types: with or
without a verb. The nominal verbless sentence is a typical noun phrase.
When the nominal sentence is about being, which in some languages such
as English requires the presence of the linking verb ‘to be’ (i.e. copula) in
the sentence. This verb is not given in Arabic. Instead, it is implied and
understood from the context. For example, “‫( ”ﺍﻟﻄﻘﺲ ﺟﻤﻴﻞ‬alttaqs jamil) has
two nouns without a verb; its English translation is “The Weather [is]
wonderful”. This can be confusing to second language learners who speak
European languages and are used to have a verb in each sentence52,53.
Arabic grammar allows complex sentence structure formation which is
discussed in the following subsections.

2.3.1. Multi word expressions


Multi word expressions are very important constructs because their total
semantics usually cannot be determined by adding up the semantics of the
parts. For example, the multi words expression “‫( ”ﺑﺎﻟﺤﺪﻳﺪ ﻭﺍﻟﻨﺎﺭ‬by force,
bialhadid walnnar) consists of two words that have the literal meaning
“‫( ”ﺣﺪﻳﺪ‬iron, hadid) and “‫( ”ﻧﺎﺭ‬fire, nar). Another example is the medical
terminology “‫( ”ﻓﻘﺮ ﺍﻟﺪﻡ‬Anemia, faqar alddam) that consists of two words
that has the literal meaning “‫( ”ﻓﻘﺮ‬poor, faqar) and “‫( ”ﺩﻡ‬blood, dam). These
non-decomposable lexicalized phrases are syntactically-unalterable units
74 Khaled Shaalan et al.

that are unable to capture the effects of inflectional variation. Thus, they
can cause problems in Machine Translation, Information Retrieval, Text
Summarization, among other NLP applications. Such expression is termed
as idiomatic multi word expressions. Other multi words expressions are
words that co-occur together more often than not, but with transparent
compositional semantics such as “‫( ”ﺭﺋﻴﺲ ﺍﻟﺪﻭﻟﺔ‬The president of the
country, rayiys alddawla). As such, they do not pose a challenge in NLP
applications. Such expressions could be of interest if we categorize them
to types as in Named Entity Recognition, i.e. contextual cues.

2.3.2. Anaphora resolution


Anaphora Resolution is specifically concerned with matching up
particular entities or pronouns with the nouns or names that they refer to.
This is very important since without it a text would not be fully and
correctly understood, and without finding the proper antecedent, the
meaning and the role of the anaphor cannot be realized. Anaphora occurs
very frequently in written texts and spoken dialogues. Almost all NLP
applications such as Machine Translation, Information Extraction,
Automatic Summarization, Question Answering, etc., require successful
identification and resolution of anaphora54.
Anaphora Resolution is classically recognized as a very difficult
problem in NLP. It is one of the challenging tasks that is very time
consuming and requires a significant effort from the human annotator and
the NLP system in order to understand and resolve references to earlier or
later items in the discourse.

Ambiguous Anaphora
The pronominal anaphora is a very widely used type in Arabic language
as it has empty semantic structure and does not have an independent
meaning from their antecedent; the main subject. This pronoun could be a
third personal pronoun, called “‫( ”ﺿﻤﻴﺮ ﺍﻟﻐﺎﺋﺐ‬damir alghayib) in Arabic,
such as “‫ ”ﻫﺎ‬/hA/ (her/hers/it/its), “‫ ”ﻩ‬/h/ (him/his/it/its), “‫ ”ﻫﻢ‬/hm/
(masculine: them/their), and “‫ ”ﻫﻦ‬/hn/ (feminine: them/their).
Challenges in Arabic Natural Language Processing 75

As an example that shows the challenges of pronominal anaphora to


NLP tasks, consider the result of using Google Translate© to translate two
Arabic sentences into English55:

‫ ﻓﺄﻋﻄﻴﺘﻬﺎ ﺍﻟﻄﻌﺎﻡ‬، ‫ﺭﺃﻳﺖ ﺍﻟﻘﻄﺔ‬

Transliteration: ra'ayt alquttah, fa'aetiatuha alttaeam


 Google translation: I saw the cat, so I gave her food
 Correct translation: I saw the cat, so I gave it food

‫ ﻓﺄﻋﻄﻴﺘﻬﺎ ﺍﻟﻄﻌﺎﻡ‬، ‫ﺭﺃﻳﺖ ﺍﻟﻄﻔﻠﺔ‬

Transliteration: ra'ayt alttaflah, fa'aetiatha alttaeam


 Google translation: I saw the little girl, so I gave her food
 correct translation: I saw the little girl, so I gave her food

The machine translation system fails to identify the correct antecedent


indicated by the third personal pronoun “‫ ”ﻫﺎ‬/hA/ (her/hers/it/its) and thus
external knowledge is needed in order to correctly identify this antecedent.
There are differences between Arabic and English pronominal systems and
Arabic is rich in morphology. The Arabic third person pronouns are
commonly encliticized which make them ambiguous. Arabic pronominal
does not differentiate linguistically between the value of the humanity
feature, i.e. ±human. As a result, both the -HUMAN FEMININE noun
“‫( ”ﺍﻟﻘﻄﺔ‬the cat) and the +HUMAN FEMININE noun “‫( ”ﺍﻟﻄﻔﻠﺔ‬the little
girl), causes ambiguity in the translated English sentence.

Hidden Anaphora
Another major kind of anaphora is hidden anaphora. It is restricted to the
subject position when there is no present noun or pronoun acting as the
subject. This is evident in the following sentence: “‫ ﻣﻌﻘﺪﺓ‬،‫”ﺍﻟﻤﻼﺣﻈﺔ ﻋﻠﻰ ﺍﻟﻠﻮﺡ‬
(The note on the board, complex) where the pronoun “‫ ”ﻫﻲ‬is not presented
in the sentence, i.e. “‫ ﻫﻲ ﻣﻌﻘﺪﺓ‬، ‫”ﺍﻟﻤﻼﺣﻈﺔ ﻋﻠﻲ ﺍﻟﻠﻮﺡ‬, which is called “zero
anaphora”. The human mind can determine the hidden Anaphora
(antecedent) but it causes grammatical mistakes in automated NLP
systems.
76 Khaled Shaalan et al.

2.3.3. Syntactically flexible text sequence


Syntactically-flexible expressions exhibit a much wider range of syntactic
variability and types of variations possible are in the form of Verb-Subject-
Object constructions13.
Arabic is generally a free word request language. While the essential
word order in Classical Arabic and Modern Standard Arabic is verb-
subject-object (VSO), they likewise permit subject-verb-object (SVO),
object-subject-verb(OSV) and object-verb-subject (OVS). It is basic to
utilize the SVO in daily papers features. Arabic vernaculars display the
SVO request. Word order disparity is depicted in Table 5. This makes the
sentence generation of Arabic NLP applications a challenge. For example,
in a question-answering system, the answer to the question “‫”ﺃﻳﻦ ﻛﺘﺎﺏ ﻫﺪﻯ؟‬
(where is Hoda’s book? 'ayn kitab hudaa?) could be any sentence that is
shown in Table 5 which indicates that Huda sold the book.

Table 5. Word order disparity.

Examples in Arabic Transliteration English Translation in English Order


‫ﺑﺎﻋﺖ ﻫﺪﻯ ﺍﻟﻜﺘﺎﺏ‬ sold Huda-NOM book-ACC Huda sold the book VSO
‫ﻫﺪﻯ ﺑﺎﻋﺖ ﺍﻟﻜﺘﺎﺏ‬ Huda-NOM sold book-ACC Huda, she sold the book. SVO
‫ﺍﻟﻜﺘﺎﺏ ﻫﺪﻯ ﺑﺎﻋﺘﻪ‬ DEF-book-NOM Huda-NOM sold-it The book, Huda sold it. OSV
‫ﺍﻟﻜﺘﺎﺏ ﺑﺎﻋﺘﻪ ﻫﺪﻯ‬ DEF-book-NOM sold-it Huda-NOM The book, Huda sold it. OVS

It is interesting to make a note of the placement of the word “‫”ﻛﺘﺎﺏ‬


(Book, kitab) in Table 5. VSO does not topicalize any constituent as old
data, which as a starting sentence in a talk cannot contain new components.
Confirms that VSO does not concentrate a specific constituent, as opposed
to different requests, which cannot be utilized in light of the fact that they
just center a specific constituent53.
Arabic case framework neglects to unmistakably stamp linguistic
contentions. This particularly happens when the case marker, which is
constantly included toward the end of the noun, cannot be incorporated on
the grounds that the noun closes with a long vowel as opposed to a
consonant. When this happens, elucidation of word request turns out to be
entirely VSO, contrary to VOS.
Additional proof originates from a study of syntactic structures in the
dialect, in which we find that VSO has the best dissemination. Bland
Challenges in Arabic Natural Language Processing 77

inserted provisions, notwithstanding, may display both SVO and VSO


orders56.

2.3.4. Agreement
Agreement is a major syntactic principle that affects the analysis and
generation of an Arabic sentence which is very significant to difficult NLP
applications such as Machine Translation and Question Answering13,47.
Agreement in Arabic is full or partial and is sensitive to word order
effects1. An adjective in Arabic usually follows the noun it modifies
“‫( ”ﺍﻟﻤﻮﺻﻮﻑ‬almawsuf) and fully agrees with respect to number, gender,
case, and definiteness, e.g. “‫( ”ﺍﻟﻮﻟﺪ ﺍﻟﻤﺠﺘﻬﺪ‬The diligent boy, alwald
almujtahad) and “‫( ”ﺍﻷﻭﻻﺩ ﺍﻟﻤﺠﺘﻬﺪﻭﻥ‬The diligent boys, al'awlad
almujtahidin). The verb is marked for agreement depending on the word
order of the subject relative to the verb, see Figure 1.

Fig. 1. Agreement patterns in verb-subject vs. subject-verb word order.

The verb in Verb-Subject-Object order agrees with the subject in


gender, e.g. “‫ ﺍﻷﻭﻻﺩ‬/ ‫( ”ﺟﺎء ﺍﻟﻮﻟﺪ‬came the-boy/the-boys, ja' alwalad/ ja'
78 Khaled Shaalan et al.

al'awlad) versus “‫ ﺍﻟﺒﻨﺎﺕ‬/ ‫( ”ﺟﺎءﺕ ﺍﻟﺒﻨﺖ‬came the-girl/the-girls, ja'at albint/


ja'at albanat). In Subject-Verb-Object (SVO) order, the verb agrees with
the subject with respect to number and gender, e.g. “‫ ﺍﻷﻭﻻﺩ ﺟﺎءﻭﺍ‬/ ‫”ﺍﻟﻮﻟﺪ ﺟﺎء‬
(came the-boy/the-boys) versus "‫ ﺍﻟﺒﻨﺎﺕ ﺟﺌﻦ‬/ ‫( "ﺍﻟﺒﻨﺖ ﺟﺎءﺕ‬came the-girl/the-
girls). In Aux-subject-verb word order, the auxiliary agrees only in gender
while the main verb agrees in both gender and number, e.g. “ ‫ﻛﺎﻧﺖ ﺍﻟﺒﻨﺖ ﺗﺄﻛﻞ‬
‫ ﻛﺎﻧﺖ ﺍﻟﺒﻨﺎﺕ ﺗﺄﻛﻠﻦ ﺍﻟﻄﻌﺎﻡ‬/ ‫( ”ﺍﻟﻄﻌﺎﻡ‬the-girl was/the-girls were eating the food,
kanat albint takul alttaeam/ kanat albanat takulun alttaeam). If the subject
precedes the auxiliary, then both verbs agree with it in both gender and
number “‫ ﺍﻟﺒﻨﺎﺕ ﻛﻦ ﻳﺄﻛﻠﻦ ﺍﻟﻄﻌﺎﻡ‬/ ‫( ”ﺍﻟﺒﻨﺖ ﻛﺎﻧﺖ ﺗﺄﻛﻞ ﺍﻟﻄﻌﺎﻡ‬albint kanat takul
alttaeam / albanat kunn yakuln alttaeam).
Some other agreements also exist between the numbers and the
countable nouns57. Number–counted noun agreement is governed by a set
of complex rules for determining the literal number that agree with the
counted noun with respect to gender and definiteness. In Arabic, the literal
generation of numbers is classified into the following categories: digits,
compounds, decades, and conjunctions. The case markings depend on the
number–counted name expression within the sentence. In the following
example, the number, “‫( ”ﺧﻤﺲ‬five [masc.sg]) and the (broken plural)
counted noun “‫( ”ﻣﺘﺎﺣﻒ‬museums [fem.pl]) need to agree in gender and
definiteness:

‫ﺍﻷﻭﻻﺩ ﺯﺍﺭﻭﺍ ﺧﻤﺴﺔ ﻣﺘﺎﺣﻒ‬

al'awlad zaruu khmst matahif


the-boys visited-they five.fem.sg museum.fem.pl
The boys visited five museums

3. Conclusion

Arabic as a language is both challenging and interesting. In this paper, we


delved into the basics of word and sentence structure, and relationships
among sentence elements. This should help readers appreciate the
complexity associated with Arabic NLP. The challenges of Arabic
language were depicted by giving examples under MSA. It was found that
although Arabic is a phonetic language in the sense that there is one-to-
Challenges in Arabic Natural Language Processing 79

one mapping between the letters in the language and the sounds with
which they are associated. An Arabic word does not dedicate letters to
represent short vowels. It requires changes in the letter form depending on
its place in the word, and there is no notion of capitalization. As for MSA
texts, short vowels are optional which makes it even more difficult for non-
native speakers of Arabic to learn the language and present challenges to
analyze Arabic words. Morphologically, the word structure is both rich
and compact such that it can represent a phrase or a complete sentence.
Syntactically, the Arabic sentence is long with complex syntax. Arabic
Anaphora has increased the ambiguity of the language, as in some cases
the Machine Translation system fails to identify the correct antecedent
because of the ambiguity of the antecedent. External knowledge is needed
to correct the antecedent. Moreover, Arabic sentence constituents (free
word order) can be swapped without affecting structure or meaning, which
adds more syntactic and semantic ambiguity, and requires analysis that is
more complex. Nevertheless, agreement in Arabic is full or partial and is
sensitive to word order effects.
Arabic language differs from other languages because of its complex
and ambiguous structure that the computational system has to deal with at
each linguistic level.

References

1. Farghaly and K. Shaalan, Arabic Natural Language Processing: Challenges and


Solutions, ACM Transactions on Asian Language Information Processing (TALIP),
the Association for Computing Machinery, ACM, 8(4):1/22 (2009).
2. R. Al-Shalabi and R. Obeidat, Improving KNN Arabic Text Classification with N-
grams based Document Indexing, In Proceedings of the Sixth International Conference
on Informatics and Systems, Cairo, Egypt, pp. 108/112 (2008).
3. L., Abd El Salam, M. Hajjar and K. Zreik, Classification of Arabic Information
Extraction methods, MEDAR 2009, 2nd International Conference on Arabic Language
Resources and Tools, pp. 71/77, Cairo, Egypt (2009).
4. N. Farra, E. Challita, R. A. Assi and H. Hajj, Sentence-level and document-level
sentiment mining for Arabic texts, In Data Mining Workshops (ICDMW), 2010 IEEE
International Conference on, IEEE, pp. 1114/1119 (2010).
5. K., Dave, S., Lawrence and D. M. Pennock, Mining the peanut gallery: Opinion
extraction and semantic classification of product reviews, In Proceedings of the 12th
international conference on World Wide Web (pp. 519-528), ACM (2003).
80 Khaled Shaalan et al.

6. S. Ghosh, S. Roy and S. Bandyopadhyay, A tutorial review on Text Mining


Algorithms, International Journal of Advanced Research in Computer and
Communication Engineering, 1(4), (2012).
7. F. Harrag, E. El-Qawasmeh and P. Pichappan, Improving Arabic text categorization
using decision trees, In First International Conference on Networked Digital
Technologies, NDT'09, IEEE, pp. 110/115 (2009).
8. J. Wiebe and E. Riloff, Creating subjective and objective sentence classifiers from
unannotated texts, In Computational Linguistics and Intelligent Text Processing, pp.
486-497, Springer, Berlin Heidelberg (2005).
9. N. Y. Habash, Introduction to Arabic natural language processing, Synthesis Lectures
on Human Language Technologies, 3(1), 1/187 (2010).
10. N. Habash, Syntactic preprocessing for statistical machine translation, In Proceedings
of the 11th MT Summit XI, pp. 215/222 (2007).
11. K. Shaalan, M. Magdy and A. Fahmy, Analysis and Feedback of Erroneous Arabic
Verbs. Journal of Natural Language Engineering (JNLE), Cambridge University Press,
UK, 21(2):271/323 (2015).
12. S. Izwaini, Problems of Arabic machine translation: evaluation of three systems. The
British Computer Society (BSC), London, pp. 118/148 (2006).
13. S. Ray and K. Shaalan, A Review and Future Perspectives of Arabic Question
Answering Systems, IEEE Transactions on Knowledge and Data Engineering,
28(12):3169-3190, IEEE, (2016). DOI: 10.1109/TKDE.2016.2607201.
14. M. Korayem, D. Crandall, and M. Abdul-Mageed, Subjectivity and sentiment analysis
of Arabic: A survey, In Advanced Machine Learning Technologies and Applications,
Springer Berlin Heidelberg, pp. 128/139 (2012).
15. N. Habash and O. Rambow, MAGEAD: a morphological analyzer and generator for
the Arabic dialects, In Proceedings of the 21st International Conference on
Computational Linguistics and the 44th annual meeting of the Association for
Computational Linguistics, Association for Computational Linguistics (ACL), pp.
681/688 (2006).
16. Elgibali, Investigating Arabic: current parameters in analysis and learning, Studies in
Semitic Languages and Linguistics series, Vol. 42, Brill. (2005).
17. Abdel Monem, K. Shaalan, A. Rafea, and H. Baraka, Generating Arabic Text in
Multilingual Speech-to-Speech Machine Translation Framework, Machine
Translation, Springer, Netherlands, 22(4): 205/258 (2008).
18. K. Ryding, A Reference Grammar of Modern Standard Arabic. Cambridge University
Press, New York (2005).
19. K. Shaalan, A Survey of Arabic Named Entity Recognition and Classification,
Computational Linguistics, MIT Press, USA, 40(2):469/510 (2014).
20. P. Daniels, The Oxford Handbook of Arabic Linguistics, The Arabic writing system,
Ed. J. Owens (2013).DOI:10.1093/oxfordhb/9780199764136.013.0018.
21. M. N. Al-Kabi, I. M. Alsmadi, A. H. Gigieh, H. A. Wahsheh and M. M. Haidar,
Opinion Mining and Analysis for Arabic Language, IJACSA) International Journal of
Advanced Computer Science and Applications, 5(5), 181/195 (2014).
22. H. Abo Bakr, K. Shaalan and I. Ziedan, and I., A Hybrid Approach for Converting
Written Egyptian Colloquial Dialect into Diacritized Arabic, In the Proceedings of The
Challenges in Arabic Natural Language Processing 81

6th International Conference on Informatics and Systems, INFOS2008, the special


track on Natural Language Processing, 27-29 March, Cairo, Egypt (2008).
23. E. Refaee and V. Rieser, An Arabic twitter corpus for subjectivity and sentiment
analysis, In Proceedings of the Ninth International Conference on Language Resources
and Evaluation (LREC’14), Reykjavik, Iceland, European Language Resources
Association (ELRA) (2014).
24. S. Siddiqui, A. Abdel Monem and K. Shaalan, Sentiment Analysis in rabic, Natural
Language to Information Systems: 21st International Conference on Applications of
Natural Language to Information Systems (NLDB 2016), Eds. E. Métais, F. Meziane,
F. Saraee, V. Sugumaran, and S. Vadera, Lecture Notes in Computer Science (LNCS
9612), Chapter 41, pp. 409/414, Springer, Berlin, Heidelberg (2016).
25. S. Siddiqui, A. Abdel Monem and K. Shaalan, Towards Improving Sentiment Analysis
in Arabic, In Proceedings of the International Conference on Advanced Intelligent
Systems and Informatics 2016, Volume 533 of the Series Advances in Intelligent
Systems and Computing, Eds. A E. Hassanine, K. Shaalan, T. Gaber, A. Ahmad, F.
Tolba, pp. 114/123, Springer (2017).
26. Al-Sughaiyer and I. Al-Kharashi, Arabic Morphological Analysis Techniques: A
Comprehensive Survey, Journal of the American Society for Information Science and
Technology, 55(3):189–213 (2004).
27. Darwish, Building a Shallow Arabic Morphological Analyzer in One Day. In
Proceedings of the ACL Workshop on Computational Approaches to Semitic
Languages, Philadelphia, PA, pp. 1–8 (2002).
28. R. Beesley, Finite-state morphological analysis and generation of Arabic at Xerox
Research: Status and plans in 2001, In ACL Workshop on Arabic Language
Processing: Status and Perspective, Vol. 1, pp. 1/8 (2001).
29. R. Ibrahim, A. Khateb and H. Taha, How Does Type of Orthography Affect Reading
in Arabic and Hebrew as First and Second Languages? Open Journal of Modern
Linguistics, 3(1):40/46 (2013).
30. Said, M. El-Sharqwi, A. Chalabi, and E. Kamal, A Hybrid Approach for Arabic
Diacritization, Natural Language Processing and Information Systems: 18th
International Conference on Applications of Natural Language to Information
Systems, NLDB 2013, pp. 53-64, Salford, UK, 2013, Springer, Berlin, Heidelberg,
June 19-21 (2013).
31. Attia, P. Pecina, Y. Samih, K. Shaalan and J. Van Genabith, Arabic Spelling Error
Detection and Correction. Journal of Natural Language Engineering (JNLE),
Cambridge University Press, UK, 22(5):751/773 (2016).
32. Oudah and K. Shaalan, NERA 2.0: Improving coverage and performance of rule-based
named entity recognition for Arabic, Journal of Natural Language Engineering (JNLE),
23(3):441/472, Cambridge University Press, UK (2017). DOI:
10.1017/S1351324916000097.
33. Oudah and K., Shaalan, Studying the impact of language-independent and language-
specific features on hybrid Arabic Person name recognition, Language Resources &
Evaluation, 51(2):351/378, Springer (2017). DOI:10.1007/s10579-016-9376-1.
34. E. Othman, K. Shaalan and A. Rafea, Towards Resolving Ambiguity in Understanding
Arabic Sentence, In the Proceedings of the International Conference on Arabic
82 Khaled Shaalan et al.

Language Resources and Tools, NEMLAR, 22nd–23rd Sept., Egypt, pp. 118/122
(2004).
35. Azmi and R. Almajed, A survey of automatic Arabic diacritization techniques, Natural
Language Engineering, Cambridge University Press, UK, 21(3):477/495 (2015).
36. S. Abu-Rabia, The Role of Vowels in Reading Semitic Scripts: Data from Arabic and
Hebrew, Reading and Writing: An Interdisciplinary Journal, 14, 39/59 (2001). DOI:
10.1023/A:1008147606320.
37. Farghaly, Three Level Morphology for Arabic, presented at the Arabic Morphology
Workshop, Linguistics Summer Institute, Stanford, CA, (1987).
38. T. McCarthy, The critical theory of Jurgen Habermas, Studies in Soviet Thought,
Springer, Berlin Heidelberg, 23(1):77/79 (1982).
39. Soudi, G. Neumann and A. Bosch, Arabic computational morphology: knowledge-
based and empirical methods, vol. 38, Springer, Dordrecht (2007).
40. Shoukry and A. Rafea, Sentence-level Arabic sentiment analysis, 2012 International
Conference on Collaboration Technologies and Systems (CTS), Denver, CO, USA,
2012, pp. 546/550 (2012). DOI: 10.1109/CTS.2012.6261103.
41. S. S. Al-Fedaghi and F. Al-Anzi., A New Algorithm to Generate Arabic Root-Pattern
forms, In Proceedings of the 11th national Computer Conference and Exhibition, pp.
391/400 (1989).
42. N. De Roeck and W. Al-Fares, A morphologically sensitive clustering algorithm for
identifying Arabic roots, In Proceedings of the 38th Annual Meeting on Association
for Computational Linguistics, Association for Computational Linguistics, pp. 199/206
(2000).
43. S. Mesfar, Towards a cascade of morpho-syntactic tools for Arabic natural language
processing, In Computational Linguistics and Intelligent Text Processing, Springer
Berlin Heidelberg, pp. 150/162 (2010).
44. Y., Benajiba, M. Diab and P. Rosso, Arabic named entity recognition using optimized
feature sets, In Proceedings of the Conference on Empirical Methods in Natural
Language Processing, Association for Computational Linguistics, pp. 284/293 (2008).
45. Y. Benajiba, P. Rosso and M. J. Bened, ANERsys: An Arabic Named Entity
Recognition system based on Maximum Entropy, In Proc. of CICLing-2007, Springer-
Verlag, LNCS series (4394), pp. 143/153 (2007).
46. K. Thakur, Genitive Construction in Hindi. M. Phil Thesis, University of Delhi, India
(1997).
47. K. Shaalan, Arabic GramCheck: A Grammar Checker for Arabic, Software Practice
and Experience, John Wiley & sons Ltd., UK, 35(7):643-665 (2005).
48. M. N. Al-Kabi, S. Kazakzeh, B. Abu Atab, S. Al-Rababah and S. Alsmadi, A Novel
Root based Arabic Stemmer, Journal of King Saud University, Computer and
Information Sciences, 27(2):94–103 (2015). DOI: 10.1016/j.jksuci.2014.04.001
49. H. K. AlAmeed, S. O. AlKitbi, A. A. AlKaabi, K. S. AlShebli, N. F. AlShamsi, N. H.
AlNuaimi, and S. S. AlMuhairi, Arabic Light Stemmer: A new enhanced approach, In
Proceedings of the Second International Conference on Innovations in Information
Technology (IIT'05), Dubai, UAE (2005).
50. W. M. Amer. (2010). Compounding in English and Arabic: A contrastive study,
Technical Report, available online at:
Challenges in Arabic Natural Language Processing 83

http://site.iugaza.edu.ps/wamer/files/2010/02/Compounding-in-English-and-
Arabic.pdf
51. S. Elkateb, W. Black, P. Vossen, D. Farwell, H. Rodríguez, A. Pease and M. Alkhalifa,
Arabic WordNet and the challenges of Arabic, In Proceedings of Arabic NLP/MT
Conference, London, UK (2006).
52. K. Shaalan, An Intelligent Computer Assisted Language Learning System for Arabic
Learners. Computer Assisted Language Learning: An International Journal, Taylor &
Francis Group Ltd., 18(1 & 2):81/108 (2005).
53. Hammo, A. Moubaiddin, N. Obeid, and A. Tuffaha, Formal Description of Arabic
Syntactic Structure in the Framework of the Government and Binding Theory,
Computacion y Sistemas, 18(3):611/625 (2014).
54. S. Hammami, L. Belguith and A. Hamadou, Arabic Anaphora Resolution: Corpora
Annotation with Co-referential Links, The International Arab Journal of Information
Technology, 6(5):481/489 (2009).
55. R. Al-Sabbagh and K. Elghamry, Arabic Anaphora Resolution: A Distributional,
Monolingual and Bilingual Approach, Faculty of Al-Alsun, Ain Shams University,
Cairo, Egypt (2002).
56. S. Usama, On issues of Arabic syntax: An essay in syntactic argumentation, Brill’s
Annual of Afroasiatic Languages and Linguistics, pp. 236/280 (2011).
57. M. Shquier and T. Sembok, Word agreement and ordering in English-Arabic machine
translation, 2008 International Symposium on Information Technology, IEEE Explore,
Kuala Lumpur, pp. 1/10 (2008). DOI: 10.1109/ITSIM.2008.4631625.
b2530   International Strategic Relations and China’s National Security: World at the Crossroads

This page intentionally left blank

b2530_FM.indd 6 01-Sep-16 11:03:06 AM


September 6, 2018 13:28 ws-rv9x6 Book Title 10693-04 page 85

85

Chapter 4

Arabic Recognition Based on Statistical Methods

A. Belaïd∗ and A. Kacem Echi∗∗



LORIA, Campus scientifique, 54506 Vandoeuvre-lès-Nancy, France
abdel.belaidloria.fr
∗∗
Université de Tunis-LaTICE, 5 Av. Taha Hussein Montfleury,
Tunis 1008, Bab Menara, Tunisia
ff.Kacem@gmail.com

Arabic recognition is still a big challenge to the scientific community.


Several approaches to address this challenge have been attempted in the
last ten years but significant work remains before large scale commer-
cially viable systems can be built. In this chapter, we first discuss the
characteristics of Arabic script and give a brief overview of the features
extraction techniques proposed in the past works to characterize and rec-
ognize Arabic script. These techniques attempt to extract the feature
vector that will be used in the recognition engine. We then investigate
the use of Machine learning techniques: some statistical methods, mainly
generative and discriminative based models, for Arabic recognition. As
generative methods, we propose one-dimensional, two-dimensional and
planar Hidden Markov Models (HMMs). To increase the representa-
tional power of the HMM, Dynamic Bayesian Networks (DBNs) are
explored. In an attempt to benefit from the advantages of the dimen-
sionality and the temporality of the models, a novel approach is proposed
which integrates causal Markov Random Field in two dimensional mod-
eling and HMMs. We then show different applications of this model for
analytic recognition and syntactic analysis. As discriminative methods,
we used Transparent Neural Networks (TNNs) to recognize a large vo-
cabulary of Arabic words, based on a cognitive model where the learning
is replaced by an activation process considering the nodes neighborhood.

1. Introduction

The Arabic script has been studied for several decades. Despite the com-
plexity of its morphology, due to its cursive aspect and the presence of many
diacritic signs, several systems are functional and give very encouraging
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 86

86 A. Belaïd and A. K. Echi

results matching those for the Latin (manuscript). The main objective
of this chapter is to show the progress that we have obtained from Ara-
bic for several decades, using Machine learning techniques. As features
extraction remains the most important step for achieving high recognition
performance, we firstly give a brief overview of the features extraction tech-
niques that we proposed in the past works to characterize and recognize the
Arabic script.
The remainder of this chapter is organized as follows: Section 2 discusses
the characteristics of Arabic script. Section 3 reviews some features extrac-
tion techniques. Section 5 focuses more on Markov models as generative
models. Section 6 shows an example of Neural networks used in the con-
text of large vocabulary. It illustrates the combination of three classifiers
for the recognition of decomposable words. Section 7 discusses directions
for future work and conclusions.

2. A Challenging Morphology

The Arabic script has complex morphological properties that make its au-
tomatic recognition a constant challenge [1]. The natural attachment of
the letters that follow one another in the word, makes that the letter shape
vary depending on the connection type and influences its termination as-
pect. Moreover in handwriting, the word division into several parts (PAWs)
gives more freedom in the writing of each PAW and creates a zigzag in the
baseline which distorts the main guide for features extraction.
If we consider that there are two main feature families for writing recog-
nition: structural and statistical [2], structural ones are those that take best
account of the morphological appearance of the Arab script and got our at-
tention throughout our research on Arabic recognition. The morphological
aspect in structural features is exhibited in two elements: regularities and
singularities (see Figure 1). The regularities correspond to a flat part in the
middle of the word representing the elongations between characters. Even
though it contains no information, its location is synonymous of baseline.
While the singularities are rich with information and contain the real char-
acteristics of the word morphology such as ascenders, descenders, diacritic
signs, loops and accents.
The position of some features such as the Alif ( @) and descenders such

as € , P , Ð , P , p , h , h. ø , ð , ¨ , ¨ ,  ,  , €, and ascenders

such as À ,È , ¼ ,   ,   ,@ the positions are quite informative on their


August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 87

Arabic Recognition Based on Statistical Methods 87

Fig. 1. Regularities and Singularities composing the word.

location in a word or a PAW.


Figure 2 shows that the tracking of Alif @ and descenders (see Fig-
ure 2(a)) may be sufficient to almost segmenting text-line into words. The
Alif is often positioned at the beginning of the word (Red sign in the Fig-
ure 2(b)) and down is often placed at the end of PAW or word (Green sign in
the Figure 2(b)). This is why in almost of our works, features are accompa-
nied by their local position in the word. In the interests of standardization
of this position, generally we use horizontal positions as “beginning, middle,
end”, and vertical positions “up, down”.

Fig. 2. Feature location of the Alif and the descenders.

3. Features Extraction Techniques

In [3], Khémiri et al. detected presence of letters without delimiting them


and thus have a global vision of words while avoiding segmentation prob-
lems. In fact, one of the major problems in recognizing unconstrained
cursive words is the process of segmentation, since poor segmentation con-
tributes heavily to recognition errors. For that, authors “divided” word
image into three columns C1 , C2 and C3 and three rows R1 , R2 and R3
as Arabic word is written in horizontal bands from right to left and up to
down (see Figure 11(b)). This is treated here as segmentation-free method
since these columns and rows are not really representative of any real
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 88

88 A. Belaïd and A. K. Echi

segmentation in the word and merely a convenient way to pass the image
to a HMM or a DBN. However, implicit word segmentation occurs during
decoding. But, a potential benefit from this word decomposition is that
extracting features from word image’s columns and raw allows the visual
and sequential aspects of handwriting recognition to be learned together,
rather than treated as two separate problems. Some structural features
such as loops, diacritic, ascenders or stems, descenders or legs, considering
their type, number and position in the word, the number and position of
PAWs are then extracted (see Figure 11(a)). Note that Arabic handwritten
words are not usually written on a single baseline and authors extracted a
sequence of sub-baselines and formed the entire word baseline by juxtapo-
sition of its PAW baselines (see Figure 3).

Fig. 3. Baseline extraction.

As it can be seen these structural features serve to describe the topologi-


cal and geometrical characteristics of the words. We believe that words can
be represented by this type of features with high tolerance to distortions
and style variations. In addition, this type of representation may also en-
code some knowledge about word structure or may provide some knowledge
as to what sort of components make up that word.
Recently Saïdani et al. proposed different methods to discriminate be-
tween machine-printed/handwritten and Arabic/Latin words based on tex-
ture features, mainly Black Run lengths (BRL) [4], histogram of oriented
gradients (HOG), pyramid HOG (PHOG) and co-occurrence matrices of
oriented gradients (Co-MOG) [5]. The idea is to exploit the writing ori-
entation as a discriminative descriptor between Arabic and Latin scripts.
In fact, letters in Arabic words, especially of handwritten type or italic
machine-printed and as written from right to left, are generally tilted to
the left, following the writing direction (see Figure 4(a)). In contrast, let-
ters in Latin script, especially of handwritten type or in italic machine-
printed and as written from left to right, tend to be inclined to the right
(see Figure 4(b)). Thus, Arabic letter strokes are generally diagonally down
whereas those written in Latin are diagonally up. Furthermore, machine-
printed Arabic words are characterized by the use of horizontal ligatures,
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 89

Arabic Recognition Based on Statistical Methods 89

Fig. 4. Machine-printed/Handwritten and Arabic/Latin word identification based on


the writing orientation.

more or less long depended on the used font (see Figure 4(c)). Oppositely,
machine-printed Latin words are composed by successive letters without
any ligature between them (see Figure 4(d)). Consequently, horizontal
strokes would be more frequent in Arabic words than in Latin words. Both
scripts use vertical strokes for ascenders.
To capture the coarseness of a texture in specified directions, BRL based
features are used [4]. Recall that black run is a set of consecutive and co-
linear black pixels. The number of pixels in the run represent its length.
For a given image, a BRL vector P is defined as follows. Each element P (i)
represents the number of runs with black pixels and length of run equal
to i in a given direction. The BRL vector’s size is M which corresponds
to the maximum run length in words. An orientation is defined using a
displacement vector d(x, y), where x and y are the displacements for the
x-axis and y-axis respectively. The typical orientations are horizontal, right
diagonal, vertical and left diagonal, then calculating the run-length encod-
ing for each direction will produce four BRL vectors. The four obtained
BRL vectors are then concatenated into a single vector characterizing the
word’s script. Figure 5 illustrates the proposed feature extraction method
by an example. Various texture features are then derived from BRL vectors
which measure the distribution of short and long runs, the similarities of
gray level values and of the length of runs through out the word’s image
and the homogeneity and the distribution of runs of the word’s image in a
specific direction.
Being based on shape descriptors, HOG has interesting properties for
script characterization [4]. As shown in Figure 6, HOG descriptor is a
histogram which counts the gradient orientation at pixels in a given image.
The number of features depends on the number of cells and orientation.
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 90

90 A. Belaïd and A. K. Echi

Fig. 5. Computing of BRL vectors: (a) binary image (b) four run-length vectors for
black pixels.

Fig. 6. Overview of HOG calculation.

While HOG counts occurrences of gradient orientation in localized por-


tions of an image, PHOG captures perceptually salient features taking into
account the spatial property of the local shape while representing an image
by HOG. The spatial information is represented by tiling the image into
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 91

Arabic Recognition Based on Statistical Methods 91

regions at multiple resolutions based on spatial pyramid matching. Each


image is divided into sequence of increasingly finer spatial grids by repeat-
edly doubling the number of divisions in each axis direction. The number
of points in each grid cell is then recorded. The number of points in a cell
at one level is simply the sum over those contained in the four cells it is
divided into at the next level thus forming a pyramid representation (see
Figure 7). PHOG consists then of a HOG over image sub-region at each res-
olution level. The distance between two PHOG image descriptors reflects
the extent to which the images contain similar shapes and correspond in
their spatial layout

Fig. 7. A schematic illustration of PHOG at each resolution level.

Co-MOG is finally used to express the distribution of gradient informa-


tion over an image [5]. It captures more spatial information than PHOG
by counting the frequency of co-occurrences of oriented gradients between
pairs of pixels. The relative locations are reflected by the offset between two
pixels as shown in Figure 8(a). The offset (4x , 4y ) specifies the distance
between the pixel of interest and its neighbor. The yellow pixel in the cen-
ter is the pixel under study and the neighboring blue ones are pixels with
different offsets. Each neighboring pixel in blue color forms an orientation
pair with the center yellow pixel and accordingly votes to the co-occurrence
matrix as illustrated in Figure 8(b). The frequency of the co-occurrences
of oriented gradients is captured at each offset via a co-occurrence matrix
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 92

92 A. Belaïd and A. K. Echi

Fig. 8. (a) Offset in Co-MOG, (b) Co-occurrence of a word image at a given offset, (c)
Vectorization of co-occurrence matrix [5].

as shown in Figure 8(b).


In [6], Aouadi et al. proposed to segment touching characters (TCs) in
Arabic manuscripts based on shape context (SC) descriptor. The idea is
to find the most similar model, among those stored in a codebook with
their prior known segmented parts, for the TC to be segmented using a
similarity metric computed from the shape context descriptor. Finding
correspondences between model and a TC consists on searching for each
point pi of TC’s contour, the best matching point qj on the model’s contour
by comparing their edge point’s shape context histogram, as illustrated in
Figure 9.
Note that the shape context descriptor has the advantage to summarize
global shape in a rich and local descriptor. It greatly simplifies recovery of
correspondences between points of two given shapes and it is tolerant to all
common shape deformations. As a key advantage no special landmarks or
key-points are necessary.

4. Machine Learning Techniques

In writing recognition area, there are several approaches using machine


learning techniques based on probability theory. Some of them can
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 93

Arabic Recognition Based on Statistical Methods 93

Fig. 9. Matching with SC: (a) SC of p in T C1 , (b) log-polar histogram for p (c) SC of
q in T C2 , (d) log-polar histogram for q which is similar to that in (b), but different of
p0 in (f). The best matching is between p in T C1 and q in T C2 . Black bins correspond
to a higher number of pixels in that bin, gray bins contains fewer pixels than the black
cells. Log-polar histogram similarity is according to the χ2 distance.

be broadly characterized as either generative or discriminative according


to whether or not the distribution of the object features are modeled.
Both of them are prediction models, described by an observed variable
X representing for example a word image features, and a hidden vari-
able Y representing the word class. The objective for both is to pre-
dict: Ŷ = argmaxY (P (Y /X)). Using Bayes’s rule, this is equivalent to:
Ŷ = argmaxY (P (X/Y )P (Y )).
Generative approaches focus on the joint probability P (X, Y ) and ex-
plicitly model the distribution of each class. “Generative” means that the
samples are used to generate the observation probabilities. This is not easy
because one needs to estimate P (X) with weak independence hypotheses.
The advantage of these approaches is essentially related their speed in train-
ing because only the data from a class k is needed to learn the k th model.
However, they have many drawbacks: 1) they depend on model quality, 2)
they indirectly model P (Y |X) and, 3) with a lot of data points, they do
not perform as well as discriminative methods.
Discriminative approaches focus directly on the conditional probability
P (Y /X) and try to model the decision boundary. “Discriminative” means
that the functions are estimated to discriminate the answers. This is easier
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 94

94 A. Belaïd and A. K. Echi

compared to the previous case, because one has to just depict the differ-
ence between the X alternatives. Their advantage is they are very fast once
trained. Their drawbacks are: 1) they interpolate between training exam-
ples, and can fail if novel inputs are presented and 2) they do not easily
handle compositionality.
In Naïve Bayes (NB) classifier, we predict the class Y knowing the
feature vector X. If we assume that the features are independent, the
joint probability is the product of the probability of each vector compo-
nent conditionally to the class. In Hidden Markov Model (HMM), X is
a data sequence observed in states, Y is the random variable distributed
in states. The observation probability is the product on all the sequence
of joint probabilities of states and observation in the states. The logistic
regression maximizes the likelihood of a label (phenomena) given explica-
tive data, assuming a log-linear model. The Conditional Random Field
predicts sequences of labels using a sequence of observation conditionally
to the context.

5. Markov Models

The domain of recognition by Markov models is the probability domain. In


this domain, we recognize a pattern by associating it a label that maximizes
the conditional probability of this label knowing the description of the pat-
tern (a posteriory probability of the label). Furthermore, a probabilistic
model that incorporates multiple learning samples is capable to synthesize
the assigning probabilities of new patterns, which means that we have the
conditional probability of the pattern knowing the model (pattern likeli-
hood).
Let X = x1 , x2 , ..., xn be a pattern to recognize and Y = y1 , y2 , ..., yn be
a possible label of X. If we assume that the pattern X is decomposable in
sub-patterns, P (X/Y ) can have different interpretations described below.

5.1. Case 1: Decomposition of the shape/label


This corresponds to the case where there is a bi-uni-vocal relationship be-
tween the pattern and their labels and where these probabilities are inde-
pendent. In this case, the likelihood of the pattern can be expressed by a
simple product of conditional probabilities (see Eq. (1)).
n
Y
P (X/Y ) = P (xi , yi ) (1)
i=1
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 95

Arabic Recognition Based on Statistical Methods 95

We can see later that by combining P (xi , yi ) with terms issued from an
appropriate decomposition of P (Y ), we can achieve a Markovian modeling
of a certain order. In the example of the Figure 10, the system follows
a successive phases in which sub-words are segmented, graphemes are ex-
tracted, characters are extracted and the word is lexically corrected using
matching with a dictionary.

Fig. 10. Hidden Markov Models for Word recognition from [7].
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 96

96 A. Belaïd and A. K. Echi

5.2. Case 2: Decomposition by association with a model


In this case, the pattern X = x1 , x2 , ..., xn is decomposable in sub-patterns
and it exists a model λY associated to the label Y . The a posteriori prob-
ability of the label becomes that of the model (see Eq. (2) where P (λY )
is the a priori probability of the λY model which can be estimated during
learning).
P (Y /X) = P (λY /X) ∝ P (X/λY )P (λY ) (2)
The idea is to associate to a pattern a state sequence of the model to
observe these sub-patterns. In the domain of the conditional probabilities,
P (X/λY ) is decomposed into a sum on all the state sequences of the model
of length n as depicted in Eq. (3).
P (X/λY ) = P (x1 . . . xn /λY ) (3)
X
= P (x1 . . . xn /q1 . . . qn , λY )P (q1 . . . qn /λY ) (4)
q1 ...qn

In the domain of conditional probabilities P (X/λY ) can be decomposed


in a sum on all the state sequences of length n of the model (q1 ...qn ) of
the product of the conditional probability of the pattern knowing the state
sequence, and the probability of the sequence knowing the model. We often
assume that only one path can contribute substantially to the calculation
of P (X/λY ) which we note by {qi∗ }. {qi∗ } can be modeled as a stochastic
process of order 1 which gives:
P (x1 , x2 , xn /q1∗ , q2∗ , ..., qn∗ , λY ) (5)
This quantity knows three different developments according to the inde-
pendence hypotheses between observed sub-patterns and observation states.
An example illustrating the development 1 is given by Akram Khémiri
et al. in [8]. The authors propose a HMM based system for the recognition of
city names. They proceed by extracting baseline, upper, lower and central
bands and some structural features such as ascenders, descenders, loops and
diacritic dots, considering their number, type and position in the word, that
serve as feature observation for the HMM (see Figure 11(a)). The HMM
is an association of an Horizontal HMM to observe the word in rows and a
vertical HMM to observe in a horizontal (see Figure 11(b)). Depending on
the observation, the traveling over time can be made in states for rows or
for columns.
To illustrate the development 2, we can mention the model of Bercu
et al. [9] (see Figure 12). The distribution of the observation probability
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 97

Arabic Recognition Based on Statistical Methods 97

Fig. 11. Development 1: Observations depending on associated states from [8].

Fig. 12. Development 2: Observations associated to transitions [9].

of the features depends on the current state and the previous state. This
observation is associated to the model transitions. The general system
uses an HMM with two levels for the online word recognition: a local level
describing the features in the letters (loops, peaks and oriented arcs) and
a global observation of letters in the word level (extension in relation to
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 98

98 A. Belaïd and A. K. Echi

the central band). The HMM is described by a triple stochastic process:


a Markov chain corresponding to the state sequence, a stochastic process
associated to the local observation and an other associated to the the global
observation.
The case of development 3 occurs frequently in handwriting recognition
where the systems use a prior segmentation in graphemes before the recog-
nition. In cursive writing as Arabic, this prior segmentation cut the letters
in parts which are difficult to learn because of a problem of sample choices.
Lemarie et al. [10] used a Radial Basis Function (RBF) trained from letters
and letter segments to evaluate the densities P (xi /qi∗ , xi−1 ). This system
is able to link two consecutive segments to estimate the potential presence
of a letter. The RBF input is the pair of consecutive segments and the
output is the most probable HMM state. Hence, the HMM finds the letter
succession in the image guided by the observation probabilities given by
the RBF.

5.3. Extension of HMM to the Plane

The use of HMM on image is not straightforward. In fact, a HMM is a


model for uni-dimensional signal where an image is bi-dimensional. Levin
and Pieraccini proved that the direct extension is exponential in the dimen-
sions of the image. However, according to the same authors, by applying
some constraints to the image alignment problem, e.g. limiting the class of
possible distortions, we can reduce the complexity in polynomial problem.
The purpose of Dynamic Planar Warping (DPW) is to pair a reference
image with test image via a mapping function so that the distortion is
minimal. If we impose separability of the function in its variables: e.g hor-
izontal distortions are independent of vertical ones, we define what we call
the Planar Hidden Markov Model (PHMM).
A PHMM is a HMM where the observation probability in each state
is given by a secondary model. The first conception was proposed by
Agazzi and Kuo [11] where the image is divided in horizontal zones (found
by Kmeans). Each zone is represented by a super-state. The horizontal
HMMs are correlated vertically. The number of states and super-states are
determined manually. This model assumes that the consecutive lines are
independent.
In collaboration with ENIT (Tunisia), N. Ben Amara et al. [12] pro-
posed a PHMM for printed and handwritten Arabic words (see Figure 13).
During recognition, the system attempts to locate bands maximizing at the
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 99

Arabic Recognition Based on Statistical Methods 99

Fig. 13. PHMM architecture from [12].

same time the likelihood emission of lines of the band and the likelihood of
its height expressed by the duration: P (dj+1 /sj ) is the probability of the
duration in the super state j. Pj (y) is the emission probability of the line
y. It is expressed by the number and width of the horizontal run-lengths.
K is the number of samples, djk is the duration in the super state j.
δy (j) = max [δy−1 (j − 1)aj−1,j , δy−1 (j)P (dj + 1/sj )]Pj (y) (6)
1≤j≤N

where 2 ≤ y ≤ Y, 1 ≤ j ≤ N .
The transition probability between two super-states is equal to:
Pk dk
j
k=1 dk k
j−1 +dj
aj−1,j = (7)
K

5.4. Bayesian Networks


To tackle the multidimensional problem of image recognition, we turned
to Bayesian networks (BN). BN is a probabilistic graphical model that
represents a set of random variables and their conditional dependencies
via a directed acyclic graph (DAG). It is an ideal modeling of a problem
represented by the conjunction of causalities. In the example of Figure 14,
P (V1 , V2 , ..., Vn ) is as follows where C(Vi ) is the set of causes (parents) of
Vi in the graph.
i=n
Y
P (V1 , V2 , ..., Vn ) = P (Vi /C(Vi )) (8)
i=1
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 100

100 A. Belaïd and A. K. Echi

Fig. 14. Bayesian Network example.

However, if BN handles well the multi-dimensionality, it does not handle


the temporal aspect. So, we use the Dynamic Bayesian Network (DBN)
which is the conjunction of a BN and a HMM and where the temporal
evolution of variables is represented. A HMM is a particular case of DBN
where the observations are made in states and the transitions are between
states. A DBN can be seen as a repetition of static BN called time slices.
Transitions are between time slices and observations are inside each slice.
With A. Khémiri in [3], we proposed a first version of a DBN (DBN1) by
coupling V-HMM and H-HMM by adding direct links between nodes in the
graph to represent dependencies between state variables (see Figure 15(a)).
Such of configuration can be learned from data or fixed. We have chosen to
fix it for our data set. Another architecture of DBN is based on coupling two
HMMs in which we add a causal link (representing the time dependencies)
from one time slice to another. The structure is completely known a priori
and all variables are observable from the data (see Figure 15(b)).
In Figure 15(a) (DBN1), the parameters π, B and A are equal to:

π1 = P (q1 = S11 ), π2 = P (q1 = S12 )


(l)
B = {bj,k } = P (Otl = k/Stl = j), where l = 1..2, t ≥ 2
(l)
A = {aj,k } = P (Stl = k/St−1
1
= j), where l = 1..2, t ≥ 2

While in Figure 15(b) (DBN2), the parameters π, B and A are equal to:

π1 = P (q1 = S11 ), π2 = P (q1 = S12 )


(l)
B = {bj,k } = P (Otl = k/Stl = j), where l = 1..2, t ≥ 2
(l)
A = {ai,j,k } = P (Stl = k/St−1
1 2
= i, St−1 = j), where l = 1..2, t ≥ 2
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 101

Arabic Recognition Based on Statistical Methods 101

Fig. 15. Two architectures of DBN from [3]: (a) DBN1, (b) DBN2.

The parameter learning is performed by Baum-Welch algorithm, while the


inference is performed by the Forward algorithm. The data set used is
the IFN/ENIT. 83 classes have been created and the recognition score is
86.07%.

5.5. Two Dimensional HMM

In the thesis of G. Saon [13], we have approached the two-dimensional


models. The model, called NSHP-HMM, acts directly on the binary image
by observing the columns where the observation probability in each pixel
is estimated by a Markov Random Field, using the local context Xθ . A
column probability is calculated as the product of the pixels probabilities
and constitutes the observation for a state. The pixel probability depends
of a 2D neighborhood taken in the half plane yet analyzed. This is why the
system is called NSHP (see Figure 16).
A second adaptation of NSHP-HMM model on Arabic word recognition
was proposed by Boukerma et al. [14] (see Figure 17). In this version, the
authors used conditional zone observation probabilities which appears in
the experiments of the authors more appropriate than pixel observation. A
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 102

102 A. Belaïd and A. K. Echi

code book is generated via K -means clustering algorithm applied on a set


of feature vectors extracted from zones.

Fig. 16. Non Symmetric Half Plane from [13], where (a) represents the random variables
and (b) an example of a Latin word analysis by NSHP.

Fig. 17. NSHP extension for Arabic recognition from [14].


August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 103

Arabic Recognition Based on Statistical Methods 103

6. Discriminative Models

For writing recognition, we were influenced by the reading model of Mc-


Clelland and Rumelhart [15] which is one of the first connectionist model
in the 80th where they tried to model the perception of the word during
reading, using local unities: visual and acoustic. It is a hierarchical model
with parallel activation with interaction. The connections are static and
do not change. The model does not learn but it is dynamic. In the model,
perception results from excitatory and inhibitory interactions of detectors
for visual features, letters and words. The letter detectors in turn excite
detectors for consistent words. Active word detectors mutually inhibit each
other and send feedback to the letter level.
S. Maddouri et al. [16] implemented this model as Transparent Neural
Network and used it for bank check word recognition. The authors intro-
duced a fourth cells layer between letters and words ones, which describes
PAWs layer (see Figure 18). There is a cell for each letter, PAW, word and
each feature associated with a given location in the image. Every cell in
each cycle (c) has an activation. In the beginning all cells are initialized to
0. The cells activation depends on their actual activation and on that of
their neighbors θ according to the following equation:

Ai (c + 1) = (1 − θ)Ai (c) + Ei (c) (9)

Ai (c + 1) being the activation of cell i at the cycle c + 1, θ is a constant for


the unit decay set to 0.07 by MaClelland [15], and Ei (c) is the effect of the
cell neighbors i. This effect is defined as: Ei (c) = ni (c)(1 − Ai (c)), ni (c)
being the excitation of the neighbors of the cells i, it is defined as:
nn
X
ni (c) = αij Aj (c) (10)
j=1

where nn is the neighbors number and αij is the connection weight between
i and j.
During the bottom-up process, features are extracted in the image zones,
then the information is propagated for the election of letters, PAWs and
words. In case of ambiguity, the zone of interest is identified and a request
is queried to the feature level to compare the Fourier descriptor of the zone
of interest.
This has been extended to the recognition of Arabic large vocabulary
by Bencheikh et al. [17]. This work was based on two observations:
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 104

104 A. Belaïd and A. K. Echi

Fig. 18. Transparent Neural Network from [16].

• Observation 1: A direct recognition of all the vocabulary is im-


possible (60 billions words!). However, a good part is decomposable
e.g. words derives from a root. A decomposable word is composed
of morphemes: prefix, radical and suffix. The radical (or the ver-
bal core) is the derivation of a root according to a given scheme by
introducing “access” letters: Ð ,@ (see Figure 19).

Fig. 19. Decomposable word.


August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 105

Arabic Recognition Based on Statistical Methods 105

• Observation 2: Brief schemes do not exceed 75:


ɪ® JÓ ,ÈAªJ¯@ ,Èñª®Ó ,ÈAª® Jƒ@ ,É«A®Ó ,ÈAª®Ó ,É«A® K ,ɪ¯ .
The conjugated schemes are about 1400. To recognize a word it is
needed of a root and a conjugated scheme

We proposed a system based on 3 classifiers (TNNs: Transparent Neural


Networks): one for the root from which the word derives, one for the scheme
that the word follows and one for the conjugation elements. As seen in
Figure 20, from word primitives, all the networks are activated and elected
root, scheme and conjugation are produced, allowing the reconstruction of
the unknown word.

Fig. 20. TNN classifiers for word recognition, from [18].

The recognition process is made possible using a good classifier collab-


oration and perceptive cycles. To recognize a word, the scheme classifier is
supplied with the word primitives, (2) perceptive cycles are applied on this
classifier to discard bad candidates, (3) the same word primitives are sup-
plied to the root and conjugation classifiers which will be supervised using
the maintained scheme candidate, (4) perceptive cycles, applied separately
on root and conjugation classifiers, will refine the vision to reject bad can-
didates, (5) linguistic constraints will be used to reject other confusions, by
confronting root and scheme classifiers outputs, since a root does not fit
any scheme and vice versa and (6) the word will be reconstituted from the
root, the scheme and the conjugation selected candidates (see Figure 21).
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 106

106 A. Belaïd and A. K. Echi

Fig. 21. TNN perceptive cycles, from [18].

To better explain how perceptive cycles benefit from network trans-


parency to decide about the correct output, here is an example of applying
perceptive cycles on TNN_R.
A first propagation (see (1)) of word primitives leads to two concurrent

roots á m… and I 
. J«. Since each root is characterized by three determinative
letters, a first perceptive cycle seeks for some details concerning the first
letter of the root (see (2)), in order to decide which letter between € and
¨ has to be retained. For that, the system goes back to word to check if
the vertical projection of the corresponding zone is flat or presents three
acute peaks. In this case, the shape of the projection is flat, thus the letter
neuron, in the second layer, will be pruned. By the same way, a second
perceptive cycle (see (3)) checks whether the second letter is H  or p and
will lead to the pruning of p letter neuron (since the shape of the letter, to
be identified, is acute)(see Figure 22).
August 13, 2018 10:47 ws-rv9x6 Book Title 10693-04 page 107

Arabic Recognition Based on Statistical Methods 107

Fig. 22. TNN Propagation, from [18].

7. Conclusion

In this chapter we first discussed the characteristics of Arabic script and


gave a brief overview of the features extraction techniques proposed in
the previous works to characterize and recognize the Arabic script. Then,
we showed the use of Machine learning techniques for Arabic recognition,
mainly generative and discriminative based methods. For generative mod-
els, we tried to unify probabilistic recognition mechanism used in HMM. For
this, the terms used in the Bayes rule are decomposed differently according
to shape dimensionality and dependence assumptions between sub-patterns
and labels. There were two important decomposition cases: 1) Shape rela-
tive to the label. This case is used in High level applications like lexical and
syntactic as it is more specific to 1D HMM. 2) Shape relative to a model
August 30, 2018 11:8 ws-rv9x6 Book Title 10693-04 page 108

108 A. Belaïd and A. K. Echi

associated to a label. We associated to the shape a stochastic process rep-


resenting the state path of the model which allows us to better observe
the sub-patterns. In 1D, the probability of a sub-pattern is conditioned by
the current state / current and previous / current state and previous sub-
pattern. In the 2D case, the probability of a sub-pattern is conditioned by
the previous sub-pattern according to an analysis axis. The results on this
axis were correlated either by another HMM (PHMM) or for random causal
fields by a bi-dimensional neighborhood of sub-patterns. For discriminative
models, we focused on Transparent Neural Networks (TNNs) inspired from
the McClelland and Rumelhart model, to recognize a large vocabulary of
Arabic words. It is based on a cognitive model where the learning is re-
placed by an activation process considering the nodes neighborhood.

References

1. V. Margner and H. El-Abed. Arabic word and text recognition - current


developments. In Proc. Second International Conference on Arabic Language
Resources and Tools, pp. 31–36, Cairo, Egypt (April 2009).
2. L. M. Lorigo and V. Govindaraju, Offline arabic handwriting recognition:
A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence.
28(5), 712–724 (2006).
3. A. Khemiri, A. Kacem-Echi, A. Belaid, and M. Elloumi. Arabic handwritten
words off-line recognition. In International Conference on Document Analysis
and Recognition, pp. 51–55, Nancy, France (August 2015).
4. A. Saidani, A. Kacem-Echi, and A. Belaid. Co-occurrence matrix of oriented
gradients for word script and nature identification. In Information and Media
Technologies, pp. 16–20 (2015).
5. A. Saidani, A. Kacem-Echi, and A. Belaid, Arabic/latin and machine-
printed/handwritten word discrimination using hog-based shape descriptor,
ELCVIA : Electronic Letters on Computer Vision and Image Analysis. 14(2),
1–23 (2015).
6. N. Aouadi and A. Kacem-Echi, A proposal for touching component segmenta-
tion in arabic manuscripts, Pattern Analysis and Applications (PAA) 20(4),
1005–1027 (2016).
7. G. A. Abanda, F. Jamour, and E. Qaralleh, Recognizing handwritten ara-
bic words using grapheme segmentation and recurrent neural networks, Int.
Journal on Document Analysis and Recognition (IJDAR) 17(3), 275–291
(2014).
8. A. Khemiri, A. Kacem-Echi, and A. Belaid. Towards arabic handwritten word
recognition via probabilistic graphical models. In International Conference
on Frontiers of Handwriting Recognition, pp. 144–151, Crete Island, Greece
(September 2014).
August 30, 2018 11:8 ws-rv9x6 Book Title 10693-04 page 109

Arabic Recognition Based on Statistical Methods 109

9. S. Bercu, B. Delyon, and G. Lorette. Segmentation par une méthode de


reconnaissance d’ecriture cursive en ligne. In CNED, pp. 144–151, Nancy,
France (1992).
10. B. Lemarié, M. Gilloux, and M. Leroux. Handwritten word recognition using
contextual hybrid radial basis function network/hidden markov models. In
eds. D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, Advances in Neural
Information Processing Systems 8, pp. 764–770 (1996).
11. O. E. Agazzi and S. Kuo, Hidden markov model based optical character recog-
nition in the presence of deterministic transformation, Pattern Recognition.
26(12), 1813–1826 (1993).
12. N. BenAmara and A. Belaid. Printed paw recognition based on planar hidden
markov models. In 13th International Conference on Pattern Recognition,
vol. B, pp. 220–224, Vienna, Austria (August 1996).
13. G. Saon and A. Belaid, High performance unconstrained word recognition
system combining hmms and markov random field, International Journal
on Pattern Recognition and Artificial Intelligence (IJPRAI) Special Issue on
Automatic Bankcheck Processing, S. Impedovo Ed. 11(5), 771–788 (1997).
14. H. Boukerma, A. Benouareth, and N. Farah. Nshp-hmm based on conditional
zone observation probabilities for off-line handwriting recognition. In 22nd
International Conference on Pattern Recognition, pp. 2961–2965 (2014).
15. J. L. McClelland and D. E. Rumelhart, Distributed memory and the repre-
sentation of general and specific information, Journal of Experimental Psy-
chology: General. pp. 159–188 (1985).
16. S. Snoussi-Maddouri, H. Amiri, A. Belaid, and C. Choisy. Combination of
local and global vision modelling for arabic handwritten words recognition. In
International Conference on Frontiers of Handwriting Recognition, pp. 1–14
(2002).
17. I. Bencheikh, A. Belaid, and A. Kacem. A novel approach for the recognition
of a wide arabic, handwritten word lexicon. In 9th International Conference
on Pattern Recognition, pp. 1–4, Tampa, USA (2008).
18. A. Kacem-Echi, I. BenCheikh, and A. Belaid, Collaborative combination
of neuron-linguistic classifiers for large arabic word vocabulary recognition,
IJPRAI. 28(1), 1–39 (2014).
b2530   International Strategic Relations and China’s National Security: World at the Crossroads

This page intentionally left blank

b2530_FM.indd 6 01-Sep-16 11:03:06 AM


September 6, 2018 13:29 ws-rv9x6 Book Title 10693-05 page 111

111

Chapter 5

Arabic Word Spotting


Approaches and Techniques

Muna Khayyat, Louisa Lam, and Ching Y. Suen


Center for Pattern Recognition and Machine Intelligence,
Concordia University,
Montreal, Quebec H3G 1M8, Canada,
muna.khayyat@gmail.com, llam,suen@encs.concordia.ca

The effective retrieval of information from scanned handwritten doc-


uments is becoming essential with the increasing volumes of digitized
documents, and therefore developing efficient means for analysis and
recognition of documents is of significant interest. Among these meth-
ods is word spotting, which has recently become an active research area.
These systems have been implemented for Latin-based and Chinese lan-
guages, while few of them have been implemented for Arabic handwrit-
ing. The fact that Arabic writing is cursive by nature and unconstrained,
with no clear white spaces between words, makes the processing of Ara-
bic handwritten documents a more challenging problem.
This chapter introduces and discusses Arabic word spotting ap-
proaches and challenges. This includes the definition of word spotting,
performance measure and approaches. Then, the Arabic language char-
acteristics are introduced, the most commonly used Arabic Word spot-
ting databases are summarized, and finally some extracted features for
Arabic word spotting are presented.

1. Word Spotting

A great number of handwritten documents have been digitized, to pre-


serve, analyze, and disseminate them. These documents are of different
categories, being drawn from fields as diverse as history, commerce, fi-
nance, and medicine. As the sheer volume of handwritten documents being
digitized continues to increase, the need for indexing them becomes vital.
Word spotting is an approach that allows a user to search for keywords
in spoken or written text. While initially developed for use in Automatic
August 13, 2018 8:45 ws-rv9x6 Book Title 10693-05 page 112

112 M. Khayyat, L. Lam, and C. Y. Suen

Speech Recognition (ASR), word spotting has since been applied to the
growing number of handwritten documents for the purpose of indexing.
Even though speech is analog in nature, while handwritten documents are
spatial, word spotting of handwritten documents has been able to adopt
the methods of speech recognition for its use. Subsequently, techniques and
algorithms specific to the processing of handwritten documents had been
developed.
Early indexing work started by applying conventional Optical Character
Recognition (OCR) techniques, and the results are passed to special search
engines to search for words. However, Manmatha et al. designed the first
handwritten word spotting system in 1996,1 and they found that applying
traditional OCR techniques to search for words is inadequate. Using OCR
in indexing words fails for the following reasons:2,3 1) handwriting analysis
suffers from low recognition accuracies; 2) the associated indexing systems
are hampered by having to process and recognize all the words of a doc-
ument, and then apply search techniques to the entire result; and 3) the
training of OCR systems requires that a huge database be constructed for
each alphabet.
Word spotting methods are based on two main approaches: template
matching and learning-based. Manmatha et al.1 proposed the first index-
ing or word spotting system for single writer historical documents. The
proposed method was based on matching word pixels. Zhang et al.4 pro-
posed a template matching approach based on extracting features from
word images. Dynamic Time Warping (DTW)2,5,6 was successfully applied
as an efficient template matching algorithm. Learning-based word spotting
systems were introduced to adapt to muli-writers with promising results.
However, sufficiently large databases are needed to train these systems.
This section defines word spotting, and describes different types of in-
put queries to word spotting systems. Then, the performance measures of
word spotting systems are described. Finally, different approaches of word
spotting are discussed.

1.1. Definition

Handwritten word spotting, also called indexing or searching within doc-


uments, is the task of detecting keywords from documents by segmenting
the document into word images (clusters) based on their visual appearance.
Word spotting systems aim to recognize all occurrences of the specific key-
word within a document. The input to the word spotting system is a
August 13, 2018 8:45 ws-rv9x6 Book Title 10693-05 page 113

Arabic Word Spotting Approaches and Techniques 113

keyword query, which can be either query by string or query by example.


Query by string is a string of letters entered on the keyboard, while query
by example uses an image of a word. Initially, most of the word spotting
systems start by clustering documents into words. This can be done using
different clustering techniques. Afterwards, the word can be described as
a whole or it can be segmented into a set of components such as letters,
strokes or graphemes. Finally, different algorithms and methods are used
to spot words. These methods include learning-based, template matching,
and shape code mapping. Figure 1 illustrates the possible steps of a word
spotting system, including different word spotting approaches.

Fig. 1. Word Spotting Systems.

1.2. Input queries

In word spotting systems, both query by string and query by example


are used to input keywords. Each of these approaches has its pros and
cons. Query by string requires learning the alphabet of the language, and
then concatenating the letters to form the word model for later matching
with the words in the document.7–10 These systems alleviate some of the
drawbacks of traditional handwriting recognition systems, which require
huge databases for training. These word spotting systems perform well for
lexicon-free approaches,11 where there are no restrictions on the size of the
lexicon.
On the other hand, for query by example, the pixel by pixel or the
extracted features of the template image are passed to the system, which
is then detected in the document using word spotting techniques. These
August 13, 2018 8:45 ws-rv9x6 Book Title 10693-05 page 114

114 M. Khayyat, L. Lam, and C. Y. Suen

systems suffer from the drawback that they can be applied only on closed
lexicons.12–15

1.3. Performance measures


To evaluate any system, some performance metrics are needed. There are
two ways to measure the performance of a word spotting system, either
viewing it from the correctly spotted samples or from the incorrectly spotted
ones. In the former view, both the recall rate and the precision rate are
determined and often the precision-recall curve is plotted to give a visual
representation of the performance.16,17 The following metrics are used to
measure the performance of a word spotting system.
Recall Rate (RR): measures the ratio of actual positives, or the success-
ful retrieval of the relevant target sample,
TP
RR = (1)
TP + FN
T P (True Positive): total number of correctly spotted target samples,
F N (False Negative): total number of target samples which are not
spotted,
Precision Rate (P R): the probability that the retrieved image is a target
word,
TP
PR = (2)
TP + FP
F P (False Positive): total number of spotted samples which are mis-
recognized.
The precision-recall curve is also used to calculate the Mean Average
Precision (M AP ) represented by the area under the curve, and the R−P rec
which gives the rate at which the recall and precision graphs intersect.
The other way of measuring the performance is adopted from spoken
word spotting.9,12 This approach is based on the error rate where the
following formulas are used.
Word Error Rate (W ER): the proportion of the words that were not
recovered exactly as they were in the manual transcript.
Out Of Vocabulary words (OOV ): words that occur only in the testing
pages and not in the training pages or words.
False Alarm Rate (F AR): an erroneous image target detection decision,
or the percentage of times the word was falsely spotted,
FP
F AR = (3)
FP + TN
T N (True Negative): Total number of the OOV images that were not
spotted.
August 13, 2018 8:45 ws-rv9x6 Book Title 10693-05 page 115

Arabic Word Spotting Approaches and Techniques 115

1.4. Word spotting approaches

Segmenting or clustering the document into words is considered the first


step in many word spotting systems. This can be done using state-of-the-art
word segmentation techniques. Various techniques are proposed to estab-
lish a threshold for the gap distance between the words in the document, to
decide if the gap is within or between words.12,13,16 Other techniques apply
vertical projections and profiles to the lines of the document to find optimal
segmentation points, and the document can also be clustered into words us-
ing classifiers such as artificial neural networks.17 However, Leydier et al.15
found that it is impossible to achieve accurate line or word segmentation.
Thus, many successful segmentation-free approaches have been proposed,
in which classifiers integrate segmentation with recognition, such as Hidden
Markov Models (HMM)18 and recurrent neural networks.19
Handwritten word spotting is a process which detects words selected by
the user in a document without any syntactic constraints.15 Many methods
are used in the literature to spot words. These methods are based on three
approaches: template matching, shape code mapping and learning-based.
Similarity Matching methods are applied in many different studies to
spot words. These methods have successful applications with systems of few
writers and are also lexicon-free. These methods measure the similarity or
dissimilarity between either the pixels of the images or the features that are
extracted from the images. Manmatha et al.1 proposed the first indexing
or word spotting system for single writer historical documents. The pro-
posed method was based on matching word pixels. Subsequently, different
template matching approaches based on features extracted from word im-
ages have been proposed.4,6,14,17 Dynamic Time Warping (DTW)2,5,12,20
has been successfully applied as an efficient template matching algorithm
based on dynamic programming.
Shape code mapping techniques use the character shape code in which
each character is mapped into a shape code. Ascenders, descenders, loops
and other structural descriptors are used to form the shape code. Each
word is represented by a sequence of shape codes, and query words are
mapped into word shape codes. Then, string matching algorithms can be
applied to perform the mapping and detect words.21
Learning based word spotting systems were introduced to adapt to
multi-writers with promising results. However, sufficiently large databases
are needed to train the system. HMM is the most common classifier ap-
plied to word spotting systems.9,16,22 Other approaches have also been
August 13, 2018 8:45 ws-rv9x6 Book Title 10693-05 page 116

116 M. Khayyat, L. Lam, and C. Y. Suen

developed; for example, Frinken et al.19 proposed a word spotting sys-


tem that uses a bidirectional Long Short-Term Memory (LSTM) Neural
Network together with the Connectionist Temporal Classification (CTC)
Token Passing algorithm to spot words, and this system has shown high
performance.

2. Arabic Word Spotting

The naturally cursive structure of Arabic writing is more unconstrained


than in other languages. This, coupled with the fact that the boundaries
between words are arbitrary and often non-existing, makes word spotting
in the Arabic language a challenging problem in need of further research.

2.1. Characteristics of Arabic handwriting

Arabic script is always cursive even when printed, and it is written hori-
zontally from right to left. In Arabic writing, letter shapes change depend-
ing on their location in the word. This fact distinguishes Arabic writing
from many other languages. In addition, dots, diacratics, and ligatures
are special characteristics of Arabic writing. Figure 2 shows two Arabic
handwritten documents.
The Arabic handwriting system evolved from a dialect of Aramaic which
has fewer phonemes than Arabic. Aramaic uses only 15 letters but Arabic
uses 28 letters. The letters in Arabic are formed by adding one, two or three
dots above or below the Aramaic letters to generate different sounds.11
Thus, many letters share a primary common shape and only differ in the
number and/or location of dots. This means dots play an important role in
the writing of Arabic and other languages that share the same letters such
as Farsi (Persian) and Urdu. It is also worth mentioning that more than
half of the Arabic letters (15 out of 28) are dotted. In printed documents,
double and triple dots are printed as separate dots, while in handwritten
documents there are different ways to write them, for example Figure 3
shows three different ways of writing double dots.
In addition, shapes of letters change depending on their position in
the word. Therefore, each Arabic letter has between two and four shapes.
Letters can be isolated (28 letters), beginning (22 letters), middle (22 let-
ters), and ending (28 letters). However, Arabic letters do not have upper
and lower cases. There are six letters in Arabic that are only connected
from the right side; therefore, when they appear in the word they cause
August 13, 2018 8:45 ws-rv9x6 Book Title 10693-05 page 117

Arabic Word Spotting Approaches and Techniques 117

Fig. 2. Two Arabic handwritten documents.

Fig. 3. Three different ways of writing double dots.

a disconnection resulting in sub-words or Pieces of Arabic Words (PAWs).


This fact makes word spotting and document segmentation into words more
challenging.
Ligatures are used to connect Arabic letters, making it difficult to deter-
mine the boundaries of the letters, since ligatures are not added according
to any writing rule. Ligatures in Arabic can only be found on the baseline
because letters are only connected on the baseline, as opposed to Latin-
based languages in which letters can be connected from the ascenders and
descenders.
In Arabic words there are small markings called “diacritical markers”;
these markers represent short vowels, double consonants and other marks23
that are added to the letters. There are no Arabic letters with both upper
and lower diacritics. Adding these diacritics to the Arabic script is not
obligatory, so they are not always added.
August 13, 2018 8:45 ws-rv9x6 Book Title 10693-05 page 118

118 M. Khayyat, L. Lam, and C. Y. Suen

2.2. Arabic word spotting approaches


Attempts have been made to construct a language independent word spot-
ting system, but these have encountered problems when handling Arabic
script. Srihari and Ball17 proposed a language independent word spot-
ting system, in which they extracted gradient features from words since
these features are language-independent. However, for Arabic handwritten
word spotting, they found it necessary to apply manual word segmentation
(clustering). In this way, they circumvent a main problem of the Arabic
language — that there are no clear boundaries between words. Leydier
et al.15 proposed a segmentation-free language independent word spotting
system which may overcome this problem. However, they faced difficulties
with words from the same root. Even though the system was validated for
Arabic using only one simple query consisting of a single PAW, the precision
rate of 80.00% for Arabic was lower than that of the two Latin databases
that were tested. Similarly, Wshah et al.24 proposed a script independent
segmentation-free word spotting system based on HMMs, and this system
was compared to a concurrent word spotting system22 also utilizing HMMs.
Both systems have found that the lowest results were obtained when ap-
plying the system on the Arabic language.
DTW has been extensively used for word matching in Arabic hand-
written word spotting. Moghaddam and Cheriet25 applied Euclidean dis-
tance enhanced by rotation, together with DTW, to measure the similarity
between two connected components or PAWs of historical documents.
Moreover, Self-Organizing Maps were used to initially cluster PAWs de-
pending on the shape complexity of each PAW. Rodriguez-Serrano and
Perronnin26 proposed a model-based similarity measure between vector se-
quences. Each sequence is mapped to a semicontinuous Hidden Markov
Model, and then a measure of similarity is computed between the HMMs.
This computation of similarity was simplified using DTW. They applied the
measure to handwritten word retrieval in three different datasets including
the IFN/ENIT database of Arabic handwritten words (described in Sec-
tion 3), and concluded that their proposed similarity outperforms DTW
and ordinary continuous HMMs. Saabni and Bronstein27 implemented an
Arabic word matching approach by extracting contour features from PAWs,
then embedding each PAW into an Euclidean space to reduce the complex-
ity; finally they used an Active-DTW28 to determine the final matching
result of a PAW.
August 13, 2018 8:45 ws-rv9x6 Book Title 10693-05 page 119

Arabic Word Spotting Approaches and Techniques 119

Content-based retrieval using a codebook has been used for Arabic word
spotting.21,29,30 In these systems, meaningful features are extracted to
represent codes of symbols, characters, or PAWs. Then similarity matching
or distance measure algorithms between the codes and the codebook are
applied to perform the final match.
Latin script is essentially based on two models (character and word),
while Arabic script is based on three models: Character, PAW and Word
models. The three models are used for Arabic word spotting, while the
PAW model is extensively used, since a line of Arabic text can be viewed as
a sequence of PAWs instead of words; and there are no differences between
the spaces separating PAWs and those separating words. Nevertheless, a
few segmentation-free systems have been proposed for Arabic handwritten
word spotting, in which segmentation is embedded within the classification
process. These systems are either implemented using HMMs based on the
character model,24 or an over-segmentation is applied based on the PAW
model.31
Attempting to segment Arabic documents into candidate words may
not be an appropriate approach for Arabic word spotting systems. This is
because Arabic words are composed of PAWs that are easy to extract, while
there are no clear boundaries between words. This latter aspect would in-
troduce difficulties in segmenting a document into words. Srihari et al.32
tried to cluster words by segmenting the line into connected components
and merging each main component with its diacritics. Nine features were
extracted from each pair of clusters and the features were passed to a neu-
ral network to decide whether the gap between the pairs is a word gap.
However, with ten writers each writing ten documents, the overall perfor-
mance was only 60% when the word segmentations were correct, and this
significantly affected the spotting results.
Many studies favored segmenting documents into PAWs rather than
words due to the problem of not having clear boundaries for words. Sari
and Kefali21 preferred to segment the document into major connected com-
ponents, to circumvent the problem of word segmentation in Arabic doc-
uments. Thus, they decided to favor Arabic PAWs processing instead of
words. They converted the PAW into Word Shape Tokens (WST) and
represented each PAW by global structural features such as loops, ascen-
ders and descenders. Similarly, input queries were coded and then a string
matching technique was applied. They validated their word spotting sys-
tem using both printed and handwritten Arabic manuscripts and historical
documents. This approach is promising because it uses open lexicons and
August 13, 2018 8:45 ws-rv9x6 Book Title 10693-05 page 120

120 M. Khayyat, L. Lam, and C. Y. Suen

avoids pre-clustering. Saabni and El-Sana7 also segmented the documents


into PAWs; they used DTW and HMM for matching in two different sys-
tems, and then additional strokes were used by means of a rule-based sys-
tem to determine the final match. Similarly, Khayyat et al.33,34 proposed
a learning-based word spotting system for Arabic handwritten documents;
this system has also favored the PAW model, in which words are spotted
using a hierarchical classifier where PAWs are recognized, and then words
are re-constructed from their PAWs. Language models are incorporated
into this system to represent the contextual information.
In Arabic, word spotting using an analytical approach to segment words
into letters is challenging due to several reasons. Firstly, the Arabic lan-
guage has 28 letters but each letter has a different shape (form) depending
on its location within a word. This results in more than 100 shapes of
letters, many of which are extremely similar and only differ in the num-
ber or location of the dots. Secondly, writers may elongate ligatures and
letters in order to highlight a keyword or for aesthetic reasons. Thirdly,
vertical overlapping between letters often occurs. Finally, in Arabic there
are many writing styles in which a letter in the same position of a word can
be written in different ways. These facts make segmenting a document into
characters challenging. Toufik et al.35 proposed an analytical approach for
handwritten Arabic letter segmentation. They extracted some structural
features that occur in Arabic letters such as holes, turning points, double
local minima, ascenders, descenders, and one, two and three dots. They
applied their segmentation algorithm to an omni-scriptor database, and
the results show that 5% of the characters were under-segmented, 9% of
the characters were over-segmented and 86% of the characters were well
segmented.
Attempting to spot words after segmenting them into letters, PAWs or
words may increase the error rate, due to segmentation errors. Ball et al.36
over-segmented the words hoping not to have more than one letter in a
segment, then a dynamic programming algorithm was applied to find the
candidate letters. However, because of the difficulties in segmentation, a
segmentation-free approach has been applied to spot Arabic words;24 this
approach has shown promising results in Latin handwritten word spotting.

3. Databases

Many databases of documents have been used by different research groups


to spot handwritten Arabic words, each of which has been used by a
August 13, 2018 8:45 ws-rv9x6 Book Title 10693-05 page 121

Arabic Word Spotting Approaches and Techniques 121

different research group. However, no publicly available database has been


used by a research group for spotting Arabic words, which makes comparing
Arabic word spotting systems almost impossible.
The Institute of Communications Technology (IFN) and the École Na-
tionale d’Ingénieurs de Tunis (ENIT) have developed the advanced Arabic
handwritten words database (IFN/ENIT)39 for Arabic word recognition.
This database consists of Tunisian city names, some researchers have also
searched for Arabic handwritten words in this database.
The KHATT database40 (KHATT means handwriting in the Arabic
language) consists of 1000 forms written by distinct writers. This database
can be used to evaluate Arabic word spotting systems, and is freely avail-
able to researchers. However, the database have not been used to evaluate
Arabic word spotting thus far.
Handwritten document databases consist of two types: single writer
and multi-writer, with the former usually containing historical documents.
Table 1 summarizes some databases that has been used in the literature for
Arabic word spotting.

4. Extracted Features

Word spotting systems of different languages, and regardless of weather


similarity matching, shape code matching or learning based approach is
applied to spot words, require features to be extracted. In the Arabic
language, features can be extracted from words, sub-words or characters.
The most commonly extracted features for Arabic word spotting systems
are gradient features and structural (geometric) features. Nevertheless,
many other features have been extracted such as Fourier transforms.30
Gradient features have been widely applied to many OCR systems, and
to word spotting systems.15,24,33,34,38 This is because gradient features
are language independent features, and can also result in high recognition
rates. These features are extracted by applying a filter to the image, after
which the gradient is calculated using the strength and the direction of the
grayscale images as follows:
∆v
Direction : θ(i, j) = tan−1 ( ) (4)
p ∆u
Strength : f (i, j) = (∆u)2 + (∆v)2 (5)
where ∆u and ∆v are the vertical and horizontal changes which can be
calculated as follows:
∆u = g(i + 1, j + 1) − g(i, j) (6)
August 13, 2018 8:45 ws-rv9x6 Book Title 10693-05 page 122

122 M. Khayyat, L. Lam, and C. Y. Suen

Table 1. Arabic Handwritten document databases for word spotting.

Authors Database Number of Number of Docu-


Writers ments

Khayyat et al.33 CENPARMI (Centre for Multiwriter 137 documents


Pattern Recognition and
Machine Intelligence)
Arabic handwritten doc-
uments database37

Khayyat et al.34,38 Subset of CENPARMI Multiwriter 47 documents


Arabic handwritten doc-
uments database

Cheriet and Moghad- Historical Documents Single 51 pages


dam31 writer

Wshah et al.24 AMA Arabic dataset Multiwriter 200 unique doc-


uments consisting
of
5000 documents
transcribed by 25
writers

Toufik Sari and Abder- Arabic manuscripts dif- 132 Pages


rahmane Kefali21 ferent sources covering
different fields and di-
verse queries

Leydier et al.15 Arabic Manuscript - One Single


query of one PAW writer

Chan et al.11 Kitab fi l-fi qh, a 12th single 20 pages


Century document on Is- writer
lamic Jurisprudence

∆v = g(i + 1, j) − g(i, j + 1). (7)


Geometric features are often used for language dependent word spotting
systems, since each language has a different geometry. Images usually
are preprocessed by extracting contours or skeletons before extracting the
August 13, 2018 8:45 ws-rv9x6 Book Title 10693-05 page 123

Arabic Word Spotting Approaches and Techniques 123

geometric features. These features include intersection points, loops, verti-


cal and horizontal lines, curves, etc. Cheriet and Moghaddam31 extracted
topological features including loops, end and base points, vertical and hor-
izontal centroid, and dots from the connected components of the skeletons.
Additional geometric features were extracted from the topological features
such as checking if branch associated with an end point is clockwise, if the
branch is S-shape and location specific information. These features were
extracted after applying several transformations to the connected compo-
nents. Saabni and El-Sana7 extracted structural features that capture lo-
cal, semi-global and global behaviours.41 Toufik Sari and Abderrahmane
Kefali21 extracted diacritics, descenders, ascenders and loops to search for
consecutive word shape tokens within the document. Shahab et al.30 ex-
tracted concentric circles features and width features from sub-words, in
which similarity measure using angular separation was used.

5. Concluding Remarks

The automatic processing, analysing, and recognition of handwritten Ara-


bic documents is challenging for reasons already mentioned in this chapter.
Nevertheless, different research groups have implemented Arabic word spot-
ting systems and have achieved satisfactory performances. These systems
were based on one of the following three models: character, sub-word or
PAW, and word.
Different databases have been used to evaluate the Arabic word spotting
systems proposed by different research groups. Unfortunately, no publicly
available database has been used by all of these research groups in order to
compare their proposed systems.
Finally, different features have been extracted from characters, PAW’s
or words. However, gradient and geometric features are the most commonly
extracted features.

References

1. R. Manmatha, C. Han, and E. M. Riseman, “Word spotting: A new ap-


proach to indexing handwriting,” in Computer Vision and Pattern Recogni-
tion (CVPR) Conf., pp. 631–637, 1996.
2. J. A. Rodrı́guez-Serrano and F. Perronnin, “Local gradient histogram fea-
tures for word-spotting in unconstrained handwritten documents,” in Proc.
of the 11th Int. Conf. on Frontiers in Handwriting Recognition (ICFHR),
pp. 7–12, 2008.
August 13, 2018 8:45 ws-rv9x6 Book Title 10693-05 page 124

124 M. Khayyat, L. Lam, and C. Y. Suen

3. J. A. Rodrı́guez-Serrano and F. Perronnin, “Score normalization for HMM-


based word spotting using universal background model,” in Proc. of the 11th
Int. Conf. on Frontiers in Handwriting Recognition (ICFHR), pp. 82–87,
2008.
4. B. Zhang, S. N. Srihari, and C. Huang, “Word image retrieval using binary
features,” in Document Recognition and Retrieval, pp. 45–53, 2004.
5. T. M. Rath and R. Manmatha, “Word image matching using dynamic time
warping,” in Computer Vision and Pattern Recognition (CVPR), pp. 521–
527, 2003.
6. T. Adamek, N. E. O’Connor, and A. F. Smeaton, “Word matching using
single closed contours for indexing handwritten historical documents,” In-
ternational Journal on Document Analysis and Recognition (IJDAR), vol. 9,
pp. 153–165, 2007.
7. R. Saabni and J. El-Sana, “Keyword searching for Arabic handwritten docu-
ments,” in Proc. 11th International Conference on Frontiers in Handwriting
Recognition, (ICFHR), pp. 271–277, 2008.
8. A. Bhardwaj, D. Jose, and V. Govindaraju, “Script independent word spot-
ting in multilingual documents,” in Proceedings of the 2nd International
Workshop on Cross Lingual Information Access, pp. 48–54, 2008.
9. V. Lavrenko, T. M. Rath, and R. Manmatha, “Holistic word recognition for
handwritten historical documents,” in Proceedings of the First International
Workshop on Document Image Analysis for Libraries (DIAL), pp. 278–287,
2004.
10. J. Edwards, Y. Whye, T. David, F. Roger, B. M. Maire, and G. Vesom,
“Making Latin manuscripts searchable using GHMM’s,” in Proceedings of the
19th Annual Conference on Neural Information Processing Systems, pp. 385–
392, 2005.
11. J. Chan, C. Ziftci, and D. Forsyth, “Searching off-line Arabic documents,” in
Proceedings of the International Conference on Computer Vision and Pattern
Recognition, Computer Vision and Pattern Recognition (CVPR), pp. 1455–
1462, 2006.
12. A. Kolcz, J. Alspector, M. Augusteijn, R. Carlson, and G. V. Popescu, “A
line-oriented approach to word spotting in handwritten documents,” Pattern
Analysis and Applications, vol. 3, no. 2, pp. 153–168, 2000.
13. R. Manmatha, C. Han, E. M. Riseman, and W. B. Croft, “Indexing hand-
writing using word matching,” in Proceedings of the first ACM international
conference on Digital libraries, DL ’96, pp. 151–159, 1996.
14. R. Manmatha and T. Rath, “Indexing handwritten historical documents -
recent progress,” in Proceedings of the Symposium on Document Image Un-
derstanding, SDIUT-03, pp. 77–85, 2003.
15. Y. Leydier, F. Lebourgeois, and H. Emptoz, “Text search for Medieval
manuscript images,” Pattern Recognition, vol. 40, no. 12, pp. 3552–3567,
2007.
16. J. A. Rodrı́guez-Serrano and F. Perronnin, “Handwritten word-spotting us-
ing hidden Markov models and universal vocabularies,” Pattern Recognition,
vol. 42, no. 9, pp. 2106–2116, 2009.
August 13, 2018 8:45 ws-rv9x6 Book Title 10693-05 page 125

Arabic Word Spotting Approaches and Techniques 125

17. S. N. Srihari and G. R. Ball, “Language independent word spotting in


scanned documents,” Lecture Notes in Computer Science - LNCS, vol. 5362,
pp. 134–143, 2008.
18. A. Fischer, A. Keller, V. Frinken, and H. Bunke, “HMM-based word spot-
ting in handwritten documents using subword models,” in Proceedings of
the 20th International Conference on Pattern Recognition (ICPR), pp. 3416–
3419, 2010.
19. V. Frinken, A. Fischer, R. Manmatha, and H. Bunke, “A novel word spotting
method based on recurrent neural networks,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. 34, no. 2, pp. 211–224, 2012.
20. K. Khurshid, C. Faure, and N. Vincent, “A novel approach for word spot-
ting using merge-split edit distance,” in Computer Analysis of Images and
Patterns (CAIP) (X. Jiang and N. Petkov, eds.), vol. 5702 of Lecture Notes
in Computer Science - LNCS, pp. 213–220, 2009.
21. T. Sari and A. Kefali, “A search engine for Arabic documents,” in Actes
du dixième Colloque International Francophone sur l’Écrit et le Document,
pp. 97–102, 2008.
22. Andreas, A. Keller, V. Frinken, and H. Bunke, “Lexicon-free handwritten
word spotting using character HMMs,” Pattern Recogn. Lett., vol. 33, no. 7,
pp. 934–942, 2012.
23. I. S. I. Abuhaibaa, M. J. J. Holtb, and S. Dattab, “Recognition of off-line
cursive handwriting,” Computer Vison and Image Understanding, vol. 71,
no. 1, pp. 19–38, 1998.
24. S. Wshah, G. Kumar, and V. Govindaraju, “Script independent word spot-
ting in offline handwritten documents based on hidden Markov models,” in
13th Int. Conf. on Frontiers in Handwriting Recognition (ICFHR), pp. 14–18,
2012.
25. R. Moghaddam and M. Cheriet, “Application of multi-level classifiers and
clustering for automatic word spotting in historical document images,” in
Proc. of the 10th Int. Conf. on Document Analysis and Recognition (ICDAR),
pp. 511–515, 2009.
26. J. A. Rodrı́guez-Serrano and F. Perronnin, “A model-based sequence simi-
larity with application to handwritten word-spotting,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 34, no. 11, pp. 2108–2120, 2012.
27. R. Saabni and A. Bronstein, “Fast key-word searching via embedding and
Active-DTW,” in Proc. of the 11th Int. Conf. on Document Analysis and
Recognition (ICDAR), pp. 68–72, 2011.
28. M. Sridha, D. Mandalapu, and M. Patel, “Active-DTW : A Generative Clas-
sifier that combines Elastic Matching with Active Shape Modeling for Online
Handwritten Character Recognition,” in Proc. of the 10th Int. Workshop on
Frontiers in Handwriting Recognition, pp. 193–196, 2006.
29. E. Şaykol, A. K. Sinop, U. Güdükbay, Ö. Ulusoy, and A. E. Çetin, “Content-
based retrieval of historical Ottoman documents stored as textual images,”
IEEE Trans. on Image Processing, vol. 13, no. 3, pp. 314–325, 2004.
30. S. Shahab, W. G. Al-Khatib, and S. A. Mahmoud, “Computer aided index-
ing of historical manuscripts,” in Proceedings of International Conference on
August 13, 2018 8:45 ws-rv9x6 Book Title 10693-05 page 126

126 M. Khayyat, L. Lam, and C. Y. Suen

Computer Graphics, Imaging and Vision (CGIV), pp. 151–159, 2006.


31. M. Cheriet and R. F. Moghaddam, Guide to OCR for Arabic Scripts, ch. A
Robust Word Spotting System for Historical Arabic Manuscripts, pp. 453–
484. Springer, 2012.
32. S. Srihari, H. Srinivasan, P. Babu, and C. Bhole, “Handwritten Arabic word
spotting using the CEDARABIC document analysis system,” in Proc. Sym-
posium on Document Image Understanding Technology (SDIUT-05), College
Park, MD, pp. 123–132, 2005.
33. M. Khayyat, L. Lam, and C. Y. Suen, “Learning-based word spotting sys-
tem for Arabic handwritten documents,” Pattern Recognition, vol. 47, no. 3,
pp. 1021–1030, 2014.
34. M. Khayyat, L. Lam, and C. Y. Suen, “Verification of hierarchical classifier
results for handwritten Arabic word spotting,” in Proc. 12th International
Conference on Document Analysis and Recognition (ICDAR), pp. 572–576,
2013.
35. T. Sari, L. Souici, and M. Sellami, “Off-line handwritten Arabic character
segmentation algorithm: ACSA,” in Proceedings of the Eighth International
Workshop on Frontiers in Handwriting Recognition (IWFHR’02), (Washing-
ton, DC, USA), pp. 452–456, 2002.
36. G. Ball, S. N. Srihari, and H. Srinivasan, “Segmentation-based and
segmentation-free approaches to Arabic word spotting,” in Proc. 10th Int.
Workshop on Frontiers in Handwriting Recognition (IWFHR), pp. 53–58,
2006.
37. N. Nobile, M. Khayyat, L. Lam, and C. Y. Suen, “Novel handwritten words
and documents databases of five middle eastern languages,” in 14th In-
ternational Conference on Frontiers in Handwriting Recognition (ICFHR),
pp. 152–157, 2014.
38. M. Khayyat, L. Lam, and C. Y. Suen, “Arabic handwritten word spotting
using language models,” in Proc. of the 13th International Conference on
Frontiers in Handwriting Recognition (ICFHR), pp. 43–48, 2012.
39. M. Pechwitz, S. S. Maddouri, V. Märgner, N. Ellouze, and H. Amiri,
“IFN/ENIT - database of handwritten Arabic words,” in Proceedings of
Colloque International Francophone sur l’Ecrit et le Document (CIFED’02),
pp. 129–136, 2002.
40. S. A. Mahmoud, I. Ahmad, W. G. Al-Khatib, M. Alshayeb, M. T. Parvez,
V. Märgner, and G. A. Fink, “KHATT: an open Arabic offline handwritten
text database,” Pattern Recognition, vol. 47, no. 3, pp. 1096–1112, 2014.
41. F. Biadsy, R. Saabni, and J. El-Sana, “Segmentation-free online Arabic hand-
writing recognition.,” International Journal of Pattern Recognition and Ar-
tificial Intelligence (IJPRAI), vol. 25, no. 7, pp. 1009–1033, 2011.
127

Chapter 6

A‘rib — A Tool to Facilitate School Children’s Ability


to Analyze Arabic Sentences Syntactically

Mashael Almedlej and Aqil M Azmi


Department of Computer Science, King Saud University,
Riyadh 11543, Saudi Arabia
lcliiio@gmail.com, aqil@ksu.edu.sa

Analyzing Arabic language sentences grammatically is the key to


understand their meaning. E‘raab is the process of syntactically
analyzing an Arabic sentence, and for many it is the most daunting task
when studying Arabic grammar in school. In this paper we develop and
implement a system that automates the task of analyzing Arabic
sentences syntactically. Our system, which we named A‘rib as it is the
imperative verb of e‘raab, is composed of three subsystems that imitate
the human e‘raab process. These are: the lexical analyzer, the syntactic
analyzer, and the results builder. The lexical analyzer identifies words in
the input sentence based on their type and properties, and outputs them
as tokens. While the syntactic analyzer parses the tokens out of the
previous subsystem and tries to identify the sentence structure using
rules expressed in a context-free grammar format. Finally, we combine
the results of both subsystems (tokens and suitable rules), and output the
complete e‘raab of the sentence along with a fully vowelized sentence.
The system is intended for school children up to junior high level.

1. Introduction

Arabic is a Semitic language that is native to over 330 million speakers,1


as well as over a billion and a half Muslims who regularly read the Holy
Qur’an and perform the daily prayers. As a language, it is both challenging
and interesting. Arabic language is quite old, it actually predates Islam.
Any person with a slight knowledge of Arabic can read and understand a
128 M. Almedlej and A. M. Azmi

text written fourteen centuries ago. Hardly any living language can claim
such a distinction. Arabic can be classified as Classical or Modern. The
Classical Arabic represents the pure language spoken by Arabs, whereas
Modern Standard Arabic (MSA) is an evolving variety of Arabic with
some borrowing to meet modern challenges, see Ref. 2.
There are 28 basic letters in the Arabic alphabet. In addition, there are
8 basic diacritical marks, which may be combined to form a total of 13
different diacritics. These marks are used to represent the three short
vowels (a, i, u), while the letters (‫ ﻯ‬,‫ ﻱ‬,‫ ﻭ‬,‫ )ﺍ‬are used to indicate vocalic
length. The diacritical marks are placed either above or below the letters
to indicate the phonetic information associated with each letter. This helps
clarify the sense and meaning of the word. Unfortunately, MSA is devoid
of diacritical markings.
Arabic language is considered one of the richest languages in terms of
vocabulary and rhetorical structures. It is also quite an intricate language.
Consider the sentence, (‫ )ﻻ ﺗﻀﺮﺏ ﺯﻳﺪًﺍ ﻭﺗﻀﺤﻚ‬basically meaning, do not hit
Zaid and laugh. Based on the diacritical marking on the last letter of the
word (‫)ﻭﺗﻀﺤﻚ‬, there are three distinct meanings: (1) if it is ( ُ‫)ﻭﺗﻀﺤﻚ‬, then
the sentence actually means, you are not allowed to hit Zaid, but can
laugh; (2) and if it is ( ْ‫)ﻭﺗﻀﺤﻚ‬, then it means, you are forbidden from doing
both acts (hitting Zaid and laughing); and (3) for ( َ‫)ﻭﺗﻀﺤﻚ‬, we may do
either act but not both, i.e. we may hit Zaid but not laugh, or may laugh
but without hitting Zaid.3 The Arabic language presents some other
challenges as well, including long sentences with complex syntax, having
a pro-drop property, and being a free order language.4 The pro-drop
property means the subject may not be explicitly present.2 Arabic
sentences can take any form, VSO (Verb-Subject-Object), SVO, and
VOS.2 This free order property of the Arabic language presents a crucial
challenge for some Arabic NLP applications. Additionally, the lack of
diacritical markings in MSA often leads to ambiguity. For example, the
undiacritized word (‫)ﻋﻠﻢ‬, has several meanings including (‫ ) ِﻋ ْﻠﻢ‬science, and
(‫ﻋﻠَﻢ‬
َ ) flag. This can even happen in spoken language as well. An individual
may read a sentence while ignoring the end case diacritics by making all
words end with the silence sound (‫)ﺳﻜﻮﻥ‬. It has the same impact as an
undiacritized sentence in the written form. For example, ( ‫ﺃﻣﺮ ﺍﻟﻤﺴﺆﻭﻝ‬
‫)ﺍﻟﻤﻮﻅﻒ‬, which could either mean the person in charge ordered the
A‘rib — A Tool to Analyze Arabic Sentences Syntactically 129

employee, or the employee instructed the person in charge. All the above
should give an idea as to why mastering the Arabic language is very
demanding, even for natives. It also gives credence to why the language
has lagged behind others computationally.
As Islam spread, Arab grammarians were quick to lay down the rules
to prevent incorrect readings of the Holy Qur’an. They established a
completely new science called e‘raab (‫)ﺇﻋﺮﺍﺏ‬, which is the syntactical
analysis of Arabic sentences. E‘raab is the key to identify words to tackle
the surface meaning of a sentence, and it is based on Arabic syntactical
rules known as (‫)ﻗﻮﺍﻋﺪ ﺍﻹﻋﺮﺍﺏ‬, which play a major role in understanding
the semantics of a sentence. Some grammarians considered it an
intellectual exercise to generate different valid e‘raab of a sentence. It is
said the grammarians were able to generate 147 different e‘raab for the
sentence, (‫)ﻻ ﺭﺟﻞ ﻓﻲ ﺍﻟﺪﺍﺭ ﻭﻻ ﺍﻣﺮﺃﺓ‬.3
The problem is how to automate this process to make the computer
analyze Arabic sentences, and correctly classify its words into the main
Arabic language components. This will help in identifying the word’s role
in the semantics of the sentence. The diacritical signs in the Arabic will
certainly help alleviate some of the ambiguity, and its lack surely increases
the vagueness. Natives are somewhat good at resolving the ambiguity
based on the context, but this is truly a challenging problem from the
computer perspective.
In this paper we propose a system which aims to automatically analyze
the Arabic sentences syntactically, the process of e‘raab. We named it
A‘rib which is the imperative verb (‫ )ﻓﻌﻞ ﺃﻣﺮ‬of e‘raab. It is hoped that such
a system will help the Arab students with the e‘raab process, one of the
most dreadful tasks while studying grammar. It will also help those
learning Arabic as a second language to better understanding the semantics
of sentences as well as appreciate the language’s intricateness. This system
can also be a nucleus for a more robust machine translation engine.
The proposed system is divided into three phases: a lexical analysis, a
syntactical analysis and a results builder. The lexical phase takes each
word of the input sentence and analyzes it, so to figure out its role in the
sentence. The result of this phase is saved ready for use in the next phase.
In the second phase, we take the tokens out of the previous phase and try
to determine a matching Arabic rule. Finally, the tokens and the matching
130 M. Almedlej and A. M. Azmi

Arabic grammar rule are used by the third phase, the results builder, to
generate the e‘raab and place the proper diacritical signs on the sentence.
The rest of the paper is organized as follows. In Section 2 we cover
related work. In Section 3 we go over basic Arabic sentence structure. The
system design is covered in Section 4. In Section 5 we go over
implementation details. Finally, in Section 6 we conclude our study.

2. Related Work

There are few pioneer researchers who have made significant attempts to
open the way to automate the process of structural analysis of Arabic
sentence.
One of the earliest attempts was in Ref. 5, where the author proposed a
model for a system that tries to analyze the Arabic sentence according to
its syntax and explicit structure. The model ignored the semantics. The
author used context-free grammar (CFG), and the system was
implemented in prolog. In Ref. 6 is another early attempt, where the
authors highlighted the importance of the morphology and syntax in the
field of NLU (Natural Language Understanding). This time the authors
introduced what they called an ‘end-case analyzer’ that was integrated
within an NLP system.
More recently, Ref. 7 developed a parser that processes an Arabic
sentence in order to automatically explain the role of each word in the
meaning of a sentence. The system is composed of two main parts: the
lexical analyzer, which includes a database that stores all Arabic words;
and the syntax analyzer, which contains a parser. The recursive parser uses
CFG to parse the sentence structure. One major drawback of the system
which is that it is limited to verbal sentences (‫ )ﺟﻤﻠﺔ ﻓﻌﻠﻴﺔ‬with active verbs
only (‫)ﻓﻌﻞ ﻣﺒﻨﻲ ﻟﻠﻤﻌﻠﻮﻡ‬.
A somewhat related work is the automatic diacritization of Arabic text.
The MSA texts are often devoid of diacritical markings, and native
speakers hardly suffer. However, there is a need for diacritical markings,
e.g. for children and those learning Arabic as a second language.
Moreover, certain NLP applications such as automatic speech recognition,
text-to-speech, machine translation, and information retrieval all these
may need diacritized texts as a source for learning.8 There are plenty of
A‘rib — A Tool to Analyze Arabic Sentences Syntactically 131

works in this area. Ref. 4 presented a good survey of recent works in the
area of automatic diacritization. There is an overlap between the e‘raab
process and the diacritization of Arabic sentences, as they both are
concerned with the semantics of the sentence. As noted, the diacritical
markings help in making sense and meaning of the words, and in
disambiguating the sentence. So the difference between e‘raab and the
automatic diacritization is that the former has to justify all its
actions/decisions, while in automatic diacritization the program places an
appropriate diacritical marking often stochastically. This is why e‘raab of
an Arabic sentence is a more challenging problem than the automatic
diacritization.

3. Basic Arabic Sentences Structure

In this Section we will delve into basic sentence structure and relations
among sentence elements. This should help readers appreciate the level of
complexity associated with e‘raab. It is advised that readers consult Ref. 9
for more depth on the subject.
Traditional Arabic grammar divides sentences into two categories:
(‫ )ﺟﻤﻠﺔ ﺍﺳﻤﻴﺔ‬nominal sentences, and (‫ )ﺟﻤﻠﺔ ﻓﻌﻠﻴﺔ‬verbal sentences. The
difference depends on the nature of the first word in the sentence, whether
it is a noun or noun phrase; or verb (respectively). Nominal sentences
consist of a subject or topic (‫)ﺍﻟﻤﺒﺘﺪﺃ‬, and predicate (‫)ﺍﻟﺨﺒﺮ‬. That is, the
nominal sentence typically begins with a noun phrase or pronoun and is
completed by a comment on that noun phrase or pronoun. The predicate
or comment may be a complex structure: nouns, adjectives, pronouns, or
prepositional phrases. By default, both the subject and the predicate of the
nominal sentence are in the nominative case (‫)ﺣﺎﻟﺔ ﺭﻓﻊ‬. And in the case
where the predicate is a noun, pronoun, or adjective, it agrees with the
subject in gender and number. Interestingly, it is possible to reverse the
order and have the predicate before the subject. This occurs when the
subject lacks the definite article, as in the example (‫ )ﺑﻴﻨﻬﻤﺎ ﺷﺠﺮﺗﺎﻥ‬between
[the two of] them [are] two trees. Example of a complex predicate, where
among others, it could be another nominal sentence, e.g. (‫)ﺍﻟﺮﺑﻴﻊ ﻓﻀﻠﻪ ﻛﺒﻴﺮ‬
spring’s bounty [is] large, or even a verbal sentence, e.g. (‫)ﺍﻟﻜﺘﺎﺏ ﻳﻔﻴﺪ ﺍﻟﻘﺎﺭﺋﻴﻦ‬
the book benefit the readers.
132 M. Almedlej and A. M. Azmi

The simplest verbal sentence consists of a verb and its pronoun subject,
which is incorporated into the verb as part of its inflection. This is what is
termed in modern linguistic as the ‘pro-drop’ feature. Past tense verbs
inflect with a subject suffix; present tense verbs have a subject prefix and
a suffix. When the subject noun is specified, it usually follows the verb
and is in nominative case. The verb agrees with the subject in gender, e.g.
(‫ )ﻧﺠﺤﺖ ﺍﻟﻄﺎﻟﺒﺔ‬the student succeeded (f.), but not always in number. The
verb could either be intransitive (‫)ﻓﻌﻞ ﻏﻴﺮ ﻣﺘﻌﺪﻱ‬, or transitive (‫)ﻓﻌﻞ ﻣﺘﻌﺪﻱ‬. In
the former case, it does not take a direct object, but may be complemented
by a prepositional phrase, e.g. (‫ )ﻳﻬﻄﻞ ﺍﻟﺜﻠﺞ ﻋﻠﻰ ﺍﻟﺠﺒﺎﻝ‬snow falls on the
mountains. While in the latter, the verb takes a direct object, which is in
the accusative case (‫)ﺣﺎﻟﺔ ﻧﺼﺐ‬, and the object may either be a noun, a noun
phrase, or a pronoun, e.g. (‫ )ﺭﻓﻊ ﻳﺪﻩ‬he raised his hand. If both the subject
and the object of the verb are specified, then the order is typically Verb-
Subject-Object (VSO), however, it is also possible to have the ordering
SVO, or VOS under certain conditions. In VSO, if the subject is dual or
plural, the verb inflects for gender agreement, and not number agreement,
e.g. (‫ )ﻛﺘﺐ ﺍﻟﻄﺎﻟﺒﺎﻥ ﺍﻟﺪﺭﺱ‬the two students wrote the lesson (m.). Some verbs
in Arabic take two objects, with both being expressed as nouns, noun
phrases, or pronouns, e.g. (‫)ﺃﺩﺭﺳﻬﻢ ﺍﻟﺮﻳﺎﺿﻴﺎﺕ‬
ّ ِ I teach them mathematics.
Moreover, the verb could either be in active voice or passive voice
(‫)ﻣﺒﻨﻲ ﻟﻠﻤﺠﻬﻮﻝ‬. In the first case, the doer of the action is the subject; while
in the passive the direct object of the verb becomes the subject, e.g. ( ‫ﺩُﺭﺳﺖ‬
‫ )ﺍﻟﻘﻀﻴﺔ‬the case was studied.

4. System Design

This work is concerned about designing a system that can automate the
Arabic syntactical analysis, so to produce the proper e‘raab results without
human intervention. In order to do that, it was necessary to review how
humans analyze the sentences to accomplish the task.
The normal analysis process goes through three main phases starting
with the sentence and ending up with the e‘raab results, as follows:
 Break down the target sentence into its main components, identifying
each component by its type and properties. This part is handled by the
lexical analyzer in the proposed system.
A‘rib — A Tool to Analyze Arabic Sentences Syntactically 133

 Study the resulted components’ relationship to each other, in a way that


it will form a correct Arabic sentence conforming to the known rules
of Arabic syntactic. This part will be handled by the syntactical
analyzer in the proposed system.
 Identify the role, e.g. (‫)ﻣﺒﺘﺪﺃ‬, and the case, e.g. (‫ )ﻣﺮﻓﻮﻉ‬of those
components according to suitable grammatical rules. After this identify
their signs are identified (‫ )ﻋﻼﻣﺔ ﺍﻹﻋﺮﺍﺏ‬according to their kind and
properties. All this will be handled by the results builder.
 

Fig. 1. A‘rib system design.

According to above steps, the proposed system is divided into three


subsystems (Figure 1). The lexical analyzer receives the user input
sentence and identifies its composed words by classifying their kind and
properties, through the help of the morphological analyzer. The output of
the lexical analyzer is a stream of tokens. Next, the syntactic analyzer
parses the tokens looking for a matching Arabic syntactical rule among a
list of predefined rules. And finally the results builder combines the results
of both lexical and syntactical analyzers (tokens + suitable rules) and
134 M. Almedlej and A. M. Azmi

generates the sentence’s complete e‘raab. We will go over each component


in greater detail.

4.1. Lexical analyzer

This is the first part of the system which is responsible for analyzing the
input sentence and identifying its words’ properties. The tokens (word +
property) are stored, which in turn will help in the e‘raab process.
To accomplish this task, we start by isolating the words of the sentence
from each other, so they are ready to be lexically analyzed. This step
includes isolating the words from its prefixes and suffixes, which have
their own position on the syntax (‫ )ﻟﻬﺎ ﻣﺤﻞ ﻣﻦ ﺍﻹﻋﺮﺍﺏ‬such as (‫)ﺍﻟﻀﻤﺎﺋﺮ ﺍﻟﻤﺘﺼﻠﺔ‬.
These will each have their own token. Next it determines the kind of each
word, either of: noun, verb or particle. Finally, it identifies the set of
properties that depends on the category of each word. For example, for
nouns, it should identify properties related to: type, gender, count,
variability … etc. For verbs, it needs to identify a verb’s tense, effect,
passivity, gender … etc. And for particle, it only needs to find its type and
sign. We need to classify all this information along with the prefixes and
suffixes of that word. This may help in preparing the final results in some
cases. Before transmitting the output of the lexical analyzer to the next
subsystem, the token tags are converted into English (Figure 2). This will
simplify processing in the next phase.

4.2. Syntactic analyzer

In this phase we process the tokens received from the lexical analyzer to
find the appropriate grammar rule corresponding to a valid Arabic
syntactical structure. For this we use both grammar and a parser, both of
which are components of this subsystem. The first component is simply a
set of Arabic language sentence structures, expressed in a formal way
using context-free grammar (CFG). The parser’s role is to find the
matching rule(s) from the CFG for the given set of tokens. In our
implementation, the CFG is stored in an external file, and the parser
dynamically parses the grammar. Saving the grammar in an external file
makes for easy editing, in case of error, or future addition of new rules.
A‘rib — A Tool to Analyze Arabic Sentences Syntactically 135

Fig. 2. Lexical analysis results on the sample sentence, the successors are happy, with tags
converted into English.

We have two basic Arabic sentence categories: nominal, and verbal.


Nominal Sentence (NS) is the one that starts with a noun, and the Verbal
Sentence (VS) starts with a verb. The CFG rule for this using Extended
Backus-Naur Form (EBNF) is,
‫ﺍﻟﺠﻤﻠﺔ ← ﺟﻤﻠﺔ ﺍﺳﻤﻴﺔ | ﺟﻤﻠﺔ ﻓﻌﻠﻴﺔ‬
In a CFG we have two disjoint sets of alphabets, terminals and non-
terminals. We further divide the non-terminals into: Sentence Component
(SC), and the Components of the Components (CC). The SC are the main
components which form the sentence, and the CC form small components
of the SC. The terminals, on the other hand, are simply the input tokens
received from the lexical analyzer. Figure 3 is an example on this
classification. The SC are non-terminals, which define the possible
occurrence and order of the main components that build the correct
sentence, e.g. subject + predicate (‫ ﺧﺒﺮ‬+ ‫)ﻣﺒﺘﺪﺃ‬, or verb + subject + object
(‫ ﻣﻔﻌﻮﻝ ﺑﻪ‬+ ‫ ﻓﺎﻋﻞ‬+ ‫)ﻓﻌﻞ‬. Table 1 lists some of the SC grammar for nominal
and verbal sentences. In writing the grammar, we added information,
expressed using superscripted text: the role (‫ﺍﻟﻤﺤﻞ‬, e.g. ‫)ﺍﺳﻢ ﺇﻥ‬, and the
judgment (‫)ﺍﻟﺤﻜﻢ‬, e.g. nominative (‫)ﺭﻓﻊ‬, accusative (‫ )ﻧﺼﺐ‬... etc. This tag is
intended for each main component of the sentence that could have nested
components in its place. The tag proves helpful when building the result
of the e‘raab sentence. Consider a nominal sentence made up of subject
136 M. Almedlej and A. M. Azmi

and predicate. The predicate could be a sentence by itself, and in that case
we have to list it in the e‘raab as a sentence playing the role of a nominative
predicate (‫)ﻓﻲ ﻣﺤﻞ ﺭﻓﻊ ﺧﺒﺮ‬. These complex and nested structures in Arabic
sentences need to be tracked, and this tag helps in the tracking process. In
later examples we will show how exactly they are used.

Fig. 3. Example of CFG terminals and non-terminals.

Table 1. Sentence Components (SC) grammar for nominal and verbal sentences. The
superscripted pair holds information regarding the role and the case.

| ‫ ﺟﺎﺭ ﻭﻣﺠﺮﻭﺭ‬+ {‫ ﺭﻓﻊ‬،‫ ﺧﺒﺮ }ﺧﺒﺮ‬+ {‫ ﺭﻓﻊ‬،‫ﺍﻟﺠﻤﻠﺔ ﺍﻻﺳﻤﻴﺔ ← ﻣﺒﺘﺪﺃ }ﻣﺒﺘﺪﺃ‬


| ‫ ﺟﺎﺭ ﻭﻣﺠﺮﻭﺭ‬+ {‫ ﺭﻓﻊ‬،‫ ﺧﺒﺮ }ﺧﺒﺮ ﺍﻟﻔﻌﻞ‬+ {‫ ﺭﻓﻊ‬،‫ ﻣﺒﺘﺪﺃ }ﺍﺳﻢ ﺍﻟﻔﻌﻞ‬+ ‫ﻓﻌﻞ ﻧﺎﺳﺦ‬
‫ ﻣﺒﺘﺪﺃ‬+ ‫ ﺟﺎﺭ ﻭﻣﺠﺮﻭﺭ | ﺣﺮﻑ ﻧﺎﺳﺦ‬+ {‫ ﺭﻓﻊ‬،‫ ﻣﺒﺘﺪﺃ }ﻣﺒﺘﺪﺃ‬+ {‫ ﺭﻓﻊ‬،‫ﺧﺒﺮ }ﺧﺒﺮ‬
‫ ﺟﺎﺭ ﻭﻣﺠﺮﻭﺭ‬+ {‫ ﺭﻓﻊ‬،‫ ﺧﺒﺮ }ﺧﺒﺮ ﺍﻟﺤﺮﻑ‬+ {‫ ﻧﺼﺐ‬،‫}ﺍﺳﻢ ﺍﻟﺤﺮﻑ‬
،‫}ﻣﻔﻌﻮﻝ ﺑﻪ‬
‫ ﻣﻔﻌﻮﻝ ﺑﻪ‬+ {‫ ﺭﻓﻊ‬،‫ ﻓﺎﻋﻞ }ﻓﺎﻋﻞ‬+ {‫ ﺭﻓﻊ‬،‫ﺍﻟﺠﻤﻠﺔ ﺍﻟﻔﻌﻠﻴﺔ ← ﻓﻌﻞ }ﻓﻌﻞ‬
‫ ﻓﺎﻋﻞ‬+ {‫ ﻧﺼﺐ‬،‫ ﻓﻌﻞ ﻣﻀﺎﺭﻉ}ﻓﻌﻞ‬+ ‫ ﺟﺎﺭ ﻭﻣﺠﺮﻭﺭ | ﺣﺮﻑ ﻧﺼﺐ‬+ {‫ﻧﺼﺐ‬
‫ ﻓﻌﻞ‬+ ‫ ﺟﺎﺭ ﻭﻣﺠﺮﻭﺭ | ﺣﺮﻑ ﺟﺰﻡ‬+ {‫ ﻧﺼﺐ‬،‫ ﻣﻔﻌﻮﻝ ﺑﻪ}ﻣﻔﻌﻮﻝ ﺑﻪ‬+ {‫ ﺭﻓﻊ‬،‫}ﻓﺎﻋﻞ‬
‫ ﺟﺎﺭ‬+ {‫ ﻧﺼﺐ‬،‫ ﻣﻔﻌﻮﻝ ﺑﻪ}ﻣﻔﻌﻮﻝ ﺑﻪ‬+ {‫ ﺭﻓﻊ‬،‫ ﻓﺎﻋﻞ }ﻓﺎﻋﻞ‬+ {‫ ﺟﺰﻡ‬،‫ﻣﻀﺎﺭﻉ}ﻓﻌﻞ‬
‫ﻭﻣﺠﺮﻭﺭ‬

Table 2 shows grammar for the second set of non-terminals, the


Components of the Components (CC), which defines the possible
components of the SC, e.g. (‫ ﻣﻀﺎﻑ ﺇﻟﻴﻪ‬+ ‫ ﺍﺳﻢ ﺃﻭ ﺍﺳﻢ ﻧﻜﺮﻩ‬:‫)ﻓﺎﻋﻞ‬. The CC
A‘rib — A Tool to Analyze Arabic Sentences Syntactically 137

grammar should cover all terminal symbols on their right hand side that
were not covered in the SC grammar.

Table 2. The Components of the components (CC) grammar.

‫ﻣﺒﺘﺪﺃ ← ﺍﺳﻢ ﻣﻌﺮﻓﺔ | ﺍﺳﻢ ﻧﻜﺮﺓ‬


‫ﺧﺒﺮ ← ﺍﺳﻢ ﻣﻌﺮﻓﺔ | ﺍﺳﻢ ﻧﻜﺮﺓ | ﺟﻤﻠﺔ | ﺟﺎﺭ ﻭﻣﺠﺮﻭﺭ‬
{‫ ﺟﺮ‬،‫}ﺍﺳﻢ ﻣﺠﺮﻭﺭ‬
‫ ﺍﺳﻢ ﻣﺠﺮﻭﺭ‬+ ‫ﺟﺎﺭ ﻭﻣﺠﺮﻭﺭ ← ﺣﺮﻑ ﺟﺮ‬
‫ﺍﺳﻢ ﻣﺠﺮﻭﺭ ← ﺍﺳﻢ ﻣﻌﺮﻓﺔ | ﺍﺳﻢ ﻧﻜﺮﺓ | ﺿﻤﻴﺮ ﻣﺘﺼﻞ‬
‫ﺣﺮﻑ ﺟﺮ ← ﻓﻲ | ﺇﻟﻰ | ﻋﻦ | ﺑـ | ﻟـ‬
‫ﻓﻌﻞ ← ﻓﻌﻞ ﻣﺎﺿﻲ | ﻓﻌﻞ ﻣﻀﺎﺭﻉ | ﻓﻌﻞ ﺃﻣﺮ‬
‫ﻓﺎﻋﻞ ← ﺍﺳﻢ ﻣﻌﺮﻓﺔ | ﺍﺳﻢ ﻧﻜﺮﺓ | ﺿﻤﻴﺮ ﻣﺘﺼﻞ‬
‫ﻣﻔﻌﻮﻝ ﺑﻪ ← ﺍﺳﻢ ﻣﻌﺮﻓﺔ | ﺍﺳﻢ ﻧﻜﺮﺓ | ﺿﻤﻴﺮ ﻣﺘﺼﻞ‬
| ‫ﺍﺳﻢ ﻣﻌﺮﻓﺔ ← ﺍﻟﻌﻠﻢ | ﺿﻤﻴﺮ ﻣﻨﻔﺼﻞ | ﺍﺳﻢ ﺇﺷﺎﺭﺓ | ﻣﺸﺘﻖ ﻣﻌﺮﻑ ﺑﺄﻝ‬
‫ﺍﺳﻢ ﺟﺎﻣﺪ ﻣﻌﺮﻑ ﺑﺄﻝ | ﺟﻤﻠﺔ ﺍﻟﺼﻠﺔ | ﺟﻤﻠﺔ ﺇﺿﺎﻓﺔ‬
{‫ ﺟﺮ‬،‫}ﻣﻀﺎﻑ ﺇﻟﻴﻪ‬
‫ ﺍﺳﻢ ﻣﻌﺮﻓﺔ‬+ {‫ﺟﻤﻠﺔ ﺇﺿﺎﻓﺔ ← ﺍﺳﻢ ﻧﻜﺮﺓ}ﻣﻀﺎﻑ‬
‫ﺍﺳﻢ ﻧﻜﺮﺓ ← ﻣﺼﺪﺭ ﻧﻜﺮﺓ | ﺍﺳﻢ ﺟﺎﻣﺪ | ﻣﺸﺘﻖ ﻧﻜﺮﺓ | ﺍﻷﺳﻤﺎء ﺍﻟﺴﺘﺔ‬
‫ ﺟﻤﻠﺔ‬+ ‫ ﺟﻤﻠﺔ ﺍﺳﻤﻴﺔ | ﺍﺳﻢ ﻣﻮﺻﻮﻝ‬+ ‫ﺟﻤﻠﺔ ﺍﻟﺼﻠﺔ ← ﺍﺳﻢ ﻣﻮﺻﻮﻝ‬
‫ﻓﻌﻠﻴﺔ‬

We use a dynamic parser to parse the predefined grammar to find all


possible rules that match the input sequence tokens received from the
lexical analyzer. The grammar is written in a certain format and is saved
in an external file; this eases the process of editing or updating the
grammar. Constraints on Arabic sentence structure are handled by the
parser through the use of word properties that are given in the tokens.
Some of these constraints are include count, gender, or definite (‫ )ﻣﻌﺮﻓﺔ‬and
indefinite (‫ )ﻧﻜﺮﺓ‬articles. These constraints help the parser to detect the
most accurate rule for the input sentence, and narrows down the number
of possible matching rules. The output of the parser is the e‘raab structure
that corresponds to all matched rules. Figure 4 shows an example output
of the parser.
138 M. Almedlej and A. M. Azmi

Fig. 4. Parsing the sentence (‫ )ﺃﺧﺬ ﺃﺣﻤﺪ ﻗﻠﻢ ﺻﺎﻟﺢ‬Ahmad took Saleh’s pen.

4.3. Results builder

The third and final part of the proposed system, which along with the
e‘raab of the input sentence, outputs appropriate diacritical markings. In
Arabic, the diacritics of the internal letters are morphologically determined
while the case-ending diacritics (or the e‘raab) are syntactically
determined. This subsystem imitates the regular e‘raab process, and
therefore requires the output of the two previous phases, the tokens and
the matching syntactic rules.
The builder uses the syntactic structure to figure out the role of each
token and grammatical judgment for each word in the sentence, e.g.
whether it is nominative (‫)ﻣﺮﻓﻮﻉ‬, accusative (‫ )ﻣﻨﺼﻮﺏ‬... etc. In order to
determine the actual sign the system makes use of the properties that were
attached to the tokens. For example, consider the signs used for the
nominative, known as (‫)ﻋﻼﻣﺎﺕ ﺍﻟﺮﻓﻊ‬. It could be the diacritic dumma (‫)ﺍﻟﻀﻤﺔ‬
in case of singular noun, broken plural, feminine sound plural, and the
imperfect tense (‫ ;)ﻓﻌﻞ ﻣﻀﺎﺭﻉ‬it is the letter waw (‫ )ﻭ‬in case of masculine
sound plural and the five nouns (‫ ﺫﻭ‬،‫ ﻓﻮ‬،‫ ﺣﻤﻮ‬،‫ ﺃﺧﻮ‬،‫ ;)ﺃﺑﻮ‬it is the letter alif in
case of dual nouns; and it is the letter noon (‫ )ﻥ‬in case of the imperfect
verb with a personal pronoun (‫)ﺍﻷﻓﻌﺎﻝ ﺍﻟﺨﻤﺴﺔ‬.
A‘rib — A Tool to Analyze Arabic Sentences Syntactically 139

There is a specific format that grammarians use when writing the


e‘raab. The format differs according to many factors, primarily on whether
the type of word is variable or not. Besides, it may be necessary to mention
the reason that makes the present verb invariable or when a suffix or a
special form of the word changes the sign of that word. Table 3 lists
various e‘raab sentences structures.

Table 3. E‘raab sentence structure. In the e‘raab format the optional argument is inside
the brackets.
Word E‘raab format Example
Nouns and ‫ﺍﻷﺳﻤﺎء‬ <role> ‫ﻣﺒﺘﺪﺃ ﻣﺮﻓﻮﻉ‬
verbs with ‫ﻭﺍﻷﻓﻌﺎﻝ‬ <judgment> ‫ﻭﻋﻼﻣﺔ ﺭﻓﻌﻪ‬
dynamic end- ‫ﺍﻟﻤﻌﺮﺑﺔ‬ <sign> ‫ﺍﻟﻮﺍﻭ )ﻷﻧﻪ ﺟﻤﻊ‬
cases (<reason>)
(‫ﻣﺬﻛﺮ ﺳﺎﻟﻢ‬
Nouns with ‫ﺍﻷﺳﻤﺎء‬ <type> <static ‫ﺿﻤﻴﺮ ﻣﺘﺼﻞ‬
static case- ‫ﺍﻟﻤﺒﻨﻴﺔ‬ end-case sign> ‫ﻣﺒﻨﻲ ﻋﻠﻰ ﺍﻟﻔﺘﺢ‬
ending <judgment> ‫ﻓﻲ ﻣﺤﻞ ﺭﻓﻊ‬
<role>
‫ﻓﺎﻋﻞ‬
Perfect tense ‫ﺍﻟﻔﻌﻞ ﺍﻟﻤﺎﺿﻲ‬ <role> <static ‫ﻓﻌﻞ ﻣﺎﺿﻲ ﻣﺒﻨﻲ‬
and imperative ‫ﻭﺍﻷﻣﺮ‬ end-case sign> ‫ﻋﻠﻰ ﺍﻟﻀﻢ‬
verbs (<reason>) ‫)ﻻﺗﺼﺎﻟﻪ ﺑﻮﺍﻭ‬
(‫ﺍﻟﺠﻤﺎﻋﺔ‬
Imperfect tense ‫ﺍﻟﻔﻌﻞ ﺍﻟﻤﻀﺎﺭﻉ‬ <role> <static ‫ﻓﻌﻞ ﻣﻀﺎﺭﻉ‬
verb with static ‫ﺍﻟﻤﺒﻨﻲ‬ end-case sign> ‫ﻣﺒﻨﻲ ﻋﻠﻰ‬
case-ending <reason> ‫ﺍﻟﺴﻜﻮﻥ ﻻﺗﺼﺎﻟﻪ‬
<judgment>
‫ﺑﻨﻮﻥ ﺍﻟﻨﺴﻮﺓ ﻓﻲ‬
‫ﻣﺤﻞ ﻧﺼﺐ‬

4.4. Special cases

One of the main problems an Arabic speaker faces is the issue of ambiguity
due to multiple meanings of the same words. This ambiguity cannot be
attributed to some kind of imperfection in Arabic language but rather it is
due to the modern custom of not writing diacritical signs. These signs fully
resolve the ambiguity and define exactly what the writer meant. Consider
the simple example, (‫)ﺳﺒﻘﻨﺎ ﺍﻟﻘﻄﺎﺭ‬. It could either mean (‫ﺳﺒَﻘَﻨﺎ ﺍﻟﻘﻄﺎﺭ‬
َ ) where
140 M. Almedlej and A. M. Azmi

the subject is (‫ )ﺍﻟﻘﻄﺎﺭ‬which means we are ahead of the train; and ( ‫ﺳﺒَ ْﻘﻨﺎ‬
َ
‫ )ﺍﻟﻘﻄﺎﺭ‬here the subject is (‫ )ﻧﺤﻦ‬and the sentence means the train is ahead
of us.
There are many ways to handle ambiguity in case the user input of a
plain text is devoid of any signs. The simplest is to ask the user to insert
the appropriate signs each time there is ambiguity. This may be annoying
for the user, moreover, the user might be using the program to inquire what
diacritical options he/she has and their associated meanings. This opens
the door for using the system to automatically guide the user through
possible appropriate signs. For this, the system will have to process all
possible sentence cases, and output all the e‘raab results along with the
proper signs in the sentences. It is worth noting that the number of possible
sentence structures is huge once the sentence is processed by the lexical
analyzer. The main reason is that the system will mull over all possible
types for each word with no regard to its syntactical structure. However,
the number of possibilities will come down when it is processed by the
syntactical analyzer, where all the unfitted types will be ignored. Another
possible scheme to resolve the ambiguity is through semantic analysis. But
since the proposed system focuses on the syntactical analysis, we will
leave this for future work.
In designing the system we allowed for the provision of some sort of
errors from the user’s side. The errors could be lexical or syntactical, and
either way the system should be smart enough to handle them. Lexical
error means the lexical analyzer fails to recognize one or more of the input
words, whereas the syntactical error means the parser fails to find a
matching grammatical rule for the input sentence. In case of either error,
the process should stop when an error is encountered and the user will be
asked to recheck his/her input. If the user insists on the same input then
the system must gracefully terminate with an appropriate message
indicating failure to handle the input in its current form.

5. Implementation

We implemented the system in Java using NetBeans IDE. The system is


composed of three main classes that represent the system components
discussed earlier: LexcialAnalyzer, SyntacticAnalyzer, and ResultsBuilder.
A‘rib — A Tool to Analyze Arabic Sentences Syntactically 141

These are called upon as soon the user clicks the button to process the
input (Algorithm 1).
The object of the first class receives the user input text, and it stores the
lexical analysis results as tokens. These tokens are assigned to another
object which processes them syntactically storing the results (matching
rules and tokens) in forms of Solutions. An object of the last class takes
these Solutions in order to translate and print them in the appropriate
format for the user. Next we go over each class in more depth.

Procedure process_input()
{
L = new LexicalAnalyzer()
S = new SyntacticAnalyzer()
R = new ResultsBuilder()
if (text is empty)
print(“Empty input”)
else {
L.Input = text
L.LexicalAnalysis()

S.Tokens = L.Tokens
S.SyntacticAnalysis()

R.Solutions = S.Solutions
R.BuildFinalSentences()
}
}
Algorithm 1. Dataflow between main classes of the system.

5.1. Lexical analysis

For lexical analysis we used Alkhalil,10 a free morphological analyzer. It


is an Arabic Part-Of-Speech (POS) Tagger that is highly acclaimed by
Arabic linguists. The tagger’s process is to take Arabic sentences as input
and output all the possible analysis cases of each word. The results are
written in plain Arabic as HTML file, unfortunately with plenty of
redundant cases (Figure 5). Alkhalil covers most of the cases of each word,
each with different diacritical option. The results are shown with no regard
142 M. Almedlej and A. M. Azmi

to the similarity of its effect on the e‘raab result nor how scarce the case
is. All this leads to ample redundancy and an increase in the final output.
Moreover, the output is geared toward professional human understanding,
summarizing the attributes using as few words as possible.

Fig. 5. Alkhalil’s output results as HTML. For the word (‫ )ﺫﻫﺐ‬it reports 22 results with
much redundancy. After post processing to remove the redundant results we end up with 7
results only.

The main process of the lexical analysis class is to handle these results
to produce the desired tokens which will be carried on in the next process:
the syntactic analysis. Figure 6 shows the lexical analysis activity. There
are two problems associated with Alkhalil that we need to address: (1)
Alkhalil outputs the results in the form of HTML file, these we converted
to a more convenient form, strings; and (2) too much redundant data.
Following some analysis we managed to remove about 70% of the possible
redundant cases. We looked at the results that repeat the same properties
while ignoring judgments and diacritical markings. The remaining 30% of
redundant cases turned out to be more challenging. These couldn’t be done
A‘rib — A Tool to Analyze Arabic Sentences Syntactically 143

Fig. 6. Lexical analysis activity diagram.

without some sort of human intervention. The removed tokens were not
actually discarded from the system but rather kept for later use. The idea
is to keep the complete diacritics of the word from the original result when
called for.
We need to properly tag each word. The tag depends on the result case,
which in turn depends on the type of the word. We created three functions
to do the tagging, one for each type: verb, noun, and particle. POSTV
handles the verbal analysis process. It identifies the verb tense, gender,
number, doer, activity, transitivity, variability … etc. POSTN is for the
nominal analysis process, where it identifies the noun type, gender,
number … etc. For the particle analysis, POSTP, only identifies particle
type and sign. After this the tokens are ready for syntactic analysis. As we
usually have more than one word with more than one case, all this is stored
in a two-dimensional Token array. The first dimension indicates the words
given by the user, while the second dimension specifies the cases for that
word. Figure 7 is an example of some tokens for the word (‫ )ﺫﻫﺐ‬following
the lexical analysis on the output of Alkhalil.
We would like to go briefly over some of the limitations/erroneous
behavior we encountered while working with Alkhalil. These impacted the
final e‘raab results.
 Even after removing some of the redundant results, the number of
remaining cases of each word was still high. This resulted in an
exponential number of sentences that went to the next stage in
processing. For example, a sentence with four words, and each word
had three cases then the total possible sentences was 34 = 81.
144 M. Almedlej and A. M. Azmi

:myTokens[][] :myTokens[][]
Word ‫ﺫﻫﺐ‬ Word ‫ﺫﻫﺐ‬
dWord ُ ‫ﺫَ َﻫ‬
‫ﺐ‬ dWord َ ‫ﺫَﻫ‬
‫ِﺐ‬
Type noun Type verb
SubType StNn SubType Past
Specialty # Specialty #
Gender M Gender M
Num Sing Num Sing
NPluralType # NPluralType #
isVariable true isVariable false
InVarSign # InVarSign ‫ﺍﻟﻔﺘﺢ‬
Prefix # Prefix #
Suffix # Suffix #
Indices 01234 Indices 16
NDefinitive Comn NDefinitive #
VPassivity # VPassivity Actv
VTransitivity # VTransitivity 0
VDoer # VDoer Ab
StaticV false StaticV false

Fig. 7. Examples of Token object.

 Strangely some of the common words were not recognized by the


system. Words such as (‫ )ﺑﻨﺖ‬,(‫ )ﺗﻠﻚ‬,(‫ … )ﺃﻭﻟﺌﻚ‬etc.
 Incorrect categorization of some words, e.g. (‫ )ﻫﻨﺎﻙ‬and (‫ )ﺫﻟﻚ‬were not
considered demonstrative pronouns.
 Incomplete analysis. Some of the missing information we came across
are: gender of proper noun, and the number of accusative object of a
given verb. These information are necessary to generate an accurate
e‘raab of a sentence.
A‘rib — A Tool to Analyze Arabic Sentences Syntactically 145

5.2. Syntactic analysis

This is the heart of the system that does the actual syntactic analysis for
the tokens received from the lexical analyzer. It finds the matching Arabic
sentence rule for the sequence of tokens. The matching rules are stored in
some form of solution objects and will be used in the final stage, the results
builder. The predefined CFG rules are stored in an external XML file, and
are dynamically parsed using a tree based algorithm. Below we go over in
more detail.

5.2.1. XML rules file

We stored the predefined rules in an external XML file to simplify the


process of editing and future addition. Each rule contains information
about the role and judgment of each component of that rule, which will
were used to build the final e‘raab. Figure 8 shows an example of a
grammar rule and its corresponding XML representation. Each CFG rule
is represented within <Rule> tag, where the components of the grammar
are broken down to several related information organized in multiple
internal tags. The left hand side (LHS) of the grammar is stored within
<Name> tag; while the grammar’s right hand side (RHS) (i.e. its
structure’s components) is stored within <Strc> tag separated by spaces.
The non-terminals are distinguished by capitalizing the first letter, while
the terminals are named using all lower case letters. The terminals should
either be a noun, a verb or a particle name, e.g. prpP for proposition
particles.
In case a non-terminal has more than one RHS structure, then we
represent each using a separate rule tag. Their <ID> tag is appended by
numbers to distinguish between them. In Figure 8 we are showing one
RHS for the nominal sentence with ID tag NSnt1. If we have other RHS
for the nominal sentence then their IDs will be tagged NSnt2, NSnt3, …
etc.
In case the structure contains a terminal component then its property is
listed within the <Cons> tag. The e‘raab information (condition of roles,
judgments … etc) of the structure is stored within the <Cond> tag. The
number of conditions equals the number of structure components and is
separated by spaces. The format of the conditions is fixed:
146 M. Almedlej and A. M. Azmi

Fig. 8. A sample CFG rule stored in an XML format. The <cond> tag marks the conditions.
Here we have two conditions (= the number of arguments to the right of →), one for
Subject and another for Predicate. Each condition consists of four arguments: Role,
Judgment, Addition, and Place. The arguments are separated by a “;”, and “#” indicates
N/A.

“Role;Judgment;Addition;Place”. The Role could be either the role’s


name (e.g. subject, predicate, verb … etc) or “pre” indicating it will
assume the previous rule’s role. Similarly the Judgment could be either the
judgment’s name (e.g. nominal, accusative … etc) or “pre” to indicate it
will take the previous judgment. We use “#” to mark judgment that is not
applicable, e.g. invariable case. Figure 9 illustrates the <Cond> tag with
different Role and Judgment components. The Addition field specifies if
there are additional phrases that need to be mentioned in the final e‘raab
sentence. For example, the first part of the genitive construction is treated
as a normal noun, however, at the end of the e‘raab sentence we must
mention (‫)ﻭﻫﻮ ﻣﻀﺎﻑ‬. Finally the Place field is used to help tracking the
placement of sentence components. It indicates whether the component is
in the place of another’s role or not, e.g. the genitive is in place of predicate
role (‫)ﺷﺒﻪ ﺟﻤﻠﺔ ﺍﻟﺠﺎﺭ ﻭﺍﻟﻤﺠﺮﻭﺭ ﻓﻲ ﻣﺤﻞ ﺭﻓﻊ ﺧﺒﺮ‬. This field could have one of
three possible values “PrRu”, “PrPl” and “#”. They stand for
(respectively): the current component is in place of the previous role of the
sentence; the current component inherits the previous placement value;
and no placement. A future example will explain these in more detail.
A‘rib — A Tool to Analyze Arabic Sentences Syntactically 147

 
(CFG) Nominal Sentence → Subject + Predicate
(XML)

<Rule>
<ID>NSnt1</ID>
<Name>NSnt</Name>
<Strc>Subj Pred</Strc>
<Cons>----</Cons>
<Cond>Subj;Nomi;#;# Pred;Nomi;#;#</Cond>
</Rule>

<!-- the 2nd structure has “Predicate” as


its role, and “Nominal” judgment -->

(CFG) Predicate → Genitive

(XML)

<Rule>
<ID>Pred2</ID>
<Name>Pred</Name>
<Strc>Geni</Strc>
<Cons>----</Cons>
<Cond>Pre;Pre;#;#PrRu</Cond>
</Rule>

<!-- the structure inherits “Predicate” as


its role, and “Nominal” as its judgment -->

Fig. 9. The <Cond> tag with different Role and Judgment.

5.2.2. The syntactic analyzer class

This is the main class where all the syntactic analysis is conducted (see
Figure 10). Each process contains several mini processes that act together
to perform a task. It starts by receiving the tokens from the lexical analyzer
and creates all possible combinations of the tokens. This is followed by
loading the rules from the external XML file, ready to parse. Next it
recursively parses the rules to find a match to the valid structures of
possible sentences. In the end it stores the matched rules along with tokens
in the form of solution objects. Below we further explain these steps in
more detail.
Creating tokens sentences. This process produces all possible
sentences from the given set of tokens. It builds the sentence tree of the
148 M. Almedlej and A. M. Azmi

input tokens, where each path from the root down to the leaves represent
one possible morphological combination for the input sentence. See Figure
11 for an example. Each path is stored as one sentence, an array of tokens.
At the end, it produces a two-dimensional array that holds all possible
morphological combinations of the tokens. The number of sentences
produced is usually vast; however, it will be cut down at a later stage when
sentences with no matching rules are removed following the syntactic
analysis.
 

Fig. 10. The activity diagram of the syntactic analysis.

Fig. 11. Generated sentence tree for the sentence (‫)ﺫﻫﺐ ﺃﺣﻤﺪ ﺇﻟﻰ ﺍﻟﻤﺪﺭﺳﺔ‬.
A‘rib — A Tool to Analyze Arabic Sentences Syntactically 149

Loading the rules. Loads the full set of rules into an appropriate object.
Though it may consume some storage it is more efficient than going over
the XML file looking for matching rules.
Finding the matching rules. This is a process that recursively fetches
the rules and builds the Arabic grammar for the given tokens. Overall, it
explores all possible Arabic grammatical combinations that have the same
length as the given tokens. For each token it checks the solution
compatibility to evaluate and test that combination. The search algorithm
works in a tree based fashion fetching the leftmost component till it
encounters a leaf (Figure 12). For each token sentence it fetches the main
rule named “Sent” (for Sentence) with an empty solution, kind of a
bootstrap for all the rules. The parser keeps calling itself recursively each
time it replaces the structure with a new solution. The process is conducted
in an in-depth first search fashion. For each rule the parser fetches, it does
two main operations: substituting the solution and checking it.
For the process of substituting the solution we need the current solution
and the name of the rule to be substituted along with its structure (the new
solution). Here we simply remove the rule from the solution and put
instead the new solution in its place. We need to copy of the properties of
the new solution (Role, Judgment, Addition, and Places) from the previous
solution. See the example in Figure 13, which shows how the parser keeps
tracing the placement and the rules. After substituting a plausible solution,
the parser checks the new solution. In case the check fails, the parser stops
navigating the rest of the current solution and goes onto the next. There
are two cases when the check fails: (1) the leftmost terminal does not
match the corresponding token in the sentence; and (2) the number of
arguments in the solution has exceeded the number of tokens in the
sentence (see the case, filled boxes in Figure 12).
150 M. Almedlej and A. M. Azmi

Fig. 12. A sample syntax parsing tree process for a sentence with 3 tokens. The underlined
word is the non-terminal currently being processed. There are four types of boxes: solid,
dotted boundary, dashed boundary and filled. Solid boxes mean all arguments are non-
terminals, further exploration is required; dotted boxes mean arguments are mixed
(terminals and non-terminals), need to confirm that terminals match; dashed boxes mean
all arguments are terminals, we have reached leaf and no more exploration; and filled boxes
mean the number of arguments is more than the number of tokens, so just stop exploring.
Legend: NSnt: Nominal Sentence (‫ ;)ﺟﻤﻠﺔ ﺍﺳﻤﻴﺔ‬Subj: Subject (‫ ;)ﻣﺒﺘﺪﺃ‬Pred: Predicate (‫;)ﺧﺒﺮ‬
GenC: Genitive Construction (‫ ;)ﻣﻀﺎﻑ ﻭﻣﻀﺎﻑ ﺇﻟﻴﻪ‬Geni: Genetive (‫;)ﺷﺒﻪ ﺟﻤﻠﺔ ﺟﺎﺭ ﻭﻣﺠﺮﻭﺭ‬
GenN: Genetive Noun (‫ ;)ﺍﺳﻢ ﻣﺠﺮﻭﺭ‬and prpP: Preposition Particle (‫)ﺣﺮﻑ ﺟﺮ‬.
A‘rib — A Tool to Analyze Arabic Sentences Syntactically 151

  R
1 S0 → Sent(#;#;#;#)
C

R Sent
2 Sent → NSnt(#;#;#;#)
C #;#;#;#

R NSnt
3 NSnt → Subj(Subj;Nomi;#;#)  Pred(Pred;Nomi;#;#)
C #;#;#;#

R Subj Pred
4 Subj → noun(Pre;Pre;#;#)
C Subj;Nomi;#;# Pred;Nomi;#;#

R noun Pred
5 Pred → Geni(Pre;Pre;#;#)
C Subj;Nomi;#;# Pred;Nomi;#;#

R noun Geni Geni → prpP(prpP;#;#;#)


6
C Subj;Nomi;#;# Pred;Nomi;#;# noun(GenN;Geni;#;PrRu)

R noun prpP noun


7
C Subj;Nomi;#;# prpP;#;#;# GenN;Geni;#;PrRu
Places Geni Nomi Pred

i.e. ‫ﺍﻟﺠﺎﺭ ﻭﺍﻟﻤﺠﺮﻭﺭ ﻓﻲ ﻣﺤﻞ ﺭﻓﻊ ﺧﺒﺮ‬

Fig. 13. Example to explain how the parser keeps tracing the placements and the rules.
Legend: Sent: Sentence; NSnt: Nominal Sentence; Subj: Subject; Pred: Predicate; Geni:
Genetive; GenN: Genetive Noun; PrRu: current component is the place of the Previous role
of the sentence; and prpP: Preposition Particle. The “#” means no value.

5.3. Results builder

This is the final stage of the process which constructs the e‘raab result in
the format shown in Table 3. This class receives the solutions array from
152 M. Almedlej and A. M. Azmi

the syntax analyzer containing all the required information: the solutions
to know the role and judgment, and the token to decide the proper diacritic
sign. The main function of this class calls the appropriate function for each
word of each case according to the token’s type (verb, noun, … etc.) and
its properties (variable, invariable, … etc.). Within these functions the role
and the judgment of a word are translated into proper Arabic e‘raab. The
sign is deduced from the type of token that word was.

5.4. Output

In the end the e‘raab results are displayed onto the screen. Figure 14 shows
A‘rib’s GUI displaying the e‘raab for the sentence (‫)ﺃﻛﻠﺖ ﺍﻟﺨﺒﺰ‬. It notes
multiple outputs corresponding to all possible results that match the rules
in the CFG. The number of output is cut down when we enter the sentence
with diacritical marking.

Fig. 14. Our system’s GUI showing results for the sentence (‫ )ﺃﻛﻠﺖ ﺍﻟﺨﺒﺰ‬entered without
diacritical signs. Note we have multiple possible solutions.

6. Conclusion and Future Work

Arabic is a sophisticated language. The syntactic analysis of Arabic,


known as e‘raab, is necessary to fully understand the sentence. In this work
A‘rib — A Tool to Analyze Arabic Sentences Syntactically 153

we propose A‘rib, a system that automates the syntactic analysis of Arabic


sentences. The system mimics the way humans perform the traditional
Arabic e‘raab, a process based on the Arabic syntax grammar. A‘rib
consists of three components: the lexical analyzer, syntactical analyzer and
results builder. These components work in tandem to produce the correct
e‘raab analysis. For the lexical analyzer we use Alkhalil, a public domain
Arabic morphological analyzer. The output is fed to the syntactical
analyzer, which dynamically parses the tokens out of lexical analyzer to
find a matching Arabic grammar rule stored externally in XML format.
Finally, the results builder writes the result of the e‘raab using proper
Arabic natural language as well as places the appropriate diacritical signs
on the words. The grammar rules are stored externally, making the system
flexible for future addition of new rules. Currently the system handles the
grammar rules for up to junior high level.
A future improvement is to reduce the number of displayed solutions.
Moreover, displayed solutions should be ordered according to relevance
with those more common solutions displayed first.

References
1. CIA World fact book. Washington DC: Central Intelligence Agency (2008).
2. A. Farghaly and K. Shaalan, Arabic natural language processing: challenges
and solutions. ACM Trans Asian Lang. Inform. Process., 8(4):1-22 (2009).
3. R. Alkhawwam, Applied e‘raab and its applications (in Arabic). Retreived
Sep 6, 2015, from uqu.edu.sa/page/ar/93207366.
4. A. Azmi and R. Almajed, A survey of automatic Arabic diacritization
techniques. Natural Language Engineering, 21(3):477-495 (2015).
5. M.G. Khayat and S.K. Al-Jabri, Model analysis of the Arabic sentences
structure (in Arabic). Proceeding of the 12th National Computer Conference:
Planning for the Informatics Society, Riyadh, Saudi Arabia, Oct 21-24, pp.
676-91 (1990).
6. A.D. Al-Sawadi and M.G. Khayat, An end-case analyzer for Arabic
sentences. J King Saud University: Computer & Information Sci. 8:21-52
(1996).
7. E. Al-Daoud and A. Basata, A framework to automate the parsing of Arabic
language sentences. Int Arab J Information Technology, 6(2):196-205
(2009).
8. S. Ananthakrishnan, S. Narayanan and S. Bangalore, Automatic
diacritization of Arabic transcripts for automatic speech recognition. Int.
Conf. Natural Lang. Processing (ICON-2005), Kanpur, India (2005).
154 M. Almedlej and A. M. Azmi

9. K.C. Ryding, A Reference Grammar of Modern Standard Arabic. Cambridge


Univ. Press, pp. 57-72 (2005).
10. ALESCO. Alkhalil morphological system (2nd edition). The Arab League
Educational, Cultural and Scientific Organization (ALESCO). From
www.alecso.org.tn/index.php?option=com_content&task=view&id=1302&
Itemid=956 (2011).
155

Chapter 7

Semi-Automatic Data Annotation, POS Tagging and


Mildly Context-Sensitive Disambiguation:
The eXtended Revised AraMorph (XRAM)

Giuliano Lancioni†, Laura Garofalo†, Raoul Villano†,


Francesca Romana Romani†, Marta Campanelli‡, Ilaria Cicola‡, Ivana Pepe‡,
Valeria Pettinari‡ and Simona Olivieri§

Roma Tre University, ‡Sapienza University of Rome, §University of Helsinki
giuliano.lancioni@uniroma3.it; laura.garofalo5@gmail.com;
raoulvillano@gmail.com; francescaromana.romani@gmail.com;
martac184@gmail.com; ilaria.cicola@gmail.com; ivanapepe27@gmail.com;
pettinari.valeria@uniroma1.it; simolivieri@gmail.com

An extended and revised form of Tim Buckwalter’s Arabic lexical and


morphological resource AraMorph, named eXtended Revised
AraMorph (XRAM), is presented. A number of weaknesses and
inconsistencies of the original model are addressed by allowing a wider
coverage of real-world classical and contemporary (both formal and
informal) Arabic texts. Building upon previous research, XRAM
enhancements include (i) flag-selectable usage markers, (ii)
probabilistic mildly context-sensitive POS tagging, filtering,
disambiguation and ranking of alternative morphological analyses, and
(iii) semi-automatic increments of lexical coverage through the
extraction of lexical and morphological information from existing
lexical resources. Testing XRAM through a front-end Python module
showed a remarkable success level.

1. Introduction

Tim Buckwalter’s AraMorph (AM, see Ref. 1) is one of the most


widespread electronic resources for the Arabic lexicon and morphology.
156 G. Lancioni et al.

Applications using it include text analyzers (e.g., BAMAE, see Ref. 2),
ontologies (e.g., Arabic WordNet browser, see Ref. 3), data mining, and
content extraction (e.g., ArMExLeR, see Ref. 4).
However, the original version of AM shows a number of
shortcomings, which reduce the coverage of the morphological analyzer
and hinder its applicability to a number of genres and text types. In
particular, Buckwalter1 focused mainly on contemporary newspaper
texts, which makes the analyzer both underrecognize — because of lack
of lexical and morphological coverage — and overrecognize (by
spuriously increasing the amount of ambiguity because of the inclusion
of historically and linguistically implausible alternatives) texts from
other genres.
Some of these inconsistencies were tackled by the Revised AM model
(RAM) presented in Boella et al.5 However, the necessity of a structural,
opposite to incremental, revision and expansion of AM appears clearly in
the impossibility to let a merely increased version effectively go beyond
a certain level of performance in analyzing, e.g. Classical and modern
informal texts.
XRAM presents itself as a structurally revised AM, which alters the
basic original structure by adding usage and genre markers and by
accruing the original, rigidly context-free conception of the analyzer by
limited statistically gathered contextual selection information. These
enhancements allow for a sensibly higher level of performance (see
Section 3).

2. Description of XRAM

XRAM, just like AM and RAM, has a purpose of analyzing texts, but in
a much more defined and thorough way.
In order to enhance the accuracy of the analysis we implemented a
flag-selectable usage markers tool through the addition of a
supplementary field in the Buckwalter’s analyzer (see Section 2.1).
After selecting a single flag or a set of flags, according to the text
genre, the text is tokenized and all the punctuation and formatting
structure is stripped and factored out. Hence, the program produces a list
of tokens ready to be processed by the XRAM analyzer, which aims to
Semi-Auto Data Annotation, POS Tagging & Mildly Context-Sensitive Disambiguation 157

create a list of possible analyses for each token represented in the


original text. Types (distinct tokens) are analyzed and a dictionary of
analysis is created, which assigns to each type a POS and a lemma, in
order to reduce computing time.
As mentioned above, ambiguity is a significant weakness in the
original AM model, which definitely compromises the correct analysis of
the text. The XRAM RE module intervenes to reduce this ambiguity by
filtering candidate analyses through a limited set of regular expressions.
This module introduces a limited amount of context-sensitiveness in the
system.
Analyses that survive the RE module are then ranked through a
simple Language Model (LM) module, based upon Buckwalter and
Parkinson’s6 frequency list. Ranking introduces an order dimension in
ambiguous analyses by assigning decreasing levels of plausibility to
POS-lemma tuples.
XRAM capitalizes on the LM module by producing a semi-automated
XML tagging of the original text according to the TEI P5 standard: the
analysis with the higher rank is proposed as the default analysis, while
other ones, lower in rank, are written in the XML output as alternative
analyses.

2.1. Flag-selectable usage markers

In order to make XRAM linguistic analysis even more reliable, markers


are provided for graphemic, morphological and lexical features
belonging to specific language varieties among Classical Arabic (CA),
Modern Standard Arabic (MSA, formal), and Informal Colloquial Arabic
(ICA, informal)a including technical and scientific sublanguage.
Inspiration came from Buckwalter & Parkinson’s Frequency Dictionary6:
for each recorded lemma, the dictionary provides morphological,
syntactic, orthographic and phonetic information as well as usage

a
By “Informal Colloquial Arabic” we mean intermediate, relatively high-level varieties of
spoken Arabic that do not exhibit especially localized features and are relatively common
to speakers of different spoken varieties. ICA essentially corresponds to Mitchell’s
Educated Spoken Arabic7 and Ryding’s Formal Spoken Arabic.8
158 G. Lancioni et al.

restrictions and register variations, according to the corpus where a


lemma can be found exclusively or most frequently.

Fig. 1. The XRAM pipeline.


Semi-Auto Data Annotation, POS Tagging & Mildly Context-Sensitive Disambiguation 159

Markers are encoded with flags which can be selected or unselected


according to the language variety or genre the corpus to be processed is
representative of. The genre selection step is currently dependent on user
input, since it is outside the main task of the project, but several ways to
detect the genre (semi-) automatically might be envisaged. This allows
the analyzer to reduce the amount of false positives by discarding non-
relevant gender and variety-specific features. Flags were specified
according to a number of diaphasic classification criteria, taking into
account lexical expansion and morphological phenomena. Flags are
labeled as follows:

Table 1. Genre flags.

FLAG FEATURE
XRAM_CA Classical Arabic
XRAM_MSA Modern Standard Arabic
XRAM_ICA Informal Colloquial Arabic
XRAM_SPEC_MED Medical Sublanguage
XRAM_SPEC_ALCH Alchemic Sublanguage
XRAM_SPEC_GRAM Grammatical Sublanguage
XRAM_NE Name Entities
XRAM_FNE Foreign Name Entities
XRAM_CAP Colloquial Aspectual Preverbs

Existing flags reflect the range of text genres included in the corpora
and subcorpora available in our research. The system can be easily
expanded by adding new flags. Flag selection is usually compounded: for
example, when processing a corpus of classical texts, XRAM_MSA,
XRAM_ICA, XRAM_SPEC_MED, XRAM_FNEand XRAM_CAP
flags will be deselected in order to optimize the output analysis.
Flags can be easily and efficiently implemented according to standard
IT practices (as XORed bits), which makes genre and text type filtering
quick and consistent.
160 G. Lancioni et al.

2.2. Probabilistic mildly context-sensitive annotation

Tokenization, word-segmentation and POS-tagging are the core tasks


AM has carried out since its inception. Yet, since no syntactic
information is provided to the program, AM shows a high degree of
morphological and lexical ambiguity, particularly when processing
unvocalized texts, due to the homography which characterizes written
Arabic, as for instance in:

WORD: ‫ﺍﻟﻜﺘﺎﺏ‬
Al+ktAb+
‫ﺍﻟﻜﺘﺎﺏ‬
1 Al+kitAb+ ‫ﺍﻝ٭ ِﻛﺘﺎﺏ٭‬  kitAb_1 [‫]ﻛﺘﺐ‬
the+book+
Al/DET+Ndu+
2 Al+kut~Ab+ ‫ﺍﻝ٭ ُﻛﺘّﺎﺏ٭‬  kut~Ab_1 [‫]ﻛﺘﺐ‬
the+kuttab (village school);Quran school+
Al/DET+N+
3 Al+kut~Ab+ ‫ﺍﻝ٭ ُﻛﺘّﺎﺏ٭‬  kAtib_1 [‫]ﻛﺘﺐ‬
the+authors;writers+
Al/DET+N+

To overcome this weakness, the revised version of AraMorph, RAM5


relied on the vocalization of hadith texts. Notwithstanding, RAM
produces good results only when processing a restricted range of text-
genres, i.e. CA vocalized texts. This is why a further improvement of
RAM is needed through the application of a mildly context-sensitive
process of disambiguation. Specifically, we adopted a streamline of two
different but complementary approaches: (i) a filtering RE component
(XRAM RE module) and (ii) a ranking LM module.
On the one hand, the filtering RE component reduces the amount of
possible analyses by filtering out candidate sequences through regular
expressions. E.g., the preposition ‫ ﻣﻊ‬maʿa ‘with’ unambiguously needs to
be followed by a noun or (marginally) an adjective: the RE component
includes a rule (symbolically represented as [* ‫ ﻣﻊ‬V]) to filter out
candidate analyses of a word as a verb when preceded by ‫ﻣﻊ‬.
Semi-Auto Data Annotation, POS Tagging & Mildly Context-Sensitive Disambiguation 161

On the other hand, the LM component ranks candidate analysis


according to the probability of individual POS-lemma tuples. This is a
local-sensitive disambiguation strategy which guides the ranking of
alternative morphological analysis for each lemma identified by XRAM.
The more these word combinations are likely to occur in the training and
testing bases for this kind of strategy, the higher their ranking level
would be, i.e. they will occupy top positions in the list of analyses
provided by XRAM.
The LM component uses a hybrid approach: an order-3 language
model drawn from a manually corrected sample is compounded with
frequencies for individual POS-lemma tuples drawn from Buckwalter &
Parkinson.6
This will drastically change the previous versions of AM, giving the
research in the field of Arabic Corpus Linguistics and Arabic
Computational Linguistics a whole new perspective and an even more
functional degree of analysis, creating a morphologic-syntax interface.

2.3. Lexical and morphological XML tagging of texts

Aiming to analyze texts taken from Arabic corpora, specific sections of


the study have been conducted to design materials on the XML language
using the model available in TEI (Text Encoding Initiative)b P5 structure
for textual annotation.9 Textual annotation is schemed adopting specific
tags, which help users identify precise information behind markers. The
TEI standard, which has been chosen for its versatility and adaptability to
various typologies of texts, fits these specific purposes well, even if
adapted by validators from time to time, depending on cases.
Morphological and lexical annotations are instead based on results
given by RAM, which provides a precise analysis of each occurring
lemma, giving information in matter of stems, function of the word and a
series of tags showing morphological features.
A combination of the two systems showed a remarkable success level,
enabling readers to clearly identify all available information on given
materials, including both textual and word-related (morphological and

bhttp://www.tei-c.org/index.xml.
162 G. Lancioni et al.

lexical) information. In fact, in addition to tags and basilar information,


such a mixing provides general information which clearly identifies the
main features of the texts (such as the average length, frequency and
occurrence of lemmas, identification of specific elements) just by
interpreting the combinations derived from the two overlapped patterns.
By way of partially re-writing and so extending RAM operating
range, a further development will then be the semi-automatic annotation
of XML texts modeled on the TEI structure. Thus, by analyzing Arabic
annotated texts employing RAM, results will provide each word with all
possible readings, giving specific information for every reading
annotated.
Furthermore, splitting information derived from RAM analysis,
process of combination is refined by embedding data in XML elements
provided by TEI standard. In particular, the tag used to identify a word
from the text is <w>, with an additional series of attributes as ‘lemma’ or
‘type’ to distinguish base-forms and specific functions.
The system automatically assigns the top ranked analysis selected by
the LM component the <w> tag, which, as a container, cannot be
embedded for one and the same input word. It marks less likely analyses
with the annotation tag <note> with the analysis encoded in the ‘ana’
attribute in order to distinguish different readings of the same word.
While reviewing the XML output text, the annotator can reverse the
default analysis by adding an attribute ed="correct" to one of the <note>
elements. An XLST transformation takes care of promoting the marked
analysis to <w> and to demote the corresponding <w> analysis to a
<note> marker.
A sample derivation is shown for the preposition phrase ‫‘ ﻣﻊ ﻛﺎﺗﺐ‬with
a writer’. The XRAM analyzer outputs one analysis for ‫( ﻣﻊ‬the output by
the XRAM system reformats subsets of information in AM in the form
vocalized_form/lemma/pos): maEa/maEa_1/PREP while three
analyses are yielded for ‫ﻛﺎﺗﺐ‬:

(1) kAtab+a/kAtab_1/V
(2) kAtib/kAtib_1/N
(3) kAtib/kAtib_2/A
Semi-Auto Data Annotation, POS Tagging & Mildly Context-Sensitive Disambiguation 163

Analysis #1 is filtered out by the RE rule (* ‫ ﻣﻊ‬V), while the LM


component ranks #2 over #3. This is the result fragment in XML
notation:

<w ana="maEa/maEa_1/PREP">‫<ﻣﻊ‬/w>
<w ana="kAtib/kAtib_1/N">
<note ana="kAtib/kAtib_2/A"/>‫<ﻛﺎﺗﺐ‬/w>

The fragment shows the unique analysis for ‫ ﻣﻊ‬and the top ranked
analysis for ‫ ﻛﺎﺗﺐ‬encoded in the ‘ana’ attribute of the <w> tag, while the
alternative analysis for ‫ ﻛﺎﺗﺐ‬is encoded as a <note>. If the annotator does
prefer one of the alternative analyses, (s)he adds the attribute
ed="correct" to it <w ana="kAtib/kAtib_1/N">
<note ana="kAtib/kAtib_2/A" ed="correct"/>‫<ﻛﺎﺗﺐ‬/w>
and launches the XLST transformation, which reverses the selection:
<w ana="kAtib/kAtib_2/A">
<note ana="kAtib/kAtib_1/N"/>‫<ﻛﺎﺗﺐ‬/w>

2.4. Semi-automatic increment of lexical coverage

One of the weak points of AraMorph is the limited range of text genres
on which the resource was based: the lexicon files as well as the
compatibility tables included in the program are mostly based on
newspaper texts and other Modern Standard Arabic non-literary texts,
which largely comprise the LDC Arabic corpus. The program is not only
unbalanced and representative of a limited part of the Arabic vocabulary,
its lists lack any stylistic and chronological information as well. Because
of this, various problems can arise from the analysis of other textual
genres, especially Classical and both (formal and informal)
contemporary ones. Analyses conducted on Pre-Islamic and Classical
texts, such as Hadith texts5 reveal that the main weak points of AM are:
(i) the rejection or the wrong analysis of words such as the ’ā-
interrogative prefix, as well as imperative verbs that are not included in
AM due to their rare occurrence in targeted AM texts. In addition, other
errors that occur with classical Arabic corpora, especially pre-Islamic,
involve broken plurals as well as certain verb stems (mainly maṣdars,
164 G. Lancioni et al.

participles, quadriliteral verbs, jussive verbs, passives), which are either


uncommon, as in the case of the quadriliteral ‫( ﺗﺨﻨﺬﺫ‬see Table 1), or are
written in a nonstandard form not recognized by the analyzer, for
example with the sukūn on the last letter. Note that when dealing with
poetry there are other metrical phenomenon that are not recognized by
the analyzer such as the ‘alif or the yā’ followed by the hā’ at the end of
the verse to create a rhyme (this was found when inserting the poetical
works, or Dīwān, of the pre-Islamic poetess Al-Ḫansā’ as a corpus in the
analyzer);
(ii) the risk of false positives due to the presence of contemporary
named entities inside the AM lexical lists, which are included in the
search even when a Classical text is analyzed (the same point has already
been approached and partially overcome within in the above mentioned
Boella et al.5 On the other hand, for contemporary formal texts among
newspapers and novels as well as contemporary informal texts such as
blogs and social networks, one of the most important problems is the
lack of a graphemic standardization of;
(iii) transliterated foreign words that Arabic borrows nowadays
especially from English, and arranges phonetically according to dialect
and idiosyncratic varieties, which influence their transcription.c Among
these types there are not only proper nouns of people and places but also
common nouns (for some examples see Table 1);
(iv) dialect words which are also exposed to a strong idiosyncratic
variety when they are transcribed (for some examples see Table 1).
Thus, the XRAM project aims at enhancing the AM through the
inclusion of additional lists of prefixes, stems and suffixes with the
relative combination tables, in order to face points (i), (iii) and (iv).
Several parts of the above-mentioned lists will be automatically extracted
from Arabic lexical resources currently available in XML format. For
Classical texts, one of the most important resources is Salmoné’s Arabic-
English dictionary,11 which is entirely encoded according to TEI
standards and downloadable in a XML file. As for transliterated foreign
words, a solution is proposed by cross checking the concerned items with
Arabic Wikipedia, which is one of the largest online encyclopedias in

cAsof matter of the Egyptian variety, Rosenbaum10 defines this linguistic phenomenon
“Egyptianized English”.
Semi-Auto Data Annotation, POS Tagging & Mildly Context-Sensitive Disambiguation 165

existence. Its large list of named entities has already inspired projects
meant to potentiate and expand on other Arabic lexical resources like
Arabic WordNet.12 Inside the XRAM project, the use of Arabic
Wikipedia was finalized to align the transcription of foreign words and
thus add them in the Buckwalter’s lists.
In regard of the most frequent unanalyzed dialect words, the solution
is to manually set a list to include in AM, since XML resources are not
widely available at the moment aside from few recently investigated
varieties.13

Table 2. Sample of unrecognized words in AM.

Classical Arabic quadril ‘become evil’ ‫ﺗﺨﻨﺬﺫ‬


maṣdar III ‘thrust of the spear’ ‫ﺍﻟﻄﻌﺎﻥ‬
ašuğā‘un/ā inter. + adj. ‘brave’ ٌ ‫ﺍَﺷُﺠﺎ‬
‫ﻉ‬
Transl. foreign named entities Arizona ‫ﺍﺭﻳﺰﻭﻧﺎ‬
Youtube ‫ﻳﻮﺗﻮﺏ‬
Huffington ‫ﻫﻔﻨﺠﺘﻮﻥ‬
Transl. foreign comm. nouns aircraft ‫ﺇﻳﺮﻛﺮﺍﻓﺖ‬
protocol ‫ﺑﺮﻭﺗﻜﻮﻝ‬
the autobus ‫ﺍﻷﺗﻮﺑﻴﺲ‬
Dialect words illī/ relative pron ‫ﺍﻟﻠﻲ‬
āntūn/ 2nd-people plural ‫ﺁﻧﺘﻮﻥ‬
dā/ m. s. dem. pro./ adj. ‫ﺩﺍ‬

3. Validation and Research Grounds

The evaluation of a tool such as XRAM involves some differences from


standard evaluation methods in lemmatization and POS tagging tasks.
First and foremost, the system outputs, on purpose, all available analyses
and does not yield an analysis (e.g., thorough tentative reconstruction or
error correction) where the analyzer has found none.
A first evaluation metrics is the rate of unrecognized words according
to text genre (see Table 3 below, and also Table 1, Section 2.1):
166 G. Lancioni et al.

Table 3. Comparison of recognition rates.

GENRE XRAM AM
% unknown % unknown
Classical Arabic 3.4 12.4
Modern Standard Arabic 1.7 2.5
Informal Colloquial Arabic 7.6 18.5
Medical Sublanguage 1.3 7.5
Alchemic Sublanguage 3.5 14.2
Grammatical Sublanguage 2.7 8.6
Named Entities 6.5 7.6
Foreing Named Entities 14.3 15.6
Colloquial Aspectual Preverbs 6.7 23.4

While the performance of XRAM is marginally better than AM in


MSA texts, more specific genres show a remarkably higher performance,
because of usage markers and increased coverage of the lexica.

4. Conclusion

XRAM significantly enhances AM performance, especially for genre-


specific texts. The model can be further enhanced by widening the
filtering and ranking modules and by increasing the coverage of the
lexicon, while keeping ambiguity low through a more and more refined
assignment of usage markers.
Further development involves integrating current research on formal
grammar (specifically, Combinatory Categorial Grammar, CCG14) within
the ranking module.

References

1. T. Buckwalter, Buckwalter Arabic Morphological Analyzer Version 1.0. Linguistic


Data Consortium, Philadelphia (2002).
2. S. Alansary, BAMAE: Buckwalter Arabic Morphological Analyzer Enhancer. In
Proc. 4th International Conference on Arabic Language Processing, pp. 1–9. Rabat,
Morocco (2012).
Semi-Auto Data Annotation, POS Tagging & Mildly Context-Sensitive Disambiguation 167

3. C. Fellbaum, M. Alkhalifa, W.J. Black, S. Elkateb, A. Pease, H. Rodríguez and P.


Vossen, Introducing the Arabic WordNet project. In Proc. 3rd Global Wordnet
Conference, pp. 295–299, Jeju Island, Korea (2006).
4. G. Lancioni, L. Benassi, M. Campanelli, I. Cicola, I. Pepe, V. Pettinari and A.
Silighini,Arabic Meaning Extraction through Lexical Resources: A General-
Purpose Data Mining Model for Arabic Texts. In Proc. IMMM 2013: The Third
International Conference on Advances in Information Mining and Management, pp.
107–112, Lisbon, Portugal (2013).
5. M. Boella, F. R. Romani, A. Al-Raies, C. Solimando, and G. Lancioni,The SALAH
Project: Segmentation and Linguistic Analysis of Hadīth Arabic Texts, Information
Retrieval Technology Lecture Notes. Computer Science, Springer Heidelberg 7097,
538–549 (2011).
6. T. Buckwalter and D. Parkinson, A Frequency Dictionary of Arabic. Routledge,
London and New York (2011).
7. M. T. Mitchell, Dimensions of style in a grammar of educated spoken Arabic,
Archivum Linguisticum 11, 89–106 (1980).
8. K. C. Ryding, Proficiency despite diglossia: A new approach for Arabic. Modern
Language Journal 75:2, 212–218 (1991).
9. L. Burnard and S. Bauman S., TEI P5: Guidelines for Electronic Text Encoding and
Interchange. Text Encoding Initiative Consortium, Charlottesville Virginia (2013).
10. G. Rosenbaum, The Growing Influence of English on Egyptian Arabic. ’Alf Lahǧa
wa Lahǧa. In Proc. 9th Aida Conference, pp. 377–384, Lit, Wien, Austria (2014).
11. H. A. Salmoné, An Advanced Learner’s Arabic-English Dictionary. Librairie du
Liban, Beirut (1889)
12. M. Alkhalifa, H. Rodríguez, Automatically extending Named Entities coverage of
Arabic Wordnet using Wikipedia. International Journal on Information and
Communication Technologies, 3 (3) (2010).
13. N. Habash, R. Eskander and A. Hawwari, A Morphological Analyzer for Egyptian
Arabic. In Proc. Twelfth Meeting of the Special Interest Group on Computational
Morphology and Phonology (SIGMORPHON2012), pp. 1–9, Montreal, Canada,
(June 2012).
14. M. Steedman, Surface Structure and Interpretation. MIT Press, Cambridge,
Massachusetts, (1996).
15. M. El-Zahhar Mohamed, Neamat Farouk El Gayar, A semi-supervised learning
approach for soft labeled data, ISDA 2010: 1136–1141 (2010).
16. N. Habash, Large Scale Lexeme Based Arabic Morphological Generation. In Proc.
Traitement Automatique du Langage Naturel (TALN-04), Fes, Morocco (2004).
17. A. Pasha, M. Al-Badrashiny, M. Diab, A. El Kholy, R. Eskander, N. Habash, M.
Pooleery, O. Rambow and M. R. Roth, MADAMIRA: A Fast, Comprehensive Tool
for Morphological Analysis and Disambiguation of Arabic. In Proc. Ninth
International Conference on Language Resources and Evaluation (LREC'14)
pp. 1094–1101. Reykjavik, Iceland (2014).
168 G. Lancioni et al.

18. N. A. Smith, A David and W. Tromble Roy, Context-Based Morphological


Disambiguation with Random Fields. In Proc. Human Language Technology
Conference and Conference on Empirical Methods in Natural Language Processing
(HLT/EMNLP), pp. 475–482, Vancouver, Canada (2005).
19. S. Zbigniew, N. El Gayar and P. Delimata, A Rough Set Approach to Multiple
Classifier Systems. Fundam. Inform. 72 (1-3), 393–406 (2006).
169

Chapter 8

WeightedNileULex: A Scored Arabic Sentiment Lexicon for


Improved Sentiment Analysis

Samhaa R. El-Beltagy
Center for Informatics Science, Nile University,
Juhayna Square, Sheikh Zayed City,
Giza, Egypt
samhaa@computer.org

Arabic sentiment analysis, has been consistently gaining attention over


the past couple of years. While research in the area of English Sentiment
Analysis has been often aided by the presence of sentiment lexicons,
such lexicons are very scarce for the Arabic language. Furthermore,
lexicons that do exist are not scored or weighted. This paper describes
the process by which entries in an existing Arabic Sentiment Lexicon
built by the author, were assigned scores. Through a number of
experiments on different datasets, it also shows that the use of the scored
lexicon almost always improves the accuracy of sentiment analysis.

1. Introduction

Over the past few years there has been an increase in interest in the topic
of Arabic Sentiment analysis and opinion mining. The increased interest
in this area is a direct result of the surge in usage of the Arabic language
within various social media platforms, amongst which are twitter and
Facebook.1,2,3 Many approaches for sentiment analysis require the
existence of sentiment lexicons, which are currently scarce for the Arabic
language. In previous work, the author presented NileULex,4 which is a
manually constructed Arabic sentiment lexicon containing approximately
six thousand Arabic terms and phrases of which 45% are colloquial
(mainly Egyptian). This work extends the previous work by presenting
170 S. R. El-Beltagy

a method for automatically assigning strength scores or weights


to NileULex entries as well as availing the resulting lexicon
“WeightedNileULex” for public use. Experiments carried out using a very
simple sentiment analysis system over four different datasets, show that
using the weighted lexicon always enhances polarity classification over
using the un-weighted lexicon and that using any of the lexicons (weighted
or un-weighted) always improves the accuracy compared over not a
lexicon at all.
The rest of this paper is organized as follows: Section 2 gives an
overview of related work, Section 3 presents a description of the baseline
un-weighted lexicon, Section 4 describes the method that was used for
scoring the lexicon, Section 5 presents the experiments conducted in order
to evaluate the usefulness of the weighted lexicon, and finally, Section 6
concludes this paper.

2. Related Work

Sentiment lexicons play an important role in polarity determination for


sentiment analysis systems. Because sentiment lexicons play an integral
role in most sentiment analysis systems, many such lexicons have been
developed for the English language. The most commonly used English
lexicons include: SentiWordNet,5 Bing Liu’s opinion lexicon,6 the MPQA
subjectivity lexicon,7 and the NRC Word-Emotion Association Lexicon.8
Recently, English Twitter specific lexicons have also come into existence
and are increasing being used. These include the Hashtag Sentiment
Lexicon and the Sentiment140 Lexicon.9,10 Of those lexicons, only
SentiWordNet, the Hashtag Sentiment Lexicon, and the Sentiment140
Lexicon, are scored. However, the importance of assigning a score to
various entries in sentiment lexicons has recently surfaced and has become
subject to research. In fact, this particular research area was introduced as
a subtask in SemEval 2015 (Task10) and again in SemEval 2016 (Task 7).
In both years training data was provided. In 2015, the top performing team
for this sub-task, employed word embeddings to train a logistic regression
model for assigning scores to sentiment terms.11 The second best
performing team, used 6 different sentiment lexicons to score input terms
(2 manually created, and 4 automatically created). Basically, input terms
WeightedNileULex: A Scored Arabic Sentiment Lexicon for Improved Sentiment Analysis 171

were compared against entries in the lexicons. If a term was found in a


manually constructed lexicon, it was assigned a value of 1 or -1, depending
on its polarity. If it was found in any of the automatically created lexicons,
it was assigned the score found in those lexicons. If it was not found in
any of the used lexicons, it was assigned a default value.12 In 2016, the
best performing system (ECNU) used a supervised approach which
employed Random forests for ranking terms. The input was based on
sentiment specific word embeddings generated using a dataset consisting
of 1.6M tweets collected and annotated using supervised learning. Two
existing lexicons were also used.13
Arabic lexicons are much scarcer than their English counterparts. One
of the first attempts to build an Arabic sentiment lexicon was proposed by
14. In this work, the authors presented an approach for building an
Egyptian lexicon and mapping entries within it to their Modern Standard
Arabic (MSA) synonyms. The presented approach was an automatic one
that when evaluated on a collected set of 1000 entries, produced an
F-measure of 70.9%. The work was primarily focused on acquiring single
terms, with the inclusion of compound phrases cited as an area for future
work. Early work on assigning scores to lexicon entries was presented in
15. In this work, the authors presented a method for semi-automatically
building a sentiment lexicon that consists of single as well as compound
terms. The authors also proposed two different approaches for assigning
scores to sentiment terms and demonstrated that the introduction of
sentiment scores can increase the accuracy of sentiment analysis by up to
20.6%. There have been other attempts to build Arabic lexicons, but those
have focused primarily on Modern standard Arabic and contained only
single terms.16-18
More recently, the authors of 19 constructed a sentiment lexicon by
devising a matching algorithm that tries to match entries in the lexicon of
an Arabic morphological analyzer to entries in SentiWordNet.5 When a
match is found, a link is created between the lexicon entry and the
matching entry in SentiWordNet and the scores of the matching term in
SentiWordNet are assigned to that entry.
In an attempt to avail the NRC Word Emotion Association Lexicon
(EmoLex)8 in other languages including Arabic, the authors resorted to
automatic translation of all entries to each target language and availed it
172 S. R. El-Beltagy

online. However the work presented in 4, has shown that the quality of this
translated lexicon is not as high as a manually constructed one and that the
sentiment analysis accuracy does suffer when using such lexicons.
EmoLex however is the only Arabic lexicon, other than NileULex4 that
contains compound terms, but it must be stated that the number of
compound entries in this lexicon is very limited. To the knowledge of the
author, NileULex is the only lexicon that has both Arabic compound
phrases and common idioms as entries.

3. The Base Lexicon

As stated in the introduction, this work builds on a previously constructed


lexicon called NileULex. The process of building the lexicon and
evaluating it is presented in 4. The version of the lexicon presented in 4
contained a total of 5953 unique terms of which 563 were compound
negative phrases, 416 were compound positives, 3693 were single term
negative words and 1281 were single term positive words. However, since
the lexicon is continuously being updated, the version that has served as
the base for this work had an additional 261 terms. While most of the
colloquial terms in this lexicon are Egyptian, a few terms from other
dialects have made their way into the lexicon. Some terms that are
transliterations of English words have also been included. Examples of
those include terms such as ‫( ﻛﻴﻮﺕ‬cute) and ‫( ﻻﻳﻚ‬like). Table 1 shows an
example of the various entry types within the lexicon, along with their
translations. Out of the four listed compound phrases in this table, the
polarity of only two entries (“‫”ﺍﺟﻬﻞ ﺧﻠﻖ ﷲ‬, “‫ )”ﺍﻳﻪ ﺍﻟﺤﻼﻭﻩ ﺩﻯ‬can be
determined using some of their constituent words. Individually, the
constituent words of the other two phrases (“‫”ﺍﻟﻜﻠﻤﺎﺕ ﻻ ﺗﺴﻌﻨﻲ‬, “‫)”ﻧﺎﺱ ﺑﻴﺌﻪ‬
give no indication to their polarity. This is the case with many compound
phrases in the lexicon.
WeightedNileULex: A Scored Arabic Sentiment Lexicon for Improved Sentiment Analysis 173

Table 1. An example of some entries in the lexicon.

Term Type Dialecta English translation


‫ﺍﻟﻜﻠﻤﺎﺕ ﻻ ﺗﺴﻌﻨﻲ‬ compound_pos MSA I am lost for words
‫ﺍﺟﻬﻞ ﺧﻠﻖ ﷲ‬ compound_neg MSA The most ignorant of beings
‫ﺍﻳﻪ ﺍﻟﺤﻼﻭﻩ ﺩﻯ‬ compound_pos EG So beautiful (Wow)
‫ﻧﺎﺱ ﺑﻴﺌﻪ‬ compound_neg EG People with no class
‫ﺟﻤﻴﻞ‬ Positive MSA Beautiful
‫ﻗﺒﻴﺢ‬ Negative MSA Ugly
‫ﻋﺴﻮﻝ‬ Positive EG Sweet
‫ﺍﺗﺮﻳﻖ‬ Negative EG Made fun of
‫ﺯﻭﻣﺒﻲ‬ Negative DIA Transliterated word for ‘zombie’

4. Assigning Scores to Lexicon Entries

In order to assign strength scores to the input Arabic Lexicon, a number of


steps were carried out. These steps can be summarized as follows:
(1) Data collection: collect tweets for each lexicon entry
(2) Collecting term statistics: collect co-ocurrance statistics for the
collected tweets
(3) Term Scoring: calculate a score for each term

Each of the above steps is explained in the following subsections.

4.1. Data collection

Since the goal of our work was to try to assign a strength score for each
positive and negative lexicon entry, we had to obtain a representative set
of tweets for each term. We have chosen to retrieve 100 tweets unique for
each term using Twitter’s search API.20 There were very few cases
however, when the search API was unable to retrieve this number of tweets
and cases where no tweets were retrieved at all. To ensure that tweets were
in fact unique, they were filtered out using the Jaccard similarity
measure.21

aDialect can be Modern Standard Arabic (MSA), Egyptian (EG) or simply a dialectical

term (DIA) which is not specific to one Arabic speaking country or region.
174 S. R. El-Beltagy

In total, approximately around 500K tweets were used for deriving


scores for the input lexicon. This collection of tweets will henceforth be
referred to as the twitter corpus.

4.2. Collecting term statistics

After carrying out the data collection step described in the previous sub-
section, each of the collected tweets was processed in order to extract
statistics for lexicon terms. In this processing step, each tweet was scanned
for lexicon terms and negated lexicon terms. A dictionary was created for
each lexicon term to keep track of how many times it occurred
 in the entire corpus
 with positive terms
 with negative terms
 with only terms that match with its polarity
 with a tweet that has negative sentiment (in a negative context)
 with a tweet that has positive sentiment (in a positive context)
 with a tweet that has neutral sentiment (in a neutral context)

The last three indictors were obtained by analyzing the tweet using
NU’s sentiment analyzer.22
Scoring of each lexicon term was based on these statistics as described
in the next sub-section.

4.3. Term scoring

The main hypothesis behind the presented scoring method is that the
stronger a polar term is, the less likely it is to co-occur with terms of an
opposite polarity or in a context that does not have the same polarity. This
hypothesis is validating empirically in Section 5 by comparing sentiment
analysis performance using the lexicon which is scored using the proposed
method verses using the un-weighted version of the lexicon.
After collecting statistics for each lexicon term, three steps were carried
out for assigning strength scores to lexicon terms. In the first step, an initial
score was calculated for each term. The first step assigns a weight to each
term that indicates the likelihood of this term being positive or negative
WeightedNileULex: A Scored Arabic Sentiment Lexicon for Improved Sentiment Analysis 175

based on co-concurrence analysis of this term with other terms and polarity
contexts. It does not take into consideration the strength of other terms it
co-ocurred with, initially assuming that all terms are equally strong. In the
second step, the weights are re-adjusted, taking the initial calculations into
consideration. In the third step, terms that have occurred at a very low in
the corpus, or have not occurred at all, are processed.
Terms that have not occurred in the input corpus at all or with a support
value less than a given threshold are assigned a default value based on their
given polarity. The details of each step are given below.

First Step:
The initial score assigned to each term in the lexicon (excluding terms
whose occurrence count is less than some given threshold v) is based on
the following equation:
𝒔𝒄𝒐𝒓𝒆𝒕 max 𝑇𝑒𝑟𝑚𝐶𝑜𝑂𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑅𝑎𝑡𝑖𝑜 , 𝑃𝑜𝑙𝑎𝑟𝑖𝑡𝑦𝑅𝑎𝑡𝑖𝑜
Where:
TermCoOccurrenceRatio measures the extent to which a term co-occurs
with other terms of similar polarity and is calculated as follows:
𝑐𝑜 𝑜𝑐𝑐𝑢𝑟𝑎𝑛𝑐𝑒𝐶𝑛𝑡 𝑤𝑒𝑖𝑔ℎ𝑡
𝑻𝒆𝒓𝒎𝑪𝒐𝑶𝒄𝒄𝒖𝒓𝒓𝒆𝒏𝒄𝒆𝑹𝒂𝒕𝒊𝒐𝒕
𝑇𝑜𝑡𝑎𝑙
co-occurrenceCntt = co-occurrence frequency of term t with terms of
the same polarity as t
weightt = tft * Normalized_idft
tft = the number of times term t has appeared in the input corpus
Normalized_idft = idft23 normalized such that the value is a number
between zero and one. The normalization factor is log2 N where N is the
number of documents in the collection used to build the idf table.
idft = the inverse document frequency23 of term t as obtained from
another corpus build using a set of objective documents. The reason we
used a different, un-opinionated corpus, was to penalize polar terms that
appear in a neutral context as terms that appear in such a context should
have less weight than those that do not. The idf table used to get this value
is the one described in 24.
Total_Count = co-occurrenceCntt + revCntt + weightt
176 S. R. El-Beltagy

revCntt = co-occurrence frequency of term t with terms of the reverse


polarity

And where:
PolarityRatio measures the extent to which a term occurs in an overall
context that is similar to its polarity and is calculated as follows:
𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝐶𝑜𝑛𝑡𝑒𝑥𝑡𝐶𝑛𝑡
𝑷𝒐𝒍𝒂𝒓𝒊𝒕𝒚𝑹𝒂𝒕𝒊𝒐𝒕
𝑡𝑤𝑒𝑒𝑡𝐶𝑛𝑡
where
similarConextCntt number of times term t has occurred with tweets
of the same polarity as its given polarity
tweetCntt total number of tweets in which term t has appeared in
the twitter corpus.
While the TermCoOccurrenceRatio takes into account all polar terms
that have co-occurred with the term for which a score is to be calculated,
the PolarityRatio, takes into account the overall sentiment of all tweets in
which the term has appeared.
All terms with support greater than 1 but less than some given value v
are placed in a list data structure that we will refer to as a ‘weak_list’. The
weak list thus represents a list of terms that have not occurred frequently
enough in the collected twitter corpus for us to assign accurate scores to.
All terms that have 0 support (have not occurred at all in the input corpus)
are initially placed in another list (“zero_list”), before being placed in the
weak_list. Processing of both the zero_list and the weak_list, is described
in the third step.

Second Step:
In the second step, all scores are revised for all terms to take into
account the strength of terms they co-occurred with. The score for a term
t is calculated as follows in this step:
𝑛𝑒𝑤𝑆𝑐𝑜𝑟𝑒 𝑠𝑐𝑜𝑟𝑒
𝒎𝒐𝒅𝒊𝒇𝒊𝒆𝒅𝒔𝒄𝒐𝒓𝒆 𝒕
2
where
𝑚 𝑡
𝒏𝒆𝒘𝑺𝒄𝒐𝒓𝒆𝒕
𝑚 𝑟 𝑡
WeightedNileULex: A Scored Arabic Sentiment Lexicon for Improved Sentiment Analysis 177

and m or matching polarity, is the weight of all terms (1,.., n)


that have co-occurred with term t and have the same polarity as t. The
equation for calculating m is as follows:

𝒎𝒑𝒐𝒍𝒂𝒓𝒊𝒕𝒚 𝑠𝑐𝑜𝑟𝑒 ∗ 𝑐𝑛𝑡

and r or reverse polarity, is the weight of all terms (1,.., m) that


have co-occurred with term t and have the reverse polarity of t. The
equation for calculating f r is as follows:

𝒓𝒑𝒐𝒍𝒂𝒓𝒊𝒕𝒚 𝑠𝑐𝑜𝑟𝑒 ∗ 𝑐𝑛𝑡

and t is the weight of the term under consideration calculated as


follows:
𝒕𝒘𝒆𝒊𝒈𝒉𝒕 𝑠𝑐𝑜𝑟𝑒 ∗ log 𝑐𝑛𝑡
The resulting score is a number between 0 and 1 which reflects the
strength of the term with its allocated polarity. The second step is repeated
n times to ensure that the numbers converge.

Third Step:
In this step, terms in the zero_support list and the weak_list are
assigned scores. Terms with 0 support are sometimes just misspelled
versions of existing terms so before moving them to the weak_list, we first
compare them in terms of similarity to existing terms. Very short terms
tend to incorrectly match with other entries so they are excluded from this
matching process. The pseudo code for calculating scores for terms that
have zero support, is as follows:
For each term t in the zero_list
a. if length of term is smaller than 3, move to weak_list and proceed
to next term
b. else get the min Levenshtein distance between t and all terms that
have been assigned a score
c. if (min <= 2) scoret = scorematching_term
d. else move t to weak_list
178 S. R. El-Beltagy

After the above step is completed, the scores for terms in the weak list
are calculated as follows:
(1) calculate the “average positive polarity” and the “average
negative polarity” given all scores that have been calculated for
entries in the lexicon
For each term t in the weak_list
(2) if supportt > 0, get scoret using the equation provided in step 1,
else set scoret = 0
(3) if scoret < 0.5, set scoret = 0.51
(4) adjusted_cnt = log2 term_cnt + 1
(5) scoret = ((scoret* adjusted_cnt) + polarityAverage) /
(adjusted_cnt +1);

In step 3, we assign a score that is just above neutral (0.5) to account


for the fact that a human as annotated this term as polar. Since the support
for all terms in a weak list is low, we adjust their weights using the polarity
average. Since we do not have a very high confidence in the assigned score
because of its low support, we employ a log function of dampen its effect
on the resulting overall score. We then calculate the final score using both
that score and the polarity average.

5. Experiments and Results

The aim of the presented experiments was to demonstrate that using a


weighed sentiment lexicon, even within a simple framework, can improve
sentiment analysis results over not using a lexicon at all, as well as over
using an un-weighted lexicon. The experiments are by no means optimized
to generate the best sentiment analysis results over the presented datasets.
To do so, more features and fine tuning are needed as presented in22.
Taking intensifiers (ex. very, much, etc.) into account, is also expected to
improve the results. In the following subsections, we present the sentiment
analysis system used in our experiments, the datasets that were employed
and finally the various experiments and their results.
WeightedNileULex: A Scored Arabic Sentiment Lexicon for Improved Sentiment Analysis 179

5.1. The sentiment analysis system

In the series of experiments that are presented in this work, we have


followed a machine learning approach to sentiment analysis. The classifier
that we have used in the presented experiments is the Complement Naïve
Bayes classifier.25 The main reason for this choice, is that in earlier work
presented in 26, we observed that with respect to the task of Arabic
sentiment analysis, this classifier performs consistently well across
datasets. In all experiments, the text for input tweets that are annotated
with sentiment is converted to a feature vector, where words that appear
in the tweet are represented with their idf weights. When using a lexicon,
whether scored or not, an additional set of features is introduced as detailed
for each experiment.
For all experiments, the following set of pre-processing steps takes
place before converting text to a feature vector:

Character Normalization: In this step, letters “‫”ﺃ‬, “‫ ”ﺇ‬and “‫ ”ﺁ‬are replaced


with “‫ ”ﺍ‬while the letter “‫ ”ﺓ‬is replaced with “‫”ﻩ‬, and the letter “‫ ”ﻯ‬is
replaced with “‫”ﻱ‬. Diacritics are also removed in this step.

Elongation Removal: In this step, words that have been elongated, are
reduced to their normal standard canonical form. Elongation is a way to
emphasize certain words. An example of an English elongated word is
“nooooooooooo”; after elongation removal, this word will be converted to
“no”.

Stemming/Lemmatization: In this step, words are very lightly stemmed


or lemmatized. The stemmer we have used is the one presented in27,28.

Matching with Lexicon Entries: This step was only carried out for
experiments that are related to the introduction of the lexicon. In this step,
input tweets/texts are matched against entries in the sentiment lexicon. The
matching process is described in detail in22. Both the tweets/texts and
lexicon entries are lemmatized and stemmed prior to any matching steps.
An efficient matching algorithm was employed to facilitate matching
between tweet text and lexicon entries. The output of this step is a count
for positive and negative lexicon entries, which are found in the tweet and
180 S. R. El-Beltagy

which are used as part of the features. Negators are currently handled in a
very simple way: encountering a negator before a sentiment term within a
window w results in the reversal of its polarity. We have observed that in
some cases, this is not necessarily valid. For example, the term “‫”ﻻ ﺣﻠﻮ‬, in
which the negator “no” appears before the word “nice”, is actually used to
affirm that something is nice. A positive score (posScore) and a negative
score (negScore) are also added as features in experiments involving the
scored lexicon. In our experiments, we have used a very simple technique
for assigning scores. Basically, the score of all positive terms is calculated
as the sum of their individual scores and that of any negated negative term
multiplied by a penalty. The same is done for all negative terms. After
summing all positive scores (allPos) and all negative scores (allNeg), final
positive and negative scores are assigned as shown in Figure 1.
An amplification factor has been introduced to boost the weight of
these two features with respect to other features in the feature vector.
Through experimentation, it was noticed that different datasets favor
different amplification factors. In all experiments presented in the
evaluation section, the amplification factor was optimized using
experiments carried out using 10 fold cross validation. Whatever factor
worked best with these experiments was used on the test dataset. The use
of intensifiers has yet to be explored and is expected to improve the results
presented in the experimentation section.

if(allNeg >allPos ) {
negScore = allNeg - allPos;
negScore = negScore *amplification_factor;
posScore=0;
} else {
posScore = allPos - allNeg;
posScore = posScore * amplification_factor;
negScore = 0;
}
Figure 1. Code snippet representing score calculation.

5.2. The used datasets

The Talaat et al. dataset (NU)26: The collection and annotation for this
dataset is described in 26. The dataset contains 3436 unique tweets, mostly
WeightedNileULex: A Scored Arabic Sentiment Lexicon for Improved Sentiment Analysis 181

written in Egyptian dialect. These tweets are divided into a training set
consisting of 2746 tweets and a test set containing 683 tweets. The
distribution of training tweets amongst polarity classes is: 1046 positive,
976 negative, and 724 neutral tweets. The distribution of the test dataset
is: 263 positive, 228 negative and 192 neutral. This dataset is available by
request from the author.

The KSA_CSS dataset (KSA)26: This dataset is one that was collected at
a research center in Saudi Arabia under the supervision of Dr. Nasser Al-
Biqami and which is also described in 26. The majority of tweets in this
dataset are in Saudi and MSA, but a few are written in Egyptian and other
dialects. The tweets for this dataset have also been divided into a training
set consisting of 9656 tweets and a test set comprised of 1414 tweets. The
training set consists of 2686 positive, 3225 negative, and 3745 neutral
tweets. The test set has 403 positive, 367 negative, and 644 neutral tweets.

The BBN Dataset (BBN)29: This dataset consists of 1199 Levantine


sentences, selected by the authors of 29 from LDC’s BBN Arabic-Dialect–
English Parallel Text. The sentences were extracted from social media
posts. The polarity breakdown of the sentences in this dataset is as follows:
498 are positive, 575 are negative, and 126 are neutral.

The Syria Dataset (SYR)29: This dataset consists of 2000 Syrian tweets,
so most of the tweets in this dataset are in Levantine. The dataset was
collected by (Salameh and Mohammad)29 and consists of 448 positive
tweets, 1350 negative tweets, and 202 neutral tweets.

5.3. Experimental results

Experiment 1: The goal of this first experiment was to examine the effect
of using the scored lexicon on improving the accuracy of the sentiment
analysis task when using tenfold cross validation on all used datasets. The
results of this experiment are shown in Table 2. Looking at these results,
it can be seen that in all cases accuracy increases when using a lexicon
(scored or not). The increase in accuracy seems to be related to the size of
the training dataset, with the largest dataset showing the least improvement
182 S. R. El-Beltagy

and the smallest, showing the most. This shows that using a lexicon, does
in fact help classifiers generalize better in the absence of large training
datasets. This hypothesis is further tested when using the lexicon in
conjunction to various test data sets.

Table 2. Results of applying the classifier on the various datasets and testing using 10 fold
cross validation.

Correctly Improvement
Accuracy Fscore
Identified over baseline
NU Data Set (size = 2746), amplification = 14
Baseline 71.34 71.2 1966 -
Lexicon Counts 72.87 72.6 2000 1.78%
ScoredLexicon 73.82 73.7 2027 3.1%
KSA Data Set (size = 9656), amplification = 6
Baseline 78.88 78.9 7613 -
Lexicon Counts 79.26 79.2 7649 0.47%
ScoredLexicon 79.31 79.3 7654 0.53%
BBN Data Set (size = 1199), amplification = 8
Baseline 68.97 68.8 827
Lexicon Counts 71.14 70.7 853 3.14%
ScoredLexicon 72.20 71.4 864 4.47%
Syr Data Set (size = 2000), amplification = 16
Baseline 77.45 77.9 1549
Lexicon Counts 78.45 78.8 1569 1.29%
ScoredLexicon 80.3 80.4 1606 3.68%

Experiment 2: The goal of the second experiment was to examine the


effect of using the scored lexicon on improving the accuracy of the
sentiment analysis task when training the classifier using the provided
training datasets and testing it using the supplied test datasets.

The results are shown for the datasets for which a separate test dataset
was provided (NU and KSA). The results of this experiment are provided
in Table 3. The results of this experiment re-affirm the conclusion reached
in the first. Here also, the use of a lexicon results in improved results with
the best results obtained when using the scored lexicon.
WeightedNileULex: A Scored Arabic Sentiment Lexicon for Improved Sentiment Analysis 183

Table 3. Results of applying the classifier on various test datasets.

Correctly Improvement
Accuracy Fscore
Identified over baseline
NU Data Set (size = 683)
Baseline 57.40 57.2 392 -
Lexicon Counts 59.59 59.2 407 3.82%
ScoredLexicon 61.90 61.10 423 7.91%
KSA Data Set (size = 1414)
Baseline 69.57 69.4 1125
Lexicon Counts 71.49 71.4 1156 2.76%
ScoredLexicon 71.8 71.8 1161 3.20%

Experiment 3: The goal of the third experiment was to examine the ability
of the scored lexicon to improve a sentiment analyzer’s generalization
ability across datasets. In this experiment, the classifier was trained using
the largest available dataset (KSA) and tested using (a) the NU dataset, (b)
the BBN dataset, (c) the Syr dataset. The results of this experiment are
shown in Table 4.

Table 4. Results of training using the KSA data set and testing using various datasets.

Correctly Improvement
Accuracy Fscore
Identified over baseline
NU_Egy Test dataset (size = 683)
Baseline 57.83 57.1 395 -
Lexicon Counts 60.03 59.2 410 3.78
ScoredLexicon 61.93 60.9 423 7.09
BBN Data Set (size = 1199)
Baseline 54.13 54.0 649 -
Lexicon Counts 56.05 56.4 673 3.70
ScoredLexicon 58.38 58.6 700 7.86
Syr Data Set (size = 2000)
Baseline 53.60 58.3 1072 -
Lexicon Counts 55.90 60.4 1118 4.29
ScoredLexicon 57.80 62.1 1156 7.84

It can be noticed from these results that the use of the scored lexicon
increased the ability of the classifier to correctly identify instances by no
less than 7% over all three used datasets. While the results for BBN and
184 S. R. El-Beltagy

Syr datasets were much lower than those achieved using the 10 fold cross
validation on the same datasets, the result for the NU test dataset was
identical to that achieved when training using the NU training dataset. This
can be explained by the fact that the KSA dataset has a subset of Egyptian
dialect tweets, so with the help of the scored lexicon, the classifier built
using KSA data was able to achieve a similar result to that achieved by a
classifier trained specifically for the Egyptian dialect. The same was not
true for the other two datasets, as they contain a completely different
dialect (Levantine).

6. Conclusion

This paper presented a method for assigning scores to entries in an Arabic


sentiment lexicon. This scored lexicon has been made publicly available
for research purposes.b The experiments carried out using this lexicon,
show that the use of a sentiment lexicon (whether scored or not) improves
sentiment classification results while the use of the scored lexicon
consistently results in the best classification results. The experiments also
showed that the use of the scored lexicon can increase a sentiment
classifier’s ability to generalize across multiple datasets. We expect that
augmenting the lexicon presented in this work, with other features such as
those presented in Ref. 22 can further improve sentiment classification
results. In the future, we intend to verify this hypothesis through
experimentation.

References
1. R.W. Neal, Twitter Usage Statistics: Which Country Has The Most Active Twitter
Population? International Business Times, 2013, http://www.ibtimes.com/twitter-
usage-statistics-which-country-has-most-active-twitter-population-1474852 (2013).
2. Facebook Statistics by Country, http://www.socialbakers.com/facebook-statistics/
(2012).
3. D. Farid, Egypt has the largest number of Facebook users in the Arab world. Daily
News Egypt, 23 September 2013, http://www.dailynewsegypt.com/2013/09/25/egypt-
has-the-largest-number-of-facebook-users-in-the-arab-world-report/ (2013).

bhttps://github.com/NileTMRG/NileULex
WeightedNileULex: A Scored Arabic Sentiment Lexicon for Improved Sentiment Analysis 185

4. S.R. El-Beltagy, NileULex: A Phrase and Word Level Sentiment Lexicon for Egyptian
and Modern Standard Arabic. In Proc. of LREC 2016. Portorož, Slovenia (2016).
5. S. Baccianella, A. Esuli and F. Sebastiani, SentiWordNet 3.0: An Enhanced Lexical
Resource for Sentiment Analysis and Opinion Mining. In: Proceedings of the Seventh
International Conference on Language Resources and Evaluation (LREC’10), pp.
2200–2204 (2010).
6. B. Liu, Sentiment Analysis and Subjectivity. In: N. Damerau (ed), Handbook of
Natural Language Processing, Second Edition (2010).
7. T. Wilson, J. Wiebe and P. Hoffmann, Recognizing contextual polarity in phrase-level
sentiment analysis. In Proc. of Human Language Technology Conference and
Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP).
pp. 347–354, Vancouver, Canada (2005).
8. S. Mohammad, and P. Turney, Crowdsourcing a Word-Emotion Association Lexicon.
Comput Intell 29(3), 436–465 (2013).
9. S. M. Mohammad, S. Kiritchenko and X Zhu, NRC-Canada: Building the State-of-the-
Art in Sentiment Analysis of Tweets. In Proc. of the Seventh International Workshop
on Semantic Evaluation (SemEval-2013), Atlanta, Georgia, USA (2013).
10. S. Kiritchenko, X. Zhu and S. Mohammad, Sentiment Analysis of Short Informal
Texts. J Artif Intell Res, 50, 723–762 (2014).
11. R. F. Astudillo, S. Amir, W. Ling, et al., INESC-ID: A Regression Model for Large
Scale Twitter Sentiment Lexicon Induction. In Proc. of the 9th International Workshop
on Semantic Evaluation (SemEval 2015), pp. 613–618 (2015).
12. H. Hamdan, P. Bellot and F. Bechet, lsislif: Feature extraction and label weighting for
sentiment analysis in twitter. In Proc. of the 9th International Workshop on Semantic
Evaluation, pp. 568–573 (2015).
13. F. Wang, Z. Zhang and M. Lan, ECNU at SemEval-2016 task 7: An enhanced
supervised learning method for lexicon sentiment intensity ranking. In Proc. of
International Workshop on Semantic Evaluation (SemEval-2016), pp. 491–496 (2015).
14. R. Al-Sabbagh and R. Girju, Mining the web for the induction of a dialectical arabic
lexicon. In Proc. LREC 2010, pp. 288–293 (2010).
15. S. R. El-Beltagy and A. Ali, Open Issues in the Sentiment Analysis of Arabic Social
Media : A Case Study. In Proc. of 9th the International Conference on Innovations and
Information Technology (IIT2013), Al Ain, UAE (2013).
16. M. Abdul-Mageed and M. Diab, Toward Building a Large-Scale Arabic Sentiment
Lexicon, In Proc. of the 6th International Global WordNet Conference, Matuse, Japan,
pp. 18–22 (2012).
17. G. Badaro, R. Baly, H. Hajj, et al., A large scale Arabic sentiment lexicon for Arabic
opinion mining. In Proc. of the EMNLP Workshop on Arabic Natural Language
Processing (ANLP), Association for Computational Linguistics, pp. 165–173 (2014).
18. F.H.H. Mahyouba, M. A. Siddiquia and M. Y. Dahaba, Building an Arabic Sentiment
Lexicon Using Semi-supervised Learning. J King Saud Univ - Comput Inf Sci, 26,
417–424 (2014).
186 S. R. El-Beltagy

19. R. Eskande, and O. Rambow, SLSA: A Sentiment Lexicon for Standard Arabic. In
Proc. 2015 Conference on Empirical Methods in Natural Language Processing, pp.
2545–50 (2015).
20. Twitter. Twitter Search API, https://dev.twitter.com/rest/public/search (2016).
21. J. Leskovec, A. Rajaraman and J.D. Ullman, Mining of Massive Datasets. 2 edition.
Cambridge, UK: Cambridge University Press. Epub ahead of print (2014). DOI:
10.1017/CBO9781139058452.
22. S.R. El-Beltagy, T. Khalil T, A. Halaby and M.H. Hammad, Combining Lexical
Features and a Supervised Learning Approach for Arabic Sentiment Analysis. In Proc.
CICLing 2016, Konya, Turkey (2016).
23. G. Salton and C. Buckley, Term-weighting Approaches in Automatic Text Retrieval.
Inf Process Manag, 24(5), 513–523 (2009).
24. S. R. El-Beltagy and A. Rafea, KP-Miner: A keyphrase extraction system for English
and Arabic documents. Inf Syst, 34(1), 132–144 (2009).
25. J.D.M Rennie, L. Shih, J. Teevan, et al., Tackling the Poor Assumptions of Naive
Bayes Text Classifiers. Proc Twent Int Conf Mach Learn, 20(1973), 616–623 (2003).
26. T. Khalil, A. Halaby, M.H. Hammad and S.R. El-Beltagy. Which configuration works
best? An experimental study on Supervised Arabic Twitter Sentiment Analysis. In
Proc. of the First Conference on Arabic Computational Liguistics (ACLing 2015), co-
located with CICLing 2015, pp. 86–93, Cairo, Egypt (2015).
27. S.R. El-Beltagy and A. Rafea, An Accuracy Enhanced Light Stemmer for Arabic Text.
ACM Trans Speech Lang Process, 7(2), 2–23 (2011).
28. S.R. El-Beltagy and A. Rafea, LemaLight: A Dictionary based Arabic Lemmatizer and
Stemmer, Techenical Report TR2-11-16, Nile University (2016).
29. M. Salameh, S.Mohammad and S. Kiritchenko, Sentiment after Translation: A Case-
Study on Arabic Social Media Posts. In Proc. of the 2015 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language
Technologies, pp. 767–777, Denver, Colorado: Association for Computational
Linguistics (2015).
August 13, 2018 9:6 ws-rv9x6 Book Title 10693-09 page 187

187

Chapter 9

Islamic Fatwa Request Routing via Hierarchical


Multi-Label Arabic Text Categorization

Reda Zayed, Mohamed Farouk and Hesham Hefny


Institute of Statistical Studies and Research,
Cairo University, Cairo, Egypt
reda fcis@yahoo.com,hehefny@ieee.org

Multi-label classification (MLC) is concerned with learning from exam-


ples where each example is associated with a set of labels in opposite
to traditional single-label classification where an example typically is as-
signed a single label. MLC problems appear in many areas, including
text categorization, protein function classification, and semantic anno-
tation of multimedia. The religious domain has become an interesting
and challenging area for machine learning and natural language process-
ing. A “fatwa” in the Islamic religion represents the legal opinion or
interpretation that a qualified scholar (mufti) can give on issues related
to the Islamic law. It is similar to the issue of legal opinions from courts
in common-law systems. In this paper, a hierarchical classification sys-
tem is introduced to automatically route incoming fatwa requests to the
most relevant mufti. Each fatwa is associated to multiple categories by
mufti where the categories can be organized in a hierarchy. The re-
sults on fatwa requests routing have confirmed the effective and efficient
predictive performance of hierarchical ensembles of multi-label classifiers
trained using the HOMER method and its variations compared to binary
relevance which simply trains a classifier for each label independently.

1. Introduction

The aim of traditional single-label classification is to learn from a set of


examples that are associated with a single label ω from a set of disjoint
labels or categories Ω where |Ω| > 1. If the number of labels |Ω| = 2,
then the learning task is called binary classification. If |Ω| > 2, then it is
called multi-class classification. In multi-label classification,1 the examples
are associated with a set of labels Y ⊆ Ω. The importance of multi-label
classification appears in domains with large number of labels (hundreds or
August 13, 2018 9:6 ws-rv9x6 Book Title 10693-09 page 188

188 R. Zayed, M. Farouk and H. Hefny

more) and each instance belongs to many labels such as text categoriza-
tion,2,3 prediction of gene function4 and protein function prediction.5
The high dimensionality of the label space leads to a number of prob-
lems that a multi-label learning algorithm has to address in an effective
and efficient way. First, the number of training examples belonging to each
particular label will be significantly less than the total number of examples.
This is similar to the class-imbalance problem in single-label classification.6
Second, the computational training complexity of a multi-label classifier
may be strongly affected by the number of labels. Some simple algorithms
such as binary relevance have both linear training and classification com-
plexity with respect to |Ω|, but there are also more advanced methods3
whose complexity is worse. Finally, although the classification complexity
of using a multi-label classifier is linear with respect to |Ω| in the best case,
this may still be inefficient for applications requiring fast response times.
Multi-label learning methods addressing these tasks can be grouped into
two categories:1 problem transformation and algorithm adaptation. The
first group of methods are algorithm independent. They transform the
learning task into one or more single-label classification tasks, for which
a large body of learning algorithms exists. The second group of methods
extend specific learning algorithms in order to deal with multi-label data
directly. There exist extensions of decision tree learners, nearest neighbor
classifiers, neural networks, ensemble methods, support vector machines,
kernel methods and others.
When a Muslim has a question that they need to be answered from an
Islamic point of view, they ask an Islamic scholar this question, and the
answer is known as a fatwa. It is similar to the issue of legal opinions from
courts in common-law systems. A fatwa in the Islamic religion represents
the legal opinion or interpretation that a qualified jurist or mufti can give
on issues related to the Islamic law. Muslim scholars are expected to give
their fatwa based on religious scripture, not just their personal opinions.
The following is an example of a fatwa: Muslims are expected to pray five
times every day at specific times during the day. A person who is going to
be on a 12 hour flight may not be able to perform their prayers on time.
So they might ask a Muslim scholar (mufti) for a fatwa on what is the
appropriate thing to do, or they might look up the answer in a book or
on the internet. The scholar might advise them to perform the prayer to
the best of their ability on the plane, or to delay their prayer until they
land. They would support their opinion with Quranic verses which Muslims
believe to be a revelation from God. The fatwa is not legally binding or
final.
August 13, 2018 9:6 ws-rv9x6 Book Title 10693-09 page 189

Islamic Fatwa Request Routing 189

It is worth mentioning that in Islam, there are four sources from which
Muslim scholars extract religious law or rulings, and upon which they base
their fatwa. The first is the Quran, which is the holy book of Islam, and
which Muslims believe is the direct and literal word of God, revealed to
Prophet Mohammad. The second source is the Sunnah, which incorpo-
rates anything that the Prophet Mohammad said, did or approved of. The
third source is the consensus of the scholars, meaning that if the schol-
ars of a previous generation have all agreed on a certain issue, then this
consensus is regarded as representing Islam. Finally, if there is no evidence
found regarding a specific question from the three first sources, then an
Islamic scholar uses his own logic and reasoning to come up with the best
answer according to the best of their ability. All actions in Muslims’ lives
are permissible, unless they got a fatwa that there is evidence from one of
the four sources previously mentioned that proves otherwise. Fatwa Areas
(categories) can be organized into tree-structured hierarchy where similar
areas share the same parent area. Each scholar could be an expert in one
or more of its branches. To get the best fatwa for a given question, the
request has to be directed to the most relevant mufti.
The main contribution of this paper is to apply an effective and com-
putationally efficient multi-label classification algorithm in a domain with
many labels such as Islamic fatwa requests routing. The algorithm, that
was introduced by Tsoumakas et al. in 2008,7 is called HOMER (Hierarchy
Of Multi-label classifiERs). HOMER constructs a hierarchy of multi-label
classifiers, each one trained to solve a classification problem with a much
smaller set of labels compared to |Ω| and a more balanced example distri-
bution. This leads to improved predictive performance along with linear
training and logarithmic testing complexities with respect to |Ω|. The first
step of HOMER is the label hierarchy generation which is the even distri-
bution of the given set of labels Ω into k disjoint subsets using a balanced
k-means clustering algorithm. That is similar labels are placed together
and dissimilar apart.
The remainder of this paper is organized as follows. Section 2 describes
the related work and Section 3 presents the proposed routing system and the
HOMER algorithm. Section 4 presents the setup and results respectively
of the experimental work comparing HOMER to binary relevance, which
is the most popular and computationally efficient multi-label classification
method. Finally, Section 5 concludes this paper and points to future work.
August 13, 2018 9:6 ws-rv9x6 Book Title 10693-09 page 190

190 R. Zayed, M. Farouk and H. Hefny

2. Related Work

Mencia and Fürnkranz3 introduced a multi-label system for possible auto-


mated or semi-automated real-world application for categorizing EU legal
documents into 4000 possible EUROVOC categories. They have shown that
a reformulation of the pairwise decomposition approach into a dual form
is capable of handling very complex problems and can therefore compete
with the approaches that use only one classifier per class.
Arabic is a Central Semitic language, closely related to Aramaic, He-
brew, Ugaritic and Phoenician. Arabic is the mother language of more than
300 million people8 and is spoken by as many as 420 million speakers (na-
tive and non-native) in the Arab world. Unlike Latin-based alphabets, the
orientation of writing in Arabic is from right to left; the Arabic alphabet
consists of 28 letters. Nouns in Literary Arabic have three grammatical
cases (nominative, accusative, and genitive [also used when the noun is
governed by a preposition]); three numbers (singular, dual and plural); two
genders (masculine and feminine); and three states (indefinite, definite, and
construct). A noun has the nominative case when it is subject; accusative
when it is the object of a verb; and the genitive when it is the object of
a preposition. Words are classified into three main parts of speech, nouns
(including adjectives and adverbs), verbs, and particles.
Most of the work in text classification treats documents as a bag-of-
words with the text represented as a vector of a weighted frequency for
each of the distinct words or tokens. Although a simplified representation
of text has been shown to be quite effective for a number of applications,
several attempts studied enhancement of text representation using concepts
or n-grams (multi-word terms).9 Islam Elhalwany et al.10 have proposed
an intelligent Fatwa Questions Answering System that can automate the
answering the answering of request without human intervention from Mus-
lim scholars. It responds to a user’s inquiry and provides the answer of
the semantically nearest fatwa request that has been previously answered
by a scholar. El-Kourdi et al. used Nave Bayes algorithm to automatically
classify Arabic documents.
Ahmed and Tiun11 have investigated the effect of using stemming and
without stemming words on the accuracy of Arabic Islamic text clustering.
Based on our experiments we have found that the stemming process than
gives better impact than without stemming process, and the K-means with
Cosine similarity measure achieves the highest score of performance. Odeh
et al.12 have introduced a new Arabic text categorization method using
August 13, 2018 9:6 ws-rv9x6 Book Title 10693-09 page 191

Islamic Fatwa Request Routing 191

vector evaluation method. The proposed method determines the key words
of the tested document by weighting each of its words, and then comparing
these key words with the key words of the testing corpus categorizes.

3. Islamic Fatwa Requests Routing System

The architecture of the proposed hierarchical classification system is shown


in Figure 1. The aim of this system is to automatically route an incoming
fatwa (legal opinion) requests to the most relevant Muslim Scholar (mufti).
Each fatwa is associated to multiple categories (Fatwa areas) by Muslim
Scholar. The categories can be organized in a hierarchy because some
Fatwa areas are subsets of more generic areas. In the following subsections,
the different steps required to build the routing system will be presented.

Fig. 1. High-level architecture of Fatwa Request Routing System.

3.1. Text preprocessing

The nature of Arabic text is different than English text; preprocessing of


Arabic text are more challenging. A huge number of features or keywords
in the documents lead to a poor performance in terms of both accuracy
and time. Therefore preprocessing is very important step before training
the text classifiers to get knowledge from massive data and reduce the
computational complexity. Before Arabic word stemming step, we need to
normalized the fatwa requests text as follows:
August 13, 2018 9:6 ws-rv9x6 Book Title 10693-09 page 192

192 R. Zayed, M. Farouk and H. Hefny

• Remove punctuation.
• Remove special characters and remove any html tags.
• Remove diacritics (primarily weak vowels).
• Remove non Arabic letters.
• Replace Arabic letter ALEF with hamza below, Arabic letter ALEF
with madda above, and Arabic letter ALEF with hamza above with
A Arabic letter ALEF.
• Replace final Arabic letter Farsi YEH with Arabic letter YEH.
• Replace final Arabic letter TEH marbuta with Arabic letter HEH.
• Stop-word removal: we determine the common words in the doc-
uments which are not specific or discriminatory to the different
classes.
• Stemming: different forms of the same word are consolidated into
a single word. For example, singular, plural and different tenses
are consolidated into a single word.

3.1.1. Light stemmer


Although many researchers mention to light stemming, we found no publi-
cation explicitly listing which affixes should be removed. We tried to remove
strings which would be found as affixes far more often than they would be
found as the beginning or end of an Arabic word without affixes. We tried
many versions of light stemming, all of which followed the same steps:

(1) Remove Arabic letter WAW (and) for Light2, Light3, and Light8
if the remainder of the word is 3 or more characters long. Although
it is important to remove Arabic letter WAW, it is also problematic,
because many common Arabic words begin with this character, hence
the stricter length criterion here than for the definite articles.
(2) Remove any of the definite articles if this leaves 2 or more characters.
(3) Go through the list of suffixes once in the (right to left) order indicated
in figure below, removing any that are found at the end of the word,
if this leaves 2 or more characters. The strings to be removed are
listed in figure below. The prefixes are actually definite articles and a
conjunction.
August 13, 2018 9:6 ws-rv9x6 Book Title 10693-09 page 193

Islamic Fatwa Request Routing 193

Fig. 2. Prefixes and suffixes removed by light stemming.

3.2. Feature engineering


Before any classification task, one of the most fundamental tasks that need
to be accomplished is that of document representation and feature selec-
tion. Classification algorithms cannot deal directly with text instances.
Instead, each text instance has to be represented as a fixed-length numeric
feature vector which are mostly the text words. This kind of text rep-
resentation typically leads to high dimension input space which normally
affects the efficiency of classification algorithms. While feature selection is
also desirable in other classification tasks, it is especially important in text
classification due to the high dimensionality of text features and the exis-
tence of irrelevant (noisy or not important) features. Several methods are
used to reduce the dimensionality of the feature space by choosing a subset
of features in order to reduce the classification computational complexity
without scarifying the accuracy. In this paper, Chi-Squared (χ2 ) statistics9
is used as a scoring function to rank the features based on their relevance
to the categories
In general, text can be represented in two separate ways. The first is
as a bag of words (Dictionary), in which a document is represented as a
set of words, together with their associated frequency in the document.
Such a representation is essentially independent of the sequence of words in
the document (context independent). The second method is to represent
each document as strings of words (called N-grams such as bigrams and
trigrams), in which each document feature represents a sequence of words
(it takes the context into consideration). In this paper, the bag-of-words
representation is used as it has shown good classification performance.
August 13, 2018 9:6 ws-rv9x6 Book Title 10693-09 page 194

194 R. Zayed, M. Farouk and H. Hefny

3.3. The HOMER algorithm

The main idea of HOMER is the transformation of a multi-label classifi-


cation problem with a large set of labels Ω into a tree-shaped hierarchy of
simpler multi-label classification problems, each one dealing with a small
number of labels compared to the full label set (<< |Ω|). The root con-
tains all labels, Ωroot = Ω. There are |Ω| leaves, each one contains a single
label ωj of Ω. Each internal node n of this hierarchy contains a set of
labels Ωn ⊆ Ω that represents the union of the label sets of its children,
Ωn = ∪Ωc , c ∈ children(n). It introduces a new label λn that is called
the meta-label of a node n and it represents the disjunction of the labels
W
contained in that node, λn ≡ ωj , ωj ∈ Ωn . A training example can be
considered annotated with meta-label λn , if it is annotated with at least
one of the labels in Ωn .
The multi-label classifier fn is assigned to each internal node n of the hi-
erarchy. The task of fn : X → Mn is to discriminate among the meta-labels
of its children where the set of labels for fn is Mn = {λc |c ∈ children(n)}.
In HOMER classification phase, for a new example x, HOMER starts
with the root node classifier froot and follows a recursive process forwarding
x to the multi-label classifier fc of a child node c only if λc is among the
predictions of fparent(c) . Eventually, this process may lead to the prediction
of one or more single-labels by the multi-label classifier(s) just above the
corresponding leaf(ves). The union of these predicted single-labels is the
output prediction for x, while the empty set is returned otherwise.
In HOMER training phase, it is assumed that there exists a set
D = {(xi , Yi )|i = 1, . . . , |D|} of multi-labeled training examples, each one
consisting of a feature vector xi and a set of labels Yi ⊆ Ω. HOMER
starts with the construction of the label hierarchy recursively in a top-
down depth-first fashion starting with the root. At each node n, k chil-
dren nodes are first created, unless |Ωn | < k, in which case the number of
children is |Ωn |. Each such child n filters the data of its parent, keeping
only the examples that are annotated with at least one of its own labels:
Dn = {(xi , Yi )|(xi , Yi ) ∈ Dparent(n) , Yi ∩ Ωn 6= ∅}. The root uses the whole
training set, Droot = D. Two main steps are then sequentially executed
into each child node that contains more than a single label: a) the labels of
the current node are distributed into k disjoint subsets, one for each child
of the current node, and b) a multi-label classifier is trained to discriminate
among the meta-labels of its children. In the latter step, each internal node
n transforms its examples (xi , Yi ) ∈ Dn into meta-examples (xi , Zi ), where
August 13, 2018 9:6 ws-rv9x6 Book Title 10693-09 page 195

Islamic Fatwa Request Routing 195

Zi = {λc |c ∈ children(n), Yi ∩ Ωc 6= ∅}. These meta-examples are used for


training fn .
The main issue in the former step is how to distribute the labels of Ωn
to the k children. HOMER evenly distributes that labels to k subsets in a
way such that labels belonging to the same subset are as similar as possi-
ble. Such objective can be accomplished by performing clustering with the
additional constraint of equal cluster sizes. It has been considered in the
past in the literature, under the name balanced clustering [9]. HOMER can
use existing balanced clustering algorithms for this step (see Section 3.2).
A new balanced clustering algorithm, called balanced k-means, is presented
in Section 3. The justification in favor of similarity-based distribution is
that if similar labels of a node n are placed in the same subset, then only
a few (ideally just one) meta-labels of fn will be predicted and thus the
rest sub-trees will not be activated. This will lead to reduce the classifi-
cation cost of HOMER. Another expected benefit is that each child node
will probably contain less training examples. The justification in favor of
even distribution is that the multi-label classifiers at each node will deal
with a more balanced distribution of examples for each meta-label. This
leads to avoid the class-imbalance problem that may improve the predictive
performance.
To conclude, HOMER’s overall training complexity is O(f (|Ω|) + |Ω|),
where f (|Ω|) is the complexity of the balanced clustering algorithm with
respect to a set of labels L. During classification, the cost depends on the av-
erage multi-label classifiers that are activated. Assuming that each example
is annotated with a small number of labels compared to Ω, that HOMER
outputs a small number of labels as well and that a different root-to-leaf
path is followed for the prediction of each label, then we can consider the
complexity as O(logk (|Ω|)). Compared to the typical O(|Ω|) complexity,
HOMER offers an important benefit, especially for domains where online
performance is critical.

4. Performance Evaluation

4.1. Data description

The dataset used in the experiments was provided by the Egyptian Dar
al-Ifta.a Since it was first established, Dar al-Ifta al-Misriyyah has been
the premier institute to represent Islam and the international flagship for
a http://eng.dar-alifta.org/
August 13, 2018 9:6 ws-rv9x6 Book Title 10693-09 page 196

196 R. Zayed, M. Farouk and H. Hefny

Islamic legal research. Dar al-Ifta al Misryyah started as one of the divi-
sions of the Egyptian Ministry of Justice. In view of its consultancy role,
capital punishment sentences among others are referred to the Dar al-Ifta
al-Misryyah seeking the opinion of the Grand Mufti concerning these pun-
ishments. The role of Dar al-Ifta does not stop at this point; it is not
limited by domestic boundaries but extends beyond Egypt covering the
entire Islamic world.

Fig. 3. Part of the hierarchy of fatwa areas.

The dataset contains about 100,000 text instances. Figure 3 presents


part of the fatwa areas hierarchy. The fatwa are assigned to a tree-
structured label hierarchy containing 830 nodes. There is about 89 internal
nodes and 741 categories (fatwa areas) as child nodes. When a fatwa is
annotated with a leaf node category, it is assigned to all the internal nodes
from that leaf until the root. The label hierarchy is not balanced and the
text instances are not distribute uniformly on all the categories. We ig-
nore the categories that have less than 20 text instances. The remaining
categories after filtration are 221 leaf nodes. The average number of labels
per fatwa is 2.8 (label cardinality). Each fatwa consists of title, question,
answer, fatwa basics and keywords. For routing task, we used only fatwa
questions. The content of a fatwa was represented using the Boolean bag-
of-words model. In order to further reduce the computational complexity
of classification, feature selection was applied in the following way. We used
the χ2 feature ranking method separately for each label in order to obtain
August 13, 2018 9:6 ws-rv9x6 Book Title 10693-09 page 197

Islamic Fatwa Request Routing 197

a ranking of all features for that label. We then selected the top 30 features
for each label and concatenate them into a single feature vector after the
removal of redundancy. This led to a reduced vocabulary of 4,486 words.
After the aforementioned preprocessing, and the removal of empty exam-
ples (examples with no features or labels) the final version of the dataset
included 15,539 instances.

4.2. Methods

We focus on the comparison of HOMER method and its variations with the
binary relevance method (BR), since it is the most widely used multi-label
learning algorithm. Both the training and the testing phases of BR are
already linear with respect to |Ω|. We compare BR against HOMER using
BR as the multi-label classifier at each internal node of the label hierarchy.
To reduce the computational cost of the experiments, we use naive Bayes
as the base classifier for BR decomposed binary tasks. We evaluate the per-
formance of methods using four fold cross-validation set. HOMER is run
with different number of clusters k = 2,4,6,8,10. In addition to the balanced
k means algorithm we examine two different approaches for the distribu-
tion of the labels of each node into its children. The first variation, called
HOMER-R, distributes evenly but randomly the labels into the k subsets.
The motivation here is to examine the benefits of clustering on top of the
even distribution of labels. The second variation called HOMER-K, dis-
tributes the labels using the Expectation-Maximization (EM) algorithm
without any constraints on cluster sizes. The motivation in this case is
to examine the benefits of even distribution on top of the clustering com-
pared to similarity-based distribution. The default version of HOMER with
balanced k means is called HOMER-B. The implementation of MULANb
Java library13 has been used for learning the multi-label classifiers in these
experiments.

4.3. Results and Discussion

There exist a number of metrics to evaluate the predictive performance


of multi-label classifiers.1 We present results based on one example-based
metric and three representative label-based ones: the Hamming loss (the
minimum is the better), micro-averaged precision, micro-averaged recall
and the micro-averaged F-measure.
b http://mulan.sourceforge.net/
August 13, 2018 9:6 ws-rv9x6 Book Title 10693-09 page 198

198 R. Zayed, M. Farouk and H. Hefny

Figures 4 shows the fatwa requests routing performance of HOMER and


its variations with respect to the number of clusters. BR trains 310 (= |Ω|)
binary classifiers but HOMER and its variations need to train an additional
number of binary classifiers for each internal node, apart from the root,
which is generally reduced with the number of clusters. The total number
of text instances processed by BR is (|Ω||D|), which is around 4.8 million.
On the other side, both balancing and clustering help reducing the number
of text instances propagated from parents to children, as HOMER-B leads
to the fewest total instances. For instance, the root node BR processes
a total number of text instances k|D| which is very small compared to
the flat BR one. In the classification time, BR activates all the trained 310
classifiers. The advantage of HOMER and its variants is that less classifiers
are activated, leading to reduced total testing time. Note that for HOMER-
B and HOMER-R the average number of nodes activated and subsequently
the testing time increases with the number of clusters.
BR has a Hamming loss of 0.022 and a Micro-averaged F-measure of
0.407. Therefore, we first observed that even the worst results of both

(a) Hamming Loss (b) Micro-F-measure

(c) Micro-Precision (d) Micro-Recall

Fig. 4. Routing performance of HOMER and its variations compared to Binary Rele-
vance.
August 13, 2018 9:6 ws-rv9x6 Book Title 10693-09 page 199

Islamic Fatwa Request Routing 199

HOMER and its variations are much better compared to BR. This is at-
tributed to the skewness of the distribution of the examples for each label
which is an important drawback of BR in domains with large number of
classes such as fatwa areas compared to HOMER that manages to alleviate
it because BR trained within each hierarchy node deals with much smaller
number of labels (<< |Ω|).
We then compare among the different HOMER variations. We observe
that for three metrics out of the four, HOMER-K has the best results,
followed by HOMER-B and then HOMER-R. Except micro-averaged re-
call, HOMER-B is the best, followed by HOMER-K then HOMER-R. This
shows that similarity-based distribution using simple clustering, is actually
more important than clustering constrained with the even label distribu-
tion in this domain. A potential reason is that the categories/areas of fatwa
may be naturally grouped into existing balanced clusters. The precision of
HOMER-K seems to be improving with the number of clusters while the re-
call degrades. In terms of Hamming loss, the performance of HOMER-R has
a decreasing trend while HOMER-K and HOMER-B show improvement as
the number of cluster increases. Hence, we conclude that the performance
of both HOMER and its variations increase with the number of clusters.

5. Future Work and Conclusion

This paper introduced a new system to automatically route an incoming


fatwa or legal opinion request to the most relevant mufti or jurist who
can provide the best answer. The fatwas are known that they can be-
long to multiple categories out of a larger number of labels. An effec-
tive and efficient multi-label classifier has been introduced using HOMER
method and its variations. In addition, the paper has empirically shown
that HOMER provides more accurate predictions than the popular binary
relevance method in less time. The two key steps in HOMER is how to
build the label hierarchy and to distribute the labels among its nodes. We
have tried three different ways to automatically build the hierarchy using
either simple clustering or balanced clustering. Within the next steps of
this work, we plan to extend HOMER to exploit the existing hierarchy
defined by Muslim scholars as it reflects the semantic relationship among
legal opinion areas. In this study, BR has been used as multi-label classifier
in each internal node. We also plan to extend HOMER through leveraging
other types of multi-label learners such as Calibrated label ranking.14 At
the feature engineering level, we intend to study the use of term frequency
August 13, 2018 9:6 ws-rv9x6 Book Title 10693-09 page 200

200 R. Zayed, M. Farouk and H. Hefny

instead of Boolean bag-of-words model and to use different features at each


node instead of using the same vocabulary for all the node classifiers. The
best feature set for one classifier is not necessary the best one for other
classifiers.

References

1. G. Tsoumakas, I. Katakis, and I. Vlahavas. Mining multi-label data. In Data


Mining and Knowledge Discovery Handbook, pp. 667–685. Springer (2010).
2. R. Schapire and Y. Singer, Boostexter: a boosting-based system for text
categorization, Machine Learning. 39, 135–168 (2000).
3. E. Mencia and J. Fürnkranz. Efficient pairwise multilabel classification for
large scale problems in the legal domain. In Proceedings of 12th European
Conference on Principles and Practice of Knowledge Discovery in Databases
(PKDD 2008) (2008).
4. Z. Barutcuoglu, R. Schapire, and O. Troyanskaya, Hierarchical multi-label
prediction of gene function, Bioinformatics. 22, 830–836 (2006).
5. G. Yu, H. Rangwala, C. Domeniconi, G. Zhang, and Z. Yu, Protein function
prediction using multilabel ensemble classification, IEEE/ACM Transactions
on Computational Biology and Bioinformatics (TCBB) (2013).
6. N. Chawla, N. Japkowicz, and A. Kotcz, Editorial: special issue on learning
from imbalanced data sets, SIGKDD Explorations. 6, 1–6 (2004).
7. G. Tsoumakas, I. Katakis, and I. Vlahavas. Effective and efficient multilabel
classification in domains with large number of labels. In Proceedings of PKDD
Workshop on Mining Multidimensional Data (MMD’08) (2008).
8. M. El-Kourdi, A. Bensaid, and T. Rachidi. Automatic arabic document cat-
egorization based on the nave bayes algorithm. In Proceedings of 20th Inter-
national Conference on Computational Linguistics (2004).
9. A. Mesleh, Chi square feature extraction based SVMs arabic language text
categorization system, Journal of Computer Science. 3(6), 430–435 (2007).
10. I. El-halawany, A. Mohammed, K. Wasfi, and H. Hefney. Enhanced knowl-
edge discovery approach in textual case based reasoning. In Proceedings of
13th Mexican International Conference on Artificial Intelligence (2014).
11. M. H. Ahmed and S. Tiun. K-means based algorithm for islamic document
clustering. In Proceedings of International Conference on Islamic Applica-
tions in Computer Science and Technologies (IMAN 2013) (2013).
12. A. Odeh, A. Abu-Errub, Q. Shambour, and N. Turab, Arabic text catego-
rization algorithm using vector evaluation method, International Journal of
Computer Science and Information Technology (IJCSIT). 6(6) (2014).
13. G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, and I. Vlahavas, MULAN:
A java library for multi-label learning, Journal of Machine Learning Research.
12, 2411–2414 (2011).
August 13, 2018 9:6 ws-rv9x6 Book Title 10693-09 page 201

Islamic Fatwa Request Routing 201

14. G. Tsoumakas, E. Mencia, I. Katakis, S.-H. Park, and J. Fürnkranz. On


the combination of two decompositive multi-label classification methods. In
Proceedings of the ECML PKDD 2009 Workshop on Preference Learning
(PL-09), pp. 114–129 (2009).
b2530   International Strategic Relations and China’s National Security: World at the Crossroads

This page intentionally left blank

b2530_FM.indd 6 01-Sep-16 11:03:06 AM


203

Chapter 10

Arabic and English Typeface Personas

Shima Nikfal and Ching Y. Suen


Centre for Pattern Recognition and Machine Intelligence,
Concordia University, Montréal, Québec, H3G 1M8, Canada
snikfal@gmail.com, suen@encs.concordia.ca

Most existing studies on typeface design have primarily focused on


typeface legibility and readability. This paper concentrates on the
perception of typefaces and consists of two experimental studies. The
first study investigates the perception of Arabic typefaces and the second
contains an analysis of English typeface design. The Arabic typefaces
study began by designing a font survey. Eighty-two people were asked
to rate their perceptions of 20 Arabic typefaces based on 5 personality
traits (Legible, Attractive, Comfortable, Artistic, and Formal). The
correlation between typefaces and their personality traits were
discovered by performing a series of statistical analyses. Based on the
results, the studied typefaces were categorized into 3 groups according
to their personality traits and typographical features. The second study
focused on English typefaces and their personality traits. A font survey
was designed. Seventy-one people rated their perceptions and
preferences of 24 English typefaces based on six personality traits
(Legible, Artistic, Formal, Sloppy, Readable, and Attractive). The
correlation between typefaces and their personality traits were
discovered by performing a series of statistical analyses. Based on the
results, the number of studied typefaces was reduced from 24 to 16, and
these 16 typefaces were categorized into four groups, according to their
personality traits and typographical features.

1. Introduction

Typography is a key tool in visual communication because typefaces can


evoke human emotions and can affect reading and comprehension. Due to
204 S. Nikfal and C. Y. Suen

the variety of proportions, heights, weights, and various styles, etc., each
typeface has its own aesthetic and expressive qualities, as evidenced by
the visual attributes of its letterforms.15 Some typefaces are able to reflect
a selected message, while others can detract from an intended meaning.
Previous researchers on typefaces have illustrated that each typeface
has its own individual identity. For example, during BBC audio program
on February 11th, 2005 Ian Peacock discussed how the fonts we select are
sending subtle messages about who we are.2 He argues that the fonts we
select to dress our words are as much of a fashion statement as the clothes
we wear. Moreover, in this program, fonts were illustrated as being
masculine or feminine. Fonts described as being fine, serif, sleek, and
elegant were considered as feminine, whereas fonts characterized as being
blocky and bold were considered as masculine.
This chapter focuses on the visual expression of Arabic and English
typefaces. For both, the relationship between typefaces and their perceived
personas has been investigated. Statistical analysis was used to separately
analyze the data collected from English and Arabic typeface surveys. By
applying statistical analysis on the data, the correlation among fonts and
perceived personas has been extracted. English and Arabic typefaces used
within this study have been grouped according to their personas. This
chapter begins with a literature review of different studies on both
typeface’s personality traits, followed by a description of the newly
designed font surveys for both typefaces and the methodologies used
within these studies. Finally, a discussion, conclusions and suggestions on
topics for future exploration are provided, based on research results.

2. Literature Review of Typeface Personality Studies

Most typeface research has examined the legibility and readability of


English scripts.3,18,36 There are only a few studies that have focused mainly
on legibility and readability pertaining to Arabic typography. Ramadan,
Mohamed, and El-Hariry26 illustrated factors that may have an influence
on the legibility and readability of electronic texts. Results showed that the
13-point Simplified Arabic typeface rated higher both in terms of reading
comprehension and discomfort than the two other typefaces: Traditional
Arabic, and Monotype Koufi with font sizes of 10, 13, and 16 points.
Arabic and English Typeface Personas 205

Alsumait, Al-Osaimi and AlFedaghi5 showed that Arab students preferred


using the Simplified Arabic typeface with a 14-point size, which made
their reading easier and faster.
A study was conducted by Hemayssi et al.14 to find adult users’
preferences for Arabic scripts. Results showed that in order to increase
legibility, it is preferable to use bold fonts, colors and clear icons.
Almuhajiri and Suen4 investigated the legibility and readability of Arabic
typefaces on appearance of Personal Digital Assistants (PDAs).
Experimental results showed that Arabic e-reader is more popular among
Arab communities using Apple devices. Uthman Script Hafs and Geeza
Pro fonts can be used in e-books, which require high legibility. Their
results showed that Almohanda, Geeza Pro, and Yakout Reg fonts are
recommended for use when displays are similar to the size of iPads, due
to their better performance as smaller size.
In the area of typeface personality traits, most existing studies have
attempted to investigate their connections to English scripts. An earlier
study conducted by Poffenberger et al.,25 identified 5 atmospheric values
such as cheapness, dignity, economy, luxury, and strength for 29
typefaces. The term “atmospheric value” was used to describe the affective
quality of a typeface. Another similar study29 was conducted subsequently,
which concluded that typefaces can be grouped under 3 headings
pertaining to atmospheric value: luxury/refinement, economy/precision
and strength.
Typeface personalities conveying messages beyond what can be
expressed within the text, is not a novel idea.16 In the 2nd century, serif
typefaces were used as “symbols of the empire” while sans serif typefaces
were used as “symbols of the republic”.12 Tschichold37 found that different
typefaces can contain different personas and that the characteristics of a
typeface should match the meaning of the text. Also, visual characteristics
can have a strong effect that goes beyond the effects of legibility and
readability.16
Quite a few researchers have matched physical characteristics with
typeface personas. For example, round series are “friendly” while squared
serifs are “official”.16 Also, typefaces with light weights are “delicate,
gentle and feminine” while typefaces with heavy weights are “aggressive
and masculine”.12
206 S. Nikfal and C. Y. Suen

Several researchers have attempted to give particular personas to


specific typefaces. Kostelnick and Roberts17 assigned “bookish and
traditional” to Times New Roman; “dramatic and sophisticated” to Bodoni
MT and “corpulent and jolly” to Goudy. In another research study
conducted by Shushan and Wright,33 Garamond was labeled as “graceful,
refined and confident” and Century Schoolbook was labeled as “serious
yet friendly”. Shaikh, Chaparro, and Fox31 investigated the relationship
between 20 fonts with fifteen personality traits. Experimental results
suggested that 20 fonts can be represented in 5 groups, which were labelled
based on personality traits (serif, sans serif, display, script/funny, and
monoscaped). The use of this data was appropriate for some onscreen
document types. Serif and sans serif typefaces were generally more
appropriate than display and script faces for reading materials. Using a
similar methodology, Bernard et al.11 investigated the aesthetic appeal of
a selected group of typefaces. Participants indicated whether or not the
typeface had personality traits, e.g., elegant, youthful/fun, and/or business-
like, using a 6-point scale that was not bipolar in nature. They reported
that Times New Roman had significantly fewer personality traits than
Comic Sans, Brandley Hand, and Monotype Corsiva. Monotype Corsiva
was also considered more elegant that all other fonts except Bradle Hand.
The most youthful typeface was Comic Sans and the least youthful was
Courier New. Shaikh and Chaparro investigated the personality traits
of onscreen typefaces and perceived appropriateness of typefaces for
a variety of onscreen document types include website ads, written
assignments, email, resumes and web pages.30
Recommendations from these studies do not apply well to Arabic
language typography. For example, Arabic typeface styles have a
completely different structure and appearance, and the idea of character
style and size may not be the same as in English. Moreover, little research
has evaluated user perceptions of what fonts may be appropriate for digital
use. With the increased use of the Internet and electronic devices, there is
a mounting need to establish user perceptions of typeface personas and for
perceived guidelines for documents delivered in a digital format.
Therefore, in this chapter, the visual expressions of digital Arabic and
English typefaces are examined.
Arabic and English Typeface Personas 207

3. Arabic Typeface Personality Traits

3.1. Research methodology

In this study, a survey with 20 different Arabic typefaces and 5 personality


traits were examined to investigate whether or not readers/viewers
consistently associated the particular personality traits to selected
typefaces and whether these typefaces conveyed these traits.

3.1.1. Typefaces used in this study

The 20 different typefaces that were selected as test typefaces for this
survey included eight typeface styles available in Microsoft Windows that
supported Arabic scripts and twelve recommended Arabic typefaces in
that are commonly used.1,34 These typefaces were selected to represent
design characteristics such as tooth and loop heights, ascenders,
descenders and others. Also, these 20 typefaces are widely used in
different applications. Some of them, such as Kufi, are the standard and
most frequently used in displays for titling and in architectural ornaments.
Others, such as Thuluth, are mostly used for short texts and titles. Naskh
has become the industry standard as a text and Ruqaa is popular for
informal text.8,28 The 20 typefaces used in this study are Advertising Light,
Kufi, Maghrib, Naskh, Tahoma, ae_Ostorah, Ae_Mashq, Courier New,
Diwani Letter, Simplified Arabic Fixed, PakType Tehreer, Microsoft Sans
Serif, Ae_Nada, Times New Roman, DecoType Thuluth, Traditional
Arabic, Andalus, Code2000, Pashtu Breshnik and Ruqaa.

3.1.2. Typeface personality and rating scales

Typeface personality affects how people interact with a variety of


documents and products including advertisements, textbooks, and other
forms of text-based material. Since the 1920s, a variety of typefaces have
been studied using different methodologies.
The majority of studies have utilized the idea of semantic scales
(bipolar adjective pairs) to evaluate the connotative nature of typefaces, as
proposed by Oswood et al.23 Their on semantic differential scales greatly
impacted the method of researching typeface perception. The semantic
208 S. Nikfal and C. Y. Suen

differential approach entails presenting participants with a series of paired


opposite terms (masculine/feminine, strong/weak, quick/slow), referred to
as semantic differential scales. For each concept being judged, participants
indicate the point on a seven-point semantic scale that best fits the concept
(“very masculine,” “somewhat weak,” and so on). Other researchers have
demonstrated the appropriateness of this methodology for examining
typefaces.7,27
However, presenting participants with paired attributes is potentially
problematic. Although terms may appear to represent opposite extremes
of a particular attribute, it is difficult to ascertain whether they are
universally viewed as opposites. Additionally, it may be inappropriate to
consider certain attributes as having a neutral point and polar extremes
along a single dimension. The more complex an attribute is, the more
potentially problematic a bipolar approach to investigate reader’s
perceptions.
For example, “hot” and “cold” clearly represent the opposite ends of a
uni-dimensional temperature scale on which the center point is neither hot
nor cold, but rather is neutral. However, “masculine” and “feminine” may
be neither opposite extremes nor points along a uni-dimensional scale. An
object or person may be viewed as having both masculine and feminine
characteristics, but the absence of masculine characteristics does not
necessarily mean the presence of feminine characteristics, and it is not
always clear what the center point between masculine and feminine
represents.
To avoid this complication, rating scales with non-paired attributes
were used. Attributes were selected based on Li and Suen’s work,19 on
terms frequently used to describe typefaces in the literature, and on
previous studied on typeface tone.
Table 1 illustrates the selected typeface personality traits and their
definitions in both Arabic and English.
Purely denotative terms that describe physical characteristics such as
round, angular, dark, and heavy were ommited. Instead, the focus was on
connotative attributes. A five-point rating scale was used in this survey to
reflect a range of responses from participants to the 20 typefaces. The 5
rating scales were: 0~20%, 21%~40%, 41%~60%, 61%~80%, and
81%~100%. Participants were given no explanation or description of the
Arabic and English Typeface Personas 209

intended meaning of each term prior to the study because such


explanations or descriptions could conceivably bias them to respond in
particular ways.

Table 1. Typeface personality traits in English and with Arabic translation.

English Arabic Translation


Legible: Easy to read ‫ ﺳﻬﻞ ﺍﻟﻘﺮﺍءﺓ‬: ‫ﻣﻘﺮﻭء‬
Attractive: Pleasing through beauty or charm ‫ ﻳﺴﺤﺮ ﺃﻭ ﻳﺴﺮ ﻣﻦ ﺟﻤﺎﻟﻪ‬: ‫ﺟﺬﺍﺏ‬
Comfortable: Free from stress or tension ‫ ﻻ ﻳﺴﺒﺐ ﺍﻹﺟﻬﺎﺩ ﺃﻭ ﺍﻟﺘﻮﺗﺮ‬: ‫ﻣﺮﻳﺢ‬
Artistic: Related to Art ‫ ﻣﺮﺗﺒﻂ ﺑﺎﻟﻔﻦ‬: ‫ﻓﻨﻲ‬
Formal: Designed for official documents ‫ ﻣﺼﻤﻢ ﻟﻠﻮﺛﺎﺋﻖ ﺍﻟﺮﺳﻤﻴﺔ‬: ‫ﺭﺳﻤﻲ‬

3.1.3. Participants

Native Arabic speakers residing in their home country were asked to fill
out the survey, in addition to the native speakers who currently live in
Montreal, Quebec, Canada. The respondents were recruited through
e-mails and posters at Concordia University. In total, 82 participants
completed the survey, consisting of 41 females and 41 males. There were
55 participants between the ages of 20–29, and 21 participants between
the ages of 30–39. Only two participants were younger than 20 and the
other four participants were older than 40. Regarding educational
background, 38 participants reported having a Bachelor’s degree, 29
participants had a Master’s degree and 10 participants had a Doctorate.
The educational background of the remaining 5 participants included high
school, Technical college and Junior college.

3.1.4. Test images

For each Arabic typeface, groups of Arabic sentences or words in the size
of 12 points were displayed as a test image (see Figure 1). Image (a)
contains the 3 most common sentences used by Arabic speakers. Image
(b) is the most common Arabic pangram and was obtained from Ref. 24,
containing all of the basic letters. Image (c) was taken from sports news.6
Images (d–f) represent the most, average and least frequently used Arabic
words. Image (g) includes the Arabic alphabet and numerals. The test
image was converted to a binary image at 300 × 300 dpi resolution.
210 S. Nikfal and C. Y. Suen

Figure 1. Samples of the text in typeface Tahoma displayed for this font survey.

3.1.5. Typeface normalization

The Arabic script is a cursive language and therefore, a word consists of


connected characters. The direction of writing in cursive scripts such as
Arabic and Farsi is usually horizontal, either from right to left or from left
to right. Theoretically, there is no limitation to the number of characters
that can be connected. Therefore, for a specific font type and size, the word
height is more or less fixed while the word width is variable. These
structural features mean that the normalization of cursive script in both
directions, as in the case of a discrete script, can change the structure of
words. For this reason, the vertical normalization based on Abuhaiba’s
work2 was selected.
The way vertical normalization is performed is as follows. First, the
height of each pangram for every typeface in point size 12 was calculated.
Then, the average height of all the images was computed. All of the images
were resized to the same height, which was equal to the average height,
while keeping their widths. This normalization method changed the 20
typefaces to the same height. However, this method affected the typeface
Arabic and English Typeface Personas 211

size because the difference between actual height and average height
showed the percentage increase or decrease in total typeface size. For
example, if a typeface has a small height compared to the average height,
it will have a bigger size after normalization. However, typefaces with
large heights will have a smaller size after normalization. Figure 2
illustrates the result of typeface Courier New before and after
normalization.

Figure 2. Typeface Courier New, size 18 (from top to bottom: before normalization and
after normalization).

3.1.6. Survey questions

The survey contained 23 questions in total, of which 20 questions


addressed the display of 20 fonts and 5 corresponding personality traits,
with the remaining questions relating to the participant’s demographic
information, including age, gender and educational background. Any
effects based on survey order were prevented by randomly distributing the
display of typefaces throughout the survey.

3.1.7. Data collection method

An online survey tool (freeonlinesurvey)22 was used to establish the Arabic


online survey format. The electronic-based approach was used to design
the survey.

3.1.8. Procedures

The participants were asked to visually examine the computer screen of


the 24 typefaces and rate them on 5 personality traits, demonstrating how
well the typeface suited each personality trait. Participants were allowed
to fill out the survey with any amount of time they required. The survey
was taken anonymously, only demographic information was obtained.
212 S. Nikfal and C. Y. Suen

3.2. Statistical analyses of survey results

Our approach for analyzing the survey data included the use of statistical
software SPSS (version 17.0). SPSS is among the most widely used
programs for statistical analysis in the social sciences.

3.2.1. Univariate analysis

Univariate analysis21 was performed on the rating scores for each


personality attribute and typeface of the survey data. Each typeface was
analyzed based on its histogram of rating scores to explore the distribution

Table 2. Mean values of rating scores for 20 typefaces with their abbreviations related to
5 personality traits.
Arabic and English Typeface Personas 213

of rating scores for each typeface. Analysis revealed that the histograms
of rating scores displayed two commonly shaped distributions: normal and
slightly skewed. The mean values are displayed in Table 2. The top 3
typefaces related to each personality trait are highlighted.

3.2.2. Correlation analysis

In order to determine the strength of the relationship among the 20


typefaces and their interrelationships to each of the 5 personality traits, a
Pearson’s Correlation coefficient35 was performed on the survey data. A
high positive correlation coefficient between two typefaces indicated that
participants perceived these two typefaces to have very similar personality
traits.
A correlation coefficient of 0.30 is considered ‘good’ and above 0.40
is a ‘strong’ correlation in the social sciences.32 In order to reduce the
number of typefaces for further analysis, a correlation of 0.65 or more was
set as a “strong” correlation threshold in our study.
Through a comparison between the results from the correlation analysis
and the univariate analysis containing the typeface Rating Scores, 13
typefaces exhibited strong correlations with the 5 personality traits. They
are: Naskh, ae_Ostorah, Ae_Mashq, Courier New, Diwani Letter,
Simplified Arabic Fixed, Microsoft Sans Serif, Ae_Nada, Times New
Roman, DecoType Thuluth, Traditional Arabic, Andalus and Reqaa.
These typefaces were used for further statistical analysis. The typefaces of
Kufi, AdvertisingLight, Code2000, Maghrib, PakType Tehreer, Pashtu
Breshnik and Tahoma were eliminated because they were determined as
the least associated with the 5 personality traits.

3.2.3. Factor analysis

Factor analysis,35 was applied to the remaining typefaces, by using


Principal Components Analysis (PCA)35 and Varimax rotation.35 This
method allowed examination of the common underlying factors between
typefaces and personality traits. The two main phases of factor analysis,
sequentially, are as follows: deriving the factors and then rotating them to
enhance their interpretability. This method was used in this study as well
as in the typeface persona studies performed by Bartram.7
214 S. Nikfal and C. Y. Suen

3.2.4. Interpretation of factors

The obtained factor analysis results revealed that 2 or 3 independent


factors accounted for 20% and 80% of the total variance. Typefaces are
categorized into 3 groups based on their ratings and the values of their
correlations. Typefaces within a group correlated highly with the other
typefaces in that group, and did not correlate highly with typefaces in the
other groups.
The items represented the typefaces and the factor represented the
independent group. Items that had higher factor loadings were more
representative of the factor than items with lower factor loadings. For
example, all typefaces that correlated positively in group 1 had much
higher personality ratings than those in the other 3 groups; thus, “Legible”,
“Comfortable” and “Formal” were common properties of typefaces in
group 1 and were the characteristics that distinguished those typefaces
from the typefaces in the other groups.
The group 3 fonts scored highest on “Attractive” and “Artistic”. The
fonts in group 2 also shared high scores on those traits. Group 2 also scored
high on “Comfortable”, “Attractive” and “Artistic” personality traits but
they did not score the highest on those traits.
The 3 groups were categorized as follows:

Group 1: Times New Roman, Simplified Arabic Fixed, Microsoft Sans


Serif, Courier New, Traditional Arabic, Naskh
Group 2: Andalus, Ae_Mashq, Ae_Nada, ae_Ostorah
Group 3: Diwani Letter, DecoType Thuluth, Reqaa

3.2.5. Rating scores of personality traits for grouped typefaces

Mean rating scores for 3 groups of typefaces were examined, based on


each personality trait (see Tables 3 to 5). From the mean values of 13
typefaces and of 3 groups in Tables 3 to 5, it was found that:

1. The most legible fonts are in group 1, followed by the fonts in group
2; the least legible fonts are in group 3.
2. The most attractive fonts are in group 3, followed by the fonts in
group 2; the least attractive fonts are in group 1.
Arabic and English Typeface Personas 215

3. The most comfortable fonts are in group 1, followed by the fonts in


group 2; the least comfortable fonts are in group 3.
4. The most artistic fonts are in group 3, followed by the fonts in group
2; the least artistic fonts are in group 1.

The most formal fonts are in group 1, followed by the fonts in group 2;
the least formal fonts are in group 3.
The 3 groups were labeled by comparing and combining the rating
scores of each personality trait across the 3 groups. The label for each
group reflected its overall persona and distinguished it from the other
groups. Groups 1 and 3 were labeled based on their common personality
traits, as “Directness” and “Creativeness”, respectively. Group 2 was
labeled as “Neutral” because this group did not score extremely high nor
low on any personality traits.

Table 3. Mean values of rating scores for 13 typefaces related to 5 personality traits.
216 S. Nikfal and C. Y. Suen

Table 4. Three groups and their corresponding typefaces.

Table 5. Comparison: mean values of rating scores for 3 groups.

3.2.6. Demographic factors

A one-way Analysis of Variance (ANOVA) was performed on the survey


data to determine if demographic factors influenced participants’
responses. These factors, namely gender, age and educational background,
served as the independent variables in our analysis.
It was not feasible to determine if effects were linked to age and
educational background, as the variability of participants within these
groups was not sufficient for a valid analysis. For age, the majority of
participants reported being between 20–30 years of age. Among all of
the participants, only several participants claimed to have a technical
school/higher vocational school education or a junior college/technical
college education.
To assess the effect of gender on our survey data. The Analyses
revealed that there was no statistically significant difference between the
Arabic and English Typeface Personas 217

responses of male and female participants. Additionally, the difference


between means of each typeface related to the 5 personality traits was
evaluated with a one-way ANOVA. All the results appeared to be
insignificant (p > 0.05) indicating a participant’s gender did not have a
significant or sizeable effect on perceptions of the typefaces’ personality
traits. See Figure 3 as an example of gender analysis of typeface Kufi.

Figure 3. Histogram of typeface “Kufi” concerning average rating scores of 5 personality


traits in male and female groups.

4. English Typeface Personality Traits

4.1. Research methodology

The purpose of this chapter is to examine whether or not English typefaces


have their own personality traits. In addition, we investigate the
relationship between English typefaces and their personality traits.

4.1.1. Typefaces studied

The typefaces used within this particular are shown in Figure 4. The
complete listing and classification of typefaces is represented in Ref. 13.
218 S. Nikfal and C. Y. Suen

Figure 4. Typefaces used in the survey.

4.1.2. Typeface personality traits and rating scales

Attributes based on work by Li and Suen19 were selected. These attributes


are particularly useful because they are not specific to typefaces, but
instead have been used as rating scales for a wide range of concepts,
ensuring their appropriateness for assessing text passage personas. The
selection included Legible, Artistic, Formal, Sloppy, Readable, and
Attractive.

4.1.3. Participants

A total of 71 participants (39% male, 61% female) completed the survey.


Approximately 51% of participants were 20–29 years old, and 40% of
participants were between 30–39. Two percent of participants were above
the age of 40 and the other 7% participants were below 20 years old.

4.1.4. Test samples

For each typeface in this study, the complete English alphabet of upper
and lower cases as well as, numerals were printed in the size of 18 points.
Two common English pangrams (a pangram is a sentence using every
letter of the alphabet at least once) were also used in the size of 12 points:
Arabic and English Typeface Personas 219

“The quick brown fox jumps over a lazy dog” followed by the sentence,
“Please complete the survey to your comfort level”. Figure 5 illustrates
sample similar to those printed for each of the 24 typefaces.

Figure 5. Sample of the alphabet and text in the font survey. This sample shows the typeface
“Poor Richard”, where (a) includes upper and lower cases of the English alphabet in the
size of 18 points, and numbers in the size of 18 points; and (b) includes two common
English pangrams in the size of 12 pints.

4.1.5. Typeface normalization

Typographers have shown that the X-height is an important feature in a


typeface. The x-height is the height of the lowercase character ‘x’. It varies
from one typeface to another. Therefore, in this study, the influence of the
differences in x-height was eliminated by normalizing the 24 typefaces to
the same x-height. The x-height normalization was based on the average
x-height of all the typefaces. First the x-height for every typeface in two
sizes, 12 and 18 points was calculated. Then, the average x-height was
separately calculated for all the typefaces in 12 and 18 point sizes.
In order for all of the typefaces to appear with an equivalent x-height,
the average x-height was utilized as a measurement for determining a new
font size. After calculating the average x-height, Equation 1 was used to
estimate the ratio between average x-height and actual x-height for each
typeface in each point size.
Equation 1: Ratio = (average x-height)/(actual x-Height)
220 S. Nikfal and C. Y. Suen

This ratio was applied to change the 24 typefaces to the same x-heights.
It shows the effect on percentage of increase or decrease in typeface size.
If a typeface has a small actual x-height compared to the average x-height,
it will have a bigger size after normalization. However, typefaces with
large x-heights will have a smaller size after normalization. Figures
illustrate the result of typefaces Chiller (Figure 6) and Impact (Figure 7)
before and after normalization.

Figure 6. Sample of typeface “Chiller” (from left to right: size 18 before normalization and
after normalization).

Figure 7. Typeface “Impact” (from left to right: size 18 before normalization and after
normalization).

4.1.6. Survey questions

The survey contained 27 questions in total, of which 24 questions


addressed the display of 24 fonts and six corresponding personality traits,
and three more questions were related to the participant’s demographic
information, including age, gender and educational background. Any
effects based on survey order were prevented by randomly distributing the
display of typefaces throughout the survey.
Arabic and English Typeface Personas 221

4.1.7. Data collection method

An online survey tool (freeonlinesurvey)22 was used to establish the


English online survey format. The electronic-based approach was used to
design the survey. The entire online survey’s administrative tasks,
including data collection, data storage, etc., were managed by a survey
tool. Participants of the English online survey were recruited through
email invitations. They followed the survey link and completed the online
survey on the Internet.

4.1.8. Procedures

It was requested that participants visually examine the typefaces and rate
them on 6 personality traits, demonstrating how well the typeface suited
each personality trait.

4.2. Statistical analyses of survey results

The approach used to analyze English survey data employed the statistical
software SPSS (version 17.0). First, an univariate analysis on the English
survey data was performed similar to the one described in the Arabic
typeface survey. Secondly, a correlation analysis on the survey data was
conducted, as described in the Arabic typeface survey. Thirdly, a factor
analysis on the remaining typefaces to group them into smaller sets was
performed similarly to the one described in the Arabic typeface survey.
Lastly, the survey’s demographic data including age, gender and
educational levels were examined to identify their potential effects on
participants’ responses, similarly to the analysis described in the Arabic
typeface survey.
It was found that 16 typefaces that exhibited strong correlations with
the 6 personality traits. They are: Impact, Bernard MT Condensed,
Garamond, Centaur, Harry Porter, Times New Roman, Kabel, Berlin Sans
FB, Footlight MT Light, Bauhaus 93, Arial, Helvetica, Rockwell,
Broadway, Cooper Black and Snap ITC. These typefaces were used for
further statistical analysis.
222 S. Nikfal and C. Y. Suen

4.2.1. Interpretation of factors

The items represented the typefaces and the factor represented the
independent group. Items that had higher factor loadings were more
representative of the factor than items with lower factor loadings. The
factor analysis results revealed that 3 or 4 independent factors accounted
for 33% and 66.7% of the total variance respectively. Typefaces are
categorized into 4 groups based on their ratings and the values of their
correlations. Typefaces within a group correlated highly with the other
typefaces in that group, and did not correlate highly with typefaces in the
other groups.
All typefaces that correlated positively in group 4 had much higher
property ratings than those in the other three groups; thus, “Legible”,
“Formal” and “Readable” were common properties of typefaces in group
4 and were characteristics that distinguished those typefaces from the
typefaces in the other groups. The following typefaces were grouped
together in group 4: Garamond, Helvetica, Arial, Times New Roman,
Centaur, Rockwell and Footlight MT Light.
The fonts in group 3 shared “Artistic”, “Readable” and “Attractive”
personality traits but they did not score the highest on those traits. Further
mean rating score analysis was done and the following two typefaces were
grouped together: Kabel and Berlin Sans FB.
The fonts in group 2 scored highest on the “Artistic” and “Sloppy”
characteristics. The following five typefaces were grouped together: Snap
ITC, Harry Porter, Broadway, Bauhaus 93 and Cooper Black.
The fonts in group 1 shared “Legible” and “Formal” personality traits
but they did not score the highest on those traits, therefore further mean
rating score analysis was performed and the fonts were grouped together
the following two typefaces: Impact, Bernard MT.
For the fonts that did not score extremely high or extremely low on any
personality traits, further typographical feature analysis was done, as
shown in Table 6. It was found that Bernard MT Condensed and Impact
had the highest x-height proportion with the exception of Broadway and
weight detection and the smallest ascender proportion and descender
proportion with the exception of Broadway. Therefore, those fonts were
grouped together in group 1.
Arabic and English Typeface Personas 223

Table 6. Typographical features of 16 typefaces.

The four groups of categorized typefaces are described as follows:

Group 1: Impact, Bernard MT Condensed


Group 2: Snap ITC, Harry Porter, Broadway, Bauhaus 93 and Cooper
Black
Group 3: Kabel and Berlin Sans FB
Group 4: Garamond, Helvetica, Arial, Times New Roman, Centaur,
Rockwell and Footlight MT Light

4.2.2. Rating scores of personality traits for grouped typefaces

Mean rating scores for the grouped typefaces were examined, (see Table
7 and Table 8). From the mean values of 16 typefaces and comparison of
4 groups in Table 8 and Table 9, it was found that:

1. The most legible fonts belong to group 4, followed by the fonts in


group 1 and group 3; the least legible fonts are in group 2.
2. The most artistic fonts are in group 2, followed by the fonts in group
3 and group 4; the least artistic fonts are in group 1.
224 S. Nikfal and C. Y. Suen

3. The most formal fonts are in group 4, followed by the fonts in group
1 and group 3; the least formal fonts are in group 2.
4. The sloppiest fonts are in group 2, followed by the fonts in group 1
and group 3; the least sloppy fonts are in group 4.
5. The most readable fonts are in group 4, followed by the fonts in
group 3 and group 1; the least readable fonts are in group 2.
6. The most attractive fonts are in group 4, followed by the fonts in
group 3 and group 2; the least attractive fonts are in group 1.

The 4 groups were labeled by combining the rating scores of each


personality trait. The label for each group reflected its overall persona and
distinguished it from the other groups. Groups 2 and 4 were labeled based
on their common personality traits, as “Creative” and “Directness”,
respectively.

Table 7. Mean values of rating scores for grouped typefaces related to 6 personality traits.
Arabic and English Typeface Personas 225

Table 8. Comparison: mean values of rating scores for 4 groups.

Table 9. Four groups and their corresponding typefaces.

Typefaces in group 1 had common typographical features, as explained


earlier. Therefore, it was labeled “Masculine” as suggested by Refs. 8, 38.
Group 3 was labeled “Neutral”, because this group did not score extremely
high nor low on any personality trait.

4.2.3. Demographic differences

It was not feasible to determine if effects were linked to gender, age and
educational background, as the variability of participants within these
groups was not sufficient for a valid analysis.

5. Summary of English Typefaces

The results of the statistical analyses provide strong evidence that there is
a clear and significant relationship between particular typefaces and
perceived personality traits. Participants in this study consistently ascribed
specific personality traits to certain typefaces, which was consistent with
results from earlier studies on typefaces and their personality traits.19,31
After examining our statistical analysis, the total number of studied
typefaces was reduced from 24 to 16. The eight typefaces were eliminated
since they produced statistically insignificant results. Via a series of
226 S. Nikfal and C. Y. Suen

statistical analyses on perceived personality traits, these 16 studied


typefaces were classified into four groups. The four groups each contained
typefaces that were related by typographical features. In addition, the list
of typefaces that scored the highest rating for each of the personality traits
was provided.

6. Summary of Arabic Typefaces

The experimental results on Arabic typeface personality traits indicate that


there is a clear and significant relationship between particular Arabic
typefaces and perceived personality traits. These results are consistent
with those results from our study on English typeface personality traits.
Seven typefaces were eliminated due to insignificant results. The 13
remaining typefaces were classified into three groups, through a series of
statistical analyses on perceived personality traits. The three groups each
contained typefaces that were related by typographical features.

7. Comparison of Both Studies

A participant’s age and educational background, reading of familiar or


unfamiliar typefaces, and reading time may have affected participants’
responses in these studies. Further investigation is needed in the future.
These findings may provide typeface designers with some useful guidance
in terms of their future typeface choices to suit different purposes.
Additionally, it was found that there was no significant difference in the
participant’s responses based on the gender factors.
The results of study 1 and study 2 suggest that there is a clear and
significant relationship between particular typefaces and perceived
personality traits. Consideration of the common typeface’s personality
found for each group of typeface in study 1 and study 2 should also be
taken into account by designers intending to create a new font with a
specific personality.
The determined categories of Arabic typefaces did not match those
identified in study of English typeface personas. There are several possible
Arabic and English Typeface Personas 227

explanations for the dissimilarity. After comparing both studies, it was


found that:

1. The only font used in both studies was Times New Roman.
2. Four commonly used personality traits were chosen in both studies:
“Legible”, “Formal”, “Artistic” and “Attractive”.
3. Due to some of the differences in specific typefaces, personality
traits, rating scales, and pangrams, and font sizes, we cannot directly
compare the results of both studies.

8. Conclusions and Future Work

In comparison with previous research on font and personality traits,19,31


this study not only performed analyses on font survey results, it identified
the personalities of Arabic and English typefaces, as well as obtained
typeface groups.
The results of study 1 and 2 suggest that there is a clear and significant
relationship between particular typefaces and perceived personality traits.
Consideration of the common typeface personality found for each group
of typeface should also be taken into account by designers intending to
create a new font with a specific personality. In addition, the results of
these studies indicate that there are trade-offs between typeface legibility
and the strong visual feelings conveyed by typefaces. Moderate designs
especially can increase the typeface legibility, but decrease prominent
responses.
While a typeface is viewed as having its own personality (study 1), the
personality of the typeface contributes to the selection and appropriate
usage of typefaces. The typefaces are rated as “Legible”, “Formal” and
“Comfortable”, but on the contrary they are unemotional, unimaginative
and unattractive. Therefore, they can be used for all purposes and are
especially appropriate for official documents, reports and forms.
Typefaces rated as “Attractive” and “Artistic” are generally best for
evoking a pleasant tone in commercial advertisements and in children is
books. These typefaces are less legible, so they are often printed in large
sizes and are more appropriate for headings than for texts.
228 S. Nikfal and C. Y. Suen

Comparison between this study and other research in the area of


psychology and typography design has illustrated that the study offers a
systematic technique of typeface design analysis in terms of the particular
personality traits that typefaces can convey. The current work is a starting
point. However, more research is needed. In future work the following
considerations need to be taken into account:

1. In order to identify personality traits that are more accurate and


specific, the selection of personality traits used in research should
be pilot tested and examined in greater detail.
2. Because of methodological limitations, some issues that may
influence the participants’ responses, including factors such as
participants’ reading comprehension, reading time, and familiarity
with studied typefaces must be addressed.
3. To be able to identify effects due to demographic factors,
the distribution of participants based on age and educational
background should be taken into consideration.

References
1. H. S. Abifares, Arabic Typography A Comprehensive Sourcebook. Saqi Books
London (2000).
2. I. S. I. Abuhaiba, Discrete script or cursive language identification from document
images. Journal of King Saud University. 16(1), 253-269, (2004).
3. I. M. Al-Harkan and M. Z. Ramadan, Effects of pixel shape and color, and matrix pixel
density of Arabic digital typeface on characters’ legibility. International Journal of
Industrial Ergonomics. 35(7), 652–66 (2005).
4. M. Almuhajri and C. Y. Suen, Legibility and readability of Arabic fonts on personal
digital assistants. In eds. M. C. Dyson and C. Y. Suen, Digital Fonts and Reading,
pp. 248-265. World Scientific, Singapore (2005).
5. A. Alsumait, A. Al-Osaimi, and H. AlFedaghi, Arab children’s reading preference for
different online fonts. In Proceedings of HCI International Conference, HCI
International, pp. 3-11, San Diego, CA (July 2009).
6. Arabic News. Available from: http://aljazeera.net/portal
7. D. Bartram, Perception of semantic quality in type: differences between designers and
non-designers. Information Design Journal. 3(1), 38:50 (1982).
8. C. H. Baylis, Trends in typefaces. Printer’s Ink 252. 5, 44-46 (1955).
9. BBC Radio. http://www.bbc.co.uk/radio4/ (2015).
10. M. L. Bernard, B. S. Chaparro, M. M. Mills, and C. G. Halcomb, Comparing the effects
of text size and format on the readability of computer-displayed Times New Roman
Arabic and English Typeface Personas 229

and Arial text. International Journal of Human-Computer Studies. 59(6), 823–835


(2003).
11. M. Bernard, M. Mills, M. Peterson, and K. A. Storrer, Comparison of popular online
fonts: which is best and when? Usability News, 3(2), (2001). http://usabilitynews.org/a-
comparison-of-popular-online-fonts-which-size-and-type-is-best/
12. R. Bringhurst, The Elements of Typographic Style, 2nd edn. Hartley and Marks, Canada
(1996).
13. Microsoft. Guide to Microsoft fonts: Design tutor. Microsoft, UK (2000).
14. H. Hemayssi, E. Sanchez, R. Moll, and C. Field, Designing an Arabic user experience:
Methods and techniques to bridge cultures. In Proceedings of the Conference on
Designing for User Experiences (DUX05), San Francisco, CA (2005).
15. S. C. Hostetler, Integrating typography and motion in visual communication. (2006).
http://www.units.muohio.edu/codeconference/papers/papers/SooHostetler-
2006%20iDMAa%20Full%20Paper.pdf Department of Art, University of Northern
Iowa, Cedar Falls.
16. C. Kostelnick, The rhetoric of text design in professional communication. The
Technical Writing Teacher. 17(3), 189-202 (1990).
17. C. Kostelnick and D. D. Roberts, Designing Visual Language: Strategies for
Professional Communicators. Longman, Boston (1997).
18. D. S. Lee, K. K. Shieh, S. C. Jeng, and I. H. Shen, Effect of character size and lighting
on legibility of electronic papers, Displays. 29(1), 10-17 (2008).
19. Y. Li and C. Y. Suen, Typeface personality traits and their design characteristics. In
Proceedings of the 9th IAPR International Workshop on Document Analysis Systems,
pp. 231-238, Cambridge, MA (June 2010).
20. Microsoft. List of Microsoft Windows fonts.
http://en.wikipedia.org/wiki/List_of_Microsoft_Windows_fonts
21. R. L. Miller, C. Acton, D. A. Fullerton, and J. Maltby, SPSS for Social Scientists.
Palgrave Macmillan, New York (2002).
22. FreeOnlineSurveys. Online font survey. http://www.freeonlinesurveys.com
23. C. E. Osgood, G. J. Suci, and P. Tannenbaum, The Measurement of Meaning.
University of Illinois Press, Champaign-Urbana (1957).
24. Pangram List. http://en.wikipedia.org/wiki/List_of_pangrams
25. A. T. Poffenberger and R. B. Franken, A study of the appropriateness of typefaces.
Journal of Applied Psychology. 7, 312–329 (1923).
26. M. Ramadan, A. Mohamed, and H. El-Hariry, Effects of cathode ray tube display
formats on quality-assurance auditor’s performance. Human Factors in Ergonomics
and Manufacturing. 20(1), 61-72 (2009).
27. C. L. Rowe, The connotative dimensions of selected display typefaces, Information
Design Journal. 3(1), 30-37 (1982).
28. R. Sassoon, Through the eyes of a child: perception and type design, In ed. R. Sassoon,
Computers and typography, pp. 158-165. Intellect Books, Oxford (1993).
230 S. Nikfal and C. Y. Suen

29. H. Spencer, The Visible Word Book, 2nd edn. Lund Humphries, Royal College of Arts,
New York (1969).
30. D. Shaikh and B. Chaparro, Perception of fonts: pereceived personality traits and
appropriate uses. In eds. M. C. Dyson and C. Y. Suen, Digital Fonts and Reading,
pp. 226-247. Word Scientific, Singapore (2016).
31. A. D. Shaikh, B. S. Chaparro, and D. Fox, Perception of fonts: Perceived personality
traits and uses. Usability News. 8(1), 1-6 (2006).
32. A. D. Shaikh, B. S. Chaparro, and D. Fox, The effect of typeface on the perception of
email, Usability News. 9(1), 1-7 (2007).
33. R. Shushan and D. Wright, D. Desktop Publishing By Design. 2nd edn. Microsoft
Press, Washington (1994).
34. E. Smitshuijzen, Arabic Font Specimen. Uitgeverij De Buitenkant, Amsterdam (2009).
35. S. A. Sweet and K. G. Martin, Data Analysis with SPSS. 3rd edn. Pearson Education,
Upper Saddle River (2009).
36. C. Y. Suen, S. Nikfal, Y. Li, Y. Zhang, and N. Nobile, Evaluation of typeface legibility
based on human perception and machine recognition. In Proceedings of the ATypI
International Conference, Dublin, Ireland, (2010).
37. J. Tschichold, Graphic Arts And Book Design: Essays on the Morality of Good Design.
Hartley & Marks, Washington (1958).
38. J. V. White, Graphic Design for the Electronic Age: Manual for Traditional and
Desktop Publishing Book. Watson-Gublications and Xerox Press, New York (1988).
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 231

231

Chapter 11

End-to-End Lexicon Free Arabic Speech Recognition


Using Recurrent Neural Networks

Abdelrahman Ahmed† , Yasser Hifny‡ , Khaled Shaalan§ and Sergio Toral¶



Electronic Engineering Department, University of Seville, Spain
abdahm@alum.us.es

Department of Information Technology, University of Helwan, Egypt
yhifny@fci.helwan.edu.eg
§
The British University in Dubai, Dubai, UAE
Khaled.shaalan@buid.ac.ae

Electronic Engineering Department, University of Seville, Spain
storal@us.es

This chapter presents the first end-to-end recipe for an Arabic speech-to-
text transcription system using the lexicon free Recurrent Neural Net-
works (RNNs). The developed approach does not depend on Hidden
Markov Models (HMMs), Gaussian Mixture Models (GMMs), or deci-
sion trees. In addition, a character based decoder is used for searching to
avoid using a word lexicon. The Connectionist Temporal Classification
(CTC) objective function is used to maximize the output character se-
quences given the acoustic features as an input. The recipe was evaluated
using 1200 hours corpus of Aljazeera multi-Genre broadcast programs.
On the development set, we report Word Error Rate (WER) of 12.03%
for non-overlapped speech.

1. Introduction

Arabic language is a challenging language and considered as a one of the


most morphologically complex languages [1,2]. Automatic Speech Recogni-
tion (ASR) for Arabic has been a research concern over the past decade [3,4]
for many reasons. The first reason is that the Arabic language has limited
development resources in the speech recognition field. The second reason is
that most Arabic language scripts are written without diacritics [5]. There
were many attempts for presenting Arabic speech recognition recipes [1].
However, the methods are challenging to learn and to get competing
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 232

232 A. Ahmed et al.

results compared to other languages [6]. The motivation of this chapter is to


present an Arabic ASR system using Recurrent Neural Networks (RNNs).
The study presents a recipe of an Arabic transcription system based on
RNNs and Connectionist Temporal Classification (CTC) objective func-
tion based on Stanford CTC source code.1 Next sections briefly introduce
automatic speech recognition methods: Hidden Markov Models (HMMs),
Gaussian Mixture Models (GMMs), Deep Neural Networks (DNNs) and
RNNs. Section 5 introduces two experiments: the first experiment is about
estimating and optimizing the parameters of the training process that are
based on 8 hours of Aljazeera corpus [7]. The second experiment is about
generating a comprehensive acoustic model using 1200 hours multi-genre
broadcast recordings during 2005–2015 from the Aljazeera Arabic TV chan-
nel. The results including the Word Error Rate (WER)/Character Error
Rate (CER) and the experiment observations regarding the training and
decoding process are presented in Section 5.2 and Section 5.4. Section 6
concludes the paper findings and proposes research opportunities for future
work.

2. Related Work

Automatic Speech Recognition based on statistical models have lead to


significant improvement in speech recognition for different languages [8].
The components of a statistical speech recognizer are:

• A language model which computes the probability of a word se-


quence.
• An acoustic model that defines the relationship between acoustic
observations and the phonemes.
• A Lexicon which is required for the decoder to define the phone
sequence for each word [9].
• A decoder which finds the most probable sentence given the input
acoustic observations. It can be computed by taking the product
of the two probabilities for each sentence and choosing the sentence
for which this product is the greatest [10].

The HMM/GMM modeling approach still has some drawbacks. HMMs


are developed under the assumption that observations are independent
which is not aligned with the vocal tract. In addition, it is a generative
1 https://github.com/amaas/stanford-ctc
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 233

Arabic Speech Recognition Using Recurrent Neural Networks 233

model which gives low recognition performance compared to discriminative


model [10,11].
The neural network achieved outstanding results in speech processing
and other fields (i.e. Handwriting, Visual recognition, ..., etc.) [12]. The
Deep Neural Networks (DNNs) is a conventional multilayer perceptron
(MLP) with a number of hidden layers for data processing [11]. Com-
bining both methods of DNNs and HMM/GMM creates a new method
called "tandem" which could outperform the GMM/HMMs legacy method.
The tandem method [13] is a feature extraction approach using DNNs in
order to obtain complementary features to Mel Frequency Cepstral Coeffi-
cients (MFCCs) [14]. Another approach of using neural network in speech
recognition is the hybrid approach which comprises the HMMs and Artifi-
cial Neural Networks (ANNs). In the hybrid approach, HMMs are used for
sequential modeling and ANNs models are used as flexible discriminant clas-
sifiers to estimate a scaled likelihood by replacing GMM models [15]. This
approach is developed further with RNNs, Bidirectional Recurrent Neu-
ral Networks (BDRNNs) and deep conditional random fields which gains
prominent performance improvement [16,17]. The process of data prepa-
ration and integration between GMM/HMMs tandem or hybrid approach
is very complicated and requires high experience in planning, parameters
estimation and optimization.
There were attempts for presenting a full recipe of Arabic speech
recognition using speech recognition toolkits like: Kaldi2 [1], HTK3 and
Sphinx4 [6]. The acoustic model was generated based on a morphological
or grapheme basis using lexicon based decoders. Continuing with previ-
ous attempts, this chapter presents end-to-end speech recognition system
using BDRNNs to map acoustic input to the transcribed text. Connec-
tionist Temporal Classification (CTC) is the objective function that is used
in BDRNNs training process [18]. Next Sections will discuss the training
process in more details.

3. Arabic Speech Recognition System

The study framework compares the RNNs lexicon free speech recogni-
tion performance and other technologies [7,19]. This study comprises
two experiments: the first experiment is about parameters estimation and
2 http://www.kaldi-asr.org/
3 http://htk.eng.cam.ac.uk/
4 http://cmusphinx.sourceforge.net/
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 234

234 A. Ahmed et al.

optimization using 8 hours Aljazeera corpus which is prepared and collected


by Qatar Computing Research Institute (QCRI).5 In the second experi-
ment, an acoustic model of 1200 hours using Aljazeera broadcast news TV
corpus was developed and evaluated.
The conceptual framework of this study is described in Fig. 1.

Fig. 1. The framework of the study.

Our speech-to-text transcription system consists of three main compo-


nents: a BDRNNs acoustic model, a language model and a character based
decoder (i.e. no need for lexicon in decoding process compared to word level
decoders). In addition, the training and decoding process is based on Ara-
bic grapheme. The objective function used to train BDRNNs is CTC that
removes the need for pre-segmented acoustic observations. The evaluation
for the test set will be performed on both the word and character level in
order to validate the results with other word based models. The next sub-
sections discuss the acoustic model in more details using BDRNNs/CTC,
the n-gram language model and the lexicon free character based decoder.

3.1. Acoustic model

BDRNNs acoustic model scores each character given the input data. More-
over, the CTC objective function (loss function) maximizes the probabilities
of the correct characters. The following sections will discuss in brief the
BDRNNs and CTC.
5 http://www.qcri.com/
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 235

Arabic Speech Recognition Using Recurrent Neural Networks 235

3.1.1. Bidirectional Recurrent Neural Networks (BDRNNs)


An RNN computes the probability of the output character c at a given
time input xt . It consists of few hidden layers followed by a softmax output
layer [11,20]. The scoring at each layer depends on the current input xt and
the previous hidden state st−1 . Hence, it does not model information based
on the future acoustic context [16]. To overcome this limitation, BDRNN
has separate hidden layers for scoring based on the the past and future
context. Each layer is computed separately by going forward from t − 1, t
and t + 1 parallel with backward computation from t + 1, t and t − 1. Then,
they are summed together as in Fig. 2.

Fig. 2. Stanford Bidirectional Recurrent Neural Networks for input x and output p(c|x).

The first layer scoring is based on Equation (1).


(1)
ht = f (W (1)T xt + b1 ). (1)
The second layer in Fig. 2 is BDRNNs hidden layer j with partial sum
of forward and backward layer (temporal layer) at time t.
(j) (f ) (b)
ht = ht + ht . (2)
The forward and backward hidden layers are computed independently with
weight matrix W (f ) and W (b) . The partial hidden layer takes an input from
(j−1) (f ) (b)
previous hidden layer ht . Therefore, the hidden layer ht and ht at
time t are computed in following equations.
(f ) (j−1) (f )
ht = f (W (j)T ht + W (f )T ht−1 + b(j) ), (3)

(f ) (j−1) (b)
ht = f (W (j)T ht + W (b)T ht+1 + b(j) ) (4)
where f (z) = min(max(z, 0), µ) is a rectified linear activation function
clipped to a maximum possible activation of µ to prevent overflow [21].
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 236

236 A. Ahmed et al.

Rectified linear hidden units have been show to work well in general for
deep neural networks, as well as for acoustic modeling of speech data [21].
The final layer of the BDRNNs computes the output distribution p(c|xt )
using a softmax function:
(s)T (s)
h(:) +bk )
e−(Wk
p(c = ck |xt ) = P (s)T (s)
, (5)
K h(:) +bj )
j=1 e−(Wj
(s) (s)
where Wk is the k’th column of the output weight matrix W (s) and bk
is a scalar bias term. The vector h(:) is the hidden layer representation of
the final hidden layer in our DBRNN. The set of all expected characters K
includes the blank symbol ( ).

3.1.2. Connectionist Temporal Classification (CTC)


The objective function used to train BDRNNs is the Connectionist Tempo-
ral Classification (CTC). CTC removes the need for pre-segmented training
data. Given an input sequence X of length T , CTC assumes the probability
of a length T character sequence C is computed as follows:
T
Y
P (C|X) = p(ct |X), (6)
t=1

where the network output at different times are conditionally independent


given the input [18]. Afterward, the total probabilities of any one label
sequence can then be found by summing the probabilities of its different
alignments [18]. In particular, the CTC objective function CT C(X, W ) is
the likelihood of the correct final transcription W which requires integrating
over the probabilities of all length T character sequence C [22].
X
CT C(X, W ) = P (C|X)
CW

T
XY
CT C(X, W ) = p(ct |X). (7)
CW t=1

A CTC collapse function constructs the possible shorter output sequences


from our length T sequence of output characters. It collapses any repeated
characters in the original length T . For example, word "so" in English is
equivalent to the following {sso,soo, so,s o,so }. Similar to HMM, a
dynamic programming algorithm is used to compute this loss function and
its gradient with respect to the BDRNN parameters.
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 237

Arabic Speech Recognition Using Recurrent Neural Networks 237

3.2. Language model


A language model computes the probability of a word sequence. It is used to
resolve ambiguous utterances during the decoding [2,23]. For example, two
sentences it takes two and it takes too are acoustically confusable. When
language model scores are combined with the acoustic scores, the ambigu-
ity may be resolved. As we used the characters instead of words during
decoding, Equation (8) is the n-gram character based prior probability.
m
Y
p(c1 . . . cm ) = p(ci |c1 . . . ci−1 ), (8)
i=1

where ci is the character order in stream of characters. Most of the speech


recognition systems uses 3-grams, 4-grams, up to 5-grams. The higher order
models imply higher certainty (low entropy). As we are targeting charac-
ters, we need to extend the n-grams to the highest possible order to increase
the certainty per word as well as preceding words. For example, assuming
that we have word based decoder working on 4-grams order and the words
on average count of letters about 4 letters, this means that we may need
16-grams order in character based language model (4 words × 4 letters).
The limitations in computational resources may hinder the possibility of
achieving these orders (16-grams). Furthermore, it consumes much time
in the decoding process which undervalues the prominent advantage of the
lexicon free decoders where will be discussed in next section.

3.3. Decoding
The beam search decoder is performed on a character level. This method
gives more advantages than word level decoding for two reasons. The first
reason is that the decoding speed in a character level is much higher com-
pared to a word level because of the lexicon. The search time for a lexicon
based decoder is a function in the number of words to be searched. For
Arabic language, the lexicon may contain up to 2M words. Hence, the lex-
icon based decoders may be very slow. On the other hand, the character
based decoders depend on the number of characters (e.g. 35) used to train
the BDRNNs. Hence, they are faster than a lexicon based decoder [24].
The second reason is that character based decoding overcomes the out of
vocabulary (OOV) problem compared to word decoding [22]. Algorithm 1
illustrates the decoding pseudo code developed by Stanford [22].
The collapse function ignores the non-blank symbols due time shift
of character alignment which produces the same character again. It also
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 238

238 A. Ahmed et al.

Algorithm 1 The Decoder input is the CTC likelihood of the characters


pctc (c|xt ) and language model pclm (c|s). For each time step t and for each
string s in current and previous hypothesized set Zt−1 , a new character
is added to s after excluding blanks and repeated characters. Z0 initial
value is empty (φ). Notation: π is the character set with no blanks ( ).
Output string is concatenated s := s + c. |s| is the string length of s.
pb (c|x1:t ) is the probability of s ending by blank conditioned on input x
up to time t. pnb (c|x1:t ) is the same as pb but not ending with blank.
ptot (c|x1:t ) = pb (c|x1:t ) + pnb (c|x1:t ).
1: Inputs: p(c|xt ) and pclm (c|s).
2: Parameters: language model weight α, insertion bonus beta and beam
width k.
3: Z0 ← φ, pb (φ|x1:0 ) ← 1, pnb (φ|x1:0 ) ← 0
4: for t =1 ...T do
5: Zt ← {}
6: for s in Zt−1 do
7: pb (s|x1:t ) ← pctc ( |xt )ptot (s|x1:t−1 )
8: pnb (s|x1:t ) ← pctc (c|xt )pnb (s|x1:t−1 )
9: Zt ← Zt+s
10: for c in π do
11: S ←s+c
12: if c 6= st−1 then
13: pnb (S|x1:t ) ← pctc (c|xt )pclm (c|s)α ptot (c|x1:t−1 )
14: else
15: pnb (S|x1:t ) ← pctc (c|xt )pclm (c|s)α pb (c|x1:t−1 )
16: end if
17: Zt ← Zt+s
18: end for
19: end for
20: Zt ← k most probable s by ptot (s|x1:T )|s|β
21: end for
22: Return argmaxs∈Zt ptot (s|x1:T )|s|β

controls the hypothesized characters to be repeated or the characters reside


between two blanks. Furthermore, the sum of the logs of acoustic model
and language model probabilities avoids underflow (a problem in process-
ing very small values) and increase the speed of the algorithm. α is the
scaling factor of the language model to balance the weighting effect of the
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 239

Arabic Speech Recognition Using Recurrent Neural Networks 239

language model probability. For example, when α is adjusted at small value


(fraction), it means that the effect of language model overall the probabil-
ity of the predicted character given the input is small and vise versa. The
probability of predicted character in Equation (9).

cψ = argmax[log p(x|c) + α log p(c)], (9)

where cψ is the hypothesized character with a sequence length (beam


length). β is the insertion bonus which is the scaling factor of the final
insertion of the characters string. It is the exponent value of the length of
the generated string multiplied by hypothesized string probability. In case
of β < 1, a reduction factor is applied which reduces the opportunity of
the decoder to insert the hypothesized string (conservative decoder) and
vice versa. The beam length is the length of the hypothesized characters
sequence length to be processed through probability calculation. By in-
creasing the beam length, the decoder accuracy increases and the decoding
process consumes much time than shorter beam length.

4. Front-End Preparation

The front-end preparation is the input data transliteration from Arabic to


Latin as well as preparing all input features to be ready for training. The
Latin characters are transcribed into numerical values and ready for the
neural network processing. Then the audio feature extraction is applied to
build the input data matrix. The following subsections detail each step.

4.1. Converting the Arabic text to Latin


(transliteration process)
The Arabic characters are converted into Latin characters so that the ma-
chine will be able to process the text. A sample of character set is shown
in Table 1.

Table 1. Sample of Arabic letters transliteration.



Arabic Letter @ X € È ¼
Transliteration ga d sh l k
Equivalent English A D SH L K

The transliteration process maps each letter from Arabic to the corre-
sponding Latin character. We added spaces between each character because
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 240

240 A. Ahmed et al.

the characters length are different (some characters are transliterated in one
or two characters as shown in example (1)). We uses hash (#) to indicate
the start of the sentence, star (*) for the end of the sentence and separator
(|) for the spaces between the words. These special characters help the
decoder to detect the sentence and word boundaries. Example 1 shows a
transliteration of a statement:

éJ
K. Qm Ì '@ éK
XYªJË@ úΫ éJ
ƒAJ
‚Ë@ èAJ
mÌ '@ Ðñ® K IJ

k
# hh y th | t q w m | a l hh y a t | a l s y a s y t | aa l a | a l t aa d d
y t | a l hh z b y t *
Example 1. Sample of Arabic statement transliterated into Latin.

The transliteration process transforms the right-to-left statement (Ara-


bic writing direction) into left-to-right (Latin). Buckwalter6 is a powerful
open source tool for Arabic to Latin transliteration. However, we built our
own look-up list for easier editing the character set which is 39 characters.

4.2. Converting the transcription to alias

The next step is to convert the transcription to the corresponding number


mapped in the list as shown in Table 2 and Example 2:

Table 2. Numerical transformation from Latin to numbers.


Transliteration # ga | sh k *
Alias 0 2 39 22 31 20

# hh y th — t q w m — a l hh y a t *
1 14 37 12 39 11 30 36 33 39 7 32 14 37 7 11 20
Example 2. Numerical transformation.

4.3. Speech features extraction

The feature extraction in this study is based on filter bank (FB) instead of
Mel Frequency Cepstral Coefficient (MFCC). The empirical results and pre-
vious work [7] show that FB outperforms the MFCC in speech and speaker
recognition technologies [25]. FB acts as a bandpass filter for the audio
signal in the frequency domain [25]. FB processing consists of projecting
6 http://www.qamus.org-/transliteration.htm
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 241

Arabic Speech Recognition Using Recurrent Neural Networks 241

the features to a higher dimensional space in which classification can be


easier [11].
Our current representation is based on using a Log-Fourier-transform-
based filterbank with 40 coefficients (plus energy) distributed on a mel-
scale, together with their first and second temporal derivatives resulting in
a 123 element feature vector. The features are pre-processed having zero
mean, unit variance and acoustic context information. The context window
is 10 of the frames before and after the current frame (21 frames). Hence
the feature dimensions are 123 × 21.

5. Experiments

In this study, we have two experiments: 1) BDRNNs Parameter estima-


tion that are based on 8 hours Aljazeera corpus, and 2) The training and
decoding process based on 1200 hours of TV Aljazeera corpus.

5.1. The 8-hour experiment

The corpus consists of 8 hours Aljazeera broadcast, which was collected


and transcribed by QCRI7 using advanced transcription system [26]. The
training set is 4270 files including the test set. The feature extraction of
the training and testing data set are prepared into three files. These files
are required for the training and decoding process as follows: 1) Alias#
which is the text transcription into numbers, 2) Key# which is the num-
ber of frames extracted per audio file and 3) Feat# is the audio features
extracted. The experiment proposes two types of language models, Pseudo
Language Model (PLM) and Real Language Model (RLM) [7]. The PLM
is collected from the data set in order to reduce the language model per-
plexity and reach the optimum parameters values (α, β). The estimated
parameters are: α = 5, β = 3.8, and beam length = 150. Afterward, we
replaced the PLM with real language model (RLM) of 980k unique words
and more than 110 million words in the same domain of Aljazeera broad-
cast prepared by QCRI. Furthermore, we have added conversational text
by twitter crawling in the same domain to minimize the Out of Vocabulary
value (OOV < .005). We built 7, 9, 14 and 15 grams orders for both the
PLM and RLM with modified Kneser-Ney smoothing using the KenLM
toolkit.8 The KenLM source code has re-compiled to accept more than 9
7 http://www.qcri.com/
8 https://kheafield.com/code/kenlm/
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 242

242 A. Ahmed et al.

grams (the default value). The 15 grams LM is about 64GB ARPA format
and 32GB binary format. The binary format is faster in decoding process
and less memory consumption compared to ARPA. The 15th gram is the
highest gram we could reach because of memory limitation of the OS. The
experiment setup is 24 processor cores, 144 GB RAM, NVidia Tesla 80K
graphical processing unit (GPU). Tesla k80 is the state of art GPU technol-
ogy, at the time of writing this paper (GPU specification: 24GB memory
cards and 2900 processing cores). The parallel decoding is performed over
the CPUs. The training parameters were adjusted to 50 epochs, 5 hid-
den layers, and each layer size is 1840 hidden units. The step is 1e-5, and
maximum frame length is 6000 frames.

5.2. The 8-hour results


The experiment baseline of HMM/GMM and HMM/GMM/Tandem com-
pared to CTC are shown in Table 3.

Table 3. HMM/GMM/Tandem and CTC results using PLM/RLM.


Model CER- WER- CER- WER-
PLM PLM RLM RLM
HMM/GMM NA 29% NA 40%
HMM/GMM/Tandem NA 10.99 NA 37%
CTC-7 grams 18% 34% 29% 55%
CTC-9 grams 16% 29% 27% 55%
CTC-14 grams 4.3% 12% 24% 47%
CTC-15 grams 1.3% 3.9% 22% 40.8%

The HMM/GMM and Tandem acoustic models are built using the Hid-
den Markov Model Toolkit (HTK)9 version 3.5. This version has improve-
ment for the language model to accept more than 64k words and supports
DNN modeling. The results in Table 3 are based on α = 5, β = 3.8, beam
length 150. We used the HRESULT command of HTK and built two out-
put files [27]. The first file contains the words sequence that is compared
to the test set. The second file is generated as a sequence of labels (char-
acters) and it is compared to the test set prepared in the same manner
(label sequence). The results become steady from epoch 30 because the
small size of training set (50 epochs per week). By increasing the number
of the network hidden layer size 2048 and beam length 150, the results have
improved less than 1%.
9 http://htk.eng.cam.ac.uk/
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 243

Arabic Speech Recognition Using Recurrent Neural Networks 243

The CTC results of CER at 15 grams is very close to the WER of


HMM/GMM baseline (40.8%, 40%, 37%). Although the decoder processes
the utterances with no lexicon, the RNN acoustic and language model
succeeded to distinguish the different characters that have the same pro-
nunciation but different ways of writing. For example, letter H  , which is
equivalent to letter t in English, has different ways of writing in the middle

of word like above and the end of word like è. When it comes at the end
of the word, it can be spoken or silent which has slightly lower probability
than pronounced character [22]. Figure 3 illustrates the DBRNN character
probability s over time t and collapsed output k(s).10 k function ignores the
blank symbols due to spaces or non-defined characters in the character set,
i.e. noise. It ignores the non-blank symbols due to the time shift of char-
acter alignment which produces the same character again. For example, m
nn hathaa may be collapsed to mn hatha.
The decoder shows the ability to transcribe OOV. Table 4 illustrates
phrase from the test set and corresponding transcription output by differ-
ent methods. The word between double brackets AîEAÓQm Ì is an OOV. The
HMM/GMM failed to deal with this word and sought for the most matching
labels which is three instead of one.

Fig. 3. Collapse function output for blank ( ) and non-blank ( ) symbols.

10 The graph is only for illustration and it does not present real data.
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 244

244 A. Ahmed et al.

Table 4. illustrates phrase from the test set and corresponding transcription output by
different methods.
Method Transcription
  áË èXCK à@ úæ JÓA     Y» @ éJ.KAg
Test Set Õ΂‚ .
g ú
Ϋ é<Ë@ éK
@ éJ
K@ QK
B @ éK
PñêÒm.Ì '@ YƒQÓ . áÓ

éK ðñJË@ AJk ñËñJº   Ì   


. JË@ ú¯ Aê®k áÓ ((AîEAÓQm )) ú
×QK ú
æË@ éJ
ËðYË@  ñª’ÊË @YK. @
  áË èXCK à@
úæ JÓA     Y» @ éJ.KAg
GMM/HMM Õ΂‚ .
g ú
Ϋ é<Ë@ éK
@ éJ
K@ QK
B @ éK
PñêÒm.Ì '@ YƒQÓ . áÓ


ÐAJ
Ëð AK@ Èñ® K èPXA¯ ((AîE AÓ Yg úÍ@ )) I.£ ú
æË@  éJ
ËðYË@  ñª ’ÊË @YK. @
   
CTC èXCK. à@ ú
æJÓAg úΫ é<Ë@ éK
@ éJ
K@Q
K B @ éK
PðX QK
YÓ Y» @ éJ.KAg. áÓ

 
éJ
K. ñJm .Ì '@ QK
ñ¢JË@ ú
¯ ‡k áÓ Ì)) ú×QK úæË@
((AîEAÓQm   


éËðYË@ éËðYË@ áÓ @YK. @ ÕÎ
‚‚ áË

5.3. The 1200-hour experiment


The corpus consists of 1200 hour Aljazeera Arabic TV channel, which was
collected and transcribed by QCRI11 using advanced transcription sys-
tem [26]. The duration of an episode is typically 20–50 minutes, which can
be split into three broad categories: conversation (63%), interview (19%)
and report (18%) [28]. The prominent feature of the Aljazeera corpus is
that it includes multiple dialect and overlapping talkers [26] which is chal-
lenging for speech recognition systems [29]. The Modern Standard Arabic
(MSA) in the Aljazeera corpus presents 70%. The rest of the corpus is a
dialect Arabic i.e. Egyptian, Gulf, levantine, and North Africa [28]. As per
keyword tag classification, the corpus contains 12 domain classes namely:
politics, economy, society, culture, media, law, science, religion, education,
sports, medicine, and military. The QCRI speech recognition baseline is
34% for non-overlapped (8.5 hours development set) and 73% for over-
lapped speech (1.5 hours) using GMM, DNN acoustic models and tri-gram
(3-grams) language model [19]. The training and decoding of the lexicon
free system is handled with the same parameters mentioned in previous
section (8-hours) with a change in frame size from 6000 frames to 10000
frames for the new corpus. The training is performed over 1200 hours,
15-grams language model and 20 iterations (epochs). Aljazeera corpus con-
sists of 10 hours development set for acoustic model optimization and 10
hours evaluation set. QCRI provided news and conversation text that can
be used in building language model. The data consists of more than 110
million words from the Aljazeera.net website.
The comparison of the experiment of 1200 hours was managed by QCRI
through the Multi-Genre Broadcast (MGB) Challenge 2016.12 The systems
11 http://www.qcri.com/
12 http://www.mgb-challenge.org/
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 245

Arabic Speech Recognition Using Recurrent Neural Networks 245

contributed in the challenge are: QCRI (1200 hours), LIUM (650 hours),
MIT (1200 hours), NDSC (680 hours) and Sevilla University (this study).
The systems description are available at MGB website. Table 5 summarizes
the results and the technologies that have been used in MGB-2 2016.

Table 5. MGB challenge systems’ results.


Acoustic Model Language System- WER- WER-Non-
Model Affiliation Overlapped Overlapped
TDNN, LSTM, BLSTM Tri-grams- QCRI 17.3 14.7
LMRNN
DNN-TDNN 4-grams LIUM 19.2 16.7
CNN-TDNN-LSTM- 4-grams- MIT 19.9 17.3
GLSTM-HLSTM LMRNN
LSTM, TDNN LMRNN NDSC 23.8 18.2

5.4. The 1200-hour results

Table 6 summarizes the results on the development set since the test set
reference transcription is not available for public domain.

Table 6. MGB challenge results for the CTC-Lexicon free system.


Acoustic Model Language System- WER- WER-Non-
Model Affiliation Overlapped Overlapped
RNN-CTC 15-grams Sevilla 22.89 12.03

The first observation is that the Sevilla CTC lexicon free setup could
lead to WER 12.03% for non-overlapped recordings. The second observa-
tion is that the gap between the overlapped and non-overlapped data set is
around 11% which is quite big difference compared to other systems (3%–
6%). This implies that while the CTC-lexicon free achieves competitive
results for non-overlapped files, it shows poor immunity for cross-talking
speech as well as the noise. Table 7 illustrates a phrase from the test set
and the corresponding transcription output.

6. Conclusion

An end-to-end Arabic speech recognition based on BDRNNs is presented in


this chapter. The developed approach does not depend on Hidden Markov
Models (HMMs). In addition, a character based decoder is used for search
to avoid using a word lexicon. The Connectionist Temporal Classification
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 246

246 A. Ahmed et al.

Table 7. illustrates a phrase from the development set and the corresponding tran-
scription output.
Method Transcription
èYK
Yg. é®Êg
 l× AKQK  ú¯ Ð@QºË@ AJK YëA‚Ó
The Truth €AJË@ ð XA’J¯B@ . . áÓ

 ÕºK. Cë @
¡J
‚® JËAK. ©J
J.Ë@ úΫ HYj AJ J®Êg
 JK
ÐñJ
Ë@ @ YêË  ¨ñ“ñÓ

Lexicon Free

€AJË@ ð XA’J¯B@ l×. AKQK . áÓ   
èYK
Yg. é®Êg ú¯ Ð@QºË@ AJK
YëA‚Ó ÕºK. Cë @

AJ J®Êg
Q
’® JËAK. éJ
J.Ë@ úΫ HYj
 JK
ÐñJ
Ë@ @ YêË  ¨ñ“ñÓ

(CTC) objective function is used to maximize the output character se-


quences given the acoustic features as an input. The recipe was evaluated
using 1200 hours corpus of the Aljazeera multi-Genre broadcast programs.
On the development set, the WER is 12.03% for non-overlapped speech.
The experiment used the CTC source code developed by Stanford Uni-
versity with additional development effort, including around 500 lines of
code for transliteration, feature extraction, twitter crawling, parameters
optimization and results preparation.
There are some research opportunities for future work. The first op-
portunity is developing an Arabic system using EESEN13 project setup.
EESEN is a CTC lexicon/word base approach that uses Weighted Finite
State Transducer (WFST). It is intended to compare the results with the
CTC lexicon free. Furthermore, the Long Short Term Memory (LSTM)
based acoustic models may improve the results.

Acknowledgments

Special thanks for QCRI for providing the Aljazeera corpus and Ziang Xie
from Stanford University.

References

1. A. Ali, Y. Zhang, P. Cardinal, N. Dahak, S. Vogel, and J. Glass. A com-


plete kaldi recipe for building arabic speech recognition systems. In Spoken
Language Technology Workshop., pp. 525–529, IEEE (2014).
2. E. Othman, K. Shaalan, and A. Rafea. Towards resolving ambiguity in un-
derstanding arabic sentence. In International Conference on Arabic Language
Resources and Tools, NEMLAR, pp. 118–122, Citeseer (2004).
3. A. Farghaly and K. Shaalan, Arabic natural language processing: Challenges
13 http://arxiv.org/abs/1507.08240
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 247

Arabic Speech Recognition Using Recurrent Neural Networks 247

and solutions, ACM Transactions on Asian Language Information Processing


(TALIP). 8(4), 14 (2009). ISSN 1530-0226.
4. F. Diehl, M. J. F. Gales, M. Tomalin, and P. C. Woodland, Morphological
decomposition in arabic asr systems, Computer Speech and Language. 26(4),
229–243 (2012). ISSN 0885-2308.
5. K. Shaalan, H. M. Abo Bakr, and I. Ziedan. A hybrid approach for building
arabic diacritizer. In Proceedings of the EACL 2009 workshop on computa-
tional approaches to semitic languages, pp. 27–35, Association for Computa-
tional Linguistics (2009).
6. V. Radha and C. Vimala, A review on speech recognition challenges and
approaches, doaj. org. 2(1), 1–7 (2012).
7. A. Ahmed, Y. Hifny, K. Shaalan, and S. Toral. Lexicon free arabic speech
recognition recipe. In International Conference on Advanced Intelligent Sys-
tems and Informatics, pp. 147–159, Springer (2016).
8. P. Motlicek, D. Imseng, B. Potard, P. N. Garner, and I. Himawan, Exploiting
foreign resources for dnn-based asr, EURASIP Journal on Audio, Speech, and
Music Processing. 2015(1), 1–10 (2015). ISSN 1687-4722.
9. M. Attia, Y. Samih, K. F. Shaalan, and J. van Genabith. The floating arabic
dictionary: An automatic method for updating a lexical database through
the detection and lemmatization of unknown words. In COLING, pp. 83–96
(2012).
10. D. Jurafsky and J. H. Martin, Speech and language processing. Pearson
(2014). ISBN 1292025433.
11. D. Yu and L. Deng, Automatic Speech Recognition. Springer (2012).
12. S. Raschka, Python Machine Learning. Packt Publishing Ltd (2015).
13. H. Hermansky, D. W. Ellis, and S. Sharma. Tandem connectionist feature
extraction for conventional hmm systems. In Acoustics, Speech, and Signal
Processing, vol. 3, pp. 1635–1638, IEEE (2000). ISBN 0780362934.
14. G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior,
V. Vanhoucke, P. Nguyen, and T. N. Sainath, Deep neural networks for
acoustic modeling in speech recognition: The shared views of four research
groups, Signal Processing Magazine, IEEE. 29(6), 82–97 (2012). ISSN 1053-
5888.
15. H. Bourlard and N. Morgan, Hybrid HMM/ANN systems for speech recog-
nition: Overview and new research directions. Springer (1998). ISBN
3540643419.
16. A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep
recurrent neural networks. In Acoustics, Speech and Signal Processing
(ICASSP), pp. 6645–6649, IEEE (2013). ISBN 1520-6149.
17. Y. Hifny, Unified acoustic modeling using deep conditional random fields,
Transactions on Machine Learning and Artificial Intelligence. 3(2), 65
(2015).
18. A. Graves, S. Fernndez, F. Gomez, and J. Schmidhuber. Connectionist tem-
poral classification: labelling unsegmented sequence data with recurrent neu-
ral networks. In Proceedings of the 23rd international conference on Machine
learning, pp. 369–376, ACM (2006). ISBN 1595933832.
August 13, 2018 9:32 ws-rv9x6-9x6 Book Title 10693-11 page 248

248 A. Ahmed et al.

19. A. Ali, P. Bell, J. Glass, Y. Messaoui, H. Mubarak, S. Renals, and Y. Zhang,


The mgb-2 challenge: Arabic multi-dialect broadcast media recognition,
arXiv preprint arXiv:1609.05625 (2016).
20. A. Graves. Supervised sequence labelling. In Supervised Sequence Labelling
with Recurrent Neural Networks, pp. 5–13. Springer (2012).
21. X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks.
In International Conference on Artificial Intelligence and Statistics, pp. 315–
323 (2011).
22. A. L. Maas, Z. Xie, D. Jurafsky, and A. Y. Ng. Lexicon-free conversational
speech recognition with neural networks. In Proceedings of the 2015 Confer-
ence of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies (2015).
23. S. F. Chen and J. Goodman. An empirical study of smoothing techniques for
language modeling. In Proceedings of the 34th annual meeting on Association
for Computational Linguistics, pp. 310–318, Association for Computational
Linguistics (1996).
24. A. Graves and N. Jaitly. Towards end-to-end speech recognition with recur-
rent neural networks. In Proceedings of the 31st International Conference on
Machine Learning, pp. 1764–1772 (2014).
25. A. Mertins, Signal analysis: Wavelets, filter banks, time-frequency transforms
and applications (1999).
26. A. Ali, Y. Zhang, and S. Vogel. Qcri advanced transcription system (qats).
In Spoken Language Technology Workshop (2014).
27. S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore,
J. Odell, D. Ollason, and D. Povey, The htk book (for htk version 3.5),
Cambridge University Engineering Department, UK (2015).
28. H. Mubarak, Data description of the arabic multi-genre-broadcast challenge
(2016).
29. K. Boakye, B. Trueba-Hornero, O. Vinyals, and G. Friedland. Overlapped
speech detection for improved speaker diarization in multiparty meetings. In
2008 IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing, pp. 4353–4356 (2008).
249

Chapter 12

Bio-Inspired Optimization Algorithms for Improving


Artificial Neural Networks: A Case Study on
Handwritten Letter Recognition

Ahmed A. Ewees and Ahmed T. Sahlol

Computer Teacher Preparation Department,


Damietta University, Damietta, Egypt
ewees@du.edu.eg, atsegypt@du.edu.eg

Arabic handwritten letter recognition is a challenging field in pattern


recognition due to the different writing styles, variation in letters, and
poor quality of manuscripts. There has not been such a system with
sufficient accuracy until now. In this chapter, a letter recognition system
based on swarm optimization algorithms with neural networks (NNs)
was presented. The main purpose of the proposed approach is to improve
the classification accuracy of NNs for Arabic handwritten letters
recognition. Evolutionary Strategy (ES), Probability Based Incremental
Learning (PBIL), Particle Swarm Optimization (PSO) and Moth-flame
optimization (MFO) algorithms were applied to train NNs by generating
optimum weights and biases and to be fed into neural networks. The
proposed approach consists of these main phases: pre-processing, feature
extraction and classification by optimized NNs. The proposed approach
shows a good ability to train NNs, as it achieved high accuracy and
f-measure values (not less than 93% for accuracy and 96% for f-measure)
in all experiments. Up to our knowledge, these results are the highest
achieved on a published Arabic letter recognition system yet. In addition,
when comparing the proposed approach to the classic NNs, advantages
were shown in both accuracy and f-measure.

1. Introduction

Recognition of handwritten letters is a challenging issue since there is a


variation in the same letter due to various font styles, size, and noises.
250 A. A. Ewees and A. T. Sahlol

There are several systems introduced to deal with this issue for different
languages such as English, Japanese, and Chinese; on the contrary, fewer
achievements were done in the Arabic language. Arabic alphabets are used
by millions in Arabian countries and also with non-Arabic spoken
languages like Persian, Kurdish, Malay and Urdu. The importance of
Arabic language is that it is the Holy Quran language; which is the Holy
Book of Muslims. The Arabic alphabet has 28 basic letters. Sixteen of the
Arabic letters have one, two or three dots. Number and position of these
dots differentiate between similar letters. Each letter has three or four
shapes depending on its position in the word (beginning, middle, end and
isolated), those shapes can be different totally. One of the reasons that
make Arabic letter recognition systems very complicated is that the basic
letters can be expanded to be tribble or have more different letter shapes.
This produces more than 80 different shapes according to the position of
a letter as well as according to the writing style (like Nasekh, Roqa'a, Farisi
or others). Many Arabic letters contain dots. They can be the only feature
that distinguishes between a letter and others.
The secondaries can be written separate or have a dashed line. They
can be also linked to a letter or drawn as one big shape. Several attempts
have been applied to develop Arabic letter recognition systems. In Refs.
1–3 classical letter recognition systems for Arabic letters were developed.
No feature selection algorithms were used; however, some features were
selected manually to reduce the processing time. Artificial neural networks
(NNs) and support vector machines (SVM) achieved the best classification
accuracy among the other classifiers. Whereas, in Refs. 4–6, classical
feature selection algorithms were performed like Genetic Algorithm (GA),
principal component analysis (PCA) and multi-objective GA,
respectively. The feature selection algorithms aimed at selecting the most
important features as well as achieving acceptable recognition rate. Linear
discriminant analysis (LDA) and Support Vector Machines (SVM) were
used to validate the efficiency of the feature selection algorithm. Also in
Ref. 7, Bat algorithm with Random Forests achieved the highest
recognition accuracy among the other classifiers. It also outperformed the
other feature selection algorithm GA. In general, most of the optimization
algorithms achieved a good performance when they were used as feature
Bio-Inspired Optimization Algorithms for Improving ANNs 251

selector; nevertheless, their ability to find global optimum and the time
consuming still need more attempts to improve them.
Some recent works in Arabic handwritten letters were proposed in Ref.
8, Particle Swarm Optimization (PSO) was used as a feature selection
algorithm to select the most significant features. The selected features
were tested by other classifiers, however, Random Forests (RF) achieved
the best accuracy. PSO showed advantages among other swarm
optimization algorithms when compared to other published works. More
details about the principles and applications of swarm in handwritten
Arabic letter recognition can be found in Ref. 9.
Due to the importance of handwritten Arabic letter recognition and the
influence of the classification accuracy of it, various classification
algorithms were applied. One of the most used is NNs; however, it has
some drawbacks in generating its weight and bias. So, there are several
bio-inspired optimization algorithms applied on training NNs, such as
PSO10, GA11, Probability Based Incremental Learning (PBIL)12,
Evolutionary Strategy (ES)13, Differential Evolution (DE)14, and Gray
Wolf Optimization (GWO)15. All these algorithms were adopted mainly
for enhancing the classification efficiency for NNs. That motivated us
to apply a new bio-inspired optimization algorithm; Moth-Flame
Optimization (MFO) on training NNs. MFO has a good ability to avoid
the local optima and has a speed convergence compared to other
optimization techniques.
In this chapter a handwritten letter recognition approach based on
swarms intelligence is proposed. ES, PBIL, PSO and MFO are used for
improving the NNs’ working mechanism by updating its weights and
biases. That has to find the optimum values for the weights and the biases
which accordingly can improve the classification accuracy. This chapter
is organized as follows: Section 2 provides more details about NNs and
the used bio-inspired optimization algorithms. Section 3 explains the
working mechanism of the swarms. Section 4 presents the proposed
approach. Experimental results with discussions are described in Section
5. Finally, conclusions and future work are provided in Section 6.
252 A. A. Ewees and A. T. Sahlol

2. Neural Networks and Bio-inspired Optimization Algorithms

2.1. Neural Networks (NNs)

Neural networks (NNs) are inspired by biological nervous systems. They


can be trained to perform a specific action by adjusting the values of the
connections (weights) between each element. So, they are trained, since, a
specific input leads to a specific target output.
In order to improve a neural network’s performance, the weights and
biases have to be reset to new values. Also, the number of hidden neurons
might need to be increased. The following subsections provide more
details about the used bio-inspired optimization algorithms, however we
focus on MFO as it is the most recent.

2.2. Particle Swarm Optimization (PSO)

PSO was introduced by Kennedy and Eberhart16. It mimics the knowledge


evolvement of a social behavior17 and simulates the group communication
behavior to move private knowledge when flocking, hunting, or migrating.
So, each member adjusts its place according to its experience and
neighbors. The swarm beginning with random members, each one of
which has a place (xi) and velocity (vi) in jth space. Each member is
examined by using the fitness function, then its performance is compared
with local and global best values.

2.3. Evolutionary Strategy (ES)

ES is based on the parents and offsprings iterations and their evolution along
generations. This algorithm proposes usual relations from the real world
between parents and their offsprings. The offsprings inherit the parents’
features. But in some cases, the offsprings are mutated; so, their features are
randomly changed. After each generation, the populations of parents and the
individuals in the offsprings are chosen to be ordered in the parents’
population for the next generation. These sequences are repeated until some
predefined number of fitness functions evaluations13,18.
Bio-Inspired Optimization Algorithms for Improving ANNs 253

2.4. Probability Based Incremental Learning (PBIL)

PBIL is a method of combining genetic algorithms and competitive


learning for function optimization19. It is an extension to the Equilibrium
Genetic Algorithm (EGA) and was achieved through the reexamination of
the performance of the EGA in terms of competitive learning.
The PBIL uses a probability vector to describe the population of a
genetic algorithm. In a binary encoded solutions string, the probability
vector specifies the probability of each bit position containing a “1”. The
probability of the bit position containing a “0” is obtained by subtracting
the probability specified in the vector from 1.0. The PBIL attempts to
create a probability vector which can be considered a prototype for high
evaluation vectors for the function space being explored.

2.5. Moth-Flame Optimization (MFO)

MFO is a recent optimization algorithm proposed by Mirjalili20. MFO acts


as real moths in nature. A real moth flies by estimating a specific angle to
the moon then converges to the light, that drives the moth to fly in a direct
path. So, the solutions in the MFO algorithm are called moths and the
problem parameters are traded as moths’ positions in the space. The
elitism position (best position) that is selected by moths is kept and called
flames.
The following matrix shows a set of arranged moths:
𝑚 , ⋯ 𝑚,
𝑀 ⋮ ⋮ ⋮ (1)
𝑚 , ⋯ 𝑚 ,
where n is the moths’ number and d is the variables’ number.
Also, the corresponding fitness values for all moths are arranged in a
matrix as follows:
𝑂M
𝑂𝑀 ⋮ (2)
𝑂𝑀
where n is the moths’ number.
254 A. A. Ewees and A. T. Sahlol

In addition, there is a second matrix for flames, it is represented as


follows:
𝐹, ⋯ 𝐹,
𝐹 ⋮ ⋮ ⋮ (3)
𝐹, ⋯ 𝐹,

where n is the flames’ number and d is the variables’ number.


As in moths there is a matrix for the corresponding fitness values for
all flames, that is sorted as follows:
𝑂𝐹
𝑂𝐹 ⋮ (4)
𝑂𝐹
where n is the flames’ number.
The main MFO algorithm is defined as below:
𝑀𝐹𝑂 𝑆 𝑃, 𝐹, 𝑇 (5)
where P is used to initialize a random population of moths and their fitness
values; F is the main function that makes the moths move around search
space, T is a termination flag.
In the main function (F), the flames are applied to update the position
of moths by the following equation:
𝑀 𝑃𝐹 𝑀 , 𝐹 𝐷 ∙ 𝑒 ∙ 𝑐𝑜𝑠 2𝜋𝑡 𝐹 (6)
where PF is applied to generate a random population and corresponding
fitness values; F is the main function, it makes the moths fly nearby search
space, T is a stop parameter.
In the F function, the flames update the moths’ position by the below
equation:
𝐷 𝐹 𝑀 (7)
where Fj determines the jth flame, and Mi determines the ith moth.
In addition, the best solutions exploitation may have degraded, since
the location updating of moths with regard to n different positions in the
search domain. To repair this issue, an adaptive technique is performed for
the flames number, as explained in the below equation:
𝑓𝑙𝑎𝑚𝑒 𝑛𝑜 𝑟𝑜𝑢𝑛𝑑 𝑁 𝑐 ∗ (8)
Bio-Inspired Optimization Algorithms for Improving ANNs 255

where c is the number of current iteration, N is the flames’ maximum


number, and T is the iterations’ maximum number.
Algorithm (1) shows the structure of MFO.

Algorithm 1 Pseudocode of MFO algorithm


1 Update flame no using Eq. (8)
2 OM = FitnessFunction(M);
3 if iteration == 1 then
4 F = sort(M);
5 OF = sort(OM);
6 else
7 F = sort(Mt-1, Mt);
8 OF = sort(M t-1, Mt);
9 end
10 for i = 1: n do
11 for j = 1: d do
12 Update r and t
13 Using Eq. (7) to calculate D with respect to the corresponding moth
14 Using Eq. (6) to update M(i,j) with respect to the corresponding moth
15 end for
16 end for

3. Swarms Working Mechanism

The selection of the best neural network parameters (in this chapter:
weight and bias) is generated randomly by the swarms (ES, PBIL, PSO
and MFO). Each starts by generating a random population of candidate
solutions for a given optimization problem. Through iterations, swarm
parameters working on approximating the probable position of their prey.
For each candidate solution, the distance from the prey is updated and
finally, by the satisfaction of an end criterion, the algorithm is terminated.
In order to design the optimization updating strategy, the classification
accuracy of a classifier (NNs in this work) is used to evaluate all possible
solutions (select the best weight and bias value) to achieve fitness function
using the classification accuracy.
256 A. A. Ewees and A. T. Sahlol

After several hundred of trials, the NNs with 5 neurons in the hidden
layers, maximum epochs are 1000, and training function is Scaled
conjugate gradient, as it achieved the best fitness value.
The Mean Square Error (MSE) was selected to validate each swarm
iteration, it can be calculated by the following equation:
𝑀𝑆𝐸 ∑ 𝑦 𝑥 (9)

where yi is the ith predicted values, xi is the actual values, and n is the
number of samples.
The proposed algorithm is trained to find the best weight and bias with
the minimum value of MSE based on the training set. Swarms’ parameters
which used for the optimizer’s searching mechanism can be seen in
Table 1.

Table 1. The experiments’ parameters.

Classifiers Parameters

NN Hidden layers: 5
Error performance: MSE

PSO w = 0.3
C1 = 1
C2 = 1

ES 𝜆: 2
σ: 1

PBIL Learning rate: 0.05


Good population member: 1
Bad population member 0
Elitism parameter: 1
Mutational probability: 0.1

MFO t = [-1, 1]
b=1

All experiments adopt the following parameters’ values: dimension


equals = 3758, population size = 6, maximum number of generations = 80,
upper bounds = 10, and lower bounds = -10.
Bio-Inspired Optimization Algorithms for Improving ANNs 257

4. The Proposed Approach

The proposed approach begins by preprocessing letter images by


performing normalization, dilation, morphological operations, and median
filtering. After this phase, the following features were extracted from each
letter image; right and left diagonal projection, gradient features, number
of holes, vertical and horizontal projection, number and position of
secondaries, and height to width ratio. The last phase is classification
which applies swarm optimization algorithms on NNs to find the optimum
values of the weight and bias of it. The best vector of weight and bias is
then used to start the testing phase. In the testing phase, the classification
accuracy of the proposed approach was measured. The results of the
optimized NNs (by PSO, PBIL, MFO and ES) was compared to the classic
NNs. Table 2 and Figure 1 illustrate the entire phases of the proposed
approach.

Table 2. The phases of the proposed approach.


No. Main phase Sub-phases Details
Phase 1 Preprocessing Binarization Otsu algorithm23
Noise removal Median filtering24: To remove random noise
like that result in scanning or digitization.
Dilation25: To recover the distorted data areas.
Morphological noise removal (based on
neighborhood pixels).
Phase 2 Feature Gradient features 400 features
extraction
Right and left diagonal 64 features
projections of letter
parts
Vertical and horizontal 248 features
projections
Special features Number of holes (1 feature) - Height to width
ratio (1 feature) - Number of secondaries (1
feature) - Position of secondaries (1 feature)
Phase 3 NNs Training Apply the swarm (ES- PSO- PBIL- MFO) to
find the optimum weight and bias.
Testing Use the optimum weight and bias to improve
the neural classification performance
258 A. A. Ewees and A. T. Sahlol

Fig. 1. The proposed approach.

5. Experiments and Results

5.1. Dataset description

This chapter uses a dataset of Alphabetic Arabic handwritten letters,


CENPARMI21. This dataset has been created at Concordia University in
Canada where 328 people were participated in writing each letter. Because
of the style of Arabic letters, they can appear in multi-forms depending on
their position such as (initial, medial, and final); so, each letter is
represented in different forms. This chapter used only the basic 28 Arabic
alphabet letters. Figure 2 represents some of them.
Bio-Inspired Optimization Algorithms for Improving ANNs 259

Fig. 2. Variations in letters from CENPARMI dataset.

5.2. Evaluation criteria

The results of the classifiers are evaluated using the accuracy and
F-measure as shown in the following equations:

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 (10)

where TP is the true positive samples, TN is the true negative samples, FP


is the false positive samples, and FN is the false negative samples.

𝐹 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 2 ∙ (11)

where precision and recall can be calculated as following:


𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 (12)

𝑟𝑒𝑐𝑎𝑙𝑙 (13)

To evaluate the swarm performance, MSE was used. It can be


calculated as in Equation (9).

5.3. Results and discussions

The proposed handwritten letter recognition approach was developed after


several runs by the swarms. However, the best fitness of only ten runs was
260 A. A. Ewees and A. T. Sahlol

chosen. The training and the testing data were chosen randomly; 70% of
the data were assigned to the training task, while the rest (30%) for the
testing. The proposed approach was implemented by MATLAB (2014b).
The results of this chapter are divided into the following subsections.

5.3.1. Optimized neural networks performance

To validate the performance of the optimized NNs, MSE (in the testing
phase) was used to measure the distance between the predicted and the
output samples. Figure 3 represents the performance of the enhanced NNs
(trained by ES - PSO - PBIL - MFO optimization algorithms) compared
to the classic NNs.

Fig. 3. The performance of NNs trained by ES, PSO, PBIL, MFO and the classic NNs.

The results in Figure 3 proved that the NNs trained by MFO achieved
the lowest error; 0.030 whereas the error of the NNs trained by PSO, PBIL,
and ES were 0.0313, 0.033, and 0.0345, respectively. This shows how
powerful is MFO in optimizing NNs.
Figure 4 illustrates the convergence curves of the fitness values of the
objective function according to the best three runs of each optimizer.
Figure 4 shows that the performance of the optimization algorithms
improved along with iterations. Although, for ES and PSO, it still steady
for many iterations, but for PBIL and MFO, the performance went better
with more iterations.
Bio-Inspired Optimization Algorithms for Improving ANNs 261

Fig. 4. The convergence curves of the optimization algorithms.

5.3.2. Performance of the proposed approach

To validate the performance of the proposed approach, two well-known


measurements were used: accuracy and f-measure, Figures 5 and 6 show
their results.

Fig. 5. The accuracy of the proposed approach.


262 A. A. Ewees and A. T. Sahlol

Fig. 6. The F-measure of the proposed optimization approach.

As seen in Figure 5, NNs trained by MFO algorithm achieved the best


classification accuracy compared to those trained by PSO, PBIL, and ES.
They achieved 0.93%, 0.94%, and 0.93% of classification accuracy,
respectively compared to 0.94\% for the MFO. Also, in Figure 6, NNs
trained by MFO outperformed all other optimization algorithms in
F-measure test; it achieved 0.97% for NNs trained by MFO, 0.96%,
0.96%, and 0.96% for NNs trained by PBIL, PSO and ES, respectively.
This means the MFO algorithm has more ability to find the best weights
and bias of NNs rather than the other optimization algorithms due to
avoiding to got stuck in local optima.

5.3.3. Comparison with the classic neural networks

In order to validate the performance of the proposed optimization


algorithms, the same measurements were test the NNs with no-
optimization. Table 3 represents the accuracy and f-measure of classic
NNs and those optimized NNs.
From Table 3, it is obvious that all NNs that trained by optimization
algorithms achieved high results in both accuracy and f-measure. Of
course, the gab is not high, but the NNs proved to achieve better
performance even with no optimization.
Bio-Inspired Optimization Algorithms for Improving ANNs 263

Table 3. The accuracy and f-measure results of the proposed optimized NNs
against classic NNs.

Classifier Accuracy F-measure


Classic-NNs 93.08% 0.9641
ES-NNs 93.21% 0.9648
PSO-NNs 94.04% 0.9691
PBIL-NNs 93.38% 0.9657
MFO-NNs 94.13% 0.9696

5.3.4. Comparison with relevant works

Table 4 shows the most recent published works in Arabic handwritten


letters with the same dataset. As seen, they are sorted according to
classification accuracy.

Table 4. Comparisons between this work and the performance of previous works on the
same dataset.

Previous work Approach Classification


accuracy

Ref. 2 Feed forward neural network 88%

Ref. 3 SVM with RBF kernel 89.20%

Ref. 7 Random Forest with Bat algorithm 91.59%

Ref. 8 Random Forest with PSO algorithm 91.66%

The proposed approach NNs with PBIL 93.21%

NNs with ES 93.38%

NNs with PSO 94.04%

NNs with MFO 94.13%

It is obvious that the proposed approach outperformed all the other


handwritten Arabic recognition systems worked on the CENPARMI
dataset. Up to our knowledge, there is no Arabic handwritten letter
264 A. A. Ewees and A. T. Sahlol

recognition system that has achieved such a high recognition accuracy as


the one presented in this chapter. We think that this is due to two main
factors:

(1) Choosing the most efficient optimization algorithms like MFO, ES,
PBIL and PSO which enhance significantly the neural network’s
performance. This leads to better classification results, however, the
enhanced NNs are complex on a time-consuming but this is the cost
of improving the recognition accuracy. A successful optimization
mechanism should find the optimum solution (the best weight and
bias) of a problem which was accomplished in this chapter.
(2) NNs proved to achieve the highest performance among other
classifiers in our previous works Refs. 2 and 27. NNs can be trained
to solve any nonlinear problem by tuning the connections (weights)
values between elements.

6. Conclusion and Future Work

In this chapter, an Arabic letter recognition approach based on enhanced


NNs was proposed. NNs were trained by Evolutionary Strategy (ES),
Probability Based Incremental Learning (PBIL), Particle Swarm
Optimization (PSO) and Moth-Flame optimization (MFO) which aimed to
find the optimum value of neural’s weights and biases. The main three
phases of the proposed approach were followed; preprocessing, feature
extraction then finally, classification by optimized NNs. Several runs by
each optimization algorithms were done to select the best weights and
biases values that can improve the classification accuracy. The neural
network optimized by the MFO algorithm achieved the highest accuracy
in all experiments when compared to the other optimizers. A classification
accuracy of 94% and f-measure value of 96% were achieved. These results
are the highest results achieved on a published Arabic letter recognition
system yet. Also, when comparing the proposed optimization algorithm to
the classic NNs, it showed advantages in both accuracy and f-measure.
Our future work might focus on deploying swarm algorithms to enhance
other powerful classifiers like Support Vector Machines (SVM).
Bio-Inspired Optimization Algorithms for Improving ANNs 265

References

1. M. Z. Khedher, G. A. Abandah and A. M. Al-Khawaldeh, Optimizing Feature


Selection for Recognizing Handwritten Arabic Characters, In The 2nd World
Enformatika Conference, 2005 (WEC’05), Vol. 4(2), pp. 81-84, 2005.
2. A. T. Sahlol, C. Y. Suen, M. R. Elbasyoni, and A. A. Sallam, A proposed OCR
Algorithm for cursive Handwritten Arabic Character Recognition, Journal of Pattern
Recognition and Intelligent Systems (PRIS), pp. 90-104 (2014).
3. A. T. Sahlol, C. Y. Suen, M. R. Elbasyoni, and A. A. Sallam, Investigating of
Preprocessing Techniques and Novel Features in Recognition of Handwritten Arabic
Characters, In Artificial Neural Networks in Pattern Recognition, Springer
International Publishing, pp. 264-276 (Oct, 2014).
4. G. Abandah and T. Malas, Feature Selection for Recognizing Handwritten Arabic
Letters, Dirasat: Engineering Sciences, Vol. 37(2) (2011).
5. G. A. Abandah, Kh. S. Younis and M. Z. Khedher, Handwritten Arabic character
recognition using multiple classifiers based on letter form, In Proceedings of the 5th
IASTED International Conference on Signal Processing, Pattern Recognition, and
Applications (SPPRA'08), pp. 128-133 (Feb, 2008).
6. G. Abandah and N. Anssari, Novel moment features extraction for recognizing
handwritten Arabic letters, Journal of Computer Science, Vol. 5(3), pp. 226-232
(2009).
7. A. T. Sahlol, C. Y. Suen, H. M. Zawbaa, A. Hassanien, and M. Abd Elfattah, Bio-
inspired BAT optimization algorithm for handwritten Arabic characters recognition,
In Evolutionary Computation (CEC), 2016 IEEE Congress on, pp. 1749-1756 (Jul,
2016).
8. A. T. Sahlol, M. Abd Elfattah, C. Y. Suen, and A. Hassanien, Particle Swarm
Optimization with Random Forests for Handwritten Arabic Recognition System, In
International Conference on Advanced Intelligent Systems and Informatics, pp. 437-
446 (Oct, 2016).
9. A. T. Sahlol and Aboul Ella Hassanein, Bio-inspired optimization algorithms for
Arabic handwritten characters, Handbook of Research on Machine Learning
Innovations and Trends, IGI Global (2017).
10. R. Mendes, P. Cortez, M. Rocha and J. Neves, Particle swarms for feedforward neural
network training, In Neural Networks, 2002. IJCNN'02. Proceedings of the 2002
International Joint Conference, IEEE, vol. 2, pp. 1895-1899 (2002).
11. Z.-G. Che, T.-A. Chiang, and Z.-H. Che, Feed-forward neural networks training: a
comparison between genetic algorithm and back-propagation learning algorithm,
International Journal of Innovative Computing, Information and Control 7, no. 10, pp.
5839-5850 (2011).
12. E. Galic, M. Höhfeld, Improving the generalization performance of multi-layer-
perceptrons with population-based incremental learning, In International Conference
on Parallel Problem Solving from Nature, Springer, pp. 740-750 (1996).
266 A. A. Ewees and A. T. Sahlol

13. S. Gielen and B. Kappen, Minimizing the system error in feedforward neural networks
with evolution strategy, In ICANN'93, Springer, pp. 490-493 (1993).
14. J. Ilonen, J.-K. Kamarainen and J. Lampinen, Differential evolution training algorithm
for feed-forward neural networks, Neural Processing Letters 17, no. 1, pp. 93-105
(2003).
15. S. Mirjalili, How effective is the Grey Wolf optimizer in training multi-layer
perceptrons, Applied Intelligence 43, no. 1, pp. 150-161 (2015).
16. R. C. Eberhart and J. Kennedy, A new optimizer using particle swarm theory,
Proceedings of the sixth international symposium on micro machine and human
science. Vol. 1 (1995).
17. N. Sultan, S. M. Shamsuddin, and A. Hassanien, Hybrid learning enhancement of
RBF network with particle swarm optimization, Foundations of Computational,
Intelligence Volume 1. Springer, pp. 381-397 (2009).
18. T. P. Runarsson and X. Yao, Search biases in constrained evolutionary optimization,
IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and
Reviews) 35, no. 2, pp. 233-243 (2005).
19. S. Baluja, Population-based incremental learning. a method for integrating genetic
search based function optimization and competitive learning, No. CMU-CS-94-163.
Carnegie-Mellon Univ Pittsburgh Pa Dept of Computer Science (1994).
20. S. Mirjalili, Moth-flame optimization algorithm: A novel nature-inspired heuristic
paradigm, Knowledge-Based Systems 89, pp. 228-249 (2015).
21. H. Alamri, J. Sadri, C.Y. Suen, N. Nobile, A novel comprehensive database for Arabic
off-line handwriting recognition, In Proceedings of 11th International Conference on
Frontiers in Handwriting Recognition, ICFHR, vol. 8, pp. 664-669 (2008).
22. A. Hassanien and E. Emary, Swarm intelligence: principles, advances, and
applications, CRC Press (2016).
23. N. Otsu, A threshold selection method from gray-level histograms, IEEE Trans. Syst.
Man Cybern., vol. 9, pp. 62-66 (Jan, 1979).
24. J. S. Lim, Two-dimensional signal and image processing, Englewood Cliffs, NJ,
Prentice Hall, (1990).
25. A. Rosenfeld and A. C. Kak, Digital picture processing, Academic press (1976).
26. M. Sami, N. EI-Bendary, T-h. Kim, and A. E. Hassanien, Using particle swarm
optimization for image regions annotation, Lecture Notes in Computer Science, vol.
7709, pp. 241-250 (2012).
27. A. T. Sahlol, A. A. Ewees, A. M. Hemdan and A. E. Hassanien, Training feedforward
neural networks using Sine-Cosine algorithm to improve the prediction of liver
enzymes on fish farmed on nano-selenite, In Computer Engineering Conference
(ICENCO), 2016 12th International, pp. 35-40 (2016).
267

Index

accusative case, 132, 138 Buckwalter & Parkinson’s


active voice, 132 Frequency Dictionary, 157
Alkhalil, 141 building sentiment lexicons, 171
ambiguity, 129, 139
Arabic acoustic modelling, 1, 4, 10, challenges in NLP, 59, 60, 67, 74,
11, 14 75
Arabic fonts, 204–207, 209, 221, Classical Arabic, 128, 157
226 colloquial Arabic speech
Arabic grammar, 127, 130, 131, recognition, 18
149, 153 Connectionist Temporal
Arabic handwritten documents, Classification, 231–233, 236, 246
111, 116, 120, 122, 123 context-free grammar, 130
Arabic handwritten letter context-sensitive disambiguation,
recognition, 249, 251, 264 155, 160
Arabic information retrieval, 29
Arabic language modelling, 1, 4, declension, 70
10, 11, 14 diacritical markings, 128, 131, 138,
Arabic machine translation, 29, 30, 142, 152
38, 55 diacritical marks, 128
Arabic morphological analyzer, digital fonts, 206
156 dimensionality reduction, 193
Arabic NLP, 29, 53, 55, 59, 60, 67, disambiguation, 155
68, 71, 76, 78 Dynamic Bayesian Networks, 85,
Arabic OCR, 48, 56 100
Arabic Opinion Mining, 169
Arabic script, 85, 86, 107 e‘raab, 129
Arabic speech recognition, 1, 2, 11, Evolutionary Strategy, 249, 251,
14 252, 264
Arabic text diacritization, 15, 17 Extended Backus-Naur Form, 135
Arabic text mining, 29, 30 eXtended Revised AraMorph
Arabic Wikipedia, 164 (XRAM), 155
Arabic WordNet, 165
AraMorph, 155 fatwa classification, 187
268 Index

feature extraction, 85, 86, 89, 107, natural language technologies, 29,
111–113, 115, 120, 121 34
filtering, 155 neural networks, 249, 250, 252,
font legibility, 204, 205, 227 255, 262, 264
free order language, 128 nominal sentence, 131, 135
nominative case, 131, 138
generative and discriminative
models, 85 orthography, 62, 63, 65, 66
genre selection, 159
Particle Swarm Optimization, 249,
handwriting recognition, 113 251, 252, 264
hidden Markov models, 85, 94, 95 passive voice, 132
hierarchical classification, 187, 191 Piece of Arabic Word, 117
POS tagging, 155
Informal Colloquial Arabic, 157 Probability Based Incremental
intransitive, 132 Learning, 249, 251, 253, 264
intricate, 61, 70, 72 pro-drop property, 128

k-means clustering, 189 Qur’an, 127

language model, 157, 232, 234, ranking, 155


237, 238, 241–244 recurrent neural network, 231–233,
learning algorithms, 112, 113 235
lemmatization, 165
lexical analyzer, 133 Salmoné’s Arabic-English
linguistic analysis, 157 dictionary, 164
segmentation free, 115, 118–120
machine learning techniques, 85, semi-automatic annotation, 162
86, 92, 107 sentences analysis, 129, 130
Modern Standard Arabic, 128, 157 sentiment analysis, 169–172, 174,
morphological analysis, 161 178, 179, 181, 182
morphological analyzer, 133 sentiment lexicons, 169–171, 178,
morphology, 60, 61, 62, 69, 70, 72, 179, 184
75 speech recognition, 231–233, 237,
moth-flame optimization, 249, 251, 240, 244
253, 264 syntactical analysis, 129, 132, 140,
multi-label classification, 187, 189, 142, 143, 145, 147, 148, 152, 153
194 syntactical analyzers, 133
multi-label learning, 188, 197
Index 269

text categorization, 187, 188, 190


tokenization, 160
transitive, 132
Transparent Neural Networks, 85,
103–105, 108

usage markers, 157

verbal sentence, 131, 135

word spotting, 111–119, 121, 122


word-segmentation, 160

XML tagging, 157

You might also like