Arabic Speech Recognition Challenges and State of The Art

1
Chapter 1
Computational Linguistics, Speech and Image Processing for Arabic Language Downloaded from www.worldscientific.com
by 185.250.46.143 on 11/23/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
Arabic Speech Recognition:

Challenges and State of the Art
Sherif Mahdy Abdou1 and Abdullah M. Moussa2

1
Faculty of Computers and Information,
Cairo University, Giza 12613, Egypt
s.abdou@fci-cu.edu.eg
2
Faculty of Engineering,
Cairo University, Giza 12613, Egypt
a.m.moussa@ieee.org
The Arabic language has many features such as the phonology and the
syntax that make it an easy language for developing automatic speech
recognition systems. Many standard techniques for acoustic and
language modeling such as context dependent acoustic models and n-
gram language models can be easily applied to Arabic. Some aspects of
the Arabic language such as the nearly one-to-one letter-to-phone
correspondence make the construction of the pronunciation lexicon even
easier than in other languages. The most difficult challenges in
developing speech recognition systems for Arabic are the dominance of
non-diacritized text material, the several dialects, and the morphological
complexity. In this chapter, we review the efforts that have been done to
handle the challenges of the Arabic language for developing automatic
speech recognition systems. This includes methods for automatic
generation for the diacritics of the Arabic text and word pronunciation
disambiguation. We also review the used approaches for handling the
limited speech and text resources of the different Arabic dialects. Finally,
we review the approaches used to deal with the high degree of affixation,
derivation that contributes to the explosion of different word forms in
Arabic.
2 S. M. Abdou and A. M. Moussa
1. Introduction
Speech recognition is the ability of a machine or program to identify

words and phrases in spoken language and convert them to a machine-

readable format. The last decade has witnessed substantial advances in
speech recognition technology, which when combined with the increase

in computational power and storage capacity, has resulted in a variety of
commercial products already on the market.
Arabic language is the largest still living Semitic language in terms of
the number of speakers. Around 300 million people use Arabic as their
first native language, and it is the fourth most widely used language
based on the number of first language speakers.
Many serious efforts have been done to develop Arabic speech
recognition systems.1,2,3 Many aspects of Arabic, such as the phonology
and the syntax, do not present problems for Automatic Speech
Recognition (ASR). Standard, language-independent techniques for
acoustic and pronunciation modeling, such as context-dependent phones,
can easily be applied to model the acoustic-phonetic properties of
Arabic. Some aspects of recognizer training are even easier than in other
languages, in particular the task of constructing a pronunciation lexicon
since there is a nearly one-to-one letter-to-phone correspondence. The
most difficult problems in developing high-accuracy speech recognition
systems for Arabic are the predominance of non-diacritized text material,
the enormous dialectal variety, and the morphological complexity.
In the following sections of this chapter we start by describing the
main components of ASR systems and major approaches that have been
introduced to develop each of them. Then, we review the previous efforts
for developing Arabic ASR systems. Finally, we discuss the major
challenges of Arabic ASR and the proposed solutions to overcome them
with a summary of state of art systems performance.
2. The Automatic Speech Recognition System Components
The goal of the ASR system is to find the most probable sequence of
words 𝑤 = (𝑤 , 𝑤 ,…) belonging to a fixed vocabulary given some set of
acoustic observations 𝑋 = (𝑥 , 𝑥 , … , 𝑥 ). Following the Bayesian
Arabic Speech Recognition: Challenges and State of the Art 3
approach applied to ASR as shown in Ref. 4, the best estimation for the
word sequence can be given by:
/
𝑤 arg 𝑚𝑎𝑥 𝑃 𝑊/𝑂 arg 𝑚𝑎𝑥 (1)
To generate an output, the speech recognizer has basically to perform

the following operations as shown in Fig. 1:
 Extract acoustic observations (features) out of the spoken utterance.

 Estimate 𝑃 𝑊 — the probability of individual word sequence to
happen, regardless of the acoustic observations. This is named the
language model.
 Estimate 𝑃 𝑋/𝑊 — the likelihood that the particular set of features
originates from a certain sequence of words. This includes both the
acoustic model and the pronunciation lexicon. The latter is perhaps
the only language-dependent component of an ASR system.
 Find word sequence that delivers the maximum of (1). This is referred
to as the search or decoding.
Speech frame
Input Speech
Front-End
Feature Extraction
Features Vector: 𝑋
Acoustic M odel
P(X/W )
Language M odel
Search
P(W )
Recognized Text
Fig. 1. The ASR system main architecture.

The two terms 𝑃 𝑊 and 𝑃 𝑋/𝑊 and the maximization operation

constitute the basic ingredients of a speech recognition system. The goal
is to determine the best word sequence given a speech input 𝑋. Actually,
𝑋 is not the speech input but a set of features derived from the speech.
The Mel Frequency Cepstrum Coefficients (MFCC) and Perceptual
Linear Prediction (PLP) are the most widely used. The acoustic and
language models and the search operation are discussed below.
2.1. Pronunciation lexicon
The pronunciation lexicon is basically a list where each word in the

vocabulary is mapped into a sequence (or multiple sequences) of
phonemes. This allows modeling a large number of words using a fixed
number of phonemes. Sometimes whole word models are used. In this
case the pronunciation lexicon will be a trivial one. The pronunciation
lexicon is language-dependent and for a large vocabulary (several
thousand words) might require a large effort. We will discuss this for
Arabic in the next sections.
2.2. Acoustic model
The most popular acoustic models are the so called Hidden Markov
Models (HMM). Each phoneme (unit in general) is modeled using an
HMM. An HMM4 consists of a set of states, transitions, and output
distributions as shown in Fig. 2.
0.5 0.7 0.9

0.7 0.3 0.2 0.1
S0 S1 S2 S3 S4
0.3 0.2 0.1
Fig. 2. HMM Phone Model.

The HMM states are associated with emission probability density

functions. These densities are usually given by a mixture of diagonal
covariance Gaussians as expressed in equation (2):
𝑏 𝑥 ∑ 𝑤 𝑁 𝑥, 𝜇 , 𝜎 (2)
where 𝑗 ranges over the number of Gaussian densities in the mixture of

state 𝑆 . The expression 𝑁 : is the value of the chosen component
Gaussian density function for feature vector x. The parameters of the
model (state transition probabilities and output distribution parameters
e.g. means and variances of a Gaussian) are automatically estimated from
training data. Usually, using only one model per phone is not accurate
enough and usually several models are trained for each phone depending
on its context. For example, tri-phone uses a separate model depending
on the immediate left and right contexts of a phone. For example, tri-
phone A with left context b and right context n (referred to as /b-A-n/)
has a different model than tri-phone A with left context t and right
context m (referred to as /t-A-m/). For a total number of phones P, there
will be P3 tri-phones, and for N states/model, there will be N P3 states in
total. The idea can be generalized to larger context e.g. quinphones. This
typically leads to a large number of parameters. In practice, context-
dependent phones are clustered to reduce the number of parameters.
Perhaps the most important aspect in designing a speech recognition
system is finding the right number of states for the given amount of
training data. Extensive research has been done to address this point.
Methods vary from very simple phonetic rules to data driven clustering.
Perhaps the most popular technique used is the decision tree clustering.5
In this method, both context questions and a likelihood metric are used to
cluster the data for each phonetic state as shown in Fig. 3. The depth of
the tree can be used to tradeoff accuracy versus robustness.
Once the context-dependent states are clustered, it remains to assign a
probability distribution to each clustered state. Gaussian mixtures are the
most popular choice in modern speech recognition systems. The
parameters of the Gaussians are estimated to maximize the likelihood of
the training data (the so-called maximum likelihood (ML) estimation).
Is left phone a sonorant or nasal ?

yes
Is right phone a back-R ? Is left phone /s, z, sh, zh/ ?
no
Is right phone voiced ?

senone 1 yes
senone 5 senone 6
Is left phone a back-L or is left
phone neither a nasal nor a Y-glide
and right phone a LAX-vowel ? senone 4
yes
senone 2 senone 3
Fig. 3. Decision tree for classifying the second state of K-triphone HMM.
For HMMs ML, estimation is achieved by the so-called forward

backward or Baum-Welch algorithm.
Although ML remained as the preferred training method for a long
time. Recently, discriminative training techniques took over. It was
demonstrated that they can lead to superior performance. However, this
comes at the expense of a more complex training procedure.6 There are
several discriminative training criteria such as Maximum Mutual
Information (MMI), Minimum Classification Error (MCE), Minimum -
Phone Error (MPE) and most recently Maximum Margin methods. All
these different techniques share the idea of using the correct transcription
and a set of competing hypotheses. They estimate the model parameters
to “discriminate” the correct versus competing hypotheses. The
competing hypotheses are usually obtained from a lattice which in turn
requires the decoding of the training data. Model estimation is most
widely done using the so-called extended Baum-Welch estimation
(EBW).7
Recently, a better acoustic model was introduced that is a hybrid of

HMM and Deep Neural Networks (DNN). The Gaussian Mixtures
Models (GMM) are replaced with neural networks with deep number of
hidden layers as shown in Fig. 4.

Fig. 4. HMM-DNN Model.
The DNNs have a higher modeling capacity per parameter than

GMMs and they also have a fairly efficient training procedure that
combines unsupervised generative learning for feature discovery with a
subsequent stage of supervised learning that fine tunes the features to
optimize discrimination. The Context-Dependent (CD)-DNN-HMM
hybrid model as shown in Ref. 8 has been successfully applied to large
vocabulary speech recognition tasks and can cut word error rate by up to
one third on the challenging conversational speech transcription tasks
compared to the discriminatively trained conventional CD-GMM-HMM
systems.
While the above summarizes how to train models, it remains to
discuss the training data. Of course, using more data allows using larger
and hence more accurate models leading to better performance.
However, data collection and transcription is a tedious and costly
process. For this reason, a technique called unsupervised or better lightly
supervised training is becoming very popular. First, several hundred
hours of speech are used to train a model. The model together with an
appropriate confidence measure can then be used to automatically
transcribe thousands of hours of data. The new data can then be used to
train a larger model. All the above techniques (and more) are
implemented in the so-called Hidden Markov Model Toolkit (HTK)
developed at Cambridge University.9
2.3. Language model
A language model (LM) is required in large vocabulary speech

recognition for disambiguating between the large set of alternative and
confusable words that might be hypothesized during the search. The LM
defines the priori probability of a sequence of words. When language
restrictions are well known and all the possible combinations between
words can be defined, probabilities can be precisely calculated and
included in finite state automata that rule the combination of words in a
sentence. Unfortunately, this scheme only applies to restricted
application domains with small vocabularies. For large vocabularies and
more complex configurations of sentences, a simple, but effective, way
to represent a sequence of n words is to consider it as an n-th order
Markov chain. The LM probability of a sentence (i.e., a sequence of
words 𝑤 , 𝑤 , … , 𝑤 ) is given by:
𝑃 𝑤 𝑃 𝑤 /𝑤 𝑃 𝑤 /𝑤 , 𝑤 𝑃 𝑤 /𝑤 , 𝑤 , 𝑤 . . . 𝑃 𝑤 /
𝑤 ,…,𝑤 ∏ 𝑃 𝑤 /𝑤 , … , 𝑤 (3)
where 𝑤 , … , 𝑤 in expressions such as 𝑃 𝑤 /𝑤 , … , 𝑤 , is the

word history for word 𝑤 . In practice, one cannot obtain reliable
probability estimates given arbitrarily long histories since that would
require enormous amounts of training data. Instead, one usually
approximates them in the following way:
𝑃 𝑤 |𝑤 , 𝑤 , … , 𝑤 𝑃 𝑤 |𝑤 ,…,𝑤 (4)
This is the definition of “N-grams”. On several recognition

approaches, the number of predecessors considered tend to be reduced
resulting in “bigrams” (for N = 2) and “trigrams” (for N = 3). An
important feature of N-grams is that their probabilities can be directly

estimated from text examples and, therefore do not need explicit
linguistic rules like grammar inference systems do. Estimation of N-
grams has to be carefully treated as for a vocabulary of size V there is as

many as (V)N probabilities to be estimated in the N-gram model. Usually
many word histories don’t occur with enough counts to have reliable
estimate for their probabilities. Many approximation techniques were
proposed to approximate these probabilities.10 For example, in the case
of bigram grammar, it typically lists only the most frequently occurring
bigrams, and uses a backoff mechanism to fall back on unigram
probability when the desired bigram is not found. In other words, if
P(wj|wi) is sought and is not found, one falls back on P(wj). But a backoff
weight is applied to account for the fact that wj is known not to be one of
the bigram successors of wi. Other higher-order backoff N-gram
grammars can be defined similarly. Ideally, a good LM would ease the
retrieval of the word sequence present in the speech signal by better
focusing the decoding procedure, which represents another relevant step
of the search procedure. One of effective tools for training language
models is the SRILM toolkit that includes most of state of art
alternatives.11
2.4. Decoding
Finding the best word (or generally unit) sequence given the speech input
is referred to as the decoding or search problem. Formally, the problem is
reduced to finding the best state sequence in a large state space that
consists of composing the pronunciation lexicon, the acoustic model and
the language model. The solution can be found using the well-known
Viterbi algorithm. Viterbi search is essentially a dynamic programming
algorithm, consisting of traversing a network of HMM states and
maintaining the best possible path score at each state in each frame. It is
a time synchronous search algorithm in that it processes all states
completely at time t before moving on to time t + 1. The abstract
algorithm can be understood with the help of Fig. 5. One dimension
represents the states in the network, and the other dimension represents
the time axis.
States
Final state
Start state
Time
Fig. 5. Viterbi search as dynamic programming.
Even for a moderate vocabulary, full search is prohibitive. The

Viterbi beam search is a very popular and simple way to speed-up the
search.12 Using a beam is not always sufficient and there are two very
popular approaches to the search problem:
 Use relatively simple acoustic and language models to generate an N-

best list or a lattice. Use more detailed acoustic and/or language
models to rescore the reduced search space to find the best word
sequence. This is called the multi-pass approach.
 Compose the full search space and use determinization and
minimization algorithms to optimize the search space. Use a Viterbi
beam search on the optimized search space to find the best word
sequence. We refer to this as the single pass approach.
A less popular approach is referred to as stack decoding that avoids

visiting the whole search space.13 In addition to optimizing the search
space, calculating the Gaussian probabilities is usually time consuming
especially for large vocabulary speech recognition. Techniques to
accelerate the Gaussian computations are also widely used. These
techniques mainly rely on using Gaussian clustering, quantization and
caching.14
3. Literature Review for Arabic ASR
The early efforts to develop Arabic ASR systems started with simple
tasks such as digits recognition and small vocabulary of isolated words.
Imai et al. in Ref. 15 presented a rule-based speaker-dependent that use

speaker-dependent phonological rules to model pronunciation variability
in the speakers with objective to decrease their recognition errors. Bahi
and Sellami in Ref. 16 presented a system that combines the vector

quantization technique and HMMs to recognize isolated Arabic words.
Nofal et al. in Ref. 17 demonstrated an Arabic command and control

speech recognition system. Elmisery et al. in Ref. 18 implemented a
pattern matching algorithm based on HMM using Field Programmable
Gate Array (FPGA) to recognize isolated Arabic words. Khasawneh
et al. in Ref. 19 applied polynomial classifier to isolated-word speaker-
independent Arabic speech and showed that it provides better recognition
performance and much faster response when compared with Dynamic
Time Warping (DTW) recognizer. Bourouba et al. in Ref. 20 presented
a hybrid approach of HMM/Support Vector Machine (SVM) for
recognition of isolated spoken Arabic words.
The beginning of 2000’s witnessed major advancement in the state of
art of Arabic ASR systems. This was mainly due to the availability of
larger ASR resources with the support provided from some DARPA
projects such as the EARS and its successor the Gales projects that
targeted the development of effective, affordable and reusable Arabic
speech recognition. One of the earliest efforts to develop large
vocabulary Arabic ASR was by BBN team for the application of
information extraction from broadcast news (the BBN Tides-OnTap
system in Ref. 21). The BBN system for Arabic ASR that was submitted
for the Ears and Gale projects evaluations included two stages. The first
stage is Speaker Independent (SI) and the second stage is Speaker
Adapted (SA) using a reference from the first stage. Each one of these
stages included three decoding passes. The first decoding pass is a
forward pass that uses simple acoustic models, Phonetically Tied
Mixture (PTM) models, and a bigram language model. The second pass
is a backward pass that uses the output of the forward pass to guide a
Viterbi beam search with more complex acoustic and language models.
A state clustered (using decision trees) within-word quinphone acoustic
model (SCTM-NX), and an approximate trigram language model are
used in this step. During the backward pass an N-best list is generated.
This list is rescored using the SCTMNX model1, and a 3-gram language
model. This system was trained using 100hrs of broadcast news

recordings and 300 million words from newspapers and web sites. The
system vocabulary was around 60k. Two types of acoustic models,
grapheme based and phoneme based were developed and their evaluation
results show that the phonetic system gives about 13% reduction in WER
compared to the grapheme system.22

One of research groups that have contributed to the advances in
Arabic ASR systems is the Spoken Language Processing Group (TLP) at
LIMSI CNRS. Their recognizer makes use of continuous density tied-
state left-to-right CD-HMM with Gaussian mixture observation densities.
Word recognition is performed in multiple passes. The first pass (less
than 1xRT) is cross-word trigram decoding with gender-specific sets of
position-dependent triphones (around 5k tied states) and a trigram
language model. The trigram lattices are expanded with a 4-gram
language model. Then the posterior probabilities of the lattice edges are
estimated using the forward-backward algorithm and the 4-gram lattice is
converted to a confusion network with posterior probabilities by
iteratively merging lattice vertices and splitting lattice edges until a linear
graph is obtained. These hypotheses are used to carry out unsupervised
acoustic model adaptation for each segment cluster using the MLLR
technique with one regression class. Then a second lattice is generated
for each segment using a bigram LM and position dependent tri-phones
with 11500 tied states (32 Gaussians per state). The word graph
generated in this second decoding pass is rescored after carrying out
unsupervised MLLR acoustic model adaptation using a variable number
of regression classes. This system was trained using 1200 hours of
Arabic broadcast data and 1.1 billion text words, distributed by LDC for
the Gale project23 and used a vocabulary with size of 200k words
with average 8.6 different pronunciations for each word. The key
contributions of that system is the automatic building of a very large
vocalized vocabulary, using a language model that includes vocalized
components and using morphological decomposition to address the
challenges of dealing with the huge lexical variety.
Another prominent large vocabulary Arabic ASR system is the one
developed by the Speech Vision and Robotics Group at Cambridge
University. This system included vocabulary with size up to 350k words
and was trained using 1024 hrs of speech data that consisted of 764 hrs
of supervised data and 260 hrs of lightly unsupervised. The gain from the
unsupervised data part has shown to be marginal and may even result in
performance degradation. That system used a state-clustered triphone

models with approximately 7k distinct states and an average of 36
Gaussian components per state and used n-gram language model trained
from 1 billion words of text data. It used three decoding stages. The first
stage is a fast decoding run with Gender Independent (GI) models. The
second stage uses Gender Dependent (GD) models adapted using LSLR.
It also uses variance scaling using the first stage supervision. The second
stage generates trigram lattices which are expanded using a 4-gram
language model and then rescored in the third stage using GD models
adapted using lattice-MLLR as discussed in Ref. 21. In that system they
have shown that graphemic models perform at least as well as phonetic
models for conversational data and have very minor degradation for the
news data.
The IBM ViaVoice was one of the first commercial Arabic large
vocabulary systems that was developed for dictation applications.24 A
more advanced system was developed by the speech recognition research
group at IBM for Arabic broadcast transcription system fielded for the
GALE project. Key advances include improved discriminative training,
the use of subspace Gaussian mixture models (SGMM) as shown in Ref.
25, neural network acoustic features as shown in Ref. 26, variable frame
rate decoding as shown in Ref. 27, training data partitioning experiments,
class-based exponential LM model and NNLMs with Syntactic
features.28 This system was trained on 1800hrs of transcribed Arabic
broadcasts and text data of size 1.6 billion words provided by the
Linguistic Data Consortium (LDC).29 A pruned language model of size 7
million n-grams using Entropy pruning as shown in Ref. 30 is used for
the construction of static, finite-state decoding graphs. Another unpruned
version of the LM that contains 883 million n-grams, is used for lattice
rescoring. This system used a vocabulary of 795K words with more than
2 million pronunciations. This system used 6 decoding passes. The first
pass used a speaker independent grapheme based acoustic model. The
following 5 passes used speaker adapted phoneme based models. All
models have penta-phone cross-word acoustic context. Another 3
rescoring passes using the different LMs produced different decoding

hypotheses that were optimized in a combination pass.
Recently the Multi Genre Broadcast (MGB) competition as shown in
Ref. 31 has activated the research and development of Arabic speech

recognition for the domain of broadcast programs recognition. MGB is a
controlled evaluation using 1,200 hours audio with lightly supervised

transcription. The system of the Qatar Computing Research Institute
(QCRI) speech transcription system for the 2016 dialectal Arabic Multi-
Genre Broadcast (MGB-2) challenge which was a combination of three
purely sequence trained recognition systems, achieved the lowest WER
of 14.2% among the nine participating teams.2 Key features of this
system are: purely sequence trained acoustic models using the recently
introduced Lattice free Maximum Mutual Information (L-MMI)
modeling framework as shown in Ref. 31; Language model rescoring
using a four-gram and Recurrent Neural Network with Max-Entropy
connections (RNNME) language models as shown in Ref. 32; and
system combination using Minimum Bayes Risk (MBR) decoding
criterion for three acoustic models trained using Time Delay Neural
Network (TDNN) as shown in Ref. 31, Long-Short Term Memory
(LSTM) Recurrent Neural Network (RNN) as shown in Ref. 33 and Bi-
directional LSTM. These results match the state of art performance for
the English ASR system for similar domain data which puts the Arabic
language in same stage as the tier one languages.
4. Challenges for Arabic ASR Systems
The Arabic language has three major challenges for developing ASR
systems. The first one is the constraint of having to use mostly non-
diacritized texts as recognizer training material which causes problems
for both acoustic and language modeling. Training accurate acoustic
models for the Arabic vowels without knowing their location in the
signal is difficult. Also, a non-diacritized Arabic word can have several
senses with the intended word sense to be derived from the word context.
Language models trained on this non-diacritized material may therefore
be less predictive than those trained on diacritized texts.
The second challenge for Arabic is the existence of many different

Arabic dialects (Egyptian, Levantine, Iraqi, Gulf, etc.) that are only
spoken and not formally written. Dialectal variety is a problem primarily
because of the current lack of training data for conversational Arabic.

Whereas Modern Standard Arabic (MSA) data can readily be acquired
from various media sources, there are only very few speech corpora of
dialectal Arabic available.
The third challenge of Arabic is its morphological complexity which
is known to present serious problems for speech recognition, in particular
for language modeling. A high degree of affixation, derivation, etc.,
contributes to the explosion of different word forms, making it difficult if
not impossible to robustly estimate language model probabilities. Rich
morphology also leads to high out-of-vocabulary rates and larger search
spaces during decoding, thus slowing down the recognition process. In
the following sections, we review most of the proposed approaches to
overcome these challenges.
4.1. Using non-diacritized Arabic data
Several approaches were proposed to overcome the lack of diacritized

text. One of the simple approaches to deal with this challenge is to build
the acoustic models based on grapheme units instead of the phoneme
which are the natural units of speech. The term grapheme refers to the
smallest meaningful contrastive unit in a writing system. In the grapheme
acoustic model, each non-diacritized grapheme is considered an acoustic
unit which is equivalent to a compound consonant-vowel phonemes
pair.34 To compensate for the wide variance of these compound units in
the acoustic space, a larger number of mixtures are used. Although this
type of model eliminates the requirement for restoring the Arabic text
diacritics, the use of compound acoustic units can result in reducing the
accuracy of Arabic ASR systems compared with using the phoneme
based models.
The other alternative approach for dealing with non-diacritized text is
the restoration of the missing diacritics. For this task, an automatic
Arabic text diacritizer can be used.35 The state of art performance for
such tools is 4% word error rate for the word internal diacritization
marks and 10% word error rate for case ending marks. This means more
than 10% of the data will be restored with wrong diacritics which would
reduce the efficiency of the trained acoustic models. To reduce the
number of errors for restored diacritics, it was proposed to use the audio
recordings of the text data to help in selecting the correct words diacritics
besides the linguistic information.11 In that approach, a forced alignment

is performed between the audio signal and the reference text using a
pronunciation dictionary that includes all the possible diacritization
forms for each word. A morphology analyzer is used to generate these
diacritization forms.36 For the words that the analyzer fails to find a
possible diacritization form, which usually happens for name entities, a
series of expert rules are used to derive their pronunciations.12 Finally,
for the remaining words, that all the approaches fail to derive any
diacritization forms for them, it is possible to backoff to the graphemic
pronunciation for them and builds a combined system.14
Although the vowelized based acoustic models provide better
accuracy, in some cases such as dialectal Arabic ASR, the grapheme
based models would be a more effective approach since the restoration of
diacritics for this type of data would require some non-existing resources
such as morphological analyzer or expert diacritization rules. Also, with
large amount of training data, the performance of grapheme based and
phoneme based systems becomes very close.37
4.2. Speech recognition for Arabic dialects
Whereas MSA data can readily be acquired from various media sources,
there is only very limited speech corpus of dialectal Arabic available.
The construction of such type of corpus is even more challenging than
the MSA one. Initially the manual annotation has no standard reference,
the same word can be transcribed with several ways such as
“‫ ﺑﺸﻜﺮﻙ‬،‫ﺑﺎﺷﻜﺮﻙ‬،‫”ﺑﺄﺷﻜﺮﻙ‬. Some transcription guidelines for Egyptian and
Levantine Dialectal Arabic were proposed to reduce such differences.38
The diacritization for dialectal Arabic is more challenging than MSA
since it would require a dialectal Arabic morphological analyzer to
generate the different diacritization forms. Using context based
diacritization would also require a robust language model for dialectal
Arabic which is not currently available. Also, the dialectal Arabic

diacritization using automatic alignment against the audio signal is also
harder due to the larger set of vowels.
To tackle the problem of data sparsity, a cross-lingual approach was

proposed to pool MSA and dialectal speech data to jointly train the
acoustic model.29 Acoustic differences between MSA and Arabic dialects

are smaller than the differences at the language level, and since only a
small amount of acoustic data is currently available for Arabic dialects,
acoustic models might benefit from a larger amount of similar data
that provides more training instances of context-dependent phones.
Moreover, the difference between dialectal and MSA speech is not
necessarily clear-cut; it is a continuum, with speakers varying between
the two ends of the continuum depending on the situational context.
Cross-dialectal data sharing may be helpful in modeling this type of
mixed speech. This approach is similar to sharing acoustic training data
across different languages to build a speech recognition system for a
target under-resourced language using several source languages with
sufficient acoustic data.45 This approach resulted in around 3% relative
reduction in WER for training Egyptian dialectal models by adding MSA
data.39
In another approach for cross-lingual training, it was proposed the
modification of the optimality criterion for training Gaussian Mixture
Model (GMM) to benefit from the similarity between phonemes in MSA
and dialectal speech which showed improvements in phone classification
tasks.24 Also, model adaptation techniques like MLLR and Maximum A-
Posteriori (MAP) were proposed to adapt existing phonemic MSA
acoustic models with a small amount of dialectal ECA speech data which
resulted in about 12% relative reduction in WER.42 Acoustic model
adaptation can perform better than data pooling when dialectal speech
data are very limited compared to existing MSA data, and adaptation
may avoid dialectal acoustic features masking by large MSA data as in
the data pooling approach.
The large overlap between the phonetic units of most of the Arabic
dialectal and MSA allowed the benefit of the large resources of MSA to
help in training the acoustic models. The challenge is harder for language
modeling. The large differences between local Arabic dialects and MSA
on the morphological, syntactic, and lexical levels make them behave

like different languages. However due to the scarcity of dialect-specific
linguistic resources, some techniques were proposed to make use of
MSA data to improve language modeling of dialectal Arabic. An

approach explored mixing Egyptian language model with MSA model.43
Although the combined model provided slight reduction in the perplexity

of some hold-out data, there was no visible effect on word error rate.
In another technique, it was proposed to combine models using
constrained interpolation, whose purpose is to limit the degree by which
MSA model can affect the parameters of the Egyptian model, but did not
also yield any improvement. To overcome the genre difference of the
colloquial Arabic corpus and the MSA corpus, which is mainly newswire
data, it was proposed to select for model training those sentences in the
MSA corpus that are closer in style to conversational speech. This
approach did not provide positive effect. An analysis for the results of
these experiments showed that by adding 300 million words of MSA
data to the Egyptian Call-Home colloquial data increases the percentage
of trigrams in the Egyptian test set that are also found in the language
model from 24.5% to 25%. Performing a similar experiment in English
by adding 227 million words of the of North American Business (NAB)
text to the Call_Home American English data increased the seen trigrams
of the test set in the training data from 34.5% to 72%.44
In other approach rather than simply adding selected text data from
MSA, it was proposed to apply linguistic transformations on the MSA
before using it in training language modeling for dialectal Arabic.
Several data transformations were proposed such as morphological
simplification (stemming), lexical transductions, and syntactic trans-
formations. This technique managed to reduce the test perplexity by
factor up to 82% but still did not manage to outperform the model built
using only the dialectal data for speech recognition results.43
All of these efforts raise the conclusion that using MSA data does not
help improve language modeling for colloquial Arabic and the best
effective approach is to train the colloquial Arabic language model from
colloquial data. Fortunately, the recent surge of social networks have
provided rich sources for collecting such type of Arabic data with large
sizes but this data needs extensive efforts of cleaning and normalization.
4.3. Inflection effect and the large vocabulary
To deal with the morphological complexity of the Arabic language in

developing Arabic ASR systems, several approaches were proposed. An

effective approach is to build the ASR system using morphologically
decomposed words. The Arabic word can be decomposed to its main

morphological components, the prefix, the suffix and the stem as shown
in Fig. 6. Using this decomposition approach the vocabulary size can be
reduced with great factor. As we see in Fig. 6, for a dataset of size 120k
the number of Arabic full form words is 14k while the number of stem
units is only 6k, which is comparable with the number of stems for
English data of same size.
Fig. 6. Left: An example of Arabic word factorization. Right: Vocabulary growth for the
Arabic language.
The main draw back for that approach is the short durations of the
affixation units, which can be only two phones long, that make them
highly susceptible for insertions errors. To avoid these effects, some
enhancements for the approach was proposed such as keeping the most
frequent words in full form without decomposition and the second
enhancement was not to decompose the prefix “Al” for words starting
with a solar consonant since due to assimilation with the following
consonant, deletion of the prefix was one of the most frequent errors.
This enhanced morphologically based LM provided some reduction in
WER compared with word based LM.45 Rather than using linguistic
knowledge to derive the morphological decomposition, an unsupervised
technique based on the Minimum Description Length principle (MDL)
was also proposed to provide better coverage for the Out-Of-Vocabulary
(OOV) words.46
Another type of models is the Factored Language Models (FLM) in

which words are viewed as vectors of K factors, so that 𝜔 𝑓 :
Factors represent morphological, syntactic, or semantic word information
and can be e.g. stems, POS tags, etc. in addition to the words themselves.
Probabilistic LMs are then constructed over (sub) sets of factors. Using a
trigram approximation, this can be expressed as:

: : : ∏ : : :
𝑝 𝑓 ,𝑓 ,…,𝑓 𝑝 𝑓 |𝑓 ,𝑓 (5)
Each word is dependent not only on a single stream of temporally
preceding word variables, but also on additional parallel streams of
features. Such a representation can be used to back off to factors when
the word n-gram has not been observed in the training data, thus
improving probability estimates. For instance, a word trigram may not
have any counts in the training set, but its corresponding factor
combinations (e.g. stems and other morphological tags) may have been
observed since they also occur in other words. This is achieved via a new
generalized parallel backoff technique. During standard backoff, the
most distant conditioning variable (in this case wt-2) is dropped first,
followed by the second most distant variable etc., until the unigram is
reached. This can be visualized as a backoff path in Fig. 7(a). If
additional conditioning variables are used which do not form a temporal
sequence, it is not immediately obvious in which order they should be
dropped. In this case, several backoff paths are possible, which can be
summarized in a backoff graph in Fig. 7(b). Paths in this graph can be
chosen in advance based on linguistic knowledge, or at run-time based
on statistical criteria such as counts in the training set.
(a) (b)
Fig. 7. Standard backoff path for a 4-gram language model over words (left) and backoff
graph for 4-gram over factors (right).
FLMs have been implemented as an add-on to the widely-used

SRILM toolkit. Further details can be found in Ref. 43. One difficulty in
training FLMs is the choice of the best combination of design choices, in
particular the conditioning factors, backoff path(s) and smoothing

options. Since the space of different combinations is too large to be
searched exhaustively, some search algorithms such Genetic Algorithms

(GAs) were proposed to optimize the choice of conditioning factors.47
Another effective approach to deal with the large vocabulary of the
Arabic language is the compilation of the whole search space in a finite
state network that is optimized to the most compact size. The huge size
of the search networks for Large Vocabulary Automatic Speech
Recognition (LVASR) systems make it impractical or even impossible to
expand the whole search network prior to decoding due to memory
limitations. The other alternative approach was to expand the search
network on the fly during the decoding process. But with the increase of
the vocabulary size in conjunction with the usage of complex Knowledge
Sources (KS) such as context dependent tri-phone models and cross word
models the dynamic expansion of the search network becomes very slow
and turns to be an impractical approach. With the efforts of a research
team at AT&T as shown in Refs. 48, 49, they managed to compile the
search network of LVASR systems in a compact size that can fit with
memory limitations and also provide a fast decoding approach. That
approach relied on eliminating the redundancy in the search network that
results from the approximations used in the integrated networks such as
the state tying of the acoustic model units and the back-off techniques in
the used language model. Let’s consider a practical example of a 64k
word trigram, among the 4 billion of possible word bigrams, only 5 to 15
million will be included in the model and, for each of these “seen” word-
pair histories, the average number of trigrams will be comprised between
2 and 5. Such a LM would have about 5 to 15 million of states and 15 to
90 million of arcs, requiring between 100 and 600 MB of storage. This
means a reduction by seven orders of magnitude with respect to a plain
64k trigram. Concerning cross word tri-phones, the number of distinct
generalized models is typically one order of magnitude smaller than the
full inventory of position-dependent contexts. Using finite state based
models, some Arabic ASR systems managed to use a vocabulary size
larger than one million words with processing time close to real time
performance as shown in Refs. 28, 50 but these was with the price of
large model sizes of several Giga bytes.
5. State of the Art Arabic ASR Performance
What is the current state of the art in speech recognition? This is a

complex question, because a system’s accuracy depends on the
conditions under which it is evaluated. Under sufficiently narrow
conditions almost any system can attain human-like accuracy, but it’s
much harder to achieve good accuracy under general conditions. The
conditions of evaluation, and hence the accuracy of any system, can vary
along the following dimensions:
 Vocabulary size and confusability: As a general rule, it is easy to

discriminate among a small set of words, but error rates naturally
increase as the vocabulary size grows.
 Speaker dependence vs. independence: By definition, a speaker
dependent system is intended for use by a single speaker, but a
speaker independent system is intended for use by any speaker.
Speaker independence is difficult to achieve because a system's
parameters become tuned to the speaker(s) that it was trained on, and
these parameters tend to be highly speaker-specific.
 Task and language constraints: Even with a fixed vocabulary,
performance will vary with the nature of constraints on the word
sequences that are allowed during recognition. Constraints are often
represented by a grammar, which ideally filters out unreasonable
sentences so that the speech recognizer evaluates only plausible
sentences.
 Read vs. spontaneous speech: Systems can be evaluated on speech
that is either read from prepared scripts, or speech that is uttered
spontaneously. Spontaneous speech is vastly more difficult, because it
tends to be peppered with disfluencies like “uh” and “um”, false
starts, incomplete sentences, stuttering, coughing, and laughter; and
moreover, the vocabulary is essentially unlimited, so the system must
be able to deal intelligently with unknown words.
 Adverse conditions: A system’s performance can also be degraded by

a range of adverse conditions.
In order to evaluate and compare different systems under well-defined

conditions, a number of standardized databases have been created with
particular characteristics. Such evaluations were mostly based on the

measurement of word (and sentence) error rate as the performance figure
of merit of the recognition system. Furthermore, these evaluations
weights were conducted systematically over carefully designed tasks
with progressive degrees of difficulty, ranging from the recognition of
continuous speech spoken with stylized grammatical structure to
transcriptions of live (off-the-air) news broadcast and conversational
speech.
There were several attempts to perform dialect speech recognition for
Egyptian, Levantine and Iraqi but the error rate is relatively high. On the
other hand MSA has sufficient resources and accordingly reasonable
performance. The table below shows the performance of different
systems for broadcast news transcription in the Gale project and some
dialectal tasks.
Table 1. State of the art performance for Arabic ASR systems.
Vocabulary Size Acoustic

Genre Models LM size WER
size training
135hrs, 1000hrs
MSA unvowelized 589K 56M 4-gram 17.0%
(unsupervised)
135hrs, 1000hrs
MSA vowelized 589K 56M 4-gram 16.9%
(unsupervised)
Vowelized +
135hr, 1000hr
MSA pronunciation 589K 56M 4-gram 14.0%
(unsupervised)
probabilities
MSA+
unvowelized 900K 1200 hrs NN 4-gram 14.2%
Dailectal
Iraqi unvowelized 90K 200 hrs 2M 3-gram 36.0%
Levantine unvowelized 64k 100 hrs 15M 3-gram 39%
Egyptian vowelized 50k 20 hrs 150k bigram 56.1%
Table 1 shows roughly state of the art performance for different

speech recognition tasks for Arabic. The performance is closely related
to the existing resources. We can see for MSA Arabic that the available
resources of vowelized training hours and Giga words of LM training

text are close to other Latin languages. So, the state of art performance
for MSA which is around 15% WER is very comparable with the 10%
WER achieved for the similar task of Broadcast News ASR for English.
But we should keep in consideration that the complexity of the Arabic
MSA ASR is much higher with vocabulary size of 560k words compared
with the 210k words of the English vocabulary for the Broadcast News
ASR. The performance of dialectal Arabic, as shown in the Iraqi,
Egyptian and Levantine and conversational ASR is comparable with the
equivalent conversational English ASR with average WER in the range
30%–40%. But we should keep in consideration that the dialectal Arabic
is much challenging when compared with conversational English. The
LM training data is very limited, and many required (Natural Language
Processing) NLP tools such as morph analyzer, diacritizer and text
normalizers need to be developed.
6. Conclusions
In this chapter we reviewed the main building components of ASR

systems and how it can be developed for the Arabic Language. Also we
reviewed the major challenges for developing Arabic ASR systems
which are the dominance of non-diacritized text material, the several
dialects, and the morphological complexity. The main efforts and
proposed approaches for handling the challenges of Arabic ASR systems
were introduced. Finally we introduced the state of art performance for
Arabic ASR systems which show competing performance even when
compared with the more advanced English ASR systems.
References
[1] J. Billa, et al. Audio indexing of broadcast news. Proc. of the International
Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. I-3-I-5
(2002).
[2] S. Khurana and A.Ali, QCRI advanced transcription system (QATS) for the Arabic
multi-dialect broadcast media recognition: MGB-2 challenge, IEEE Spoken
Language Technology Workshop, (SLT), pp. 292–298 (2016).
[3] V. Peddinti, D. Povey, and S. Khudanpur, A time delay neural network architecture
for efficient modeling of long temporal contexts, Proc. of the Interspeech Conf., pp.
3214–3218 (2015).
[4] L. Rabiner, A tutorial on hidden Markov models and selected applications in

speech recognition, Proc. IEEE 77(2), pp. 257–286 (1989).
[5] T. Shinozaki, HMM state clustering based on efficient cross-validation, Proc. Int.
Conf. Acoustics Speech and Signal Processing, ICASSP, pp. 1157–1160 (2006).
[6] P. M. Baggenstoss, A modified Baum-Welch algorithm for hidden Markov models
with multiple observation spaces, IEEE Transactions on Speech and Audio
Processing, 9(4), pp. 411–416 (2001).
[7] M. Afify, Extended baum-welch reestimation of Gaussian mixture models based on
reverse Jensen inequality, Proc. of the 9th European Conference on Speech
Communication and Technology, Interspeech, pp. 1113–1116 (2005).
[8] Y. A. Alotaibi, M. Alghamdi, F. Alotaiby, Speech Recognition System of Arabic
Digits based on A Telephony Arabic Corpus, Proc. of the International Conference
on Image and Signal Processing, ICISP, pp 245–248 (2010).
[9] M. Alghamdi, Y. O. El Hadj and M. Alkanhal, A Manual System to Segment and
Transcribe Arabic Speech, Proc. of the International Conference on Signal
Processing and Communications, ICSPC, pp. 233–236 (2007).
[10] Y. A. Alotaibi, Comparative Study of ANN and HMM to Arabic Digits
Recognition Systems, Journal of King Abdulaziz University, JKAU, 19(1), pp. 43–
60 (2008).
[11] J. Ma, S. Matsoukas, O. Kimball and R. Schwartz, Unsupervised training on large
amount of broadcast news data, Proc. of the International Conference on Acoustics,
Speech and Signal Processing, ICASSP, pp. 1056–1059 (2006).
[12] A. Messaoudi, J.-L. Gauvain and L. Lamel, Arabic transcription using a one
million word vocalized vocabulary, Proc. of the International Conference on
Acoustics, Speech and Signal Processing, ICASSP, pp. I-1093–I-1096 (2006).
[13] M. Gales, et al. Progress in the CU-HTK broadcast news transcription system,
IEEE Transactions Speech and Audio Processing, 14(5), pp. 1513–1525 (2006).
[14] H. Soltau, G. Saon, B. Kingsbury, H-K. Kuo, L. Mangu, D. Povey and G. Zweig.
The IBM 2006 GALE Arabic ASR system, Proc. of the International Conference
on Acoustics, Speech and Signal Processing, ICASSP, pp. IV-349–IV-352 (2007).
[15] T. Imai, A. Ando and E. Miyasaka, A new method for automatic generation of
speaker-dependent phonological rules, International Conference of Acoustics,
Speech and Signal Processing, ICASSP, vol. 1, pp. 864–867 (1995).
[16] H. Bahi and M. Sellami, A hybrid approach for Arabic speech recognition.
ACS/IEEE international conference on computer systems and applications, pp. 14–
18 (2003).
[17] M. Nofal, E. Abdel Reheem et al., The development of acoustic models for
command and control Arabic speech recognition system. Proc. of International
Conference on Electrical, Electronic and Computer engineering, ICEEC’04, pp.
1023–1026 (2004).
[18] F. A. Elmisery, A. H. Khalil, A. H. et al. (2003). A FPGA-based HMM for a

discrete Arabic speech recognition system, Proc. of the 15th International
Conference on Microelectronics, ICM, pp. 205–209.
[19] M. Khasawneh, K. Assaleh, W. Sweidan and M. Haddad, The application of
polynomial discriminant function classifiers to isolated Arabic speech recognition,

Proc. of IEEE International Joint Conference on Neural Networks, vol. 4, pp.
3077–3081 (2004).
[20] H. Bourouba, R. Djemili, M. Bedda and C. Snani, New Hybrid System (Supervised
Classifier/HMM) for Isolated Arabic Speech Recognition, Proc. of the
International Conference on Information & Communication Technologies, pp.
1264–1269 (2006).
[21] J. Billa, et al., Arabic speech and text in Tides Ontap, Proc. of the International
Conference on Human Language Technology Research. HLT, pp 1024–1029.
[22] Afify, M., Nguyen, L., Xiang, B., Abdou, S. and Makhoul, J. (2005). Recent
progress in Arabic broadcast news transcription at BBN, Proc. of the Interspeech
Conf., pp.1637–1640 (2002).
[23] http://projects.ldc.upenn.edu/gale/index.html, page referenced at April 2017.
[24] https://www-01.ibm.com/software/pervasive/viavoice.html
[25] D. Povey, et al., Subspace Gaussian mixture models for speech recognition, Proc.
of the International Conference on Acoustics, Speech and Signal Processing,
ICASSP, pp. 4330–4333 (2010).
[26] H. Hermansky, D. P. W. Ellis and S. Sharma, Tandem connectionist feature
extraction for conventional HMM systems, Proc. of the International Conference
on Acoustics, Speech and Signal Processing, ICASSP, pp. 1635–1638 (2000).
[27] S. M. Chu and D. Povey, Speaking rate adaptation using continuous frame rate
normalization, Proc. of the International Conference on Acoustics, Speech and
Signal Processing, ICASSP, pp. 4306–4309 (2010).
[28] H.-K. J. Kuo, L. Mangu, A. Emami, A., I. Zitouni and Y.-S. Lee, Syntactic features
for Arabic speech recognition, IEEE Workshop on Automatic Speech Recognition
& Understanding, ASRU, pp. 327–332 (2009).
[29] https://catalog.ldc.upenn.edu/search
[30] A. Stolcke, Entropy-based pruning of backoff language models, Proc. of DARPA
Broadcast News Transcription and Understanding Workshop, pp. 270–274 (1998).
[31] http://www.mgb-challenge.org/arabic.html, page referenced at April 2017.
[32] D. Povey, et al., Purely sequence-trained neural networks for ASR based on lattice-
free MMI, Proc. of the Interspeech Conf., pp. 2751–2755 (2016).
[33] T. Mikolov, et al., RNNLM – Recurrent Neural Network Language Modeling
Toolkit, IEEE Workshop on Automatic Speech Recognition & Understanding,
ASRU, pp. 125–128 (2011).
[34] H. Sak, A. W. Senior and F. Beaufays, Long short-term memory recurrent neural
network architectures for large scale acoustic modeling, Proc. of the Interspeech
Conf., pp. 338–342 (2014).
[35] J. Billa, M. Noamany, A. Srivastava, D. Liu, R. Stone, J. Xu,, J. Makhoul and F.
Kubala, Audio Indexing of Arabic Broadcast News. Proc. of the International
Conference on Acoustics, Speech and Signal Processing, ICASSP, pp I-5–I-8
(2002).
[36] M. Rashwan, M. Al-Badrashiny, M. Attia, S. Abdou and A. Rafea, A Stochastic

Arabic Diacritizer Based on a Hybrid of Factorized and Un-factorized Textual
Features, IEEE Transactions on Speech and Audio Processing, 19(1), pp. 166–175
(2011).
[37] T. Buckwalter, Arabic Morphology Analysis. A tool in LDC catalog

https://catalog.ldc.upenn.edu/LDC2004L02 (2004).
[38] G. Saon, H. Soltau, et al., The IBM 2008 GALE Arabic speech transcription
system. Proc. of the International Conference on Acoustics, Speech and Signal
Processing, ICASSP, pp. 4378–4381 (2010).
[39] N. Habash, M. Diab and O. Rambow, Conventional Orthography for Dialectal
Arabic (CODA): Principles and Guidelines – Egyptian Arabic, Version 0.7,
Columbia University Academic Commons, http://dx.doi.org/10.7916/D83X8562
(2012).
[40] K. Kirchhoff and D. Vergyri, Cross-dialectal data sharing for acoustic modeling in
Arabic speech recognition. Speech Communication, 46(1), pp. 37–51 (2005).
[41] T. Schultz and A. Waibel, Language independent and language adaptive acoustic
modeling for speech recognition, Speech Communication, 35(1-2), pp. 31–51
(2001).
[42] P.-S. Huang. and M. Hasegawa-Johnson, Cross-dialectal data transferring for
Gaussian mixture model training in Arabic speech recognition. International
Conference on Arabic Language Processing, vol. 1, p. 1 (2012).
[43] M. Elmahdy, R. Gruhn and W. Minker, Novel Techniques for Dialectal Arabic
Speech Recognition, Springer (2012).
[44] K. Kirchhoff, et al. Novel Speech Recognition Models for Arabic. Johns Hopkins
University Summer Research Workshop Final Report (2002).
[45] A. Rozovskaya, R. Sproat and E. Benmamoun, Challenges in Processing Colloquial
Arabic: The challenge of Arabic for NLP/MT, International Conference of the
British Computer Society, pp. 4–14 (2006).
[46] L. Lamel, A. Messaoudi and J. Gauvain, Investigating morphological
decomposition for transcription of Arabic broadcast news and broadcast
conversation data, Proc. of the Interspeech Conf., vol. 1, pp. 1429–1432 (2008).
[47] M. Creutz, et al., Morph-based speech recognition and modeling of out-of-
vocabulary words across languages. ACM Transactions on Speech and Language
Processing, 5(1), pp. 1–29 (2007).
[48] J. Bilmes and K. Kirchhoff, Factored language models and generalized parallel
backoff, Proc. Human Language Technology Conf. of the North American Chapter
of the ACL, vol. 2, pp. 4–6 (2003).
[49] M. Mohri, M. Riley, D. Hindle, A. Ljolje and F. Pereira, Full Expansion of
Context-Dependent Networks in Large Vocabulary Speech Recognition,
International Conference of Acoustics, Speech and Signal Processing, pp. 665–668
(1998).
[50] M. Mohri and M. Riley, Network Optimizations for Large-Vocabulary Speech
Recognition, Speech Communication Journal, 28(1), pp. 1–12 (1999).

Arabic Speech Recognition Challenges and State of The Art

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Arabic Speech Recognition Challenges and State of The Art

Uploaded by

Copyright:

Available Formats

1

Arabic Speech Recognition:

Sherif Mahdy Abdou1 and Abdullah M. Moussa2

Speech recognition is the ability of a machine or program to identify

words and phrases in spoken language and convert them to a machine-

speech recognition technology, which when combined with the increase

2. The Automatic Speech Recognition System Components

To generate an output, the speech recognizer has basically to perform

 Extract acoustic observations (features) out of the spoken utterance.

Fig. 1. The ASR system main architecture.

The two terms 𝑃 𝑊 and 𝑃 𝑋/𝑊 and the maximization operation

2.1. Pronunciation lexicon

The pronunciation lexicon is basically a list where each word in the

2.2. Acoustic model

0.5 0.7 0.9

0.3 0.2 0.1

Fig. 2. HMM Phone Model.

The HMM states are associated with emission probability density

where 𝑗 ranges over the number of Gaussian densities in the mixture of

Is left phone a sonorant or nasal ?

Is right phone voiced ?

For HMMs ML, estimation is achieved by the so-called forward

Recently, a better acoustic model was introduced that is a hybrid of

hidden layers as shown in Fig. 4.

Fig. 4. HMM-DNN Model.

The DNNs have a higher modeling capacity per parameter than

developed at Cambridge University.9

2.3. Language model

A language model (LM) is required in large vocabulary speech

where 𝑤 , … , 𝑤 in expressions such as 𝑃 𝑤 /𝑤 , … , 𝑤 , is the

This is the definition of “N-grams”. On several recognition

important feature of N-grams is that their probabilities can be directly

grams has to be carefully treated as for a vocabulary of size V there is as

Fig. 5. Viterbi search as dynamic programming.

Even for a moderate vocabulary, full search is prohibitive. The

 Use relatively simple acoustic and language models to generate an N-

A less popular approach is referred to as stack decoding that avoids

3. Literature Review for Arabic ASR

Imai et al. in Ref. 15 presented a rule-based speaker-dependent that use

and Sellami in Ref. 16 presented a system that combines the vector

Nofal et al. in Ref. 17 demonstrated an Arabic command and control

model. This system was trained using 100hrs of broadcast news

compared to the grapheme system.22

performance degradation. That system used a state-clustered triphone

rescoring passes using the different LMs produced different decoding

Ref. 31 has activated the research and development of Arabic speech

controlled evaluation using 1,200 hours audio with lightly supervised

4. Challenges for Arabic ASR Systems

The second challenge for Arabic is the existence of many different

because of the current lack of training data for conversational Arabic.

4.1. Using non-diacritized Arabic data

Several approaches were proposed to overcome the lack of diacritized

besides the linguistic information.11 In that approach, a forced alignment

4.2. Speech recognition for Arabic dialects

Arabic which is not currently available. Also, the dialectal Arabic

To tackle the problem of data sparsity, a cross-lingual approach was

acoustic model.29 Acoustic differences between MSA and Arabic dialects

on the morphological, syntactic, and lexical levels make them behave

MSA data to improve language modeling of dialectal Arabic. An

Although the combined model provided slight reduction in the perplexity

4.3. Inflection effect and the large vocabulary

To deal with the morphological complexity of the Arabic language in