Professional Documents
Culture Documents
Arabic Speech Recognition Challenges and State of The Art
Arabic Speech Recognition Challenges and State of The Art
Chapter 1
Computational Linguistics, Speech and Image Processing for Arabic Language Downloaded from www.worldscientific.com
by 185.250.46.143 on 11/23/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
The Arabic language has many features such as the phonology and the
syntax that make it an easy language for developing automatic speech
recognition systems. Many standard techniques for acoustic and
language modeling such as context dependent acoustic models and n-
gram language models can be easily applied to Arabic. Some aspects of
the Arabic language such as the nearly one-to-one letter-to-phone
correspondence make the construction of the pronunciation lexicon even
easier than in other languages. The most difficult challenges in
developing speech recognition systems for Arabic are the dominance of
non-diacritized text material, the several dialects, and the morphological
complexity. In this chapter, we review the efforts that have been done to
handle the challenges of the Arabic language for developing automatic
speech recognition systems. This includes methods for automatic
generation for the diacritics of the Arabic text and word pronunciation
disambiguation. We also review the used approaches for handling the
limited speech and text resources of the different Arabic dialects. Finally,
we review the approaches used to deal with the high degree of affixation,
derivation that contributes to the explosion of different word forms in
Arabic.
2 S. M. Abdou and A. M. Moussa
1. Introduction
The goal of the ASR system is to find the most probable sequence of
words 𝑤 = (𝑤 , 𝑤 ,…) belonging to a fixed vocabulary given some set of
acoustic observations 𝑋 = (𝑥 , 𝑥 , … , 𝑥 ). Following the Bayesian
Arabic Speech Recognition: Challenges and State of the Art 3
approach applied to ASR as shown in Ref. 4, the best estimation for the
word sequence can be given by:
/
𝑤 arg 𝑚𝑎𝑥 𝑃 𝑊/𝑂 arg 𝑚𝑎𝑥 (1)
Computational Linguistics, Speech and Image Processing for Arabic Language Downloaded from www.worldscientific.com
by 185.250.46.143 on 11/23/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
Speech frame
Input Speech
Front-End
Feature Extraction
Features Vector: 𝑋
Acoustic M odel
P(X/W )
Language M odel
Search
P(W )
Recognized Text
𝑋 is not the speech input but a set of features derived from the speech.
The Mel Frequency Cepstrum Coefficients (MFCC) and Perceptual
by 185.250.46.143 on 11/23/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
Linear Prediction (PLP) are the most widely used. The acoustic and
language models and the search operation are discussed below.
The most popular acoustic models are the so called Hidden Markov
Models (HMM). Each phoneme (unit in general) is modeled using an
HMM. An HMM4 consists of a set of states, transitions, and output
distributions as shown in Fig. 2.
𝑏 𝑥 ∑ 𝑤 𝑁 𝑥, 𝜇 , 𝜎 (2)
by 185.250.46.143 on 11/23/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
yes
Is right phone a back-R ? Is left phone /s, z, sh, zh/ ?
no
senone 2 senone 3
Fig. 3. Decision tree for classifying the second state of K-triphone HMM.
hours of speech are used to train a model. The model together with an
appropriate confidence measure can then be used to automatically
transcribe thousands of hours of data. The new data can then be used to
Computational Linguistics, Speech and Image Processing for Arabic Language Downloaded from www.worldscientific.com
train a larger model. All the above techniques (and more) are
implemented in the so-called Hidden Markov Model Toolkit (HTK)
by 185.250.46.143 on 11/23/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
𝑃 𝑤 𝑃 𝑤 /𝑤 𝑃 𝑤 /𝑤 , 𝑤 𝑃 𝑤 /𝑤 , 𝑤 , 𝑤 . . . 𝑃 𝑤 /
𝑤 ,…,𝑤 ∏ 𝑃 𝑤 /𝑤 , … , 𝑤 (3)
𝑃 𝑤 |𝑤 , 𝑤 , … , 𝑤 𝑃 𝑤 |𝑤 ,…,𝑤 (4)
many word histories don’t occur with enough counts to have reliable
estimate for their probabilities. Many approximation techniques were
proposed to approximate these probabilities.10 For example, in the case
of bigram grammar, it typically lists only the most frequently occurring
bigrams, and uses a backoff mechanism to fall back on unigram
probability when the desired bigram is not found. In other words, if
P(wj|wi) is sought and is not found, one falls back on P(wj). But a backoff
weight is applied to account for the fact that wj is known not to be one of
the bigram successors of wi. Other higher-order backoff N-gram
grammars can be defined similarly. Ideally, a good LM would ease the
retrieval of the word sequence present in the speech signal by better
focusing the decoding procedure, which represents another relevant step
of the search procedure. One of effective tools for training language
models is the SRILM toolkit that includes most of state of art
alternatives.11
2.4. Decoding
Finding the best word (or generally unit) sequence given the speech input
is referred to as the decoding or search problem. Formally, the problem is
reduced to finding the best state sequence in a large state space that
consists of composing the pronunciation lexicon, the acoustic model and
the language model. The solution can be found using the well-known
Viterbi algorithm. Viterbi search is essentially a dynamic programming
algorithm, consisting of traversing a network of HMM states and
maintaining the best possible path score at each state in each frame. It is
a time synchronous search algorithm in that it processes all states
completely at time t before moving on to time t + 1. The abstract
algorithm can be understood with the help of Fig. 5. One dimension
represents the states in the network, and the other dimension represents
the time axis.
10 S. M. Abdou and A. M. Moussa
States
Final state
Computational Linguistics, Speech and Image Processing for Arabic Language Downloaded from www.worldscientific.com
by 185.250.46.143 on 11/23/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
Start state
Time
The early efforts to develop Arabic ASR systems started with simple
tasks such as digits recognition and small vocabulary of isolated words.
Arabic Speech Recognition: Challenges and State of the Art 11
grapheme based and phoneme based were developed and their evaluation
results show that the phonetic system gives about 13% reduction in WER
by 185.250.46.143 on 11/23/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
and was trained using 1024 hrs of speech data that consisted of 764 hrs
of supervised data and 260 hrs of lightly unsupervised. The gain from the
unsupervised data part has shown to be marginal and may even result in
Computational Linguistics, Speech and Image Processing for Arabic Language Downloaded from www.worldscientific.com
Gaussian components per state and used n-gram language model trained
from 1 billion words of text data. It used three decoding stages. The first
stage is a fast decoding run with Gender Independent (GI) models. The
second stage uses Gender Dependent (GD) models adapted using LSLR.
It also uses variance scaling using the first stage supervision. The second
stage generates trigram lattices which are expanded using a 4-gram
language model and then rescored in the third stage using GD models
adapted using lattice-MLLR as discussed in Ref. 21. In that system they
have shown that graphemic models perform at least as well as phonetic
models for conversational data and have very minor degradation for the
news data.
The IBM ViaVoice was one of the first commercial Arabic large
vocabulary systems that was developed for dictation applications.24 A
more advanced system was developed by the speech recognition research
group at IBM for Arabic broadcast transcription system fielded for the
GALE project. Key advances include improved discriminative training,
the use of subspace Gaussian mixture models (SGMM) as shown in Ref.
25, neural network acoustic features as shown in Ref. 26, variable frame
rate decoding as shown in Ref. 27, training data partitioning experiments,
class-based exponential LM model and NNLMs with Syntactic
features.28 This system was trained on 1800hrs of transcribed Arabic
broadcasts and text data of size 1.6 billion words provided by the
Linguistic Data Consortium (LDC).29 A pruned language model of size 7
million n-grams using Entropy pruning as shown in Ref. 30 is used for
the construction of static, finite-state decoding graphs. Another unpruned
version of the LM that contains 883 million n-grams, is used for lattice
rescoring. This system used a vocabulary of 795K words with more than
2 million pronunciations. This system used 6 decoding passes. The first
pass used a speaker independent grapheme based acoustic model. The
following 5 passes used speaker adapted phoneme based models. All
models have penta-phone cross-word acoustic context. Another 3
14 S. M. Abdou and A. M. Moussa
The Arabic language has three major challenges for developing ASR
systems. The first one is the constraint of having to use mostly non-
diacritized texts as recognizer training material which causes problems
for both acoustic and language modeling. Training accurate acoustic
models for the Arabic vowels without knowing their location in the
signal is difficult. Also, a non-diacritized Arabic word can have several
senses with the intended word sense to be derived from the word context.
Language models trained on this non-diacritized material may therefore
be less predictive than those trained on diacritized texts.
Arabic Speech Recognition: Challenges and State of the Art 15
from various media sources, there are only very few speech corpora of
dialectal Arabic available.
The third challenge of Arabic is its morphological complexity which
is known to present serious problems for speech recognition, in particular
for language modeling. A high degree of affixation, derivation, etc.,
contributes to the explosion of different word forms, making it difficult if
not impossible to robustly estimate language model probabilities. Rich
morphology also leads to high out-of-vocabulary rates and larger search
spaces during decoding, thus slowing down the recognition process. In
the following sections, we review most of the proposed approaches to
overcome these challenges.
marks and 10% word error rate for case ending marks. This means more
than 10% of the data will be restored with wrong diacritics which would
reduce the efficiency of the trained acoustic models. To reduce the
Computational Linguistics, Speech and Image Processing for Arabic Language Downloaded from www.worldscientific.com
number of errors for restored diacritics, it was proposed to use the audio
recordings of the text data to help in selecting the correct words diacritics
by 185.250.46.143 on 11/23/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
Whereas MSA data can readily be acquired from various media sources,
there is only very limited speech corpus of dialectal Arabic available.
The construction of such type of corpus is even more challenging than
the MSA one. Initially the manual annotation has no standard reference,
the same word can be transcribed with several ways such as
“ ﺑﺸﻜﺮﻙ،ﺑﺎﺷﻜﺮﻙ،”ﺑﺄﺷﻜﺮﻙ. Some transcription guidelines for Egyptian and
Levantine Dialectal Arabic were proposed to reduce such differences.38
The diacritization for dialectal Arabic is more challenging than MSA
since it would require a dialectal Arabic morphological analyzer to
generate the different diacritization forms. Using context based
diacritization would also require a robust language model for dialectal
Arabic Speech Recognition: Challenges and State of the Art 17
Fig. 6. Left: An example of Arabic word factorization. Right: Vocabulary growth for the
Arabic language.
The main draw back for that approach is the short durations of the
affixation units, which can be only two phones long, that make them
highly susceptible for insertions errors. To avoid these effects, some
enhancements for the approach was proposed such as keeping the most
frequent words in full form without decomposition and the second
enhancement was not to decompose the prefix “Al” for words starting
with a solar consonant since due to assimilation with the following
consonant, deletion of the prefix was one of the most frequent errors.
This enhanced morphologically based LM provided some reduction in
WER compared with word based LM.45 Rather than using linguistic
knowledge to derive the morphological decomposition, an unsupervised
technique based on the Minimum Description Length principle (MDL)
was also proposed to provide better coverage for the Out-Of-Vocabulary
(OOV) words.46
20 S. M. Abdou and A. M. Moussa
and can be e.g. stems, POS tags, etc. in addition to the words themselves.
Probabilistic LMs are then constructed over (sub) sets of factors. Using a
by 185.250.46.143 on 11/23/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
(a) (b)
Fig. 7. Standard backoff path for a 4-gram language model over words (left) and backoff
graph for 4-gram over factors (right).
Arabic Speech Recognition: Challenges and State of the Art 21
larger than one million words with processing time close to real time
performance as shown in Refs. 28, 50 but these was with the price of
large model sizes of several Giga bytes.
Computational Linguistics, Speech and Image Processing for Arabic Language Downloaded from www.worldscientific.com
by 185.250.46.143 on 11/23/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
for MSA which is around 15% WER is very comparable with the 10%
WER achieved for the similar task of Broadcast News ASR for English.
But we should keep in consideration that the complexity of the Arabic
MSA ASR is much higher with vocabulary size of 560k words compared
with the 210k words of the English vocabulary for the Broadcast News
ASR. The performance of dialectal Arabic, as shown in the Iraqi,
Egyptian and Levantine and conversational ASR is comparable with the
equivalent conversational English ASR with average WER in the range
30%–40%. But we should keep in consideration that the dialectal Arabic
is much challenging when compared with conversational English. The
LM training data is very limited, and many required (Natural Language
Processing) NLP tools such as morph analyzer, diacritizer and text
normalizers need to be developed.
6. Conclusions
References
[1] J. Billa, et al. Audio indexing of broadcast news. Proc. of the International
Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. I-3-I-5
(2002).
Arabic Speech Recognition: Challenges and State of the Art 25
[2] S. Khurana and A.Ali, QCRI advanced transcription system (QATS) for the Arabic
multi-dialect broadcast media recognition: MGB-2 challenge, IEEE Spoken
Language Technology Workshop, (SLT), pp. 292–298 (2016).
[3] V. Peddinti, D. Povey, and S. Khudanpur, A time delay neural network architecture
Computational Linguistics, Speech and Image Processing for Arabic Language Downloaded from www.worldscientific.com
for efficient modeling of long temporal contexts, Proc. of the Interspeech Conf., pp.
3214–3218 (2015).
by 185.250.46.143 on 11/23/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
3077–3081 (2004).
[20] H. Bourouba, R. Djemili, M. Bedda and C. Snani, New Hybrid System (Supervised
Classifier/HMM) for Isolated Arabic Speech Recognition, Proc. of the
International Conference on Information & Communication Technologies, pp.
1264–1269 (2006).
[21] J. Billa, et al., Arabic speech and text in Tides Ontap, Proc. of the International
Conference on Human Language Technology Research. HLT, pp 1024–1029.
[22] Afify, M., Nguyen, L., Xiang, B., Abdou, S. and Makhoul, J. (2005). Recent
progress in Arabic broadcast news transcription at BBN, Proc. of the Interspeech
Conf., pp.1637–1640 (2002).
[23] http://projects.ldc.upenn.edu/gale/index.html, page referenced at April 2017.
[24] https://www-01.ibm.com/software/pervasive/viavoice.html
[25] D. Povey, et al., Subspace Gaussian mixture models for speech recognition, Proc.
of the International Conference on Acoustics, Speech and Signal Processing,
ICASSP, pp. 4330–4333 (2010).
[26] H. Hermansky, D. P. W. Ellis and S. Sharma, Tandem connectionist feature
extraction for conventional HMM systems, Proc. of the International Conference
on Acoustics, Speech and Signal Processing, ICASSP, pp. 1635–1638 (2000).
[27] S. M. Chu and D. Povey, Speaking rate adaptation using continuous frame rate
normalization, Proc. of the International Conference on Acoustics, Speech and
Signal Processing, ICASSP, pp. 4306–4309 (2010).
[28] H.-K. J. Kuo, L. Mangu, A. Emami, A., I. Zitouni and Y.-S. Lee, Syntactic features
for Arabic speech recognition, IEEE Workshop on Automatic Speech Recognition
& Understanding, ASRU, pp. 327–332 (2009).
[29] https://catalog.ldc.upenn.edu/search
[30] A. Stolcke, Entropy-based pruning of backoff language models, Proc. of DARPA
Broadcast News Transcription and Understanding Workshop, pp. 270–274 (1998).
[31] http://www.mgb-challenge.org/arabic.html, page referenced at April 2017.
[32] D. Povey, et al., Purely sequence-trained neural networks for ASR based on lattice-
free MMI, Proc. of the Interspeech Conf., pp. 2751–2755 (2016).
[33] T. Mikolov, et al., RNNLM – Recurrent Neural Network Language Modeling
Toolkit, IEEE Workshop on Automatic Speech Recognition & Understanding,
ASRU, pp. 125–128 (2011).
[34] H. Sak, A. W. Senior and F. Beaufays, Long short-term memory recurrent neural
network architectures for large scale acoustic modeling, Proc. of the Interspeech
Conf., pp. 338–342 (2014).
[35] J. Billa, M. Noamany, A. Srivastava, D. Liu, R. Stone, J. Xu,, J. Makhoul and F.
Kubala, Audio Indexing of Arabic Broadcast News. Proc. of the International
Conference on Acoustics, Speech and Signal Processing, ICASSP, pp I-5–I-8
(2002).
Arabic Speech Recognition: Challenges and State of the Art 27
[38] G. Saon, H. Soltau, et al., The IBM 2008 GALE Arabic speech transcription
system. Proc. of the International Conference on Acoustics, Speech and Signal
Processing, ICASSP, pp. 4378–4381 (2010).
[39] N. Habash, M. Diab and O. Rambow, Conventional Orthography for Dialectal
Arabic (CODA): Principles and Guidelines – Egyptian Arabic, Version 0.7,
Columbia University Academic Commons, http://dx.doi.org/10.7916/D83X8562
(2012).
[40] K. Kirchhoff and D. Vergyri, Cross-dialectal data sharing for acoustic modeling in
Arabic speech recognition. Speech Communication, 46(1), pp. 37–51 (2005).
[41] T. Schultz and A. Waibel, Language independent and language adaptive acoustic
modeling for speech recognition, Speech Communication, 35(1-2), pp. 31–51
(2001).
[42] P.-S. Huang. and M. Hasegawa-Johnson, Cross-dialectal data transferring for
Gaussian mixture model training in Arabic speech recognition. International
Conference on Arabic Language Processing, vol. 1, p. 1 (2012).
[43] M. Elmahdy, R. Gruhn and W. Minker, Novel Techniques for Dialectal Arabic
Speech Recognition, Springer (2012).
[44] K. Kirchhoff, et al. Novel Speech Recognition Models for Arabic. Johns Hopkins
University Summer Research Workshop Final Report (2002).
[45] A. Rozovskaya, R. Sproat and E. Benmamoun, Challenges in Processing Colloquial
Arabic: The challenge of Arabic for NLP/MT, International Conference of the
British Computer Society, pp. 4–14 (2006).
[46] L. Lamel, A. Messaoudi and J. Gauvain, Investigating morphological
decomposition for transcription of Arabic broadcast news and broadcast
conversation data, Proc. of the Interspeech Conf., vol. 1, pp. 1429–1432 (2008).
[47] M. Creutz, et al., Morph-based speech recognition and modeling of out-of-
vocabulary words across languages. ACM Transactions on Speech and Language
Processing, 5(1), pp. 1–29 (2007).
[48] J. Bilmes and K. Kirchhoff, Factored language models and generalized parallel
backoff, Proc. Human Language Technology Conf. of the North American Chapter
of the ACL, vol. 2, pp. 4–6 (2003).
[49] M. Mohri, M. Riley, D. Hindle, A. Ljolje and F. Pereira, Full Expansion of
Context-Dependent Networks in Large Vocabulary Speech Recognition,
International Conference of Acoustics, Speech and Signal Processing, pp. 665–668
(1998).
[50] M. Mohri and M. Riley, Network Optimizations for Large-Vocabulary Speech
Recognition, Speech Communication Journal, 28(1), pp. 1–12 (1999).