Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Moroccan Dialect Speech Recognition System Based

on CMU SphinxTools
Abderrahim Ezzine Hassan Satori Mohamed Hamidi Khalid Satori
LISAC Laboratory LISAC Laboratory LISAC Laboratory LISAC Laboratory
Faculty of Sciences Dhar Faculty of Sciences Dhar Faculty of Sciences Dhar Faculty of Sciences Dhar
Mahraz, Sidi Mohammed Ben Mahraz, Sidi Mohammed Ben Mahraz, Sidi Mohammed Ben Mahraz, Sidi Mohammed Ben
Abbdallah University Abbdallah University Abbdallah University Abbdallah University
Fés, Morocco Fés, Morocco Fés, Morocco Fés, Morocco
ezzine2abderrahim@gmail.com hassan.satori@usmba.ac.ma mohamed.hamidi.5@gmail.com khalidsatori@gmail.com

Abstract— The main aim of an Automatic Speech Recognition H. Satori and F. ElHaoussi [8] have implemented an
system (ASR) is to produce a system that is able to simulate the Amazigh speech system by using CMU Sphinx tools based on
human listener based on the learning approach and speech data of Hidden Markov Model. The exploited corpus includes 60
a studied language. In this paper, we describe the Darija Amazigh Moroccan speakers Tarifit native, equally divided
Moroccan Dialect speech recognition system that is implemented
between male and female. Their designed system permits to
to recognize the ten first Arabic digits spoken in Moroccan dialect
(Darija) collected from 20 speakers including both males and recognize digits and alphabets of Amazigh language and the
females. This system is designed based on the CMU Sphinx tools best-achieved performance is 92.89% found with 16 Gaussian
through the ASR Hidden Markov Model method with small data Mixture models.
and the Mel frequency spectral coefficients (MFCCs) that are used
in the feature extraction phase. Our best-obtained accuracy is
Ouissam et al. [9] have presented an automatic speech
96.27 % found with 8 GMMs. recognition system based on Sphinx4 that permit to detect the
people who have disorders voices. Their project is carried out
Keywords—Speech recognition; Moroccan dialect; HMMs; using Amazigh language in order to differentiate the normal and
MFCC; CMU Sphinx; Acoustic model; Artificial intelligence. pathological voices. Their findings were measured using
combinations of HMMs 5-states with 8 Gaussian mixture
I. INTRODUCTION distributions.
Automatic Speech Recognition (ASR) defined as a
Hamidi et al. [10] have presented an interactive security
technology that allows a computing device to converts the words
system-based speech recognition technique. In their work, the
into a readable text by way of a microphone or telephone. The
ASR-HMM and IVR technologies were combined to allow the
ASR has a large field of implementations such as command
distance tasks administration managing by utilizing speech
recognition, interactive voice response, dictation, it can be used
commands and the security identification by using biological
to help handicapped people to interact with the community. It is
voiceprint. The findings present that the access rate is more than
a technology that makes life more facile [1]. The principal aim
80 % whereas the non-admin recognition rate is less than 6%.
of ASR research is to permit a computer to identifies all words
that were spoken by anybody, independent of vocabulary size, The aim of this paper is to create the Darija dialect ASR and
noise and speaker characteristics in real-time with 100% explore the changes that must be realized in the model to adapt
precision [2]. Moroccan dialect speech recognition. Our work will be based on
the hidden Markov model - Gaussian mixture model
To build an ASR system, we need to create the Language
combination. The proposed system will be designed by using
Model (LM), Acoustic Model (AM), and dictionary for the
Carnegie Melon University (CMU) Sphinx which is a statistical
target language. Unfortunately, designing an acoustic model for
speaker-independent set of tools using the Hidden Markov
a specific language is expensive, unlike the AM and the
Models (HMM).
dictionary. Due to the recording of speech data from speakers to
ensure the ASR speaker-independent [3]. Given the importance The paper is organized as follows: Section 2 presents the
of ASR technology, several systems are implemented for Moroccan dialect. Section 3 presents the Moroccan dialect
different languages based on the HTK [4] and CMU Sphinx [5], speech recognition system. Section 4 shows the experimental
Dragon [6], KALDI [7] toolkits. results and Section 5 concludes the paper.

978-1-7281-8041-0/20/$31.00 ©2020 IEEE

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 26,2020 at 12:44:29 UTC from IEEE Xplore. Restrictions apply.
II. MOROCCAN DIALECT TABLE I. TEN FIRST DIGITS WITH THEIR SYLLABLES AND THEIR
TRANSCRIPTION IN ENGLISH, ARABIC

In this paper, we studied a Moroccan dialect called Darija is a


spoken variety of Arabic. It is considered the informal language
Moroccan
of Morocco and it utilized in daily communication, media like English
Digits Dialect
TV and Radio. Given its direct contact with the Amazigh transcription
transcription
language, the Moroccan dialect is much influenced by the 0 SIFR 
Amazigh sound pattern, morphology and lexicon. Moroccan
1 WAHD 
dialect has 32 basic phonemes, of which four are vowels, and 28
are consonants. Moroccan dialect has fewer vowels than 2 JOJ

classical Arabic [11]. 3 THLATA 

III. MOROCCAN DIALECT SPEECH RECOGNITION SYSTEM 4 RABAA 


5 KHAMSA 
This part represents our experience to build and develop a 6 STTA 
Moroccan dialect voice recognition system using CMU Tools.
7 SBAA 
Fig. 1 represents the main elements that are usually found in a
typical ASR system. 8 THMANYA 
Recently, the applications of automatic speech recognition for 9 TSAAOD 

Moroccan Amazigh language based on CMU Sphinx tools was
targeted by our lab researchers [12-18].
B. Corpus preparation

The first step of ASR system construction is preparing the


corpus which is a collection of unit sounds defined by certain
words in the vocabulary. The corpus was created by using the
Text ten 10 Darija Moroccan dialect digits. A number of 20 Moroccan
Corpus speakers (10 males and 10 females) aged between 14 and 50
Feature years old were asked to utter all digits 10 times. Hence, the
Extraction database Moroccan Digit consists of 10 repetitions of every digit
produced by each speaker. Depending on this, the database for
Moroccan speakers consists of 2000 tokens. During the
Language recording session, each utterance was played back to assure that
Model the entire digit was included in the registered signal. Table II
Decoder shown the parameters used for the corpus preparation.
Speech Acoustic
Model TABLE II. CORPUS PARAMETERS
Corpus

Parameter Value
Text
Sampling rate 16 kHz
Fig. 1. Block diagram of ASR System
Number of bits 16 bits

A. System overview Wave format Mono, wav

Corpus 10 Moroccan Arabic-digits


The main aim of our study is to design a Darija automatic speech Repetitions 10 times
recognition system based on Mel frequency spectral coefficients
in the feature extraction phase and GMM-HMM system Pronunciation Darija Moroccan Dialect
combination techniques in the training phase. Condition of noise Normal life
To build our Darija acoustic model we have used SphinxTrain
tools based on the dictionary, language Model and collected Speakers 10 male 10 female
speech data while in the recognition phase we have based on
Pocketsphinx decoder. Our adopted dictionary file was created C. Training
by using the ten first Darija Moroccan dialect digits followed by
their transcriptions. Table I represents the Moroccan the first The Training of acoustic model is done using CMU Sphinx
Darija digits with their pronunciation. tools that uses embedded training method based on the Baum-

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 26,2020 at 12:44:29 UTC from IEEE Xplore. Restrictions apply.
Welch algorithm. Training is the procedure of building the
knowledge base by learning the Acoustic Model and Language Pre-
Framing
Model used by the system [8]. Fig. 2 presents the training emphasis
process.
The generation of the acoustic model is realized by grouping
a set of input data and treat them with the SphinxTrain tool (see
Fig. 8). The input data is as follow: Mel frequency
FFT Windowing
• Speech data filtering

• Transcript files that include the word pronounced set


for each wave file.
• Fileids files that include the path of each wave file.
• Dictionary and Language Model files Log 10 DCT Acoustic
Vectors
• Phone file that lists the used phonemes set.
Fig. 3. MFCC Architecture

Training The extraction of Mel-Frequency Cepstral Coefficients


transcription Testing transcription (MFCC) includes an analysis based on the frames of an input
And fileids files And fileids files speech where the speech signal is segmented into a sequence of
frames. Each frame offers a sinusoidal transformation (Fast
Fourier Transform) to generate certain parameters, which then
Dictionary undergoes a perception scale on the Mel scale and a
and lm.DMP decorrelation. The output obtained was a sequence of
files characteristic vectors describing a logarithmically useful
compressed amplitude and simplified frequency information.
Configuratio
n
Phone and file
2) Acoustic model
filler files
MDdigits

Among the most defies of automatic speech recognition is


the accuracy. The acoustic model plays an important role for
ameliorate this accuracy. The main goal of the acoustic model is
Training files
to compute the likelihood of the observed feature vectors given
Wave files linguistic units (phones, words, subparts of phones) using a
statistical method known as the Hidden Markov Model (HMM)
Testing files
with a mixture density Gaussian distribution. For example,
Gaussian Mixture Model (GMM) is used to calculate the
likelihood of a given feature vector P (O | Q) for each HMM
Fig. 2. Proccess of training phase state Q, corresponding to O phone or subphone. For recognizing
small numbers of words like 10 digits, using HMM state to
represent a phone is sufficient. the popular configuration to
1) Feature extraction represent the phone is three HMM states, each phone has three
emitting HMM states instead of one plus two non-emitting states
Feature extraction technique transform the speech waveform at two ends. This 5 states phone HMM is known as a word model
into a sequence of feature vectors which contains only the or phone model as shown in fig. 4.
required information to identify a given utterance where play
3) Language model
important role in speech recognition system performance. The
most widely used methods of a feature vectors extraction is Mel-
Frequency Cepstral Coefficients (MFCC), which is used to The language model (LM) is utilized by the speech
simulate the human ear. Fig. 3 gives a summary of the extraction recognition system to guide the search for correct word
of Mel coefficients (MFCC) [19]. sequences. There are many types of models to describe any
language in the recognition and recognition phases such as
grammar, phonetics statistical language models and statistical
language models. In this work, for generate the Language model

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 26,2020 at 12:44:29 UTC from IEEE Xplore. Restrictions apply.
of our system we use the CMU-Cambridge statistical language TABLE III. RECOGNITION RATES TO THE TEN DARIJA DIGITS WITH THREE
HMM STATES AND DIFFERENT GMMS VALUES
modelling toolkit [20].

GMMs
S Moroccan Dialect
digits
4 8 16 32

1 2 3 SIFR 91.66% 95.00% 91.66% 90.00%

WAHD 96.66% 96.66% 95.00% 95.00%

(a) JOJ 96.66% 93.33% 95.00% 95.00%

THLATA 95.00% 95.00% 93.33% 91.66%


S TT A RABAA 93.33% 90.00% 88.00% 90.00%

KHAMSA 86.66% 91.66% 91.66% 86.00%


1 2 3 1 2 1
3 2 3
STTA 76.66% 81.66% 85.00% 75.00%

SBAA 95.00% 93.33% 93.33% 90.00%

(b) THMANYA 96.66% 95.00% 93.33% 93.33%

Fig. 4. (a) representation of "S" Phoneme with 5 hmm states (b) TSAAOD 96.66% 93.33% 91.66% 93.33%
representation of “STTA” digits with hmm states.
Average 95.08% 96.27% 96.10% 94.58%

4) Pronunciation dictionary
TABLE IV. RECOGNITION RATES TO THE TEN DARIJA DIGITS WITH FIVE
The dictionary file is used as an intermediary among the AM HMM STATES AND DIFFERENT GMMS VALUES
and LM. Our used pronunciation dictionary file includes the ten
first Darija dialect words followed by their pronunciation. Fig. 5 GMMs
Moroccan Dialect
represents the phonetic dictionary list used in the training [21]. digits
4 8 16 32

SIFR SS I F R SIFR 98.33% 95.00% 93.33% 81.66%


WAHD W A HH D WAHD 98.33% 96.66% 93.33% 95.00%
JOJ J OU J
THLATA TH L A TH A JOJ 98.33% 98.33% 96.66% 96.66%
RABAA R A B AA A THLATA 98.33% 96.66% 96.66% 95.00%
KHAMSA KH A M S A
STTA S TT A RABAA 91.66% 96.66% 80.00% 80.00%
STTA(2) S E TT A KHAMSA 88.33% 93.33% 90.00% 91.66%
SBAA S A B AA A
THMANYA TH M A N Y A STTA 75.00% 81.66% 83.33% 85.00%
TSAAOD T S AA OU D SBAA 95.00% 93.33% 96.66% 98.33%
TSAAOD(2) T E S AA D
THMANYA 98.33% 96.66% 95.00% 96.66%
Fig. 5. The used phonetic dictionary list. TSAAOD 96.66% 96.66% 93.33% 95.00%

IV. EXPERIMENTAL RESULTS Average 95.25% 95.25% 94.92% 92.71%

In all proposed schemes, the used data were partitioned to 70


In the case of the three HMM states, Table IV presents the
% for training and 30 % for testing to assure the speaker- systems recognition rates as well as the average recognition rates
independent aspect. Also, our system was trained by exploiting of all digits. The system recognizes 2000 token for all ten digits.
various GMMs values that ranged from 4 to 32 combined with
The obtained system performances are 95.08 % 96.27 % 96.10
3 and 5 states per HMMs.
% and 94.58 % were found by utilizing 4, 8, 16 and 32 GMMs,

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 26,2020 at 12:44:29 UTC from IEEE Xplore. Restrictions apply.
respectively. We have observed that 8 GMMs gives the best rate [11] M. Ennaji, A. Makhoukh, H. Es-saidy, M. Moubtassime, S. Slaoui “ A
GRAMMAR OF MOROCCAN ARABIC” Publications of the Faculty
of 96.27 %. According to the analysis of digits recognition rates, of Letters Dhar El Mehraz, Fès 2004
the best recognized Darija digits are STTA and KHAMSA. [12] Zealouk, O., Satori, H., Hamidi, M., Laaidi, N., & Satori, K. (2018).
When we used the five HMM state, the systems try to recognize Vocal parameters analysis of smoker using Amazigh language.
2000 samples of all 10 Moroccan dialect digit. Table V shows International Journal of Speech Technology, 21(1), 85-91.
the accuracy rate of the system. The performances are 95.25 %, [13] Zealouk, O., Hamidi, M., Satori, H., & Satori, K. (2020). Amazigh Digits
95.25 %, 94.92% and 92.71% for using 4, 8, 16 and 32 GMMs, Speech Recognition System Under Noise Car Environment. In Embedded
Systems and Artificial Intelligence (pp. 421-428). Springer, Singapore.
respectively. Also, in the case of digits the best results were
[14] Hamidi, M., Satori, H., Zealouk, O., & Satori, K. (2019). Speech coding
found with 4 and 8 GMMs. The most frequently misrecognized effect on Amazigh alphabet speech recognition performance. J. Adv. Res.
Moroccan dialect digits are STTA and KHAMSA. Dyn. Control Syst, 11(2), 1392-1400.
[15] Hamidi, M., Satori, H., Zealouk, O., & Satori, K. (2020). Interactive
Voice Application-Based Amazigh Speech Recognition. In Embedded
Systems and Artificial Intelligence (pp. 271-279). Springer, Singapore.
V. CONCLUSION
[16] Barkani, F., Satori, H., Hamidi, M., Zealouk, O., & Laaidi, N. (2020).
Comparative Evaluation of Speech Recognition Systems Based on
Different Toolkits. In Embedded Systems and Artificial Intelligence (pp.
In this paper, the automatic speech recognition system for the 33-41). Springer, Singapore.
Darija Moroccan dialect was developed. This system is [17] Hamidi, M., Satori, H., Zealouk, O., & Satori, K. (2020). Amazigh digits
implemented by using CMU Sphinx tools based on HMMs with through interactive speech recognition system in noisy environment.
Gaussian mixtures. The corpus size used in this work is not large International Journal of Speech Technology, 23(1), 101-109.
and the best obtained result is about 96.27 % accurate which is [18] Addarrazi, I., Satori, H., & Satori, K. (2020). A Follow-Up Survey of
Audiovisual Speech Integration Strategies. In Embedded Systems and
very encouraging. Artificial Intelligence (pp. 635-643). Springer, Singapore.
In our future work, the proposed system can be improved by [19] Gupta, K., Gupta, D.: An analysis on LPC, RASTA and MFCC
using a large vocabulary of the Darija Moroccan dialect and we techniques in automatic speech recognition system. In: 2016 6th
test the performance of the system in a noisy environment. International Conference-Cloud System and Big Data Engineering
(Confluence), pp. 493–497. IEEE (2016)
REFERENCES [20] http://www.speech.cs.cmu.edu/tools/lmtool-new.html
[21] Satori, H., Hiyassat, H., Harti, M., & Chenfour, N. (2009). Investigation
Arabic Speech Recognition using CMU Sphinx System. The International
Arab Journal of Information Technology, 6(2), 186–190.
[1] Haton, J. P., Cerisara, C., Fohr, D., Laprie, Y., & Smaïli, K. (2006).
Reconnaissance automatique de la parole: Du Signal à son Interprétation.
Dunod.
[2] Anand, A. V., Devi, P. S., Stephen, J., & Bhadran, V. K. (2012,
December). Malayalam Speech Recognition system and its application
for visually impaired people. In 2012 Annual IEEE India Conference
(INDICON) (pp. 619-624). IEEE.
[3] Jackson, M. (2005). Automatic speech recognition: Human computer
interface for kinyarwanda language. A Project Report Submitted in Partial
Fulfillment of the Requirements for the Award of the Degree Master of
Science in Computer Science of Makerere University August.
[4] Young, S. J., & Young, S. (1993). The HTK hidden Markov model
toolkit: Design and philosophy. Cambridge, England: University of
Cambridge, Department of Engineering.
[5] Lee, K. F., Hon, H. W., & Reddy, R. (1990). An overview of the SPHINX
speech recognition system. IEEE Transactions on Acoustics, Speech, and
Signal Processing, 38(1), 35-45.
[6] Baker, J. (1975). The DRAGON system--An overview. IEEE
Transactions on Acoustics, speech, and signal Processing, 23(1), 24-29.
[7] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel,
N., ... & Silovsky, J. (2011). The Kaldi speech recognition toolkit. In
IEEE 2011 workshop on automatic speech recognition and understanding
(No. CONF). IEEE Signal Processing Society
[8] Satori, H., Hiyassat, H., Harti, M., & Chenfour, N. (2009). Investigation
Arabic Speech Recognition using CMU Sphinx System. The International
Arab Journal of Information Technology, 6(2), 186–190.
[9] Zealouk, O., Satori, H., Hamidi, M., & Satori, K. (2020). Pathological
Detection Using HMM Speech Recognition-Based Amazigh Digits.
In Embedded Systems and Artificial Intelligence (pp. 281-289). Springer,
Singapore.
[10] Hamidi, M., Satori, H., Zealouk, O., Satori, K., & Laaidi, N. (2018,
October). Interactive voice response server voice network administration
using hidden markov model speech recognition system. In 2018 Second
World Conference on Smart Trends in Systems, Security and
Sustainability (WorldS4) (pp. 16-21). IEEE.

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 26,2020 at 12:44:29 UTC from IEEE Xplore. Restrictions apply.

You might also like