Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/306350420

The Recognition of Hijaiyah Letter Pronunciation Using Mel Frequency


Cepstral Coefficients and Hidden Markov Model

Article  in  Journal of Computational and Theoretical Nanoscience · August 2016


DOI: 10.1166/asl.2016.7769

CITATIONS READS

4 2,747

3 authors, including:

Warih Maharani Kang Adiwijaya


Telkom University Telkom University
22 PUBLICATIONS   173 CITATIONS    187 PUBLICATIONS   1,009 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Prediksi Debit Aliran Sungai dengan Neural Networks untuk keperluan pertanian menentukan masa tanam View project

social science View project

All content following this page was uploaded by Kang Adiwijaya on 21 January 2017.

The user has requested enhancement of the downloaded file.


RESEARCH ARTICLE Adv. Sci. Lett. 22, 2043-2047, 2016

Copyright © 2016 American Scientific Publishers Advanced Science Letters


All rights reserved Vol. 22, 2043-2047, 2016
Printed in the United States of America

The Recognition of Hijaiyah Letter Pronunciation


using Mel Frequency Cepstral Coefficients and
Hidden Markov Model
Rifan Muhamad Fauzi, Adiwijaya, Warih Maharani
School of Computing, Telkom University, Bandung 40257, Indonesia

Learning Hijaiyyah letters is the first stage for someone to read the Holy Qur’an. This process is usually completed by a
learner and an advisor who introduces and teaches how to read and pronounce Hijaiyyah Letter. Speech recognition is a
system that is applied to process voice signal to become data, so it is possible to be recognized by the computer. By applying
this system, it is expected that the function of an advisor who introduces and adjusts the pronunciation of Hijaiyyah Letter
can be replaced, so the process of learning can be done independently. The problem in introducing Hijaiyyah Letter can be
solved by using Mel Frequency Cepstrum Coefficients (MFCC) to do the extraction process of every voice signal’s
characteristic and Hidden Markov Model (HMM) to form model and classify voice. After the author tests the system using
several scenarios, the best accurate data obtained is 67.75% in recognizing 50 words. This accurate data is taken from the
result of 16kHz sample rate test, the size of the codebook is 64 and state of HMM is 5.

Keywords: MFCC, HMM, Speech Recognition, Hijaiyah

1. INTRODUCTION
flexibly.
Hijaiyyah Letter is letters that are used in Al-Qur’an. Mel-frequency cepstral coefficients (MFCC) is a
Someone needs to study Hijaiyyah Letter and Tajweed to method to do feature extraction. This method adopts the
be able to read and understand Al-Qur’an. Studying system of human’s hearing organ, so it is possible to
Hijaiyyah Letter is the very first step in the process of catch the characteristic of important sound [11]. Hidden
learning Al-Qur’an. In the practice, a special advisor who Markov Model (HMM) is a statistic model where the
is Al-Qur’an masterly is absolutely needed to introduce modeled system is assumed as Markov process with
and teach Hijaiyyah Letter. Speech recognition is a unknown parameter process and it is aimed to decide the
technology that is applied to recognize voice and change other hidden parameters based on the known parameters
it into data representation that is understood by computer. [9]. HMM is applied in the process of modeling and
By using Speech Recognition System, it is wished that an recognition.
advisor/ corrector of Hijaiyyah Study can be replaced, so
Pronouncing hijaiyyah word is bit different with
the process of study can be completed independently and
another. We need to understand tajweed and being able to
*
recite the letters correctly is the foundation of tajweed,
Email Address: rifan.refun@gmail.com
2043 Adv. Sci. Lett. Vol. 22, No. 8, 2016 1936-6612/2016/22/2043/005
doi: 10.1166/asl.2016.7769
Adv. Sci. Lett. 22, 2043–2047, 2016 RESEARCH ARTICLE

and this is achieved by knowing where the sound anticipate loss information when filtering process using
originates (mahkraaij). This can then help in practising mel filter bank is applied. The last phase of MFCC is
the pronunciation of the letters correctly. So, building Discrete Cosine Transform (DCT). This algorithm is
speech recognition system for this case is challenging and aimed to convert log mel spectrum from frequency
might be different with another language. MFCC is domain to time domain so, it results 12 features of
adopting humans’s hearing organ and it is expected for MFCC. Next, these 12 features will be derivated by using
good extracting hijaiyyah word voices’s feature. the first order derivation that will produce 12 other
features. Feature vector that will be produced by MFCC is
2. METHODOLOGY 24 features (12 features of MFCC and 12 derivative
features).
A. Voice data recording
The list of Hijaiyyah Words that will be placed as dataset D. Vector Quantization
for this system will be taken from Qronis curriculum. Vector Quantization (VQ) that is used in this journal
Data that is taken is 50 words. Every word is said by 21 is divided into two parts; they are the formulation of
people during the recording. codebook and the determination of codebook index. The
formulation of codebook is completed during the training
B. Normalization process by using K-Means Clustering algorithm, while
The aim of the normalization process is to generalize the determination of codebook index is completed during
the maximum amplitude and sample rate voice signal so training and testing process by changing characterization
there will be no influence of amplitude change in the next vector to become codebook index which has smallest
process. The normalization that is applied in this journal euclidean distance.
covers conversing process (stereo to mono), resampling
16 kHz, centering of amplitude, and dividing every E. Training
amplitude discrete by the maximum amplitude value. Training is a process of voice data modeling to
become a model that can be used in the testing. That that
C. Characterization extraction are used in the training process are strings of codebox
Characterization extraction is an important process index as a result of vector quantization. These indexes can
in Speech Recognition because it is very significant in be named as HMM symbol of observation. Training
characterization extraction process of a voice signal. [7]. process is done by using Baum-Welch algorithm. The
Characterization extraction method that is applied in this result of the training is a model of HMM λ = (A,B,π), A
journal is Mel Frequency Cepstral Coefficients (MFCC) is matrix of transitional probability between, B is matrix
that can be pictured in Figure 1. First, voice signal is of observation symbol probability, and πis initial state
filtered by using pre-emphasis by the parameter of 0,95. probability. In this journal, the kinds of HMM is a result
Next, the result of the pre-emphasise voice signal is of ergodic discrete, where the parameters of HMM such
divided into several frames of 240 width and 160 overlap. as matrix A, B, C and π are resurrected randomly and the
Third, frame blocking result will enter the next process value is nominalized into one. The values of A, B, and π
which is windowing using hamming window. are then re-estimated by training process to get an optimal
parameter. The process of parameter HMM training is
explained in figure 2.
The initiation of HMM is the initiation of model
HMM λ = (A,B,π) that is aired randomly and the value
can be normalized to one. After that, it is important to
count and using forward and backward
algorithm, and can be counted inductively by using three
forward algorithm step[10]:

Fig.1. MFCC Block Diagram


Next, the algorithm of Fast Fourier Transform (FFT)
is applied to find magnitude spectrum from windowed
data[11]. Next phase is mel-filtering, in which this filter
uses bank of triangular filter. In this journal, the
parameter of bank filter is 13 filter linier, 66.6666 space
linier, 27 filter log and 12 cepstral MFCC coefficient.
Next, log module is applied as smoothing function to
2044
RESEARCH ARTICLE Adv. Sci. Lett. 22, 2043-2047, 2016

Mathematical description
: number of state
: the number of distinct observation symbol per
state
: the state transition probability distribution
: the observation symbol probability distribution
in state
: the initial state distribution
: observation
: number of sequence

The result of HMM parameter re-estimation is new


values of elements in matrix A, B, and π. Iteration to re-
estimation is stopped when it reaches the maximum
points or it can use threshold of minimum development
(new model will not give a significant change). After the
process of re-estimation is finished, the system will save
model λ = (A,B,π). This model will be used in testing
Fig.2. HMM Parameter Training
process. The number of models saved is equal to the word
1. Initialization recognized.
1   ibi  O1  , 1 i  N (1) F. Evaluation
In testing process, every test data will enter the same
2. Induction
normalization, feature extraction, and vector quantization
 t 1 ( j )   i 1 t  i  aij  b j  Ot 1  ,
N
step as in the training step. The different is, in the testing
  step, vector quantization is only the determination of
1 t  T 1 (2) codebook. After that, the system will take model data λ =
1 j  N (A, B, π) that has been stored during the process of
training and recognition. It is completed by counting
3. Termination
likelihood from the string of index as the result of vector
P  O |     i 1T  i 
N
(3) quantization that applied to all λ model. The output of this
process gives index from reference signal model that has
When t  i  , it can be counted inductively using
the maximum likelihood. The Index is fitted with the
backward algorithm, as follow [10]. saved database, so the index that results a corresponding
1. Initialization text is found.
T  i   1, 1 i  N (4) G. System specification
Application is built in matlab 2011 and can recognize
2. Induction
voice audio with format wav. It couldn’t record and
t (i )   j 1 aij b j  Ot 1 t 1  j  ,
N
recognize voice in realtime. And also it couldn’t handle
noise.
t  T  1, T  2, (5) ,1, 1 i  N
Next, the process of re-estimation of HMM 3. TEST RESULT
parameter can be done by using these formulas.
aˆ1  i  ˆ1  i  A. Sample Rate
i  (6) Sample rate shows the value of radio signal that has
 aˆ  j 
N
j 1 t been taken in one second during the voice recording.
Resampling is a process to normalized sample rate in
 aˆ  i  a b O  ˆ  j 
T 1
t 1 t 1 voice signal. The higher sample rate value, the better
a  t 1
(7)
t ij j
audio quality it has. Resampling is completed to make all
  aˆ  i  a b  O  ˆ  j 
ij T 1 T
t 1 t 1
t 1 j 1 t ij j
signals to be in the same sample rate, so the next process
 aˆ  i  ˆ  i   O ,V  will not be influenced by the modification of sample rate
T 1

b k   t 1
(8) in every signal.
t t t k

 aˆ  i  ˆ  i 
j T 1
t 1 t t

2045 Adv. Sci. Lett. Vol. 22, No. 8, 2016 1936-6612/2016/22/2043/005


doi: 10.1166/asl.2016.7769
Adv. Sci. Lett. 22, 2043–2047, 2016 RESEARCH ARTICLE

the accuracy of system, because this codebook will be


used in the process of vector quantization in deciding the
string of HMM observation.
C. Number of Model
In HMM, every word will be modeled as a model of
HMM. The more words put in train, the more model that
will be shaped. The influence of a number of model to
system accuracy can be seen in figure 5.

Figure 3.The influence of sample rate to accuracy


Based on the graphic in figure 3, it can be concluded
that the best accuracy accomplished by system is when it
uses 16000Hz sample rate. It is because the signal for
8000Hz sample rate cannot be able to save all of the voice
characteristics that are needed, so the characteristics that
are extracted does not fit the needed characteristics. But
when sample rateis changed into 44100kHz (higher), the
system does not give any better result. It shows that a
higher sample rate is not the best way to improve the Figure 5.The influence of number of model to
system's accuracy, it is because the higher sample rate it Accuracy
has, the bigger the bit representation, and it will make the Based on graphic in figure 5, it can be seen that the
system works harder to model a pronunciation. For that more model in train, so the accuracy tends to be smaller.
reason, the system tends to produce a worse accuracy, and It is maybe because there are models that are alike in
longer time to compute. Besides, a research [7] stated that cepstral, so the system missed in doing the classification.
the best sample rate of human’s voice recognition is The solution to make HMM achieve a good accuracy is
8kHz-20kHz. So, 16000Hz Sample rate is the best value, by adding audio file in every model during training. The
in this case. more audio file in every model, the more observation
B. Codebook and State string during the training. As the consequences, the
Codebook is a representation of feature vectorof all process of HMM parameter re-estimation will be better
voice signals that have passed the clustering process. The and it will also result the optimum model. With an
following figure is a result of accuracy to a number of optimum model, system accuracy will be better.
state and the size of codebook. D. MFCC vs LPC
LPC is another feature extraction method and it
could be used for recognition system also. Comparison
between MFCC and LPC in recognize same dataset with
same system can be seen in figure 6.

Figure 4.The influence of state and codebook to


accuracy
Based on graphic in figure 4, it can be seen that the
highest accuracy (67.75%) can be achieved in 64
codebookand the number of state is 5, while the lowest
accuracy (44.62%) can be achieved in 16 codebooks and
the number of state is 5. It can also be analyzed when the Figure 6. The accuracy comparison between
size of codebook increases, the system accuracy tends to MFCC and LPC
improve. However, it is also important to remember that Based on graphic in figure 6, it can be concluded
if the size of codebook is too big, it will ruin the that MFCC is better feature extraction method in this case.
codebook, which means that the dots that should be in Whatever codebook size used, MFCC has always better
one cluster will be placed in more than one cluster and accuracy.
separated from one to another. It will, surely, influence
2046
RESEARCH ARTICLE Adv. Sci. Lett. 22, 2043-2047, 2016

4 CONCLUSION and C++ Implementation, Fondazione Ugo Bordoni: Wiley.


[3] Do MN. (1994). DSP Mini-Project: An Automatic Speaker
Recognition Sistem.
Based on the research, it can be concluded that the [4] Gales, Mark., Young, Steve. 2008. The Application of Hidden
recognition of Hijaiyyah Letter pronunciation using Mel Markov Models in Speech Recognition. Vol. 1, No. 3 (2007)
Frequency Cepstral Coefficients (MFCC) and Hidden 195–304.
Markov Model that has been developed to recognize [5] Kinnunen, Tomi., 2003, Spectral Fefatures for Authomatic Text-
Hijaiyyah letter voice/ pronunciation has 67.75% Independent Speaker Recognition
accuracy. This accuracy is taken from the test result with [6] Muda, L., Mumtaj, B., & Elamvazuthi., I., 2010, Voice
Recognition Algorithm using Mel Frequency Cepstral
codebook size of 64 and 5 state HMM parameter. The size Coefficient (MFCC) and Dynamic Time Warping (DTW)
of codebook in the process of clustering highly influences Techniques, Journal of Computing, Volume 2, Issue 3, March
the system accuracy. While the best number of state can 2010.
be understood from the number of observation done. A [7] Rabiner, L., & Biing-Hwang, J., 1993, Fundamentals of Speech
great number of state does not always increase system Recognition, Englewood Cliff: Prentice-Hall International, Inc.
accuracy, and vice versa. For further system development, [8] Rabiner, L.R., & B-H, Juang., 2006, Speech Recognition:
Statistical Methods, Elsevier Ltd.
it is considered important to develop the recognition of
[9] Rabiner, Lawrence, B. H. Juang. 1991. Hidden Markov Model
Hijaiyyah letter and its makhraj and the process of for Speech Recognition. Technometrics, VOL. 33, NO. 3, August
recognition can be running in real time system. 1991
[10] Rabiner, Lawrence. 1989. A Tutorial on Hidden Markov Models
ACKNOWLEDGMENTS and Selected Applications in Speech Recognition. IEEE, Vol. 77,
No. 2, pp. 257-285, Februari 1989.
[11] Tychtl, Zbyni k dan Psutka, Josef. Speech Production Based on
This work was supported in part by Telkom the Mel-Frequency Cepstral Coefficients.
University Research Grant. [12] Wildermoth, B.R., 2001. Text-Independent speaker recognition
using source based features.
REFERENCES [13] Yulita, I.N., Liong T.H., Adiwijaya, 2012, Fuzzy hidden markov
models for indonesian speech classification, Journal of Advanced
Computational Intelligence and Intelligent Informatics 16:3 pp.
[1] Ardisasmita, M.S., 2003, Sistem Kendali Peralatan Dengan
Perintah Suara Menggunakan Model Hidden Markov dan 381-387
Jaringan Syaraf Tiruan, Risalah Lokakarya Komputasi dalam
Sains da Tekologi Nuklir XIV, Juli 2003. Received: 22 September 2010. Accepted: 18 October 2010
[2] Becchetti, C., & Lucio, P. R., 1999, Speech Recognition Theory

2047 Adv. Sci. Lett. Vol. 22, No. 8, 2016 1936-6612/2016/22/2043/005


doi: 10.1166/asl.2016.7769

View publication stats

You might also like