Professional Documents
Culture Documents
Structure Evolution of Hidden Markov Models For Audiovisual Arabic Speech Recognition
Structure Evolution of Hidden Markov Models For Audiovisual Arabic Speech Recognition
x, 200x 1
Structure Evolution of Hidden hopefully obtain the global optimal solution (Man et al.,
1996). In addition, GA has the potential to generate the
Markov Models for Audiovisual Arabic optimization of both feature subsets and HMM parameters
Speech Recognition at the same time (Xueying et al., 2007; Pérez et al., 2007;
Goh et al., 2010).
Therefore, in this work we propose an alternative
solution to find the optimal structure and the candidate
Abstract. In this paper, we present an Audio- structure of HMM using GA. It is based on evolution of a
Visual Automatic Speech Recognition system
population of individuals, which encode potential solutions
(AVASR) combining the acoustic and the visual
data. The algorithm proposed here, for modeling to the problem and transverse the fitness landscape by
the multimodal data, is a Hidden Markov Model means of genetic operators that are supposed to bias their
(HMM) hybridized with the Genetic Algorithm evolution towards better solutions. Experimental results
(GA) to determine its optimal structure. This showed that our GA for HMM training can obtain more
algorithm is combined with the Baum-Welch optimized HMM than the Baum-Welch algorithm (which is
algorithm which allows an effective re-estimation an Expectation Maximization (EM) algorithm).
of the probabilities of the HMM. Our experiments This paper is organized as follows: Section 2, we briefly
have shown the improvement in the performance present a review of some research literature related to our
of the most promising audiovisual system, based
research interest. In section 3, we will deal with the
on a merger the combination of GA/HMM model.
background knowledge to understand our proposed
Keywords: automatic speech recognition; HMM; approach, and we will explain all the method used in this
hidden Markov model; GA; genetic algorithm; work. The performance of all the system is evaluated and
hybridization; audio-visual fusion; computer discussed in section 4, and the conclusions drawn in the
vision.
final section.
(b)
Once the ROI is isolated, it is recommended to extract
useful information using a minimum number of attributes to
avoid statistical modeling difficulties due to high dimension
of attribute space.
To characterize video signals, in this work, we use the
coefficients of the upper-left corner of the DCT of the
resulting image of the ROI with respect to the following
relation (Gupta and Garg, 2012):
[ ] [ ]
M N
1 ( 2 x +1 ) uπ ( 2 y +1 ) vπ
F (u , v )= α ( u ) α (v ) ∑ ∑ f ( x , y ) cos cos
√ MN x=1 y=1 2M 2N
[ ] [ ]
M N
1 ( 2 x +1 ) uπ ( 2 y +1 ) vπ
F (u , v )= α ( u ) α (v ) ∑ ∑ f ( x , y ) cos cos
√ MN x=1 y=1 2M 2N
(b)
[ ] [ ]
M N
1 ( 2 x +1 ) uπ ( 2 y +1 ) vπ
F (u , v )= α ( u ) α ( v ) ∑ ∑ f ( x , y ) cos cos The indices of the most energetic coefficients are
√ MN x=1 y=1 2M obtained 2onN a training set in a way that the visual DCT
[ ] [ ]
M N
1 ( 2 x +1 ) uπ ( 2 y +1are
coefficients ) vπsubsequently extracted by fitting the DCT to
F (u , v )= α ( u ) α ( v ) ∑ ∑ f ( x , y ) cos cos
each frame in the video as it shown in Figure 7. Only the
√ MN x=1 y=1 2M 2N
corresponding coefficients are retained (in our case, we kept
(1) the 100 most energetic coefficients).
Where M and N are the dimensions of the image, u is the
horizontal spatial frequency, v is the vertical spatial 3.4 The hybrid GA/HMM modeling
frequency, f(x, y) is the pixel value at coordinates (x, y),
F(u,v) is the DCT coefficient of size M×N at coordinates (u, The HMM is an ubiquitous tool, for representing probability
v) and α() is defined as follows: distributions over sequences of observations, which has
emerged as the predominant technology in speech
{
1 recognition in recent years. Indeed, it has been proved that
, w=1
α ( w )= √2 (2) HMM are more adapted to the speech recognition domain
and images modeling than the traditional HMM (Fujiwara et
1 , otherwise ;
al., 2008).
By definition, HMM is a stochastic process determined
by two interrelated mechanisms: Markov chain which
Figure 7 Selection process of the DCT coefficients with a determines the state at time t, St=st, and a state-dependent
sample from: (a) CUAVE database (b) AVARB process which generates the observation Ot = ot depending
database
on the current state st.
HMM is usually represented by three sets of probability
distributions and their definitions are given as follows:
- Π: initial state probabilities,
π i=P ( q 1=i ) ,1 ≤i ≤ N (3)
Where N is the number of states in the model.
- A: the transitional probability matrix, A={ai,j},
a i , j =P ( qt +1= j/ qt =i ) ,1 ≤ i , j ≤ N (4)
with qt the current state and qt+1 the next state.
- B: the emission probabilities, B ={ bi(ot) }, where bi(ot) is
the probability of observation ot generated from state i
(a)
(
b i=P o t=
j
qt )
=i , 1 ≤ j≤ M , 1 ≤i ≤ N (5)
cannot satisfy the conditions of the HMM parameters. For ∑ log ( P ( oi∨λ i ) ) (7)
this reason, not included in the offspring and they are P n=
i=1
replaced by new chromosomes. M
The principle of the proposed GA/HMM algorithm is as
follows: a GA manipulates a population of HMM consists It is proven that Baum-Welch algorithm leads to a local
of individuals whose architecture is not unique. The maximum of function f(λi). However, it is possible that
algorithm will be looking to obtain optimal HMMs, i.e. the other better maxima of f(λi)exist for given training set. In
HMM with the highest probability of generating a given this paper we tried to overcome this problem by using
observation O. This probability can be rapidly calculated by genetic algorithms for maximization of f(λi) (Goh et al.,
the Baum-Welch algorithm. Local optimization is 2010).
performed by the Baum-Welch algorithm in conjunction
4- Selection: Among all the individuals of the population,
with genetic operators specialized for HMMs (Oudelha and
select a number S'<S, which will be used as parents to
Ainon, 2010; Xueying et al., 2007; Goh et al., 2010).
regenerate the S-S’ other individuals not selected. The
Among these operators, for example, we find the operator of
selection is done according to the best calculated scores
crossover which integrates the stochasticity constraints that
in step 3. Each selected individual is marked "parent."
apply on the HMM. There is also a normalization operator
in order to restore the same constraint after the mutation 5- Crossover/recombination: For each unmarked
phase. We give below a frame of GA/HMM algorithm. individual "parent" randomly select two individuals
Using a marker named "parent" simply to treat only from the population of those marked "parent" and the
individuals required during optimization and evaluation. cross. The crossover is in a crossover point, and is
realized between two rows of the matrices of the
1- Initialization: Create a population of size S, randomly,
HMM, which allows to obtain in return two children. It
the most natural encoding is to fabricate the
retains only one of both children, at random.
chromosome by reorganizing all the coefficients of the
HMM. The simplest way is to juxtapose all rows of all 6- Mutation/ normalization: On each unmarked individual
matrices. We so obtain a coding in real numbers while "parent" we apply the mutation operator. This one
respecting the constraints related to the HMM. The consists in modifying a small random quantity each
representation of a population is given in Figure 8. coefficient of the matrices of the HMM. Each
coefficient is modified according to the value of the
Figure 8 Chromosome representation method in the GA/HMM mutation probability. After treating an individual, we
training apply on him an operator of normalization, to ensure
that this individual still answers the constraints of the
π1 … πN a1,1 … a1,N a2,1 … aN,N HMM. We should verify that the matrices of the HMM
are stochastic. This operator is applied after the 1
C
operation of mutation because subsequent operations R s= ∑
C−1 i=1
( max log P(o∨λ j )+log P(o∨λ i))
imperatively work on HMM.
7- Evaluation of the stop condition: If the maximum (9)
number of iterations is not reached, then return to
step2, otherwise go to step 8. Where C is the number of classes being considered to
8- Finally return the best HMM among the current measure the reliability of each modality, and sϵ { A ,V }.
population. After that, we can calculate the integration weight of
audio reliability measure A by:
Note that this algorithm has been adapted to
optimization of vectors of observations. The re-estimation RA
made so that the HMM has a maximal probability to
w A= (10)
RA+ RV
generate the set of vectors.
Where R A and RV are the reliability measure of the
3.5 The audio-visual fusion outputs ofthe acoustic and visual GA/HMM respectively,
and the weighting factor of visual modality can be found by
In Decision fusion (or late integration), the visual and
relation:
auditory information are processed separately and are each
wA + wV = 1 for 0 < wA, wV <1 (11)
transmitted to a classifier as in (Rogozan et al., 1997).The
results at the output of each of the two recognition processes
are fused in an integration module which provides the final
result. 4 Results and discussion
In this work, match score level is used to combine the
audio and visual identification outputs. Several decision In this paper we performed the proposed AVASR system on
fusion strategies have been tested (product, sum, minimum, the CUAVE database and our Arabic audiovisual database.
maximum, vote ...) and all show a significant improvement In order to achieve this recognition system we have used⅔
of the data for the learning stage and the remaining ⅓ to test
in results compared to the consideration of a single
the effectiveness of our system.
modality, which conduct us to focus in this work on the use The proposed AVASR system using RASTA-PLP as
of the model of separate fusion, i.e. the fusion scores from audio features extraction, DCT coefficients as visual
each recognizer GA/HMM. Their sets of log-likelihood can features extraction, and a GA/HMM for audio-visual speech
modeling, was implemented as described in the previous
be combined using weights that reflect the reliability of each
particular flow, the combined scores then take the following sections on Matlab simulation software. As a result, and by
form (Islam and Rahman, 2010): integrating the first and second derivative of the parameters,
we obtain matrix of 27 parameters for the audio stream after
the segmentation of the sampled speech data into 0.025
AV A V seconds frames with an overlap of 0.010 seconds. The
log P(o ∨λ)=w A log P( o ∨λ A )+w V log P(o ∨λ V ) GA/HMM recognizers were built using the Hidden Markov
(8) Model Toolkit (HTK) as in (Young et al. 2005).
In Table 1 and Table 2, we have presented various kinds
Where λ Aand λ V are the acoustic and the visual of instance with different GA control parameters that have
GA/HMMs respectively and log P(o ∨λ A ) and
A been solved with our algorithm to evaluate performance of
the proposed system. We ran each instance 15 times with a
log P(o V ∨ λV )are there log-likelihood. The reliability of different number of clusters, crossover probability values
each modality can be calculated by the most appropriate and between 0.5-0.9, and mutation probability with the value
best in performance (Islam and Rahman, 2010), the average 0.01, and obtained the maximum P(o|λ) values after 50
difference between the maximum log-likelihood and the generation.
other ones, can be found as,
Table 1 GA parameters training HMM for audio-only: (a) AVARB database (b) CUAVE database
Number Pc Pm Average Number Pc Pm Average
of P(o|λ) of P(o|λ)
clusters clusters
3 0.5 0.01 -2.3630 3 0.5 0.01 -3.7416
5 0.6 0.01 -1.5838 5 0.6 0.01 -3.2604
7 0.7 0.01 -1.1396 7 0.7 0.01 -3.4235
9 0.8 0.01 -3.3185 9 0.8 0.01 -3.9134
12 0.9 0.01 -4.0122 12 0.9 0.01 -4.3637
(a) (b)
Table 2 GA parameters training HMM for video-only: (a) AVARB database (b) CUAVE database
Number Pc Pm Average Number Pc Pm Average
of P(o|λ) of P(o|λ)
clusters clusters
3 0.5 0.01 -7.7629 3 0.5 0.01 -5.1860
4 0.5 0.01 -7.0046 5 0.6 0.01 -5.2987
7 0.8 0.01 -7.1555 7 0.7 0.01 -5.4743
9 0.8 0.01 -7.6595 9 0.8 0.01 -5.8747
12 0.9 0.01 -7.8234 12 0.9 0.01 -6.0890
(a) (b)
We observe that the results are varied according to the GA/HMM. The same thing for the CUAVE audio
parameters training of the GA, also to the number of database with 4 clusters, Pc=0.6 and Pm= 0.01, and for the
clusters obtained by the vector quantization phase, e.g. visual CUAVE database the best performance is obtained
with a 7 clusters, Pc=0.7 and Pm= 0.01, for the Arabic with 3 clusters, Pc=0.5 and Pm= 0.01.
audio database and 5 clusters, Pc=0.6 and Pm= 0.01 for the Figures 9 and 10 give the rate of recognition with
visual Arabic database are superior to all the other respect to the number of clusters used in the experiment.
approaches in all cases. Therefore, we use theme in our
Figure 9 Comparison of recognition rates of audio-only, video-only and audio-visual ASR for the CUAVE database by using: (a) an
HMM-based and (b) GA/HMM
(a) (b)
Based on Figure 9, we can see that the recognition rates For the CUAVE database the results show that the
obtained with our GA/HMM are better in most cases average rate of recognition achieved a best rate with
compared to those obtained with the standard HMM. The 86.8% using standard HMM recognizer with 5 clusters for
above figures also indicate that the AVASR system the clustering phase, and 98.1% using GA/HMM
outperform significantly overall by achieving the highest recognizer with 3 clusters.
recognition rates.
Figure 10 Comparison of recognition rates of audio-only, video-only and audio-visual ASR for AVARB database by using: (a)an HMM-
based and (b)GA/HMM
(a) (b)
In Figure 10, we have noted almost the same previous Potamianos, G., Neti, C., Gravier, G., Garg, A. and Senior A.W.
observations with our AVARB database, i.e. we found the (2003) ‘Recent advances in the automatic recognition of
best average rate of recognition equal to 93.7% and 97.6% audiovisual speech’, Proceedings of the IEEE, Vol. 91, No.
9, pp.1306–1326.
by the use of standard HMM and hybrid GA/HMM
Man, K.F., Tang, K.S. and Kwong, S. (1996) ‘Genetic
recognizers respectively, and with 7 clusters for both. Algorithms Concepts and Applications’, IEEE Transactions
More generally we found a percentage increase varying on Industrial Electronics, Vol. 43, No. 5, pp.519–532.
from almost 5% to 28% of our test results, but this rise in McGurk, H. and MacDonald, J. (1976) ‘Hearing lips and seeing
the recognition rates given is not fixed, along with the voices’, Nature 264 (5588), pp.746–748.
increase in the size of the population. It could be that Petajan, E.D. (1984) ‘Automatic lipreading to enhance speech
given rates worse or the same standard HMM system recognition’,Proceedings of the IEEE Communication
before optimizations. This is due to the characteristic of Society Global Telecommunications Conference, Atlanta,
the GA method that is random and this system using the Georgia.
Elmahdy, M. and Gruhn, R. (2009) ‘Modern standard Arabic
standard general replacement process.
based multilingual approach for dialectal Arabic speech
recognition’, Eighth international symposium on natural
language, pp.169–174.
5 Conclusions Iwano, K., Yoshinaga, T., Tamura, S. and Furui, S. (2007)
‘Audio-Visual Speech Recognition Using Lip Information
Extracted from Side-Face Images’, EURASIP Journal on
In this primary work, we presented a new AVASR system, Audio, Speech, and Music Processing, Vol. 2007, No. 1,
which uses DCT coefficients as visual features extraction, pp.4–4.
RASTA-PLP as acoustic features extraction. To overcome Zhi-yi, Q. Yu, L. Li-hong, Z. and Ming-xin, S. (2006) ‘Hybrid
the drawbacks of the traditional training of HMM using SVM/HMM architectures for speech recognition’,
Baum-Welch algorithm which are mainly the converge to Proceedings of the First International Conference on
a local optimum which is the closer to the starting values Innovative Computing, Information and Control , Vol. 2,
pp.100–104.
of the optimization procedure and the estimation of the
Zhao, G., Pietikäinen, M. and Hadid, A. (2007) ‘Local
recognizer parameters that requires a more careful Spatiotemporal Descriptors for Visual Recognition of
examination, we have chosen to use an hybrid GA/HMM Spoken Phrases’, Proceedings of the international workshop
algorithm for modeling the audio-visual speech. on Human-centered multimedia, pp.57–66.
Two databases were used to experiment our AVASR Pao, T.L., Liao, W.Y., Wu, T.N. and Lin, C.Y.
system: The CUAVE database and our AVARB database. (2009)‘Automatic Visual Feature Extraction for Mandarin
From the several test results, we conclude that the Audio-Visual Speech Recognition’, Proceedings of the
system modeled by HMM and trained by our GA/HMM IEEE International Conference on Systems, Man and
training have higher rates of recognition than the HMM Cybernetics, San Antonio, TX, USA, pp.2936–2940.
Nock, H.J., Iyengar, G. and Neti, C. (2003) ‘Speaker localisation
trained by the Baum-Welch algorithm. using audio-visual synchrony: An empirical study’, In
For future work, we are planning to cover more issues Proceedings of International Conference on Image and video
about improving the performance of the proposed system. retrieval (CIVR), Vol. 2728, pp.488–499.
Finally, we also intend to test our system with other Patterson, E.K., Gurbuz, S., Tufekci, Z. and Gowdy, J.N. (2002)
alternatives methods recognition and compare them with ‘Moving-talker speaker-independent feature study and
our proposed GA/HMM algorithm. baseline results using the CUAVE multimodal speech
corpus’, EURASIP Journal on Applied Signal Processing,
Vol. 11, pp.1189–1201.
References Reikeras, H., Herbst, B.M., du Preez, J. and Engelbrecht, H.
(2010) ‘Audio-Visual Automatic Speech Recognition using
Dynamic Bayesian Networks’, Proceedings of the Twenty- International Symposium in Information Technology (ITSim),
First Annual Symposium of the Pattern Recognition Vol. 2, pp.542–545.
Association of South Africa, pp.233–238. Kanungo, T., Netanyahu, N.S. and Wu, A.Y. (2002) ‘An
Galatas, G., Potamianos, G. and Makedon, F. (2012) ‘Audio- Efficient k-Means Clustering Algorithm: Analysis and
visual speech recognition using depth information from the Implementation’, IEEE Transactions on Pattern Analysis
Kinect in noisy video conditions’, Proceedings of the 5th and Machine Intelligence, Vol. 24, No. 7, pp.881–892.
International Conference on PErvasive Technologies Related Rogozan,A., Deléglise, P. and Alissali, M. (1997) ‘Adaptive
to Assistive Environments (PETRA’12), Crete, Greece, determination of audio and visual weights for automatic
Article No. 2. speech recognition’, ESCA Workshop on Audio-Visual
Minotto, V.P., Lopes, C.B.O., Scharcanski, J., Jung, C.R. and Speech Processing (AVSP'97), pp.61–64.
Lee, B. (2013) ‘Audiovisual Voice Activity Detection Based Islam, R. and Rahman, F. (2010) ‘Likelihood Ratio Based Score
on Microphone Arrays and Color Information’, IEEE Fusion for Audio-Visual Speaker Identification in
Journal of Selected Topics in Signal Processing, Vol. 7, No. Challenging Environment’, International Journal of
1, pp.147–156.
Computer Applications, Vol. 6, No. 7, pp. 6–11.
Rabiner, L. and Juang, B. (1993) ‘Fundamentals of Speech
Recognition’, Prentice-Hall, Englewood Cliffs, NJ. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu,
Xue-ying, Z., Yiping, W. and Zhefeng, Z. (2007) ‘A Hybrid X.A., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev,
Speech Recognition Training Method for HMM Based on V. and Woodland P. (2005) ‘the HTK Book’ (for HTK
Genetic Algorithm and Baum Welch Algorithm’, IEEE 2nd Version 3.4).
International conference on Innovative Computing,
Information and Control (ICICIC’07), pp.572.
Pérez, Ó, Piccardi, M. and García, J. (2007) ‘Comparison
between genetic algorithms and the Baum-Welch algorithm
in learning HMMs for human activity classification’,
Proceeding ofEvoWorkshops’7, pp.399–406.
Goh, J., Tang, L. and Al turk, L. (2010) ‘Evolving the Structure
of Hidden Markov Models for Micro aneurysms Detection’,
UK Workshop on Computational Intelligence (UKCI), pp.1–
6.
Viola, P.A. and Jones, M.J, (2001) ‘Rapid Object Detection
using a Boosted Cascade of Simple Features’, Proceedings of
the 2001 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’01),Vol. 1,pp.511–
518.
Papageorgiou, C.P., Oren, M. and Poggio, T. (1998) ‘A General
Framework for Object Detection’, Sixth International
Conference on Computer Vision (ICCV'98), pp.555–562.
Khandait, S.P., Khandait, P.D. and Thool, Dr.R.C. (2009) ‘An
Efficient Approach to Facial Feature Detection for
Expression Recognition’, International Journal of Recent
Trends in Engineering, Vol. 2, No. 1, pp.179–182.
Potamianos, G., Neti, C., Luttin, J. and Matthews I. (2004)
‘Audio-visual automatic speech recognition: an overview, in
issues in audio-visual speech processing’,Issues in Visual
and Audio-Visual Speech Processing,G. Bailly, E. Vatikiotis-
Bateson, and P. Perrier, eds., MIT Press,Chapter 10.
Harris, F.J. (1978) ‘On the use of windows for harmonic analysis
with the discrete Fourier transform’. Proceedings of the
IEEE, Vol. 66, pp.51–83.
Hermansky, H., Morgan, N., Bayya, A. and Kohn P. (1992)
‘RASTA-PLP Speech Analysis’, IEEE International
conference on Acoustics, speech and signal processing, Vol.
1, pp.121–124.
Gupta, M. and Garg, Dr.A.K. (2012) ‘Analysis of image
compression algorithm Using DCT’, International Journal of
Engineering Research and Applications (IJERA), Vol. 2, No.
1, pp.515–521.
Fujiwara, Y., Sakurai, Y. and Yamamuro, M. (2008) ‘SPIRAL:
efficient and exact model identification for hidden Markov
models’, In Proceeding of the 14th ACM SIGKDD
international conference on Knowledge discovery and data
mining KDD, pp.247–255.
Sun, K. and Yu, J. (2007) ‘Video Affective Content Recognition
Based on Genetic Algorithm Combined HMM’, Proceedings
of the 6th international conference on Entertainment
Computing, pp. 249 – 254.
Oudelha, M. and Ainon, R.N. (2010) ‘HMM parameters
estimation using hybrid Baum-Welch genetic algorithm’,