Structure Evolution of Hidden Markov Models For Audiovisual Arabic Speech Recognition

Int. J., Vol. x, No.
x, 200x 1
Structure Evolution of Hidden hopefully obtain the global optimal solution (Man et al.,
1996). In addition, GA has the potential to generate the
Markov Models for Audiovisual Arabic optimization of both feature subsets and HMM parameters
Speech Recognition at the same time (Xueying et al., 2007; Pérez et al., 2007;
Goh et al., 2010).
Therefore, in this work we propose an alternative
solution to find the optimal structure and the candidate
Abstract. In this paper, we present an Audio- structure of HMM using GA. It is based on evolution of a
Visual Automatic Speech Recognition system
population of individuals, which encode potential solutions
(AVASR) combining the acoustic and the visual
data. The algorithm proposed here, for modeling to the problem and transverse the fitness landscape by
the multimodal data, is a Hidden Markov Model means of genetic operators that are supposed to bias their
(HMM) hybridized with the Genetic Algorithm evolution towards better solutions. Experimental results
(GA) to determine its optimal structure. This showed that our GA for HMM training can obtain more
algorithm is combined with the Baum-Welch optimized HMM than the Baum-Welch algorithm (which is
algorithm which allows an effective re-estimation an Expectation Maximization (EM) algorithm).
of the probabilities of the HMM. Our experiments This paper is organized as follows: Section 2, we briefly
have shown the improvement in the performance present a review of some research literature related to our
of the most promising audiovisual system, based
research interest. In section 3, we will deal with the
on a merger the combination of GA/HMM model.
background knowledge to understand our proposed
Keywords: automatic speech recognition; HMM; approach, and we will explain all the method used in this
hidden Markov model; GA; genetic algorithm; work. The performance of all the system is evaluated and
hybridization; audio-visual fusion; computer discussed in section 4, and the conclusions drawn in the
vision.
final section.
1 Introduction 2 Related works

Speech is one of the most natural way by which people
Audio-visual speech recognition has emerged in recent
communicate. The Automatic Speech Recognition (ASR)
years as an active field, gathering researchers in computer
aims to transform the acoustic signal into a sequence of
vision, signal and speech processing, and pattern
words, which ideally corresponds to a sentence pronounced
recognition (Potamianos et al., 2003). Humans use visual
by a speaker. Recognition systems that use as input only the
information subconsciously in order to understand speech,
acoustic signal have reached their limitations especially in
especially in noisy conditions, but also when the audio is
noisy environmental situations. In this case, the use of
clean. The movement of the speaker’s lips offers hints about
additional information in conjunction with those extracted
the place of articulation, which are automatically integrated
from the acoustic signal is a new methodology that
by the brain. The McGurk effect (McGurk, 1976) proves
improves the performance and robustness of ASR.
this by showing that, when presented with inconsistent
Numerous works on speech perception have shown the
audio and visual stimuli, the brain perceives a different
importance of the visual information like body and facial
sound from the one that was spoken.
expressions, or lips and tongue movements in the
The first foray into visual ASR and AVASR occurred in
recognition process among humans (Potamianos et al.,
1984 with Petajan’s Doctoral Thesis (Petajan, 1984). His
2003; Iwano et al., 2007; Pao et al., 2009). The use of data of
system extracted the height, width, perimeter, and area of a
the lips shape and movement of the speaker seems to be a
speaker’s mouth and used these features as inputs to a
promising issue for speech recognition because the visual
speech recognition system for visual ASR.
signal is both correlated to the acoustic speech signal and
In addition, Arabic speech recognition faces many
contains complementary information to it.
challenges. For example, Arabic has short vowels which are
The most models of visual speech perception focused on
usually ignored in text. Therefore, more confusion will be
sensory interaction of type fusion or integration. At this
added to the ASR decoder. Additionally, Arabic has many
level, the remains posed question is: where and how this
dialects where words are pronounced differently. The work
fusion of acoustic and visual modalities happens with
humans. Classically, we distinguish two types of fusion in in (Elmahdy and Gruhn, 2009) summarized the main
AVASR systems: the fusion of parameters and the fusion of problems in Arabic speech recognition, which include
scores. Arabic phonetics, grapheme-to-phoneme relation, and
In order to improve the performance of ASR systems, morphological complexity.
researchers have explored other paradigms like neural and In contrast, automatic lipreading has been minimally
Bayesian networks, discriminative training techniques, state investigated, with the exception of a few pioneering efforts
duration modeling and the use of support vector machines (Potamianos et al., 2003; Iwano et al., 2007).For example
with HMM (Zhi-yi et al., 2006). too, the work in Zhao et al. (2007) proposed local
GA is a stochastic search method that can perform spatiotemporal descriptors to represent and recognize
global search within the defined searching space and spoken isolated phrases based solely on visual input. An
automatic visual feature extraction approach is discussed by
Copyright © 200x Inderscience Enterprises Ltd.

(Tsang-Long Pao et al., 2009) to extract the visual features 3 Proposed system overview
of the lips that can be used in the AVASR system. These
features are important to the recognition system, especially Human speech perception is bimodal in nature: Humans
in noisy condition. They present recognition performance combine audio and visual information in deciding what has
using various visual features to explore their impact on the been spoken, especially in noisy environment. The visual
recognition accuracy. modality benefit to speech intelligibility in noise has been
Nock et al. (2003) tested in three different approaches: quantified as far back as in (Potamianos et al., 2003; Iwano
the mutual information, assuming either joint Gaussian et al., 2007).
distributions or discrete distributions, and a specific measure AVASR introduces new and challenging tasks compared
adequate to the speaker case, which was based on the to traditional, audio-only ASR (Potamianos et al., 2003). It
modeling of audio-visual features using HMM. This work consists of three modules, which are: the acoustic
represents the first exhaustive evaluation of the performance recognition module, the visual recognition module and the
of several audio and video representations and fusion fusion module.
methods in this domain. The acoustic recognition module uses a stochastic
The representations that they tested were the Discrete approach based on HMM that are optimized by a GA. The
Cosine Transform (DCT), the pixel’s intensity and pixel’s generic process of this module is based on three stages
intensity changes for the video signal and the Mel- which are the parametrization of the acoustic signal, in our
Frequency Cepstral Coefficients (MFCC) for the audio case using the RelAtive SpecTral Analysis-Perceptual
signal. The analyzed sequences belong to the CUAVE Linear Predictive method (RASTA-PLP), the learning of
audio-visual database in (Patterson et al., 2002) and consist models based on genetic research of optimal parameters of
on two persons speaking in turns. The authors concluded the best model among a heterogeneous population of HMM
that the pixel’s intensity changes together with the Gaussian (containing different architectures) and an optimization of
MI were the more suitable combination in the speaker the model by a gradient algorithm (Baum-Welch) and
localization task. Reikeras et al. (2010) proposed a basic decoding by the Viterbi algorithm. The visual recognition
AVASR system that uses MFCCs as acoustic features, module uses the same stochastic approach; it differs only by
Active Appearance Model (AAM) parameters as visual the parameterization phase which it based on the DCT as
features, and Dynamic Bayesian Networks (DBNs) as shown in Figure 1.
probabilistic models of audio-visual speech.
Galatas et al. (2012) have built on their recent work, an
incorporated facial depth data of a speaker captured by the Figure 1 Architecture of the proposed AVSR system
Microsoft Kinect device, as a third data stream in an audio-
visual automatic speech recognizer. The obtained results in
this work have shown the improvement of system
performance due to the depth modality, as well as
considerable accuracy increase, when using both the visual
and depth modalities over audio only speech recognition.
Lip motion and audio analysis provide a large amount of
information that can be integrated to produce more robust
audiovisual Voice Activity Detection (VAD) schemes, as
discussed in the recent work of (Minotto et al., 2013).
HMM is considered as a basic component in speech
recognition systems (Rabiner and Juang, 1993). The
estimation of model parameters affects the performance of
the recognition process. It is performed so that the
recognition error is minimized.
HMM parameters are determined during iterative process
called "training process". One of the conventional methods This section briefly introduces the main components of
that are applied in setting HMM model parameters values is our proposed system, as well as the methods intended for
Baum-Welch algorithm. One drawback of this method is their implementation.
that it converges generally to a local optimum. Several
stochastic and no stochastic versions of this algorithm were 3.1 The used audiovisual databases
proposed, but none of them have resulted in satisfactory
version, which allows the exploration of large research In this paper two databases are used to test our proposed
space. GA computes simultaneously several solutions Audio-Visual Automatic Speech Recognition System:
instead of a single one, exactly and in the same way as in
the EM algorithm. At each step of the GA, combining 3.1.1 CUAVE database
individuals of current population, applying genetic
operators, selection, crossover and mutation creates a new In order to test our system we used the Clemson University
population. The proposed models as well as the different Audio-Visual Experiments (CUAVE) database (Patterson et
parts of the system are described in the following al. 2002). It has been created by the Digital Speech and
paragraphs. Audio Processing Group at Clemson University. They
distribute it on one DVD royalty-free for research purposes. In this first work, we have also used our own database of
It is an audio-visual database which contains over 7,000 Audio-Visual ARaBic speech (AVARB) database. This
utterances of both connected and isolated digits recorded by multi-speaker database was recorded in a real environment
36 individual speakers (17 female and 19 male) and 20 pairs (classroom very noisy). It contains pronunciations of Arabic
of speakers. It is aimed at testing multi-speaker solutions. In words isolated, taken at a sampling frequency of 16 KHz.
addition to that, it includes both still and moving speakers in Video data were captured with a 690 ×340 pixel
order to be robust to speaker movement. resolution at 30 frames/s frame rate with variations of pose
The database contains around 3 hours of speech (profile view and front view) and saved in AVI format. The
recorded through a Mini DV camera. Video was then database is collected from 18 speakers (2 female and 16
compressed into MPEG-2 files (stereo audio at a 44 kHz male), these speakers are from different regional dialects,
sampling rate, 16-bit). It also includes audio files checked and each speaker pronounces each word 9 times (Arabic
for synchronization (mono rate of 16 kHz, 16-bit) and numerals from zero to nine) with different modes of
annotation files. The figure 2 shows some snapshot from the pronunciation (normal, slow and fast).The distance between
CUAVE database. the camera and the speaker as well the luminance are
adjustable to add more diversification in the audio and the
visual streams during the learning, the average distance is to
Figure 2 Snapshot from CUAVE database
16.5 cm. In our basic corpus which contains only isolated
words, the size of each record is 2 seconds which is enough
time to utter a word slowly in Arabic. Figure 3 shows
examples of frames from some individuals in our AVARB
data corpus.
Figure 3 Some examples of frames of our AVARB database
Various works have used this database in order to

evaluate their proposed systems, including, but not limited
to, in Patterson et al. (2002) the authors seek to introduce a
challenging audio-visual database (CUAVE database) that
is flexible and fairly comprehensive. Also, an image
processing based contour technique, an image-transform
method, and a deformable template scheme are used in this
comparison to obtain visual features. This work also
presents methods and results in an attempt to make these 3.2 The acoustic feature extraction
techniques more robust to speaker movement, too an initial
baseline speaker-independent results are included using all Recognizing the speech requires to extract a set of
speakers. acoustic parameters from the recorded signal. This
The work in (Reikeras et al., 2010), propose a basic parameterization must be performed on successive frames
AVASR system that uses DBNs for modeling the audio- of speech of short duration (typically 10 to 30 ms) over
visual speech and the CUAVE database in order to increase which the signal can generally be considered as quasi-
performance, in particular in noisy acoustic environment. stationary. To improve the analysis and to limit the edge
Therefore, we have chosen it to test and to compare the effects the frames are weighted by a temporal window and
performance of our system with this works in order to the extremities flattened to minimize discontinuities in the
stimulate research and fuel advances in speech processing. signal that could introduce artifacts in the spectra (Harris,
1978). Finally, successive time windows overlap in part.
3.1.2 Audiovisual Arabic database In this work, the feature extraction step is accomplished
using the RASTA-PLP technique which is an improvement
of the traditional Perceptual Linear Prediction (PLP) method
(Hermansky et al., 1991). It consists in a special filtering of
the different frequency channels of a PLP analyzer. The
previous filtering is done to make speech analysis less
sensitive to the slowly changing or steady-state factors in
speech. For the RASTA-PLP features, an additional filtering
is applied after decomposition of the spectrum into critical
bands. This RASTA filter removes the low modulation
frequencies which are supposed to stem from channel
effects rather than from speech characteristics. The process
of computing RASTA-PLP coefficients is described by the
Figure 4.
Figure 4 Steps of Rasta-PLP analysis
(a) (b) (c)
A typical human face follows a set of anthropometric

standards which have been utilized to narrow the search of a
particular facial feature to smaller regions of the face
(Khandait et al., 2009). The possibility of the use of distinct
hue of the lips taking into account the reflected light is
implemented. This point is fetched by a defined hue value.
In contrast to other methods (Potamianos et al., 2004), this
method is not light independent, thus intensity and direction
3.3 The visual feature extraction of the light can influence results.
In our work we use the following generic steps for the
It is well known that the zone of the body the most facial feature detection and extraction from the localized
responsible for sense is undoubtedly the face. The mouth face image as shown in Figure 6:
produces speech, eye position provides information on the
object or the observed area, and expression wrinkles are the  For color image, convert it into gray scale image and
mirrors of our emotions. Briefly, the face is the center of the adjust the intensity of both the type of images.
human communications.  Find the gradient of the region of interest (ROI) in
In speech recognition process, the focuses will be on the detected image using Sobel/Prewitt edge detection
lips because they constitute a large part of our research for operator and then take lower part of face and project it
AVASR. As in the face detection step, Viola and Jones vertically to get mouth localization.
(2001) face detector algorithm is used. We have chosen this
 Draw rectangular box on each of the detected feature
method based on Haar-like features, because it joins the
components.
performance and the simplicity. Indeed, it guarantees an
acceptable detection rate with a relatively low error rate.
The face detector is then applied over the detected skin
Figure 6 Examples of detected mouth region from: (a) CUAVE
regions to detect the target face, and a rectangular region is database (b) AVARB database
constructed with the same width as the detected face with a
height of four times longer than the length of the face of the
target model as shown in Figure 5.
The Viola-Jones face detector uses Haar-like features
(Papageorgiou et al., 1998), which are reminiscent of Haar
basis functions, to train the stage classifier for the cascaded
classifier. The Haar-like features are predefined and
computed directly on the integral image of the gray image. (a)
Figure 5 An example of face detection: (a) original image; (b)

skin detection with noise suppression and (c) face
detection result
(b)
Once the ROI is isolated, it is recommended to extract
useful information using a minimum number of attributes to
avoid statistical modeling difficulties due to high dimension
of attribute space.
To characterize video signals, in this work, we use the
coefficients of the upper-left corner of the DCT of the
resulting image of the ROI with respect to the following
relation (Gupta and Garg, 2012):
[ ] [ ]
M N
1 ( 2 x +1 ) uπ ( 2 y +1 ) vπ
F (u , v )= α ( u ) α (v ) ∑ ∑ f ( x , y ) cos cos
√ MN x=1 y=1 2M 2N
[ ] [ ]
M N
1 ( 2 x +1 ) uπ ( 2 y +1 ) vπ
F (u , v )= α ( u ) α (v ) ∑ ∑ f ( x , y ) cos cos
√ MN x=1 y=1 2M 2N
(b)
[ ] [ ]
M N
1 ( 2 x +1 ) uπ ( 2 y +1 ) vπ
F (u , v )= α ( u ) α ( v ) ∑ ∑ f ( x , y ) cos cos The indices of the most energetic coefficients are
√ MN x=1 y=1 2M obtained 2onN a training set in a way that the visual DCT
[ ] [ ]
M N
1 ( 2 x +1 ) uπ ( 2 y +1are
coefficients ) vπsubsequently extracted by fitting the DCT to
F (u , v )= α ( u ) α ( v ) ∑ ∑ f ( x , y ) cos cos
each frame in the video as it shown in Figure 7. Only the
√ MN x=1 y=1 2M 2N
corresponding coefficients are retained (in our case, we kept
(1) the 100 most energetic coefficients).
Where M and N are the dimensions of the image, u is the
horizontal spatial frequency, v is the vertical spatial 3.4 The hybrid GA/HMM modeling
frequency, f(x, y) is the pixel value at coordinates (x, y),
F(u,v) is the DCT coefficient of size M×N at coordinates (u, The HMM is an ubiquitous tool, for representing probability
v) and α() is defined as follows: distributions over sequences of observations, which has
emerged as the predominant technology in speech
{
1 recognition in recent years. Indeed, it has been proved that
, w=1
α ( w )= √2 (2) HMM are more adapted to the speech recognition domain
and images modeling than the traditional HMM (Fujiwara et
1 , otherwise ;
al., 2008).
By definition, HMM is a stochastic process determined
by two interrelated mechanisms: Markov chain which
Figure 7 Selection process of the DCT coefficients with a determines the state at time t, St=st, and a state-dependent
sample from: (a) CUAVE database (b) AVARB process which generates the observation Ot = ot depending
database
on the current state st.
HMM is usually represented by three sets of probability
distributions and their definitions are given as follows:
- Π: initial state probabilities,
π i=P ( q 1=i ) ,1 ≤i ≤ N (3)
Where N is the number of states in the model.
- A: the transitional probability matrix, A={ai,j},
a i , j =P ( qt +1= j/ qt =i ) ,1 ≤ i , j ≤ N (4)
with qt the current state and qt+1 the next state.
- B: the emission probabilities, B ={ bi(ot) }, where bi(ot) is
the probability of observation ot generated from state i
(a)
(
b i=P o t=
j
qt )
=i , 1 ≤ j≤ M , 1 ≤i ≤ N (5)
Where M is the number of observation symbols finding by

K-means algorithm (see (Kanungo et al., 2002) for more
details).
There are three fundamental problems related to the
HMM; the evaluation problem, the problem of
determination of the path of states, the problem of learning
(supervised or unsupervised). Some other problems related b1,1 … b1,M b2,1 … bN,M
to HMM are: the over flow of representation of numbers in
machine, insufficient data for learning, the update of
models when processes vary in time, the choice of HMM 2- Optimization: Apply to each HMM of the unmarked
architecture the best adapted to data, the choice of a good population "parent" the Baum-Welch algorithm from
initial estimate of the probabilities of the HMM. The the observation O.
Forward algorithm allows the calculation of the likelihood
of the observations, the Viterbi algorithm finds the optimal 3- Evaluation: The quality of an individual (also called
path which is the most likely to follow and the Baum- fitness) describes the adequacy of this one with its
Welch algorithm performs the supervised learning (re- environment. More precisely the individual will have a
estimate the parameters of HMM).These algorithms are note all the more high as it will be a good solution to the
detailed in (Rabiner and Juang, 1993). problem. In the problem of optimizing an HMM, it is
The EM methods like the Baum-Welch algorithm are desired to quantify the ability of an HMM to learning an
known to converge to an optimum of the solution space, but observation. In our GA, the fitness values are the results
nothing proves that this optimum is the global one. In fact, of the objective function. The likelihood P(o j|λi) is an
these algorithms tend to converge to the local optimum appropriate criterion used in the objective function to
which is the closer to the starting values of the optimization determine the quality of the chromosomes.
procedure. The traditional remedy to this problem consists The probability P(o|λ) is calculated by maximum
in running several times the EM algorithm, using different likelihood method or Baum-Welch algorithm. They
sets of random starting values, and keeping the best maximize the probability that the given HMM λi
solution. More advanced versions of the EM algorithm generated the training utterances oj:
(ECM, SEM ...) can also be used, but they do not constitute
a definitive answer to the local optima problem (Sun and Pn
Yu, 2007). In this article, we presented a hybrid GA/HMM f ( λi )= N
(6)
algorithm to address this problem. ∑ Pi
Another reason to present this hybrid algorithm due to i=1
slow convergence and high computational power needed by
classical GA, especially when the generated chromosomes M
cannot satisfy the conditions of the HMM parameters. For ∑ log ( P ( oi∨λ i ) ) (7)
this reason, not included in the offspring and they are P n=
i=1
replaced by new chromosomes. M
The principle of the proposed GA/HMM algorithm is as
follows: a GA manipulates a population of HMM consists It is proven that Baum-Welch algorithm leads to a local
of individuals whose architecture is not unique. The maximum of function f(λi). However, it is possible that
algorithm will be looking to obtain optimal HMMs, i.e. the other better maxima of f(λi)exist for given training set. In
HMM with the highest probability of generating a given this paper we tried to overcome this problem by using
observation O. This probability can be rapidly calculated by genetic algorithms for maximization of f(λi) (Goh et al.,
the Baum-Welch algorithm. Local optimization is 2010).
performed by the Baum-Welch algorithm in conjunction
4- Selection: Among all the individuals of the population,
with genetic operators specialized for HMMs (Oudelha and
select a number S'<S, which will be used as parents to
Ainon, 2010; Xueying et al., 2007; Goh et al., 2010).
regenerate the S-S’ other individuals not selected. The
Among these operators, for example, we find the operator of
selection is done according to the best calculated scores
crossover which integrates the stochasticity constraints that
in step 3. Each selected individual is marked "parent."
apply on the HMM. There is also a normalization operator
in order to restore the same constraint after the mutation 5- Crossover/recombination: For each unmarked
phase. We give below a frame of GA/HMM algorithm. individual "parent" randomly select two individuals
Using a marker named "parent" simply to treat only from the population of those marked "parent" and the
individuals required during optimization and evaluation. cross. The crossover is in a crossover point, and is
realized between two rows of the matrices of the
1- Initialization: Create a population of size S, randomly,
HMM, which allows to obtain in return two children. It
the most natural encoding is to fabricate the
retains only one of both children, at random.
chromosome by reorganizing all the coefficients of the
HMM. The simplest way is to juxtapose all rows of all 6- Mutation/ normalization: On each unmarked individual
matrices. We so obtain a coding in real numbers while "parent" we apply the mutation operator. This one
respecting the constraints related to the HMM. The consists in modifying a small random quantity each
representation of a population is given in Figure 8. coefficient of the matrices of the HMM. Each
coefficient is modified according to the value of the
Figure 8 Chromosome representation method in the GA/HMM mutation probability. After treating an individual, we
training apply on him an operator of normalization, to ensure
that this individual still answers the constraints of the
π1 … πN a1,1 … a1,N a2,1 … aN,N HMM. We should verify that the matrices of the HMM
are stochastic. This operator is applied after the 1
C
operation of mutation because subsequent operations R s= ∑
C−1 i=1
( max log P(o∨λ j )+log P(o∨λ i))
imperatively work on HMM.
7- Evaluation of the stop condition: If the maximum (9)
number of iterations is not reached, then return to
step2, otherwise go to step 8. Where C is the number of classes being considered to
8- Finally return the best HMM among the current measure the reliability of each modality, and sϵ { A ,V }.
population. After that, we can calculate the integration weight of
audio reliability measure A by:
Note that this algorithm has been adapted to
optimization of vectors of observations. The re-estimation RA
made so that the HMM has a maximal probability to
w A= (10)
RA+ RV
generate the set of vectors.
Where R A and RV are the reliability measure of the
3.5 The audio-visual fusion outputs ofthe acoustic and visual GA/HMM respectively,
and the weighting factor of visual modality can be found by
In Decision fusion (or late integration), the visual and
relation:
auditory information are processed separately and are each
wA + wV = 1 for 0 < wA, wV <1 (11)
transmitted to a classifier as in (Rogozan et al., 1997).The
results at the output of each of the two recognition processes
are fused in an integration module which provides the final
result. 4 Results and discussion
In this work, match score level is used to combine the
audio and visual identification outputs. Several decision In this paper we performed the proposed AVASR system on
fusion strategies have been tested (product, sum, minimum, the CUAVE database and our Arabic audiovisual database.
maximum, vote ...) and all show a significant improvement In order to achieve this recognition system we have used⅔
of the data for the learning stage and the remaining ⅓ to test
in results compared to the consideration of a single
the effectiveness of our system.
modality, which conduct us to focus in this work on the use The proposed AVASR system using RASTA-PLP as
of the model of separate fusion, i.e. the fusion scores from audio features extraction, DCT coefficients as visual
each recognizer GA/HMM. Their sets of log-likelihood can features extraction, and a GA/HMM for audio-visual speech
modeling, was implemented as described in the previous
be combined using weights that reflect the reliability of each
particular flow, the combined scores then take the following sections on Matlab simulation software. As a result, and by
form (Islam and Rahman, 2010): integrating the first and second derivative of the parameters,
we obtain matrix of 27 parameters for the audio stream after
the segmentation of the sampled speech data into 0.025
AV A V seconds frames with an overlap of 0.010 seconds. The
log P(o ∨λ)=w A log P( o ∨λ A )+w V log P(o ∨λ V ) GA/HMM recognizers were built using the Hidden Markov
(8) Model Toolkit (HTK) as in (Young et al. 2005).
In Table 1 and Table 2, we have presented various kinds
Where λ Aand λ V are the acoustic and the visual of instance with different GA control parameters that have
GA/HMMs respectively and log P(o ∨λ A ) and
A been solved with our algorithm to evaluate performance of
the proposed system. We ran each instance 15 times with a
log P(o V ∨ λV )are there log-likelihood. The reliability of different number of clusters, crossover probability values
each modality can be calculated by the most appropriate and between 0.5-0.9, and mutation probability with the value
best in performance (Islam and Rahman, 2010), the average 0.01, and obtained the maximum P(o|λ) values after 50
difference between the maximum log-likelihood and the generation.
other ones, can be found as,
Table 1 GA parameters training HMM for audio-only: (a) AVARB database (b) CUAVE database
Number Pc Pm Average Number Pc Pm Average
of P(o|λ) of P(o|λ)
clusters clusters
3 0.5 0.01 -2.3630 3 0.5 0.01 -3.7416
5 0.6 0.01 -1.5838 5 0.6 0.01 -3.2604
7 0.7 0.01 -1.1396 7 0.7 0.01 -3.4235
9 0.8 0.01 -3.3185 9 0.8 0.01 -3.9134
12 0.9 0.01 -4.0122 12 0.9 0.01 -4.3637
(a) (b)
Table 2 GA parameters training HMM for video-only: (a) AVARB database (b) CUAVE database
Number Pc Pm Average Number Pc Pm Average
of P(o|λ) of P(o|λ)
clusters clusters
3 0.5 0.01 -7.7629 3 0.5 0.01 -5.1860
4 0.5 0.01 -7.0046 5 0.6 0.01 -5.2987
7 0.8 0.01 -7.1555 7 0.7 0.01 -5.4743
9 0.8 0.01 -7.6595 9 0.8 0.01 -5.8747
12 0.9 0.01 -7.8234 12 0.9 0.01 -6.0890
(a) (b)
We observe that the results are varied according to the GA/HMM. The same thing for the CUAVE audio
parameters training of the GA, also to the number of database with 4 clusters, Pc=0.6 and Pm= 0.01, and for the
clusters obtained by the vector quantization phase, e.g. visual CUAVE database the best performance is obtained
with a 7 clusters, Pc=0.7 and Pm= 0.01, for the Arabic with 3 clusters, Pc=0.5 and Pm= 0.01.
audio database and 5 clusters, Pc=0.6 and Pm= 0.01 for the Figures 9 and 10 give the rate of recognition with
visual Arabic database are superior to all the other respect to the number of clusters used in the experiment.
approaches in all cases. Therefore, we use theme in our
Figure 9 Comparison of recognition rates of audio-only, video-only and audio-visual ASR for the CUAVE database by using: (a) an
HMM-based and (b) GA/HMM
(a) (b)
Based on Figure 9, we can see that the recognition rates For the CUAVE database the results show that the
obtained with our GA/HMM are better in most cases average rate of recognition achieved a best rate with
compared to those obtained with the standard HMM. The 86.8% using standard HMM recognizer with 5 clusters for
above figures also indicate that the AVASR system the clustering phase, and 98.1% using GA/HMM
outperform significantly overall by achieving the highest recognizer with 3 clusters.
recognition rates.
Figure 10 Comparison of recognition rates of audio-only, video-only and audio-visual ASR for AVARB database by using: (a)an HMM-
based and (b)GA/HMM
(a) (b)
In Figure 10, we have noted almost the same previous Potamianos, G., Neti, C., Gravier, G., Garg, A. and Senior A.W.
observations with our AVARB database, i.e. we found the (2003) ‘Recent advances in the automatic recognition of
best average rate of recognition equal to 93.7% and 97.6% audiovisual speech’, Proceedings of the IEEE, Vol. 91, No.
9, pp.1306–1326.
by the use of standard HMM and hybrid GA/HMM
Man, K.F., Tang, K.S. and Kwong, S. (1996) ‘Genetic
recognizers respectively, and with 7 clusters for both. Algorithms Concepts and Applications’, IEEE Transactions
More generally we found a percentage increase varying on Industrial Electronics, Vol. 43, No. 5, pp.519–532.
from almost 5% to 28% of our test results, but this rise in McGurk, H. and MacDonald, J. (1976) ‘Hearing lips and seeing
the recognition rates given is not fixed, along with the voices’, Nature 264 (5588), pp.746–748.
increase in the size of the population. It could be that Petajan, E.D. (1984) ‘Automatic lipreading to enhance speech
given rates worse or the same standard HMM system recognition’,Proceedings of the IEEE Communication
before optimizations. This is due to the characteristic of Society Global Telecommunications Conference, Atlanta,
the GA method that is random and this system using the Georgia.
Elmahdy, M. and Gruhn, R. (2009) ‘Modern standard Arabic
standard general replacement process.
based multilingual approach for dialectal Arabic speech
recognition’, Eighth international symposium on natural
language, pp.169–174.
5 Conclusions Iwano, K., Yoshinaga, T., Tamura, S. and Furui, S. (2007)
‘Audio-Visual Speech Recognition Using Lip Information
Extracted from Side-Face Images’, EURASIP Journal on
In this primary work, we presented a new AVASR system, Audio, Speech, and Music Processing, Vol. 2007, No. 1,
which uses DCT coefficients as visual features extraction, pp.4–4.
RASTA-PLP as acoustic features extraction. To overcome Zhi-yi, Q. Yu, L. Li-hong, Z. and Ming-xin, S. (2006) ‘Hybrid
the drawbacks of the traditional training of HMM using SVM/HMM architectures for speech recognition’,
Baum-Welch algorithm which are mainly the converge to Proceedings of the First International Conference on
a local optimum which is the closer to the starting values Innovative Computing, Information and Control , Vol. 2,
pp.100–104.
of the optimization procedure and the estimation of the
Zhao, G., Pietikäinen, M. and Hadid, A. (2007) ‘Local
recognizer parameters that requires a more careful Spatiotemporal Descriptors for Visual Recognition of
examination, we have chosen to use an hybrid GA/HMM Spoken Phrases’, Proceedings of the international workshop
algorithm for modeling the audio-visual speech. on Human-centered multimedia, pp.57–66.
Two databases were used to experiment our AVASR Pao, T.L., Liao, W.Y., Wu, T.N. and Lin, C.Y.
system: The CUAVE database and our AVARB database. (2009)‘Automatic Visual Feature Extraction for Mandarin
From the several test results, we conclude that the Audio-Visual Speech Recognition’, Proceedings of the
system modeled by HMM and trained by our GA/HMM IEEE International Conference on Systems, Man and
training have higher rates of recognition than the HMM Cybernetics, San Antonio, TX, USA, pp.2936–2940.
Nock, H.J., Iyengar, G. and Neti, C. (2003) ‘Speaker localisation
trained by the Baum-Welch algorithm. using audio-visual synchrony: An empirical study’, In
For future work, we are planning to cover more issues Proceedings of International Conference on Image and video
about improving the performance of the proposed system. retrieval (CIVR), Vol. 2728, pp.488–499.
Finally, we also intend to test our system with other Patterson, E.K., Gurbuz, S., Tufekci, Z. and Gowdy, J.N. (2002)
alternatives methods recognition and compare them with ‘Moving-talker speaker-independent feature study and
our proposed GA/HMM algorithm. baseline results using the CUAVE multimodal speech
corpus’, EURASIP Journal on Applied Signal Processing,
Vol. 11, pp.1189–1201.
References Reikeras, H., Herbst, B.M., du Preez, J. and Engelbrecht, H.
(2010) ‘Audio-Visual Automatic Speech Recognition using
Dynamic Bayesian Networks’, Proceedings of the Twenty- International Symposium in Information Technology (ITSim),
First Annual Symposium of the Pattern Recognition Vol. 2, pp.542–545.
Association of South Africa, pp.233–238. Kanungo, T., Netanyahu, N.S. and Wu, A.Y. (2002) ‘An
Galatas, G., Potamianos, G. and Makedon, F. (2012) ‘Audio- Efficient k-Means Clustering Algorithm: Analysis and
visual speech recognition using depth information from the Implementation’, IEEE Transactions on Pattern Analysis
Kinect in noisy video conditions’, Proceedings of the 5th and Machine Intelligence, Vol. 24, No. 7, pp.881–892.
International Conference on PErvasive Technologies Related Rogozan,A., Deléglise, P. and Alissali, M. (1997) ‘Adaptive
to Assistive Environments (PETRA’12), Crete, Greece, determination of audio and visual weights for automatic
Article No. 2. speech recognition’, ESCA Workshop on Audio-Visual
Minotto, V.P., Lopes, C.B.O., Scharcanski, J., Jung, C.R. and Speech Processing (AVSP'97), pp.61–64.
Lee, B. (2013) ‘Audiovisual Voice Activity Detection Based Islam, R. and Rahman, F. (2010) ‘Likelihood Ratio Based Score
on Microphone Arrays and Color Information’, IEEE Fusion for Audio-Visual Speaker Identification in
Journal of Selected Topics in Signal Processing, Vol. 7, No. Challenging Environment’, International Journal of
1, pp.147–156.
Computer Applications, Vol. 6, No. 7, pp. 6–11.
Rabiner, L. and Juang, B. (1993) ‘Fundamentals of Speech
Recognition’, Prentice-Hall, Englewood Cliffs, NJ. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu,
Xue-ying, Z., Yiping, W. and Zhefeng, Z. (2007) ‘A Hybrid X.A., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev,
Speech Recognition Training Method for HMM Based on V. and Woodland P. (2005) ‘the HTK Book’ (for HTK
Genetic Algorithm and Baum Welch Algorithm’, IEEE 2nd Version 3.4).
International conference on Innovative Computing,
Information and Control (ICICIC’07), pp.572.
Pérez, Ó, Piccardi, M. and García, J. (2007) ‘Comparison
between genetic algorithms and the Baum-Welch algorithm
in learning HMMs for human activity classification’,
Proceeding ofEvoWorkshops’7, pp.399–406.
Goh, J., Tang, L. and Al turk, L. (2010) ‘Evolving the Structure
of Hidden Markov Models for Micro aneurysms Detection’,
UK Workshop on Computational Intelligence (UKCI), pp.1–
6.
Viola, P.A. and Jones, M.J, (2001) ‘Rapid Object Detection
using a Boosted Cascade of Simple Features’, Proceedings of
the 2001 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’01),Vol. 1,pp.511–
518.
Papageorgiou, C.P., Oren, M. and Poggio, T. (1998) ‘A General
Framework for Object Detection’, Sixth International
Conference on Computer Vision (ICCV'98), pp.555–562.
Khandait, S.P., Khandait, P.D. and Thool, Dr.R.C. (2009) ‘An
Efficient Approach to Facial Feature Detection for
Expression Recognition’, International Journal of Recent
Trends in Engineering, Vol. 2, No. 1, pp.179–182.
Potamianos, G., Neti, C., Luttin, J. and Matthews I. (2004)
‘Audio-visual automatic speech recognition: an overview, in
issues in audio-visual speech processing’,Issues in Visual
and Audio-Visual Speech Processing,G. Bailly, E. Vatikiotis-
Bateson, and P. Perrier, eds., MIT Press,Chapter 10.
Harris, F.J. (1978) ‘On the use of windows for harmonic analysis
with the discrete Fourier transform’. Proceedings of the
IEEE, Vol. 66, pp.51–83.
Hermansky, H., Morgan, N., Bayya, A. and Kohn P. (1992)
‘RASTA-PLP Speech Analysis’, IEEE International
conference on Acoustics, speech and signal processing, Vol.
1, pp.121–124.
Gupta, M. and Garg, Dr.A.K. (2012) ‘Analysis of image
compression algorithm Using DCT’, International Journal of
Engineering Research and Applications (IJERA), Vol. 2, No.
1, pp.515–521.
Fujiwara, Y., Sakurai, Y. and Yamamuro, M. (2008) ‘SPIRAL:
efficient and exact model identification for hidden Markov
models’, In Proceeding of the 14th ACM SIGKDD
international conference on Knowledge discovery and data
mining KDD, pp.247–255.
Sun, K. and Yu, J. (2007) ‘Video Affective Content Recognition
Based on Genetic Algorithm Combined HMM’, Proceedings
of the 6th international conference on Entertainment
Computing, pp. 249 – 254.
Oudelha, M. and Ainon, R.N. (2010) ‘HMM parameters
estimation using hybrid Baum-Welch genetic algorithm’,

Structure Evolution of Hidden Markov Models For Audiovisual Arabic Speech Recognition

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Structure Evolution of Hidden Markov Models For Audiovisual Arabic Speech Recognition

Uploaded by

Copyright:

Available Formats

Int. J., Vol. x, No.

1 Introduction 2 Related works

Copyright © 200x Inderscience Enterprises Ltd.

Figure 3 Some examples of frames of our AVARB database

Various works have used this database in order to

Figure 4 Steps of Rasta-PLP analysis

(a) (b) (c)

A typical human face follows a set of anthropometric

Figure 5 An example of face detection: (a) original image; (b)

Where M is the number of observation symbols finding by

You might also like