Professional Documents
Culture Documents
2000 - Albesano - Hybrid HMM-NN Modeling of Stationary - Transitional Units For Continuous Speech Recognition
2000 - Albesano - Hybrid HMM-NN Modeling of Stationary - Transitional Units For Continuous Speech Recognition
2000 - Albesano - Hybrid HMM-NN Modeling of Stationary - Transitional Units For Continuous Speech Recognition
www.elsevier.com/locate/ins
Abstract
This paper describes the bene®ts in recognition accuracy that can be achieved in a
hybrid Hidden Markov Model ± Neural Network (HMM±NN) recognition framework
by using context-dependent subword units named Stationary±Transitional Units. These
units are made up of stationary parts of the context-independent phonemes plus all the
admissible transitions between them; they have good generalization capability and
capture a wide acoustic detail. These units are very suitable to be modeled with neural
networks, can enhance the performances of hybrid HMM±NN systems, and represent a
real alternative to the context-independent phonemes. The ecacy of Stationary±
Transitional Units is veri®ed for the Italian language on isolated and continuous speech
recognition tasks extracted from a real application employed for railway timetable
telephonic vocal access. The results show that a relevant improvement is achieved with
respect to the use of the context-independent phonemes. Ó 2000 Elsevier Science Inc.
All rights reserved.
1. Introduction
*
Corresponding author. Fax: +39-011-2286207.
E-mail address: dario.albesano@cselt.it (D. Albesano).
0020-0255/00/$ - see front matter Ó 2000 Elsevier Science Inc. All rights reserved.
PII: S 0 0 2 0 - 0 2 5 5 ( 9 9 ) 0 0 1 0 6 - 1
4 D. Albesano et al. / Information Sciences 123 (2000) 3±11
to form the words of any given vocabulary. Subword units modeling is then
mandatory for large vocabulary recognition systems, as well as for ¯exible
vocabulary applications. Phonemes, which are the basic sounds of a tongue,
have dierent behavior with respect to the acoustic context where they are
uttered; for this reason, context-sensitive phonetic models (triphones) are
generally used in HMM frameworks for taking into account the coarticulation
eects.
Hybrid HMM±NN models have been investigated by several research
teams: Franzini et al. [5] and Haner et al. [7] have introduced the Connec-
tionist Viterbi Training to enhance HMM based continuous speech recogni-
tion; Bourlard and Morgan [3] have proposed connectionist probability
estimation to overcome some limitations of HMM-only recognizers; Robinson
[10] and Hochberg et al. [8] described a recurrent neural network/HMM ap-
proach to large vocabulary, speaker independent speech recognition and the
authors introduced a phoneme based hybrid HMM±NN model [6] whose
training procedure employs an integrated gradual movement of bootstrap
speech segmentations.
A relevant dierence between the HMM and the hybrid HMM±NN ap-
proach to open vocabulary speech recognition is that the former is able to
employ context-dependent acoustic units like accurate triphones or very gen-
eral diphones, while the latter is limited to employ context-independent pho-
nemes, due to its intrinsic discriminant training that does not allow it to model
classes of sounds which are not fully separated like those modeled with context-
dependent units.
To overcome the problem mentioned above, we propose in this paper the
use of the Stationary±Transitional Units (STUs), a robust context-dependent
units set for open vocabulary speech recognition recently introduced by Fissore
et al. [4] in the HMMs framework. The features of these units are well-suited
for neural network discriminant training and modeling and can enhance the
performances of open vocabulary hybrid HMM±NN models. This hypothesis
is veri®ed by the Italian language, experimenting with STUs in the framework
of hybrid HMM±NN models for telephone quality isolated and continuous
speech recognition, and by comparing the results with the standard context-
independent phoneme approach.
experimentally case by case depending on the kind of acoustic units that are
modeled. In the case of whole word models recurrent networks have proved to
be superior whereas in the case of subword-units feedforward MLP seems
preferable [6].
An NNA has an input window that comprises some contiguous frames of
the sequence, one or more hidden layers and an output layer where the acti-
vation of each unit estimates the probability P(Q | X) of the corresponding
automaton state Q given the input window X.
An NNA has many degrees of freedom: the architecture of the MLP, the
input window width, the number of automaton states for the dierent words or
phonemes employed.
The modeling of STUs with NNA was performed by assigning one output
unit of the MLP to both stationary and transitional units. An alternative is to
assign two states to transitional units. A typical architecture for an NNA de-
voted to work with the Italian STUs is depicted in Fig. 1 and comprises:
· 39 parameters per frame (log Energy, 12 Cepstral Coecients and their ®rst
and second order derivatives);
· seven frame wide input window;
· ®rst hidden layer with 315 sigmoidal neurons: one central block and two
context blocks with 105 neurons (5 for E , DE and DDE, 30 for Cep, DCep
and DDCep);
· second hidden layer with 250 sigmoidal neurons fully connected to the ®rst
hidden and to the output layer;
4. Training NNA
5. Experimental activity
The data set Arvin has been used to train the NNA both on isolated and
continuous speech tasks; it was collected over the Public Switched Telephone
Network (PSTN) with a band of 300±3400 Hz and was sampled at 8 kHz. Arvin
consists of read speech, with speakers evenly distributed among males and
females from many Italian regions and dierent accents. 1136 speakers uttered
4875 continuous sentences and 3653 isolated words, all phonetically balanced.
Utterances have been manually transcribed. The isolated component has been
used to train NNA for the Panda task while both the isolated and continuous
components have been used to train the NNA for the Dialogos-95 task.
The Panda test set was collected in the same way as the Arvin data base and
is made up of 17 444 isolated words belonging to a vocabulary of 475 Italian
railway stations uttered by 1050 dierent speakers.
The Dialogos-95 test set is a spontaneous speech data base collected from
phone calls incoming from all Italy during a ®eld trial of the DIALOGOSâ
system [1]. It is made up by 2040 sentences from real dialogues on the domain
of railway timetable access.
Three dierent experiments have been run with dierent sets of subword
units or dierent subword unit modeling: 27 Italian context-independent
phonemes, 379 Italian STUs with one or two neurons per diphone-transition
unit.
D. Albesano et al. / Information Sciences 123 (2000) 3±11 9
On the Dialogos-95 data base the recognition phase takes place with a
language model based on bigrams and rescoring of N-best with trigrams.
The recognition and understanding results obtained for the Panda and Di-
alogos-95 test sets are reported in Table 1.
The WA column contains the Word Accuracy measure, the SA column tells
the Sentence Accuracy and the SU column states the Sentence Understanding.
Table 1 shows the improvements obtained in performance on both the test
sets: on the isolated Panda data base WA increases from 91.4% to 94.6% ye-
lding a 37% of error reduction when two states per diphone-transition unit are
used. On the continuous Dialogos-95 data base WA, SA and SU are better
when passing from 27 phonemes to 379 STUs; in fact the WA increases from
53.5% to 72.1% (with an error reduction of 32%), and the SA from 60.0% to
71.6%. After parsing and concept extraction, we have a SU of 66.0% for
phonemes and 78.0% for STUs with an error reduction of 35%. It can also be
observed that, the con®guration with two states per diphone-transition unit
seem to work always better than the con®guration with one state per transition.
Table 1
Recognition results for Panda and Dialogos-95
Panda Dialogos-95
WA WA SA SU
27 C.I. phons, 3 states 91.4 53.5 60.0 66.0
379 STUs, 1 state 93.5 72.1 71.6 78.0
379 STUs, 2 sts/diph 94.6 72.7 72.2 ±
10 D. Albesano et al. / Information Sciences 123 (2000) 3±11
Fig. 2. Output probabilities generated by 2 NNA for context-independent phonemes (top) and
STUs (bottom).
D. Albesano et al. / Information Sciences 123 (2000) 3±11 11
6. Conclusions
References
[1] D. Albesano, P. Baggia, M. Danieli, R. Gemello, E. Gerbino, C. Rullent, A robust system for
human-machine dialogue in telephony-based applications, International Journal of Speech
Technology 2 (1997) 101±111.
[2] D. Albesano, F. Mana, R. Gemello, Speeding up neural network execution: an application to
speech recognition, in: Proceedings of the IEEE NNSP Workshop, Kyoto, Japan, 1996.
[3] H. Bourlard, N. Morgan, Connectionist speech recognition: a hybrid approach, Kluwer
Academic Publishers, Dordrecht, 1993.
[4] L. Fissore, F. Ravera, P. Laface, Acoustic-phonetic modeling for ¯exible vocabulary speech
recognition, in: Proceedings of the EUROSPEECH '95, Madrid, Spain, September 1995, pp.
799±802.
[5] M.A. Franzini, K.F. Lee, A. Waibel, Connectionist viterbi training: a new hybrid method for
continuous speech recognition, in: Proceedings ICASSP 90, Albuquerque, NM, April 1990,
pp.425±428.
[6] R. Gemello, D. Albesano, F. Mana, Context independent phoneme classi®cation for open
vocabulary recognition, CSELT Technical Report DTD 95.0230, March 1995.
[7] P. Haner, M. Franzini, A. Waibel, Integrating time alignment and neural networks for high
performance continuous speech recognition, in: Proceedings of the ICASSP 91, pp. 105±108.
[8] M.M. Hochberg, S.J. Renals, A.J. Robinson, G.D. Cook, Recent improvements to the abbot
large vocabulary CSR system, in: Proceedings of the ICASSP 95, Detroit, USA, pp. 69±72.
[9] L. Rabiner, B.H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Englewood
Clis, NJ, 1993.
[10] A.J. Robinson, An application of recurrent nets to phone probability estimation, IEEE
Transactions on Neural Networks 5 (2) (1994) 298±305.