2000 - Albesano - Hybrid HMM-NN Modeling of Stationary - Transitional Units For Continuous Speech Recognition

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Information Sciences 123 (2000) 3±11

www.elsevier.com/locate/ins

Hybrid HMM±NN modeling of stationary±


transitional units for continuous speech
recognition
Dario Albesano *, Roberto Gemello, Franco Mana
CSELT-Centro Studi e Laboratori Telecomunicazioni, via G. Reiss Romoli, 274-10148 Torino, Italy
Received 5 November 1997; accepted 15 April 1999

Abstract
This paper describes the bene®ts in recognition accuracy that can be achieved in a
hybrid Hidden Markov Model ± Neural Network (HMM±NN) recognition framework
by using context-dependent subword units named Stationary±Transitional Units. These
units are made up of stationary parts of the context-independent phonemes plus all the
admissible transitions between them; they have good generalization capability and
capture a wide acoustic detail. These units are very suitable to be modeled with neural
networks, can enhance the performances of hybrid HMM±NN systems, and represent a
real alternative to the context-independent phonemes. The ecacy of Stationary±
Transitional Units is veri®ed for the Italian language on isolated and continuous speech
recognition tasks extracted from a real application employed for railway timetable
telephonic vocal access. The results show that a relevant improvement is achieved with
respect to the use of the context-independent phonemes. Ó 2000 Elsevier Science Inc.
All rights reserved.

1. Introduction

In open vocabulary speech recognition, words cannot be modeled as a


whole, but must be de®ned in terms of subword units which are then composed

*
Corresponding author. Fax: +39-011-2286207.
E-mail address: dario.albesano@cselt.it (D. Albesano).

0020-0255/00/$ - see front matter Ó 2000 Elsevier Science Inc. All rights reserved.
PII: S 0 0 2 0 - 0 2 5 5 ( 9 9 ) 0 0 1 0 6 - 1
4 D. Albesano et al. / Information Sciences 123 (2000) 3±11

to form the words of any given vocabulary. Subword units modeling is then
mandatory for large vocabulary recognition systems, as well as for ¯exible
vocabulary applications. Phonemes, which are the basic sounds of a tongue,
have di€erent behavior with respect to the acoustic context where they are
uttered; for this reason, context-sensitive phonetic models (triphones) are
generally used in HMM frameworks for taking into account the coarticulation
e€ects.
Hybrid HMM±NN models have been investigated by several research
teams: Franzini et al. [5] and Ha€ner et al. [7] have introduced the Connec-
tionist Viterbi Training to enhance HMM based continuous speech recogni-
tion; Bourlard and Morgan [3] have proposed connectionist probability
estimation to overcome some limitations of HMM-only recognizers; Robinson
[10] and Hochberg et al. [8] described a recurrent neural network/HMM ap-
proach to large vocabulary, speaker independent speech recognition and the
authors introduced a phoneme based hybrid HMM±NN model [6] whose
training procedure employs an integrated gradual movement of bootstrap
speech segmentations.
A relevant di€erence between the HMM and the hybrid HMM±NN ap-
proach to open vocabulary speech recognition is that the former is able to
employ context-dependent acoustic units like accurate triphones or very gen-
eral diphones, while the latter is limited to employ context-independent pho-
nemes, due to its intrinsic discriminant training that does not allow it to model
classes of sounds which are not fully separated like those modeled with context-
dependent units.
To overcome the problem mentioned above, we propose in this paper the
use of the Stationary±Transitional Units (STUs), a robust context-dependent
units set for open vocabulary speech recognition recently introduced by Fissore
et al. [4] in the HMMs framework. The features of these units are well-suited
for neural network discriminant training and modeling and can enhance the
performances of open vocabulary hybrid HMM±NN models. This hypothesis
is veri®ed by the Italian language, experimenting with STUs in the framework
of hybrid HMM±NN models for telephone quality isolated and continuous
speech recognition, and by comparing the results with the standard context-
independent phoneme approach.

2. Modeling with STUs

Although context-independent subword units are easily trained from ¯uent


speech, generalizable to new domains and well-suited for neural network dis-
criminant classi®ers (because they are not overlapping and cover the whole
output space), they are not adequate in representing the spectral and temporal
properties of the speech units in all contexts [9]. Therefore, it becomes very
D. Albesano et al. / Information Sciences 123 (2000) 3±11 5

important to equip neural networks based classi®ers with trainable context-


dependent units, possibly as general as diphones (in principle, it is possible to
train with an appropriate database a complete set of diphones able to represent
any new vocabulary) and as accurate as triphones.
Stationary±Transitional Units seem to take advantage of the generality of the
diphones without losing the accuracy of triphones; in addition they are very
suitable to be modeled with neural networks. Rather than with the classical
diphone or triphone units, phonetic transcriptions of words are represented in
terms of stationary context-independent phonemes and diphone-transition
coarticulation units. Thinking to a triphone áLpRñ, its central states are bor-
rowed from the stationary states of the context-independent unit ápñ while the
states concerning the R transition are borrowed from a diphone-transition unit
áp±Rñ made up of the ®nal state of phoneme ápñ and the ®rst state of phoneme
áRñ. In this way the accuracy of triphones is preserved. With STUs a sequence
of three context-independent phonemes áX ñá Pñá Y ñ is replaced by the sequence
of ®ve STUs áxñá xÿpñá pñá pÿyñá yñ, where áxñ, ápñ, áyñ are the stationary parts
of phonemes á Xñ, áPñ, áYñ and áxÿpñ, ápÿyñ are the corresponding transitions
between áxñ and ápñ, and ápñ and áyñ.
The background noise, referred as `@', is treated like a phoneme, so it has a
stationary context-independent part á@ñ and the transitions to and from all the
phonemes ápñ (e.g. á@ÿpñ and á pÿ@ñ).
The proposed units are robust and have good generalization capability as
each unit is modeled by a sucient number of frames of the same context; as
mentioned before, STUs are very promising for NN implementation because
they represent a partition of the sounds of a tongue but with more acoustic
detail, and, like context-independent phonemes, they do not overlap. Finally,
they are language dependent but domain independent.

3. Neural Network Automata framework

Hybrid HMM±NN models integrate the ability of dealing with temporal


patterns, typical of HMMs, with the pattern classi®cation power of NNs. They
inherit from HMMs the modeling of words with left-to-right automata and the
Viterbi decoding, delegating to an NN the computation of emission proba-
bilities.
The recognition model we use is a hybrid HMM±NN model devoted to
recognize sequential patterns, named Neural Network Automata (NNA). Each
class is described in terms of a left-to-right automaton (with self loops) as in the
HMMs. The emission probabilities of the automata states are estimated by a
Multi-Layer Perceptron (MLP) neural network, instead than by mixtures of
gaussians, while the transition probabilities are not considered. The MLP may
be recurrent or feedforward: this architectural choice has to be decided
6 D. Albesano et al. / Information Sciences 123 (2000) 3±11

experimentally case by case depending on the kind of acoustic units that are
modeled. In the case of whole word models recurrent networks have proved to
be superior whereas in the case of subword-units feedforward MLP seems
preferable [6].
An NNA has an input window that comprises some contiguous frames of
the sequence, one or more hidden layers and an output layer where the acti-
vation of each unit estimates the probability P(Q | X) of the corresponding
automaton state Q given the input window X.
An NNA has many degrees of freedom: the architecture of the MLP, the
input window width, the number of automaton states for the di€erent words or
phonemes employed.
The modeling of STUs with NNA was performed by assigning one output
unit of the MLP to both stationary and transitional units. An alternative is to
assign two states to transitional units. A typical architecture for an NNA de-
voted to work with the Italian STUs is depicted in Fig. 1 and comprises:
· 39 parameters per frame (log Energy, 12 Cepstral Coecients and their ®rst
and second order derivatives);
· seven frame wide input window;
· ®rst hidden layer with 315 sigmoidal neurons: one central block and two
context blocks with 105 neurons (5 for E , DE and DDE, 30 for Cep, DCep
and DDCep);
· second hidden layer with 250 sigmoidal neurons fully connected to the ®rst
hidden and to the output layer;

Fig. 1. Architecture of a NNA for STU recognition.


D. Albesano et al. / Information Sciences 123 (2000) 3±11 7

· output layer modeling 379 STU units corresponding to 27 stationary con-


text-independent phonemes (1 output Softmax neuron each one), 349 di-
phone transition coarticulation units (1 output Softmax neuron each one,
or two as shown in Fig. 1), three models for speci®c telephonic noises (1 out-
put Softmax neuron each one).
This adopted network architecture is quite large and the recall on new data
could be too slow. That was solved by adopting the acceleration method de-
scribed in [2].

4. Training NNA

NNA integrates an incremental re-segmentation during a unique MLP


training. That greatly reduces training time, allowing the use of standard
workstations for training.
During NNA training we want simultaneously:
· to ®nd the best segmentation of words into the employed phonetic units, and
of phonetic units into neural states;
· to train the network to discriminate those states.
Training is an iterative procedure as follows:
Initialization:
· initialize the NNA with small random weights;
· create the ®rst segmentation by starting from a bootstrap segmentation of
training utterances into the employed subword units, and segmenting them
uniformly into the foreseen number of states;
Iterations: For each epoch do:
· load the present segmentation;
· train the NNA for one epoch according to that segmentation;
· obtain a new segmentation by applying the dynamic programming to each
utterance in the training set to re-evaluate the transition points proposed
by the NNA;
· update the present segmentation by using a function of itself and of the new
segmentation:
present segm ˆ F …present segm; new segm†;
e.g.:
F …s1; s2† ˆ Ks1 ‡ …1 ÿ K†s2
with K starting from 1.0 and decreasing during the training.
To train NNA, targets are generated frame by frame according to the
present segmentation, putting 1.0 for the active state of the right automaton
and 0.0 otherwise. All the automata are trained into a unique net, hence
performing a discriminant training. The MLP basic learning algorithm is the
8 D. Albesano et al. / Information Sciences 123 (2000) 3±11

back-propagation. The error function is Cross-entropy and the output unit


activation function is Softmax. Bootstrap segmentation of the spoken utter-
ances in context-independent phonemes or in STUs is generally obtained by a
forced segmentation via HMMs. The termination criterion is given by error
stabilization, generally it does not take more than 35 epochs. Intermediate
weights are tested on a validation set.

5. Experimental activity

Some experimental activity has been conducted to evaluate the behaviour of


STUs on a test set of isolated speech recognition (Panda) and on a test set of
continuous speech recognition (Dialogos-95). In order to better investigate the
capabilities of STUs we trained NNA on a phonetically balanced, domain
independent data base and tested them on a railway domain. Under these test
conditions we tried to compare STUs to the context-independent phonemes.

5.1. Speech data bases

The data set Arvin has been used to train the NNA both on isolated and
continuous speech tasks; it was collected over the Public Switched Telephone
Network (PSTN) with a band of 300±3400 Hz and was sampled at 8 kHz. Arvin
consists of read speech, with speakers evenly distributed among males and
females from many Italian regions and di€erent accents. 1136 speakers uttered
4875 continuous sentences and 3653 isolated words, all phonetically balanced.
Utterances have been manually transcribed. The isolated component has been
used to train NNA for the Panda task while both the isolated and continuous
components have been used to train the NNA for the Dialogos-95 task.
The Panda test set was collected in the same way as the Arvin data base and
is made up of 17 444 isolated words belonging to a vocabulary of 475 Italian
railway stations uttered by 1050 di€erent speakers.
The Dialogos-95 test set is a spontaneous speech data base collected from
phone calls incoming from all Italy during a ®eld trial of the DIALOGOSâ
system [1]. It is made up by 2040 sentences from real dialogues on the domain
of railway timetable access.

5.2. Recognition results

Three di€erent experiments have been run with di€erent sets of subword
units or di€erent subword unit modeling: 27 Italian context-independent
phonemes, 379 Italian STUs with one or two neurons per diphone-transition
unit.
D. Albesano et al. / Information Sciences 123 (2000) 3±11 9

On the Dialogos-95 data base the recognition phase takes place with a
language model based on bigrams and rescoring of N-best with trigrams.
The recognition and understanding results obtained for the Panda and Di-
alogos-95 test sets are reported in Table 1.
The WA column contains the Word Accuracy measure, the SA column tells
the Sentence Accuracy and the SU column states the Sentence Understanding.
Table 1 shows the improvements obtained in performance on both the test
sets: on the isolated Panda data base WA increases from 91.4% to 94.6% ye-
lding a 37% of error reduction when two states per diphone-transition unit are
used. On the continuous Dialogos-95 data base WA, SA and SU are better
when passing from 27 phonemes to 379 STUs; in fact the WA increases from
53.5% to 72.1% (with an error reduction of 32%), and the SA from 60.0% to
71.6%. After parsing and concept extraction, we have a SU of 66.0% for
phonemes and 78.0% for STUs with an error reduction of 35%. It can also be
observed that, the con®guration with two states per diphone-transition unit
seem to work always better than the con®guration with one state per transition.

5.3. Behaviour analysis: a case study

An example of the behaviour di€erences between two NNA implementing


context-independent phonemes and STUs is given in Fig. 2.
Starting from the top, the ®rst curve shows the log Energy of the utterance
`pesce' (®sh) extracted from the training set; the following six curves show, on a
linear scale [0ÿ1], P(Q | X ) of the á@ñ, ápñ, áeñ, á&ñ, áeñ, áañ context-independent
phonemes involved in the word. The remaining 10 curves concern, always on a
linear scale [0ÿ1], P(Q | X ) of the á@ÿpñ, ápñ, ápÿeñ, áeñ, áeÿ&ñ, á&ñ, á&ÿeñ,
áeñ, áeÿ@ñ, á@ñ STUs. The thin vertical dashed lines are the bootstrap seg-
mentations while the solid dashed lines are the ®nal ones obtained after the
learning process.
Looking at the output probabilities of the diphone-transitional units in-
volved in the word shown in Fig. 2, it is possible to see that they assume
signi®cative values just at the corresponding transition points of the context-
independent phonemes in the upper curves. This behaviour, pointed out in the
case study reported here, has been observed in many other training and test

Table 1
Recognition results for Panda and Dialogos-95
Panda Dialogos-95
WA WA SA SU
27 C.I. phons, 3 states 91.4 53.5 60.0 66.0
379 STUs, 1 state 93.5 72.1 71.6 78.0
379 STUs, 2 sts/diph 94.6 72.7 72.2 ±
10 D. Albesano et al. / Information Sciences 123 (2000) 3±11

Fig. 2. Output probabilities generated by 2 NNA for context-independent phonemes (top) and
STUs (bottom).
D. Albesano et al. / Information Sciences 123 (2000) 3±11 11

utterances: this allows us to conclude transitional units have been properly


trained and they are able to capture the acoustical transitions.

6. Conclusions

The results show that, with respect to the standard context-independent


phonemes, important improvements have been achieved on continuous speech
recognition, while they are less relevant for isolated word recognition. Due to
its satisfying recognition performance, the resulting NNA has been embedded
into the DIALOGOSâ online system for railway timetable telephonic vocal
access, as described in [1].

References

[1] D. Albesano, P. Baggia, M. Danieli, R. Gemello, E. Gerbino, C. Rullent, A robust system for
human-machine dialogue in telephony-based applications, International Journal of Speech
Technology 2 (1997) 101±111.
[2] D. Albesano, F. Mana, R. Gemello, Speeding up neural network execution: an application to
speech recognition, in: Proceedings of the IEEE NNSP Workshop, Kyoto, Japan, 1996.
[3] H. Bourlard, N. Morgan, Connectionist speech recognition: a hybrid approach, Kluwer
Academic Publishers, Dordrecht, 1993.
[4] L. Fissore, F. Ravera, P. Laface, Acoustic-phonetic modeling for ¯exible vocabulary speech
recognition, in: Proceedings of the EUROSPEECH '95, Madrid, Spain, September 1995, pp.
799±802.
[5] M.A. Franzini, K.F. Lee, A. Waibel, Connectionist viterbi training: a new hybrid method for
continuous speech recognition, in: Proceedings ICASSP 90, Albuquerque, NM, April 1990,
pp.425±428.
[6] R. Gemello, D. Albesano, F. Mana, Context independent phoneme classi®cation for open
vocabulary recognition, CSELT Technical Report DTD 95.0230, March 1995.
[7] P. Ha€ner, M. Franzini, A. Waibel, Integrating time alignment and neural networks for high
performance continuous speech recognition, in: Proceedings of the ICASSP 91, pp. 105±108.
[8] M.M. Hochberg, S.J. Renals, A.J. Robinson, G.D. Cook, Recent improvements to the abbot
large vocabulary CSR system, in: Proceedings of the ICASSP 95, Detroit, USA, pp. 69±72.
[9] L. Rabiner, B.H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Englewood
Cli€s, NJ, 1993.
[10] A.J. Robinson, An application of recurrent nets to phone probability estimation, IEEE
Transactions on Neural Networks 5 (2) (1994) 298±305.

You might also like