Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Philips J.

Rex 49 (1995) 381-397

SPEECH RECOGNITION ALGORITHMS FOR VOICE


CONTROL INTERFACES

by R. HAEB-UMBACH, P. BEYERLEIN and D. GELLER


Philips GmbH Forschungslahoratorien. Weisshausstrassc~ 2, D-52066 Aachen. Germunj,

Abstract
Recognition accuracy has been the primary objective of most speech recog-
nition research, and impressive results have been obtained, e.g. less than
0.3% word error rate on a speaker-independent digit recognition task.
When it comes to real-world applications, robustness and real-time
response might be more important issues. For the first requirement we
review some of the work on robustness and discuss one specific technique.
spectral normalization, in more detail. The requirement of real-time
response has to be considered in the light of the limited hardware resources
in voice control applications, which are due to the tight cost constraints. In
this paper we discuss in detail one specific means to reduce the processing
and memory demands: a clustering technique applied at various levels
within the acoustic modelling.
Keywords: automatic speech recognition; small-vocabulary systems;
robustness; acoustic-phonetic modefling; state and density
clustering.

1. Introduction
Automatic speech recognition has been a topic of research for many years. It
is, however, primarily in the past few years that the technology has matured
enough to be employed in a large range of applications. There are two main
reasons why this has occurred. Firstly, improvements in speech recognition
algorithms have led to more robust and reliable systems which can cope
with real-world and not only laboratory-controlled environments. Secondly,
there is a sharp decrease in cost of computation and memory, which is reflected
by the rapid growth in computing capability provided by modern digital signal
processing (DSP) chips.
Today, the cost of the speech recognition feature approaches a range which

Philips
Journalof Research Vol. 49 No. 4 19% 381
R. Haeb-Umbach et al.

makes it accessible to everyday consumer and telecommunications products.


Speech recognition algorithms for voice control interfaces, which operate
with a recognition vocabulary of up to 100 words, can be implemented on a
single DSP. An increasing number of products already include a DSP for other
purposes; e.g. modern mobile telcommunication terminals often employ a
DSP to carry out the complex signal processing tasks of those systems. This
resource can then be shared, and speech recognition comes with only minor
extra cost, primarily due to extra memory.
Voice control means the ability of a machine to react to spoken commands.
The goal of this technology is to provide enhanced access to machines via voice
commands. However, often voice input has to compete with well-established
conventional means of input, e.g. keyboards. A speech recognizer will only
be successful if it meets the following requirements:
It exploits the unique properties of the speech input mode such that the user
perceives an actual benefit of using it rather than using a conventional input
mode. This topic is extensively dealt with in an accompanying article in this
issue [l].
It is accurate: the system must achieve a specified level of performance, e.g.
recognition accuracy greater than 95%, so that the user is motivated to con-
tinue using the system.
It is robust: both with respect to changing environmental conditions and in
the way users interact with the system. The latter is particularly important if
the users are unfamiliar with the technology.
It provides real-time response: the user must be provided with a timely
system response. Without sufficiently fast feedback the user does not feel
in control of the system.
While the first requirement addresses the user interface point of view, the
other requirements are mainly determined by the recognition algorithms and
their implementation. Recognition accuracy has been the primary objective
of most speech recognition research, and impressive results have been
obtained, e.g. less than 0.3% word error rate for a speaker-independent digit
recognition task [2]. When it comes to real-world applications, robustness
might even be the more important issue. It has thus become an active field
of research in recent years. We will review some of this work and discuss
one specific technique, spectral normalization, in more detail in Section 2.
The last requirement of real-time response has to be considered in the light
of the limited hardware resources in voice control applications, which are a
result of the tight cost constraints. The problem to be addressed has two points
of view:

382 Philips Journal of Research Vol. 49 No. 4 1995


Speech recognition algorithms jbr voice control inte[faces

?? Develop algorithms to obtain the best performance given a restricted hard-


ware resource (processor and memory limitations).
?? Devise or adapt the hardware so that it is optimally suited to solve the given

recognition task.
In this paper we mainly take the first point of view and discuss in detail one
specific means to reduce the processing and memory demands: a clustering
technique applied at various levels within the acoustic modelling. This will
be discussed in Section 3. The results are summarized in Section 4.

2. Towards robust speech recognition


2.1. Acoustical environmental variahilit?

The causes of acoustical environmental variability can be subdivided into


factors which remain constant through the course of an utterance or a session,
such as recording equipment or room acoustics, and factors which may vary
even within a single utterance, e.g. background noise.
The major contributions to acoustical environmental variability are as
follows [3]:
Changing channel characteristics: This problem arises if the room acoustics
are changed or if the recording equipment is changed (e.g. a different micro-
phone, change of the telephone channel for telephone speech). This kind of
distortion of the speech signal is most often modelled by passing the speech
signal through a linear filter.
Input level: Level changes result from speakers speaking with different
volume or from changes of orientation or distance to the microphone.
Additive noise is present in many applications. One prominent example is
speech recognition in a car. If training and testing are carried out at the
same noise level. the recognizer can cope quite well even with low signal-
to-noise ratios. If there is a mismatch between training and recognition
this can cause severe performance degradation.
DifSerent speaking styles may result in different spectral characteristics. One
example is speech spoken in the presence of noise.
Extraneous speech b?, interfering speakers.

One approach to render a speech recognizer more robust is multistyle train-


ing: instead of using a model for the environmental variability, this approach
consists of using a database for training that contains the variability to be
expected in the test conditions. The technique has been successfully used
with hidden Markov models (HMMs) because of their powerful modelling

Philips Jownal of Research Vol. 49 Uo. 4 1995 383


R. Haeb-Umbach et al.

abilities in different contexts. However, this approach is often impractical due


to the lack of sufficient training data. Therefore a mismatch between training
and recording environment can often not be avoided.
Much research has been devoted to the problem of devising algorithms
which cope with such a mismatch. The approaches can be divided into the
following two broad categories:
Signal processing techniques in the speech recognition front end. Examples
of this are the use of microphone arrays, speech enhancement techniques
and measures to extract more robust feature vectors.
Statistical modelling techniques of speech and noise. These include transfor-
mation and adaptation techniques of hidden Markov models of noise-free
speech to the noisy case.
An overview of these techniques may be found in [3, 41. In the following we
will present one technique of the first category in more detail: spectrum
normalization.

2.2. Spectrum normalization

The speech signal at the speech recognizer input is typically modelled as a


clean speech signal and an additive noise term, observed at the output of a
linear time-varying filter, representing the slowly changing channel transfer
function. In the speech recognizer front end considered here the input signal
is sampled and blocked into overlapping frames. For each frame a Fourier
transform is computed. In the frequency domain the speech signal may thus
be expressed as

JV> t) = [NV, t) + S(f, 41.Wf, t) (1)


where S(f, t), N(f, t) and X(f, t) denote the Fourier transform of a block of
the pure speech signal, the noise signal and the input signal to the speech recog-
nizer, respectively; H(f, t) is the transfer function of the recording channel.
Note that we assume non-stationary signals; i.e., the spectral characteristics
are a function of the time t. Further, the channel transfer function is assumed
to be time-varying.
We compute mel-frequency log-spectral coefficients [5]. If we assume that
the speech signal corrupted by noise can be modelled as the maximum of
the speech signal and noise over time for each frequency band [6], then we
obtain from eq. (1) the following equation in the log-spectral domain:

x(k, t) x T(k, t)s(k, t) + (1 - T(k, t))n(k, t) + h(k, t) (2)

384 Philips Journal of Research Vol. 49 No. 4 1995


Speech recognition algorithms for voice control interfhces

where

1 if s(k, t) > n(k. t)


T(k, t) =
C
0 otherwise

4k 4, 4% t), h(k,4 and x(k, t) are the logarithms of the power spectral
densities of the pure speech signal, the noise signal, the transfer function
and the compound signal at the input of the recognizer, respectively; k denotes
the frequency subband index and t is the discrete time index.
For most environmental variabilities it is justified to assume that the channel
transfer function varies only slowly with time compared to the rate at which
speech changes; i.e., h(k: t) can be considered a low-pass signal with respect
to the time index t.
The influence of the transfer function can then be eliminated by high-pass
filtering of the spectral subband envelopes. Furthermore, it is well known
that high-pass filtering of the subband envelopes suppresses speaker-specific
characteristics of the speech signal. Different techniques for high-pass filtering
have been applied.

2.2.1. Utterance mean subtraction


In the following we assume that h(k, t) is constant with respect to time t
within one utterance: h(k, t) = h(k). Then for each subband k the mean value
of x(k, t) over an utterance of length T is

x(k) = f 5 x(k. /)
r=I

= +15
h(k)N,(k) ,=,
T(k. t)s(k, t)

+ T _ ,: (k) $(l - T(k, t))n(k, t)


i 1-l

where

N,(k) = -&(k:
t=l
4 (5)

Now it is easily seen that y(k, t): = x(k, t) - Z(k) is independent of the channel

Philips Journal of Research kol. 49 NW 4 1995 385


R. Haeb-Umbach et al.

transfer function h(k):


y(k, t) =x(k, t) - X(k)
= T(k, t)s(k, t) + (1 - T(k, t))n(k, t)

- &-$(k tb(k 4 -
7
t-1
T _ fy

T
(k) $(l- 4% t)bk
t-1
4 (6)

However, it can be observed that y(k, t) depends on the relative amount of


speech frames within an utterance.’ Thus this technique only works well if
this amount is roughly constant for all utterances, both in training and test.
The above utterance mean computation can be considered as a filtering
operation with a finite impulse response filter of length one utterance with
tap weights equal to l/T. Note that the overall mean subtraction operation
is equivalent to a high-pass filter operation. Further note that the mean com-
putation introduces a processing delay of one utterance for all subsequent
operations.
A similar operation can also be carried out on cepstral features, where the
technique is known as cepstral mean subtraction [3].
The above processing is also helpful to compensate for differences between
the noise conditions in training and test: if the noise statistics can be considered
to be stationary, i.e. n(k, t) M n(k), then the noise term is also suppressed by
high-pass filtering.

2.2.2. Recursive high-pass filter


The processing delay of one utterance of the utterance mean subtraction
technique is unacceptable for a real-time implementation. Therefore we
consider here high-pass filtering of the spectral subband envelopes by means
of a recursive infinite impulse response filter [7]. Here the subband mean is
computed as a weighted running sum of the past samples:

%(k, t) = 2 a(j)x(k, t -j) (7)


j=l

One possible initialization for x(k, 0) is the overall mean value of the kth sub-
band component of the training data. The filter coefficients o(j) have to be
chosen such that the above represents a low-pass filter, e.g. for a first-order
lowpass filter we have a(j) = ai, 0 < a < 1. The bandwidth of the filter has

’ Note that each utterance is composed of speech and silence frames.

386 Philips Journal of Research Vol. 49 No. 4 19%


Speech recognition algorithms for voice control interjbces

to meet contradictory design goals: a large bandwidth is desirable to be able to


adapt quickly to a new channel; however, a small bandwidth results in less dis-
tortion of the speech signal s(k, t). This problem can be solved by a time-vary-
ing filter bandwidth, i.e. o(j) = o(.j, t).
Again, the high-pass signal y(k. t) = x(k, t) - .~(k, t) is then used for the
subsequent classifier.

2.2.3. E.uperinlental results

Speaker-dependent tests were conducted on a car database: two sets of data


were collected in a compact-sized car driving on a highway at an average speed
of 120 km/h (75 mph). The first set of data contains speech uttered via a tele-
phone handset (‘HS’), the second in hands-free mode (‘HF’) with a single
microphone attached to the sun vizor. Both data collections involved 10
speakers (handset: 7 male + 3 female; hands-free: 5 male + 5 female). Three
male and three female speakers were common to both data collections. The
vocabulary contains the German digits including the two alternative pronun-
ciations of ‘2’: ‘zwei’ and ‘zwo’. Digits were spoken in isolation, as 3-digit
strings, and as 7-digit strings; 44 single digits and 44 (88) 3-digit strings were
used for training in the handset (hands-free) mode. Recognition experiments
were carried out on 100 7-digit strings. For the ‘cross-tests’ all handset data
were used for training and all hands-free data for recognition. The error rates
presented are always average values over all available speakers: 10 for the
handset, 10 for the hands-free experiments and 6 for the cross-tests. The
signal-to-noise ratio of the handset data was between 1OdB and 19dB and
that of the hands-free data around OdB. However, it should be emphasized
that the difference between handset and hands-free data is not only an
increased noise level but also a different acoustic channel and different speak-
ing mode. For more details on the database see [8].
Further, we carried out speaker-independent recognition experiments on a
telephone database (‘MTEL’). The training part consists of isolated utterances
of German digits spoken by 37 male speakers resulting in a total of 897 utter-
ances. The recognition part contains 1892 digit strings of different lengths (up
to 7 digits) which were spoken by another 23 male speakers who were different
from the training speakers.
Finally, the adult speakers’ portion of the Texas Instruments Connected
Digits recognition task [9] (‘TI Digits’) was used as yet another database.
Our speech recognizer employs a connected-word recognition algorithm
which is based on whole-word hidden Markov models, the emission probabil-
ities of which are modelled by continuous Laplacian densities. For all results

Philips Journal of Research Vol. 49 No. 4 1995 387


R. Haeb-Umbach et al.

TABLE I
Word error rates for handset (HS) and hands-free (HF) database, crosstests
(HS: training, HF: recognition) and MTEL telephone database; string error
rate to TI Digits recognition task. In all cases: 32-component feature vector

Word (string) error rate (%)

Type of normalization HS HF CROSS MTEL TI

No spectrum normalization 0.9 2.0 100 3.0 (4.0)


Mean subtraction 1.2 2.0 6.9 1.6 (3.2)
High-pass filter 1.1 2.3 11.4 2.1 (3.9)
Band-pass filter [ 1l] 1.2 2.2 8.3 3.4 _

of Table I we trained single-density emission probabilities rather than mix-


tures. We used the Viterbi approximation both in training and recognition,
i.e. the probability of a word is replaced by the probability of its most likely
state sequence.
After sampling the input speech signal at 8 kHz and subsequent pre-emphasis
(a = 0.95), 15 cepstrally smoothed log-power spectral intensities are computed
every 12 ms from a Hamming-windowed 32 ms portion of the speech signal.
The energy per frame is subtracted from each intensity and included as an
additional component in a resulting 16-component feature vector. The 16-
component vector is augmented with first-order time differences of each vector
component to obtain a 32-component vector. Details of the signal analysis and
the acoustic modelling can be found in [8, lo].
Table I summarizes the results on the databases described above. The high-
pass filter was initialized for each new utterance, according to the scheme out-
lined in Section 2.2.2. After five time frames the filter parameter was fixed to its
steady-state value. For comparison we have included in the table results for the
bandpass filter described in [l 11. The table indicates word error rates in %,
except for the TI Digits case, where string error rates are cited in parentheses.
The results show that in the case of a mismatch of training and test condi-
tions (CROSS’ in the table) spectrum normalization is essential. If there is no
spectrum normalization the recognizer fails completely. This is due to the many
word insertions caused by the high noise level of the test data. For the speaker-
independent recognition experiments (MTEL, TI), spectrum normalization
also improves performance. This shows that slow changes of the log-spectral
intensities are mainly caused by speaker and acoustic channel variations and
thus should be removed, since they do not bear any information for the

388 Philips
Jouroal
ofResearch Vol. 49 No. 4 1995
Speech recognition algorithms for voice control interfaces

recognition task. Further, it can be observed that the recursive filters perform
somewhat worse than the mean subtraction technique. In the cases of speaker-
dependent recognition and match between training and test environment
(‘HS’, ‘HF’), spectrum normalization does not yield any benefit and can
even deteriorate performance slightly. Spectrum normalization may thus be
viewed as a safeguard measure for cases where one might encounter a test
environment that differs from the training conditions.

3. Clustering techniques for compact acoustic models


In this section we are concerned with the problem of reducing the computa-
tion and memory demands of speech recognition in order to obtain a cost-
effective hardware implementation. In the following we describe clustering
techniques to arrive at compact acoustic representations which achieve the
same recognition performance with a considerably smaller number of param-
eters than the non-clustered system. We assume that the reader is familiar with
hidden Markov modelling, otherwise we refer him to [12]. We applied cluster-
ing techniques at different levels of the acoustic modelling and integrated them
into the acoustic-phonetic training procedure of our continuous-density hidden
Markov model (HMM) speech recognizer. The main idea of clustering is to
join models and parameters which are acoustically similar.
We applied the theoretical framework of maximum-likelihood estimation to
define a measure of similarity for our bottom-up hierarchical clustering. This
approach leads to a similarity measure which is composed of an L, distance of
the model or density means and an observation-count-based weighting factor.
In Section 3.1 we will show that such a clustering criterion fits well into a
maximum-likelihood estimation-based training procedure.
For a continuous-density HMM system, acoustic similarity can be seen at
different levels: at the phone level, at the state (or mixture) level and at the
density level.
Clustering at the first level (phones) leads to model tying and aims at defin-
ing a reduced set of models to be trained. It avoids a duplication of acoustically
similar models, and therefore reduces the number of parameters of a system.
This translates directly into reduced processing and memory demands of the
recognizer. In addition, tying of rarely observed models with similar but
more often observed ones leads to a more reliable estimation of the model
parameters and thus to a more efficient utilization of given data. Similar
reasoning is behind state tying, where individual states rather than whole
models are tied.

Philips Journal of Research Vol. 49 ho. 4 1995 389


R. Haeb-Umbach et al.

The goal of density clustering is to identify similar single emission probability


density functions out of the pool of density functions. The pool may contain
either the component densities of the mixture densities of all models or there
may be separate pools for each phoneme or family of phonemes. As a result
corresponding parameters of different models are shared. Density clustering
is done across models and is independent of the previously mentioned
model-tying.

3.1. A maximum-likelihood approach to clustering

The objective of the applied clustering procedure is to achieve a tying of


models and parameters which produces the minimum possible increase of a
global measure of heterogeneity. We will show that such a measure can be
directly derived from the maximum-likelihood criterion.

3.1.1. The measure of heterogeneity

Assuming Gaussian output probabilities, we can write for the likelihood of


the observation 5, given the model (mixture density) m,

p(s( m) = epi . e-d(o’3c)


i=l

d(o’,,) =f(G-;,)C-yG-QT

pi = (@$I c 1 l/2 ’

where n is the number of component densities of the mixture, D is the dimen-


sion of the feature space, C is a given pooled covariance matrix, wi is a given
fixed mixture component weight and < are the component density mean
vectors with the distance d(o’, 6). Due to the exponential decrease of the like-
lihood we can approximate the logarithm of the mixture density likelihood by:

- log(p(o’)m)) =i2iffnI- 1odPi)+ d(C611.

Note that - lOg(pi) 20. Thus we define a distance d which incorporates the
term - log(pi) into d:

d(o’, c) = d(o’, 6) - lOg(pi). (10)


To obtain the log-likelihood for the whole training set we sum over all

390 Philips Journal of Research Vol. 49 No. 4 1995


Speech recognition algorithms.for voice control interfaces

observation vectors which occur in the training data:2

(11)

Following the maximum-likelihood approach, we have to train an inventory


of models and densities which minimizes the criterion V:

(12)

After a suitable linear transformation d = To’ of the feature space. with


XI- = I, we obtain

(13)

Rewriting the double sum over all mixtures and all component densities of
each mixture as a single summation over the pool of all N densities, we get:

nj = 1. (14)
c
.?rZ,

During bottom-up clustering the size of the pool of densities is successively


reduced. This results in an increase of the measure V. To be consistent with
our maximum-likelihood training procedure, the natural choice of a measure
for the heterogeneity is thus V, and the rule to be followed hence is: Merge
those two clusters which result in the smallest possible increase of heterogeneitv V.
Since we are joining models and densities, V will increase during clustering.
Hence the maximum-likelihood approach suggests to minimize the increase of
V during clustering. Thus we can assume that V is an optimal measure of
heterogeneity in the framework of maximum-likelihood training.

’ To indicate that m is the model which corresponds to the observation vector 0’we write m;.
Furthermore we write for convenience ?, tm if 6 is a component density mean vector of the model
(mixture density) M, and we write SE< if < is the closest mean vector (with respect to the Euclidian
distance) to 0’.

Philips
Journal of Research Vol. 49 No. 4 1995 391
R. Haeb-Umbach et al.

3.1.2. Ward’s clustering procedure


In this section we derive Ward’s bottom-up clustering procedure, which is
described in [13].
In the beginning, each cluster consists of a single element, in our case a single
density mean vector d Now the number of clusters is successively reduced.
After joining the two clusters p, q to the fused clusterf:

(15)

with nf = nP + n4, the increase of the negative log-likelihood:

can, after some computation, be simplified to

(18)
From this equation it is obvious that clustering can never lead to a decrease of
the negative log-likelihood. Since

(19)

holds, we obtain for an additional fusion of the fused group fand another
group i:

Ai,f = -& pif - iiil12 (20)


I

(21)

Aj,f can be interpreted as a distance between the clusters i and f; i.e., the
distance (measure of dissimilarity) of two clusters is simply the increase in
negative log-likelihood V if the two clusters are merged. In an implementation
the terms Ai,j are computed at the beginning of the clustering. During a
successive clustering step the distances of the clusters i to the new clusterf
(f = fused@, 4)) can be directly computed from the distances Ai,g,Ai,p and
AP,y according to eq. (21). The iterative clustering procedure is summarized

392 Philips
Journal of Research Vol. 49 No. 4 1995
Speech recognition algorithm.y,for voice control inteyfbces

in the following:

1. Compute a dissimilarity matrix from all given densities by eq. (18).


2. Search for the minimum A,, within the distance matrix and join the
clusters i,j to a new c1uster.f =,f(i. j).
3. Remove the items i,j from the distance matrix.
4. Add itemf’ to the distance matrix, using eq. (21).
5. Stop if some stopping criterion is met (e.g. number of clusters), otherwise
go to 2.

The applied clustering technique is successive and agglomerative and hence


can lead to suboptimal and degenerated cluster configurations. To avoid such
undesirable results an additional k-means clustering (see below) has to be
included.

3.1.3. k-means clustering procedure

The k-means clustering technique is well-known and is used in addition to


Ward’s clustering procedure (see Section 3.1.2). It works as follows:

1. Start with an initial set of cluster means.


2. For each element i of each cluster search for the nearest neighbour
cluster, i.e. a cluster the mean of which has a minimum L, distance
to i.
3. Move the density i, if necessary, to the new nearest neighbour cluster.
4. Re-estimate the cluster means.
5. If there are no more density moves from one cluster to another stop,
otherwise go to 2.

The k-means clustering procedure can be included after each clustering itera-
tion of Section 3.1.2 or. to save computation time. after a certain number of
iterations.

3.2. State-tying versus density-tying

Hidden Markov models may share some or all component densities of their
mixture densities if they model acoustically similar events. This similarity can
be modelled on state level and density level. As a result of state-tying (see
Fig. l), complete states will be tied together, i.e. the tied states will share the
same inventory of component densities.
Density-tying on the other hand allows different models to share common
regions of the acoustic space (see Fig. 1). It is done across HMM states and
is independent of the previously mentioned state-tying. Note that the resulting

Philips Journal of Research Vol. 49 hio. 4 1995 393


R. Haeb-Umbach et al.

Two similar states are modelled by two different mixture densities

Two similar states are modelled by two different mixture densities


some of the comDonent densities are tied

w Density Tying

Two similar states are modelled by one and the same mixture density

~~~ State Tying

Fig. 1. State-tying versus density-tying.

configuration is in essence a tied mixture-density system where the degree of


tying is determined by the amount of clustering.
It is obvious that for single-density emission probabilities there is actually
no difference between state-tying and density-tying which is important for
the implementation, since both kinds of tying are obtained by the same
clustering procedure.

3.3. Experimental results

We carried out experiments on several small-vocabulary recognition tasks.


Here, small-vocabulary speech recognition is used as a synonym for an acoustic
modelling approach which employs hidden Markov models of words rather
than of phonemes.
For word-model-based small-vocabulary speech recognition, state tying

TABLE II
String error rate (SER) on TI Digits for various configurations with a small
number of densities

SER [%] Configuration

3.37 0.6 k non-tied single-densities, 19 200 parameters


2.97 0.3 k tied densities, 1 k weights, 10 600 parameters
2.59 1.2 k non-tied densities, 39 600 parameters
2.67 0.3 k tied densities, 1.6 k weights, 11 200 parameters

394 Philips Journal of Research Vol. 49 No. 4 1995


Speech recognition algorithms,for voice control inteyfaces

TABLE III
String error rate (SER) on TI Digits for various configurations with a large
number of densities

SER [%] Configuration

1.91 2.4 k non-tied densities, 79 200 parameters


I .90 0.8 k tied densities, 3.2 k weights, 28 800 parameters

1.45 4.8 k non-tied densities, 158 400 parameters


1.30 2 k tied densities, 10 k weights. 74 000 parameters

1.16 9.5 k non-tied densities, 304 000 parameters


1.14 5 k tied densities, 18.5 k weights, 178 000 parameters

0.95 19 k non-tied densities, 608 000 parameters


0.97 10.5 k tied densities, 26 k weights, 362 000 parameters

identifies acoustically similar states within different words. This results in


deciding automatically and in a data-driven way which parts of speech of
the recognition vocabulary are similar and therefore can be modelled with
shared parameters.
Tables II and III show the effects of clustering on the number of parameters
and on the error rate for experiments on the adult speakers’ portion of the
Texas Instruments Connected Digits recognition task. For details on the
non-tied system, see [2]. In Tables II and III experiments with similar string
error rates are grouped together. It can be seen that the number of model para-
meters could be reduced by a factor of 2 to 3 without increase in error rate. For
a medium error rate performance range (1 .l% to 3% string error rate), the
results can be stated alternatively: given the same number of parameters, the
tied system achieves an error rate up to 30% better. Similar results have
been obtained on the other databases mentioned in Section 2. Thus the
described clustering techniques are a powerful means to reduce the computing
and memory demands to make the recognition fit on cheap hardware.

4. Summary

When it comes to real-world voice control applications. robustness and real-


time response might even be more important issues than a task-dependent
optimized recognition accuracy. We discussed two techniques to achieve
robust high-accuracy real-time speech recognition in real-world environments:
the spectral normalization technique and clustering techniques.

Philips Journal of Research Vol. 49 No. 4 1995 395


R. Haeb-Umbach et al.

The reported results show that in the case of a mismatch of training and test
conditions spectrum normalization is essential, since it is able to remove the
negative effects on the error rate of changing transfer channel characteristics
and even of different noise levels of training and test. For the reported
speaker-independent recognition experiments spectrum normalization also
improves performance since it discards speaker-specific spectral characteristics
in the speech spectrum. Further, it can be observed that recursive filters, which
have to be employed to achieve real-time response, perform somewhat worse
than the mean subtraction technique. In the cases of speaker-dependent
recognition and match between training and test environment, spectrum
normalization does not yield any benefit and can even worsen performance
slightly. Spectrum normalization may thus be viewed as a safeguard measure
for cases where the test environment differs from the training conditions.
Clustering techniques have been applied to obtain a compact representation
of acoustic models. This translates directly into reduced computation and
memory demands in an implementation. This is an important factor in voice
control applications which have to live with tight cost constraints. Moreover,
clustering has another beneficial side-effect-a better utilization of the training
data. In this paper we have discussed in detail a maximum-likelihood-based
clustering procedure applied at various levels within the acoustic modelling.
At the state level, clustering allows us to avoid the duplication of acoustically
similar models. A consequence is that rarely seen acoustic events can be
modelled together with more robust ones. At the density level, clustering
allows us to model better the part of the acoustic space that is shared by
different models. A combination of the two clustering techniques leads to a
reduction of the number of parameters by a factor of up to three and to a sig-
nificant error rate reduction on several small-vocabulary speech recognition
tasks.

REFERENCES

PI S. Gamm and R. Haeb-Umbach, User interface design of voice controlled consumer


electronics, Philips J. Res., 49(4), 439-454 (1995).
PI R. Haeb-Umbach. D. Geller and H. Nev. Imnrovements in connected digit recognition using
linear discriminant analysis and mix&e densities, in Proc. IEEE Int. Conf. on Acoustics,
Speech, and Signal Processing, Minneapolis, MN, Apr. 1993, pp. 11239911242.
131 A. Acero, Acoustical and Environmental Robustness in Automatic Speech Recognition,
Kluwer Academic Publishers, Boston, MA, 1993.
[41 B.H. Juang, Speech recognition in adverse environments, Comput. Speech and Lang., 5,275-
294 (1991).
151R. Haeb-Umbach. D. Geller and H. Nev. Improvements in speech recognition for voice
dialing in the car environment, in Proc. ESCA-ETRW Workshop on Speech Processing in
Adverse Conditions, Cannes-Mandelieu (France), Nov. 1992, pp. 203-206.
161A. Nadas, D. Nahamoo and M. Picheny, Speech recognition using noise-adaptive proto-
types, IEEE Trans. Acoust. Speech and Signal Processing, 37(10), 1495-1503 (Oct. 1989).

396 Philips Journal of Resesrcb Vol. 49 No. 4 1995


Speech recognition algorithms for voice control inteyfbces

[71 H.G. Hirsch, P. Meyer and H.-W. Ruehl. Improved speech recognition using high-pass
filtering of subband envelopes, in Proc. European Conf. on Speech Communication and
Technology, Genova, Sep. 1991. pp. 4133416.
PI H.W. Ruehl. S. Dobler, J. Weith, P. Meyer, A. Nell, H.H. Hamer and H. Piotrowski, Speech
recognition in the noisy car environment, Speech Commun., 10(l), II-22 (Feb. 1991).
[91 R.G. Leonhard, A database for speaker-independent digit recognition, in Proc. IEEE Int.
Conf. on Acoustics, Speech. and Signal Processing. San Diego, CA. Mar. 1984. pp.
42.11.1-42.11.4.
[lOI A. Nell, H.H. Hamer, H. Piotrowski, H.W. Ruehl, S. Dobler and S. Weith. Real-time
connected-word recognition in a noisy environment. in Proc. IEEE Int. Conf. on Acoustics.
Speech, and Signal Processing, Glasgow, UK. May 1989. pp. 679-682.
[Ill H. Hermansky and N. Morgan, Towards handling the acoustic environment in spoken
language processing. in Proc. Int. Conf. Spoken Language Processing, Banff. Canada. Oct.
1992, pp. 85-88.
WI L.R. Rabiner, Mathematical foundations of hidden Markov models, in H. Niemann. M.
Lang and G. Sagerer (eds.), Recent Advances in Speech Understanding and Dialog Systems,
Vol. F46 of NATO ASI Series, Springer, Berlin. 1988, pp. 1833205.
[I31 D. Steinhauser and K. Langer, Clusteranalyse. Walter de Gruyter, 1977.

Philips Journal of Research \ol. 4Y No. 4 1995 397

You might also like