Professional Documents
Culture Documents
Speaker Recognition Models (Kin Yu, John Mason, John Oglesby)
Speaker Recognition Models (Kin Yu, John Mason, John Oglesby)
%Error
%Error
Training versions Testing versions
25 15
ture extraction.
Total number of mixtures Number of training versions
(a) (b)
The database is divided into training and testing sets.
The rst ten versions, i.e. the rst two collection ses- Fig. 2. TI parameter variation: %error against (a)
sions are reserved for training, with the remaining total number of mixtures for ergodic CDHMMs,
fteen repetitions reserved for testing. (b) number of training versions for VQ codebooks
An incremental training data set selection is per- proles of the thirty-two mixture equivalent models
formed until all the data from the training set is (2m16s, 8m4s and 32m1s), the 32m1s conguration is
exhausted, thus a series of experiments using one chosen for subsequent text-independent experiments,
through ten training versions is utilised. A subset a form which has been used by other researchers
of speakers is adopted. The data from twenty males, [2][8][9]. For a comparison we require the second mod-
all of approximately the same age is used, and the elling technique, VQ, to be of a similar size. Figure
vocabulary is reduced to ten digits, 1 through 9 and 2b shows codebook performance as a function of the
zero. codebook rate and amount of training data. Notice-
Mel-scale warped cepstra from a Hamming window able performance dierences occur between 16 and
of 32ms, with 50% overlap is used to parameterise 32 codewords, and 32 and 64 codewords. Above 64
the speech, and pooled inverse variance weighting is codewords performance improvements are small. A
applied to each of the 14 cepstral coecients. 32 element VQ codebook is chosen, despite its slight
sub-optimal performance, to be similar in size to the
3. MODEL PARAMETERS CDHMM.
In DTW the model parameters are fully determined
by the training data and the vocabulary. 3.2 Text-dependent parameters
In contrast, in the cases of VQ and CDHMM's, de- Results for corresponding TD experiments are shown
cisions on model parameters need to be made. For for CDHMM and VQ in Figure 3a and Figure 3b re-
VQ the primary factor is the codebook size. The spectively. For VQ, the performance again improves
CDHMM case is less straight-forward since the model with the codebook size with little improvement be-
topology is also a variable and as a consequence two yond codebook size of 8.
primary parameters are the number of states and the For text-dependent CDHMM (Figure 3b) the TD re-
number of mixtures in each state. sults show a clear minimum region when the total
In the following sections we look at recognition per- number of mixtures is between 8 and 16. Similar
formance for VQ and CDHMM's in terms of the re- curves are observed for dierent amounts of training
spective model parameters, for both TI and TD con- data. Within this region the state/mixture combi-
ditions. nations which give the best performances are 5m2s,
2m6s and 1m8s, suggesting that performance is lit-
3.1 Text-independent parameters tle aected by the state transition parameters of the
Results from experiments with various CDHMM CDHMM.
topologies trained with 1,5 and 10 versions are sum- Hence in the TD case, an 8-element VQ codebook and
marised in Figure 2a. From these results it is noticed a constrained 8-state single-mixture CDHMM is cho-
that the performance of the model correlates highly sen to compare with the DTW modelling technique.
with the total number of mixtures in the model, i.e.
the number of states times the number of mixtures 4. PERFORMANCE COMPARISON
per state. The trend shown here for 10 version train- TI performance: Figure 4a illustrates the identica-
ing (10vt) has also been observed by others [2][8], tion performance of a 32-element VQ codebook and a
also in TI experiments. The proles for 1vt, and to a 32-mixture single state CDHMM. For 1 and 2 version
lesser extent 5vt, show the eetcs of insucient train- training VQ performs better than the CDHMM, but
ing data on parameter estimation. for 7,8,9 and 10 version training the CDHMM out-
Given the near optimal performance is in the three performs the simpler modelling technique. Between
30 30 30
14 2 codewords 32 VQ 8 VQ
1m8s = 1 mixture 8 states 4 codewords 32m1s CDHMM DTW
25 8 codewords 25 25 1m8s CDHMM
12 16 codewords
32 codewords
%Error
10 20 20 20
%Error
%Error
%Error
8
15 15 15
6
10 10 10
4
5 5 5
2
1m8s
0 0 0 0
4 8 12 16 20 24 28 32 36 40 44 48 52 56 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Total number of mixtures Number of training versions Number of training versions Number of training versions
%Error
10 10
for each speaker. This will utilise the advantages of
the DTW approach which provides good performance
5 5
with small amounts of training data.
6. ACKNOWLEDGEMENTS
The Authors wish to thank BT Labs for the use of
0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
(a) (b)
for this work.
Fig. 5. DTW text-dependent digit performance for
(a) 2,3,4,7 and zero and (b) 1,5,6,8 and 9 REFERENCES
ing versions is also illustrated for digits 1 and 9. The [1] D. A. Irvine and F. J. Owens. A comparison
worst performers are digits 4,8,6 and 2. of speaker recognition techniques for telephone
A large variation is observed across the digits. For speech. Proc. Eurospeech-93, 3:2275{2278, 1993.
example Figure 5a shows digit 4 performing badly, [2] T. Matsui and S. Furui. Comparison of text-
while digit zero performs well across all training ver- independent speaker recognition methods using
sions, with a performance dierence of 6.3% at their VQ-distortion and discrete/continuous HMMs.
closest point. Hence, in a password system consist- IEEE Trans. Speech and Audio Processing,
ing only of digits, judicious choice could signicantly 2:456{459, 1994.
improve performance. [3] Y. Linde, A. Buzo, and R. M. Gray. An algo-
rithm for vector quantizer design. IEEE Trans.
5. CONCLUSION Communications, 28:84{95, 1980.
[4] F. K. Soong, A. E. Rosenberg, L. R. Rabiner,
Perhaps the most surprising overall nding presented and B. H. Juang. A vector quantization approach
in this paper is the superior performance of DTW to speaker recognition. Proc. ICASSP-85, 1:387
over both VQ and the CDHMM. As mentioned above, { 390, March 1985.
the CDHMM performance is likely to improve with [5] J. T. Buck, D. K. Burton, and J. E. Shore.
certain parameter estimation from pooled data, with Text-dependent speaker recognition using vec-
only the means being updated on a speaker specic tor quantisation. Proc. ICASSP-85, 1:391{394,
basis. 1985.
This can be viewed as one step in moving the [6] S. Furui. Cepstral analysis technique for auto-
CDHMM towards a DTW or a VQ approach, and matic speaker verication. IEEE Trans. ASSP-
continuing in this vein, the DTW may be viewed as 29, pages 254{272, 1981.
merely a degenerate case of the CDHMM. In turn the [7] S. J. Young and P. C. Woodland. HTK: Hid-
VQ approach may be regarded as a degenerate case den Markov model toolkit V1.4 User manual.
of DTW. Considering rst the latter pair, the essen- Cambridge University Engineering Department,
tial dierence between VQ and DTW is the inherent Speech Group, 1992.
time-alignment within DTW and the results indicate [8] X. Zhu, Y. Gao, S. Ran, F. Chen, I. Macleod,
that some speaker-specic time-sequence information B. Millar, and M. Wagner. Text-independent
within speech, completely lost in VQ, is captured by speaker recognition using VQ, mixture Gaussian
DTW. In contrast, the lack of recognition sensitivity VQ and ergodic HMMs. Proc. ESCA-1994, pages
to the number of CDHMM states suggests that the 55{58, 1994.
state transition probabilities do not themselves con- [9] J. de Veth and H. Bourland. Comparison of hid-
tribute to discrimination, but serve merely to align den Markov model techniques for speaker veri-
speech events to states. cation. Proc. ESCA-94, 1994.
These observations raises the question on how [10] E. S. Parris and M. J. Carey. Discrimina-
a CDHMM might be customised to harness the tive phonemes for speaker identication. Proc.
time-sequence information, thereby equaling or out- ICSLP-94, 4:1843{1846, 1994.
performing the DTW approach. This can be done
by assigning each oberservation in the reference tem-
plate as a state, and
attening the variances and the
transition probabilities. This does not however pre-
vent bad parameter estimates with small amounts of
training data using existing algorithms.