Professional Documents
Culture Documents
Speaker Identification
Speaker Identification
IN SPECTROGRAMS
Introduction
Pattern Recognition problems are amongst the most challenging and fascinating
areas in speech research. The closed set text-dependent speaker identification problem may be
stated as follows. Out of a total population of N “known” speakers, find the speaker whose
reference pattern has closest resemblance to the sample pattern of the “unknown” speaker
who is assumed to be one of the given set of speakers.
Speaker Identification tasks include the basic components: (I) feature extraction (II)
speaker modeling (III) speaker matching and (IV) decision logic. The feature extraction
module converts the raw speech waveform in the given sample to a spectrogram.
Spectrograms are then used to make a statistical model of a speaker’s voice patterns and use
them to create a database. Later, when unknown samples arrive, they are used to match
spectrograms from the given database. The decision logic finally makes a one-out-of-N
decision, e.g. selects the speaker with maximum degree of similarity. Some of the established
modeling techniques, such as vector quantization (VQ) [1] and Gaussian mixture models
(GMM) [2], use the same general framework for speaker identification.
In this study, speaker identification is carried out by means of comparisons of
speech spectrograms. This paper is organized as follows. In Image Comparison section, we
give an overview of a non-parametric statistical procedure to measure the similarity between
two images using the Kolmogorov-Smirnov test for Image Comparison consisting of the
Hollander-Wolfe statistic. In the Speaker Identification Algorithm section, we have proposed
a technique used to identify or authenticate an unknown speaker, the results of which and also
a comparison with some of the already established techniques have been stated in a later
section. Concluding remarks and future work are discussed in last section.
Image Comparison
As is it can be seen from the spectrograms illustrated in Figure 1, the images appear
to be dissimilar for different speakers, for the utterance ‘gadget’. Hence, an essential task of
image comparison is to justify the claim. Under assumption that images are subject to random
noise, we want to test if images are the same (the speech samples are of the same speaker).
We say that two images are the same if they have the same grey-scale distribution [3].
Clearly, if two images are the same, up to a small noise, they should have close grey-scale
distributions. The reverse though is not true. Thus, grey-scale distribution analysis is helpful
Figure 1: The spectrograms of three sample speech signals of the same word ‘gadget’, the first two being uttered
by ‘Speaker 1’ and the last by ‘Speaker 5’.
when images of the same content are compared. One can compute the distribution, or more
specifically, the cumulative distribution function, as the sum of probabilities that a pixel takes
the grey-scale level less than g, where g = 0,.....,255 . If {hg} is the image histogram,
g
Fg = ∑h
g '= 0
g' is the empirical cumulative distribution function. As for any distribution
The image comparison between two spectrograms is conducted as follows. Both the
images are partitioned into several overlapping bands having a common ‘optimal’ bandwidth
and overlap. A good success rate and speedy completion of the test may be assumed to be
optimality criteria for bandwidth and overlap selection. Splitting up frequency axis into
several bands produce significantly better results than time axis segmentation, primarily due
to phase shifts if working on time axis. The pixel values in these frequency bands may be
interpreted as the energy content in the frequency bands (which uniquely characterizes an
individual) as the speech signal is swept through time.
This fact lays the basis of our methodology to verify and, more importantly, identify
a speaker. Let p be the number of bands the frequency axis of a spectrogram is segmented
into. We compute the Hollander-Wolfe statistic (denoted here as xi) between the
i th. (i = 1,...., p ) corresponding frequency bands of the two spectrograms. Hence, we get a
p × 1 vector, X = {x1, x2,…, xp}, representing the image distances between the p frequency
bands of the spectrograms. An optimal weighted sum of these distances produces a Weighted
Image distance between the spectrograms for a given word. Simple averaging instead of
optimally weighing produces moderate results. Thereby, the procedure of assigning optimal
weights to the frequency bands should be such that it traps the similarity between samples of a
speaker’s utterance of the same word and also the variations among them, giving higher
weights to more stable bands of a particular speaker. Two such methods have been discussed.
Given two different sample spectrograms of a specified word uttered by a particular
speaker, image distances between corresponding frequency bands of the spectrograms are
stored in a vector X of dimension 1 x p (here p denotes the no. of bands). Let the vector X ( i )
denote the i th. row vector of the matrix M n× p (here n denotes the no. of spectrogram
comparison tests conducted between the sample spectrograms of the same word uttered by
that particular speaker). These {X (i ) }i =1 are used to form an estimate of the covariance
n
matrix: S = cov(M ) . (I) To obtain the ‘optimal’ weight vector w p×1 , our goal is to minimize
p
1
the objective function: f w = w' Sw under the constraints ∑ wi = 1 and wi ≥ 0∀i . (II) Same
2 i =1
minimization problem as above except that the weights which add up to unity, needn’t take
only positive values. Given a word, the motive is to gain even from the unstable frequency
bands by assigning negative weights to those bands. Bands of instability may vary with
speaker. [In case a column vector of S equals zero (as in case of a band with zero energy
content consistently), incorporate positive real valued random noises having small variance in
such a column vector. This maintains the non-singularity of the S matrix.]
^
While identifying an unknown speaker, given a word, the resultant w p×1 gives the
optimal weights to be given to the frequency bands of both that of the aforementioned
particular speaker’s sample spectrogram and the unknown speaker’s spectrogram while
^
computing the Weighted Image distance between the two images, given by w' X (where X is
a vector of image distances between identical frequency bands of two images).
1. Identifying a speaker
In this study, we used three distinct words (two monosyllables and one disyllable),
namely ‘gadget’, ‘loss’ and ‘cat’ from each speaker in a closed set of 30 speakers recorded in
a noise-free environment thus ensuring least loss of speaker dependent information. Samples
from a particular speaker are collected in different sessions varying over time, to make our
database as efficient as possible. Before computation of the spectrogram, any DC offset
present in the signals were removed, the signals centered around 0 vertically and the
maximum amplitude normalized to -3dB. This ensures a fair comparison of the spectrograms.
Six utterances of each of the three words are taken from 30 speakers. After
computing the optimal weights using 4 out of the 6 samples as training samples for each word
for each speaker, we select only 3 sample spectrograms (taking 2 samples for each word
produced 99.2% as the best rate in successful identification after considering all optimality
procedures) randomly among the 4 chosen (training samples) for each word from every
speaker from the set of 30 speakers for identification purposes. Now, one of the speakers
(randomly chosen in the prescribed set of speakers), gives utterances of the three words. Let,
D *jkm denotes the Weighted Image distance between the m th. (1 ≤ m ≤ 3 , the 3 sample
spectrograms chosen for each word) ‘sample spectrogram’ of the k th. (1 ≤ k ≤ 3) word
corresponding to the j th. (1 ≤ j ≤ 30) speaker in our database and the unknown speaker’s
spectrogram of the k th. word. Identify the unknown speaker as the i th. speaker
1 3
if: Di• = min 1≤ j ≤n D j • where D j • = ∑ D jk • , ∀j , and D jk • = min 1≤ m≤3 D *jkm . D j • ∀j gives a
3 k =1
measure of aggregate distance that the unknown speaker’s spectrograms has with that of the
j th. speaker’s in our database.
For a speaker who is in the prescribed set, numerous tests were carried out to check
the values of spectrogram distances with his/her own set of samples. Let
E jm ( j = 1,....,30; m = 1,....,18 i.e. 3× 4 C 2 = 18 tests are conducted, each test consisting of
comparison with spectrograms of the three words) denote the aggregate distances that the j th.
speaker has with his/her own set of sample spectrograms (corresponding to the three words).
Our problem is to successfully identify a speaker who is in the set of 30 speakers and reject
those who aren’t. So, for an unknown speaker, classify the speaker as ‘not in the closed set’ if
− − 1 18
his minimum aggregate distance ‘D’, is greater than E j + zα σ j ∀j . Here E j = ∑ E jm ∀j ,
18 m =1
2
1 18 −
σ = ∑ E jm − E j and zα is the 100(1 − α )% quantile of the standard Normal
2
j
18 − 1 m =1
distribution for a suitably chosen α . Experimental results are shown for different values of α
(Table 1) by randomly eliminating the database of a set of 5 speakers, and then choosing a
speaker from the original 30 speakers to test for authentication and validation.
Table 1: ‘Success Rate’ refers to recognition rate within the closed set of speakers while ‘F A Rates’ (False
Alarm Rates) indicate rate of incorrect rejection or false acceptance for a randomly chosen speaker
who may or may not be within the given set of speakers.
References
1. F.K. Soong, A.E. Rosenberg A.E., B.-H. Juang, and L.R. Rabiner. A vector
quantization approach to speaker recognition. AT & T Technical Journal,66:14–26, 1987.
2. Reynolds, D. A., Speaker identification and verification using Gaussian mixture
speaker models. Speech Commun. 17 (1995),91–108.
3. Eugene Demidenko. Mixed Models Theory and Applications. Wiley, 2004,596-603.
4. Srinivasan, Srihari and Beal. Signature verification using Kolmogorov-Smirnov
statistic, Proc. International Graphonomics Society Conference (IGS),152-156, June 2005.
1
Technical and programming support by Gourab Mukherjee (GMM) and Subhadip Mitra (VQ).