Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

TEXT DEPENDENT SPEAKER IDENTIFICATION USING SIMILAR PATTERNS

IN SPECTROGRAMS

Tridibesh Dutta, Gopal Krishna Basak


Indian Statistical Institute, 203, B.T. Road Kolkata-700108.

Here, we study a new application of image comparison, automated speaker identification


in a closed set of speakers. This is done by scrutinizing datasets of spectrograms of the speakers
to find the spectral pattern “perceptually closest to the spectrogram of the unknown speaker”.
Using spectrogram segmentation, this paper, mainly, revolves around understanding the
complex patterns of variation in frequency and amplitude with time while for an utterance of a
given word by an individual .The features used for identifying a speaker, rely on the
Kolmogorov-Smirnov test for image comparison. Performance of this novel approach on a
sample collected from 30 speakers show that this methodology can be effectively used to produce
a 100% identification success rate in a closed set of speakers.

Introduction

Pattern Recognition problems are amongst the most challenging and fascinating
areas in speech research. The closed set text-dependent speaker identification problem may be
stated as follows. Out of a total population of N “known” speakers, find the speaker whose
reference pattern has closest resemblance to the sample pattern of the “unknown” speaker
who is assumed to be one of the given set of speakers.
Speaker Identification tasks include the basic components: (I) feature extraction (II)
speaker modeling (III) speaker matching and (IV) decision logic. The feature extraction
module converts the raw speech waveform in the given sample to a spectrogram.
Spectrograms are then used to make a statistical model of a speaker’s voice patterns and use
them to create a database. Later, when unknown samples arrive, they are used to match
spectrograms from the given database. The decision logic finally makes a one-out-of-N
decision, e.g. selects the speaker with maximum degree of similarity. Some of the established
modeling techniques, such as vector quantization (VQ) [1] and Gaussian mixture models
(GMM) [2], use the same general framework for speaker identification.
In this study, speaker identification is carried out by means of comparisons of
speech spectrograms. This paper is organized as follows. In Image Comparison section, we
give an overview of a non-parametric statistical procedure to measure the similarity between
two images using the Kolmogorov-Smirnov test for Image Comparison consisting of the
Hollander-Wolfe statistic. In the Speaker Identification Algorithm section, we have proposed
a technique used to identify or authenticate an unknown speaker, the results of which and also
a comparison with some of the already established techniques have been stated in a later
section. Concluding remarks and future work are discussed in last section.

Image Comparison

As is it can be seen from the spectrograms illustrated in Figure 1, the images appear
to be dissimilar for different speakers, for the utterance ‘gadget’. Hence, an essential task of
image comparison is to justify the claim. Under assumption that images are subject to random
noise, we want to test if images are the same (the speech samples are of the same speaker).
We say that two images are the same if they have the same grey-scale distribution [3].
Clearly, if two images are the same, up to a small noise, they should have close grey-scale
distributions. The reverse though is not true. Thus, grey-scale distribution analysis is helpful
Figure 1: The spectrograms of three sample speech signals of the same word ‘gadget’, the first two being uttered
by ‘Speaker 1’ and the last by ‘Speaker 5’.

when images of the same content are compared. One can compute the distribution, or more
specifically, the cumulative distribution function, as the sum of probabilities that a pixel takes
the grey-scale level less than g, where g = 0,.....,255 . If {hg} is the image histogram,
g
Fg = ∑h
g '= 0
g' is the empirical cumulative distribution function. As for any distribution

function, Fg is non-decreasing within the interval [0, 1]. Let


(i)
F (i ) = {Fg , g = 0,....,255}(i = 1,2) be the grey-scale distribution for the image M (i ) with pixel
^
matrix size Pi × Qi . We compute the maximum, D = max g F (1) − F ( 2) , the distance of one
distribution from the other. Consider the following statistic proposed by Hollander and
^
[
Wolfe: λ KS = D J + 0 .11 / J + 0 . 12 ]
where J = P1Q1 P2 Q2 /( P1Q1 + P2 Q2 ) . The
Kolmogorov-Smirnov Image test comprising of the Hollander and Wolfe statistic (λ KS ) [3]
which takes into account the pixel matrix size of each image, may be thought of as a distance
measure between the grey-scale distributions of two images. The Kolmogorov-Smirnov
Image test has already found its application in various fields like signature verification [4],
and histological analysis of breast cancer [3].

Speaker Identification Algorithm

The image comparison between two spectrograms is conducted as follows. Both the
images are partitioned into several overlapping bands having a common ‘optimal’ bandwidth
and overlap. A good success rate and speedy completion of the test may be assumed to be
optimality criteria for bandwidth and overlap selection. Splitting up frequency axis into
several bands produce significantly better results than time axis segmentation, primarily due
to phase shifts if working on time axis. The pixel values in these frequency bands may be
interpreted as the energy content in the frequency bands (which uniquely characterizes an
individual) as the speech signal is swept through time.
This fact lays the basis of our methodology to verify and, more importantly, identify
a speaker. Let p be the number of bands the frequency axis of a spectrogram is segmented
into. We compute the Hollander-Wolfe statistic (denoted here as xi) between the
i th. (i = 1,...., p ) corresponding frequency bands of the two spectrograms. Hence, we get a
p × 1 vector, X = {x1, x2,…, xp}, representing the image distances between the p frequency
bands of the spectrograms. An optimal weighted sum of these distances produces a Weighted
Image distance between the spectrograms for a given word. Simple averaging instead of
optimally weighing produces moderate results. Thereby, the procedure of assigning optimal
weights to the frequency bands should be such that it traps the similarity between samples of a
speaker’s utterance of the same word and also the variations among them, giving higher
weights to more stable bands of a particular speaker. Two such methods have been discussed.
Given two different sample spectrograms of a specified word uttered by a particular
speaker, image distances between corresponding frequency bands of the spectrograms are
stored in a vector X of dimension 1 x p (here p denotes the no. of bands). Let the vector X ( i )
denote the i th. row vector of the matrix M n× p (here n denotes the no. of spectrogram
comparison tests conducted between the sample spectrograms of the same word uttered by
that particular speaker). These {X (i ) }i =1 are used to form an estimate of the covariance
n

matrix: S = cov(M ) . (I) To obtain the ‘optimal’ weight vector w p×1 , our goal is to minimize
p
1
the objective function: f w = w' Sw under the constraints ∑ wi = 1 and wi ≥ 0∀i . (II) Same
2 i =1
minimization problem as above except that the weights which add up to unity, needn’t take
only positive values. Given a word, the motive is to gain even from the unstable frequency
bands by assigning negative weights to those bands. Bands of instability may vary with
speaker. [In case a column vector of S equals zero (as in case of a band with zero energy
content consistently), incorporate positive real valued random noises having small variance in
such a column vector. This maintains the non-singularity of the S matrix.]
^
While identifying an unknown speaker, given a word, the resultant w p×1 gives the
optimal weights to be given to the frequency bands of both that of the aforementioned
particular speaker’s sample spectrogram and the unknown speaker’s spectrogram while
^
computing the Weighted Image distance between the two images, given by w' X (where X is
a vector of image distances between identical frequency bands of two images).

1. Identifying a speaker

In this study, we used three distinct words (two monosyllables and one disyllable),
namely ‘gadget’, ‘loss’ and ‘cat’ from each speaker in a closed set of 30 speakers recorded in
a noise-free environment thus ensuring least loss of speaker dependent information. Samples
from a particular speaker are collected in different sessions varying over time, to make our
database as efficient as possible. Before computation of the spectrogram, any DC offset
present in the signals were removed, the signals centered around 0 vertically and the
maximum amplitude normalized to -3dB. This ensures a fair comparison of the spectrograms.
Six utterances of each of the three words are taken from 30 speakers. After
computing the optimal weights using 4 out of the 6 samples as training samples for each word
for each speaker, we select only 3 sample spectrograms (taking 2 samples for each word
produced 99.2% as the best rate in successful identification after considering all optimality
procedures) randomly among the 4 chosen (training samples) for each word from every
speaker from the set of 30 speakers for identification purposes. Now, one of the speakers
(randomly chosen in the prescribed set of speakers), gives utterances of the three words. Let,
D *jkm denotes the Weighted Image distance between the m th. (1 ≤ m ≤ 3 , the 3 sample
spectrograms chosen for each word) ‘sample spectrogram’ of the k th. (1 ≤ k ≤ 3) word
corresponding to the j th. (1 ≤ j ≤ 30) speaker in our database and the unknown speaker’s
spectrogram of the k th. word. Identify the unknown speaker as the i th. speaker
1 3
if: Di• = min 1≤ j ≤n D j • where D j • = ∑ D jk • , ∀j , and D jk • = min 1≤ m≤3 D *jkm . D j • ∀j gives a
3 k =1
measure of aggregate distance that the unknown speaker’s spectrograms has with that of the
j th. speaker’s in our database.

2. Authentication of a speaker in an open set

For a speaker who is in the prescribed set, numerous tests were carried out to check
the values of spectrogram distances with his/her own set of samples. Let
E jm ( j = 1,....,30; m = 1,....,18 i.e. 3× 4 C 2 = 18 tests are conducted, each test consisting of
comparison with spectrograms of the three words) denote the aggregate distances that the j th.
speaker has with his/her own set of sample spectrograms (corresponding to the three words).
Our problem is to successfully identify a speaker who is in the set of 30 speakers and reject
those who aren’t. So, for an unknown speaker, classify the speaker as ‘not in the closed set’ if
− − 1 18
his minimum aggregate distance ‘D’, is greater than E j + zα σ j ∀j . Here E j = ∑ E jm ∀j ,
18 m =1
2
1 18  −

σ = ∑  E jm − E j  and zα is the 100(1 − α )% quantile of the standard Normal
2
j
18 − 1 m =1  
distribution for a suitably chosen α . Experimental results are shown for different values of α
(Table 1) by randomly eliminating the database of a set of 5 speakers, and then choosing a
speaker from the original 30 speakers to test for authentication and validation.

Recognition and Authentication Results

A comparison of successful identification by splitting spectrograms’ frequency axis


(11 bands having an optimal bandwidth 33 and overlap 11) and then, similarly, the time axis
(11 bands having an optimal bandwidth 32 and overlap 8) has been depicted in Table 1. In a
closed-set speaker identification, the success rates in identification for different weight
selection rules [(1): Simple average weights, (2): First optimal weight selection procedure and
(3): Second optimal weight selection procedure.], taking three samples as a database for each
word, for each speaker, have been provided in Table 1 (Column: 2).
Though, successful identification of a speaker from just a word may be as low as 30-
40%, combining results from the 3 words and computing the aggregate distance (as stated in
Identifying a speaker), one can obtain as good as 100% success rate in identification in a
closed set problem, as is evident from results stated in Table 1. Using different values of
α (which is as defined before), we determined the false alarm rates in identification, and
authentication for the methodologies mentioned in the preceding sections. Results shown in
Table 1, also concludes that allotting optimal weights to the bands has a boosting effect on
the success rate. Conducting 240 tests (each test comprising of 3 test spectrograms
corresponding to the three words uttered by a speaker amongst the closed set of speakers) for
each mentioned procedure, Success Rates, when it is known that the unknown speaker is from
the closed set, have been computed. 100 tests for three values of α (for each procedure) gave
estimates of the False alarm rates. Table 1 also, supports our hypothesis that segmenting the
spectrogram along the frequency axis gives better results than partitioning along the time axis
in these kind of images in which phase shifts on time domain is likely to happen.
Procedure Success α FA Procedure Success α FA
adopted Rate Rates adopted Rate Rates
Time (1) 74% 0.025 31 % Frequency (1) 86% 0.025 18 %
Scale 0.05 29 % Scale 0.05 16 %
0.01 28% 0.01 15 %
(2) 83% 0.025 25 % (2) 95% 0.025 13 %
0.05 22 % 0.05 12 %
0.01 21 % 0.01 10 %
(3) 87% 0.025 17 % (3) 100% 0.025 11 %
0.05 16 % 0.05 10 %
0.01 15 % 0.01 8%

Table 1: ‘Success Rate’ refers to recognition rate within the closed set of speakers while ‘F A Rates’ (False
Alarm Rates) indicate rate of incorrect rejection or false acceptance for a randomly chosen speaker
who may or may not be within the given set of speakers.

Technique used Time taken for Error


(closed set of 30 speakers) identification Rate Table 2: Comparison of our
1. Spectrogram matching 18 sec. 0.00% proposed technique with some
established Speaker Identification
2. GMM1 2.5 sec. 35.5% techniques using the three words
3. VQ1 2 sec. 39.33% for speaker identification.

Applications, Conclusion and Future work

This methodology can be used to identify speakers in password protected zones


where a database of voices of speakers can be used as passwords. This model, if required, can
be made more dynamic by adding the “most recent successful voice acceptance” of a
particular speaker into his/her database of samples, discarding his/her spectrogram
corresponding to earliest voice sample in the database and updating the optimal weights. This
dynamic model, takes into consideration the change in voice of a particular speaker over time.
This paper presents a method for text-dependent speaker identification based on
extracting unique patterns from a spectrogram for a given speaker and a word. The essence
of this technique lies in formulating the speaker-identification problem into pattern
recognition of images and resolving it using statistical tools. Future work will focus on better
selection of words, better band-width selection in both frequency and time axis,
implementation of this technique on a large-scale and subsequently look for reducing its
computational complexity and computation time even further.

References
1. F.K. Soong, A.E. Rosenberg A.E., B.-H. Juang, and L.R. Rabiner. A vector
quantization approach to speaker recognition. AT & T Technical Journal,66:14–26, 1987.
2. Reynolds, D. A., Speaker identification and verification using Gaussian mixture
speaker models. Speech Commun. 17 (1995),91–108.
3. Eugene Demidenko. Mixed Models Theory and Applications. Wiley, 2004,596-603.
4. Srinivasan, Srihari and Beal. Signature verification using Kolmogorov-Smirnov
statistic, Proc. International Graphonomics Society Conference (IGS),152-156, June 2005.

1
Technical and programming support by Gourab Mukherjee (GMM) and Subhadip Mitra (VQ).

You might also like