Partially Supervised Speaker Clustering

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/51574320

Partially Supervised Speaker Clustering

Article  in  IEEE Transactions on Pattern Analysis and Machine Intelligence · August 2011


DOI: 10.1109/TPAMI.2011.174 · Source: PubMed

CITATIONS READS
48 2,024

4 authors, including:

Hao Tang Stephen M Chu


Central South University IBM
101 PUBLICATIONS   2,034 CITATIONS    81 PUBLICATIONS   1,976 CITATIONS   

SEE PROFILE SEE PROFILE

Mark Hasegawa-Johnson
University of Illinois, Urbana-Champaign
431 PUBLICATIONS   8,792 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Acoustic Landmark and Automatic Speech Recognition View project

Collaborative Patient Portals View project

All content following this page was uploaded by Stephen M Chu on 14 April 2015.

The user has requested enhancement of the downloaded file.


IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XXX 20XX 1

Partially Supervised Speaker Clustering


Hao Tang, Member, IEEE, Stephen Chu, Member, IEEE, Mark Hasegawa-Johnson, Senior
Member, IEEE, and Thomas Huang, Life Fellow, IEEE

Abstract—Content-based multimedia indexing, retrieval and processing as well as multimedia databases demand the structuring
of the media content (image, audio, video, text, etc.), one significant goal being to associate the identity of the content to the
individual segments of the signals. In this paper, we specifically address the problem of speaker clustering, the task of assigning
every speech utterance in an audio stream to its speaker. We offer a complete treatment to the idea of partially supervised
speaker clustering, which refers to the use of our prior knowledge of speakers in general to assist the unsupervised speaker
clustering process. By means of an independent training data set, we encode the prior knowledge at the various stages of
the speaker clustering pipeline via 1) learning a speaker-discriminative acoustic feature transformation, 2) learning a universal
speaker prior model, and 3) learning a discriminative speaker subspace, or equivalently, a speaker-discriminative distance metric.
We study the directional scattering property of the Gaussian mixture model (GMM) mean supervector representation of
utterances in the high-dimensional space, and advocate exploiting this property by using the cosine distance metric instead of
the Euclidean distance metric for speaker clustering in the GMM mean supervector space. We propose to perform discriminant
analysis based on the cosine distance metric, which leads to a novel distance metric learning algorithm – linear spherical
discriminant analysis (LSDA). We show that the proposed LSDA formulation can be systematically solved within the elegant
graph embedding general dimensionality reduction framework.
Our speaker clustering experiments on the GALE database clearly indicate that 1) our speaker clustering methods based on the
GMM mean supervector representation and vector-based distance metrics outperform traditional speaker clustering methods
based on the “bag of acoustic features” representation and statistical model based distance metrics, 2) our advocated use of
the cosine distance metric yields consistent increases in the speaker clustering performance as compared to the commonly
used Euclidean distance metric, 3) our partially supervised speaker clustering concept and strategies significantly improve the
speaker clustering performance over the baselines, and 4) our proposed LSDA algorithm further leads to the state-of-the-art
speaker clustering performance.

Index Terms—Speaker clustering, partial supervision, distance metric learning.

1 I NTRODUCTION segmentation and clustering, or “who spoke when”,


speaker diarization is the process of partitioning an
C ONTENT- BASED multimedia indexing, retrieval
and processing as well as multimedia databases
are active fields of research in the information era
input audio stream into temporal regions of speech
signal energy contributed from the same speakers. A
[1]. In many situations it is highly demanded that we typical speaker diarization system consists of three
structure the media content (image, audio, video, text, stages. The first is the speech detection stage, where
etc.) so that the identity of the content (face, voice, we find the portions of speech in the audio stream.
keywords, etc.) can be associated with the individual The second is the segmentation stage, where we find
segments of the data. Often, clustering multimedia the locations in the audio stream likely to be change
data is a first step to multimedia content analysis points between speakers. At this stage, we often over-
as well as multimedia database construction, mining, segment the audio stream, resulting in only one single
search, and visualization [2]. speaker in each segment. The last is the clustering
In this paper, the problem of speaker clustering [3], stage, where we associate the segments from the same
[4], [5], [6], [7], [8] is specifically addressed. Speaker speakers together. Figure 1 illustrates the process of
clustering aims to assign every speech utterance in an speaker diarization. In this paper, we mainly focus
audio stream to its respective speaker, and is an es- on the clustering stage, not only because the clus-
sential part of a task known as speaker diarization [9], tering stage is the most important part of speaker
[10], [11], [12], [13], [14]. Also referred to as speaker diarization, but also most techniques developed for
the clustering stage can be readily applied to the
• H. Tang, M. Hasegawa-Johnson, and T. Huang are with the Depart-
segmentation stage (for example, with the help of a
ment of Electrical and Computer Engineering, University of Illinois at sliding window of fixed or variable length).
Urbana-Champaign, Urbana, IL, 61821 USA. Unlike speaker recognition (i.e. identification and
E-mail: {haotang2,jhasegaw,t-huang1}@uiuc.edu.
• S. Chu is with the Human Language Technologies Group at the IBM
verification), where we have training data for the
T. J. Watson Research Center, Yorktown Heights, NY, 10598 USA. speakers and thus training can be done in a super-
E-mail: schu@us.ibm.com. vised fashion, speaker clustering is usually performed
in a completely unsupervised manner. The output of
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XXX 20XX 2

Fig. 1. The process of speaker diarization. A typical


speaker diarization system consists of a speech de-
tection stage, a segmentation stage, and a clustering
stage. Fig. 2. The general speaker clustering pipeline. There
are four essential elements in any speaker clustering
algorithm. This paper transfers knowledge from an in-
speaker clustering is a unique arbitrary code for each dependent training set in order to improve every stage
speaker (e.g., spk1, spk2, etc.) rather than his or her in the speaker clustering pipeline.
real identity (e.g., Tom, Mary, etc.). An interesting
question is: Can we do speaker clustering in a some-
how supervised manner? That is, can we make use of a UBM [17], which can be considered as a universal
all available prior information that may be helpful for speaker prior model. Third, at the distance metric
speaker clustering? stage, we learn a speaker-discriminative distance met-
Our answer to this question is positive. It is worth ric through a novel algorithm – linear spherical dis-
noting that a few researchers in the field of speaker di- criminant analysis (LSDA). Note that at the clustering
arization have already tried to incorporate some prior stage, conventional clustering techniques such as k-
knowledge into their methods and indeed gained means [18] and hierarchical clustering [19] can be
noticeable improvements in the performance. For ex- naturally employed.
ample, the use of a universal background model The contribution of our paper is at least three-fold.
(UBM) for adapted Gaussian mixture model (GMM) First, we propose methods that allow the transfer
based clustering was attempted in [9] and [10], and of learning from an independent training set to the
the GMM mean supervector as the utterance rep- unsupervised clustering problem, and show that these
resentation was recently adopted in [13] and [14]. strategies significantly improve the speaker clustering
However, none of the previous work addresses every performance over the baselines. Our methods out-
facet of the problem. In this paper, we offer a complete perform traditional speaker clustering methods based
treatment to the conceptually new idea of partially on the “bag of acoustic features” representation and
supervised speaker clustering, which refers to the use statistical model based distance metrics. Second and
of our prior knowledge of speakers in general to in particular, we study the directional scattering prop-
assist the unsupervised speaker clustering process. erty of the GMM mean supervector representation of
By means of an independent training data set, we utterances in the high-dimensional space, and advo-
acquire prior knowledge about speakers in general by cate exploiting this property by using the cosine dis-
1) learning a speaker-discriminative acoustic feature tance metric instead of the Euclidean distance metric
transformation, 2) learning a universal speaker prior for GMM mean supervector speaker clustering. Last
model (i.e. a UBM) which is then adapted to the but not least, we propose to perform discriminant
individual utterances to form the GMM mean su- analysis based on the cosine distance metric, which
pervector representation, whose directional scattering leads to a novel distance metric learning algorithm
properties we study and exploit, and 3) learning a – linear spherical discriminant analysis (LSDA). We
discriminative speaker subspace, or equivalently, a show that the proposed LSDA formulation can be
speaker-discriminative distance metric. systematically solved within the elegant graph em-
Figure 2 is a general speaker clustering pipeline. bedding [20] general dimensionality reduction frame-
Basically, there are four critical elements in any work. We demonstrate that the LSDA algorithm leads
speaker clustering algorithm and it is these elements to the state-of-the-art speaker clustering performance.
that make a difference. We incorporate our prior This paper is organized as follows. In Sections
knowledge of speakers into the various stages of 2, 3, 4, and 5, the four stages of the speaker clus-
this pipeline through an independent training data tering pipeline, namely feature extraction, utterance
set. First, at the feature extraction stage, we learn representation, distance metric, and clustering, are
a speaker-discriminative acoustic feature transforma- described. In each section, we first review the cur-
tion based on linear discriminant analysis (LDA) rent state-of-the-art approaches, and then present the
[15]. Second, at the utterance representation stage, strategies that incorporate our partially supervised
we adopt the maximum a posteriori (MAP) adapted speaker clustering concept into the corresponding
GMM mean supervector representation [16] based on stage of the speaker clustering pipeline. In Section
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XXX 20XX 3

6, we describe our experiment setup and protocol, on the current frame. More precisely, the PLP features
introduce the performance evaluation metrics, and of the current frame, those of the KL (e.g., 4) frames
present the experiment results as well as provide a to the left and those of the KR (e.g., 4) frames to
discussion of the results. Finally, we conclude the the right are concatenated to form a high-dimensional
paper in Section 7. feature vector, referred to as the context-expanded
feature vector. In the context-expanded feature vec-
tor space, we learn a speaker-discriminative acoustic
2 F EATURE E XTRACTION
feature transformation by LDA based on the known
2.1 Acoustic Features speaker labels of the independent training data set.
The first stage of the speaker clustering pipeline is The context-expanded feature vectors can then be
feature extraction. Feature extraction is the process of projected onto a low-dimensional (e.g., 40) speaker-
identifying the most important cues from the mea- discriminative feature subspace, which is expected to
sured data while removing unimportant ones for a provide optimal speaker separability. In this way we
specific task or purpose based on domain knowledge. transfer knowledge about the speakers in one corpus
For speaker clustering, the most widely used features to improve clustering of the speakers in a different
are the short-time spectrum envelope based acoustic corpus.
features such as the mel-frequency cepstral coeffi- In the experiment section, we specifically compare
cients (MFCC) and perceptual linear predictive (PLP) the proposed LDA transformed acoustic features with
coefficients [21]. Although MFCC and PLP were not the acoustic features traditionally augmented with the
originally designed for representing the information first and second order derivatives and show that the
relevant to distinguishing among different speakers, LDA transformed acoustic features outperform the
and in fact, their primary use is in speech recognition, traditional acoustic features on speaker clustering un-
they work reasonably well for speaker clustering in der the same clustering conditions. This validates that
practice. The higher-order MFCCs (e.g., 13-19) are the proposed speaker-discriminative acoustic feature
known to correspond to the source characteristics in transformation strategy can provide a better frontend
the source-filter model of speech production [22], and to speaker clustering as compared to traditional ones.
thus convey speaker information. In order to account
for the temporal dynamics of the spectrum, the basic 3 U TTERANCE R EPRESENTATION
MFCC or PLP features are usually augmented by 3.1 “Bag of Acoustic Features” Representation
their first-order derivatives (i.e. the delta coefficients) The second stage of the speaker clustering pipeline is
and second-order derivatives (i.e. the acceleration utterance representation. Utterance representation, as
coefficients). Higher-order derivatives may be used its name suggests, is the task of compactly encoding
too, although that is rarely seen. These derivatives the acoustic features of an utterance. In the literature
incorporate the time-evolving properties of the speech on speaker clustering, the mainstream utterance rep-
signal and are expected to increase the robustness of resentation is the so-called “bag of acoustic features”
the acoustic features. representation where the acoustic feature vectors are
described by a time-independent statistical model
2.2 Speaker-discriminative Acoustic Feature such as a Gaussian or GMM. The rationale behind
Transformation this representation is that in speaker clustering the lin-
guistic content of the speech signal is considered to be
The use of first and second order derivatives of irrelevant and normally disregarded. Thus, temporal
the basic acoustic features introduces the character- independence between inter-frame acoustic features is
ization of the temporal dynamics of the spectrum. assumed.
However, such characterization is completely unsu- Most often, due to its unimodal nature a single
pervised, and thus lacks the potential to discrimi- Gaussian is far from being sufficient to model the
nate between speakers. Using an independent train- probability distribution of the acoustic features of
ing data set, we can simultaneously characterize the an utterance, and a GMM is preferred. The theo-
temporal dynamics of the spectrum and maximize retical property that a GMM can approximate any
the discriminative power of the augmented acoustic continuous probability density function (PDF) arbi-
features based on a discriminative learning frame- trarily closely given a sufficient number of Gaussian
work. Specifically, we first compute 13 PLP features components makes the GMM a popular choice for
for every speech frame with cepstral mean subtraction parametric PDF estimators.
(CMS) and cepstral variance normalization (CVN) to The acoustic features of an utterance are modeled
compensate for the inter-section and inter-channel by an m-component GMM, defined as a weighted sum
variability [23]. Then, instead of augmenting the basic of m component Gaussian densities
PLP features by their first and second order deriva- m
tives, we augment them by the basic PLP features of
X
p(x|λ) = wi N (x|µi , Σi ) (1)
the neighboring frames spanning a window centered i=1
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XXX 20XX 4

where x is a d-dimensional random vector, wi is the E step, the posterior probability of a training vector
ith mixture weight, and N (x|µi , Σi ) is a multivariate falling into every Gaussian component is computed
Gaussian PDF
w0i N (xt |µ0i , Σ0i )
1 T −1
p(i|xt ) = Pm , i = 1, 2, · · · , m
j=1 w0j N (xt |µ0j , Σ0j )
1
N (x|µi , Σi ) = d/2 1/2
e− 2 (x−µi ) Σi (x−µi ) (2)
(2π) |Σi | (3)
Note that Equation 3 is the probability that we
with mean vector µi and covariance matrix Σi . wi can
re-assign the training vector xt to the ith Gaussian
be interpreted as the a priori probability that an obser-
component of the UBM λ0 = {w0i , µ0i , Σ0i }m
i=1 . Based
vation of x comes from the source governed by the ith
on these posterior probabilities, we compute the suf-
Gaussian distribution. Thus it satisfies the properties
ficient statistics of the training data
0 ≤ wi ≤ 1 and m
P
i=1 i = 1. A GMM is completely
w
specified by its parameters λ = {wi , µi , Σi }m
i=1 and the
T
X 1 X
T
estimation of the PDF reduces to finding the proper ni = p(i|xt ), Ei = p(i|xt )xt (4)
ni t=1
values of λ. t=1
A central problem of the GMM is how to es- T
timate the model parameters λ. This problem can 1 X
Ei2 = p(i|xt )x2t (5)
be practically solved by maximum likelihood esti- ni t=1
mation (MLE) techniques such as the expectation-
In the M step, the sufficient statistics of the training
maximization (EM) algorithm [24]. However, it is
data are combined with the prior model sufficient
widely known that MLE easily over-fits with insuffi-
statistics by interpolation. The new model parameters
cient training data. The number of free parameters of a
are obtained as follows
GMM, p, depends on the feature dimension d and the
number of Gaussian components m. More precisely, wi′ = [αi ni /T + (1 − αi )wi ]δ (6)
p = md2 /2+3md/2+m−1, which grows linearly in m
but quadratically in d. In order to alleviate this “curse µ′i = βi Ei + (1 − βi )µi (7)
of dimensionality” [25], diagonal covariance matrices
σi′2 = γi Ei2 + (1 − γi )(σi2 + µ2i ) − µ′2
i (8)
are often used in the Gaussian components. In this
case, p = 2md + m − 1, which grows linearly in both where δ is a scaling factor computed over all new
m and d. mixture weights to ensure that they sum to unity.
The interpolation coefficients in Equations 6-8 are data
dependent and automatically determined for every
3.2 GMM Mean Supervector Representation Gaussian component using the empirical formula νi =
A relatively new utterance representation that has ni /(ni + rν ) where ν ∈ {α, β, γ} and rν is a fixed
emerged in the speaker recognition area is the GMM relevance factor for ν. This empirical formula offers
mean supervector representation, which is obtained a smart mechanism to control the balance between
by concatenating the mean vectors of the Gaussian the new and old sufficient statistics. Figure 3 demon-
components of a GMM trained on the acoustic fea- strates the basic idea of MAP adaptation for a GMM.
tures of a particular utterance [26].
3.2.2 GMM Mean Supervectors
3.2.1 UBM and MAP Adaptation The GMM mean supervector representation of an
When an utterance is short, the number of acoustic utterance is obtained by first MAP adapting the UBM
feature vectors available for training a GMM is small. to the acoustic features of the utterance and then
To avoid over-fitting, we can first train a single GMM concatenating the component mean vectors of the
on an independent training data set, leading to a target GMM to form a long column vector. Figure 4
well-trained GMM known as the UBM in the speaker
recognition literature [27]. Since the amount of data
used to train the UBM is normally large, and the data
is fairly evenly distributed across many speakers, the
UBM is believed to provide a good representation of
speakers in general. Therefore, it can be considered
as a universal speaker prior model in which we en-
code the common characteristics of different speakers.
Given a specific utterance, we can then derive a target
GMM by adapting the UBM to the acoustic features
of the utterance. This is done by MAP adaptation [28]. Fig. 3. The basic idea of MAP adaptation. MAP adap-
MAP adaptation starts with a prior model (i.e. the tation starts with a prior model and iteratively performs
UBM), and iteratively performs EM estimation. In the regularized EM estimation.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XXX 20XX 5

Independent
Input speech
space, where conventional clustering techniques such
training speech
as k-means and the hierarchical clustering can be
naturally applied.
Figure 5 visualizes the GMM mean supervectors of
Feature Extraction Feature Extraction many utterances from five different speakers using
2D scatter plots of their two principal components.
In each plot, the different speakers are shown in
Acoustic features Acoustic features different colors. For each speaker, there are about 150
utterances, denoted by small dots. As one can see,
the data points belonging to the same speaker tend to
MAP Adaption Model Training cluster together. Thus, the Euclidean distance metric
(Mean Only) (GMM) is a reasonable choice for speaker clustering in the
GMM mean supervector space. However, one can also
observe that the data points show very strong direc-
GMM mean
supervector
UBM tional scattering patterns. The directions of the data
points seem to be more informative and indicative
Fig. 4. The generation of a GMM mean supervector. A than their magnitudes. This observation motivated us
GMM mean supervector is obtained by MAP adapting to favor the cosine distance metric over the Euclidean
only the component means of a UBM. distance metric for speaker clustering in the GMM
mean supervector space.
A reasonable explanation as to why the GMM
gives a block diagram that shows how a GMM mean mean supervectors show strong directional scattering
supervector is generated. patterns is that when we perform mean-only MAP
Once a target GMM is obtained for an utterance, its adaptation, only a subset of the UBM component
component means are stacked to form a GMM mean means is adjusted, and the particular subset that is ad-
supervector justed seems to be rather speaker-dependent. Hence,
the speaker-specific information is encoded in those
s = [µ1 T µ2 T ... µm T ]T (9)
component means which are adapted. Therefore, the
It is numerically beneficial to subtract the mean su- utterances from the same speaker tend to yield a
pervector of the UBM from a GMM mean supervector, cluster of GMM mean supervectors that scatter in a
namely, particular direction in the GMM mean supervector
s′ = s − s0 (10) space.
As presented later in the experiment section, our
where s0 is the mean supervector of the UBM. Without experiment results on all speaker clustering tasks
causing any ambiguity, we call s′ (instead of s) a clearly demonstrate that the cosine distance metric
GMM mean supervector. A complete set of GMM consistently outperforms the Euclidean distance met-
mean supervectors forms a high-dimensional space ric when using the GMM mean supervector as the
called the GMM mean supervector space. utterance representation. This strongly supports our
A supervector created by stacking component discovery of the directional scattering property of the
means, as shown in Equation 9, represents local first- GMM mean supervectors and forms the foundation
order differences between the UBM and adapted of our original motivation to perform discriminant
GMM: shifts in local modes and regional centers analysis in the cosine distance metric space.
of mass. Concatenating variances to the supervector
would allow us to also represent local second-order
adaptation, e.g., changes in the local compactness of 4 D ISTANCE M ETRIC
the GMM. The scattering patterns of second-order The third stage of the speaker clustering pipeline is
information in supervector space are very different distance metric. The distance metrics that can be used
from those of first-order information, however, so we for speaker clustering are closely related to the par-
choose to include only first-order information. ticular choice of utterance representations. Two popu-
lar categories of distance metrics, namely likelihood-
3.2.3 Property of GMM Mean Supervectors based distance metrics and vector-based distance met-
The GMM mean supervector is an effective utterance rics, are widely used for the two corresponding utter-
representation that has been applied to speaker recog- ance representations, respectively.
nition. However, it has come to our attention that
the use of the GMM mean supervector representation
for speaker clustering is still rare. The GMM mean 4.1 Likelihood-based Distance Metrics
supervector representation allows us to represent an For the “bag of acoustic features” utterance repre-
utterance as a single data point in a high-dimensional sentation, the distance metric should represent some
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XXX 20XX 6

single data point in a high-dimensional vector space,


the most often used distance metric is the Euclidean
distance metric

d(x, y) = (x − y)T (x − y) (14)

As we discussed earlier, in the GMM mean super-


vector space, the data points belonging to the same
speaker tend to cluster together. Thus, the Euclidean
distance metric is a reasonable choice for speaker clus-
tering in the GMM mean supervector space. However,
we observe that the data points show very strong di-
rectional scattering patterns. The directions of the data
points seem to be more informative and indicative
than their magnitudes. This observation motivated
us to advocate the use of the cosine distance metric
Fig. 5. The property of GMM mean supervectors. The instead of the Euclidean distance metric for speaker
data points show strong directional scattering patterns. clustering in the GMM mean supervector space. The
cosine distance metric is a measure of the angle
between two vectors in the space and is irrelevant to
measure of the distance between two statistical mod- the norms of the vectors. It is defined as
els. A famous likelihood-based distance metric ex- xT y
tensively used for speaker clustering is the Bayesian d(x, y) = 1 − √ p (15)
information criterion (BIC) [29]. For a given utterance, xt x yT y
the BIC value indicates how well a model fits the Our experiments show that the cosine distance met-
utterance and is given by ric consistently outperforms the Euclidean distance
λ metric for speaker clustering in the GMM mean su-
BIC(Mi ) = log L(Xi |Mi ) − ki log(ni ) (11) pervector space.
2
where L(Xi |Mi ) is the likelihood of the acoustic fea-
tures Xi given the model Mi , λ is a design parameter, 4.3 Distance Metric Learning versus Linear Sub-
ki is the number of free parameters in Mi , and ni is the space Learning
number of feature vectors in Xi . The distance between
Although the Euclidean and cosine distance metrics
two utterances Xi and Xj is given by the ∆BIC value.
can be directly used, they are optimal only if the
If we assume Mi and Mj are both Gaussian, then
data points are uniformly distributed in the entire
∆BIC is given by
space. In a high-dimensional space, most often the
∆BIC(Xi , Xj ) = n log Σ − ni log Σi − nj log Σj − λP data points lie in or near a low-dimensional manifold,
(12) or preferably a linear subspace, of the original space.
where n = ni + nj , Σi and Σj are the covariance ma- In this case, it is extremely advantageous if we can
trices of Xi and Xj , respectively, Σ is the covariance learn an optimal distance metric for the data.
matrix of the aggregate of Xi and Xj , and P is a We define a generalized Euclidean distance metric
penalty term given by between two data points x and y as
1 1 d(x, y) = (x − y)T A(x − y) (16)
P = (d + d(d + 1)) log n (13)
2 2
with d being the dimension of the acoustic feature where the positive definite matrix A is aimed to
vectors. compensate for the non-uniform data distribution. If
Other likelihood-based distance metrics include the A coincides with the covariance matrix of the data,
generalized likelihood ratio (GLR), Gish distance, this generalized Euclidean distance metric reduces to
Kullback-Leibler divergence (KLD), divergence shape the Mahalanobis distance [31]. Equation 16 can be re-
distance (DSD), Gaussian divergence (GD), cross BIC written as
(XBIC), cross log likelihood ratio (XLLR), and so forth 1 1 1 1
d(x, y) = (A 2 x − A 2 y)T (A 2 x − A 2 y)
[30]. All these metrics have been proposed for the
“bag of acoustic features” utterance representation. = (W x − W y)T (W x − W y) (17)

That is, the generalized Euclidean distance metric


4.2 Vector-based Distance Metrics between x and y can be re-organized as the Euclidean
For the GMM mean supervector utterance represen- distance metric between two linearly transformed
1
tation, since an utterance can be represented as a data points W x and W y where W = A 2 .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XXX 20XX 7

Similarly, we define a generalized cosine distance


metric between x and y as
xT Ay xd
d(x, y) = 1− √ √
xT Ax yT Ay
(A1/2 x)T (A1/2 y) SDA h-dimensional
= 1− √ √ hypersphere
(A1/2 x)T (A1/2 x) (A1/2 y)T (A1/2 y)
(W T x)T (W T y)
= 1− √ √ (18) x2
yh
(W T x)T (W T x) (W T y)T (W T y)

x1
Likewise, the generalized cosine distance metric d-dimensional LSD
A
between x and y is the cosine distance metric between linear space
y2
two linearly transformed data points W x and W y
1 y1
where W = A 2 . In this sense, it is clear that learning
h-dimensional
an optimal distance metric is equivalent to learning linear space
an optimal linear transformation of the original high-
dimensional space. There exist various linear sub- Fig. 6. The schematic illustration of SDA and LSDA.
space learning methods that can fit into this context. Under two mild conditions, the nonlinear projection can
be linearized and thus SDA reduces to LSDA.

4.4 Distance Metric Learning in Euclidean Space


4.5.1 Linear Spherical Discriminant Analysis
Linear subspace learning can be classified into two
distinct categories: unsupervised learning and su- We coined the phrase “spherical discriminant anal-
pervised learning. For unsupervised linear subspace ysis” (SDA) to denote discriminant analysis in the
learning, principal component analysis (PCA) [15] cosine distance metric space. We define a projection
may be the most early developed and prevailing from a d-dimensional linear space to an h-dimensional
technique, and when applied to speech or speaker hypersphere where h < d
recognition, is known as the eigenvoice approach [32].
WTx
Other more recent unsupervised learning techniques y= (19)
kW T xk
include the locality preserving projection (LPP) [33],
neighborhood preserving embedding (NPE) [34], etc. We note that such a projection is nonlinear. How-
All these techniques may be applied to speaker clus- ever, under two mild conditions, this projection can
tering. However, we are most interested in supervised be linearized. One condition is that the objective
learning since the goal of speaker clustering is re- function for learning the projection only involves the
lated to classification. It is natural that we prefer a cosine distance metric. The other condition is that only
learning technique that is discriminative rather than the cosine distance metric is used in the projected
generative. The most famous technique for supervised space. In this case, the norm of the projected vector
linear subspace learning is Fisher’s LDA. LDA has y has impact on neither the objective function nor
been applied to speaker clustering, and the resulting distance computation in the projected space. Thus,
technique is termed the fishervoice approach [35]. the denominator term of Equation 19 can be safely
The term “fishervoice” is analogous to “fisherface” dropped, leading to a linear projection y = W T x,
in the face recognition literature, where the fisherface which is called “linear spherical discriminant analy-
approach refers to the face recognition method based sis” (LSDA). Figure 6 illustrates the basic ideas of SDA
on LDA while the eigenface approach refers to the and LSDA.
face recognition method based on PCA. Formally speaking, the goal of LSDA is to seek a
linear projection W such that the average within-class
cosine similarity of the projected data is maximized
4.5 Distance Metric Learning in Cosine Space while the average between-class cosine similarity of
Most existing linear subspace learning techniques the projected data is minimized. Assuming that there
(e.g., PCA and LDA) are implicitly based on the are c classes, the average within-class cosine similarity
Euclidean distance metric. As we mentioned ear- is defined to be the average of the class-dependent
lier, due to the directional scattering property of the average cosine similarities between the projected data
GMM mean supervectors, we favor the cosine dis- vectors. It can be written in terms of the unknown
tance metric over the Euclidean distance metric for projection matrix W and original data points x
speaker clustering in the GMM mean supervector c
space. Therefore, we propose to perform discriminant 1X
SW = Si (20)
analysis in the cosine distance metric space. c i=1
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XXX 20XX 8

1 X yjT yk
Si = p √ same vertex set but have different similarity matrices.
|Di ||Di | yjT yj Ty
yk
yj ,yk ∈Di
k
The graph similarity preserving criterion is given by
1 X xT T
j W W xk
X (i) (p)
= p √ (21) W = arg min kf (xi , W ) − f (xj , W )k2 (sij − sij )
|Di ||Di | xT
j
W W T xj xT
k
W W T xk W
xj ,xk ∈Di i6=j
(25)
where |Di | denotes the number of data points in the where f (x, W ) is a general projection with parameters
ith class. Similarly, the average between-class cosine W . Note that the above objective function integrates
similarity is defined to be the average of the average the two aforementioned graphs through the subtrac-
cosine similarities between any two pairs of classes. tion of the similarities in the penalty graph from the
It can be likewise written in terms of W and x similarities in the intrinsic graph. One can easily see
c X c that minimizing this objective function ensures that if
1 X
SB = Smn (m 6= n) (22) the data points xi and xj are close in the sense of the
c(c − 1) m=1 n=1 similarities in S (i) and S (p) then their projections in
the low-dimensional space are close, too.
1 X yjT yk
In Equation 25, if we use a spherical projection
Smn = p √ of the form of Equation 19, we obtain the following
|Dm ||Dn | yjT yj Ty
yk k
yj ∈Dm criterion
yk ∈Dn 2
X W T xi W T xj

1 xT T
j W W xk (i) (p)
X
= p √ (23) W = arg min kW T xi k − kW T xj k (sij − sij )

|Dm ||Dn | xT W W T xj xT W W T xk W
j k i6=j
xj ∈Dm
xk ∈Dn (26)
Although there is no closed-form solution to the
where |Dm | and |Dn | denote the number of data
optimization problem of Equation 26, as shown in
points in the mth and nth classes, respectively.
[37], this problem can be solved using a steepest
The LSDA criterion is to maximize SW while min-
descent algorithm, with the gradient derived as
imizing SB , which can be written in the trace differ-
ence form P n f W T x xT f W T x xT W T (xi xTj +xj xTi ) o (i) (p)
G = 2 i6=j ij f 3 f i i + ij f 3 f j j − fi fj (sij −sij )
i j j i
W = arg max(SW − SB ) (24) (27)
W p q
where fi = T T
xi W W xi , fj = T T
xj W W xj , and
Note that there are various forms of the criterion T T
that may be adopted. We choose the trace difference fij = xi W W xj . If we expand the L2 norm term of
form, which is similar to the work of Ma et al. [36]. Equation 26, by some simple manipulations we obtain
However, we systematically solve our LSDA formu- 2
W T xi W T xj T T
 
lation in an elegant general dimensionality reduction − = 2 1 − xi W W xj
kW T xi k kW T xj k kW T xi k kW T xj k
framework known as graph embedding [20], [37]. (28)
Thus, the criterion in Equation 26 is equivalent to
4.5.2 Graph Embedding Solution to LSDA the following criterion
Graph embedding is a general framework for dimen- X xi T W W T xj (i) (p)
sionality reduction, where, an undirected weighted W = arg max q (sij −sij )
W
p
xTWWTx xTWWTx
graph, G = {X, S}, with vertex set X and similarity i6=j i i j j

matrix S, is used to characterize certain statistical or (29)


geometrical properties of a data set. A vertex xi in By comparing Equation 29 to Equations 20–24, we
X represents a data point in the high-dimensional conclude that the graph embedding criterion of Equa-
space. An entry sij in S, denoted as the weight of tion 26 is equivalent to the LSDA criterion of Equation
the edge connecting xi and xj , represents the sim- 24 if the entries of the similarity matrices S (i) and S (p)
ilarity between the two corresponding data points. are set to proper values, as follows
The purpose of graph embedding is to represent each (i) 1
vertex of the graph as a low dimensional vector that sjk ← if xj , xk ∈ Di , i = 1, ..., c
c|Di ||Di |
preserves the similarities in S. (p) 1
Graph embedding unifies most dimensionality re- sjk ← if xj ∈ Dm , xk ∈ Dn
c(c − 1)|Dm ||Dn |
duction algorithms into a general framework. For a m, n = 1, ..., c, m 6= n (30)
specific dimensionality reduction algorithm, we often
use two graphs: the intrinsic graph {X, S (i) }, which That is, by assigning appropriate values to the
characterizes the data properties that the algorithm weights of the intrinsic and penalty graphs, our LSDA
aims to preserve, and the penalty graph {X, S (p) }, formulation can be systematically solved within the
which characterizes the data properties that the al- elegant graph embedding general dimensionality re-
gorithm aims to avoid. These two graphs share the duction framework.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XXX 20XX 9

5 C LUSTERING TABLE 1
Experiment settings.
The last stage of the speaker clustering pipeline is
clustering. We are interested in conventional cluster- Test set Indep. training set
ing techniques which can be applied to the GMM #speaker 630 498
#utterance 19024 18327
mean supervector space. We mainly focus on two #utt/spk (ave.) 30 ∼ 40 30 ∼ 40
traditional classes of algorithms. One is “flat” clus- utt duration (ave.) 3 ∼ 4s 3 ∼ 4s
tering – clustering by partitioning the data space.
The other is hierarchical clustering, where we try
not to construct a partition but a nested hierarchy of Our experiments are based on a test set of 630
partitions. In most of real-world applications, one of speakers and 19024 utterances extracted from the
these two classes of algorithms is employed. GALE database. In order to implement our partially
The representative flat clustering algorithm is k- supervised speaker clustering strategies at the various
means whose objective is to partition the data space stages of the speaker clustering pipeline, we employ
in such a way that the total intra-cluster variance an independent training set, which was also extracted
is minimized. It iterates between a cluster assigning from the GALE database. Note that the test set and
step and a mean updating step until convergence. the independent training set were chosen in such a
Spherical k-means [38] is an extension of k-means that way that speakers in the independent training set
is based on the cosine distance metric. do not exist in the test set. Table 1 lists the detailed
The representative hierarchical clustering algorithm experiment settings.
is agglomerative clustering [15] whose objective is to To guarantee a statistically significant performance
obtain a complete hierarchy of clusters in the form comparison, we carry out our experiments as follows.
of a dendrogram. The algorithm adopts a bottom-up 1. A case is an experiment associated with a specific
strategy. First, it starts with each data point being a number of test speakers, namely 2, 5, 10, 20, 50,
cluster. Then, it checks which clusters are the closest and 100, respectively;
and merges them into a new cluster. As the algorithm 2. For each case, this number of speakers are drawn
proceeds, it always merges the two closest clusters randomly from the test set, and all the utterances
until there is only one single cluster left. from the selected speakers are used in the exper-
A remarkable question related to agglomerative iment;
clustering is how to determine which clusters are the 3. For each case, 10 independent trials are run, each
closest. There exist several methods that measure the of which involves a random draw of the test
distance between two clusters, for instance, the single speakers;
linkage, complete linkage, average linkage, “ward” 4. For each case, the mean and standard error of the
linkage, and so on [19], [39]. We empirically discover clustering results over the 10 independent trials
that the “ward” linkage yields the best performance are reported.
for speaker clustering. The “ward” linkage is a func-
tion that specifies the distance between two clusters
X and Y by the increase in the error sum of squares 6.2 Performance Evaluation Metrics
(ESS) after the merging of X and Y (Z = X ∪ Y ):
We report our experiment results based on two perfor-
1 X X mance evaluation metrics, namely the clustering accu-
x̄ = x, ESS(X) = |x − x̄|2
|X| racy and the normalized mutual information (NMI)
x∈X x∈X
d(X, Y ) = ESS(Z) − [ESS(X) + ESS(Y )] (31) [41]. These two metrics are standard for evaluating
(general) data clustering results [42]. The clustering
accuracy is given by
6 E XPERIMENTS
N
1 X
6.1 Experiment Setup and Protocol r= [ci = li ] (32)
N i=1
We conduct extensive speaker clustering experiments
on the GALE Mandarin database [40]. The GALE where N denotes the number of test utterances, ci is
database contains about 1900 hours of broadcast news the cluster label of the ith utterance returned by the
speech data collected from various Mandarin TV pro- algorithm, li is the true cluster label, and [v] is an
grams at various times. The waveforms were sampled indication function which returns 1 if v is true and 0
at 16 KHz and quantized at 16 bits per sample, otherwise.
and were automatically over-segmented into short The NMI is another popular, information-
utterances using the BIC criterion, with each utterance theoretically interpreted metric given by
being as pure as possible, namely, each utterance
being from a single speaker. A random sample of the I(C, L)
r= (33)
results were further verified by human listeners. [H(C) + H(L)]/2
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XXX 20XX 10

where I(C, L) is the mutual information acoustic features, and “LDA” for the proposed LDA
T transformed acoustic features.
XX \ |ci lj |
I(C, L) = |ci lj | log (34) Finally, to further demonstrate the statistical signifi-
i j
|ci ||lj | cance of our performance improvements, we perform
a paired t-test (at the default 5% significance level) be-
and H(C) and H(L) are the entropy
tween our proposed method and the ”Gaussian+BIC”
X |ci | |ci | X |li |
|li | method for every case. The p-values of all tests turn
H(C) = − log , H(L) = − log
N N N N out to be so close to zero that Matlab rounds most of
i i
(35) them to zero. This clearly indicates that the results of
our method are significantly different from the results
T
In the above formulas, |ci |, |lj | and |ci lj | areTthe
number of utterances from speaker ci , lj , and ci lj , of the ”Gaussian+BIC” method.
respectively. Next, we perform a step-by-step verification and
The above two metrics are used for utterance-based discussion of the improvements that our partially
evaluations. We extend them to frame-based evalua- supervised speaker clustering strategies have made to
tions by simply replacing the number of utterances in the experiment results, as follows:
the above formulas with the corresponding number of In Tables 2–5, our experiment results show that our
frames. This allows us to investigate how the duration speaker clustering methods based on the GMM mean
of an utterance affects the clustering performance. supervector representation and vector-based distance
metrics significantly outperform traditional speaker
clustering methods based on the “bag of acoustic fea-
6.3 Experiment Results and Discussions tures” representation and statistical model based dis-
Our main experiment results are presented in Tables tance metric such as the BIC (Rows: Orig vs. Baseline).
2–5. Specifically, we conduct speaker clustering 1) in It is worth mentioning that in the “Gaussian+BIC”
the GMM mean supervector space with the Euclidean method, if we utilize agglomerative clustering, the
and cosine distance metrics; 2) in the PCA subspace computational load can become prohibitive as the
with the Euclidean distance metric(i.e. the eigenvoice number of speakers gets larger. This is because at
approach); 3) in the LPP subspace with the Euclidean each iteration, along with a new cluster being formed
distance metric; 4) in the NPE subspace with the by merging the two closest ones, a new statistical
Euclidean distance metric; 5) in the LDA subspace model representing the new cluster has to be re-
with the Euclidean distance metric (i.e. the fishervoice trained, and the distance between the new model
approach); 6) in the kernel-LDA nonlinear manifold and any other model updated. On the contrary, ag-
with the Euclidean distance metric; 7) in the LSDA glomerative clustering can be done very efficiently
subspace with the cosine distance metric. In each in the GMM mean supervector space by using the
experiment, we utilize both k-means (or spherical k- “ward” linkage. In addition, in the GMM mean su-
means) and agglomerative clustering. In order to com- pervector space, although speaker clustering based
pare our methods to the traditional “bag of acoustic on the Euclidean distance metric achieves reasonably
features” methods, we employ the ”Gaussian+BIC” good results, the cosine distance metric consistently
method as the baseline. The experiment results are outperforms the Euclidean distance metric (Rows:
presented in four forms – utterance-based cluster- COS/Orig vs. EU/Orig), thanks to the directional
ing accuracies (Table 2), utterance-based NMIs (Table scattering property of the GMM mean supervectors,
3), frame-based clustering accuracies (Table 4), and which is discussed in Section 3.2.3.
frame-based NMIs (Table 5). For each case, we present In Table 6, our experiment results show that the
the mean of the results over 10 trials√ as well as the LDA transformed acoustic features consistently out-
standard error of the mean, se = s/ n, in parentheses, perform the traditional acoustic features by 1%-3%.
where s is the standard deviation of the results and The proposed speaker-discriminative acoustic feature
n the number of trials (10). In all tables, Orig stands transformation implements one of our partially super-
for the original GMM mean supervector space, k for vised speaker clustering strategies, which can provide
k-means, and h for hierarchical clustering. a better frontend to speaker clustering as compared to
Additionally, we compare the proposed LDA trans- traditional ones.
formed acoustic features with the acoustic features Due to the difficulty of handling high-dimensional
traditionally augmented with the first and second data, and in order to alleviate the “curse of di-
order derivatives. Specifically, 13 basic PLP features mensionality”, linear subspace learning methods are
augmented by their first and second order derivatives used to derive various subspaces in which the final
form a 39-dimensional traditional acoustic feature speaker clustering is performed. From the experiment
vector. Table 6 gives a comparison of the results (clus- results in Tables 2–5, we see that the unsupervised
tering accuracies) of both kinds of acoustic features on methods (i.e. PCA/eigenface approach, LPP, NPE)
speaker clustering under the same clustering condi- more or less improve the performance, but significant
tions. In this table, “traditional” stands for traditional improvements of the performance are achieved by
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XXX 20XX 11

TABLE 2
Performance comparison of speaker clustering based on utterance-based clustering accuracies.

2 spk 5 spk 10 spk 20 spk 50 spk 100 spk


×100% ∼ 60 utt ∼ 150 utt ∼ 300 utt ∼ 600 utt ∼ 1500 utt ∼ 3000 utt
k h k h k h k h k h k h
Orig 94.7 96.0 81.6 85.0 77.3 82.6 70.5 78.1 58.4 69.4 47.2 57.7
(0.28) (0.25) (0.84) (0.81) (1.01) (0.98) (1.07) (1.02) (1.31) (1.24) (1.47) (1.45)
EU PCA 96.6 96.2 84.8 85.5 81.3 82.9 78.5 79.3 69.7 69.9 59.4 58.5
(0.26) (0.26) (0.77) (0.74) (0.82) (0.84) (1.02) (1.00) (1.26) (1.28) (1.43) (1.45)
LPP 98.3 98.1 92.5 92.8 87.9 88.6 85.6 84.7 77.4 77.0 70.1 69.8
(0.19) (0.19) (0.39) (0.38) (0.69) (0.70) (0.75) (0.76) (1.09) (1.08) (1.27) (1.28)
NPE 97.4 97.0 84.3 85.0 83.2 83.9 78.7 79.3 70.4 69.8 58.1 58.3
(0.20) (0.18) (0.78) (0.74) (0.81) (0.82) (1.01) (1.00) (1.25) (1.23) (1.39) (1.45)
LDA 98.3 98.4 94.1 94.0 89.9 90.8 87.2 86.6 79.5 79.6 73.1 72.3
(0.20) (0.18) (0.32) (0.32) (0.49) (0.50) (0.83) (0.84) (1.01) (0.97) (1.19) (1.19)
KLDA 98.2 98.1 93.0 92.8 87.3 87.5 84.6 83.9 75.2 75.3 70.4 70.1
(0.18) (0.19) (0.38) (0.37) (0.58) (0.57) (0.79) (0.82) (1.05) (1.04) (1.26) (1.24)
Orig 99.0 99.1 88.3 90.7 84.1 86.5 80.6 82.2 74.7 77.7 66.4 69.3
COS
(0.13) (0.13) (0.51) (0.51) (0.73) (0.72) (0.99) (0.98) (1.07) (1.09) (1.30) (1.26)
LSDA 99.2 99.1 97.8 98.0 95.0 94.7 90.3 90.0 84.3 85.9 77.9 79.4
(0.12) (0.12) (0.19) (0.20) (0.31) (0.32) (0.51) (0.50) (0.77) (0.77) (1.01) (1.00)
Baseline (Gaussian+BIC) 82.5 83.8 71.6 72.0 58.3 60.5 53.1 52.7 43.2 44.1 35.0 37.4
(0.84) (0.85) (1.19) (1.21) (1.44) (1.42) (1.59) (1.47) (1.72) (1.69) (1.87) (1.91)

TABLE 3
Performance comparison of speaker clustering based on utterance-based NMIs.

2 spk 5 spk 10 spk 20 spk 50 spk 100 spk


×100% ∼ 60 utt ∼ 150 utt ∼ 300 utt ∼ 600 utt ∼ 1500 utt ∼ 3000 utt
k h k h k h k h k h k h
Orig 91.9 93.7 79.0 82.7 74.4 80.3 67.7 75.3 55.9 66.4 44.3 55.3
(0.41) (0.37) (0.96) (0.97) (1.11) (1.05) (1.16) (1.09) (1.27) (1.46) (1.45) (1.36)
EU PCA 93.9 93.5 82.7 83.0 78.9 79.9 76.2 76.7 66.9 67.6 56.9 56.3
(0.37) (0.37) (0.79) (0.79) (0.99) (0.96) (1.08) (1.05) (1.31) (1.32) (1.45) (1.47)
LPP 96.1 95.4 89.9 89.9 85.1 86.2 83.1 82.0 74.8 74.0 67.2 67.3
(0.25) (0.25) (0.61) (0.61) (0.83) (0.84) (0.77) (0.78) (1.09) (1.09) (1.32) (1.27)
NPE 95.3 94.3 81.6 82.3 81.1 81.3 76.0 77.1 67.8 67.0 55.7 55.9
(0.26) (0.25) (0.80) (0.79) (0.90) (0.88) (1.06) (1.03) (1.24) (1.35) (1.43) (1.46)
LDA 96.2 95.8 92.0 91.5 87.5 88.2 85.0 84.0 77.4 77.6 71.1 70.2
(0.25) (0.25) (0.41) (0.41) (0.64) (0.63) (0.83) (0.83) (1.04) (1.09) (1.27) (1.28)
KLDA 96.0 95.6 91.8 91.7 86.0 86.2 83.3 82.9 74.1 74.0 66.9 67.2
(0.25) (0.26) (0.40) (0.43) (0.63) (0.63) (0.90) (0.88) (1.11) (1.09) (1.30) (1.33)
Orig 96.6 96.8 86.0 88.0 81.4 83.6 78.2 79.2 72.1 75.2 63.8 67.1
COS
(0.25) (0.25) (0.60) (0.57) (0.78) (0.79) (1.02) (1.01) (1.12) (1.11) (1.32) (1.27)
LSDA 97.2 97.0 95.3 95.3 92.2 92.6 87.3 87.9 82.3 83.2 75.3 77.4
(0.16) (0.16) (0.22) (0.23) (0.38) (0.38) (0.60) (0.61) (0.85) (0.80) (1.07) (1.08)
Baseline (Gaussian+BIC) 79.9 81.5 69.3 69.5 56.2 58.4 50.9 50.4 41.0 41.5 32.2 35.2
(0.91) (0.90) (1.26) (1.28) (1.46) (1.45) (1.62) (1.58) (1.77) (1.81) (1.93) (1.93)

supervised methods (i.e. LDA/fishervoice approach, is based on an implicit nonlinear mapping from the
LSDA). Notably, our proposed LSDA algorithm leads original data space to a much higher dimensional
to the state-of-the-art speaker clustering performance. feature space where linear projections are found. As
This clearly indicate that LSDA benefits significantly the number of clusters increases, linear separation
from the directional scattering property of the data in in the higher dimensional feature space increasingly
the GMM mean supervector space. comes at the cost of very small margins. In order to
Owing to the non-linearity of the data, a kernel avoid small margins, the kernel LDA algorithm may
discriminant analysis method may be preferable over choose suboptimal clustering strategies.
LDA. In Tables 2–5, we present the results obtained Also noted is that the frame-based performance
via kernel LDA [43] where the affinities used in LSDA is better than the utterance-based performance. The
are used as the Gram matrix in kernel LDA. Our rationale behind this is that longer utterances tend to
experiment results show that kernel LDA is com- be more often correctly classified than shorter utter-
parable to LDA when applied to smaller numbers ances, which is reasonable because longer utterances
of test speakers (2,5,10). When the number of test can provide more speaker-discriminative information
speakers gets larger, kernel LDA performs slightly than shorter ones.
worse than LDA (20,50,100). The reason behind this In every experiment, we employ two clustering
observation may be explained as follows: kernel LDA algorithms, namely k-means and agglomerative clus-
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XXX 20XX 12

TABLE 4
Performance comparison of speaker clustering based on frame-based clustering accuracies.

2 spk 5 spk 10 spk 20 spk 50 spk 100 spk


×100% ∼ 60 utt ∼ 150 utt ∼ 300 utt ∼ 600 utt ∼ 1500 utt ∼ 3000 utt
k h k h k h k h k h k h
Orig 95.9 97.2 83.4 87.7 80.6 85.8 74.4 82.5 63.3 74.7 53.5 64.4
(0.24) (0.22) (0.77) (0.75) (0.91) (0.78) (1.01) (0.99) (1.19) (1.17) (1.38) (1.31)
EU PCA 97.4 97.7 86.2 86.6 84.4 85.7 81.9 83.1 74.5 74.8 66.1 67.2
(0.22) (0.23) (0.70) (0.72) (0.73) (0.74) (0.97) (0.98) (1.19) (1.20) (1.38) (1.30)
LPP 99.1 99.0 94.7 94.8 90.5 90.6 88.7 88.9 82.6 82.3 76.7 76.9
(0.13) (0.13) (0.31) (0.31) (0.56) (0.56) (0.60) (0.58) (0.97) (0.89) (1.19) (1.20)
NPE 98.8 98.5 87.1 87.7 86.1 86.7 82.5 83.6 74.9 74.7 64.5 64.7
(0.16) (0.19) (0.70) (0.68) (0.74) (0.77) (0.93) (0.96) (1.20) (1.20) (1.36) (1.32)
LDA 99.2 99.4 96.2 96.0 92.5 92.6 91.0 90.8 84.4 84.3 80.2 81.7
(0.13) (0.12) (0.26) (0.26) (0.38) (0.40) (0.63) (0.64) (0.89) (0.87) (1.06) (1.13)
KLDA 98.6 98.4 94.8 94.6 90.1 90.4 88.0 87.6 80.3 80.1 76.7 77.4
(0.19) (0.19) (0.35) (0.34) (0.50) (0.54) (0.64) (0.63) (1.04) (1.02) (1.26) (1.20)
Orig 99.4 99.5 89.9 92.5 87.3 88.1 85.1 86.5 79.3 81.2 72.3 74.9
COS
(0.13) (0.13) (0.45) (0.42) (0.73) (0.70) (0.89) (0.86) (1.08) (0.99) (1.21) (1.20)
LSDA 99.7 99.5 98.9 99.2 97.1 96.8 92.5 92.1 87.0 88.2 82.6 83.2
(0.10) (0.09) (0.17) (0.15) (0.22) (0.25) (0.44) (0.44) (0.68) (0.71) (0.93) (0.96)
Baseline (Gaussian+BIC) 84.7 85.1 72.9 73.4 61.6 63.2 57.7 56.9 48.6 49.4 42.7 44.1
(0.77) (0.74) (1.17) (1.13) (1.43) (1.38) (1.46) (1.40) (1.63) (1.57) (1.83) (1.92)

TABLE 5
Performance comparison of speaker clustering based on frame-based NMIs.

2 spk 5 spk 10 spk 20 spk 50 spk 100 spk


×100% ∼ 60 utt ∼ 150 utt ∼ 300 utt ∼ 600 utt ∼ 1500 utt ∼ 3000 utt
k h k h k h k h k h k h
Orig 93.4 95.0 81.2 84.9 78.4 80.1 71.5 80.2 60.7 72.4 51.0 61.6
(0.36) (0.31) (0.95) (0.90) (1.00) (0.93) (1.08) (1.01) (1.27) (1.38) (1.39) (1.36)
EU PCA 95.0 95.6 83.5 84.5 81.9 82.7 78.9 80.1 72.2 72.1 63.4 65.1
(0.31) (0.30) (0.77) (0.76) (0.87) (0.88) (0.98) (1.02) (1.24) (1.35) (1.45) (1.40)
LPP 96.8 96.8 92.4 92.3 87.6 88.2 86.2 86.9 80.1 80.1 74.0 74.6
(0.24) (0.26) (0.47) (0.52) (0.74) (0.67) (0.78) (0.79) (1.04) (0.98) (1.16) (1.22)
NPE 96.2 95.7 84.7 85.0 83.5 84.6 80.2 80.8 72.0 72.5 61.8 62.0
(0.25) (0.29) (0.78) (0.76) (0.84) (0.83) (0.91) (1.01) (1.26) (1.23) (1.42) (1.45)
LDA 97.0 96.5 93.6 93.2 89.5 90.4 88.7 88.0 81.6 81.7 77.2 79.2
(0.22) (0.23) (0.37) (0.37) (0.59) (0.57) (0.72) (0.76) (1.04) (0.99) (1.13) (1.16)
KLDA 96.6 96.1 93.0 92.8 87.1 88.0 85.2 84.9 77.1 77.1 74.3 75.4
(0.25) (0.25) (0.38) (0.38) (0.58) (0.63) (0.86) (0.87) (1.06) (1.09) (1.29) (1.23)
Orig 97.1 97.0 87.8 89.9 84.7 85.8 82.2 84.4 76.8 78.7 69.5 72.6
COS
(0.22) (0.21) (0.61) (0.58) (0.68) (0.68) (0.93) (0.95) (1.06) (1.06) (1.30) (1.25)
LSDA 97.4 97.3 96.1 96.5 94.9 94.5 89.9 89.4 84.5 86.2 79.8 81.0
(0.16) (0.15) (0.22) (0.22) (0.36) (0.37) (0.59) (0.62) (0.83) (0.82) (1.07) (1.10)
Baseline (Gaussian+BIC) 82.5 82.8 70.2 71.3 59.4 60.7 55.4 54.1 46.3 47.0 40.2 41.7
(0.85) (0.82) (1.23) (1.23) (1.39) (1.46) (1.52) (1.54) (1.71) (1.64) (1.78) (1.85)

tering. Although there are pros and cons in each ically finding the number of clusters in a dataset in
algorithm, we observe that in the same subspace, the a completely unsupervised manner is still an open
speaker clustering performance of the two algorithms research problem. Many speaker diarization systems
is comparable. However, k-means is sensitive to ini- deal with this problem through hierarchical clustering
tialization, which means the results across multiple using a BIC-based stopping criterion [11]. A similar
runs may not be identical. Thus, we need to restart method could have been used to determine the num-
the k-means algorithm many times (e.g., 50) with a ber of speakers automatically in our paper. However,
different initialization at each time, and record the best the exact number of speakers is not accurately com-
result. Therefore, k-means is normally much slower puted by this simple method. In general, clustering
than agglomerative clustering. On the other hand, results may vary dramatically for different numbers
agglomerative clustering with the “ward” linkage of speakers determined. In order to eliminate the
method runs very fast. For the case of 100 speakers influence of the number of speakers and single out
(about 3000 utterances), it takes our Matlab program the extent to which the proposed partially supervised
less than one minute to complete the job on a Linux strategies may improve the speaker clustering perfor-
machine with a mainstream configuration. mance, we assume that the number of test speakers
An issue that is not addressed in this paper is the is known a priori, and defer the investigation of this
determination of the number of speakers. Automat- issue to our future work.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XXX 20XX 13

TABLE 6
Performance comparison of the proposed LDA transformed acoustic features with the traditional acoustic
features based on clustering accuracies.

2 spk 5 spk 10 spk 20 spk 50 spk 100 spk


×100% ∼ 60 utt ∼ 150 utt ∼ 300 utt ∼ 600 utt ∼ 1500 utt ∼ 3000 utt
k h k h k h k h k h k h
Traditional 93.5 94.7 80.1 83.7 76.2 81.3 69.0 76.6 57.5 68.4 46.0 56.6
EU
(0.28) (0.28) (0.85) (0.85) (1.07) (1.03) (1.10) (1.04) (1.29) (1.27) (1.55) (1.47)
LDA 94.7 96.0 81.6 85.0 77.3 82.6 70.5 78.1 58.4 69.4 47.2 57.7
(0.29) (0.25) (0.82) (0.79) (1.03) (0.99) (1.05) (1.04) (1.31) (1.27) (1.51) (1.41)
Traditional 97.4 97.7 87.3 89.4 82.9 85.4 79.6 81.2 73.5 76.4 64.9 67.8
COS
(0.16) (0.16) (0.56) (0.58) (0.77) (0.80) (1.02) (1.02) (1.10) (1.10) (1.34) (1.30)
LDA 99.0 99.1 88.3 90.7 84.1 86.5 80.6 82.2 74.7 77.7 66.4 69.3
(0.13) (0.12) (0.49) (0.51) (0.71) (0.72) (1.00) (1.00) (1.06) (1.09) (1.25) (1.28)

7 C ONCLUSIONS T. J. Watson Research Center for their continuous


In this paper, we propose the conceptually new idea encouragement and support.
of partially supervised speaker clustering and of-
fer a complete treatment of the speaker clustering R EFERENCES
problem. By means of an independent training data
set, our strategies are to encode the prior knowl- [1] M. Lew, N. Sebe, C. Djeraba, R. Jain. Content-based multimedia
information retrieval: State of the art and challenges. ACM
edge of speakers in general at the various stages Trans. on Multimedia Computing, Communications, and Ap-
of the speaker clustering pipeline via 1) learning plications, 2(1):1-19, 2006.
a speaker-discriminative acoustic feature transforma- [2] Y. Gong, W. Xu. Machine Learning for Multimedia Content
Analysis. Springer, 2007.
tion, 2) learning a universal speaker prior model, [3] H. Jin, F. Kubala, and R. Schwartz, “Automatic speaker cluster-
and 3) learning a discriminative speaker subspace, or ing,” Proc. DARPA Speech Recognition Workshop’97.
equivalently, a speaker-discriminative distance metric. [4] S. Chen, and P. Gopalakrishnan, “Clustering via the Bayesian
information criterion with applications in speech recognition,”
We discover the directional scattering property of the Proc. ICASSP’98.
GMM mean supervector representation of utterances [5] D. Reynolds, E. Singer, B. Carson, G. O’Leary, J. McLaughlin,
and advocate the use of the cosine distance metric and M. Zissman, “Blind clustering of speech utterances based
on speaker and language characteristics,” Proc. ICSLP’98.
instead of the Euclidean distance metric. We propose [6] A. Solomonoff, A. Mielke, M. Schmidt, H. Gish. Clustering
to perform discriminant analysis based on the cosine speakers by their voices. Proc. ICASSP, 2:757-760, 1998.
distance metric, leading to a novel distance metric [7] R. Faltlhauser, and G. Ruske, “Robust speaker clustering in
eigenspace,” Proc. ASRU’01.
learning algorithm – LSDA. We show that the pro- [8] W. Tsai, S. Cheng, and H. Wang, “Speaker clustering of speech
posed LSDA formulation can be systematically solved utterances using a voice characteristic reference space,” Proc.
within the elegant graph embedding framework. Our ICSLP’04.
[9] M. Ben, M. Betser, F. Bimbot, and G. Gravier. Speaker diariza-
speaker clustering experiments indicate that 1) our tion using bottom-up clustering based on a parameter-derived
speaker clustering methods based on the GMM mean distance between adapted GMMs. Proc. ICSLP, 2004.
supervector representation and vector-based distance [10] Barras, C., Zhu, X., Meignier, S. and Gauvain, J.L., “Multistage
speaker diarization of broadcast news,” IEEE Trans. ASLP. vol.
metrics outperform traditional methods based on the 14 no. 5, pp. 1505-1512, Sept. 2006.
“bag of acoustic features” representation and statisti- [11] S. Tranter, D. Reynolds. An Overview of Automatic Speaker
cal model based distance metrics, 2) our advocated Diarization Systems. IEEE T-ASLP, 14(5):1557-565, 2006.
[12] C. Wooters and M. Huijbregts, “The ICSI RT07s Speaker
use of the cosine distance metric yields consistent Diarization System,” Lecture Notes in Computer Science, 2007.
increases in the speaker clustering performance as [13] P. Kenny. Bayesian analysis of speaker diarization with eigen-
compared to the commonly used Euclidean distance voice priors. CRIM, 2008.
metric, thanks to the directional scattering property [14] D. Reynolds, P. Kenny, F. Castaldo. A study of new approaches
to speaker diarization. Interspeech, 2009.
of the GMM mean supervectors discovered, 3) our [15] R. Duda, P. Hart, D. Stork. Pattern Classification (2nd ed.).
partially supervised speaker clustering concept and Wiley Interscience, 2000.
strategies significantly improve the speaker clustering [16] W. Campbell,D. Sturim, D. Reynolds. Support vector machines
using GMM supervectors for speaker verification. Signal Pro-
performance over the baselines, and 4) our proposed cessing Letters 13(5):308-311, 2006.
LSDA algorithm further leads to the state-of-the-art [17] D. Reynolds, T. Quatieri, R. Dunn. Speaker Verification using
speaker clustering performance. Adapted Gaussian Mixture Models. Digital Signal Processing,
10:19-41, 2000.
[18] J. MacQueen. Some Methods for classification and Analysis of
ACKNOWLEDGMENTS Multivariate Observations. Proc. 5th Berkeley Symposium on
Math. Stat. and Prob. 1:281-297, 1967.
This work was supported in part by the DARPA [19] A. Jain, R. Dubes. Algorithms for clustering data. Prentice-
contract HR0011-06-2-0001 and in part by the NSF Hall, Inc., 1988.
[20] S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, S. Lin. Graph
Grant 08-03219. The authors would like to thank Dr. Embedding and Extensions: A General Framework for Dimen-
Lidia Mangu and Dr. Michael Picheny at the IBM sionality Reduction. IEEE T-PAMI 29(1):40-51, 2007.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, XXX 20XX 14

[21] H. Hermansky. Perceptual linear predictive (PLP) analysis of Stephen Mingyu Chu was born in Bei-
speech. J. Acoust. Soc. Amer., 87:1738-1752, 1990. jing, China, in 1970. He studied Physics
[22] G. Fant. Acoustic Theory of Speech Production. Mouton De at Peking University before coming to the
Gruyter, 1970. United States, where he continued his ed-
[23] P. Jain, H. Hermansky. Improved mean and variance normal- ucation and received the M.S. degree and
ization for robust speech recognition. Proc. ICASSP, 2001. the Ph.D. degree in Electrical Engineering
[24] A. Dempster, N. Laird, D. Rubin. Maximum likelihood from from the University of Illinois at Urbana-
incomplete data via the EM algorithm. J. Roy. Stat. Soc. B, Champaign, in 1999 and 2003, respectively.
39(1):1-38, 1977. Since May, 2003, he has been with the
[25] C. Bishop. Pattern Recognition and Machine Learning. speech group at the IBM T. J. Watson Re-
Springer, 2006. search Center. His research interests include
[26] P. Kenny, G. Boulianne, and P. Dumouchel. Eigenvoice mod- speech recognition, multimodal signal processing, graphical models,
eling with sparse training data. IEEE Trans. Speech and Audio and machine learning.
Processing, 13(3):345–354, 2005.
[27] Douglas A. Reynolds and Richard C. Rose. Robust Text-
Independent Speaker Identification Using Gaussian Mixture
Speaker Models. IEEE Trans. Speech and Audio Processing,
3(1):72–83, 1995.
[28] J-L. Gauvain, C.-H. Lee. Maximum a posteriori estimation for
multivariate Gaussian mixture observations of Markov chains.
IEEE Trans. Speech and Audio Processing, 2:291–298, 2004.
[29] G. Schwarz. Estimating the dimension of a model. Annals of Mark Hasegawa-Johnson (M’97-SM’04) re-
Statistics 6(2):461-464, 1978. ceived the S.B., S.M., and Ph.D. degrees in
[30] X. Miro. Robust Speaker Diarization for Meetings. PhD thesis, electrical engineering and computer science
Universitat Politecnica de Catalunya, 2006. from MIT in 1989, 1989, and 1996 respec-
[31] P. Mahalanobis. On the generalised distance in statistics. Na- tively. From 1986-1990 he was a speech cod-
tional Institute of Sciences of India 2(1):49-55, 1936. ing engineer, first at Motorola Labs, Schaum-
[32] R. Kuhn, P. Nguyen, J. Junqua, L. Goldwasser, N. Niedziel- burg, IL, then at Fujitsu Laboratories Lim-
ski, S. Fincke, K. Field, M. Contolini. Eigenvoices for speaker ited, Kawasaki, Japan. From 1996-9 he was
adaptation. Proc. ICSLP, 1771-1774, 1998. a post-doctoral fellow at the Speech Pro-
[33] X. He, P. Niyogi. Locality Preserving Projections. NIPS 2003. cessing and Auditory Physiology Lab, UCLA,
[34] X. He, D. Cai, S. Yan, H. Zhang. Neighborhood Preserving funded by the NIH, and by the 19th annual
Embedding. Proc. ICCV, 2005. F.V. Hunt Post-Doctoral Fellowship of the Acoustical Society of Amer-
[35] S. Chu, H. Tang, T. Huang. Fishervoice and Semi-supervised ica. Since 1999, Prof. Hasegawa-Johnson has been on the faculty of
Speaker Clustering. Proc. ICASSP, 2009. the University of Illinois. He is currently an Associate Professor in the
[36] Y. Ma, S. Lao, E. Takikawa, M. Kawade. Discriminant Analysis Department of Electrical and Computer Engineering, with Affiliate
in Correlation Similarity Measure Space. Proc. ICML, 227:577- appointments in Speech and Hearing Sciences, Computer Science,
84, 2007. and Linguistics, and with a research appointment at the Beckman
[37] Y. Fu, S. Yan, T. Huang. Correlation Metric for Generalized Institute for Advanced Science and Technology.
Feature Extraction. IEEE T-PAMI 30(12):2229-235, 2008.
[38] I. Dhillon, D. Modha. Concept decompositions for large sparse
text data using clustering. Machine Learning, 42(1):143-75, 2001.
[39] A. Jain. Data Clustering: 50 Years Beyond K-Means. Technical
Report TR-CSE-09-11. Under review by the Pattern Recognition
Letters.
[40] S. Chu, H. Kuo, L. Mangu, Y. Liu, Y. Qin, Q. Shi. Recent
advances in the IBM GALE mandarin transcription system.
Proc. ICASSP, 2008.
Thomas Huang (S’61-M’63-SM’76-F’79-
[41] P. Viola, W. Wells III. Alignment by maximization of mutual
LF’01) received the B.S. degree in electrical
information. Int. J. Comput. Vision, 24(2):137-154, 1997.
engineering from National Taiwan University,
[42] J. Moore, E.-H. Han, D. Boley, M. Gini, R. Gross, K. Hastings,
Taipei, Taiwan, China, and the M.S. and
G. Karypis, V. Kumar, B. Mobasher. Web Page Categorization
Sc.D. degrees in electrical engineering from
and Feature Selection Using Association Rule and Principal
the Massachusetts Institute of Technology
Component Clustering. Workshop on Information Technologies
(MIT), Cambridge. He was at the faculty of
and Systems, 1997.
the Department of Electrical Engineering at
[43] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, K. Mullers. Fisher
MIT from 1963 to 1973, and at the faculty
discriminant analysis with kernels. IEEE Neural Networks for
of the School of Electrical Engineering and
Signal Processing Workshop, pp. 41-48, 1999.
Director of its Laboratory for Information
and Signal Processing at Purdue University from 1973 to 1980.
In 1980, he joined the University of Illinois at Urbana-Champaign,
where he is now William L. Everitt Distinguished Professor of
Electrical and Computer Engineering, Research Professor at the
Coordinated Science Laboratory, Head of the Image Formation
Hao Tang received the Ph.D. degree in elec- and Processing Group at the Beckman Institute for Advanced
trical engineering from the University of Illi- Science and Technology, and Co-chair of the Institute’s major
nois at Urbana-Champaign in 2010, the M.S. research theme: human-computer intelligent interaction. Dr. Huang’s
degree in electrical engineering from Rut- professional interests lie in the broad area of information technology,
gers University in 2005, the M.E. and B.E. especially the transmission and processing of multidimensional
degrees, both in electrical engineering, from signals. He is a Member of the National Academy of Engineering,
the University of Science and Technology of a Foreign Member of the Chinese Academies of Engineering and
China, in 2003 and 1998, respectively. Since Science, and a Fellow of the International Association of Pattern
2010, he has been with HP Labs. His broad Recognition, IEEE, and the Optical Society of America.
research interests include statistical pattern
recognition, machine learning, computer vi-
sion, speech and multimedia signal processing.

View publication stats

You might also like