Agglomerative Mean-Shift Clustering Via Query Set Compression

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220906948

Agglomerative Mean-Shift Clustering via Query Set Compression

Conference Paper · April 2009


DOI: 10.1137/1.9781611972795.20 · Source: DBLP

CITATIONS READS
10 121

3 authors:

Xiaotong Yuan Bao-Gang Hu


Nanjing University of Information Science & Technology Chinese Academy of Sciences
39 PUBLICATIONS   738 CITATIONS    222 PUBLICATIONS   4,536 CITATIONS   

SEE PROFILE SEE PROFILE

Ran He
CASIA
207 PUBLICATIONS   4,769 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

machine learning View project

Face Synthesis View project

All content following this page was uploaded by Bao-Gang Hu on 16 May 2014.

The user has requested enhancement of the downloaded file.



Agglomerative Mean-Shift Clustering via Query Set Compression

Xiao-Tong Yuan Bao-Gang Hu Ran He

Abstract x to xi with covariance Hi and Ci is a normalization


Mean-Shift (MS) is a powerful non-parametric clustering constant.
method. Although good accuracy can be achieved, its com- The Mean-Shift (MS) algorithm discovered by
putational cost is particularly expensive even on moder- Fukunaga and Hostetler [12] is a powerful optimization
ate data sets. In this paper, for the purpose of algorithm algorithm for KDE (1.1). It is expressed as the following
speedup, we develop an agglomerative MS clustering method fixed-point iteration:
called Agglo-MS, along with its mode-seeking ability and
(1.2)
convergence property analysis. Our method is built upon !−1
N
an iterative query set compression mechanism which is mo- X 1
l+1
wi g M 2 (xl , xi , Hi ) H−1

tivated by the quadratic bounding optimization nature of x = i
i=1
Ci
MS. The whole framework can be efficiently implemented !
N
in linear running time complexity. Furthermore, we show X 1
wi g M 2 (xl , xi , Hi ) H−1

that the pairwise constraint information can be naturally i xi
i=1
Ci
integrated into our framework to derive a semi-supervised
non-parametric clustering method. Extensive experiments ′
where g(x) = −k (x) and k(·) is also called as the
on toy and real-world data sets validate the speedup advan- shadow of the profile g(·) [8].
tage and numerical accuracy of our method, as well as the By setting the query set Q = X and the reference
superiority of its semi-supervised version. set R = X , the Naive-MS clustering is done by group-
ing points in Q according to the modes they converge
1 Introduction to via MS conducted on R. It has a wide range of appli-
To find the clusters of a data set sampled from a certain cations, such as discontinuity preserving image smooth-
unknown distribution is important in many machine ing [8], image segmentation [21] and texture classifica-
learning and data mining applications. Probability tion [14]. Obviously, the Naive-MS clustering typically
density estimator may represent the distribution of data requires O(KN 2 ) evaluation (K is the average number
in a given problem and then the modes may be taken of MS iterations per-query). Even for moderate data
as the representatives of clusters. As a non-parametric set, such an exhaustive querying mechanism will lead to
method, the kernel density estimation is mostly applied severe requirements for computational time and/or stor-
in practice. Given a set of N independent, identically age. The recent years have witnessed a surge of inter-
distributed samples X = {x1 , ..., xN } drawn from a ests of fast MS clustering method [4][5][14][20][21]. One
population with density function f (x), x ∈ Rd , the well formulated method is the Gaussian blurring MS
kernel density estimator (KDE) with kernel k(·) is (GBMS) [7], which iteratively sharpens the query and
defined by reference set by moving each data point according to
(1.1) the Gaussian MS (GMS). Carreira-Perpinan [5] proves
N N
1 that GBMS converges cubically and further provides an
fˆk (x) =
X X
p(i)p(x|i) = wi k(M 2 (x, xi , Hi )) accelerated version of GBMS by using an improved it-
C i
i=1 i=1 eration stopping criterion. Yang et al. [21] accelerate
the speed of GMS to linear running time by using im-
where p(i) = wi is the prior weight or mixing proportion
PN proved fast gaussian transform (IFGT). Although very
of point xi (satisfying i=1 wi = 1), M 2 (x, xi , Hi ) = efficient for large size and high dimensional databases,
(x − xi )T H−1
i (x − xi ) is the Mahalanobis distance from IFGT-MS is difficult to be generalized for convex kernels
other than Gaussian. For image segmentation applica-
∗ Supported in part by NSFC grants #60275025 and MOST of
tions, Carreira-Perpinan [4] has evaluated four kinds of
China (No.2007DFC10740). The authors are all with the National acceleration strategies for GMS, which are based on the
Laboratory of Pattern Recognition, Institute of Automation, spatial structure of images and on the fact that GMS is
Chinese Academy of Sciences. an Expectation-Maximization (EM) algorithm [6]. The

223 Copyright © by SIAM.


Unauthorized reproduction of this article is prohibited.
fastest one is spatial discretisation, which can accelerate rally integrated into the proposed query set compres-
GMS by one to two orders of magnitude. However, one sion framework, which leads to a novel constrained
limitation of this strategy is that it is typically image non-parametric clustering algorithm. Comparing to
analysis specified because it requires spatial dimensions the constrained Kmeans (CKmeans) [18] and its vari-
that can be regularly discreted. Based on the dual-tree ations [2][9][19], our method has the following advan-
technique, Wang et al. [20] present the DT-MS for fast tages: 1) There is no potential assumption on data dis-
data clustering. Since a relative error bound is main- tribution for input data; 2) The number of output clus-
tained at each iteration, DT-MS is proved to be stable ters is not necessarily to be known beforehand, and 3)
and accurate. However, the computational cost sav- No initialization is required to start the algorithm run-
ing by DT-MS, with comparison to IFGT-MS, is more ning. At the same time, if necessary, the CKmeans can
impressive under the simple Epanechnikov kernel than be used as a postprocessing step to further group the
under the Gaussian kernel. output clusters by our method into desired number of
Recently, Yuan and Li [22] point out that convex clusters. The experimental evaluation on UCI and real-
kernel based MS is equivalent to half-quadratic (HQ) world data sets validates the superiority of our method.
optimization for the KDE function (1.1). The HQ anal- The remainder of the paper is structured as follows.
ysis framework of MS implies the fact that MS is a In Section 2, we briefly review the QB nature of MS,
quadratic bounding (QB) optimization for KDE, which which founds the base of this work, from the viewpoint
is also discovered by Fashing and Tomasi [11]. Moti- of HQ analysis. In Section 3, we develop an agglomera-
vated by the QB nature of MS, we develop a highly tive MS clustering method based on iterative query set
efficient query set compression mechanism to accelerate compression and evaluate its numerical performance on
MS clustering under general kernels. The basic idea is toy and real-world clustering problems. In Section 4, by
to construct a family of d-dimensional hyperellipsoids utilizing the pairwise constraint information, we further
to cover the current query set Q, promising that points extend our unsupervised algorithm framework into a
inside each hyperellipsoid will be at the same basin of semi-supervised version. Finally, we conclude this work
the d-dimensional KDE surface and hence will converge in Section 5.
to the same mode via MS algorithm. For a given query
point, the hyperellipsoid is constructed from a lower QB 2 Quadratic Bounding Nature of MS
function of KDE defined at this point. Empirically, we The fact that MS is a QB optimization is originally
have observed that the number of the covering hyper- discovered by Fashing and Tomasi in [11], motivated
ellipsoids is much smaller (in most cases a reduction of by the relationship between MS and Newton-Raphson
one or two orders of magnitude) than the size of Q. We method. Actually, the QB nature of MS can be more
then take the centers of these hyperellipsoids to form straightforwardly derived from the HQ optimization
a new query set with size dramatically reduced. Such viewpoint of MS [22]: When kernel is convex and
a query set covering procedure can be iteratively run monotonically decreasing, the MS algorithm can be
until convergence is attained. Numerical experiments explained as HQ optimization for fˆk (x). This feature
show that the approximation error introduced in such can be shown by using the theory of convex conjugated
a query set compression mechanism is always limited functions [17] to introduce the following augmented
and acceptable under proper kernel bandwidth. After energy function with d + N variables:
each iteration of query set compression, the clustering
can be done by grouping the points in current query N
X 1
−pi M 2 (x, xi , Hi ) + ϕ(pi )

set according to the hyperellipsoids they are assigned (2.3) F̂ (x, p) = wi
i=1
Ci
with. This naturally leads to an agglomerative clus-
tering framework. We have analyzed the mode-seeking where x ∈ Rd is the KDE variable and p = (p1 , ..., pN )
ability and convergence property of the proposed algo- is an N -dimensional vector of auxiliary variables. ϕ(·) is
rithm. A tight upper bound of the convergent query set the dual function of k(·) [22]. F̂ (x, p) is quadratic w.r.t.
size is derived, which promises in theory the speedup x while concave w.r.t. p. For a fixed x, the following
performance of our algorithm. relationship holds
Our second contribution is to develop a semi-
supervised non-parametric clustering algorithm inside (2.4) fˆk (x) = sup F̂ (x, p)
the proposed query set compression framework. Un- p

til now, the MS algorithm is typically applied in an


and thus
unsupervised manner. We point out that supervisory
information, e.g., the pairwise constraints, can be natu- max fˆk (x) = max F̂ (x, p)
x x,p

224 Copyright © by SIAM.


Unauthorized reproduction of this article is prohibited.
which means that maximizing fˆk (x) is equivalent to For ∀x ∈ HE(x̂1i ), due to the quadratic bounding
maximizing the augmented function F̂ (x, p) on the property of ρ̂1 (x), we have
extended domain. A local maximizer (x̂, p̂) of F̂ can
be calculated in the following alternate maximization fˆk (x) ≥ ρ̂1 (x) ≥ ρ̂1 (x̂0i ) = fˆk (x̂0i )
way from a starting point x̂0 which means that any data point xj ∈ Q0 ∩HE(x̂1i ) can

p̂li = −k M 2 (x̂l−1 , xi , Hi ) , i = 1, ..., N be alternatively chosen as x̂1i to increase the current QB

(2.5)
function value, rather than maximizing it. Therefore it
N
!−1 N
! is quite hopeful that xj will converge to the same mode
X 1 X 1 as xi does. This is close to the essence behind gener-
(2.6) x̂ =l
wi p̂li H−1
i wi p̂li H−1
i xi
i=1
Ci i=1
Ci alized EM, in which the M step is allowed to merely
increase the expectation obtained in E step rather than
which is actually the MS algorithm (1.2).
maximizing it. Now we may reasonably claim that
At a fixed point x̂l−1 ∈ Rd , a lower QB function for
ˆ points inside Q0 ∩ HE(x̂1i ) will converge to the same
fk (x) is given by
local maximum and can be safely clustered together by
(2.7) ρ̂l (x) , F̂ (x, p̂l ) MS. Sequentially, we can run such a hyperellipsoid con-
struction procedure on the query set Q0 for each point
which according to (2.4) satisfies fˆk (x̂l−1 ) = ρ̂l (x̂l−1 )
not yet being covered by any existing hyperellipsoids,
and fˆk (x) ≥ ρ̂l (x) for ∀x. Obviously the step (2.6) is until the whole set is completely scanned. As a result,
equivalent to solving we obtain a family S0 of hyperellipsoids
S which covers the
(2.8) x̂l = arg max ρ̂l (x). entire query data set, i.e., Q0 ⊆ HE(x̂1 )∈S0 HE(x̂1i ).
x i
Actually, S0 can be viewed as a compressor of Q0 . The
which indicates that MS is a QB optimization for size of S0 generally depends on data set distribution,
KDE. The QB viewpoint of MS motivates the following kernel bandwidth and data scanning order for hyperel-
acceleration strategy for MS clustering. lipsoid construction. When S0 is relatively dense, we
could alternatively apply the well known set covering
3 Agglomerative MS Clustering greedy algorithm [13] to find a subset of S0 , to cover Q0 .
The key point of our agglomerative MS algorithm is to A formal description of our query set covering mecha-
construct a family of d-dimensional hyperellipsoids to nism is given in a form of algorithm function in Algo-
cover the current query set Q, promising that points rithm 1.
inside each hyperellipsoid will converge to a common
local maximum of KDE via MS. We then use the centers 1: Function: Query Set Covering(Query data set Q, Ref-
of these hyperellipsoids to form a new query set as the erence data set R)
2: Initialization: S = ∅.
compressor of the original one. We may iteratively run
3: for each xi ∈ Q do
such a set covering mechanism several times until it
4: if ∃HE(x̂1j ) ∈ S such that xi ∈ HE(x̂1j ) then
converges. After each iteration level, the clustering is 5: Associate xi with HE(x̂1j ) .
done by grouping the current query points according to 6: else
the hyperellipsoids they are associated with, which leads 7: Run one iteration of MS from xi with reference
to hierarchical clustering. In the following derivation, set R and construct the HE(x̂1i ), as stated in
we assume that the covariance is homogenous, i.e., sectionS3.1.
Hi = H. 8: S = S {HE(x̂1i )}
9: end if
3.1 Query Set Covering Let’s start with Q0 = 10: end for
R = X . Given a data point xi ∈ Q0 , we initialize 11: Return S
x̂0i ← xi . It is known from eq. (2.8) that the output Algorithm 1: The Query Set Covering Function
x̂1i from the first MS iteration is the maximizer of a QB
function ρ̂1 (x). Thus we may write down:
′ 3.2 Iterative Query Set Compression Given cur-
ρ̂1 (x) = sM 2 (x, x̂1i , H) + C rently constructed hyperellipsoid set S0 , we may take
PN w p̂1 ′
where s = − i=1 Ci i i < 0 and C is the constant the centers of the hyperellipsoids in it to form a com-
pressed query set, i.e., Q1 = {x̂1i |HE(x̂1i ) ∈ S0 }. The
term. Taking x̂1i as center, we define the following d-
above presented set covering operation can be directly
dimensional hyperellipsoid:
applied on Q1 . After sufficient times of iteration until
HE(x̂1i ) = {x|M 2 (x, x̂1i , H) ≤ M 2 (x̂0i , x̂1i , H)} convergence, we will obtain a sparse enough query set

225 Copyright © by SIAM.


Unauthorized reproduction of this article is prohibited.
Q∞ . At each iteration level l, we could group points in until it converges, and the Q∞ will surly contain at least
Ql according to the hyperellipsoids in Sl they are associ- these m modes returned by the Naive-MS.
ated with. Such a query set compression framework nat- Based on the above two points, we claim that the
urally leads to an agglomerative clustering of Q0 . A for- Agglo-MS possesses the same mode-seeking ability as
mal description of the proposed method, namely Agglo- that of the Naive-MS if the query set covering is done
MS, is given in Algorithm 2. We canP see that the com- with some proper data scanning orders. Our numerical
putational cost for Agglo-MS is O( L l=1 |Ql |N ), here observation shows that such a property of the Agglo-MS
L is a sufficient large iteration number that promises always holds under most data scanning orders.
the convergence. Since |QL | ≤ ... ≤ |Q1 | ≪ N and
L ≈ K typically hold (see the algorithm analysis later 3.3.2 Convergence Property of |Ql | We discuss
on), the Agglo-MS can speedup the Naive-MS by a fac- here the convergence property of |Ql |, which is impor-
tor of KN/( L tant in algorithm speedup performance analysis. First,
P
l=1 |Ql |).
due to the fact that |Q1 | ≥ |Q2 | ≥ ..., and |Ql | ≥ 1, we
1: Initial Query Set Covering : have the following convergence proposition:
2: Let query set Q0 = X and reference set R = X .
3: S0 =Query Set Covering(Q0 , R) Proposition 3.1. The sequence {|Ql |, l = 1, 2, ...}
4: Iterative Query Set Compression Phase: generated in the Agglo-MS algorithm converges.
5: Set l = 1
6: while |Ql | > |Ql−1 | do
We now focus on the value of |Q∞ |, which is more in-
7: Let query set Ql = {x̂1i |HE(x̂1i ) ∈ Sl−1 }. terested in by end-users. It is known that the Naive-
8: Sl =Query Set Covering(Ql , R) MS clustering is done by grouping the queries according
9: Group the points in query set Ql according to the to the modes they separately converge to. In practice,
hyperellipsoids in Sl they are associated with. the convergent modes correspond to one common KDE
10: l←l+1 maximum may not be exactly the same due to compu-
11: end while tational error. We take those modes fall inside a sphere
Algorithm 2: The Agglo-MS Clustering algorithm with small enough radius ρ as identical, and nonidentical
otherwise. Here ρ serves as the resolution of Naive-MS
clustering, i.e., the smaller ρ is, the more clusters will
3.3 Algorithm Study We give in this section some be output. For the clarity of description, we introduce
algorithm analysis for the Agglo-MS. The discussion the following concept of ρ-Mode Number for Naive-MS:
includes: 1) An insight view of the mode-seeking ability
of Agglo-MS, and 2) The convergence property of query
set size |Ql |, which is the main concern of algorithm Definition 1. ( ρ-Mode Number) Given a resolu-
speedup performance. tion parameter ρ, the number of nonidentical modes lo-
cated by Naive-MS algorithm on data set X under band-
3.3.1 On Mode-Seeking Ability First, we tell that width H is referred as ρ-Mode Number, which is denoted
the Agglo-MS generally will find no more modes than by M (X , H, ρ).
Naive-MS does. Actually, each point survives in the We aim to establish the relationship between |Ql |
currently compressed query set Ql is the l-th MS itera- and M (X , H, ρ). To do this, we slightly modify the
tion output from some initial point in Q0 . Therefore the set covering mechanism in algorithm 1 as follows: The
final obtained Q∞ is a subset of the convergent modes if condition in line #4 is revised as: ∃HE(x̂1j ) ∈ S
returned by Naive-MS on Q0 . This also implies that L ≈
such that xi ∈ HE(x̂1j ) HS(x̂1j , 4ρ). Here HS(x̂1j , 4ρ)
W
K. Second, we show that the Agglo-MS will find no less
is a hypersphere centered at x̂1j with radius 4ρ. We
modes than the Naive-MS does if the query set covering
refer algorithm 1 after this modification as the modified
is done with some proper data scanning orders. Suppose
query set covering, with which we have the following
that m different modes are located by the Naive-MS on
proposition on an upper bound of |Q∞ |.
Q0 and let C1 , ..., Cm be the corresponding m clusters.
For each cluster Cj , let x∗j = arg maxx∈Cj fˆk (x). It is
Proposition 3.2. Given a resolution parameter ρ, the
easy to know that a hyperellipsoid constructed from x∗j
sequence {|Ql |, l = 1, 2, ...} generated in the Agglo-MS
will cover no other points in Q0 than x∗j . We may scan
algorithm with the modified query set covering mecha-
from x∗j , j = 1, ..., m, to do set covering on Q0 and
nism satisfies
compress it into Q1 . Obviously the Naive-MS will find
at least these same m modes on Q1 . Such a way of lim |Ql | 6 M (X , H, ρ)
scanning order generation can be repeated on Ql , l ≥ 1, l→∞

226 Copyright © by SIAM.


Unauthorized reproduction of this article is prohibited.
Proof. For each xli ∈ Ql , denote x∞ i be its correspond- compress the query set until convergence is attained,
ing MS convergent point. It is known that ∃L > 0 such and the query points are clustered in an agglomerative
that if l > L then kxli − x∞i k < ρ, ∀i. Also, we have that way during this procedure.
the convergent set {x∞ i i = 1, ..., |Ql |} is covered by
, In the following experiments, we choose the Naive-
M (X , H, ρ) hyperspheres with radius ρ. Based on these MS as baseline algorithm and make performance com-
preliminary knowledge, we now prove the proposition parison among the Agglo-MS, IFGT-MS and ms1-MS
with reduction to absurdity. (for image segmentation). In our implementation of the
Assume that liml→∞ |Ql | = T > M (X , H, ρ). Due Agglo-MS, we use the modified query set covering func-

to the discreteness of cardinality we get that ∃L > 0 tion.

such that if l > L then |Ql | = T . Now we fix the

iteration level l > max(L, L ). Since T > M (X , H, ρ), 3.5 A Walkthrough Example We apply in this
from the drawer principle we know that there exist at section the Agglo-MS on a 2D toy data set (shaped
least two points xli and xlj whose convergent points x∞ i as shown in Figure 1(b), size 1, 821) to illustrate its
and x∞ j will fall inside the same covering hypersphere. working mechanism. In this case, isotropic covariance
Now, we consider the (l + 1)th iteration of query set H = σ 2 I is used, hence the hyperellipsoids are degener-
compression from xli (note that hyperellipsoids will ated into disks.
be constructed for each point in query set when it Let’s start from the point A(−1.5, 0.6) (the green
converges). According to the triangle inequality, we dot in Figure 1(b)) with bandwidth σ = 0.4 to perform
have kxl+1
i − xlj k < kxl+1
i − x∞ ∞ ∞ ∞
i k + kxi − xj k + kxj − the query set covering. After one iteration of MS,
l
xj k < ρ + 2ρ + ρ = 4ρ. Therefore, according to the point A shifts to point B(−1.29, 0.62) (the red dot
modified set covering mechanism xlj shall be associated in Figure 1(b)). Now, we construct the first disk,

−→
with the hyperellipsoid constructed from xl+1 in the taking B as its center and kABk as the radius. In
i
(l + 1)th level of set compression. This indicates that Figure 1(a), we plot the mesh of KDE (in black) and
|Ql+1 | < |Ql | = T , which leads to contradiction. Thus the QB function ρ̂1 (x) (in blue) at A. Obviously, any
the proposition is proved . point lies inside this disk will increase ρ̂1 (x), hence is
hopeful to converge to the same mode as that of A
Proposition 3.2 tells us that when ρ is properly chosen via MS. After one pass of scanning, a total of 223
and l is large enough, Ql will be rather sparse. Therefore disks are obtained to cover the initial query data set,
significant speedup of Agglo-MS over Naive-MS can be as illustrated in Figure 1(c). One worrying aspect of
achieved. our set covering mechanism is the possible intersections
between disks from different clusters. To more clearly
3.4 Summary of Method To summarize the algo- illustrate such a phenomenon, we have selected a local
rithm development so far, an accelerated MS cluster- window (in green) located around one cluster boundary
ing method based on iterative query set compression is area in Figure 1(c) and magnified it in Figure 1(d).
derived. The computational complexity, mode-seeking Those disks in blue straddle clusters. This is also
ability and convergence property of the proposed algo- where the numerical error of Agglo-MS generates from.
rithm are analyzed. In methodology, our Agglo-MS The detailed quantitative evaluation on this aspect will
algorithm differs itself from the existing acceleration be addressed later on. We now enter the iterative
strategies, including IFGT-MS [21], LSH-MS [14], ms1- query set compression phase. After l = 5 times of
MS [4] and DT-MS [20]. The IFGT-MS approximates iteration a query set with 104 points is obtained, and
MS calculation at each query point by taking its nearby the corresponding compressed query set and clustering
points as reference set and adopting the fast multipole result are shown in Figure 1(e)&1(f). The convergent
trick. It is only applicable to Gaussian kernel. The (l = 60) compressed query set and the corresponding
LSH-MS also makes a fast search of the neighborhood clustering result are shown in Figure 1(g)&1(h). The
around a query point to approximately compute MS query set size vs. iteration number curves for different
iteration. For image segmentation tasks, the ms1-MS bandwidths are given in Figure 2(a). We evaluate the
smartly stops the current MS iteration if the ascent-path numerical performance of accelerated MS algorithms
intersects with a previously visited pixel. This is close using the CPU running time and the following defined
to the trick used in [8]. The DT-MS achieves speedup ε-error rate (ε-ER):
by recursively considering a region of query points and (3.9) !
N
a region of reference points, which are represented by 1 X kx̂∞,X-MS
i − x̂∞,Naive-MS
i k
ε-ER = δ >ε
nodes of a query tree and a reference tree respectively. N i=1 kx̂i∞,Naive-MS k
The proposed Agglo-MS, on the other hand, iteratively

227 Copyright © by SIAM.


Unauthorized reproduction of this article is prohibited.
3 3

2 2 1.6
B
1 1 1.4

0 A 0 1.2

−1 −1 1

−2 −2 0.8

−3 −3 0.6

−4 −4 0.4
−2 −1 0 1 2 −2 −1 0 1 2 0.5 1 1.5

(a) (b) (c) (d)

3 3 3 3

2 2 2 2

1 1 1 1

0 0 0 0

−1 −1 −1 −1

−2 −2 −2 −2

−3 −3 −3 −3

−4 −4 −4 −4
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

(e) (f) (g) (h)

Figure 1: Illustration of iterative query set compression working mechanism on a 2D toy dataset. See text for the detailed
description.

1000 35 0.4
σ = 0.1 Agglo-MS Agglo-MS
30 0.35
σ = 0.3
800
σ = 0.8 IFGT-MS 0.3 IFGT-MS
Speedup Ratio

25
Query Set Size

600 0.25
20

ε-ER
0.2
400 15
0.15
10
0.1
200
5 0.05

0 0 0
10 20 30 40 50 60 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Iteration Number σ σ

(a) Query Set Size vs. Iteration Number (b) Speedup vs. bandwidth (c) ε-ER vs. bandwidth

Figure 2: Quantitative evaluation curves for the 2D toy problem.

where δ(x) is the delta function that equals one if in table 1. The optimal clustering results (|Q∞ | = 5)
boolean variable x is true and equals zero otherwise, by Agglo-MS are achieved under σ ∈ (0.25, 0.54). The
while x̂∞,X-MS
i and x̂∞,Naive-MS
i are convergent modes speedup vs. bandwidth curves of the Agglo-MS and
returned by X-MS (X stands for ’Agglo’, ’IFGT’ or IFGT-MS over the Naive-MS are given in Figure 2(b),
’ms1’ in this work) and Naive-MS respectively from an from which we can see that Agglo-MS outperforms
initial query point xi in Q0 . We set the parameters ρ = IFGT-MS in speedup performance. We also show the
ε = 10−3 , and choose the Gaussian kernel throughout ε-ER vs. bandwidth curves of the Agglo-MS and IFGT-
the experiments in this paper. The algorithm codes 1 MS in Figure 2(c), from which we can see that the
are basically written in C++, and run on a P4 Core2 approximation error are comparable between these two
2.4G Hz CPU with 2G RAM. For this example, the algorithms. Note that when bandwidth is small (σ<0.3)
quantitative results under bandwidth σ = 0.4 are listed or large (σ>0.6), relatively high ε-ER is introduced in
Agglo-MS. This is because small bandwidth tends to
1 The IFGT-MS code is implemented using lead to over clustering by MS method, which makes
the IFGT library provided by the authors at the Agglo-MS sensitive to the intersection between
http://www.umiacs.umd.edu/~ vikas/Software/IFGT/IFGT_code.htm. disks from different clusters, while too large bandwidth
The important turnable parameters, e.g., p max -polynomial or- will lead to large covering disks, which also increase
der, K-number of cells and r-ratio of the cut off radius, can be
automatically chosen by the program. We further carefully tune
the chance of involving points belonging to different
the parameters p max and r to achieve comparable result to clusters.
Naive-MS.

228 Copyright © by SIAM.


Unauthorized reproduction of this article is prohibited.
Table 1: CPU running time (in milliseconds) and ε-ER of
Agglo-MS, IFGT-MS (p max = 5, r = 2) and Naive-MS on
the 2D synthetic dataset. σ = 0.4.
Methods |Ql | CPU Time ε-ER
(a) (b) (c) (d)
l=1 223 40 —
Agglo-MS l=5 104 53 —
l = 60 5 117 0.0016
IFGT-MS — 975 0
Naive-MS — 1,759 —

(e) (f) (g) (h)

3.6 Real-World Experiments We now assess the Figure 4: Segmentation results under σ = 0.1 (top row)
performance of Agglo-MS on some real-world clustering and σ = 0.2 (bottom row) on the hand image. For each
tasks, including image segmentation and high dimen- row, from left to right: Naive-MS, Agglo-MS, IFGT-MS and
sional data set clustering. ms1-MS.

3.6.1 Image Segmentation The most important


application of MS clustering algorithm is for unsuper- lected groups of segmentation results are given in Fig-
vised image segmentation. We test in this part the per- ure 5. The quantitative comparison between Agglo-MS,
formance of Agglo-MS in image segmentation tasks. We IFGT-MS and ms1-MS on these images are listed in ta-
follow the approach in [21], where each datum is rep- ble 2. As expected, the Agglo-MS significantly outper-
resented by spatial and range features (i, j, L∗ , u∗ , v ∗ ), forms the other two in speedup performance on these
where (i, j) is the pixel location in the image and four test images. The ε-ER achieved by Agglo-MS is
(L∗ , u∗ , v ∗ ) is the normalized LUV color feature. Fig- overall less than 3% and comparable to the other two.
ure 4(b)&4(f) show the results by Agglo-MS on the color
image hand. The speedup vs. bandwidth curves of 3.6.2 High Dimensional Cases To evaluate the
Agglo-MS, IFGT-MS and ms1-MS over Naive-MS on speedup and numerical performance of the Agglo-MS
this image are given in Figure 3(a), from which we can in high dimensional data clustering tasks, we have
see that the Agglo-MS is always much faster than the applied it on several real-world databases, including
other two. The ε-ER vs. bandwidth curves of the Agglo- the CMU PIE face database3, the MNIST handwritten
MS, IFGT-MS and ms1-MS are plotted in Figure 3(b), digit database4 and the TDT2 document database5 . We
from which we can see that the approximation error are first briefly describe these data sets and then give the
comparable among these three algorithms. The reason- quantitative results on them.
able segmentations are achieved under the bandwidth CMU PIE Data set The CMU PIE face database
interval σ ∈ (0.1, 0.22). We give in Figure 4 the selected contains 68 subjects with 41,368 face images as a whole.
segmentation results under bandwidths σ = 0.1, 0.2 by We follow [15] to use 170 face images for each individual
different MS clustering methods. in our experiment. The total count of points in the
data set is 11,554. The size of each cropped gray
400
Agglo-MS
0.1
Agglo-MS
scale image is 32 × 32 pixels. We do clustering in a
350

300
IFGT-MS
ms1-MS
0.08 IFGT-MS
ms1-MS
subspace embedded by spectral regression (SR) [3], with
Speedup Ratio

250 0.06 dimension reduced from 1024 into 67.


ε-ER

200

150 0.04 MNIST Document Data set The MNIST database


100

50
0.02 of handwritten digits has a training set of 60,000 ex-
0
0.2 0.4 0.6
σ
0.8 1
0
0.2 0.4 0.6
σ
0.8 1 amples,and a test set of 10,000 examples. The digits
have been size-normalized and centered in a fixed-size
(a) Speedup vs. bandwidth (b) ε-ER vs. bandwidth
(28 × 28) bilevel image. our clustering is done on the
training set embedded by SR, with dimension reduced
Figure 3: Quantitative evaluation curves for the hand
from 784 to 9.
image.
TDT2 Document Data set The TDT2 corpus con-
Some images from [21] and the Berkeley segmen-
tation dataset2 are also used for evaluation. Four se- 3 http://www.ri.cmu.edu/projects/project_418.html
4 http://yann.lecun.com/exdb/mnist/
2 http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/
5 http://www.nist.gov/speech/tests/tdt/tdt98/index.htm

229 Copyright © by SIAM.


Unauthorized reproduction of this article is prohibited.
(a) (b) (c) (d) (e) (f) (g) (h)

(i) (j) (k) (l) (m) (n) (o) (p)

Figure 5: Selected image segmentation results. For each image group, from left to right: Naive-MS, Agglo-MS, IFGT-MS
and ms1-MS.

Table 2: Quantitative results by Agglo-MS, IFGT-MS (p max = 3, r = 2), ms1-MS and Naive-MS on four test images.
Images House Base Dive Hawk Cowboy
Sizes 255 × 192 432 × 294 481 × 321 481 × 321
σ 0.1 0.1 0.06 0.1
Agglo-MS 1.780 3.927 9.313 5.888
CPU IFGT-MS 5.628 23.863 28.753 27.82
Time (s) ms1-MS 8.015 14.607 12.372 18.124
Naive-MS 116.587 169.290 155.829 717.521
Agglo-MS 65.50 43.11 16.73 121.86
Speedup IFGT-MS 2.64 5.15 2.58 3.66
ms1-MS 14.55 11.59 12.60 39.59
Agglo-MS 0.029 0.025 0.001 0.005
ε-ER IFGT-MS 0.021 0.054 0.046 0.015
ms1-MS 0.027 0.024 0.001 0.005
|Q∞ |/M (X , H, ρ) 5/5 7/7 3/3 5/5

sists of 11,201 on-topic documents which are classified


into 96 semantic categories. In this experiment, we use Table 3: Performance of Agglo-MS vs. Naive-MS on high
dimensional datasets.
the top 30 categories, thus leaving us with 9,394 docu- Datasets PIE MNIST TDT2
ments. The clustering is also performed in a subspace Dimensions 67 9 29
embedded by SR, with dimension reduced from 36,771 Classes 68 10 30
to 29. Sizes 11,554 60,000 9,394
The provided IFGT-MS code fails on these data σ 15 10 0.8
sets. The quantitative results by Agglo-MS and Naive- CPU Agglo-MS 52.748 13.273 4.789
MS are listed in table 3, from which we can see that Time (s) Naive-MS 505.111 381.646 125.729
the Agglo-MS also significantly improves the clustering Speedup 9.58 28.75 26.25
speed with acceptable approximation error in high ε-ER 0.092 0.065 0.002
dimensional space. |Q∞ |/M (X , H, ρ) 76/76 113/113 22/22

4 Extension for Constrained Clustering


Our Agglo-MS can be naturally extended into a con-
strained non-parametric clustering framework, with ter while a cannot-link constraint enforces that two in-
cluster purity much improved. Typically, the con- stances must not be placed in the same cluster. The
strained clustering focused on the use of background previous work on constrained clustering can be divided
information in the form of instance level must-link and into two categories: 1) where the constrains help the
cannot-link constraints. A must-link constraint enforces algorithm construct a distortion/distance function for
that two instances must be placed in the same clus- optimization [1][16][19] and 2) where the constraints are
used as hints to guide the algorithm to search a feasi-

230 Copyright © by SIAM.


Unauthorized reproduction of this article is prohibited.
ble solution [9][10][18]. The first type of work makes is not allowed to be associated with the same hyperellip-
the assumption that points surrounding a pair of must- soid during the set covering. This will make sure that,
link/cannot-link points should be close to/far from each at each iteration level l, the obtained clusters are CL-
other, while the second type just requires that the feasible under current cannot-link constraints Cl . Here,
two points be in the same/different clusters. Our con- the cannot-link constraint Cl is updated according the
strained Agglo-MS (CAgglo-MS) method falls into the following rule: Given a pair of cannot-link points in
second category. The must-link and cannot-link con- Cl−1 , the centers of the two hyperellipsoids they sepa-
straints are used to guide query set covering at each rately associated with in Sl are labeled as cannot-link
iteration level to promise the feasibility of the output in Cl . Therefore, through simple induction on l we get
clusters. At the same time, the constraints themselves that the output clusters at each iteration level are CL-
are automatically updated as query set evolves. To fur- feasible under the initial cannot-link constraint C0 = C.
ther get the desired number of clusters, the CKmeans- The following proposition summarizes the feasibility is-
like methods [2][9][18][19]) can be alternatively applied sues discussed above.
on the final compressed query set with updated con-
straints. Proposition 4.1. The output clusters at each level l by
CAgglo-MS algorithm are ML-feasible and CL-feasible
4.1 Definition of Constraints We consider the under the constraints M and C.
problem of clustering the set X under the following
types of constraints: Typically, CAgglo-MS will output more clusters
than Agglo-MS does on the same data set. This can also
1. Must-Link Constraints: Each must-link con-
be verified by the following knowledge: Since CAgglo-
straint involves a pair of points xi and xj (i 6= j).
MS is CL-feasible, we know from [10] that |Q∞ | is
In any feasible clustering, points xi and xj must
bounded below by a number kmin , which is upper
be in the same cluster. Denote M be the set of
bounded by one plus the maximum degree of a node
must-link constraint pairs.
in an undirected graph Gc = {Vc , Ec } whose vertices
set Vc = Q0 and edges set Ec = C. If necessary, as what
2. Cannot-Link Constraints: Each cannot-link
is done in the following experiments, we may further
constraint also involves a pair of distinct points xi
group the output clusters of CAgglo-MS so that more
and xj . In any feasible clustering, points xi and xj
compact results can be obtained, .
must not be in the same cluster. Denote C be the
set of cannot-link constraint pairs.
4.3 Performance Evaluation
A constrained clustering algorithm is referred to as ML-
feasible and CL-feasible if the constraints M and C are 4.3.1 Data Preparation The data sets used in our
satisfied respectively. constrained clustering experiments include 4 data sets
from the UCI repository6 : Iris, Wine, Sonar and
4.2 The Algorithm and Feasibility The CAgglo- Ionosphere; and two subsets from the MNIST and
MS clustering method is formally given in Algorithm TDT2. For MNIST we choose the digits {3, 6, 8, 9}
4. We show through the following detailed construc- to form the MNIST.Multi4 while for TDT2 we chose
tion procedure that CAgglo-MS is both ML-feasible and the top ten documents to form the TDT2.Multi10.
CL-feasible. As is well known, must-link constraints are We then randomly sample 5% from the MNIST.Multi4
transitive. Therefore, a given collection M of must-link and TDT2.Multi10. Table 4 summarizes the properties
constraints can be transformed into an equivalent col- of these data sets. The constraints are generated as
lection M = {M1 , ..., Mr } of constraints, by computing follows: for each constraint, we pick out one pair of
the transitive closure of W . In the CAgglo-MS, we take data points randomly from the input data sets (the
M1 , ..., Mr as r singletons, and together with the com- labels of which are available for evaluation purpose but
plemental set X − i=1 Mi to form the initial query unavailable for clustering). If the labels of this pair of
Sr
set. Such a way of initialization promises that the out- points are the same, then we generate a must-link. If
put clusters at each iteration level l is ML-feasible due the labels are different, a cannot-link is generated. The
to the fact that Agglo-MS is an agglomerative cluster- amounts of constraints are determined by the size and
ing algorithm. To promise the CL-feasibility, we revise difficulty of data set.
the algorithm 1 into a cannot-link constrained query
set covering mechanism, as is formally described in Al-
gorithm 3. The key point is that each cannot-link pair 6 http://archive.ics.uci.edu/ml/

231 Copyright © by SIAM.


Unauthorized reproduction of this article is prohibited.
4.3.2 Experimental Design In the first group of
experiments, we aim to compare the clustering perfor-
1: Function: mance of CAgglo-MS and the original Agglo-MS. We
CL Query Set Covering(Query data set Q, Reference use the Precision [2] to evaluate the purity of output
data set R, Cannot-link constraints C) clusters by CAgglo-MS and Agglo-MS. The reason to
2: Initialization: S = ∅. adopt Precision as measurement is that the number of
3: for each xi ∈ Q do output clusters by Agglo-MS and CAgglo-MS may differ
if ∃HE(x̂1j ) ∈ S such that xi ∈ HE(x̂1j ) HS(x̂1j , 4ρ) from the underlying class numbers.
W
4:
and Violate Cannot Link (xi , HE(x̂1j ), C) is false Our second group of experiments is designed to
then
compare CAgglo-MS with two popular constrained K-
5: Associate xi with HE(x̂1j ).
means clustering methods MPCKmeans [2] and CK-
6: else
7: Run one iteration of MS from xi with reference means [18], as well as the traditional K-means. To out-
set R and construct the HE(x̂1i ), as stated in put the desired number of clusters in our algorithm, we
sectionS3.1. alternatively apply MPCKmeans to further group the
8: S = S {HE(x̂1i )} output clusters from the CAgglo-MS with the updated
9: end if CL-Constraints. We refer to such a combined method
10: end for as CAgglo-MS-Kmeans. The astute reader might notice
11: Return S that during the query set compression in CAgglo-MS,
—————————————————————— the CL-Constraints Cl will be stronger than Cl−1 since
12: Function: the cannot-link property is propagated from point pairs
Violate Cannot Link (Data point xi , Hyperellipsoid HE,
to cluster pairs. The final obtained CL-Constraint C∞
Cannot-link constraints C)
will then always be over strengthened and go against
13: if ∃xj such that (xi , xj ) ∈ C and xj is already associated
with HE then the compact clustering of Q∞ . To alleviate such an in-
14: Return true herent drawback, in our implementation, we just carry
15: else out the MPCKmeans on the first five Ql (l ≤ 5) with
16: Return false the corresponding CL-Constrained Cl and pick the one
17: end if that performs the best as final clustering output. The
Algorithm 3: Cannot-Link Constrained Query Set F-Measure score [2] is adopt as quantitative measure-
Covering ment for clustering performance evaluation.

4.3.3 Results The Precision curves of the CAgglo-


MS and Agglo-MS on the test data sets are plot in
Figure 6, from which we can see that integrating semi-
supervision information into query set compression does
1: Constrained Initial Query Set Covering : help to significantly improve the purity of the output
{xi ∈ clusters. We illustrate in Figure 7 the evolution of
′ Sr ′ S
2: Let X = X − i=1 Mi . Set query set Q0 = X
Mi , i = 1, ..., r} and reference set R = X . Set the cannot-link constraints during the running of CAgglo-
constraints C0 = C MS on the Iris data set with 200 constraints. From
3: S0 =CL Query Set Covering(Q0 , R, C0 ) Figure 7(c) we can obviously see that the CL-Constraint
4: Constrained Iterative Set Compression Phase: is much strengthened and at least six clusters are
5: Set l = 1 required on Q150 to fulfil the corresponding C150 .
6: while Convergence is not attained do The F-Measure curves of CAgglo-MS-Kmeans,
7: Let query set Ql = {x̂1i |HE(x̂1i ) ∈ Sl−1 }.
MPCKmeans, CKmeans and the traditional K-means
8: Construct cannot-link constraints Cl so that:
on these data sets are given in Figure 8. Our CAgglo-
∀(xm , xn ) ∈ Cl−1 and xm ∈ HE(x̂1i ), xn ∈ HE(x̂1j ),
1 1
we have (x̂i , x̂j ) ∈ Cl . MS-Kmeans achieves the best performance on the four
9: Sl =CL Query Set Covering(Ql , R, Cl ) UCI data sets and the TDT2.Multi10 subset. On
10: Group each point in the query set Ql according to the the MINIST.Multi4 subset, CAgglo-MS-Kmeans and
hyperellipsoid it is associated with in Sl . MPCKmeans perform comparably, and both are supe-
11: l ←l+1 rior to CKmeans when constraints are less than 650
12: end while while inferior to it as constraints increase above 650.
Algorithm 4: Constrained Agglo-MS Clustering

232 Copyright © by SIAM.


Unauthorized reproduction of this article is prohibited.
0.98 1 1

0.96 0.9
0.9

0.94
0.8

Precision

Precision
Precision
0.8
0.92 CAgglo-MS CAgglo-MS CAgglo-MS
0.7
Agglo-MS 0.7
0.9 Agglo-MS Agglo-MS
0.6
0.6
0.88
0.5
0.86 0.5
10 50 90 130 170 200 200 400 600 800 1000 200 400 600 800 1000
Number of Constraints Number of Constraints Number of Constraints

(a) Iris (b) Wine (c) Sonar

1 1 1

0.95 0.98 0.99

0.9 0.98
0.96
0.97
Precision

Precision
Precision
0.85
CAgglo-MS 0.94
0.96 CAgglo-MS
0.8 CAgglo-MS
Agglo-MS 0.92 0.95
0.75 Agglo-MS Agglo-MS
0.9 0.94
0.7
0.93
0.65 0.88
0.92
200 400 600 800 1000 500 1000 1500 2000 200 400 600 800 1000
Number of Constraints Number of Constraints Number of Constraints

(d) Ionosphere (e) MNIST.Multi4 (f) TDT2.Multi10

Figure 6: The Precision vs. number of constraints curves of CAgglo-MS and Agglo-MS.
4.5 4.5 4.5

4 4 4
Dimension 2

Dimension 2

Dimension 2
3.5 3.5 3.5

3 3 3

2.5 2.5 2.5

2 2 2
4 5 6 7 8 4 5 6 7 8 4 5 6 7 8
Dimension 1 Dimension 1 Dimension 1

(a) Iris: Original CL-constraints (b) Iris: CL-constraints for Q20 (c) Iris: CL-constraints for Q150

Figure 7: Evolution of cannot-link constraints during query set compression in CAgglo-MS, on the UCI Iris data
set. The points from the three classes are differently colored as dots. The cannot-link constraints are represented
by red lines. The black stars represent the current query points. Here, the first two dimensions of input feature
are plotted for the sake of visualization.
0.96 1 1
CAgglo-MS-Kmeans
0.94
MPCKmeans
0.9 0.9
0.92 CKmeans
Kmeans
F-Measure

F-Measure

F-Measure

0.9 0.8 0.8 CAgglo-MS-Kmeans


0.88 CAgglo-MS-Kmeans
MPCKmeans
0.7 MPCKmeans 0.7 CKmeans
0.86
CKmeans Kmeans
0.84 Kmeans
0.6 0.6
0.82

0.8 0.5 0.5


50 100 150 200 0 200 400 600 800 1000 0 200 400 600 800 1000
Number of Constraints Number of Constraints Number of Constraints

(a) Iris (b) Wine (c) Sonar

1 1 1

0.9 0.9 0.9


F-Measure

F-Measure

F-Measure

0.8 CAgglo-MS-Kmeans 0.8 CAgglo-MS-Kmeans 0.8 CAgglo-MS-Kmeans


MPCKmeans MPCKmeans MPCKmeans
0.7 CKmeans 0.7 CKmeas 0.7 CKmeans
Kmeans Kmeans Kmeans
0.6 0.6 0.6

0.5 0.5 0.5


0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Number of Constraints Number of Constraints Number of Constraints

(d) Ionosphere (e) MNIST.Multi4 (5%) (f) TDT2.Multi10 (5%)

Figure 8: The F-Measure vs. number of constraints curves of CAgglo-MS-Kmeans, MPCKmeans, CKmeans and
traditional K-means.

233 Copyright © by SIAM.


Unauthorized reproduction of this article is prohibited.
with gaussian blurring mean-shift. In International
Table 4: Datasets used for evaluation of CAgglo-MS Conference on Machine Learning, pages 153–160, 2006.
Datasets Sizes Dimensions Classes [6] M. Carreira-Perpinan. Gaussian mean-shift is an em
Iris 150 4 3 algorithm. IEEE TPAMI, 29(5):767–776, 2007.
Wine 178 13 3 [7] Y. Cheng. Mean shift, mode seeking, and clustering.
Sonar 208 60 2 IEEE TPAMI, 17(7):790–799, 1995.
Ionosphere 351 34 2 [8] D. Comaniciu and P. Meer. Mean shift: A robust
MNIST.Multi4 (5%) 1190 9 4 approach toward feature space analysis. IEEE TPAMI,
TDT2.Multi10 (5%) 372 29 10 24(5):603–619, May 2002.
[9] I. Davidson and S. Ravi. Clustering with constraints:
Feasibility issues and the k-means algorithm. In Inter-
national Conference on Data Mining. SIAM, 2005.
5 Conclusion [10] I. Davidson and S. Ravi. Hierarchical clustering
with constraints: theory and practice. In Princi-
In this paper, we report our research progress on al- ples and Practice of Knowledge Discovery in Databases
gorithm improvement for the popularly used MS non- (PKDD), 2005.
parametric clustering method. As the first contribution, [11] M. Fashing and C. Tomasi. Mean shift is a bound
we develop the so called Agglo-MS algorithm based on a optimization. IEEE TPAMI, 27:471–474, Mar. 2005.
highly efficient hyperellipsoid query set covering mecha- [12] K. Fukunaga and L. Hostetler. The estimation of the
nism. The mode-seeking ability and convergence prop- gradient of a density function, with application in pat-
tern recognition. IEEE Transactions on Information
erty of the Agglo-MS are analyzed. The Agglo-MS is
Theory, 21:32–40, 1975.
applicable to general convex kernels. Another advan- [13] M. R. Garey and D. S. Johnson. Computers
tage of Agglo-MS over some existing ones, e.g., IFGT- and Intractability, A Guide to the Theory of NP-
MS and LSH-MS, lies in that it is free of parameter Completeness. Freeman, New York, 1979.
tuning, hence is more flexible when applied in practice. [14] B. Georgescu, I. Shimshoni, and P. Meer. Mean
Extensive evaluation on several toy and real-world clus- shift based clustering in high dimensions: a texture
tering tasks validates the time efficiency and numeri- classification example. In International Conference on
cal accuracy of Agglo-MS in both low and high dimen- Computer Vision, volume 1, pages 456–463. IEEE,
sional spaces. The second contribution of this work is 2003.
[15] X. He, S. Yan, Y. Hu, P. Niyogi, and H. Zhang.
to elegantly integrate the pairwise constraint informa-
Face recognition using laplacianfaces. IEEE TPAMI,
tion into the Agglo-MS to develop a semi-supervised 27(3):1–13, Mar. 2005.
non-parametric clustering algorithm, namely CAgglo- [16] D. Klein, S. Kamvar, and C. Manning. From instance-
MS. The experimental evaluation on UCI and real-world level constrains to space-level constraints: Making
data sets validates the superiority of CAgglo-MS. We the most of prior knowledge in data clustering. In
expect that, through combining with dimensionality re- International Conference on Machine Learning, pages
duction techniques, one can achieve competitive solu- 307–314, 2002.
tions via Agglo-MS and CAgglo-MS in many clustering [17] R. Rockfellar. Convex Analysis. Princeton Press, 1970.
[18] K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl.
tasks.
Constrained k-means clustering with background
knowledge. In International Conference on Machine
References Learning, pages 577–584, 2001.
[19] F. Wang, T. Li, and C. Zhang. Semi-supervised
clustering via matrix factorization. In International
[1] S. Basu, M. Bilenko, and R. Mooney. A probabilistic Conference on Data Mining. SIAM, 2008.
framework for semi-supervised clustering. In Knowl- [20] P. Wang, D. Lee, A. Gray, and J. Rehg. Fast mean shift
edge Discovery and Data Mining. ACM, 2004. with accurate and stable convergence. In International
[2] M. Bilenko, S. Basu, and R. Mooney. Integrating con- Conference on Artificial Intelligence and Statistics,
straints and metric learning in semi-supervised cluster- volume 2, pages 604–611, 2007.
ing. In International Conference on Machine Learning, [21] C. Yang, R. Duraiswami, N. A. Gumerov, and L. Davis.
volume 1, pages 81–88, 2004. Improved fast gauss transform and efficient kernel
[3] D. Cai, X. He, and J. Han. Spectral regression for density estimation. In International Conference on
efficient regularized subspace learning. In International Computer Vision, volume 1, pages 664–671. IEEE,
Conference on Computer Vision. IEEE, 2007. 2003.
[4] M. Carreira-Perpinan. Acceleration strategies for gaus- [22] X. T. Yuan and S. Z. Li. Half quadratic analysis
sian mean-shift image segmentation. In Computer Vi- for mean shift: with extension to a sequential data
sion and Pattern Recognition, volume 1, pages 1160– mode-seeking method. In International Conference on
1167. IEEE, 2006. Computer Vision. IEEE, 2007.
[5] M. Carreira-Perpinan. Fast nonparametric clustering

234 Copyright © by SIAM.


Unauthorized reproduction of this article is prohibited.
View publication stats

You might also like