Professional Documents
Culture Documents
Agglomerative Mean-Shift Clustering Via Query Set Compression
Agglomerative Mean-Shift Clustering Via Query Set Compression
Agglomerative Mean-Shift Clustering Via Query Set Compression
net/publication/220906948
CITATIONS READS
10 121
3 authors:
Ran He
CASIA
207 PUBLICATIONS 4,769 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Bao-Gang Hu on 16 May 2014.
2 2 1.6
B
1 1 1.4
0 A 0 1.2
−1 −1 1
−2 −2 0.8
−3 −3 0.6
−4 −4 0.4
−2 −1 0 1 2 −2 −1 0 1 2 0.5 1 1.5
3 3 3 3
2 2 2 2
1 1 1 1
0 0 0 0
−1 −1 −1 −1
−2 −2 −2 −2
−3 −3 −3 −3
−4 −4 −4 −4
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
Figure 1: Illustration of iterative query set compression working mechanism on a 2D toy dataset. See text for the detailed
description.
1000 35 0.4
σ = 0.1 Agglo-MS Agglo-MS
30 0.35
σ = 0.3
800
σ = 0.8 IFGT-MS 0.3 IFGT-MS
Speedup Ratio
25
Query Set Size
600 0.25
20
ε-ER
0.2
400 15
0.15
10
0.1
200
5 0.05
0 0 0
10 20 30 40 50 60 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Iteration Number σ σ
(a) Query Set Size vs. Iteration Number (b) Speedup vs. bandwidth (c) ε-ER vs. bandwidth
where δ(x) is the delta function that equals one if in table 1. The optimal clustering results (|Q∞ | = 5)
boolean variable x is true and equals zero otherwise, by Agglo-MS are achieved under σ ∈ (0.25, 0.54). The
while x̂∞,X-MS
i and x̂∞,Naive-MS
i are convergent modes speedup vs. bandwidth curves of the Agglo-MS and
returned by X-MS (X stands for ’Agglo’, ’IFGT’ or IFGT-MS over the Naive-MS are given in Figure 2(b),
’ms1’ in this work) and Naive-MS respectively from an from which we can see that Agglo-MS outperforms
initial query point xi in Q0 . We set the parameters ρ = IFGT-MS in speedup performance. We also show the
ε = 10−3 , and choose the Gaussian kernel throughout ε-ER vs. bandwidth curves of the Agglo-MS and IFGT-
the experiments in this paper. The algorithm codes 1 MS in Figure 2(c), from which we can see that the
are basically written in C++, and run on a P4 Core2 approximation error are comparable between these two
2.4G Hz CPU with 2G RAM. For this example, the algorithms. Note that when bandwidth is small (σ<0.3)
quantitative results under bandwidth σ = 0.4 are listed or large (σ>0.6), relatively high ε-ER is introduced in
Agglo-MS. This is because small bandwidth tends to
1 The IFGT-MS code is implemented using lead to over clustering by MS method, which makes
the IFGT library provided by the authors at the Agglo-MS sensitive to the intersection between
http://www.umiacs.umd.edu/~ vikas/Software/IFGT/IFGT_code.htm. disks from different clusters, while too large bandwidth
The important turnable parameters, e.g., p max -polynomial or- will lead to large covering disks, which also increase
der, K-number of cells and r-ratio of the cut off radius, can be
automatically chosen by the program. We further carefully tune
the chance of involving points belonging to different
the parameters p max and r to achieve comparable result to clusters.
Naive-MS.
3.6 Real-World Experiments We now assess the Figure 4: Segmentation results under σ = 0.1 (top row)
performance of Agglo-MS on some real-world clustering and σ = 0.2 (bottom row) on the hand image. For each
tasks, including image segmentation and high dimen- row, from left to right: Naive-MS, Agglo-MS, IFGT-MS and
sional data set clustering. ms1-MS.
300
IFGT-MS
ms1-MS
0.08 IFGT-MS
ms1-MS
subspace embedded by spectral regression (SR) [3], with
Speedup Ratio
200
50
0.02 of handwritten digits has a training set of 60,000 ex-
0
0.2 0.4 0.6
σ
0.8 1
0
0.2 0.4 0.6
σ
0.8 1 amples,and a test set of 10,000 examples. The digits
have been size-normalized and centered in a fixed-size
(a) Speedup vs. bandwidth (b) ε-ER vs. bandwidth
(28 × 28) bilevel image. our clustering is done on the
training set embedded by SR, with dimension reduced
Figure 3: Quantitative evaluation curves for the hand
from 784 to 9.
image.
TDT2 Document Data set The TDT2 corpus con-
Some images from [21] and the Berkeley segmen-
tation dataset2 are also used for evaluation. Four se- 3 http://www.ri.cmu.edu/projects/project_418.html
4 http://yann.lecun.com/exdb/mnist/
2 http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/
5 http://www.nist.gov/speech/tests/tdt/tdt98/index.htm
Figure 5: Selected image segmentation results. For each image group, from left to right: Naive-MS, Agglo-MS, IFGT-MS
and ms1-MS.
Table 2: Quantitative results by Agglo-MS, IFGT-MS (p max = 3, r = 2), ms1-MS and Naive-MS on four test images.
Images House Base Dive Hawk Cowboy
Sizes 255 × 192 432 × 294 481 × 321 481 × 321
σ 0.1 0.1 0.06 0.1
Agglo-MS 1.780 3.927 9.313 5.888
CPU IFGT-MS 5.628 23.863 28.753 27.82
Time (s) ms1-MS 8.015 14.607 12.372 18.124
Naive-MS 116.587 169.290 155.829 717.521
Agglo-MS 65.50 43.11 16.73 121.86
Speedup IFGT-MS 2.64 5.15 2.58 3.66
ms1-MS 14.55 11.59 12.60 39.59
Agglo-MS 0.029 0.025 0.001 0.005
ε-ER IFGT-MS 0.021 0.054 0.046 0.015
ms1-MS 0.027 0.024 0.001 0.005
|Q∞ |/M (X , H, ρ) 5/5 7/7 3/3 5/5
0.96 0.9
0.9
0.94
0.8
Precision
Precision
Precision
0.8
0.92 CAgglo-MS CAgglo-MS CAgglo-MS
0.7
Agglo-MS 0.7
0.9 Agglo-MS Agglo-MS
0.6
0.6
0.88
0.5
0.86 0.5
10 50 90 130 170 200 200 400 600 800 1000 200 400 600 800 1000
Number of Constraints Number of Constraints Number of Constraints
1 1 1
0.9 0.98
0.96
0.97
Precision
Precision
Precision
0.85
CAgglo-MS 0.94
0.96 CAgglo-MS
0.8 CAgglo-MS
Agglo-MS 0.92 0.95
0.75 Agglo-MS Agglo-MS
0.9 0.94
0.7
0.93
0.65 0.88
0.92
200 400 600 800 1000 500 1000 1500 2000 200 400 600 800 1000
Number of Constraints Number of Constraints Number of Constraints
Figure 6: The Precision vs. number of constraints curves of CAgglo-MS and Agglo-MS.
4.5 4.5 4.5
4 4 4
Dimension 2
Dimension 2
Dimension 2
3.5 3.5 3.5
3 3 3
2 2 2
4 5 6 7 8 4 5 6 7 8 4 5 6 7 8
Dimension 1 Dimension 1 Dimension 1
(a) Iris: Original CL-constraints (b) Iris: CL-constraints for Q20 (c) Iris: CL-constraints for Q150
Figure 7: Evolution of cannot-link constraints during query set compression in CAgglo-MS, on the UCI Iris data
set. The points from the three classes are differently colored as dots. The cannot-link constraints are represented
by red lines. The black stars represent the current query points. Here, the first two dimensions of input feature
are plotted for the sake of visualization.
0.96 1 1
CAgglo-MS-Kmeans
0.94
MPCKmeans
0.9 0.9
0.92 CKmeans
Kmeans
F-Measure
F-Measure
F-Measure
1 1 1
F-Measure
F-Measure
Figure 8: The F-Measure vs. number of constraints curves of CAgglo-MS-Kmeans, MPCKmeans, CKmeans and
traditional K-means.