Professional Documents
Culture Documents
Kernel Clustering
Kernel Clustering
CSE 902
Radha Chitta
Outline
●
Data analysis
●
Clustering
●
Kernel Clustering
●
Kernel K-means and Spectral Clustering
●
Challenges and Solutions
Data Analysis
Learning Problems
Learning Problems
➢
Supervised Learning (Classification)
➢
Semi-supervised learning
There are two objects in this set of images. Separate them into “two” groups.
➢
Completely unsupervised learning
K-means
Squared-
Partitional error Kernel
K-means
Graph
theoretic Spectral
MST
Nearest
1 to k-NN
neighbor
Clustering Algorithms
Single
Link
Agglomerative Complete
Hierarchical Link
Divisive
Mixture
models
Clustering
Algorithms Distribution/density DBScan
K-means
Squared-
Partitional error Kernel
K-means
Graph
theoretic Spectral
MST
Nearest
1 to k-NN
neighbor
The K-means problem
Given data set
D = { x1, x 2, … , x n }
z3
x2
z2
x1 z1
ϕ: Χ →Η
Polynomial ϕ ( x ) = ( x12 , √ 2 x 1 x 2 , x 2
2)
Dot product kernels
Euclidean distance in Η
2 T T T
∥ϕ ( x )−ϕ ( y )∥2= ϕ ( x ) ϕ ( x ) − 2 ϕ ( x ) ϕ ( y ) + ϕ ( y ) ϕ ( y )
Polynomial map
ϕ ( [ x 1, x 2 ] )= ( x 2 2
1 , √ 2 x1 x2 , x2 )
y 12
T
ϕ(x) ϕ( y) = ( x
2
1 √ 2 x1 x2 x ) 2
2
( )
√ 2 y1 y2
y2
2
= ( x y)
T 2
Dot product kernels
Euclidean distance in Η
2 T T T
∥ϕ ( x )−ϕ ( y )∥2= ϕ ( x ) ϕ ( x ) − 2 ϕ ( x ) ϕ ( y ) + ϕ ( y ) ϕ ( y )
Polynomial map
ϕ ( [ x 1, x 2 ] )= ( x 2 2
1 , √ 2 x1 x2 , x2 )
y 12
T
ϕ(x) ϕ( y) = ( x
2
1 √ 2 x1 x2 x ) 2
2
( )
√ 2 y1 y2
y2
2
= ( x y)
T 2
Dot product kernels
Euclidean distance in Η
2 T T T
∥ϕ ( x )−ϕ ( y )∥2= ϕ ( x ) ϕ ( x ) − 2 ϕ ( x ) ϕ ( y ) + ϕ ( y ) ϕ ( y )
Polynomial map
ϕ ( x ) = ( x12 , √ 2 x 2 2
1 , x2 )
y 12
T
ϕ(x) ϕ( y) = ( x
2
1 √ 2 x1 x2 x ) 2
2
( )
√ 2 y1 y2
y2
2
T
= ( x y)
2
Kernel function
κ( x , y )
Mercer kernels
2
d H ( x , y ) = κ ( x , x ) −2 κ ( x , y ) + κ ( y , y )
Kernels
●
Polynomial (Linear)
T p
κ ( x , y )= ( x y)
●
Gaussian
κ ( x , y ) = exp (− λ∥x − y∥22 )
●
Chi-square d
( x i− y i )
2
κ ( x , y )=1−∑
i=1 0.5 ( x i + y i )
●
Histogram intersection
m
κ ( x , y ) = ∑ min ( hist ( x ) , hist ( y ) )
i =1
Kernel K-means
Finding C clusters in Hilbert (feature) space H to
minimize sum of squared error
C n
min ∑ ∑ U ki∥c k −ϕ ( x i )∥2H
U k =1 i=1
[ ]
n C n n n
2 1
min ∑ ∑ U ki κ ( x i , x i ) − ∑ U kj κ( x i , x j )+ 2 ∑ ∑ U kj U kl κ( x j , x l )
U i=1 k=1 n k j=1 n k j=1 l =1
K = [ κ ( x i , x j ) ]n×n ; x i , x j ∈ D
Distance
n n n
2 1
2
d ( x i , c k ) =κ ( x i , x i )−
nk
∑ kj i j n 2 ∑ ∑ U kj U kl κ( x j , x l )
U κ( x , x )+
j=1 k j=1 l =1
Reorder the clustering process such that only a
small portion of the kernel matrix is required at
a time.
Store the full kernel matrix on disk and load
part of it into the memory.
Reordered kernel K-means
A Large scale clustering scheme for kernel K-means, Zhang and Rudnicky, ICPR 2002
Update incrementally using the kernel portion
available in memory.
n
−2
f ( xi , ck )= ∑
n k j =1
U kj κ( x i , x j )
n n
1
g ( c k ) = 2 ∑ ∑ U kj U kl κ ( x j , x l )
n k j=1 l =1
Combine values to obtain final partition.
Parallel Kernel K-means
Scaling Up Machine Learning: Parallel and Distributed Approaches, Bekkerman et. al. 2012
Parallel Kernel K-means
K-means on 1 billion 2-D points with 2 clusters
2.3 GHz quad-core Intel Xeon processors, with 8GB
memory in the HPCC intel07 cluster
Number of
Speedup
processors
2 1.9
3 2.5
4 3.8
5 4.9
6 5.2
7 5.5
8 6.3
9 5.5
10 5.1
➔
Reordering does not reduce runtime
complexity, only memory; parallelization does
not help reduce runtime and memory
complexity
➔
Kernel approximation Use sampling to avoid
calculating the full kernel matrix
➔
Express the cluster centers in terms of sampled
points
Approx. Kernel K-means
Approximate Kernel k-means: Solution to Large Scale Kernel Clustering, KDD 2011
Randomly sample m points {y 1, y 2, … y m }, m≪n and compute the kernel similarity matrices
KA ( m x m) and KB ( n x m)
( K A )ij =ϕ ( yi )T ϕ( y j ) ( K B )ij =ϕ ( x i )T ϕ( y j )
Approx. Kernel K-means
Approximate Kernel k-means: Solution to Large Scale Kernel Clustering, KDD 2011
Properties
➔
Need to calculate only O(nm) sized matrix – almost linear runtime and memory
complexity.
➔
Bounded O(1/m) approximation error.
Clustering 80M Tiny
Images
K-means 6 hours
Example Clusters
Clustering 80M Tiny
Images
Best Supervised Classification Accuracy on CIFAR-10 using GIST features: 54.7*
K-means 26.70
*Ranzato et. Al., Modeling pixel means and covariances using factorized third-order boltzmann machines, CVPR 2010
** Fowlkes et al., Spectral grouping using the Nystrom method, PAMI 2004
Matrix approximations
➢
Nystrom
➢
Randomly select columns
T
K =K B K −1
A K B
➢
CUR
➢
Special form of Nystrom for any matrix K
➢
Select the most important columns C and rows R
➢
Find U that minimizes the approximation error
∥K −CUR∥F
➢
Sparse approximation
➢ ̂ =K + N is sparse.
Find matrix N such that K
E [ N ij ]=0 and var ( N ij ) is small
➢
Out-of-sample clustering
➢
Non-linear random projections
Non-linear Random Projections
Efficient Kernel Clustering Using Random Fourier Features, ICDM 2012
Random Projection for dimensionality
reduction
Generate a random d x r matrix R.
Project data into r-dimensional space
1 T 1 T
x '= R x y'= R y
√r √r
Johnson-Lindenstrauss Lemma
If R is orthonormal and r is sufficiently large,
distances are preserved in the projected space.
Kernel K-means is K-means in Hilbert space where data is linearly
separable.
Efficiently project data in Hilbert space to a low-dimensional space.
Linear separability is preserved.
Apply K-means in the low-dimensional space.
Obtain low-dimensional representation of cluster centers.
Non-linear Random Projections
Efficient Kernel Clustering Using Random Fourier Features, ICDM 2012
ωx
Randomly sample m vectors {ω1, ω2, …ωm }, m≪n from Fourier transform of the kernel
and project the points using f (ω , x)
1
z ( x)=
√m
[ T T T T
cos ( ω1 x ) …cos ( ω m x ) …sin ( ω1 x ) sin ( ω m x ) ]
H= [ z ( x 1)T z ( x 2)T … z ( x n)T ]
Non-linear Random Projections
Efficient Kernel Clustering Using Random Fourier Features, ICDM 2012
K-means 6 hours
➔
Highly flexible, least amount of prior knowledge required
➔
Learn kernel and labels simultaneously
➔
Maintain the spectrum of the data after projection into
feature space
−1
1 γ I
min trace ( KL ) + U K + KL+
K ,U 2 (
C C ) U
T
s.t.
K ≽0
p
trace ( K ) ⩽B
−l⩽n p −n q ⩽l
−1/ 2 −1/ 2
L=I −D WD
Unsupervised Non-Parametric
Kernel Learning
Unsupervised non-parametric kernel learning algorithm, Liu et. al., 2013
➔
Highly flexible, least amount of prior knowledge required
➔
Learn kernel and labels simultaneously
➔
Maintain the spectrum of the data after projection into
feature space
Spectrum regularization
1 γ I −1 T
min trace ( KL ) + U K + KL+
K ,U 2 (
C C
U )
s.t.
K ≽0
p
trace ( K ) ⩽B
−l⩽n p −n q ⩽l
−1/ 2 −1/ 2
L=I −D WD Graph Laplacian
Unsupervised Non-Parametric
Kernel Learning
Unsupervised non-parametric kernel learning algorithm, Liu et. al., 2013
➔
Highly flexible, least amount of prior knowledge required
➔
Learn kernel and labels simultaneously
➔
Maintain the spectrum of the data after projection into
feature space
Squared loss
−1
1 γ I
min trace ( KL ) + U K + KL+
K ,U 2 (
C C ) U
T
s.t.
K ≽0
p
trace ( K ) ⩽B
−l⩽n p −n q ⩽l
−1/ 2 −1/ 2
L=I −D WD
Unsupervised Non-Parametric
Kernel Learning
Unsupervised non-parametric kernel learning algorithm, Liu et. al., 2013
➔
Highly flexible, least amount of prior knowledge required
➔
Learn kernel and labels simultaneously
➔
Maintain the spectrum of the data after projection into
feature space
−1
1 γ I
min trace ( KL ) + U K + KL+
K ,U 2 (
C C ) U
T
s.t.
K ≽0 PSD
p
trace ( K ) ⩽B
−l⩽n p −n q ⩽l
−1/ 2 −1/ 2
L=I −D WD
Unsupervised Non-Parametric
Kernel Learning
Unsupervised non-parametric kernel learning algorithm, Liu et. al., 2013
➔
Highly flexible, least amount of prior knowledge required
➔
Learn kernel and labels simultaneously
➔
Maintain the spectrum of the data after projection into
feature space
−1
1 γ I
min trace ( KL ) + U K + KL+
K ,U 2 C ( C ) U
T
s.t.
K ≽0
p
Avoid overfitting and trace ( K ) ⩽B
assigning all points to −l⩽n p −n q ⩽l
one cluster −1/ 2 −1/ 2
L=I −D WD
Unsupervised Non-Parametric
Kernel Learning
Unsupervised non-parametric kernel learning algorithm, Liu et. al., 2013
➔
Highly flexible, least amount of prior knowledge required
➔
Learn kernel and labels simultaneously
➔
Maintain the spectrum of the data after projection into
feature space
−1
1 γ I
min trace ( KL ) + U K + KL+
K ,U 2 (
C C ) U
T
s.t.
K ≽0
p
trace ( K ) ⩽B
−l⩽n p −n q ⩽l
−1/ 2 −1/ 2
L=I −D WD