Professional Documents
Culture Documents
Towards K-Means-Friendly Spaces: Simultaneous Deep Learning and Clustering
Towards K-Means-Friendly Spaces: Simultaneous Deep Learning and Clustering
cent research has shown that optimizing the two tasks jointly
can substantially improve the performance of both. The learning) have also drawn a lot of attention. In addi-
premise behind the latter genre is that the data samples are tion to the above approaches that learn linear DR opera-
obtained via linear transformation of latent representations tors (e.g., a projection matrix), nonlinear DR techniques
that are easy to cluster; but in practice, the transforma-
tion from the latent space to the data can be more com- such as those used in spectral clustering [19] and sparse
plicated. In this work, we assume that this transformation subspace clustering [28, 32] have also been considered.
is an unknown and possibly nonlinear function. To recover In recent years, motivated by the success of deep
the ‘clustering-friendly’ latent representations and to better
cluster the data, we propose a joint DR and K-means clus- neural networks (DNNs) in supervised learning, unsu-
tering approach in which DR is accomplished via learning pervised deep learning approaches are now widely used
a deep neural network (DNN). The motivation is to keep for DR prior to clustering. For example, the stacked
the advantages of jointly optimizing the two tasks, while ex-
ploiting the deep neural network’s ability to approximate autoencoder (SAE) [27], deep CCA (DCCA) [1], and
any nonlinear function. This way, the proposed approach sparse autoencoder [18] take insights from PCA, CCA,
can work well for a broad class of generative models. To- and sparse coding, respectively, and make use of DNNs
wards this end, we carefully design the DNN structure and
the associated joint optimization criterion, and propose an to learn nonlinear mappings from the data domain to
effective and scalable algorithm to handle the formulated low-dimensional latent spaces. These approaches treat
optimization problem. Experiments using five different real their DNNs as a preprocessing stage that is separately
datasets are employed to showcase the effectiveness of the
proposed approach. designed from the subsequent clustering stage. The
hope is that the latent representations of the data
1 Introduction learned by these DNNs will be naturally suitable for
Clustering is one of the most fundamental tasks in data clustering. However, since no clustering-driven objec-
mining and machine learning, with an endless list of tive is explicitly incorporated in the learning process,
applications. It is also a notoriously hard task, whose the learned DNNs do not necessarily output reduced-
outcome is affected by a number of factors – including dimension data that are suitable for clustering – as will
data acquisition and representation, use of preprocess- be seen in our experiments.
ing such as dimensionality reduction (DR), the choice In [30], Yang et. al considered performing DR and
of clustering criterion and optimization algorithm, and clustering jointly. The rationale behind [30] is that if
initialization [2, 7]. K-means is arguably the bedrock of there exists some latent space where the entities nicely
clustering algorithm. Since its introduction in 1957 by fall into clusters, then it is natural to seek a DR trans-
Lloyd (published much later in 1982 [15]), K-means has formation that reveals these, i.e., which yields a low
been extensively used either alone or together with suit- K-means clustering cost. This motivates using the K-
able preprocessing, due to its simplicity and effective- means cost in latent space as a prior that helps choose
ness. K-means is suitable for clustering data samples the right DR, and pushes DR towards producing K-
that are evenly spread around some centroids (cf. the means-friendly representations. By performing joint DR
first subfigure in Fig. 1), but many real-life datasets do and K-means clustering, substantially improved clus-
not exhibit this ‘K-means-friendly’ structure. Much ef- tering results have been observed in [30]. The limita-
fort has been spent on mapping high-dimensional data tion of [30] (see also [6, 20]) is that the observable data
to a certain space that is suitable for performing K- is assumed to be generated from the latent clustering-
means. Earlier ways for achieving this goal include friendly space via simple linear transformation. While
the classical DR algorithms, namely principal compo- simple linear transformation works well in many cases,
there are other cases where the generative process is
∗ Department more complex, involving a nonlinear mapping.
of ECE, University of Minnesota
† Department of IMSE, Iowa State University Contributions In this work, we take a step further
r model.
proximat-
a reason-
expected
10]. Al-
ivial and
e the DR
Generated SVD NMF optimization criterion for joint DNN-based DR and K-
objective
means clustering. The criterion is a combination of
ta trans-
three parts, namely, dimensionality reduction, data re-
dly latent
construction, and cluster structure-promoting regular-
A sneak ization, which takes insights from the linear generative
1, where model case as in [30] but is much more powerful in cap-
are well turing nonlinear models. We deliberately include the
en trans- LLE MDS LapEig
reconstruction part and implement it using a decoding
on-linear network, which is crucial for avoiding trivial solutions.
tructure. The criterion is also flexible – it can be extended to in-
he 100-D corporate different DNN structures and clustering cri-
e can see teria, e.g., subspace clustering.
imension • Effective and Scalable Optimization Procedure:
ans. Our SAE DCN w/o reconst.DCN (Proposed) The formulated optimization problem is very challeng-
ing to handle, since it involves layers of nonlinear activa-
opose an tion functions and integer constraints that are induced
DR and by the K-means part. We propose a judiciously de-
htful and signed solution package, including empirically effective
initialization and a novel alternating stochastic gradient
different
algorithm. The algorithmic structure is simple, enables
subspace
online implementation, and is very scalable.
mization Figure 1: The learned 2-D reduced-dimension data • Comprehensive Experiments and Validation:
k so that by different methods. The observable data is in the We provide a set of synthetic-data experiments and vali-
100-D1:space
Figure The and is generated
learned from 2-D data (cf.
2-D reduced-dimension datatheby date the method on five different real datasets including
ocedure: first subfigure)
different methods.through the nonlinear
The original transformation
data is in 100-D space various document and image copora. Evidently visible
challeng- andin is(4.9). The true
generated fromcluster labels are
ground-truth 2-Dindicated
data (cf.using
the improvement from the respective state-of-art is observed
ar activa- different colors.
first subfigure) through the nonlinear transformation in for all the datasets that we experimented with.
e induced (4.9). The plots are drawn with true labels. • Reproducibility: We open-source our codes at
from [30]: We propose a joint DR and K-means clus-
ously de- https://github.com/boyangumn/DCN.
tering framework, where the DR part is implemented
effective through learning a DNN, rather than a linear model. 2 Background and Problem Formulation
gradient The rationale is that a DNN is capable of approximat- 2.1 K-means Clustering and DR Given a set of
, enables ing any nonlinear continuous function using a reason- data vectors {xi }i=1,...,N where xi ∈ RM , the task
able number of parameters [8], and thus is expected to of clustering is to group the N data vectors into K
idation: overcome the limitations of the work in [30]. Although categories. K-means approaches this task by optimizing
a experi- implementing this idea is highly non-trivial (much more the following cost function:
rent real challenging than [30], where the DR part only needs to N
e copora. learn a linear model), our objective is well-motivated:
X 2
min kxi − M si k2 (2.1)
rts is ob- by better modeling the data transformation process, M ∈RM ×K ,{si ∈RK }
i=1
a much more K-means-friendly latent space can be s.t. sj,i ∈ {0, 1}, 1T si = 1 ∀i, j,
s at ~. learned – as we will demonstrate. A sneak peek of the
kind of performance that can be expected using our pro- where si is the assignment vector of data point i which
posed method can be seen in Fig. 1, where we generate has only one non-zero element, sj,i denotes the jth
four clusters of 2-D data which are well separated in element of si , and the kth column of M , i.e., mk ,
the 2-D Euclidean space and then transform them to denotes the centroid of the kth cluster.
a 100-D space using a complex non-linear mapping [cf. K-means works well when the data points are
(4.9)] which destroys the cluster structure. We apply evenly scattered around their centroids in the Euclidean
several well-appreciated DR methods to the 100-D data space. For example, when the data samples exhibit
and observe the recovered 2-D space (see section 4.1 for geometric distribution as depicted in the first subfigure
an explanation of the different acronyms). One can see in Fig. 1, K-means can easily distinguish the four
that the proposed algorithm outputs reduced-dimension clusters. Therefore, we consider datasets which have
data that are most suitable for applying K-means. Our similar structure as being ‘K-means-friendly’. However,
specific contributions are as follows: high-dimensional data are in general not very K-means-
• Optimization Criterion Design: We propose an friendly. In practice, using a DR pre-processing, e.g.,
PCA or NMF, to reduce the dimension of xi to a where f (·; W) denotes the mapping function and W
much lower dimensional space and then apply K-means is a set of parameters that characterize the non-linear
usually gives better results. The insight is simple: mapping. Many non-linear mapping functions can be
The data points may exhibit better K-means-friendly considered, e.g., Gaussian and sinc kernels. In this work,
structure in a certain latent feature space and the we propose to use a DNN as our mapping function, since
DR approaches such as PCA and NMF help find this DNNs have the ability of approximating any continuous
space. In addition to the above classic DR methods mapping using a reasonable number of parameters [8].
that essentially learn a linear generative model from For the rest of this paper, f (·; W) is a DNN and W
the latent space to the data domain, nonlinear DR collects the network parameters, i.e., the weights of the
approaches such as those used in spectral clustering and links that connect the neurons between layers and the
DNN-based DR are also widely used as pre-processing bias in each hidden layer.
before K-means or other clustering algorithms [1, 4, 27]. Using the above notation, it is tempting to formu-
2.2 Joint DR and Clustering Instead of using DR late joint DR and clustering as follows:
as a pre-processing, joint DR and clustering was also XN
2
considered in the literature [6, 20, 30]. This line of work min L=
b kf (xi ; W) − M si k2 (2.3)
W,M ,{si }
can be summarized as follows. Consider the generative i=1
model where a data sample is generated by xi = W hi , s.t. sj,i ∈ {0, 1}, 1T si = 1 ∀i, j.
where W ∈ RM ×R and hi ∈ RR , where R M .
Assume that the data clusters are well-separated in The formulation in (2.3) seems intuitive at first glance.
latent domain (i.e., where hi lives) but distorted by However, it is seriously ill-posed and could easily lead
the transformation introduced by W . Reference [30] to trivial solutions. A global optimal solution to Prob-
formulated the joint optimization problem as follows: lem (2.3) is f (xi ; W) = 0 and the optimal objective
value Lb = 0 can always be achieved by simply letting
N
2
X 2 M = 0. Another trivial solution is simply putting
min kX − W HkF + λ khi − M si k2
M ,{si },W ,H arbitrary samples to K clusters which will lead to a
i=1
small value of L b – but this could be far from being de-
+ r1 (H) + r2 (W ) (2.2) sired since there is no provision for respecting the data
s.t. sj,i ∈ {0, 1}, 1T si = 1 ∀i, j, samples xi ’s; see the bottom-middle subfigure in Fig. 1
[Deep Clustering Network (DCN) w/o reconstruction].
where X = [x1 , . . . , xN ], H = [h1 , . . . , hN ], and λ ≥ 0 In fact, the formulation proposed in a parallel unpub-
is a parameter for balancing data fidelity and the latent lished work in [29] is very similar to that in (2.3). The
cluster structure. In (2.2), the first term performs only differences are that the hard 0-1 type assignment
DR and the second term performs latent clustering. constraint on si is relaxed to a probabilistic type soft
The terms r1 (·) and r2 (·) are regularizations (e.g., constraint and that the fitting criterion there is based
nonnegativity or sparsity) to prevent trivial solutions, on the Kullback-Leibler (KL) divergence instead of least
e.g., H → 0 ∈ RR×N ; see details in [30]. squares – unfortunately, such a formulation still has
The approach in (2.2) enhances the performance of trivial solutions, just as what we have discussed above.
data clustering substantially in many cases. However, One of the most important components to prevent
the assumption that the clustering-friendly space can be trivial solution in the linear DR case lies in the recon-
obtained by linear transformation of the observable data struction part, i.e., the term kX − W Hk2 in (2.2).
F
is restrictive, and in practice the relationship between This term ensures that the learned hi ’s can (approxi-
observable data and the latent representations can be mately) reconstruct the xi ’s using the basis W . This
highly nonlinear – as implied by the success of kernel motivates incorporating a reconstruction term in the
methods and spectral clustering [19]. How can we joint DNN-based DR and K-means. In the realm of
transcend the ideas and insights of joint linear DR and unsupervised DNN, there are several well-developed ap-
clustering to deal with such complex (yet more realistic) proaches for reconstruction – e.g., the stacked autoen-
cases? In this work, we address this question. coder (SAE) is a popular choice for serving this purpose.
2.3 Proposed Formulation Our idea is to model To prevent trivial low-dimensional representations such
the relationship between the observable data xi and its as all-zero vectors, SAE uses a decoding network g(·; Z)
clustering-friendly latent representation hi using a non- to map the hi ’s back to the data domain and requires
linear mapping, i.e., that g(hi ; Z) and xi match each other well under some
metric, e.g., mutual information or least squares-based
hi = f (xi ; W), f (·; W) : RM → RR , measures.
By the above reasoning, we come up with the
following cost function:
…
…
…
…
N
X λ 2
min ` (g(f (xi )), xi ) + kf (xi ) − M si k2
W,Z,M ,{si }
i=1
2
(2.4)
T
s.t. sj,i ∈ {0, 1}, 1 si = 1 ∀i, j,
where we have simplified the notation f (xi ; W) and Figure 2: Structure of the proposed deep clustering
g(hi ; Z) to f (xi ) and g(hi ), respectively, for concise- network (DCN).
ness. The function `(·) : RM → R is a certain loss optimization procedure including an empirically effec-
function that measures the reconstruction error. In tive initialization method and an alternating optimiza-
this work, we adopt the least-squares loss `(x, y) = tion based algorithm for handling (2.4).
2
kx − yk2 ; other choices such as `1 -norm based fitting
and the KL divergence can also be considered. λ ≥ 0 3.1 Initialization via Layer-wise Pre-Training
is a regularization parameter which balances the recon- For dealing with hard non-convex optimization prob-
struction error versus finding K-means-friendly latent lems like that in (2.4), initialization is usually crucial.
representations. To initialize the parameters of the network, i.e., (W, Z),
we use the layer-wise pre-training method as in [3] for
2.4 Network Structure Fig. 2 presents the network training autoencoders. This pre-training technique may
structure corresponding to the formulation in (2.4). On be avoided in large-scale supervised learning tasks. For
the left-hand side of the ‘bottleneck’ layer are the so- the proposed DCN which is completely unsupervised,
called encoding or forward layers that transform raw however, we find that the layer-wise pre-training proce-
data to a low-dimensional space. On the right-hand dure is important no matter the size of the dataset. The
side are the ‘decoding’ layers that try to reconstruct the layer-wise pre-training procedure is as follows. We start
data from the latent space. The K-means task is per- with using a single-layer autoencoder to train the first
formed at the bottleneck layer. The forward network, layer of the forward network and the last layer of the de-
the decoding network, and the K-means cost are opti- coding network. Then, we use the outputs of the trained
mized simultaneously. In our experiments, the structure forward layers as inputs to train the next pair of encod-
of the decoding networks is a ‘mirrored version’ of the ing and decoding layers, and so on. After pre-training,
encoding network, and for both the encoding and de- we perform K-means to the outputs of the bottleneck
coding networks, we use the rectified linear unit (ReLU) layer to obtain initial values of M and {si }.
activation-based neurons [16]. Since our objective is to
perform DNN-driven K-means clustering, we will refer 3.2 Alternating Stochastic Optimization Even
to the network in Fig. 2 as the Deep Clustering Network with a good initialization, handling Problem (2.4) is still
(DCN) in the sequel. very challenging. The commonly used stochastic gradi-
From a network structure point of view, DCN ent descent (SGD) algorithm cannot be directly applied
can be considered as an SAE with a K-means-friendly to jointly optimize W, Z, M and {si } because the block
structure-promoting function in the bottleneck layer. variable {si } is constrained on a discrete set. Our idea is
We should remark that the proposed optimization crite- to combine the insights of alternating optimization and
rion in (2.4) and the network in Fig. 2 are very flexible: SGD. Specifically, we propose to optimize the subprob-
Other types of neuron activation, e.g., the sigmoid and lems with respect to (w.r.t.) one of M , {si } and (W, Z)
binary step functions, can be used. For the clustering while keeping the other two sets of variables fixed.
part, other clustering criteria, e.g., K-subspace and soft For fixed (M , {si }), the subproblem w.r.t. (W, Z)
K-means [2, 12], are also viable options. Nevertheless, is similar to training an SAE – but with an additional
we will concentrate on the proposed DCN in the sequel, penalty term on the clustering performance. We can
as our interest is to provide a proof-of-concept rather take advantage of the mature tools for training DNNs,
than exhausting the possibilities of combinations. e.g., back-propagation based SGD and its variants. To
implement SGD for updating the network parameters,
3 Optimization Procedure we look at the problem w.r.t. the incoming data xi :
Optimizing (2.4) is highly non-trivial since both the cost
λ 2
function and the constraints are non-convex. In addi- min Li = ` (g(f (xi )), xi ) + kf (xi ) − M si k2 . (3.5)
tion, there are scalability issues that need to be taken
W,Z 2
into account. In this section, we propose a pragmatic The gradient of the above function over the net-
work parameters is easily computable, i.e., OX Li = Algorithm 1 Alternating SGD
∂`(g(f (xi )),xi )
∂X + λ ∂f∂X
(xi )
(f (xi ) − M si ), where X = 1: Initialization;
(W, Z) is a collection of the network parameters and 2: for t = 1 : T do % Perform T epochs over the data;
∂` ∂f (xi ) Update network parameters by (3.6);
the gradients ∂X and ∂X can be calculated by back- 3:
propagation [22] (strictly speaking, what we calculate 4: Update assignment by (3.7);
here is the subgradient w.r.t. X since the ReLU func- 5: Update centroids by (3.8);
tion is non-differentible at zero). Then, the network 6: end for
parameters are updated by
2 0.90
0.88
0
0 5 10 15 20
0.86
index of clusters
0.84
Figure 3: The sizes of 20 clusters in the experiment. 0.82
Table 1: Evaluation on the RCV1-v2 dataset
0.80
0 10 20 30 40 50
Methods DCN SAE+KM KM XRAY Epoches
Methods DCN SAE+KM LCCF NMF+KM KM SC XARY JNKM Methods DCN SAE+KM KM (SSC-OMP)
NMI 0.48 0.47 0.46 0.39 0.41 0.40 0.19 0.40 NMI 0.85 0.83 0.85
ARI 0.34 0.28 0.17 0.17 0.15 0.17 0.02 0.10 ARI 0.84 0.82 0.82
ACC 0.44 0.42 0.32 0.33 0.3 0.34 0.18 0.24 ACC 0.93 0.91 0.86
Table 3: Evaluation on the raw MNIST dataset. Table 5: Evaluation on the Pendigits dataset