Professional Documents
Culture Documents
Co-Clustering Documents and Words Using Bipartite Spectral Graph Partitioning
Co-Clustering Documents and Words Using Bipartite Spectral Graph Partitioning
inderjit@cs.utexas.edu
ABSTRACT
Both do
ument
lustering and word
lustering are well studied problems. Most existing algorithms
luster do
uments
and words separately but not simultaneously. In this paper
we present the novel idea of modeling the do
ument
olle
tion as a bipartite graph between do
uments and words, using whi
h the simultaneous
lustering problem
an be posed
as a bipartite graph partitioning problem. To solve the partitioning problem, we use a new spe
tral
o-
lustering algorithm that uses the se
ond left and right singular ve
tors of
an appropriately s
aled word-do
ument matrix to yield good
bipartitionings. The spe
tral algorithm enjoys some optimality properties; it
an be shown that the singular ve
tors
solve a real relaxation to the NP-
omplete graph bipartitioning problem. We present experimental results to verify that
the resulting
o-
lustering algorithm works well in pra
ti
e.
1.
INTRODUCTION
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
KDD 2001 San Francisco, California, USA
Copyright 2001 ACM X-XXXXX-XX-X/XX/XX ...$5.00.
Mij =
Eij ;
0;
i2 1 ;j 2 2
Mij :
(1)
ut(V 1 ; V 2 ; : : : ; V k ) =
ut(V i ; V j ):
Dm = :dj :
i2
Aij
i2
Aij ;
l
9
=
8 l = 1; : : : ; k; :
V 1 ;::: ;V
(2)
We now introdu
e our bipartite graph model for representing a do
ument
olle
tion. An undire
ted bipartite graph
is a triple G = (D; W ; E ) where D = fd1 ; : : : ; dn g, W =
fw1 ; : : : ; wm g are two sets of verti
es and E is the set of
edges ffdi ; wj g : di 2 D; wj 2 Wg. In our
ase D is the set
of do
uments and W is the set of words they
ontain. An
edge fdi ; wj g exists if word wj o
urs in do
ument di ; note
that the edges are undire
ted. In this model, there are no
edges between words or between do
uments.
An edge signies an asso
iation between a do
ument and
a word. By putting positive weights on the edges, we
an
apture the strength of this asso
iation. One possibility is
to have edge-weights equal term frequen
ies. In fa
t, most
of the term-weighting formulae used in information retrieval
may be used as edge-weights, see [20 for more details.
Consider the m n word-by-do
ument matrix A su
h that
Aij equals the edge-weight Eij . It is easy to verify that the
adja
en
y matrix of the bipartite graph may be written as
3. GRAPH PARTITIONING
Given a graph G = (V ; E ), the
lassi
al graph bipartitioning problem is to nd nearly equally-sized vertex subsets
V 1 ; V 2 of V su
h that
ut(V 1 ; V 2 ) = min
ut(V 1 ; V 2 ).
i<j
0T A ;
A 0
where we have ordered the verti
es su
h that the rst m verti
es index the words while the last n index the do
uments.
We now show that the
ut between dierent vertex subsets, as dened in (1) and (2), emerges naturally from our
formulation of word and do
ument
lustering.
W m = :wi :
j2
Aij
j2
9
=
Aij ; 8 l = 1; : : : ; k; :
Thus ea h of the word lusters is determined by the do ument lustering. Similarly given word lusters W 1 ; : : : ; W k ,
V 1 ;V 2
Proof.
Proof.
P
pp
2
q We = 0, and q Wq = weight(V ).
Proof. Let y = We, then yi = weight(i) = Wii. Thus
r
r
X
X weight(i) =
qT We =
weight(i)
satises
2
1
Similarly qT Wq =
Theorem
i2 1
Pn
i=1 Wii qi
i2 2
0:
= 1 + 2 = weight(V ).
tu
q Lq
qT Wq
ut(V 1 ; V 2 )
ut(V 1 ; V 2 )
+
:
weight(V 1 ) weight(V 2 )
Proof. It is easy to show that the generalized partition
ve
tor q may be written as
q = 2p1 +122 p + 2p2 112 e;
where p is the partition ve
tor of (6). Using part 7 of Theorem 1, we see that
2
qT Lq = (14+1 22 ) pT Lp:
Substituting the values of pT Lp and qT Wq, from Theorem 2 and Lemma 1 respe
tively, proves the result.
tu
Thus to nd the global minimum of (7), we
an restri
t
our attention to generalized partition ve
tors of the form in
Lemma 1. Even though this problem is still NP-
omplete,
the following theorem shows that it is possible to nd a real
relaxation to the optimal generalized partition ve
tor.
T
Theorem
4. The problem
qT Lq ; subje
t to qT We = 0;
q6 0 qT Wq
is solved when q is the eigenve
tor
orresponding to the 2nd
min
=
Proof.
tu
jV j
1
jV j
2
i2 1
k Eik
k Eik
i2 2
= 2 S (V 1 ; V 2 );
within(V 1 ) within(V 2 )
where S (V 1 ; V 2 ) =
+
:
weight(V 1 ) weight(V 2 )
4.
D1T A ; and D = D1 0
0 D2
A D2
where
D1 and D2 are diagonal
matri
es su
h that D (i; i) =
P
P
j Aij ; D (j; j ) =
i Aij . Thus Lz = Dz may be
L
written as
D1T A x = D1 0 x (9)
0 D2 y
A D2 y
Assuming that both D1 and D2 are nonsingular, we
an
D1 = x D1 = Ay = D1 = x;
D2 = AT x + D2 = y = D2 = y:
Letting u = D1 = x and v = D2 = y , and after a little
1 2
1 2
1 2
1 2
1 2
1 2
1 2
1 2
D1 = AD2
D2 = AT D1
1 2
1 2
v
u
1=2
1=2
= (1
= (1
)u;
)v:
Anv
= 2 u2 ;
An u
T
= 2 v 2 ;
(10)
The singular ve
tors u2 and v2 of An give a real approximation to the dis
rete optimization problem of minimizing
the normalized
ut. Given u2 and v2 the key task is to
extra
t the optimal partition from these ve
tors.
The optimal generalized partition ve
tor of Lemma 1 is
two-valued. Thus our strategy is to look for a bi-modal
distribution in the values of u2 and v2 . Let m1 and m2
denote the bi-modal values that we are looking for. From
the previous se
tion, the se
ond eigenve
tor of L is given by
D1
D2
u
= v
1=2
1 2
(11)
j =1 2 (i)2mj
(z2 (i)
mj )2 :
The above is exa
tly the obje
tive fun
tion that the
lassi
al
k-means algorithm tries to minimize[9. Thus we use the
following algorithm to
o-
luster words and do
uments:
Algorithm Bipartition
1. Given A, form An = D1 1=2 AD2 1=2 .
2. Compute the se
ond singular ve
tors of An , u2 and v2
and form the ve
tor z2 as in (11).
3. Run the k-means algorithm on the 1-dimensional data z 2
to obtain the desired bipartitioning.
The surprising aspe
t of the above algorithm is that we
run k-means simultaneously on the redu
ed representations
of both words and do
uments to get the
o-
lustering.
U
;
(12)
= V
where U = [u ; : : : ; u` , and V = [v ; : : : ; v` . From
this redu
ed-dimensional data set, we look for the best kmodal t to the `-dimensional points m ; : : : ; mk by assigning ea
h `-dimensional row, Z (i), to mj su
h that the
D1
D2
+1
1=2
1 2
+1
sum-of-squares
k
X
j =1 2 (i)2mj
kZ (i) mj k
is minimized. This
an again be done by the
lassi
al kmeans algorithm. Thus we obtain the following algorithm.
Algorithm Multipartition(k)
1. Given A, form An = D1 1=2 AD2 1=2 .
2. Compute ` = dlog2 ke singular ve
tors of An , u2 ; : : : u`+1
and v 2 ; : : : v`+1 , and form the matrix Z as in (12).
3. Run the k-means algorithm on the `-dimensional data Z
to obtain the desired k-way multipartitioning.
Name
MedCran
MedCran All
MedCisi
MedCisi All
Classi
3
Classi
3 30do
s
Classi
3 150do
s
Yahoo K5
Yahoo K1
# Do
s
2433
2433
2493
2493
3893
30
150
2340
2340
# Words # Nonzeros(A)
5042
117987
17162
224325
5447
109119
19194
213453
4303
176347
1073
1585
3652
7960
1458
237969
21839
349792
EXPERIMENTAL RESULTS
For some of our experiments, we used the popular Medline (1033 medi
al abstra
ts), Craneld (1400 aeronauti
al
systems abstra
ts) and Cisi (1460 information retrieval abstra
ts)
olle
tions. These do
ument sets
an be downloaded from ftp://ftp.
s.
ornell.edu/pub/smart. For testing
Algorithm Bipartition, we
reated mixtures
onsisting of 2
of these 3
olle
tions. For example, MedCran
ontains do
uments from the Medline and Craneld
olle
tions. Typi
ally,
we removed stop words, and words o
urring in < 0:2% and
> 15% of the do
uments. However, our algorithm has an
in-built s
aling s
heme and is robust in the presen
e of large
number of noise words, so we also formed word-do
ument
matri
es by in
luding all words, even stop words.
For testing Algorithm Multipartition, we
reated the Classi
3 data set by mixing together Medline, Craneld and Cisi
whi
h gives a total of 3893 do
uments. To show that our
algorithm works well on small data sets, we also
reated
subsets of Classi
3 with 30 and 150 do
uments respe
tively.
Our nal data set is a
olle
tion of 2340 Reuters news
arti
les downloaded from Yahoo in O
tober 1997[2. The
arti
les are from 6
ategories: 142 from Business, 1384 from
Entertainment, 494 from Health, 114 from Politi
s, 141 from
Sports and 60 news arti
les from Te
hnology. In the prepro
essing, HTML tags were removed and words were stemmed
using Porter's algorithm. We used 2 matri
es from this
olle
tion: Yahoo K5
ontains 1458 words while Yahoo K1 in
ludes all 21839 words obtained after removing stop words.
Details on all our test
olle
tions are given in Table 1.
lass label for ea
h do
ument, the
onfusion matrix
aptures the goodness of do
ument
lustering. In addition, the
measures of purity and entropy are easily derived from the
onfusion matrix[6.
Table 2 summarizes the results of applying Algorithm Bipartition to the MedCran data set. The
onfusion matrix at
the top of the table shows that the do
ument
luster D0
onsists entirely of the Medline
olle
tion, while 1400 of the
1407 do
uments in D 1 are from Craneld. The bottom of
Table 2 displays the \top" 7 words in ea
h of the word
lusters W 0 and W 1 . The top words are those whose internal
edge weights are the greatest. By the
o-
lustering, the word
luster W i is asso
iated with do
ument
luster D i . It should
be observed that the top 7 words
learly
onvey the \
on
ept" of the asso
iated do
ument
luster.
Similarly, Table 3 shows that good bipartitions are also
obtained on the MedCisi data set. Algorithm Bipartition uses
the global spe
tral heuristi
of using singular ve
tors whi
h
D
D
Medline
Craneld
:
1026
0
7
1400
1:
W 0 : patients
ells blood
hildren hormone
an
er renal
W 1 : sho
k heat supersoni
wing transfer bu
kling laminar
0
Medline
Cisi
:
970
0
63 1460
1:
W 0 :
ells patients blood hormone renal rats
an
er
W 1 : libraries retrieval s
ienti
resear
h s
ien
e system book
0
6. CONCLUSIONS
In this paper, we have introdu
ed the novel idea of modeling a do
ument
olle
tion as a bipartite graph using whi
h
we proposed a spe
tral algorithm for
o-
lustering words and
do
uments. This algorithm has some ni
e theoreti
al properties as it provides the optimal solution to a real relaxation
of the NP-
omplete
o-
lustering obje
tive. In addition, our
D
D
:
1:
0
Medline
1014
19
Craneld
0
1400
D
D
:
1:
0
Medline
Cisi
925
0
108 1460
D
D
D
Med
Cisi
Cran
: 965
0
0
65 1458
10
1:
3
2 1390
2:
W 0 : patients
ells blood hormone renal
an
er rats
W 1 : library libraries retrieval s
ienti
s
ien
e book system
W 2 : boundary layer heat sho
k ma
h supersoni
wing
0
7.
REFERENCES
Sports Te h
D : 120
113
0
1
0
59
D: 0
1175
0
0
136
0
D : 19
95
4
73
5
1
D: 1
6
217
0
0
0
D: 0
0
273
0
0
0
D: 2
0
0
40
0
0
W :
ompani sto
k nan
i pr busi wire quote
W : lm tv emmi
omedi hollywood previou entertain
W : presid washington bill
ourt militari o
tob violat
W : health help pm death famili rate lead
W : surgeri injuri undergo hospit england re
ommend dis
ov
W : senat
linton
ampaign house white nan
republi
n
0
:
1:
2:
0
Med
9
0
1
Cisi
0
10
0
Cran
0
0
10
D
D
D
:
1:
2:
0
Med
49
0
1
Cisi
0
50
0
Cran
0
0
50