Co-Clustering Documents and Words Using Bipartite Spectral Graph Partitioning

Co-clustering documents and words using Bipartite
Spectral Graph Partitioning

Inderjit S. Dhillon
Department of Computer Sciences
University of Texas, Austin, TX 78712
inderjit@cs.utexas.edu
ABSTRACT
Both do ument lustering and word lustering are well studied problems. Most existing algorithms luster do uments
and words separately but not simultaneously. In this paper
we present the novel idea of modeling the do ument olle tion as a bipartite graph between do uments and words, using whi h the simultaneous lustering problem an be posed
as a bipartite graph partitioning problem. To solve the partitioning problem, we use a new spe tral o- lustering algorithm that uses the se ond left and right singular ve tors of
an appropriately s aled word-do ument matrix to yield good
bipartitionings. The spe tral algorithm enjoys some optimality properties; it an be shown that the singular ve tors
solve a real relaxation to the NP- omplete graph bipartitioning problem. We present experimental results to verify that
the resulting o- lustering algorithm works well in pra ti e.
1.
INTRODUCTION
Clustering is the grouping together of similar obje ts. Given

a olle tion of unlabeled do uments, do ument lustering
an help in organizing the olle tion thereby fa ilitating future navigation and sear h. A starting point for applying
lustering algorithms to do ument olle tions is to reate
a ve tor spa e model [20. The basi idea is (a) to extra t
unique ontent-bearing words from the set of do uments
treating these words as features and (b) to then represent
ea h do ument as a ve tor in this feature spa e. Thus the
entire do ument olle tion may be represented by a wordby-do ument matrix A whose rows orrespond to words and
olumns to do uments. A non-zero entry in A, say Aij , indi ates the presen e of word i in do ument j , while a zero
entry indi ates an absen e. Typi ally, a large number of
words exist in even a moderately sized set of do uments, for
example, in one test ase we use 4303 words in 3893 do uments. However, ea h do ument generally ontains only a
small number of words and hen e, A is typi ally very sparse
with almost 99% of the matrix entries being zero.
Existing do ument lustering methods in lude agglomer-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
KDD 2001 San Francisco, California, USA
Copyright 2001 ACM X-XXXXX-XX-X/XX/XX ...$5.00.
ative lustering[25, the partitional k-means algorithm[7,

proje tion based methods in luding LSA[21, self-organizing
maps[18 and multidimensional s aling[16. For omputational e ien y required in on-line lustering, hybrid approa hes have been onsidered su h as in[5. Graph-theoreti
te hniques have also been onsidered for lustering; many
earlier hierar hi al agglomerative lustering algorithms[9 and
some re ent work[3, 23 model the similarity between do uments by a graph whose verti es orrespond to do uments
and weighted edges or hyperedges give the similarity between verti es. However these methods are omputationally
prohibitive for large olle tions sin e the amount of work
required just to form the graph is quadrati in the number
of do uments.
Words may be lustered on the basis of the do uments in
whi h they o-o ur; su h lustering has been used in the
automati onstru tion of a statisti al thesaurus and in the
enhan ement of queries[4. The underlying assumption is
that words that typi ally appear together should be asso iated with similar on epts. Word lustering has also been
protably used in the automati lassi ation of do uments,
see[1. More on word lustering may be found in [24.
In this paper, we onsider the problem of simultaneous or
o- lustering of do uments and words. Most of the existing
work is on one-way lustering, i.e., either do ument or word
lustering. A ommon theme among existing algorithms is
to luster do uments based upon their word distributions
while word lustering is determined by o-o urren e in do uments. This points to a duality between do ument and
term lustering. We pose this dual lustering problem in
terms of nding minimum ut vertex partitions in a bipartite graph between do uments and words. Finding a globally
optimal solution to su h a graph partitioning problem is NP omplete; however, we show that the se ond left and right
singular ve tors of a suitably normalized word-do ument
matrix give an optimal solution to the real relaxation of
this dis rete optimization problem. Based upon this observation, we present a spe tral algorithm that simultaneously
partitions do uments and words, and demonstrate that the
algorithm gives good global solutions in pra ti e.
A word about notation: small-bold letters su h as x, u,
p will denote olumn ve tors, apital-bold letters su h as
A, M , B will denote matri es, and s ript letters su h as
V ; D; W will usually denote vertex sets.
2. BIPARTITE GRAPH MODEL

First we introdu e some relevant terminology about graphs.
A graph G = (V ; E ) is a set of verti es V = f1; 2; : : : ; jVjg
and a set of edges fi; j g ea h with edge weight Eij . The

adja en y matrix M of a graph is dened by
Mij =
if there is an edge fi; j g;

otherwise:
Eij ;
0;
Given a partitioning of the vertex set V into two subsets

V 1 and V 2 , the ut between them will play an important
role in this paper. Formally,
ut(V 1 ; V 2 ) =
i2 1 ;j 2 2
Mij :
(1)
The denition of ut is easily extended to k vertex subsets,

X
ut(V 1 ; V 2 ; : : : ; V k ) =
ut(V i ; V j ):
the indu ed do ument lustering is given by

8
<
Dm = :dj :
i2
Aij
i2
Aij ;
l
9
=
8 l = 1; : : : ; k; :
Note that this hara terization is re ursive in nature sin e

do ument lusters determine word lusters, whi h in turn
determine (better) do ument lusters. Clearly the \best"
word and do ument lustering would orrespond to a partitioning of the graph su h that the rossing edges between
partitions have minimum weight. This is a hieved when
ut(W 1 [ D 1 ; : : : ; W k [ Dk ) = min ut(V 1 ; : : : ; V k )
V 1 ;::: ;V
(2)
where V 1 ; : : : ; V k is any k-partitioning of the bipartite graph.
We now introdu e our bipartite graph model for representing a do ument olle tion. An undire ted bipartite graph
is a triple G = (D; W ; E ) where D = fd1 ; : : : ; dn g, W =
fw1 ; : : : ; wm g are two sets of verti es and E is the set of
edges ffdi ; wj g : di 2 D; wj 2 Wg. In our ase D is the set
of do uments and W is the set of words they ontain. An
edge fdi ; wj g exists if word wj o urs in do ument di ; note
that the edges are undire ted. In this model, there are no
edges between words or between do uments.
An edge signies an asso iation between a do ument and
a word. By putting positive weights on the edges, we an
apture the strength of this asso iation. One possibility is
to have edge-weights equal term frequen ies. In fa t, most
of the term-weighting formulae used in information retrieval
may be used as edge-weights, see [20 for more details.
Consider the m n word-by-do ument matrix A su h that
Aij equals the edge-weight Eij . It is easy to verify that the
adja en y matrix of the bipartite graph may be written as
3. GRAPH PARTITIONING
Given a graph G = (V ; E ), the lassi al graph bipartitioning problem is to nd nearly equally-sized vertex subsets
V 1 ; V 2 of V su h that ut(V 1 ; V 2 ) = min
ut(V 1 ; V 2 ).
i<j
0T A ;
A 0
where we have ordered the verti es su h that the rst m verti es index the words while the last n index the do uments.
We now show that the ut between dierent vertex subsets, as dened in (1) and (2), emerges naturally from our
formulation of word and do ument lustering.
2.1 Simultaneous Clustering

A basi premise behind our algorithm is the observation:
Duality of word & do ument lustering: Word luster-
ing indu es do ument lustering while do ument lustering

indu es word lustering.
Given disjoint do ument lusters D 1 ; : : : ; Dk , the orresponding word lusters W 1 ; : : : ; W k may be determined as
follows. A given word wi belongs to the word luster W m
if its asso iation with the do ument luster D m is greater
than its asso iation with any other do ument luster. Using
our graph model, a natural measure of the asso iation of a
word with a do ument luster is the sum of the edge-weights
to all do uments in the luster. Thus,
8
<
W m = :wi :
j2
Aij
j2
9
=
Aij ; 8 l = 1; : : : ; k; :
Thus ea h of the word lusters is determined by the do ument lustering. Similarly given word lusters W 1 ; : : : ; W k ,
V 1 ;V 2
Graph partitioning is an important problem and arises in

various appli ations, su h as ir uit partitioning, telephone
network design, load balan ing in parallel omputation, et .
However it is well known that this problem is NP- omplete[12.
But many ee tive heuristi methods exist, su h as, the
Kernighan-Lin(KL)[17 and the Fidu ia-Mattheyses(FM)[10
algorithms. However, both the KL and FM algorithms sear h
in the lo al vi inity of given initial partitionings and have a
tenden y to get stu k in lo al minima.
3.1 Spectral Graph Bipartitioning

Spe tral graph partitioning is another ee tive heuristi
that was introdu ed in the early 1970s[15, 8, 11, and popularized in 1990[19. Spe tral partitioning generally gives
better global solutions than the KL or FM methods.
We now introdu e the spe tral partitioning heuristi . Suppose the graph G = (V ; E ) has n verti es and m edges. The
n m in iden e matrix of G, denoted by IG has one row per
vertex and one olumn per edge. The olumn orresponding
to edge fi; j g ofp
IG is zero ex ept
for the i-th and j -th enp
tries, whi h are Eij and
Eij respe tively, where Eij is
the orresponding edge weight. Note that there is some ambiguity in this denition, sin e the positions of the positive
and negative entries seem arbitrary. However this ambiguity
will not be important to us.
Definition 1. The Lapla ian matrix L = LG of G is an
n n symmetri matrix, with one row and olumn for ea h
vertex, su h that
8 P
< k Eik ; i = j
i 6= j and there is an edge fi; j g (3)
Lij = : Eij ;
0
otherwise:
Theorem 1. The Lapla ian matrix L = LG of the graph

G has the following properties.
1. L = D M , where M is the adja en y matrix
P and D
is the diagonal \degree" matrix with Dii = k Eik .
2. L = IG IG T .
3. L is a symmetri positive semi-denite matrix. Thus
all eigenvalues of L are real and non-negative, and L
has a full set of n real and orthogonal eigenve tors.
4. Let e = [1; : : : ; 1T . Then Le = 0. Thus 0 is an
eigenvalue of L and e is the orresponding eigenve tor.
5. If the graph G has onne ted omponents then L has

eigenvalues that equal 0. P
6. For any ve tor x, xT Lx = fi;j g2E Eij (xi xj )2 .
7. For any ve tor x, and s alars and
(x + e)T L(x + e) = 2 xT Lx:
(4)
Lemma 1. Given graph G, let L and W be its Lapla ian

and vertex weight matri es respe tively. Let 1 = weight(V 1 )
and 2 = weight(V 2 ). Then the generalized partition ve tor q with elements
8 q
< + 2 ; i 2 V 1 ;
qi = : q 11
; i 2 V2;
Proof.
1. Part 1 follows from the denition of L.

2. This is easily seen by multiplying IG and IG T .
T x = yT y 0, for all
3. By part 2, xT Lx = xT IG IG
x. This implies that L is symmetri positive semidenite. All su h matri es have non-negative real eigenvalues and a full set of n orthogonal eigenve tors[13.
4. Given any ve tor x, Lx = IG (IG T x). Let k be the
row of IG T x that orresponds to the edge fi; j g, then
it is easy to see that
p
(IG T x)k = Eij (xi xj );
(5)
and so when x = e, Le = 0.
5. See [11.
6. This follows from equation (5).
7. This follows from part 4 above.
tu
For the rest of the paper, we will assume that the graph G
onsists of exa tly one onne ted omponent. We now see
how the eigenvalues and eigenve tors of L give us information about partitioning the graph. Given a bipartitioning
of V into V 1 and V 2 (V 1 [ V 2 = V ), let us dene the partition ve tor p that aptures this division,

2 V 1;
(6)
pi = +11;; ii 2
V 2:
Theorem 2. Given the Lapla ian matrix L of G and a
partition ve tor p, the Rayleigh Quotient
pT Lp = 1 4 ut(V 1 ; V 2 ):
T
Proof.
P
pp
Clearly pT p = n. By part 6 of Theorem 1, pT Lp =

2
fi;j g2E Eij (pi pj ) . Thus edges within V 1 or V 2 do not
ontribute to the above sum, while ea h edge between V 1
and V 2 ontributes a value of 4 times the edge-weight. t
u
3.2 Eigenvectors as optimal partition vectors

Clearly, by Theorem 2, the ut is minimized by the trivial
solution when all pi are either -1 or +1. Informally, the ut
aptures the asso iation between dierent partitions. We
need an obje tive fun tion that in addition to small ut values also aptures the need for more \balan ed" lusters.
We now present su h an obje tive fun tion. Let ea h
vertex i be asso iated with a positive weight, denoted by
weight(i), and let W be the diagonal matrix of su h weights.
For
P a subset of verti es
P V l dene its weight to be weight(V l ) =
i2V l weight(i) =
i2V l Wii . We onsider subsets V 1 and
V 2 to be \balan ed" if their
respe tive weights are equal.
The following obje tive fun tion favors balan ed lusters,
V 1 ; V 2 ) + ut(V 1 ; V 2 ) :
Q(V 1 ; V 2 ) = ut(
(7)
weight(V 1 ) weight(V 2 )
Given two dierent partitionings with the same ut value,
the above obje tive fun tion value is smaller for the more
balan ed partitioning. Thus minimizing Q(V 1 ; V 2 ) favors
partitions that have a small ut value and are balan ed.
We now show that the Rayleigh Quotient of the following generalized partition ve tor q equals the above obje tive
fun tion value.
2
q We = 0, and q Wq = weight(V ).
Proof. Let y = We, then yi = weight(i) = Wii. Thus
r
r
X
X weight(i) =
qT We =
weight(i)

satises
2
1
Similarly qT Wq =
Theorem
i2 1
Pn
i=1 Wii qi
i2 2
0:
= 1 + 2 = weight(V ).
tu
3. Using the notation of Lemma 1,
q Lq
qT Wq
ut(V 1 ; V 2 ) ut(V 1 ; V 2 )
+
:
Proof. It is easy to show that the generalized partition
ve tor q may be written as
q = 2p1 +122 p + 2p2 112 e;
where p is the partition ve tor of (6). Using part 7 of Theorem 1, we see that
2
qT Lq = (14+1 22 ) pT Lp:
Substituting the values of pT Lp and qT Wq, from Theorem 2 and Lemma 1 respe tively, proves the result.
tu
Thus to nd the global minimum of (7), we an restri t
our attention to generalized partition ve tors of the form in
Lemma 1. Even though this problem is still NP- omplete,
the following theorem shows that it is possible to nd a real
relaxation to the optimal generalized partition ve tor.
T
Theorem
4. The problem
qT Lq ; subje t to qT We = 0;
q6 0 qT Wq
is solved when q is the eigenve tor orresponding to the 2nd
min
=
smallest eigenvalue 2 of the generalized eigenvalue problem,

Lz = Wz:
(8)
Proof.
This is a standard result from linear algebra[13.
tu
3.3 Ratio-cut and Normalized-cut objectives

Thus far we have not spe ied the parti ular hoi e of
vertex weights. A simple hoi e is to have weight(i) = 1 for
all verti es i. This leads to the ratio- ut obje tive whi h has
been onsidered in [14 (for ir uit partitioning),
ut(V 1 ; V 2 ) ut(V 1 ; V 2 )
Ratio- ut(V 1 ; V 2 ) =
+
:
jV j
1
jV j
2
An interesting hoi e is to make the weight of ea h vertex equal to the sum

Pof the weights of edges in ident on it,
i.e., weight(i) =
k Eik . This leads to the normalized ut riterion that was used in [22 for image segmentation.
Note that for this hoi e of vertex weights, the vertex weight
matrix W equals the degree matrix D, and weight(V i ) =
ut(V 1 ; V 2 )+within(V i ) for i = 1; 2, where within(V i ) is the

sum of the weights of edges with both end-points in V i . Then
the normalized- ut obje tive fun tion may be expressed as
ut(V P
1; V2)
1; V2)
N (V 1 ; V 2 ) = P ut(V P
+P
;
i2 1
k Eik
k Eik
i2 2
= 2 S (V 1 ; V 2 );
within(V 1 ) within(V 2 )
where S (V 1 ; V 2 ) =
+
:
Note that S (V 1 ; V 2 ) measures the strengths of asso iations

within ea h partition. Thus minimizing the normalized- ut
is equivalent to maximizing the proportion of edge weights
that lie within ea h partition.
4.
THE SVD CONNECTION
In the previous se tion, we saw that the se ond eigenve tor

of the generalized eigenvalue problem Lz = Dz provides
a real relaxation to the dis rete optimization problem of
nding the minimum normalized ut. In this se tion, we
present algorithms to nd do ument and word lusterings
using our bipartite graph model. In the bipartite ase,
D1T A ; and D = D1 0
0 D2
A D2
where
D1 and D2 are diagonal
matri es su h that D (i; i) =
P
P
j Aij ; D (j; j ) =
i Aij . Thus Lz = Dz may be
L
written as
D1T A x = D1 0 x (9)
0 D2 y
A D2 y
Assuming that both D1 and D2 are nonsingular, we an

rewrite the above equations as
D1 = x D1 = Ay = D1 = x;
D2 = AT x + D2 = y = D2 = y:
Letting u = D1 = x and v = D2 = y , and after a little
1 2
1 2
1 2
1 2
1 2
1 2
1 2
1 2
algebrai manipulation, we get
D1 = AD2
D2 = AT D1
1 2
1 2
v
u
1=2
1=2
= (1
= (1
)u;
)v:
These are pre isely the equations that dene the singular

value de omposition (SVD) of the normalized matrix An =
D1 1=2 AD2 1=2 . In parti ular, u and v are the left and
right singular ve tors respe tively, while (1 ) is the orresponding singular value. Thus instead of omputing the
eigenve tor of the se ond (smallest) eigenvalue of (9), we an
ompute the left and right singular ve tors orresponding to
the se ond (largest) singular value of An ,
Anv
= 2 u2 ;
An u
T
= 2 v 2 ;
(10)
where 2 = 1 2 . Computationally, working on An is

mu h better sin e An is of size w d while the matrix L
is of the larger size (w + d) (w + d).
The right singular ve tor v2 will give us a bipartitioning
of do uments while the left singular ve tor u2 will give us a
bipartitioning of the words. By examining the relations (10)
it is lear that this solution agrees with our intuition that
a partitioning of do uments should indu e a partitioning of
words, while a partitioning of words should imply a partitioning of do uments.
4.1 The Bipartitioning Algorithm
The singular ve tors u2 and v2 of An give a real approximation to the dis rete optimization problem of minimizing
the normalized ut. Given u2 and v2 the key task is to
extra t the optimal partition from these ve tors.
The optimal generalized partition ve tor of Lemma 1 is
two-valued. Thus our strategy is to look for a bi-modal
distribution in the values of u2 and v2 . Let m1 and m2
denote the bi-modal values that we are looking for. From
the previous se tion, the se ond eigenve tor of L is given by
D1
D2
u
= v
1=2
1 2
(11)
One way to approximate the optimal bipartitioning is by the

assignment of z 2 (i) to the bi-modal values mj (j = 1; 2) su h
that the following sum-of-squares riterion is minimized,
2
X
j =1 2 (i)2mj
(z2 (i)
mj )2 :
The above is exa tly the obje tive fun tion that the lassi al
k-means algorithm tries to minimize[9. Thus we use the
following algorithm to o- luster words and do uments:
Algorithm Bipartition
1. Given A, form An = D1 1=2 AD2 1=2 .
2. Compute the se ond singular ve tors of An , u2 and v2
and form the ve tor z2 as in (11).
3. Run the k-means algorithm on the 1-dimensional data z 2
to obtain the desired bipartitioning.
The surprising aspe t of the above algorithm is that we
run k-means simultaneously on the redu ed representations
of both words and do uments to get the o- lustering.
4.2 The Multipartitioning Algorithm

We an adapt our bipartitioning algorithm for the more
general problem of nding k word and do ument lusters.
One possibility is to use Algorithm Bipartition in a re ursive
manner. However, we favor a more dire t approa h. Just
as the se ond singular ve tors ontain bi-modal information, the ` = dlog2 ke singular ve tors u2 ; u3 ; : : : ; u`+1 , and
v2; v3; : : : ; v`+1 often ontain k-modal information about
the data set. Thus we an form the `-dimensional data set

U
;
(12)
= V
where U = [u ; : : : ; u` , and V = [v ; : : : ; v` . From
this redu ed-dimensional data set, we look for the best kmodal t to the `-dimensional points m ; : : : ; mk by assigning ea h `-dimensional row, Z (i), to mj su h that the
D1
D2
+1
1=2
1 2
+1
sum-of-squares
k
X
j =1 2 (i)2mj
kZ (i) mj k
is minimized. This an again be done by the lassi al kmeans algorithm. Thus we obtain the following algorithm.
Algorithm Multipartition(k)
1. Given A, form An = D1 1=2 AD2 1=2 .
2. Compute ` = dlog2 ke singular ve tors of An , u2 ; : : : u`+1
and v 2 ; : : : v`+1 , and form the matrix Z as in (12).
3. Run the k-means algorithm on the `-dimensional data Z
to obtain the desired k-way multipartitioning.
Name
MedCran
MedCran All
MedCisi
MedCisi All
Classi 3
Classi 3 30do s
Classi 3 150do s
Yahoo K5
Yahoo K1
# Do s
2433
2433
2493
2493
3893
30
150
2340
2340
# Words # Nonzeros(A)
5042
117987
17162
224325
5447
109119
19194
213453
4303
176347
1073
1585
3652
7960
1458
237969
21839
349792
Table 1: Details of the data sets

5.
EXPERIMENTAL RESULTS
For some of our experiments, we used the popular Medline (1033 medi al abstra ts), Craneld (1400 aeronauti al
systems abstra ts) and Cisi (1460 information retrieval abstra ts) olle tions. These do ument sets an be downloaded from ftp://ftp. s. ornell.edu/pub/smart. For testing
Algorithm Bipartition, we reated mixtures onsisting of 2
of these 3 olle tions. For example, MedCran ontains do uments from the Medline and Craneld olle tions. Typi ally,
we removed stop words, and words o urring in < 0:2% and
> 15% of the do uments. However, our algorithm has an
in-built s aling s heme and is robust in the presen e of large
number of noise words, so we also formed word-do ument
matri es by in luding all words, even stop words.
For testing Algorithm Multipartition, we reated the Classi 3 data set by mixing together Medline, Craneld and Cisi
whi h gives a total of 3893 do uments. To show that our
algorithm works well on small data sets, we also reated
subsets of Classi 3 with 30 and 150 do uments respe tively.
Our nal data set is a olle tion of 2340 Reuters news
arti les downloaded from Yahoo in O tober 1997[2. The
arti les are from 6 ategories: 142 from Business, 1384 from
Entertainment, 494 from Health, 114 from Politi s, 141 from
Sports and 60 news arti les from Te hnology. In the prepro essing, HTML tags were removed and words were stemmed
using Porter's algorithm. We used 2 matri es from this olle tion: Yahoo K5 ontains 1458 words while Yahoo K1 in ludes all 21839 words obtained after removing stop words.
Details on all our test olle tions are given in Table 1.
5.1 Bipartitioning Results

In this se tion, we present bipartitioning results on the
MedCran and MedCisi olle tions. Sin e we know the \true"
lass label for ea h do ument, the onfusion matrix aptures the goodness of do ument lustering. In addition, the
measures of purity and entropy are easily derived from the
onfusion matrix[6.
Table 2 summarizes the results of applying Algorithm Bipartition to the MedCran data set. The onfusion matrix at
the top of the table shows that the do ument luster D0
onsists entirely of the Medline olle tion, while 1400 of the
1407 do uments in D 1 are from Craneld. The bottom of
Table 2 displays the \top" 7 words in ea h of the word lusters W 0 and W 1 . The top words are those whose internal
edge weights are the greatest. By the o- lustering, the word
luster W i is asso iated with do ument luster D i . It should
be observed that the top 7 words learly onvey the \ on ept" of the asso iated do ument luster.
Similarly, Table 3 shows that good bipartitions are also
obtained on the MedCisi data set. Algorithm Bipartition uses
the global spe tral heuristi of using singular ve tors whi h
D
D
Medline
Craneld
:
1026
0
7
1400
1:
W 0 : patients ells blood hildren hormone an er renal
W 1 : sho k heat supersoni wing transfer bu kling laminar
0
Table 2: Bipartitioning results for MedCran

D
D
Medline
Cisi
:
970
0
63 1460
1:
W 0 : ells patients blood hormone renal rats an er
W 1 : libraries retrieval s ienti resear h s ien e system book
0
Table 3: Bipartitioning results for MedCisi

makes it robust in the presen e of \noise" words. To demonstrate this, we ran the algorithm on the data sets obtained
without removing even the stop words. The onfusion matri es of Table 4 show that the algorithm is able to re over
the original lasses despite the presen e of stop words.
5.2 Multipartitioning Results

In this se tion, we show that Algorithm Multipartition
gives us good results. Table 5 gives the onfusion matrix
for the do ument lusters and the top 7 words of the asso iated word lusters found in Classi 3. Note that sin e k = 3
in this ase, the algorithm uses ` = dlog2 ke = 2 singular
ve tors for o- lustering.
As mentioned earlier, the Yahoo K1 and Yahoo K5 data
sets ontain 6 lasses of news arti les. Entertainment is the
dominant lass ontaining 1384 do uments while Te hnology ontains only 60 arti les. Hen e the lasses are of varied
sizes. Table 6 gives the multipartitioning result obtained by
using ` = dlog2 ke = 3 singular ve tors. It is learly di ult to re over the original lasses. However, the presen e
of many zeroes in the onfusion matrix is en ouraging. Table 6 shows that lusters D 1 and D2 onsist mainly of the
Entertainment lass, while D4 and D5 are \purely" from
Health and Sports respe tively. The word lusters show the
underlying on epts in the asso iated do ument lusters (re all that the words are stemmed in this example). Table 7
shows that similar do ument lustering is obtained when
fewer words are used.
Finally, Algorithm Multipartition does well on small olle tions also. Table 8 shows that even when mixing small (and
random) subsets of Medline, Cisi and Craneld our algorithm
is able to re over these lasses. This is in stark ontrast to
the spheri al k-means algorithm that gives poor results on
small do ument olle tions[7.
6. CONCLUSIONS
In this paper, we have introdu ed the novel idea of modeling a do ument olle tion as a bipartite graph using whi h
we proposed a spe tral algorithm for o- lustering words and
do uments. This algorithm has some ni e theoreti al properties as it provides the optimal solution to a real relaxation
of the NP- omplete o- lustering obje tive. In addition, our
D
D
:
1:
0
Medline
1014
19
Craneld
0
1400
D
D
:
1:
0
Medline
Cisi
925
0
108 1460
Table 4: Results for MedCran All and MedCisi All
D
D
D
Med
Cisi
Cran
: 965
0
0
65 1458
10
1:
3
2 1390
2:
W 0 : patients ells blood hormone renal an er rats
W 1 : library libraries retrieval s ienti s ien e book system
W 2 : boundary layer heat sho k ma h supersoni wing
0
Table 5: Multipartitioning results for Classi 3

Bus Entertain Health Politi s Sports Te h
D0 : 120
82
0
52
0
57
D1 : 0
833
0
1
100
0
D2 : 0
259
0
0
0
0
D3 : 22
215
102
61
1
3
D4 : 0
0
392
0
0
0
D5 : 0
0
0
0
40
0
W 0 : linton ampaign senat house ourt nan white
W 1 : septemb tv am week musi set top
W 2 : lm emmi star hollywood award omedi enne
W 3 : world health new polit entertain te h sport
W 4 : surgeri injuri undergo hospit england a ord re ommend
W 5 : republ advan wild ard mat h abdelatif a adolph
Table 6: Multipartitioning results for Yahoo K1
algorithm works well on real examples as illustrated by our

experimental results.
7.
REFERENCES
[1 L. D. Baker and A. M Callum. Distributional

lustering of words for text lassi ation. In ACM
SIGIR, pages 96{103, 1998.
[2 D. Boley. Hierar hi al taxonomies using divisive
partitioning. Te hni al Report TR-98-012, University
of Minnesota, 1998.
[3 D. Boley, M. Gini, R. Gross, E.-H. Han, K. Hastings,
G. Karypis, V. Kumar, B. Mobasher, and J. Moore.
Do ument ategorization and query generation on the
World Wide Web using WebACE. AI Review, 1998.
[4 C. J. Crou h. A luster-based approa h to thesaurus
onstru tion. In ACM SIGIR, pages 309{320, 1988.
[5 D. R. Cutting, D. R. Karger, J. O. Pedersen, and
J. W. Tukey. S atter/gather: A luster-based
approa h to browsing large do ument olle tions. In
ACM SIGIR, 1992.
[6 I. S. Dhillon, J. Fan, and Y. Guan. E ient lustering
of very large do ument olle tions. In V. K.
R. Grossman, C. Kamath and R. Namburu, editors,
Data Mining for S ienti and Engineering
Appli ations. Kluwer A ademi Publishers, 2001.
[7 I. S. Dhillon and D. S. Modha. Con ept
de ompositions for large sparse text data using
lustering. Ma hine Learning, 42(1):143{175, January
2001. Also appears as IBM Resear h Report RJ
10147, July 1999.
[8 W. E. Donath and A. J. Homan. Lower bounds for
the partitioning of graphs. IBM Journal of Resear h
and Development, 17:420{425, 1973.
[9 R. O. Duda, P. E. Hart, and D. G. Stork. Pattern
Classi ation. John Wiley & Sons, 2000. 2nd Edition.
[10 C. M. Fidu ia and R. M. Mattheyses. A linear time
heuristi for improving network partitions. Te hni al
Bus Entertain Health Politi s
Sports Te h
D : 120
113
0
1
0
59
D: 0
1175
0
0
136
0
D : 19
95
4
73
5
1
D: 1
6
217
0
0
0
D: 0
0
273
0
0
0
D: 2
0
0
40
0
0
W : ompani sto k nan i pr busi wire quote
W : lm tv emmi omedi hollywood previou entertain
W : presid washington bill ourt militari o tob violat
W : health help pm death famili rate lead
W : surgeri injuri undergo hospit england re ommend dis ov
W : senat linton ampaign house white nan republi n
0
Table 7: Multipartitioning results for Yahoo K5

D
D
D
:
1:
2:
0
Med
9
0
1
Cisi
0
10
0
Cran
0
0
10
D
D
D
:
1:
2:
0
Med
49
0
1
Cisi
0
50
0
Cran
0
0
50
Table 8: Results for Classi 3 30do s and Classi 3 150do s
Report 82CRD130, GE Corporate Resear h, 1982.

[11 M. Fiedler. Algebrai onne tivity of graphs.
Cze heslovak Mathemati al Journal, 23:298{305, 1973.
[12 M. R. Garey and D. S. Johnson. Computers and
Intra tability: A Guide to the Theory of
NP-Completeness. W. H. Freeman & Company, 1979.
[13 G. H. Golub and C. F. V. Loan. Matrix omputations.
Johns Hopkins University Press, 3rd edition, 1996.
[14 L. Hagen and A. B. Kahng. New spe tral methods for
ratio ut partitioning and lustering. IEEE
Transa tions on CAD, 11:1074{1085, 1992.
[15 K. M. Hall. An r-dimensional quadrati pla ement
algorithm. Management S ien e, 11(3):219{229, 1970.
[16 R. V. Katter. Study of do ument representations:
Multidimensional s aling of indexing terms. System
Development Corporation, Santa Moni a, CA, 1967.
[17 B. Kernighan and S. Lin. An e ient heuristi
pro edure for partitioning graphs. The Bell System
Te hni al Journal, 29(2):291{307, 1970.
[18 T. Kohonen. Self-organizing Maps. Springer, 1995.
[19 A. Pothen, H. Simon, and K.-P. Liou. Partitioning
sparse matri es with eigenve tors of graphs. SIAM
Journal on Matrix Analysis and Appli ations,
11(3):430{452, July 1990.
[20 G. Salton and M. J. M Gill. Introdu tion to Modern
Retrieval. M Graw-Hill Book Company, 1983.
[21 H. S hutze and C. Silverstein. Proje tions for e ient
do ument lustering. In ACM SIGIR, 1997.
[22 J. Shi and J. Malik. Normalized uts and image
segmentation. IEEE Trans. Pattern Analysis and
Ma hine Intelligen e, 22(8):888{905, August 2000.
[23 A. Strehl, J. Ghosh, and R. Mooney. Impa t of
similarity measures on web-page lustering. In AAAI
2000 Workshop on AI for Web Sear h, 2000.
[24 C. J. van Rijsbergen. Information Retrieval.
Butterworths, London, se ond edition, 1979.
[25 E. M. Voorhees. The ee tiveness and e ien y of
agglomerative hierar hi lustering in do ument
retrieval. PhD thesis, Cornell University, 1986.

Co-Clustering Documents and Words Using Bipartite Spectral Graph Partitioning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Co-Clustering Documents and Words Using Bipartite Spectral Graph Partitioning

Uploaded by

Copyright:

Available Formats

Co-clustering documents and words using Bipartite

Spectral Graph Partitioning

Clustering is the grouping together of similar obje ts. Given

ative lustering[25, the partitional k-means algorithm[7,

2. BIPARTITE GRAPH MODEL

and a set of edges fi; j g ea h with edge weight Eij . The

if there is an edge fi; j g;

Given a partitioning of the vertex set V into two subsets

The de nition of ut is easily extended to k vertex subsets,

the indu ed do ument lustering is given by

Note that this hara terization is re ursive in nature sin e

where V 1 ; : : : ; V k is any k-partitioning of the bipartite graph.

2.1 Simultaneous Clustering

Duality of word & do ument lustering: Word luster-

ing indu es do ument lustering while do ument lustering

Graph partitioning is an important problem and arises in

3.1 Spectral Graph Bipartitioning

Theorem 1. The Lapla ian matrix L = LG of the graph

5. If the graph G has onne ted omponents then L has

Lemma 1. Given graph G, let L and W be its Lapla ian

1. Part 1 follows from the de nition of L.

Clearly pT p = n. By part 6 of Theorem 1, pT Lp =

3.2 Eigenvectors as optimal partition vectors

3. Using the notation of Lemma 1,

smallest eigenvalue 2 of the generalized eigenvalue problem,

This is a standard result from linear algebra[13.

3.3 Ratio-cut and Normalized-cut objectives

An interesting hoi e is to make the weight of ea h vertex equal to the sum

ut(V 1 ; V 2 )+within(V i ) for i = 1; 2, where within(V i ) is the

Note that S (V 1 ; V 2 ) measures the strengths of asso iations

THE SVD CONNECTION

In the previous se tion, we saw that the se ond eigenve tor

rewrite the above equations as

algebrai manipulation, we get

These are pre isely the equations that de ne the singular

where 2 = 1 2 . Computationally, working on An is

4.1 The Bipartitioning Algorithm

One way to approximate the optimal bipartitioning is by the

4.2 The Multipartitioning Algorithm

Table 1: Details of the data sets

5.1 Bipartitioning Results

MedCran and MedCisi olle tions. Sin e we know the \true"

Table 2: Bipartitioning results for MedCran

Table 3: Bipartitioning results for MedCisi

5.2 Multipartitioning Results

Table 4: Results for MedCran All and MedCisi All

Table 5: Multipartitioning results for Classi 3

Table 6: Multipartitioning results for Yahoo K1

algorithm works well on real examples as illustrated by our

[1 L. D. Baker and A. M Callum. Distributional

Bus Entertain Health Politi s

Table 7: Multipartitioning results for Yahoo K5

Table 8: Results for Classi 3 30do s and Classi 3 150do s

Report 82CRD130, GE Corporate Resear h, 1982.

You might also like

The denition of ut is easily extended to k vertex subsets,

1. Part 1 follows from the denition of L.

smallest eigenvalue 2 of the generalized eigenvalue problem,

These are pre isely the equations that dene the singular

where 2 = 1 2 . Computationally, working on An is