Biocluster MB05

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Job: Molecular Biotechnology Operator: SV

Chapter: Karypis/MB05-0033 Date: 6/05


Pub Date: 2005 Revision: 1st Pass

Data Clustering REVIEW 1

Data Clustering in Life Sciences


Ying Zhao and George Karypis*

Abstract
Clustering has a wide range of applications in life sciences and over the years has been used in many
areas ranging from the analysis of clinical information, phylogeny, genomics, and proteomics. The primary
goal of this article is to provide an overview of the various issues involved in clustering large biological
datasets, describe the merits and underlying assumptions of some of the commonly used clustering ap-
proaches, and provide insights on how to cluster datasets arising in various areas within life-sciences. We
also provide a brief introduction to CLUTO, a general purpose toolkit for clustering various datasets, with an
emphasis on its applications to problems and analysis requirements within life sciences.
Index Entries: Clustering algorithms; similarity between objects; microarray data; Cluto.

1. Introduction proteomics. For example, clustering algorithms ap-


Clustering is the task of organizing a set of plied to gene expression data can be used to iden-
objects into meaningful groups. These groups tify co-regulated genes and provide a genetic
can be disjoint, overlapping, or organized in fingerprint for various diseases. Clustering algo-
some hierarchical fashion. The key element of rithms applied on the entire database of known
clustering is the notion that the discovered groups proteins can be used to automatically organize the
are meaningful. This definition is intentionally different proteins into close- and distant-related
vague, as what constitutes meaningful is to a families, and identify subsequences that are mostly
large extent, application dependent. In some appli- preserved across proteins (1–5). Similarly, clus-
cations this may translate to groups in which the tering algorithms applied to the tertiary structural
pairwise similarity between their objects is maxi- datasets can be used to perform a similar organi-
mized, and the pairwise similarity between objects zation and provide insights in the rate of change
of different groups is minimized. In some other between sequence and structure (6,7).
applications this may translate to groups that con- The primary goal of this article is to provide an
tain objects that share some key characteristics, overview of the various issues involved in cluster-
although their overall similarity is not the high- ing large datasets, describe the merits and under-
est. Clustering is an exploratory tool for analyz- lying assumptions of some of the commonly used
ing large datasets, and has been used extensively clustering approaches, and provide insights on
in numerous application areas. how to cluster datasets arising in various areas
Clustering has a wide range of applications in within life sciences. Toward this end, the article is
life sciences and over the years has been used organized, broadly, in three parts. The first part
in many areas ranging from the analysis of (see Headings 2. to 4.) describes the various types
clinical information, phylogeny, genomics, and of clustering algorithms developed over the years,

*Author to whom all correspondence and reprint requests should be addressed. Department of Computer Science and Engineering, Univer-
sity of Minnesota, and Digital Technology Center and Army HPC Research Center, Minneapolis, MN 55455. E-mail: karypis@cs.umn.edu.

Molecular Biotechnology  2005 Humana Press Inc. All rights of any nature whatsoever reserved. 1073–6085/2005/31:X/000–000/$30.00

MOLECULAR BIOTECHNOLOGY 1 Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

2 Zhao and Karypis et al.

the various methods for computing the similarity the initial k-way clustering. This clustering is then
between objects arising in life sciences, and meth- repeatedly refined so that it optimizes a desired
ods for assessing the quality of the clusters. The clustering criterion function. A k-way partition-
second part (see Heading 5.) focuses on the prob- ing through repeated bisections is obtained by
lem of clustering data arising from microarray recursively applying the above algorithm to
experiments and describes some of the commonly compute two-way clustering (i.e., bisections). Ini-
used approaches. Finally, the third part (see tially, the objects are partitioned into two clus-
Heading 6.) provides a brief introduction to ters, then one of these clusters is selected and is
Cluto, a general purpose toolkit for clustering further bisected, and so on. This process contin-
various datasets, with an emphasis on its applica- ues k–1 times, leading to k clusters. Each of these
tions to problems and analysis requirements bisections is performed so that the resulting two-
within life sciences. way clustering solution optimizes a particular cri-
terion function.
2. Types of Clustering Algorithms Criterion functions used in the partitional clus-
The topic of clustering has been extensively tering reflect the underlying definition of the
studied in many scientific disciplines and a vari- “goodness” of clusters. The partitional clustering
ety of different algorithms have been developed can be considered as an optimization procedure
(8–20). Two recent surveys on the topics (21,22) that tries to create high-quality clusters according
offer a comprehensive summary of the different to a particular criterion function. Many criterion
applications and algorithms. These algorithms functions have been proposed (9,29,30) and some
can be categorized along different dimensions of them are described later in Heading 6. Criterion
based either on the underlying methodology of the functions measure various aspects of intracluster
algorithm, leading to partitional or agglomerative similarity, intercluster dissimilarity, and their
approaches; the structure of the final solution, combinations. These criterion functions use dif-
leading to hierarchical or nonhierarchical solu- ferent views of the underlying collection, by either
tions; the characteristics of the space in which they modeling the objects as vectors in a high-dimen-
operate, leading to feature or similarity approaches; sional space or by modeling the collection as a
or the type of clusters that they discover, leading graph.
to globular or transitive clustering methods. Hierarchical agglomerative algorithms find the
clusters by initially assigning each object to its
2.1. Agglomerative and Partitional own cluster and then repeatedly merging pairs of
Algorithms clusters until a certain stopping criterion is met.
Partitional algorithms, such as K-means (9,23), Consider an n-object dataset and the clustering solu-
K-medoids (9,11,13), probabilistic (10,24), graph- tion that has been computed after performing merg-
partitioning based (9,25–27), or spectral based ing steps. This solution will contain exactly n–1
(28), find the clusters by partitioning the entire clusters, as each merging step reduces the num-
dataset into either a predetermined or an automati- ber of clusters by one. Now, given this (n–1)-
cally derived number of clusters. way clustering solution, the pair of clusters that
Partitional clustering algorithms compute a k-way is selected to be merged next is the one that leads
clustering of a set of objects either directly or to an (n–l–1)-way solution that optimizes a par-
through a sequence of repeated bisections. A direct ticular criterion function. That is, each one of the
k-way clustering is commonly computed as fol- (n–1) × (n–l–1)/2 pairs of possible merges is
lows. Initially, a set of k objects is selected from evaluated, and the one that leads to a clustering
the datasets to act as the seeds of the k clusters. solution that has the maximum (or minimum)
Then, for each object, its similarity to these k value of the particular criterion function is selected.
seeds is computed, and it is assigned to the cluster Thus, the criterion function is locally optimized
corresponding to its most similar seed. This forms within each particular stage of agglomerative al-

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

Data Clustering 3

gorithms. Depending on the desired solution, this ing the similarity (or distance) between each ob-
process continues until either there are only clus- ject and its cluster centroid. On the other hand,
ters left, or when the entire agglomerative tree similarity-based algorithms compute the cluster-
has been obtained. ing solution by first computing the pairwise simi-
The three basic criteria to determine which pair larities between all the objects and then use these
of clusters to be merged next are single-link (31), similarities to drive the overall clustering solu-
complete-link (32), and group average (UPGMA tion. Hierarchical agglomerative schemes, graph-
[unweighted pair group method with arithmetic based schemes, as well as K-medoid, fall into this
mean]) (9). The single-link criterion function category. The advantages of similarity-based
measures the similarity of two clusters by the methods is that they can be used to cluster a wide
maximum similarity between any pair of objects variety of datasets, provided that reasonable
from each cluster, whereas the complete-link crite- methods exist for computing the pairwise simi-
rion uses the minimum similarity. In general, both larity between objects. For this reason, they have
the single-link and the complete-link approaches been used to cluster both sequential (1,2) as well
do not work very well because they either base as graph datasets (33,34), especially in biological
their decisions to a limited amount of information applications. On the other hand, there has been
(single-link) or assume that all the objects in the limited work in developing clustering algorithms
cluster are very similar to each other (complete- that operate directly on the sequence or graph
link). On the other hand, the group average approach datasets (35).
measures the similarity of two clusters by the aver- However, similarity-based approaches have
age of the pairwise similarity of the objects from two key limitations. First, their computational
each cluster and does not suffer from the prob- requirements are high as they need to compute
lems arising with single-link and complete-link. the pairwise similarity between all the objects that
In addition to these three basic approaches, a need to be clustered. As a result, such algorithms
number of more sophisticated schemes have been can only be applied to relatively small datasets (a
developed, such as CURE (18), ROCK (19), CHA- few thousand objects), and they cannot be effec-
MELEON (20), that have been shown to produce tively used to cluster the datasets arising in many
superior results. fields within life sciences. The second limitation
Finally, hierarchical algorithms produce a clus- of these approaches is that by not using the
tering solution that forms a dendrogram, with a object’s feature space and relying only on the
single all-inclusive cluster at the top and single-point pairwise similarities, they tend to produce subop-
clusters at the leaves. In contrast, in nonhierarchical timal clustering solutions, especially when the
algorithms there tends to be no relation between similarities are low relative to the cluster sizes.
the clustering solutions produced at different lev- The key reason for this is that these algorithms
els of granularity. can only determine the overall similarity of a col-
lection of objects (i.e., a cluster) by using mea-
2.2. Feature-Based and Similarity-Based sures derived from the pairwise similarities (e.g.,
Clustering Algorithms average, median, or minimum pairwise similari-
Another distinction between the different clus- ties). However, such measures, unless the overall
tering algorithms is whether or not they operate similarity between the members of different clus-
on the object’s feature space or operate on a derived ters is high, are quite unreliable as they cannot
similarity (or distance) space. K-means–based algo- capture what is common between the different
rithms are the prototypical examples of methods objects in the collection.
that operate on the original feature space. In this Clustering algorithms that operate in the object’s
class of algorithms, each object is represented as feature space can overcome both of these limita-
a multidimensional feature vector, and the clus- tions. Because they do not require the precompu-
tering solution is obtained by iteratively optimiz- tation of the pairwise similarities, fast partitional

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

4 Zhao and Karypis et al.

algorithms can be used to find the clusters, and are a number of application areas in which such
because their clustering decisions are made in the clusters give rise to meaningful groupings of the
object’s feature space, they can potentially lead objects (i.e., domain experts will tend to agree that
to better clusters by correctly evaluating the simi- the clusters are correct). Such areas includes clus-
larity between a collection of objects. For example, tering documents based on the terms they contain,
in the context of clustering protein sequences, the clustering customers based on the products they
proteins in each cluster can be analyzed to deter- purchase, clustering genes based on their expres-
mine the conserved blocks, and use only these sion levels, and clustering proteins based on the
blocks in computing the similarity between the motifs they contain.
sequences (an idea formalized by profile HMM The second type of clusters contains objects in
approaches; 36,37). Recent studies in the context which again there exists a subspace associated
of clustering large high-dimensional datasets with that cluster. However, unlike the earlier case,
done by various groups (38–41) show the advan- in these clusters there will be subclusters that
tages of such algorithms over those based on simi- may share a very small number of the subspace’s
larity. dimension, but there will be a strong path within
that cluster that will connect them. By “strong
2.3. Globular and Transitive Clustering path” we mean that if A and B are two subclusters
Algorithms that share only a few dimensions, then there
Besides the operational differences between will be another set of subclusters X1,X2,...,Xk,
various clustering algorithms, another key dis- that belong to the cluster, such that each of the
tinction between them is the type of clusters that subcluster pairs (A,X 1), (X 1,X 2),...,(X k,B) will
they discover. There are two general types of clus- share many of the subspace’s dimensions. What
ters that often arise in different application domains. complicates cluster discovery in this setting is that
What differentiates these types is the relationship the connections (i.e., shared subspace dimen-
between the cluster’s objects and the dimensions sions) between subclusters within a particular clus-
of their feature space. ter will tend to be of different strength. Examples
The first type of clusters contains objects that of this type of clusters include protein clusters
exhibit a strong pattern of conservation along a with distant homologies or clusters of points that
subset of their dimensions. That is, there is a sub- form spatially contiguous regions.
set of the original dimensions in which a large Our discussion so far focused on the relation-
fraction of the objects agree. For example, if the ship between the objects and their feature space.
dimensions correspond to different protein motifs, However, these two classes of clusters can also
then a collection of proteins will form a cluster, if be understood in terms of the object-to-object
there exists a subset of motifs that is present in a similarity graph. The first type of clusters will
large fraction of the proteins. This subset of dimen- tend to contain objects in which the similarity
sions is often referred to as a subspace, and the between all pairs of objects will be high. On the
above stated property can be viewed as the cluster’s other hand, in the second type of clusters there
objects and its associated dimensions forming a will be a lot of objects whose direct pairwise simi-
dense subspace. Of course, the number of dimen- larity will be quite low, but these objects will be
sions in these dense subspaces, and the density connected by many paths that stay within the clus-
(i.e., how large is the fraction of the objects that ter that traverse high similarity edges. The names
share the same dimensions) will be different from of these two cluster types were inspired by this
cluster to cluster. Exactly what is this variation in similarity-based view, and they are referred to as
subspace size and density (and the fact that an globular and transitive clusters, respectively.
object can be part of multiple disjoint or overlap- The various clustering algorithms are in gen-
ping dense subspaces) is what complicates the eral suited for finding either globular or transitive
problem of discovering this type of clusters. There clusters. In general, clustering criterion-driven

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

Data Clustering 5

partitional clustering algorithms such as K-means Sequences: Each object is represented as a se-
and its variants and agglomerative algorithms quence of symbols or events. The meaning of
using the complete-link or the group-average these symbols or events also depends on the
method are suited for finding globular clusters. underlying application and includes objects,
On the other hand, the single-link method of the such as DNA and protein sequences, sequences
agglomerative algorithm and graph-partitioning- of secondary structure elements, temporal mea-
based clustering algorithms that operate on a near- surements of various quantities, such as gene
est neighbor similarity graph are suited for finding expressions, and historical observations of dis-
transitive clusters. Finally, specialized algo- ease states.
rithms, called subspace clustering methods, have Structures: Each object is represented as a
been developed to explicitly find either globular two- or three-dimensional structure. The pri-
mary examples of such datasets include the spa-
or transitive clusters by operating directly in the
tial distribution of various quantities of interest
object’s feature space (42–44).
within various cells, and the three-dimensional
3. Methods for Measuring the Similarity geometry of chemical molecules such as enzymes
Between Objects and proteins.
In general, the method used to compute the The rest of this section describes some of the
similarity between two objects depends on two most popular methods for computing the similar-
factors. The first factor has to do with how the ity for all these types of objects.
objects are actually being represented. For example,
the similarity between two objects represented 3.1. Similarity Between Multidimensional
by a set of attribute-value pairs will be entirely Objects
different from the method used to compute the There are a variety of methods for computing
similarity between two DNA sequences and the similarity between two objects that are repre-
two three-dimensional protein structures. The sented by a set of attribute-value pairs. These
second factor is much more subjective and has to methods, to a large extent, depend on the nature
do with the actual goal of clustering. Different of the attributes themselves and the characteris-
analysis requirements may give rise to entirely tics of the objects that we need to model by the
different similarity measures and different clus- similarity function.
tering solutions. This section focuses on discuss- From the point of similarity calculations, there
ing various methods for computing the similarity are two general types of attributes. The first one
between objects that address both of these factors. consists of attributes whose range of values is
The diverse nature of biological sciences and continuous. This includes both integer-valued and
the shear complexity of the underlying physico- real-valued variables, as well as attributes whose
chemical and evolutionary principles that need to allowed set of values are thought to be part of an
be modeled, gives rise to numerous clustering ordered set. Examples of such attributes include
problems involving a wide range of different gene expression measurements, ages, disease sever-
objects. The most prominent of them are the ity levels. On the other hand, the second type con-
following: sists of attributes that take values from an unordered
Multidimensional Vectors: Each object is set. Examples of such attributes include various
represented by a set of attribute-value pairs. gender, blood type, and tissue type. We will refer
The meaning of the attributes (also referred to to the first type as continuous attributes and to the
as variables or features) is application depen- second type as categorical attributes. The primary
dent and includes datasets like those arising difference between these two types of attributes
from various measurements (e.g., gene expres- is that in the case of continuous attributes, when
sion data), or from various clinical sources there is a mismatch on the value taken by a par-
(e.g., drug response, disease states). ticular attribute in two different objects, the dif-

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

6 Zhao and Karypis et al.

ference of the two values is a meaningful mea- ity between two objects is often defined along a
sure of distance, whereas in categorical attributes, small subset of dimensions.
there is no easy way to assign a distance between An alternate way of measuring the similarity
such mismatches. between two objects in the vector–space model is
In the rest of this section, we present methods to look at the angle between their vectors. If two
for computing the similarity assuming that all the objects have vectors that point to the same direc-
attributes in the objects are either continuous or tion (i.e., their angle is small), then these objects
categorical. However, in most real applications, will be considered similar, and if their vectors
objects will be represented by a mixture of such point to different directions (i.e., their angle is
attributes, therefore the described approaches large), then these vectors will be considered dis-
need to be combined. similar. This angle-based approach for comput-
ing the similarity between two objects emphasizes
3.1.1. Continuous Attributes
the relative values that each dimension takes
When all the attributes are continuous, each within each vector, and not their overall length.
object can be considered to be a vector in the That is, two objects can have an angle of zero (i.e.,
attribute space. That is, if n is the total number of point to the identical direction), even if their Euclid-
attributes, then each object v can be represented ean distance is arbitrarily large. For example, in a
by an n-dimensional vector (v1,v2,...,vn), where vi two-dimensional space, the vectors v = (1,1), and
is the value of the ith attribute. u = (1000,1000) will be considered to be identi-
Given any two objects with their correspond- cal, as their angle is zero. However, their Euclid-
ing vector–space representations v and u, a widely ean distance is close to 1000 2 .
used method for computing the similarity between Because the computation of the angle between
them is to look at their distance as measured by two vectors is somewhat expensive (requiring
some norm of their vector difference. That is, inverse trigonometric functions), we do not mea-
disr (v,u) = v – u r (1) sure the angle itself but its cosine function. The
cosine of the angle between two vectors v and u is
where r is the norm used, and · is used to denote
given by

n
vector norms. If the distance is small, then the ob- vu
jects will be similar, and the similarity of the objects sim ( v, u ) = cos ( v, u ) = i =1 i i
(4)
 v 2 u 2
will decrease as their distance increases.
The two most commonly used norms are the This measure will be plus one, if the angle be-
one-norm and the two-norm. In the case of the tween v and u is zero, and minus one, if their angle
one-norm, the distance between two objects is is 180 degrees (i.e., point to opposite directions).
give by n
Note that a surrogate for the angle between two
dis1 (v, u ) = v − u  1= ∑  vi − ui | (2)
vectors can also be computed using the Euclidean
i distance, but instead of computing the distance
where · denotes absolute values. Similarly, in the between v and u directly, we need to first scale
case of the two-norm, the distance is given by them to be of unit length. In that case, the Euclid-
n ean distance measures the chord between the two
dis 2 ( v, u ) = v − u  2 = ∑ (v i
− ui )2 (3) vectors in the unit hypersphere.
i =1 In addition to the above linear algebra inspired
Note that the one-norm distance is also called methods, another widely used scheme for deter-
the Manhattan distance, whereas the two-norm mining the similarity between two vectors uses
distance is nothing more than the Euclidean dis- the Pearson correlation coefficient, which is given
by
tance between the vectors. Those distances may n
∑ i =1
( v − v )( u − u )
sim ( v , u ) = corr ( v , u ) =
i i

become problematic when clustering high-dimen- (5)


∑ i =1 ( v − v) ∑ i =1 ( v − u)
n 2 n 2

sional data, because in such datasets, the similar- i i

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

Data Clustering 7

where v and u are the mean of the values of the dissimilarity measure between X and Y is defined
v and u vectors, respectively. Note that Pearson’s to be the number of mismatching attributes of the
correlation coefficient is nothing more than the two objects. That is,
cosine between the mean-subtracted v and u vec- m

tors. As a result, it does not depend on the length d ( X , Y ) = ∑ S ( X i , Yi )


of the( v − v ) and ( u − u ) vectors, but only on their i =1

angle. where
Our discussion so far on similarity measures 0 ( X i = Yi )
S ( X i , Yi ) = 
1 ( X i ≠ Yi )
for continuous attributes focused on objects in
which all the attributes were homogeneous in na-
ture. A set of attributes are called homogeneous if A normalized variant of the above dissimilar-
all of them measure quantities that are of the same ity is defined as follows
type. As a result changes in the values of these m
nX + nY
variables can be easily correlated across them. d ( X,Y ) = ∑ i
S ( Xi , Yi )
i

i =1 nX nY
Quite often, each object will be represented by a i i

set of inhomogeneous attributes. For example, if where nXi(nYi) is the number of times the value Xi
we would like to cluster patients, then some of (Y i) appears in the ith attribute of the entire
the attributes describing each patient can measure dataset. If two categorical values are common
things like age, weight, height, calorie intake. If across the dataset, they will have low weights, so
we use some of these methods to compute the that the mismatch between them will not contrib-
similarity, we will essentially make the assump- ute significantly to the final dissimilarity score. If
tion that equal magnitude changes in all variables two categorical values are rare in the dataset, then
are identical. However, this may not be the case. they are more informative and will receive higher
If the age of two patients is 50 yr, that represents weights according to the formula. Hence, this dis-
something that is significantly different if their similarity measure emphasizes the mismatches
calorie intake difference is 50 cal. To address that happen for rare categorical values than for
these problems, the various attributes need to be those involving common ones.
first normalized before using any of these simi- One of the limitations of the preceding method
larity measures. Of course, the specific normal- is that two values can contribute to the overall
ization method is attribute dependent, but its goal similarity only if they are the same. However, dif-
should be to make differences across different at- ferent categorical values may contain useful in-
tributes comparable. formation in the sense that even if their values are
different, the objects containing those values are
3.1.2. Categorical Attributes related to some extent. By defining similarities
If the attributes are categorical, special similar- just based on matches and mismatches of values,
ity measures are required, as distances between some useful information may be lost. A number
their values cannot be defined in an obvious man- of approaches have been proposed to overcome
ner. The most straightforward way is to treat each this limitation (19,46–48) by using additional in-
categorical attribute individually and define the formation between categories or relationships be-
similarity based on whether two objects contain tween categorical values.
the exact same value for each categorical at-
tribute. Huang (45) formalized this idea by intro- 3.2. Similarity Between Sequences
ducing dissimilarity measures between objects One of the most important applications of clus-
with categorical attributes that can be used in any tering in life sciences is clustering sequences, for
clustering algorithms. Let X and Y be two objects example, DNA or protein sequences. Many clus-
with m categorical attributes, and Xi and Yi be the tering algorithms have been proposed to enhance
values of the ith attribute of the two objects, the sequence database searching, organize sequence

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

8 Zhao and Karypis et al.

databases, generate phylogenetic trees, or guide These global similarity scores are meaningful
multiple sequence alignment. In this specific clus- when we compare similar sequences with roughly
tering problem, the objects of interest are biologi- the same length (e.g., protein sequences from the
cal sequences, which consist of a sequence of same protein family). However, when sequences
symbols, which could be nucleotides, amino acids, are of different lengths and are quite divergent,
or secondary structure elements (SSEs). Biologi- the alignment of the entire sequences may not
cal sequences are different from the objects we make sense, in which case, the similarity is com-
have discussed so far, in the sense that they are monly defined on the conserved subsequences.
not defined by a collection of attributes. Hence, This problem is referred to as the local alignment
the similarity measures we discussed so far are problem, which seeks to find the pair of substrings
not applicable to biological sequences. of the two sequences that has the highest global
Over the years, a number of different approaches alignment score among all possible pairs of sub-
have been developed for computing similarity strings. Local alignments can be computed opti-
between two sequences (49). The most common mally via a dynamic programming algorithm,
ones are the alignment-based measures, which originally introduced by Smith and Waterman
first compute an optimal alignment between two (54). The base conditions are score(0,j) = 0 and
sequences (either globally or locally), and then score(i,0) = 0, and the general recurrence is given
determine their similarity by measuring the degree by
of agreement in the aligned positions of the two 0
score ( i − 1, j − 1) + S ( S ( i ) , S ( i ))
sequences. The aligned positions are usually ( )
score 0 , j = max 
1 2

score ( i − 1, j ) + S ( S ( i ) , _ )
scored using a symbol-to-symbol scoring matrix, 1

and in the case of protein sequences, the most score ( i, j − 1) + S ( _, S ( j ))


commonly used scoring matrices are PAM (50,51) 2

or BLOSUM (52). The local sequence alignment score corre-


The global sequence alignment (Needleman- sponds to the cells of the dynamic programming
Wunsch alignment; 53) aligns the entire sequences table that have the highest value. Note that the
using dynamic programming. The recurrence re- recurrence for local alignments is very similar to
lations are the following (49). Given two se- that for global alignments only with minor changes,
quences S1 of length n and S2 of length m, and a which allow the alignment to begin from any loca-
scoring matrix S, let score(i, j) be the of the optimal tion (i, j) (49).
alignment of prefixes and S1[1...i] and S2[1...j]. Alternatively, local alignments can be com-
The base conditions are puted approximately via heuristic approaches,
score ( 0, j ) = ∑ S ( _, S2 ( k )) such as FASTA (55,56) or BLAST (57). The heu-
1≤ k ≤ j ristic approaches achieve low time complexities
and by first identifying promising locations in an effi-
score ( i, 0 ) = ∑ S ( S ( k ), _ )
1 cient way, and then applying a more expensive
1≤ k ≤ i
method on those locations to construct the final
Then, the general recurrence is local sequence alignment. The heuristic approaches
score ( i − 1, j − 1) + S ( S ( i ) , S ( i )) are widely used for searching protein databases

1 2

score ( 0 , j ) = max score ( i − 1, j ) + S ( S ( i ) , _ )


1
because of their low time complexity. Descrip-
score ( i , j − 1) + S ( _, S ( j )) tion of these algorithms is beyond the scope of
2
this article, and the interested reader should fol-
where ‘_’ represents a space, S is the scoring low the references.
matrix to specify the matching score for each Most existing protein clustering algorithms use
pair of symbols, and score(n,m) is the optimal the similarity measure based on the local align-
alignment score. ment methods (i.e., Smith-Waterman, BLAST

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

Data Clustering 9

and FASTA [GeneRage; 2, ProtoMap 58]). These to try to superimpose the structure of one protein
clustering algorithms first obtain the pairwise on top of the structure of the other protein, so that
similarity scores of all pairs of sequences. Then certain key features are mapped very close to each
they either normalize the scores by the self-simi- other in space. Once this is done, then the similar-
larity scores of the sequences to obtain a percent- ity between two structures is computed by mea-
age value of identicalness (59), or transform the suring the fit of the superposition. This fit is
scores to binary values based on a particular commonly computed as the root mean square de-
threshold (2). Other methods normalize the row viations (RMSD) of the corresponding features.
similarity scores by taking into account other To some extent, this is similar in nature to the
Au: ref 58
sequences in the dataset. For example, ProtoMage alignment performed for sequence-based similar-
states (58) first generates the distribution of the pairwise ity approaches, but it is significantly more compli-
ProtoMap not similarities between sequence A and the other cated as it involves three-dimensional structures
ProtoMage sequences in the database. Then the similarity with substantially more degrees of freedom.
between sequence A and sequence B is defined as There are a number of different variations for per-
the expected value of the similarity score found forming this superposition that have to do with
for A and B, based on the overall distribution. A (1) the features of the two proteins that are sought
low expected value indicates a significant and to be matched, (2) whether or not the proteins are
strong connection (similarity). treated as rigid or flexible bodies, (3) how the
equivalent set of features from the two proteins
3.3. Similarity Between Structures are determined, and (4) the type of superposition
Methods for computing the similarity between that is computed.
the three-dimensional structures of two proteins In principle, when comparing two protein
(or other molecules) are intrinsically different structures we can treat every atom of each amino
from any of the approaches that we have seen so acid side chain as a feature and try to compute a
far for comparing multidimensional objects and superposition that matches all of them as well as
sequences. Moreover, unlike the previous data possible. However, this usually does not lead to
types for which there are well-developed and good results because the side chains of different
widely accepted methods for measuring similari- residues will have different number of atoms with
ties, the methods for comparing three-dimen- different geometries. Moreover, even the same
sional structures are still evolving, and the entire amino acid types may have side chains with dif-
field is an active research area. Providing a com- ferent conformations, depending on their environ-
prehensive description of the various methods for ment. As a result, even if two proteins have very
computing the similarity between two structures similar backbones, a superposition computed by
requires a chapter (or a book) of its own, and is looking at all the atoms may fail to identify this
far beyond the scope of this article. For this rea- similarity. For this reason, most approaches try to
son, our discussion in the rest of this section will superimpose two protein structures by focusing
primarily focus on presenting some of the issues on the Ca atoms of their backbones, whose loca-
involved in comparing three-dimensional struc- tions are less sensitive on the actual residue type.
tures, in the context of proteins, and outlining Besides these atom-level approaches, other meth-
Au: ref 60 some of the approaches that have been proposed ods focus on SSEs and superimpose two proteins
does not for solving them. The reader should refer to the so that their SSEs are geometrically aligned with
have an chapter by Johnson and Lehtonen (60) that pro- each other.
author;
please vide an excellent introduction on the topic. Most approaches for computing the similarity
complete The general approach, which almost all meth- between two structures treat them as rigid bodies,
reference ods for computing the similarity between a pair and try to find the appropriate geometric transfor-
of three-dimensional protein structures follow, is mation (i.e., rotation and translation) that leads to

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

10 Zhao and Karypis et al.

the best superposition. Rigid-body geometric Another related problem is answering the ques-
transformations are well-understood and they are tion, how many clusters are there in the dataset?
relatively easy to compute efficiently. However, An ideal clustering algorithm should be the one
by treating proteins as rigid bodies we may get that can automatically discover the natural clus-
poor superpositions when the protein structures ters present in the dataset based on the underlying
are significantly different, although they are part cluster definition. However, there are no such uni-
of the same fold. In such cases, allowing some versal cluster definitions and clustering algo-
degree of flexibility tends to produce better results, rithms suitable for all kinds of datasets. As a
but also increases the complexity. In trying to find result, most existing algorithms require either the
the best way to superimpose one structure on top number of clusters to be provided as a parameter
of the other in addition to the features of interest as it is done in the case of K-means, or a similar-
we must identify the pairs of features from the ity threshold that will be used to terminate the
two structures that will be mapped against each merging process in the case of agglomerative
other. There are two general approaches for doing clustering. However, in general, it is hard to know
that. The first approach relies on an initial set of the right number of clusters or the right similarity
equivalent features (e.g., Ca atoms or SSEs) being threshold without a prior knowledge of the dataset.
provided by domain experts. This initial set is used One possible way to automatically determine
to compute an initial superposition and then addi- the number of clusters k is to compute various
tional features are identified using various ap- clustering solutions for a range of values of k,
proaches based on dynamic programming or
score the resulting clusters based on some par-
graph theory (53,61,62). The second approach
ticular metric and then select the solution that
tries to automatically identify the correspondence
achieves the best score. A critical component of
between the various features by various methods
this approach is the method used to measure the
including structural comparisons based on match-
quality of the cluster. To solve this problem,
ing Ca-atoms contact maps (63), or on the optimal
alignment of SSEs (64). numerous approaches have been proposed in a
Finally, as it was the case with sequence align- number of different disciplines including pattern
ment, the superposition of three-dimensional recognition, statistics, and data mining. The major-
structures can be done globally, whose goal is to ity of them can be classified into two groups: exter-
superimpose the entire protein structure, or locally, nal quality measures and internal quality measures.
which seeks to compute a good superposition The approaches based on external quality mea-
involving a subsequence of the protein. sures require a priori knowledge of the natural
clusters that exist in the dataset, and validate a
4. Assessing Cluster Quality clustering result by measuring the agreement
Clustering results are hard to evaluate, espe- between the discovered clusters and the known
cially for high-dimensional data and without a information. For instance, when clustering gene
prior knowledge of the objects’ distribution, expression data, the known functional categori-
which is quite common in practical cases. How- zation of the genes can be treated as the natural
ever, assessing the quality of the resulting clus- clusters, and the resulting clustering solution will
ters is as important as generating the clusters. be considered correct if it leads to clusters that
Given the same dataset, different clustering algo- preserve this categorization. A key aspect of the
rithms with various parameters or initial conditions external quality measures is that they use infor-
will give very different clusters. It is essential to mation other than that used by the clustering algo-
know whether the resulting clusters are valid and rithms. However, such reliable a priori knowledge
how to compare the quality of the clustering results, is usually not available when analyzing real
so that the right clustering algorithm can be chosen datasets—after all, clustering is used as a tool to
and the best clustering results can be used for fur- discover such knowledge in the first place.
ther analysis.

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

Data Clustering 11

The basic idea behind internal quality measures subsamples. These various clustering results are
is rooted from the definition of clusters. A mean- then compared to see the degree to which they
ingful clustering solution should group objects agree. The stable clustering solution should be the
into various clusters, so that the objects within one that gives similar clustering results across
each cluster are more similar to each other than the different subsamples. This approach can also
the objects from different clusters. Therefore, be easily used to determine the correct number of
most of the internal quality measures evaluate clusters in hierarchical clustering solutions. The
the clustering solution by looking at how similar stability test of clustering is performed at each
the objects are within each cluster and how well level of the hierarchical tree, and the number of
the objects of different clusters are separated. For clusters k will be the largest k value that still can
example, the pseudo F statistic suggested by produce stable clustering results.
Calinski and Harabasz (65) uses the quotient Finally, a recent approach, with applications to
between the intracluster average squared distance clustering gene expression datasets, assesses the
and intercluster average squared distance. If we clustering results of gene expression data by look-
have X as the centroid (i.e., mean vector) of all ing at the predictive power for one experimental
the objects, Xj as the centroid of the objects in condition from the clustering results based on the
cluster Cj, k as the total number of clusters, Xj as other experimental conditions (67). The key idea
the total number of objects, and as the squared behind this approach is that if one condition is left
Euclidean distance between two object-vectors x out, then the clusters generated from the remain-
and y, then the pseudo F statistic is defined as fol- ing conditions should exhibit lower variation in
lows: the left-out condition than randomly formed clus-
n k
( )
∑ i =1 d ( xi , X ) − ∑ j =1 ∑ x ∈C d x, X j ters. Yeung et al. (67) defined the figure of merit
j
(FOM) to be the summation of intracluster vari-
F= k −1 ance for each one of the clustering instances in
k

j
(
∑ j =1 ∑ x ∈C d x, X j ) which one of the conditions was not used during
cluster (i.e., left-out condition). Among the vari-
n−k
ous clustering solutions, they prefer the one that
One of the limitations of the internal quality
exhibits the least variation, and their experi-
measures is that they often use the same informa-
ments showed that in the context of clustering
tion both in discovering and in evaluating the
gene expression data, this method works quite
clusters. Recall from Heading 2. that some clus-
well. The limitation of this approach is that it is
tering algorithms produce clustering results by
not applicable to dataset in which all the attributes
optimizing various clustering criterion functions.
are independent. Moreover, this approach is only
If the same criterion functions were used as the
applicable to low dimensional datasets, as com-
internal quality measure, then the overall cluster-
puting the intracluster variance for each dimen-
ing assessment process does nothing more than
sion is quite expensive when the number of
assessing how effective the clustering algorithms
dimensions is very large.
was in optimizing the particular criterion func-
tion, and provides no independent confirmation 5. Case Study: Clustering Gene Expression
about the degree to which the clusters are mean- Data
ingful. Recently developed methods for monitoring
An alternative way for validating the cluster- genomewide mRNA expression changes such as
ing results is to see how stable they are when add- oligonucleotide chips (68) and cDNA microarrays
ing noise to the data, or subsampling it (66). This (69), are especially powerful as they allow us to
approach performs a sequence of subsamplings of quickly and inexpensively monitor the expression
the dataset and uses the same clustering proce- levels of a large number of genes at different time
dure to produce clustering solutions for various points, for different conditions, tissues, and organ-

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

12 Zhao and Karypis et al.

isms. Knowing when and under what conditions single-stranded DNA molecules are attached at
a gene or a set of genes is expressed often pro- fixed locations (spots) by high-speed robotic
vides strong clues as to their biological role and printing (70). Each array may contain tens of
function. thousands of spots, each of which corresponds to
Clustering algorithms are used as an essential a single gene. mRNA from the sample and from
tool to analyze these datasets and provide valu- control cells is extracted and cDNA is prepared
able insight on various aspects of the genetic by reverse transcription. Then, cDNA is labeled
machinery. There are four distinct classes of clus- with two fluorescent dyes and washed over the
tering problems that can be formulated from the microarray so that cDNA sequences from both
gene expression datasets, each addressing a dif- populations hybridize to their complementary
ferent biological problem. The first problem focuses sequences in the spots. The amount of cDNA
on finding co-regulated genes by grouping together from both populations bound to a spot can be
genes that have similar expression profiles. These measured by the level of fluorescence emitted
co-regulated genes can be used to identify pro- from each dye. For example, the sample cDNA is
moter elements by finding conserved areas in their labeled with a red dye and the control cDNA is
upstream regions. The second problem focuses on labeled with a green dye. Then, if the mRNA from
finding distinctive tissue types by grouping together the sample population is in abundance, the spot
tissues whose genes have similar expression pro- will be red; if the mRNA from the control popula-
files. These tissue groups can then be further ana- tion is in abundance, it will be green; if sample
lyzed to identify the genes that best distinguish and control bind equally the spot will be yellow;
the various tissues. The third clustering problem if neither binds, it will appear black. Thus, the
focuses on finding common inducers by grouping relative expression levels of the genes in the
together conditions for which the expression pro- sample and control populations can be estimated
files of the genes are similar. Finding such groups from the fluorescent intensities and colors for
of common inducers will allow us to substitute each spot. After transforming the raw images pro-
different “trigger” mechanisms that still elicit the duced by microarrays into relative fluorescent in-
same response (e.g., similar drugs, or similar her- tensity with some image processing software, the
bicides or pesticides). Finally, the fourth cluster- gene expression levels are estimated as log-ratios
ing problem focuses on finding organisms that of the relative intensities. A gene expression matrix
exhibit similar responses over a specified set of can be formed by combining multiple microarray
tested conditions, by grouping together organisms experiments of the same set of genes but under
for which the expression profiles of their genes different conditions, where each row corresponds
(in an ortholog sense) are similar. This would to a gene and each column corresponds to a con-
allow us to identify organisms with similar re- dition (i.e., a microarray experiment) (70).
sponses to chosen conditions (e.g., microbes that The Affymetrix GeneChip oligonucleotide
share a pathway). array contains several thousand single-stranded
In the rest of this subheading we briefly review DNA oligonucleotide probe pairs. Each probe
the approaches behind cDNA and oligonucleotide pair consists of an element containing oligonucle-
microarrays, and discuss various issues related to otides that perfectly match the target (PM probe)
clustering such gene expression datasets. and an element containing oligonucleotides with
a single base mismatch (MM probe). A probe set
5.1. Overview of Microarray Technologies consists of a set of probe pairs corresponding to a
DNA microarrays measure gene expression target gene. Similarly, the labeled RNA is extracted
levels by exploiting the preferential binding of from sample cell and hybridizes to its complemen-
complementary, single-stranded nucleic acid tary sequence. The expression level is measured
sequences. cDNA microarrays, developed at by determining the difference between the PM
Stanford University, are glass slides, to which and MM probes. Then, for each gene (i.e., probe

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

Data Clustering 13

set) average difference or log average can be cal- scatter-plot smoother fit to the M vs A plot using
culated, where average difference is defined as the robust locally linear fits.
average difference between the PM and MM of
5.3. Similarity Measures
every probe pair in a probe set and log average is
defined as the average log ratios of the PM/MM In most microarray clustering applications our
intensities for each probe pair in a probe set. goal is to find clusters of genes or clusters of con-
ditions. A number of different methods have been
5.2. Data Preparation and Normalization proposed for computing these similarities, includ-
Many sources of systematic variation may ing Euclidean distance-based similarities, corre-
affect the measured gene expression levels in lation coefficients, and mutual information.
microarray experiments (71). For the GeneChip The use of correlation coefficient-based simi-
experiments, scaling/normalization must be per- larities is primarily motivated by the fact that
formed for each experiment before combining while clustering gene expression datasets we are
them together, so that they can have the same Tar- interested on how the expression levels of differ-
get Intensity (TGT). The scaling factor of each ent genes are related under various conditions.
experiment is determined by the array intensity of The correlation coefficient values between genes
the experiment and the common TGT, where the (Eq. 5) can be used directly or transformed to abso-
array intensity is a composite of the average dif- lute values if genes of both positive and negative
ference intensities across the entire array. correlations are important in the application.
For cDNA microarray experiments, two fluo- An alternate way of measuring the similarity is
rescent dyes are involved and cause more system- to use the mutual information between a pair of
atic variation, which makes normalization more genes. The mutual information between two infor-
important. In particular, this variation could be mation sources A and B represent how much infor-
caused by differences in RNA amounts, differ- mation the two sources contain for each other.
ences in labeling efficiency between the two fluo- D’Haeseleer et al. (72) used mutual information
rescent dyes, and image acquisition parameters. to define the relationship between two conditions
Such biases can be removed by a constant adjust- A and B. This was done by initially discretizing
ment to each experiment to force the distribution the gene expression levels into various bins, and
of the log-ratios to have a median of zero. Because using this discretization to compute the Shannon
an experiment corresponds to one column in the entropy of conditions A and B as follows
gene expression array, this global normalization SA = − ∑ pi log pi
can be done by subtracting the mean/median of i

the gene expression levels of one experiment from where pi is the frequency of each bin. Given these
the original values, so that the mean value for this entropy values, then the mutual information be-
experiment (column) is zero. tween A and B is defined as
M ( A, B ) = SA + SB − SA, B
However, there are other sources of systematic
variation that global normalization may not be
able to correct. Yang et al. (71) pointed out that A feature common to many similarity measures
dye biases can depend on spot overall intensity used for microarray data is that they almost never
and location on the array. Given the red and green consider the length of the corresponding gene or
fluorescence intensities (R, G) of all the spots condition vectors, which is the actual value of the
in one slide, they plotted the log intensity ratio M differential expression level, but focus only on
= logR/G vs the mean log-intensity A = log RG , various measures of relative change or how these
which shows clear dependence of the log ratio M relative measures are correlated between two
on overall spot intensity A. Hence, an intensity- genes or conditions (67,73,74). The reason for this
related normalization was proposed, where the is twofold. First, there is still significant experi-
original log-ratio is subtracted by C(A). C(A) is a mental errors in measuring the expression level

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

14 Zhao and Karypis et al.

of a gene, and is not reliable to use it “as is.” Sec- on their distances to the nearest node Np toward
ond, in most cases we are only interested on how the data point P. The advantages of using SOMs
the different genes change across the different are its structured approach, which makes visual-
conditions (i.e., either upregulated or downregu- ization very easy. However, the method requires
lated) and we are not interested in the exact the user to specify the number of clusters as well
amount of this change. as the grid topology, including the dimensions of the
grid and the number of clusters in each dimension.
5.4. Clustering Approaches for Gene From the successes obtained in using K-means
Expression Data and group-average-based clustering algorithms,
Since the early days of the development of the as well as other similar algorithms (76,77), it appears
microarray technologies, a wide range of existing that the clusters in the context of gene expression
clustering algorithms have been used, and novel datasets are globular in nature. This should not be
new approaches have been developed for cluster- surprising as researchers are often interested in
ing gene expression datasets. The most effective obtaining clusters whose genes have similar expres-
traditional clustering algorithms are based either on sion patterns/profiles. Such a requirement automati-
the group-average variation of the agglomerative cally lends itself to globular clusters, in which the
clustering methodology, or the K-means approach pairwise similarity between most object pairs is
applied to unit-length gene or condition expres- quite high. However, as the dimensionality of
sion vectors. Unlike other applications of cluster- these datasets continue to increase (primarily by
ing in life sciences, such as the construction of increasing the number of conditions that are ana-
phylogenetic trees, or guide trees for multiple lyzed), requiring consistency across the entire set
sequence alignment, there is no biological rea- of conditions will be unrealistic. Many new algo-
son that justifies that the structure of the correct rithms have been proposed recently to tackle the
clustering solution is in the form of a tree. Thus, problem of clustering gene expression data with
agglomerative solutions are inherently subopti- high dimensionality, among which global dimen-
mal when compared to partitional approaches, sion reduction (78,79) and finding clusters in sub-
which allow for a wider range of feasible solu- spaces (i.e., subsets of dimensions) (80–84) are
tions at various levels of cluster granularity. How- two widely used techniques.
ever, despite this, the agglomerative solutions The basic idea of global dimension reduction
tend to produce reasonable and biologically mean- is to compress the entire gene/condition matrix to
ingful results, and allow for an easy visualization represent genes by vectors in a compressed space
of the relationships between the various genes or of low dimensionality, such that the biologically
conditions in the experiments. interesting results can be extracted (78,79,85).
The ease of visualizing the results has also led Horn and Axel (78) and Ding et al. (79) proposed
to the extensive use of self-organizing maps to use singular value decomposition (SVD) to
(SOM) for gene expression clustering (73,75). compress the data and then apply traditional clus-
The SOM method starts with a geometry of “nodes” tering algorithms (such as k-means). They also
of a simple topology (e.g., grid and ring) and a dis- showed that their algorithms can find meaningful
tance function on the nodes. Initially, the nodes clusters on cancer cells (86), leukemia dataset
are mapped randomly into the gene expression (87), and yeast cell cycle dataset (88). Similar
space, in which the ith coordinate represents the ideas have been applied to other high-dimensional
expression level in the ith condition. At each clustering problems, such as text clustering, and
following iteration, a data point P (i.e., a gene shown to be effective as well (89).
expression profile) is randomly selected and the Finding clusters in subspaces tackle this prob-
data point P will attract nodes to itself. The near- lem differently by redefining the problem of
est node Np to P will be identified and moved the clustering as finding clusters whose internal simi-
most, and other nodes will be adjusted depending larities become apparent in subspaces or clusters

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

Data Clustering 15

that preserve certain expression patterns among bors and only uses this set of dimensions to calcu- AU: neces-
the dimensions in subspaces (80–84). The vari- late the distances. The HARP algorithm is an sary to spell
ous algorithms differ from one another in how agglomerative projected clustering algorithm out HARP? If
they model the desired clusters, the optimization (84). It evaluates the quality of a cluster as the so, please
provide.
algorithm and clustering algorithm that generate sum of the relevance index of each dimension,
the desired clusters, and whether the algorithms where the relevance index is defined by compar-
allow genes that belong to more than one cluster ing the local variance (i.e., among the objects in
(i.e., overlapping clusters). Cheng and Church the cluster) with the global variance of that dimen-
(81) assume each expression value in the matrix sion. The dimensions that have high relevance
is the addition of three components: the back- index values are the “signature” dimensions for
ground level, the row effect, and the column effect. identifying the members in the cluster. It will
Thus, they use minimum mean squared residue merge the two clusters that can result in a cluster
as the objective function to find clusters in sub- with the highest relevance index sum. In the end,
spaces that have small deviations with respect to each cluster can be represented as the dimensions
the rows in the cluster, the columns in the sub- with highest relevance index values.
space, and the background defined by the cluster. Finally, the clustering algorithms we dis-
The Plaid model (82) assumes the entire matrix is cussed so far treat the dimensions (i.e., condi-
a sum of overlapping layers (i.e., clusters) and the tions) independently. In other words, if we
global background, to better handle the overlap- permute the dimensions arbitrarily, the cluster-
ping clusters. Thus, the goal is to estimate the ing result will remain the same. Sometimes, espe-
mean expression values of each layer and the cially when the conditions are time points, methods
probabilities of each entry in the matrix belong- that can capture the temporal relationships between
ing to each layer, such that the calculated expres- the conditions are more suitable (90–93). Filkov
sion values are most consistent with the observed et al. (90) showed that using correlation coeffi-
ones. Both algorithms (81,82) use a greedy algo- cient as the similarity measure can be misleading
rithm to find clusters one by one. In particular, in time-series analysis and proposed to compare
after finding one cluster, the entries that partici- two genes by looking at the degree of agreement
pate in the cluster are adjusted to eliminate the of slopes of the gene expression levels between
effect of the cluster. The algorithm will iterate two time points. Liu and Muller (91) applied the
several times until K (a user-defined parameter) dynamic time-warping technique on gene expres-
clusters are found or no significant cluster can be sion data. There are also model-based clustering
found further. The two algorithms just mentioned algorithms for temporal datasets, which assume
treated rows and columns equally during the clus- each cluster as a stochastic model (92,93). Because
tering process and investigators often refer to this the model-based clustering algorithm itself is a
type of clustering algorithms as biclustering (81– complicated topic and not the focus of this article,
83). Another type of clustering algorithms find- we will not discuss these algorithms in detail and
ing clusters in subspaces (projected clustering readers can refer to ref. 94 for a comprehensive
algorithms; 80,84) treats the objects to be clus- review.
tered and the relevant dimensions differently and
often finds disjointed clusters. PROCLUS (80) is 6. Cluto: A Clustering Toolkit
one of such algorithms. The whole clustering pro- We now turn our focus on providing a brief
cess is similar to the k-medoids method, which overview of CLUTO (release 2.1), a software pack-
assigns each object to its closest cluster medoid. age for clustering low- and high-dimensional
However, when calculating the distance between datasets and for analyzing the characteristics of
the object to its medoid, PROCLUS selects the the various clusters, which has been developed by
most conserved l (a user-defined parameter) dimen- our group and is available at http://www.cs.umn.
sions for each medoid based on its nearest neigh- edu/~karypis/cluto. CLUTO has been developed as

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

16 Zhao and Karypis et al.

a general purpose clustering toolkit. CLUTO’s dis- rithm that is greedy in nature, has low computa-
tribution consists of both stand-alone programs tional requirements, and produces high-quality
(vcluster and scluster) for clustering and analyz- clustering solutions (30). Moreover, C LUTO’s
ing these clusters, as well as a library through graph-partitioning-based clustering algorithms
which an application program can access directly use high-quality and efficient multilevel graph-
the various clustering and analysis algorithms imple- partitioning algorithms derived from the METIS
mented in CLUTO. WCLUTO (95), a web-enabled and hMETIS graph- and hypergraph-partitioning
version of the stand-alone application Cluto, algorithms (97,98). Moreover, C LUTO’s algo-
designed to apply clustering methods to genomic rithms have been optimized for operating on very
information, is also available at http://cluto. large datasets, in terms of the number of objects,
ccgb.umn.edu. To date, CLUTO has been success- as well as in terms of the number of dimensions.
fully used to cluster datasets arising in many This is especially true for CLUTO’s algorithms for
diverse application areas including information partitional clustering. These algorithms can
retrieval, commercial datasets, scientific datasets, quickly cluster datasets with several tens of thou-
and biological applications. sands objects and several thousands of dimen-
CLUTO implements three different classes of sions. Moreover, because most high-dimensional
clustering algorithms that can operate either directly datasets are very sparse, CLUTO directly takes into
in the object’s feature space or in the object’s simi- account this scarcity and requires memory that is
larity space. The clustering algorithms provided roughly linear on the input size.
by Cluto are based on the partitional, agglomer- In the rest of this section we present a short
ative, and graph-partitioning paradigms. CLUTO’s description of CLUTO’s stand-alone programs fol-
partitional and agglomerative algorithms are able lowed by some illustrative examples of how it can
to find clusters that are primarily globular, be used for clustering biological datasets.
whereas its graph-partitioning and some of its
6.1. Usage Overview
agglomerative algorithms are capable of finding
transitive clusters. The vcluster and scluster programs are used to
A key feature in most of CLUTO’s clustering cluster a collection of objects into a predeter-
algorithms is that they treat the clustering prob- mined number of clusters k. The vcluster program
lem as an optimization process that seeks to maxi- treats each object as a vector in a high-dimen-
mize or minimize a particular clustering criterion sional space, and it computes the clustering solu-
function, defined either globally or locally over tion using one of five different approaches. Four
the entire clustering solution space. CLUTO pro- of these approaches are partitional in nature,
vides a total of seven different criterion functions whereas the fifth approach is agglomerative. On the
that have been shown to produce high-quality other hand, the scluster program operates on the
clusters in low- and high-dimensional datasets. similarity space between the objects but can
The equations of these criterion functions are compute the overall clustering solution using the
shown in Table 1, and they were derived and ana- same set of five different approaches.
Table 1
lyzed (30,96). In addition to these criterion func- Both the vcluster and scluster programs are
tions, CLUTO provides some of the more traditional invoked by providing two required parameters
local criteria (e.g., single-link, complete-link, and on the command line along with a number of
group-average) that can be used in the context of optional parameters. Their overall calling sequence
agglomerative clustering. is as follows:
An important aspect of partitional-based crite- vcluster [optional parameters] MatrixFile
rion-driven clustering algorithms is the method NClusters
used to optimize this criterion function. CLUTO scluster [optional parameters] GraphFile
uses a randomized incremental optimization algo- NClusters

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

Data Clustering 17

MatrixFile is the name of the file that stores pute the clustering, and the number of desired
the n objects that need to be clustered. In vcluster, clusters (#Clusters). Once it computes the clus-
each of these objects is considered to be a vector tering solution, it displays information regarding
in an m-dimensional space. The collection of the quality of the overall clustering solution, as
these objects is treated as an n × m matrix, whose well as the quality of each cluster, using a variety
rows correspond to the objects, and whose col- of internal quality measures. These measures
umns correspond to the dimensions of the feature include the average pairwise similarity between
space. Similarly, GraphFile is the name of the file each object of the cluster and its standard devia-
that stores the adjacency matrix of the similarity tion (ISim and ISdev), and the average similarity
graph between the n objects to be clustered. The between the objects of each cluster to the objects
second argument for both programs, NClusters, in the other clusters and their standard deviation
is the number of clusters that is desired. (ESim and ESdev). Finally, vcluster reports the
Fig 1 Figure 1 shows the output of vcluster for clus- time taken by the various phases of the program.
tering a matrix into 10 clusters. From this figure
6.2. Summary of Biological Relevant
we see that vcluster initially prints information
Features
about the matrix, such as its name, the number of
rows (#Rows), the number of columns (#Col- The behavior of vcluster and scluster can be
umns), and the number of non-zeros in the matrix controlled by specifying more than 30 different
(#NonZeros). Next, it prints information about the optional parameters. These parameters can be
values of the various options that it used to com- broadly categorized into three groups. The first

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

18 Zhao and Karypis et al.

Fig. 1. Output of vcluster for matrix sports.mat and a 10-way clustering.

group controls various aspects of the clustering is computed by a sequence of repeated bisections,
algorithm, the second group controls the type of whereas in the case of “direct”, the entire k-way
analysis and reporting that is performed on the clustering solution is computed at one step.
computed clusters, and the third set controls the CLUTO’s traditional agglomerative algorithm is
visualization of the clusters. Some of the most implemented by the “agglo” option, whereas the
Table 2 important parameters are shown in Table 2, and “graph” option implements a graph-partitioning-
are described in the context of clustering biologi- based clustering algorithm, which is well-suited
cal datasets in the rest of this section. for finding transitive clusters. The method used
to define the similarity between the objects is
6.3. Clustering Algorithms specified by the -sim parameter, and supports the
The -clmethod parameter controls the type of cosine (“cos”), correlation coefficient (“corr”),
algorithms to be used for clustering. The first and a Euclidean distance-derived similarity
two methods (i.e., “rb” and “direct”) follow the (“dist”). The clustering criterion function that is
partitional paradigm described in Subheading used by the partitional and agglomerative algo-
2.1. The difference between them is the method rithms is controlled by the -crfun parameter. The
they use to compute the k-way clustering solution. first seven criterion functions (described in Table 1)
In the case of “rb”, the k-way clustering solution are used by both partitional and agglomerative,

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

Data Clustering 19

Table 2
Key Parameters of Cluto’s Clustering Algorithms
Parameter Values Function
-clmethod rb, direct, agglo, graph Clustering Method
-sim cos, corr, dist Similarity measures
-crfun I1, I2, E1, G1, G’1, H1, H2, slink, wslink, clink, Criterion Function
wclink, upgma
-agglofrom (int) Where to start agglomeration
-fulltree Builds a tree within each cluster
-showfeatures Display cluster’s feature signature
-showtree Build a tree on top of clusters
-labeltree Provide key features for each tree
node
-plottree (filename) Plots the agglomerative tree
-plotmatrix (filename) Plots the input matrices
-plotclusters (filename) Plots cluster–cluster matrix
-clustercolumn Simultaneously cluster the features

whereas the last five (single-link, weighted-single- 6.4. Building Tree for Large Datasets
link, complete-link, weighted-complete-link, and Hierarchical agglomerative trees are used exten-
group-average) are only applicable to agglomerative sively in life sciences as they provide an intuitive
clustering. way to organize and visualize the clustering results.
A key feature of CLUTO is that allows you to However, there are two limitations with such trees.
combine partitional and agglomerative clustering First, hierarchical agglomerative clustering may
approaches. This is done by the -agglofrom param- not be the optimal way to cluster data in which
eter in the following way. The desired k-way clus- there is no biological reason to suggest that the
tering solution is computed by first clustering the objects are related with each other in a tree-
dataset into clusters (m > k), and then uses an fashion. Second, hierarchical agglomerative
agglomerative algorithm to group some of these clustering algorithms have high computational
clusters to form the final k-way clustering solu- and memory requirements, making them imprac-
tion. The number of clusters m is the value sup- tical for datasets with more than a few thousand
plied to -agglofrom. This approach was motivated objects.
by the two-phase clustering approach of the To address these problems CLUTO provides the
CHAMELEON algorithm (20), and was designed to -fulltree option that can be used to produce a
allow the user to compute a clustering solution that complete tree using a hybrid of partitional and
uses a different clustering criterion function for agglomerative approaches. In particular, when
the partitioning phase from that used for the agglom- -fulltree is specified, CLUTO builds a complete
eration phase. An application of such an approach hierarchical tree that preserves the clustering
is to allow the clustering algorithm to find non- solution that was computed. In this hierarchical
globular clusters. In this case, the partitional clus- clustering solution, the objects of each cluster
tering solution can be computed using a criterion form a subtree, and the different subtrees are
function that favors globular clusters (e.g., “i2”), merged to get an all-inclusive cluster at the end.
and then combine these clusters using a single- Furthermore, the individual trees are combined in
link approach (e.g., “wslink”) to find nonglobular a meaningful way, therefore to accurately repre-
but well-connected clusters. sent the similarities within each tree.

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

20 Zhao and Karypis et al.

Figure 2 shows the trees produced on a sample Fig 2


gene expression dataset. The first tree was obtained
using the agglomerative clustering algorithm,
whereas the second tree was obtained using the
repeated-bisecting method in which the -fulltree was
specified.
6.5. Analyzing the Clusters
In addition to the core clustering algorithms,
CLUTO provides tools to analyze each of the clus-
ters and identify what are the features that best
describe and discriminate each one of the clus-
ters. To some extent, these analysis methods try
to identify the dense subspaces in which each
cluster is formed. This is accomplished by the
-showfeatures and -labeltree parameters.
Figure 3 shows the output produced by vcluster Fig 3
when -showfeatures was specified for a dataset
consisting of protein sequences and the 5mers that
they contain. Looking at this figure, we can see
that the set of descriptive and discriminating fea-
tures are displayed right after the table that pro-
vides statistics for the various clusters. For each
cluster, vcluster displays three lines of informa-
tion. The first line contains some basic statistics
for each cluster corresponding to the cluster-id
(cid), number of objects in each cluster (Size), the
average pairwise similarity between the cluster’s
objects (ISim), and the average pairwise similar-
ity to the rest of the objects (ESim). The second
line contains the five most descriptive features,
whereas the third line contains the five most dis-
criminating features. The features in these lists are
sorted in decreasing descriptive or discriminating
order.
Right next to each feature, vcluster displays a
number that in the case of the descriptive features
is the percentage of the within-cluster similar-
ity that this particular feature can explain. For
example, for the 0th cluster, the 5mer “GTSMA”
explains 58.5% of the average similarity between
the objects of the 0th cluster. A similar quantity is
displayed for each one of the discriminating fea-
Fig. 2. (A) The clustering solution produced by tures, and is the percentage of the dissimilarity
the agglomerative method. (B) The clustering solu- between the cluster and the rest of the objects that
tion produced by the repeated-bisecting method and this feature can explain. In general there is a large
–fulltree. overlap between descriptive and discriminating
features, with the only difference being that the

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

Data Clustering 21

Fig. 3. Output of vcluster for matrix sports.mat and a 10-way clustering that shows the descriptive and dis-
criminating features of each cluster.

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

22 Zhao and Karypis et al.

Fig. 4. Various visualizations generated by the-plotmatrix (a) and -plotcluster (b) parameter.

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

Data Clustering 23

percentages associated with the discriminating of clustering datasets arising in life sciences.
features are typically smaller than the correspond- Existing clustering algorithms for sequence and
ing percentages of the descriptive features. This structure datasets operate on the object’s similar-
is because some of the descriptive features of a ity space. As discussed in Subheading 2.2., such
cluster may also be present in a small fraction of algorithms are quite limiting as they cannot scale
the objects that do not belong to the cluster. to very large datasets, cannot be used to find clus-
6.6. Visualizing the Clusters ters that have conserved features (e.g., sequence
or structural motifs), and cannot be used to pro-
CLUTO’s programs can produce a number of vide a description as to why a set of objects was
Fig 4 visualizations that can be used to see the rela- assigned to the same cluster that is native to the
tionships between the clusters, objects, and fea- object’s features. The only way to overcome these
tures. You have already seen one of them in Fig. 2 shortcomings is to develop algorithms that oper-
that was produced by the -plotmatrix parameter. ate directly on either the sequences or structures.
The same parameter can be used to visualize Thus, opportunities for future research can be
sparse high-dimensional datasets. This is illus- broadly categorized into three groups: (1) devel-
trated in Fig. 4A for the protein dataset used ear- opment of computationally efficient and scalable
lier. As we can see from that plot, vcluster shows algorithms for large sequence and structure
the rows of the input matrix reordered in such a datasets; (2) development of clustering algo-
way so that the rows assigned to each one of the rithms for sequence and structure datasets that
10 clusters are numbered consecutively. The col- operate directly on the object’s native representa-
umns of the displayed matrix are selected to be tion; and (3) development of clustering algo-
the union of the most descriptive and discriminat- rithms that can provide concise explanations on
ing features of each cluster, and are ordered accord- the characteristics of the objects that were assigned
ing to the tree produced by an agglomerative to each cluster.
clustering of the columns. Also, at the top of each
column, the label of each feature is shown. Each References
non-zero positive element of the matrix is dis- 1. Linial, M., Linial, N., Tishby, J., and Golan, Y.
(1997) Global self-organization of all known protein
played by a different shade of red. Entries that
sequences reveals inherent biological structures. J.
are bright red correspond to large values and the Mol. Biol. 268, 539–556.
brightness of the entries decreases as their value 2. Enright, A. J. and Ouzounis, C. A. (2000) GeneRAGE:
decrease. Also note that in this visualization both a robust algorithm for sequence clustering and domain
the rows and columns have been reordered using detection. Bioinformatics 16(5), 451–457.
3. Mian, I. S. and Dubchak, I. (2000) Representing and rea-
a hierarchical tree. soning about protein families using generative and dis-
Finally, Fig. 4B shows the type of visualiza- criminative methods. J. Comput. Biol. 7(6), 849–862.
tion that can be produced when -plotcluster is 4. Spang, R. and Vingron, M. (2001) Limits of homol-
specified for a sparse matrix. This plot shows the ogy of detection by pairwise sequence comparison.
Bioinformatics 17(4), 338–342.
clustering solution shown at Fig. 4A by replacing 5. Kriventseva, E., Biswas, M., and Apweiler, R. (2001)
the set of rows in each cluster by a single row that Clustering and analysis of protein families. Curr.
corresponds to the centroid vector of the cluster. Opin. Struct. Biol. 11, 334–339.
The -plotcluster option is particularly useful for 6. Eidhammer, I., Jonassen, I., and Taylor, W. R. (2000)
Structure comparison and stucture patterns. J. Comput.
displaying very large datasets, as the number of rows Biol. 7, 685–716.
in the plot is only equal to the number of clusters. 7. Shatsky, M., Fligelman, Z. Y., Nussinov, R., and
Wolfson, H. J. (2000) Alignment of flexible protein
7. Future Research Directions in Clustering structures, in Proceedings of the 8th International
Despite the huge body of research in cluster Conference on Intelligent Systems for Molecular Biol-
analysis, there are a number of open problems and ogy, (Altman, R., et al., eds.), AAAI Press, Menlo Park,
CA, pp. 329–343.
research opportunities, especially in the context

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

24 Zhao and Karypis et al.

8. Lee, R. C. T. (1981) Clustering analysis and its appli- 21. Jain, A. K., Murty, M. N., and Flynn, P. J. (1999)
cations, in Advances in Information Systems Science Data clustering: a review. ACM Comput. Surv. 31(3),
(Toum, J. T., ed.), Plenum, New York. 264–323.
9. Jain, A. K. and Dubes, R. C. (1988) Algorithms for 22. Han, J., Kamber, M., and Tung, A. K. H. (2001) Spa-
Clustering Data, Prentice Hall, 1998. tial clustering methods in data mining: a survey, in
10. Cheeseman, P. and Stutz, J. (1996) Bayesian classifi- Geographic Data Mining and Knowledge Discovery
cation (autoclass): theory and results, in Advances in (Miller, H. and Han, J., eds.), Taylor & Francis.
Knowledge Discovery and Data Mining (Fayyad, U. 23. MacQueen, J. (1967) Some methods for classifica-
M., Piatetsky-Shapiro, G., Smith, P., and Uthurusamy, tion and analysis of multivariate observations, in Pro-
R., eds.), AAAI/MIT Press, pp. 153–180. ceedings of the 5th Symposium Math. Statist. Prob.,
11. Kaufman, L. and Rousseeuw, P. J. (1990) Finding pp. 281–297.
Groups in Data: An Introduction to Cluster Analysis, 24. Dempster, A. P., Laird, N. M., and Rubin, D. B.
John Wiley & Sons, New York. (1977) Maximum likelihood from incomplete data via
12. Jackson, J. E. (1991) A User’s Guide to Principal the em algorithm. J. Roy. Stat. Soc. 39.
Components, John Wiley & Sons, New York. 25. Zahn, K. (1971) Graph-theoretical methods for de-
13. Ng, R. and Han, J. (1994) Efficient and effective clus- tecting and describing gestalt clusters. IEEE Trans.
tering method for spatial data mining, in Proceedings Comput. (C-20), 68–86.
of the 20th VLDB Conference (Bocca, J. B., Jarke, 26. Han, E. G., Karypis, G., Kumar, V., and Mobasher,
M., and Zaniolo, C., eds.), San Francisco, CA, pp. B. (1998) Hypergraph based clustering in high-di-
144–155. mensional data sets: a summary of results. Bull. Tech.
14. Berry, M. W., Dumais, S. T., and O’Brien, G. W. Committee Data Eng. 21(1).
(1995) Using linear algebra for intelligent informa- 27. Strehl, A. and Ghosh, J. (2000) Scalable approach to
tion retrieval. SIAM Rev. 37, 573–595. balanced, high-dimensional clustering of market-bas-
15. Zhang, T., Ramakrishnan, R., and Linvy, M. (1996) kets, in Proceedings of HiPC, Springer Verlag, pp.
Birch: an efficient data clustering method for large 525–536.
databases, in Proceedings of 1996 ACM-SIGMOD 28. Boley, D. (1998) Principal direction divisive parti-
International Conference on Management of Data, tioning. Data Mining Knowl. Discov. 2(4),
ACM Press, New York, NY, pp. 103–114. 29. Duda, R. O., Hart, P. E., and Stork, D. G. (2001) Pat-
16. Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. tern Classification, John Wiley & Sons, New York.
(1996) A density-based algorithm for discovering 30. Zhao, Y. and Karypis, G. (2001) Criterion functions
clusters in large spatial databases with noise, in Pro- for document clustering: experiments and analysis.
ceedings of the Second International Conference on Technical report TR #01–40, Department of Computer
Knowledge Discovery and Data Mining (Simoudis, Science, University of Minnesota, Minneapolis. Avail-
E., Han, J., and Fayyad, U., eds.), AAAI Press, Menlo able at http://cs.umn.edu/ ~karypis/publications.
Park, CA, pp. 226–231. 31. Sneath, P. H. and Sokal, R. R. (1973) Numerical Tax-
17. Wang, X., Wang, J. T. L., Shasha, D., Shapiro, B., onomy, Freeman, London, UK.
Dikshitulu, S., Rigoutsos, I., and Zhang, K. (1997) 32. King, B. (1967) Step-wise clustering procedures. J. Am.
Automated discovery of active motifs in three dimen- Stat. Assoc. 69, 86–101.
sional molecules, in Proceedings of the 3rd Interna- 33. Johnson, M. S., Sutcliffe, M. J., and Blundell, T. L.
tional Conference on Knowledge Discovery and Data (1990) Molecular anatomy: phyletic relationships
Mining (Heckerman, D., Mannila, H., Pregibon, D., derived from 3-dimensional structures of proteins.
and Uthurusamy, R., eds.), AAAI Press, Menlo Park, J. Mol. Evol. 30, 43–59.
CA, pp. 89–95. 34. Taylor, W. R., Flores, T. P., and Orengo, C. A. (1994)
Multiple protein structure alignment. Protein Sci. 3,
18. Guha, S., Rastogi, R., and Shim, K. (1998) CURE: an
1858–1870.
efficient clustering algorithm for large databases, in
35. Jonyer, I., Holder, L. B., and Cook, D. J. (2000)
Proceedings of 1998 ACM-SIGMOD International
Graph-based hierarchical conceptual clustering in
Conference on Management of Data, ACM Press,
structural databases. Int. J. Artific. Intell. Tools.
New York, NY, pp. 73–84.
36. Eddy, S. R. (1996) Hidden markov models. Curr.
19. Guha, S., Rastogi, R., and Shim, K. (1999) ROCK: a
Opin. Struct. Biol. 6, 361–365.
robust clustering algorithm for categorical attributes,
37. Eddy, S. R. (1998) Profile hidden markov models.
in Proceedings of the 15th International Conference
Bioinformatics 14, 755–763.
on Data Engineering, IEEE, Washington-Brussels-
38. Aggarwal, C. C., Gates, S. C., and Yu, P. S. (1999)
Tokyo, pp. 512–521.
On the merits of building categorization systems by
20. Karypis, G., Han, E. H., and Kumar, V. (1999) Cha-
supervised clustering, Proceedings of the Fifth ACM
meleon: a hierarchical clustering algorithm using
SIGKDD International Conference on Knowledge
dynamic modeling. IEEE Comput. 32(8), 68–75.
Discovery and Data Mining (Chaudhuri, S. and

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

Data Clustering 25

Madigan, D., eds.), ACM Press, New York, NY, pp. 50. Dayhoff, M. O., Schwartz, R. M., and Orcutt, B. C.
352–356. (1978) A model of evolutionary change in proteins.
39. Cutting, D. R., Pedersen, J. O., Karger, D. R., and Atlas Protein Sequence Struct. 5, 345–352.
Tukey, J. W. (1992) Scatter/gather: a cluster-based 51. Schwartz, R. M. and Dayhoff, M. O. (1978) Matrices
approach to browsing large document collections, in for detecting distant relationships. Atlas Protein Se-
Proceedings of the ACM SIGIR, ACM Press, New quence Struct. 5, 353–358.
York, NY, pp. 318–329. 52. Henikoff, S. and Henikoff, J. G. (1992) Amino acid
40. Larsen, B. and Aone, C. (1999) Fast and effective text substitution matrices from protein blocks. Proc. Natl.
mining using linear-time document clustering, in Acad. Sci. USA 89, 10,915–10,919.
Proceedings of the Fifth ACM SIGKDD International 53. Needleman, S. B. and Wunsch, C. D. (1970) A gen-
Conference on Knowledge Discovery and Data Min- eral method applicable to the search for similarities
ing (Chaudhuri, S. and Madigan, D., eds.), ACM in the amino acid sequence of two proteins. J. Mol.
Press, New York, NY, pp. 16–22. Biol. 48, 443–453.
41. Steinbach, M., Karypis, G., and Kumar, V. (2000) A 54. Smith, T. F. and Waterman, M. S. (1981) Identifica-
comparison of document clustering techniques, in tion of common molecular subsequences. J. Mol.
KDD Workshop on Text Mining, ACM Press, New Biol. 147, 195–197.
York, NY, pp. 109–110. 55. Lipman, D. J. and Pearson, W. R. (1985) Rapid and
42. Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, sensitive protein similarity searches. Science 227,
P. (1998) Automatic sub- space clustering of high di- 1435–1441.
mensional data for data mining applications, in Proceed- 56. Pearson, W. R. and Lipman, D. J. (1988) Improved
ings of 1998 ACM-SIGMOD International Conference tools for biological sequence comparison. Proc. Natl.
on Management of Data, ACM Press, New York, NY, Acad. Sci. USA 85, 2444–2448.
pp. 94–105. 57. Altschul, S., Gish, W., Miller, W., Myers, E., and
43. Burdick, D., Calimlim, M., and Gehrke, J. (2001) Lipman, D. (1990) Basic local alignment search tool.
Mafia: a maximal frequent itemset algorithm for J. Mol. Biol. 215, 403–410.
transactional databases, in Proceedings of the 17th 58. Yona, G., Linial, N., and Linial, M. (2000) Protomap:
International Conference on Data Engineering, automatic classification of protein sequences and hier-
IEEE, Washington-Brussels-Tokyo, pp. 443–452. archy of protein families. Nucleic Acids Res. 28, 49–55.
44. Nagesh, H., Goil, S., and Choudhary, A. (1999) Ma- 59. Bolten, E., Schliep, A., Schneckener, S., Schomburg,
fia: efficient and scalable subspace clustering for very D., and Schrader, R. (2001) Clustering protein se-
large data sets. Technical report TR #9906-010, quences-structure prediction by transitive homology.
Northwestern University. Bioinformatics 17, 935–941.
45. Huang, Z. (1997) A fast clustering algorithm to clus- 60. Johnson, M. S. and Lehtonen, J. V. (2000) Compari-
ter very large categorical data sets in data mining, in son of protein three dimensional structure, in Bio-
Proceedings of SIGMOD Workshop on Research informatics: Sequences, Structure and Databanks
Issues on Data Mining and Knowledge Discovery, (Higgins, D, eds.), Oxford University Press, 2000.
Tucson, Arizona. 61. Grindley, H. M., Artymiuk, P. J., Rice, D. W., and
46. Ganti, V., Gehrke, J., and Ramakrishnan, R., (1999) Willet, P. (1993) Identification of tertiary structure
Cactus-clustering categorical data using summaries, in resemblance in proteins using a maximal common
Proceedings of the 5th International Conference on subgraph isomorphism algorithm. J. Mol. Biol. 229,
Knowledge Discovery and Data Mining (Chaudhuri, 707–721.
S. and Madigan, D., eds.), ACM Press, New York, NY, 62. Mitchell, E. M., Artymiuk, P. J., Rice, D. W., and
pp. 73–83. Willet, P. (1989) Use of techniques derived from
47. Gibson, D., Kleinberg, J., and Rahavan, P. (1998) Clus- graph theory to compare secondary structure motifs
tering categorical data: an approach based on dynamical in proteins. J. Mol. Biol. 212, 151–166.
systems, in Proceedings of the 24th International Con- 63. Holm, L. and Sander, C. (1993) Protein structure
ference on Very Large Databases (Gupta, A., Shmueli, comparison by alignment of distance matrices. J.
O., and Widom, J., eds.), pp. 311–323, San Francisco, Mol. Biol. 233, 123.
CA, Morgan Kaufmann. 64. Kleywegt, G. J. and Jones, T. A. (1997) Detecting
48. Ryu, T. W. and Eick, C. F. (1998) A unified similar- folding motifs and similarities in protein structures.
ity measure for attributes with set or bag of values for Methods Enzymol. 277, 525–545.
database clustering, in The 6th International Work- 65. Calinski, T. and Harabasz, J. (1974) A dendrite
shop on Rough Sets, Data Mining and Granular Com- method for cluster analysis. Commun. Stat. 3, 1–27.
puting, Research Triangle Park, NC. 66. Ben-Hur, A., Elisseeff, A., and Guyon, I. (2000) A sta-
49. Gusfield, D. (1997) Algorithms on Strings, Trees, and bility based method for discovering structure in clus-
Sequences: Computer Science and Computational tered data, in Pacific Symposium on Biocomputing, vol.
Biology, Cambridge University Press, NY. 7, World Scientific Press, Singapore, pp. 6–17.

MOLECULAR BIOTECHNOLOGY Volume 31, 2005


Job: Molecular Biotechnology Operator: SV
Chapter: Karypis/MB05-0033 Date: 6/05
Pub Date: 2005 Revision: 1st Pass

26 Zhao and Karypis et al.

67. Yeung, K. Y., Haynor, D. R., and Ruzzo, W. L. application to hematopoietic differentiation. Proc.
(2001) Validating clustering for gene expression data. Natl. Acad. Sci. USA 96, 2907–2912.
Bioinformatics 17(4), 309–318. 75. Kohonen, T. Self-Organizing Maps, Springer-Verlag,
68. Fodor, S. P., Rava, R. P., Huang, X. C., Pease, A. C., Berlin, 1995.
Holmes, C. P., and Adams, C. L. (1993) Multiplexed 76. Ben-Dor, A. and Yakhini, Z. (1999) Clustering gene
biochemical assays with biological chips. Nature 364, expression patterns, in Proceedings of the 3rd Annual
555, 556. International Conference on Research in Computa-
69. Schena, M., Shalon, D., Davis, R. W., and Brown, P. tional Molecular Biology, ACM Press, New York,
O. (1995) Quantitative monitoring of gene expression NY, pp. 33–42.
patterns with a complementary DNA microarray. Sci- 77. Shamir, R. and Sharan, R. (2000) Click: a clustering
ence 270. algorithm for gene expression analysis, in Proceed-
70. Brazma, A., Robinson, A., Cameron, G., and Ashburner, ings of the 8th International Conference on Intelli-
M. (2000) One-stop shop for microarray data. Nature gent Systems for Molecular Biology (Altman, R., et
403, 699–701. al., eds.), AAAI Press, Menlo Park, CA, pp. 307–316.
71. Yang, Y. H., Dudoit, S., Luu, P., and Speed, T. P. 78. Zhao, Y. and Karypis, G. (2002) Comparison of
(2001) Normalization for cDNA microarray data, in agglomerative and partitional document clustering
Proceedings of SPIE International Biomedical Op- algorithms, in SIAM (2002) Workshop on Clustering
tics Symposium (Bittner, M. L., Chen, Y., Dorsel, A. High-Dimensional Data and Its Applications. Arling-
N., and Dougherty, E. R., eds.), vol. 4266, SPIE, ton, VA. Also available as technical report #02-014,
Bellingham, WA, pp. 141–152. University of Minnesota.
72. D’Haeseleer, P., Fuhrman, S., Wen, X., and Somogyi, 79. Karypis, G. and Kumar, V. (1998) hMeTiS 1.5: a
R. (1998) Mining the gene expression matrix: infer- hypergraph partitioning package. Technical report,
ring gene relationships from large scale gene expres- Department of Computer Science, University of
sion data, in, Information Processing in Cells and Minnesota. Available at www-users.cs.umn.edu/
Tissues, Plenum, New York, pp. 203–212. ~karypis/metis
73. Eisen, M. (2000) Cluster 2.11 and treeview 1.50 http:/ 80. Karypis, G. and Kumar, V. (1998) MeTiS 4.0: unstruc-
/rana.lbl.gov/Eisen Software.htm. tured graph partitioning and sparse matrix ordering
74. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., system. Technical report, department of Computer Sci-
Kitareewan, S., Dmitrovsky, E., Lander, E. S., and ence, University of Minnesota. Available at www-
Golub, T. R. (1999) Interpreting patterns of gene users.cs.umn.edu/~karypis/metis
expression with self-organizing maps: methods and

MOLECULAR BIOTECHNOLOGY Volume 31, 2005

You might also like