Download as ps, pdf, or txt
Download as ps, pdf, or txt
You are on page 1of 9

Clustering Gene Expression Data

Part 1

CS 838
www.cs.wisc.edu/~craven/cs838.html
Mark Craven
craven@biostat.wisc.edu
April 2001

Clustering Gene Expression Profiles

• given: expression profiles for a set of genes or


experiments/patients (whatever columns
represent)
• do: organize profiles into clusters such that
– elements in the same cluster are highly similar
to each other
– elements from different clusters have low
similarity to each other

1
Clustering Gene Expression Profiles
• there are many different clustering algorithms
• other clustering methods have been applied to gene
expression data
– EM with Gaussian clusters [Mjolsness et al. ‘99]
– self organizing maps [Tamayo et al. ‘99]
– graph-theoretic algorithms [Ben-Dor & Yakhini ‘98,
Hartuv et al. ’99, Sharan & Shamir ’00]
– etc.
• two general types of approaches
– hierarchical
– non-hierarchical

Hierarchical Clustering:
A Dendogram
height of bar indicates
degree of dissimilarity
within cluster

leaves represent objects (e.g. genes)

2
Scotch Whisky Dendogram

figure from: Lapointe & Legendre, Applied Statistics, 1993

Bottom-Up Hierarchical Clustering


given : a set X = { x1 ... x n } of objects
for i := 1 to n do
ci := { xi } /* each object is initially its own cluster */
C := {c1 ...c n }
j := n + 1
while | C |> 1
(c a , cb ) := argmax sim (cu , cv ) /* find most similar pair */
( c u ,cv )

c j = c a ∪ cb /* create a new cluster for pair */


C := C − {ca , cb } ∪ {c j }
j := j + 1

3
Similarity of Two Clusters
• the similarity of two clusters can be determined in
several ways
– single link: similarity of two most similar
members
– complete link: similarity of two least similar
members
– average link: average similarity between
members

Genome-Wide Cluster Analysis


• Eisen et al., PNAS 1998
• S. cerevisiae (baker’s yeast)
– all genes (~ 6200) on a single array
– measured during several processes
• human fibroblasts
– 8600 human transcripts on array
– measured at 12 time points during serum
stimulation

4
The Data
• 79 measurements for yeast data
• collected at various time points during
– diauxic shift (shutting down genes for
metabolizing sugars, activating those for
metabolizing ethanol)
– mitotic cell division cycle
– sporulation
– temperature shock
– reducing shock

The Data
• each measurement Gi represents
red i
log
green i
where red is the test expression level, and green is
the reference level for gene G in the i th
experiment
• the expression profile of a gene is the vector of
measurements across all experiments

G1 ... Gn

5
The Data
• m genes measured in n experiments

g1,1 g1,n

g 2,1 g 2,n

g m ,1 g m,n

vector for a gene

The Task

red i identify genes


log w/similar
green i profiles

experiments

6
Gene Similarity Metric
• to determine the similarity of two genes X , Y

1 N
 X i − X offset  Yi − Yoffset 
S ( X ,Y ) = ∑ 
ΦX
 
N i =1   Φ Y 

N
(Gi − Goffset )2
ΦG = ∑ i =1 N

Gene Similarity Metric


• since there is an assumed reference state (the
gene’s expression level didn’t change), Goffset is
set to 0 for all genes

  
  
N   
1 Xi Yi
S ( X ,Y ) = ∑   
N i =1  N 2
X i  N 2
Yi 
 ∑  ∑ 
 i =1 N  i =1 N 

7
Dendogram for Serum
Stimulation of Fibroblasts

signaling & cell cholesterol


angiogenesis cyle biosynthesis

Eisen et al. Results


• redundant representations of genes cluster together
– but individual genes can be distinguished from
related genes by subtle differences in
expression
• genes of similar function cluster together
– e.g. 126 genes strongly down-regulated in
response to stress

8
Eisen et al. Results
• 126 genes down-regulated in response to stress
– 112 of the genes encode ribosomal and other
proteins related to translation
– agrees with previously known result that yeast
responds to favorable growth conditions by
increasing the production of ribosomes

Comments on Gene Clustering


• descriptive approach to analyzing data
• can be potentially used to
– gain insight into a gene’s function
– identify of classes of genes

You might also like