Professional Documents
Culture Documents
Gene Expression - Microarrays: Misha Kapushesky
Gene Expression - Microarrays: Misha Kapushesky
Gene Expression - Microarrays: Misha Kapushesky
Misha Kapushesky
European Bioinformatics Institute, EMBL
Page 297
Outline: microarray data analysis
Gene expression
Microarrays
Preprocessing
normalization
scatter plots
Inferential statistics
t-test
ANOVA
Page 312
Microarrays: tools for gene expression
ostolop@ebi.ac.uk
How it works
Complementary hybridization:
- Put a part of the gene sequence on the array
- convert mRNA to cDNA using reverse transcriptase
ostolop@ebi.ac.uk
Spotted Arrays
ostolop@ebi.ac.uk
Spotted Arrays
• Two channel technology for comparing two
samples – relative measurements
• Two mRNA samples (reference, test) are reverse
transcribed to cDNA, labeled with fluorescent
dyes (Cy3, Cy5) and allowed to hybridize to array
ostolop@ebi.ac.uk
Spotted Arrays
• Read out two images by scanning array with lasers,
one for each dye
ostolop@ebi.ac.uk
Oligonucleotide Arrays
• One channel technology – absolute measurements
• Instead of putting entire genes on array, put multiple
oligonucleotide probes: short, fixed length DNA
sequences (25-60 nucleotides)
• Oligos are synthesized in situ
• Affymetrix uses a photolithography process,
similar to that used to make semiconductor chips
• Other technologies available (e.g. mirror arrays)
ostolop@ebi.ac.uk
Oligonucleotide Arrays
• For each gene, construct a probeset – a set of
n-mers to specific to this gene
ostolop@ebi.ac.uk
Advantages of microarray experiments
Data
acquisition
Data
analysis
Data
confirmation
Biological insight
Stage 1: Experimental design
Page 314
Stage 2: RNA preparation
(Note that the terms “probe” and “target” may refer to the
element immobilized on the surface of the microarray, or
to the labeled biological sample; for clarity, it may be
simplest to avoid both terms.)
Stage 4: Image analysis
Control
B Crystallin
is over-expressed
Rett
in Rett Syndrome
Fig. 8.21
Page 319
Fig. 8.21
Page 319
Stage 5: Microarray data analysis
Hypothesis testing
• How can arrays be compared?
• Which RNA transcripts (genes) are regulated?
• Are differences authentic?
• What are the criteria for statistical significance?
Clustering
• Are there meaningful patterns in the data (e.g. groups)?
Classification
• Do RNA transcripts predict predefined groups, such as
disease subtypes?
Page 318
Stage 6: Biological confirmation
Page 320
Stage 7: Microarray databases
Eukaryotic
Genes
PCR Products
Many different plates For each plate set,
containing different genes many identical replicas
Microarray Overview II Measure
Fluorescence
in 2 channels
red/green
Control
Test
Hybridize,
Wash
Prepare Fluorescently
Labeled Probes Analyze the data
to identify
patterns of
gene expression
Affymetrix GeneChip™ Expression Analysis
Hybridize and
wash chips Scan chips
Control
Analyze
Test
Prepare
Obtain RNA Fluorescently
Samples Labeled
Probes PM
MM
Microarray Expression Analysis
Differential
Tissue RNA Preparation
State/Stage
Selection and Labeling
Selection
Competitive
Hybridization
Spots
Fluorescence Expression
Gene on an
Intensity Measurement
Array
Steps in the Process
Select array elements and annotate them
►experimental design
►microarray design
►sample preparation
►hybridization procedures
►image analysis
►controls for normalization
Visit http://www.mged.org
Interpretation of RNA analyses
Gene expression
Microarrays
Preprocessing
normalization
scatter plots
Inferential statistics
t-test
ANOVA
genes
(RNA
transcript
levels)
Microarray data analysis
Fig. 9.1
Page 333
Microarray data analysis
Preprocessing
Brain Fibroblast
Astrocyte Astrocyte
up
Expression level (sample 2)
e l high
l e v
n
wn
si o
s
do
r e
x p
e
low
up
Log ratio
down
Mean log intensity
You can make these plots in Excel…
-1
-2
-3
7 8 9 10 11 12 13 14
log(Cy3*Cy5)
1. Intensity-dependent structure
2. Data not mean centered at log2(ratio) = 0
Ratio Cy3/Cy5 for the same RNA
11ab Raw Ratios
sorted from least most expressed
2
1.8
1.6
1.4
1.2
11a:ratio
1
11b:ratio
0.8
0.6
0.4
0.2
0
LOWESS Results
Affymetrix Chips
Mismatch (MM) probes
M
A A
After RMA (a normalization
procedure), the median is near zero,
and skewing is corrected.
Histograms of raw
intensity values for 14
arrays (plotted in R)
before and after RMA array
was applied.
RMA can adjust for the effect of GC content
log intensity
GC content
Robust multi-array analysis (RMA)
RMA offers a large increase in precision (relative to
Affymetrix MAS 5.0 software).
precision
log expression SD
MAS 5.0
RMA
Gene expression
Microarrays
Preprocessing
normalization
scatter plots
Inferential statistics
t-test
ANOVA
Also note that some ratios, such as 2.00, appear to be dramatic while
others are not. Some researchers set a cut-off for changes of interest
such as two-fold.
A significant
difference
Probably
not
Inferential statistics
A t-test is a commonly used test statistic to assess
the difference in mean values between two groups.
x1 – x2 difference between mean values
t= =
SE variability (standard error
of the difference)
Questions
Gene expression
Microarrays
Preprocessing
normalization
scatter plots
Inferential statistics
t-test
ANOVA
Data matrix
(20 genes and
genes 3 time points
from Chu et al.,
1998)
Software: S-
PLUS package
t=2.0
t=0.5 t=0
Page 355
Distance Is Defined by a Metric
3
log2(cy5/cy3)
-3
Distance Metric: Euclidean Pearson*
D 1.4 -0.05
D 6.0 +1.00
Distance is Defined by a Metric
log2(cy5/cy3)
-2
D 1.4 -0.90
D 4.2 -1.00
Distance Matrix
Once a distance metric has been selected, the starting point for all
clustering methods is a “distance matrix”
Gene1
Gene2
Gene3
Gene4
Gene5
Gene6
Gene1 0 1.5 1.2 0.25 0.75 1.4
Gene2 1.5 0 1.3 0.55 2.0 1.5
Gene3 1.2 1.3 0 1.3 0.75 0.3
Gene4 0.25 0.55 1.3 0 0.25 0.4
Gene5 0.75 2.0 0.75 0.25 0 1.2
Gene6 1.4 1.5 0.3 0.4 1.2 0
The elements of this matrix are the pair-wise distances. Note that the
matrix is symmetric about the diagonal.
Agglomerative clustering
0 1 2 3 4
a
a,b
b
c
d
e
a
a,b
b
c
d
d,e
e
Agglomerative clustering
0 1 2 3 4
a
a,b
b
c
c,d,e
d
d,e
e
Agglomerative clustering
0 1 2 3 4
a
a,b
b a,b,c,d,e
c
c,d,e
d
d,e
e
…tree is constructed
Divisive clustering
a,b,c,d,e
4 3 2 1 0
Divisive clustering
a,b,c,d,e
c,d,e
4 3 2 1 0
Divisive clustering
a,b,c,d,e
c,d,e
d,e
4 3 2 1 0
Divisive clustering
a,b
a,b,c,d,e
c,d,e
d,e
4 3 2 1 0
Divisive clustering
a
a,b
b a,b,c,d,e
c
c,d,e
d
d,e
e
4 3 2 1 0
…tree is constructed
agglomerative
0 1 2 3 4
a
a,b
b a,b,c,d,e
c
c,d,e
d
d,e
e
4 3 2 1 0
divisive
Adapted from Kaufman and Rousseeuw (1990)
1
12
Agglomerative and
divisive clustering
sometimes give conflicting
results, as shown here
1
12
Agglomerative Linkage Methods
Single Linkage
Average Linkage
Complete Linkage
(HCL-6)
Single Linkage
Cluster-to-cluster distance is defined as the minimum distance
between members of one cluster and members of the another
cluster. Single linkage tends to create ‘elongated’ clusters with
individual genes chained onto clusters.
DAB
(HCL-7)
Average Linkage
DAB
(HCL-8)
Complete Linkage
DAB
(HCL-9)
Comparison of Linkage Methods
A
a2 Euclidean distance
1
b2
B
a’2 A’
Angle distance
0.5 Chord distance
b’2
B’
0.5 1 1.5 x1
a’1 b’1 a1 b1
K-Means/Medians Clustering – 1
G3 G6 G1 G8 G4 G5 G2 G10 G9 G12
G11 G7 G13
• to reduce dimensionality
• to identify outliers
http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
1
12
High-throughput methods beyond microarrays
ostolop@ebi.ac.uk
RNA-seq
• Sequencing technology is making fast progress
• Idea: sequencing is so cheap that we can sequence
mRNA molecules directly
ostolop@ebi.ac.uk
RNA-seq
(a) After two rounds of poly(A) selection, RNA
is fragmented to an average length of 200
nt by magnesium-catalyzed hydrolysis and
then converted into cDNA by random
priming. The cDNA is then converted into a
molecular library for Illumina/Solexa 1G
sequencing, and the resulting 25-bp reads
are mapped onto the genome. Normalized
transcript prevalence is calculated with an
algorithm from the ERANGE package.
(b) Primary data from mouse muscle RNAs
that map uniquely in the genome to a 1-kb
region of the Myf6 locus, including reads
that span introns. The RNA-Seq graph
above the gene model summarizes the
quantity of reads, so that each point
represents the number of reads covering
each nucleotide, per million mapped reads
(normalized scale of 0–5.5 reads).
(c) Detection and quantification of differential
expression. Mouse poly(A)-selected RNAs
from brain, liver and skeletal muscle for a
20-kb region of chromosome 10 containing
Myf6 and its paralog Myf5, which are
muscle specific. In muscle, Myf6 is highly
expressed in mature muscle, whereas Myf5
is expressed at very low levels from a small
number of cells. The specificity of RNA-Seq
is high: Myf6 expression is known to be
highly muscle specific, and only 4 reads out
of 71 million total liver and brain mapped
reads were assigned to the Myf6 gene
model.
ostolop@ebi.ac.uk
RNA-seq
ostolop@ebi.ac.uk
Acknowledgements
• This presentation uses slides/graphics from:
J. Pevsner (Johns Hopkins, http://www.bioinfbook.org)
J. Quackenbush (DFCI, Harvard)
C. Dewey (Wisconsin, http://www.biostat.wisc.edu/bmi576)
ostolop@ebi.ac.uk