Gene Expression - Microarrays: Misha Kapushesky

Gene Expression - Microarrays
Misha Kapushesky
European Bioinformatics Institute, EMBL
St. Petersburg, Russia

May 2010
Compare gene expression
in this cell type…
…after viral …relative …in samples

infection to a knockout from patients
…after drug …at a later …in a different

treatment developmental time body region
Gene expression is context-dependent,
and is regulated in several basic ways
• by region (e.g. brain versus kidney)

• in development (e.g. fetal versus adult tissue)
• in dynamic response to environmental signals
(e.g. immediate-early response genes)
• in disease states
• by gene activity
Page 297
Outline: microarray data analysis
Gene expression
Microarrays
Preprocessing
normalization
scatter plots
Inferential statistics
t-test
ANOVA
Exploratory (descriptive) statistics

distances
clustering
principal components analysis (PCA)
Microarrays: tools for gene expression
A microarray is a solid support (such as a membrane

or glass microscope slide) on which DNA of known
sequence is deposited in a grid-like array.
Page 312
Microarrays: tools for gene expression
The most common form of

microarray is used to measure
gene expression. RNA is isolated
from matched samples
of interest. The RNA is typically
converted to cDNA,
labeled with fluorescence (or
radioactivity), then hybridized
to microarrays in order to
measure the expression levels
of thousands of genes.
Measuring RNA abundances
ostolop@ebi.ac.uk
How it works
Complementary hybridization:
- Put a part of the gene sequence on the array
- convert mRNA to cDNA using reverse transcriptase
ostolop@ebi.ac.uk
Spotted Arrays
• Robot puts little spots of DNA on glass slides

• Each spot is a DNA analog of the mRNA we
want to detect
ostolop@ebi.ac.uk
Spotted Arrays
• Two channel technology for comparing two
samples – relative measurements
• Two mRNA samples (reference, test) are reverse
transcribed to cDNA, labeled with fluorescent
dyes (Cy3, Cy5) and allowed to hybridize to array
ostolop@ebi.ac.uk
Spotted Arrays
• Read out two images by scanning array with lasers,
one for each dye
ostolop@ebi.ac.uk
Oligonucleotide Arrays
• One channel technology – absolute measurements
• Instead of putting entire genes on array, put multiple
oligonucleotide probes: short, fixed length DNA
sequences (25-60 nucleotides)
• Oligos are synthesized in situ
• Affymetrix uses a photolithography process,
similar to that used to make semiconductor chips
• Other technologies available (e.g. mirror arrays)
ostolop@ebi.ac.uk
Oligonucleotide Arrays
• For each gene, construct a probeset – a set of
n-mers to specific to this gene
ostolop@ebi.ac.uk
Advantages of microarray experiments
Fast Data on >20,000 transcripts within weeks
Comprehensive Entire yeast or mouse genome on a chip
Flexible Custom arrays can be made

to represent genes of interest
Easy Submit RNA samples to a core facility
Cheap? Chip representing 20,000 genes for $300

Disadvantages of microarray experiments
Cost ■ Some researchers can’t afford to do

appropriate numbers of controls, replicates
RNA ■ The final product of gene expression is protein

significance ■ “Pervasive transcription” of the genome is
poorly understood (ENCODE project)
■ There are many noncoding RNAs not yet
represented on microarrays
Quality ■ Impossible to assess elements on array surface

control ■ Artifacts with image analysis
■ Artifacts with data analysis
■ Not enough attention to experimental design
■ Not enough collaboration with statisticians
Sample
acquisition
Data
acquisition
Data
analysis
Data
confirmation
Biological insight
Stage 1: Experimental design
Stage 2: RNA and probe preparation
Stage 3: Hybridization to DNA arrays

Stage 4: Image analysis
Stage 5: Microarray data analysis
Stage 6: Biological confirmation

Stage 7: Microarray databases
Stage 1: Experimental design
[1] Biological samples: technical and biological replicates:

determine the data analysis approach at the outset
[2] RNA extraction, conversion, labeling, hybridization:

except for RNA isolation, routinely performed at core facilities
[3] Arrangement of array elements on a surface:

randomization can reduce spatially-based artifacts
Page 314
Stage 2: RNA preparation
For Affymetrix chips, need total RNA (about 5 ug)
Confirm purity by running agarose gel
Measure a260/a280 to confirm purity, quantity
One of the greatest sources of error in microarray

experiments is artifacts associated with RNA isolation;
appropriately balanced, randomized experimental
design is necessary.
Stage 3: Hybridization to DNA arrays
The array consists of cDNA or oligonucleotides
Oligonucleotides can be deposited by photolithography
The sample is converted to cRNA or cDNA
(Note that the terms “probe” and “target” may refer to the
element immobilized on the surface of the microarray, or
to the labeled biological sample; for clarity, it may be
simplest to avoid both terms.)
Stage 4: Image analysis
RNA transcript levels are quantitated
Fluorescence intensity is measured with a

scanner.
Differential Gene Expression on a cDNA Microarray
Control
 B Crystallin
is over-expressed
Rett
in Rett Syndrome
Fig. 8.21
Page 319
Fig. 8.21
Page 319
Stage 5: Microarray data analysis
Hypothesis testing
• How can arrays be compared?
• Which RNA transcripts (genes) are regulated?
• Are differences authentic?
• What are the criteria for statistical significance?
Clustering
• Are there meaningful patterns in the data (e.g. groups)?
Classification
• Do RNA transcripts predict predefined groups, such as
disease subtypes?
Page 318
Stage 6: Biological confirmation
Microarray experiments can be thought of as

“hypothesis-generating” experiments.
The differential up- or down-regulation of specific RNA

transcripts can be measured using independent assays
such as
-- Northern blots
-- polymerase chain reaction (RT-PCR)
-- in situ hybridization
Page 320
Stage 7: Microarray databases
There are two main repositories:
Gene Expression Omnibus (GEO) at NCBI
ArrayExpress at the European Bioinformatics

Institute (EBI)
Microarray Overview I
Microtiter Plate Microarray Slide

Microbial (with 60,000 or more
spotted genes)
ORFs
Design PCR Primers

+
PCR Products
Eukaryotic
Genes
Select cDNA clones
PCR Products
Many different plates For each plate set,
containing different genes many identical replicas
Microarray Overview II Measure
Fluorescence
in 2 channels
red/green
Control
Test
Hybridize,
Wash
Prepare Fluorescently
Labeled Probes Analyze the data
to identify
patterns of
gene expression
Affymetrix GeneChip™ Expression Analysis
Hybridize and
wash chips Scan chips
Control
Analyze
Test
Prepare
Obtain RNA Fluorescently
Samples Labeled
Probes PM
MM
Microarray Expression Analysis
Differential
Tissue RNA Preparation
State/Stage
Selection and Labeling
Selection
Competitive
Hybridization
Spots
Fluorescence Expression
Gene on an
Intensity Measurement
Array
Steps in the Process
Select array elements and annotate them
Build a database to manage stuff
Print arrays and manage the lab
Hybridize and analyze images; manage data
Analyze hybridization data and get results

MIAME
In an effort to standardize microarray data presentation

and analysis, Alvis Brazma and colleagues at 17
institutions introduced Minimum Information About a
Microarray Experiment (MIAME). The MIAME framework
standardizes six areas of information:
►experimental design
►microarray design
►sample preparation
►hybridization procedures
►image analysis
►controls for normalization
Visit http://www.mged.org
Interpretation of RNA analyses
The relationship of DNA, RNA, and protein:

DNA is transcribed to RNA. RNA quantities and
half-lives vary. There tends to be a low positive
correlation between RNA and protein levels.
The pervasive nature of transcription:

The Encyclopedia of DNA Elements (ENCODE)
project identified functional features of genomic
DNA, initially in 30 megabases (1% of the human
genome). One of its observations was the
“pervasive nature of transcription”: the vast majority
of DNA is transcribed, although the function is
unknown.
Gene expression
Microarrays
Preprocessing
normalization
scatter plots
t-test
ANOVA

distances
clustering
Microarray data analysis
• begin with a data matrix (gene expression values

versus samples)
genes
(RNA
transcript
levels)

versus samples)
Typically, there are

many genes
(>> 20,000) and
few samples (~ 10)
Fig. 9.1
Page 333

versus samples)
Preprocessing
Inferential statistics Descriptive statistics

Microarray data analysis: preprocessing
Observed differences in gene expression could be

due to transcriptional changes, or they could be
caused by artifacts such as:
• different labeling efficiencies of Cy3, Cy5

• uneven spotting of DNA onto an array surface
• variations in RNA purity or quantity
• variations in washing efficiency
• variations in scanning efficiency
Microarray data analysis: preprocessing
The main goal of data preprocessing is to remove

the systematic bias in the data as completely as
possible, while preserving the variation in gene
expression that occurs because of biologically
relevant changes in transcription.
A basic assumption of most normalization procedures

is that the average gene expression level does not
change in an experiment.
Data analysis: global normalization
Global normalization is used to correct two or more

data sets. In one common scenario, samples are
labeled with Cy3 (green dye) or Cy5 (red dye) and
hybridized to DNA elements on a microrarray. After
washing, probes are excited with a laser and detected
with a scanning confocal microscope.
Global normalization is used to correct two or more

data sets
Example: total fluorescence in

Cy3 channel = 4 million units
Cy 5 channel = 2 million units
Then the uncorrected ratio for a gene could show

2,000 units versus 1,000 units. This would artifactually
appear to show 2-fold regulation.
Global normalization procedure
Step 1: subtract background intensity values

(use a blank region of the array)
Step 2: globally normalize so that the average ratio = 1

(apply this to 1-channel or 2-channel data sets)
Scatter plots
Useful to represent gene expression values from

two microarray experiments (e.g. control, experimental)
Each dot corresponds to a gene expression value
Most dots fall along a line
Outliers represent up-regulated or down-regulated genes

Differential Gene Expression
in Different Tissue and Cell Types
Brain Fibroblast
Astrocyte Astrocyte
up
Expression level (sample 2)
e l high
l e v
n
wn
si o
s
do
r e
x p
e
low
Expression level (sample 1)

Log-log
transformation
Scatter plots
Typically, data are plotted on log-log coordinates
Visually, this spreads out the data and offers symmetry
raw ratio log2 ratio

time behavior value value
t=0 basal 1.0 0.0
t=1h no change 1.0 0.0
t=2h 2-fold up 2.0 1.0
t=3h 2-fold down 0.5 -1.0
expression level
low high
up
Log ratio
down
Mean log intensity
You can make these plots in Excel…
…but for many bioinformatics applications use R.

Visit http://www.r-project.org to download it.
There are limits to what you
can measure
The Limits of log-ratios: The space we explore
Good Data
Bad Data from Parts Unknown
Each “pin group” is colored differently

Gary Churchill
Lowess Normalization
Why LOWESS?
3
A SD =
0.346
2
-1
-2
-3
7 8 9 10 11 12 13 14
log(Cy3*Cy5)
1. Intensity-dependent structure
2. Data not mean centered at log2(ratio) = 0
Ratio Cy3/Cy5 for the same RNA
11ab Raw Ratios
sorted from least most expressed
2
1.8
1.6
1.4
1.2
11a:ratio
1
11b:ratio
0.8
0.6
0.4
0.2
0
LOWESS Results
Affymetrix Chips
Mismatch (MM) probes
• MM probes are used to measure background

signals due to non-specific sources and
scanner offset.
• Using a MM probe as an estimate of
background seems wrong and often the MM
signal >= the PM signal
• Some would claim that subtraction of the
mismatch probe adds noise for little gain.
Computing expression summaries: a
three-step process
• Background/Signal adjustment
• Normalization (can happen at the probe-pair or
the probe-set level).
• Summarization of probe-pairs into probe-set or
gene level information
Background/Signal Adjustment
• A method which does some or all of the following
Corrects for background noise, processing effects
Adjusts for cross hybridization
Adjust estimated expression values to fall on proper scale
• Probe intensities are used in background adjustment

to compute correction (unlike cDNA arrays where area
surrounding spot might be used)
Normalization Methods
• Complete data (no reference chip, information

from all arrays used)
Quantile normalization (Bolstadt al 2003)
• Baseline (normalized using reference chip)
Scaling (Affymetrix)
Non linear (Li-Wong)
Summarization
• Reduce the 11-20 probe intensities on each array to a

single number for gene expression
• Main Approaches
Single chip
• AvDiff (Affymetrix) – no longer recommended for use due to
many flaws
• Mas5.0 (Affymetrix) –use a 1 step Tukey biweight to combine the
probe intensities in log scale
Multiple Chip
•MBEI (Li-Wong dChip) –a multiplicative model
•RMA –a robust multi-chip linear model fit on the log scale
Robust multi-array analysis (RMA)
• Developed by Rafael Irizarry (Dept. of Biostatistics), Terry
Speed, and others
• Available at www.bioconductor.org as an R package
• Also available in various software packages (including
Partek, www.partek.com and Iobion Gene Traffic)
• See Bolstad et al. (2003) Bioinformatics 19;
Irizarry et al. (2003) Biostatistics 4
There are three steps:
[1] Background adjustment based on a normal plus

exponential model (no mismatch data are used)
[2] Quantile normalization (nonparametric fitting of signal
intensity data to normalize their distribution)
[3] Fitting a log scale additive model robustly. The model is
additive: probe effect + sample effect
GCRMA
• GC-RMA is a modified version of RMA that models intensity

of probe level data as a function of GC-content
• expect to see higher intensity values for probes that are GC
rich due to increased binding
M
M
A A
After RMA (a normalization
procedure), the median is near zero,
and skewing is corrected.
Scatterplots display the effects of

normalization.
vsn: variance stabilizing normalization
• Variance depends on signal
intensity in microarray data
• A transformation can be found

after which the variance is
approximately constant
• Like the logarithm at the upper

end of, approximately linear at
the lower end
• Also incorporates the

estimation of "normalization"
parameters (shift and scale)
• Assumes that less than half of

the genes on the arrays are
differentially transcribed across
the experiment.
vsn: post-normalization plot
log signal intensity
log signal intensity

array
Histograms of raw
intensity values for 14
arrays (plotted in R)
before and after RMA array
was applied.
RMA can adjust for the effect of GC content
log intensity
GC content
RMA offers a large increase in precision (relative to
Affymetrix MAS 5.0 software).
precision
log expression SD
MAS 5.0
RMA
average log expression

RMA offers comparable accuracy to MAS 5.0.

accuracy
observed log expression
log nominal concentration

Gene expression
Microarrays
Preprocessing
normalization
scatter plots
t-test
ANOVA

distances
clustering
Inferential statistics are used to make inferences
about a population from a sample.
Hypothesis testing is a common form of inferential

statistics. A null hypothesis is stated, such as:
“There is no difference in signal intensity for the gene
expression measurements in normal and diseased
samples.” The alternative hypothesis is that there
is a difference.
We use a test statistic to decide whether to accept or

reject the null hypothesis. For many applications,
we set the significance level  to p < 0.05.
Analyzing expression data
Question: for each of my 20,000 transcripts, decide
whether it is significantly regulated in some disease.
control disease
[1] Obtain a matrix of genes (rows) and expression values columns.

Here there are 20,000 rows of genes of which the first six are shown.
There are three control samples and three disease samples. Calculate
the mean value for each gene (transcript) for the controls and the
disease (experimental) samples.
[2] Calculate the ratios of control versus disease.
Also note that some ratios, such as 2.00, appear to be dramatic while
others are not. Some researchers set a cut-off for changes of interest
such as two-fold.
A significant
difference
Probably
not
A t-test is a commonly used test statistic to assess
the difference in mean values between two groups.
x1 – x2 difference between mean values
t= =
SE variability (standard error
of the difference)
Questions
Is the sample size (n) adequate?

Are the data normally distributed?
Is the variance of the data known?
Is the variance the same in the two groups?
Is it appropriate to set the significance level to p < 0.05?
A t-test is a commonly used test statistic to assess
the difference in mean values between two groups.
x1 – x2 difference between mean values
t= =
SE variability (standard error
of the difference)
Notes
• t is a ratio (it thus has no units)

• We assume the two populations are Gaussian
• The two groups may be of different sizes
• Obtain a P value from t using a table
• For a two-sample t test, the degrees of freedom is N - 2.
• For any value of t, P gets smaller as df gets larger
[3] Perform a t-test. Hypothesis is that the

transcript in the disease group is up (or down)
relative to controls.
[3] Note the results: you can have…
a small p value (<0.05) with a big ratio difference

a small p value (<0.05) with a trivial ratio difference
a large p value (>0.05) with a big ratio difference
a large p value (>0.05) with a trivial ratio difference
Is it appropriate to set the significance level to p < 0.05?
If you hypothesize that a specific gene is up-regulated,
you can set the probability value to 0.05.
You might measure the expression of 10,000 genes and

hope that any of them are up- or down-regulated. But
you can expect to see 5% (500 genes) regulated at the
p < 0.05 level by chance alone. To account for the
thousands of repeated measurements you are making,
some researchers apply a Bonferroni correction.
The level for statistical significance is divided by the
number of measurements, e.g. the criterion becomes:
p < (0.05)/10,000 or p < 5 x 10-6
The Bonferroni correction is generally considered to be too

conservative.
Inferential statistics: false discovery rate
The false discovery rate (FDR) is a popular multiple
corrections correction. A false positive (also called a type
I error) is sometimes called a false discovery.
The FDR equals the p value of the t-test times the

number of genes measured (e.g. for 10,000 genes and a
p value of 0.01, there are 100 expected false positives).
You can adjust the false discovery rate. For example:
FDR # regulated transcripts # false discoveries

0.1 100 10
0.05 45 3
0.01 20 1
Would you report 100 regulated transcripts of which 10

are likely to be false positives, or 20 transcripts of which
one is likely to be a false positive?
Inferential statistics: other methods used
• t-test for two sample groups, SAM and t-tests with
permutation testing
• ANOVA for multiple factors
• Linear models with Bayesian moderation of variance

Smyth G. (2004) “Linear Models and Empirical Bayes Methods for
Assessing Differential Expression in Microarray Experiments”
• Simultaneous inference: multivariate t-distributions for

simultaneous confidence intervals
Hsu et al. (1996) “Multiple Comparisons: Theory and Methods”
Hsu et al. (2006) “Screening for Differential Gene Expressions from
Microarray Data”
A volcano plot displays both p values and fold change
p value (treated versus control)
log fold change (treated/untreated)

Gene expression
Microarrays
Preprocessing
normalization
scatter plots
t-test
ANOVA

distances
clustering
Descriptive statistics
Microarray data are highly dimensional: there are
many thousands of measurements made from a small
number of samples.
Descriptive (exploratory) statistics help you to find

meaningful patterns in the data.
A first step is to arrange the data in a matrix.

Next, use a distance metric to define the relatedness
of the different data points. Two commonly used
distance metrics are:
-- Euclidean distance
-- Pearson coefficient of correlation
What is a cluster?
A cluster is a group that has homogeneity

(internal cohesion) and separation (external
isolation). The relationships between objects
being studied are assessed by similarity or
dissimilarity measures.
samples (time points)
Data matrix
(20 genes and
genes 3 time points
from Chu et al.,
1998)
Software: S-
PLUS package
t=2.0
t=0.5 t=0
3D plot (using S-PLUS software)

Descriptive statistics: clustering
Clustering algorithms offer useful visual descriptions
of microarray data.
Genes may be clustered, or samples, or both.
We will next describe hierarchical clustering.

This may be agglomerative (building up the branches
of a tree, beginning with the two most closely related
objects) or divisive (building the tree by finding the
most dissimilar objects first).
In each case, we end up with a tree having branches

and nodes.
Page 355
Distance Is Defined by a Metric
3
log2(cy5/cy3)
-3
Distance Metric: Euclidean Pearson*
D 1.4 -0.05
D 6.0 +1.00
Distance is Defined by a Metric
log2(cy5/cy3)
-2
Distance Metric: Euclidean Pearson(r*-1)
D 1.4 -0.90
D 4.2 -1.00
Distance Matrix
Once a distance metric has been selected, the starting point for all
clustering methods is a “distance matrix”
Gene1
Gene2
Gene3
Gene4
Gene5
Gene6
Gene1 0 1.5 1.2 0.25 0.75 1.4
Gene2 1.5 0 1.3 0.55 2.0 1.5
Gene3 1.2 1.3 0 1.3 0.75 0.3
Gene4 0.25 0.55 1.3 0 0.25 0.4
Gene5 0.75 2.0 0.75 0.25 0 1.2
Gene6 1.4 1.5 0.3 0.4 1.2 0
 The elements of this matrix are the pair-wise distances. Note that the
matrix is symmetric about the diagonal.
Agglomerative clustering
0 1 2 3 4
a
a,b
b
c
d
e
Adapted from Kaufman and Rousseeuw (1990)

0 1 2 3 4
a
a,b
b
c
d
d,e
e
0 1 2 3 4
a
a,b
b
c
c,d,e
d
d,e
e
0 1 2 3 4
a
a,b
b a,b,c,d,e
c
c,d,e
d
d,e
e
…tree is constructed
Divisive clustering
a,b,c,d,e
4 3 2 1 0
Divisive clustering
a,b,c,d,e
c,d,e
4 3 2 1 0
Divisive clustering
a,b,c,d,e
c,d,e
d,e
4 3 2 1 0
Divisive clustering
a,b
a,b,c,d,e
c,d,e
d,e
4 3 2 1 0
Divisive clustering
a
a,b
b a,b,c,d,e
c
c,d,e
d
d,e
e
4 3 2 1 0
…tree is constructed
agglomerative
0 1 2 3 4
a
a,b
b a,b,c,d,e
c
c,d,e
d
d,e
e
4 3 2 1 0
divisive
Adapted from Kaufman and Rousseeuw (1990)
1
12
Agglomerative and
divisive clustering
sometimes give conflicting
results, as shown here
1
12
Agglomerative Linkage Methods
Linkage methods are rules or metrics that return

a value that can be used to determine which
elements (clusters) should be linked.
Three linkage methods that are commonly used

are:
Single Linkage
Average Linkage
Complete Linkage
(HCL-6)
Single Linkage
Cluster-to-cluster distance is defined as the minimum distance
between members of one cluster and members of the another
cluster. Single linkage tends to create ‘elongated’ clusters with
individual genes chained onto clusters.
DAB = min ( d(ui, vj) )
where u A and v B

for all i = 1 to NA and j = 1 to NB
DAB
(HCL-7)
Average Linkage
Cluster-to-cluster distance is defined as the average distance

between all members of one cluster and all members of another
cluster. Average linkage has a slight tendency to produce clusters of
similar variance.
DAB = 1/(NANB)  ( d(ui, vj) )

DAB
(HCL-8)
Complete Linkage
Cluster-to-cluster distance is defined as the maximum distance

between members of one cluster and members of the another
cluster. Complete linkage tends to create clusters of similar size and
variability.
DAB = max ( d(ui, vj) )

DAB
(HCL-9)
Comparison of Linkage Methods
Single Average Complete

Two-way
clustering
of genes (y-axis)
and cell lines
(x-axis)
(Alizadeh et al.,
2000)
x2
A
a2 Euclidean distance
1
b2
B
a’2 A’
Angle distance
0.5 Chord distance
b’2
B’
 

0.5 1 1.5 x1
a’1 b’1 a1 b1
K-Means/Medians Clustering – 1
1. Specify number of clusters, e.g., 5.
2. Randomly assign genes to clusters.

G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13
K-Means/Medians Clustering – 2
3. Calculate mean/median expression profile of each cluster.
4. Shuffle genes among clusters such that each gene is now in the
cluster whose mean expression profile (calculated in step 3) is
the closest to that gene’s expression profile.
G3 G6 G1 G8 G4 G5 G2 G10 G9 G12
G11 G7 G13
5. Repeat steps 3 and 4 until genes cannot be shuffled around any

more, OR a user-specified number of iterations has been
reached.
k-means is most useful when the user has an a priori hypothesis about the
number of clusters the genes should belong to.
K-Means / K-Medians Support (KMS)
Because of the random initialization of K-Means/K-Means,
clustering results may vary somewhat between successive runs on
the same dataset. KMS helps us validate the clustering results
obtained from K-Means/K-Medians.
Run K-Means / K-Medians multiple times.
The KMS module generates clusters in which the member genes

frequently group together in the same clusters (“consensus
clusters”) across multiple runs of K-Means / K-Medians.
The consensus clusters consist of genes that clustered together

in at least x% of the K-Means / Medians runs, where x is the
threshold percentage input by the user.
Principal components analysis (PCA)
An exploratory technique used to reduce the

dimensionality of the data set to 2D or 3D
For a matrix of m genes x n samples, create a new

covariance matrix of size n x n
Thus transform some large number of variables into

a smaller number of uncorrelated variables called
principal components (PCs).
Principal components analysis (PCA): objectives
• to reduce dimensionality
• to determine the linear combination of variables
• to choose the most useful variables (features)
• to visualize multidimensional data
• to identify groups of objects (e.g. genes/samples)
• to identify outliers
http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
1
12
High-throughput methods beyond microarrays
ostolop@ebi.ac.uk
RNA-seq
• Sequencing technology is making fast progress
• Idea: sequencing is so cheap that we can sequence
mRNA molecules directly
“Digital Gene Expression”
ostolop@ebi.ac.uk
RNA-seq
(a) After two rounds of poly(A) selection, RNA
is fragmented to an average length of 200
nt by magnesium-catalyzed hydrolysis and
then converted into cDNA by random
priming. The cDNA is then converted into a
molecular library for Illumina/Solexa 1G
sequencing, and the resulting 25-bp reads
are mapped onto the genome. Normalized
transcript prevalence is calculated with an
algorithm from the ERANGE package.
(b) Primary data from mouse muscle RNAs
that map uniquely in the genome to a 1-kb
region of the Myf6 locus, including reads
that span introns. The RNA-Seq graph
above the gene model summarizes the
quantity of reads, so that each point
represents the number of reads covering
each nucleotide, per million mapped reads
(normalized scale of 0–5.5 reads).
(c) Detection and quantification of differential
expression. Mouse poly(A)-selected RNAs
from brain, liver and skeletal muscle for a
20-kb region of chromosome 10 containing
Myf6 and its paralog Myf5, which are
muscle specific. In muscle, Myf6 is highly
expressed in mature muscle, whereas Myf5
is expressed at very low levels from a small
number of cells. The specificity of RNA-Seq
is high: Myf6 expression is known to be
highly muscle specific, and only 4 reads out
of 71 million total liver and brain mapped
reads were assigned to the Myf6 gene
model.
ostolop@ebi.ac.uk
RNA-seq
ostolop@ebi.ac.uk
Acknowledgements
• This presentation uses slides/graphics from:
J. Pevsner (Johns Hopkins, http://www.bioinfbook.org)
J. Quackenbush (DFCI, Harvard)
C. Dewey (Wisconsin, http://www.biostat.wisc.edu/bmi576)
ostolop@ebi.ac.uk

Gene Expression - Microarrays: Misha Kapushesky

Uploaded by

Copyright:

Available Formats

You might also like

Gene Expression - Microarrays: Misha Kapushesky

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gene Expression - Microarrays: Misha Kapushesky

Uploaded by

Copyright:

Available Formats

Gene Expression - Microarrays

St. Petersburg, Russia

…after viral …relative …in samples

…after drug …at a later …in a different

• by region (e.g. brain versus kidney)

Exploratory (descriptive) statistics

A microarray is a solid support (such as a membrane

The most common form of

• Robot puts little spots of DNA on glass slides

Fast Data on >20,000 transcripts within weeks

Comprehensive Entire yeast or mouse genome on a chip

Flexible Custom arrays can be made

Easy Submit RNA samples to a core facility

Cheap? Chip representing 20,000 genes for $300

Cost ■ Some researchers can’t afford to do

RNA ■ The final product of gene expression is protein

Quality ■ Impossible to assess elements on array surface

Stage 2: RNA and probe preparation

Stage 3: Hybridization to DNA arrays

Stage 5: Microarray data analysis

Stage 6: Biological confirmation

[1] Biological samples: technical and biological replicates:

[2] RNA extraction, conversion, labeling, hybridization:

[3] Arrangement of array elements on a surface:

For Affymetrix chips, need total RNA (about 5 ug)

Confirm purity by running agarose gel

Measure a260/a280 to confirm purity, quantity

One of the greatest sources of error in microarray

The array consists of cDNA or oligonucleotides

Oligonucleotides can be deposited by photolithography

The sample is converted to cRNA or cDNA

RNA transcript levels are quantitated

Fluorescence intensity is measured with a

Microarray experiments can be thought of as

The differential up- or down-regulation of specific RNA

There are two main repositories:

Gene Expression Omnibus (GEO) at NCBI

ArrayExpress at the European Bioinformatics

Microtiter Plate Microarray Slide

Design PCR Primers

Select cDNA clones

Build a database to manage stuff

Print arrays and manage the lab

Hybridize and analyze images; manage data

Analyze hybridization data and get results

In an effort to standardize microarray data presentation

The relationship of DNA, RNA, and protein:

The pervasive nature of transcription:

Exploratory (descriptive) statistics

• begin with a data matrix (gene expression values

• begin with a data matrix (gene expression values

Typically, there are

• begin with a data matrix (gene expression values

Inferential statistics Descriptive statistics

Observed differences in gene expression could be

• different labeling efficiencies of Cy3, Cy5

The main goal of data preprocessing is to remove

A basic assumption of most normalization procedures