Download as pdf or txt
Download as pdf or txt
You are on page 1of 59

MicroArrays I:

Introduction to the concept &


Background
Genome-wide measurement of gene expression

Joe Assouline
BME
MicroArrays lectures
¾ MicroArrays I:
Introduction to the concept & Background

¾ MicroArray II:
Analysis, expression, image and normalization strategies

¾ MicroArray III:
Probe selection and techniques for utilization

¾ MicroArray IV: clustering strategies and a future applications of


arrays (transcriptome)
A growing interest…

Comparing
Subtractive
hybridization with
microarrays
publications in
cancer research

Pioneered by
Schena et al 1996
Brown, De Risi and
Trend et al 1996
http://derisilab.ucsf.edu
Traditional Subtractive Hybridization
methods
Subtractive Hybridization Method 2
Subtractive Hybridization Method I
SAGE:
Serial analysis of
Gene Expression
Serial sequences
determined the
absolute
abundance of every
transcript in the sample
The sequence of 15bp
at a time gene-specific
tags are produced and
concatenated and short
tags (9bp) isolated,
automatically
sequenced
SAGE
¾ Allow quantitative measurement of gene expression of large number
of transcripts
¾ A variety of SAGE Libraries available
z For each 9-bp tag there are 49 or 262,144 transcripts
z Tags are mapped to genes using Unigene
z Possible that 1tag>1 gene,# of tags in library proportional to #mRNA in
the biological sample
¾ www.ncbi.nlm.nih.gov/SAGE/
z Allows comparison of gene expression in various tissue for which SAGE
libraries have been generated

Sequencing Seq. Preprocessing


Library And tag extraction
prod Tracking SAGE
Lib annotation
info website

UniGene Tag to gene


GenBank
Project Mapping
dbEST
Other SAGE recourses
¾ USAGE is a web-based db for SAGE analyses
z www.cmbi.kun.nl.usage
¾ Human transcritome Map Univ. of Amesterdam (based
on 2.4 million SAGE tags)
z www.Bioinfo.amc.uva.nl.HTM
¾ SAGE offers data protocols SAGE maps and
bibliography of SAGE pubs
z www.sagenet.org
Eukaryotic gene expression domains

A) Co-expression plot for every pair of adjacent genes in a 70-gene regions of yeast (Chromosome 12
Correlation coefficient in expression profiles
10 minutes intervals over2 mitotic cycles green=pos; red=neg)
Several cluster co-expressed adjacent gene see diagonal (Cohen et al 2000)
B) Region of increased gene expression RIDGES on 3 human chromosomes see green bars
Expression levels were generated fro SAGES analysis of transcripts of 12 tissues
And correlated with gene density (black histograms)
Massively Parallel Signature
Sequencing (MPSS)
A) 4 types of fluorescent Oligo
are hybridized to 5’over-hang of
DNA Strand immobilized beads
-Most internal base=most
specific
-all possible combinations
of A, T, G, C are included
at other position
-Type II endonuclease
cleaves template proximal
to the previous cycle
exposing the next
nucleotide
B) Beads remain immobilized,
image of fluorescence is taken
at each of the positions
-successive images are
decoded for tag
frequencies
-Output readout : Relative
abundance of each
transcript
Generalities
¾ MicroArrays (MA) measure gene expression
¾ In contrast to EST sequencing projects and SAGE allow high-
throughput analysis of gene expression, MA is used to assess the
differential expression (mRNA abundance) of biological samples.

¾ Functional genomics refers to large scale analysis of the genome-


wide function of genes (contrast with study of individual prot. of
DNA/RNA. MA important place in functional genomics

¾ GeneBank lots of gene with unknown function, lots of EST


(unknown function) detected on microarrays potential find function
¾ However MA limitations
MicroArrays Pros & Cons

Advantages Disadvantages

FAST Cost

Comprehensive Unknown
Significance of RNA
Flexible Uncertain quality
control
Expression Analysis Flow Chart
MicroArray system
Using cDNA 2 dyes: Cy3 and Cy5

Bioinformatics and Data mining


cDNA
labelling
Oligonucleotides arrays (Affymetrix)
Image analysis foundations
Oligonucleotide array: Affimetrix
Each gene is measured by
comparing hybridization of
sample mRNA with probes
11-20 pairs of
Oligonucleotides (25 base
pairs)

PM: perfect match same as


gene sequence

MM: Mismatch same but the


13th base changed >
reduces rate of binding.
Control for experimental
variation and non specific
binding

This results in the reading of


two vectors of different
intensity (PM and MM)
Definitions I
¾ Probe: a single stranded DNA oligonucleotide
complementary to a specific sequence. GeneChip
Expression probe arrays utilize oligo probes that are 25
bases long. The probes are synthesized directly on the
surface of the array using photolithography and
combinatorial chemistry.
¾ Probe Cell: a single square-shaped feature on an array
containing one type of probe. The size can vary
depending on the array type, typically 50 or 24 mm.
Each probe cell contains millions of probe molecules.
Definitions II
¾ Perfect Match: (PM) probes that are designed to be
complementary to a reference sequence.
¾ Mismatch: (MM) probes that are designed to be
complementary to the reference sequence except for a
homomeric base mismatch at the central (13th) position.
Mismatch probes serve as a control for cross-
hybridization.
¾ Probe Pair: two probe cells, a PM and its corresponding
MM. On the probe array, a probe pair is arranged with a
PM cell directly above the MM cell.
Definitions III
¾ Probe Set: a set of probes designed to detect one
transcript. A probe set usually consists of 16 - 20 prob
pairs. For example, a 20 probe pair set is made up of 20
PM and 20 MM for a total of 40 probe cells.
¾ Target: fragmented, biotinylated anti-sense cRNA
prepared from mRNA to be analyzed. Target molecules
are hybridized to the probe array, and the levels of
hybridization are measured with the GeneArray scanner
after the array is stained with streptavidin-phycoerythrin
(SAPE).
Probe Pair Analysis
A probe pair is Positive if:
(1) PM – MM ≥ SDT
(2) MM / PM ≥ SRT

A probe pair is Negative if:


(1) MM – PM ≥ SDT
(2) PM / MM ≥ SRT
Absolute Analysis
1. Positive Fraction: # positive probe pairs / # probe pairs used

Positive Fraction = 18 / 20 = 0.9

2. Pos/Neg Ratio: the Ratio of Positive probe pairs to Negative probe


pairs in a probe set. It is calculated as follows:
Pos / Neg Ratio = #Positive probe pairs / # Negative probe pairs

Pos / Neg = 18 / 2 = 9

3.Log Avg Ratio: a metric that describes the hybridization performance


of a probe set by determining the ratio of the PM to MM intensities
for each probe pair, taking the Log of the resulting values, then
averaging them across the probe set.

Log Avg Ratio = 10 * [∑log (PM / MM)] / (Pairs in Avg)


Avg)

Each of the three metrics are used to determine the Absolute Call via a decision
matrix to determine the status of each transcript (Present, Marginal, Absent).
Average Difference and Expression
Level
The Average Difference (Avg Diff) serves as a
relative indicator of the level of expression of a
transcript.
¾ is an estimation of the change in expression of a
given gene between two experiments.
¾ is calculated by taking the difference between
the PM and MM of every probe pair and
averaging the differences over the entire probe
set:

Avg Diff = ∑(PM - MM) / (Pairs in Avg)


Increase and Decrease Probe Pairs

Two criteria must be met for a probe pair to show a


significant Increase:
(1) (PM - MM)exp - (PM - MM)base > Change
Threshold(CT) And
(2) (2) [(PM - MM)exp - (PM - MM)base] / (PM - MM)base >
Percent Change Threshold / 100

Decrease:
(1) (PM - MM)base - (PM - MM)exp > Change Threshold
(CT) And
(2) (2) [(PM - MM)base - (PM - MM)exp] / (PM - MM)base >
Percent Change Threshold / 100
Difference Call
The Difference Call Decision Matrix is an algorithm that
generates one of five outcomes for every transcript:
Increase (I), Marginally Increase (MI), Decrease (D),
Marginally Decrease (MD), and No Change (NC). The
following four metrics are weighted differently and
entered into the Decision Matrix:

1) Max [ Increase / Total , Decrease / Total ]


2) 2) Increase / Decrease Ratio
3) 3) Log Average Ratio Change
4) 4) Dpos-Dneg Ratio
Fold-Change
¾ The Avg Diff of a transcript is directly related to its
expression level, an estimate of the Fold Change of the
transcript between the baseline and experimental
samples can be calculated.
¾ The normalized or scaled Avg Diff values are
recomputed in both the experimental and baseline data
sets to include only probe pairs that are used in both the
baseline and experimental arrays.
¾ Then an Avg Diff Change is determined.
¾ The expression of the Fold Change is a positive number
when the transcript has increased over its baseline state,
and as a negative number when the transcript level
declines.
Some Software available
•Microarrays experiments generates lots of data collection, management, and analysis of data is
a big challenge
•commercial and noncommercial solutions exist, some free open source software which allows
users to analyze data with e using a host of existing tool or to develop own tools.

To name a few:
General Most comprehensive
GeneSpring Clutering , data mining (heat mapping)

GeneLinker gold Maps/graphics

Spot Fire Venn Diagram, clustering (small stat)

Freeware (Raw Data analysis (From Harvard)


DChip

Genepublisher Automatic analysis of data from DNA


microarrays.
microarrays. Normalization, statistics,
visualization, transduction pathways databases

TM4 microarray analysis of tools tigr.org/software/tm4 Suite of 4 softawre package + MySQL


Java-
Java-based tools, was developed for 2 colors but many components Spot Fire, MIdAS , MEV, MADAM
are usefull with Affimetrix Genechip data

Datamining/management
Datamining/management
For collection, analysis and more of Genechip
Affimatrix Tools data. MAS, MBIE, RMA

genesping integrate database with


GeNet 4 other bases, management of databases
Stages in an Experiment
Experimental Design (think first)
Choice of the samples/and size
Assignment of experimental conditions

Signal Extraction
Image Analysis
Gene filtering
Probe level analysis of oligo. Arrays
Normalization and removal of artifacts
(for comparisons across arrays)

Data Analysis
Selection of genes differentially
expressed (across exp. Conditions)
Clustering and classification of
biological samples
Clustering and …..of genes

Validation and Interpretation


Comparison across platform
Use of multiple independent datasets
Experimental paradigm/considerations

Sample Pooling
z Can be used to dilute out
individual sample-to-
sample variation
z Combining samples may
yield enough to perform
the experiment
z Cannot pool if samples
treated differently

Replication
z Always - gives you the
ability to do statistics and
can aid in data mining
z Reduces the effects of
false-positives and -
negatives
z Costly
Errors and pitfalls
¾ Experimental design is critical
z Adequate # of exp. vs,. Control, use replicate (no magic #)

¾ Difficult to relate: intensity of gene expression in exp. to # of copies of mRNA


transcripts in the cell
z Because Errors along the way from sample prep to data acquisition
z (Some attempt to standardization of method… not always applied)

¾ Data analysis: attention to global and local background correction


z Each approach has advantages and inconveniences :
• Unsupervised cluster analysis sacrifice classes of samples
(cells derived from patients with different levels of cancer)
• Supervised make assumptions about classes which may be false

¾ Many experimental artifacts


z Skewing of scattered plots because contaminated samples
z Cluster analysis may show differences between exp. Samples (not control vs experiment) because
of operator’s manipulations (day to day)
Power of Replication

¾ Consider 3 experiments in which there are


either 0, 1, or 2 replicates.
¾ In each of these experiments there is a
treatment and its control.
0 Replicates
0 Replicates
Control Treatment

1 1

¾ 2 Chips need to be run to achieve 1 comparison data


point
¾ Cannot perform any statistics for significance
¾ False-positive and -negative values are portrayed as
genuine
1 Replicate

Control Treatment
1 1

2 2
¾ 4 Chips achieve 4 comparison data points
¾ Can begin to perform statistics for significance
¾ 50 % of false-positive and -negative values are
considered as genuine
3 Replicates
Control Treatment
1 1

2 2

3 3

¾ 6 Chips achieve 9 comparison data points


¾ Can perform more powerful statistics for significance
¾ 33 % of false-positive and -negative values are
considered genuine
Using Replicates To Mine
Data

C1T1 C1T2 C1T3…C3T3 No. Inc. %

Gene 1
I I D D 7 78
Gene 2
I I I I 9 100
Gene 3
I D
I D 3 33
Gene 4 D D D D 0 0

Gene 5 I I D 6 67
I
Gene 6
I I I I 9 100

I D D I 5 56
Gene 7
Using Replicates To Mine Data
reorganized
Replicate

C1T1 C1T2 C1T3…C3T3 No. Inc. %

Gene 2 I I I I 9 100

Gene 6 I I I I 9 100
Gene 1 I I D D 7 78
Gene 5 I I I D 6 67
Gene 7 D 5 56
I D I
Gene 3 I I D D 3 33

Gene 4 D D D D 0 0
Overview
of data
analysis
Assessment of the Gene Expression Profile of Differentiated and
Dedifferentiated Human Fetal Chondrocytes by Microarray
Analysis
David G. Stokes,1 Gang Liu,1 Ibsen B. Coimbra,1 Sonsoles Piera-Velazquez,1 Robert M. Crowl,2 and
Sergio A. Jime´nez1, ARTHRITIS & RHEUMATISM
Vol. 46, No. 2, February 2002, pp 404–419

Goals: Plastic no-coat


¾ 1) study the changes in patterns of Poly-(HEMA) (Dedifferentiated)
(Differentiated)
gene expression exhibited by
human chondrocytes as they
dedifferentiate into fibroblastic
cells in culture inorder to better
¾ 2) understand the mechanisms
that control this process and its
relationship to the phenotypic
changes that occur in
chondrocytes during the
development of osteoarthritis
(OA)State
Method: culture cells for 11 days, mRNA
to cDNA
a gene expression analysis using a
microarray (UniGEM V) containing
5,000 known human genes and
_3,000 expressed sequence tags
(ESTs).
List of 283 genes most frequently expressed
by human fetal epiphyseal chondrocytes in
poly-HEMA culture
Relevant genes that display a >2-
fold difference in expression
between differentiated and
dedifferentiated human fetal
chondrocytes(HFCs)

¾ A >2-fold difference in the


expression of
¾ 62 known genes and 6 ESTs
was observed between the
two cell types
¾ TWIST and HIF-1: transcription
factor genes, and a
¾ cadherin 11, cellular adhesion
protein gene, were markedly
regulated in response to
differentiation and
dedifferentiation.
Northern analysis of mRNA isolated
from differentiated
(pH)and dedifferentiated (Pl) HFCs.
Chondrocytes (11 days culture)
COL-A gene family and Tenascin
involved with De-differentiation
Application in adult/aging
Osteoarthritis (OA)
In adult normal and OA
cartilage and
chondrocytes.
At least 3 of the genes
regulated in response of
dedifferentiation
(TWIST, IGF-2 IGFBPs)
Examples
Examples of MicroArrays Studies
Sporulation in yeast
During
sporulation
Genes are
induced or
repressed

Microarray
analysis
allows
clustering of
genes in
subclasses
Examples of MicroArrays Studies
Cancer
A) Hierarchical clustering
from solid tumor
samples
B) Clustering of diffuse B-
cell Lymphomas
C)Kaplan-Meier plots,
Display survival
probability based on
the relative level of
expression of
molecular subtypes
Key Features of Bioinformatics of Microarray
data analysis
Example GeneSpings software suite
¾ Data Normalization A comprehensive suite of
normalization options
¾ Data Clustering appropriate for different
technologies.

¾ 3D Data Visualization GeneSpring works with


any genomic or proteomic
¾ Scripting technology that associates
numbers with genes. This
¾ Pathway Views includes microarrays,
Affymetrix chips, Clontech
¾ Expression Profile or Research Genetics
blots, SAGE and RT-PCR.
Comparison or Probe Entire
Enterprise Repository for
Conditions (PEER-C)
¾ Advanced Statistical Tools
Data Normalization
¾ Normalizations scenarios are an
important part of the bioinformatics
data management. (i.eGeneSpring has 16
transformations for the creation of scenarios).
¾ Normalization across Arrays
(visualization approach over the
different samples, creates a artificial
reference array)
¾ Used the quantile normalization:
normalization:
Each array distribution of intensity
compared to other to obtain same
for all quantile value. 50%
Affimetrix has a “probe level data”
software analysis package (Affy
(Affy))
(statistical language and R)

¾ Examle In our experiments we want


to know differential expression of
RNA in the various experimental
conditions
z Normal brain vs. Brain tumor
z Glioblasmoma vs. Astrocytoma Observed variation include:
z Brain tumors cells vs. Endothelial •Sample prep
cells
z From bioreactor experiments •Array manufacture and data processing
Data Clustering
Algorithms organize genes and samples into expression alike patterns

Hierarchical and Non-hierarchical techniques (use of nested clusters)

Tree or dendrogram Finds subgroups

¾ uncover patterns of gene


expression data
¾ the relationships between
these patterns
¾ use one or a combination
of clustering options to
characterize their data:
¾ gene trees (hierarchical
clustering), experiment
trees, self-organizing maps
(SOM), and k-means
clustering. principal
components analysis
(PCA) to characterize the
most significant patterns in
a given experiment.

Many software and tools for clustering


( Software in R and S-plus available free)
3D Data Visualization

¾ 3D scatter plot tool


provides in-depth and
interactive
representations of
highly complex data.
¾ Expression data
values or multivariate
analysis results can
be displayed to
visualize

Many options are available which use color intensity to represent the expression level
Scripting

¾ Allows for custom scripting to


automate repetitive analytical
tasks. We are currently identifying
tasks (i.e. normalization
permutations) to assure rapid and
reliable routines
¾ ensure consistency in the analysis
process and simplify data analysis
management.

¾ Scrips are interfaced with GeNet


and combine scripts with basic
functions to perform more complex
analyses.
Pathway Views

¾ Graphic representation of
genes and their
expression patterns based
on their location within a
cellular pathway.
¾ Interactive design of
pathway diagrams or
directly import
¾ predict genes associated
with discrete steps in the
pathway of interest.
Expression Profile Comparison or Probe Entire
Enterprise Repository for Conditions (PEER-C)

¾ Search tool designed to


explore all of the
experiments related to a
single genome in the
GeNet database. Other
databases available.
¾ identify target expression
patterns , identify
expression profiles within
all the normalized sample
sets archived in the
GeNet database.
¾ characterizing the results
of compound screening
experiments and patient
samples (from various
tumors and grown under
different conditions)

In our experiments, we are searching and processing microarray data from tissue normal and tumors
Goals are to compare genes profile from various sources to our sampling and analyzing strategies

Note: Our samples by in large are collected following enrichment under specific culture conditions
Advanced Statistical Tools

¾ Analysis of complex data sets.


¾ t-tests and analysis of
variance (ANOVA) for reliably
identifying differentially
expressed genes. class
prediction tools identifies
genes capable of
discriminating between one or
more experimental parameters
or sample phenotypes.
¾ Groups of genes identified by
expression profiling can be
further characterized by
performing sequence searches
for potential regulatory
elements.

Lots of software available:


GeneSpring suite
BRB (Array Tools for NCI)
Affymetrix tools
Whitehead Institute Tools
Example of clustering with brain
tumor samples
¾ gene expression of at least
10-fold.
100 ¾ The initial evaluation
included RNA from HUVEC
cells compared to the RNA
10 of tumor sample one (560
genes) or tumor sample 2
(508 genes). compared to
1
one another, there were 280
genes held in common
0.1
between the two, and 187 of
these changes
Origin
0.01
Brain HUVEC Tumor-mixed
¾
Tumor-pure
When 5x over 3000 genes
Mixed cell culture Pure cell culture

¾ significance (p-Value of
0.05)
Using GeneSpring software by Silicon Genetics for downstream analysis of the microarray data
genes with significant
differential expression
¾ claudin 5, a protein whose function is critical to the maintenance of
the blood brain barrier (BBB). Our data also demonstrated differences
in gene expression of connexin 37, a gap junction protein.

¾ Other genes: endothelial cell-specific molecule 1, endothelial Protein C


receptor, and von Willebrand factor, (specifically been function in
endothelial cells)

¾ Members of the growth factor and their receptor families (i.e. tumor
necrosis factor and its receptor, transforming growth factor beta receptor II)

On going work, is to determine levels of


expression of key molecules involved in the
maintenance of (BBB) pathway and growth
factors (TGF, VEGF, TNF)
Normalized data fromBrain
tumor samples clustering
100

10

0.1
Sampl e
0.01
N B NB N EndoNENdo
DJM DJMDJP DJP

Y-axis: NB vs. NEndo vs tumor, Defau...


Colored by: N Endo
Gene List: NBvsNEndovstumor (1014)
100

10
1

0.1
Sampl e
0.01
NB N Endo DJM DJP

Y-axis: NB vs. NEndo vs tumor, Defau...


Colored by: N Endo
Gene List: NBvsNEndovstumor (1014), 2...
NBvsNEndovstum or NB 5-fold g than Huve c or tumor

Venn Diagram
Assessment of Normal Brain 0
551
0

v.s Tumor

463 0

0
0
NBvsNEndovs tumor

NB 5-fold less HUVEC and tumor


Repositories for Microarray Data & Comments
Repositories for Microarray Data

Comments URL

AMAD
From Stanford and the University of California ~http://www.microarrays.org/sofrware.html
ArrayExpress at Berkeley and at San Francisco
From Alvis Brazma and colleagues at the EBI From the ~ http://www.ebi.ac.uk/arrayexpress/
ChipDB ~ http://young39.wi.mit.edu/chipdb_public/ ~
ExpressDB Whitehead Institute
At Harvard; relational database containing yeast http://arep.med.harvard.edu/ExpressDB/
RNA expression data
Gene Director From Biodiscovery ~ http://www.biodiscovery.com
GeNet From Silicon Genetics ~ http://www.sigenetics.com
GeneX ~ http://genex.ncgr.org/‘
From NCGR
~ http://www.ncbi.nlm.nih.gov/geo/
GEO GXD Gene Expression Omnibus from NCBI From the Jackson
~ http://www.informatics.jax.org/
MAdb Laboratory National Cancer Institute
~ http://madb.nci.nih.gov
MaxdSQL University of Manchester ~ http://www.bioinf.man.ac. uk/microarray /maxd
RAD University of Pennsylvania http://www.cbil.upenn.edu/radZ/ servlet
Stanford Microarray Stanford University ~ http://www.dnachip.org/
Database
Demos
¾ Repositories
http://www.microarrays.org/sofrware.html
¾ AMAD From Stanford
and the University http://genome-www5.stanford.edu/cgi-
of California at bin/search/QuerySetup.pl
Berkeley and at
San Francisco http://genome-
www5.stanford.edu/index.shtml
Stanford 43K human cDNA microarray (www.microarray.org/sfgf)

TIGR
http://www.tigr.org/software/

http://nciarray.nci.nih.gov/
National cancer institute

The Walter and Eliza Hall http://bioinf.wehi.edu.au/folders/suzanne/databases.html


Institute of Medical Research,l Australia

You might also like