ExSeq Presentation With Background

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Optimization of mRNA sequencing relative to current microarray platforms

May 31, 2011

Why mRNAseq?
There are at least three compelling reasons for performing RNAseq in general and mRNAseq in particular vs. the microarray
Specificity (of what is being measured) Reduced bias (in batches, in log ratio (FC) estimates, in general) Sensitivity (on a gene or transcript basis, both detection and differential expression)

Other reasons
Detection of SNV or other variations No predetermined transcriptome needs to be available, no probes need to be designed or manufactured Cost (will soon be equivalent on a per assay basis with microarray)

Why mRNAseq?
Specificity

DDR1

Affymetrix Probe set annotation

3 bias of probes creates greater ambiguity in measurement

Why mRNAseq?
Reduced Bias over batches, time, machines, related to processing
CD19 CD8 CD14 CD4 cells

Why mRNAseq?
Reduced Bias
related to processing Clients miRNAseq samples sequenced on 4 different machines at 2 different sites at different times with no apparent bias in the first principal components GAIIx HS-01 HS-02 HS-IL

Historical Issues in Genome-wide Expression Studies MAQC (Microarray Quality Control) effort began in 2005 to address differences seen in differential expression between microarray platforms
The effort was very successful, as many questions were being raised about the reliability/reproducibility of microarray results The issues with microarray reproducibility in diff. exp. came down to three main findings
Poor use/interpretation of statistical methods create illusion of discordance Relying on annotation rather than probe locations is perilous Each platform has its sweet spot for various performance characteristics

All primary microarray platforms can work well and be generally concordant

Similar Issues with mRNA sequencing


In one sense, next-gen sequencers can be viewed as another platform for RNA (relative) quantitation, but without the bias of predetermined probes RNA assay performance measures that are important
Detection and signal (including dynamic range) Fold Change (Log Ratio or Log FC) Estimates (biased?) Differential Gene Lists (size, uniqueness) Concordance with reference methods (TaqMan) Repeatability/Reproducibility of these same factors
Technical variability vs. Biological variability

Exemplar Sequencing (ExSeq) Experiment


Goal: Compare performance of various mRNA-seq strategies to microarrays in a real-world experimental scenario Design
15 breast cancer cell lines 5 unique lines representing each of 3 breast cancer subtypes Three independent library preparations of each line using the Illumina TruSeq protocol 45 samples randomized into 7 pools/lanes with 7 barcodes per pool The 7 pools were run on 2 flow cells (100x100 cycles ie, 100PE) For runs with acceptable quality, reads were randomly collected into sets of 2m, 5m, 10m, 17m, 25m, 33m, and 50m reads

ExSeq
Why is our ExSeq experiment designed differently than MAQC?
MAQC had only two biological conditions that were themselves pools from multiple sources / tissues (UHRR and HBRR)
They had technical replicates, not biological replicates

We wanted to test a more realistic scenario using different biological conditions and biological replicates within a condition
Thus, one specimen one source

Since each specimen is attributable to one source, we can also potentially assess variation (SNV/fusion/etc.) of that source whereas with MAQC we could not even if we sequenced the RNA We can still assess assay repeatability by running independent preps in multiple lanes, but with sequencing we can easily combine output from multiple preps (if there is not bias)

Main Presentation
Sequencing performance relative to sequencing strategy and relative to microarrays using 15 breast cancer cell lines
Illumina HiSeq sequencer (HiSeq) HG-U133_Plus_2 microarray (Affymetrix) Human HT-12v4 microarray (Illumina)

Interpretations/Insights from a sequencing experiment over and above microarrays

Raw Sequencing Output - ExSeq


Typically achieving 100-110M PF clusters /lane*

*Illumina HiSeq specification is 60M PF clusters / lane (v2)

EA Pre-processing & QA Methods Implemented


Automated determination of
Species Molecule - DNA, RNA or miRNA Insert size length and variation

Detection of non-uniformity of barcode representation Alarms for


percent and number of PF clusters unexpected base distribution unusual quality scores by cycle

Automated detection and cleavage of adaptors Automated detection and trimming of cycles with skewed quality scores or high frequency of Ns Correction of quality scores based on Phi-X spike-ins (in testing)

Computational Processing overview


Alignment Maximizing unambiguous alignments Alignment of reads that cross exon junctions Ex: Bowtie, BWA, Tophat Abundance estimation Gene or transcript Handling alignments that are ambiguous in the transcriptome Ex: Cufflinks, MISO, IsoEM, IsoInfer, Rseq, . . . Normalization of read counts Minimizing bias due to variation in number of clusters available Ex: Total count, Upper quartile, quantile, density Testing for differential expression Data are not log normal as often observed with microarrays Ex: Cuffdiff, edgeR, DESeq

Alignment
In mRNA-seq and similar sequence counting experiments, the number of unambiguously aligned reads is the driving factor behind most aspects of measurement performance There are numerous strategies to improve unambiguous alignment
Longer reads Paired end vs single end Alignment strategy - including error tolerance Reduced complexity of the reference sequence

Alignment Approaches

unaligned ambiguous unambiguous

Default parameters were used for available tools Reference database and error tolerance was identical for all methods EA = EA developed SE alignment strategy Alignment estimates generated from the Illumina body map data

Unambiguous Alignment in ExSeq


The default TopHat algorithm is based on the SE Bowtie algorithm EA-TopHat is a hybrid approach which uses the general Tophat algorithm for junction mapping, but is powered by the EA alignment engine
*Alignment estimates generated from the ExSeq data

97%

Detection
Microarray detection defined by MAS5 call for Affymetrix and detection p-value < 0.05 for Illumina Sequencing detection defined as greater than 3 counts assigned at the end of abundance estimation Shared content consists of the set of transcripts (or genes) that are common to all platforms Unrestriced content allows for any possible detection event under the platform specific definition

Detection Shared Transcripts

Gray detected in any sample Red detected in >=66% of samples Detected is defined as >= 3 reads assigned to a transcript

Detection All Transcripts

Gray detected in any sample Red detected in >=66% of samples Detected is defined as >= 3 reads assigned to a transcript

Transcript Abundance Estimation


There may be few unambiguous alignments to the genome but a large fraction of ambiguity remain with respect to the transcriptome Ignoring the ambiguous fraction leads greatly reduces the read count, and results in greatly reduced repeatability, fold change estimation, and identification of differential expression Definition of the transcriptome plays an important role, and here we use the UCSC KnownGene table a combination of RefSeq, GenBank, and Uniprot Many methods are available to intelligently assign ambiguous reads. Results in the remaining slides are from Cufflinks estimation of the KnownGene transcripts.

Magnitude of Fold Change (Log Ratio)


10m PE HiSeq
Slope estimates > 1 indicate compression of fold change estimates in array platforms (x-axes) relative to HiSeq (yaxes). FC (Log Ratio)estimates are increased for 25M reads (right) relative to 10M reads (left) r2 values are increased for 25M reads (right) relative to 10M reads (left)

25m PE HiSeq

Affy array

Illumina array

Magnitude of Fold Change

Slopes of the log ratio estimates are observed to be greatly increased in 25M reads versus 10M reads, and modestly increased in PE runs relative to SE runs

Concordance of Fold Change

r2 (r=correlation) values are observed to be modestly increased in 25M reads versus 10M reads

Comparison of Differential Expression


Venn diagrams are used by many to assess concordance of differential expression; however, the choice of threshold for significance varies widely Concordant/discordant counts were tabulated from 40 combinations of q-value and fold change thresholds. (q in 0.01-0.25 and FC in 1.5-8) For each threshold, the number discordant by platform was normalized to the number concordant

Comparison of Differential Expression


Illustration
Platform A Common Platform B

With FC > 1.5 and q< 0.01 460 for a given number of reads

1000

1270

Summarized values 0.46 1 1.27 Scale the common set to 1 Value of 0 for platform A means that all differential expression detected by A was also made by B Increasing values indicate an increasing number of significant calls unique to the platform

Comparison of Differential Expression


For each collection of read depth (10M, 25M, etc.) and strategy (PE, SE) FC q
Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B

1.5

.01 .05 .1 .15 .25

0.%!

n.nn

0.%!

n.nn

0.%!

n.nn

0.%!

n.nn

0.%!

n.nn

0.%!

n.nn

0.%!

n.nn

0.%!

n.nn

Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn

Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn

Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn

Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn

Summarization of Differential Expression


(for common content - microarray vs. sequencing)

The number of detected DE transcripts unique to each platform is similar in magnitude to the intersection of both at 10M reads. For 25M reads, sequencing produces a noticeably larger total number of differentially detected transcripts even for common content and the amount unique to each array type (Affy and Illumina) is much smaller than at 10M. Error bars indicate variation across the 40 combinations of FC and q.

HiSeq vs Array Comparative Summary


EA alignment of10M reads provides similar performance as currently available gene expression microarrays in terms of detection, estimation of fold change, and detection of differential expression for the content assayed by the microarray EA alignment of 25M reads fold change estimates are 75-100% larger, and 2-3x more transcripts are identified as differentially expressed for the content assayed by the microarray PE provides modest benefits in all aspects of quantification relative to SE However, there are other RNAs to measure using mRNAseq as detection is increased 4x relative to microarrays for 25M PF clusters

Effect of Sequencing Parameters


50b SE vs 50b PE
Agreement improves with increasing number of reads (top vs bottom). 25 cycle x 2 is much better than 50 cycle x 1 to recapitulate the results of 50 cycle x2

25b PE vs 50b PE 10m

25m

Effect of Differential Expression Test

DESeq vs. cuffdiff Magnitudes of change are compressed in cuffdiff relative to DESeq. DESeq identifies 2-3x more unique transcripts than Cuffdiff across the three comparisons.

Detection of Single Nucleotide Variants

Approximately 1/3 of detected SNVs are not known to dbSNP At least 1 variant is detected in ~10% of detected transcripts

Denovo Transcript Assembly

Cufflinks used for denovo assembly of transcripts, no transcriptome definition is used PE provides superior performance to SE in this scenario

Novel Transcript Assembly

Identification of completely novel transcripts those that are assembled from mRNA, but exist in regions currently annotated as intergenic

Sequencing Parameters and Processing


Good correlation of differential expression is observed between 50 SE and 50 PE. However, 25 PE is superior to 50 SE when 50 PE is the standard. Differential expression statistics are still evolving, and the major disagreement between current methods is in estimation of error Single nucleotide variants can be detected, but only for the ~10% most abundant transcripts Denovo assembly is greatly improved with PE information

Experimental Results

25M clusters of 50 x 2 Principal component analysis easily segregates the cell lines from the three known subtypes. Basal Claudin Luminal

Expression of Isoforms
Claudin Basal Luminal

ESR1 is a well studied gene due to its association with developmental stage in epithelial cells and it use as a biomarker for treatment in breast cancer 11 isoforms of ESR1 are detected and some may be indicative of isoform specific differential expression

Differential Expression of Isoforms


Differentially expressed transcripts between Claudin and Luminal cell lines were identified as before These were filtered for isoforms of the same gene that exhibit opposing direction of change between the groups and 115 transcripts were identified
Claudin Luminal

Summary
EA has consistently achieved >100M PF clusters per lane with high quality base calls for 50 x 2 cycle sequencing from TruSeq prepared libraries With EA alignment of 10M clusters of 50 x 2 cycles, HiSeq provides similar levels of information to microarrays, when limited to transcripts detected by the microarray With EA alignment of 25M clusters of 50 x 2 cycles, HiSeq provides substantially more information than microarrays, even when limited to transcripts detected by the microarray At >=10M clusters and above, HiSeq consistently detects 3-4x more transcripts than microarrays, and at >=25M clusters, HiSeq detects 2x more differentially expressed transcripts

Summary
PE strategies are similar or marginally better than SE related to
Magnitude of detection of known transcripts Magnitude of detection of differential expression Correlation of FC with microarrays Estimating the magnitude of FC

PE strategies are noticeably better than SE in improving


Percentage of unambiguously aligned reads

PE strategies are greatly better than SE in improving


De novo assembly of transcripts or in detecting novel transcripts

Detection of novel isoforms and SNVs are improved with increased coverage and read depth Alignment, estimation, testing for differential expression are as important as the sequencing strategy

www.GenomicKnow-How.com

You might also like