ExSeq Presentation With Background

Optimization of mRNA sequencing relative to current microarray platforms
May 31, 2011
Why mRNAseq?
There are at least three compelling reasons for performing RNAseq in general and mRNAseq in particular vs. the microarray
Specificity (of what is being measured) Reduced bias (in batches, in log ratio (FC) estimates, in general) Sensitivity (on a gene or transcript basis, both detection and differential expression)
Other reasons
Detection of SNV or other variations No predetermined transcriptome needs to be available, no probes need to be designed or manufactured Cost (will soon be equivalent on a per assay basis with microarray)
Why mRNAseq?
Specificity
DDR1
Affymetrix Probe set annotation
3 bias of probes creates greater ambiguity in measurement
Why mRNAseq?
Reduced Bias over batches, time, machines, related to processing
CD19 CD8 CD14 CD4 cells
Why mRNAseq?
Reduced Bias
related to processing Clients miRNAseq samples sequenced on 4 different machines at 2 different sites at different times with no apparent bias in the first principal components GAIIx HS-01 HS-02 HS-IL
Historical Issues in Genome-wide Expression Studies MAQC (Microarray Quality Control) effort began in 2005 to address differences seen in differential expression between microarray platforms
The effort was very successful, as many questions were being raised about the reliability/reproducibility of microarray results The issues with microarray reproducibility in diff. exp. came down to three main findings
Poor use/interpretation of statistical methods create illusion of discordance Relying on annotation rather than probe locations is perilous Each platform has its sweet spot for various performance characteristics
All primary microarray platforms can work well and be generally concordant
Similar Issues with mRNA sequencing

In one sense, next-gen sequencers can be viewed as another platform for RNA (relative) quantitation, but without the bias of predetermined probes RNA assay performance measures that are important
Detection and signal (including dynamic range) Fold Change (Log Ratio or Log FC) Estimates (biased?) Differential Gene Lists (size, uniqueness) Concordance with reference methods (TaqMan) Repeatability/Reproducibility of these same factors
Technical variability vs. Biological variability
Exemplar Sequencing (ExSeq) Experiment

Goal: Compare performance of various mRNA-seq strategies to microarrays in a real-world experimental scenario Design
15 breast cancer cell lines 5 unique lines representing each of 3 breast cancer subtypes Three independent library preparations of each line using the Illumina TruSeq protocol 45 samples randomized into 7 pools/lanes with 7 barcodes per pool The 7 pools were run on 2 flow cells (100x100 cycles ie, 100PE) For runs with acceptable quality, reads were randomly collected into sets of 2m, 5m, 10m, 17m, 25m, 33m, and 50m reads
ExSeq
Why is our ExSeq experiment designed differently than MAQC?
MAQC had only two biological conditions that were themselves pools from multiple sources / tissues (UHRR and HBRR)
They had technical replicates, not biological replicates
We wanted to test a more realistic scenario using different biological conditions and biological replicates within a condition
Thus, one specimen one source
Since each specimen is attributable to one source, we can also potentially assess variation (SNV/fusion/etc.) of that source whereas with MAQC we could not even if we sequenced the RNA We can still assess assay repeatability by running independent preps in multiple lanes, but with sequencing we can easily combine output from multiple preps (if there is not bias)
Main Presentation
Sequencing performance relative to sequencing strategy and relative to microarrays using 15 breast cancer cell lines
Illumina HiSeq sequencer (HiSeq) HG-U133_Plus_2 microarray (Affymetrix) Human HT-12v4 microarray (Illumina)
Interpretations/Insights from a sequencing experiment over and above microarrays
Raw Sequencing Output - ExSeq

Typically achieving 100-110M PF clusters /lane*
*Illumina HiSeq specification is 60M PF clusters / lane (v2)
EA Pre-processing & QA Methods Implemented

Automated determination of
Species Molecule - DNA, RNA or miRNA Insert size length and variation
Detection of non-uniformity of barcode representation Alarms for

percent and number of PF clusters unexpected base distribution unusual quality scores by cycle
Automated detection and cleavage of adaptors Automated detection and trimming of cycles with skewed quality scores or high frequency of Ns Correction of quality scores based on Phi-X spike-ins (in testing)
Computational Processing overview

Alignment Maximizing unambiguous alignments Alignment of reads that cross exon junctions Ex: Bowtie, BWA, Tophat Abundance estimation Gene or transcript Handling alignments that are ambiguous in the transcriptome Ex: Cufflinks, MISO, IsoEM, IsoInfer, Rseq, . . . Normalization of read counts Minimizing bias due to variation in number of clusters available Ex: Total count, Upper quartile, quantile, density Testing for differential expression Data are not log normal as often observed with microarrays Ex: Cuffdiff, edgeR, DESeq
Alignment
In mRNA-seq and similar sequence counting experiments, the number of unambiguously aligned reads is the driving factor behind most aspects of measurement performance There are numerous strategies to improve unambiguous alignment
Longer reads Paired end vs single end Alignment strategy - including error tolerance Reduced complexity of the reference sequence
Alignment Approaches
unaligned ambiguous unambiguous
Default parameters were used for available tools Reference database and error tolerance was identical for all methods EA = EA developed SE alignment strategy Alignment estimates generated from the Illumina body map data
Unambiguous Alignment in ExSeq

The default TopHat algorithm is based on the SE Bowtie algorithm EA-TopHat is a hybrid approach which uses the general Tophat algorithm for junction mapping, but is powered by the EA alignment engine
*Alignment estimates generated from the ExSeq data
97%
Detection
Microarray detection defined by MAS5 call for Affymetrix and detection p-value < 0.05 for Illumina Sequencing detection defined as greater than 3 counts assigned at the end of abundance estimation Shared content consists of the set of transcripts (or genes) that are common to all platforms Unrestriced content allows for any possible detection event under the platform specific definition
Detection Shared Transcripts
Gray detected in any sample Red detected in >=66% of samples Detected is defined as >= 3 reads assigned to a transcript
Detection All Transcripts
Gray detected in any sample Red detected in >=66% of samples Detected is defined as >= 3 reads assigned to a transcript
Transcript Abundance Estimation

There may be few unambiguous alignments to the genome but a large fraction of ambiguity remain with respect to the transcriptome Ignoring the ambiguous fraction leads greatly reduces the read count, and results in greatly reduced repeatability, fold change estimation, and identification of differential expression Definition of the transcriptome plays an important role, and here we use the UCSC KnownGene table a combination of RefSeq, GenBank, and Uniprot Many methods are available to intelligently assign ambiguous reads. Results in the remaining slides are from Cufflinks estimation of the KnownGene transcripts.
Magnitude of Fold Change (Log Ratio)

10m PE HiSeq
Slope estimates > 1 indicate compression of fold change estimates in array platforms (x-axes) relative to HiSeq (yaxes). FC (Log Ratio)estimates are increased for 25M reads (right) relative to 10M reads (left) r2 values are increased for 25M reads (right) relative to 10M reads (left)
25m PE HiSeq
Affy array
Illumina array
Magnitude of Fold Change
Slopes of the log ratio estimates are observed to be greatly increased in 25M reads versus 10M reads, and modestly increased in PE runs relative to SE runs
Concordance of Fold Change
r2 (r=correlation) values are observed to be modestly increased in 25M reads versus 10M reads
Comparison of Differential Expression

Venn diagrams are used by many to assess concordance of differential expression; however, the choice of threshold for significance varies widely Concordant/discordant counts were tabulated from 40 combinations of q-value and fold change thresholds. (q in 0.01-0.25 and FC in 1.5-8) For each threshold, the number discordant by platform was normalized to the number concordant

Illustration
Platform A Common Platform B
With FC > 1.5 and q< 0.01 460 for a given number of reads
1000
1270
Summarized values 0.46 1 1.27 Scale the common set to 1 Value of 0 for platform A means that all differential expression detected by A was also made by B Increasing values indicate an increasing number of significant calls unique to the platform

For each collection of read depth (10M, 25M, etc.) and strategy (PE, SE) FC q
Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B
1.5
.01 .05 .1 .15 .25
0.%!
n.nn
0.%!
n.nn
0.%!
n.nn
0.%!
n.nn
0.%!
n.nn
0.%!
n.nn
0.%!
n.nn
0.%!
n.nn
Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn
Summarization of Differential Expression

(for common content - microarray vs. sequencing)
The number of detected DE transcripts unique to each platform is similar in magnitude to the intersection of both at 10M reads. For 25M reads, sequencing produces a noticeably larger total number of differentially detected transcripts even for common content and the amount unique to each array type (Affy and Illumina) is much smaller than at 10M. Error bars indicate variation across the 40 combinations of FC and q.
HiSeq vs Array Comparative Summary

EA alignment of10M reads provides similar performance as currently available gene expression microarrays in terms of detection, estimation of fold change, and detection of differential expression for the content assayed by the microarray EA alignment of 25M reads fold change estimates are 75-100% larger, and 2-3x more transcripts are identified as differentially expressed for the content assayed by the microarray PE provides modest benefits in all aspects of quantification relative to SE However, there are other RNAs to measure using mRNAseq as detection is increased 4x relative to microarrays for 25M PF clusters
Effect of Sequencing Parameters

50b SE vs 50b PE
Agreement improves with increasing number of reads (top vs bottom). 25 cycle x 2 is much better than 50 cycle x 1 to recapitulate the results of 50 cycle x2
25b PE vs 50b PE 10m
25m
Effect of Differential Expression Test
DESeq vs. cuffdiff Magnitudes of change are compressed in cuffdiff relative to DESeq. DESeq identifies 2-3x more unique transcripts than Cuffdiff across the three comparisons.
Detection of Single Nucleotide Variants
Approximately 1/3 of detected SNVs are not known to dbSNP At least 1 variant is detected in ~10% of detected transcripts
Denovo Transcript Assembly
Cufflinks used for denovo assembly of transcripts, no transcriptome definition is used PE provides superior performance to SE in this scenario
Novel Transcript Assembly
Identification of completely novel transcripts those that are assembled from mRNA, but exist in regions currently annotated as intergenic
Sequencing Parameters and Processing

Good correlation of differential expression is observed between 50 SE and 50 PE. However, 25 PE is superior to 50 SE when 50 PE is the standard. Differential expression statistics are still evolving, and the major disagreement between current methods is in estimation of error Single nucleotide variants can be detected, but only for the ~10% most abundant transcripts Denovo assembly is greatly improved with PE information
Experimental Results
25M clusters of 50 x 2 Principal component analysis easily segregates the cell lines from the three known subtypes. Basal Claudin Luminal
Expression of Isoforms
Claudin Basal Luminal
ESR1 is a well studied gene due to its association with developmental stage in epithelial cells and it use as a biomarker for treatment in breast cancer 11 isoforms of ESR1 are detected and some may be indicative of isoform specific differential expression
Differential Expression of Isoforms

Differentially expressed transcripts between Claudin and Luminal cell lines were identified as before These were filtered for isoforms of the same gene that exhibit opposing direction of change between the groups and 115 transcripts were identified
Claudin Luminal
Summary
EA has consistently achieved >100M PF clusters per lane with high quality base calls for 50 x 2 cycle sequencing from TruSeq prepared libraries With EA alignment of 10M clusters of 50 x 2 cycles, HiSeq provides similar levels of information to microarrays, when limited to transcripts detected by the microarray With EA alignment of 25M clusters of 50 x 2 cycles, HiSeq provides substantially more information than microarrays, even when limited to transcripts detected by the microarray At >=10M clusters and above, HiSeq consistently detects 3-4x more transcripts than microarrays, and at >=25M clusters, HiSeq detects 2x more differentially expressed transcripts
Summary
PE strategies are similar or marginally better than SE related to
Magnitude of detection of known transcripts Magnitude of detection of differential expression Correlation of FC with microarrays Estimating the magnitude of FC
PE strategies are noticeably better than SE in improving

Percentage of unambiguously aligned reads
PE strategies are greatly better than SE in improving

De novo assembly of transcripts or in detecting novel transcripts
Detection of novel isoforms and SNVs are improved with increased coverage and read depth Alignment, estimation, testing for differential expression are as important as the sequencing strategy
www.GenomicKnow-How.com

ExSeq Presentation With Background

Uploaded by

Copyright:

Available Formats

You might also like

ExSeq Presentation With Background

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ExSeq Presentation With Background

Uploaded by

Copyright:

Available Formats

Optimization of mRNA sequencing relative to current microarray platforms

May 31, 2011

Affymetrix Probe set annotation

3 bias of probes creates greater ambiguity in measurement

Similar Issues with mRNA sequencing

Exemplar Sequencing (ExSeq) Experiment

Interpretations/Insights from a sequencing experiment over and above microarrays

Raw Sequencing Output - ExSeq

*Illumina HiSeq specification is 60M PF clusters / lane (v2)

EA Pre-processing & QA Methods Implemented

Detection of non-uniformity of barcode representation Alarms for

Computational Processing overview

unaligned ambiguous unambiguous

Unambiguous Alignment in ExSeq

Detection Shared Transcripts

Detection All Transcripts

Transcript Abundance Estimation

Magnitude of Fold Change (Log Ratio)

Magnitude of Fold Change

Concordance of Fold Change

Comparison of Differential Expression

Comparison of Differential Expression

Comparison of Differential Expression

.01 .05 .1 .15 .25

Summarization of Differential Expression

HiSeq vs Array Comparative Summary

Effect of Sequencing Parameters

25b PE vs 50b PE 10m

Effect of Differential Expression Test

Detection of Single Nucleotide Variants

Denovo Transcript Assembly

Novel Transcript Assembly

Sequencing Parameters and Processing

Differential Expression of Isoforms

PE strategies are noticeably better than SE in improving

PE strategies are greatly better than SE in improving

You might also like