Professional Documents
Culture Documents
ExSeq Presentation With Background
ExSeq Presentation With Background
ExSeq Presentation With Background
Why mRNAseq?
There are at least three compelling reasons for performing RNAseq in general and mRNAseq in particular vs. the microarray
Specificity (of what is being measured) Reduced bias (in batches, in log ratio (FC) estimates, in general) Sensitivity (on a gene or transcript basis, both detection and differential expression)
Other reasons
Detection of SNV or other variations No predetermined transcriptome needs to be available, no probes need to be designed or manufactured Cost (will soon be equivalent on a per assay basis with microarray)
Why mRNAseq?
Specificity
DDR1
Why mRNAseq?
Reduced Bias over batches, time, machines, related to processing
CD19 CD8 CD14 CD4 cells
Why mRNAseq?
Reduced Bias
related to processing Clients miRNAseq samples sequenced on 4 different machines at 2 different sites at different times with no apparent bias in the first principal components GAIIx HS-01 HS-02 HS-IL
Historical Issues in Genome-wide Expression Studies MAQC (Microarray Quality Control) effort began in 2005 to address differences seen in differential expression between microarray platforms
The effort was very successful, as many questions were being raised about the reliability/reproducibility of microarray results The issues with microarray reproducibility in diff. exp. came down to three main findings
Poor use/interpretation of statistical methods create illusion of discordance Relying on annotation rather than probe locations is perilous Each platform has its sweet spot for various performance characteristics
All primary microarray platforms can work well and be generally concordant
ExSeq
Why is our ExSeq experiment designed differently than MAQC?
MAQC had only two biological conditions that were themselves pools from multiple sources / tissues (UHRR and HBRR)
They had technical replicates, not biological replicates
We wanted to test a more realistic scenario using different biological conditions and biological replicates within a condition
Thus, one specimen one source
Since each specimen is attributable to one source, we can also potentially assess variation (SNV/fusion/etc.) of that source whereas with MAQC we could not even if we sequenced the RNA We can still assess assay repeatability by running independent preps in multiple lanes, but with sequencing we can easily combine output from multiple preps (if there is not bias)
Main Presentation
Sequencing performance relative to sequencing strategy and relative to microarrays using 15 breast cancer cell lines
Illumina HiSeq sequencer (HiSeq) HG-U133_Plus_2 microarray (Affymetrix) Human HT-12v4 microarray (Illumina)
Automated detection and cleavage of adaptors Automated detection and trimming of cycles with skewed quality scores or high frequency of Ns Correction of quality scores based on Phi-X spike-ins (in testing)
Alignment
In mRNA-seq and similar sequence counting experiments, the number of unambiguously aligned reads is the driving factor behind most aspects of measurement performance There are numerous strategies to improve unambiguous alignment
Longer reads Paired end vs single end Alignment strategy - including error tolerance Reduced complexity of the reference sequence
Alignment Approaches
Default parameters were used for available tools Reference database and error tolerance was identical for all methods EA = EA developed SE alignment strategy Alignment estimates generated from the Illumina body map data
97%
Detection
Microarray detection defined by MAS5 call for Affymetrix and detection p-value < 0.05 for Illumina Sequencing detection defined as greater than 3 counts assigned at the end of abundance estimation Shared content consists of the set of transcripts (or genes) that are common to all platforms Unrestriced content allows for any possible detection event under the platform specific definition
Gray detected in any sample Red detected in >=66% of samples Detected is defined as >= 3 reads assigned to a transcript
Gray detected in any sample Red detected in >=66% of samples Detected is defined as >= 3 reads assigned to a transcript
25m PE HiSeq
Affy array
Illumina array
Slopes of the log ratio estimates are observed to be greatly increased in 25M reads versus 10M reads, and modestly increased in PE runs relative to SE runs
r2 (r=correlation) values are observed to be modestly increased in 25M reads versus 10M reads
With FC > 1.5 and q< 0.01 460 for a given number of reads
1000
1270
Summarized values 0.46 1 1.27 Scale the common set to 1 Value of 0 for platform A means that all differential expression detected by A was also made by B Increasing values indicate an increasing number of significant calls unique to the platform
1.5
0.%!
n.nn
0.%!
n.nn
0.%!
n.nn
0.%!
n.nn
0.%!
n.nn
0.%!
n.nn
0.%!
n.nn
0.%!
n.nn
Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn
Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn
Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn
Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B Platform A Common Platform B 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn 0.%! 1 n.nn
The number of detected DE transcripts unique to each platform is similar in magnitude to the intersection of both at 10M reads. For 25M reads, sequencing produces a noticeably larger total number of differentially detected transcripts even for common content and the amount unique to each array type (Affy and Illumina) is much smaller than at 10M. Error bars indicate variation across the 40 combinations of FC and q.
25m
DESeq vs. cuffdiff Magnitudes of change are compressed in cuffdiff relative to DESeq. DESeq identifies 2-3x more unique transcripts than Cuffdiff across the three comparisons.
Approximately 1/3 of detected SNVs are not known to dbSNP At least 1 variant is detected in ~10% of detected transcripts
Cufflinks used for denovo assembly of transcripts, no transcriptome definition is used PE provides superior performance to SE in this scenario
Identification of completely novel transcripts those that are assembled from mRNA, but exist in regions currently annotated as intergenic
Experimental Results
25M clusters of 50 x 2 Principal component analysis easily segregates the cell lines from the three known subtypes. Basal Claudin Luminal
Expression of Isoforms
Claudin Basal Luminal
ESR1 is a well studied gene due to its association with developmental stage in epithelial cells and it use as a biomarker for treatment in breast cancer 11 isoforms of ESR1 are detected and some may be indicative of isoform specific differential expression
Summary
EA has consistently achieved >100M PF clusters per lane with high quality base calls for 50 x 2 cycle sequencing from TruSeq prepared libraries With EA alignment of 10M clusters of 50 x 2 cycles, HiSeq provides similar levels of information to microarrays, when limited to transcripts detected by the microarray With EA alignment of 25M clusters of 50 x 2 cycles, HiSeq provides substantially more information than microarrays, even when limited to transcripts detected by the microarray At >=10M clusters and above, HiSeq consistently detects 3-4x more transcripts than microarrays, and at >=25M clusters, HiSeq detects 2x more differentially expressed transcripts
Summary
PE strategies are similar or marginally better than SE related to
Magnitude of detection of known transcripts Magnitude of detection of differential expression Correlation of FC with microarrays Estimating the magnitude of FC
Detection of novel isoforms and SNVs are improved with increased coverage and read depth Alignment, estimation, testing for differential expression are as important as the sequencing strategy
www.GenomicKnow-How.com