RNA Sequencing: An Introduction To Efficient Planning and Execution of RNA Sequencing (RNA-Seq) Experiments

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

THE SWISS DNA COMPANY

White Paper · Next Generation Sequencing

RNA Sequencing
An introduction to efficient planning and execution of RNA sequencing
(RNA-Seq) experiments.

Motivation for RNA Sequencing

RNA sequencing (RNA-Seq) refers to a


Experimental Design
method that is based on next genera-
tion sequencing (NGS) technologies to
study transcriptomes. While microar- Samples
ray-based technologies depend on
measuring hybridization intensities to
predesigned probes, RNA-Seq relies on Total RNA Isolation
discrete counting of sequenced mol-
ecules. Thus, in contrast to microar- RNA Library Preparation
rays, RNA-Seq requires no prior knowl-
edge about the genome and can be
considered hypothesis free. In addi- Illumina Short Read Sequencing
tion, the counting of RNA molecules
gives RNA-Seq experiments a much
Data Analysis
higher dynamic range compared to
the hybridization intensities mea-
sured in microarray experiments. Also Report Generation
considering the falling costs of NGS
technologies, it is not surprising that Figure 1. This figure depicts a generalized RNA sequencing workflow that may be used for differential expres-
RNA-Seq experiments have become sion analysis.
the gold standard in transcriptome
studies. Novel isoforms, alterna- data, the interpretation of RNA-Seq (as of August 2018). Thus, a good
tive splice sites, rare transcripts and experiments may appear a daunting experimental strategy is important to
gene-fusions, non-coding transcripts, task, especially for eukaryotic organ- reliably identify the desired genes and
and additional, even novel mecha- isms [1]. For instance, the most recent gene networks, for example the tran-
nism, can be detected all in a single human reference assembly – GRCh38 scription targets of a stressor, or a spe-
experiment. Usually, the main goal of – has 244’550 unique exons with a cific gene or pathway of interest after
RNA-Seq studies is to obtain expres- mean length of 330 bases and a total treatment, or gene-expression differ-
sion profiles, pathways and gene net- amount of 80 million bases scattered ences between different genotypes.
works linked to the experimental con- across the 3 billion bases of the human This white paper will give an overview
dition studied. A generalized workflow genome. There are 2’879 annotated on how to handle RNA-Seq data by
of a typical RNA-Seq study is depicted non-coding micro RNA (miRNA) [2], presenting a selection of workflows.
in Figure 1. Given the complex orga- 52’000 transcripts from 26’475 genes,
nization of genomes together with 22’302 associated gene ontology (GO)
the huge amount of fragmented terms [3] and 330 KEGG pathways  [4]

Microsynth AG, Switzerland


Schützenstrasse 15 · P.O. Box · CH  -   9436 Balgach · Phone + 41 71 722 83 33 · Fax + 41 71 722 87 58 · info@microsynth.ch · www.microsynth.ch
THE SWISS DNA COMPANY
White Paper · Next Generation Sequencing

Sequencing and Differential Expression Analysis of Coding and Non-coding RNA

Selected Applications of RNA-Seq parison of data sets obtained from regulatory purposes, may be used
Transcriptome studies are well exper imental conditions (e.g. drug to develop biomarkers specific to
suited to understand disease mech- treatments) and controls to deter- a medical condition. Such differen-
anisms, developmental mecha- mine the difference in transcript abun- tially expressed miRNA, can then be
nisms, or response to various stress- dance. The focus here is on messenger experimentally verified to develop
ors. Differential expression analysis RNA (mRNA). In addition, non-cod- diagnostic qPCR kits for instance.
of RNA-Seq data relies on the com- ing miRNAs, which often have gene

Experimental Design usually suffice for accurate mapping. per replicate for eukaryotic organisms
For a successful experiment, many However, paired-end sequencing (and and 10 million single-end reads for
aspects, including experimental setup, in some cases longer reads, for instance each replicate for prokaryotic organ-
sampling, and funding are to be con- as produced by Pacific Biosciences isms. For miRNA the read numbers
sidered. In addition, the number of (PacBio) sequencing technologies) is may be halved. It is also worth men-
biological replicates and the number required if highly accurate transcript tioning that the External RNA Controls
of reads produced for each replicate quantification, determination of gene Consortium (ERCC) has developed a
are essential parameters to produce fusions or novel splice variant detec- set of external RNA controls designed
valid results [5], especially to detect tion is envisaged. In contrast to sin- to mimic natural eukaryotic mRNA
the maximal number of differentially gle-end sequencing, paired-end sequences [6]. These sequences may
expressed genes which includes rare sequencing enables reading both ends be spiked in after RNA isolation and can
transcripts. As gene expression anal- of a (c)DNA fragment. Generally, it is be used to estimate the uncertainty in
ysis builds on counting reads from recommended to work at least in trip- the subsequent measurements.
the respective transcription unit, sin- licates per experimental condition and
gle-end reads of 75 bases length sequence 30 million single-end reads

RNA Isolation of the input RNA. The acronym GIGO distorted upon amplification of low
Obtaining high quality RNA is critical. “garbage in, garbage out” holds true amounts of input RNA, using either
RNA degradation is detrimental to the in this case as well. A notable excep- transcription-based or PCR-based
experiments since it may introduce 3’ tion is RNA extracted from formalde- amplification methods. However,
biases during polyadenylation (polyA) hyde-fixed paraffin-embedded (FFPE) with careful controls and a sufficient
enrichment or may distort the tran- tissue obtained by laser-assisted dis- number of biological replicates, the
script profile by differentially affect- section methods, where a certain adverse effects can be minimized.
ing different RNAs. Thus, great care amount of degradation is unavoid-
is needed to preserve the integrity able. Transcript profiles may also be

Sequencing Library Construction struction may use alternative tech- After adapter ligation and before PCR
Depending on the desired RNA type niques to enrich the relevant RNA amplification, uracil-DNA glycosylase
(coding or non-coding) and type of fraction. The constructed libraries are (UNG) is added to degrade the second
organism studied (eukaryote or pro- stranded, meaning they retain the strand. As a result, all reads start in the
karyote), different sequencing library strand information of the sequenced same orientation, allowing the iden-
types are constructed. For instance, molecule, which results in a more reli- tification of the transcribed strand. A
to sequence mRNA, a polyA enrich- able quantification of gene expression schematic depiction of how total RNA
ment step is performed for eukary- [7]. One typical method to keep the is turned into a sequenceable Illumina
otes, while a ribosomal RNA deple- RNA strandness makes use of uracil cDNA library is shown in Figure 2.
tion step is carried out for prokaryotes. instead of thymine for incorporation
Kits for non-coding RNA library con- during second strand cDNA synthesis.

Microsynth AG, Switzerland


Schützenstrasse 15 · P.O. Box · CH  -   9436 Balgach · Phone + 41 71 722 83 33 · Fax + 41 71 722 87 58 · info@microsynth.ch · www.microsynth.ch
THE SWISS DNA COMPANY
White Paper · Next Generation Sequencing

Figure 2. Schematic description of a poly-A enriched RNA Illumina library


ready for sequencing. Image: David Corney.

Next Generation Sequencing


Illumina short-read sequencing by syn- 3‘ 3‘ 3‘ 3‘
Sequence read 1 Flipping and Sequence read 2
thesis (SBS) technology, as depicted in

Anti-sense DNA
will be antisense
First strand second strand will be sense
Sense DNA

Sense DNA

Sense DNA
sequencing Indexing sequencing
Figure 3, is especially well suited for
RNA-Seq, as it is fast, accurate and cost
effective [8]. Sequenced reads are pro-
duced in the standard fastq format [9] Index read

5‘ 5‘ 5‘ 5‘
that incorporates both sequence infor-
mation and quality scoring and can be Flow Cell Flow Cell Flow Cell Flow Cell
further processed in downstream anal-
yses. Figure 3: Schematic of Illuminas paired-end sequencing workflow.

Bioinformatics Analysis pipeline to extract meaningful infor-


As an example, let us consider an mation. Bioinformatics analysis of
RNA-Seq experiment aimed at detect- RNA-Seq data generally consists in: 1)
ing changes in the gene expression quality control, optional size selection
profile of a human cell line after expo- (e.g. to specifically separate non-cod-
sure to a drug. With control samples ing RNA fractions) and filtering of
and additional samples taken at two the sequenced reads, 2) splice-aware
different time points after drug appli- mapping of the reads to the refer-
cation, each in three replicates, the ence genome, 3) counting of uniquely
total sample number would amount mapped reads for each gene, 4) normal-
to nine. It is strictly recommended to ization of read counts across the exper-
not pool different replicates or condi- iment and 5) statistical evaluation of
tions into a single sample for sequenc- the normalized values comparing the
ing as this would eliminate all statisti- different conditions (such as treatment
cal power and the experiment would and control) to each other to iden- Figure 4. Principle Component Analysis (PCA) plot
become useless. Assuming each tify significant fold changes and up- to visualize grouping of samples in an RNA-Seq
sample in the outlined RNA-Seq exper- or down-regulation of the genes [10]. experiment. The three conditions depicted are
iment generates 30 million single-end, Table 1 presents an excerpt of a mRNA clearly separated, indicating significant, differential
75 bases long reads, the whole exper- differential gene expression analy- gene expression patterns of the three analyzed con-
iment produces 270 million reads or sis. Figure 4 shows how expression ditions.
20 billion bases. This large amount values of replicates group differ from
of data requires a dedicated analysis the values of the respective controls.

Microsynth AG, Switzerland


Schützenstrasse 15 · P.O. Box · CH  -   9436 Balgach · Phone + 41 71 722 83 33 · Fax + 41 71 722 87 58 · info@microsynth.ch · www.microsynth.ch
THE SWISS DNA COMPANY
White Paper · Next Generation Sequencing

  

         Table 1. This excerpt of a table shows the main
output of a differential gene expression analysis. In
 this experiment two conditions with three replicates
are compared to each other. The table lists from

left to right the gene identifier, boxplots represent-


ing expression level distributions of the replicates,
the log2 fold change of gene expression between
 condition 1 and condition 2, the propability value
(p-value) of the log2 fold change and the p-value

adjusted for multiple testing.

fold change
< Neuroendocrine cancers> 2.4 0 2.3
DNA MDM2 Inhibition of apoptosis
Neuroblastoma
PTK2 Cell survival
Amplification
TP53 Apoptosis
MYCN MAX
COMMD3-BMI1 Repression of tumor
supressors
SP1 DNA
ZBTB17 NTRK1 Figure 5. Excerpt from the Kyoto Encyclopedia
Inhibition of apoptosis
Carcinoid (TF) NGFR of Genes and Genomes (KEGG) pathway graph
Mutation DNA “TRANSCRIPTIONAL MISREGULATION IN CANCER”,
MEN1 TF CDKN1B Proliferation?
where colored nodes represent significantly up- or
MLL (TF)
downregulated genes in the selected pathway.

Based on the differential gene expres- metabolic processes as exemplified of such analyses may be submitted to
sion results and depending on the in Figure 5. A useful additional anal- public databases such as miRNet [12]
content of gene information published ysis in the case of miRNAs comprises for further network-based visual analy-
in databases, gene set and pathway a motif search to identify potential sis. Figure 6 depicts such a motif iden-
analysis may be carried out to illumi- miRNA targets and to uncover addi- tified by a miRNA analysis.
nate the larger context of the involved tional, novel miRNAs [11]. The results

UUUGAGUC
AU
A Figure 6. A depiction of a significant de novo miRNA
motif discovered in a miRNA Seq analysis. A miRNA
U motif is a region that is well conserved in many of
G the analyzed sequences.
CAA UA C

Microsynth AG, Switzerland


Schützenstrasse 15 · P.O. Box · CH  -   9436 Balgach · Phone + 41 71 722 83 33 · Fax + 41 71 722 87 58 · info@microsynth.ch · www.microsynth.ch
THE SWISS DNA COMPANY
White Paper · Next Generation Sequencing

Summary

Obviously, RNA-Seq is not limited to tated reference genome is available. resulting in a ready-to-use de novo
dealing with questions of differen- In short, RNA is collected from as many transcriptome [13].
tial gene expression or identification different stages and tissues as possi- RNA-Seq provides a snapshot of the
of miRNA, which have been discussed ble. The entire RNA is then enriched transcriptome in cells and cell pop-
in the previous sections. Table 2 lists for polyadenylated mRNA. The pool ulations, making it a very attractive
common RNA-Seq applications. The of mRNA, which ideally represents and powerful method. However, the
table can serve as a guide for selecting all transcribed genes, is then normal- results of the RNA-Seq experiments are
an appropriate approach to a research ized to reduce abundant mRNAs and complex because they produce a large
question. Another application of RNA enrich rare mRNAs. The normalized amount of fragmented data. However,
Seq technology is, for example, de novo transcripts are sequenced, then assem- with the right approach, the challenge
transcriptome assembly and annota- bled in a second step and annotated of extracting knowledge is reduced to
tion, which is useful when no anno- with various databases in a third step, a manageable task.

Table 2. This table provides an overview of common scientifc questions in the field of RNA-Seq and gives a brief overview of the most important points that
need to be considered in a RNA-Seq project. The table is intended as a quick reference guide.

Explore the Explore the Develop Study cancer Study the


Common influence of a influence of a biomarkers for specific
Scientific transcriptome of
treatment on treatment on specific medical mechanisms a yet uncharted
Question eukaryotic gene prokaryotic conditions species
expression gene expression

Non-coding RNA Alternative splice-sites De novo transcriptome


Analysis Differential gene Differential gene
expression in differential expression and gene-fusion assembly
expression in
Method eukaryotes prokaryotes analysis in eukaryotes detection (novel
isoforms)

At least two conditions in Pooling of different


At least two conditions in At least two conditions in
Experimental replicates, replicates, replicates, -
tissues, growth
stadiums, etc. to
no pooling of different
Setup no pooling of different
conditions
no pooling of different
conditions conditions capture the
transcriptome in its
entirety

mRNA and mRNA and e.g. miRNA and mRNA and mRNA,
Material and availability of annotated availability of annotated availability of annotated availability of annotated missing annotated
Resources reference genome reference genome non-coding RNA and
reference genome
reference Genome reference Genome

Total RNA isolation; Total RNA isolation; Total RNA isolation; Total RNA isolation; Total RNA isolation;
Sample stranded polyA enriched stranded ribo-depleted non-coding RNA
enriched sequencing
stranded polyA enriched normalized mRNA
sequencing library sequencing library
Preparation sequencing library sequencing library
library

30 Mio single-end reads, 10 Mio single-end reads, 15 Mio single-end reads, 50 Mio paired-end reads, 20 Mio paired-end reads,
Sequencing 75 bp length 75 bp length 75 bp length 2 x 150 bp length 2 x 300 bp length

Differential gene Differential gene


expression analysis; expression analysis; Statistical appraisal of De novo transcriptome
Differential expression
Data pathway analysis if pathway analysis if
pathway database for the analysis;
detected alternative
splice sites and gene
assembly and
annotation
pathway database for the
Analysis organism in question is organism in question is motif search fusions
available available

Microsynth AG, Switzerland


Schützenstrasse 15 · P.O. Box · CH  -   9436 Balgach · Phone + 41 71 722 83 33 · Fax + 41 71 722 87 58 · info@microsynth.ch · www.microsynth.ch
THE SWISS DNA COMPANY
White Paper · Next Generation Sequencing

References
[1] Steven L. Salzberg. Open ques- [6] Lemire A, Lea K, Batten D, et al. [12] Yannan Fan, Keith Siklenka, Simran
tions: How many genes do we have? Development of ERCC RNA Spike-In K. Arora, Paula Ribeiro, Sarah Kimmins,
BMC Biology. 2018;16(94). doi:10.1186/ Control Mixes. Journal of Biomolecular Jianguo Xia; miRNet - dissecting miR-
s12915-018-0564-x Techniques : JBT. 2011;22(Suppl):S46 NA-target interactions and functional
associations through network-based
[2] Sam Griffiths-Jones, Russell J. [7] Zhao S, Zhang Y, Gordon W, et al. visual analysis, Nucleic Acids Research,
Grocock, Stijn van Dongen, Alex Comparison of stranded and non- Volume 44, Issue W1, 8 July 2016, Pages
Bateman, Anton J. Enright; miRBase: stranded RNA-seq transcriptome profil- W135–W141, https://doi.org/10.1093/
microRNA sequences, targets and gene ing and investigation of gene overlap. nar/gkw288
nomenclature, Nucleic Acids Research, BMC Genomics. 2015;16(1):675.
Volume 34, Issue suppl_1, 1 January doi:10.1186/s12864-015-1876-7 [13] Neves, R.C., Guimaraes, J.C.,
2006, Pages D140–D144, https://doi. Strempel, S. et al., Transcriptome pro-
org/10.1093/nar/gkj112 [8] Online at: https://emea.illumina. filing of Symbion pandora (phylum
com/systems/sequencing-platforms/ Cycliophora): insights from a differential
[3] The Gene Ontology Consortium; nextseq/applications.html?langsel=/ gene expression analysis, Org Divers Evol
Expansion of the Gene Ontology knowl- ch/, accessed 14.09.2018 (2017) 17: 111. https://doi.org/10.1007/
edgebase and resources, Nucleic Acids s13127-016-0315-1
Research, Volume 45, Issue D1, 4 January [9] Online at: http://maq.sourceforge.
2017, Pages D331–D338, https://doi. net/fastq.shtml, accessed 14.09.2018
org/10.1093/nar/gkw1108
[10] Sandrine Borgeaud, Lisa C. Metzger,
[4] Minoru Kanehisa, Miho Furumichi, Tiziana Scrignari, Melanie Blokesch, The
Mao Tanabe, Yoko Sato, Kanae type VI secretion system of Vibrio chol-
Morishima; KEGG: new perspectives on erae fosters horizontal gene transfer,
genomes, pathways, diseases and drugs, Science 02 Jan 2015: Vol. 347, Issue 6217,
Nucleic Acids Research, Volume 45, Issue pp. 63-67. DOI: 10.1126/science.1260064
D1, 4 January 2017, Pages D353–D361,
https://doi.org/10.1093/nar/gkw1092 [11] Bhupesh K. Prusty, Nitish Gulve,
Suvagata Roy Chowdhury, Michael
[5] Schurch NJ, Schofield P, Gierliński M, Schuster, Sebastian Strempel, Vincent
et al. How many biological replicates Descamps, Thomas Rudel. HHV-6
are needed in an RNA-seq experiment encoded small non-coding RNAs define
and which differential expression tool an intermediate and early stage in viral
should you use? RNA. 2016;22(6):839- reactivation. npj Genomic Medicine.
851. doi:10.1261/rna.053959.115 2018;3(25). 10.1038/s41525-018-0064-5

Microsynth AG, Switzerland


Schützenstrasse 15 · P.O. Box · CH  -   9436 Balgach · Phone + 41 71 722 83 33 · Fax + 41 71 722 87 58 · info@microsynth.ch · www.microsynth.ch

You might also like