Comparative Transcriptomics Analysis

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Comparative Transcriptomics Analysis

Y-h. Taguchi
Department of Physics, Chuo University, Tokyo 112-8551, Japan

Abstract

In this article, I describe de facto standard strategy for Comparative Tran-


scriptomics Analysis (CTA). CTA is an often used strategy to investigate gene
expression profiles from the biological point of views, since transcripts expressed
differentially between more than one conditions are expected to represent the
distinction between conditions. In spite of the evident necessity of CTA, its
execution is not straightforward since there are no definite criteria on what is
differential expression. In order to address this problem, various statistical mod-
els are proposed as measurement technologies develop. Specifically, microarray
and high throughput sequencing technology are focused and related applications
designed to deal with output from these technologies are introduced.
Keywords: differential expression, fold change, multiple comparison,
microarray, high throughput sequencing

1. Introduction

Biological meaning of Comparative Transcriptomics Analysis. Comparative Tran-


scriptomics Analysis (CTA) is, as its name says, the comparisons of the amount
of transcripts between two distinct conditions. Typical distinct conditions are
5 healthy controls vs disease patients, a pair of distinct tissues, distinct time
points and even distinct species. The reason of this frequent execution of CTA
is because almost all cells keep the full set of genes, although not all of them are

Email address: tag@granular.com (Y-h. Taguchi)

Preprint submitted to Encyclopedia of Bioinformatics and Computational Biology: ApplicationsOctober 1, 2017


expressive. Because of this cells’ sharing the full set of genes, CTA is a primary
method to investigate the factors that make cells distinct from one another.
10 In contrast to the clearer purpose of CTA, performing CTA is much harder.
This is primarily because the amount of transcripts is naturally real numbers.
Since no real numbers cannot be exactly equal to one another by chance, we
need some criterion when the amounts of two transcripts can be regarded to
differ from each other. The most simplest solution of this problem is the intro-
15 duction of some statistical models. Based upon the statistical models under the
null hypothesis that the amounts of two transcripts are equal, we can identify
transcripts associated with distinct values between two conditions by rejecting
the null hypothesis. Thus, the point is how to assume the null hypothesis with
which statistical models are assumed.
20 Dependent upon assumed statistical models, various strategies of CTA are
possible. They are primarily divided to two categories dependent upon what
kind of technology is used to measure the amount of transcript. The major
two such technologies are microarray and high throughput sequencing (HTS),
respectively. Although these two technologies measure biologically the same
25 variables, i.e., the amount of transcripts, the outcomes differ from each other.
In micorarray, the amounts of transcripts measured are inevitably real number,
since microarray measures the amount of transcripts by the amount of light
emission from nucleotide fragments that bind to complementary probes. On
the other hands, the amount of transcript measured by HTS must be integer,
30 since HTS counts the number of fragments cut out from mRNA. Since suit-
able statistical models for real numbers and integer differ, in the following, I
separately discuss two.

2. Statistical models

2.1. Microarray

35 As described in the above, statistical models applied to the amount of tran-


scripts measured by microarray must be that of real numbers. Most of popular

2
statistical models assume that the amount transcript obeys t distribution, t(n),
where n is the number of samples in each class. When aiming to identify genes
that have distinct mean µ between two classes, the null hypothesis that µ is
40 equivalent between two classes is rejected with some P -value that is the thresh-
old value, e.g., 0.05.
t test is only applicable to two classes problem. Although t test can be
formally applied to multiple classes by dividing multiple classes to a set of
pairwise comparison, it is erroneous because of the following reason. Suppose
45 we have K multiple classes. Then, number of pairwise comparisons is as many
as K(K − 1)/2. Thus, P = 2/K(K − 1) can be achieved by chance. If K is as
large as ten, since P = 2/k(K − 2) ' 0.02, usual threshold value P = 0.05 is
clearly meaningless. If P -values are corrected, e.g., P < 0.05 × 2/K(K − 1), the
ability that t test can detect genes associated with distinct amount of transcript
50 between any pairs of multiple classes will drastically decrease.
In order to avoid these problems, for the CTAs associated with multiple
classes, other strategies are employed. These other strategies include categorical
regression (in other words, analysis of variance (ANOVA) (Upton and Cook,
2008)), χ2 test, or any other extensions of them. Details of implementations
55 differ from applications to applications, it will be discussed in the below in the
applications section.

2.2. HTS

HTS is an alternate technology that can outperform microarray that has


many limitations. Microarray must be designed prior to measurements. This
60 means, non-model organisms are hard to be tested. In HTS, we do not need
anything prior to the measurements, although genome sequence must be decided
if it is not available prior to count the number of reads attributed to each gene.
Genome sequence also can be decided by HTS.
HTS is roughly divided to two classes: genome sequencing and RNA-seq.
65 In genome sequencing, fragmented genomic DNA were sequenced and whole
genome was assembled from the reads sequence. On the other hand, RNA-seq

3
tries to sequence reads taken from RNAs. For both cases, read can be single
end or paired ends. For the latter, reads are generated from both ends of longer
fragmented DNA or RNA. This strategy can increase the accuracy than the
70 single end, because paired ends require additional constraint that pared end
must be within the length of mRNA and fragmented DNA.
Assembling fragmented DNA to get whole genome sequence and transform-
ing RNA-seq reads into transcript with considering splicing are the issue of
statistical model. In other words, they are the task that decide the most prob-
75 able genome sequence or mapping of RNA-seq reads to genome based on the
obtained read sequences. Since the measurement of the amount of transcript
itself is out of scope of this article, refer to other articles for more details.
In contrast to the conventional HTS where length of read sequences is limited
to up to a few hundreds nucleotide, an alternative technology, long read sequenc-
80 ing (e.g., nanopore, https://nanoporetech.com/ and PacBio, http://www.pacb.com/),
is recently rising. In long read technology, each read can be as long as a
few thousands nucleotide, which is long enough to measure whole genome of
prokaryotesand individual whole transcript of eukaryotes. This has several ad-
vantages compared with short read technologies. For prokaryotes genome, it is
85 obvious that long read technology allows to omit assemble in order to obtain
whole genome sequences. For eukaryotes transcript, complicated spliced junc-
tion identification processed can be omitted. Thus, long read technology soon
will completely replace with short read technologies.

2.3. Normalization

90 Prior to CTA, the amount of transcript must be normalized, since indepen-


dent of the measurement technologies, maicroarray or HTS, the total amount
of transcript is impossible to control during measurement. This process can be
done either outside of CTA or as a part of CTA. In the former case, researchers
can select their favorable methods while the latter employs the pre-defines strat-
95 egy for normalization. In any case, dependent on the normalization strategy,
the outcome of CTA varies and there are no de facto standard techniques for the

4
normalization prior to CTA. This is an additional reason why CTA is difficult to
perform, since incorrect normalization might result in the miss-identification of
genes expressed distinctly between two conditions. Various applications aiming
100 normalization prior to CTA will be introduced in applications section below.

2.4. Fold change


Even if some genes can be identified as expressed differently between two
conditions based on the criterion given by a statistical test, some genes might
likely not biologically play critical roles. This is because the standard statistical
105 tests often ignore the the amount of transcripts itself. For example, in t test,
very small difference of µs between two classes can be judged as significant when
estimated standard deviation in each class is much smaller that the estimated
difference of µs between two classes. In extreme case, when the amounts of
transcript in each class are completely same within each class, i.e., standard
110 deviation is equal to zero, any tiny difference of µs between two classes can be
regarded as significant, since P = 0. Nevertheless, such a judgement is clearly
meaningless from the biological point of views.
In order to avoid these somewhat meaningless identifications of genes asso-
ciated with distinct amount of transcript between distinct classes, fold change is
115 often considered together with statistical tests. Fold change is, as its name says,
the ratio between the amount of transcripts in two classes. Then, genes asso-
ciated with both significantly small P -vales and large enough fold change (e.g.,
larger than two or less than a half) are regarded to be those associated with
significant difference of amount of transcript between two classes. Although this
120 strategy is known to work well empirically, there are no systematic ways that
can decide how large or small fold change (larger than twice, larger than three
times, smaller than one half, or smaller than one third) should be employed in
order to get biologically reasonable answers.

2.5. Multiple comparisons


125 Multiple comparison is another issue of CTA. Any statistical test can at-
tribute to each gene, P -values that reject null hypothesis that two classes are

5
equivalent in some sense. Nevertheless, since the number of genes is huge (c.a.
104 ), these P -values are misleading. If the number of genes is N , P = 1/N
can happen by chance. This suggests that the frequently used criterion that
130 P < 0.05 is significant is useless.
At the moment, although there are no definite ways to address this problem,
numerous trials were proposed. The simplest one is Bonferroni (Armstrong,
2014) where P is transformed to be N P . This apparently simple transforma-
tion has drawback that often misses the significant difference because of its
135 too strict criterion. More robust way is to assume the uniform distribution
for P -values. If null hypothesis is true, P -values must obey uniform distribu-
tion. Then, genes whose attributed P -values deviate from uniform distribution
is regarded to be significant. The most frequently used criterion along this line
is Benjamin-Hochberg (Benjamini and Hochberg, 1995). Empirically, the lat-
140 ter strategy was more often employed, since it can give us more biologically
reasonable (interpretable) results empirically.

2.6. Batch effect

Although it is somewhat related to normalization issue, batch effect can


often heavily affect the outcomes of CTA. Batch effects generally means the
145 uncontrollable external factors that can affect measured amount of transcript.
Numerous factors can cause batch effects, e.g., days, time, institutes, persons,
and so on. If batch effect is not removed effectively, CTA results in not the com-
parison between treated and control samples, but that among batches. Although
there are many packages proposed for removing batch effects (Reese et al., 2013;
150 Leek, 2014; Chen et al., 2011), there are no de facto standard methods that can
remove batch effect well independent of the experimental situations. The best
strategy that researchers can employ is to evaluate outcomes with considering
biological significance of outcomes carefully. In principal, it is dangerous to mix
gene expression profiles taken from different batches.

6
155 2.7. Enrichment analysis

Although it is not directly related to CTA, biological interpretation of se-


lected genes is often required and important. One of typical analysis for this
purpose is enrichment analysis. In enrichment analysis, a set of genes is given,
and various biological terms associated with the gene set with significantly small
160 P -values computed by some statistical test, e.g., Fisher’s exact test, are identi-
fied.
Here I list some of famous servers to be used for this purpose: DAVID (Huang
et al., 2008), g:profiler (Reimand et al., 2016), TargetMine (Chen et al., 2016),
Enricher (Kuleshov et al., 2016) and MSigDB (Subramanian et al., 2005), al-
165 though they have their own pros and cons.

2.8. qPCR

Other than two major technologies, microarray and HTS, quantitative poly-
merase chain reaction (qPCR) is also sometimes used to measure amount of
transcript. In apite of the quantitative accuracy of qPCR compared with other
170 two technologies, qPCR is used only less frequently. This is because qPCR has
no ability to measure numerous transcripts simultaneously. For each transcript,
experiments must be repeated independently. This is far from cost effective. In
addition to this, qPCR often requires reference transcript that is not altered
between treated and control samples. Since either identification of not altered
175 transcript or artificial spike in of reference genes are additional time and cost
consuming process, qPCR is mainly used for validating the limited number of
transcripts among all transcripts measured by either microarray or HTS.

2.9. Single cell analysis

Although single cell analysis (SCA) (Yuan et al., 2017), which measures tran-
180 scripts cell by cell, is a rising field, it is not developed enough to be included
in this encyclopedia as an established field. First of all, because of technology
currently developing, some transcripts are often missing. At the moment, there
are no ways to judge if missing transcript is really missing (biologically) or not

7
(technologically missing, i.e., failure of measurements). Thus, the purpose of
185 SCA is often presently not identifying genes whose transcripts are expressed
distinctly between teated and control samples, but clustering (grouping) cells
based upon the measured transcripts. Since clustering cells is somewhat outside
of CTA, SCA is not discussed here in details. Nonetheless, SCA is surely replac-
ing conventional CTA in the future when the SCA technology is established.

190 3. Applications

There are numerous applications for CTA. In this article, those in Biocon-
ductor (Huber et al., 2015) will be specifically introduced.

3.1. Normalization for microarray

There are numerous methods to achieve normalization prior to CTA for mi-
195 croarray. There are generally two branches along this direction. The first one in
the normalization based upon single microarray. This means that the amount
of transcript is normalized with considering single measurement. The one of the
most frequent methodology of single array based normalization is mas5, which
is implemented as mas5 function included in affy (Gautier et al., 2004) pack-
200 age. In single microarray based normalization, total amount of transcripts is
assumed to be constant independent of measurements. Another frequent strat-
egy is normalization based upon multiple array based one. The most popular
one along this direction is rma, which is implemented as rma function also in-
cluded in affy (Gautier et al., 2004) package. In multiple array based strategy,
205 the amount of transcripts that share ranking among multiple arrays are assumed
to take the same values. In actual, there are no ways by which we can judge the
better one between these two. Generally speaking, multiple array based strate-
gies are more popular since they are less affected by individual measurements.
In principle, the choice of better strategy is highly context dependent. It must
210 be evaluated based upon the biological outcomes.

8
3.2. Normalization for HTS

Although the amount of transcripts obtained by HTS technology is integer,


it has different difficulties than microarray normalization. Since the number of
reads are that of RNA fragments, the number of reads mapped to individual
genes is not proportional to the amount of transcripts. It is obvious that longer
RNA has more fragments. It is very contrast to microarray where the amount
of transcripts measured is per gene base. Thus, there are two kinds of normal-
ization strategy. One is to transform the number of reads mapped to each gene
to that of per gene base. The most frequent definition of this line is RPKM
(Reads per million mapped reads), which is defined as

106 103
raw counts × ×
all reads gene length

where raw counts is the number of reads mapped to each gene, all reads are total
number of reads in each measurement. Although it looks reasonable, there is
one drawback. When single gene expression drastically increases, because of all
215 reads in denominator, all of other transcripts are regarded as being decreased,
although it is clearly not reasonable.
Another strategy is raw reads, which is without any any normalization. It
might look strange, but raw reads as it is can be treated if suitable statistical
models are proposed (see below).

220 3.3. CTA for real numbers

Since RPKM is real number analogous to the amount of transcripts measured


by microarray, we discuss these with that of microarray. As for CTA of real
numbers, there are huge number of applications. Here I introduce two of them
as the most frequently used ones. The first one is SAM (Significance Analysis
225 of Microarrays) (Tusher et al., 2001), which is implemented as sam function
in siggenes package (Schwender, 2012). It can deal with both two classes
and multiple classes, by employing modified t test and χ2 test accordingly.
Another one is limma (Ritchie et al., 2015), which is based upon linear model
assuming Bayesian work frame. limma can also deal with both two classes and

9
230 multiple classes. In actual, limma can be adapted to almost all situations, since
it employs design matrix strategy by which user can designate any kinds of
possible comparisons. sam and limma also can give users the adjusted P -values
which considered multiple comparison criterion. Thus, researchers do not have
to consider correction assuming multiple comparisons.

235 3.4. CTA for integer numbers

Since reads count by HTS is positive integers, we need null hypothesis fitted
to this situation. Although Poisson distribution has long been employed, it has
one drawback;since Poisson distribution has only one parameter, it cannot be
fitted to mean and variance simultaneously. In order to overcome this problem,
240 negative binomial distribution is more often used. DESeq2 (Love et al., 2014) is
the most frequently used packages for this purpose. It also accepts raw reads
as input and normalization is included in the data processing. It also gives us
adjusted P -values so as not to consider multiple comparisons separately. It is
also fitted to both two classes and multiple classes, since it employ design matrix
245 strategy that limma employs.
In spite of frequent and successful usage of DEseq2 in CTA of HTS, the
appearance of negative binomial distribution is not always guaranteed, since it
lacks convergence theorem that normal distribution has. Because of this weak
points, a non-parametric strategy is sometimes employed. NOISeq (Tarazona
250 et al., 2015) is one of the most frequently used non-parametric packages. NOISeq
also implements data normalization, multiple comparison corrections, adapted
to both two classes and multiple classes and so on.
CTA often results in distinct outcomes between DESeq2 and NOISeq. As in
the microarray, there are not definite criteria that decide the best applications.

255 4. Conclusions

In summary, CTA is rather art than science. At the moment, there are
no definite ways guaranteed to always work well regardless the situations con-

10
sidered. CTA must be done with much cares in order to avoid getting results
without any biological meanings. It is not an easy way, but must be tried.

260 References

Armstrong, R.A., 2014. When to use the bonferroni correction. Ophthalmic and
Physiological Optics 34, 502–508. URL: https://doi.org/10.1111%2Fopo.
12131, doi:10.1111/opo.12131.

Benjamini, Y., Hochberg, Y., 1995. Controlling the False Discovery Rate: A
265 Practical and Powerful Approach to Multiple Testing. Journal of the Royal
Statistical Society. Series B (Methodological) 57, 289–300. URL: http://dx.
doi.org/10.2307/2346101, doi:10.2307/2346101.

Chen, C., Grennan, K., Badner, J., Zhang, D., Gershon, E., Jin, L., Liu, C.,
2011. Removing batch effects in analysis of expression microarray data:
270 An evaluation of six batch adjustment methods. PLoS ONE 6, e17238.
URL: https://doi.org/10.1371%2Fjournal.pone.0017238, doi:10.1371/
journal.pone.0017238.

Chen, Y.A., Tripathi, L.P., Mizuguchi, K., 2016. An integrative data analy-
sis platform for gene set analysis and knowledge discovery in a data ware-
275 house framework. Database 2016, baw009. URL: https://doi.org/10.
1093%2Fdatabase%2Fbaw009, doi:10.1093/database/baw009.

Gautier, L., Cope, L., Bolstad, B.M., Irizarry, R.A., 2004. affy–analysis
of affymetrix GeneChip data at the probe level. Bioinformatics 20,
307–315. URL: https://doi.org/10.1093%2Fbioinformatics%2Fbtg405,
280 doi:10.1093/bioinformatics/btg405.

Huang, D.W., Sherman, B.T., Lempicki, R.A., 2008. Systematic and integrative
analysis of large gene lists using DAVID bioinformatics resources. Nature
Protocols 4, 44–57. URL: https://doi.org/10.1038%2Fnprot.2008.211,
doi:10.1038/nprot.2008.211.

11
285 Huber, W., Carey, V.J., Gentleman, R., Anders, S., Carlson, M., Carvalho, B.S.,
Bravo, H.C., Davis, S., Gatto, L., Girke, T., Gottardo, R., Hahne, F., Hansen,
K.D., Irizarry, R.A., Lawrence, M., Love, M.I., MacDonald, J., Obenchain,
V., Oleś, A.K., Pagès, H., Reyes, A., Shannon, P., Smyth, G.K., Tenenbaum,
D., Waldron, L., Morgan, M., 2015. Orchestrating high-throughput genomic
290 analysis with bioconductor. Nature Methods 12, 115–121. URL: https:
//www.bioconductor.org/, doi:10.1038/nmeth.3252.

Kuleshov, M.V., Jones, M.R., Rouillard, A.D., Fernandez, N.F., Duan, Q.,
Wang, Z., Koplev, S., Jenkins, S.L., Jagodnik, K.M., Lachmann, A., Mc-
Dermott, M.G., Monteiro, C.D., Gundersen, G.W., Ma’ayan, A., 2016. En-
295 richr: a comprehensive gene set enrichment analysis web server 2016 update.
Nucleic Acids Research 44, W90–W97. URL: https://doi.org/10.1093%
2Fnar%2Fgkw377, doi:10.1093/nar/gkw377.

Leek, J.T., 2014. svaseq: removing batch effects and other unwanted noise
from sequencing data. Nucleic Acids Research 42, e161–e161. URL: https:
300 //doi.org/10.1093%2Fnar%2Fgku864, doi:10.1093/nar/gku864.

Love, M.I., Huber, W., Anders, S., 2014. Moderated estimation of fold
change and dispersion for RNA-seq data with DESeq2. Genome Biology
15. URL: https://doi.org/10.1186%2Fs13059-014-0550-8, doi:10.1186/
s13059-014-0550-8.

305 Reese, S.E., Archer, K.J., Therneau, T.M., Atkinson, E.J., Vachon, C.M.,
de Andrade, M., Kocher, J.P.A., Eckel-Passow, J.E., 2013. A new
statistic for identifying batch effects in high-throughput genomic data
that uses guided principal component analysis. Bioinformatics 29, 2877–
2883. URL: https://doi.org/10.1093%2Fbioinformatics%2Fbtt480,
310 doi:10.1093/bioinformatics/btt480.

Reimand, J., Arak, T., Adler, P., Kolberg, L., Reisberg, S., Peterson, H., Vilo,
J., 2016. g:profiler—a web server for functional interpretation of gene lists

12
(2016 update). Nucleic Acids Research 44, W83–W89. URL: https://doi.
org/10.1093%2Fnar%2Fgkw199, doi:10.1093/nar/gkw199.

315 Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., Smyth, G.K.,
2015. limma powers differential expression analyses for RNA-sequencing and
microarray studies. Nucleic Acids Research 43, e47–e47. URL: https://doi.
org/10.1093%2Fnar%2Fgkv007, doi:10.1093/nar/gkv007.

Schwender, H., 2012. siggenes: Multiple testing using SAM and Efron’s em-
320 pirical Bayes approaches. URL: https://bioconductor.org/packages/
release/bioc/html/siggenes.html. R package version 1.50.0.

Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L.,
Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S.,
Mesirov, J.P., 2005. Gene set enrichment analysis: A knowledge-based ap-
325 proach for interpreting genome-wide expression profiles. Proceedings of the
National Academy of Sciences 102, 15545–15550. URL: https://doi.org/
10.1073%2Fpnas.0506580102, doi:10.1073/pnas.0506580102.

Tarazona, S., Furió-Tarı́, P., Turrà, D., Pietro, A.D., Nueda, M.J., Ferrer, A.,
Conesa, A., 2015. Data quality aware analysis of differential expression in
330 RNA-seq with NOISeq r/bioc package. Nucleic Acids Research , gkv711URL:
https://doi.org/10.1093%2Fnar%2Fgkv711, doi:10.1093/nar/gkv711.

Tusher, V.G., Tibshirani, R., Chu, G., 2001. Significance analysis of microar-
rays applied to the ionizing radiation response. Proceedings of the Na-
tional Academy of Sciences 98, 5116–5121. URL: https://doi.org/10.
335 1073%2Fpnas.091062498, doi:10.1073/pnas.091062498.

Upton, G., Cook, I., 2008. A Dictionary of Statistics. Oxford University


Press. URL: https://doi.org/10.1093%2Facref%2F9780199541454.001.
0001, doi:10.1093/acref/9780199541454.001.0001.

Yuan, G.C., Cai, L., Elowitz, M., Enver, T., Fan, G., Guo, G., Irizarry,
340 R., Kharchenko, P., Kim, J., Orkin, S., Quackenbush, J., Saadatpour, A.,

13
Schroeder, T., Shivdasani, R., Tirosh, I., 2017. Challenges and emerging di-
rections in single-cell analysis. Genome Biology 18. URL: https://doi.org/
10.1186%2Fs13059-017-1218-y, doi:10.1186/s13059-017-1218-y.

14

You might also like