Professional Documents
Culture Documents
Comparative Transcriptomics Analysis
Comparative Transcriptomics Analysis
Comparative Transcriptomics Analysis
Y-h. Taguchi
Department of Physics, Chuo University, Tokyo 112-8551, Japan
Abstract
1. Introduction
2. Statistical models
2.1. Microarray
2
statistical models assume that the amount transcript obeys t distribution, t(n),
where n is the number of samples in each class. When aiming to identify genes
that have distinct mean µ between two classes, the null hypothesis that µ is
40 equivalent between two classes is rejected with some P -value that is the thresh-
old value, e.g., 0.05.
t test is only applicable to two classes problem. Although t test can be
formally applied to multiple classes by dividing multiple classes to a set of
pairwise comparison, it is erroneous because of the following reason. Suppose
45 we have K multiple classes. Then, number of pairwise comparisons is as many
as K(K − 1)/2. Thus, P = 2/K(K − 1) can be achieved by chance. If K is as
large as ten, since P = 2/k(K − 2) ' 0.02, usual threshold value P = 0.05 is
clearly meaningless. If P -values are corrected, e.g., P < 0.05 × 2/K(K − 1), the
ability that t test can detect genes associated with distinct amount of transcript
50 between any pairs of multiple classes will drastically decrease.
In order to avoid these problems, for the CTAs associated with multiple
classes, other strategies are employed. These other strategies include categorical
regression (in other words, analysis of variance (ANOVA) (Upton and Cook,
2008)), χ2 test, or any other extensions of them. Details of implementations
55 differ from applications to applications, it will be discussed in the below in the
applications section.
2.2. HTS
3
tries to sequence reads taken from RNAs. For both cases, read can be single
end or paired ends. For the latter, reads are generated from both ends of longer
fragmented DNA or RNA. This strategy can increase the accuracy than the
70 single end, because paired ends require additional constraint that pared end
must be within the length of mRNA and fragmented DNA.
Assembling fragmented DNA to get whole genome sequence and transform-
ing RNA-seq reads into transcript with considering splicing are the issue of
statistical model. In other words, they are the task that decide the most prob-
75 able genome sequence or mapping of RNA-seq reads to genome based on the
obtained read sequences. Since the measurement of the amount of transcript
itself is out of scope of this article, refer to other articles for more details.
In contrast to the conventional HTS where length of read sequences is limited
to up to a few hundreds nucleotide, an alternative technology, long read sequenc-
80 ing (e.g., nanopore, https://nanoporetech.com/ and PacBio, http://www.pacb.com/),
is recently rising. In long read technology, each read can be as long as a
few thousands nucleotide, which is long enough to measure whole genome of
prokaryotesand individual whole transcript of eukaryotes. This has several ad-
vantages compared with short read technologies. For prokaryotes genome, it is
85 obvious that long read technology allows to omit assemble in order to obtain
whole genome sequences. For eukaryotes transcript, complicated spliced junc-
tion identification processed can be omitted. Thus, long read technology soon
will completely replace with short read technologies.
2.3. Normalization
4
normalization prior to CTA. This is an additional reason why CTA is difficult to
perform, since incorrect normalization might result in the miss-identification of
genes expressed distinctly between two conditions. Various applications aiming
100 normalization prior to CTA will be introduced in applications section below.
5
equivalent in some sense. Nevertheless, since the number of genes is huge (c.a.
104 ), these P -values are misleading. If the number of genes is N , P = 1/N
can happen by chance. This suggests that the frequently used criterion that
130 P < 0.05 is significant is useless.
At the moment, although there are no definite ways to address this problem,
numerous trials were proposed. The simplest one is Bonferroni (Armstrong,
2014) where P is transformed to be N P . This apparently simple transforma-
tion has drawback that often misses the significant difference because of its
135 too strict criterion. More robust way is to assume the uniform distribution
for P -values. If null hypothesis is true, P -values must obey uniform distribu-
tion. Then, genes whose attributed P -values deviate from uniform distribution
is regarded to be significant. The most frequently used criterion along this line
is Benjamin-Hochberg (Benjamini and Hochberg, 1995). Empirically, the lat-
140 ter strategy was more often employed, since it can give us more biologically
reasonable (interpretable) results empirically.
6
155 2.7. Enrichment analysis
2.8. qPCR
Other than two major technologies, microarray and HTS, quantitative poly-
merase chain reaction (qPCR) is also sometimes used to measure amount of
transcript. In apite of the quantitative accuracy of qPCR compared with other
170 two technologies, qPCR is used only less frequently. This is because qPCR has
no ability to measure numerous transcripts simultaneously. For each transcript,
experiments must be repeated independently. This is far from cost effective. In
addition to this, qPCR often requires reference transcript that is not altered
between treated and control samples. Since either identification of not altered
175 transcript or artificial spike in of reference genes are additional time and cost
consuming process, qPCR is mainly used for validating the limited number of
transcripts among all transcripts measured by either microarray or HTS.
Although single cell analysis (SCA) (Yuan et al., 2017), which measures tran-
180 scripts cell by cell, is a rising field, it is not developed enough to be included
in this encyclopedia as an established field. First of all, because of technology
currently developing, some transcripts are often missing. At the moment, there
are no ways to judge if missing transcript is really missing (biologically) or not
7
(technologically missing, i.e., failure of measurements). Thus, the purpose of
185 SCA is often presently not identifying genes whose transcripts are expressed
distinctly between teated and control samples, but clustering (grouping) cells
based upon the measured transcripts. Since clustering cells is somewhat outside
of CTA, SCA is not discussed here in details. Nonetheless, SCA is surely replac-
ing conventional CTA in the future when the SCA technology is established.
190 3. Applications
There are numerous applications for CTA. In this article, those in Biocon-
ductor (Huber et al., 2015) will be specifically introduced.
There are numerous methods to achieve normalization prior to CTA for mi-
195 croarray. There are generally two branches along this direction. The first one in
the normalization based upon single microarray. This means that the amount
of transcript is normalized with considering single measurement. The one of the
most frequent methodology of single array based normalization is mas5, which
is implemented as mas5 function included in affy (Gautier et al., 2004) pack-
200 age. In single microarray based normalization, total amount of transcripts is
assumed to be constant independent of measurements. Another frequent strat-
egy is normalization based upon multiple array based one. The most popular
one along this direction is rma, which is implemented as rma function also in-
cluded in affy (Gautier et al., 2004) package. In multiple array based strategy,
205 the amount of transcripts that share ranking among multiple arrays are assumed
to take the same values. In actual, there are no ways by which we can judge the
better one between these two. Generally speaking, multiple array based strate-
gies are more popular since they are less affected by individual measurements.
In principle, the choice of better strategy is highly context dependent. It must
210 be evaluated based upon the biological outcomes.
8
3.2. Normalization for HTS
106 103
raw counts × ×
all reads gene length
where raw counts is the number of reads mapped to each gene, all reads are total
number of reads in each measurement. Although it looks reasonable, there is
one drawback. When single gene expression drastically increases, because of all
215 reads in denominator, all of other transcripts are regarded as being decreased,
although it is clearly not reasonable.
Another strategy is raw reads, which is without any any normalization. It
might look strange, but raw reads as it is can be treated if suitable statistical
models are proposed (see below).
9
230 multiple classes. In actual, limma can be adapted to almost all situations, since
it employs design matrix strategy by which user can designate any kinds of
possible comparisons. sam and limma also can give users the adjusted P -values
which considered multiple comparison criterion. Thus, researchers do not have
to consider correction assuming multiple comparisons.
Since reads count by HTS is positive integers, we need null hypothesis fitted
to this situation. Although Poisson distribution has long been employed, it has
one drawback;since Poisson distribution has only one parameter, it cannot be
fitted to mean and variance simultaneously. In order to overcome this problem,
240 negative binomial distribution is more often used. DESeq2 (Love et al., 2014) is
the most frequently used packages for this purpose. It also accepts raw reads
as input and normalization is included in the data processing. It also gives us
adjusted P -values so as not to consider multiple comparisons separately. It is
also fitted to both two classes and multiple classes, since it employ design matrix
245 strategy that limma employs.
In spite of frequent and successful usage of DEseq2 in CTA of HTS, the
appearance of negative binomial distribution is not always guaranteed, since it
lacks convergence theorem that normal distribution has. Because of this weak
points, a non-parametric strategy is sometimes employed. NOISeq (Tarazona
250 et al., 2015) is one of the most frequently used non-parametric packages. NOISeq
also implements data normalization, multiple comparison corrections, adapted
to both two classes and multiple classes and so on.
CTA often results in distinct outcomes between DESeq2 and NOISeq. As in
the microarray, there are not definite criteria that decide the best applications.
255 4. Conclusions
In summary, CTA is rather art than science. At the moment, there are
no definite ways guaranteed to always work well regardless the situations con-
10
sidered. CTA must be done with much cares in order to avoid getting results
without any biological meanings. It is not an easy way, but must be tried.
260 References
Armstrong, R.A., 2014. When to use the bonferroni correction. Ophthalmic and
Physiological Optics 34, 502–508. URL: https://doi.org/10.1111%2Fopo.
12131, doi:10.1111/opo.12131.
Benjamini, Y., Hochberg, Y., 1995. Controlling the False Discovery Rate: A
265 Practical and Powerful Approach to Multiple Testing. Journal of the Royal
Statistical Society. Series B (Methodological) 57, 289–300. URL: http://dx.
doi.org/10.2307/2346101, doi:10.2307/2346101.
Chen, C., Grennan, K., Badner, J., Zhang, D., Gershon, E., Jin, L., Liu, C.,
2011. Removing batch effects in analysis of expression microarray data:
270 An evaluation of six batch adjustment methods. PLoS ONE 6, e17238.
URL: https://doi.org/10.1371%2Fjournal.pone.0017238, doi:10.1371/
journal.pone.0017238.
Chen, Y.A., Tripathi, L.P., Mizuguchi, K., 2016. An integrative data analy-
sis platform for gene set analysis and knowledge discovery in a data ware-
275 house framework. Database 2016, baw009. URL: https://doi.org/10.
1093%2Fdatabase%2Fbaw009, doi:10.1093/database/baw009.
Gautier, L., Cope, L., Bolstad, B.M., Irizarry, R.A., 2004. affy–analysis
of affymetrix GeneChip data at the probe level. Bioinformatics 20,
307–315. URL: https://doi.org/10.1093%2Fbioinformatics%2Fbtg405,
280 doi:10.1093/bioinformatics/btg405.
Huang, D.W., Sherman, B.T., Lempicki, R.A., 2008. Systematic and integrative
analysis of large gene lists using DAVID bioinformatics resources. Nature
Protocols 4, 44–57. URL: https://doi.org/10.1038%2Fnprot.2008.211,
doi:10.1038/nprot.2008.211.
11
285 Huber, W., Carey, V.J., Gentleman, R., Anders, S., Carlson, M., Carvalho, B.S.,
Bravo, H.C., Davis, S., Gatto, L., Girke, T., Gottardo, R., Hahne, F., Hansen,
K.D., Irizarry, R.A., Lawrence, M., Love, M.I., MacDonald, J., Obenchain,
V., Oleś, A.K., Pagès, H., Reyes, A., Shannon, P., Smyth, G.K., Tenenbaum,
D., Waldron, L., Morgan, M., 2015. Orchestrating high-throughput genomic
290 analysis with bioconductor. Nature Methods 12, 115–121. URL: https:
//www.bioconductor.org/, doi:10.1038/nmeth.3252.
Kuleshov, M.V., Jones, M.R., Rouillard, A.D., Fernandez, N.F., Duan, Q.,
Wang, Z., Koplev, S., Jenkins, S.L., Jagodnik, K.M., Lachmann, A., Mc-
Dermott, M.G., Monteiro, C.D., Gundersen, G.W., Ma’ayan, A., 2016. En-
295 richr: a comprehensive gene set enrichment analysis web server 2016 update.
Nucleic Acids Research 44, W90–W97. URL: https://doi.org/10.1093%
2Fnar%2Fgkw377, doi:10.1093/nar/gkw377.
Leek, J.T., 2014. svaseq: removing batch effects and other unwanted noise
from sequencing data. Nucleic Acids Research 42, e161–e161. URL: https:
300 //doi.org/10.1093%2Fnar%2Fgku864, doi:10.1093/nar/gku864.
Love, M.I., Huber, W., Anders, S., 2014. Moderated estimation of fold
change and dispersion for RNA-seq data with DESeq2. Genome Biology
15. URL: https://doi.org/10.1186%2Fs13059-014-0550-8, doi:10.1186/
s13059-014-0550-8.
305 Reese, S.E., Archer, K.J., Therneau, T.M., Atkinson, E.J., Vachon, C.M.,
de Andrade, M., Kocher, J.P.A., Eckel-Passow, J.E., 2013. A new
statistic for identifying batch effects in high-throughput genomic data
that uses guided principal component analysis. Bioinformatics 29, 2877–
2883. URL: https://doi.org/10.1093%2Fbioinformatics%2Fbtt480,
310 doi:10.1093/bioinformatics/btt480.
Reimand, J., Arak, T., Adler, P., Kolberg, L., Reisberg, S., Peterson, H., Vilo,
J., 2016. g:profiler—a web server for functional interpretation of gene lists
12
(2016 update). Nucleic Acids Research 44, W83–W89. URL: https://doi.
org/10.1093%2Fnar%2Fgkw199, doi:10.1093/nar/gkw199.
315 Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., Smyth, G.K.,
2015. limma powers differential expression analyses for RNA-sequencing and
microarray studies. Nucleic Acids Research 43, e47–e47. URL: https://doi.
org/10.1093%2Fnar%2Fgkv007, doi:10.1093/nar/gkv007.
Schwender, H., 2012. siggenes: Multiple testing using SAM and Efron’s em-
320 pirical Bayes approaches. URL: https://bioconductor.org/packages/
release/bioc/html/siggenes.html. R package version 1.50.0.
Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L.,
Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S.,
Mesirov, J.P., 2005. Gene set enrichment analysis: A knowledge-based ap-
325 proach for interpreting genome-wide expression profiles. Proceedings of the
National Academy of Sciences 102, 15545–15550. URL: https://doi.org/
10.1073%2Fpnas.0506580102, doi:10.1073/pnas.0506580102.
Tarazona, S., Furió-Tarı́, P., Turrà, D., Pietro, A.D., Nueda, M.J., Ferrer, A.,
Conesa, A., 2015. Data quality aware analysis of differential expression in
330 RNA-seq with NOISeq r/bioc package. Nucleic Acids Research , gkv711URL:
https://doi.org/10.1093%2Fnar%2Fgkv711, doi:10.1093/nar/gkv711.
Tusher, V.G., Tibshirani, R., Chu, G., 2001. Significance analysis of microar-
rays applied to the ionizing radiation response. Proceedings of the Na-
tional Academy of Sciences 98, 5116–5121. URL: https://doi.org/10.
335 1073%2Fpnas.091062498, doi:10.1073/pnas.091062498.
Yuan, G.C., Cai, L., Elowitz, M., Enver, T., Fan, G., Guo, G., Irizarry,
340 R., Kharchenko, P., Kim, J., Orkin, S., Quackenbush, J., Saadatpour, A.,
13
Schroeder, T., Shivdasani, R., Tirosh, I., 2017. Challenges and emerging di-
rections in single-cell analysis. Genome Biology 18. URL: https://doi.org/
10.1186%2Fs13059-017-1218-y, doi:10.1186/s13059-017-1218-y.
14