Professional Documents
Culture Documents
Dynamic Quantitative Trait Locus Analysis of Plant
Dynamic Quantitative Trait Locus Analysis of Plant
Dynamic Quantitative Trait Locus Analysis of Plant
822 Trends in Plant Science, December 2015, Vol. 20, No. 12 http://dx.doi.org/10.1016/j.tplants.2015.08.012
© 2015 Elsevier Ltd. All rights reserved.
High-Throughput Phenotyping Glossary
A high-throughput phenotyping platform (HTPP) may either be set up in controlled environ- False discovery rate (FDR): in
ments, such as growth chambers and greenhouses, or in the field. Imaging techniques [12,13] multiple hypothesis testing, FDR is
are core facilities of a HTPP and are utilized to record phenotypes as pictures or videos, including the expected portion of falsely
rejected null hypotheses among the
for example visible and/or digital imaging for traits such as height, shoot biomass, yield traits, rejected hypotheses.
root architecture and morphology, fluorescence imaging for traits such as photosynthetic status, Family-wise error rate (FWER): the
and 3D imaging for traits such as root structure. These 2D and/or 3D images obtained by HTPPs probability of having one incorrectly
may be stored on a high-performance computing infrastructure and be read and processed by rejected null hypothesis among all the
hypotheses.
mathematical and computational image-analysis tools to extract various traits. Therefore, Genome-wide association (GWA)
HTPPs do not only rely on advanced imaging and remote sensing techniques but also require mapping: identifies SNPs that are
high-performance computational tools for processing and managing image data [8]. significantly associated with a
quantitative trait among a genome-
scale SNP set based on population
HTPPs have been successfully developed in some controlled environments (e.g., Australian data. Because a high-density SNP
Plant Phenomics Facility, www.plantphenomics.org.au), where microenvironmental conditions panel is applied, the detected
such as light and water are automatically adjusted, and the positions of plants can also be significant markers should be in high
linkage disequilibrium with QTLs, and
relocated to minimize the environmental heterogeneity between individuals and repeats. Such
can be used to proxy the QTL
environmental homogeneity may also be achieved by efficient use of experimental design and positions.
analysis within the HTPP framework (e.g., choosing suitable block size and/or using appropriate Marker-assisted selection (MAS):
statistical models for the analysis) with less cost than relocating the plants [14]. However, it has a molecular strategy to indirectly
improve economically relevant traits
been reported that the QTLs identified in controlled environments may not contribute to crop
during their early development
improvement in the field [6]. By contrast, many HTPPs which have been directly developed in the stages, based on selection targeted
field cannot adequately monitor environmental factors, such as the temporal effects of climate on a trait-associated set of markers.
and atmospheric variation, or the spatial effect caused by soil variation [8]. The use of these Multiple-split test: a hypothesis-
testing method for variable selection.
platforms in the field will require improvements to permit better monitoring of environmental The data are repeatedly and
factors, and application of an appropriate experimental design would also be useful to maintain a randomly divided into two parts, the
sufficient degree of environmental homogeneity within a field. Another issue with HTPPs is that first part is used to perform variable
they are currently only available for a limited number of plant species, and more generic selection and the second is used to
construct a test statistic. The P value
phenotyping platforms that are applicable for multiple species will be needed in the future. of each SNP is averaged over
multiple replicates to reduce the
High-Throughput Phenotyping Facilitates the Measurement of Developmental Traits uncertainty. This method can be
used to control FWER.
Studying the developmental process (e.g., growth) of the traits is often interesting. Analyzing
Next-generation sequencing
developmental behavior of a trait is only possible if there are repeated measurements of (NGS): utilizes efficient parallel
individual phenotypes over time. Monitoring trait development by traditional phenotyping sequencing and imaging techniques
approaches is far from simple work. For example, obtaining repeated measurements of some to simultaneously produce thousands
to millions of reads with low cost.
traits such as root architecture is not possible using conventional methods because these
Advances in NGS facilitate the
methods would necessitate destroying the plants. By contrast, some HTPPs, relying on various genotyping of high-density SNP
imaging techniques, are able to more conveniently monitor the dynamic growth of the traits panels to be used later in genetic
without damaging the plant [15–20]. Therefore, HTPPs can efficiently bring time as an extra studies.
Permutation test: hundreds of
dimension to the phenotype data, which may potentially facilitate QTL analysis of developmental
datasets are generated by randomly
and growth-related traits. To efficiently utilize timecourse data generated by HTPPs, advanced shuffling phenotypes into a different
statistical methods are needed [18,19]. Ideally such methods should integrate the phenotypic order and destroying phenotype–
information over multiple timepoints, map the dynamic phenotype–genotype relationship, and genotype relationships. Each shuffled
dataset is analyzed to construct an
account for possible random errors introduced by temporal and/or spatial environmental factors. empirical distribution of SNP test
statistics (i.e., null distribution). The
Functional QTL Mapping observed SNP test statistic is then
Analysis of quantitative trait loci involves modeling, estimation, and hypothesis testing. Statistical tested against this distribution.
Quantitative trait locus (QTL): a
approaches for analyzing a quantitative trait and/or multiple correlated traits at a single timepoint segment of a DNA sequence which
(Box 1) have been well established [21–23]. When phenotype records at multiple timepoints are contributes to the variation of a
available, one may analyze each single timepoint separately and identify QTLs associated with quantitative trait by containing or
being linked to the genes determining
phenotypes at that particular timepoint. This approach ignores the dependency between
that trait.
repeated phenotypic measurements. For example, one may expect that the two phenotypic
measurements at neighboring timepoints should have closer values than the two at a greater
distance. An improved approach is functional mapping [24–27], which was developed based
on the assumption of function-valued traits: phenotypic values at discrete timepoints are
‘snapshots’ of a continuous function/curve over time [28]. Functional mapping aims to detect
QTLs associated with the whole developmental process of the traits, instead of being associated
with any single observation. Several modeling strategies for functional mapping that have
been proposed previously are described below.
Compared to the univariate regression model (Box 1), here the phenotypes yi(tk), population
mean parameters b0(tk), effect parameters bj(tk) of the jth single-nucleotide polymorphism
(SNP), and residual errors ei(tk) (k = 1, . . ., m) are vectors consisting of multiple observations/
parameters over time, S0 represents the covariance matrix of residual errors, and the genotype
value xij is constant over time. Therefore, a distinctive feature of the varying-coefficient model
is that the regression coefficients b0(tk) and bj(tk) are not specified to be constant but are
allowed to change over time.
The dependency structure of function-valued traits can be described by population effects b0(tk),
SNP effects bj(tk), and residual covariance matrix S0. First, b0(tk) and bj(tk) are modeled as
continuous functions/curves [b0(t), bj(t)] [e.g., a linear curve can be specified as bj(t) = b0 + b1t],
which describe the temporal trend of the function-valued traits. When the temporal trend is
monotonic, one can simply use a low-degree polynomial curve or a logistic curve (Box 2). When
Polynomial
An alternative choice for fitting non-linear curves is a polynomial function, a linear combination of multiple polynomial
bases: b(t) = b0 + b1t + b2t2 + b3t3.... In practice, the quadratic and cubic polynomials are most widely used to describe
the non-linearity of the data. High-degree polynomials should be used with caution because they may result in over-
fitting. In other words, the curve would describe the random errors instead of the true underlying process of the
datapoints.
Spline
Splines are truncated polynomials, which are used to better describe local behavior of the curve. The truncated bases
are joined smoothly at knots, which are specific locations within the time-interval. For example, a cubic spline with knots
t1 and t2 (t1 < t2) is expressed as bðtÞ ¼ b0 þ b1 t þ b2 t2 þ b3 t3 þ b4 ðt t1 Þ3þ þ b5 ðt t2 Þ3þ , where ðt t1 Þ3þ ¼ ðt t1 Þ3 if
t > t1, and is equal to zero, otherwise. In practice, the location and amount of knots need to be properly specified
to provide an accurate description of the data.
10
10 8
β(t)
β(t)
5 4
0 0
0 5 10 15 20 0 5 10 15 20
t t
(C) Cubic polynomial (D) Cubic spline
12 12
10 10
8 8
β(t)
β(t)
6 6
4 4
2 2
0 0
0 5 10 15 20 0 5 10 15 20
t t
Figure I. Illustration of the Four Different Trend Functions. The red dots represents data simulated from a logistic
growth function plus a normal residual, and the blue lines are estimated curves by (A) a linear function, (B) logistic growth
function, (C) cubic polynomial function, and (D) cubic spline.
Recently, Campbell et al. [32] conducted an association analysis on 360 rice lines with 26 258
SNPs to understand the genetic mechanism of the dynamics of rice responses to salt stress. The
Autoregressive Structure
To describe the temporal correlation in function-valued traits, a first-order autoregressive or AR(1) covariance structure
has been proposed in various studies. In the AR(1) model, the covariance between two observations yi(th) and yi(tk) (h = 1,
. . ., m, k = 1, . . ., m) is expressed as follows: COV½yi ðtk Þ; yi ðth Þ ¼ s 20 rjtk th j , where 0 < r < 1. The AR(1) model ensures
that the two observations at nearby timepoints are more correlated than the two observations further apart, which should
be a valid assumption for most of the function-valued traits. In addition, the AR(1) covariance has only two parameters s 20
and r to be estimated, and therefore can be conveniently applied in practice.
Two-Stage Method
Several two-stage approaches have been presented [20,34–36]. They first fit a linear or a logistic
growth curve at an individual level to phenotypic measurements observed over time, and then
treat the estimated curve parameters for all the individuals as the latent trait values in QTL
mapping, to be analyzed by any QTL mapping tool. Note that the two-stage approach
integrating temporal information over phenotypes is roughly equivalent to the VC model
combining information over QTL effects [36,37]. Unlike the VC model, the two-stage approach
usually does not take the residual covariance into account. This approach can be easily and
quickly implemented without iteration, but may have less power to detect QTLs owing to its
approximate nature [37,38]. For improvement, Piepho et al. [39] proposed an alternative strategy
by orthogonalizing the covariance matrix in the first stage of the model. This method might be
capable of providing a more precise approximation to the one-stage method by maintaining the
computational efficiency of the two-stage strategy, and therefore deserves more investigation
in the future.
–20
Root p angle
–40
–60
–80
–100
–120
–140
0 1 2 3 4 5 6 7 8
Hour
Figure I. The Phenotype Trajectories of 162 Arabidopsis RILs Over Time. The data are publicly available at
http://phytomorph.wisc.edu/download/.
–20 6
QTL effect
–40 4
2
–60
0
–80
–2
–100 –4
0 2 4 6 8 0 2 4 6 8
8 8
6 6
QTL effect
QTL effect
4 4
2 2
0 0
–2 –2
–4 –4
0 2 4 6 8 0 2 4 6 8
Figure II. The Estimated QTL Effects from the 2-QTL Model. The estimated QTL effects from the 2-QTL model
based on the SLOD criterion (red unbroken curve) and 3-QTL model based on the MLOD criterion (blue broken curve),
respectively (Figure 4 in [38]; reproduced with permission of The Genetics Society of America). Abbreviation: Chr,
chromosome.
where p is the total number of markers, and all the other parameters are defined in the same way
as in Equation 1. When the number of SNPs is small, one may apply a similar type of maximum
likelihood based algorithm as that used in single-locus methods to obtain estimates of Equation
2. However, when the number of SNPs is larger than the number of individuals (which is likely to
be the case in many QTL/association mapping studies), Equation 2 becomes an over-saturated
model and the maximum likelihood method cannot provide a valid solution. In such cases,
variable selection methods are applicable, selecting only a subset of important SNPs for
inclusion in the model and eliminating the irrelevant ones [44]. Note that many variable selection
techniques were originally developed in the context of univariate regression (i.e., for analyzing
quantitative traits at a single timepoint) [42,45–47], and some of them (including stepwise
regression, penalized regression, and Bayesian approaches) have been generalized for the
VC model and/or approximate models for functional mapping.
Stepwise Regression
A straightforward model search strategy operates by simply enumerating all possible combi-
nations of SNPs and picking the optimal model as judged by model selection criteria [48] such as
cross validation (CV), Akaike information criterion (AIC), or Bayesian information criterion (BIC).
This approach is computationally unfeasible for even a moderately-sized dataset (e.g., with 100
SNPs). A simpler ‘greedy’ alternative is stepwise regression [49]. The forward selection starts
from a null model (i.e., with only the intercept term), and repeats the following two steps: (i) adds a
A stepwise method usually searches only low-dimensional space, and can be quickly imple-
mented on a large amount of data, making it a favored choice in many QTL/association mapping
studies [45,50]. In [43] the forward/backward selection was utilized in a VC model for functional
mapping. For the mouse behavior data this method detects a similar set of QTLs as the single
locus-based GEE method [33], but with less computational cost. Forward/backward selection
has also been used in some approximate functional mapping models [38].
Penalized Regression
Stepwise regression is based on a discrete model search strategy, which may not be stable for
variable selection because small alterations in the data can cause fairly drastic changes in the
results [49]. Recently, statisticians have paid more attention to penalized regression, a continu-
ous variable selection approach which is numerically more stable [48,49]. In penalized regres-
sion, the objective function (i.e., the sum of squares function or likelihood function) is combined
with a penalty term Penal(l,b(t)), which shrinks the effects of unimportant SNPs toward zero
during estimation and therefore leads to a sparse model. The tuning parameter l (l>0)
determines the number of markers to be incorporated into the model, and can be selected
by a similar model selection criterion as that used in stepwise regression. Popular choices of
penalty functions include LASSO [36,40] and fusion penalty [51].
The penalized regression has a Bayesian interpretation [52]. The penalty term Penal(l,b(t)) can
be ‘translated’ to a prior distribution of SNP effects, b(t)–, the pre-knowledge of the distribution of
b(t) before seeing the data. The combination of data likelihood and prior leads to a posterior
distribution of model parameters, which can be estimated by the Markov chain Monte Carlo
(MCMC) sampling method [52–54]. A major benefit of Bayesian approaches is that they can
easily provide uncertainty estimates such as standard errors or credible intervals of the SNP
effects [55].
Penalized regression methods, especially the MCMC-based Bayesian approaches, are often
more computationally expensive than the stepwise regression [54]. Therefore, current penalized
estimation techniques can only be employed with smaller datasets in a VC model for analyzing
function-valued traits (Yi Gong, PhD thesis, University of North Carolina at Chapel Hill, 2013).
Alternatively, one can more easily use penalized regression in the two-stage modeling frame-
work [36].
Hypothesis Testing
Variable selection focuses on a subset of important SNPs with non-zero effects. In reality, it is
often impossible to choose a perfect tuning parameter value by any model selection criterion
(e.g., CV, AIC, BIC) to guarantee that all the selected SNPs represent true QTLs. It is generally the
case that the selected subset of SNPs may include some noisy signals. Therefore, hypothesis
testing needs to be used together with variable selection to formally judge QTLs and control the
false-positive rate. Constructing a test statistic for a SNP selected by variable selection is not
straightforward because the estimates of SNP effects do not asymptotically follow any known
distribution [56]. Current testing approaches such as multiple-split testing [57], stability
selection [58], and phenotype permutation [59] are largely based on sub-sampling or re-
sampling the data, which is time-consuming and sometimes conservative in detecting QTLs
Moreover, microenvironmental effects are usually unknown, and cannot be directly represented
as covariates in a functional QTL model. Note that the application of high-throughput phenotyp-
ing platforms, especially those developed in the field, cannot guarantee that the temporal/spatial
environmental factors are fully controlled and monitored. A possible consequence of microen-
vironmental variation is that the distribution of the phenotype data does not follow a normal
distribution [6], and the VC model under the assumption of residual normality may therefore not
be optimal for such data. In that case it might be worth investigating the use of robust statistical
approaches such as hierarchical generalized linear models [60] or least absolute deviation
regression [65]. These models assume that the residuals follow a skewed distribution, and
they might be more appropriate for describing non-normally distributed phenotype data.
The statistical methodologies of functional QTL mapping have been largely developed over the
past 10 years, and the general software tools for implementing these models (such as R/qtl [68],
and TASSEL [69] for QTL mapping of univariate traits) are still lacking. Therefore, it would be
desirable to develop high-performance and user-friendly computer programs for the practical
usage of functional mapping. Furthermore, it would also be interesting to extend the current
methodologies of functional mapping for new purposes (see Outstanding Questions), such as
searching for gene–gene and gene–environment interactions [27,70,71], predicting genomic
breeding values [72], estimating trait heritability [73], and combining functional mapping with
systems biology [74].
Acknowledgments
This work was supported by research funding from Biocenter Oulu. We are grateful to the Editor, three anonymous referees,
and Ashley Last for their valuable comments on the manuscript.
References
1. Mackay, T.F.C. et al. (2009) The genetics of quantitative traits: 15. Zhang, X. et al. (2012) Natural genetic variation for growth and
challenges and prospects. Nat. Rev. Genet. 5, 565–577 development revealed by high-throughput phenotyping in Arabi-
2. Collard, B.C. and Mackill, D.J. (2008) Marker-assisted selection: dopsis thaliana. G3 2, 29–34
an approach for precision plant breeding in twenty-first century. 16. Tessmer, O.L. et al. (2013) Functional approach to high-thoughput
Philos. Trans. R. Soc. B 363, 557–572 plant growth analysis. BMC Syst. Biol. 7, S17
3. He, J. et al. (2014) Genotyping-by-sequencing (GBS), an ultimate 17. Moore, C.R. et al. (2013) High-throughput computer vision intro-
marker-assisted selection (MAS) tool to accelerate plant breeding. duces the time axis to a quantitative trait map of a plant growth
Front. Plant Sci. 5, 484 response. Genetics 195, 1077–1086
4. Huang, X. and Han, B. (2014) Natural variations and genome-wide 18. Chen, D. et al. (2014) Dissecting the phenotypic components of
association studies in crop plants. Annu. Rev. Plant Biol. 65, 531–551 crop plant growth and drought responses based on high-through-
5. Tisné, S. et al. (2013) Phenoscope: an automated large-scale put image analysis. Plant Cell 26, 4636–4655
phenotyping platform offering high spatial homogeneity. Plant 19. Brown, T.B. et al. (2014) TraitCapture: genomic and environment
74, 534–544 modelling of plant phenomic data. Curr. Opin. Plant Biol. 18,
6. Cobb, J.N. et al. (2013) Next-generation phenotyping: requirements 73–79
and strategies for enhancing our understanding of genotype– 20. Honsdorf, N. et al. (2014) High-throughput phenotyping to detect
phenotype relationships and its relevance to crop improvement. drought tolerance QTL in wild barley introgression lines. PLoS
Theor. Appl. Genet. 126, 867–887 ONE 9, e97047
7. Walter, A. et al. (2012) Advanced phenotyping offers opportunities 21. Foulkes, A.S. (2009) Applied Statistical Genetics with R: For Pop-
for improved breeding of forage and turf species. Ann. Bot. 110, ulation-based Association Studies, Springer
1271–1279 22. Lander, E.S. and Botstein, D. (1989) Mapping Mendelian factors
8. Araus, J.L. and Carns, J.E. (2014) Field high-throughput pheno- underlying quantitative traits using RFLP linkage maps. Genetics
typing: the new crop breeding frontier. Trends Plant Sci. 19, 52–61 121, 185–199
9. Yang, W. et al. (2014) Combining high-throughput phenotyping 23. Jiang, C. and Zeng, Z-B. (1995) Multiple trait analysis of
and genome-wide association studies to reveal natural genetic genetic mapping for quantitative trait loci. Genetics 140,
variation in rice. Nat. Commun. 5, 5087 1111–1127
10. Topp, C.N. et al. (2013) 3D phentyping and quantitative trait 24. Ma, C. et al. (2002) Functional mapping of quantitative trait loci
locus mapping identify core regions of the rice genome con- underlying the character process: a theoretical framework. Genet-
trolling root architecture. Proc. Natl. Acad. Sci. U.S.A. 110, ics 161, 1751–1762
E1695–E1704 25. Wu, R. and Lin, M. (2006) Functional mapping-how to map and
11. Fahlgren, N. et al. (2015) Light, camera, action: high-throughput study the genetic architecture of dynamical complex traits. Nat.
plant phenotyping is ready for a close-up. Curr. Opin. Plant Biol. Rev. Genet. 7, 229–237
24, 93–99 26. He, Q. et al. (2010) Mapping genes for plant structure, develop-
12. Sozzani, R. et al. (2014) Advanced imaging techniques for the study ment and evolution: functional mapping meets ontology. Trends
of plant growth and development. Trends Plant Sci. 19, 304–310 Genet. 26, 39–46
13. Li, L. et al. (2014) A review of imaging techniques for plant phe- 27. Wang, Z. et al. (2013) Modeling phenotypic plasticity in growth
notyping. Sensors 14, 20078–20111 trajectories: a statistical framework. Evolution 68, 81–91
14. Brien, C.J. et al. (2013) Accounting for variation in designing 28. Pletcher, S.D. and Geyer, C.J. (1999) The genetic analysis of age-
greenhouse experiments with special reference to greenhouses dependent traits: modelling the character process. Genetics 151,
containing plants on conveyor systems. Plant Methods 9, 5 825–835
42. Yi, H. et al. (2015) Penalized multimarker vs. single-marker regres- 66. Goulding, E.H. et al. (2008) A robust automated system elucidates
sion methods for genome-wide association studies of quantitative mouse home cage behavioral structure. Proc. Natl. Acad. Sci.
traits. Genetics 199, 205–222 U.S.A. 105, 20575–20582
43. Li, Z. and Sillanpää, M.J. (2013) A Bayesian nonparametric 67. Schaefer, A.T. and Claridge-Chang, A. (2012) The surveillance state
approach for mapping dynamic quantitative traits. Genetics of behavioral automation. Curr. Opin. Neurobiol. 22, 170–176
194, 997–1016 68. Broman, K.W. and Sen, S. (2009) A Guide to QTL Mapping with
44. Sillanpää, M.J. and Corander, J. (2002) Model choice in gene R/qtl, Springer
mapping: what and why. Trends Genet. 18, 301–307 69. Bradbury, P.J. et al. (2012) TASSEL: software for association
45. Segura, V. et al. (2012) An effcient multi-locus mixed-model mapping of complex traits in diverse samples. Bioinformatics
approach for genome-wide association studies in structured pop- 23, 2633–2635
ulations. Nat. Genet. 44, 825–830 70. Piepho, H.P. (2000) A mixed-model approach to mapping quan-
46. Li, Z. and Sillanpää, M.J. (2012) Overview of LASSO-related titative trait loci in Barley on the basis of multiple environmental
penalized regression methods for quantitative trait mapping and data. Genetics 156, 2043–2050
genomic selection. Theor. Appl. Genet. 125, 419–435 71. Yi, N. (2010) Statistical analysis of genetic interactions. Genet. Res.
47. O’Hara, R.B. and Sillanpää, M.J. (2009) A review of Bayesian 92, 443–459
variable selection methods: What, how, and which? Bayesian 72. Desta, Z.A. and Ortiz, R. (2014) Genomic selection: genome-
Anal. 4, 85–118 wide prediction in plant improvement. Trends Plant Sci. 19,
48. Hastie, T. et al. (2009) Elements of Statistical Learning. (2nd Edn), 592–601
Springer 73. Sillanpää, M.J. (2011) On statistical methods for estimating heri-
49. Izenman, A.J. (2008) Modern Multivariate Statistical Techniques, tability in wild populations. Mol. Ecol. 20, 1324–1332
Springer 74. Sun, L. and Wu, R. (2015) Mapping complex traits as a dynamic
50. Broman, K.W. and Speed, T.P. (2002) A model selection system. Phys. Life Rev. 13, 155–185
approach for the identification of quantitative trait loci in experi- 75. Haley, C.S. and Knott, S.A. (1992) A simple regression method for
mental crosses. J. Roy. Statist. Soc. B 64, 641–656 mapping quantitative trait loci in line crosses using flanking
51. Daye, Z.J. et al. (2012) A sparse structured shrinkage estimator for markers. Heredity 69, 315–324
nonparametric varying-coefficient model with an application in 76. Halperin, E. and Stephan, D.A. (2009) SNP imputation in associa-
genomics. J. Comput. Graph. Stat. 21, 110–133 tion studies. Nat. Biotechnol. 27, 349–351
52. Park, T. and Casella, G. (2008) The Bayesian LASSO. J. Am. Stat. 77. Marchini, J. and Howie, B. (2010) Genotype imputation for
Assoc. 103, 681–686 genome-wide association studies. Nat. Rev. Genet. 11, 499–511
53. Yang, R. and Xu, S. (2007) Bayesian shrinkage analysis of quanti- 78. Sen, S. and Churchill, G.A. (2001) A statistical framework for
tative trait loci for dynamic traits. Genetics 176, 1169–1185 quantitative trait mapping. Genetics 159, 371–387