Dynamic Quantitative Trait Locus Analysis of Plant

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Review

Dynamic Quantitative Trait


Locus Analysis of Plant
Phenomic Data
Zitong Li1,2 and Mikko J. Sillanpää1,2,*
Advanced platforms have recently become available for automatic and system- Trends
atic quantification of plant growth and development. These new techniques can High-throughput imaging techniques
efficiently produce multiple measurements of phenotypes over time, and intro- are capable of measuring time-series
of plant phenotypes, which may poten-
duce time as an extra dimension to quantitative trait locus (QTL) studies. tially facilitate the QTL analysis of devel-
Functional mapping utilizes a class of statistical models for identifying QTLs opmental and growth related traits.
associated with the growth characteristics of interest. A major benefit of func-
A major benefit of functional mapping is
tional mapping is that it integrates information over multiple timepoints, and that it integrates information over multi-
therefore could increase the statistical power for QTL detection. We review the ple timepoints, and therefore could
current development of computationally efficient functional mapping methods increase the statistical power for QTL
detection.
which provide invaluable tools for analyzing large-scale timecourse data that
are readily available in our post-genome era. To handle high-dimensional genotyp-
ing and phenotyping data, computa-
tional efficiency is the focus of the
novel statistical methods for dynamic
QTL Mapping and High-Throughput Phenotyping QTL analysis.
In plant genetics, quantitative trait locus (QTL) mapping (see Glossary) is often used to
identify QTLs or causal genes associated with phenotypes of interest [1]. QTL mapping is a
crucial step in marker-assisted selection (MAS), which has been successfully applied in
many plant breeding programs [2]. Recent advances in next-generation sequencing (NGS)
techniques have provided fast and inexpensive access to genomic information on a large
scale, which allows the execution of the QTL and/or association mapping based on genome-
wide marker data [3,4] (genome-wide association mapping). In addition to the genotypic
information, QTL mapping also requires high-quality phenotype data. Intuitively, increasing the
sample size in a QTL analysis can improve the power to correctly identify QTLs. From another
perspective, it is also beneficial to perform sample collection of plants under similar or
exchangeable microenvironmental conditions to ensure that environmental variance and noise
are minimized. However, traditional plant phenotyping approaches largely utilize manual
laboratory experiments and visual scoring by experts, and these practices are often time-
consuming and it is difficult to arrange desirable growth conditions for individual phenotypes or
repeats. Consequently, the development of high-throughput and/or automated phenotype
platforms [5–11], involving both automated recording and screening of phenotypes by various
imaging techniques, and effectively allocating and monitoring environmental conditions, has
started to gain more and more attention. This advance greatly eases the measurement of 1
Biocenter Oulu, Oulu, Finland
2
timecourse phenotype data and will provide new insight into genetic studies of plant growth. Department of Mathematical
Sciences and Department of Biology,
Accordingly, this review describes functional mapping – a class of statistical methods University of Oulu, 90014 Oulu,
designed to efficiently integrate temporal information and identify QTLs associated with Finland
phenotypic dynamics. Computational efficiency is an important consideration in some recently
developed approaches to meet the new challenge from high-dimensional genotype and
*Correspondence:
phenotype data. The review also discusses the limitations of the current methods and high- Mikko.Sillanpaa@oulu.fi
lights future research directions. (M.J. Sillanpää).

822 Trends in Plant Science, December 2015, Vol. 20, No. 12 http://dx.doi.org/10.1016/j.tplants.2015.08.012
© 2015 Elsevier Ltd. All rights reserved.
High-Throughput Phenotyping Glossary
A high-throughput phenotyping platform (HTPP) may either be set up in controlled environ- False discovery rate (FDR): in
ments, such as growth chambers and greenhouses, or in the field. Imaging techniques [12,13] multiple hypothesis testing, FDR is
are core facilities of a HTPP and are utilized to record phenotypes as pictures or videos, including the expected portion of falsely
rejected null hypotheses among the
for example visible and/or digital imaging for traits such as height, shoot biomass, yield traits, rejected hypotheses.
root architecture and morphology, fluorescence imaging for traits such as photosynthetic status, Family-wise error rate (FWER): the
and 3D imaging for traits such as root structure. These 2D and/or 3D images obtained by HTPPs probability of having one incorrectly
may be stored on a high-performance computing infrastructure and be read and processed by rejected null hypothesis among all the
hypotheses.
mathematical and computational image-analysis tools to extract various traits. Therefore, Genome-wide association (GWA)
HTPPs do not only rely on advanced imaging and remote sensing techniques but also require mapping: identifies SNPs that are
high-performance computational tools for processing and managing image data [8]. significantly associated with a
quantitative trait among a genome-
scale SNP set based on population
HTPPs have been successfully developed in some controlled environments (e.g., Australian data. Because a high-density SNP
Plant Phenomics Facility, www.plantphenomics.org.au), where microenvironmental conditions panel is applied, the detected
such as light and water are automatically adjusted, and the positions of plants can also be significant markers should be in high
linkage disequilibrium with QTLs, and
relocated to minimize the environmental heterogeneity between individuals and repeats. Such
can be used to proxy the QTL
environmental homogeneity may also be achieved by efficient use of experimental design and positions.
analysis within the HTPP framework (e.g., choosing suitable block size and/or using appropriate Marker-assisted selection (MAS):
statistical models for the analysis) with less cost than relocating the plants [14]. However, it has a molecular strategy to indirectly
improve economically relevant traits
been reported that the QTLs identified in controlled environments may not contribute to crop
during their early development
improvement in the field [6]. By contrast, many HTPPs which have been directly developed in the stages, based on selection targeted
field cannot adequately monitor environmental factors, such as the temporal effects of climate on a trait-associated set of markers.
and atmospheric variation, or the spatial effect caused by soil variation [8]. The use of these Multiple-split test: a hypothesis-
testing method for variable selection.
platforms in the field will require improvements to permit better monitoring of environmental The data are repeatedly and
factors, and application of an appropriate experimental design would also be useful to maintain a randomly divided into two parts, the
sufficient degree of environmental homogeneity within a field. Another issue with HTPPs is that first part is used to perform variable
they are currently only available for a limited number of plant species, and more generic selection and the second is used to
construct a test statistic. The P value
phenotyping platforms that are applicable for multiple species will be needed in the future. of each SNP is averaged over
multiple replicates to reduce the
High-Throughput Phenotyping Facilitates the Measurement of Developmental Traits uncertainty. This method can be
used to control FWER.
Studying the developmental process (e.g., growth) of the traits is often interesting. Analyzing
Next-generation sequencing
developmental behavior of a trait is only possible if there are repeated measurements of (NGS): utilizes efficient parallel
individual phenotypes over time. Monitoring trait development by traditional phenotyping sequencing and imaging techniques
approaches is far from simple work. For example, obtaining repeated measurements of some to simultaneously produce thousands
to millions of reads with low cost.
traits such as root architecture is not possible using conventional methods because these
Advances in NGS facilitate the
methods would necessitate destroying the plants. By contrast, some HTPPs, relying on various genotyping of high-density SNP
imaging techniques, are able to more conveniently monitor the dynamic growth of the traits panels to be used later in genetic
without damaging the plant [15–20]. Therefore, HTPPs can efficiently bring time as an extra studies.
Permutation test: hundreds of
dimension to the phenotype data, which may potentially facilitate QTL analysis of developmental
datasets are generated by randomly
and growth-related traits. To efficiently utilize timecourse data generated by HTPPs, advanced shuffling phenotypes into a different
statistical methods are needed [18,19]. Ideally such methods should integrate the phenotypic order and destroying phenotype–
information over multiple timepoints, map the dynamic phenotype–genotype relationship, and genotype relationships. Each shuffled
dataset is analyzed to construct an
account for possible random errors introduced by temporal and/or spatial environmental factors. empirical distribution of SNP test
statistics (i.e., null distribution). The
Functional QTL Mapping observed SNP test statistic is then
Analysis of quantitative trait loci involves modeling, estimation, and hypothesis testing. Statistical tested against this distribution.
Quantitative trait locus (QTL): a
approaches for analyzing a quantitative trait and/or multiple correlated traits at a single timepoint segment of a DNA sequence which
(Box 1) have been well established [21–23]. When phenotype records at multiple timepoints are contributes to the variation of a
available, one may analyze each single timepoint separately and identify QTLs associated with quantitative trait by containing or
being linked to the genes determining
phenotypes at that particular timepoint. This approach ignores the dependency between
that trait.
repeated phenotypic measurements. For example, one may expect that the two phenotypic
measurements at neighboring timepoints should have closer values than the two at a greater

Trends in Plant Science, December 2015, Vol. 20, No. 12 823


Box 1. Models for Association Mapping Quantitative trait locus (QTL)
The simple linear regression on each marker j (j = 1, . . .., p; where p is the total number of SNPs) for population-level mapping: identifies QTLs associated
association mapping is defined as: with the target trait based on data
i:i:d:
collected from progenies of
yi ¼ b0 þ xij b j þ ei ; ei  Nð0; s 20 Þ; [I] experimental crosses, related
individuals in families, or unrelated
where yi is the phenotype value of individual i (i = 1, . . ., n; n is the total number of individuals), xij is the genotype value individuals in single or multiple
of individual i and marker j coded as 1, 0, and 1 for the three SNP genotypes AA, AB, and BB respectively, b0 is the populations. If the marker set is not
overall phenotype mean, bj is the effect of SNP j, and ei is the residual error, which is assumed to follow a mutually sufficiently dense a QTL generally has
independent normal distribution with zero mean and unknown variance s 20 . Least-square estimates of regression
P no exact location at any marker, and
coefficients (b0, bj) can be obtained by minimizing the sum of squares function SSE ¼ ni¼1 ðyi  b0  xi j b j Þ2 . it is necessary to search for the QTL
location within the interval between
Interval Mapping markers.
For datasets with low SNP density it can be useful to search for the location of the putative QTL in the interval between Single-nucleotide polymorphism
two markers. For each putative QTL position, one may first calculate PQTL = Pr(QTLjSNP1,SNP2), the probability of QTL (SNP): a DNA sequence variation
genotype conditionally on the genotypes of its two flanking SNPs, and then use regression of phenotype y on PQTL (i.e., with a change in a single base pair
by replacing xij with PQTL in Equation I). The probability PQTL can be calculated for given experimental crossing design between individuals of the same
based on expected genotype frequencies [75] or at population level based on linkage disequilibrium [76]. In more recent species. SNPs are one of the most
times the use of imputation techniques [77] and pseudomarkers [78] has largely replaced this in practice, and thus commonly used genetic markers in
association mapping is also applicable here. QTL analysis.
Stability selection: a sub-sampling
QTL Decision Rules based method for FDR control based
Hypothesis testing can be used to formally judge QTLs. For a SNP, j, we are interested in testing the null hypothesis on SNP effects estimated by variable
which is given by: bj = 0 vs the alternative hypothesis: bj 6¼ 0. In detail, a Student t test statistic is first calculated by selection. In each run variable
^
t ¼ b , where b^ j is the least squares estimate and s½b^ j  is the standard error. Next, the corresponding P value is selection is applied to the randomly
j ^ j
s½b
computed based on the assumption that tj follows a Student t distribution. When the P value is smaller than the chosen data for half the number of
individuals, and the selected SNPs
significance threshold a (e.g., a = 0.05), the null hypothesis is rejected and the corresponding SNP is declared as
are recorded. This procedure is
significant. Such hypothesis testing controls the probability of incorrectly rejecting the null hypothesis under the
repeated many times, and an
significance level a. For large SNP panels many hypotheses are tested simultaneously, and therefore the chance of
empirical frequency of each SNP
making wrong decisions is increased. Therefore, choosing a more conservative significance level is often necessary to
being selected is calculated.
adjust the multiplicity. This can be done by controlling the family-wise error or false discovery rates.

distance. An improved approach is functional mapping [24–27], which was developed based
on the assumption of function-valued traits: phenotypic values at discrete timepoints are
‘snapshots’ of a continuous function/curve over time [28]. Functional mapping aims to detect
QTLs associated with the whole developmental process of the traits, instead of being associated
with any single observation. Several modeling strategies for functional mapping that have
been proposed previously are described below.

Varying-Coefficient (VC) Regression for Functional Mapping


As an extension of the univariate linear regression (Equation I in Box 1), a VC multivariate
regression model [29] for association mapping of function-valued traits is defined by:
i:i:d: X
y i ðtk Þ ¼ b0 ðtk Þ þ x i j b j ðtk Þ þ ei ðtk Þ; ei ¼ ½ei ðt1 Þ; :::; ei ðtm ÞT  MVNð0; 0
Þ: [1]

Compared to the univariate regression model (Box 1), here the phenotypes yi(tk), population
mean parameters b0(tk), effect parameters bj(tk) of the jth single-nucleotide polymorphism
(SNP), and residual errors ei(tk) (k = 1, . . ., m) are vectors consisting of multiple observations/
parameters over time, S0 represents the covariance matrix of residual errors, and the genotype
value xij is constant over time. Therefore, a distinctive feature of the varying-coefficient model
is that the regression coefficients b0(tk) and bj(tk) are not specified to be constant but are
allowed to change over time.

The dependency structure of function-valued traits can be described by population effects b0(tk),
SNP effects bj(tk), and residual covariance matrix S0. First, b0(tk) and bj(tk) are modeled as
continuous functions/curves [b0(t), bj(t)] [e.g., a linear curve can be specified as bj(t) = b0 + b1t],
which describe the temporal trend of the function-valued traits. When the temporal trend is
monotonic, one can simply use a low-degree polynomial curve or a logistic curve (Box 2). When

824 Trends in Plant Science, December 2015, Vol. 20, No. 12


Box 2. Trend Functions for the Varying-Coefficient (VC) Model
Linear Curve
The linear curve is defined by b(t) = b0 + b1t, where b0 is the intercept and b1 is the slope measuring the growth rate.
Linear curves provide a simple description of the increasing or decreasing trends of the data, but may not be efficient
for describing data with non-linear patterns (Figure I).

Logistic Growth Curve


a
The logistic curve can be expressed as bðtÞ ¼ 1þexp½cðbtÞ , where a is the limiting value of b(t) when t turns to infinite, b is
the inflection point, and c is the growth rate. Compared to linear curves, a logistic curve allows the growth speed (i.e.,
deviation of the curve) to change over time: for example to be faster at an early stage and slower at the mature stage, such
that it can be used to better mimic the whole growth dynamic of many plant traits (e.g., height, weight. . .). A drawback
of the logistic curve is that there is a nonlinear relationship between b(t) and the curve parameters a, b, and c, which
is inconvenient from a computation standpoint.

Polynomial
An alternative choice for fitting non-linear curves is a polynomial function, a linear combination of multiple polynomial
bases: b(t) = b0 + b1t + b2t2 + b3t3.... In practice, the quadratic and cubic polynomials are most widely used to describe
the non-linearity of the data. High-degree polynomials should be used with caution because they may result in over-
fitting. In other words, the curve would describe the random errors instead of the true underlying process of the
datapoints.

Spline
Splines are truncated polynomials, which are used to better describe local behavior of the curve. The truncated bases
are joined smoothly at knots, which are specific locations within the time-interval. For example, a cubic spline with knots
t1 and t2 (t1 < t2) is expressed as bðtÞ ¼ b0 þ b1 t þ b2 t2 þ b3 t3 þ b4 ðt  t1 Þ3þ þ b5 ðt  t2 Þ3þ , where ðt  t1 Þ3þ ¼ ðt  t1 Þ3 if
t > t1, and is equal to zero, otherwise. In practice, the location and amount of knots need to be properly specified
to provide an accurate description of the data.

(A) Linear curve (B) Logisc growth curve


15 12

10

10 8
β(t)

β(t)

5 4

0 0
0 5 10 15 20 0 5 10 15 20
t t
(C) Cubic polynomial (D) Cubic spline
12 12

10 10

8 8
β(t)

β(t)

6 6

4 4

2 2

0 0
0 5 10 15 20 0 5 10 15 20
t t

Figure I. Illustration of the Four Different Trend Functions. The red dots represents data simulated from a logistic
growth function plus a normal residual, and the blue lines are estimated curves by (A) a linear function, (B) logistic growth
function, (C) cubic polynomial function, and (D) cubic spline.

Trends in Plant Science, December 2015, Vol. 20, No. 12 825


the trend is more complex it is possible to apply more advanced non-parametric curve and/or
functional data-modeling techniques ([30,31]) such as splines (Box 2), wavelets, and kernel
methods. Secondly, the residual covariance matrix S0 describes temporal correlation between
residuals of two observations. In practice, the covariance S0 can be modeled parametrically by
an autoregressive process or random intercept and slope components (Box 3).

In the VC regression, parameter estimation requires an iterative maximum likelihood algorithm by


updating the parameter values of bj(t) [e.g., updating b0 and b1, when bj(t) = b0 + b1t] and S0
sequentially [24]. In hypothesis testing, we may test either the association between a QTL and
the whole growth trajectory [i.e., test bj(t) = 0 vs bj(t) 6¼ 0], or the association between a QTL and
one particular parameter of the growth trajectory [i.e., test b0 = 0 vs b0 6¼ 0, and b1 = 0 vs b1 6¼ 0,
when bj(t) is a linear curve].

The VC model is a statistically-precise approach for mapping function-valued traits by simulta-


neously estimating the SNP effects bj(t) and covariance S0 in a single procedure. Because bj(t) is
treated as a curve, the estimation of bj(t) at each single timepoint always ‘borrows information’
from the observations at surrounding timepoints, in this way providing more efficient estimates of
bj(t) and increasing the power to identify QTLs. By contrast, the VC model could be considered
conservative in that it focuses on QTLs associated with the whole developmental/growth
trajectories of the trait, but it may neglect some ‘local’ QTLs that are only associated with
the trait at a particular growth stage.

Recently, Campbell et al. [32] conducted an association analysis on 360 rice lines with 26 258
SNPs to understand the genetic mechanism of the dynamics of rice responses to salt stress. The

Box 3. Parametric Residual Covariance Structures


Diagonal Structure
The simplest choice is a diagonal or independent covariance matrix, which assumes no serial correlation of repeated
phenotype measurements over time. An independent covariance matrix with homoscedastic residual errors (i.e.,
variance constant over time) is defined by
2 3
1 0  0
X 6 .7
.
60 1 .7
¼ COV½yi  ¼ COV½ei  ¼ s 20 6 . . 7;
0 4 .. } .. 5
0   1
where yi = [y1, ..., ym]T, and ei = [e1, ..., em]T (m is the total number of timepoints). The diagonal covariance matrix is not
appropriate for many of the function-valued traits, and the use of it in the VC model may result in the introduction of some
false-positive QTLs [36].

Autoregressive Structure
To describe the temporal correlation in function-valued traits, a first-order autoregressive or AR(1) covariance structure
has been proposed in various studies. In the AR(1) model, the covariance between two observations yi(th) and yi(tk) (h = 1,
. . ., m, k = 1, . . ., m) is expressed as follows: COV½yi ðtk Þ; yi ðth Þ ¼ s 20 rjtk th j , where 0 < r < 1. The AR(1) model ensures
that the two observations at nearby timepoints are more correlated than the two observations further apart, which should
be a valid assumption for most of the function-valued traits. In addition, the AR(1) covariance has only two parameters s 20
and r to be estimated, and therefore can be conveniently applied in practice.

Random Intercept and Slope Model


The VC model can be generalized as a linear mixed-effect model (LMM). The LMM utilizes random effects to introduce serial
correlation into the model, and the residual covariance S0 is only needed to be specified as a simple diagonal matrix (as
defined above). Specifically, in Equation 1 in the main text, the population mean component b0(t) can be re-parameterized as
b0(t) + bi0(t) by adding the term bi0(t) as individual-level random effects to describe each individual's departure from the overall
mean. Let b0(t) be a linear curve: b0(t) = b0 + b1t, the random effects bi0(t) can be specified as bi0(t) = bi0 + bi1t. The random
P
intercept and slope parameters are assumed to follow a common multivariate normal distribution [bi0, bi1]| 22  MVN(0,
P P
), with the covariance to be estimated from the data. Under such model settings, the marginal variance of a single
22 22
P P P
observation yi(tk) is Var½yi ðtk Þ ¼ s 20 þ 11 þ 2tk 12 þ t2k 22 , and the covariance of two observations becomes Cov[yi(tk),
P P P
yi(th)] = 11 + (tk + th) 12 + tkth 22. Clearly, random intercept and slope terms do not only introduce the covariance
between repeated measurements but also allow for the heterogeneity of variance.

826 Trends in Plant Science, December 2015, Vol. 20, No. 12


phenotype data were prepared using high-throughput imaging techniques over 18 days. VC
modeling of the temporal trend of salinity-induced growth responses as a decreasing logistic
curve was applied in the dynamic QTL analysis. The functional association analysis identified 55
QTLs (with lower P values or higher statistical significance) compared to the 26 QTLs found by
single timepoint QTL analysis. The results clearly indicate that functional mapping by integrating
temporal information has higher statistical power to identify significant QTLs compared to
conventional single timepoint QTL mapping.

The VC models require computationally-expensive iterative algorithms for parameter estimation.


Modern high-throughput phenotyping techniques may introduce high time-resolution data with
a large number of timepoints [17], which could create a computational burden for the VC model.
To increase computational speed, some approximation methods for the VC regression are
applicable. These are described next.

An Estimating Equations Approach


A generalized estimating equation (GEE) approach to mapping function-valued traits has
been proposed [33]. The method first estimates the genetic effect parameters bj(t) by GEE,
which assumes the working independent correlation structure (i.e., S0 to be an identity
matrix) in Equation 1, and then estimates the covariance S0 based on the estimates of SNP
effects obtained in the first step. The computation is performed in these two steps without
iteration, and therefore can be faster than the VC models, although, in theory, the GEE
may be less efficient than a VC method with correctly specified covariance. From a high-
resolution mouse behavior dataset with 222 timepoints, this method was able to identify two
QTLs [33].

Two-Stage Method
Several two-stage approaches have been presented [20,34–36]. They first fit a linear or a logistic
growth curve at an individual level to phenotypic measurements observed over time, and then
treat the estimated curve parameters for all the individuals as the latent trait values in QTL
mapping, to be analyzed by any QTL mapping tool. Note that the two-stage approach
integrating temporal information over phenotypes is roughly equivalent to the VC model
combining information over QTL effects [36,37]. Unlike the VC model, the two-stage approach
usually does not take the residual covariance into account. This approach can be easily and
quickly implemented without iteration, but may have less power to detect QTLs owing to its
approximate nature [37,38]. For improvement, Piepho et al. [39] proposed an alternative strategy
by orthogonalizing the covariance matrix in the first stage of the model. This method might be
capable of providing a more precise approximation to the one-stage method by maintaining the
computational efficiency of the two-stage strategy, and therefore deserves more investigation
in the future.

A Simple Regression-Based Method


A third approach with a low computational cost has been proposed by Kwak et al. [38]. They
suggest separately mapping the QTLs at each single timepoint, and combining the test statistic
information of each QTL over time such that the information is combined afterwards in the
hypothesis-testing stage. Their method succeeded in identifying three QTLs from a root tip angle
dataset of Arabidopsis thaliana with 241 timepoints. More details of this QTL study can be found
in Box 4. This approach is shown to have notable performance when the function-valued traits
are very smooth, but is less efficient in estimating QTL effects when the phenotype curves are
noisy (e.g., owing to errors introduced by temporal environmental effects) because it does not
take the correlation structure among phenotypes into account. In such cases, pre-smoothing
of phenotype measurements may be needed beforehand, perhaps by employing kernel tech-
niques [40].

Trends in Plant Science, December 2015, Vol. 20, No. 12 827


Combining Multiple-Locus Modeling with Functional Mapping
Single-locus methods (e.g., Equation 1) mapping one SNP at a time have been applied in most
of the functional mapping studies so far. In reality, many quantitative traits in plants are controlled
by multiple genes, indicating that single-locus models are inadequate for these complex traits
[41,42], and may not provide accurate estimates of SNP effects and residual covariance. In
addition, the residual covariance has to be repeatedly estimated for every single SNP, making
the single-locus functional method computationally expensive for datasets with thousands of
SNPs. A multiple-locus method, which simultaneously estimates the additive effects of multiple
SNPs as well as the covariance matrix in one computational procedure, may better mimic the
true genetic mechanism under a complex trait and save computational cost [43].

A multiple-locus varying-coefficient (VC) model is defined by:


X
p
i:i:d: X
yi ðtk Þ ¼ b0 ðtk Þ þ xi j b j ðtk Þ þ ei ðtk Þ; ei ¼ ½ei ðt1 Þ; :::; ei ðtm ÞT  MVNð0; 0
Þ; [2]
j¼1

Box 4. A Case Study: Gravitropism in Arabidopsis thaliana


Arabidopsis Datasets
The study ([17,38]) aimed to identify quantitative trait loci influencing the plant root gravitropism in Arabidopsis thaliana
based on three different datasets, including two independently reared sets of 162 recombinant inbred lines (RILs1 and 2),
and 92 near-isogenic lines (NILs). For the RIL and NIL datasets there were 234 and 102 DNA markers on five
chromosomes, respectively. The growth of seedling roots was automatically recorded every 2 minutes for 8 h by video
(see file S1 in [17] for an illustration), and the follow-up imaging analysis derived the angle of the root tip (in degrees) over
241 timepoints as the phenotype data. Next, analysis and results on the RIL1 dataset (Figure I) are illustrated.

Timecourse QTL Analysis


QTL analysis was first conducted separately on each single timepoint by standard interval mapping [22,75]. The LOD
score (log10 likelihood ratio comparing a QTL model and a null model) of each locus was calculated. Second, to integrate
QTL evidence over multiple timepoints, the average and maximum values of the LOD scores over time were calculated
for each locus, denoted the SLOD and MLOD scores, respectively. SLOD has strength to identify QTLs having small
effects over a long range of time, whereas MLOD is more powerful for identifying QTLs having large effects over a
short time-interval. A permutation test was used to derive a threshold for identifying significant loci. Next, multiple
locus analysis was conducted by using a stepwise model search strategy. Consequently, the SLOD criterion derived
a two-QTL model with QTLs on chromosomes 1 (at 61 cM) and 4 (at 76 cM), and the MLOD criterion identified one
extra QTL on chromosome 3 (at 42 cM). The QTLs on chromosomes 3 and 4 have large effects during the early stage
of the root growth, and the QTL on chromosome 2 has a relatively late effect (Figure II).

Phenotype trajectories of 162 Arabidopsis RILs


20

–20
Root p angle

–40

–60

–80

–100

–120

–140
0 1 2 3 4 5 6 7 8
Hour

Figure I. The Phenotype Trajectories of 162 Arabidopsis RILs Over Time. The data are publicly available at
http://phytomorph.wisc.edu/download/.

828 Trends in Plant Science, December 2015, Vol. 20, No. 12


(A) Baseline curve (B) Chr 1, 61 cM
0
8
Tip angle (degrees)

–20 6

QTL effect
–40 4
2
–60
0
–80
–2
–100 –4

0 2 4 6 8 0 2 4 6 8

Time (h) Time (h)

(C) Chr 3, 76 cM (D) Chr 4, 42 cM

8 8
6 6
QTL effect

QTL effect

4 4
2 2
0 0
–2 –2
–4 –4

0 2 4 6 8 0 2 4 6 8

Time (h) Time (h)

Figure II. The Estimated QTL Effects from the 2-QTL Model. The estimated QTL effects from the 2-QTL model
based on the SLOD criterion (red unbroken curve) and 3-QTL model based on the MLOD criterion (blue broken curve),
respectively (Figure 4 in [38]; reproduced with permission of The Genetics Society of America). Abbreviation: Chr,
chromosome.

where p is the total number of markers, and all the other parameters are defined in the same way
as in Equation 1. When the number of SNPs is small, one may apply a similar type of maximum
likelihood based algorithm as that used in single-locus methods to obtain estimates of Equation
2. However, when the number of SNPs is larger than the number of individuals (which is likely to
be the case in many QTL/association mapping studies), Equation 2 becomes an over-saturated
model and the maximum likelihood method cannot provide a valid solution. In such cases,
variable selection methods are applicable, selecting only a subset of important SNPs for
inclusion in the model and eliminating the irrelevant ones [44]. Note that many variable selection
techniques were originally developed in the context of univariate regression (i.e., for analyzing
quantitative traits at a single timepoint) [42,45–47], and some of them (including stepwise
regression, penalized regression, and Bayesian approaches) have been generalized for the
VC model and/or approximate models for functional mapping.

Stepwise Regression
A straightforward model search strategy operates by simply enumerating all possible combi-
nations of SNPs and picking the optimal model as judged by model selection criteria [48] such as
cross validation (CV), Akaike information criterion (AIC), or Bayesian information criterion (BIC).
This approach is computationally unfeasible for even a moderately-sized dataset (e.g., with 100
SNPs). A simpler ‘greedy’ alternative is stepwise regression [49]. The forward selection starts
from a null model (i.e., with only the intercept term), and repeats the following two steps: (i) adds a

Trends in Plant Science, December 2015, Vol. 20, No. 12 829


SNP best correlated with the estimated phenotypic residuals into the model, and (ii) re-estimates
the residuals according to the newly added SNP. The algorithm should terminate when adding
any new SNP cannot improve the model. Backward elimination may also be applicable by
starting from a full model (with all the SNPs included) and successively deleting the SNP which
is least correlated with the phenotypic residuals. Moreover, one may also combine forward/
backward selection in the same procedure [45].

A stepwise method usually searches only low-dimensional space, and can be quickly imple-
mented on a large amount of data, making it a favored choice in many QTL/association mapping
studies [45,50]. In [43] the forward/backward selection was utilized in a VC model for functional
mapping. For the mouse behavior data this method detects a similar set of QTLs as the single
locus-based GEE method [33], but with less computational cost. Forward/backward selection
has also been used in some approximate functional mapping models [38].

Penalized Regression
Stepwise regression is based on a discrete model search strategy, which may not be stable for
variable selection because small alterations in the data can cause fairly drastic changes in the
results [49]. Recently, statisticians have paid more attention to penalized regression, a continu-
ous variable selection approach which is numerically more stable [48,49]. In penalized regres-
sion, the objective function (i.e., the sum of squares function or likelihood function) is combined
with a penalty term Penal(l,b(t)), which shrinks the effects of unimportant SNPs toward zero
during estimation and therefore leads to a sparse model. The tuning parameter l (l>0)
determines the number of markers to be incorporated into the model, and can be selected
by a similar model selection criterion as that used in stepwise regression. Popular choices of
penalty functions include LASSO [36,40] and fusion penalty [51].

The penalized regression has a Bayesian interpretation [52]. The penalty term Penal(l,b(t)) can
be ‘translated’ to a prior distribution of SNP effects, b(t)–, the pre-knowledge of the distribution of
b(t) before seeing the data. The combination of data likelihood and prior leads to a posterior
distribution of model parameters, which can be estimated by the Markov chain Monte Carlo
(MCMC) sampling method [52–54]. A major benefit of Bayesian approaches is that they can
easily provide uncertainty estimates such as standard errors or credible intervals of the SNP
effects [55].

Penalized regression methods, especially the MCMC-based Bayesian approaches, are often
more computationally expensive than the stepwise regression [54]. Therefore, current penalized
estimation techniques can only be employed with smaller datasets in a VC model for analyzing
function-valued traits (Yi Gong, PhD thesis, University of North Carolina at Chapel Hill, 2013).
Alternatively, one can more easily use penalized regression in the two-stage modeling frame-
work [36].

Hypothesis Testing
Variable selection focuses on a subset of important SNPs with non-zero effects. In reality, it is
often impossible to choose a perfect tuning parameter value by any model selection criterion
(e.g., CV, AIC, BIC) to guarantee that all the selected SNPs represent true QTLs. It is generally the
case that the selected subset of SNPs may include some noisy signals. Therefore, hypothesis
testing needs to be used together with variable selection to formally judge QTLs and control the
false-positive rate. Constructing a test statistic for a SNP selected by variable selection is not
straightforward because the estimates of SNP effects do not asymptotically follow any known
distribution [56]. Current testing approaches such as multiple-split testing [57], stability
selection [58], and phenotype permutation [59] are largely based on sub-sampling or re-
sampling the data, which is time-consuming and sometimes conservative in detecting QTLs

830 Trends in Plant Science, December 2015, Vol. 20, No. 12


when the sample size is small [36]. Development of more-efficient parameter tuning or hypothe- Outstanding Questions
sis-testing methods for multiple-locus models is still an open research question, and deserves How can the high-throughput imaging
more attention in the future. and QTL analysis be combined in a
single workflow?

Modeling Environmental Effects Does robust regression provide a good


In addition to genetics, environmental factors may also contribute to phenotypic variation in solution to account for microenviron-
mental variation in high-throughput
plants. Thus, the QTL analysis needs to take both macroenvironmental and microenvironmental
phenotyping data?
variation into consideration [60]. The macroenvironmental effects are typically measurable
factors such as time and location. In the VC model, the inclusion of random intercept and Are the VC models applicable to esti-
slope covariance structure (Box 3) describes serial correlation and inhomogeneous variances mate the dynamic trajectories of
heritabilities?
over time. Thus, such covariance can partially handle the phenotypic heterogeneity as a result
of temporal effects such as temperature and precipitation [36,61]. In addition, the interactions Can the statistical methodology of
between gene and temporal effects are also described in the VC model by allowing for changes functional mapping be used to design
in QTL effects over time. novel strategies and computational
tools for predicting plant genomic
breeding values?
Phenotypic variation is also caused by spatial environmental effects when data are collected
from multiple environments/locations. A valid assumption is that observations at nearby loca- The HTPPs can simultaneously record
tions are affected by more similar environmental effects than observations located further apart multiple correlated traits. Could it be
possible to have multiple phenotype
[62]. Therefore it might be beneficial to include location/environmental information as extra components in a functional mapping
covariates, and specify the residual errors to be correlated and inhomogeneous over individuals model?
in the VC model (Equations 1 and 2) instead of assuming independent errors. However, inclusion
of both spatial and temporal covariance structures in the VC model often involves calculation
of the inverse of a large covariance matrix (i.e., a Kronecker product of two covariances) [63],
which is computationally expensive. A more practical approach could assume the spatial and
temporal covariances being separate to avoid representing them as a Kronecker product [64].
Alternatively, using randomized block designs can also effectively allow for the presence
of spatial variation without necessitating fitting to complex spatial models.

Moreover, microenvironmental effects are usually unknown, and cannot be directly represented
as covariates in a functional QTL model. Note that the application of high-throughput phenotyp-
ing platforms, especially those developed in the field, cannot guarantee that the temporal/spatial
environmental factors are fully controlled and monitored. A possible consequence of microen-
vironmental variation is that the distribution of the phenotype data does not follow a normal
distribution [6], and the VC model under the assumption of residual normality may therefore not
be optimal for such data. In that case it might be worth investigating the use of robust statistical
approaches such as hierarchical generalized linear models [60] or least absolute deviation
regression [65]. These models assume that the residuals follow a skewed distribution, and
they might be more appropriate for describing non-normally distributed phenotype data.

Concluding Remarks and Future Perspectives


Novel high-throughput phenotyping and automated imaging techniques can easily introduce
timecourse data of phenotypes, and thereby facilitate the study of the genetic control of plant
development and/or growth-related traits. We have provided an overview of functional QTL
mapping employing an elaborate class of statistical approaches to investigate phenotype–
genotype relationships over the whole growth period instead of at a particular developmental
stage of the plants. Note that high-throughput phenotyping is not only used in plant genetics
but also in many other fields such as animal behavior science [66,67]. The statistical models
described here are general, and can also be used to analyze timecourse data generated by
platforms for organisms other than plants [33,43]. Our special focus has been on some
recently developed computationally efficient multiple-locus methods which address new
statistical and computational challenges arising from large-scale functional data of this
post-genome era.

Trends in Plant Science, December 2015, Vol. 20, No. 12 831


The choice of statistical method is important and determines the quality of the functional
mapping analysis. We recommend that the analyst should start with a visual inspection of
the phenotypic trajectories. If the trajectories are smooth and monotonic, some simple paramet-
ric functions may be sufficient to describe the data. For more complex and noisy trajectories, one
may need to apply advanced non-parametric trend and covariance functions. Computational
expense is also an important concern. For datasets with large amounts of timepoints or markers,
one may prefer to use approximate two-stage methods, instead of precise one-stage methods,
to speed up the computation, although this might sacrifice some estimation accuracy.

The statistical methodologies of functional QTL mapping have been largely developed over the
past 10 years, and the general software tools for implementing these models (such as R/qtl [68],
and TASSEL [69] for QTL mapping of univariate traits) are still lacking. Therefore, it would be
desirable to develop high-performance and user-friendly computer programs for the practical
usage of functional mapping. Furthermore, it would also be interesting to extend the current
methodologies of functional mapping for new purposes (see Outstanding Questions), such as
searching for gene–gene and gene–environment interactions [27,70,71], predicting genomic
breeding values [72], estimating trait heritability [73], and combining functional mapping with
systems biology [74].

Acknowledgments
This work was supported by research funding from Biocenter Oulu. We are grateful to the Editor, three anonymous referees,
and Ashley Last for their valuable comments on the manuscript.

References
1. Mackay, T.F.C. et al. (2009) The genetics of quantitative traits: 15. Zhang, X. et al. (2012) Natural genetic variation for growth and
challenges and prospects. Nat. Rev. Genet. 5, 565–577 development revealed by high-throughput phenotyping in Arabi-
2. Collard, B.C. and Mackill, D.J. (2008) Marker-assisted selection: dopsis thaliana. G3 2, 29–34
an approach for precision plant breeding in twenty-first century. 16. Tessmer, O.L. et al. (2013) Functional approach to high-thoughput
Philos. Trans. R. Soc. B 363, 557–572 plant growth analysis. BMC Syst. Biol. 7, S17
3. He, J. et al. (2014) Genotyping-by-sequencing (GBS), an ultimate 17. Moore, C.R. et al. (2013) High-throughput computer vision intro-
marker-assisted selection (MAS) tool to accelerate plant breeding. duces the time axis to a quantitative trait map of a plant growth
Front. Plant Sci. 5, 484 response. Genetics 195, 1077–1086
4. Huang, X. and Han, B. (2014) Natural variations and genome-wide 18. Chen, D. et al. (2014) Dissecting the phenotypic components of
association studies in crop plants. Annu. Rev. Plant Biol. 65, 531–551 crop plant growth and drought responses based on high-through-
5. Tisné, S. et al. (2013) Phenoscope: an automated large-scale put image analysis. Plant Cell 26, 4636–4655
phenotyping platform offering high spatial homogeneity. Plant 19. Brown, T.B. et al. (2014) TraitCapture: genomic and environment
74, 534–544 modelling of plant phenomic data. Curr. Opin. Plant Biol. 18,
6. Cobb, J.N. et al. (2013) Next-generation phenotyping: requirements 73–79
and strategies for enhancing our understanding of genotype– 20. Honsdorf, N. et al. (2014) High-throughput phenotyping to detect
phenotype relationships and its relevance to crop improvement. drought tolerance QTL in wild barley introgression lines. PLoS
Theor. Appl. Genet. 126, 867–887 ONE 9, e97047
7. Walter, A. et al. (2012) Advanced phenotyping offers opportunities 21. Foulkes, A.S. (2009) Applied Statistical Genetics with R: For Pop-
for improved breeding of forage and turf species. Ann. Bot. 110, ulation-based Association Studies, Springer
1271–1279 22. Lander, E.S. and Botstein, D. (1989) Mapping Mendelian factors
8. Araus, J.L. and Carns, J.E. (2014) Field high-throughput pheno- underlying quantitative traits using RFLP linkage maps. Genetics
typing: the new crop breeding frontier. Trends Plant Sci. 19, 52–61 121, 185–199
9. Yang, W. et al. (2014) Combining high-throughput phenotyping 23. Jiang, C. and Zeng, Z-B. (1995) Multiple trait analysis of
and genome-wide association studies to reveal natural genetic genetic mapping for quantitative trait loci. Genetics 140,
variation in rice. Nat. Commun. 5, 5087 1111–1127
10. Topp, C.N. et al. (2013) 3D phentyping and quantitative trait 24. Ma, C. et al. (2002) Functional mapping of quantitative trait loci
locus mapping identify core regions of the rice genome con- underlying the character process: a theoretical framework. Genet-
trolling root architecture. Proc. Natl. Acad. Sci. U.S.A. 110, ics 161, 1751–1762
E1695–E1704 25. Wu, R. and Lin, M. (2006) Functional mapping-how to map and
11. Fahlgren, N. et al. (2015) Light, camera, action: high-throughput study the genetic architecture of dynamical complex traits. Nat.
plant phenotyping is ready for a close-up. Curr. Opin. Plant Biol. Rev. Genet. 7, 229–237
24, 93–99 26. He, Q. et al. (2010) Mapping genes for plant structure, develop-
12. Sozzani, R. et al. (2014) Advanced imaging techniques for the study ment and evolution: functional mapping meets ontology. Trends
of plant growth and development. Trends Plant Sci. 19, 304–310 Genet. 26, 39–46
13. Li, L. et al. (2014) A review of imaging techniques for plant phe- 27. Wang, Z. et al. (2013) Modeling phenotypic plasticity in growth
notyping. Sensors 14, 20078–20111 trajectories: a statistical framework. Evolution 68, 81–91
14. Brien, C.J. et al. (2013) Accounting for variation in designing 28. Pletcher, S.D. and Geyer, C.J. (1999) The genetic analysis of age-
greenhouse experiments with special reference to greenhouses dependent traits: modelling the character process. Genetics 151,
containing plants on conveyor systems. Plant Methods 9, 5 825–835

832 Trends in Plant Science, December 2015, Vol. 20, No. 12


29. Hastie, T. and Tibshirani, R. (1993) Varying-coefficient models. J. 54. Sillanpää, M.J. et al. (2012) Simultaneous estimation of multiple
Roy. Statist. Soc. B 55, 757–796 quantitative trait loci and growth curve parameters through hier-
30. Ramsay, J.O. et al. (2009) Functional Data Analysis with R and archical Bayesian modeling. Heredity 108, 134–146
Matlab, Springer 55. Kyung, M. et al. (2010) Penalized regression, standard errors, and
31. Baert, A. et al. (2012) Functional unfold principal component Bayesian Lassos. Bayesian Anal. 4, 369–412
analysis for automatic plant-based stress detection in grapevine. 56. Bühlmann, P. et al. (2014) High-dimensional statistics with a view
Funct. Plant Biol. 39, 519–530 towards applications in biology. Annu. Rev. Statist. Appl. 1, 255–278
32. Campbell, M.T. et al. (2015) Integrating image-based phenomics 57. Meinshausen, N. et al. (2009) P-values for high-dimensional
and association analysis to dissect the genetic architecture regression. J. Am. Stat. Assoc. 104, 1671–1681
of temporal salinity responses in rice. Plant Physiol. 168, 58. Meinshausen, N. and Bühlmann, P. (2010) Stability selection. J.
1476–1489 Roy. Stat. Soc. B 72, 417–473
33. Xiong, H. et al. (2011) A flexible estimating equations approach for 59. Li, Z. and Sillanpää, M.J. (2012) Estimation of quantitative trait
mapping function valued traits. Genetics 189, 305–316 locus effects with epistasis by variational Bayes algorithms. Genet-
34. Liu, G.F. et al. (2010) Functional mapping of quantitative trait loci ics 190, 231–249
associated with rice tillering. Mol. Genet. Genomics 284, 263–271 60. Mulder, H.A. et al. (2013) Estimation of genetic variance for macro-
35. Hurtado, P.X. et al. (2011) Dynamics of senescence-related QTLs and micro-environmental sensitivity using double hierarchical gen-
in potato. Euphytica 183, 289–302 eralized linear models. Genet. Sel. Evol. 45, 23
36. Li, Z. et al. (2014) Functional multi-locus QTL mapping of temporal 61. Fahrmeir, L. and Kneib, T. (2011) Bayesian Smoothing and
trends in Scots pine wood traits. G3 (Bethesda) 4, 2365–2379 Regression for Longitudinal, Spatial and Event History Data,
37. Sikorska, K. et al. (2012) Fast linear mixed model computations for Oxford University Press
genome-wide association studies with longitudinal data. Stat. 62. Rousset, F. and Ferdy, J-B. (2014) Testing environmental and
Med. 32, 165–180 genetic effects in the presence of spatial autocorrelation. Ecog-
38. Kwak, I.Y. et al. (2014) A simple regression-based method to map raphy 37, 781–790
quantitative trait loci underlying function-valued phenotypes. 63. Yap, J.S. et al. (2011) Functional mapping of reaction norms to
Genetics 197, 1409–1416 multiple environmental signals through nonparametric covariance
39. Piepho, H.P. et al. (2012) A stage-wise approach for the analysis of estimation. BMC Plant Biol. 11, 23
multi-environment trials. Biom. J. 54, 844–860 64. Smith, A.B. et al. (2007) Varietal selection for perennial crops
40. Meier, L. and Bühlmann, P. (2007) Smoothing l1-penalized esti- where data relate to multiple harvests from a series of field trials.
mators or high-dimensional time-course data. Electron. J. Statist. Euphytica 157, 253–266
1, 597–615 65. Li, Z. et al. (2015) A robust multiple-locus method for quantitative
41. Jansen, R.C. (1993) Interval mapping of multiple quantitative trait trait locus analysis of non-normally distributed multiple traits.
loci. Genetics 135, 205–211 Heredity 115, 556–564

42. Yi, H. et al. (2015) Penalized multimarker vs. single-marker regres- 66. Goulding, E.H. et al. (2008) A robust automated system elucidates
sion methods for genome-wide association studies of quantitative mouse home cage behavioral structure. Proc. Natl. Acad. Sci.
traits. Genetics 199, 205–222 U.S.A. 105, 20575–20582

43. Li, Z. and Sillanpää, M.J. (2013) A Bayesian nonparametric 67. Schaefer, A.T. and Claridge-Chang, A. (2012) The surveillance state
approach for mapping dynamic quantitative traits. Genetics of behavioral automation. Curr. Opin. Neurobiol. 22, 170–176
194, 997–1016 68. Broman, K.W. and Sen, S. (2009) A Guide to QTL Mapping with
44. Sillanpää, M.J. and Corander, J. (2002) Model choice in gene R/qtl, Springer
mapping: what and why. Trends Genet. 18, 301–307 69. Bradbury, P.J. et al. (2012) TASSEL: software for association
45. Segura, V. et al. (2012) An effcient multi-locus mixed-model mapping of complex traits in diverse samples. Bioinformatics
approach for genome-wide association studies in structured pop- 23, 2633–2635
ulations. Nat. Genet. 44, 825–830 70. Piepho, H.P. (2000) A mixed-model approach to mapping quan-
46. Li, Z. and Sillanpää, M.J. (2012) Overview of LASSO-related titative trait loci in Barley on the basis of multiple environmental
penalized regression methods for quantitative trait mapping and data. Genetics 156, 2043–2050
genomic selection. Theor. Appl. Genet. 125, 419–435 71. Yi, N. (2010) Statistical analysis of genetic interactions. Genet. Res.
47. O’Hara, R.B. and Sillanpää, M.J. (2009) A review of Bayesian 92, 443–459
variable selection methods: What, how, and which? Bayesian 72. Desta, Z.A. and Ortiz, R. (2014) Genomic selection: genome-
Anal. 4, 85–118 wide prediction in plant improvement. Trends Plant Sci. 19,
48. Hastie, T. et al. (2009) Elements of Statistical Learning. (2nd Edn), 592–601
Springer 73. Sillanpää, M.J. (2011) On statistical methods for estimating heri-
49. Izenman, A.J. (2008) Modern Multivariate Statistical Techniques, tability in wild populations. Mol. Ecol. 20, 1324–1332
Springer 74. Sun, L. and Wu, R. (2015) Mapping complex traits as a dynamic
50. Broman, K.W. and Speed, T.P. (2002) A model selection system. Phys. Life Rev. 13, 155–185
approach for the identification of quantitative trait loci in experi- 75. Haley, C.S. and Knott, S.A. (1992) A simple regression method for
mental crosses. J. Roy. Statist. Soc. B 64, 641–656 mapping quantitative trait loci in line crosses using flanking
51. Daye, Z.J. et al. (2012) A sparse structured shrinkage estimator for markers. Heredity 69, 315–324
nonparametric varying-coefficient model with an application in 76. Halperin, E. and Stephan, D.A. (2009) SNP imputation in associa-
genomics. J. Comput. Graph. Stat. 21, 110–133 tion studies. Nat. Biotechnol. 27, 349–351
52. Park, T. and Casella, G. (2008) The Bayesian LASSO. J. Am. Stat. 77. Marchini, J. and Howie, B. (2010) Genotype imputation for
Assoc. 103, 681–686 genome-wide association studies. Nat. Rev. Genet. 11, 499–511
53. Yang, R. and Xu, S. (2007) Bayesian shrinkage analysis of quanti- 78. Sen, S. and Churchill, G.A. (2001) A statistical framework for
tative trait loci for dynamic traits. Genetics 176, 1169–1185 quantitative trait mapping. Genetics 159, 371–387

Trends in Plant Science, December 2015, Vol. 20, No. 12 833

You might also like