Professional Documents
Culture Documents
Briefings in Functional Genomics 2010 Jannink 166 77
Briefings in Functional Genomics 2010 Jannink 166 77
Briefings in Functional Genomics 2010 Jannink 166 77
1093/bfgp/elq001
Abstract
We intuitively believe that the dramatic drop in the cost of DNA marker information we have experienced should
have immediate benefits in accelerating the delivery of crop varieties with improved yield, quality and biotic and
abiotic stress tolerance. But these traits are complex and affected by many genes, each with small effect.
Traditional marker-assisted selection has been ineffective for such traits. The introduction of genomic selection
Corresponding author. Jean-Luc Jannink, USDA-ARS, R.W. Holley Center for Agriculture and Health; and Department of Plant
Breeding and Genetics, Cornell University, Ithaca, New York, 14853, USA. E-mail: jeanluc.jannink@ars.usda.gov
Jean-Luc Jannink works with the USDA-ARS, R.W. Holley Center for Agriculture and Health, and also at the Department of Plant
Breeding and Genetics, Cornell University. He develops methods using DNA markers to increase the efficiency of small grain
improvement and facilitate the use of those methods in public small grain breeding programs.
Aaron J. Lorenz is a postdoctoral research associate at the USDA-ARS, R.W. Holley Center for Agriculture and Health studying
association mapping and genomic selection for the U.S. Department of Agriculture Barley Coordinated Agricultural Project.
Hiroyoshi Iwata is with the National Agriculture and Food Science Research Organization, National Agricultural Research Center at
Japan. He applies data mining methods to a number of biological phenomena, from genetic map construction, to analysis of
multi-dimensional shapes, to prediction of performance in breeding.
Published by Oxford University Press 2010. All rights reserved. For permissions, please email: journals.permissions@oxfordjournals.org
Genomic selection in plant breeding 167
The solution to this quandary lies not in seeking Furthermore, there may be a high degree of corre-
single markers associated with single large effects but lation or multicollinearity between the predictors.
in capitalizing on the developing capacity for scoring In so-called ‘large p, small n’ problems, standard
many markers at low cost. Add to this capacity novel multiple linear regression cannot be used without
statistical methods that enable the simultaneous esti- variable selection, which conflicts with the original
mation of all marker effects and you get genomic goal of avoiding marker selection. An important
selection (GS;[16]). GS uses a ‘training population’ danger in the development of a prediction model is
of individuals that have been both genotyped and overfitting: an overfitted model can exaggerate
phenotyped to develop a model that takes genotypic minor fluctuations in the data and will generally
data from a ‘candidate population’ of untested indi- have poor predictive ability. To overcome these
viduals and produces genomic estimated breeding problems, a variety of methods, e.g. best linear unbi-
values (GEBVs). These GEBVs say nothing of the ased prediction [28], ridge regression [29], Bayesian
function of the underlying genes but they are the regression [16], kernel regression [30, 31] and
chemometrics that are useful when the researcher is similar approaches assuming additive gene action and
faced with many variables whose relationships are ill- a heritability of 0.5. Forward-simulation of the pop-
understood, and the object is merely to construct a ulation was performed to reach mutation-drift equi-
good predictive model [40]. In both methods, latent librium under conditions that generated about 50
variables are extracted as linear combinations of the segregating QTL. For both studies, however, the
predictors and are used for response prediction. In expected effective QTL number [in the sense of
PCR, the latent variables are chosen to explain as ref. 12] was low: only 6 and 13 for Meuwissen et
much of the variation in the original predictors as al. and Habier et al. respectively. We believe both
possible. In PLS, the latent variables are chosen so of these numbers are unrealistic to model polygenic
that the relationship between the latent variables and traits. For a training population size of 1000, respec-
response is as strong as possible. The number of latent tive GS accuracies were 0.66 and 0.64 for ridge
variables, generally determined through cross-valida- regression and 0.79 and 0.69 for BayesB. The greater
tion, is much lower than the number of predictors or overall accuracy and greater difference between ridge
Marker type and density marker densities, on the order of eight per Morgan,
Solberg et al. [35] used the simulation conditions of can deliver accuracies close to the maximum
Meuwissen et al. [16] to evaluate the effect of marker observed; (ii) using ridge regression, there was a
density and of SSR-like multiallelic markers versus marker density optimum above which accuracy
SNP-like biallelic markers. They found that similar declined [48]; (iii) accuracy assuming true marker
accuracies were achieved in different populations if variances were known was only marginally higher
the marker density scaled with each population’s (0–8%) than assuming all marker variances were
effective size (Ne). Historic recombination between equal [48]; (iv) GS can out-perform phenotypic
loci scales linearly with Ne so that maintaining a fixed selection even when the biparental population is
amount of recombination between loci also requires composed of very few (e.g. 35) individuals [50].
marker density to scale linearly with Ne [47]. As an overall population improvement strategy, no
As might be expected, accuracy increased with den- study has contrasted performing GS within biparen-
sity though gains for a fixed density increment tal crosses to performing it across a breeding program
directly selected upon, increasing their long-term They derived the correlation between predicted
value. and true breeding value as
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
h2
rg^g ¼
h2 þ 1
THEORETICAL STUDIES
Theoretical studies have yielded important results on where
is the ratio of the number of phenotyped
three topics: (i) the sources of GS accuracy; (ii) accu- individuals, nP, to the number of loci, nG and h2 is
racy formulated as a function of QTL number and the entry-mean basis heritability for the trait.
training population size; and (iii) impacts of GS on Hayes et al. [54] increased the realism of the ana-
long-term response. GS models genetic variance in lysis by modeling locus effects as random and deriv-
two ways [17]. As expected, it uses markers in strong ing an approximation for the effective number of
LD with QTL by estimating associated marker allele independent chromosome segments, which indicates
effects. However, as somewhat of an unanticipated how many effects the GS model must estimate.
that assume many QTL evenly distributed over the different halves) or within families (different individ-
genome (i.e. ridge regression) perform as well as uals within families ending up in the different
methods that assume fewer QTL of varying effect halves). For the split across families, only capturing
(e.g. BayesB); (iii) decreasing marker numbers did LD between marker and QTL will be useful for pre-
not strongly affect GS accuracy; and (iv) GS accuracy diction because the families were weakly related.
increased linearly with training population size. One In contrast, for the split within families, capturing
interpretation of these observations is that the infini- genetic relatedness will also be useful. Interestingly,
tesimal model assumption, ‘an infinite number of for traits of similar heritability, prediction across
loci, all with infinitesimally small effects’, is closer families was more accurate in the Lee et al.
to being correct than an assumption of few QTL (BayesB-like analysis) study than in the Legarra
(where ‘few’ could mean dozens but not hundreds). et al. (ridge regression) study. Conversely, prediction
Alternatively, there may be relatively few loci at within families was more accurate in the Legarra
which variants have a large effect on the phenotype, et al. than the Lee et al. study. We hypothesize that
but these variants are at low frequency so that they
statistics to model development and prediction. Like also be used to define a population of samples similar
GS, the advantage of NIRS is that a large set of enough that it could be predicted using a single
variables is cheap to measure (NIR spectra) and can training set and to identify outlier samples [67, 69].
predict variables that are expensive to measure (wet In GS, a statistic similar to the H distance based on
chemistry measurements). NIRS has been inten- marker data could also relate training population
sively researched for decades [65]. Spectra (absor- diversity to model accuracy.
bance values at each of thousands of wavelengths) Routine use of NIRS involves a continual need
are collected on a large population of samples, and to update the calibration as new variation in the
a subset of samples to be phenotyped is chosen. phenotype is encountered. Several guidelines exist
Prediction typically involves relating phenotypes to for deciding when a particular calibration can be
the spectra through PCR or PLS regression [66]. A used to predict new samples, and which samples
common goal of selecting samples for phenotyping is should be added to the existing training population
to evenly span the range of spectral and phenotypic [70]. We envision similar guidelines for training pop-
variation of the population, while minimizing the ulation maintenance in GS. Because generations fol-
size of the set [65]. One multivariate distance lowing a given selection event will contain only the
metric often used for selecting samples that uni- alleles of the parents in each cycle of selection, it may
formly span the spectral diversity is the be most efficient to update the training population
Mahalanobis distance [H distance; 65, 67]. The H by phenotyping the parents of each selection cycle.
distance accounts for collinearity between predictors Empirical and simulation findings should resolve this
in calculating their distance [68]. The H distance can question. Theory and practice in other areas such as
174 Jannink et al.
Managing short- and long-term population [73]. For GS, it seems sensible that we
selection gain should also take advantage of marker data to manage
If QTL are in complete LD with markers, theory inbreeding and optimize long-term selection gains.
shows that GS should cause less inbreeding or loss For example, Goddard [71] proposed varying the
of genetic diversity than selection on breeding values weight given to marker information as a function
estimated using pedigree information [58]. of the allele frequencies at each marker [19]. It
Obviously, this condition does not hold and, in real- would also be possible to use markers to mimic
ity, GS can fall short of phenotypic selection: (i) GS within-family selection, a practice that reduces the
will not ‘discover’ some QTL and these will drift rate of inbreeding. We have done within-group
rather than be subject to selection [71]; (ii) if selection by using marker data to cluster selection
marker and QTL are not in perfect LD, fixing a candidates and then selecting within clusters
marker will not fix the QTL [72]; (iii) finally, as (Figure 3). Such selection reduces short-term but
we have seen, GS does capture some relationship increases long-term gain. Figure 3 shows that
information increasing the likelihood of co-selection beyond accelerating selection response, marker data
of relatives. For traditional pedigree-based selection, offers wide possibilities for managing short- and
methods have been developed to select while con- long-term gains. Research into these possibilities
straining the rate of increase of relatedness in the has just begun [71, 74].
Genomic selection in plant breeding 175
28. Kolbehdari D, Schaeffer LR, Robinson JA. Estimation of 49. Bernardo R. Genomewide selection for rapid
genome-wide haplotype effects in half-sib designs. J Anim introgression of exotic germplasm in maize. Crop Sci 2009;
Breed Genet 2007;124:356–61. 49:419–25.
29. Hoerl AE, Kennard RW. Ridge regression: biased estima- 50. Wong C, Bernardo R. Genomewide selection in oil palm:
tion for nonorthogonal problems. Technometrics 2000;42: increasing selection gain per unit time and cost with small
80–6. populations. TheorAppl Genet 2008;116:815–24.
30. Bennewitz J, Solberg T, Meuwissen THE. Genomic breed- 51. Habier D, Fernando RL, Dekkers JCM. Genomic
ing value estimation using nonparametric additive regression selection using low-density marker panels. Genetics 2009;
models. Genet Select Evol 2009;41:20. 182:343–53.
31. Gianola D, van Kaam JBCHM. Reproducing Kernel 52. Burdick JT, Chen W-M, Abecasis GR, et al. In silico
Hilbert spaces regression methods for genomic assisted method for inferring genotypes in pedigrees. Nat Genet
prediction of quantitative traits. Genetics 2008;178: 2006;38:1002–4.
2289–303. 53. Yu J, Holland JB, McMullen MD, et al. Genetic design and
32. Long N, Gianola D, Rosa GJM, et al. Machine learning statistical power of nested association mapping in maize.
classification procedure for selecting SNPs in genomic selec- Genetics 2008;178:539–51.
tion: application to early mortality in broilers. J Anim Breed 54. Hayes BJ, Visscher PM, Goddard ME. Increased accuracy of
69. Mark HL, Tunnell D. Qualitative near-infrared reflectance response under alternative trait and genomic parameters.
analysis using Mahalanobis distances. Anal Chem 1985;57: J Anim Breed Genet 2007;124:342–55.
1449–56. 73. Avendaño S, Woolliams JA, Villanueva B. Mendelian sam-
70. Shenk JS, Fales SL, Westerhaus MO. Using near-infrared pling terms as a selective advantage in optimum breeding
reflectance product library files to improve prediction accu- schemes with restrictions on the rate of inbreeding. Genet
racy and reduce calibration costs. Crop Sci 1993;33:578–81. Res 2004;83:55–64.
71. Goddard M. Genomic selection: prediction of accuracy and 74. Li Y, Kadarmideen HN, Dekkers JCM. Selection on multi-
maximisation of long term response. Genetica 2009;136: ple QTL with control of gene diversity and inbreeding for
245–57. long-term benefit. J Anim Breed Genet 2008;125:320–9.
72. Muir WM. Comparison of genomic and traditional
BLUP-estimated breeding value accuracy and selection