Professional Documents
Culture Documents
lynch1994
lynch1994
Abstract
Recent advances in the application of the polymerase chain reaction make it possible to
score individuals at a large number of loci. The RAJ?D (random amplified polymorphic
DNA) method is one such technique that has attracted widespread interest. The analysis
of population structure with RAPD data is hampered by the lack of complete genotypic
information resulting from dominance, since this enhances the sampling variance
associated with single loci as well as induces bias in parameter estimation. We present
estimators for several population-genetic parameters (gene and genotype frequencies,
within- and between-population heterozygosities, degree of inbreeding and population
subdivision, and degree of individual relatedness) along with expressions for their
sampling variances. Although completely unbiased estimators do not appear to be
possible with RAPDs, several steps are suggested that will insure that the bias in
parameter estimates is negligible. To achieve the same degree of statistical power, on the
order of 2 to 10 times more individuals need to be sampled per locus when dominant
markers are relied upon, as compared to codominant (RFLP,isozyme) markers. Moreover,
to avoid bias in parameter estimation, the marker alleles for most of these loci should be
in relatively low frequency. Due to the need for pruning loci with low-frequency null
alleles, more loci also need to be sampled with RAPDs than with more conventional
markers, and some problems of bias cannot be completely eliminated.
Keywords: gene diversity, inbreeding, molecular markers, structure, RAPDs, relatedness
Received 29 March 1993; revision accepted 15 July 1993
Introduction
plant and animal taxa, and the number of loci that can be
Estimates of genetic variation increasingly are being examined is essentially unlimited, RAPDs are viewed as
based upon information at the DNA level. Whereas having several advantages over RFLPs and DNA
some DNA markers, such as restriction fragment length fingerprints. When the primers are of intermediate size
polymorphisms, can be analysed with traditional meth- (on the order of 10 base pairs), multiple amplifiable
ods developed for isozymes, others, such as DNA- fragments (from different loci) are usually present for
fingerprints, need to be analysed in new ways (Lynch each set of primers in each genome. The fragments can be
1988, 1990; Stephens et al. 1992). A recent method that is . separated by size on a standard agarose gel and
being assimilated rapidly by empiricists is the RAPD visualized by ethidium bromide staining, eliminating
(random amplified polymorphic DNA) technique (Welsh the need for radiolabeled probes. Since the primers consist
& McClelland 1990; Williams et al. 1990; Caetano-AnollCs of"random sequences, and do not discriminate between
et a/. 1991; Deragon 1992; Hadrys et al. 1992). coding and noncoding regions, it is reasonable to expect
RAPDs are generated by applying the polymerase the technique to sample the genome more randomly than
chain reaction (PCR) to genomic DNA samples, using conventional methods.
randomly constructed oligonucleotides as primers. Since Despite these advantages, RAPD analysis presents
the technique is relatively easy to apply to a wide array of some practical problems. As in the case of DNA
fingerprinting where multiple markers appear on the
Correspondence: Michael Lynch +1-503-346-2364 same gel, there can be uncertainty in assigning markers to
92 M. LYNCH and B. G. MILLIGAN
speclfic loci in the absence of a preliminary pedigree We will restrict our attention to sexual diploid
analysis. In addition, the possibility that products of populations. Letting M and m denote the marker and
different loci will have similar molecular weights, and null alleles at a locus, and p and q = 1-p their frequencies,
therefore be indistinguishable on a gel (because of the expected genotype frequencies in a segregating pop-
comigration), is of concern. In some cases, these +
ulation are P M M = p2(1 - F I S ) pFls, &,,, = 2pq(l - F I S ) ,
problems can be overcome by practical means (Hadrys +
and P,, = q2(1 - F l s ) qFIs, where Fls measures the
et al. 1992). However, a third issue seems to be departure from Hardy-Weinberg equilibrium. Under
unavoidable - the dominance of RAPD markers. If one random mating, FIs=O, and the expected genotype
of the “alleles” at a RAPD site is unamplifiable, then frequencies follow the Hardy-Weinberg proportions.
marker/marker homozygotes cannot be distinguished Under complete inbreeding, Fls = 1, there are no hetero-
from marker/null heterozygotes. Provided there is only a zygotes, and the frequencies of the MM and mm
single amplifiable allele per locus, this does not prevent homozygotes are p and q, respectively. Fls >0 is usually
the estimation of allele frequencies necessary for popula- interpreted as a measure of the degree of inbreeding
tion-genetic analysis, but it does reduce the accuracy of under the assumption that the markers are selectively
such estimation relative to analysis with codominant neutral and in gametic-phase equilibrium with any
markers. selected loci. However, Fls can take on negative values
Our purpose is to develop the general theory needed when the frequency of heterozygotes exceeds the Hardy-
for the analysis of population structure with dominant Weinberg expectation.
markers such as W D s , and to lay out formally some of Because the MM and Mm genotypes cannot be
the difficulties that such markers present. Formulae are discriminated from each other by RAPD analysis, the
presented for the estimation of conventional measures of marker allele is ”dominant” to the null allele on a gel. (In
population structure. All of the estimators take into principle, such discrimination might be possible through
account bias due to small sample size and are accom- the detection of staining intensity differences, but this
panied by approximate expressions for their sampling degree of resolution would be very difficult to attain
variance. without carefully controlled assays.) Thus, the only
observable quantity for a locus is the fraction
of individuals in the population with ( I - x ) and without
Assumptions
(x) the marker. Our goal is to demonstrate how observed
In order to focus on the salient statistical issues, we start marker frequencies can be used to estimate conventional
with several assumptions. First, we make the generous measures of population-genetic structure.
assumption that the interpretation of banding patterns on Throughout the paper, we will discriminate estimates
gels can be accomplished in a completely unambiguous from true population parameters by placing hats ( ” )
manner. That is, we assume that marker alleles from on the former. All of the estimators, as well as approxi-
different loci do not comigrate to the same position on a mate expressions for their sampling variances, were
gel, and that the investigator is fully capable of matching obtained by use of second-order Taylor expansions.
bands from different lanes within and among gels. Although tedious at times, this approach is fairly
Secondly, we assume that each locus can be treated as a straightforward, and we omit the algebraic details.
two-allele system, with only one of the alleles per locus Similar procedures were used by Nei & Roychoudhury
being amplifiable by the PCR. The ‘’null‘’ allele may fail to (1974) and Nei (1978) to obtain some estimators with
amplify either because of loss of a primer site or because codominant markers.
an insertion has caused the distance between primer sites In several places, we will require the sampling variance
to exceed the capacity of the PCR; the specific mechanism of a mean of parameter estimates. To save on space, we
is not relevant to the theory that follows. Without will simply refer to the following equation, denoting the
extensive segregation analysis, it is difficult to validate appropriate terms to substitute for the sample size s, the
the tyo-allele assumption, although a possible way to individual estimates z(i), and the mean of those estimates
-
resolve the issue is presented in the discussion. Most Z,
current users of RAPDs seem to be of the opinion that the . 4
where 32 is the proportion of the N sampled individuals PCR-based, the offspring generation can be assayed at
that do not exhibit the marker, provides a downwardly a very early stage (e.g. seeds), prior to the operation
biased estimate of q. A less biased (and asymptotically of selection.
unbiased) estimator is Letting io be the observed value of x in an offspring
population resulting from random mating, and assuming
the genotypic frequencies are not influenced by selection,
q can be estimated by substituting f, into eqn 2a. The
deviation from Hardy-Weinberg equilibrium can then be
where Var(32) = 2(1 - f ) / N is the sampling variance of
estimated with
the frequency of null homozygotes. The sampling
variance of the allele frequency is approximately
1-32
Var(q) = -
4N ' (34
The square root of this quantity provides an estimate of
the standard error of 4 and may be used as an
approximate indicator of the accuracy of estimation.
However, it should be noted that the construction of (0) Esbmalns 01 q
4 5
H/[2q(l - q)] f = 0.3
-a]/[Zq(
6
-
I q)J. q ==o0.3
7
7 ..
q = 0.7
8
-
-
9
i
I
10
Number 01 Homozygetas
Erp~ted
there is no way to evaluate whether Frs = 0 with RAPD
analysis of a single generation. If, however, the study Fig. 1. (a) Expected estimates of the gene frequency, 4 obtained
population, or a sample of it, can be mated randomly, by eqn 2a relative to the parametric value 9, as function of
the expected number of null homozygotes in the sample, Nq2
then Hardy-Weinberg equilibrium will be attained in (solid hes). Results obtained by numerical solution on a
the offspring generation, and disequilibrium in the computer are given for q=O.3 and 0.7; results for 9=0.1 and
parent population will be revealed by a change in x 0.2 are essentially identical to those for q=O.3. Also plotted
across generations. This approach to testing for Hardy- (dotted lines) are the expected values of 2'/2/q. (b)Slmilar results
for the expected estimates of the with-population gene
Weinberg equilibrium may be practically feasible in
diversity, H obtained by eqn 4a relative to the parametric
many empirical settings. Since the RAPD technique is value, Q(1-q).
94 M. LYNCH and B. G. MILLIGAN
where x,, is the estimated value of x in the parental Weinberg equilibrium, and under the assumptions of our
generation, and the term in large parentheses again model, can be viewed as the probability that a random
accounts for small sample size bias. The sampling pair of alleles will contain one marker and one null. An
variance of P I S at a particular locus is approximately asymptotically unbiased estimator of this quantity is
Hl(i) = q l ( i ) [- +
l ci,(i)] ~ a r [ q , ( i ) ] , (44
and its sampling variance is approximately
(7)
All of the terms of t h s expression have been defined
Its sampling variance, Var(kw), is a function of the above except the last two, and these are of the form
sampling of finite numbers of populations, loci, and
individuals, all three of which are accounted for when eqn
1 is used with s = n, z(i) = H,, and Z = Aw. The square Such covariance exists because the same individuals from
root of the resdtant quantity provides an estimate of the population j are used to estimate fjifi) and ff&(i).
standard error of f i w . Following the logic used above, the Averaging over all loci, the estimated mean gene
total sampling variance of f i w can be partitioned into diversity between populations j and k is
contributions from sampling individuals (I), loci (L),and
populations (P),
13a)
(14a)
96 M. LYNCH and B. G. MILLIGAN
same phenotype, and These expressions can be used to explore the utility of
band-sharing in the estimation of relatedness.
The degree of overlap in the distributions of S
associated with different r depends strongly on the
frequency of markers in the population (Fig. 2). If the
markers are present in high frequency (all x =O), they will
provide little discrimination among relatives, since nearly
Estimates of relatedness based on single loci are all members of the population will exhibit the marker a t
extremely noisy. A more accurate estimate is obtained each locus. The use of very low frequency markers does
by averaging over all L loci, not improve the situation. Since the distribution of
similarity depends only on cl(i), which is symmetrical in
X ( I ] and l-x(i), it takes on the same properties for
q2(i)= x(i)= 0.1 as for q2(i)= x(i) = 0.9. The best resolution
Such an analysis is only applied to polymorphic loci. occurs when markers are at intermediate frequency, i.e.
When a large number of markers is used, the sampling x(D = 0.5. However, even in this case, the degree of over-
distribution of should be roughly normal. Its variance lap in distributions of S is very broad, and based on the
can be estimated by substituting s=L, z(i) = fob(i),and results in the figure, it can be seen that the problem is not
t = i,b in eqn 1. eliminated until very large numbers of markers (well in
The utiIity of RAPD markers for discriminating pairs of excess of 50) are scored. We conclude that the utility of
individuals related to different degrees depends on the RAPDs for relatedness estimation is rather limited.
overlap between the distributions of similarity for
different degrees of relatedness. If there are L informative Discussion
markers for a pair of individuals, the mean (over loci)
similarity (over loci) for pairs of individuals with Because of the dominance property of RAPDs, gene
relatedness r should be approximately binomially dis- frequency estimates for such loci are necessarily less
tributed with expectation accurate than those obtained with codominant markers
such as isozymes and RFL.Ps. In the latter case, the
sampling variance of the gene frequency is q(l - q ) / 2 N .
Comparing this with eqn 2b, the ratio of the sampling
variance of gene frequency for RAPDs relative to a
and sampling variance comparable locus with codominance is (l+q)/(2q), where
9 is the frequency of the null allele. Comparing eqn 4b to
the result of Nei & Roychoudhury (1974), this can also be
shown to be the ratio for the sampling variance of the
0.0
0.0 0.2 0.4. 0.6 0.8 1.0
00 0.2 0.4 0.6 0.8 1.0 1 .o Variation among loci in the frequency of the
marker allele will increase the variance beyond
Similarity that illustrated.
98 M. LYNCH and B. G. MILLIGAN
within-population gene diversity using both types of gene diversity. The bias resulting from violations of this
markers. This ratio can be viewed as the proportional assumption depends on the frequency distribution of
increase in sampling effort that needs to be applied to a alleles, but a qualitative assessment can be made for the
RAPD locus in order to achieve a genetic-parameter simple situation in which n markers at a locus all have
estimate as accurate as would be acquired with a locus equal freqdencies, (1-q)/n. If the investigator focuses on
with codominant markers with the same frequency. a single marker and views the heterozygosity as
Recall from above, that to avoid biased parameter (1-x”‘), where x is the frequency of individuals not
estimates, RAPDs for which x < 3/N should be avoided in exhibiting that marker, the quantity 2(1 -q)(n - l+q)/n2
population-genetic analysis. Assuming this is done, then would actually be estimated. The ratio of this quantity to
the critical value of the above ratio is the actual heterozygosity, 1-q2-n[(l -q)/n12, is
(1 + m)/(2m), which takes on values of 1.4 for
a sample size of N=10, 3.4 for N=100, and 9.1 for
N = 1000. Thus, for acceptable loci alone, on the order of 2
to 10 times more individuals should be sampled per RAPD
locus than per RFLP or isozyme loci. which shows that in this extreme case the estimated
The pruning of loci for which the null phenotype heterozygosity would be on the order of (2/n)th or less
frequency is below 3 / N also implies that more loci need to than the true value.
be sampled in RAPD analysis than in a survey involving A simple way to avoid the problem of multiple allelic
more conventional codominant markers. The number of markers is to focus on markers that are common. If the
loci that are rejected using the 3/N criterion will neces- sum of the frequencies of individuals in whch two
sarily depend on the allele-frequency distribution, but it is markers are uniquely found exceeds one, then the
clear that this number must increase with decreasing N. markers cannot be allelic. So if this criterion is applied
The pruning of loci raises an additional issue. Although to all pairs of markers, the multiple-marker problem can
our procedure insures that the estimates of population be eliminated. This, of course, does not remove the issue
parameters will be essentially unbiased for the pool of of multiple unamplifiable (null) alleles, a problem that
acceptable loci, it also tends to reject loci that are highly seems much more difficult to resolve. Moreover, by
homozygous. Moreover, the loci that are rejected may eliminating loci with high allele frequencies, this proce-
differ from population to population. Consequently, dure raises the additional issues of bias associated with
estimates of genome-wide parameters, such as average pruning, already discussed above.
within- and between-population gene diversities, derived Finally, we note that our emphasis has been on the
via RAPD analysis are likely to be biased in complicated development of unbiased parameter estimators. Such
ways whether or not the pruning procedure is applied. In estimators do not always minimize the mean squared
either case, the obvious way to minimize the problem of deviations between parameters and their estimates.
bias is to sample large numbers of individuals per Alternative procedures, such as maximum likelihood,
population (ideally at least loo), since this reduces the may provide minimum-variance estimators, although
fraction of loci that will yield biased parameter estimates. such estimators are not necessarily unbiased.
In addition, it seems quite advisable in comparative
studies to equalize the sample sizes in different popula-
Acknowledgements
tions to minimize the possibility that observed differences
between populations are simply artifacts of bias resulting This work has been supported by NSF grants BSR 89-
from sample size differences. 11038 and BSR 90-24977, and PHS grant GM36827-01 to
Certainly, if even with large N, 10% or more of the ML, and by NSF grant BSR 89-06129 to BGM. We are
sampled loci still fail to meet the 3/N criterion, then any especially grateful to K. Ritland for pointing out some
population-parameter estimates derived from the data ’ fundamental problems in an earlier draft of this paper,
should be interpreted with an appropriate degree of and to E. Martins for helpful comments.
caution. If it is deemed desirable not to prune loci from
the final analysis, then the best that can be said is that the
References
bias of a parameter estimate averaged over all loci cannot
exceed the worst locus-specific bias. Caetanc-Anoll6s G, Bassam BJ, Gresshoff PM (1991) DNA
A critical assumption of the preceding theory is that amplification lingerprinting usmg very short arbitrary
oligonucleotide primers. Bio/Technology, 9, 553-557.
there are a t most two alleles per locus (one marker and Clark AG, Lanigan CMS (1993) Prospects for estimating
one null). If that is not the case, the approach we have nucleotide divergence with RAPDs. Molecular Biology and
employed will lead to an overestimate of the frequency of Evolution, 10, 1096-1111.
alleles, and an underestimate of the within-population Crowley PH (1992) Resampling methods for computation-
ANALYSIS OF RAPD DATA 99
intensive data analysis in ecology and evoluhon. Annual Stephens JC, Gilbert DA, Yuhki N, OBrien SJ (1992) Estimation of
Rmiew of Ecology and Systematics, 23, 405-447. heterozygosity for single-probe multilocus DNA fingerprints.
Deragon JM, Landry BS (1992) RAPD and other PCR-based Molecular Biology and Evolution, 9, 729-743.
analyses of plant genomes using DNA extracted from small Welsh J, McClelland M (1990) Fingerprinting genomes using PCR
leaf disks. PCR Methods and Applications, 1, 175-180. with arbitrary primers. Nuclezc Acids Research, 18, 7213-7218.
Hadrys H, Dalick M, Scluerwater B (1992) Applications of Williams JGK, Kublelik AR, Livak KJ, Rafalski JA, Tingey SV
random amplified polymorphic DNA (RAPD) in molecular (1990) DNA polymorphisms amplified by arbitrary primers
ecology. Molecular Ecology, 1, 55-64. are useful as genetic markers. Nucleic Acids Research, IS,6531-
Lynch M (1988) Estimation of relatedness by DNA iingerprinting. 6535.
Molecular Biology and Evolution, 5, 584-599. Wright S (1951) The genetical structure of populations. Annals of
Lynch M (1990) The similarity index and DNA fingerprinting. Eugenics, 15, 323-354.
Molecular Biology and Evolution, 7, 478-484.
Nei M (1972) Genetic distance between populations. American
Naturalist, 106, 283-292.
Nei M (1978) Estimation of average heterozygosity and genetic Much of the author's research 1s concerned with the ecologcal
distance from a small number of individuals. Genetics, 89, and evolutionary determinants of population-genetic structure,
583-590. and with the use of molecular markers for quantifying such
Nei M, Roychoudhurj AK (1974) Sampling variances of structure.
heterozygosity and genetic distance. Genetics, 76, 379-390.