Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Molecular Ecology (1994) 3, 91-99

Analysis of population genetic structure with RAPD


markers
M. LYNCH and B. G. MILLIGAN"
Department of Biology, University of Oregon, Eugene, OR 97403 and *Department of Botany, University of Texas, Austin, TX
78773, U S A

Abstract
Recent advances in the application of the polymerase chain reaction make it possible to
score individuals at a large number of loci. The RAJ?D (random amplified polymorphic
DNA) method is one such technique that has attracted widespread interest. The analysis
of population structure with RAPD data is hampered by the lack of complete genotypic
information resulting from dominance, since this enhances the sampling variance
associated with single loci as well as induces bias in parameter estimation. We present
estimators for several population-genetic parameters (gene and genotype frequencies,
within- and between-population heterozygosities, degree of inbreeding and population
subdivision, and degree of individual relatedness) along with expressions for their
sampling variances. Although completely unbiased estimators do not appear to be
possible with RAPDs, several steps are suggested that will insure that the bias in
parameter estimates is negligible. To achieve the same degree of statistical power, on the
order of 2 to 10 times more individuals need to be sampled per locus when dominant
markers are relied upon, as compared to codominant (RFLP,isozyme) markers. Moreover,
to avoid bias in parameter estimation, the marker alleles for most of these loci should be
in relatively low frequency. Due to the need for pruning loci with low-frequency null
alleles, more loci also need to be sampled with RAPDs than with more conventional
markers, and some problems of bias cannot be completely eliminated.
Keywords: gene diversity, inbreeding, molecular markers, structure, RAPDs, relatedness
Received 29 March 1993; revision accepted 15 July 1993

Introduction
plant and animal taxa, and the number of loci that can be
Estimates of genetic variation increasingly are being examined is essentially unlimited, RAPDs are viewed as
based upon information at the DNA level. Whereas having several advantages over RFLPs and DNA
some DNA markers, such as restriction fragment length fingerprints. When the primers are of intermediate size
polymorphisms, can be analysed with traditional meth- (on the order of 10 base pairs), multiple amplifiable
ods developed for isozymes, others, such as DNA- fragments (from different loci) are usually present for
fingerprints, need to be analysed in new ways (Lynch each set of primers in each genome. The fragments can be
1988, 1990; Stephens et al. 1992). A recent method that is . separated by size on a standard agarose gel and
being assimilated rapidly by empiricists is the RAPD visualized by ethidium bromide staining, eliminating
(random amplified polymorphic DNA) technique (Welsh the need for radiolabeled probes. Since the primers consist
& McClelland 1990; Williams et al. 1990; Caetano-AnollCs of"random sequences, and do not discriminate between
et a/. 1991; Deragon 1992; Hadrys et al. 1992). coding and noncoding regions, it is reasonable to expect
RAPDs are generated by applying the polymerase the technique to sample the genome more randomly than
chain reaction (PCR) to genomic DNA samples, using conventional methods.
randomly constructed oligonucleotides as primers. Since Despite these advantages, RAPD analysis presents
the technique is relatively easy to apply to a wide array of some practical problems. As in the case of DNA
fingerprinting where multiple markers appear on the
Correspondence: Michael Lynch +1-503-346-2364 same gel, there can be uncertainty in assigning markers to
92 M. LYNCH and B. G. MILLIGAN

speclfic loci in the absence of a preliminary pedigree We will restrict our attention to sexual diploid
analysis. In addition, the possibility that products of populations. Letting M and m denote the marker and
different loci will have similar molecular weights, and null alleles at a locus, and p and q = 1-p their frequencies,
therefore be indistinguishable on a gel (because of the expected genotype frequencies in a segregating pop-
comigration), is of concern. In some cases, these +
ulation are P M M = p2(1 - F I S ) pFls, &,,, = 2pq(l - F I S ) ,
problems can be overcome by practical means (Hadrys +
and P,, = q2(1 - F l s ) qFIs, where Fls measures the
et al. 1992). However, a third issue seems to be departure from Hardy-Weinberg equilibrium. Under
unavoidable - the dominance of RAPD markers. If one random mating, FIs=O, and the expected genotype
of the “alleles” at a RAPD site is unamplifiable, then frequencies follow the Hardy-Weinberg proportions.
marker/marker homozygotes cannot be distinguished Under complete inbreeding, Fls = 1, there are no hetero-
from marker/null heterozygotes. Provided there is only a zygotes, and the frequencies of the MM and mm
single amplifiable allele per locus, this does not prevent homozygotes are p and q, respectively. Fls >0 is usually
the estimation of allele frequencies necessary for popula- interpreted as a measure of the degree of inbreeding
tion-genetic analysis, but it does reduce the accuracy of under the assumption that the markers are selectively
such estimation relative to analysis with codominant neutral and in gametic-phase equilibrium with any
markers. selected loci. However, Fls can take on negative values
Our purpose is to develop the general theory needed when the frequency of heterozygotes exceeds the Hardy-
for the analysis of population structure with dominant Weinberg expectation.
markers such as W D s , and to lay out formally some of Because the MM and Mm genotypes cannot be
the difficulties that such markers present. Formulae are discriminated from each other by RAPD analysis, the
presented for the estimation of conventional measures of marker allele is ”dominant” to the null allele on a gel. (In
population structure. All of the estimators take into principle, such discrimination might be possible through
account bias due to small sample size and are accom- the detection of staining intensity differences, but this
panied by approximate expressions for their sampling degree of resolution would be very difficult to attain
variance. without carefully controlled assays.) Thus, the only
observable quantity for a locus is the fraction
of individuals in the population with ( I - x ) and without
Assumptions
(x) the marker. Our goal is to demonstrate how observed
In order to focus on the salient statistical issues, we start marker frequencies can be used to estimate conventional
with several assumptions. First, we make the generous measures of population-genetic structure.
assumption that the interpretation of banding patterns on Throughout the paper, we will discriminate estimates
gels can be accomplished in a completely unambiguous from true population parameters by placing hats ( ” )
manner. That is, we assume that marker alleles from on the former. All of the estimators, as well as approxi-
different loci do not comigrate to the same position on a mate expressions for their sampling variances, were
gel, and that the investigator is fully capable of matching obtained by use of second-order Taylor expansions.
bands from different lanes within and among gels. Although tedious at times, this approach is fairly
Secondly, we assume that each locus can be treated as a straightforward, and we omit the algebraic details.
two-allele system, with only one of the alleles per locus Similar procedures were used by Nei & Roychoudhury
being amplifiable by the PCR. The ‘’null‘’ allele may fail to (1974) and Nei (1978) to obtain some estimators with
amplify either because of loss of a primer site or because codominant markers.
an insertion has caused the distance between primer sites In several places, we will require the sampling variance
to exceed the capacity of the PCR; the specific mechanism of a mean of parameter estimates. To save on space, we
is not relevant to the theory that follows. Without will simply refer to the following equation, denoting the
extensive segregation analysis, it is difficult to validate appropriate terms to substitute for the sample size s, the
the tyo-allele assumption, although a possible way to individual estimates z(i), and the mean of those estimates
-
resolve the issue is presented in the discussion. Most Z,
current users of RAPDs seem to be of the opinion that the . 4

existence of multiple amplifiable alleles at a locus is


relatively rare. It is important to realize, however, that
this opinion is usually based on rather circumstantial
evidence, and even if true, it does not rule out the
Estimation of gene and genotype frequencies
possibility that the pool of unamplifiable alleles is
heterogeneous. We will return to the potential impor- If the population is in Hardy-Weinberg equilibrium,
tance of this issue later. x = q2, and x”’
is the null allele frequency. However, i1/2,
ANALYSIS OF RAPD DATA 93

where 32 is the proportion of the N sampled individuals PCR-based, the offspring generation can be assayed at
that do not exhibit the marker, provides a downwardly a very early stage (e.g. seeds), prior to the operation
biased estimate of q. A less biased (and asymptotically of selection.
unbiased) estimator is Letting io be the observed value of x in an offspring
population resulting from random mating, and assuming
the genotypic frequencies are not influenced by selection,
q can be estimated by substituting f, into eqn 2a. The
deviation from Hardy-Weinberg equilibrium can then be
where Var(32) = 2(1 - f ) / N is the sampling variance of
estimated with
the frequency of null homozygotes. The sampling
variance of the allele frequency is approximately
1-32
Var(q) = -
4N ' (34
The square root of this quantity provides an estimate of
the standard error of 4 and may be used as an
approximate indicator of the accuracy of estimation.
However, it should be noted that the construction of (0) Esbmalns 01 q

confidence limits for q based on this variance estimator


and on the assumption of normality may be inappropriate
when the sample size is small. Resampling procedures
(Crowley 1992) provide a more general approach. This
caveat applies to confidencelimit construction for all of
the other estimators discussed below.
The term in parentheses in eqn 2a accounts for
much, but not all, of the bias in the estimation of q due
to small sample size and can be substantial when the
nuU allele is rare (x<O.l). If the expected number of
null homozygotes in the sample (Nq') is less than two,
1::
0.1
1 p = 0.3
q = 0.7
&If: p = 0.3
q = 0.7 -
-
_____
even eqn 2a yields an estimate of 9 that can be
downwardly biased by 5% or more (Fig. 1).Addition of
higher-order terms to the Taylor expansion does not
improve the situation. However, eqn 2a always yields a
better estimate of q than ?/', and if Nq2>3, it yields an
estimate that is essentially unbiased (Fig. 1). The clear
implication of this result is that unbiased estimates of
population-genetic parameters can be achieved with
RAPDs provided the analysis is restricted to markers
that are not too common. Specifically, we recommend the
restriction of analyses to bands whose observed frequency
is less than 1 -(3/N), i.e. markers whose incidence is less
than 0.70 for N = 10, 0.94 for N=SO, etc.
The situation is more complicated if the population is
locally inbred, for then x = q2(1-Frs)+@rs. Contrary to the.
usual situation when all three genotypes are observable,
0 1 2 3
[2&( I

4 5
H/[2q(l - q)] f = 0.3
-a]/[Zq(

6
-
I q)J. q ==o0.3

7
7 ..

q = 0.7

8
-
-
9
i
I
10

Number 01 Homozygetas
Erp~ted
there is no way to evaluate whether Frs = 0 with RAPD
analysis of a single generation. If, however, the study Fig. 1. (a) Expected estimates of the gene frequency, 4 obtained
population, or a sample of it, can be mated randomly, by eqn 2a relative to the parametric value 9, as function of
the expected number of null homozygotes in the sample, Nq2
then Hardy-Weinberg equilibrium will be attained in (solid hes). Results obtained by numerical solution on a
the offspring generation, and disequilibrium in the computer are given for q=O.3 and 0.7; results for 9=0.1 and
parent population will be revealed by a change in x 0.2 are essentially identical to those for q=O.3. Also plotted
across generations. This approach to testing for Hardy- (dotted lines) are the expected values of 2'/2/q. (b)Slmilar results
for the expected estimates of the with-population gene
Weinberg equilibrium may be practically feasible in
diversity, H obtained by eqn 4a relative to the parametric
many empirical settings. Since the RAPD technique is value, Q(1-q).
94 M. LYNCH and B. G. MILLIGAN

where x,, is the estimated value of x in the parental Weinberg equilibrium, and under the assumptions of our
generation, and the term in large parentheses again model, can be viewed as the probability that a random
accounts for small sample size bias. The sampling pair of alleles will contain one marker and one null. An
variance of P I S at a particular locus is approximately asymptotically unbiased estimator of this quantity is

Hl(i) = q l ( i ) [- +
l ci,(i)] ~ a r [ q , ( i ) ] , (44
and its sampling variance is approximately

Var[fij(i)]= 4[1 - Zjj(i)]’Var[Lj;(i)]. (4b)


The results shown in Fig. 1 indicate that, as in the case of
A precise statistical test of the deviation of the parental
our estimator for gene frequency, eqn 4a yields an
generation from the Hardy-Weinberg expectation can be
essentially unbiased estimate of gene diversity provided
achieved via a x2 or log-likelihood ratio test by evaluating
Nq2>3. The figure also shows that eqn 4a provides a
the observed parental frequencies, f, and (1 -?,), against
much better estimate of Hi(i) than the first-order estimate
their expectations, io and (1-201, with one degree of
2qj(i)[l- Ljj(i)].
freedom. If there is no compelling evidence that FIs # 0,
Averaging over all L observed loci, the mean observed
then it is reasonable to pool the estimates of q from both
gene diversity in the jth population is
generations. In the latter case, the sampling variance of
+
the pooled estimate of q is [Var(Lj,) Var(Q,,)]/4, provided
the sampled members of the two generations are not (5)
related.
A mean estimate of FIs over all sampled loci, &, can be The sampling variance of H, is a function of the variance
obtained by averaging the f~s(i)acquired for each of the due to the sampling of a finite number of individuals a t
i = 1,. . .,L markers. The sampling variance of this mean each locus, described by eqn 4b, and of the variance in
estimate, Var(Fls), is obtained by letting s = L, gene diversity among loci. Both sources of variance
z ( i ) = fIs(i), and 2 = F I ~in eqn 1, and this accounts for contribute to the variance among the L observed k;(i)
sampling of both individuals and loci. Provided the within the population. That is, the total sampling variance
number of loci sampled is fairly large, the mean estimate of the gene diversity within a population, Var(H,),can be
Frs should be roughly normally distributed around its obtained by applying eqn 1 with s = L, z ( i ) = H , ( i ) and
expectation with a standard error equal to [var(&)]”*. ii = fi]. The square-root of this quantity provides an
Similar logic can be used to obtain a mean (and sampling estimate of the standard error of Hi.
variance) of the FIS estimates acquired for multiple Partitioning the total sampling variance of H, into its
populations. two sources can provide insight into preferred experi-
In the following, we assume that the population has in mental designs for estimating the gene diversity (e.g.
some way been analyzed at a time when the genotype Lynch & Crease 1990). The expected contribution to the
frequencies are in Hardy-Weinberg equilibrium, since no sampling variance of the average gene diversity owing to
further progress with RAPD analysis is possible if that is sampling a finite number of individuals per locus is
not the case. Our assumption that each locus is biallelic
implies that every observable band in the population
sample represents a unique locus. In the following, q(i)
represents the frequency of the null allele at the locus
involving the ith marker. where the Var[Hj(i)] have been defined in eqn 4b. The
sampling variance of H, due to variation in gene diversity
among loci is then
Gene diversity within populations
varL (H/)= Var (HI - Var1(HI) . (6b)
With the estimates of allele frequencies in hand, it is
possible to estimate the gene diversity within a popula- The within-locus sampling variance can be reduced to
tion. (We use the term gene diversity in an informal sense zero by increasing the number of individuals sampled,
here since it is likely that RAPDs will often contain only but if there are true interlocus differences in gene
noncoding DNA.) We employ the conventional measure diversity, as will almost always be the case, any further
of gene diversity, H,(i) = 2qj(i)[1- q/(i)], which is the reduction in the variance of would require an increase
probability that two genes, randomly drawn from in the number of loci scored.
population j , differ at the ith locus. This measure is Finally, if n populations have been sampled, the mean
equivalent to the expected heterozygosity under Hardy- within-population gene diversity can be estimated by
ANALYSIS OF W D DATA 95

(7)
All of the terms of t h s expression have been defined
Its sampling variance, Var(kw), is a function of the above except the last two, and these are of the form
sampling of finite numbers of populations, loci, and
individuals, all three of which are accounted for when eqn
1 is used with s = n, z(i) = H,, and Z = Aw. The square Such covariance exists because the same individuals from
root of the resdtant quantity provides an estimate of the population j are used to estimate fjifi) and ff&(i).
standard error of f i w . Following the logic used above, the Averaging over all loci, the estimated mean gene
total sampling variance of f i w can be partitioned into diversity between populations j and k is
contributions from sampling individuals (I), loci (L),and
populations (P),

and its variance, Var(fi,k), is obtained by letting s = L,


z(i) = fijk(i), and Z = f i , k in eqn 1. The variance of can
be obtained in a parallel manner.
The mean between-population gene diversity is ob-
tained by averaging over all distinct pairs of populations,

13a)

The sampling variance of is given by


Population subdivision
When data are available from more than one population, 13b)
it is usually of interest to evaluate the degree to which the
total gene diversity partitions into its within- and This expression accounts for individual, locus, and
between-population components. The heterozygosity population sampling. It is somewhat more complicated
between populations j and k at the ith locus, than the variance formulations given above because not
H;k ( i ) = 9/ ( i )[l - q k (i)]+ q k (i)[ 1 - q, (i)], is the probability all of the painvise population comparisons contributing
that a gene randomly drawn from population j differs to H B are independent. Gene diversities between over-
from one randomly drawn from population k. An lapping pairs of populations (eg. HI2 and H I 3 ) are likely
asymptotically unbiased estimator of this quantity is to be correlated positively. The term V B represents the
variance among diversity measures involving nonover-
lapping population pairs (e.g. 1212, H,, &6, fij.8, etc.) The
term CB represents the covariance among diversity
and its variance due to sampling a finite number of
measures involving overlapping population pairs (e.g.
individuals is approximately
pairs [ f i l z , fin],[HIz ,H i , . . . , fin, fix I . . . [ f i ~
I .
&sIt
etc.) Both VB and CR are obtained by applying the stan-
dard definitions of a variance and covariance to the fi&.
Wright's (1951) measure of population subdivision,
FST= HB/HT, where HT = H B + H ~takes , on extreme values
In the absence of population subdivision, the gene fre- of 0 when all populations have identical gene frequencies
quencies in all populations are the same, so
and 1 when all populations are completely homozygous
Hik(1) = H, ( 1 ) = Hk (i). Thus, a more meaningful defini- for alternate alleles. An asymptotically unbiased estimate
tion of the between-population gene diversity is the het-
of FST can be obtained by using
eroiygosity in excess of that observed within populations,

the sampling variance of which is

(14a)
96 M. LYNCH and B. G. MILLIGAN

All of the elements of this equation have been defined I I.


above except the sampling covariance of the within- and
between-population esimates of gene diversity,
An alternative procedure for estimating nucleotide
divergence with RAPDs has been introduced by Clark
& Lanigan (1993). They assume that the time since
separation is sufficiently small that there is a negligible
probability of multiple mutational changes per locus.
Such covariance exists because the same populations, loci,
and individuals are used to estimate HB and Hw. The Relatedness
sampling variance of & is approximately
Since related individuals are expected to have more
similar genotypes than nonrelatives, it stands to reason
that the fraction of loci for which two individuals have
identical phenotypes must increase with the degree of
relatedness. Assuming noninbred individuals, and letting
41 and 42 be the probabilities of sharing one or two alleles
identical by descent at a locus, then the relatedness of
individuals a and b is defined to be r a b =+lo h/2 ++ & b r
which takes on values of 0.5 for parents and offspring and
for full sibs, 0.25 for half sibs, etc.
Taking into account the joint probabilities of all
Genetic distance
genotypes, and assuming a random-mating population,
Nei (1972) introduced a measure of genetic distance the probability that two individuals, a and b, share the
between populations, Dik, that provides an estimate of the same phenotype (both have the marker, or both do not) at
mean number of mutations separating the genes from two the ith locus is
populations. Nei's genetic distance accounts for multiple
mutations per locus, and in the ideal case of isolated
populations and a constant rate of substitution, is
proportional to the mean coalescence time for genes in where O ( i ) = 1 - 2$(i)[1 - q2(i)] is the probability that
different populations. Letting ],=l-H, be the gene two nonrelatives have matching phenotypes at the locus
identity within population 1, and rjk = 1 -H;k be the and H ( i ) = 2q(i)[l - q(i)] is the locus-specific heterozyg-
gene identity between populations j and k, both averaged osity. This expression has an undesirable property - it is a
over all loci, then Djk = - l n ( J j k / a ) is Nei's genetic function of the relatedness Tab, which we want to estimate,
distance. An unbiased estimator of Djk is given by and the identity coefficient q&, which is unknown. Since
blabis a measure of identity by descent, not identity in
state, it cannot be estimated from observational data.
Unfortunately, we have been unable to obtain any
expression for that does not have a similar problem.
Note, however, that the final term in eqn 16a can be no
(154 greater than 1/16 (this only being the bias in the case of
and its sampling variance is approximately parent-offspring relationships when H(11=0.5). Thus, if
one is willing to accept a bias of this magnitude,

provides an approximate relationship between band-


sharing and Tab. Taking into account uncertainty in the
estimation of 8, a relatedness estimator using data from
the ith locus is

All of the elements of this formula have been defined


above except where Snb(i)=lor 0 denotes whether a and b have the
ANALYSIS OF RAPD DATA 97

same phenotype, and These expressions can be used to explore the utility of
band-sharing in the estimation of relatedness.
The degree of overlap in the distributions of S
associated with different r depends strongly on the
frequency of markers in the population (Fig. 2). If the
markers are present in high frequency (all x =O), they will
provide little discrimination among relatives, since nearly
Estimates of relatedness based on single loci are all members of the population will exhibit the marker a t
extremely noisy. A more accurate estimate is obtained each locus. The use of very low frequency markers does
by averaging over all L loci, not improve the situation. Since the distribution of
similarity depends only on cl(i), which is symmetrical in
X ( I ] and l-x(i), it takes on the same properties for
q2(i)= x(i)= 0.1 as for q2(i)= x(i) = 0.9. The best resolution
Such an analysis is only applied to polymorphic loci. occurs when markers are at intermediate frequency, i.e.
When a large number of markers is used, the sampling x(D = 0.5. However, even in this case, the degree of over-
distribution of should be roughly normal. Its variance lap in distributions of S is very broad, and based on the
can be estimated by substituting s=L, z(i) = fob(i),and results in the figure, it can be seen that the problem is not
t = i,b in eqn 1. eliminated until very large numbers of markers (well in
The utiIity of RAPD markers for discriminating pairs of excess of 50) are scored. We conclude that the utility of
individuals related to different degrees depends on the RAPDs for relatedness estimation is rather limited.
overlap between the distributions of similarity for
different degrees of relatedness. If there are L informative Discussion
markers for a pair of individuals, the mean (over loci)
similarity (over loci) for pairs of individuals with Because of the dominance property of RAPDs, gene
relatedness r should be approximately binomially dis- frequency estimates for such loci are necessarily less
tributed with expectation accurate than those obtained with codominant markers
such as isozymes and RFL.Ps. In the latter case, the
sampling variance of the gene frequency is q(l - q ) / 2 N .
Comparing this with eqn 2b, the ratio of the sampling
variance of gene frequency for RAPDs relative to a
and sampling variance comparable locus with codominance is (l+q)/(2q), where
9 is the frequency of the null allele. Comparing eqn 4b to
the result of Nei & Roychoudhury (1974), this can also be
shown to be the ratio for the sampling variance of the

04; X=O1.09 0.3 i X = 0.5


L =20 ' L=20
I
03-
0.2 4 ~..~
. .__

0.0
0.0 0.2 0.4. 0.6 0.8 1.0

Fig. 2. Expected distributions of similarity in


band sharing for random pairs of individuals of
different degrees of relatedness T . It is assumed
that individuals of a given relationship have
been scored at L = 20 or L = 100 markers, each
associated with a locus with the same value of x.
The mean of each binomial distribution is given
by eqn 19a, and the variance by eqn 19h.
-1

00 0.2 0.4 0.6 0.8 1.0 1 .o Variation among loci in the frequency of the
marker allele will increase the variance beyond
Similarity that illustrated.
98 M. LYNCH and B. G. MILLIGAN

within-population gene diversity using both types of gene diversity. The bias resulting from violations of this
markers. This ratio can be viewed as the proportional assumption depends on the frequency distribution of
increase in sampling effort that needs to be applied to a alleles, but a qualitative assessment can be made for the
RAPD locus in order to achieve a genetic-parameter simple situation in which n markers at a locus all have
estimate as accurate as would be acquired with a locus equal freqdencies, (1-q)/n. If the investigator focuses on
with codominant markers with the same frequency. a single marker and views the heterozygosity as
Recall from above, that to avoid biased parameter (1-x”‘), where x is the frequency of individuals not
estimates, RAPDs for which x < 3/N should be avoided in exhibiting that marker, the quantity 2(1 -q)(n - l+q)/n2
population-genetic analysis. Assuming this is done, then would actually be estimated. The ratio of this quantity to
the critical value of the above ratio is the actual heterozygosity, 1-q2-n[(l -q)/n12, is
(1 + m)/(2m), which takes on values of 1.4 for
a sample size of N=10, 3.4 for N=100, and 9.1 for
N = 1000. Thus, for acceptable loci alone, on the order of 2
to 10 times more individuals should be sampled per RAPD
locus than per RFLP or isozyme loci. which shows that in this extreme case the estimated
The pruning of loci for which the null phenotype heterozygosity would be on the order of (2/n)th or less
frequency is below 3 / N also implies that more loci need to than the true value.
be sampled in RAPD analysis than in a survey involving A simple way to avoid the problem of multiple allelic
more conventional codominant markers. The number of markers is to focus on markers that are common. If the
loci that are rejected using the 3/N criterion will neces- sum of the frequencies of individuals in whch two
sarily depend on the allele-frequency distribution, but it is markers are uniquely found exceeds one, then the
clear that this number must increase with decreasing N. markers cannot be allelic. So if this criterion is applied
The pruning of loci raises an additional issue. Although to all pairs of markers, the multiple-marker problem can
our procedure insures that the estimates of population be eliminated. This, of course, does not remove the issue
parameters will be essentially unbiased for the pool of of multiple unamplifiable (null) alleles, a problem that
acceptable loci, it also tends to reject loci that are highly seems much more difficult to resolve. Moreover, by
homozygous. Moreover, the loci that are rejected may eliminating loci with high allele frequencies, this proce-
differ from population to population. Consequently, dure raises the additional issues of bias associated with
estimates of genome-wide parameters, such as average pruning, already discussed above.
within- and between-population gene diversities, derived Finally, we note that our emphasis has been on the
via RAPD analysis are likely to be biased in complicated development of unbiased parameter estimators. Such
ways whether or not the pruning procedure is applied. In estimators do not always minimize the mean squared
either case, the obvious way to minimize the problem of deviations between parameters and their estimates.
bias is to sample large numbers of individuals per Alternative procedures, such as maximum likelihood,
population (ideally at least loo), since this reduces the may provide minimum-variance estimators, although
fraction of loci that will yield biased parameter estimates. such estimators are not necessarily unbiased.
In addition, it seems quite advisable in comparative
studies to equalize the sample sizes in different popula-
Acknowledgements
tions to minimize the possibility that observed differences
between populations are simply artifacts of bias resulting This work has been supported by NSF grants BSR 89-
from sample size differences. 11038 and BSR 90-24977, and PHS grant GM36827-01 to
Certainly, if even with large N, 10% or more of the ML, and by NSF grant BSR 89-06129 to BGM. We are
sampled loci still fail to meet the 3/N criterion, then any especially grateful to K. Ritland for pointing out some
population-parameter estimates derived from the data ’ fundamental problems in an earlier draft of this paper,
should be interpreted with an appropriate degree of and to E. Martins for helpful comments.
caution. If it is deemed desirable not to prune loci from
the final analysis, then the best that can be said is that the
References
bias of a parameter estimate averaged over all loci cannot
exceed the worst locus-specific bias. Caetanc-Anoll6s G, Bassam BJ, Gresshoff PM (1991) DNA
A critical assumption of the preceding theory is that amplification lingerprinting usmg very short arbitrary
oligonucleotide primers. Bio/Technology, 9, 553-557.
there are a t most two alleles per locus (one marker and Clark AG, Lanigan CMS (1993) Prospects for estimating
one null). If that is not the case, the approach we have nucleotide divergence with RAPDs. Molecular Biology and
employed will lead to an overestimate of the frequency of Evolution, 10, 1096-1111.
alleles, and an underestimate of the within-population Crowley PH (1992) Resampling methods for computation-
ANALYSIS OF RAPD DATA 99

intensive data analysis in ecology and evoluhon. Annual Stephens JC, Gilbert DA, Yuhki N, OBrien SJ (1992) Estimation of
Rmiew of Ecology and Systematics, 23, 405-447. heterozygosity for single-probe multilocus DNA fingerprints.
Deragon JM, Landry BS (1992) RAPD and other PCR-based Molecular Biology and Evolution, 9, 729-743.
analyses of plant genomes using DNA extracted from small Welsh J, McClelland M (1990) Fingerprinting genomes using PCR
leaf disks. PCR Methods and Applications, 1, 175-180. with arbitrary primers. Nuclezc Acids Research, 18, 7213-7218.
Hadrys H, Dalick M, Scluerwater B (1992) Applications of Williams JGK, Kublelik AR, Livak KJ, Rafalski JA, Tingey SV
random amplified polymorphic DNA (RAPD) in molecular (1990) DNA polymorphisms amplified by arbitrary primers
ecology. Molecular Ecology, 1, 55-64. are useful as genetic markers. Nucleic Acids Research, IS,6531-
Lynch M (1988) Estimation of relatedness by DNA iingerprinting. 6535.
Molecular Biology and Evolution, 5, 584-599. Wright S (1951) The genetical structure of populations. Annals of
Lynch M (1990) The similarity index and DNA fingerprinting. Eugenics, 15, 323-354.
Molecular Biology and Evolution, 7, 478-484.
Nei M (1972) Genetic distance between populations. American
Naturalist, 106, 283-292.
Nei M (1978) Estimation of average heterozygosity and genetic Much of the author's research 1s concerned with the ecologcal
distance from a small number of individuals. Genetics, 89, and evolutionary determinants of population-genetic structure,
583-590. and with the use of molecular markers for quantifying such
Nei M, Roychoudhurj AK (1974) Sampling variances of structure.
heterozygosity and genetic distance. Genetics, 76, 379-390.

You might also like