Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Botanical Journal of the Linnean Society, 2009, 159, 1–11.

Selection of candidate coding DNA barcoding regions


for use on land plants
CAROLINE S. FORD1, KAREN L. AYRES2, NICOLA TOOMEY3, NADIA HAIDER4,
JONATHAN VAN ALPHEN STAHL5, LAURA J. KELLY5, NIKLAS WIKSTRÖM6,
PETER M. HOLLINGSWORTH7, R. JOEL DUFF8, SARAH B. HOOT9,
ROBYN S. COWAN5, MARK W. CHASE5 and MIKE J. WILKINSON1*
1
Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, Aberystwyth
SY23 3DA, UK
2
School of Biological Sciences, University of Reading, Reading RG6 6FN, UK
3
Department of Biological Sciences, Imperial College, Silwood Park, Ascot SL5 7PY, UK
4
Department of Molecular Biology and Biotechnology, Atomic Energy Commission of Syria (AECS),
Damascus, PO Box 6091, Syria
5
Jodrell Laboratory, Royal Botanic Gardens, Kew, Richmond TW9 3DS, UK
6
Department of Systematic Botany, Uppsala University, Norbyvägen 18D, 752 36 Uppsala, Sweden
7
Royal Botanic Garden Edinburgh, 20a Inverleith Row, Edinburgh EH3 5LR, UK
8
Department of Plant Biology, Southern Illinois University, Carbondale, IL 62901-6509, USA
9
Department of Biological Sciences, University of Wisconsin, Lapham Hall 181, 3209 N. Maryland
Ave., Milwaukee, WI 53211, USA

Received 13 April 2008; accepted for publication 25 September 2008

An in silico screen of 41 of the 81 coding regions of the Nicotiana plastid genome generated a shortlist of 12
candidates as DNA barcoding loci for land plants. These loci were evaluated for amplification and sequence
variation against a reference set of 98 land plant taxa. The deployment of multiple primers and a modified
multiplexed tandem polymerase chain reaction yielded 85–94% amplification across taxa, and mean sequence
differences between sister taxa of 6.1 from 156 bases of accD to 22 from 493 bases of matK. We conclude that loci
should be combined for effective diagnosis, and recommend further investigation of the following six loci: matK,
rpoB, rpoC1, ndhJ, ycf5 and accD. © 2009 The Linnean Society of London, Botanical Journal of the Linnean
Society, 2009, 159, 1–11.

ADDITIONAL KEYWORDS: genetic barcodes – matK – MT-PCR – multi-locus barcode – plant identification
– plastid genes.

INTRODUCTION divergent selection and efficacy of genome repair. The


absence of routine protocols for rapid, low-cost and
The proliferation of genome sequence data has
reliable genome sequencing of all species means that
allowed paradigm shifts in our understanding of cell
there is a growing desire to target shared genomic
biology and gene function, but has largely failed to
regions as a common platform for DNA-based species
enhance species diagnosis. Absolute genome sequence
diagnosis. However, some incongruity is inescapable
divergence between sister species is unknown, but
if identification is based on limited DNA information,
will inevitably be highly variable between species
because there can only be a correlative link between
pairs, and will depend on a range of factors, including
the sequences used and true species identity. To con-
time since speciation, mutation rate, extent of
found matters, there is variability in the mechanisms
of speciation and contexts in which speciation occurs
*Corresponding author. E-mail: jjw@aber.ac.uk and inconsistency in the taxonomic treatment of

© 2009 The Linnean Society of London, Botanical Journal of the Linnean Society, 2009, 159, 1–11 1
2 C. S. FORD ET AL.

analogous situations across divergent taxa. Despite also show sufficient sequence consistency to ensure
such limitations, the strategy of using limited DNA that infraspecific variation does not confound species
information to aid species identification has consider- assignation; (4) unsusceptible to the amplification
able attraction for many practical, commercial and of loci other than the targeted plastid region; and
scientific applications (Newmaster, Fazekas & (5) easily annotated for the critical evaluation of
Ragupathy, 2006). This is known as DNA barcoding. sequence quality and error detection.
Barcoding efforts are proceeding for many animal The search for a parallel region to CO1 in land plants
groups using the mitochondrial cytochrome oxidase C that matches these criteria has focused on the plastid
gene (CO1). DNA sequences of this gene contain suf- genome, which benefits from a high copy number per
ficient variability to distinguish species of birds cell, enabling amplification from degraded samples,
(Hebert et al., 2004a), fish (Ward et al., 2005), spiders and has a known gene order and function for an
(Greenstone et al., 2005) and some butterflies (Hebert increasing number of land plants. However, it seems
et al., 2004b; Hajibabaei et al., 2006), and the gene likely that the modest levels of sequence divergence of
has been designated as the current barcoding region plastid DNA when compared with CO1 in animals
for the animal kingdom (Hebert, Ratnasingham & (Chase et al., 2007) will dictate the use of multiple loci
deWaard, 2003). Moreover, pilot studies are now to allow species diagnosis. Several plastid regions have
assessing the utility of CO1 for barcoding fungi been proposed (Chase et al., 2007), with greatest inter-
(Seifert et al., 2007), diatoms (Evans, Wortley & est shown in the protein-encoding rbcL gene (Chase
Mann, 2007) and red algae (Robba et al., 2006). et al., 2005; Newmaster et al., 2006; Kress & Erickson,
The use of this mitochondrial region largely avoids 2007) and the predominantly non-coding psbA-trnH
problems of locus and genome duplication that are intergenic spacer (Kress et al., 2005; Kress & Erickson,
commonplace in the nuclear genome, as they only 2007; Shaw et al., 2007). In addition, K. J. Kim (http://
occasionally occur for mitochondrial genes through www.barcoding.si.edu/plant_working_group.html) has
horizontal transfer (Timmis et al., 2004). However, in suggested the combined use of coding and non-coding
some groups, CO1 is an unsuitable barcoding marker. loci (either matK + atpF-H + psbK-I or matK + atpF-
Barcoding efforts are less well developed for these H + trnH-psbA) as multi-locus barcodes.
taxa, primarily because an equivalent diagnostic The suggestion of rbcL as a barcoding region relates
region has yet to be identified. Land plants are the to its historical popularity as a gene for deep-level
most prominent of these ‘orphan groups’, in which phylogenetic studies, rather than being based on a
relative sequence uniformity of the mitochondrial systematic evaluation of plastid coding regions for
genome precludes the use of CO1 as a barcoding DNA barcoding. There are thousands of rbcL
marker (Cho et al., 1998, 2004; Adams & Palmer, sequences from phylogenetic studies already depos-
2003; Chase et al., 2005). The difficulty in developing ited in public databases, and routine protocols exist
barcoding markers for land plants is exacerbated by for sequence generation. However, the absence of
the extensive evolutionary divergence between taxa vouchers or electropherograms for the great majority
since they first appeared c. 475 million years ago of sequences precludes their use for barcoding,
(Wellman, Osterloff & Mohiuddin, 2003), their although they may still have utility as an aid to
propensity to undergo both large- and small-scale identification. The length of the rbcL gene (c. 1400
nuclear genome rearrangements, the widespread bases) also dictates the use of multiple-pass sequenc-
appearance of interspecific and intergeneric hybrids, ing, which further compromises its value as a barcode
and the huge diversity in their life history and breed- region if used in its entirety. Its application therefore
ing behaviour. This complexity has long been recog- depends on the identification of suitable diagnostic
nized and has even complicated the establishment of sections of the gene (Kress & Erickson, 2007). In silico
universal species concepts across the group. Thus, the tests of a 550–600-base subset of rbcL have shown
‘land plant genome’ has great depth and breadth in its that the section can generally resolve samples to the
variability and presents a particularly difficult chal- family and genus level (Kress & Erickson, 2007).
lenge in the search for suitable DNA barcoding loci. Despite efforts, hopes of producing universal primers
to amplify short, diagnostic regions of rbcL suitable
for barcoding (allowing single-pass sequencing) have
KEY CRITERIA FOR BARCODING LOCI
proved elusive (Chase et al., 2007).
As an underlying principle of DNA barcoding is the The psbA-trnH intergenic spacer is one of the most
high-throughput standardized identification of bio- variable regions of the plastid genome (Shaw et al.,
logical samples, the DNA region(s) used for plant 2007) and, in this respect, has potential as a candi-
barcoding must be: (1) routinely amplifiable; (2) easily date barcoding region for land plants (Kress et al.,
sequenced via single-pass sequencing; (3) sufficiently 2005; Kress & Erickson, 2007). However, much of its
variable to separate most sister species pairs and yet variability occurs as indels, with amplicons targeting

© 2009 The Linnean Society of London, Botanical Journal of the Linnean Society, 2009, 159, 1–11
DNA BARCODING REGIONS FOR LAND PLANTS 3

the region typically exhibiting considerable variation cgi?gi=13349&db=g&from=0), was individually ‘blast
in size and leading to alignment difficulties even searched’ using the search function on this page to
between closely related taxa (Chase et al., 2007). identify related sequences in the database. All plant
There are also problems associated with duplications sequences returned were aligned together as amino
and deletions in this locus (Sass et al., 2007). The acids using default settings of ClustalW (http://
degree to which sequence alignment is necessary for www.ebi.ac.uk/clustalw/). These alignments were then
barcoding purposes is a matter of debate (Chase et al., employed as the basis for the elimination of genes from
2005, 2007; Kress et al., 2005; Newmaster et al., 2006; further assessment using the following categories: (1)
Kress & Erickson, 2007). Little & Stevenson (2007) genes not conserved across > 98% of species for which
contend that alignment-free methods can perform complete plastid sequences are available; (2) genes
well in barcoding studies. However, a more practical lacking conserved regions for the development of
issue arising from the use of a non-coding region is universal primers; and (3) genes containing introns
that the absence of a reading frame means that it is or showing extensive size variation (such as those
impossible for end-users to exploit the translated showing > 1.3-fold difference in length). The remaining
(amino acid) sequence as an independent assessment genes were examined further to identify regions of the
of sequence quality. Also lost is the capacity to use the sequence of suitable amplicon size (180–800 bases) in
amino acid sequence as a strong supporting criterion which variation could be observed between species and
to infer orthology of genome location and function that were flanked by conserved regions that could be
between amplicon and target locus, i.e. to confirm utilized as target sites for polymerase chain reaction
whether the amplicon derives from the plastid or from (PCR) primers. The remaining 40 loci from the
a pseudogene within the nuclear genome that shares Nicotiana plastid were evaluated in a separate study
primer binding sites and possibly a common ancestral (Haider, 2003; N. Haider, pers. comm.), and are not
origin via horizontal gene transfer. presented here.
Although the debate regarding the use of non-
coding regions in DNA barcoding is ongoing based on
the high levels of sequence variability observed in UNIVERSAL PRIMER DEVELOPMENT
such loci, there are clear benefits of using easily Primer binding sites were selected in regions of con-
alignable barcode data from coding loci, as these served amino acid sequence where the nucleotide
allow independent checks of sequence quality (via in sequence contained 40–60% GC content in a region
silico translation) and thereby provide some protec- flanking a 180–800-bp amplicon. Primers were
tion against the inadvertent use of nuclear pseudo- designed: (1) to be 20–24 bp in length with similar
genes. An added benefit of coding regions is that these melting temperature (Tm); (2) to lack internal nucle-
data are also more amenable for re-use in phyloge- otide complementarity and redundancy; (3) to be
netic studies. Given these potential benefits, a thor- exonic and terminate the 5′ end of the primer on
ough systematic investigation of the potential plastid codon position 2 (forward primers) or 1 (reverse
coding regions for DNA barcoding in land plants is primers); and (4) to have any existing variation
important. In this paper, we examine the genic between sequences only at the third position nucle-
regions of the plastid genome with barcoding poten- otide in the triplet. Four primers were designed for
tial. Our search has focused on regions containing each of the 12 loci, two forward and two reverse
areas of conserved amino acid sequence and hence (Table S2, see Supporting Information). An additional
assumed functional importance to act as targets for forward primer was designed for matK to address
primer binding; these flank exons of 200–800 bases some of the problems experienced with universality
and do not contain introns. during testing.

DNA SAMPLES
MATERIAL AND METHODS
The ability of primers to generate amplicons was
IDENTIFICATION OF GENE REGIONS tested empirically using 98 taxa comprising 46 con-
We examined 41 of the 81 functionally characterized generic species pairs and two species triplets from the
loci from the 95 protein coding regions of the Nicotiana same genus. The test series comprised four liver-
tabacum plastid genome (GenBank accession worts, six pteridophytes, six gymnosperms, 28 mono-
NC_001879) for suitability as barcoding candidates cotyledons and 54 other angiosperms (Table S3, see
(Table S1, see Supporting Information). Each coding Supporting Information). An initial screen for
region, as listed on the Entrez Genome page of the sequence polymorphisms between ‘sister species’ was
National Center for Biotechnology Information (NCBI) performed, with ten diverse pairs of angiosperm
website (http://www.ncbi.nlm.nih.gov/genomes/altvik. species selected for reliable amplification in the initial

© 2009 The Linnean Society of London, Botanical Journal of the Linnean Society, 2009, 159, 1–11
4 C. S. FORD ET AL.

PCR screen (Table S3). The second round of screening exponential phase during pre-amplification. This
for sequence variability was performed using all 98 reduces the incidence of de novo mutations arising
taxa. Nearly all DNA was supplied pre-extracted from from mispriming during first-round amplification,
the Royal Botanic Gardens, Kew DNA Bank (http:// and so increases the fidelity of the template for
www.kew.org/data/dnabank/homepage.html). Excep- second-round amplification. The resulting products
tions were those samples for bryophytes and some were diluted 1 : 12.5, and 2.5 mL of this product was
fern allies. used as template for the second amplification employ-
ing the inner forward primer (Table S2) and the inner
reverse primer (Table S2) in a PCR using Bioline
STANDARD PCR AND SEQUENCING BioMix(TM) in a 20-mL reaction as follows: 94 °C for
PCR was performed using Bioline BIOTAQTM DNA 1 min, followed by 40 cycles of 94 °C for 30 s, 51 °C for
polymerase or Bioline BioMixTM in a 20-mL reaction 40 s and 72 °C for 40 s, with a final extension step of
according to the manufacturer’s instructions. Ther- 72 °C for 5 min. The resultant amplicons were sub-
mocycling conditions were optimized for all primer mitted to Macrogen for purification and single exten-
pairs at 94 °C for 1 min, followed by 40 cycles of 94 °C sion sequencing in the forward and reverse directions
for 30 s, 53 °C for 40 s and 72 °C for 40 s, with a final using the inner primer pair.
extension step of 72 °C for 5 min. Products were sub-
mitted for sequence analysis to either The BioCentre,
University of Reading (http://www.biocentre.reading. SEQUENCE ANALYSIS AND VERIFICATION
ac.uk) or Macrogen (http://www.macrogen.com). Cycle Consensus sequences were produced for each taxon at
sequencing reactions were carried out in both cases each locus by alignment of the forward and reverse
according to Sanger, Nicklen & Coulson (1977) using sequences using ClustalW (http://www.ebi.ac.uk/
BigDye Terminator 3.1 (Applied Biosystems). Manual clustalw/). All sequences were blastn searched (http://
editing of raw traces and subsequent alignments of www.ncbi.nlm.nih.gov/BLAST/) to verify taxon (or
forward and reverse sequences enabled us to assign close taxonomic group) and locus. The sequences were
edited sequences for most species. The 3′ and 5′ then translated into amino acids using ExPASY to
termini were clipped to generate consensus sequences yield sequence and frame-shift errors identified by the
for each taxon. Nucleotide sequences were then presence of stop codons. Errors were removed if pos-
translated into amino acid sequences using ExPASY sible via comparison with the sister taxon amino acid
(http://www.expasy.ch/tools/dna.html). sequence and with the original trace data. If ambi-
guities remained, sequences were either discarded
or re-sequenced. Corrected amino acid sequences
MODIFIED MULTIPLEXED TANDEM PCR (MT-PCR) were then verified using a blastp search to known
Taxa showing repeated null amplification for specific sequences. Sequences that failed to blast (either as a
loci were subjected to a modified MT-PCR technique. nucleotide or amino acid sequence) to the expected
This method was originally developed to ensure target were excluded from further analysis.
the detection of rare mRNAs from mixed samples Sequences remaining after verification were aligned
(Stanley & Szewczuk, 2005), and generates high- in species pairs using ClustalW and clipped to equal
quality sequences from low levels of cDNA template length. Verified sequences for selected loci were sub-
(Nolan, Hands & Bustin, 2006; Jex et al., 2008). The mitted to the European Molecular Biology Laboratory
same approach was used here to increase the capacity (EMBL) (Table S4, see Supporting Information).
to detect potentially low-concentration (or quality)
template DNA. The nested approach used by MT-PCR
utilizes the two primer pairs designed for each loci, SEQUENCE VARIABILITY
with the outer pair yielding first-round pre- Comparison of species pairs to determine numbers of
amplification and the inner pair using the first-round polymorphisms and indels for each marker was per-
product as template for second-round amplification. formed using a program written in C (Program S1).
First-round pre-amplification was performed employ- Variation in size and sequence meant that global
ing the outer forward primer (usually designated 1, alignments resulted in short read lengths for most of
but see Table S2) and the outer reverse primer the loci used. We therefore aligned the sequences in
(usually designated 4, but see Table S2) in a PCR congeneric pairs separately, so that we maximized the
using Bioline BioMixTM in a 20-mL reaction compris- read length for each pair. For species triplets (Erica,
ing: 94 °C for 1 min, followed by 15 cycles of 94 °C for Protea), one sequence was discarded to leave a pair.
30 s, 51 °C for 40 s and 72 °C for 40 s, with a final Polymorphisms were identified whenever a site dif-
extension step of 72 °C for 5 min. The use of only 15 fered between the two aligned sequences for a species
cycles ensures that the amplification remains in the pair. Each individual indel was treated as a dummy

© 2009 The Linnean Society of London, Botanical Journal of the Linnean Society, 2009, 159, 1–11
DNA BARCODING REGIONS FOR LAND PLANTS 5

Table 1. Polymerase chain reaction (PCR) amplification for 12 target loci (one primer pair each) across 98 taxa and
sequence polymorphisms between ten species pairs for 11 loci

PCR amplification Sequence polymorphisms

Species Percentage Combined amplicon Variable Percentage of


Gene amplified N/98 success Rank length (bp) bases variable bases Rank (%)

accD 65 66.3 6 2155 84 3.90 1


matK 40 41.8 11 4548 52 1.14 6
ndhA 63 64.3 8 3171 22 0.69 8
ndhJ 78 79.6 2 3946 63 1.60 4
ndhK 88 89.8 1 1710 5 0.29 10
rpl22 27 27.6 12 – – – –
rpoB 74 75.5 4 4129 38 0.92 7
rpoC1 78 79.6 2 4690 70 1.49 5
rpoC2 64 65.3 7 4867 32 0.66 9
ycf2 67 68.4 5 3504 9 0.26 11
ycf5 51 52.0 10 2927 66 2.25 3
ycf9 60 61.2 9 1263 29 2.30 2

The amplification of products with the expected size was expressed and ranked by the number of species (N/98) and
percentage success. For each marker, sequence variability is expressed as the number and percentage of variable bases
of total length. Markers ranked (highest to lowest) by percentage of variable bases.

base, and therefore usually resulted in a single nucle- product of appropriate size across 98 divergent taxa.
otide polymorphism (SNP). Sequence length was All loci yielded individual products of appropriate size
recorded as zero for species pairs that did not produce in > 40% of species trialled, except for rpl22 (27.6%),
clean sequences. A further C program (Program S2) with ndhK (89.8%) amplifying products most consis-
took the resulting number of SNPs and sequence tently (Table 1). Accordingly, 11 loci (all except rpl22)
length per species pair, and summed these over all were carried forward to a second screen to assess the
possible combinations of markers. To maximize the fidelity of amplification and to provide preliminary
use of the available data, all species pairs yielding information on sequence variability.
adequate sequences for a given subset of markers
were used to obtain results for that subset, resulting
AMPLIFICATION FIDELITY AND PRELIMINARY
in different numbers of species pairs being used for
COMPARISON OF DNA SEQUENCES
different marker subsets. The results were expressed
as the mean number and percentage of polymorphic A second screen was performed to test amplicon orthol-
sites over available pairs. ogy and to provide an initial evaluation of sequence
variation. In this case, we used ten congeneric species
pairs selected for consistency of amplification across
RESULTS most loci in the initial screen (Table S3). In this way,
LOCUS SELECTION we sought to rank loci on the basis of sequence
ClustalW alignments of the amino acid sequences of variability rather than on the ability to amplify.
41 of the 81 functionally characterized coding regions Representative sequences of all coding regions were
in the Nicotiana plastid genome (GenBank accession obtained from four species pairs (Begonia, Coffea,
NC-001879) yielded a candidate shortlist of 12 loci Hieracium and Paeonia), whereas four species pairs
meeting the criteria given in ‘Materials and methods’. produced good sequences for all except one locus
These top 12 plastid gene regions were selected for (Aglaia and Sophora missing matK; Betula missing
further primer design and empirical evaluation rpoC2; Dioscorea missing ycf9). Acorus failed to yield
(Tables 1, S1, S2). sequenced amplicons for two loci (matK and ndhA) and
Carex failed to amplify for six loci (rpoC1, matK, ycf2,
ycf5, ndhK and ndhA) (Table S5, see Supporting Infor-
INITIAL SCREEN OF THE POTENTIAL FOR mation). Thus, data for all species pairs were obtained
WIDESPREAD AMPLIFICATION for accD, ndhJ and rpoB.
The initial screen tested the ability of one primer pair In a few cases, in silico translation of raw trace data
for each locus to yield a single strongly amplified revealed sequences that included one or more stop

© 2009 The Linnean Society of London, Botanical Journal of the Linnean Society, 2009, 159, 1–11
6 C. S. FORD ET AL.

Table 2. Amplification and sequence success of six plastid markers

Region accD matK ndhJ rpoB rpoC1 ycf5

Species pairs tested 48 48 48 48 48 48


Species pair sequences 35 32 45 42 45 32
Percentage amplification 73 67 94 87 94 67
Mean amplicon length (bp) 212 724 361 388 498 297

The first two species in each of the triplets are counted as a ‘pair’ in this analysis.

codons. There were often modest quality traces around alignment with reference sets to have confidence in
these codons and their appearance was frequently true amplicon origin. Accordingly, ycf9 was replaced
associated with mononucleotide repeats. We therefore by the proportionately less variable but longer rpoB
adopted the conservative approach of deeming these (ranked 7, Table S5) and taken forward for more
sites to contain possible artefactual, frame-shifting detailed sequence comparisons.
indels. It was often possible to confirm such artefacts
by direct comparison with the equivalent sequence
from the sister species, but only sometimes by careful SEQUENCE VARIABILITY ACROSS 98
re-examination of all trace data. For this screen, DIVERGENT SPECIES
manual correction of such instances generally involved Primers targeting the selected six loci (accD, ycf5,
a simple frame-shift correction that removed the stop ndhJ, rpoB, rpoC1 and matK) were used for a more
codon and increased the stretch of amplicon sequence detailed examination of sequence variability between
over which there was clear amino acid and nucleotide sister species across 98 divergent taxa. All four pos-
sequence concordance with orthologous regions from sible primer combinations for all six loci were used in
other species. There was no obvious pattern to the this screen. This action enabled sequences to be
appearance of these indels, with the abundance of secured across the vast majority of taxa for ndhJ,
frame-shift indels ranging from none in ndhA, ndhJ rpoC1, rpoB and ycf5 (98%, 98%, 95% and 95% respec-
and rpoB to nine in accD (Table S6, see Supporting tively; Table S7, see Supporting Information). The
Information). Corrected amino acid sequences were consistency of amplification was more problematic for
then screened for putative homology with known matK (70% of species amplified) and accD (86% of
sequences using protein blast (blastp) searches on the species). Some species were also markedly more prone
NCBI database. The highest ranking blastp similarity to consistent failure across several loci than others;
scores matched the target locus for all genes. Equisetum telmateia failed to amplify the intended
There was considerable variability between the target for any locus, whereas no amplification was
edited sequences of selected loci in the abundance of obtained for Diosporos bejaudii and Erica anguliger
SNPs. By far the highest proportion of polymorphic for three loci (accD, matK and rpoB, and rpoC1, rpoB
sites (83/2155, 3.8%) was generated in accD, with a and matK, respectively; Table S7). Thus, given that
marked drop to ycf9 and ycf5 (2.29% and 2.25%, every species, except E. telmateia, amplified products
respectively), ranked second and third (Tables S5, 2). using at least one primer pair, the quality of DNA
The least variable locus proved to be ycf2, exhibiting extraction was not considered to be a significant
only nine polymorphisms among the 3500 bases factor in determining PCR amplification. The use of
amplified across all species (0.25%). Thus, viewed four primer combinations nevertheless dramatically
purely from the perspective of the percentage of improved the overall frequency of loci amplified and
variable sites, the six highest ranking loci were sequenced in most species. This is perhaps best illus-
accD, ycf9, ycf5, ndhJ, rpoC1 and matK (Table S5). trated by Alcantarea regina, for which amplification
However, although ycf9 yielded the second most vari- was recovered from just five of 24 primer combina-
able sequences (proportion of variable sites), the tions tested across all loci, but nevertheless secured
amplicons it generated were both small and highly sequence from all loci except matK. The limited cov-
variable in size, with the largest amplicon (Coffea, erage for problematic loci was addressed by the intro-
168 bp) being almost 1.5 times longer than the short- duction of a new forward primer for matK (matKX)
est (Acorus, 115 bp; Table S5). These features would and the generation of amplicons using modified
profoundly complicate the process of demonstrating MT-PCR (improving amplification to 85%; Table S7).
that the amplicon derived from the intended plastid The intended locus invariably generated the highest
locus, as short, variable sequences with multiple amino acid and nucleotide hits by blastp and blastn,
indels provide little scope for establishing sufficient respectively.

© 2009 The Linnean Society of London, Botanical Journal of the Linnean Society, 2009, 159, 1–11
DNA BARCODING REGIONS FOR LAND PLANTS 7

No locus successfully produced unambiguous elec- The combined use of two or more markers therefore
tropherograms across all species pairs, although the provides the opportunity to increase the variance
two most consistently amplified loci, ndhJ and rpoC1, between species pairs and the overall frequency of
only failed to provide sequence for three pairs (Anas- diagnostic sequence differences. Overall, this measure
trophyllum, Pinus and Stephanodaphne, and Anastro- increased the mean number of sequence differences
phyllum, Carex and Equisetum, respectively) (i.e. between species pairs across all individual loci from
45/48 pairs, 94%; Table S8, see Supporting Informa- 7.6 to 15.3 bases when pairs of loci were used, and
tion). The rpoB primers universally produced ampli- 30.5 when four loci were used (Table S11, see Sup-
cons for all angiosperms (i.e. 88% of all species porting Information). The combined use of all loci
groups), but consistently failed in all other groups, yielded a mean of 45.8 sequence differences between
namely Araucaria, Ephedra, Equisetum, Isoetes, Lyco- sister species, but required a combined mean
podium and Mannia. The remaining three loci were sequence length of over 2500 bases per species.
notably less consistent and yielded robust sequence Among the combinations excluding one locus, there
in both/all sister species on just 50% (matK), 67% was a notable and sharp drop in mean sequence
(ycf5) or 73% (accD) of occasions (Table S8). As with divergence generated from the combination that
rpoB, these loci performed markedly better among lacked matK (25.3 bases compared with 38.8–42.4 for
angiosperms, with the majority of failed reactions the remaining combinations). This pattern was main-
deriving from gymnosperm and spore-producing land tained for all shorter combinations that lacked matK.
plants (Table S8). The adoption of the MT-PCR tech- For instance, performance was distinctly bimodal for
nique greatly improved the proportion of taxa from the triplet combinations, with the ten combinations
which sequences were secured, primarily from matK, including matK yielding 27.9–33.9 mean differences
but also from ndhJ and rpoB. Sequence data for an between species pairs, whereas the ten combinations
additional 13 taxa were obtained via this method for lacking this locus fell in the range 11.9–17.9
matK (Table S9, see Supporting Information). The use (Table S11).
of this technique, combined with the inclusion of an
additional forward primer (matKX), improved the
DISCUSSION
amplification of the matK primers to 32/48 (67%,
Table S8) of sister species examined. The only robust The search for genomic regions to provide appropriate
sequence obtained for E. telmateia from all primer targets for genetic barcoding is a task that extends
combinations and loci was acquired using the beyond a simple quest for sequence variability.
MT-PCR technique with the ndhJ primers, thereby Although there is a need for sufficient polymorphism
confirming the utility of the technique for difficult to allow species-level diagnosis, there is an equal need
taxa or DNA. to consider the requirements of users and the
When all species pair comparisons were made, demands placed upon them when seeking to match
matK and ndhJ generated the highest mean number experimental data to reference barcode sequences. A
of differences between species pairs (14.0 and 12.3, greater level of technical rigour is required to gener-
respectively), followed by rpoC1, rpoB, ycf5 and accD ate reference sequences, and it is important that
in that order (Table S8). However, it should be appropriate procedural mechanisms are adopted by
remembered that this ranking is based on different database providers to ensure that such data are
species arrays for each locus. To compare sequence robust and reliable. If the emergent barcoding
variability between loci directly, it is better to con- resource is to prove popular, however, it is also vital
sider only the 19 species pairs that amplified across that the same constraints are not imposed on end-
all loci. When this was performed, matK again gen- users. One must balance the desire for universality of
erated the highest number of mean sequence differ- amplification, sequence divergence and the provision
ences (20.5 per species pair), followed by ndhJ (7.0), of internal procedures to correct sequencing errors.
ycf5 (6.4), rpoC1 (4.5), rpoB (4.0) and accD (3.4) The volume and taxonomic breadth of plastid
(Table S10, see Supporting Information). Signifi- genome sequences held in public databases is impres-
cantly, no single locus was able to distinguish sive and expanding, with 121 eukaryote plastid
between all species pairs included in the study, with genomes sequenced in their entirety (http://www.
some variability between loci in the number and ncbi.nlm.nih.gov/genomes/ORGANELLES/plastids_
selection of species pairs that they were unable to tax.html), and information relating to several loci
diagnose. Loci differed in their ability to discriminate known for over 1000 species (e.g. http://www.ncbi.
between species pairs, with ycf5 (33/33 pairs) and nlm.nih.gov). This resource provides a valuable plat-
matK (31/33) proving to be the best performers, fol- form from which candidate barcoding loci can be
lowed by ndhJ (38/45), rpoC1 (37/46), accD (28/36) sought, although it must be remembered that the
and rpoB (32/43). nucleotide sequence coverage is taxonomically skewed

© 2009 The Linnean Society of London, Botanical Journal of the Linnean Society, 2009, 159, 1–11
8 C. S. FORD ET AL.

and heavily biased towards crop and model species. majority of instances, implying that most arose from
One key issue is that these data are inevitably sequencing, annotation or transcriptional artefacts,
insufficient to cover all natural sequence variation despite our initial editing of the trace data, utilization
(particularly for poorly described loci), and this com- of experienced DNA sequence service providers and
plicates primer design. We ameliorated this problem sequencing in both directions (a small number of
by targeting conserved amino acid regions likely to be sequencing reactions repeatedly failed in one direc-
under functional constraint, such that variation in tion). These errors could not be detected or corrected
the corresponding nucleotide sequence should be using non-coding sequences as DNA barcodes.
restricted to synonymous mutations. Our strategy to There is now a growing acceptance that the genetic
identify candidate loci was to seek two such regions of barcoding of land plants requires a multi-locus
sufficient size to allow nested primer design, and in approach (Chase et al., 2005, 2007; Newmaster et al.,
sufficiently close proximity to allow reliable amplifi- 2006). The selection of loci to feature in a multi-locus
cation and single-pass sequencing. Experimentally, barcode requires a balance between conflicting
these precautions appeared to yield reasonable levels factors. Cognizance should be taken of the universal-
of amplification across 98 test taxa, with all loci, ity of amplification (with the more variable loci
except rpl22, producing amplicons in > 40% of taxa tending also to be the more difficult to amplify across
with single primer pairs and without PCR optimiza- groups), but also whether there is sufficient sequence
tion. The marked improvement when all possible variation to distinguish between species (with the
primer combinations and MT-PCR were applied seem- more variable loci generating greater capacity to
ingly validated the approach (success among > 85% of effect diagnosis). Newmaster et al. (2006) also consid-
species, > 67% of species pairs). ered familiarity and the value of considerable existing
Data quality is an important determinant of the sequence resources when suggesting the combination
experimental utility of molecular barcodes and, for of rbcL with other plastid loci as part of a multi-locus
this reason, it is widely acknowledged that submis- barcode. More recently, Kress & Erickson (2007)
sions to public databases as DNA barcodes (i.e. evaluated a subset of the loci and primers described
barcode reference sequences) will be subject to several here with several other loci on the basis of amplifi-
additional requirements, most notably that the raw cation success and sequence variability. They pro-
trace data are used to provide a post-hoc means of posed that psbA-trnH should be combined with part of
sequence validation. Exposing trace data to indepen- the rbcL gene (rbcL-a), with the former providing the
dent scrutiny will certainly reduce instances of sequence variation for species identification and the
erroneous base calling among reference barcoding latter providing a less variable taxonomic ‘anchor’.
sequences, and will permit statistical approaches to However, inclusion of psbA-trnH exposes the compos-
characterize base peak and signal baseline fluctua- ite barcode to problems of excessive size variation and
tions (for example, Izmailov et al., 2002; Andrade & an absence of corroborative amino acid sequence for
Manolakos, 2003). Likewise, the users may them- end-user sequence annotation. It is the potential cor-
selves opt to employ standard quality assurance roborative value of the translated sequence that led
approaches, such as Phred (Ewing et al., 1998), to us to focus on coding regions as potential barcode
verify the quality of electropherogram traces (for regions, although reduced sequence variability may
example, Lewers et al., 2008). Such measures do not necessitate the inclusion of multiple loci.
provide protection against the sequencing of heterolo- Previously, Chase et al. (2007) cited preliminary
gous amplicons, although the use of nucleotide simi- data from the present study to suggest that two
larity searches provides one means of indicating triplet loci combinations may be suitable for use as
whether the amplicon is likely to originate from the universal barcodes from land plants: rpoC1, rpoB and
intended plastid locus. In silico translation of nucle- matK, or rpoC1, matK and psbA-trnH. Comprehen-
otide sequences therefore provides an attractive sive examination of the data generated provides some
additional test for the orthology of genic regions, with support for this combination, but also highlights
comparison of amino acid alignments across taxa also alternative combinations that may perform a simi-
revealing those coding regions within the amplicon larly effective role as a multi-locus barcode. When loci
likely to be under strong selection. We found that all were compared individually, matK yielded by far the
nucleotide and translated sequences of our genic greatest level of variability. Indeed, the performance
candidates exhibited greatest sequence similarity to of matK was such that there was clear division
their intended targets, although in the preliminary between combinations that included this locus and
screen the use of translated sequences also uncovered those that did not. This feature clearly favoured the
several instances where frame shifts led to the intro- inclusion of matK in any locus combination being
duction of stop codons in all reading frames. Here, proposed as the multi-locus barcode, although
manual correction was relatively simple for the vast thought must also be given to its modest performance

© 2009 The Linnean Society of London, Botanical Journal of the Linnean Society, 2009, 159, 1–11
DNA BARCODING REGIONS FOR LAND PLANTS 9

when individual matK primers were used and that broad generalizations. This requires a more extensive
the large size of the amplicon generated also favours survey, in which the utility of markers is assessed in
its inclusion. The use of nested MT-PCR greatly greater detail for a number of taxonomically diverse
improved the amplification rates for this locus (85% of groups. There have only been a small number of such
species, 67% of species pairs), but still left the pros- studies to date. In the first, Sass et al. (2007) com-
pect of a proportion of taxa for which no sequence was pared all six loci identified above across 21 species
obtained. Although it is possible that a reasonable from ten genera of cycads, and achieved greatest
number of such cases may be resolved by PCR opti- sequencing success with rpoC1, which enabled iden-
mization or through the targeted design of taxon- tification to the generic level, but not to species. In
specific primers (Hebert et al., 2004a, b; Ward et al., concordance with the results presented here, matK
2005; Seifert et al., 2007), it is questionable whether offered greater species-level variation in sequence,
this option would be appropriate for certain barcoding but performed less well in terms of amplification
applications. For instance, workers interested in (Sass et al., 2007). Lahaye et al. (2008) compared the
establishing community species audits will require same loci across the taxonomically problematic
broad coverage across all members of the communi- orchids, and found that, for this group, matK per-
ties being surveyed, and so may be less concerned formed best. Similarly, Newmaster et al. (2008) found
with resolving problematic groups or with distin- matK to provide greater resolution at the species
guishing between the comparatively infrequent level, compared with the other five loci, when applied
instances of sympatric sister species. Under such cir- to a set of 40 taxa (composed of 38 species from one
cumstances, the additional effort needed to develop genus) from Myristicaceae known to have a low level
new primers or protocols to ‘fill data gaps’ may be of genetic divergence. We await the outcome of
unwarranted. Thus, selection of the remaining loci in similar studies across a wider selection of taxa before
the multi-locus barcode should ideally aim to comple- finally advocating the optimal composition of a multi-
ment the deficiency in universality of matK, and yet locus barcode for all land plants. In the longer term,
also maximize the probability of realizing a species we expect that wider access to complete plastid
diagnosis in the absence of a matK product. Combi- genome sequences will enable finer focusing of loci for
nations including ndhJ, ycf5 and accD yielded the use as part of multi-locus barcodes and, as sequencing
highest proportion of variable sites, and so appear to technology advances for end-users, more complex
be the most natural complements to matK. However, multi-locus barcodes may become an increasingly
difficulty here resides in taxonomic coverage, with attractive proposition, including appropriate loci from
accD known to be systematically absent from grasses the nuclear genome.
(Katayama & Ogihara, 1996), ycf5 from some mosses
(Sugiura et al., 2003) and ndhJ from the commer- ACKNOWLEDGEMENTS
cially important genus Pinus (Wakasugi et al., 1994).
Given the relatively minor difference in the combined We thank the Alfred P. Sloan Foundation and the
performance of these loci with matK, when compared Gordon and Betty Moore Foundation for funding.
with the more universally present (and amplifiable)
rpoC1 and rpoB, we therefore propose that REFERENCES
rpoC1 + rpoB + matK currently represents the most
prudent choice for an interim multi-locus coding Adams KL, Palmer JD. 2003. Evolution of mitochondrial
barcode combination. Although increasing the gene content: gene loss and transfer to the nucleus. Molecu-
number of loci can only increase the number of lar Phylogenetics and Evolution 29: 380–395.
Andrade L, Manolakos ES. 2003. Signal background esti-
species pairs that can be distinguished (from 18 to
mation and baseline correction algorithms for accurate DNA
19), the expansion in the mean number of differences
sequencing. Journal of VLSI Signal Processing 35: 229–243.
(20.5 for matK to 29.1 for rpoB + rpoC1 + matK)
Chase MW, Cowan RS, Hollingsworth PM, van den Berg
(Table S11) enhances the capacity to allow for
C, Madriñán S, Petersen G, Seberg O, Jørgsensen T,
infraspecific variation. Furthermore, given the higher Cameron KM, Carine M, Pedersen N, Hedderson TAJ,
consistency of amplification from rpoB and rpoC1, the Conrad F, Salazar GA, Richardson JE, Hollingsworth
use of these markers also provides useful ‘insurance’ ML, Barraclough TG, Kelly L, Wilkinson MJ. 2007.
for instances in which matK fails to amplify. As A proposal for a standardised protocol to barcode all land
supplementary barcoding loci, ndhJ and, to a slightly plants. Taxon 56: 295–299.
lesser extent, ycf5 and accD may perform a valuable Chase MW, Salamin N, Wilkinson M, Dunwell JM,
function for some groups as supplementary barcoding Kesanakurthi RP, Haider N, Savolainen V. 2005. Land
loci. Although our study encompasses a broad taxo- plants and DNA barcodes: short-term and long-term goals.
nomic spectrum of species pairs, the comparatively Philosophical Transactions of the Royal Society, Series B
limited sampling used restricts our ability to make 360: 1889–1895.

© 2009 The Linnean Society of London, Botanical Journal of the Linnean Society, 2009, 159, 1–11
10 C. S. FORD ET AL.

Cho Y, Mower JP, Qiu Y-L, Palmer JD. 2004. Mitochon- Pupulin F, Gigot G, Maurin O, Duthoit S, Barraclough
drial substitution rates are extraordinarily elevated and TG, Savolainen V. 2008. DNA barcoding the floras of
variable in a genus of flowering plants. Proceedings of the biodiversity hotspots. Proceedings of the National Academy
National Academy of Sciences of the United States of of Sciences of the United States of America 105: 2923–2928.
America 101: 17 741–17 746. Lewers KS, Saski CA, Cuthbertson BJ, Henry DC,
Cho Y, Qiu Y-L, Kuhlman P, Palmer JD. 1998. Explosive Staton ME, Main DS, Dhanaraj AL, Rowland LJ,
invasion of plant mitochondria by a group I intron. Proceed- Tomkins JP. 2008. A blackberry (Rubus L.) expressed
ings of the National Academy of Sciences of the United sequence tag library for the development of simple sequence
States of America 95: 14 244–14 249. repeat markers. BMC Plant Biology 8: 69.
Evans KM, Wortley AH, Mann DG. 2007. An assessment of Little DP, Stevenson DW. 2007. A comparison of algorithms
potential diatom ‘barcode’ genes (cox1, rbcL, 18S and ITS for the identification of specimens using DNA barcodes:
rDNA) and their effectiveness in determining relationships examples from gymnosperms. Cladistics 23: 1–21.
in Sellaphora (Bacillariophyta). Protist 158: 349–364. Newmaster SG, Fazekas AJ, Ragupathy S. 2006. DNA
Ewing B, Hillier L, Wendl MC, Green P. 1998. Base-calling barcoding in land plants: evaluation of rbcL in a multigene
of automated sequencer traces using phred. I. Accuracy tiered approach. Canadian Journal of Botany 84: 335–341.
assessment. Genome Research 8: 175–185. Newmaster SG, Fazekas AJ, Steeves RAD, Janovec J.
Greenstone MH, Rowley DL, Heimbach U, Lundgren 2008. Testing candidate plant barcode regions in the Myris-
JG, Pfannenstiel RS, Rehner SA. 2005. Barcoding gen- ticaceae. Molecular Ecology Resources 8: 480–490.
eralist predators by polymerase chain reaction: carabids and Nolan T, Hands RE, Bustin SA. 2006. Quantification of
spiders. Molecular Ecology 14: 2347–3266. mRNA using real-time RT-PCR. Nature Protocols 1: 1559–
Haider N. 2003. Development and use of universal primers in 1582.
plants. PhD Thesis, University of Reading. Robba L, Russell SJ, Barker GL, Brodie J. 2006. Assess-
Hajibabaei M, Janzen DH, Burns JM, Hallwachs W, ing the use of the mitochondrial cox1 marker for use in DNA
Hebert PDN. 2006. DNA barcodes distinguish species of barcoding of red algae (Rhodophyta). American Journal of
tropical Lepidoptera. Proceedings of the National Academy Botany 93: 1101–1108.
of Sciences of the United States of America 103: 968–971. Sanger F, Nicklen S, Coulson AR. 1977. DNA sequencing
Hebert PDN, Penton EH, Burns JM, Janzen DH, with chain-terminating inhibitors. Proceedings of the
Hallwachs W. 2004b. Ten species in one: DNA barcoding National Academy of Sciences of the United States of
reveals cryptic species in the neotropical butterfly Astraptes America 74: 5463–5467.
fulgerator. Proceedings of the National Academy of Sciences Sass C, Little DP, Stevenson DW, Specht CD. 2007. DNA
of the United States of America 101: 14 812–14 817. barcoding in the Cycadales: testing the potential of proposed
Hebert PDN, Ratnasingham S, deWaard JR. 2003. Bar- barcoding markers for species identification of cycads. PLoS
coding animal life: cytochrome c oxidase subunit 1 diverges One 11: e1154.
among closely related species. Proceedings of the Royal Seifert KA, Samson RA, deWaard JR, Houbraken J,
Society, Series B 270 (Suppl. 1): S96–S99. Levesque CA, Moncalvo J-M, Louis-Seize G, Hebert
Hebert PDN, Stoeckle MY, Semlak TS, Francis SM. PDN. 2007. Prospects for fungus identification using CO1
2004a. Identification of birds through DNA barcodes. PLoS DNA barcodes, with Penicillium as a test case. Proceedings
Biology 2: 1657–1663. of the National Academy of Sciences of the United States of
Izmailov A, Goloubentzev D, Jin C, Sunay S, Wisco V, America 104: 3901–3906.
Yager TD. 2002. A general approach to the analysis of Shaw J, Lickey EB, Schilling EE, Small RL. 2007. Com-
errors and failure modes in the base-calling function in parison of whole chloroplast genomes to choose non-coding
automated fluorescent DNA sequencing. Electrophoresis 23: regions for phylogenetic studies in angiosperms: the tortoise
2720–2728. and the hare III. American Journal of Botany 94: 275–
Jex AR, Smith HV, Monis PT, Campbell BE, Gasser RB. 288.
2008. Cryptosporidium – biotechnological advances in the Stanley KK, Szewczuk E. 2005. Multiplexed tandem PCR:
detection, diagnosis and analysis of genetic variation. Bio- gene profiling from small amounts of RNA using SYBR
technology Advances 26: 304–317. Green detection. Nucleic Acids Research 33: e180.
Katayama H, Ogihara Y. 1996. Phylogenetic affinities of the Sugiura C, Kobayashi Y, Aoki S, Sugita C, Sugita M.
grasses to other monocots as revealed by molecular analysis 2003. Complete chloroplast DNA sequence of the moss Phy-
of chloroplast DNA. Current Genetics 29: 572–581. scomitrella patens: evidence for the loss and relocation of
Kress WJ, Erickson DL. 2007. A two-locus global DNA rpoA from the chloroplast to the nucleus. Nucleic Acids
barcode for land plants: the coding rbcL gene complements Research 31: 5324–5331.
the non-coding trnH-psbA spacer region. PLoS One 6: e508. Timmis JN, Ayliffe MA, Huang CY, Martin W. 2004.
Kress WJ, Wurdack KJ, Zimmer EA, Weight LA, Janzen Endosymbiotic gene transfer: organelle genomes forge
DH. 2005. Use of DNA barcodes to identify flowering plants. eukaryotic chromosomes. Nature Reviews Genetics 5: 123–
Proceedings of the National Academy of Sciences of the 135.
United States of America 102: 8369–8374. Wakasugi T, Tsudzuki J, Ito S, Nakashima K, Tsudzuki
Lahaye R, van der Bank M, Bogarin D, Warner J, T, Sugiura M. 1994. Loss of all ndh genes as determined by

© 2009 The Linnean Society of London, Botanical Journal of the Linnean Society, 2009, 159, 1–11
DNA BARCODING REGIONS FOR LAND PLANTS 11

sequencing the entire chloroplast genome of the black pine 2005. DNA barcoding Australia’s fish species. Philosophical
Pinus thunbergii. Proceedings of the National Academy of Transactions of the Royal Society, Series B 360: 1847–1857.
Sciences of the United States of America 91: 9794–9798. Wellman CH, Osterloff PL, Mohiuddin U. 2003. Frag-
Ward RD, Zemlak TS, Innes BH, Last PR, Hebert PDN. ments of the earliest land plants. Nature 425: 282–285.

SUPPORTING INFORMATION
Additional Supporting Information may be found in the online version of this article:
Table S1. Gene regions of the Nicotiana plastid genome assessed for their potential as barcoding markers.
Markers in bold selected for primer development and screening
Table S2. Candidate regions selected for development as potential barcoding markers and their primer pairs.
*Primer added after initial screen to improve universality of matK
Table S3. Ninety-eight taxa (46 species pairs and two species triplets) selected for the assessment of candidates
for universal primers for barcoding. All taxa were screened for polymerase chain reaction (PCR) amplification
across 12 candidate loci. *Ten species pairs used in the first round of sequencing to screen polymorphisms across
11 candidate loci. All taxa were screened in the second round of sequencing to identify sequence polymorphisms
across six candidate loci
Table S4. European Molecular Biology Laboratory (EMBL) sequence accession numbers
Table S5. Results of the initial screen for sequence polymorphisms across ten species pairs. Percentage of
variable sites (ranked from most to least variable): calculated as the percentage of single nucleotide polymor-
phisms (SNPs) and indel sites of the total number of bases for each marker. P (ranked from the highest to the
lowest) represents the proportion of species pairs discriminated by each marker
Table S6. Distribution of stop codons in translated sequences of target loci. Stop codons detected in translated
nucleotide sequences during the initial screen for polymorphisms of 11 loci against ten species pairs
Table S7. Primer universality. Results of the application of all four possible primer pairs for the six primary
loci to the full set of 98 taxa. 1, amplification; 0, no amplification; dark shade, failed after both standard and
multiplexed tandem polymerase chain reaction (MT-PCR)
Table S8. Number of single nucleotide polymorphisms (SNPs) and indels per marker across 48 species pairs
(first two species in the two species triplets analysed as pairs). Additional sequences for each locus from Pinus
koraiensis and P. thunbergii were downloaded from GenBank and added to this dataset (accessions NC004677,
NC001631). Only sequence data corresponding to the section amplified by the primer pairs presented here were
aligned and submitted to SNP analysis protocols. The results appear in this table as Pinus2. Num diffs, number
of SNPs and indels between species pair. Seq len, length of sequence generated. *Extra species pair comparisons
secured by multiplexed tandem polymerase chain reaction (MT-PCR) or additional primer (MatKX) (see also
Table S9)
Table S9. Sequences generated using multiplexed tandem polymerase chain reaction (MT-PCR)
Table S10. Numbers of single nucleotide polymorphisms (SNPs) and indels per marker across the 19 species
pairs for which sequence data were obtained across all loci. Num diffs, number of SNPs and indels between
species pair. Seq len, length of sequence generated
Table S11. Marker combination analysis for the 19 species pairs for which sequence data were obtained for all
loci
Program S1. barcode.c.
Program S2. barcode2.c.
Please note: Wiley-Blackwell are not responsible for the content or functionality of any supporting materials
supplied by the authors. Any queries (other than missing material) should be directed to the corresponding
author for the article.

© 2009 The Linnean Society of London, Botanical Journal of the Linnean Society, 2009, 159, 1–11

You might also like