Plant Genotyping 1

Methods in
Molecular Biology 1245
Jacqueline Batley Editor
Plant
Genotyping
Methods and Protocols
METHODS IN MOLECULAR BIOLOGY
Series Editor
John M. Walker
School of Life Sciences
University of Hertfordshire
Hatfield, Hertfordshire, AL10 9AB, UK
For further volumes:

http://www.springer.com/series/7651
Plant Genotyping
Methods and Protocols
Edited by
Jacqueline Batley
School of Plant Biology, The University of Western Australia,
Crawley, WA, Australia
Editor
Jacqueline Batley
School of Plant Biology, The University of Western Australia
Crawley, WA, Australia
ISSN 1064-3745 ISSN 1940-6029 (electronic)

ISBN 978-1-4939-1965-9 ISBN 978-1-4939-1966-6 (eBook)
DOI 10.1007/978-1-4939-1966-6
Springer New York Heidelberg Dordrecht London
Library of Congress Control Number: 2014952462
© Springer Science+Business Media New York 2015

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction
on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation,
computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this
legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for
the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work.
Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the
Publisher's location, in its current version, and permission for use must always be obtained from Springer. Permissions
for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution
under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not
imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and
regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of publication, neither
the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be
made. The publisher makes no warranty, express or implied, with respect to the material contained herein.
Printed on acid-free paper
Humana Press is a brand of Springer

Springer is part of Springer Science+Business Media (www.springer.com)
Preface
Plant genotyping is a rapidly advancing field. The ability to produce vast amounts of DNA
sequence data has enabled the discovery of molecular markers in a vast array of plant species,
meaning that genotyping rather marker development becomes the rate limiting factor. This
volume is aimed at plant biologists working on plants from model organisms and crops, to
orphan species and focuses on all the different marker types available. The volume would
also be of interest to researchers who would benefit from an introduction to the different
marker systems available for plant research.
Plant genotyping is required for a variety of end uses including marker-assisted selec-
tion, associating phenotype with polymorphism, DNA barcoding, genetic diversity analysis,
conservation genetics, and improving genome assemblies. The most suitable genotyping
system to use depends on the throughput requirements, facilities available, and questions
to be answered. Chapters within this volume focus on the diverse range of genotyping
methods available, with guidelines as to what methods may be suitable for the different
needs of the researchers. Overviews are provided in the early chapters. Given the issues with
polyploidy in some plant species, information is included describing how to handle this
data. Information is also provided on bioinformatics tools for marker discovery, databases
hosting existing markers, and software for data analysis. Chapters providing details on spe-
cific genotyping methods are then included.
Scientific research progresses rapidly and the technologies for genotyping evolve with
this. In this volume we have covered the different methods available to date, many of which
will continue to increase in throughput as these technologies increase and researchers are
encouraged to frequently review which may be the most applicable method for their
research.
Crawley, WA, Australia Jacqueline Batley
v
Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Advances in Plant Genotyping: Where the Future Will Take Us . . . . . . . . . . . . 1

Dhwani A. Patel, Manuel Zander, Jessica Dalton-Morgan,
and Jacqueline Batley
2 Molecular Marker Applications in Plants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Alice C. Hayward, Reece Tollenaere, Jessica Dalton-Morgan
3 Bioinformatics: Identification of Markers
from Next-Generation Sequence Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Pradeep Ruperao and David Edwards
4 Molecular Marker Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Kaitao Lai, Michał Tadeusz Lorenc, and David Edwards
5 Plant Genotyping Using Fluorescently Tagged Inter-Simple
Sequence Repeats (ISSRs): Basic Principles and Methodology . . . . . . . . . . . . . 63
Linda M. Prince
6 SSR Genotyping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Annaliese S. Mason
7 Genotyping Analysis Using an RFLP Assay . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Shutao Dai and Yan Long
8 DNA Barcoding for Plants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Natasha de Vere, Tim C.G. Rich, Sarah A. Trinder,
and Charlotte Long
9 Multiplexed Digital Gene Expression Analysis
for Genetical Genomics in Large Plant Populations . . . . . . . . . . . . . . . . . . . . . 119
Christian Obermeier, Bertha M. Salazar-Colqui, Viola Spamer,
and Rod Snowdon
10 SNP Genotyping by Heteroduplex Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Norma Paniego, Corina Fusari, Verónica Lia, and Andrea Puebla
11 Application of the High-Resolution Melting Technique
for Gene Mapping and SNP Detection in Plants . . . . . . . . . . . . . . . . . . . . . . . 151
David Chagné
12 Challenges of Genotyping Polyploid Species . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Annaliese S. Mason
13 Genomic Reduction Assisted Single Nucleotide Polymorphism
Discovery Using 454-Pyrosequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Peter J. Maughan, Joshua A. Udall, and Eric N. Jellen
vii
viii Contents
14 Inter-SINE Amplified Polymorphism (ISAP) for Rapid

and Robust Plant Genotyping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Torsten Wenke, Kathrin M. Seibt, Thomas Döbel,
Katja Muders, and Thomas Schmidt
15 Screening of Mutations by TILLING in Plants . . . . . . . . . . . . . . . . . . . . . . . . 193
Nian Wang and Lei Shi
16 Gene Analysis Using Mass Spectrometric Cleaved Amplified
Polymorphic Sequence (MS-CAPS) with Matrix-Assisted
Laser Desorption Ionization Time-of-Flight Mass Spectrometry
(MALDI-TOF). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Hideyuki Kajiwara
17 Quantitative SNP Genotyping of Polyploids
with MassARRAY and Other Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Marcelo Mollinari and Oliver Serang
18 SNP Genotyping Using KASPar Assays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
Scott M. Smith and Peter J. Maughan
19 Skim-Based Genotyping by Sequencing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Agnieszka A. Golicz, Philipp E. Bayer, and David Edwards
20 The Restriction Enzyme Target Approach to Genotyping
by Sequencing (GBS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
Elena Hilario
21 Methods for the Design, Implementation, and Analysis
of Illumina Infinium™ SNP Assays in Plants . . . . . . . . . . . . . . . . . . . . . . . . . . 281
David Chagné, Luca Bianco, Cindy Lawley, Diego Micheletti,
and Jeanne M.E. Jacobs
22 Use of the Illumina GoldenGate Assay for Single Nucleotide
Polymorphism (SNP) Genotyping in Cereal Crops . . . . . . . . . . . . . . . . . . . . . 299
Shiaoman Chao and Cindy Lawley
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Contributors
JACQUELINE BATLEY • School of Agriculture and Food Sciences, University of Queensland,

Brisbane, QLD, Australia; Centre for Integrative Legume Research, University
of Queensland, Brisbane, QLD, Australia; School of Plant Biology, The University of Western
Australia, Crawley, WA, Australia
PHILIPP E. BAYER • School of Agriculture and Food Sciences, University of Queensland,
Brisbane, QLD, Australia; Australian Centre for Plant Functional Genomics, University
of Queensland, Brisbane, QLD, Australia
LUCA BIANCO • Computational Biology Platform-HPC, FEM Research and Innovation
Center, San Michele all’Adige, TN, Italy
DAVID CHAGNÉ • The New Zealand Institute for Plant & Food Research Limited,
Palmerston North Research Centre, Palmerston, New Zealand
SHIAOMAN CHAO • USDA-ARS, Biosciences Research Lab, Fargo, ND, USA
SHUTAO DAI • National Key Laboratory of Crop Genetic Improvement, Huazhong
Agricultural University, Wuhan, China
JESSICA DALTON-MORGAN • School of Agriculture and Food Sciences, University of
Queensland, Brisbane, QLD, Australia; Centre for Integrative Legume Research,
University of Queensland, Brisbane, QLD, Australia
THOMAS DÖBEL • Department of Dermatology, University of Heidelberg, Heidelberg,
Germany
DAVID EDWARDS • School of Agriculture and Food Sciences, University of Queensland,
Brisbane, Australia; Australian Centre for Plant Functional Genomics, University
of Queensland, Brisbane, Australia
CORINA FUSARI • Instituto de Biotecnología, Centro de Investigación en Ciencias
Veterinarias y Agronómicas (CICVyA), Instituto Nacional de Tecnología Agropecuaria
(INTA), Nicolas Repeto y Los Reseros, Hurlingham, Buenos Aires, Argentina; System
Regulation Group, Metabolic Networks Department, Max Planck Institute of Molecular
Plant Physiology, Hurlingham, Buenos Aires, Argentina
AGNIESZKA A. GOLICZ • School of Agriculture and Food Sciences, University
of Queensland, Brisbane, Australia; Australian Centre for Plant Functional Genomics,
University of Queensland, Brisbane, Australia
ALICE C. HAYWARD • School of Agriculture and Food Sciences, University of Queensland,
ELENA HILARIO • The New Zealand Institute for Plant and Food Research,
Auckland, New Zealand
JEANNE M.E. JACOBS • The New Zealand Institute for Plant & Food Research Ltd.,
Christchurch, New Zealand
ERIC N. JELLEN • 4105B LSB, Department of Plant and Wildlife Sciences,
Brigham Young University, Provo, UT, USA
ix
x Contributors
HIDEYUKI KAJIWARA • National Institute of Agrobiological Sciences, Tsukuba,

Ibaraki, Japan
KAITAO LAI • School of Agriculture and Food Sciences, University of Queensland, Brisbane,
Australia; Australian Centre for Plant Functional Genomics, University of Queensland,
Brisbane, Australia
CINDY LAWLEY • Illumina Inc., Hayward, CA, USA
VERÓNICA LIA • Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
Buenos Aires, Argentina; Facultad de Ciencias Exactas y Naturales. Universidad de
Buenos Aires; Instituto de Biotecnología, Centro de Investigación en Ciencias Veterinarias
y Agronómicas (CICVyA), Instituto Nacional de Tecnología Agropecuaria (INTA),
Buenos Aires, Argentina
CHARLOTTE LONG • National Botanic Garden of Wales, Llanarthne, UK; Institute of
Biological, Environmental and Rural SciencesAberystwyth University, Aberystwyth, UK
YAN LONG • National Key Laboratory of Crop Genetic Improvement, Huazhong
Agricultural University, Wuhan, China; Institute of Biotechnology, Chinese Academy
of Agricultural Science, Beijing, China
MICHAŁ TADEUSZ LORENC • School of Agriculture and Food Sciences, University of
Queensland, Brisbane, Australia; Australian Centre for Plant Functional Genomics,
University of Queensland, Brisbane, Australia
ANNALIESE S. MASON • School of Agriculture and Food Sciences, University of Queensland,
PETER J. MAUGHAN • 5144 LSB, Department of Plant and Wildlife Sciences,
Brigham Young University, Provo, UT, USA
DIEGO MICHELETTI • IRTA, Center for Research in Agricultural Genomics
CSIC-IRTA-UAB-UB, Bellaterra (Cerdanyola del Vallès), Barcelona, Spain
MARCELO MOLLINARI • University of São Paulo ESALQ, Piracicaba, SP, Brazil
KATJA MUDERS • NORIKA GmbH, Groß Lüsewitz, Germany
CHRISTIAN OBERMEIER • Department of Plant Breeding, Justus Liebig University Giessen,
Giessen, Germany
NORMA PANIEGO • Instituto de Biotecnología, Centro de Investigación en Ciencias
(INTA), Nicolas Repeto Y Los Reseros, Hurlingham, Buenos Aires, Argentina; Consejo
Nacional de Investigaciones Científicas y Técnicas (CONICET), Buenos Aires, Argentina
DHWANI A. PATEL • School of Agriculture and Food Sciences, University of Queensland,
Brisbane, QLD, Australia; Centre for Integrative Legume Research, University of
Queensland, Brisbane, QLD, Australia
LINDA M. PRINCE • Department of Botany, The Field Museum, Chicago, IL, USA
ANDREA PUEBLA • Instituto de Biotecnología, Centro de Investigación en Ciencias
(INTA), Nicolas Repeto y Los Reseros, Buenos Aires, Argentina
TIM C.G. RICH • Department of Biodiversity and Systematic Biology, National Museum,
Wales, Cardiff, UK
PRADEEP RUPERAO • School of Agriculture and Food Sciences, University of Queensland,
Brisbane, Australia; Australian Centre for Plant Functional Genomics, University
of Queensland, Brisbane, Australia
Contributors xi
BERTHA M. SALAZAR-COLQUI • Department of Plant Breeding, Justus Liebig University

Giessen, Giessen, Germany
THOMAS SCHMIDT • Institute of Botany, Technische Universität Dresden, Dresden, Germany
KATHRIN M. SEIBT • Institute of Botany, Technische Universität Dresden, Dresden, Germany
OLIVER SERANG • Boston Children’s Hospital, Boston, MA, USA; Harvard Medical School,
Boston, MA, USA
LEI SHI • National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural
University, Wuhan, China
SCOTT M. SMITH • Department of Plant and Microbial Biology, North Carolina State
University, Kannapolis, NC 28081, USA
ROD SNOWDON • Department of Plant Breeding, Justus Liebig University Giessen, Giessen,
Germany
VIOLA SPAMER • Department of Plant Breeding, Justus Liebig University Giessen, Giessen,
Germany
REECE TOLLENAERE • School of Agriculture and Food Sciences, University of Queensland,
SARAH A. TRINDER • National Botanic Garden of Wales, Llanarthne, UK
JOSHUA A. UDALL • 5133 LSB, Department of Plant and Wildlife Sciences, Brigham
Young University, Provo, UT, USA
NATASHA DE VERE • National Botanic Garden of Wales, Llanarthne, UK; Institute of
Biological, Environmental and Rural SciencesAberystwyth University, Aberystwyth, UK
NIAN WANG • Key Laboratory of Plant Germplasm Enhancement and Speciality
Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, China
TORSTEN WENKE • Institute of Botany, Technische Universität Dresden, Dresden, Germany
MANUEL ZANDER • School of Agriculture and Food Sciences, University of Queensland,
Brisbane, QLD, Australia; Centre for Integrative Legume Research, University of
Queensland, Brisbane, QLD, Australia
Chapter 1
Advances in Plant Genotyping: Where the Future

Will Take Us
Dhwani A. Patel, Manuel Zander, Jessica Dalton-Morgan,
Abstract
Genetic diversity between individuals can be tracked and monitored using a range of molecular markers.
These markers can detect variation ranging in scale from a single base pair up to duplications and transloca-
tions of entire chromosomal regions. The genotyping of individuals allows the detection of this variation and
it has been successfully applied in plant science for many years. The increasing amounts of sequence data able
to be generated using next-generation sequencing (NGS) technologies have produced a vast expansion in the
rate of discovery of polymorphisms, with single nucleotide polymorphisms (SNPs) predominating as the
marker of choice. This increase in polymorphic marker resources through efficient discovery, coupled with
the utility of SNPs, has enabled the shift to high-throughput genotyping assays and these methods are
reviewed and discussed here, alongside the recent innovations allowing increased throughput.
Key words Single nucleotide polymorphisms (SNPs), Next-generation sequencing (NGS),

Genotyping by sequencing (GbS), Bioinformatics
1 Common Molecular Markers and Genotyping Methods
The application to which a genetic marker is best suited depends

on its physical properties and genomic location, the cost involved,
ease of use, and degree of throughput required. In the past the
physical location of a genetic marker was commonly unknown and
not necessary for purposes of diversity and evolutionary analyses
and breeding applications. With advances in genome sequencing
technologies, genetic markers with a known genomic location and
environment are becoming more popular and applicable to an
increasingly diverse and high-throughput range of objectives.
Various molecular markers can be classified into groups based
on: (1) requirement for prior sequence information, (2) mode of
transmission (biparental or uniparental; nuclear or organellar inher-
itance), (3) number of loci per marker (single or multiple) and
mode of interaction (dominant or codominant), and (4) method of
Jacqueline Batley (ed.), Plant Genotyping: Methods and Protocols, Methods in Molecular Biology, vol. 1245,
DOI 10.1007/978-1-4939-1966-6_1, © Springer Science+Business Media New York 2015
1
2 Dhwani A. Patel et al.
analysis (hybridization-based, PCR-based, next-generation tech-

nology). In the early days of molecular marker development, spe-
cific sequence information was often unknown and individuals were
distinguished based on random amplification of PCR fragments,
restriction digestion patterns, DNA hybridization, or a combina-
tion of both. With advances in whole genome sequencing technol-
ogies and associated reduced costs, sequence-based molecular
markers such as SNPs (Single Nucleotide Polymorphisms) and,
most recently, GBS (Genotyping by sequencing) are becoming
more popular. Moreover, these sequence-based markers are inher-
ently able to capture vast amounts of variation at single-base resolu-
tion, making them particularly useful for the detection of perfect
markers (DNA polymorphisms causally linked to traits of interest)
and discovery and analysis of the alleles involved.
1.1 Restriction RFLPs are hybridization-based dominant markers that detect

Fragment Length changes in restriction fragment lengths due to DNA variation
Polymorphisms (e.g., SNPs or INDELS) at restriction recognition sites [1]. They
(RFLPs) are locus specific and highly heritable. Following restriction diges-
tion, fragments are separated, hybridized to locus-specific probes,
and visualized. However, the reagents used are toxic and expensive
and the entire process is time consuming [2]. These limitations led
to a loss in their popularity and created a need for more sophisti-
cated markers.
1.2 Amplified Restriction digestion combined with polymerase chain reaction

Fragment Length (PCR) technology gave way to AFLPs, a multilocus fingerprinting
Polymorphisms technology that utilizes DNA of any origin [1]. These markers have
(AFLPs) been used to study polymorphisms at multiple loci in germplasm,
trait mapping, creating linkage groups in crosses, and constructing
high-density genetic maps [1, 2]. However, no prior sequence infor-
mation is required and their physical location is usually unknown.
AFLPs combine two methods, restriction fragment analysis and
PCR amplification [3]. AFLP markers detect the polymorphisms of
amplified genomic restriction fragments. A subset of adapted restric-
tion fragments which adequately screens a fraction of the genome is
selected via primer extension beyond restriction sites [3]. Several
techniques ranging from agarose gel electrophoresis to automated
genotyping can be used to score these AFLP-PCR products.
1.3 Randomly RAPD markers are random DNA segments amplified using short
Amplified Polymorphic primers of around 10 bp via PCR [3]. Short primers ensure com-
DNAs (RAPDs) plementary sequence matching and subsequent amplification in
the genome. There may be length variation and presence/absence of
priming sites. The final products can be visualized using agarose
gel electrophoresis. Due to the low selectivity of short primers,
using this method increases the chances of nonspecific priming and
therefore artifacts [3]. Another issue with RAPD markers is their
Advances in Plant Genotyping: Where the Future Will Take Us 3
dominance. In the case of an allele at an RAPD site being unampli-

fiable, marker/marker homozygotes cannot be differentiated from
marker/null heterozygotes [4].
1.4 Simple Sequence SSRs, also known as microsatellites, are tandem repeated short
Repeats (SSRs)/ DNA stretches that can occur as mono-, di-, tri-, tetra-, penta, and
Microsatellites hexanucleotides [5]. The number of repeated units is affected by
mutations which makes SSRs highly polymorphic. These markers
have several beneficial attributes like their abundance in the
genome, high reproducibility, multiallelic variability, and genetic
codominance. SSRs have a wide range of applicability such as in
marker-assisted studies (MAS), genetic diversity analysis, genetic
mapping, and phenotype mapping to mention a few [5]. They also
allow for transferability between species because primers designed
in one species often amplify corresponding loci in related species,
with information gained able to be used for comparative analyses.
Some SSR sequences have been implicated in playing a role in gene
function and expression, as transcriptional activating elements, and
those SSRs present in noncoding regions may have a functional
significance [5]. Drawbacks such as varying abundance of markers
in different species, reduced frequency of SSRs in plant genomes
relative to animal genomes, and degree of optimization required in
each new species limit their use [6].
1.5 Restriction-Site RADs are short DNA stretches adjacent to every restriction enzyme
Associated DNA (RAD) recognition site [7] and are useful in reducing the complexity of a
genome [1]. The latest advances in using RAD markers include
sequencing RAD tags for Single Nucleotide Polymorphism (SNP)
discovery and genotyping. This has proven effective in discovering
polymorphic markers even in organisms with low polymorphism.
Due to their reduced genome representation, the nucleotides next
to the recognition sites can be sequenced at high depth for SNP
detection. The user can also choose the number of markers to be
used based on the restriction enzymes chosen. This method can be
used for bulk segregant analysis by genotyping pooled populations
and multiplexed samples [1].
1.6 Allele-Specific ASAPs is a method whereby at least one PCR primer is selected
Associated Primers that contains a polymorphism (usually at the 3′ end), compared to
(ASAPs) regular PCR-based reactions whereby nonpolymorphic primers are
used to amplify a polymorphic region in between them [8]. Under
stringent PCR conditions, this results in matched primers amplifying
the required fragment and mismatched primers not allowing ampli-
fication. The appearance of an amplicon on an agarose gel thus
allows for resolution of DNA polymorphism in a presence/absence
relationship [8]. The main benefit of this method over similar
methods of its era is the enhanced throughput achievable, as it
involves fewer steps and was more easily applied to a large number
of samples. Cost savings could also be achieved with this method

[8]. Variations of ASAPs are still in use, with a recent study apply-
ing this method in Brassica oleracea, Brassica napus, and Sesamum
indicum [9].
1.7 Single Nucleotide SNPs are the most abundant markers present in a genome [2].
Polymorphisms (SNPs) They have become the most popular choice of marker for several
genetic analyses. A SNP can be defined as a nucleotide difference
between two individuals at a particular locus [5]. The three forms
of SNPs are transitions (C/T, G/A), transversions (C/G, A/T,
T/G, C/A), and insertions/deletions (indels). C/T SNPs tend to
be more frequent outside of transcribed regions as a result of
increased cytosine methylation and amplified cytosine deamination
(reviewed in [10]).
SNPs have many features that make them the ideal choice as a
molecular marker. They occur abundantly in the genome, are
relatively stable during evolution, and have a low mutation rate.
Such molecular markers are good tools to analyze the various
processes encompassing the population and evolutionary genetics
of an organism. These include mating systems, patterns of specia-
tion and dispersal, and comparative genomics [11]. SNPs are also
excellent genetic markers for high-density genetic map construction
for the genetic and physical mapping of genomes, trait mapping
and association, and linkage disequilibrium (LD) studies. In agri-
culture, these properties enable SNPs to be applied to genetic diag-
nostics, germplasm identification, and marker-assisted selection for
breeding programs.
The usefulness of SNPs for various applications depends on
their genomic location and environment. Genic SNPs are identified
within expressed sequences from available EST databases or next-
generation transcriptome sequencing data [12–17]. These SNPs
can result in either synonymous or nonsynonymous amino acid
changes. Nonsynonymous SNPs may be linked directly to gene
function or be “perfect” markers by altering protein structure or
function. Genic SNPs are often selected against, which can be
observed by the lower frequency of nonsynonymous to synony-
mous base changes in gene regions, and can lead to an underestima-
tion of true SNP number and reduced resolution for genetic
diversity studies (reviewed in [18]). Genic SNPs are also limited to
actively transcribed or gene-rich regions of the genome. The exis-
tence of duplicated loci and highly conserved gene family members,
especially in polyploidy species, can compromise the applicability
of genic SNPs to downstream applications such as association
mapping and LD studies [12, 19].
With the recent advances in whole genome sequencing (WGS)
technologies, genomic SNPs are increasing in popularity and acces-
sibility [20]. Genomic SNPs can be identified from any sequenced
region in the genome, minimizing problems from duplicated genic
regions conserved within and between genomes. Furthermore, the
majority of genomic SNPs are free of selective pressure, making

them evolutionarily neutral allowing a more complete estimate of
diversity levels [10]. There are several high-throughput SNP geno-
typing platforms commercially available today that use SNPs as
markers. These will be detailed in the next section.
1.8 Reduced- Advances in genome sequencing technologies have paved the way
Representation for significant improvements in the rapid detection of genetic varia-
Libraries (RRLs) tion as well as the throughput and wealth of the information
and Complexity obtained. Using reduced-representation sequencing, which involves
Reduction sequencing a few targeted, genomic regions rather than the entire
of Polymorphic genome, individuals can be directly compared for sequence varia-
Sequences (CRoPS) tions. Partial, but genome-wide, coverage is obtained by digesting
and pooling samples from multiple individuals with a frequently
cutting restriction enzyme [21]. The fragments of desired size are
then selected and sequenced at high depth at a reduced cost to full
genome sequencing. Reads from reduced-representation sequenc-
ing can be mapped to a reference genome for polymorphism detec-
tion, SNP calling, and haplotype analysis (where adjacent SNPs are
inherited as a conserved block of sequence). In the absence of a
reference, paired-end sequencing reads from any second generation
sequencing (SGS) platform, or long reads from the Roche Genome
Sequencer, can be used to assemble the fragments. However, this
method is not suitable to be applied to genomes with high ploidy
levels or large repetitive genome fractions [1].
CRoPS was the first method that used sequence identifiers, or
barcodes, to uniquely tag sequence reads of an individual DNA
sample, enabling multiplexing of samples on one lane of any SGS
platform for polymorphism identification and population studies
[21]. Studies in maize have demonstrated the applicability of
CRoPS [22, 23]. Barcodes can also be applied to RRLs as long as
fragment size is selected individually for each sample before
pooling.
2 Sanger Sequencing
Sanger sequencing is one of the most common, as well as one of

the most accessible, methods of molecular marker sequencing and
its inception has revolutionized genetics. It involves the base by
base determination of a DNA sequence using dideoxynucleotides
(ddNTPs) in a chain-terminating reaction [24]. It has been the
most used sequencing method since 1977 and although NGS has
supplanted it in popularity, it is still widely used for its affordability
and for obtaining long sequence reads of over 500 nucleotides.
Along with sequencing DNA from PCR products, it was used to
sequence the first model organisms, providing a physical map for
molecular marker mapping.
3 High-Throughput Genotyping
Genotyping multiple samples accurately and in a cost-effective

manner has provided researchers with a whole new technical plat-
form for sophisticated genetic studies [25, 26]. Applicability of this
vast resource, comprising of several NGS platforms depends on the
experiment, sample numbers, and main goal. NGS technology
involves using a single instrument to sequence hundreds of thou-
sands to millions of cDNA/DNA fragments in a massively parallel
manner [26]. This technology can be applied for resequencing for
SNP detection, de novo sequencing, interactive mapping based on
immunoprecipitation-protein DNA/RNA, using bisulfite-mediated
cytosine conversion for DNA methylation and transcriptome
sequencing.
SNPs are the most popular marker for use in high-throughput
studies due to their binary nature. The most efficient methods for
SNP genotyping are detailed below, including traditional and
advancing methods.
3.1 TaqMan Assay Allele-specific hybridization coupled with taq polymerase activity
during PCR forms the basis of the TaqMan assay [27]. One pair of
PCR primers and two different probes to one SNP site are used.
Fluorescence occurs when one of the probes matches a SNP allele,
which leads to the separation of the quencher and the fluorescent
dye. Life Technologies’ 7900HT Fast Real-Time PCR system can
process eighty-four 384 well plates in up to 4 days. One of the
drawbacks of this assay is its high cost of probes for a low level of
SNP multiplexing. Some recent advances include systems like
Biomark HD-System [28] and OpenArray [29] that have a small
sample requirement, consume less reagents, and have a higher
throughput [27].
3.2 iPlex Gold Assay Multiplex PCR, single-base extension and Matrix-assisted laser
desorption/ionization-time of flight (MALDI-TOF) mass spec-
trometry (MS) detection combined together make the iPlex Gold
assay (Sequenom, www.sequenom.com). Shrimp alkaline phospha-
tase deactivates the remaining nucleotides after PCR and the single
base primer extension is performed. The SNP site is combined
with one of four terminator nucleotides and the products are trans-
ferred onto 384-matrix spot chips following desalination, to be
analyzed using MALDI-TOF MS [27]. One 384 plate can be pro-
cessed in less than 10 h. This method is very useful for low input
samples because it directly analyses the allele-specific product and
outputs highly accurate data.
3.3 High Resolution HRM uses intercalating fluorescent dyes to monitor the melting
Melt (HRM) profile (unmelted to melted) of PCR products by genotyping on
the Light Scanner gene mutation/genotyping system [30]. This is the

first commercial high-throughput system that conducts rapid gene
mutation scanning and genotyping [25].
4 Advances in Plant Genotyping
With advances in SGS technology allowing millions of SNPs to be

identified in plant genome, the plant genotyping methods have
advanced to genotype a larger number of SNPs in ultra-high
throughput. These methods are detailed below.
4.1 Illumina The Illumina GoldenGate assay is a large-scale genotyping assay

GoldenGate Assay which can analyze 384–3,072 different loci in up to 96 individuals.
and Infinium Assay It uses allele-specific oligo (ASO) hybridization along with fluores-
cently labeled universal amplification primers to differentiate
between genotypes [31]. Previous studies using the Illumina
GoldenGate assay have shown that it can be used to reliably score
SNPs for genetic analysis [32]. Furthermore, it is cost-effective and
flexible for analyzing large numbers of SNPs [5]. The Infinium
assay incorporates a whole-genome amplification step in which the
amount of DNA is increased by up to 1,000-fold. SNP-specific
primers capture the fragmented DNA on the bead array, which is
then extended with hapten-labeled nucleotides. Fluorescently
labeled antibodies are then added to detect the incorporated
hapten-labeled nucleotides informing the user of the SNP data.
The Infinium assay is limited to bi-allelic SNPs and cannot
detect indel mutations or alternative alleles. Deletions of regions or
addition of alleles sometimes deviate individuals from the two alleles
per loci design. In such cases, Infinium categorizes these loci as
“no calls” without any further discrimination. With many crop plant
species having highly polyploid genomes, homoeologous loci are a
real obstacle to be navigated in designing SNP probes. There are
a number of limitations in designing probes for SNP loci and also a
small percentage (10–12 %) of loci that have passed all design speci-
fications will fail during the chip manufacturing process meaning
that specific loci of interest may be eliminated from the final assay.
4.2 Genotyping by Genotyping by sequencing (GBS) was first demonstrated in maize

Sequencing (GBS) and barley. This is a form of reduced representation sequencing
using restriction enzyme digested samples, however this can also
be achieved using a high level of multiplexing of samples within the
same lane in a method termed “skim GBS”. There are some advan-
tages to this new method over using a “static” SNP panel such as
with the Infinium assay. Change of focus within the genome can
easily be accommodated when using GBS by mining the raw data,
whereas an entirely new SNP panel would need to be designed and
created within the Infinium paradigm. Digesting genomic DNA
with a frequent cutter and next-generation high-throughput

sequencing of all resulting restriction fragments is the essence of
Genotyping by Sequencing (GBS) [1]. GBS has a low per-sample
cost, can be applied to any crop species, and is easy to conduct in
small genomes. In order to gain sufficient coverage in complex
genomes, reduction of complexity or target enrichment can be
performed. Compared to the RAD-seq method, GBS is less com-
plicated, involves reduced sample handling, there is no size selec-
tion of fragments, easier generation of restriction fragments with
adapter, and fewer DNA purification steps [1]. A single experiment
involving GBS can yield the discovery of ~25,000 SNPs that can be
used for germplasm characterization, breeding, population studies,
and trait mapping [1]. Furthermore, in the absence of a reference
genome, GBS sequence tags can be used as dominant markers for
kinship analysis. Genomic selection on novel germplasm and ana-
lyzing the population structure without prior knowledge of the
species are among several other uses of GBS that comprise the
future of biology.
5 Bioinformatics Challenges
5.1 Handling Large The rapid advances in sequencing technology over the past decade
Volumes of Data have led to an explosion of sequence and molecular marker data
[33]. In the early days of sequencing, the growth of sequence capa-
bilities and information technology resources went hand in hand
[34]. In the last decade however, the emergence of NGS technology
has advanced the field so much, that the throughput and the out-
put of data from individual sequencing runs has reached the point
where it is outgrowing the capacity to store this data in an efficient
and cost-effective way [20]. The genome informatics ecosystem is
at risk of getting swamped with data that current storage capacity
cannot absorb, with the sequence data output doubling every
5 months (on average) which is in turn dramatically lowering the
cost per DNA base sequenced [34]. This may pose as a challenge
in the near future but alternative options like cloud computing are
currently under consideration [34, 35]. The rapid increase in
sequencing data has also created the need for new algorithms that
can process this flood of data in a meaningful and effective way.
5.2 Assembly Some of the greatest advancements have come from genome
Software assembly software, as this is one of the most important applications
of NGS data [36]. Early assembly software struggled to meet the
needs of researchers in assembling complex genomes such as those of
higher plants and mammals, however recent advances have allowed
for the completion of several eukaryotic genomes [37, 38]. One
significant challenge in genome assembly is the existence of large
repetitive elements within genomes [39]. This can in part be tackled
by increasing read-length, which third-generation sequencing

technology aims to achieve, and using read pairs to bridge assembly
gaps caused by repetitive regions [39].
5.3 Alignment The overwhelming volume of sequence data has also led to the
Software development of new alignment algorithms, as existing tools simply
cannot cope [39]. This applies to traditional dynamic program-
ming methods, as well as the BLAST family of alignment heuris-
tics. Current alignment algorithms have addressed this problem by
splitting the alignment problem into two steps: First, candidate
alignment locations are found using a heuristic search; second, the
actual alignment is performed. Examples of this include BLAT,
MAQ, Bowtie, and SOAPaligner/SOAP2 [39].
5.4 Polymorphism Discovering polymorphisms from aligned sequence data is a further

Discovery consideration that should be made when looking for effective imple-
for Molecular Markers mentation of current sequencing technologies. The vast abundance
of DNA sequence data allows for the application of computational
algorithms that make it possible to discover polymorphisms, such as
SNPs [20]. A major challenge of polymorphism identification in
silico is determining the true biological variation. Computationally
predicted polymorphisms may in fact occur due to sequence error,
a problem that arises from next-generation sequencing platforms
that sacrifice data quality for raw quantity of data output [40].
6 Conclusion
The advent of PCR and later, next-generation sequencing has

allowed for the development of an increasing range of molecular
markers. Despite increasing accessibility and improvement of
genome sequencing technologies, molecular markers have remained
essential components of all large-scale genomic analyses. NGS tech-
nologies continue to be inextricably linked, allowing high-through-
put marker discovery and comprising a vital component of the latest
advance in genotyping technologies like GBS. Molecular markers
have myriad valuable applications in plant science, discussed in the
following chapter, and the explosion of markers being discovered
through advancing technology and the development of a range of
high-throughput genotyping technologies has enabled their
increased use.
References
1. Mir RR, Varshney RK (2013) Future prospects 2. Agarwal M, Shrivastava N, Padh H (2008)
of molecular markers in plants. In: Henry RJ Advances in molecular marker techniques and
(ed) Molecular markers in plants. Wiley, their applications in plant sciences. Plant Cell
New York, pp 169–190 Rep 27:617–631
3. Makosiej A, Nasalski P, Giraud B, Vladimirescu genome research. Comp Funct Genomics

A, Amara A (2008) An innovative sub-32nm 5:276–280
SRAM current sense amplifier in double-gate 16. Love C, Robinson A, Lim G, Hopkins C,
CMOS insensitive to process variations and Batley J, Barker G, Spangenberg GC, Edwards
transistor mismatch. IEEE Int Conf Integr D (2005) Brassica ASTRA: an integrated data-
Circuit Design Technol Proc 2008:47–50 base for Brassica genomic research. Nucleic
4. Lynch M, Milligan BG (1994) Analysis of pop- Acids Res 33:W493–W495
ulation genetic-structure with RAPD markers. 17. Love CG, Edwards D (2007) Accessing inte-
Mol Ecol 3:91–99 grated Brassica genetic and genomic data using
5. Appleby N, Edwards D, Batley J (2009) New the BASC server. In: Edwards D (ed) Plant bio-
technologies for ultra-high throughput geno- informatics. Humana Press, USA, pp 229–244
typing in plants. In: Somers DJ, Langridge P, 18. Edwards D, Batley J, Cogan NOI, Forster JW,
Gustafson JP (eds) Plant genomics. Humana, Chagné D (2007) Single Nucleotide
Kentucky, pp 19–40 Polymorphism discovery. In: Oraguzie NC,
6. Kalia R, Rai M, Kalia S, Singh R, Dhawan AK Rikkerink EHA, Gardiner SE, De Silva H (eds)
(2011) Microsatellite markers: an overview of Association mapping in plants. Springer,
the recent progress in plants. Euphytica 177: New York, pp 53–76
309–334 19. Hayward A, Dalton-Morgan J, Mason A,
7. Baird NA, Etter PD, Atwood TS, Currey MC, Zander M, Edwards D, Batley J (2012) SNP
Shiver AL, Lewis ZA, Selker EU, Cresko WA, discovery and applications in Brassica napus.
Johnson EA (2008) Rapid SNP discovery and J Plant Biotechnol 39:1–12
genetic mapping using sequenced RAD mark- 20. Batley J, Edwards D (2009) Genome sequence
ers. PLoS One 3:e3376 data: management, storage, and visualization.
8. Gu WK, Weeden NF, Yu J, Wallace DH (1995) Biotechniques 46:333–336
Large-scale, cost-effective screening of PCR 21. Davey JW, Hohenlohe PA, Etter PD, Boone
products in marker-assisted selection applica- JQ, Catchen JM, Blaxter ML (2011) Genome-
tions. Theor Appl Genet 91:465–470 wide genetic marker discovery and genotyping
9. Liu J, Huang SM, Sun MY, Liu SY, Liu YM, using next-generation sequencing. Nat Rev
Wang WX, Zhang XR, Wang HZ, Hua W Genet 12:499–510
(2012) An improved allele-specific PCR primer 22. van Orsouw NJ, Hogers RCJ, Janssen A, Yalcin
design method for SNP marker analysis and its F, Snoeijers S, Verstege E, Schneiders H, van der
application. Plant Methods 8:34 Poel H, van Oeveren J, Verstegen H, van Eijk
10. Edwards D, Forster JW, Chagné D, Batley J MJT (2007) Complexity reduction of polymor-
(2007) What are SNPs? In: Oraguzie NC, phic sequences (CRoPS (TM)): a novel approach
Rikkerink EHA, Gardiner SE, De Silva H (eds) for large-scale polymorphism discovery in com-
Association mapping in plants. Springer, plex genomes. PLoS One 2:e1172
New York, pp 41–52 23. Mammadov J, Chen W, Ren R, Pai R,
11. Giraud T, Enjalbert J, Fournier E, Delmotte F, Marchione W, Yalçin F, Witsenboer H, Greene
Dutech C (2008) Population genetics of fun- T, Thompson S, Kumpatla S (2010)
gal diseases of plants. Parasite 15:449–454 Development of highly polymorphic SNP
12. Batley J, Edwards D (2007) SNP applications in markers from the complexity reduced portion
plants. In: Oraguzie NC, Rikkerink EHA, of maize (Zea mays L.) genome for use in
Gardiner SE, De Silva H (eds) Association map- marker-assisted breeding. Theor Appl Genet
ping in plants. Springer, New York, pp 95–102 121:577–588
13. Duran C, Appleby N, Edwards D, Batley J 24. Sanger F, Nicklen S, Coulson AR (1977) DNA
(2009) Molecular genetic markers: discovery, sequencing with chain-terminating inhibitors.
applications, data storage and visualisation. Proc Natl Acad Sci U S A 74:5463–5467
Curr Bioinform 4:16–27 25. Zheng L, Bin L, Yan D, Nongyue H (2011)
14. Erwin T, Jewell E, Love C, Lim G, Li X, The state of field of high-throughput SNP
Chapman R, Batley J, Stajich J, Mongin E, genotyping system. In: Bioelectronics and bio-
Stupka E, Ross B, Spangenberg GC, Edwards informatics (ISBB), 2011 international sympo-
D (2007) BASC: an integrated bioinformatics sium, 3–5 Nov 2011, pp 174–177
system for Brassica research. Nucleic Acids Res 26. Edenberg HJ, Liu Y (2009) Laboratory methods
35:D870–D873 for high-throughput genotyping. Cold Spring
15. Love CG, Batley J, Lim G, Robinson AJR, Harbor Protoc 2009, pdb.top62
Savage D, Singh D, Spangenberg GC, Edwards 27. Bayés M, Gut IG (2011) Overview of
D (2004) New computational tools for Brassica genotyping. In: Rapley R, Harbron S (eds)
Molecular analysis and genome discovery. John 34. Stein LD (2010) The case for cloud computing
Wiley & Sons, Ltd, pp 1–23 in genome informatics. Genome Biol 11:207
28. Fluidigm (2012) Biomark HD system. http:// 35. Dai L, Xin G, Yan G, Jingfa X, Zhang Z (2012)
www.fluidigm.com/biomark-hd-system.html Bioinformatics clouds for big data manipula-
29. LifeTechnologies (2012) OpenArray® tion. Biol Direct 7:43
Real-Time PCR System. http://www.applied- 36. Edwards D, Batley J (2010) Plant genome
biosystems.com/absite/us/en/home/applica- sequencing: applications for crop improve-
tions-technologies/real-time-pcr/ ment. Plant Biotechnol J 8:2–9
real-time- pcr-instruments/openarray-real- 37. Imelfort M, Edwards D (2009) De novo
time-pcr-system.html sequencing of plant genomes using second-
30. Biofire (2012) LightScanner® system mutation generation technologies. Brief Bioinform 10:
discovery, gene scanning and genotyping. bio- 609–618
fire diagnostics. http://www.biofiredx.com/ 38. Imelfort M, Batley J, Grimmond S, Edwards D
LightScanner/ (2009) Genome sequencing approaches and
31. Tindall EA, Petersen DC, Nikolaysen S, Miller successes. In: Somers DJ, Langridge P,
W, Schuster SC, Hayes VM (2010) Gustafson JP (eds) Plant genomics. Humana,
Interpretation of custom designed Illumina Kentucky, pp 345–358
genotype cluster plots for targeted association 39. Lee HC, Lai KT, Lorenc MT, Imelfort M,
studies and next-generation sequence valida- Duran C, Edwards D (2012) Bioinformatics
tion. BMC Res Notes 3:39 tools and databases for analysis of next-
32. Durstewitz G, Polley A, Plieske J, Luerssen H, generation sequence data. Brief Funct
Graner EM, Wieseke R, Ganal MW (2010) Genomics 11:12–24
SNP discovery by amplicon sequencing and 40. Lai K, Duran C, Berkman PJ, Lorenc MT,
multiplex SNP genotyping in the allopolyploid Stiller J, Manoli S, Hayden MJ, Forrest KL,
species Brassica napus. Genome 53:948–956 Fleury D, Baumann U, Zander M, Mason AS,
33. Thudi M, Li YP, Jackson SA, May GD, Batley J, Edwards D (2012) Single nucleotide
Varshney RK (2012) Current state-of-art of polymorphism discovery from wheat next-
sequencing technologies for plant genomics generation sequence data. Plant Biotechnol J
research. Brief Funct Genomics 11:3–11 10:743–749
Chapter 2
Molecular Marker Applications in Plants

Alice C. Hayward, Reece Tollenaere, Jessica Dalton-Morgan,
Abstract
Individuals within a population of a sexually reproducing species will have some degree of heritable genomic
variation caused by mutations, insertion/deletions (INDELS), inversions, duplications, and translocations.
Such variation can be detected and screened using molecular, or genetic, markers. By definition, molecular
markers are genetic loci that can be easily tracked and quantified in a population and may be associated with
a particular gene or trait of interest. This chapter will review the current major applications of molecular
markers in plants.
Key words Molecular markers, SNPs, Association mapping, Genetic diversity, Genetic mapping,
Marker-assisted selection
1 Introduction
Genetic markers can be used to study patterns of heredity, genomic

variation, evolutionary and selection phenomena, allele–allele
linkages, and allele–phenotype associations. The application to
which a genetic marker is best suited depends on its physical prop-
erties and genomic location, the cost involved, ease of use, and
degree of throughput required. Molecular markers have been suc-
cessfully applied in plant science toward the genetic and physical
mapping of genomes, the identification of genes controlling vari-
ous processes and phenotypes (trait association), genetic diversity
and evolutionary analyses, and in marker-assisted breeding for crop
improvement.
In the past the physical location of a genetic marker was com-
monly unknown and not necessary for purposes of diversity and
evolutionary analyses and breeding applications. With advances in
genome sequencing technologies, genetic markers with a known
genomic location and environment are becoming more popular
and applicable to an increasingly diverse and high-throughput
range of objectives. DNA sequence-based markers are inherently
13
14 Alice C. Hayward et al.
able to capture vast amounts of variation at single-base resolution,

making them particularly useful for the detection of perfect
markers (DNA polymorphisms causally linked to traits of interest)
and discovery and analysis of the alleles involved.
2 Genetic and Association Mapping
2.1 Genetic Linkage One of the most important applications of genetic markers has been
Map Construction the construction of genetic linkage maps [1, 2]. These maps are
and QTL Identification created by genotyping a large mapping population of segregating
individuals and studying the resulting recombination frequencies
between genetic markers. This enables establishment of linkage
groups of associated markers with an approximate relative position
along a chromosome based on their likelihood of being coinher-
ited. A linkage group will inherently often represent a large propor-
tion of an individual chromosome with imputed recombination
points. The abundance of SNPs and their ability to be discovered
and genotyped rapidly in a high-throughput manner makes them
particularly valuable markers for genetic mapping [3–5].
Importantly, when the same mapping population used to
derive a linkage map is phenotyped for segregating traits of inter-
est, such as seed color or flowering time, the association between
marker patterns and the phenotypic variation can be quantified.
This then enables identification of the genomic regions controlling
traits of interest. Where these traits are quantitative, the associated
genomic region(s) are known as quantitative trait loci (QTL). The
identification of markers closely linked to genetic loci of interest,
including QTL, enables discovery of the underlying, causative
gene(s). Prior to the availability of whole genome sequencing tech-
nologies, this involved map-based cloning, which used the known
sequence of markers directly flanking a locus to amplify and
sequence the intervening region for gene candidate identification.
Depending on the resolution of the genetic map as defined by
marker density and thus distance between flanking markers, this
process was often extremely time and resource intensive.
Nonetheless, it enabled the first identification of developmentally
and agriculturally important genes in many crop and model plant
species. In the crop canola, QTL of importance include those for
oil yield, oil quality, disease resistance, and pod shatter tolerance,
amongst many others [6–9].
2.2 Genome Genetic linkage maps are highly valuable in helping to assemble
Assembly, Physical contigs of next-generation genome sequencing data into chromo-
Mapping, and Synteny somes. This is achieved by physically mapping genetic marker
Mapping sequences on these contigs and comparing this to their known
relative location on the genetic map. The success of this process
depends on the accuracy and robustness of the genetic linkage
Molecular Marker Applications in Plants 15
map, as well as the quality of the original contig sequence assembly.

Where markers flanking QTL are physically located on a genome
sequence, this enables direct and rapid analysis of the intervening
region. With the aid of the plethora of in silico sequence analysis,
gene prediction, and annotation tools currently available, candi-
date genes underlying these loci can be rapidly identified [10].
Polymorphisms in the candidate gene regions between individuals
segregating for the trait can further narrow down the causal gene.
Moreover, identification and genotyping of additional SNPs in the
original mapping population enables fine-mapping, or extremely
high density mapping, of the QTL [11]. SNPs found to be causally
associated with a trait variation are known as “perfect markers”, and
these, along with the candidate gene, can be then verified in vitro
and applied to molecular-assisted breeding programs (see below).
In species descended from a common ancestor, the preserved
order of at least two homologous genes along chromosomes is
known as synteny. Synteny mapping uses the locations of con-
served genetic markers on the genetic maps of different species to
compare interspecies genome organization. This is useful for analy-
ses of gene and genome evolution and in reconstructing ancestral
genomes. During evolution, genome rearrangements, expansion,
gene loss, and mutation occur at increasing frequency with genetic
distance, reducing synteny between distantly related species.
When a region of high synteny between species is identified, this
suggests a high level of selection for preserving genome sequence
and organization in these regions. Such shared synteny is a basic
criterion for establishing the functional orthology of genomic
regions in different species and can facilitate rapid identification of
conserved, agriculturally important gene regions in related crop
species. Furthermore, markers associated with different gene para-
logues enable localization and comparison of the specific members
of multigene family members [12, 13]. Synteny mapping studies
were pioneered in grass species [14] but have been conducted in
numerous plant species [15–18].
2.3 Association Genetic markers that are linked to traits under selection are highly
Mapping and Linkage valuable for identifying genetic loci that contribute to phenotypic
Disequilibrium variation based on linkage disequilibrium (LD). LD refers to the
coinheritance of specific genetic markers in ancestrally related indi-
viduals at higher frequencies than expected based on recombina-
tion distances. Regions that are in high LD may be under high
selection pressure for particular allelic combinations, implying a
positive relationship between otherwise physically distinct alleles
and quantifiable traits. LD mapping, or association mapping, refers
to the analysis of statistical associations between genetic markers,
usually individual SNPs or SNP haplotypes, and traits (phenotypes)
in a collection of individuals [5, 19–21]. SNP haplotypes,
which comprise SNP alleles always found in particular allelic
combinations, are found in species with moderate or high levels

of LD and may encompass genes or gene clusters [12]. As such,
a minimal set of the SNPs normally existing as haplotypes can be
used to impute the remainder of the haplotype alleles. This pro-
vides the ability to fast-track screening of regions of agronomic
interest in breeding programs using a minimal genotyping set [4,
5, 22]. In Arabidopsis, identification of linkage disequilibrium
based on high-density SNP maps has significantly advanced evolu-
tionary and association genetics studies [23].
Association studies can either be candidate gene-based or whole-
genome based [21]. In the candidate gene approach, the aim is to
determine correlations between traits of interest and DNA polymor-
phisms (e.g., SNPs) within candidate genes thought to be involved
in those traits. This approach requires prior foresight into the likely
biochemistry and genetics of the trait in order to narrow down gene
candidates. On the other hand, whole-genome association mapping
analyses association of densely mapped genetic markers across all
chromosomes with variation in phenotype to identify potential
causal or LD associated loci. Association mapping has become popu-
lar for identifying trait–marker relationships within many species,
particularly for mining new alleles in natural populations or germ-
plasm collections, and/or where the creation of large biparental
mapping populations may be less feasible. In this approach, genetic
markers are screened across natural populations or a diverse collec-
tion of individuals in order to associate alleles with phenotypic traits
of interest [24–26]. Since allelic variation in these populations
depends on historical recombination and linkage disequilibrium,
association studies may produce very high map resolution in species
with low levels of LD [27]. LD-based association mapping has been
applied in many crop and forage species including maize, barley,
wheat, rice, sorghum, sugarcane, sugar beet, soybean, poplar, and
grape (reviewed in [19, 28, 29]). In some crop species, for example
Brassica napus, “diversity fixed foundation sets” have been created,
comprising a small number of homozygous lines thought to capture
a large proportion of the genetic diversity available for the species.
Single nucleotide polymorphisms (SNPs) are currently one of the
most popular markers for the fine mapping of heritable traits [30].
The availability of large-scale sequencing and SNP genotyping tech-
nologies will support genome-wide association studies in important
crop species by enabling screening of large sets of polymorphic
markers, even in complex polyploid species [31]. In maize for exam-
ple, the Illumina Goldengate SNP genotyping assay was used to
determine the extent of LD in a diverse global maize collection [32].
Similarly, in black poplar (Populus nigra) this same SNP genotyping
platform was used to analyze linkage disequilibrium between
SNP markers and determine their association to cellulose and lig-
nin biosynthesis properties [33]. This narrowed down candidate
genes associated with these traits with the aim of developing a
genomics-based breeding platform for bioethanol production.
3 Genomic-Based Breeding
3.1 Marker-Assisted Markers provide the potential to fine map important genetic loci
Selection: Single with high resolution through the use of mapping populations.
Marker–Trait Where these populations are phenotyped for traits of agronomic
Associations importance, such as disease resistance, the inheritance of particular
marker loci or haplotypes in the population can be linked to such
phenotypes. Genotyping of markers tightly linked to traits can
then rapidly predict the phenotypes of a large selection of segregat-
ing individuals at an early stage of development, often well before
phenotypic screening would be possible, and at reduced cost.
The application of single marker–trait associations to crop breeding
is known as marker-assisted selection (MAS). MAS enables efficient
selection of breeding lines for the introgression of desirable traits
into commercial crop accessions as well as the high-throughput
screening of the resulting progeny [34, 35].
An effective marker for MAS must generally be located within
1 cM of a desired trait and able to be genotyped at high through-
put and reproducibility [36]. Low polymorphism, poor genomic
distribution, and/or poor reproducibility of marker types includ-
ing RFLPs (Restriction Fragment Length Polymorphisms) and
RAPDs (Randomly Amplified Polymorphic DNA) limit their appli-
cation to MAS. Microsatellites (also known as Simple Sequence
Repeats, SSRs) are highly polymorphic, reproducible alternatives,
but are often poorly linked to genes [36, 37]. Nonetheless, SSRs
and RAPDs have been applied in B. napus (canola) MAS programs
for selection for major gene disease resistance [38], yellow seed
coat color [39], male fertility restorer lines [40], and improvement
of oil quality [41].
SNPs are currently the best markers for MAS due to their high
prevalence and polymorphism in the genome and their potential for
strong, or even perfect, linkage to traits of interest [5, 42–44].
Perfect linkage is possible where the polymorphism is directly
responsible for variation in the desired trait. The development of
high-throughput sequencing technologies in recent years has greatly
assisted association studies that utilize SNP markers [45, 46].
3.2 Genome-Wide The availability of phenotypic data along with genotypic data permits
Marker-Assisted the association of loci, or haplotypes, at a genome-wide scale, which
Selection may be used to mine an entire genome for genotype–phenotype
correlations.
When there are enough markers, spanning the entire genome
in a dense manner, it is expected that the gene, or genes, of interest
will be in linkage disequilibrium with at least one or some of the
markers, leading to marker-assisted selection on a genomic scale
[47]. Genome-wide marker-assisted selection studies will be an
important way of safeguarding global food supplies into the future.
One study performed by Morris and coworkers investigated
agroclimatic traits, such as drought tolerance, within sorghum lines.

The study identified ~265,000 SNPs in 971 worldwide accessions,
adapted to diverse agroclimatic conditions. Genome-wide associa-
tion studies (GWAS) based on the markers identified were then
carried out to identify novel loci underlying variations in agrocli-
matic traits [48]. Another study, utilizing restriction-site associated
DNA (RAD) sequencing identified 8,207 SNP markers across
the Lupin genome, which once filtered, led to the discovery of 38
molecular markers linked to the Lanr1 disease resistance gene.
Sequences involved in the analysis were derived from 20 informa-
tive plants resultant of a cross between a disease resistant and a
disease susceptible line [49].
Marker-assisted breeding programs implement the introgression
of genomic fragments to deliver a desired trait. In some instances,
gene pyramiding is utilized, by which one, two, three, or even more
genes carrying a particular trait (i.e., pathogen resistance) are intro-
gressed into a hybrid line. Analysis of the resulting degrees of resis-
tance to the pathogen can then be performed. Jiang et al. [50]
carried out marker-assisted gene pyramiding on rice cultivars to
introgress rice blast resistance genes. Results from this study indi-
cated that the greater the number of resistance genes contained in
the improved lines, the higher the resistance to the pathogen and a
subsequent growth benefit [50].
Rice, being the major source of caloric intake globally is critical
to worlds food supply. GWAS carried out on Oryza sativa aim to
improve the quality, safety, reliability, and sustainability of this
most important crop in a time of population growth, climate
change, and the identification of novel agricultural regions. Rice
varieties with high stress tolerance, resource-use efficient, and high
productivity will be required utilizing a genomics and plant breed-
ing approach. A study carried out by Zhao et al. [51] genotyped
44,100 SNP variants across 413 diverse varieties collected from 82
countries. For these varieties, 34 morphological, developmental,
and agronomic traits were systematically phenotyped over two
consecutive field seasons [51].
Tomato introgression lines (ILs) derived from the hybridiza-
tion of wild tomato (Solanum pennellii) and cultivated tomato
(Solanum lycopersicum) resulting in fertile offspring have been
extensively used in the identification of interspecific QTL. These
publicly available ILs have been comprehensively phenotyped for
hundreds of traits thereby allowing the identification of 2,795
QTL [52]. Further analysis of introgression fragments revealed
five genomic regions (BINs, 1C, 2B, 4I, 7H, and 11C) that share
colinearity, spanning 104 QTL associated with fruit carbon primary
metabolism [53, 54], fruit color [55], volatile content [56], and
yield traits linked to metabolite variations found in the fruits [57,
58]. Within these syntenic regions 38 distinct genes with conserva-
tion of genomic ordering, orientation, and gene structure (intron/
exon) between the two species were observed with variation in

intergenic regions disrupting the near perfect colinearity [59].
Sequencing, annotation, and characterization of the genes within
these syntenic regions, along with polymorphism and microsyn-
tenic analysis between the genes have unearthed the basis for
evolutionary change for the five regions [59], a resource for under-
standing the possible future value of these introgression fragments
and the role that they might play in increasing genetic diversity and
availability of desirable traits in crop species. In canola, MAS has
enabled selection of intervarietal substitution lines [60] and enrich-
ment of genomic introgression lines [61].
4 Genetic Diversity Analyses
The development and implementation of molecular marker technol-

ogy has paved the way for large-scale analyses of genetic diversity
in and between species. This is valuable for clarifying evolutionary
relationships and taxonomies as well as providing an understanding
of genome change rates within and between different species.
Importantly, the ability to assess genetic diversity in crops also has
implications for crop breeding and sustainability [62, 63].
SSRs and SNPs have been widely applied to crop genetic diver-
sity analyses [64–67]. SNPs, as the most common form of highly
heritable genetic variation across the genome, are superior indica-
tors of genetic diversity and phylogeny, particularly in crop species
with ancient genome duplications. Moreover, genomic SNPs are
most often free of selective pressures, allowing a more complete
estimate of diversity levels based on random genetic drift [68].
This makes them highly useful in identifying regions of LD and
then in tracking chromosome segments to identify recombination
events that break up such regions [42, 69]. In maize, an Illumina
SNP genotyping assay using over 1,000 SNPs was used to estimate
the genetic diversity, population structure, and familial relatedness
across a highly diverse global maize collection from temperate,
tropical, and subtropical public breeding programs [32]. A similar
study in cassava assessed the diversity of 53 varieties from the
Americas and Africa to reveal substructure based on geographical
origin [70]. In the genus Arabidopsis, a genotyping by sequencing
approach of 80 diverse accessions from different habitats through-
out Eurasia is being used to assess genetic variation contributing to
adaptation to diverse environments [71].
4.1 Crop Breeding For agriculturally important species, a high level of allelic diversity
provides an essential resource for mining beneficial trait variants
associated with this diversity. In the context of a changing climate,
a diverse germplasm set provides a valuable degree of genetic
plasticity and adaptive potential for breeding-based crop
improvements and future food security. Unfortunately, extensive

artificial selection and inbreeding has severely limited the genetic
diversity in many major crop species [63, 72–74]. Canola (B. napus),
for instance, is a recent allopolyploid that contains only a fraction
of the genetic diversity present in its progenitor species’ B. rapa
and B. oleracea [75]. Compounding this, inbreeding depression
and the associated large blocks of linkage disequilibrium in rape-
seed breeding populations have created linkage drag, whereby
desirable alleles are inextricably linked to undesirable alleles [25].
As such it has become a priority for many breeders to identify the
degree of genetic diversity in not only commercial germplasm,
but also wild relatives, of crop species through the use of molecular
markers.
Understanding genetic diversity creates great scope for crop
improvement and heterosis via wide hybridization and introgression
of genetic diversity [76]. Blackleg disease is a fungal disease that
devastates canola crops worldwide. Recently, Yu et al. [77] success-
fully introgressed two known blackleg resistance genes, LepR1 and
LepR2 from B. rapa subspecies sylvestris into the related allotetra-
ploid B. napus via interspecific hybridizations [77]. Furthermore,
diversity analyses enable best choice of lines within germplasm banks
for preservation of genetic diversity and breeding potential. In col-
lections of black mustard (B. nigra; [66]) and castor bean (Ricinis
communis; [63]) SSR and SNP markers were used, respectively,
to analyze the diversity within geographically distinct populations.
A similar study using SSR markers in feral and cultivated alfalfa
germplasm concluded that feral alfalfa populations may provide a
source of new germplasm for plant improvement [65].
4.2 Comparative The ability to compare genomic properties of various evolutionarily

Genomics related individuals can provide a wealth of information regarding
the mechanisms underlying genome evolution, hybridization, poly-
ploidization, and speciation [4, 5]. Molecular markers facilitate
rapid and high-throughput comparative genomics analyses and
enable analysis of presence/absence variation (PAV), copy number
variation (CNV) and, for physically mapped markers such as SNPs,
genomic rearrangements between individuals or species. In addi-
tion to elucidating the mechanisms and patterns of genome evolu-
tion, this information can then be linked to phenotype to better
understand the influence of various selective pressures on genome
stability and phenotype expression [4]. For example, many genomic
regions associated with disease resistance in plants are rapidly evolv-
ing due to constant selective pressure from rapidly evolving patho-
gens. This information can be highly useful for elucidating the
genetic basis for disease resistance and coevolution of the pathogen
and host plant [78]. Another example is comparing genomic
structural change and any associated effects on agricultural vigor in
hybrids expressing heterosis for any trait of interest.
4.3 Taxonomic Comparing the genetic similarity of related species is the most
Classification accurate method of resolving taxonomic classifications. Various
molecular marker methods provide a fast, high-throughput, and
effective means to determine evolutionary relationships at differing
resolution. Within the Brassicaceae family, phylogenies remain
somewhat confused as a result of recurrent hybridization and poly-
ploidization events [79]. SNPs for high-throughput evolutionary
analysis are being applied to resolve ancestral karyotypes in the
Brassicaceae and the origin and timing of whole genome duplica-
tion and hybridization events [80–82]. The ability to efficiently
classify large numbers of samples into species groups also has appli-
cations for germplasm banks by facilitating routine verification of
stored lines and control of potential contamination. In the study by
Pradhan and coworkers [66], SSR markers with known genomic
locations from each of the three Brassica “A,” “B,” and “C”
genomes were used to confirm species identity in a collection of
B. nigra accessions found to be contaminated with B. juncea and B.
rapa species. Thus, genetic markers with known genomic origin
can be valuable for species classification where identification based
solely on morphological characters is difficult [66].
5 Complications Arising from Polyploidy
Due to the majority of agriculturally important crop species con-

taining genomes with complex polyploidy, effective SNP discovery
can be hampered by possible misidentification of variation between
homoeologous (between genome) or paralogous (within genome)
loci as true SNPs. Polyploidization events have also resulted in
larger genome sizes, with organisms such as maize, barley, and
wheat having genome sizes comparable to, or much larger than,
humans [4, 83, 84].
During SNP prediction, calibration of the software parameters
is required to enable the best trade-off between detection of false
positives and the exclusion of some real polymorphism. Many
studies have addressed this issue by adjusting the stringency level
required for read depth allowing polymorphism detection. In Trick
et al. [82], a direct comparison of SNP detection rates at varying
stringency levels was presented, demonstrating a large degree of
difference. While it is possible to predict a large number of poly-
morphic sites, sequencing or read mapping errors can produce syn-
thetic polymorphism. Validation of a subset of the predicted SNPs
is required to estimate the true rate of variation. This has tradition-
ally been achieved using Sanger sequencing, however it is also pos-
sible to utilize higher throughput SNP assays, such as GoldenGate,
for this purpose [85, 86].
Despite these hurdles, SNP discovery has been performed in
the crop B. napus with a validation rate of 95 % and these have then
used to produce successful SNP assays (J. Batley, pers. comm.).

The approach used recently in the allogamous species Juglans regia
was to detect SNPs within one line and then use this SNP pool
to genotype populations generated by crossings to that line [87].
The issue of homologous genes within polyploid genomes interfer-
ing in SNP discovery has been managed in sugarcane by Bundock
et al. [88], directing the discovery effort toward intergenomic
SNPs. Two separate sugarcane lines, parents of a mapping popula-
tion, were sequenced within regions of interest. When used in
conjunction with the analysis of wild or progenitor species related
to the organism of interest, these can aid in the analysis of evolu-
tionary relationships and are able to provide information on both
the diploid and polyploid organisms. The five allele combinations
that arrive from either tetraploidy or the existence of two paralo-
gous loci can be accommodated in Illumina’s SNP assay software
GenomeStudio, however these two scenarios are indistinguish-
able without prior knowledge. Higher rates of ploidy produce
results that are unable to be discriminated into discrete allele
combinations.
Complexity reduction is a common strategy used to deal with the
issues deriving from complex polyploids where high coverage is
required. A reduction in the complexity of the template to be
sequenced can be achieved in a number of different ways, depending
on the desired approach. Limiting the sequence to expressed sequence
tags (ESTs) may produce an appropriate amount of sequence data
and can be a useful alternative in gene discovery and has also been
employed in crop species lacking a reference genome for SNP dis-
covery [82, 89]. Other complexity reduction methods are based
on enzymatic digestion or AFLP (Amplified Fragment Length
Polymorphism) amplification using the CRoPs system [90].
Although most SNP identification using next-generation
sequencing approaches can be utilized without prior knowledge of
the reference genome, the sequence capture approach enriches for
regions of interest based on predesigned probes, which can describe
one contiguous region or many small regions up to a total size
limit. A similar output can be generated using long amplicons,
sequenced in multiple individuals, such as for Eucalyptus [91] or
rice [92]. The best situation is having access to a reference genome
for the species of interest, however, a number of groups have devel-
oped methods to get around the lack of information. If the focus
is on the transcriptome, ESTs can be assembled into a rough draft,
using either newly generated data [80] or, if do not have the
resources to sequence, it is possible to use publicly available EST
data [82]. Broadening the focus to the whole genome, an “on the
fly” reference genome has been generated in the wheat progenitor
species Aegilops tauschii [93], performing a low level of coverage in
one individual using the longer reads generated on the Roche 454
platform. The individuals that were the focus of that study could
then be sequenced to a much greater depth on other platforms and

the results compared to the generated genome. This enabled the
successful prediction of nearly half a million SNPs. A similar
approach is to isolate individual chromosome arms and use these as
the template. This is a suitable method for size and complexity
reduction and has already been used to sequence a number of the
large chromosomes of T. aestivum [94–97].
6 Conclusion
Molecular markers offer abundant applications in plant molecular

genomics and breeding. Despite increasing accessibility and improve-
ment of genome sequencing technologies, molecular markers remain
essential components of all large-scale genomic analyses, not only
by facilitating genome assembly but via their demonstrated value in
high-throughput genotyping, comparative and evolutionary genom-
ics, trait mapping, and plant breeding. As such molecular markers
are likely to continue to be developed and successfully applied
toward advancing plant genomics for many years to come.
References
1. Baird NA, Etter PD, Atwood TS, Currey MC, Marcroft S, Kearney G, Smith KF, Forster JW,
Shiver AL, Lewis ZA, Selker EU, Cresko WA, Spangenberg GC (2009) Genetic map con-
Johnson EA (2008) Rapid SNP discovery and struction and QTL mapping of resistance to
genetic mapping using sequenced RAD mark- blackleg (Leptosphaeria maculans) disease in
ers. PLoS One 3:e3376 Australian canola (Brassica napus L.) cultivars.
2. Duran C, Edwards D, Batley J (2009) Genetic Theor Appl Genet 120:71–83
maps and the use of synteny. In: Somers DJ, 7. Pilet ML, Delourme R, Foisset N, Renard M
Langridge P, Gustafson JP (eds) Plant genom- (1998) Identification of loci contributing to
ics: methods and protocols. Humana Press, quantitative field resistance to blackleg disease,
New York, NY, pp 41–56 causal agent Leptosphaeria maculans (Desm.)
3. Duran C, Edwards D, Batley J (2009) Ces. et de Not., in Winter rapeseed (Brassica
Molecular marker discovery and genetic map napus L.). Theor Appl Genet 96:23–30
visualisation. In: Edwards D (ed) Applied bio- 8. Qiu D, Morgan C, Shi J, Long Y, Liu J, Li R,
informatics. Springer, New York, pp 165–189 Zhuang X, Wang Y, Tan X, Dietrich E,
4. Edwards D, Wilcox S, Barrero RA, Fleury D, Weihmann T, Everett C, Vanstraelen S, Beckett
Cavanagh CR, Forrest KL, Hayden MJ, P, Fraser F, Trick M, Barnes S, Wilmer J,
Moolhuijzen P, Keeble-Gagnere G, Bellgard MI, Schmidt R, Li J, Li D, Meng J, Bancroft I
Lorenc MT, Shang CA, Baumann U, Taylor JM, (2006) A comparative linkage map of oilseed
Morell MK, Langridge P, Appels R, Fitzgerald A rape and its use for QTL analysis of seed oil and
(2012) Bread matters: a national initiative to pro- erucic acid content. Theor Appl Genet 114:
file the genetic diversity of Australian wheat. 67–80
Plant Biotechnol J 10:703–708 9. Smooker AM, Wells R, Morgan C, Beaudoin F,
5. Hayward A, Dalton-Morgan J, Mason A, Cho K, Fraser F, Bancroft I (2011) The identi-
Zander M, Edwards D, Batley J (2012) SNP fication and mapping of candidate genes and
discovery and applications in Brassica napus. QTL involved in the fatty acid desaturation
J Plant Biotechnol 39:49–61 pathway in Brassica napus. Theor Appl Genet
6. Kaur S, Cogan NO, Ye G, Baillie RC, Hand 122:1075–1090
ML, Ling AE, McGearey AK, Kaur J, Hopkins 10. Tollenaere R, Hayward A, Dalton-Morgan J,
CJ, Todorovic M, Mountford H, Edwards D, Campbell E, Lee JRM, Lorenc MT, Manoli S,
Batley J, Burton W, Salisbury P, Gororo N, Stiller J, Raman R, Raman H, Edwards D,
Batley J (2012) Identification and characteriza- 21. Rafalski JA (2010) Association genetics in
tion of candidate Rlm4 blackleg resistance crop improvement. Curr Opin Plant Biol 13:
genes in Brassica napus using next-generation 174–180
sequencing. Plant Biotechnol J 10:709–715 22. Cowling WA, Balázs E (2010) Prospects and
11. Choi SR, Teakle GR, Plaha P, Kim JH, Allender challenges for genome-wide association and
CJ, Beynon E, Piao ZY, Soengas P, Han TH, genomic selection in oilseed Brassica species.
King GJ, Barker GC, Hand P, Lydiate DJ, Genome 53:1024–1028
Batley J, Edwards D, Koo DH, Bang JW, Park 23. Atwell S, Huang YS, Vilhjalmsso BJ et al
BS, Lim YP (2007) The reference genetic link- (2010) Genome-wide association study of 107
age map for the multinational Brassica rapa phenotypes in Arabidopsis thaliana inbred
genome sequencing project. Theor Appl Genet lines. Nature 465:627–631
115:777–792 24. Cardon LR, Bell JI (2001) Association study
12. Edwards D, Batley J, Cogan NOI, Forster JW, designs for complex diseases. Nat Rev Genet
Chagné D (2007) Single nucleotide polymor- 2:91–99
phism discovery. In: Oraguzie N, Rikkerink E, 25. Flint-Garcia SA, Thornsberry JM, Buckler ES
Gardiner S, Silva H (eds) Association mapping (2003) Structure of linkage disequilibrium in
in plants. Springer, New York, pp 53–76 plants. Annu Rev Plant Physiol Plant Mol Biol
13. Love C, Logan E, Erwin T, Kaur J, Lim GAC, 54:357–374
Hopkins C, Batley J, James N, May S, 26. Oraguzie N (2007) An overview of association
Spangenberg G, Edwards D (2006) Integrating mapping. In: Oraguzie N, Rikkerink E,
and interrogating diverse Brassica data within Gardiner S, Silva H (eds) Association mapping
an EnsEMBL structured database. Proceedings in plants. Springer, New York, pp 1–9
of the joint meeting of the fourteenth crucifer
genetics workshop and fourth ishs symposium 27. Neale DB, Savolainen O (2004) Association
on Brassicas. Acta Hort 706:77–82 genetics of complex traits in conifers. Trends
Plant Sci 9:325–330
14. Bevan M, Murphy G (1999) The small, the
large and the wild: the value of comparison 28. Waugh R, Jannink JL, Muehlbauer GJ, Ramsay
in plant genomics. Trends Genet 15: L (2009) The emergence of whole genome
211–214 association scans in barley. Curr Opin Plant
Biol 12:218–222
15. Feuillet C, Keller B (2002) Comparative
genomics in the grass family: molecular charac- 29. Yu JM, Buckler ES (2006) Genetic association
terization of grass genome structure and evolu- mapping and genome organization of maize.
tion. Ann Bot 89:3–10 Curr Opin Biotechnol 17:155–160
16. Galvão VC, Nordstrom KJV, Lanz C, Sulz P, 30. Chagné D, Batley J, Edwards D, Forster JW
Mathieu J, Pose D, Schmid M, Weigel D, (2007) Single nucleotide polymorphisms
Schneeberger K (2012) Synteny-based genotyping in plants. In: Oraguzie N, Rikkerink
mapping-by-sequencing enabled by targeted E, Gardiner S, Silva H (eds) Association map-
enrichment. Plant J 71:517–526 ping in plants. Springer, New York, pp 77–94
17. McClean PE, Mamidi S, McConnell M, 31. Duran C, Eales D, Marshall D, Imelfort M,
Chikara S, Lee R (2010) Synteny mapping Stiller J, Berkman PJ, Clark T, McKenzie M,
between common bean and soybean reveals Appleby N, Batley J, Basford K, Edwards D
extensive blocks of shared loci. BMC Genomics (2010) Future tools for association mapping in
11:184 crop plants. Genome 53:1017–1023
18. Zhu HY, Kim DJ, Baek JM, Choi HK, Ellis 32. Yan JB, Shah T, Warburton ML, Buckler ES,
LC, Kuester H, McCombie WR, Peng HM, McMullen MD, Crouch J (2009) Genetic
Cook DR (2003) Syntenic relationships characterization and linkage disequilibrium
between Medicago truncatula and Arabidopsis estimation of a global maize collection using
reveal extensive divergence of genome organi- SNP markers. PLoS One 4:e8451
zation. Plant Physiol 131:1018–1026 33. Guerra FP, Wegrzyn JL, Sykes R, Davis MF,
19. Abdurakhmonov IY, Abdukarimov A (2008) Stanton BJ, Neale DB (2013) Association genet-
Application of association mapping to under- ics of chemical wood properties in black poplar
standing the genetic diversity of plant germ- (Populus nigra). New Phytol 197:162–176
plasm resources. Int J Plant Genomics 2008: 34. Appleby N, Edwards D, Batley J (2009) New
574927 technologies for ultra-high throughput geno-
20. Gupta PK, Rustgi S, Kulwal PL (2005) Linkage typing in plants. In: Somers DJ, Langridge P,
disequilibrium and association studies in higher Gustafson JP (eds) Plant genomics: methods
plants: present status and future prospects. and protocols. Humana Press, New York, NY,
Plant Mol Biol 57:461–485 pp 19–39
35. Semagn K, Bjornstad A, Ndjiondjop MN Brown PJ, Acharya CB, Mitchell SE, Harriman
(2006) An overview of molecular marker meth- J, Glaubitz JC, Buckler ES, Kresovich S (2013)
ods for plants. Afr J Biotechnol 5:2540–2568 Population genomic and genome-wide associa-
36. Mohan M, Nair S, Bhagwat A, Krishna TG, tion studies of agroclimatic traits in sorghum.
Yano M, Bhatia CR, Sasaki T (1997) Genome Proc Natl Acad Sci U S A 110:453–458
mapping, molecular markers and marker- 49. Yang HA, Tao Y, Zheng ZQ, Li CD,
assisted selection in crop plants. Mol Breed Sweetingham MW, Howieson JG (2012)
3:87–103 Application of next-generation sequencing for
37. Hong CP, Piao ZY, Kang TW, Batley J, Yang rapid marker development in molecular plant
TJ, Hur YK, Bhak J, Park BS, Edwards D, Lim breeding: a case study on anthracnose disease
YP (2007) Genomic distribution of simple resistance in Lupinus angustifolius L. BMC
sequence repeats in Brassica rapa. Mol Cells Genomics 13:318
23:349–356 50. Jiang HC, Feng YT, Bao L, Li X, Gao GJ,
38. Chèvre AM, Barret P, Eber F, Dupuy P, Brun Zhang QL, Xiao JH, Xu CG, He YQ (2012)
H, Tanguy X, Renard M (1997) Selection of Improving blast resistance of Jin 23B and its
stable Brassica napus-B.juncea recombinant hybrid rice by marker-assisted gene pyramiding.
lines resistant to blackleg (Leptosphaeria macu- Mol Breed 30:1679–1688
lans): identification of molecular markers, 51. Zhao K, Tung CW, Eizenga GC, Wright MH,
chromosomal and genomic origin of the intro- Ali ML, Price AH, Norton GJ, Islam MR,
gression. Theor Appl Genet 95:1104–1111 Reynolds A, Mezey J, McClung AM,
39. Somers DJ, Rakow G, Prabhu VK, Friesen Bustamante CD, McCouch SR (2011)
KRD (2001) Identification of a major gene and Genome-wide association mapping reveals a
RAPD markers for yellow seed coat colour in rich genetic architecture of complex traits in
Brassica napus. Genome 1077–1082 Oryza sativa. Nat Commun 2:467
40. Hansen M, Hallden C, Nilsson NO, Sall T 52. Lippman ZB, Semel Y, Zamir D (2007) An
(1997) Marker-assisted selection of restored integrated view of quantitative trait variation
male-fertile Brassica napus plants using a set of using tomato interspecific introgression lines.
dominant RAPD markers. Mol Breed 3: Curr Opin Genet Dev 17:545–552
449–456 53. Schauer N, Semel Y, Balbo I, Steinfath M,
41. Tanhuanpää PK, Vilkki JP, Vilkki HJ (1995) Repsilber D, Selbig J, Pleban T, Zamir D,
Association of a RAPD marker with linolenic Fernie AR (2008) Mode of inheritance of pri-
acid concentration in the seed oil of rapeseed mary metabolic traits in tomato. Plant Cell
(Brassica napus L). Genome 38:414–416 20:509–523
42. Barker GLA, Edwards KJ (2009) A genome- 54. Schauer N, Semel Y, Roessner U, Gur A,
wide analysis of single nucleotide polymor- Balbo I, Carrari F, Pleban T, Perez-Melis A,
phism diversity in the world's major cereal Bruedigam C, Kopka J, Willmitzer L, Zamir D,
crops. Plant Biotechnol J 7:318–325 Fernie AR (2006) Comprehensive metabolic
43. Ching A, Caldwell KS, Jung M, Dolan M, profiling and phenotyping of interspecific
Smith OS, Tingey S, Morgante M, Rafalski AJ introgression lines for tomato improvement.
(2002) SNP frequency, haplotype structure Nat Biotechnol 24:447–454
and linkage disequilibrium in elite maize inbred 55. Liu YS, Gur A, Ronen G, Causse M, Damidaux
lines. BMC Genet 3:19 R, Buret M, Hirschberg J, Zamir D (2003)
44. Snowdon RJ, Friedt W (2004) Molecular There is more to tomato fruit colour than can-
markers in Brassica oilseed breeding: current didate carotenoid genes. Plant Biotechnol J
status and future possibilities. Plant Breed 123: 1:195–207
1–8 56. Tieman DM, Zeigler M, Schmelz EA, Taylor
45. Syvänen AC (2005) Toward genome-wide MG, Bliss P, Kirst M, Klee HJ (2006)
SNP genotyping. Nat Genet 37:S5–S10 Identification of loci affecting flavour volatile
46. Varshney RK, Nayak SN, May GD, Jackson SA emissions in tomato fruits. J Exp Bot 57:
(2009) Next-generation sequencing technolo- 887–896
gies and their implications for crop genetics 57. Eshed Y, Zamir D (1995) An introgression line
and breeding. Trends Biotechnol 27:522–530 population of Lycopersicon pennellii in the cul-
47. Meuwissen T (2007) Genomic selection: tivated tomato enables the identification and
marker assisted selection on a genome wide fine mapping of yield-associated QTL. Genetics
scale. J Anim Breed Genet 124:321–322 141:1147–1162
48. Morris GP, Ramu P, Deshpande SP, Hash CT, 58. Semel Y, Nissenbaum J, Menda N, Zinder M,
Shah T, Upadhyaya HD, Riera-Lizarazu O, Krieger U, Issman N, Pleban T, Lippman Z,
Gur A, Zamir D (2006) Overdominant quanti- 69. Fourmann M, Barret P, Froger N, Baron C,
tative trait loci for yield and fitness in tomato. Charlot F, Delourme R, Brunel D (2002)
Proc Natl Acad Sci U S A 103:12981–12986 From Arabidopsis thaliana to Brassica napus:
59. Kamenetzky L, Asis R, Bassi S, de Godoy F, development of amplified consensus genetic
Bermudez L, Fernie AR, Van Sluys MA, markers (ACGM) for construction of a gene
Vrebalov J, Giovannoni JJ, Rossi M, Carrari F map. Theor Appl Genet 105:1196–1206
(2010) Genomic analysis of wild tomato intro- 70. Ferguson ME, Hearne SJ, Close TJ, Wanamaker
gressions determining metabolism- and yield- S, Moskal WA, Town CD, de Young J, Marri
associated traits. Plant Physiol 152:1772–1786 PR, Rabbi IY, de Villiers EP (2012)
60. Howell PM, Marshall DF, Lydiate DJ (1996) Identification, validation and high-throughput
Towards developing intervarietal substitution genotyping of transcribed gene SNPs in cassava.
lines in Brassica napus using marker-assisted Theor Appl Genet 124:685–695
selection. Genome 39:348–358 71. Cao J, Schneeberger K, Ossowski S, Gunther
61. Zou J, Zhu JL, Huang SM, Tian ET, Xiao Y, T, Bender S, Fitz J, Koenig D, Lanz C, Stegle
Fu DH, Tu JX, Fu TD, Meng JL (2010) O, Lippert C, Wang X, Ott F, Muller J, Alonso-
Broadening the avenue of intersubgenomic Blanco C, Borgwardt K, Schmid KJ, Weigel D
heterosis in oilseed Brassica. Theor Appl Genet (2011) Whole-genome sequencing of multiple
120:283–290 Arabidopsis thaliana populations. Nat Genet
62. Cowling WA (2007) Genetic diversity in 43:956–U960
Australian canola and implications for crop 72. He GH, Prakash C (2001) Evaluation of
breeding for changing future environments. genetic relationships among botanical varieties
Field Crop Res 104:103–111 of cultivated peanut (Arachis hypogaea L.)
63. Foster JT, Allan GJ, Chan AP, Rabinowicz PD, using AFLP markers. Genet Resour Crop Evol
Ravel J, Jackson PJ, Keim P (2010) Single 48:347–352
nucleotide polymorphisms for assessing genetic 73. Hyten DL, Song QJ, Zhu YL, Choi IY, Nelson
diversity in castor bean (Ricinus communis). RL, Costa JM, Specht JE, Shoemaker RC,
BMC Plant Biol 10:13 Cregan PB (2006) Impacts of genetic bottle-
64. Allan G, Williams A, Rabinowicz PD, Chan AP, necks on soybean genome diversity. Proc Natl
Ravel J, Keim P (2008) Worldwide genotyping Acad Sci U S A 103:16666–16671
of castor bean germplasm (Ricinus communis L.) 74. Levi A, Thomas CE, Keinath AP, Wehner TC
using AFLPs and SSRs. Genet Resour Crop (2001) Genetic diversity among watermelon
Evol 55:365–378 (Citrullus lanatus and Citrullus colocynthis)
65. Bagavathiannan MV, Julier B, Barre P, Gulden accessions. Genet Resour Crop Evo 48:
RH, Van Acker RC (2010) Genetic diversity of 559–566
feral alfalfa (Medicago sativa L.) populations 75. Song K, Osborn TC (1992) Polyphyletic ori-
occurring in Manitoba, Canada and compari- gins of Brassica napus – new evidence based on
son with alfalfa cultivars: an analysis using SSR organelle and nuclear RFLP analyses. Genome
markers and phenotypic traits. Euphytica 35:992–1001
173:419–432 76. Chen S, Nelson MN, Chevre AM, Jenczewski
66. Pradhan A, Nelson MN, Plummer JA, Cowling E, Li ZY, Mason AS, Meng JL, Plummer JA,
WA, Yan GJ (2011) Characterization of Pradhan A, Siddique KHM, Snowdon RJ, Yan
Brassica nigra collections using simple GJ, Zhou WJ, Cowling WA (2011) Trigenomic
sequence repeat markers reveals distinct groups bridges for Brassica improvement. Crit Rev
associated with geographical location, and fre- Plant Sci 30:524–547
quent mislabelling of species identity. Genome 77. Yu FQ, Gugel RK, Kutcher HR, Peng G,
54:50–63 Rimmer SR (2013) Identification and mapping
67. Wang J, Kaur S, Cogan NOI, Dobrowolski of a novel blackleg resistance locus LepR4 in
MP, Salisbury PA, Burton WA, Baillie R, Hand the progenies from Brassica napus x B. rapa
M, Hopkins C, Forster JW, Smith KF, subsp. sylvestris. Theor Appl Genet 126:
Spangenberg G (2009) Assessment of genetic 307–315
diversity in Australian canola (Brassica napus 78. Hayward A, McLanders J, Campbell E,
L.) cultivars using SSR markers. Crop Pasture Edwards D, Batley J (2012) Genomic advances
Sci 60:1193–1201 will herald new insights into the Brassica:
68. Edwards D, Forster J, Chagné D, Batley J Leptosphaeria maculans pathosystem. Plant
(2007) What are SNPs? In: Oraguzie N, Biol 14:1–10
Rikkerink E, Gardiner S, Silva H (eds) Association 79. Lysak MA, Koch MA (2011) Phylogeny,
mapping in plants. Springer, New York, genome, and karyotype evolution of crucifers
pp 41–52 (Brassicaceae). In: Schmidt R, Bancroft I (eds)
Genetics and genomics of the Brassicaceae. 89. Iorizzo M, Senalik DA, Grzebelus D, Bowman
Springer, New York, pp 1–31 M, Cavagnaro PF, Matvienko M, Ashrafi H,
80. Hu Z, Huang S, Sun M, Wang H, Hua W Van Deynze A, Simon PW (2011) De novo
(2012) Development and application of single assembly and characterization of the carrot
nucleotide polymorphism markers in the poly- transcriptome reveals novel genes, new markers,
ploid Brassica napus by 454 sequencing of and genetic diversity. BMC Genomics 12:389
expressed sequence tags. Plant Breed 131: 90. van Orsouw NJ, Hogers RCJ, Janssen A, Yalcin
293–299 F, Snoeijers S, Verstege E, Schneiders H, van der
81. Schranz ME, Song BH, Windsor AJ, Mitchell- Poel H, van Oeveren J, Verstegen H, van Eijk
Olds T (2007) Comparative genomics in the MJT (2007) Complexity reduction of polymor-
Brassicaceae: a family-wide perspective. Curr phic sequences (CRoPS (TM)): a novel approach
Opin Plant Biol 10:168–175 for large-scale polymorphism discovery in com-
82. Trick M, Long Y, Meng JL, Bancroft I (2009) plex genomes. PLoS One 2:e1172
Single nucleotide polymorphism (SNP) discov- 91. Hendre PS, Kamalakannan R, Varghese M
ery in the polyploid Brassica napus using Solexa (2012) High-throughput and parallel SNP dis-
transcriptome sequencing. Plant Biotechnol J covery in selected candidate genes in Eucalyptus
7:334–346 camaldulensis using Illumina NGS platform.
83. Mayer KFX, Waugh R, Langridge P et al Plant Biotechnol J 10:646–656
(2012) A physical, genetic and functional 92. Kharabian-Masouleh A, Waters DL, Reinke
sequence assembly of the barley genome. RF, Henry RJ (2011) Discovery of polymor-
Nature 491:711–716 phisms in starch-related genes in rice germ-
84. Schnable PS, Ware D, Fulton RS et al (2009) plasm by amplification of pooled DNA and
The B73 maize genome: complexity, diversity, deeply parallel sequencing. Plant Biotechnol J
and dynamics. Science 326:1112–1115 9:1074–1085
85. Chagné D, Crowhurst RN, Troggio M, 93. You FM, Huo N, Deal KR, Gu YQ, Luo MC,
Davey MW, Gilmore B, Lawley C, McGuire PE, Dvorak J, Anderson OD (2011)
Vanderzande S, Hellens RP, Kumar S, Cestaro Annotation-based genome-wide SNP discov-
A, Velasco R, Main D, Rees JD, Iezzoni A, ery in the large and complex Aegilops tauschii
Mockler T, Wilhelm L, Van de Weg E, genome using next-generation sequencing
Gardiner SE, Bassil N, Peace C (2012) without a reference genome sequence. BMC
Genome-wide SNP detection, validation, and Genomics 12:59
development of an 8K array for apple. PLoS 94. Berkman PJ, Lai KT, Lorenc MT, Edwards D
One 7:e31745 (2012) Next-generation sequencing applica-
86. Verde I, Bassil N, Scalabrin S, Gilmore B, tions for wheat crop improvement. Am J Bot
Lawley CT, Gasic K, Micheletti D, Rosyara 99:365–371
UR, Cattonaro F, Vendramin E, Main D, 95. Berkman PJ, Skarshewski A, Manoli S, Lorenc
Aramini V, Blas AL, Mockler TC, Bryant DW, MT, Stiller J, Smits L, Lai KT, Campbell E,
Wilhelm L, Troggio M, Sosinski B, Aranzana Kubalakova M, Simkova H, Batley J, Dolezel J,
MJ, Arus P, Iezzoni A, Morgante M, Peace C Hernandez P, Edwards D (2012) Sequencing
(2012) Development and evaluation of a 9K wheat chromosome arm 7BS delimits the
SNP array for peach by internationally coordi- 7BS/4AL translocation and reveals homoeolo-
nated SNP detection and validation in breed- gous gene conservation. Theor Appl Genet
ing germplasm. PLoS One 7:e35668 124:423–432
87. You FM, Deal KR, Wang J, Britton MT, Fass 96. Hernandez P, Martis M, Dorado G, Pfeifer M,
JN, Lin D, Dandekar A, Leslie CA, Aradhya M, Galvez S, Schaaf S, Jouve N, Simkova H,
Luo MC, Dvorak J (2012) Genome-wide SNP Valarik M, Dolezel J, Mayer KFX (2012) Next-
discovery in walnut with an AGSNP pipeline generation sequencing and syntenic integra-
updated for SNP discovery in allogamous tion of flow-sorted arms of wheat chromosome
organisms. BMC Genomics 13:354 4A exposes the chromosome structure and
88. Bundock PC, Eliott FG, Ablett G, Benson AD, gene content. Plant J 69:377–386
Casu RE, Aitken KS, Henry RJ (2009) 97. Lai K, Berkman PJ, Lorenc MT, Duran C,
Targeted single nucleotide polymorphism Smits L, Manoli S, Stiller J, Edwards D (2012)
(SNP) discovery in a highly polyploid plant WheatGenome.info: an integrated database
species using 454 sequencing. Plant Biotechnol J and portal for wheat genome information.
7:347–354 Plant Cell Physiol 53:e2
Chapter 3
Bioinformatics: Identification of Markers

from Next-Generation Sequence Data
Pradeep Ruperao and David Edwards
Abstract
With the advent of sequencing technology, next-generation sequencing (NGS) technology has dramatically
revolutionized plant genomics. NGS technology combined with new software tools enables the discovery,
validation, and assessment of genetic markers on a large scale. Among different markers systems, simple
sequence repeats (SSRs) and Single nucleotide polymorphisms (SNPs) are the markers of choice for genetics
and plant breeding. SSR markers have been a choice for large-scale characterization of germplasm collec-
tions, construction of genetic maps, and QTL identification. Similarly, SNPs are the most abundant genetic
variations with higher frequencies throughout the genome of plant species. This chapter discusses various
tools available for genome assembly and widely focuses on SSR and SNP marker discovery.
Key words Next-generation sequencing (NGS), Genetic markers, SSRs, Microsatellites, SNPs,
Mapping tools, Assembly tools, SSRPrimerII, SGSautoSNP
1 Introduction
The advent of next-generation sequencing (NGS) has revolutionized

genomic and transcriptomic approaches to biology [1–4]. New
sequencing tools are also valuable for the discovery, validation,
and assessment of genetic markers in populations [5–8]. Molecular
marker technology has developed rapidly over the last decade and
two forms of sequence-based marker, simple sequence repeats
(SSRs) and single nucleotide polymorphism (SNPs), now predom-
inate applications in modern genetic analysis, linking phenotype
with the underlying genotype [9–12]. NGS has led to the produc-
tion of large volumes of data that can be used for genome sequenc-
ing and the mining of SSRs and SNPs [13–18]. These markers may
then be applied for diversity analysis, genetic trait mapping, asso-
ciation studies, and marker-assisted selection [19]. The ability to
mine this data for molecular marker discovery is dependent on the
development of advanced bioinformatics tools and databases
[20–23]. This chapter discusses the application of several tools for
29
30 Pradeep Ruperao and David Edwards
genetic marker discovery from NGS data. Several NGS technologies

are available and each can be applied for the discovery of markers
across almost any genome of interest.
NGS technology has enabled the discovery and genotyping of
markers at a very high density for comprehensive genome-wide
association studies. Many biological questions can now be
answered with high accuracy, for example, mapping recombina-
tion breakpoints for trait association and characterizing genomic
differences between populations, as well as the implementation of
genomic selection for crop improvement. Here, this chapter aims
to provide examples of current SNP and SSR marker discovery
from NGS data.
1.1 What Are SSRs? SSRs, also known as microsatellites, are repeating DNA sequences
of 1–6 nucleotides that occur ubiquitously in all prokaryotic and
eukaryotic genomes. The number of repeat units may be variable
among individual genotypes, making SSRs useful for genetic analy-
sis. The variability of alleles at a locus makes SSRs markers more
informative per locus than SNPs [24].
The main limitation in the development of SSR markers has
been the discovery of sequences containing SSR repeats to allow
primer design for polymerase chain reaction (PCR) amplification
and genotyping. SSRs in the coding regions of genes may modify
gene function. Because most such modifications are likely to be
detrimental, the number of SSRs and polymorphisms within cod-
ing regions is expected to be lower than in noncoding sequences.
Hence genomic noncoding regions are the preferred source of
sequence for SSR mining. The isolation of SSRs has traditionally
been a labor intensive and economically costly process, yielding
relatively small number of markers. The process involved the con-
struction of genomic libraries enriched for targeted SSR motifs and
the isolation and sequencing of clones containing the SSR [25].
Additionally, primers from a single SSR locus should amplify only
the target locus and the SSR should show clear polymorphism.
Computational approaches overcome many of the limitations of
SSR discovery, and with the rapid expansion of NGS, there is an
increasing abundance of DNA sequence data suitable for SSR
discovery.
1.2 What Are SNPs? Single nucleotide polymorphisms, frequently called SNPs, are
the most common type of genetic variation among species [26].
A SNP is a single base change in a DNA sequence that can be clas-
sified as one of two types. Transitions are purine–purine (A⇔G) or
pyrimidine–pyrimidine (C⇔T) changes, while transversions are
purine–pyrimidine or pyrimidine–purine changes (A⇔C, A⇔T,
G⇔C, G⇔T). The development of high-throughput methods for
the discovery and genotyping of SNPs has led to a revolution in
their use as molecular markers [4, 27–29]. In principle, at each
Bioinformatics: Identification of Markers from Next-Generation Sequence Data 31
position in a sequence, any of the four possible nucleotide bases

can be present; however, SNPs are usually biallelic. SNPs have a
low mutation rate and are abundant in populations. Due to their
resolution, they are often considered as the ultimate genetic marker
[26]. Interestingly, SNPs are commonly found associated with SSR
sequences [30].
2 New Marker Discovery Technology
Among the different NGS technologies available (Table 1), 454

and Illumina systems are commonly used for SSR discovery.
The 454 pyrosequencing method uses a fragmented nucleic acid
template ligated with adaptor sequences at each end. These adapters
are used as priming sites for emulsion PCR and pyrosequencing.
Illumina technology uses bridge PCR to amplify fragmented DNA
followed by sequencing by synthesis using fluorescently labeled
nucleotides with reversible terminators. Life technologies recently
introduced the ion torrent sequencing system which is a scalable,
semiconductor technique using an integrated circuit to perform
nonoptical sequencing [31].
From recent studies, shotgun sequencing of a genome or tran-
scriptome by NGS is the easiest way to discover SNP or SSR loci.
However, the source of sequence for SNP or SSR identification
depends on the researcher’s interest and project goals. Sequence
assembly is often the first step in NGS-based marker discovery, to
generate longer DNA sequences or contigs. The choice of assem-
bly depends on several factors, including the type of data and the
availability of bioinformatics resources. The longer 454 sequence
reads may be used for SSR discovery without assembly, though
assembled sequences are usually longer, assisting PCR primer
design and enabling the identification of candidate polymorphic
SSRs [32–34]. A special consideration should be given during NGS
data assembly to the choice of assembly software. Some of the fre-
quently used software packages used for de novo DNA sequence
Table 1
Sequencing technologies
Features 454 Illumina Ion torrent

Sequence chemistry Pyrosequencing Synthesis
Semiconductor
Amplification approach EmPCR BridgePCR EmPCR
Paired-end support No Yes No
Read-length (bp) 350–1,000 100–250 ~200
Table 2
Assembly software
Name Technology Website

GsAssembler Sanger, 454 http://www.horticulture.wisc.edu/node/361
CLC Genomics Sanger, 454, Illumina, http://www.clcbio.com/index.php?id=575
Workbench Ion torrent
Velvet Sanger, 454, Illumina http://www.ebi.ac.uk/~zerbino/velvet/
SeqMan Ngen Sanger, 454, Illumina, http://www.dnastar.com/t-products-seqman-ngen.aspx
Ion torrent
AbySS Illumina http://www.bcgsc.ca/platform/bioinfo/software/abyss
Euler Sanger, 454, Illumina http://www2.nbcr.net/wordpress2/eular/
SOAPdenovo Illumina http://soap.genomics.org.cn/soapdenovo.html
SaSSY Illumina https://github.com/minillinim/SaSSY
MIRA Sanger, 454, Illumina, http://sourceforge.net/apps/mediawiki/mira-assembler/
Ion Torrent
NextGENe 454, Illumina, Ion http://softgenetics.com/NextGENe.html
Torrent
Newbler Sanger, 454 http://454.com/products/analysis-software/
TMAP Ion Torrent http://ioncommunity.lifetechnologies.com/
Geneious Sanger, 454, Illumina, http://www.geneious.com/
Ion Torrent
assembly are listed in Table 2. Each approach has its own merits.
For example, gsAssembler is specifically designed for 454 data with
the possibility of including of Sanger or other FASTA format
sequence data. Geneious (from Biomatters Ltd.) [35, 36], CLC
Genomics Workbench, and SeqManNGen (DNASTAR) are com-
mercially available software packages to analyze Sanger, 454,
Illumina, and other NGS datasets. Newbler is a de novo sequence
assembler developed for use with 454 sequencing data. Velvet is a
de Bruijn graph-based assembler for de novo assembly of short
reads [37]. While it is fairly simple to set up and run these software
packages, significant bioinformatics and genomics knowledge are
often required to obtain optimal results.
2.1 SSR Discovery With the revolution in sequencing technology, it is now feasible to
screen entire genomes for the presence of SSRs using bioinformat-
ics tools. The search parameters used for SSR detection also impact
SSR discovery, and several computational tools such as SSRPrimer
[38, 39] also design PCR primers flanking the SSR sequences,
and it is now possible to computationally predict polymorphic
SSRs [40].
Table 3
SSR tools
Name References
STRING—Java search for tandem [50]
repeats in genomes
SSRPrimerII http://www.appliedbioinformatics.com.au/projects/ssrPrimer
MicroSAtellite (MISA) http://pgrc.ipk-gatersleben.de/misa/
Sputnik http://espressosoftware.com/sputnik/index.html
BuildSSR [102]
SSR Identification Tool (SSRIT) [103]
Tandem Repeat Finder (TRF) [46]
Tandem Repeat Occurrence Locator [56]
(TROLL)
Mreps [42]
SSRSEARCH ftp://ftp.gramene.org/pub/gramene/software/scripts/ssr.pl
Msatfinder http://www.genomics.ceh.ac.uk/msatfinder/
RepeatMasker http://www.mendeley.com/research/repeatmasker-open30/
Imperfect Microsatellite Extractor [52]
(IMEx)
Spectral repeat finder (SRF) [104]
CENSOR http://www.girinst.org/censor/
There is substantial variation in the algorithms used for SSR

discovery. Some of the tools available for SSRs identification are
listed in Table 3. They include: the perl script MicroSAtellite [41];
mreps [42], a program capable of also finding imperfect repeats;
the windows-based SSR locator [43]; and the web tools such as
WebSat [44] and Msatfinder 2.0. One of the most commonly used
SSR search algorithms, Sputnik, has the useful feature of allowing
the user to specify the percentage mismatch allowed in the SSR
discovery. In programs such as Adplot [45] and tandem repeats
finder (TRF) [46], k-tuple match detection is used in combination
with wraparound dynamic programming. Tandem Repeats Analysis
Program (TRAP) [47] classifies, quantifies, and selects candidate
microsatellite markers from the output of TRF. ATRhunter [48] is
similar in function to TRF, but additionally, it uses a heuristic
approach for the detection of approximate tandem repeats. In Mreps
[42], all perfect repeats are found as seeds to find imperfect repeats.
Dynamic programming with compression algorithms are used for
the identification of approximate tandem repeats in a mining tool
called Search for Tandem Approximate Repeats (STAR). Similarly,

a dynamic programming was effectively used in Advanced Content
Matching Engine for Sequences (ACMES) [49] for the identification
of repetitive sequences from large query files. Another heuristic
tool called Search for Tandem Repeats IN Genomes (STRING)
uses dynamic programming to autoalign genomic sequences [50].
Many motifs of size “n” can be determined using a sliding window
approach. This principle has been implemented by number of inves-
tigators in tools such as Exact Tandem Repeats Analyzer (E-TRA)
[51] and Sputnik. Some tools, such as Imperfect Microsatellite
Extractor (IMEx) [52], SciRoKo [53], and Poly [52–54], allow
harboring k mismatches at each of the iterations due to indels
or substitutions. Other tools are based on dictionary approaches
for repeat mining including Repeat-masker [55], TROLL [56],
MISA [41], TRF-Tandem Repeat Finder [46], REPuter [57], and
REPfind [58]. NGS data has increasingly been used for the devel-
opment of SSR markers. SSR finding tools that have popularly
been applied on NGS data include msatfinder [59], E-TRA [60],
msatcommander, and MISA [61].
2.2 SNP Discovery SNPs have emerged as the markers of choice in breeding programs
because of their abundance and high-throughput detection capac-
ities [62]. There is a huge potential to apply SNPs in crop improve-
ment programs and various methods have been described to detect
and genotype SNPs.
A common way to identify SNPs from NGS data is to first map
variety specific reads to a reference genome. Algorithms are then
applied either to identify differences between the reads and the refer-
ence or to identify sequence differences in the aligned reads, usually
including measures of accuracy to reduce the occurrence of false-
positive SNP calls. Many SNP discovery software programs have
been developed. Some such as CASAVA (Consensus Assessment of
Sequence And Variation) are provided together with next-genera-
tion sequencers (Illumina), with GS Amplicon Variant Analyzer and
GS Reference Mapper Software supplied for the Roche 454
GS-FLX. Commercial software such as NextGENe (http://www.
softgenetics.com/), CLC Genomics Workbench (http://www.
clcbio.com/index.php?id=1240) or Biomatters Geneious [63] and
free-ware programs such as SNPdector [64], ACCUSA [65],
AGSNP [66], NGS-SNP [67], AtlasSNP2 [68], PolyScan [69], and
SGSautoSNP [70] are also available.
The efficiency of variant detection depends on the accuracy of
read alignment. Burrows–Wheeler transform (BWT)-based align-
ers (Bowtie [71], SOAP2 [72], and BWA [73]) are fast, memory
efficient, and particularly useful for aligning repetitive reads, but
comparatively less sensitive than hash-based algorithms such as
MAQ [74], Novoalign, and Stampy [75]. MAQ introduced
mapping quality, a Phred-like measure of alignment confidence.
Table 4
SNP tools
Program Website/reference
SOAP2 http://soap.genomics.org.cn/index.html
Samtools http://samtools.sourceforge.net/
GATK http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit
MaCH http://genome.sph.umich.edu/wiki/Thunder
Qcall ftp://ftp.sanger.ac.uk/pub/ Multi-sample LD
rd/QCALL
IMPUTE2 http://mathgen.stats.ox.ac.uk/impute/impute_v2.html
GigaBayes http://bioinformatics.bc.edu/marthlab/GigaBayes
SNPdetector [64]
Geneious http://www.geneious.com/
PolyScan [69]
SGSautoSNP [70]
QualitySNP [105]
BFAST (BLAT-like Fast Accurate Search Tool) is an alignment

tool which uses reference genome indexing to improve alignment
speed [76]. TopHat [77] is open source software designed to align
reads from RNA-Seq to a reference genome without relying on
known splice sites.
Sequence alignment is more difficult for regions with higher
levels of diversity between the reference and sequenced genomes
or for genomes with significant complexity, polyploidy, or repeats
[9, 16, 78, 79]. Some of these issues can be overcome by using
longer reads or paired reads. SNP calling can proceed by counting
alleles at each site and using simple cut-off rules for when to call a
SNP. In some methods, it is possible to incorporate additional
information regarding allele frequencies and/or patterns of LD.
Methods also vary depending on whether the individuals being
sequenced are homozygous or heterozygous. A selection of com-
monly used software for SNP discovery is listed in Table 4.
The SNP discovery software AutoSNP [80–83] has been
extended to produce the recently developed AutoSNPdb [81, 82,
84]. It can integrate both Sanger and Roche 454 pyrosequencing
data for SNP discovery. Another version of autoSNP, SGSautoSNP,
calls SNPs from Illumina sequence data. SGSautoSNP can generate
marker assay files for the design of Illumina infinum and goldengate
genotyping arrays. MAQ predicts SNPs by using the alignment qual-
ity to score SNPs, but it requires user input for a minimum coverage.
Slider II calls SNPs when the confidence accumulated from the

aligned reads is higher than confidence of the base in the reference
genome. For recalibration of per-base quality scores, programs
such as GATK [85] or SOAPsnp [86] are recommended. Samtools
is a package used to manipulate NGS alignment, which includes the
computation of genotype likelihoods (samtools) and SNP and gen-
otype calling (bcftools). GATK can be used for NGS data align-
ment, SNP and genotype calling (Unifed Genotyper), SNP filtering,
and SNP quality recalibration (Variant Recalibrator). SNVer is a
statistical tool for calling common and rare variants from pooled or
individual next-generation sequencing (NGS) data reports.
3 Case Studies
3.1 SSR Discovery Cavagnaro et al. [87] performed a genome-wide characterization

of SSRs in cucumber. Two cucumber varieties “9930” and “Gy14”
3.1.1 Genome-Wide
were used to develop SSR markers on a large scale. Gy14 is a North
Characterization of Simple
American pickling cucumber line with multiple disease resistance
Sequence Repeats
genes and superior horticultural characteristics. Genome sequenc-
in Cucumber (Cucumis
ing of Gy14 was performed using the Roche 454 GS FLX Titanium
sativus L.)
platform at 36× genome coverage. The sequences were assembled
using Newbler and searched for perfect SSR with a basic motif of
2–8 bp using MISA. Oligonucleotide primers were then designed
to the SSRs using Primer3 (v. 1.1.4).
A total of 3× coverage Sanger shotgun sequence of inbred line
9,930 was used to identify SSRs with MISA, and PCR primer pairs
were designed using Primer3 (v.1.1.4). Using an in silico PCR
strategy, PCR primer pairs were mapped onto the Gy14 sequence
scaffolds. Genomic sequence delimited by the primer pairs was
extracted, analyzed, and annotated for the presence and type of SSR
repeat using a custom Perl script. The in silico-generated amplicons
from Gy14 were then compared with the expected amplicon size
from 9,930. SSRs were classified as polymorphic if amplicons from
Gy14 and 9,930 varied by at least 2 bp. The results were validated
by PCR amplification with genomic DNA from 9,930 and Gy14.
3.1.2 SSRPrimer Jewell et al. [38] applied an automated web-based SSR discovery
and SSR Taxonomy Tree: method, SSRPrimer [39] which combines SSR discovery with
Biome SSR Discovery PCR primer design for SSR amplification. SSRs are identified using
SPUTNIK, and the results parsed to Primer3 for locus-specific
primer design. This approach was first used for individual species
datasets [88–95] but later applied to the complete GenBank data-
base, designing PCR amplification primers for 14 million SSRs,
representing the first biome scale SSR discovery [38]. The resulting
SSR Taxonomy Tree tool provides web-based searching of this data,
together with downloading and visualization of SSR amplified
primers.
3.2 SNP Discovery To increase the number of SNPs available for basic and applied
potato genetics, Hamilton et al. [96] conducted extensive tran-
3.2.1 Single Nucleotide
scriptome sequencing from three relevant potato cultivars (Atlantic,
Polymorphism Discovery
Premier Russet, and Snowden) using the Illumina platform.
in Elite North American
Quality filtered reads were assembled with Velvet [37] and the
Potato Germplasm
assemblies compared with Sanger EST collections from the variet-
ies Bintje, Kennebec, and Shepody. The majority of the Sanger
reads were represented within the Illumina GA2 datasets. MAQ
was employed to identify and filter SNPs within the three Illumina
transcriptomes. The infinium BeadXpress was used to validate and
assess allelic diversity in a diverse set of potato germplasm. This
study identified 575,340 SNPs in elite potato germplasm.
3.2.2 Discovery of Single Lorenc et al. [70] developed an approach called SGSautoSNP for
Nucleotide Polymorphisms SNP prediction, demonstrating the method by identifying SNPs
in Complex Genomes between four wheat cultivars. Variety specific reads were mapped to
Using SGSautoSNP the reference wheat chromosomes 7A, 7B, and 7D [97–99] using
SOAP [37]. The resulting BAM files were used in SGSautoSNP for
SNP discovery. SNPs were called between reads in the alignment
without considering the reference allele. More than 800,000 SNPs
were predicted across the wheat group 7 chromosomes with a vali-
dated accuracy of >93 %. The approach has since been used for SNP
discovery in Brassica with an accuracy of 96 % [100].
3.2.3 Coverage-Based Azam et al. [101] established coverage-based consensus calling

Consensus Calling (CBCC) (CBCC) for SNP calling between the Chickpea genotypes
of Short Sequence Reads ICC4958 and ICC1882. A total of 15.7 and 22.1 million Illumina
and Comparison of CBCC reads for ICC4958 and ICC1882 were aligned to a chickpea tran-
Results to Identify SNPs scriptome assembly using Maq, Bowtie, Novoalign, and SOAP2.
in Chickpea (Cicer SNPs were discovered by comparing bases at each position between
arietinum; Fabaceae), genotypes from each alignment. Thus, four different sets of pre-
a Crop Species Without dicted SNPs were compared. More than 4,500 nonredundant
a Reference Genome SNPs were identified between the two chickpea genotypes. Among
all software, Maq alone predicted 50 % true positive SNPs.
However, 62.5 % of SNPs were accurately predicted by consensus
from all four software packages.
4 Examples
4.1 SSRPrimerII SSRPrimerII is an automated process to identify SSRs and design

PCR Primers (see Note 1) (Fig. 1). It is available over the internet
through a web-based Graphical User Interphase (GUI) and there
is also a command line version for local use. The input is in the
form of single or multiple FASTA format DNA sequence. Primer
pairs are designed at least 10 bp distant from either side of the
identified SSR. Default optimum size for the primers is 21 bases
with a maximum of 23 bases, default melting temperature is 55 °C
SSRPrimerII
SPUTNIK PRIMER
FASTA sequences SSR Primers
Fig. 1 SSRPrimerII accepts sequences in FASTA format to find SSR markers with
SPUTNIK and PRIMER3
with a minimum of 50 °C and maximum of 70 °C. The maximum

GC content is set to 70 %.
1. Choose the sequences from which to identify SSR molecular
markers. In this example, we will use wheat genes annotated
with the term drought.
2. First, identify the genes from National Centre for Biotechnology
Information (NCBI) (http://www.ncbi.nlm.nih.gov/). Select
“Nucleotide” database from the dropdown list and type
“wheat [orgn] AND drought” in the search box to find related
sequences (see Note 2).
3. Select the first ten wheat genes from the results. Click on “send
to” dropdown list to select “File” option as Destination and
“FASTA” as format and click the “create File” button to down-
load the sequences in FASTA format (Fig. 2). Alternatively,
these sequences may be downloaded from http://www.applied-
bioinformatics.com.au/projects/ssrPrimer/example-ssrPrimer.
fasta.
4. Open the SSRPrimerII Web site (http://www.appliedbioinfor-
matics.com.au/projects/ssrPrimer) (Fig. 3). Click on the
“Choose File” to upload FASTA file or alternatively, paste
FASTA format sequences in the provided text box and click
on the “Submit to Pipeline” button to start an SSRPrimer
pipeline.
5. The identified SSRs can be downloaded as a tab separated
value (TSV) file and are tabulated in the result table. The avail-
able PCR primers designed to amplify the identified SSR
(Fig. 4). Furthermore, primer characteristics from PRIMER3
software are also displayed for further manipulations.
4.2 SGSautoSNP SGSautoSNP [70] (see Note 3) is specifically designed to identify

(Second-Generation SNPs from Illumina genome shotgun data. A reference is used for
Sequencing AutoSNP) mapping the reads, and SNPs are then called between these mapped
reads. The SGSautoSNP algorithm uses two steps to call a SNP at
each locus. SNP calling requires a SNP redundancy score of at least 2.
Fig. 2 Retrieval and downloading of wheat drought related sequences from GenBank
Fig. 3 The sequence entry page for SSRPrimerII

Fig. 4 Example result from SSRPrimerII
Fig. 5 SGSautoSNP calls SNPs between cultivars that are represented by at least two reads. SNPs within a
cultivar are ignored as they are likely to represent mis-mapping in homozygous species
The SNP redundancy score is the minimum number of reads

calling the SNP allele at the locus. After this initial SNP calling, the
SNPs are checked to confirm that all bases within each variety at
the SNP locus are the same (Fig. 5) and the locus is ignored if a
SNP appears within a variety. For this reason, this approach is only
suitable for homozygous varieties.
1. Use Short Oligonucleotide Analysis Package (SOAP) to align
pair reads on the reference (see Note 4). To align reads, build
index files for the reference genome.
Syntax: 2bwt-builder reference.fa
2. Align SGS pair-end reads against formatted index files for each
cultivar uniquely to the reference (see Note 5).
Syntax: soap –a readsA.fastq –b readB.fastq –D
index.file –o PE_out.soap -2 SE_output –m min_
insert_size –X max_insert_size –r 0
3. Use only aligned paired-reads (for better accuracy) to convert
into sorted and indexed BAM files using SOAP2BAM.py script
available within SGSautoSNP package (see Note 6).
Syntax: python SOAP2BAM.py -s PE_out.soap -f

reference.fasta -r path/to/dir
4. In order to allow SGSautoSNP.py to differentiate the reads
from each cultivar, each read ID in each BAM file needs to be
modified to include a cultivar reference tag using generat_
BAM.py.
Syntax: python generate_subset_BAM.py --bam
soapfile.bam --reference reference.fasta--
chr_name ChrName --cultivar CultivarName
--res_dir path/to/dir
5. Use picard (see Note 7) to remove duplicate reads in the
BAM file.
Syntax: java -Xmx4g -jar MarkDuplicates.jar
INPUT=subsetfile.bam OUTPUT=clonesremove.bam
METRICS_FILE=filename.stat REMOVE_
DUPLICATES=true ASSUME_SORTED=true
6. Finally, the BAM file for each cultivar from each chromosome
must be merged using SAMtools (see Note 8) to produce a
single BAM file for each chromosome (see Note 9).
Syntax: samtools merge merge.bam clonesre-
move1.bam clonesremove2.bam clonesremove3.bam…
7. 4SGSautoSNP.py uses the merged BAM file, along with the
reference, for SNP discovery. On successful completion of run-
ning the script, it produces a stat file “file.stat” that contains
SNP calling statistics including (a) scaffold name, (b) SNP
number, (c) SNP types (transitions and transversions), (d) scaf-
fold length. The other file with extension “.snp” contains
human readable SNP information. Other format files, such as
VCF, GFF3, “.map”, “.extension” are supported files to dis-
play SNPs in Geneious, MagicViewer, GBrowse, and Flapjack
software.
Syntax: python SGSautoSNP.py-- bam merge.bam
--fasta reference.fasta --snpid_prefix ID
--chr_offset offset.gff3 --contig_output
SNPcontig.snpCn --chr_output SNPchr.snpChr
--cultivars "A,B,C'--pu 4
8. filter_SNPs.py script parses the text “SNPchr.snp” file to
retrieve SNPs between specific individuals of interest. It also
produces the “.matrix” file to illustrate the SNPs between all
combinations of cultivars.
Syntax: python filter_snps.py --snps SNPchr.snp
--chr_name ChrName --chr_output snpChr.filt
--contig_output SNPcontig.filt--dir path/to/dir
9. Bam2ConsensusSequence.py needs BAM format files to

generate consensus sequences for each scaffold. This consensus
sequences can be used for downstream analysis.
Syntax: bam2consensus_seqs.py --bam merge.
bam –fasta ref.fasta –output path/to/dir
10. To generate Illumina marker assay for designing Illumina
Infinum and GoldenGate genotyping arrays, SNP2Markers.py
script accepts the consensus sequences generated from step 8
and “.snp” from step 6.
Syntax: python SNPs2Markers.py --fasta cons-
Seq.fasta --snp chr.snp --marker_name name
--species species_name--germplasm germplasm_
name --library library_name --panel panal_
name--chr _name ChrName--dir path/to/dir
5 Notes
1. SSRPrimer is a pipeline integrated with SPUTNIK, an SSR

finder, and with Primer3, a PCR primer designing tool.
2. Currently, produces 104 sequences but the number may
increase as the database size increases.
3. SGSautoSNP pipeline is a robust SNP discovery tool imple-
mented in Python 2.7 for command line execution on any
operating system. The multithreaded feature of the program is
an additional feature to handling the large genome size.
4. SOAP is an alignment tool package used to analyze the SGS
data, available at http://soap.genomics.org.cn/.
5. A read which has multiple hits is preferred to align uniquely to
avoid the identification of false-positive SNPs.
6. A python module called pysam and biopython is required for
SGSautoSNP to convert SAM/BAM formats.
7. Picard is a Java-based command-line utilities that manipulates
SAM files and is available at http://picard.sourceforge.net/.
8. SAMtools provides various utilities for manipulating align-
ments in SAM format (mostly used alignment format).
SAMtools package is available at http://samtools.source-
forge.net/.
9. In SGSautoSNP, reads of each cultivar should align separately
on the reference genome and mapped paired reads from all
cultivars should merge into a single bam file for each chromo-
some, respectively.
References
1. Appleby N, Edwards D, Batley J (2009) New Langridge P, Gustafson JP (eds) Plant genom-
technologies for ultra-high throughput geno- ics. Humana, New York, pp 41–56
typing in plants. In: Somers DJ, Langridge P, 13. Edwards D, Wang X (2012) Genome
Gustafson JP (eds) Plant genomics. Humana, Sequencing Initiatives. In: Edwards D, Parkin
Louisville, KY, pp 19–40 IAP, Batley J (eds) Genetics. Genomics and
2. Edwards D, Batley J, Snowdon R (2013) breeding of oilseed Brassicas. Science Publishers
Accessing complex crop genomes with next- Inc., New Hampshire, pp 152–157
generation sequencing. Theor Appl Genet 14. Edwards D, Batley J (2010) Plant genome
126:1–11 sequencing: applications for crop improve-
3. Berkman PJ, Lai K, Lorenc MT, Edwards D ment. Plant Biotechnol J 7:1–8
(2012) Next generation sequencing applica- 15. Imelfort M, Edwards D (2009) De novo
tions for wheat crop improvement. Am J Bot sequencing of plant genomes using
99:365–371 second-generation technologies. Brief
4. Duran C, Eales D, Marshall D, Imelfort M, Bioinform 10:609–618
Stiller J, Berkman PJ, Clark T, McKenzie M, 16. Imelfort M, Duran C, Batley J, Edwards D
Appleby N, Batley J, Basford K, Edwards D (2009) Discovering genetic polymorphisms
(2010) Future tools for association mapping in next-generation sequencing data. Plant
in crop plants. Genome 53:1017–1023 Biotechnol J 7:312–317
5. Lorenc MT, Boskovic Z, Stiller J, Duran C, 17. Nie X, Li B, Wang L, S B, Liu S, Li T, Dolezel
Edwards D (2012) Role of bioinformatics as a J, Edwards D, Luo MC, Weining S (2012)
tool for oilseed Brassica species. In: Edwards Development of chromosome-arm-specific
D, Parkin IAP, Batley J (eds) Genetics. microsatellite markers in Triticum aestivum
Genomics and breeding of oilseed Brassicas. (Poaceae) using NGS technology. Am J Bot
Science Publishers Inc., New Hampshire, 99:e369–e371
pp 194–205 18. Lai K, Duran C, Berkman PJ, Lorenc MT,
6. Duran C, Boskovic Z, Batley J, Edwards D Stiller J, Manoli S, Hayden MJ, Forrest KL,
(2011) Role of bioinformatics as a tool for Fleury D, Baumann U, Zander M, Mason AS,
vegetable Brassica species. In: Sadowski J (ed) Batley J, Edwards D (2012) Single nucleotide
Vegetable Brassicas. Science Publishers, Inc., polymorphism discovery from wheat next-
New Hampshire, pp 406–418 generation sequence data. Plant Biotechnol J
7. Edwards D (2011) Wheat bioinformatics. In: 10:743–749
Bonjean A, Angus W, Van Ginkel M (eds) 19. Duran C, Appleby N, Edwards D, Batley J
The world wheat book. Lavoisier, Paris, (2009) Molecular genetic markers: discovery,
pp 851–875 applications, data storage and visualisation.
8. Batley J, Jewell E, Edwards D (2007) Curr Bioinform 4:16–27
Automated discovery of single nucleotide 20. Lai K, Berkman PJ, Lorenc MT, Duran C,
polymorphism (SNP) and simple sequence Smits L, Manoli S, Stiller J, Edwards D
repeat (SSR) molecular genetic markers. In: (2012) WheatGenome.info: an integrated
Edwards D (ed) Plant bioinformatics. database and portal for wheat genome infor-
Humana, New York, pp 473–494 mation. Plant Cell Physiol 53:1–7
9. Duran C, Edwards D, Batley J (2009) 21. Lai K, Lorenc MT, Edwards D (2012)
Molecular marker discovery and genetic map Genomic databases for crop improvement.
visualisation. In: Edwards D, Hanson D, Agronomy 2:62–73
Stajich J (eds) Applied bioinformatics. 22. Edwards D, Batley J (2008) Bioinformatics:
Springer, New York, pp 165–189 fundamentals and applications in plant genet-
10. Edwards D, Batley J (2004) Plant bioinfor- ics, mapping and breeding. In: Kole C,
matics: from genome to phenome. Trends Abbott AG (eds) Principles and practices of
Biotechnol 22:232–237 plant genomics. Science Publishers Inc, New
11. Batley J, Edwards D (2007) SNP applications Hampshire, pp 269–302
in plants. In: Oraguzie NC, Rikkerink EHA, 23. Edwards D (2007) Bioinformatics and plant
Gardiner SE, De Silva HN (eds) Association genomics for staple crops improvement. In:
mapping in plants. Springer, New York, Kang MS, Priyadarshan M (eds) Breeding major
pp 95–102 food staples. Blackwell, London, pp 93–106
12. Duran C, Edwards D, Batley J (2009) Genetic 24. Hamblin MT, Warburton ML, Buckler ES
maps and the use of synteny. In: Somers DJ, (2007) Empirical comparison of simple
sequence repeats and single nucleotide poly- 35. Meintjes P, Duran C, Kearse M, Moir R,
morphisms in assessment of maize diversity Wilson A, Stones-Havas S, Cheung M,
and relatedness. PLoS One 2:e1367 Sturrock S, Buxton S, Cooper A, Markowitz
25. Edwards KJ, Barker JHA, Daly A, Jones C, S, Thierer T, Ashton B, Heled J (2012)
Karp A (1996) Microsatellite libraries Geneious Basic: an integrated and extendable
enriched for several microsatellite sequences desktop software platform for the organiza-
in plants. Biotechniques 20:758 tion and analysis of sequence data.
26. Edwards D, Forster JW, Chagné D, Batley J Bioinformatics 28:1647–1649
(2007) What are SNPs? In: Oraguzie NC, 36. Drummond AJ, Ashton BSB, Cheung M,
Rikkerink EHA, Gardiner SE, De Silva HN Cooper A, Duran C, Field M, Heled J, Kearse
(eds) Association mapping in plants. Springer, M, Markowitz S, Moir R, Stones-Havas S,
New York, pp 41–52 Sturrock S, Thierer T, Wilson A (2011)
Geneious v5.4. http://www.geneious.com
27. Gupta PK (2008) Single-molecule DNA
sequencing technologies for future genomics 37. Zerbino DR, Birney E (2008) Velvet: algo-
research. Trends Biotechnol 26:602–611 rithms for de novo short read assembly using
de Bruijn graphs. Genome Res 18:821–829
28. Edwards D, Forster JW, Cogan NOI, Batley
J, Chagné D (2007) Single nucleotide poly- 38. Jewell E, Robinson A, Savage D, Erwin T,
morphism discovery. In: Oraguzie NC, Love CG, Lim GA, Li X, Batley J, Spangenberg
Rikkerink EHA, Gardiner SE, De Silva HN GC, Edwards D (2006) SSRPrimer and SSR
(eds) Association mapping in plants. Springer, taxonomy tree: biome SSR discovery. Nucleic
New York, pp 53–76 Acids Res 34:W656–W659
29. Chagné D, Batley J, Edwards D, Forster JW 39. Robinson AJ, Love CG, Batley J, Barker G,
(2007) Single nucleotide polymorphism Edwards D (2004) Simple sequence repeat
genotyping in plants. In: Oraguzie NC, marker loci discovery using SSR primer.
Rikkerink EHA, Gardiner SE, De Silva HN Bioinformatics 20:1475–1476
(eds) Association mapping in plants. Springer, 40. Duran C, Singhania R, Raman H, Batley J,
New York, pp 77–94 Edwards D (2013) Predicting polymorphic
30. Mogg R, Batley J, Hanley S, Edwards D, EST-SSRs in silico. Mol Ecol Resour 13:
O'Sullivan H, Edwards KJ (2002) 538–545
Characterization of the flanking regions of 41. Thiel T, Michalek W, Varshney RK, Graner A
Zea mays microsatellites reveals a large num- (2003) Exploiting EST databases for the
ber of useful sequence polymorphisms. Theor development and characterization of gene-
Appl Genet 105:532–543 derived SSR-markers in barley (Hordeum vul-
31. Rothberg JM, Hinz W, Rearick TM, Schultz gare L.). Theor Appl Genet 106:411–422
J, Mileski W, Davey M, Leamon JH, Johnson 42. Kolpakov R, Bana G, Kucherov G (2003)
K et al (2011) An integrated semiconductor mreps: efficient and flexible detection of
device enabling non-optical genome sequenc- tandem repeats in DNA. Nucleic Acids Res
ing. Nature 475:348–352 31:3672–3678
32. Blanca J, Canizares J, Roig C, Ziarsolo P, Nuez 43. da Maia LC, Palmieri DA, de Souza VQ,
F, Pico B (2011) Transcriptome characteriza- Kopp MM, de Carvalho FI, Costa de Oliveira
tion and high throughput SSRs and SNPs A (2008) SSR locator: tool for simple
discovery in Cucurbita pepo (Cucurbitaceae). sequence repeat discovery integrated with
BMC Genomics 12:104 primer design and PCR simulation. Int J
Plant Genomics 2008:412696
33. Parchman TL, Geist KS, Grahnen JA,
Benkman CW, Buerkle CA (2010) 44. Martins WS, Lucas DC, Neves KF, Bertioli DJ
Transcriptome sequencing in an ecologically (2009) WebSat – a web software for microsat-
important tree species: assembly, annotation, ellite marker development. Bioinformation
and marker discovery. BMC Genomics 11:180 3:282–283
34. Hiremath PJ, Farmer A, Cannon SB, 45. Taneda A (2004) Adplot: detection and visu-
Woodward J, Kudapa H, Tuteja R, Kumar A, alization of repetitive patterns in complete
Bhanuprakash A, Mulaosmanovic B, Gujaria genomes. Bioinformatics 20:701–708
N, Krishnamurthy L, Gaur M, Kavikishor B, 46. Benson G (1999) Tandem repeats finder: a
Shah T, Srinivasan R, Lohse M, Xiao Y, Town program to analyze DNA sequences. Nucleic
CD, Cook DR, May GD, Varshney RK Acids Res 27:573–580
(2011) Large-scale transcriptome analysis in 47. Sobreira TJ, Durham AM, Gruber A (2006)
chickpea (Cicer arietinum L.), an orphan TRAP: automated classification, quantifica-
legume crop of the semi-arid tropics of Asia tion and annotation of tandemly repeated
and Africa. Plant Biotechnol J 9:922–931 sequences. Bioinformatics 22:361–362
48. Wexler Y, Yakhini Z, Kashi Y, Geiger D 63. Kearse M, Moir R, Wilson A, Stones-Havas S,
(2005) Finding approximate tandem repeats Cheung M, Sturrock S, Buxton S, Cooper A,
in genomic sequences. J Comput Biol 12: Markowitz S, Duran C, Thierer T, Ashton B,
928–942 Meintjes P, Drummond A (2012) Geneious
49. Reneker J, Shyu CR, Zeng P, Polacco JC, basic: an integrated and extendable desktop
Gassmann W (2004) ACMES: fast multiple- software platform for the organization and
genome searches for short repeat sequences analysis of sequence data. Bioinformatics
with concurrent cross-species information 28:1647–1649
retrieval. Nucleic Acids Res 32:W649–W653 64. Zhang J, Wheeler DA, Yakub I, Wei S, Sood
50. Parisi V, De Fonzo V, Aluffi-Pentini F (2003) R, Rowe W, Liu PP, Gibbs RA, Buetow KH
STRING: finding tandem repeats in DNA (2005) SNPdetector: a software tool for
sequences. Bioinformatics 19:1733–1738 sensitive and accurate SNP detection. PLoS
51. Karaca M, Bilgen M, Onus AN, Ince AG, Comput Biol 1:e53
Elmasulu SY (2005) Exact tandem repeats 65. Frohler S, Dieterich C (2010) ACCUSA–
analyzer (E-TRA): a new program for DNA accurate SNP calling on draft genomes.
sequence mining. J Genet 84:49–54 Bioinformatics 26:1364–1365
52. Mudunuri SB, Nagarajaram HA (2007) 66. You FM, Deal KR, Wang J, Britton MT, Fass
IMEx: imperfect microsatellite extractor. JN, Lin D, Dandekar AM, Leslie CA, Aradhya
Bioinformatics 23:1181–1187 M, Luo MC, Dvorak J (2012) Genome-wide
53. Kofler R, Schlotterer C, Lelley T (2007) SNP discovery in walnut with an AGSNP
SciRoKo: a new tool for whole genome mic- pipeline updated for SNP discovery in alloga-
rosatellite search and investigation. mous organisms. BMC Genomics 13:354
Bioinformatics 23:1683–1685 67. Grant JR, Arantes AS, Liao X, Stothard P
54. Bizzaro JW, Marx KA (2003) Poly: a quanti- (2011) In-depth annotation of SNPs arising
tative analysis tool for simple sequence repeat from resequencing projects using NGS-
(SSR) tracts in DNA. BMC Bioinformatics SNP. Bioinformatics 27:2300–2301
4:22 68. Shen Y, Wan Z, Coarfa C, Drabek R, Chen L,
55. Tarailo-Graovac M, Chen N (2009) Using Ostrowski EA, Liu Y, Weinstock GM, Wheeler
RepeatMasker to identify repetitive elements DA, Gibbs RA, Yu F (2010) A SNP discovery
in genomic sequences. Curr Protoc method to assess variant allele probability
Bioinformat Chapter 4:Unit 4 10 from next-generation resequencing data.
56. Castelo AT, Martins W, Gao GR (2002) Genome Res 20:273–280
TROLL–tandem repeat occurrence locator. 69. Chen K, McLellan MD, Ding L, Wendl MC,
Bioinformatics 18:634–636 Kasai Y, Wilson RK, Mardis ER (2007)
57. Kurtz S, Schleiermacher C (1999) REPuter: PolyScan: an automatic indel and SNP detec-
fast computation of maximal repeats in com- tion approach to the analysis of human rese-
plete genomes. Bioinformatics 15:426–427 quencing data. Genome Res 17:659–666
58. Betley JN, Frith MC, Graber JH, Choo S, 70. Lorenc MT, Hayashi S, Stiller J, Lee H,
Deshler JO (2002) A ubiquitous and con- Manoli S, Ruperao P, Visendi P, Berkman PJ,
served signal for RNA localization in chor- Lai K, Batley J, Edwards D (2012) Discovery
dates. Curr Biol 12:1756–1761 of single nucleotide polymorphisms in com-
59. Faircloth BC (2008) msatcommander: detec- plex genomes using SGSautoSNP. Biology
tion of microsatellite repeat arrays and auto- 1:370–382
mated, locus-specific primer design. Mol Ecol 71. Langmead B, Trapnell C, Pop M, Salzberg SL
Resour 8:92–94 (2009) Ultrafast and memory-efficient align-
60. Perry JC, Rowe L (2011) Rapid microsatellite ment of short DNA sequences to the human
development for water striders by next- genome. Genome Biol 10:R25
generation sequencing. J Hered 102:125–129 72. Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen
61. Garg R, Patel RK, Tyagi AK, Jain M (2011) K, Wang J (2009) SOAP2: an improved ultra-
De novo assembly of chickpea transcriptome fast tool for short read alignment. Bioinformatics
using short reads for gene discovery and 25:1966–1967
marker identification. DNA Res 18:53–63 73. Li H, Durbin R (2009) Fast and accurate
62. Collins LA, Torrero MN, Franzblau SG short read alignment with Burrows-Wheeler
(1998) Green fluorescent protein reporter transform. Bioinformatics 25:1754–1760
microplate assay for high-throughput screen- 74. Li H, Ruan J, Durbin RM (2008) Mapping
ing of compounds against Mycobacterium short DNA sequencing reads and calling vari-
tuberculosis. Antimicrob Agents Chemother ants using mapping quality scores. Genome
42:344–347 Res 18:1851–1858
75. Lunter G, Goodson M (2011) Stampy: a sta- (2010) Genome-wide characterization of

tistical algorithm for sensitive and fast map- simple sequence repeats in cucumber (Cucumis
ping of Illumina sequence reads. Genome Res sativus L.). BMC Genomics 11:569
21:936–939 88. Hong C, Piao ZY, Kang TW, Batley J, Yang
76. Homer N, Merriman B, Nelson SF (2009) TJ, Hur YK, Bhak J, Park BS, Edwards D,
BFAST: an alignment tool for large scale Lim Y (2007) Genomic distribution of simple
genome resequencing. PLoS One 4:e7767 sequence repeats in Brassica rapa. Mol Cells
77. Trapnell C, Pachter L, Salzberg SL (2009) 23:349–356
TopHat: discovering splice junctions with 89. Hopkins CJ, Cogan NOI, Hand M, Jewell E,
RNA-Seq. Bioinformatics 25:1105–1111 Kaur J, Li X, Lim GAC, Ling AE, Love C,
78. Lee HC, Lai K, Lorenc MT, Imelfort M, Mountford H, Todorovic M, Vardy M,
Duran C, Edwards D (2012) Bioinformatics Spangenberg GC, Edwards D, Batley J (2007)
tools and databases for analysis of next- Sixteen new simple sequence repeat markers
generation sequence data. Brief Funct from Brassica juncea expressed sequences and
Genomics 11:12–24 their cross-species amplification. Mol Ecol
79. Batley J, Edwards D (2009) Mining for single Notes 7:697–700
nucleotide polymorphism (SNP) and simple 90. Ling AE, Kaur J, Burgess B, Hand M,
sequence repeat (SSR) molecular genetic Hopkins CJ, Li X, Love CG, Vardy M,
markers. In: Posada D (ed) Bioinformatics for Walkiewicz M, Spangenberg G, Edwards D,
DNA sequence analysis. Humana, New York, Batley J (2007) Characterization of simple
pp 303–322 sequence repeat markers derived in silico from
80. Batley J, Barker G, O'Sullivan H, Edwards KJ, Brassica rapa bacterial artificial chromosome
Edwards D (2003) Mining for single nucleo- sequences and their application in Brassica
tide polymorphisms and insertions/deletions napus. Mol Ecol Notes 7:273–277
in maize expressed sequence tag data. Plant 91. Burgess B, Mountford H, Hopkins CJ, Love
Physiol 132:84–91 C, Ling AE, Spangenberg GC, Edwards D,
81. Duran C, Appleby N, Clark T, Wood D, Batley J (2006) Identification and character-
Imelfort M, Batley J, Edwards D (2009) ization of simple sequence repeat (SSR) mark-
AutoSNPdb: an annotated single nucleotide ers derived in silico from Brassica oleracea
polymorphism database for crop plants. genome shotgun sequences. Mol Ecol Notes
Nucleic Acids Res 37:D951–D953 6:1191–1194
82. Savage D, Batley J, Erwin T, Logan E, Love 92. Batley J, Hopkins CJ, Cogan NOI, Hand M,
CG, Lim GA, Mongin E, Barker G, Jewell E, Kaur J, Kaur S, Li X, Ling AE, Love
Spangenberg GC, Edwards D (2005) C, Mountford H, Todorovic M, Vardy M,
SNPServer: a real-time SNP discovery tool. Walkiewicz M, Spangenberg GC, Edwards D
Nucleic Acids Res 33:W493–W495 (2007) Identification and characterization of
simple sequence repeat markers from Brassica
83. Barker G, Batley J, O'Sullivan H, Edwards napus expressed sequences. Mol Ecol Notes
KJ, Edwards D (2003) Redundancy based 7:886–889
detection of sequence polymorphisms in
expressed sequence tag data using 93. Keniry A, Hopkins CJ, Jewell E, Morrison B,
autoSNP. Bioinformatics 19:421–422 Spangenberg GC, Edwards D, Batley J
(2006) Identification and characterization of
84. Duran C, Appleby N, Vardy M, Imelfort M, simple sequence repeat (SSR) markers from
Edwards D, Batley J (2009) Single nucleotide Fragaria x ananassa expressed sequences.
polymorphism discovery in barley using Mol Ecol Notes 6:319–322
autoSNPdb. Plant Biotechnol J 7:326–333
94. Mortimer J, Batley J, Love C, Logan E,
85. McKenna A, Hanna M, Banks E, Sivachenko Edwards D (2005) Simple sequence repeat
A, Cibulskis K, Kernytsky A, Garimella K, (SSR) and GC distribution in the Arabidopsis
Altshuler D, Gabriel S, Daly M, DePristo MA thaliana genome. J Plant Biotechnol 7:
(2010) The genome analysis toolkit: a 17–25
MapReduce framework for analyzing next-
generation DNA sequencing data. Genome 95. Hong C, Plaha P, Koo DH, Yang TJ, Choi
Res 20:1297–1303 SR, Lee YK, Uhm T, Bang JW, Edwards D,
Bancrofts I, Park BS, Lee J, Lim Y (2006) A
86. Li R, Li Y, Kristiansen K, Wang J (2008) survey of the Brassica rapa genome by BAC-
SOAP: short oligonucleotide alignment pro- End sequence analysis and comparison with
gram. Bioinformatics 24:713–714 Arabidopsis thaliana. Mol Cells 22:300–307
87. Cavagnaro F, Senalik DA, Yang L, Simon W, 96. Hamilton J, Hansey CN, Whitty BR, Stoffel
Harkins TT, Kodira CD, Huang S, Weng Y K, Massa AN, Van Deynze A, De Jong WS,
Douches DS, Buell CR (2011) Single nucleo- 101. Azam S, Thakur V, Ruperao P, Shah T, Balaji
tide polymorphism discovery in elite North J, Amindala B, Farmer AD, Studholme DJ,
American potato germplasm. BMC Genomics May GD, Edwards D, Jones JD, Varshney RK
12:302 (2012) Coverage-based consensus calling
97. Berkman PJ, Skarshewski A, Lorenc MT, Lai (CbCC) of short sequence reads and com-
K, Duran C, Ling EYS, Stiller J, Smits L, parison of CbCC results to identify SNPs in
Imelfort M, Manoli S, McKenzie M, chickpea (Cicer arietinum; Fabaceae), a crop
Kubalakova M, Simkova H, Batley J, Fleury species without a reference genome. Am J Bot
D, Dolezel J, Edwards D (2011) Sequencing 99:186–192
and assembly of low copy and genic regions of 102. Rungis D, Berube Y, Zhang J, Ralph S,
isolated Triticum aestivum chromosome arm Ritland CE, Ellis BE, Douglas C, Bohlmann
7DS. Plant Biotechnol J 9:768–775 J, Ritland K (2004) Robust simple sequence
98. Berkman PJ, Skarshewski A, Manoli S, Lorenc repeat markers for spruce (Picea sp) from
MT, Stiller J, Smits L, Lai K, Campbell E, expressed sequence tags. Theor Appl Genet
Kubalakova M, Simkova H, Batley J, Dolezel J, 109:1283–1294
Hernandez P, Edwards D (2012) Sequencing 103. Kantety RV, La Rota M, Matthews DE,
wheat chromosome arm 7BS delimits the Sorrells ME (2002) Data mining for simple
7BS/4AL translocation and reveals homoeol- sequence repeats in expressed sequence tags
ogous gene conservation. Theor Appl Genet from barley, maize, rice, sorghum and wheat.
124:423–432 Plant Mol Biol 48:501–510
99. Berkman PJ, Visendi P, Lee HC, Stiller J, 104. Sharma D, Issac B, Raghava G, Ramaswamy R
Manoli S, Lorenc MT, Lai K, Batley J, Fleury (2004) Spectral repeat finder (SRF): identifica-
D, Šimková H, Kubaláková M, Weining S, tion of repetitive sequences using Fourier trans-
Doležel J, Edwards D (2013) Dispersion and formation. Bioinformatics 20:1405–1412
domestication shaped the genome of bread 105. Tang J, Vosman B, Voorrips RE, van der
wheat. Plant Biotechnol J 11:564–571 Linden CG, Leunissen JA (2006) QualitySNP:
100. Hayward A, Dalton-Morgan J, Mason A, a pipeline for detecting single nucleotide
Zander M, Edwards D, Batley J (2012) SNP polymorphisms and insertions/deletions in
discovery and applications in Brassica napus. EST data from diploid and polyploid species.
J Plant Biotechnol 39:1–12 BMC Bioinformatics 7:438
Chapter 4
Molecular Marker Databases

Kaitao Lai, Michał Tadeusz Lorenc, and David Edwards
Abstract
The detection and analysis of genetic variation plays an important role in plant breeding and this role is
increasing with the continued development of genome sequencing technologies. Molecular genetic markers
are important tools to characterize genetic variation and assist with genomic breeding. Processing and
storing the growing abundance of molecular marker data being produced requires the development of specific
bioinformatics tools and advanced databases. Molecular marker databases range from species specific
through to organism wide and often host a variety of additional related genetic, genomic, or phenotypic infor-
mation. In this chapter, we will present some of the features of plant molecular genetic marker databases,
highlight the various types of marker resources, and predict the potential future direction of crop marker
databases.
Key words Molecular marker, Genetic marker, Genetic variation, SNP marker, SSR marker
1 Introduction
The characterization of genetic variation can provide knowledge to

help understand the molecular basis of various biological phenom-
ena in plants. Phenotype-based genetic markers were used in Gregor
Mendel’s experiments in the nineteenth century. Later, phenotype-
based genetic markers helped establish the theory of genetic linkage.
More recently, DNA-based markers have been developed to over-
come the limitations of phenotype-based genetic markers [1].
While several diverse DNA-based marker types have been
developed, single nucleotide polymorphisms (SNPs) and simple
sequence repeats (SSRs, also known as microsatellites) predomi-
nate and are widely used in plant breeding, genomic research,
and modern genetic analysis [2, 3]. Molecular markers are used in
plant breeding and genetic research, including mapping of genes
and quantitative trait loci (QTL) analysis, phylogenetic studies,
comparative genomics, and marker-assisted breeding [4–6].
Most molecular marker databases host SNP and SSR markers
[7]. Some databases also include other types of marker that are not
49
50 Kaitao Lai et al.
commonly used. These markers include restriction fragment

length polymorphism (RFLP), amplified fragment length poly-
morphism (AFLP), random amplification of polymorphic DNA
(RAPD), short tandem repeat (STR), and diversity arrays technology
(DArT).
A SNP is a DNA sequence variation, representing an individual
nucleotide base in the genome that differs between individual
genomes [8]. SNPs are regarded as evolutionarily conserved markers
and have been used as markers for QTL analysis and in association
studies in place of SSRs. There are several approaches to identify
and genotype SNPs in plants [9, 10] and their diverse applications
suggest that they will continue to be the dominant DNA molecular
marker in the foreseeable future [11]. The application of new
sequencing methods is leading the discovery of large numbers
of SNPs in wheat [12, 13], rice [14, 15], Brassicas [16], and other
crop species [17, 18].
SSRs are highly polymorphic and informative markers. SSRs
demonstrate a high degree of transferability between different
species and so are regarded as excellent markers for comparative
genetic and genomic analysis. PCR primers designed to an SSR
from one species frequently amplify a corresponding locus in related
species. The mining of SSRs from gene and genome sequence data
is now routine [19], with large numbers of SSRs identified in a
range of species including Brassicas [20, 21], wheat [22], and
strawberry [23]. SSR loci also provide hot spots for SNP discovery
and SSRs may readily be converted to SNP markers [24].
Advances in genome sequencing technology and the increasing
availability of genome sequences are providing an abundance of
dense molecular markers [25, 26]. For example, sequence poly-
morphisms developed using the Brassica rapa genome sequence
[27] have been used to identify and characterize SNP and poly-
morphisms in agronomically important genes in canola (B. napus)
[28–30]. In addition, the sequencing of isolated chromosome
arms in wheat [31–33] has led to the identification of large num-
bers of molecular markers [22].
Genetic linkage maps represent the order of known molecular
genetic markers along a given chromosome for a given species.
Comparative mapping is a valuable technique to identify similarities
and differences between species [34]. Many marker databases pro-
vide a CMap map visualization tool or their own customized viewer
tools for displaying data, including chromosomes and genetic mark-
ers with associated mapping locations in the form of genetic linkage
maps or comparative maps. A list of molecular marker databases is
presented in Table 1. In addition, web links and references for
relevant marker databases are presented in Table 2.
Table 1
Examples of molecular marker databases with different types of markers
Database name Viewer SNPs SSRs RFLPs RAPDs AFLPs ESTs BACs DArTs DNA probes PCR primers
autoSNPdb * +
Brassica.info + + + + +
Brassica rapa * +
genome database
Chickpea root EST +
database
Cotton Marker * + + +
Database (CMD)
GenBank dbSNP * +
Graingenes * + + + + + + +
Gramene * + + + + + + + + +
ICRISAT +
Legume Information * + + + + +
System (LIS)
MaizeGDB + + + + +
MoccaDB + + +
Panzea * + +
Rice Genome * +
Annotation Project
SSR Primer +
SSR taxonomy tree +

(continued)
51
Table 1
52
(continued)
Database name Viewer SNPs SSRs RFLPs RAPDs AFLPs ESTs BACs DArTs DNA probes PCR primers
SOL Genomics * + + + + +
Network (SGN)
SoyBase * + + + + + +
Kaitao Lai et al.
tfGDR Project Website * +

Triticeae Mapped EST * + +
DataBase ver.2.0
(TriMEDB)
VegMarks * + + +
Wheat genome information + +
* indicates that this database provides viewer, + indicates that this database supplies this type of marker
Table 2
Examples of molecular marker databases related to crop improvement
Database name Web link References

autoSNPdb http://autosnpdb.appliedbioinformatics.com.au/ [60, 62, 63]
Brassica.info http://www.brassica.info/resource/markers.php [56]
Brassica rapa genome database http://brassicadb.org/brad/geneticMarker.php [75]
Chickpea root EST database http://www.icrisat.org/what-we-do/biotechnology/Cpest/home.asp [73]
Cotton Marker Database (CMD) http://www.cottonmarker.org/cgi-bin/cmd_search_marker_result.cgi [59]
GenBank dbSNP http://www.ncbi.nlm.nih.gov/projects/SNP/ [76–78]
Graingenes http://wheat.pw.usda.gov/cgi-bin/graingenes/browse.cgi?class=marker [45–47]
Gramene http://www.gramene.org/db/markers/marker_view [42]
ICRISAT http://www.icrisat.org/ [74]
Legume Information System (LIS) http://www.comparative-legumes.org/ [79, 80]
MaizeGDB http://www.maizegdb.org/probe.php [81–83]
MoccaDB http://moccadb.mpl.ird.fr/index.php?cat=1 [58]
Panzea http://www.panzea.org/db/searches/webform/marker_search [84]
Rice Genome Annotation Project http://rice.plantbiology.msu.edu/annotation_pseudo_putativessr.shtml [52]
SSR Primer 2 http://flora.acpfg.com.au/ssrprimer2/ [68]
(continued)
53
Table 2
54
(continued)
Database name Web link References

SSR taxonomy tree http://appliedbioinformatics.com.au/projects/ssrtaxonomy/php/ [68]
SOL Genomics Network (SGN) http://solgenomics.net/ [57]
Kaitao Lai et al.
SoyBase http://soybase.org/ [85]

tfGDR Project Website http://tfgdr.bioinfo.wsu.edu/ [86]
Triticeae Mapped EST Database ver.2.0 (TriMEDB) http://trimedb.psc.riken.jp/index.pl [50]
VegMarks http://vegmarks.nivot.affrc.go.jp/VegMarks/jsp/page.do?transition=marker
Wheat genome information http://www.wheatgenome.info [65, 67]
Molecular Marker Databases 55
2 Molecular Marker Databases
With the ever increasing amount of genetic and genomic information

there is a requirement to manage the data to make it available and
accessible to researchers [35, 36]. This includes the development
of custom visualization tools [36–38] and bioinformatics systems to
traverse the genome to phenome divide [39, 40]. Many molecular
marker databases provide various types of markers for a range of
species while some databases provide information on a single type
of marker [41]. The largest single marker database is dbSNP
(http://www.ncbi.nlm.nih.gov/projects/SNP/). dbSNP provides
SNP data mostly for humans and other vertebrates, although it also
includes some plant data.
There are several databases for the grasses. The Gramene data-
base (http://www.gramene.org/) hosts many types of markers
based on the genomes of rice, maize, grape, and Arabidopsis [42].
This website provides a search engine, and users can search for spe-
cific markers. Marker details are displayed in text format, including
database cross-references and map positions linked to chromosomes
in CMap [43]. The source of SSR markers includes the International
Rice Genome Sequencing Project, IRMI (International Rice
Microsatellite Initiative), MaizeGDB, the Cornell SSR library,
and the Indian Agricultural Research Institute. Most of the SSR
markers are from rice and maize. A total of 2,942 SNP markers
from the Gramene database belong to barley and are related to
high-throughput SNP genotyping in barley [44].
GrainGenes (http://wheat.pw.usda.gov/cgi-bin/graingenes/)
hosts multiple types of markers for Triticeae and Avena [45–47].
The website also provides comparative map views for wheat, barley,
rye, and oats using CMap. Marker types include SSR, RFLP,
and SNP. Most of the SNP makers are from two sources [44, 48].
An improved SNP-based consensus genetic map has been devel-
oped from 1,133 individuals from ten mapping populations. This
database provides a search panel with query name or a list of marker
names as input.
MaizeGDB (http://www.maizegdb.org) provides a search
engine to identify ESTs, AFLPs, RAPD probes, and sequence data
for maize. The legume information system (LIS) provides access to
markers such as SNP, SSR, RFLP, and RAPDs for diverse legumes,
including peanut, soybean, alfalfa, and common bean.
The Panzea (http://www.panzea.org/) database describes the
genetic architecture of complex traits in maize and teosinte. This
database also provides a marker search interface. Two common
types of marker, SNP and SSR, can be searched for. The search
results display a list of markers with position details related to
different chromosomes. When the marker is selected, the website
can display this marker in precomputed multiple sequence align-

ments using the Look-Align viewer [49].
TriMEDB (Triticeae mapped EST database) [50] provides
information on mapped cDNA markers that are related between
barley and wheat. The current version of TriMEDB provides map-
location data for barley and wheat. These data were retrieved from
three published barley linkage maps: the barley SNP database of
SCRI (http://bioinf.scri.ac.uk/barley_snpdb/), the barley tran-
script map of IPK (http://pgrc.ipk-gatersleben.de/transcript_
map/), HarvEST barley versions 1.63 and 1.68 (http://harvest.
ucr.edu/), and one diploid wheat map [51]. Users can search the
database from the search markers page using marker and chromo-
some names. The search results include the name of any retrieved
marker, related linkage maps, chromosome number, map posi-
tions, primer pairs for PCR, EST contigs for each sequence
resource, a link to the cDNA assembly, and comparative maps for
the rice genome. The database can be accessed at http://trimedb.
psc.riken.jp/.
The database of the Rice Genome Annotation Project [52]
hosts putative SSRs in the rice genome pseudomolecules (http://
rice.plantbiology.msu.edu/). The rice genome annotation project
pseudomolecules (Release 7) were used for SSR identification [53].
This database provides a web interface and displays predicted SSR
markers filtered by type and/or chromosome, as well as a GBrowse
view to display the SSR sequences.
With the exception of some important species, databases for
nongrass species tend to be more limited in scope. There are a
large number of Brassica molecular markers developed together
with bioinformatics resources [54, 55]. The central Brassica portal
for all things Brassica (http://www.brassica.info) provides a link
to access to a range of Brassica molecular markers, including
SNP/InDel, SSR, RFLP, AFLP, and RAPD. This website provides
a summary of available information for Brassica SSRs and provides
a means to exchange and distribute these markers at the Brassica
microsatellite information exchange [56].
The Sol Genomics Network database (SGN; http://solgenomics. net/)
is a clade-oriented database (COD) hosting biological data for species in
the Solanaceae and their close relatives. The data types range from chro-
mosomes and genes to phenotypes and accessions. SGN hosts more
than 20 genetic and physical maps for tomato, potato, pepper, and
tobacco with thousands of markers. Genetic marker types in the database
include SNP, SSR, AFLP, PCR, and RFLP [57].
The SoyBase database (http://soybase.org/) hosts genomic
and genetic data for soybean. The markers include SNP, SSR, RFLP,
RAPD, and AFLP. The markers can be viewed from CMap and
have also been linked to their corresponding location in a Gbrowse2
genome viewer. Each marker comes with the genomic sequence,
detection method, and information source.
VegMarks (http://vegmarks.nivot.affrc.go.jp/) is a database

for vegetable genetic markers developed by National Institute of
Vegetable and Tea Science (NIVTS) in Japan. This database pro-
vides various marker characteristics, including ID number, genetic
map position, nucleotide sequence of the clones/PCR primers,
and polymorphism data among varieties/accessions for Chinese
cabbage, bunching onion, cucumber, eggplant, melon, and tomato.
The markers hosted in this database include SNP, SSR, and RFLP.
Some marker data is restricted for registered users only. This data-
base provides a single map for each chromosome together with
marker position information.
MoccaDB (http://moccadb.mpl.ird.fr/) is an integrative
database for functional, comparative, and diversity studies in the
Rubiaceae family which includes coffee [58]. It provides an easy
access to markers, such as SSR, SNP, and RFLP and related infor-
mation data such as PCR assay conditions, cross amplification
within related species, locus position on different linkage maps, and
diversity parameters. It also provides a search engine for searching
related markers by keywords and downloads of related data in
Microsoft Office Excel format.
The Cotton Microsatellite Database (CMD) (http://www.
cottonmarker.org/) is a curated and integrated web-based relational
database providing centralized access to publicly available cotton
SSRs. CMD contains publication, sequence, primer, mapping,
and homology data for nine major cotton SSR projects, collectively
representing 5,484 SSR markers [59].
In addition to species-specific databases, other databases focus
on specific marker types. The autoSNPdb database [60] is based
on an early pipeline for SNP discovery from EST sequence data
[24, 61]. It provides an interface facilitating a variety of queries to
search for SNPs within known genes from a range of species including
Brassica, rice, barley [62], and wheat [63]. The SNP identification
method was developed based on polymorphisms related to specific
genes identified through keyword, sequence similarity, or compara-
tive genomics approaches. The results provide sequence annotation
and SNP information in tabular and graphical format.
There are an increasing number of bioinformatics resources
available for wheat [64]. WheatGenome.info is an integrated
database resource which supplies a variety of web-based systems
hosting wheat genetic and genomic data. Wheatgenome.info [65]
provides a GBrowse2-based wheat genome viewer, CMap and
CMap3D comparative genetic map viewers [38, 43]. From the
GBrowse2-based wheat genome viewer, wheat reference genomic
sequences are currently only available for wheat group 7 chromo-
somes [31, 32]. SGSautoSNP (Second Generation Sequencing
autoSNP) software has been used to identify more than 900 000
SNPs between four Australian varieties along this chromosome
group [66]. More SNPs can be expected to be identified between

further wheat cultivars as this project develops.
SSR Primer 2 (http://flora.acpfg.com.au/ssrprimer2/) [67]
provides the real-time discovery of SSRs within submitted DNA
sequences, with the concomitant design of PCR primers for SSR
amplification [68]. The success of this system has been demon-
strated in Brassica [69–71] and strawberry [23].
A chickpea (Cicer arietinum L) root EST database hosted at
ICRISAT (http://www.icrisat.org/) provides access to over
2,800 chickpea ESTs from a library constructed after subtractive
suppressive hybridization (SSH) of root tissue from two closely
related chickpea genotypes possessing different sources of
drought avoidance and tolerance [72]. This chickpea root EST
database is a subset of larger ICRISAT maintained database.
ICRISAT (http://www.icrisat.org/) also hosts a nonredundant
set of 4,543 SNPs, which were identified between two chickpea
genotypes [73].
3 Conclusions and Future Direction
Molecular marker databases are expanding rapidly as increasing

numbers of markers are developed from the latest high-throughput
DNA sequencing technologies. There is an increasing challenge to
manage and maintain this expanding data as well as integrate
marker data with the growth of available genome sequences.
Finally, the greatest challenge will be to fully integrate genetic
diversity information with heritable trait information, bridging the
genome to phenome divide and providing the tools for more
advanced breeding and crop improvement.
References
1. Duran C, Edwards D, Batley J (2009) detecting DNA polymorphism, genotype iden-

Molecular marker discovery and genetic map tification and genetic diversity in wheat. Theor
visualisation. In: Edwards D, Hanson D, Appl Genet 100:592–594
Stajich J (eds) Applied bioinformatics. Springer, 5. Stein N, Graner A (2005) Map-based gene
New York, pp 165–189 isolation in cereal genomes. In: Gupta P,
2. Edwards D, Batley J (2008) Bioinformatics: Varshney R (eds) Cereal genomics. Springer,
fundamentals and applications in plant genet- Amsterdam, pp 331–360
ics, mapping and breeding. In: Kole C, Abbott 6. Varshney RK, Sigmund R, Börner A, Korzun
AG (eds) Principles and practices of plant V, Stein N, Sorrells ME, Langridge P, Graner A
genomics. Science Publishers, Inc., New York, (2005) Interspecific transferability and com-
pp 269–302 parative mapping of barley EST-SSR markers in
3. Appleby N, Edwards D, Batley J (2009) New wheat, rye and rice. Plant Sci 168:195–202
technologies for ultra-high throughput geno- 7. Batley J, Edwards D (2009) Mining for single
typing in plants. In: Somers D, Langridge P, nucleotide polymorphism (SNP) and simple
Gustafson J (eds) Plant genomics. Humana, sequence repeat (SSR) molecular genetic mark-
New York, pp 19–40 ers. In: Posada D (ed) Bioinformatics for DNA
4. Prasad M, Varshney RK, Roy JK, Balyan HS, sequence analysis. Humana, New York,
Gupta PK (2000) The use of microsatellites for pp 303–322
8. Edwards D, Forster JW, Chagné D, Batley J species using 454 sequencing. Plant Biotechnol
(2007) What are SNPs? In: Oraguzie NC, J 7:347–354
Rikkerink EHA, Gardiner SE, Silva HND 19. Edwards D, Batley J (2010) Plant genome
(eds) Association mapping in plants. Springer, sequencing: applications for crop improvement.
New York, pp 41–52 Plant Biotechnol J 7:1–8
9. Chagné D, Batley J, Edwards D, Forster JW 20. Hong CP, Piao ZY, Kang TW, Batley J, Yang
(2007) Single nucleotide polymorphism geno- TJ, Hur YK, Bhak J, Park BS, Edwards D,
typing in plants. In: Oraguzie N, Rikkerink E, Lim YP (2007) Genomic distribution of simple
Gardiner S, De Silva H (eds) Association map- sequence repeats in Brassica rapa. Mol Cells
ping in plants. Springer, New York, pp 77–94 23:349–356
10. Edwards D, Forster JW, Cogan NOI, Batley J, 21. Burgess B, Mountford H, Hopkins CJ, Love C,
Chagné D (2007) Single nucleotide polymor- Ling AE, Spangenberg GC, Edwards D, Batley J
phism discovery. In: Oraguzie N, Rikkerink E, (2006) Identification and characterization of
Gardiner S, De Silva H (eds) Association map- simple sequence repeat (SSR) markers derived
ping in plants. Springer, New York, pp 53–76 in silico from Brassica oleracea genome shotgun
11. Batley J, Edwards D (2007) SNP applications sequences. Mol Ecol Notes 6:1191–1194
in plants. In: Oraguzie N, Rikkerink E, Gardiner 22. Nie X, Li B, Wang L, Liu P, Biradar SS, Li T,
S, De Silva H (eds) Association mapping in Dolezel J, Edwards D, Luo M, Weining S
plants. Springer, New York, pp 95–102 (2012) Development of chromosome-arm-
12. Allen AM, Barker GL, Berry ST, Coghill JA, specific microsatellite markers in Triticum aes-
Gwilliam R, Kirby S, Robinson P, Brenchley tivum (Poaceae) using NGS technology. Am J
RC, D'Amore R, McKenzie N, Waite D, Hall Bot 99:e369–e371
A, Bevan M, Hall N, Edwards KJ (2011) 23. Keniry A, Hopkins CJ, Jewell E, Morrison B,
Transcript-specific, single-nucleotide polymor- Spangenberg GC, Edwards D, Batley J (2006)
phism discovery and linkage analysis in hexa- Identification and characterization of simple
ploid bread wheat (Triticum aestivum L.). sequence repeat (SSR) markers from Fragaria x
Plant Biotechnol J 9:1086–1099 ananassa expressed sequences. Mol Ecol Notes
13. Winfield MO, Wilkinson PA, Allen AM, Barker 6:319–322
GL, Coghill JA, Burridge A, Hall A, Brenchley 24. Batley J, Barker G, O'Sullivan H, Edwards KJ,
RC, D'Amore R, Hall N, Bevan MW, Richmond Edwards D (2003) Mining for single nucleo-
T, Gerhardt DJ, Jeddeloh JA, Edwards KJ (2012) tide polymorphisms and insertions/deletions
Targeted re-sequencing of the allohexaploid in maize expressed sequence tag data. Plant
wheat exome. Plant Biotechnol J 10:733–742 Physiol 132:84–91
14. Kharabian-Masouleh A, Waters DLE, Reinke 25. Lee H, Lai K, Lorenc MT, Imelfort M, Duran
RF, Henry RJ (2011) Discovery of polymor- C, Edwards D (2012) Bioinformatics tools and
phisms in starch-related genes in rice germ- databases for analysis of next generation
plasm by amplification of pooled DNA and sequence data. Brief Funct Genomics 2:12–24
deeply parallel sequencing†. Plant Biotechnol J
9:1074–1085 26. Imelfort M, Duran C, Batley J, Edwards D
(2009) Discovering genetic polymorphisms
15. Subbaiyan GK, Waters DL, Katiyar SK,
in next-generation sequencing data. Plant
Sadananda AR, Vaddadi S, Henry RJ (2012)
Biotechnol J 7:312–317
Genome-wide DNA polymorphisms in elite
indica rice inbreds discovered by whole-genome 27. Wang X, Wang H, Wang J, Sun R, Wu J, Liu S,
sequencing. Plant Biotechnol J 10:623–634 Bai Y, Mun J-H, Bancroft I, Cheng F, Huang
16. Trick M, Long Y, Meng JL, Bancroft I (2009) S, Li X, Hua W, Wang J, Wang X, Freeling M,
Single nucleotide polymorphism (SNP) discov- Pires JC, Paterson AH, Chalhoub B, Wang B,
ery in the polyploid Brassica napus using Solexa Hayward A, Sharpe AG, Park B-S, Weisshaar B,
transcriptome sequencing. Plant Biotechnol J Liu B, Li B, Liu B, Tong C, Song C, Duran C,
7:334–346 Peng C, Geng C, Koh C, Lin C, Edwards D,
Mu D, Shen D, Soumpourou E, Li F, Fraser F,
17. Barker GLA, Edwards KJ (2009) A genome- Conant G, Lassalle G, King GJ, Bonnema G,
wide analysis of single nucleotide polymor- Tang H, Wang H, Belcram H, Zhou H,
phism diversity in the world's major cereal Hirakawa H, Abe H, Guo H, Wang H, Jin H,
crops. Plant Biotechnol J 7:318–325 Parkin IAP, Batley J, Kim J-S, Just J, Li J, Xu J,
18. Bundock PC, Eliott FG, Ablett G, Benson AD, Deng J, Kim JA, Li J, Yu J, Meng J, Wang J,
Casu RE, Aitken KS, Henry RJ (2009) Min J, Poulain J, Hatakeyama K, Wu K, Wang
Targeted single nucleotide polymorphism L, Fang L, Trick M, Links MG, Zhao M, Jin
(SNP) discovery in a highly polyploid plant M, Ramchiary N, Drou N, Berkman PJ, Cai Q,
Huang Q, Li R, Tabata S, Cheng S, Zhang S, 37. Lim G, Jewell E, Li X, Erwin T, Love C, Batley
Zhang S, Huang S, Sato S, Sun S, Kwon S-J, J, Spangenberg G, Edwards D (2007) A com-
Choi S-R, Lee T-H, Fan W, Zhao X, Tan X, Xu parative map viewer integrating genetic maps
X, Wang Y, Qiu Y, Yin Y, Li Y, Du Y, Liao Y, for Brassica and Arabidopsis. BMC Plant Biol
Lim Y, Narusaka Y, Wang Y, Wang Z, Li Z, 7:40
Wang Z, Xiong Z, Zhang Z (2011) The 38. Duran C, Boskovic Z, Imelfort M, Batley J,
genome of the mesopolyploid crop species Hamilton NA, Edwards D (2010) CMap3D: a
Brassica rapa. Nat Genet 43:1035–1040 3D visualisation tool for comparative genetic
28. Hayward A, Dalton-Morgan J, Mason A, maps. Bioinformatics 26:273–274
Zander M, Edwards D, Batley J (2012) SNP 39. Duran C, Eales D, Marshall D, Imelfort M,
discovery and applications in Brassica napus. J Stiller J, Berkman PJ, Clark T, McKenzie M,
Plant Biotechnol 39:49–61 Appleby N, Batley J, Basford K, Edwards D
29. Hayward A, Vighnesh G, Delay C, Samian (2010) Future tools for association mapping in
MR, Manoli S, Stiller J, McKenzie M, Edwards crop plants. Genome 53:1017–1023
D, Batley J (2012) Second-generation sequenc- 40. Edwards D, Batley J (2004) Plant bioinformatics:
ing for gene discovery in the Brassicaceae. from genome to phenome. Trends Biotechnol
Plant Biotechnol J 10:750–759 22:232–237
30. Tollenaere R, Hayward A, Dalton-Morgan J, 41. Lai K, Lorenc MT, Edwards D (2012)
Campbell E, McLanders J, Lorenc M, Manoli Genomic databases for crop improvement.
S, Stiller J, Raman R, Raman H, Edwards D, Agronomy 2:62–73
Batley J (2012) Identification and characterisa- 42. Youens-Clark K, Buckler E, Casstevens T, Chen
tion of candidate Rlm4 blackleg resistance C, DeClerck G, Derwent P, Dharmawardhana
genes in Brassica napus using next generation P, Jaiswal P, Kersey P, Karthikeyan AS, Lu J,
sequencing. Plant Biotechnol J 10:709–715 McCouch SR, Ren L, Spooner W, Stein JC,
31. Berkman BJ, Skarshewski A, Lorenc MT, Lai Thomason J, Wei S, Ware D (2011) Gramene
K, Duran C, Ling EYS, Stiller J, Smits L, database in 2010: updates and extensions.
Imelfort M, Manoli S, McKenzie M, Nucleic Acids Res 39:D1085–D1094
Kubalakova M, Simkova H, Batley J, Fleury D, 43. Youens-Clark K, Faga B, Yap IV, Stein L, Ware
Dolezel J, Edwards D (2011) Sequencing and D (2009) CMap 1.01: a comparative mapping
assembly of low copy and genic regions of iso- application for the Internet. Bioinformatics
lated Triticum aestivum chromosome arm 25:3040–3042
7DS. Plant Biotechnol J 9:768–775
44. Close TJ, Bhat PR, Lonardi S, Wu Y, Rostoks
32. Berkman PJ, Skarshewski A, Manoli S, Lorenc N, Ramsay L, Druka A, Stein N, Svensson JT,
MT, Stiller J, Smits L, Lai K, Campbell E, Wanamaker S, Bozdag S, Roose ML, Moscou
Kubalakova M, Simkova H, Batley J, Dolezel J, MJ, Chao S, Varshney RK, Szucs P, Sato K,
Hernandez P, Edwards D (2012) Sequencing Hayes PM, Matthews DE, Kleinhofs A,
wheat chromosome arm 7BS delimits the Muehlbauer GJ, DeYoung J, Marshall DF,
7BS/4AL translocation and reveals homoeolo- Madishetty K, Fenton RD, Condamine P,
gous gene conservation. Theor Appl Genet Graner A, Waugh R (2009) Development and
124:423–432 implementation of high-throughput SNP
33. Hernandez P, Martis M, Dorado G, Pfeifer M, genotyping in barley. BMC Genomics 10:582
Galvez S, Schaaf S, Jouve N, Simkova H, Valarik 45. O'Sullivan H (2007) GrainGenes – a genomic
M, Dolezel J, Mayer KF (2012) Next-generation database for Triticeae and Avena. In: Edwards
sequencing and syntenic integration of flow- D (ed) Methods in molecular biology. Humana,
sorted arms of wheat chromosome 4A exposes Totowa, NJ, pp 301–314
the chromosome structure and gene content.
Plant J Cell Mol Biol 69:377–386 46. Carollo V, Matthews DE, Lazo GR, Blake TK,
Hummel DD, Lui N, Hane DL, Anderson OD
34. Duran C, Edwards D, Batley J (2009) Genetic
(2005) GrainGenes 2.0. An improved resource
maps and the use of synteny. In: Gustafson JP,
for the small-grains community. Plant Physiol
Langridge P, Somers DJ (eds) Plant genomics.
139:643–651
Humana, New York, pp 41–55
47. Matthews DE, Carollo VL, Lazo GR, Anderson
35. Batley J, Edwards D (2009) Genome sequence
OD (2003) GrainGenes, the genome database
data: management, storage, and visualization.
for small-grain crops. Nucleic Acids Res 31:
Biotechniques 46:333–336
183–186
36. Duran C, Appleby N, Edwards D, Batley J
(2009) Molecular genetic markers: discovery, 48. Szűcs P, Blake VC, Bhat PR, Chao S, Close TJ,
applications, data storage and visualisation. Cuesta-Marcos A, Muehlbauer GJ, Ramsay L,
Curr Bioinform 4:16–27 Waugh R, Hayes PM (2009) An integrated
resource for Barley linkage map and malting database for functional, comparative and diver-
quality QTL alignment. Plant Gen 2:134–140 sity studies in the Rubiaceae family. BMC Plant
49. Canaran P, Stein L, Ware D (2006) Look- Biol 9:123
Align: an interactive web-based multiple 59. Blenda A, Scheffler J, Scheffler B, Palmer M,
sequence alignment viewer with polymorphism Lacape JM, Yu JZ, Jesudurai C, Jung S,
analysis support. Bioinformatics 22:885–886 Muthukumar S, Yellambalase P, Ficklin S,
50. Mochida K, Saisho D, Yoshida T, Sakurai T, Staton M, Eshelman R, Ulloa M, Saha S, Burr B,
Shinozaki K (2008) TriMEDB: a database to Liu S, Zhang T, Fang D, Pepper A, Kumpatla
integrate transcribed markers and facilitate S, Jacobs J, Tomkins J, Cantrell R, Main D
genetic studies of the tribe Triticeae. BMC (2006) CMD: a cotton microsatellite database
Plant Biol 8:72 resource for Gossypium genomics. BMC
Genomics 7:132
51. Hori K, Takehara S, Nankaku N, Sato K,
Sasakuma T, Takeda K (2007) Barley EST mark- 60. Duran C, Appleby N, Clark T, Wood D,
ers enhance map saturation and QTL mapping Imelfort M, Batley J, Edwards D (2009)
in diploid wheat. Breed Sci 57:39–45 AutoSNPdb: an annotated single nucleotide
polymorphism database for crop plants. Nucleic
52. Ouyang S, Zhu W, Hamilton J, Lin H, Campbell Acids Res 37:D951–D953
M, Childs K, Thibaud-Nissen F, Malek RL, Lee
Y, Zheng L, Orvis J, Haas B, Wortman J, Buell 61. Barker G, Batley J, O'Sullivan H, Edwards KJ,
CR (2007) The TIGR rice genome annotation Edwards D (2003) Redundancy based detec-
resource: improvements and new features. tion of sequence polymorphisms in expressed
Nucleic Acids Res 35:D883–D887 sequence tag data using autoSNP. Bioinformatics
19:421–422
53. Temnykh S, DeClerck G, Lukashova A,
Lipovich L, Cartinhour S, McCouch S (2001) 62. Duran C, Appleby N, Vardy M, Imelfort M,
Computational and experimental analysis of Edwards D, Batley J (2009) Single nucleotide
microsatellites in rice (Oryza sativa L.): fre- polymorphism discovery in barley using autoS-
quency, length variation, transposon associa- NPdb. Plant Biotechnol J 7:326–333
tions, and genetic marker potential. Genome 63. Lai K, Duran C, Berkman PJ, Lorenc MT,
Res 11:1441–1452 Stiller J, Manoli S, Hayden MJ, Forrest KL,
54. Lorenc MT, Boskovic Z, Stiller J, Duran C, Fleury D, Baumann U, Zander M, Mason AS,
Edwards D (2012) Role of Bioinformatics as a Batley J, Edwards D (2012) Single nucleotide
tool for oilseed Brassica species. In: Edwards polymorphism discovery from wheat next-
D, Parkin IAP, Batley J (eds) Genetics, genom- generation sequence data. Plant Biotechnol J
ics and breeding of oilseed Brassicas. Science 10:743–749
Publishers Inc., New Hampshire, pp 194–205 64. Edwards D (2011) Wheat bioinformatics. In:
55. Duran C, Boskovic Z, Batley J, Edwards D Bonjean A, Angus W, Van Ginkel M (eds)
(2011) Role of bioinformatics as a tool for veg- The world wheat book. Lavoisier, Paris,
etable Brassica species. In: Sadowski J (ed) pp 851–875
Vegetable Brassicas. Science Publishers, Inc., 65. Lai K, Berkman PJ, Lorenc MT, Duran C,
New Hampshire, pp 406–418 Smits L, Manoli S, Stiller J, Edwards D (2012)
56. Choi SR, Teakle GR, Plaha P, Kim JH, Allender WheatGenome.info: an integrated database
CJ, Beynon E, Piao ZY, Soengas P, Han TH, and portal for wheat genome information.
King GJ, Barker GC, Hand P, Lydiate DJ, Plant Cell Physiol 53:e2
Batley J, Edwards D, Koo DH, Bang JW, Park 66. Edwards D, Wilcox S, Barrero RA, Fleury D,
BS, Lim YP (2007) The reference genetic link- Cavanagh CR, Forrest KL, Hayden MJ,
age map for the multinational Brassica rapa Moolhuijzen P, Keeble-Gagnère G, Bellgard MI,
genome sequencing project. Theor Appl Genet Lorenc MT, Shang CA, Baumann U, Taylor JM,
115:777–792 Morell MK, Langridge P, Appels R, Fitzgerald A
57. Bombarely A, Menda N, Tecle IY, Buels RM, (2012) Bread matters: a national initiative to
Strickler S, Fischer-York T, Pujar A, Leto J, profile the genetic diversity of Australian wheat.
Gosselin J, Mueller LA (2011) The sol Plant Biotechnol J 10:703–708
genomics network (solgenomics.net): grow- 67. Jewell E, Robinson A, Savage D, Erwin T, Love
ing tomatoes using Perl. Nucleic Acids Res CG, Lim GAC, Li X, Batley J, Spangenberg
39:D1149–D1155 GC, Edwards D (2006) SSRPrimer and SSR
58. Plechakova O, Tranchant-Dubreuil C, Benedet taxonomy tree: biome SSR discovery. Nucleic
F, Couderc M, Tinaut A, Viader V, De Block P, Acids Res 34:W656–W659
Hamon P, Campa C, de Kochko A, Hamon S, 68. Robinson AJ, Love CG, Batley J, Barker G,
Poncet V (2009) MoccaDB – an integrative Edwards D (2004) Simple sequence repeat
marker loci discovery using SSR primer. M, Edgar R, Federhen S, Feolo M, Geer LY,
Bioinformatics 20:1475–1476 Helmberg W, Kapustin Y, Khovayko O,
69. Batley J, Hopkins CJ, Cogan NOI, Hand M, Landsman D, Lipman DJ, Madden TL, Maglott
Jewell E, Kaur J, Kaur S, Li X, Ling AE, Love DR, Miller V, Ostell J, Pruitt KD, Schuler GD,
C, Mountford H, Todorovic M, Vardy M, Shumway M, Sequeira E, Sherry ST, Sirotkin K,
Walkiewicz M, Spangenberg GC, Edwards D Souvorov A, Starchenko G, Tatusov RL,
(2007) Identification and characterization of Tatusova TA, Wagner L, Yaschenko E (2008)
simple sequence repeat markers from Brassica Database resources of the national center for
napus expressed sequences. Mol Ecol Notes biotechnology information. Nucleic Acids Res
7:886–889 36:D13–D21
70. Hopkins CJ, Cogan NOI, Hand M, Jewell E, 78. Gonzales MD, Gajendran K, Farmer AD,
Kaur J, Li X, Lim GAC, Ling AE, Love C, Archuleta E, Beavis WD (2007) Leveraging
Mountford H, Todorovic M, Vardy M, model legume information to find candidate
Spangenberg GC, Edwards D, Batley J (2007) genes for soybean sudden death syndrome using
Sixteen new simple sequence repeat markers the legume information system. In: Edwards D
from Brassica juncea expressed sequences and (ed) Methods in molecular biology. Humana,
their cross-species amplification. Mol Ecol Totowa, NJ, pp 245–259
Notes 7:697–700 79. Gonzales MD, Archuleta E, Farmer A,
71. Ling AE, Kaur J, Burgess B, Hand M, Hopkins Gajendran K, Grant D, Shoemaker R, Beavis
CJ, Li X, Love CG, Vardy M, Walkiewicz M, WD, Waugh ME (2005) The legume informa-
Spangenberg G, Edwards D, Batley J (2007) tion system (LIS): an integrated information
Characterization of simple sequence repeat resource for comparative legume biology.
markers derived in silico from Brassica rapa Nucleic Acids Res 33:D660–D665
bacterial artificial chromosome sequences and 80. Schaeffer ML, Harper LC, Gardiner JM,
their application in Brassica napus. Mol Ecol Andorf CM, Campbell DA, Cannon EK, Sen
Notes 7:273–277 TZ, Lawrence CJ (2011) MaizeGDB: curation
72. Jayashree B, Buhariwalla HK, Shinde S, and outreach go hand-in-hand. Database
Crouch JH (2005) A legume genommics (Oxford) 2011, bar022
resource: the chickpea root expressed sequence 81. Lawrence CJ (2007) MaizeGDB – the maize
tag database. Electron J Biotechnol 8: genetics and genomics database. In: Edwards
128–133 D (ed) Methods in molecular biology. Humana,
73. Azam S, Thakur V, Ruperao P, Shah T, Balaji J, Totowa, NJ, pp 331–345
Amindala B, Farmer AD, Studholme DJ, 82. Lawrence CJ, Schaeffer ML, Seigfried TE,
May GD, Edwards D, Jones JD, Varshney RK Campbell DA, Harper LC (2007) MaizeGDB's
(2012) Coverage-based consensus calling new data types, resources and activities. Nucleic
(CbCC) of short sequence reads and compari- Acids Res 35:D895–D900
son of CbCC results to identify SNPs in chick- 83. Canaran P, Buckler ES, Glaubitz JC, Stein L,
pea (Cicer arietinum; Fabaceae), a crop species Sun Q, Zhao W, Ware D (2008) Panzea: an
without a reference genome. Am J Bot 99: update on new content and features. Nucleic
186–192 Acids Res 36:D1041–D1043
74. Cheng F, Liu S, Wu J, Fang L, Sun S, Liu B, Li 84. Grant D, Nelson RT, Cannon SB, Shoemaker
P, Hua W, Wang X, Cheng F, Liu SY, Wu J, Fang RC (2010) SoyBase, the USDA-ARS soybean
L, Sun SL, Liu B, Li PX, Hua W, Wang XW genetics and genomics database. Nucleic Acids
(2011) BRAD, the genetics and genomics data- Res 38:D843–D846
base for Brassica plants. BMC Plant Biol 11:136
85. Wegrzyn J, Main D, Figueroa B, Choi M, Yu J,
75. Karsch-Mizrachi I, Nakamura Y, Cochrane G Neale D, Jung S, Lee T, Stanton M, Zheng P,
(2012) The international nucleotide sequence Ficklin S, Cho I, Peace C, Evans K, Volk G,
database collaboration. Nucleic Acids Res 40: Oraguzie N, Chen C, Olmstead M, Gmitter G,
D33–D37 Abbott A (2012) Uniform standards for genome
76. Benson DA, Karsch-Mizrachi I, Lipman DJ, databases in forest and fruit trees. Tree Genet
Ostell J, Sayers EW (2009) GenBank. Nucleic Genomes 8:1–2
Acids Res 37:26–31 86. Tree fruit Genome Database Resources (tfGDR)
77. Wheeler DL, Barrett T, Benson DA, Bryant SH, (2002) Washington State University, Pullman,
Canese K, Chetvernin V, Church DM, DiCuccio WA. http://www.tfgdr.org
Chapter 5
Plant Genotyping Using Fluorescently Tagged

Inter-Simple Sequence Repeats (ISSRs):
Basic Principles and Methodology
Linda M. Prince
Abstract
Inter-simple sequence repeat PCR (ISSR-PCR) is a fast, inexpensive genotyping technique based on length
variation in the regions between microsatellites. The method requires no species-specific prior knowledge
of microsatellite location or composition. Very small amounts of DNA are required, making this method
ideal for organisms of conservation concern, or where the quantity of DNA is extremely limited due to
organism size. ISSR-PCR can be highly reproducible but requires careful attention to detail. Optimization
of DNA extraction, fragment amplification, and normalization of fragment peak heights during fluorescent
detection are critical steps to minimizing the downstream time spent verifying and scoring the data.
Key words ABI Genetic Analyzer, Capillary electrophoresis, Conservation, Fragment, Inexpensive,
Normalization, Population genetics, ISSR-PCR
1 Introduction
ISSR-PCR is a genotyping technique based on length variation in

the regions between microsatellites. The method became popular
in the late 1990s [1–5], along with other fragment-based methods
such as Amplified Fragment Length Polymorphisms (AFLPs),
Restriction Fragment Length Polymorphisms (RFLPs), and
Randomly Amplified Polymorphic DNAs (RAPDs) as an alterative to
microsatellites. All of these methods have the benefit of requiring
little or no prior knowledge of an organism’s genome. ISSR-PCR
requires a single step post DNA isolation, so is relatively low cost.
The ISSR-PCR procedure is particularly useful for studies of rare
organisms because, unlike AFLPs and RFLPs that have a restriction
digestion step, it requires very little DNA.
A quick search for ISSR + plant in PubMed [6] returned over
400 ISSR-based publications. ISSRs have successfully been used to
address evolutionary questions such as detection and verification of
63
64 Linda M. Prince
hybridization events, detection of clonal variation, sex determina-

tion in seedlings, development of linkage maps, and for phyloge-
netic estimation among closely related species. It has been used
extensively by crop scientists for cultivar identification and charac-
terization of basic genetic variation in wild and cultivated taxa.
Conservation geneticists who do not have access to species-specific
microsatellite markers have also taken advantage of the method,
often using ISSRs as a stepping stone to custom microsatellite
development [7–9]. The versatility of this technique makes ISSR
useful for researchers interested in fields as diverse as conservation
biology and cancer research, across a wide spectrum of biological
diversity (plants, animals, fungi, algae).
ISSR-PCR uses a single fluorescently labeled primer to target
the region between identical microsatellites. Sets of 100 ISSR
primers were commercially available through The University of
British Columbia, Vancouver (UBC) as late as 2005 [10]. A list of
the UBC ISSR primers and some inferred primer characteristics
(annealing temperature, primer–primer interaction) are provided
in Table 1. Only some of the primers are suitable for ISSR-PCR
due to primer–primer interactions and extreme annealing tempera-
ture requirements (either very low or very high). A number of
studies use multiple primers or use degenerate primers in the PCR
reaction to increase the number of fragments produced. With the
advent of capillary electrophoresis and laser-induced fluorescence,
there is little need to do either as this method is much more sensi-
tive and offers far higher resolution than older slab-based (either
agarose or acrylamide) methods. An added benefit is increasingly
long read lengths, up to 1,200 bp, thanks to the development of
additional internal lane standards.
Like many anonymous markers, ISSRs exhibit (primarily)
dominant inheritance, thus the amplified regions are scored as dial-
lelic (presence/absence). Variation among individuals within a
population can arise through structural changes to the region
(insertions or deletions) or the loss of primer binding sites. The
method fell out of favor over the past decade due to concerns of
reproducibility. In fact the method is highly reproducible [11–14]
but requires careful attention to detail. It is standard practice to
run all samples in duplicate, eliminating bands that are detected in
only one of the two replicates. A majority rule approach, based on
triplicates, seems like a better approach. Optimization of DNA
extraction, fragment amplification, and normalization of fragment
peak heights during fluorescent detection are critical steps to mini-
mizing the downstream time spent verifying and scoring the data.
Details presented below are for Applied Biosystems, Inc. capillary
array Genetic Analyzer platforms, but can readily be adapted to
other systems.
Table 1
UBC ISSR primer set with estimated annealing temperatures and primer interaction characteristics based upon a review in the computer software
package Oligo [26]
# Base composition Td (1) Tm (2) Tm (3) Primer dimers? Primer hairpins?
Nondegenerate 801 ATA TAT ATA TAT ATA TT 24.5 41.8 34.0 All the way Yes! Tm = 43°
802 ATA TAT ATA TAT ATA TG 24.8 44.2 36.0 All the way Yes! Tm = 43°
803 ATA TAT ATA TAT ATA TC 23.8 44.2 36.0 All the way Yes! Tm = 43°
804 TAT ATA TAT ATA TAT AA 23.3 41.8 34.0 All the way Yes! Tm = 38°
805 TAT ATA TAT ATA TAT AC 21.9 44.2 36.0 All the way Yes! Tm = 38°
806 TAT ATA TAT ATA TAT AG 22.5 44.2 36.0 All the way Yes! Tm = 38°
807 AGA GAG AGA GAG AGA GT 39.6 61.1 50.0 No No
808 AGA GAG AGA GAG AGA GC 44.2 63.5 52.0 1 of 2 bp No
809 AGA GAG AGA GAG AGA GG 44.0 63.5 52.0 No No
810 GAG AGA GAG AGA GAG AT 40.0 61.1 50.0 1 of 2 bp No
811 GAG AGA GAG AGA GAG AC 40.0 63.5 52.0 No No
811 GAG AGA GAG AGA GAG AC 40.0 63.5 52.0 No No
812 GAG AGA GAG AGA GAG AA 41.3 61.1 50.0 No No
813 CTC TCT CTC TCT CTC TT 40.9 61.1 50.0 No No
814 CTC TCT CTC TCT CTC TA 38.5 61.1 50.0 1 of 2 bp No
814 CTC TCT CTC TCT CTC TA 38.5 61.1 50.0 1 of 2 bp No
815 CTC TCT CTC TCT CTC TG 41.7 63.5 52.0 No No
816 CAC ACA CAC ACA CAC AT 46.1 61.1 50.0 1 of 2 bp No
817 CAC ACA CAC ACA CAC AA 47.6 61.1 50.0 No No
818 CAC ACA CAC ACA CAC AG 46.8 63.5 52.0 No No
819 GTG TGT GTG TGT GTG TA 42.7 61.1 50.0 1 of 2 bp No
820 GTG TGT GTG TGT GTG TC 45.0 63.5 52.0 No No
821 GTG TGT GTG TGT GTG TT 45.3 61.1 50.0 No No
822 TCT CTC TCT CTC TCT CA 42.2 61.1 50.0 No No
823 TCT CTC TCT CTC TCT CC 44.5 63.5 52.0 No No
824 TCT CTC TCT CTC TCT CG 46.0 63.5 52.0 1 of 2 bp No
825 ACA CAC ACA CAC ACA CT 44.4 61.1 50.0 No No
826 ACA CAC ACA CAC ACA CC 48.6 63.5 52.0 No No
827 ACA CAC ACA CAC ACA CG 50.2 63.5 52.0 1 of 2 bp No
Plant Genotyping Using Fluorescently Tagged Inter-Simple Sequence Repeats…
828 TGT GTG TGT GTG TGT GA 47.5 61.1 50.0 No No

829 TGT GTG TGT GTG TGT GC 51.2 63.5 52.0 1 of 2 bp No
830 TGT GTG TGT GTG TGT GG 51.0 63.5 52.0 No No
65
(continued)
Table 1
66
(continued)
# Base composition Td (1) Tm (2) Tm (3) Primer dimers? Primer hairpins?
Generate 831 ATA TAT ATA TAT ATA TYA 26.6 44.0 36.0 All the way Yes! Tm = 43°
832 ATA TAT ATA TAT ATA TYC 28.2 46.3 38.0 All the way Yes! Tm = 43°
833 ATA TAT ATA TAT ATA TYG 29.1 46.3 38.0 All the way Yes! Tm = 43°
834 AGA GAG AGA GAG AGA GYT 43.5 62.2 52.0 No No
Linda M. Prince
835 AGA GAG AGA GAG AGA GYC 43.2 64.5 54.0 No No
836 AGA GAG AGA GAG AGA GYA 41.2 62.2 52.0 No No
837 TAT ATA TAT ATA TAT ART 26.6 44.0 36.0 All the way Yes! Tm = 38°
838 TAT ATA TAT ATA TAT ARC 25.9 46.3 38.0 All the way Yes! Tm = 38°
839 TAT ATA TAT ATA TAT ARG 26.9 46.3 38.0 All the way Yes! Tm = 38°
840 GAG AGA GAG AGA GAG AYT 43.9 62.2 52.0 No No
841 GAG AGA GAG AGA GAG AYC 43.6 64.5 54.0 No No
842 GAG AGA GAG AGA GAG AYG 44.7 64.5 54.0 No No
843 CTC TCT CTC TCT CTC TRA 42.5 62.2 52.0 No No
844 CTC TCT CTC TCT CTC TRC 44.4 64.5 54.0 No No
845 CTC TCT CTC TCT CTC TRG 45.5 64.5 54.0 No No
846 CAC ACA CAC ACA CAC ART 49.9 62.2 52.0 No No
847 CAC ACA CAC ACA CAC ARC 49.8 64.5 54.0 No No
848 CAC ACA CAC ACA CAC ARG 51.1 64.5 54.0 No No
849 GTG TGT GTG TGT GTG TYA 46.7 62.2 52.0 No No
850 GTG TGT GTG TGT GTG TYC 48.9 64.5 54.0 No No
851 GTG TGT GTG TGT GTG TYG 50.2 64.5 54.0 No No
852 TCT CTC TCT CTC TCT CRA 42.2 62.2 52.0 No No
853 TCT CTC TCT CTC TCT CRT 44.4 62.2 52.0 No No
854 TCT CTC TCT CTC TCT CRG 45.3 64.5 54.0 No No
855 ACA CAC ACA CAC ACA CYT 48.3 62.2 52.0 No No
856 ACA CAC ACA CAC ACA CYA 46.0 62.2 52.0 No No
857 ACA CAC ACA CAC ACA CYG 49.4 64.5 54.0 No No
858 TGT GTG TGT GTG TGT GRT 50.2 62.2 52.0 No No
859 TGT GTG TGT GTG TGT GRC 50.1 64.5 54.0 No No
860 TGT GTG TGT GTG TGT GRA 47.8 62.2 52.0 No No
Nondegenerate 861 ACC ACC ACC ACC ACC ACC 65.4 71.3 60.0 No No
862 AGC AGC AGC AGC AGC AGC 67.6 71.3 60.0 6 of 2 bp No
863 AGT AGT AGT AGT AGT AGT 30.3 57.7 48.0 3 of 1–2 bp No
864 ATG ATG ATG ATG ATG ATG 48.4 57.7 48.0 6 of 2 bp No
865 CCG CCG CCG CCG CCG CCG 90.8 85.0 72.0 6 of 2 bp No
866 CTC CTC CTC CTC CTC CTC 59.4 71.3 60.0 No No
867 GGC GGC GGC GGC GGC GGC 90.1 85.0 72.0 6 of 2 bp No
868 GAA GAA GAA GAA GAA GAA 46.9 57.7 48.0 No No
869 GTT GTT GTT GTT GTT GTT 49.2 57.7 48.0 No No
870 TGC TGC TGC TGC TGC TGC 69.5 71.3 60.0 6 of 2 bp No
871 TAT TAT TAT TAT TAT TAT 33.1 44.0 36.0 6 of 2 bp No
872 GAT AGA TAG ATA GAT A 26.8 49.6 40.0 4 of 2 bp No
873 GAC AGA CAG ACA GAC A 39.7 59.8 48.0 No No
874 CCC TCC CTC CCT CCC T 62.9 70.1 56.0 No No
875 CTA GCT AGC TAG CTA G 39.9 59.8 48.0 All the way Yes! Tm = 66°
876 GAT AGA TAG ACA GAC A 32.8 54.7 44.0 1 of 2 bp No
877 TGC ATG CAT GCA TGC A 61.2 59.8 48.0 All the way Yes! Tm = 93°
878 GGA TGG ATG GAT GGA T 53.2 59.8 48.0 4 of 2 bp No
879 CTT CAC TTC ACT TCA 37.7 52.9 42.0 No No
880 GGA GAG GAG AGG AGA 45.1 61.1 48.0 No No
881 GGG TGG GGT GGG GTG 63.7 69.3 54.0 No No
Degenerate 882 VBV ATA TAT ATA TAT AT 25.7 41.8 34.0 All the way Yes! Tm = 31°
883 BVB TAT ATA TAT ATA TA 26.9 41.8 34.0 All the way Yes! Tm = 31°
884 HBH AGA GAG AGA GAG AG 39.7 58.7 48.0 No No
885 BHB GAG AGA GAG AGA GA 43.3 58.7 48.0 No No
886 VDV CTC TCT CTC TCT CT 41.7 58.7 48.0 No No
887 DVD TCT CTC TCT CTC TC 42.5 58.7 48.0 No No
888 BDB CAC ACA CAC ACA CA 47.7 58.7 48.0 No No
889 DBD ACA CAC ACA CAC AC 43.1 58.7 48.0 No No
890 VHV GTG TGT GTG TGT GT 46.6 58.7 48.0 No No
891 HVH TGT GTG TGT GTG TG 47.8 58.7 48.0 No No
892 TAG ATC TGA TAT CTG AAT TCC C 57.3 65.7 60.0 2 of 2 bp, 1 of 6 bp Yes! Tm = 28°
893 NNN NNN NNN NNN NNN 41.6 36.5 30.0 All the way Yes! Tm = 63°
894 TGG TAG CTC TTG ATC ANN NNN 58.8 63.0 56.0 4 of 1–2 bp; 1 of 5, 1 of 6 No
895 AGA GTT GGT AGC TCT TGA TC 54.9 66.2 58.0 8 of 1–4 bp Yes! Tm = 37°
Plant Genotyping Using Fluorescently Tagged Inter-Simple Sequence Repeats…
896 AGG TCG CGG CCG CNN NNN NAT G 75.8 73.2 68.0 1 of 8 bp No
897 CCG ACT CGA GNN NNN NAT GTG G 65.9 69.5 64.0 10 of 1–2 bp; 1 of 6 bp No
898 GAT CAA GCT TNN NNN NAT GTG G 61.1 63.9 58.0 2 of 1 bp; 1 of 4 bp; 1 of 8 bp No
899 CAT GGT GTT GGT CAT TGT TCC A 68.0 69.5 64.0 2 of 3 bp; 2 of 4 bp Yes! Tm = 49°
67
900 ACT TCC CCA CAG GTT AAC ACA 63.6 68.9 62.0 2 of 1 bp; 2 of 2 bp; 1 of 6 bp No
Tm/Td calculations: method 1: nearest neighbor, method 2: %GC, method 3: 2° × (A + T) + 4° × (G + C)
68 Linda M. Prince
2 Materials
2.1 DNA Extraction 1. 2× Hexadecyltrimethyl ammonium bromide (CTAB) extrac-

tion buffer: 100 mM Tris–HCl, 1.4 M NaCl, 30 mM EDTA
(disodium), 2 % (w/v) Hexadecyltrimethyl ammonium bro-
mide (see Note 1). 500 μL per sample.
2. Proteinase K (20 mg/mL). 2.5 μL per sample.
3. β-mercaptoethanol. 2.5 μL per sample.
4. Chloroform. 500 μL per sample, on ice.
5. 95 % Ethanol. 1,000 μL per sample, on ice.
6. 1× TE buffer: 100 mM Tris–HCl, 10 mM EDTA (disodium),
pH 7.5. 100 μL per sample.
7. Gasketed 2 mL screw-cap tubes. Two per sample.
8. Tissue mill (e.g., Bead Beater) for 2 mL tubes and 1–2 mm
diameter ceramic beads (or decontaminated mortars and
pestles).
9. Water baths or heat blocks (50, 65 °C).
10. Table top centrifuge for 2 mL tubes capable of 12,500 rcf
(refrigerated preferred).
11. Rocker or orbital mixer.
12. Fume hood rated for use with chloroform and
β-mercaptoethanol.
13. Optional: Microcon YM-30 column (Millipore, Billerica, MA).
14. NanoVue (or other similar apparatus for quantification).
2.2 PCR 1. High-fidelity DNA polymerase such as Phusion (see Note 2).
Amplification 2. dNTPs (2.5 mM each, 10 mM total).
3. Oligonucleotide primers (20 μM; Table 1) labeled with fluo-
rescent dyes specific to your electrophoresis instrumentation.
4. Deionized water.
5. Thin-walled PCR tubes appropriate to your thermal cyclers.
6. Thermal cycler (routine; real-time preferred).
2.3 Capillary 1. Separation matrix: POP-7.

Electrophoresis (Using 2. Separation buffer, 1×: Genetic Analysis Buffer.
an ABI 3130xl Genetic
3. Capillary Array: 50 cm length.
Analyzer)
4. Run Module: Custom ISSR (see Note 3).
5. Size Standard: GeneScan™ 1200 LIZ®.
6. Loading buffer: Hi-Di™ Formamide.
7. Instrument: ABI 3130xl Genetic Analyzer and Data Collection
software.
Plant Genotyping Using Fluorescently Tagged Inter-Simple Sequence Repeats… 69
2.4 Fragment 1. GeneMapper® Software v4.1.

Scoring 2. Spreadsheet software: Microsoft Office Excel 2007.
and Verification
3 Methods
3.1 Isolation See Notes 1 and 4–7 before beginning. Hexadecyltrimethyl

of gDNA ammonium bromide (CTAB) extractions [15] produce the clean-
est and highest yield DNA of several available methods [16]. It is
effective on a diversity of plant leaf (and green stem) tissues includ-
ing mucilaginous species such as Agave (Agavaceae) and Stenocereus
(Cactaceae).
1. Grind 0.5 × 1 cm2 of fresh leaf material in a 2.0 mL screw-top
microcentrifuge tube using a tissue mill (on high speed) until
a fine powder (not more than 1 min 30 s). Store in −80 °C
freezer if necessary.
2. Add 500 μL of 2× CTAB extraction buffer, 2.5 μL of proteinase
K, 2.5 μL of β-mercaptoethanol, close caps tightly and vortex
gently until well mixed (see Note 1).
3. Incubate 20–30 min in 50 °C water bath, invert every 5 min.
4. Transfer to 65 °C water bath and incubate for another 15 min,
invert every 5 min.
5. Remove from water bath and allow samples to cool slightly.
6. Add 500 μL of ice cold chloroform and shake vigorously to
mix. Degas once. Recap tightly (see Note 4).
7. Gently rock or shake for 15 min at room temperature.
8. Spin for 10 min at 12,500 × g in centrifuge (at 4 °C if refriger-
ated unit is available) (see Note 5).
9. Transfer 400 μL of supernatant to a new 2.0 mL screw-cap
microcentrifuge tube, discard residue in hazardous waste.
10. Add 800–1000 μL of ice cold 95 % ethanol (or 400 μL cold
isopropanol) and incubate 1 h in –20 °C freezer to precipitate
gDNA (2–3 weeks for herbarium material) (see Note 6).
11. Spin for 10 min at 12,500 × g in room temp or refrigerated (4 °C)
centrifuge to pellet precipitate.
12. Pipette off supernatant and discard in hazardous waste.
13. Dry pellet in speed-vac for 20–30 min (longer if necessary).
14. Resuspend gDNA in 100 μL sterile of 1× TE buffer, pH 7.5.
15. Quantify gDNA using a NanoVue (or other similar instrument)
according to the manufacturer’s instructions.
3.2 PCR See Notes 1 and 8 before beginning. Optimization must be per-
Amplification formed for every species project and each primer. Negative controls
must be run to identify primer-dimers and to assess reagent
70 Linda M. Prince
contamination. Consistent amplification is critical to reproducibility

of ISSR fragment generation. Reproducibility can be improved
by using very clean template DNA, a proof-reading polymerase,
and the same thermal cycler (and program) for all replicates.
1. A standard Phusion Polymerase ISSR-PCR amplification reac-
tion is 10 μL in volume and includes: 2.0 μL of 5× PCR buffer
(HF Phusion Buffer), 6.0 μL of dH2O, 0.5 μL dNTP (10 mM
total), 0.5 μL Primer (20 μM), 0.05 μL polymerase (Phusion),
and 1.0 μL DNA (10 ng/μL). Prepare a master mix of all
reagents (except DNA) to minimize pipetting errors. Prepare
enough master mix for at least two negative controls.
2. Thermal Cycler Profile: 1 cycle of 98 °C for 5 min, 40 cycles of
98 °C for 0:15 (denaturization), 50 °C/0:45 (annealing),
72 °C/1:00 (extension), 1 cycle of 72 °C for 10:00 (final
extension), and 4 °C hold indefinitely. The annealing tempera-
ture will vary depending upon primer. Most will amplify most
robustly at 50 °C.
3. Store amplified products in the dark, at 4 °C for up to 1 day.
Samples should be diluted into formamide for longer storage
times, but not to exceed 7 days. Fluorescence detection of
fragments declines beyond this time. Ideally, samples would be
electrophoresed within 72 h of amplification. This profile is
optimized for Phusion polymerase. Quality and completeness
of PCR must be verified before proceeding (see Note 7).
3.3 Fragment See Notes 8–10 before beginning. Normalization of fragment peak
Detection heights during fluorescent detection is critical. Automated Fragment
Analysis relies on user-specified thresholds for peak width and peak
height. Samples that amplify weakly and are not improved by clean-
ing of the DNA or normalization via addition of more PCR product
will require extensive downstream verification. The procedure below
is specific to an Applied Biosystems, Inc. 3130xl Genetic Analyzer,
but can be adapted to most other capillary platforms.
1. Instrument setup (see Note 9): POP-7 polymer, 50 cm capillary
array, ISSR run module (see Note 2).
2. Sample Preparation: 96-well run plate, 0.5 μL of GeneScan
1200 LIZ size standard, 10 μL of Hi-Di Formamide (see
Note 10), 0.5–5.0 μL of sample (see Note 11), Plate cassette
assembly.
3. Denature prepared samples at 95 °C for 2 min and hold at 4 °C
before loading the plate on the instrument. Store plates at 4 °C
until successful data collection has been confirmed, then dis-
card in hazardous waste. Increase PCR product if necessary,
and rerun within 24 h, repeating the denaturization step. There
is no need to add more 1200 Liz size standard.
3.4 Fragment See Note 12. The method below is specific to The ABI 3130xL
Analysis from 100 instrument, but can be adapted to other platforms. Output (.fsa) files
to 800 bp are imported to GeneMapper v4.0 (or later) software. GeneMapper
software can accommodate a large number of samples, but it is easier
to manipulate projects created for each primer separately. New
(Generic) files are analyzed with a modified AFLP Analysis Method.
This method scores each peak above a minimum peak height (50 rfu)
as an allele and applies a binary label of 1, check, or 0, for the
presence of peak in a particular bin. The level of background “noise”
is often around 10 bp, but will vary depending on overall signal
strength. Select appropriate parameters (yours may differ):
1. General Parameters: Description: AFLP tutorial, Instrument:
3130xl.
2. Allele Parameters:
● Analyze dyes = blue (FAM), green (VIC), yellow (NED),
red (HEX), orange (LIZ).
● Analysis range = 50–800 bps (allows you to see primer-
dimers but avoids most dye blobs that will cause your anal-
ysis to fail).
● Normalization scope = Project.
● Normalization method = Sum of Signal.
● Panel = Generate panel using samples. Bin width (bp) = 1.0;
Use all samples.
● Allele calling = Name alleles using labels. Useful labels
include 0 (<5), 1 (≥100), and check (50 > but ≤100).
3. Peak Detector Parameters.
● Peak detection algorithm = Advanced.
● Analysis range = Partial (3,250–1,950).
● Analysis sizing = All sizes.
● Smoothing = None.
● Size calling method = Local Southern Method.
● Peak amplitude thresholds = 50 for all five colors.
● Min. peak half width = 2 pts.
● Polynomial degree = 3.
● Peak window size = 15 pts.
● Slope threshold = 0.0.
4. Peak Quality Parameters—Use factory defaults.
5. Quality Flags Parameters—Use factory defaults.
Select the appropriate size standard (after pruning fragments
<50 and >800 bp, and renaming the 1200 Liz standard) from the
pull-down menu. The Tables feature can be used to export any
72 Linda M. Prince
number of fragment characteristics such as presence/absence, peak

height, peak size, peak area, etc. Spreadsheet software (Excel) is
used to view the tables and to assess the consistency of allele calling
for replicate ISSR PCR reactions for each primer, sample by sam-
ple. (Warning! Older versions of Excel have a maximum width of
256 cells.) Any given ISSR-PCR may produce >100 fragments.
Fragment data for each individual for each primer are concatenated
into a single list of binary states, checking the original data as
necessary. Excel can also be used to check bin assignment for any
given fragment by calculating the variance of the peak sizes on a
fragment-by-fragment basis. High variance values (>0.03) might
be an indicator of incorrect binning.
This is by far the most time-consuming (and potentially most
subjective) step in ISSR data generation. Fragment sizes are gener-
ally consistent from sample to sample and replicate to replicate, espe-
cially for fragments between 100 and 650 bp in size. Peak height of
individual fragments, although quite consistent from replicate to
replicate, can be very different from sample to sample. Generation of
(roughly) uniform height peaks across PCR samples minimizes mis-
scored fragments. In many cases, amplification of any specific primer
will yield a number of fragments that are common to all samples. If
products cannot be normalized prior to the data analysis step, the
peak height of these “universal” fragments can be used to normalize
the data postelectrophoresis. For example, if all samples have a frag-
ment of size 121 bp, and the height of that peak varies from a high
of 600 to a low of 45 depending upon sample, the ability of the
software to accurately score the presence or absence of this fragment
would be limited by the threshold set by the user. The average signal
strength also varies considerably from primer to primer, so a peak
height is a relative score. The default setting for “present” in
GeneMapper is 50. Ideally, GeneMapper would have an option that
allows the user to normalize any given dataset across a number of
these “universal” fragments, reducing the number of false “absent”
calls made by the software. This is something that must be done
manually and somewhat subjectively, based on overall signal
strength and experience. Using a majority-rule approach, only those
samples with a particular fragment detected in 1/3 runs would have
to be verified.
The final data matrix can be generated in Excel, or in a simple
text editor, for analysis in your favorite Population Genetics soft-
ware package.
4 Notes
1. Heat gently to dissolve CTAB, bring to volume with sterile DI

water, pH = 8.0. Do not autoclave. Store at 4 °C. Reheat gently
to get CTAB back into solution before use. Solutions of
up to 6× CTAB have been used by other researchers. Explore

alternative extraction buffers and methods, such as PTB [17]
if CTAB does not work well. Many commercial kits exclude
β-mercaptoethanol. This reagent is crucial for good extrac-
tions. The addition of Proteinase K is supposed to improve
yield, especially for herbarium material. Other buffer addi-
tions to consider: DTT as an alternative to β-mercaptoethanol
[18], PVP at 40 mg/mL of CTAB to remove phenolic con-
taminants [19].
2. Phusion DNA polymerase has exceptionally high fidelity and
is much better at handling monomeric repeats than other
polymerases [20]. Given the nature of the primers being used,
it seems prudent to spend a bit extra on the polymerase to
minimize “Taq stutter.”
3. The custom ISSR Run Module is a modification of ABIs
FragmentAnalysis50_POP7 module. The creation of a larger
size standard (Liz 1200) allows data to be scored over a much
larger size span than previous standards. This module will allow
fragments up to approximately 800 bp to be scored. The Liz
1200 size standard can be used to score fragments up to 1,200 bp
in length, but peaks become weak and broad as the run pro-
gresses, causing many more mis-scored data during the analysis
step. Modifications are as follows:
● Voltage steps = 40 (modified from 30).
● Prerun and run voltages = 10 (not 15).
● Data Delay = 150 (not 200).
● Run time = 5,000 (not 1,800).
4. If the two layers are not miscible (with vigorous shaking), a
resin such as Nucleon Phytopure may be added at this step to
improve DNA extraction results [21]. Phenol/Chloroform
extraction might also improve DNA purity [22].
5. Be sure to check the maximum centrifugation speed for your
brand of tubes! Chloroform compromises the integrity of plastic.
Spinning too fast or for too long will cause the tubes to shatter.
6. This is an excellent place to stop if necessary. Samples may be
left (in alcohol) at −20 °C indefinitely.
7. DNAs can be further cleaned via a number of methods such as
sodium acetate reprecipitation (with ethanol rather than isopro-
pyl to remove excess salt) if necessary [23], RNaseA treatment
[24], or any number of column-based DNA extraction kits.
Microcon YM-30 columns (Millipore) are particularly effective.
8. Prior to fluorescence-detection electrophoresis, it is useful to
ensure consistent amplification of ISSRs. Real time or quanti-
tative (q-PCR) simplifies this step greatly as it allows annealing
temperature optimization and identification of recalcitrant
74 Linda M. Prince
DNA samples. If q-PCR is not available, fragments can be

visualized on a large-format 2 % agarose gel.
9. The on-instrument life of the polymer impacts the quality and
consistency of allele calls; it is suggested not to exceed the
recommended limit of seven days. Similarly, use of older arrays
(>150 runs) may negatively impact the quality and consistency
of allele calls.
10. Formamide degrades over time, especially with freeze–thaw
cycles. Prepare 1 mL aliquots and freeze. Once an aliquot has
been thawed for use, do not refreeze. Formamide is dangerous
and should be used in an approved fume hood. Waste (tubes,
plates, liquid) must be disposed of as hazardous waste.
11. Multiplexing (adding multiple, different primers to a run well)
is a great way to reduce overall cost of the method. Primers must
be labeled with different dyes and, ideally, would not have many
shared fragment sizes. If samples are multiplexed, more for-
mamide (20 μL) should be used. Multiplexing more than two
samples/well is not recommended. Additional template can be
added for weak samples to normalize peak heights across all
samples. Ideally the upstream PCR conditions can be improved
(cleaner DNA, more template DNA, optimized annealing
conditions, etc.) instead. Maximum allowable volume for the
ABI 3130xl is 30 μL/well.
12. GeneMapper is one of the most heavily used and reliable soft-
ware packages for fragment analysis, but it is relatively expensive
at ~$10,000/license. A number of other packages are available,
ranging in price from free to over $13,000.00/license, many
of which were recently reviewed by Meudt and Clarke [25].
The greatest benefit of modern versions of Genemapper is the
ability to score a large number of fragments using the AFLP
module without a priori bin specification.
Acknowledgements
I am grateful to the team at Applied Biosystems, Inc. (C.J. Davidson

and T. Ingalls) for collaborative efforts to optimize ISSR methods
on the 3500xL Genetic Analyzer, and to Rancho Santa Ana Botanic
Garden for financial support.
References
1. Zietkiewicz E, Rafalski A, Labuda D (1994) evolutionarily diverse genomes using single

Genome fingerprinting by simple sequence primers of simple-sequence repeats. Theor
repeat (SSR)-anchored polymerase chain reac- Appl Genet 89:998–1006
tion amplification. Genomics 20:176–183 3. Salimath SS, de Oliveira AC, Godwin ID,
2. Gupta M, Chyi Y-S, Romero-Severson J, Owen Bennetzen JL (1995) Assessment of genome
JL (1994) Amplification of DNA markers from origins and genetic diversity in the genus
Eleusine with DNA markers. Genome 38: 14. Levi A, Thomas CE, Newman M, Reddy OUK,
757–763 Zhang X, Xu Y (2004) ISSR and AFLP markers
4. Kostia S, Varvio S-L, Vakkari P, Pulkkinen P differ among American watermelon cultivars
(1995) Microsatellite sequences in a conifer, with limited genetic diversity. J Am Soc Hort
Pinus sylvestris. Genome 38:1244–1248 Sci 129:553–558
5. Charters YM, Robertson A, Wilkinson MJ, 15. Doyle JJ, Doyle JL (1987) A rapid DNA isola-
Ramsay G (1996) PCR analysis of oilseed rape tion procedure for small quantities of fresh leaf
cultivars (Brassica napus L. ssp. oleifera) using tissue. Phytochem Bull 19:11–15
5'-anchored simple sequence repeat (SSR) 16. Applied Biosystems, Inc. (2010) Application
primers. Theor Appl Genet 92:442–447 note: ISSR plant genotyping. Publication
6. PubMed.gov. US National Library of Medicine/ 106AP31-01. http://tools.invitrogen.com/
National Institutes of Health. http://www. content/sfs/brochures/cms_079244.pdf .
ncbi.nlm.nih.gov/pubmed. Accessed 12 May Accessed 10 Dec 2012
2012 17. Kistler L (2012) Ch 10 Ancient DNA extrac-
7. Albani MC, Battey NH, Wilkinson MJ (2004) tion from plants. In: Shapiro B, Hofreiter M
The development of ISSR-derived SCAR (eds) Ancient DNA: methods and protocols.
markers around the SEASONAL Human, New York, pp 71–79
FLOWERING LOCUS (SFL) in Fragaria 18. Herzer S (2001) DNA purification. In: Gerstein
vesca. Theor Appl Genet 109:571–579 AS (ed) Molecular biology problem solver: a lab-
8. Bornet B, Antoine E, Françoise S, Marcaillou-Le oratory guide. Wiley-Liss, Inc. http://onlineli-
Baut C (2005) Development of sequence char- brary.wiley.com/book/10.1002/0471223905.
acterized amplified region markers from inter- Accessed 10 Dec 2012
simple sequence repeat fingerprints for the 19. John ME (1992) An efficient method for isola-
molecular detection of toxic phytoplankton tion of RNA and DNA from plants containing
Alexandrium catenella (Dinophyceae) and polyphenolics. Nucleic Acids Res 20:2381
pseudo-Nitzchia pseudodelicatissima 20. Fazekas AJ, Steeves R, Newmaster SG (2010)
(Bacillariophyceae) from French coastal waters. Improving sequencing quality from PCR prod-
J Phycol 41:704–711 ucts containing long mononucleotide repeats.
9. Ye Q, Qiu Y-X, Quo Y-Q, Chen J-X, Yang S-Z, Biotechniques 48:277–285
Zhao M-S, Fu C-X (2006) Species-specific
21. Amersham Biosciences (2000) Nucleic acid
SCAR markers for authentication of
purification: nucleon phytopure. Data
Sinocalycanthus chinensis. J Zhejiang Univ Sci
File18-1146-64. https://www.gelifesciences.
B 7:868–872
com. Accessed 10 Dec 2012
10. UBC primer set No. 9, Biotechnology
22. Bitesize Bio. The basics: how phenol extraction
Laboratory, University of British Columbia,
works.http://bitesizebio.com/articles/the-basics-
Vancouver, Canada
how-phenol-extraction-works/. Accessed 10
11. Bornet B, Branchard M (2001) Nonanchored Dec 2012
inter simple sequence repeat (ISSR) markers:
reproducible and specific tools for genome fin- 23. Zumbo P (2012) Ethanol precipitation. Weill
gerprinting. Plant Mol Biol Rep 19:209–215 Cornell Medical College Department of
Physiology and Biophysics, Ithaca, NY, p 12
12. Monte-Corvo L, Goulão L, Oliveira C (2001)
ISSR analysis of cultivars of pear and suitability 24. http://irc.igd.cornell.edu/Protocols/RNase
of molecular markers for clone discrimination. Protocol.html. Accessed 10 Dec 2012
J Am Soc Hort Sci 126:517–522 25. Meudt HM, Clarke AC (2007) Almost forgot-
13. Qian W, Ge S, Hong DY (2001) Genetic varia- ten or latest practice? AFLP applications, analyses
tion within and among populations of a wild and advances. Trends Plant Sci 12:106–117
rice Oryza granulate from China detected by 26. Rychlik W (2002) OLIGO primer analysis
RAPD and ISSR markers. Theor Appl Genet software, version 6. Molecular biology insights.
102:440–449 Cascade, Inc, Cascade-Chipita Park, CO
Chapter 6
SSR Genotyping
Annaliese S. Mason
Abstract
SSR genotyping involves the use of simple sequence repeats (SSRs) as DNA markers. SSRs, also called
microsatellites, are a type of repetitive DNA sequence ubiquitous in most plant genomes. SSRs contain
repeats of a motif sequence 1–6 bp in length. Due to this structure SSRs frequently undergo mutations,
mainly due to DNA polymerase errors, which involve the addition or subtraction of a repeat unit. Hence,
SSR sequences are highly polymorphic and may be readily used for detection of allelic variation within
populations. SSRs are present within both genic and nongenic regions and are occasionally transcribed,
and hence may be identified in expressed sequence tags (ESTs) as well as more commonly in nongenic
DNA sequences. SSR genotyping involves the design of DNA-based primers to amplify SSR sequences
from extracted genomic DNA, followed by amplification of the SSR repeat region using polymerase chain
reaction, and subsequent visualization of the resulting DNA products, usually using gel electrophoresis.
These procedures are described in this chapter. SSRs have been one of the most favored molecular markers
for plant genotyping in the last 20 years due to their high levels of polymorphism, wide distribution across most
plant genomes, and ease of use and will continue to be a useful tool in many species for years to come.
Key words Simple sequence repeats, PCR-based markers, Molecular markers, Plant genotyping,
Polymorphism, Primer design, Agarose gel electrophoresis
1 Introduction
Simple sequence repeats, commonly known as SSRs or microsatel-

lites, are a type of repetitive DNA sequence found in eukaryotic
genomes. SSRs consist of repeats of a short (1–6 bp) motif [1],
such as [A]n (mononucleotide motif) or [CT]n (dinucleotide motif).
In plants, there is approximately one SSR every 50 kbp of genomic
sequence [2]. If the SSR sequence is composed entirely of repeats
of one motif, it is termed as “perfect” SSR, whereas “imperfect” or
“compound” SSRs are made up of multiple different repeat unit
motifs. SSRs are widely spread throughout both genic and inter-
genic regions of plant genomes, although are predominantly found
in noncoding regions [3]. SSRs usually arise and mutate through
errors made by the DNA polymerase enzyme during DNA replica-
tion, whereby a repeat unit is either added or subtracted to the SSR
77
78 Annaliese S. Mason
sequence [4], although illegitimate recombination during meiosis

may also play a role in SSR expansion and contraction. These mech-
anisms increase the mutation rate of SSR sequences to approxi-
mately 10−2–10−3 per locus per gamete per generation [5],
approximately 106 times the mutation rate of nonrepetitive genomic
DNA [6]. Hence, SSRs are highly polymorphic, and this high level
of polymorphism makes these sequences ideal targets for the devel-
opment of molecular markers [7], although the process of SSR
marker development can be quite involved [8, 9]. SSR marker alleles
are usually differences in numbers of repeats, and these alleles are
generally codominant, in noncoding regions and under neutral
selective pressure [2]. SSR alleles are also highly reproducible, unlike
some other marker types, such as RAPDs, where repeatable results
are difficult to obtain [10]. SSR markers are also relatively cheap to
use, and performing the SSR marker protocol in-house is within the
reach of most molecular laboratories. These aspects of SSRs have
contributed to making SSRs the “workhorse” of molecular marker
studies for the last two decades, and in many plant species SSR
markers will play a significant role for many years to come.
SSRs were first discovered in plants in 1984 [1], and their utility
as a molecular marker was realized only 5 years later in 1989 [5, 11].
Since then, SSRs have been developed and used as molecular
markers in a very wide range of crop species, initially in important
crops such as wheat [12, 13], rice [14], and soybean [15], and
more recently in crops such as cucumber [16], sunflower [17], and
peanut [18]. SSRs have also been used for evolutionary studies in
population genetics [19] and molecular systematics [20], and have
since been studied for their role in genome evolution [21] and
responses to genome stress [22]. SSRs have also demonstrated utility
in marker-assisted selection [23–25] and mapping quantitative
trait loci [23, 25].
The first step in SSR marker genotyping is to design primers
specific to the flanking sequence of the SSR region. In major crop
species and model plants, suitable SSR primers are already available in
public databases, but in other cases the SSR markers must be designed
from scratch. If genomic sequence is available, software programs
can be used to identify SSRs and primers can be designed from the
genomic sequence to amplify SSR regions. However, for many spe-
cies no genomic sequence is available, and if this is the case genomic
sequence libraries must be created, clones putatively containing mic-
rosatellites must be sequenced, primer sequences designed, and prim-
ers confirmed to amplify interpretable polymorphic products
from unique SSR loci [9]. The average attrition from successfully
sequenced clones to production of useful SSR primers has been esti-
mated at 83 % loss [9]. Microsatellite-enriched libraries can be
produced in a month or less at a low cost (estimated at <US$1,500 in
2002) [8], and the allocation of 6 months to develop working prim-
ers from libraries is probably reasonable for most species [9].
SSR Genotyping 79
Once SSR primers are available, the PCR-based amplification

of the SSR region can be carried out for genotyping purposes.
Although other methods such as hybridization can be used for
detecting SSRs [1, 5], and next-generation sequencing may obvi-
ate the need for lab-based sequence detection [26], PCR-based
amplification is by far the simplest and most widely used SSR
marker method [11]. PCR primers are oligonucleotides (usually
18–25 bp long) that are specific to either side of the target SSR
sequence: one forward primer specific to sequence in the 5′–3′
direction and one reverse primer specific to sequence in the 3′–5′
reverse complement direction. After amplification of the DNA
fragment, the final step is to visualize the size polymorphisms of
the PCR product. The most common and easily accessible means
of doing this is by agarose gel electrophoresis, whereby the PCR
product is loaded into a well on a solid gel and an electric current
is run through it (Fig. 1). This process separates the negatively
charged DNA fragments over time, as smaller fragments travel
more quickly than larger fragments. Popular alternatives to agarose
gel electrophoresis (AGE) include polyacrylamide gel electropho-
resis (PAGE) and capillary gel electrophoresis. PAGE, while more
technically difficult than AGE, has the advantage of being able to
resolve smaller differences in fragment size, down to a 5 % differ-
ence or less [27]. Capillary gel electrophoresis usually involves the
fluorescent labeling of the DNA fragments and subsequent detec-
tion using a Sanger sequencing machine. In this method DNA is
loaded into capillary tube gels, and after electrophoresis the migra-
tion of the fluorescently labeled DNA fragments is recorded
(Fig. 2). Single base-pair differences in fragment size between
alleles can be resolved using this method. These three methods of
SSR allele visualization provide considerable flexibility in terms of
trade-offs between cost, ease of production, and scoring and sensi-
tivity. SSRs may also be used directly as markers in already
Fig. 1 Example of agarose gel electrophoresis of PCR products resulting from SSR locus amplification using PCR,
showing visualization of two alleles of approximately 320 and 230 bp in size (Allele 1 and Allele 2, respectively)
in 23 experimental individuals. Sample 4 and Sample 16 represent probable failed amplification of the SSR
locus, Samples 13, 15, and 18 are homozygous for Allele 2, Sample 23 is homozygous for Allele 1 and the
remaining samples are heterozygous (one copy of Allele 1 and one copy of Allele 2)
Fig. 2 Example of capillary gel electrophoresis output of fluorescently labeled

SSR marker fragments, visualized using GeneMapper 1.0 (Applied Biosystems).
Five alleles are shown across two individuals. A1, A2, and A3 are amplified by the
same PCR primer (green fluorescent label), with Individual 1 heterozygous at this
locus with alleles Al (~273 bp) and A2 (~290 bp) and Individual 2 heterozygous
at this locus with alleles A1 (~290 bp) and A3 (295 bp). Alleles B1 (~276 bp) and
B2 (~283 bp) are amplified by another fluorescently labeled primer (red fluores-
cent dye), with Individual 1 heterozygous at this locus (both alleles) and Individual
2 homozygous for allele B2
well-characterized species by whole genome sequencing of different

genotypes [26]. However, this method is still in development, and
assembling repetitive regions accurately using next-generation
sequencing is difficult [28], which will probably limit the useful-
ness of SSR regions as markers in the current generation of
SSR Genotyping 81
sequencing technology. In future, bioinformatics-based analysis of

SSR regions may replace laboratory-based genotyping using SSR
markers, but the high utility of PCR-based SSR genotyping will
ensure the use of this method in many species in years to come.
2 Materials
1. PCR master mix components: 20 mM MgCl, 2.5 mM dNTP

mix, 10 μM forward primer, 10 μM reverse primer, 10× buffer,
5 U/μl Taq DNA polymerase.
2. Purified genomic DNA extracted from plant tissue.
3. Thermocycler machine or three heated water baths.
4. Ice or cold blocks.
5. Molecular biology grade agarose powder.
6. Gel electrophoresis tank.
7. 1× TAE solution: Tris base, glacial acetic acid solution and
0.5 M EDTA (ethylenediamine tetra-acetic acid) sodium salt
solution, pH 8.0. To make up 1 l of 50× TAE stock solution:
242 g Tris base, 57.2 ml glacial acetic acid, 100 ml 0.5 M
EDTA, pH 8.0); make up to 1 l with deionized H2O. Dilute
stock solution 1 part to 49 parts deionized H2O to make up
1× TAE.
8. 10 mg/ml Ethidium bromide solution.
9. 6× Loading buffer: 25 mg xylene cyanol and/or bromophenol
blue and 4 g sucrose in 10 ml of H2O.
10. DNA ladder.
3 Methods
3.1 Primer Design Primers can be obtained commercially as custom oligonucleotides

(See Note 1) (see Note 2) and should be designed from available genomic DNA
sequence to fulfill the following set of criteria:
● Forward primer and reverse primer flank the SSR sequence.
● Primer length 18–25 bp.
● GC content of primer >40 %.
● Annealing temperature of primer >45 °C.
● No strings of repeated mononucleotides >3.
● No repetitive regions or regions which when inverted will bind
to each other.
● No complementary sequences between the forward and reverse
primers.
3.2 Sequence 1. Make up 10–50 μl (see Note 4) per reaction PCR mixes in
Amplification 0.2 ml PCR tubes on ice containing (see Note 5):
(See Note 3) ● 0.125 mM dNTP mix (e.g., 1.6 μl of 2.5 mM solution in
a 20 μl reaction).
● 1× DNA polymerase buffer (e.g., 2 μl of 10× DNA poly-
merase buffer in a 20 μl reaction).
● 2 mM MgCl (e.g., 2 μl of 20 mM MgCl solution in a 20 μl
reaction; only add MgCl if not present in 10× DNA poly-
merase buffer).
● 0.5 μM forward primer (e.g., 1 μl of 10 μM solution in a
20 μl reaction).
● 0.5 μM reverse primer (e.g., 1 μl of 10 μM solution in a
20 μl reaction).
● 10–75 ng of genomic DNA (e.g., 5 μl of 10 ng/μl DNA
solution in a 20 μl reaction).
● 1 U of Taq (DNA polymerase enzyme from Thermus
aquaticus; e.g., 0.2 μl of 5 U/μl solution in a 20 μl
reaction).
● Purified deionized water to the appropriate volume (e.g., up
to 20 μl in a 20 μl reaction).
2. Mix thoroughly by flicking the tube (do not vortex).
3. Thermocycler programming (see Note 6)
Heat cycling (see Note 7): Initial denaturation, then 15–35
cycles of denaturation, melting, and annealing, followed by a
final extension. Typical temperatures and times are given for a
product of 500 bp.
(a) Initial denaturation: 94 °C for 5 min.Then 35 cycles of
steps 2–4:
(b) Denaturation: 94 °C for 30 s.
(c) Melting (see Note 8): 50 °C for 60 s.
(d) Annealing: 72 °C for 60 s.
Followed by a single extension step:
(e) Extension: 72 °C for 10 min.
3.3 Visualization 1. Prepare the agarose gel for electrophoresis. Weigh out 1 g of
of Amplified DNA molecular biology grade agarose powder into a conical flask, add
Product Using Agarose 100 ml of 1× TAE buffer (makes 1 % agarose gel, see Note 10).
Gel Electrophoresis Heat the solution to boiling and check to make sure all powder
(See Note 9) is dissolved. Cool the outside of the flask under running water
(turning or swirling flask to prevent uneven cooling) until flask
can be held comfortably (~60 °C), and then add 2 μl of 10 mg/ml
ethidium bromide solution (see Note 11).
2. Pour liquid agarose into a gel mold, generally consisting of a
hard plastic tray which has been taped on both ends (to allow
SSR Genotyping 83
transmission of current through the gel both ends of the gel

tray should be open after gel has set). A comb which sits in the
gel tray is then added to make wells in which samples may be
loaded after the gel has set.
3. Add loading buffer at the appropriate concentration to each of
the tubes containing the PCR product (see Note 12), and mix
by pipetting to create a uniform solution.
4. Remove tape from ends of gel and lower into the electropho-
resis tank, making sure the buffer solution covers the top of the
gel and that the gel tray is positioned in line with the edges of
the tank with the wells toward the negative electrode.
5. Load a DNA ladder into the first and last wells on each row,
and then add samples in order to the remaining wells.
6. Close the lid of the gel tank, making sure the electrodes are
in contact, and then start the electrophoresis. When using
standard gel tanks (see Note 13), set the voltage to 100 V and
240 A for 30–60 min, then check at intervals to track the
progression of the loading buffer down the gel.
7. After satisfactory progression of the loading buffer to two-thirds
to three-quarters of the way down the gel, stop the electropho-
resis and remove the gel from the tank. Visualize by placing
the gel into an ultraviolet (UV) transilluminator cabinet or on
to a UV transilluminator box. Make sure to use appropriate
protective equipment to prevent UV burns to exposed skin.
8. Photograph or otherwise record the location of the DNA
bands on the gel relative to the ladder.
3.4 Analyzing Plant 1. Score SSR alleles as 1: presence of a band or 0: absence of a

SSR Data band. One or two alleles per individual should be observed for
a single locus (see Note 14 for more information or if this is not
the case). SSR allele copy number cannot be reliably determined
from agarose gel electrophoresis (see Note 15).
2. Collate scoring data for all SSR markers run on the population,
and sort by genomic location (e.g., chromosome 1, linkage
group 4) or other relationships if known.
3. Data may now be inputted into one of a range of free or com-
mercial software programs in order to perform analyses such as
creation of linkage maps/determination of population linkage
disequilibrium, determination of population genetic diversity,
creation of phylogenic relationship trees, population principal
components analysis (PCA). The addition of phenotype or
population structure data (e.g., cultivars, families) may allow
analyses such as detection of quantitative trait loci (QTLs),
association mapping, matching of haplotypes to known pheno-
types, and correlation analyses (see Note 16).
4 Notes
1. Obtaining useful SSR primers is an involved process, and prim-

ers may fail due to incorrect segment amplification, lack of
polymorphism in the identified SSR, or aspecific amplification
(particularly in polyploid genomes). The time to develop DNA
libraries and subsequent tests to acquire working SSR primer
pairs has been estimated at 7 months [9].
2. Primers can be ordered as custom oligonucleotides, and com-
panies that provide these commercially (such as Fisher-Biotech
and Finnzymes) often provide web pages that give an estimate
of primer secondary structures, melting temperatures, and
compatibilities between primer pairs based on primer sequences.
Checking these prior to ordering is desirable. Primers can be
shipped at room temperature, and ordered at custom concen-
trations or volumes or as dried down pellets. Generally primers
cost around US$20 each, although the cost varies depending
on the company, country, and characteristics (e.g., length, fluo-
rescent tags) of the primers.
3. PCR failure in inexperienced operators and particularly in teach-
ing labs is most commonly due to mistakes in master mix com-
position, such as failing to add a reagent or adding the wrong
volume of a reagent. Keeping the reagent mix cold and mixing
well (flicking the tube is preferable to inverting, but vortexing is
not acceptable) after addition of all components is also essential
for success. It is also possible to acquire a “bad batch” of dNTPs,
buffer, Taq DNA polymerase, or magnesium chloride from
suppliers, or for any reagent to acquire contaminants which give
false positives in the PCR reactions. PCR failure due to poor
DNA quality (contaminants) is also common, although gener-
ally the process is extremely tolerant of DNA quantity (1–200 ng
will still work in many instances).
4. The minimum recommended PCR volume is usually 10 μl,
and 50 μl is more than sufficient for SSR genotyping purposes.
Lower reaction volumes are more likely to fail, and 20–25 μl
reaction volumes provide a good compromise between success
rates and savings on reagent costs.
5. For more than a few samples, make up a master mix containing
all reagents except the genomic DNA. For example, when
preparing 24 samples of DNA for PCR, multiply the amounts of
all reagents except DNA required for one reaction volume by 25
(+1 for pipetting error), then add the appropriate amount of
master mix (e.g., 15 μl in a 20 μl reaction volume containing
5 μl DNA solution) to individual DNA samples in PCR tubes.
6. The PCR mix should be repeatedly heated and cooled in cycles.
Although this can be achieved by manually transferring tubes
SSR Genotyping 85
between appropriately heated water baths, commercially

produced thermocyclers, which contain tube-holding blocks
that can be heated and cooled to precise temperatures, are
more commonly used.
7. Most commercially available Taq DNA polymerase will come
with protocols suggesting thermocycler protocols optimized
for those particular enzymes, as well as tubes of 10× reaction
buffer, MgCl solution, and dNTP mix. The Taq DNA poly-
merase enzyme patent has now expired, and this enzyme may
hence now be produced by laboratories in-house. A number of
other modified Taq enzymes are also commercially available,
with “hot start” and “proof reading” capabilities. Hot start Taq
is more thermostable and needs to be run at higher tempera-
tures (e.g., 98 °C) for the denaturation and annealing steps of
the thermocycler protocol. Proof-reading Taq makes less
sequence errors during replication than regular Taq. Neither
hot-start nor proof-reading Taq is required for SSR amplifica-
tion and genotyping procedures, although proof-reading Taq
may provide a more robust means of checking polymorphisms
during sequencing of isolated microsatellite regions in the
primer design validation phase.
8. Melting temperature (Tm) is determined by the composition of
the oligonucleotide primers. Longer primers with higher GC
content will have higher melting temperatures compared to
shorter primers with lower GC content, and the sequence will
also affect the Tm. Online calculators are available to predict
primer Tm through suppliers of Taq and oligonucleotides, and
for each pair of primers using the lower melting value is gener-
ally recommended. Lowering the melting temperature during
the PCR will reduce primer binding specificity, and is hence
more likely to produce a product in recalcitrant reactions and
also to produce multiple, aspecific products (especially in poly-
ploids). Increasing the melting temperature will increase
primer binding specificity, but the reaction may fail if the melt-
ing temperature is too high relative to the primer Tm. Generally,
melting temperatures range from 48 to 68 °C, with two-step
PCRs recommended if primer Tm is over 65 °C. Two-step PCRs
involve the addition of the melting and annealing stages in the
PCR reaction, such that a longer step of 72 °C replaces the two
shorter steps.
9. Alternatives to agarose gel electrophoresis include polyacryl-
amide gel electrophoresis and capillary electrophoresis.
Polyacrylamide gels are made using acrylamide rather than
agarose, but work on a similar principle. Capillary electropho-
resis is rarely done in-house, as this technique involves the use
of a sequencing machine, such as an Applied Biosystems
3730x1 sequencer. PCR products are obtained in the same
fashion as for agarose gel electrophoresis, although a few extra

sample preparation steps (such as addition of formamide, dry-
ing down the PCR products, or the addition of the DNA lad-
der to the tubes) are occasionally required depending on the
fragment analysis service. Visualization of the PCR products is
performed via fluorescent labeling, which may take place either
postamplification or by the use of a fluorescently labeled primer
(only one primer of the pair is required to be fluorescently
labeled). Fluorescently labeled oligonucleotides can be
obtained from suppliers such as Applied Biosystems, who have
a custom set of patented dyes (FAM, VIC, PET, NED, etc.).
Multiple samples may be run in the same lane, provided that
alleles from different primers can be separated on the basis of
expected size or fluorescent dye color. After separation of the
PCR products (optimal length 100–500 bp) by capillary gel
electrophoresis, image files are generated of the fluorescent
peaks. Free software is available to analyze this data and allows
scaling of alleles relative to the internal standard such that frag-
ment size may be estimated, with reproducibility of approxi-
mately 0.3 bp between runs. Care should be taken in scoring
alleles that are close together, as a phenomenon known as
“stutter” often causes false additional alleles separated by
0.5–2 bp intervals. These tend to manifest in the same pattern
(e.g., one smaller allele to the right of the primary allele, a
double peak instead of a single peak, or a set of peaks of dimin-
ishing size) for each allele at the same locus and can be pre-
sumed to be false if always observed in conjunction with
another adjacent allele.
10. Standard molecular biology grade agarose should be used to
make gels for SSR genotyping of 0.8–2.5 % (i.e., 0.8–2.5 g of
agarose per 100 ml of buffer). Higher concentration gels will
lack resolving power unless higher quality agarose is used. 1×
TAE buffer may also be replaced by 1× TBE buffer (10.8 g Tris
base, 5.5 g boric acid, and 0.5 ml of 0.5 M EDTA. pH 8.0 in
1 l of dH2O). 1× TBE has greater resolving power than 1×
TAE, and hence may be used to differentiate smaller fragments
on high concentration (e.g., 4 %) gels made from high-quality
agarose (e.g., Agarose1000).
11. Ethidium bromide is still commonly used in molecular genetics
laboratories for visualization of DNA on gels, as it forms inter-
calating bonds between the thiamine residues of DNA that
fluoresce under ultraviolet light. However, ethidium bromide is
also a known carcinogen and may be replaced by commercially
produced fluorescent alternatives such as SybrGreen if desired.
When working with ethidium bromide or in ethidium bromide-
contaminated areas nitrile gloves are recommended, as latex
gloves are permeable to this chemical.
SSR Genotyping 87
12. PCR products may be stored at this step for later visualization,
although storing before adding loading buffer is preferable. 4 °C
is fine for short-term storage (up to 2 weeks), and −20 °C is
preferable for longer term storage.
13. Gel tanks which run faster and at higher voltages are also
available commercially, and premade agarose gels can also be
bought rather than made from scratch.
14. Microsatellites are codominant molecular markers, simplifying
the analysis of marker data as all alleles produced should be
observable. If only one SSR locus is amplified by the primers
then a maximum of two alleles should be observed for any one
individual. The observation of one allele represents either
homozygosity at that locus or two alleles that are too close in
size to be distinguishable. Absence of alleles in some individuals
can be due to either failed PCRs, polymorphisms in the primer
binding site, or deletion mutations. If multiple loci are ampli-
fied by a single primer (detectable as the presence of three or
more alleles) then scoring alleles can become more complex.
Amplification of multiple loci is common in polyploids, but
may also occur in other species. Increasing the melting tem-
perature in the PCR reaction may increase primer specificity to
remove additional bands produced by amplification of second-
ary loci. For simple determination of genetic diversity the allo-
cation of alleles to known SSR loci is not as critical, as in these
analyses the patterns of bands are used to produce a similarity
matrix between individuals for production of phylogenic trees,
and the location of the SSRs is of secondary importance.
Likewise, creation of linkage maps does not require prior
knowledge of SSR location, and may in fact be used to allocate
SSRs to linkage groups. However, in other uses of SSRs, such as
haplotype determination and analysis of genetic introgressions,
the location of the SSR and hence the correct identification of
alleles as belonging to each locus is crucial.
15. If capillary electrophoresis is used to separate PCR products,
then allele copy number may be able to be determined based on
relative amplification of the fluorescent peaks. This can only be
done if two or more alleles are amplified by the same PCR prim-
ers, as relative peak amplification is required for proper assess-
ment of copy number, and will not be reliable for all markers.
To assess the reliability of allele copy number analysis, the ratio of
amplification of each allele relative to every other allele amplified
by the same primer should be calculated. If ratios do not fall
neatly into whole number multiples (e.g., 1:1, 1:2), allele copy
number should not be assessed using that marker, but otherwise
this technique may provide some utility for detecting events such
as homoeologous nonreciprocal translocations [29–31].
16. Logistic rather than normal linear regression should be performed

using SSR data, as SSR alleles are binomially rather than normally
distributed. Data cleaning to remove alleles with a high degree
of failed amplification across the population is also suggested,
as these may bias subsequent analyses.
References
1. Tautz D, Renz M (1984) Simple sequences 13. Plaschke J, Ganal MW, Roder MS (1995)
are ubiquitous repetitive components of Detection of genetic diversity in closely-related
eukaryotic genomes. Nucleic Acids Res 12: bread wheat using microsatellite markers.
4127–4138 Theor Appl Genet 91:1001–1007
2. Morgante M, Rafalski A, Biddle P et al (1994) 14. Yang GP, Maroof MAS, Xu CG et al (1994)
Genetic mapping and variability of 7 soybean Comparative analysis of microsatellite DNA
simple sequence repeat loci. Genome 37: polymorphism in landraces and cultivars of
763–769 rice. Mol Gen Genet 245:187–194
3. Cox R, Mirkin SM (1997) Characteristic 15. Akkaya MS, Bhagwat AA, Cregan PB (1992)
enrichment of DNA repeats in different Length polymorphisms of simple sequence
genomes. Proc Natl Acad Sci U S A 94: repeat DNA in soybean. Genetics 132:
5237–5242 1131–1139
4. Strand M, Prolla TA, Liskay RM et al (1993) 16. Chung SM, Staub JE, Chen JF (2006)
Destabilization of tracts of simple repetitive Molecular phylogeny of Cucumis species as
DNA in yeast by mutations affecting DNA revealed by consensus chloroplast SSR marker
mismatch repair. Nature 365:274–276 length and sequence variation. Genome 49:
5. Tautz D (1989) Hypervariability of simple 219–229
sequences as a general source for polymorphic 17. Tang S, Yu JK, Slabaugh MB et al (2002)
DNA markers. Nucleic Acids Res 17: Simple sequence repeat map of the sunflower
6463–6471 genome. Theor Appl Genet 105:1124–1136
6. Wolfe KH, Li WH, Sharp PM (1987) Rates of 18. Hopkins MS, Casa AM, Wang T et al (1999)
nucleotide substitution vary greatly among Discovery and characterization of polymorphic
plant mitochondrial, chloroplast, and nuclear simple sequence repeats (SSRs) in peanut.
DNAs. Proc Natl Acad Sci U S A 84: Crop Sci 39:1243–1247
9054–9058 19. Goldstein DB, Roemer GW, Smith DA et al
7. Varshney RK, Graner A, Sorrells ME (2005) (1999) The use of microsatellite variation to
Genic microsatellite markers in plants: features infer population structure and demographic
and applications. Trends Biotechnol 23:48–55 history in a natural model system. Genetics
8. Zane L, Bargelloni L, Patarnello T (2002) 151:797–801
Strategies for microsatellite isolation: a review. 20. Goldstein DB, Pollock DD (1997) Launching
Mol Ecol 11:1–16 microsatellites: A review of mutation processes
9. Squirrell J, Hollingsworth PM, Woodhead M and methods of phylogenetic inference. J Hered
et al (2003) How much effort is required to 88:335–342
isolate nuclear microsatellites from plants? Mol 21. Barrier M, Friar E, Robichaux R et al (2000)
Ecol 12:1339–1348 Interspecific evolution in plant microsatellite
10. Mohan M, Nair S, Bhagwat A et al (1997) structure. Gene 241:101–105
Genome mapping, molecular markers and 22. Zou J, Fu DH, Gong HH et al (2011) De novo
marker-assisted selection in crop plants. Mol genetic variation associated with retrotranspo-
Breed 3:87–103 son activation, genomic rearrangements and
11. Weber JL, May PE (1989) Abundant class of trait variation in a recombinant inbred line
human DNA polymorphisms which can be population of Brassica napus derived from inter-
typed using the polymerase chain-reaction. specific hybridization with Brassica rapa. Plant J
Am J Hum Genet 44:388–396 68:212–224
12. Devos KM, Bryan GJ, Collins AJ et al (1995) 23. Zhou WC, Kolb FL, Bai GH et al (2003)
Application of 2 microsatellite sequences in Validation of a major QTL for scab resistance
wheat storage proteins as molecular markers. with SSR markers and use of marker-assisted
Theor Appl Genet 90:247–252 selection in wheat. Plant Breed 122:40–46
SSR Genotyping 89
24. Young ND (1999) A cautiously optimistic generation technologies. Brief Bioinform 10:
vision for marker-assisted breeding. Mol Breed 609–618
5:505–510 29. Mason AS, Nelson MN, Castello M-C et al
25. Collard BCY, Mackill DJ (2008) Marker- (2011) Genotypic effects on the frequency of
assisted selection: an approach for precision homoeologous and homologous recombination
plant breeding in the twenty-first century. in Brassica napus × B. carinata hybrids. Theor
Philos T R Soc B 363:557–572 Appl Genet 122:543–553
26. Varshney RK, Nayak SN, May GD et al (2009) 30. Nelson MN, Mason AS, Castello M-C et al
Next-generation sequencing technologies and (2009) Microspore culture preferentially
their implications for crop genetics and breed- selects unreduced (2n) gametes from an inter-
ing. Trends Biotechnol 27:522–530 specific hybrid of Brassica napus L. × Brassica
27. Maniatis T, Jeffrey A, Vandesande H (1975) carinata Braun. Theor Appl Genet 119:
Chain-length determination of small double- 497–505
stranded and single-stranded DNA molecules by 31. Nicolas SD, Mignon GL, Eber F et al (2007)
polyacrylamide gel electrophoresis. Biochemistry Homeologous recombination plays a major
14:3787–3794 role in chromosome rearrangements that occur
28. Imelfort M, Edwards D (2009) De novo during meiosis of Brassica napus haploids.
sequencing of plant genomes using second- Genetics 175:487–503
Chapter 7
Genotyping Analysis Using an RFLP Assay

Shutao Dai and Yan Long
Abstract
RFLP (Restriction Fragment Length Polymorphism) is a commonly used technique that can be used for
genotyping for nearly all organisms, including plants, animals, and humans. RFLP is widely used in genetic
and genomic research, such as genome mapping and gene identification. The technique involves DNA
digestion, gel electrophoresis, capillary transfer of DNA, and southern hybridization. In this chapter, we
aim to give a detailed introduction of how to perform RFLPs for identifying genotypes.
Key words RFLP, Molecular marker, Genotyping analysis, Southern blotting
1 Introduction
RFLP (Restriction Fragment Length Polymorphism) is a technique

that exploits variations in homologous DNA sequences. It is a
difference in homologous DNA molecules that can be detected by
the presence of different fragments in length after DNA digestion
with specific restriction enzymes. In RFLP analysis, DNA samples
are cut into pieces by restriction enzyme(s) and the resulting
fragments are separated according to their lengths by agarose gel
electrophoresis. RFLP, as a molecular marker, was first used in
1975 to identify DNA sequence polymorphisms for genetic map-
ping of a temperature-sensitive mutation of adeno-virus serotypes
[1]. It was then used for human genome mapping [2] and plant
genomic research [3, 4].
RFLP analysis was the first DNA profiling technique and
widely used in genome mapping and variation analysis, such as
genetic fingerprinting [5, 6], construction of genetic maps [7–9],
identification of candidate gene locations for different traits [10–13],
hereditary disease diagnostics [14–16], and paternity tests [17–19].
In comparison with other techniques for genotyping, RFLPs have
a few advantages. Firstly, the RFLP loci are distributed across the
whole genome and the markers are relatively highly polymorphic.
Different restriction endonuclease can also be used for RFLPs.
91
92 Shutao Dai and Yan Long
Secondly, the RFLP markers are codominantly inherited and are

highly reproducible. Thirdly, the polymorphic loci identified by
RFLP are stably detected for different varieties regardless of envi-
ronmental influence and gene interaction. Because of these charac-
teristics, the method provides opportunity to simultaneously screen
numerous samples. In addition, DNA blots can be analyzed repeat-
edly with different RFLP probes by stripping and reprobing.
The technique involves DNA digestion with restriction enzymes,
separation of the resulting fragments by agarose gel electrophoresis,
capillary transfer of the fragments to a membrane, and southern
hybridization with a radioactively labeled DNA probe. These pro-
cesses are time consuming, involve expensive and radioactive/toxic
reagents, and require large quantity of high-quality genomic DNA.
In recent years, these drawbacks limit the use of RFLPs.
2 Materials
Prepare all solutions using ultrapure water (prepared by purifying

deionized water to attain a sensitivity of 18 MΩ cm at 25 °C) and
analytical grade reagents. Prepare and store all reagents at room
temperature (unless indicated otherwise). Diligently follow all waste
disposal regulations when disposing of waste materials.
2.1 Digesting DNA Restriction enzyme(s) (see Note 1).

for Southern Blotting
2.2 Southern 1. 1× TAE buffer: 40 mM Tris–HAC, 1 mM EDTA, pH 8.0. For

Blotting onto a simple method of preparing 1× TAE buffer: Prepare 50×
Hybond-N+ Membrane native buffer (2 M Tris–HAC, 50 mM EDTA, pH 8.0). Weigh
242 g Tris and transfer to a 1-l graduated cylinder containing
about 100 ml of water. Mix and make up to 800 ml with water.
Add 57.1 ml of Acetic Acid and 100 ml of 0.5 mol/l EDTA
(pH 8.0), mix and make up to 1 l with water. Dilute 20 ml of
50× native buffer to 1 l with water.
2. 0.8 % agarose gel: Dissolve 2 g of agarose in 250 ml of 1× TAE
buffer, add 2.5 μl of 1 % ethidium bromide solution (see Note 2).
3. 6× loading buffer: 30 mM EDTA, 36 % (v/v) glycerol, 0.05 %
(w/v) bromophenol blue, 0.05 % (w/v) xylene cyanol FF,
pH 7.0. Weigh 4.4 g of EDTA, 250 mg of bromophenol blue,
and 250 mg of xylene cyanol FF and transfer to a glass beaker.
Add about 200 ml of water to the glass beaker. Heat and stir
until the powder is fully dissolved. Then add 180 ml glycerol
and adjust to pH 7.0 with NaOH. Make up to 500 ml with
water. Store at 4 °C.
Genotyping Analysis Using an RFLP Assay 93
4. DNA marker: λDNA/HindIII or λDNA/EcoRI.

5. 0.25 M HCl: 10.4 ml concentrated hydrochloric acid (12 N)
into 500 ml water.
6. 0.4 M NaOH: Dissolve 16 g of NaOH in 1 l water.
7. 20× SSC: 3.0 M NaCl, 0.3 M trisodium citrate, pH 7.0. Add
about 100 ml water to a 1-l graduated cylinder or a glass
beaker. Weigh 175.3 g of sodium chloride and 88.2 g of triso-
dium citrate and transfer to the cylinder. Add water to a volume
of 900 ml. Mix and adjust pH with HCl (see Note 3). Make up
to 1 l with water.
8. 2× SSC: Dilute 20× SSC ten times with water.
9. Hybond-N+ Membrane.
2.3 Radioactively 1. Probe DNA fragment or plasmid (see Note 4).

Labeling Probe 2. Random primers d(N)6 (50 ng/μl).
3. 10× dCTP mixture containing unlabeled dGTP, dATP, and
dTTP each at a concentration of 0.2 mM.
4. Klenow Buffer and Klenow Fragment enzyme (5 U/μl).
5. [α-32p]dCTP (10 m Ci/ml, sp. act. >3,000 Ci/mmol).
6. Denaturing solution: 80 mM EDTA or hybridization solution.
2.4 Hybridization 1. Dextran sulphate (sigma D-6001).

with Radioactively 2. 50× Denhardts: 1 % Ficoll 400, 1 % Polyvinylpyrrolidone
Labeled Probe (PVP), 1 % BSA. Dissolve 1 g of Ficoll 400, 1 g of PVP, and
1 g of BSA in 100 ml of water and filter-sterilize using a 0.45-
μm membrane filter, store at −20 °C.
3. 20× SSPE: 3.0 M NaCl, 0.2 M NaH2PO3, 0.02 M EDTA,
pH 7.4. Weigh 175.3 g of sodium chloride and 27.6 g of
sodium dihydrogen phosphate monohydrate and transfer to a
1-l graduated cylinder containing about 100 ml of water. Mix
and make up to 800 ml with water. Add 40 ml of 0.5 mol/l
EDTA, mix and adjust pH to 7.4 with NaOH. Make up to 1 l
with water.
4. 10 % (w/v) SDS.
5. Salmon sperm DNA or Herring Testes DNA (10 mg/ml).
6. Wash A: 2× SSC, 0.1 % SDS.
7. Wash B: 0.2× SSC, 0.1 % SDS.
2.5 Stripping 1. Wash buffer 1: 0.1× SSC, 0.1 % SDS.

and Reprobing 2. Wash buffer 2: 0.1 M NaOH, 0.2 % SDS.
3. Wash buffer 3: 0.2 M Tris, 0.1× SSC, 0.2% SDS.
3 Methods
Carry out all procedures at room temperature unless otherwise

specified.
3.1 Digesting DNA 1. Measure the concentration of DNA samples with fluorometer
for Southern Blotting and equilibrate them to a concentration of 0.6–1.5 μg/μl.
2. Mix 15 μg of genomic DNA, 3 μl of 10× buffer, 29 μl of auto-
claved ddH2O, and 3 μl of restriction enzyme (10 U/μl) to a
total of 35 μl for each sample. Mix thoroughly by tapping the
tubes several times, followed by very brief spinning. Digest at
the appropriate temperature for 8–12 h.
3. After completing the digestion, store at 4 °C, and check the
quality of the digestion with 3.5 μl (1.5 μg) of the digested DNA
on an agarose gel. Good digestion should show an even distri-
bution and clear smear (Fig. 1).
3.2 Southern 1. Transfer the appropriate amount of digested DNA to a fresh

Blotting onto microfuge tube. Add 0.15 volume of 6× loading buffer.
Hybond-N+ Membrane Electrophoresis should be carried out in a 0.8 % agarose gel
(10 cm × 20 cm) with 1× TAE at a low voltage (about <1 V/cm)
for approximately 15 h (see Note 5).
2. Carefully remove the gel frame from the electrophoresis appara-
tus, slide the gel into a glass baking dish containing 400 ml of
0.25 M HCl, then shake gently to depurinate the DNA until
the bromophenol blue turns yellow (about 15 min) (see Note 6).
3. Pour off the depurination solution and rinse the gel with
distilled water.
4. Add 500 ml of 0.4 M NaOH to the glass baking dish. Shake
gently for about 30 min to denature the DNA until the
bromophenol blue changes back to blue again.
Fig. 1 Restricted DNA fragments separated by agarose gel electrophoresis and visualized by UV light
5. Turn the gel upside down, then cut it to the desired size
(usually 19.5 cm × 10 cm), and cut off the upper right-hand
corner of the gel to mark the direction of the electrophoresis.
6. Place a glass plate on a large plastic tray, add enough 0.4 M
NaOH to saturate two sheets of filter papers (22 cm × 30 cm)
with the transfer solution (0.4 M NaOH) and place on the
glass plate to create a wick. Roll out all air bubbles from the
wick with a glass rod.
7. Carefully transfer the gel onto the wick and take care to remove
air bubbles beneath it (see Note 7).
8. Wet the Hybond-N+ membrane (19.5 cm × 10 cm) with the
transfer solution and cut off a corner. Pipette a small volume of
the solution onto the gel, and place the membrane on it,
making sure the position of the cut corner is identical to that
of the gel. Then roll out all air bubbles (see Note 8).
9. Wet two pieces of filter paper (20 cm × 10.5 cm) in the transfer
solution and place them on the top of the membrane and roll
out all air bubbles.
10. Surround the gel with plastic wrap or parafilm and place a stack
of paper towels (20 cm × 10.5 cm) on the filter paper above the
gel (see Note 9). Put a glass plate on the top of these towels
and place a 500-g weight on the glass plate.
11. Allow the transfer of DNA in the capillary transfer system to
proceed for 20–30 h (Fig. 2) (see Note 10).
12. Carefully remove the glass plate, paper towels and filter paper,
and peel the gel from the membrane (see Note 11), make the
position of the gel slots on the membrane with a soft lead pencil,
and transfer the membrane into 2× SSC to wash briefly so as to
remove bits of gel or particles.
13. Air dry the membrane on a sheet of filter paper, then sandwich
with two pieces of filter papers, after baking at 80–100 °C
Fig. 2 Capillary system for transferring DNA to membrane

for 2–4 h, write a brief description on one corner of the

membrane with a pencil, then store at 4 °C or in 2× SSC for
hybridization.
3.3 Radioactively 1. Set up the following reaction for one tube with 1–2 membrane:
Labeling Probe 2–3 μl (50–150 ng) of probe DNA, 1.5 μl of Random primers
d(N)6, 1 μl of marker DNA, 2 μl of 10× dCTP mixture, 0.5 μl
of Klenow fragment enzyme, 2 μl of Klenow Buffer and add
autoclaved ddH2O to a volume of 20 μl.
2. Add probe DNA, random Primers, and marker DNA into a
0.5-ml eppendorf tube, heat in a boiling water bath for 5 min,
then quickly cool in wet ice for 5 min, spin briefly and quickly
place back on ice and go to the next stage.
3. Add appropriate ddH2O, 10× dCTP Buffer, Klenow Buffer,
and Klenow Fragment enzyme into the same tube in wet ice,
then mix quickly by taping the tube and spinning briefly.
4. Take the tube into hybridization room and add 1–2 μl of
[α -32p]dCTP to the tube, mix thoroughly with a pipette.
5. Leave at room temperature (30 °C) for about 2–4 h.
6. Add 400 μl denaturation solution to each reaction tube, then
put in a boiling water bath for 4 min.
7. Cool in ice until use.
3.4 Hybridization 1. Prepare the following prehybridization solution for one tube
with Radioactively with 1–2 membranes: 3 g of dextran sulphate, 17.55 ml of
Labeled Probe ddH2O, 6 ml of 20× SSPE, 6 ml of 50× Denhardts, 0.3 ml
of 10 % SDS, 150 μl of Salmon sperm DNA (10 mg/ml) to a
total of 30 ml.
2. Gently dissolve the dextran sulphate in ddH2O (prewarmed at
65 °C) at room temperature, then add 20× SSPE, 50× Denhardts,
and 10 % SDS. After mixing well, heat to 65 °C in a water bath.
3. Add 150 μl of freshly boiled 10 mg/ml Salmon sperm DNA or
Herring testes DNA to the above solution and mix well, keep-
ing it in 65 °C.
4. Roll the membrane (stored in 2× SSC) into a hybridization
tube, add 30 ml 2× SSC to prewarmed at the hybridization
oven for about 10–20 min at 65 °C, then remove the 2× SSC
and add 30 ml of the prehybridization solution to the tube and
eliminate air bubbles.
5. Place the tube inside a prewarmed hybridization oven to
prehybridize for approximately 8 h at 65 °C.
6. Prepare the hybridization solution in the same way as the
prehybridization solution during prehybridization, and keep it
at 65 °C: 0.6 g of dextran sulphate, 3.5 ml of ddH2O, 1.2 ml
of 20× SSPE, 1.2 ml of 50× Denhardts, 60 μl of 10 % SDS,
30 μl of Salmon sperm DNA (10 mg/ml) to a total of 6 ml.
Fig. 3 A photograph developed with X-ray film. The blots represent the restriction fragments
7. Remove the prehybridization tube from the oven and dispose

of the prehybridization solution.
8. Pipette 6 ml of the hybridization solution and then quickly add
the denatured probe to the tube. Place it back in the oven to
hybridize for more than 12 h at 65 °C.
9. When hybridization is complete, pour off the hybridization
solution into a container suitable for disposal of radioactivity,
then add a small amount of wash buffer A (prewarmed at
65 °C) and rinse the membrane quickly, then add a further
30 ml of wash buffer A to the tube in the oven at 65 °C for
15 min twice.
10. Repeat with 30 ml of wash buffer B in the tube to wash once,
then remove the membrane from the tube into a large plastic
tray. Add 500 ml of wash buffer B to wash twice in a 65 °C
shaker. During this, measure the radioactivity until the counts
are about 500–2,000 cpm (see Note 12).
11. After washing, the membrane is wrapped in cling film and
exposed to X-ray film with an intensifying screen for 4–16 days
at −70 °C or a phosphorimager plate for 1–6 h at room tem-
perature depending on the counts given (see Note 13).
12. Develop photographs (Fig. 3) with X-ray film in a darkroom
or produce images with a phosphorimager plate in cyclone
storage phosphor system.
3.5 Stripping Remove the probe and prepare hybridization again. Wash the
and Reprobing membrane as follows (see Note 14): Wash A for 10 min, Wash B
for several minutes (see Note 15), Wash C for 20 min. After washing,
put the membrane into 2× SSC until use.
4 Notes
1. Complete digestion of genomic DNA is crucial but difficult.

A stable and high-performance restriction enzyme is selected
to digest genomic DNA. EcoRI and HindIII are excellent at
cutting genomic DNA. However, they may not be the highest
performers in RFLP detection. TaqI, MspI, and XbaI have
good performances for the identification of RFLPs.
2. If agarose gel does not contain ethidium bromide, after electro-

phoresis is completed, stain the agarose gel with ethidium
bromide to photograph the gel.
3. Don’t wait until complete solution of the powder to adjust pH
value, as the powder is difficult to dissolve before adjusting
pH, but easy at pH 7.0.
4. Probe DNA can be a DNA fragment from a PCR product or a
plasmid with restriction digestion, or it can be plasmid. The DNA
fragment must be separated and purified from agarose gel
electrophoresis.
5. Bromophenol blue is an indicator. Stop electrophoresis when
the bromophenol blue reaches near 3/4 of the length of the
agarose gel.
6. Bromophenol blue is yellow in acidic conditions. Depurination
is finished when the bromophenol blue turns yellow. Don’t
submerge the gel in depurination solution too long after the
color has changed.
7. Make sure that the reverse side of the gel faces up.
8. Always handle with clean gloves and blunt-ended forceps. Do not
adjust the membrane once it is placed on the gel.
9. Make sure that the plastic wrap or parafilm can prevent the
paper towels from coming in contact with the filter paper
below the gel.
10. Replace the paper towels as they become wet. Try to prevent
the entire stack of paper towels from becoming wet.
11. The gel can be stained in a solution of ethidium bromide
and visualized on a UV transilluminator to check the DNA
transfer.
12. During the washing process, periodically monitor the amount
of radioactivity of the membrane using a Geiger counter.
13. Make sure that one side of the membrane carrying the digested
DNA faces the X-ray film or the phosphorimager plate.
14. Do not allow the membrane to dry at any time prior to remov-
ing the probe, as drying will cause the probe DNA to bind
irreversibly.
15. Periodically check the radioactivity using a Geiger counter.
Don’t rinse the membrane in wash solution too long.
References
1. Grodzicker T, Williams J, Sharp P, Sambrook J 2. Botstein D, White R, Skolnick M, Davis R

(1975) Physical mapping of temperature sensi- (1980) Construction of a genetic linkage map
tive mutants of adenovirus. Cold Spring Harbor in man using restriction fragment length poly-
Symp Quant Biol 39:439–446 morphisms. Am J Hum Genet 32:314–331
3. Helentjaris T, Slocum M, Wright S, Schaefer A, by QTL association with RFLP alleles. Theor
Nienhuis J (1986) Construction of genetic Appl Genet 88:486–489
linkage maps in maize and tomato using restric- 12. Waldron BL, Moreno-Sevilla B, Anderson JA,
tion fragment length polymorphisms. Theor Stack RW, Frohberg RC (1999) RFLP map-
Appl Genet 72:761–769 ping of QTL for Fusarium head blight resis-
4. Weber D, Helentjaris T (1989) Mapping RFLP tance in wheat. Crop Sci 39:805–811
loci in maize using B-A translocations. Genetics 13. Zimnoch-Guzowska E, Marczewski W,
121:583–590 Lebecka R, Flis B, Scha¨fer-Pregl R, Salamini
5. Ali S, Müller CR, Epplen JT (1986) DNA fin- F, Gebhardt C (2000) QTL analysis of new
ger printing by oligonucleotide probes specific sources of resistance to Erwinia carotovora ssp.
for simple repeats. Hum Genet 74:239–243 atroseptica in potato done by AFLP, RFLP,
6. Becker J, Vos P, Kuiper M, Salamini F, Heun M and resistance-gene-like markers. Crop Sci 40:
(1995) Combined mapping of AFLP and 1156–1167
RFLP markers in barley. Mol Gen Genet 14. Chartier-Hariln M-C, Parfitt M, Legrain S,
249:65–73 Pérez-Tur J et al (1994) Apolipoprotein E,
7. Bradshaw HD, Villar M, Watson BD, Otto epsilon 4 allele as a major risk factor for spo-
KG, Stewart S, Stettler RF (1994) Molecular radic early and late-onset forms of Alzheimer’s
genetics of growth and development in disease: analysis of the 19q13.2 chromosomal
Populus. III. A genetic linkage map of a hybrid region. Hum Mol Genet 3:569–574
poplar composed of RFLP, STS, and RAPD 15. Inoue N, Kawashima S, Kanazawa K, Yamada
markers. Theor Appl Genet 89:167–178 S, Akita H, Yokoyama M (1998) Polymorphism
8. Cregan PB, Jarvik T, Bush AL, Shoemaker RC, of the NADH/NADPH oxidase p22 phox
Lark KG, Kahler AL, Kaya N, VanToai TT, gene in patients with coronary artery disease.
Lohnes DG, Chung J, Specht JE (1999) An Circulation 97:135–137
integrated genetic linkage map of the soybean 16. Shindo Y, Inoko H, Yamamoto T, Ohno S
genome. Crop Sci 39:464–1490 (1994) HLA-DRB1 typing of Vogt-Koyanagi-
9. Lespinasse D, Rodier-Goud M, Grivet L, Harada's disease by PCR-RFLP and the strong
Leconte A, Legnate H, Seguin M (2000) A association with DRB1*0405 and DRB1*0410.
saturated genetic linkage map of rubber tree Br J Ophthalmol 78:223–226
(Hevea spp.) based on RFLP, AFLP, microsat- 17. Allen RW, Bliss B, Pearson A (1989)
ellite, and isozyme markers. Theor Appl Genet Characteristics of a DNA probe (pa3′HVR)
100:127–138 when used for paternity testing. Transfusion
10. Amer IMB, Worland AJ, Korzun V, Börner A 29:477–485
(1997) Genetic mapping of QTL controlling 18. Morling N, Hansen HE (1993) Paternity test-
tissue-culture response on chromosome 2B of ing with VNTR DNA systems. Int J Leg Med
wheat (Triticum aestivum L.) in relation to 105:189–196
major genes and RFLP markers. Theor Appl 19. Smith JC, Newton CR, Alves A, Anwar R,
Genet 94:1047–1052 Jenner D, Markham AF (1990) Highly poly-
11. Lark KG, Orf J, Mansur LM (1994) Epistatic morphic minisatellite DNA probes. Further eval-
expression of quantitative trait loci (QTL) in uation for individual identification and paternity
soybean [Glycine max (L.) Merr.] determined testing. J Forensic Sci Soc 30:3–18
Chapter 8
DNA Barcoding for Plants

Natasha de Vere, Tim C.G. Rich, Sarah A. Trinder,
and Charlotte Long
Abstract
DNA barcoding uses specific regions of DNA in order to identify species. Initiatives are taking place around
the world to generate DNA barcodes for all groups of living organisms and to make these data publically
available in order to help understand, conserve, and utilize the world’s biodiversity. For land plants the core
DNA barcode markers are two sections of coding regions within the chloroplast, part of the genes, rbcL and
matK. In order to create high quality databases, each plant that is DNA barcoded needs to have a herbarium
voucher that accompanies the rbcL and matK DNA sequences. The quality of the DNA sequences,
the primers used, and trace files should also be accessible to users of the data. Multiple individuals
should be DNA barcoded for each species in order to check for errors and allow for intraspecific variation.
The world’s herbaria provide a rich resource of already preserved and identified material and these can be
used for DNA barcoding as well as by collecting fresh samples from the wild. These protocols describe the
whole DNA barcoding process, from the collection of plant material from the wild or from the herbarium,
how to extract and amplify the DNA, and how to check the quality of the data after sequencing.
Key words DNA barcoding, Plants, rbcL, matK, Species identification, Plant collection, Herbarium
specimens, DNA extraction, PCR
1 Introduction
DNA barcoding is a tool for species identification that uses inter-

nationally agreed protocols and regions of DNA to create a global
database of living organisms [1]. International initiatives are taking
place across hundreds of countries to DNA barcode the world’s
biodiversity and make these data publically available to all users [2].
The importance of plant DNA barcoding is underlain by the
need for accurate species identification to both conserve and utilize
plants. In many parts of the world, however, this may be hindered
by a lack of taxonomic expertise [3]. As well as identifying whole
plants, it is also sometimes useful to be able to identify species
from material such as roots, seeds, and pollen; or in mixtures of
plants sampled from the air, soil, or water [4]. DNA barcoding of
plants is already being employed in a wide variety of applications.
101
102 Natasha de Vere et al.
For instance, barcoding approaches have been used for the verifica-
tion of plant products ranging from medicinal plants [5, 6] to
kitchen spices [7], berries [8], olive oil [9], and tea [10]. Ecological
applications have included the identification of invasive species
[11–13], characterization of below-ground plant diversity using
roots [14], and reconstruction of past vegetation and climate from
plant remains in the soil [15]. Genetic sequences obtained in the con-
text of DNA barcoding have also been used to create phylogenetic
trees for use in phylogenetic community ecology [16, 17].
These applications all depend on using regions of DNA that are
able to identify between species without being too variable within a
species. Following the evaluation of several candidate markers, the
Plant Working Group (PWG) of the Consortium for the Barcoding of
Life (CBOL) recommended that regions of two plastid genes, rbcL
and matK, be adopted as the standard plant DNA barcodes, with the
recognition that supplementary markers may be required [18, 19].
The use of DNA barcoding as an identification tool is also dependent
on the creation of high-quality reference databases of sequences [19].
Essential to a database is that every DNA sequence should be associ-
ated with the plant specimen from which it came; along with when,
where, and by whom it was collected and identified. This is best done
through the creation of a herbarium voucher alongside each DNA
sample, although sometimes for rare and threatened species a photo-
graph may provide a substitute [4]. The lab procedures through
which a sample is processed should also be recorded, with the primers
used, trace files and quality statistics for its DNA sequence all available
to end users of the data [4]. All data should be publically available;
GenBank provides a repository for DNA sequences but in addition it
is recommended to deposit data on to the Barcode of Life Datasystem
(BOLD) [20]. BOLD provides a means of managing projects and
allows trace files, scans of herbarium specimens, and photographs to
be stored alongside DNA sequences [20].
With estimates suggesting that there may be around 380,000
land plant species in the world, composed of around 352,000
angiosperms, 1,300 gymnosperms, and 13,000 bryophytes, ferns,
and fern allies, DNA barcoding must use existing resources and
expertise efficiently in order to make an effective contribution to
cataloguing such huge diversity [21]. The herbaria of the world
provide a vast and important source of plant material that is already
identified and preserved, capturing years of taxonomic expertise [4].
Extracting DNA from herbarium specimens, however, can be more
problematic compared to collecting new samples from the wild.
The usability of samples for DNA extraction in different herbaria
varies according to the conditions in which specimens have been
stored and how they were originally preserved. Newer material
works better than older and certain taxonomic groups work less
well than others [4, 22]. For large DNA barcode campaigns we
recommend a combined approach that uses herbarium specimens
to quickly bulk up the number of samples available for DNA
DNA Barcoding for Plants 103
barcoding and then filling in the gaps with new collections of

material for species that work less well from the herbarium.
All DNA barcodes must undergo rigorous checks after
sequencing, in order to identify incorrect sequences and check the
accuracy and quality of base calls. The sequencing of multiple indi-
viduals per species is of critical importance as it allows comparisons
between the sequences to be made. Each sequence should be man-
ually edited and checked by examining the placement of samples
within phylogenetic trees. This stage is particularly important for
plant species because the markers used, in particular rbcL, are often
quite similar between closely related species [4, 19].
Although the same regions of DNA should be used for DNA
barcoding, there is a wide range of protocols and approaches that
have been used to generate plant DNA barcodes [21]. We describe
here the whole DNA barcoding process; from the collection of
specimens in the field or herbarium, their processing in the lab, and
manual editing and checking post sequencing. The protocols we
present are medium to high throughput, using a 96-well format
but without the use of robots. We have concentrated on the core
DNA barcode loci of rbcL and matK but the protocols presented
can be readily adapted to other markers as required.
2 Materials
2.1 Field Collecting 1. Self-indicating silica gel (see Note 1).

of Plant Samples 2. Specimen collection envelopes (see Note 2).
3. Herbarium voucher collection bags (see Note 3).
4. Permanent marker pen and pencil.
5. Jewelry tags.
6. Air-tight sealable box.
7. Tablespoon.
8. Field notebook or laptop (see Note 4).
9. Field press (see Note 5).
10. Flimsies (see Note 6).
11. Drying paper (see Note 7).
12. Scissors.
13. 70 % ethanol.
14. Camera.
15. GPS.
2.2 Preparation 1. Flimsies.

of Herbarium 2. Drying paper.
Specimens
3. Corrugates (see Note 8).
4. Field press.
5. Drying oven that can provide a steady flow of warm air at
35–45 °C.
6. Acid-free herbarium mounting paper.
7. Herbarium labels.
8. Gummed linen strips.
9. Archival quality PVA glue.
10. Freezer.
11. Insect-proof herbarium cupboards.
2.3 Collecting 1. Laptop.

Samples for DNA 2. Specimen labels.
Extraction
3. Plastic ziplock bags for storing herbarium material for DNA
from Herbarium
extraction.
Specimens
4. Forceps.
5. 70 % ethanol.
6. Collection list.
7. A3 scanner.
2.4 DNA Extraction 1. Commercial extraction kit, e.g., Qiagen DNeasy 96 Plant Kit.
of Herbarium Samples 2. Molecular biology grade 100 % ethanol.
in 96-Well Format
3. Mill for tissue grinding, e.g., Qiagen TissueLyser with 96-well
plate adaptor.
4. 3 mm tungsten carbide milling beads.
5. Centrifuge with deep bucket rotor for 96-well plates, capable
of achieving 6,000 × g.
6. Pipettes; multichannel and single with associated tips.
7. Measuring cylinders and buffer reservoirs.
8. Burner for flaming.
9. Forceps.
10. Water bath set at 65 °C.
11. Proteinase K (see Note 9).
12. DTT (see Note 10).
13. Fridge and freezer.
2.5 PCR 1. Taq polymerase (see Note 11).

Amplification 2. Forward and reverse primers.
3. BSA (see Note 12).
4. Molecular biology grade water.
5. DNA.
6. Sterile PCR tubes or 96-well PCR plates.

7. Heat-sealing PCR film.
8. Thermocycler with capacity for 96-well plates.
2.6 Gel 1. Agarose gel.

Electrophoresis 2. 1× TAE buffer (can be purchased already made up as 10× TAE).
3. SYBR safe dye (see Note 13).
4. Loading buffer.
5. Size standard.
6. Electrophoresis tank.
7. Gel support and combs.
8. Masking tape.
9. Conical flask.
10. Microwave.
11. Powerpack.
12. UV gel imaging system.
13. Amplified DNA.
14. Pipette and pipette tips.
3 Methods
Throughout specimen collection and processing in the field,

herbarium, and laboratory, procedures must be put in place to
ensure contamination is avoided. In the field, forceps and scissors
that are used for collecting can be cleaned in 70 % ethanol prior
and between each use. In the laboratory, 70 % ethanol, DNA
AWAY, ELIMINase, or a similar product should be used for regular
cleaning. Lab coats and nitrile or latex gloves should be worn at all
times. Pipettes, tips, and other consumables should be autoclaved
before use or purchased in sterile packs.
3.1 Naming 1. It is essential to correctly identify and consistently name sam-

and Locating ples to be used for DNA barcoding. Use a standard reference
Specimens guide for plant names; generally the standard flora for that area
or accepted monograph for taxonomic sampling. Take care to
avoid the use of synonyms and alternative spellings, especially
if multiple collectors are involved with field sampling. Be careful
with species names that are double-barreled, as these can often
become truncated during data processing. Having taxonomic
expertise available at the collection stage greatly improves the
accuracy and efficiency of the whole DNA barcoding process.
2. Before each collecting trip prepare a list of target species and
areas to be visited. Plan routes to ensure sampling is efficient.
Regional floras, online databases, and local recorders can all

help to locate target species.
3. The collection of multiple samples per species allows errors
to be spotted and any intraspecific variation to be identified.
A minimum of three samples is recommended for groups that
are taxonomically well known, and 5–10 for groups whose
taxonomy is less well characterized. Multiple samples for each
species should be collected from throughout the geographic
area of focus.
4. Before collecting, ensure that all necessary permissions and
permits have been sought and that copies of paperwork are
available for inspection.
3.2 Field Collecting 1. Before going into the field prepare the target list of species and
of Plant Samples areas to be covered.
(See Note 14) 2. Ensure silica gel is fully dehydrated and place a tablespoon of gel
into each specimen bag. Store these in an airtight bag or con-
tainer to keep them dry; these are best prepared before going
into the field.
3. Locate target specimen. Record the species name (if known)
and assign it a collection number. This collection number will
link the DNA sample to its herbarium voucher and collection
information.
4. As a minimum, record the date, collector’s name(s), and location
to within at least 100 m or latitude and longitude using a GPS
and the locality name following the spelling on the map. In addi-
tion you may want to collect other information, for example,
life stage, habitat, associated species, topography, and aspect.
5. Collect a sample of the plant that includes all of the features
required for its identification. Often this means that flowering
material is required for angiosperms and spore-bearing fronds
for ferns. Root material may also be necessary for some species.
It is important to familiarize yourself with the key taxonomic
features of a species before collecting.
6. Label a jewelry tag with the collection code and attach it to
the plant stem or other part where it will not become easily
detached.
7. It is best to put tissue samples into silica gel immediately in the
field to dry them quickly to maximize DNA quality, especially
in hot or dry climates. Select a green, undamaged leaf from the
specimen for DNA collection and remove with scissors or by
hand. Cut or tear three 0.5 cm × 0.5 cm sections from the leaf
and place into a sample bag containing one tablespoon of dry
silica gel. Give the bag a quick shake to bury the leaf pieces,
fold the bag closed, and label it with the collection code and,
if known, the species name. Place into an air-tight container or

bag separate from the unused silica gel bags (see Note 15).
8. Place the rest of the plant in a flimsy between drying papers in
a field press or, for bryophytes, within a paper wallet, or retain
in a plastic bag or vasculum for pressing later (see Note 16).
A single plant individual must be used for the herbarium
voucher and the DNA sample. This will act as a voucher for the
DNA barcode so that it is always possible to go back to the
original individual to check its identity.
9. For samples whose species threat status may prevent the col-
lection of a herbarium voucher, photographs can act as a
substitute, but this is not recommended as routine practice.
If photographs are to be used, they must capture the diagnostic
features of the species and show the collecting code in the pic-
tures. For consistency of reference in the herbarium, pictures
can be mounted like specimens.
10. Dip scissors into 70 % ethanol between sampling to avoid con-
tamination and also to reduce the risk of spreading infections
between plants.
3.3 Preparation 1. On returning from the field, interleave the flimsies containing
of Herbarium plant samples with drying papers and corrugates. Place within
Specimens a press and dry within a luke warm drying oven or in a well-
ventilated area taking care not to overheat.
2. After 24 h, the plant should be easier to work with and can be
arranged to illustrate all of its key taxonomic features if this has
not been done when first pressed. For example, manipulate
specimens so that both upper and lower surfaces of leaves are
shown and flowers or spore structures are clearly displayed.
3. Change the drying papers regularly, normally once a day, so that
the sample dries quickly; this maximizes the chances of getting
further DNA out of the samples if subsequently required. Dry
damp papers before reusing.
4. Once the sample is dried, mount it on to a herbarium sheet
using small glued linen strips. Use the minimum number of
strips to secure the plant safely without obscuring key features
of the specimen. Other methods of attaching specimens such
as sewing or glue can also be used, but gluing is not recom-
mended as this impedes further use of the specimen for DNA
extraction.
5. Glue a herbarium label containing all of the collection infor-
mation on to the bottom right-hand corner of the sheet.
6. Mounted specimens should be frozen at −20 °C for a minimum
of 72 h to kill any pests that might be found within the plants.
Place specimens in cardboard boxes covered with air-tight
plastic bags during freezing so that condensation forms on the

outside of the bag rather than on the specimens.
7. After freezing, store specimens in an insect-proof herbarium
cupboard in a room where the temperature is maintained at
16–23 °C and humidity at 40–60 % [23].
8. Herbarium specimens should be verified by a taxonomic expert
to ensure correct identification.
9. The voucher can then be accessioned and scanned using an A3
scanner and the scan uploaded with the rest of the DNA barcode
information.
3.4 Collecting Collecting samples directly from herbarium specimens for DNA
Samples for DNA extraction is an efficient way to obtain a large number of verified
Extraction samples. The age of the specimen, how it has been preserved and
from Herbarium stored, and the taxonomic group all affect the likelihood of obtain-
Specimens ing useable DNA, therefore it is advisable to conduct a trial before
embarking on a large-scale sampling campaign. We found an approx-
imate 10 % loss of DNA recoverability per decade, so preferentially
sample material less than 30 years old [4].
1. Prepare the target list of species for collecting. It is advisable to
arrange this with reference to the layout of the herbarium col-
lection, so that herbarium cabinets are visited in order and the
collection can be sampled as efficiently as possible.
2. Prepare labels with duplicate collection codes; these can be cut
in half, with one half stuck to the herbarium specimen to indi-
cate that it has been sampled and the other placed in the bag
with the leaf sample. Both labels need to show the collection
code as this is the number linking the DNA sample with the
herbarium specimen. Barcoded herbarium labels (i.e., not
DNA barcodes) can be used if duplicated.
3. Choose the herbarium specimen for sampling. The specimen
must be suitable for having a small section of tissue removed
(approximately 2–4 cm2) without decreasing its scientific value
and should preferably have been determined by an expert in addi-
tion to the collector. More recently collected samples work better.
Do not sample type specimens or historically important material
unless there are compelling reasons to do so. Collect multiple
samples per species from throughout the geographic target area.
4. Remove a small piece of material, around 2–4 cm2 and place
this into airtight ziplock bag using forceps, along with the label
with the species name and collection code. Select samples from
areas of green, thin tissue which will have dried quickly and
retained DNA quality. DNA is more easily extracted from flowers
than leaves for some taxonomic groups such as Orchidaceae
and Hypericaceae. Be careful of mixed species collections on
the same herbarium sheet.
5. Dip forceps into 70 % ethanol between samples and allow to

dry thoroughly.
6. Once the herbarium specimens have been sampled they can be
scanned using an A3 scanner and the collection information
recorded.
3.5 Laboratory Keeping track of the collected samples as they pass through the lab
Information procedures is a nontrivial task, especially for plants, as each sample
Management Systems will be amplified multiple times to allow for successful amplifica-
(LIMS) tion using the two DNA barcode markers. It is possible to keep
track of samples using spreadsheets but we recommend that for
larger scale DNA barcode campaigns some form of LIMS system is
used. We use the Biocode PlugIn, a free tool that can be added
into the Geneious Pro bioinformatics software [24].
3.6 DNA Extraction There are many different methods for DNA extraction from plant
of Herbarium Samples material. The method we present here uses a commercial kit
in 96-Well Format (Qiagen DNeasy 96 Plant Kit) but the protocol has been adapted
for use with herbarium specimens. It uses a 96-well format with
two 96-well plates processed per extraction (see Note 17).
1. Decide on the position of each sample to be extracted for two
96-well plates.
2. Place a 3-mm tungsten carbide bead into each sample tube
within two 96-well sample plates.
3. Add 0.5 cm2 of tissue sample to each tube using forceps. Tissue
thickness will vary between samples so consider this and aim for
an even amount of tissue across samples. When placing the
samples into the sample tubes, only open one strip of 8 lids at a
time as the samples can become statically charged and jump
about. Break or find samples of suitable size within the sample
bags to avoid particles of material moving into the lab.
4. Dip forceps in 70 % ethanol or flame between samples.
5. Extract the DNA as per the manufacturers’ instructions but
with the following modifications.
6. To the 400 μl of AP1 buffer add 80 μl of DDT at 0.75 mg/ml
and 20 μl of Proteinase K at 1 mg/ml. Make up enough for all
192 samples being extracted and add 400 μl of the mixture to
each sample tube.
7. Disrupt the sample in a mill for 2 min each side. Turn the
samples around between each 2 min to allow for consistent
sample disruption.
8. After disruption, extend the incubation time in the modified
AP1 buffer to 1 h at 65 °C (see Note 18).
9. At the end of the DNA extraction extend the final incubation
stage with AE buffer to 15 min.
Table 1
rbcL and matK primers commonly used to amplify plant species
Primer F/R Sequence 5′–3′ Reference

rbcLa-F F ATGTCACCACAAACAGAGACTAAAGC [32]
rbcLr590 R AGTCCACCGCGTAGACATTCAT [4]
rbcLa-rev R GTAAAATCAAGTCCACCRCG [16]
rbcLajf634R R GAAACGGTCTCTCCAACGCAT [33]
rbcL724R R TCGCATGTACCTGCAGTAGC [34]
matK2.1a F ATCCATCTGGAAATCTTAGTTC [35]
matK2.1F F CCTATCCATCTGGAAATCTTAG [35]
matK_1R_kim F ACCCAGTCCATCTGGAAATCTTGGTCC K.J. Kim, unpub.
MatK_390f F CGATCTATTCATTCAATATTTC [36]
MatK_Xf F TAATTTACGATCAATTCATTC [35]
MatK-3FKIM-r R CGTACAGTACTTTTGTGTTTACGAG K.J. Kim, unpub.
MatK_1326r R TCTAGCACACGAAAGTCGAAGT [36]
MatK_5r R GTTCTAGCACAAGAAAGTCG [35]
matK3.2 R CTTCCTCTGTAAAGAATTC [35]
Additional primers, including those for particular orders of flowering plants, are also available [4, 21, 31]
3.7 PCR The method described here is for the amplification of the DNA
Amplification barcode markers rbcL and matK. It is optimized for use with
herbarium material but also works for freshly collected material
that has been stored in silica gel prior to extraction. Table 1 shows
primers commonly used for rbcL and matK. rbcL primers are gener-
ally universal, working well across a broad taxonomic range; we use
rbcLaF and rbcLr590 for the first PCR. If this fails we then use a
different reverse primer. matK is more problematic and often
requires more primer combinations, especially when using herbar-
ium material. For herbarium material we often use primers specific
to the order of flowering plants to which the sample belongs [4].
matK amplification can also sometimes be problematic for nonseed
plants and further primer development is required for these [21].
1. Table 2 details the components required for PCR. Decide on
the number of samples to be amplified and make up a master
mix with each component (except for the DNA) in the quanti-
ties required plus a little extra for pipetting errors. Include a
water control to test for contaminants.
2. Add 18 μl of master mix into each PCR tube either individually
or using a multichannel pipette if using a 96-well PCR plate.
Table 2
PCR components required in order to amplify rbcL and matK
Amount per Amount needed for a 96-well plate

Component sample (μl) (96 + 4 for pipetting errors) (μl) Details
2× Taq-polymerase 10 100 We use Bioline Biomix
master mix
F primer 0.4 40 10 μM, diluted in TE
R primer 0.4 40 10 μM, diluted in TE
BSA 0.8 8 1 mg/ml solution
H2O 6.4 64 Molecular biology grade
DNA 2.0
3. Add 2 μl of DNA to each tube.

4. If using 96-well PCR plates, place a heat-sealing PCR film over
the plate, or close lids firmly if using PCR tubes.
5. Place into the thermocycler and amplify using: 95 °C for 2 min,
then 45 cycles of 95 °C for 30 s, 50 °C for 1 min 30 s, and then
72 °C for 40 s. Finish with 72 °C for 5 min and then 30 °C for
10 s (see Note 19).
3.8 Gel There are many shapes and sizes of gel support and combs; this
Electrophoresis method can be used to run a 96-well plate of samples at one time
(see Note 20).
1. Plan the order in which samples will be added to the agarose gel.
2. Assemble walls of gel support or cover with two pieces of
masking tape each side.
3. Make a 1 % gel by weighing out 1.3 g of agarose in a conical
flask and add 130 ml of 1× TAE buffer. This is for a 1 % gel
with a gel support of 16 × 17 cm.
4. Heat in a microwave on medium power until bubbling
(this should take around 3 min). Check that all of the agarose
has fully dissolved; it should be clear with no thread-like
appearance.
5. Cool by placing the conical flask under a cold running tap,
swirl to allow even cooling but do not shake as this will add in
air bubbles. Cool until comfortable to touch.
6. Add 3 μl of SYBRsafe dye and swirl until incorporated.
7. Pour into gel support, avoiding air bubbles or layers and insert
combs. Four combs with 30 wells each allows a 96-well PCR
plate to be run at each time.
8. Allow gel to set.

9. Use an empty PCR plate as a loading plate for the PCR products.
Add 2 μl of loading buffer per DNA sample to the loading
plate using a multichannel pipette.
10. Add 4 μl of DNA to each tube using a multichannel pipette.
11. Remove the outside walls of the gel support and place it into
an electrophoresis tank containing 1× TAE buffer. Gently
remove the combs and top up with 1× TAE buffer up to the
fill line (see Note 21).
12. Load 6 μl of DNA and buffer into each well using the same
pipette tip and rinsing between samples in the buffer. A multi-
channel pipette can be used for this but often the samples will
need to be interleaved.
13. Add 4 μl of size standard at the end of each row of samples.
14. Attach the electrodes and plug into the power pack. Run the
gel at 120 V for around 20 min (see Note 22).
15. Visualize the gel on a gel image system under UV light
(see Note 23).
16. Photograph the gel and save the image.
17. Cherry-pick the samples that have amplified successfully for
transfer into sequencing plates. Record the failures for feeding
back into the amplification process.
3.9 DNA Sequencing For DNA barcoding, samples should be Sanger sequenced in both
directions so each PCR plate will result in two sequencing plates.
DNA can be sequenced using the same primers that were used for
PCR. DNA sequencing technology is developing rapidly and many
of the applications of DNA barcoding make use of next generation
sequencing approaches. Sanger sequencing is, at the time of writ-
ing, still an appropriate tool for the creation of reference DNA
barcode libraries due to its accuracy and long read length, but this
may change in the future.
We recommend the use of a commercial provider for DNA
sequencing and purification. Shop around for the best price and do
negotiate for discounts when a large number of samples are going
to be sequenced.
3.10 Manual Editing, There are a number of software packages available that can be used
Alignment, for manual editing, for example, Sequencher, Geneious, and
and Data Checks CodonCode Aligner; we use Sequencher. Rather than give specific
details of methods for particular software packages, we provide
here a guideline to what is required for the manual editing and
checking process.
1. Download the AB1 files from the DNA sequencer into a folder
containing the forward and reverse reads. Depending on how
DNA sequencing was carried out, the files may need to be
renamed to make it easier to construct the contigs.
2. The sequences must be trimmed to remove low quality bases at
the beginning and end of the sequencing read. A standard set of
trim criteria should be used. We use a 25 bp window and remove
sequence with >2 bp showing a quality score (QV) <20.
3. The forward and reverse reads now need to be assembled into
a contig. Sequencher provides methods for assembly by name
allowing the process to be automated.
4. After you have your assembled contigs, manually edit each
one. First, check that one of the reading frames is free of stop
codons. Then check the amount of overlap in the forward and
reverse read; this should be greater than 50 % for DNA barcode
sequences (see Note 24).
5. View the contig consensus and the traces for each read. Search
for contig disagreements and if any are found examine the for-
ward and reverse sequences to see if the reason is clear. If it is,
then change the base to the correct letter. If the disagreement is
not clear then change the consensus to N for that base. Also
search for ambiguous bases. Check each ambiguous base; leave
if the call is acceptable, change to the correct letter or N if not.
6. Trim the primers from the consensus; sometimes, depending
on the amount of trimming, only part of the primer will be left
or none at all (see Note 25).
7. Provide a summary of the quality of the sequences. Quality
statistics can include the amount of bidirectional read; mean
QV of sequences; the percentage of high (QV >30) and low
quality (QV <20) bases, and the number of internal gaps
and substitutions when aligning the forward and reverse reads
(see Note 26).
8. When manual editing is complete, export the successful con-
sensus sequences as FASTA files.
9. Manual editing of individual sequences may not always spot
extra bases inserted or missed during the base-calling process
and may not always spot sequences in reverse complement. We
align the sequences to highlight these errors and also to prepare
sequences for downstream analysis. For rbcL we use Clustal W
in MEGA [25], for MatK we use transAlign [26].
10. Completed alignments may require some manual editing; rbcL
does not contain indels but matK does and manual editing can
improve the final alignment. We use MEGA, Bioedit, and
Mesquite for manually editing alignments (see Note 27).
11. In the final alignment, scan through and check for any
sequences that appear misplaced, often due to a missed or extra
base being called. If the whole sequence appears incorrect then
sometimes it is the reverse complement. Check multiple copies
of the same species, these should be similar.
12. If all the sequences in the alignment used the same forward and
reverse primers then the primers can be trimmed at this stage
instead of individually during manual editing of the contigs.
13. We use the alignment to make guide trees to look for obvi-
ously misplaced taxa as these can represent mislabeling or con-
taminants. Before making the tree, trim the sequences to a
constant length. We use MEGA to make Neighbor-joining
trees using Kimura-2-parameter and 1,000 bootstrap replicates
(see Note 28).
14. View the neighbor-joining tree for obviously misplaced samples.
Samples of the same species should appear close together but
may not necessarily form a monophyletic group. This approach
can help spot contaminants and errors within very different
taxonomic groups but will not help if contaminants are close
relatives.
15. Once the data is checked it can be exported in FASTA format and
uploaded, along with scans of herbarium vouchers, collection
data, and trace files, to BOLD [20].
4 Notes
1. Silica gel is a drying agent; wear gloves when handling and use
in a well-ventilated area. We find it preferable to use small balls
rather than powder as this generates less dust. The silica gel must
be fully dehydrated before use and should be stored in an airtight
box or bag to prevent it absorbing moisture from the air.
2. We use glassine paper negative bags in which to collect specimens
as these can be purchased in suitable sizes so that a tablespoon
of silica gel can be added into each bag. They can be purchased
from photographic suppliers. Alternatively plastic ziplock bags
can be used or 2 ml tubes with screw top lids.
3. Traditionally a metal carrying case called a vasculum is used to
collect herbarium samples until they can be pressed. This keeps
them cool and prevents them from becoming damaged. This,
however, is rather bulky to carry and plastic carrier bags may be
used instead.
4. Field collecting notebooks can be used to record specimens
but if a large collecting campaign is planned it is advisable to use
a Field Information Management System (FIMS) [27].
5. Field presses can be purchased but it is very easy to make them.

They consist of two press frames made of latticed wood or
metal, these need to be strong but lightweight. Press straps
hold the contents in place; cloth webbing straps with spiked
buckles work best [23].
6. Flimsies are folded sheets of thin, strong paper slightly smaller
than the field press. Each plant sample is placed within a flimsy
and remains in this after collection and during the drying
process until it is mounted onto a herbarium sheet.
7. Drying papers are thick blotting paper used in the field press
and during the drying process.
8. Corrugates are made of corrugated cardboard or aluminum
with the corrugations running width-wise across the press.
These speed up the drying process by conducting dry air and
improving ventilation [23].
9. Proteinase K is used to digest protein and remove contamina-
tion from DNA extractions. It also helps to inactivate nucleases
that can degrade the DNA.
10. DDT (dithiothreitol) is a strong reducing agent that helps to
reduce the disulphide bonds of proteins.
11. We routinely use a 2× Taq polymerase master mix (Biomix
from Bioline) that contains all of the dNTPs and buffers
required. Use of different polymerase can affect results;
Platinum Taq DNA polymerase (Invitrogen) is recommended
for matK as it improves subsequent sequencing success [21].
12. BSA (bovine serum albumin) is used as a PCR additive, helping
to prevent PCR inhibitors from binding to and inactivating
the Taq DNA polymerase. It helps to scavenge a variety of pos-
sible inhibitors and is particularly useful when amplifying
poorer quality DNA such as that obtained from herbarium
material [28].
13. Ethidium bromide can be used as an alternative to SYBR Safe,
but it is highly toxic and a mutagen so we would recommend
avoiding its use. SYBR Safe has comparable sensitivity and is
not classified as hazardous waste under US federal regulations.
SYBR Safe will degrade in light so should be stored in the dark.
It can be stored in a fridge but it contains DMSO so will be
frozen at this temperature. Allow it to defrost before use.
14. This protocol describes how to optimally collect leaf samples
from plants for DNA extraction. The ability to extract DNA is
highly dependent on how well the sample is collected and
dried. This protocol uses silica gel to dry the leaf samples as fast
as possible. This rapid drying will help preserve the DNA in
the best possible condition.
15. The leaves of some species contain secondary compounds or

are very hairy, which can interfere with PCR; in this case it is
better to collect flowers. Take extra care with bryophytes as
these often have a number of species growing closely together
and it is easy to collect a mixed sample. Aquatic plants can also
be problematic as they are often covered in algae.
16. Sometimes it is not convenient to collect directly into the field
press; if this is the case then place the voucher material into
carrier bags or a vasculum for pressing at the end of the day.
17. For freshly collected material stored in silica gel we use the
Qiagen DNeasy 96 Plant Kit as per the manufacturers’ instruc-
tions and for smaller sample sizes the Qiagen DNeasy Plant
Mini kit. A wide range of nonkit based methods are also available
[21, 29].
18. The herbarium specimens are very dehydrated and extending
the incubation phase allows the cells to become more hydrated,
giving additional time for the AP1, proteinase K, and DDT to
work. During the incubation, place a heavy object on the top
of the sample plates so that the lids do not pop off as they are
heated in the waterbath.
19. These PCR conditions work well for herbarium specimens and
samples collected into silica gel. If PCR is unsuccessful, optimi-
zation can be used to try and improve the PCR conditions.
If the agarose gel shows faint or inconsistent bands then the
annealing temperature can be decreased, concentration of
MgCl2 can be increased, and primer concentration increased.
Diluting the DNA 10× can also help to dilute secondary com-
pounds that interfere with the PCR. If bands on agarose are
fuzzy or multibanded then the annealing temperature can be
increased, MgCl2 concentration decreased, primer concentra-
tion decreased, or the number of PCR cycles reduced. Increasing
and decreasing the annealing temperature should use 0.5–1 °C
intervals, alternatively a gradient PCR can be used to discover
the optimum annealing temperature. PCR errors can increase as
the number of PCR cycles increases so the number should be
kept to the minimum that allows successful amplification.
20. An alternative to using traditional agarose gels is to use precast
E-gels (Invitrogen). These are quicker and more convenient
but more expensive.
21. Removing the combs when the gel is in the buffer helps to
avoid the walls of sample wells collapsing into themselves.
22. This needs to be long enough to allow the size of the PCR
product to be seen in relation to the size standard but not too
long so that adjacent rows come into contact with each other.
23. Only turn the UV light on when ready to visualize and photo-
graph the gel as the UV light will break down the SYBR safe.
24. As rbcL and matK are coding regions of DNA, stop codons
should not be found within the sequences. A genuine stop
codon within the sequence can indicate that the gene is no
longer functional and therefore a pseudogene has been
sequenced; this is quite rare for rbcL and matK, but if it does
occur that sequence should be removed. More frequently stop
codons are caused by a base being added or missed during base-
calling which throws the reading frame out; this can often be
repaired during the manual editing process.
25. The consensus will be shown in the 5′–3′ direction so the
reverse primer will need to be reverse complemented before
searching for it.
26. The CBOL Plant Working Group define high quality sequences
as those in which both the forward and reverse reads should
have a minimum length of 100 bp, a minimum mean QV of
>30, and the posttrim lengths should be >50 % of the original
read length. The assembled contig should have >50 % overlap
in the alignment of the forward and reverse reads with <1 %
low-quality bases (<20 QV), and <1 % internal gaps and substi-
tutions when aligning the forward and reverse reads [18].
27. rbcL typically does not contain indels in the alignment. The
only cases where we have found indels are when the alignment
contains parasitic plants. Plants that are completely parasitic do
not need a functional rbcL gene, so greater sequence variation
is possible due to the relaxation of selection pressure on this
region [30].
28. If using MEGA to make the alignment and Neighbor-joining
trees then note that the final alignment will need to be saved as
a .meg file so that it can be opened in the data explorer module
of MEGA.
References
1. Hebert PDN, Cywinska A, Ball SL et al (2003) 6. Chen S, Yao H, Han J et al (2010) Validation
Biological identifications through DNA bar- of the ITS2 region as a novel DNA barcode
codes. Proc R Soc Lond B Biol Sci 270: for identifying medicinal plant species. PLoS
313–321 One 5:e8613
2. Hebert PDN, Gregory TR (2005) The prom- 7. De Mattia F, Bruni I, Galimberti A et al (2011)
ise of DNA barcoding for taxonomy. Syst Biol A comparative study of different DNA barcod-
54:852–859 ing markers for the identification of some mem-
3. Chase MW, Fay MF (2009) Barcoding of plants bers of Lamiacaea. Food Res Int 44:693–702
and fungi. Science 325:682–683 8. Jaakola L, Suokas M, Haggman H (2010)
4. de Vere N, Rich TCG, Ford CR et al (2012) Novel approaches based on DNA barcoding
DNA barcoding the native flowering plants and high-resolution melting of amplicons for
and conifers of Wales. PLoS One 7:e37945 authenticity analyses of berry species. Food
5. Asahina H, Shinozaki J, Masuda K et al (2010) Chem 123:494–500
Identification of medicinal Dendrobium species 9. Kumar S, Kahlon T, Chaudhary S (2011) A
by phylogenetic analyses using matK and rbcL rapid screening for adulterants in olive oil using
sequences. J Nat Med 64:133–138 DNA barcodes. Food Chem 127:1335–1341
10. Stoeckle MY, Gamble CC, Kirpekar R et al 24. Parker M, Stones-Havas S, Starger C et al
(2011) Commercial teas highlight plant DNA (2012) Laboratory information management
barcode identification successes and obstacles. systems for DNA barcoding. In: Kress WJ,
Sci Rep 1:42 Erickson DL (eds) Springer protocols methods
11. Bleeker W, Klausmeyer S, Peintinger M et al in molecular biology 858 DNA barcodes
(2008) DNA sequences identify invasive alien methods and protocols. Humana, New York,
Cardamine at Lake Constance. Biol Conserv pp 269–310
141:692–698 25. Tamura K, Peterson D, Peterson N et al (2011)
12. Saunders GW (2009) Routine DNA barcoding MEGA5: molecular evolutionary genetics anal-
of Canadian Gracilariales (Rhodophyta) reveals ysis using maximum likelihood, evolutionary
the invasive species Gracilaria vermiculophylla distance, and maximum parsimony methods.
in British Columbia. Mol Ecol Resour 9: Mol Biol Evol 28:2731–2739
140–150 26. Bininda-Emonds ORP (2005) Transalign:
13. Van De Wiel CCM, Van Der Schoot J, Van using amino acids to facilitate the multiple
Valkenburg JLCH et al (2009) DNA barcod- alignment of protein-coding DNA sequences.
ing discriminates the noxious invasive plant BMC Bioinformatics 6:156
species, floating pennywort (Hydrocotyle 27. Deck J, Gross J, Stones-Havas S et al (2012)
ranunculoides L.f.), from non-invasive relatives. Field information management systems for
Mol Ecol Resour 9:1086–1091 DNA barcoding. In: Kress WJ, Erickson DL
14. Kesanakurti PR, Fazekas AJ, Burgess KS et al (eds) Springer protocols methods in molecular
(2011) Spatial patterns of plant diversity biology 858 DNA barcodes methods and pro-
below-ground as revealed by DNA barcoding. tocols. Humana, New York, pp 255–267
Mol Ecol 20:1289–1302 28. Kreader CA (1996) Relief of amplification
15. Sonstebo JH, Gielly L, Brysting AK et al (2010) inhibition in PCR with bovine serum albumin
Using next-generation sequencing for molecu- or T4 gene 32 protein. Appl Environ Microbiol
lar reconstruction of past Arctic vegetation and 62:1102–1106
climate. Mol Ecol Resour 10:1009–1018 29. Ivanova NV, Fazekas AJ, Hebert PDN (2008)
16. Kress WJ, Erickson DL, Jones FA et al (2009) Semi-automated, membrane-based protocol
Plant DNA barcodes and a community phylog- for DNA isolation from plants. Plant Mol Biol
eny of a tropical forest dynamics plot in Panama. Rep 26:186–198
Proc Natl Acad Sci U S A 106:18621–18626 30. Wolfe AD, dePamphilis CW (1998) The effect
17. Kress WJ, Erickson DL, Swenson NG et al of relaxed functional constraints on the photo-
(2010) Advances in the use of DNA barcodes synthetic gene rbcL in photosynthetic and
to build a community phylogeny for tropical nonphotosynthetic parasitic plants. Mol Biol
trees in a Puerto Rican forest dynamics plot. Evol 15:1243–1258
PLoS One 5:e15409 31. Dunning LT, Savolainen V (2010) Broad-scale
18. CBOL Plant Working Group (2009) A DNA amplification of matK for DNA barcoding plants,
barcode for land plants. Proc Natl Acad Sci U S a technical note. Bot J Linn Soc 164:1–9
A 106:12794–12797 32. Kress WJ, Erickson DL (2007) A two-locus
19. Hollingsworth PM, Graham SW, Little DP global DNA barcode for land plants: the cod-
(2011) Choosing and using a plant DNA bar- ing rbcL gene complements the non-coding
code. PLoS One 6:e19254 trnH-psbA spacer region. PLoS One 2:e508
20. Ratnasingham S, Hebert PDN (2007) BOLD: 33. Fazekas AJ, Burgess KS, Kesanakurti PR et al
the barcode of life data system (www.barcod- (2008) Multiple multilocus DNA barcodes
inglife.org). Mol Ecol Notes 7:355–364 from the plastid genome discriminate plant
21. Fazekas A, Kuzmina ML, Newmaster SG et al species equally well. PLoS One 3:e2802
(2012) DNA barcoding methods for land 34. Fay MF, Swensen SM, Chase MW (1997)
plants. In: Kress WJ, Erickson DL (eds) Taxonomic affinities of Medusagyne oppositifo-
Springer protocols methods in molecular biol- lia (Medusagynaceae). Kew Bull 52:111–120
ogy 858 DNA barcodes methods and proto- 35. Ford CS, Ayres KL, Toomey N et al (2009)
cols. Springer, New York, pp 223–252 Selection of candidate coding DNA barcoding
22. Särkinen T, Staats M, Richardson JE et al regions for use on land plants. Bot J Linn Soc
(2012) How to open the treasure chest? 159:1–11
Optimising DNA extraction from herbarium 36. Cuenoud P, Savolainen V, Chatrou LW et al
specimens. PLoS One 7:e43808 (2002) Molecular phylogenetics of
23. Bridson D, Forman L (1998) The herbarium Caryophyllales based on nuclear 18S rDNA and
handbook, 3rd edn. Royal Botanic Gardens plastid rbcL, atpB, and matK DNA sequences.
Kew, London Am J Bot 89:132–144
Chapter 9
Multiplexed Digital Gene Expression Analysis for Genetical

Genomics in Large Plant Populations
Christian Obermeier, Bertha M. Salazar-Colqui,
Viola Spamer, and Rod Snowdon
Abstract
Digital gene expression (DGE) analysis is a cost-effective method for large-scale quantitative transcriptome
analysis using second generation sequencing. Here we describe how adaptation of DGE with barcode
indexing in large segregating plant populations of over 100 genotypes can be applied for successful
expression QTL (eQTL) and gene expression network analysis to develop transcript-based markers for
breeding.
Key words Digital gene expression analysis, SAGE, RNA-Seq, Genetical genomics, eQTL
1 Introduction
In the last decade, global transcriptome profiling methods have

evolved rapidly due to the increasing availability and diversity of
cost-effective next-generation sequencing technologies.
Quantitative global transcriptome analysis can assist in marker
development for complex traits by integrating DNA variation and
quantitative gene expression data in segregating mapping popula-
tions. This can help to link expression levels of transcripts to
genomic regions influencing quantitative traits. Such integrative
approaches have been coined “Genetical Genomics” [1]. They aim
to detect the genomic loci that control gene expression differences,
referred to as expression quantitative trait loci (eQTL) [2].
Initially, microarray-based expression platforms were applied
for integration of QTL mapping and quantitative transcriptome
profiling. Such analyses were mainly performed in model organ-
isms, and the high expense of microarray gene expression experi-
ments generally limited studies to a few individuals. Recently,
cost-effective and high-throughput transcriptome quantification
techniques based on second generation sequencing approaches
119
120 Christian Obermeier et al.
have begun to supercede microarrays as the method of choice for

global transcriptome analysis. Such techniques, based on sequenc-
ing of complete messenger RNA libraries (mRNA-Seq) or of short
cDNA-tags (digital gene expression; DGE), allow more powerful
eQTL studies in mapping populations with hundreds of individu-
als. Fully quantitative mRNA-Seq is generally still too expensive for
frequent application in large segregating populations. On the other
hand, high-depth sequencing of short, defined tags offers a power-
ful alternative at a fraction of the price which enables the genera-
tion of highly quantitative global expression data in large
populations of individuals.
In DGE, oligo-dT surface-attached beads are used for synthe-
sis of cDNA libraries, resulting in enrichment of the 3′ end of poly-
adenylated mRNAs. These are then used for massively parallel
sequencing of a short tag from the 3′ end of every captured mRNA
molecule. The technique derives from the Serial Analysis of Gene
Expression (SAGE) protocol, in which 13–15 bp fragments were
sequenced by Sanger sequencing of concatenated and cloned tags
[3]. The technique was later refined for sequencing of 21 bp frag-
ments in the LongSAGE protocol [4] and 26–27 bp in the
SuperSAGE protocol [5]. The LongSAGE and SuperSAGE proce-
dures were also adapted to second generation sequencing for
higher throughput. Library production and Illumina short-read
sequencing services are offered by a number of commercial compa-
nies for LongSAGE and SuperSAGE. Services are also offered by
commercial companies with modified protocols to sequence bar-
coded 100 bp 3′-fragment cDNA [6] or 50–500 bp assembled
3′-fragment cDNA [7] using Illumina short-read technology.
The following protocol describes the cost-effective parallel
production of DGE libraries with 21 bp tag length (LongSAGE)
for plant mapping populations with 96 genotypes, applying 8-plex
barcoding for sequencing in 12 flow cells on Illumina systems
(e.g., Genome Analyzer IIx, MiSeq). With current sequencing
outputs the technique can generate more than 25 million tags per
flow cell, or three million tags per individual, giving highly quanti-
tative data even for low-abundance transcripts. The protocol
applies barcoding by using 2 × 8 oligonucleotides for adapter P1
production, enabling parallel sequencing of eight barcoded sam-
ples in one MiSeq or GAIIx flow cell (see Note 1). The protocol
can easily be adapted to the increasing higher sequence read out-
put of second generation Illumina sequencing machines, based on
improvement of the hardware or the sequencing chemistry. The
number of targeted reads per individual should be based on the
transcriptome size and complexity of the studied organism.
In DGE-like approaches only one 3′-tag from any polyadenyl-
ated mRNA molecule is sampled after reverse-transcription into
cDNA. In contrast, in RNA-Seq approaches the entire transcrip-
tome is reverse-transcribed into cDNA libraries and randomly
Digital Gene Expression Analysis 121
fragmented into pieces a few hundred nucleotides long. Because it

analyzes whole transcripts instead of short 3′ tags, RNA-Seq pro-
vides more information than DGE-like approaches on transcript
structure and variation. On the other hand, however, estimates of
gene expression levels from RNA-Seq data, such as reads per kilo-
base of gene length per million reads (RPKM) based on a reference
genome, are strongly biased in terms of gene length, GC content,
and dinucleotide frequencies. They therefore require complex sta-
tistical approaches and considerable knowledge of the reference
genome for adequate interpretation [8]. Also, compared to DGE,
RNA-Seq requires considerably higher depth of coverage to reach
similar statistical power in detecting lowly abundant transcripts.
The increase in statistical power for accurate quantification in
DGE-like approaches comes at the expense of a reduced depth of
mRNA structural information.
Successful eQTL analysis requires sufficiently high throughput
to achieve parallel quantitative expression profiling for hundreds of
thousands of individual transcripts from every individual of a map-
ping population. On the other hand, parallel analysis of hundreds
of individuals from a mapping population is also vital to achieve the
necessary power of eQTL detection. Because they can be applied
to large mapping populations of 100 or more individuals at rela-
tively low cost, DGE-like approaches like the one presented below
are suitable for cost-effective and accurate eQTL analysis for iden-
tification of transcript-based markers for breeding.
2 Materials
This protocol recommends reagent suppliers based on successful

application of the procedure with the named reagents. The authors
have no affiliations with the suppliers of these reagents, and com-
parable reagents from other suppliers might be equally suitable.
2.1 Isolation 1. Autoclaved, deionized water (see Note 2) or DEPC-treated

of Total RNA water (Invitrogen, Carlsbad, CA, USA).
2. TRIzol Reagent (Life Technologies, Carlsbad, CA, USA).
3. RNeasy mini kit (QIAGEN, Hilden, Germany)
4. 5 M NaCl (Carl Roth, Karlsruhe, Germany) autoclaved in
deionized water. Store at room temperature.
5. Chloroform (Carl Roth, Karlsruhe, Germany).
6. Isopropanol (Carl Roth, Karlsruhe, Germany).
7. Ethanol, 70 % (Carl Roth, Karlsruhe, Germany).
8. Plant material.
9. Pestle and mortar.
10. Liquid nitrogen.

11. Microcentrifuge.
12. 2 ml microcentrifuge tube.
13. Spatula.
14. Spectrophotometer.
2.2 DNAse Digestion 1. RQ1 RNAse-free DNAse (Promega, Madison, WI, USA).
2. Phenol/Chloroform/Isoamyl Alcohol (25:24:1, v/v, Life
Technologies, Carlsbad, CA, USA).
3. 3 M sodium acetate, pH 5.5, store at 4 °C (Carl Roth,
Karlsruhe, Germany).
6. Autoclaved deionized water (Merck Millipore, Billerica, MA,
USA).
2.3 Binding 1. Dynabeads OligodT(25) beads: 5 mg/ml in PBS, 0.02 %

of Total RNA sodium azide, to be used with a magnetic bead stand. Store at
to Magnetic Beads 4 °C (Life Technologies, Carlsbad, CA, USA).
2. Magnetic stand-96 (Life Technologies, Carlsbad, CA, USA).
3. Nonstick (see Note 3) RNAse-free 1.5 ml microfuge tubes
(Life Technologies, Carlsbad, CA, USA).
4. Lysis/Binding buffer: 100 mM Tris–HCl, pH 7.5, 500 mM
LiCl, 10 mM EDTA, 0.1 % Lithiumdocecylsulfat (LiDS,
see Note 4), 5 mM dithiothreitol (DTT) (Carl Roth, Karlsruhe,
Germany). Store at 4 °C and prewarm to room temperature
before use.
5. Wash Buffer B: 10 mM Tris–HCl, pH 7.5, 0.15 M LiCl, 1 mM
EDTA, store at 4 °C (Carl Roth, Karlsruhe, Germany).
2.4 First-Strand 1. First-Strand Buffer (5×): 250 mM Tris–HCl, pH 8.3, 375 mM

cDNA Synthesis KCl, 15 mM MgCl2. Store at −20 °C (Life Technologies,
Carlsbad, CA, USA).
2. RNAseOUT Recombinant Ribonuclease Inhibitor (4 U/μl),
store at −20 °C (Life Technologies, Carlsbad, CA, USA).
3. 0.1 M DTT: Store at −20 °C (Life Technologies, Carlsbad,
CaA, USA).
4. Superscript II Reverse Transcriptase (200 U/μl), store at
−20 °C (Life Technologies, Carlsbad, CA, USA).
5. 5 M Betaine, store at 4 °C (see Note 5, Sigma Aldrich,
St. Louis, MO, USA).
6. dNTP Mix (10 mM each of dATP, dTTP, dCTP and dGTP),
store at −20 °C (Fermentas International Inc./Thermal
Scientific, Burlington, Canada).
2.5 Second-Strand 1. E. coli DNA Polymerase (10 U/μl), store at −20 °C (Fermentas
cDNA Synthesis International Inc./Thermal Scientific, Burlington, Canada).
2. E. coli RNase H (5 U/μl), store at −20 °C (Fermentas
International Inc./Thermal Scientific, Burlington, Canada).
3. Mussel glycogen (20 ng/μl), store at −20 °C (Roche
Diagnostics Deutschland GmbH, Mannheim, Germany).
4. Bovine Serum Albumin, BSA (10 mg/ml): Store at −20 °C
(New England Biolabs Inc., Ipswich, MA, USA).
5. Second-Strand Buffer (10×), store at −20 °C (Fermentas
International Inc./Thermal Scientific, Burlington, Canada).
6. 0.5 M EDTA, pH 8.0, store at room temperature (Carl Roth,
7. Wash Buffer C: 5 mM Tris–HCl, pH 7.5, 0.5 mM EDTA, 1 M
NaCl, 0.1 % SDS (see Note 6), 10 μg/ml mussel glycogen,
store at 4 °C (Carl Roth, Karlsruhe, Germany).
8. Wash Buffer D: 5 mM Tris–HCl, pH 7.5, 0.5 mM EDTA, 1 M
NaCl, 200 μg/ml BSA, store at 4 °C (Carl Roth, Karlsruhe,
Germany).
9. NEBuffer 3 (10×): 1 M NaCl, 500 mM Tris–HCl, 100 mM
MgCl2, 10 mM Dithiothreitol, pH 7.9, store at −20 °C (New
England Biolabs Inc., Ipswich, MA, USA).
10. NEBuffer 4 (10×): 500 mM potassium acetate, 200 mM Tris-
acetate, 100 mM magnesium acetate, 10 mM Dithiothreitol,
pH 7.9, store at −20 °C (New England Biolabs Inc., Ipswich,
MA, USA).
11. NEBuffer 3 (1×): diluted from 10× stock (1 M NaCl, 500 mM
Tris–HCl, 100 mM MgCl2, 10 mM Dithiothreitol, pH 7.9),
store at −20 °C (New England Biolabs Inc., Ipswich, MA, USA).
12. NEBuffer 4 (1×): diluted from 10× stock (500 mM potassium
acetate, 200 mM Tris-acetate, 100 mM magnesium acetate,
10 mM Dithiothreitol, pH 7.9), store at −20 °C (New England
Biolabs Inc., Ipswich, MA, USA).
2.6 Digestion 1. LoTE: 3 mM Tris–HCl, pH 7.5, 0.2 mM EDTA, store at 4 °C

of cDNA With DpnII (Carl Roth, Karlsruhe, Germany).
2. NEBuffer 3 (10×): 1 M NaCl, 500 mM Tris–HCl, 100 mM
3. NEBuffer 4 (10×): 500 mM potassium acetate, 200 mM Tris-
MA, USA).
4. Bovine Serum Albumin, BSA (10 mg/ml, 100×), store at
−20 °C (New England Biolabs Inc., Ipswich, MA, USA).
5. DpnII (10,000 U/ml), store at −20 °C (New England Biolabs

Inc., Ipswich, MA, USA) (see Note 7).
6. Wash Buffer C: 5 mM Tris–HCl, ph 7.5, 0.5 mM EDTA, 1 M
NaCl, 0.1 % SDS (see Note 6), 10 μg/ml mussel glycogen,
store at 4 °C (Carl Roth, Karlsruhe, Germany; Roche
Germany; New England Biolabs Inc., Ipswich, MA, USA).
2.7 Ligation of GEX1 1. Complementary barcoded oligonucleotides, HPLC-purified

Adapter to the DpnII- and 5′-modified, are mixed in equal concentrations (10 μM of
Restricted cDNA “a” and “b” oligonucleotide for each barcode) to produce a
Bound to GEX1 barcode adapter (Eurofins MWG, Operon, Ebersberg,
Magnetic Beads Germany). The original GEX1 Illumina adapters are modified
by introducing 4 bp barcodes after the DpnII restriction and a
6 bp MmeI recognition site. The following 4 bp bases are used
as barcodes for multiplexing of eight samples for subsequent
pooling: AGCT, GTAC, CATG, TCGA, ATGC, GACT,
CGTA, and TCAG. GEX1 adapter is ligated to the 5′ end of
the DpnII-digested bead-bound cDNA fragments. Barcodes in the
oligos used for GEX adapter 1 production are underlined.
The names of the oligos include an “a” or “b” in the name for
the oligo of the upper and lower DNA strand. The oligos and
adapters contain the barcode within the name:
GEX1a_AGCT: 5′-ACAGGTTCAGAGTTCTACAGAGCT
TCCGAC-3′
GEX1b_AGCT: 5′-P-GATCGTCGGAAGCTCTGTAGA
ACTCTGAAC-3′
GEX1a_GTAC: 5′-ACAGGTTCAGAGTTCTACAGGTAC
TCCGAC-3′
GEX1b_GTAC: 5′-P-GATCGTCGGAGTACCTGTAGA
ACTCTGAAC-3′
GEX1a_CATG: 5′-ACAGGTTCAGAGTTCTACAGCATG
TCCGAC-3′
GEX1b_CATG: 5′-P-GATCGTCGGACATGCTGTAGA
ACTCTGAAC-3′
GEX1a_TCGA: 5′-ACAGGTTCAGAGTTCTACAGTCGA
TCCGAC-3′
GEX1b_TCGA: 5′-P-GATCGTCGGATCGACTGTAGA
ACTCTGAAC-3′
GEX1a_ATGC: 5′-ACAGGTTCAGAGTTCTACAGATGC
TCCGAC-3′
GEX1b_ATGC: 5′-P-GATCGTCGGAATGCCTGTAGA
ACTCTGAAC-3′
GEX1a_GACT: 5′-ACAGGTTCAGAGTTCTACAGGACT
TCCGAC-3′
GEX1b_GACT: 5′-P-GATCGTCGGAGACTCTGTAGA
ACTCTGAAC-3′
GEX1a_CGTA: 5′-ACAGGTTCAGAGTTCTACAGCGTA
TCCGAC-3′
GEX1b_CGTA: 5′-P-GATCGTCGGACGTACTGTAGA
ACTCTGAAC-3′GEX1a_TCAG:
′-ACAGGTTCAGAGTTCTACAGTCAGTCCGAC-3′
GEX1b_TCAG:5′-P-GATCGTCGGATCAG CTGTAGAA
CTCTGAAC-3′
2. T4 DNA Ligase (5 U/μl), store at −20 °C (Life Technologies,
Carlsbad, CA, USA).
3. Ligase Buffer (5×), store at −20 °C (Life Technologies,
Carlsbad, CA, USA).
4. LoTE: 3 mM Tris–HCl, pH 7.5, 0.2 mM EDTA, store at 4 °C
(Carl Roth, Karlsruhe, Germany).
5. Autoclaved deionized water (see Note 2) or DEPC-treated
water (Invitrogen, Carlsbad, CA, USA).
2.8 Cleaving 1. S-adenosylmethionine, SAM (32 mM), store at −20 °C (New

With the Tagging England Biolabs Inc., Ipswich, MA, USA).
Enzyme MmeI 2. SAM (10×): freshly diluted to 400 μM from stock (New
3. Autoclaved deionized water (see Note 2) or DEPC-treated
water (Invitrogen, Carlsbad, CA, USA).
4. NEBuffer 1 (10×): 100 mM Bis–Tris–Propane–HCl,100 mM
5. NEBuffer 4 (10×): 500 mM potassium acetate, 200 mM Tris–
MA, USA).
6. NEBuffer 4 (1×): diluted from 10× stock (New England
Biolabs Inc., Ipswich, MA, USA).
7. NEBuffer 4 (1×)/SAM (1×): freshly prepare from 10× stock
(New England Biolabs Inc., Ipswich, MA, USA).
8. LoTE: 3 mM Tris–HCl, pH 7.5, 0.2 mM EDTA, store at 4 °C
9. MmeI (2,000 U/ml): Store at −20 °C (New England Biolabs
Inc., Ipswich, MA, USA).
10. Phenol/chloroform/isoamyl alcohol (25:24:1, v/v), store at
4 °C (Life Technologies, Carlsbad, CA, USA).
11. Mussel glycogen (20 mg/ml), store at −20 °C (Roche

12. 95 % ethanol (Carl Roth, Karlsruhe, Germany).
13. 70 % ethanol, cold (Carl Roth, Karlsruhe, Germany).
14. 3 M sodium acetate, pH 5.5, store at 4 °C (Carl Roth, Karlsruhe,
Germany).
16. Antarctic Phosphatase (5,000 U/ml), store at −20 °C (New
17. Phase-Lock Gel (PLG) 2 ml tubes, store at room temperature
(5 PRIME, Darmstadt, Germany).
2.9 Ligation of GEX 1. Two complementary oligonucleotides GEX2a (5′ modified)

Adapter 2 and GEX2b (3′ degenerated), synthesized and HPLC purified
by Eurofins MWG, Operon, Ebersberg, Germany. GEX2a and
GEX2b oligonucleotides were used in equal concentrations
(1.5 μM) to produce GEX adapter 2 (1.5 μM).
GEX2a: 5′-P-TCGTATGCCGTCTTCTGCTTG-3′
GEX2b: 5′-CAAGCAGAAGACGGCATACGANN-3′
2. T4 DNA Ligase (5 U/μl), store at −20 °C (Life Technologies,
Carlsbad, CA, USA).
3. Ligase Buffer (5×), store at −20 °C (Life Technologies,
Carlsbad, CA, USA).
2.10 Enrichment 1. FINNZYMES Phusion Hot Start DNA Polymerase (2 U/μl),

of Adapter-Ligated store at −20 °C (New England Biolabs Inc., Ipswich, MA, USA).
cDNA by PCR 2. Phusion HF buffer (5×), store at −20 °C (New England Biolabs
and Purification Inc., Ipswich, MA, USA).
from Gel
3. dNTP Set Mix, 10 mM Solutions, (Fermentas International
Inc./Thermal Scientific, Burlington, Canada).
4. HPLC purified GEX_PCR1 and GEX_PCR2 primers,
(Eurofins MWG, Operon, Ebersberg, Germany):
GEX_PCR1: 5′-CAAGCAGAAGACGGCATACGA-3′
GEX_PCR2: 5′-AATGATACGGCGACCACCGACAGGTTCAG
AGTTCTACAG-3′
5. Acrylamide 4K-solution (40 %, 37.5:1), store at 4 °C
(AppliChem GmbH, Darmstadt, Germany).
6. TEMED, store at 4 °C (BioRad Laboratories GmbH,
München, Germany).
7. Ammonium persulfate: freshly prepare 10 % solution in auto-
claved water and store at 4 °C only for short time (Carl Roth,
8. SDS (10 %, w/v), store at room temperature (Carl Roth,

9. Gel running buffer (4×): 1 M Tris–HCl, pH 6.8, store at 4 °C
10. TBE (10×): 1 M Tris–HCL, 0.9 M boric acid, 0.01 M EDTA,
store at room temperature (Carl Roth, Karlsruhe, Germany).
11. SYBR Green I Nucleic Acid Gel Stain (10,000× in DMSO),
store at −20 °C (Lonza, Rockland, ME, USA).
12. 25 bp DNA ladder (1 μg/μl), store at −20 °C (Life Technologies,
Carlsbad, CA, USA).
13. Loading dye (6×): 30 % (v/v) glycerol, 0.25 % (w/v)
Bromophenol Blue (SERVA SERVA Electrophoresis GmbH,
Heidelberg, Germany).
14. Sigma Costar® Spin-X® centrifuge tube filters, cellulose acetate
membrane, pore size 0.45 μm, nonsterile (Sigma Aldrich, St.
Louis, MO, USA).
15. Mussel glycogen (20 mg/ml), store at −20 °C (Roche
16. 3 M sodium acetate, pH 5.5, store at 4 °C (Carl Roth,
17. 95 % ethanol (Carl Roth, Karlsruhe, Germany).
18. 70 % ethanol, cold (Carl Roth, Karlsruhe, Germany).
19. Gel elution buffer: 5 parts of LoTE buffer (3 mM Tris–HCl,
pH 7.5, 0.2 mM EDTA): 1 part 7.5 M ammonium acetate
20. Qiagen elution buffer: 10 mM Tris–HCl, pH 8.5, store at
room temperature, from Qiagen PCR Purification Kit
(QIAGEN, Hilden, Germany).
2.11 Validation 1. Agarose Roti®garose (Carl Roth, Karlsruhe, Germany).

of Libraries 2. Agilent DNA 1000 Kit (Agilent Technologies, Inc., Santa
Clara, CA, USA).
3. Agilent High Sensitivity DNA Assay Kit (Agilent Technologies,
Inc., Santa Clara, CA, USA).
4. Agilent 2100 Bioanalyzer (Agilent Technologies, Inc., Santa
Clara, CA, USA).
5. Chip Priming Station (Agilent Technologies, Inc., Santa Clara,
CA, USA).
6. IKA Vortex Mixer (Agilent Technologies, Inc., Santa Clara,
CA, USA).
2.12 Sequencing 1. HPLC purified custom sequencing primer:

and Data Analysis GEX_seq:
5′-GACAGGTTCAGAGTTCTACAG-3′
(Eurofins MWG, Operon, Ebersberg, Germany).
2. Access to Illumina Genome Analyzer IIx, Cluster Station/

cBOT or MiSeq.
3. Illumina TruSeq SR Cluster Kit or MiSeq Reagent Kits.
3 Methods
The following protocol is a combination of the protocols from

Gowda and Wang [9], Obermeier et al. [10], the manual on “Digital
Gene Expression—Tag profiling with DpnII” [11], Morrissy et al.
[12, 13] and the manual on the “I-SAGE Long Kit” [14].
3.1 Isolation of Total 1. Grind 200 mg of plant material stored at −80 °C to a fine pow-
RNA der in a precooled mortar with pestle using liquid nitrogen.
2. Transfer the sample into a precooled 2-ml microcentrifuge
tube, using a precooled spatula. Avoid thawing of plant mate-
rial and transfer tubes to −20 °C until a manageable set of sam-
ples is ground.
3. Add 1 ml of cold (4 °C) TRIzol reagent and vortex for 30 s.
4. Incubate the homogenized samples for 5 min at room
temperature.
5. Add 0.2 ml of chloroform and mix by inverting by hand for 15 s.
6. Incubate samples for 2–3 min at room temperature.
7. Centrifuge samples at 12,000 × g for 15 min at 4 °C (see Note 8).
8. Transfer the supernatants (aqueous phases) very carefully into
a fresh set of 2 ml microcentrifuge tubes.
9. Add 0.5 ml of ice cold (−20 °C) isopropanol to samples, mix
well by inversion.
10. Incubate samples at room temperature for 10 min.
11. Centrifuge samples at 12,000 × g for 10 min at 4 °C.
12. Discard the supernatants and wash the RNA pellets once with
1 ml of 75 % ethanol.
13. Centrifuge samples at 12,000 × g for 5 min at 4 °C.
14. Discard the supernatant and let the pellets dry at room tem-
perature for approx. 10 min under a fume hood or use a
SpeedVac.
15. Dissolve the total RNA from each samples in 80 μl of RNase-
free water.
16. Store RNA samples at −80 °C.
17. Check total RNA quality by agarose gel electrophoresis (1 %)
of an aliquot of the samples (10 μl) (see Note 9).
18. Estimate RNA concentration and check quality by using a
Nanodrop spectrophometer. Extraction of high quality
RNA which can be used for successful downstream enzymatic

processing from some plant tissues, e.g., oil-rich seeds, is
more difficult and might require additional purification steps
(see Note 10).
3.2 DNAse Digestion According to some protocols no DNAse digestion is necessary as

magnetic oligodT beads are designed to bind
polyA + RNA. However, to ensure exclusive binding of mRNA
molecules, DNAse digestion is recommended.
1. Add RQ1 RNAse-free DNAse to the purified RNA, according
to manufacturer’s recommendations, in a total volume of
200 μl.
2. Add 200 μl Phenol/Chloroform/Isoamyl alcohol mixture and
mix well by vortexing samples.
3. Centrifuge samples at 14,000 rpm (13,000 × g) for 15 min at
room temperature.
4. Carefully transfer the supernatants to new tubes.
5. Add 1/10 volume of 3 M sodium acetate and 2.5 volume 95 %
ethanol. Mix well.
6. Centrifuge samples at 14,000 rpm (13,000 × g) for 15 min at 4 °C.
7. Remove supernatant and add 1 ml of 70 % ethanol.
8. Centrifuge samples at 14,000 rpm (13,000 × g) for 15 min at 4 °C.
9. Discard the supernatant and let the pellets dry at room tem-
perature for approx. 10 min under a fume hood or use a
SpeedVac.
10. Dissolve the total RNA from each sample in 20–30 μl of
RNase-free water.
11. Measure concentration using Nanodrop.
3.3 Binding of Total 1. Resuspend Dynabeads OligodT(25) beads by vortexing and

RNA to Magnetic transfer 100 μl for each total RNA preparation to a new
Beads RNase-free 1.5 ml nonstick microfuge tube (see Note 3).
2. Place the tubes on a magnetic stand for 2 min, then carefully
remove the supernatant and discard it.
3. Add 100 μl of Lysis/Binding buffer to the beads.
4. Remove the tube from the magnetic stand and carefully flick
the tubes with fingers to mix with buffer, then place the tube
back on the magnetic stand.
5. Remove the supernatant and repeat once more.
6. Wash the beads by resuspending them in 500 μl of Lysis/
Binding buffer.
7. Remove the tubes from the stand and mix the contents of the
tube by gently flicking the tubes with fingers.
8. Centrifuge briefly to collect any beads that may stick to the cap
of the tube.
9. Return the tubes to the magnetic stand for 2 min and carefully
remove the supernatant.
10. Repeat steps 4–9 once.
11. Resuspend the beads in 50 μl of Lysis/Binding buffer.
12. Adjust a total amount of 50 μg of the total DNAse-treated
RNA to 50 μl with water.
13. Heat the RNA solutions to 65 °C for 5 min to disrupt any
secondary structure, place on ice.
14. After removing the Lysis/Binding buffer from the equilibrated
oligo dT beads (step 11 above) add 50 μl of total RNA sample
to the beads.
15. Mix the sample well by flicking the tubes and firmly close and
place into a 50-ml Falcon tube stuffed with a Kimwipe.
16. Mix the beads and RNA by slowly rocking the tube on a rock-
ing platform or vortexing the tubes intermittently on a slow
vortex, for 30 min at room temperature. Check in between
that beads do not sediment.
17. Place the tube on a magnetic stand for 2 min, then carefully
remove and discard the supernatant.
18. Wash the beads on the magnetic stand with 200 μl of Wash
Buffer B.
19. Wash the beads a second time with 200 μl of Wash Buffer.
3.4 First-Strand 1. Wash the beads four times with 100 μl of 1× First-Strand Buffer
cDNA Synthesis (for SuperScript IIII Reverse Transcriptase). On the fourth
wash do not remove the supernatant.
2. Mix the following reagents for the first-strand cDNA synthesis
with a total volume of 38 μl in a new siliconized (nonstick)
RNase-free 1.5 ml tube on ice: 8 μl First-Strand Buffer (5×),
0.5 μl RNaseOUT (4 U/μl), 4.5 μl DTT (0.1 M), 2 μl dNTP
mix (10 mM), 3 μl betaine (5 M), 20 μl water.
3. Carefully remove the fourth wash from the beads and immedi-
ately add 38 μl of first-strand cDNA synthesis mix to the beads.
Mix gently by flicking the tube without causing the beads to
stick to the upper inner walls or lid of the tubes.
4. Centrifuge briefly to collect any beads that may stick to the
inner cap of the tubes.
5. Place the tubes at 42 °C for 2 min to equilibrate the reagents.
6. Add 2 μl of SuperScript III Reverse Transcriptase (200 U/μl).
7. Incubate on a thermomixer at 42 °C for 1 h, mixing for 30 s,
130 × g at 10 min intervals. Make sure the beads do not
sediment.
3.5 Second-Strand 1. To terminate first-strand cDNA synthesis, heat samples at

cDNA Synthesis 72 °C for 7 min, then place on ice while setting up the second-
strand synthesis mix.
2. Mix the following reagents for the second-strand cDNA syn-
thesis with a total volume of 115 μl on ice, in a new siliconized
(nonstick) RNase-free 1.5 ml tube: 73 μl water, 31 μl Second-
Strand Buffer (5×), 3.75 μl dNTP mix (10 mM), 2 μl E. coli
DNA polymerase (10 U/μl), and 0.4 μl E. coli RNase H (5 U/μl)
(see Note 11).
3. Add the 115 μl second-strand reaction mix directly into the
tube containing the 40 μl first-strand reaction.
4. Gently vortex the contents of the tube and incubate at 15 °C
for 2.5 h in a thermomixer programmed to mix for 30 s,
130 × g, at 10 min intervals. Do not let the temperature rise
above 15 °C.
5. Heat Wash Buffer C to 75 °C.
6. Place the reaction tubes on ice for 2 min, and then add 22.5 μl
of 0.5 M EDTA to stop the reaction.
7. Place the tube on a magnetic stand for 2 min, then carefully
remove and discard the supernatant. Immediately add 375 μl
of heated Wash Buffer C to the beads.
8. Mix well and incubate the beads at 75 °C for 15 min in a ther-
momixer programmed to mix for 30 s, 130 × g, at 2 min
intervals.
9. Place the tubes on the magnetic stand for 2 min, then carefully
remove and discard the supernatant. Immediately wash again
with 375 μl of warm Wash Buffer C (see Notes 12 and 15).
10. Wash the sample four times with 375 μl of Wash Buffer D,
room temperature (see Note 13).
11. Place the beads for 2 min on the magnetic stand, remove
supernatants into new, RNase-free 1.5 ml nonstick tubes (see
Note 14).
12. Add 100 μl 1× Restriction buffer NEBuffer 3 for use with
restriction enzyme DpnII to the beads.
13. Transfer contents to a new nonstick RNase free 1.5 ml tube
(see Note 14).
14. Wash again with 100 μl of buffer NEBuffer 3 (1×).
15. Immediately add 89 μl of water to the beads and gently resus-
pend. Keep the tubes on ice.
3.6 Digestion 1. Add the following reagents directly to the tube containing
of cDNA With DpnII 89 μl of cDNA magnetic bead suspension for a total volume of
100 μl: 10 μl NEB Buffer 3 (10×), 1 μl DpnII (10 U/μl).
2. Place the bottle containing Wash Buffer C in a water bath set
at 37 °C (see Note 15).
3. Incubate at 37 °C for 1 h in a thermomixer programmed to

mix for 30 s, 130 × g, at 10 min intervals.
4. Place tubes on magnetic stand for 2 min, then carefully remove
and discard the supernatant.
5. Wash the tube twice with 375 μl of Wash Buffer C, prewarmed
to 37 °C (see Note 15).
6. Wash the tube three times with 375 μl of Wash Buffer D,
prewarmed to room temperature (see Note 13).
7. Resuspend the beads in 375 μl of Wash Buffer D.
8. Continue with ligation or store samples at 4 °C overnight if
necessary.
3.7 Ligation of GEX1 1. Prepare double-stranded DNA barcoded adapters GEX1 by

Adapter to the DpnII- mixing the two complementary oligonucleotides of the eight
Restricted cDNA barcoded GEX1 Adapters (GEX1a_AGCT, GEX1b_AGCT;
Bound to Magnetic GEX1a_GTAC, GEX1b_GTAC; GEX1a_CATG, GEX1b_
Beads CATG; GEX1a_TCGA, GEX1b_TCGA; GEX1a_ATGC,
GEX1b_ATGC; GEX1a_GACT, GEX1b_GACT; GEX1a_
CGTA, GEX1b_CGTA, and GEX1a_TCAG, GEX1b_TCAG)
in equal concentrations 1:1 (10 μM each single-stranded oli-
gonucleotide) and anneal by heating to 95 °C for 2 min, cool
slowly to 4 °C over an hour in a thermal cycler resulting in a
5 μM double-stranded barcoded adapter GEX 1 solution.
2. Prepare double-stranded DNA adapters GEX1_Mme1 by mix-
ing the two complementary oligonucleotides GEX1a_MmeI
and GEX1b_MmeI in equal concentrations and anneal by
heating to 95 °C for 5 min, cool slowly to 4 °C over an hour
in a thermal cycler.
3. Place sample tubes from above on magnetic stand.
4. Wash the beads twice with 100 μl of 1× Ligase Buffer.
5. Place the tubes on a magnetic stand for 2 min, then carefully
remove and discard the supernatant.
6. Immediately add the following to the beads for a total volume
of 50 μl: 36 μl water, 3 μl 5 μM GEX barcoded adapter 1, 10 μl
Ligase Buffer (5×), and 1 μl T4 DNA Ligase (5 U/μl).
7. Mix gently by flicking the tubes without causing the beads to
splash on the inner walls or lid. Seal the lid with parafilm.
8. Incubate the ligation reactions overnight in a thermomixer at
16 °C programmed to mix for 30 s, 130 × g, 30 min intervals.
3.8 Cleaving 1. Prepare a fresh dilution of 10× SAM (400 μM) from the NEB
with the Tagging supplied SAM (32 mM). 20 μl of 10× SAM is needed per
Enzyme MmeI sample.
2. Prepare a 1× NEBuffer 4/1× SAM solution from the 32 mM

SAM and 1× NEBuffer 4: 1 μl SAM (32 mM), 799 μl NEBuffer
4 (1×).
3. Wash each tube three times with 250 μl of Wash Buffer D
prewarmed to room temperature, and then resuspend the
beads in 250 μl of Wash Buffer D (see Note 13).
4. Place the tubes from the previous step on a magnetic stand
for 2 min.
5. Wash the beads four times with 250 μl of room temperature
Wash Buffer D.
6. Prepare a multiple of the following digestion mix for each
sample to be processed on ice for a total volume of 100 μl:
76 μl water, 10 μl NEBuffer 4 (10×), 10 μl SAM (10×), and
4 μl MmeI (2 U/μl).
7. Wash the tubes twice with 250 μl 1× NEBuffer 4/1×
SAM. Remove the supernatants.
8. Add 100 μl of the digestion mix to the beads. Mix gently by
flicking the tube.
9. Incubate the tubes at 37 °C for 1.5 h on a thermomixer programmed
to mix for 30 s, at 130 × g, at 10 min intervals.
10. Place the tubes on a magnetic stand for 2 min. Do not discard
the supernatants. The supernatants now contain the tags.
Carefully remove the supernatants and transfer to new micro-
centrifuge tubes.
11. Wash the tubes containing the beads with 50 μl of 1× NEBuffer
4. Transfer the supernatants to new tubes to yield a total vol-
umes of 150 μl. Discard the beads.
12. Add 1 μl of Antarctic phosphatase (5 U/μl) to the 150 μl sam-
ple and incubate at 37 °C for 1 h.
13. Microcentrifuge 2-ml PLG tubes 1 min at maximum speed,
14,000 rpm (13,000 × g), at room temperature.
14. Add 50 μl of water to the samples and transfer the 200 μl to the
prespun 2-ml PLG tubes.
15. Add 200 μl of Phenol/Chloroform/Isoamyl alcohol mixture
to the supernatants and mix well by inversion. Centrifuge for
5 min at maximum speed at room temperature.
16. Transfer the top aqueous phase (200 μl) to new 1.5-ml tubes.
17. Add the following precipitation reagents to the samples
(200 μl) for a total volume of 872 μl: 20 μl 3 M sodium ace-
tate, pH 5.5, 2 μl mussel glycogen (20 mg/ml), 650 μl cold
95 % ethanol. Vortex vigorously.
18. Incubate the tubes at −20 °C for at least 30 min and microcen-
trifuge for 30 min at maximum speed at 4 °C.
19. Carefully remove the supernatants.
20. Wash the pellets three times with 1 ml of cold 70 % ethanol,
centrifuge for 15 min.
21. Dry the pellets at room temperature for approx. 10 min under
a fume hood or use a SpeedVac. Do not overdry.
22. Resuspend the pellets in 6 μl of water and incubate for 10 min
to aid in solubilization.
3.9 Ligation of GEX 1. Prepare double-stranded DNA adapter GEX2 by mixing two
Adapter 2 complementary oligonucleotides in equal concentrations (1:1,
1.5 μM each single-stranded oligonucleotide) and anneal by
heating to 95 °C for 2 min, cool slowly to 4 °C over an hour
in a thermal cycler resulting in a 0.75 μM double-stranded
adapter GEX 2 solution.
2. Set up the following adapter ligation on ice directly into the
tube containing the 6 μl of DNA solution (Subheading 3.8)
for a total volume of 10 μl: 1 μl GEX Adapter 2 (0.75 μM),
2 μl Ligase Buffer (5×), 1 μl T4 DNA ligase (5 U/μl).
3. Seal the lid of the tube with parafilm and incubate overnight
at 16 °C.
3.10 Enrichment 1. Prepare a PCR Master Mix and distribute in wells of a 96-well
of Adapter-Ligated PCR plate. Total volume for one reaction is 25 μl: 16 μl of
cDNA by PCR water, 5 μl of Phusion HF buffer (5×), 0.25 μl of GEX1_
and Purification PCR_1 primer (25 μM), 0.25 μl of GEX_PCR_2 primer
from Gel (25 μM), 0.75 μl of dNTPs (10 mM), 0.25 μl of Phusion Hot
Start DNA Polymerase (2 U/μl).
2. Add 2.5 μl of GEX2 Adapter 2-ligated cDNA to each well
(25 μl total volume).
3. Amplify in a thermal cycler using the following program: 30 s
at 98 °C, 13 cycles of: 10 s at 98 °C, 30 s at 60 °C, 15 s at
72 °C, 10 min at 72 °C, hold at 4 °C.
4. When preparing libraries for the first time also include a dilution
series of GEX Adapter 2-ligated cDNA for amplification (1:5,
1:10, 1:20, 1:100, 1:200) and compare amplification patterns
with expected fragment sizes and yield by polyacrylamide gel
electrophoresis. Expected sizes are 93 bp for the targeted GEX1-
tag-GEX2 fragment and smaller sizes for artifacts including
76 bp for the GEX1-GEX2 adapter ligation, 30 bp for the GEX1
adapter, 23 bp for the GEX2 adapter fragment plus PCR primer
dimers. Identify the dilution which yields in PCR highest yield
of the 93 bp fragment compared to the other nontargeted
fragments by polyacrylamide gel electrophoresis and use for
cutting out and purification of 93 bp fragments.
5. Prepare a 12 % polyacrylamide gel in TBE buffer (about 16 cm

plate length, 1.5 mm spacer).
6. Mix 25 μl of amplified DNA with 5 μl of 6× DNA loading dye.
7. Load 15 μl each of the amplified DNA plus loading dye mix
into two wells of the 12 % TBE PAGE gel.
8. Load a ladder with small DNA fragments in 25–50 bp steps up
to 300 bp.
9. Run the gel for 30–35 min at 200 V.
10. Remove the gel from the apparatus.
11. Puncture the bottom of a sterile, nuclease-free, 0.5 ml micro-
tube 4–5 times with a 21-G needle.
12. Place the 0.5 ml microtube into a sterile, round-bottom,
nuclease-free, 2 ml microtube.
13. Stain the gel with SYBR Green I Nucleic Acid Gel Stain (50 μl
stock in 500 ml 0.5× TBE buffer) for 15 min in a clean
container.
14. View the gel on a Dark Reader transilluminator to avoid being
exposed to UV light. Use a clean scalpel to cut out the 93 bp
DNA fragments in the sample lanes (see Fig. 1). Be careful not
to contaminate cut out gel slices with smaller PCR fragments.
15. Place the gel slices into the prepared 0.5 ml microtubes.
16. Centrifuge the stacked tubes at full speed for 5 min at 4 °C to
force the gel pieces through the holes into the 2 ml tubes.
Fig. 1 Amplification products from four different DGE-DpnII libraries (1–4) loaded
on a 12 % polyacrylamide gel. Staining was done using SYBR Green I Nucleic
Acid Gel Stain (Lonza). M = 25 bp size marker. Correctly sized fragments containing
a tag should be 93 bp and can be excised from the gel before sequencing
17. Add 100 μl of 1× gel elution buffer to the gel debris in the
2 ml tubes.
18. Elute the DNA by incubating at 65 °C for 1 h.
19. Transfer the eluate and the gel debris to the top of a Spin-X filter.
20. Centrifuge the filter for 5 min at 14,000 rpm (13,000 × g) at
room temperature.
21. Add 2 μl of glycogen, 20 μl of 3 M NaOAc, and 650 μl of cold
95 % ethanol.
22. Mix by vortexing and store the tubes at −20 °C for 5 min.
23. Centrifuge at 14,0000 rpm (13,000 × g) for 20 min.
24. Discard the supernatants, leaving the pellets intact.
25. Wash the pellets two times with 1 ml 70 % ethanol, room
temperature.
26. Discard the supernatants leaving the pellets intact.
27. Dry the pellets at room temperature for approx. 10 min under
a fume hood, or use a SpeedVac. Do not overdry.
28. Resuspend the pellets in 10 μl of Qiagen elution buffer
(QIAGEN PCR purification kit).
29. Let the tubes sit for 10 min at room temperature. Store at −20 °C.
3.11 Validation 1. Check library quality on an Agilent Technologies 2100

of Libraries Bioanalyzer (if available) using chips from the Agilent DNA
1000 kit. Load 1 μl of the resuspended DNA following the
manufacturer’s protocol and check the size, purity, and con-
centration of the sample.
2. Determine the concentration of the library by measuring its
absorbance at 260 nm in a UV spectrophotometer (Nanodrop).
The yield from the protocol should be between 500 and
1,000 ng of DNA.
3. Measure the 260/280 ratio. It should be approximately 1.8.
4. Load 10 % of the volume of the library on a polyacrylamide gel
and check that the size is as expected (93 bp).
5. From the measured concentrations calculate the approximate
total yield in ng and the total amount in pmol. For calculation
the average molar mass for one base pair (650 g/mol)
× 93 bp = 60,450 g/mol can be used.
6. Dilute a minimum of 2 μl of the sample to 10 nM in Qiagen
elution buffer (from QIAGEN PCR Purification Kit) supple-
mented with 0.1 % Tween 20 (see Note 16).
7. Check the 10 nM diluted DNA by running a High Sensitivity
DNA Assay Chip on an Agilent Technologies 2100 Bioanalyzer
and determine the final concentration of the diluted sample
(see Note 17).
3.12 Sequencing 1. For sequencing of the barcoded libraries in 8-plex mixes

and Data Analysis each in one flow cell of an Illumina Genome Analyzer IIx, the
Illumina standard protocol and chemistry have been applied.
The protocol can also be adapted for sequencing on the MiSeq
platform.
2. The required amounts and concentrations of the 8-plex
samples were 20 μl of 10 nM 8-plex sample (0.64 ng/μl) in
Qiagen elution buffer (Tris–HCl, 10 mM, pH 8.5) supple-
mented with 0.1 % Tween 20.
3. Adjust each single library to an equal molarity of 10 nM based
on Agilent DNA 1000 kit measurements on the Agilent 2100
Bioanalyzer.
4. Recheck concentrations and readjust by using the High
Sensitivity DNA Assay (Agilent Technologies, Inc., Santa Clara,
CA, USA).
5. Pool eight libraries with different barcodes by taking 2.5 μl
from each library.
6. A total of 6.5 pmol of DNA should be used for sequencing.
7. A 30 μM solution of a custom sequencing primer GEX_seq has
to provided when using a commercial service for sequencing.
8. Primary and part of secondary data analysis including image
analysis, base calling, and quality check can be performed with
standard software, e.g., the Illumina Genome Analyzer data
analysis pipeline Real Time Analysis and CASAVA. Further
filtering for quality, adapters and artifacts can be performed
using open source software, e.g., FASTAX-Toolkits.
9. eQTL analysis can be performed using a number of open source
software tools, e.g., FastMap [15] and others [16] to identify
tags colocalizing with QTL for phenotypic traits for further
marker development.
4 Notes
1. This protocol involves costs for synthesis of 16 oligonucle-

otides (29/30 bp in length) by a commercial service provider
for production of eight barcoded P1 adapters. For higher level
multiplexing, e.g., for parallel sequencing of 64 samples in one
flow cell, costs for synthesis of oligonucleotides would increase
considerably when more variants of adapter P1 are used (costs
for 128 oligonucleotides). This would exceed sequencing costs
per flow cell and is only cost-effective for usage of these adapt-
ers in a larger number of DGE projects. Alternatively, adapter
P2 can be modified to also contain a barcode. Barcoded adapters
P1 and P2 can then be used in different combinations to
reduce oligonucleotide synthesis costs (32 instead of 128
oligonucleotides required to produce eight barcoded P1

adapters and eight barcoded P2 adapters for parallel sequencing
of 64 samples in one flow cell).
2. Some protocols recommend addition of DEPC (carcinogenic)
before autoclaving water to keep RNAse free or use certified
nuclease-free water. However, in our experience the use of
deionized autoclaved water is sufficient to ensure RNAse-free
solutions.
3. It is recommended to use nonstick instead of regular microfuge
tubes, as the magnetic beads tend to stick to the walls of regu-
lar tubes, causing less efficient resuspension of beads in buffer
and washing solutions.
4. The manufacturer’s protocol recommends 1 % LiDS. However,
0.1 % LiDS gives better results.
5. Betaine was used to reduce formation of secondary structure
in GC regions. It also improves amplification of DNA and
avoids DNA melting during synthesis.
6. Although the original protocol uses 1 % SDS, using 0.1 % SDS
gives better results.
7. As an alternative to the 4 bp cutter restriction enzyme DpnII,
the 4 bp cutter restriction enzyme NlaIII is also often used in
DGE protocols. However, it is very temperature sensitive and
has to be stored at −80 °C. Only aliquots should be taken out
from the freezer. The half-life of NlaIII at −80 °C is only about
6 months. The oligonucleotides listed for production of
adapter GEX1 must be modified when using NlaIII.
8. Following centrifugation the mixture separates into a lower
red phenol–chloroform phase, an interphase, and a colorless
upper aqueous phase. RNA remains exclusively in the aqueous
phase and makes up about 60 % of the volume of TRIzol
reagent used for homogenization.
9. Clean RNA will be relatively stable. To check RNA for RNAse
contamination and degradation, leave an aliquot of each RNA
sample overnight at room temperature and compare on an
agarose gel with an aliquot of the same RNA sample stored
overnight at −20 °C. High quality RNA should not show any
degradation compared to the RNA sample stored after extrac-
tion at −20 °C.
10. High contents of secondary metabolites (e.g., samples from
oil-rich seeds) can interfere with the extraction procedure.
Although the quality of total RNA may appear satisfactory
based on the pattern on agarose gels, or the ratio of UV mea-
surement at 260/230 nm in Nanodrop measurements, down-
stream enzymatic reactions of the protocol can be inhibited.
Thus, RNA extracted from such samples should be repurified in
a second step using RNeasy columns (QIAGEN).
11. E. coli DNA Ligase helps to produce longer cDNAs and is

included in the Invitrogen LongSAGE and Morrissy [12] pro-
tocols for second-strand synthesis. However, it was removed
from the protocol (also recommended by the manufacturer
Fermentas) due to high costs when applied to multiple samples.
The E. coli DNA polymerase amount was reduced from 200 to
20 units per reaction to reduce costs in multiplexing. The E. coli
RNase H amount was reduced from 10 to 0.8 units per reaction
to reduce costs in multiplexing according to the concentrations
recommended by the manufacturer (Fermentas).
12. Perform wash steps quickly to prevent precipitation of SDS.
13. Make sure Wash Buffer D is at room temperature to avoid
clumping of beads. If clumping of beads occurs perform addi-
tional wash steps.
14. If the beads stick to the sides of the tube, gently scrape them
off using a pipette tip.
15. Make sure Wash Buffer C is prewarmed to 37 °C to avoid pre-
cipitation of SDS.
16. Use 0.1 % Tween 20 to avoid degradation of DNA and adhe-
sion to the walls of the tubes.
17. The High sensitivity DNA chips (Agilent) with detection limit
of 100 pg/μl is used for this purpose, because 10 nM of 93 bp
product is approximately 0.64 ng/μl, which cannot be accu-
rately measured using the NanoDrop with an approximate
detection limit of 2 ng/μl.
Acknowledgements
The development of this protocol for high-throughput mapping

in large plant populations was funded by the Deutsche
Forschungsgemeinschaft (DFG) grants SN14/7-2, SN14/11-1
and WA 2161/2-1.The authors thank Bashir Hosseini, Anja Pöltl,
and Liane Renno for technical assistance.
References
1. Jansen RC, Nap J-P (2001) Genetical genom- (2002) Using the transcriptome to annotate
ics: the added value from segregation. Trends the genome. Nat Biotechnol 19:508–512
Genet 17:388–391 5. Matsumura H, Reich S, Ito A, Saitoh H,
2. De Koning D-J, Haley CS (2005) Genetical Kamoun S, Winter P, Kahl G, Reuter R, Krüger
genomics in humans and model organisms. DH, Terauchi R (2003) Gene expression anal-
Trends Genet 21:377–381 ysis of plant host–pathogen interactions by
3. Velculescu VE, Zhang L, Vogelstein B, Kinzler SuperSAGE. Proc Natl Acad Sci U S A
KW (1995) Serial analysis of gene expression. 100:15718–15723
Science 270:484–487 6. Torres T, Metta M, Ottenwälder B, Schlötterer
4. Saha S, Sparks AB, Rago C, Akmaev V, Wang C (2008) Gene expression profiling by massively
CJ, Vogelstein B, Kinzler KW, Velculescu VE parallel sequencing. Genome Res 18:172–177
7. Kahl G, Molina C, Rotter B, Jüngling R, Frank 12. Morrissy S, Zhao Y, Delaney A, Asano J, Dhalla N,
A, Krezdorn N, Hoffmeier K, Winter P (2012) Li I, McDonald H, Pandoh P, Prabhu A, Tam A,
Reduced representation sequencing of plant Hirst M, Marra M (2010) Digital gene expression by
stress transcriptomes. J Plant Biochem tag sequencing on the Illumina Genome Analyzer.
Biotechnol. doi:10.1007/s13562-012-0129-y Curr Protoc Hum Genet 11(11):1–11.11.36
8. Zheng W, Chung LM, Zhao H (2011) Bias 13. Morrissy AS, Morin RD, Delaney A, Zeng T,
detection and correction in RNA-sequencing McDonald H, Jones S, Zhao Y, Hirst M, Marra
data. BMC Bioinformatics 12:290 MA (2009) Next-generation tag sequencing for
9. Gowda M, Wang GL (2008) Robust- cancer gene expression profiling. Genome Res
LongSAGE (RL-SAGE): an improved 19:1825–1835
LongSAGE method for high-throughput tran- 14. Invitrogen (2010) I-SAGE™ Long Kit. For con-
scriptome analysis. Methods Mol Biol 387: structing Long SAGE™ (serial analysis of gene
25–38 expression) libraries. Version D, 19 October
10. Obermeier C, Hosseini B, Friedt W, Snowdon 2010, 25-0656. http://tools.invitrogen.com/
R (2009) Gene expression profiling via content/sfs/manuals/isagelong_man.pdf .
LongSAGE in a non-model plant species: a Accessed 26 Jun 2012
case study in seeds of Brassica napus. BMC 15. Gatti DM, Shabalin AA, Lam T-C, Wright FA,
Genomics 10:295 Rusyn I, Nobel AB (2009) FastMap: Fast
11. Illumina (2007) Preparing samples for digital eQTL mapping in homozygous populations.
gene expression-tag profiling with DpnII. Bioinformatics 25:482–489
http://illumina.bioinfo.ucr.edu/ht/docu- 16. Shabalin AA (2012) Matrix eQTL: ultra fast
mentation/molbiol-docs/DGE-DpnII- eQTL analysis via large matrix operations.
Sample-Prep.pdf. Accessed 26 Jun 2012 Bioinformatics 28:1353–1358
Chapter 10
SNP Genotyping by Heteroduplex Analysis

Norma Paniego, Corina Fusari, Verónica Lia, and Andrea Puebla
Abstract
Heteroduplex-based genotyping methods have proven to be technologically effective and economically
efficient for low- to medium-range throughput single-nucleotide polymorphism (SNP) determination. In this
chapter we describe two protocols that were successfully applied for SNP detection and haplotype analysis
of candidate genes in association studies. The protocols involve (1) enzymatic mismatch cleavage with
endonuclease CEL1 from celery, associated with fragment separation using capillary electrophoresis
(CEL1 cleavage), and (2) differential retention of the homo/heteroduplex DNA molecules under partial
denaturing conditions on ion pair reversed-phase liquid chromatography (dHPLC). Both methods are
complementary since dHPLC is more versatile than CEL1 cleavage for identifying multiple SNP per target
region, and the latter is easily optimized for sequences with fewer SNPs or small insertion/deletion poly-
morphisms. Besides, CEL1 cleavage is a powerful method to localize the position of the mutation when
fragment resolution is done using capillary electrophoresis.
Key words Heteroduplex analysis, SNP, CEL1 cleavage, Capillary electrophoresis, dHPLC,
Genotyping, Candidate genes
1 Introduction
Recent technological advances allow the generation and/or inter-

rogation of massive amounts of genotype data. In parallel, geno-
typing platforms such as multiplexed SNPs and more recently
genotyping by sequencing have the potential to provide a timely
and cost-effective way for whole-scale genome analysis [1, 2]. In spite
of that, medium-throughput SNP platforms remain the best choice
for diversity, mapping, and breeding applications associated to
low- and medium-scale projects, which often involve genotyping a
relatively large number of individuals (i.e., 100–1,000) with a
reduced set of loci (i.e., 10–50) [3, 4]. Here we present two of the
most popular methodologies for SNP genotyping based on hetero-
duplex analysis: enzymatic mismatch cleavage using endonuclease 1
from celery [5] followed by fluorescent fragment resolution using
capillary electrophoresis (CEL1 cleavage) and denaturing High
Pressure Liquid Chromatography (dHPLC) [6]. Both techniques
141
142 Norma Paniego et al.
Fig. 1 Outline of SNP detection by heteroduplex analysis. Amplified regions of different alleles (allele A and B)
are mixed in equimolecular proportions and subjected to a heating and cooling process to enable the formation
of homoduplex and heteroduplex molecules. In dHPLC, heteroduplex molecules elute earlier than the homodu-
plex because of their reduced melting temperature. In CEL1, a DNA automatic analyzer enables the detection
of labeled fragments corresponding to cleaved heteroduplex and homoduplex molecules. Reproduced from [9]
with permission from Springer Science + Business Media
rely on heteroduplex formation by denaturalization and slow

reannealing of heterozygous DNA molecules (Fig. 1), and they can
be easily applied for automated genotyping of well-characterized
polymorphisms using standard reagents and equipment. CEL1
cleavage implies enzymatic cleavage at the 3’ end of mismatches
and fragment electrophoresis, whereas dHPLC is based on the dif-
ferential retention of the homo- and heteroduplex DNA molecules
on ion-pair reversed-phase high pressure liquid chromatography
supported under partial denaturing conditions [6]. In practice, the
choice of one method over the other will be ultimately determined
by the scale and scope of the study, the requirements of each tech-
nique, and the project budget. From a technical point of view, the
issues to be taken into consideration to select the appropriate geno-
typing method include: (1) the number of variable sites (SNPs)
within the target region, (2) the amplicon length, and (3) whether
prior knowledge of the complete amplicon sequence is available.
SNP Genotyping by Heteroduplex Analysis 143
2 Materials
2.1 Heteroduplex 1. High quality DNA from plant tissue (see Note 1).
Formation 2. A set of individuals of known genotype for each candidate gene:
target amplicons from a small panel of homozygous individuals
of the species of interest are used as reference to make homo-
and heteroduplex samples for the analysis and optimization.
3. Primer pairs for PCR amplification of target regions (10 μM
working dilution) (see Note 2).
4. PCR components: 50 mM MgCl2, 10 mM dNTP, PCR buffer,
high yield high fidelity DNA polymerase or equivalent (see Note 3).
5. A 96/384-well thermocycler with adjusting ramp settings and
touch-down settings.
6. DNA/PCR amplicon quantification system: Hoechst 33258
or Picogreen®, double-stranded DNA standard, microtiter
plate fluorometer (e.g., Gemini from Molecular Probes or
equivalent) (see Note 4).
7. DNA/PCR amplicon quality visualization system: bromophenol
blue/xylene cyanol gel loading buffer, ethidium bromide
stained agarose gel, agarose gel electrophoresis apparatus, and
gel visualization system (see Note 5).
2.2 CEL1 Cleavage 1. CEL1 juice extract (CJE). Partially purified extract obtained
according to the protocol described by Till et al. [7, 8]
(see Note 6).
2. CEL1 reaction buffer 10×: 5 ml 1 M MgSO4, 5 ml 1 M HEPES,
pH = 7.5, 2.5 ml 2 M KCl, 0.1 ml 10 % Triton® X-100, 5 μl
20 mg/ml bovine serum albumin, 37.5 ml deionized water.
3. Heating block.
4. CEL1 stop solution: 0.15 M EDTA, pH = 8.
5. Absolute ethanol and 75 % ethanol.
6. Deionized water.
7. Centrifuge.
2.3 Capillary 1. CEL1 treated products.

Electrophoresis 2. Hi-Di formamide (Applied Biosystems).
3. GeneScan 500 (-250) ROX size standard (Applied Biosystems,
CA, USA).
4. Heating Block.
5. 3130xl Genetic Analyzer (Applied Biosystems, CA, USA):
50 cm capillary array and POP-7 polymer or equivalent.
6. GeneMapper application software (Applied Biosystems, CA,
USA) or equivalent.
2.4 dHPLC 1. Agilent series 1100 HPLC system (Agilent Technologies Inc.,
CA, USA) or equivalent including biocompatibility kit, binary
pump equipped with solvent degasser unit, autosampler with
cooling module, column oven, and variable wavelength
detector.
2. dHPLC Column (Varian Helix™ DNA or Transgenomic
DNASep column).
3. Buffer A: 100 mM TEAA, pH 7.0, 0.1 mM EDTA (see Note 7).
4. Buffer B: 100 mM TEAA, pH 7.0, 0.1 mM EDTA, 25 % (v/v)
acetonitrile (see Note 7).
5. pUC18 Hae III digested (Sigma Chemical Company, MO,
USA, Cat. No.: D6293).
6. dHPLC Melt Program available at the Stanford University
webpage: http://insertion.stanford.edu/melt.html. The sensi-
tivity of the heteroduplex analysis using dHPLC is maximized
by maintaining the HPLC column at a temperature that favors
partial strand denaturation in the presence of base-pair mis-
matches. The “dHPLC Melt program” or similar allows opti-
mal temperature selection for mutation detection from an in
silico reference sequence (see Note 8).
7. 384-well microplates (recommended Greiner Bio-One, 384
well microplate, low volume, HiBase, clear, Cat. No.: 784101
compatible with Agilent HPLC autosampler).
8. Application software (HPCHEM Agilent or equivalent).
3 Methods
A priori knowledge on the number of SNPs within a given candidate

gene allows an efficient selection of the optimal heteroduplex
method for genotyping purposes. In addition to the number of
SNPs, the number of haplotypes, i.e., the number of different
combinations of SNP alleles along a region, is also a critical factor
in establishing the most appropriate methodology. For amplicons
containing more than five SNPs or targeted regions presenting few
haplotypes, it is highly recommended to use dHPLC. Meanwhile,
CEL1 cleavage is a better choice for targeted regions comprising a
small number of SNPs (one to three) and/or more than three
haplotypes. Individuals of known genotype (“reference”) are used
to optimize assay conditions in order to distinguish all haplotypes
from each other by their distinctive electropherogram or chro-
matogram profiles [9]. One advantage of heteroduplex analysis is
that it has the ability to detect new sequence polymorphisms within
the samples being interrogated.
3.1 Heteroduplex 1. Prepare PCR amplicons from uncharacterized individuals, and

Formation from reference individuals, in a 50 μl volume reaction with
80–100 ng genomic DNA, 2 mM MgCl2, 0.2 mM dNTP,
1 U high fidelity DNA polymerase, and 0.25 mM primer set.
Set touch-down cycling conditions to: 2 min at 94 °C for
initial denaturing, 35 cycles of 30 s at 94 °C, 45 s at 65–58 °C
or 60–55 °C, 1 min at 72 °C and a final extension of 10 min
at 72 °C.
2. Check amplicon qualities and quantify them.
3. Place in separate tubes (a) equimolecular amounts (150 ng
each) of amplicons from uncharacterized and reference indi-
viduals (mix 1), (b) equimolecular amounts (150 ng each)
from two different reference individuals (mix 2), (c) 300 ng of
amplicons from reference individuals (mix 3), and (d) 300 ng
of amplicons from uncharacterized individuals (mix 4).
Depending on the concentration of each amplicon, the final
mix volume can vary from 20 to 100 μl. For a given uncharac-
terized individual there must be as many mix 1 samples as ref-
erence individuals representing the different haplotypes
available for each candidate gene.
4. Place all tubes (or plate) in the thermal cycler and expose sam-
ples to a heating and gradual cooling program to produce
homo- and heteroduplexes (95 °C for 2 min, 95 °C ramping to
85 °C: –2 °C/s, 85 °C ramping to 25 °C: 0.1 °C/s, 4 °C
hold, see Note 9). Use samples immediately or store at −20 °C
until use.
3.2 CEL1 1. Add 0.2 μl of CJE and 2 μl of 10× CEL1 reaction buffer to each
Cleavage Assay tube containing mix 1, mix 2, mix 3 and mix 4 (see Note 10).
2. Incubate samples in a heating block at 45 °C for 15 min.
3. Place samples on ice and stop the reaction with 5 μl of CEL1
stop solution.
4. Add 2.5 volumes of absolute ethanol to precipitate DNA mol-
ecules. Incubate at 20 °C for 30 min and then centrifuge at
3,600 × g for 45 min. Wash the pellet with 200 μl of 75 %
ethanol. Repeat centrifugation step. Discard supernatant. Dry
and resuspend the pellet in 5 μl of deionized water.
3.3 Capillary 1. Dilute the final CEL1 treated product (5 μl) with 10 μl of
Electrophoresis Hi-Di formamide and 0.25 μl of GeneScan 500 (-250) ROX size
of CEL1 Treated standard.
Products 2. Heat samples at 95 °C for 5 min.
3. Transfer samples to ice for 5 min.
4. Inject samples into 3130xl Genetic Analyzer.
5. Adjust settings for capillary electrophoresis: injection voltage

1.2 kV, injection time 50 sec, run voltage 15 kV, run time
2,000 s (see Note 11).
6. Collect data to analyze with GeneMapper application software
or equivalent. For GeneMapper, create a kit (e.g., CEL1 geno-
typing), define panels (i.e., one for each candidate gene), and
markers (i.e., expected fragment according to the number of
SNP present in the region). Create a new project and add the
sample files. Set the analysis parameters and table settings for the
project. Perform an initial analysis. Create a bin set according
to the peaks seen in the mixes made of reference individuals
(mix 2 and mix 3). Reanalyze the samples in the project and
edit allele calls. Examine the results for allele calling.
3.4 dHPLC-Based 1. Switch on computer and open the software HPCHEM.

Heteroduplex Analysis 2. Start the HPLC equipment: turn on wavelength detector,
sample tray cooler, column oven, and pumps.
3. Perform a column equilibration and performance test: set flow
rate at 0.45 ml/min, sample tray at 10 °C, column oven at
50 °C, and equilibrate the column at 55 % buffer A and 45 %
buffer B until the baseline is stable (approx. 20–30 min).
Test the column performance by injecting 0.3 μg of pUC18
Hae III digest. Baseline resolution of the 257-/267-bp, and the
434-/458-bp fragments should be obtained.
4. Adjust running conditions to test mix 2 and mix 3 (reference
individuals): set flow rate at 0.9 ml/min to reach 5 min run-
ning time per sample and try different combinations of column
temperature and buffer gradient according to the dHPLC
Melt Program suggestions. Set autosampler to inject 3–15 μl
per sample, containing no less than 100 ng DNA. Optimal
running conditions are achieved when heteroduplex samples
resolve as 2–4 peaks and homoduplex samples resolve as a single
peak. Repeat this step for each candidate region to be genotyped
(see Note 12).
5. Run the test samples under the optimal settings for each
candidate gene. Check the column performance every 100–150
injections (see Note 13).
6. Column maintenance: wash the column for 20 min at 30 °C
with 100 % buffer B. Store column at room temperature in
100 % buffer B.
7. Collect data to analyze them with HPCHEM software or
equivalent.
3.5 Genotype Calling 1. The comparison procedure should be done using only samples
from the same candidate gene. Mix 1 samples will produce
heteroduplex profiles similar to those of mix 2 whenever the
corresponding reference and uncharacterized individuals have
Fig. 2 Detection of SNPs from sunflower candidate genes (1-ACCO, MADSB-TF3, LIM, CPSI, and AALP) with
capillary electrophoresis CEL1 cleavage and dHPLC. (a) Electropherograms illustrating heteroduplex (upper
panel) and homoduplex (bottom panel) profiles for 1-ACCO and MADSB-TF3 genes. Gray bars indicate the
cleavage or homoduplex product/s at the corresponding base-pair position. The y axes are in Fluorescence
Units and x axes are in base pairs. (b) dHPLC elution profiles of LIM, CPSI, and AALP genes. Heteroduplex
(upper chromatogram) and homoduplex molecules (bottom chromatogram) obtained at 0.9 ml/min in 5 min of
running (x axes). y-Axes are in milli-Absorbance Units (mAU) that correlate with milli-Volt Units. Reproduced
from [9] with permission from Springer Science + Business Media
different genotypes (Fig. 2). Mix 1 samples composed of

uncharacterized and reference individuals with the same gen-
otype will have homoduplex profiles similar to those of mix 3.
In addition, assuming that mix 3 samples show only one peak,
mix 4 samples should also show only one peak. Samples made
of a single PCR product which show two or more peaks should
be subjected to further analysis to evaluate heterozygosis or
contamination.
4 Notes
1. A high quality and high yield of DNA is obtained from lyophi-

lized young leaves (3-week-old plants grown in a greenhouse)
using a commercial kit, such as NucleoSpin Plant II (Macherey-
Nagel, Germany) or equivalent.
2. Keep primer length between 25 and 35 base pairs, amplicon
lengths between 200 and 850 base pairs, and annealing tem-
perature around 60 °C. Locate the end/start of forward/
reverse primers at least 70 base pairs apart from the first/last
SNP within the region. For CEL1 endonuclease, label either
one or both primers with FAM (strong intensity label) and/or
HEX (weaker intensity) fluorescent dyes.
3. Use of proofreading DNA polymerase can result in loss of
efficiency. Products like Taq Platinum (Life Technologies, CA,
USA) have proved to yield high amounts of PCR product with
low error rates. High yield and fidelity DNA polymerases for
mutation analysis can be purchased from Life Technologies
(Discoverase™) and Transgenomics (Optimase®).
4. Fluorescent dye Hoechst 33258 is inexpensive and more sensi-
tive that spectrophotometry allowing quantification down to
3 ng/ml of double-stranded DNA. Top-reading solid black
plates are recommended. TNE/Hoechst working solution:
Hoechst 100 ng/ml and 1× TNE: 10 mM Tris, 200 mM NaCl,
1 mM EDTA, pH = 7.4. Mix sample/working solution is
2–200 μl. Spectrometer is adjusted to an excitation wavelength
(λ) of 365 nm and an emission wavelength (λ) of 468 nm.
Picogreen® is also a very simple test consisting of TE buffer,
PicoGreen® reagent, a standard curve formed by supplied
double strand DNA standards and customer supplied samples
(e.g., Quant-iT™ PicoGreen® and double strand DNA reagent
and kits, Invitrogen, CA, USA).
5. Mix DNA/PCR products (1 μl) with 5 μl of loading buffer
(Bromophenol-blue and/or Xylene-cyanol loading buffer)
and load the samples in ethidium bromide stained 1 % or 2 %
agarose gels, respectively. For CEL1 cleavage, it is highly
recommended to scan the gels with Typhoon Trio Scanner
(GE Healthcare Bio-Sciences) in order to monitor the fluores-
cence intensity of the labeled PCR fragments.
6. Perform all CEL1 purification steps at 4 °C. Homogenize
chilled young celery stacks (0.5 kg) with a juice extractor.
Adjust the juice to 100 mM Tris–HCl, pH = 7.7, 100 μM

PMSF (buffer A). Centrifuge for 20 min at 2,600 × g to pellet
debris. Collect the supernatant and add Ammonium Sulfate
((NH4)2SO4) with stirring to make a 25 % saturated solution.
Mix gently for 30 min. Centrifuge the suspension at 16,000 × g
for 40 min at 4 °C. Adjust the supernatant to 80 % saturation
of ((NH4)2SO4) and stir the suspension for another 30 min.
Centrifuge at 16,000 × g for 90 min. Discard supernatant.
Solubilize pellet in buffer A (0.1× starting volume). Transfer
suspension to a dialysis tube (Spectra/Por® 12–14,000
MWCO) and dialyze against a total of 32 l of buffer A with
four changes over 4 h. Overnight dialysis at 4 °C can be per-
formed without compromising purification. Aliquot and store
at –20 °C. CEL1 juice extract performed well after defrosting/
freezing up to four times.
7. Buffers A and B for dHPLC can be bought or prepared
in-house. The benefits of premixed options are their longer
shelf life and reduced inter-batch variation. In-house buffer
has a 1-week shelf life. Its manufacture requires ultrapure
water (18.2 MΩ-cm resistivity at 25 °C) and HPLC-grade
acetonitrile.
8. The optimal temperature is the temperature at which the tar-
get fragment has begun to denature, but 70–85 % still main-
tains a helical structure. The program output is a set of
conditions that combines buffer elution gradient with column
temperature.
9. Adjust the cooling ramp according to the thermocycler cooler
rate. For the Eppendorf Master Cycler with a cooling rate of
–4.5 °C/s, the cooling ramp was set at 45 % in the first step
and at 2 % in the second step.
10. CJE volume and DNA amount need to be optimized using
reference heteroduplex samples (mix 2) to check for the
expected cleavage pattern according to the number of SNPs in
the candidate region. Typically, CJE volumes ranging from 0.2
to 0.5 μl are enough to digest between 250 and 500 ng DNA
in heteroduplex samples.
11. Capillary electrophoresis running parameters must be adjusted
taking into account sizes and fluorescent signal of the frag-
ments in the sample. Initially, a number of serial dilutions must
be prepared and assayed in order to adjust fluorescent signal of
fragments between 200 and 6,000 RFU (reference fluorescent
units). Fragment lengths will define the appropriate size standard
for each analysis and capillary electrophoresis parameters must
be modified accordingly.
12. Before running dHPLC method, check that there is sufficient
buffer to prevent the column from running dry, order the
injection list to operate sequentially from lowest to highest
temperature to provide the most efficient protocol for the oven

and to optimize running time. Consider performing a blank
injection every time the temperature and gradient change dur-
ing the run (to avoid ghost peaks in the chromatogram due to
analytes retained form a previous injection).
13. After approximately 150 runs, sample profiles start to shift
retention time, and resolution becomes poorer. Check the col-
umn performance by injecting pUC18 Hae III control, when
peak resolution decreases, wash the column with 100 % aceto-
nitrile at 30 °C for 20 min, restabilize the column with pUC18
conditions for 40 min and check again the elution profile. It is
useful to create a separate folder for each column in the HPLC
software to know exactly how many runs have been done
(# runs = # files). In the case of the Helix column (Varian), the
lifetime is about 1,000–1,500 injections, while DNASep
(Transgenomics) is about 5,000–6,000 analysis.
Acknowledgments
We gratefully thank Lic Alberto Maligne for dHPLC technical

assistance and Verónica Nishinakamasu and Pablo Vera for fluores-
cence capillary electrophoresis technical support. This research was
supported by ANPCyT/FONCYT, PID 2007 00073, INTA-PRR
AEBIO 245001 and 245005, INTA-PE AEBIO 24554711 and
241351. Drs. V.L. and N.P. are career members of the Consejo
Nacional deInvestigaciones Científicas y Técnicas (CONICET).
References
1. Appleby N, Edwards D, Batley J (2009) New 6. Xiao W, Oefner PJ (2001) Denaturing high-
technologies for ultra-high throughput genotyp- performance liquid chromatography: a review.
ing in plants. In: Somers DJ, Langridge P, Hum Mutat 17:439–474
Gustafson JP (eds) Plant genomics. Humana, 7. Till BJ, Colbert T, Tompa R et al (2003) High-
New Hampshire, pp 19–40 throughput TILLING for functional genomics.
2. Elshire RJ, Glaubitz JC, Sun Q, Poland JA, In: Grotewold E (ed) Plant functional genomics:
Kawamoto K, Buckler ES, Mitchell SE (2011) A methods and protocols. Humana, New Hampshire,
robust, simple genotyping-by-sequencing (GBS) pp 205–220
approach for high diversity species. PLoS One 6:10 8. Till BJ, Burtner C, Comai L, Henikoff S
3. Thomson MJ, Zhao K, Wright M et al (2011) (2004) Mismatch cleavage by single-strand
High-throughput single nucleotide polymorphism specific nucleases. Nucleic Acids Res 32:
genotyping for breeding applications in rice using 2632–2641
the BeadXpress platform. Mol Breed 29:875–886 9. Fusari CM, Lia VV, Nishinakamasu V, Zubrzycki
4. Fusari CM, Di Rienzo JA, Troglia C et al (2012) JE, Puebla AF, Maligne AE, Hopp HE, Heinz
Association mapping in sunflower for sclerotinia RA, Paniego NB (2010) Single nucleotide poly-
head rot resistance. BMC Plant Biol 12:93 morphism genotyping by heteroduplex analysis
5. Till BJ, Zerr T, Comai L, Henikoff S (2006) in sunflower (Helianthus annuus L.). Mol Breed
A protocol for TILLING and Ecotilling in plants 28:73–89
and animals. Nat Protoc 1:2465–2477
Chapter 11
Application of the High-Resolution Melting Technique

for Gene Mapping and SNP Detection in Plants
David Chagné
Abstract
Identifying DNA variations associated with important agronomic traits is a major focus for plant biologists
today. Modern crop breeders use molecular markers widely as tools for selecting new varieties more rapidly
and efficiently. High-Resolution Melting (HRM) is frequently selected as the method of choice to rapidly
and cost effectively detect and genotype SNPs. These SNPs can be used for gene mapping studies and
routinely by breeders.
Key words HRM, SNPs, Gene mapping, Orthologous markers
1 Introduction
A number of crops now possess sufficient genomic resources to

develop tools such as high density single nucleotide polymorphism
(SNP) genotyping arrays from whole genome sequences and rese-
quencing in the germplasm. However, these arrays are often biased
toward a few accessions and give an incomplete figure of the DNA
variations within the species as a whole and of closely related spe-
cies. This is particularly a problem for crops for which selection
schemes are based on the introgression of new genes from wild
species, because these wild accessions are not the ones chosen for
large-scale sequencing initiatives. Therefore, there is a great
demand for techniques that are capable of quickly identifying and
characterizing DNA variations linked to desirable novel characters,
in order to assist in introgressing these traits into new varieties,
without compromising other elements of crop quality. While new
techniques such as genotyping by sequencing [1] bring flexibility
and unbiased SNP genotyping, there is still a great need for fast
and flexible single-locus assays, for example, when a researcher
wants to home in on genetic variations in a defined genomic region
around a gene of interest. The High-Resolution Melting (HRM)
technique [2–4] is the method of choice to detect and genotype
151
152 David Chagné
SNPs rapidly and cost effectively, for gene mapping studies and SNP
validation, as well as for developing markers in a form that plant
breeders can use routinely.
2 Principle of the HRM Technique
The attractiveness of HRM lies in its simplicity: PCR fragments

(see Notes 1 and 2) are amplified using unlabeled primer pairs, and
a melting analysis is performed at the end of the PCR reaction in
the presence of a high fidelity dye that binds to double-stranded
DNA (dsDNA). Prior to the high-resolution melting analysis, the
PCR products are denatured at 94–95 °C and then quickly rean-
nealed to 40 °C. This quick melting/reannealing step is the core of
the technique, as it influences the subsequent HRM analysis.
During that critical step, complementary strands for the unique
allele of a homozygous sample reanneal perfectly to form a perfect
complementary dsDNA product (homoduplex). However, in the
case of heterozygous samples that have more than one allele present
in the PCR product, half the alleles reanneal to the complementary
strand of the same allele and the other half reanneal to the comple-
mentary strand of the other allele (Fig. 1). Such nonperfectly
annealed molecules are called heteroduplexes and are less stable
PCR amplification High Resolution Melting analysis
PCR product
A A
T T
Homozygous A A
T T
Fast melting and

Slow melting
re-annealing
Genotype
calling
High fidelity dye
A A
T T
Heterozygous G G Homoduplex
C C
C
SNP Heteroduplex
T
Fig. 1 Principle of the high-resolution melting technique. Fast melting and reannealing promotes the formation
of heteroduplex PCR products for heterozygous individuals. Such heteroduplexes are less stable than homo-
duplexes and melt at a lower temperature
Application of the High-Resolution Melting Technique… 153
than homoduplexes. The last step of the HRM analysis involves

a slow melting of PCR products from 65 to 95 °C while a high
frequency high accuracy fluorescence capture (25 measurements
per 1 °C) is performed. Samples containing heteroduplexes
(heterozygous alleles) melt at lower temperature than homoduplex-
containing samples (see Note 3).
Several software programs are available for analyzing HRM data,
usually based on the platform used for HRM, as most real-time PCR
machines capable of HRM have their own HRM analysis module.
However, the general method for the HRM data analysis is the same
whatever the platform. First the melting profiles for each PCR prod-
uct must be normalized, as there is usually variability between PCR
quantities (visible as variable fluorescence intensities) across samples.
The raw fluorescence values must be normalized in such a way that
dsDNA (i.e. PCR products premelting) and ssDNA (i.e. PCR prod-
ucts postmelting) are set as 100 and 0 % of bound fluorescent dye,
respectively. Typically normalization is performed by selecting a
region of ~1 °C before and after the dramatic drop in fluorescence,
with special attention paid to selecting regions where all the curves
are parallel and horizontal (Fig. 2a). Following normalization, the
software automatically groups melting curves based on their similar-
ity. Options for visually assigning the melting curve groupings
include using the placement of the melting curves. Heterozygous
samples usually melt at lower temperature than homozygous
samples (Fig. 2b). The other two visualizations are the normalized
fluorescence difference between all the curves compared with a
standard curve set up by the user (Fig. 2c) and the melting peaks
calculated by the derivative of this melting curve difference (Fig. 2d).
For the last option, heterozygous samples usually exhibit more than
one melting peak. See Notes 4–9.
3 HRM in a Heterozygous Species: An Example in Apple
Candidate SNPs were selected using a bioinformatics search made

of contiguous alignments of public apple ESTs [5]. PCR primers
surrounding SNPs were designed for three ESTs: EB130988,
CN849946, and CO868260, yielding PCR products of 119, 131,
and 110 bp long, respectively. Each PCR product spanned two
predicted SNPs.
Template genomic DNA for PCR amplification of the three
markers was purified from a subset of individuals from an apple
segregating population [6], using a CTAB protocol. The subset of
individuals comprised both parents (M.9 and “Robusta 5”) and 14
full-sib seedlings.
PCR reactions were performed in a total volume of 10 μl,
using the LightCycler® 480 High Resolution Melting Master with
2.5 mM MgCl2, 0.2 μM of each primer, and 2.5 ng of genomic DNA.
Fig. 2 High-Resolution Melting analysis. (a) Normalization of the melting curves. Regions before and after the
PCR product melting were selected to normalize the curves (highlighted in gray). (b) Normalized melting
curves. Melting curves are grouped based on their similarity, which enables identification of genotypes. The
example shows one homozygous (green) and three heterozygous (pink, red, and blue) types. The homozygous
type melted at higher temperature than the heterozygous types because of the presence of unstable hetero-
duplexes in the heterozygous samples. (c) Normalized difference plot. The reference melting curve is the green
homozygous curve. (d) Melting peaks. Heterozygous types present more than one melting peak. The
LightCycler® 480 (Roche) software (gene scanning module) was used for this HRM analysis
All PCR primers were amplified using the same PCR conditions
(see Note 10). The high-resolution melting analysis was performed
immediately after the PCR amplification, with 25 acquisitions per
degree Celsius.
All three markers were polymorphic (Fig. 3); however, different
segregation types were observed, depending on the genotype of
the parents. In the first example, a backcross type segregation was
observed (Fig. 3a), with two genotypes and melting curves observed
in the progeny. The second type of segregation where both parents
were heterozygous and had the same genotype yielded three geno-
types in the progeny (Fig. 3b). Both homozygous genotypes were
discriminated with this marker. Most interestingly, the HRM tech-
nique detected genotypic differences resulting from a more com-
plex segregation pattern, where two SNPs were located within the
PCR amplicon, yielding four different genotypes in the progeny
(Fig. 3c). As all three types of segregation are common in outcross-
ing and highly heterozygous species, this experiment demonstrates
the usefulness of the HRM technique for detecting and genotyping
SNPs in plants.
4 HRM Efficiency in Haploid Samples
Although the HRM technique is very sensitive and discriminates very

well between hetero- and homozygous genotypes, the technique is
less efficient for detecting differences between homozygous geno-
types. This issue slightly limits the use of HRM technique for dou-
ble haploid segregating populations or for studies of haploid
chloroplastic markers. One method to increase the efficiency and
accuracy of homozygous type segregation analysis using HRM is
to spike all the samples in an experiment with one haploid sample.
This mixing of one haplotype with a different one will generate
pseudo-heterozygous melting profiles therefore making use of the
power of HRM for discriminating hetero- versus homozygous
samples as discussed earlier.
5 Orthologous Markers
SNP markers are not easily transferable between species, as the

same nucleotide is unlikely to be polymorphic in two species, even
if closely related. However, as the HRM technique is not specific
to a particular SNP within a PCR amplicon, orthologous PCR
primer pairs can be used to cross-amplify orthologous loci between
species, provided any SNPs are detected within the PCR amplicons.
Therefore, the HRM technique has great potential for orthologous
marker development.
156 David Chagné
Fig. 3 High-Resolution Melting profile for three SNP markers in a segregating population of apple seedlings.
PCR products were amplified over the two parents (top left-hand wells of the first column for each marker) and
14 individuals of the F1 progeny (“M.9” × “Robusta 5”). Distinctive melting profiles were obtained where the
marker was present in two (a), three (b), or four (c) genotypes in the progeny. The underlying SNP genotypes
of both parents and the progeny are shown alongside each melting profile
6 Conclusion
The HRM technique provides a quick and inexpensive strategy to

develop new markers in plants, with no post-PCR separation, such
as gel electrophoresis, required. Combining this method with a
careful choice of candidate genes, for example, based on expression
profiles in extreme individuals, should make it possible to identify
the polymorphisms responsible for important traits. We have found
that the HRM technique has great potential for improving selec-
tion and understanding genome functioning in crop plants.
7 Notes
1. Keep your PCR products short. HRM is more efficient with

short PCR fragments. Moreover, longer PCR products are
more likely to span more than one SNP in your amplicon,
which will generate complex segregating patterns that have
more than two haplotypes (Fig. 1c). If you must map a specific
candidate gene, you should tile it with multiple primer pair
combinations yielding short amplicons.
2. For PCR primer design, minimize dimers and hairpins. Dimers
and hairpins will melt and interact with the melting curves of
the targeted amplicon.
3. You may want to multiplex several PCR products within the same
reaction. I would not recommend this, as the melting peaks are
likely to interact and make the analysis more challenging.
4. You may want to genotype microsatellites using HRM. I have
performed this, but the success rate is much lower than for SNPs.
Remember that PCR products spanning microsatellites display
slippage bands or peaks during electrophoresis. The HRM
profile will be a mix of melting between alleles and slippage frag-
ments, which is likely to make analysis of the melting profile
extremely challenging.
5. Try to use DNA samples that are extracted using the same
technique for your HRM analysis. DNA prepared using different
methods may have different inhibiting compounds present
that will influence the melting profile. I have often observed
very different melting curves between PCR products obtained
from CTAB and column-based extracts.
6. Use a segregating population for evaluating your HRM markers
the first time you try this. Tracing the genetics helps you to
understand complex melting patterns.
7. There is no magic rule that can be applied to HRM profile
analysis; however, when I look at a normalized fluorescence
158 David Chagné
difference graph, I use a threshold of 3 (e.g. from Y axis in

Fig. 2c) between melting curves to decide whether a melting
difference is real or not.
8. A good melting curve is a melting curve that is reproducible.
If you have only one sample that displays a specific melting
curve, treat it with caution. This different melting curve may
be due to an artifact influencing the PCR.
9. If you run HRM on a real-time PCR machine, acquire fluores-
cence at each cycle to verify the efficiency of the reaction, as for
a gene expression quantitative PCR analysis. Do not melt frag-
ments that have not reached the plateau phase and conversely
avoid excessive numbers of PCR cycles, to reduce the risk of
PCR artifacts that will be detected by the HRM.
10. PCR conditions for High-Resolution Melting:Initial denatur-
ation at 95 °C for 5 min.PCR amplification (40 cycles):
95 °C for 10 s.
Annealing temperature specific to primer pair for 30 s.
72 °C for 15 s (with fluorescence acquisition).
HRM analysis:
95 °C for 1 min.
40 °C for 1 min.
65–95 °C at slow ramping rate (with continuous fluorescence
acquisition).
Acknowledgements
I would like to thank Susan E. Gardiner for useful comments on

the manuscript, Roche Applied Science New Zealand for great
technical support, and Plant & Food Research for various sources
of funding.
References
1. Elshire RJ, Glaubitz JC, Sun Q, Poland JA, 3. Liew M, Pryor R, Palais R, Meadows C, Erali M,
Kawamoto K, Buckler ES, Mitchell SE (2011) Lyon E, Wittwer C (2004) Genotyping of sin-
A robust, simple genotyping-by-sequencing gle-nucleotide polymorphisms by high-resolution
(GBS) approach for high diversity species. PLoS melting of small amplicons. Clin Chem 50:
One 6:e19379 1156–1164
2. Liew M, Nelson L, Margraf R, Mitchell S, Erali M, 4. Montgomery J, Wittwer CT, Kent JO, Zhou LM
Mao R, Lyon E, Wittwer C (2006) Genotyping of (2007) Scanning the cystic fibrosis transmem-
human platelet antigens 1 to 6 and 15 by high- brane conductance regulator gene using high-
resolution amplicon melting and conventional resolution DNA melting analysis. Clin Chem
hybridization probes. J Mol Diagn 8:97–104 53:1891–1898
5. Newcomb RD, Crowhurst RN, Gleave AP, 6. Celton J-M, Tustin DS, Chagné D, Gardiner SE
Rikkerink EHA, Allan AC, Beuning LL, Bowen (2009) Construction of a dense genetic linkage
JH, Gera E, Jamieson KR, Janssen BJ et al map for apple rootstocks using SSRs developed
(2006) Analyses of expressed sequence tags from from Malus ESTs and Pyrus genomic sequences.
apple. Plant Physiol 141:147–166 Tree Genet Genomes 5:93–107
Chapter 12
Challenges of Genotyping Polyploid Species

Annaliese S. Mason
Abstract
Most plant species are known to be either ancient or recent polyploids, containing more than one genome
as a result of past interspecific hybridization events (allopolyploidy) and/or genome doubling (autopoly-
ploidy). Genotyping in polyploid species offers a set of unique challenges. Most molecular marker meth-
odologies are made more complex by polyploidy, as multilocus alleles are generally produced when a single
locus is targeted. Genotyping by sequencing is also more challenging in polyploids, with problematic
assemblies of duplicated regions and difficulties in distinguishing between inter- and intragenomic poly-
morphisms. Strategies for identifying and overcoming the challenges of polyploidy in plant genotyping
are proposed.
Key words Polyploidy, Multilocus amplification, Hybridization, Marker genotyping, Homeology
1 Introduction
Polyploidy, or presence of more than one genome (set of chromo-

somes) within a single organism, is now known to be extremely
prevalent in plants and particularly angiosperms, with evidence for
many recent polyploidy events and common ancient polyploidy
events in angiosperm and seed plant lineages [1]. Many cultivated
crop species are also polyploid, such as canola, potato, cotton, and
sugarcane, and still others contain remnant homeology as a result
of ancestral polyploidy events, such as rice and maize [2, 3].
Polyploidy occurs either as a result of interspecific hybridization,
when two species hybridize to form a new species with both paren-
tal genomes (allopolyploidy), or as a result of abnormal meiosis or
somatic doubling within a species to create a new cytotype with
two copies of the original genome (autopolyploidy) [4]. The end
result of either form of polyploidy is the duplication of genomic
DNA sequence. This is true of allopolyploids as well as autopoly-
ploids: if two species are closely related enough to hybridize natu-
rally, it is likely that the divergence from a common ancestor has
not progressed far enough as to completely obliterate the genomic
161
relationship (homeology) between the two species. Allopolyploids

formed from more closely related species will show greater duplica-
tion of genomic regions (greater sequence homeology) than allo-
polyploids formed from species that have diverged over greater
evolutionary time periods. For recent autopolyploids the two
genomes are expected to be identical, although over time DNA
loss due to genome downsizing generally reduces the level of
genome duplication [5]. This genome duplication effect of poly-
ploidization offers some serious challenges to plant genotyping.
Genotyping involves the identification of allelic polymorphisms,
generally in DNA sequence. These polymorphisms are identified
using a combination of molecular markers (PCR-based and/or
restriction enzyme-based) and DNA sequence analysis. Purely
PCR-based molecular markers include RAPDs (Randomly
Amplified Polymorphic DNA) and SSRs (Simple Sequence Repeats).
RFLPs (Restriction Fragment Length Polymorphisms) are the most
popular form of restriction enzyme-based molecular marker, and
markers such as CAPS (Cleaved Amplified Polymorphic Sequences)
and AFLPs (Amplified Fragment Length Polymorphisms) use both
PCR-based amplification and restriction enzyme digests to produce
marker alleles. DNA sequence-based genotyping can take the form
of direct sequencing of gene sequences and genomic regions to
identify differences, with popular genomic regions for this purpose
including ITS (Internal Transcribed Spacer) regions, rDNA loci,
and chloroplasts. SNPs (Single Nucleotide Polymorphisms) can
also be identified in the DNA sequence and used as marker alleles
for genotyping. All genotyping methods that rely on the amplifica-
tion of polymorphic alleles from genomic sequence can be affected
by polyploidy in the target species, with the degree of effect com-
mensurate with the degree of sequence similarity between dupli-
cated genomic regions.
2 Detecting if Polyploidy Is Affecting Marker Genotyping

in the Species of Interest
For many species, and especially in important crops and model

plants, the polyploid status of the genome is well known and char-
acterized. In the crop Brassicas for example, the multiple ancient
and recent polyploidy events leading to current species karyotypes
(chromosome complements) are known in enough detail to map
the regions of ancient and recent homeology across the genomes
[6, 7], and some genomes have also been sequenced and assem-
bled [8]. However, this level of information is not available for
most species, and as next-generation sequencing and generation of
molecular markers becomes quicker and easier, it becomes increas-
ingly likely that species with previously unknown polyploid
Genotyping Polyploids 163
lineages will be selected for marker genotyping. How then do we

detect polyploidy from marker genotyping results?
The first, most obvious sign of polyploidy in marker genotyping
is the observation of multiple bands from putative single locus mark-
ers. If more than two bands are consistently observed in putative
single-locus assays, such as RFLPs or SSRs, this may be an indication
of genome duplication in the target species. In Brassica napus, an
allotetraploid species of recent origin, the majority of SSR markers
discovered amplify both homeologous loci, with two to four bands
(or more) depending on locus heterozygosity per marker [9, 10]. Of
course, whole genome duplications are not necessarily the cause of
multiple bands in genotyping assays. Small-scale sequence duplica-
tions may be due to mechanisms such as nonhomologous transloca-
tions or retrotransposon activation [11], or primer sequence or
enzyme restriction sites may not be unique just by chance. However,
if a large percentage of locus-specific genotyping assays return mul-
tiple genome hits or sequence products in the target species, poly-
ploidy is very likely the cause. For genotyping-by-sequencing assays,
the first indication of polyploidy may be the presence of more than
two haplotypes for single genomic regions, with additional difficulty
in assembling sequence reads [12].
Polyploidy may also be detectable through distinctive pheno-
types when comparing closely related species. Polyploids generally
have larger cell sizes, conventionally detected through stomata
measurements and measurements of pollen size in plants [13, 14].
Polyploids will also fundamentally contain more chromosomes and
DNA than their diploid relatives, and this may be detected via
chromosome counting or through methods such as flow cytome-
try, which provide a rough estimate of nuclear DNA content [15].
Several species naturally occur in either hybrid or polyploid species
complexes, in which members of a species can differ in ploidy level,
or whereby two or more closely related species readily hybridize to
form what appears to be a third species in the same region [16–18].
Variations on this theme are many and complicated [17] and high-
light the importance of investigating related species, particularly
those that share geographic locations, in detecting polyploidy and
hybridization events.
3 Identifying Allelic Variants for Genotyping in Polyploids:

Homeologous Loci and False Polymorphisms
The first main challenge of genotyping polyploids is the correct

identification of DNA polymorphisms as allelic variants. As poly-
ploidy fundamentally results in duplication of DNA sequence, sites
for enzyme restriction and primers for PCR-based sequence ampli-
fication may be present in multiple copies. Identification of allelic
Ch. 1 a B C D
Homologues
Ch. 1 a b c d
Homoeologues
Ch. 2 A B D
Homologues
Ch. 2 A B d
Fig. 1 Cartoon of possible locus configurations in a 2n = 4 species with ancestral polyploidy and hence genomic
duplications between Chromosome 1 (Ch. 1) and Chromosome 2 (Ch. 2). Alleles “A” and “a” represent a false
polymorphism, whereby Ch. 1 and Ch. 2 have homeologous homozygous loci, but marker assays may detect
a single polymorphic locus. Alleles “B” and “b” may constitute a useful polymorphic locus on Ch. 1, but only in
codominant marker assays. Alleles “C” and “c” represent a polymorphic chromosome-specific locus, the most
desirable type of locus in a species with genome duplication. Alleles “D” and “d” may only be detectable as
being present at multiple loci due to distorted allele segregation ratios or through sequence validation
variation for genotyping generally relies on some form of DNA

sequencing. However, homeologous regions are difficult to distin-
guish from homologous regions in DNA sequence analysis and
assembly [19]. Hence, allelic variation identified in polyploids may
be due to differences between homeologous sequences rather than
between homologous chromosomes: polymorphisms may be
attributable to the presence of two slightly different copies of a
DNA sequence at different genomic locations, rather than an
allelic difference at a single homologous locus (Fig. 1). This
homeologous variation may be fixed within populations, with no
variation between individuals: sequencing methods for detecting
allelic variation will often incorrectly identify homeologous
intergenomic markers as polymorphic. This form of overestima-
tion is highly prevalent in methods such as SNP discovery in poly-
ploids [20, 21], especially when using next-generation sequencing
for SNP discovery.
Designing primers for PCR-based marker genotyping is also
complicated by genome duplication: designing primers specific to
one genome can be difficult in polyploids. Primers will often amplify
homeologous regions, as well as the target region, depending on
the degree of variance in the primer sequence between homeolo-
gous DNA regions and the primer binding specificity. In relatively
recent polyploids with highly similar genomes, such as wheat,
canola, and hexaploid oats, up to 50 % of SNPs identified from
sequence data turn out to be due to polymorphisms between home-
ologous homoeologous regions rather than between alleles at the
same locus [21–23]. Enzyme restriction sites may also be present in

multiple copies in polyploids, but depending on the marker geno-
typing method may provide less of a problem. Marker genotyping
methods that rely on amplification of alleles that do not have known
genomic locations will be less directly affected by polyploidy.
However, lack of knowledge of the polyploidy event will affect all
inferences made from this data, from false heterozygosity due to
homeology in predicting genetic diversity to segregation issues in
association mapping.
4 Resolving the Challenges Faced by Polyploidy in Genotyping
4.1 Investigating In some species and marker systems, the effects of polyploidy can
the Origin be partially or fully resolved by the addition of different types of
of the Polyploidy Event control samples. In some cases, it can be very helpful to include
Through Phylogenetic related species to the species of interest in the marker assays. In
Analysis Brassica, the allopolyploid species B. napus (canola, 2n = AACC),
B. juncea (Indian mustard, 2n = AABB), and B. carinata
(2n = BBCC) contain the A, B, and C genomes fundamentally un-
rearranged comparative to the genomes of extant diploid species
B. rapa (2n = AA), B. oleracea (2n = CC), and B. nigra (2n = BB).
Hence, inclusion of the diploid species in assays provides valuable
information about genome location in the allopolyploids. For
example, an allele observed in B. napus (2n = AACC) and B. rapa
(2n = AA) but not B. oleracea (2n = CC) may be presumed to be
located in the Brassica A genome. Progenitor species and hence
genomic controls for bread wheat (2n = AABBDD) also exist, such
as durum wheat (2n = AABB) and D genome grass species.
However, this level of genomic relationship and hence ability to
control for genome location in polyploids is unfortunately uncom-
mon. Regardless, in many species with hitherto unsuspected poly-
ploidy the interrogation of related species will still yield valuable
insights into the nature of the polyploidy event. Many polyploid
species are actually hybrids (or allopolyploids) between related spe-
cies, and tracking these origins will be invaluable for further geno-
typing, for example in designing subgenome specific primers in the
hybrid species. Investigation of more distantly related species may
also yield information as to how recent the polyploidy event was in
the history of the species, and hence what the expected level of
genome duplication is. If a whole genus or family exhibits the same
polyploid phenotype, the polyploidy event may be predicted to be
less recent than if a single species displays genome duplication.
Genotyping using less variable markers, such as RFLPs, may be
particularly useful in elucidating these relationships. More highly
variable markers such as SSRs and SNPs may be of more limited
utility in phylogenic analyses, but provide far more useful infor-
mation within species populations and species complexes.
4.2 Mapping Mapping populations are an extremely useful genotyping tool, and
this is especially true in polyploid species. Production of a mapping
population involves crossing two genetically distinct parent indi-
viduals to produce an F1 individual or population, then detection
of allelic cosegregation in resulting progeny. In plants, the process
can be facilitated in many species by production of homozygous
parent lines and by use of microspore culture on the F1 to
immediately examine gametic segregation in the first generation
using doubled-haploid microspore-derived plants. In the case of
markers that produce multiple alleles at different homeologous
loci, creation of a linkage map can allow each allele to be mapped
to a separate subgenome and genomic location. Segregation of
two alleles in a mapping population also offers definite evidence
that both alleles are present at the same locus. Hence, for species
where little genomic information is present, production of a map-
ping population can be invaluable in resolving complexities caused
by genome duplication.
4.3 Validation Sequencing alleles used for genotyping analysis is a common process
and Sequencing in the production of some marker types, such as SSRs [24] and
SNPs [21]. In this approach, the actual DNA sequence of the allele
and surrounding region is validated by sequencing. Poor match of
surrounding sequences (e.g., flanking sequences of SSRs) between
alleles is indicative of amplification of two homeologous loci, rather
than amplification of two alleles from a single locus. Sequence mis-
matches may be either difficult or easy to detect: in recent poly-
ploids, high conservation of homeologous sequence is expected.
Use of markers localized to noncoding DNA sequences may be
more helpful in this instance, as coding regions of DNA sequence
(such as may be provided by RNAseq technology) may be expected
to be more conserved between homeologous regions. However, as
DNA and RNA sequencing becomes more common and easily
accessible, availability of genomic sequence information against
which to compare sequenced marker alleles is improving. In particu-
lar, genotyping by sequencing approaches provide a ready means of
screening for allelic haplotypes in populations, and hence differenti-
ating “normal” homologous variation between alleles at a single
locus from homeologous variation between duplicated loci.
5 Summary
It is important for researchers to be aware of the effects of polyploidy

and hybridization when investigating previously unknown
genomes. The confounding effects of these common processes
can be significant, causing problems with genotyping assays
and also data analyses, many of which have been developed for
diploid species. Investigation of related species, awareness of the
phenotypic cues of polyploidy, and chromosome counts are all

sound precautionary measures in polyploidy detection. Presence of
multiple bands in marker assays or presence of multiple sequence
haplotypes may provide evidence of polyploidy in the species of
interest. Controlling for the effects of polyploidy and hybridization
can be done through use of related species as controls, generation
of linkage mapping populations, and screening or sequencing
alleles to validate subgenome specificity.
Genotyping polyploid species can be difficult. However, as
sequencing technology becomes increasingly accessible and bioin-
formatics tools evolve, resolving the issues related to genome
duplication events will become easier. Polyploidy and hybridiza-
tion are fascinating evolutionary processes, and investigation of
polyploid or hybrid species populations through genotyping can
yield a wealth of interesting information.
References
1. Jiao YN, Wickett NJ, Ayyampalayam S, Microspore culture preferentially selects unre-
Chanderbali AS, Landherr L, Ralph PE et al duced (2n) gametes from an interspecific
(2011) Ancestral polyploidy in seed plants and hybrid of Brassica napus L. × Brassica carinata
angiosperms. Nature. 473:97–100 Braun. Theor Appl Genet 119:497–505
2. Leitch AR, Leitch IJ (2008) Genomic plasticity 11. Zou J, Fu DH, Gong HH, Qian W, Xia W,
and the diversity of polyploid plants. Science Pires JC et al (2011) De novo genetic variation
320:481–483 associated with retrotransposon activation,
3. Soltis DE, Buggs RJA, Doyle JJ, Soltis PS (2010) genomic rearrangements and trait variation in a
What we still don't know about polyploidy. recombinant inbred line population of Brassica
Taxon 59:1387–1403 napus derived from interspecific hybridization
4. Harlan JR, DeWet JMJ, On Ö (1975) Winge with Brassica rapa. Plant J 68:212–224
and a prayer: the origins of polyploidy. Bot Rev 12. Imelfort M, Edwards D (2009) De novo sequenc-
41:361–390 ing of plant genomes using second-generation
5. Leitch IJ, Bennett MD (2004) Genome down- technologies. Brief Bioinform 10:609–618
sizing in polyploid plants. Biol J Linn Soc 13. Chen S, Nelson MN, Chèvre AM, Jenczewski
82:651–663 E, Li Z, Mason AS et al (2011) Trigenomic
6. Schranz ME, Lysak MA, Mitchell-Olds T bridges for Brassica improvement. Crit Rev
(2006) The ABC's of comparative genomics in Plant Sci 30:524–547
the Brassicaceae: building blocks of crucifer 14. Veilleux RE, Lauer FI (1981) Variation for 2n
genomes. Trends Plant Sci 11:535–542 pollen production in clones of Solanum
7. Nelson MN, Parkin IAP, Lydiate DJ (2011) phureja Juz. and Buk. Theor Appl Genet 59:
The mosaic of ancestral karyotype blocks in 95–100
the Sinapis alba L. genome. Genome 54: 15. Mason AS, Yan GJ, Cowling WA, Nelson MN
33–41 (2012) A new method for producing allohexa-
8. Wang XW, Wang HZ, Wang J, Sun RF, Wu J, ploid Brassica through unreduced gametes.
Liu SY et al (2011) The genome of the meso- Euphytica 186:277–287
polyploid crop species Brassica rapa. Nat 16. Zohary D, Nur U (1959) Natural triploids in
Genet 43:1035–1040 the orchard grass. Dactylis-Glomerata L. poly-
9. Mason AS, Nelson MN, Castello M-C, Yan G, ploid complex and their significance for gene
Cowling WA (2011) Genotypic effects on the flow from diploid to tetraploid levels. Evolution
frequency of homoeologous and homologous 13:311–317
recombination in Brassica napus × B. carinata 17. Mallet J (2007) Hybrid speciation. Nat Rev
hybrids. Theor Appl Genet 122:543–553 446:279–283
10. Nelson MN, Mason AS, Castello M-C, 18. Petit C, Bretagnolle F, Felber F (1999)
Thomson L, Yan G, Cowling WA (2009) Evolutionary consequences of diploid-polyploid
hybrid zones in wild species. Trends Ecol Evol 22. Lai KT, Duran C, Berkman PJ, Lorenc MT,
14:306–311 Stiller J, Manoli S et al (2012) Single nucleotide
19. Imelfort M, Duran C, Batley J, Edwards D polymorphism discovery from wheat next-
(2009) Discovering genetic polymorphisms in generation sequence data. Plant Biotechnol J
next-generation sequencing data. Plant 10:743–749
Biotechnol J 7:312–317 23. Oliver RE, Lazo GR, Lutz JD, Rubenfield MJ,
20. Duran C, Appleby N, Edwards D, Batley J Tinker NA, Anderson JM et al (2011) Model
(2009) Molecular genetic markers: discovery, SNP development for complex genomes based
applications, data storage and visualisation. Curr on hexaploid oat using high-throughput 454
Bioinform 4:16–27 sequencing technology. BMC Genomics 12
21. Hayward A, Mason AS, Dalton-Morgan J, 24. Squirrell J, Hollingsworth PM, Woodhead M,
Zander M, Edwards D, Batley J (2012) SNP Russell J, Lowe AJ, Gibby M et al (2003) How
discovery and applications in Brassica napus. much effort is required to isolate nuclear micro-
J Plant Biotechnol 39:49–61 satellites from plants? Mol Ecol 12:1339–1348
Chapter 13
Genomic Reduction Assisted Single Nucleotide

Polymorphism Discovery Using 454-Pyrosequencing
Peter J. Maughan, Joshua A. Udall, and Eric N. Jellen
Abstract
We report the development of a simple genomic reduction protocol based on 454-pyrosequencing
technology that discovers large numbers of single nucleotide polymorphisms (SNP) from pooled DNA
samples. The method is based on the conservation of restriction endonuclease sites across samples and
biotin separation for genomic reduction and the addition of multiplex identifier (MID) barcodes to each
of the pooled samples to allow for postsequencing deconvolution of the pooled DNA fragments and
SNP discovery.
Key words Single nucleotide polymorphism (SNP), 454-Pyrosequencing, Genomic reduction
1 Introduction
Single nucleotide polymorphisms (SNPs), defined as single-base

changes, are the most abundant type of sequence variation in
eukaryotic genomes [1, 2]. A single SNP can have four alleles, but
most show only two alleles and are regarded as biallelic [3]. SNP
densities range dramatically across the genome of most species,
especially when considering coding or noncoding regions [4, 5].
The high frequency of SNPs in most species offers the possibility
of constructing extremely dense genetic maps that are particularly
valuable for map-based gene cloning efforts as well as haplotype-
based association studies. SNPs have already been successfully
utilized in a wide array of research areas, including association
studies [6, 7], conservation genetics [8], genetic diversity analysis
[9], and genetic mapping [10].
Unfortunately, the initial discovery of SNPs can be expensive
and time consuming and is often the limiting factor in their utiliza-
tion, especially in nonmodel or nonagricultural species. Genomic
reduction, also known as reduced representation, significantly
169
170 Peter J. Maughan et al.
reduces the complexity of large genomes [11, 12]. Genomic

reduction, based on restriction-site conservation (GR-RSC),
allows for the effective sampling of identical DNA fragments
across individuals without a priori genome sequence information
[13]. The incorporation of MID-barcodes into specific DNA
sequence fragments allows for the unambiguous assignment of
fragments to specific samples in the sequence pool, thus enabling
the identification of SNPs that will segregate in specific populations.
When linked with second-generation sequencing, genomic reduc-
tion provides a cost-effective means to identify large numbers of
high-confidence SNPs en masse with broad application across
diverse genomes.
The genomic reduction protocol described here for SNP dis-
covery is based on restriction site conservation across individuals,
removal of >90 % of the genome via biotin–streptavidin paramag-
netic bead separation and size selection via gel electrophoresis,
followed by >10-fold sequencing coverage of the remaining
genome via 454-pyrosequencing and the use of incorporated
MID-barcodes for postassembly SNP discovery.
DNA is double digested with restriction endonucleases that
recognize 4-base and 6-base recognition sites, producing three types
of DNA fragments based on their end restriction sites, specifically:
4-base to 4-base, 6-base to 6-base, and 6-base to 4-base fragments.
Subsequently, double-stranded adapters are ligated to the ends of
the digested DNA fragments. The adapter ligated to the end of the
6-base recognition site is end-labeled with a 5′-biotin molecule,
while the adapter on the 4-base recognition site is unlabeled.
Genomic reduction is accomplished by removing the nonlabeled
DNA fragments (4-base to 4-base fragments) from the reaction
using a biotin–strepavidin paramagnetic bead separation (see Note 1).
MID-barcode sequences are then added to the DNA fragments
using PCR primers complementary to the adapter sequences and
that carry a 5′ 10-base MID-barcode. During the initial PCR cycle,
the primers anneal to the adapter sequence and the MID-barcode is
incorporated into the amplified DNA fragment. A high-fidelity poly-
merase with proofreading capabilities is used to avoid introducing
amplification errors. Since the MID-barcode is a 10-base sequence,
several variants of the barcode can be synthesized; thus, each indi-
vidual utilized in the SNP discovery experiment can be labeled with
a specific MID-barcode. Equimolar amounts of each individual PCR
sample are pooled together and size selected (500–650 bp) via
electrophoresis in preparation for 454-pyrosequencing. The size
selection further reduces the genome complexity and the number of
loci sequenced. Since all the samples are sequenced together, only a
single emPCR reaction is required and no space is lost to subgasketing
of the 454 sequencing plate, thus maximizing the number of potential
reads in the pyrosequencing run.
Genomic Reduction Assisted Single Nucleotide Polymorphism Discovery… 171
2 Materials
2.1 Restriction 1. 150 ng/μL genomic DNA for each sample. DNA should be
Digest extracted using a standard DNA extraction protocol that pro-
duces high quality DNA (260/280 nm ratio: 1.8–2.0). Care
should be taken to correctly ascertain DNA concentrations
(UV absorbance or fluorometry).
2. Restriction endonucleases. EcoRI (20,000 U/mL); BfaI
(5,000 U/mL) (New England Biolabs, Beverly, MA)
(see Notes 2–3).
3. 10× NEBuffer 4 (500 mM potassium acetate, 200 mM Tris–
acetate, 100 mM Magnesium Acetate, 10 mM Dithiothreitol,
pH 7.9; New England Biolabs, Beverly, MA) (see Note 4).
4. Nuclease free water.
5. Thermal cycler or heat control water bath programmable to
37 °C.
6. PCR strip tubes with caps.
2.2 Adapters 1. EcoRI DS adapter. This double-stranded adapter is comprised

of two single-stranded DNA fragments. Purchase each sepa-
rately with the specified 5′ modifications (Integrated DNA
Technologies, Iowa City, IA) and resuspend to a concentration
of 1 μg/μL (see Note 5):
EcoRI adapter Forward: 5′/5BioTEG/-CTCGTAGACTGC
GTACC.
EcoRI adapter Reverse: 5′/5Phos/-AATTGGTACGCAGT
CTAC.
2. BfaI DS adapter. This double-stranded adapter is comprised of
two single-stranded DNA fragments. Purchase each separately
with specified 5′ modifications (Integrated DNA Technologies,
Iowa City, IA) and resuspend to a concentration of 1 μg/μL
(see Note 5):
BfaI adapter Forward: 5′-GACGATGAGTCCTGAG.
BfaI adapter Reverse: 5′/5Phos/-TACTCAGGACTCA.
3. 10× NEBuffer 4 (500 mM potassium acetate, 200 mM Tris–
acetate, 100 mM Magnesium Acetate, 10 mM Dithiothreitol,
pH 7.9; New England Biolabs, Beverly, MA).
5. Microfuge tube (1.7–2.0 mL).
6. Dry bath programmable to 95 °C (Boekel Scientific,
Feasterville, PA).
2.3 Ligation 1. T4 DNA Ligase (400,000 cohesive end units/mL; see Note 6)
(New England Biolabs, Beverly, MA).
2. 10× T4 DNA Ligase Buffer (500 mM Tris–HCl, 100 mM
MgCl2, 10 mM ATP, 100 mM Dithiothreitol, pH 7.5; New
England Biolabs, Beverly, MA).
3. EcoRI DS adapter (5 μM) and BfaI DS adapter (50 μM) as
prepared in Subheading 3.3.
4. Thermal cycler programmable to 16 °C.
2.4 Size Exclusion 1. Chroma Spin™-400 + TE columns (ClonTech Laboratories,

via Spin Mountain View, CA) (see Note 7).
Chromatography 2. Centrifuge (Jouan CR3i, Thermo Scientific, Waltham, MA)
with swing-out rotor and 15 mL inserts (Jouan T40, Thermo
Scientific, Waltham, MA).
3. Microfuge tubes (2.0 mL).
2.5 Biotin– 1. Dynal M-280 beads (Invitrogen, Carlsbad, CA).

Streptavidin 2. 2× B&W Buffer (10 mM Tris–HCL, pH 7.5, 1 mM EDTA,
Separation 2 M NaCl; autoclaved).
3. Magnetic particle concentrator (MPC), such as the DynaMag-2
magnet (Invitrogen, Carlsbad, CA), for separation of
Dynabeads.
4. 1× TE buffer (10 mM Tris, 1 mM EDTA, pH 8.0; autoclaved).
5. Labroller II rotator (Labnet, Woodbridge, NJ).
2.6 PCR 1. 50× Advantage®-HF 2 Polymerase (ClonTech Laboratories,

Amplification Mountain View, CA). The kit also contains 10× HF 2 PCR
and MID-Barcode Buffer, 10× HF dNTP Mix, and PCR-grade water (see Note 8).
Attachment 2. EcoRI Adapter-MIDX and BfaI Adapter-MIDX fusion primer
pairs (Table 1) (Integrated DNA Technologies, Iowa City, IA)
resuspended to a concentration of 10 μM (see Note 9).
3. PCR strip tubes with caps (SnapStrip® ISCBioExpress,
Kaysville, UT).
4. Thermal cycler (ABI 9700, Applied Biosystems, Foster City, CA).
5. FlashGel® Dock, 1.2 % agarose FlashGel® DNA Cassette, 5×
FlashGel® loading dye and FlashGel® DNA 100 bp to 4 kb
marker (Lonza Rockland, Inc., Rockland, Md).
2.7 Fragment 1. Quant-iT picogreen® dye (Invitrogen, Carlsbad, CA).

Pooling, Size 2. TBS-380 Mini-fluorometer (Promega, Madison, WI) or similar
Selection, fluorometer.
and Isolation
3. Metaphor® agarose (Cambrex BioScience, East Rutherford, NJ).
4. 1× TAE (40 mM Tris-acetate, 1 mM EDTA, pH 8.2).
Table 1
MID barcode primer pairs
MID- Sequence (5′ → 3′)a

barcode
primers
pairs EcoRI BfaI
MID1 ACGAGTGCGTGACTGCGTACCAATTC ACGAGTGCGTGATGAGTCCTGAGTAG
MID2 ACGCTCGACAGACTGCGTACCAATTC ACGCTCGACAGATGAGTCCTGAGTAG
MID3 AGACGCACTCGACTGCGTACCAATTC AGACGCACTCGATGAGTCCTGAGTAG
MID4 AGCACTGTAGGACTGCGTACCAATTC AGCACTGTAGGATGAGTCCTGAGTAG
MID5 ATCAGACACGGACTGCGTACCAATTC ATCAGACACGGATGAGTCCTGAGTAG
MID6 ATATCGCGAGGACTGCGTACCAATTC ATATCGCGAGGATGAGTCCTGAGTAG
MID7 CGTGTCTCTAGACTGCGTACCAATTC CGTGTCTCTAGATGAGTCCTGAGTAG
MID8 CTCGCGTGTCGACTGCGTACCAATTC CTCGCGTGTCGATGAGTCCTGAGTAG
MID9 TAGTATCAGCGACTGCGTACCAATTC TAGTATCAGCGATGAGTCCTGAGTAG
MID10 TCTCTATGCGGACTGCGTACCAATTC TCTCTATGCGGATGAGTCCTGAGTAG
a
MID barcode sequences are in bold
5. 100-bp DNA ladder (Invitrogen, Carlsbad, CA).

6. Sterile scalpel and 2.0 mL microfuge tubes.
7. Qiaquick Gel Extraction Kit (Qiagen, Germantown, MD).
8. 10:1 TE buffer (10 mM Tris, 1 mM EDTA, pH 8.0;
autoclaved).
3 Methods
3.1 Restriction 1. In a PCR strip tube prepare a double digestion for each of the
Digest DNA sample as described in Table 2. If more than one DNA
sample is processed, a cocktail of the nuclease free water,
restriction enzymes, and NEB 4 buffer can be prepared (mixed
thoroughly) and subdivided into the separate PCR strip tubes
containing the DNA for each sample.
2. Cap and gently mix each sample thoroughly by finger flicking.
3. Incubate the samples at 37 °C for 1 h. (see Note 10).
3.2 Adapter Adapters can be prepared ahead of time or during the restriction
Preparation digestion period.
1. In separate 1.5 mL microfuge tubes prepare the EcoRI and
BfaI double-stranded adapters by combining the components
in Table 3 for each of the adapters.
Table 2
Double digest preparation
Stock solution 1× Reaction

150 ng DNA (150 ng/μL) 3 μL (450 ng)
EcoRI NEB (20,000 U/mL) 0.15 μL (3 U/reaction)
BfaI NEB (5,000 U/mL) 0.60 μL (3 U/reaction)
10× NEB 4 buffer 4 μL
Nuclease free water 32.25 μL
Total 40 μL
Table 3
Adapter preparation for (a) EcoRI Double-Stranded Adapter and
(b) BfaI double-stranded adapter
(a) EcoRI double-stranded adapter (5 μM)

1 μg/μL EcoRI adapter Forward 1.7 μL
1 μg/μL EcoRI adapter Reverse 1.5 μL
10× NEB 4 buffer 3.0 μL
Total 60 μL
(b) BfaI double-stranded adapter (50 μM)
1 μg/μL BfaI adapter Forward 16.0 μL
1 μg/μL BfaI adapter Reverse 14.0 μL
10× NEB 4 buffer 3.0 μL
Total 60 μL
2. Cap, label appropriately and mix each sample thoroughly.

3. Incubate the samples (in the capped tubes) at 96 °C for 5 min
in a dry bath. After the 5 min incubation, remove the heating
block from the dry bath and allow it to come to room tem-
perature without removing the sample tubes (see Note 11).
4. Double-stranded (DS) adapters can be stored at −20 °C for
later use. Note that once the DS adapters have been prepared,
they can be used for subsequent experiments without redena-
turing or reannealing.
3.3 Ligation 1. After the 1 h restriction digestion incubation, add 10 μL of the

ligation mix to each sample, prepared as outlined in Table 4.
If more than one sample is processed, a cocktail of the nuclease
free water, ligase buffer, adapters, and ligase can be prepared
(mixed thoroughly) and subdivided into the separate PCR
strip tubes containing the digested DNA for each sample.
Table 4
Ligation mix preparation
Ligation mix
5 μM EcoRI DS adapter 3.0 μL
50 μM BfaI DS adapter 3.0 μL
10× ligase buffer 1.0 μL
T4 ligase 0.5 μL
Total 10 μL
2. The total volume per reaction is now 50 μL. Cap and gently
mix each sample thoroughly.
3. Incubate the samples at 16 °C for 3 h (see Note 10).
4. Following the ligation, add 25 μL of nuclease free water to
each sample to bring the final volume to 75 μL (used in step 5
of Subheading 3.4).
3.4 Size Exclusion 1. Invert a Chroma Spin™-400 column several times to resus-
via Spin pend the gel matrix completely. One column per sample is
Chromatography required.
2. Holding the column upright, grasp the breakaway end between
your thumb and index finger and snap it off. Place the end of
the spin column into a 2-mL microcentrifuge tube, and remove
the top cap.
3. Place the column + collection tube in a 15 mL tube and cen-
trifuge at 700 × g in a swing-out rotor for 5 min. After centrifu-
gation, the column matrix will appear semidry. This step purges
the equilibration buffer from the column and reestablishes the
matrix bed.
4. Remove the spin column and collection tube from the centrifuge
rotor, and discard the collection tube.
5. Place the spin column into a clean 2-mL microcentrifuge tube
and carefully apply the 75 μL restriction digestion/ligation
sample to the center of the gel surface.
6. Centrifuge at 700 × g in a swing-out rotor for 5 min.
7. Remove the spin column and collection tube from the rotor.
The purified sample is in the collection tube.
3.5 Biotin– 1. In a new microfuge tube, add 10 μL of Dynal M-280 beads

Streptavidin (thoroughly mix the Dynal beads prior to use) to 61.5 μL of
Separation 2× B&W Dynal buffer. Mix by finger flicking (see Note 12).
2. Wash the beads via magnetic separation by placing the tube in

the MPC (DynaMag-2 magnet) for 1 min.
3. While the tube is in the MPC and without disturbing the
magnetic particles, carefully open the tube and pipette out the
wash buffer. The magnetic particles are on the tube wall closest
to the magnet.
4. Repeat the bead wash step 2 more times.
5. Suspend beads in 75 μL of 2× B&W buffer and add to the
purified sample from the spin chromatography
(Subheading 3.4, step 7) (see Note 12).
6. Mix gently and incubate at room temperature for 20 min with
frequent agitation using a microtube rotator set to 5 rpm. If a lab
rotator is not available, mix by finger flicking the tube every 5 min.
7. Wash the beads to remove the nonbiotin labeled DNA frag-
ments via magnetic separation by placing the tube in the MPC
for 1 min, followed by careful removal of the wash buffer with
a pipette.
8. Remove the tube from the MPC and add 150 μL of 1× B&W
buffer to each sample, mix gently and repeat the magnetic bead
separation and wash two more times as described in step 7.
9. After the third wash, resuspend the pellet in 100 μL of 1× TE.
Samples can be stored long term at −20 °C (or stored over-
night at 4 °C).
3.6 PCR 1. In a PCR strip tube prepare a PCR reaction for each of the
Amplification DNA sample as outlined in Table 5.
and MID-Barcode 2. Gently mix the PCR reaction and thermal cycle with the
Attachment parameters outlined in Table 6.
Table 5
PCR mastermix
Stock solution 1× Reaction

50× ClonTech Advantage-HF 2 Polymerase 1.0 μL
10× ClonTech HF 2 dNTP 5.0 μL
10× ClonTech HF 2 10× Buffer 5.0 μL
10 μM BfaI-MIDX primer
a
1.0 μL
10 μM EcoRI-MIDXa primer 1.0 μL
DNA templateb 1.0 μL
Total 50 μL
a
For each DNA sample use a different EcoRI-MIDX*/BfaI-MIDX primer set (see Note 8)
b
Mix beads-template well before addition
Table 6
PCR thermocycling parameters
Cycling step Temperature (°C) Time

Step 1 95 1:00 min
Step 2 95 0:15 s
Step 3 65 0:30 s
Step 4 68 2:00 min
Repeat steps 2–4 17 additional times for a total of 18 cycles
Step 5 Hold 4
Fig. 1 Typical FlashGel® electrophoresis results used to determine the optimal

number of cycles needed during PCR. Lane 1 contains 4 μL of the FlashGel® DNA
Marker. Lanes 2–5 show typical results after 18, 20, 22, and 24 cycles. The goal
is to perform sufficient cycles to remain within the exponential phase of PCR
while avoiding overcycling. In this example, 20 is the optimal number of cycles
3. At the end of the thermal cycling program, remove and store

5 μL of the reaction in a microfuge tube.
4. Return the PCR reaction to the thermal cycler and amplify
the sample an additional two cycles (to a total 20 cycles).
On completion of the 20 cycles, remove 5 μL and store the
subsample. Repeat the cycling and subsampling at the 22nd
and 24th cycles.
5. Combine 1 μL of 5× loading dye with 5 μL of each subsample
and electrophoresis for 5 min on a 1.2 % agarose FlashGel®
according to the manufacturer’s protocol. An example of
successful amplification is shown in Fig. 1 (see Note 13).
6. From the gel picture, identify the optimal number of cycles that
produces a bright smear through the target range (450–
600 bp) without over cycling. The goal is to perform sufficient
cycles to remain within the exponential phase of PCR. In our
hands, the optimal cycle number is typically between 20 and
22 cycles (see Note 14).
7. Prepare a new 50 μL PCR reaction for each sample according
to the protocol above. Cycle each sample to the optimum
number of cycles determined above.
3.7 Fragment 1. Fluorometically determine the DNA concentration of each

Pooling, Size sample using a Quant-iT picogreen® dye. Typical yields for
Selection, each 50 μL PCR reaction are between 1 and 2 μg.
and Isolation 2. Pool equimolar amounts of each PCR sample to obtain a single
sample containing 5 μg of DNA (see Note 15).
3. Electrophorese the pooled samples in a single lane on a 1.5 %
Metaphor® agarose gel in 0.5× TAE at 40 V for 8 h and visualize
the gel using standard ethidium bromide staining and UV
transillumination.
4. Using a sterile scalpel, cut out a single gel slice representing
DNA fragments ranging from ~450 to 650 bp.
5. Extract the DNA from the gel slice according to the manufac-
turer’s suggested protocol using a Qiaquick Gel Extraction
Kit, resuspending the extracted DNA in 100 μL of EB buffer
(10 mM Tris–Cl, pH 8.5; included in kit). If needed, the
sample can be further concentrated using a standard
speed-vac.
At this point the pooled DNA sample is ready for standard
454-pyrosequencing (without DNA fragmentation), including the
small-fragment removal step found in the 454-pryosequencing
General Library Preparation manual and all subsequent steps
(i.e., immobilization, fill-in, single-stranded DNA library isolation,
quality assessment, and quantitation) to produce a single-stranded
library. Following sequencing, the DNA reads are bioinformati-
cally separated into MID-barcode pools representing each of the
DNA samples using the SFF Tool commands in the Genome
Sequencer FLX system data analysis software package. Several bio-
informatic methods exist for SNP discovery from these MID-
barcode pools [13]. Most include assembly of the reads, either
against a reference sequence or as a de novo assembly (e.g., Roche
Newbler assembler) followed by SNP discovery using commercial
software (e.g., CLCBio Workbench, Katrinebjerg, Aarhus N,
Denmark) or custom perl-scripts.
4 Notes
1. Genomes larger than 500 Mb may require the use of an endo-

nuclease that recognizes an 8-base recognition site in order to
obtain the desired magnitude of genomic reduction. The num-
ber of fragments produced from the digestion can be estimated
by multiplying the genome size of the organism by the pre-
dicted abundance of recognition site of the rare (6- or 8-bp)
endonuclease. For example, the predicted frequency of an
EcoRI cut site (GAATTC) in an organism with a genome size
of 466 Mb, and a predicted GC content of 35 % would be
0.00034. The expected number of fragments remaining after
the biotin–streptavidin reduction would therefore be
159,219—of which only ~20 % are predicted to be in the
sequenced fragment size range (500–650 bp)
2. EcoRI and BfaI were purchased from New England BioLabs,
although any reputable source should work well. If purchased
from a different vendor, verify unit concentrations for each
enzyme and adjust the restriction digest to maintain 3 U/
reaction as a final concentration.
3. Of primary importance to maximizing the number of SNP iden-
tified is the selection of the appropriate restriction endonucle-
ases that will minimize the amount of repetitive DNA content in
the target sequencing size range (500–650 bp), while providing
sufficient genomic reduction to provide adequate sequencing
coverage (≥10-fold) for SNP detection. To determine which
enzymes to use in the genomic reduction, the researcher should
first determine if the two selected enzymes produce an adequate
number of DNA fragments in the target range (should appear as
a smear without any obvious strong bands).
4. If different enzymes are utilized, identify the correct buffer to
use to assure 100 % activity by both enzymes during the dou-
ble digestion. New England BioLabs provides a web-based
double digest finder tool at: http://www.neb.com/nebecomm /
DoubleDigestCalcμLator.asp.
5. Each adapter is composed of two single-stranded DNA mole-
cules that are combined (see Subheading 3.2) to produce a
single double-stranded adapter with an overhang complemen-
tary to the restriction site sequence for EcoRI or BfaI. The
addition of the TEG-Biotin to the EcoRI adapter allows for a
downstream biotin–streptavidin separation—an integral part
of the genomic reduction. The 5′ phosphorylation modifica-
tion ensures double-stranded ligation of the adapters to the
digested DNA.
6. New England BioLabs is unique in that it sells its T4 DNA

ligase in units of cohesive end units, while most other vendors
market their ligase in Weiss units. The conversion is 1 cohesive
end unit equals 0.015 Weiss units. Equivalently, one Weiss unit
equals 67 cohesive end units.
7. Chroma Spin™ columns are designed for rapid spin-column
chromatography and are used for size selection of nucleic
acids within a wide size range. Chroma Spin™-400 columns
effectively remove 90 % of all DNA fragments smaller than
170 bp. Other Chroma Spin™ columns are available if a dif-
ferent DNA target size for sequencing is chosen.
8. The kit contains the Advantage 2 enzyme blend of TITANIUM™
Taq DNA Polymerase and a small amount of proofreading
enzyme that reportedly achieve a 29-fold higher fidelity
(accuracy) than that seen with wild-type Taq DNA Polymerase.
The high fidelity of the polymerase is important for avoiding
the introduction of sequence errors during PCR.
9. We provide the sequence for ten EcoRI Adapter-MIDX and
BfaI Adapter-MIDX fusion primer pairs. Each primer contains
a 5′ MID barcode fused to an adapter specific sequence. Each
sample in the SNP discovery pool is amplified with a unique
primer pair. If additional MID-barcode sequences are needed
(for large discovery panels), more MID-barcode sequences
have been published by 454 Life Sciences (http://454.com).
10. To facilitate the restriction digest and the ligation protocol
(see Subheading 3.1–3.3), we have programmed a thermal cycler
with the following cycling method: 1 h at 37 °C, followed by a
3 h at 16 °C, followed by 4 °C forever. After the completion of
the restriction digest, the sample is allowed to drop to 16 °C, the
thermal cycler is paused, the ligation mix added, and the thermal
cycler restarted.
11. The block and samples should reach room temperature
(~23 °C) in approximately 45–60 min. During this time, the
complementary single-stranded DNA fragments are annealing
to create the double-stranded adapters, specifically:
EcoRI DS adapter:
5′/5BioTEG/-CTCGTAGACTGCGTACC
CATCTGACGCATGGTTAA-/5Phos/5′
Note that the EcoRI recognition site is: G∇AATTΔC.
BfaI DS adapter:
5′-GACGATGAGTCCTGAG
CTACTCAGGACTCAT-/5Phos/5'
Note that the BfaI recognition site is: C∇TAΔG.
12. The Dynal M-280 beads are magnetic particles (2.8 μm in

diameter) with covalently bound streptavidin and are used to
bind the biotinylated DNA. According to the manufacturer,
1 mg of beads typically binds 700 pmol of free biotin or 5 pmol
of a 2–4 kb double-stranded DNA fragment. A cocktail of the
Dynal beads can be made if more than one sample is processed.
For example, if five samples are being processed, then 50 μL
of beads should be mixed with 307.5 μL of 2× B&W Dynal
buffer, washed three times and then resuspended in 375 μL of
2× B&W Dynal buffer
13. According to the manufacturer, the FlashGel® System is 5-20
times more sensitive than gels stained with ethidium bromide
stain and will detect <0.1 ng DNA/band.
14. Note that good samples appear as a smear with no strong
bands in the target zone (450–650 bp). Also note that the
DNA smear will not be present in the lower cycle samples and
will only become apparent in the later cycles.
15. The pooled samples should represent an equal amount of each
sample. 454 pyrosequencing protocols require approximately
5 μg of starting DNA (http://454.com).
Acknowledgments
This research was funded by the Ezra Taft Benson Agriculture

and Food Institute. We gratefully acknowledge Dr. Edward
Wilcox (BYU) for his assistance and advice with regards to
454-pyrosequencing.
References
1. Garg K, Green P, Nickerson DA (1999) 5. Ching A, Caldwell K, Jung M, Dolan M, Smith

Identification of candidate coding region sin- O, Tingey S, Morgante M, Rafalski A (2002)
gle nucleotide polymorphisms in 165 human SNP frequency, haplotype structure and link-
genes using assembled expressed sequence age disequilibrium in elite maize inbred lines.
tags. Genome Res 9:1087–1092 BMC Genet 3:19
2. Batley J, Barker G, O'Sullivan H, Edwards KJ, 6. Andrew T, Maniatis N, Carbonaro F, Liew
Edwards D (2003) Mining for single nucleo- SHM, Lau W, Spector TD, Hammond CJ
tide polymorphisms and insertions/deletions (2008) Identification and replication of three
in maize expressed sequence tag data. Plant novel myopia common susceptibility gene loci
Physiol 132:84–91 on chromosome 3q26 using linkage and link-
3. Krawczak M (1999) Informativity assessment age disequilibrium mapping. PLoS Genet.
for biallelic single nucleotide polymorphisms. doi:10.1371/journal.pgen.1000220
Electrophoresis 20:1676–1681 7. Zhu C, Gore M, Buckler ES, Yu J (2008)
4. Van K, Hwang EY, Kim MY, Park HJ, Lee SH, Status and prospects of association mapping in
Cregan PB (2005) Discovery of SNPs in soy- plants. Plant Genome 1:5–20
bean genotypes frequently used as the parents 8. Cramer ERA, Stenzler L, Talaba AL,
of mapping populations in the United States Makarewich CA, Vehrencamp SL, Lovette IJ
and Korea. J Hered 96:529–535 (2008) Isolation and characterization of SNP
variation at 90 anonymous loci in the banded 11. Wiedmann RT, Smith TP, Nonneman DJ
wren (Thryothorus pleurostictus). Conserv (2008) SNP discovery in swine by reduced rep-
Genet 9:1657–1660 resentation and high throughput pyrosequenc-
9. Kawuki R, Ferguson M, Labuschagne M, ing. BMC Genet 9:81
Herselman L, Kim DJ (2009) Identification, 12. Van Tassell CP, Smith TPL, Matukumalli LK,
characterisation and application of single nucle- Taylor JF, Schnabel RD, Lawley CT,
otide polymorphisms for diversity assessment in Haudenschild CD, Moore SS, Warren WC,
cassava (Manihot esculenta Crantz). Mol Breed Sonstegard TS (2008) SNP discovery and allele
23:669–684 frequency estimation by deep sequencing of
10. Rostoks N, Mudie S, Cardle L, Russell J, Ramsay reduced representation libraries. Nat Methods
L, Booth A, Svensson JT, Wanamaker SI, Walia 5:247–252
H, Rodriguez EM, Hedley PE, Liu H, Morris J, 13. Maughan PJ, Yourstone SM, Jellen EN,
Close TJ, Marshall DF, Waugh R (2005) Udall JA (2009) SNP discovery via genomic
Genome-wide SNP discovery and linkage analy- reduction, barcoding, and 454-pyrose-
sis in barley based on genes responsive to abiotic quencing in amaranth. Plant Genome 2:
stress. Mol Genet Genomics 274:515–527 260–270
Chapter 14
Inter-SINE Amplified Polymorphism (ISAP)

for Rapid and Robust Plant Genotyping
Torsten Wenke, Kathrin M. Seibt, Thomas Döbel, Katja Muders,
and Thomas Schmidt
Abstract
The unambiguous differentiation of crop genotypes is often laborious or expensive. A rapid, robust,
and cost-efficient marker system is required for routine genotyping in plant breeding and marker-assisted
selection. We describe the Inter-SINE Amplified Polymorphism (ISAP) system that is based on standard
molecular methods resulting in genotype-specific fingerprints at high resolution. These markers are derived
from Short Interspersed Nuclear Elements (SINEs) which are dispersed repetitive sequences present in
most if not all plant genomes and can be efficiently extracted from plant genome sequences. The ISAP
method was developed on potato as model plant but is also transferable to other plant species.
Key words Inter-SINE amplified polymorphism, ISAP, Retrotransposon, Short interspersed nuclear
elements, SINE
1 Introduction
We report the Inter-SINE Amplified Polymorphism (ISAP) marker

system for the rapid, robust, and cost-efficient differentiation of
related plant genotypes. The method is based on a class of dis-
persed repetitive DNA sequences designated as Short Interspersed
Nuclear Elements (SINEs). SINEs are non-LTR retrotransposons
and amplified by reverse transcription. The retrotransposition
occurs via an RNA intermediate and the resulting copy is inte-
grated at another site in the genome while the source copy is pre-
served [1]. Because they are nonautonomous retroelements their
activity relies on proteins encoded by corresponding Long
Interspersed Nuclear Elements [2]. The “copy and paste” mecha-
nism causes the evolutionary amplification and diversification of
SINEs leading to distinct families with copy numbers of up to a few
thousands. SINEs are only approximately 100–500 nucleotides long
and widely distributed along chromosomes. Multiple SINE families
183
184 Torsten Wenke et al.
exist simultaneously in a species and the majority of copies is local-

ized within euchromatic regions frequently nearby or within genes
[3–6].
Several studies have shown that SINEs are widespread in flow-
ering plants. They were found in basal and higher angiosperms,
gymnosperms, monocots, and dicots [5–11]. The targeted identi-
fication of SINEs is efficiently feasible for any genome since the
development of the bioinformatic tool SINE-Finder [6]. The only
prerequisite is a sufficient amount of genome sequences.
The high sequence variability among families and widespread
occurrence in combination with their targeted identification make
SINEs a well-suitable source for the development of molecular
markers. We developed the ISAP marker system using potato
(Solanum tuberosum) as the model plant [12]. The potato genome
is extensively sequenced and eight SINE families and two subfami-
lies have been described [6, 13]. Worldwide more than 4,500
potato varieties are known and cultivated in over 100 countries
[14]. Agriculturally important cultivars arose from only a few wild
potato species and modern breeding strategies include mainly
crosses of existing cultivars to develop novel varieties with desirable
agronomic traits [15]. Recombination during breeding leads to
mosaic-like genome compositions and genotypes showing specific
SINE arrangements.
The genotype-specific SINE distribution, particularly the
regions between SINE copies in close vicinity, can be analyzed and
displayed using the ISAP method (Fig. 1). Briefly, outward-facing
primers are derived from each SINE family to amplify inter-SINE
regions by PCR. The resulting products are separated according
to their size by standard agarose gel electrophoresis or capillary
electrophoresis on an automated sequencing device. The selective
tRNA-related region
TSD 3’ tail TSD
5’ A B tRNA-unrelated (A/T)n 3’
R F R F F R
Electrophoresis
Fig. 1 Short Interspersed Nuclear Elements (SINEs) and principle of the Inter-SINE Amplified Polymorphism
(ISAP) method. SINEs are typically characterized by a tRNA-derived 5′ region containing the RNA polymerase
III promoter motif (box A and box B), a non-tRNA related region of mostly unknown origin, and an A/T-rich 3′
tail or simple sequence repeat. The flanking target site duplication (TSD) is created during the integration
process. For ISAP analyses genomic DNA between neighboring SINEs is amplified by PCR with outward-facing
SINE reverse (R) and forward primers (F). PCR amplicons are separated by electrophoresis
Inter-SINE Amplified Polymorphism (ISAP) for Rapid and Robust Plant Genotyping 185
combination of the primers of different SINE families enables the

generation of numerous polymorphic ISAP patterns. Genotype-
specific, highly polymorphic fingerprints will be obtained and can be
used for the differentiation of varieties, landraces, and wild species.
2 Materials
Prepare all solutions using sterile distilled water. Store all reagents
according to manufacturer’s instructions or at room temperature if
not indicated otherwise.
2.1 Polymerase 1. Template DNA.

Chain Reaction (PCR) 2. DreamTaq™ DNA Polymerase (Thermo Scientific), 5 U/μl
(see Note 1).
3. 10× DreamTaq™ Green Buffer (Thermo Scientific).
4. 20 mM dNTPs.
5. 2 mg/ml BSA (bovine serum albumin).
6. 3 μM SINE family-specific primers (see Note 2).
7. Thermocycler with fast heating and cooling rates.
2.2 Agarose Gel 1. Standard horizontal agarose gel electrophoresis and gel docu-
Electrophoresis mentation equipment (see Note 3).
2. 50× TAE buffer: 2 M Tris base, 2 M glacial acetic acid, 50 mM
EDTA dissolved in H2O, pH 8.5. Dilute to 1× TAE with water
prior use.
3. Standard molecular biology grade LE (low electroendosmosis)
agarose.
4. 1 % ethidium bromide (see Note 4).
5. DNA electrophoresis size standard for amplicon sizing, e.g.,
GeneRuler™ 100 bp Plus DNA Ladder (Thermo Scientific).
2.3 Capillary 1. Fluorescence-labeled SINE family-specific primers (see Note 5).

Electrophoresis 2. Automated capillary sequencing device, e.g., CEQ™ 8000
(Beckman Coulter).
3. Size standard (e.g., CEQ™ DNA Size Standard Kit-600,
Beckman Coulter).
3 Methods
3.1 Primer Design 1. For the design of primers for ISAP, representative copies of a
SINE family have to be aligned using MUSCLE, MAFFT
[16, 17], or similar programs (see Note 6).
Fig. 2 Example of an aligned SINE family (SolS-IV) of the potato genome. Gray shaded rectangles indicate
the RNA polymerase III promoter boxes A and B. Poly(A) tails at the 3′ ends are located directly upstream of the
target site duplications which are underlined. Arrows show positions of primers derived for ISAP. Dots indicate
identical nucleotides. Dashes show gaps introduced to optimize the alignment. EMBL accessions are given
2. Design SINE family-specific outward-facing primers of the most

conserved regions (Fig. 2) (see Note 7).
3. Determine the optimal primer annealing temperatures by
standard gradient PCR. Band sizes should range between 100
and 2,000 bp. Single primers or combinations of primers
derived from different SINE families can be used.
4. Determine the marker information content (number of poly-
morphic bands) by application of the primers to a set of
genotypes.
5. Extend the SINE family-derived primers at the 5′ end by a
20mer of an arbitrary GC-rich sequence (see Notes 2 and 8).
3.2 Polymerase The isolation of genomic DNA from plant material can be
Chain Reaction conducted according to several protocols (e.g., CTAB protocol by
[18]) or with commercially available kits. The use of young leaves
without degradation ensures high quality DNA which is essential
for reliable and reproducible banding patterns. The isolated DNA
should be RNA free.
1. Prepare template DNA for PCR reactions with a concentration
of 10 ng/μl (see Note 9).
2. Prepare a PCR master mix for all samples on ice. Each reaction
consists of 2 μl of DreamTaq™ Green Buffer, 2 μl of dNTPs,
1 μl of BSA, 1 μl of each primer, 0.1 μl of DreamTaq™ DNA
Polymerase, 1 μl of template DNA, adjust to 20 μl with water

(see Note 10).
3. Perform a two-step PCR program as follows (see Notes 8 and 11):
(a) Initial denaturation (5 min at 93 °C).
(b) 3 cycles (denaturation 20 s at 93 °C, annealing 30 s at 60 °C,
elongation 2 min at 72 °C).
(c) 27 cycles (denaturation 20 s at 93 °C and annealing/
elongation 140 s at 72 °C).
(d) Final elongation (5 min at 72°C).
(e) Storage at 4 °C.
3.3 Agarose Gel For comparable banding patterns prepare and run gels under con-
Electrophoresis stant conditions (gel composition, voltage, separation time, buffer
composition, and size standard). Conditions described here are
optimized for amplicons of 100–2,000 bp length resulting from a
typical ISAP run.
1. Prepare a 2 % agarose gel with 1× TAE buffer (see Note 12).
2. For staining of the PCR products add ethidium bromide
(0.05 μl/ml gel) prior casting.
3. Put the gel into the electrophoresis chamber and fill in fresh
1× TAE until the gel is covered with 1 mm of the buffer
(see Note 13).
4. Load the complete 20 μl PCR reaction volume and 1–1.5 μg
of the DNA size marker on the gel (see Note 14).
5. Separation runs at approximately 3.5 V/cm until a clear dif-
ferentiation of the individual bands is achieved.
M 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 M bp
3000
1000
500
100
Fig. 3 ISAP patterns of potato varieties with the primer pair SolS-IIIa-extended-F/SolS-IV-extended-R resolved
on 2 % agarose gel in 1× TAE buffer. Lanes correspond to the varieties “Acapella” (1 ), “Angela” (2 ), “Annabelle”
(3 ), “Arcona” (4 ), “Arkula” (5 ), “Arosa” (6 ), “Atica” (7 ), “Ballerina” (8 ), “Bellaprima” (9 ), “Berber” (10 ),
“Bonus” (11 ), “Borwina” (12 ), “Carlita” (13 ), “Rita” (14 ), “Rosara” (15 ), “Salome” (16 ), “Solist” (17 ), “Stefanie”
(18 ), “Valetta” (19 ), “Velox” (20 ), “Verona” (21 ), “Terrana” (22 ), “Agave” (23 ), “Agila” (24 ), “Aktiva” (25 ),
“Ampera” (26 ), and 100 bp Plus DNA Ladder (M )
6. Document and store the resulting banding patterns with a gel

documentation device (Fig. 3).
3.4 Capillary 1. Prepare the PCR reaction mix according to Subheading 3.2
Electrophoresis including at least one 5′ dye-labeled primer (see Note 5).
The PCR conditions are described earlier.
2. Use an appropriate size standard according to the expected
amplicon size range.
3. Separate the PCR products in a sequencing device referring to
the manufacturer’s instructions. Depending on the capillary
sequencer aliquots of 1 μl of the PCR reaction or less might be
sufficient for signal detection (see Note 15).
4. Separation results will be documented as electropherograms
and automatically stored in a file (Fig. 4).
3.5 Data Analysis Depending on the purpose of the ISAP analysis, different require-
ments for management and interpretation of the data exist. In par-
ticular, the analysis of large numbers of samples from several ISAP
experiments requires adequate software. Programs like GelCompar
II or BioNumerics (Applied Maths NV) are suitable for the estab-
lishment of a database. This software allows the storage and nor-
malization of data from different ISAP experiments which enables
the comparison and combination of ISAP runs (Fig. 5). The soft-
ware accepts data generated by both conventional agarose gels and
150000
125000 210
291
Dye Signal [rfu]
100000 451
596
75000
509
50000
533 681
458
25000
0
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700
Size [nt]
Fig. 4 Electropherogram of ISAP fragments generated with primers SolS-IIIa-F/SolS-IV-R separated by capillary
electrophoresis for the potato cultivar “Gala”. For fluorescence signal detection, one primer was labeled at
the 5′ end with Cy5. Amplicons (peaks with size information) were separated on the Beckman CEQ™ 8000
capillary sequencer according to the “Frag4” method including an internal size standard (small peaks)
Fig. 5 ISAP banding patterns of potato varieties analyzed using GelCompar II software (Applied Maths NV). Gel
images for each variety from various Experiment data (e.g., different ISAP primer combinations) can be stored
and analyzed. A Comparison allows cluster Analyses based on the electrophoretic separation to determine
pattern Similarities between varieties visualized as color-coded matrix
capillary sequencers; results from gel images and electropherograms

can be loaded directly into the program. The software enables the
filtering of matching ISAP patterns, the construction of similarity
matrices and cluster analyses. Moreover, the combination of ISAP
and other markers is supported, as well as the complementation
with further information including phenotypic data.
4 Notes
1. DreamTaq™ polymerase is highly efficient and yields well-

balanced amplicon intensities among various DNA polymer-
ases tested. The use of BSA as additive is recommended because
of the sensitivity of DreamTaq™ to inhibitors arising from the
template DNA which depends on the isolation method. Other
polymerases might differ in template or additive requirements
and may influence the intensity of high or low molecular weight
bands but can be used as well.
2. Primers consist of a SINE-derived part that is 5′-extended by
a synthetic 20mer C/G-rich part (e.g., CTG ACG GGC CTA
ACG GAG CG, [12]) enabling high annealing temperatures in
a two-step PCR after three initial cycles (see Subheading 3.2,
step 3). The extension is identical for all primers. Primers
developed for potato are listed in Table 1.
Table 1
ISAP primers used for potato containing the 20 bp-5′extension (underlined)
Name Primer Sequence (5′-3′) Ta (°C)

SolS-IIIa-extended-F CTGACGGGCCTAACGGAGCGCCTATGTGGTTTGCGAGC 60
SolS-IIIa-extended-R CTGACGGGCCTAACGGAGCGTAACCCGCACTAGGCAAG 60
SolS-IV-extended-F CTGACGGGCCTAACGGAGCGGTCACAGACGGATTTCTCG 60
SolS-IV-extended-R CTGACGGGCCTAACGGAGCGCCCTTTGGATCAATCACAGC 60
SINE part-specific annealing temperatures (Ta) for ISAP-PCR are provided
3. We recommend the Sub-Cell® GT Agarose Gel Electrophoresis

Systems and the power supply from BioRad for these applications
because of the consistent DNA fragment separation.
4. Disposal of ethidium bromide containing agarose gels and solu-
tions should strictly follow the local waste disposal regulations.
5. The labeling of one primer is sufficient for this application.
Fluorescence labeling dyes depend on the capillary sequencing
device used.
6. The targeted identification of SINE families can be carried out
as described by Wenke et al. [6]. SINEs stored in sequence
databases (e.g., EMBL, GenBank) can also be used.
7. Highly abundant SINE families showing high identity values
between copies are most suitable because they contain more con-
served primer binding sites compared to small and diverse SINE
families. If possible derive primers from the 3′ end of a SINE
family because many family members might be 5′ truncated.
8. The primer extension in combination with an increased anneal-
ing temperature (see Subheading 3.2, step 3) leads to better
balanced amplicon intensities and reduces the background smear
due to increased specificity at higher annealing temperatures.
9. Depending on the abundance and diversity of the SINE families
and appropriate primer binding sites higher or lower concentra-
tions of template DNA might be required. In case of weak band
profiles, a reduction of the template amount could be helpful.
10. The primer amount can influence the PCR product and if
necessary has to be adapted. Reduced primer concentrations
result in a preference of higher molecular weight bands due to
a shift of the primer/template ratio.
11. The first three cycles (program step b) are performed at an
annealing temperature corresponding to the SINE-derived
primer part determined by gradient PCR (see Subheading 3.1,
step 3). For the following 27 cycles (program step c) the
annealing temperature is increased to match the annealing
temperature of the extended primers.
12. A gel thickness of 4 mm or less is recommended because of the

higher resolution of low molecular bands and reduced back-
ground fluorescence. Consider the evaporation of water during
heating and polymerization of the gel. Use a gel comb with
wide slots (approximately 5 mm with a volume of 25 μl). Low
temperatures during gel polymerization can sharpen the bands.
13. Excess agarose gel electrophoresis buffer decreases the DNA
mobility and endorses distorted bands.
14. The DreamTaq™ Green Buffer allows direct loading of the
samples on the gel. For accurate sizing adequate size standards
are required. We recommend a 100 bp DNA ladder loaded in
the flanking lanes and at least after every eighth sample for
software-based pattern normalization.
15. The remaining PCR reaction volume can be separated simulta-
neously by conventional agarose gel electrophoresis.
Acknowledgement
We gratefully acknowledge the funding of the KMU-Innovativ-project

(grant no. 0315425) by the German Federal Ministry of Education
and Research.
References
1. Finnegan DJ (1989) Eukaryotic transposable 7. Zhang X, Wessler SR (2005) BoS: a large and
elements and genome evolution. Trends Genet diverse family of short interspersed elements
5:103–107 (SINEs) in Brassica oleracea. J Mol Evol 60:
2. Ohshima K, Okada N (2005) SINEs and 677–687
LINEs: symbionts of eukaryotic genomes with 8. Deragon JM, Zhang X (2006) Short inter-
a common tail. Cytogenet Genome Res 110: spersed elements (SINEs) in plants: origin,
475–490 classification, and use as phylogenetic markers.
3. Lenoir A, Lavie L, Prieto JL, Goubely C, Cote Syst Biol 55:949–956
JC, Pelissier T, Deragon JM (2001) The evolu- 9. Fawcett JA, Kawahara T, Watanabe H, Yasui Y
tionary origin and genomic organization of (2006) A SINE family widely distributed in the
SINEs in Arabidopsis thaliana. Mol Biol Evol plant kingdom and its evolutionary history.
18:2315–2322 Plant Mol Biol 61:505–514
4. Ohtsubo H, Cheng C, Ohsawa I et al (2004) 10. Tsuchimoto S, Hirao Y, Ohtsubo E, Ohtsubo
Rice retroposon p-SINE1 and origin of culti- H (2008) New SINE families from rice, OsSN,
vated rice. Breed Sci 54:1–11 with poly(A) at the 3' ends. Genes Genet Syst
5. Xu JH, Osawa I, Tsuchimoto S, Ohtsubo E, 83:227–236
Ohtsubo H (2005) Two new SINE elements, 11. Baucom RS, Estill JC, Chaparro C, Upshaw N,
p-SINE2 and p-SINE3, from rice. Genes Jogi A, Deragon JM, Westerman RP, SanMiguel
Genet Syst 80:161–171 PJ, Bennetzen JL (2009) Exceptional diversity,
6. Wenke T, Döbel T, Sörensen TR, Junghans H, non-random distribution, and rapid evolution
Weisshaar B, Schmid T (2011) Targeted iden- of retroelements in the B73 maize genome.
tification of short interspersed nuclear element PLoS Genet 5:e1000732
families shows their widespread existence and 12. Seibt KM, Wenke T, Wollrab C, Junghans H,
extreme heterogeneity in plant genomes. Plant Muders K, Dehmer KJ, Diekmann K, Schmidt
Cell 23:3117–3128 T (2012) Development and application of
SINE-based markers for genotyping of potato 16. Edgar RC (2004) MUSCLE: a multiple sequence
varieties. Theor Appl Genet 125:185–196 alignment method with reduced time and space
13. Huang SW, Xu X, Pan SK, Cheng SF, Zhang B complexity. BMC Bioinformatics 5:1–19
et al (2011) Genome sequence and analysis of 17. Katoh K, Misawa K, Kuma K, Miyata T (2002)
the tuber crop potato. Nature 475: MAFFT: a novel method for rapid multiple
189–195 sequence alignment based on fast Fourier
14. Pieterse L, Hils U (2009) World catalogue of transform. Nucleic Acids Res 30:3059–3066
potato varieties 2009/10. Agrimedia GmbH, 18. Saghai-Maroof MA, Soliman KM, Jorgensen
Clenze RA, Allard RW (1984) Ribosomal DNA
15. Rodriguez F, Ghislain M, Clausen AM, Jansky spacer-length polymorphisms in barley: men-
SH, Spooner DM (2010) Hybrid origins of delian inheritance, chromosomal location, and
cultivated potatoes. Theor Appl Genet 121: population dynamics. Proc Natl Acad Sci U S A
1187–1198 81:8014–8018
Chapter 15
Screening of Mutations by TILLING in Plants

Nian Wang and Lei Shi
Abstract
TILLING (Targeting Induced Local Lesions IN Genomes) is a well-known reverse genetics technique
designed to detect unknown SNPs (single nucleotide polymorphisms) in genes of interest using an enzymatic
digestion and is widely employed in plant and animal genomics. The main advantage of this technique is
that it allows for the high-throughput identification of an allelic series of mutants with a range of modified
functions for a particular gene. In this chapter, we aim to give a detailed introduction of how to establish
a TILLING platform for identifying mutants in plants, including generation of a large mutant population,
DNA and seed library preparation, mutation identification based on a LI-COR4300 DNA analyzer, and
confirmation of functions of the mutated genes.
Key words TILLING, Mutant population, Screening of mutations
1 Introduction
The development of high-throughput and low-cost methods for

the discovery of natural and induced mutations has enabled novel
approaches for reverse genetics and germplasm characterization.
Methods that combine enzymatic mismatch cleavage and fluores-
cence detection by gel or capillary electrophoresis have been the
primary mutation discovery platforms for the reverse-genetics
strategy known as TILLING (Targeting Induced Local Lesions IN
Genomes). This approach has worked well in combination with
mutagens that cause primarily SNPs or small indels. In 2000, the
denaturing high-performance liquid chromatography (DHPLC)
was used in TILLING for SNP identification [1, 2]. In 2003, the
LI-COR4300 DNA analyzer was adopted to screen the mutations
instead of DHPLC in Arabidopsis thaliana and the efficiency of
TILLING was much improved [3, 4]. At present, TILLING has
been successfully employed to screen mutations of target genes
among a wide variety of species including maize, legumes, rice,
canola, tomato, potato, medicago, oat, sunflower, peanut, and wheat
[5–29]. With the improvement of technology, a number of studies
showed TILLING could be combined with next-generation
193
194 Nian Wang and Lei Shi
sequencing and thus largely increased the efficiency of this

technique in rice and wheat [30].
In comparison with some well-known techniques for the pro-
duction and screening of mutants in plants, e.g., insertional mutagen-
esis and RNA interference, TILLING showed several excellent
advantages [30–32]. Firstly, mutants can easily be obtained in
TILLING projects for most plants without bias. Although inser-
tional mutagenesis and RNA interference have been used to obtain
reduction-of-function or knockout mutations, most of them rely
on transgenic systems. This makes it difficult to obtain mutations
for a number of plants which lack the availability of transgenic
approaches. In a TILLING experiment, the chemical mutagenic
treatment of plant seeds, pollen, or explants provides an easy and cost-
effective way to produce mutations in the genome. Secondly, the con-
struction of the mutant population based on mutagenic treatment is also
highly efficient in creating mutations in a plant genome. Alkylating
agents, such as ethylmethanesulfonate (EMS), diethyl sulfates (DES),
ethylene imine (EI), N-methyl-N-nitroso urethane (MNU), and
sodium azide (SA), are the most popular chemical mutagens used for
the mutagenic treatment. This treatment can cause random point
mutations or short insertion/deletions (INDELs) at a high density in
the plant genome and therefore generation of a small population can
meet the requirement of producing “saturated mutation” in a plant
genome. Thirdly, gene regions are targeted for mutation discovery
with no bias, as products of PCR amplification are used for screen-
ing. In theory, any region in the gene can be targeted if a suitable
primer pair is able to be designed for PCR amplification. In addition,
there are other advantages of the TILLING technique, for example,
the mutants identified can be easily used for subsequent plant
breeding, the whole TILLING procedure is cost and time saving,
and mutation screening has high accuracy. All these advantages make
the TILLING approach widely applicable for plant research.
In a typical TILLING project, several continuous steps should
be conducted. These are generation of a large mutant population,
DNA and seed library preparation, mutation identification, and
final confirmation of changed gene functions (Fig. 1) [33]. In this
chapter, we aim to give a detailed introduction of how to establish
a TILLING platform for identifying mutations in plants.
2 Materials
All solutions should be prepared using ultrapure water (18 MΩ cm

at 25 °C) and analytical grade reagents.
2.1 Construction 1. Seeds used for mutagenesis should be homozygous, e.g., seeds
of Mutant Population harvested from a double haploid (DH) line.
2. Ethyl Methanesulfonate (EMS) solution: EMS (Sigma-Aldrich,
M0880, USA) (see Note 1), Buffer solution (for 1 l): dissolve
Screening of Mutations by TILLING in Plants 195
Fig. 1 Outline of the high-throughput TILLING procedure [33]. Seeds are mutagenized by treatment with alkyl-
ating agents, such as EMS, which primarily introduces G/C to A/T transitions; M1 plants are self-fertilized, and
M2 individuals are used to prepare DNA samples for mutational screening, whilst an inventory of their seeds is
established for future and downstream research. For mutation screening, DNAs are pooled to maximize
the efficiency of mutation detection. PCR is performed using 5′-end-labeled gene-specific primers to target
the desired locus, and hetero-duplexes are formed by heating and cooling the PCR products. CEL I nuclease is
used to cleave at base mismatches, and the products representing induced mutations are visualized with
denaturing polyacrylamide gel electrophoresis
Na2HPO4·12H2O 21.8 g and NaH2PO4·2H2O 6.08 g in ddH2O

to ~950 ml, adjust pH to be 7.0 and set volume to 1 l.
2.2 Mutation 1. Solution for CEL1: CEL1 extracted from celery stalk
Detection (see Note 2).
2. Buffer B: 0.1 M Tris–HCI, pH 7.7, 0.5 M KCl, 100 μl PMSF.
3. 10× CEL1 buffer solution: 100 ml 1 M MgSO4, 100 ml 1 M
HEPES, pH 7.0 (see Note 3), 50 ml 2 M KCL, 100 μl 10 %
Triton X-100, 100 μl 20 mg/ml BSA. Mix all reagents with
ddH20 and set the volume to 1 l.
4. LICOR 4300 DNA analyzer and the corresponding accessories,
such as 25 cm gel glass, 0.25 mm spacers.
5. Gel: KBPlus 6.50 % Gel Matrix (LI-COR Biosciences-GmbH,
Germany) (see Note 4). 10 % APS, TEMED, 25 ml syringe and
a 0.2 μm filter, assembled (clean) glass plates, 1 l 0.8× TBE (1×
TBE for 6.5 % gel), formamide loading dye (MWG, Germany).
6. Sephadex G50 (Sigma-Aldrich, S5897, USA), 96-well spin
plates, 96-well column loader (Millipore).
7. Labeled primers and DNA ladders (see Note 5).
3 Methods
3.1 Construction 1. Dissolve EMS into buffer solution at the required concentration.
of Mutant Population Incubate seeds in the EMS solution for 24–36 h (see Note 6).
2. Wash the treated seeds in fresh water for 3–4 h (see Note 7).
3. Sow the mutagenized seeds in the field. The plants are desig-
nated as the M1 generation (see Note 8).
4. Self-pollinate each M1 plant and harvest the seeds.
5. Sow the seeds from M1 plants in the field, these plants are
designated as the M2 generation. Make sure each M1 produces
a M2 plant (see Note 9).
6. Extract DNA from each M2 plant and store in −20 °C after the
DNA concentrations are adjusted to be in an appropriate simi-
lar level. The DNA extraction method will depend on the spe-
cies being investigated.
7. Harvest the self-pollinated seeds of the M2 plants and store
under dry and low temperature conditions.
3.2 Mutation Primers are designed according to the sequence of the candidate
Detection genes being targeted. Make sure of the specificity of primer ampli-
fication for the targeted genes (see Note 10).
3.2.1 Primer Design
3.2.2 PCR Setup The DNA templates for the PCR reaction are pooled four- to
eightfold (see Note 11). All reactions should be kept out of the
light as much as possible to protect the fluorescent oligos from
degrading. Use between 10 and 100 ng of genomic DNA as tem-
plate for the PCR reaction, which is performed in a 96-well plate.
The components of the PCR mix are listed in Table 1. PCR cycling
conditions are outlined in Table 2.
Table 1
The components of PCR mix
Components Single reaction 96-well plate

Template DNA 10–100 ng
Primer mix 0.02 μl 2
10× PCR buffer 0.5 μl 50
25 mM MgCl2 0.6 μl 60
10 mM dNTPs 0.8 μl 80
(2.5 mM in each dNTP)
Taq polymerase 0.2 U 50
Total reaction volume 10 μl 1 ml
Table 2
PCR cycling conditions, including heteroduplex formation
94 °C 2 min
Loop 1: 8 cycles
94 °C 30 s Touch down
65 °C 30 s Increment –1 °C /cycle
72 °C 30 s
Loop 2: 35 cycles
94 °C 30 s Amplify
58 °C 30 s
72 °C 30 s
72 °C 5 min Final extension
99 °C 10 min Denaturation
95 °C 10 min
Decrease 95–80 °C at 3 °C/min Renaturation
Decrease 80–55 °C at 1 °C/min
Hold at 55 °C for 20 min

Hold at 25 °C forever
The primer mix contains labeled and unlabeled forward and

reverse primers. The mixture contains a 3:2 ratio of IRD700-labeled
to unlabeled forward primers and 4:1 ratio of IRD800-labeled to
unlabeled reverse primers; the final concentration for the primer
cocktail is 100 μM for each forward and reverse primer sequence.
3.2.3 CEL1 Digestion This step is performed to digest the mismatch formed by DNA
mutation. The original crude CEL1 enzyme is extracted from 500 g
celery stalk or can be purchased (see Note 12). CEL1 working
solution is prepared with 4 μl of crude CEL1 and 96 μl of buffer B.
The digested conditions and procedures are listed as follows.
1. The 96-well plate with the digested reaction solution (Table 3)
is incubated for 15 min at 45 °C.
2. Stop reaction by adding 5 μl of 75 mM EDTA to each well in
the plate and thoroughly mix.
3. The products should be frozen or purified immediately as the
CEL1 enzyme is very active and might continue cutting/
degrading the sample.
Table 3
CEL1 reaction mix composition
Single reaction (μl) 96-well reaction(μl)

10× CEL1 buffer 1.5 150
ddH2O 8.5 850
CEL1 stock 1 100
Template DNA 5 500
Total reaction volume 15 1,500
3.2.4 Sample This step is performed to remove excess salt and concentrate the
Purification: Isopropanol CEL1 digested product. Both Sepahadex G50 (from Pharmacia,
Precipitation cross-linked dextran) and isopropanol precipitation can be per-
formed for sample purification and concentration and the work
flow of them both are shown.
1. 15 μl of isopropanol is added to each well in the 96-well plate
which contains the inactive CEL1 digested products.
2. The mixtures of each well are pipetted up and down (manually).
The 96-well plate with samples is spun for 15 min at 4,000 × g
in a microplate centrifuge.
3. The supernatant of each sample is removed and dipped on a
paper towel.
4. The precipitate of each sample is washed with 20 μl 70 % etha-
nol and spun for 15 min and supernatant is decanted.
5. The pellet is dissolved in 5 μl of formamide loading buffer. Do
not worry about the remnants of ethanol. It will evaporate in
the next step.
6. The samples are heated at 85 °C for about 0.5 h until the final
volume of the samples is about 3–5 μl.
3.2.5 Sample 1. Sephadex G50 is loaded in a 96-well spin Sephadex plate and
Purification: Sephadex the excess of resin is removed using a column loader.
Purification 2. 300 μl sterile ddH2O is added into each well with 8-channel
pipette.
3. The 96-well plate with resin is incubated at room temperature
for at least 1.5 h or stored at 4 °C for up to 1 week.
4. Put the Sephadex plate on an alignment frame and an empty
96-well plate. Centrifuge them for 2 min and remove plate
with the flow-through (water).
5. Fill a new PCR plate with 4 μl formamide loading dye
(see Note 13).
6. The sephadex plate and alignment frame are put on the PCR
plate with the loading dye.
7. Add all of the samples (the inactive CEL1 digested reaction)
into the well of the Sephadex plate directly and make sure they
are over the centers of the columns without touching them
with the pipette tips.
8. Put a new 96-well plate under the Sephadex plate and centrifuge
them for 2 min, the purified products are now in the flow-
through (see Note 14).
9. Heat the samples at 85 °C for about 0.5 h until the final volume
is about 3–5 μl.
3.2.6 Preparing Gels The 25 cm gels are prepared for electrophoresis to identify the
and Electrophoresis mutants using the LICOR 4300 DNA analyzer.
1. Mix 20 ml of KBplus with 150 μl of 10 % APS and then add
15 μl of TEMED.
2. Fill the syringe with the above mixture and place the filter in
the syringe (see Note 15).
3. Pour the above mixture by dispensing it through the filter
immediately to the chink between the glass plates. The air bub-
bles can be prevented from forming by tapping on the glass
plates at the liquid edge. If air bubbles appear, they can be
removed just after the gel is poured using the bubble catcher
(see Note 16).
4. Insert the comb spacer in the center.
5. Insert the plexiglass pressure plate between the plates and
clamp rails and tighten the screws (see Note 17).
6. Let the gel fully polymerize for at least 1.5 h or store the gel at
4 °C for not more than 1 day (see Note 18).
7. Remove the plexiglass plate and the comb spacer.
8. Remove the excess polyacrylamide in and above the slot with
pipette.
9. Rinse the outside of the plates with ddH2O and ethanol
(see Note 19).
10. Place the gel on the LI-COR machine and setup all accessories
following the manual book of the LI-COR DNA analyzer.
11. Run the gel (see Note 20).
3.3 Scoring This step is performed to identify which pooled samples would
and Confirmation harbor mutations according to the images from the LICOR 4300
of the Mutations DNA analyzer.
1. Download the IRD 700 and 800 nm images from the LICOR
4300 DNA analyzer to personal computer.
2. Check if there are any new bands on some lanes on the 700
and 800 nm images except the main bands of the PCR prod-
ucts of your targeted gene and dimers.
3. Calculate the sizes of new bands obtained from step 2 on the
same lanes on the 700 and 800 nm images, respectively.
4. A lane with new bands on both 700 and 800 nm images and
the total size of the two bands equal to the size of PCR product
of the targeted genes can be regarded as a possible mutation
existing among the pooled samples of the lane.
5. Confirm the mutation by sequencing (see Note 21).
4 Notes
1. Other alkylating agents, such as DES, EI, MNU, and SA also

work well here.
2. Crude CEL1 is extracted from celery stalk and the protocol is
reported by Oleykowski et al. [34].
3. This buffer can be bought or prepared by you.
4. To reduce the cost, 6.5 % PAGE gel can be prepared by you.
All reagents should be of high purity (>99.9 %). The mixture
should be filtrated before it is used to make the gel.
5. There are different types of IRD labeled DNA ladder. A proper
one should be selected according to the size of the screened
region being targeted. Both IRD 700 and 800 nm DNA lad-
ders are required.
6. Concentration of EMS and the time treated with EMS of seeds
or explants of different plants varies largely. A pilot experiment
should be performed.
7. Pay attention to the toxic water. Dispose of it appropriately.
8. It is normal to see a number of M1 plants that die or cannot set
seeds. Thus, an additional number of seeds or explants should
be treated.
9. In this step, single seed descent would be the most popular
strategy. However, some researchers also prefer for one M1
plant to produce 3–4 M2 lines. This increases the availability of
mutations created in the EMS treatment, but increases the
quantity of the task.
10. The specificity of the primer pairs is the key point to screen the
mutations. If a primer pair can amplify more than one product,
it will increase the complexity of mutation scoring and doesn’t
work in most cases.
11. The pooling strategy can largely increase the efficiency of
TILLING. A pilot experiment should be done to determine
the proper pooling folds. The new pooling samples and the
original one should be distinguished.
12. Crude CEL1 is extracted from celery stalk.
13. To reduce the cost, the components of formamide loading dye
can be prepared according to the following components: 25 ml
of deionized formamide, 500 μl of 0.5 M EDTA, pH 8.0,
6 mg of bromophenol blue, and 67 ml of ddH2O.
14. The used sephadex plate can be cleaned and reused after dry-
ing at 36 °C and washing.
15. This step should be performed as soon as possible to prevent
the gel polymerizing.
16. The glass plate should be placed at a horizontal level. Striking
the plate very gently can prevent the bubbles.
17. Clamp the plates.
18. If the gel is to be kept for a day, its ends should be covered with
a wet (0.8× TBE buffer) tissue and wrapped in a plastic foil.
19. The place where the laser scans the gel should be very clean so
as to prevent strong background from appearing in the final
images.
20. Follow the manual book of LICOR DNA analyzer to set up
the parameters of electrophoresis running.
21. Sequence the targeting PCR products of the samples which
harbor possible mutations. Compare the sequence of PCR
products of the potential mutations with the wild type to con-
firm the mutation.
References
1. McCallum CM, Comai L, Greene EA, Henikoff 5. Jones MO, Piron-Prunier F, Marcel F, Piednoir-
S (2000) Targeted screening for induced muta- Barbeau E, Alsadon AA, Wahb-Allah MA,
tions. Nat Biotechnol 18:455–457 Al-Doss AA, Bowler C, Bramley PM, Fraser PD,
2. McCallum CM, Comai L, Greene EA, Bendahmane A (2012) Characterisation of
Henikoff S (2000) Targeting induced local alleles of tomato light signalling genes gener-
lesions IN genomes (TILLING) for plant ated by TILLING. Phytochemistry 79:78–86
functional genomics. Plant Physiol 123: 6. Gady AL, Vriezen WH, Van de Wal MH, Huang
439–442 P, Bovy AG, Visser RG, Bachem CW (2012)
3. Till BJ, Colbert T, Tompa R, Enns LC, Induced point mutations in the phytoene syn-
Codomo CA, Johnson JE, Reynolds SH, thase 1 gene cause differences in carotenoid
Henikoff JG, Greene EA, Steine MN, Comai content during tomato fruit ripening. Mol
L, Henikoff S (2003) High-throughput Breed 29:801–812
TILLING for functional genomics. Methods 7. Chen L, Huang L, Min D, Phillips A, Wang S,
Mol Biol 236:205–220 Madgwick PJ, Parry MA, Hu YG (2012)
4. Till BJ, Reynolds SH, Greene EA, Codomo Development and characterization of a new
CA, Enns LC, Johnson JE, Burtner C, Odden TILLING population of common bread wheat
AR, Young K, Taylor NE, Henikoff JG, Comai (Triticum aestivum L.). PLoS One 7:e41570
L, Henikoff S (2003) Large-scale discovery of 8. Sabetta W, Alba V, Blanco A, Montemurro C
induced point mutations with high-throughput (2011) sunTILL: a TILLING resource for gene
TILLING. Genome Res 13:524–530 function analysis in sunflower. Plant Methods
7:20
9. Slade AJ, McGuire C, Loeffler D, Mullenberg 20. Elias R, Till BJ, Mba C, Al-Safadi B (2009)
J, Skinner W, Fazio G, Holm A, Brandt KM, Optimizing TILLING and Ecotilling tech-
Steine MN, Goodstal JF, Knauf VC (2012) niques for potato (Solanum tuberosum L).
Development of high amylose wheat through BMC Res Notes 2:141
TILLING. BMC Plant Biol 12:69 21. de Lorenzo L, Merchan F, Laporte P, Thompson
10. Sikora P, Chawade A, Larsson M, Olsson J, R, Clarke J, Sousa C, Crespi M (2009) A novel
Olsson O (2012) Mutagenesis as a tool in plant plant leucine-rich repeat receptor kinase regu-
genetics, functional genomics, and breeding. lates the response of Medicago truncatula roots
Int J Plant Genomics 2011:314829 to salt stress. Plant Cell 21:668–680
11. Okabe Y, Asamizu E, Saito T, Matsukura C, 22. Wang N, Wang Y, Tian F, King GJ, Zhang C,
Ariizumi T, Bres C, Rothan C, Mizoguchi T, Long Y, Shi L, Meng J (2008) A functional
Ezura H (2011) Tomato TILLING technol- genomics resource for Brassica napus: develop-
ogy: development of a reverse genetics tool for ment of an EMS mutagenized population and
the efficient isolation of mutants from Micro- discovery of FAE1 point mutations by
Tom mutant libraries. Plant Cell Physiol TILLING. New Phytol 180:751–765
52:1994–2005 23. Suzuki T, Eiguchi M, Kumamaru T, Satoh H,
12. Knoll JE, Ramos ML, Zeng Y, Holbrook CC, Matsusaka H, Moriguchi K, Nagato Y, Kurata
Chow M, Chen S, Maleki S, Bhattacharya A, N (2008) MNU-induced mutant pools and
Ozias-Akins P (2011) TILLING for allergen high performance TILLING enable finding of
reduction and improvement of quality traits in any gene mutation in rice. Mol Genet Genomics
peanut (Arachis hypogaea L.). BMC Plant Biol 279:213–223
11:81 24. Cooper JL, Till BJ, Laport RG, Darlow MC,
13. Stephenson P, Baker D, Girin T, Perez A, Kleffner JM, Jamai A, El-Mellouki T, Liu S,
Amoah S, King GJ, Ostergaard L (2010) A rich Ritchie R, Nielsen N, Bilyeu KD, Meksem K,
TILLING resource for studying gene function Comai L, Henikoff S (2008) TILLING to
in Brassica rapa. BMC Plant Biol 10:62 detect induced mutations in soybean. BMC
14. Minoia S, Petrozza A, D'Onofrio O, Piron F, Plant Biol 8:9
Mosca G, Sozio G, Cellini F, Bendahmane A, 25. Till BJ, Cooper J, Tai TH, Colowit P, Greene
Carriero F (2010) A new mutant genetic EA, Henikoff S, Comai L (2007) Discovery of
resource for tomato crop improvement by chemically induced mutations in rice by
TILLING technology. BMC Res Notes 3:69 TILLING. BMC Plant Biol 7:19
15. Fitzgerald TL, Kazan K, Li Z, Morell MK, 26. Horst I, Welham T, Kelly S, Kaneko T, Sato S,
Manners JM (2010) A high-throughput method Tabata S, Parniske M, Wang TL (2007)
for the detection of homologous gene deletions TILLING mutants of Lotus japonicus reveal
in hexaploid wheat. BMC Plant Biol 10:264 that nitrogen assimilation and fixation can
16. Uauy C, Paraiso F, Colasuonno P, Tran RK, occur in the absence of nodule-enhanced
Tsai H, Berardi S, Comai L, Dubcovsky J sucrose synthase. Plant Physiol 144:806–820
(2009) A modified TILLING approach to 27. Heckmann AB, Lombardo F, Miwa H, Perry
detect induced mutations in tetraploid and JA, Bunnewell S, Parniske M, Wang TL,
hexaploid wheat. BMC Plant Biol 9:115 Downie JA (2006) Lotus japonicus nodulation
17. Perry J, Brachmann A, Welham T, Binder A, requires two GRAS domain regulators, one of
Charpentier M, Groth M, Haage K, Markmann which is functionally conserved in a non-
K, Wang TL, Parniske M (2009) TILLING in legume. Plant Physiol 142:1739–1750
Lotus japonicus identified large allelic series for 28. Till BJ, Reynolds SH, Weil C, Springer N,
symbiosis genes and revealed a bias in func- Burtner C, Young K, Bowers E, Codomo CA,
tionally defective ethyl methanesulfonate alleles Enns LC, Odden AR, Greene EA, Comai L,
toward glycine replacements. Plant Physiol 151: Henikoff S (2004) Discovery of induced point
1281–1291 mutations in maize genes by TILLING. BMC
18. Morita R, Kusaba M, Iida S, Yamaguchi H, Plant Biol 4:12
Nishio T, Nishimura M (2009) Molecular char- 29. Perry JA, Wang TL, Welham TJ, Gardner S,
acterization of mutations induced by gamma irra- Pike JM, Yoshida S, Parniske M (2003) A
diation in rice. Genes Genet Syst 84:361–370 TILLING reverse genetics tool and a web-
19. Le Signor C, Savois V, Aubert G, Verdier J, accessible collection of mutants of the legume
Nicolas M, Pagny G, Moussy F, Sanchez M, Lotus japonicus. Plant Physiol 131:866–871
Baker D, Clarke J, Thompson R (2009) 30. Tsai H, Howell T, Nitcher R, Missirian V,
Optimizing TILLING populations for reverse Watson B, Ngo KJ, Lieberman M, Fass J, Uauy
genetics in Medicago truncatula. Plant C, Tran RK, Khan AA, Filkov V, Tai TH,
Biotechnol J 7:430–441 Dubcovsky J, Comai L (2011) Discovery of
rare mutations in populations: TILLING by 33. Colbert T, Till BJ, Tompa R, Reynolds S,
sequencing. Plant Physiol 156:1257–1268 Steine MN, Yeung AT, McCallum CM, Comai
31. Kurowska M, Daszkowska-Golec A, Gruszka L, Henikoff S. (2001) High- throughput
D, Marzec M, Szurman M, Szarejko I, screening for induced point mutations. Plant
Maluszynski M (2011) TILLING: a shortcut in Physiol 126:480–484
functional genomics. J Appl Genet 34. Oleyowski CA, Bronson Mullins CR, Godwin
52:371–390 AK, Yeung AT (1998) Mutation detection
32. Weil CF (2009) TILLING in grass species. using a novel plant endonuclease. Nucleic
Plant Physiol 149:158–164 Acids Res 26:4597–4602
Chapter 16
Gene Analysis Using Mass Spectrometric Cleaved

Amplified Polymorphic Sequence (MS-CAPS)
with Matrix-Assisted Laser Desorption Ionization
Time-of-Flight Mass Spectrometry (MALDI-TOF)
Hideyuki Kajiwara
Abstract
Mass spectrometric cleaved amplified polymorphic sequence (MS-CAPS) is a method for detecting genes
using a combination of short PCR and matrix-assisted laser desorption ionization time-of-flight mass
spectrometry (MALDI-TOF MS). MS-CAPS can identify a single nucleotide polymorphism (SNP) in less
than one hour and is suitable for plants, animals, bacteria, and food.
Key words Mass spectrometric cleaved amplified polymorphic sequence (MS-CAPS), Matrix-assisted
laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS), Polymerase
chain reaction (PCR), Restriction enzyme, Single nucleotide polymorphisms (SNPs), Uracil-DNA
glycosylase (UDG)
1 Introduction
Several methods for detecting single nucleotide polymorphisms

(SNPs) [1] have been used for gene analysis. Among them, the
GOOD assay [2–4] is useful for analysis of genes that include
SNPs, and a commercial system to detect SNPs based on this assay
is now available [5]. However, genotyping using the GOOD assay
is considered time consuming and labor intensive. It requires two
PCRs and special primers that are cleaved by ultraviolet radiation
to decrease the mass of the amplicon for analysis. The first PCR
step in the GOOD assay requires purified genomic DNA (gDNA)
because a relatively long sequence must be amplified to produce
the template for the second PCR.
There are several other approaches for analyzing differences at the
SNP level using gel electrophoresis, including cleaved amplified
polymorphic sequence (CAPS) analysis [6] and derived CAPS
analysis [7]. Although these techniques are useful for analyzing
205
206 Hideyuki Kajiwara
differences among cultivars, the time required to prepare purified

gDNA for PCR and for gel electrophoresis to detect amplicons are
considered as disadvantages. In addition, it is difficult to analyze
many samples simultaneously, because amplicons must be analyzed
manually by gel electrophoresis.
Rapid methods for gene identification have been requested for
commercial and agricultural purposes [8]. Previously, we reported
a method combining short PCR and MALDI-TOF MS analysis to
detect the transgene in transgenic rice grains [9]. Thus, to detect
SNPs for discriminating rice cultivars [10] and bacteria strains [11]
a method combining short PCR and MALDI-TOF MS, designated
as mass spectrometric cleaved amplified polymorphic sequence
(MS-CAPS) analysis, was developed. This method is suitable for
various plants by the addition of a polyphenol oxidase inhibitor [12].
The method was further improved by introducing asymmetric
PCR [13] and fast PCR [14, 15] for quick analysis. In this method,
PCR products are amplified using a biotinylated primer and then
digested by a restriction enzyme or treated with uracil-DNA glyco-
sylase (UDG). The digested DNA is then purified using streptavidin-
coated magnetic beads (MB/SA) and analyzed by MALDI-TOF
MS. Crude extracts containing gDNA could be used as the experi-
mental material, without further purification.
2 Materials
Prepare all solutions using sterilized ultrapure water and analytical

grade reagents. Sterilize all plastic ware and glassware by autoclaving.
2.1 DNA Extraction 1. gDNA extraction solution: pure sterilized water or water con-
taining 20 mM cysteine: Dissolve 351 mg L-cysteine hydro-
chloride monohydrate in 100 ml of water (see Note 1).
2. Plastic or glass rod.
2.2 PCR 1. Primer sets to amplify the specific regions (see Note 2).
2. PCR amplification apparatus, such as the 2720 thermal cycler
(Applied Biosystems, Foster city, CA, USA) (see Note 3).
3. DNA polymerase: Tfi DNA polymerase with 10× buffer and
50 mM MgCl2 (Invitrogen, Carlsbad, CA, USA) (see Note 4).
4. dATP, dCTP, dGTP, dTTP, and dUTP: dilute to 5 mM dNTP.
2.3 Enzyme 1. Restriction enzymes (e.g., from New Englands BioLabs,

Treatment Ipswich, MA, USA).
and the Isolation 2. Afu UDG (New Englands BioLabs) (see Note 5).
of Single-Stranded
3. Incubator that can go to 65 °C.
DNA
4. Magnetic beads: Magnotex-SA (Takara Bio Inc., Kyoto, Japan).
Gene Analysis by MS-CAPS 207
5. Magnetic stand: Magnesphere technology magnetic separation

stand (Promega) (see Note 6).
6. Alkali denaturation solution: 0.1 M NaOH, 0.1 M NaCl.
7. 25 % ammonia solution.
8. Cap lock (As One Co. Ltd., Tokyo, Japan).
9. Vacuum concentrator: CL-105 centrifugal concentrator (Tomy
Seiko Co. Ltd., Tokyo, Japan).
2.4 Mass 1. Citric acid solution: 0.15 M diammonium hydrogen citrate.

Spectrometry 2. Matrix solution: 2′, 4′, 6′-trihydroxyacetophenone (Wako
Pure Chemical Industries, Ltd., Osaka, Japan) in 50 % acetoni-
trile (see Note 7).
3. 50 % acetonitrile.
4. MALDI-TOF MS apparatus: Ultraflex (Bruker Daltonics,
Billerica, MA, USA).
5. Target: Anchor chip (Bruker Daltonics).
3 Methods
Carry out all procedures at room temperature, unless otherwise

stated.
3.1 DNA Extraction 1. Manually crush pieces of sample (approx. 3–5 mm of plant
and Isolation leaf) with a rod, then vortex in 0.1 ml water or 20 mM cysteine
of Single-Stranded (see Note 1) for 1–3 min.
DNA 2. Briefly centrifuge the sample, if needed, and use 0.8 μl of the
supernatant for PCR (see Note 8).
3.2 PCR 1. Prepare the solution for asymmetric PCR by mixing 0.8 μl of
extracted crude gDNA, 3.5 μl of 0.1 mM primer, 0.5 μl of
0.1 mM biotinylated primer, 14.8 μl of PCR amplification mix-
ture, and 0.4 μl of Tfi DNA polymerase in a 0.2 ml PCR tube.
PCR amplification mixture consists of 0.8 μl of 5 mM dATP,
0.8 μl of 5 mM dCTP, 0.8 μl of 5 mM dGTP, 0.8 μl of 5 mM
dTTP, 0.6 μl of 50 mM MgCl2, 4 μl of 10× PCR buffer solu-
tions, and 7 μl of water. If UDG is to be used, replace dTTP
with dUTP (see Note 9). The final volume should be 20 μl.
2. Amplify the DNA using the following reaction conditions for
fast PCR: denaturation for 1 min at 94 °C; then five cycles of
2 s at 94 °C, 2 s at X °C, and 2 s at 72 °C; then 25 cycles of 2 s
at 85 °C, 2 s at X °C, and 2 s at 72 °C. X °C is the annealing
temperature of the particular primers used (see Note 10).
3. Check the amplicon by acrylamide gel electrophoresis, if
required (see Note 11).
3.3 Enzymatic 1. Mix 20 μl of amplicon, 6 μl of buffer for the restriction enzyme

Treatment or UDG, 33 μl of water, and 1 μl of restriction enzyme or
UDG (see Note 12).
2. Incubate the solution at the appropriate temperature for 15 min.
3. Add 2 μl of magnetic beads solution (see Note 13) and pipette
several times to disperse the magnetic beads. The magnetic
beads solution should be washed with 50 μl of water before
addition to the sample.
4. After 1 min, put the tube on the magnetic stand and leave for
30 s. Remove the solution (see Note 14).
5. Remove the tube from the magnetic stand and add 50 μl of
water. Mix by pipetting.
6. Put the tube on the magnetic stand and leave for 30 s. Remove
the solution.
7. Remove the tube from the magnetic stand and add 50 μl of
alkali denaturation solution and mix by pipetting. Leave at
room temperature for at least 5 min.
8. Put the tube on the magnetic stand and leave for 30 s. Remove
the solution.
9. Wash the magnetic beads three times using 100 μl of water
(see Note 15).
10. Add 50 μl of ammonia solution and incubate at 65 °C for at
least 10 min with tight sealing using a cap lock (see Note 16).
11. Put the tube on the magnetic stand and leave for 30 s
(see Note 16). Remove the solution and lyophilize using a
vacuum concentrator.
3.4 Mass 1. Dissolve the single-stranded DNA (ssDNA) in 1 μl of 0.15 M

Spectrometry diammonium hydrogen citrate and spot onto the target.
Immediately add the same volume of 100 mg/ml 2′,4′,6′-tri-
hydroxyacetophenone in 50 % acetonitrile (see Note 7).
2. Add 1 μl of 50 % acetonitrile for recrystallization to the dried
sample spot, if required (see Note 17).
3. Place the target in the MALDI-TOF MS apparatus. Analyze in
linear positive ion mode (see Note 18) (Fig. 1).
4 Notes
1. In most cases, cysteine was not needed. Cysteine inhibits the

browning of plant tissue caused by polyphenol oxidase.
2. The primer design is very important. The following examples
are of primers designed based on SNPs in rice cultivars (http://rapdb.
dna.affrc.go.jp/) [10, 16]. The oligonucleotide may be synthesized
B SNP
gDNA
Fast PCR and

Asymetric PCR
+ Restriction
enzyme or UDG
+ MB/SA
washing by water
N
MB/SA=B
S
Alkali denaturation
washing by water
N
MB/SA=B
S
at 65⬚C + NH 3 solution
N
MB -SA B
S
Fig. 1 Experimental procedure of MS-CAPS. A gene locus containing a SNP was

selected for short PCR amplification. The SNP should lie between the primers,
which were designed based on the sequences surrounding the SNP. After short
PCR, the amplicons were digested by a restriction enzyme to obtain different
products from the amplicons, i.e., digested or not digested, or restriction frag-
ments of different lengths or nucleotide composition. If there was a T (or an A)
near the SNP, then U could be substituted in place of the T by adding dUTP to the
PCR reaction; the amplicons could then be distinguished after treatment by
UDG. One of the primers must be biotinylated (B) at its 5′-end. After the enzyme
treatment, streptavidin (SA)-coated magnetic beads (MB) were added and mixed
by pipetting. ssDNA was then obtained using an alkali treatment, and the com-
plementary ssDNA was removed. Ammonia solution was added to dissociate the
MB/SA and biotin. Biotinylated ssDNA and MB/SA in the ammonia solution were
separated using a magnetic stand. Biotinylated ssDNA could then be obtained by
lyophilization
in-house or by a custom synthesis company. The length of the

linker between 5′-end and the biotin depends on the company.
One of the primers must be biotinylated at the 5′ end. It is
advisable to confirm the specificity of the primer set using non-
biotinylated primers, because biotinylated primers are relatively
Table 1
Nucleotide sequences of rice SNP gene loci and PCR primers
Gene loci were selected using the database http://rapdb.dna.affrc.go.jp/. A vertical arrow shows
the restriction enzyme cleavage site. The position of the SNP is shown as an upper case red letter
expensive. The SNP should be positioned in the region newly

synthesized by DNA polymerase during PCR (Table 1).
Example 1; in the case of the rice SNP gene locus E20943,
DNA from rice cultivar Himenomochi (Hi) would be cleaved by
EaeI, but DNA from cultivar Akitakomachi (Ak) would not.
Tsp509I would digest both DNAs, but the nucleotide composi-
tion would be different, G in Hi and C in Ak. Thus, the SNP
could be detected by digestion or not by EaeI (Fig. 2a) and by
the mass difference between G and C (Δ = 40.03) (Table 2)
after cleavage by Tsp509I (Fig. 2b). Example 2; in the case of
SNP gene locus R1744, Heiseimochi (He) and Nipponbare (Ni)
differed from a G to A transition (Table 1). During PCR, dUTP
was replaced dTTP, and then UDG could cut at U (T in the
original sequence). Therefore, the mass difference between
samples from He and Ni would be that between G and A
(Δ = 15.97) (Table 2) (Fig. 2c).
3. Most of the time taken for PCR represents the time taken to
change the temperature. Higher annealing temperatures make
for quicker PCR amplification. When the dissociation time,
annealing time, and extension time were 1 s, amplification
occasionally failed in the 2720 thermal cycler. The reason is
not known. Failure also occasionally occurred if the reaction
volume was less than 10 μl.
a
1SLin
Primer
4.0
2.0
Intensity (x 104) [a.u.]
1SLin
Hi
1.5 m/z 5687.713
0.5
1SLin
2.0 Ak
1.0
m/z
5000 6000 7000 8000
(m/z)
b 1SLin
Hi m/z 8735.504
8.0
4.0
1SLin
Ak
m/z 8694.644 Δ = 40.860
6.0
4.0
7600 8000 8400 8800 (m/z)
Fig. 2 MS-CAPS analysis of rice cultivars. (a) SNP gene locus E20943 was selected for SNP detection and
digested by EaeI. The peak at m/z 5,687 differed significantly between cultivar Akitakomachi (Ak) and cultivar
Himenomochi (Hi). A horizontal arrow shows the peak corresponding to the amplicon and a vertical arrow at
m/z 6570 shows the peak corresponding to unreacted primer. If asymmetric PCR was not used, the peak
derived from the unreacted primer would be higher than that of the PCR products. (b) Analysis of the same
locus using a different approach. Amplicons were digested using Tsp509I. The theoretical mass difference
between Hi and Ak would be 40.03 (Table 2); the observed difference by MS-CAPS analysis was 40.860.
A horizontal arrow shows the observed peak and vertical arrows show postsource decay products during
MALDI-TOF MS analysis. (c) MS-CAPS analysis of amplicons treated with UDG. Biotinylated primer R1744FB
and primer R1744R were used to amplify sequences from cultivars Heiseimochi (He) and Nipponbare (Ni), and
peaks representing the amplicons were detected at m/z 6,976.991 and 6,961.424, respectively. PCR products
after digestion by UDG and alkali treatment are shown. The mass difference detected by MS-CAPS analysis of
He and Ni was 15.576 (theoretical difference between A and G = 16.00 (Table 2)). Vertical arrows show the
degradation products from postsource decay during MALDI-TOF MS analysis
c
1SLin
2.0 He
m/z 6976.991
1.0
1SLin
1.5 Ni m/z 6961.424 Δ = 15.576
0.5
6800 7100 7400 m/z
Fig. 2 (continued)
Table 2
Mass differences between nucleosides
Nucleoside (mass) G A T C
G (151.10) – 15.97 24.99 40.03
A (135.13) – 9.02 24.03
T (126.11) – 16.00
C (111.10) –
The mass of the mono isotope of each nucleotide was used for calculation
4. To the best of our knowledge, Tfi DNA polymerase works well

in this method and is the least expensive option. High fidelity
DNA polymerase cannot use dUTP as a substrate.
5. The reaction temperature of Afu UDG is 65 °C. Although
there are UDGs whose reaction temperature is 37 °C, Afu UDG
showed better and quicker results.
6. It was possible to use a 96-well plate using a MagnaBot II
magnetic separation device (Promega). Automatic transfer of
the solutions using a workstation (EDR-24LS compact
workstation, Biotech, Tokyo, Japan) was also confirmed.
7. Water should be added after the solubilization of 2′, 4′,

6′-trihydroxyacetophenone in acetonitrile.
8. Usually, centrifugation is not required.
9. The PCR amplification mixture can be stored at −20 °C.
10. Conventional PCR cycle is also possible. The running time of
conventional PCR would be 3–10 min longer than that of fast
PCR, depending on the annealing temperature.
11. Amplicons obtained by short PCR can be analyzed on 12 %
polyacrylamide gels (10 ml of 30 % acrylamide (30:0.8), 5 ml
of 5× TBE (54 g of Tris, 27.5 g of boric acid, 20 ml of 0.5 M
EDTA in 1 l), 9.82 ml of water, 175 μl of 10 % ammonium
persulfate, and 50 μl of TEMED [17].
12. In most cases, the buffer for PCR amplification did not affect
the digestion by restriction enzymes or UDG.
13. According to the manufacturer’s data, there is sufficient binding
capacity in 1 μl of Magnotex-SA for MS-CAPS analysis.
However, if too small a volume of magnetic beads was used,
the brown magnetic beads sometimes disappeared during the
washing process.
14. If the solution is on the wall of the tube, brief centrifugation is
recommended. Do not spin the tube at a high speed for a long
period.
15. This step is very important. If salts remain, MS-CAPS analysis
will fail.
16. Watch out for sudden opening of the cap by ammonia vapor.
Use safety glasses for protection.
17. If good crystals were formed, recrystallization was not essential.
Bad crystals that contained excess salts or nucleotides had a
shiny surface. They required more time for lyophilization and
did not show any spectrum. If a good crystal was not obtained
by recrystallization using 50 % acetonitrile, addition of an
extra 1 μl of matrix solution was sometimes helpful to form
better crystals.
18. Typical analysis conditions were as follows: positive linear
mode; detection range m/z 4,000–10,000; suppress up to
m/z 3,000 with gating; pulse ion extraction 270 ns; a total of
300 shots, changing every 50 shots.
Acknowledgements
This work was supported by the Kieikai Research Foundation.

References
1. Giancola S, McKhann HI, Berard A, Camilleri matrix-assisted laser desorption ionization

C, Durand S, Libeau P, Roux F, Reboud X, time-of-flight mass spectrometry. Anal Biochem
Gut IG, Brunel D (2006) Utilization of the 411:152–154
three high-throughput SNP genotyping meth- 10. Kajiwara H, Yamaguchi M, Sato H, Shibaike H
ods, the GOOD assay, amplifouor and TaqMan, (2012) Discrimination among rice varieties
in diploid and polyploidy plants. Theor Appl based on rapid detection of single nucleotide
Genet 112:1115–1124 polymorphisms by a newly developed method,
2. Sauer S, Lechner D, Berlin K, Plancon C, mass spectrometric cleaved amplified polymor-
Heuermann A, Lehrach H, Gut IG (2000) Full phic sequence (MS-CAPS) analysis. Plant Omics
flexibility genotyping of single nucleotide poly- 5:231–237
morphisms by the GOOD assay. Nucleic Acids 11. Kajiwara H, Sato M, Suzuki A (2012) Detection
Res 28:E100 of Acidovorax avenae subsp. citrulli using PCR
3. Sauer S, Lehrach H, Reinhardt R (2003) and MALDI-TOF MS. J Electrophoresis 56:
MALDI mass spectrometry analysis of single 13–17
nucleotide polymorphisms by photocleavage 12. Kajiwara H (2012) Analysis of diverse plant
and charge-tagging. Nucleic Acids Res 31:e63 species by mass spectrometric cleaved amplified
4. Smylie KJ, Cantor CR, Denissenko MF (2004) polymorphic sequence (MS-CAPS) and
Analysis of sequence variations in several human improvement of PCR efficiency by addition of
genes using phosphoramidite bond DNA frag- cysteine. Plant Mol Biol Rep. doi:10.1007/
mentation and chip-based MALDI-TOF. s11105-012-0448-0
Genome Res 14:134–141 13. Saiki RK, Bugawan TL, Horn GT, Mullis KB,
5. Pusch W, Kraeuter KO, Froehilch T, Stagies Y, Erlich HA (1986) Analysis of enzymatically
Kostrezewa M (2001) Genotools SNP man- amplified β-globin and HLA-DQ alpha DNA
ager: a new software for automated high- with allele-specific oligonucleotide probes.
throughput MALDI-TOF mass spectrometry Nature 324:163–166
SNP genotyping. Biotechniques 30:210–215 14. Sullivan D, Fahey B, Rirus D (2006) Fast
6. Neff MM, Neff JD, Chory J, Pepper AE (1998) PCR. Bioradiations 7:22–27, https://www.
dCAPS, a simple technique for the genetic anal- bio-rad.com/webroot/web/pdf/lsr/litera-
ysis of single nucleotide polymorphisms: experi- ture/br118.pdf
mental applications in Arabidopsis thaliana 15. Yap EPH, McGee JOD (1991) Short PCR
genetics. Plant J 14:387–392 product yields improved by lower denatur-
7. Konieczny A, Ausubel FM (1993) A procedure ation temperature. Nucleic Acid Res 19:
for mapping Arabidopsis mutations using co- 1713
dominant ecotype-specific PCR-based markers. 16. Sato H, Endo T, Shiokai S, Nishio T, Yamaguchi
Plant J 4:403–410 M (2010) Identification of 205 current rice cul-
8. Auer CA (2003) Tracking genes from seed to tivars in Japan by dot-blot-SNP analysis. Breed
supermarket: techniques and trends. Trends Sci 60:447–453
Plant Sci 12:591–597 17. Sambrook J, Fritsch EF, Maniatis T (1989)
9. Kajiwara H (2011) Detection of specific DNA Molecular cloning, 2nd edn. Cold Spring
from crude extracts of rice seed grains using Harbor Laboratory Press, New York
Chapter 17
Quantitative SNP Genotyping of Polyploids

with MassARRAY and Other Platforms
Marcelo Mollinari and Oliver Serang
Abstract
Accurate genotyping is essential for building genetic maps and performing genome assembly of polyploid
species. Recent high-throughput techniques, such as Illumina GoldenGate™ and Sequenom iPLEX
MassARRAY®, have made it possible to accurately estimate the relative abundances of different alleles even
when the ploidy of the population is unknown. Here we describe the experimental methods for collecting
these relative allele intensities and then demonstrate the practical concerns for inferring genotypes using
Bayesian inference via the software package SuperMASSA.
Key words Polyploids, MassARRAY, SNP, Quantitative genotyping
1 Introduction
Polyploids are extremely important in agriculture. Even with the

use of modern genomic techniques, it is difficult to precisely
estimate the percentage of polyploid species in plants [1]. Experts
estimate that between 30 and 50 % of all flowering plants were
polyploids at some point in their evolution [1–4]. Despite their
importance, polyploid species do not fully benefit from the use of
molecular markers [5]. Molecular markers are a fundamental ingredi-
ent in modern methods for studying evolution, phylogenetics
(including studies of synteny and diversity), assembly of linkage maps,
quantitative trait loci (QTL) analysis, and association studies.
The use of molecular markers in polyploids has lagged behind
diploids due to several complications: (1) the large number of
genotypic classes; (2) the poorly understood behavior of the chro-
mosomes of several polyploid species; (3) the lack of molecular and
statistical methods to precisely and efficiently estimate the genotypic
classes and, in some cases, (4) the ploidy level of the species, which
is the case of modern sugarcane cultivars. Due to its high ploidy
level (≈8–16) and interspecific origin, the ploidy levels of modern
sugarcane cultivars are variable and its determination is not trivial,
215
216 Marcelo Mollinari and Oliver Serang
even when using modern flow cytometry techniques [6, 7] or

molecular cytogenetics [8, 9]. If we consider a codominant and
multiallelic marker, there are up to pp/2 genotypic classes in a poly-
ploid gamete (where p denotes the ploidy level). For instance,
for an octoploid, there are up to 70 possible genotypic classes.
Without an accurate classification of genotypes in polyploids, it is
infeasible to use approaches that have marked a revolution in biol-
ogy in the past hundred years [5].
To circumvent these problems, the vast majority of genetic
studies in polyploids utilize only single dose markers. Such markers,
also called simplex or Single Dose Restriction Fragment [10], detect
the presence of polymorphisms in just one homologous chromo-
some per homology group. In an F1 population, they segregate in
a 1:1 fashion (if the single dose is present in exactly one parent) or
3:1 (if the single dose is present in both parents). In this case, even
codominant markers such as RFLPs and microsatellites behave like
dominant markers [11]. It has also been possible to detect markers
with higher doses, such as double, triple, and quadruple. The current
approaches used to estimate the dosage of a fragment are based
on segregation patterns in the progeny [12–14]. For instance,
a triple dose marker in an autooctoploid individual with bivalent
paring without double reduction segregates in a 13:1 fashion
(presence:absence) [12, 13]. These segregation ratios alone do
not consistently provide a reliable genotype estimate, particularly
when the ploidy is not trivially small. Some gametes will not even
be observed in a cross, since the number of individuals is relatively
small.
The use of only single dose markers (essentially treating each
allele as a Boolean “present” or “absent” value) imposes serious
limitations to polyploid studies. Using only single dose markers it
is impossible to study the effects of allelic dosage, which have been
shown to be extremely important in gene expression in several spe-
cies. For instance, Guo et al. [15] showed the importance of these
effects in maize, Galitski et al. [16] in Saccharomyces cerevisiae, and
Wang et al. [17] in Arabidopsis. According to Osborn et al. [18],
allele-dosage effects in heterozygous genotypes are correlated to
intermediate gene expression and phenotypic effects as compared
to homozygous genotypes. This phenomenon can be responsible
for intermediate phenotypic classes with selective advantages,
although an expansion of the range of phenotypes has not been
observed. Most importantly, the greater information in multidose
markers makes them much more probable to be informative in a
cross; this makes them much more information dense for linkage
analysis. The inclusion of multiple dose markers is essential to the
adequate genetic study of polyploids.
Although most studies involving species with high levels of
ploidy are based on single-dose dominant makers, some methods
have been proposed to handle multiple dose, dominant (e.g., RAPDs
and AFLPs) and codominant (e.g., RFLPs and microsatellites)
Quantitative SNP Genotyping of Polyploids with MassARRAY and Other Platforms 217
markers, especially for tetraploid species [19]. In this case, since the
number of possible gamete genotypes is relatively small (up to 6), it
is possible to predict the genotype of a pair of parents (for a given
locus) based on the marker information in their progeny. This kind
of information enabled several authors to propose methods for
genetic mapping [20–24] and QTL mapping [25–27], taking into
account the multidose and multiallelic nature of tetraploid species.
More recently, single nucleotide polymorphisms have played
an important role in genetic studies of polyploids. Along with inser-
tions and deletions, they are the most common type of sequence
difference between alleles [28]. The abundance of SNPs makes
them extremely important to the construction of saturated genetic
maps and for QTL analysis and association studies. In polyploids, a
locus may carry multiple doses of a particular nucleotide. The quan-
tification of this dosage is possible when using quantitative SNP
genotyping. Quantitative high-throughput technologies, such as
Illumina GoldenGate™ [29] and Sequenom iPLEX MassARRAY®
[30, 31] provide two signals for each SNP locus. Since SNPs are
mostly biallelic, each one of these signals corresponds to an inten-
sity recorded for one of the two possible alleles. Thus, the expected
value of each signal intensity is proportional to the corresponding
allele dosage [31, 32]. Using these ideas, Serang et al. [5] proposed
a graphical Bayesian model for inferring polyploid SNP genotypes
(i.e., inferring the discrete genotype of each individual for each
locus, identifying the number of copies of each allele and, if neces-
sary, predicting the ploidy level).
In this chapter, we describe the practical aspects of genotyping
polyploids with MassARRAY® and similar platforms. Polyploid
genotyping can be partitioned into two distinct sets of tasks:
The first set of tasks involves estimating the relative abundance of
alleles from each individual in the population. Subheading 1.1
describes how the MassARRAY® platform can be used to estimate
the relative intensities of each allele for each individual in the popu-
lation. The second set of tasks use a scatter plot of relative intensi-
ties to estimate the genotype for each individual in the population.
Subheading 3.2 describes how to estimate genotypes from these
relative abundances; the methods described in this subsection do
not depend on the platform used to estimate the relative
abundances.
1.1 Current Methods Currently, the two important protocols for polyploid genotyping
for Estimating Relative are Illumina GoldenGate™ and Sequenom MassARRAY®; these
Abundance of Alleles protocols use different methods for estimating the relative abun-
dance of the two alleles at a particular locus (Fig. 1). Note that
both procedures only provide relative allele abundances; even with
well-designed controls (e.g., starting with the same amount of tis-
sue from each individual), the amplified allele abundances may not
be comparable between two different individuals, because different
individuals may be processed separately at some point, creating
a Sequence unique b
Fluorescent Mass modified
SNP site targeting a particular Extension primer
label terminators
bead
Allele +
T T Allele
C Allele C
+
Allele
Beads with uniue Buffering matrix

address sequence
+ + +
T T+ C+
+
T C C
soild surface Sample plate

Laser Scanning Confocal Microscopy MALDI-TOF Mass Spectrometer
(BeadArray Reader) (MassARRAY System)
CT CT CT
Intensity
Intensity
Intensity
T/T T/C C/C
m/z m/z m/z
T/T T/C C/C
Fig. 1 Alternative methods for measuring relative abundance of alleles. In both methods there are allele-
specific amplifications prior to the quantification step. (a) In fluorescence-based methods such as Illumina
GoldenGate™, each allele hybridizes to its own fluorescent probe. A unique sequence contained in the prod-
ucts of the specific amplification hybridizes in a specific bead. Thus, the assay products that were in solution
are bound to a solid surface for quantification. The abundance of each allele is determined by quantifying the
fluorescence intensity using a laser scanning confocal microscope [33]. (b) In mass spectrometry-based
methods such as Sequenom MassARRAY®, the DNA fragments at each allele are ionized and time-of-flight
mass spectrometry is used to quantify the relative abundance of each allele. Because the alleles have non-
identical sequences (due to modified-mass ddNTP terminators in Sequenom iPLEX reaction, indicated by
asterisks), they have slightly different masses; measuring the total charge of the ions at the time of flight
expected for the mass-to-charge ratio for each allele gives a robust estimate for the relative abundance of
each allele in the form of a intensity spectrum [30]
chances for the introduction of technical variation. For example,

one diploid heterozygote may have an estimated intensity of 100
for allele A, while another individual with identical genotype has an
estimated intensity of 200; however, the first individual should
have an estimated intensity near 100 for allele B and the second
individual should have an estimated intensity near 200 for allele
B. In mathematical terms, the proportionality constant between
intensity and abundance does not vary within the same individual,
but may vary between two different individuals.
Illumina GoldenGate™ hybridizes each allele to its own
fluorescent probe, and quantifies each abundance by quantifying
the intensity of the fluorescence. KASPar genotyping system,
from KBioScience, also uses fluorescence to measure abundance of
allele-specific extension products. Sequenom MassARRAY® ionizes

the DNA fragments at each allele and uses time-of-flight mass
spectrometry (a technology predominantly used for protein and
peptide-based assays) to quantify the relative abundance of each
allele. Because the alleles have nonidentical sequence (due to the
SNP or other polymorphism), they have slightly different masses;
measuring the total charge of the ions at the time of flight expected
for the mass-to-charge ratio for each allele gives a robust estimate
for the relative abundance of each allele (Fig. 1).
Another technology which has been widely used in plant geno-
typing is Genotyping By Sequence (GBS). GBS can also provide
the relative abundance of alleles by counting reads that contain a
particular SNP. One fundamental step to perform GBS is the reduc-
tion of the complexity of the genome using restriction enzymes.
After the reduction of the genome complexity, a series of adapters
are attached to the fragments and those adapters are bound to an
Illumina HiSeq Flow cell. There are several GBS protocols in the
literature allowing the quantification of fragments that have a par-
ticular polymorphism, thereby measuring the relative abundance of
each variant. Two important examples of such protocols used in
plants are described in Baird et al. [34] (known as RADseq) and
Elshire et al. [35].
An ideal protocol for polyploid genotyping is sensitive, efficient
(i.e. low cost), and has a lack of bias (i.e., lack of skew). Sensitivity
indicates robust estimation of relative abundances even when the
initial quantity of DNA is small. The cost of processing samples
with Illumina GoldenGate™ and Sequenom MassARRAY® is simi-
lar [36] and there is no dramatic difference in complexity of sample
preparation. Both methods fall under the category of “a few cents”
per data point [37]. Lack of bias (i.e., lack of skew) indicates that
neither allele is preferentially amplified within the same individual
(i.e., a diploid heterozygote with allele A intensity of 100 should
imply an allele B intensity near 100). Bias may be introduced in
two ways: preferential amplification of one allele; and preferential
labeling or detection of one allele. Because both Illumina
GoldenGate™ and Sequenom MassARRAY® amplify in the same
manner (i.e., PCR), both methods have equal, fairly low risk of
bias during amplification. On the other hand, the use of different
fluorophores in Illumina GoldenGate™ can introduce biases by
which one allele is preferentially detected, for example, due to the
complications from secondary structure formation in nucleic acids
[38]. According to Griffin and Smith [38] and Marziali and Akeson
[39], these artifacts are irrelevant to MALDI-TOF analysis: since
the mass-to-charge ratio (m/z) is an intrinsic property of a DNA
strand, it is not affected by secondary structure and, therefore,
the time of flight is not affected. A comparable concern with
Sequenom MassARRAY® would be the possibility of different
ionization efficiencies for each allele, which could affect the
mass-to-charge ratio and hence the time of flight. Current preliminary

comparisons indicate lower risks for bias using Sequenom
MassARRAY® [5]; however, a more rigorous side-by-side com-
parison would substantially benefit the field. Regardless, future
advances to fluorescence-based and mass spectrometry-based
measurement may result in superior fluorescence-based protocols
and will undoubtedly see the eventual emergence of superior new
technologies to measure the relative abundance of alleles.
2 Materials
The MassARRAY® iPLEX system is based on a region-specific PCR

followed by an allele-specific single base extension (i.e., iPLEX
genotyping biochemistry) in which products are analyzed in terms
of their masses by matrix-assisted laser desorption/ionization
time-of-flight mass spectrometry (MALDI-TOF MS) [31] (see
Subheading 3.1). Although the purpose of this chapter is to give
details on practical aspects of SNP genotyping inference using
the software package SuperMASSA, we will briefly describe some
wet-lab characteristics of the Sequenom MassARRAY® iPLEX.
A more detailed description of these procedures can be found in
MassARRAY® iPLEX manuals and also in Oeth et al. [30], Gabriel
et al. [40], Oeth et al. [31], and Bradic et al. [41]. More specifically,
for plants, see Irwin [42]. Although there are several protocols
described in the literature, the following description is based on the
Sequenom’s technical note [30], which can be used as a basis for
other protocols.
First, a proprietary software (called MassARRAY Design) is
used to design both PCR and MassEXTEND iPLEX primers for
multiplex reactions. A PCR amplification is then conducted using
common reagents, such as buffers with MgCl2, dNTPs, primers,
and Taq polymerase. After that, a treatment with Shrimp Alkaline
Phosphatase (SAP) is required in order to dephosphorylate unin-
corporated nucleotides. The MassEXTEND iPLEX reaction is
conducted using a specific kit which contains the iPLEX mass-
modified terminators and the specific primers designed with
MassARRAY Design software. The iPLEX products are desalted
using an ion exchange resin. Finally, a nanodispenser is used to
dispense the products on to a SpectroCHIP containing 384 spots.
To handle the liquids, the MassARRAY® system makes use of
robotic equipment. Once the protocol for a certain species is
established, the whole procedure becomes simple and almost
automatic. At this point, the products are ready for quantitative
analysis, partitioned into two sections: Subheading 3.1 describes
the methods used to process biological samples (i.e., DNA) using
the MassARRAY® system. The output of this procedure is a scatter
plot (denoted D) of quantitative allele intensities from individuals
1,2,…n: D = ((x1,y1),(x2,y2),…,(xn,yn)). Subheading 3.2 explains

how to infer the SNP genotypes based on these quantitative scatter
plot data. Thus, to start from prepared scatter plot data, go directly
to Subheading 3.2.
3 Methods
This section will focus on the MassARRAY® genotyping system

and on the bench procedures that transform biological samples
into a scatter plot (denoted D) of allele intensities from individuals
1,2,…n: D = ((x1,y1),(x2,y2),…,(xn,yn)). However, any SNP geno-
typing technology that produces two relative intensities for each indi-
viduals corresponding to alleles for a locus can be analyzed using
the methods shown in Subheading 3.2.
3.1 Processing One of the most suitable methods for quantitative SNP genotyping
the Biological Sample in polyploids is MALDI-TOF MS. In MALDI-TOF MS the DNA
with MassARRAY® is deposited in a buffering matrix containing a crystalline structure
(3-hydroxypicolinic acid, in MassARRAY®). This buffering matrix
is responsible for the absorption of a greater part of the energy
applied to the sample, preventing a significant decomposition and
fragmentation of the DNA [31]. Then, the LASER beam is directed
on to the samples causing the desorption/ionization of the DNA.
The ionized molecules (mostly positive ions) pass through a flight
tube with a detector at the end. The flight time is proportional to
the mass-to-charge ratio (m/z) of the ionized molecule [42]; how-
ever, MALDI usually has a strong enrichment for singly charged
ions. In the same electric field (and thus the same force field, since
the charges are roughly homogeneous), molecules with higher
masses accelerate more slowly, and consequently have longer time of
flight than molecules with low masses. Figure 2 shows a schematic
representation of MALDI-TOF MS.
The motivating principle behind SNP genotyping using
MALDI-TOF MS is to detect the abundance of the DNA fragments
from each allele by measuring the intensities of the masses corre-
sponding to the two amplified DNA fragments. These fragments
are amplified using target-specific PCR and extensions conducted
at the SNP site. Each allele at the locus has a specific mass, and the
relative intensities of the amplified product at each mass are indica-
tive of the relative intensities of the alleles.
One of the advantages of SNP genotyping using MALDI-TOF
MS is its capacity for multiplexing. This multiplexing, which allows
multiple loci to be processed in a single analysis, makes this tech-
nology cost-effective when compared to the other high-throughput
technologies available [37]. One of the most used MALDI-TOF
MS genomics platforms is the Sequenom MassARRAY® system
[31]. The first optimized protocol for multiplexing developed by
Fig. 2 Schematic representation of MALDI-TOF MS. The DNA is embedded in a buffering matrix which forms a
crystalline structure. A LASER beam is directed on to the matrix, which absorbs the greater part of the energy
and ionizes the analyte. The ionized molecules pass through a flight tube to a detector at the end. The flight
time is proportional to the mass-to-charge ratio (m/z) of the ionized molecule. The output is usually given in a
form of a three-peak spectrum. The first peak (m+3) corresponds to an unextended primer. The second cor-
responds to the low mass allele (m+2) and the third corresponds to the high mass allele (m+1)
Sequenom is based on a locus-specific PCR followed by a treatment

with shrimp alkaline phosphatase (SAP) in order to dephosphorylate
unincorporated dNTPs added during the amplification reaction.
Then, the SNP site is extended by using an oligonucleotide primer
adjacent to the SNP of interest [37, 43]. If the individual is homo-
zygous, there are two possible products for this reaction, differing
only at the SNP site. Usually, mass spectrometers are capable of
detecting differences in the masses of these products. However,
since the difference is too small (varies from 9 Da for A/T extensions
to 40 Da for C/G extensions) it is difficult to establish a high-
throughput multiplex genotyping routine [30]. Thus, instead of
using only one nucleotide to measure differences between alleles,
Sequenom proposed using another base in conjunction with the
SNP base, in order to create large mass separations between allele-
specific products [30, 44]. This procedure is called homogeneous
MassEXTEND (hME) assay. hME uses a normal dNTP nucleotide
(T, in Fig. 3) complementary to one of the alleles of the SNP site
(A, in Fig. 3) along with terminator ddNTPs for the other nucleo-
tide types. The first one passes the SNP site (A), since the comple-
mentary nucleotide is a normal dNTP (not a terminator), and
stops the extension in the next nucleotide. The second one termi-
nates exactly at the SNP site. This produces alleles with mass
Fig. 3 (continued) mass differences corresponding to the mass of a nucleotide, which makes the multiplexing
routine feasible. The iPLEX reaction instead uses acyclic mass-modified terminators to perform the extension into
the SNP site. These mass-modified terminators used by iPLEX differentiate the mass of the extension products by
at least 16 Da. Since the mass differences can be detected directly at the SNP site without extra nucleotide
masses, it is possible to establish higher multiplex levels than the ones obtained when using hME [43, 47]
Fig. 3 A schematic representation of MassARRAY® hME and iPLEX assays. The first step is based on a locus-
specific PCR followed by a treatment with shrimp alkaline phosphatase (SAP) for both assays. In the hME
reaction one extra nucleotide is used in order to create larger mass differences between extension products.
In the example, the allele-specific extension is conducted using a normal T nucleotide (which is complemen-
tary to the SNP site A) along with other terminator nucleotides (A, C, and G). Thus, when a nucleotide A is
present as a template, the extension passes the SNP site and stops at the next nucleotide. When the other
allele is present (G in the example), the extension terminates exactly at the SNP site. This produces alleles with
differences corresponding to the mass of a nucleotide (∼300 Da).

Since the range of the resolving capability of MALDI-TOF MS is
approximately 4,500–9,000 Da, the mass differences of 300 Da
allow multiplexing to levels of 15-plex (on optimized protocols
applied to diploid species).
Since the multiplex level is closely related to the cost-effectiveness
of the platform, over the years Sequenom has worked toward the
improvement of these levels, changing both the chemistry and
algorithms used to design primers in order to minimize interactions
and crosstalk that occur in a multiplex PCR reaction [37]. The
iPLEX genotype assay substantially improves the multiplexing capa-
bility, allowing multiplexing levels between 24 and 36, although,
under optimized conditions, it is possible to obtain up to 40-plex
levels [31]. One of the fundamental differences between hME
and iPLEX approaches is that the latter uses acyclic mass-modified
terminators to perform the extension into the SNP site. These
mass-modified terminators used by iPLEX differentiate the mass of
the extension products by at least 16 Da (difference of an A-G SNP),
which is large enough to be directly measured by a MALDI-TOF
MS without extra nucleotide masses. These lower mass differences
allow higher levels of multiplexing (up to 40-plex on optimized
protocols for diploid species) since it is possible to position more
loci within the range of the resolving capability of the MALDI-
TOF MS. It is worth noting that optimizations in PCR conditions
and software-designed primers helped to increase multiplexing
levels when using iPLEX assay.
The output of the SNP genotyping for a specific locus using
MALDI-TOF MS is usually given in the form of a three-peak spec-
trum (Fig. 3). The first peak (m+3, in Fig. 3) corresponds to an
unextended primer. The second corresponds to the low mass allele
(m+2) and the third corresponds to the high mass allele (m+1).
If the ionization efficiency is similar for both alleles, the height of
each peak is proportional to the abundance of each allele [5]. For
example, a particular SNP in a diploid species can be homozygous
by having just one peak for a low mass allele or just one peak for a
high mass allele (plus the unextended primer peak). It can, alterna-
tively, be heterozygous, having peaks for both alleles. Since a typi-
cal heterozygous diploid individual has one copy of each allele,
peaks with the same or very similar heights are expected. Although
deviations from these proportions are observed, in MALDI-TOF
MS using iPLEX assay they are minimal [5]. In experiments con-
ducted by Sequenom [30], the iPLEX assay showed a mean bias of
52:48 (where 50:50 was expected) indicating that the peak heights
created when using iPLEX chemistry can be used as a quantitative
measure of allele dosage.
Unlike diploids, in which heterozygous loci have a single
dose of each allele, in polyploid species, the dosage of an allele in a
heterozygous polyploid individual can vary from 0 up to p
(where p denotes the ploidy level). Accurate measurement of

relative abundances (such as the ones provided by iPLEX) is essen-
tial to distinguish between the subtly different proportions in high-
ploidy organisms [5] (see Note 2). However, some subtle
adaptations to the iPLEX protocol optimize its accuracy for poly-
ploid genotyping. The first adaptation regards the multiplexing
levels: generally, this multiplexing must be smaller than the multi-
plexing level used to analyze diploids. Using a low-plex level, the
probability of interaction between primers decreases and the avail-
ability of dNTPs, ddNTPs, and reagents (and thus the cost of the
assay) increases. In diploids, the amount of reagents and primers is
optimized to amplify loci with two alleles (from two homologous
chromosomes). However, in polyploids there are multiple sets of
chromosomes and consequently multiple alleles (this number is
proportional to the ploidy level). Since the number of alleles that
should be amplified is greater in polyploids, the amount of reagents
per loci should also be greater than the amounts used for diploids.
Moreover, with a lower interaction between primers, the efficiency
of the PCR reactions can be improved. A second adaptation simply
uses different concentrations of genomic DNA than compared to
analyzing diploids. For instance, sugarcane has an estimated genome
size of 10 Gb [45]. When compared to the human genome
(≈3.5 Gb), for a given quantity of DNA, a specific target locus is
about 2.9-fold less represented in sugarcane than in humans, thus,
for sugarcane, a higher concentration of DNA is needed to perform
the analyzes. Bérard et al. [46] did the same comparison between
human and wheat genomes. According to these authors, due to the
size of wheat genome (≈17 Gb for Triticum aestivum L.), a target
locus is fourfold less represented in wheat than in humans for the
same amount of DNA. As examples of protocol adaptations, in dip-
loids, the multiplexing level varies from 24-plex up to 40-plex and
DNA concentrations from 2.5 up to 10 ng/μL [30, 31, 37, 40, 47];
however, in sugarcane these numbers are around 10-plex and
10 ng/μL DNA (used in the sugarcane example data).
Array-based technologies are also widely used for SNP geno-
typing in polyploids. There are two important technologies from
Illumina: GoldenGate™ and Infinium. Both are based on Illumina’s
BeadArray technology which is basically a chip with silica beads in
matrix form. Each polymorphism has its own bead with a specific
capture sequence. The processed DNA hybridizes to the beads and
the products are fluorescently labeled (one fluorescent color or
“channel” for each allele). These fluorophores are detected by a
LASER-scanning confocal microscope, called the BeadArray Reader.
Illumina’s array-based technologies are both high-throughput:
GoldenGate™ allows for high-plex levels (up to 1,536) while
Infinium permits whole genome genotyping. It is worth mention-
ing that both Sequenom’s and Illumina’s technologies have simi-
larities: both are quantitative, and both employ two channels,
a 5000 10000 15000 20000 25000 b
40
30
Intensity of allele 2
20
10
0
0
0 5000 10000 15000 20000 25000 30000 0 2 4 6 8 10 12 14

Intensity of allele 1 Intensity of allele 1
Fig. 4 Examples of scatter plots of raw data presented in Serang et al. [5]. (a) Example of an autotetraploid
potato scatter plot of allele intensities in an association panel obtained using the Illumina GoldenGate™ assay.
The annotated scatter plot for this SNP is shown in Fig. 7. (b) Example of a sugarcane scatter plot of allele
intensities in a F1 biparental sugarcane population with unknown ploidy obtained using the Sequenom iPLEX
MassARRAY® technology. The annotated scatter plot for these two SNPs is shown in Fig. 8
where the relative intensities correspond to the amount of DNA at

each allele for the specific locus. Although Illumina’s technologies
allow for higher throughput than Sequenom’s, the MALDI-TOF
procedure generally results in less allele-specific bias, and thus
more accurate measures of relative abundance. Since genotyping
by MALDI-TOF MS is based on the mass-to-charge ratio, which
is an intrinsic property of the DNA strand, it is not susceptible to
complications that arises from secondary structure as can occur
with hybridization array-based methods [38, 39, 48].
Regardless of their subtle differences, at each locus both the
Sequenom and Illumina protocols result in two relative intensities
for each individual analyzed. These data are easily visualized as a
scatter plot (the series for the locus of interest is denoted D) show-
ing the allele intensities from individuals 1,2,…n: D = ((x1,y1),
(x2,y2),…,(xn,yn)). Example scatter plots from potato and sugarcane
loci are shown in Fig. 4.
3.2 Inferring This subsection focuses on using the intensity data

Genotypes: Statistical D = ((x1,y1),(x2,y2),…,(xn,yn)) from MassARRAY® (Fig. 4).
Analysis Given a single locus’ scatter plot data from a MassARRAY®
of Quantitative SNP experiment (or from several replicate experiments), assigning gen-
Data otypes becomes a joint clustering and inference problem consisting
of three main goals:
1. Find clusters.
2. Infer the genotype of each cluster.
3. Assign each individual to a cluster (i.e., assign a genotype to
each individual).
Inference on these three subtasks is made difficult by the fact
that each of these goals depends on the outcome of the other two.
For example, assigning genotypes to clusters requires specified

clusters. Likewise, when distinguishing whether a collection of
points consists of two clusters or one cluster, it is useful to evaluate
how well the predicted genotypes would correspond to the two
clusters compared to the predicted genotype of the single cluster.
3.2.1 Iterative Approach Iterative approaches utilizing mixture models were the first methods
devised to perform joint inference on the clusters and genotypes.
In particular, Voorrips et al. [49] extended the method of Fujisawa
et al. [50] from diploids to autotetraploids. These methods essen-
tially alternate between assigning points to a cluster and comput-
ing a linear regression on each cluster, which provides the average
slope of the cluster (i.e., it computes m in the regression y = mx + b).
This average slope is then used to find an integer solution x/y = m
where Y and X are integers and Y + X = P. Points are then reas-
signed to the nearest cluster (using the regression from each cluster),
and the process is continued until convergence is reached. Points
with total intensities too small (x2 + y2 < τ) are excluded from analysis
because small changes can affect their cluster membership, and can
in turn affect the slope of the cluster, resulting in cumulative errors
of nontrivial magnitude.
This iterative approach is intuitive; however, it does not fully
consider the interdependence of the three goals enumerated earlier.
As a result, the clusters are relatively unconstrained. After conver-
gence is reached, a cluster may be assigned to a nonsensical region
between two genotypes, with a slope m that does not correspond
well to any predicted dosage for that ploidy. Furthermore, the model
requires the ploidy to be known in advance (to determine the num-
ber of components in the mixture model). It is not possible to infer
the true ploidy by simply trying several different possible ploidys,
because the unconstrained nature of the model inherently rewards
higher ploidys (consider a model where each point is contained in a
single cluster and zero error in the regressions).
Essentially, these problems lead to a lack of “identifiability”;
even with an infinite sample size, it is not possible to compare ploidys
that would share common clusters (e.g., 1:3 cluster from a tetra-
ploid population and 2:6 cluster from an octoploid population).
3.2.2 Bayesian Approach The Bayesian approach used by SuperMASSA [5] starts with an
assumed ploidy and then computes the predicted location of geno-
type clusters and, by using population-level modeling, the predicted
distribution of genotypes in the population. The population-level
information (e.g., a population in Hardy–Weinberg equilibrium at
the locus or the progeny of an F1 cross) can then be used to com-
pute a likelihood that rewards solutions that not only offer tight
clusters at the predicted ratios (or angles), but also rewards solu-
tions that yield plausible genotype distributions for the population.
Figure 5 shows the graphical model used by the Bayesian method
SuperMASSA: G denotes all of the genotype assignments and
Fig. 5 A graphical view of SNP genotyping presented in Serang et al. [5]. A

directed edge from a to b represents a dependency of b on a (in the standard
Bayesian network manner). The population ploidy is denoted P, the true geno-
types of all individuals in the population are denoted G, the distribution of geno-
types in the population are denoted C, the theoretical (i.e., predicted) distribution
of genotypes in the population is denoted T, and the observed scatter plot data
of relative allele intensities for all individuals are denoted D. In the Hardy–
Weinberg model, the theoretical distribution of genotypes in the population T
depends on the allele frequency α. In the F1 model, the theoretical distribution of
genotypes in the population T depends on the parental genotypes (Q1 and Q2,
which have optional observed relative allele intensities D(1) and D(2), respectively).
Performing inference on the ploidy and genotypes while recognizing all of these
dependencies permits comprehensive inference of the ploidies and genotypes;
in contrast, iterative heuristics iteratively ignore some dependencies in order to
achieve simpler inference
θ denotes population-level parameters (e.g., the parent genotypes

in an F1 or the first allele frequency for a population in Hardy–
Weinberg equilibrium).
The use of population-level information solves the identifiability
problem (when more than one cluster exists): for both the progeny
of an F1 cross or a locus in Hardy–Weinberg equilibrium, the
predicted genotype distribution will be unimodal (i.e., there will not
be successive clusters with many points, followed by a cluster with
few points, and followed by a cluster with many points). Thus, with
clusters observed 2:0, 1:1, 0:2, even though a tetraploid population
could make equivalent clusters 4:0, 2:2, 0:4, it would be highly
unlikely to do so without containing any 3:1 and 1:3 individuals.
When only one cluster exists (making it impossible to distinguish
between other genotypes that could give the same ratio), an Occam’s
razor approach is used, and the smallest ploidy is selected.
Thus, the Bayesian method can be used to infer the ploidy as
well as the genotypes. The ploidy and the population-level param-
eters θ (e.g., parental genotypes in an F1 or the first allele fre-
quency for Hardy–Weinberg) can be estimated by trying several
ploidys in a desired range, and then for each ploidy choosing the
most likely θ. In the case of the F1, data from the parents (i.e., a
scatter plot consisting of replicate samples processed from the two
parental individuals) can be multiplied in as another factor of the

overall likelihood.
Efficient inference is not trivial for the Bayesian model; the
latent variable C modeling the true distribution of genotypes in the
population depends on the genotypes of all individuals; thus maxi-
mizing the posterior probability (which is the product of the likeli-
hood and the uniform prior probability of θ) is an optimization
problem on all individuals. A greedy maximum likelihood (ML)
approach disentangles this dependency by assigning each individ-
ual the genotype from the closest expected ratio (e.g., when P = 6
the point (1.05,1.99) would be assigned 2:4). From these pro-
posed genotypes, a likelihood can be computed that rewards tight
clusters close to the predicted ratio and rewards solutions with a
plausible distribution of genotypes for the population.
For cases when small skew is encountered (e.g., a slightly higher
proportionality constant between x and the amount of the first
allele, which could potentially be caused by slightly different ioniza-
tion efficiencies of the DNA fragment in the first allele), then the
nearest genotype may actually be incorrect and may not correspond
to the maximum a posteriori (MAP) solution. This is because the
greedy ML solution optimizes only individual genotype assign-
ments, but does not fully consider the genotype distribution of
the population. Fortunately, the true MAP can be transformed to
use branch and bound, which is efficient especially when the data
quality is good (see Note 3).
The branch and bound procedure is derived by demonstrating
that given the number of individuals with each genotype, the MAP
genotype assignments can be found trivially. Given the number of
individuals with each genotype C, it is never advantageous to assign
individual 1 to genotype cluster A and individual 2 to genotype
cluster B if (x1: y1) is closer to the predicted ratio of cluster B and
(x2: y2) is closer to the predicted ratio of cluster A; indeed, trivially
swapping these individuals’ genotype assignments decreases the
overall distance, while maintaining the same genotype distribution
(after all, genotypes A and B each lose and gain an individual for a
net change of zero). This is illustrated in Fig. 6. By proving that
out-of-order genotype assignments are suboptimal, it can be easily
shown that, given the counts of individuals with each genotype C,
the genotypes themselves can be found by sorting the points along
the arc from (1,0) to (0,1) and distributing the next available gen-
otype to the next sorted individual.
For this reason, it is sufficient to perform a branch and bound
search on the genotype counts C, and for each C proposed, trivially
infer the MAP estimate of the individual genotype assignments G.
Searching C (which has O(nP+1) possible configurations) is far sim-
pler than searching G (which has O((P + 1)n) possible configura-
tions). Given a partially determined value of C, denoted Cpre f
(i.e., the counts of only some genotypes are determined), then it is
50 50
40 40
30 30
20 20
10 10
0 0
10 10
50 0 50 100 150 200 250 50 0 50 100 150 200 250
Intensity of allele 1 Intensity of allele 1
Fig. 6 Illustration of a suboptimal genotype configuration presented in Serang et al. [5]. The left panel shows
an inferior genotype configuration, where two individuals are assigned genotypes where the relative intensities
of each individual are closer to the expected relative intensities for the genotype assigned to the other indi-
vidual. The right panel shows a superior genotype configuration, which swaps the genotypes of these two
individuals, achieving the same distribution of genotypes in the population, but with a lower distance between
the scatter plot points and the expected relative intensities for the predicted genotypes
possible to derive an upper bound on the best solution C consistent

with Cpre f (i.e., the best solution that does not alter Cpre f).
Consider, if Cpre f specifies that exactly 10 individuals have gen-
otype 0:6 (but in this case Cpre f does not specify the number of
individuals with other genotypes), then the MAP implies they must
be the 10 individuals sorted along the arc from (1,0) to (0,1). If the
next sorted individual has intensities (6.1,0.01), the closest geno-
type assignment it can have is 1:5, because including it as a 0:6
genotype would violate the presumption that exactly 10 individu-
als have genotype 0:6 (adding another individual would increase
the count to 11). In this case, it is very unlikely that the (6.1,0.01)
point comes from a 1:5 genotype (or further), and so any C con-
sistent with Cpre f has a likelihood with a provably small upper
bound. If that upper bound is lower than the likelihood of an
already observed solution (e.g., seeded by a first pass with the
greedy ML method), then the current partial solution Cpre f can be
aborted. Likewise, a partial solution with a provably poor probability
for the population genotype distribution (e.g., a bimodal distribu-
tion consisting of ten individuals with 0:6, 0 individuals with 1:5,
and ten individuals with 2:4) can also be discarded early. Thus,
the search space can be substantially narrowed, especially when the
data quality is high and the greedy first pass is of high quality (per-
mitting a more stringent bound). Generally, when the data quality
is good (i.e., fairly tight clusters and with no substantial skew) and
the ploidy range considered does not exceed twenty or so, MAP
inference can be performed efficiently (i.e., in a minute or less).
The hyper-parameters of the Bayesian method specify the search
space for the ploidy, the model for the distribution of genotypes in
the population, and the search space for the Gaussian width parameter
σ, which is used to model the noise in the intensity measurements.
Specifying a larger Gaussian width parameter σ decreases efficiency
of MAP inference because it permits highly diffuse clusters; thus
configurations that assign a point to a far-away cluster are not sub-
stantially penalized and cannot be aborted as early by the branch
and bound. The Bayesian method is the current method of choice
for genotype inference, particularly if the ploidy is not known or if
information or data concerning parents or population structure are
available.
3.3 Example Figures 7 and 8 depict examples of annotated loci using the two
of Good Loci models available in the current version of SuperMASSA, e.g., F1
and Hardy–Weinberg. The raw data for these Figures is shown in
Fig. 4. The locus on Fig. 7 was obtained from a tetraploid potato
association panel and the Hardy–Weinberg model was used. The
scatter plot from Fig. 8 was obtained from a biparental cross of two
precommercial varieties of sugarcane with unknown ploidy. In that
case, the MAP configuration simultaneously found the ploidy level,
a set of two parents with their respective dosages, and the genotype
assignments for the scatter plot. In both figures, the theoretical
genotype distributions and observed genotype distributions are
nearly identical. Also, it is important to note that the genotype
annotations are extremely close to the predicted angles for each
assigned genotype.
When the ploidy is unknown (which is the case of commercial
varieties of sugarcane), SuperMASSA searches for a ploidy which
yields a MAP in a specified range (see Note 1). In Fig. 8 the esti-
mated ploidy level was 10 and the posterior probability given by
SuperMASSA was extremely close to 1.00, indicating a good
Fig. 7 The annotated scatter plot of allele intensities for the SNP shown in Fig. 4a (a tetraploid potato association
panel). The first graphic shows the annotated scatter plot and the second shows the theoretical distribution of
genotypes in the population and the distribution of individuals assigned to each genotype with σ = 0.10. Data
extracted from Voorrips et al. [49]. Platform used: Illumina GoldenGate
Parents Progeny
0.5
Expected
40
30 Observed
35 0.4
25
30
Frequencies
20 25 0.3
15 20
0.2
15
10
10
0.1
5
5
0 0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30 35 40 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Intensity of allele 1 Intensity of allele 1 Doses of allele 1
Fig. 8 The annotated scatter plot of allele intensities for the SNP shown in Fig. 4b (a F1 biparental sugarcane
population). The ploidy level was searched from 2 to 100 (only even numbers). The first graphic shows parental
data, consisting of 12 replicates of each parent. The second indicates the annotated scatter plot in F1 popula-
tion and the third shows the theoretical distribution of genotypes in the population and the distribution of
individuals assigned to each genotype. Extracted from Serang et al. [5]. The posterior probability associated to
the classification was ≈1.00 and the estimated ploidy level was 10 with σ = 0.16. Platform used: Sequenom
MassARRAY with iPLEX chemistry
classification. For a given ploidy level, it is possible to inspect if the

genotyping calling is good, by changing the value of the
SuperMASSA’s naive posterior report threshold argument. Setting
the posterior threshold to 0.0 (default), all genotypes are allocated
in genotypic classes, i.e., SuperMASSA allows all individuals regard-
less of their posterior probability for a given configuration.
However, if the posterior threshold is set to a certain level, some
individuals which do not reach that level are not shown in the out-
put of the analysis. Ideally, the more individuals are classified as the
posterior threshold gets near to 1.00, the better is the genotyping
calling. For the loci shown in Fig. 7, the variation of the naive pos-
terior report threshold with small increments shows that all individ-
uals studied have at least a posterior of 0.55 (dashed line in Fig. 9)
and more than 80 % of the data have a posterior equal or higher than
0.90 (dot-dashed line in Fig. 9), indicating a high-quality result.
However, it is important to note that since SuperMASSA takes
into account the F1 and Hardy–Weinberg models, the reduction of
the number of individuals (caused by a high posterior threshold)
can cause significant changes in the observed distribution.
Consequently, this difference between the expected and the
observed values produces low posterior values. This influence is
most strongly observed in F1 models which take into account a
hypergeometric distribution which is quite sensitive to changes in
the number of individuals in each class. For example, the expected
segregation of a loci with four doses in one parent and zero doses
in another in a biparental population is 1:8:18:8:1 (given by the
hypergeometric distribution). Imagine that an observed loci (with
parents and progeny data) fits well to this segregation. However if
one of the observed classes has points that were not well attributed
(i.e., have low posterior), the posterior threshold will eliminate
Fig. 9 Effect of the increment of the naive posterior report threshold on the num-
ber of annotated individuals for the loci analyzed in Fig. 7. All individuals studied
have at least a posterior of 0.55 (dashed line) and more than 80 % of the data
have a posterior equal or higher than 0.90 (dot-dashed line), which certainly
indicates a good result
those points, consequently changing the fit of the model to the

hypergeometric distribution. In other words, the posterior thresh-
old will eliminate loci with low posterior and this can cause a dis-
equilibrium in the classes of the distribution yielding low posteriors.
It has been observed that this effect is less pronounced when using
the Hardy–Weinberg model and the unstructured model does not
suffer from this effect.
4 Notes
1. When the population ploidy is unknown, SuperMASSA can

use population models (e.g., F1 or Hardy–Weinberg models)
to help infer the ploidy at the same time as genotype inference;
however, lower quality data (i.e. – data with high variance or
strong allele-specific bias) may make this difficult. Fortunately,
the SuperMASSA posterior probability gives an indication of
the quality of the results, and whether they are to be trusted.
Figure 10 depicts data from loci with increased levels of noise
and complexity; SuperMASSA awards a lower posterior probability
SNP (a) SNP (b) SNP (c)
50 16
40
14
40 12
30
10
30
20 8
20 6
10 4
10
2
0 0 0
0 10 20 30 40 0 10 20 30 40 50 0 2 4 6 8 10 12 14 16
Intensity of allele 1 Intensity of allele 1 Intensity of allele 1
Fig. 10 Example of three annotated sugarcane SNPs with increasing levels of difficulty to estimate the correct
ploidy level (the signal/noise ratio decreases from left to right). The two first SNPs were genotyped in a F1
biparental sugarcane population. The third was genotyped in an association sugarcane panel. In all cases the
range where the ploidy level was searched was from 2 to 100 (only even numbers). The first locus computes
an estimated ploidy of eight with a high posterior probability (≈1.00). The second locus achieves an estimated
ploidy of 10, and a lower posterior probability (≈0.51). The third locus estimates the ploidy to be 76, and with
a very low posterior probability (≈0.24). Platform used: Sequenom MassARRAY with iPLEX chemistry. Courtesy
of Dr. Anete P de Souza, Centro de Biologia Molecular e Engenharia Genética, UNICAMP, Campinas, Brazil
to the figures with greater noise (and less reasonable results).

It is important to note that lower posteriors are computed on
lower quality data as the result of other parameter settings giv-
ing similar results. For example, the third panel in Fig. 10 is
best fit by complex parameters; it is almost certain that this
locus is suffering from high variance, and the model is trying to
explain each point in the diffuse clusters by estimating the ploidy
at 76 and essentially giving each point its own cluster (see Note 2).
But importantly, other high ploidies have this same effect (for
example, 74 or 78 ploid), and thus produce similar likelihoods.
As a result, although the 76 ploid model is the most likely, it
has a poor marginal probability (because many other models
have similar likelihood). For this reason, if the estimated ploidy
level is at the upper range of the ploidy range investigated,
inference should be performed again using a higher maximum
ploidy (i.e., if the estimated ploidy was the maximum allowed,
50, it should be processed again with a maximum ploidy higher
than 50). Because of population modeling, higher and higher
ploidies will result in a lower mean-squared error for the scatter
plot, but will also result in a very poor likelihood for the distri-
bution of genotypes. For practical and computational reasons,
the maximum ploidy is truncated, enabling a grid search; how-
ever, to be accurate, the ploidies not investigated must have van-
ishing likelihoods. Increasing the maximum ploidy until it is
greater than the estimated ploidy ensures that this truncated
grid search does not affect the results. As with all Bayesian meth-
ods, proper “competition” between models that explain the
data results in a reasonably low posterior probability when the
data are not explained well.
2. One of the most important causes of poor results is the large

variance that can be found in some scatter plot data. This large
variance usually is due to nonoptimized protocols for polyploid
species. The vast majority of the SNP genotyping techniques
were developed and optimized for diploid species. When a
polyploid is analyzed a series of optimization procedures must
be done beforehand to implement a genotyping routine.
Changes such as multiplexing levels, primer interaction, PCR
conditions, DNA, and reagent concentration must be taken
into consideration to make the variance as lower as possible
(see Subheading 3.1). However, for high ploidy levels this is
not a simple task and, sometimes, the optimization is not
enough to produce distinct clusters. It seems to be an intrinsic
property of complex loci. For lower ploidy levels, such as tetra-
ploids and hexaploids, large variances are less problematic when
compared to species with higher ploidy levels. If a locus has a low
ploidy level, even for the most complex case where all genotypes
are present, the classification is straightforward, since the large
variance of the clusters does not cause their superposition
(see Fig. 7, for instance). However, this effect is more pronounced
for high ploidy levels, since the expected angles are closer. Thus,
in order to have a good classification, one needs very tight clus-
ters, which in some cases are almost impossible to obtain due to
experimental errors intrinsic to every technique.
Moreover, the experimental results we have analyzed so far
indicate that when the allele dosage is high, the clusters tend to
be less distinct when compared to lower dosages. In some cases,
this fact, added to a high ploidy level, prevents SuperMASSA
from producing a reasonable genotype calling. Figure 11 shows
an example where both parents have high dosage and the prog-
eny has a large variance for a high ploidy level. The combination
of these two factors (high ploidy level and high dosage)
Parents Progency
25 0.25
35 Expected
Observed
30 0.20
20
25
Frequencies
15 0.15
20
10 15 0.10
10
5 0.05
5
0 0 0.00
0 5 10 15 20 25 0 5 10 15 20 25 30 35 0 5 10 15 20 25
Fig. 11 Annotated scatter plot of a SNP genotyped in a F1 biparental sugarcane population. The ploidy level
was searched from 2 to 100 (only even numbers). When there is too much noise (a low signal/noise ratio), it
is difficult to obtain a good classification. In this case, it can be identified by observing the low posterior
probability (≈0.47). The estimated ploidy level was 24. Platform used: Sequenom MassARRAY with iPLEX
chemistry. Courtesy of Dr. Anete P de Souza, Centro de Biologia Molecular e Engenharia Genética, UNICAMP,
Campinas, Brazil
produces a very difficult situation to analyze. However,

SuperMASSA indicates that the result is not satisfactory pro-
viding a posterior probability of 0.47, as explained in Note 1.
Thus, it is very important to check the value of the posterior
probability to check the quality of the genotype calling.
It is important to note that both estimation of ploidy and
genotype annotation are quite linked and for good annotations
we generally have good estimation of ploidy levels.
If SuperMASSA cannot produce good results, biological and
genomic information about the loci can help to decide if a bad
genotype calling should be discarded or not. First, if the ploidy
level is unknown but it is the same within the genome, the ploidy
of the SNPs should be estimated, in average, as the ploidy of the
species analyzed. However, if the ploidy of the species is not
constant within the analyzed genome (which is the case of
sugarcane), one procedure that can be taken into consideration
to provide more trusted results is to verify another SNP that is
located close to the analyzed SNP. This can be done, for exam-
ple, checking which SNPs were amplified in the first locus-
specific PCR (amplification step in Fig. 3). Since those SNPs are
in a specific fragment of genomic DNA, they should share the
ploidy level. If at least one SNP on the fragment has a high
posterior, it can provide evidence to help to estimate the ploidy
(and consequently the annotation) of other SNPs contained
into the same fragment.
Importantly, deciding if a locus with high variance and
consequently low MAP should be discarded depends on its
use. For instance, if several loci annotated using SuperMASSA
will be used to assemble a genetic map, some SNPs with low
posterior are not of great concern, since in the construction of
a genetic linkage map all the information about the SNPs is
used jointly within a linkage group and it is expected that the
SNPs with high posterior probabilities provide information to
SNPs which were not well classified.
3. A good genotype call should produce clusters around the
expected angles for a given ploidy (see Fig. 8, for example).
However, sometimes loci with skewed clusters can be found.
Figure 12 depicts a skewed locus with a cluster (formed by
cross symbols) positioned above the expected line for a hexa-
ploid locus. Also it is possible to see skewed genotypes in the
parental scatter plot. One of the most important causes of
skewed loci is the preferential amplification of one of the two
alleles within the same individual in the PCR step, as described
in Subheading 1.1. As explained in Note 2, the genotyping
protocols are established for SNP diploid species. If these pro-
tocols are used without a specific optimization to polyploid
species, the effect of the preferential amplification can be more
pronounced since the expected angles in the scatter plots are
Parents Progency
0.40
40 Expected
35 0.35 Observed
35
30
30 0.30
25
Frequencies
25 0.25
20 0.20
20
15 15 0.15
10 10 0.10
5 5 0.05
0 0 0.00
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 40 .0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0
Fig. 12 Annotated scatter plot of a SNP genotyped in a F1 biparental sugarcane population. The ploidy level
was searched from 2 to 100 (only even numbers). It is possible to see a skew in parental and progeny scatter
plots. Although the skew has not been modeled, a high posterior was obtained (≈1.00) indicating a high quality
of inferred genotypes. The estimated ploidy level was 6. Platform used: Sequenom MassARRAY with iPLEX
chemistry. Courtesy of Dr. Anete P de Souza, Centro de Biologia Molecular e Engenharia Genética, UNICAMP,
Campinas, Brazil
narrower as the ploidy level increases. Another source of bias is

the preferential labeling of the final amplified products. Thus,
as described in Subheading 1.1, technologies which suffer less
from this effect, such as those based in MALDI-TOF analysis
should be considered.
However, when allele-specific bias cannot be ignored, the
MAP procedure in SuperMASSA can be used, because it not
only probabilistically minimizes the deviation between the
expected and observed relative intensities (i.e., the deviation
between the points on the scatter plot and their expected loca-
tions), but also minimizes the deviation between the observed
and expected genotype distributions for the population
(according to some model, e.g., Hardy–Weinberg or F1).
When the observed relative allele intensities are not centered
around reasonable expected relative allele intensities, then
minimizing the deviation between the observed and expected
genotype distributions for the population will gain relative
importance, because the MAP estimate of the parameter σ
(which measures the spread in the observed relative allele
intensities) will become large, thus decreasing the importance
of the deviation between the expected and observed relative
intensities. Allele-specific bias that substantially impedes infer-
ence will currently lower the posterior probability, indicating
uncertainty.
In situations where allele-specific bias results in an
extremely strong skew, it would be possible to consider an
additional latent variable S to model this skew. If the allele-
specific bias preferentially detects the first allele S times the
second allele (even when the abundances of both alleles are
equal), then this bias can be divided out trivially (that is, when
S is not zero). Inference when the allele-specific bias parameter
S is unknown could be performed by trying several values of S

in a grid search, with S < 1, S = 1 (i.e., the current, uncorrected
model), and S > 1. For each value of S the bias would be divided
out and inference would be performed performing the current
inference procedure. Using a prior on S with a single mode at
S = 1 will penalize substantial alteration of the data, and will
thus prefer no correction when it is not particularly helpful.
4. A reasonable result for quantitative SNP genotyping should
produce distinct clusters. However, data can have distinct clus-
ters and poor results, indicated by a low posterior. This usually
happens when, for some reason, one of the genotypic classes is
missing, or has different number of individuals than expected
for that class. The expected number of individuals is given by
one of the two theoretical distributions of genotypes available
in SuperMASSA: F1 or Hardy–Weinberg. If the numbers differ
significantly from the theoretical distribution, SuperMASSA
will award a low posterior since there are other models com-
peting to explain the data. Figure 13 shows a locus which has
a good genotype annotation with distinct clusters but also has
a low posterior (≈0.67). The right panel shows the superposi-
tion of the genotype and the theoretical distribution in a
Hardy–Weinberg population using the MAP estimate for the
parameter α. For several classes, the observed and theoretical
values are quite different. For this reason, SuperMASSA awards
a low posterior probability for this locus.
One of the factors that can cause this difference between the
expected and the observed is the linkage between paralogous
Fig. 13 Annotated scatter plot of a SNP genotyped in an association sugarcane panel. The ploidy level was
searched from 2 to 100 (only even numbers). It is possible to see distinct clusters but the posterior probability
indicates a poor result (≈0.67). The estimated ploidy level was 12. Platform used: Sequenom MassARRAY with
iPLEX chemistry. Courtesy of Dr. Anete P de Souza, Centro de Biologia Molecular e Engenharia Genética,
UNICAMP, Campinas, Brazil
sequences. If paralogous sequences are linked, certain combi-

nations of genotypes will be more frequent than other combi-
nations, depending on how closely linked the sequences are.
Another cause for this imbalance can be attributed to selective
advantages of certain allelic dosages. As we noted in
Subheading 1, allele-dosage effects in heterozygous genotypes
can be correlated to intermediate gene and phenotypic expres-
sion as compared to homozygous genotypes and this can result
in phenotypic classes with selective advantages, and conse-
quently more frequency in the population [18]. Regardless of
whether this different copy is closely linked or on another
chromosome, it will lead to patterns of inheritance that may
result in missing clusters according to the current population
models. In the future, these scenarios could be modeled using
population models that recognize this possibility.
Acknowledgements
We are grateful for the help of Dr. Antonio Augusto Franco Garcia
of University of São Paulo ESALQ, Dr. Thiago G Marconi and Dr.
Anete P de Souza, Centro de Biologia Molecular e Engenharia
Genética, UNICAMP for generously sharing their data and exper-
tise. We also like to thank Dr. Gary McDowell of Harvard Medical
School and Boston Children’s Hospital and Dr. Ryan Emerson of
Adaptive TCR Corporation for their suggestions.
References
1. Soltis DE, Albert V, Leebens-Mack J, Bell CD, Ming R (2012) Genome size variation in three
Paterson AH, Zheng C, Sankoff D, Saccharum species. Euphytica 185:511–519
Depamphilis CW, Wall PK, Soltis PS (2009) 8. D’Hont A, Grivet L, Feldmnn P, Glaszmann J,
Polyploidy and angiosperm diversification. Am Rao S, Berding N (1996) Characterisation of
J Bot 96:336–348 the double genome structure of modern sugar-
2. Darlington CD (1937) Recent advances in cane cultivars (Saccharum spp.) by molecular
cytology. J&A Churchill, Ltd., London cytogenetics. Mol Gen Genet 250:405–413
3. Stebbins GL (1950) Variation and evolution in 9. D’Hont A (2005) Unraveling the genome
plants. Columbia University Press, New York structure of polyploids using FISH and GISH;
4. Hieter P, Griffiths T (1999) Polyploidy – more examples of sugarcane and banana. Cytogenet
is more or less. Science 285:210–211 Genome Res 109:27–33
5. Serang O, Mollinari M, Garcia A (2012) 10. Wu KK, Burnquist W, Sorrells ME, Tew TL,
Efficient exact maximum a posteriori computa- Moore PH, Tanksley SD (1992) The detection
tion for bayesian SNP genotyping in poly- and estimation of linkage in polyploids using
ploids. PLoS One 7:e30906 single-dose restriction fragments. Theor Appl
6. Edme S, Comstock J, Miller J, Tai P (2005) Genet 83:294–300
Determination of DNA content and genome 11. Da Silva JAG, Honeycutt RJ, Burnquist W,
size in sugarcane. J Am Soc Sugar Cane Al-Janabi SM, Sorrells M, Tanksley SD, Sobral
Technol 25:1–16 BWS (1995) Saccharum spontaneum L. ’SES
7. Zhang J, Nagai C, Yu Q, Pan Y, Ayala-Silva T, 208’ genetic linkage map combining RFLP-and
Schnell R, Comstock J, Arumuganathan A, PCR-based markers. Mol Breed 1:165–179
12. Sorrells ME (1992) Development and applica- autotetraploid species. Genetics 159:
tion of RFLP in polyploids. Crop Sci 32: 1819–1832
1086–1091 27. Wu R, Ma CX, Casella G (2004) A mixed poly-
13. Ripol MI, Churchill GA, Silva JAGD, Sorrells ploid model for linkage analysis in outcrossing
M (1999) Statistical aspects of genetic map- tetraploids using a pseudo-test backcross
ping in autopolyploids. Gene 235:31–41 design. J Comput Biol 11:562–580
14. Baker P, Jackson P, Aitken K (2010) Bayesian 28. Rafalski A (2002) Applications of single nucle-
estimation of marker dosage in sugarcane and otide polymorphisms in crop genetics. Curr
other autopolyploids. Theor Appl Genet Opin Plant Biol 5:94–100
120:1653–72 29. Fan JB, Oliphant A, Shen R et al (2003) Highly
15. Guo M, Davis D, Birchler JA (1996) Dosage parallel SNP genotyping. Cold Spring Harb
effects on gene expression in a maize ploidy Symp Quant Biol 68:69–78
series. Genetics 142:1349–1355 30. Oeth P, Beaulieu M, Park C, Kosman D, de
16. Galitski T, Saldanha AJ, Styles CA, Lander ES, Mistro G, van Den Boom D, Jurinke C (2007)
Fink GR (1999) Ploidy regulation of gene iPLEX assay: increased plexing efficiency and
expression. Science 285:251–254 flexibility for MassARRAY system through sin-
17. Wang J, Tian L, Lee HS, Wei NE, Jiang H, gle base primer extension with mass-modified
Watson B, Madlung A, Osborn TC, Doerge terminators. Sequenom application note.
RW, Comai L, Chen ZJ (2006) Genomewide Sequenom, San Diego, CA
non additive gene regulation in Arabidopsis 31. Oeth P, de Mistro G, Marnellos G, Shi T, van den
allotetraploids. Genetics 172:507–17 Boom D (2009) Qualitative and quantitative
18. Osborn TC, Pires JC, Birchler JA, Auger DL, genotyping using single base primer extension
Chen ZJ, Lee HS, Comai L, Madlung A, coupled with matrix-assisted laser desorption/
Doerge RW, Colot V, Martienssen RA (2003) ionization time-of-flight mass spectrometry
Understanding mechanisms of novel gene (MassARRAY). In: Komar AA (ed) Single nucle-
expression in polyploids. Trends Genet 19: otide polymorphisms. Humana, New York,
141–147 pp 307–343
19. Luo ZW, Hackett CA, Bradshaw JE, McNicol 32. Akhunov E, Nicolet C, Dvorak J (2009) Single
JW, Milbourne D (2000) Predicting parental nucleotide polymorphism genotyping in poly-
genotypes and gene segregation for tetrasomic ploid wheat with the Illumina GoldenGate
inheritance. Theor Appl Genet 100:1067–1073 assay. Theor Appl Genet 119:507–517
20. Luo ZW, Hackett CA, Bradshaw JE, McNicol 33. Illumina, Inc. (2006) Goldengate assay work-
JW, Milbourne D (2001) Construction of a flow. Technical report. Illumina Inc., San
genetic linkage map in tetraploid species using Diego, CA
molecular markers. Genetics 157:1369–1385 34. Baird NA, Etter PD, Atwood TS, Currey MC,
21. Ma CX, Casella G, Shen ZJ, Osborn TC, Wu R Shiver AL, Lewis ZA, Selker EU, Cresko WA,
(2002) A unified framework for mapping quan- Johnson EA (2008) Rapid SNP discovery and
titative trait loci in bivalent tetraploids using genetic mapping using sequenced RAD mark-
single-dose restriction fragments: a case study ers. PLoS One 3:e3376
from Alfalfa. Genome Res 12:1974–1981 35. Elshire RJ, Glaubitz JC, Sun Q, Poland JA,
22. Luo ZW, Zhang RM, Kearsey MJ (2004) Kawamoto K, Buckler ES, Mitchell SE (2011)
Theoretical basis for genetic linkage analysis in A robust, simple genotyping-by-sequencing
autotetraploid species. Proc Natl Acad Sci (gbs) approach for high diversity species. PLoS
U S A 101:7040–7045 One 6:e19379
23. Luo ZW, Zhang Z, Leach L, Zhang RM, 36. Bagge M, Lubberstedt T (2008) Functional
Bradshaw JE, Kearsey MJ (2006) Constructing markers in wheat: technical and economic
genetic linkage maps under a tetrasomic model. aspects. Mol Breed 22:319–328
Genetics 172:2635–2645 37. Ragoussis J, Elvidge GP, Kaur K, Colella S
24. Leach LJ, Wang L, Kearsey MJ, Luo Z (2010) (2006) Matrix-assisted laser desorption/ioni-
Multilocus tetrasomic linkage analysis using sation, time-of-flight mass spectrometry in
hidden Markov chain model. Proc Natl Acad genomics research. PLoS Genet 2:e100
Sci U S A 107:4270–4274 38. Griffin TJ, Smith LM (2000) Single-nucleotide
25. Xie C, Xu S (2000) Mapping quantitative trait polymorphism analysis by MALDITOF mass
loci in tetraploid populations. Genet Res spectrometry. Trends Biotechnol 18:77–84
76:105–115 39. Marziali A, Akeson M (2001) New DNA
26. Hackett CA, Bradshaw JE, McNicol JW (2001) sequencing methods. Annu Rev Biomed Eng
Interval mapping of quantitative trait loci in 3:195–223
40. Gabriel S, Ziaugra L, Tabbaa D (2009) SNP 45. Wang J, Roe B, Macmil S et al (2010)
genotyping using the Sequenom MassARRAY Microcollinearity between autopolyploid sug-
iPLEX platform. Curr Protoc Hum Genet arcane and diploid sorghum genomes. BMC
60:1–18 Genomics 11:261
41. Bradic M, Costa J, Chelo IM (2011) 46. Berard A, Le Paslier M, Dardevet M, Exbrayat-
Genotyping with Sequenom. In: Rockman M, Vinson F, Bonnin I, Cenci A, Haudry A, Brunel
Orgogozo V (eds) Molecular methods for D, Ravel C (2009) High-throughput single
evolutionary genetics. Humana, New York, nucleotide polymorphism genotyping in wheat
pp 193–210 (Triticum spp.). Plant Biotechnol J 7:364–374
42. Irwin D (2008) The MassARRAY system for 47. Sequenom (2007) Typer 40 manual.
plant genomics. In: Henry R (ed) Plant geno- Sequenom, San Diego, CA
typing II, SNP technology. CSIRO Publishing, 48. Bonk T, Humeny A (2001) MALDI-TOF-MS
Collingwood, VIC, pp 98–113 analysis of protein and DNA. Neuroscientist
43. Sequenom (2003) Multiplexing the homoge- 7:6–12
neous MassEXTEND assay. Sequenom, San 49. Voorrips RE, Gort G, Vosman B (2011)
Diego, CA Genotype calling in tetraploid species from bi-
44. Storm N, Darnhofer-Patel B, van den Boom D, allelic marker data using mixture models. BMC
Rodi CP (2003) MALDI-TOF mass Bioinformatics 12:172
spectrometry-based SNP genotyping. In: Kwok 50. Fujisawa H, Eguchi S, Ushijima M, Miyata S,
P (ed) Single nucleotide polymorphisms – meth- Miki Y, Muto T, Matsuura M (2004) Genotyping
ods and protocols. Humana, New York, of single nucleotide polymorphism using model-
pp 241–262 based clustering. Bioinformatics 20:718–726
Chapter 18
SNP Genotyping Using KASPar Assays

Scott M. Smith and Peter J. Maughan
Abstract
In a separate chapter we describe a simple method for single nucleotide polymorphism (SNP) discovery
using genomic reduction. Here we describe a scalable and cost-effective SNP genotyping method based
on KBioscience’s competitive allele-specific PCR amplification of target sequences and endpoint fluores-
cence genotyping (KASPar™) using a FRET capable plate reader or Fluidigm’s dynamic array high-
throughput platform.
Key words KASPar™, Single nucleotide polymorphism (SNP), Genotyping, Fluidigm
1 Introduction
Single nucleotide polymorphisms (SNPs) are the most abundant type

of polymorphism found in eukaryotic genomes [1]. SNP markers can
be used in a wide variety of applications, including association studies
[2], conservation genetics [3], genetic diversity analysis [4], and are
fast becoming the marker system of choice in marker-assisted plant
breeding programs [5]. Many of these applications require large num-
bers of genotyped SNPs. KASPar chemistry provides a versatile
method of genotyping that can be applied to small- and large-scale
projects. The utilization of the KASPar genotyping chemistry com-
bined with the Fluidigm integrated nano-fluidic circuit (IFC) and
EP1 endpoint fluorescence reader reduces a data point cost to $0.05
per data point which is significantly less expensive than traditional
markers systems (e.g., AFLPs or SSRs) [6]. A single 96.96 Fluidigm
IFC is capable of producing 9,216 genotypic data points in a single
run (~4 h) with little technical expertise and since each genotyping
reaction is done on a nanoliter scale, the consumable reagent costs
(i.e., Taq polymerase and primers) is only $0.001 per data point
(the remainder of the cost is the IFC) [6]. If a Fluidigm EP1 endpoint
fluorescence reader is unavailable (a significant capital investment), or
for small-scale project, KASPar™ SNP assays can be read on a standard
fluorescence resonance energy transfer (FRET) plate reader.
243
244 Scott M. Smith and Peter J. Maughan
Fig. 1 Diagram detailing the KASPar genotyping chemistry. Components consist

of: three user-designed primers (two allele-specific forwards and one common
reverse) unique to a single SNP, two universal secondary oligos with attached 5′
fluorophore and bound quenchers (included in KASPar reagent), and DNA tem-
plate. In the first rounds of PCR, only the common reverse and the allele-specific
primer that corresponds to the specific genotype of the DNA template hybridize
and extend. In this first round of PCR, a 5′ tail is incorporated into the PCR prod-
uct. During the second cycle of PCR, the common reverse oligo binds the tem-
plate made from the first round of PCR and extends producing a complement to
the allele-specific 5′ tail. In the third round of PCR, the secondary oligos with the
attached fluorophore hybridize to the PCR produce releasing the fluorophore
from its quencher and incorporating it into the final PCR product. As amplification
continues additional fluorophores are released from their quenchers producing a
strong allele-specific signal
SNP Genotyping Using KASPar Assays 245
2 Materials
2.1 High-Throughput 1. 2× KASPar reaction mix (contains Taq polymerase, reference

Genotyping: KASPar dye ROX, secondary universal primers, 50 mM MgCl, DMSO)
Amplification (KBioscience, PN KSB-1004-001) (see Note 1).
and Genotyping Using 2. 1-50 ng/μL Genomic DNA (see Note 2).
Fluidigm’s
3. Competitive allele-specific KASPar SNP primers (100 μM
Dynamic Array each) (see Fig. 1 and Note 3).
4. 2× Assay Loading Reagent (Fluidigm, PN 85000736).
5. GT 20× Sample Loading Reagent (Fluidigm, PN 85000741).
7. Fluidigm chip (Dynamic Array integrated fluidic circuit (IFC)
(see Note 4).
8. Fluidigm Control Line Fluid (comes with IFCs).
9. EP1 Reader (Fluidigm, PN EP1-EP1).
10. IFC Controller [HX] (Fluidigm, PN IFC-HX).
11. FC1 Cycler (Fluidigm, PN CYC-FC1).
12. Standard microfuge tubes (2.0 mL).
13. Standard 96-well polypropylene PCR plates.
14. Plate seals (Thermo Scientific, PN AB-0812).
15. Microplate plate sealer (Thermo Scientific, ALPS 50V, PN
AB-1443).
2.2 Low-Throughput 1. 2× KASPar reaction mix (contains Taq polymerase, reference

Genotyping: PCR Plate dye ROX, secondary universal primers, 50 mM MgCl, DMSO)
(96 or 384-Well) (KBioscience, PN KSB-1004-001).
KASPar Amplification 2. 1–50 ng/μL Genomic DNA (see Note 2).
and FRET Reader 3. SNP primers (100 μM each) (see Note 3).
5. Standard 96- or 384-well Skirted PCR plate (see Note 5).
7. Optical plate seal (Thermo Scientific, PN AB-0812).
8. Thermal cycler.
9. FRET-capable plate reader (see Note 6).
10. Microplate plate sealer (Thermo Scientific, ALPS 50V, PN
AB-1443).
2.3 Specific Target 1. 2× Multiplex PCR Master Mix (Qiagen, PN 206143).

Amplification 2. 10× STA primers (100 μM) (see Note 8).
(See Note 7)
3. TE buffer (10 mM Tris, 1 mM EDTA, pH 8.0; autoclave).

5. 1–50 ng/μL Genomic DNA (see Note 2).
7. Standard polypropylene 96-well PCR plates.
8. Thermal cycler.
3 Methods
The KBiosciences KASPar™ genotyping chemistry is designed on

the concept of competitive allele-specific PCR (see Fig. 1). In this
protocol, genotype-specific primers (one for each of the SNP
alleles) and fluorophore-labeled oligos are used in a competitive
PCR reaction to produce an allele-specific fluorescent signal. Each
allele-specific primer has a SNP-specific base complementary to the
target DNA template and one of the SNP alleles. Attached to each
SNP allele-specific oligo is a unique 5′ tail with sequence homol-
ogy to universal secondary oligos labeled with either a FAM or
HEX fluorophore. Fluorescence from the secondary oligo is ini-
tially suppressed by bound quencher molecules. During the first
round of PCR, only the correct allele-specific primer binds and its
5′ tail is incorporated into the PCR product. On the second round
of PCR, the reverse primer generates a sequence complementary
to the 5′ tail of the allele-specific sequence. This allows for the sec-
ondary fluorophore labeled oligo to bind and become incorpo-
rated into the PCR product during the third round of
PCR. Incorporation of the fluorophore labeled oligo into the
PCR product releases it from its quencher allowing it to fluoresce.
As PCR continues, generation of signal increases. After completion
of PCR the fluorescent signal can be read and a genotype deter-
mined (see Fig. 4). If the starting template DNA is of low quality or
quantity, we highly recommend performing a specific target ampli-
fication (STA) step (see Subheading 3.3) prior to KASPar geno-
typing. STA reduces the complexity of the template DNA by
targeting and preamplifying the SNP amplicons which may improve
the results of the subsequent KASPar genotyping.
3.1 KASPar 1. Inject control line fluid into the top and bottom control line
Genotyping via fluid reservoirs of the 96.96 chip (one syringe per reservoir)
Fluidigm’s (see Fig. 2).
Dynamic Array 2. Load chip into IFC Controller [HX] with the barcode facing
out and “Prime” the IFC. Priming takes approximately 20 min.
3. Prepare working KASPar primer mix: In a 96-well PCR plate
combine allele-specific primer 1, allele-specific primer 2, com-
mon reverse primer, and nuclease free water as shown in Table 1
(see Note 9).
Fig. 2 Diagram of a Fluidigm 96.96 chip layout. Control line fluid is injected into
each of the control line fluid reservoirs and primed using the IFC Controller
(Control line fluid is pressurized causing it to enter the chip allowing control of
various valves). Ninety-six assays and 96 samples are then loaded into their
respective inlets (see Fig. 3). Assays and samples are then forced into the IFC
chip using the IFC controller
Table 1
KASPar primer mix preparation
Volume Final Concentration

Component (μL) (μM)
Allele-specific primer 1 (100 μM) 3 12
Allele-specific primer 2 (100 μM) 3 12
Common reverse primer (100 μM) 7.5 30
Nuclease free water 11.5
Total 25
4. Prepare Preassay cocktail: Combine 550 μL 2× Assay Loading

Reagent and 396 μL nuclease free water into a PCR tube
(preassay cocktail).
5. Distribute 8.6 μL of the preassay cocktail into each well of a
new 96-well plate containing 1.4 μL of individual working
KASPar primer mix.
6. Prepare presample cocktail: Combine 330 μL of KASPar reagent,
33 μL of GT Sample Loading Reagent, and 22 μL of nuclease
free water into a new PCR tube (presample cocktail).
7. Distribute 3.5 μL of the presample cocktail into each well of a
new 96-well plate containing 2.5 μL of genomic DNA.
8. Seal sample and assay plates.
9. Mix samples and assays well by gently vortexing the sealed plates.
10. Pipette 4 μL of 10× assay mix into each assay inlet using an
8-channel pipette. Pipette by column the first 6 columns from
Fig. 3 Diagram of one set of IFC inlets (assay or sample) and how to load them.
Using an 8-channel pipette, pipette assays and samples into their appropriate
inlets by column. A standard 8-channel pipette will pipette into every other inlet.
Pipette column 1 of the prepared 96-well plate of assays into the first column
(every other inlet as indicated by the blue highlighted inlets) then work right with
columns 2, 3, 4, 5, and 6. Pipette column 7 of the prepared 96-well plate of
assays into the first column of inlets (every other inlet as indicated by the purple
highlighted inlets) just below those pipetted previously. Work your way right with
columns 8, 9, 10, 11, and 12 until all inlets are filled. Repeat this pattern for
prepared samples, pipetting them into the sample inlets
prepared assay plate to chip assay inlets starting on the top left
working right filling every other inlet in each of the six inlet
columns. Pipette the remaining columns from the prepared
assay plate (columns 7–12) just below the previously pipetted
assays starting on the left working right (see Fig. 3).
11. Pipette 5 μL of samples into each sample inlet using an 8-channel
pipette. Pipette by column the first six columns from prepared
sample plate to chip sample inlets starting on the top left
working right filling every other inlet in each of the six inlet
columns. Pipette the remaining columns from the prepared
sample plate (columns 7–12) just below the previously pipetted
samples starting on the left working right (see Fig. 3).
12. Remove any bubbles in the sample and assay inlets (see Note 10).
13. Remove the blue plastic protector from the bottom of the IFC.
14. Place the chip in the FC1 Cycler with the barcode facing out
and thermal cycle using the touchdown conditions described
in Table 2 (see Notes 11 and 12).
Table 2
Touchdown PCR conditions for KASPar genotyping via Fluidigm’s
dynamic array
Cycle step Temperature (°C) Time

1 70 30 min
25 10 min
2 94 15 min
3 94 20 s
65 1 min
4 94 20 s
64.2 1 min
5 94 20 s
63.4 1 min
6 94 20 s
62.6 1 min
7 94 20 s
61.8 1 min
8 94 20 s
61.0 1 min
9 94 20 s
60.2 1 min
10 94 20 s
59.4 1 min
11 94 20 s
58.6 1 min
12 94 20 s
57.8 1 min
13 94 20 s
57.0 1 min
14 Repeat step 13 for an additional 25 cycles
15 20 30 s
Table 3
Further cycling conditions for KASPar genotyping via Fluidigm’s
dynamic array

1 94 20 s
57.0 1 min
2 Repeat step 1 for an additional four cycles
3 20 30 s
Fig. 4 Example of SNP assays using the KASPar genotyping on the Fluidigm access array. The image was
obtained from Fluidigm’s SNP Genotyping Analysis software and shows a Cartesian graph with three distinct
genotypic cluster. Each dot represents one sample
15. Prepare to read the chip by turning on the EP1 Reader and
opening the EP1 Data Collection software (see Note 13).
16. Remove the chip from the FC1 Cycler and place it in the EP1
Reader with the barcode facing out. Read the chip using EP1
Data Collection software’s on screen directions.
17. Remove the chip from the EP1 Reader and place it back in the
FC1 Cycler and cycle for an additional five cycles using the
conditions outlined in Table 3.
18. Repeat steps 16 and 17 one more time to obtain reads for 36,
41, and 46 cycles (see Note 14).
19. Use the Fluidigm SNP Genotyping Analysis software to analyze
the genotyping results. Genotyping results are plotted by SNP
assay on Cartesian graphs with each dot representing a single
sample genotype. Samples with the same genotype should group
together forming distinct genotype-specific clusters (see Fig. 4).
3.2 KASPar 1. Prepare KASPar primer mix: In a 96-well PCR plate combine
Genotyping via allele-specific primer 1, allele-specific primer 2, common
FRET-Capable reverse primer, and nuclease free water as described in Table 1.
Plate Reader 2. Prepare a DNA plate by pipetting 4 μL of DNA into each well
of a 96 or 384 well plate and dry down the DNA sample in a
centrifugal evaporator (speed vac) or by leaving the sample
uncovered for several hours at room temperature in a laminar
flow hood (see Note 15).
3. Prepare individual KASPar primer master mixes for each SNP
assay by combining the components in Table 4 into individu-
ally labeled microfuge tubes. Dispense KASPar primer master
mixes into each well of the prepared DNA plate (see Note 16).
4. Seal the plate with an optically clear seal, vortex briefly and
centrifuge the plate.
5. Thermal cycle the reaction using the touchdown conditions as
described in step 14 of Subheading 3.1.
6. Capture end-point fluorescence signal using a FRET-capable
plate reader. Genotyping results can be plotted by SNP assay
on Cartesian graphs with each dot representing a single sample
genotype using KBiosciences Kluster Caller or other similar
software packages (see Fig. 4).
3.3 Specific Target The specific target amplification (STA) is an optional step. STA
Preamplification reduces the complexity of the template DNA by targeting and pre-
amplifying the SNP amplicons. This step is most useful when the
starting template DNA is of low quality or quantity.
1. Prepare 10× STA Primer mix (final solution will contain 500 nM
of each primer): in a single PCR tube, combine 2 μL of each
forward primer and 2 μL of each reverse primer and bring the
final volume up to 400 μL by adding TE Buffer as described in
Table 5.
Table 4
KASPar master mixes for genotyping via FRET-capable plate reader
Component Volume (μL)

KASPar 2X reagent 4
KASPar primer mix 0.11
a
Nuclease free water 4
Total 8.11
The volumes in the following table are for 1 reaction (i.e., calculate the amount of primer
master mix needed for number of DNA samples in your experiment plus overage)
a
If you do not dry down your DNA samples, omit the Nuclease free water from the
primer master mix
Table 5
STA primer mix components
Component Volume (μL)(for 96 assays)

100 μM STA forward (for all 96 assays) 2 each (192 total)
100 μM common reverse (for all 96 assays) 2 each (192 total)
TE buffer 16
Total 400
Table 6
STA premix components
Volume per Volume for 96 samples

Component sample (μL) plus overage (μL)
Qiagen 2× Multiplex PCR Master Mix 2.5 275.0
10× STA primer mix (500 nM each) 0.5 55.0
Nuclease free water 0.75 82.5
Total 3.75 412.5
Table 7
STA thermal cycling conditions

1 95 15 min
2 95 15 s
3 60 4 min
4 Repeat steps 3 and 4 for an additional 13 cycles
5 Hold at 4 °C
2. Prepare STA premix: combine Qiagen PCR Master Mix,

primer mix from step 1, and nuclease free water as described in
Table 6.
3. Combine STA premix and genomic DNA into 96-well plate:
add 3.75 μL STA premix to each well followed by 1.25 μL
genomic DNA to each well for a total reaction volume of 5 μL.
4. Seal the 96-well plate and mix by vortexing.
5. Thermal cycle under the conditions described in Table 7.
6. Dilute STA products 1:100 by adding 1 μL of STA product to

99 μL of TE. Use the diluted STA product in place of the
genomic DNA template in the KASPar Genotyping reactions
(both Subheadings 3.1 and 3.2; above) with no change in
reagent volumes.
4 Notes
1. The KASPar reaction mix is sensitive to light and repeated freeze

and thaw cycles. Improper storage will lead to poor genotyp-
ing results. We recommend thawing the KASPar reaction mix
once and aliquoting out single use volumes into individual
microcentrifuge tubes. Once aliquoted, wrap tubes in foil and
store at −20 °C. KASPar reaction mix handled appropriately
should produce quality results for up to 6 months.
2. Genomic DNA should be extracted using standard DNA
extraction protocols that yield high quality DNA. If high qual-
ity DNA cannot be obtained or if concentrations are lower
than desired the STA step may provide more reliable results.
3. Competitive allele-specific KASPar SNP primers consist of two
allele-specific forward primers and one common reverse primer.
The allele-specific forward primers include a 5′ tail associated
with either FAM or HEX fluorophore labeled oligos and also
includes the polymorphic SNP base as the last (3′) nucelotide.
An example of general structure of the forward allele-specific
primers is shown below (see also Fig. 1). The italicized 5′ por-
tion is the 5′ tail corresponding to the sequence of the fluoro-
phore labeled oligo, the underlined portion is the SNP
assay-specific sequence, and the bolded 3′ nucleotide corre-
sponds to the single nucleotide polymorphism.
Allele specific 1: 5′-GAAGGTGACCAAGTTCATGCT
AAAGCTCATTATTCTTTCTAAAGAAATGATAG
Allele specific 2: 5′-GAAGGTCGGAGTCAACGGATTG
AAAGCTCATTATTCTTTCTAAAGAAATGATAA
KASPar primers can be designed using any number of primer
design software packages, including PrimerPicker
(KBiosciences, 2009) and should be stored at −20 °C.
4. Fluidigm IFCs come in 96.96, 48.48, and 24.128 formats
where the first number represents number of assays and the
second number representing number of samples. The method
described here is specific for the 96.96 format but can be
adjusted to work with other IFC formats.
5. Ensure that the PCR plate is compatible with your plate reader.
6. Any FRET-capable plate reader can be used as long as it can read
emission wavelengths of 520 (FAM), 556 (HEX), and 610
(ROX). We used a PHERAstar Plus (BMG LabTech GmbH,

Ortenberg, Germany).
8. The STA primers consist of one forward primer and one reverse
primer flanking each SNP. The reverse primer is identical to the
reverse primer used in KASPar allele-specific amplification.
The forward primers are also identical to the allele-specific
primers but do not include the 5′ tail or the polymorphic SNP
base (see Note 3).
9. This protocol makes enough working KASPar primer mix to
run approximately seventeen 96.96 chips (156,672 data
points). Note that primers will autohydrolyze over time (likely
accelerate by repeated freeze/thaw cycles) resulting in nonal-
lele specific amplification and poor data point clustering.
Working KASPar primers should be stored at −20 °C in TE
(10 mM Tris, 1 mM EDTA, pH 7.5).
10. Bubbles left in inlets will prevent samples or assays from load-
ing properly. To eliminate bubbles from inlets either use a
clean bent pipette tip to gently pull bubble out or gently aspi-
rate alcohol vapor over the inlet using a wash bottle (suction
straw removed) containing a small amount of alcohol in the
bottom. The alcohol fumes quickly break the surface tension
of the bubbles eliminating them. Be careful not to over aspi-
rate as the samples will evaporate.
11. Explanation of thermal cycling conditions:
Step 1: Thermal mixing step, which mixes the sample and assay
of each reaction.
Step 2: Hot start.
Steps 3–12: Touchdown cycles—0.8 °C degrees decrease each
cycle.
Steps 13–14: Amplification.
Step 15: Cool Down.
12. Samples will amplify more quickly if using STA product as the
starting DNA template. To prevent overamplification, reduce
the number of cycles in step 14 of the thermal cycling condi-
tions (Table 2) to 17 additional cycles (see Note 14).
13. Before the chip can be read on the EP1 reader, it must be
turned on to allow the camera to cool to the appropriate
operating temperature (approx. 40 min).
14. Not all SNP assays will amplify at the same rate (i.e., some
assays will provide better results with fewer or more cycles than
the average). Obtaining three sets of data at five cycle intervals
allows for comparison and increases the probability of obtaining
maximal separation of the genotype cluster.
15. Leaving the DNA hydrated is an alternative method. If using
hydrated DNA, omit the nuclease free water from the primer
Sample
Sample
Sample 8
Sample 9
Sample
Sample
Sample
Sample
Sample
Sample
Sample 7
Controls
1 2 3 4 5 6 7 8 9 10 11 12
SNP A 1/1 1/2 1/3 1/4 1/5 1/6 1/7 1/8 1/9 1/10 1/11 1/A1
SNP B 2/1 2/2 2/3 2/4 2/5 2/6 2/7 2/8 2/9 2/10 2/11 2/A1
SNP C 3/1 3/2 3/3 3/4 3/5 3/6 3/7 3/8 3/9 3/10 3/11 3/A2
SNP D 4/1 4/2 4/3 4/4 4/5 4/6 4/7 4/8 4/9 4/10 4/11 4/A2
SNP E 5/1 5/2 5/3 5/4 5/5 5/6 5/7 5/8 5/9 5/10 5/11 5/Het
SNP F 6/1 6/2 6/3 6/4 6/5 6/6 6/7 6/8 6/9 6/10 6/11 6/Het
SNP G 7/1 7/2 7/3 7/4 7/5 7/6 7/7 7/8 7/9 7/10 7/11 7/NTC
SNP 8 H 8/1 8/2 8/3 8/4 8/5 8/6 8/7 8/8 8/9 8/10 8/11 8/NTC
Fig. 5 Genotyping of 11 samples with eight SNP assays in a 96-well plate. Each row (minus the last row) will
be filled with a single SNP assay (row A = SNP 1, row B = SNP 2, etc.). Each column will be filled with a single
DNA sample (column 1 = Sample 1, column 2 = Sample 2, etc.). The last column will be used for controls.
Multiple positive as well as multiple negative controls are included. For this example, prepare eight KASPar
primer-specific master mixes. Each master mix should contain enough master mix for 14–16 samples (11
samples, 1 control, and 2–4 for overage). The table below depicts the plate set up (first and second numbers
in well positions represent SNP assay and sample numbersrespectively)
master mix. We experience more consistent results with the dry

down method, which we attribute to unequal evaporation of
the DNA samples due to our use of a liquid handling robot for
DNA distribution. We have also successfully scaled the reagents
proportionally to create 4 μL reactions which are successfully
measured by the BMG PHERAstar Plus plate reader.
16. When designing your plate set up be sure to include positive
controls, including homozygotes for allele 1 and allele 2, as
well as a control heterozygous sample (a synthetic heterozy-
gote can be made by mixing equal quantities of the homozy-
gous samples). Negative controls, including a no template
control (NTC; DNA free water is substitute for the DNA tem-
plate), should also be included. Arraying samples and assays by
rows and columns with the last row or column for controls
seems to be the most convenient for setting up. A sample set
up may look as shown in Fig. 5.
Acknowledgments
This research was funded by the McKnight Foundation and Ezra

Taft Benson Agriculture and Food Institute. We gratefully acknowl-
edge Dr. Joshua Udall (BYU), Jared Clouse (BYU) and Jamie Rice
(Fluidigm) for his assistance and advice with regards to the
Fluidigm protocols.
References
1. Dou J, Zhao X, Fu X, Jiao W, Wang N, Zhang L, 4. Blair MW, Cortes AS, Penmetsa RV, Farmer A,
Hu X, Wang S, Bao Z (2012) Reference-free Carrasquilla-Garcia N, Cook DR (2013) A high-
SNP calling: improved accuracy by preventing throughput SNP marker system for parental
incorrect calls from repetitive genomic regions. polymorphism screening, and diversity analysis
Biology 7:17 in common bean (Phaseolus vulgaris L.). Theor
2. Filiault DL, Maloof JN (2012) A genome-wide Appl Genet 126:535–548
association study identifies variants underlying 5. Foolad MR, Panthee DR (2012) Marker-assisted
the Arabidopsis thaliana shade avoidance selection in tomato breeding. Crit Rev Plant Sci
response. PLoS Genet 8:e1002589 31:93–123
3. Ogden R, Baird J, Senn H, Ross M (2012) The 6. Maughan PJ, Smith SM, Fairbanks DJ, Jellen
use of cross-species genome-wide arrays to dis- EN (2011) Development, characterization, and
cover SNP markers for conservation genetics: a linkage mapping of single nucleotide polymor-
case study from Arabian and scimitar-horned phisms in the grain amaranths (Amaranthus sp.).
oryx. Conser Genet Resour 4:471–473 Plant Genome J 4:92–101
Chapter 19
Skim-Based Genotyping by Sequencing

Agnieszka A. Golicz, Philipp E. Bayer, and David Edwards
Abstract
Genotyping by sequencing (GBS) is a relatively new method used to determine the differences in the
genetic makeup of individuals. Its novelty stems from a combination of two already available methods:
genotyping and next-generation sequencing. Depending on the individual study design GBS protocols
can take multiple forms, however most share a sequence of core steps that have to be undertaken. These
include: sequencing of the DNA from the individuals of interest (usually two parents of a mapping population
and their progeny), mapping of the sequencing reads to the reference sequence, SNP calling and filtering,
SNP genotyping and imputation, followed by haplotype identification and downstream analysis. GBS has
a range of applications from general marker discovery, haplotype identification, and recombination charac-
terization to quantitative trait locus (QTL) analysis, genome-wide association studies (GWAS), and genomic
selection (GS). It has already been applied to a range of plant species including: rice, maize, artichoke, and
Arabidopsis thaliana. It is a promising approach which is likely to provide new and important insights into
plant biology.
Key words Genotyping, GBS, Markers, SNPs, SNP calling, Imputation, Haplotype identification,
Recombination
1 Introduction
Genotyping by sequencing, abbreviated to GBS or GbyS, is a rela-

tively new concept that has been developing over the last few years
and which holds a potential to transform current genetics research.
It stems from a combination of two available and widely applied
methods, namely, genotyping and next-generation sequencing.
The original GBS concept referred to a specific GBS approach [1]
but the term is now used more widely to include all genotyping
methods using next-generation sequencing.
Genotyping is a process of determining the differences in the
genetic makeup (genotype) of individuals by examining their DNA.
The sequence differences (polymorphisms) between individuals
can take multiple forms including: insertions, deletions, and single
nucleotide polymorphisms (SNPs). Molecular markers are indicators
of sequence polymorphisms and can be used to detect differences
257
258 Agnieszka A. Golicz et al.
in the DNA sequence between individuals. A range of different

molecular markers exist. For plants these include, but are not lim-
ited to: restriction fragment length polymorphisms (RFLPs) [2],
amplified fragment length polymorphisms (AFLPs) [3], and simple
sequence repeats (SSRs) [4] (see Note 1). Albeit historically impor-
tant, and still preferred for certain applications [5], these markers
are being replaced by the use of single nucleotide polymorphisms
(SNPs) [6–8]. The main advantages of SNPs include their rela-
tively high density, evolutionary stability, and wide range of differ-
ent commercially available high-throughput assay protocols
[9–12]. However, a common feature these assays share is that they
require SNP locations to be known prior to the assay. By design, the
assays do not sequence the whole of the individual's DNA but rather
assess certain positions in the sequence and report the alleles encoun-
tered. Genotyping by sequencing (GBS) takes a different approach.
The method involves sequencing the whole (or a portion) of the
genome from multiple individuals and then analyzing the sequences
to genotype the polymorphisms.
Large-scale DNA sequencing has become feasible due to the
introduction of next-generation sequencing technologies [13] and
several crop genomes have now been sequencing using this tech-
nology [14–16], including isolated wheat chromosomes [17–19],
Brassica rapa [20], and chickpea [21].
Next-generation sequencing technologies include several sys-
tems from Illumina [22], Roche 454 [23], and Ion Torrent [24],
and provide cost- and labor-efficient sequencing of vast quantities
of DNA [25]. GBS analysis pipelines heavily rely on bioinformatics
tools for the analysis of the data produced by next-generation
sequencing platforms [26–31]. Continuous improvement of both
the sequencing technologies and computational analysis methods
makes GBS one of the prime candidates for routine genotyping in
the future [32].
One of the goals of genotyping by sequencing is to generate
haplotype maps showing the distribution of haplotype blocks.
There are different definitions of “haplotype blocks,” the most
common is “sizable regions over which there is little evidence for
historical recombination and within which only a few common
haplotypes are observed” [33]. In GBS, usually only one haplotype
is allowed per block.
2 GBS: Recent Applications in Plant Sciences
GBS has a range of applications from general marker discovery,

haplotype identification, and recombination characterization, to
quantitative trait locus (QTL), genome-wide association studies
(GWAS), and genomic selection (GS). Both, whole-genome
sequencing and the reduced representation strategy (sequencing only
Skim-Based Genotyping by Sequencing 259
a part of the genome) can be incorporated in the GBS protocols.

A handful of recent experiments applying GBS as a method of
marker discovery are described below.
Huang et al. [34] performed whole-genome sequencing to
resequence a total of 150 recombinant inbred lines (RILs) devel-
oped from a cross between Oryza sativa ssp. indica and japonica.
The recombinant lines were sequenced to an average coverage of
0.02×, identifying a total of 1,493,461 SNPs, with an average den-
sity of 1 SNP every 40 kilo bases. In order to eliminate noise,
which stems from false positive SNP calls (see Note 2) blocks of
consecutive SNPs were used for genotyping. Subsequently, the
established genotypes were used to identify recombination break-
points. The recombination maps were in turn converted into a
skeleton bin map, resulting in a total of 2,334 recombination bins
obtained for the 150 RILs. The bins allowed a linkage map to be
constructed with bins serving as markers. Finally, the map was used
to identify four candidate QTLs linked to plant height.
Yang et al. [35] performed sequencing of 40 F2 Arabidopsis
plants and their parents. In contrast to the work of Huang et al.
[34], the coverage was much higher, and the overall coverage for the
40 F2 plants totaled 824×. The initial set of markers discovered com-
prised 586,231 SNPs and 41,743 deletions. Further filtering reduced
the total to 415,357 markers. The markers were used to identify
haplotype blocks, and the haplotype blocks were in turn used to map
recombination events, both crossovers and gene conversions,
characterizing in detail recombination activity along the genome.
Interestingly, the authors concluded that small gene conversion
tracts represented over 90 % of all the recombination events. The
precise identification of small gene conversions was made possible by
the high density of markers screened in this GBS protocol.
Gore et al. [36] performed targeted sequencing of 20 % of the
maize genome, focusing on low-copy regions, across a panel of 27
inbred lines. Methylation sensitive restriction enzymes were used
to enrich for the nonrepetitive portion of the genome. Over 32
Gbp of sequence data were obtained, mostly at low coverage.
Following mapping of this data to the B73 maize reference
genome, more than 3.3 million SNPs and indels were identified.
The SNPs were identified by constructing a consensus sequence
for each inbred line and comparing this to the reference genome.
The SNPs were subsequently applied to assess genetic diversity,
study recombination rates and haplotype structure.
Scaglione et al. [37] used a previously described “restriction
site-associated DNA” sequencing method (RAD-Seq) [38] to dis-
cover SNP variants in artichoke. This is a good example of a species
which has a poorly explored genome that lacked any SNPs in the
public domain and could be genotyped due to the availability of
second-generation GBS technologies. RAD tags were sequenced
from the genomic DNA of three C. cardunculus mapping
population parents, generating approximately 1 Gbp of sequence

data which was used for de novo reference contig assembly and
SNP calling. A total of 33,784 SNP variants were identified and
genotyped. The study showed that RAD-Seq can be successfully
used to develop markers in poorly understood and highly hetero-
zygous species.
Elshire et al. [1] also adopted a procedure based on next-
generation sequencing of a genomic subset targeted using restric-
tion digestion, but with a protocol involving fewer steps than
RAD-Seq. Again, by choosing appropriate restriction enzymes,
repetitive portions of the genome could be avoided and the enrich-
ment of low copy regions could be achieved. The method has been
tested on 276 RILs from a high resolution maize mapping popula-
tion, identifying and genotyping more than 200,000 maize mark-
ers to enable GWAS in maize.
3 GBS: Method
3.1 Sequencing The main consideration when adopting a sequencing strategy for a
GBS project involves deciding between whole genome and reduced
representation sequencing. An advantage of reduced representation
sequencing is a reduction in the amount of sequence data required.
However, this should be balanced by the increased complexity of
DNA sequencing library preparation, downstream bioinformatics
analysis, and potential bias of reduced complexity sequencing.
A variety of reduced representation protocols exist including exon-
capture [39], RNA-Seq [40], and RAD-Seq [38, 41] (for exhaus-
tive review refer to: Genome-wide genetic marker discovery and
genotyping using next-generation sequencing [42]). One of the
most popular methods of genome size reduction for GBS is restric-
tion site-associated DNA sequencing (RAD-Seq). However, while
working efficiently for some species, several sources of bias in RAD-
Seq experiments have been identified [43]. These include restric-
tion fragment bias, restriction site heterozygosity, and PCR GC
content bias, and these should be taken into account during experi-
mental design. Recently, with the reducing cost of Illumina DNA
sequence data generation, whole-genome sequencing emerged as a
viable alternative to reduced representation. Whole-genome sequenc-
ing [35, 44, 45] decreases the number of steps and cost of library
preparation, reduces the complexity of downstream bioinformatics
analysis, and eliminates biases stemming from the use of restriction
enzymes, but may require more sequence data than reduced repre-
sentation methods. A major advantage of whole-genome sequenc-
ing is that genotyping resolution can be adjusted by generating
different quantities of data, balancing resolution with cost.
Sequencing depth is another major consideration during GBS
experimental design. The optimal mean coverage per locus may
vary depending on the species, experimental goals, and strategies
adopted. It is important to remember that an ideal distribution of

reads spreads them uniformly along the reference sequence.
However, both stochastic and experimental limitations prevent
such an even distribution [46] resulting in some portions of the
genome having higher coverage than the others. In general, lower
coverage per locus results in a lower confidence in the discovery of a
new marker [42]. In extreme cases, where complete coverage of all
the genome is required, at least 30× coverage per locus per individ-
ual is recommended [35, 43]. In recombinant populations with
high quality parental genotypes, genotyping a relatively small num-
ber of markers may be sufficient for trait association when coupled
with imputation techniques. In such cases coverage of below 1× may
suffice [44]. The sequencing of multiple individuals at low coverage
combined with imputation is often referred to as skim sequencing.
For example, sequencing of 517 rice landraces at average coverage of
1× followed by genotyping coupled to imputation was sufficient to
perform successful GWAS studies [44].
3.2 Read Mapping There are a variety of tools used to map reads to a reference genome
(see Note 3), including but not limited to: SOAP2 [47], Bowtie
[48], and BWA [49]. The specific choice of the mapping software
depends on the application (see Note 4). The characteristics to
be taken into account include speed, sensitivity, ability to discover
indels, availability of computing resources, and personal preference.
Usually only read pairs that map uniquely to a unique place in the
genome should be considered. Discarding read pairs that map
equally well to more than one position reduces the false positive
variant discovery rate [50].
3.3 SNP Calling GBS methods differ in that some, such as RAD-Seq, discover SNPs
and Filtering during the genotyping process while others require an initial SNP
discovery process prior to genotyping. Many applications in plants
require prior knowledge of the SNP positions. SNP discovery is
usually performed by resequencing of the parental individuals fol-
lowed by read alignment to the available reference sequence and
SNP calling. Different SNP discovery tools are available and the
most appropriate one to use would depend on the species and data
type. Some example SNP discovery tools include: SHORE [51],
samtools mpileup [52], and SGSAutoSNP [50]. The main concep-
tual difference between SNP calling pipelines involves the use of the
reference sequence. Some use the reference sequence in the SNP
identification process, whereas the others use the reference for read
alignment only and then discover polymorphisms between the
aligned reads. During the SNP discovery process, sequence errors
and misaligned reads may result in false positive SNP calls and
these are considerations SNP discovery software needs to address
to ensure accurate prediction. Depending on the software used,
SNPs can be further filtered using criteria including: total coverage
at SNP position, number of reads supporting the SNP, proportion
of the reads supporting the SNP, and the quality score of bases at
the SNP position. The application of an appropriate choice of fil-
ters should remove a large proportion of false positive SNPs. In the
case of RAD-seq, if the reference genome is available, the raw reads
can also be aligned to the reference sequence. Alternatively, when
no reference sequence is available de novo RAD tag analysis is pos-
sible. Very similar sequences that differ only by a small number of
mismatches, and presumably represent the same locus, are clus-
tered together; SNP and indels can then be identified between
alleles.
3.4 SNP Genotyping Typically, skim-based SNP genotyping involves resequencing of

multiple individuals followed by alignment of the reads to the refer-
ence sequence. For each of the individuals, reads mapping to the
known SNP positions are inspected and the SNP alleles are recorded.
Using this approach, genotype maps for the entire genomes can be
generated, making it possible to discern, which part of the genome
was inherited from each of the parental individuals.
3.5 SNP Imputation For certain applications SNP imputation may be desired. Imputation
involves inferring missing genotypes to generate a more complete
picture of the genetic makeup. In the skim sequencing applications,
where often only a portion of markers for each individual will be
genotyped, imputation becomes an important step in the data anal-
ysis pipelines. If an appropriate method is chosen, the accuracy of
imputation can be very high [44]. Imputation methods range from
fairly straightforward “filling in of the missing SNP” based on the
known surrounding genotypes in the recombinant populations, a
k nearest neighbors (KNN) algorithm-based model which was suc-
cessfully applied in the analysis of 500 unrelated rice samples [44],
through to sophisticated statistical methods which rely on linkage
disequilibrium structure and haplotype maps [53]. The choice of
imputation procedure requires careful consideration and needs to
be balanced with the data volume and desired resolution of geno-
typing. If too much information is missing from a sample, imputa-
tion may not be accurate and it may be beneficial to remove the
sample before analysis.
3.6 Haplotype The markers obtained from GBS may be used to construct high
Identification density haplotype maps [44], estimate recombination rates [36, 35],
and Further Analysis and perform GWAS [44] and genomic selection. The high number
of markers obtained compared to traditional genotyping provides
more precision for the downstream analysis.
Genotyping data can be used to scan for points in which
recombination rates are unusually high or low, i.e., recombination
hot- or cold-spots. SNP density can be measured along and
between the chromosomes. Analysis of SNP density provides infor-
mation regarding sequence conservation; the regions with lowest
SNP density presumably being the most conserved and the regions
with high SNP density being the least conserved. Sequence con-
servation may in turn provide clues regarding functionality and
evolutionary history. GWAS has emerged as tool, which enables
identification of genomic variation underling complex traits. GWAS
is extremely convenient because it does not require any prior
knowledge about the location of the gene of interest. However, it
is dependent on the quality and the density of the markers used.
GBS has a potential to provide high density maker maps and will
improve the efficiency of GWAS in plants. Figure 1 depicts the
workflow in a sample GBS pipeline.
Fig. 1 An overview of workflow in a sample GBS pipeline

4 Example: Skim-Based GBS in Arabidopsis
This example is based on mock data and a relatively simple genome,

however the method can be applied to real data and larger, more
complex genomes.
4.1 Experimental The aim of the experiment is to perform de novo SNP discovery
Design and Choice based on NGS data, genotyping of a population by skim sequenc-
of Dataset ing, and estimation of recombination frequency along Arabidopsis
thaliana chromosome 1.
Two parental individuals (P1 and P2) and 100 offspring dou-
ble haploid (DH) individuals were selected for sequencing (see Note
5). Illumina sequencing libraries, with an insert size of 500 bp,
were constructed for each individual. Then 100 bp paired reads
were generated using an Illumina HiSeq 2000. The parental indi-
viduals were sequenced to 30× coverage. The progeny individuals
were sequenced with average coverage of 1×.
4.2 Read Mapping Reads from the parental individuals and offspring were mapped to
the A. thaliana genome downloaded from phytozome version
v9.0 (http://www.phytozome.net/) using SOAPaligner/soap2.
Only reads that map uniquely to one position in the genome were
considered and a broad insert size of 0–1,000 bp was selected.
SOAP parameters: –m 0 –x 1,000 –r 0.
4.3 SNP Calling SNPs were discovered using SGSautoSNP [50]. SGSautoSNP is a
SNP discovery tool designed specifically for complex, polyploid
plant genomes, though it also works well with simpler genomes.
SGSautoSNP represents a novel approach to SNP discovery, since
it does not consider the reference sequence during SNP calling.
Instead, it uses a reference to align reads from multiple samples
and then finds SNPs between samples by comparison of the mapped
reads. Also, considering the fact that plant populations are often
inbred or doubled haploid and highly homozygous, SGSautoSNP
discards all the SNPs that are heterozygous within a single sample
deeming them most likely due to mis-mapping of reads and
representing false SNP calls.
According to the SGSautoSNP strategy, reads from both paren-
tal individuals (P1 and P2) are aligned to the reference genome and
then polymorphisms between parents are identified. The resulting
SNP file is then used for genotyping of the DH population.
4.4 SNP Genotyping As outlined earlier, the reads from 100 offspring double haploid
individuals were aligned to the same A. thaliana reference. A cus-
tom parser, GenotypeSNPs.pl, compares the SNPs called by
SGSautoSNP to the mappings generated for the offspring and
checks which nucleotides are present at each SNP position.
For example, SNP1 has an A in P1 and a T in P2 at position

1,337 on chromosome 1. Where a read from one of the progeny
matches to this position, the genotype is called as either A or T
reflecting the genotype of one of the parents.
4.5 SNP Imputation Because the DH individuals were sequenced at low coverage, miss-
ing sequence data results in missing genotype calls. SNP imputa-
tion was performed to “fill in” the missing genotypes based on
the haplotype structure of the parents. The principle behind the
imputation method is presented in Fig. 2.
4.6 Recombination Based on the haplotypes of the two parents, the inheritance of
Frequency Estimation blocks P1 (paternal homozygosity) and P2 (maternal homozygosity)
haplotypes was determined (see Note 6). Recombination events
Fig. 2 SNP imputation and haplotype identification for chromosome 1. Schematic

representation of cross over event during meiosis (a) and examples that could be
seen in ten randomly chosen DH individuals, prior to imputation (b) and after
imputation (c). The red and blue bars represent chromosomes of P1 and P2 (F0),
respectively. Each row in (b) and (c) corresponds to an DH individual. White
blocks (b) represent missing information, where SNPs had to be imputed.
Individuals with too much information missing (3, 9, and 10 from the top in (b)
were removed prior to further analysis). SNPs were imputed using surrounding
called SNP information and knowledge of the parental genotypes
100
90
80
70
60
CO events
50
40
30
20
10
0
0.3
1.8
3.3
4.8
6.3
7.8
9.3
10.8
12.3
13.8
15.3
16.8
18.3
19.8
21.3
22.8
24.3
25.8
27.3
28.8
Chromosome position [Mbp]
Fig. 3 Recombination frequencies. Recombination frequency for each position along A. thaliana chromosome 1.
The frequency was calculated by summing the cross over (CO) events for all the individuals in the window of
width 5,000 bp
were identified as switches from P1 to P2 haplotype blocks or from

P2 to P1 haplotype blocks. Blocks of less than 2 kb were ignored
as those were likely to represent abundant non-cross over events,
resulting from gene conversion [35]. The recombination frequency
was calculated as the number of P1 to P2 or P2 to P1 switches
for a window of 5,000 bp. A sample plot of the number of recom-
bination events against the length of A. thaliana chromosome 1 is
presented in Fig. 3.
5 Notes
1. SSRs are also known as microsatellites.

2. False positives may result from sequencing errors and errors in
read alignment—explained in Subheadings 3.2 and 3.3.
3. The availability of a reference genome is a prerequisite for per-
forming skim-based GBS.
4. In our experience, SOAP2 maps fewer reads but with greater
accuracy than BWA. This makes SOAP2 favorable for SNP pre-
diction when designing fixed SNP assays such as the Illumina
infinium or goldengate assays. In contrast, BWA is favored in
skim GBS, as BWA also generates gapped alignments which
SOAP2 cannot generate. This leads to a greater number of
SNPs identified and false predictions are easily eliminated
during the filtering/imputation steps.
5. Use of double haploid lines simplifies analysis of the results,
SNPs arising due to heterozygosity within a single individual
will be eliminated.
6. Haplotype blocks of all sizes are observed, ranging from several

base pairs to hundreds of kilo base pairs. The small blocks are
likely to be a result of gene conversion rather than crossing over.
The ratio of large/small haplotype blocks may vary between
species depending on the crossing over and gene conversion
frequency.
References
1. Elshire RJ, Glaubitz JC, Sun Q, Poland JA, 11. Batley J, Edwards D (2009) Mining for single
Kawamoto K, Buckler ES, Mitchell SE (2011) nucleotide polymorphism (SNP) and simple
A robust, simple genotyping-by-sequencing sequence repeat (SSR) molecular genetic mark-
(GBS) approach for high diversity species. ers. In: Posada D (ed) Bioinformatics for
PLoS One 6:e19379 DNA sequence analysis. Humana, New York,
2. Botstein D, White RL, Skolnick M, Davis RW pp 303–322
(1980) Construction of a genetic linkage map 12. Appleby N, Edwards D, Batley J (2009) New
in man using restriction fragment length poly- technologies for ultra-high throughput geno-
morphisms. Am J Hum Genet 32:314–331 typing in plants. In: Somers D, Langridge P,
3. Vos P, Hogers R, Bleeker M, Reijans M, van de Gustafson J (eds) Plant genomics. Humana,
Lee T, Hornes M, Frijters A, Pot J, Peleman J, New York, pp 19–40
Kuiper M et al (1995) AFLP: a new technique 13. Edwards D, Batley J, Snowdon R (2013)
for DNA fingerprinting. Nucleic Acids Res Accessing complex crop genomes with next-
23:4407–4414 generation sequencing. Theor Appl Genet
4. Jarne P, Lagoda PJL (1996) Microsatellites, 126:1–11
from molecules to populations and back. 14. Edwards D, Wang X (2012) Genome sequenc-
Trends Ecol Evol 11:424–429 ing initiatives. In: Edwards D, Parkin IAP,
5. Arif IA, Bakir MA, Khan HA, Al Farhan AH, Batley J (eds) Genetics, genomics and breeding
Al Homaidan AA, Bahkali AH, Sadoon MA, of oilseed Brassicas. Science Publishers Inc.,
Shobrak M (2010) A brief review of molecular New Hampshire, pp 152–157
techniques to assess plant diversity. Int J Mol 15. Imelfort M, Batley J, Grimmond S, Edwards D
Sci 11:2079–2096 (2009) Genome sequencing approaches and
6. Edwards D, Forster JW, Chagné D, Batley J successes. In: Somers D, Langridge P, Gustafson
(2007) What are SNPs? In: Oraguzie NC, J (eds) Plant genomics. Humana, New York,
Rikkerink EHA, Gardiner SE, De Silva HN pp 345–358
(eds) Association mapping in plants. Springer, 16. Edwards D, Batley J (2010) Plant genome
New York, pp 41–52 sequencing: applications for crop improve-
7. Edwards D, Forster JW, Cogan NOI, Batley J, ment. Plant Biotechnol J 7:1–8
Chagné D (2007) Single nucleotide polymor- 17. Berkman PJ, Skarshewski A, Lorenc MT, Lai
phism discovery. In: Oraguzie N, Rikkerink E, K, Duran C, Ling EYS, Stiller J, Smits L,
Gardiner S, De Silva H (eds) Association map- Imelfort M, Manoli S, McKenzie M,
ping in plants. Springer, New York, pp 53–76 Kubalakova M, Simkova H, Batley J, Fleury D,
8. Chagné D, Batley J, Edwards D, Forster JW Dolezel J, Edwards D (2011) Sequencing and
(2007) Single nucleotide polymorphism geno- assembly of low copy and genic regions of iso-
typing in plants. In: Oraguzie N, Rikkerink E, lated Triticum aestivum chromosome arm
Gardiner S, De Silva H (eds) Association map- 7DS. Plant Biotechnol J 9:768–775
ping in plants. Springer, New York, pp 77–94 18. Berkman PJ, Skarshewski A, Manoli S, Lorenc
9. Duran C, Edwards D, Batley J (2009) MT, Stiller J, Smits L, Lai K, Campbell E,
Molecular marker discovery and genetic map Kubalakova M, Simkova H, Batley J, Dolezel J,
visualisation. In: Edwards D, Hanson D, Hernandez P, Edwards D (2012) Sequencing
Stajich J (eds) Applied bioinformatics. Springer, wheat chromosome arm 7BS delimits the
New York, pp 165–189 7BS/4AL translocation and reveals homoeolo-
10. Hayward A, Dalton-Morgan J, Mason A, gous gene conservation. Theor Appl Genet
Zander M, Edwards D, Batley J (2012) SNP 124:423–432
discovery and applications in Brassica napus. 19. Berkman PJ, Visendi P, Lee HC, Stiller J,
J Plant Biotechnol 39:1–12 Manoli S, Lorenc MT, Lai K, Batley J, Fleury D,
Šimková H, Kubaláková M, Weining S, Doležel Boodhun A, Brennan JS, Bridgham JA, Brown
J, Edwards D (2013) Dispersion and domesti- RC, Brown AA, Buermann DH, Bundu AA,
cation shaped the genome of bread wheat. Burrows JC, Carter NP, Castillo N, Chiara
Plant Biotechnol J 11:564–571 ECM, Chang S, Neil Cooley R, Crake NR,
20. Wang X, Wang H, Wang J, Sun R, Wu J, Liu S, Dada OO, Diakoumakos KD, Dominguez-
Bai Y, Mun J-H, Bancroft I, Cheng F, Huang Fernandez B, Earnshaw DJ, Egbujor UC,
S, Li X, Hua W, Wang J, Wang X, Freeling M, Elmore DW, Etchin SS, Ewan MR, Fedurco
Pires JC, Paterson AH, Chalhoub B, Wang B, M, Fraser LJ, Fuentes Fajardo KV, Scott Furey
Hayward A, Sharpe AG, Park B-S, Weisshaar B, W, George D, Gietzen KJ, Goddard CP, Golda
Liu B, Li B, Liu B, Tong C, Song C, Duran C, GS, Granieri PA, Green DE, Gustafson DL,
Peng C, Geng C, Koh C, Lin C, Edwards D, Hansen NF, Harnish K, Haudenschild CD,
Mu D, Shen D, Soumpourou E, Li F, Fraser F, Heyer NI, Hims MM, Ho JT, Horgan AM,
Conant G, Lassalle G, King GJ, Bonnema G, Hoschler K, Hurwitz S, Ivanov DV, Johnson
Tang H, Wang H, Belcram H, Zhou H, MQ, James T, Huw Jones TA, Kang GD,
Hirakawa H, Abe H, Guo H, Wang H, Jin H, Kerelska TH, Kersey AD, Khrebtukova I,
Parkin IAP, Batley J, Kim J-S, Just J, Li J, Xu J, Kindwall AP, Kingsbury Z, Kokko-Gonzales
Deng J, Kim JA, Li J, Yu J, Meng J, Wang J, PI, Kumar A, Laurent MA, Lawley CT, Lee
Min J, Poulain J, Hatakeyama K, Wu K, Wang SE, Lee X, Liao AK, Loch JA, Lok M, Luo S,
L, Fang L, Trick M, Links MG, Zhao M, Jin Mammen RM, Martin JW, McCauley PG,
M, Ramchiary N, Drou N, Berkman PJ, Cai Q, McNitt P, Mehta P, Moon KW, Mullens JW,
Huang Q, Li R, Tabata S, Cheng S, Zhang S, Newington T, Ning Z, Ling Ng B, Novo SM,
Zhang S, Huang S, Sato S, Sun S, Kwon S-J, O'Neill MJ, Osborne MA, Osnowski A,
Choi S-R, Lee T-H, Fan W, Zhao X, Tan X, Xu Ostadan O, Paraschos LL, Pickering L, Pike
X, Wang Y, Qiu Y, Yin Y, Li Y, Du Y, Liao Y, AC, Pike AC, Chris Pinkard D, Pliskin DP,
Lim Y, Narusaka Y, Wang Y, Wang Z, Li Z, Podhasky J, Quijano VJ, Raczy C, Rae VH,
Wang Z, Xiong Z, Zhang Z. (2011) The Rawlings SR, Chiva Rodriguez A, Roe PM,
genome of the mesopolyploid crop species Rogers J, Rogert Bacigalupo MC, Romanov
Brassica rapa. Nat Genet 43:1035–1040 N, Romieu A, Roth RK, Rourke NJ, Ruediger
21. Varshney RK, Song C, Saxena RK, Azam S, Yu ST, Rusman E, Sanches-Kuiper RM, Schenker
S, Sharpe AG, Cannon SB, Baek J, Tar'an B, MR, Seoane JM, Shaw RJ, Shiver MK, Short
Millan T, Zhang X, Rosen B, Ramsay LD, Iwata SW, Sizto NL, Sluis JP, Smith MA, Ernest
A, Wang Y, Nelson W, Farmer AD, Gaur PM, Sohna Sohna J, Spence EJ, Stevens K, Sutton
Soderlund C, Penmetsa RV, Xu C, Bharti AK, N, Szajkowski L, Tregidgo CL, Turcatti G,
He W, Winter P, Zhao S, Hane JK, Carrasquilla- Vandevondele S, Verhovsky Y, Virk SM,
Garcia N, Condie JA, Upadhyaya HD, Luo M, Wakelin S, Walcott GC, Wang J, Worsley GJ,
Singh NP, Lichtenzveig J, Gali KK, Rubio J, Yan J, Yau L, Zuerlein M, Rogers J, Mullikin
Nadarajan N, Thudi M, Dolezel J, Bansal KC, JC, Hurles ME, McCooke NJ, West JS, Oaks
Xu X, Edwards D, Zhang G, Kahl G, Gil J, FL, Lundberg PL, Klenerman D, Durbin R,
Singh KB, Datta SK, Jackson SA, Wang J, Cook Smith AJ (2008) Accurate whole human
D (2013) Draft genome sequence of kabuli genome sequencing using reversible termina-
chickpea (Cicer arietinum): genetic structure tor chemistry. Nature 456:53–59
and breeding constraints for crop improvement. 23. Margulies M, Egholm M, Altman WE, Attiya S,
Nat Biotechnol 31:240–246 Bader JS, Bemben LA, Berka J, Braverman MS,
22. Bentley DR, Balasubramanian S, Swerdlow Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM,
HP, Smith GP, Milton J, Brown CG, Hall KP, Gomes XV, Godwin BC, He W, Helgesen S, Ho
Evers DJ, Barnes CL, Bignell HR, Boutell JM, CH, Irzyk GP, Jando SC, Alenquer ML, Jarvie
Bryant J, Carter RJ, Keira Cheetham R, Cox TP, Jirage KB, Kim JB, Knight JR, Lanza JR,
AJ, Ellis DJ, Flatbush MR, Gormley NA, Leamon JH, Lefkowitz SM, Lei M, Li J,
Humphray SJ, Irving LJ, Karbelashvili MS, Lohman KL, Lu H, Makhijani VB, McDade
Kirk SM, Li H, Liu X, Maisinger KS, Murray KE, McKenna MP, Myers EW, Nickerson E,
LJ, Obradovic B, Ost T, Parkinson ML, Pratt Nobile JR, Plant R, Puc BP, Ronan MT, Roth
MR, Rasolonjatovo IM, Reed MT, Rigatti R, GT, Sarkis GJ, Simons JF, Simpson JW,
Rodighiero C, Ross MT, Sabot A, Sankar SV, Srinivasan M, Tartaro KR, Tomasz A, Vogt KA,
Scally A, Schroth GP, Smith ME, Smith VP, Volkmer GA, Wang SH, Wang Y, Weiner MP,
Spiridou A, Torrance PE, Tzonev SS, Vermaas Yu P, Begley RF, Rothberg JM (2005) Genome
EH, Walter K, Wu X, Zhang L, Alam MD, sequencing in microfabricated high-density
Anastasi C, Aniebo IC, Bailey DM, Bancarz picolitre reactors. Nature 437:376–380
IR, Banerjee S, Barbour SG, Baybayan PA, 24. IT (2013) Ion Torrent. http://www.iontorrent.
Benoit VA, Benson KF, Bevis C, Black PJ, com/
25. Imelfort M, Duran C, Batley J, Edwards D 37. Scaglione D, Acquadro A, Portis E, Tirone M,
(2009) Discovering genetic polymorphisms in Knapp SJ, Lanteri S (2012) RAD tag sequenc-
next-generation sequencing data. Plant ing as a source of SNP markers in Cynara car-
Biotechnol J 7:312–317 dunculus L. BMC Genomics 13:3
26. Lorenc MT, Boskovic Z, Stiller J, Duran C, 38. Miller MR, Dunham JP, Amores A, Cresko WA,
Edwards D (2012) Role of bioinformatics as a Johnson EA (2007) Rapid and cost-effective
tool for oilseed Brassica species. In: Edwards polymorphism identification and genotyping
D, Parkin IAP, Batley J (eds) Genetics, genom- using restriction site associated DNA (RAD)
ics and breeding of oilseed Brassicas. Science markers. Genome Res 17:240–248
Publishers Inc, New Hampshire, pp 194–205 39. Ng SB, Turner EH, Robertson PD, Flygare
27. Duran C, Boskovic Z, Batley J, Edwards D SD, Bigham AW, Lee C, Shaffer T, Wong M,
(2011) Role of bioinformatics as a tool for veg- Bhattacharjee A, Eichler EE, Bamshad M,
etable Brassica species. In: Stiller J (ed) Nickerson DA, Shendure J (2009) Targeted
Vegetable Brassicas. Science Publishers, Inc., capture and massively parallel sequencing of 12
New Hampshire, pp 406–418 human exomes. Nature 461:272–276
28. Edwards D (2011) Wheat bioinformatics. In: 40. Nagalakshmi U, Wang Z, Waern K, Shou C,
Bonjean A, Angus W, Van Ginkel M (eds) Raha D, Gerstein M, Snyder M (2008) The
The world wheat book. Lavoisier, Paris, transcriptional landscape of the yeast genome
pp 851–875 defined by RNA sequencing. Science 320:
29. Lee H, Lai K, Lorenc MT, Imelfort M, Duran 1344–1349
C, Edwards D (2012) Bioinformatics tools and 41. Baird NA, Etter PD, Atwood TS, Currey MC,
databases for analysis of next generation Shiver AL, Lewis ZA, Selker EU, Cresko WA,
sequence data. Brief Funct Genomics 2:12–24 Johnson EA (2008) Rapid SNP discovery and
30. Berkman PJ, Lai K, Lorenc MT, Edwards D genetic mapping using sequenced RAD mark-
(2012) Next generation sequencing applica- ers. PLoS One 3:e3376
tions for wheat crop improvement. Am J Bot 42. Davey JW, Hohenlohe PA, Etter PD, Boone
99:365–371 JQ, Catchen JM, Blaxter ML (2011) Genome-
31. Batley J, Edwards D (2009) Genome sequence wide genetic marker discovery and genotyping
data: management, storage, and visualization. using next-generation sequencing. Nat Rev
Biotechniques 46:333–336 Genet 12:499–510
32. Duran C, Eales D, Marshall D, Imelfort M, 43. Davey JW, Cezard T, Fuentes-Utrilla P, Eland
Stiller J, Berkman PJ, Clark T, McKenzie M, C, Gharbi K, Blaxter ML (2012) Special fea-
Appleby N, Batley J, Basford K, Edwards D tures of RAD Sequencing data: implications for
(2010) Future tools for association mapping in genotyping. Mol Ecol 22:3151–3164
crop plants. Genome 53:1017–1023 44. Huang X, Wei X, Sang T, Zhao Q, Feng Q,
33. Gabriel SB, Schaffner SF, Nguyen H, Moore Zhao Y, Li C, Zhu C, Lu T, Zhang Z, Li M,
JM, Roy J, Blumenstiel B, Higgins J, DeFelice Fan D, Guo Y, Wang A, Wang L, Deng L, Li
M, Lochner A, Faggart M, Liu-Cordero SN, W, Lu Y, Weng Q, Liu K, Huang T, Zhou T,
Rotimi C, Adeyemo A, Cooper R, Ward R, Jing Y, Li W, Lin Z, Buckler ES, Qian Q,
Lander ES, Daly MJ, Altshuler D (2002) The Zhang Q-F, Li J, Han B (2010) Genome-wide
structure of haplotype blocks in the human association studies of 14 agronomic traits in
genome. Science 296:2225–2229 rice landraces. Nat Genet 42:961–967
34. Huang X, Feng Q, Qian Q, Zhao Q, Wang L, 45. Wilkening S, Tekkedil MM, Lin G, Fritsch ES,
Wang A, Guan J, Fan D, Weng Q, Huang T, Wei W, Gagneur J, Lazinski DW, Camilli A,
Dong G, Sang T, Han B (2009) High- Steinmetz LM (2013) Genotyping 1000 yeast
throughput genotyping by whole-genome strains by next-generation sequencing. BMC
resequencing. Genome Res 19:1068–1076 Genomics 14:90
35. Yang S, Yuan Y, Wang L, Li J, Wang W, Liu H, 46. Sampson J, Jacobs K, Yeager M, Chanock S,
Chen J-Q, Hurst LD, Tian D (2012) Great Chatterjee N (2011) Efficient study design for
majority of recombination events in Arabidopsis next generation sequencing. Genet Epidemiol
are gene conversion events. Proc Natl Acad Sci 35:269–277
%R 101073/pnas1211827110 47. Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen
36. Gore MA, Chia J-M, Elshire RJ, Sun Q, Ersoz K, Wang J (2009) SOAP2: an improved ultra-
ES, Hurwitz BL, Peiffer JA, McMullen MD, fast tool for short read alignment.
Grills GS, Ross-Ibarra J, Ware DH, Buckler ES Bioinformatics 25:1966–1967
(2009) A first-generation haplotype map of 48. Langmead B, Trapnell C, Pop M, Salzberg SL
maize. Science 326:1115–1117 (2009) Ultrafast and memory-efficient alignment
of short DNA sequences to the human genome. 51. Ossowski S, Schneeberger K, Clark RM, Lanz
Genome Biol 10:R25 C, Warthmann N, Weigel D (2008) Sequencing
49. Li H, Durbin R (2009) Fast and accurate short of natural strains of Arabidopsis thaliana with
read alignment with Burrows-Wheeler trans- short reads. Genome Res 18:2024–2033
form. Bioinformatics 25:1754–1760 52. Li H, Handsaker B, Wysoker A, Fennell T, Ruan
50. Lorenc MT, Hayashi S, Stiller J, Lee H, J, Homer N, Marth G, Abecasis G, Durbin R
Manoli S, Ruperao P, Visendi P, Berkman PJ, (2009) The sequence alignment/map format
Lai K, Batley J, Edwards D (2012) Discovery and SAMtools. Bioinformatics 25:2078–2079
of single nucleotide polymorphisms in com- 53. Halperin E, Stephan DA (2009) SNP imputa-
plex genomes using SGSautoSNP. Biology tion in association studies. Nat Biotechnol
1:370–382 27:349–351
Chapter 20
The Restriction Enzyme Target Approach

to Genotyping by Sequencing (GBS)
Elena Hilario
Abstract
The modified genotyping by sequencing method described here emphasizes verifying the success of each
library ligation by performing individual PCRs, before preparing the pool of barcoded amplicons to be
sequenced. Although this extra step might seem excessive, it will give peace of mind to the researcher knowing
each individual is represented in the data set and avoid additional data imputation at the analysis stage.
Key words Genome partitioning, GBS, Genotyping by sequencing, Next generation sequencing
1 Introduction
Genome partition techniques aim to reduce the amount of genomic

sequence to be analyzed by next generation sequencing platforms.
The complexity and size of the sequence data set to be analyzed is
reduced by sampling only the areas of interest. The partition could
consist of targeting one specific gene family by PCR and capturing
the amplified products by hybridization to a specific probe. When
no a priori sequence information is available, as is the case in
genetic marker development, physical features of the genome
can be used to reduce its complexity and target only certain areas.
The sequence specificity of restriction enzymes allows the targeting
of those recognition sites to explore the flanking sequence to
discover a single nucleotide polymorphism, or simple sequence
repeat, which can be developed as genetic markers. Some restriction
enzymes will also differentiate between methylated and nonmethyl-
ated sites, which are another physical feature of genome architec-
ture often used to identify repetitive elements. Several methods for
partitioning the genome have been reported, and more specific
tools for reducing the genome complexity to develop genetic
markers have been developed [1, 2].
Genotyping by sequencing (GBS) is one of several targeted
approaches developed to reduce the genome complexity and
271
272 Elena Hilario
develop genetic markers [3]. In this approach the genomic DNA

of each individual is digested with the selected restriction enzyme.
Each restriction fragment will have an adaptor ligated to each end:
a barcoded and a common adaptor. These adaptors contain the
annealing sites of the Illumina compatible PCR primers, which
allow direct loading of the amplicon mixture into the flow cell for
cluster formation. The method described in this chapter differs
from Elshire and colleagues [3] in the following steps: 1 μg of
genomic DNA is used as starting material and 1 pmol of each adaptor
is used for the adaptor ligation step, no drying of digested DNA
and annealed adaptors step, the annealing is performed according
to Ko et al. [4], each individual GBS library is independently ampli-
fied and analyzed before preparing the amplicon pool, and a proof
reading DNA polymerase is used for the amplification step. The
barcoded oligonucleotides were designed by Deena Bioinformatics
[5]. The criteria for selecting the restriction enzyme can be based
on the same principles used by Restriction site Associated DNA
(RAD) Sequencing [6], which takes into account the genome size,
GC content, multiplex factor, and number of reads obtained by the
sequencing platform.
2 Materials
2.1 Plasticware 1. 1 mL deep well plate (NUNC cat# 260252).

and Consumables 2. 1.5 mL screw-capped tubes, sterile.
3. 15 mL screw-capped sterile conical plastic tube.
4. Aluminum tape (NUNC cat# 232698).
5. PCR plates (Thermo-Fast ABGAB-0600).
6. 2 L plastic box.
7. Sterile deionized water.
8. 1× TE: 10 mM Tris–HCl pH 7.5, 1 mM EDTA, sterile.
9. 25 mM MgCl2, sterile.
10. 20 mg/mL dextran blue (dissolved in sterile deionized water).
11. 3 M sodium acetate pH 5.2.
12. 100 % absolute ethanol.
13. 70 % ethanol (freshly made).
14. 2 % agarose gels in 1× TAE buffer: 40 mM Tris, pH 7.6,
20 mM acetic acid, 1 mM EDTA.
15. 0.5 μg/mL ethidium bromide or 1× Sybr Safe® (Life tech-
nologies) staining solutions in 1× TAE.
2.2 Enzymes 1. AccuPrime™ Taq DNA polymerase High Fidelity 5 U/μL and
and Kits 10× AccuPrime™ High Fidelity Buffer I (Life Technologies,
CA, USA).
The Restriction Enzyme Target Approach to Genotyping by Sequencing (GBS) 273
2. QIAquick® PCR Purification Kit (Qiagen®, Germany, cat#

28104).
3. Restriction enzyme set from New England Biolabs: 10× NEB
3, 100× BSA, BamHI 20 U/μL.
4. T4 DNA ligase and 5× T4 DNA ligase buffer (Life
technologies™, CA, USA).
5. Agilent High Sensitivity DNA kit (Agilent Technologies cat#
5067-4626).
6. 1 kb plus DNA ladder.
7. Lambda DNA.
8. Uncut pBluescript (Agilent Technologies) or any other clon-
ing vector like pUC19.
2.3 Equipment 1. Water bath at 65 °C.

2. Bench top centrifuge with plate rotor.
3. Bench top centrifuge for microcentrifuge tubes.
4. Horizontal gel electrophoresis unit with power supply.
5. Bioanalyzer (Agilent Technologies).
3 Methods
3.1 DNA Digestion 1. Normalize all your DNA preparations to have the same con-
centration to speed up the pipetting steps (see Note 1).
2. Prepare a restriction enzyme master mix and aliquot into a
1 mL deep well plate, per reaction: 5 μL of 10× NEB reaction
buffer 3, 0.5 μL of 100× BSA, 1 μL of 20 U/μL BamHI, 1 μg
genomic DNA, deionized sterile water to 50 μL.
3. Seal the plate with aluminum tape and mix gently by vortexing.
4. Incubate the digestions at 37 °C for 3 h.
5. Spin down and store at −20 °C until ready for the next step.
3.2 Adaptor 1. The barcode adaptor stock solutions should be dissolved at

Annealing 20 pmol/μL in 1× TE, pH 7.5, in the plates received from the
oligonucleotide synthesis provider (see Note 2), usually in
1 mL deep well plates. The common adaptor (GC) and the
common BamHI (GCB) adaptor stocks should be dissolved at
500 pmol/μL in 1× TE, pH 7.5, in screw-capped tubes.
2. In a new PCR plate, aliquot the barcode adaptors to have a
final concentration of 10 pmol/μL of each oligonucleotide:
10 μL of barcode(+)strand 20 pmol/μL (GBp_1) and 10 μL of
barcode(−)strand 20 pmol/μL (GBn_1).
3. Seal the plate with aluminum tape and mix gently by
vortexing.
274 Elena Hilario
4. The common adaptors are annealed in 1.5 mL screw-capped

centrifuge tubes, to a final concentration of 10 pmol/μL of
each oligonucleotide, according to the following per reaction:
1 μL of common adaptor 500 pmol/μL (GC), 1 μL of common
BamHI adaptor 500 pmol/μL (GCB), and 48 μL of 1× TE
pH 7.5.
5. Incubate the barcode adaptor pair plate and common adaptor
pair tube at 65 °C in the water bath for 5 min.
6. Spin down briefly.
7. Add 1.6 μL of 25 mM MgCl2 to each well of the barcode
adaptor pair plate with a multichannel pipette. Seal the plate
and mix gently by vortexing.
8. Add 4 μL of 25 mM MgCl2 to the common adaptor pair tube.
Mix contents gently by tapping end of the tube.
9. Incubate at 65 °C for 5 min.
10. Remove about 1.5 L of water from the water bath and place it
in the large plastic box (see Note 3).
11. Transfer the barcode adaptor pair plate and the common
adaptor pair tube to the plastic box and let them cool down to
room temperature (~23 °C) for about 2 h. Do not close the
plastic box.
13. Add 20 μL of 1× TE, pH 7.5, to each well of the barcode
adaptor pair plate with a multichannel pipette. Seal the plate
and mix gently by vortexing. The final volume is 40 μL.
14. Add 50 μL of 1× TE, pH 7.5, to the common adaptor pair
tube. Mix contents by tapping the end of the tube. The final
volume is 100 μL.
After annealing, you have 5 pmol/μL of annealed GBp_1 and
GBn_1 (barcode adaptor pair mix), and the same concentration of
annealed GC and GCB (common adaptor mix).
3.3 Annealed GBS 1. In a new PCR plate add 6 μL of 1× TE, pH 7.5, to each well.
Adaptor Plate 2. Aliquot 2 μL of the annealed common adaptor mix to each
well.
3. Aliquot 2 μL of annealed barcoded adaptor pair mix to their
corresponding well. Seal the plate and mix. This is now the
annealed GBS adaptor plate.
The concentration of the annealed GBS adaptor plate is
1 pmol/μL for each oligonucleotide: annealed GBp_1, GBn_1,
GC, and GCB.
For short-term storage (4–5 days), keep at 4 °C; otherwise,
store at −20 °C.
3.4 Anneal 1. Aliquot 1 μL from the annealed GBS adaptor plate into their
the Adaptor Pairs corresponding well of the DNA restriction enzyme digestion
to the Digested DNA plate. Seal the plate, spin down briefly, mix gently by vortexing,
and spin down again.
2. Incubate the DNA restriction enzyme digestion plate containing
the annealed GBS adaptors at 65 °C for 5 min in a water bath.
3. Remove about 1.5 L of water from the water bath and place it
in the large plastic box.
4. Transfer the plate to the plastic box and let it cool down to
room temperature (~23 °C) for about 2 h.
3.5 Ligation 1. Prepare a T4 DNA ligase master mix according to the following:
3 μL of Deionized sterile water, 14 μL of 5× T4 DNA Ligase
buffer, 2 μL of T4 DNA Ligase 1 U/μL, and 51 μL of DNA
digested + annealed GBS adaptors.
2. Aliquot 19 μL of the T4 DNA master mix into each well of the
plate containing the digested DNA and annealed GBS adaptors.
Change tips after each transfer. Seal the plate, mix by gently
vortexing, and spin down briefly.
3. Incubate the ligation reactions at 4 °C (refrigerator) overnight
(see Note 4).
4. Spin down at 500 × g for 5 min at room temperature.
5. Bring the total volume to 100 μL by adding 1 μL of 20 mg/mL
dextran blue, and 29 μL of 1× TE pH 7.5. Spin down briefly
and mix by gently vortexing.
6. Add 10 μL of 3 M of sodium acetate pH 5.2, mix by vortexing
and add 200 μL of 100 % absolute ethanol. Seal the plate and
mix gently by vortexing. Incubate at −20 °C for at least 2 h, or
overnight.
7. Spin down at 1,000 × g for 25 min, in the cold room if possible;
otherwise at room temperature.
8. Discard supernatant by inverting the plate over a container.
Add 200 μL of 70 % ethanol. Let it stand at room temperature
for 30–60 min (see Note 5). Spin down as in step 7.
9. Discard the supernatant and blot the plate over two paper tow-
els to remove all liquid.
10. To remove all traces of ethanol solution spin down the plate
inverted over a piece of paper towel for 3–4 s only.
11. Let the plate air dry at room temperature for 10 min.
12. Add 50 μL of 1× TE, pH 7.5. Seal the plate and vortex thor-
oughly. Let the DNA dissolve completely at 4 °C overnight
(see Note 6).
13. For long-term storage, keep at −20 °C.
276 Elena Hilario
You have now a cleaned GBS_BamHI ligation plate called the

GBS_BamHI library plate (stored in a 1 mL deep well plate) that
can be used for producing amplicons up to 20 times, according to
the following amplification protocol.
3.6 Amplification 1. Dissolve PCR primers PPA and PPB in 1× TE pH 7.5 to have
(See Note 7) a stock concentration of 1 nmol/μL. Dilute 1:100 in 1× TE
pH 7.5 to have a working solution of 10 pmol/μL.
2. Make a PCR master mix for the total number of libraries that
need to be amplified (see Note 8), per reaction: 40.3 μL of
deionized sterile water, 5 μL of 1× AccuPrime™ High Fidelity
Buffer I (see Note 9), 1 μL of PPA 10 pmol/μL, 1 μL of PPB
10 pmol/μL, 0.2 μL of AccuPrime™ Taq DNA polymerase
High Fidelity 5 U/μL, and 2.5 μL of GBS_BamHI library
plate (corresponding well).
3. Aliquot 47.5 μL of the PCR master mix in each well of a new
PCR plate labeled as GBS_BamHI amplified library plate.
4. Add 2.5 μL of each GBS_BamHI library plate well to its cor-
responding location in the GBS_BamHI amplified library plate.
Seal the plate and mix gently by vortexing. Spin down the plate
briefly.
5. Run the following PCR profile: 72 °C, 5 min → 94 °C,
1 min → (94 °C, 30 s → 65 °C, 30 s → 68 °C, 30 s) × 25
cycles → 68 °C, 5 min → stop, leave at room temperature.
6. Run 15 μL of each GBS_BamHI amplified library per lane in a
2 % agarose gel (see Note 10), in 1× TAE buffer, to confirm
that every library was successfully prepared and amplified.
3.7 GBS Library Pool 1. To prepare the GBS_BamHI library pool: transfer 20–30 μL of
and Clean Up each amplified library from the GBS_BamHI amplified library
plate into one 15 mL Falcon centrifuge tube. Measure the final
volume.
2. If there is any remaining PCR reaction you may want to keep
as backup, seal the GBS_BamHI amplified library plate and
store at −20 °C.
3. Purify the GBS_BamHI library pool with QIAquick® PCR
Purification Kit with the following modifications:
(a) Add 5 volumes of PB buffer to the library pool in the
15 mL Falcon (step 1), mix thoroughly by vortexing.
Incubate for 5 min at room temperature.
(b) Load 4 Qiaquick® PCR columns with the PB/library mix-
ture. Follow the manufacturer’s instructions, except incu-
bate the EB buffer 10 min at room temperature before
centrifugation for a complete elution. Final total volume is
~200 μL. Store at 4 °C, in a screw-capped 1.5 mL tube.
4. Analyze the cleaned GBS_BamHI library pool fragment size

and concentration with the BioAnalyzer/High Sensitivity
DNA kit. The library concentration should be >20 ng/μL.
There should be no peaks at 129 bp (see Note 11).
The cleaned GBS_BamHI library pool is ready to be loaded in
the HiSeq2000 Illumina sequencer. The sequencing provider will
determine how much to load, so send as much sample as possible.
A stock solution at >400 nM is ideal. To calculate the molarity
assume an average amplicon size equal to 350 bp.
4 Notes
1. The protocol detailed here is based on a pilot study of a bin

mapping set of an Actinidia (kiwifruit) population formed of
46 progeny and 2 parents, using the restriction enzyme
BamHI. The volumes specified are per sample, the appropriate
master mixes should be prepared for the number of samples to
be processed in each particular experiment. The DNA concen-
trations were estimated by spectrophotometry (A260 nm, A280 nm,
A230 nm). A sample of each DNA was analyzed by agarose gel
electrophoresis. A sample was consider optimal if no DNA deg-
radation was observed below 10 kbp, the A260/A280 ≥ 1.8, and
the A260/A230 > 2.0.
2. Request the minimum scale available, usually 5 nmol normalized,
purified by desalting and as dry pellets. The common adaptors
should be HPLC purified, and order the smallest synthesis
scale. The total number of nmoles reported by the oligonucle-
otide service provider was used for calculating the amount of
buffer needed for resuspension. No further oligonucleotide
quantification was performed. The resuspended oligonucle-
otides can be stored for over a year at 4 °C, while processing
the experiment. For long-term storage kept at −20 °C.
3. It is recommended to have one dedicated plastic box for this
purpose marked to 1.5 L to standardize the annealing time.
The estimated time mentioned in the next step corresponds to
a laboratory with a room temperature around 20 °C.
4. Even though there is no cleaning or inactivation step after the
restriction digestion, the ligation will proceed without prob-
lems since the barcoded and common adaptors are designed to
avoid recreating the restriction enzyme target site. It is recom-
mended to perform the ligation at a low temperature even if the
enzyme produces sticky ends. If concerned, you may do the
ligation at 16 °C overnight, instead of 4 °C.
5. The DNA precipitated with dextran blue forms a film in the
round bottom end of each well. Since vortexing will not detach
278 Elena Hilario
this pellet and allow the 70 % ethanol to wash the salts out
of the DNA, let diffusion and time do this job.
6. Do not resuspend the DNA by pipetting. Let it dissolve gently
into the buffer solution overnight.
7. This is the most crucial step in the protocol. A high fidelity
enzyme is recommended for the amplification. This protocol is
based on an end-point PCR approach. An optimization step to
determine the ideal number of cycles used should be performed:
18, 23, 25, and 30 cycles. The expected amplicon profile should
be a smooth smear from >200 bp up to 1 kbp. The presence of
prominent bands is not desirable, but sometimes unavoidable.
These bands are due to amplification bias or to repetitive ele-
ments in the genome which contain that particular restriction
enzyme target site. To avoid the first issue a real-time PCR
approach could reduce amplification bias and ensure uniform
coverage. Commercial kits are available (e.g., Kapa Biosystems).
I strongly recommend you perform an individual amplification
of each library and carry out its analysis before pooling them all
for sequencing (Subheading 3.7). This extra step will ensure
that each individual’s genomic DNA was successfully digested,
ligated, and amplified. For optimal visualization of the ampli-
fied libraries, analyze the amplicons produced in cycle optimiza-
tion step in the Bioanalyzer. A real-time PCR approach can be
performed to minimize amplification bias and ensure uniform
coverage by using commercial kits (e.g., Kapa Biosystems).
Prominent bands due to repetitive elements containing the
restriction enzyme site could be avoided if some sequence infor-
mation is known about these problematic elements. However,
in most cases this issue is hard to avoid, but these data points can
be removed bioinformatically at the analysis stage.
8. Pipette the PCR master mix in exactly the order shown in the
table to avoid any contamination. After pipetting primer PPB,
close the tube, vortex, spin down briefly, and then add the
DNA polymerase directly into the solution, pipetting a few
times to release all the enzyme into the liquid. This practice is
recommended for any molecular biology procedure. It allows
any stabilizing agents in the reaction buffer (e.g., bovine serum
albumin, polyethylene glycol, glycerol, etc.) to coat the inside
of the plastic tube and minimize the adsorption of the enzyme
to the walls.
9. The amplification also works with 10× AccuPrime High
Fidelity Buffer II, included with the enzyme. The difference
between buffers is the DNA template used: Buffer I is opti-
mized for small DNA fragments, and Buffer II is optimal for
genomic DNA.
10. To prepare a sturdy and transparent 2 % agarose gel, use 1 g
of any standard agarose (molecular biology grade) and 1 g of
high resolution agarose to prepare 100 mL. This mixture

produces a gel that is easy to handle and also saves money on the
expensive high resolution agarose. A 3 % agarose gel produces
even better results but melting this mixture can be difficult.
To prepare 100 mL of 3 % agarose, pour 100 mL of 1× TAE
buffer into a 250 mL conical flask capped with an inverted
plastic beaker. Add 0.5 g standard agarose and 2.5 g high reso-
lution agarose. Microwave at 50 % power for 1 min, checking
every half minute and swirl the flask gently. Repeat until com-
pletely melted. Pour a thin gel. For best results, stain the gel
with an ethidium bromide or Sybr Safe® (Life Technologies)
solution for 10–15 min and destain in deionized water for
5–10 min.
11. This band corresponds to the barcoded adaptor ligated to the
common adaptor and amplified by PPA and PPB, without any
DNA insert. If the ratio of genomic DNA to each of the adaptors
is kept at 1 μg to 1 pmol, there should be no empty barcoded/
common adaptor amplicons, or a negligible amount that will not
take a significant amount of sequencing reads in the data set.
Acknowledgments
I would like to thank Lena Fraser, Lorna Barron, and Anne Gunson
(The New Zealand Institute for Plant and Food Research) for their
valuable comments and corrections to this protocol.
References
1. Davey JW, Hohenlohe PA, Etter PD, Boone JQ, robust, simple genotyping-by-sequencing (GBS)
Catchen JM, Blaxter ML (2011) Genome-wide approach for high diversity species. PLoS One
genetic marker discovery and genotyping using 6:e19379. doi:10.1371/journal.pone.0019379
next-generation sequencing. Nat Rev Genet 4. Ko W-Y, David RM, Akashi H (2003) Molecular
12:499–510 phylogeny of the Drosophila melanogaster species
2. Turner EH, Ng SB, Nickerson DA, Shendure J subgroup. J Mol Evol 57:562–573
(2009) Methods for genomic partitioning. Annu 5. van Gurp T. www.deenabio.com/services/
Rev Genomics Hum Genet 10:263–284 gbs-adapters
3. Elshire RJ, Glaubitz JC, Sun Q, Poland JA, 6. Davey JW (ed) www.wiki.ed.ac.uk/display/
Kawamoto K, Buckler ES, Mitchell SE (2011) A RADSequencing/Home
Chapter 21
Methods for the Design, Implementation, and Analysis

of Illumina Infinium™ SNP Assays in Plants
David Chagné, Luca Bianco, Cindy Lawley, Diego Micheletti,
and Jeanne M.E. Jacobs
Abstract
The advent of Next-Generation sequencing-by-synthesis technologies has fuelled SNP discovery, genotyping,
and screening of populations in myriad ways for many species, including various plant species. One tech-
nique widely applied to screening a large number of SNP markers over a large number of samples is the
Illumina Infinium™ assay.
Key words Illumina Infinium™ assay, SNP discovery, SNP selection, SNP genotyping, Consortia
1 Introduction
1.1 What Are SNPs? Single Nucleotide Polymorphisms (SNPs) are individual nucleo-
tide base differences between two DNA sequences. SNPs are the
most common type of known DNA variation. In principle each
nucleotide could have four different variants at any particular site,
however in general SNPs are biallelic and can be categorized
according to the type of nucleotide substitution as either a transi-
tion (C/T or G/A) or a transversion (C/G, A/T, C/A, or T/G).
The disadvantage of biallelic markers, when compared to multial-
lelic markers such as SSRs, is compensated by the relative abun-
dance of SNPs. For example, on average one SNP is found every
29 and 288 bp in the potato and apple genomes, respectively [1, 2].
Consequently, SNPs have replaced microsatellites as the marker of
choice in plant genetics due to their potential for high multiplexing
in one reaction and ease of data analysis and interpretation.
1.2 Principle The Infinium™ assay (Illumina, Inc) relies upon probes designed to
of the Infinium™ target a sequence immediately upstream of a target SNP. The probes
Chemistry are attached to beads and deployed on a fixed glass slide format in
an average of 15× redundancy for each SNP genotype. The assay
281
282 David Chagné et al.
has the ability to target 3,000–1 million SNPs for each sample in a
single experiment. The Infinium™ assay involves a single workflow
but a closer look reveals two different assays are used to target
the maximum possible SNPs of interest from a given genome. The
Infinium I assay interrogates each SNP using two allele-specific
probes on two separate bead types. The other Infinium™ chemis-
try (Infinium II) involves a single probe and bead type to query a
SNP with a base extension providing the discriminating allele
information. The 3′ end of the oligonucleotide probe is extended
by a DNA polymerase using labeled ddNTPs (single base exten-
sion). The terminating fluorescent dye corresponds to the two tar-
get alleles, which makes it possible to detect two allelic variants for
a variable site and discriminate heterozygous from homozygous
genotypes. The Infinium™ II assay uses two dyes: one dye for both
adenine and thymine, and another dye for both cytosine and gua-
nine. Therefore, A/T and C/G transversion SNPs require two
beads to discriminate between the target alleles. The distinction
between one-bead type and two-bead type assays is most impor-
tant in the design phase, as targeting one bead type SNPs will opti-
mize the space on any given array format.
1.3 Applications SNPs are widely used to understand evolutionary and genetic
of SNPs for Plant relationships between and within species, to identify correlations
Genetics to disease status in humans, and to investigate traits of agronomic
and Examples interest in high-value livestock and crops. SNPs provide an impor-
of Infinium™ Assays tant source of molecular markers that are useful in genetic map-
for Plants ping, map-based positional cloning, detection of marker–trait gene
associations through linkage and linkage disequilibrium mapping,
and the assessment of genetic relationships between individuals.
The low mutation rate of SNPs makes them excellent markers for
studying complex genetic traits as well as genome evolution [3].
The Infinium™ assay has been used for a range of plant species.
Myles et al. [4] have characterized genome-wide patterns of genetic
variation in several hundred cultivars of Vitis vinifera and its wild
relative V. sylvestris using the grape 9000 SNP Infinium™ array [5].
They show that V. vinifera was domesticated from V. sylvestris in
the Near East and have identified parent–offspring and sibling con-
nections, most of them first-degree relationships, between some
well-known varieties. The apple SNP array of 9,000 SNPs [2] was
used for assessing the efficiency of genomic selection for improving
fruit quality in an apple breeding program [6] and to develop a
dense SNP-based linkage map of an apple rootstock progeny [7].
In sunflower, a 10K array was developed and used for diversity
analyses [8] and for the construction of a dense genetic map based
on multiple crosses [9]. In potato (an auto-tetraploid) a 10K array
was designed based on SNPs located in candidate genes, as well as
the potato genome sequence [10]. So far, selected SNPs have
been used for studying the allelic variation in a diversity panel [10],
Infinium™ SNP Assay Methods in Plants 283
and for the construction of two diploid genetic linkage maps, each
with the reference potato genome sequence genotype as a parent
[11]. The maps resulted in an improved anchoring of sequence
scaffolds to the potato genome assembly.
1.4 Genotyping Despite the vital importance of plants as a source of food, the use
Budgets of the Infinium™ technique for plants has lagged behind its appli-
and Advantages cation for human and major livestock species (e.g., chicken, pig,
of Consortia cattle, and sheep). The research communities working on plant
species tend to be small and fragmented and not as well resourced
as the human and animal research communities. Nevertheless,
the demand for high-throughput genotyping is high. The major
contributor to recent growth in the ability of plant geneticists to
use the Infinium™ technique has been the development of world-
scale research consortia to enable a concerted design of Infinium™
assays. Consortia offer an opportunity for a research community to
drive the development of an SNP array while sample contributors
drive wide adoption and validation of common SNP content per-
ceived by the community as needed to capture the genome of
interest. Researchers often have overlapping goals that are best
addressed by a combined effort for the development of a single
tool or set of common tools where economies of scale can be lever-
aged. Tool development may include a strategy of SNP selection
that targets haplotype blocks (i.e., “tag SNPs” or tSNPs), even
coverage across the genome or specific gene-rich regions or a com-
bination of these in a single SNP array tool. Tools that meet mul-
tiple needs often have a minimum of several thousand marker
density [2, 12, 13]. Although, in some cases marker sets as low as
384 SNPs have been widely adopted for targeted common pur-
poses (e.g., crosses between rice lines Oryza indica vs. O. japonica)
[14], as described in the Goldengate chapter in this volume. Where
it is undesirable for content to be shared, (e.g., by commercial
partners), proprietary marker content can be supplemented to the
base content of a genotyping BeadChip to make a custom version
for that partner alone.
1.5 SNP Discovery SNP detection using next-generation sequencing platforms gives
and SNP Selection access to the variation of a species, either for one selected individual
for Infinium™ Assay or entire core collections, germplasm sets, breeding lines, or the
diversity set across a species’ natural range. Nevertheless, querying
all SNPs detected in the genome is prohibitive and unnecessary
(provides redundant information), necessitating a strategy for SNP
selection for assay design. For example, a genetic map experiment
would only require a few thousand markers, which is far fewer than
the millions that may be detected originally. Ideally the SNPs to be
selected for a high-throughput assay can be validated to ensure an
optimum conversion rate to polymorphic markers. This is unrealistic
for many species of plants for which no validation dataset is
available. A few tricks can be used to work around this during SNP
discovery. For example, using pedigreed populations for the origi-
nal sequencing for SNP detection enables sorting of true and false
SNPs by looking at their segregation patterns in the population.
This increases the confidence for each SNP converting to a poly-
morphic assay, which is a useful parameter to track and use when
finalizing SNP selection. Other SNP selection criteria can include
focus on specific SNP sites based on their location in the genome
(evenly distributed or in clusters), proximity or affiliation with
gene coding regions, or SNP type (Infinium I or Infinium II).
SNP detection based on whole genome resequencing data can
be done by calling genotypes from pools of different individuals, as
done in the case of the RosBREED apple 9K [2] and RosBREED
peach 9K [15] SNP arrays. Alternatively, genotypes can be obtained
from separate individuals (if high coverage is available for each
individual) followed by merging all the calls at the end, as demon-
strated in the case of the FruitBreedomics apple 20K SNP array
(www.fruitbreedomics.com).
Economical methods of SNP discovery include use of reduced
representation libraries (RRLs) obtained by enzyme digestion of
DNA to increase the local coverage, as in the case of the grapevine
9K array [4], or focusing only on the coding portion of a genome
by sequencing normalized cDNA, as reported in the development
of the SolCAP tomato [16] and potato [10] 8K SNP arrays.
Knowing the specific features of the reference genome is quite
important when identifying the markers to consider for the array.
In particular, SNPs from paralogous regions and repetitive elements
should be avoided as the signal produced from the chip would
most probably be affected by interference produced by those
regions. This is especially an issue for highly heterozygous or poly-
ploid species. For example, for the genome of apple which went
through a whole genome duplication [17], the Infinium™ II 20K
SNP array design included resequencing data of two doubled-
haploid accessions obtained from ‘Golden Delicious’, which is the
cultivar used for the reference genome sequence [17]. The use of
doubled-haploids resulted in the exclusion of SNPs showing a
“heterozygous behavior”, indicating multiple loci within either of
the two doubled-haploids.
In general, SNP selection is the step in the assay design that
requires the most intense and thoughtful input from the user.
Once SNP discovery is complete, a typical SNP selection pipeline
ideally includes (depending upon the status of the reference
genome): Chromosome and coordinate map information, genetic
marker position, estimated minor allele frequencies in a discovery
panel, distance from the target SNP to the closest known adjacent
polymorphism on either side of the SNP, 50 bases of flanking
sequence on both sides of the SNP, target SNP alleles with referenced
strand (e.g., TOP/BOT or FOR/REV), estimate of conversion
rate or “SNP_Score” (available through an online tool like Illumina’s

Assay Design Tool (ADT) http://support.illumina.com/array/
array_software/assay_design_tool.ilmn), and ILMN_ID (where
previously validated designs are included in a new design). By
Pooling designs using ILMN_ID (found in the csv version of the
bead pool manifest *.bpm), a score file can be obtained that con-
tains the exact validated probe sequence from previous designs.
This way its corresponding forward strand designation can be pre-
served in a future design. SNP scores obtained via ADT can be used
to make the decision on final SNP selection. (The manufacturer’s
recommendation (Illumina, Inc) is to prioritize designs above a
threshold of 0.6.)
The bioinformatic analysis supporting SNP chip designs
involves the following main steps:
1. SNP calling. Several software packages have been developed to
map resequencing reads to reference genomes to call putative
SNPs. For the apple 20K array a two-tier approach based on
GEM [18] and BFAST [19] in combination with SAMtools
[20] and VCFtools [21] was used. The grapevine 9K array [5]
design used Illumina’s ELAND. SOAP/SOAPsnp [22] was
used for the apple and pear 9K [2], and the cherry 6K [13].
Both SOAP/SOAPsnp and CLC Bio’s CLC Genomics
Workbench were used for the peach 9K array [15]. MOSAIK
aligner (Michael Stromberg, Boston College) and Maq (http://
maq.sourceforge.net/) have been used for developing the sun-
flower 10K [8] and potato 8K [10] SNP arrays, respectively.
Repetitive or paralogous regions can be removed from the
analysis by filtering out multiple-mapping reads or by applying
specific software filtering like those found in cross_match
( http://www.phrap.org/phredphrapconsed.html#block_
phrap), as used in development of the maize 50K array [23].
2. Quality control SNP filtering. This step removes SNPs that are
of low sequence quality, SNP loci with read depths that are too
high (may be in duplicated regions) or too low (may be too
low coverage to reliably identify a variant), and SNPs that,
based upon representation in the samples included in the SNP
discovery data, might be a lower priority due to being low in
minor allele frequency or present in only one line. Many groups
have had success with an iterative approach to SNP filtering of
sequence data. Exact parameters should be empirically deter-
mined based upon the then-current sequencing coverage and
quality, however a strict, semistrict, and lenient set of filtering
criteria can allow for the identification of SNPs with some
expectation of relative validation expectations. For example, a first
pass filter of sequence data might include requisite minimum
and maximum coverage, a minimum sequencing (phred-like)
Q score at the SNP base (e.g., Q = 30), only two alleles per
targeted SNP, and an observation of each of the two SNP alleles

at least twice in different individuals/samples (not including
the reference) from the sequence data. Less strict filtering
might relax the criteria to a minimum Q score of 20 at the SNP
base, identification of each of the two alleles twice but count-
ing the reference. A lenient criterion might be to have a mini-
mum Q score (e.g., Q = 20) but to count any variant that
appears different from the reference. By doing an iterative fil-
tering, one can identify the maximum number of SNPs but
prioritize the placement of those SNPs by using membership
in each of these pools (strict, less strict, and lenient) within the
SNP selection process.
3. Final selection. Choosing final SNPs for an iSelect panel requires
a balance between prioritizing the highest quality SNPs (most
likely to be polymorphic in desired lines) and optimizing the
usage of the number of bead types supported by the array.
Decisions about whether the tool will target uniform distribu-
tion of SNPs across chromosomes and enrichment around some
focal points might be pursued, as in the case of the apple 20 and
9K arrays. Conversely, only SNPs in coding portions of the
genome might be selected for the array as in the case of sun-
flower, tomato, potato, cherry, and peach arrays. Hybrid
approaches are also possible, as shown for maize where the pri-
ority was given to SNPs located within genes but entries with
less optimal ADT design scores were added to obtain a relatively
even marker distribution across chromosomes while hitting the
50K target [23]. To prioritize maximum number of SNPs for a
given number of attempted bead types, it is a good idea to pref-
erentially select assays that use the Infinium II method of geno-
typing. Infinium II is a single-bead per SNP method of query
whereas Infinium I is a two-bead per SNP method of query.
Both methods of query use the same chemistry, however priori-
tizing the Infinium II designs is a consideration in the SNP
selection process since manufacturing costs are based upon
number of attempted bead types. Since Infinium II uses the
same color channel for the A nucleotide as the T nucleotide,
and uses a different color for both the C and the G nucleo-
tides, two beads (Infinium I assays) are required to target A/T
or C/G SNPs. As a result, if one prioritizes Infinium™ II SNPs
(information available in the score file output from the ADT
design portal), one can maximize content for a given number of
attempted bead types. An additional criterion that is useful to
track for SNP selection is distance from the targeted SNP to the
nearest adjacent polymorphism. ADT will not design over an
adjacent polymorphism identified with an “N” or other IUPAC
code within the flanking sequence. Empirical data indicate that
hybridization is unlikely to be affected, especially if an adjacent
SNP polymorphism is outside of SNP polymorphism is at least 10

bases away from the target SNP. Therefore, SNPs with adjacent
polymorphisms that are a high priority for design can still be
included, but would need an unambiguous flanking sequence
(e.g., inserting the major allele for the adjacent polymorphism)
for ADT to assign a design score. Finally, SNPs with ADT scores
below 0.6 should be deprioritized for design as they have a lower
likelihood of being successful based upon the melting tempera-
ture, possibility to create a hairpin reaction or other criteria con-
sidered in the proprietary calculation of ADT design score.
2 Materials
2.1 Reagents 1. Illumina supplied reagents are supplied in correct amounts for
the ordered assay (Table 1).
2. Genomic DNA (see Notes 1 and 2).
3. 0.1 N NaOH: Dissolve 4 g of NaOH in 1 L water.
Table 1
Illumina supplied reagents for the Infinium assay
Item Part#
ATM—Anti Stain Two-Color Master Mix 11208317
FMS—Fragmentation solution 11203428
MA1—Multisample Amplification 1 Mix 11202880
MA2—Multisample Amplification 2 Mix 11203401
MSM—Multisample Amplification Master Mix 11203410
PB1—Reagent used to prepare BeadChips for hybridization 11191922
PB2—Humidifying buffer used during hybridization 11191130
PM1—Precipitation solution 11203436
RA1—Resuspension, hybridization, and wash solution 11222442
STM—Superior Two-Color Master Mix 11288046
TEM—Two-Color Extension Master Mix 11208309
XC1—Xstain BeadChip solution 1 11208288
4. 100 % 2-propanol.
5. 100 % ethanol.
6. 95 % formamide, 1 mM EDTA. Store at −20 °C.
7. 10 mM Tris–HCl, pH 8.5.
2.2 Equipment 1. Qubit® Fluorometer (Invitrogen, CA, USA).

2. Qubit® dsDNA BR assay kit (Invitrogen, CA, USA).
3. GoldenGate Satellite Kit (Cat# BG-10-105). Contents:
(a) 11140324—SHAKER.MICROPLT,HS,230V x1.
(b) 179477—FASTNER.LOOP, ADH,HIGH TEMP x108.
(c) 179485—FASTNER.HOOK, NYLON x 18.
(d) SE-901-1002—ILLUMINA Hybridisation Oven (220V) x1.
(e) 175724—HYBEX.220V,w,MICROTUBE BLOCK x2.
4. 96-well 0.8 ml microtiter plate.
5. Multichannel pipettes.
6. Cap mat.
7. Large centrifuge capable of accommodating plates.
8. Foil seal.
9. Heat sealer.
10. Heat sealer, combi heat sealing unit.
11. Adapter plate, combi heat sealing unit (96-Well PCR Plate
Carrier).
12. Heat-sealing foil sheets, Thermo-Seal.
13. BeadChip Wash Rack and Glass Tray.
14. Infinium™ Hybridization Chamber and Gasket ×1.
15. Te-Flow Flow-Through Chambers—four per plate.
16. Wash Dish ×2.
17. Wash Rack.
18. Multisample BeadChip Alignment Fixture.
19. Water Circulator.
20. Flow-Through Chamber with Illumina temperature probe.
21. Vacuum desiccator.
22. Self-locking tweezers.
23. Staining rack.
24. Wash dishes.
25. BeadChips.
26. HiScan machine.
3 Methods
3.1 Infinium™ Assay While this protocol is written from the perspective of assaying 96
Protocol samples using Infinium™ chips, which hold 24 samples each, other
combinations are possible and can be easily accommodated into
the protocol. The 96-well/24-sample chip format provides the
highest throughput possible and currently allows up to 90,000
SNPs to be queried simultaneously in 12× redundancy for over 99 %
call rates on validated SNP assays.
Unless stated, all centrifugation and vortexing steps are for
1 min.
1. Quantitate samples using the Qubit dsDNA BR assay. Normalize
all samples in a 96-well PCR plate to 50 ng/μl by adding
Tris–HCl 10 mM, pH 8.5 (see Notes 1–3).
2. Dispense 20 μl of MA1, followed by 4 μl of DNA sample, and
then 4 μl of NaOH into each well of the 0.8 ml plate and seal
with a cap mat.
3. Vortex at 1,600 rpm and centrifuge at 280 × g, then incubate at
room temperature for 10 min.
4. Dispense 34 μl of MA2 and 38 μl of MSM into each well,
before resealing for vortexing and centrifuging as in step 3.
5. Incubate resealed plate in a 37 °C oven for 20–24 h.
6. Before opening the plate, centrifuge briefly at 50 × g to ensure
all liquid is in the bottom of the wells.
7. Add 25 μl of FMS to each well, reseal and vortex as before.
Centrifuge briefly again at 50 × g to ensure all liquid is in the
bottom of the wells.
8. Incubate on a heating block at 37 °C for 1 h.
9. Add 50 μl of PM1 to each well, seal and vortex as before.
10. Incubate for a further 5 min on the 37 °C heating block.
Centrifuge briefly at 50 × g to ensure all liquid is in the bottom
of the wells.
11. Add 155 μl of 2-propanol to each well, then seal plate with a
second, fresh cap mat.
12. Mix by inverting the plate at least ten times, then incubate at
4 °C for 30 min.
13. Prepare a balance plate before centrifuging at 2,000 × g and 4 °C
for 20 min. This should produce pale blue pellets in the bottom
of the wells (see Note 4).
14. Immediately decant supernatant by smoothly and rapidly
inverting the plate onto an absorbent pad prepared on the bench.
Remove all liquid by tapping the plate firmly for 1 min on the pad.
Ensure the pellets are completely dry by leaving the plate

inverted at room temperature for 1 h.
15. Resuspend pellets in 23 μl of RA1, then seal with a foil seal
using a heat sealer.
16. Incubate in a 48 °C oven for 1 h.
17. Vortex plate at 1,800 rpm then centrifuge briefly at 280 × g to
ensure all liquid is in the bottom of the wells.
18. Prepare and assemble the Hyb chamber as recommended in
the Infinium™ user manual, including adding 400 μl of PB2 to
each of the eight reservoirs.
19. Denature samples by incubating on a 95 °C heating block for
20 min. A further 30-min incubation at room temperature is
then followed by briefly centrifuging at 280 × g to ensure all
liquid is in the bottom of the wells.
20. Prepare four BeadChips by unpackaging and placing in Hyb
Chamber inserts.
21. Dispense 12 μl of each sample onto the inlet ports along each
side of the BeadChips, ensuring that each sample flows to cover
the entire bead stripe.
22. Position the BeadChips in the Hyb Chamber and replace the lid.
Correctly orientate the chamber in the oven and incubate for
16–24 h at 48 °C, with the rocker at setting 5.
23. Prepare XC4 reagent for following day by adding 330 ml of
100 % ethanol and shaking vigorously to mix. Leave at room
temperature until needed.
24. Before opening the chamber, allow to cool on the bench for
25 min.
25. Cleanly remove the IntelliHyb seals from the BeadChips, one at
a time, before sliding into the prepared wash rack submerged in
the dish containing PB1. It is important that the chips should not
be allowed to dry out before the Flow-Through Chamber is
assembled.
26. Wash the chips by gentle agitation of the wash rack for 1 min
before transferring to a second wash dish containing PB1.
Repeat the 1-min agitation wash step.
27. Prepare Multisample BeadChip Alignment Fixture containing
PB1. Transfer the BeadChips to the Alignment Fixture and
assemble Flow-Through Chamber comprising the clear spac-
ers, glass back plates, and metal clamps, using the Alignment
Bar to correctly align components. Trim excess spacer.
28. Prepare Chamber Rack connected to the Water Circulator so
that the temperature is 44 °C, calibrated with an Illumina®
TeFlow Thermometer Assembly. Place Flow-Through Chamber
assemblies into Chamber rack once the desired temperature
is reached.
29. Perform the Single-Base Extension section of the protocol

without interruption by dispensing the following reagents into
the reservoir of each chamber assembly:
(a) 150 μl of RA1. Incubate for 30 s. Repeat five times.
(b) 450 μl of XC1. Incubate for 10 min.
(c) 450 μl of XC2. Incubate for 10 min.
(d) 200 μl of TEM. Incubate for 15 min.
(e) 450 μl of 95 % formamide/1 mM EDTA. Incubate for
1 min. Repeat once.
(f) Incubate for 5 min.
(g) Begin ramping the chamber rack temperature to the tem-
perature indicated on the STM tube, or to 37 °C if none is
shown.
(h) 450 μl of XC3. Incubate for 1 min. Repeat once.
(i) Wait for the chamber rack to reach the desired temperature
before continuing.
30. Once the second temperature has been reached, continue with
the staining section of the protocol by dispensing the following
reagents into the reservoir of each chamber assembly:
(a) 250 μl of STM and incubate for 10 min.
(b) 450 μl of XC3 and incubate for 1 min. Repeat once, and
then wait 5 min.
(c) 250 μl of ATM and incubate for 10 min.
(d) 450 μl of XC3 and incubate for 1 min. Repeat once, and
then wait 5 min.
(e) 250 μl of STM and incubate for 10 min.
(f) 450 μl of XC3 and incubate for 1 min. Repeat once, and
then wait 5 min.
(g) 250 μl of ATM and incubate for 10 min.
(h) 450 μl of XC3 and incubate for 1 min. Repeat once, and
then wait 5 min.
(i) 250 μl of STM and incubate for 10 min.
(j) 450 μl of XC3 and incubate for 1 min. Repeat once, and then
wait 5 min.
(k) Move the chamber assemblies to the lab bench and place
horizontally.
31. Carefully disassemble the chamber assemblies one at a time and
place the BeadChips in the prepared staining rack submerged
in the wash dish containing PB1. Perform staining by moving
the rack up and down ten times, then leave to incubate for a
further 5 min.
32. Transfer to a second wash dish containing freshly poured XC4.
Repeat the staining process.
33. Remove the staining rack to a tube rack in one smooth, rapid
motion and use self-locking tweezers to slide each BeadChip
from the staining rack to the tube rack.
34. Place the entire tube rack in a vacuum desiccator and start the
vacuum, using at least 508 mm Hg. Dry under vacuum for
50–55 min.
35. Image BeadChips on HiScan system.
36. Import data to GenomeStudio software.
37. Analyze results.
3.2 Infinium™ Assay Each Infinium™ bead array is hybridized with one DNA sample.
Downstream Analysis The raw data from an Infinium™ assay consist of fluorescence
intensity in two colors with an average of 15 beads of each bead
type (for Infinium II SNPs) carrying the information of one SNP
locus. The raw data are filtered within the iScan software so that
aberrant outliers, if present, are removed prior to using the remain-
ing data to identify the correct genotype call for that bead type and
its targeted SNP. The overall data have as many as individual sam-
ples in the analysis. Such data cannot be analyzed manually and
require specialized software such as GenomeStudio to extract and
transform the data into a meaningful and analyzable format. After
a BeadChip is scanned, the data are imported into GenomeStudio
Software for analysis. Input and output files for GenomeStudio are
shown in Fig. 1. Most importantly, as Infinium™ assays have at
least 3,000 markers run simultaneously, the data analysis is a step
change from simplex marker systems, where a lot of emphasis and
attention used to be devoted to troubleshooting every single data
point. A systematic approach must be employed to automate the
analysis as much as possible, which involves using quality metrics to
filter out the good from the ambiguous data. Some analysis may be
done manually for data points that are ambiguous if these are
viewed as essential. These can be identified using quality metrics,
although often these loci are so few that they can be excluded to
avoid manual work as much as possible.
In addition to GenomeStudio’s GenCall, a number of algo-
rithms were developed to process the raw signal of the BeadArray
into genotype calls. The three more widely applicable are Illuminus
[24], GenoSNP [25], and CRLMM [26, 27]. The main modeling
differences lie in the normalization method and clustering that can
occur either within sample (GenCall, Illuminus, GenoSNP) or
both within and between samples (CRLMM). In plants most pub-
lications use the GenomeStudio’s proprietary GenCall method.
Initial steps for data analysis within GenomeStudio involve a
preliminary sample quality evaluation to determine which samples
may require reprocessing or removal. If a custom cluster file (*.egt)
is required, clustering should be done after removal of failed or
suboptimal samples. Because GenomeStudio is a population-based
Fig. 1 Inputs and outputs for GenomeStudio’s Genotyping Module. Two different types of file can be used for
this process: Intensity data files (*.idat) or Genotype Call Files (*.gtc). An optional input into GenomeStudio that
can be generated from the Instrument Control Software is the *.gtc format. The *.gtc format consolidates
information from *.bpm, *.csv, *.idat, *.egt for faster uploading of data into GenomeStudio. During *.gtc file
generation, signal intensity data from *.idat files are combined with information about SNP content on the
array from the bead pool manifest file (*.bpm) and cluster reference information for each locus (*.egt). Outputs
depend upon downstream analysis tool requirements
genotyping software package, the quickest way to identify problematic

samples is to identify outliers relative to the population perfor-
mance using various quality metrics. The key metric for sample
quality is the GenCall score. This score indicates the reliability of
the genotypes called and can range from 0.0 to 1.0. GenCall scores
are calculated using information from the sample clustering algo-
rithm. Each SNP is evaluated based on the angle of the clusters,
dispersion of the clusters, overlap between clusters, and intensity.
Genotypes with lower GenCall scores are located furthest from the
center of a cluster and have lower reliability. There is no global
interpretation of a GenCall score as it depends on the clustering of
samples at each SNP, which is affected by many different variables,
including the quality of the samples and loci. A good starting point
is to analyze the Infinium™ data with a default no-call threshold of
0.15. A no-call threshold of 0.15 means that genotypes with a
GenCall score lower than 0.15 are not assigned genotypes because
they are considered to be too far from the center of the cluster to
make a reliable genotype call. No-calls on successful DNA samples
at successful loci contribute to lowering the call rate for the overall
project. The standard 0.15 threshold for Infinium™ data was
determined empirically using projects with trio and replicate
Fig. 2 Poorly performing samples (encircled) are obvious outliers from the popu-
lation of samples when 10 % GC Score (or 50 % GC Score in the case of more
raw data) is plotted against sample call rate
information to optimize call rate without compromising reproduc-

ibility or Mendelian consistency. Another way to remove poor-
quality samples within the standard diploid genotyping algorithm
is to use line graph functionality within GenomeStudio to view
the 50th Percentile GenCall Score (50 % GC Score) or 10th
Percentile GC (10 % GC Score) against the call rate for all samples
in the project (Fig. 2).
Once poor performing (low quantifying) samples are excluded
it is necessary to rebuild the clusters before starting the SNP quality
checking. For projects where a standard or community developed
cluster file (*.egt) is available, this is the best starting point for call-
ing genotypes within GenomeStudio. However in some situations,
sample intensities might not overlay perfectly onto the standard
cluster positions. This is especially true when the analyzed datasets
are phylogenetically distant to the dataset used to build the cluster
file. Reclustering some or all SNPs can optimize GenomeStudio’s
ability to call genotypes and results in higher overall call rates. All or
a subset of loci can be reclustered to generate a custom cluster file.
An important consideration in the decision to recluster is that the
GenomeStudio clustering algorithms require a minimum of about
100 samples to predict reliable cluster positions in a diploid genome.
Therefore, projects with less than 100 unique samples would be best
served by reliance on the standard or community developed cluster
file for calling genotypes as a starting point.
Some metrics are useful for filtering SNPs in a GenomeStudio
project. The GenTrain score reflects the shapes of the clusters and
their relative distance to each other. The Cluster Sep score indicates
the separation between clusters. The call frequency (Call Freq)
corresponds to the number of no-calls divided by the total number
of SNPs.
In plants the published studies using Infinium™ SNP arrays

use slightly different thresholds of 10 % GC Score, 50 % GC Score,
and GenTrain score to filter the SNPs. For the 10 % GC Score a
threshold of 0.15 and 0.2 was reported in Vitis [5], peach [15],
and apple [7]. On the other hand, the first evaluation of the apple
9K array [2] reported a threshold of 0.5 for the 50 % GC Score. In
almost all the publications the SNPs were filtered using a GenTrain
score between 0.4 and 0.6 [2, 4, 7, 15]. However, the cited studies
performed manual checks of the automatic calls made by
GenomeStudio. When the genotypic clusters are too close to one
another and the polymorphisms cannot be scored reliably using
the automated allele calling, manual scoring of the polymorphic
loci is required. For example, in sunflower [9] approximately 30 %
of the SNPs were manually scored to maximize the number of
genotypes returned.
3.3 Calling Clusters A large number of cultivated plant species are polyploid, such as
for Polyploid Genomes potato (tetraploid; see Note 5), wheat (hexaploid), and strawberry
(octoploid). The expected segregation for SNPs using the
Infinium™ technique is therefore more complex and will exhibit
more than the three clusters (AA, AB, and BB) typical of diploid
species. Methods adapted for polyploidy in GenomeStudio soft-
ware include an algorithm for automated calling of clusters repre-
sented by polyploid genomes. The automated clustering algorithms
start from an estimated density distribution and are able to detect
meaningful clusters in data with varying density, which is common
in genotyping data. Sensitivity of cluster detection can be adjusted
at the project level by specifying a minimum number of points in a
cluster and cluster distance. The X-Y coordinates for cluster
positions can be exported from GenomeStudio for downstream
data analysis. The automated cluster calling functionality currently
available in GenomeStudio is using both Density Based Spatial
Clustering of Applications with Noise (DBSCAN) and Ordering
Points to Identify the Clustering Structure (OPTICS; [28]) algo-
rithms. Sensitivity for cluster detection can be adjusted by altering
minimum cluster distance and minimum number of points required
to define a cluster.
4 Notes
1. Template quality and purity is crucial for most DNA-based

methods. However, while fragmented and degraded DNA can
work for techniques such as PCR, good-quality DNA opti-
mizes results for an Infinium™ assay experiment. More impor-
tant than nonsheared DNA, is having a minimum concentration
of the template target DNA of 200 ng of DNA in a minimum
concentration of 50 ng/μl. DNA purification can be a challenge
a b
2.20 2.40
2.00 2.20
1.80 2.00
1.60 1.80
1.60
1.40
1.40
1.20
Nom R
1.20
Nom R
1
1
0.80
0.80
0.60
0.60
0.40
0.40
0.20 0.20
0 0.00
56 96 0 15 53 0
−0.20 −0.20
−0.40
0 0.20 0.40 0.60 0.80 1 0 0.20 0.40 0.60 0.80 1
Norm Theta Norm Theta
Fig. 3 Comparison between two DNA extraction methods for plant tissue: SNP calling and clustering using
GenomeStudio. Samples from young expanding pear (Pyrus communis) leaves were extracted (a) using the
Macherey Nagel Nucleospin kit and (b) using a CTAB-based technique, and analyzed using the 9 k apple and
pear Infinium™ assay. The individuals from the two experiments belong to the same F1 population grown in
similar conditions. The SNP shown is a pear SNP. The clustering for the CTAB-based extraction is of much lower
quality (i.e., the clusters are more spread out and less separated) than for the column-based extraction kit
for many plant species where sufficient material of high quality

can be difficult to access or only available for a short time in the
growing season. Good-quality DNA is often obtained from
young expanding vegetative tissue collected in spring. Older
leaves tend to have less DNA and more inhibitory molecules,
though inhibitory molecules that coextract with nucleic acids can
be an issue with young expanding leaves too. Compounds such
as polyphenols and polysaccharides can be hard to remove
from the extract.
2. DNA quantity and quality is often assessed using UV fluores-
cence with a dye binding to DNA. The ratio of UV absorbance
at 260 and 280 nm (A260/280) is used as a quality measure-
ment, with good-quality nuclear DNA having an A260/280
ratio around 1.8 of 50ng/ul is used. Protein and polyphenol
contaminations tend to decrease this ratio. Nevertheless, the
A260/280 ratio does not always detect inhibiting contami-
nants. Figure 3 depicts an experiment that was carried out using
two different methods for DNA extraction of pear samples: one
method involved CTAB and the second used a column-based
commercial kit. Both gave acceptable A260/280 ratios, how-
ever the quality of the genotype clusters varies greatly between
both experiments. The figure shows the same SNP marker
run over the same full sib population extracted using the differ-
ent methods. The CTAB method resulted in most SNP clusters
being unresolved. The likely cause is that the CTAB inhibits the
first step of the Infinium protocol which is whole genome

amplification. It may generate an unbalance in the allelic ratio,
generating clusters that are a continuum. It is therefore crucial
to use high quality DNA where a PicoGreen quantification is an
accurate representation of the target sample DNA present to
achieve the best results with the Infinium™ protocol. Based on
our experience in various plant species using different DNA
extraction techniques, we recommend that commercial kits
based on column purification be used.
3. While a minimum of 50 ng/μl, as measured by a PicoGreen
method of quantification is ideal, concentrations down to
10 ng/μl can be accommodated. The more important property
is quality of the DNA. It is recommended to run the samples on
a 1 % agarose gel before use in this assay to ascertain any levels
of degradation or contamination.
4. If there is any delay before continuing on to step 14, repeat
this step.
5. Cultivated potato is auto-tetraploid and highly heterozygous.
At any given locus up to four different alleles may be present.
To fully utilize high-throughput genotyping platforms, such as
the potato 10 k SNP chip, for genetic improvement of potato,
analysis of tetraploid lines is required. In auto-tetraploid
potato, each SNP can potentially manifest itself in five different
clusters: AAAA, AAAB, AABB, ABBB, BBBB. Software such
as GenomeStudio can readily identify the fully homozygous
clusters (AAAA and BBBB). However, the remaining three clus-
ters of genotypes (AAAB, AABB, ABBB) are more difficult to
distinguish and are usually grouped into one heterozygous class.
Recent improvements to the GenomeStudio software mean that
it is possible to improve the clustering in an automated fashion
without an a priori designation of how many clusters are
expected. One strategy to work around this in the case of bipa-
rental crosses is to focus on using SNPs with simple segregation
patterns, e.g., simplex (single-dose) markers in one parent,
segregating 1:1 in the F1 progeny (AAAB × AAAA or
ABBB × BBBB), or duplex markers in one parent
(AABB × AAAA/BBBB) segregating 1:2:1 in the F1 progeny
[29]. Alternatively a different way of data analysis using raw
data for clustering with alternative software like fitTetra [30]
can be applied for tetraploid genomes.
References
1. The Potato Genome Sequencing Consortium 2. Chagné D, Crowhurst RN, Troggio M et al

(2011) Genome sequence and analysis of (2012) Genome-wide SNP detection, valida-
the tuber crop potato. Nature 475: tion, and development of an 8 k SNP array for
189–195 apple. PLoS One 7:e31745
3. Syvanen AC (2005) Toward genome-wide 17. Velasco R, Zharkikh A, Affourtit J et al (2010)

SNP genotyping. Nat Genet 37:S5–S10 The genome of the domesticated apple
4. Myles S, Boyko AR, Owens CL et al (2011) (Malus x domestica Borkh.). Nat Genet 42:
Genetic structure and domestication history of 833–839
the grape. Proc Natl Acad Sci U S A 108: 18. Marco-Sola S, Sammeth M, Guigo R, Ribeca P
3530–3535 (2012) The GEM mapper: fast, accurate and
5. Myles S, Chia J-M, Hurwitz B et al (2010) versatile alignment by filtration. Nat Methods
Rapid genomic characterization of the genus 9:1185–1188
Vitis. PLoS One 5:e8219 19. Homer N, Merriman B, Nelson SF (2009)
6. Kumar S, Chagné D, Bink MCAM et al (2012) BFAST: an alignment tool for large scale genome
Genomic selection for fruit quality traits in resequencing. PLoS One 4:A95–A106
apple (Malus x domestica Borkh.). PLoS One 20. Li H, Handsaker B, Wysoker A et al (2009)
7:e36674 The sequence alignment/map format and
7. Antanaviciute L, Fernandez-Fernandez F, Jansen SAMtools. Bioinformatics 25:2078–2079
J et al (2012) Development of a dense SNP- 21. Danecek P, Auton A, Abecasis G et al (2011)
based linkage map of an apple rootstock progeny The variant call format and VCFtools.
using the Malus Infinium™ whole genome geno- Bioinformatics 27:2156–2158
typing array. BMC Genomics 13:203 22. Li R, Li Y, Kristiansen K, Wang J (2008)
8. Bachlava E, Taylor CA, Tang S et al (2012) SOAP: short oligonucleotide alignment pro-
SNP discovery and development of a high- gram. Bioinformatics 24:713–714
density genotyping array for Sunflower. PLoS 23. Ganal MW, Durstewiz G, Polley A et al (2011)
One 7:e29814 A large maize (Zea mays L.) SNP genotyping
9. Bowers JE, Bachlava E, Brunick RL et al (2012) array: development and germplasm genotyp-
Development of a 10,000 locus genetic map of ing, and genetic mapping to compare with the
the sunflower genome based on multiple crosses. B73 reference genome. PLoS One 6:e28334
G3-Genes Genomes. Genetics 2:721–729 24. Teo YY, Inouye M, Small KS et al (2007) A
10. Hamilton JP, Hansey CN, Whitty BR et al genotype calling algorithm for the Illumina
(2011) Single nucleotide polymorphism dis- BeadArray platform. Bioinformatics 23:
covery in elite north american potato germ- 2741–2746
plasm. BMC Genomics 12:302 25. Giannoulatou E, Yau C, Colella S et al (2008)
11. Felcher KJ, Coombs JJ, Massa AN et al (2012) GenoSNP: a variational Bayes within-sample
Integration of two diploid potato linkage maps SNP genotyping algorithm that does not
with the potato genome sequence. PLoS One require a reference population. Bioinformatics
7:e36347 24:2209–2214
12. Matukumalli LK, Lawley CT, Schnabel RD 26. Carvalho B, Bengtsson H, Speed TP, Irizarry
et al (2009) Development and characterization RA (2007) Exploration, normalization, and
of a high density SNP genotyping assay for genotype calls of high-density oligonucleotide
cattle. PLoS One 4:e5350 SNP array data. Biostatistics 8:485–499
13. Peace C, Bassil N, Main D et al (2012) 27. Ritchie ME, Carvalho BS, Hetrick KN et al
Development and evaluation of a genome-wide (2009) R/Bioconductor software for Illumina's
6K SNP array for diploid sweet cherry and tet- Infinium™ whole-genome genotyping
raploid sour cherry. PLoS One 7:e48305 BeadChips. Bioinformatics 25:2621–2623
14. Thomson MJ, Zhao K, Wright M et al (2012) 28. Ankerst M, Breunig MM, Kriegel HP, Sander J
High-throughput single nucleotide polymor- (1999) OPTICS: ordering points to identify
phism genotyping for breeding applications in the clustering structure. ACM SIGMOD inter-
rice using the BeadXpress platform. Mol Breed national conference on management of data.
29:875–886 ACM Press, New York, pp 49–60
15. Verde I, Bassil N, Scalabrin S et al (2012) 29. Douches D, Coombs J, Merk HL (2012) How
Development and evaluation of a 9K SNP array to develop SNP-based tetraploid maps for Potato.
for peach by internationally coordinated SNP http://www.extension.org/pages/63187/
detection and validation in breeding germ- how-to-develop-snp-based-tetraploid-maps-for-
plasm. PLoS One 7:e35668 potato. Webinar. Accessed 8 April 2013
16. Sim S-C, Durstewitz G, Plieske J et al (2012) 30. Voorrips RE, Gort G, Vosman B (2011)
Development of a large SNP genotyping array Genotype calling in tetraploid species from
and generation of high-density genetic maps in bi-allelic marker data using mixture models.
tomato. PLoS One 7:e40563 BMC Bioinformatics 12:172
Chapter 22
Use of the Illumina GoldenGate Assay for Single Nucleotide

Polymorphism (SNP) Genotyping in Cereal Crops
Shiaoman Chao and Cindy Lawley
Abstract
Highly parallel genotyping assays, such as the GoldenGate assay developed by Illumina, capable of inter-
rogating up to 3,072 single nucleotide polymorphisms (SNPs) simultaneously, have greatly facilitated
genome-wide studies, particularly for crops with large and complex genome structures. In this report, we
provide detailed information and guidelines regarding genomic DNA preparation, SNP assay design, SNP
assay protocols, and genotype calling using Illumina’s GenomeStudio software.
Key words DNA marker, High-throughput genotyping, Oligo pool assay, OPA, Single nucleotide
polymorphism, SNP
1 Introduction
Highly multiplexed Single Nucleotide Polymorphism (SNP) geno-

typing assay systems capable of interrogating a large number of SNP
markers in parallel have greatly facilitated genome-wide studies,
particularly for crops with large and complex genome structures.
The use of highly parallel assay systems such as the GoldenGate
assay developed by Illumina [1, 2] has been reported with success
in several crops, including maize [3], soybean [4], barley [5, 6],
rice [7], wheat [8], and oat [9], for various genetic and breeding
applications. The highly multiplexed GoldenGate assay was enabled
by the development of the BeadArray platform, on which the beads
each with a unique but universal bead type oligo identifier attached
were pooled, self-assembled, and randomly arranged [10]. The
original BeadArray technology assembled these beads on fiber
optic bundles and a Sentrix Array Matrix compatible with 96-well
plate format. Subsequent development moved these universal
beads to a fixed slide multisample (BeadChip). A series of decoding
hybridizations are performed to identify the positions of the uni-
versal bead types located on a particular custom-manufactured
array [11]. Each bead type is replicated on average 30 times to
299
300 Shiaoman Chao and Cindy Lawley
improve the assay precision [11]. The GoldenGate assay takes a

small amount of genomic DNA (250 ng) and is based on primer
hybridization, extension, and ligation to differentiate and produce
allele-specific products. To interrogate each SNP, three assay prim-
ers are designed using an automated algorithm developed at
Illumina. Interrogating oligos include two that are allele specific
(ASO) and the third that is locus specific (LSO). The LSO can be
up to 20 bases downstream from the targeted SNP position and is
placed based upon many factors including optimization of melting
temperatures (Tm) and position relative to any annotated poly-
morphisms adjacent to the targeted SNP. When the ASO and LSO
are manufactured, the bead type address is included in the PCR
primer sequences to facilitate differentiation of individual SNPs
during highly multiplex assay. Allele-specific products of approxi-
mately uniform size and Tm derived from the genomic DNA fur-
ther help to optimize PCR conditions across all targeted products.
These are then PCR amplified with universal primers fluorescently
labeled with Cy3 and Cy5 and are detected by hybridizing to the
BeadChip arrays through the complementary bead type address
present on both the array and LSO. The fluorescence signals are
read out from the arrays in a scanner, and the resulting intensity
values are processed using the GenomeStudio software developed
by Illumina for allele calling. In this report, we describe methods
for genomic DNA preparations, including DNA extraction and
quantification, and general GoldenGate assay procedures with a
focus on cereals crops, including both diploid and polyploid crops.
2 Materials
2.1 Sample Tissue 1. Leaf tissue collected at the seedling stage.

Preparation 2. Miracloth.
2.1.1 Freeze Drying 3. Liquid nitrogen.
4. Freeze dryer.
2.1.2 Silica Gel 1. Leaf tissue collected at the seedling stage.

2. 6–12 mesh plain type silica gel stored in airtight conditions.
2.2 DNA Extraction 1. DNA Extraction buffer: 0.1 M Tris–HCl pH 7.5, 50 mM

EDTA, pH 8.0, 1.25 % SDS. To prepare a liter of buffer, add
100 ml of 1.0 M Tris–HCl pH 7.5, 100 ml of 0.5 M EDTA
pH 8.0, 125 ml of 10 % SDS, and 675 ml of ddH2O. Store the
buffer at room temperature.
2. Tissue grinder.
3. 6 M ammonium acetate stored at 4 °C.
4. Isopropanol stored at –20 °C.
GoldenGate SNP Genotyping 301
5. 70 % ethanol stored at 4 °C.

6. 1× TE: 10 mM Tris–HCl, pH 8.0, 1 mM EDTA, pH 8.0
stored at room temperature.
2.3 DNA 1. PicoGreen dsDNA quantification reagent (Molecular Probes

Quantification Cat # P7581) stored at 4 °C.
2. 1× TE: 10 mM Tris–HCl, pH 8.0, 1 mM EDTA, pH 8.0
stored at room temperature.
3. DNA standard, such as lambda DNA, with known concentration.
4. 10×TAE: 400 mM Tris–HCl, pH 7.5, 180 mM glacial acetic
acid, 1 mM EDTA.
5. 0.8 % agarose gel in 1× TAE.
6. Adhesive aluminum seals.
7. Spectrophotometer, such as a NanoDrop (Thermo Scientific,
Wilmington, DE).
8. Spectrofluorometer specific for PicoGreen.
2.4 GoldenGate 1. OPA (oligo pool assay): prepare a final list of SNP panel and
Assay submit it to Illumina for OPA synthesis. The SNPs included in the
final list have previously been processed through the Illumina
assay design tool (ADT) pipeline.
2. GoldenGate assay reagent kits from Illumina: includes the DNA
activation kit, the BeadChip assay kit, and universal-32 BeadChips
(see Note 1).
3. User supplied reagents: Titanium Taq DNA polymerase
(Clontech Laboratories, Inc., Mountain View, CA), 0.1 N
NaOH, 70 % ethanol, and 100 % ethanol.
4. User supplied lab consumables: reagent trough, single and
8-channel manual pipettes, filter tips for 8-channel manual
pipettes, 96-well PCR plates, aluminum heat seal foil, adhesive
plate seal, 96-well 0.45 μM filter plates (EMD Millipore,
Billerica, MA), and 96-well V-bottom plates.
3 Methods
3.1 Sample Tissue Two methods are described for sample tissue preparation and
Preparation in 96-Well either one will suffice.
Plate Format
3.1.1 Freeze Drying 1. Place a 96-deep well plate on ice, cut a piece of 2-in. leaf blade
at the seedling stage, fold and insert in the well.
2. After a plate of samples is collected in full, wrap the plate with
miracloth, fasten the miracloth with a string, then plunge the
plate in liquid nitrogen.
3. Place the frozen plate in –80 °C freezer, and continue with tissue
collecting.
4. Place all frozen plates in the freeze dryer, and dry the tissues
overnight.
5. Remove the string and miracloth, cover the plate with a plate
mat, and store the plates at 4 °C before extraction.
3.1.2 Silica gel 1. Fill the plain type silica gel in the 96-deep well plates following
the protocol of Bodo Slotta et al. [12]. Place the plate mat to
ensure silica gel is not exposed to the moisture in the air.
2. Remove the plate mat, cut a piece of 2-in. leaf blade at the
seedling stage, fold and insert in the well (see Note 2).
3. Place the mat back onto the plate after all samples are collected
and flip the plate a few times to ensure the leaf tissues are in
contact with silica gel.
4. Store the plates at room temperature in airtight plastic bags for
a week, allowing tissues to dry, then proceed with DNA
extraction.
5. Store the plates at 4 °C if DNA is not extracted immediately.
3.2 DNA Extraction This method is adapted from the original protocol reported by
Pallotta et al. [13] for extracting DNA in 96-well plates using a
robot. The same method can be used to manually extract DNA
from dried tissues in individual tubes or in strip tubes (see Note 3).
DNA is stored at 4 °C before use or –20 °C for longer term.
1. Preheat the extraction buffer to 65 °C.
2. To grind freeze-dried leaf tissues in powder, add a ball bearing
to each well. The silica gel-dried leaf tissues can be ground
using the silica gel present in each well. Load the plates to the
tissue grinder and grind for a specified length of time depending
on the model used.
3. Add 500 μl of extraction buffer to each well (see Note 4). Seal
the plates with adhesive seals and incubate the plates at 65 °C
for 30 min. Vortex the plates every 5 min during incubation.
4. Cool the plates on ice for 15 min before adding 250 μl of cold
6 M ammonium acetate. Seal the plates with adhesive seals,
mix by vortexing, and incubate the plates on ice for 15 min.
5. Centrifuge the plates for 20 min at 4,000 × g at 10 °C.
6. Add 360 μl chilled isopropanol into each well of new 96
deep-well plates.
7. Transfer 600 μl of the supernatant into new 96 deep-well plates
containing isopropanol (see Note 5). Mix thoroughly and
allow DNA to precipitate for 10 min or longer at 4 °C.
8. Centrifuge the plates for 20 min at 4,000 × g at 10 °C to

pellet the DNA. Pour off and discard the supernatant (see
Note 6).
9. Add 500 μl of chilled 70 % ethanol to wash the DNA pellets.
10. Centrifuge the plates for 20 min at 4,000 × g at 10 °C, and
discard the supernatant. Air dry the DNA pellets for 20 min.
11. Resuspend the DNA pellet in 100 μl 1× TE to dissolve DNA
overnight at 4 °C.
12. Centrifuge the plates for 5 min at 4,000 × g at 10 °C, and transfer
90 μl of DNA to new 96-well PCR plates for storage at 4 °C
for short term or –20 °C for long term.
3.3 DNA The DNA concentration is estimated using a spectrophotometer,

Quantification such as a NanoDrop. The DNA quality should be checked for ran-
dom samples by visualizing the presence of a high molecular weight
band appearing on the 0.8 % agarose gels (see Note 7). Dilute
DNA in 1× TE to approximately 50 ng/μl, and proceed with
quantifying DNA using PicoGreen (see Note 8).
1. To prepare lambda DNA standard, dilute lambda DNA to
75 ng/μl in a final volume of 233.3 μl in well A1 of a 96-well
plate. Add 66.7 μl of 1× TE to well B, and 100 μl to wells C to
H of column 1. Do a serial dilution by transferring 133.3 μl of
lambda DNA in well A1 into well B1 and mix well. Change
tips and transfer 100 μl from well B1 into well C1, and mix
well. Repeat for wells D1 to G1. Well H1 serves as the blank at
0 ng/μl.
2. Prepare 1:200 dilution of PicoGreen into 1× TE. Use 115 μl
PicoGreen and 23 ml 1× TE for one plate, and 215 μl
PicoGreen and 43 ml 1× TE for two plates.
3. Transfer 195 μl PicoGreen/TE dilution into each well of col-
umns 1 and 2 of a plate labeled as standard QDNA plate, add
2 μl of each stock lambda DNA dilution to the standard QDNA
plate, and mix well. Immediately cover the plate with an adhe-
sive aluminum seal.
4. Transfer 195 μl PicoGreen/TE dilution into each well of a
plate labeled as sample QDNA plate, add 2 μl of a plate of
DNA samples to the sample QDNA plate, and mix well.
Immediately cover the plate with an adhesive aluminum seal.
5. Measure fluorescence on Spectrofluorometer specific for
PicoGreen according to manufacturer's recommendations.
3.4 SNP To ensure a high success rate of converting candidate SNPs to suc-
Assay Design cessful assays, the SNP panel is first evaluated using Illumina’s
algorithm for GoldenGate scoring, the Assay Design Tool (ADT).
ADT is a bioinformatic pipeline based on a proprietary algorithm
developed for designing oligo probes for SNP assays using

GoldenGate, as well as Infinium, chemistry.
1. Prepare a list with sequence reads of 50–60 bases flanking the
targeted SNP (e.g., GG…TA[G/A]GT…AT) using the tem-
plate available online for download. It is recommended that all
potential candidate SNPs be evaluated by ADT and the scores
included as part of the final SNP selection criteria.
2. Tech support at Illumina can supply the then-current list of
supported genomes. For genomes with a build incorporated in
ADT, a filter that downgrades scores of SNPs landing in likely
duplicated regions of the genome is included. For nonsup-
ported genomes with little genomic information, users can use
a repeat-masking lower case weighting of sequence data to
indicate high risk regions of the genome, where known. ADT
will preferentially avoid lower case masked regions provided
the filter for lower case weighting is enabled. This is part of the
input parameters in the downloadable ADT input file.
3. Upload the SNP list to the iCom website. Illumina will return
a file with a designability score assigned to each SNP ranging
from 0 to 1. Generally the higher the scores, the better the
chance for the SNP assay to work. The recommended cutoff
score value for optimizing success is 0.6 or higher, although
lower scores can be included where SNPs are in a highly desir-
able region of the genome.
4. Filter and select a final list of SNP panel for the genotyping
assay. Submit the final score file to Illumina for OPA synthesis
(see Note 9).
3.5 GoldenGate The genotyping assay generally takes about 3 days. Day 1 involves
Genotyping Assay DNA activation and hybridizing OPA to biotinylated DNA tem-
plates overnight. Day 2 involves extension, ligation, and PCR
amplification of DNA templates containing the targeted SNPs. It is
recommended that all the assays up to the PCR step be carried out in
a pre-PCR clean room. In the post-PCR room, the PCR products
are cleaned up, denatured, and hybridized to the BeadChips over-
night. Day 3 involves BeadChips washing and imaging to generate
hybridization intensity values.
3.5.1 DNA Activation 1. Preheat the heat blocks to 95 °C.

2. Add 5 μl of MS1 and 5 μl of genomic DNA at 50 ng/μl to the
96-well plates, heat-seal the plates, mix well, pulse centrifuge,
and incubate the plates on heat blocks at 95 °C for 30 min
(see Note 10).
3. Pulse centrifuge the plates, add 5 μl of PS1, mix well, then add
15 μl of isopropanol, and mix.
4. Precipitate DNA by spinning the plates at 3,000 × g for 20 min.
Remove isopropanol by smacking the plates, and then spinning
the plates upside down at 8 × g for 1 min. Air dry DNA at room
temperature for 15 min.
5. Dissolve activated DNA in 10 μl of RS1, and proceed with the
next step.
3.5.2 OPA Hybridization 1. Preheat the heat blocks to 70 °C.

2. Add 10 μl of OPA and 30 μl of OB1 to each well of new
96-well plates (the ASE plates). Transfer 10 μl of activated
DNA to the ASE plates, heat-seal the plates, mix well, and
pulse centrifuge.
3. After placing the ASE plates in the heat block, immediately
turn the temperature down to 30 °C, allowing the heat block
to slowly cool down to 30 °C. ASE plates can remain on the
heat block at 30 °C for up to 16 h.
3.5.3 Extension 1. Preheat the heat blocks to 45 °C.

and Ligation 2. Place the ASE plates on magnetic stands, pipette and discard
all liquid from the ASE plates. Wash wells twice with 50 μl of
AM1, and twice with 50 μl of UB1.
3. Add 37 μl of MEL to each well, mix well, and incubate the
ASE plates on heat blocks at 45 °C for 15 min.
3.5.4 PCR Amplification 1. Add 64 μl of Titanium Taq polymerase and 50 μl of UDG (see
Note 11) to MMP tubes, mix well. Aliquot 30 μl of MMP
mixture to the PCR plates, and store the plates in the dark.
2. Preheat the heat blocks to 95 °C.
3. Place the ASE plates on magnetic stands, pipette and discard
all liquid from the ASE plates after 15 min incubation, wash
wells once with 50 μl of UB1.
4. Add 35 μl of IP1 to the ASE plates and incubate at 95 °C for
1 min.
5. Transfer 30 μl of supernatant from the ASE plates on magnetic
stands to the PCR plates, heat-seal the PCR plates, and discard
the ASE plates.
6. Place the PCR plates into the thermal cycler, and run the
program set at 37 °C for 10 min, 95 °C for 3 min, followed by
34 cycles of 95 °C for 35 s, 56 °C for 35 s and 72 °C for 2 min,
then a final extension at 72 °C for 10 min, before holding the
program at 4 °C for 5 min.
3.5.5 Clean 1. Add 20 μl of MBP into each well of the PCR plates. Set the
and Denature PCR 8-channel pipette to 85 μl, pipette all the solution in the PCR
Products plates up and down several times to mix, then transfer the
mixed solution to the 0.45 μM filter plates. Incubate the filter
plates at room temperature for 1 h in the dark.
2. Place the filter plates on an empty 96-well V-bottom plate

(the waste plates), centrifuge at 1,000 × g for 5 min at 25 °C.
3. Add 50 μl of UB2 to the filter plates, repeat centrifugation at
1,000 × g for 5 min at 25 °C.
4. Add 30 μl of MH1 to clean 96-well V-bottom plates (the INT
plates). Replace the waste plates with the INT plates. Add 30 μl
of 0.1 N NaOH to the filter plates, and centrifuge at 1,000 × g
for 5 min at 25 °C. Discard the filter plates.
5. Gently mix the contents of the INT plates by moving the plates
side to side. Store the INT plates in the dark until ready to
dispense samples onto the BeadChip.
3.5.6 Hybridization 1. Turn on the hybridization oven to 60 °C.

of BeadChip 2. Add 200 μl of CHB into the humidifying buffer reservoirs in
the Hyb chamber.
3. Place each BeadChip in a Hyb chamber insert. Pipette samples
up and down in the INT plates and load 15 μl of sample from
the INT plates to the inlet port on the BeadChip.
4. Load the Hyb chamber inserts containing sample-laden
BeadChips to the Hyb chamber. Close and lock the BeadChip
Hyb chamber lid.
5. Place the Hyb chambers into the 60 °C hybridization oven
with the rocker on. After 30 min, adjust the oven temperature
to 45 °C, and incubate between 16 and 18 h at 45 °C.
3.5.7 BeadChip Wash 1. Prepare three wash dishes with two filled with 300 ml of PB1,
and Imaging and the third filled with 300 ml of XC4 reagent mixed with
100 % ethanol.
2. Remove the seals on the BeadChips. Load up to 12 BeadChips
to a wash rack, and immerse the BeadChips in the first wash
dish containing PB1, move up and down ten times.
3. Transfer the wash rack to the second PB1 wash dish and let it
soak for 5 min.
4. Transfer the wash rack to the XC4 wash dish and move the wash
rack slowly up and down ten times, and let it soak for 5 min.
5. Dry the BeadChips in desiccators under vacuum for 1 h or
until dry.
6. Clean the underside of each BeadChips to remove excess XC4
with Kimwipes wetted with 70 % ethanol.
7. Download dmap files corresponding to each BeadChips through
a Decode File Client application (see Note 12), and load the
BeadChips to the array reader, such as iScan.
8. The intensity data (.idat) files generated by the reader contain
allele-specific hybridization intensity values.
3.6 Genotype Calling Three files are required to start a new GenomeStudio project:
Using GenomeStudio (1) the intensity data (.idat) files, (2) the OPA manifest (.opa) con-
Software taining interrogating probe and bead address sequence informa-
tion, and (3) the sample sheet (.csv) containing the OPA name, the
Sentrix barcodes (all BeadChips are barcoded), and the sample
names, their corresponding well positions, and other relevant sam-
ple information (see Note 13) (Fig. 1). A typical GenomeStudio
project contains three major elements, the SNP Graph where gen-
otype calling can be manipulated, the Samples Table containing
sample names and the call rate for each sample over all SNPs
assayed, and the SNP Table containing the names of the SNPs used
in the genotyping assay and the statistics of genotype clustering of
all samples assayed for each SNP (Fig. 2).
The software first normalizes and scales the intensity data to
adjust for the background noise. The software then uses the
GenCall (GC) no-call threshold (0.25 is the recommended lower
threshold for GC score for the GoldenGate assay) to determine if
the genotypes should be assigned within the call region of any
given cluster. If the score is less than 0.25, the genotype is consid-
ered too far from the centroid of the cluster to be reliably assigned
to the cluster and results in a no-call, or missing data. GenomeStudio
software was developed originally using human data assuming dip-
loidy and Hardy–Weinberg equilibrium (HWE), and thus includes
metrics that allow easy screening of loci that deviate from HWE.
After applying automated data clustering, three genotype clusters
Decode Map Files

(*.dmap )
iScan
SNP Manifest Intensity Data Sample Sheet Cluster File

(*.opa) (*.idat) (*.csv) (*.egt)
GenomeStudio Software
Visualization
Raw Data Report
Tools
Fig. 1 The workflow for generating SNP genotype data using the GenomeStudio software. Sample sheet and
cluster file are optional for starting a new GenomeStudio project
Fig. 2 Major elements of a GenomeStudio project
Fig. 3 Three genotypes are expected after automated calling using the algorithm provided by the software
are expected (Fig. 3). The default cluster separation parameters

provided, however, are often not applicable for self-pollinated
crops where two homozygote genotypes are expected for most of
the SNPs. As a result, manual inspection of cluster positions is
often required (see Note 14). When SNPs were derived from
regions where copy number differences exist among samples,
Fig. 4 One of the homozygote clusters is shifted and manually edited
Fig. 5 This SNP detected subclusters and should be eliminated for further analysis
genotype cluster compression is observed, and occurs more fre-

quently in polyploid crops [3, 4, 8] (Fig. 4). The sequence varia-
tions adjacent to the targeted SNPs present in different samples
genotyped can cause the formation of the subclusters (Fig. 5). In
polyploids, polymorphic SNPs other than the targeted one present
on different genomes can further result in the appearance of more
than three clusters and should be eliminated from analysis (Fig. 5).
To ascertain the heterozygote cluster position, a heterozygote con-
trol from mixing equal amount of DNA from parents used to con-
struct mapping populations or from true F1 individuals if available,
is recommended, which can also serve as the control for genotyp-
ing consistency between plates and between genotyping facilities.
The cluster file (.egt) containing the genotype cluster position
information can be exported and applied to different batches of
samples genotyped with the same OPA to maintain the genotype
calling consistency (Fig. 1).
4 Notes
1. The 32-BeadChip platform capable of processing 32 samples is

applicable for up to 3072-plex SNP assay.
2. The amount of silica gel in each well is enough to dry only one
piece of leaf tissue. If too many tissues are inserted, they will
become moldy and affect DNA quality.
3. This method is also applicable for extracting DNA from
endosperm tissue. The triploidy nature of the endosperm
tissue may affect the SNP genotype cluster position during
genotype calling.
4. Silica gel tends to absorb water, 100 μl of ddH2O can be added
to compensate for the loss of liquid absorbed by silica gel.
5. Care must be taken to avoid transferring debris from the
interface.
6. Allow the remaining fluid to drain off the DNA pellet by
inverting the plates onto paper towels for a few seconds.
7. The presence of RNA in the DNA preps does not have any effect
on the SNP assay, except one should take RNA into consideration
when estimating the DNA concentration. The DNA quality
appears to be not critical for the GoldenGate assay [2]. We once
used degraded DNA due to limited seed source and obtained
good results. However, we do recommend users prepare high
quality DNA for all SNP genotyping assays.
8. This protocol is quite accurate for final concentrations between 0
and 50 ng/μl. If one plans on making a subsequent dilution for
samples with the concentration between 50 ng/μl and 75 ng/μl
using this protocol, it is recommended to make the dilution con-
servatively and recheck the final concentration using PicoGreen.
This protocol is intended to determine whether the samples are
at a minimum of 50 ng/μl and should not be used for dilution
guidance because PicoGreen is nonlinear and less precise (in the
dilution series) for concentration estimated over 75 ng/μl.
Therefore, this step should be used to determine if samples meet
a threshold. If they do, then they should be used as is regardless
of the absolute value of the PicoGreen assay. The upper limit for
concentration that does not work for the GoldenGate assay
has not been determined. Thus, one can load samples at their
maximum to optimize robustness of the assay.
9. Not included in the score file returned from Illumina is the
information of the probes designed for each SNP, which is
the basis for OPA synthesis. The users may request the probe
information to assist in filtering out SNPs derived from highly
redundant genomic regions remaining after complexity reduc-
tion, particularly when the SNP panel contains a mixture of
cDNA- and genomic-derived SNPs [14].
10. One person can manually process two plates at the same time.
11. UDG, uracil DNA glycosylase, is used to kill carry-over DNA as
a precaution to minimize cross contamination from different
batches of experiments.
12. The dmap files generated from the decoding process contain
the information of the bead types and their positions on each
BeadChip and are required during array scanning.
13. The sample sheet is optional. However, if the sample sheet is
not used, GenomeStudio will assign each sample with a generic
name.
14. Illumina recently released the GenomeStudio Polyploid
Clustering (PC) Module that uses density-based algorithms to
assign genotypes to clusters. It is suitable for polyploid species
for which the standard diploid clustering algorithm imple-
mented in the Genotyping Module is not appropriate. The PC
Module performs cluster assignment, but does not call geno-
types. Manual editing of cluster assignments is still necessary.
Acknowledgements and Disclaimer
This work was supported by USDA-NIFA Grant No. 2009-85606-

05701 (“Barley Coordinated Agricultural Project: Leveraging
Genomics, Genetics, and Breeding for Gene Discovery and Barley
Improvement”); USDA-NIFA Grant No. 2009-65300-05638
(“Single Nucleotide Polymorphism (SNP) Markers for High-
Throughput Genotyping to Advance Genomic, Genetic, and
Breeding Research in Wheat”); USDA-NIFA Grant No. 2009-
65300-05707 (“Oat SNP Development and Identification of Loci
Affecting Key Traits in North American Oat Germplasm using
Association Mapping”); General Mills, Inc.; and USDA CRIS
Project No. 5442-22000-033-00D (“Improvement of Hard Red
Spring and Durum Wheat for Disease Resistance and Quality using
Genetics and Genomics”). Mention of trade names or commercial
products in this article is solely for the purpose of providing specific
information and does not imply recommendation or endorsement
by the U.S. Department of Agriculture. USDA is an equal oppor-
tunity provider and employer.
References
1. Fan J-B, Oliphant A, Shen R, Kermani BG, Rigault P, Zhou L, Stuelphagel S, Chee MS
Garcia F, Gunderson KL, Hansen M, Steemers (2003) Highly parallel SNP genotyping. Cold
F, Butler SL, Deloukas P, Galver L, Hunt S, Spring Harbor Symp Quant Biol 68:69–78
McBride C, Bibikova M, Rubano T, Chen J, 2. Shen R, Fan J-B, Campbell D, Chang W, Chen
Wickham E, Doucet D, Chang W, Campbell J, Doucet D, Yeakley J, Bibikova M, Garcia
D, Zhang B, Kryglyak S, Bentley D, Hass J, EW, McBride C, Steemers F, Garcia F, Kermani
BG, Gunderson K, Oliphant A (2005) High- 8. Akhunov E, Nicolet N, Dvorak J (2009) Single
throughput SNP genotyping on universal bead nucleotide polymorphism genotyping in poly-
arrays. Mutat Res 573:70–82 ploid wheat with the Illumina GoldenGate
3. Yan J, Yang X, Shah T, Sanchez-Villeda H, Li J, assay. Theor Appl Genet 119:507–517
Warburton M, Zhou Y, Crouch JH, Xu Y 9. Chao S, Oliver R, Lazo G, Tinker N, Jellen E,
(2010) High-throughput SNP genotyping Maughan J, Jackson E (2012) Development of
with GoldenGate assay in maize. Mol Breed a high-density SNP genotyping panel as a com-
25:441–451 munity resource for genetic analysis in oat.
4. Hyten DL, Song Q, Choi I-Y, Yoon M-S, Abstract. Plant and Animal Genome XX
Specht JE, Matukumalli LK, Nelson RL, Conference, 14–18 Jan 2012, San Diego, CA
Shoemaker RC, Young ND, Cregan PB (2008) 10. Oliphant A, Barker DL, Stuelpnagel JR, Chee
High-throughput genotyping with the MS (2002) BeadArray™ Technology: enabling
GoldenGate assay in the complex genome of an accurate, cost-effective approach to high-
soybean. Theor Appl Genet 116:945–952 throughput genotyping. Biotechniques 32:
5. Rostoks N, Ramsay L, MacKenzie K, Cardle L, S56–S61
Bhat PR, Roose ML, Svensson JT, Stein N, 11. Gunderson KL, Kruglyak S, Graige MS, Garcia
Varshney RK, Marshall DF, Graner A, Close TJ, F, Kermani BG, Zhao C, Che D, Dickinson T,
Waugh R (2006) Recent history of artificial Wickham E, Bierle J, Doucet D, Milewski M,
outcrossing facilitates whole-genome associa- Yang R, Siegmund C, Hass J, Zhou L, Oliphant
tion mapping in elite inbred crop varieties. Proc A, Fan J-B, Barnard S, Chee MS (2006)
Natl Acad Sci U S A 103:18656–18661 Decoding randomly ordered DNA arrays.
6. Close TJ, Bhat PR, Lonardi S, Wu Y, Rostoks Genome Res 14:870–877
N, Ramsay L, Druka A, Stein N, Svensson JT, 12. Bodo Slotta TA, Brady L, Chao S (2008) High
Wanamaker S, Bozdag S, Roose ML, Moscou throughput tissue preparation for large-scale
MJ, Chao S, Varshney RK, Szucs P, Sato K, genotyping experiments. Mol Ecol Resour
Hayes PM, Matthews DE, Kleinhofs A, 8:83–87
Muehlbauer GJ, DeYoung J, Marshall DF, 13. Pallotta MA, Warner P, Fox RL, Kuchel H,
Madishetty K, Fenton RD, Condamine P, Jefferies SJ, Langridge P (2003) Marker assisted
Graner A, Waugh R (2009) Development and wheat breeding in the southern region of
implementation of high-throughput SNP Australia. Proceedings of the tenth international
genotyping in barley. BMC Genomics 10:582 wheat genetics symposium, Paestum, Italy,
7. Zhao K, Wright M, Kimball J, Eizenga G, pp 789–791
McClung A, Kovach M, Tyagi W, Ali ML, Tung 14. Tinker NA, Chao S, Lazo GR, Oliver RE,
C-W, Reynolds A, Bustamante CD, McCouch Huang YF, Poland JA, Jellen EN, Maughan
SR (2010) Genomic diversity and introgression PJ, Kilian A, Jackson EW (2014) A SNP
in O sativa reveal the impact of domestication genotyping array for hexaploid oat (Avena
and breeding on the rice genome. PLoS One sativa L.). Plant Genome doi: 10.3835/
5:e10780 plantgenome2014.03.0010.
INDEX
A DNAse ...............................................................122, 129

restriction.............................2, 63, 98, 173–175, 260, 277
Adaptor annealing ....................................................273–274 Digital gene expression (DGE) ................................119–139
Adaptors ......................................31, 104, 272–275, 277, 279 DNA
AFLP. See Amplified fragment length polymorphism barcoding ............................................................101–117
(AFLP) extraction .......................... 64, 68, 73, 102, 104, 107–110,
Allele specific associated primers (ASAPs) ......................3–4 115, 171, 196, 206, 207, 253, 296, 297, 300–303
Amplified fragment length polymorphism isolation ........................................................................63
(AFLP) ...................2, 22, 50, 56, 63, 71, 74, 162, 258 quantification......................................................301, 303
Assembly software ................................................ 8–9, 31, 32
AutoSNPdb ...................................................... 35, 51, 53, 57 E
B Electrophoresis
agarose gel ................................................... 2, 79, 82–83,
Barcode..........5, 101–103, 107–110, 112, 113, 120, 124, 132,
85, 86, 91, 92, 94, 98, 128, 143, 184, 185, 187–188,
137, 138, 170, 246, 249, 250, 272, 274, 277, 279, 307
190, 191, 277
Bead chip ......................................................... 283, 287, 288,
capillary ........................................................................92
290–292, 299–301, 304, 306, 307, 310, 311
Ethyl methanesulfonate (EMS)........................ 194–196, 200
Bioinformatics ....................................8–9, 29–42, 55–57, 81,
Expression quantitative trait loci (eQTL) ........ 119–121, 137
109, 153, 167, 258, 260, 272
Biotin-streptavidin separation .................. 172, 175–176, 179
F
Brassica.info ..................................................................51, 53
Brassica rapa genome database ............................................53 False polymorphisms ................................................163–165
BuildSSR ............................................................................33 Fluidigm ................................................... 243, 245–250, 253
Fragment
C detection .................................................................70, 72
cDNA synthesis scoring ..........................................................................69
first strand...........................................................122, 130 Freeze drying ............................................................300–302
second strand ......................................................123, 131
CEL1 ............................................... 141–149, 195, 197–201
G
CENSOR ...........................................................................33 GATK ..........................................................................35, 36
Chickpea root EST database .................................. 51, 53, 58 GbS. See Genotyping by sequencing (GbS)
Cleaving ................................................... 125–126, 132–134 GenBank dbSNP..........................................................51, 53
Complexity reduction of polymorphic sequences Geneious ...........................................32, 34, 35, 41, 109, 112
(CRoPS) ...................................................................5 Genetic diversity.......................................3, 4, 13, 16, 19–21,
Consortia ..........................................................................283 83, 87, 165, 169, 243, 259
Cotton marker database (CMD) ............................ 51, 53, 57 Genetic map .............................................2–4, 14–15, 55, 57,
91, 169, 217, 236, 282, 283
D
Genome
Database ...........................................4, 29, 36, 38, 42, 49–58, assembly................................................ 8, 14–15, 23, 283
78, 101, 102, 106, 188, 190, 210 partitioning .................................................................271
Denaturing high pressure liquid chromatography studio .............................22, 292–297, 300, 307–309, 311
(dHPLC) ............... 141, 142, 144, 146, 147, 149, 193 Genotype calling .................36, 146–148, 235, 236, 307–310
Digestion Genotyping by sequencing (GbS) ......................... 2, 7–9, 19,
CEL1..................................................................197–198 141, 151, 163, 166, 219, 257–267, 271–279
DNA...........................................................................142 GigaBayes...........................................................................35
DOI 10.1007/978-1-4939-1966-6, © Springer Science+Business Media New York 2015
313
PLANT GENOTYPING: METHODS AND PROTOCOLS
314 Index
Goldengate ...................................................7, 16, 21, 35, 42, Mass spectrometric cleaved amplified polymorphic sequence
217–219, 225, 226, 231, 266, 283, 288, 299–311 (MS-CAPS) .................................................205–213
Graingenes ............................................................. 51, 53, 55 Mass spectrometry.........................6, 207, 208, 218–220, 222
Gramene ................................................................. 51, 53, 55 Matrix-assisted laser desorption/ionization time of flight
(MALDI-TOF)....................................... 6, 205–213,
H 219–222, 224, 226, 237
Haplotype identification........................... 258, 262–263, 265 MicroSAtellite (MISA) ..........................3, 17, 30, 33, 34, 36,
Heteroduplex .................................... 141–150, 152–154, 197 49, 55–57, 63, 64, 77, 78, 85, 87, 157, 216, 266, 281
High resolution melting (HRM) MID barcodes. See Multiplex identifier (MID) barcodes
haploid samples ..........................................................155 MoccaDB ............................................................... 51, 53, 57
heterozygous species ...........................................153–155 Molecular markers ............................1–5, 8, 9, 13–23, 29, 30,
Homoeologous loci...............................................................7 38, 49–58, 78, 87, 91, 162, 184, 215, 257, 258, 282
Hybridisation ...................................................................288 Mreps .................................................................................33
Msatfinder ....................................................................33, 34
I MS-CAPS. See Mass spectrometric cleaved amplified
polymorphic sequence (MS-CAPS)
ICRISAT ............................................................... 51, 53, 58
Multiplex identifier (MID) barcodes ............... 170, 172, 173,
Imperfect microsatellite extractor (IMEx)....................33, 34
176–178, 180
IMPUTE2 .........................................................................35
Mutant population ...................................................194–196
Indels ................................................................. 2, 4, 34, 113,
Mutation detection ........................................... 144, 195–199
117, 193, 194, 259, 261, 262
Indexing .............................................................................35 N
Infinium ..................................7, 37, 225, 266, 281–297, 304
Inter simple sequence repeat (ISSR).............................63–74 Next generation sequencing (NGS) ........................ 5, 6, 8, 9,
Inter-sine amplified polymorphism (ISAP) .............183–191 22, 29–42, 79, 80, 112, 119, 162, 164, 257, 258, 260,
iPLEX ..........................................................6, 217, 218, 220, 264, 271, 283
222–226, 232, 234, 235, 237, 238
O
K Oligo pool assay (OPA) ............. 301, 304, 305, 307, 309, 310
Kaspar............................................................... 218, 243–255 hybridisation ...............................................................305
Orthologous markers ................................................155–156
L
P
Laboratory information management system
(LIMS) .................................................................109 Panzea .................................................................... 51, 53, 55
Legume information system (LIS) ......................... 51, 53, 55 Polymerase chain reaction (PCR) .............2, 3, 5, 6, 9, 30–32,
Ligation ............................................124–126, 132, 134, 172, 36–38, 42, 50–52, 56, 57, 63, 64, 68–70, 72–74,
174–175, 179–180, 272, 275–277, 300, 304, 305 79–87, 98, 104–105, 110–112, 115, 116, 126–127,
Linkage disequilibrium (LD) ............................ 4, 15–17, 19, 134–136, 143, 145, 148, 152–158, 162–164,
20, 35, 83, 262, 282 170–174, 176–178, 180, 184–187, 189–191,
194–201, 205–207, 209–213, 219–225, 235, 236,
M 244–247, 249, 251–253, 271–274, 276, 278, 288,
289, 295, 300, 301, 303–306
MaCH ................................................................................35
Polyploidy ....................................4, 21–23, 35, 161–167, 295
MaizeGDB ............................................................ 51, 53, 55
PolyScan .......................................................................34, 35
MALDI-TOF. See Matrix-assisted laser desorption/
Primer design .............................................30, 31, 36, 42, 81,
ionization time of flight (MALDI-TOF)
85, 110, 157, 185–186, 196, 208, 253
Mapping
Pyrosequencing......................................31, 35, 170, 181, 187
association............................................. 4, 14–16, 83, 165
genetic ................................................3, 4, 14, 15, 55, 57,
Q
91, 169, 217, 236, 282, 283
physical ............................................... 4, 5, 13–15, 20, 56 Qcall ...................................................................................35
synteny....................................................................14–15 QualitySNP ........................................................................35
Marker assisted selection (MAS)...............3, 4, 17–19, 29, 78 Quantitative trait loci (QTL) ...........................14, 15, 18, 49,
MassARRAY ............................................................215–239 50, 78, 83, 119, 137, 215, 217, 258, 259
PLANT GENOTYPING: METHODS AND PROTOCOLS
Index
315
R detector .........................................................................35
filtering ..........................36, 261–262, 285–286, 294, 310
Radioactively labeling probe ...................................92, 96–97 imputation ..........................................................262, 265
RAD sequencing. See Restriction site associated DNA selection .............................................. 175, 282–287, 304
(RAD) sequencing Single stranded DNA (ssDNA) ........................ 71, 178–180,
Randomly amplified polymorphic DNA 206–208
(RAPDs).......2–3, 17, 50–52, 55, 56, 63, 78, 162, 216 Size exclusion ...........................................................172, 175
Read mapping ............................................ 21, 261, 262, 264 Skim based genotyping by sequencing .....................257–267
Reduced representation libraries (RRLs)......................5, 284 SNP. See Single nucleotide polymorphism (SNP)
RepeatMasker.....................................................................33 SOAP2. See Short Oligonucleotide Analysis Package
Restriction digest ............................................. 2, 63, 98, 171, (SOAP)2
173–175, 179, 180, 260, 277 SOL genomics network (SGN).............................. 52, 54, 56
Restriction fragment length polymorphism Southern blot................................................................92–96
(RFLP) .................................2, 17, 50–52, 55–57, 63, SoyBase .................................................................. 52, 54, 56
1–98, 162, 163, 165, 216, 258 Spectral repeat finder (SRF) ...............................................33
Restriction site associated DNA (RAD) Spin chromatography ....................................... 172, 175, 176
sequencing ...............................3, 8, 18, 259–262, 272 Sputnik ....................................................... 33, 34, 36, 38, 42
Rice genome annotation project ............................. 51, 53, 56 SSR. See Simple sequence repeat (SSR)
RNA isolation .......................................... 121–122, 128–129 SSR identification tool (SSRIT) ........................................33
RRLs. See Reduced representation libraries (RRLs) SSRPrimerII...........................................................33, 37–40
SSRSEARCH ....................................................................33
S
Sample collection T
from the field ...................................... 103, 105–107, 114
Tandem repeat finder (TRF) ........................................33, 34
from herbarium specimens ......................... 103–109, 114
Tandem repeat occurrence locator (TROLL)...............33, 34
Sample purification
Taqman ................................................................................6
isopropanol precipitation ............................................198
Targeting induced local lesions in genomes
sephadex .............................................................198–199
(TILLING) ..................................................193–201
Samtools ............................................35, 36, 41, 42, 261, 285
tfGDR Project Website ................................................52, 54
Sanger sequencing .................................... 5, 21, 79, 112, 120
TRF. See Tandem repeat finder (TRF)
Sequence alignment.............................................. 35, 56, 102
Triticeae Mapped EST Database ver.2.0
SGSautoSNP ..............................34, 35, 37–42, 57, 261, 264
(TriMEDB) ................................................ 52, 54, 56
Short Oligonucleotide Analysis Package
TROLL. See Tandem repeat occurrence
(SOAP)2 ...............9, 34, 35, 37, 40, 41, 261, 264, 266
locator (TROLL)
Simple sequence repeat (SSR) ........................... 3, 17, 19–21,
29–34, 36–38, 42, 49–58, 63–74, 77–88, 162, 163, V
165, 166, 184, 243, 258, 266, 271
discovery ...........................................................30–34, 36 Validation ...............................................21, 29, 85, 127, 136,
taxonomy tree ................................................... 36, 52, 54 152, 164, 166, 283, 285
Single nucleotide polymorphism (SNP) VegMarks ............................................................... 52, 54, 57
assay design.....................................21, 22, 243, 250–255,
W
266, 281–297, 303–304, 310
calling .........5, 35, 37, 38, 40, 41, 260–262, 264, 285, 296 Wheat genome information .........................................52, 54

Plant Genotyping 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Plant Genotyping 1

Uploaded by

Copyright:

Available Formats

Methods in

Molecular Biology 1245

Jacqueline Batley Editor

For further volumes:

Methods and Protocols

ISSN 1064-3745 ISSN 1940-6029 (electronic)

Library of Congress Control Number: 2014952462

© Springer Science+Business Media New York 2015

Printed on acid-free paper

Humana Press is a brand of Springer

Crawley, WA, Australia Jacqueline Batley

1 Advances in Plant Genotyping: Where the Future Will Take Us . . . . . . . . . . . . 1

14 Inter-SINE Amplified Polymorphism (ISAP) for Rapid

JACQUELINE BATLEY • School of Agriculture and Food Sciences, University of Queensland,

HIDEYUKI KAJIWARA • National Institute of Agrobiological Sciences, Tsukuba,

BERTHA M. SALAZAR-COLQUI • Department of Plant Breeding, Justus Liebig University

Advances in Plant Genotyping: Where the Future

Key words Single nucleotide polymorphisms (SNPs), Next-generation sequencing (NGS),

1 Common Molecular Markers and Genotyping Methods

The application to which a genetic marker is best suited depends

analysis (hybridization-based, PCR-based, next-generation tech-

1.1 Restriction RFLPs are hybridization-based dominant markers that detect

1.2 Amplified Restriction digestion combined with polymerase chain reaction

dominance. In the case of an allele at an RAPD site being unampli-

of samples. Cost savings could also be achieved with this method

majority of genomic SNPs are free of selective pressure, making

Sanger sequencing is one of the most common, as well as one of

Genotyping multiple samples accurately and in a cost-effective

the Light Scanner gene mutation/genotyping system [30]. This is the

4 Advances in Plant Genotyping

With advances in SGS technology allowing millions of SNPs to be

4.1 Illumina The Illumina GoldenGate assay is a large-scale genotyping assay

4.2 Genotyping by Genotyping by sequencing (GBS) was first demonstrated in maize

with a frequent cutter and next-generation high-throughput

by increasing read-length, which third-generation sequencing

5.4 Polymorphism Discovering polymorphisms from aligned sequence data is a further

The advent of PCR and later, next-generation sequencing has

3. Makosiej A, Nasalski P, Giraud B, Vladimirescu genome research. Comp Funct Genomics

Molecular Marker Applications in Plants

Genetic markers can be used to study patterns of heredity, genomic

able to capture vast amounts of variation at single-base resolution,

2 Genetic and Association Mapping

map, as well as the quality of the original contig sequence assembly.

combinations, are found in species with moderate or high levels

agroclimatic traits, such as drought tolerance, within sorghum lines.

exon) between the two species were observed with variation in

4 Genetic Diversity Analyses

The development and implementation of molecular marker technol-

improvements and future food security. Unfortunately, extensive

4.2 Comparative The ability to compare genomic properties of various evolutionarily

5 Complications Arising from Polyploidy

Due to the majority of agriculturally important crop species con-

used to produce successful SNP assays (J. Batley, pers. comm.).

then be sequenced to a much greater depth on other platforms and

Molecular markers offer abundant applications in plant molecular

Bioinformatics: Identification of Markers

The advent of next-generation sequencing (NGS) has revolutionized

genetic marker discovery from NGS data. Several NGS technologies

position in a sequence, any of the four possible nucleotide bases

2 New Marker Discovery Technology

Among the different NGS technologies available (Table 1), 454

Features 454 Illumina Ion torrent

Name Technology Website