Professional Documents
Culture Documents
Crawford and Oleksiak 2016
Crawford and Oleksiak 2016
Crawford and Oleksiak 2016
doi: 10.1093/bfgp/elw008
Advance Access Publication Date: 4 April 2016
Review paper
Abstract
Marine species live in a wide diversity of environments and yet, because of their pelagic life stages, are thought to be well-
connected: they have high migration rates that inhibit significant population structure. Recent innovations in sequencing
technologies now provide information on nucleotide polymorphisms at thousands to tens of thousands of loci based on
whole genomes, reduced representative portions of genomes (0.1–1%) or a majority of expressed mRNAs. Data from these
genomic approaches are used to define and quantify single-nucleotide polymorphisms (SNPs). These SNP data tend to agree
with data from older technologies (allozymes or microsatellites), which support well-connected populations with few gen-
etic differences among populations. However, these studies also find few percentages of SNPs (1–5%) that readily distinguish
genetic differences among populations on relatively small geographic scales. The magnitudes of the genetic differences (FST
values) suggest that hundreds of loci with significant differences are due to positive selective pressures. Thus, these data
suggest that natural selection is effectively altering allele frequencies at 100s of loci in marine populations. In this manu-
script, we provide examples of these studies, the strengths and weaknesses of different genomic approaches as well as im-
portant technical aspects associated with genomic approaches.
Importance of genomics are connected. This is important because many marine organisms
have a pelagic larval stage, and the oceans’ currents can transport
The world’s oceans provide diverse organisms and environmental larvae thousands of kilometers from where they hatched. We also
habitats to better explore ecological genomics: how genomic vari- can contrast neutral loci to genetic changes that deviate from the
ation and its expression enhance an organism’s probability of suc- neutral patterns and are most likely evolving by natural selection
cess. Marine habitats range from freezing polar regions to warm to better understand local adaptation. Marine environments pro-
tropical waters, from shallow coastal estuaries to deep-sea hydro- vide clines along a coast, within an estuary and within a tidal
thermal vents that rely on chemosynthesis rather than photosyn- shore, and altered environments at vastly different time and spa-
thesis, and adaptation to these environments likely requires tial scales to investigate the potential for adaptive change.
many different genes in many different pathways. Using genomic Further, changing habitats due to factors such as global climate
approaches, which query the whole genome rather than a tar- change, anthropogenic pollution and habitat destruction are af-
geted gene set, we are more likely to discover the evolved changes fecting all the world’s oceans. Genomic approaches allow us to ex-
responsible for this variation in life than if we explore only a few plore the rate of adaptive change on a scale that was not
genes. Additionally, because genomic approaches provide so previously possible. Combining genomics with research on marine
many loci, many of which are evolving by neutral processes, we organisms that inhabit diverse habitats provides opportunities to
can use neutral-demographic variation to define how populations address important questions concerning how marine organisms
Douglas L. Crawford is a Professor in the Department of Marine Biology and Ecology at the Rosenstiel School of Marine and Atmospheric Science,
University of Miami. His research focuses on evolutionary genomics integrating physiology, mRNA expression and nucleotide polymorphisms.
Marjorie F. Oleksiak is an Associate Professor in the Department of Marine Biology and Ecology at the Rosenstiel School of Marine and Atmospheric
Science, University of Miami. Her research uses genomic tools with physiological and toxicological phenotypes to enhance our understanding of evolu-
tionary processes.
C The Author 2016. Published by Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com
V
342
Ecological population genomics | 343
and populations interact with their environments and adapt to past 5 years, several approaches have been developed to survey
environmental change, and genomic approaches provide the tools 1000s to 10 000s of loci by sequencing whole genomes or reduced
to address the genetic basis of these ecological interactions with a portions of the genome (transcriptome sequencing or reduced rep-
breadth of coverage that has not previously been available. resentative libraries [27]). These approaches are discussed below.
Genomic studies interrogate across the genome and include While whole-genome sequencing has been widely used to
whole-genome analyses, quantifying gene expression and ana- discover SNPs in model organisms such as humans and flies,
lyses of genetic variation (e.g. SNP, single-nucleotide polymorph- there are fewer examples in non-model, marine organisms. A
Term Description
Whole-genome sequencing Sequences that capture nearly all of the genome and are assembled into large (10s–100 kb)
scaffolds. Because whole genomes can be 0.1 to 10 Gbps, whole-genome approaches require
much more sequencing than GBS, RNA-seq or exon-capture. Thus, few individuals are
typically analyzed.
RRGS: Reduced representative Genome sequencing that only sequences a selected ‘reduced’ portion of the genome. Typically
genome sequencing (RAD, RAD-tag, GBS) this approach sequences 0.1% to 1% of the genome derived from specific restriction sites.
The power of this approach is that the same loci in the genome of approximately 50–100 bp
are sequenced in many (100s of) individuals.
Transcriptomics and RNA-seq Transcriptomics is the quantitative and qualitative analysis of RNA expression. RNA-seq is a
transcriptomic approach that uses sequences of expressed RNA and provides both quantita-
tive measures of expression and identifies nucleotide variation in expressed DNA (i.e. in
RNA).
Capture sequencing, exon-capture, Capture sequencing, as the name implies, is sequencing selected (captured) genomic DNA tar-
exomic sequencing, targeted sequencing gets. The most general approach is exon-capture or exomic sequencing that uses PCR,
hybridization probes or cDNAs to capture exons or DNA that is expressed as RNA. Yet, the
selected targets can be any parts of the genome where there are probes, or primers to specify
a portion of the genome.
454 sequencing technology 454 was one of the initial next-generation sequencing approaches that captures individual
DNA fragments on beads and sequences each of these fragments in separate wells in high-
density well plates; 454 sequencing produces 300–500 bp sequences (reads) yielding 0.5
Gbp/run.
Illumina sequencing technology Illumina sequencing captures DNA fragments on a solid surface and sequences these
fragments on the surface. There are many different platforms, and the High-Seq platform
that typically produces 50–200 bp reads yields 50 to 1000 Gbp per run
ABI SOLiD sequencing technology Sequencing by Oligonucleotide Ligation and Detection is a novel approach that does not rely
on sequencing by DNA polymerization. Each sequence is 50 bp and yields 30–50 Gbp per
run.
(Pinctada maxima) [49], California red abalone (Haliotis rufescens) assembles a transcriptome from RNA-seq (RNA sequencing or
[50] and a marine annelid (Streblospio benedicti) [51]. Both EST and whole transcriptome sequencing) data, then aligns RNA-seq
454 sequencing (Table 2) yield relatively long reads compared reads to this assembly and finally identifies polymorphic loci.
with the Illumina and ABI platforms (Table 2), making de novo The RNA-seq reads might come from individual transcripts or
transcriptome assembly easier and potentially more robust. To from transcripts pooled from many individuals.
compare the utility of Roche 454 (which produces 300–500 bp Focusing SNP discovery on expressed genes has the signifi-
reads but only 0.5 Gbp/run [52]) and Illumina Genome Analyzer cant advantage of targeting many functional portions of the
(GAII, which only produces 100 bp reads, but >30 Gbp [52]) genome (i.e. all SNPs come from mRNAs, most of which encode
sequencing for SNP discovery, both platforms were used for SNP proteins and do not include non-coding genomic regions) [61].
discovery in European hake (Merluccius merluccius). Not surpris- Knowing the function of a locus containing a SNP of interest
ingly, the average contig length (set of overlapping DNA se- gives further insight into the putative SNP phenotype and is
quences) was 3-fold greater for 454 than for Illumina, while the useful for downstream analyses, for instance gene functional
average coverage depth was 7.5-fold greater for Illumina. category enrichment analyses. This knowledge can be critically
However, the two approaches yielded similar percentages of important when looking for signatures of natural selection and
polymorphic loci, showing that both platforms are suitable for trying to determine the functional consequences of variation
large-scale SNP discovery [53]. Other marine species using shorter that seems to be under selection or adaptive.
read sequencing platforms such as Illumina and ABI SOLiD plat- One shortcoming of targeting expressed genes results from
forms for SNP discovery include mammals, fish, mollusks and differences in allele-specific expression. Allele-specific expres-
arthropods [31, 32, 54–60]. De Wit et al. outline a transcriptome sion, where one allele is preferentially expressed, can result
assembly approach followed by SNP marker development and in inaccurately calling an individual a homozygote rather than
genotyping [61]. The idea is relatively straightforward: one a heterozygote. Perhaps the biggest disadvantage of using
Ecological population genomics | 345
The RRGS data for the genetic variation within and among these investigating 440 817 SNPs, 3847 SNPs (approximately 1%)
and other populations are very similar to previous studies using showed significantly strong differences among populations,
allozymes and microsatellites [80, 81], which suggests that there with a few at or near fixation [32]. Similar data are found in tem-
is little if any difference among South Florida sailfin molly poral isolation of pink salmon (Oncorhynchus gorbuscha)
populations. Yet, among the salt marsh flats in the Florida Keys and Atlantic salmon (Salmo salar), where a few out of 1000s of
populations, there are SNPs with significant genetic diversity SNPs are associated with migration and reproduction timing
among populations (Figure 2B [71]). Specifically, these three [59, 60].
similar populations have 18 SNPs with FST values that are un- Data from these studies consistently detect a few percent-
likely to occur relative to permutation of SNPs with similar het- ages of SNPs with elevated divergence between groups or popu-
erozygosities (p-value < 0.01, false discovery rate ¼ 0.1; Figure lations that exceed neutral expectations. These studies
2B). Using these loci, there is a strong structure among popula- highlight the power of genomic approaches: discovery and char-
tions (Figure 2C). The meaningful difference between RRGS and acterization of many thousands of polymorphic loci provide a
previous measures of genetic diversity is the number of loci wide breadth of information where 1-5% of SNPs distinguish im-
examined, which allows more precise delineations of popula- portant population structure. Many of these informative SNPs
tion structure and, as well, facilitates identifying loci with ex- have FST values that are indicative of non-neutral processes and
cessive FST values that could indicate adaptive divergence. thus suggest that adaptive evolution acting on many loci is
Thus, the power of RRGS studies is that they enhance previous common.
studies and can also distinguish subtle but evolutionarily im-
portant genetic differences.
The resolution of genetic differences based on RRGS is found
Technical aspects: alignment, sequence depth,
in other studies. In 37 red abalone (Haliotis rufescens), the vast
majority of SNPs had little genetic differentiation among popu-
HWE and LD
lations [55]. Yet, 691 from 22 000 loci had significantly higher The big difference between traditional genetic markers and gen-
FST values that readily distinguish populations along the etic markers from genomic approaches is the number of
California Coast [55]. Similarly, among Baltic and Atlantic her- markers. Genomic approaches can yield hundreds of thousands
ring (Clupea harengus), most SNPs support genetic homogeneity, and even millions of genetic markers. However, except for num-
reflecting the lack of obvious dispersal barriers in marine habi- ber, genetic markers generated via genomic approaches are very
tats [32, 54]. These SNPs for many or a majority of loci yield similar to those generated via less high-throughput approaches
similar results to studies using allozymes or microsatellites and should be treated so [31, 71]. This entails filtering the data
where population differentiation across large geographic areas so that only high-quality and high-confidence genetic markers
is very low (FST values 0.005). Yet, out of 5985 SNPs, there were are used to address the relevant biological questions. An initial
117 (2.0%) that showed evidence of substantial divergence, cor- filtering step begins with sequence coverage depth. The general
responding to FST values ¼ 0.128 [54]. In a similar analysis, when idea is that a polymorphism that occurs in multiple sequences
Ecological population genomics | 347
and in multiple individuals is unlikely to be a technical artifact aligned. The alignment method has a large effect on the resulting
(e.g. a sequencing error). Yet, there are potential technical prob- SNP pool. For instance, the Tassel pipeline [83, 84] for species
lems specific to RRGS approaches [27, 64, 65]. with a reference genome and the related UNEAK pipeline for spe-
Five technical aspects to be aware of are ascertainment bias, cies without a reference genome are used to identify SNPs from
alignment method, sequence depth, mis-alignment of paralogs RRGS data. Tassel and other bioinformatics approaches that rely
and linkage disequilibrium (LD). SNP discovery should use indi- on aligning short sequence reads have the option of using Bowtie
viduals from across the population range versus genotyping [85] and BWA [86]. However, using the same raw sequencing
one population and investigating these population-specific data, Bowtie and BWA result in different final SNP sets: in our
polymorphisms. Examining SNPs discovered in only one popu- and others’ experience, only approximately 50% of reads pro-
lation leads to ascertainment bias [82]. This ascertainment bias duced shared alignments with these two tools [87–89]. Even after
occurs because polymorphisms are population-specific, and filtering to remove SNPs with too few reads and excessive Ho,
defining SNPs in one population will cause one to miss SNPs in this difference in which SNPs are identified is not eliminated, al-
a second population, thus creating a false sense of lower nu- though it does greatly reduce the frequencies of alternate SNP
cleotide diversity in the second population. identification [72]. It is difficult to suggest the cause or the conse-
An important technical aspect of RRGS approaches starts with quences of identifying different SNPs from the same sequences,
alignment. In general, short sequence reads or tags are aligned to but one can compare these two approaches and inquire about
a reference genome or, alternatively, are assembled into a set of the frequencies of paralogs and sequencing depth to choose an
contiguous sequences, to which individual sequence reads are objective approach (see below).
348 | Crawford and Oleksiak
Sequencing depth, the number of reads per locus or SNP, de- effect [93], Figure 3B), and these informative SNPs can be used
fines the probability of identifying allelic variation, defining in- to describe the connectivity and differences within a species.
dividual genotypes and avoiding sequencing errors. To identify LD is the significant correlation among SNPs. This statistical
true allelic variants (versus sequencing errors) and genotypes association among SNPs when physically close is due to limits
(where the identity of a SNP variant on both chromosomes is of recombination to create independence among sites. A poten-
estimated) requires sufficient read depth. If there were an equal tial advantage for genomic approaches that identify 1000s of
chance of sequencing both alleles, the probability of identifying SNPs is a more accurate estimate of effective population size
because they inform us about the conservation genetics of iso- 10. Wittkopp PJ. Variable gene expression in eukaryotes: a net-
lated populations, the genes affecting important phenotypes work perspective. J Exp Biol 2007;210:1567–75.
(e.g. reproductive schedules) and the effectiveness of adaptive 11. Wray GA, Hahn MW, Abouheif E, et al. The evolution of tran-
change. Currently, many genomic studies suggest that marine scriptional regulation in eukaryotes. Mol Biol Evol 2003;20:
species have greater population structure than previously 1377–419.
appreciated. Additionally, many of these studies identify many 12. Allendorf FW, Hohenlohe PA, Luikart G. Genomics and the
loci that appear to be evolving by natural selection. An exten- future of conservation genetics. Nat Rev Genet 2010;11:
34. Narum SR, Buerkle CA, Davey JW, et al. Genotyping-by- 53. Milano I, Babbucci M, Panitz F, et al. Novel tools for conserva-
sequencing in ecological and conservation genomics. Mol Ecol tion genomics: comparing two high-throughput approaches
2013;22:2841–7. for SNP discovery in the transcriptome of the European hake.
35. Hodges E, Xuan Z, Balija V, et al. Genome-wide in situ PloS One 2011;6:e28008.
exon capture for selective resequencing. Nat Genet 2007;39: 54. Corander J, Majander KK, Cheng L, et al. High degree of cryptic
1522–7. population differentiation in the Baltic Sea herring Clupea
36. Willette DA, Allendorf FW, Barber PH, et al. So, you want to harengus. Mol Ecol 2013;22:2931–40.
73. Milano I, Babbucci M, Cariani A, et al. Outlier SNP markers 86. Li H, Durbin R. Fast and accurate short read alignment
reveal fine-scale genetic structuring across European with Burrows–Wheeler transform. Bioinformatics 2009;25:
hake populations (Merluccius merluccius). Mol Ecol 1754–60.
2014;23:118–35. 87. Nielsen R, Mattila DK, Clapham PJ, et al. Statistical approaches
74. Beaumont MA, Balding DJ. Identifying adaptive genetic diver- to paternity analysis in natural populations and applications
gence among populations from genome scans. Mol Ecol to the North Atlantic humpback whale. Genetics 2001;157:
2004;13:969–80. 1673–82.