Professional Documents
Culture Documents
Reloj Molecular
Reloj Molecular
of Pages 8
Opinion
The molecular clock has played an important role in The observation that evolutionary rates in coding sequences
biological research, both as a description of the evolu- were actually independent of generation length was one of
tionary process and as a tool for inferring evolutionary the motivations for the development of the nearly-neutral
timescales. Genomic data have provided valuable theory shortly afterwards [3]. The nearly-neutral theory
insights into the molecular clock, allowing the patterns gave population size a key role in governing the relative
and causes of evolutionary rate variation to be charac- impacts of drift and selection. To this day, the neutral theory
terized in increasing detail. I explain how genome and the molecular clock remain important null models in
sequences offer exciting opportunities for estimating evolutionary analysis.
the timescale of the Tree of Life. I describe the different The molecular clock is familiar to many researchers as a
approaches that have been used to deal with the compu- tool for estimating evolutionary rates and timescales. Ear-
tational and statistical challenges encountered in molec- ly molecular clock studies focused on the timescale of
ular clock analyses of genomic data. Finally, I offer a hominid evolution [4] and were followed by ambitious
perspective on the future of molecular clocks, highlight- efforts to date some of the deepest nodes in the Tree of
ing some of the key limitations and the most promising Life [5,6]. Alongside these studies was a growing recogni-
research directions. tion of rate variation among lineages, in contradiction to
the molecular clock. This motivated the development of
The molecular clock more powerful methods of estimating evolutionary time-
One of the fundamental goals of biological research is to scales, including models that incorporated rate heteroge-
understand the evolutionary process. By allowing the raw neity across lineages [7]. As a result, the role of the
materials of evolution to be analyzed, genetic data have had molecular clock grew dramatically and molecular dating
an immense impact on this endeavor. In this context, the analyses now form an important component of many evo-
molecular clock has been extremely valuable owing to its dual lutionary studies [8,9]. Our understanding of molecular
role as a description of the pattern of molecular evolution and evolutionary rates has been aided by the growth in genomic
as a tool for estimating evolutionary rates and timescales. sequence data. These data have brought significant compu-
The importance of the molecular clock has not diminished tational challenges but offer a rich source of information for
over the years, with its role shifting to the analysis of evolu- resolving the timescale of the Tree of Life.
tionary patterns and processes on a genomic scale.
The molecular clock hypothesis, which postulates a con- Heterogeneity in molecular evolutionary rates
stancy of evolutionary rates among lineages, was introduced Molecular evolution involves dynamic interactions among
in the early 1960s [1] and played a part in the development of the forces of mutation, selection, and drift. As a conse-
molecular evolutionary theory. The apparent homogeneity quence, rate variation across lineages and across the ge-
of rates among lineages was one of the inspirations for the nome are ubiquitous features of the evolutionary process.
neutral theory, which proposed that a large proportion of Large datasets increase the power of statistical methods to
mutations do not alter the fitness of an organism [2]. The test hypotheses about molecular evolution, including the
neutral theory emphasized the importance of genetic drift impacts of different biological and environmental factors
and predicted that evolutionary rates depended on rates of that affect evolutionary rates.
spontaneous mutation, independently of population size. The causes of rate variation can be broadly divided into
According to the neutral theory, however, absolute rates gene effects, lineage effects, and residual effects [10,11].
of evolution (per unit time) are expected to have a negative Gene effects lead to different rates between loci, and have
relationship with generation length. This is because most long been recognized as an intrinsic feature of the molecu-
inherited mutations are thought to occur during replication lar evolutionary process [12]. They can be caused by differ-
of germline DNA, and species with long generations there- ences in the proportion of functionally constrained sites
fore tend to accumulate fewer average mutations per year. and by regional heterogeneities in mutation rates across
the genome [11]. Gene effects represent the only form of
Corresponding author: Ho, S.Y.W. (simon.ho@sydney.edu.au).
rate variation recognized in the simplest clock model,
Keywords: molecular clock; genomic data; rate heterogeneity; pacemaker models;
phylogenetic analysis. known as the strict or global molecular clock.
0169-5347/
Lineage effects refer to factors that act across the whole
ß 2014 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.tree.2014.07.004 genome, such as differences in particular physiological and
life-history traits. These might include generation time,
Trends in Ecology & Evolution xx (2014) 1–8 1
TREE-1847; No. of Pages 8
Opinion Trends in Ecology & Evolution xxx xxxx, Vol. xxx, No. x
r Deg
ake en
m pa
er
ac
ate
Universal p
ce
ma
mulple
ker
Mu r
l p l ke
e pace ma
Figure 1. Comparison of the three pacemaker models of genome evolution [16], each illustrated using rooted four-taxon trees from five loci. In the Universal Pacemaker
model, loci evolve at different rates but share the same pattern of rate variation across branches. In the Multiple Pacemaker model, loci evolve at different rates but
groups of loci share the same patterns of rate variation across branches. In the Degenerate Multiple Pacemaker model, each locus has its own, distinct pattern of rate
variation across branches. These three models involve different interactions between gene effects and lineage effects, leading to contrasting patterns of rate variation
between loci.
metabolic rate, body size, and the performance of DNA Between these two models sits the Multiple Pacemaker
repair mechanisms [13]. Interactions between lineage and model, which involves a moderate level of residual effects.
gene effects are known as residual effects and act hetero- In this model there are clusters of genes that share the
geneously across the genome. Residual effects can be same lineage effects. This is the most plausible model of
caused by selection, variation in population size, and other genomic evolution and appears to have some statistical
factors [14,15]. By causing the pattern and extent of support [16]. The growing availability of genomic datasets
among-lineage rate heterogeneity to vary across loci, re- will enable further comparisons of these models of molec-
sidual effects are particularly important at the genome ular evolution.
scale.
The interplay between gene and lineage effects is en- Dating using genome sequences
capsulated in the pacemaker models of genome evolution The molecular clock continues to be an important tool for
(Figure 1) [16]. The Universal Pacemaker model ascribes estimating evolutionary rates and timescales in the geno-
variation in molecular rates to the presence of both gene mic era. There has been a steady increase in the size of
and lineage effects. In this model, loci can evolve at distinct datasets used for such phylogenetic dating analyses, both
rates from one another, but are all governed equally by inspiring and enabling the development of sophisticated,
lineage effects. Notably, this model assumes the absence of parameter-rich models of molecular evolution. Much of the
residual effects. Studies of archaeal and bacterial genomes progress in molecular clock methods over the past few
[16,17] and of genome-wide collections of trees from Dro- decades can be described as an effort to account for lineage
sophila and yeast [18] found that the Universal Pacemaker effects – that is, dealing with rate variation among
model explained much of the observed variation in rates. branches in the phylogeny [7]. With the growing use of
This stands in contrast with the Degenerate Multiple large datasets comprising multiple markers, the impacts of
Pacemaker model, which assumes that each locus has a residual effects are becoming increasingly important. To
distinct pattern of rate variation across lineages, and appreciate how methodological development has benefited
genomic evolution is thereby dominated by residual effects. from the increase in genetic data it is worthwhile to
2
TREE-1847; No. of Pages 8
Opinion Trends in Ecology & Evolution xxx xxxx, Vol. xxx, No. x
consider how different methods have been designed to Box 1. Calibrating the molecular clock
account for the three major forms of rate variation. This
Molecular clock models describe the patterns of rate variation
review focuses on sequence data, which are by far the most
across lineages, allowing estimation of the relative ages of nodes in
widely used type of genome-scale data in molecular clock the phylogeny. To place an absolute timescale on the phylogenetic
analyses. tree, the molecular clock needs to be calibrated. This can be done in
one of two ways: by setting the rate to a known value, or by
Accounting for rate variation across lineages constraining the age of at least one node in the phylogeny. The
scarcity of reliable rate estimates means that the latter approach is
For the first few decades of its history, the term ‘molecular
usually preferred, especially when there is substantial rate variation
clock’ referred to the strict-clock model, which does not among lineages. Generally, increasing the number of calibrations
account for lineage effects. This still provides a useful null leads to an improvement in molecular clock estimation [25,66].
model for testing rate variation among lineages. In addi- Ages of nodes can be constrained on the basis of fossil or
tion, use of the strict clock remains commonplace in anal- geological information. The earliest fossil representative of a
lineage can be used to infer the divergence time of that lineage
yses of datasets with low genetic variation, such as those from its sister lineage [67]. The age constraint can be implemented
comprising samples from a single population or from close- in several ways, of which the simplest is to fix the age of the node to
ly related species [19]. a point value. However, this ignores any uncertainty in the
Since the late 1990s there has been a proliferation of calibration age, such as that associated with radiometric dating or
methods that relax the assumption of rate constancy [7]. taxonomic assignment. Instead, a preferable approach is to account
for uncertainty by allowing the node age to vary within chosen
These relaxed-clock models allow the rate to vary across constraints [68]. In Bayesian phylogenetic analysis this can be done
lineages such that each branch of the phylogeny can have a by specifying an informative prior distribution for the node age,
distinct evolutionary rate. The various relaxed-clock mod- typically in the form of a lognormal or exponential distribution [69].
els make different assumptions about how rates vary Choosing the parameters of these distributions can be a difficult
throughout the tree, such as the degree of correlation exercise [70], but there are several formalized methods that allow
this process to be informed by the fossil data [71,72]. Alternatively,
between rates in neighboring branches. These models have calibrating fossils can be included in combined analyses of
been reviewed and compared in detail [7,20], although morphological and molecular data such that their associated
some uncertainty remains about their relative merits temporal information is incorporated implicitly [63,73].
[21–23]. If ancient genomes are available, the ages of the sequences can be
used for calibration. This is possible in studies based on well-
The wide range of molecular clock models has made the
preserved ancient specimens or on rapidly evolving viruses
estimation of evolutionary rates and timescales a substan- sampled through time [74]. Calibrations based on ancient se-
tially statistical exercise. However, using an inappropriate quences can be very effective because there is usually no
clock model can produce highly misleading estimates of uncertainty in the attachment of dates to nodes in the phylogeny.
evolutionary rates and timescales [24–27]. Accordingly, Moreover, these dates are often known with considerable precision.
Any nontrivial uncertainty in the sequence ages can be incorporated
choosing an appropriate model is an important step in
into the molecular clock analysis [75,76].
any phylogenetic dating analysis, and there are several
methods that can be used for model selection. Although the
various clock models provide different descriptions of rate contrasting patterns of rate variation in different branches
variation across lineages, they do not make any statements of the tree. A familiar example of this is associated with the
about absolute rates or node times. In this respect, all codon structure of protein-coding genes: patterns of rate
molecular clock methods share a reliance on calibrations, variation at second codon positions are influenced by se-
which are usually informed by paleontological or geological lection, whereas rates at third codon sites are more likely
data (Box 1). to be subject to lineage effects [32]. In this particular
example, a simple solution is to assign a separate model
Accounting for rate variation along the sequence of among-lineage rate variation to each codon position [33].
Evolutionary rates can vary across the genome, among More generally, one can account for residual effects by
regions, among loci, and even between nucleotide sites. In partitioning the data into subsets based on their pattern of
phylogenetic analyses of DNA sequence data, rate hetero- among-lineage rate heterogeneity. An appropriate strate-
geneity among sites has typically been taken into account gy might be to assign separate molecular clock models to
by assigning sites to a small number of discrete rate different genomic regions, different loci, or different codon
categories, based on the gamma distribution [28]. This positions in protein-coding genes [33]. Alternatively, a
method can also be used when analyzing genomic datasets, statistical approach can be taken to identify the optimal
but an alternative approach is to allow each locus or group partitioning scheme for the data. This can be done, for
of loci to have a distinct evolutionary rate [29–31]. For example, by comparing the estimates of branch lengths
example, a relative rate parameter can be assigned to each from the different loci in the dataset and using a clustering
locus, with these rates following a gamma [30] or Dirichlet method to group loci with similar patterns of rate variation
distribution [31]. These methods, however, are only appro- among branches [29,34]. These patterns can be summa-
priate when gene and lineage effects are present, but not rized using tree-distance metrics [34] or principal-compo-
residual effects (conforming to the Universal Pacemaker nents analysis [29]. In the molecular clock analysis, a
model of genome evolution; Figure 1). separate model of among-lineage rate variation can then
be assigned to each subset of the data. This is analogous to
Accounting for residual effects the common practice of partitioning the dataset and
A more difficult problem emerges when there are signifi- assigning a separate substitution model to each data sub-
cant residual effects such that different sites or loci show set [35].
3
TREE-1847; No. of Pages 8
Opinion Trends in Ecology & Evolution xxx xxxx, Vol. xxx, No. x
Analyzing large datasets in molecular phylogenetics, the first steps in this direction
With the drive to base evolutionary inferences on datasets were taken using organellar genomes. Molecular dating
of increasing size, an ongoing feature of sampling has been using complete sequences of organellar genomes is now
the trade-off between numbers of loci and taxa. Some relatively common (e.g., [42,43]), whereas few dating stud-
studies have focused on estimating evolutionary time- ies based on nuclear DNA have taken advantage of the
scales for very large numbers of taxa, but these have data produced by early genome projects (e.g., [44]). When
typically been restricted to small numbers of loci or have there are few taxa in the dataset, the number of distinct
used supertree methods to merge smaller trees inferred site patterns in the alignment is small. Sites sharing the
from different markers [36,37]. The emergence of high- same pattern of variation across taxa can be grouped for
throughput sequencing technology has made it possible to the purposes of likelihood calculation such that molecular
assemble large datasets that can be characterized as hav- clock analyses of these datasets are computationally trac-
ing (i) a small number of markers for a large number of table even with intensive Bayesian and likelihood methods
taxa; (ii) a large number of markers for a small number of (Box 2). However, datasets with many loci but few taxa are
taxa; or (iii) a large number of markers for a large number subject to several disadvantages associated with sparse
of taxa. As I explain below, these three types of dataset taxon sampling, with impacts on tree balance, performance
present different challenges for molecular clock analyses. of phylogenetic inference, and estimation of macroevolu-
Datasets comprising small numbers of markers for a tionary parameters [45]. The addition of taxa often comes
very large sample of taxa can be used to answer a variety of at the cost of an increased proportion of missing data, with
evolutionary questions, particularly those associated with uncertain impacts on molecular clock analysis [46,47].
macroevolutionary processes such as diversification Analyses of multilocus data must also deal with incongru-
[36,37]. Several methods can be used to estimate evolu- ence between trees estimated from different loci, the extent
tionary timescales for very large phylogenetic trees [38–41] of which will depend on the taxonomic scale being investi-
(Box 2). These methods share several features in common. gated (Box 3).
All treat inference of the phylogeny as a separate problem, Genome-scale datasets from moderate to large numbers
and thus the topology and branch lengths (measuring the of taxa are still relatively rare. However, with various
amount of genetic change) are assumed to be known for the genome-sequencing initiatives underway, such as the Ge-
dating analysis. Although this leads to a considerable nome 10K Project [48] and the i5K Project [49], very large
reduction in the computational burden, it also places a datasets will soon become much more common. Analyses of
limit on model complexity because the sequence data are these data, which have the potential to comprise millions of
not always analyzed directly. In addition, most of the characters from tens to hundreds of taxa, can be handled
methods that can analyze large numbers of taxa are unable using rapid likelihood-based methods but remain problem-
to accommodate complex calibrating information. Thus, atic for most Bayesian phylogenetic methods. In particu-
these methods typically do not handle uncertainty in cali- lar, enormous computational demands are made by the
brations and the phylogeny in an ideal manner. calculation of the full likelihood and by the estimation of
In molecular dating analyses there has been growing the posterior using Markov chain Monte Carlo sampling
use of datasets that comprise large amounts of sequence [50]. These analyses will benefit from efforts to improve the
data from small numbers of taxa. As with many advances computational efficiency of molecular dating analyses. In
4
TREE-1847; No. of Pages 8
Opinion Trends in Ecology & Evolution xxx xxxx, Vol. xxx, No. x
150 100 50
Monotremata
Marsupiala
Xenarthra
Afrotheria
Theria
Lagomorpha
Placentalia Rodena
Scandena
Primates
Eulipotyphla
Cetarodactyla
Chiroptera
Perissodactyla
Carnivora
Figure 2. Evolutionary timescale of ordinal diversification in mammals. Chronogram (dates in Ma) estimated in a Bayesian relaxed-clock analysis of 14 632 nuclear genes,
calibrated using 38 fossil-based constraints on the ages of nodes [29]. Light-blue bars at nodes represent 95% credibility intervals of divergence-time estimates. Triangles
denote clades represented by more than one species in the analysis. Even with genome-scale data, the age estimates for some nodes have considerable uncertainty. The
orange vertical line indicates the timing of the Cretaceous–Paleogene boundary. Most orders of placental mammals diversified in the Paleogene, but the basal divergences
in Placentalia occurred in the Late Cretaceous. The date estimates were robust to a range of factors including data partitioning and various model priors [29,31].
5
TREE-1847; No. of Pages 8
Opinion Trends in Ecology & Evolution xxx xxxx, Vol. xxx, No. x
Opinion Trends in Ecology & Evolution xxx xxxx, Vol. xxx, No. x
7
TREE-1847; No. of Pages 8
Opinion Trends in Ecology & Evolution xxx xxxx, Vol. xxx, No. x
and Distributed Processing Workshops and PhD Forum. pp. 539–548 74 Drummond, A.J. et al. (2003) Measurably evolving populations. Trends
IEEE Computer Society Washington Ecol. Evol. 18, 481–488
54 Ayres, D.L. et al. (2012) BEAGLE: an application programming 75 Shapiro, B. et al. (2011) A Bayesian phylogenetic method to estimate
interface and high-performance computing library for statistical unknown sequence ages. Mol. Biol. Evol. 28, 879–887
phylogenetics. Syst. Biol. 61, 170–173 76 Molak, M. et al. (2013) Phylogenetic estimation of timescales using
55 Yang, Z. and Rannala, B. (2006) Bayesian estimation of species ancient DNA: the effects of temporal sampling scheme and uncertainty
divergence times under a molecular clock using multiple fossil in sample ages. Mol. Biol. Evol. 30, 253–262
calibrations with soft bounds. Mol. Biol. Evol. 23, 212–226 77 Drummond, A.J. et al. (2012) Bayesian phylogenetics with BEAUti and
56 Britton, T. (2005) Estimating divergence times in phylogenetic trees the BEAST 1.7. Mol. Biol. Evol. 29, 1969–1973
without a molecular clock. Syst. Biol. 54, 500–507 78 Ronquist, F. et al. (2012) MrBayes 3.2: efficient Bayesian phylogenetic
57 dos Reis, M. and Yang, Z. (2013) The unbearable uncertainty of inference and model choice across a large model space. Syst. Biol. 61,
Bayesian divergence time estimation. J. Syst. Evol. 51, 30–43 539–542
58 Rannala, B. and Yang, Z. (2007) Inferring speciation times under an 79 Yang, Z. (2007) PAML 4: phylogenetic analysis by maximum likelihood.
episodic molecular clock. Syst. Biol. 56, 453–466 Mol. Biol. Evol. 24, 1586–1591
59 Wheat, C.W. and Wahlberg, N. (2013) Critiquing blind dating: the 80 Lartillot, N. et al. (2009) PhyloBayes 3: a Bayesian software package for
dangers of over-confident date estimates in comparative genomics. phylogenetic reconstruction and molecular dating. Bioinformatics 25,
Trends Ecol. Evol. 28, 636–642 2286–2288
60 Lanfear, R. et al. (2010) Watching the clock: studying variation in rates 81 Heled, J. and Drummond, A.J. (2012) Calibrated tree priors for relaxed
of molecular evolution between species. Trends Ecol. Evol. 25, 495–503 phylogenetics and divergence time estimation. Syst. Biol. 61, 138–149
61 Lartillot, N. and Poujol, R. (2011) A phylogenetic model for 82 Hedin, M. et al. (2012) Phylogenomic resolution of paleozoic
investigating correlated evolution of substitution rates and divergences in harvestmen (Arachnida, Opiliones) via analysis of
continuous phenotypic characters. Mol. Biol. Evol. 28, 729–744 next-generation transcriptome data. PLoS ONE 7, e42888
62 Wang, M. et al. (2011) A universal molecular clock of protein folds and 83 Stamatakis, A. (2014) RAxML version 8: a tool for phylogenetic
its power in tracing the early history of aerobic metabolism and planet analysis and post-analysis of large phylogenies. Bioinformatics 30,
oxygenation. Mol. Biol. Evol. 28, 567–582 1312–1313
63 Ronquist, F. et al. (2012) A total-evidence approach to dating with 84 The 1000 Genomes Project Consortium (2012) An integrated map of
fossils, applied to the early radiation of the Hymenoptera. Syst. Biol. genetic variation from 1,092 human genomes. Nature 491, 56–65
61, 973–999 85 Martin, M.D. et al. (2013) Reconstructing genome evolution in historic
64 Wilkinson, R.D. et al. (2011) Dating primate divergences through an samples of the Irish potato famine pathogen. Nat. Commun. 4, 2172
integrated analysis of palaeontological and molecular data. Syst. Biol. 86 Kubatko, L.S. and Degnan, J.H. (2007) Inconsistency of phylogenetic
60, 16–31 estimates from concatenated data under coalescence. Syst. Biol. 56,
65 Heath, T.A. et al. (2014) The fossilized birth-death process for coherent 17–24
calibration of divergence-time estimates. Proc. Natl. Acad. Sci. U.S.A. 87 Degnan, J.H. and Rosenberg, N.A. (2009) Gene tree discordance,
http://dx.doi.org/10.1073/pnas.1319091111 phylogenetic inference and the multispecies coalescent. Trends Ecol.
66 Paradis, E. (2013) Molecular dating of phylogenies by likelihood Evol. 24, 332–340
methods: a comparison of models and a new information criterion. 88 Heled, J. and Drummond, A.J. (2010) Bayesian inference of species
Mol. Phylogenet. Evol. 67, 436–444 trees from multilocus data. Mol. Biol. Evol. 27, 570–580
67 Donoghue, P.C. and Benton, M.J. (2007) Rocks and clocks: calibrating 89 Yang, Z. and Rannala, B. (2010) Bayesian species delimitation using
the Tree of Life using fossils and molecules. Trends Ecol. Evol. 22, multilocus sequence data. Proc. Natl. Acad. Sci. U.S.A. 107, 9264–9269
424–431 90 Shapiro, B. and Ho, S.Y.W. (2014) Ancient hyaenas highlight the old
68 Sanderson, M.J. (1997) A nonparametric approach to estimating problem of estimating evolutionary rates. Mol. Ecol. 23, 499–501
divergence times in the absence of rate constancy. Mol. Biol. Evol. 91 Ho, S.Y.W. et al. (2011) Time-dependent rates of molecular evolution.
14, 1218–1231 Mol. Ecol. 20, 3087–3101
69 Ho, S.Y.W. and Phillips, M.J. (2009) Accounting for calibration 92 Pulquério, M.J. and Nichols, R.A. (2007) Dates from the molecular
uncertainty in phylogenetic estimation of evolutionary divergence clock: how wrong can we be? Trends Ecol. Evol. 22, 180–184
times. Syst. Biol. 58, 367–380 93 Ho, S.Y.W. and Larson, G. (2006) Molecular clocks: when times are a-
70 Parham, J.F. et al. (2012) Best practices for justifying fossil changin’. Trends Genet. 22, 79–83
calibrations. Syst. Biol. 61, 346–359 94 Cooper, A. and Penny, D. (1997) Mass survival of birds across the
71 Heath, T.A. (2012) A hierarchical Bayesian model for calibrating Cretaceous–Tertiary boundary: molecular evidence. Science 275,
estimates of species divergence times. Syst. Biol. 61, 793–809 1109–1113
72 Nowak, M.D. et al. (2013) A simple method for estimating informative 95 Meredith, R.W. et al. (2011) Impacts of the Cretaceous Terrestrial
node age priors for the fossil calibration of molecular divergence time Revolution and KPg extinction on mammal diversification. Science 334,
analyses. PLoS ONE 8, e66245 521–524
73 Lee, M.S.Y. et al. (2009) Phylogenetic uncertainty and molecular clock 96 Archibald, J.D. and Deutschman, D.H. (2001) Quantitative analysis of
calibrations: a case study of legless lizards (Pygopodidae, Gekkota). the timing of the origin and diversification of extant placental orders. J.
Mol. Phylogenet. Evol. 50, 661–666 Mamm. Evol. 8, 107–124