Professional Documents
Culture Documents
Phylogenetic Analysis
Phylogenetic Analysis
A bioinformatics tool
Presented by
Uttam Kr. Patra
2nd sem-Microbiology
© 2007, U.K.Patra 1
Introduction
Greek:
phylon = race & genetic = birth
Phylogenetic trees illustrate the
evolutionary relationships among groups
of organisms, or among a family of related
nucleic acid or protein sequences.
i.e., how might have this family been
derived during evolution.
© 2007, U.K.Patra 2
Ancestor
Ancestor
© 2007, U.K.Patra 3
Milestone
1753. Linnaeus classifying organisms w.r.t. their similarity and
difference in “Systema Naturea”
1859. Darwin publishes “The Origin of Species…”
© 2007, U.K.Patra 8
Common Phylogenetic Tree Terminology
Terminal Nodes
Branches or
Lineages A Represent the
TAXA (genes,
populations,
B species, etc.)
used to infer
C the phylogeny
D
Ancestral Node
or ROOT of Internal Nodes or E
the Tree Divergence Points
(represent hypothetical
ancestors of the taxa)
© 2007, U.K.Patra 9
Root: the common ancestor of all taxa.
© 2007, U.K.Patra 10
Branches of Phylogenetic trees may be
scaled (top panel) representing
the amount of evolutionary
change, time, or both, when there
is a molecular clock
Orthologs are homologs resulting from speciation. They are genes that
stem from a common ancestor. Orthologs often have similar functions.
SPO11 (Baudat et al. Mol Cell 2000)
Paralogs are homologs resulting from gene duplication. They are genes
derived from a common ancestral locus that was duplicated within the
genome of an organism. Paralogs tend to have different functions.
CLB1/CLB2 (Brachat et al. Genome Biology 2003).
© 2007, U.K.Patra 12
Orthologs & Paralogs
Paralogs
Duplication
Orthologs
Speciation
Orthologs
© 2007, U.K.Patra 13
Species a Species b
Data
Until ~1990, phylogenies were based mostly on morphology
© 2007, U.K.Patra 14
DNA Sequence Evolution
-3 mil yrs
AAGACTT
-2 mil yrs
AAGGCCT TGGACTT
-1 mil yrs
AGGGCAT TAGCCCT AGCACTT
© 2007, U.K.Patra 15
Choose
THE WAY sequences
Multiple
sequence
aligment
strong
yes no
similarity?
Recognizable
yes no
similarity?
© 2007, U.K.Patra 16
Chose two Are the sequence No No Does sequence No
Do the sequence
Sequence protein sequence ? encode proteins ? encode proteins
and have introns?
Yes Yes
Yes
Yes No
Is alignment of Alter Did alignment
high quality ? parameter improve ?
Yes
Perform statistical Sequence are
test of alignment significantly
score similar
© 2007, U.K.Patra 17
Maximum
parsimony methods
Maximum parsimony methods try to find the most efficient path between
two evolutionary states. This approach is based on finding the minimum
number of mutations to explain the differences among the sequences.
An initial tree topology is specified and each position in sequence
examined in support of each tree. All reasonable topologies are
examined until a tree with minimal numbers of changes is chosen and
designated the “best” tree.
Multiple sequence alignment needed
Used for rather similar sequences to be analyzed in small numbers
Widely used software in the field that implements Maximum parsimony
and other methods is PHYLIP
http://evolution.genetics.washington.edu/phylip.html
© 2007, U.K.Patra 18
Maximum
parsimony methods
Advantages:
Are simple, intuitive, and logical (many possible by ‘pencil-and-paper’).
Can be used on molecular and non-molecular (e.g., morphological) data.
Can tease apart types of similarity (shared-derived, shared-ancestral,
homoplasy)
Can be used for character (can infer the exact substitutions) and rate
analysis.
Can be used to infer the sequences of the extinct (hypothetical)
ancestors.
Disadvantages:
Can be fooled by high levels of homoplasy (‘same’ events).
Can become positively misleading in the “Felsenstein Zone”:
© 2007, U.K.Patra 19
Maximum Likelihood methods
Maximum likelihood method checks every reasonable
tree topology and examines the support for each tree by
every sequence.
Uses probability calculations to find a tree that best
accounts for the observed sequence variations.
All possible trees are considered (time-consuming)
Few sequences can be analyzed.
It is possible to evaluate trees with mutations in different
lineages.
Use evolutionary models that allow for variations in base
composition (Jukes-Cantor, Kimura)
© 2007, U.K.Patra 20
Maximum likelihood (ML) methods
Advantages:
Are inherently statistical and evolutionary model-based.
Usually the most ‘consistent’ of the methods available.
Can be used for character (can infer the exact substitutions) and rate
analysis.
Can be used to infer the sequences of the extinct (hypothetical)
ancestors.
Can help account for branch-length effects in unbalanced trees.
Can be applied to nucleotide or amino acid sequences, and other
types of data.
Disadvantages:
Are not as simple and intuitive as many other methods.
Are computationally very intense (Iimits number of taxa and length of
sequence).
Like parsimony, can be fooled by high levels of homoplasy.
Violations of the assumed model can lead to incorrect trees.
© 2007, U.K.Patra 21
Distance
methods
© 2007, U.K.Patra 22
Neighbor Joining (NJ)
Reconstructs unrooted tree
Calculates branch lengths
Based on Star decomposition
In each stage, the two nearest nodes of the tree are chosen and
defined as neighbors in our tree.
This is done recursively until all of the nodes are paired together.
Advantages
is fast and thus suited for large datasets and for bootstrap
analysis
permist lineages with largely different branch lengths
permits correction for multiple substitutions
Disadvantages
sequence information is reduced gives only one possible tree.
strongly dependent on the model of evolution used.
© 2007, U.K.Patra 23
Conclusion
Tree reconstruction in general and
parsimony analysis in particular has come a long
way in the past 10 years. Currently, different
techniques and programs vying for the attention
of phylogenetic analysis.
Finding most parsimonious trees is still
a noble goal, but somewhat unrealistic and
today the name of the game is finding trees that
are at least close to being optimal.
© 2007, U.K.Patra 24
SOME LIST OF PHYLOGENETIC SOFTWARE
TreeView CAFCA
TreeExplorer Spectronet
TreeMap NEntrez
© 2007, U.K.Patra 25
The (ever expanding) Entrez System
OMIM PubMed
Structure Journals
CDD/CDART Books
GEO/GDS Genome
UniGene UniSTS
Nucleotide SNP
PopSet
© 2007, U.K.Patra 26
© 2007, U.K.Patra 27