Phylogenetic Analysis

Phylogenetic Analysis:
A bioinformatics tool
Presented by
Uttam Kr. Patra
2nd sem-Microbiology
© 2007, U.K.Patra 1
Introduction
Greek:
phylon = race & genetic = birth
 Phylogenetic trees illustrate the
evolutionary relationships among groups
of organisms, or among a family of related
nucleic acid or protein sequences.
i.e., how might have this family been
derived during evolution.
© 2007, U.K.Patra 2
Ancestor
Ancestor
Carolus Linnaeus Charles Darwin Willi Hennig
© 2007, U.K.Patra 3
Milestone
1753. Linnaeus classifying organisms w.r.t. their similarity and
difference in “Systema Naturea”
1859. Darwin publishes “The Origin of Species…”
1864. E.HEKEL: published a tree of life (1879) including Moner

(formless clumps, later named bacteria).
1904. Nuttal et al .,describe moleculer data can be used in
phylogenetics
1937. Chatton distinguished prokaryotes (bacteria that lack nuclei)
from eukaryotes (having nuclei).
1950. Willi Hennig, a German entomologist developed Phylogenetic
systematics (cladistics)
© 2007, U.K.Patra Cont… 4

1970 The Needleman-Wunsch algorithm for sequence comparison
is published.
1972 Dayhoff develops the Protein Sequence Database (PSD).
1975 Sanger and others (Maxam, Gilbert) invent rapid DNA

sequencing methods.
1981 The Smith-Waterman algorithm for sequence alignment is
published.
1981 IBM introduces its Personal Computer to the market.
1982 The GenBank sequence database is created at Los Alamos

National Laboratory
1985 The FASTP algorithm is published by Lipman and Pearson.
1985 PAUP 3.11 (Swofford),
1987 Published Neighbor-joining method (Saitou and Nei)
1988 Maximum parsimony method (Felsenstein).

© 2007, U.K.Patra 5
1988 The National Center for Biotechnology Information (NCBI) is
established at the National Library of Medicine in Bethesda.
1988 The FASTA algorithm for sequence comparison is published by
Pearson and Lipman.
1989 HENNIG86 (Farris,),
1990 The BLAST program is published by Altschul et al.
1993 PHYLIP (Felsenstein), NONA (Goloboff)
1994 CLUSTALW (Thompson et al.)
1996 TREEVIEW (Page, R.D.M. )
1996 Possibilized to track ancient genes, such as r RNA and some

proteins, back through the tree of life and to discover new
organisms based on their sequence (Barns et al.).
© 2007, U.K.Patra 6
1998 Partitioned Bremer Support (PBS) (Baker et al.).
2002 PAUP* (Swofford,).
2003 Mr Bayes for Bayesian likelihood analysis (Ronquist &

Huelsenbeck,),
2003 MCMC implementation of Bayesian likelihood
(Ronquist & Huelsenbeck,),
2004 POY (Wheeler et al.,) for optimization alignment and fixed-state
alignment,
2004 A. gossypii genome as a tool for mapping of ancient
S. cereveciae genome. (Dietrich et al.,)
2005 Clann : C. J. Creevey et.al
2007 Phylogenetic signal and functional categories in Proteobacteria

Genomes ; Iñaki Comas et.al
2007 Phylogenetic distribution of translational GTPases in bacteria
;Tõnu Margus et.al © 2007, U.K.Patra 7
© 2007, U.K.Patra
APPLICATION
 Identification of microorganisms
 Evolution studies
 Systematic biology
 Medical research (eg.Vaccine development) and epidemiology
 Ecology
 Orthologs detection
 Biochemical & Pharmaceutical industry
 Follow changes occuring in rapidly changing species
(e.g., HIV virus)
 A reference to study lateral gene transfer.
© 2007, U.K.Patra 8
Common Phylogenetic Tree Terminology
Terminal Nodes
Branches or
Lineages A Represent the
TAXA (genes,
populations,
B species, etc.)
used to infer
C the phylogeny
D
Ancestral Node
or ROOT of Internal Nodes or E
the Tree Divergence Points
(represent hypothetical
ancestors of the taxa)
© 2007, U.K.Patra 9
 Root: the common ancestor of all taxa.
 Branch: reflects the relationship between taxa according to

descent and ancestry.
 Branch length: indicates the number of changes that have

occurred in the branch.
 Node: is a taxonomic unit identifying either an existing or an

extinct species.
 Clade: a subgroup of two or more taxa or DNA/Protein sequences

that includes both their common ancestor and all of their
descendents.
 Distance scale: scale that represents the number of differences

between organisms or sequences
 Topology: defines the branching patterns of the tree.
© 2007, U.K.Patra 10
Branches of Phylogenetic trees may be
 scaled (top panel) representing
the amount of evolutionary
change, time, or both, when there
is a molecular clock
 unscaled (middle panel) and

have no direct correspondence
with either time or amount of
evolutionary change
 rooted (top and middle panel)

branching relationships between
taxa are specified by the way
they are connected to each other,
but the position of the common
ancestor is not.
 or unrooted (bottom panels).

On an unrooted tree with five
species, there are five branches
on which the tree can be rooted.
© 2007, U.K.Patra 11
Homology Interpretation: Darwin to 21st
Century
 Before Darwin: homology was defined morphologically.
 Darwin (1859): Homology is a result of descent with modifications

from a common ancestor.
 Modern genetics: Homology is determined by genes.
 Two sequences are homologous if they are similar and share a

common ancestor (similarity by itself is not enough).
 Homologs are commonly defined as orthologs, paralogs, or xenologs.
 Orthologs are homologs resulting from speciation. They are genes that
stem from a common ancestor. Orthologs often have similar functions.
SPO11 (Baudat et al. Mol Cell 2000)
 Paralogs are homologs resulting from gene duplication. They are genes
derived from a common ancestral locus that was duplicated within the
genome of an organism. Paralogs tend to have different functions.
CLB1/CLB2 (Brachat et al. Genome Biology 2003).
© 2007, U.K.Patra 12
Orthologs & Paralogs

Paralogs
Duplication
 
Orthologs
Speciation
Orthologs
   
© 2007, U.K.Patra 13
Species a Species b
Data
 Until ~1990, phylogenies were based mostly on morphology
 Now we have so many DNA sequences and genomic data available

that phylogenies can now be based on both molecular and
morphological data
 Biomolecular sequences: DNA, RNA, amino acid, in a multiple
alignment
 Molecular markers (e.g., SNPs, RFLPs, etc.)
 Gene order and content
 Proteomics: Biologists now have full gene sequences for many
single-chromosome organisms and organelles (e.g., mitochondria,
chloroplasts) and for more and more larger organisms
These are “character data”: each character is a function mapping the

set of taxa to distinct states (equivalence classes), with evolution
modelled as a process that changes the state of a character
© 2007, U.K.Patra 14
DNA Sequence Evolution
-3 mil yrs
AAGACTT
-2 mil yrs
AAGGCCT TGGACTT
-1 mil yrs
AGGGCAT TAGCCCT AGCACTT
AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT today
© 2007, U.K.Patra 15
Choose
THE WAY sequences
Multiple
sequence
aligment
strong
yes no
similarity?
Recognizable
yes no
similarity?
Maximum Distance Maxium

parsimony methods likelihood
methods methods
© 2007, U.K.Patra 16
Chose two Are the sequence No No Does sequence No
Do the sequence
Sequence protein sequence ? encode proteins ? encode proteins
and have introns?
Yes Yes
Yes
MSA Perform local

alignment
Translate
sequence
Predict gene
structure
Yes No
Is alignment of Alter Did alignment
high quality ? parameter improve ?
Examine sequence Yes

for presence of No
respect of low No
complexity sequence Is the alignment Sequence are not
score significant ? detectably similar
Yes
Perform statistical Sequence are
test of alignment significantly
score similar
© 2007, U.K.Patra 17
Maximum
parsimony methods
 Maximum parsimony methods try to find the most efficient path between
two evolutionary states. This approach is based on finding the minimum
number of mutations to explain the differences among the sequences.
An initial tree topology is specified and each position in sequence
examined in support of each tree. All reasonable topologies are
examined until a tree with minimal numbers of changes is chosen and
designated the “best” tree.
 Multiple sequence alignment needed
 Used for rather similar sequences to be analyzed in small numbers
 Widely used software in the field that implements Maximum parsimony
and other methods is PHYLIP
http://evolution.genetics.washington.edu/phylip.html
 PAUP* is also used for this purpose

http://paup.csit.fsu.edu/index.html
© 2007, U.K.Patra 18
Maximum
parsimony methods
Advantages:
 Are simple, intuitive, and logical (many possible by ‘pencil-and-paper’).
 Can be used on molecular and non-molecular (e.g., morphological) data.
 Can tease apart types of similarity (shared-derived, shared-ancestral,
homoplasy)
 Can be used for character (can infer the exact substitutions) and rate
analysis.
 Can be used to infer the sequences of the extinct (hypothetical)
ancestors.
Disadvantages:
 Can be fooled by high levels of homoplasy (‘same’ events).
 Can become positively misleading in the “Felsenstein Zone”:
© 2007, U.K.Patra 19
Maximum Likelihood methods
 Maximum likelihood method checks every reasonable
tree topology and examines the support for each tree by
every sequence.
 Uses probability calculations to find a tree that best
accounts for the observed sequence variations.
 All possible trees are considered (time-consuming)
 Few sequences can be analyzed.
 It is possible to evaluate trees with mutations in different
lineages.
 Use evolutionary models that allow for variations in base
composition (Jukes-Cantor, Kimura)
© 2007, U.K.Patra 20
Maximum likelihood (ML) methods
Advantages:
 Are inherently statistical and evolutionary model-based.
 Usually the most ‘consistent’ of the methods available.
 Can be used for character (can infer the exact substitutions) and rate
analysis.
 Can be used to infer the sequences of the extinct (hypothetical)
ancestors.
 Can help account for branch-length effects in unbalanced trees.
 Can be applied to nucleotide or amino acid sequences, and other
types of data.
Disadvantages:
 Are not as simple and intuitive as many other methods.
 Are computationally very intense (Iimits number of taxa and length of
sequence).
 Like parsimony, can be fooled by high levels of homoplasy.
 Violations of the assumed model can lead to incorrect trees.
© 2007, U.K.Patra 21
Distance
methods
• In distance matrix methods evolutionary distance is computed by

counting all the nucleic acid or protein substitutions for all
pairwise relationships of multiple alignment.
• Employ the number of changes between each pair in a group of

sequences to produce a tree of that group.
• Pairs that display the smallest number of changes is called
“neighbours”.
• Goal is to position neighbours correctly and to calculate branch
lenghts that reflect the original data
• UPGMA (Unweighted Pair Group Method with Arithmetic Mean)

UPGMA, employs a sequential clustering algorithm. The distance
values are identified in the order of similarity and a tree is built in a
stepwise manner. The most closely related sequences are joined by a
node and then the next most closely related sequences are added,
etc. As the connected set of sequences accumulate they are treated
as a composite set.
© 2007, U.K.Patra 22
Neighbor Joining (NJ)
 Reconstructs unrooted tree
 Calculates branch lengths
 Based on Star decomposition
 In each stage, the two nearest nodes of the tree are chosen and
defined as neighbors in our tree.
This is done recursively until all of the nodes are paired together.
 Advantages
 is fast and thus suited for large datasets and for bootstrap
analysis
 permist lineages with largely different branch lengths
 permits correction for multiple substitutions
 Disadvantages
 sequence information is reduced gives only one possible tree.
 strongly dependent on the model of evolution used.
© 2007, U.K.Patra 23
Conclusion
Tree reconstruction in general and
parsimony analysis in particular has come a long
way in the past 10 years. Currently, different
techniques and programs vying for the attention
of phylogenetic analysis.
Finding most parsimonious trees is still
a noble goal, but somewhat unrealistic and
today the name of the game is finding trees that
are at least close to being optimal.
© 2007, U.K.Patra 24
SOME LIST OF PHYLOGENETIC SOFTWARE
Tree viewing Tree building
TreeView CAFCA
TreeExplorer Spectronet
Tree comparsion & Sequence alignment,

interpretation analysis and searching
MacClade 3.0 MACAW
TreeMap NEntrez

© 2007, U.K.Patra 25
The (ever expanding) Entrez System
OMIM PubMed
3D Domains PubMed Central
Structure Journals
CDD/CDART Books
Protein Entrez Taxonomy
GEO/GDS Genome
UniGene UniSTS
Nucleotide SNP
PopSet
© 2007, U.K.Patra 26
© 2007, U.K.Patra 27

Phylogenetic Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Phylogenetic Analysis

Uploaded by

Copyright:

Available Formats

Phylogenetic Analysis:

Carolus Linnaeus Charles Darwin Willi Hennig

1864. E.HEKEL: published a tree of life (1879) including Moner

© 2007, U.K.Patra Cont… 4

1975 Sanger and others (Maxam, Gilbert) invent rapid DNA

1982 The GenBank sequence database is created at Los Alamos

1985 PAUP 3.11 (Swofford),

1987 Published Neighbor-joining method (Saitou and Nei)

1988 Maximum parsimony method (Felsenstein).

1990 The BLAST program is published by Altschul et al.

1993 PHYLIP (Felsenstein), NONA (Goloboff)

1994 CLUSTALW (Thompson et al.)

1996 TREEVIEW (Page, R.D.M. )

1996 Possibilized to track ancient genes, such as r RNA and some

2002 PAUP* (Swofford,).

2003 Mr Bayes for Bayesian likelihood analysis (Ronquist &

2007 Phylogenetic signal and functional categories in Proteobacteria

 Branch: reflects the relationship between taxa according to

 Branch length: indicates the number of changes that have

 Node: is a taxonomic unit identifying either an existing or an

 Clade: a subgroup of two or more taxa or DNA/Protein sequences

 Distance scale: scale that represents the number of differences

 Topology: defines the branching patterns of the tree.

 unscaled (middle panel) and

 rooted (top and middle panel)

 or unrooted (bottom panels).

 Darwin (1859): Homology is a result of descent with modifications

 Modern genetics: Homology is determined by genes.

 Two sequences are homologous if they are similar and share a

 Homologs are commonly defined as orthologs, paralogs, or xenologs.

 Now we have so many DNA sequences and genomic data available

These are “character data”: each character is a function mapping the

AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT today

Maximum Distance Maxium

MSA Perform local

Examine sequence Yes

 PAUP* is also used for this purpose

• In distance matrix methods evolutionary distance is computed by

• Employ the number of changes between each pair in a group of

• UPGMA (Unweighted Pair Group Method with Arithmetic Mean)

Tree viewing Tree building

Tree comparsion & Sequence alignment,

MacClade 3.0 MACAW

3D Domains PubMed Central

Protein Entrez Taxonomy

You might also like