Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Evolutionary Basis of

Sequence Alignment
and Algorithms

Dr. Aditya Kumar Padhi

Laboratory for Computational Biology & Biomolecular Design (LCBD)


School of Biochemical Engineering, IIT (BHU)
Outline of today’s discussion
v General overview of sequence alignment.

v Evolution, and its significance & relationship with sequence


alignment

v The rationale behind sequence alignment

v Types of sequence alignment

v Protein sequence alignment

v Sequence similarity and sequence identity

v Algorithms of sequence alignment (brief overview)

v A case study
Overview
v Sequence comparison (DNA, RNA & Protein) lies at the heart of
bioinformatics analysis.

v It is the first step toward structural and functional analysis of


newly determined sequences.

v The most fundamental comparison process is sequence


alignment.
ü Search for common character patterns
Evolutionary basis
v DNA and proteins are products of evolution.

v Linear sequences of the nucleotide bases


and amino acids form the primary structure
of the DNA and proteins.

v They can be considered molecular fossils


that encode the history of millions of years
of evolution.

v During this time period, they undergo


random changes
ü Selections
ü Mutations
Evolutionary basis cont…
v Despite this, the traces of evolution may still exist, which allows the identification
of the common ancestry.

v This is due to the residues that perform key functional structural roles tend to be
preserved by natural selection. Others tend to mutate more frequently.

v To detect sequence homology, we must first align sequences.

v By sequence alignment, patterns of conversions and variation can be identified.

v The degree of sequence alignment often reveals the evolutionary relatedness of


different sequences.
• It serves as the basis for the prediction of the
structure and functions of uncharacterized
sequences.

• If a significant similarity between two


sequences is found, that indicates that they
belong to the same family.
Why to
perform • Also, it provides inference for the relatedness
(evolutionary relationship) of two sequences
sequence under study.
alignment?
• If they share significant similarities, it reflects
the fact that they must have derived from a
common evolutionary origin.
Evolutionary Basis for Sequence Alignment and
its inference

• Homology: When two sequences are descended from a common


evolutionary origin, they are said to have a homologous relationship or
share homology.

• Similarity: The percentage of aligned residues that are similar in


physiochemical properties such as size, charge and etc.

• Identity: Quantity that describes how much two sequences are alike in
the strictest terms or the extent to which two sequences are invariant.

• Example: two sequences share 40% similarity.

• The two sequences are either homologous or nonhomologous


Sequence alignment and evolution
n Assume we know the evolutionary history relating q and d
species:

n The true alignment can be found using h as a template:


h : GLVS T
q’: GLISVT
d’: GIV--T
Sequence alignment and evolution

n Given an alignment, several different evolutionary histories can be


(equally) possible & derived.

n Example:
q Alignment:
q’: GLISVT
d’: G-I-VT

q One possible history:


H*:GLIVT
/\
->S / \ L->
/ \
q:GLISVT d:GIVT
Evolutionary basis of sequence alignment
Why are there regions of identity?

1) Conserved function - residues participate in reaction.

2) Structural - residues participate in maintaining structure of


protein. (For example, conserved cysteine residues that
form a disulfide linkage)

3) Historical - Residues that are conserved solely due to a


common ancestor gene.
Protein sequence alignment
• Nucleotide sequences consist of only four characters, and therefore,
unrelated sequences have at least a 25% chance of being identical.

• For protein sequences, there are 20 possible amino acid residues, and so 2
unrelated sequences can match up 5% of the residues by random chance. If
gaps are allowed, the percentage could increase to 10–20%.

• Sequence length is also a crucial factor.

• The shorter the sequence, the higher the chance that some alignment is
attributable to random chance. The longer the sequence, the less likely the
matching at the same level of similarity is attributable to random chance.
Protein sequence alignment
• For determining a homology relationship of 2 protein sequences, if both
sequences are aligned at full length (having 100 residues long), an identity of
30% or higher can be safely regarded as having close homology.

The 3 zones of protein sequence alignments. 2 protein sequences can be regarded as


homologous if the percentage sequence identity falls in the safe zone (identity of 30% or
higher). Sequence identity values below the zone boundary, but above 20%, are considered to
be in the twilight zone, where homologous relationships are less certain. The region below 20%
is the midnight zone, where homologous relationships cannot be reliably determined.
Sequence Similarity & Sequence Identity
• Sequence similarity and sequence identity are synonymous for
nucleotide sequences. For protein sequences, however, the two concepts
are very different.

• In a protein sequence alignment, sequence identity refers to the


percentage of matches of the same amino acid residues between two
aligned sequences.

• Similarity refers to the percentage of aligned residues that have similar


physicochemical characteristics and can be more readily substituted for
each other.
Sequence Similarity & Sequence Identity
• One way to calculate is the use of the overall sequence lengths of both
sequences.

where S is the percentage sequence similarity, Ls is the number of aligned


residues with similar characteristics, and La and Lb are the total lengths of each
individual sequence.

where I is the percentage sequence identity, Li is the number of aligned


residues with the exact same residues, and La and Lb are the total lengths of
each individual sequence.
Sequence Similarity & Sequence Identity
> protein-A
ACHKLMGCGLITPNASR

> protein-B
SKTVHRMPGSRAPKLSM
Star: identical residues, One dot:
somewhat similar, Two dots: very
similar, Dashes: gaps in sequences

I = [(4 ✕ 2) / (17 + 17)] ✕ 100

Percentage sequence identity (I) = 23.52%

• Although of equal length, these two sequences are not very identical.

• They may not have a common evolutionary origin.


Methods of sequence alignment
• Pairwise sequence alignment – compare two sequences
• Multiple sequence alignment – compare one sequence to many
others

For each of the above, we can do


• Local Alignment – compare similar parts of two sequences
• Global Alignment – compare the whole sequence

• For the different types of alignments, there are different assumptions


and methods
Pairwise vs. Multiple sequence alignment
n Pairwise
q The process of lining up two sequences to achieve maximal levels of
identity/similarity for the purpose of assessing the degree of similarity and
the possibility of homology.

q Example: It is used to decide if two genes are structurally or functionally


related.

Bar: identical residues, One dot: somewhat similar, Two dots: very similar, Dashes: gaps in sequences
Pairwise vs. Multiple sequence alignment
n Multiple
q MSA is an alignment of three or more sequences such that each column
of the alignment is an attempt to represent the evolutionary changes in
one sequence position, including substitutions, insertions, and deletions.

q It is believed that over time the functional components embedded within


the sequences are conserved in order to retain function.

Disease-associated I71V variant/mutant


Local vs. Global alignment
Local alignment Global alignment
n Aligns segments of the n Aligns the entire sequence.
sequences. n Identifies all conserved
n Identifies short conserved residues.
residues.
n Dynamic programming is
n Complete alignment is not done. required.
S2
S2
Ancestor Ancestor
S1 S1

n May miss out on some important


conserved residues. n Computationally intensive,
n Computationally less intensive, much slower than local
faster than global alignment. alignment.

n Example: Smith-Waterman, n Example: Needleman &


BLAST, FASTP Wunsch method
Alignment algorithms
The 3 primary methods of producing Pairwise alignments

1. Dot matrix method (old method)

2. The dynamic programming (DP) algorithm (advanced


method)

3. Word or k -tuple methods


Alignment algorithms
The dot-matrix method:

• The two sequences are written out as column and row headings of a
two-dimensional matrix.

• A dot is put in the dot-matrix plot at a position where the nucleotides in


the two sequences are identical.

• The alignment is defined by a path from the upper-left element to the


lower-right element.
Advantages of Dot-Matrix method

The vertical gap indicates that a


coding region corresponding to ~75
amino acids has either been deleted
from the human gene or inserted into
the bacterial gene.

The two diagonally oriented parallel


lines most probably indicate that a
small internal duplication has
occurred in the bacterial gene.
Disadvantages of Dot-Matrix method

May not identify the best alignment.


Dynamic programming method
• Global alignment program is based on Needleman-Wunsch algorithm and
local alignment on Smith-Waterman. Both algorithms are derivates from the
basic dynamic programming algorithm.

• Three steps in dynamic programming


1. Initialization
2. Matrix fill (scoring)
3. Traceback (alignment)

Word or k-tuple method


• This method is useful in large-scale database searches to find whether there
is a significant match available with the query sequence.

• The Word method is used in the database search tools like the BLAST
family.

• They identify a series of short, non-overlapping subsequences (words) of the


query sequence.

• Details of other algorithms will be explained in next class.


MSA - a case study

• Gorilla and Chimpanzee are closely related in terms of ANG’s evolution.


• Human is also a closely related species.
• Chicken is distant in terms of evolution when ANG is considered specifically.
Thank you

You might also like