Download as pdf or txt
Download as pdf or txt
You are on page 1of 68

Sequence Alignment and

Visualization
Practical: Sequence alignment using BLAST and Clustal W
Unit-III
• Learn the principle, working mechanism, and importance of:
o Local and global sequence alignments (Needleman-Wunsch and Smith-Waterman
algorithms)
o Pair-wise (BLAST and FASTA algorithms) and multiple sequence alignment (Clustal W)
• BLAST (in detail)
o Theory behind BLAST- how Hidden Markov Model (HMM) can be used to model a
family of unaligned sequences or a common motif within a set of unaligned
sequences and further be used for discrimination and multiple alignment,
o BLAST score
• Amino acid substitution matrices
• s-value and e-value, calculating the alignment score and significance of e
and p value
Important terms
• Sequence
• Sequencing
• Sequencing methods
• DNA
• Sanger Method (dideoxy chain termination method)
• Maxam-Gilbert (Chemical degradation method)
• Protein
• Edman Degradation reaction
• Mass Spectrometry
• Sequence Alignment
It is the way of arrangement of DNA/RNA or protein sequences, in order to
identify the regions of similarity among them.
Sequence alignment
• It is the way of arrangement of DNA/RNA or protein sequences, in order to identify
the regions of similarity among them.
• Query & Subject sequence (2 queries/query and database)
• Aligned sequences of nucleotide or amino acid residues are typically represented as
rows within a matrix.
• Gaps are inserted between the residues so that residues with identical or similar
characters are aligned in successive columns.
Representation
• Both graphically and in text format.
• Sequences are written in rows arranged so that aligned residues appear in
successive columns.
• In text formats, aligned columns containing identical or similar characters
are indicated with a system of conservation symbols. Eg. An asterisk or
pipe symbol = show identity between two columns, Colon = substitutions.
• Many sequence visualization programs also use colour to display
information (ConSurf)
• Most web-based tools allow a number of input and output formats, such
as FASTA format and GenBank format
• A general conversion program is available at READSEQ.
Sequence alignment : Applications
• It is useful for predicting the structure and function of a new discovered gene.
• Greater the sequence similarity, greater is the chance that the query and subject
sequence share similar structure or function. (Similarity between two genes
doesn’t always imply common function)
• To determine evolutionary relationships between 2 sequences.
• Mutations
If two sequences in an alignment share a common ancestor, mismatches can be interpreted
as point mutations and gaps as indels (that is, insertion or deletion mutations) introduced in
one or both lineages in the time since they diverged from one another.
• Conservation
In protein sequence alignment, the degree of similarity between amino acids occupying a
particular position in the sequence can be interpreted as a rough measure of how conserved
a particular region or sequence motif is among lineages
Computational approaches to sequence alignment
generally fall into two categories

1. Global alignment
Calculating a global alignment is a form of global
optimization that "forces" the alignment to span
the entire length of all query sequences.

2. Local alignment
local alignments identify regions of similarity
within long sequences that are often widely
divergent overall. It finds local regions with high
level of similarity
GLOBAL ALIGNMENT
• Alignment is carried out from beginning till end of the sequence to
find out the best possible alignment
• It attempts to align every residue in every sequence
• They are most useful when the sequences in the query set are similar
and of roughly equal size (It does not mean global alignments cannot
end in gaps.)
Local alignment
Gobal V/s Local Alignment
• Local alignments are often preferable over global alignments, but can
be more difficult to calculate because of the additional challenge of
identifying the regions of similarity.

• A variety of computational algorithms have been applied to the


sequence alignment problem, including slow but formally optimizing
methods like dynamic programming and efficient heuristic or
probabilistic methods designed for large-scale database search.

• With sufficiently similar sequences, there is no difference between local


and global alignments.
https://demonstrations.wolfram.com/GlobalAndLocalSequenceAlignmentAlgorithms/
Methods of sequence alignments
• Dot matrix method
• Dynamic programming method
• Word or k-tuple method
Dynamic programming algorithm for sequence alignment
• The method compares every pair of characters in the two sequences and generates an
alignment, which is best or optimal.

• Each alignment has its own similarity score that describes the level of similarity between the two
sequence.

• A positive value or a high score is given for a match and a negative value or a low score is given
for a mismatch and gaps
Dynamic programming algorithm for sequence alignment
• Alignment of large number of sequences and scoring them is a highly computationally
demanding method.

• Latest algorithmic improvements and ever increasing computer capacity make possible
to align a query sequence against a large database in few minutes

• The alignment algorithms depends upon a scoring system which is based on probability
that:
1) a particular nucleotide/amino acid pair is found in alignments of related
genes/proteins;
2) a particular nucleotide/amino acid pair is aligned by chance;
3) introduction of gap would be a better choice as it increases the score.

• The ratio of the first two probabilities (log-odds value) is usually provided in an
scoring matrix
Scoring Matrix
• It is a set of values that describes the probability of an amino-acid or nucleotide residue to
be substituted by another in an alignment
• Also known as substitution matrix
• A positive value or a high score is given for a match
• A negative value or a low score is given for a mismatch
The nucleotide/amino acid substitution with higher probability of appearing in a
related gene/protein are scored less negative as compared to those whose
probability is less.
Transversion score less
negative than that of
transition substitutions
Scoring Matrices for amino acid sequence alignments
Purpose
• The most common sequence alignment for protein is to look for similarity between different sequences in
order to infer function or establish evolutionary relationships.
• This helps researchers better understand the origin and function of genes through the nature of homology and
conservation.
Scoring matrix of amino acids is more complicated than nucleotides
• Substituting an amino acid with another from the same physico-chemical category is more likely to have a
smaller impact on the structure and function of a protein than replacement with an amino acid from a
different category.
• Because of which, the possibility of certain mutations (non-sense or beneficial) to persist within the
population is relatively higher.
• Therefore, it was essential that scoring matrix of amino-acids reflected the influence of physicochemical
properties of amino acid residues on the ancestor probability (the chance of sharing a similar ancestor).
Log-odds value of amino acid substitution matrix
• Is derived by analysing the occurrence of an amino-acid pair in sequences of closely related proteins
• High log-odds value of a amino-acid pair in an alignment reflects ancestor probability
Example
• PAM (Point accepted mutation) Matrix
• BLOSUM (Blocks Substitution Matrix)
BLOSUM (BLOcks SUbstitution Matrix)
• Substitution matrix for sequence alignment of proteins
• Used to score alignments between evolutionarily divergent protein sequences
• First BLOSUM Matrix Construction (Steven Henikoff and Jorja Henikoff, 1992)
BLOCKS (sequence pattern) database of very conserved regions of
protein families was scanned

Relative frequencies of amino acid and their substitution


probabilities counted

Log-odds score for each of the 210 possible substitution


pairs of 20 amino acids calculated
• Thus, scores for each position in the BLOSUM matrix are frequencies of substitutions
observed in blocks of local alignments of protein sequences
• All BLOSUM matrices are based on observed alignments; they are not extrapolated
from comparisons of closely related proteins like the PAM Matrices
BLOSUM (BLOcks SUbstitution Matrix)
• Several sets of BLOSUM matrices exist using different alignment databases, named with numbers.
• BLOSUM matrices with
• high numbers  designed for comparing closely related sequences
• low numbers  designed for comparing distant related sequences.
• For example, BLOSUM80 is used for closely related alignments, and BLOSUM45 is used for more distantly related
alignments.
• The alignment databases of these matrices were created by merging (clustering) all sequences
that were more similar than a given percentage into one single sequence and then comparing
sequences that were all more divergent than the given percentage value only; thus reducing the
contribution of closely related sequences.
• The percentage used was appended to the name, giving for example:
• BLOSUM80 : where sequences that were more than 80% identical were clustered.
• BLOSUM62 : matrix built by clustering sequences with ≥ 62% identity
• Note: BLOSUM 62 is the default matrix for protein BLAST. Experimentation has shown that the
BLOSUM-62 matrix is among the best for detecing must weak protein similarities.
The Sequence alignment algorithms based on dynamic
programming method uses scoring matrix for finding best alignment

Example:
• Needleman-Wunsch algorithm (Global alignments)
• Smith-Waterman algorithm (local alignments)
There are 3 parts to computing the best alignment using the N-W algorithm:
1. Initialization of the matrix with the scores possible
2. Fill up a matrix (table) T using the recurrence relation
3. The traceback step: use the filled-in matrix T to work out the best
alignment
Example
• It is essential to recognise that several different alignments may have nearly identical scores 
indicating that dynamic programming algorithm may produce more than one optimal alignment

• Intelligent manipulation of some parameters (scoring function & gap penalities) is important and may
discrimate the alignments with similar score
• Like the Needleman–Wunsch algorithm, of which it is a variation, Smith–Waterman
is a dynamic programming algorithm.

• The main difference to the Needleman–Wunsch algorithm is :


1. that negative scoring matrix cells are set to zero, which renders the (thus
positively scoring) local alignments visible.
2. All negative scores are set to zero
3. Traceback procedure starts at the highest scoring matrix cell and proceeds until
a cell with score zero is encountered, yielding the highest scoring local
alignment.
EXAMPLE
A curse or a blessing?

Large databases are a blessing …


• They are more likely to contain something similar to the query

… and a curse
• Increasing the size of the database decreases the significance of the hits
you get
• Searching huge databases requires fast computers

What requirements this puts on software development


• The programs must be speeded up or database searches will take longer
and longer
• The false positive rate must be reduced to not lose specificity

You might also like