Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

2.

-SEQUENCE ANALYSIS
When studying a novel gene, or when a gene responsible for a disease is identified within the human
genome it is interesting to determine whether related genes appear in other species. This method must
be both:

- Sensitive, as it has to pick up very distant relationships.


- Selective, as all the relationships that it report must be true.

This is performed by comparing the sequence against related or similar ones. Several of these identified
sequences come from ORFs, regulatory elements and transcription, and the organization of genes. These
methods have to predict the likely protein function and confer a particular phenotype. Some programs
that perform these tasks are BLAST (Basic Local Alignment Search Tool) and FASTA (FAST Alignment).

Let’s start with picking out genes in genomes. As already seen in the previous unit, an ORF (Open Reading
Frame) is a region of DNA sequence that starts with an initiation codon (ATG) and ends with a stop codon
(TAA, TGA, TAG). It is a potential protein-coding region. There are two possible approaches to detect a
gene in a genome.

 Detection of regions similar to already-known coding regions from other organisms, or similar to
ESTs derived from mRNAs that correspond to genes known to be transcribed. Note that an EST or
expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. In turn, cDNA is a DNA
transcript of an mRNA.
 Ab initio methods which seek to identify genes from the properties of the DNA sequences. Ab
initio methods is more like looking from “knowledge”, rather than looking from similarities, as
these methods search for patterns in DNA that contain gene-like criteria.

Some of the criteria used to detect sequences via ab initio methods are:

- Initial (5’) exon starts with a transcription initiation point, preceded by a core promoter site
(TATA box), and placed about 30 bp upstream. Furthermore this exon should be free of in-frame
stop codons and ends immediately before a dinucleotide GT splice signal. Occasionally a non-
coding exon may precede this initiation exon.
- Internal exons, like initial exons, must be free of in-frame stop codons- They begin immediately
after an AG splice signal and end immediately before a GT splice signal.
- Final (3’) exon starts immediately after an AG splice signal, and ends with a stop codon, followed
by a poly-adenylation signal sequence. Occasionally there may be a non-coding exon after it.

5’ 3’

5’ 3’
Example; could the following sequence generate a potential ORF of 17 amino acids? And if so, is this ORF
real?

TTTATACTCTCTCGGGCTCGTCTGGGAGATCGCTTGCTCATGTTAGTACGTAAGGTAAAGCGTAT
CGTAGGGCTAGCTAGCGAGTGTAGTTTTAAAGGTCGCGTATATATAGCGCGTATGCCGCGATA
TAGCGTTTCGCAGCGTATCGATACGCTATGCTATGCAGCTATAGTTTTTCCCTGCGTCACATGTAC
GATCGATGGGGTATACGATTAGGCGCGCTATACGATTAGCGATCTAGCGAGAGTTTTTAGAAAT
AAAAGGTCATG
Assuming the sequence is mRNA, and querying it in ORF finder does not yield the desired 17 amino acid
reading frame. However, if assumed it is DNA we can find several indicators such as the TATA box
(highlighted in yellow), and a starting codon followed by sequence until a GT splicing signal (highlighted in
green). The intron region starts with the GT splicing sequence and ends with the AG splicing sequence
(highlighted in blue). The final mRNA would be:

ATG TTA GTA CGT AAG TGT AGT TTT AAA GCT ATA GTT TTT CCC TGC AGT TTT TAG AAA TAA AAG GTC ATG

Querying this sequence yields a 17 amino acid transcript, in order to identify if this protein exists in nature
it is necessary to perform a search among databases. By using the queried sequence and the BLAST tool
(note that this is pairwise alignment) provided with ORF finder it is possible to determine if the protein is
present.

2.1.-BLASTN
BLAST for Basic Local Alignment Search Tool is an algorithm for comparing primary biological sequence
information, such as the amino-acid sequences of proteins or the nucleotides of DNA sequences. A BLAST
search enables a researcher to compare a query sequence with a library or database of sequences, and
identify library sequences that resemble the query sequence above a certain threshold. NCBI provides a
BLAST algorithm for both nucleotides (BLASTN) and proteins (BLASTP), however there are more tools,
described as follows:

- BLASTN, queries nucleotide sequences, and searches among nucleotide databases.


- BLASTP, queries amino acid sequences, and searches for it in protein sequence databases.
- BLASTX, queries translated-nucleotide sequences, searching in a protein-sequence database. It
takes a translation of the query and looks for it in protein databases. It is useful for finding similar
proteins to those encoded by a nucleotide query. Mostly used when the sequence fragment is not
known for encoding a protein.
- TBLASTN, queries an amino acid sequence an searches for it in a translated-nucleotide sequence
database. It is useful for finding protein homologs in unannotated nucleotide data. (Note that the
translated nucleotide database is a database of proteins that have not been addressed). It is used
when BLASTP renders negative result.
- TBLASTX queries translated nucleotide sequences an looks for them in a translated nucleotide
sequence database. It is useful for identifying novel genes in error prone nucleotide query
sequences.

The NCBI provides with a BLAST tool, let’s first start with BLASTN. In the tool it is possible to select the
database, under the “Choose Search Set” section. Other included options are “Organism”. To start the
algorithm once selected the parameters one has to click “BLAST”.
The hits are displayed under the “Descriptions” tab, and are sorted according to their quality. Some of
these parameters are:

- Total score represents a numerical value that describes the overall quality of the alignment.
Higher numbers correspond to higher similarity. The score scale depends on the scoring system
used (substitution matrix, gap penalty,…).
- Query cover is a value given in a percentage of how well is the query overlapping with the found
candidate sequence.
- E value (or expectation value) is a correction of the p-value for multiple testing. In the context of
database searches, the lower the E value, the more significant the score is. Sequences with E
values less than 1e-04 can be considered related with an error rate of less than 0.01 %, and thus
can be considered as homologous or related to the queried sequence.

An alternative representation is portrayed under the “Alignments” tab, some of the parameters displayed
here are:

- Bit score, a log scaled version of the provided score. It is a normalized score expressed in bits that
lets estimate the magnitude of the search space that one would have to look through before
expecting to find a score as good as or better than this one by chance.
- E value, already explained above.
- Identities is a percentage representation of the amount of matching amino acids/nucleotides.
- Gaps is a percentage value of how many gaps are present in the candidate sequence.

2.2.-SEQUENCE ALIGNMENT
Given two or more sequences, we wish to:

 Measure their similarity.


 Determine the residue-residue correspondences.
 Observe patterns of conservation and variability.
 Infer evolutionary relationships.
Any assignment of correspondences that preserves the order of the residues within the sequences is an
alignment. Gaps may be introduced, for instance, take the following two strings;

A reasonable way to align them would be:

However, there are several ways to align them correctly. Therefore, criteria must be defined so an
algorithm can choose the best alignment. For the sequences GCTGAACG and CTATAATC we have the
following possibilities:

Uninformative alignment

Gapless alignment

Alignment with gaps

Four main types of sequence alignment can be devised. There are listed as follows:

- Global match, consists on the alignment of all of one sequence with all of the other. This illustrates
mismatches, insertions, and deletions.

- Local match, a region from one sequence that is aligned with one that matches from the other.
For local matching, overhangs at the ends are not treated as gaps. In addition to mismatches,
insertions and deletions within the matched region are also possible.

- Motif match consists on finding matches of a short sequence in one or more regions internal to
a long one, one can allow mismatching, insertions or deletions.
- Multiple alignment consists on a mutual alignment of many sequences.

2.2.1.-THE DOTPLOT
A dotplot is a simple picture that gives an overview of the similarities between two sequences, formed by
a table or matrix. Rows correspond to the residues of one sequence and columns to the residues of other
sequences. Positions are left blank if the residues are different and filled if they match.

Some idea of the similarity of the two sequences can be gleaned from the number and length of matching
segments shown in the matrix. Identical sequences will obviously have a diagonal line in the center of the
matrix. Insertions and deletions between sequences give rise to disruptions in this diagonal. Regions of
local similarity or repetitive sequences give rise to further diagonal matches in addition to the central
diagonal.

G A G T G T T A A T T C G
G X X X X
A X X X
T X X X X X
T X X X X X
A X X X
C X
C X
A X X X
A X X X
T X X X X X
T X X X X X
C X
G X X X X

Any path through the dotplot from upper left to lower right passes through a succession of cells, each of
which picks out a pair of positions, one from the row and one from the column, that are matched in the
alignment that corresponds to that path; or that indicates a gap in one of the sequences. The path need
not to pass through filled-in positions only. However, the more filled-in points on the path, the more
matching residues in the alignment as this would correspond to aligning several residues of one sequence
with only one residue of the other, or to introduce gaps on both sequences. For instance, the blue path
would read:

G A - T - A - - - - - C -
G A T T A C C A A T T C G
Dotplot showing identities between a repetitive sequences and itself. The repeats appear on several
subsidiary diagonals parallel to the main one.

On the example below there is a dotplot showing identities between a palindromic sequence an itself.
The palindrome reveals itself as a stretch of matches perpendicular to the main diagonal.
This is not just word play, regions in DNA recognized by transcriptional regulators or restriction enzymes
have sequences related to palindromes, crossing from one strand to the other. Within each strand a
region is followed by its reverse complement. Longer regions of DNA or RNA containing inverted repeats
of this form can form stem-loop structures. In addition, some transposable elements in plants contain
true (approximate) palindromic sequences – inverted repeats of non-complemented sequences, on the
same strand.

2.2.2.-SEQUENCE ALIGNMENT TOOLS


ClustalW2 is a general purpose DNA or protein multiple sequence alignment program for three or more
sequences. However, as of today the services of CLUSTALW2 have been terminated.

Another interesting pairwise alignment tool (this means: it can only align two sequences at a time) is
Needle (EMBOSS) which creates an optimal global alignment of two sequences using the Needleman-
Wunsch algorithm (explained in later sections of the unit). One can query both nucleotide and amino acid
sequences. Furthermore it is possible to modify the scoring options (to be reviewed in later sections).

The output of EMBOSS Needle consists of an identity and similarity percentage, followed by a gap
percentage and a score (the higher the score, the better). Finally a visual alignment is provided. Note that
the obtained score depends on the chosen scoring parameters.
MUSCLE, or Multiple Sequence Comparison by Long-Expectation is a tool used for multiple DNA sequence
alignment. One has to copy all the sequences into the input box, set up the parameters and submit the
job. The output is a percentage identity matrix between all of the queried sequences.

2.3.-SEQUENCE SIMILARITY
Dotplots do not allow for a quantitative measurement of sequence similarity, in order to do this it is
necessary to define a scoring system that takes into account the distance among character strings. In
information theory there are two types of distances:

- Hamming distance between two strings of equal length is the number of positions at which the
corresponding symbols are different (just mismatches). In other words, it measures the minimum
number of substitutions required to change one string into the other, or the minimum number of
errors that could have transformed one string into another.
- Levenshtein or edit distance is defined between two strings of not necessarily equal length. It is
defined as the minimum number of single symbol edits (insertions, deletions or substitutions)
required to change one string into the other.

For the examples below the right image has a Hamming distance of 2 (and a Levenshtein distance of 2 as
well), while the strings on the right has only a Levenshtein distance of 3 (Hamming distance cannot be
measured, as they do not have the same length).

Example; one may calculate the Hamming and Levenshtein distances among the following two strings;
ATGGCTTCG and CTTCTTCGG. And later for GGAATGG and ATG.
A T G G C T T C G
C T T C T T C G G
In this case Hamming and Levenshtein distances are the same, amounting to 6. Note that Hamming
distance is ONLY applicable when the string lengths are the same.

In the second case there are different alignments:

G G A A T G G G G A A T G G
- - - A T G - - - A T - G -
With Levenshtein distanced of 4 and 5, respectively. Note that Levenshtein distances of >4 would be
considered as bad for this case.

However, in biology certain changes are more likely to take place than others, for instance:

- Amino acid substitutions tend to be conservative. A replacement of one amino acid by another
with similar size or physicochemical properties is more common than one which is radically
different.
- Deletion of a succession of contiguous bases or amino acids is a more probable event than the
independent deletion of the same number of bases or amino acids at non-contiguous positions in
the sequence.

In order to account for all of this variability it is necessary to assign variable weights to the different edit
operations (deletions, substitutions, etc…) to obtain an optimal alignment.

2.3.1.-SCORING SCHEME
Four main biological events must be considered during the sequence alignment: conservation,
substitution, insertion and deletion. There could be a conservation when the two compared letters are
the same and a match is detected, a substitution when a mismatch is detected where a letter is aligned
with another, and insertion or deletion when in one of the two aligned sequences a letter is aligned with
a gap. Matches, mismatches and gaps detected during the alignment do not guarantee to be the most
representative biological truth since their dispositions is dependent of the chosen scoring schemes and
the selected alignment algorithm. In order to improve the correlation between computed sequence
alignment and biological similarity, specific combinations of scoring schemes and alignment algorithms
have been developed during the years and are usually adopted for the alignment of different types of
biological sequences.

Given an alignment structure that stores the two sequences and a scoring scheme, the score of the
alignment should be computed as the sum of the scores for aligned character pairs plus the sum of the
scores for all gaps. This means: the implemented scoring function must be linear. The scoring function
differs from the scoring scheme in that the scoring scheme is set of rules used to assess the possible
biological events that must be considered during the alignment procedure.

As such the scoring scheme has to take into account two phenomena separately;

- Matches and mismatches (conservations and substitutions).


- Gap models (insertions and deletions).
The match/mismatch evaluation can be of variable complexity, but all in all there are two main ways to
tackle the problem.

- Simple scoring (also known as Levenshtein distance model) and it applies a score of 0 and -1
respectively if a match or mismatch is detected, whereas a penalty of value equal to -1 in case of
gaps representing insertions or deletions.
- Substitutional matrix scoring relies on substitutional matrices which are built on the basis of the
probability that a particular amino acid or nucleotide is replaced with another during the
evolutionary process. They assign to each pair a value that indicates their degree of similarities,
obtained thanks to statistical methods reflecting the frequency of a particular substitution. Let’s
take for instance nucleic acids, in these sequences transitions are more frequent than
transversions, meaning that A↔G and C↔T are more frequent than A↔T and G↔C. Thus a
good substitution matrix would take into account this phenomenon as follows:

A G T C
A 20 10 5 5
G 10 20 5 5
T 5 5 20 10
C 5 5 10 20

The gap evaluation makes use of different models that are applied when a gap is detected. There are
three main models to asses the gap penalty:

- Linear gap model, it takes the gap length (g) and multiplied by the extension penalty (d).
𝑷 = −𝒈 · 𝒅
- Affine gap model takes into account that the first amino acid or nucleotide inserted/deleted
found during the alignment operations is more significant, from the biological point of view, that
the subsequent ones, making the so called affine gap model. It attributes different costs to the
gap opening (c) and the gap extension events (d), assigning in this way a higher penalty to the
gap presence with respect to its relative length (g).
- 𝑷 = −𝒄 − (𝒈 − 𝟏) · 𝒅
- Dynamic gap model can be used to reduce the computational time and the memory requirement
while keeping the alignment scores close to those computed with the Affine Gap model.

However, in this course the affine gap model will be used. Note that every used algorithm will make use
of different substitutional matrices and different gap models.

 For DNA sequences, CLUSTALW2 (multiple alignment) recommends the use of the identity matrix
for substation (+1 for a match, 0 for a mismatch) and gap penalties of 10 for gap initiation, and
0.1 for gap extension by one residue).
 For protein sequences, the recommendations are to use the BLOSUM62 or PAM250 matrices (to
be explained in later sections) for substitutions, and gap penalties of 11 for gap initiation and 1
for gap extension by one residue.
Example; for the following two sequences ACGACT GGC A and ACTATGGCA, obtain the alignment score
with the following criteria: match +4, mismatch -5, gap opening -16 and gap extension -4.

First it is necessary to obtain the dotplot graph.

A C G A C T G G C A
A
C
T
A
T
G
G
C
A
From here we can obtain several alignments, for example:

A C G A C T G G C A
A C T A - T G G C A
+4 +4 -5 +4 -16 +4 +4 +4 +4 +4

A C G A C T G G C A
- - - A C T A - - A
-16 -4 -4 +4 +4 +4 -5 -16 -4 +4
(Note that each jump in diagonal is a pairing, while every jump in horizontal is a gap). Now the alignments
have to be scored accordingly.

Alignment A = 11 Alignment B = -33

2.3.2.-DERIVATION OF SUBSTITUTION MATRICES


One of the first amino acid substitution matrices, the PAM (Point Accepted Mutation) matrix was
developed in the 1970s. This matrix is calculated by observing the differences in closely related proteins.
The PAM1 matrix estimates what rate of substitution would be expected if 1% of the amino acids had
changed (and therefore two sequences 1 PAM apart have 99% identical residues). The PAM1 matrix is
used as the basis for calculating other matrices by assuming that repeated mutations would follow the
same pattern as those in the PAM1 matrix, and multiple substitutions can occur at the same site. The
matrices are based on amino acid substitution frequencies found in nature, and as such a common
change should score higher than a rare one.

As aforementioned, the PAM1 is taken as a basis and it is constructed by statistical data taken from nature.
To produce matrices appropriate for more widely divergent sequences powers of the PAM1 matrix are
taken. At the PAM250 level, it corresponds to 20% overall sequence identity, and it is the lowest
sequence similarity for which we can hope to produce a correct alignment by simple pairwise sequence
comparison alone.
The entries are in alphabetical order of three-letter amino acid names, the example below corresponds
to the Dayhoff derived PAM250 matrix. Note that the substitution rates are assumed to be symmetrical
(as it is not possible to determine the difference between both).

There are another type of matrices, namely the BLOSUM (BLOcks SUbstitution Matrix), which intend to
replace the PAM matrices for those that perform better in identifying distant relationships. The matrices
are built from statistical data from evolutionarily divergent proteins. The probabilities used in the matrix
calculation are computed by looking at "blocks" of conserved sequences found in multiple protein
alignments. To reduce bias from closely related sequences, segments in a block with a sequence identity
above a certain threshold were clustered giving weight to each such cluster. An example of a BLOSUM
matrix is presented below;
It is important to distinguish the applications of both BLOSUM and PAM matrices, the differences are
listed as follows;

- PAM matrices are based on an explicit evolutionary model (i.e. replacements are counted on the
branches of a phylogenetic tree), whereas BLOSUM matrices are based on an implicit model of
evolution.
- The PAM matrices are based on mutations observed throughout global alignment, this includes
both highly conserved and highly mutable regions. The BLOSUM matrices are based only on highly
conserved regions (to analyze distant genes) in series of alignments forbidden to contain gaps.
- The method used to count the replacements is different: unlike the PAM matrix, the BLOSUM
procedure uses groups of sequences within which not all mutations are counted the same.
- Higher numbers in the PAM matrix naming scheme denote larger evolutionary distances, while
larger numbers in BLOSUM matrix naming scheme denote higher sequence similarity and
therefore smaller evolutionary distance. PAM 150 is used for more distant sequences than PAM
100; BLOSUM62 is used for closer sequences than BLOSUM50.

2.3.3.-QUICK DATABASE SCREENING


All the aforementioned methods are excellent in aligning and matching sequences, however they are too
slow to compare sequences in databases. For that task approximate methods have been designed to
complete the search quickly. They can detect close relationships well and quickly but are inferior to the
exact ones in picking up very distant relationships.

In practice, they give satisfactory performance in the many cases in which the probe sequence is fairly
similar to one or more sequences in the databank. For instance, the BLAST algorithm works as follows;

1. BLAST first divides the probe sequence into fixed-length words of k length. It then identifies all
exact occurrences of these words in the full database (no mismatches, no gaps).
2. Starting with each match, BLAST tries to extend the match in both directions. Still no mismatches,
no gaps allowed.
3. Given the extended matches, BLAST tries to put them together by doing alignments allowing
mismatches and gaps, but only within limited regions containing the preliminary matches.
2.4.-DYNAMIC PROGRAMMING
In computer bioinformatics, dynamic programming (also known as dynamic optimization) is a method for
solving a complex problem by breaking it down into a collection of simpler subproblems, solving each of
those subproblems just once, and storing their solutions. The next time the same subproblem occurs,
instead of recomputing its solution, one simply looks up the previously computed solution, thereby saving
computation time at the expense of a (hopefully) modest expenditure in storage space.

The dynamic programming table is generated by selecting in each box the smallest value resulting from
the three possible arithmetic calculations: the diagonal movement evaluates and scores the substitutions
and the horizontal and vertical movements evaluate and score the impact of inserting a deletion into one
or the other sequence. An additional table is usually drawn to highlight the different paths that lead to
the solution.

You might also like