Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 60

Lecture 4: BLAST

Ly Le, PhD
School of Biotechnology
Email: ly.le@hcmiu.edu.vn
Office: Rm 501, HCM International University
Concepts of Sequence Similarity
Searching
• The premise: One sequence by itself is not
informative; it must be analyzed by
comparative methods against existing
sequence databases to develop hypothesis
concerning relatives and function.

2
BLAST
• Basic Local Alignment Search Tool
• Developed in 1990 and 1997 (S. Altschul)
• A heuristic method for performing local
alignments through searches of high scoring
segment pairs (HSP’s)
– Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990)
“Basic local alignment search tool.” J. Mol. Biol. 215:403-410.
– Altschul SF, Madden TL, Schaeffer AA, Zhang J, Zhang Z,
Miller W, Lipman DJ (1997) “Gapped BLAST and PSI-BLAST:
a new generation of protein database search programs.” NAR
25:3389-3402.

Lecture 3.1 3
4
What BLAST tells you ...
• BLAST reports surprising alignments
- Different than chance
• Assumptions
- Random sequences
- Constant composition
• Conclusions
- Surprising similarities imply evolutionary homology

Evolutionary Homology: descent from a common


5 ancestor Does not always imply similar function
How Does BLAST Really Work?

• The BLAST programs improved the overall


speed of searches while retaining good
sensitivity (important as databases continue
to grow) by breaking the query and database
sequences into fragments ("words"), and
initially seeking matches between fragments.
• Word hits are then extended in either
direction in an attempt to generate an
alignment with a score exceeding the
threshold of "S".
6
BLAST Algorithm
• Scoring of matches done using scoring
matrices
• Sequences are split into words (default
n=3)
– Speed, computational efficiency
• BLAST algorithm extends the initial
“seed” hit into an HSP
– HSP = high scoring segment pair = Local optimal
alignment

7
Scoring matrices: PAM (Percent Accepted
Mutation)

Amino acids are grouped according to to the chemistry of the side group: (C) sulfhydryl, (STPAG)-small
hydrophilic, (NDEQ) acid, acid amide and hydrophilic, (HRK) basic, (MILV) small hydrophobic, and (FYW)
aromatic. Log odds values: +10 means that ancestor probability is greater, 0 means that the probability
are equal, -4 means that the change is random. Thus the probability of alignment YY/YY is 10+10=20,
whereas YY/TP is –3-5=-8, a rare and unexpected between homologous sequences.
Scoring matrices: BLOSUM62
(BLOcks amino acid SUbstitution Matrices)

Ideology of BLOSUM is similar but it is calculated from a very different and much larger set of
proteins, which are much more similar and create blocks of proteins with a similar pattern
PAM versus BLOSUM
• First useful scoring • Much later entry to matrix
matrix for protein “sweepstakes”
• Assumed a Markov • No evolutionary model is
Model of evolution (I.e. assumed
all sites equally mutable • Built from PROSITE
and independent) derived sequence blocks
• Derived from small, • Uses much larger, more
closely related proteins diverse set of protein
with ~15% divergence sequences (30% - 90% ID)

Lecture 3.1 10
PAM Matricies
• PAM 40 - prepared by multiplying PAM 1 by
itself a total of 40 times
best for short alignments with high similarity
• PAM 120 - prepared by multiplying PAM 1 by
itself a total of 120 times
best for general alignment
• PAM 250 - prepared by multiplying PAM 1 by
itself a total of 250 times
best for detecting distant sequence similarity
Lecture 3.1 11
BLOSUM Matricies
• BLOSUM 90 - prepared from BLOCKS
sequences with >90% sequence ID
best for short alignments with high similarity
• BLOSUM 62 - prepared from BLOCKS
sequences with >62% sequence ID
best for general alignment (default)
• BLOSUM 30 - prepared from BLOCKS
sequences with >30% sequence ID
best for detecting weak local alignments
Lecture 3.1 12
BLAST Algorithm
• Scoring of matches done using scoring
matrices
• Sequences are split into words (default
n=3)
- Speed, computational efficiency
• BLAST algorithm extends the initial
“seed” hit into an HSP
- HSP = high scoring segment pair = Local optimal
alignment

13
BLAST Algorithm

14
BLAST Algorithm

15
Extending the High Scoring Segment
Pair (HSP)

Minimum
Score (S)

Neighborhood
16 Score Threshold (T)
BLAST Algorithm
• Scoring of matches done using scoring
matrices
• Sequences are split into words (default
n=3)
- Speed, computational efficiency
• BLAST algorithm extends the initial
“seed” hit into an HSP
- HSP = high scoring segment pair = Local optimal
alignment

17
BLAST Access
• NCBI BLAST
• http://www.ncbi.nlm.nih.gov/BLAST/
• Canadian Bioinformatics Resource BLAST
• http://cbr-rbc.nrc-cnrc.gc.ca/blast/
• European Bioinformatics Institute BLAST
• http://www.ebi.ac.uk/blastall/
• http://www.ebi.ac.uk/blast2/

Lecture 3.1 18
Lecture 3.1 19
Lecture 3.1 20
Lecture 3.1 21
Different Flavours of BLAST

• BLASTP - protein query against protein DB


• BLASTN - DNA/RNA query against GenBank (DNA)
• BLASTX - 6 frame trans. DNA query against proteinDB
• TBLASTN - protein query against 6 frame GB transl.
• TBLASTX - 6 frame DNA query to 6 frame GB transl.
• PSI-BLAST - protein ‘profile’ query against protein DB
• PHI-BLAST - protein pattern against protein DB
Lecture 3.1 22
Other BLAST Services
• MEGABLAST - for comparison of large sets of
long DNA sequences
• RPS-BLAST - Conserved Domain Detection
• BLAST 2 Sequences - for performing pairwise
alignments for 2 chosen sequences
• Genomic BLAST - for alignments against
select human, microbial or malarial genomes
• VecScreen - for detecting cloning vector
contamination in sequenced data

Lecture 3.1 23
Running NCBI BLAST

Lecture 3.1 24
MT0895
• MMKIQIYGTGCANCQMLEKNAREAVKELG
IDAEFEKIKEMDQILEAGLTALPGLAVDG
ELKIMGRVASKEEIKKILS

Lecture 3.1 25
Running NCBI BLAST
• Paste in sequence (FASTA format, raw
sequence or type in GI or accession number)
>Mysequence MT0895
KIQIYGTGCANCQMLEKNAREAVKELGIDAE
FEKIKEMDQILEAGLTALPGLAVDGELKIDS
OR
>
KIQIYGTGCANCQMLEKNAREAVKELGIDAE
FEKIKEMDQILEAGLTALPGLAVDGELKIDS
OR
KIQIYGTGCANCQMLEKNAREAVKELGIDAE
FEKIKEMDQILEAGLTALPGLAVDGELKIDS
Lecture 3.1 26
Running NCBI BLAST
• Choose a range of interest in the sequence
“set subsequences” (not usually used)
• Select the database from pull-down menu
(usually choose nr = non-redundant)
• Keep CD Search “check box” on
• Leave “Options” unchanged (use defaults)
• Go to “Format” menu and adjust Number of
descriptions and alignments as desired
Lecture 3.1 27
Running NCBI BLAST

Select Database

Lecture 3.1 28
Conserved Domain Database
• Contains a collection of pre-identified
functional or structural domains
• Derived from Pfam and Smart databases
as well as other sources
• Uses Reverse Position Specific BLAST
(RPS-BLAST) to perform search
• Query sequence is compared to a PSSM
derived from each of the aligned domains

Lecture 3.1 29
Running NCBI BLAST

Click BLAST!

Lecture 3.1 30
Formatting Results

Lecture 3.1 31
BLAST Format Options

Lecture 3.1 32
BLAST Output

Lecture 3.1 33
BLAST Output

Lecture 3.1 34
BLAST Output

Lecture 3.1 35
BLAST Output

Lecture 3.1 36
BLAST Output

Lecture 3.1 37
BLAST Output

Lecture 3.1 38
BLAST Parameters
• Identities - No. & % exact residue matches
• Positives - No. and % similar & ID matches
• Gaps - No. & % gaps introduced
• Score - Summed HSP score (S)
• Bit Score - a normalized score (S’)
• Expect (E) - Expected # of chance HSP aligns
• P - Probability of getting a score > X
• T - Minimum word or k-tuple score (Threshold)
Lecture 3.1 39
Getting the Most from
BLAST

Lecture 3.1 40
BLAST Options

Lecture 3.1 41
BLAST Options
• Composition-based statistics (Yes)
• Sequence Complexity Filter (Yes)
• Expect (E) value (10)
• Word Size (3)
• Substitution or Scoring Matrix (Blosum62)
• Gap Insertion Penalty (11)
• Gap Extension Penalty (1)

Lecture 3.1 42
Composition Statistics
• Recent addition to BLAST algorithm
• Permits calculated E (Expect) values to
account for amino acid composition of
queries and database hits
• Improves accuracy and reduces false
positives
• Effectively conducts a different scoring
procedure for each sequence in database

Lecture 3.1 43
LCR’s (low complexity)
• Watch out for…
– transmembrane or signal peptide regions
– coil-coil regions
– short amino acid repeats (collagen, elastin)
– homopolymeric repeats
• BLAST uses SEG to mask amino acids
• BLAST uses DUST to mask bases
Lecture 3.1 44
Scraping the Bottom of
the Barrel with Psi-BLAST

Lecture 3.1 45
PSI-BLAST Algorithm
• Perform initial alignment with BLAST using
BLOSUM 62 substitution matrix
• Construct a multiple alignment from matches
• Prepare position specific scoring matrix
• Use PSSM profile as the scoring matrix for a
second BLAST run against database
• Repeat steps 3-5 until convergence

Lecture 3.1 46
PSI-BLAST

Lecture 3.1 47
PresS Iterate! PSI-BLAST

Lecture 3.1 48
PSI-BLAST

PresS Iterate!

Lecture 3.1 49
PSI-BLAST

Lecture 3.1 50
PSI-BLAST
• For Protein Sequences ONLY
• Much more sensitive than BLAST
• Slower (iterative process)
• Often yields results that are as good as
many common threading methods
• SHOULD BE YOUR FIRST CHOICE IN
ANALYZING A NEW SEQUENCE

Lecture 3.1 51
BLAST against PDB

Lecture 3.1 52
Sequence Similarity
Searching – The statistics
are important

• Discriminating between real and artifactual matches is


done using an estimate of probability that the match
might occur by chance.

• We’ll talk more about the meaning of the scores (S)


and e-values (E) that are associated with BLAST hits

53
Where does the score (S) come
from?
• The quality of each pair-wise alignment is
represented as a score and the scores are
ranked.
• Scoring matrices are used to calculate the
score of the alignment base by base (DNA)
or amino acid by amino acid (protein).
• The alignment score will be the sum of the
scores for each position.

54
What do the Score and the e-value
really mean?
• The quality of the alignment is
represented by the Score (S).
• The score of an alignment is calculated as the sum of substitution and gap
scores. Substitution scores are given by a look-up table (PAM, BLOSUM)
whereas gap scores are assigned empirically .

• The significance of each alignment is


computed as an E value (E).
• Expectation value. The number of different alignments with scores equivalent
to or better than S that are expected to occur in a database search by
chance. The lower the E value, the more significant the score.

55
Notes on E-values
• Low E-values suggest that sequences
are homologous
๏ Can’t show non-homology
• Statistical significance depends on both
the size of the alignments and the size of
the sequence database
‣ Important consideration for comparing results across
different searches
‣ E-value increases as database gets bigger
‣ E-value decreases as alignments get longer

56
Homology: Some Guidelines

• Similarity can be indicative of homology


• Generally, if two sequences are
significantly similar over entire length
they are likely homologous
• Low complexity regions can be highly
similar without being homologous
• Homologous sequences not always
highly similar

57
Suggested BLAST Cutoffs

• For nucleotide based searches, one


should look for hits with E-values of 10-6
or less and sequence identity of 70% or
more
• For protein based searches, one should
look for hits with E-values of 10-3 or less
and sequence identity of 25% or more

58
BLAST - Rules of Thumb
• Expect (E-value) is equal to the number of BLAST
alignments with a given Score that are expected to
be seen simply due to chance
• Don’t trust a BLAST alignment with an Expect score
> 0.01 (Grey zone is between 0.01 - 1)
• Expect and Score are related, but Expect contains
more information. Note that %Identies is more
useful than the bit Score
• Recall Doolittle’s Curve (%ID vs. Length, next slide)
%ID > 30 - numres/50
• If uncertain about a hit, perform a PSI-BLAST search

Lecture 3.1 59
NCBI BLAST TUTORIAL

• http://youtu.be/HXEpBnUbAMo

You might also like