Professional Documents
Culture Documents
Lecture 4: Blast: Ly Le, PHD
Lecture 4: Blast: Ly Le, PHD
Ly Le, PhD
School of Biotechnology
Email: ly.le@hcmiu.edu.vn
Office: Rm 501, HCM International University
Concepts of Sequence Similarity
Searching
• The premise: One sequence by itself is not
informative; it must be analyzed by
comparative methods against existing
sequence databases to develop hypothesis
concerning relatives and function.
2
BLAST
• Basic Local Alignment Search Tool
• Developed in 1990 and 1997 (S. Altschul)
• A heuristic method for performing local
alignments through searches of high scoring
segment pairs (HSP’s)
– Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990)
“Basic local alignment search tool.” J. Mol. Biol. 215:403-410.
– Altschul SF, Madden TL, Schaeffer AA, Zhang J, Zhang Z,
Miller W, Lipman DJ (1997) “Gapped BLAST and PSI-BLAST:
a new generation of protein database search programs.” NAR
25:3389-3402.
Lecture 3.1 3
4
What BLAST tells you ...
• BLAST reports surprising alignments
- Different than chance
• Assumptions
- Random sequences
- Constant composition
• Conclusions
- Surprising similarities imply evolutionary homology
7
Scoring matrices: PAM (Percent Accepted
Mutation)
Amino acids are grouped according to to the chemistry of the side group: (C) sulfhydryl, (STPAG)-small
hydrophilic, (NDEQ) acid, acid amide and hydrophilic, (HRK) basic, (MILV) small hydrophobic, and (FYW)
aromatic. Log odds values: +10 means that ancestor probability is greater, 0 means that the probability
are equal, -4 means that the change is random. Thus the probability of alignment YY/YY is 10+10=20,
whereas YY/TP is –3-5=-8, a rare and unexpected between homologous sequences.
Scoring matrices: BLOSUM62
(BLOcks amino acid SUbstitution Matrices)
Ideology of BLOSUM is similar but it is calculated from a very different and much larger set of
proteins, which are much more similar and create blocks of proteins with a similar pattern
PAM versus BLOSUM
• First useful scoring • Much later entry to matrix
matrix for protein “sweepstakes”
• Assumed a Markov • No evolutionary model is
Model of evolution (I.e. assumed
all sites equally mutable • Built from PROSITE
and independent) derived sequence blocks
• Derived from small, • Uses much larger, more
closely related proteins diverse set of protein
with ~15% divergence sequences (30% - 90% ID)
Lecture 3.1 10
PAM Matricies
• PAM 40 - prepared by multiplying PAM 1 by
itself a total of 40 times
best for short alignments with high similarity
• PAM 120 - prepared by multiplying PAM 1 by
itself a total of 120 times
best for general alignment
• PAM 250 - prepared by multiplying PAM 1 by
itself a total of 250 times
best for detecting distant sequence similarity
Lecture 3.1 11
BLOSUM Matricies
• BLOSUM 90 - prepared from BLOCKS
sequences with >90% sequence ID
best for short alignments with high similarity
• BLOSUM 62 - prepared from BLOCKS
sequences with >62% sequence ID
best for general alignment (default)
• BLOSUM 30 - prepared from BLOCKS
sequences with >30% sequence ID
best for detecting weak local alignments
Lecture 3.1 12
BLAST Algorithm
• Scoring of matches done using scoring
matrices
• Sequences are split into words (default
n=3)
- Speed, computational efficiency
• BLAST algorithm extends the initial
“seed” hit into an HSP
- HSP = high scoring segment pair = Local optimal
alignment
13
BLAST Algorithm
14
BLAST Algorithm
15
Extending the High Scoring Segment
Pair (HSP)
Minimum
Score (S)
Neighborhood
16 Score Threshold (T)
BLAST Algorithm
• Scoring of matches done using scoring
matrices
• Sequences are split into words (default
n=3)
- Speed, computational efficiency
• BLAST algorithm extends the initial
“seed” hit into an HSP
- HSP = high scoring segment pair = Local optimal
alignment
17
BLAST Access
• NCBI BLAST
• http://www.ncbi.nlm.nih.gov/BLAST/
• Canadian Bioinformatics Resource BLAST
• http://cbr-rbc.nrc-cnrc.gc.ca/blast/
• European Bioinformatics Institute BLAST
• http://www.ebi.ac.uk/blastall/
• http://www.ebi.ac.uk/blast2/
Lecture 3.1 18
Lecture 3.1 19
Lecture 3.1 20
Lecture 3.1 21
Different Flavours of BLAST
Lecture 3.1 23
Running NCBI BLAST
Lecture 3.1 24
MT0895
• MMKIQIYGTGCANCQMLEKNAREAVKELG
IDAEFEKIKEMDQILEAGLTALPGLAVDG
ELKIMGRVASKEEIKKILS
Lecture 3.1 25
Running NCBI BLAST
• Paste in sequence (FASTA format, raw
sequence or type in GI or accession number)
>Mysequence MT0895
KIQIYGTGCANCQMLEKNAREAVKELGIDAE
FEKIKEMDQILEAGLTALPGLAVDGELKIDS
OR
>
KIQIYGTGCANCQMLEKNAREAVKELGIDAE
FEKIKEMDQILEAGLTALPGLAVDGELKIDS
OR
KIQIYGTGCANCQMLEKNAREAVKELGIDAE
FEKIKEMDQILEAGLTALPGLAVDGELKIDS
Lecture 3.1 26
Running NCBI BLAST
• Choose a range of interest in the sequence
“set subsequences” (not usually used)
• Select the database from pull-down menu
(usually choose nr = non-redundant)
• Keep CD Search “check box” on
• Leave “Options” unchanged (use defaults)
• Go to “Format” menu and adjust Number of
descriptions and alignments as desired
Lecture 3.1 27
Running NCBI BLAST
Select Database
Lecture 3.1 28
Conserved Domain Database
• Contains a collection of pre-identified
functional or structural domains
• Derived from Pfam and Smart databases
as well as other sources
• Uses Reverse Position Specific BLAST
(RPS-BLAST) to perform search
• Query sequence is compared to a PSSM
derived from each of the aligned domains
Lecture 3.1 29
Running NCBI BLAST
Click BLAST!
Lecture 3.1 30
Formatting Results
Lecture 3.1 31
BLAST Format Options
Lecture 3.1 32
BLAST Output
Lecture 3.1 33
BLAST Output
Lecture 3.1 34
BLAST Output
Lecture 3.1 35
BLAST Output
Lecture 3.1 36
BLAST Output
Lecture 3.1 37
BLAST Output
Lecture 3.1 38
BLAST Parameters
• Identities - No. & % exact residue matches
• Positives - No. and % similar & ID matches
• Gaps - No. & % gaps introduced
• Score - Summed HSP score (S)
• Bit Score - a normalized score (S’)
• Expect (E) - Expected # of chance HSP aligns
• P - Probability of getting a score > X
• T - Minimum word or k-tuple score (Threshold)
Lecture 3.1 39
Getting the Most from
BLAST
Lecture 3.1 40
BLAST Options
Lecture 3.1 41
BLAST Options
• Composition-based statistics (Yes)
• Sequence Complexity Filter (Yes)
• Expect (E) value (10)
• Word Size (3)
• Substitution or Scoring Matrix (Blosum62)
• Gap Insertion Penalty (11)
• Gap Extension Penalty (1)
Lecture 3.1 42
Composition Statistics
• Recent addition to BLAST algorithm
• Permits calculated E (Expect) values to
account for amino acid composition of
queries and database hits
• Improves accuracy and reduces false
positives
• Effectively conducts a different scoring
procedure for each sequence in database
Lecture 3.1 43
LCR’s (low complexity)
• Watch out for…
– transmembrane or signal peptide regions
– coil-coil regions
– short amino acid repeats (collagen, elastin)
– homopolymeric repeats
• BLAST uses SEG to mask amino acids
• BLAST uses DUST to mask bases
Lecture 3.1 44
Scraping the Bottom of
the Barrel with Psi-BLAST
Lecture 3.1 45
PSI-BLAST Algorithm
• Perform initial alignment with BLAST using
BLOSUM 62 substitution matrix
• Construct a multiple alignment from matches
• Prepare position specific scoring matrix
• Use PSSM profile as the scoring matrix for a
second BLAST run against database
• Repeat steps 3-5 until convergence
Lecture 3.1 46
PSI-BLAST
Lecture 3.1 47
PresS Iterate! PSI-BLAST
Lecture 3.1 48
PSI-BLAST
PresS Iterate!
Lecture 3.1 49
PSI-BLAST
Lecture 3.1 50
PSI-BLAST
• For Protein Sequences ONLY
• Much more sensitive than BLAST
• Slower (iterative process)
• Often yields results that are as good as
many common threading methods
• SHOULD BE YOUR FIRST CHOICE IN
ANALYZING A NEW SEQUENCE
Lecture 3.1 51
BLAST against PDB
Lecture 3.1 52
Sequence Similarity
Searching – The statistics
are important
53
Where does the score (S) come
from?
• The quality of each pair-wise alignment is
represented as a score and the scores are
ranked.
• Scoring matrices are used to calculate the
score of the alignment base by base (DNA)
or amino acid by amino acid (protein).
• The alignment score will be the sum of the
scores for each position.
54
What do the Score and the e-value
really mean?
• The quality of the alignment is
represented by the Score (S).
• The score of an alignment is calculated as the sum of substitution and gap
scores. Substitution scores are given by a look-up table (PAM, BLOSUM)
whereas gap scores are assigned empirically .
55
Notes on E-values
• Low E-values suggest that sequences
are homologous
๏ Can’t show non-homology
• Statistical significance depends on both
the size of the alignments and the size of
the sequence database
‣ Important consideration for comparing results across
different searches
‣ E-value increases as database gets bigger
‣ E-value decreases as alignments get longer
56
Homology: Some Guidelines
57
Suggested BLAST Cutoffs
58
BLAST - Rules of Thumb
• Expect (E-value) is equal to the number of BLAST
alignments with a given Score that are expected to
be seen simply due to chance
• Don’t trust a BLAST alignment with an Expect score
> 0.01 (Grey zone is between 0.01 - 1)
• Expect and Score are related, but Expect contains
more information. Note that %Identies is more
useful than the bit Score
• Recall Doolittle’s Curve (%ID vs. Length, next slide)
%ID > 30 - numres/50
• If uncertain about a hit, perform a PSI-BLAST search
Lecture 3.1 59
NCBI BLAST TUTORIAL
• http://youtu.be/HXEpBnUbAMo