Professional Documents
Culture Documents
Lecture5 Sequence Comparison-2019
Lecture5 Sequence Comparison-2019
Lecture5 Sequence Comparison-2019
Comparisons
Mike Cherry
Genomics
Genetics 211 - Winter 2019
!2
reference genome
Reads
Contigs
Scaffold
Chromosome
!4
Homo sapiens Reference Genome
!6
GRCh38.p12 Statistics
Number of regions with alternate loci or patches 317
Total sequence length 3,099,706,404
Total ungapped length 2,948,583,725
Gaps between scaffolds 349
Number of scaffolds 472
Scaffold N50 67,794,873
Scaffold L50 16
Number of contigs 998
Contig N50 57,879,411
Contig L50 18
Total number of chromosomes and plasmids 24
Number of component sequences (WGS or clone) 35,613
https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.38/ !7
GRCh38
!8
Issues Resolved Summary of GRCh38 updates
Gap Resolution
!9
Schneider et al. Genome Res. (2017) 27:849-864
An algorithmic overview of satellite characterization and linear
representation.
!10
Karen H. Miga et al. Genome Res. 2014;24:697-707
Previous GRC versions simply had a 3Mb gap on each
chromosome to represent the centromeric region.
B. Paten, A. Novak & D. Haussler. (2015) Mapping to a Reference Genome Structure. arXiv. 2014:1–26 !15
Genome Graphs
!17
B. Paten, A. Novak & D. Haussler. (2015) Mapping to a Reference Genome Structure. arXiv. 2014:1–26
Graph-based Variant Calling Evaluation
• If the signal artifact (uniquely mapping) is extremely severe (> 1000 fold) we
flag the region.
• If the signal artifact is present in most of the tracks independent of cell-line or
experiment type we flag the region.
• If the region has dispersed high mappability and low mappability coordinates
then it is more likely to be an artifact region.
• If the region has a known repeat element then it is more likely to be an artifact
• We check if the stranded read counts/signal is structured in the UwDNase,
UncFAIRE and input/control datasets i.e. do we see offset mirror peaks on the
+ and - strand that is typical observed in real, functional peaks. If so we remove
these regions from the artifact list.
• If the region exactly overlaps a known gene’s TSS, or is in the vicinity or
within a known gene, it is more likely to be removed from the artifact list. Our
intention is to give such regions the benefit of doubt of being real peaks.
!21
Anshul Kundaje, 2014, A comprehensive collection of signal artifact blacklist regions in the human genome
Blacklist for hg19
Anshul Kundaje (2014) A comprehensive collection of signal artifact blacklist regions in the human genome !22
Blacklist composition for hg19
Region type bp % of total
centromeric_repeat 8,997,003 77.64%
BSR/Beta 797,511 6.88%
Satellite_repeat 723,464 6.24%
Low_mappability_island 514,885 4.44%
ALR/Alpha 299,365 2.58%
(CATTC)n 145,669 1.26%
telomeric_repeat 25,798 0.22%
chrM 24,608 0.21%
LSU-rRNA_Hsa 20,620 0.18%
High_Mappability_island 12,594 0.11%
TAR1 12,532 0.11%
ACRO1 5,877 0.05%
SSU-rRNA_Hsa 5,595 0.05%
snRNA 2,062 0.02%
(GAGTG)n 422 0.00%
(GAATG)n 267 0.00%
total: 11,588,272 bp
Anshul Kundaje (2014) A comprehensive collection of signal artifact blacklist regions in the human genome !23
Blacklist for GRCh38
chr1 : 124450730 124450960 chr16 : 34593000 34593590
chr2 : 90397520 90397900 chr16 : 34594490 34594720
chr2 : 90398120 90398760 chr16 : 34594900 34595150
chr3 : 93470260 93470870 chr16 : 34595320 34595570
chr4 : 49118760 49119010 chr16 : 46380910 46381140
chr4 : 49120790 49121130 chr16 : 46386270 46386530
chr5 : 49601430 49602300 chr16 : 46390180 46390930
chr5 : 49657080 49657690 chr16 : 46394370 46395100
chr5 : 49661330 49661570 chr16 : 46395670 46395910
chr10 : 38528030 38529790 chr16 : 46398780 46399020
chr10 : 42070420 42070660 chr16 : 46400700 46400970
chr16 : 34571420 34571640 chr20 : 28513520 28513770
chr16 : 34572700 34572930 chr20 : 31060210 31060770
chr16 : 34584530 34584840 chr20 : 31061050 31061560
chr16 : 34585000 34585220 chr20 : 31063990 31064490
chr16 : 34585700 34586380 chr20 : 31067930 31069060
chr16 : 34586660 34587100 chr20 : 31069000 31069280
chr16 : 34587060 34587660 chr21 : 8219780 8220120
chr16 : 34587900 34588170 chr21 : 8234330 8234620
total: 17,040 bp
Anshul Kundaje (Stanford) & Alan Boyle (University of Michigan), October 2016 !24
Signal due to unique mapping reads in high-mappability islands
!25
Anshul Kundaje, 2014, A comprehensive collection of signal artifact blacklist regions in the human genome
Signal due to multi-mapping reads in high-mappability islands
!26
Anshul Kundaje, 2014, A comprehensive collection of signal artifact blacklist regions in the human genome
sponge sequence database
K.H. Miga, C. Eisenhart and W.J. Kent. 2015. NAR doi:10.1093/nar/gkb671 !27
K.H. Miga, C. Eisenhart and W.J. Kent. 2015. NAR doi:10.1093/nar/gkb671 !28
pipelines
!30
http://www.nytimes.com/2016/02/05/science/dna-study-of-first-ancient-african-genome-flawed-researchers-report.html?_r=0
DNA Study of First Ancient African Genome
Flawed, Researchers Report
By Carl Zimmer
In October, Dr. Manica and his colleagues reconstructed the first ancient human genome ever found
in Africa, retrieved from the skeleton of a man who lived in Ethiopia 4,500 years ago.
Ancient DNA experts were delighted, because the genome may provide clues about African history
that other kinds of evidence — broken pottery shards, for example, or scraps of ancient manuscripts
— cannot.
“It’s an amazing, amazing, unique, special, incredible, first-of-its-kind data set,” David
Reich, a geneticist at Harvard Medical School who was not involved in the study, said in an interview.
The researchers found that Mota was only distantly related to many people elsewhere in Africa. In
fact, the analysis suggested that most living Africans shared some DNA with
Europeans and Asians that were missing from Mota’s genome.
To explain these intriguing results, Dr. Manica and his colleagues tested out different historical
scenarios. In the best-supported one, a group of people migrated from the Near East back to East
Africa — a so-called backflow — about 3,000 years ago. In subsequent generations, their DNA
spread across Africa.
http://www.nytimes.com/2016/02/05/science/dna-study-of-first-ancient-african-genome-flawed-researchers-report.html?_r=0 !31
https://dl.dropboxusercontent.com/u/26978112/Erratum%20with%20figures.pdf
!32
Pipeline Script Mistake
!33
sharing a pipeline
BIGWIG
5. Signal Correlation • Rep1
5. Signal tracks 3. Peak calling
between replicates • Rep2 ChIP Fold-enrichment over • SPP
• Rep0 input • GEM
• PeakSeq
Processing
Steps 4c. Fraction of reads in Peaks 4a. Self consistency ratio (N1/N2)
(FRiP) 4b. Rescue Ratio (Np/Nt)
QC 4d. Overlap between peak
callers
File Format Relaxed Peak calls
• Rep1 , Rep2
Thresholded Peak calls IDR thresholds • Rep0
Files at DCC 4. IDR • Rep1.pr1, Rep2.pr2
(NarrowPeak) • Nt = Rep1 VS Rep2
• FASTQs • Rep2.pr1, Rep2.pr2
• SPP (Blacklist filter) • Np = Rep0.pr1 VS Rep0.pr2
• BAM • GEM (Blacklist filter) • N1 = Rep1.pr1 VS Rep1.pr2
• Rep0.pr1, Rep0.pr2
• QC measures • PeakSeq (Blacklist filter) • N2 = Rep2.pr1 VS Rep2.pr2
• Peak calls • Optimal = max(Np,Nt)
• Signal tracks
• Motifs Motifs and motif hits
6. Motif Discovery • Integrated GEM motifs
• Post-peak calling motifs
Anshul Kundaje
What is Galaxy?
A collection of bioinformatics tools for:
• data conversion and manipulation
• statistical analysis
• next generation sequencing analysis
• provides integration of useful tools into
reuseable pipelines, that can also be shared
• unified and consistent interface for easy
exploration
Toolbox for:
• retrieving (“get”) data
• manipulating data (liftOver, filter, sort, set
operations, format conversion)
• data analysis (statistics, sequence alignment,
variant calling and annotation)
dozens of tools for different NGS applications
packaged with Galaxy
Galaxy pipelines
managing workflows
Sequence
Comparison
Goals of Sequence Comparison:
• Find similarity such that an inference of
homology is justified.
– Similarity = observed with sequence alignment
– Homology = shared evolutionary history
(ancestry)
• Find a new sequence (gene) of interest
• Provide biologically appropriate results.
– Substitutions, insertions and deletions
• Compare as many sequences as fast as
possible.
!41
Local vs. Global Alignment
• Global Alignment
--T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC
| || | || | | | ||| || | | | | |||| |
AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C
!42
Dynamic Programming
Basics for sequence alignment.
Smith-Waterman method.
A C
A 2
C
Scoring for nucleotides:
Match = 2
Gap = -1
Mismatch = -1
!43
Dynamic Programming
Basics for sequence alignment.
Smith-Waterman method.
A C
A 2 1
C 1 4
!44
Dynamic Programming
Basics for sequence alignment.
Smith-Waterman method.
A C
A 2 1
C 1 4
!45
Dynamic Programming
Basics for sequence alignment.
Smith-Waterman method.
1 4 1) diagonal
C 2) up
3) left
Scoring for nucleotides:
Match = 2
Gap = -1
Mismatch = -1
!46
Points: match +2, mismatch or gap -1
- T A C T A A C G C
- 0 0 0 0 0 0 0 0 0 0
T 0 2 1 0 2 1 0 0 0 0
G
C
A
C
G
C
T
!47
Points: match +2, mismatch or gap -1
- T A C T A A C G C
- 0 0 0 0 0 0 0 0 0 0
T 0 2 1 0 2 1 0 0 0 0
G 0 1 1 0 1 1 0 0 2 1
C 0 0 0 3 2 1 0 2 1 4
A
C
G
C
T
!48
Points: match +2, mismatch or gap -1
- T A C T A A C G C
- 0 0 0 0 0 0 0 0 0 0
T 0 2 1 0 2 1 0 0 0 0
G 0 1 1 0 1 1 0 0 2 1
C 0 0 0 3 2 1 0 2 1 4
A 0 0 2 2 2 4 3 2 1 3
C 0 0 1 4 3 3 3 5 4 3
G 0 0 0 3 3 2 2 4 7 6
C 0 0 0 2 2 2 1 4 6 9
T 0 2 1 1 4 3 2 3 5 8
!49
Points: match +2, mismatch or gap -1
- T A C T A A C G C
- 0 0 0 0 0 0 0 0 0 0
T 0 2 1 0 2 1 0 0 0 0
G 0 1 1 0 1 1 0 0 2 1
C 0 0 0 3 2 1 0 2 1 4
A 0 0 2 2 2 4 3 2 1 3
C 0 0 1 4 3 3 3 5 4 3
G 0 0 0 3 3 2 2 4 7 6
C 0 0 0 2 2 2 1 4 6 9
T 0 2 1 1 4 3 2 3 5 8
!50
Points: match +2, mismatch or gap -1 Order: Diagonal, up, left
- T A C T A A C G C
- 0 0 0 0 0 0 0 0 0 0
T 0 2 1 0 2 1 0 0 0 0
G 0 1 1 0 1 1 0 0 2 1
C 0 0 0 3 2 1 0 2 1 4
A 0 0 2 2 2 4 3 2 1 3
C 0 0 1 4 3 3 3 5 4 3
G 0 0 0 3 3 2 2 4 7 6
C 0 0 0 2 2 2 1 4 6 9
T 0 2 1 1 4 3 2 3 5 8
!51
Points: match +2, mismatch or gap -1 Order: Diagonal, up, left
- T A C T A A C G C
- 0 0 0 0 0 0 0 0 0 0
T 0 2 1 0 2 1 0 0 0 0
G 0 1 1 0 1 1 0 0 2 1
C 0 0 0 3 2 1 0 2 1 4
A 0 0 2 2 2 4 3 2 1 3
C 0 0 1 4 3 3 3 5 4 3
TACTAACGC
G 0 |:|0 | 0||| 3 3 2 2 4 7 6
C 0 TGC-A-CGCT
0 0 2 2 2 1 4 6 9
T 0 2 1 1 4 3 2 3 5 8
!52
NCBI BLAST
!54
The BLAST Search Algorithm
PQG 17
PEG 14
PRG 13
neighborhood PKG 13 neighborhood
words PNG 12 score threshold
PDG 12 (T = 13)
PHG 12
PMG 12
PSG 12
PQN 11
PQA 10
etc ...
!57
BLAST Scoring System
Raw score (S): Sum of scores for each aligned position and scores for
gaps
S = λ(matches) - λ(mismatches) - λ(gap penalties)
note: this score varies with the scoring matrix used and thus may not
be meaningfully compared for different searches
Bit score (S’): Version of the raw score that is normalized by the scale
of the scoring matrix (λ) and the scale of the gap penalty (K)
S’ = (λ S – ln(K)) / ln(2)
note: because it is normalized the bit score can be meaningfully
compared across searches
E p
10 0.99995460
5 0.99326205
2 0.86466472
1 0.63212056
0.1 0.09516258 (about 0.1)
0.05 0.04877058 (about 0.05)
0.001 0.00099950 (about 0.001)
0.0001 0.0001000
Table 4.4
page 107
!59
BLAT -- BLAST-Like Alignment Tool
By Jim Kent, UCSC
http://genome.ucsc.edu/cgi-bin/hgBlat
• BLAT is designed to find sequences of >95% similarity of
length >40 bases. Perfect sequence matches of >33
bases are identified.
• Protein BLAT finds sequences of >80% similarity of length
>20 amino acids.
• DNA BLAT works by keeping an index of the entire
genome. The index consists of all non-overlapping 11-
mers except for those in repeats.
• Protein BLAT works in a similar manner, except with 4-mers
rather than 11-mers.
• The index is used to find areas of probable similarity.
Then the sequence for the area of interest is read into
memory for a detailed alignment.
!60
BLAT Indexing
!61
BLAT output includes text formats and browser tracks
Scoring Matrix
!63
Accepted Point Mutations (PAM)
or Percent Accepted Mutations
!64
Creating the PAM1
Schwartz & Dayhoff (1978)
• Studied 34 protein super-families and grouped them
into 71 phylogenetic trees. There were 1,572
changes observed. All sequences were at least 85%
identical. Alignments were scanned with a 100
amino acid window.
• These are observed mutations thus the term
accepted point mutations, accepted by natural
selection and thus the dominant allele in the
species.
• Normalized probability of change:
Pij = (Cij / T) x (1 / Fi)
Cij = number of changes from aai to aaj
Fi = freq of aai in that group of sequences
T = total number of all aa changes in 100 sites
!65
A 2
R -2 6
N 0 0 2
D 0 -1 2 4
C
Q
-2 -4 -4 -5 12
0 1 1 2 -5 4
PAM250 log odds
E 0 -1 1 3 -5 2 4 scoring matrix
G 1 -3 0 1 -3 -1 0 5
H -1 2 2 1 -3 3 1 -2 6
I -1 -2 -2 -2 -2 -2 -2 -3 -2 5
L -2 -3 -3 -4 -6 -2 -3 -4 -2 -2 6
K -1 3 1 0 -5 1 0 -2 0 -2 -3 5
M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6
F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9
P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6
S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2
T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3
W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17
Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10
V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4
A R N D C Q E G H I L K M F P S T W Y V
!66
Protein family PAMs/100 res/108 year
Immunoglobulin (Ig) kappa chain 37
Kappa casein 33
Luteinizing hormone b 30
Lactalbumin 27
Complement component 3 27
Collagen 1.7
Glucagon 1.2
Ubiquitin 0
!67
PAM1 99% identity
PAM10.7 90% identity
PAM80 50% identity
PAM250 20% identity
Percent identity
Percent Identity
“twilight zone”
!68
Deriving Substitution Scores BLOSUM
Henikoff & Henikoff, 1992
Protein Family
Block A Block B
!69
BLOSUM 62
scoring matrix
!70
Position Specific Scoring Matrix (PSSM) for LDL (LPB000033) from
BLOCKS
Position
of Match
A B C D E F G H I K L M N P Q R S T V W X Y Z * -
1. -27 -28 -30 -30 -4 -30 -33 -24 6 19 -29 -1 -26 -36 1 25 -8 7 -25 31 -14 -27 -1 0 0
2. 7 -65 28 -64 6 -53 -67 -64 37 -64 -45 -45 -67 -69 -63 -66 -60 -56 -36 -66 -42 -60 -33 0 0
3. 6 6 -31 11 26 -40 7 -28 -31 -3 -38 -34 -1 -37 -23 -30 2 -28 4 -42 -13 -40 -1 0 0
4. 13 -5 -26 11 -26 -35 -30 -27 -22 13 -30 -27 -27 -36 -25 21 0 -25 16 -39 -13 -35 -25 0 0
5. 24 7 -29 10 -34 -38 5 -36 -34 -37 7 -32 3 -41 -36 -39 12 -30 -32 -45 -17 -42 -36 0 0
6. -24 -5 -28 -3 -15 -34 16 -32 -30 -11 8 -27 -6 2 -32 -5 20 8 -30 -41 -10 -38 -25 0 0
7. -58 -23 -52 8 -62 -44 -67 -61 -38 -63 31 27 -63 -69 -60 -62 -64 -57 -43 -60 -44 -57 -61 0 0
8. -13 23 -33 24 -21 -34 5 28 -41 -5 -41 -35 22 -39 4 -27 -28 -30 -39 31 -18 -25 -7 0 0
9. -33 0 -42 1 -41 -51 33 -40 -53 -37 -53 -47 -2 -4 -42 10 7 -39 -50 -50 -26 -51 -42 0 0
10. -4 -15 -18 -25 -24 -23 6 -24 -4 14 9 -15 -3 -31 -22 -19 10 9 8 -32 -7 -28 -23 0 0
11. 5 11 -23 23 8 8 -7 12 -26 9 -12 -23 -5 -29 -4 6 -9 -21 -25 -29 -7 -19 1 0 0
12. -37 -42 -41 -44 -39 -42 -45 -34 -3 -2 8 -33 -39 -4 0 33 9 -38 -20 -48 -19 -44 -17 0 0
13. -6 -5 -18 -18 -17 -18 -22 6 -18 14 -14 -4 12 -27 9 8 4 11 0 -23 -5 11 -3 0 0
14. -14 -37 -26 -39 -36 7 -37 -36 -17 -36 7 -23 -35 8 -6 -38 15 8 22 -37 -14 -31 -19 0 0
15. -47 -57 -40 -59 -55 -35 -58 -52 24 8 27 -26 -56 -58 -49 -46 -54 -46 -28 -51 -34 -46 -52 0 0
16. 0 -18 -19 -33 -29 17 -34 -25 13 -29 5 -14 1 -37 12 -30 -29 -24 21 -25 -12 4 -6 0 0
17. -22 -5 -26 -24 -10 -32 8 -1 -34 -25 -21 -30 19 -36 13 -3 18 16 -31 -36 -12 1 3 0 0
18. -8 -7 -21 4 12 15 -27 -21 -21 0 1 -20 -21 -32 -1 -23 16 4 -7 -25 -6 8 5 0 0
19. -33 25 -37 32 -5 -44 -5 -29 -39 -29 -44 -39 17 -43 -1 10 4 -31 -3 -48 -18 -42 -3 0 0
20. -43 -54 -36 -56 -52 -32 -54 -46 19 -42 27 -22 -52 -55 -46 12 -50 -42 5 -46 -29 -42 -49 0 0
21. -19 -21 -22 -20 16 -29 -7 -21 -8 0 -11 -21 -22 -32 10 6 11 8 13 -34 -7 -30 13 0 0
22. -27 -5 -30 -21 12 -24 -30 23 -32 -21 -16 -3 15 -37 13 6 -3 -26 -31 45 -14 -17 12 0 0
23. -81 -91 -91 -89 -87 -96 -91 -88 -94 -86 -95 -90 -93 45 -88 -89 -86 -87 -91 -99 -83 -97 -88 0 0
24. 15 -4 -35 -30 -7 -47 -35 -30 -46 -28 -47 -41 31 -45 -33 23 5 -32 -41 -49 -23 -44 -21 0 0
25. 10 -34 -34 -36 -38 -45 32 -38 -46 9 -46 -41 -32 -43 -36 -33 8 -33 -41 -45 -25 -46 -37 0 0
26. 5 -58 -41 -59 -52 -42 -58 -51 34 -52 6 -33 -56 -59 16 -53 -52 -46 -3 -55 -31 -50 -14 0 0
27. 24 -40 -28 -42 -39 -39 -35 -40 -30 -39 4 -31 -38 11 -39 -40 -8 21 2 -46 -19 -42 -39 0 0
28. -32 -45 -28 -47 -45 -24 6 -40 7 -43 21 -17 -43 -47 -41 -42 -39 -11 19 33 -21 -30 -43 0 0
29. -55 4 -59 39 -39 -63 -53 -50 -58 10 -63 -59 -40 -61 -49 11 -52 -53 -3 -69 -39 -62 -44 0 0
30. -24 -33 -24 -34 -33 19 -33 7 -11 -32 12 1 -31 23 -30 -32 8 -25 -8 -23 -13 17 -32 0 0
31. -20 -3 -18 10 7 -12 -28 1 3 -18 11 18 -20 -9 -8 5 -13 -19 7 -21 -6 12 -2 0 0
!71
Logos provide a simple visualization of a
PSSM
Crooks GE, Hon G, Chandonia JM, Brenner SE. 2004. WebLogo: A sequence logo generator.
Genome Res., 14:1188-1190 [weblogo.berkeley.edu]
Schneider TD, Stephens RM. 1990. Sequence Logos: A New Way to Display Consensus
Sequences. Nucleic Acids Res. 18:6097-6100
!72
Considerations when making a profile.
!73
PSI-BLAST
Interative Protein-Protein BLAST
(Repeat until no
change or
iteration limit)
!74
PSI-BLAST is performed in five steps
retinol-binding
apolipoprotein D odorant-binding
protein protein
!77
Scoring matrices focus on the big (or small) picture
retinol-binding
protein
PAM250
PAM30
retinol-binding
retinol-binding
protein
protein
Blosum80
Blosum45
!79
PSI-BLAST generates scoring matrices
more powerful than PAM or BLOSUM
retinol-binding
retinol-binding
protein
protein
!80
PSI-BLAST alignment of RBP and β-lactoglobulin: iteration 1
Query: 27 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86
V+ENFD ++ G WY + +K P + I A +S+ E G + K ++
Sbjct: 33 VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK---------ELS 82
!81
PSI-BLAST alignment of RBP and β-lactoglobulin: iteration 2
Query: 4 VWALLLLAAWAAAERDCRVSSF--------RVKENFDKARFSGTWYAMAKKDPEGLFLQD 55
V L+ LA A + +F V+ENFD ++ G WY + +K P +
Sbjct: 2 VTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEI-EKIPASFEKGN 60
!82
PSI-BLAST alignment of RBP and β-lactoglobulin: iteration 3
Query: 3 WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54
V L+ LA A + S V+ENFD ++ G WY + K
Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59
!83
PSI-BLAST: the problem of corruption
!84
Progressive alignment
• Advantages
– Biologically reasonable search strategy
– Relatively fast & efficient
• Disadvantages
– Quality deteriorates when sequences are
distantly related
– Strongly dependent upon initial alignments
since early errors are “locked in”
v1 v2 v3 v4
v1 -
v2 .17 -
v3 .87 .28 - (.17 means 17 % identical)
v4 .59 .33 .62 -
v1 v2 v3 v4 v1
v1 - v3
v2 .17 - v4
v3 .87 .28 -
v2
v4 .59 .33 .62 -
Calculate:
v1,3 = alignment (v1, v3)
v1,3,4 = alignment((v1,3),v4)
v1,2,3,4 = alignment((v1,3,4),v2)
!88
Slide from JD Wren
Step 3: Progressive Alignment
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFD
FOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD
FOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFD
FOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQ
FOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
. . : ** . :.. *:.* * . * **:
!89
Slide from JD Wren
MUSCLE Algorithm