Lecture5 Sequence Comparison-2019

References, Pipelines &
Comparisons
Mike Cherry
Genomics
Genetics 211 - Winter 2019
!2
reference genome
Reads
Contigs
Scaffold
Chromosome
!4
Homo sapiens Reference Genome
UCSC Version Release Date Release Name

hg38 Dec 2013 GRCh38
hg19 Feb 2009 GRCh37
hg18 Mar 2006 NCBI Build 36.1
hg17 May 2004 NCBI Build 35
hg16 Jul 2003 NCBI Build 34
hg15 Apr 2003 NCBI Build 33
hg13 Nov 2002 NCBI Build 31
hg12 Jun 2002 NCBI Build 30
…
hg8 Aug 2001 UCSC assembled
…
hg1 May 2000 UCSC assembled
GRC = Sanger, WashU, EBI and NCBI
hg19 to GRCh38
• Several thousand corrected bases in both
coding and non-coding regions
• 100 assembly gaps closed or reduced
• satellite repeat modeled
– megabase-sized gaps in centromere regions
• updated mitochondrial reference sequence
• many alternative reference sequences
– how to use in mapping?
!6
GRCh38.p12 Statistics
Number of regions with alternate loci or patches 317
Total sequence length 3,099,706,404
Total ungapped length 2,948,583,725
Gaps between scaffolds 349
Number of scaffolds 472
Scaffold N50 67,794,873
Scaffold L50 16
Number of contigs 998
Contig N50 57,879,411
Contig L50 18
Total number of chromosomes and plasmids 24
Number of component sequences (WGS or clone) 35,613
https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.38/ !7
GRCh38
!8
Issues Resolved Summary of GRCh38 updates
Changes in scaffold N50 length
Gap Resolution
!9
Schneider et al. Genome Res. (2017) 27:849-864
An algorithmic overview of satellite characterization and linear
representation.
!10
Karen H. Miga et al. Genome Res. 2014;24:697-707
Previous GRC versions simply had a 3Mb gap on each
chromosome to represent the centromeric region.
GRCh38 has a modeled the average centromere for

human chromosomes
Karen Miga and Jim Kent, UCSC !11

!12
Model Organism Reference Genomes
Worm
Organism Mouse mm10 Fly v6 Yeast r64
WBcel235
Number of regions with
84
alternative loci or patches
Total sequence length 2,800,055,571 143,726,002 100,286,401 12,157,105
Total assembly gap length 79,356,534 1,152,978
Gaps between scaffolds 191
Number of scaffolds 279 1,870
Scaffold N50 52,589,046 25,286,936
Number of contigs 780 2,442
Contig N50 32,273,079 31,485,538

Total number of
chromosomes and 22 8 7 17
!13
Transitioning between assemblies
• UCSC liftOver software
– start with BED file
– uses BLAT to identify matching regions, then alter the
coordinates
– creates a chain file to defines the changes needed
between assemblies
• NCBI Genome Remapping Service
– http://www.ncbi.nlm.nih.gov/genome/tools/remap
For processed data typically do not want to liftOver, rather

remap and then rerun analysis.
!14
One reference doesn’t fit all.
B. Paten, A. Novak & D. Haussler. (2015) Mapping to a Reference Genome Structure. arXiv. 2014:1–26 !15
Genome Graphs
Single linear reference genome = “a single monoploid assembly of

the genome of a species”
How to better represent the sequence diversity and complexity?
More accurate sequence variant calling?
AM Novak, et al., 2017, https://doi.org/10.1101/101378 !16

Reference Genome in a Graph Representation
!17
B. Paten, A. Novak & D. Haussler. (2015) Mapping to a Reference Genome Structure. arXiv. 2014:1–26
Graph-based Variant Calling Evaluation

Variant Calling with Genome Graphs

blacklist & sponge
Remove robust, non-biological signals of
enrichment from downstream computational
analysis
Blacklist for hg19
• If the signal artifact (uniquely mapping) is extremely severe (> 1000 fold) we
flag the region.
• If the signal artifact is present in most of the tracks independent of cell-line or
experiment type we flag the region.
• If the region has dispersed high mappability and low mappability coordinates
then it is more likely to be an artifact region.
• If the region has a known repeat element then it is more likely to be an artifact
• We check if the stranded read counts/signal is structured in the UwDNase,
UncFAIRE and input/control datasets i.e. do we see offset mirror peaks on the
+ and - strand that is typical observed in real, functional peaks. If so we remove
these regions from the artifact list.
• If the region exactly overlaps a known gene’s TSS, or is in the vicinity or
within a known gene, it is more likely to be removed from the artifact list. Our
intention is to give such regions the benefit of doubt of being real peaks.
!21
Anshul Kundaje, 2014, A comprehensive collection of signal artifact blacklist regions in the human genome
Blacklist for hg19
1. If uniquely mapping > 1000 fold

2. Present in most of the tracks independent of cell-line
or experiment type
3. Dispersed high mappability and low mappability
coordinates
4. Known repeat element
5. Read counts/signal is structured in the UwDNase,
UncFAIRE and input/control datasets as is typical
observed in real, functional peaks. If so we remove
these regions from the artifact list.
6. Exactly overlaps or in the vicinity of a known gene’s
TSS, it is removed from the artifact list.
Anshul Kundaje (2014) A comprehensive collection of signal artifact blacklist regions in the human genome !22
Blacklist composition for hg19
Region type bp % of total
centromeric_repeat 8,997,003 77.64%
BSR/Beta 797,511 6.88%
Satellite_repeat 723,464 6.24%
Low_mappability_island 514,885 4.44%
ALR/Alpha 299,365 2.58%
(CATTC)n 145,669 1.26%
telomeric_repeat 25,798 0.22%
chrM 24,608 0.21%
LSU-rRNA_Hsa 20,620 0.18%
High_Mappability_island 12,594 0.11%
TAR1 12,532 0.11%
ACRO1 5,877 0.05%
SSU-rRNA_Hsa 5,595 0.05%
snRNA 2,062 0.02%
(GAGTG)n 422 0.00%
(GAATG)n 267 0.00%
total: 11,588,272 bp
Anshul Kundaje (2014) A comprehensive collection of signal artifact blacklist regions in the human genome !23
Blacklist for GRCh38
chr1 : 124450730 124450960 chr16 : 34593000 34593590
chr2 : 90397520 90397900 chr16 : 34594490 34594720
chr2 : 90398120 90398760 chr16 : 34594900 34595150
chr3 : 93470260 93470870 chr16 : 34595320 34595570
chr4 : 49118760 49119010 chr16 : 46380910 46381140
chr4 : 49120790 49121130 chr16 : 46386270 46386530
chr5 : 49601430 49602300 chr16 : 46390180 46390930
chr5 : 49657080 49657690 chr16 : 46394370 46395100
chr5 : 49661330 49661570 chr16 : 46395670 46395910
chr10 : 38528030 38529790 chr16 : 46398780 46399020
chr10 : 42070420 42070660 chr16 : 46400700 46400970
chr16 : 34571420 34571640 chr20 : 28513520 28513770
chr16 : 34572700 34572930 chr20 : 31060210 31060770
chr16 : 34584530 34584840 chr20 : 31061050 31061560
chr16 : 34585000 34585220 chr20 : 31063990 31064490
chr16 : 34585700 34586380 chr20 : 31067930 31069060
chr16 : 34586660 34587100 chr20 : 31069000 31069280
chr16 : 34587060 34587660 chr21 : 8219780 8220120
chr16 : 34587900 34588170 chr21 : 8234330 8234620
total: 17,040 bp
Anshul Kundaje (Stanford) & Alan Boyle (University of Michigan), October 2016 !24
Signal due to unique mapping reads in high-mappability islands
!25
Signal due to multi-mapping reads in high-mappability islands
!26
sponge sequence database
K.H. Miga, C. Eisenhart and W.J. Kent. 2015. NAR doi:10.1093/nar/gkb671 !27
K.H. Miga, C. Eisenhart and W.J. Kent. 2015. NAR doi:10.1093/nar/gkb671 !28
pipelines
!30
http://www.nytimes.com/2016/02/05/science/dna-study-of-first-ancient-african-genome-flawed-researchers-report.html?_r=0
DNA Study of First Ancient African Genome
Flawed, Researchers Report
By Carl Zimmer
In October, Dr. Manica and his colleagues reconstructed the first ancient human genome ever found
in Africa, retrieved from the skeleton of a man who lived in Ethiopia 4,500 years ago.
Ancient DNA experts were delighted, because the genome may provide clues about African history
that other kinds of evidence — broken pottery shards, for example, or scraps of ancient manuscripts
— cannot.
“It’s an amazing, amazing, unique, special, incredible, first-of-its-kind data set,” David
Reich, a geneticist at Harvard Medical School who was not involved in the study, said in an interview.
The researchers found that Mota was only distantly related to many people elsewhere in Africa. In
fact, the analysis suggested that most living Africans shared some DNA with
Europeans and Asians that were missing from Mota’s genome.
To explain these intriguing results, Dr. Manica and his colleagues tested out different historical
scenarios. In the best-supported one, a group of people migrated from the Near East back to East
Africa — a so-called backflow — about 3,000 years ago. In subsequent generations, their DNA
spread across Africa.
http://www.nytimes.com/2016/02/05/science/dna-study-of-first-ancient-african-genome-flawed-researchers-report.html?_r=0 !31
https://dl.dropboxusercontent.com/u/26978112/Erratum%20with%20figures.pdf
!32
Pipeline Script Mistake
bwa ref.fasta sample[0-9].fastq > sample[0-9].bam
samtools mpileup -uf ref.fasta sample[0-9].bam | bcftools view

-bvcg - > samples.raw.bcf
vcftools samples.vcf -plink samples Missing Step
plink --file samples --genome
plink --file samples --read-genome samples.genome —cluster

--mds-plot 2
!33
sharing a pipeline
• ENCODE at UCSC’s Genome Browser

• Roadmap at Baylor’s Genboree
• IHEC at the IHEC Portal
• Materials and Methods
• Galaxy/Globus (Galaxy on the Cloud)
• Seven Bridges Genomics
• tarball of scripts
• DNAnexus
TF ChIP-Seq processing pipeline
TF ChIP-seq processing pipeline
1a. Mapping statistics 2. Cross-correlation
Labs 1b. Library Complexity scores (NSC, RSC)
FASTQ (SE/PE) 2. Filter reads Subsampled

BED
• Rep1 (Lane1) 1. Mapping BAM • Unmapped Pseudo-Replicates
•Rep1,
• Rep1 (Lane 2) • Rep1 • Multimapping • Rep1.pr1 , Rep1.pr2
(BWA) •Rep2
• Rep2 (Lane1) • Rep2 • Low Quality • Rep2.pr1, Rep2.pr2
•Rep0 = Rep1+Rep2
• Rep2(Lane2) • Duplicates • Rep0.pr1, Rep0.pr2
BIGWIG
5. Signal Correlation • Rep1
5. Signal tracks 3. Peak calling
between replicates • Rep2 ChIP Fold-enrichment over • SPP
• Rep0 input • GEM
• PeakSeq
Processing
Steps 4c. Fraction of reads in Peaks 4a. Self consistency ratio (N1/N2)
(FRiP) 4b. Rescue Ratio (Np/Nt)
QC 4d. Overlap between peak
callers
File Format Relaxed Peak calls
• Rep1 , Rep2
Thresholded Peak calls IDR thresholds • Rep0
Files at DCC 4. IDR • Rep1.pr1, Rep2.pr2
(NarrowPeak) • Nt = Rep1 VS Rep2
• FASTQs • Rep2.pr1, Rep2.pr2
• SPP (Blacklist filter) • Np = Rep0.pr1 VS Rep0.pr2
• BAM • GEM (Blacklist filter) • N1 = Rep1.pr1 VS Rep1.pr2
• Rep0.pr1, Rep0.pr2
• QC measures • PeakSeq (Blacklist filter) • N2 = Rep2.pr1 VS Rep2.pr2
• Peak calls • Optimal = max(Np,Nt)
• Signal tracks
• Motifs Motifs and motif hits
6. Motif Discovery • Integrated GEM motifs
• Post-peak calling motifs
Anshul Kundaje
What is Galaxy?
A collection of bioinformatics tools for:
• data conversion and manipulation
• statistical analysis
• next generation sequencing analysis
• provides integration of useful tools into
reuseable pipelines, that can also be shared
• unified and consistent interface for easy
exploration
Toolbox for:
• retrieving (“get”) data
• manipulating data (liftOver, filter, sort, set
operations, format conversion)
• data analysis (statistics, sequence alignment,
variant calling and annotation)
dozens of tools for different NGS applications
packaged with Galaxy
Galaxy pipelines
managing workflows
Sequence
Comparison
Goals of Sequence Comparison:
• Find similarity such that an inference of
homology is justified.
– Similarity = observed with sequence alignment
– Homology = shared evolutionary history
(ancestry)
• Find a new sequence (gene) of interest
• Provide biologically appropriate results.
– Substitutions, insertions and deletions
• Compare as many sequences as fast as
possible.
!41
Local vs. Global Alignment
• Global Alignment
--T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC
| || | || | | | ||| || | | | | |||| |
AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C
• Local Alignment—better alignment to find

conserved segment
tccCAGTTATGTCAGgggacacgagcatgcagagac
||||||||||||
aattgccgccgtcgttttcagCAGTTATGTCAGatc
!42
Dynamic Programming
Basics for sequence alignment.
Smith-Waterman method.
A C
A 2
C
Scoring for nucleotides:
Match = 2
Gap = -1
Mismatch = -1
!43
Dynamic Programming
A C
A 2 1
C 1 4

Match = 2
Gap = -1
Mismatch = -1
!44
Dynamic Programming
A C
A 2 1
C 1 4

Match = 2
Gap = -1
Mismatch = -1
!45
Dynamic Programming
A C Use scores to complete the

matrix, row by row. Add, or
subtract, from neighboring
A 2 1 cell with the highest score
using this order:
1 4 1) diagonal
C 2) up
3) left
Match = 2
Gap = -1
Mismatch = -1
!46
Points: match +2, mismatch or gap -1
- T A C T A A C G C
- 0 0 0 0 0 0 0 0 0 0
T 0 2 1 0 2 1 0 0 0 0
G
C
A
C
G
C
T
!47
- T A C T A A C G C
- 0 0 0 0 0 0 0 0 0 0
T 0 2 1 0 2 1 0 0 0 0
G 0 1 1 0 1 1 0 0 2 1
C 0 0 0 3 2 1 0 2 1 4
A
C
G
C
T
!48
- T A C T A A C G C
- 0 0 0 0 0 0 0 0 0 0
T 0 2 1 0 2 1 0 0 0 0
G 0 1 1 0 1 1 0 0 2 1
C 0 0 0 3 2 1 0 2 1 4
A 0 0 2 2 2 4 3 2 1 3
C 0 0 1 4 3 3 3 5 4 3
G 0 0 0 3 3 2 2 4 7 6
C 0 0 0 2 2 2 1 4 6 9
T 0 2 1 1 4 3 2 3 5 8
!49
- T A C T A A C G C
- 0 0 0 0 0 0 0 0 0 0
T 0 2 1 0 2 1 0 0 0 0
G 0 1 1 0 1 1 0 0 2 1
C 0 0 0 3 2 1 0 2 1 4
A 0 0 2 2 2 4 3 2 1 3
C 0 0 1 4 3 3 3 5 4 3
G 0 0 0 3 3 2 2 4 7 6
C 0 0 0 2 2 2 1 4 6 9
T 0 2 1 1 4 3 2 3 5 8
!50
Points: match +2, mismatch or gap -1 Order: Diagonal, up, left
- T A C T A A C G C
- 0 0 0 0 0 0 0 0 0 0
T 0 2 1 0 2 1 0 0 0 0
G 0 1 1 0 1 1 0 0 2 1
C 0 0 0 3 2 1 0 2 1 4
A 0 0 2 2 2 4 3 2 1 3
C 0 0 1 4 3 3 3 5 4 3
G 0 0 0 3 3 2 2 4 7 6
C 0 0 0 2 2 2 1 4 6 9
T 0 2 1 1 4 3 2 3 5 8
!51
Points: match +2, mismatch or gap -1 Order: Diagonal, up, left
- T A C T A A C G C
- 0 0 0 0 0 0 0 0 0 0
T 0 2 1 0 2 1 0 0 0 0
G 0 1 1 0 1 1 0 0 2 1
C 0 0 0 3 2 1 0 2 1 4
A 0 0 2 2 2 4 3 2 1 3
C 0 0 1 4 3 3 3 5 4 3
TACTAACGC
G 0 |:|0 | 0||| 3 3 2 2 4 7 6
C 0 TGC-A-CGCT
0 0 2 2 2 1 4 6 9
T 0 2 1 1 4 3 2 3 5 8
!52
NCBI BLAST
Getting Started with BLAST handout:

ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_BLASTGuide.pdf !53
Five flavors of BLAST
Program Query DB type

1
BLASTN DNA DNA
1
BLASTP protein protein
6
BLASTX DNA protein
6
TBLASTN protein DNA
36
TBLASTX DNA DNA
!54
The BLAST Search Algorithm
query word (W=3)

Query: GSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFVEDAELRQTLQEDL
PQG 17
PEG 14
PRG 13
neighborhood PKG 13 neighborhood
words PNG 12 score threshold
PDG 12 (T = 13)
PHG 12
PMG 12
PSG 12
PQN 11
PQA 10
etc ...
Query: 325 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA 365

+LA++L+ TP G R++ +W+ P+ D + ER + A
Sbjct: 290 TLASVLDCTVTPKGSRMLKRWLHMPVRDTRVLLERQQTIGA 330
High-scoring Segment Pair (HSP)
(from NCBI Web site) !55

Raw Scores (S values) from an Alignment
S = (ΣMij) – cO – dG,
where
M = score from a similarity matrix
for a particular pair of amino acids (ij)
c = number of gaps
O = penalty for the existence of a gap
d = total length of gaps
G = per-residue penalty for extending
the gap
Kerfeld and Scott, PLoS Biology 2011 !56

BLAST Parameters and Constants
S = raw score (scoring matrix derived)

S’ = bit score
E = chance of finding zero HSPs with score >= S
λ = constant based on scoring matrix
K = constant based on gap penalty
n = effective length of database
m = effective length of query
!57
BLAST Scoring System
Raw score (S): Sum of scores for each aligned position and scores for
gaps
S = λ(matches) - λ(mismatches) - λ(gap penalties)
note: this score varies with the scoring matrix used and thus may not
be meaningfully compared for different searches
Bit score (S’): Version of the raw score that is normalized by the scale
of the scoring matrix (λ) and the scale of the gap penalty (K)
S’ = (λ S – ln(K)) / ln(2)
note: because it is normalized the bit score can be meaningfully
compared across searches
E value: Number of alignments with bit score S’ or better that one

would expect to find by chance in a search of a database of the
same size
E = mn2-S’
m = effective length of database
n = effective length of query sequence
note: E values may change if databases of different sizes are
searched !58
E values or p values
p = 1 - e-E
Very small E values are very similar to p values.

E values of about 1 to 10 are far easier to interpret
than corresponding p values.
E p
10 0.99995460
5 0.99326205
2 0.86466472
1 0.63212056
0.1 0.09516258 (about 0.1)
0.05 0.04877058 (about 0.05)
0.001 0.00099950 (about 0.001)
0.0001 0.0001000
Table 4.4
page 107
!59
BLAT -- BLAST-Like Alignment Tool
By Jim Kent, UCSC
http://genome.ucsc.edu/cgi-bin/hgBlat
• BLAT is designed to find sequences of >95% similarity of
length >40 bases. Perfect sequence matches of >33
bases are identified.
• Protein BLAT finds sequences of >80% similarity of length
>20 amino acids.
• DNA BLAT works by keeping an index of the entire
genome. The index consists of all non-overlapping 11-
mers except for those in repeats.
• Protein BLAT works in a similar manner, except with 4-mers
rather than 11-mers.
• The index is used to find areas of probable similarity.
Then the sequence for the area of interest is read into
memory for a detailed alignment.
!60
BLAT Indexing
!61
BLAT output includes text formats and browser tracks
Scoring Matrix
• Modeled Change in Protein

Sequences
– PAM (Accepted Point Mutations)
– Schwartz & Dayhoff (1978)
• Experimentally Derived Matrix

– BLOSUM (BLOCKS Substitution Matrix)
– Henikoff & Henikoff (1992)
!63
Accepted Point Mutations (PAM)
or Percent Accepted Mutations
• Number of individual amino acid

changes occurring per 100 aa residues
as a result of evolution. PAM1 = unit of
evolutionary divergence in which 1% of
the amino acids have been changed.
• PAM of 250, or PAM250, represents
[PAM1]250. The PAM1 matrix multiplied
against itself 250 times.
!64
Creating the PAM1
Schwartz & Dayhoff (1978)
• Studied 34 protein super-families and grouped them
into 71 phylogenetic trees. There were 1,572
changes observed. All sequences were at least 85%
identical. Alignments were scanned with a 100
amino acid window.
• These are observed mutations thus the term
accepted point mutations, accepted by natural
selection and thus the dominant allele in the
species.
• Normalized probability of change:
Pij = (Cij / T) x (1 / Fi)
Cij = number of changes from aai to aaj
Fi = freq of aai in that group of sequences
T = total number of all aa changes in 100 sites
!65
A 2
R -2 6
N 0 0 2
D 0 -1 2 4
C
Q
-2 -4 -4 -5 12
0 1 1 2 -5 4
PAM250 log odds
E 0 -1 1 3 -5 2 4 scoring matrix
G 1 -3 0 1 -3 -1 0 5
H -1 2 2 1 -3 3 1 -2 6
I -1 -2 -2 -2 -2 -2 -2 -3 -2 5
L -2 -3 -3 -4 -6 -2 -3 -4 -2 -2 6
K -1 3 1 0 -5 1 0 -2 0 -2 -3 5
M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6
F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9
P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6
S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2
T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3
W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17
Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10
V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4
A R N D C Q E G H I L K M F P S T W Y V
!66
Protein family PAMs/100 res/108 year
Immunoglobulin (Ig) kappa chain 37
Kappa casein 33
Luteinizing hormone b 30
Lactalbumin 27
Complement component 3 27
Collagen 1.7
Troponon C, skeletal muscle 1.5
Alpha crystallin B chain 1.5
Glucagon 1.2
Glutamate dehydrogenase 0.9
Histone H2B, member Q 0.9
Ubiquitin 0
From Dayhoff (1978)
!67
PAM1 99% identity
PAM10.7 90% identity
PAM80 50% identity
PAM250 20% identity
Percent identity
Percent Identity
“twilight zone”
Differences per 100 residues
!68
Deriving Substitution Scores BLOSUM
Henikoff & Henikoff, 1992
Protein Family
Block A Block B
!69
BLOSUM 62
scoring matrix
!70
Position Specific Scoring Matrix (PSSM) for LDL (LPB000033) from
BLOCKS
Position
of Match
A B C D E F G H I K L M N P Q R S T V W X Y Z * -
1. -27 -28 -30 -30 -4 -30 -33 -24 6 19 -29 -1 -26 -36 1 25 -8 7 -25 31 -14 -27 -1 0 0
2. 7 -65 28 -64 6 -53 -67 -64 37 -64 -45 -45 -67 -69 -63 -66 -60 -56 -36 -66 -42 -60 -33 0 0
3. 6 6 -31 11 26 -40 7 -28 -31 -3 -38 -34 -1 -37 -23 -30 2 -28 4 -42 -13 -40 -1 0 0
4. 13 -5 -26 11 -26 -35 -30 -27 -22 13 -30 -27 -27 -36 -25 21 0 -25 16 -39 -13 -35 -25 0 0
5. 24 7 -29 10 -34 -38 5 -36 -34 -37 7 -32 3 -41 -36 -39 12 -30 -32 -45 -17 -42 -36 0 0
6. -24 -5 -28 -3 -15 -34 16 -32 -30 -11 8 -27 -6 2 -32 -5 20 8 -30 -41 -10 -38 -25 0 0
7. -58 -23 -52 8 -62 -44 -67 -61 -38 -63 31 27 -63 -69 -60 -62 -64 -57 -43 -60 -44 -57 -61 0 0
8. -13 23 -33 24 -21 -34 5 28 -41 -5 -41 -35 22 -39 4 -27 -28 -30 -39 31 -18 -25 -7 0 0
9. -33 0 -42 1 -41 -51 33 -40 -53 -37 -53 -47 -2 -4 -42 10 7 -39 -50 -50 -26 -51 -42 0 0
10. -4 -15 -18 -25 -24 -23 6 -24 -4 14 9 -15 -3 -31 -22 -19 10 9 8 -32 -7 -28 -23 0 0
11. 5 11 -23 23 8 8 -7 12 -26 9 -12 -23 -5 -29 -4 6 -9 -21 -25 -29 -7 -19 1 0 0
12. -37 -42 -41 -44 -39 -42 -45 -34 -3 -2 8 -33 -39 -4 0 33 9 -38 -20 -48 -19 -44 -17 0 0
13. -6 -5 -18 -18 -17 -18 -22 6 -18 14 -14 -4 12 -27 9 8 4 11 0 -23 -5 11 -3 0 0
14. -14 -37 -26 -39 -36 7 -37 -36 -17 -36 7 -23 -35 8 -6 -38 15 8 22 -37 -14 -31 -19 0 0
15. -47 -57 -40 -59 -55 -35 -58 -52 24 8 27 -26 -56 -58 -49 -46 -54 -46 -28 -51 -34 -46 -52 0 0
16. 0 -18 -19 -33 -29 17 -34 -25 13 -29 5 -14 1 -37 12 -30 -29 -24 21 -25 -12 4 -6 0 0
17. -22 -5 -26 -24 -10 -32 8 -1 -34 -25 -21 -30 19 -36 13 -3 18 16 -31 -36 -12 1 3 0 0
18. -8 -7 -21 4 12 15 -27 -21 -21 0 1 -20 -21 -32 -1 -23 16 4 -7 -25 -6 8 5 0 0
19. -33 25 -37 32 -5 -44 -5 -29 -39 -29 -44 -39 17 -43 -1 10 4 -31 -3 -48 -18 -42 -3 0 0
20. -43 -54 -36 -56 -52 -32 -54 -46 19 -42 27 -22 -52 -55 -46 12 -50 -42 5 -46 -29 -42 -49 0 0
21. -19 -21 -22 -20 16 -29 -7 -21 -8 0 -11 -21 -22 -32 10 6 11 8 13 -34 -7 -30 13 0 0
22. -27 -5 -30 -21 12 -24 -30 23 -32 -21 -16 -3 15 -37 13 6 -3 -26 -31 45 -14 -17 12 0 0
23. -81 -91 -91 -89 -87 -96 -91 -88 -94 -86 -95 -90 -93 45 -88 -89 -86 -87 -91 -99 -83 -97 -88 0 0
24. 15 -4 -35 -30 -7 -47 -35 -30 -46 -28 -47 -41 31 -45 -33 23 5 -32 -41 -49 -23 -44 -21 0 0
25. 10 -34 -34 -36 -38 -45 32 -38 -46 9 -46 -41 -32 -43 -36 -33 8 -33 -41 -45 -25 -46 -37 0 0
26. 5 -58 -41 -59 -52 -42 -58 -51 34 -52 6 -33 -56 -59 16 -53 -52 -46 -3 -55 -31 -50 -14 0 0
27. 24 -40 -28 -42 -39 -39 -35 -40 -30 -39 4 -31 -38 11 -39 -40 -8 21 2 -46 -19 -42 -39 0 0
28. -32 -45 -28 -47 -45 -24 6 -40 7 -43 21 -17 -43 -47 -41 -42 -39 -11 19 33 -21 -30 -43 0 0
29. -55 4 -59 39 -39 -63 -53 -50 -58 10 -63 -59 -40 -61 -49 11 -52 -53 -3 -69 -39 -62 -44 0 0
30. -24 -33 -24 -34 -33 19 -33 7 -11 -32 12 1 -31 23 -30 -32 8 -25 -8 -23 -13 17 -32 0 0
31. -20 -3 -18 10 7 -12 -28 1 3 -18 11 18 -20 -9 -8 5 -13 -19 7 -21 -6 12 -2 0 0
!71
Logos provide a simple visualization of a
PSSM
Crooks GE, Hon G, Chandonia JM, Brenner SE. 2004. WebLogo: A sequence logo generator.
Genome Res., 14:1188-1190 [weblogo.berkeley.edu]
Schneider TD, Stephens RM. 1990. Sequence Logos: A New Way to Display Consensus
Sequences. Nucleic Acids Res. 18:6097-6100
!72
Considerations when making a profile.
• How are missing sequences represented?

• Many sequences are needed to create a
useful alignment, but not too many that
are closely related.
• Where are the gaps located?
!73
PSI-BLAST
Interative Protein-Protein BLAST
BLASTP (first iteration)
Analyze output and create PSSM
(Repeat until no
change or
iteration limit)
PSSM used to search database
!74
PSI-BLAST is performed in five steps
[1] Select a query and search it against a protein

database
[2] PSI-BLAST constructs a multiple sequence alignment

then creates a “profile” or specialized position-specific
scoring matrix (PSSM)
[3] The PSSM is used as a query against the database
[4] PSI-BLAST estimates statistical significance (E values)
[5] Repeat steps [3] and [4] iteratively, typically 5 times.

At each new search, a new profile is used as the query.
!75
Results of a PSI-BLAST search
# hits
Iteration # hits > threshold
1 104 49
2 173 96
3 236 178
4 301 240
5 344 283
6 342 298
7 378 310
8 382 320
!76
The universe of lipocalins (each dot is a protein)
retinol-binding
apolipoprotein D odorant-binding
protein protein
!77
Scoring matrices focus on the big (or small) picture
retinol-binding
protein
your RBP query

!78
Scoring matrices focus on the big (or small) picture
PAM250
PAM30
retinol-binding
retinol-binding
protein
protein
Blosum80
Blosum45
!79
PSI-BLAST generates scoring matrices
more powerful than PAM or BLOSUM
retinol-binding
retinol-binding
protein
protein
!80
PSI-BLAST alignment of RBP and β-lactoglobulin: iteration 1
Score = 46.2 bits (108), Expect = 2e-04

Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%)
Query: 27 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86
V+ENFD ++ G WY + +K P + I A +S+ E G + K ++
Sbjct: 33 VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK---------ELS 82
Query: 87 ADMVGTF---------TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137

D GT ++ +PAK +++++ + +WI+ TDY+ YA+ YSC
Sbjct: 83 PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135
Query: 138 ----LLNLDGTCADSYSFVFSRDPNGLPPE 163

L ++D + ++ R+P LPPE
Sbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158
!81
Score = 140 bits (353), Expect = 1e-32

Query: 4 VWALLLLAAWAAAERDCRVSSF--------RVKENFDKARFSGTWYAMAKKDPEGLFLQD 55
V L+ LA A + +F V+ENFD ++ G WY + +K P +
Sbjct: 2 VTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEI-EKIPASFEKGN 60
Query: 56 NIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMV---GTFTDTEDPAKFKMKYWGVASF 112

I A +S+ E G + K + D + V ++ +PAK +++++ +
Sbjct: 61 CIQANYSLMENGNIEVLNKEL-----SPDGTMNQVKGEAKQSNVSEPAKLEVQFFPL--- 112
Query: 113 LQKGNDDHWIVDTDYDTYAVQYSCR----LLNLDGTCADSYSFVFSRDPNGLPPEA 164

+WI+ TDY+ YA+ YSC L ++D + ++ R+P LPPE
Sbjct: 113 --MPPAPYWILATDYENYALVYSCTTFFWLFHVD------FFWILGRNPY-LPPET 159
!82
Score = 159 bits (404), Expect = 1e-38

Query: 3 WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54
V L+ LA A + S V+ENFD ++ G WY + K
Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59
Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114

+ I A +S+ E G + K V + ++ +PAK +++++ +
Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112
Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164

+WI+ TDY+ YA+ YSC + ++ R+P LPPE
Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159
!83
PSI-BLAST: the problem of corruption
PSI-BLAST is useful to detect weak but biologically

meaningful relationships between proteins.
The main source of false positives is the spurious

amplification of sequences not related to the query.
For instance, a query with a coiled-coil motif may
detect thousands of other proteins with this motif
that are not homologous.
Once even a single spurious protein is included

in a PSI-BLAST search above threshold, it will not
go away.
!84
Progressive alignment
• Advantages
– Biologically reasonable search strategy
– Relatively fast & efficient
• Disadvantages
– Quality deteriorates when sequences are
distantly related
– Strongly dependent upon initial alignments
since early errors are “locked in”
Slide from JD Wren !85

ClustalW
http://www.ebi.ac.uk/clustalw/
• Most popular multiple alignment tool

• ‘W’ stands for ‘weighted’ (different parts of
alignment are weighted differently).
• Three-step process
1) Construct pairwise alignments
2) Build Guide Tree
3) Progressive alignment built using the tree

Step 1: Pairwise Alignment
• Aligns each sequence against each other giving

a similarity matrix
• Similarity = exact matches / sequence length
(percent identity)
v1 v2 v3 v4
v1 -
v2 .17 -
v3 .87 .28 - (.17 means 17 % identical)
v4 .59 .33 .62 -

Step 2: Guide Tree (cont’d)
v1 v2 v3 v4 v1
v1 - v3
v2 .17 - v4
v3 .87 .28 -
v2
v4 .59 .33 .62 -
Calculate:
v1,3 = alignment (v1, v3)
v1,3,4 = alignment((v1,3),v4)
v1,2,3,4 = alignment((v1,3,4),v2)
!88
Slide from JD Wren
Step 3: Progressive Alignment
! Start by aligning the two most similar sequences

! Following the guide tree, add in the next sequences,
aligning to the existing alignment
! Insert gaps as necessary
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFD
FOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD
FOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFD
FOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQ
FOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
. . : ** . :.. *:.* * . * **:
Dots and stars show degree of conservation in a column.
!89
Slide from JD Wren
MUSCLE Algorithm
Edgar, R. C. Nucl. Acids Res. 2004 32:1792-1797 !90

Reading
• Rosenbloom, KR, et al., (2015) The UCSC Genome Browser

database : 2015 update
• Korf, Yandell & Bedell (2003) BLAST: An Essential Guide to
the Basic Local Alignment Search Tool. O’Reilly
• Jonathan Pevsner (2003) Bioinformatics and Functional
Genomics. Wiley-Liss
• Mount (2004) Bioinformatics: Sequence and Genome
Analysis. Cold Spring Harbor Laboratory Press
• Baxevanis & Ouellette (2001) Bioinformatics: A Practical
Guide to the Analysis of Genes and Proteins. Wiley
Interscience
• Jones & Pevzner (2004) An Introduction to Bioinformatic
Algorithms. (MIT Press)
• Salzberg, Searls & Kasif (1998) Computational Methods in
Molecular Biology. Elsevier
• Waterman (1995) Introduction to Computation Biology.
Chapman & Hall !91

Lecture5 Sequence Comparison-2019

Uploaded by

Copyright:

Available Formats

You might also like

Lecture5 Sequence Comparison-2019

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture5 Sequence Comparison-2019

Uploaded by

Copyright:

Available Formats

References, Pipelines &

UCSC Version Release Date Release Name

Changes in scaffold N50 length

GRCh38 has a modeled the average centromere for

Karen Miga and Jim Kent, UCSC !11

Total assembly gap length 79,356,534 1,152,978

Gaps between scaffolds 191

Number of scaffolds 279 1,870

Scaffold N50 52,589,046 25,286,936

Number of contigs 780 2,442

Contig N50 32,273,079 31,485,538

For processed data typically do not want to liftOver, rather

Single linear reference genome = “a single monoploid assembly of

How to better represent the sequence diversity and complexity?

More accurate sequence variant calling?

AM Novak, et al., 2017, https://doi.org/10.1101/101378 !16

AM Novak, et al., 2017, https://doi.org/10.1101/101378 !18

AM Novak, et al., 2017, https://doi.org/10.1101/101378 !19

1. If uniquely mapping > 1000 fold

bwa ref.fasta sample[0-9].fastq > sample[0-9].bam

samtools mpileup -uf ref.fasta sample[0-9].bam | bcftools view

vcftools samples.vcf -plink samples Missing Step

plink --file samples --genome

plink --file samples --read-genome samples.genome —cluster

• ENCODE at UCSC’s Genome Browser

FASTQ (SE/PE) 2. Filter reads Subsampled

• Local Alignment—better alignment to find

Scoring for nucleotides:

Scoring for nucleotides:

A C Use scores to complete the

Getting Started with BLAST handout:

Program Query DB type

query word (W=3)

Query: 325 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA 365

High-scoring Segment Pair (HSP)

(from NCBI Web site) !55

Kerfeld and Scott, PLoS Biology 2011 !56

S = raw score (scoring matrix derived)

E value: Number of alignments with bit score S’ or better that one

Very small E values are very similar to p values.

• Modeled Change in Protein

• Experimentally Derived Matrix

• Number of individual amino acid

Troponon C, skeletal muscle 1.5

Alpha crystallin B chain 1.5

Glutamate dehydrogenase 0.9

Histone H2B, member Q 0.9

From Dayhoff (1978)

Differences per 100 residues

• How are missing sequences represented?

BLASTP (first iteration)

Analyze output and create PSSM

PSSM used to search database

[1] Select a query and search it against a protein

[2] PSI-BLAST constructs a multiple sequence alignment

[3] The PSSM is used as a query against the database

[4] PSI-BLAST estimates statistical significance (E values)