Biological Sequence Determination: Protein

Biological Sequence
Determination
DNA RNA protein
Robert M. Horton, PhD, MS

rmhorton@cybertory.org
artwork: commons.wikimedia.org
Sequencing
protein context
RNA technological
biological
DNA

old methods
 classical sequencing (Sanger)

automation, base calling, quality scoring
concepts
chemistry, enzymes
shotgun sequencing, assembly, finishing physics, computers
 "next generation" contemporary

methods: pyrosequencing, CRT, SOLiD microfluidics
applications: resequencing, epigenetics, RNA-Seq microfabrication
 “third generation”
SMRT, nanopores, etc.
contemplation
of the future
Protein Sequencing
Why Proteins?
Small Digestible
(pepsin, trypsin,
Chemically chymotrypsin)
distinguishable
(purifyable) Important
Insulin
Fred Sanger
Nobel prize, 1958
Classes of RNA

mRNA 
snRNA

modified bases ( cap with m7G , 2'- 
splicosomes (U1, U2, U4, U5,
O-methylation ) U6)

splicing 
snoRNA

polyadenylation  pre-rRNA processing (U3)

tRNA  guide 2'-O-methylation
 guide pseudouridylation

modified bases (GMe, GMe2, CMe,
T, ψ, UH2, I, IMe) 
RNAi

rRNA 
siRNA (short interfering RNA)

prokaryotic: 70S = 50S (5S, 23S) + 
miRNA (microRNA)
30S (16S)  post-transcriptional gene

eukaryotic: 80S = 60S (5S, 5.8S, silencing
28S) + 40S (18S)  3' UTR, conserved

7SL 
piRNA

RNA of Signal Recognition Particle
 transcriptional silencing of
retrotransposons
(SRP)

homologous to Alu SINE (11% of
human genome) ... et cetera ...
DNA Sequencing
1977
The “modern era” of DNA sequencing begins
Chemical Sequencing of DNA
(Maxam-Gilbert)
February 1977
Two steps:
Damage bases
specific, partial
Cleave backbone
Four reactions:
A
A+G Related approaches are still
C used in specialized applications
C+T (e.g., DNA footprinting)
"The major time spent in DNA sequencing is

spent in the preparation of the DNA
fragments and on the elements of strategy."
-- W. Gilbert, 1980
http://nobelprize.org/nobel_prizes/chemistry/laureates/1980/gilbert-lecture.pdf
Chain Termination Sequencing
(Sanger Sequencing)
2',3'-dideoxy TTP
Sanger F, Nicklen S & Coulson AR

DNA sequencing with chain-terminating inhibitors
PNAS 74:5463-7, December 1977
Primer Extension
Bacterial
DNA polymerase I
adds nucleotides to
the 3' end of primer
to complement
5' -overhanging
template.
Each strand is an
ordered sequence
with a direction.
Arrows indicate 5' to 3'

direction (DNA grows
biochemically in this
direction).
(pyrophosphate released)
Sanger sequencing
Individual reactions with one dNTP partially “poisoned” with
dideoxynucleotides (ddATP, ddCTP, ddGTP, ddTTP)
Decades of
improvements
 automated
 fluorescence
 four colors
 one lane
 dye terminators
 one reaction
 capillaries
Automated Sanger sequencing
trace base calls quality scores

Quality Score
q = -10 * log1 0(p)
p = predicted error probability
1/1000 probability of error = q score of 30
uses
data quality monitoring assembly consensus finishing criteria
Sequencing Strategy
"The major time spent in DNA sequencing is
spent in the preparation of the DNA
fragments and on the elements of strategy."
-- W. Gilbert, 1980
Primer walking Shotgun Sequencing

(serial) (parallel)
Universal Primers
Assembly
read length affects assembly
Next-Generation Sequencing
 Pyrosequencing (454/Roche)
 Cycles of Reversible Termination
(Solexa/Illumina)
 Ligation (ABI SOLiD)
"Third-Generation" Sequencing
 SMRT (Pacific Biosciences)
pyrosequencing
+
pyrophosphat APS
e by dNTP incorporation)
(released adenosine
5`-
phosphosulfate
ATP
sulfurylase
+
sulfat
e AT
P
pyrosequencing
O
+ 2 +
ATP oxygen luciferin
firefly
luciferase
AMP + pyrophosphate + light + oxyluciferin

pyrosequencing
more biochemistry
problem solution
apyrase
pyrophosphate recycling breaks down ATP to
AMP + 2 Pi
(or wash out solution)
use an analog
suitable for
luciferase can use dATP polymerase but
not luciferase
dATPα
S
pyrosequencing
flowgram
Ronaghi M. Genome Res 11:3-11, 200

1
Emulsion PCR
water droplet in oil
one primer bound

to solid bead
individual
template
molecule
Emulsion PCR
DNA anchored
to bead all
comes from
the same
template
molecule
"polony" = "PCR colony"

pyrosequencing
Alternatives to chemiluminescence
 heat (“thermosequencing”)
 pH change ("Ion Torrent")
Cycles of Reversible Termination
Illumina/Solexa
Helicos
Helicos
Illumina
Metzker M. Sequencing Technologies - The Next Generation.

Nature Reviews Genetics 11:31-46, 2010.
Short Read Alignment
FASTQ Format
maq.sourceforge.net/fastq.shtml
$q = chr(($Q<=93? $Q : 93) + 33);

0 $Q = ord($q) - 33; 60
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]
Paired End Tags
Mme I
TCCRAC
(20/18)
Illumina Genome Analyzer
Library Preparation
Bridge Amplification forms "Polonies"

Cycles of Reversible Termination

Ligation-based Sequencing
SOLiD (ABI)
Complete Genomics
Polonator (Church Lab)
SOLiD
Sequencing by Oligonucleotide Ligation and Detection
3'- ATNNN~ZZZ*-5'
artwork is from the pamphlet Dibase Sequencing and Color Space Analysis
SOLiD
SOLiD: Dibase Encoding
AT AC AA AG
CG CA CC CT
GC GT GG GA
TA TG TT TC
base space color space
Each color sequence can represent four different base sequences.
The base sequence is one unit longer than the color sequence.
You need to know one base to tell which sequence is represented.

SNP causes two color changes single color change is probably an error
Single-Molecule, Real-Time (SMRT)
Sequencing
 High throughput
 Parallelism (small reactions)
 Speed (immediate results)
 Long reads
 Read individual templates from mixtures
 Haplotyping
SMRT Sequencing
41
Simulated SMRT Sequencing Data
Platform Comparisons
Xu M, Fujita D, and Hanagata N.

Perspectives and Challenges of Emerging Single-Molecule DNA Sequencing Technologies.
Small 5(23):2638–2649, 2009
Other Technologies
 Mass spectrometry  nanopores (protein,
 TEM graphene)
 ionic current blockage
 STM
 transverse tunneling
 nanonozzle probes currents
 exonuclease
Targeted Exome Capture
nimblegen.com
Bonus Slides
Selenocysteine
tRNA
Omics
 transcriptome
 exome
 kinome
“Plus and Minus”
Method
(circa 1975)
"minus":
polymerase stops
at
missing base
"plus":
T4 DNA
polymerase
3' exonuclease
stalled by dNTP
Sanger F, Coulson AR.
J Mol Biol. 94(3):441-8, 1975
pyrosequencing
Animation: http://www.pyrosequencing.com/DynPage.aspx?id=7454
Bioinformatics Classics
Needleman SB, Wunsch CD. A general method

applicable to the search for similarities in the
amino acid sequence of two proteins. J Mol Biol
48:443-453, 1970.
Smith TF, Waterman MS. Identification of

common molecular subsequences. J Mol Biol
147:195-197, 1981.
Automated Base Calling
1. identify idealized peak locations

 assume locally even spacing
2. find observed peaks

dynamic
programming
3. match observed to expected
 omit and split as necessary
4. add "good" unmatched peaks

Error Probabilities
predictive

does not require knowing actual sequence
valid

the set of bases assigned to probability p should
have an actual error rate of p
discriminating

helps to distinguish correct vs. incorrect base calls

1,000,000 base calls with 1,0000 errors (p = 0.01)

better if we can break it into two 500,000 sets:
 p=0.018 in one set (9000 errors)
 p=0.002 in second set (1000 errors)
Error Probability Calibration
'Given a set of parameters and a training set of reads for which it
is known which base-calls are correct and which are errors, find a
way of associating parameter values to error probabilities that
has (near) maximum discrimination power for small r.'
Phred Quality Score Parameters
Empirical.
Small values tend to correspond to more accurate base-calls.
Window-based parameters smooth out error probabilities.
1.Peak spacing (7 peak window)

● largest / smallest peak-to-peak spacing
2.Uncalled/called ratio (7 peak window)
● amplitude of largest uncalled / smallest called peak
3.Uncalled/called ratio (3 peak window)
4.Peak resolution
● -1 * # bases to the next unresolved base
Lookup Table Production
 Select a range of 50 threshold values for each of the 4
parameters.
 These 50 values are chosen so that each increment contains
approximately the same number of bases in the training set.
 For each 4-tuple of parameter thresholds (504=6,250,000):
 find the set of bases defined by these thresholds
 compute empirical error rates
 The parameter set with the lowest error rate goes into the table.
 if multiple 4-tuples give the same rate, choose the largest set
 These bases are removed, and the process is repeated until all
bases are represented in the table.
Post-translational Modification
(or co-translational)
 glycosylation (glycoproteins) 
acylation (at O, N, or S)
 mucin, cellular interaction, 
acetylation (acetate, CH3CO2− )
structural 
myristoylation (myristate, a
 N-linked C14 fatty acid)
 asparagine 
palmitoylation (palmitate, a
 O-linked C16 fatty acid)
 serine, threonine, hydroxylysine,

alkylation
hydroxyproline 
methylation
 iodination 
isoprenylation
 thyroid hormone 
phosphorylation
 hydroxylation 
signal transduction
 hydroxylysine in collagen 
ADP-ribosylation
 covalently bound enzyme
cofactors 
signal transduction
 FAD, biotin, etc

cholera toxin
 ubiquitination ... and many more
“Wandering
Spot”
Method
ca.1970s
RNA or
DNA
partial digestion
2D separation
Horizontal = base
composition
Vertical = size
This is an RNAse T1 fragment, so it ends in G
Fuke, M., and Busch, H. Nucleic Acids Res. 4:339-352, 1977.

Enzymatic vs
Chemical
Partial Cleavage
of RNA
Sequence-specific RNases
Phy M: A+U
A: pyrimidine-specific (C+U)
U2: A or A+G
T1: degrades after G residues
V1: degrades paired bases
Peattie DA. PNAS 76:1760-1764, 1979.
enzymatic chemical
Modified Nucleotides in tRNA
(post-transcriptional)
 methyl guanine (GMe)  pseudouridine (ψ)

 dimethylguanine(GMe2)  dihydrouridine (UH2)
 methylcytosine (Me)  inosine (I)
 ribothymine (T)  methylinosine (IMe)
Nucleotide Ambiguity Codes
(IUPAC)
Unambiguous 3-fold degenerate

A, C, G, T, U V = A, C or G
(not T)
H = A, C or T
2-fold degenerate (not G)
M = A or C D = A, G or T
R = A or G (puRine) (not C)
W = A or T (Weak) B = C, G or T
S = C or G (Strong) (not A)
Y = C or T (pYrimidine)
K = G or T
4-fold degenerate
X = A, C, G or T
N = A, C, G or T
Automated Base Calling
Phred
third-party base caller with better
accuracy than ABI's
open source(ish)
Ewing B, Hillier L, Wendl MC, Green P.

Base-Calling of Automated Sequencer Traces Using Phred. I. Accuracy Assessment.
Genome Res. 8:175-185, 1998
Ewing B and Green P.

Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities.
Genome Res. 8:186-194, 1998
Shotgun Sequencing
Staden R.
A strategy of DNA sequencing employing computer programs,
Nucleic Acids Research 7: 2601-2610, 1979
“With modern fast sequencing techniques and suitable computer

programs it is now possible to sequence whole genomes without
the need of restriction maps. This paper describes computer
programs that can be used to order both sequence gel readings
and clones. A method of coding for uncertainties in gel readings is
described. These programs are available on request.”
“The whole of the DNA to be sequenced is shotgunned into a suitable vector and
cloned. Ideally the cloned fragments would be of at least 200 bases in length. The
clones are then sequenced and the computer used to collate the data. Collation
involves searching for overlaps in the data.”
2D gel electrophoresis
cybertory.org/exercises/primerDesign
Protein Sequencing
Edman Degradation
phenylisothiocyanate
invented ca. 1950s
automated ca. 1973
proceeds from N-terminus
read 50-70 aa
http://en.wikipedia.org/wiki/Edman_degradation
Mass Spectrometry
Precise determination of
A few amino acids can ID a spot on 2D molecular weights of
gel peptides
(Sec)
modified from Wikimedia commons

Biological Sequence Determination: Protein

Uploaded by

Copyright:

Available Formats

You might also like

Biological Sequence Determination: Protein

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Biological Sequence Determination: Protein

Uploaded by

Copyright:

Available Formats

Biological Sequence

DNA RNA protein

Robert M. Horton, PhD, MS

 classical sequencing (Sanger)

 "next generation" contemporary

"The major time spent in DNA sequencing is

Sanger F, Nicklen S & Coulson AR

Arrows indicate 5' to 3'

trace base calls quality scores

Primer walking Shotgun Sequencing

AMP + pyrophosphate + light + oxyluciferin

(or wash out solution)

Ronaghi M. Genome Res 11:3-11, 200

water droplet in oil

one primer bound

"polony" = "PCR colony"

Metzker M. Sequencing Technologies - The Next Generation.

$q = chr(($Q<=93? $Q : 93) + 33);

Bridge Amplification forms "Polonies"

Cycles of Reversible Termination

Each color sequence can represent four different base sequences.

You need to know one base to tell which sequence is represented.

Xu M, Fujita D, and Hanagata N.

Needleman SB, Wunsch CD. A general method

Smith TF, Waterman MS. Identification of

1. identify idealized peak locations

2. find observed peaks

4. add "good" unmatched peaks

1.Peak spacing (7 peak window)

This is an RNAse T1 fragment, so it ends in G

Fuke, M., and Busch, H. Nucleic Acids Res. 4:339-352, 1977.

V1: degrades paired bases

Peattie DA. PNAS 76:1760-1764, 1979.

 methyl guanine (GMe)  pseudouridine (ψ)

Unambiguous 3-fold degenerate

Ewing B, Hillier L, Wendl MC, Green P.

Ewing B and Green P.

“With modern fast sequencing techniques and suitable computer

modified from Wikimedia commons

You might also like