Biological Sequence Determination: Protein

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 68

Biological Sequence

Determination

DNA RNA protein

Robert M. Horton, PhD, MS


rmhorton@cybertory.org

artwork: commons.wikimedia.org
Sequencing
protein context
RNA technological
biological
DNA

old methods

 classical sequencing (Sanger)


automation, base calling, quality scoring
concepts
chemistry, enzymes
shotgun sequencing, assembly, finishing physics, computers

 "next generation" contemporary


methods: pyrosequencing, CRT, SOLiD microfluidics
applications: resequencing, epigenetics, RNA-Seq microfabrication

 “third generation”
SMRT, nanopores, etc.
contemplation
of the future
Protein Sequencing
Why Proteins?

Small Digestible
(pepsin, trypsin,
Chemically chymotrypsin)
distinguishable
(purifyable) Important

Insulin
Fred Sanger
Nobel prize, 1958
Classes of RNA

mRNA 
snRNA

modified bases ( cap with m7G , 2'- 
splicosomes (U1, U2, U4, U5,
O-methylation ) U6)

splicing 
snoRNA

polyadenylation  pre-rRNA processing (U3)

tRNA  guide 2'-O-methylation
 guide pseudouridylation

modified bases (GMe, GMe2, CMe,
T, ψ, UH2, I, IMe) 
RNAi

rRNA 
siRNA (short interfering RNA)

prokaryotic: 70S = 50S (5S, 23S) + 
miRNA (microRNA)
30S (16S)  post-transcriptional gene

eukaryotic: 80S = 60S (5S, 5.8S, silencing
28S) + 40S (18S)  3' UTR, conserved

7SL 
piRNA

RNA of Signal Recognition Particle
 transcriptional silencing of
retrotransposons
(SRP)

homologous to Alu SINE (11% of
human genome) ... et cetera ...
DNA Sequencing

1977
The “modern era” of DNA sequencing begins
Chemical Sequencing of DNA
(Maxam-Gilbert)
February 1977
Two steps:
Damage bases
specific, partial
Cleave backbone

Four reactions:
A
A+G Related approaches are still
C used in specialized applications
C+T (e.g., DNA footprinting)

"The major time spent in DNA sequencing is


spent in the preparation of the DNA
fragments and on the elements of strategy."
-- W. Gilbert, 1980

http://nobelprize.org/nobel_prizes/chemistry/laureates/1980/gilbert-lecture.pdf
Chain Termination Sequencing
(Sanger Sequencing)

2',3'-dideoxy TTP

Sanger F, Nicklen S & Coulson AR


DNA sequencing with chain-terminating inhibitors
PNAS 74:5463-7, December 1977
Primer Extension

Bacterial
DNA polymerase I
adds nucleotides to
the 3' end of primer
to complement
5' -overhanging
template.

Each strand is an
ordered sequence
with a direction.

Arrows indicate 5' to 3'


direction (DNA grows
biochemically in this
direction).
(pyrophosphate released)
Sanger sequencing
Individual reactions with one dNTP partially “poisoned” with
dideoxynucleotides (ddATP, ddCTP, ddGTP, ddTTP)
Decades of
improvements
 automated
 fluorescence
 four colors
 one lane
 dye terminators
 one reaction
 capillaries
Automated Sanger sequencing

trace base calls quality scores


Quality Score
q = -10 * log1 0(p)
p = predicted error probability
1/1000 probability of error = q score of 30

uses
data quality monitoring assembly consensus finishing criteria
Sequencing Strategy
"The major time spent in DNA sequencing is
spent in the preparation of the DNA
fragments and on the elements of strategy."
-- W. Gilbert, 1980

Primer walking Shotgun Sequencing


(serial) (parallel)
Universal Primers
Assembly
read length affects assembly
Next-Generation Sequencing
 Pyrosequencing (454/Roche)
 Cycles of Reversible Termination
(Solexa/Illumina)
 Ligation (ABI SOLiD)

"Third-Generation" Sequencing
 SMRT (Pacific Biosciences)
pyrosequencing

+
pyrophosphat APS
e by dNTP incorporation)
(released adenosine
5`-
phosphosulfate

ATP
sulfurylase

+
sulfat
e AT
P
pyrosequencing

O
+ 2 +
ATP oxygen luciferin

firefly
luciferase

AMP + pyrophosphate + light + oxyluciferin


pyrosequencing
more biochemistry

problem solution
apyrase
pyrophosphate recycling breaks down ATP to
AMP + 2 Pi

(or wash out solution)

use an analog
suitable for
luciferase can use dATP polymerase but
not luciferase

dATPα
S
pyrosequencing
flowgram

Ronaghi M. Genome Res 11:3-11, 200


1
Emulsion PCR

water droplet in oil

one primer bound


to solid bead

individual
template
molecule
Emulsion PCR

DNA anchored
to bead all
comes from
the same
template
molecule

"polony" = "PCR colony"


pyrosequencing

Alternatives to chemiluminescence
 heat (“thermosequencing”)
 pH change ("Ion Torrent")
Cycles of Reversible Termination

Illumina/Solexa
Helicos

Helicos
Illumina

Metzker M. Sequencing Technologies - The Next Generation.


Nature Reviews Genetics 11:31-46, 2010.
Short Read Alignment
FASTQ Format
maq.sourceforge.net/fastq.shtml

$q = chr(($Q<=93? $Q : 93) + 33);


0 $Q = ord($q) - 33; 60
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]
Paired End Tags

Mme I
TCCRAC
(20/18)
Illumina Genome Analyzer

Library Preparation
Illumina Genome Analyzer

Bridge Amplification forms "Polonies"


Illumina Genome Analyzer

Cycles of Reversible Termination


Ligation-based Sequencing

SOLiD (ABI)
Complete Genomics
Polonator (Church Lab)
SOLiD
Sequencing by Oligonucleotide Ligation and Detection

3'- ATNNN~ZZZ*-5'

artwork is from the pamphlet Dibase Sequencing and Color Space Analysis
SOLiD
SOLiD: Dibase Encoding

AT AC AA AG
CG CA CC CT
GC GT GG GA
TA TG TT TC
SOLiD: Dibase Encoding
base space color space

Each color sequence can represent four different base sequences.

The base sequence is one unit longer than the color sequence.

You need to know one base to tell which sequence is represented.


SOLiD: Dibase Encoding

SNP causes two color changes single color change is probably an error
Single-Molecule, Real-Time (SMRT)
Sequencing

 High throughput
 Parallelism (small reactions)
 Speed (immediate results)
 Long reads
 Read individual templates from mixtures
 Haplotyping
SMRT Sequencing

41
Simulated SMRT Sequencing Data
Platform Comparisons

Xu M, Fujita D, and Hanagata N.


Perspectives and Challenges of Emerging Single-Molecule DNA Sequencing Technologies.
Small 5(23):2638–2649, 2009
Other Technologies
 Mass spectrometry  nanopores (protein,
 TEM graphene)
 ionic current blockage
 STM
 transverse tunneling
 nanonozzle probes currents
 exonuclease
Targeted Exome Capture

nimblegen.com
Bonus Slides
Selenocysteine
tRNA
Omics
 transcriptome
 exome
 kinome
“Plus and Minus”
Method
(circa 1975)

"minus":
polymerase stops
at
missing base
"plus":
T4 DNA
polymerase
3' exonuclease
stalled by dNTP
Sanger F, Coulson AR.
J Mol Biol. 94(3):441-8, 1975
pyrosequencing

Animation: http://www.pyrosequencing.com/DynPage.aspx?id=7454
Bioinformatics Classics

Needleman SB, Wunsch CD. A general method


applicable to the search for similarities in the
amino acid sequence of two proteins. J Mol Biol
48:443-453, 1970.

Smith TF, Waterman MS. Identification of


common molecular subsequences. J Mol Biol
147:195-197, 1981.
Automated Base Calling

1. identify idealized peak locations


 assume locally even spacing

2. find observed peaks


dynamic
programming
3. match observed to expected
 omit and split as necessary

4. add "good" unmatched peaks


Error Probabilities
predictive

does not require knowing actual sequence

valid

the set of bases assigned to probability p should
have an actual error rate of p

discriminating

helps to distinguish correct vs. incorrect base calls

1,000,000 base calls with 1,0000 errors (p = 0.01)

better if we can break it into two 500,000 sets:
 p=0.018 in one set (9000 errors)
 p=0.002 in second set (1000 errors)
Error Probability Calibration
'Given a set of parameters and a training set of reads for which it
is known which base-calls are correct and which are errors, find a
way of associating parameter values to error probabilities that
has (near) maximum discrimination power for small r.'
Phred Quality Score Parameters
Empirical.
Small values tend to correspond to more accurate base-calls.
Window-based parameters smooth out error probabilities.

1.Peak spacing (7 peak window)


● largest / smallest peak-to-peak spacing
2.Uncalled/called ratio (7 peak window)
● amplitude of largest uncalled / smallest called peak
3.Uncalled/called ratio (3 peak window)
4.Peak resolution
● -1 * # bases to the next unresolved base
Lookup Table Production
 Select a range of 50 threshold values for each of the 4
parameters.
 These 50 values are chosen so that each increment contains
approximately the same number of bases in the training set.
 For each 4-tuple of parameter thresholds (504=6,250,000):
 find the set of bases defined by these thresholds
 compute empirical error rates
 The parameter set with the lowest error rate goes into the table.
 if multiple 4-tuples give the same rate, choose the largest set
 These bases are removed, and the process is repeated until all
bases are represented in the table.
Post-translational Modification
(or co-translational)

 glycosylation (glycoproteins) 
acylation (at O, N, or S)
 mucin, cellular interaction, 
acetylation (acetate, CH3CO2− )
structural 
myristoylation (myristate, a
 N-linked C14 fatty acid)
 asparagine 
palmitoylation (palmitate, a
 O-linked C16 fatty acid)
 serine, threonine, hydroxylysine,

alkylation
hydroxyproline 
methylation
 iodination 
isoprenylation
 thyroid hormone 
phosphorylation
 hydroxylation 
signal transduction
 hydroxylysine in collagen 
ADP-ribosylation
 covalently bound enzyme
cofactors 
signal transduction
 FAD, biotin, etc

cholera toxin
 ubiquitination ... and many more
“Wandering
Spot”
Method
ca.1970s
RNA or
DNA

partial digestion
2D separation
Horizontal = base
composition
Vertical = size

This is an RNAse T1 fragment, so it ends in G

Fuke, M., and Busch, H. Nucleic Acids Res. 4:339-352, 1977.


Enzymatic vs
Chemical
Partial Cleavage
of RNA

Sequence-specific RNases
Phy M: A+U
A: pyrimidine-specific (C+U)
U2: A or A+G
T1: degrades after G residues

V1: degrades paired bases

Peattie DA. PNAS 76:1760-1764, 1979.

enzymatic chemical
Modified Nucleotides in tRNA
(post-transcriptional)

 methyl guanine (GMe)  pseudouridine (ψ)


 dimethylguanine(GMe2)  dihydrouridine (UH2)
 methylcytosine (Me)  inosine (I)
 ribothymine (T)  methylinosine (IMe)
Nucleotide Ambiguity Codes
(IUPAC)

Unambiguous 3-fold degenerate


A, C, G, T, U V = A, C or G
(not T)
H = A, C or T
2-fold degenerate (not G)
M = A or C D = A, G or T
R = A or G (puRine) (not C)
W = A or T (Weak) B = C, G or T
S = C or G (Strong) (not A)
Y = C or T (pYrimidine)
K = G or T
4-fold degenerate
X = A, C, G or T
N = A, C, G or T
Automated Base Calling

Phred
third-party base caller with better
accuracy than ABI's

open source(ish)

Ewing B, Hillier L, Wendl MC, Green P.


Base-Calling of Automated Sequencer Traces Using Phred. I. Accuracy Assessment.
Genome Res. 8:175-185, 1998

Ewing B and Green P.


Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities.
Genome Res. 8:186-194, 1998
Shotgun Sequencing
Staden R.
A strategy of DNA sequencing employing computer programs,
Nucleic Acids Research 7: 2601-2610, 1979

“With modern fast sequencing techniques and suitable computer


programs it is now possible to sequence whole genomes without
the need of restriction maps. This paper describes computer
programs that can be used to order both sequence gel readings
and clones. A method of coding for uncertainties in gel readings is
described. These programs are available on request.”

“The whole of the DNA to be sequenced is shotgunned into a suitable vector and
cloned. Ideally the cloned fragments would be of at least 200 bases in length. The
clones are then sequenced and the computer used to collate the data. Collation
involves searching for overlaps in the data.”
2D gel electrophoresis
cybertory.org/exercises/primerDesign
Protein Sequencing
Edman Degradation
phenylisothiocyanate
invented ca. 1950s
automated ca. 1973
proceeds from N-terminus
read 50-70 aa

http://en.wikipedia.org/wiki/Edman_degradation
Mass Spectrometry

Precise determination of
A few amino acids can ID a spot on 2D molecular weights of
gel peptides
(Sec)

modified from Wikimedia commons

You might also like