Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

QuickintrototheMITCompBioGroup

Daniel
Marbach
Mike Lin
J ason
Ernst
J essica
Wu
Rachel
Sealfon
Pouya
Kheradpour
Manolis
Kellis
Chris
Bristow
Loyal
Goff
Irwin
J ungreis
Whoweare
Sushmita
Roy
Luke
Ward
Stata4
Stata3
Louisa
DiStefano
Dave
Hendrix
Angela
Yen
Ben
Holmes
Soheil
Feizi
Mukul
Bansal
Bob
Altshuler
Stefan
Washietl
Whatwedo:Researchsynopsis
Whybiologyinacomputersciencegroup?
Fundamentalbiologicalquestions:
1. Interpretingthehumangenome.
2. Revealingthelogicofgeneregulation.
3. Principlesofevolutionarychange.
Algorithmic/machinelearningmethods:
Comparativegenomics:evolutionarysignatures
Regulatorygenomics:motifs,networks,models
Epigenomics:chromatinstates,dynamics,disease
Phylogenomics:evolutionatthegenomescale
Definingcharacteristicsofourgroup:
Learngenomicrules,exploitnatureofproblems
Interdisciplinarycollaborations,highbiologyimpact
(1)Comparativegenomics:evolutionarysignatures
Protein-coding signatures
1000s new coding exons
Translational readthrough
Overlapping constraints
Non-coding RNA signatures
Novel structural families
Targeting, editing, stability
Structures in coding exons
microRNA signatures:
Novel/expanded miR families
miR/miR* arm cooperation
Sense/anti-sense switches
Regulatory motif signatures
Systematic motif discovery
Regulatory motif instances
TF/miRNA target networks
Single binding-site resolution
(2)Regulatorygenomics:circuits,predictivemodels
Initialannotationofthenoncodinggenome,from20%to70%
Systemsbiologyforananimalgenomeforthefirsttimepossible
Studentsandpostdocsarecofirstauthors,leadershiproles
Predictive models
of gene regulation
Infer networks
Predict function
Predict regulators
Predict gene
expression
ENCODE/modENCODE
4-year effort, dozens of
experimental labs
Integrative analysis
Systematic genome
annotation
Flagship NIH project
G
e
n
e
r
a
t
i
v
e

m
o
d
e
l
1. Family rate
2. Species-specific rates
S
i
F
j
~gamma
(,)
~normal(
i
,
i
)
Selective
pressures on
gene function
Population dynamics of the species
Twocomponentsofgeneevolution
(3)Phylogenomics:Bayesiangenetreereconstruction
N
e
w

p
h
y
l
o
g
e
n
o
m
i
c

p
i
p
e
l
i
n
e
LearnedF
j
,S
i
distributions
BirthDeath
process
Branchlength
prior
Sequence
likelihood LengthI,TopologyT,ReconciliationR
Topology
prior
HKYmodel
(traditional)
AlignmentdataD,specieslevelparameters
B
a
y
e
s
i
a
n
f
o
r
m
u
l
a
t
i
o
n
(4)VignetteonEpigenomics
Usingchromatininformation
tounderstandhumandiseases
J ason Ernst
Pouya
Kheradpour
Challenge of data integration in many marks/cells
Dozens of chromatin tracks
Understand their function
Reveal their combinations
Annotate systematically
Our approach: learn
common chromatin states
Explicitly model combinations
Unsupervised approach,
probabilistic model
Construct antibodies
pull down chromatin
ChIP-seq tracks
Histone tail
modifications
(marks)
Histones
Histone
tails
Ourapproach:MultivariateHiddenMarkovModel(HMM)
9
TSS
Enhancer
DNA
Binarized
chromatin
marks.Called
basedona
poisson
distribution
Mostlikely
HiddenState
TranscribedRegion
1:
3:
4:
5:
6:
HighProbabilityChromatinMarksinState
2:
0.8
0.9
0.9
0.8
0.7
0.9
200basepairinterval
Allprobabilities
arelearnedfrom
thedata
2
H3K4me3
H3K36me3 H3K36me3 H3K36me3 H3K36me3
H3K4me1
H3K4me3
H3K4me1
H3K27ac
0.8
H3K4me1
H3K36me3
K27ac
K4me1
H3K4me3
H3K4me3
H3K4me1
H3K4me1
1 3 4 6 6 6 6 6 5 5 5
Unobserved
Binarization leads to explicit modeling of mark combinations and interpretable parameters
Emission distribution is a
product of independent
Bernoulli random
variables
Ernst and Kellis, Nat Biotech 2010
From chromatin marks to chromatin states
Learn de novo
significant
combinations of
chromatin marks
Reveal functional
elements, even
without looking
at sequence
Use for genome
annotation
Use for studying
regulation
dynamics in
different cell
types
Promoter states
Transcribed states
Active Intergenic
Repressed
Ernst and Kellis, Nat Biotech 2010
ENCODE: Study nine marks in nine human cell lines
9humancelltypes 9marks
H3K4me1
H3K4me2
H3K4me3
H3K27ac
H3K9ac
H3K27me3
H4K20me1
H3K36me3
CTCF
+WCE
+RNA
HUVEC Umbilical vein endothelial
NHEK Keratinocytes
GM12878 Lymphoblastoid
K562 Myelogenous leukemia
HepG2 Liver carcinoma
NHLF Normal human lung fibroblast
HMEC Mammary epithelial cell
HSMM Skeletal muscle myoblasts
H1 Embryonic
x
81ChromatinMarkTracks
(2
81
combinations)
Ernst et al, Nature 2011
Learned jointly
across cell
types
(virtual
concatenation)
State definitions
are common
State locations
are dynamic
Brad Bernstein ENCODE Chromatin Group
Chromatinstatesdynamicsacrossninecelltypes
Single annotation track for each cell type
Summarize cell-type activity at a glance
Can study 9-cell activity pattern across
Correlated
activity
Predicted
linking
Multicellactivityprofilesandtheircorrelations
HUVEC
NHEK
GM12878
K562
HepG2
NHLF
HMEC
HSMM
H1
Gene
expression
Chromatin
States
Active TF motif
enrichment
ON
OFF
Active enhancer
Repressed
Motif enrichment
Motif depletion
TF regulator
expression
TF On
TF Off
Dip-aligned
motif biases
Motif aligned
Flat profile
Chromatin state & gene expression link enhancers and target genes
TF motif enrichment & TF expression reveal activators / repressors
Ex2: Gfi1 repressor of
K562/GM cells
Ex1: Oct4 predicted activator
of embryonic stem (ES) cells
Coordinatedactivityrevealsactivators/repressors
Enhancer networks: Regulator enhancer target gene
Activity signatures for each TF
Enhancer activity
xx
Disease-associated SNPs enriched for enhancers in relevant cell types
E.g. lupus SNP in GM enhancer disrupts Ets1 predicted activator
Revisitingdisease
associatedvariants
Title Author/
Journal
Total
#SNPs
Fold Cell
Type
# SNPs in
enhancers
FDR
Multiplelociinfluenceerythrocyte phenotypes
intheCHARGEConsortium.
Ganesh et al
Nat Genet 2009
35 17K562 9 0.02
Biological,clinicalandpopulationrelevanceof
95lociforbloodlipids
Teslovich et al
Nature 2010
101 11HepG2 13 0.02
Genomewideassociationstudymetaanalysis
identifiessevennewrheumatoidarthritisrisk
loci
Stahl et al
Nat Genet 2010
29 15GM12878 7 0.03
Genomewidemetaanalysesidentifythreeloci
associatedwithprimarybiliarycirrhosis
Liu et al
Nat Genet 2010
6 41GM12878 4 0.03
ChineseHanpopulationidentifiesninenew
susceptibilitylociforsystemiclupus
erythematosus.
Han et al
Nat Genet 2009
18 21GM12878 6 0.03
Sixnewlociassociatedwithbloodlowdensity
lipoproteincholesterol,highdensitylipoprotein
cholesterolortriglyceridesinhumans.
Kathiresan et al
Nat Genet 2008
18 24HepG2 5 0.03
Genomewideassociationstudyof
hematologicalandbiochemicaltraitsina
Japanesepopulation
Kamatani et al
Nat Genet 2009
39 12K562 7 0.03
Agenomewidemetaanalysisidentifies22loci
associatedwitheighthematologicalparameters
intheHaemGenconsortium.
Soranzo et al
Nat Genet 2009
28 15K562 6 0.03
Metaanalysisofgenomewideassociationdata
identifiesfournewsusceptibilitylocifor
colorectalcancer.
Houlston et al
Nat Genet 2008
4 66HepG2 3 0.03
Genomewideassociationstudyidentifieseight
lociassociatedwith bloodpressure.
Newton-Chen
Nat Genet 2009
9 30K562 4 0.04
Regulatory roles revealed for many studies
Science
Nature
Nature
Nature
Nature
Nature Biotech
Nature
Nature
PLoS Genetics
MBE
Genome Research
Nature
Genome Research
Nature
Genome Research
PLoS Comp. Bio.
Genes & Development
Genome Research
Nature
PNAS
BMC Evo. Bio.
ACM TKDD
Genome Research
RECOMB
J. Comp. Bio.
PNAS
Weworkhard
Weaimtofurtherourunderstandingof
thehumangenomebycomputational
integrationoflargescalefunctionaland
comparativegenomicsdatasets.
Weusecomparativegenomicsof
multiplerelatedspeciestorecognize
evolutionarysignaturesofprotein
codinggenes,RNAstructures,
microRNAs,regulatorymotifs,and
individualregulatoryelements.
Weusecombinationsofepigenetic
modificationstodefinechromatinstates
associatedwithdistinctfunctions,
includingpromoter,enhancer,
transcribed,andrepressedregions,each
withdistinctfunctionalproperties.
Wedevelopphylogenomic methodsto
studydifferencesbetweenspeciesandto
uncoverevolutionarymechanismsforthe
emergenceofnewgenefunctions
Ourmethodshaveledtonumerousnew
insightsondiverseregulatorymechanisms,
uncoveredevolutionaryprinciples,and
providemechanisticinsightsforpreviously
uncharacterizeddiseaseassociatedSNPs
Nature Nature Nature Nature Nature In review
Nature Gen Genes&Dev Nature Nature Biotech
Nature Nature Nature Nature WBpress
PNAS Nature G.R. BioChem
Nature GenomRes Nature G.R. Science
RECOMB RECOMB
Kayaking Whitewater Rafting
BBQs Canadian
Thanksgiving
Sailing
But we also have fun
Wanttogetstarted?
6.047/6.878 ComputationalBiology:
Genomes,Networks,Evolution
Challengesincomputationalbiology
DNA
4 Assembly/BWT
1 Gene Finding/HMMs 5 Motif discovery/EM
Rapid lookup
BLAST/Hashing
3
Gene expression/Bayesian classification 8
RNA transcript
Sequence alignment/DP 2
Evolutionary Theory
and phylogenetics
7
TCATGCTAT
TCGTGATAA
TGAGGATAT
TTATCATAT
TTATGATTT
Cluster discovery
unsupervised learning
9 Motif finding/
Gibbs sampling
10
Network analysis/Graph theory 11
Comparative Genomics/model fits 6
12
Metabolic modeling/systems
Predictive models/regression 13
Population Genomics/
coalescent theory/selection
14
4modules:Genomes,Genes,Networks,Evolution
Foundations and frontiers. Practical/algorithmic problems
F
i
n
a
l

p
r
o
j
e
c
t
:

m
e
n
t
o
r
i
n
g
,

m
i
l
e
s
t
o
n
e
s
P
r
o
j
e
c
t

p
l
a
n
n
i
n
g
P
r
o
j
e
c
t

e
x
e
c
u
t
i
o
n
Formoreinfo:compbio.mit.edu
(ortalktoanyofus)
6.006
IntroductiontoAlgorithms
Lecture252/3
(Highdimensional)geometry
Or:6.850GeometricComputingpreview
NearestNeighbor
Given:asetP ofn pointsinR
d
NearestNeighbor: foranyqueryq,
returnsapointpP minimizing
theEuclideandistance ||pq||
q
BuildBST
Nextlarger(q):findsthenext
elementafterelementq
Nextsmaller(q):analogous
Thecloserofthetwoisthe
nearestneighbor
Performance:
Space:O(n)
Querytime:O(log n)
assumingbalancedBST
Thecaseofd=1
q
Thecaseofd=2
ComputeVoronoi diagram
Givenq,performpoint
location
Performance:
Space:O(n)
Querytime:O(log n)
Thecaseofd>2
Voronoi diagramhassizen
O(d)
Curseofdimensionality
Wecanalsoperformalinearscan:O(dn) time
Bothareprettybadifd,n=fewmillion
Whywouldd beafewmillion?
Example:VectorSpaceModel
[Salton, Wong, Yang 1975]
Treat each document as a vector of its words
Onecoordinate foreverypossibleword
Example:


=thecat


=thedog
Similaritybetweenvectors?
Dotproduct:

http://portal.acm.org/citation.cfm?id=361220
the
cat
dog
1
1
1

VectorSpaceModelctd
[Salton, Wong, Yang 1975]
Wehave
||D
1
D
2
||
2
=||D
1
||
2
+||D
2
||
2
2D
1
.
D
2
IfwenormalizeD
1
,D
2
then
||D
1
D
2
||
2
=2 2D
1
.
D
2
MinimizingEuclideandistance=Maximizing
DotProduct
Manyotherapplications:searchingforsimilar
biosequences,similarimages,etc
ApproximateNearNeighbor
cApproximateNearest
Neighbor:builddatastructure
which,foranyqueryq
returnspP,||pq|| cr,
wherer isthedistancetothe
nearestneighborofq
Canbeatthecurse
Examplealgorithm:
Space:O(dn+n
1+1/c
)
Querytime:O(dn
1/c
)
q
r
cr
6.850GeometricComputing
NextSpring
Comingtolectureroomnearyou!
6.006
Introduction to Algorithms
Lecture 26c: Beyond
Piof. Eiik Bemaine
Eriks Main
Research Areas
Bata stiuctuies
uiaph algoiithms
ueometiic foluing algoiithms
Recieational algoiithms
6.851: Advanced
Data Structures
6.851: Advanced
Data Structures
Bynamic giaphs
Integei stiuctuies
Stiing stiuctuies
ueometiic BSs
Reuucing space
Time tiavel
Realistic mouels: cache, uisk, uP0s,
Best possible binaiy seaich tiees
Integer Data Structures
Stoie integeis in iange
subject to inseit, uelete, successoi, anu
pieuecessoi in
time [van Emde Boas]

Ig n
Ig Ig u
time [fusion trees]

Ig n
Ig Ig n
time [combination]
CacheEfficient
Data Structures
Nemoiy tiansfeis happen in blocks:
Seaiching takes
B
memoiy tiasnfeis
Soiting takes
n
B
C
n
B
memoiy tiansfeis
even if you uont know oi !
Approximation Algorithms
in Planar Graphs (& more)
Belete eveiy th
BFS layei
Rings aie wiuth
tiees
Nost pioblems
solveu in
0(k)
Patch up at a
factoi
Appioximate
bettei as
The Movie
vanessa uoulu,
uieen Fuse Films
|S6 minutes, 2uu8j
Tomohiio Tachi
Origamizer
|Tachi 2uu6;
Bemaine & Tachi
2uu92u11j
NaturalCycles
Erik&MartinDemaine
RenwickGallery,
SmithsonianAmerican
ArtMuseum,2012
Hinged Dissection
|Buueney 19u2j
Hinged Dissection Universality
|Abbott, Abel, Chailton, Bemaine, Bemaine, Komineis 2uu8j
Theoiem: Foi any finite set of polygons of
equal aiea, theie is a hingeu uissection that
can folu into any of the polygons,
continuously without selfintersection
Complexity of Games & Puzzles
|Bemaine, Beain & many otheisj
u playeis
(simulation)
1 playei
(puzzle)
2 playeis
(game)
team,
impeifect info
NP
PSPACE EXPTINE
P
0nueciuable
NEXPTINE
PSPACE
PSPACE
Rengo Kiiegspiel.
biiuge.
Constraint Logic
|Beain & Bemaine 2uu9j
u playeis
(simulation)
1 playei
(puzzle)
2 playeis
(game)
team,
impeifect info
PSPACE EXPTINE
P
0nueciuable
NEXPTINE
PSPACE
NP
PSPACE
FollowOn Algorithms Classes
6.u46: Inteimeuiate Algoiithms
6.u47: Computational Biology
6.8S4: Auvanceu Algoiithms
6.849: ueometiic Foluing Algoiithms
6.8Su: ueometiic Computing
6.8S1: Auvanceu Bata Stiuctuies
6.8S2: Bistiibuteu Algoiithms
6.8SS: Algoiithmic uame Theoiy
6.8SS: Netwoik 0ptimization
6.8S6: Ranuomizeu Algoiithms
6.8S7: Netwoik anu Computei Secuiity
FollowOn Theory Classes
6.u4S: Automata, Computability, Complexity
6.84u: Theoiy of Computing
6.841: Auvanceu Complexity Theoiy
6.842: Ranuomness & Computation
6.84S: Quantum Complexity Theoiy

You might also like