Professional Documents
Culture Documents
Principles and Problems of de Novo Genome Assembly
Principles and Problems of de Novo Genome Assembly
Principles and Problems of de Novo Genome Assembly
assembly
Lex Nederbragt
Norwegian High-Throughput Sequencing Centre (NSC)
and
Centre for Ecological and Evolutionary Synthesis (CEES)
What is this thing called ‘genome
assembly’?
What is a genome assembly?
reads
contigs
scaffolds
Sequence data
reads
Reads contigs
scaffolds
original DNA
fragments
original DNA
fragments
Sequenced ends
http://www.cbcb.umd.edu/research/assembly_primer.shtml
Reads!
reads
contigs
scaffolds
http://www.sciencephoto.com/media/210915/enlarge
Contigs
reads
scaffolds
ACGCGATTCAGGTTACCACG
GCGATTCAGGTTACCACGCG
GATTCAGGTTACCACGCGTA
TTCAGGTTACCACGCGTAGC
CAGGTTACCACGCGTAGCGC
Alig ne d re ads GGTTACCACGCGTAGCGCAT
TTACCACGCGTAGCGCATTA
ACCACGCGTAGCGCATTACA
CACGCGTAGCGCATTACACA
CGCGTAGCGCATTACACAGA
CGTAGCGCATTACACAGATT
TAGCGCATTACACAGATTAG
Co ns e ns us c o ntig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG
Contigs
reads
scaffolds
Contig orienation?
Contig order?
Collapsed repeat
consensus
http://www.cbcb.umd.edu/research/assembly_primer.shtml
Contigs
reads
scaffolds
http://www.cbcb.umd.edu/research/assembly_primer.shtml
Mate pairs
reads
scaffolds
fragments
Sequenced ends
scaffolds
mate pairs
contigs
reads
ACGCGATTCAGGTTACCACG
GCGATTCAGGTTACCACGCG
scaffolds
Assembly
How to do this?
Algorithms
Algorithms
All are graph-based
Graph-theory!
Algorithms
Hamiltonian path
– a path that contains all the nodes
http://www.cbcb.umd.edu/research/assembly_primer.shtml
Algorithms
Overlap calculation (alignment)
– computationally intensive
Read 1 Read 2
Overlap
ACGCGATTCAGGTTACCACG
GCGATTCAGGTTACCACGCG
GATTCAGGTTACCACGCGTA
TTCAGGTTACCACGCGTAGC
CAGGTTACCACGCGTAGCGC
Alig ne d re ads GGTTACCACGCGTAGCGCAT
TTACCACGCGTAGCGCATTA
ACCACGCGTAGCGCATTACA
CACGCGTAGCGCATTACACA
CGCGTAGCGCATTACACAGA
CGTAGCGCATTACACAGATT
TAGCGCATTACACAGATTAG
Co ns e ns us c o ntig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG
Algorithms
Path through the graph
contig
ACGCGATTCAGGTTACCACG
GCGATTCAGGTTACCACGCG
GATTCAGGTTACCACGCGTA
TTCAGGTTACCACGCGTAGC
CAGGTTACCACGCGTAGCGC
Alig ne d re ads GGTTACCACGCGTAGCGCAT
TTACCACGCGTAGCGCATTA
ACCACGCGTAGCGCATTACA
CACGCGTAGCGCATTACACA
CGCGTAGCGCATTACACAGA
CGTAGCGCATTACACAGATT
TAGCGCATTACACAGATTAG
Co ns e ns us c o ntig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG
Algorithms
Many flavors
Abandoned
Greedy extension
http://www.waialuasodaworks.com/images/flavors2009.jpg
Greedy extension
Oldest
http://www.cbcb.umd.edu/research/assembly_primer.shtml
Overlap-Layout-Consensus
Typical for Sanger-type reads
– also used by newbler from 454 Life Sciences
Overlap-Layout-Consensus
Steps
– Overlap computation
– Layout: graph simplification
– Consensus: sequence
Overlap-Layout-Consensus
Overlap phase:
– K-mer seeds initiate overlap
ACGCGATTCAGGTTACCACG
Overlap-Layout-Consensus
1 TCGTTAGCT 1 2 3 4 5 6
2 CGTTAGCTG
3 GTTAGCTGT
4 TTAGCTGTC
5 TAGCTGTCG
6 AGCTGTCGA
7 ATTCGATTT
8 TTCGATTTT
9 TCGATTTTT 11
10 CGATTTTTC
11 GATTTTTCG
12 ATTTTTCGA
Read GACCTACA
GAC de Bruijn graph
ACC
K-mers (K=3) CCT
CTA
TAC
ACA
Hiseq Hiseq X
2000/2500
1000 Hiseq2500 RR
NextSeq 500
100
Gigabses per run (log scale)
10 Proton
SOLiD MiSeq
1 GS FLX
ABI 3730xl
Roche/454 GS
Illumina GA
0.1 PGM
GA II Series4
PacBio RS SOLiD
0.01 GS Junior Illumina MiSeq
PacBio RS
0.001
454 GS Junior
Ion Proton
Hiseq Hiseq X
2000/2500
1000 Hiseq2500 RR
10 Proton
SOLiD MiSeq
1 GS FLX
ABI 3730xl
Roche/454 GS
Illumina GA
0.1 PGM
‘Sanger
‘Sanger like’
GA II Series4
PacBio RS SOLiD
0.01 GS Junior Illumina MiSeq
like’
Ion Torrent PGM
Ion Proton
http://schatzlab.cshl.edu/presentations/2012-01-17.PAG.SMRTassembly.pdf
Why is genome assembly such
a difficult problem?
1) Repeats
Repeat copy 1 Repeat copy 2
Collapsed repeat
consensus
http://www.cbcb.umd.edu/research/assembly_primer.shtml
2) Diploidy
Differences
between sister
* chromosomes
‘heterozygosity’
*
http://commons.wikimedia.org/wiki/File:Chromosome_1.svg
2) Diploidy
Polymorphic region 2
Region 1 Region 4
Polymorphic region 3
http://www.astraean.com/borderwars/wp-content/uploads/2012/04/heterozygoats.jpg
and many other sites
3) Polyploidy
http://en.wikipedia.org/wiki/Polyploidy
4) Many programs to choose from