Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Fragment Assembly

Dr. Zoya Khalid


Sequencing by Hybridization (SBH)
• Hybridization data is obtained for the DNA with all
possible probes of length l .

• If no errors, we have all substrings of length l.

• We would like to reconstruct the DNA from those


substrings.

• We can formalize this as SCS and solve it as before.

• But we can simplify a little bit…


Sequencing by Hybridization (SBH)
• Sequencing by hybridization is a special case of the shortest superstring
problem where all fragments in F has the same length l
• SBH array has probes for all possible k-mers (all possible substrings of
length k)
• For a given DNA sample, array tells us whether each k-mer is PRESENT
or ABSENT in the sample
• The set of all k-mers present in a string s is called its spectrum
• Example:
• s = ACTGATGCAT
• spectrum(s, 3) = {ACT, ATG, CAT, CTG, GAT, GCA, TGA, TGC}
Example DNA Array
AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
AA
AC X
AG
AT X
CA
CC
Sample:
ACTGATGCAT CG
CT X
Spectrum (k=4): GA X
{ACTG, ATGC,
GC X
CTGA,GATG,
GCAT,TGAT, GG
TGCA} GT
TA
TC
TG X X
TT
SBH Problem
• Given: A set S of k-mers
• Do: Find a string s, such that spectrum(s,k) = S

{ACT, ATG, CAT, CTG, GAT, GCA, TGA, TGC}

?
Hamiltonian Path Approach
• S = { ATG AGG TGC TCC GTC GGT GCA CAG }
• Ham ATG AGG TGC TCC GTC GGT GCA CAG

ATGCAGGTCC

Path visited every VERTEX once


Hamiltonian Path Approach
• A more complicated graph:

• S={ ATG TGG TGC GTG GGC GCA GCG CGT }

• Path 1: H

ATGCGTGGCA
H
• Path 2:

ATGGCGTGCA
SBH as Eulerian path
• Could use Hamiltonian path approach, but not useful
due to NP-completeness
• Instead, use Eulerian path approach
• Eulerian path: A path through a graph that traverses
every edge exactly once
• Construct graph with all (k-1)-mers as vertices
• For each k-mer in spectrum, add edge from vertex
representing first k-1 characters to vertex
representing last k-1 characters
SBH as Eulerian path
• Genome : AAABBBBA
• K= 3
• AAA, AAB, ABB, BBB, BBB, BBA
• Construct graph with all (k-1)-mers as vertices
• 3-1= 2 mers (Extract its left and right mers)
• AA, AA AA,AB AB,BB BB,BB BB,BB BB,BA
• Start making nodes
Reconstruct Genome

Now assume we do not know the


original sequence.
Can we reconstruct the genome ?

One edge per k-mer (For each K-mer there is one directed edge) • It involves walk through
One node per distinct K-1 mer
(All distinct K-1 mer) the graph.
• Move from node to node
following the direction of
edges.
• Walk that crosses each
This walk is called Eulerian walk edge once.
Reads -> Genome
Different sequences – the same spectrum
• Different sequences may have the same spectrum:
Spectrum(GTATCT,2)=
Spectrum(GTCTAT,2)=
{AT, CT, GT, TA, TC}
Eulerian Path Approach
• = { ATG, TGC, GCG,CGT,GTG,TGG,GGC,GCA}
• Vertices correspond to (k– 1)–mers : { AT, TG, GC, GG, GT, CA, CG }
• Edges correspond to k–mers from S

GT CG

AT TG GC CA

GG

Path visited every EDGE once


• S = { AT, TG, GC, GG, GT, CA, CG } corresponds to two different paths:
GT CG

AT TG GC CA

GG

ATGGCGTGCA ATGCGTGGCA
Properties of Eulerian graphs
• It will be easier to consider Eulerian cycles: Eulerian
paths that form a cycle
• Graphs that have an Eulerian cycle are simply called
Eulerian
• Theorem: A connected graph is Eulerian if and only if
each of its vertices are balanced
• A vertex v is balanced if indegree(v) = outdegree(v)
Seven Bridges of Königsberg

Euler answered the question: Is there a


walk through the city that traverses each
bridge exactly once?
Vertices A, B and D have degree 3 and vertex C has degree 5, so this graph has four vertices of odd degree. So it does not have an Euler Path.
15
Euler Theorem
• A graph is balanced if for every vertex the number of incoming edges
equals to the number of outgoing edges:
in(v)=out(v)
• Theorem: A connected graph is Eulerian if and only if each of its
vertices is balanced.
Eulerian Cycle
• An Eulerian path in a graph ! is a path that passes through every edge
of ! exactly once.
• An Eulerian cycle is an Eulerian path that starts and ends at the same
vertex.
• Indegree =outdegree

w
Eulerian Path -> Eulerian Cycle
• Indegree =outdegree
• If a graph has an Eulerian Path starting at s and ending at t then
• All vertices must be balanced, except for s and t which may have |indegree(v)
– outdegree(v)| = 1
• If and s and t are not balanced, add an edge between them to balance
• Graph now has an Eulerian cycle which can be converted to an Eulerian path by removal
of the added edge
Eulerian cycle algorithm
• Start at any vertex v, traverse unused edges until returning to v
• While the cycle is not Eulerian
• Pick a vertex w along the cycle for which there are untraversed outgoing
edges
• Traverse unused edges until ending up back at w
• Join two cycles into one cycle

20
Joining cycles

21
Algorithm for Constructing an Eulerian Cycle

a. Start with an arbitrary vertex v


and form an arbitrary cycle
with unused edges until a
dead end is reached. Since
the graph is Eulerian this dead
end is necessarily the starting
point, i.e., vertex v.
SBH difficulties
• In practice, sequencing by hybridization is hard
• Arrays are often inaccurate -> incorrect spectra
• False positives/negatives
• False Positives : A cancer screening test comes back positive, but you don’t have the
disease.
• False Negatives :You have a disease but the test comes negative
• Need long probes to deal with repetitive sequence
• But the number of probes needed is exponential in the length of the probes!
• There is a limit to the number of probes per array (currently between 1-10 million probes
/ array)
Sequence assembly summary
• Two general algorithmic strategies
• Overlap graph hamiltonian paths
• Eulerian paths in k-mer graphs
• Biggest challenge
• Repeats!
• Read errors
• Complicates computing read overlaps
• Repeats
• Roughly half of the human genome is composed of repetitive elements
• Repetitive elements can be long (1000s of bp)
• Human genome
• 1 million Alu repeats (~300 bp)
• 200,000 LINE repeats (~1000 bp)
Overlap-Layout-Consensus
• Most common assembler strategy for long reads
1.Overlap: Find all significant overlaps between reads, allowing for
errors
2.Layout: Determine path through overlapping reads representing
assembled sequence
3.Consensus: Correct for errors in reads using layout

30
Consensus

GTATCGTAGCTGACTGCGCTGC
Layout ATCGTCTCGTAGCTGACTGCGCTGC
ATCGTATCGAATCGTAG
TGACTGCGCTGCATCGTATCGTATC

Consensus TGACTGCGCTGCATCGTATCGTATCGTAGCTGACTGCGCTGC

31
Hard vs. Easy
• Eulerian path – visit every edge once (easy)
• Hamiltonian path – visit every node once (hard)
• Superstring(easy)
• Shortest common superstring (hard)
• Aligning 2 sequences (easy)
• Aligning k > 2 sequences (hard)
• Shortest path (easy)
• Longest path (hard)
Kruskal’s Minimum Spanning Tree (MST) Algorithm

• A minimum spanning tree (MST) or minimum weight spanning tree for a


weighted, connected, undirected graph is a spanning tree with a weight
less than or equal to the weight of every other spanning tree.

• In Kruskal’s algorithm, sort all edges of the given graph in increasing order.
Then it keeps on adding new edges and nodes in the MST if the newly
added edge does not form a cycle.

• It picks the minimum weighted edge at first and the maximum weighted
edge at last. Thus we can say that it makes a locally optimal choice in each
step in order to find the optimal solution.
Greedy Algorithm Examples
• Kruskal’s Algorithm for Minimum Spanning Tree
• Minimum spanning tree: a set of n-1 edges that connects a graph of n vertices
without any cycles and that has minimal total weight
• Kruskal’s algorithm adds the edge that connects two components with the
smallest weight at each step without introducing a cycle
• Proven to give an optimal solution
Weight Src Dest Greedy Algorithm Examples
1 7 6
2 8 2
2 6 5
1 2 3
4 0 1
4 2 5
6 8 6
0 8 4
7 2 3
7 7 8
8 0 7
7 6 5
8 1 2
9 3 4
10 5 4
11 1 7
14 3 5
References
• Lecture notes of Colin Dewey @ University of Wisconsin-Madison
• Lecture notes of Arne Elofsson @ Stockholm University
• Lecture notes of Yuzhen Ye @ Indiana University

You might also like