Professional Documents
Culture Documents
Homologysearch
Homologysearch
Ming Li
Canada Research Chair in Bioinformatics
Computer Science. U. Waterloo
I want to show you how one simple
theoretical idea can make big impact to
a practical field.
Outline
Motivation, market, biology, homology search
Review dynamic programming, BLAST heuristics
Optimized spaced seeds
The idea
How to compute them
Data models, coding regions, HMM, vector seeds
Multiple optimized spaced seeds
PatternHunter program
Mathematical theory of spaced seeds
Why they are better
Complexity of finding optimal spaced seeds
Applications beyond bioinformatics & Open problems
1. Introduction
Biology
Motivation and market
Definition of homology search problem
Biology
DNA Protein
(Genotype) Phenotype
Human: 3 billion bases, 30k genes.
T A
A T
E. coli: 5 million bases, 4k genes
C G
T A
cDNA
reverse transcription
A T transcription translation
G C
C G
mRNA Protein
G (A,C,G,U) (20 amino acids)
C
Years
Internet search
Size limit: 5 billion people x homepage size
Supercomputing power used: ½ million CPU-hours/day
Query frequency: Google --- 112 million/day
Query type: exact keyword search --- easy to do
Homology search
Size limit: 5 billion people x 3 billion basepairs +
millions of species x billion bases
10% (?) of world’s supercomputing power
Query frequency: NCBI BLAST -- 150,000/day,
15% increase/month
Query type: approximate search
Bioinformatics Companies living
on Blast
Paracel (Celera)
TurboGenomics (TurboWorx)
Example:
0001 1 1 011 1 11 1 1 1 11 1001101 11 10
AGCGATGTCACGCGCCCGTATTTCCGTA
G
| | | | | | x| | | | | | | | | | | | ||
TCGGATCTCACGCGCCCGGCTTACCGTG
BLAST Dilemma:
If you want to speed up, have to use a
longer seed. However, we now face a
dilemma:
increasing seed size speeds up, but loses
sensitivity;
decreasing seed size gains sensitivity, but
loses speed.
How do we increase sensitivity & speed
simultaneously? For 20 years, many
tried: suffix tree, better programming ..
3. Optimized Spaced Seeds
Why are they better
How to compute them
Data models, coding regions, HMM
PatternHunter
Multiple spaced seeds
Vector seeds
New thinking
Three lines of existing approaches:
Smith-Waterman exhaustive dynamic programming.
Blast family: find seed matches (11 for Blastn, 28 for
MegaBlast), then extend. Dilemma: increasing seed size speeds
up but loses sensitivity; descreasing seed size gains sensitivity
but loses speed.
Suffix tree: MUMmer, Quasar, REPuter. Only good for precise
matches, highly similar sequences.
Blast approach is the only way to deal with real and
large problems, but we must to solve Blast dilemma: We
need a way to improve sensitivity and speed
simultaneously.
Goals: Improve (a) Blast, (b) Smith-Waterman.
Optimal Spaced Seed
(Ma, Tromp, Li: Bioinformatics, 18:3, 2002, 440-445)
1100111011101101011101101011111011101
111*1**1*1**11*111
Which seed is more likely to hit this region:
BLAST seed: 11111111111
Spaced seed: 111*1**1*1**11*111
Sensitivity: PH weight 11 seed vs Blast 11 & 10
PH 2-hit sensitivity vs Blastn 11, 12 1-hit
Expect Less, Get More
Lemma: The expected number of hits of a weight W
length M seed model within a length L region with
similarity p is
(L-M+1)pW
Proof: The expected number of hits is the sum, over the L-M+1
possible positions of fitting the seed within the region, of the
probability of W specific matches, the latter being pW. ■
BLAST PatternHunter
E.coli vs H.inf 716s 14s/68M
Arabidopsis 2 vs 4 -- 498s/280M
Human 21 vs 22 -- 5250s/417M
Human(3G) vs Mouse(x3=9G)* 19 years 20 days
A. thaliana chr 2 vs 4
E. Coli vs H. influenza
Genome Alignment by PatternHunter (4 seconds)
Prior Literature
Over the years, it turns out, that the mathematicians
know about certain patterns appear more likely than
others: for example, in a random English text, ABC is
more likely to appear than AAA, in their study of
renewal theory.
Random or multiple spaced q-grams were used in the
following work:
FLASH by Califano & Rigoutsos
Multiple filtration by Pevzner & Waterman
LSH of Buhler
Praparata et al
Coding Region & HMM
The Hidden Markov Model is a finite set of states, each of
which is associated with a probability distribution. Transitions
among the states are governed by a set of probabilities called
transition probabilities. In a particular state an outcome or
observation can be generated, according to the associated
probability distribution. It is only the outcome, not the state,
which is visible to an external observer and therefore states are
``hidden'' from the outside; hence the name Hidden Markov
Model. An HMM has the following components satisfying usual
probability laws:
The number of states of the model, N.
The number of observation symbols in the alphabet, M.
A set of state transition probabilities, depending on current state.
A probability distribution of the observable symbol at each of the
states.
Modeling coding region with HMM
PatternHunter original assumption: I is a sequence of N
independent Bernoulli variables X0, … XN-1 with P(Xi)=p.
Coding region third position usually is less conserved. Skipping it
should be a good idea (BLAT 110110110). With spaced seeds, we
can model it using HMM.
Brejova, Brown, Vinar: M(3) HMM: I is a sequence of N independent
Bernoulli random variables X0,X1, … XN-1 where Pr(Xi=1)=pi mod 3. In
particular:
p0 p1 p2
human/mouse 0.82 0.87 0.61
human/fruit fly 0.67 0.77 0.40
DP algorithm naturally extends to M(3),
Opt seed (weight 8) for coding region: 11011011000011
Picture of M(3)
From Brejova-Brown-Vinar
Running PH
Download at: www.BioinformaticsSolutions.com
The biggest problem for Blast was low sensitivity (and low
speed). Massive parallel machines are built to do Smith-
Waterman exhaustive dynamic programming.
Spaced seeds give PH a unique opportunity of using several
optimal seeds to achieve optimal sensitivity, this was not
possible for Blast technology.
Economy --- 2 seeds (½ time slow down) achieve sensitivity of
1 seed of shorter length (4 times speed).
With multiple optimal seeds. PH II approaches Smith-Waterman
sensitivity, and 3000 times faster.
Finding optimal seeds, even one, is NP-hard.
Experiment: 29715 mouse EST, 4407 human EST. Sensitivity
and speed comparison next two slides.
Finding Multiple Seeds (PH II)
Computing the hit probability of given k seeds on a uniformly
distributed random region is NP-hard.
Finding a set of k optimal seeds to maximize the hit probability
cannot be approximated within 1-1/e unless NP=P
Given k seeds, algorithm DP can be adapted to compute their
sensitivity (probability one of the seeds has a hit).
Greedy Strategy
Compute one optimal seed a. S={a}
Given S, find b maximizing hit probability of S U {b}.
Coding region, use M(3), 0.8, 0.8, 0.5 (for mod 3 positions)
We took 12 CPU-days on 3GHz PC to compute 16 weight 11
seeds. General seeds (first 4): 111010010100110111,
111100110010100001011, 11010000110001010111,
1110111010001111.
Sensitivity Comparison with Smith-Waterman (at 100%)
The thick dashed curve is the sensitivity of Blastn, seed weight 11.
From low to high, the solid curves are the sensitivity of PH II using
1, 2, 4, 8 weight 11 coding region seeds, and the thin dashed curves
are the sensitivity 1, 2, 4, 8 weight 11 general purpose seeds,
respectively
Speed Comparison with Smith-Waterman
Define the transition matrix C=(Ci,j). It is a K x K matrix. Let x(n) = (x1(n), x2(n),
… , xK(n)). Then, assuming that M' divides n,
x(n) = x(1) ◦ (CT)n/M’.
Because Ci,j > 0 for all i,j, C is a positive matrix. The row sum of the i-th row is
the probability that a length-(|s|+M') region ending with t i does not have a
hit. The row sum is lower bounded by (1-p)d (1-pW ), where (1-p)d is the
probability of generating d 0's and 1-pW is the probability of generating a
string in T. It is known that the largest eigenvalue of C is positive and
unique, and is lower bounded by the smallest row sum and upper bounded
the largest row sum of C. Let λ1>0 be the largest eigenvalue, and λ2, 0 < ||
λ2|| < λ1 be the second largest. There is a unique eigenvector
corresponding to λ1.
x(n) / ||x(n)|| = x(1) ◦ (CT )n/M’ / ||x(1) ◦ (CT)n/M’||
converges to the eigenvector corresponding to λ1. As λ1n tends to zero, we
can use standard techniques to normalize xn as
y(n) = x(n) / ||x(n)||2
and then x(n+M’) = C y(n) .
Proof Continued …
Then (by power method) the Rayleigh quotient of
converges to λ1. The convergence speed depends on the ratio λ1/ |λ2|, it is
λ2 ≤ λ1 (K-1)/(K+1)
where K= maxi,j,k,l (Ci,j Ck,l / Ci,l Ck,j). For any i,j, we have
where a is the number of 1's in tj, b the number of 0's in tj. K = O(1/(1-p)d ).
As (1 - 1/K)K goes to 1/e, the convergence can be achieved in O(K) = O(1/(1-
p)d ) steps. The time complexity is therefore upper bounded by an
exponential function in |s|. QED
Theorem 2. It is NP hard to find the optimal seed
(Li, Ma, Kisman, Tromp, J. Bioinfo Comput. Biol. 2004)
Proof. Reduction from 3-SAT. Given a 3-SAT instance C 1 … Cn, on Boolean variables x1, … , xm,
where Ci= l1 + l2 + l3, each l is some xi or its complement. We reduce this to seed selection
problem. The required seed has weight W=m+2 and length L=2m+2, and is intended to be
of the form 1(01+10)m1, a straightforward variable assignment encoding. The i-th pair of
bits after the initial 1 is called the code for variable x i. A 01 code corresponds to setting xi
true, while a 10 means false. To prohibit code 11, we introduce, for every 1 ≤ i ≤ m, a
region
Since the seed has m 0s, it cannot bridge the middle 0 m+1 part, and is thus forced to hit at the
start or at the end. This obviously rules out code 11 for variable x i. Code 00 automatically
becomes ruled out as well, since the seed must distribute m 1s among all m codes. Finally,
for each clause Ci, we introduce
Theorem 3. Given a set of seeds, choose k best can be approximated with ratio 1-
1/e.
Theorem 5 [L. Zhang] If the length of a spaced seed is not too long, then it strictly
outperforms consecutive seed, in asymptotic hit probability.
5. Applications beyond bioinformatics
Stock Market Prediction
Zou-Deng-Li: Detecting Market Trends by Ignoring It, Some Days, March, 2005
Learning/inference methods or
complete and comprehensive systems
are beyond this study. They can be
used in conjunction with our proposal.
Background
Hundreds of market indicators are used (esp.
in automated systems). Typical:
Common sense: if the past k days are going up,
then the market is moving up.
Moving average over the last k days. When the
average curve and the price curve intersect,
buy/sell.
Special patterns: a wedge, triangle, etc.
Volume
Hundreds used in automated trading systems.
Note: any method will make money in the
right market, lose money in the wrong market
Problem Formalization
The market movement is modeled as a 0-1 sequence, one bit per day,
with 0 meaning market going down, and 1 up.
S(n,p) is an n day iid sequence where each bit has probability p being
1 and 1-p being 0. If p>0.5, it is an up market
Ik=1k is an indicator that the past k days are 1’s.
I8 has sensitivity 0.397 in S(30,0.7), too conservative
I8 has false positive rate 0.0043 in S(100, 0.3). Good
Iij is an indicator that there are i 1’s in last j days.
I811 has high sensitivity 0.96 in S(30,0.7)
But it is too aggressive at 0.139 false positive rate in S(100, 0.3).
Spaced seeds 1111*1*1111 and 11*11111*11 combine to
have sensitivity 0.49 in S(30,0.7)
False positive rate 0.0032 in S(100, 0.3).
Consider a betting game: A player bets a number k. He wins k dollars
for a correct prediction and o.w. loses k dollars. We say an indicator A
is better than B, A>B, if A always wins more and loses less.
Sleeping on Tuesdays and Fridays
1.E-08
0.1 0.2 0.3 0.4 0.5
1
0.9
0.8
0.7
0.6
0.5
0.4
1111*1*1111; 11*11111*11
0.3
11111111
0.2
9-out-of-11
0.1
0
0.5 0.6 0.7 0.8 0.9 1
Two experiments
We performed two trading experiments
One artificial
One on real data (S&P 500, Nasdaq
indices)
Experiment 1: Artificial data
This simple HMM 4/5
generates a very
artificial simple model
EVEN:
5000 days (bits) P(1)=0.5