Professional Documents
Culture Documents
STS Research Paper Final
STS Research Paper Final
dimensional space to 4
K
(where K << ) dimensional space is often
used in SimHash. Figure 2 illustrates the concept of the SimHash algorithm in this research.
Figure 2. On the left is the main part of the SimHash algorithm. SimHash traverses the sequence to tally
the number of occurrences of the tags, and the hash is obtained through a linear combination of the tally
of the tag and its weight.
In this research, K was chosen to be 6 (which is explained later), thus producing a total
possibility of 4096 (4
6
) possible tags. Of these tags, 400 were selected to be used in set T for the
A1CAA1CACC1A1CAA1CC1A1AC
A1CAA1CACC1A1CAA1CC1A1AC
A1CAA1CACC1A1CAA1CC1A1AC
A1CAA1CACC1A1CAA1CC1A1AC
A1CAA1CACC1A1CAA1CC1A1AC
A1CAA1CACC1A1CAA1CC1A1AC
A1CAA1CACC1A1CAA1CC1A1AC
A1C: 1 AA1:0 1A1: 0 CAC: 0
A1C: 1 AA1:0 1A1: 0 CAC: 0
A1C: 1 AA1:0 1A1: 0 CAC: 0
A1C: 1 AA1:1 1A1: 0 CAC: 0
A1C: 2 AA1:1 1A1: 0 CAC: 0
A1C: 2 AA1:1 1A1: 0 CAC: 0
A1C: 2 AA1:1 1A1: 0 CAC: 1
A1C: 3 AA1:3 1A1: 8 CAC: 6
SeL of 1ags
(#A1C)*WelghL(A1C) +
(#AA1)*WelghL(AA1) +
(#1A1)*WelghL(1A1) +
(#CAC)*WelghL(CAC) =
3*12.336 + 3*14.633 + 8*43.234 +
6*42.439 = !"#$%#
A1C WelghL: 12.336
AA1 WelghL: 14.633
1A1 WelghL: 43.234
CAC WelghL: 42.439
8
SimHash function, as they were the length K subsequences that appeared most frequently in the
whole microbial DNA dataset. Set T was then weighted according to the following formula:
=
# o] occucncc o] tug
# o] mnmum o] occucnccs umong uII tugs
(Scale factor of 20)
Thus, using the formula, the 400th tag (the least frequent one) would have a weight equal
to the scale factor, while the first tag (the most frequent one) would have a much larger weight.
D) Clustering
Hamming distance function is used to compute the difference between the pair of
SimHash values, which determines whether the two belong to the same cluster.
Defining V to be the set of size N,
= {
s
]
s=1
N
where
s
is the SimHash of sequence s.
For any pair (
and
]
) of SimHash values in V, the pair-wise distance is
Hij = hamming_distance(
,
]
).
The output of the SimHash-based clustering algorithm is a cluster set C of size R:
= {
]
=1
R
where cluster c is a subset of SimHash value of size L (L < N):
= |
]
|
]=1
L
such that, for any pair of SimHash values in cluster c,
]
( = 1 , = 1 )
where is the SimHash threshold.
The actual implementation of the sequence clustering algorithm is performed as follows:
1) Loop through every sequence s in S.
2) If s has already been assigned to a cluster, skip it.
9
3) If s has not been assigned, assign it to a new cluster c, then
a. Go through every other sequence s in S.
b. If s has not been assigned, and the similarity of s and s is below , assign s
to cluster c.
4) Continue until all sequences have been assigned to a cluster.
The number of clusters that are created depends on threshold value. Stricter threshold
values require that sequences be more similar in order to be assigned to the same cluster,
resulting in a larger number of clusters, while large threshold values may be too lax and assign
sequences that are extremely different to the same cluster.
30$-1%$ '*, 4.$(-$$.+*
A) Selecting SimHash Parameters
The selection of parameters T (the set of tags used in the hash) and K (the length of each
tag) affects the accuracy of the SimHash algorithm. Table 2 displays the actual effects of the
sizes of T and K on the accuracy of the algorithm. With a larger number of tags included in T,
the accuracy of the SimHash algorithm increases, albeit at the cost of runtime. The value of K
(the length of a tag) dictates the total number of possible tags, which is equal to 4
K
, that could be
used in a DNA sequence analysis. The increase in K, however, only improves the accuracy of the
SimHash algorithm before K reaches a certain threshold. After that point (k=6 in this study), the
accuracy benefit of increasing the value of K becomes negligible. This is because once K reaches
a certain value, and the number of possible tags is sufficient enough for the most frequent K
subsequences to be selected, the further increase in K does not contribute much to the good
candidates in the selection of T (the set of tags used) to improve the accuracy of the SimHash.
10
The combination of the number of tags and the size of K that produces the most accurate result in
this study is highlighted in red in Table 2.
K: Length of tags
4 5 6 7 8
N
u
m
b
e
r
o
f
t
a
g
s
u
s
e
d
i
n
T
32 41.587 % 50.234 % 55.634 % 54.297 % 54.455 %
64 51.263 % 58.230 % 66.766 % 67.395 % 67.944 %
128 60.917 % 65.529 % 67.592 % 69.259 % 69.592 %
256 62.923 % 69.646 % 70.233 % 71.343 % 72.013 %
512 64.278 % 72.766 % 75.433 % 74.355 % 75.233 %
Table 2. Accuracy of varying values of K and numbers of tags in SimHash. The accuracy values measure
the percentage of exact match between SimHash and Shingling for top 5% similarity of one sequence to
other sequences.
B) Complexity and Runtime
The SimHash algorithm written in this research greatly reduces the computational
complexity compared to the conventional shingling algorithm. The shingling algorithm is brute
force, traversing through the set of possible pairs of sequences; this produces a complexity of
O(
2
). For each pair of sequences, the program must then loop through each pair of
subsequences in both of the original sequences, resulting in a complexity of O(()
2
), where L
represents the length of the sequence and K represents the length of the subsequence. The entire
shingle computation has a complexity of O((
2
)()
2
). By comparison, the SimHash algorithm
has a pre-computation step to compute the hashes for all the sequences; this step has a
complexity of O(). The next step is to calculate the absolute difference for each pair of
sequences, resulting in a complexity of O((
2
)). Thus the overall complexity of this method is
O(
2
+ ), which is significantly lower than the Big-O complexity of the shingling
algorithm. The actual results, displayed in Table 3, indicate that the SimHash method is hundreds
of times faster than the shingling algorithm when computing DNA sequence similarities.
11
In Figure 3, it can be observed that as the number of sequences increases, the
performance gain of SimHash over the shingling algorithm is greater. When analyzing large
datasets containing millions of microbial DNA sequences, as with the data from Sorcerer II,
using SimHash can yield better runtime efficiency.
Sample
ID
Number
of reads
Runtime for
shingling
algorithm
(sec)
Runtime
for
SimHash
(sec)
53R 13,040 18.66 0.53
55R 10,134 14.405 0.441
112R 16,087 23.252 0.691
115R 16,651 24.189 0.678
137 14,147 20.347 0.552
138 13,241 19.064 0.531
FS312 55,592 81.378 2.243
FS396 83,399 123.296 3.072
Table 3. Comparison of shingling algorithm and
SimHash runtimes in generating similarity metrics for
all eight samples. All benchmarks were done on a
computer with an Intel Core 2 Duo E8400 @ 3.00
GHz and 4 GB of RAM.
Figure 3. Runtime benchmarks for shingling
algorithm and SimHash methods using increasingly
larger sequence datasets. The sequences came from the
112R sample. All benchmarks were done on a
computer with an Intel Core 2 Duo E8400 @ 3.00
GHz and 4 GB of RAM.
C) Accuracy of SimHash Compared with Shingling Algorithm
The accuracy of the SimHash method was determined by comparing its similarity metric
output against the metric output of the shingling method. For each DNA sequence, the output
from either the SimHash or the shingling method was composed of the similarity metrics
between the sequence and all the other DNA sequences, ordered from the most similar to the
least similar. Top-N accuracies were then computed for each sequence, as Top-1 accuracy, Top-2
0
2
4
6
8
10
12
14
16
1000 1300 2000 3000 3000 7300 10000
Shlngllng
SlmPash
number of reads
8
u
n
L
l
m
e
(
s
e
c
)
12
accuracy, Top-3 accuracy, etc., with N representing the number of the most similar DNA
sequences included in the computation.
Top-N accuracy was calculated as the quotient of the number of the N most similar
sequences identified by the SimHash method that were in the set of the N most similar sequences
identified by the shingling method. A Top-N accuracy of 100% indicates that for a sequence S,
both the shingling- and hashing-based methods identified the same N sequences.
Figure 4. Top-N accuracy decreases as N increases. However, the accuracy asymptotes at 40% as N
increases to infinity. This graph only shows 40 randomly selected sequences and their top-50 ranked list
of similar sequences. The red line shows the average accuracy of these 40 sequences.
When N=1, only the sequence itself is selected as the most similar DNA sequence for
both the SimHash and the shingling method; thus, Top-1 is always 100%. When N increases, the
Top-N accuracy degrades, as expected. The Top-N accuracy asymptotes at 40% as N increases to
infinity. It is essential to note that despite the tendency of Top-N accuracy to decrease as N
increases, only the accuracy of the lower N values influences the result of sequence clustering in
this research. This is similar to a Google web search; it is important that the top few search
13
results are very relevant and are good matches, while the later entries, being less significant in
the search result set, could have low relevance.
D) Comparing SimHash Sequence Clustering with State-of-the-Art Algorithms
The Rand index (RI) was used to assess the clustering accuracy of the proposed SimHash
algorithm. RI measures the percentage of decisions that are correct, with penalties for incorrect
decisions. Basically, we want to assign two sequences to the same cluster if and only if they are
similar. A true positive (TP) decision assigns two similar sequences to the same cluster, and a
true negative (TN) decision assigns two dissimilar sequences to different clusters. There are two
types of incorrect decisions: a false positive (FP) decision assigns two dissimilar sequences to the
same cluster, and a false negative (FN) decision assigns two similar sequences to different
clusters. RI is calculated using the following formula:
Reference clusters were generated using the shingling algorithm, which was used to
produce ranked similarity metrics. Then, the greedy incremental approach (similar algorithm as
CD-HIT and UCLUST for grouping clusters) was used to partition sequences into clusters,
taking ranked similarity metric as input. Because shingling is a brute force algorithm, it was
assumed to be the most accurate, and thus, treated as reference datasets for assessing other
algorithms.
The SimHash sequence clustering method was evaluated in comparison with two state-
of-the-art methods, UCLUST and CD-HIT. Table 4 shows the comparative analysis for all eight
samples. The runtime and Rand index were captured on each sample separately, as well as on the
total of all samples. Clearly, the SimHash algorithm produced more accurate clustering than CD-
14
HIT and UCLUST in all eight samples and the total sample. In terms of efficiency, SimHash ran
much faster than CD-HIT, reducing the time from 36 seconds to 12 seconds when clustering all
222,291 sequences. Although UCLUST was slightly faster than SimHash, the difference between
the two was almost negligible, with both taking approximately 12 seconds to cluster all 222,291
sequences.
Sample ID # Reads Algorithms Parameters RI
Accuracy
Runtime
(sec)
#
Clusters
53R
13,040
SimHash : 270 0.9998 0.588 916
CD-HIT identity: 96% 0.9919 1.378 963
UCLUST identity: 94% 0.9796 0.476 971
55R
10,134
SimHash : 270 0.9996 0.463 845
CD-HIT identity: 96% 0.984 1.07 841
UCLUST identity: 94% 0.9781 0.381 829
112R
16,087
SimHash : 270 0.9996 0.752 1,122
CD-HIT identity: 92% 0.994 1.989 1,044
UCLUST identity: 86% 0.9821 0.664 1,024
115R
16,651
SimHash : 270 0.9996 0.773 940
CD-HIT identity: 95% 0.9844 1.611 936
UCLUST identity: 93% 0.9784 0.592 934
137
14,147
SimHash : 270 0.9997 0.624 853
CD-HIT identity: 96% 0.9883 1.277 889
UCLUST identity: 93% 0.9874 0.482 824
138
13,241
SimHash : 270 0.9993 0.598 859
CD-HIT identity: 96% 0.9928 1.543 853
UCLUST identity: 94% 0.9852 0.451 882
FS312
55,592
SimHash : 150 0.9997 2.494 1,887
CD-HIT identity: 90% 0.9926 7.952 1,881
UCLUST identity: 86% 0.9904 2.336 1,965
FS396
83,399
SimHash : 150 0.9996 3.419 1,956
CD-HIT identity: 94% 0.9844 11.855 1,954
UCLUST identity: 90% 0.9852 3.204 1,981
All
SimHash : 50 0.9922 12.92 6,085
222,291 CD-HIT identity: 94% 0.9876 36.001 6,148
UCLUST identity: 90% 0.9867 12.289 6,202
Table 4. Accuracy and performance of SimHash, CD-HIT (version 4.5.4), and UCLUST (USEARCH
version 6.0.307). The identity ensures that all pairs of sequences in a cluster must have at least that
percentage of sequence similarity. The numbers in bold highlight the best values among the three methods.
The numbers of clusters were generated to match the numbers of operational taxonomic units (OTUs),
reported by Sogin et al., 2006.
5+*(1-$.+*$ '*, 6-%-&0 7+&8
13
In this research, I developed an efficient and accurate metagenome clustering algorithm
based on SimHash. SimHash transforms a long and complex DNA sequence into a simple
fingerprint, which preserves the compositional markup of the sequence. These fingerprints are
far easier to compare than the original DNA sequences. Based on these fingerprints, a similarity
metric between DNA sequences was generated and then fed into a greedy incremental clustering
engine to partition the DNA sequences into clusters. In order to assess the effectiveness of the
proposed method, a benchmark dataset of 222,291 DNA sequences collected at deep-sea
underwater locations with varying temperatures and cell concentrations was used. The
assessment was divided into two phases: 1) the runtime and speed of generating a similarity
matrix were compared with the shingling algorithm, and 2) the overall runtime and speed of
clustering were compared with two state-of-the-art methods, CD-HIT and UCLUST. The
experiment indicated that SimHash ran much faster (in the magnitude of hundreds of times) than
the shingling algorithm in generating similarity matrix. The overall results revealed that the
SimHash algorithm performed better than two state-of-the-art methods, CD-HIT and UCLUST,
in accuracy, and showed a significant advantage over CD-HIT in runtime. The proposed method
can help biologists speed up their research in the genomes of microbes, with implications for
new drug design and discovery.
The clustering algorithm can be improved in the future. The currently used clustering
algorithm uses a greedy approach, assuming that the first DNA sequence in a new cluster is at
the center of the cluster, representing its typical characteristics. All the other sequences are
assigned to the same cluster if they are similar enough to the clusters first sequence (i.e.,
similarity metric < the threshold), or they are assigned to another cluster. However, if the
algorithm happens to pick a sequence that is actually at the edge of the cluster as the first
16
sequence, then the subsequent assignment of all the other DNA sequences will not be very
accurate. In future work, the accuracy of the DNA sequence classification can be further
improved by taking more dimensions of comparison into consideration. For example, the linkage
clustering technique offers the possibility of comparing a candidate to its nearby objects in a
cluster instead of just to the center when evaluating the assignment of the candidate to the
cluster. The similarity metric generated by the SimHash function can be used as a distance
function in the linkage cluster when calculating the degree to which a DNA sequence is similar
to its neighboring sequences.
17
!"#"$"%&"'
Broder, A., Glassman, S., Manasse, M., & Zweig, G. (1997). Syntactic clustering of the web. In
6th International World Wide Web Conference, (pp. 393404).
Charikar, M. S. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings
of the 34th Annual ACM Symposium on Theory of Computing (pp. 380388).
Dalevi, D., Ivanova, N.N., Mavromatis, K., Hooper, S.D., Szeto, E., Hugenholtz, P., Kyrpides,
N.C., & Markowitz, V.M. (2008). Annotation of metagenome short reads using
proxygenes. Bioinformatics, 24(16), i7i13.
Gori, F., Folino, G., Jetten, M., & Marchiori, E. (2011). MTR: taxonomic annotation of short
metagenomic reads using clustering at multiple taxonomic ranks. Bioinformatics, 27(2),
196203.
Edgar, R.C. (2010) Search and clustering orders of magnitude faster than BLAST.
Bioinformatics, 26 (19), 24602461.
Hugenholtz, P. & Tyson, W.G. (2008). Microbiology: metagenomics. Nature, 455(7212), 481
483.
Li, W.Z. & Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of
protein or nucleotide sequences. Bioinformatics, 22(13), 16581659.
Rasheed, Z., Rangwala, H., & Barbara, D. (2012). Efficient clustering of metagenomic
sequences using locality sensitive hashing. In SIAM International Conference in Data
Mining (pp. 10231034).
Sadowski, C. & Levin, G. (2007). SimHash: hash-based similarity detection. Technical Report,
Google.
Sogin, M., Morrison, H., Huber, J., Welch, D., Huse, S., Neal, P., Arrieta, J., & Herndl, G.
(2006). Microbial diversity in the deep sea and the underexplored rare biosphere.
PNAS, 103 (32), 1211512120.
Tringe, S.G. et al. (2005). Comparative metagenomics of microbial communities. Science,
308(5721), 554557.
Uddin, S., Roy, C.K., Schneider, K.A., & Hindle, A. (2011). On the effectiveness of simhash for
detecting near-miss clones in large scale software systems. Proc. WCRE, 1322.