STS Research Paper Final

1
Using Hashing Algorithms for Clustering

Metagenomes

2

"#$%&'(%

Metagenomics is a genomic approach that uses culture-independent sequencing to study
the identification, abundance, and interaction of microorganisms under different environments. It
provides a powerful lens through which microbial communities that make life possible can be
observed and examined. However, metagenomic sequencing often yields a massive quantity of
partial and noisy DNA sequences, and therefore, poses tremendous challenges in data analysis. It
is well recognized that metagenomic sequencing technologies have created a tremendous gap
between available sequence data and their biological interpretation, and that this gap will not be
closed without the help of more sophisticated computational approaches. In this research, a
unique SimHash-based sequence clustering algorithm was designed, which reduces the
computational complexity by generating a small fingerprint to represent the compositional
markup of the entire sequence. These fingerprints, rather than the original long and complex
sequences, were then used in calculating sequence similarities. The partition of DNA sequences
into clusters was implemented using the greedy incremental approach, taking similarity metrics
generated by SimHash as the resemblance measurement between sequence pairs. The proposed
method was evaluated against two state-of-the-art methods, CD-HIT and UCLUST, on a
benchmark dataset of 222,291 DNA sequences collected from eight underwater samples from the
North Atlantic Deep Water and Axial Seamounts. The empirical results revealed that SimHash
produced more accurate clusters than both CD-HIT and UCLUST, and demonstrated that
SimHash was computationally efficient in comparison to both of the other methods.

3

)*%&+,-(%.+*

Genomics refers to the determination of the DNA of organisms. In 2003, a thirteen-year
project known as the Human Genome Project was completed, mapping out the entire human
genome. Since then, biologists have begun focusing on the genomes of other organisms. This
shift of interest marked the creation of a rapidly emerging new field called metagenomics, which
applies the power of genomics to an entire community of coexisting microorganisms within an
environmental sample, without cultivating clonal samples of each microorganism separately
(Tringe et al., 2005). Microbes are vital to many of our bodily functions, such as digestion of
food, and are responsible for the recycling of resources throughout our ecosystem. Determining
the content, abundance, and interaction of these different microbes is crucial for understanding
the pathogenic or evolutional role played by these microbes, and thus, can potentially reveal
ways to meet myriad challenges in biomedicine.
The full metagenomic process begins with an environmental sample containing hundreds
or thousands of species of microbes. Genomic sequencing is performed on the sample, producing
a large set of DNA segments, which are also called reads (Gori, 2010). To assign these
voluminous DNA segments correctly to each species of microorganisms, computational
techniques are increasingly becoming indispensable tools. One computational technique,
clustering, is widely used to partition the entire set of sequence data into distinct groups of
sequences that are species-specific. Clustering is useful because it is a crucial initial step in the
rapid analysis of sequence diversity and internal structure, and for identification of the content
and abundance of microbial species within the samples (Dalevi et al., 2008). Existing clustering
methods include shingling algorithm, CD-HIT, and UCLUST. The shingling algorithm (Broder,
Glassman, Manasse, & Zweig, 1997) is considered the most accurate, because it follows a brute
4

force approach. However, it is computationally intensive, requiring a lot of processing power and
a large amount of time. Other methods have attempted to improve the process, trading accuracy
for speed. CD-HIT uses a heuristics based on statistical k-mer filtering to speed up clustering (Li
& Godzik, 2006). UCLUST achieves better efficiency by using high-scoring common
subsequences, HSPs, to compare pairs of sequences (Edgar, 2010). Both UCLUST and CD-HIT
use the greedy incremental approach to partition the sequences into clusters. Although CD-HIT
and UCLUST have achieved some success, the following challenges still remain in clustering
metagenomes:
a) The massive quantity of metagenomic data is still a tremendous challenge (Hugenholtz &
Tyson, 2008). For example, the Global Ocean Sampling expedition, undertaken by
Sorcerer II, produced millions of sequences (Sogin et al., 2006).
b) Partial, erroneous, and biased sequence sampling poses another challenge. A
metagenomic sample contains many different microorganisms, bacteria, viruses and
small eukaryotes, each of which has different genome structures. Even the most extensive
sequencing will most likely provide only partial sampling of the DNA (Gori, 2010).
c) The lack of reference genome databases for intrinsically complex microbial communities
is yet another challenge for metagenomic analysis (Rasheed, Rangwala, & Barbara, 2012).
To address these challenges and to find an accurate and efficient method to enhance
metagenomic analysis, I designed and implemented a SimHash-based sequence clustering
method. Through sequence clustering, a large, redundant dataset can be reduced to smaller, non-
redundant (NR) datasets, and erroneous sequences can be identified or reconstructed by using
consensus from sequences within clusters. SimHash, rst proposed by Charikar (Charikar, 2002),
has been used in detecting similar files and classifying documents in large systems (Sadowski &
3

Levin, 2007), and in detecting near-miss clones in large-scale software systems (Uddin, Roy,
Schneider, & Hindle, 2011).
/'%0&.'1$ '*, /0%2+,$

A) Overall Flowcharts

Figure 1. Overall flowchart of this research

6

DNA sequences in FASTA format are fed into the SimHash algorithm to generate hash
value metrics and a similarity matrix, which are then consumed in the greedy incremental
clustering algorithm that assigns each DNA sequence to a cluster. The efficiency and accuracy of
the SimHash-based algorithm are compared to those of CD-HIT ((http://bioinformatics.org/cd-
hit/) and UCLUST (http://www.drive5.com/usearch/download.html), based on the reference
clusters produced by the shingling algorithm.
B) Datasets
In this research, the primary data came from underwater samples from the North Atlantic
Deep Water and Axial Seamounts. A total of eight data samples (IDs: 53R, 55R, 112R, 115R,
137, 138, FS312, and FS396) were used, containing a total of 222,291 DNA sequences. These
samples, taken at different sites and different depths, varied from 2.3C to 31.2C, and from 3.3
! 10
4
cells per mL of water to 1.8 ! 10
5
cells per mL of water (Table 1) (Sogin et al., 2006).
Sample ID Site Depth (m) Temperature
(C)
Cells per mL of
water
53R Labrador seawater 1,400 3.5 6.4!10
4
55R Oxygen minimum 500 7.1 1.8!10
5

112R Lower deep water 4,121 2.3 3.9!10
4

115R Oxygen minimum 550 7 1.5!10
5

137 Labrador seawater 1,710 3 3.3!10
4

138 Labrador seawater 710 3.5 5.2!10
4

FS312 Bag City 1,529 31.2 1.2!10
5

FS396 Marker 52 1,537 24.4 1.6!10
5

Table 1. Environmental DNA samples collected from North Atlantic Deep Water and Axial Seamounts at
Juan de Fuca Ridge (Sogin et al., 2006). These eight samples were used as datasets for this research.
C) SimHash
Let F = {1, 2,3,t} be the set of all unique DNA tags across the entire collection of DNA
sequences, and f(s) F be the set of tags present in a particular sequence s S. Each tag t f(s)
7

is given a certain weight t->s , which measures the importance of tag t to sequence s. The
combination of f(s) and weights {t->s)}tf(s) represents the tag vector of sequence s. The
SimHash value for sequence s can then be computed as a modified dot product between its tag
vector and its specific vector weights:

s
=
t t](s)

ts

where
t
is the number of occurrences of tag t in sequence s.
DNA sequences are made up of four nucleotides, so the total number of tags for a
sequence of length is 4
. To make computation more reasonable, a one-way mapping function

converting the original 4
dimensional space to 4
K
(where K << ) dimensional space is often
used in SimHash. Figure 2 illustrates the concept of the SimHash algorithm in this research.

Figure 2. On the left is the main part of the SimHash algorithm. SimHash traverses the sequence to tally
the number of occurrences of the tags, and the hash is obtained through a linear combination of the tally
of the tag and its weight.
In this research, K was chosen to be 6 (which is explained later), thus producing a total
possibility of 4096 (4
6
) possible tags. Of these tags, 400 were selected to be used in set T for the

A1CAA1CACC1A1CAA1CC1A1AC

A1C: 1 AA1:0 1A1: 0 CAC: 0
A1C: 1 AA1:0 1A1: 0 CAC: 0
A1C: 1 AA1:0 1A1: 0 CAC: 0
A1C: 1 AA1:1 1A1: 0 CAC: 0
A1C: 2 AA1:1 1A1: 0 CAC: 0
A1C: 2 AA1:1 1A1: 0 CAC: 0
A1C: 2 AA1:1 1A1: 0 CAC: 1

A1C: 3 AA1:3 1A1: 8 CAC: 6
SeL of 1ags
(#A1C)*WelghL(A1C) +
(#AA1)*WelghL(AA1) +
(#1A1)*WelghL(1A1) +
(#CAC)*WelghL(CAC) =
3*12.336 + 3*14.633 + 8*43.234 +
6*42.439 = !"#$%#
A1C WelghL: 12.336
AA1 WelghL: 14.633
1A1 WelghL: 43.234
CAC WelghL: 42.439
8

SimHash function, as they were the length K subsequences that appeared most frequently in the
whole microbial DNA dataset. Set T was then weighted according to the following formula:
=
# o] occucncc o] tug
# o] mnmum o] occucnccs umong uII tugs
(Scale factor of 20)
Thus, using the formula, the 400th tag (the least frequent one) would have a weight equal
to the scale factor, while the first tag (the most frequent one) would have a much larger weight.
D) Clustering
Hamming distance function is used to compute the difference between the pair of
SimHash values, which determines whether the two belong to the same cluster.
Defining V to be the set of size N,
= {
s
]
s=1
N

where
s
is the SimHash of sequence s.
For any pair (
and
]
) of SimHash values in V, the pair-wise distance is
Hij = hamming_distance(
,
]
).
The output of the SimHash-based clustering algorithm is a cluster set C of size R:
= {
]
=1
R

where cluster c is a subset of SimHash value of size L (L < N):
= |
]
|
]=1
L

such that, for any pair of SimHash values in cluster c,

]
( = 1 , = 1 )
where is the SimHash threshold.
The actual implementation of the sequence clustering algorithm is performed as follows:
1) Loop through every sequence s in S.
2) If s has already been assigned to a cluster, skip it.
9

3) If s has not been assigned, assign it to a new cluster c, then
a. Go through every other sequence s in S.
b. If s has not been assigned, and the similarity of s and s is below , assign s
to cluster c.
4) Continue until all sequences have been assigned to a cluster.
The number of clusters that are created depends on threshold value. Stricter threshold
values require that sequences be more similar in order to be assigned to the same cluster,
resulting in a larger number of clusters, while large threshold values may be too lax and assign
sequences that are extremely different to the same cluster.
30$-1%$ '*, 4.$(-$$.+*

A) Selecting SimHash Parameters
The selection of parameters T (the set of tags used in the hash) and K (the length of each
tag) affects the accuracy of the SimHash algorithm. Table 2 displays the actual effects of the
sizes of T and K on the accuracy of the algorithm. With a larger number of tags included in T,
the accuracy of the SimHash algorithm increases, albeit at the cost of runtime. The value of K
(the length of a tag) dictates the total number of possible tags, which is equal to 4
K
, that could be
used in a DNA sequence analysis. The increase in K, however, only improves the accuracy of the
SimHash algorithm before K reaches a certain threshold. After that point (k=6 in this study), the
accuracy benefit of increasing the value of K becomes negligible. This is because once K reaches
a certain value, and the number of possible tags is sufficient enough for the most frequent K
subsequences to be selected, the further increase in K does not contribute much to the good
candidates in the selection of T (the set of tags used) to improve the accuracy of the SimHash.
10

The combination of the number of tags and the size of K that produces the most accurate result in
this study is highlighted in red in Table 2.
K: Length of tags
4 5 6 7 8
N
u
m
b
e
r

o
f

t
a
g
s

u
s
e
d

i
n

T

32 41.587 % 50.234 % 55.634 % 54.297 % 54.455 %
64 51.263 % 58.230 % 66.766 % 67.395 % 67.944 %
128 60.917 % 65.529 % 67.592 % 69.259 % 69.592 %
256 62.923 % 69.646 % 70.233 % 71.343 % 72.013 %
512 64.278 % 72.766 % 75.433 % 74.355 % 75.233 %

Table 2. Accuracy of varying values of K and numbers of tags in SimHash. The accuracy values measure
the percentage of exact match between SimHash and Shingling for top 5% similarity of one sequence to
other sequences.
B) Complexity and Runtime
The SimHash algorithm written in this research greatly reduces the computational
complexity compared to the conventional shingling algorithm. The shingling algorithm is brute
force, traversing through the set of possible pairs of sequences; this produces a complexity of
O(
2
). For each pair of sequences, the program must then loop through each pair of
subsequences in both of the original sequences, resulting in a complexity of O(()
2
), where L
represents the length of the sequence and K represents the length of the subsequence. The entire
shingle computation has a complexity of O((
2
)()
2
). By comparison, the SimHash algorithm
has a pre-computation step to compute the hashes for all the sequences; this step has a
complexity of O(). The next step is to calculate the absolute difference for each pair of
sequences, resulting in a complexity of O((
2
)). Thus the overall complexity of this method is
O(
2
+ ), which is significantly lower than the Big-O complexity of the shingling
algorithm. The actual results, displayed in Table 3, indicate that the SimHash method is hundreds
of times faster than the shingling algorithm when computing DNA sequence similarities.
11

In Figure 3, it can be observed that as the number of sequences increases, the
performance gain of SimHash over the shingling algorithm is greater. When analyzing large
datasets containing millions of microbial DNA sequences, as with the data from Sorcerer II,
using SimHash can yield better runtime efficiency.
Sample
ID
Number
of reads
Runtime for
shingling
algorithm
(sec)
Runtime
for
SimHash
(sec)
53R 13,040 18.66 0.53
55R 10,134 14.405 0.441
112R 16,087 23.252 0.691
115R 16,651 24.189 0.678
137 14,147 20.347 0.552
138 13,241 19.064 0.531
FS312 55,592 81.378 2.243
FS396 83,399 123.296 3.072

Table 3. Comparison of shingling algorithm and
SimHash runtimes in generating similarity metrics for
all eight samples. All benchmarks were done on a
computer with an Intel Core 2 Duo E8400 @ 3.00
GHz and 4 GB of RAM.
Figure 3. Runtime benchmarks for shingling
algorithm and SimHash methods using increasingly
larger sequence datasets. The sequences came from the
112R sample. All benchmarks were done on a
computer with an Intel Core 2 Duo E8400 @ 3.00
GHz and 4 GB of RAM.

C) Accuracy of SimHash Compared with Shingling Algorithm
The accuracy of the SimHash method was determined by comparing its similarity metric
output against the metric output of the shingling method. For each DNA sequence, the output
from either the SimHash or the shingling method was composed of the similarity metrics
between the sequence and all the other DNA sequences, ordered from the most similar to the
least similar. Top-N accuracies were then computed for each sequence, as Top-1 accuracy, Top-2
0
2
4
6
8
10
12
14
16
1000 1300 2000 3000 3000 7300 10000
Shlngllng
SlmPash
number of reads
8
u
n
L
l
m
e

(
s
e
c
)
12

accuracy, Top-3 accuracy, etc., with N representing the number of the most similar DNA
sequences included in the computation.
Top-N accuracy was calculated as the quotient of the number of the N most similar
sequences identified by the SimHash method that were in the set of the N most similar sequences
identified by the shingling method. A Top-N accuracy of 100% indicates that for a sequence S,
both the shingling- and hashing-based methods identified the same N sequences.

Figure 4. Top-N accuracy decreases as N increases. However, the accuracy asymptotes at 40% as N
increases to infinity. This graph only shows 40 randomly selected sequences and their top-50 ranked list
of similar sequences. The red line shows the average accuracy of these 40 sequences.
When N=1, only the sequence itself is selected as the most similar DNA sequence for
both the SimHash and the shingling method; thus, Top-1 is always 100%. When N increases, the
Top-N accuracy degrades, as expected. The Top-N accuracy asymptotes at 40% as N increases to
infinity. It is essential to note that despite the tendency of Top-N accuracy to decrease as N
increases, only the accuracy of the lower N values influences the result of sequence clustering in
this research. This is similar to a Google web search; it is important that the top few search
13

results are very relevant and are good matches, while the later entries, being less significant in
the search result set, could have low relevance.
D) Comparing SimHash Sequence Clustering with State-of-the-Art Algorithms
The Rand index (RI) was used to assess the clustering accuracy of the proposed SimHash
algorithm. RI measures the percentage of decisions that are correct, with penalties for incorrect
decisions. Basically, we want to assign two sequences to the same cluster if and only if they are
similar. A true positive (TP) decision assigns two similar sequences to the same cluster, and a
true negative (TN) decision assigns two dissimilar sequences to different clusters. There are two
types of incorrect decisions: a false positive (FP) decision assigns two dissimilar sequences to the
same cluster, and a false negative (FN) decision assigns two similar sequences to different
clusters. RI is calculated using the following formula:

Reference clusters were generated using the shingling algorithm, which was used to
produce ranked similarity metrics. Then, the greedy incremental approach (similar algorithm as
CD-HIT and UCLUST for grouping clusters) was used to partition sequences into clusters,
taking ranked similarity metric as input. Because shingling is a brute force algorithm, it was
assumed to be the most accurate, and thus, treated as reference datasets for assessing other
algorithms.
The SimHash sequence clustering method was evaluated in comparison with two state-
of-the-art methods, UCLUST and CD-HIT. Table 4 shows the comparative analysis for all eight
samples. The runtime and Rand index were captured on each sample separately, as well as on the
total of all samples. Clearly, the SimHash algorithm produced more accurate clustering than CD-
14

HIT and UCLUST in all eight samples and the total sample. In terms of efficiency, SimHash ran
much faster than CD-HIT, reducing the time from 36 seconds to 12 seconds when clustering all
222,291 sequences. Although UCLUST was slightly faster than SimHash, the difference between
the two was almost negligible, with both taking approximately 12 seconds to cluster all 222,291
sequences.
Sample ID # Reads Algorithms Parameters RI
Accuracy
Runtime
(sec)
#
Clusters

53R

13,040
SimHash : 270 0.9998 0.588 916
CD-HIT identity: 96% 0.9919 1.378 963
UCLUST identity: 94% 0.9796 0.476 971

55R

10,134
SimHash : 270 0.9996 0.463 845
CD-HIT identity: 96% 0.984 1.07 841
UCLUST identity: 94% 0.9781 0.381 829

112R

16,087
SimHash : 270 0.9996 0.752 1,122
CD-HIT identity: 92% 0.994 1.989 1,044
UCLUST identity: 86% 0.9821 0.664 1,024

115R

16,651
SimHash : 270 0.9996 0.773 940
CD-HIT identity: 95% 0.9844 1.611 936
UCLUST identity: 93% 0.9784 0.592 934

137

14,147
SimHash : 270 0.9997 0.624 853
CD-HIT identity: 96% 0.9883 1.277 889
UCLUST identity: 93% 0.9874 0.482 824

138

13,241
SimHash : 270 0.9993 0.598 859
CD-HIT identity: 96% 0.9928 1.543 853
UCLUST identity: 94% 0.9852 0.451 882

FS312

55,592
SimHash : 150 0.9997 2.494 1,887
CD-HIT identity: 90% 0.9926 7.952 1,881
UCLUST identity: 86% 0.9904 2.336 1,965

FS396

83,399
SimHash : 150 0.9996 3.419 1,956
CD-HIT identity: 94% 0.9844 11.855 1,954
UCLUST identity: 90% 0.9852 3.204 1,981

All
SimHash : 50 0.9922 12.92 6,085
222,291 CD-HIT identity: 94% 0.9876 36.001 6,148
UCLUST identity: 90% 0.9867 12.289 6,202
Table 4. Accuracy and performance of SimHash, CD-HIT (version 4.5.4), and UCLUST (USEARCH
version 6.0.307). The identity ensures that all pairs of sequences in a cluster must have at least that
percentage of sequence similarity. The numbers in bold highlight the best values among the three methods.
The numbers of clusters were generated to match the numbers of operational taxonomic units (OTUs),
reported by Sogin et al., 2006.
5+*(1-$.+*$ '*, 6-%-&0 7+&8

13

In this research, I developed an efficient and accurate metagenome clustering algorithm
based on SimHash. SimHash transforms a long and complex DNA sequence into a simple
fingerprint, which preserves the compositional markup of the sequence. These fingerprints are
far easier to compare than the original DNA sequences. Based on these fingerprints, a similarity
metric between DNA sequences was generated and then fed into a greedy incremental clustering
engine to partition the DNA sequences into clusters. In order to assess the effectiveness of the
proposed method, a benchmark dataset of 222,291 DNA sequences collected at deep-sea
underwater locations with varying temperatures and cell concentrations was used. The
assessment was divided into two phases: 1) the runtime and speed of generating a similarity
matrix were compared with the shingling algorithm, and 2) the overall runtime and speed of
clustering were compared with two state-of-the-art methods, CD-HIT and UCLUST. The
experiment indicated that SimHash ran much faster (in the magnitude of hundreds of times) than
the shingling algorithm in generating similarity matrix. The overall results revealed that the
SimHash algorithm performed better than two state-of-the-art methods, CD-HIT and UCLUST,
in accuracy, and showed a significant advantage over CD-HIT in runtime. The proposed method
can help biologists speed up their research in the genomes of microbes, with implications for
new drug design and discovery.
The clustering algorithm can be improved in the future. The currently used clustering
algorithm uses a greedy approach, assuming that the first DNA sequence in a new cluster is at
the center of the cluster, representing its typical characteristics. All the other sequences are
assigned to the same cluster if they are similar enough to the clusters first sequence (i.e.,
similarity metric < the threshold), or they are assigned to another cluster. However, if the
algorithm happens to pick a sequence that is actually at the edge of the cluster as the first
16

sequence, then the subsequent assignment of all the other DNA sequences will not be very
accurate. In future work, the accuracy of the DNA sequence classification can be further
improved by taking more dimensions of comparison into consideration. For example, the linkage
clustering technique offers the possibility of comparing a candidate to its nearby objects in a
cluster instead of just to the center when evaluating the assignment of the candidate to the
cluster. The similarity metric generated by the SimHash function can be used as a distance
function in the linkage cluster when calculating the degree to which a DNA sequence is similar
to its neighboring sequences.
17

!"#"$"%&"'

Broder, A., Glassman, S., Manasse, M., & Zweig, G. (1997). Syntactic clustering of the web. In
6th International World Wide Web Conference, (pp. 393404).
Charikar, M. S. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings
of the 34th Annual ACM Symposium on Theory of Computing (pp. 380388).
Dalevi, D., Ivanova, N.N., Mavromatis, K., Hooper, S.D., Szeto, E., Hugenholtz, P., Kyrpides,
N.C., & Markowitz, V.M. (2008). Annotation of metagenome short reads using
proxygenes. Bioinformatics, 24(16), i7i13.
Gori, F., Folino, G., Jetten, M., & Marchiori, E. (2011). MTR: taxonomic annotation of short
metagenomic reads using clustering at multiple taxonomic ranks. Bioinformatics, 27(2),
196203.
Edgar, R.C. (2010) Search and clustering orders of magnitude faster than BLAST.
Bioinformatics, 26 (19), 24602461.
Hugenholtz, P. & Tyson, W.G. (2008). Microbiology: metagenomics. Nature, 455(7212), 481
483.
Li, W.Z. & Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of
protein or nucleotide sequences. Bioinformatics, 22(13), 16581659.
Rasheed, Z., Rangwala, H., & Barbara, D. (2012). Efficient clustering of metagenomic
sequences using locality sensitive hashing. In SIAM International Conference in Data
Mining (pp. 10231034).
Sadowski, C. & Levin, G. (2007). SimHash: hash-based similarity detection. Technical Report,
Google.
Sogin, M., Morrison, H., Huber, J., Welch, D., Huse, S., Neal, P., Arrieta, J., & Herndl, G.
(2006). Microbial diversity in the deep sea and the underexplored rare biosphere.
PNAS, 103 (32), 1211512120.
Tringe, S.G. et al. (2005). Comparative metagenomics of microbial communities. Science,
308(5721), 554557.
Uddin, S., Roy, C.K., Schneider, K.A., & Hindle, A. (2011). On the effectiveness of simhash for
detecting near-miss clones in large scale software systems. Proc. WCRE, 1322.

STS Research Paper Final

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

STS Research Paper Final

Uploaded by

Copyright:

Available Formats

1

Using Hashing Algorithms for Clustering

. To make computation more reasonable, a one-way mapping function

You might also like