Professional Documents
Culture Documents
2013 Liviu P. Dinu, Radu-Tudor Ionescu, 2012. Clustering Based On Rank Distance With Applications On DNA PDF
2013 Liviu P. Dinu, Radu-Tudor Ionescu, 2012. Clustering Based On Rank Distance With Applications On DNA PDF
2013 Liviu P. Dinu, Radu-Tudor Ionescu, 2012. Clustering Based On Rank Distance With Applications On DNA PDF
Applications on DNA
1 Introduction
Clustering has long played an important role in a wide variety of fields: biol-
ogy, statistics, pattern recognition, information retrieval, machine learning, data
mining, psychology and other social sciences.
There are various clustering algorithms that differ significantly in their notion
of what constitutes a cluster. The appropriate clustering algorithm and parame-
ter settings depend on the individual data set. But not all clustering algorithms
can be applied on a particular data set. For example, clustering methods that
depend on a standard distance function (such as K-means) cannot be applied
on a data set of objects (such as strings) for which a standard distance function
cannot be computed.
The goal of this work is to introduce two clustering algorithms for strings, or
more precisely, that are based on a distance measure for strings. A few distance
measures for strings can be considered, but we focus on using a single distance
that has very good results in terms of accuracy and time for many important
problems. The distance we focus on is termed rank distance [3] and it has appli-
cations in biology [4, 7], natural language processing [2], and many other fields.
The first clustering algorithm presented here is a centroid model that represents
each cluster by a single median string. The second one is a connectivity model
the builds a hierarchy of clusters based on distance connectivity. Both use rank
distance.
Let us describe the paper organization. In section 2 we introduce notation
and mathematical preliminaries. We first introduce the rank distance and then
we define closest string and median string problems. Section 3 gives on overview
of related work regarding sequencing and comparing DNA, rank distance and
clustering methods. The clustering algorithms are described in section 4. The
experiments using mitochondrial DNA sequences from mammals are presented
in section 5. Finally, we draw our conclusion and talk about futher work in
section 6.
2 Preliminaries
Definition 1. Let U = {1, 2, ..., #U} be a finite set of objects, named universe
(we write #U for the cardinality of U). A ranking over U is an ordered list:
τ = (x1 > x2 > ... > xd ), where xi ∈ U for all 1 ≤ i ≤ d, xi 6= xj for all
1 ≤ i 6= j ≤ d, and > is a strict ordering relation on the set {x1 , x2 , ..., xd }.
Definition 2. Given two partial rankings σ and τ over the same universe U,
we define the rank distance between them as:
X
∆(σ, τ ) = |ord(σ, x) − ord(τ, x)|.
x∈σ∪τ
The median string problem (MSP) is similar to the closest string problem,
only that the goal is to minimize the average distance to all the input strings.
3 Related Work
3.1 Sequencing and Comparing DNA
In many important problems in computational biology a common task is to
compare a new DNA sequence with sequences that are already well studied and
annotated. DNA sequence comparison is ranked in the top of two lists with
major open problems in bioinformatics [9, 17].
The standard method used in computational biology for sequence compari-
son is by sequence alignment. Sequence alignment is the procedure of comparing
two sequences (pairwise alignment) or more sequences (multiple alignment) by
searching for a series of individual characters or characters patterns that are in
the same order in the sequences. Algorithmically, the standard pairwise align-
ment method is based on dynamic programming [15].
Although dynamic programming for sequence alignment is mathematically
optimal, it is far too slow for comparing a large number of bases, and too slow
to be performed in a reasonable time. Typical DNA database today contains
billions of bases, and the number is still increasing rapidly.
The standard distances with respect to the alignment principle are edit (Lev-
enshtein) distance [10] or its ad-hoc variants. The study of genome rearrange-
ment [12] was investigated also under Kendall tau distance (the minimum num-
ber of swaps needed to transform a permutation into the other).
(j) (j)
where ∆(xi , cj ) is the rank distance between a string xi in cluster j and the
cluster centroid cj . The objective function is simply an indicator of the distance
of the input strings from their respective cluster centers.
5 Experiments
5.1 Data set
Our clustering methods are tested on a classical problem in bioinformatics: the
phylogenetic analysis of the mammals. In the experiments presented in this paper
we use mitochondrial DNA sequence genome of 22 mammals available in the
EMBL database. The DNA sequence of mtDNA has been determined from a
large number of organisms and individuals, and the comparison of those DNA
sequences represents a mainstay of phylogenetics, in that it allows biologists
to elucidate the evolutionary relationships among species. In mammals, each
double-stranded circular mtDNA molecule consists of 15, 000-17, 000 base pairs.
In our experiments each mammal is represented by a single mtDNA sequence
that comes from a single individual. Note that DNA from two individuals of the
same species differs by only 0, 1%.
House Mouse
Opossum
Harbour Seal
Gray Seal
Cat
Indian Rhinoceros
White Rhinoceros
Donkey
Horse
Rat
Sheep
Fin Whale
Big Whale
Cow
Human
Sumatran Orangutan
Orangutan
Gibbon
Gorilla
Chimpanzee
Pygmy Chimpanzee
7 6 5 4 3 2 1
5
x 10
Fig. 1. Phylogenetic tree obtained from mammalian mtDNA sequences using rank
distance.
Acknowledgment
The contribution of the authors to this paper is equal. Radu Ionescu thanks his
Ph.D. supervisor Denis Enachescu from the University of Bucharest.
Bibliography