Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 11

Title :

Objective/Aim :
Technical Details :
Innovativeness & usefulness :
Current status of development :
Review of Literature
A review of the relevant literature showing the work done previously in the
area of proposed research is essential to plan further research effectively. The
information given in the review should be supported by references.
Materials & methods :
Results :

References and Bibliography

What is NCD?
Normalized Compression Distance (NCD) is actually a family of functions which take
as arguments two objects (literal files, Google search terms) and evaluate a fixed formula
expressed in terms of the compressed versions of these objects, separately and combined.
Hence this family of functions is parametrized by the compressor used. If x and y are the
two objects concerned, and C(x) is the length of the compressed version of x using
compressor C, then the

The method is the outcome of a mathematical theoretical developments based on


Kolmogorov complexity.

The family of NCD's includes the


• NID Normalized Information Distance when the compressor reaches the
Kolmogorov complexity of the data, so , the Kolmogorov
complexity of x, and we use so to speak the Kolmogorov complexity
compressor
• NGD Normalized Google Distance when Google is used to determine a
probability density function over search terms that yields a Shannon-Fano code
length by taking the negative log of the probability of a term. This process of
deriving a code length from a search term can be considered as a Google
compressor.

A standard PC file-compression program can tell the difference between classical music,
jazz and rock, all without playing a single note. This new-found ability could help
scholars identify the composers of music that until now has remained anonymous.

The technique exploits the ability of off-the-shelf "zip" data-compression software to do


more than just squeeze PC files into manageable sizes. For instance, various zip programs
have already been used to detect the language a piece of text is written in (New Scientist
print edition, 15 December 2001).

To do this, you first take several long text files, each in a known language, and compress
them, noting the file size of each. You then append the unknown file to each of the
uncompressed, known files in turn, and compress them again, noting the difference that
adding the unknown file makes in each case.

The smaller the difference, the more likely the languages are to be the same. That is
because the zip program looks for duplicated sequences in the text to shrink it without
losing information.

Rudi Cilibrasi, Paul Vitanyi and Ronald de Wolf of the Dutch National Research Institute
in Amsterdam wondered if such compression could also help distinguish between
musical genres. So they tried it out on digital files of various pieces, including some from
Beethoven, Miles Davis and Jimi Hendrix.

Rhythm and melody

They subtracted any data unrelated to the actual music, such as digital ID tags, to create a
data string representing only the rhythm and melody of the tune. Using a program called
Bzip2, they followed a similar procedure as with the text files, measuring how similar
each piece was to every other. Then they plotted the results in a way that produces a tree-
shaped pattern, in which similar pieces cluster together on the same branch.

In a test with 12 each of jazz, classical and rock pieces, the results were fairly good. Ten
of the jazz, nine of the rock and most of the classical pieces ended up in three distinct
branches of the tree.
When applied to 32 classical pieces, the technique clustered each composer on a separate
branch. Vitanyi thinks the trick could help identify a plausible composer for works of
unknown origin, as long as they have written several known works for comparison. It
could also help online music stores, for example by classifying music files.

The technique's elegance lies in the fact that it is tone deaf. Rather than looking for
features such as common rhythms or harmonies, says Vitanyi, "it simply compresses the
files obliviously."

"I would love a technique that can work out who wrote something just by putting the
notes on a page into a computer," says Jeremy Summerly of the Royal Academy of
Music in London, who tries to identify the composers of unattributed fragments of 16th-
century musical scores. The technique is promising, he says, because it detects features of
a piece that the composer does not consciously think about, but which are actually their
hallmark.

Summerly hopes to see what the technique makes of the second half of Mozart's
Requiem, completed by Franz Süssmayr after Mozart's death. The way it clusters among
other works by Mozart and Süssmayr might reveal how much original work Süssmayr
contributed.

The sort of things I’m backing up are: Music (mainly MP3s), Pictures (mainly JPEGs),
Videos (a mixture of MPEGs and AVIs / DIVXs), and Software (Both in the form of
binary files and source code). I have therefore split the test into the following categories:

• Binaries
o 6,092,800 Bytes taken from my /usr/bin director

• Text Files
o 43,100,160 Bytes taken from kernel source at /usr/src/linux-headers-
2.6.28-15

• MP3s
o 191,283,200 Bytes, a random selection of MP3s

• JPEGs
o 266803200 Bytes, a random selection of JPEG photos

• MPEG
o 432,240,640 Bytes, a random MPEG encoded video

• AVI
o 734,627,840 Bytes, a random AVI encoded video

I have tarred each category so that each test is only performed on one file (As far as I’m
aware, tarring the files will not affect the compression tests). Each test has been run from
a script 10 times and an average has been taken to make the results as fair as possible.
The things I’m interested in here are compression and speed. How much smaller are the
compressed files and how long do I have to wait for them to compress / decompress.

Although there are many compression tools available, I decided to use the 5 that I
consider the most common. GZIP, BZIP2, ZIP, LZMA, and the linux tool Compress.
One of the reasons for this test is to find the best compression, so where there was an
option, I have chosen to use the most aggressive compression offered by each tool.

Here’s what I found:


FROM ::: http://webcache.googleusercontent.com/search?
q=cache:BfYvNHVpov8J:blog.terzza.com/linux-compression-comparison-gzip-vs-
bzip2-vs-lzma-vs-zip-vs-compress/
+zlib+bzip+lzma&cd=4&hl=en&ct=clnk&gl=in&client=firefox-a

Version information:
gzip 1.3.12
bzip2 1.0.5
LZMA 4.32.0beta3
LZMA SDK 4.43

For starters, I threw an empty 1GiB file with nothing in it but binary zeros at them.

$ dd if=/dev/zero of=test.zero -bs=1024M -count=1


1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 187.978 s, 5.7 MB/s

Now the fun starts.

GZIP
$ /usr/bin/time -f “%U seconds CPU %P” gzip -c9 test.zero > test.gz
12.36 seconds CPU 99%

BZIP2
$ /usr/bin/time -f “%U seconds CPU %P” bzip2 -c9 test.zero > test.bz2
32.07 seconds CPU 98%

LZMA
$ /usr/bin/time -f “%U seconds CPU %P” lzma -c9 test.zero > test.lzma
873.79 seconds CPU 96%

So what kind of compression ratios are we talking about here?

$ ls -lh test.zero*
-rw-r–r– 1 kafui kafui 1.0G 2009-03-25 12:01 test.zero
-rw-r–r– 1 kafui kafui 1018K 2009-03-25 12:51 test.gz
-rw-r–r– 1 kafui kafui 148K 2009-03-25 13:10 test.lzma
-rw-r–r– 1 kafui kafui 785 2009-03-25 12:52 test.bz2

Dendrograms - A diagram representing the fusions and divisions at each stage of


a classification analysis. Similar in appearance to a cladogram, but based on
phenetic similarity rather than cladistic similarity.

Tree - Mathematically, an acyclic (cycle-free) line graph. Used to represent the


evolutionary history of a set of taxa, with the leaves (or terminal branches)
representing contemporary taxa and the internal branches representing
hypothesised ancestors (see also rooted tree, unrooted tree).

Unrooted tree (network) - A cladogram for which the ancestor (= root) has not
been hypothesized, and which thus does not specify the direction of evolutionary
change among the character-states. An unrooted tree can be rooted on any of its
branches, and so there are many rooted trees that can be derived from a single
unrooted tree (cf. rooted tree).

A rooted phylogenetic tree is a directed tree with a unique node corresponding to the
(usually imputed) most recent common ancestor of all the entities at the leaves of the
tree. The most common method for rooting trees is the use of an uncontroversial
outgroup — close enough to allow inference from sequence or trait data, but far enough
to be a clear outgroup.

Unrooted trees illustrate the relatedness of the leaf nodes without making assumptions
about ancestry at all. While unrooted trees can always be generated from rooted ones by
simply omitting the root, a root cannot be inferred from an unrooted tree without some
means of identifying ancestry; this is normally done by including an outgroup in the input
data or introducing additional assumptions about the relative rates of evolution on each
branch, such as an application of the molecular clock hypothesis. Figure 1 depicts an
unrooted phylogenetic tree for myosin, a superfamily of proteins.[4]

Both rooted and unrooted phylogenetic trees can be either bifurcating or


multifurcating, and either labeled or unlabeled. A rooted bifurcating tree has exactly
two descendants arising from each interior node (that is, it forms a binary tree), and an
unrooted bifurcating tree takes the form of an unrooted binary tree, a free tree with
exactly three neighbors at each internal node. In contrast, a rooted multifurcating tree
may have more than two children at some nodes and an unrooted multifurcating tree
may have more than three neighbors at some nodes. A labeled tree has specific values
assigned to its leaves, while an unlabeled tree, sometimes called a tree shape, defines a
topology only. The number of possible trees for a given number of leaf nodes depends on
the specific type of tree, but there are always more multifurcating than bifurcating trees,
more labeled than unlabeled trees, and more rooted than unrooted trees. The last
distinction is the most biologically relevant; it arises because there are many places on an
unrooted tree to put the root. For labeled bifurcating trees, there are

total rooted trees and

total unrooted trees, where n represents the number of leaf nodes. Among labeled
bifurcating trees, the number of unrooted trees with n leaves is equal to the number of
rooted trees with n − 1 leaves.[5]

A dendrogram is a broad term for the diagrammatic representation of a phylogenetic


tree.

A cladogram is a tree formed using cladistic methods. This type of tree only represents
a branching pattern, i.e., its branch lengths do not represent time.

A phylogram is a phylogenetic tree that explicitly represents number of character


changes through its branch lengths.

A chronogram is a phylogenetic tree that explicitly represents evolutionary time through


its branch lengths.

[edit] Construction
Main article: Computational phylogenetics

Phylogenetic trees among a nontrivial number of input sequences are constructed using
computational phylogenetics methods. Distance-matrix methods such as neighbor-joining
or UPGMA, which calculate genetic distance from multiple sequence alignments, are
simplest to implement, but do not invoke an evolutionary model. Many sequence
alignment methods such as ClustalW also create trees by using the simpler algorithms
(i.e. those based on distance) of tree construction. Maximum parsimony is another simple
method of estimating phylogenetic trees, but implies an implicit model of evolution (i.e.
parsimony). More advanced methods use the optimality criterion of maximum likelihood,
often within a Bayesian Framework, and apply an explicit model of evolution to
phylogenetic tree estimation.[5] Identifying the optimal tree using many of these
techniques is NP-hard[5], so heuristic search and optimization methods are used in
combination with tree-scoring functions to identify a reasonably good tree that fits the
data.
Tree-building methods can be assessed on the basis of several criteria:[6]

• efficiency (how long does it take to compute the answer, how much memory does
it need?)
• power (does it make good use of the data, or is information being wasted?)
• consistency (will it converge on the same answer repeatedly, if each time given
different data for the same model problem?)
• robustness (does it cope well with violations of the assumptions of the underlying
model?)
• falsifiability (does it alert us when it is not good to use, i.e. when assumptions are
violated?)

Tree-building techniques have also gained the attention of mathematicians. Trees can
also be built using T-theory.[7]

[edit] Limitations

Although phylogenetic trees produced on the basis of sequenced genes or genomic data
in different species can provide evolutionary insight, they have important limitations.
They do not necessarily accurately represent the species evolutionary history. The data on
which they are based is noisy; the analysis can be confounded by horizontal gene
transfer[8], hybridisation between species that were not nearest neighbors on the tree
before hybridisation takes place, convergent evolution, and conserved sequences.

Also, there are problems in basing the analysis on a single type of character, such as a
single gene or protein or only on morphological analysis, because such trees constructed
from another unrelated data source often differ from the first, and therefore great care is
needed in inferring phylogenetic relationships among species. This is most true of genetic
material that is subject to lateral gene transfer and recombination, where different
haplotype blocks can have different histories. In general, the output tree of a
phylogenetic analysis is an estimate of the character's phylogeny (i.e. a gene tree) and
not the phylogeny of the taxa (i.e. species tree) from which these characters were
sampled, though ideally, both should be very close. For this reason, serious phylogenetic
studies generally use a combination of genes that come from different genomic sources
(e.g., from mitochondrial or plastid vs. nuclear genomes), or genes that would be
expected to evolve under different selective regimes, so that homoplasy (false homology)
would be unlikely to result from natural selection.

When extinct species are included in a tree, they are terminal nodes, as it is unlikely that
they are direct ancestors of any extant species. Scepticism must apply when extinct
species are included in trees that are wholly or partly based on DNA sequence data, due
to the fact that little useful "ancient DNA" is preserved for longer than 100,000 years, and
except in the most unusual circumstances no DNA sequences long enough for use in
phylogenetic analyses have yet been recovered from material over 1 million years old.

In some organisms, endosymbionts have an independent genetic history from the host.
Phylogenetic networks are used when bifurcating trees are not suitable, due to these
complications which suggest a more reticulate evolutionary history of the organisms
sampled..

compression

compression works by finding sequences of data that are repeated. The term ``sliding
window'' is used; all it really means is that at any given point in the data, there is a record
of what characters went before. A 32K sliding window means that the compressor (and
decompressor) have a record of what the last 32768 (32 * 1024) characters were. When
the next sequence of characters to be compressed is identical to one that can be found
within the sliding window, the sequence of characters is replaced by two numbers: a
distance, representing how far back into the window the sequence starts, and a length,
representing the number of characters for which the sequence is identical.

I realize this is a lot easier to see than to just be told. Let's look at some highly
compressible data:

Blah blah blah blah blah!

Our datastream starts by receiving the following characters: `B,' `l,' `a,' `h,' ` ,' and `b.'
However, look at the next five characters:

vvvvv
Blah blah blah blah blah!
^^^^^

There is an exact match for those five characters in the characters that have already gone
into the datastream, and it starts exactly five characters behind the point where we are
now. This being the case, we can output special characters to the stream that represent a
number for length, and a number for distance.

The data so far:

Blah blah b

The compressed form of the data so far:

Blah b[D=5,L=5]

The compression can still be increased, though to take full advantage of it requires a bit
of cleverness on the part of the compressor. Look at the two strings that we decided were
identical. Compare the character that follows each of them. In both cases, it's `l' -- so we
can make the length 6, and not just five. But if we continue checking, we find the next
characters, and the next characters, and the next characters, are still identical -- even if the
so-called ``previous'' string is overlapping the string we're trying to represent in the
compressed data!

It turns out that the 18 characters that start at the second character are identical to the 18
characters that start at the seventh character. It's true that when we're decompressing, and
read the length, distance pair that describes this relationship, we don't know what all
those 18 characters will be yet -- but if we put in place the ones that we know, we will
know more, which will allow us to put down more... or, knowing that any length-and-
distance pair where length > distance is going to be repeating (distance) characters again
and again, we can set up the decompressor to do just that.

It turns out our highly compressible data can be compressed down to just this:

Blah b[D=5, L=18]!

You might also like