Cryptologia Com Claudia

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Clustering and Categorization Applied to

Cryptanalysis
Claudia Oliveira, Jose Antonio Xexeo, and Carlos Andre Carvalho
Departamento de Engenharia de Sistemas, Instituto Militar de Engenharia
Rio de Janeiro, Brazil

Abstract. This paper presents a new application for Information Retrieval techniques. We introduce the use of clustering and categorization
in the attack of cryptosystems. In order to clearly present the fundamentals and understand the workings and the implications of this new
technique, we developed a procedure for keylength determination in the
process of cryptanalysis of polyalphabetic ciphers, the core of any attack
of this type of ciphers. The basic premises are: first, a cryptogram is a
normal document written in an unknown language; secondly, Information
Retrieval Techniques are extremely useful in detecting string patterns in
ordinary texts and might be helpful with cryptograms as well.

Introduction

This paper presents a new technique in cryptanalysis. It proposes a new field of


investigation which links Information Retrieval (IR) to Cryptology, through the
use of text categorization techniques in the attack of cryptosystems. The goal of
our work in this initial stage is to argue for the feasibility of using these techniques as cryptanalysis instruments. In order to clearly present the fundamentals
and understand the workings and the implications of this IR technique, as an
example, we developed a procedure for keylength determination in the process
of cryptanalysis of simple polyalphabetic ciphers.
Information Retrieval systems have relied heavily on the bag-of-words
model of document representation [5], which simply encodes the frequency of
each word in a text and disregards word order and any language specific linguistic knowledge. Even the language in question is not taken into account, although
statistical features highly influence the model. Many IR tasks are solved by assuming that a given text A is conceptually more similar to text B than to text
C, even if their contents are not known. This perspective is valuable in the speculation about the linguistic features that remain after a text has been encrypted.
In the context of this application, a word is a string of symbols, either in a
plaintext or in a ciphertext; in ciphertexts, words can be blocks of a fixed size.
As a consequence, the vocabulary is the set of distinct words generated by a
cipher. The cipher key can be viewed as a linguistic property that determines a
new language (ciphertexts), with its particular vocabulary and word frequency
distribution. Therefore, the work explores the idea that the analysis of linguistic

features of ciphertexts, including keylength, requires that they are viewed as


regular documents written in an unknown language.
The keylength determination procedure is carried out in three stages:
1. using an IR grouping technique, Hierarchical Clustering [3], a collection of ciphertexts is separated into groups. The clustering target feature used was the
linguistic distance or, conversely, linguistic similarity between texts, given
that our main hypothesis is that large values of this feature are indicators
that the texts are ciphered with the same cipher key;
2. we use the first part of the Kasiski method in order to calculate what we
defined as the Kasiski index for each text. This index is the largest most
frequent prime factor of the distances between repeated n-grams in the ciphertext. The most frequent index in each group is then used to label the
group;
3. for each group, the keylength is determined by using an IR categorization
method, namely k-Nearest Neighbour Categorization [6]. In general, this
method measures the distances between the label of the categorizing group
and the known keylengths of groups previously organized in a training phase.
The output keylength is determined as a function of the nearest groups
keylength.
The procedure was evaluated using a set of texts extracted from the Bible.
This is a limited domain scenario, but message interception applications often
deal in such scenarios. Experimental results indicate that keylength, as a linguistic feature of ciphertexts, has a considerable influence on the level of document
similarity between ciphertexts, therefore measuring the similarities with IR techniques is a good instrument for keylength determination.
The remainder of this paper is organized as follows: in Section 2 we present
a brief review of the existing approaches to the determination of polyalphabetic keylength. In Section 3 the specific IR techniques that have been employed
in this work are described with the necessary particularizations to the case of
ciphertexts. In Section 4 the details of the experiments are presented with tabulated results and finally, in Section 6 concluding remarks are presented with an
indication of further experiments involving other types of cryptographic systems.

The Polyalphabetic Substitution Cipher

A polyalphabetic substitution cipher is a combination of monoalphabetic substitution ciphers. The number of monoalphabetic substitution ciphers is known as
the period or keylength (M). The message is broken into blocks of length M and
each of the M characters in the block is encrypted using a different monoalphabetic key. It should be evident that the ciphertext created by a polyalphabetic
substitution cipher has similar properties to that created by a simple substitution cipher. As a result, if the period is known, polyalphabetic substitution
ciphers are vulnerable to the same cryptanalytic attacks as the simple substitution cipher [1]. Thus, knowing the keylength reduces the problem to cracking

a series of monoalphabetic substitution ciphers. Cryptanalysis of a polyalphabetic substitution cipher is therefore accomplished in three steps: 1) determine
the keylength; 2) break ciphertext into separate pieces, one piece per permutation; 3) solve each piece as a separate monoalphabetic substitution cipher, using
frequency distributions.
2.1

Keylength determination methods

There are some known methods for determining keylength in polyalphabetic substitution ciphers. The first of those methods to be published is due to Friedrich
Kasiski, in Die Geheimschriften und die Dechiffrierkunst (Secret writing and
the Art of Deciphering), 1863. It is based on the analysis of gaps between recurring patterns in the ciphertext, assuming that in a substitution cipher, patterns
in the plaintext will manifest themselves in the ciphertext. For example, the
tri-grams THE and ING occur frequently in English. In a polyalphabetic
substitution, it is possible, or even likely, that they will be enciphered using
adjacent permutations at multiple points. If the message is long enough, one
expects to find multiple occurrences of these patterns. Kasiskis method works
on tri-grams (or longer n-grams) as follows: 1) identify repeated patterns of 3
or more letters in the ciphertext; 2) for each pattern, write down the starting
positions for all the instances of the pattern; 3) compute the differences between
the starting positions of successive instances; 4) determine all the factors of these
differences; 5) the keylength will be one of the factors that appears more often.
Another traditional keylength determination method is the Index of Coincidence (IC), introduced by William Friedman in 1920. The IC is defined as the
probability of two randomly chosen symbols from a ciphertext being the same.
The IC varies from 0.038, corresponding to a flat frequency distribution, and
0.066, corresponding to the distribution of English plaintexts. The measure of
this variation is related to the keylength (number of alphabets). Hence, the IC
for text encrypted with a monoalphabetic cipher is the same as for plaintext in a
given language. One major drawback of IC is that precision decreases sharply as
keylength increases, which means that the small differences between neighbouring values of IC are overshadowed by the deviations from the theoretical letter
frequencies that are bound to occur in any given ciphertext.
An extensive account of the Kasiski method and the IC method can be found
in [2].

Sorting out keylengths with an IR-based method

Our procedure uses two types of IR techniques:


document clustering or unsupervised document classification, which attempts
to group together documents of similar content, with respect to certain attributes. Grouping texts according to authorship, subject matter and regional dialects are examples of clustering application.

document categorization, which attempts to assign documents to two or more


pre-defined categories. Normally, categorization involves a training phase, in
which examples of pairs htext,attributei are given to a learning system that
will produce the knowledge used by the categorizing algorithm. Labelling
each text, in a group of texts of different authors, with its respective author,
is an example of categorization.
The application of some notion of similarity between texts is broadly used in
linguistic areas [4] such as dialectology, historical linguistics, stylometry, secondlanguage learning, psycholinguistics and even in theoretical linguistics. Many
learning algorithms employed in Computational Linguistics, including supervised techniques, such as memory-based learning (k-NN) and support-vector
machines, and unsupervised techniques, such as Kohonen maps and clustering,
are fundamentally based on measures of similarity.
The concept of similarity between texts is the basis of the keylength clustering and categorization procedure proposed here. The Vector Space Model is
presented in 3.2 in order to emphasize how blind it is to linguistic features other
than word frequencies. The IR methods that have been used in the experiments
are described in 3.3 and 3.5.
3.1

Overview of the Procedure

The method proposed here is an application of clustering and categorization


techniques to cryptanalysis. We illustrate the method with an experiment with
polyalphabetic ciphers. This method is appropriate to classify collections of ciphertexts, with different keys and keylengths. In this particular setting (polyalphabetic encryption system), our method goes as far as the determination of the
keylength. The block diagram of the keylength determination process is shown
in Figure 1.
The input is a collection of ciphertexts. Each of the cryptograms is represented as a vector of words. If spaces are preserved in the ciphertext then word
delimitation is defined as a usual tokenization task, otherwise a standard word
length is stipulated.
These word vectors are used to calculate the inter-ciphertext similarity matrix. In this matrix, the cell dij is the value of similarity between ciphertexts i and
j. After these two pre-processing steps, the keylength determination procedure
is carried out in three stages.
In the first stage, the collection of ciphertexts is clusterized, based on the
similarity matrix. Even though the key is unknown, we expect this phase to
produce groups of texts ciphered with the same cipher key. Here we are based
in the hypothesis that different cipher keys will determine a particular language
in the ciphertext, which can be subject to successful clustering.
The second stage begins by constructing inter-similarity matrices for each
cluster of ciphertexts that is output from the clustering stage. Then each cluster
is labelled with its Kasiski index (defined in 3.4), which will narrow down the
set of possible keylengths.

Input ciphertexts

Representing
texts as vectors

- Labelling with
Kasiski Index

?
Calculating similarity matrix

Categorizing
with the kNN
method

?
?
Clustering

Keylength

Figure 1. The block diagram of the keylength determination process

At this point the procedure requires a k-Nearest Neighbour (kNN) categorizer that was previously trained. The third stage uses distance metrics to position each labelled cluster in relation to the training space composed of sets of
cryptograms of known keylength. This is accomplished by computing the proximity between a categorizing cluster and the clusters of this training space. For
the kNN categorizer we consider as neighbourhood for a target cluster only the
clusters whose labels are multiples of the target clusters label. The estimated
keylength is a function of this neighbourhood.
The process of training the categorizer is shown in Figure 2. The training
corpus of plaintexts has to be encrypted with a variety of keys and keylengths
and the resulting set of ciphertexts must be indexed using the vector space
model. Groups of resulting same key cryptograms were put together, as if they
were clusters, and then represented by their similarity matrices. Each group
(similarity matrix) was labelled with the corresponding keylength, and used as
a training sample for the categorizer.
A review of the basic techniques used in the keylength determination procedure is useful to build the theoretical foundations of such an empirical method
and justify the IR approach to the cryptanalysis problem.
3.2

Modelling ciphertexts in a vector space

The vector space model [5] is one of the most widely used models for document
retrieval due to its conceptual simplicity and the clarity of the metaphor of
spatial proximity between documents, represented as word vectors. Each word

Training
texts

?
Encryption
system

Calculate

simi-

- larities matrices
?

?
Encrypted
training texts

Labelling
with
correct keylength

- Trained

kNN
categorizer

Figure 2. The block diagram of the training process

in the document collection is represented as a dimension in the vector space;


the size of the vocabulary contained in the collection determines the number of
dimensions of the space. The vector length along any dimension is a function
of the number of occurrences of a particular word in the represented document.
Similarity between two documents is measured by the cosine of the angle between
their respective vectors.
For example, considering the following strings as the two only text documents
in a collection:
a dog and a cat
and a frog and a cat
The corresponding vector space would be composed of 5 dimensions
(a, dog, and, cat, f rog)
and contain two vectors. Considering raw word frequency as the vector length
along a given dimension, the vectors would be:
(2, 1, 1, 1, 0)
(2, 0, 2, 1, 1)
Whatever language is under consideration, it can be said that many of the
words in a given document do not describe its content. Grammatical words
(prepositions, pronouns, articles, modal and light verbs, numerals, and adverbs)
and high frequency words are examples of such non-discriminating words. In the
case of ciphertexts, it is obviously not possible to detect grammatical words using
a linguistic criterion. The high frequency criteria is also difficult to implement
for systems other then monoalphabetic ciphers because the same word might be
encrypted into several different strings, corrupting the plaintext language term
frequency distribution.

Term weighting for the vector space model has been largely based on single
term statistics. In the case of polyalphabetic ciphertexts, it has to be stressed, the
frequencies are not as influential because they are not real language frequencies.
Statistics have been disrupted by encryption and therefore is great difficulty in
assessing the impact of weighting.
Calculating the similarities between two document vectors can be done in a
number of ways, using such measures as the Euclidean distance, the Dice coefficient, the Jaccard coefficient, and the cosine measure. The latter is a popular
measure of similarity for text clustering, based on the angle between two document vectors. The cosine measure between vectors u and v is given by the
quotient between the scalar product of the vectores and the product of their
modules:
P
(ui vi )
cos(u, v) = pP 2 P 2
ui vi
For example, the cosine measure of similarity between a dog and a cat
(2,1,1,1,0), and a frog and a cat (2,0,2,1,1) is given by:
22+10+12+11+01

(22

+ 12 + 12 + 12 + 0) (22 + 0 + 22 + 12 + 12 )

= 0.84

The similarities matrix of n documents is a n n symetric matrix, where


each cell mij contains the similarity measure (e.g. cosine) between documents di
and dj .
3.3

Ciphertext Clustering

Document clustering or unsupervised document classification has been used to


improve retrieval systems by attempting to group together documents of similar
content. The first phase of our procedure consists of clustering the initial set of
encrypted documents according to their vector space similarity, hoping to obtain
clusters of texts ciphered with the same cipher key. We reiterate our view that,
since different cipher keys will determine a particular language in the ciphertext,
same key ciphertexts will present similarities due to a common linguistic base.
Hierarchical clustering is used when the input set of objects does not have an
expected partition into well-separated clusters. Two major types of hierarchical
clustering algorithms are currently being used: divisive algorithms, that partition
S until singleton sets are obtained; and agglomerative algorithms, that proceed
by merging singleton sets until S is reconstituted.
Agglomerative methods are widely used in IR systems. A standard agglomerative algorithm applied to a set of texts takes as input:
1. a set of n texts;
2. a n n matrix of inter-similarity values.
The algorithm starts out with each individual text as a singleton cluster and
includes the following iterative steps [3]:

1. search the similarities matrix for the two most similar clusters (i.e. the clusters with the least distance between them);
2. the two selected clusters, ci and cj are merged to produce a new cluster, cij
that now contains two or more texts;
3. calculate the similarities between this new cluster and all other clusters (only
those distances involving the new cluster have changed)
These steps are repeated until all texts are grouped in one large cluster.
In order to make this algorithm work, the inter-cluster similarities have to be
computed in step 3. At the beginning, when there are only singleton clusters, this
computation consists of the calculus of cosine values (Section 3.2). When nonsingleton clusters are grouped with other clusters, singleton or non-singleton,
there are different ways to define the highest inter-cluster similarity: single-link,
average-link, complete-link or Wards method (see [3] for detailed explanations).
Single linkage, or nearest neighbour linkage, group the two clusters with
the highest similarity between two of their respective texts. In this work, we
empirically chose the single-link policy, because it produced the best clustering
results, as shown in Section 4.

m h

Figure 3. Sample dendogram

The result of the hierarchical clustering approach is usually represented as a


hierarchical tree - or dendrogram - in which the leaf nodes represent the input
ciphertexts in singleton clusters. By drawing a horizontal line across the dendrogram, and looking at the subtrees below the line, the number of output clusters
can be set. Different cluster definitions can be found by moving the horizontal
line up or down.

For instance, in Figure 3 the thick horizontal line defines three clusters:
A = f, g, d, e
B = m, h, l, j, i, k
C = a, b, c
The two most similar cryptograms of this example are h and l, since they were
the first singletons to be merged, as can be seen in the Figure 3.
There is a question of where to position the optimal cutting line. In our case,
we would wish to obtain clusters in which i. only same key cryptograms were
grouped together, and ii. all same key cryptograms were grouped in the same
cluster. Our experiments showed that condition i is always accomplished.
3.4

Labelling the clusters

The Kasiski method proceeds mechanically to determine all factors of the differences between starting points of recurring strings patterns, as described in Section 2. The factors that appear frequently have to be checked and it is reasonable
to assume that the cryptanalyst will be looking at many different keylengths to
test.
We based our procedure for labelling the clusters before classfication on the
Kasiski method. The following definitions were used.
Definition 1. The largest most frequent prime factor of the distances between
repeated n-grams in a ciphertext is defined here as the Kasiski index of a document, (KId ).
Definition 2. The largest most frequent value of KId in a cluster is defined here
as the Kasiski index of a cluster, (KIc ).
Labelling is the assignment of KIc to every cluster. This labelling begins with
the calculus of KId for each document in each cluster. At this point we expect
that the keylength of the ciphertexts in a cluster is a multiple of the clusters
KIc .
3.5

Cluster Categorization

The third phase of the proposed procedure is accomplished by categorizing by


keylength the clusters obtained in the first phase and labelled in the second. The
categorizer was based on the k-Nearest Neighbour (kNN) algorithm [6], a type of
instance-based learning algorithm. This class of learners, rather than constructing a general explicit description of the learning function when training examples
are provided, simply stores the training examples and the generalization step is
performed only when a new instance must be categorized. The kNN algorithm
is robust and effective when provided with a sufficiently large document set.
The training examples in this case were pairs (similarity matrix, keylength).
We use a set of matrix distance metrics to compute the nearest neighbours of a
new cluster to be categorized.

For instance, if the following training samples are provided:


< m1 , 66 > < m2 , 36 > < m3 , 42 >
< m4 , 56 > < m5 , 84 > < m6 , 45 >
< m7 , 63 > < m8 , 50 > < m9 , 23 >
< m10 , 27 > < m11 , 84 > < m12 , 54 >
and the following labelled cluster is to be categorized, for k = 3:
< m, 7 >
The measurement space is composed only of the set of training clusters whose
labels are multiples of the target clusters label, hence:
< m3 , 42 >
< m4 , 56 > < m5 , 84 >
< m7 , 63 >
< m11 , 84 >
In the example, the three least distances in
{m : m3 , m : m4 , m : m5 , m : m7 , m : m11 }
indicate the nearest neighbours of < m, 7 >.
For a final result there are two possibilities: either the most common label
pertaining to the k nearest neighbours, for an arbitrary k, is assigned to the
cluster or all the labels are returned, a multi-categorization policy, so that the
cryptanalyst can work from a set of possible keylengths.
In the example, suppose the neighbours were chosen as
{m : m5 , m : m7 , m : m11 }
A single-categorization policy would yield 84 as keylength; a multi-categorization
policy would yield both 84 and 63 as keylengths.
The evaluation of our experiments in Section 4 present both cases.

Evaluating the Procedure

The proposed keylength determination procedure has been implemented and


tested on a collection of texts. The results of the experiments were evaluated following the recommendations of Yang and Liu [7] for text categorization methods.
The effectiveness of category assignments by the categorizer to the clusters
of same key ciphertexts is evaluated by the standard recall, precision and F1
measures. Recall is defined as the number of correct assignments by the categorizer (cs) divided by the total number of correct assignments (c). Precision is
the number of correct assignments by the categorizer (cs) divided by the total
number of assignments by the categorizer (na).

These measures can be computed by macro-averaging or micro-averaging. In


the first approach, recall (maR) and precision (maP) are calculated per category
first and then the average is taken over all categories. Considering n test clusters
m categories, the macro averaged values for recall and precision should be:
m

r=

1 X
1 X
(csi /ci ) , p =
(csi /nai )
m i=1
m i=1

where csi as the number of correct system assignments for category i, ci is the
number of correct assignments for category i (i.e. the number of documents in
category i) and nai is the number of system assignments for category i. For micro
averaging, the global values of recall (miR) and precision (miP) are computed
instead of those relative to categories, thus
cs =

m
X

csi , c =

i=1

m
X

ci , na =

i=1

and
r=

m
X

nai

i=1

cs
cs
, p=
c
na

The third phase of the procedure involves training a kNN categorizer and,
thus, a collection of texts that are similar to the ones that the cryptanalyst will
have to decipher needs to be available. The cryptanalyst activity allows him
to limit the scope of possible subjects and message sizes, given that the origin
of the ciphertext is known. For instance, in a military set up, one can expect
to intercept messages requesting resources (e.g. food, arms, ammunition and
personnel) that are short. The kNN categorizer should be trained for each well
defined situation; its performance would suffer otherwise. It is important that an
adequate size corpus should be available for training and this is not a problem
in many real life cryptanalysis set up.
4.1

Experiment setup

The training and test corpora that was used in this study were derived from
the books of the Bible, containing 144,000 words, obtainable on the Web at
http://www.o-bible.com/bbe.html, as of March 2005. These texts were encrypted
using a Vigen`ere Tableaux, developed specifically for this project generating a
total of 26,640 cryptograms with an average of 1,000 words each.
The experiment used keys of odd keylength ranging from three to 49 (approximately 5% of the cryptograms length). For each keylength, 30 random cipher
keys were generated to encrypt the training set and seven random cipher keys to
encrypt the test set. In total, the categorizer was trained with 21,600 and tested
with 5,040 cryptograms. The plaintexts of the training corpus are distinct from
the ones of the testing corpus.

4.2

Metrics sensitivity in relation to the keylengths

Before the experiments, we analysed the collection of ciphertexts, with the knowledge of their cipherkeys, in order to establish the relation between statistical features and the keylengths. We used these evidences as a support for the clustering
and categorization phases of our procedure.
Table 1 shows the preliminary characterization of the collections, after indexing and computing the pairwise similarity matrix for each same keylength text
collection. Term frequency has been used to weight terms. Each of these matrices
was examined with respect to the average values of: average similarity, median
(the middle value in a matrix, above and below which lie an equal number of
values), and average vocabulary size.
Table 1. Averaging the data
Keylength Average Average Average Median Average Average
Similarity
Similarity
Vocabulary
3
0,527795
0,526842
181,9667
9
0,297273
0,295511
228,3952
15
0,210244
0,209757
242,6048
21
0,16716
0,165185
250,9571
27
0,139001
0,136993
255,381
33
0,120689
0,118314
258,3667
39
0,109187
0,108249
261,5429
45
0,09895
0,098297
263,3143

As the length of the key is increased it can be observed from Table 1 that the
vocabulary size increases and the similarities decrease. A longer key implies that
there will be several representations for the same plaintext word, depending on
its position in the text, which explains the observed decrease in similarity.
4.3

An experiment in keylength determination

Using the experiment set up in 4.1, we executed the three-phase procedure. The
results of each phase are presented separately.
1. Clustering according to cipher key has produced perfect results for precision
and recall in the 168 clusters that have been produced. This excellent result
is to be expected, since it is a task similar to language discrimination.
2. In the labelling phase, the computation of the individual KId revealed that a
correct factor of the keylength is obtained for all ciphertexts of keylength less
then 19, hence the clusters were trivially labelled with a correct KIc . From
keylengths up to 45, errors in the computation of KId started to creep in, but
were compensated by the policy of calculating KIc , taking into account the
majority KId in a cluster. Only in four out of the 168 clusters were assigned
an incorrect KIc , all of them occured in clusters of keylength 47.

These results support the approach of working within clusters, which will
boost significantly the performance of keylength determination algorithms
because in each cluster one is sure to find documents ciphered with the same
key.
3. In the categorization phase, as already mentioned, the training samples are
pairs (similarity matrix, keylength). In order to measure the distance between
two matrices in the kNN categorizing space we used five different metrics: the
Euclidean matrix distance, the difference in average, median, and vocabulary
size.
Test data was analysed for two different sub-experiments, and the results of
EX1 and EX2 are shown in Tables 2 and 3:
EX1 a cluster is labelled with category n if n is the category of the majority of
the neighbours according to the categorizer;
EX2 a cluster is labelled with all the categories that label more than 6 neighbours according to the categorizer (maximum of 5 categories).

Table 2. Results of the experiment - single categorization (EX1)


Metrics
miP
miR
maP
Euclidean distance 0,993902 0,970238 0,994792
Mean
0,932927 0,910714 0,947669
Median
0,902439 0,880952 0,928241
Vocabulary size 0,97561 0,952381 0,982292
miR = micro-recall; miP = micro-precision
maR = macro-recall; maP = macro-precision

maR
0,970238
0,910714
0,880952
0,952381

Table 3. Results of the experiment - multi categorization (EX2)


Metrics
miP
miR
maP
Euclidean distance 0,896175 0,97619 0,935185
Mean
0,87234 0,97619 0,914247
Median
0,845361 0,97619 0,896983
Vocabulary size 0,901099 0,97619 0,943182
miR = micro-recall; miP = micro-precision
maR = macro-recall; maP = macro-precision

maR
0,97619
0,97619
0,97619
0,97619

In the context of cryptanalysis, the results in Tables 2 and 3 are extremely


positive. The best results in the single-categorization evaluation policy were obtained with the Euclidean distance, which reached over 99% of precison and over
97% in recall. In the multi-categorization policy, vocabulary size come out on
top.

We implemented the Kasiski method and executed it with the same test data.
The results are displayed in table 4. It can be observed that the Kasiski method
starts to falter at keylength less than 1% of the text length. In contrast, using
the Euclidean distance, our method maintains perfect ?? until 4.5% of the text
length.
Table 4. Results of the experiment - comparison with the Kasiski method
Keylength Kasiski Average Vocabulary Euclidean
Method Similarity
Size
Distance
3
1
1
1
1
5
1
1
1
1
7
1
1
1
1
9
0,0095
1
1
1
11
1
1
1
1
13
1
1
1
1
15
0
1
1
1
17
1
1
1
1
19
0,9667
1
1
1
21
0
1
1
1
23
0,9857
1
1
1
25
0,0048
1
1
1
27
0
1
1
1
29
0,8667
1
1
1
31
0,9333
1
1
1
33
0
1
1
1
35
0
0,8571
0,8571
1
37
0,7857
1
1
1
39
0
0,2857
0,5714
1
41
0,7286
1
1
1
43
0,7095
1
1
1
45
0
0,2857
1
0,8571
47
0,5000 0,4286
0,4286
0,4286
49
0,0238
1
1
1

Future works

The results obtained with the polyalphabetic experiments suggest that similar
experiments could be projected for the cryptanalysis of contemporary cryptographic systems, in an exploratory search for linguistic patterns. The following
preliminary experiment has been carried out with the block cipher algorithms
DES (Data Encryption Standard) and AES (Advanced Encryption Standard).
We selected 11 text sizes for DES encryption and 9 for AES. For each of those
sizes, we selected 30 plaintexts to encrypt with 50 randomly chosen keys, generating 1500 cryptograms for each of the different text sizes.

Table 5 present the results for the DES and AES ciphering. The results
for the DES experiment show that in all 11 clustering processes, every cluster
contained only same key cryptograms, as indicated by the value of precision (1).
For cryptograms larger than 512 bytes, recall is also 1, demonstrating that, for
each key, all of them have been correctly placed in the same cluster. The results
are identical for all keys, since the plaintexts are the same and each collection of
same key cryptograms are simply alphabetical variations of the set of plaintexts.
The results also show that procedure requires larger texts in order to perform
well for AES ciphers than for DES ciphers, a conclusion that follows logically
from the fact that AES key space is much larger.
Table 5. Text size and clustering - DES and AES
DES
AES
Text size Precision Recall Text size Precision
10240
1
1
10240
1
8192
1
1
8192
1
6144
1
1
6144
1
4096
1
1
4096
1
2048
1
1
3072
1
1024
1
1
2560
1
512
1
1
2048
1
256
1
0.8
1536
1
192
1
0.71
1024
1
128
1
0.4
64
1
0.6

Recall
1
1
1
0.97
0.87
0.5
0.33
0.2
0.16

Concluding Remarks

This article presented a novel approach to the investigation of keylength of ciphertexts, based on elementary Information Retrieval tasks, such as document
clustering and categorization.
Clustering according to cipher key has produced perfect results in our experiments. In this process, even though the key is unknown, our procedure groups
texts ciphered with the same cipher key, confirming the hypothesis that different
cipher keys will determine a particular language in the ciphertext, which can be
subject to successful clustering.
One of the main advantages of our method over the other available methods is
the fact that the proposed procedure is language independent, therefore specific
languages frequency distributions are not required.
In the restricted scope of the application of our method to the case of polyalphabetic systems, we widened the usefulness of the Kasiski method in a major way: the KIc , being the most frequent KId in a cluster, obliterates all the

isolated errors in computing Kasiski factors for individual documents in that


cluster. It has been observed in this study that vocabulary size and Euclidean
distance between similarity matrices are closely linked to the length of the key in
polyalphabetic ciphers because they yielded the two most precise cluster distance
metrics.
On a more theoretical note, it was interesting to compare the foundations of
the Kasiski method, conceived way before computers were available, to the bagof-words approach to document retrieval, in order to propose the automatization
of the first known polyalphabetic key determination method.
With respect to contemporary ciphers, we produced preliminary evidences
that the DES and the AES algorithms do not produce text cryptograms totally
unrelated to the input.

References
1. A. Clark and E. Dawson. Discrete optimisation: A powerful tool for cryptanalysis?
In Proceedings of the First International Conference on the Theory and Applications
of Cryptography, Prague, Czech Republic, 1996.
2. D. E. R. Denning. Cryptography and Data Security. Addison-Wesley Publishing
Company, USA, 1982.
3. A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264323, 1999.
4. L. Lebart and M. Rajman. Computing similarity. In R. et al., editor, Handbook of
NLP. Dekker: Basel, 2000.
5. C. Manning and H. Sch
utze. Foundations of Statistical Natural Language Processing.
MIT Press, Cambridge, MA, 1999.
6. T. Mitchell. Machine Learning. McGraw-Hill International Editions, 1997.
7. Y. Yang and X. Liu. A re-examination of text categorization methods. In 22nd
Annual International SIGIR, pages 4249, Berkley, August 1999.

You might also like