Professional Documents
Culture Documents
Cryptologia Com Claudia
Cryptologia Com Claudia
Cryptologia Com Claudia
Cryptanalysis
Claudia Oliveira, Jose Antonio Xexeo, and Carlos Andre Carvalho
Departamento de Engenharia de Sistemas, Instituto Militar de Engenharia
Rio de Janeiro, Brazil
Abstract. This paper presents a new application for Information Retrieval techniques. We introduce the use of clustering and categorization
in the attack of cryptosystems. In order to clearly present the fundamentals and understand the workings and the implications of this new
technique, we developed a procedure for keylength determination in the
process of cryptanalysis of polyalphabetic ciphers, the core of any attack
of this type of ciphers. The basic premises are: first, a cryptogram is a
normal document written in an unknown language; secondly, Information
Retrieval Techniques are extremely useful in detecting string patterns in
ordinary texts and might be helpful with cryptograms as well.
Introduction
A polyalphabetic substitution cipher is a combination of monoalphabetic substitution ciphers. The number of monoalphabetic substitution ciphers is known as
the period or keylength (M). The message is broken into blocks of length M and
each of the M characters in the block is encrypted using a different monoalphabetic key. It should be evident that the ciphertext created by a polyalphabetic
substitution cipher has similar properties to that created by a simple substitution cipher. As a result, if the period is known, polyalphabetic substitution
ciphers are vulnerable to the same cryptanalytic attacks as the simple substitution cipher [1]. Thus, knowing the keylength reduces the problem to cracking
a series of monoalphabetic substitution ciphers. Cryptanalysis of a polyalphabetic substitution cipher is therefore accomplished in three steps: 1) determine
the keylength; 2) break ciphertext into separate pieces, one piece per permutation; 3) solve each piece as a separate monoalphabetic substitution cipher, using
frequency distributions.
2.1
There are some known methods for determining keylength in polyalphabetic substitution ciphers. The first of those methods to be published is due to Friedrich
Kasiski, in Die Geheimschriften und die Dechiffrierkunst (Secret writing and
the Art of Deciphering), 1863. It is based on the analysis of gaps between recurring patterns in the ciphertext, assuming that in a substitution cipher, patterns
in the plaintext will manifest themselves in the ciphertext. For example, the
tri-grams THE and ING occur frequently in English. In a polyalphabetic
substitution, it is possible, or even likely, that they will be enciphered using
adjacent permutations at multiple points. If the message is long enough, one
expects to find multiple occurrences of these patterns. Kasiskis method works
on tri-grams (or longer n-grams) as follows: 1) identify repeated patterns of 3
or more letters in the ciphertext; 2) for each pattern, write down the starting
positions for all the instances of the pattern; 3) compute the differences between
the starting positions of successive instances; 4) determine all the factors of these
differences; 5) the keylength will be one of the factors that appears more often.
Another traditional keylength determination method is the Index of Coincidence (IC), introduced by William Friedman in 1920. The IC is defined as the
probability of two randomly chosen symbols from a ciphertext being the same.
The IC varies from 0.038, corresponding to a flat frequency distribution, and
0.066, corresponding to the distribution of English plaintexts. The measure of
this variation is related to the keylength (number of alphabets). Hence, the IC
for text encrypted with a monoalphabetic cipher is the same as for plaintext in a
given language. One major drawback of IC is that precision decreases sharply as
keylength increases, which means that the small differences between neighbouring values of IC are overshadowed by the deviations from the theoretical letter
frequencies that are bound to occur in any given ciphertext.
An extensive account of the Kasiski method and the IC method can be found
in [2].
Input ciphertexts
Representing
texts as vectors
- Labelling with
Kasiski Index
?
Calculating similarity matrix
Categorizing
with the kNN
method
?
?
Clustering
Keylength
At this point the procedure requires a k-Nearest Neighbour (kNN) categorizer that was previously trained. The third stage uses distance metrics to position each labelled cluster in relation to the training space composed of sets of
cryptograms of known keylength. This is accomplished by computing the proximity between a categorizing cluster and the clusters of this training space. For
the kNN categorizer we consider as neighbourhood for a target cluster only the
clusters whose labels are multiples of the target clusters label. The estimated
keylength is a function of this neighbourhood.
The process of training the categorizer is shown in Figure 2. The training
corpus of plaintexts has to be encrypted with a variety of keys and keylengths
and the resulting set of ciphertexts must be indexed using the vector space
model. Groups of resulting same key cryptograms were put together, as if they
were clusters, and then represented by their similarity matrices. Each group
(similarity matrix) was labelled with the corresponding keylength, and used as
a training sample for the categorizer.
A review of the basic techniques used in the keylength determination procedure is useful to build the theoretical foundations of such an empirical method
and justify the IR approach to the cryptanalysis problem.
3.2
The vector space model [5] is one of the most widely used models for document
retrieval due to its conceptual simplicity and the clarity of the metaphor of
spatial proximity between documents, represented as word vectors. Each word
Training
texts
?
Encryption
system
Calculate
simi-
- larities matrices
?
?
Encrypted
training texts
Labelling
with
correct keylength
- Trained
kNN
categorizer
Term weighting for the vector space model has been largely based on single
term statistics. In the case of polyalphabetic ciphertexts, it has to be stressed, the
frequencies are not as influential because they are not real language frequencies.
Statistics have been disrupted by encryption and therefore is great difficulty in
assessing the impact of weighting.
Calculating the similarities between two document vectors can be done in a
number of ways, using such measures as the Euclidean distance, the Dice coefficient, the Jaccard coefficient, and the cosine measure. The latter is a popular
measure of similarity for text clustering, based on the angle between two document vectors. The cosine measure between vectors u and v is given by the
quotient between the scalar product of the vectores and the product of their
modules:
P
(ui vi )
cos(u, v) = pP 2 P 2
ui vi
For example, the cosine measure of similarity between a dog and a cat
(2,1,1,1,0), and a frog and a cat (2,0,2,1,1) is given by:
22+10+12+11+01
(22
+ 12 + 12 + 12 + 0) (22 + 0 + 22 + 12 + 12 )
= 0.84
Ciphertext Clustering
1. search the similarities matrix for the two most similar clusters (i.e. the clusters with the least distance between them);
2. the two selected clusters, ci and cj are merged to produce a new cluster, cij
that now contains two or more texts;
3. calculate the similarities between this new cluster and all other clusters (only
those distances involving the new cluster have changed)
These steps are repeated until all texts are grouped in one large cluster.
In order to make this algorithm work, the inter-cluster similarities have to be
computed in step 3. At the beginning, when there are only singleton clusters, this
computation consists of the calculus of cosine values (Section 3.2). When nonsingleton clusters are grouped with other clusters, singleton or non-singleton,
there are different ways to define the highest inter-cluster similarity: single-link,
average-link, complete-link or Wards method (see [3] for detailed explanations).
Single linkage, or nearest neighbour linkage, group the two clusters with
the highest similarity between two of their respective texts. In this work, we
empirically chose the single-link policy, because it produced the best clustering
results, as shown in Section 4.
m h
For instance, in Figure 3 the thick horizontal line defines three clusters:
A = f, g, d, e
B = m, h, l, j, i, k
C = a, b, c
The two most similar cryptograms of this example are h and l, since they were
the first singletons to be merged, as can be seen in the Figure 3.
There is a question of where to position the optimal cutting line. In our case,
we would wish to obtain clusters in which i. only same key cryptograms were
grouped together, and ii. all same key cryptograms were grouped in the same
cluster. Our experiments showed that condition i is always accomplished.
3.4
The Kasiski method proceeds mechanically to determine all factors of the differences between starting points of recurring strings patterns, as described in Section 2. The factors that appear frequently have to be checked and it is reasonable
to assume that the cryptanalyst will be looking at many different keylengths to
test.
We based our procedure for labelling the clusters before classfication on the
Kasiski method. The following definitions were used.
Definition 1. The largest most frequent prime factor of the distances between
repeated n-grams in a ciphertext is defined here as the Kasiski index of a document, (KId ).
Definition 2. The largest most frequent value of KId in a cluster is defined here
as the Kasiski index of a cluster, (KIc ).
Labelling is the assignment of KIc to every cluster. This labelling begins with
the calculus of KId for each document in each cluster. At this point we expect
that the keylength of the ciphertexts in a cluster is a multiple of the clusters
KIc .
3.5
Cluster Categorization
r=
1 X
1 X
(csi /ci ) , p =
(csi /nai )
m i=1
m i=1
where csi as the number of correct system assignments for category i, ci is the
number of correct assignments for category i (i.e. the number of documents in
category i) and nai is the number of system assignments for category i. For micro
averaging, the global values of recall (miR) and precision (miP) are computed
instead of those relative to categories, thus
cs =
m
X
csi , c =
i=1
m
X
ci , na =
i=1
and
r=
m
X
nai
i=1
cs
cs
, p=
c
na
The third phase of the procedure involves training a kNN categorizer and,
thus, a collection of texts that are similar to the ones that the cryptanalyst will
have to decipher needs to be available. The cryptanalyst activity allows him
to limit the scope of possible subjects and message sizes, given that the origin
of the ciphertext is known. For instance, in a military set up, one can expect
to intercept messages requesting resources (e.g. food, arms, ammunition and
personnel) that are short. The kNN categorizer should be trained for each well
defined situation; its performance would suffer otherwise. It is important that an
adequate size corpus should be available for training and this is not a problem
in many real life cryptanalysis set up.
4.1
Experiment setup
The training and test corpora that was used in this study were derived from
the books of the Bible, containing 144,000 words, obtainable on the Web at
http://www.o-bible.com/bbe.html, as of March 2005. These texts were encrypted
using a Vigen`ere Tableaux, developed specifically for this project generating a
total of 26,640 cryptograms with an average of 1,000 words each.
The experiment used keys of odd keylength ranging from three to 49 (approximately 5% of the cryptograms length). For each keylength, 30 random cipher
keys were generated to encrypt the training set and seven random cipher keys to
encrypt the test set. In total, the categorizer was trained with 21,600 and tested
with 5,040 cryptograms. The plaintexts of the training corpus are distinct from
the ones of the testing corpus.
4.2
Before the experiments, we analysed the collection of ciphertexts, with the knowledge of their cipherkeys, in order to establish the relation between statistical features and the keylengths. We used these evidences as a support for the clustering
and categorization phases of our procedure.
Table 1 shows the preliminary characterization of the collections, after indexing and computing the pairwise similarity matrix for each same keylength text
collection. Term frequency has been used to weight terms. Each of these matrices
was examined with respect to the average values of: average similarity, median
(the middle value in a matrix, above and below which lie an equal number of
values), and average vocabulary size.
Table 1. Averaging the data
Keylength Average Average Average Median Average Average
Similarity
Similarity
Vocabulary
3
0,527795
0,526842
181,9667
9
0,297273
0,295511
228,3952
15
0,210244
0,209757
242,6048
21
0,16716
0,165185
250,9571
27
0,139001
0,136993
255,381
33
0,120689
0,118314
258,3667
39
0,109187
0,108249
261,5429
45
0,09895
0,098297
263,3143
As the length of the key is increased it can be observed from Table 1 that the
vocabulary size increases and the similarities decrease. A longer key implies that
there will be several representations for the same plaintext word, depending on
its position in the text, which explains the observed decrease in similarity.
4.3
Using the experiment set up in 4.1, we executed the three-phase procedure. The
results of each phase are presented separately.
1. Clustering according to cipher key has produced perfect results for precision
and recall in the 168 clusters that have been produced. This excellent result
is to be expected, since it is a task similar to language discrimination.
2. In the labelling phase, the computation of the individual KId revealed that a
correct factor of the keylength is obtained for all ciphertexts of keylength less
then 19, hence the clusters were trivially labelled with a correct KIc . From
keylengths up to 45, errors in the computation of KId started to creep in, but
were compensated by the policy of calculating KIc , taking into account the
majority KId in a cluster. Only in four out of the 168 clusters were assigned
an incorrect KIc , all of them occured in clusters of keylength 47.
These results support the approach of working within clusters, which will
boost significantly the performance of keylength determination algorithms
because in each cluster one is sure to find documents ciphered with the same
key.
3. In the categorization phase, as already mentioned, the training samples are
pairs (similarity matrix, keylength). In order to measure the distance between
two matrices in the kNN categorizing space we used five different metrics: the
Euclidean matrix distance, the difference in average, median, and vocabulary
size.
Test data was analysed for two different sub-experiments, and the results of
EX1 and EX2 are shown in Tables 2 and 3:
EX1 a cluster is labelled with category n if n is the category of the majority of
the neighbours according to the categorizer;
EX2 a cluster is labelled with all the categories that label more than 6 neighbours according to the categorizer (maximum of 5 categories).
maR
0,970238
0,910714
0,880952
0,952381
maR
0,97619
0,97619
0,97619
0,97619
We implemented the Kasiski method and executed it with the same test data.
The results are displayed in table 4. It can be observed that the Kasiski method
starts to falter at keylength less than 1% of the text length. In contrast, using
the Euclidean distance, our method maintains perfect ?? until 4.5% of the text
length.
Table 4. Results of the experiment - comparison with the Kasiski method
Keylength Kasiski Average Vocabulary Euclidean
Method Similarity
Size
Distance
3
1
1
1
1
5
1
1
1
1
7
1
1
1
1
9
0,0095
1
1
1
11
1
1
1
1
13
1
1
1
1
15
0
1
1
1
17
1
1
1
1
19
0,9667
1
1
1
21
0
1
1
1
23
0,9857
1
1
1
25
0,0048
1
1
1
27
0
1
1
1
29
0,8667
1
1
1
31
0,9333
1
1
1
33
0
1
1
1
35
0
0,8571
0,8571
1
37
0,7857
1
1
1
39
0
0,2857
0,5714
1
41
0,7286
1
1
1
43
0,7095
1
1
1
45
0
0,2857
1
0,8571
47
0,5000 0,4286
0,4286
0,4286
49
0,0238
1
1
1
Future works
The results obtained with the polyalphabetic experiments suggest that similar
experiments could be projected for the cryptanalysis of contemporary cryptographic systems, in an exploratory search for linguistic patterns. The following
preliminary experiment has been carried out with the block cipher algorithms
DES (Data Encryption Standard) and AES (Advanced Encryption Standard).
We selected 11 text sizes for DES encryption and 9 for AES. For each of those
sizes, we selected 30 plaintexts to encrypt with 50 randomly chosen keys, generating 1500 cryptograms for each of the different text sizes.
Table 5 present the results for the DES and AES ciphering. The results
for the DES experiment show that in all 11 clustering processes, every cluster
contained only same key cryptograms, as indicated by the value of precision (1).
For cryptograms larger than 512 bytes, recall is also 1, demonstrating that, for
each key, all of them have been correctly placed in the same cluster. The results
are identical for all keys, since the plaintexts are the same and each collection of
same key cryptograms are simply alphabetical variations of the set of plaintexts.
The results also show that procedure requires larger texts in order to perform
well for AES ciphers than for DES ciphers, a conclusion that follows logically
from the fact that AES key space is much larger.
Table 5. Text size and clustering - DES and AES
DES
AES
Text size Precision Recall Text size Precision
10240
1
1
10240
1
8192
1
1
8192
1
6144
1
1
6144
1
4096
1
1
4096
1
2048
1
1
3072
1
1024
1
1
2560
1
512
1
1
2048
1
256
1
0.8
1536
1
192
1
0.71
1024
1
128
1
0.4
64
1
0.6
Recall
1
1
1
0.97
0.87
0.5
0.33
0.2
0.16
Concluding Remarks
This article presented a novel approach to the investigation of keylength of ciphertexts, based on elementary Information Retrieval tasks, such as document
clustering and categorization.
Clustering according to cipher key has produced perfect results in our experiments. In this process, even though the key is unknown, our procedure groups
texts ciphered with the same cipher key, confirming the hypothesis that different
cipher keys will determine a particular language in the ciphertext, which can be
subject to successful clustering.
One of the main advantages of our method over the other available methods is
the fact that the proposed procedure is language independent, therefore specific
languages frequency distributions are not required.
In the restricted scope of the application of our method to the case of polyalphabetic systems, we widened the usefulness of the Kasiski method in a major way: the KIc , being the most frequent KId in a cluster, obliterates all the
References
1. A. Clark and E. Dawson. Discrete optimisation: A powerful tool for cryptanalysis?
In Proceedings of the First International Conference on the Theory and Applications
of Cryptography, Prague, Czech Republic, 1996.
2. D. E. R. Denning. Cryptography and Data Security. Addison-Wesley Publishing
Company, USA, 1982.
3. A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264323, 1999.
4. L. Lebart and M. Rajman. Computing similarity. In R. et al., editor, Handbook of
NLP. Dekker: Basel, 2000.
5. C. Manning and H. Sch
utze. Foundations of Statistical Natural Language Processing.
MIT Press, Cambridge, MA, 1999.
6. T. Mitchell. Machine Learning. McGraw-Hill International Editions, 1997.
7. Y. Yang and X. Liu. A re-examination of text categorization methods. In 22nd
Annual International SIGIR, pages 4249, Berkley, August 1999.