Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

DISSERTATION FOR ATTAINMENT THE MASTER DEGREE OF

ELECTROTECHNICAL ENGINEERING AND COMPUTERS

Cryptographic Accelerator in Reconfigurable Hardware


DAVID JOSÉ MARQUES MACHADO
DECEMBER 2008

Abstract

This article presents a general overview on the cryptography highlighting and its importance in the protection of
information. The main objective of this work is the implementation of the asymmetric algorithm, RSA, using the
Montgomery algorithm, for modular multiplication, with keys of 4096 bits. Increasingly, companies and organizations use
advanced technology to facilitate and expedite transactions of information. Privacy in the information becomes
necessary to ensure the safety of the transmitted data and cryptography is one of the resources used. Modular
Multiplication is a central operation in many application areas including public key cryptography. Therefore, there
is a need for algorithms that are on the one hand, fast and on the other hand, have an area and power efficient.
Therefore and in order to accelerate the process of encryption, the exponential multiplication has been implemented en
reconfigurable hardware (FPGA).

Key Words: Cryptography, Montgomery algorithm, Privacy, Modular Multiplication

hardware such as a bigger security and greater


Introduction processing speed, the use of reconfigurable circuits
such as FPGA (Field Programmable Gate Array),
The necessity to supply security to information exists
provides further flexibility, adjusting the hardware to
since it started to contain some value. The increase of
the user’s requirements. This work is structured in the
the exchange of information between computers
following form: fundamental knowledge of
through networks results in a great search for
cryptography algorithms work necessary to the
information security systems. Thus being, the
development; the hardware architecture; the achieved
cryptographic algorithm must be well designed and
results, leading to the conclusions and proposals for
well implemented, in a way to decrease unauthorized
further improvement in this specific matter.
forms of getting information. The main objective of this
work is the development the main components of an
Cryptography
asymmetrical cryptographic system, RSA, efficient and
secure, in a reconfigurable hardware. Efficiency should The cryptography converts given legible into unreadable
be understood as the acceleration in processing time, data, with the capacity to recover the original data from
achieved with the improvement of the exponentiation the unreadable data. Cryptography can be
and modular multiplication algorithms. The accomplished by using to codes or ciphers. Code
implemented were the Exponentiation by Squaring, for implies the substitution of words or phrases, for
the exponentiation, and the Montgomery algorithm, for example, establishing that white is called black, while
the modular multiplication. The security of this system cipher recurs to letters substitution. There are three
was also one of the concerns, using keys of 4096 bits, types of ciphers: occultation, substitution and
in order to provide the capability of obtaining of the transposition. Occultation cipher is obtained when the
information by undesirable users. Beyond the private message is simply hidden. The older example
advantages offered for the implementation in the of this kind of cipher is the one where a message was
written on a head, of a bald slave, so that when his
hair grows the message would be hidden. Substituting Stage 4: A number e is chosen such that (ed-1) is
letters for others, according to a preset key, is known divisible for (p-1).(q-1). To carry through this
as Substitution cipher. Transposition cipher is obtained calculation the algorithm of Extended Euclides is used.
by anagrams (from the Greek Ana = “to come back” or Making the use of this algorithm it is calculated that
“to repeat” + graphein = “to write”), that is, a kind of e = 97.
word games, where the letters rearrangement of
another existing word becomes the decoded word. For Therefore, in this example, the important values of
example, Elegant Man and Gentleman or Monday and RSA algorithm are:
Dynamo are anagrams. p = 53, q = 61, n = 3233, d = 193 e e = 97
Due to the use of Substitution Cipher, with asymmetric
Key, on the RSA algorithm, only this type of Cipher will To calculate ed - x. (p-1). (q-1) =€ 1 (being x an
be considered. integer number), the Extended algorithm of Euclides is
necessary. The values e and d are called public and
RSA Algorithm private exponents, respectively. The pair (n, e) is the
public key and the pair (n, d) is the private key. The
The RSA algorithm is a cryptographic algorithm values p and q must always be kept in secret or be
developed in 1978 by R. L. Rivest. Shamir and L. destroyed.
Adleman of the MIT. This method consists on To carry through the cryptography, in this example, the
generating a public key, used to cipher the data, and a ASCII code was adoptee to numerically represent the
private key, used to decipher the data. As in other letters of the word. The phrase “CRIPTOGRAFIA” is
methods, the RSA algorithm uses enormous prime represented numerically as:
numbers to construct the pair of keys, therefore their
generation is very hard obtain. Each pair of keys
678273808479718265707365
shares the product of two prime numbers, the module
and a specific exponent. The security of this algorithm
The message to codify cannot be bigger than the
increases with the size (in bits) of the prime numbers.
module, in this case it cannot be bigger than 3233,
Nowadays, the expectation is that, with the increase of
therefore it will be used groups of three digits by the
the computational power, it will be possible to factor
following form:
these enormous prime numbers in a lesser time, and,
therefore, the generation of keys with even larger
678-273-808-479-718-265-707-365
numbers (than 4096) will be possible.

To cipher each letter the calculation is carried through:


Operation Example C = Te mod n, where C stands for the ciphered
message and T for the original message.
The following example ciphers the message
We have then the following calculations:
“CRIPTOGRAFIA”,
C = 67897 mod 3233 = 2580;
C = 27397 mod 3233 = 3201;
Stage 1: Two random prime numbers are selected C = 80897 mod 3233 = 0205;
p = 53, q = 61. C = 47997 mod 3233 = 0041;
C = 71897 mod 3233 = 2304;
Stage 2: A number n is generated through the C = 26597 mod 3233 = 0265;
multiplication of the previously chosen numbers C = 70797 mod 3233 = 2611;
(n = p. q), therefore: C = 36597 mod 3233 = 1341;
n = 53. 61 = 3233;
Thus we obtain the following ciphered message:
Stage 3: A number d is chosen so that, d is either 25803201020500412304026526111341
lesser that n and relatively prime to (p-1). (q-1). It is
enough to choose a random prime number greater
than p and q. For this example d=193 was chosen.
Montgomery Multiplication It intends to use operands with 4096 bits, requiring the
use of two memories of 4096 bits each: one for the

The Montgomery multiplication algorithm is a very base and another one for the exponent.

ingenious method to calculate the modular


multiplication. This algorithm replaces the division for a Montgomery Multiplications
shifting and modular addition, if necessary, resulting There are two components that carry through different
on a faster computer processing. Moreover, it is types of Montgomery multiplications. The Montgomery
suitable for hardware (FPGA/ASIC) implementation but Multiplier which is the main component of the
is inadequate to simply carry through a modular architecture and the ToMontgomery component.
multiplication. However, when the proposal is to carry These components are used almost in all the iterations
through a very great number of multiplications this throughout the cryptographic calculation. The
process of modular multiplication is extremely efficient. multiplications carried through in these components
The modular multiplication of Montgomery calculates are performed with logical operations, fast and simple,
MM (X, Y) = XYR-1mod (m), where m is an integer carried through with logical shifts.
number between 2n-1 < m < 2n such that The operands of the components are x_in and y_in, in
MCD (m, R) = 1 the case of the Montgomery Multiplier, and x_in, in the
case of the ToMontgomery. The intended type of
Montgomery Multiplication multiplication is selected by a select signal, selMM.
Operational Example The two types of multiplication are: a multiplication of
Montgomery using the two operands of entrance,

The Montgomery multiplication operation will use the (x_in*y_in*R-1) mod m (m is the module and R=2n,

following conditions: X = 23, Y = 20, m = 27, R = 32 n=number of bits, and the same throughout all the
-1
and R = 11. calculation), and a multiplication by R of operator x_in,
(x_in*R) mod m, that carries through the change of
domain to the Montgomery’s domain. The result will be
Number 23 is changed into its image in the domain of
the value of one or another multiplication, based on the
Montgomery:
value of the signal selMM.
X' = MM (X, R ²) = MM (23,32 ²) = 23,32 ²,11 mod 27 =
23,32 mod 27 = 7
Montgomery Multiplication (MM)

Step by step multiplications of X' by itself are:


(X')2 = MM (7,7) = 7.7.11.mod 27 = 26 This algorithm allows modifying the size of the
3
(X') = MM (7,26) = 7.26.11 mod 27 = 4 operands in the desired way, without affecting the
4
(X') = MM (7,4) = 7.4.11 mod 27 = 11 developed specification. The basic component is
5
(X') = MM (7, 11) = 7.11.11 mod 27 = 10 limited to the use of 64bits operands per iteration

Through the Montgomery multiplication between (X') 5 because of hardware area usage limitations. Together
5
and 1 it’s possible to obtain (x) as an Integer number. with a fast exponentiation algorithm implementation,
5 5
(x) mod 27 = MM ((X') , 1) = MM (10,1) = 10.1.11 mod this component can allow cryptography with keys of

27 = 110 mod 27 = 2 4096 bits, with a very good performance. The


multiplications are performed in sequence, using one
multiplication per algorithm iteration. This implies that
Hardware Architecture
the occupied space and critical time will grow with the
number of bits of the operands. This component is
The developed architecture is composed by: three
combinatorial, it performs a multiplication of nbit (in this
components, Montgomery Multiplier, ToMontgomery
in case nbit = 32 bits) in a clock cycle, requiring the
and Expoent; seven main registers, x_in, y_in, exp_in,
clock cycle to be enough to guarantee the result is
x_base, x_inter, multfact, result and z, having some
available on the exit. Critical time increases with the
signals and other auxiliary registers; and two
size of the operands and is equal to the latency.
memories, mem_X and mem_Exp.
Occupation (slices) Time (ns) n shifts to the left of x, and after that to carry through
1844 85,241 the operation mod m, because it cannot be known how
many subtractions would be necessary and to carry
through the operation mod would take much time and
space, therefore it would involve divisions. As at the
Change to the Montgomery’s Domain
beginning, A it’s never greater that m, its double could
(ToMontgomery)
not be more than two times bigger of m, so that a
subtraction by m will be enough, if A>m, to have an
To perform a Montgomery multiplication, both the
inferior value of m. In such a way, it is enough to make
operands need to be in the Montgomery domain,
a shift of the bits to the left (A*2) and subtract by m, if
before making any calculations. In the end it’s
A>m, for each iteration. The number of iterations is
necessary to change the result to the integer’s domain,
given by n, as R=2n it is necessary to repeat n times
to get a real result. Real result is attained by
the iterations. This component has less occupation
multiplying the final value by 1. This calculation can be
area and critical time than the Montgomery Multiplier.
made using the Montgomery multiplication, with the
As in Montgomery Multiplier, the ToMontgomery has
operands, 1 and final value.
the critical time equal to the latency, because it is
combinatorial.
x (integers) = MM(x(Montgomery),1)
= x (Montgomery) . 1 . R-1 mod módulo
Occupation (slices) Time (ns)
1571 81,738
On the other hand, changing to the Montgomery’s
domain presents some difficulties. It’s calculation
similar to the previous one although the value for
which is necessary to multiply has 2n+2 bits. As Exponentiation by Squaring
Montgomery multiplication component cannot be used,
because the operands have a maximum of n bits, a An exponentiation is a succession of multiplications.
dedicated component was developed. When the exponent has a very high value this
operation takes more time. In this way, any algorithm
2 -1
x (Montgomery) = x (integers) . R . R mod módulo that speeds up the attainment of the final result is an

= x (integers) . R mod módulo , com R=2 n advantage.


To optimize the exponentiation the algorithm
Not to affect the performance of the Montgomery Exponentiation by Squaring, also known as binary
Multiplier this solution makes use of the multiplications exponentiation, was implemented. This algorithm
properties for powers of 2, assuming that R is always a makes use of the binary properties of the exponent,
n
power of 2 and taking the R=2 value. being sufficiently efficient for very high values of
exponent, reducing significantly the number of
necessary multiplications:
6 (110b) * 22 = 24 (11000b)
100 (1100100b) * 2-2 = 25 (11001b)

The product of a number for a power of 2 can be


calculated through a sequence of shifts of bits to the
left or right, if the power is negative or positive,
As an example to make a simple exponentiation, as
respectively. With this property the modular product
515, the flow of data and operations would be the
by R is replaced by a sequence of n (number of bits of
following one:
operands) iterations, where in each iteration, the
operand is multiplied by 2, and after that subtracted by
m (module).
The implemented operation was carried through in an
iterative form, where the value of the operand is the
result of the previous iteration. It is not viable to make
Step Result X exp Occupation (slices) Time (ns)

0 1 5 15 16 5,593

1 5 5 14
2 5 25 7 Memories
3 125 25 6
For the cryptographic calculation with base and
4 125 625 3
exponent values with 4096 bits it becomes necessary
5 78125 625 2
to store its values in blocks of memory RAM to help
6 78125 390625 1
the processing. The used memories are of type
7 30517578125 390625 0
RAMB16_S36_S36. These memories allow storing the
4096 intended bits with a fast reading. They allow
The result of 515=30517578125, as intended, is access to 64 bits of data, in each reading. If using 32
obtained faster. The multiplication to carry through bits (in the developed case), it only needs one memory
depends on the parity of the exponent. When this is access every two iterations.
even, x is modified with the result of x 2. When this is
odd, result is modified with the result of result*x. Control Unit
The management of the signals and registers through
Exponent
all the cryptographic calculations is done by a state

This component makes the update of the exponent’s machine. This unit carries through, in a general way,

value throughout all the calculation. The value of the the control of registers and signals on the basis of the

exponent is the number of necessary multiplications to algorithm, Exponentiation by Squaring, developed as

get the final result. In this way, the exponent functions well the accesses to the memories.

as a signal that controls the end of the operations of


modular multiplication. The result is available when the The calculation functions cyclically, with different

exponent will have the value zero. operands, between 10 states, having each one of
them, different functions. The states are: test, comeca,

This component has one operand, exp_in, and a select inicio, um, dois, tres, fimIter, montToint, intTomont and

signal, sel. The Exponent performs two types of fim.

operations: a subtraction by 1, exp_in - 1, and a


division by 2, exp_in/2. The subtraction by 1 is done
com
replacing the less significant bit by `0'. The division by teste eca

2 is done by a shift of the bits to the right. The select


value depends on less significant bit of the exponent.
Int To
When the exponent is odd, the less significant bit is `1', inici mont
um o
the result of exp_out is the resultant value of the
subtraction. When the exponent is pair, the less
significant bit is `0', the result of exp_out is the result of
the division. dois tres

This method requires fewer multiplications, using the Mont


Toint
division of the exponent by half, which will reach zero FIM

more quickly. The Exponent is an auxiliary component


that has less occupation area and critical time than the
Montgomery Multiplier. This is because it uses few Data Flow
simple operations, because the subtraction is carried
through by a substitution and the division replaced by As referred, each iteration is limited up to 64 bits. So it
a shift to the right. becomes necessary to repeat the calculations to carry
through operations with 4096 bits. It is necessary to
understand how to make the integration of the different
bits of the exponent and base, in the calculation of the In step 2, although the exponent is already zero in the
final value. previous iteration its necessary to calculate x_inter so
that in the next iteration the accumulated value is
Carrying through the exponentiation with the correct.
Exponentiation by Squaring algorithm, allows making
the cryptographic calculation without difficulty. In the The value of the base is still limited. To solve this
Exponentiation by Squaring, the type of calculation problem the algorithm will have to recur to the
depends if the exponent is odd or pair (less significant properties of the exponents:
bit `1' or `0'), shifting the bits to the right whenever this
is even. After n (number of bits) right shifts, the bits ab * cb = (a*c)b
more to the right will be the n following bits.

From the equation and because it’s not possible to


Example: (n=2) perform calculations with a value of base larger than
Exponent: 100101 after n shits is 001001 64 bits, when using a value of 4096 bits, it’s necessary
to factor the base in smaller numbers. The number of
As shown in this example, making use of the exponent factors will never exceed the 4096/(nbits). Thus it is
at the beginning of the calculation, after n right shifts, enough to reuse the architecture of the exponentiation
the n bits more to the right will be the second n bits of for the different values of the base, that is, all its
the exponent (the 2 blue bits are now in the end of the factors.
exponent) and thus successively. In such way, a = c*d*e*f => ab = (c*d*e*f)b = cb * db * eb * fb
because only n bits of the exponent are available, it
becomes necessary to read from the memory in the The operands multiplication can be performed in a
end of an iteration the n following bits. progressive way by getting the different results.

Functioning (with n=32bits)


((( cb * db )* eb )* fb )
XExp mod n

Example:
515970 50 mod 610391 = 413136
MEM EXP Exp
X
Decomposing the base in factors:
x_inter
515970 = 81 * 65 * 98
Result Comes:
515970 50 mod 610391 = (81 * 65 * 98) 50
mod 610391
50 50 50
= ( 81 * 65 * 98 ) mod 610391 =
The calculation finishes when the whole exponent
( (8150 mod 610391) * (6550 mod 610391) * (9850 mod
became zero and not when each one of the n read bits
610391)) mod 610391 = 413136
of the exponent became to zero. When the bits of each
reading are zero, it is still necessary to continue to
With these factors the succession of calculations is the
update the register x_inter for the next iterations.
following one:
1. Calculate 8150 mod 610391 = 607196
Example: (n=2)
2. Calculate 6550 mod 610391 = 400590
29 (expoente=1001b, base=10b)
3. Calculate 9850 mod 610391 = 87270
Step result x_inter Exponent contabit
3. ( 607196 * 400590 ) mod 610391 = 104877
0 1 2 1->(01) 0 4. ( 104877 * 87270 ) mod 610391 = 413136
1 2 2 0 1
2 2 4 0 2 Separating the base in three factors the same result is
3 2 16 2->(10) 0 obtained. Instead of carrying through the
4 2 256 1 1 exponentiation one time, its necessary to carry through
5 512 256 0 2 an exponentiation for each factor and after that n
multiplications between the results of the factors (with
n= (nº factors) -1). This allows a calculation with The time of execution of each test (nº Cycles*Period)
operands with fewer bits each, improving the is:
processing.

nbit Execution Time


64 0,1017 s
Tests and Results 32 0,0943 s
16 0,0912 s
The following formula indicates the number of cycles
necessary to get the encrypted value:
In all the tests the execution times are similar,
therefore to use iterations with more or lesser bits it is
Number of clock cycles
not relevant. Although with iterations of 16bits the
execution time is slightly inferior.

The occupation area (slices) for each one of the tests


is:
Substituting the nbit for the test values 64, 32 and 16
bits, the following values are obtained:
nbit Occupation space (slices)

Nbit Number of clock cycles 64 9860


32 2573
64 536579
16 977
32 1097731
16 2293763
With smaller bits in operations the LUT’s occupation
will be smaller, and consecutively, there will be lesser
Thus, and assuming that each clock cycle has the
occupied slices that can be used to perform other
same duration for the three tests, the test with
necessary operations. This way the use of iterations
iterations of 64 bits would be the fastest. This happens
with 32 or 16 bits seems to be a better choice
because with lesser bits, each iteration has, the faster
because, there will be more freedom to perform other
would be the clock cycle. This occurs because the
applications at the same time.
cycle of clock is the longest time path between two
flip-flops. With less bits the time will be lowes and
Studying both parameters separately is a good way to
consequently clock period can be faster.
determinate the advantages, and disadvantages, of
each one of the options, although by studying both
For each test, the clock period would be:
parameters simultaneously it’s possible to determinate
which of the options is the best one to choose without
nbit Clock Period any doubt.

64 189 ns
32 85 ns Using the values from above, a graphic can be
constructed where both variables can be related,
16 39 ns
Occupation Space Vs. Execution Time.

Ocupation Vs Time
0,104
0,102 64 bits
0,1
Time

0,098
0,096
0,094 32 bits
0,092
16 bits
0,09
0 2000 4000 6000 8000 10000 12000
Ocupation
implemented in a device of low cost. The circuit was
From the graphic it’s possible to verify that as the total specified in VHDL (High Speed Integrated Circuit
execution time grows the occupation space also Hardware Description Language) language, assuring
increases. This way is confirmed that using iterations the portability of the architecture to other technologies.
with lesser bits, the necessary space decreases.
Although the number of clock cycles, until obtaining The architecture developed in this project can be
the final result, increases (because of the lesser clock modified as necessary. As suggestion for the
period), the execution time is minor. continuation of the work, some of the following
This algorithm carries through the operations of modifications can be tried:.
multiplication in a iterative form functioning almost Implementation using other key sizes. This can be
sequentially. This does not imply advantages in done with the introduction of more memories and
execution time by using more or less bits for operand. adjustment of the circuit to perform more iterations to
With operands with fewer bits each multiplication is process more data.
perform more quickly, but it requires more operations Introduction of pipeline stages. The introduction of a
to be fulfilled, and vice versa. system in pipeline can increase the performance of the
Having almost an equality in the necessary time to circuit, mainly in the component that carries through
obtain the result, the occupation area, in each one of the Montgomery multiplication.
the tests, is as a more important role for the choice of Better management of the base factors. The
the best option. Thus the accomplishment of the management of the base in this architecture can
operation with fewer bits is the most viable solution. require a high computations time. A better processing
will improve the performance of the circuit.
Conclusions
With the increasing technological evolution, the References
computers have their capacity of processing increased [1] - Alfred J. Menezes, Paul C. Van Oorschot, Scott.
periodically, and so it becomes necessary to Vanstone, Handbook of Applied Cryptography, CRC-
constantly improve the cryptographic systems. Proof of Press, 1ª edition, December 1996
this is the fact that security in computer networks is [2] - Francisco Rodríguez-Henríquez, N. . Saqib. Díaz-
one of the areas where the continuous search for Perez, Cetin Kaya Koc, Cryptographic Algorithms on
newer alternatives provides a bigger development in Reconfigurable the Hardware (Signals and
the technological realm. Communication Tecnology), Springer, 1ª edition,
Another factor that values the asymmetrical November 2006
cryptography is the continuous growth of the electronic [3] - E. Savas, c. K. Koç, The Montgomery Modular
business that would be impracticable without a Inverse - Revisited, IEEE Computer Society, Vol. 49, nº. 7,
cryptography that is able to provide total security for July 2000
the users. The cryptography system currently used is [4] - F. Bernard, Scalable the hardware implementing
extremely safe, specialists’ esteem that somebody high-radix Montgomery multiplication algorithm,
who tries to break cryptography in the base of the Elsevier North-Holland, Vol. 53, nº 2-3, February 2007
attempt-and-error, would take about 100.000 years [5] - Deschamps Jean-Pierre, Bioul Géry, Sutter Gustavo,
using a common PC. Synthesis of Arithmetic Circuits, Wiley-Interscience,
This work proposed a reconfigurable hardware March 2006
architecture to perform the main operations of [6] - A. Daly, L. Marnane e E. Popovici, Fast Modular
cryptography using the RSA algorithm. This Inversion in the Montgomery Domain on
architecture performs the encryption and decryption of Reconfigurable Logic, Irish Signals and Systems
data of 4096 bits, using iterations of 32 bits, although it Conference — ISSC 2003, Limerick, July, 2003.
has been implemented in a way such that the size of [7] - Guerric Meurice de Dormale, Philippe Bulens e
the operands for iterations can be easily modified. The Jean-Jacques Quisquater, An Improved Montgomery
target technology of this project consisted in logical Modular Inversion Targeted for Efficient
reconfigurable devices FPGA. The circuit was Implementation on FPGA, INTERNATIONAL
designed considering the reduction of the amount of CONFERENCE ON FIELD-PROGRAMMABLE
used resources, which makes possible that it is TECHNOLOGY, 2004

You might also like