Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

An Embedded RSA processor for encryption and decryption

Yang Qian, Wu Xingiun, Zhou Runde, Lu Ruibing Institute of Microelectronics, Tsinghua University, Beijing 100084, China

Abstract
This paper presents an embedded RSA processor, which can implement RSA cryptographic process under the control of external microprocessor. The coarsely integrated operated scanning method is used to adapt the hardware implementation of Montgomery. The window method is also used to reduce significantly the number of modular multiplications needed. Pipelined control is used to speed up the throughput. Sequential access in the register group is used to make the routing regular and reduce the combinational logic. At a clock rate of 20MHi, it takes only 160 ms at most to complete 1024 bit modular exponentiation using this RSA processor. The hardware size of this processor is approximately equivalent to 26,000 logic gates, and this design is very suitable for embedded systems.

multiplications needed is too much. In this paper, improved algorithms are introduced firstly, which adapt hardware implementation and reduce the number of modular multiplications. Based on improved algorithms an embedded RSA processor is designed.

Algorithms
CIOS method
Given base W (W=2'7 of the number representation is used in all algorithms of this paper. Multiplicands a , b can be expressed as following:

Introduction
Public-key cryptography algorithm has high security capability compared with secret-key cryptography algorithm, but it is very slow. RSA [l] algorithm is the best known and widely used public-key algorithm now and it is in fact a process of completing large integer modular exponentiation. Modular exponentiation can break into a series of modular multiplication. Montgomery algorithm [Z] is widely used modular multiplication algorithm. Based on Montgomery algorithm, the binary method [3] can be used to compute modular exponentiation, but the number of modular

The basic Montgomery modular multiplication algorit

hm [4] is as ( 4 ) . n[Ol' can be calculated with the extended Euclidean algorithm [5].

n[O]' = -n[O]-' mod W ; t=a.b; fori=Otok-1 , (m=t[i].n[O] modW; 1 t = t +m.n.W ; return t I r ;

(4)

There are some disadvantages of using the basic MO ntgomery modular multiplication algorithm directly

0-7803-6677-8/01/$10.00O200 I IEEE.

356

for hardware implementation: 1) Needing lots of stor age units, 2k+l storage units is needed totally(t[2 k ] , ...,t[O]); 2 ) Including multiplication of m*n, but
m is w bits-wide and n is w*k bits-wide; 3 ) Includi

with the binary method (that is the case of all bits of e being 1). The number of modular multiplication needed is significantly reduced with the window method [6]. The basic concept of the window method is: precalculating several specific numbers, then scanning exponent e with a window, and if the number in the window has been precalculated, it can be used directly so that the number of modular multiplication is reduced. Let h be the length of the window. In the worst case, the number of modular multiplication needed is only g(l+llh) with the window method. The throughput of the window method is
= Ih times as large as
g(l+llh)
h+l

ng the division of t l r , it will complicate the design of datapath. Analyzing above basic Montgomery algorithm, two points should be noticed: 11 rn is only related with t [ iI : 2 ) the last w*k bits o f t has no contributions to the final result. Based on that, CIOS method [3] is proposed. CIOS method: fori = 0 to k - 1
{ c = 0; for j = 0 to k -1 {(c,s) = t [ j l + a[jlb[il+ c; t[ j ] = s;

( c ,s) = t [ k ]+ c; t [ k ] = s; t [ k + 13 = c; c = 0; rn = t[O]n[O]'mod W; (c, s) = t[OI + rnn[Ol; for i = l t o k - 1 [(c, s) = t [ j l + mn[ j l t [ j-1 1 = s; (c,s) = t[kl + c; t [ k -11 = s; t[k]= t[k+1 1+ c;

that of the binary method. But the number of storage units storing precalculated numbers would double if the length of the window increases 1. In this design the length of the window chooses 3, which compromise the consideration of throughput and the number of storage units. In this case the throughput of the window method is 1.5 times as large as that of the binary method.

+ c;
Architecture
Introduction of the function of modules Processor includes four modules: modular exponentiation control logic (called Exp-control), modular multiplication control logic (called Mon-control), datapath, memory and memory control logic (called Mem-control). (See figure 1) Modular multiplication control logic controls the dataflow in the datapath and distribution of the arithmetic operators. It controls datapath to complete the computation of n[O] ' , modular multiplication with CIOS method. The func-se1 signal determines which computation to be completed. Modular exponentiation control logic is the top module, which controls Mon-control and Mem-control. It communicates with Mon-control through handshaking signals (mon-start and mon-finish).

In this algorithm, the division is finished by shifting the result one word to the right (that is division by W) in the last j loop. There are some advantages in this algorithm which can be convenient for the hardware implementation: 1) the multiplicands are all w bits-wide which ease the design of datapath; 2 ) reducing the storage units, it only needs k+2 storage units(t[k+l]. . . t [ O ] ) ; 3 ) multiply-add is the basic computation and it will ease the design of datapath. The window method Let g be the number of bits in exponent e. In the worst case, the number of modular multiplication needed is 2g

351

the f i s t j loop of the CIOS method. So storage units ab1 and m can share Operand register A, and storage
status Eq-control exp-ad
O

R-addr-

a ; I datain
I
result

units b[i]and nu] can share Operand register B. That can save lots of registers, since every storage unit is 32 bitswide. Carry register C is mapped to the storage unit c of the CIOS method. Constat register RNO stores n[O] ' The design o f control logic Control logic includes two parts, which are Mon-control and Exp-control. Hardwired control is used in the design

contrl

datapath

figure 1 RSA processor Memory stores constants, initial data, intermediate results and final results. And mem-control controls the access to memory to avoid the conflicts and allows only one module to access memory at one time. The design of datapath

of control logic to simplify the design and increase the efficiency of the circuit. Mon-control and Exp-control both are FSM (finite state machine). There is continuous computation as following in the CIOS method: fetching operand processing of

arithmetic units-storing results. Such as: for j=O to k-1 There are one multiplier of 32 bits, one adderhubtracter
{ (c,s)=t[i]+ au]b[i]+c;

of 32 bits and one adder of 32 bits plus 64 bits in datapath ( w , k are both 32 bits-wide). (See figure 2)

t[i]=s;
J

Pipelined control is used in the design of control logic to


A B

RNO

speed up the throughput (see figure 3). So fetching operand, storing results and processing of arithmetic units can be done simultaneously, and that speeds up the throughput significantly.

fetch

process fetch

store process fetch store process store

figure 2 the main units of datapath Datapath also includes a register group T[O:k], which is mapped to the storage units t[O:k] of the CIOS method. In figure 2, the outputs of the adder can be directly stored into the register group and that saves the storage unit s of the CIOS method. Operand register A and B are mapped to the storage units ali] and b[i] of the CIOS method. The storage units ab1 and b[i] are only used in

Organization of register group Intermediate results are stored in register group, and that is convenient for pipelined control and simplifies the timing control. There are 1024 (32 X 32) registers in the register group, which cost large area. The organization of register group can be optimized to cost smaller area.

358

The CIOS method has the following characteristics: the access to the storage units t[i] is always sequential. On the above characteristic, sequential access is used in the design of the register group. It can make the routing regular and reduce the combinationallogic.(See figure 4)

processor is done on the TSMC 0.35um technology. At a clock rate of ZOMHz, it takes only 160 ms at most to complete 1024 bit modular exponentiation using this RSA processor. The processor has been integrated into a
smart IC card as its coprocessor.

Acknowledgements
The authors wish to express their gratitude to China National Science Foundations (Project # 59995550-1) and Tsinghua University 985 K e y Reseaich funds for their support.

figure 4 sequential access Sequential access is that the data of the register group can only be accessed on the port register T[O]. Fetching a data can shift the data to port register T[O] firstly and fetch it, and storing a data is similar to that. Sequential access reduces the routing area and combinational logic significantly compared with random access (that is each register being able to be accessed directly). The hardware size of the register group with sequential access is only equivalent to 12,000 logic gates, but that with random access is equivalent to 18,000 logic gates.

References
[l] Rivest R L , Shamir A , Adleman L. Communications of the ACM, 212, (120) 1978 [2] Montgomery P L. Mathematics of Computation, 40,170, (519) 1985 [3] Kaya Koc C, Acar T, Kaliski BS Jr. ZEEE Micro, 163, (26) 1996

Conclusions
The CIOS method is used to adapt hardware implementation of Montgomery, and it takes a good tradeoff between hardware size and the throughput. The window method is also used to reduce significantly the number of modular multiplicationsneeded. The length of the window method chooses 3, and the throughput is 1.5 times as large as that of the binary method in this case. In the control logic circuits, pipelined control is used to speed up the computation of modular multiplication. The register group is used to adapt the pipelined control. Sequential access in the register group is used to make the routing regular and reduce the combinational logic. In the process of this design Verilog HDL, behaviorallevel simulator Verilog-XL and logic synthesis tools Synopsys are used to design, simulate and synthesize the circuit respectively, and the gate-level simulation of the

[4] ZHANG Wujian, LIANG Songhai, ZHOU Runde.


J Tsinghua University, 39,S1, (13) 1999, (in

Chinese 1 [5] Dusse S R, Kaliski B S Jr . A cryptographic library for the Motorola DSP54000. Damgaard. Advances in Cryptology-Eurocrypt 90, Lecture Notes in Computer Science. (Springer-Verlag, New York, 1990), p230

[6] GAI Weixin. The fast algorithms and VLSI


implementation of large integer modular exponentiation. (Tsinghua universit , Beijing, 19971, (in Chinese)

359

You might also like