Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 10

CS502 2003 Final Project

Jing Lu, Wan Qian

8/22/2013

Implementing a 1024-bit RSA on FPGA


Jing Lu Wan Qian

Reconfigurable Network Group Applied Research Lab Department of Computer Science and Engineering Washington University in St. Louis

Abstract
In this project, we implemented a 1024-bit RSA circuit in VHDL. It is a full-featured RSA circuit including key generation and data encryption. Both RSA key generation component and data encryption component are too big to fit into a single Xilinx Virtex 2000E FPGA on the Field Programmable Port Extender (FPX) platform, so that we are unable to test them in real hardware. However, each sub-component was simulated in ModelSim and proved functionally correct. Although neither RSA key generation nor data encryption component can be tested on FPX, we still implemented all the testing related hardware, including Control Packet Processor (CPP) and Output Packet Processor (OPP). According to the synthesis statistics, the highest frequency of the RSA encryption component is 64.3 MHz. So the estimated throughput is 0.45 Kb/s for 1024-bit d, and the throughput is 137 kb/s for 3-bit e.

I.

Introduction

Nowadays, more and more reconfigurable hardware devices are used in network applications due to their low cost, high performance and flexibility. Such applications include extensible network routers, firewalls [1], and Internet-enable sensors, etc. These reconfigurable hardware devices are usually distributed in a large geographic area and operated over public networks, making on-site configuration inconvenient or infeasible. Therefore, robust security mechanisms for remote control and configuration are highly needed. The RSA algorithm is a secure, high quality, public key algorithm. It can be used in these applications as a method of exchanging secret information such as keys and producing digital signatures. However, the RSA algorithm is very computationally intensive, operating on very large (typically thousands of bits long) integers. The RSA algorithm has been adopted by many commercial software products and is built into current operating systems by Microsoft, Apple, Sun, and Novell [2]. Commercial Application Specific Standard Products (ASSPs) like the security processors offered by several vendors have a much higher RSA performance than software implementation. However, their solution is inflexible and expensive [3]. With the exponential increase in FPGA size over time (Moore's law), it is possible to implement a relatively high performance, user parameterizable RSA at low cost on FPGA [3]. In this project, we are challenged to implement a full-featured 1024-bit RSA circuit on FPGA, including key generation and data encryption. Although high throughput is very desirable, we try to achieve high system frequency, small chip area and low power as well. We gave them higher priority than throughput since the RSA circuit is implemented in a resource-limited FPGA possibly shared by other modules. We implemented a full-featured RSA circuit including key generation and data encryption in VHDL. Each individual component was simulated in

Page 1

CS502 2003 Final Project

Jing Lu, Wan Qian

8/22/2013

ModelSim and worked well. All the simulation waveforms, source code, testbench, and makefile can be found on our course website [11]. The RSA encryption component was synthesized using Synplify Pro. The encryption component takes 79% of total FPGA resource, while primality tester itself takes 84% of total FPGA resource. Therefore, neither the key generation nor the encryption component can fit into the FPX platform. Synthesis result shows that the highest frequency of the encryption component is 64.3 Mhz. The estimated throughput for 1024-bit d is 0.45 kb/s, and 137 kb/s for 3-bit e.

II.

RSA Hardware Implementation

The RSA algorithm was inverted by Rivest, Shamir, and Adleman in 1977 and published in 1978 [5]. It is one of the simplest and most widely used public-key cryptosystems. Figure 1 summarizes the RSA algorithm.

Fig. 1 The RSA algorithm

Page 2

CS502 2003 Final Project

Jing Lu, Wan Qian

8/22/2013

Fig. 2 The system architecture for RSA key generation The system architecture for key generation is shown in Figure 2. A random number generator generates 512-bit pseudo random numbers and stores them in the rand FIFO. Once the FIFO is full, the random number generator stops working until a random number is pulled out by the primality tester. The primality tester takes a random number as input and tests if it is a prime. Confirmed primes are put in the prime FIFO. Similarly to the random number generator, primality tester starts new test only when prime FIFO is not full. A lot of power is saved by using the two FIFOs because computation is performed only when needed. When new key pair is required, the down stream component pulls out two primes from the prime FIFO, and calculates n and (n). N is stored in a register. (n) is sent to the Greatest Common Divider (GCD), where public exponent e is selected such that gcd((n), e) = 1, and private exponent d is obtained by inverting e modulo (n). E and d are also stored in registers. Once n, d, and e are generated, RSA encryption/decryption is simply a modular exponentiation operation. Figure 3 shows the RSA encryption/decryption structure in hardware implementation.

Fig. 3 The RSA encryption/decryption structure The core of the RSA implementation is how efficient the modular arithmetic operations are, which include modular addition, modular subtraction, modular multiplication and modular exponentiation. The RSA also involves some regular arithmetic operations, such as regular addition, subtraction and multiplication used to calculate n and (n), and regular division used in GCD operation. We have implemented most of the mentioned operations, except for the regular subtraction, modular addition and modular subtraction. Instead, we use regular adder to do subtraction by representing the operands in 2's complement format. Omura's method [6] is employed to efficiently compute the modular addition. In his method temporary value is allowed to grow larger than modulus n but less than 2k (2k-1 < n < 2k), correction is performed when the temporary value exceeds 2k. Techniques for modular exponentiation, modular multiplication, modular addition, and addition operations are summarized in [7], which we found very helpful.

Page 3

CS502 2003 Final Project


Fig. 4 A 512-bit adder

Jing Lu, Wan Qian

8/22/2013

Addition
Addition is the basic operation used in all other operations. The speed of addition operation plays an important role on the overall system throughput. Previous experience from CS536 project tells us that an n-bit adder implemented on FPGA using carry look-ahead technique is slow and resource consuming. This is because Xilinx FPGAs use 4-input Look-Up Tables (LUTs) to implement any 4-input combinatorial logic. Carry functions with a lot of inputs are broken down to fit into several 4-input LUTs, making the critical path longer. We used Xilinx Core Generator system to generate a 32-bit adder, and implemented our own 512-bit and 1024-bit adders by concatenating 16 and 32 32-bit adders, respectively. Xilinx Core Generator System makes the best of the Xilinx FPGA structure and constructs a faster 32-bit adder. Figure 4 shows the interface of a 512-bit adder. Ready, start and s_valid are handshaking signals, indicating when the adder is idle, when the computation starts, and when the result becomes valid. Simulation waveforms on our course website show the signals' timing in great detail. A 512-bit adder has 16 clock cycles latency. For a 1024-bit adder, the latency is 32 clock cycles. To subtract b from a using an adder (in this case, input a is always greater than b), we invert b and set c_in to be 1, the resulting s equals (a - b). Modular Multiplication Given the adders, we constructed modular multiplication using shift-add multiplication algorithm. Let A and B are two k-bit positive integers, respectively. Let A i and Bi are the ith bit of A and B, respectively. The algorithm is stated as follows: Input: A, B, n Output: M = A*B mod n P <= A; M <= 0; for i = 0 to k-1 if Bi = 1 M <= (M + P) mod n; end if P <= 2*P mod n; end for return M; For a 512-bit modular multiplier, inputs A and B are both 512 bits. However, B might be a small number with a lot of leading 0s. In the hardware implementation, before getting into the shift-add iterations, we search for the position of the first leading 1 in B, and set (k - 1) to be this position. By doing this, we avoid unnecessary shift and modular operations, making the multiplication faster when B is small. As we mentioned before, we used Omura's method to correct the partial product M and temporary value P when any of them becomes greater than 2 512. The corrected M and P may still greater than n, so before returning M, we will do final correction on M to make sure M is less than n. Figure 5 shows the interface of a 512-bit modular multiplier. A 1024-bit modular multiplier was also implemented in a similar fashion.

Page 4

CS502 2003 Final Project

Jing Lu, Wan Qian

8/22/2013

Fig. 5 A 512-bit modular multiplier Modular Exponentiation The modular exponentiation operation is simply an exponentiation operation where modular multiplication is intensively performed. We implemented the 512-bit and 1024-bit modular exponentiation components using LR binary method, where LR stands for the left-to-right scanning direction of the exponent. The following pseudo code describes the LR binary algorithm. Input: A, B, n Output: E = AB mod n E <= 1; for i = k-1 to 0 if Bi = 1 E <= A*E mod n; end if if i 0 E <= E*E mod n; end if end for return E; Similar to modular multiplication, we search for the position of the first leading 1 in exponent B and set (k - 1) to be the position. This avoids unnecessary modular squaring operations. For small exponent such as the public exponent e, the modular exponentiation is much faster than big exponent such as the private exponent d. Figure 6 shows the interface of a 512-bit modular exponentiation component.

Fig. 6 A 512-bit modular exponentiation component

Page 5

CS502 2003 Final Project Primality Tester

Jing Lu, Wan Qian

8/22/2013

Primality test is the critical part of RSA key generation. We implemented the algorithm developed by Miller and Rabin [8, 9]. It exploits Fermat's theorem that states a n-1 mod n = 1 if n is a prime. Miller and Rabin's primality test algorithm is as follows [10]: Test (n) Find integer k, q, with k > 0, q odd, so that (n - 1 = 2kq); Select a random integer a, 1 < a < n - 1; if aq mod n = 1 or n - 1 return "inconclusive"; end if for j = 1 to k - 1 if aq2^j mod n = n - 1 return "inconclusive"; end if end for return "composite"; In order to make hardware implementation efficient, the above algorithm is stated a little bit different from the one in the textbook. To obtain a prime in high probability (greater than 0.999999), the random number n is counted as a prime only if TEST returns 10 successive "inconclusive" [10]. According to the distribution of primes, on average every 142 trials (0.4ln(2512)) will find a prime. The interface of primality tester is shown in Figure 7. Rd_en and rd_ack are handshaking signals with the rand FIFO. Primality tester asserts rd_en to read a random number from rand FIFO, which then asserts rd_ack when the random number appears on rand_in bus.

Fig. 7 Primality tester GCD After we get (n), we need to find a small number e with gcd((n),e) = 1, which indicates that e is relative prime to (n) and its multiplicative inverse modulo (n) exists. Extended Euclidean algorithm was implemented to find gcd((n),e), if the gcd is 1, e and its multiplicative inverse d are returned. The following pseudo code shows the extended Euclidean algorithm in the textbook [10]. We simplified the algorithm by removing unnecessary variables and computations and made it more suitable for hardware implementation. Extended Euclid (m,b) (A1, A2, A3) <= (1, 0, m); (B1, B2, B3) <= (0, 1, b); loop: if B3 = 0 return A3 = gcd(m,b); end if

Page 6

CS502 2003 Final Project

Jing Lu, Wan Qian

8/22/2013

if B3 = 1 return B3 = gcd(m,b);B2 = b-1 mod m; end if Q <= A3/B3; (T1, T2, T3) <= (A1 - Q*B1, A2 - Q*B2, A3 - Q*B3); (A1, A2, A3) <= (B1, B2, B3); (B1, B2, B3) <= (T1, T2, T3); goto loop; Simplified Extended Euclid (m,b) start: (A2, A3) <= (0, m); (B2, B3) <= (1, b); loop: if B3 = 0 b <= fetch_new_b; goto start; end if if B3 = 1 return (B3, B2); end if Q <= A3/B3; T3 <= A3 mod B3; T2 <= A2 + Q*B2; (A2, A3) <= (B2, B3); (B2, B3) <= (T2, T3); goto loop; Simplified extended Euclidean algorithm keeps trying new b until it finds gcd(m, b) = 1, then it returns b and it multiplicative inverse modulo m. Note in the simplified version, we remove variables A1, B1 and T1 because they have nothing to do with the computation. Instead of letting T2 <= A2 - Q*B2, we use T2 <= A2 + Q*B2. Thus, we guarantee T2 is always positive. This makes the hardware implementation simpler because we don't need to deal with negative numbers. Given the initial values of A2 and B2 are 0 and 1, respective, it is easy to see that T2 computed by the simplified version is the magnitude of the T2 computed by the original version. When B3 equals 1, B2 is definitely a positive number, so keeping its magnitude is enough for correct output. Another difference in the simplified algorithm is that we substitute T3 <= A3 mod B3 for T3 <= A3 - A3/B3 * B. It is obvious that A3 mod B3 = A3 - A3/B3 * B3. We designed a divider that accepts big dividend and small divisor, and returns a big quotient and a small remainder. Because in this particular application where we deal with big m (around 1024 bits) and small b (less than 12 bits, explained later), the divisor is used only in the first division. The rest division operations deal with numbers smaller than b, a simple shift-subtraction algorithm accomplishes the task. To get a valid b faster, we pre-compute all primes less than 2000 excluding 2. Because the product of all these primes is greater than 21024, which means that any m less than 21024 is prime to at least one of these primes. These primes are hard-coded in the on-chip block memory (we call it prime ROM). Fetch_new_b in the simplified algorithm indicates next prime is pulled out upon request. Extended Euclidean algorithm doesn't use any modular arithmetic operations. Instead, it requires a regular multiplier to compute Q*B2. We implemented a 1024-bit multiplier using shiftadd multiplication algorithm. More information about the 1024-bit regular multiplier and the divider with 1024-bit dividend and 12-bit divisor can be found on our project website. The interface of the GCD component using the simplified extended Euclidean algorithm is shown in Figure 8.

Page 7

CS502 2003 Final Project

Jing Lu, Wan Qian

8/22/2013

Fig. 8 GCD component using simplified Extended Euclidean Algorithm Random Number Generator Linear Feedback Shift Register (LFSR) is used to generate pseudo random numbers. In particular, Fibonacci LFSR was implemented because it is more suitable for hardware implementation than Galios LFSR. In theory, an n-bit linear feedback shift register can generate a (2n - 1)-bit long pseudo random sequence before repeating. However, an LFSR with a maximal period must satisfy the following property: the polynomial formed from a tap sequence plus the constant 1 must be a primitive polynomial modulo 2 [10]. We are unable to find the primitive polynomial for a 512-bit LFSR, so we implemented a 521-bit LFSR and used its least significant 512 bits to generate 512-bit pseudo random numbers. The primitive polynomial for a 512 bit LFSR was obtained from [12], which is x512 + x32 + 1. In order to get a 1024-bit n (n = q*p, where p, q are primes), we set the leading two bits of the random numbers to 1s, which ensures that the leading bit of n is always 1 (therefore n is 1024-bit number). Figure 9 shows the structure of the 521-bit LFSR, the 512-bit random number generator implemented using the 521-bit LFSR is shown in Figure 10.

Fig. 9 Structure of the 521-bit LFSR

Fig. 10 A 512-bit random number generator Regular Multiplier

Page 8

CS502 2003 Final Project

Jing Lu, Wan Qian

8/22/2013

To compute n and (n) from p and q, regular multiplier is needed. We implemented a 1024bit multiplier using the simple shift-add algorithm. Details can be found on our course website. FIFO We used Xilinx Core Generator System to generate a 12-entry rand FIFO and a 12-entry prime FIFO. Please refer to our course website for more information.

III.

Conclusion

In this project, we implemented a 1024-bit RSA circuit in VHDL. It is a full-featured RSA circuit including key generation and data encryption. Both RSA key generation component and data encryption component are too big to fit into a single Xilinx Virtex 2000E FPGA on the FPX platform, so that we are unable to test them in real hardware. However, each sub-component was simulated in ModelSim and proved functionally correct. Although neither RSA key generation nor data encryption component can be tested on FPX, we still implemented all the testing related hardware, including Control Packet Processor (CPP) and Output Packet Processor (OPP). According to the synthesis statistics, the highest frequency of the RSA encryption component is 64.3 MHz. So the estimated throughput is 0.45 Kb/s for 1024-bit d, and the throughput is 137 kb/s for 3-bit e. This project makes it very clear that RSA algorithm is not very suitable for implementation on FPGA, because modular exponentiation, primality test and gcd operations need many registers to store intermediate results. In FPGA, register is expensive because every CLB has only two edge-triggered D flip-flops. There may exist more efficient algorithms for RSA hardware implementation. Wed like to continue working on the RSA encryption component. Hopefully it can finally fit into the FPX platform and becomes useful in the remote control and configuration of reconfigurable hardware devices.

Reference
[1] J. W. Lockwood, C. Neely, C. Zuver, J. Moscola, S. Dharmapurikar, and D. Lim, "An Extensible, System-On-Programmable-Chip, Content-Aware Internet Firewall." Field Programmable Logic and Applications, paper 14B, Sep. 2003. [2] http://www.x5.net/faqs/crypto/q19.html [3] J. Fry, and M. Langhammer, "FPGAs Lower Cost for RSA Cryptography." http://www.commsdesign.com/design_corner/OEG2003926S0022 [4] J. W. Lockwood, N. Naufel, J. S. Turner, and D. E. Taylor, "Reprogrammable Network Packet Processing on the Field Programmalbe Port Extender (FPX)." ACM International Symposium on Field Programmable Gate Arrays, Feb. 2001, pp. 87-93. [5] R. Rivest, A. Shamir, and L. Adleman, "A Method for Obtaining Digital Signatures and Public Key Cryptosystems." Communications of the ACM, Feb. 1978. [6] J. K. Omura, "A Public Key Cell Design for Smart Card Chips." International Symposium on Information Theory and its Applications, pp. 983-985, Nov. 1990. [7] C. K. Koc, "RSA Hardware Implementation." RSA laboratories Technical Report TR-801, Ver. 1.0, Apr. 1996. [8] G. Miller, "Riemann's Hypothesis and Tests for Primality." Proceeding sof the 7 th Annual ACM Symposium on the Theory of Computing, May 1975. [9] M. Rabin, "Probabilistic Algorithms for Primality Testing." Journal of Number Theory, Dec. 1980.

Page 9

CS502 2003 Final Project

Jing Lu, Wan Qian

8/22/2013

[10] William Stallings, "Cryptography and Network Security: Principles and Practices." 3 rd edition, 2003. [11] http://www.arl.wustl.edu/~jl1/education/cs502/course_project.htm [12] Bruce Schneier, "Applied Cryptocraphy: Protocols, Algorithms, and Source Code in C." 2 nd edition, 1996.

Page 10

You might also like