An Efficient Reconfigurable Multiplier Architecture For Galois

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Microelectronics Journal 34 (2003) 975–980

www.elsevier.com/locate/mejo

An efficient reconfigurable multiplier architecture for Galois field GFð2mÞ


P. Kitsos*, G. Theodoridis, O. Koufopavlou
VLSI Design Laboratory, Department of Electrical and Computer Engineering, University of Patras, Rio, Patras 26500, Greece
Received 8 January 2003; revised 26 May 2003; accepted 4 June 2003

Abstract
This paper describes an efficient architecture of a reconfigurable bit-serial polynomial basis multiplier for Galois field GFð2m Þ; where
1 , m # M: The value m; of the irreducible polynomial degree, can be changed and so, can be configured and programmed. The value of M
determines the maximum size that the multiplier can support. The advantages of the proposed architecture are (i) the high order of flexibility,
which allows an easy configuration for different field sizes, and (ii) the low hardware complexity, which results in small area. By using the
gated clock technique, significant reduction of the total multiplier power consumption is achieved.
q 2003 Elsevier Ltd. All rights reserved.
Keywords: Galois field; Polynomial multiplication; Bit-serial; Irreducible polynomial; All-one polynomial; Linear Feedback Shift Register; Low power;
Cryptography; Elliptic curves

1. Introduction which the coefficients of the irreducible polynomial can be


modified [3]. However, all the above fixed field size
Arithmetic operations over GFð2m Þ have many appli- multipliers do not work efficiently in applications with
cations in coding theory [1] and cryptography [2]. As the variable field size requirements. In these applications the
multiplication is very costly in terms of area and delay, a lot multipliers always performs all the operations, which are
of research has been performed in designing small area and needed for the maximum field size calculations. So, in order
high-speed multipliers [3 – 5]. to improve the system performance in multiplication cases
Previous published multipliers over GFð2m Þ can be with field size less than the maximum, a proper and flexible
classified into three categories: the bit-serial multipliers design implementation is required. In the past, multipliers
[3], with OðmÞ area requirement, the bit-parallel multipliers with this feature have been proposed in Refs. [5,12,13].
[4], with Oðm2 Þ area requirement, and the hybrid [5], which Nowadays wireless devices are widely used. Since power
are partially bit-serial and partially bit-parallel. Hybrid consumption determines the time between two successive
multipliers are faster than bit-serial ones, while their area is recharges of such a device and the battery life as well, the
smaller than that of bit-parallel. Another classification can reduction of power dissipation is vital in such devices.
be considered based on the used basis representation, which The main source of power dissipation in a CMOS circuit is
may be polynomial, normal, dual or digit [6 – 8]. The the switching activity of its nodes, which may contribute
multiplier hardware complexity can be reduced if (i) the more than 90% of the total power consumption [14].
irreducible polynomial is an All-One Polynomial (AOP) [9] However, a lot of the performed circuit node transitions are
or a trinomial [10], and (ii) a redundant field representation wasteful regarding the functionality of the circuit. Hence,
is used [11]. avoiding the unnecessary and wasteful transitions is a major
Many of the previous proposed multipliers have fixed task in the low power design.
In this paper, a small area reconfigurable architecture for
field size, and so, if the irreducible polynomial has to change
the Most Significant Bit (MSB)-first, bit-serial, polynomial
the multiplier must be redesigned [4,9,10]. In the recent
basis multiplier over GFð2m Þ is introduced, where 1 , m #
years only few fixed field size multipliers were proposed in
M: m is the degree of the irreducible polynomial and it can
* Corresponding author. Tel.: þ30-2610-997-323; fax: þ 30-2610-994- be easily changed according to the application require-
798. ments. M is the maximum degree of the irreducible
E-mail address: pkitsos@ee.upatras.gr (P. Kitsos). polynomial.
0026-2692/03/$ - see front matter q 2003 Elsevier Ltd. All rights reserved.
doi:10.1016/S0026-2692(03)00172-1
976 P. Kitsos et al. / Microelectronics Journal 34 (2003) 975–980

Compared with the multipliers in Refs. [3,5,12] the Step 2: for i ¼ 0 to m 2 1 do


advantages of the proposed architecture are: (i) the high
order of flexibility, which allows an easy configuration for C i ðxÞ ¼ ½Ci21 ðxÞx þ bm212i AðxÞmod PðxÞ
different field degree m; and (ii) the low hardware
complexity, which results in smaller area. By using the Using the fact that: xm mod PðxÞ ¼ pm21 xm21 þ · · · þ p1 x þ
gated clock technique, significant reduction of the total p0 the Ci ðxÞ in the step 2 can be rewritten as
multiplier power consumption is achieved. The proposed
multiplier is suitable for elliptic curve applications [15,16], Ci ðxÞ ¼½ci21
m21 ðpm21 x
m21
þ · · · þ p1 x þ p0 Þ þ · · · þ c0i21
especially in devices with strict area limitations. þ bm212i AðxÞmod PðxÞ ¼ ðci21 i21
m21 pm21 þ cm22
The paper is organized as follows: in Section 2 a brief
description of the MSB-first, bit-serial, polynomial basis þ bm212i am21 Þxm21 þ ðcm21
i21 i21
pm22 þ cm23
GFð2m Þ multiplier is given. In Section 3, the multiplier þ bm212i am22 Þxm22 þ · · · þ ðci21 i21
m21 p1 þ c0
proposed reconfigurable architecture is presented. Measure- X
m21
i21
ments and comparisons with other multipliers are shown in þ bm212i a1 Þx þ ðcm21 p0 þ bm212i a0 Þ ¼ cij xj
the Section 4. Section 5 concludes the paper. j¼0

ð5Þ

2. MSB-first bit-serial GFð2m Þ multiplier where


cij ¼ ci21 i21 i21
m21 pj þ cj21 þ bm212i aj and c21 ¼ 0 ð6Þ
Two elements, AðxÞ and BðxÞ; over GFð2m Þ can be ex-
pressed as polynomials of degree at most m 2 1 over GF(2): The block diagram of a bit-serial multiplier architecture is
shown in Fig. 1. The main component of the multiplier is a
AðxÞ ¼ am21 xm21 þ am22 xm22 þ · · · þ a1 x þ a0 ; Linear Feedback Shift Register (LFSR). The coefficients of
with ai [ GFð2Þ 0#i#m21 ð1Þ the irreducible polynomial PðxÞ and the multiplicand
polynomial AðxÞ; are stored in PðiÞ and AðiÞ registers,
BðxÞ ¼ bm21 xm21 þ bm22 xm22 þ · · · þ b1 x þ b0 ; respectively. As the coefficients bðiÞ; of multiplier poly-
nomial BðxÞ; are shifted bit-by-bit in the MSB direction, the
with bi [ GFð2Þ 0#i#m21 ð2Þ
partial products bpi AðxÞ; are produced in the Partial Product
component. The LFSR calculates the summary of the bpi AðxÞ
We define the field according to PðxÞ:
and also performs the reduction modulo PðxÞ: After m clock
PðxÞ ¼ xm þ pm21 xm21 þ · · · þ p1 x þ p0 ; with pi [ GFð2Þ cycles, the multiplication product CðxÞ is produced. This
implementation is a MSB-first version because the MSB is
0#i#m21 ð3Þ
the first bit which enters to the multiplier.
A hardware implementation of the above multiplier has
a m-order irreducible polynomial over GF(2). This poly-
been proposed in Ref. [18] (Fig. 2).
nomial is also irreducible over GFð2m Þ [17]. When the
In order to perform the reduction modulo PðxÞ; the non-
coefficients pi ; in the polynomial of Eq. (3), are equal to one,
zero coefficients pðiÞ; of the irreducible polynomial PðxÞ
the irreducible polynomial is named AOP [9].
configures the LFSR, through the upper-series AND gates.
The modulo multiplication CðxÞ involves the carry-free
As the multiplier polynomial BðxÞ is shifted bit-by-bit in the
multiplication between the two polynomials AðxÞ and BðxÞ;
MSB direction, the partial products bpi AðxÞ are produced by
with the interleave reduction modulo PðxÞ:
the lower-series AND gates.
!
X
m21
i
CðxÞ ¼ ðAðxÞBðxÞÞmod PðxÞ ¼ AðxÞ bi x mod PðxÞ
i¼0 3. Proposed multiplier architecture
!
X
m21
¼ ðbi ðxi AðxÞÞ mod PðxÞ The proposed reconfigurable MSB-first multiplier that
i¼0
can be used for variable field degree m is shown in Fig. 3.
¼ ðb0 AðxÞ þ b1 xAðxÞ þ · · · þ bm22 xm22 AðxÞ The proposed hardware implementation consists of a bit-
þ bm21 xm21 AðxÞÞ mod PðxÞ ð4Þ sliced LFSR and is very similar to the conventional bit-
serial multiplier of Fig. 2. It requires M extra demultiplexers
Eq. (4) can be implemented using the following iterative and M extra OR gates. Each slice i; consists of two subfield
algorithm. multipliers (AND gates), one subfield adder (XOR gate),
one 2-output demultiplexer, one OR gate, and 3 one-bit
2.1. Polynomial multiplication in GFð2m Þ algorithm registers ðPðiÞ; AðiÞ and DðiÞÞ: The non-zero coefficients
pðiÞ; of the irreducible polynomial PðxÞ; configure the
Step 1: C21 ðxÞ ¼ 0 LFSR, through the OR and AND gates of the feedback path.
P. Kitsos et al. / Microelectronics Journal 34 (2003) 975–980 977

Fig. 1. Block diagram of a bit-serial multiplier.

The maximum value of the field degree is M; and it is during multiplication. out2 forces the global feedback path
determined by the application requirements. with the proper value through the ORi gate.
Each coefficient aðiÞ; of the multiplicand polynomial When the application requires multiplications with
AðxÞ; is stored in AðiÞ position of the AðiÞ register, while each variable field sizes, the value of the degree m; and the
coefficient pðiÞ; of the irreducible polynomial PðxÞ; is stored set of coefficients of polynomials AðxÞ; BðxÞ; and PðxÞ are
in the PðiÞ position of the PðiÞ register. If an irreducible configured (i.e. programmed) by the Control Unit.
polynomial of degree m; m , M is required, the remaining After m clock cycles the right multiplication result is
PðjÞ and AðjÞ bits of the registers are filled with zeros, where stored in the register DðiÞ: The minimum clock cycle
m , j # M: period is determined by the delays of the feedback path
Signal controlðiÞ selects one of the two demultiplexer and the 2-output demultiplexer. It is equal to 2TAND þ
output. The value of each signal controlðiÞ; is defined as: TXOR þ TNOT þ ðm þ 1ÞTOR ; where TAND ; TXOR ; TNOT ;
( ) and TOR is the delay of the 2-input AND, 3-input XOR,
1 if i # m inverter, and 2-input OR gates, respectively.
controlðiÞ ¼ ð7Þ
0 if m , i # M During computation each one-bit register DðiÞ; is
controlled by the corresponding controlðiÞ signal. So, all
In the positions where controlðiÞ ¼ 1 (out1 is selected) the the one-bit register DðjÞ; are set inactive by discontinue their
slice is an active whilst if controlðiÞ ¼ 0 (out2 is selected) clock signal. Thus, the unnecessary and wasteful transitions
the slice is inactive. Inactive means that this slice is not used at all the one-bit register DðjÞ; are eliminated. This gated

Fig. 2. MSB-first bit-serial multiplier implementation.


978 P. Kitsos et al. / Microelectronics Journal 34 (2003) 975–980

Fig. 3. Proposed reconfigurable bit-serial multiplier.

clock technique results in a significant power dissipation The implementation in Ref. [5] can operate over variable
reduction [14]. Galois field size. It uses a representation of the field GFð2m Þ
as GFðð2n Þk Þ where m ¼ nk. This approach processes all
component-wise multiplications and additions in bit-paral-
4. Measurements and comparisons lel in the sub-field GFð2n Þ and serial processing by a LFSR
for the extension field GFð2k Þ: The resulting implementation
The area hardware resources and the execution time of is general in that the value of k can be changed, leading to a
the proposed multiplier hardware implementation are shown limited degree of flexibility. When m is prime and n . 1 the
in Table 1. For comparison, measurements from other multiplication cannot be operated. The usage of composite
designs are also presented. field is discouraged for elliptic curve cryptosystems. The
In Table 1 measurements for the Control Unit area framework for an attack on elliptic curve cryptosystems that
resources are not considered. The bit-serial polynomial uses composite fields of the form GFðð2n Þk Þ is described in
basis multiplier suggested in Ref. [3], is based on the Ref. [20]. The execution time for one multiplication is m=n;
Programmable Cellular Automata [19]. It requires less lower than the bit-serial one but with higher area
area resources than the proposed implementation, and the complexity.
critical path is shorter. On the other hand, the main The implementation proposed in Ref. [12], is a variable
disadvantage of the multiplier implementation in Ref. [3] Galois field size multiplier. Before performing the multi-
is that it supports only multiplications with fixed field plication, the operands have to be transformed from
size. polynomial into triangular basis or inverse order. In order
P. Kitsos et al. / Microelectronics Journal 34 (2003) 975–980 979

Table 1
Area hardware resources and execution time

Multiplier Ref. [3] Ref. [5] Ref. [12] Proposed


implementation

# AND 3m ð3=4Þmn 3m 2m
# XOR m mðð3=4Þn þ 3 2 3=nÞ 2m m
# REG m mð2 þ 1=nÞ m2 3m
# OR 0 0 0 m21
# (2:1) DEMUX 0 0 0 m
# ðm : 1Þ MUX 0 0 m 0
3mðTXOR þ dlog2 me 2TAND þ TXOR þ TNOT
Critical path TAND þ TXOR –
ðTNOT þ TAND þ TOR ÞÞ þ ðm þ 1ÞTOR
# CLK m m=n 3m m
Reconfigurable No No Yes Yes
(support any
irreducible polynomial)

to support variable field sizes, Serial-In Serial-Out registers, reach high performance, especially if the irreducible
and, m m : 1 multiplexers are used, resulting in an increase polynomial degree m is beyond 200. The Elliptic Curve
of the clock cycles, which are required for performing Cryptosystems with key size of 106 –210 bits can yield the
multiplication. In addition, the usage of the m m : 1 similar security level with the key size of 512 –2048 bits of
multiplexers makes this implementation infeasible for RSA algorithm. The key sizes are considered to be
devices with strict area limitations. The multiplication equivalent strength based on MIPS years needed to recover
result is computed after 3m clock cycles. The critical path is one key [22].
determined by the m : 1 multiplexer delay and the delay of In Table 2 measurements for the above mentioned binary
the 3-input XOR. The total latency is equal to 3mðTXOR þ field size 2m are illustrated. The proposed multiplier is
dlog2 meðTNOT þ TAND þ TOR ÞÞ: implemented in a Field Programmable Gate Array (FPGA).
The implementation proposed in Ref. [13] is highly For system synthesis, the LeonardoSpectrum from Mentor
scalable because a fixed-area multiplier can handle operands Graphics was used. For all the measurements the XILINX
of any field size. In this implementation the Montgomery FPGA XCV1000E-FG1156-8 device was considered [23].
algorithm [21] is used. Moreover, the word size of the In this implementation the maximum field degree M is 210.
processing unit as well as the number of pipeline stages can The polynomial degree m; varies from 106 to 210 bits.
be selected according to the application requirements. The experimental delay measurements of Table 2 are
Before performing Montgomery multiplication, the oper- very close to the expected values produced by the
ands must be transformed into the Montgomery domain, theoretical expressions and the data of Table 1. The slight
while for the completion of multiplication, the final result differences between the experimental and the theoretical
has to be retransformed to GFð2m Þ field. These transform- values are due to for the latest the FPGA internal
ations are accomplished by using pre-computed constants. interconnection wires and buffers delays are not calculated.
Additionally, the high number of full adders and registers
prohibits the use of this multiplier in applications with strict
area limitations. 5. Conclusion
In applications with limited computational power, a
hardware acceleration of the field arithmetic is necessary to A reconfigurable bit-serial Galois field multiplier archi-
tecture is proposed in this paper. The multiplier is
Table 2 reconfigurable because it can perform for variable Galois
Proposed multiplier performance measurements
field degree m: This multiplier can support any arbitrary
Binary field degree FPGA frequency Multiplication irreducible polynomial. The multiplication result is com-
(m) (MHz) execution time (ms) puted after m clock cycles. The advantages of the proposed
architecture are the high order of flexibility, which allows an
106 33.5 3.1
easy configuration for variable field size 2m ; and the low
119 30 3.9
132 26 5 hardware complexity, which results in small area. In
158 22.4 7 addition, the proposed multiplier has low power consump-
163 22.1 7.4 tion features, which are achieved by using the gated clock
174 19.8 8.8 technique. Comparing with previous published implemen-
193 18.5 10.4
tations, the proposed multiplier architecture is suitable for
210 17.1 12.3
devices with limited silicon area.
980 P. Kitsos et al. / Microelectronics Journal 34 (2003) 975–980

References Theory in Public Key Cryptosystems, PKC, Cheju Island, Korea,


February 13–15, 2001.
[1] S. Lin, D. Costello, Error Control Coding: Fundamentals and [12] M.A. Hasan, M. Ebtedaei, Efficient architectures for computations
Applications, Prentice-Hall, Englewood Cliffs, NJ, 1983. over variable dimensional Galois field, IEEE Trans. Circuits Syst.
[2] A.J. Menezes, P.C. van Oorschot, S.A. Vanstone, Handbook of I. Fundam. Theory Appl. 45 (11) (1998).
Applied Cryptography, CRC Press, Boca Raton, FL, 1997. [13] E. Savas, A.F. Tenca, Ç.K. Koc, A scalable and unified multiplier
[3] H. Li, C.N. Zhang, Efficient cellular automata based versatile architecture for finite field GFðpÞ and GFð2m Þ, Proceedings of the
multiplier for GFð2m Þ, J Inform. Sci. Engng. 18 (4) (2002) 479–488. Cryptographic Hardware and Embedded Systems—CHES, LNCS,
[4] Ç.K. Koç, B. Sunar, Low-complexity bit-parallel canonical and vol. 1965, Springer, Berlin, 2000, pp. 277 –292.
normal basis multipliers for a class of finite fields, IEEE Trans. [14] A.P. Chandrakasan, R.W. Brodersen, Low power digital CMOS
Comput. 47 (3) (1998) 353–356. design, Kluwer Academic Publishers, Dordrecht, 1995.
[5] C. Paar, P. Fleischmann, P. Soria-Rodriguez, Fast arithmetic for [15] N. Koblitz, Elliptic curve cryptosystems, Math. Comput. 48 (1987)
public-key algorithms in Galois field with composite exponents, IEEE 203 –209.
Trans. Comput. 48 (10) (1999) 1025– 1034. [16] V. Miller, Uses of elliptic curves in cryptography, Proceedings of the
[6] C. Paar, N. Lange, A comparative VLSI synthesis of finite field Advances in Cryptology—CRYPTO’85, LNCS, vol. 218, Berlin,
multipliers, Proceedings of the Third International Symposium of Germany, 1986, pp. 417– 426.
Communication Theory and Its Applications, Lake District, UK, July [17] R. Lidl, H. Niederreiter, Introduction to finite fields and their
1995. applications, Cambridge University Press, Cambridge, 1986.
[7] L. Song, K.K. Parhi, Low-energy digit-serial/parallel finite field [18] P.A. Scott, S.E. Travares, L.E. Peppard, A fast VLSI multiplier
multipliers, J. VLSI Signal Process. Syst. 19 (2) (1998) 149–166. for GFð2m Þ, IEEE J. Sel. Areas Commun. sac-4 (Jan) (1986)
[8] G. Orlando, C. Paar, A scalable GFðpÞ elliptic curve processor 62– 65.
architecture for programmable hardware, Proceedings of the Crypto- [19] S. Nandi, B.K. Kar, P.P. Chaudhuri, Theory and applications of
graphic Hardware and Embedded Systems—CHES, LNCS, vol. 2162, cellular automata in cryptography, IEEE Trans. Comput. 43 (12)
Springer, Paris, 2001, pp. 348– 363. (1994) 1346–1357.
[9] M.A. Hasan, M. Wang, V.K. Bhargava, Modular construction of low [20] N.P. Smart, How secure are elliptic curves over composite extension
complexity parallel multipliers for a class of finite field GFð2m Þ, IEEE fields?, Proceedings of the Advances in Cryptology—EUROCRYPT,
Trans. Comput. 41 (8) (1992) 962–971. Innsbruck, Austria, May 6–10, 2001, pp. 30 –39.
[10] H. Wu, Low complexity bit-parallel finite field arithmetic using [21] Ç.K. Koc, T. Acar, Montgomery multiplication in GFð2m Þ, Des. Codes
polynomial basis, Proceedings of the cryptographic hardware and Cryptography 14 (1) (April 1998) 57–69.
embedded systems—CHES, LNCS, vol. 1717, Springer, Worcester, [22] A.K. Lenstra, E.R. Verheulc, Selecting cryptographic key sizes,
MA, 1999, pp. 280–291. J. Cryptol. 14 (2) (2001) 255–293.
[11] W. Geiselmann, H. Lukhaub, Redundant representation of finite fields, [23] Xilinx, Virtex, 2.5 V Field Programmable Gate Arrays, Xilinx, San
Proceedings of the Fourth International Workshop on Practice and Jose, CA, 2002, www.xilinx.com

You might also like