Efficient and High-Throughput Implementations of AES-GCM Fpgas

Efficient and High-Throughput Implementations of AES-GCM on FPGAs
Gang Zhou, Harald Michalik Laszl6 Hinsenkamp

Institute of Computer and Communication Netwc)rk DSI GmBH, Digital Signalprocessing &
Engineering, Technical University of Braunschw eig Information Technology, Bremen
zhou michalik@ ida.ing.tu-bs.de hinsenk @dsi-it.de
Abstract still discussed the area and time complexities by using XOR
and AND gates as primitives. Most modern FPGAs are based
This paper addresses efficient and high-throughput imple- on Look-Up-Table (LUT) technology, e.g. 4-input LUT in
mentations of AES-GCM optimized for FPGAs. Two main Xilinx Virtex-4 [4] and 6-input LUT in Virtex-5 [5]. The
components, the AES engine and the modular multiplication previous gate-oriented designs may not achieve the optimal
over GF(2m), are discussed and their complexities on FPGAs performance when applied on FPGAs. The synthesis algo-
are shown. Instead of discussing the complexities by using rithm/CAD tool will influence the final results when it con-
AND and XOR gates as primitives, we present the complex- verts gate-oriented schematics into LUT-oriented schematics.
ity analysis directly based on FPGA primitives, e.g., Look-
Up-Tables (LUTs). For the modular multiplier; the straight- This paper presents the complexity analysis and high
forward multiplication is used to get a speed-efficient de- throughput implementations for AES-GCM on FPGAs. The
sign while the Karatsuba's algorithm is used to get an area- main contribution of this work is summarized as follows:
efficient design. For the AES engine, the composite field ap- Instead of discussing the complexities by using AND and
proach is adopted and then inner-round pipelining technology XOR gates as primitives, we present the complexity analy-
is applied. The estimated resource consumption returned by sis directly based on LUT. By balancing the delay in sub-
the complexity analysis provides a good criterion to minimize components, we present two GCM implementations for en-
the influence of technology mapping. By optimizing and bal- cryption and decryption, respectively. The FPGAs are as-
ancing the critical delay of sub-components, two high perfor- sumed to be based on 4-input LUT technology in this work.
mance GCM implementations are presented on Xilinx Virtex-4 The following convention is used during the timing analysis:
devices. Tx and TA denote the delay of XOR and AND gate, respec-
tively. TLUT denotes the delay of the 4-input LUT in FPGAs.
The document is organized as follows. In next section we
1 Introduction review the GCM for completeness and present the hardware
architecture. Then the complexity of the GF(2m) multiplier
The Galois/Counter Mode (GCM) [1] is an Advanced En- is discussed in section 3. The straightforward method and
cryption Standard (AES) mode of operation proposed by Na- Karatsuba's algorithm are compared and we select an area-
tional Institute of Standards and Technology (NIST) in 2005. efficient and a delay-optimized implementation. In section 4
It can provide not only high speed authenticated encryp- the pipelined AES engine is presented. Section 5 presents the
tion but also protection against bit-flipping attacks, which experimental results of high throughput GCM designs. Fi-
is difficult to be achieved in previous modes of operations, nally, section 6 concludes this work.
e.g., Electronic Code Book (ECB), Output Feedback Mode
(OFB) and Cipher Block Chaining (CBC) etc. Recent re-
ported Application Specific Integrated Circuit (ASIC) designs 2 GCM architecture
for GCM already achieved 34 Giga-bit per second (Gbps)
throughput based on 0.18,um CMOS technology in [2] and
42Gbps throughput based on 0.13,um CMOS technology in The GCM encryption accepts four inputs: a secret key K,
[3]. However currently no detailed GCM design on Field Pro- an initialization vector IV, a plaintext P and an additional
grammable Gate Array (FPGA) is found. authenticated data A, and then produces a cipher text C and
Two main components in GCM are an AES engine and a an authentication tag T. P and A are segmented into 128-bit
finite field multiplier over GF(2128) in the universal hash- blocks (P1, P2,., P9) and (A1, A2, , A). The last bit
ing function (GHASH). Many previous works have reported blocks Pn and A* may not be complete blocks and are as-
their experimental results on different FPGA platforms both sumed to have only u-bit and v-bit, where 1 < u, v < 128.
for the multiplier and the AES engine. However most works The GCM authenticated encryption operation is defined as
1-4244-1472-5/07/$25.00 C 2007 IEEE FrPT 2007

where lenr(A) and lenr(C) return the bit string length of A and
C in 64-bit, respectively.
Fig. 1 shows the architecture for GCM encryption (a)
and decryption (b). The IV length is fixed to 96 bits for
efficient implementations as recommended in the standard.
The shadow boxes denote registers and the symbols conform
with the GCM algorithm. The register Mask is used to in-
dicate the valid bits of the last data block. Both inputs and
outputs are registered to minimize the 10 influences and the
register OP and E(K, Cnt) are inserted to reduce the crit-
ical delay. Therefore the latency of these architectures is
latencyAES + 4. The main sub-components are: GF(2128)
multiplier, pipelined AES core, Key Expansion (KE) unit and
H/E(K,O) unit. In the AES, KE and H units, SubBytes trans-
formation is one of the basic components. In the GCM archi-
tecture, the SubBytes in AES, KE and H units are assumed to
have the same architecture to maintain the consistency and
Din
balance the critical delay. Both H and expanded keys are
stored in registers so that the generation of next H and next
expanded keys could be executed in parallel to the encryption
with the current key. The critical delay is determined by one
of the following terms:
1. the delay from OP register to X register, part of which
is the multiplier's delay.
2. The critical delay of E(K,O) unit.
Figure 1. GCM hardware architectures
3. The critical delay of KE.
equations in (1): 4. The critical delay of the pipelined AES core.
H= E(K, 0128) 3 Bit Parallel Multiplier over GF(2m)

IV 1 031 1 Iif len(IV) = 96
Yo Let F(x) be the irreducible polynomial and two operands
{ GHASH(H,, IV) otherwise
Yi= -incr(Yi_1) fori= 1... ,n (1) are assumed to be A(x) Zm=1 aixi and B(x)
ci =PE1 E(K,Y1) fori=1 ,n-1 o1 bizx. The objective is to calculate C(x)
Cn = P>, e MSBUl(E(K, Yn)) A(x)B(x) mod F(x) efficiently. The modular multiplication
can be divided into two steps: first a classic multiplication
T= MSBt(GHASH(H, A, C) E E(K, Yo)) and then a modular reduction. The classic multiplication cal-
oid,kt,oc:
iuIaLSr tbho -)- 1I tr:n17na
LIIC ZTIt 1Leris puoyiouiiiai
where A 11 B denotes the concatenation of two strings A and
B; E(K, X) denotes the block cipher encryption of the value 2m-2
X with the key K; incr() generates successive counter values D(x) = dix', where dk I: aib, 0 < i, j < k-l (2)
for AES; MSBt(S) returns the bit strings containing only the i=O i+,}=k
most significant t bits of S. D(x) can be calculated either by using a straightforward
The universal GHASH function is defined over a sequence method or by iterating Karatsuba's Algorithm (KA) as dis-
of Galois field operations over GF(2128) with irreducible cussed in the following subsections.
polynomial F(x) = x128 + x7 + 2 + X + 1. It accepts A, C
and H as inputs where C and H are variables defined in (1) 3.1 Straightforward multiplication
and generates a 128-bit hash value Xm±n±+.
The straightforward multiplication to calculate D(x) is
L fori =O shown in Fig. 2 (a). The computations need m2 AND gates
and (m -1)2 XOR gates. The critical delay is Flog2ml Tx +
I(XimI (eCi-() lH fori(=Cm+1forim++n TA if the binary tree method for computing XOR chain is
(X,+,n e (len(A) || len(C))) H for i = m + n +1 used.
B(x) A B
bo bi b2. bm-2 bm-i
jr
T ~Im
Clm-l
input
am-2 stage
A(x) am3
a] sub-multiplier
stage
oa
D --A-(-do d- -3 d d d;
d d
D(x) =A(x) .B(x) o d m3 m2d md +-d
(a) Straightforward multiplication utpt
ao- ,stage
di X bi
a--LUT -d XOR
bo- bo
D 7I
(b) Schematics conversion for dO F
ao
b
)DI ail LUT_ Figure 3. 1-step iteration of KA
I
boo
3.2.1 1-step iteration
(c) Schematics conversion for dl
Suppose A = AlAh and B = BlBh, where Al, Ah, Bl, Bh
Figure 2. Straightforward multiplication are (m/2)-bit terms. The 1-step iteration of KA can be de-
scribed as:
To estimate the resource consumption on FPGAs, we con- {FO = A1Bl
vert the gate oriented schematics into LUT oriented schemat- Fl = (Ah + Al)(Bh + Bl)
ics. In the ideal case, 3 gates can be replaced by 1 LUT. F2 = AhBh
For example, 3 gates are required to get dl, but only 1 LUT D = FO + xm/2(FO + Fl + F2) +xmF2
achieves the same functionality in FPGA as shown in Fig. 2
(b). However, the inputs of LUTs may not be fully used as Fig. 3 shows the multiplier architecture by applying 1-step
shown in Fig. 2 (c), where 5 gates are replaced by 2 LUTs to iteration of KA. The multiplier includes three stages: input
get d2. stage, sub-multiplier stage and output stage, where three sub-
From FPGA design point-of-view, the number of LUTs to multipliers operate in parallel. As indicated in Fig. 3, the
get dk is input stage requires m LUTs and the output stage requires
m -1 LUTs. The delay in both input stage and output stage
is TLUT. Suppose the number of LUTs of the sub-multiplier
#LUTdk =
I+ 2kl4],k=O..mnm 1
3+Cl)41 - (3)
is denoted as LUTmult(m,2,n) and the number of LUT logic
UTd
+r 22mk-l-4], =m..2m -2 levels is denoted as Tmult(m/2,n). The total complexity can

be estimated by:
The critical delay occurs at the dm lth term. It is
Flog42ml TLUT. * LUT complexity: 3LUTmult(m/2,n) + 2m- 1,
3.2 Karatsuba's Algorithm * LUT delay: Tmult(m/2,n) + 2TLUT-
We should notice that only 2 out of 4 LUT inputs are used in

The KA can reduce the complexity of the classic school the input stage. This exhibits the opportunity to reduce the
multiplication. In [6] the gate-based analysis was reported on delay when more iterations of KA are performed.
the fully parallel KA for GF(2m) as:
f #XOR = (m)10923(n2+ 6n- 1)-8m + 2 3.2.2 2-step iteration

#AND = (
m)10g23n 2
The complexity for 2-step iteration of KA can be simply de-
Delay < TA + Tx(log2n + 4log2r) duced from the equation discussed before and given by:
where, m is the bit-depth of operands, n is the bit-depth of * LUT complexity: 3(3LUTmult(m/4,n) + 2(m/2)- 1) +
classic multiplier and m = 2rn holds. 2m-1 = 9LUTmult(m/4,n) + 5m -4,
Table 1. Complexity summary of the classic
multiplication for m = 128
n r (XOR,AND,Total) Tgate #LUT TLUT
1 7 (12100,2187,14287) 2lTx+TA 9330 12
2 6 (9913,2916,12829) 19Tx+TA 7143 10
4 5 (8455, 3888 ,12343) 17Tx+TA 5928 10
8 4 (7969, 5184, 13153) 15Tx+TA 5523 8
16 3 (8455, 6912, 15367) 13Tx+TA 5820 8
32 2 (9913, 9216, 19129) llTx+TA 6783 6
64 1 (12415, 12288 , 24703) 9Tx+TA 8448 6
128 0 (16129, 16384 ,32513) 7Tx+TA 10923 4
of matrix Q and 0 is the maximum Hamming weight of the

column vectors in matrix Q. For the irreducible polynomial
x128 + X7 + X2 + + I in the GHASH function, the mod-
Figure 4. 2-step iteration of KA ular reduction consumes 255 LUTs and the delay is 2TLUT
according to the modular matrix [Q].
* LUT delay: (Tmult(m/4,n) + 2TLUT) + 2TLUT 3.4 Complexity of GF(2128) multiplier
Tmult(m/4,n) + 4TLUT.
TABLE 1 summarizes the complexity of classic multiplica-
The implementation with 4-input LUTs allows further de-
tion by using the straightforward multiplication and iterating
lay optimization by utilizing all inputs. As depicted in Fig.
KA for different n with m = 128. The last row gives the com-
4, the XOR operations P3 use only 2 inputs so that P1 can
plexity for the straightforward multiplication because r = 0
be added into the same LUT by utilizing the other 2 inputs of
holds. The straightforward multiplication costs the most re-
the LUT. The same optimization can be performed for block
sources but has the minimum critical delay. The KA with
P4. Thus, the delay from inputs A/B to the outputs of P3/P4
n = 4 consumes the least number of gates (12343) while the
becomes TLUT instead of 2TLUT. The new LUT delay for
KA with n = 8 consumes the least number of LUTs (5523).
2-step iteration is Tmult(m/4,n) + 3TLUT Here we see that the optimum solution in ASICs may not be
the optimum solution on FPGAs.
3.2.3 r-step iteration After combined with the modular reduction operation,
By applying the same idea, the complexity of r-step iteration which consumes 255 LUTs with the critical delay 2TLUT
of KA is: as shown in section 3.3.3, the characteristics of a delay-
optimized and an area-efficient modular multiplication over
* LUT complexity: 3rLUTmult( 1 n) + 4m((3)r 1)- GF(2128) are summarized as:
3r-_1 3~~~~~~ 12
2' * delay-optimized: straightforward multiplication, #LUTs:
* LUT delay: Tmult( n) + (2r -Li )TLUT. 11178, delay: 6TLUT.
For 21 n, the multiplication complexity is identical to * area-efficient: KA with n = 8, #LUTs: 5778, delay:
the straightforward multiplication implementation in section 1OTLUT-
3.3.1.
4 AES Engine
3.3 Modular reduction of D(x)
In this section, we first discuss the complexity for Sub-
After we get D (x), the most significant m -1 terms of Bytes in composite field approach and MixColumns opera-
D(x) are iteratively reduced to polynomials with degree less tions. Then two pipelined architectures are presented, which
than m by using the irreducible polynomial F(x). A re- have balanced delay with the previous GF(2128) multipliers.
duction matrix [Q] [7] can be generated iteratively accord-
ing to F(x) and then the reduction can be achieved by us- 4.1 SubBytes
ing XOR arrays. Its complexity depends on the irreducible
polynomial F(x). For general irreducible polynomials, the The composite field approach reduces not only the hard-
count of XOR gates is bounded by H(Q) and the critical de- ware complexity but also exhibits the advantages of inner
lay is Flog2(0 + l)l Tx, where H(Q) is the Hamming weight round pipelining [8]. The Galois field F1 : GF(28) is
mapped into composite field F2: GF((24)2) or sometimes
GF(((22)2)2) [9]. In this paper, we construct the isomorphic
composite field by using the fields defined in (4) [10]. The
U X
field conversion matrix 6 is given in (5). ,-Z
v-_
s s1,C w,-
{GF(2) > GF(24) :p(x) = x4 + x + 1 Si
GF(24) > GF((24)2) q(y) = y2 + y + f, SI
(4) ~3
where f = {OblOO0}2 or x3
1 0 0 0 0 1 0 1
0 0 1 0 0 0 0 0
0 0 1 1 1 1 1 1
0 0 0 1 1 0 0 0
0 0 0 0 1 1 1 0 (5) Figure 6. Implementation of MixColumns
0 1 1 0 1 1
0 0 1 1 0 1 0 1
-0 0 0 0 0 1 0 1i GF(24). The number pairs under each component are the
estimated number of LUTs and the number of LUT levels.
In order to reduce the logic delay and improve the resource
(c) Q2Q4 delay reduction utilization, blocks Qi and Q3 are combined as indicated in
Fig. 5 (b) and blocks Q2 and Q4 are combined as indicated
(27i) in Fig. 5 (c). Detailed descriptions of each component are
inversi 0 0 I Composite field listed in TABLE 2. Since they are matrix conversion opera-
tions and simple arithmetics in small finite field, we ignored
4(3, (1 ' the logic equations or true-tables to avoid tediousness. The
Q x1 6
6 x and _ total number of LUTs is 69 and the critical propagation de-
Q6(4
Q6(4, 1)~Q7(4,I1).
Q )(4 1) . affine Tran. lay is 1OTLUT, which is the sum of the critical path through
4. Q10(9,2) components Q1&Q3, Q5, Q6, Q7, Q8/Q9, Q1O.
. 3(4,1) t
:
)5(13,2) Q9(13,2) 4.2 MixColumns
i.........................................................................................
(a) Stubbytes Implementation by composite field approach
The matrix transformation can be achieved by expanding
:Q&Q 4 bLH
Q Q bIbfH+bfL
QIQ L
the equation (5.6) in the AES standard [11] as:
(112) (b)QQ I 3 delay reduction So,C XSo,c + SI,, + (XSI,c + S2,, + S3,c)
sI,c Xs2,c + sO,c + (XSI,c + S2,c + S3,c)
Figure 5. The implementation of SubBytes S2, XS2,c + S3,C + (XS3,c + SO,c + S1,c) (6)
S3, = XSo,c + S2,C + (XS3,c + SO,c + SI,c)
where the constant terms {02}16 and {03}16 are rewritten as
Table 2. Complexity of SubBytes polynomials x and x + 1.
component #LUTs TLUT comments
Fig. 6 shows the implementation of MixColumns on FP-
Q1&Q3 7+4 2 transformation (Fi -) F2)
GAs by sharing the common expressions in (6). Blocks i -
and addition over GF(24)
Q2&Q4 2 1 squaring and constant B6 achieve the same functionality as indicated on the right,
multiplication over GF(24) which is Z = xU+V+W, where U, V, W, Z c GF(28). Be-
Q6 4 1 addition over GF(24) cause ofxU = (u6, U5, U4, U3 +U7,u 2 +U7, ,U oUO + U7, U7),
Q7 4 1 inversion over GF(24) Z can be calculated as Z = (U6 + V7 + W7, a5 + V6 + W6, U4 +
Q5, Q8, Q9 13 2 multiplication over GF(24) V5 + W5, U3 + U7 + V4 + W4, U2 + U7 + V3 + W3, U1 + V2 +
Q1O 9 2 transformation (F2 -) F1) and W2, O + a7 + V1 + W1, a17 + Vo + Wo). Obviously, each block
affine transformation Bi requires only 8 LUTs so that the total number of LUTs for
Total 69 10 SubBytes transformation MixColumns transformation is 48 and the delay is 2TLUT.
Fig. 5 (a) describes the optimized design of SubBytes 4.3 Pipelined AES Core
transformation over composite field based on the work in [9].
It includes 10 sub-components and the shadow components The AES algorithm can be easily pipelined by unrolling
(Q2 -
Q9) indicate that the operations are performed over the round iteration as the architecture shown in Fig. 7 (a). The
K Ki K2 K3 K4 K5 K6 K K K K9
KE o
only K because the plaintext input is 0. Since only 2 of
4 inputs of LUT are used in the AddRoundKey operations,
:t,+ |_|
_ _ ±_ ±E(PK) MUX2 can be finished by utilizing the same LUTs. How-
I:Register ever, MUXI needs 128 LUTs and the critical delay is ITLUT.
Therefore, the number of LUTs for E(K,O) is estimated as:
SubBytes ShiftRows MixColumns AddRoundlKey 128 + 16 x 69 + 48 x 4 + 128 = 1552. The critical delay is
(Fig 5 (Wiring) (Fig 6) (XORs) 14TLUT if no inner-round register is inserted and 7TLUT if 1
Sin T T Sout level registers are inserted. The control logics for the iterative
AES engine cost only a few resources and thus are ignored in
the estimation.
(a) architecture
LUT Q1&Q3Q Q5 _Q6 Q7 QS Q B H B31 H+1 4.5 Key Expansion (KE)

level 2 2 1 I 2 2 1
(b) the critical delay chain insert I wheini er-round
PiPeliningr are needed to pi elined AES core
control
Figure 7. The pipelined AES architecture 1.
~~~I MUX2 -= Statc

SubBytcs ShiftRows MixColumnflS AddRoundKey Register
(Fig55) (Wiring) (Fig.6) (XORs)
Figure 8. Architecture of the H/E(K,O) unit
critical delay is shown in Fig. 7 (b) if composite field Sub- Figure 9. Architecture of KE
Bytes approach is used. The last round can be obtained by
simply removing the MixColumns transformation. The com- The KE unit generates the expanded keys for both the
posite field SubBytes can be further pipelined by inserting reg- pipelined AES engine and the E(K,O) unit as shown in Fig. 9,
isters into the architecture. For example, 1 level of registers in which the shadow boxes are registers. Again, the SubBytes
can be added before the components Q8 and Q9 so that the use the same implementations as the pipelined AES engine to
critical delay reduces to 7TLUT, which originates from the balance the critical delay. Both the MUX operation and XOR
second part in a single round (see Fig. 7 (b)). The delay could operations consume 128 LUTs. The critical delay is domi-
be further reduced if we add more registers, e.g, the delay be- nated by the SubBytes transformation, 5 level XOR chain and
comes 4TLUT if two registers are added. But we won't get a MUX operation. On FPGAs 5 level XOR chain can be fin-
better system performance since the fastest GF(2128) bit par- ished by 2 level LUTs. Therefore the resource estimation is
allel multiplier has the critical delay 6TLUT. calculated as: 69 * 4 + 128 + 128 = 532 LUTs. The rcon
According to the resource estimation for SubBytes and updating and control logics are ignored since they consume
MixColumns transformations, the number of LUTs for the only a few resources and have no significant influence on the
general rounds and the last round can be calculated by 16 x critical delay.
LUTSubBytes + 4 X LUTMixColumns + 128 = 1424 and
16 X LUTSubBytes + 128 = 1232, respectively. In total, the 5 High-Throughput GCM Design
pipelined AES engine requires 128 + 9 x LUTgeneral round +
LUTiast round = 14176 LUTs. 5.1 Resource Estimation
4.4 H/E(K,O) Unit TABLE 3 shows us the estimated resources and delay for
two GCM designs. Each design can be used both for GCM en-
The E(K,O) unit can be implemented with an iterative AES cryption and decryption since they have similar architectures
architecture as shown in Fig. 8. The SubBytes and Mix- as shown in Fig. 1. The value pairs listed in the table are the
Columns Operations use the same implementations as the estimated number of LUTs and the critical delay in number of
pipelined AES core in order to maintain the balanced de- LUT levels. We should keep in mind that the multiplier delay
lay. The input to the iterative engine is simplified to be is only part of the delay from OP register to X register and
Table 3. Resource estimation of high-throughput GCM designs
GCM AES AES latency E(K,0) KE Multiplier Total LUT Delay Feature
GCM1 (14176,13) 11 (1552,14) (532,13) (5778,10) 22038 14TLUT area-efficient
GCM2 (14176,7) 21 (1552,7) (532,7) (11178,6) 27428 7TLUT speed-efficient
Table 4. Synthesis and Place&Route results of GCM designs on Xilinx XC4VLX40ff668-12

design LUT AES GCM LUT Slices Period Slices Usage Period Freq. Thr. Thr./Sli.
(Est.) pipeline latency (Syn.) (Syn.) (Syn.) (PAR) (PAR) (PAR) (PAR) (PAR) (PAR)
#_____ stage cycle # ns # % ns Mhz Gbit/s Mbit/s
GCMIKe 22038 11 15 22596 11298 6.861 13523 73% 8.393 119 15.232 1.126
GCMI1d 22038 11 15 22533 11277 6.861 13505 73% 8.399 119 15.232 1.126
GCM2-e 27428 21 25 27941 13971 3.905 16378 88% 6.189 161 20.608 1.258
GCM2Ad 27428 21 25 27544 13722 3.905 16396 88% 6.195 161 20.608 1.257
delayop=>X = delaym,ltiplier + TLUT holds. For GCM1, Table 6. The logic and routing delay
the SubBytes transformation is implemented by pure combina- design period skew logic per. routing per.
torial logics and the area-efficient multiplier (KA until n = 8) ns ns ns % ns %
is selected since its delay is still smaller than other compo- GCMIKe 8.393 0 2.191 26% 6.202 74%
nents. The critical delay of GCM1 is determined by E(K,0) GCMI1d 8.399 0 2.046 24% 6.353 76%
unit, that is 14TLUT. For GCM2, the 1 level registers are GCM2-e 6.189 0 1.181 19% 5.008 81%
inserted into the composite field SubBytes implementations GCM2Ad 6.195 0 1.019 16% 5.176 84%
and the delay-optimized multiplier (straightforward multipli-
cation) is selected otherwise it will slow down the whole sys-
tem. In GCM2, all paths in AES, E(K,0), KE and the path (14TLUT) according to the estimation. The reason is re-
from OP register to X register have the the same critical de- vealed by utilizing Xilinx Timing Analysis on the critical
lay: 7TLUT. path as indicated in TABLE 6. In GCM2, the critical de-
lay is dominated by the modular multiplier and most inputs
5.2 Experimental Results in the straightforward multiplication have big fan-out, which
have non-negligible influences on the routing delay in FP-
Both GCM1 and GCM2 were implemented in VHDL code GAs. The logic delay ratio between GCM1 and GCM2 is
for authenticated encryption and decryption, respectively. All about 1.182/2.191 54%, which conforms with our estima-
-
implementations were simulated in Modelsim and verified tion, while the routing delay ratio increases to (5.008/6.202
against Matlab outputs. We use Precision RTL Synthesis 2005 81%).
from Mentor Graphics as the synthesis tool and Xilinx ISE Nevertheless, the throughput per slice of GCM2 is higher
8.1 to perform Place And Route (PAR) with default settings. than that of GCM1. When the designs are applied with an
In order to help the synthesis tool to perform the technol- IPsec encapsulating security payload mechanism as suggested
ogy mapping, we purposely added some parenthesis into the in [15], GCM1 is suitable for small packets dominated sys-
VHDL code and partially added manual technology mapping tems while GCM2 is expected to have better performance for
with LUT primitives from Xilinx unisim library. TABLE 4 big packets dominated systems because of the different output
summarizes the resource consumption and performance on a latencies.
target device Xilinx Virtex-4LX-ff668-12, in which the num- To the best of our knowledge, this report is the first detailed
ber of available slices is 18432. The critical delay of PAR evaluation of high performance AES-GCM implementations
is derived from the design by gradually adjusting the timing on modern FPGAs. Both Elliptic Semiconductor [16] and He-
constraint with resolution 0.1 ns. lion Technology [17] have claimed that they can provide the
The synthesis results consume about 500 more slices be- FPGA IP cores for AES-GCM, however we could not find
cause of the control logics in KE unit, E(K,0) unit and the the resource and performance data sheet on their web-sites.
top GCM unit. The pipelined AES engine and the multiplier The IP core provided by Jetstream Media Technologies [18]
consume exactly the same number of LUTs as the estimated adopts an iterative AES engine. On Virtex-4 its throughput is
resources. After PAR, GCM1 consumes 73% available slices 2.7Gbps with the resource consumption of 2090 slices and 10
and reaches the throughput of 15.232Gbit/s. GCM2 consumes BlockRAMs according to the data sheets. Furthermore, no in-
88% resources and reaches the throughput of 20.608Gbit/s. formation about the implementation technologies was given.
Unfortunately, the maximal running frequency of GCM2 is We also built a stand-alone Area Efficient AES Encryption
only 40% higher than that of GCM1 although the critical de- (AE-AES-E) engine by using the pipelined AES architecture
lay of GCM2 (7TLUT) is half of the critical delay of GCM1 and key expansion unit. As shown in TABLE 5, our design
4
Table 5. Resource comparison of memoryless stand-alone AES engine
Design [12] [10] [9] [13] [14] AE-AES-E AE-AES-E
enc enc enc/dec(excl. KE) enc(excl. KE) enc/dec enc enc
Device
Reg/Round
Virtex-E
I
7
I' Virtex-II
4
Virtex-E
7
Virtex-It Pro
7
Zr Virtex-E
7
7L
_
Virtex-E
2
Virtex-4
2
Slices 15112 10750 11022 9446 16693 8070 8035
consumes the least resources compared with the recent pub- [3] A. Satoh, "High-speed hardware architectures for authenticated
lished memoryless pipelined AES implementations (the de- encryption mode GCM," in Circuits and Systems, 2006. ISCAS
cryption functionality can be achieved with at most additional 2006. Proceedings. 2006 IEEE International Symposium on,
2000 slices according to our estimation). Although the imple- May 2006.
mentations in [12], [10], [9], [13] and [14] also adopt the com- [4] Xilinx, "Virtex-4 User Guide, V1.5," Mar. 2006,
posite field approach for the SubBytes transformation, the au- http://www.xilinx.com.
thors did not provide detailed complexity analysis for FPGA [5] "Virtex-5 User Guide, V2.1," Oct. 2006,
platforms. So for their implementations one couldn't judge http://www.xilinx.com.
whether the synthesis tool performed the technology mapping [6] F. Rodriguez-Henriquez and C. K. Koc., "On fully parallel
efficiently or not. The register level per round of our AES Karatsuba Multipliers for GF(2m)," in Proceedings of the In-
engine is only 2 but other designs used 7 or 4 register levels. ternational Conference on Computer Science and Technology,
Therefore we don't compare the speed and throughput with 2003.
other designs. Increasing the inner-round pipelining level will [7] A. Reyhani-Masoleh and A. Hasan, "Low Complexity Bit Par-
not improve the GCM system performance because of the crit- allel Architecture for Polynomial Basis Multiplication Over
ical delay of the modular multiplier. GF(2m)," IEEE Transaction on Computers, vol. 53, no. 8, pp.
945-959, Aug. 2004.
6 Conclusion [8] V. Rijmen, "Efficient Implementation of the Rijndael S-box,"
http://www. esat. kuleuven. ac. be/!rijmen/rijndael/sbox.pdf,
2000.
In this paper we have shown two high performance GCM
implementations on FPGAs by balancing the critical delay of [9] X. Zhang and K. K. Parhi, "High-Speed VLSI Architectures for
the AES engine and the modular multiplier. The complex- the AES Algorithm," IEEE Transaction on VLSI, vol. 12, no. 9,
pp. 957-967, 2004.
ity analysis of the sub-components was directly based on the
number of LUTs. The Karatsuba's algorithm was adopted [10] K. U. Jaervinen, M. T. Tommiska, and J. 0. Skyttae, "A Fully
to reduce the complexity of the modular multiplier while the Pipelined Memoryless 17.8Gbps AES-128 Encrypto," FPGA,
composite field approach was used to reduce the complexity 2003.
of the SubBytes implementation. The estimated resource con- [11] NIST, "Advanced Encryption Standard (AES)," FIPS Publica-
sumption was used to ensure compact implementations be- tion 197, Nov. 26, 2001.
cause it is a good criterion to judge the efficiency of the tech- [12] F.-X. Standaert, G. Rouvroy, J.-J. Quisquater, and J.-D. Legat,
nology mapping. The implementation of combining a normal "Efficient Implementation of Rijndael Encryption in Reconfig-
pipelined AES engine with a modular multiplier by iterating urable Hardware: Improvements and Design Tradeoffs," Cryp-
Karatsuba's algorithm until n = 8 consumes 13523 slices and tographic Hardware and Embedded Systems (CHES), 2003.
reaches the throughput of 15Gbit/s on Xilinx Virtex-4 devices. [13] A. Hodjat and I. Verbauwhede, "A 21.54 Gbits/s Fully
The implementation of combining an inner-round pipelined Pipelined AES Processor on FPGA," in Proceedings of the 12th
AES with a modular multiplier by using the straightforward Annual IEEE Symposium on Field-Programming Custom Com-
multiplication reaches the throughput of 2OGbit/s. The computing Machines, FCCM'04, 2004.
plexity analysis and implementations can be easily extended [14] T. Good and M. Benaissa, "AES on FPGA: from the fastest to
to support 192-bit and 256-bit key lengths. the smallest," Cryptographic Hardware and Embedded Systems
(CHES), 2005.
References [15] J. Viega and D. A. McGrew, "The Use of Galois/Counter Mode
(GCM) in IPsec Encapsulating Security Payload," RFC 4106,
Network Working Group, June 2005.
[1] D. A. McGrew and J. Viega, "The Galois/Counter Mode of Op-
eration (GCM)," Updated submission to NIST, Modes of Oper- [16] Elliptic Semiconductor, "http://www.ellipticsemi.com."
ation Process, May 2005. [17] Helion Technology, "http://www.heliontech.com."
[2] B. Yang, S. Mishra, and R. Karri, "High Speed Ar- [18] Jetstream Media Technologies, "http://www.jetsmt.com."
chitecture for Galois/Counter Mode of Operation (GCM),"
http://eprint.iacr.org/2005/]46.pdf, 2005.

Efficient and High-Throughput Implementations of AES-GCM Fpgas

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Efficient and High-Throughput Implementations of AES-GCM Fpgas

Uploaded by

Copyright:

Available Formats

Efficient and High-Throughput Implementations of AES-GCM on FPGAs

Gang Zhou, Harald Michalik Laszl6 Hinsenkamp

1-4244-1472-5/07/$25.00 C 2007 IEEE FrPT 2007

equations in (1): 4. The critical delay of the pipelined AES core.

H= E(K, 0128) 3 Bit Parallel Multiplier over GF(2m)

+r 22mk-l-4], =m..2m -2 levels is denoted as Tmult(m/2,n). The total complexity can

We should notice that only 2 out of 4 LUT inputs are used in

f #XOR = (m)10923(n2+ 6n- 1)-8m + 2 3.2.2 2-step iteration

of matrix Q and 0 is the maximum Hamming weight of the

LUT Q1&Q3Q Q5 _Q6 Q7 QS Q B H B31 H+1 4.5 Key Expansion (KE)

~~~I MUX2 -= Statc

Figure 8. Architecture of the H/E(K,O) unit

Table 4. Synthesis and Place&Route results of GCM designs on Xilinx XC4VLX40ff668-12

You might also like