Ins D 13 1975

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Information Sciences

Manuscript Draft

Manuscript Number: INS-D-13-1975

Title: A Highly Parallel Area Efficient Byte-Submission S-Box Architecture for the AES Cryptosystem

Article Type: Full length article

Keywords: Cryptography,
Advanced Encryption System (AES),
S-Box Byte Substitution,
Parallel S-Box architecture,
Pipeline S-Box.
Manuscript (including abstract)
Click here to view linked References
1

A Highly Parallel Area Efficient Byte-


Substitution S-Box Architecture for AES
Cryptosystem
Mostafa Abd-El-Barr and Altaf Al-Farhan
Department of Information Science
College of Computing Sciences and Engineering (CCSE)
Kuwait University
mostafa.abdelbarr@gmail.com and eng.altaf@hotmail.com


Abstract—The performance of the S-Box represents an important factor that affects the speed, area, as well as the power
consumption of the Advanced Encryption System (AES) cryptosystem. A number of techniques were presented in the literature
attempting to improve the performance of the S-Box byte-substitution. In this paper, we classify the S-Box byte-substitution
optimization techniques reported in the literature as those based on hardware, software, and combined hardware/software. We then
move on to propose a new highly parallel and area-efficient S-Box architecture. We also conduct a performance analysis and
comparison of the proposed architecture with those achieved by existing techniques. Our comparison is based on O(AT2) where A is the
area and T is the delay. Our comparison shows that the proposed architecture outperforms the existing techniques in terms of delay
and area. Moreover, the pipelinability of the proposed S-Box architecture is presented resulting is further time savings, higher
throughput, along with higher hardware resources utilization.

Index Terms— Cryptography, Advanced Encryption System (AES), S-Box Byte Substitution, Parallel S-Box architecture, Pipeline
S-Box.

1. Introduction

The National Institute of Standards and Technology (NIST) organized an open competition in 1997 to find a replacement for
DES cryptosystem. The Rijndael cryptosystem, proposed by Joan Daemen and Vincent Rijmen was one such system [1]. In 2000
NIST adopted a slightly modified version of Rijndael as the new security standard. This standard is now known as the Advanced
Encryption Standard (AES). The AES is a symmetric-key block cipher algorithm used to encrypt/decrypt sensitive data
worldwide. Researchers made numerous performance enhancements to the AES in terms of area, delay, and power consumption,
see for example [2]-[4].
According to the AES data to be encrypted is divided into equally sized blocks each is called a state. Each state has a fixed
size of 128 bits arranged in a grid of four by four cells each is one byte. The algorithm performs a series of mathematical
operations on each state based on the Substitution-Permutation Network principle to produce a cipher text. The algorithm starts
with an initial step in which it adds the round key to the state. After that, the state goes through a loop of four repeated
operations: sub bytes, shift rows, mix columns, and add round key. This is followed by a final iteration to produce the cipher
text. The last iteration excludes the mix columns operation. Similar to encryption, the decryption of a cipher text starts with add
round key operation. However, the decryption operations are ordered as follows: inverse shift rows, inverse sub bytes, add round
key, and inverse mix columns. The plain text is obtained after a final iteration that excludes inverse mix column operation. Fig. 1
provides an illustration of the AES encryption/decryption processes. For the ten rounds version of AES considered in this work,
the key size is 128 bits corresponding to four words each has four bytes. The initial key shown in Fig.1 as w[0, 3] is used to
obtain ten sub keys fed to add round key operations. This is done by the key expansion iterative procedure.
Among the four loop operations, the bytes substitution is performed using what is known as the substitution box (S-Box). The
S-Box performs a non-linear transformation on data by replacing each individual byte by a different byte. The main purpose of
2

the byte substitution is to bring confusion to the data to be encrypted [5]. The replacement bytes can be obtained on the fly by
determining the multiplicative inverse of a given state in finite field GF(256) followed by an affine transformation in GF(2).
Alternatively, the replacement bytes can be pre-calculated and stored in a look-up table (LUT) in the S-Box. It should be noted
that the replacement bytes for encryption are different than the ones used for decryption.

Fig. 1. AES iterative architecture.

Byte substitution is considered one of the most complex loop operations. Hence, a number of research efforts were devoted to the
optimization of the byte substitution both in time and hardware complexity [6]-[18]. The research work reported in this paper
concentrates on proposing a new method for designing a highly parallel area-efficient S-Box architecture for the AES
cryptosystem. The paper is organized as follows. In Section 2, we provide some background material. In Section 3, we provide a
classification and a brief summary of the existing S-Box realization techniques. In Section 4, we preview the related work
reported in the literature. In Section 5, we introduce the proposed S-Box architecture and algorithm. In Section 6, we illustrate
the Complementary Metal–Oxide–Semiconductor (CMOS) implementation of the proposed architecture. In Section 7, we
provide estimation of the delay and area needed for the proposed architecture. In Section 8, we provide a comparison with
existing techniques. In Section 9 we illustrate the pipelinability of the proposed architecture together with an evaluation of the
resulting improvement in the performance. Section 10 provides a number of concluding remarks.

2. Background Material

The S-Box computation involves basically two steps. These are the multiplicative inverse and the affine transformation. The
multiplicative inverse is complex to perform in GF(28). Thus, it is simpler to perform it using what is known as composite field
arithmetic. This is because of the simplified hardware needed for performing the inversion and multiplication operations required
for S-Box computation. Fig. 2 provides illustrations of the S-Box computations using two composite filed approaches, i.e. GF(4)
and GF(2). In this case the input is represented as an element of GF((24)2), see Fig. 2(a), or GF(((22)2)2), see Fig. 2(b).
3

Map the input into smaller degree (Degree-2)


Map the input into smaller degree (Degree-4)

Compute the multiplicative inverse in Degree-2


Compute the multiplicative inverse in Degree-4 GF((24)2) GF(((22)2)2)← GF((22)2) ←GF(22)

Re-map the result into higher degree (degree 8) Re-map the result into higher degree (degree 8)

Perform Affine Transformation Perform Affine Transformation

(a) Using GF((24)2) (b) Using GF(((22)2)2)

Fig. 2. S-Box computations steps using two different composite fields

The substitution values can be pre-computed and stored in one Look Up table (LUT). The substitution of a byte {xy} that
belongs to GF(28) in hexadecimal notation can be obtained using Table 1 [8].

Table 1
S-Box Lookup Table (LUT)

y
0 1 2 3 4 5 6 7 8 9 a b c d e f
x
0 63 7c 77 7b f2 6b 6f c5 30 01 67 2b fe d7 ab 76
1 ca 82 c9 7d fa 59 47 f0 ad d4 a2 af 9c a4 72 c0
2 b7 fd 93 26 36 3f f7 cc 34 a5 e5 f1 71 d8 31 15
3 04 c7 23 c3 18 96 05 9a 07 12 80 e2 eb 27 b2 75
4 09 83 2c 1a 1b 6e 5a a0 52 3b d6 b3 29 e3 2f 84
5 53 d1 00 ed 20 fc b1 5b 6a cb be 39 4a 4c 58 cf
6 d0 ef aa fb 43 4d 33 85 45 f9 02 7f 50 3c 9f a8
7 51 a3 40 8f 92 9d 38 f5 bc b6 da 21 10 ff f3 d2
8 cd 0c 13 ec 5f 97 44 17 c4 a7 7e 3d 64 5d 19 73
9 60 81 4f dc 22 2a 90 88 46 ee b8 14 de 5e 0b db
a e0 32 3a 0a 49 06 24 5c c2 d3 ac 62 91 95 e4 79
b e7 c8 37 6d 8d d5 4e a9 6c 56 f4 ea 65 7a ae 08
c ba 78 25 2e 1c a6 b4 c6 e8 dd 74 1f 4b bd 8b 8a
d 70 3e b5 66 48 03 f6 0e 61 35 57 b9 86 c1 1d 9e
e e1 f8 98 11 69 d9 8e 94 9b 1e 87 e9 ce 55 28 df
f 8c a1 89 0d bf e6 42 68 41 99 2d 0f b0 54 bb 16

Consider, for example, that we need to find the byte substitution for the byte {b9}. Referring to row (b) and column (9) in
Table 1, the substitution value is simply {56}. Alternatively, byte substitution can be performed in two steps. The first step is to
find the multiplicative inverse pre-computed and stored in a table similar to Table 1 [14]. The second step that follows the
multiplicative inverse is the affine transformation. The later is performed over GF(2) using the following equation:
(1)
'
Where bi is the ith bit of the byte resulted from multiplicative inverse operation, bi is the corresponding affine bit, ci is the ith
bit of c with the value {63} or {0110 0011} [9]. For instance, the multiplicative inverse of byte {b9} is found to be {8e} [14].
Then, the affine transformation for byte {8e} is performed using equation (1). Applying the equation for the eight bits will end
up with the same result above, i.e. {56}.
4

3. Classification of the S-Box Realization Techniques

We classify the S-Box realization techniques into three basic ones. Fig. 3 illustrates such classification.

S-Box Realization Techniques

3.1 Hardware Techniques 3.2 Software Techniques 3.3 Combined Hardware/Software

Polynomial Basis Serial Techniques

Normal Basis Parallel Techniques

Combined Polynomial/ Blend of Serial/Parallel


Normal Basis Techniques

Fig. 3. Classification of S-Box realization techniques.


3.1 Hardware Techniques

Hardware techniques are those in which the substitution values are computed rather than reading them from a memory. The
representation of the finite field elements has an impact on the hardware in terms of the chip area. Polynomial and normal bases
are the two popular representations. Canright has examined all possible choices of representations in each sub-field including
purely polynomial, purely normal basis, and a combination of both [10]. The total number of isomorphisms was found to be 432
different choices. The case that uses polynomial basis for all sub-fields is the one proposed by Satoh et al. [7]. The optimal
representation, which has the minimum gate count, was obtained using an exhaustive tree-search algorithm. Our findings showed
that the representation made by Canright in [10] is the most compact among the techniques found in the literature. It is
implemented in normal bases, which uses sub-field arithmetic to compute the AES S-Box. In each sub-field both inverse and
multiplication operations are required [10]. Further optimization was achieved by calculating a merged S-Box for both
encryption and decryption. The merged S-Box is 20% more compact than separate computations of S-Box and its inverse.
Canright's implementation is best suited for limited space applications, e.g. mobile applications. Furthermore, it can be
parallelized and pipelined for higher throughput. The main drawback of Canright's method is its relatively long critical path and
high power consumption [11]. Alternatively, Nikova et al. proposed an implementation using normal bases that differs in the
level of decomposition [12]. Both normal bases implementations [10] and [12] are compact, however the choice of which to use
depends on the application and hardware technology used. The polynomial representations by Rijmen [13], Satoh et al. [7], and
Wolkerstorfer et al. [14] were less efficient compared to normal bases representations in terms of chip area.
The implementation that resulted in least power consumption according to [11] is the one proposed by Bertoni et al. [17]. The
proposed solution uses a synthesis methodology composed of three main blocks. The first block is a multi-level decoder
followed by a permutation block which does the S-Box computation and ends with an encoder. Another method that used
decoders implemented as twisted binary decision diagram (BDD) was presented [15]. The idea of merging sub bytes and mix
columns operations was also discussed and named as T-Box. Later, Hobson et al. [16] proposed a fast BDD approach resulted in
a low gate count that is one fourth the one achieved in [15]. Table 2 shows a comparison among a number of hardware based
techniques in terms of Area, Delay, and Power Consumption [11]. It should be noted that all values listed in the table are the
minimum reported in each case.

Table 2
Delay and Area Comparisons
Area Gate Equivalent (GE) Critical Path Delay (ns) Power Normalized
Implementation
A T P

Satoh [7] 360 9 16.75


Wolkerstorfer [14] 382.5 8 9.5
Canright [10] 281.25 8 6.125
Bertoni [17] 1608.75 3 0.375
5

3.2 Software Techniques

Software implementation is based on using a look up table (LUT). In this case a RAM is used to store pre-calculated S-Box
values. Techniques that are included under such approach vary from the very serial (slow) to the totally parallel (fast) ones. In
the totally serial approach each byte of the State is substituted by the corresponding S-Box byte one at a time in a serial manner.
Consider, for example, the 128 bit AES case. In this case each of the 16 bytes in a state is replaced by the corresponding S-Box
byte. Assume that the area required for the LUT is A, and the time needed to substitute one byte is T, then the total time needed
to substitute all sixteen State bytes is 16T. The other extreme is the totally parallel approach. In this case there are 16 S-Boxes
each consists of 16 bytes such that all of the 16 bytes of the State are substituted by their respective corresponding S-Box bytes
simultaneously. The total time needed to substitute all sixteen State bytes is T, while the area required is 16A.
The two cases shown in Fig. 4 represent two extreme cases: the most restricted serial and the most flexible parallel ones. In
between, there exist a number of possible intermediate cases like the ones discussed in Section 4. These intermediate cases
exhibit tradeoffs between area and time.

(a) Totally serial S-Box architecture. (b) Totally parallel S-Box architecture.

Fig. 4. Serial versus parallel S-Box architecture.

3.3 Combined Hardware and Software Techniques

In this case use is made of pipelining (hardware technique) in building the S-Box employing small substitution tables
constructed using LUTs (software technique) [18]. The basic idea of the approach is that the original large truth-table, or S-Box
in our case, of 8-variable function is broken down into a set of smaller size multiplexer-switched truth-table of say n-variable
functions using the Shannon expression. The smaller tables are then mapped into n-LUT of Xilinx FPGA. The approach shows
significant improvement in the overall throughput. Moreover, this approach is pipelinable where the size of the small LUTs is
technology dependent and the number of pipeline stages can vary accordingly. Fig. 5 shows an example of this technique using a
four-stage pipeline S-Box implementation. The first stage implements Shannon's decomposition over four 4-variable functions
representing the 256-bit S-Box. This stage uses sixteen 4-LUTs each having 64-bit string as an input. The results (S1, S2, S3, S4,
S5, S6, S7, S8, S9, S10, S11, S12, S13, S14, S15, and S16) are latched in the registers of the slices. The second stage of the pipeline uses
eight 4-LUTs each performs the selection between two Sum of Products (SOPs) using combinations of the variables X5, and X6.
The results (P1, P2, P3, P4, P5, P6, P7, and P8) are latched in the internal registers connected to the corresponding 4-LUTs. Similar
to the second stage, the third stage has inputs generated from the previous stage besides the two variables X7, and X8 which are
used for multiplexing. The results (Q1, Q2, Q3, and Q4) are latched in the internal registers connected to the corresponding
4-variable OR-function. Equation (2) shows a bottom up mathematical expression of the function F(X).

=
6

Fig. 5. Illustration of the combined hardware/software technique [18]

4. The Related Work

The demand for fast, area and power efficient AES implementations is rapidly increasing and is becoming more crucial than
the demand for smaller active device on the chip. The traditional basic lookup table implementations are relatively fast and can
achieve better performance with some modifications [11]. One way to reduce power consumption is to divide the 256 bytes
S-Box into smaller tables with the aim being to reduce the switching activities. The use of smaller LUTs of different sizes
ranging from 16 to 128 bytes was examined in [11]. However, only the result for the 16 bytes size was reported (sub16-lut). The
solution proposed in [9] uses not only small LUTs, but also reduces execution time at the expense of doubling the chip area
required. Unlike [11] serial approach, the approach in [9] explored the use of 32 parallel small S-Boxes of size 16 bytes each.
The first group of LUT (16 LUTs) uses the left-most 4 bits of the State byte to distinguish among 16 tables (Table 0 to Table F)
and the right-most 4 bits of the State byte are used as an address to obtain the value in a given table. This is done in parallel for
all the bytes of a State. If the four left-most bits of two or more bytes were identical then a conflict occurs. A conflict is resolved
by using the second group of 16 S-Box tables indexed by the right-most 4 bits of the State byte (tables 0’ to Table F’). In this
case, the left-most 4 bits of the byte are used as the address to obtain the value in a given table. These steps are repeated until all
bytes are substituted. We illustrate the iterations of the technique in Fig. 6.

Fig. 6. Illustration of byte substitution algorithm proposed in [9]


Based on the approach in [9] the average speed up is expected to be 8 times faster than the classical serial implementation. On
the other hand, the main drawback is the possibility of conflict in accessing a given table. In this case, the author suggested using
7

a busy bit with each table such that if the busy bit is ON, then it means that the table is currently in use and cannot be accessed in
the current cycle. This of cause will lead to a slowdown in the rate at which the S-Box substitution is performed. In the next
Section, we propose an S-Box architecture which overcomes the conflict problem in addition to achieving higher throughput.

5. Proposed S-Box Architecture

The basic idea of the proposed scheme is to perform byte substitution of an AES state using a number of 2×2 tables that are
organized in groups. Each group has 16 small S-Boxes of size 2×2 organized in a bigger table of four rows and four columns.
The small tables are selected based on row and column values. The size of each group is 64 bytes, which is one fourth the regular
S-Box size. Each byte needs exactly four groups to cover all values of the original S-Box. If we consider the use of 16 groups,
then 4 bytes can be processed simultaneously. To illustrate the idea consider the example shown in Fig. 7 where a state of 16
bytes is considered. Suppose that the byte to be substituted has the hexadecimal value {b9}.

(a) Step 1 (GR): Group Selection

(b) Step 2 (RW): Row selection (c) Step 3 (CL): Column selection (d) Step 4 (LT): S-Box table lookup

Fig. 7. An illustrative example of the proposed S-Box architecture

The first step is group selection which is based on the content of the two left most bits of the byte processed, that is Group 3
(10) in this case. Since each group has 16 tables, we need to know the row and column values in order to choose one table out of
16. The row and column values are specified in the four middle bits of the byte processed, i.e., (11) and (10), respectively, in the
example shown in Fig. 7. The small 2×2 table is identified by the intersection of Row 3 and Col 2 in Group 3 (10). Finally the
fourth step, which is table lookup done using the right most two bits of the byte, i.e., (01), are processed to obtain substitution
value. The substitution value for byte {b9} is {56}. This value is the correct substitution and can be verified from Table 1. The
same idea applies to the rest of the bytes. The steps required in the proposed substitution method are summarized in the
algorithm shown in Fig. 8.
8

Algorithm Parallel Area Efficient Byte Substitution


For i: 0  n-1 Do // n is the number of iterations required to substitute a state
For all bytes processed simultaneously Do
Step 1 (GR): Use the two left-most bits to select a group of four
Step 2 (RW): Use the next two left-most bits to select a row within a group
Step 3 (CL): Use the next two left-most bits to select a column
Step 4 (LT): Use the two right-most bits as an index to the identified 2×2 lookup table and obtain substitution value
End For
End For
End Algorithm

Fig. 8. The proposed algorithm

It should be noted that since steps 2 and 3 are independent, they can be performed in parallel. The algorithm uses groups of
small tables which is beneficial as it simplifies table indexing and in turn leads to reduced power consumption. More time saving
can be achieved with pipelining which is discussed in Section 9.

6. Proposed S-Box CMOS Implementation

This Section introduces the hardware implementation of the proposed S-Box algorithm. For each step in the algorithm listed in
Fig. 8, the hardware components required are discussed. The selection of groups, rows, and columns is implemented using
decoders. On the other hand, S-Box LUT is implemented using multiplexers. Fig. 9 shows a general block diagram of the
proposed implementation for a single byte substitution.

Fig. 9. Block diagram for a single byte substitution


9

In order to choose one group out of four, a 2-to-4 decoder is used. Two more 2-to-4 decoders are required to choose the row
and column within the selected group. We select to use two input NAND (NAND2) decoders to comply with the standard
CMOS technology. The NAND2 and the 2-to-4 decoder circuits are shown in Fig. 10(a), and Fig. 10(b) respectively. Once
decoding on the group, row, and column levels are done, the LUT to be used is known. The two right most bits of the byte
processed are used as an index to the 2×2 LUT. These two bits are connected to the select lines of a 4-to-1 multiplexer having the
table data as inputs and the S-Box substitution value as the output. All the 4-to-1 multiplexers implemented are constructed using
three 2-to-1 multiplexers for simplicity (Refer to Fig. 10(c)). The following is an explanation of three possible solutions to
implement 2-to-1 multiplexers:

6.1 First Solution: Gate-Level (NAND-NAND) CMOS Implementation

Each CMOS NAND2 gate consists of four transistors. Four NAND2 gates and an inverter can be used as shown in Fig. 10(d)
to perform the operation of a 2-to1 MUX. For our 2×2 LUT we need a 4-to-1 MUX which is constructed using three 2-to-1
MUX as shown in Fig. 10(c).

6.2 Second Solution: Pass Transistors Implementation

Pass transistor logic can be used as shown in Fig. 10(e) to implement a 2-to-1 multiplexer. The two inverters added at the
output are used to retain the logic level. The 4-to-1 multiplexer needed for S-Box LUT is constructed using three 2-to-1
multiplexers.

6.3 Third Solution: Transmission Gates Implementation

Transmission gates are simply switches which can act as two-to-one multiplexer as shown in Fig. 10(f). In this case the
number of transistors required is less than the former implementations discussed above. Similarly, 4-to-1 multiplexers are
constructed out of 2-to-1 ones.

(a) CMOS implementation of a two-input NAND gate (b) 2-to-4 Decoder Logic Circuit

(d) NAND-based realization of 2-to-1 MUX


(c) 4-to-1 MUX using three 2-to-1 MUX

(e) Pass transistor based realization of 2-to-1 MUX (f) Transmission gate based realization of 2-to-1 MUX

Fig. 10. The CMOS realization circuits


10

7. Performance Analysis

Based on the proposed architecture and methodology discussed in the previous Sections, we can develop general expressions
for area and delay estimations as follows:

Equation (3) provides an expression of the area A in terms of the number of bytes processed simultaneously b and the area of
the 2-to-1 multiplexer (mA) used which can vary based on the implementation. The constants c0 and c1 which appear in (3)
represent fixed values related to the number and size of decoders/multiplexers used.
The delay T is expressed in (4) as a function of the number of iterations (i) required to complete the processing of one state
and (mT) which is the delay of the 2-to-1 multiplexer used. The constants k0 and k1 which appear in (4) represent fixed values
that reflect the number of levels of decoders/multiplexers in the critical path and their delay respectively.
All results are based on using 0.35 µm CMOS technology of AMS Corp. [19]. Since the decoding part for our three proposed
solutions is the same, the associated delay and area are consequently identical. To clarify the results obtained, the case of
processing four bytes in parallel is considered here (without pipelining). This implies that 72 NAND2 gates are required, as
every byte needs three 2-to-4 decoders. Each 2-to-4 decoder requires six NAND2 gates assuming that inverters are implemented
using NAND2 gates. Area is measured using gate equivalents (GE), where one gate equivalent corresponds to one NAND2 gate.
According to the CMOS library used one NAND2 gate has 0.1 ns delay. Given that the decoding of the row and column are done
in parallel, the total decoding time is 1.6 ns (16×0.1) for one AES state.
Our three proposed solutions share the same idea of building 4-to-1 multiplexers out of three 2-to-1 ones. This means that we
need twelve 2-to-1 multiplexers to process four bytes of a state. Multiplying the size of one 2-to-1 multiplexer by 12 gives the
total space requirement for each solution for the multiplexing phase. Referring to the implementation in Fig. 10(b), we can see
that the critical path delay for the 4-to-1 multiplexer is twice the delay of a 2-to-1 multiplexer. Given that every four bytes of a
state are processed simultaneously, the total delay is eight times the delay of a 2-to-1 multiplexer.
For the proposed solution #1, the space requirement of each 2-to-1 multiplexer is 4 GEs as there are three NAND2 gates
besides an inverter which is overestimated as one NAND2 gate (Fig. 10(c)). Thus, the total area is estimated as 48 GEs (12×4).
On the other hand, the critical path delay of a 2-to-1 multiplexer is three times the delay of a NAND2 which results in a total
delay of 2.4 ns (8×0.3).
For the proposed solution #2, two pass transistors which construct a 2-to-1 multiplexer can be estimated to consume an area
equivalent to ½ of a NAND2 area. We need an inverter for the selector which has a size of 1 GE. In addition to two cascaded
inverters at the output to retain the logic value. These two inverters consume the area of 2 GEs. The sum is 3.5 GEs for each 2-
to-1 multiplexer yielding a total of 42 GEs (12×3.5). The delay of one 2-to-1 multiplexer is the delay of the transistors plus the
delay of the two inverters, that is 0.25 ns. Therefore, the total delay can be estimated as 2 ns (8×0.25).
For the proposed solution #3, each transmission gate consumes an area equivalent to ½ of a NAND2. The inverter for the
selector consumes 1 GE. The sum is 2 GEs per 2-to-1 multiplexer. This gives a total area of 24 GEs (12×2). The critical path
delay for the data of the 2-to-1 multiplexer is estimated as the delay of one two input NAND gate yielding a total of 0.8 ns
(8×0.1). Table 3 provides an estimation of the total delay and area values for all the proposed implementations summing up the
decoding and multiplexing phases.

Table 3
Delay and Area Estimation of the Proposed Solution
First Proposed Solution Second Proposed Solution Third Proposed Solution
Implementation
1-byte 2-byte 4-byte 8-byte 16-byte 1-byte 2-byte 4-byte 8-byte 16-byte 1-byte 2-byte 4-byte 8-byte 16-byte
Decoders Area (GE) 18 36 72 144 288 18 36 72 144 288 18 36 72 144 288
Multiplexers Area (GE) 12 24 48 96 192 10.5 21 42 84 168 6 12 24 48 96
Total Area (GE) 30 60 120 240 480 28.5 57 114 228 456 24 48 96 192 384
Decoders Delay (ns) 6.4 3.2 1.6 0.8 0.4 6.4 3.2 1.6 0.8 0.4 6.4 3.2 1.6 0.8 0.4
Multiplexers Delay (ns) 9.6 4.8 2.4 1.2 0.6 8 4 2 1 0.5 3.2 1.6 0.8 0.4 0.2
Total Delay (ns) 16 8 4 2 1 14.4 7.2 3.6 1.8 0.9 9.6 4.8 2.4 1.2 0.6

Given a 16-byte state, the architecture flexibility allows varying the bytes processed from a single byte at a time to 16 bytes in
parallel with power of twos increments, i.e., 1, 2, 4, 8, and 16. The delay and area estimation for all combinations are shown in
Table 4. It can be observed that the more bytes processed in parallel, the more area is needed and the less delay is required.
11

Table 4
Area and Delay Estimation
Number Area (GE) Number Delay (ns)
First Second Third of First Second Third First Second Third
of Bytes
Proposed Proposed Proposed Iterations Proposed Proposed Proposed Proposed Proposed Proposed
Solution Solution Solution Solution Solution Solution Solution Solution Solution
16 480 456 384 1 1 0.9 0.6 480 369.36 138.24
8 240 228 192 2 2 1.8 1.2 960 738.72 276.48
4 120 114 96 4 4 3.6 2.4 1920 1477.44 552.96
2 60 57 48 8 8 7.2 4.8 3840 2954.88 1105.92
1 30 28.5 24 16 16 14.4 9.6 7680 5909.76 2211.84

8. Comparison

In this Section we compare our proposed solutions with the state of the art substitution box implementations. Delay (T) and
area (A) measures for the existing techniques are obtained from the survey conducted in [11]. The minimum values of area and
delay are used to ensure a fair comparison. The comparison is conducted using the AT2 values. More weight is given to the delay
(T) in our comparison assuming that chip area (A) is available in compact sizes and at affordable costs. The AT2 values are
shown in Table 5. These measures were normalized with respect to sub-lookup table of size 16 implementation denoted as
sub16-lut since it had the largest area-delay squared product. Surprisingly, the simplest straight-forward S-Box hardware lookup
table implementation denoted by hw_lut achieves the least AT2 value among all previously reported techniques in the literature.
Our proposed solutions are at least 29% better than the best previous techniques in terms of AT2.

Table 5
Delay and Area Comparisons
Area (GE) Critical Path Delay (ns) Area Delay Squared Product Normalized
Implementation
A T AT2

sub16-lut [11] 1912.5 4 1.000


Satoh [7] 360 9 0.953
Wolkerstorfer [14] 382.5 8 0.800
Canright [10] 281.25 8 0.588
Bertoni [17] 1608.75 3 0.473
hw_lut [11] 1203.75 3 0.354

1-byte 30 16 0.251
2-byt e 60 8 0.125
First Proposed Solution 4-byt e 120 4 0.063
8-byte 240 2 0.031
16-byte 480 1 0.016

1-byte 28.5 14.4 0.193


2-byte 57 7.2 0.097
Second Proposed Solution 4-byte 114 3.6 0.048
8-byte 228 1.8 0.024
16-byte 456 0.9 0.012

1-byte 24 9.6 0.072


2-byt e 48 4.8 0.036
Third Proposed Solution 4-byte 96 2.4 0.018
8-byte 192 1.2 0.009
16-byte 384 0.6 0.005
12

The results shown in Table 5 are further illustrated in Fig. 11 and Fig. 12. It should be noted that the AT2 measures shown in
these figures are normalized with respect to sub16-lut since it achieves the largest area-delay squared product. In Fig. 11 we
show a comparison among existing techniques. Fig. 12 shows a comparison between our proposed solutions and the best
technique found in Fig. 11, i.e. the technique using hw_lut. It should also be noted that that all these comparisons are conducted
without pipelining. Pipelinability of our proposed architecture is discussed in the next Section.

Area × Delay Squared (AT^2) Normalized

1.2

0.8

0.6

0.4

0.2

0
Satoh Wolkerstorfer Canright hw_lut sub16_lut bertoni

Fig. 11. Other implementations area delay squared product comparison

Area × Delay Squared (AT^2) Normalized


0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

Fig. 12. Our implementations compared with the best of the other implementations in terms of AT2

When comparing our proposed architecture to the architecture reported by Li in [9], we observe that our solution provides a
constant delay unlike Li's technique where delay varies. Our approach is data independent, whereas Li's approach delay is
heavily dependent on the data encrypted. It performs best when all the bytes of the State are different and it performs worst when
all the bytes of the State are identical. Unfortunately, the work by Li did not provide specific values for area and delay. However
we observe that in the best possible case Li's approach can be 16 times faster than the traditional serial approach, while it can be
only two times faster in the worst case. For the purpose of comparison we can approximate the delay and area values for Li’s
approach assuming the use of decoders and multiplexers created with the same CMOS technology used in our work. To decode
one byte, two 4-to-16 decoders are required for the left and right halves of the byte (Refer to Fig. 6). A 4-to-16 decoders can be
implemented using five 2-to-4 decoders. A 2-to-4 decoder requires 6 GE and takes 0.2 ns. Therefore, a 16-byte state would
require an area of 960 GE (16×5×6×2) and a delay of 0.4 ns (0.2×2). Since Li used two LUTs of 16 elements each, two 16-to-1
13

multiplexers are required. A 16-to-1 multiplexer can be implemented using five 4-to-1 multiplexers. Similar to our solution, the
4-to-1 multiplexer is made of three 2-to-1 multiplexers each consuming an area of 4 GE and requires a delay of 0.3 ns. Thus, a
16-byte state would need 120 GE (2×5×3×4) and takes a time delay of 2.4 ns (0.3×2×2×2). For the best case that completes byte
substitution in one cycle, summing up decoding and multiplexing values, the delay is at least 2.8 ns and the area is at least 1080
GE. These values are to be compared to 480 GE and 1.0 ns as required by our 16-byte solution (See Table 5). This indicates that
our 16-byte solution outperforms Li's solution architecture in terms of area, delay, and hence AT2 value.

9. Pipelinability of the Proposed S-Box Architecture

The algorithm steps shown in Fig. 8 can be optimized through pipelining. Each step can represent a stage in the pipeline
architecture. Further speedup can be achieved by merging the second and the third steps of the algorithm since they are totally
independent in terms of data and hardware resources required. The optimized pipeline architecture is shown in Fig. 13.

Fig. 13. Optimized pipelined architecture for byte substitution

The pipeline performance can be measured by identifying the number of stages required to complete the processing of one
state substitution , the number of iterations needed to complete the processing of one state substitution , and the time of the
slowest stage in the pipeline . Pipelining speedup, throughput, and efficiency can be computed as discussed in [20] using the
following equations:

For all the proposed solutions the value of is fixed, since there are three stages in the pipeline as illustrated in Fig. 13. In
contrast, the values and can vary depending on the solution selected. The pipeline performance of our proposed solutions
based on the criteria mentioned in equations (5), (6), and (7) is specified in Table 6.
The benefits of pipelining byte substitution can be clearly noticed as the number of bytes processed per iteration decreases.
The optimal speedup and hence the higher efficiency is achieved when the state is taken as one byte at a time. In this particular
case, the pipelined solution is 8/3 ≈ 2.67 times faster than the non-pipelined one. This is significantly fast as one state completes
its byte substitution in 6 ns rather than 16 ns for the first solution 1-byte case.
14

Regardless of the solution selected, the intermediate cases (i.e., when 2-bytes, 4-bytes, or 8-bytes are processed per pipeline
stage) represent a compromise between delay and area. Those intermediate cases take advantage of both pipelining and
parallelism to reduce delay, while consuming reasonable hardware resources.

Table 6
Pipelined Architecture Evaluation
First Proposed Solution Second Proposed Solution Third Proposed Solution
Comparison Criterion
1-byte 2-byte 4-byte 8-byte 16-byte 1-byte 2-byte 4-byte 8-byte 16-byte 1-byte 2-byte 4-byte 8-byte 16-byte
Number of Iterations 16 8 4 2 1 16 8 4 2 1 16 8 4 2 1
Stage 1: Group Decoding Delay
3.2 1.6 0.8 0.4 0.2 3.2 1.6 0.8 0.4 0.2 3.2 1.6 0.8 0.4 0.2
(ns)
Stage 2: Row/Col Decoding
3.2 1.6 0.8 0.4 0.2 3.2 1.6 0.8 0.4 0.2 3.2 1.6 0.8 0.4 0.2
Delay (ns)
Stage 3: Multiplexers Delay (ns) 9.6 4.8 2.4 1.2 0.6 8 4 2 1 0.5 3.2 1.6 0.8 0.4 0.2
Maximum Stage Delay
48 24 12 6 3 40 20 10 5 2.5 16 8 4 2 1
Normalized
Speedup 2.67 2.4 2 1.5 1 2.67 2.4 2 1.5 1 2.67 2.4 2 1.5 1
Throughput 0.02 0.04 0.07 0.15 0.30 0.02 0.04 0.09 0.18 0.36 0.06 0.11 0.22 0.44 0.89
Efficiency 0.89 0.8 0.67 0.5 0.33 0.89 0.8 0.67 0.5 0.33 0.89 0.8 0.67 0.5 0.33
Total Delay (ns) 16 8 4 2 1 14.4 7.2 3.6 1.8 0.9 9.6 4.8 2.4 1.2 0.6
Total Delay with Pipelining (ns) 5.99 3.33 2 1.33 1 5.39 3 1.8 1.2 0.9 3. 60 2 1.2 0.8 0.6

10. Conclusion
In this work, a literature survey for the state of the art AES byte substitution techniques was conducted. A classification of
these techniques into three main categories namely: hardware, software, and a combination of both hardware and software
techniques was introduced. A new S-Box architecture was proposed. The new architecture divides the S-Box up into groups each
containing a number of two-by-two LUTs. A new byte substitution algorithm was proposed specifying the steps required to
obtain substitution values using the proposed architecture. Three different hardware implementation solutions were examined.
These three solutions were analyzed in terms of delay and area and were also compared to other well-known previously reported
techniques. The analysis and comparison results show that all of the proposed solutions significantly outperform the existing
ones. In particular, the third proposed solution which is based on the use of transmission gates multiplexers achieves the least
O(AT2) values among all implementations discussed. Our proposed S-Box technique is based on the use of small S-Boxes which
leads to simplified implementations. The proposed S-Box architecture is modular, compact, and efficient with negligible
additional space requirements. In addition, we have also shown how to pipeline our S-Box architecture in order to maximize
throughput and reduce delay.

Acknowledgment

The authors would like to acknowledge the financial support received from Kuwait University in the form of the funded
research project WI 04/10.

Biography
Mostafa Abd-El-Barr received his PhD degree in Computer Altaf Y. Al-Farhan received her
Engineering, University of Toronto, Canada in 1986. He is with the Bachelor of Science and Master of
Department of Information Science, College of Computing Sciences Science degrees in Computer
and Engineering (CCSE), Kuwait University since 2003 and an Engineering from Kuwait University,
Adjunct Professor with the ECE Department, University of Victoria Kuwait, in 2004 and 2007 respectively.
(UVic), BC, Canada. His research interests include Design and She worked at the State Audit Bureau of
Analysis of Reliable & Fault-Tolerant Computer Systems, Computer Kuwait as a Systems Engineer, from
Networks Optimization, Parallel Processing/Algorithms, Information 2004 to 2008. She is currently a
Security, Multiple-Valued Logic (MVL) Design & Analysis, VLSI System Design, and Teaching Assistant at the Information Science Department,
Digital Systems Testing. He is the author and/or co-author of more than 170 scientific College of Computing Sciences and Engineering, Kuwait
papers published in journals and conference proceedings/symposia. He has three books University, Kuwait since 2008. Her research interests
published (two are also translated to the Chinese Language). He is also an official include Information Security, Multiple-Valued Logic,
IEEE/EAC/ABET evaluator. Professor Abd-El-Barr is a Senior IEEE Member and a Intelligent Systems, and Human Computer Interaction.
member of the International Association of Engineers (IAENG). He is a Senior
International Associate Editor of the International Journal of Information Technology &
Web Engineering and a member of the Editorial of the International Journal of Computer
Science and Security (IJCSS). He is a Certified Professional Engineer in Ontario, Canada.
15

References

[1] J. Daemen, V. Rijmen, AES proposal: Rijndael, First Advanced Encryption Standard (AES), 1998. Available: http://www.nist.gov/aes.
[2] H. Samiee, R. E. Atani, H. Amindavar, A novel area-throughput optimized architecture for the AES algorithm, Int. Conf. on Electronic Devices, Systems
and Applications, ICEDSA, 2011, pp. 29-32.
[3] T. Rahman, S. Pan, Q. Zhang, Design of a High Throughput 128-bit AES (Rijndael Block Cipher), Proc. of the Int. Multi-Conf. of Engineers and Computer
Scientists, 2010, vol. 2, pp. 1-5.
[4] A. Hodjat, I. Verbauwhede, Area-throughput trade-offs for fully pipelined 30 to 70 Gbits/s AES processors, IEEE Transactions on Computers, vol. 55
(2006), no. 4, pp. 366-372.
[5] C. E. Shannon, Communication theory of secrecy systems, Bell system technical journal, vol. 28 (1949), no. 4, pp. 656-715. ‫‏‬
[6] D. Chen, G. Shou, Y. Hu, and Z. Guo, Efficient architecture and implementations of AES, 3rd Int. Conf. on Advanced Computer Theory and Engineering,
ICACTE, 2010, vol. 6, pp. 295-298.
[7] A. Satoh, S. Morioka, K. Takano, S. Munetoh, A compact Rijndael hardware architecture with S-box optimization, Advances in
Cryptology, ASIACRYPT, 2001, LNCS vol. 2248, pp. 239–254.
[8] O. Maslennikow, M. Rajewska, R. Berezowski, Hardware Realization of the AES Algorithm S-Block Functions in the Current-Mode Gate Technology,
Proc. of the 9th Int. Experience of Designing and Application of CAD Systems in Microelectronics, CADSM, 2007, IEEE Catalog Number 07EX1594, pp.
211-217.
[9] H. Li, A parallel S-box architecture for AES byte substitution, Int. Conf. on Communications, Circuits and Systems, ICCCAS, 2004, vol. 1, pp. 1-3.
[10] D. Canright, A very compact S-Box for AES, Cryptographic Hardware and Embedded Systems, CHES, Springer Verlag, 2005, vol. 3659 of Lecture Notes
in Computer Science, pp. 441–455.
[11] S. Tillich, M. Feldhofer, J. Großschädl, Area, delay, and power characteristics of standard-cell implementations of the AES S-box, Embedded Computer
Systems: Architectures, Modeling, and Simulation, Springer Verlag, 2006, vol. 4017 of Lecture Notes in Computer Science, pp. 457-466.
[12] S. Nikova, V. Rijmen, M. Schla ffer, Using Normal Bases for Compact Hardware Implementations of the AES S-Box, Security and Cryptography for
Networks, Springer Verlag, 2008, vol. 5229 of lecture Notes in Computer Science, pp. 236-245.
[13] V. Rijmen, Efficient Implementation of the Rijndael S-box, [online]. Available: http://www.esat.kuleuven.ac.be/rijmen/rijndael/sbox.pdf
[14] J. Wolkerstorfer, E. Oswald, M. Lamberger, An ASIC implementation of the AES S-Boxes, Topics in Cryptology, CT-RSA, Springer Verlag, 2002, pp. 67-
78.
[15] S. Morioka, A. Satoh, A 10-Gbps full-AES crypto design with a twisted BDD S-box architecture, IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, 2004, 12(7), pp. 686-691.‫‏‬
[16] R. Hobson, S. Wakelin, An area-efficient high-speed AES S-Box method, Proc. 5th Int. Workshop on System-on-Chip for Real-Time Applications, 2005,
pp. 376-379.
[17] G. Bertoni, M. Macchetti, L. Negri, P. Frangneto, Power-efficient ASIC Synthesis of Cryptographic S-boxes, Proc. of the 14th ACM Great Lakes
symposium on VLSI, GLSVLSI, ACM Press, 2004, pp. 277-281.
[18] V. Tomashau, Pipeline AES S-box Implementation Starting with Substitution Table, LC Engineers Inc., USA, [Online]. Available: https://www.design-
reuse.com/articles/30375/pipeline-aes-s-box-implementation-starting-with-substitution-table.html
[19] AMS Corp., Specialty Process 0.35µm CMOS Application Notes, Austria, [Online]. Available: http://www.ams.com/eng/Products/Full-Service-
Foundry/Process-Technology/CMOS/0.35-m-CMOS-Application-Notes
[20] M. Abd-El-Barr, H. El-Rewini, Fundamentals of computer organization and architecture, Wiley, Hoboken, N.J, 2005.

You might also like