IKV 2 Main

Software Implementation of Cryptographic Schemes
Prof. Dr.-Ing. Christof Paar

Chair for Embedded Security
Ruhr University Bochum
Lecture Notes, Version 1.3 June 2014
ii
Table of Contents
1 Introduction 1
2 Ecient Modular Arithmetic 3
2.1 Special Primes in Cryptography . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Modulo Reduction with Mersenne Primes . . . . . . . . . . . . . 5
2.1.3 Modulo Reduction with Pseudo Mersenne Primes . . . . . . . . . 8
2.1.4 Modulo Reduction with Generalized Mersenne Primes . . . . . . 10
2.2 An Accelerated Version of the Extended Euclidean Algorithm . . . . . . 13
2.2.1 Greatest Common Divisor . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 The Euclidean Algorithm . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 The Extended Euclidean Algorithm . . . . . . . . . . . . . . . . 16
2.2.4 The Binary Extended Euclidean Algorithm . . . . . . . . . . . . 19
3 Block Cipher Implementation 29
3.1 Software Implementation of Block Ciphers . . . . . . . . . . . . . . . . . 29
3.1.1 Structure of SP-Networks . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2 Table Look-Up (TLU) Based Permutations . . . . . . . . . . . . 31
3.1.3 S-Box Implementation via Table Look Up . . . . . . . . . . . . . 34
3.2 Bit Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 The Block Cipher PRESENT . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Finite Field Arithmetic in Cryptography 43
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Some Mathematics: A Very Brief Introduction to Finite Fields . . . . . 45
4.2.1 Denition of a Finite Field . . . . . . . . . . . . . . . . . . . . . 45
4.2.2 Element Representation in Extension Fields GF(p
m
) . . . . . . . 48
4.3 Addition and Subtraction in GF(p
m
) . . . . . . . . . . . . . . . . . . . . 49
4.4 Multiplication in GF(p
m
) . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.5 Inversion in Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.5.2 Example Modular Inversions . . . . . . . . . . . . . . . . . . . . 55
iii
Table of Contents
5 Arithmetic in Galois Fields GF(2
m
) in Software 59
5.1 Field Element Representation . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Field Addition in GF(2
m
) . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3 Field Multiplication in GF(2
m
) . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.1 Shift-and-Add Multiplication . . . . . . . . . . . . . . . . . . . . 61
5.3.2 Right-to-Left Comb Method . . . . . . . . . . . . . . . . . . . . . 64
5.3.3 Left-to-Right Comb Method . . . . . . . . . . . . . . . . . . . . . 66
5.3.4 Window-Based Multiplication . . . . . . . . . . . . . . . . . . . . 67
5.3.5 Modular Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3.6 Squaring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6 Arithmetic in Galois Fields GF(2
m
) in Hardware 73
6.1 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2.1 Bit-Serial Multiplication . . . . . . . . . . . . . . . . . . . . . . . 76
6.2.2 Digit-Serial Multiplication . . . . . . . . . . . . . . . . . . . . . . 82
References 90
Index 92
iv
Chapter 1
Introduction
In the course Implementation of Cryptographic Schemes we looked at algorithms and
methods for speeding up exponentiations, multiplication and modulo reduction with
long numbers which is of great importance for implementing public-key algorithms. We
also introduced various side-channel and fault injection attacks.
In the course at hand, Software Implementation of Cryptographic Schemes, we will
address more advanced topics from the eld of cryptographic implementations. First,
we introduce fast algorithms for modulo reduction with special primes and fast inver-
sion. Next, fast and secure software implementation techniques with a focus on block
ciphers are covered. For the cryptanalysis of ciphers faster than brute-force, Time-
Memory Trade-O (TMTO) schemes are introduced in the following. The last part
deals with fast software (and hardware) algorithms for arithmetic in Galois elds of the
form GF(2
m
). These nite elds are of special importance when implementing elliptic
curve cryptography. Some of the algorithms detailed in this course seem exotic, i.e.,
some of the methods are not widely known. However, we carefully selected algorithms
and techniques which are very useful for implementing modern applied cryptography!
Please keep in mind that the guiding scheme of the course is implementation. We
believe that a good way of learning implementation techniques is well to imple-
ment the techniques introduced in the course. Hence it is highly recommended that you
take your preferred computer language and try to implement the treated algorithms
by yourself. This action will help you in understanding and memorizing the presented
topics (and help to pass the exam in case you take this course as part of your university
curriculum). Of course, coding the algorithms will also help to improve your program-
ming skills. It is good to keep the quote of the famous algorithm researcher Donald E.
Knuth in mind:
You dont understand anything unless you try to implement it.
1
The course will also help you do better grasp selected topics in number theory and
abstract algebra.
1
Freely translated from the 05/2002 issue of the German computer magazine ct
1
Introduction
If you have some feedback or found a mistake in this script please send an email to
skripte@crypto.rub.de with the tag "[IKV]" in the subject. Thanks.
We hope you enjoy reading and implementing cryptographic schemes with the tech-
niques introduced in our course!
2
Chapter 2
Efcient Modular Arithmetic
In this chapter we introduce two techniques which are of relevance for implementing
public-key cryptography. First, we show how prime numbers of special form can lead to
an enormous speed-up when computing a modular reduction. In practice, such primes
are often used for elliptic curve cryptography. The second algorithm has much broader
applicability: A fast version of the extended Euclidean algorithm (EEA), namely the
binary EEA, is presented. We recall that the EEA is the standard method for nding
the modular inverse of an integer and is thus used in many public-key libraries. The
reader should be familiar with modular arithmetic as well as the EEA. If you do not have
the background, [PP10, Section 1.4] provides an introduction to modular arithmetic as
needed here and [PP10, Section 6.3] introduces the extended Euclidean algorithm.
2.1 Special Primes in Cryptography
Arithmetic modulo a prime is widely used in public-key cryptography. There are three
families of asymmetric algorithms of practical relevance: RSA, discrete logarithm and
elliptic curves. Often, asymmetric schemes are based on arithmetic modulo a prime.
When executing the algorithms, computing the modular reduction is a major compu-
tational burden which may consume around 50% of the total execution time of the
scheme. In this section we learn about primes of a special form with which modular
reduction in software becomes considerably more ecient.
2.1.1 Introduction
Many asymmetric cryptographic schemes rely on prime elds Z
p
, in which the most
costly computation is usually the modular multiplication C = A B mod p. This holds
especially for elliptic curve and discrete logarithm schemes such as the Die-Hellman
key exchange or DSA. For general primes, the complexity of one Z
p
multiplication
consists of the cost of one integer multiplication and one modulo reduction, with an
approximate ratio of 50%:
3
cost of a modular multiplication = cost(int MUL) + cost(mod RED)
complexity 50 : 50
O(mod MUL) c
1
l
2
+ c
2
l
2
where l = log
2
p| + 1 is the bit-length of the modulus p and c
1
c
2
are constants
depending on the implementation. Thus, generally a modulo reduction (e. g., using
Montgomery arithmetic) of an operand with a length of 2l bit is roughly as costly as
one integer multiplication of two operands with a length of l bit each.
However, if special primes are used, the modulo reduction can have much lower com-
putational demands. In the following two examples we see that the rst one is quite
challenging to do by hand even though the numbers are small (867 and 13), whereas
the second one is easy even though a very large number is involved:
Example 2.1
867 mod 13 9 (2.1)
833179342206789811245 mod 100 45 (2.2)
We notice that it is a lot easier for us to perform the modular reduction with a modulus
that is in some way related to the decimal number system (in the 2
nd
example above
it is the modulus 100). Similarly, a PC can handle operands tting to its internal
word size, e. g., 32 bit, more eciently. On the basis of this idea we now investigate
some types of special primes with good binary arithmetic properties, in order to reduce
the complexity of a modulo reduction. We dierentiate between the following types of
special primes.
1. Mersenne and Mersenne-like Primes
p = 2
n
1 (Mersenne)
p = 2
n
+ 1 (Mersenne-like)
2. Pseudo Mersenne Primes
p = 2
n
a, 0 < a < 2
n
2
3. Generalized Mersenne Primes
2
d
0
k
2
d
1
k
2
d
2
k
. . . 1
A Mersenne prime is an n-bit integer with a binary representation in which all bits are
1, while a Mersenne-like prime is an (n + 1)-bit integer which begins and ends with a
1, while all other bits are set to 0. The binary representation of generalized Mersenne
4
primes and pseudo Mersenne primes depends on the choice of the parameters, i. e., d
i
,
n, a and k.
Example 2.2 The table below shows some integers (which are not neces-
sarily prime) in order to illustrate the various binary representations that
we consider in the remainder of this chapter.
Number Binary representation
Mersenne (2
16
1) : 1111 1111 1111 1111
2
Mersenne-like (2
16
+ 1) : 1 0000 0000 0000 0001
2
Pseudo MP (2
16
+ 89) : 1 0000 0000 0101 1001
2
Pseudo MP (2
16
89) : 1111 1111 1010 0111
2
Generalized MP (2
16
2
8
2
4
1) : 1111 1110 1110 1111
2
Generalized MP (2
16
2
8
+ 2
4
1) : 1111 1111 0000 1111
2
Generalized MP (2
16
2
8
+ 2
4
+ 1) : 1111 1111 0001 0001
2
2.1.2 Modulo Reduction with Mersenne Primes

We start with the denition:
Denition 2.1.1 Mersenne Primes
Primes of the form p = 2
n
1 are called Mersenne primes.
Mersenne primes are quite rare, in particular most integers of the form 2
n
1 are not
prime. Table 2.1 shows examples of MPs. Today there are only 48 Mersenne primes
known. The largest prime 2
57885161
was found in January 2013 and is conjectured to be
the 48
th
Mersenne prime. Because of the complexity of calculating and proving primes,
it is possible that primes that are smaller than the 48
th
have not been found yet.
The following theorem states the necessary condition for the existence of Mersenne
primes.
Theorem 2.1.1 If p = 2
n
1 is prime, then n is also a prime.
5
Table 2.1: Some Mersenne primes
Mersenne
prime no. n 2
n
1
1 2 3
2 3 7
3 5 31
4 7 127
5 13 8191
.
.
.
.
.
.
.
.
.
48 57885161 . . .
Proof 2.1 of Theorem 2.1.1
We proof by contradiction. Assume that 2
n
1 is prime and let n = a b,
with a, b N and a, b ,= 1. The number 2
ab
1 can now be written as the
product of two non-trivial factors:
2
ab
1 = (2
a
1) (1 + 2
a
+ 2
2a
+ . . . + 2
(b1)a
)
Hence, 2
n
1 is not prime if n is composite.
Reduction modulo a Mersenne prime Next, we want to compute the modular

multiplication C = A B mod p, where A, B, C Z
p
and p = 2
n
1. Note that in
general A and B have a maximal bit length of n, thus their product C
= A B has
a maximal bit length of 2n and thus may need to be reduced modulo p to obtain
C = C
mod p. For the computation, C
is split up in a higher part C

h
and a lower
part C
l
, so that C
h
begins at position 2
n
. Note that both C
h
and C
l
are n bit words.
C
= A B
C
= C
h
2
n
+ C
l
where 0 C
h
, C
l
< 2
n
2
n
= Q p + R
2
n
= [1] (2
n
1) + [1]
2
n
1 mod p
C = C
h
2
n
+ C
l
mod p
C C
h
1 + C
l
mod p
The key idea of the method is that we reduce 2
n
modulo the prime. Due to the special
structure of the prime (2
n
1), a reduction of 2
n
is equivalent to a multiplication by 1.
6
Therefore we can compute C with a simple addition of C
h
and C
l
. There is a chance
that a carry bit accurs during the addition of C
h
and C
l
. In this case, the result C
is
greater than p but smaller than 2 p. Hence, subtracting p suces to perform the nal
modulo reduction and obtain C = A B mod p. These ndings lead to the following
algorithm:
Algorithm 2.1 Mersenne Modulo Reduction
Input: C
and a modulus p = 2
n
1
Output: C C
mod p, i.e., 0 C < 2

n
1
C
= C
h
2
n
+ C
l
(i) C = C
h
+ C
l
IF C 2
n
(ii) C = C p
RETURN(C)
In conclusion, we observe that a reduction modulo a Mersenne prime is achieved with
one or two additions. This complexity is in practice negligible compared to the com-
plexity of an integer multiplication. Lets look at an example with small numbers,
using the Mersenne prime p = 7, which is the 2
nd
Mersenne prime.
Example 2.3 Let p = 2
3
1 = 7, A = 5 and B = 6.
C
= A B = 30 = 011
2
..
C
h
=3
110
2
..
C
l
=6
C
= C
h
2
n
+ C
l
(mod p)
C
= 3 2
3
+ 6 mod 7
C
3 1 + 6 mod 7
C
= 9
since C
> p
C 9 7 = 2 mod 7
We notice that in the last line the conditional subtraction C
p is per-
formed.
Remarks:
1. Numbers of the form 2
n
+ 1 have the same low modulo reduction complexity. It
is easy to show the modular reduction is achieved by computing C C
l
C
h
.
An example for such a prime number is p = 2
8
+ 1 = 257.
7
2. The modulo reduction trick shown here can be performed for all numbers 2
n
1,
even if they are not prime. For instance, the IDEA block cipher requires reduction
modulo 2
16
+ 1 even though the integer is composite.
2.1.3 Modulo Reduction with Pseudo Mersenne Primes
A generalization of Mersenne primes are pseudo Mersenne primes which are dened as
follows.
Denition 2.1.2 Pseudo Mersenne Prime
A prime p of the form p = 2
n
a, 0 < a < 2
n/2
, is a pseudo Mersenne
prime.
They were introduced for the use in cryptography, especially in the context of elliptic
curves [BP98]. For a = 1 we have a Mersenne or Mersenne-like number.
We will now show how a number with a bit length of 2n can be reduced modulo a
pseudo Mersenne prime. Without loss of generality we will consider a modulus of the
form p = 2
n
a.
Modulo reduction with p = 2
n
a
Let C
= A B with 0 C
2
2n
1. It holds that
2
n
a mod p
since 2
n
= [1] (2
n
a) + [a]. We obtain
C
= C
h
2
n
+ C
l
where 0 C
h
, C
l
2
n
1. We reduce now the term 2
n
modulo (2
n
a):
C
C
h
a + C
l
mod p (2.3)
When computing C
h
a + C
l
we obtain a new long integer C
with a length of about

2/3n bit that can again be split is into a higher part C
h
and a lower part C
h
:
C
h
2
n
+ C
l
mod p
where 0 C
l
< 2
n
and 0 C
h
< 2
n/2+1
. We can reduce C
further by applying the

trick one more time:
C
h
a + C
l
mod p (2.4)
Since both C
h
and a have a length of about n/2, the result C
has roughly a length of

n bit. However, due to carries C
may need to be reduced. This can be achieved by

8
subtracting the modulus: C
p.
a
A
B
C
l
C
h
C
h
a
C = a C
h
+ C
l
C
l
C
h
C = a C
h
+ C
l
0 < a < 2
n/2
A < p
= C = C
h
| C
l
possible carry
2
n
- 2
n/2
+ 1 p
= 2
n
a
< 2
n n
n/2
B < p
possible carry
Figure 2.1: Graphical representation of the reduction modulo a pseudo Mersenne prime
Figure 2.1 shows a graphic representation of the algorithm. After the rst iteration
of the algorithm, the result C
h
a + C
l
has a size of approximately 2
n+log
2
(a)
which is
greater than p. We apply the reduction trick now a second time. The nal result after
the second iteration C
h
a + C
l
can still be greater than p, hence the modulus p may
have to be subtracted, again.
The main computational complexity of the modulo reduction are the multiplications
by a in Equation (2.3) and (2.4). Thus, roughly speaking, the modulo reduction of an
operand C
with a bit length of 2n is achieved by means of two multiplications of an

operand with n bit by a and a few additions. Since the two multiplications are often
faster than one division, the algorithms can be an advantage over standard methods
for modulo reduction. Lets look at an example with small numbers.
Example 2.4 Let a = 5 and p = 2
8
5 = 251. We note that 2
8
5 mod p.
The operands we consider are A = 213 and B = 197. The algorithm
9
proceeds as follows.
(0) C
= A B = 41961
= 163 2
8
+ 233
= 10100011 11101001
2
MUL (1) C
= 163 5 + 233 1048 mod 251

= 100 00011000
2
C
4 2
8
+ 24 mod 251
MUL (2) C
= 4 5 + 24 44 mod 251
= 101100
2
Note that the main computational steps (after the actual operand multiplication) are
the two multiplications in (1) and (2).
Remarks
1. A formal description of the method is given in [Men97, Algorithm 14.47].
2. The method can also be of interest for fast computations in Galois elds of the
form GF(p
m
), where p is a pseudo Mersenne prime. In this case, p is chosen such
that it just ts into one computer word, e.g., p = 2
63
259 is a good t for 64
bit CPUs. For more information see [BP98].
2.1.4 Modulo Reduction with Generalized Mersenne Primes
The National Institution of Standards and Technology (NIST) has standardized the
following generalized Mersenne primes in July 1999 for use in elliptic curve cryptogra-
phy
1
:
p
1
= 2
192
2
64
1
p
2
= 2
224
2
96
+ 1
p
3
= 2
256
2
224
+ 2
192
+ 2
96
+ 1
p
4
= 2
384
2
128
2
96
+ 2
32
1
When we choose a bit size for the prime of an ECC scheme dened over GF(p), we
have to pay attention to the fact that general attacks on ECC have a complexity of
approximately

p. For instance, attacking an 192 bit ECC scheme requires about
_
(2
192
) = 2
192/2
= 2
96
steps. That means that an elliptic curve with 192 bit is as
1
see http://cscr.nist.gov/encryption/dss/ecdsa/NISTReCur.doc, Appendix 1 Implementation
of Modular Arithmetic
10
secure as a block cipher (that does not have mathematical weaknesses) with a 96-bit
key.
Reducing an integer modulo any of the above primes can be done without any multi-
plication, using only additions. Lets rst look at an example.
Example 2.5
If we do arithmetic in Z
p
, the crucial modular reduction operation occurs
after a multiplication of two integers. Let
p = 2
192
2
64
1
A, B Z
p
, i. e., log
2
A| + 1, log
2
B| + 1 192
C
= A B
The bitlength of C
is log
2
C
| + 1 < 384 = 2 192. Now, we split C
into
ve blocks of 64 bits each, where each block is in the range 0 c
i
< 2
64
,
cf. Figure 2.2.
(1) C
= c
5
2
320
+ c
4
2
256
+ c
3
2
192
+ c
2
2
128
+ c
1
2
64
+ c
0
where 0 c
i
< 2
64
Figure 2.2: Splitting of a 384-bit integer into 64-bit blocks
Mathematically speaking we represent C
in radix 2
64
notation. Note that
this is the natural representation of C
on a 64-bit computer. To perform

the reduction C C
mod p, 0 < C < p eciently, we now derive reduced

expressions for 2
192
, 2
256
, 2
320
:
(2) 2
192
= [1] p + [2
64
+ 1] 2
64
+ 1 mod p
(3) 2
256
= 2
64
2
192
2
128
+ 2
64
mod p
(4a) 2
320
= 2
64
2
256
2
64
(2
128
+ 2
64
) = 2
192
+ 2
128
mod p
(4b) 2
320
2
128
+ 2
64
+ 1 mod p
Now we use Eqns (2), (3) and (4b) to substitute the terms in Eqn (1).
C
= c
5
(2
128
+2
64
+1)+c
4
(2
128
+2
64
)+c
3
(2
64
+1)+c
2
2
128
+c
1
2
64
+c
0
mod p
11
After reordering and substitution we get
c
0
= c
0
+ c
3
+ c
5
c
1
= c
1
+ c
3
+ c
4
+ c
5
c
1
= c
2
+ c
4
+ c
5
With the result:
C = C
mod p = c
2
2
128
+ c
1
2
64
+ c
0
Note that, when adding the coecients c
i
(i = 1, . . . , 5), carry bits can
occur, such that the coecients c
1
, c
2
and c
3
can be equal or greater than
2
64
.
As shown in the example, the reduction of

C
=
5
i=0
c
i
2
i64
to
C =
2
j=0
c
j
2
j64
is a linear operation which can be described as
_
_
_
_
c
0
c
1
c
2
_
_
_
_
=
_
_
_
_
1 0 0
0 1 0
0 0 1
1 0 1
1 1 1
0 1 1
_
_
_
_
. .
I
. .
R
_
_
_
_
_
_
_
_
_
_
_
_
c
0
c
1
c
2
c
3
c
4
c
5
_
_
_
_
_
_
_
_
_
_
_
_
Where I is the identity matrix and R is the reduction matrix. The modulo reduction
can be accomplished with 7 additions of 64 bit coecients and no multiplications.
The number of additions is given by the Hamming weight of the reduction matrix.
#add = HW(R)
For more information about modulo reduction with generalized Mersenne primes see
[Sol99].
12
2.2 An Accelerated Version of the Extended Euclidean Algo-
rithm
The extended Euclidean algorithm is one of the most important number theoretical
algorithms in public-key cryptography, because it is the most ecient method for com-
puting the multiplicative inverse modulo an integer. (The EEA can also be used for
inversion in extension elds, e.g., in Galois elds GF(2
m
).) The EEA is widely used
in asymmetric cryptosystems, e.g., for computing RSA keys [PP10, Chapter 7] or per-
forming point addition in elliptic curve cryptography [PP10, Chapter 9]. In this section
we will learn an important version of the EEA which considerably accelerates the al-
gorithm in software. It replaces division by bit shifts.
2.2.1 Greatest Common Divisor
The Greatest Common Divisor (GCD) of two integer numbers r
0
and r
1
is denoted by
gcd(r
0
, r
1
)
and is the largest positive number that divides both r
0
and r
1
. For instance gcd(27, 6) =
3. For small numbers, the GCD is easy to calculate by factoring both numbers and
nding the highest common factor.
Example 2.6 Let r
0
= 84 and r
1
= 30. Factoring yields
r
0
= 84 = 2 2 3 7, and
r
1
= 30 = 2 3 5
Hence, the highest common factor is the common product
2 3 = 6 = gcd(30, 84)
For larger numbers (with sizes used in public-key cryptography), however, factoring
often is not possible and a much more ecient algorithm is being used for GCD com-
putations: the Euclidean algorithm.
2.2.2 The Euclidean Algorithm
The Euclidean algorithm is based on the simple observation that
gcd(r
0
, r
1
) = gcd(r
0
r
1
, r
1
), (2.5)
13
2.2 An Accelerated Version of the Extended Euclidean Algorithm
where we assume that r
0
> r
1
, and that both numbers are positive integers.
Proof 2.2 Let gcd(r
0
, r
1
) = g. Since g obviously divides both r
0
and r
1
,
we can write r
0
= g x and r
1
= g y, where x and y are co-prime integers
and x > y. It holds now either
gcd(r
0
r
1
, r
1
) = gcd(g (x y), g y) = g (2.6)
or
gcd(r
0
r
1
, r
1
) = gcd(g (x y), g y) = f g , where f > 1 (2.7)
Case (2.7), however, cannot occur. If it were true, we would have xy = fz
(for some integer z) and y = f y
. From this follows:

x y = f z
x f y
= f z
x = f(x + y
)
We have now established that both y and x are divisible by f. Hence
gcd(r
0
, r
1
) = gcd(gx, gy) = f g which is a contradiction to the assumption
that gcd(r
0
, r
1
) = g. Hence, Eqn (2.5) holds.
Lets verify this property with the numbers from the previous example.
Example 2.7 Let again r
0
= 84 and r
1
= 30. We now look at the GCD
of (r
0
r
1
) and r
1
r
0
r
1
= 54 = 2 3 3 3
r
1
= 30 = 2 3 5
The largest common factor still is 2 3 = 6 = gcd(30, 54) = gcd(30, 84).
It also follows immediately that we can apply the process iteratively

gcd(r
0
, r
1
) = gcd(r
0
r
1
, r
1
) = gcd(r
0
2r
1
, r
1
) = = gcd(r
0
i r
1
, r
1
)
14
as long as (r
0
i r
1
) > 0. The algorithm will be most ecient if we choose the largest
possible value for i since we approach the termination criteria much faster. This is the
case if we compute
gcd(r
0
, r
1
) = gcd(r
0
mod r
1
, r
1
).
Since the rst term (r
0
mod r
1
) is smaller than the second term r
1
, we usually swap
them
gcd(r
0
, r
1
) = gcd(r
1
, r
0
mod r
1
).
The core observation from this process is that we can reduce the problem of nding the
GCD of two given long numbers to that of the GCD of two relatively small numbers.
This process can be used recursively until we nally obtain gcd(r
m
, 0) = r
m
. Since
each iteration preserves the GCD of the previous iteration step, it turns out that this
nal GCD is the GCD of the original problem, i. e.,
gcd(r
0
, r
1
) = = gcd(r
m
, 0) = r
m
.
We rst show some examples for nding the GCD using the Euclidean algorithm and
then discuss the algorithm a bit more formally.
Example 2.8 Let r
0
= 22 and r
1
= 6. The Euclidean algorithm in
Figure 2.3 gives somewhat of feeling for the algorithm by showing how the
lengths of the parameters are reduced in every iteration. The shaded part
in every iteration is the current remainder which forms the new input term
for the next iteration.
r
r
r
r
0
1
2
3
2 2
6 6 6 44 4
gcd(22, 6) = gcd(6, 4) = gcd(4, 2) = gcd(2, 0) = 2
2
2
4
2
gcd(22,6) = gcd(3*6+4,6) = gcd(6,4)
gcd(4,2) = gcd(2*2+0,2) = gcd(2,0) = 2
gcd(6,4) = gcd(1*4+2,4) = gcd(4,2)
Figure 2.3: Example of the Euclidean algorithm
It is also helpful to look at the Euclidean algorithms with slightly larger numbers. This
happens in the next example.
15
Example 2.9 Let r
0
= 973 and r
1
= 301. The GCD is then computed as
973 = 3 301 + 70
301 = 4 70 + 21
70 = 3 21 + 7
21 = 3 7 + 0
gcd(973, 301) = gcd(301, 70) = gcd(70, 21) = gcd(21, 7) = gcd(7, 0)
= 7.
We can now give a more formal description of the algorithm.

Algorithm 2.2 The Euclidean Algorithm
[Men97, Alg. 2.104]
Input: positive integers r
0
, r
1
with r
0
r
1
Output: g = gcd(r
0
, r
1
)
1. WHILE r
1
,= 0 Do:
1.1 Set g r
0
mod r
1
, r
0
r
1
and r
1
g
2. RETURN (r
0
)
Note that the algorithm terminates if a remainder g with the value 0 is being computed.
The remainder derived in the previous iteration is then denoted by r
0
in our notation
and is the GCD of the original problem.
It is important to note that the Euclidean algorithm is very ecient. Even in the case
of the very long numbers typically involved in public-key cryptography, the algorithm is
quick. The number of iterations is proportional to the logarithm of the operand. That
means, for instance, that the number of iterations of a GCD involving 1024 bit numbers
is 1024 times a constant. In fact, it can be shown that ecient implementations of the
algorithm need pretty close to 1024 iterations. Of course, algorithms with a few 1000
iterations can easily be executed on todays computers (or smartphones).
2.2.3 The Extended Euclidean Algorithm
So far we have seen that the EA allows to nd the GCD of two integers r
0
, r
1
. It can
be shown that the resulting GCD can always be written as a linear combination of the
form
gcd(r
0
, r
1
) = r
m
= s r
0
+ t r
1
16
where s and t are integer coecients. This equation is often referred to as Diophantine
equation. It turns out that the coecients s and t of the linear combination are in
cryptography much more important than the GCD itself.
The question now is: If the two coecients s and t always exist, how do we nd them?
For this we use the Extended Euclidean Algorithm. The idea behind the EEA is simple:
We execute the EA as before, but in every iteration we express the current remainder
r
i
as a linear combination of the form
r
i
= s
i
r
0
+ t
i
r
1
.
Note that these linear combinations always involve the initial values r
0
and r
1
. In the
last iteration we will end up with
r
m
= gcd(r
0
, r
1
) = s
m
r
0
+ t
m
r
1
= s r
0
+ t r
1
This means that the coecients s
m
and t
m
computed in the last iteration are the
parameters s and t we are looking for.
Example 2.10 We consider the extended Euclidean algorithm with the
same values as in the previous example, r
0
= 973 and r
1
= 301. On the left-
hand side, we compute the standard EA, i.e., we compute new remainders
r
2
, r
3
, . . .. Also, we have to compute the integer quotient q
i1
in every
iteration. On the right-hand side we compute the coecients s
i
and t
i
such
that r
i
= s
i
r
0
+ t
i
r
1
. The coecients are always shown in brackets.
i r
i2
= q
i1
r
i1
+ r
i
r
i
= [s
i
]r
0
+ [t
i
]r
1
2 973 = 3 301 + 70 70 = [1]r
0
+ [3]r
1
3 301 = 4 70 + 21 21 = 301 4 70
= r
1
4(1r
0
3 r
1
)
= [4]r
0
+ [13]r
1
4 70 = 3 21 + 7 7 = 70 3 21
= (1r
0
3r
1
) 3(4r
0
+ 13r
1
)
= [13]r
0
+ [42]r
1
21 = 3 7 + 0
The algorithm computed the three parameters gcd(973, 301) = 7, s = 13
and t = 42. The correctness can be veried by:
gcd(973, 301) = 7 = [13] 973 + [42] 301 = 12649 12642
17
Watch carefully the algebra steps taking place in the right column in the example above.
In particular, observe that the current linear combination can always be constructed
with the help of the two previous linear combinations. With the help of this example,
we can give a more formal description of EEA in terms of iterative formulae for the
computation of s
i
and t
i
. Note that the iteration counter starts with the index 2. The
initial values s
0
, s
1
and t
0
, t
1
are also given. The computation is nished when the
remainder has the value r
i
= 0. The coecients s and t of the solution are then given
by s
i1
and t
i1
, respectively.
Algorithm 2.3 Extended Euclidean Algorithm (EEA)
[Men97, Alg 2.107]
Input: positive integers r
0
, r
1
with r
0
> r
1
Output: g = gcd(r
0
, r
1
) and the integer coecients s and t
such that g = s r
0
+ t r
1
1. IF r
1
= 0
g r
0
, s 1, t 0 and GOTO Step 5
2. Initialize s
0
= 1, s
1
= 0, t
0
= 0, t
1
= 1, a = r
0
and b = r
1
3. WHILE b > 0
3.1 q
a
b
|, r a qb, s s
0
qs
1
and t t
0
qt
1
3.2 a b, b r, s
0
s
1
, s
1
s, t
0
t
1
and t
1
t
4. g a, s s
0
and t t
0
5. RETURN (g, s, t)
One of the most important applications of the EEA is to compute the inverse modulo
an integer. This is straightforward as it will be shown now. Lets assume we want to
compute the inverse of r
1
mod r
0
where r
1
< r
0
. Note that the inverse only exists if
gcd(r
0
, r
1
) = 1. When we compute the EEA we obtain an expression of the form
gcd(r
0
, r
1
) = 1 = s r
0
+ t r
1
If we take this equation modulo r
0
we obtain
1 t r
1
mod r
0
This is exactly the denition of the inverse of r
1
! That means that t itself is the inverse
of r
1
t r
1
1
mod r
0
We consider now an example for computing the modular inverse.
18
Example 2.11 Our goal is to compute 12
1
mod 67. The values 12 and
67 are relatively prime, i.e., gcd(67, 12) = 1. With the EEA we compute
the coecients s and t in gcd(67, 12) = 1 = s 67 +t 12. Starting with the
values r
0
= 67 and r
1
= 12, the algorithm proceeds as follows:
i q
i1
r
i
s
i
t
i
2 5 7 1 -5
3 1 5 -1 6
4 1 2 2 -11
5 2 1 -5 28
This gives us the linear combination
5 67 + 28 12 = 1
As shown above, the inverse of 12 follows from here as
12
1
28 mod 67
This result can easily be veried
28 12 = 336 1 mod 67
For fast software implementations a variant called binary extended Euclidean algorithm
is more ecient. The most costly step in the EEA is the computation of the quotient q
in each iteration which requires a division (with very long numbers). The binary EEA
uses divisions and multiplications by powers of 2 which can be realized as simple shifts
in software. Its description is given in the next section.
2.2.4 The Binary Extended Euclidean Algorithm
Even though the EEA has a low theoretical complexity it has a logarithmic number
of iterations its drawback in practice is that it requires a long number division in
every iteration. In the following, we develop a generalization of the EEA which replaces
divisions by binary shifts.
The Binary Euclidean Algorithm
We start again with rst considering the Euclidean algorithm which computes the
GCD of two integers. We recall that the standard EA is based on the recursion:
19
gcd(r
i
, r
i1
) = gcd(r
i1
, r
i
mod r
i1
). A simple example is:
gcd(22, 6) = gcd(6, 22 mod 6)
= gcd(4, 6)
Broadly speaking, through the above iteration the EA reduces one GCD computation
by another one with smaller parameters. We already saw in Eqn (2.5) that there is not
only the reduction of the type
r
i
mod r
i1
possible which is used in the standard EA. For a generalization of the EA the following
denition is useful.
Denition 2.2.1 GCD Preserving Reduction
Given two integers r
0
, r
1
, where r
0
> r
1
, we call the computation of
r
2
such that
gcd(r
0
, r
1
) = gcd(r
2
, r
1
)
a GCD preserving reduction.
For the binary EA we make use of three GCD preserving reductions.
1. u, v both even
gcd(u, v) = 2 gcd(
u
2
,
v
2
)
Proof: u = 2u
, v = 2v
gcd(u, v) = gcd(2u
, 2v
) = 2 gcd(u
, v
)
2. u is even and v is odd
gcd(u, v) = gcd(
u
2
, v)
Proof: Since 2 v but 2 [ u: gcd(u, v) = gcd(
u
2
, v)
3. u, v are both odd
gcd(u, v) = gcd(
u v
2
, v)
Proof:
(i) The proof can be split in two parts. The proof that gcd(u, v) = gcd(uv, v)
is shown in Section 2.2.2.
(ii) Secondly, since (u v) is even, reduction 2 from above holds
gcd(u v, v) = gcd(
uv
2
, v)
It is crucial to observe that all reductions only require division by 2, which is a simple
shift in software, or subtraction. Both operations are considerably faster than multipli-
cation or division, especially if long numbers are involved. In general the binary GCD
20
is calculated through divisions by 2 and stops when one factor is a multiple of the other
factor. Then the remainder of r
i
mod r
i1
is zero.
Notice that case 1 of the preserving reductions doesnt occur usually if you want to
compute the inverse of a number, because of the factor 2, what would mean that the
gcd ,= 1 and as a consequence the inverse u
1
doesnt exist necessarily.
Algorithm 2.4 Binary Euclidean Algorithm
[Men97, Alg. 14.54]
Input: positive integers u, v with u v
Output: g = gcd(u, v)
1. g 1
2. WHILE both u and v are even
u u/2, v v/2, g 2g
3. WHILE u ,= 0
3.1 WHILE u is even
u u/2
3.2 WHILE v is even
v v/2
3.3 t [ u v [ /2.
3.4 IF u v
u t
ELSE
v t
4. RETURN(g v)
We look now at an example that compares the binary GCD algorithm with the standard
EA.
Example 2.12
We want to compute the GCD of 18 and 42. The standard EA proceeds as
follows:
gcd(42, 18) = gcd(18, 6)
gcd(18, 6) = gcd(6, 0)
gcd(6, 0) = 6
21
The binary EA computes the same result with the following steps:
gcd(42, 18) = 2 gcd(21, 9)
gcd(21, 9) = 2 gcd(6, 9)
gcd(6, 9) = 2 gcd(3, 9)
gcd(3, 9) = 2 gcd(3, 3)
gcd(3, 3) = 2 gcd(0, 3)
gcd(0, 3) = 2 3 = 6
From the example we see that the binary GCD needs more iterations than the stan-
dard EA. On the other hand, each iteration only uses very cheap operations, namely
division by 2 and subtractions. The standard EA needs expensive modulo operation.
In practice, the runtime in software of the binary EA is smaller.
The Binary Extended Euclidean Algorithm
In order to compute the inverse of an integer modulo p (p prime) or modulo m (m
composite) we can use the binary EEA. The algorithm computes for input parameters
u and v the gcd(u, v) and two integer coecients s and t such that
s v + t u = gcd(u, v)
In case of gcd(u, v) = 1, the inverse v
1
mod u exist (we assume here without loss of
generality that u > v).
The binary EEA is given in the following algorithm.
22
Algorithm 2.5 Binary Extended Euclidean Algorithm
( [Men97, Alg. 14.61])
Input: two positive integers u and v
Output: integers s, t and w such that tu + sv = w, where w =
gcd(u, v)
1. g 1
2. WHILE u and v are both even
u u/2, v v/2, g 2g
3. t u, w v, A 1, B 0, C 0, D 1
4. WHILE t is even
4.1 t t/2
4.2 IF A B 0 (mod 2)
A A/2, B B/2
ELSE
A (A + v)/2, B (B u)/2
5. WHILE w is even
5.1 w w/2
5.2 IF C D 0 (mod 2)
C C/2, D D/2
ELSE
C (C + v)/2, D (D u)/2
6. IF t w
t (t w), A (AC), B (B D)
ELSE
w (w t), C (C A), D (D B)
7. IF t = 0
t C, s D
RETURN(t,s,g w)
ELSE
GOTO Step 4
As its core, the algorithm performs the three GCD-preserving reductions which were
discussed at the beginning of this subsection. The two variables for which the GCD is
being computed are t and w, i.e., throughout the algorithm the following equality is
23
being maintained:
gcd(u, v) = gcd(t, w)
The values of t and w are decreased in every step until t = 0. In Step 2 they are
divided by 2 if both values are even. In Step 4 t is divided by 2 if it is even, and the
same happens with w in Step 5. In Step 6, the dierence between t and w is computed
and assigned to either t (if t was originally larger), or vice versa. The crucial dierence
from the binary EA which was introduced before is that throughout the algorithm the
following linear combinations are being maintained:
t = Au + B v (2.8)
w = C u + Dv (2.9)
This means that every time t or w are being reduced in Step 4, 5 or 6, new values for
A, B, C and D are being computed as needed. We prove now that the Equalities (2.8)
and (2.9) are actually correctly evaluated in those steps.
Proof 2.3 Step 4 divides t once or several times by 2. At the beginning
of the step (2.8) and (2.9) are true. The values w, C and D are not changed,
hence (2.9) is certainly still true after the step. If A B 0 (mod 2), i.e.,
A and B are both even, the step updates t, A and B as follows:
t/2 = A/2 u + B/2 v (2.10)
If we multiply this equation by 2 it is clear that expression (2.8) is still
maintained. The more interesting case happens when A and B are not
both even. In this case, we can not simply divide A and B by 2 like in
Eqn (2.10). Instead A is being replaced by (A+ v)/2 and B by (B u)/2.
These two substitutions maintain (2.8) because:
t/2 = (A + v)/2 u + (B u)/2 v
t/2 = A/2 u + uv/2 + B/2 v uv/2
t/2 = A/2 u + B/2 v
Again, if we multiply this equation by 2 it is clear that expression (2.8) is
being maintained.
Step 5 updates the variables w, C and D and maintains (2.9). It
proceeds completely analogues to Step 4.
Finally, we have to show that Step 6 maintains (2.8) and (2.9). In
the rst case, i.e., t w, the three substitutions take place: t (t w),
24
A (AC), B (B D). These are valid transformations since:
t w = (AC)u + (B D)v
t w = (Au + B v) (C u + Dv)
which is exactly the expression that we obtain if we subtract (2.9) from
(2.8). In case t < w, analogue computations take place.
The algorithm terminates if t has the value of 0. At this point the GCD computation
has been reduced to
gcd(u, v) = gcd(t, w) = gcd(0, w)
Trivially, the gcd(0, w) has the value w. Hence the value of g w is being returned in
Step 7. (g is 1 or a power of 2 if both u and v are even.) Importantly, at the moment
when the algorithm terminates, it also holds that:
w = C u + Dv = gcd(u, v)
Thus, the parameters t and s of the EEA are exactly the values in C and D which are
returned in Step 6.
We show the algorithm in an example, where we compute the inverse of 10 mod 39.
Equations in square brackets are only to show the correctness of calculations. For
a better understanding the step numbers here are equivalent to the step numbers of
the binary EEA. The index new means that this variable gets a new value, for better
distinction of old from new values.
Example 2.13
u = 39, v = 10
1. g 1
2. no operation
3. t = 39, w = 10, A = 1, B = 0, C = 0, D = 1
_
t = A u + B v w = C u + D v
39 = 1 39 + 0 10 10 = 0 39 + 1 10
_
4. no operation, since t is odd
5. w is even
(a) w
new
= w/2 = 10/2 = 5
Notice that 5 is the new value for w in gcd
gcd(39, 10) = gcd(39, 5)
25
(b) C
new
=
C+v
2
=
0+10
2
= 5
D
new
=
Du
2
=
139
2
= 19
_
_
w
new
= C
new
u + D
new
v
5 = 5 39 + (19) 10
5 = 195 + (190) gcd(39, 5)
_
_
w is odd, therefore end of loop
6. t w, (39 > 5)
t
new
= t w = 39 5 = 34
A
new
= AC = 1 5 = 4
B
new
= B D = 0 (19) = 19
Notice that 34 is the new value for t in gcd
_
_
t
new
= A
new
u + B
new
v
34 = (4) 39 + 19 10
34 = 156 + 190 gcd(34, 5)
_
_
7. t ,= 0 go to algorithm step 4
4. t is even
t
new
= t/2 = 34/2 = 17
A
new
= (A + v)/2 = (4 + 10)/2 = 3
B
new
= (B u)/2 = (19 39)/2 = 10
17 = 3 39 + (10) 10) gcd(17, 5)
t is odd, therefore end of loop
5. no operation, since w is odd
6. t w, (17 > 5)
t
new
= (t w) = 17 5 = 12
A
new
= (AC) = 3 5 = 2
B
new
= (B D) = 10 + 19 = 9
12 = 2 39 + 9 10 gcd(12, 5)
4. t is even
t
new
= t/2 = 12/2 = 6
A
new
= (A + v)/2 = (2 + 10)/2 = 4
B
new
= (B u)/2 = (9 39)/2 = 15
6 = 4 39 + (15) 10 gcd(6, 5)
t is even, therefore next loop iteration
t
new
= t/2 = 6/2 = 3
A
new
= (A + v)/2 = (4 + 10)/2 = 7
B
new
= (B u)/2 = (15 39)/2 = 27
26
3 = 7 39 + (27) 10 gcd(3, 5)
5. no operation (NOP), since w is odd
6. t w, (3 < 5)
w
new
= (w t) = (5 3) = 2
C
new
= (C A) = (5 7) = 2
D
new
= (D B) = (19 + 27) = 8
2 = 2 39 + 8 10 gcd(3, 2)
4. NOP, since t is odd
5. w is even
w
new
= w/2 = 2/2 = 1
C
new
= C/2 = 2/2 = 1
D
new
= D/2 = 8/2 = 4
1 = 1 39 + 4 10 gcd(3, 1)
w is odd, therefore end of loop
6. t w, (3 > 1)
t
new
= (t w) = 2
A
new
= (AC) = (7 + 1) = 8
B
new
= (B D) = (27 4) = 31
2 = 8 39 + (31) 10 gcd(3, 1)
4. t is even
t
new
= t/2 = 2/2 = 1
A
new
= (A + v)/2 = (8 + 10)/2 = 9
B
new
= (B u)/2 = (31 39)/2 = 35
1 = 9 39 + (35) 10 gcd(1, 1)
5. NOP, since w is odd
6. t w, (1 = 1)
t
new
= (t w) = (1 1) = 0
A
new
= (AC) = (9 + 1) = 10
B
new
= (B D) = (35 4) = 39
0 = 10 39 + (39) 10 gcd(0, 1)
7. t = 0
t
new
= C = 1
s
new
= D = 4
27
g w = 1 1 = 1
t u + s v = 1
1 39 + 4 10 = 39 + 40 = 1
s v 1 mod u
s v
1
mod u
4 10
1
mod 39
At the end we have successfully computed the inverse, 4, of 10 modulo
39 with the binary EEA.
28
Chapter 3
Block Cipher Implementation
DES was designed by IBM and the NSA in the early/mid 1970s and later become the
rst ocial standard cipher. It was designed with a strong focus on hardware eciency,
which unfortunately implies that software implementations are rather inecient.
A straightforward (=nave) software implementation approach is to implement a block
cipher component-wise, which leads to extremely poor performance. The next two
sections present some more advanced software techniques for more ecient block cipher
implementations in software.
3.1 Software Implementation of Block Ciphers
Remark:
1. Block ciphers are the dominant symmetric algorithm type for encryption and
message authentication.
2. In their structure they are all iterative!
Figure 3.1 shows the general structure of block ciphers. Therein K denotes the key,
Ki the round key, x the plaintext, and y the ciphertext. One can see the iterative
structure through the lead back arrow on the right side of the gure. At the end of the
loop the resulting ciphertext y is provided.
The major aspect of ecient block cipher implementation is the implementation of the
Round Function. There exist many dierent designs of round functions, but they all
consider the two axioms of Shannon, which are required for a block cipher to be secure:
1. Confusion
2. Diusion
In theory, there exist many dierent ways of realizing confusion and diusion, but in
practice only a few atomic operations are used. Confusion is mostly done by substitution
29
Key-Schedule RoundFunction
K
x
y
Ki
Figure 3.1: The General Structure of a Block Cipher
(S-Boxes) or by arithmetic operations (IDEA). Diusion is achieved by permutation on
bit or word level or more complex mixing operations (e. g., matrix multiplications in
AES).
3.1.1 Structure of SP-Networks
Substitution-Permutation-networks (SPN) are beside Feistel networks one of the two
major design strategies for block ciphers. Examples for SPN are the AES or PRESENT
(see Sect. 3.3 on page 39). Figure 3.2 shows the building blocks of an SPN.
The Key Addition is usually done by an XOR between the text and the round
key, while the substitution layer consists of S-Boxes which substitute the input bits
into output bits. Recall that good S-Boxes should in average change at least half of
the output bits (strict avalanche criterion). In such a round function the substitution
layer is the only non-linear part of the block cipher and prevents linear and dierential
cryptanalysis. The permutation layer can be realized through one or more P-Boxes,
whose function is to mix the bits over the S-Boxes.
In comparison to a Feistel-network, e. g., used in DES, an SPN needs only half the
amount of rounds to achieve the same cryptographic strength. This is due to the fact
that in an SPN the whole data block is processed while the Feistel-Network processes
only a part (usually one half) of the data block per round.
Nave Implementation
The two following implementation approaches for the substitution and the permutation
layers are the most commom because they are straightforward:
Substitution through individual table look-up.
30
Permutation
x
y
Key Addition
Substitution
RoundFunction
Figure 3.2: The Structure of an SP-Network
Permutations realized through loops.
However, both approaches have the shortcoming that they are very slow. Therefore,
subsequently some common acceleration techniques are presented.
3.1.2 Table Look-Up (TLU) Based Permutations
Despite the fact that table look-ups are slow, they are faster than loops involving
bit testing for implementing the permutation. The basic idea is to realize atomic
operations through tables (time-memory trade-o). In the following we aim for a fast
implementation of an arbitrary permutation. We start with looking at an examplary
32-bit to 32-bit permutation.
The straightforward approach would be to use a single table like in Figure 3.3. The
input i is interpreted as the address of the table row and the entry of row i is the
permutated output p(i) of the input.
To store this table we would need 2
32
32 bit = 2
32
2
5
= 2
37
bit = 128 Gbit = 16 GByte.
Storing this table poses no problem with modern PCs, but for sure there must exist
methods to perform a permutation more eciently.
The next idea is, to half the size of the input for the table of A, and process the upper
16 bit of the input, i
h
, separately from the lower 16 bit, i
h
, as depicted in Fig. 3.4.
31
(A)
Figure 3.3: Realizing a permutation with a simple look-up table
The input size of the two tables of B is 2
16
and we need 2 2
16
32 bit = 2
22
bit
= 2
19
Byte = 0.5 MByte storage space for the table. Note, this is a decrease of the
table size by a factor of = 2
37
/2
22
= 2
15
compared to solution A. We can further
decrease the table size and end up with Fig. 3.5.
With approach C we have four dierent tables that require need 4 2
8
32 bit =
2
15
bit = 4 kB memory. The aim is to decrease the table size such that it is small enough
to t into the cache of the CPU. If this can be achieved the maximum performance
speed up is expected. It is obvious that the TLU method trades area (e. g., memory)
for computing time. Table 3.1 summarizes the area requirements and operation steps
for our example 32-bit to 32-bit permutation.
Amount of
Size of Table Operations
A 2
37
bit 0
B 2
22
bit 1
C 2
15
bit 3
Table 3.1: Comparing all three approaches
The same technique could be used for ecient implementations of an arbitrary permu-
tation such as the expansion operation in DES. One question that might arises with
32
(B)
32
16
32 bit
16
32
32
32
i P[i]
Ph
Pl
ih
il
Figure 3.4: A rst improvement of realizing a permutation with look-up tables
(C)
32
8 32
32
i
32
P[i]
32 bit
32
32
32
32
8
8
8
Figure 3.5: A further improvement of the look-up tables
the expansion function of DES is: how do these tables look like when there are fewer
input bits than output bits? In order to show the similarity with the above introduced
approach the DES expansion function serves an example. The expansion prescription
for the rst four bit is given in Table 3.2 and is also depicted in Figure 3.6.
5 48
1 2 3 4 32
. . .
2 3 4 7
Figure 3.6: An illustration of the expansion
Eight dierent tables are needed, because the 32 input bits are segmented into 4-bit
digits. Every table has a size of 2
4
48 bit = 768 bit. With each of the eight tables the
33
Original Bit-Position Expanded Bit-Position
1 2, 48
2 3
3 4
4 5, 7
5 6, 8
.
.
.
.
.
.
Table 3.2: An examplary expansion
four input bits are saved as expanded output and all other bits which are not touched
by the expansion of the prevailing input bits are set to zero. An OR or XOR of
all eight table outputs will give the entire expanded result, as shown in Fig. 3.7.
48
48
48
OR
.
.
.
4
32
4
0
1 1 0
0 0 0 0 0
0 0 0
Table Input-Bits
0 0 0 0
0 0 0 1
1 1 1 1
.
.
.
0 0 0 0
0 0 0 1
1 1 1 1
.
.
.
Look-Up-Table for the last four input bits.
Look-Up-Table for the first four input bits.
1 1 1 1 1 1
Figure 3.7: General structure of look-up tables for Expansion or Permutation
3.1.3 S-Box Implementation via Table Look Up
There are design strategies that take the opposite way for optimization: build larger
tables from smaller ones. One benet is to make use of the full word length of the
computer architecture.
In Fig. 3.8 a simple transition from a specied S-Box to a look-up table is given. It is
important to mention that it is not necessary to have dierent S-Boxes, hence, S1
and S2 in Fig. 3.8 could be the same S-Box.
Because one S-Box usually does not use the full word length of the datapath of a
computer, depending on the word length and available storage space for the table, one
or ore S-Boxes are joined as depicted in Fig. 3.9.
Note that the size of table B is 2
16
16 bit = 128 kByte and is 512 times larger than
A which only requires 2
8
8 bit = 256 Byte.
It sounds paradox that rst smaller tables will bring an improvement and than applying
34
(A)
S1
8
8
S2
8
8
8 8
i S[i]
8 bit
LookUpTable S
Figure 3.8: A simple S-Box transition to a look-up table
(B)
16 16
i || j S [i] || S [j] 1 2
16 bit
LookUpTable S
S [i] 1 S [j] 2
Figure 3.9: An Example of Grouped S-Boxes
the opposite design strategy again leads to further improvement. To understand this a
comparison to the nave implementation methods is helpful: The nave implementation
of a permutation is done as individual bit shifts within loops, which is very complex
and expensive in software. A look-up in a table replaces these many operations under
the condition that the table look-up could be done fast. This is possible with small
tables which could be stored in fast accesible memories such as cache. Note that the
same holds for the expansion operation. The substitution layer consists of many small
table look-ups which can be grouped together to save some look-ups and, hence, time.
3.2 Bit Slicing
The techniques presented so far accelerate the execution of block ciphers, when im-
plemented in the conventional way: The cipher is executed in an iterative approach,
until for example the ciphertext for a given plaintext is computed. Then, the next
plaintext is processed. In contrast, the method presented in the following processes
several plaintexts in parallel.
35
3.2 Bit Slicing
In 1997, Eli Biham presented a new implementation strategy called Bit Slicing [Bih97]
and applied this totally new concept to DES. The main idea behind Bit Slicing is a
rearrangement of the order of bits inside the CPU registers such that one can obtain
a virtual parallel processing of more than one data blocks per computation. A CPU is
then viewed as paired one-bit processors which compute one-bit operations simultane-
ously and act together as an SIMD (Single-Instruction-Multiple-Data) processor. The
maximum count of concurrent processible data blocks depends on the CPU register
size, e. g., for the Intel Core 2 Duo
R
with 64-bit register size 64 data blocks can be
handled at one time.
The rearrangement is relative complex and the required time to rearrange grows linearly
with the register size (O(n)). Advantageous is the complexity of the round function
implementation which is independent of the register size and therefore is constant
(O(1)). In the following we assume for simplicity that we have a 32-bit CPU and a
64-bit block cipher like DES or PRESENT.
.
.
.
Data Block 1
Data Block 2
Data Block 32
64 bit (Data Block Size)
0 63 62 Bit 1
Process 1
Process 2
Process 32
Figure 3.10: Conventional representation and processing of Data
Figure 3.10 depicts the conventional sequence of processing data block-wise, with la-
beled bit positions. Note that a 64-bit data block does not t into a 32-bit register.
Since the data size is 64-bit we need 64 registers for the rearrangement, see Figure 3.11,
and we can process 32 data blocks at a time. From the rst data block the bit on Po-
sition 0 is taken and stored on Position 0 of Register 1. Then the bit on Position 1 of
the rst data block is stored on Position 0 of Register 2, and so on. The bits of the
second data block are then stored on Position 1 of the appropriate registers.
This order of rearrangement is only one possibility, in fact one is free where and in which
order to store the bits for internally processing, because at the end the rearrangement
is reversed and the original data block representation is restored. This ensures the
independence and interoperability of dierent implementations. The virtual data object
of this rearranged bits is called a slice and can be seen as a rotation of the original
data block by 90 degree.
36
Also note that we here have no longer 32 sequentially processes but only one process
for 32 data blocks at the same time. The pointers point onto the registers and are
needed later in the permutation layer.
.
.
.
32 bit (Register Size)
Data Bits 0
Data Bits 1
Data Bits 62
Data Bits 63
Slice 1 2 31 32
Register 1
Register 2
Register 63
Register 64
Pointer 1
Pointer 2
Pointer 62
Pointer 63
Figure 3.11: The Data Representation after Rearrangement for Bit Slicing and for One Process
With some slight accommodations inside the feedback mode functions of block ciphers,
like the Cipher Feedback Mode, it is possible to use them for bitslicing, too. One
has to pay attention that if a feedback mode other than the Electronic Code Book is
used that one cannot mix up the conventional and the bitslicing method for processing
the data any longer. Because of the parallel processing block 1 does not inuence
processing of block 2 but it inuences the processing of blocks 33 to 64 given a register
size of 32.
The power of this technique is the parallel processing and the extremely fast com-
putation of the permutation and substitution layer of the round function. While the
permutation layer can be realized by re-ordering of registers (in practice a pointer per-
mutation, see Fig. 3.12) the S-Boxes are ideally given with their boolean equations,
e. g., in the Algebraic Normal Form (ANF). Usually an S-Box is implemented by a
table look up, because of the new data representation this would be a very inecient
solution with bitslicing. It would require to sequentially collect and compose the S-Box
inputs from dierent registers and nally the S-Box output has to reintegrated into the
registers.
With the boolean representation of the S-Boxes the substitution layer is implementable
37
3.2 Bit Slicing
bit by bit, hence it is suitable for a bit sliced approach.
Register1 = Register1 XOR ((Register2 AND Register3) OR Register4)
Register2 = Register2 AND (Register3 NAND Register4)
Boolean operations (such as AND, OR, NOT, XOR) usually require only one CPU
cycle on modern CPUs. The cost of basic integer operations are (mostly) similar to
the cost of boolean operations.
.
.
.
32 bit (Register Size)
1 2 31 32
Register 1
Register 2
Register 63
Register 64
Pointer 5
Pointer 44
Pointer 2
Pointer 27
Data Bits 0
Data Bits 1
Data Bits 62
Data Bits 63
Slice
Figure 3.12: The Permutation of Bit Slicing
For methods to represent a given S-Box in Boolean terms, please look for Walsh Spec-
trum or Walsh Transformation, de morgans rules, or Karnaugh-Veitch digrams
in appropriate math books or use an Internet search engine of your choice. Also, soft-
ware such as espresso helps to nd an optimized Boolean representation of a given
truth table.
The actual speed-up achieved by bit slicing depends on the actual cipher and the grade
of optimization of the implementation that is used for comparison. The more bit-wise
operations, such as permutations, are used in the cipher, the more the application of bit
slicing pays out. Bihams bitsliced implementation of DES can speed up the execution
of DES by a factor of 2 to 10.
In the last sections we have presented a selection of improvement techniques to speed
up a software implementation. Now we will present a new block cipher that uses rather
simple basic building blocks, called PRESENT. Since the implementation of PRESENT
is not really challenging it is therefore ideally suited as an example for applying the
above presented techniques. It is recommended to implement the cipher at rst time
in a straightforward manner without any improvements. Then one has a comparison
38
base and can compare the performance gain after each implementation optimization.
That will emphasize how much performance gain the techniques can provide.
3.3 The Block Cipher PRESENT
The PRESENT block cipher was published at CHES 2007 [BLK
+
07]. It was designed
with the following properties in mind:
1. Simplicity
2. small parameters (64-bit blocksize and 80-bit keysize) to reduce the chip area
in hardware
3. SP-network re-using the same S-Box in 32 rounds
Note that a Feistel network would require twice the amount of rounds as an SP-network.
PRESENT Input Output
Key
64 64
80
Figure 3.13: A block diagram of PRESENT-80
PRESENT was designed as an ultra-lightweight block cipher for use in extremely con-
strained environments, such as passive RFID tags or other pervasive devices. It is se-
cure, very ecient in hardware and can be implemented with lower gate count than to-
days leading compact stream ciphers. For more details, design issues, security analy-
ses, and implementation results the interested reader is referred to [BKL
+
06, RPLP08].
PRESENT supports two key lengths of 80 and 128 bits. For the given applications,
which PRESENT was developed for, the version with 80-bit keys (from now on referred
to as PRESENT-80) is recommended. This is a more than adequate security level for
the low-security applications typically required in tag-based deployments, but just as
importantly, this matches the design goals of hardware-oriented stream ciphers in the
eSTREAM project and allows to make a fairer comparison.
Each of the 31 rounds consists of an XOR operation to introduce a round key K
i
for
1 i 31, a non-linear substitution layer, and a linear bitwise permutation. The nal
round only consists of the XOR operation with the nal round key. The non-linear
layer uses a single 4-bit S-Box S which is applied 16 times in parallel in each round.
The cipher is described in pseudo-code in Figure 3.14, and each stage is now specied
in turn. The design rationale are given in [BKL
+
06, Section 4] and throughout the
remainder of this chapter we number bits starting from zero with bit zero on the right
hand side of a block or word.
39
generateRoundKeys()
= 1 to 31
addRoundKey( ,K)
sLayer( )
pLayer( )
addRoundKey( ,K )
for do
end for
i
STATE
i
32
STATE
STATE
STATE
Plaintext
Substitution Layer
Permutation Layer
Substitution Layer
Permutation Layer
Ciphertext
Key Register
Update
Update
Round Key Addition
Round Key Addition
.
.
.
.
.
.
Figure 3.14: A top-level algorithmic description of PRESENT Encryption
Round Key Addition
Given round key K
i
=
i
63
. . .
i
0
for 1 i 32 and current STATE b
63
. . . b
0
, ad-
dRoundKey consists of the operation for 0 j 63,
b
j
b
j
i
j
.
Substitution Layer.
The S-Box used in PRESENT is a 4-bit to 4-bit S-Box S : F
4
2
F
4
2
. The action of this
box in hexadecimal notation is given by the Table 3.3.
x 0 1 2 3 4 5 6 7 8 9 A B C D E F
S[x] C 5 6 B 9 0 A D 3 E F 8 4 7 1 2
Table 3.3: The S-Box of PRESENT
For substitution layer the current STATE b
63
. . . b
0
is considered as sixteen 4-bit words
w
15
. . . w
0
where w
i
= b
4i+3
[[b
4i+2
[[b
4i+1
[[b
4i
for 0 i 15 and the output nibble
S[w
i
] provides the updated state values in the obvious way. The substitution layer
eases the serialization, because the same S-Box is used 16 times.
40
Permutation Layer.
The bit permutation used in PRESENT is given in Table 3.4. Bit i of the state is
moved to the bit position P(i) given in the table.
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
P(i) 0 16 32 48 1 17 33 49 2 18 34 50 3 19 35 51
i 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
P(i) 4 20 36 52 5 21 37 53 6 22 38 54 7 23 39 55
i 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
P(i) 8 24 40 56 9 25 41 57 10 26 42 58 11 27 43 59
i 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
P(i) 12 28 44 60 13 29 45 61 14 30 46 62 15 31 47 63
Table 3.4: PRESENTs Permutation Layer.
The permutation layer is very regular and simple and in hardware virtually for free.
The carefully chosen S-Box and the permutation layer prevent linear and dierential
cryptanalysis attacks. Figure 3.15 shows an excerpt of the combination of PRESENTs
substitution and permutation layer. The input of one S-Box is formed of four dierent
S-Box outputs and the output of one S-Box is mapped to four dierent S-Boxes for a
maximal avalanche eect. If we change one input bit of an S-Box and assume that this
aects all four output bits, then after three rounds one modied bit of the input would
have aected all 64 output bits.
S S S S S S S S S S S S S S S S
k
i
k
i+1
S S S S S S S S S S S S S S S S
Figure 3.15: The S/P network for PRESENT.
41
The Key Schedule for PRESENT-80.
The user-supplied key is stored in a key register K and represented as k
79
k
78
. . . k
0
. At
round i the 64-bit round key K
i
=
63
62
. . .
0
consists of the 64 leftmost bits of the
current contents of register K. Thus at round i we have that
K
i
=
63
62
. . .
0
= k
79
k
78
. . . k
16
.
After extracting the round key K
i
, the key register K = k
79
k
78
. . . k
0
is updated as
follows.
1. [k
79
k
78
. . . k
1
k
0
] = [k
18
k
17
. . . k
20
k
19
]
2. [k
79
k
78
k
77
k
76
] = S[k
79
k
78
k
77
k
76
]
3. [k
19
k
18
k
17
k
16
k
15
] = [k
19
k
18
k
17
k
16
k
15
] round counter
Thus, the key register is rotated by 61-bit positions to the left, the left-most four bits
are passed through the PRESENT S-Box, and the round counter value i is exclusive-
ored with bits k
19
k
18
k
17
k
16
k
15
of K with the least signicant bit of round counter
on the right. The round counter, which counts from 1 to 31, is for preventing sliding
attacks.
Figure 3.16 depicts a schematic summary of the PRESENT-80 key schedule.
61 bit left rotation
S addRoundCounter
64
64
64
64
80
4 56 5 15
[79:76] [14:0] [19:15] [75:20]
[79:16]
RoundKey
i
RoundKey
i+1
[79:16]
Round
i
Round
i+1
Figure 3.16: A Scheme of PRESENT-80 Key Schedule
The key schedule for 128-bit keys is presented in [BKL
+
06, Appendix II].
42
Chapter 4
Finite Field Arithmetic in
Cryptography
A nite eld, sometimes also called a Galois eld (named after

Evariste Galois), is a set
of q elements for which two basic operations (addition and multiplication) are dened
so that the set becomes a commutative division ring. Clearly, substraction, inversion
and division can be derived from these two operations. Finite elds are denoted by F
q
or GF(q) and have several important properties, some of them being briey mentioned
here only:
Each nite eld has q = p
m
elements for some prime p and a positive integer m.
Moreover, for any prime p and a positive integer m there is a nite eld with
q = p
m
elements.
Two nite elds with the same number of elements are isomorphic. In other
words, one can speak about the eld F
q
with q = p
m
.
Though we will provide some details on nite elds in this chapter, we refer to the
following excellent books for more background:
Lidl/Niederreiter: Finite Fields (Encyclopedia of Mathematics and Its Applica-
tions), Vol.20, Cambridge Univ. Press
Handbook of Applied Cryptography, [Men97, Sec. 2.6., p. 80]
McEliece: Finite Fields for Computer Scientists and Engineers. Kluwer, 1987
4.1 Introduction
Finite elds are widely used in cryptography, especially in the following applications:
43
4.1 Introduction
The multiplicative groups of nite elds can be used directly to build crypto-
graphic schemes based on the discrete logarithm problem such as public key
encryption, digital signature or key exchange;
Algebraic structure for curves (e. g., elliptic and hyperelliptic ) as well as for other
cryptographically interesting algebraic varieties;
S-boxes and linear diusion layers of block ciphers, especially in AES, are some-
times expressed in terms of nite elds;
Stream ciphers often rely on linear and nonlinear feedback shift registers which
can be extensively analyzed using techniques of nite elds;
For secret sharing, Shamirs threshold scheme over nite elds can be used.
Although almost any nite eld can be considered for application in cryptography
(many types of nite elds have been proposed for use in cryptography since the early
1990s, see Figure 4.1 for a graphic representation), either large prime elds F
p
Z
p
with p prime or nite elds of characteristic 2, also referred to as binary elds, are used
in practice. Such binary elds with 2
m
elements are denoted by F
2
m.
prime fields extension fields
special p
Mersenne primes
generalized
Mersenne primes
pseudo Mersenne
primes
FF GF(p )
m
GF(p) GF(2 )
m
GF(p )
m
GF(p) GF(2 a)
m
GF(2 )
m
GF((2 ) )
n
GF(( ) )
m
2 a
n
GF(p )
m
m = 1
char 2
fields
m > 1
p = 2
m > 1
p > 1
general
primes
special
primes
m prime m composite
OEF
general
primes
OEF: Optimized Extension Field
Figure 4.1: Overview of Finite Field Subdivision
44
Finite Field Arithmetic in Cryptography
4.2 Some Mathematics: A Very Brief Introduction to Finite Fields
4.2.1 Denition of a Finite Field
To be able to dene a nite eld, we need two basic notions from algebra: groups and
rings. While only one binary operation is dened for groups, rings have two binary
operations.
Denition 4.2.1 Group (from [Men97, p. 75])
A group (G, ) consists of a set G with a binary operation
on G satisfying the following axioms.
1. If a and b are two elements in G, then the product a b is also
in G.
2. The group operation is associative. That is,
a (b c) = (a b) c for all a, b, c G.
3. There is an element 1 G, called the identity element, such
that
a 1 = 1 a = a for all a G.
4. For each a G there exists an element a
1
G, called the
inverse of a, such that
a a
1
= a
1
a = 1.
5. A group G is abelian (or commutative) if, furthermore,
a b = b a for all a, b G.
Note that multiplicative group notation has been used for the group
operation. If the group operation is addition, then the group is said
to be an additive group, the identity element is denoted by 0, and the
inverse of a is denoted -a.
The notion of the order of an element in a group is crucial for understanding many
cryptographic applications.
Denition 4.2.2 Order of a Group Element
The order of an element a (G, ) is the smallest positive integer
such that a a . . . a = a
= e, where e is the identity element of

the group.
45
Denition 4.2.3 Ring (from [Men97, p. 76/77])
A ring (R, +, ) consists of a set R with two binary opera-
tions arbitrarily denoted + (addition) and (multiplication) on R,
satisfying the following axioms:
1. (R, +) is an abelian group with identity denoted 0.
2. The operation is associative. That is,
a (b c) = (a b) c for all a, b, c R.
3. There is a multiplicative identity denoted 1, with 1 ,= 0, such
that 1 a = a 1 = a for all a R.
4. The operation is distributive over +. That is,
a (b + c) = (a b) + (a c) and
(b + c) a = (b a) + (c a) for all a, b, c R.
The ring is a commutative ring if a b = b a for all a, b R.
Denition 4.2.4 Finite Field
A nite eld is a nite commutative ring in which all elements
except the element 0 posses a multiplicative inverse.
A nite eld with q elements only exists, if q is a prime power, i. e., q = p
m
for some
integer m and prime integer p. p is called the characteristic of the nite eld. q is
called the order of the nite eld.
Theorem 4.2.1 There exists one nite eld of order q for every q of
the form
q = p
m
with p prime, and m integer m 1.
Example 4.1 Are there nite elds GF(q) with q non-prime (e. g., q =
8, 9, 10)?
Since q = 8 = 2
3
, GF(8) = GF(2
3
) GF(q) exists
Since q = 9 = 3
3
, GF(9) = GF(3
3
) GF(q) exists
Since q = 10 = 2 5, GF(10) = GF(2 5) GF(q) does not exist.
46
The most intuitive examples of nite elds are elds of prime order, i. e., elds with
m = 1. Elements of the eld GF(p) can be represented by integers 0, 1, . . . , p 1.
The two operations of the eld are modular integer addition and integer multiplication
modulo p.
Theorem 4.2.2 Let p be a prime. The integer ring Z
p
is denoted as
GF(p) and referred to as prime eld, i. e., a Galois eld with a prime
number of elements. All non-zero elements of GF(p) have an inverse.
Arithmetic in GF(p) is done modulo p.
In order to do arithmetic in a prime eld, we have to follow the rules for integer rings.
Addition and multiplication are done modulo p, the additive inverse of any element a is
given by a mod p and the multiplicative inverse of any non-zero element a is dened
as a a
1
= 1. Lets have a look at an example of a very small prime eld:
Example 4.2 Arithmetic tables for the nite eld GF(3) = 0, 1, 2
addition
+ 0 1 2
0 0 1 2
1 1 2 0
2 2 0 1
additive inverse
0 = 0
1 = 2
2 = 1
multiplication
1 2
1 1 2
2 2 1
multiplicative inverse
1
1
= 1
2
1
= 2, since 2 2 1 mod 3
In AES the nite eld contains 256 elements and is denoted as GF(2
8
). Each of the
elements of this eld can be represented by one byte. For the S-Box and MixColumn
transforms, AES treats every byte of the internal data path as an element of the eld
GF(2
8
) and manipulates the data by performing arithmetic in this nite eld. However,
if the order of a nite eld is not prime (and 2
8
is clearly not a prime), the addition
and multiplication operation cannot be represented by addition and multiplication of
integers modulo p
m
. Such elds with m > 1 are called extension elds. What is needed
here is (1) a dierent notation for eld elements and (2) dierent rules for performing
arithmetic with the elements. We will see in the following that elements of extension
elds can be represented as polynomials and that computation in the extension eld is
achieved by performing polynomial arithmetic.
47
4.2.2 Element Representation in Extension Fields GF(p
m
)
In extension elds GF(p
m
) elements are not represented as integers but as polynomials
with coecients in GF(p). The polynomials have a maximum degree of m1, so that
there are m coecients in total for every element. In the eld GF(2
8
) which is used in
AES, each element A GF(2
8
) is thus represented as
A(x) = a
7
x
7
+ . . . + a
1
x + a
0
, a
i
GF(2) = 0, 1.
Note that there are exactly 256 = 2
8
such polynomials. The set of these 256 polynomials
is the nite eld GF(2
8
). It is also important to observe that every polynomial can
simply be stored in digital form as an eight bit vector
A = (a
7
, a
6
, a
5
, a
4
, a
3
, a
2
, a
1
, a
0
)
2
.
In particular, we do not have to store the factors x
7
, x
6
, etc. It is clear from the
position of the bits to which factors x
i
the coecients belong.
Theorem 4.2.3 The elements of every nite eld GF(q
m
) can be
represented by polynomials of maximum degree m1 with coecents
from GF(q)
A(x) = a
m1
x
m1
+ . . . + a
1
x + a
0
, A GF(q
m
), a
i
GF(q)
where the element
0 x
m1
+ . . . + 0 x + 0 = 0
is the additive identity and
0 x
m1
+ . . . + 0 x + 1 = 1
is the multiplicative identity.
Note that polynomials with coecent from GF(p) are called polynomials over GF(p).
Moreover, the set of all polynomials over GF(p) is denoted by Z
p
[x], if p is prime.
Example 4.3 GF(2
3
) GF(q) = GF(2) = 0, 1
A(x) = a
2
x
2
+ a
1
x + a
0
48
Nr. a
2
a
1
a
0
polynomial FF element
1 0 0 0 0 x
2
+ 0 x + 0 0
2 0 0 1 0 x
2
+ 0 x + 1 1
3 0 1 0 0 x
2
+ 1 x + 0 x
4 0 1 1 0 x
2
+ 1 x + 1 x + 1
5 1 0 0 1 x
2
+ 0 x + 0 x
2
6 1 0 1 1 x
2
+ 0 x + 1 x
2
+ 1
7 1 1 0 1 x
2
+ 1 x + 0 x
2
+ x
8 1 1 1 1 x
2
+ 1 x + 1 x
2
+ x + 1
GF(2
3
) = 0, 1, x, x + 1, x
2
, x
2
+ 1, x
2
+ x, x
2
+ x + 1
As one can see, for GF(2
3
) eight dierent elements (polynomials) exist.
4.3 Addition and Subtraction in GF(p

m
)
When one looks at addition and subtraction in extension elds, it turns out that
these operations are straightforward. They are simply performed in a coecient-wise
manner: we simply add or subtract coecients with equal powers of x. The actual
addition or subtraction is done in the underlying eld GF(p).
Denition 4.3.1 Extension eld addition and subtraction
Let A(x), B(x) GF(p
m
). The sum of the two elements is
then computed according to
C(x) = A(x) + B(x) =
m1
i=0
c
i
x
i
, c
i
= a
i
+ b
i
mod p
and the dierence is computed according to
C(x) = A(x) B(x) =
m1
i=0
c
i
x
i
, c
i
= a
i
+ (b
i
) mod p.
Lets have a look at an example in the eld GF(2
8
) which is used in AES.
49
m
)
Example 4.4 Here is how the sum C(x) = A(x) + B(x) of two elements
from GF(2
8
) is being computed
A(x) = x
7
+ x
6
+ x
4
+ 1
B(x) = x
4
+ x
2
+ 1
C(x) = x
7
+ x
6
+ x
2
That is, the addition is done coecient-wise modulo 2.
Note that if p = 2 we perform modulo 2 addition (or subtraction) with the coecients,
and that addition modulo 2 is equivalent to bit-wise XOR. Moreover, addition and
subtraction are the same operation in this case. Hence, if we compute the dierence of
the two polynomials A(x) B(x) from the example above, we obtain the same result
as for the sum. Note also that no carries appear for addition in binary nite elds.
Remark: Subtraction in GF(2
m
) is identical to addition
C(x) = A(x) B(x)
c
i
= a
i
b
i
mod 2
c
i
= a
i
b
i
+ (b
i
+ b
i
) mod 2
c
i
= a
i
+ (b
i
+ b
i
) + b
i
= a
i
+ b
i
mod 2
Example 4.5
A(x) B(x) = (x
7
+ x
6
+ x
4
+ 1) (x
4
+ x
2
+ 1)
= x
7
+ x
6
+ (1 1) x
4
+ x
2
+ (1 1)
= x
7
+ x
6
+ x
2

m
)
Multiplication in GF(2
8
) is the core operation of the MixColumn transformation
of AES. In a rst step, two elements (represented by their polynomials) of a nite
eld GF(p
m
) are multiplied using the standard polynomial multiplication rule, with
coecient arithmetic done in GF(p)
A(x) B(x) = (a
m1
x
m1
+ + a
0
) (b
m1
x
m1
+ + b
0
)
C
(x) = c
2m2
x
2m2
+ + c
0
,
50
where
c
0
= a
0
b
0
mod p
c
1
= a
0
b
1
+ a
1
b
0
mod p
.
.
.
c
2m2
= a
m1
b
m1
mod p.
In general, this polynomial will have a degree higher than m1 and has to be reduced.
In order to change the representative of the equivalence class in GF(p)[x], we take a
reduction approach similar to what we did in the case of multiplication in prime elds:
In GF(p), we multiply the two integers, divide the result by a prime, and consider
only the remainder. This is what we are doing here too: The product is divided by
an irreducible polynomial and we consider only the remainder after the polynomial
division. Irreducible polynomials are polynomials with coecients from GF(p) which
do not factor (except for the trivial factor involving 1) into smaller polynomials over
GF(p). Thus, every nite eld GF(p
m
) is constructed by an irreducible polynomial
P(x) of degree m with coecients over GF(p). That is, P(x) is needed for doing actual
arithmetic in the nite eld.
Denition 4.4.1 Extension eld multiplication
Let A(x), B(x) GF(p
m
) and let
P(x) =
m
i=0
p
i
x
i
, p
i
GF(p)
be an irreducible polynomial, i. e., it does not factor into lower-degree
polynomials with coecients in GF(p) (except the trivial factor 1).
Finite eld multiplication of the two elements A(x), B(x) is performed
as
C(x) = A(x) B(x) mod P(x).
Roughly speaking, irreducible polynomials can be thought of as analogues to prime
numbers. Note that the coecient p
m
can be always chosen to be 1 by normalization.
Note further that p
0
is always not equal 0 because else the reduction polynomial could
be factorized into x (x
m1
+ p
m1
x
m2
+ . . . + p
2
x + p
1
).
Example 4.6 We want to multiply the two polynomials
A(x) = x
3
+ x
2
+ 1
51
m
)
and
B(x) = x
2
+ x
in the eld GF(2
4
). The irreducible polynomial of this Galois eld is given
as
P(x) = x
4
+ x + 1.
The plain polynomial product is computed as
C
(x) = A(x) B(x) = x

5
+ x
3
+ x
2
+ x
We can now divide C
(x) using the polynomial division method we learned

in school (or at the beginning of college). However, sometimes it is easier
to reduce each of the leading terms x
4
and x
5
individually
x
4
= 1 P(x) + (x + 1)
x
4
x + 1 mod P(x)
x
5
x
2
+ x mod P(x).
Now, we only have to insert the reduced expression for x
5
into the interme-
diate result C
(x)
C(x) C
(x) mod P(x)

C(x) (x
2
+ x) + (x
3
+ x
2
+ x) = x
3
A(x) B(x) x
3
.
Note: Recall that the polynomials are normally stored as bit vectors in
computers. If we look at the element representation on the bit level for the
multiplication from this example, the following operation is being performed
on the bit level
A B = C
x
3
+ x
2
+ 1 x
2
+ x = x
3
(1 1 0 1) (0 1 1 0) = (1 0 0 0)
This demonstration should make clear that nite eld multiplication is en-
tirely dierent from conventional integer multiplication.
Note that not all polynomials are irreducible. Here is a counter example. The polyno-
mial x
4
+ x
3
+ x + 1 is reducible since
x
4
+ x
3
+ x + 1 = (x
2
+ x + 1) (x
2
+ 1)
52
and, hence, cannot be used to construct the extension eld GF(2
4
).
Remarks:
1. In practice, the irreducible polynomial can often be taken from tables.
2. For factoring polynomials over nite elds, one can use Berlekamps Q-matrix
algorithm, which is presented on page 124 in [Men97].
3. In practice, often two special polynomial-forms are used.
(a) Trinomial: x
m
+ x
n
+ 1, m > n > 0,
for m = 8 y, y N is the trinomial never irreducible.
(b) Pentanomial: x
m
+ x
n
3
+ x
n
2
+ x
n
1
+ 1, m > n
3
> n
2
> n
1
> 0
Note that a polynomial with an even number of coecients is never irreducible
because it has the root 1.
4.5 Inversion in Finite Fields
The zero element does not have a multiplicative inverse. However, in the case of AES,
in the Sub-Bytes transformation the zero element is mapped to itself.
4.5.1 Introduction
Computing the multiplicative inverse of an element of some nite eld F
q
occurs fre-
quently in cryptography. Two cases are of particular importance.
1. The inverse of a (non-zero) element a of F
p
, i. e., an integer a
1
such that
a a
1
1 mod p.
This operation is needed in almost every public-key scheme such as the Digital
Signature Algorithm or elliptic curve protocols.
2. The inverse of a (non-zero) element for an arbitrary nite eld GF(p
m
) and the
corresponding irreducible reduction polynomial P(x), the inverse A
1
of A
GF(p
m
), where A ,= 0, is dened as
A
1
(x) A(x) = 1 mod P(x).
This operation is in particular relevant in modern cryptography (i) if an elliptic
curve system is dened over GF(2
m
) and (ii) for the computation of S-Boxes in
block ciphers such as AES.
53
There are several methods for computing inverses. Each method has its advantages.
Here is an overview:
Extended Euclidean Algorithm and its Variants
This is a general method and often, especially for software implementation and
large nite elds, the fastest inversion algorithm. The method always computes
a linear combination of an element a and the modulus in such that
s a + t m = 1
from which follows that the inverse of a is given as a
1
s mod m. The following
section will deal with this method in detail.
Fermats Theorem
Fermats Theorem (often also referred to as Fermats little theorem) achieves
inversion through exponentiation. It requires much more computations than the
Euclidean algorithm based approaches but has a much simpler central structure.
Thus, it is sometimes preferred in situations where code size is limited (e. g., on
small smardcard processors) or when hardware resources are limited.
Inversion in this case is based on Fermats Theorem according to which
A
q1
= 1
for all A F
q
. The inverse follows now from
A A
q2
= 1
as A
1
= A
q2
.
Table Look-Up
A straightforward method is to compute a table which contains the inverse of
every element a F
q
. Inversion is now accomplished by one single table look-up.
This is an extremely fast method. Unfortunately, the number of table entries
is equal to the eld order which grows exponentially with the bit-length of the
eld order. In practice this method is only applied to small elds with up to
approximately 2
16
elements. An important examples is the inversion in the AES
S-Box, which is virtually always realized through table method.
Specialized Methods
Particularly for extension elds GF(2
m
) there exist specic inversion methods
which can be advantageous in certain situations. Three of these methods are:
Itoh-Tsujii Inversion
This method computes special addition chains for Fermats theorem [GP01].
54
Subeld Inversion
If a eld F
q
has subelds, this method reduces inversion in the large eld F
q
to
inversion in a smaller subeld. This method is, for instance, sometimes used for
hardware implementation of the AES S-Boxes. [GP01]
Direct Inversion
In GF(2
m
), inversion can be performed through solving a system of equations.
If this system is precomputed, inversion can be done through closed formulae.
Unfortunately, these closed expressions are very complex even for moderately
large elds. Hence, in practice this method can only be used for very small elds,
e. g., elds up to 16 elements.
4.5.2 Example Modular Inversions
The extended Euclidean Algorithm (EEA) can be used to compute the inverse of non-
zero elements of
(1) prime elds F
p
,
(2) integer rings Z
m
,
(3) extension elds GF(2
m
) and more generally any extension eld GF(q
m
).
This section only deals with modular integer inversion, i. e., with the two former cases.
This operation is of utmost importance for public-key schemes such as RSA, elliptic
curves or Die-Hellman key exchange.
We will give an example how to nd the inverse with the Extended Euclidean Algorithm
and the Table Look-Up method. The inverse A
1
can be found using the Extended
Euclidean Algorithm with A(x) and P(x) as input.
s(x)A(x) + t(x)P(x) = gcd(P(x), A(x)) = 1
s(x)A(x) = 1 mod P(x)
s(x) = A
1
(x)
Example 4.7
Inverse of x
2
GF(2
3
), with P(x) = x
3
+ x + 1.
GCD Equation Inverse Calculation
t
0
= 0, t
1
= 1
x
3
+x + 1 = [x] x
2
+ (x + 1) t
2
= t
0
q
1
t
1
= 0 q
1
1 = x = x
x
2
= [x + 1] (x + 1) + 1 t
3
= t
1
q
2
t
2
= 1 (x + 1)(x) = x
2
+x + 1
x + 1 = [x + 1] 1 + 0
55
Proof: t
3
is inverse of x
2
in P(x), since
t
3
x
2
= (x
2
+ x + 1) x
2
= x
4
+ x
3
+ x
2
= (x
2
+ x) + (x + 1) + x
2
= 2 x
2
+ 2 x + 1
1 mod P(x),
where x
3
x + 1 mod P(x) and x
4
x
2
+ x mod P(x).
Remark: In every iteration of the Euclidean algorithm, division is used (which is not
shown above) to uniquely determine q
i
and r
i
.
The eld GF(2
8
) used for AES is so small that pre-computed look-up tables with all
256 inverses are readily available. Table 4.1 shows such a table with all inverse values
in GF(2
8
) modulo P(x) = x
8
+ x
4
+ x
3
+ x + 1 in hexadecimal.
Y
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 00 01 8D F6 CB 52 7B D1 E8 4F 29 C0 B0 E1 E5 C7
1 74 B4 AA 4B 99 2B 60 5F 58 3F FD CC FF 40 EE B2
2 3A 6E 5A F1 55 4D A8 C9 C1 0A 98 15 30 44 A2 C2
3 2C 45 92 6C F3 39 66 42 F2 35 20 6F 77 BB 59 19
4 1D FE 37 67 2D 31 F5 69 A7 64 AB 13 54 25 E9 09
5 ED 5C 05 CA 4C 24 87 BF 18 3E 22 F0 51 EC 61 17
6 16 5E AF D3 49 A6 36 43 F4 47 91 DF 33 93 21 3B
7 79 B7 97 85 10 B5 BA 3C B6 70 D0 06 A1 FA 81 82
X 8 83 7E 7F 80 96 73 BE 56 9B 9E 95 D9 F7 02 B9 A4
9 DE 6A 32 6D D8 8A 84 72 2A 14 9F 88 F9 DC 89 9A
A FB 7C 2E C3 8F B8 65 48 26 C8 12 4A CE E7 D2 62
B 0C E0 1F EF 11 75 78 71 A5 8E 76 3D BD BC 86 57
C 0B 28 2F A3 DA D4 E4 0F A9 27 53 04 1B FC AC E6
D 7A 07 AE 63 C5 DB E2 EA 94 8B C4 D5 9D F8 90 6B
E B1 0D D6 EB C6 0E CF AD 08 4E D7 E3 5D 50 1E B3
F 5B 23 38 34 68 46 03 8C DD 9C 7D A0 CD 1A 41 1C
Table 4.1: Multiplicative inverse table in GF(2
8
) for byte XY
Example 4.8 From Table 4.1 the inverse of
x
6
+ x
2
+ 1 = (01000101)
2
= (45)
hex
= XY
is given by the fth row (row number 4) and the sixth column (column
number 5)
(31)
hex
= (00110001)
2
= x
5
+ x
4
+ 1.
56
This is true since
(x
6
+ x
2
+ 1) (x
5
+ x
4
+ 1) 1 (mod P(x)).
57
58
Chapter 5
Arithmetic in Galois Fields GF(2
m
)
in Software
This chapter introduces the software algorithms for arithmetic in binary Galois elds,
i.e., in elds of the form GF(2
m
). Binary nite elds are in practice in particular
relevant for elliptic curve cryptosystems. The bit lengths needed for ECC are typically
in the range of m 160 512. For a background on nite elds, see Chapter 4.
5.1 Field Element Representation
Remember that the elements of GF(2
m
) are denoted by the following polynomial rep-
resentation
1
.
A(x) =
m1
i=0
a
i
x
i
= a
m1
x
m1
+ + a
1
x + a
0
; a
i
GF(2) = 0, 1
Consider a digital computer with a register width of R bits. Obviously, if R is smaller
than m, a GF(2
m
) element does not t into a single computer register. Thus, eld
elements of GF(2
m
) are represented by an array of (at least) t = ,
m
R
| registers that
stores all the coecients. The position of a coecient in a computer word corresponds
to its position in the polynomial representation.
In most cases, the division
m
R
yields a non-zero remainder. Hence, the last register
A[,
m
R
| 1] is not completely utilized and lled with zeroes, as illustrated in Fig. 5.1.
1
There are other representation, referred to as basis, possible. In particular, there is a normal basis
and a dual basis representation.
59
m
)
Figure 5.1: Representation of a GF(2
m
) element in a computer as an array of t registers with width R
m
)
The addition in GF(2
m
) can be realized through a simple bit-wise XOR operation of
the corresponding registers:
C[i] = A[i] B[i]
In software this is done by a loop with t iterations, as shown in Alg. 5.1. The algorithm
has a linear complexity of t =
_
m
R
_
.
Algorithm 5.1 Addition of two Elements in GF(2
m
)
Input: A, B
Output: C = A + B
1. FOR i = 0 TO
_
m
R
_
1
1.1 C[i] A[i] XOR B[i]
2. RETURN() C)
Note that each individual register can treated separately because no carries occur.
Depending on the register width, R bits are processed in each iteration of the for-loop.
m
)
While addition in GF(2
m
) is straightforward and not costly in software, multiplication
is the crucial operation for implementing asymmetric cryptosystems such as elliptic
curves. In the following, several software algorithms for eld multiplication in GF(2
m
)
are presented. We start with the shift-and-add method, which forms the basis for the
advanced comb and window algorithms introduced subsequently.
60
m
) in Software
5.3.1 Shift-and-Add Multiplication
This algorithm exists in two versions, namely least signicant bit (LSB) rst and most
signicant bit (MSB) rst multiplication. We focus on the LSB variant here.
We consider two eld elements A and B:
A(x) = a
m1
x
m1
+ + a
0
, a
i
0, 1
B(x) = b
m1
x
m1
+ + b
0
, b
i
0, 1
and rewrite the multiplication as follows:
A(x) B(x) mod P(x) = (a
m1
x
m1
+ + a
0
) B(x) mod P(x)
= a
0
B + a
1
xB + a
2
x
2
B + a
3
x
3
B +
+a
m1
x
m1
B mod P(x)
= a
0
B
+a
1
(xB mod P(x))
+a
2
(x[xB mod P(x)] mod P(x))
+a
3
(x[x
2
B mod P(x)] mod P(x))
.
.
.
+a
m1
(x[x
m2
In every iteration, the shift-and-add algorithm computes one term of the form
a
i
(x[x
i1
B mod P(x)] mod P(x)) (5.1)
It is important to note that the expression in brackets, i.e., [x
i1
B mod P(x)] is always
the result from the previous iteration, i.e., it has already been computed. Lets have
detailed a look at what has to be done in order to realize Eqn (5.1).
First, we multiply the expression [x
i1
B mod P(x)] by x followed by a reduc-
tion modulo P(x). The result, i.e., the value (x[x
i1
B mod P(x)] mod P(x)), is
stored so that it is available as input in the next iteration. This is the shift
part of the algorithm since we multiply a polynomial by x, which can be viewed
as a shift of the polynomial coecients by one position.
The second operation we have to perform is to multiply
(x[x
i1
by the coecient a
i
. Since a
i
is a bit, it can be either one or zero. Hence, this
multiplication corresponds to the decision whether the polynomial
(x[x
i1
61
m
)
needs to be added to the intermediate result or not. This second operation is the
add part of the algorithm.
The pseudo code of the method is shown in Algorithm 5.2.
Algorithm 5.2 Shift-and-Add according to [DH00]
Input: A(x), B(x) GF(2
m
), with the irreducible polynomial P(x)
Output: C(x) = A(x) B(x) mod P(x)
1. IF a
0
= 1
C(x) B(x)
ELSE
C(x) 0
2. FOR i = 1 TO m1
2.1 B(x) x B(x) mod P(x)
2.2 If a
i
= 1 Then
C(x) C(x) + B(x)
3. RETURN C(x)
Step 2.1 performs the shift part, and Step 2.2 the add part of the multiplication
method. We saw above that additions are inexpensive in software, hence Step 2.1 is
the most costly one of this algorithm. How is this shift and reduce step implemented
in software? Lets review what happens in the polynomial representation when we
multiply B(x) by x:
B(x) = b
0
+ b
1
x + + b
m1
x
m1
x B(x) = b
0
x + b
1
x
2
+ + b
m2
x
m1
+ b
m1
x
m
Since x
m
is not in eld GF(2
m
) in a polynomial representation, we have to reduce the
term b
m1
x
m
modulo P(x). If the irreducible polynomial P(x) is given as:
P(x) = x
m
+
m1
i=0
p
i
x
i
, p
i
0, 1
the term x
m
can be modulo reduced to:
x
m
m1
i=0
p
i
x
i
mod P(x)
62
m
) in Software
and hence:
x B(x) mod P(x) p
0
b
m1
+ [b
0
+ p
1
b
m1
]x + + [b
m2
+ p
m1
b
m1
]x
m1
We see that, due to the reduction, the coecients of the irreducible polynomial can
potentially inuence each power of x, i.e., each coecient of the shifted B(x). In order
to keep the computational complexity of the modulo reduction low, it is of advantage
to choose so-called sparse irreducible polynomials, for which most coecients are zero.
In practical cryptographic applications, the reduction polynomial is a trinomial or a
pentanomial
2
. An irreducible trinomial has three non-zero coecients:
P(x) = x
m
+ x
t
+ 1, 0 < t < m
A pentanomial has likewise ve non-zero coecients. Choosing these special polynomi-
als as P(x) minimizes the cost of the modular reduction, e.g., for the above trinomial
we obtain
x B(x) mod P(x) = b
m1
+ b
0
x + + [b
t1
+ b
m1
]x
t
+ + b
m2
x
m1
for the operation x B(x) mod P(x). It is conjectured that either an irreducible trino-
mial or pentanomial exist for every nite eld GF(2
m
). Somewhat unfortunately, for
nite elds where 8[m, i.e., m = 8, 16, 24, 32, 40, . . ., irreducible trinomials do not exist.
However, irreducible pentanomials exist.
What remains, and what is the major disadvantage of realizing the shift-and-add ap-
proach in software, are the shifts by just one bit: As depicted in Fig. 5.2, shifting by
just one bit aects all registers of the array representing the element B(x). We note
that each register has to be shifted and the MSB of each register, i.e., bits b
31
, b
63
etc.
on a 32 bit CPU, has to be introduced as carry in the LSB position of the register
with the next higher index. This overhead for processing a single bit is repeated in
each of the m iterations of the algorithm and constitutes a bottleneck. The resulting
complexity of Alg. 5.2 for a particular computer depends on the number of R-bit reg-
isters required to represent one element t =
_
m
R
_
. For one multiplication, m t shifts
and additions are needed. The complexity of the entire algorithm is approximately
c m
m
R
, where the factor c is a constant value, implying that the complexity of the
shift-and-add multiplication algorithm grows quadratically according to
m
2
R
.
The principle of the shift-and-add algorithm is well suited for hardware implementa-
tion (cf. Subsection 6.2.1), where shifting an element by one bit is inexpensive, but is
very costly for a software implementation. Next, we introduce improvements of the
algorithm, which take this handicap into account.
2
We note that binomials of the form P(x) = x
m
+ x
i
, 0 i m 1, are always reducible since
they are either divisible by x or x + 1.
63
m
)
Figure 5.2: Shifting B(x) to x B(x) on a 32 bit CPU: A one-bit shift aects all registers of the array and is
thus costly.
5.3.2 Right-to-Left Comb Method
Unlike the shift-and-add algorithm from the previous section, the comb algorithm does
not perform the reduction modulo P(x). It is just a method for performing the sheer
polynomial multiplication A(x) B(x). The result of this operation has then to be mod-
ulo reduced. An algorithm for the reduction is described in Section 5.3.5.
The core idea of the algorithm is as follows. The shift-and-add algorithm processes the
bits of A(x) in the natural order a
0
, a
1
, . . . , a
m1
. It turns out that we can dramatically
reduce the bit shifts if we process the bits in the order a
0
, a
32
, a
64
, . . . followed by
a
1
, a
33
, a
65
, . . . until a
31
, a
63
, a
95
, . . .
In the following we assume: a register width of R = 32, operand A(x) =
m1
i=0
a
i
x
i
,
operand B(x) =
m1
i=0
b
i
x
i
and let t =
_
m
32
_
be the number of registers needed for
storing one eld element. We start with rewriting the multiplication.
A(x) B(x) = (a
0
+ a
1
x + + a
(t1)32+31
x
(t1)32+31
) B(x)
A(x) B(x) = (a
0
+ a
32
x
32
+ a
64
x
64
+ + a
(t1)32
x
(t1)32
) B(x)
+(a
1
+ a
33
x
32
+ a
65
x
64
+ + a
(t1)32+1
x
(t1)32
) B(x)
+(a
2
+ a
34
x
32
+ a
66
x
64
+ + a
(t1)32+2
x
(t1)32
) x
2
B(x)

+(a
31
+ a
63
x
32
+ ) x
31
B(x)
As stated above, the algorithm processes the expression above in a row-like fashion,
i.e., rst (a
0
+ a
32
x
32
+ a
64
x
64
+ + a
(t1)32
x
(t1)32
) B(x) is computed, then the
second row etc. Here is how the algorithm unfolds:
IF a
0
= 1 THEN C(x) C(x) + B(x)
64
m
) in Software
IF a
32
= 1 THEN C(x) C(x) + x
32
B(x)
IF a
64
64
B(x)

IF a
(t1)32
(t1)32
B(x)
B(x) = x B(x)
IF a
1
IF a
33
32
B(x)

IF a
(t1)32+1
(t1)32
B(x)
B(x) = x B(x)

At this point, we should ask ourselves why this somewhat strange reordering of oper-
ations is computationally better than the straightforward shift-and-add method. The
reason is that the vast majority of operations above are of the form
C(x) + x
i 32
B(x)
This means we have to add a shifted version of B(x) to C(x). However, B(x) is shifted
by multiples of 32 bit, which can be achieved by simply reordering of registers, as shown
in Figure 5.3. It is crucial to compare this with the bit shifts from the shift-and-add
algorithm shown in Figure 5.2. In actual software implementation, the situation is even
better: We never have to do the actual shifts by 32 bits within the array that holds B!
Instead, when computing C(x) +x
i 32
B(x) we simply add each word B[j] to the word
C[j + i].
However, every so often we do have to perform actual bit shifts. These are the steps
of the form
B(x) = x B(x)
in the expression above. They are, just as in the shift-and-add algorithm, slow because
they require actually shifting of all words of the B-array by one position and taking
care of carries between words. However, we only have to perform such bit shifts 32
times for the entire polynomial multiplication, whereas the shift-and-add algorithms
required m of such shifts. In elliptic curve cryptography, one of the main application
areas for GF(2
m
) arithmetic, m is typically in the range of 160512 bits.
Lets now look at the pseudo code of the comb method. Note that a
i
, b
i
, and c
i
0, 1.
65
m
)
b
(t-1)32 + 31
b
(t-1)32
. . .
.
.
.
b
0
b
31
b
63
b
32
R
0
R
1
R
(t-1)
B
(b )
(t-1)32 + 31
(b )
(t-1)32
. . .
.
.
.
(b )
0
(b )
31
(b )
63
(b )
32
R
0
R
1
R
t
B x
32
0 0
. . .
R
2
Figure 5.3: 32-bit word shifting of B(x) to x
32
B(x).
Algorithm 5.3 Right-To-Left Comb Multiplication Method
Input: A(x) =
m1
i=0
a
i
x
i
; B(x) =
m1
i=0
b
i
x
i
Output: C(x) = A(x) B(x) =
2m2
i=0
c
i
x
i
1. C(x) 0
2. For k = 0 To 31 Do:
2.1 For j = 0 To t 1 Do:
If a
j32+k
= 1 Then
C(x) C(x) + x
j32
B(x)
2.2 If k ,= 31 Then
B(x) x B(x)
3. Return C(x)
We note that in every iteration, there is still only one bit of a
i
processed. There are
about m iterations (m t 32), where each iteration consists of m/R elementary steps
for the addition of the two arrays C(x) + x
j32
B(x). Hence, the total complexity
is similar to the one of the shift-and-add algorithm, only the constant c has changed:
c
1
m
2
R
, where c
1
< c.
5.3.3 Left-to-Right Comb Method
Similar to the right-to-left comb method, a serial computation of the left-to-right comb
method would look like:
66
m
) in Software
IF a
31
IF a
63
32
B(x)

IF a
(t1)32+31
C(x) x C(x)
IF a
30
IF a
62
32
B(x)

IF a
(t1)32+30
C(x) x C(x)

This, then, results in the following left-to-right comb algorithm.
Algorithm 5.4 Left-To-Right Comb Multiplication Method
Input: A(x) =
m1
i=0
a
i
x
i
; B(x) =
m1
i=0
b
i
x
i
2m2
i=0
c
i
x
i
1. C 0
2. For k = 31 DownTo 0 Do:
2.1 For j = 0 To t 1 Do:
If a
j32+k
= 1 Then
C(x) C(x) + x
j32
B(x)
2.2 If k ,= 0 Then
C(x) x C(x)
3. Return C(x)
The algorithm has the same complexity as the right-to-left comb method.
5.3.4 Window-Based Multiplication
The following question is the motivation for the window-based multiplication
Can we somehow reduce the number of iterations to
m
w
(i.e., similar to the
window methods for exponentiation)?
The idea to solve this is to form digits or windows of w-bits. However, this approach
requires some precomputations. First, assume (a
i
+ a
i+1
x + + a
i+w1
x
w1
) B(x),
where w is called the window size. The window can take 2
w
dierent values depend-
ing on the bits a
i
. Hence, we have to precompute a table with all possible window
polynomials times operand B(x). The table is provided in Figure 5.4.
67
m
)
(x + . . . + 1) B(x)
w-1
R
2 -1
w
.
.
.
R
0
R
1
R
2
R
3
0
B(x)
x B(x)
(x+1) B(x)
m + w bits
2 entries
w
Figure 5.4: Pre-Computation table of B(x) times all possible window polynomials.
Note, that each table entry is has a length of (m + w) bits. For instance for GF(2
163
)
and w = 4, each entry consists of 167 bits. In the following, we use this notation:
GF(2
m
) : binary nite Galois eld
R = 32 : register width of the processor
t =
_
m
32
_
: amount of 32-bit words for one eld element
w = 4 : digit size (or window size)
s =
_
m
4
_
: amount of 4-bit digits for one eld element
We can view operand A(x) as a vector of w-bit (here: w = 4) digits such that
A = a
m1
a
m2
a
m3
a
m4
. .
digit a
s1
. . . a
7
a
6
a
5
a
4
. .
a
1
a
3
a
2
a
1
a
0
. .
a
0
Then, 8 digits a
i
form a 32-bit word such that
A = a
s1
. . . a
s8
. .
32bit, A[t1]
. . . a
15
. . . a
8
. .
32bit, A[1]
a
7
. . . a
0
. .
32bit, A[0]
The basic idea of the algorithm is to multiply digits a
i
with B(x) in the following order.
a
7
B(x), a
15
B(x), . . . , a
s7
B(x), a
6
B(x), . . .
Note that the digit a
s1
could consists only of zeros, because of m is prime, e. g., 163.
If m = 163 then the word A[t 1] is spare, because only the rst three bits are used.
68
m
) in Software
In Table 5.1 we show the rst few iterations of the algorithm.
Iteration Arithmetic
0) C(x) 0
0 a) (a
31
x
3
+a
30
x
2
+a
29
x +a
28
) B(x) = B
a
7
0 b) C(x) C(x) +B
a
7
1 a) (a
63
x
3
+a
62
x
2
+a
61
x +a
60
) B(x) = B
a
15
1 b) C(x) C(x) +x
32
B
a
15
.
.
.
.
.
.
(t-1) a) (a
(t1)32+31
x
3
+a
(t1)32+30
x
2
+a
(t1)32+29
x +a
(t1)32+28
) B(x) = B
a
s1
(t-1) b) C(x) C(x) +x
(t1)32
B
a
s1
C(x) C(x) x
4
t a) (a
27
x
3
+a
26
x
2
+a
25
x +a
24
) B(x) = B
a
6
t b) C(x) C(x) +B
a
6
(t+1) a) (a
59
x
3
+a
58
x
2
+a
57
x +a
56
) B(x) = B
a
14
(t+1) b) C(x) C(x) +x
32
B
a
14
.
.
.
.
.
.
(2t-1) a) (a
(2t1)32+31
x
3
+a
(2t1)32+30
x
2
+a
(2t1)32+29
x +a
(2t1)32+28
) B(x) = B
a
s2
(2t-1) b) C(x) C(x) +x
(2t1)32
B
a
s2
C(x) C(x) x
4
Table 5.1: Window-based multiplication example
Remarks:
(i) The iterations numbered from [0a] to [(t-1)b] are the inner loop.
(ii) After step (t-1) b in Table 5.1 all most signicant digits have been processed
and we have to shift now by w = 4 bit positions.
(iii) There arent any real operation in the [a] iterations. These are mainly table
look-ups.
(iv) In iteration [0b], for instance, the value B
a
7
needs to be shifted by 4 7 = 28 bit
positions to the left in the product C(x). This is done by the operation C(x) x
4
,
which is applied seven times at the end of each inner loop.
69
m
)
Algorithm 5.5 Window-Based Multiplication for GF(2
m
)
for processors with R = 32 and window/digit w = 4.
Input: A(x) =
m1
i=0
a
i
x
i
; B(x) =
m1
i=0
b
i
x
i
; t = ,
m
32
|
2m2
i=0
c
i
x
i
1. Precompute a look-up table B
u
= u(x) B(x) for all polynomials
u(x) of degree at most 3 (i. e., w 1).
2. C(x) 0
3. For k = 7 DownTo 0 Do:
3.1 For j = 0 To t 1 Do:
u a
j8+k
C(x) C(x) + B
u
x
j32
3.2 If k ,= 0 Then
C(x) C(x) x
4
4. Return C(x)
The complexity of the window-based method for multiplication can be determined as
follows.
(i) During the precomputation phase (in step 1 of the algorithm) we need at least
2
w
non-trivial multiplications of the form
digit polynomial
where the digit consists of w bits and the polynomial of m bits. But there are no
accumulation additions.
(ii) In Step 3.1 of the algorithm we have
32
w

m
32
=
m
w
iterations, where each iterations involves one table access and one multi-word
addition. Furthermore, we need
32
w
shifts by w positions.
If we assume the precomputation as non-recurring, the window method is by a factor
w less complex than both comb methods and the shift-and-add method.
5.3.5 Modular Reduction
The previous multiplication algorithms only performed multiplications without modulo
reduction. We provide now an ecient algorithm for the case that the eld polynomial
70
m
) in Software
is sparse (recall that for all elds GF(2
m
), 1 < m < 10000, irreducible trinomials or
pentanomials exist). In the following explanations we are in GF(2
163
) and use the
irreducible polynomial P(x) = x
163
+ x
7
+ x
6
+ x
3
+ 1 such that
x
163
x
7
+ x
6
+ x
3
+ 1 mod P(x) (5.2)
where all 4 reduction coecients are within one computer word. Now, let
C(x) = c
324
x
324
+ + c
163
x
163
+ c
162
x
162
+ + c
0
then, C(x) can be represented as an array of 32 bit words such that
C(x) = x
320
C[10] + x
288
C[9] + + x
32
C[1] + C[0] (5.3)
where C[i] is a 32-bit polynomial. Lets look at the reduction of the word C[9] =
c
288
+ c
289
x + + c
319
x
31
.
x
288
C[9] = c
288
x
288
+ c
289
x
289
+ + c
319
x
319
= x
163
(c
288
x
125
+ + c
319
x
156
)
Now we using equation (5.2)
x
288
C[9] (x
7
+ x
6
+ x
3
+ 1) (c
288
x
125
+ + c
319
x
156
) mod P(x)
x
132
(c
288
+ c
289
x + + c
319
x
31
)
+x
131
(c
288
+ c
289
x + + c
319
x
31
)
+x
128
(c
288
+ c
289
x + + c
319
x
31
)
+x
125
(c
288
+ c
289
x + + c
319
x
31
) mod P(x)
and
x
288
C[9] x
132
C[9] + x
131
C[9] + x
128
C[9] + x
125
C[9] mod P(x) (5.4)
If we compare the equations (5.3) and (5.4), we see that the term x
288
C[9] can be
reduced by adding the word C[9] to the lower parts of the polynomial C(x) at the
positions x
132
, x
131
, x
128
and x
125
. We can develop a similar procedure for the words
C[10], C[8], C[7], C[6]. Note that C[5] is a special case
x
160
C[5] = C
191
x
191
+ + c
163
x
163
+ c
162
x
162
+ c
161
x
161
+ c
160
x
160
The last three coecients do not have to be reduced. Thus we can dene
x
160
C
[5] = x
160
(c
191
x
31
+ + c
163
x
3
)
71
m
)
where C
[5] can be reduced as described above. Modulo reduction has a linear com-
plexity for low weight irreducible polynomials and is much faster than multiplication.
5.3.6 Squaring
Squaring in GF(2
m
) is entirely dierent from general multiplication and of less com-
plexity as we will see now. Lets start with an example.
Example 5.1 A(x) GF(2
3
)
A(x) = a
2
x
2
+ a
1
x + a
0
A(x)
2
= [a
2
a
2
]x
4
+ [a
1
a
2
+ a
2
a
1
]x
3
+ [a
0
a
2
+ a
2
a
0
+ a
1
a
1
]x
2
+[a
0
a
1
+ a
1
a
0
]x + [a
0
a
0
]
Now note that a

i
a
j
+ a
j
a
i
= 2 a
i
a
j
= 0 in GF(2), it follows that
A(x)
2
= a
2
2
x
4
+ a
2
1
x
2
+ a
2
0
.
Since a
2
i
= a
i
in GF(2)
A(x)
2
= a
2
x
4
+ a
1
x
2
+ a
0
.
These property, i. e., raising an element A(x) GF(q
m
) to the q-th power is called
Forbenius automorphism. Thus, in general for a squaring of A(x) GF(2
m
)
A(x)
2
= a
m1
x
2(m1)
+ + a
1
x
2
+ a
0
Thus the squaring itself is just a spreading of coecients without long-member poly-
nomial multiplications. Nonetheless, a subsequent modulo reduction is required. Squar-
ing is much faster than multiplication, which is important if many exponentiations are
used, which use a lot of squarings. This process can be facilitated through a simple
table look-up. Hence, a table of 512 bytes can be precomputed for converting a 8-bit
value into their expanded 16-bit counterparts.
72
Chapter 6
m
)
in Hardware
In this chapter we will explain how galois eld arithmetic is realized in hardware.
Some of the discussed circuits were used and developed by NASA (National Aeronau-
tics and Space Administration) for the ight to the moon. Even there exist several
improvements you can still nd some of this curcuits in your CD-Player. The follow-
ing explanations are restricted to GF(2
m
), because these elds are the mostly used in
cryptography. At rst we will explain the basic hardware elements which are used for
the implementation of addition and multiplication. One will see that the full arith-
metic implementation is a simple combination of these elements, including the modulo
reduction.
We will begin with the explanations of the necessary hardware elements and their
symbols. Note that we consider only elds of GF(2
m
), thus all coecients in polynomial
representation are of GF(2). One will see that we can use boolean operations and
components to achieve the arithmetic in hardware.
In the next sections we will introduce the symbols which are needed for the circuit
description. At all we need a storage element in which we can save one bit. Typically
this memory is a ip-op. Figure 6.1 shows in (1) the typical used symbol in cipher
descriptions, where a is the saved bit, and in (2) one possible hardware D-Flip-Flop
symbol. Note that this ip-op must be internally connected in such way that the
Start-Value is only read-in at Reset event.
6.1 Addition
Note that we are in characteristic 2 and therefore we do not have any carries at addition.
Table 6.1 shows the arithmetic. One can see, this is equal to the XOR operation. Thus
the addition of two elements is implemented through an XOR-Gate.
Figure 6.2 shows (1) the in cryptography mostly used and (2) the IEC 60617-12
symbol for an addition element. For clarity reasons we will present the circuits with
73
6.1 Addition
(1) (2)
a Input Output
D
S
R
Q
Q
Start-Value
Reset
Clock
Input Output
negative Output
Figure 6.1: The Symbols for an D-Flip-Flop Element.
a b a + b a XOR b
0 0 0 0
0 1 1 1
1 0 1 1
1 1 0 0
Table 6.1: Basic Addition in GF(2)
the symbols of (1).
a
b
c
(1) (2)
a
b
c
=1
Figure 6.2: The Symbols for an Addition Element.
In cryptography we very seldom want to add only two basic elements, but more an
equation of form
C(x) = A(x) + B(x) =
m1
i=0
a
i
x
i
+
m1
i=0
b
i
x
i
=
m1
i=0
(a
i
+ b
i
)x
i
=
m1
i
0
c
i
x
i
.
Well need to connect one addition element to another to get a complete GF(2
m
)
adder. Remember that we are in characteristic 2 and do not have any carries? So
we can construct an adder by chaining more XOR-Gates together like in Figure 6.3
illustrated.
The area complexity for this adder is m times XOR-Gates and needs one clock cycle
for one GF(2
m
) addition. One can see that this operation is very fast. In hardware
the GF addition is four times faster or, in other words, needs four times less area than
the normal integer addition in hardware. Thats because within integer addition you
must treat the carries.
74
m
) in Hardware
a
0
b
0
c
0
a
m-1
b
m-1
c
m-1
.
.
.
Figure 6.3: An GF(2
m
) adder
6.2 Multiplication
From Table 6.2 one can see that the boolean multiplication operation is an AND in
GF(2).
a b a b a AND b
0 0 0 0
0 1 0 0
1 0 0 0
1 1 1 1
Table 6.2: Basic multiplication in GF(2)
The symbols for the hardware element are given in Figure 6.4, where again (1) is the
mostly used and (2) the one in IEC 60617-12 dened symbol.
(1) (2)
a
b
c
&
a
b
c
Figure 6.4: The Symbols for an Multiplication Element.
There are several GF multiplier architectures. A good classication is according to
the time complexity. Table 6.3 gives this classication in order from fastest to slowest
architecture, where TA is Time-Area, D = digit, and R = register width. At last
line in the table we give the complexity of an software algorithm like shift-and-add,
called as super-seriell, as a comparison to the software world.
For the architectures bit-seriell and digit-seriell exist two dierent implementation
forms. Both implementations can be further subdivided into the starting form least
signicant bit/digit (LSB/LSD) and most signicant bit/digit (MSB/MSD). The vari-
able D of architecture digit-seriell is free of choice and determines the digit-size. This
architecture was presented in 1998.
75
6.2 Multiplication
Multiplier #CLK Gate Complexity TA product Relevant
bit-parallel 1 O(m
2
) O(m
2
) m small, m 8,
mainly for block
ciphers, AES
digit-seriell
m
D
O(D m) O(m
2
) for public key
bit-seriell m O(m) O(m
2
) for public key, es-
pecially ECC
super-seriell
m
2
R
R O(m
2
)
Table 6.3: Multiplier Architecture Classication.
6.2.1 Bit-Serial Multiplication
There are two related architectures:
(i) LSB (least signicant bit rst) multiplier
(ii) MSB (most signicant bit rst) multiplier
Least Signicant Bit Multiplier
For the Least Signicant Bit Multiplier we start with an explanation of multiplication
by x.
C(x) = x A(x) = x (a
m1
x
m1
+ . . . + a
1
x + a
0
)
= a
m1
x
m
+ . . . + a
1
x
2
+ a
0
x
Note that only the term a
m1
x
m
must be reduced. The Reduction is done through
the reduction polynomial P(x).
P(x) = x
m
+ p
m1
x
m1
+ . . . + p
1
x + p
0
x
m
= p
m1
x
m1
+ . . . + p
1
x + p
0
mod P(x)
C(x) = x A(x) = a
m1
(p
m1
x
m1
+ . . . + p
1
x + p
0
) + a
m2
x
m1
+ . . . + a
0
x
= (a
m1
+ a
m1
p
m1
)x
m1
+ . . . + (a
0
+ a
m1
p
1
)x + (a
m1
p
0
)
= c
m1
x
m1
+ . . . + c
1
x + c
0
where c
0
= a
m1
p
0
and c
i
= a
i1
+ a
m1
p
i
for i = 1, 2, . . . , m1.
Jorge Guajardo: special case trinomial (see also [GP01])
P(x) = x
m
+ x
t
+ 1
_
_
d
0
= 1
d
t
= a
t1
+ a
m1
p
i
d
i
= a
i1
; i ,= t
76
m
) in Hardware
Remark:
(i) In cryptography, P(x) is almost always a trinom or a pentanom, hence there are
only 2 or 4 non-zero coecients p
i
, respectively.
(ii) p
0
is always 1, since P(x) must be irreducible.
The multiplication is realized in hardware through a shift register. When connecting
the shift register with a feedback circuit also the modulo reduction is achieved. An
example is presented in Figure 6.5. Starting with a
0
, a
1
, . . . , a
m1
the ip-op contents
are computed after one clock cycle to d
0
, d
1
, . . . , d
m1
.
a
m-2
a
m-1
. . .
a
0
a
1
. . .
p
0
p
1
p
m-1
Figure 6.5: An Shift-Register with Feedback Circuit.
In the special case of GF(2
m
) the coecients p
i
are of GF(2) which is the set 0, 1.
This means for the feedback circuit that the connection is closed if p
i
= 1 (Figure 6.6
(1)) and opened if p
i
= 0 (Figure 6.6 (2)).
(1) (2)
p = 1
i
a
i-1
a
i
p = 0
i
a
i-1
a
i
Figure 6.6: Feedback Circuit Connection for Special Case GF(2
m
).
Remark:
(i) Important remark, for trinomials or pentanomials with known, xed coecients
all but 2 or 4 feedback slices are present.
(ii) A linear feedback shift register (LFSR), performs a multiplication by x and mod-
ulo reduction.
We develop now the entire multiplier. We start with writing the multiplication in
Horners form
77
6.2 Multiplication
C(x) = A(x) B(x) mod P(x)
= A(x)
m1
i=0
b
i
x
i
mod P(x)
= b
0
A(x) + b
1
x A(x) + b
2
x
2
A(x) + . . . + b
m1
x
m1
A(x) mod P(x)
C(x) = b
0
A(x)
+b
1
(x A(x) mod P(x))
+b
2
(x [x A(x)] mod P(x))
+b
3
(x [x
2
A(x)] mod P(x))
.
.
.
+b
m1
(x [x
m2
A(x)] mod P(x))
In every iteration we have to perform
1. Multiply the m-bit polynomial (x
i1
A(x) mod P(x)) by x and reduce modulo
P(x). This result must be stored for the next iteration.
2. Multiply this result (x
i
A(x) mod P(x)) by b
i
.
3. Accumulate, i. e., add and store, all results.
For a hardware circuit of the LSB multiplier we have extend the LFSR circuit from
above, see Figure 6.7.
a
m-2
a
m-1
. . .
a
0
a
1
. . .
p
0
p
1
p
m-1
. . .
B
c
0
c
1
c
m-2
c
m-1
C
Step 1
Step 2
Step 3
Figure 6.7: General LSB-Multiplier A B = C for GF(2
m
).
78
m
) in Hardware
To give an impression how such network would look like in a IEC 60617-12 manner
we present it in Figure 6.8. Remember that the start value of the ip-ops will be
read-in, and only then, when signal Reset is set.
a
m-2
a
m-1
a
0
a
1
p
0
p
1
p
m-1
. . .
B
c
0
c
m-2
c
m-1
C
D
S
R
Q
Q
=1
D
S
R
Q
Q
D
S
R
Q
Q
=1
D
S
R
Q
Q
Reset
Clock
. . .
. . .
. . .
& &
p
2
&
. . .
& & & &
=1
D
S
R
Q
Q
0
=1
D
S
R
Q
Q
0
=1
D
S
R
Q
Q
0
=1
D
S
R
Q
Q
0
. . .
. . .
. . .
A
Figure 6.8: LSB-Multiplier in IEC 60617-12 Symbol Standard.
Step 1 represents the linear feedback shift register. This part of the circuit drives the
signal and because of its size one has to respect the fan-out eect. One could imagine
the feedback network as a great serialization of an Resistor-Capacitor-Network, which
produces a high signal delay.
Step 2 realizes the multiplication by b
i
in ascending order b
0
, b
1
, . . . , b
m1
.
The last step (Step 3) adds the particular terms. The accumulation registers are
initialized with 0 and contain the result c
0
, c
1
, . . . , c
m1
after m clock cycles.
Remark:
The shift registers of the LFSR become initialized with the coecients a
0
, a
1
, . . . , a
m1
.
The other registers become initialized with 0s and will contain the product
coecients c
0
, c
1
, . . . , c
m1
after m clock cycles.
79
6.2 Multiplication
The coecients of operand B(x) have to be fed into the multiplier in a bit-serial
fashion, LSB rst.
Note that the feedback path will be dramatically less complex for a xed trinomial or
pentanomial.
Example 6.1 LSB Multiplier
C(x) = A(x) B(x) mod P(x), P(x) = x
4
+ x + 1, GF(2
4
)
x A(x) mod P(x) = p
0
a
m1
+
m1
i=1
(p
i
a
m1
+ a
i1
) x
i
here: x A(x) = a
2
x
3
+ a
1
x
2
+ (a
3
+ a
0
)x + a
3
mod P(x)
The LSB multiplier network is given in Figure 6.9. The FF
i
are initialized
with the a
i
, where FF
i
= a
i
for i = 0, 1, 2, 3.
FF
0
FF
1
FF
2
FF
3
c
0
c
1
c
2
c
3
b , b , b , b
3 2 1 0
Figure 6.9: LSB-Multiplier for P(x) = x
4
+ x + 1 in GF(2
4
).
In the Tables 6.4 we show the states of every iteration of the LSB multiplier
while computing the product A(x) B(x).
We get following result
C(x) = (b
0
a
3
+ b
1
a
2
+ b
2
a
1
+ b
3
a
0
+ b
3
a
3
)x
3
+(b
0
a
2
+ b
1
a
1
+ b
2
a
0
+ b
2
a
3
+ b
3
a
2
+ b
3
a
3
)x
2
+(b
0
a
1
+ b
1
a
0
+ b
1
a
3
+ b
2
a
2
+ b
2
a
3
+ b
3
a
1
+ b
3
a
2
)x
+(b
0
a
0
+ b
1
a
3
+ b
2
a
2
+ b
3
a
1
)
Most Signicant Bit Multiplier

Now, after we had discussed the LSB multiplier, we talk about the Most Signicant
Bit Multiplier. See also [OP99].
80
m
) in Hardware
clock cycle (j) FF
0
(j) FF
1
(j) FF
2
(j) FF
3
(j)
1 a
0
a
1
a
2
a
3
2 a
3
(a
0
+ a
3
) a
2
a
2
3 a
2
(a
3
+ a
2
) (a
0
+ a
3
) a
1
4 a
1
(a
2
+ a
1
) (a
3
+ a
2
) (a
0
+ a
3
)
c.c.(j) FF
0
(j) B(x) FF
1
(j) B(x) FF
2
(j) B(x) FF
3
(j) B(x)
1 b
0
a
0
b
0
a
1
b
0
a
2
b
0
a
3
2 b
1
a
3
b
1
(a
0
+ a
3
) b
1
a
2
b
1
a
2
3 b
2
a
2
b
2
(a
3
+ a
2
) b
2
(a
0
+ a
3
) b
2
a
1
4 b
3
a
1
b
3
(a
2
+ a
1
) b
3
(a
3
+ a
2
) b
3
(a
0
+ a
3
)
c.c.(j) c
0
(j) c
1
(j) c
2
(j) c
3
(j)
1 b
0
a
0
b
0
a
1
b
0
a
2
b
0
a
3
2 +b
1
a
3
+b
1
(a
0
+ a
3
) +b
1
a
2
+b
1
a
2
3 +b
2
a
2
+b
2
(a
3
+ a
2
) +b
2
(a
0
+ a
3
) +b
2
a
1
4 +b
3
a
1
+b
3
(a
2
+ a
1
) +b
3
(a
3
+ a
2
) +b
3
(a
0
+ a
3
)
Table 6.4: States-Example of the LSB-Multiplier.
The equation for the MSB multiplier is unchanged in comparison to the LSB multiplier.
But while we had begun the multiplication with the least signicant bit we now start
the multiplication with the most signicant bit of B(x). Therefore we represent the
B(x) for the MSB multiplier in the following way.
B(x) = b
m1
x
m1
+ b
m2
x
m2
+ b
m3
x
m3
+ . . . + b
1
x + b
0
= (b
m1
x + b
m2
)x
m2
+ b
m3
x
m3
+ . . . + b
1
x + b
0
= ((b
m1
x + b
m2
)x + b
m3
)x
m3
+ . . . + b
1
x + b
0
.
.
.
= (. . . ((b
m1
x + b
m2
)x + b
m3
)x + . . . + b
1
)x + b
0
The equation is then solved as follows.
C(x) = [(. . . ((b
m1
x + b
m2
)x + b
m3
)x + . . . + b
1
)x + b
0
] A(x) mod P(x)
= (. . . (b
m1
A(x)
. .
clk1
x + b
m2
A(x))
. .
clk2
x + b
m3
A(x))
. .
clk3
x + . . . + b
0
mod P(x)
81
6.2 Multiplication
In Algorithm 6.1 we dene for educational reasons the object c[1] such that the
objects c[0] up to c[m1] correspond with the terms c
0
, c
1
, . . . , c
m1
.
Algorithm 6.1 MSB Multiplication Algorithm
Input: A(x), B(x), P(x)
Output: C(x) = A(x) B(x) mod P(x)
1. c[1] 0
2. c[i] 0, i = 0, 1, . . . , m1
3. For i = k 1 DownTo 0 Do:
3.1 c[i] x c[i 1]
3.2 c[i] c[i] mod P(x)
3.3 For j = 0 To k 1 Do:
3.3.1 If (b
j
= 1) Then
c[i] c[i] + A(x)
Note that the Step 3.1 of Algorithm 6.1 is a left shift.
One can see in Figure 6.10 that the MSB circuit needs less area then the LSB circuit,
because only m ip-ops are needed in contrast to the 2m of the LSB circuit.
c
0
c
1
c
2
c
m-1
. . .
. . .
. . .
a
0
a
1
a
2
a
m-1
p
0
p
1
p
2
p
m-1
b , b , ..., b
m-1 m-2 0
Figure 6.10: General MSB-Multiplier A B = C for GF(2
m
).
All ip-ops are initialized with 0s.
In the Table 6.5 we show the contents of the ip-ops named c
0
, c
1
, . . . , c
m1
.
6.2.2 Digit-Serial Multiplication
The goal of a digit serial multiplier is to perform one GF(2
m
) multiplication in less
than m clock cycles.
82
m
) in Hardware
Clock Cycle c
0
c
1
. . . c
m1
1 a
0
b
m1
a
1
b
m1
. . . a
m1
b
m1
2 p
0
(a
m1
b
m1
) a
0
b
m1
. . . a
m2
b
m1
+a
0
b
m2
+p
1
(a
m1
b
m1
) +p
m1
(a
m1
b
m1
)
+a
1
b
m2
+a
m1
b
m2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
m p
0
(a
1
b
m1
p
0
(a
2
b
m1
. . . a
0
b
m1
+p
m1
(. . .)) +p
m1
(. . .)) +p
m1
(a
1
b
m2
+a
0
b
0
+p
1
(a
3
b
3
+ . . .) +p
m2
(. . .))
+a
1
b
0
+a
m1
b
0
Table 6.5: General MSB-Multiplication Table
Basic idea: Group D-bits of one of the operands, say B, into a digit. In each clock
cycle we multiply those D-bits with all bits of the other operand A in parallel.
Thus, it is expected that the multiplication is nished in approx ,
m
D
| clock cycles.
This basic idea is similar to the window algorithm of the previous chapter. As with
the Bit-Serial Multiplication here again exist two principal architectures:
(i) LSD (least signicant digit rst) multiplier
(ii) MSD (most signicant digit rst) multiplier
In the following, we will develop the LSD multiplier.
First we will develop the digit representation of B(x).
Let D be the digit size (e.g. D = 2, 4, 8, 16 or any other value). B(x) consists then of
d =
_
m
D
_
digits.
B(x) = 0 + 0 + . . . + b
m1
x
m1
+ . . .
. .
B
d1
(x)
+. . . + b
2D1
x
2D1
+ . . . + b
D
x
D
. .
B
1
(x)
+b
D1
x
D1
+ . . . + b
1
x + b
0
. .
B
0
(x)
Note that in general D m (i. e., D does not divide m) hence the most signicant digit
B
D1
(x) has to be lled with zero coecients at the leading positions.
An interesting fact is that the values 2, 4, 8 and all other powers of 2 are not an optimal
choice for D. This has to do with the fact that we have to add up D + 1 values later
on. In order to make optimum use of an XOR tree, the values 3, 7, 15 etc. are best as
values for D.
83
6.2 Multiplication
Each digit B
i
is again a polynomial with maximum degree D 1
B
i
(x) =
_
_
_
D1
j=0
b
D+j
x
j
, 0 i d 2
m1D(d1)
j=0
b
D
i+j
x
j
, i = d 1
We look now at the multiplication.
= A(x)
d1
i=0
B
i
x
iD
mod P(x)
C(x) = A(x)B
d1
x
(d1)D
+ . . . + A(x)B
2
x
2D
(6.1)
+A(x)B
1
x
D
+ A(x)B
0
mod P(x)
Similar to the case of the bit multiplier, we can re-write (6.1) yielding a Horner-like
expression:
C(x) = [B
0
(x) A(x) (6.2)
+B
1
(x) (A(x) x
D
mod P(x))
+B
2
(x) (A(x)x
D
x
D
mod P(x))
+B
3
(x) (A(x)x
2D
x
D
mod P(x))
.
.
.
+B
d1
(x) (A(x)x
(d2)D
x
D
mod P(x))] mod P(x)
Note that each line of Equation (6.2) is performed in one clock cycle. Per clock cycle
we have to perform the following operations
(1) Multiplying of A(x)x
(i1)D
by x
D
, i. e., shifting of A(x)x
(i1)D
by D positions.
The result is reduced modulo P(x) and stored.
(2) Multiply B
i
(x) by (A(x)x
iD
mod P(x)), i. e., a multiplication of a D-bit polyno-
mial by a m-bit polynomial. This operation is very expensive and takes approx
95% of the whole complexity.
(3) Accumulating, i. e., add-and-store, the intermediate results from Step 2.
(4) Note that there is a nal reduction mod P(x) necessary after the last iteration.
Using Equation (6.2) we can now develop a block diagram for a LSD multiplier, which
is presented in Figure 6.11. The multiplier needs the digits B
0
, B
1
, . . . , B
d1
as serial
84
m
) in Hardware
inputs, least signicant digit rst. The multiplier terminates after d clock cycles, and
is, thus, by a factor of approx D faster than the LSB multiplier.
Step 1
Step 2
Step 3
Step 4
m
m+D-2
m m
m
m
m+D-2
m+D-2
m+D-2
A(x)
D
A x mod P(x)
iD
B (A(x) x mod P(x))
i
Accumulator
mod P(x)
C(x)
Select A(x) only
for first clock
cycle
Output to mod
operation after
last clock cycle
D
B , ..., B , B
d-1 1 0
MUX
DEMUX
Figure 6.11: LSD-Multiplier Diagram in GF(2
m
).
Note that Step 2 involves the multiplication of an m-bit polynomial by a D-bit poly-
nomial. The highest degree of the result is (m1) +(D1) = m+D2. This is the
most costly one in terms of gate complexity.
We will now look at the hardware realization of Steps 1-4. We start with Step 2.
Here is an example for digit times polynomial multiplication.
Example 6.2 Step 2
GF(2
4
), D = 2
85
6.2 Multiplication
B
0
(x) = b
0
+ b
1
x
A(x) = a
0
+ a
1
x + a
2
x
2
+ a
3
x
3
B
0
A(x) = (b
0
+ b
1
x) (a
0
+ a
1
x + a
2
x
2
+ a
3
x
3
)
= b
0
a
0
+b
0
a
1
x + b
0
a
2
x
2
+ b
0
a
3
x
3
+b
1
a
0
x + b
1
a
1
x
2
+ b
1
a
2
x
3
+ b
1
a
3
x
4
B
0
A(x) = [b
0
a
0
] + [b
0
a
1
+ b
1
a
0
]x + [b
0
a
2
+ b
1
a
1
]x
2
+ [b
0
a
3
+ b
1
a
2
]x
3
+[b
1
a
3
]x
4
This arithmetic is accomplished by circuit in Figure 6.12.
a
0
a
1
a
2
a
3
b
0
b
1
b
0
a
0
b a + b a
0 1 1 0
b a +
0 2
b a
1 1
b a +
0 3
b a
1 2
b
1
a
3
Figure 6.12: Example Step 2 Hardware Realization of the LSD Mutliplier.
Hardware complexity for the example Step 2 is:
count of multiplications: 2 4 ANDs = 8 ANDs
count of additions: 1 3 XORs = 3 XORs
In general, Step 2 requires m D AND-Gates and (D 1) (m 1) XOR-Gates. The

critical path of this circuit is calculated as
AND
+ ,log
2
(D)|
XOR
, where is the
signal pass-through time of the component.
Why is in the calculation of the critical path the factor ,log
2
(D)|? We will clear up
that with a little example.
Example 6.3 XOR of 4 Elements
86
m
) in Hardware
(1) Linear Array
=1
=1
=1
a
b
c
d
f
This solution would need 3 XORs and has a delay of 3
XOR
.
(2) Binary Tree
=1
=1
=1
a
b
c
d
f
This improvement would need 3 XORs and only a delay of 2
XOR
.
Remark: In general we need N 1 XORs for N elements.

Theorem 6.2.1 Critical Path Length
For digit values D m k, the reduction in Step 1 of the LSD
multiplier consists of an XOR binary tree, which has a critical path
of ,log
2
(D)|.
Now we take a look at the realization of Step 1, which computes A
(D1)
(x) x
D
mod
P(x), where A
(D1)
(x) is some intermediate polynomial of maximum degree m 1.
Lets start with an example.
Example 6.4 For Step 1 A
D1
(x) x
D
mod P(x)
GF(2
4
), D = 2
A(x) = a
0
+ a
1
x + a
2
x
2
+ a
3
x
3
x
2
A(x) = a
0
x
2
+ a
1
x
3
+ a
2
x
4
+ a
3
x
5
To show the importance of the irreducible polynomial we have to look at
Step 1 with two dierent irreducible polynomial.
87
6.2 Multiplication
(i) P(x) = x
4
+ x + 1
x
4
x + 1 mod P(x)
x
5
x
2
+ x mod P(x)
x
2
A(x) a
0
x
2
+ a
1
x
3
+ a
2
(x + 1) + a
3
(x
2
+ x) mod P(x)
[a
2
] + [a
2
+ a
3
]x + [a
0
+ a
3
]x
2
+ [a
1
]x
3
mod P(x)
a
0
a
1
a
2
a
3
(x )
0
(x )
1
(x )
2
(x )
3
=1
=1
Figure 6.13: Circuit for irreducible polynomial P(x) = x
4
+ x + 1 in GF(2
4
)
In Figure 6.13 one can see that the circuit for the irreducible polyno-
mial P(x) = x
4
+ x + 1 would need two XOR elements and has a one
times
XOR
delay.
(ii) P(x) = x
4
+ x
3
+ 1
x
4
x
3
+ 1 mod P(x)
(N) x
5
x
4
+ x (x
3
+ 1) + x = x
3
+ x + 1 mod P(x)
x
2
A(x) a
0
x
2
+ a
1
x
3
+ a
2
(x
3
+ 1) + a
3
(x
3
+ x + 1) mod P(x)
[a
2
+ a
3
] + [a
3
]x + [a
0
]x
2
+ [a
1
+ (a
2
+ a
3
)]x
3
mod P(x)
Note that the sum a
2
+ a
3
is needed at position x
0
, x
3
, and needs
to be pre-computated before the complete sum a
1
+ (a
2
+ a
3
) of x
3
could be sum up.
This irreducible polynomial is unfavorable for a hardware implemen-
tation, look at Figure 6.14. Although two XOR elements are needed,
the delay in this circuit amounts to two times
XOR
.
As we can see from the example, the time complexity of Step 1 depends on the position
of the second highest non-zero coecient of P(x). This coecient was at position 1 in
88
m
) in Hardware
a
0
a
1
a
2
a
3
(x )
0
(x )
1
(x )
2
(x )
3
=1
=1
Figure 6.14: Circuit for irreducible polynomial P(x) = x
4
+ x
3
+ 1 in GF(2
4
)
the rst example and at position 3 in the second example. In the second example, two
iterated modulo reductions occurred in equation (N).
Theorem 6.2.2 From [SP98].
Assume
P(x) = x
m
+ p
k
x
k
+
k1
j=0
p
j
x
j
if the irreducible polynomial with p
k
= 1. For t m1k the power
x
m+t
can be reduced modP(x) in one step.
Proof 6.1
x
m+t
= x
t
x
m
[p
k
x
k
+
k1
j=0
p
j
x
j
] mod P(x)
p
k
x
t+k
+
k1
j=0
p
j
x
j+t
(1)
Since t m1k the degree of (1) is t +k m1. This implies that x

m+t
as shown
in (1) is already completely reduced.
89
6.2 Multiplication
90
References
[Bih97] Eli Biham. A fast new des implementation in software. Technical report,
Computer Science Department Technion - Israel Institute of Technology, 1997.
[BKL
+
06] A. Bogdanov, L. R. Knudson, G. Leander, C. Paar, A. Poschmann, J. B.
Robshaw, Y. Seurin, and C. Vikkelsoe. Present: An ultra-lightweight block
cipher. 2006.
[BLK
+
07] Andrey Bogdanov, Gregor Leander, Lars R. Knudsen, Christof Paar, Axel
Poschmann, Matthew J.B. Robshaw, Yannick Seurin, and Charlotte Vikkel-
soe. PRESENT - An Ultra-Lightweight Block Cipher. In CHES 2007, number
4727 in LNCS, pages 450 466. Springer, 2007.
[BP98] D. V. Bailey and C. Paar. Optimal Extension Fields for Fast Arithmetic in
Public-Key Algorithms. In H. Krawczyk, editor, Advances in Cryptology
CRYPTO 98, volume LNCS 1462, pages 472485, Berlin, Germany, 1998.
Springer-Verlag.
[DH00] A. Menezes D. Hankerson, J. L. Hernandez. Software implementation of ellip-
tic curve cryptography over binary elds. Cryptographic Hardware and Em-
bedded Systems - CHES 2000: Second International Workshop, Worcester,
MA, USA, August 2000. Proceedings, 1965/2000:243267, 2000.
[GP01] Jorge Guajardo and Christof Paar. Itoh-tsujii inversion in standard basis and
its application in cryptography and codes. Designs, Codes and Cryptography,
Volume 25(Number 2):p. 207216, Februar 2001.
[Men97] Menezes, A. and van Oorschot, P. and Vanstone, S. Handbook of Applied
Cryptography. CRC Press, 1997. Full text version available at www.cacr.
math.uwaterloo.ca/hac.
[OP99] Gerardo Orlando and Christof Paar. A super-serial galois elds multiplier for
FPGAs and its application to public-key algorithms. In Kenneth L. Pocek and
Jerey Arnold, editors, IEEE Symposium on FPGAs for Custom Computing
Machines, pages 232239, Los Alamitos, CA, 1999. IEEE Computer Society
Press.
91
References
[PP10] Christof Paar and Jan Pelzl. Understanding Cryptography A Textbook for
Students and Practitioners. Springer, 2010. http://www.crypto-textbook.
com/.
[RPLP08] Carsten Rolfes, Axel Poschmann, Gregor Leander, and Christof Paar. Secu-
rity for 1000 GE. In ecrypt workshop SECSI - Secure Component and System
Identication, 2008.
[Sol99] Jerome A. Solinas. Generalized mersenne numbers. Technical report, CACR,
1999.
[SP98] Leilei Song and Keshab K. Parhi. Low-energy digit-serial/parallel nite eld
multipliers. J. VLSI Signal Process. Syst., 19(2):149166, 1998.
92
Index
Diophantine Equation, 17
EA, see Euclidean algorithm
Euclidean algorithm, 13
extension eld
addition, 49
irreducible polynomial, 51
multiplication, 51
polynomial, 47
polynomial arithmetic, 47
reduction, 51
subtraction, 49
extension eld
GF(p
m
), 48
eld
characteristic, 46
extension, see extension eld
prime, 47
gcd, see greatest common divisor
greatest common divisor, 13
irreducible
polynomial, see extension eld
93

IKV 2 Main

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IKV 2 Main

Uploaded by

Copyright:

Available Formats

Software Implementation of Cryptographic Schemes

Prof. Dr.-Ing. Christof Paar

2.1.2 Modulo Reduction with Mersenne Primes

Reduction modulo a Mersenne prime Next, we want to compute the modular

mod p. For the computation, C

is split up in a higher part C

mod p, i.e., 0 C < 2

with a length of about

further by applying the

has roughly a length of

may need to be reduced. This can be achieved by

with a bit length of 2n is achieved by means of two multiplications of an

= 163 5 + 233 1048 mod 251

| + 1 < 384 = 2 192. Now, we split C

on a 64-bit computer. To perform

mod p, 0 < C < p eciently, we now derive reduced

As shown in the example, the reduction of

. From this follows:

It also follows immediately that we can apply the process iteratively

We can now give a more formal description of the algorithm.

= e, where e is the identity element of

4.3 Addition and Subtraction in GF(p

4.4 Multiplication in GF(p

(x) = A(x) B(x) = x

(x) using the polynomial division method we learned

(x) mod P(x)

Now note that a

Most Signicant Bit Multiplier

In general, Step 2 requires m D AND-Gates and (D 1) (m 1) XOR-Gates. The

Remark: In general we need N 1 XORs for N elements.

Since t m1k the degree of (1) is t +k m1. This implies that x

You might also like