RNS in Cryptography

Chapter 10
RNS in Cryptography
The need for securely accessing the information and protecting the information
from unauthorized persons is well recognized. These needs can be met by using
Encryption algorithms and Authentication algorithms [1, 2]. Encryption can be
achieved by using block ciphers or stream ciphers. The information considered as
fixed blocks of data e.g. 64-bit, 128-bit, etc. can be mapped into another 64-bit
block or 128-bit block under the control of a key. On the other hand, stream ciphers
generate random sequence of bits which can be used to mask plain text bit stream
(by performing bit-wise Exclusive-Or operation). Several techniques exist for block
and stream cipher implementation. These are called as symmetric key-based sys-
tems, since the receiver shall use for decryption the same key used for the block or
stream cipher at the transmitter. This key has to be somehow made available to the
receiver by previous arrangement or by using Key exchange algorithms
e.g. DiffieHellman Key exchange. The other requirement of authentication of a
source is performed using several techniques. This notable among these is based on
RSA (Rivest Shamir Adleman) algorithm. This algorithm is the workhorse of
Public Key cryptography. As against symmetric key systems, in this case two
keys are needed known as Public Key or Private Key. The strength of these systems
is derived from the difficulty of factoring large numbers which are products of two
big primes.
RSA algorithm is briefly described next. First, Alice chooses two primes p and q.
The product of the primes n ¼ p q is made public since it is extremely difficult to
find p and q given n. Next Alice defines ф(n) ¼ ( p 1) (q 1), which gives the
number of integers less than n and prime to n. Next, Alice chooses a value e denoted
as encryption key (public Key). Alice next computes a decryption key (private key)
d such that e d ¼ 1mod ф(n). The private key d is not disclosed to anybody.
Alice can now encrypt a message m by obtaining C ¼ md mod n where C stands
for cipher text corresponding to m. Anybody (say Bob, the intended recipient or
anybody else) who has knowledge of the public key of Alice, can now find m by
computing Cemod n ¼ m. This method is useful for confirming that only Alice could
have sent this message since meaningful message could be obtained by decryption.
© Springer International Publishing Switzerland 2016 263

P.V. Ananda Mohan, Residue Number Systems, DOI 10.1007/978-3-319-41385-3_10
264 10 RNS in Cryptography
This is the problem of Authentication. The mathematical operations performed by

Alice and receiver are exponentiation mod m operations. We can employ the Key
pair in another way too. Suppose Bob wants to send a message to Alice, he can use
Alice’s Public Key e and send C ¼ me mod n. Then Alice can use her private key
and compute Cdmod n to get m. This implies that Alice, the only owner of the
private key, can decrypt the message and nobody else.
With the computing power available for factorization of n, it is recommended that
p and q typically shall be 1024–2048 bits. The exponentiation operation is realized
as successive squaring and multiplication operations. An example will illustrate the
complexity of the procedure. Let us choose p ¼ 13 and q ¼ 17 so that n ¼ 221. We
note that ф(221) ¼ (13 1) (17 1) ¼ 192. Choose e ¼ 7 such that gcd (e,
ф(n)) ¼ 1 where gcd is greatest common divisor. Next, d can be computed as
55 since e d mod ф(n) ¼ 7 55 mod 192 ¼ 1. Thus e ¼ 7 is encryption key and
d ¼ 55 is decryption key. Since we need n also, we usually denote the encryption key
together with n as Public key ¼ (e, n) ¼ (7,221) and similarly Private key ¼ (d, n) ¼
(55,221). Consider a message m ¼ 11. Encryption needs computation of c ¼ me
mod n ¼ 117 mod 221 ¼ 54. Next, decryption using private key yields, m ¼ cdmod
n ¼ 5455 mod 221 ¼ 11. Alternatively, signing the message with the private key
yields the signature s ¼ md mod n ¼ 755 mod 221 ¼ 97 and verification using public
key yields 977 mod 221 ¼ 7. We have used calculator to perform the exponentiation
mod n. In practice, 5455 mod 221 is computed by successive squaring and optional
multiplication all modulo n operations. Expressing the exponent in binary form for
example 55 ¼ 110111, we note that 5455 ¼ 5432 5416 544 542 54. Hence, we
need to square 54 to obtain 542 mod 221, square 542 mod 221 to obtain 544 mod
221 and so on. Next scanning the exponent to find bits which are one, we can
multiply these intermediate results and skip in case of missing 1 in the exponent.
For illustration, we need not multiply 547 with 548 and instead in the next step
multiply with 5416 to obtain 5423. In successive steps, we obtain, 54, 543, 547, 5423,
5455 all mod n. Thus, in general, for a 1024-bit exponent, 1023 modulo squaring
operations need to be done compulsorily, whereas optionally in the worst case 1023
modulo multiplications need to be carried out. It is therefore essential to speed up the
basic operation—modulo multiplication (A B) mod n. Some techniques have been
described in Chapter 4 for modulo multiplication already. We explore other tech-
niques in detail in this Chapter.
In cryptographic applications such as RSA encryption, Diffie-Hellman Key
exchange, Elliptic curve cryptography, etc., modulo multiplication and modulo
exponentiation of large numbers, of bit lengths varying between 160 bits and
2048 bits, typically will be required. Two popular techniques are based on Barrett
reduction and Montgomery multiplication. However, to perform the operation (XY)
mod N for a single modulus, RNS using several small word lengths moduli can be
employed. This topic has received recently considerable attention. We deal with
both RNS-based and non-RNS-based (i.e. using only one modulus)
implementations in the following sections. In this chapter, we also consider appli-
cations of RNS in Elliptic Curve Cryptography processors and for implementation
of Pairing protocols.
10.1 Modulo Multiplication Using Barrett’s Technique 265
10.1 Modulo Multiplication Using Barrett’s Technique
We first consider Barrett’s technique [3, 4] for computing r ¼ x mod M given x and
M in base b where x ¼ x2k1. . .x1x0, and M ¼ mk1. . .m1m0 with mk1 6¼ 0. Note that
radix b is chosen typically to be word length of the processor. We assume b > 3
j 2k k
herein. Barrett’s technique requires pre-computation of a parameter μ ¼ bM .
j k j k j k
q2
First, we find consecutively q1 ¼ k1 x
, q2 ¼ k1x
μ and q3 ¼ kþ1 . Next, we
b b b
compute r 1 ¼ x modbkþ1 , r 2 ¼ ðq3 MÞmodbkþ1 , r ¼ r 1 r 2 . If r < 0, r ¼ r
þbkþ1 and if r M, r ¼ r M. Note that divisions are simple right hand shifts of
the base b representation. In the multiplication q2 ¼ q1μ, the (k + 1) LSBs are
not required to determine q3 except for determining carry from position (k + 1)
to (k + 2). Hence, k 1 least significant digits of q2 need not be computed.
Similarly, r2 ¼ q3M also can be simplified as a partial multiple-precision mul-
tiplication which evaluates only the least significant
(k + 1) digits of q3M. Note
kþ1
that r2 can be computed using at most þ k single precision multipli-
2
μand q1 have at most (k + 1) digits, determining q3 needs at most
cations. Since
k 2
ð k þ 1Þ 2 ¼ k þ5kþ2
2 single precision multiplications. Note that q2 is
2
needed only for computing q3. It can be shown that 0 r < 3M.
As an illustration, consider finding (121) mod 13. Evidently k ¼ 4. We obtain
μ ¼ 19, q1 ¼ 15, q2 ¼ 15 19 ¼ 285, q3 ¼ 8. Thus, r1 ¼ 25, r2 ¼ 8 and r ¼ 17. Since
r > 13, we have r ¼ r mod 13 ¼ 4.
Barrett’s algorithm estimates the quotient q for b ¼ 2 in the general case as
$j X kj2kþα k%
X
q¼ ¼ 2kþβ M ð10:1aÞ
M αβ
2
kþα
where α and β are two parameters. The value μ ¼ 2 M can be pre-computed and
stored. Several attempts have been made to overcome the last modulo reduction
operation. Dhem [5] has suggested α ¼ w þ 3, β ¼ 2 for radix 2w so that the
maximum error in computing q is 1. Barrett has used α ¼ n, β ¼ 1. The classical
modular multiplication algorithm to find (X Y ) mod M is presented in Figure 10.1
where multiplication and reduction are integrated. Note that step 4 uses (10.1a).
Quisquater [7] and other authors [8] have suggested writing the quotient as
kþc
X X 2
q¼ ¼ kþc ð10:1bÞ
M 2 M
and the result is T ¼ X qM.

Figure 10.1 High Radix classical Modulo multiplication algorithm (adapted from [6]
©IEEE2010)
Knezevic et al. [6] have observed that the performance of Barrett reduction can
be improved by choosing moduli of the form (2n Δ) in set S1 where 0 < Δ
j n k j k
2 α or (2n1 + Δ) in Set S where 0 < Δ 2n1 . In such cases, the value of
1þ2
2 αþ1
2 1
^
q in (10.1a) can be computed as

Z Z
^q ¼ if M 2 S 1 or ^
q ¼ if M 2 S2 ð10:1cÞ
2n 2n1
This modification does not need any computation unlike in (10.1b). Since many
recommendations such as SEC (Standards for Efficient Cryptography), NIST
(national Institute of Standards and Technology), and ANSI (American National
Standards Institute) use such primes, the above method will be useful.
Brickell [9, 10] has introduced a concept called carry-delayed adder. This
comprises of a conventional carry-save-adder whose carry and sum outputs are
added in another level of CSA comprising of half-adders. The result in carry-save
form has the interesting property that either a sum bit or next carry bit is ‘1’. As an
illustration, consider the following example:
A ¼ 40 101000
B ¼ 25 011001
C ¼ 20 010100
S ¼ 37 100101
C ¼ 48 0110000
T ¼ 21 010101
D ¼ 64 1000000
The output (D, T) is called carry-delayed number or carry-delayed integer. It
may be checked that TiDi+1 ¼ 0 for all i ¼ 0, . . ., k 1.
10.2 Montgomery Modular Multiplication 267
Brickell [9] has used this concept to perform modular multiplication. Consider
computing P ¼ AB mod M where A is a carry-delayed integer:
X
k1
A¼ T i þ Di 2i
i¼0
Then P ¼ AB can be computed by summing the terms

ðT 0 B þ D0 BÞ20 þ T 1 B þ D1 B 21 þ ðT 2 B þ D2 BÞ22 þ þ ðT k1 B þ Dk1 BÞ2k1
Rearranging noting that D0 ¼ 0, we have
20 T 0 B þ 21 D1 B þ 21 T 1 B þ 22 D2 B þ 22 T 2 B þ 23 D3 B þ
þ2k2 T k2 B þ 2k1 Dk1 B þ 2k1 T k1 B
Since either Ti or Di+1 is zero due to the delayed-carry-adder, each step requires a
shift of B and addition of at most two carry-delayed integers:
þ1
either ðPd ; Pt Þ ¼ ðPd ; Pt Þ þ 2i T i B or ðPd ; Pt Þ ¼ ðPd ; Pt Þ þ 2i Diþ1 B
After k steps, P ¼ (Pd, Pt) is obtained.

Brickell suggests addition of terms till P exceeds 2k and then only a correction is
added of value (2k M ). Brickell shows that 11 steps after multiplication starts,
the algorithm starts subtracting multiples of N since P is a carry-delayed integer of
k + 11 bits, which needs to be reduced mod M.
10.2 Montgomery Modular Multiplication
The Montgomery multiplication (MM) algorithm for processor-based implementations

[11] use two techniques: separated or integrated multiplication and reduction.
In separated multiplication and reduction, first multiplication of A and B, each
of s number of words, is performed and then Montgomery reduction is
performed. On the other hand, in integrated MM algorithm, these two operations
alternate. The integration can be coarse or fine grained (meaning how often we
switch between multiplication and reduction after processing an array of words
or after processing just one word). Next option is regarding the general form of
multiplication and reduction steps. One form is operand scanning based on
whether the outer loop moves through the words of one operand. In another
form, known as product scanning, the outer loop moves through the product
itself. Note that the operand scanning or product scanning is independent of
whether multiplication and reduction are integrated or separated. In addition, the
multiplication can take one form and reduction can take another form even in
integrated approach.
As such, we have five techniques (a) separated operand scanning (SOS),
(b) coarsely integrated operand scanning (CIOS), (c) finely integrated operand
scanning (FIOS), (d) finely integrated product scanning (FIPS) and (e) coarsely
integrated hybrid scanning (CIHS). The word multiplications needed in all these
techniques are (2s2 + s) whereas word additions for FIPS are (6s2 + 4s + 2), for SOS,
CIOS and CIHS are (4s2 + 4s + 2) and for FIOS are (5s2 + 3s + 2).
In SOS technique, we first obtain the product (A B) as a 2s-word integer t.
0 0 1
Next, we compute u ¼ (t + mn)/r where m ¼ (tn ) mod r and n ¼ . We first
n r
take u ¼ t and add mn to it using standard multiplication routine. We divide the
result by 2sw which we accomplish by ignoring the least significant s words. The
reduction actually proceeds word by word using n0 ¼ n mod 2w. Each time the result
is shifted right by one word implying division by 2w. The number of word
multiplications is (2s2 + s).
The CIOS technique [11, 12] improves on the SOS technique by integrating the
multiplication and reduction steps. Here instead of computing complete (A B) and
then reducing it, we alternate between the iterations of the outer loop for multipli-
cation and reduction. Consider an example with A and B each comprising of four
words a3, a2, a1, a0 and b3, b2, b1, b0 respectively. First a0b0 is computed and we
denote the result as cout0 and tout00 where tout00 is the least significant word and
cout0 is the most significant word. In the second cycle, two operations are performed
simultaneously. We multiply tout00 with n0 to get m0 and also computing a1b0 and
adding cout0 to obtain cout1, tout01. At this stage, we know the multiple of N to be
added to make the least significant word zero. In the third cycle, a2b0 is computed
and added to cout1 to obtain cout2, tout02 and in parallel m0n0 is computed and added
to tout00 to obtain cout3. In the fourth cycle, a3b0 is computed and added with cout2
to get cout4 and tout03 and simultaneously m0n1 is computed and added with cout3
and tout01 to obtain cout5 and tout10. Note that the multiplication with b0 is
completed at this stage, but reduction is lagging behind by two cycles. In the fifth
cycle, a0b1 is computed and added with tout10 to get cout7 and tout20 and simul-
taneously m0n2 is computed and added with cout5 and tout02 to obtain cout6 and
tout11. In addition, cout4 is added to get tout04 and tout05. In the sixth cycle, a1b1 is
computed and added with cout7, tout11 to get cout9 and tout21 and simultaneously
m0n3 is computed and added with cout6 and tout03 to obtain cout8 and tout12. In
addition, tout2 is multiplied with n0 to get m1. In this way, the computation proceeds
and totally 18 cycles are needed.
The FIOS technique integrates the two inner loops of the CIOS method by
computing both the addition and multiplication in same loop. In each iteration,
X0Yi is calculated andthe result is added to Z. Using Z0 we calculate T as
1
T ¼ ðZ 0 þ X 0 Y 0 Þ . Next, we add MT to Z. The Least significant word Z0
M r
of Z will be zero, and hence division by r is exact and performed by a simple right
shift. The number of word multiplications in each step is (2s + 1) and hence totally
(2s2 + s) word multiplications are needed and (2s2 + s) cycles are needed on a w-bit
processor. The addition operations need additional cycles.
Note that in the CIHS technique, the right half of the partial product summation
of the conventional n n multiplier is performed and the carries flowing beyond the
s words are saved. In the second loop, the least significant word t0 is multiplied by
n0 0 to obtain the value of m0. Next the modulus n0 is multiplied with m0 and added to
t0. This will make the LSBs zero. The multiplication with m0 with n1, n2, etc. and
addition with t1, t2, t3, etc. will be carried out in the next few cycles. Simulta-
neously, the multiplications needed for forming the partial products beyond s words
are carried out and result added to the carries obtained and saved in the first step as
well as with the words obtained by multiplying mi with nj. At appropriate time, the
mi values are computed as soon as the needed information is available. Thus the
CIHS algorithm integrates the multiplication with addition of mn. For a 4 4 word
multiplication, the first loop takes 7 cycles and the second loop takes 19 cycles.
The reader may refer to [13] for a complete description of the operation.
In the FIPS algorithm also, the computation of ab and mn are interleaved.
There are two loops. The first loop computes one part of the product ab and then
adds mn to it. Each iteration of the inner loop executes two multiply accumulate
operations of the form a b + S i.e. products ajbij and pjnij are added to a
cumulative sum. The cumulative sum is stored in three single-precision words t[0],
t[1] and t[2] where the triple (t[0], t[1], t[2]) represents t[2]22w + t[1]2w + t[0].
These registers are thus used as a partial product accumulator for products ab and
mn. This loop computes the words of m using n0 and then adds the least significant
word of mn to t. The second loop completes the computation by forming the final
result u word by word in the memory space of m.
Walter [14] has suggested a technique for computing (ABrn) mod M where
A < 2M, B < 2M and 2M < rn1, r is the radix and r 2 so that S < 2M for all
possible outputs S. (Note that n is the upper bound on the number of digits in A,
B and M). Note also that an1 ¼ 0. Each step computes S ¼ ðS þ ai B þ qi MÞ

div rwhere qi ¼ ðs0 þ ai bo Þ m1 o mod r. It can be verified that S < (M + B) till
the last but one step. Thus, the final output is bounded: S < 2M. Note that in the last
step of exponentiation, multiplication by 1 is needed and scaling by 2n mod M will
be required. A Montgomery step can achieve this. Here also note that since Sr n
¼ Ae þ QM and Q ¼ rn 1 maximum. Note that A ¼ ðA r n Þmod M
i.e. Montgomery form of Ae. Therefore since Ae < 2M, we have Srn < (rn + 1)M
and hence S M needing no final subtraction. The advantage here is that the cycle
time is independent of radix.
Orup [15] has suggested a technique for avoiding the modulo multiplication
needed to obtain the q value for high radix Montgomery modulo multiplication.

e 0 0 1
Orup suggests scaling the modulus M to M ¼ MM where M ¼ consid-
M k
2
1
ering radix 2k so that q is obtained as qi ¼ ðSi1 þ bi AÞ k since 0 ¼ 1.
2 M 2k
Thus only (biA) mod 2k is needed to be added with k LSBs of Si1:

e þ bi A div2k
Siþ1 ¼ Si þ qi M ð10:2aÞ
e with dynamic range greater than the original value M by k bits at

This leads to M
most. The addition operation in the determination of the quotient q also can be
avoided by replacing A by 2kA. Then, the expression qi ¼ (Si + biA) mod 2k becomes
qi ¼ Si mod2k. In the update Si+1, we have

e div2k þ bi A
Siþ1 ¼ Si þ qi M ð10:2bÞ
The number of iterations is increased by one to compensate for the extra factor 2k.
McIvor et al. [16] have suggested Montgomery modular multiplication (AB2k
modM ) modification using 5-2 and 4-2 carry save adders. Note that A, B and S are
considered to be in carry-save-form denoted by the vectors A1, A2, B1, B2, S1 and S2.
Specifically, the qi determination and estimation of the sum S is based on the
following equations:

qi ¼ S1 ½i0 þ S2 ½i0 þ Ai ðB10 þ B20 Þ mod2 ð10:3aÞ
and
S1, iþ1 , S2, iþ1 ¼ CSRðS1 ½i þ S2 ½i þ Ai ðB1 þ B2 Þ þ qi MÞdiv2 ð10:3bÞ
Note that S1,0 ¼ 0 and S2,0 ¼ 0. In other words, the SUM is in redundant form or
carry-save form (CSR). The second step uses a 5:2 CSA. In an alternate algorithm,
qi computation is same as in (10.3a) but it needs a 4:2 CSA. We have for the four
cases of Ai and qi being 00, 01, 10 and 11 the following expressions:
Ai ¼ 0, qi ¼ 0:
S1, iþ1 , S2, iþ1 ¼ CSRðS1 ½i þ S2 ½i þ 0 þ 0Þdiv2 ð10:4aÞ
Ai ¼ 1, qi ¼ 0:
S1, iþ1 , S2, iþ1 ¼ CSRðS1 ½i þ S2 ½i þ B1 þ B2 Þdiv2 ð10:4bÞ
Ai ¼ 0, qi ¼ 1:
S1, iþ1 , S2, iþ1 ¼ CSRðS1 ½i þ S2 ½i þ M þ 0Þdiv2 ð10:4cÞ
Ai ¼ 1, qi ¼ 1:
S1, iþ1 , S2, iþ1 ¼ CSRðS1 ½i þ S2 ½i þ D1 þ D2 Þdiv2 ð10:4dÞ
where D1, D2 ¼ CSR (B1 + B2 + M + 0) is pre-computed.

The advantage of this technique is that the lengthy and costly conventional
additions are avoided thereby reducing the critical path. Only (n + 1) cycles are
needed in the case of (10.3a) and (10.3b) and (n + 2) cycles are needed in the case of
(10.4a) and (10.4b). The critical path in the case of (10.3a) and (10.3b) is 3ΔFA þ 2
ΔXOR þ ΔAND whereas in the case of (10.4a) and (10.4b), it is 2ΔFA þ
Δ4:1MUX þ 2ΔXOR þ ΔAND . Note that k steps are needed where k is the number of
bits in M, A and B.
Nedjah and Mourelle [17] have described three hardware architectures for binary
Montgomery multiplication and exponentiation. The sequential architecture uses two
Systolic Array Montgomery Modular Multipliers (SAMMM) to perform multiplica-
tion followed by squaring, whereas, the parallel architecture uses two systolic modular
multipliers in parallel to perform squaring and multiplication (see Figure 10.2a, b).
In the sequential architecture, two SAMMMs are used each needing five regis-
ters. The controller controls the number of iterations needed depending on the
exponent. Note, however, one of the multipliers is not necessary. In the parallel
architecture, the hardware is more since multiplication and squaring use different
hardware blocks and this needs eight registers. The systolic linear architecture using
m e-PEs (E-cells) shown in Figure 10.2c, where m is the number of bits in M,
contains two SAMMMs one of which performs squaring and another performs
multiplication. These e-PEs together perform left to right binary modular
exponentiation.
Note that a front-end and back-end SAMMM are needed to do the
pre-computation and post-computation as needed in Montgomery algorithm (see
Figure 10.2d). The front-end multiplies the operands by 22nmod M and post-
Montgomery multiplication multiplies by ‘1’ to get rid of the factor 2n from the
result. The basic PE realizes the Montgomery step of computing R + aiB + qiM
where qi ¼ (r0 + aib0) mod 2. Note that depending on ai and qi, four possibilities
exist:
(1) ai ¼ 1, qi ¼ 1 add M + B, (2) ai ¼ 1, qi ¼ 0, add B, (3) ai ¼ 0, qi ¼ 1 add M,
(4) ai ¼ 0, qi ¼ 0 no addition.
The authors suggest pre-computation of M + B only once and denote it as MB.
Thus, using a 4:1 multiplexer, either MB, B, M or 0 is selected to be added to R. This
will reduce the cell hardware to a full-adder, a 4:1 MUX and few gates to control the
4:1 MUX. Some of the cells can be simplified which are in the border (see
Figure 10.2e showing the systolic architecture which uses the general PE in
Figure 10.2f). The authors show that sequential exponentiation needs least area
whereas systolic exponentiation needs most area. Sequential exponentiation takes
highest time, whereas systolic exponentiation takes the least computation time. The
AT (area time product) is lower for systolic implementation architecture and
highest for parallel implementation.
Shieh et al. [18] have described a new algorithm for Montgomery modular
multiplication. They extend Yang et al. technique [19] in which first AB is com-
puted. This 2k-bit word is considered as MH. 2k + ML. Hence (AB)2k mod N ¼ MH+
(ML)2k mod N. The second part can be computed and the result added to MH to
a e0/1
I E
T
1 0 1 0
ei
CONTROLLER
MUX22 MUX21
MPRODUCT1 SQUARE1
SAMMM2 MODULUS SAMMM1
MPRODUCT1.1 SQUARE 1.1
T
b
E TEXT
ei-1/1 1 0
CONTROLLER
MUX21
SQUARE
MODULUS
SAMMM1
SAMMM2
0 1
ei
MUX21
MPRODUCT
Figure 10.2 (a) Parallel (b) sequential and (c) systolic linear architectures for Montgomery
multiplier (d) architecture of the exponentiator (e) systolic architecture and (f) basic PE architec-
ture (adapted from [17] ©IEEE 2006)
obtain the result. Denoting ML(i) as the ith bit of ML, the reduction process in each
step of Montgomery algorithm finds
S þ ML ðiÞ þ qðiÞN
qi ¼ ðS þ ML ðiÞÞmod2, S ¼ ð10:5aÞ
2
iteratively for i ¼ 0, . . ., k 1. The algorithm starts from LSB of ML. Yang

et al. [17] represent qi and S in carry-save form as
SC þ SS þ ML ðiÞ þ qðiÞN
qi ¼ ðSC þ SS þ ML ðiÞÞmod2, ðSc ; SS Þ ¼ ð10:5bÞ
2
Shieh et al. [18] suggest computing S as
S þ ML ðiÞ þ 2qðiÞN 0
S¼ ð10:5cÞ
2
where N 0 ¼ Nþ1 0
2 . The advantage is that 2q(i)N and ML(i) can be concatenated as a
single operand, thus decreasing the number of terms to be added as 3 instead of
4. The authors also suggest “quotient pipelining” deferring the use of computed q(i)
to the next iteration. Thus, we need to modify (10.5c) as
em-1 e1 T1M
c e0
p(m-1) p(2) p(1) p(0)
R(m) R(m-1) R(2) R(1) R(0)
E-cellm-1 E-cell1 E-cell0

M M M M
d E
one twon T
TE mod M Two2n
2nR 2 nT
SAMMMK Exponentiator 2 nT SAMMM0
M
M
Figure 10.2 (continued)

e
mbn bn mn 0 mbj bj mj 0 mb1 b1 m1 0 mb0 b0 m0 0
a0 a0 a0
a0
Carry0,n-2. Carry0,j-1. Carry0,0.
cell 0,n cell 0,j cell 0,1 cell 0,0
qo qo qo
γn(1) γj(1)
mbn bn mn 0 mbj bj mj 0 mb1 b1 m1 0 γ1(1) mb0 b0 m0 0 γ0(1)
a1 a1 a1
Carry1,n-2. Carry1,j-1. Carry1,0.

cell 1,n cell 1,j cell 1,1 cell 1,0
q1 q1 q1
a1
γn(2) γj(2) γj(2) γ0(2)
γj+1(i) γ1(i)
mbn bn mn 0 γ2(i)
mbj bj mj 0 mb1 b1 m1 0 mb0 b0 m0 0
γ0(i)
ai ai ai
Carryi,n-1. Carryi,j-1. Carryi,0.

cell i,n cell i,j cell i,1 cell i,0
qi qi qi ai
γn(i+1)
γj(i+1) γ1(i+1) γ0(i+1)
γj+1(n) mbj bj mj 0 mb1 b1 m1 0

mbn bn mn 0 γ2(n) γ1(n) mb0 b0 m0 0
γ0(n)
an an an
Carryn,n-1. Carryn,j-1. Carryn,0.

cell n,n cell n,j cell n,1 Cell n,0
qn qn qn
an
γn(n+1) γj(n+1) γ1(n+1) γ0(n+1)

f gj (i) bj mj mbj 0
MUX4×1
aj aj
qj qj
Carryout FA Carryin
bj mj mbj 0
g j-1 (i+1)

00
S þ ML ðiÞ þ 2qði 1ÞN
S¼ ð10:5dÞ
2
00 0 00 0
where N ¼ N2 if N0 [0] ¼ 0 and N ¼ N þN 2 if N0 [0] ¼ 1. Since N 0 ¼ Nþ1
2 , we have
00 Nþ1 3Nþ1 0
N ¼ 4 or ¼ 4 depending on the LSB of N . (Note that [i] stands for ith bit).
This technique needs extension of A, B and S by two bits (0 A, B, S > 4N ) and
(AB2(n+4)) mod N is computed and 0 A, B, S > 4N. The advantage of using
(10.5a) and (10.5b) is that these two computations can be performed in a pipelined
fashion.
The authors have also shown that the partial product addition and modulo
reduction can be merged into one step adding two more iterations. This method
also needs extension of B by four bits. The output is S ¼ (AB)2(k+4) mod N. The
loop computes in this case
M S 00
M¼ þ ABðiÞ, ML ðiÞ ¼ Mmod2, S ¼ þ ML ði 1Þ þ 2qði 2Þ N ,
2 2
qði 1Þ ¼ Smod2
ð10:6Þ
The authors have described an array architecture comprising of (k 1) PE cells.

Each PE has one PPA (partial product addition) and one MRE (modulo reduction).
They realize in carry-save form by denoting M ¼ (Mc, Ms) and S ¼ (Sc, Ss). They
show that the critical path is affected by one full-adder only.
Word-based Montgomery modular multiplication has also been considered in
literature [11, 20–28]. In the MWR2MM (multiple word radix-2 Montgomery
multiplication) algorithm for computing XY2nmodM due to Tenca and Koc [20],
Y and M are considered to be split into e number of w-bit words. M, Y and S are
extended to (e + 1) words by a most significant zero-bit word: M ¼ (0, Me1, . . ., M1,
1 S=0
2 for i = 0 to n – 1
3 (Ca,S(0)) := xiY (0) + S (0)
4 if S0(0) = 1 then
5 (Cb,S(0)) := S(0) + M(0)
6 for j = 1 to e
7 (Ca,S(j)) := Ca + xiY (j) +S (j)
8 (Cb,S(j)) := Cb + M (j) +S (j)
9 S(j–1) :=(S0(j), Sw–1..1
(j–1) )
10 end for
11 else
12 for j = 1 to e
13 (Ca,S(j)) := Ca + xiY (j) + S (j)
14 S(j–1) :=(S (j), S (j–1) )
0 w–1..1
15 end for
end if
16 S (e) = 0
end for
Figure 10.3 Pseudocode of MWR2MM algorithm (adapted from [20] ©IEEE2003)
M0), Y ¼ (0, Ye1, . . ., Y1, Y0) and S ¼ (0, Se1, . . ., S1, S0). The algorithm is given in
the pseudocode in Figure 10.3. The arithmetic is performed in w-bit precision. Based
on the value of xi, xiY0 + S0 is computed and if LSB is 1, then M is added so that the
LSB becomes zero. A shift right operation must be performed in each of the inner
loops. A shifted Sj1 word is available only when the LSB of new Sj is obtained.
Basically, the algorithm has two steps (a) add one word from each of the vectors
S, xiY and M (addition of M depending on a test) and (b) one-bit right shift of an
S word. An architecture is shown in Figure 10.4a containing a pipe-lined kernel of p
w-bit PEs (processing elements) for a total of wp-bit cells. In one kernel cycle, p bits
of X are processed. Hence, k ¼ n/p kernel cycles are needed to do the entire
computation. Each PE contains two w-bit adders, two banks of w AND gates to
conditionally add xi Yj and add “odd” Mj to Si and registers hold the results (see
Figure 10.4b). (Note that “odd” is true if LSB of S is “1”.) Note that S is renamed
here as Z and Z is stored in carry save redundant form. A PE must wait two cycles to
kick off after its predecessor until Zo is available because Z1 must be first computed
and shifted. Note that the FIFO needs to store the results of each PE in carry-save
redundant form requiring 2w bits for each entry. These need to be stored until PE1
becomes available again.
A pipeline diagram of Tenca-Koc architecture [20] is shown in Figure 10.5a for
two cases of PEs (a) case 1, e > 2p 1, e ¼ 4 and p ¼ 2 and (b) case 2, e 2p 1,
e ¼ 4 and p ¼ 4 indicating which bits are processed in each cycle. There are two
X Mem
Sequence
Control
Kernel
x
YM M
Mem 0 Y PE1 PE2 PE3 PE P
Z’
Z
FIFO
Result
xi
Mw-1:0 Mw-1:0
Yw-1:0 Yw-1:0
cin cin
3:2 CSA
3:2 CSA
(w) (w)
odd
Z0
Zw-1:0 Zw-1:0
cout cout Zw-1
ca cb reset
Z0
Figure 10.4 (a) Scalable Montgomery multiplier architecture and (b) schematic of PE (adapted
from [21] ©IEEE 2005)
dependencies for PE1 to begin a kernel cycle indicated by the gray arrows. PE
1 must be finished with the previous cycle and the Zw1:0 result of the previous
kernel cycle must be available at PE p. Assuming a two cycle latency to bypass the
result from PE p to account for the FIFO and routing, the computation time in clock
cycles is
k ð e þ 1Þ þ 2ð p 1Þ e > 2p 1 ðcase IÞ
ð10:7Þ
kð2p þ 1Þ þ e 2 e 2p 1 ðcase IIÞ
The first case corresponds to large number of words. Each kernel cycle needs e + 1
clock cycles for the first PE to handle one bit of X. The output of PE p must be queued
until the first PE is ready again. There are k kernel cycles. Finally, 2( p 1) cycles are
required for the subsequent PEs to complete on the last kernel cycle.
The second case corresponds to the case where small number of words are
necessary. Each kernel cycle takes 2p clock cycles before the final PE produces
its first word and one more cycle to bypass the result back. k kernel cycles are
needed. Finally, e 2 cycles are needed to obtain the more significant words at the
end of the first kernel cycle.
Harris et al. [21] case is presented for comparison in Figure 10.5b. Harris
et al. [21] have suggested that the results be stored in the FIFO in non-redundant
form to save FIFO area requiring only w bits for each entry in stead of 2w bits in
[20]. They also note that in stead of waiting for the LSBs of the previous word to be
shifted to the right, M and Y can be left shifted thus saving latency of one clock
cycle. This means that as soon as LSB of Z is available, we can start the next step for
another xi. The authors have considered the cases e ¼ p and e > p (number of PEs
p required equal to number of the words e or less than the number of words e). Note
that in this case, (10.7) changes to
ð k þ 1Þ ð e þ 1Þ þ p 2 e > pðcase IÞ
ð10:8Þ
kðp þ 1Þ þ 2e 1 e pðcase IIÞ
Kelley and Harris [22] have extended Tenca-Koc algorithm to high-radix 2v using
a w v bit multiplier. They have also suggested using Orup’s technique [15] for
a Case 1: e>2p-1; e=4,p=2 Case 2: e≤2p-1; e=4, p = 4

PE1 PE2 PE1 PE2 PE3 PE4
1 Xo MYw-1:0 Xo MYw-1:0
Zw-2:1 Zw-2:1
Xo MY2w-1:w
Kernel cycle 1
2 Xo MY2w-1:w
Z2w-2:w-1 Z2w-2:w-1
Xo MY3w-1:2w X1 MYw-1:0
3 Xo MY3w-1:2w X1 MYw-1:0
Kernel cycle 1
Z3w-2:2w-1 Zw-2:1
Z3w-2:2w-1 Zw-2:1 Xo MY4w-1:3w X1 MY2w-1:w
4 Xo MY4w-1:3w X1 MY2w-1:w Z4w-2:3w-1 Z2w-2:w-1
Z4w-2:3w-1 Z2w-2:w-1
Xo MY5w-1:4w X1 MY3w-1:2w X2 MYw-1:0
5 Xo MY5w-1:4w X1 MY3w-1:2w
Z-5w-2:4w-1 Z3w-2:2w-1 Zw-2:1
Z-5w-2:4w-1 Z3w-2:2w-1
X1 MY4w-1:3w X2 MY2w-1:w
6 X2 MYw-1:0 X1 MY4w-1:3w Z4w-2:3w-1 Z2w-2:w-1
Zw-2:1 Z4w-2:3w-1
X1 MY5w-1:4w X2 MY3w-1:2w X3 MYw-1:0
7 X2 MY2w-1:w X1 MY5w-1:4w Z-5w-2:4w-1 Z3w-2:2w-1 Zw-2:1
Z2w-2:w-1 Z-5w-2:4w-1
8 X2 MY3w-1:2w X3 MYw-1:0 Z4w-2:3w-1 Z2w-2:w-1
Kernel cycle 2
Z3w-2:2w-1 Zw-2:1 X2 MY5w-1:4w X3 MY3w-1:2w

9 X2 MY4w-1:3w X3 MY2w-1:w Z-5w-2:4w-1 Z3w-2:2w-1
Z4w-2:3w-1 Z2w-2:w-1 X3 MY4w-1:3w
10 X2 MY5w-1:4w X3 MY3w-1:2w Z4w-2:3w-1
Z-5w-2:4w-1 Z3w-2:2w-1 X3 MY5w-1:4w
11 X3 MY4w-1:3w Z-5w-2:4w-1
Z4w-2:3w-1 Kernel Stall
12 X3 MY5w-1:4w
Z-5w-2:4w-1 Xo MYw-1:0
Zw-2:1
Xo MY2w-1:w
Z2w-2:w-1
Xo MY3w-1:2w X1 MYw-1:0
Z3w-2:2w-1 Zw-2:1
Xo MY4w-1:3w X1 MY2w-1:w
Z4w-2:3w-1 Z2w-2:w-1
Xo MY5w-1:4w X1 MY3w-1:2w X2 MYw-1:0
Kernel cycle 2
Z-5w-2:4w-1 Z3w-2:2w-1 Zw-2:1

Z4w-2:3w-1 Z2w-2:w-1
X1 MY5w-1:4w X2 MY3w-1:2w X3 MYw-1:0
Z-5w-2:4w-1 Z3w-2:2w-1 Zw-2:1
Z4w-2:3w-1 Z2w-2:w-1
X2 MY5w-1:4w X3 MY3w-1:2w
Z-5w-2:4w-1 Z3w-2:2w-1
X3 MY4w-1:3w
Z4w-2:3w-1
X3 MY5w-1:4w
Z-5w-2:4w-1
Figure 10.5 Pipeline diagrams corresponding to (a) Tenca and Koc technique and (b) Harris
et al. technique (adapted from [21] ©IEEE2005)
b Case 1: e>p; e=4,p=2 Case 2: e≤p; e=4, p = 4

PE1 PE2 PE1 PE2 PE3 PE4
1 Xo MYw-1:0 Xo MYw-1:0
Zw-2:1 Zw-2:1
Kernel cycle 1
2 Xo MY2w-1:w Xo MY2w-1:w X1 MYw-1:0

Z2w-2:w-1 Z2w-2:w-1 Zw-2:1
3 Xo MY3w-1:2w X1 MYw-1:0 Xo MY3w-1:2w X1 MY2w-1:w X2 MYw-1:0
Z3w-2:2w-1 Zw-2:1 Z3w-2:2w-1 Z2w-2:w-1 Zw-2:1
4 Xo MY4w-1:3w X1 MY2w-1:w
Xo MY4w-1:3w X1 MY3w-1:2w X2 MY2w-1:w X3 MYw-1:0
Kernel cycle 1
Z4w-2:3w-1 Z2w-2:w-1
5 Xo MY5w-1:4w X1 MY3w-1:2w Z4w-2:3w-1 Z3w-2:2w-1 Z2w-2:w-1 Zw-2:1
Z-5w-2:4w-1 Z3w-2:2w-1 Xo MY5w-1:4w X1 MY4w-1:3w X2 MY3w-1:2w X3 MY2w-1:w
Z-5w-2:4w-1 Z4w-2:3w-1 Z3w-2:2w-1 Z2w-2:w-1
6 X2 MYw-1:0 X1 MY4w-1:3w
Zw-2:1 Z4w-2:3w-1 Kernel Stall X1 MY5w-1:4w X2 MY4w-1:3w X3 MY3w-1:2w
Z-5w-2:4w-1 Z4w-2:3w-1 Z3w-2:2w-1
7 X2 MY2w-1:w X1 MY5w-1:4w
Z2w-2:w-1 Z-5w-2:4w-1 X4 MYw-1:0 X2 MY5w-1:4w X3 MY4w-1:3w
8 X2 MY3w-1:2w X3 MYw-1:0 Zw-2:1 Z-5w-2:4w-1 Z4w-2:3w-1
Kernel cycle 2
Z3w-2:2w-1 Zw-2:1 X4 MY2w-1:w X5 MYw-1:0 X3 MY5w-1:4w

9 X2 MY4w-1:3w X3 MY2w-1:w Z2w-2:w-1 Zw-2:1 Z-5w-2:4w-1
Z4w-2:3w-1 Z2w-2:w-1 X4 MY3w-1:2w X5 MY2w-1:w X6 MYw-1:0
10 X2 MY5w-1:4w X3 MY3w-1:2w Z3w-2:2w-1 Z2w-2:w-1 Zw-2:1
Kernel cycle 2
Z-5w-2:4w-1 Z3w-2:2w-1
X4 MY4w-1:3w X5 MY3w-1:2w X6 MY2w-1:w X7 MYw-1:0
11 X3 MY4w-1:3w Z4w-2:3w-1 Z3w-2:2w-1 Z2w-2:w-1 Zw-2:1
Z4w-2:3w-1 X4 MY5w-1:4w X5 MY4w-1:3w X6 MY3w-1:2w X7 MY2w-1:w
12 X3 MY5w-1:4w Z-5w-2:4w-1 Z4w-2:3w-1 Z3w-2:2w-1 Z2w-2:w-1
Z-5w-2:4w-1
X5 MY5w-1:4w X6 X6 X7 MY3w-1:2w
Z-5w-2:4w-1 X6 Z3w-2:2w-1
X6 MY5w-1:4w X7 MY4w-1:3w
Z-5w-2:4w-1 Z4w-2:3w-1
X7 MY5w-1:4w
Z-5w-2:4w-1
avoiding multiplication in computing q by scaling the modulus and also by pre-scaling

X by 2v to allow multiplications to occur in parallel for computing qM + xiY.
Jiang and Harris [23] have extended the Harris et al. [21] radix-2 design by using
a parallel modification of Montgomery algorithm. Here, the computation done is
Z ¼ Z2 þ q M^ þ Xi Y where q ¼ Z mod 2 and M0 is such that RR0 MM0 ¼ 1
n
and R ¼ 2 and M ^ ¼ ððM0 mod2ÞM þ 1Þ=2. Note that parallel multiplication needs
just ANDing only.
Pinckney and Harris [24] and Kelly and Harris [25] have described a radix-4
parallelized design which left shifts the operands and parallelizes the multiplica-
tions within the PE. Note that pre-computed values of 3M ^ and 3Y are employed
0 2
^
where M ¼ M mod2 M þ 1 =2 . Orup’s technique [15] has been used to avoid
2
multiplications.
Huang et al. [26] have suggested modifications for Tenca-Koc algorithm to
perform Montgomery multiplication in n clock cycles. In order to achieve this,
they suggest pre-computing the partial results using two possible assumptions for
the MSB of the previous word. PE1 can take the w 1 MSBs of S0 (i ¼ 0), from PE
0 at the beginning of clock 1, do a right shift and prepend with both 1 and zero based
on the two different assumptions about the MSBs of this word at the start of the
computation and compute S1(i ¼ 1). At the beginning of the clock cycle 2, since the
correct bit will be available as the LSB of S1 (i ¼ 0), one of the two pre-computed
versions of S0 (i ¼ 1) is chosen. Since the w 1 LSBs are same, the parallel
hardware can have same LSB adding hardware and using small additional adders,
the other portions can be handled. Same pattern of computations repeats in the
subsequent clock cycles. Thus, the resource requirement is marginally increased. The
j k cycles is T ¼ n + e 1 if e p and T ¼ n + k(e p) + e 1
computation time in clock
n
otherwise where k ¼ p .
In another technique, each PE processes the complete computation of a
specific word in S. However, all PEs can scan different bits of the operand X at
the same time. The data dependency graphs of both these cases are presented in
Figure 10.6a, b. Note that the second architecture, however, has fixed size
(i.e. e number of PEs which cannot be reduced). The first technique has been
shown to outperform Tenca-Koc design by about 23 % in terms of the product of
latency time and area when implemented on FPGAs. The second technique
achieves an improvement of 50 %.
The authors have also described a high-radix implementation [26] while
preserving the speed up factor of two over corresponding technique of Tenca and
Koc [20]. In this, for example, considering radix-4, two bits are scanned at one time
taking ((n/2) + e 1) clock cycles to produce an n-bit Montgomery multiplication.
The multiplication by 3 needed, can be done on the fly or avoided by using Booth’s
algorithm which needs to handle negative operands [26].
Shieh and Lin [27] have suggested rewriting the recurrent equations in MM
algorithm
qi ¼ ðSi þ A Bi Þmod 2
ðSi þ A Bi þ qi N Þ
Siþ1 ¼ ð10:9Þ
2
as

SMi1
qi ¼ SRi þ þ A Bi mod 2
2

SRi þ SM2i1 þ A Bi þ qi N
SRiþ1 þ SMiþ1 ¼ ð10:10Þ
2
with SR0 ¼ SM0 ¼ SM1 ¼ 0 for i ¼ 0, . . . (k 1). Note that A, B and N are k-bit
words. This will help in deferring the accumulation of MSB of each word of the
intermediate result to the next iteration of the algorithm. Note that the intermediate
result Si in (10.9) is decomposed into two parts SM and SR. The word SM contains
only the MSB followed by zeroes and the word SR comprises the remaining LSBs.
They also observe that in (10.10), the number of terms can be reduced to three,
taking advantage of several zero bits in SRi and SMi1/2. Further by considering
A as two words AP and AR (for example, for W ¼ 4) AP ¼ 0a10000a60000a200 and
AR ¼ a10a9a8a70a5a4a30a1a0, (10.10) gets changed to
a PE#0
j=0 i= 0
x0 Y(0)
Sw-1..1 (0)=0
M(0)
j=1 D
Sw-1..1 (0)
S0(1)=0 PE#1
(1)
{xo,qo,C } i=1
Y(1) x1
Y(0)
Sw-1..1 (1)=0
M(1)
M(0)
j=2 S0(1)
E D
Sw-1..1 (0)
S0(2)=0 Sw-1..1 (1)
(1)
{x1,q1,C } PE#2
{xo,qo,C(2)} i= 2
(1)
Y Y(0)
Y(2) x2
Sw-1..1 (2)=0 M(1)
M(2) M(0)
j=3 S0(1)
E E D
S0(3)=0
S0(2) Sw-1..1 (1) Sw-1..1 (0)
{xo,qo,C(3)} {x1,q1,C(2)} {x2,q2,C(1)}
Sw-1..1 (2)
Y(2) Y(1)
Y(3)
Sw-1..1 (3)=0 M(3) M(2) M(1)
j=4 S0(3) S0(2) S0(1)
E E E
S0(4)=0
{xo,qo,C(4)} {x1,q1,C(3)} {x2,q2,C(2)}

Sw-1..1 (2) Sw-1..1 (1)
Sw-1..1 (3)
b PE #0
j=0
S(0)=0 Y(0)
(0)
M
i=0 x0
PE #1
D 0 j=1
{C (1)
Y(0)
(1
) S =0 Y(1)
i=1 S(0) (0)
,q
} M(1)
M 0
x1 x0 PE #2
D S(1)0 E 0 j=2
{ C (1 { C (2 (2) (2)
)
S(0) Y (0) )
,q S(1) Y
(1)
,q S =0 Y
i=2 (1) 0 }
M
(0) 1}
M M(2)
x2 x1 x0 PE #3
D S 0 E(1)
S 0 E 0
(2)
j=3
{ C (1 { C (2 { C (3 (3)
S =0 (3)
) ) (2) )
(0) Y(0) ,q S (1) Y (1) , q (2) Y , q Y (3)
i=3 S (0) 2} (1) 1 } S (2) 0} M
M M M
x3 x2 x1 x0
D S(1)0 E S(2)0 E (3)

S 0 E 0
{ C (1 { C (2 { C (3 { C (4
) ) ) )
(0) Y(0) , q S (1) Y (1)
, q S (2) Y(2)
, q S (3) Y(3) ,q
i=4 S 3} 2} 1} 0}
M
(0)
M(1) M(2) M(3)
x4 x3 x2 x1
D S(1)0 E S(2)0 E S(3)0 E S(4)0
{ C (1 { C (2 { C (4
) ) { C (3 )
,q ,q ) ,q
S(0) 4 } S
(1) 3 } S(2) ,q
2 } S
(3)
1 }
Figure 10.6 Data dependency graphs of (a) optimized architecture and (b) alternative architec-
ture of MWR2MM algorithm (adapted from [26] ©IEEE2011)

0 SM0 i1
qi ¼ SR i þ 2AP Biþ1 þ þ AR Bi mod2 ¼ ðOP1i þ OP2i Þmod 2
2
0
SM i1
SR0 i þ 2AP Biþ1 þ þ AR Bi þ qi N
2
SR0 iþ1 þ SM0 iþ1 ¼
2
OP1i þ OP2i þ OP3i
¼
2
ð10:11Þ
where OP1i ¼ SRi þ 2AP Biþ1 , OP2i ¼ SM2i1 þ AR Bi and OP3i ¼ qi N.

The pair (SR0 , SM0 ) is used in place of (SR, SM) since the value of the intermediate
result changes due to the rearrangement of the operands to be added in each
0
iteration. Note that a post-processing operation S ¼ SR0 þ SM0 þ SM k1 is
k k k 2
required to obtain the final result. Thus, the data dependency between the MSB of
Sj in the (i + 1)th iteration and Sj+1 in the ith iteration can be relaxed using this
technique. The reader is referred to [27] for more information on the implementation.
Knezevic et al. [6] have described Barrett and Montgomery modulo reduction
techniques for special moduli-generalized Mersenne primes used in Elliptic curve
cryptography (ECC). They have presented an interleaved modulo multiplier suit-
able for both Barrett and Montgomery reduction and observe that Montgomery
technique is faster. The algorithm for Montgomery reduction is presented in
Figure 10.7a. Two unified architectures for interleaved Barrett and Montgomery
reduction architectures are presented in Figure 10.7b, c for classical technique and
modified technique, respectively. In these, the blocks π1 and π2 are multiple-
precision multipliers which perform multiplication in lines 3 and 5 in the flow
chart in Figures 10.1 and 10.7a, whereas π3 is a single precision multiplier used to
perform multiplication in step 4. An additional adder Σ is also required.
In the case of Barrett reduction, the pre-computed value of μ is λ ¼ w + 4 bits
long, whereas in the case of Montgomery algorithm, the precomputed value M0 is
λ ¼ w bits long. In the case of Barrett reduction, π2 uses the most significant λ bits of
the product calculated by π3, whereas in the case of Montgomery algorithm, it uses
the least significant λ bits of the same product. The authors show that an improve-
ment in speed of about 50 % can be achieved using digit size of 32 bits and modulus
of 192–512 bits. In Montgomery reduction, the processing of Y starts from right to
left whereas in the case of Barrett reduction it starts from left to right. The authors
also observe that in the case of special structure for the modulus M (see (10.1c)), the
architecture can be simplified as shown in Figure 10.7c needing only two multi-
pliers. This does not need any multiplication with pre-computed values. The critical
path is reduced as shown in the bold line in Figure 10.7c of one multiplier and one
adder only.
Miyamoto et al. [28] have explored the full design space of RSA processors
using High-radix Montgomery multipliers. They have considered four aspects
(a) algorithm design (b) radix design (c) architecture design and (d) arithmetic
Figure 10.7 (a) Algorithms for Modulo multipliers using Montgomery and Barrett Reduction
(b) original architecture and (c) modified architecture (adapted from [10.8] ©IEEE2010)
b X Y M m or M’
n-bit n-bit n-bit l-bit
l l
*
l 2l
w w l l
* * p3
2w n+w n+l 2l
+ + l
n+w-bit n+l-bit
p1 p2
n+w n+l
Barrett: l = w + 4 MS bits
n+l+1 n+l Montgomery: l = w LS bits
+
n+l+1 -bit Z
S
c X Y M
n-bit n-bit n-bit
w w l l
* *
2w n+w n+l 2l
+ +
n+w-bit n+l-bit
p1 p2
n+w n+l
n+l+1 n+l
+
n+l+1 -bit Z
S
component design. They have considered four types of exponentiation algorithms

based on two variants of binary methods (a) left to right binary method and
(2) square multiply exponentiation method that have different resistances to simple
power analysis (SPA)-based attacks each using CRT and not using CRT.
The square multiply exponentiation method starts at LSB and works upwards.
The left to right binary method requires lower hardware resources. The m-ary
window method reduces the number of multiplication operations using 2m1
pre-computed values. But more memory resources are needed. Hence the authors
use left to right binary method. The square and multiply method has been used since
it prevents SPA attacks by using dummy operations even for zero bits of the
exponent. CRT also reduces the clock cycles by almost ¾. This, however, requires
extra hardware for pre-processing and post-processing.
First an algorithm for modulo exponentiation is selected considering trade-off
between RSA computation time and tamper resistance. The authors suggest next
that radix needs to be chosen. Circuit area and delay time increase exponentially
with the radix. However, the area and time increase in a different way. The decrease
in number of cycles may compensate the increase in critical path. The data path
architectures of three types have been considered which have the intermediate
results in (1) single form (type I), (2) semi-carry-save form (type II) and
(3) carry-save form (type III). The authors have observed that 85, 73 and 84 different
variants of type I, II and II data path architectures are possible. The RSA time is
largest for type I and least for type III, whereas the area is least for type I and largest
for type III.
All the three algorithms (type I, II and III) are presented in Figure 10.8a–c for
completeness. We wish to compute Z ¼ (X Y 2rm) modN where X, Y, Z, N are
k-bit integers. These k-bit operands are considered as m number of r-bit blocks. We
define w ¼ N1mod 2r, ti ¼ (z0 + xiy0)w mod 2r. The original high-radix Montgom-
ery algorithm needs computation of q ¼ z + xiyj + tinj + c. They suggest storing the
temporary variable q of 2r bits as zj1 and c where c ¼ q/2r and zj1 ¼ q mod (2r).
The computation of q is realized in two steps in Type I. In the first step,
zi + xiyj + ca is computed to yield the sum zj and carry ca. Next step computes
zj + tinj + cb as sum zj1 and carry cb. A CPA is used to perform the needed
summation. In type II, these two steps are modified to preserve the carry-save
form for only the intermediate carry. We compute in the first step as follows:

zj þ xi yj þ cs1a þ cs2a þ eca ¼ cs1a þ cs2a þ eca , zj ð10:12aÞ
In the second step, we evaluate

zj þ ti nj þ cs1b þ cs2b þ ecb ¼ cs1b þ cs2b þ ecb , zj1 ð10:12bÞ
Note that csa and csb are r-bit whereas ec is 1-bit carry. The lower r-bit output and
1-bit ec is given by the CPA operation, whereas the rest are obtained by partial
product addition using a CSA.
Figure 10.8 Montgomery multipliers using (a) single form (Type I), (b) using semi carry-save
form (Type II) and (c) using carry-save form (Type III) (adapted from [28] ©IEEE2011)
10.3 RNS Montgomery Multiplication and Exponentiation 287
In the third algorithm, carry save form is used for intermediate sum and carry
where cs1 and cs2 are intermediate carry signals and zs1 and zs2 are intermediate sum
signals. The two steps in this case are modified as
zj þ xi yj þ cs1a þ cs2a ¼ ðcs1a þ cs2a , zs1 þ zs2 Þ

zs1 þ zs2 þ ti nj þ cs1b þ cs2b þ ec ¼ ðcs1b þ cs2b , zs1 þ zs2 Þ ð10:12cÞ

zs1 þ zs2 ¼ ec; zj1
The CPA operation is performed at the end of the inner loop to obtain zj. The third
approach needs more steps due to the extra additions.
The computation time of CPA significantly affects the critical path. In Type I
and II, the CPA widths are 2r and r respectively whereas in Type III, in every cycle
CPA is not used. The number of cycles needed for the arithmetic core in types I, II
and III are 2m2 + 4m + 1, 2m2 + 5m + 1 and 2m2 + 6m + 2 respectively. The authors
have considered variety of final CPA based on KoggeStone, BrentKung,
HanCarlson and LadnerFischer types. The partial product addition also used a
variety of algorithms: Dadda tree, 4:2 compressor tree, (7, 3) counter trees and (3, 2)
counters. The radix also was variable from 8 to 64 bits.
The authors report that smallest area of 861 gates using Type-I radix-28 proces-
sor to shortest operating time of 0.67 ms at 421.94 MHz with Type III radix-2128
processor. The highest hardware efficiency (RSA time area) of 83.12 s-gates was
achieved with Type II radix-232 processor.
10.3 RNS Montgomery Multiplication and Exponentiation

ab
Posch and Posch [29] have first suggested Montgomery reduction mod N in
M
RNS using two RNS namely RNS1 (Base B) and RNS2 (base B0 ) which have
dynamic ranges of M and M0 respectively. The algorithm [30] is presented in
Figure 10.9a for computing Z ¼ (ab/M ) mod N. First t ¼ ab is computed in both
RNS. Next q ¼ t/N is computed in RNS1. In RNS2, we compute ab þ ^q N (where ^q
is obtained by base extension of q from RNS1 to RNS2) and divide it by M. (Note
that inverse of M exists in RNS2 and not in RNS1). This result is next base extended
to RNS1. Evidently, two base extensions are needed: one from RNS1 to RNS2 and
another from RNS2 to RNS1.
Posch and Posch have made certain assumptions on RNS1 and RNS2:
N N
NþΔ<M <Nþ with Δ <
3 6
b
Set1 M Set2 Mʹ
Moduli 3 5 7 11 13 17
a =10 1 0 3 10 10 10
b = 25 1 0 4 3 12 8
a×b=250 1 0 5 Mod (105) 8 3 12 (a×b)mod (2431)
= 250
t 2 2 3
=(-1/37) mod 105
=17
t×(a×b) mod 105 2 0 1 Base Extend 6 11 16 S = t×(a×b) mod 105 =
=50 to M′ 50
→
4 11 3 N = 37
2 4 14 s×N = 1850
10 7 9 a×b+s×N = 2100
2 1 6 1/M = 1/105
Result = 20 2 0 6 Base 9 7 3 2100/105 = 20
Extension to
M
←
Figure 10.9 (a) Bajard’s and Posch and Posch algorithm for Montgomery multiplication using
RNS (adapted from [31] ©IEEE2004) (b) an example of Montgomery multiplication using RNS
1
4N M0 4 þ εM0 =N N with εM0 =N <
12
N
a, b < N þ Δ < N þ ð10:13Þ
6
Only the base extension

$ algorithm
% is different. The CRT expansion is
X
n
approximated as W *int ¼ Xi w*i where Xi are the residues and w*i ¼ wi δi
i¼1
M1
and δi 1 denoting the error between the actual value and wi ¼ mi i . Base
extension from RNS1 to RNS2 results in some t* ¼ t or t + M1 if W*int ¼ Wint or
Wint 1 (approximation off by 1). It can be shown that t* < M1 or M1 + Δ in these
2 < y < 3N þ 3
N N
two cases respectively. Further, note that where

1 1
y ¼ x xM N modMÞNM þ 2N.
Bajard et al. [30] have described RNS Montgomery modular multiplication
algorithm in which B and N are given in RNS form and A is given in Mixed
Radix form (MRS). In this method only one RNS base is used and the condition
0 N 3max M ðmi Þ needs to be satisfied. The algorithm is executed in n steps
i2ð1, ::nÞ
where n is the number of moduli. In each step,a MRS digit q0 i of a number Q is
AB
computed and a new value of R where R ¼ modN is determined using a0 i and
M
q0 i in RNS where a0 i is the mixed radix digit of A. Next, since R is a multiple of mi,
and the moduli are relatively prime numbers, R is multiplied by the multiplicative
inverse of mi. This process cannot, however, be carried out in the last step, since mi
is not prime to itself. Hence, another RNS shall be used for expressing the result and
reconstructing the residue after it is lost. Since the result is available in RNS base, it
needs to be extended to the original base in the end. The authors also suggest
another technique where the missing residue is recovered using base extension
employing a redundant modulus using Shenoy and Kumaresan technique
[32]. Note, however, that for systems based on MRC, fully parallel computation
cannot be realized and thus they are slower. An example of Montgomery multipli-
cation using RNS to compute (AB/M ) mod N where A ¼ 10, B ¼ 25, M ¼ 105 and
N ¼ 37 is presented in Figure 10.9b.
Bajard et al. [31] approach using two moduli sets is similar to other techniques
[33, 34] but the base extension steps use different algorithms. These, however, have
the same complexity as that of Posch and Posch [29]. Note that q shall be extended
to base B0 as explained before. This can be obtained by using CRT but the multiple
of M that needs to be subtracted needs to be known. The application of CRT in base
B yields
X
k
q¼ σ i Mi αM ð10:14aÞ
i¼1

where M contains k moduli and σ i ¼ qi M1
i

m modmi and α < k. We need not
i
compute the exact value of q in M0 but instead extend q in M0 as
X
k
^q ¼ q þ αM ¼ σ i Mi ð10:14bÞ
i¼1
for j ¼ k + 1, . . .,2k. We compute next

ab þ ^q N
^r ¼ ð10:14cÞ
M
Note that the computed value ^r < M0 and has a valid representation in base B0 .
From (10.14b) and (10.14c), we get

ab þ ðq þ αMÞN
^r ¼ modN ¼ abM1 modN ð10:14dÞ
M
Thus, there is no need to compute α. Instead, ^q needs to be computed in B0 . Once r is

estimated, this needs to be extended to B which can be done using Shenoy and
Kumaresan [32] technique for which a redundant residue mr is used. It may be noted
that since in CRT, α < k, q < M and ab < MN, we have ^q < ðk þ 1ÞM and hence
^r < ðk þ 2ÞN < M0 . The condition ab < MN implies that if we want to use ^r in the
next step as needed in exponentiation algorithms, say squaring, we obtain the
condition (k + 2)2N2 < MN or (k + 2)2N < M. Thus, if N is a 1024-bit number, and
if we use 32-bit moduli, we need base B of size k 33.
Note that lines 1, 2 and 4 in the algorithm in Figure 10.9a need 5k modular
multiplications. The first and second base extensions (steps 3 and 5) each need
(k2 + 2k) modular multiplications respectively thus needing over all (2k2 + 9k)
modular multiplications. If the redundant modulus is chosen as a power
of 2, the first base extension needs k2 + k multiplications only.
Kawamura et al. [33, 34] have described a Cox-Rower architecture. They
stipulate that a, b < 2N so that

v ab þ tN ð2N Þ2 þ MN 4N
w¼ ¼ < ¼N þ 1 2N if 4N M ð10:15Þ
M M M M
In this method, the base extension algorithm is executed in parallel by plural

“rower” units controlled by a “cox” unit. Each rower unit is a single precision
modular multiplier accumulator whereas the cox unit is typically a 7-bit adder. The
algorithm is same as that in Figure 10.9a except for the base extension steps.
Referring to CRT-based RNS to binary conversion, it is clear that the value of α
(i.e. multiple of M ) that needs to be subtracted needs to be determined so that base
extension to carried out as discussed in Chapter 6. Kawamura et al. [34] have
suggested computing α in a different way. Noting that
Xn Xn
1
x¼ xi Mi αM ¼ σ i Mi αM ð10:16aÞ
i¼1
M i mi i¼1
where
Xn !
1
σi ¼ xi modmi ð10:16bÞ
i¼1
M i mi
we have
x Xn
σi
αþ ¼ ð10:16cÞ
M i¼1
mi
$ %
Xn
σi Xn
σi
Since x < M, is between α and α + 1. Thus α ¼ and 0 < α < n
i¼1
mi i¼1
mi
holds. The value of α can be recursively estimated in the “cox” unit by approxi-
mating mi in the denominator by 2r, in order to avoid division by mi. Note that it is
assumed that r is common to all moduli in spite of mi being different in general and
computing
$ %
Xn
trunc σ i
^ ¼
α þ α0 ð10:17Þ
i¼1
2r

where trunc σ i ¼ σ i \ ð1:::::10:::::0Þ2 . (Note that the number of ones are q0 and
number of zeroes are r q0 and \ stands for bit-wise AND operation.) Note that σ i is
approximated by its most significant q bits as trunc (σ i). The parameter α0 is an
offset value to take into account the error caused due to the approximation. Note
^ in (10.17) is computed bit by bit with an initial value λo ¼ α as
that α

trunc σ i
λi ¼ λi1 þ , αi ¼ bλi c, λi ¼ λi αi for i ¼ 1, 2, . . . n ð10:18Þ
2r
Note that αi is a bit and if it is 1, the rower unit subtracts M. Note that the error is
transferred to the next step and only in the last step there is residual error.
Kawamura et al. [34] have suggested the use of α0 ¼ 0 and α0 ¼ 0.5 for the first
and second base extensions, respectively. Note that n clock cycles are needed to
obtain the n number of αi values. It can be seen that n2 + 2n modulo multiplications
are needed for each base extension and 5n other modulo multiplications operations
are needed for complete modulo multiplication. The Cox-Rower architecture is
presented in Figure 10.10.
Gandino et al. [35] have suggested reorganization allowing pre-computation of
certain constants of the algorithms due to Bajard et al. [31] and Kawamura
et al. [34]. In these algorithms, several multiplications of partial results with
pre-computed values exist. By exploiting the commutative property, a sequence
of multiplications of a partial result by a pre-computed value is replaced by a single
multiplication. In addition, the authors use the commutative, distributive and
associative properties to rearrange the operations in a more effective way.
r
I/O
r
RAM ROM RAM ROM RAM ROM
e
r r
q trune
q
Mul & Acc Mul & Acc Mul & Acc
Adder
mod an/bn mod an-1/bn-1 mod a1/b1
<Cox Unit> 1 ki
<Rower Unit n> <Rower Unit (n-1)> <Rower Unit 1>
Figure 10.10 Cox-Rower architecture for Montgomery algorithm in RNS (adapted from [34]
©Eurocrypt2000)
The original RNS MM algorithm and reorganized RNS MM algorithm are

presented in Figure 10.11a, b for illustrating the optimizations that can be carried
out. First, the inputs x and y are multiplied by Aj1 and are denoted as ^x and ^y. Note
that the two consecutive steps for computing q from u by multiplying sequentially
with N1 and Bi1 in the original algorithm is replaced with single multiplication
of u with (N1Bi1). Steps 3, 5, 6, and 7 in the original algorithm are moved into
the first base extension step together with the multiplication by Aj that is required to
correct the input. There is thus no need to sequentially compute u^, u, t, v, w but
instead ŵ can be computed. Similar modifications have been suggested in the
re-organized first and second base extension algorithms also for both Bajard’s
technique as well as Kawamura’s technique. Further, they have shown that the
exponentiation algorithms also can be reorganized. The reader is urged to refer to
their work for more information.
Schinianakis and Stouraitis [36] have suggested the use of MRC for both the
base extension operations in place of the approximate CRT-based techniques (refer
Chapter 6 Jullien’s technique of Figure 6.1a). They have also considered the binary-
to-RNS conversion and RNS-to-binary conversion needed in the front and back
ends. They have used radix 2r representation of given integer x and all the binary to
reverse converters for various moduli use L steps to compute
L1
X
Xmi ¼ xj 2rj m for all i: ð10:19Þ
i mi
j¼0
Figure 10.11 (a, b) The

original and reorganized
RNS MM algorithms
(adapted from [35]
©IEEE2012)

The constants 2rj m are pre-computed and stored. Thus, L parallel units can
i
convert in L steps the given binary word X into L residues. The MRC technique is
used for RNS-to-binary conversion where each add/multiply unit weighs the Mixed
radix digit appropriately and computes the result. The general architecture for all
these functions for RNS Montgomery multiplication (RMM) is presented in
Figure 10.12. They denote the two RNS bases as K and Q and compute (ABQ1)
mod N. They choose the scaling factor as the product of moduli in the first moduli
2
set. The authors observe that the BE algorithm using MRC needs only L þL 2 2
operations, whereas the techniques due to Bajard et al. [31] and Kawamura
et al. [34] techniques need 2L2 + 3L and 2L2 + 4L operations, respectively. Conse-
quently, a 1024-bit exponentiation using 33 number of 32-bit moduli and a clock
frequency of 100 MHz could work with a throughput of 3 Mb/s using 0.35–0.18μ
CMOS technology as against earlier methods [34] whose throughput is 890 kb/s.
Jie et al. [37] have suggested reformulation of Bajard et al. technique [32] which
requires pre-computations to reduce the number of steps from 2n2 + 8n to 2n2 + 5n,
whereas the technique of Kawamura et al. [34] needs 2n2 + 9n steps. In this
approach, a modulus 2n is used for easy base extension.
Schinianakis and Stouraitis [38] have suggested using MRC following Yassine
and Moore technique [39] discussed in Chapter 5. The moduli in RNS need to be
selected in a proper manner in this method so that the computation is simpler. In
Figure 10.12 A RNS Montgomery multiplication architecture due to Schinianakis (adapted from
[36] ©IEEE2011)
Yassine and Moore’s technique, (L 2) multiplications are needed for one base
extension as compared L(L 1)/2 needed for other techniques. The authors also
have unified the hardware to cater for both conventional RNS and Polynomial RNS.
The authors show that in the RNS case, the number of multiplications can be
reduced compared to use of conventional MRC as in [36]. The authors have
designed the hardware for dual field addition/subtraction, multiplication, modular
reduction and MAC operation to cater for both the fields GF( p) and GF (2m). They
have considered the various options viz., number of moduli, the use of several
MACs (β in number) in parallel and selectable radix 2r.
Ciet et al. [40] have suggested FPGA implementation of 1024-bit RSA with
RNS following similar approach as Bajard et al. [31] and Kawamura
et al. [34]. They have suggested that the nine moduli needed for each of the bases
(RNS moduli sets) can be selected from a pool of generalized Mersenne primes of
the form 2k1 2k2 1. Thus (C63 )(C54 ) possible combinations exist for
9 9
58 k1 64, 0 k2 k12þ1. Ciet et al. [40] have also suggested solutions for
signing using private key d using RSA algorithm. In these CRT is considered to
be used [41].
Denoting the hash of the message to be signed using private key d as μ, compute
μp ¼ μ mod p and μq ¼ μ mod q. Choosing two random bases as mentioned above,
μp and μq can be represented in two RNS bases. In order to avoid Differential power
analysis (DPA) attacks, the authors suggest adding randomization to both exponent
and message. Next, μpD mod N and μqD mod N can be computed using
RNS followed by CRT to perform reverse conversion to obtain μD mod N. Note
that μ D mod N can be computed as μ Dp modN where D ¼ D mod ( p 1) and
p p p
similarly we have μqD mod N ¼ μq Dq modN where Dq ¼ D mod (q 1).

10.4 Montgomery Inverse 295
Szerwinski and Guneysu [42] have studied on the application of Graphics

Processing units (GPU) for RNS due to their inherent parallel architecture. They
can effectively run a group of threads called warp in a SIMD (Single Instruction
Multiple data) fashion. They suggest that CIOS (coarse integrated operand scan-
ning) [11] is suitable for implementation in RNS of Montgomery modular multi-
plication using GPUs. They have reviewed various techniques of base extension
considered before due to Szabo and Tanaka using MRC and CRT-based techniques
of Shenoy and Kumaresan [32], Bajard et al. [31] and Kawamura et al. [34] for first
and second base extensions. They conclude that the throughput is maximum while
using Bajard’s method [31] for first base extension and Shenoy and Kumaresan
technique [32] for second base extension. For exponentiation as well, they have
shown that RNS yields lower throughput (number of encryptions per second) and
has lower latency than CIOS technique. Considering 1024-bit modular exponenti-
ation, they observe that 439.8 operations/s and 813 operations/s and latencies of
144 ms as against 6930 ms is required for RNS and CIOS-based techniques
respectively.
10.4 Montgomery Inverse
Montgomery representation of the modular inverse is b12n mod a [43]. The first
phase in the evaluation computes b12k mod a where k is the number of iterations.
The output of the first phase is denoted as Almost Montgomery Inverse (AMI). The
first phase computes gcd (a, b) where gcd stands for greatest common divisor. The
second phase halves this value (k n) times modulo a and negates the result to yield
b12n mod a.
The pseudocode is presented in Figure 10.13a. Note that u, v, r and s are such
that us + vr ¼ a, s 1, u 1, 0 v b. Note that r, s, u, and v are between 0 and
k1
2a 1. The number of iterations k are such that aþb 2 2 ab. The following
k k
invariants can be verified: br ¼ u2 moda and bs ¼ v2 moda. An example for
a ¼ 17, b ¼ 10 and n ¼ 5 is presented in Figure 10.13b.
The Montgomery inverse can be effectively used in reducing the number of steps
in exponentiation mod a as needed in RSA and other algorithms. For example, if the

exponent is 119 ¼ (1110111)2, it can be recoded as 10001001 2 where 1 has weight
of 1. Thus only three multiplications need to be done instead of 5, whereas the
number of squarings are same in both the cases.
Savas and Koc [44] have suggested defining New Montgomery inverse as x ¼ b1
2n
2 mod a so that the Montgomery inverse of a number already in Montgomery
domain is computed:
x ¼ NewMonInvðb2m Þ ¼ ðb2m Þ1 :22m mod a ¼ b1 2m mod a ð10:20Þ
Note that MonInv algorithm in Figure 10.13a cannot compute new Montgomery
inverse. This needs two steps:
Figure 10.13 (a) Algorithm for computing the Montgomery inverse b12mmoda and (b) example
a ¼ 17, b ¼ 10, n ¼ 5 (adapted from [43] ©IEEE1995)
m 1 m
c ¼ MonInvðb2m Þ ¼ Þ :2 moda
ðb2 1
1
¼ b mod
a
2m m ð10:21aÞ
x ¼ MonPro c; 2 2m
¼ b 2 2 moda ¼ b1 2m mod a
where Montgomery product (MonPro) is defined as MonPro (a, b) ¼ (ab2n) mod

p. Alternatively, we can use
10.4 Montgomery Inverse 297
m m
v ¼ MonProðb2m , 1Þ ¼ ðb2 2 Þmod a ¼ bmod a
1 m ð10:21bÞ
x ¼ MonInvðbÞ ¼ b 2 mod a
This will be useful in ECC computation if the intermediate results are already in
Montgomery domain and when division is needed e.g. in computation of point
addition or doubling.
Gutub et al. [45] have described a VLSI architecture for GF( p) Montgomery
modular inverse computation. They observe that two parallel subtractors for finding
(u v), (v u) and (r a) are required so as to speed up the computation (see
Figure 10.13a). They have suggested a scalable
architecture which takes w bits at a
time and performs computation in wn cycles for scalable operations such as
addition/subtraction. The area of the scalable design has been found to be on
average 60 % smaller than the fixed one.
Savas [46] has used redundant signed digit (RSD) representation so that the carry
propagation in addition and subtraction is avoided. This enables fast computation of
multiplicative inverse in GF( p).
Bucek and Lorencz [47] have suggested a subtraction-free AMI technique. It
computes (u + v) instead of (u v) where one of the operands must always be
negative. By keeping u always negative and v positive, we can compute an
equivalent of the differences in the original algorithm without subtraction. Note
that the values of v, r, and s are same as in the original algorithm but opposite values
appear for u. The authors have considered original AMI design with two subtractors
as well as one subtractor and show that AT (Area Time product) is lower for
subtraction-free AMI, whereas AMI with one subtractor is slower than AMI using
two subtractors. Note that the initial value of u shall be p in stead of p in this
approach. The algorithm is presented in Figure 10.14.
Figure 10.14 Bucek and

Lorenz subtraction free
AMI algorithm pseudocode
(adapted from [47]
©IEEE2006)
10.5 Elliptic Curve Cryptography Using RNS
Schiniankis et al. [48] have realized ECC using RNS. In order to reduce the division
operations in affine representation, they have used Jacobian coordinates. Consider
the elliptic curve
y2 ¼ x3 þ ax þ b over Fp ð10:22aÞ
where a, b 2 Fp and 4a3 + 27b2 6¼ 0 mod p together with a special point O called the
point at infinity. Substituting x ¼ X2 , y ¼ Y3 ; using the Jacobian coordinate repre-
Z Z
sentation, (10.22a) changes as

E Fp : Y 2 ¼ X3 þ aXZ4 þ bZ6 ð10:22bÞ
The point at infinity is given by {0, 0, 0}. The addition of two points Po ¼ (Xo, Yo,
Zo) and P1 ¼ ðX1 ; Y 1 ; Z 1 Þ 2 Fp thus will yield the sum P2 ¼ ðX2 ; Y 2 ; Z2 Þ ¼ P0 þ P1

2 E Fp given by
8
> 2
< X2 ¼ R TW
2
P2 ¼ P0 þ P1 ¼ 2Y 2 ¼ VR MW 3 ð10:22cÞ
>
:
Z2 ¼ Z0 Z1 W
where
W ¼ X0 Z21 X1 Z20 , R ¼ Y 0 Z 31 Y 1 Z 30 , T ¼ X0 Z21 þ X1 Z20 ,

M ¼ Y 0 Z 31 þ Y 1 Z 30 , V ¼ TW 2 2X2 :
The doubling of point P1 is given as

8
> 2
< X2 ¼ M 2S
P2 ¼ 2P1 ¼ Y 2 ¼ MðS X2 Þ T ð10:23Þ
>
:
Z 2 ¼ 2Z 1 Y 1
where M ¼ 3X21 þ aZ41 , S ¼ 4X1 Y 21 , T ¼ 8Y 41 .

Note that the computation is intensive in multiplications and additions while
division is avoided. The exponentiation i.e. operation kP follows the binary algo-
rithm where successive squarings and multiplications depending on the value of
bits (1 or 0) in the exponent k are required. All the operations are mod p thus
necessitating modulo adders and multipliers. If the field characteristic is 160 bits
long, the equivalent RNS range to compute (10.22c) and (10.23) is about 660 bits.
Hence, the authors use 20 moduli each of about 33-bit length. In the case of p being
192 bits, then the moduli size of the 20 moduli each will be only 42 bits. The authors
10.5 Elliptic Curve Cryptography Using RNS 299
used extended RNS using one redundant modulus to perform residue to binary
conversion using CRT. The conversion from projective coordinates to affine coor-
dinates is done using x ¼ X2 , y ¼ Y3 .
Z Z
The RNS adder, subtractor and multiplier architectures are as shown in
Figure 10.15a, b for point adder (ECPA) and point doubler (ECPD). The point
multiplier (ECPM) is shown in Figure 10.12c. Note that the RNS adder, multiplier
and subtractor are shared for all the computations in (10.22c) and (10.23). Note also
that the modulo p reduction is performed after RNS to binary conversion using CRT.
The authors have shown that all the operations are significantly faster than those
using conventional hardware. A 160-bit point multiplication takes approximately
2.416 ms on Xilinx Virtex-E (V1000E-BG-560-8). The authors also observe that the
cost of conversion from residue to binary is negligible.
Schiniakis et al. [49] have further extended their work on ECC using RNS. They
observe that for p of 192-bit length, the equivalent RNS dynamic range is 840 bits.
As such, 20 moduli of 42 bits each have been suggested. The implementation of
(10.22c) and (10.23) can take advantage of parallelism, between multiplication,
addition or subtraction operations for both point addition as well as point doubling.
They observe that 13 steps will be required for each (see Figure 10.16). Note that
for ECPA, 17 multiplications, 5 subtractions and 2 additions are required, whereas
for ECPD, 15 multiplications, one addition and 3 subtractions are required sharing
thus a multiplier/adder/subtractor. They, however, do not use a separate squaring
circuit. The RNS uses one extra redundant modulus and employs extended RNS for
RNS to binary conversion based on CRT. A special serial implementation was used
for multiplication of nf-bit word by f-bit word needed in CRT computation consid-
ering f bits at a time where n is the number of moduli and f is the word length of each
modulus, considering a n moduli RNS. The modulo reduction after CRT is carried
out using a serial modulo multiplier with 1 as one of the operands. The projective to
affine coordinate conversion needs division by Z2 and Z3 which needs finding the
multiplicative inverse. It needs one modular inversion and four modulo
multiplications:
1
T1 ¼ , T 2 ¼ T 21 , x ¼ XT 2 , T 3 ¼ T 1 T 2 , y ¼ YT 3 ð10:24Þ
Z
The authors use the technique of [47] for this purpose. The authors also consider
the effect of the choice of number of moduli and word length of the moduli on the
performance. They observe that the area decreases with increase in number of
moduli and leads to decrease in the bit length of the moduli. The moduli used are
presented in Table 10.1 for a 192-bit implementation for illustration. The authors
have described FPGA implementation using Xilinx Virtex-E XCV 1000E, FG
680 FPGA device. Typically the time needed for point multiplication is 4.84,
4.08, 3.54 and 2.35 ms for 256, 224,192 and 160-bit implementations.
Esmaeildoust et al. [50] have described Elliptic curve point multiplication based
on Montgomery technique using two RNS bases. The authors use moduli of the
a Z1
to_ Z1_2
to _U0
Z1
to _Z1_3
Multiplexer
X0
Decoder
1 to 17
34 to 2
RNS
Multiplier
to_ VR
From _W2
to_ W3
From _M
to_ MW3
From _W3
From _U0
From _U1
to_ W
Multiplexer
From _S0
to_R
Decoder
10 to 2
RNS
1 to 5
Subtractor to _X2
to _V
From _2X2
to_Y2
From _VR
From MW3
From _U0
Multiplexer
From _U1 To_ Y
Decoder
RNS
1 to 2
4to 2
Adder To_ M
From _S0
From _S1
b Y1
to _Y1Z1
to_Z2
Y1
to_X1 2
Multiplexer
From _Y1Z1
Decoder
1 to 15
30 to 2
RNS
Multiplier
to _Y1 4
From_ Y1 Z4
to_ T
From_ M
to _MS X2
From _S X2
Multiplexer
10 to 2
From U0 RNS To M
Adder
From U1
From _M2
From _2S To _X2

Multiplexer
From _S
Decoder
RNS
1 to 3
6 to 2
To _SX2
Subtractor
From _X2 To _Y2
From MS X2
From _T
Figure 10.15 Architectures of ECPA (a), ECPD (b) and ECPM (c) (adapted from [48]
©IEE2006)
c k(l-bits)
Shift
MSB LSB
Counter register
O
[k]P
MUX
MUX
ECPA ECPD
P
form 2k, 2k 1 and 2k 2ti 1 in order to have efficient arithmetic operations,

binary-to-RNS and RNS-to-binary conversions. The first base uses three or four
moduli of the type 2k 2ti 1 where ti < k/2 depending on the field length
160 (three moduli), 192 (three or four moduli), 224 and 256 bits (both four moduli).
The second base uses either the three moduli set {2k, 2k 1, 2k+1 1} [51] for
160 and 192 bits field length or the four moduli set {2k, 2k 1, 2k+1 1, 2k1 1}
[52] for field lengths 192, 224, and 256 bits. The various arithmetic operations like
modulo addition, modulo subtraction and modulo multiplication for these are
simpler. As an illustration, considering the modulus of the form ( 2k 2ti 1Þ,
the reduction of a 2k-bit number which is a product of two k-bit numbers
can be realized using mod ð2k 2ti 1Þ addition of four operands as
0 k
whh 2 þ w0hl þ wh þ wl k ti where w is the 2k-bit product written as
k
0 k 2 20 1
wl ¼ wh2 + wl ¼ whh 2 þ whl þ wh þ wl k ti . After the MRC digits are
2 2 1
found for base 1, conversion to base 2 needs computation of
xj ¼ ðv1 þ m1 ðv2 þ m2 ðv3 þ m3 v4 ÞÞÞmj where vi are the Mixed radix digits [50].
The MRC digit computation needs modulo subtraction and modulo multi-
plication with multiplicative inverse. Due to the particular form of moduli,
these operations are simple. We will consider this in more detail later. The
advantage of the choice of moduli set {2n, 2n 1, 2n+1 1} is that the MRC
digits can be easily found (see Chapter 5). The conversion from second base to
first base also is performed in a similar way. Thus using shifters and adders, the
various modulo operations can be performed.
The authors have employed a four-stage pipeline comprising of one mod
2k 2ti 1 multiplier, one reconfigurable modular (RM) multiplier, Recon-
figurabe modulo (RM) adder and two base extension units with adder-based
structures. The RM structures cater for operations for four types of moduli needed
2k, 2k 1, 2k+1 1, 2k1 1. A six-stage pipeline also has been suggested which
can achieve higher speed. In this, the conversion from RNS in one base to another
base is performed in two-stage RNS to MRS and MRS to RNS. The designs were
implemented on Xilinx Virtex-E, Xlinx Virtex-2 Pro and Altera Stratix II.
a RESERVED
REGISTERS A = X0, B = Y0, C = Z0, D = Z02, G = X1, H = Y1, I = Z1
A,B,C,D X1 Z02
*
U1
t1
E Z02 Z0
*
Z03
t2
D X0=U0 Y1
* *
W S1
t3
D,H Y0=S0 S1
* D*
R t4
F R
D+
*
R2 t5
G U0 U1 W
+
*
T W2 t6
A,I S0
+
*
TW2 M t7
B,D W
-
*
X2 W3 t8
A,I
+
*
2X2 MW3
t9
D,G 2 Z0
TW W
- *
V Z2
t10
B,C
* D-
VR t11
F X2 X2
* -
2Y2
t12
D,F X22 1/2(modp)
* D-
Y2
t13
B
Figure 10.16 (a, b) Data flow graph (DFG)s for point addition and point doubling algorithms
(adapted from [49] ©IEEE2011)
b RESERVED A = X1, B = Y1, C = Z1, D = X12

REGISTERS
A,B,C,D 3 X12
*
3X12
t1
A Z1 Z1
*
Z12
t2
D Z12
* D-
Z14 t3
D a
* D-
aZ14
Y1 t4
D Z1
+ *
M Y1Z1
t5
E,D Y1 Y1 Y1Z1
Y12 * +
Z2
t6
B,C X1 Y12
* D+
X1Y12
t7
D 4
* D-
S
t8
D M
+ *
2S M2
t9
A,F Y12
* +
Y14 X2
t10
B,A 8
* -
T S-X2
t11
D
* D-
M(S-X2)
t12
A Z2 Z2
* *
Y2 Z22
t13
B,D
Typically on Xilinx Virtex-E, a 192-bit ECPM takes 2.56 ms while needing 20,014
LUTs and for 160-bit field length, ECPM needs 1.83 ms while needing 15,448
LUTs. The reader is urged to refer to [50] for more details.
Difference in complexity between addition and doubling leads to simple power
analysis (SPA) attacks [73]. Hence, unified addition formulae need to be used.
Table 10.1 RNS base 2446268224217 2446268224261 2446268224273

modulus set for the 192-b
2446268224289 2446268224321 2446268224381
implementation (adapted
from [49] ©IEEE2006) 2446268224409 2446268224427 2446268224441
2446268224447 2446268224451 2446268224453
2446268224457 2446268224481 2446268224493
2446268224513 2446268224579 2446268224601
2446268224639 2446268224657
Montgomery ladder can be used for scalar multiplication algorithm (addition and
doubling performed in each step). Several solutions have been suggested to have
leak resistance for different types of elliptic curves. As an illustration, for Hessian
form [53], the curve equation is given by
x3 þ y3 þ z3 ¼ 3dxyz ð10:25aÞ
where d 2 Fp and is not a third root of unity. For Jacobi model [54], we have the
curve equation as
y2 ¼ εx4 2δx2 z2 þ z4 ð10:25bÞ
where ε and δ are constants in Fp and for short Wierstrass form [55], the curve
equation is given by
y2 z ¼ x3 þ axz2 þ bz3 ð10:25cÞ
These require 12, 12 and 18 field multiplications for addition/doubling. Note that
Montgomery’s technique [56] proposes to work only on x coordinates. The curve
equation is given by
By2 ¼ x3 þ Ax2 þ x ð10:25dÞ
Both addition and doubling take time of only three multiplications and two
squarings. Both these are performed for each bit of the exponent. Cost of this is 10
(k)2 multiplications for finding kG.
Bajard et al. [73] have also shown that the formula for point addition and
doubling can be rewritten to minimize the modular reductions needed. As an
illustration for the Hessian form elliptic curve, the original equations for addition
of two points (X1, Y1, Z1) and (X2, Y2, Z2) are
X3 ¼ Y 21 X2 Z2 Y 22 X1 Z 1
Y 3 ¼ X21 Y 2 Z 2 X22 Y 1 Z1 ð10:26aÞ
Z3 ¼ Z 21 X2 Y 2 Z 22 X1 Y 1
The cost of multiplication is negligible compared to the cost of reduction in

RNS. The authors consider RNS bases with moduli of the type mi ¼2k ci where ci
is small and sparse and ci 2k=2 . Several co-primes can be found e.g., for mi < 232,
ci ¼ 2ti 1 with ti ¼ 0,1, . . ., 16 for ci ¼ 2ti 1 and with ti ¼ 1, . . ., 15 for
ci ¼ 2ti þ 1. If more co-primes are needed, ci can be of the form ci ¼ 2ti 2si
1 can be used. The reduction mod mi in these cases needs few shift and add
operations. Thus the reduction part cost is 10 % of the cost of multiplication. Thus,
an RNS digit product is equivalent to a 1.1 word product (where word is k bits) and
RNS multiplication needs only 2n RNS digit products or 2.2n word products.
The authors have shown that in RNS, the modular reductions needed will be
reduced. The advantage of RNS will be apparent if we count multiplications and
modular reductions separately. Hence, (10.26a) can be rewritten as
A ¼ Y 1 X2 , B ¼ Y 1 Z 2 , C ¼ X1 Y 2 , D ¼ Y 2 Z1 , E ¼ X1 Z 2 , F ¼ X2 Z 1 ,
ð10:26bÞ
X3 ¼ AB CD, Y 3 ¼ EC FA, Z3 ¼ EB FD
Thus, only nine reductions and 12 multiplications are needed. Similar results can be
obtained for Wierstrass and Montgomery ladder cases.
RNS base extension needed in Montgomery reduction using MRC first followed
by Horner evaluation has been considered by Bajard et al. [73]. Expressing the
reconstruction from residues in base B using MRC as

A ¼ a1 þ m1 ða2 þ m2 ða3 þ Þ þ mn1 an Þ . . . ð10:27aÞ
we need to compute for base extension to base B0 with moduli mj for j ¼ n, n + 1, . . .,

2n

aj ¼ a1 þ m1 ða2 þ m2 ða3 þ Þ þ mn1 an Þ . . . m ð10:27bÞ
j

1
The number of multiplications by constants are (n2 n)/2 digit products.
mi mj
The conversion from MRS to RNS corresponds to few shifts and adds. Assuming
modulus of the form 2k 2ti 1, this needs computation of ða þ bmi Þmj ¼ a þ 2k
b 2ti b b which can be done in two additions (since a + 2kb is just concatenation
and reduction mod mj requires three additions). Thus, the evaluation of each aj in
base B0 needs 5n word additions. The MRS-to-RNS conversion needs (n2 n)/5
RNS digit products since the five-word additions are equivalent to 1/5 of a RNS
digit product. Hence for the two base extensions, we need

n2 n þ 25 n2 n þ 3n ¼ 75 n2 þ 85 n RNS digit products which is better than O(n2).
10.6 Pairing Processors Using RNS
Considerable attention has been paid in the design of special purpose processors
and algorithms in software for efficient implementation of pairing protocols. The
pairing computation can be broken down into multiplications and additions in the
underlying fields. Pairing has applications in three-way key exchange [57], identity-
based encryption [58] and identity-based signatures [59] and non-interactive zero
knowledge proofs [60].
The name bilinear pairing indicates that it takes a pair of vectors as input and
returns a number. It performs a linear transformation on each of its input variables.
These operations are dependent on elliptic or hyper-elliptic curves. The pairing is a
mapping e: G1 G2 ! G3 where G1 is a curve group defined over finite field Fq and
G2 is another curve group on the extension field F k and G3 is a sub-group of the
q
multiplicative group F k . If groups G1 and G2 are of the same group, then e is called
q
symmetric pairing. If G1 6¼ G2, then e is called asymmetric pairing. The map is
linear in each component and hence useful for constructing cryptographic pro-
tocols. Several pairing protocols exist: Weil Pairing [61], Tate pairing [62], ate
pairing [63], R-ate pairing [64] and optimal pairing [65].
Let Fp be the prime field with characteristic p and let E(Fp) be an elliptic curve
y ¼ x3 þ a4 x þ a6 and # E(F p) is the number of points on the elliptic curve. Let ‘
2
be a prime divisor of #E Fp ¼ p þ 1 t where t is the trace of Frobenius map on

the curve. The embedding degree k of E with respect to ‘ is the smallest integer such
that ‘ divides pk 1. This means that the full ‘-torsion is defined on the field F k .
p
For any integer m and ‘-torsion point P, f(m,P) is the function defined on the curve
whose divisor is

div f ðm;PÞ ¼ mðPÞ ½mP ðm 1ÞO ð10:28Þ
We define
E(k)[r]
k-rational r-torsion group of the curve. Let G1 ¼ E(Fp)[r],
the
G2 ¼ E F =rE F k and G3 ¼ μr
F* k (the rth roots of unity). Let P 2 G1 ,
pk p p
Q 2 G2 , then, the reduced Tate pairing is defined as
‘

eT :E Fp ½‘ E F k ! F* k = F* k ð10:29aÞ
p p p
k
p ‘1
eðP; QÞ ¼ f ðl;PÞ ðQÞ ð10:29bÞ
The first step is to evaluate the function f(‘,P)(Q) at Q using Miller loop [61]. A
pseudocode for Miller loop is presented in Figure 10.17. This uses the classical
square and multiply algorithm. The Miller loop is the core of all pairing protocols.
10.6 Pairing Processors Using RNS 307
Figure 10.17 Algorithm for Miller loop (adapted from [66] ©2011)
In this, g(A,B) is the equation of a line passing through the points A and B (or tangent
g A;BÞ
to A if A ¼ B) and νA is the equation of the vertical line passing by A so that νðAþB is
the function on E involved in the addition of A and B. The values of the line and
vertical functions g(A,B) and νA+B are the distances calculated between the fixed
point Q and the lines that arise when adding B to A on the elliptic curve in
the standard way. Considering the affine coordinate representation of A and A + B
as (xj, yj) and (xj+1, yj+1), and coordinates of Q as (xQ, yQ), then we have

lA, B ðQÞ ¼ yQ yj λj xQ xj

vAþB ðQÞ ¼ xQ xjþ1
Miller [61] proposed an algorithm that constructs f(‘,P)(Q) in stages by using

pk 1
double and add method. The second step is to raise f to the power l .
The length of the Miller loop can be reduced to half compared to that of Tate
pffiffi
pairing because t 1 ‘, by swapping
P and Q in Ate pairing [64]. Here, we

define G1 ¼ E(Fp)[r], G2 ¼ E F k ½r \ Ker π p ½p where π p is the pth power
p
Frobenius endomorphism, i.e. π p : E ! E : ðx; yÞ ° ðxp ; yp Þ. Let P 2 G1 , Q 2 G2 and
let t ¼ p + 1 #E(Fp) be the trace of Frobenius. Then, Ate pairing is defined as
‘

eA : E F k \ Ker ðπ pÞ E Fp ½‘ ! F* k = F* k ð10:30aÞ
p p p
pk 1
ðQ; PÞ ¼ ðf ‘1, Q ðPÞÞ ‘ ð10:30bÞ
Note that Q 2 Kerðπ pÞ, π ðQÞ ¼ ðt 1ÞQ.

In the case of R-Ate pairing [64], if l is the parameter used to construct the BN
curves [67], b ¼ 6l + 2, it is defined as
‘

eR : E F k \ Kerðπ pÞ E Fp ½‘ ! F* k F* k ð10:31aÞ
p p p
k
p p ‘1
Ra ðQ; PÞ ¼ f ðb;QÞ ðPÞ: f ðb;QÞ ðPÞ:gðbQ;QÞ ðPÞ gðπðbþ1ÞQ, bQÞ ðPÞ ð10:31bÞ
pffiffi
The length of the Miller loop is 4 ‘ and hence is reduced by 4 compared to Tate
pairing.
The MNT curves [68] have an embedding degree k of 6. These are ordinary
elliptic curves over Fp such that p ¼ 4l2 + 1 and t ¼ 1 2l where p is a large prime
such that #E(Fp) ¼ p + 1 t is a prime [69].
Parameterized elliptic curves due to Barreto and Naehrig [67] are well suited for
asymmetric pairings. These are defined with E: E : y2 ¼ x3 þ a6 , a6 6¼ 0 over Fp
where p ¼ 36u4 36u3 þ 24u2 6u þ 1 and n the order of E is n ¼ 36u4 36u3
þ18u2 6u þ 1 for some u such that p and n are primes. Note that only u that
generates primes p and n will suffice. BN curves have an embedding degree k ¼ 12
which means that n divides p12 1 but not pk 1 for 0 k 12. Note that t ¼ 6u2
+ 1 is the trace of Frobenius. The value of t is also parameterized and must be
chosen large to meet certain security level. For efficiency of computation,
u and t must be having small Hamming weight. As an example, for a6 ¼ 3,
u ¼ 0x6000 0000 0000 1F2D (hex) gives 128-bit security. Since t, n and p are
parameterized, the parameter u alone suffices to be stored or transmitted. This
yields two primes n and p of 256 bits with Hamming weights 91 and 87, respec-
tively. The field size is F k is 256 k ¼ 3072 bits. This allows a faster exponen-
p
tiation method.
An advantage of BN curves is their degree 6 twist. Considering that E and E e are
two elliptic curves defined over Fq, the degree of the twist is the degree of the
smallest extension on which the isomorphism ψ d between E and E˜ is defined over an
e defined by
extension Fdq of Fq. This means that E is isomorphic over F 12 to a curve E
p
y2 ¼ x3 þ aν6 where ν is an element in F 2 which is not a cube or a square. Thus, we
p

can define twisted versions of pairings on E e F 2 E Fp ½‘. This means that the
p
coordinates of Q can be written as (xQv1/3, yQv1/2) where xQ, yQ are in F 2 :
p
e!E
ψ6 : E

ðx; yÞ ° xv1=3 , yv1=2 ð10:32Þ
Note that computing g, v, 2T, T + Q (needed in Algorithm 1 see Figure 10.17)

requires only F 2 arithmetic but the result remains in F 12 . The denominators v2T,
p p
vT+Q will get wiped out by the final exponentiation.
For implementation of Pairing protocols, special hardware will be required such

as large operand multipliers based on variety of techniques such as Karatsuba,
ToomCook, Arithmetic in extension fields, etc. It will be helpful to consider these
first before discussing pairing processor implementation using RNS. The reader is
urged to consult [70, 71] for tutorial introduction to pairing.
Large Operand Multipliers
Bajard et al. [72] have considered choice of 64-bit moduli with low Hamming
weight in moduli sets. The advantage is that the multiplications with multiplicative
inverses in MRC will be having low Hamming weight thus simplifying the multi-
plication as few additions. For base extension to another RNS as well, as explained
before, such moduli will be useful. These Moduli are of the type 2k 2ti 1 where
t < k/2. As an illustration, two six moduli sets are 264-210-1, 264-216-1, 264-219-1, 264
-228-1, 264-220-1, and 264-231-1 whose Hamming weights are 3 and 264-222-1, 264-2
13
-1, 264-229-1, 264-230-1, 264-1, and 264 with Hamming weight being 3,3,3,3,2,1.
The inverses in this case are having Hamming weight ranging between 2 and 20.
Multiplication mod (2224-296 + 1) which is an NIST prime P-224 can be easily
carried out [42]. The product is having 14 number of 32-bit words. Denoting these
as r13, r12, r11, . . ., r2, r1, r0, the reduction can be carried out by computing
(t1 + t2 + t3 t4 t5) mod P-224 where
t1 ¼ r6r5r4r3r2r1r0
t2 ¼ r10r9r8r7000
t3 ¼ 0r13r12r11000
t4 ¼ 0000r13r12r11
t5 ¼ r13r12r11r10r9r8r7
Multiplication of large numbers can be carried out using Karatsuba formula [74]
using fewer multiplications of smaller numbers and with more additions. This can
be viewed as multiplication of linear polynomials. Two linear polynomials of two
terms can be multiplied as follows using only three multiplications:
ða0 þ a1 xÞðb0 þ b1 xÞ ¼ a0 b0 þ ða0 b1 þ a1 b0 Þx þ a1 b1 x2

ð10:33aÞ
¼ a0 b0 þ ðða0 þ a1 Þðb0 þ b1 Þ a0 b0 a1 b1 Þx þ a1 b1 x2
Thus only a0 b0 , a1 b1 , ða0 þ a1 Þðb0 þ b1 Þ are the three needed multiplications.

Extension to three terms [75] is as follows:

a0 þ a1 x þ a2 x2 b0 þ b1 x þ b2 x2 ¼ a0 b0 C þ 1 x x2

þ a1 b1 C x þ x 2 x 3

þ a2 b2 C x 2 x 3 þ x 4
þ ða0 þ a1 Þðb0 þ b1 ÞðC þ xÞ

þ ða0 þ a2 Þðb0 þ b2 Þ C þ x2

þ ða1 þ a2 Þðb1 þ b2 Þ C þ x3
þ ða0 þ a1 þ a2 Þðb0 þ b1 þ b2 ÞC
ð10:33bÞ
for an arbitrary polynomial C with integer coefficients. Proper choice of C can

reduce the number of multiplications. As an example C ¼ x2 avoids the need to
compute ða0 þ a2 Þðb0 þ b2 Þ. Thus, only six multiplications will be needed instead
of nine multiplications needed in school book algorithm.
Montgomery [75] has extended this technique to products of quartic, quintic and
sextic polynomials which are presented below for completeness. The quartic case
(see (10.34)) needs 13 multiplications and 22 additions/subtractions by taking
advantage of the common sub-expressions
a0 þ a1 , a0 a4 , a3 þ a4 , ða0 þ a1 Þ ða3 þ a4 Þ, ða0 þ a1 Þ

þa2 , a2 þ ða3 þ a4 Þ, ða0 þ a1 Þ þ ða2 þ a3 þ a4 Þ, a0
ða2 þ a3 þ a4 Þ, ða0 þ a1 þ a2 Þ a4 , ða0 a2 a3 a4 Þ
þa4 , ða0 þ a1 þ a2 a4 Þ a0
and similarly with “b”s. Other optimizations are also possible by considering
repeated sub-expressions.

a0 þ a1 x þ a2 x2 þ a3 x3 þ a4 x4 b0 þ b1 x þ b2 x2 þ b3 x3 þ b4 x4

¼ ða0 þ a1 þ a2 þ a3 þ a4 Þðb0 þ b1 þ b2 þ b3 þ b4 Þ x5 x4 þ x3

þða0 a2 a3 a4 Þðb0 b2 b3 b4 Þ x6 2x5 þ 2x4 x3

þða0 þ a1 þ a2 a4 Þðb0 þ b1 þ b2 b4 Þ x5 þ 2x4 2x3 þ x2
5
þða0 þ a1 a3 a4 Þðb0 þ b1 b3 b4 Þ x 2x4 þ x3

þða0 a2 a3 Þðb0 b2 b3 Þ x6 þ 2x5 x4

þða1 þ a2 a4 Þðb1 þ b2 b4 Þ x4 þ 2x3 x2

þða3 þ a4 Þðb3 þ b4 Þ x7 x6 þ x4 x3 þ ða0 þ a1 Þðb0 þ b1 Þ x5 þ x4 x2 þ x

þða0 a4 Þðb0 b4 Þ x6 þ 3x5 4x4 þ 3x3 x2

þa4 b4 x8 x7 þ x6 2x5 þ 3x4 3x3 þ x2

þa3 b3 x7 þ 2x6 2x5 þ x4 þ a1 b1 x4 2x3 þ 2x2 x

þa0 b0 x6 3x5 þ 3x4 2x3 þ x2 x þ 1
ð10:34Þ
Similarly, for the quintic polynomial multiplication, number of multiplications

can be 17 and for sextic case we need 22 base ring multiplications. Bounds on the
needed number of multiplications on the number of products for terms up to 18 have
been given in [75].
The hardware implementation of Fp arithmetic for pairing friendly curves
e.g. BarretoNaehrig (BN) curves can be intensive in modular multiplications.
These can be realized using polynomial Montgomery reduction technique
[76, 77]. Either parallel or digit serial implementation can be adopted. In this hybrid
modular multiplication (HMM) technique, polynomial reduction is carried out
using Montgomery technique while coefficient reduction uses division. In the
parallel version [76], four steps are involved (see flow chart in Figure 10.18a)
(a) polynomial multiplication, (b) coefficient reduction mod z, (c) polynomial
reduction and (d) coefficient reduction.
X
n1
We wish to compute r(z) ¼ a(z)b(z)zn mod p(z) where aðzÞ ¼ ai z i , bð z Þ ¼
i¼0
X
n1 X
n1
bi zi and pðzÞ ¼ pi zi þ 1 where p ¼ f(z) is the modulus. The polynomial
i¼0 i¼1
multiplication in the first step leads to
X2n2
cðzÞ ¼ aðzÞbðzÞ ¼ i¼0
ci zi ð10:35aÞ
In this step, the coefficient reduction is carried out by finding ci mod z and ci div z.
The ci div z is added to ci+1. In polynomial reduction based on Montgomery
technique, first q(z) is found as
qðzÞ ¼ ðcðzÞmod zn ÞgðzÞmod zn ð10:35bÞ
where gðzÞ ¼ ðf ðzÞÞ1 modzn . Next, we compute cðzÞqnðzÞf ðzÞ. A last step is coefficient
z
reduction. The computation yields a(z)b(z)z5 mod p in case of BN curves. The
expressions for q(z), h(z) and v(z) in the case of BN curves are as follows:
X4
qð z Þ ¼ i¼1
qi zi ¼ ðc4 þ 6ðc3 2c2 6ðc1 9c0 ÞÞÞz4
þ ðc3 þ 6ðc2 2c1 6c0 ÞÞz3 þ ðc2 þ 6ðc1 2c0 ÞÞz2 ð10:35cÞ
þ ðc1 þ 6c0 Þz c0
and
X3
hð z Þ ¼ gi zi ¼ 36q4 z3 þ 36ðq4 þ q3 Þz2
i¼0 ð10:35dÞ
þ 12ð2q4 þ 3ðq3 þ q2 ÞÞz þ 6ðq4 þ 4q3 þ 6ðq2 þ q1 ÞÞ
a4 a3 a2 a1 a0 bi 65 65
x
c
Mul Mul Mul Mul Mul
(c) Register
(65x32) (65x65) (65x65) (65x65) (65x65) 67 63
(b) Mul
Mod-1 Mod-1 Mod-1 Mod-1 Mod-1 67 63 67 93

s s
Accumulator X + 63
X
C8 C7 C6 C5 C4 C3 C2 C1 C0 - -
78 30 64
Polynomial Reduction
(d) Mod-1 (e) Mod-2
v3 v2 v1 v0
Mod-2 Mod-2 Mod-2 Mod-2
+ + + + +
+ + +
+
r4 r3 r2 r1 r0
Figure 10.18 (a) Parallel hybrid modular multiplication algorithm for BN curves (b) Fp multi-
plier using HMMB (adapted from [76] ©IEEE2012)
c ðzÞ
v ðzÞ ¼ þ hð z Þ ð10:35eÞ
z5
Next, coefficient reduction is done on v(z) to obtain r(z).

An example [78] will be illustrative. Consider a ¼ 35z4 + 36z3 + 7z2 + 6z + 103,
b ¼ 5z4 + 136z3 + 34z2 + 9z + 5 with z ¼ 137, f(z) ¼ 36z4 + 36z3 + 24z2 + 6z + 1.
Note that g(z) ¼ f(z)1 ¼ 324z4 36z3 12z2 + 6z + 1. We need to find r(z) ¼
a(z)b(z)/z5 mod f(z).
First, we consider the non-polynomial form of integers. We have A ¼ 12,
422, 338, 651 and B ¼ 2, 111, 720, 197 and p ¼ 12, 774, 932, 983. We can find
A B ¼ 26, 232, 503, 423, 290, 434, 247 and 1375 ¼ 48, 261, 724, 457. We also
compute α ¼ (A B) mod 1375 ¼ 41, 816, 018, 411. We next have β ¼ f(z)1 ¼ 114,
044, 423, 849. Hence, γ ¼ (αβ) mod 1375 ¼ 33,251, 977, 638. Finally we compute
ABþγp
¼ 451, 024, 289, 300,5955, 068, 401 ¼ 9, 345, 382, 793.
1375 137
The following steps will obtain the result when we compute in polynomial form:
cðzÞ ¼ aðzÞbðzÞ ¼ z9 þ 74z8 þ 52z7 þ 111z6 þ 70z5 þ 118z4 þ 96z3 þ 36z2 þ z þ 104
after reducing the coefficients mod 137. Thus

cðzÞmodz5 f 1 ðzÞ modz5 ¼ 33686z4 3636z3 1278z2 þ 623z 104
Next multiplying this with f(z) yields z9 + 1,212,696z8 + 1,081,800z7 + 631,560z6

+ 91,272z5 + 0z4 + 0z3 + 0z2 + 0z + 0z0. Note that the least significant five terms are
zero. Next adding the most significant terms of c(z) divided by z5 viz., (z9 + 74z8
+ 52z7 + 111z6 + 70z5)/z5 to the most significant terms beyond z4, yields
z5 þ 1212770z4 þ 1081852z3 þ 631671z2 þ 91342z1 þ 100z0
which after simplification gives 65z5 + 6z4 + 30z3 + 57z2 + 82z1 + 100z0. Note that
this needs to be reduced mod p to obtain the actual result
26z4 þ 72z3 þ 57z2 þ 117z þ 129 ¼ 9, 345, 382, 793:
The same example can be worked out using serial multiplication due to Fan
et al. [77] which leads to smaller intermediate coefficient values. Note that instead
of computing a(z)b(z) fully, we take terms of b one term at a time and reduce the
product mod p. The results after each step of partial product addition, coefficient
reduction and scaling by z are as follows:
10z4 + 3z3 + 4z2 + 6z + 95 after adding 5A and 33p
The coefficients in this case can be seen to be smaller than in the previous case.
We will illustrate the first step as follows: After multiplication of A with 5 we obtain
175z4 + 180z3 + 35z2 + 30z + 515. Reducing the terms mod 137 and adding the carry
to the previous term, we obtain z5 + 39z4 + 43z3 + 35z2 + 33z + 104. Evidently, we
need to add 33p to make the least significant digit zero yielding (z5 + 39z4 + 43z3
+ 35z2 + 33z + 104) + 33(36z4 + 36z3 + 24z2 + 6z + 1) which after reducing the terms
mod 137 as before and dividing by z since z0 term becomes zero gives 10z4 + 3z3
+ 4z2 + 6z + 95.
In the digit serial hybrid multiplication technique, the multiplication and reduc-
tion/scaling is carried out together in each step.
Fan et al. [76] architecture was based on Hybrid Montgomery multiplier (HMM)
where multiplication and reduction are interleaved. The multiplier architecture for
z ¼ 263 + s where s ¼ 857 ¼ 25(24 + 23) + 26 + (24 + 23) + 1 for 128-bit security is
shown in Figure 10.18b. Four 65 65 multipliers and one 65 32 multiplier are
used to carry out the polynomial multiplication. Each 65 65 multiplier is
implemented using two-level Karatsuba method. Five “Mod-1” blocks are used
for first coefficient reduction step. The Mod-1 block is shown in Figure 10.18b.
Partial products are immediately reduced. Multiplication by s is realized using four
additions since s ¼ 25(24 + 23) + 26 + (24 + 23) + 1. The outputs of “Mod-1” blocks
can be at most 78 bits. These outputs corresponding to the various “bi” computed in
successive five cycles are next accumulated and shifted in the accumulator. Once
the partial products are ready, in phase III, polynomial reduction is performed with
only shifts and additions e.g. 6α ¼ 22α + 2α, 9α ¼ 23α + α, 36α ¼ 25α + 22α. The
values of ci are less than (i + 1)277 for 0 i 4. It can be shown that vi are less
than 92 bits. The “Mod-2” block is similar to “Mod-1” block but input is only 93 bit
(see Figure 10.18b). The resulting ri are such that jr i j ¼ 263 þ 241 for 0 i 3 and
jr 4 j 230 .
The negative coefficients in r(z) are made positive by adding the following
polynomial:
lðzÞ ¼ ð36v 2Þz4 þ ð36v þ 2z 2Þz3 þ ð24v þ 2z 2Þz2

þ ð6v þ 2z 2Þz þ ðv þ 2zÞ ð10:36Þ
where v ¼ 225 and z ¼ 263 þ s.

The authors have used a 16-stage pipeline to achieve a high clock frequency and
one polynomial multiplication takes five iterations. One multiplier has a delay of
20 cycles. The multiplier finishes one multiplication every five cycles. The authors
have used XILINX Virtex-6 FPGAs (XC6VLX240) and could achieve a maximum
frequency of 210 MHz using 4014 slices, 42 DSP48E1s and 5 block RAMs
(RAMB36E1).
The digit serial implementation [77] will be described next. Note that p1(z) ¼ 1
mod z which means that p1(z) mod zn has integer coefficients. The polynomial
reduction uses Montgomery reduction which needs division by “z”. Since z ¼ 2m + s
where s is small, the division is transferred to the multiplication by s for small s. The
algorithm for modular reduction for BN curves is presented in Figure 10.19a. Note
that five steps are needed in the first loop to add a(z)bj to old result and divide by
z mod p. The authors prove that the output is bounded under the conditions
0 jai j, jbi j < 2m=2 , i ¼ 4 and 0 jai j, jbi j < 2mþ1 , 0 i 3 such that
0 jr i j < 2m=2 , i ¼ 4 and 0 jr i j < 2mþ1 , 0 i 3. Note that for realizing
256-bit BN curves, the digit size is 64 bits. Four 64-bit words and one 32-bit
word will be needed. It can be seen that in the first loop, in step 3, one 32 64
and four 64 64 multiplications are needed. In step 4, one dlog2 se dlog2 μe
multiplication where μ < 2m+6, is needed. The last iteration takes four 32 64
and one 32 32 multiplications. In total, the first loop takes one 32 32, eight
32 64, sixteen 64 64 and five dlog2 se dlog2 μe multiplications where μ < 2m+6.
The coefficient reduction phase requires eight dlog2 se dlog2 μe multiplications. It
can be shown that μ < 2k+6 in the for loop (steps 8–10) and μ < s26 in step 12 in the
second for loop. On the other hand, in the Barrett and Montgomery algorithms, we
need 36 numbers of 64 64 multiplications.
The design is implemented on an ASIC 130 nm and needed 183K gates for Ate
and R-Ate pairing and worked at 204 MHz frequency and the times taken are 4.22
and 2.91 ms, respectively. The architecture of the multiplier together with the
accumulator and γ, μ calculation blocks are shown in Figure 10.19b. Step 3 is
performed by a row of 64 16 and 32 16 multipliers needing four iterations. The
partial product is next reduced by the Mod_ t block which comprises of a multiplier
and subtractor. This block generates μ and γ from ci. Note that μ ¼ ci div2m and
γ ¼ ci mod2m sμ in all mod blocks except the one below rc0 which computes
instead γ ¼ sμ rc0 mod2m . The second loop re-uses the mod z blocks.
Chung and Hasan [79] have suggested the use of LWPFI (low-weight polyno-
mial form integers) for performing modular multiplications efficiently. These are
similar to GMN (generalized Mersenne Numbers) f(t) where t is not a power of
2 where jf ðiÞj 1:
f ðtÞ ¼ tn þ f n1 tn1 þ þ f 1 ðtÞ þ f 0 ð10:37Þ
Since f(t) is monic (leading coefficient is unity), the polynomial reduction phase
is efficient. A pseudocode is presented in Figure 10.20. The authors use Barrett’s
reduction algorithm for performing divisions required in phase III. When moduli
are large, Chung and Hasan algorithm is more efficient than traditional Barrett or
Montgomery reduction algorithm. The authors have later extended this technique to
the case [80] when jf i j s where s z. Note that the polynomial reduction phase is
efficient only when f(z) is monic.
Corona et al. [81] have described a 256-bit prime field multiplier for application
in bilinear pairing using BN curves with p ¼ 261 + 215 + 1 using an asymmetric
divide and conquer approach based on five-term Karatsuba technique, which used
12 DSP48 slices on Virtex-6. It needed fourteen 64 64 partial sub-products. This,
however, needs lot of additions. However, these additions have certain pattern that
can be exploited to reduce number of clock cycles needed from 23 needed in
Figure 10.19 (a) Hybrid modular multiplication algorithm and (b) Fp multiplier using HMMB
(adapted from [77] ©IEEE2009)
Karatsuba technique to 15 by proper scheduling. The 512-bit product is reduced to a

256-bit integer using polynomial variant of Montgomery reduction of Fan
et al. [77]. A 65 65-bit multiplier has been realized using asymmetric tilling
[82]. One operand A was split into three 24-bit words A0, A1 and A2 and B was
split into four 17-bit words B0, B1, B2 and B3 so that a 72 68 multiplier can be
realized using the built-in DSP48 slices. This consumes 12 DSP48 slices and
requires 12 products and 5 additions. This design could achieve a clock frequency
of 223.7 MHz using Virtex-6 with a 40-cycle latency and takes 15 cycles per
product.
Figure 10.20 ChungHasan multiplication algorithm (adapted from [76] ©IEEE2012)
Brinci et al. [83] have suggested a 258-bit multiplier for BN curves. The authors
observe that the Karatsuba technique cannot efficiently exploit the full performance
of DSP blocks in FPGAs. Hence, it is required to explore alternative techniques.
They use a Montgomery quartic polynomial multiplier needing 13 sub-products
using Montgomery technique [75] realized using 65 65-bit multipliers, 7 65-bit
multipliers and one 7 7-bit multiplier and it needs 22 additions. In non-standard
tilling, eleven DSP blocks will suffice: eight multipliers are 17 24 whereas three
are 24 17. The value of p used in BN curves is 263 + 857. A frequency of 208 MHz
was achieved using Virtex-6, 11DSP 48 blocks and 4-block RAMS taking 11 cycles
per product.
Extension Field Arithmetic
When computing pairings, we need to construct a representation for the finite

field F k ,where k is the embedding degree. The finite field F k is implemented as
p p
Fp[X]/f(X), where f(X) is an irreducible polynomial of degree k over Fp. The
elements of F k are represented using polynomial basis [1, X, X2, . . ., Xk1]
p
where X is the root of the irreducible polynomial over F k . In large prime fields,
p
pairing involves arithmetic in extensions of small degrees of the base field.
Hence, optimization of extension field arithmetic will be required. We need
algorithms for multiplication, squaring, for finding multiplicative inverse and
for computing Frobenius endomorphism. These are considered in detail in this

section.
Multiplication is computed as a multiplication of polynomials followed by a
reduction modulo the irreducible polynomial f(X), which can be built into the
formula for multiplication. For a multiplication in F k , at least k reductions are
p
required as the result has k coefficients. For F 12 , twelve reductions are required.
p
Lazy reduction (accumulation and reduction) can be used to decrease the number of
reductions in the extension field as will be explained later.
Several techniques can be used to perform computations in the quadratic, cubic,
quartic and sextic extension fields [84].
A. Multiplication and squaring

A quadratic extension can be constructed using F ¼ Fp ½X= X2 β where β is a
p2
quadratic non-residue in Fp. An element α 2 F 2 is represented as α0 þ α1 X where
p
αi 2 Fp .
The school book method of multiplication c ¼ ab yields
c ¼ ðao þ Xa1 Þðbo þ Xb1 Þ ¼ ðao bo þ βa1 b1 Þ þ Xða1 bo þ ao b1 Þ ¼ ðco þ Xc1 Þ

ð10:38aÞ
where v0 ¼ a0 b0 , v1 ¼ a1 b1 which costs 4M + 2A + B where M, A and B stand for

multiplication, addition and multiplication by a constant, respectively. Using
Karatsuba’s formula [70], we have
c ¼ vo þ βv1 , c1 ¼ ðao þ a1 Þðbo þ b1 Þ vo v1 ð10:38bÞ
which costs 3M + 5A + B. On the other hand for squaring, we have the formulae for
respective cases of school book and Karatsuba as
co ¼ a2o þ βa21 , c1 ¼ 2ao a1 and co ¼ a2o þ βa21 , c1 ¼ ðao þ a1 Þ2 vo v1

ð10:38cÞ
where v0 ¼ a20 , v1 ¼ a21 . Thus, the operations in both these cases are M + 2S + 2A + B
and 3S + 4A + B where S stands for squaring.
In another technique known as complex squaring, c ¼ a2 is computed as
co ¼ ðao þ a1 Þðao þ βa1 Þ vo βv0

ð10:39Þ
c1 ¼ 2v0
where vo ¼ aoa1. This needs 2M + 4A + 2B operations.

In the case of cubic extensions F 3 ¼ Fp ½X= X3 β , an element α 2 F 3 is
p p
represented as α0 þ α1 X þ α2 X2 where αi 2 Fp and β is a cubic non-residue in Fp.
The school book type of multiplication yields c ¼ ab as

c ¼ ao þ a1 X þ a2 X 2 bo þ b1 X þ b2 X 2
¼ a0 b0 þ Xða1 b0 þ a0 b1 Þ þ X2 ða1 b0 þ a0 b1 þ βa2 b2 Þ þ X3ða2 b1 þ a1 b2 Þ þ X4 a2 b2
¼ a0 b0 þ βða2 b1 þ a1 b2 Þ þ X a1 b0 þ a0 b1 þ βa2 b2 þ X2 a2 b0 þ a0 b2 þ a1 b1
¼ co þ c1 X þ c2 X 2
ð10:40aÞ
This costs 9M + 6A + 2B. For squaring, we have
co ¼ a2o þ 2βa1 a2 , c1 ¼ 2a0 a1 þ βa22 , c2 ¼ a21 þ 2a0 a2 ð10:40bÞ
which needs 3M + 3S + 6A + 2B operations. The Karatsuba technique for mul-

tiplication yields
co ¼ vo þ βðða1 þ a2 Þðb1 þ b2 Þ v1 v2 Þ
c1 ¼ ða0 þ a1 Þðb0 þ b1 Þ v0 v1 þ βv2 ð10:40cÞ
c2 ¼ ðao þ a2 Þðbo þ b2 Þ v0 þ v1 v2
which costs 6M + 15A + 2B where vo ¼ aobo, v1 ¼ a1b1 and v2 ¼ a2b2. For squar-
ing, we have

c o ¼ v o þ β ð a1 þ a2 Þ 2 v 1 v 2
c1 ¼ ða0 þ a1 Þ2 v0 v1 þ βv2 ð10:40dÞ
c2 ¼ ðao þ a2 Þ2 v0 þ v1 v2
which requires 6S + 13A + 2B operations where v0 ¼ a02, v1 ¼ a12, v2 ¼ a22. In

the ToomCook-3 [85, 86] method, we have to pre-compute
vo ¼ að0Þbð0Þ ¼ a0 b0 , v1 ¼ að1Þbð1Þ ¼ ða0 þ a1 þ a2 Þðb0 þ b1 þ b2 Þ,

v2 ¼ að1Þbð1Þ ¼ ða0 a1 þ a2 Þðb0 b1 þ b2 Þ,
ð10:41Þ
v3 ¼ að2Þbð2Þ ¼ ða0 þ 2a1 þ 4a2 Þðb0 þ 2b1 þ 4b2 Þ,
v4 ¼ að1Þbð1Þ ¼ a2 b2
where the five interpolation points vi are estimated as a(X)b(X) at X ¼ 0, 1,

2 and 1. This needs 5M + 14A operations (for eliminating division by 6). Next
interpolation needs to be performed to compute c ¼ 6ab as
co ¼ 6v0 þ βð3v0 3v1 v2 þ v3 12v4 Þ,

c1 ¼ 3v0 þ 6v1 2v2 v3 þ 12v4 þ 6βv4 ,
ð10:42aÞ
c2 ¼ 6v0 þ 3v1 þ 3v2 6v4
The total computation requirements are 5M + 40A + 2B operations. If β ¼ 2,

cost is reduced to 5M + 35A. For squaring, we have
vo ¼ ðað0ÞÞ2 ¼ a2o , v1 ¼ ðað1ÞÞ2 ¼ ðao þ a1 þ a2 Þ2 , v2 ¼ ðað1ÞÞ2

¼ ðao a1 þ a2 Þ2 , v3 ¼ ðað2ÞÞ2 ¼ ðao þ 2a1 þ 4a2 Þ2 , v4 ¼ ðað1ÞÞ2 ¼ a22
ð10:42bÞ
which needs 5S + 7A operations. Next, interpolation needs to be performed as

before using (10.42a). Thus, ToomCook method needs less number of multi-
plications but more additions.
For squaring three other techniques due to Chung and Hasan [87] are useful.
For degree 2 polynomials, these need two steps. In the pre-computation step in
method 1 (CH-SQR1), we have

so ¼ a2o , s1 ¼ 2a0 a1 , s2 ¼ ao þ a1 a2 ðao a1 a2 Þ, s3 ¼ 2a1 a2 , s4 ¼ a22
ð10:43aÞ
In the next step, squaring is computed as
co ¼ so þ βs3 , c1 ¼ s1 þ βs4 , c2 ¼ s1 þ s2 þ s3 so s4 ð10:43bÞ
Thus, the total cost is 3M + 2S + 11A + 2B operations. For the second tech-
nique, we have s2 ¼ ðao a1 þ a2 Þ2 while other si are same as in (10.43a) and
the final step is same as in (10.43b). The total cost is 2M + 3S + 10A + 2B
operations. For the third method, we have pre-computation given by
so ¼ a2o , s3 ¼ 2a1 a2 , s1 ¼ ðao þ a1 þ a2 Þ2 , s2 ¼ ðao a1 þ a2 Þ2 , s3 ¼ 2a1 a2 ,

s4 ¼ a22 , t1 ¼ ðs1 þ s2 Þ=2
ð10:43cÞ
and finally we compute
co ¼ so þ βs3 , c1 ¼ s1 s3 t1 þ βs4 , c2 ¼ t1 s0 s4 ð10:43dÞ
The total cost is 1M + 4S + 11A + 2B + 1D2 operations where D2 indicates

division by 2. To avoid this division, C ¼ 2a2 can be computed:
co ¼ 2so þ 2βs3 , c1 ¼ s1 s2 2s3 þ 2βs4 , c2 ¼ 2s0 þ s1 þ s2 2s4 ð10:43eÞ
The total cost is 1M + 4S + 14A + 2B operations.

In the case of direct quartic extensions, an element α 2 F is represented as
p4
α0 þ α1 X þ α2 X2 þ α3 X3 where αi 2 Fp . We can construct a quartic extension as

F 4 ¼ Fp ½X= X4 β where β is a quadratic non-residue in Fp. We can also
p pffiffiffi
construct a quartic extension as F 4 ¼ F 2 ½Y = Y 2 γ where γ ¼ β is a
p p
quadratic non-residue to F 2 . An element in F 4 can be represented as α0 þ α1 γ

p p
where αi 2 F 2 .
p
The school book type of multiplication in direct quartic extension yields
c ¼ ab as

co ¼ ao bo þ β a1 b3 þa3 b1 þa2 b2
c1 ¼ ao b1 þ a1 b0 þ β a2 b3 þ a3 b2
ð10:44aÞ
c2 ¼ ao b2 þ a1 b1 þ a2 b0 þ βa3 b3
c3 ¼ ao b3 þ a1 b2 þ a2 b1 þ a3 b0
needing 16M + 12A + 3B operations whereas squaring is by the equations

c0 ¼ a20 þ β 2a1 a3 þ a22 , c1 ¼ 2ða0 a1 þ βa2 a3 Þ, c2 ¼ 2a0 a2 þ a21 þ βa23 , c3
¼ 2ð a0 a3 þ a1 a2 Þ
ð10:44bÞ
needing 6M + 4S + 10A + 3B operations. Note that ToomCook method also can

be used which reduces the number of multiplications at the expense of other
operations.
Note, however, that quadratic over quadratic extension can be done to obtain
quartic extensions. Note also that the Karatsuba technique or school book
technique can be used in F 2 leading to four options. The cost of operations
p
needed depends upon the choice of multiplication methods for the bottom
quadratic extension field. The operations needed are summarized in Tables 10.2
and 10.3 for multiplication and squaring [84].
Sextic extensions are possible using quadratic over cubic, cubic over qua-
dratic or direct sextic. In the case of direct sextic extension, F 6 is constructed as
p
F 6 ¼ Fp ½X= X6 β where β is both a quadratic and cubic non-residue in Fp.
p
Table 10.2 Summary p4 method School book Karatsuba

of multiplicative costs
for quartic extensions as p2 method >Linear Linear >Linear Linear
quadratic over quadratic School book 16M 12A + 5B 12M 16A + 4B
(adapted from [84] ©2006) Karatsuba 12M 24A + 5B 9M 25A + 4B
Table 10.3 Summary of squaring costs for quartic extensions as quadratic over quadratic
(adapted from [84]©2006)
p4 method School book Karatsuba Complex
p2 method >Linear Linear >Linear Linear >Linear Linear
School book 6M + 4S 10A + 4B 3M + 6S 14A + 5B 8M 12A + 4B
Karatsuba 3M + 6S 17A + 6B 9S 20A + 8B 6M 18A + 4B
Karatsuba/Complex 7M 17A + 6B 6M 20A + 8B
An element α 2 F is constructed as αo þ α1 X þ α2 X2 þ α3 X3 þ α4 X4 þ α5 X5
p6
where αi 2 Fp . The school book method computes c ¼ ab as [84]
co ¼ ao b0 þ β ð a1 b5 þ a2 b 4 þ a3 b3 þ a4 b2 þ a5 b1 Þ
c1 ¼ a0 b1 þ a1 bo þ β ð a2 b 5 þ a3 b4 þ a4 b3 þ a5 b2 Þ
c2 ¼ ao b2 þ a1 b1 þ a2 b0 þ β ð a3 b5 þ a4 b4 þ a5 b3 Þ
ð10:45aÞ
c3 ¼ ao b3 þ a1 b2 þ a2 b1 þ a3 b0 þ βða4 b5 þ a5 b4 Þ
c4 ¼ ao b4 þ a1 b3 þ a2 b2 þ a3 b1 þ a4 b0 þ βa5 b5
c5 ¼ ao b5 þ a1 b4 þ a2 b3 þ a3 b2 þ a4 b1 þ a5 b0
The total costs of multiplication for direct sextic extension in case of a school
book, Montgomery and ToomCook-6X techniques can be found as 36M
+ 30A + 5B, 17M + 143A + 5B, 11M + 93Mz + 236A + 5B operations respectively.
For squaring, the corresponding costs are 15M + 6S + 22A + 5B, 17S + 123A + 5B
and 11S + 79Mz + 163A + 5B where Mz stands for multiplication with a small
word-size integer.
In case of squaring c ¼ a2, we have for school book method of sextic extension

co ¼ a20 þ β 2ða1 a5 þ a2 a4 Þ þ a23
c1 ¼ 2ð ao a1 þ β ð a2 a5 þ a3 a4 Þ Þ

c2 ¼ 2ao a2 þ a21 þ β 2a3 a5 þ a24
ð10:45bÞ
c3 ¼ 2ðao a3 þ a1 a2 þ βa4 a5 Þ
c4 ¼ 2ðao a4 þ a1 a3 Þ þ a22 þ βa25
c5 ¼ 2ð ao a5 þ a1 a4 þ a2 a 3 Þ
An example of degree extension using a quadratic extension of cubic (cubic

over quadratic) is illustrated next [69]. Note that F 3 ¼ Fp ½X= X3 2 ¼ Fp ðαÞ
p
and F 6 ¼ F 3 ½Y = Y 2 α ¼ F 3 ðβÞ where α is a cubic root of 2. For using lazy
p p p
reduction, complete arithmetic shall be unrolled. Hence, letting A ¼ a0 þ a1 α

þa2 α2 þ β a3 þ a4 α þ a5 α2 and B ¼ b0 þ b1 α þ b2 α2 þ β b3 þ b4 α þ b5 α2
two elements of F 6 , using Karatsuba on the quadratic extension leads to
p

AB ¼ a0 þ a1 α þ a2 α2 b0 þ b1 α þ b2 α2 þ α a3 þ a4 α þ a5 α2 b3 þ b4 α þ b5 α2

þ a0 þ a3 þ ða1 þ a4 Þα þ ða2 þ a5 Þα2 b0 þ b3 þ ðb1 þb4 Þα þ ðb2 þ b5 Þα2

a0 þ a1 α þ a2 α 2 b0 þ b1 α þ b2 α 2 a3 þ a4 α þ a5 α 2 b3 þ b4 α þ b5 α 2 β
ð10:46Þ
Using Karatsuba once again to compute each of the three products gives

AB ¼ a0 b0 þ 2 a4 b4 þ ða1 þ a2 Þðb1 þ b2 Þ a1 b1 þ a3 þ a5 ðb3 þ b5 Þ a3 b3

a5 b5 þ a3 b3 þ ða0 þ a1 Þ ðb0 þ b1 Þ a0 b0 a1 b1 þ 2 a2 b2 þ a4 þ a5 ðb4 þ b5 Þ

a4 b4 a5 b5 α þ a1 b1 þ 2a5 b5 þ ða0 þ a2 Þ b0

þ b2 a0 b0 a2 b2 þ a3 þ a4 b3 þ b4 a3 b3

a4 b4 α2 þ ða0 þ a3 Þ ðb0 þ b3 Þ a0 b0 a3 b3
þ 2ða1 þ a2 þ a4 þ a5 Þðb1 þ b2 þ b4 þ b5 Þ
ða1 þ a4 Þðb1 þ b4 Þ ða2 þ a5 Þðb2 þ b5 Þ ða1 þ a2 Þðb1 þ b2 Þ

þ a1 b1 þ a2 b2 ða4 þ a5 Þðb4 þ b5 Þ þ a4 b4 þ a5 b5 β

þ ða0 þ a1 þ a3 þ a4 Þ b0 þ b1 þ b3 þ b4 ða0 þ a3 Þðb0 þ b3 Þ ða1 þ a4 Þðb1 þ b4 Þ
ða0 þ a1 Þðb0 þ b1 Þ þ a0 b0 þ a1 b1

ða3 þ a4 Þ ðb3 þ b4 Þ þ a3 b3 þ a4 b4 þ 2 a2 þ a5 b2 þ b5 a2 b2 a5 b5 αβ

þ ða1 þ a4 Þðb1 þ b4 a1 b1 a4 b4 þ ða0 þ a2 þ a3 þ a5 Þðb0 þ b2 þ b3 þ b5 Þ

ða0 þ a3 Þðb0 þ b3 Þ ða2 þ a5 Þðb2 þ b5 Þ ða0 þ a2 ðb0 þ b2 Þ

þ a0 b0 þ a2 b2 ða3 þ a5 Þðb3 þ b5 Þ þ a3 b3 þ a5 b5 α2 β
ð10:47Þ
It can be seen that 18M + 56A + 8B operations in Fp are required. It requires only
six reductions. Note that each component of AB can lie between 0 and 44p2.
Thus, Bn in Montgomery representation and M in RNS representation must be
greater than 44p to perform lazy reduction in this degree 6 field.
For sextic extension, if M < 20A, Devegili et al. [84] have suggested to
construct the extension as quadratic over cubic and Karatsuba over Karatsuba
for multiplication and use complex over Karatsuba for squaring. For M 20A,
cubic over quadratic, ToomCook3x over Karatsuba for multiplication and
either complex, ChungHasan SQR3 or SQR3x over Karatsuba/complex for
squaring have been recommended.
The extension field F 12 is defined by the following tower of extensions [88]:
p

F 2 ¼ Fp ½u= u2 þ 2
p

F 6 ¼ F 2 ½v= ν3 ξ where ξ ¼ u 1
p p

F 12 ¼ F 6 ½w= w2 v ð10:48Þ
p p

Note that the representation F ¼ F 2 ½W = W 6 ξ where W ¼ w is also
12
p p
possible. The tower has the advantage of efficient multiplication for the canon-
ical polynomial base. Hence, an element α 2 F 12 can be represented in any of
p
the following three ways:
α ¼ a0 þ a1 ω where a0 , a1 2 F 6
p

α ¼ a0, 0 þ a0, 1 ν þ a0, 2 ν þ a1, 0 þ a1, 1 ν þ a1, 2 ν2 ω where ai, j 2 F
2
p2
α ¼ a0, 0 þ a1, 0 W þ a0, 1 W 2 þ a1, 1 W 3 þ a0, 2 W 4 þ a1, 2 W 5 ð10:49Þ
Hankerson et al. [88] have recommended the use of Karatsuba for multipli-
cation and complex for squaring for F 12 extensions. Quadratic on top of a cubic
p
on top of a quadratic tower of extensions needs to be used. A multiplication
using Karatsuba’s method needs 54 multiplications and 12 modular reductions,
whereas squaring using complex method for squaring in F 12 and Karatsuba for
p
multiplication in F 6 , F 2 needs 36 multiplications and 12 modular
p p
reductions [69].
A complete multiplication of F k requires kλ multiplications in Fp with
p
1 < λ 2 and note that lazy reduction can be used in Fp. A multiplication in F k
p
then requires k reductions since the result has k coefficients. Multiplication in Fp
needs n2 word multiplications and reduction requires (n2 + n) word multiplica-
tions. If p3 mod 4, multiplications by β ¼ 1 can be computed as simple
subtractions in F 2 . A multiplication in F k thus needs (kλ + k)n2 + kn word
p p
λ
2 10k þ8k
multiplications in radix representation and 1:1 7k 5 n þn 5 word
multiplications if RNS is used [69].
The school book type of multiplication is preferred for F since in Karatsuba
p2
2 2
method, the dynamic range is increased from 2p to 6p [89].
Yao et al. [89] have observed that for F 12 multiplication, school book
p
method also provides an elegant solution. Using lazy reduction, only 12 reduc-
tions will be needed. The evaluation of f g can be as follows:
!
X X
f g¼ f j gk W jþk þ f j gk ζW jþk6 ð10:50Þ
jþk<6 jþk6

where f g 2 F 2 ½W = W 2 ζ and fj, gk 2 F 2 , 1 j, k 6 are the coefficients
p p
of f and g, respectively. The coefficients of the intermediate results of figk are
less than 2p2 and the coefficients of figkζ are less than 4p2. Considering
fj ¼ f0 + f1i, gk ¼ g0 + g1i, we have
f j gk ζ ¼ ðf 0 g0 f 1 g1 f 0 g1 f 1 g0 Þ þ ðf 0 g0 f 1 g1 þ f 0 g1 þ f 1 g0 Þi ð10:51Þ
Since four products are needed to compute both the components, two accu-
mulators can easily handle this requirement.
B. Inversion
Three other operations are needed in pairing computation: inversion, Frobenius
computation and squaring of unit elements. For the quadratic case [90], assum-
ing an irreducible polynomial x3 + n, we have the formula for inversion as
1 a ib
¼ ð10:52aÞ
a þ bx a2 þ nb2
This needs 1 inversion, 2 squarings, 2 multiplications and 3 reductions in Fp.

The inversion of d in Fp is performed as d1 ¼ dp2 mod p. For the cubic case,
assuming an irreducible polynomial x3 + n, we have [90]
1 A þ Bx þ Cx2
¼ ð10:52bÞ
a þ bx þ cx2 F
where A ¼ a2 þ nbc, B ¼ nc2 ab, C ¼ b2 ac, F ¼ nbC þ aA ncB.

This needs one inversion, 9 multiplications, 3 squarings and 7 reductions in
Fp [69].
Inversion in F 6 built as a quadratic extension of a cubic requires one
p
inversion, 9 multiplications, 3 squarings and 7 reductions in F 2 [88]. Alterna-
p
tively, we can see that 1 inversion, 36 multiplications and 16 reductions in Fp are
required. Inversion in F 12 requires 1 inversion, 97 multiplications and 35 reduc-
p
tions in Fp or 1 inversion, 2 multiplications and 2 squarings in F 6 [69, 88].
p
C. Frobenius computation
Frobenius action or raising an extension field element to the power of the
modulus is always very cheap [90]. Consider (a + ib)p ¼ ap + bpip ¼ (a ib)
mod p since all the terms which are multiples of p in the expansion of (a + ib)p
will vanish and i( p1)/2 ¼ 1, as i is a quadratic non-residue.

Raising an element f 2 F 12 ¼ F 6 ðwÞ= w2 v to the pth power can be
p p
efficiently carried out [91]. This is needed in the final exponentiation in Ate
pairing. The field extension F 12 can also be represented as a sextic extension of
p
a quadratic field f 2 F 12 ¼ F 2 ðW Þ= W 6 u with W ¼ w. Hence, we can write
p p
f ¼ g þ hw 2 F 12 with g, h 2 F 6 such that g ¼ g0 þ g1 v þ g2 v2 and h ¼ h0
p p
þh1 v þ h2 v2 where gi , hi 2 F 2 for i ¼ 1, 2, 3. Equivalently, we have
p
f ¼ g þ hw ¼ g0 þ h0 W þ g1 W 2 þ h1 W 3 þ g3 W 4 þ h3 W 5 ð10:53aÞ
Recall that pth power of an element in F can be calculated free of

p2
cost. Denoting gi , hi as conjugates of gi and hi for i ¼ 1, 2, 3 using the
p
identity W p ¼ uðp1Þ=6 W, we can write W i ¼ γ 1, i W i with γ 1, i ¼ uiðp1Þ=6 ,
γ 2, i ¼ γ 1, i γ 1, i and γ 3, i ¼ γ 1, i γ 2, i which need to be pre-computed and stored

for i ¼ 1, . . . 5. Hence, we compute f p as
p
f p ¼ g0 þ h0 W þ g1 W 2 þ h1 W 3 þg2 W 4 þ h2 W 5
¼ g0 þ h0 W p þ g1 W 2p þ h1 W 3p þg2 W 4p þ h2 W 5p
¼ g0 þ h0 γ 1, 1 W þ g1 γ 1, 2 W 2 þ h1 γ 1, 3 W 3 þg2 γ 1, 4 W 4 þ h2 γ 1, 5 W 5 ð10:53bÞ
Thus, five multiplications in Fp and five conjugations in F are needed. Similar

p2
2 3
procedure can be used for computing f p , f p as well.
Pairing Algorithms
The final exponentiation in pairing algorithms needed in (10.29b), (10.30b), and

(10.31b) is considered next [92]. For BN curves, we have

p12 1 6 2 p4 p2 þ 1
¼ p 1 p þ1 ð10:54Þ
‘ ‘
where l is the order. Thus, three steps are required. The exponentiation by first term
can be performed by conjugation (since p6 ¼ p) followed by an inversion
(for taking care of 1). The exponentiation corresponding to p2 + 1 needs Frobenius
2
(f p ) computation followed by multiplication with f. These two are known as “easy
part” of the final exponentiation step, whereas the operation corresponding to the
third tem is known as “hardpart”. Devegili
et al. [93] have suggested that in case of
p4 p2 þ1
BN curves, the expression ‘ can be written in terms of the parameters of
BN curve p and x as

p3 þ 6x2 þ 1 p2 þ 36x3 18x2 þ 12x þ 1 p þ 36x3 30x2 þ 18x 2
Using the method of multi-exponentiation combined with Frobenius, this needs

computation of
6x2 þ1 3
3 2 3 2 2
36x 18x þ12xþ1
fp fp ðf p Þ f 36x 30x þ18x2
2
3 2 6x þ1
This can be computed as a f 6x5
,b p
a ,b ab, f f p p 2 p
bðf Þ f
bðf p f Þ9 af 4 .
2 3
Note that ap , f p , f p , f p are computed using Frobenius. Later, Scott et al. [92]
have given a better systematic method of computing the hard part taking into
account the parameters of the BN curves as
2 3
p2 6 18
2 3 2 2 p 2 30
mp mp mp ½1=m2 4 mx 5 ½1=ðmx Þp 12 1= mx mx 1=mx
36
2 3 p
1= mx mx
The terms in the brackets are next computed using four multiplications (inver-
sion is just a conjugation) leading to a calculation of the form
y0 y21 y62 y12 18 30 36

3 y4 y5 y6
The authors next suggest that using Olivos algorithm [94], the computation can be
done using just two registers as follows:
T0 ðy6 Þ2 , T 0 T 0 y4 , T 0 T 0 y5 , T 1 y3 y5 , T 1 T1T0, T0 T 0 y2 , T 1 T 21 ,
T1 T1T0, T1 T 21 , T 0 T 1 y1 , T 1 T 1 y0 , T 0 T 20 , T 0 T0T1
This can be carried out using 9 multiplications and 4 squarings.

p6 1
For MNT curves, the final exponentiation [92] is realized as ‘ ¼ p3 1
2
p pþ1
ð p þ 1Þ ‘ where p ¼ 4l2 + 1 and ‘ ¼ 4l2 2l þ 1. The hard part of the final
2
p pþ1
exponentiation is ‘ . Since p ¼ x2 + 1, and l ¼ x2 x + 1, we have the hard
4 2
part as x 2þx þ1 ¼ x2 þ x þ 1 ¼ p þ x: Hence, the hard part is mpmx which is
x xþ1
achieved by a Frobenius and exponentiation to the power of x. where m is the
value to be exponentiated.
Pairing Processor Implementations Using RNS
Yao et al. [89] suggested a set of base moduli to reduce complexity of modulus
reduction in RNS and presented also an efficient Fp Montgomery modular multi-
plier and a high-speed pairing processor using this multiplier has been described.
They have suggested the selection of moduli in both the RNS used in base extension
close so that (bk cj) values are small where bk and cj are the moduli in the two
bases. Thus, the bit lengths of the operands needed in base extension are small.
They have suggested the use of eight moduli sets for 256-bit dynamic range for
128-bit security as
B ¼ 2w 1, 2w 9, 2w þ 3, 2w þ 11, 2w þ 5, 2w þ 9, 2w 31, 2w þ 15
C ¼ 2w , 2w þ 1, 2w 3, 2w þ 17, 2w 13, 2w 21, 2w 25, 2w 33
Y
s
where w ¼ 32 so that the bit lengths of bi cj are as small as possible
k¼1, k6¼j
(<25 bits).
Yao et al. [95] have suggested that maximal length in bits of bi cj shall be
minimized to v bits so that multiplications will be v w words rather than w w
words. They have also suggested a systematic procedure for RNS parameter
selection to result in a lower complexity. They observe that for a 16-bit machine,
n ¼ 4 where n is the number of moduli is optimum. They suggest two techniques for
choosing moduli (a) multiple plus prime and (b) first come first selected improved.
In the first method, we start with the set of first (2n 1) primes. The product of all
these π 2n
i¼1 pi is denoted as Ө. M is a multiple of Ө. Then M + pi are all pairwise
coprime and hence the name MPP (multiple plus prime). The second method selects
only pseudo-Mersenne primes. As an illustration, two RNS with moduli {264-33, 2
64
-15, 264-7, 264-3} and {264-17, 264-11, 264-9, 264-5} can be obtained yielding the
various weights Bij to be represented by at most 14 bits to 5 bits. Thus, v reduces to
14 for one RNS and 8 for another RNS. Thus, 64 64 multiplications can be
replaced with 64 14, 64 8 multiplications. The first method is attractive when
n is very small and note that multiplications are not performed as additions.
Montgomery reduction in RNS has higher complexity than ordinary Montgom-
ery reduction. However, this overhead of slow reduction can be partially removed
by reducing the number of reductions also known as lazy reduction. In computing
ab + cd with a, b, c, d in Fp where p is a n word prime number, we need 4n2 + 2n
word operations since each modulo multiplication needs 2n2 + n word products
using digit serial Montgomery algorithm [11]. In lazy reduction, first ab + cd is
computed and then reduced needing only 3n2 + n word products [69]. Lazy reduc-
tion performs one reduction per multiple of multiplications. This is possible for
expressions like AB CD EF in Fp. In RNS [89], it takes 2s2 + 11s word multi-
plications while it takes 4s2 + s using digit serial Montgomery modular multiplica-
tion [11]. The actual operating range should be large for lazy reduction to be
economical (e.g. 22p2 is needed for computation of (10.51)). Around 10,000
modular multiplications are needed for a pairing processor.
Yao et al. [89] have described a high-speed pairing co-processor using RNS and
lazy reduction. They have used homogeneous coordinates [96] for realizing Ate and
optimal Ate pairing. The algorithm for optimal Ate pairing is presented in
Figure 10.21. The formulas used for point doubling, point addition and line raising
together with the operations needed in Miller loop are presented in Table.10.4. Note
that S, M and R stand for squaring, multiplication and reduction in F 2 and m and
p
Figure 10.21 Algorithm for optimal Ate pairing (adapted from [89] ©2011)
r stand for multiplication and reduction in Fp. The cost of squaring in Fp is also
indicated as m. Note that M ¼ 4m, S ¼ 3m and R ¼ 2r. School book method has been
employed to avoid more additions. The final addition is carried out in steps 10 and
11, whereas final exponentiation is carried out in step 12. The operation count is
presented in Table 10.5. This includes Frobenius endomorphism of Q1 and Q2. Note
that the computation T Q2 is skipped because this point is not needed further.
6
The algorithm for computation of f p 1 in F 12 [89] is shown in Figure 10.22. The
p
hard part is computed following Devegili et al. [93] as shown in Table 10.5. The
inversion needed in Figure 10.22 is carried out as d 1 ¼ d p2 modp.
The architecture of pairing coprocessor due to Yao et al. [89] is presented in
Figure 10.23. It uses eight PEs in the Cox-Rower architecture (see Figure 10.23c).
Each PE caters for one channel of B and one channel of C. Four DSP blocks can
realize 35 35 multiplier whereas two DSPS can realize a 35 25 multiplier. A
dual mode multiplier (see Figure 10.23a) which can perform two element multipli-
cations at a time is used. It uses two accumulators to perform the four multiplica-
tions in parallel and accumulate the results in parallel. Each PE performs two
element multiplications at a time using the dual multiplier. Thus, Cox-Rower
algorithm is modified accordingly and needs less number of loops. The cox unit
is an accumulator to provide result correction information to all PEs in base
extension operation. The register receives ξ values from every PE and delivers
two ξ values to all PEs at a time. The internal design of the PE is shown in
Figure 10.23b comprising dual-mode multiplier, an adder, 2RAMs for multiplier
inputs, 2 RAMS for adder inputs, 3 accumulators and one channel reduction
module. Two accumulators are used for polynomial multiplication and one is
used for matrix multiplication of the base extension step. The authors have
330
Table 10.4 Pipeline design and operation count of Miller loop (adapted from [89] ©2011)
Condition Step Operations Count
ri ¼ 0 1 A ¼ y21 , B ¼ 3b0 z21 , C ¼ 2x1 y1 , D ¼ 3x21 , E ¼ 2y1 z1 3S + 2M + 5R
2 x3 ¼ ðA 3BÞC, y3 ¼ A2 þ 6AB 3B2 , z3 ¼ 4AE, l0 ¼ ðB AÞζ, l3 ¼ yP E, l4 ¼ xP D 2S + 3M + 4m + 5R
3 f ¼ f2 6S + 15M + 6R
4 f¼f · l 18M + 6R

ri ¼ 1 1 A ¼ y21 , B ¼ 3b0 z21 , C ¼ 2x1 y1 , D ¼ 3x21 , E ¼ 2y1 z1 , f 0 ¼ f 2 0 5S + 4M + 6R

2 x3 ¼ ðA 3BÞC, y3 ¼ A2 þ 6AB 3B2 , z3 ¼ 4AE, l0 ¼ ðB AÞζ, l3 ¼ yP E, l4 ¼ xP D, f 1 ¼ f 2 1 2S + 6M + 4m + 6R
3 A ¼ y3 yQz3, B ¼ x3 xQ z3, f2,3,4,5 ¼ ( f2)2,3,4,5 4S + 12M + 4m + 6R
4 f¼f · l 18M + 6R

5 C ¼ A2 , D ¼ B2 , l0 ¼ xQ A yQ B ζ, l3 ¼ yP B, l4 ¼ xP A 2S + 2M + 4m + 5R
6 C ¼ z3C, E ¼ BD, D ¼ x3D, f0,1,2 ¼ ( f · l )0,1,2 12M + 6R

7 13M + 6R
10
x3 ¼ B(E + C 2D), y3 ¼ A(3D E C) y3 E, z3 ¼ z3E, f3,4,5 ¼ ( f · l )3,4,5

RNS in Cryptography
Table 10.5 Operation count of final steps (adapted from [89] ©2011)
Step Operations Operation count
FA1 Q1 π p(Q) M + 2m + R
T T + Q1 and lT , Q1 ðPÞ 2S + 11M + 8m + 13R
f f lT , Q1 ðPÞ 18M + 6R
FA2 Q2 π p(Q1) M + 2m + R
lT , Q2 ðPÞ 4M + 8m + 5R
f f lT , Q2 ðPÞ 18M + 6R
fp
6
1 Before d1 9S + 12M + 2m + 7R + r
d1 ¼ dp2 294m + 294r
After d1 6S + 36M + 16R
fp
2
þ1
fp
2
þ1 e þR
M e þ 8m þ 4R
f6|u|5 64e e þ 68R
e
4 2
p þ1=n a
fp S þ 4M
b ap+1 e e
M þ R þ 3M þ 4m þ 5R
2 3
fp, f p , f p 6M + 12m + 12R
T p 2
b ðf Þ f p2 e þe
2M e
S þ 3R
T T 6u2 þ1
126Se þ 12M
e þ 138R
e
p3
9 e þ 5Se þ 12R
e
f T b f pþ1 f 4 7M
6
Figure 10.22 Algorithm for computation of f p 1
in F (adapted from [89] ©2011)
p12
implemented using Virtex-6 XC6VLX240T-2 FPGA. The performance figures are

as follows:
Ate and Optimal Ate pairing 7032 slices and 32 DSP48E1s, frequency of
250 MHz, cycles 229,109 and delay of 0.916 ms for Ate pairing and 166,027 cycles
and delay of 0.664 ms for optimal Ate pairing.
Figure 10.23 Pairing coprocessor hardware architecture (adapted from [89] ©2011)
Table 10.6 Number of operations and cycles per computation in Miller loop (adapted from [89]
©2011)
Miller’s loop
2T and lT,T(P) T + Q and lT,Q(P) f2 f ∙l Ate Optimal
#Multiplication 39 54 78 72 – –
#Reduction 20 26 12 12 – –
#Cycles 340 456 313 301 128,531 64,084
The authors have given in detail the number of operations and cycles required for
Miller loop (see Table 10.6) and for computation of final steps in Table 10.7. One
pairing computation can be completed in 1 s. The speed advantage can be seen from
the fact that a 256 256 multiplication in RNS needs 16 times 32 32-bit multi-
plication whereas Karatsuba needs 27 multiplications.
Duquesne and Guillermin [66] described FPGA implementation of optimal Ate
pairing using RNS for 128-bit security level in large characteristic. The authors
have used BN curves for 126-bit security using u ¼ (262 + 255 + 1) whereas for
128-bit security, they have used u ¼ (263 + 222 + 28 + 27 + 1). The base extension
using MRC described earlier [52] needs 75 n2 þ 85 n RNS digital multiplications and
cannot be parallelized as it uses MRC. On the other hand, the Kawamura
et al. technique [34] has overall complexity of 2n2 + 3n which has been enhanced
Table 10.7 Number of operations and cycles per computation in final steps (adapted from [89]
©2011)
Step Operation # Cycles #Idle cycles Occupation rate (%)
Final addition FA1 799 7 99.1
FA2 566 50 91.2
fp
6
1 dp2 mod p 25,537 21,127 17.3
Others 1333 240 82.0
2 2
þ1
fp f p þ1 573 9 98.4
fp
4 2
p þ1=n f6|u|5 21,812 68 99.7
6u2 þ1 44,778 138 99.7
T
Others 6905 47 99.3
Total Ate 100,578 21,629 78.5
Optimal 101,943 21,686 78.7
Figure 10.24 Algorithm for optimal Ate pairing (adapted from [67] ©2012)
by using n parallel rowers to achieve one RNS digit multiplication and accumula-
tion per cycle. Hence, a full-length multiplication can be done in two cycles one
over base B and one over base B0 , an addition in 4, a subtraction in 6 and whole
reduction in 2n + 3 cycles. They have used lazy reduction. As such for F k only
p
k reductions are needed and thus (2kn + 3k) cycles overall are needed. The algo-
rithm implemented is shown in Figure 10.24. The authors use projective coordi-
nates [96, 97]. The doubling and addition steps together with line raising are
presented in detail in Figure 10.25a–c where it may be noted that the classical
formulae are rearranged to highlight reduction and inherent parallelism in local
Figure 10.25 (a–c). Algorithms for doubling step, addition step and hard part of the final
exponentiation (adapted from [66] ©2012)
variables. The F inversion is based on Hankerson et al. [88] formulae. The final
p12
exponentiation used Scott et al. multi-addition chain [92] described before. The
X5
squaring in F 12 uses the technique proposed in [98]. Specifically, for a ¼ a
i¼0 i
p
γ with ai 2 F 2 , the coefficients of A ¼ a are given by
i 2
p
A0 ¼ 3a20 þ 3ð1 þ iÞa23 2a0 A1 ¼ 6ð1 þ iÞða2 þ a5 Þ þ 2a1

A2 ¼ 3a21 þ 3ð 1 þ iÞa24 2a2 A3 ¼ 6a0 a3 þ 2a3 ð10:55Þ
A4 ¼ 3a22 þ 3ð 1 þ iÞa25 2a4 A5 ¼ 6a1 a4 þ 2a5
The authors give details of the operation count for BN126 and BN128 curves
which saves cycles at each step using RNS as compared to the earlier design of
Guillermin [99]. The saving is due to exploitation of the inherent parallelism in
algorithm db1 (see Figure 10.25a) and add as well as in operations in F 12 . The
p
pipeline depth could be up to 8 and still avoid idle states. The authors use in F 2
p
multiplication and subtraction using additional hardware in parallel to avoid cas-
caded operations thus saving 4 cycles at the expense of increased hardware. Similar
technique has been used for F 12 as well. The authors have implemented ion Altera
p
Cyclone III, Stratix II and Stratix III. The results are as follows:
BN126 Cyclone II EP2C35 91 MHz frequency size 14274LC time 1.94 ms
BN126 Stratix III EP2S30154 154 MHz frequency size 4227A time 1.14 ms
BN126 Stratix III EP3S50 165 MHz frequency size 4233A time 1.07 ms
Duquesne [69] has considered application of RNS for fast pairing computation
implementation using BN curves and MNT curves considering Tate, Ate and R-Ate
pairings employing lazy reduction and efficient arithmetic in extension fields.
Duquesne has presented very detailed description of every computation needed
for all the three pairings.
We consider Tate pairing first. Jacobian coordinates have been considered for
P and affine coordinates for Q in this approach. The algorithm for Tate pairing
for MNT curves is presented in Figure 10.26 where P ¼ ðxP ; yP Þ 2 E Fp ½‘ and

Q ¼ xQ , yQ β with xQ , yQ 2 F 3 . It is assumed that F 6 is built as a quadratic
p p
extension of F 3 : F 6 F 3 = Y 2 υ ¼ F 3 ½β. The authors consider multipli-
p p ¼ p ½Y p
cation and squaring as of same complexity. Lines 1–4 in the algorithm in Fig-
ure 10.26, perform doubling of point T 2 E(Fp) in Jacobian coordinates and needs
10 multiplications in Fp and 8 modular reductions considering lazy reduction. Lines
7–10 use mixed addition of T and P and require 11 multiplications and 10 modular
reductions in Fp. Since xQ and yQ are in F 3 , line 5 requires 9 multiplications in Fp
p
and 8 modular reductions. Line 11 needs 7 multiplications and 6 reductions in Fp. In
line 6, a multiplication and squaring in F 6 is needed which needs 30 multiplications
p
and 12 modular reductions in Fp. For the line 12, we need 18 multiplications and
6 reductions. Line 13 does exponentiation by p3 free by conjugation, whereas
36 multiplications and 16 reductions and one inversion are required in F 6 . This
p
step totally needs 54 multiplications, 22 modular reductions and one inversion in
Fp. In line 14, a multiplication in F 6 and a Frobenius computation are needed. The
p
Figure 10.26 Algorithm for Tate pairing for MNT curves (adapted from [69])
latter needs 5 modular multiplications in Fp and overall, the second step of final
exponentiation needs 23 multiplications and 11 reductions in Fp.
The hard part involves one Frobenius (five modular multiplications), one mul-
tiplication in F 6 , one exponentiation by 2l. For each step of exponentiation,
p
12 multiplications and 6 reductions or 18 multiplications and 6 reductions are
required depending on whether multiplications are not needed or needed. For a
96-bit security, l has bit length of 192 implying that lines 1–6 are done 191 times
and lines 7–12 around 96 times. Totally, the Miller loop needs 191 (10 + 9 + 30)
+ 96 (11 + 7 + 8) ¼ 12,815 multiplications and 191 (8 + 8 + 12) + 96 (10 + 6
+ 6) ¼ 7460 reductions. The easy part of the final exponentiation needs 1 inversion,
77 multiplications and 33 reductions in Fp. Considering that 2l is 96 bits long, the
hard part can perform exponentiation using sliding window of 3 for computing f2l.
This needs 96 squarings in F 6 , 24 multiplications in F 6 and three
p p
pre-computations. Thus, the hard part needs 5 + 18 + 97 12 + 27 18 ¼ 1673

multiplications and 5 + 6 + 97 12 + 27 18 ¼ 755 reductions. Thus, full Tate
pairing needs 14,565 multiplications and 8248 reductions. A radix implementation
on the other hand needs 14,565 62 + 8248 (62 + 6) ¼ 870,756 word multiplica-

tions, whereas RNS needs 1:1 14, 565 2 8 þ 8248 7582 þ 858 ¼ 736, 626
word multiplications indicating a gain of 15.4 %.
In the case of Ate pairing, lines 1–4 are done in F 3 requiring 10 multiplications
p
and 8 reductions in F 3 i.e. 60 multiplications and 24 reductions in Fp. Similarly,
p
lines 7–10 require 11 multiplications and 10 reductions in F 3 or 66 multiplications
p
and 30 reductions in Fp. If the coordinates of T are (XT, YTβ, ZT), lines 5 and 11 must
be replaced by

50 : g ¼ Z 2T Z 2T yP β þ A XT Z 2T xP 2νY 2T
ð10:56Þ
110 : g ¼ ZTþP yQ β þ ZTþP yP F xP xQ

where Z2T , Z 2T , A ¼ 3X2T a4 Z4T , Y 2T , ZTþP , F ¼ Y T yQ Z 3T are computed in F
in
p3
the previous steps. The first requires 18 multiplications and 12 reductions in Fp,
whereas the second requires 15 multiplications and 6 reductions.
Finally, since t 1 has bit length 96, and Hamming weight of about 48 bits, the
Miller loop requires 95 (60 + 18 + 30) + 47 (66 + 15 + 18) ¼ 14,913 multiplica-
tions and 95 (24 + 12 + 12) + 47 (30 + 6 + 6) ¼ 6534 reductions. The final expo-
nentiation is same as in Tate pairing and thus full Ate Pairing needs 16,663
multiplications and 7322 reductions. In radix representation, this means that
907,392 word multiplications are needed whereas in RNS, only 703,204 word
multiplications are needed. The gain is thus 22.65 %.
In the case of BN curves, the flow chart is presented in Figure 10.27. Due to the
twist of order 6 for BN curves, some improvements can be made. The author
considers that F 12 is built as a quadratic extension of a cubic extension of F 2
p p
which is compatible with the use of twist of order 6. Due to the twist defined by v,

the second input of Tate pairing can be written as Q ¼ xQ γ 2 þ yQ γ 3 with xQ, yQ
2 F 2 . As seen earlier in the case of MNT curves, Lines 1–4 in the algorithm need
p
7 multiplications in Fp and 6 modular reductions, whereas lines 7 and 10 require
11 multiplications in Fp and 10 modular reductions in Fp. Since xQ and yQ are in
F 2 , line 5 requires 8 modular multiplications in Fp and lazy reduction cannot be
p
used. In line 11, 6 multiplications and only 5 modular reductions are needed since
lazy reduction can be used on the constant term. Line 6 involves both squaring and
multiplication in F 6 . This requires 36 multiplications and 12 reductions. Further-
p
more, multiplication by g needs 39 multiplications and 12 reductions. Thus, the
total complexity for line 6 is 75 multiplications and 24 modular reductions in Fp.
The case of line 12 is similar and it needs 39 modular multiplications and
Figure 10.27 Algorithm for Tate pairing for BN curves (adapted from [69])
f
p6
12 reductions in Fp. Line 13 computes where computation of f is free by
f p6
conjugation. Hence, one multiplication and inversion are needed in F 12 . This
p
inversion needs one inversion, 97 multiplications and 35 reductions in Fp. The first
step of the exponentiation thus requires 151 multiplications, 47 modular reductions
and one inversion in Fp. Line 14 involves one multiplication in F 12 and one
p
powering to p2. The Frobenius map and its iterations need 11 modular multiplica-
tions in Fp. This step thus needs 65 multiplications and 23 reductions in Fp.
The hard part given in line 15 involves one Frobenius (11 modular multiplica-
tions), one multiplication in F 12 (54 multiplications and 12 reductions) and one
p
exponentiation. Since for BN curves, l can be chosen as sparse, a classical square
and multiply can be used. Since in line 13, f has been raised to the power ( p6 1), it
is a unit and can be squared with only 2 squarings and 2 reductions in Fp (i.e. 24
multiplications and 12 reductions in Fp). Thus, the cost is only 24 multiplications
and 12 reductions for most steps. For steps corresponding to the non-zero bits of the
exponent, 54 additional multiplications and 12 additional reductions are necessary.
In line 16, four applications of the Frobenius map, 9 multiplications and 6 squar-
ings in F 12 (i.e. 674 multiplications and 224 reductions in Fp) are needed. It also
p
needs an exponentiation which is similar to line 15 but two times larger. Consid-
ering a Hamming weight of l as 11 and ‘ as 90, we observe that steps 1–6 are done
255 times and lines 7–12 are done 89 times for a 128-bit security level. Thus, the
Miller loop needs 255 (7 + 8 + 75) + 89 (11 + 6 + 39) ¼ 27,934 multiplications
and 255 (6 + 8 + 24) + 89 (10 + 5 + 12) ¼12,093 reductions. The easy part of
the final exponentiation requires one inversion, 216 multiplications and 70 reduc-
tions in Fp. The hard part involves exponentiation by 6l 5 which has Hamming
weight of 11 and 6l2 + 1 which has Hamming weight of 28. The second exponen-
tiation can be split into two parts l and 6l [88] both having Hamming weight of 11.
This leads to 21 multiplications. Lines 15 and 16 require 11 + 54 + 65 24 + 9 54
+ 674 + 127 24 + 21 54 ¼ 6967 multiplications and 11 + 12 + 65 12 + 9 12
+ 224 + 127 12 + 21 12 ¼ 2911 reductions. Thus, the full Tate pairings needs
35,117 multiplications but only 15,074 reductions. For radix implementation using
8 (32 bit) words, we need 35,117 82 + 15,074 (82 + 8) ¼ 3,332,816 word multi-

7 2 8
plications whereas RNS needs 1:1 35, 117 2 8 þ 15, 074 8 þ 8 ¼
5 5
2, 315, 994 word multiplications. This has a gain of 30.5 %.
In the case of Ate pairing, lines 1–4 are done in F 2 requiring 3 multiplications,
p
4 squarings and 6 reductions in F 2 i.e. 17 multiplications and 12 reductions in Fp.
p
Similarly, lines 7–10 require 8 multiplications and 3 squarings and 10 reductions in
F 2 or 30 multiplications and 20 reductions in Fp. If the coordinates of T are (XTγ 2,
p
YTγ 3, ZT), lines 5 and 11 must be replaced by
0
50 : g ¼ Z2T Z 2T yP AZ 2T xP γ þ AXT 2Y 2T γ 3
0 ð10:57Þ
110 : g ¼ ZTþP yP FxP γ þ FxQ ZTþP yQ γ 3
where Z2T , A ¼ 3X2T , Y 2T , Z TþP , F ¼ Y T yQ Z3T were computed in the previous

steps. The first requires 15 multiplications and 12 reductions in Fp, whereas the
second requires 10 multiplications and 6 reductions. Note further that the value
g obtained has only terms in γ, γ3 and a constant term so that a multiplication by
g requires only 39 multiplications instead of 54.
Next, since t 1 has bit length 128 and Hamming weight 29, the total cost of the
Miller loop is 127 (17 + 15 + 36 + 39) + 28 (30 + 10 + 39) ¼ 15,801 multiplica-
tions and 127 (12 + 12 + 24) + 28 (20 + 6 + 12) ¼ 7160 reductions. The final
exponentiation is same as in Tate pairing and thus full Ate Pairing needs 22,984
multiplications and 10,241 reductions. In radix representation, this means that
2,208,328 word multiplications are needed, whereas in RNS, only 1,558,065 word
multiplications are needed. The gain is thus 29.5 %.
In the case of R-Ate pairing, while the Miller loop is same, an additional step
p
is necessary at the end: the computation of f : f gðT;QÞ ðPÞ gðπðTþQÞ, T Þ ðPÞ where T ¼
(6l + 2)Q is computed in the Miller loop and π is the Frobenius map on the curve.
The following operations will be needed in the above computation. One step of
addition as in the Miller loop (computation of T + Q and gðTþQÞ ðPÞ) needs 40 mul-
tiplications and 26 reductions in Fp. As p 1mod 6 for BN curves, one application
of Frobenius map is needed which requires 2 multiplications in F 2 by
p
pre-computed values. Next, one non-mixed addition step (computation of
gðπðTþQÞ, T Þ ðPÞ) needs 60 multiplications and 40 reductions in Fp. Two multiplica-
tions of the results in the two previous steps require 39 multiplications and
12 reductions in Fp. Next, a Frobenius needs 11 modular multiplications and finally,
one full multiplication in F 12 requires 54 multiplications and 2 reductions in Fp.
p
Thus, totally this step requires 249 multiplications and 117 reductions in Fp.
Considering that 6l + 2 has 66 bits and Hamming weight of 9, the cost of the Miller
loop is 65 (17 + 15 + 36 + 39) + 8 (30 + 10 + 39) ¼ 7587 multiplications and
65 (12 + 12 + 24) + 8 (20 + 6 + 12) ¼ 3424 reductions. The final exponentiation
is same as for Tate pairing. Hence, for complete R-Ate pairing, we need 15,019
multiplications and 6405 reductions. This means that 1,422,376 word multiplica-
tions in radix representation and 985,794 in the case of RNS will be required thus
saving 30.7 %.
Kammler et al. [100] have described an ASIP (application specific instruction set
processor) for BN curves. They consider the NIST recommended prime group order
of 256 bits E(Fp) and 3072 bits for the finite field F k ¼ 256 12 ¼ 3072 (since
p
k ¼ 12). This ASIC is programmable for all pairings. They keep the points in
Jacobian coordinates throughout the pairing computation and thus field inversion
can be avoided almost entirely. Inversion is accomplished by exponentiation with
( p 2). All the values are kept in Montgomery form through out the pairing
computation.
The authors have used the scalable Montgomery modulo multiplier architecture
(see Figure 10.28a) due to Nibouche et al. [101] which can be segmented and
pipelined. In this technique, for computing ABR1mod M, the algorithm is split into
two multiplication operations that can be performed in parallel. It uses carry-save
number representation. The actual multiplication is carried out in the left half (see
Figure 10.28a) and reduction is carried out in the right half simultaneously. The left
is a conventional multiplier built up of gated full-adders and the right is a multiplier
with special cells for the LSBs. These LSB cells are built around half-adders. Due to
the area constraint, subsets of the regular structure of the multiplier have been used
and computation is performed in multiple cycles. They have used multi-cycle
multipliers for W H (W is word length and H is number of words) of three
different sizes 32 8, 64 8 and 128 8 bits. For example for a 256-bit multiplier,
symbol:
255 256-W 255-W 256-2W 2W-1 W W-1 0
load from memory

W
B M
“0” “0”
CM CR
“0” “0”
SM SR
0 bin cin sin 0 min cin sin

a t'
load 32 H multiplication reduction
from ain t' t'in
memory W ×H bit out W ×H bit
cout sout H-1 cout sout
31
Figure 10.28 (a) Montgomery multiplier based on Nibouche et al. technique and (b) multi-cycle
Montgomery Multiplier (MMM) (adapted from [100] ©2009)
H ¼ 8 and W ¼ 32 can be used. Thus, A is taken as 8 bits at a time and B taken as

32 bits at a time thus needing 256 cycles for multiplication and partial reduction and
addition (see Figure 10.28a). This approach makes the design adaptable to the
desired computation performance and to trade off area versus execution time of the
multiplication.
The structure of the multi-cycle Montgomery multiplier (MMM) is shown in
Figure 10.28b. The two’s complementer is included in the multiplication unit. The
result is stored in the registers of temporary carry-save values CM, SM, SR,CR. The
authors have used a multi-cycle adder unit for modular addition and subtraction. In
addition, an enhanced memory architecture has been employed-transparent inter-
leaved memory segmentation. Basically, the number of ports to the memory system
is extended to increase the throughput. These memory banks can be accessed in
parallel. The authors mention that in 130 nm standard cell technology, an optimal
Table 10.8 Number of operations needed in various pairing computations (adapted from [100]
©iacr2009)
Number of Opt Ate Ate η Tate Comp. η Comp. tate
Multiplications 17,913 25,870 32,155 39,764 75,568 94,693
Additions 84,956 121,168 142,772 174,974 155,234 193,496
Inversions 3 2 2 2 0 0
Ate pairing needed 15.8 ms and frequency was 338 MHz. The number of operations
needed for different pairing applications are presented in Table 10.8 in order to
illustrate the complexity of a pairing processor.
Barenghi et al. [102] described an FPGA co-processor for Tate pairing over Fp
which used BKLS algorithm [62] followed by Lucas laddering [103] for the final

exponentiation pk 1 =r:
p2 1 m m
f P DQ r ¼ ðc þ id Þp1 ¼ ðc id Þ2 ¼ ða þ ibÞm
where m ¼ p1 2 2 m V m ð2aÞ þ ibU ð2aÞ

r , a ¼ c d , b ¼ 2cd. Note that ða þ ibÞ ¼ 2 m
where Um and Vm are the mth terms of the Lucas sequence. The prime p is a 512-bit
number and k ¼ 2 has been used. They have designed a block which can be used for
modular addition/subtraction using three 512-bit adders. The adders compute A + B,
A + B M and A B + M. Modular multiplication was using Montgomery algo-
rithm based on CIOS technique. The architecture comprises of a microcontroller, a
Program ROM, a Fp multiplier and adder/subtractor, a register file and an input/
output buffer. The microcontroller realizes Miller’s loop by calling the
corresponding subroutines. The ALU could execute multiplication and addition/
subtraction in parallel. Virtex-2 8000 (XC2V8000-5FF1152) was used which
needed 33,857 slices and frequency of 135 MHz and a time of 1.61 ms.
References
1. W. Stallings, Cryptography and Network Security, Principles and Practices, 6th edn. (Pear-
son, Upper Saddle River, 2013)
2. B. Schneier, Applied Cryptography: Protocols, Algorithms, and Source Code in C (Wiley,
New York, 1996)
3. P. Barrett, Implementing the Rivest-Shamir-Adleman Public Key algorithm on a standard
Digital Signal Processor, in Proceedings of Annual Cryptology Conference on Advances in
Cryptology, (CRYPTO‘86), pp. 311–323 (1986)
4. A. Menezes, P. van Oorschot, S. Vanstone, Handbook of Applied Cryptography (CRC, Boca
Raton, 1996)
5. J.-F. Dhem, Modified version of the Barrett Algorithm, Technical report (1994)
6. M. Knezevic, F. Vercauteren, I. Verbauwhede, Faster interleaved modular multiplication
based on Barrett and Montgomery reduction methods. IEEE Trans. Comput. 59, 1715–1721
(2010)
References 343
7. J.-J. Quisquater, Encoding system according to the so-called RSA method by means of a
microcontroller and arrangement implementing the system, US Patent #5,166,978, 24 Nov
1992
8. C.D. Walter, Fast modular multiplication by operand scanning, Advances in Cryptology,
LNCS, vol. 576 (Springer, 1991), pp. 313–323
9. E.F. Brickell, A fast modular multiplication algorithm with application to two key cryptog-
raphy, Advances in Cryptology Proceedings of Crypto 82 (Plenum Press, New York, 1982),
pp. 51–60
10. C.K. Koc. RSA Hardware Implementation. TR 801, RSA Laboratories, (April 1996)
11. C.K. Koc, T. Acar, B.S. Kaliski Jr., Analyzing and comparing Montgomery Multiplication
Algorithms, in IEEE Micro, pp. 26–33 (1996)
12. M. McLoone, C. McIvor, J.V. McCanny, Coarsely integrated Operand Scanning (CIOS)
architecture for high-speed Montgomery modular multiplication, in IEEE International
Conference on Field Programmable Technology (ICFPT), pp. 185–192 (2004)
13. M. McLoone, C. McIvor, J.V. McCanny, Montgomery modular multiplication architecture
for public key cryptosystems, in IEEE Workshop on Signal Processing Systems (SIPS),
pp. 349–354 (2004)
14. C.D. Walter, Montgomery exponentiation needs no final subtractions. Electron. Lett. 35,
1831–1832 (1999)
15. H. Orup, Simplifying quotient determination in high-radix modular multiplication, in Pro-
ceedings of IEEE Symposium on Computer Arithmetic, pp. 193–199 (1995)
16. C. McIvor, M. McLoone, J.V. McCanny, Modified Montgomery modular multiplication and
RSA exponentiation techniques, in Proceedings of IEE Computers and Digital Techniques,
vol. 151, pp. 402–408 (2004)
17. N. Nedjah, L.M. Mourelle, Three hardware architectures for the binary modular exponenti-
ation: sequential, parallel and systolic. IEEE Trans. Circuits Syst. I 53, 627–633 (2006)
18. M.D. Shieh, J.H. Chen, W.C. Lin, H.H. Wu, A new algorithm for high-speed modular
multiplication design. IEEE Trans. Circuits Syst. I 56, 2009–2019 (2009)
19. C.C. Yang, T.S. Chang, C.W. Jen, A new RSA cryptosystem hardware design based on
Montgomery’s algorithm. IEEE Trans. Circuits Syst. II Analog Digit. Signal Process. 45,
908–913 (1998)
20. A. Tenca, C. Koc, A scalable architecture for modular multiplication based on Montgomery’s
algorithm. IEEE Trans. Comput. 52, 1215–1221 (2003)
21. D. Harris, R. Krishnamurthy, M. Anders, S. Mathew, S. Hsu, An improved unified scalable
radix-2 Montgomery multiplier, in IEEE Symposium on Computer Arithmetic, pp. 172–175
(2005)
22. K. Kelly, D. Harris, Very high radix scalable Montgomery multipliers, in Proceedings of
International Workshop on System-on-Chip for Real-Time Applications, pp. 400–404 (2005)
23. N. Jiang, D. Harris, Parallelized Radix-2 scalable Montgomery multiplier, in Proceedings of
IFIP International Conference on Very Large-Scale Integration (VLSI-SoC 2007),
pp. 146–150 (2007)
24. N. Pinckney, D. Harris, Parallelized radix-4 scalable Montgomery multipliers. J. Integr.
Circuits Syst. 3, 39–45 (2008)
25. K. Kelly, D. Harris, Parallelized very high radix scalable Montgomery multipliers, in Pro-
ceedings of Asilomar Conference on Signals, Systems and Computers, pp. 1196–1200 (2005)
26. M. Huang, K. Gaj, T. El-Ghazawi, New hardware architectures for Montgomery modular
multiplication algorithm. IEEE Trans. Comput. 60, 923–936 (2011)
27. M.D. Shieh, W.C. Lin, Word-based Montgomery modular multiplication algorithm for
low-latency scalable architectures. IEEE Trans. Comput. 59, 1145–1151 (2010)
28. A. Miyamoto, N. Homma, T. Aoki, A. Satoh, Systematic design of RSA processors based on
high-radix Montgomery multipliers. IEEE Trans. VLSI Syst. 19, 1136–1146 (2011)
29. K.C. Posch, R. Posch, Modulo reduction in residue Number Systems. IEEE Trans. Parallel
Distrib. Syst. 6, 449–454 (1995)
30. C. Bajard, L.S. Didier, P. Kornerup, An RNS Montgomery modular multiplication Algo-
rithm. IEEE Trans. Comput. 47, 766–776 (1998)
31. J.C. Bajard, L. Imbert, A full RNS implementation of RSA. IEEE Trans. Comput. 53,
769–774 (2004)
32. A.P. Shenoy, R. Kumaresan, Fast base extension using a redundant modulus in RNS. IEEE
Trans. Comput. 38, 293–297 (1989)
33. H. Nozaki, M. Motoyama, A. Shimbo, S. Kawamura, Implementation of RSA Algorithm
Based on RNS Montgomery Multiplication, in Cryptographic Hardware and Embedded
Systems—CHES, ed. by C. Paar (Springer, Berlin, 2001), pp. 364–376
34. S. Kawamura, M. Koike, F. Sano, A. Shimbo, Cox-Rower architecture for fast parallel
Montgomery multiplication, in Proceedings of International Conference on Theory and
Application of Cryptographic Techniques: Advances in Cryptology, (EUROCRYPT 2000),
pp. 523–538 (2000)
35. F. Gandino, F. Lamberti, G. Paravati, J.C. Bajard, P. Montuschi, An algorithmic and
architectural study on Montgomery exponentiation in RNS. IEEE Trans. Comput. 61,
1071–1083 (2012)
36. D. Schinianakis, T. Stouraitis, A RNS Montgomery multiplication architecture, in Proceed-
ings of ISCAS, pp. 1167–1170 (2011)
37. Y.T. Jie, D.J. Bin, Y.X. Hui, Z.Q. Jin, An improved RNS Montgomery modular multiplier, in
Proceedings of the International Conference on Computer Application and System Modeling
(ICCASM 2010), pp. V10-144–147 (2010)
38. D. Schinianakis, T. Stouraitis, Multifunction residue architectures for cryptography. IEEE
Trans. Circuits Syst. 61, 1156–1169 (2014)
39. H.M. Yassine, W.R. Moore, Improved mixed radix conversion for residue number system
architectures, in Proceedings of IEE Part G, vol. 138, pp. 120–124 (1991)
40. M. Ciet, M. Neve, E. Peeters, J.J. Quisquater, Parallel FPGA implementation of RSA with
residue number systems—can side-channel threats be avoided?, in 46th IEEE International
MW Symposium on Circuits and Systems, vol. 2, pp. 806–810 (2003)
41. J.-J. Quisquater, C. Couvreur, Fast decipherment algorithm for RSA public key cryptosystem.
Electron. Lett. 18, 905–907 (1982)
42. R. Szerwinski, T. Guneysu, Exploiting the power of GPUs for Asymmetric Cryptography.
Lect. Notes Comput. Sci. 5154, 79–99 (2008)
43. B.S. Kaliski Jr., The Montgomery inverse and its applications. IEEE Trans. Comput. 44,
1064–1065 (1995)
44. E. Savas, C.K. Koc, The Montgomery modular inverse—revisited. IEEE Trans. Comput. 49,
763–766 (2000)
45. A.A.A. Gutub, A.F. Tenca, C.K. Koc, Scalable VLSI architecture for GF(p) Montgomery
modular inverse computation, in IEEE Computer Society Annual Symposium on VLSI,
pp. 53–58 (2002)
46. E. Savas, A carry-free architecture for Montgomery inversion. IEEE Trans. Comput. 54,
1508–1518 (2005)
47. J. Bucek, R. Lorencz, Comparing subtraction free and traditional AMI, in Proceedings of
IEEE Design and Diagnostics of Electronic Circuits and Systems, pp. 95–97 (2006)
48. D.M. Schinianakis, A.P. Kakarountas, T. Stouraitis, A new approach to elliptic curve
cryptography: an RNS architecture, in IEEE MELECON, Benalmádena (Málaga), Spain,
pp. 1241–1245, 16–19 May 2006
49. D.M. Schinianakis, A.P. Fournaris, H.E. Michail, A.P. Kakarountas, T. Stouraitis, An RNS
implementation of an Fp elliptic curve point multiplier. IEEE Trans. Circuits Syst. I Reg. Pap.
56, 1202–1213 (2009)
50. M. Esmaeildoust, D. Schnianakis, H. Javashi, T. Stouraitis, K. Navi, Efficient RNS imple-
mentation of Elliptic curve point multiplication over GF(p). IEEE Trans. Very Large Scale
Integration (VLSI) Syst. 21, 1545–1549 (2013)
References 345
51. P.V. Ananda Mohan, RNS to binary converter for a new three moduli set {2n+1 -1, 2n, 2n-1}.
IEEE Trans. Circuits Syst. II 54, 775–779 (2007)
52. M. Esmaeildoust, K. Navi, M. Taheri, A.S. Molahosseini, S. Khodambashi, Efficient RNS to
Binary Converters for the new 4- moduli set {2n, 2n+1 -1, 2n-1, 2n-1 -1}”. IEICE Electron. Exp.
9(1), 1–7 (2012)
53. J.C. Bajard, S. Duquesne, M. Ercegovac, Combining leak resistant arithmetic for elliptic
curves defined over Fp and RNS representation, Cryptology Reprint Archive 311 (2010)
54. M. Joye, J.J. Quisquater, Hessian elliptic curves and side channel attacks. CHES, LNCS
2162, 402–410 (2001)
55. P.Y. Liardet, N. Smart, Preventing SPA/DPA in ECC systems using Jacobi form. CHES,
LNCS 2162, 391–401 (2001)
56. E. Brier, M. Joye, Wierstrass elliptic curves and side channel attacks. Public Key Cryptog-
raphy LNCS 2274, 335–345 (2002)
57. P.L. Montgomery, Speeding the Pollard and elliptic curve methods of factorization. Math.
Comput. 48, 243–264 (1987)
58. A. Joux, A one round protocol for tri-partite Diffie-Hellman, Algorithmic Number Theory,
LNCS, pp. 385–394 (2000)
59. D. Boneh, M.K. Franklin, Identity based encryption from the Weil Pairing, in Crypto 2001,
LNCS, vol. 2139, pp. 213–229 (2001)
60. D. Boneh, B. Lynn, H. Shachm, Short signatures for the Weil pairing. J. Cryptol. 17, 297–319
(2004)
61. J. Groth, A. Sahai, Efficient non-interactive proof systems for bilinear groups, in 27th Annual
International Conference on Advances in Cryptology, Eurocrypt 2008, pp. 415–432 (2008)
62. V.S. Miller, The Weil pairing and its efficient calculation. J. Cryptol. 17, 235–261 (2004)
63. P.S.L.M. Barreto, H.Y. Kim, B. Lynn, M. Scott, Efficient algorithms for pairing based
cryptosystems, in Crypto 2002, LCNS 2442, pp. 354–369 (Springer, Berlin, 2002)
64. F. Hess, N.P. Smart, F. Vercauteren, The eta paring revisited. IEEE Trans. Inf. Theory 52,
4595–4602 (2006)
65. F. Lee, H.S. Lee, C.M. Park, Efficient and generalized pairing computation on abelian
varieties, Cryptology ePrint Archive, Report 2008/040 (2008)
66. F. Vercauteren, Optimal pairings. IEEE Trans. Inf. Theory 56, 455–461 (2010)
67. S. Duquesne, N. Guillermin, A FPGA pairing implementation using the residue number
System, in Cryptology ePrint Archive, Report 2011/176(2011), http://eprint.iacr.org/
68. S. Duquesne, RNS arithmetic in Fpk and application to fast pairing computation, Cryptology
ePrint Archive, Report 2010/55 (2010), http://eprint.iacr.org
69. P. Barreto, M. Naehrig, Pairing friendly elliptic curves of prime order, SAC, 2005. LNCS
3897, 319–331 (2005)
70. A. Miyaji, M. Nakabayashi, S. Takano, New explicit conditions of elliptic curve traces for
FR-reduction. IEICE Trans. Fundam. 84, 1234–1243 (2001)
71. B. Lynn, On the implementation of pairing based cryptography, Ph.D. Thesis PBC Library,
https://crypto.stanford.edu/~blynn/
72. C. Costello, Pairing for Beginners, www.craigcostello.com.au/pairings/PairingsFor
Beginners.pdf
73. J.C. Bazard, M. Kaihara, T. Plantard, Selected RNS bases for modular multiplication, in 19th
IEEE International Symposium on Computer Arithmetic, pp. 25–32 (2009)
74. A. Karatsuba, The complexity of computations, in Proceedings of Staklov Institute of
Mathematics, vol. 211, pp. 169–183 (1995)
75. P.L. Montgomery, Five-, six- and seven term Karatsuba like formulae. IEEE Trans. Comput.
54, 362–369 (2005)
76. J. Fan, F. Vercauteren, I. Verbauwhede, Efficient hardware implementation of Fp-arithmetic
for pairing-friendly curves. IEEE Trans. Comput. 61, 676–685 (2012)
77. J. Fan, F. Vercauteren, I. Verbauwhede, Faster Fp-Arithmetic for cryptographic pairings on
Barreto Naehrig curves, in CHES, vol. 5747, LNCS, pp. 240–253 (2009)
78. J. Fan, http://www.iacr.org/workshops/ches/ches2009/presentations/08_ Session_5/CHES

2009_fan_1.pdf
79. J. Chung, M.A. Hasan, Low-weight polynomial form integers for efficient modular multipli-
cation. IEEE Trans. Comput. 56, 44–57 (2007)
80. J. Chung, M. Hasan, Montgomery reduction algorithm for modular multiplication using low
weight polynomial form integers, in IEEE 18th Symposium on Computer Arithmetic,
pp. 230–239 (2007)
81. C.C. Corona, E.F. Moreno, F.R. Henriquez, Hardware design of a 256-bit prime field
multiplier for computing bilinear pairings, in 2011 International Conference on
Reconfigurable Computing and FPGAs, pp. 229–234 (2011)
82. S. Srinath, K. Compton, Automatic generation of high-performance multipliers for FPGAs
with asymmetric multiplier blocks, in Proceedings of 18th Annual ACM/Sigda International
Symposium on Field Programmable Gate Arrays, FPGA ‘10, New York, pp. 51–58 (2010)
83. R. Brinci, W. Khmiri, M. Mbarek, A.B. Rabaa, A. Bouallegue, F. Chekir, Efficient multi-
pliers for pairing over Barreto-Naehrig curves on Virtex -6 FPGA, iacr Cryptology Eprint
Archive (2013)
84. A.J. Devegili, C. OhEigertaigh, M. Scott, R. Dahab, Multiplication and squaring on pairing
friendly fields, in Cryptology ePrint Archive, vol. 71 (2006)
85. A.L. Toom, The complexity of a scheme of functional elements realizing the multiplication
of integers. Sov. Math. 4, 714–716 (1963)
86. S.A. Cook, On the minimum computation time of functions, Ph.D. Thesis, Harvard Univer-
sity, Department of Mathematics, 1966
87. J. Chung, M.A. Hasan, Asymmetric squaring formulae, Technical Report, CACR 2006-24,
University of Waterloo (2006), http://www.cacr.uwaterloo.ca/techreports/2006/cacr2006-24.
pdf
88. D. Hankerson, A. Menezes, M. Scott, Software Implementation of Pairings, in Identity Based
Cryptography, Chapter 12, ed. by M. Joye, G. Neven (IOS Press, Amsterdam, 2008),
pp. 188–206
89. G.X. Yao, J. Fan, R.C.C. Cheung, I. Verbauwhede, A high speed pairing Co-processor using
RNS and lazy reduction, eprint.iacr.org/2011/258.pdf
90. M. Scott, Implementing Cryptographic Pairings, ed. by T. Takagi, T. Okamoto, E. Okamoto,
T. Okamoto, Pairing Based Cryptography, Pairing 2007, LNCS, vol. 4575, pp. 117–196
(2007)
91. J.L. Beuchat, J.E. Gonzalez-Diaz, S. Mitsunari, E. Okamoto, F. Rodriguez-Henriquez,
T. Terya, in High Speed Software Implementation of the Optimal Ate Pairing over Barreto-
Naehrig Curves, ed. by M. Joye, A. Miyaji, A. Otsuka, Pairing 2010, LNCS 6487, pp. 21–39
(2010)
92. M. Scott, N. Benger, M. Charlemagne, L.J.D. Perez, E.J. Kachisa, On the final exponentiation
for calculating pairings on ordinary elliptic curves, Cryptology ePrint Archive, Report 2008/
490(2008), http://eprint.iacr.org/2008/490.pdf
93. A.J. Devegili, M. Scott, R. Dahab, Implementing cryptographic pairings over Barreto-
Naehrig curves, Pairing 2007, vol. 4575 LCNS (Springer, Berlin, 2007), pp. 197–207
94. J. Olivos, On vectorial addition chains. J. Algorithm 2, 13–21 (1981)
95. G.X. Yao, J. Fn, R.C.C. Cheung, I. Verbauwhede, Novel RNS parameter selection for fast
modular multiplication. IEEE Trans. Comput. 63, 2099–2105 (2014)
96. C. Costello, T. Lange, M. Naehrig, Faster pairing computations on curves with high degree
twists, ed. by P. Nguyen, D. Pointcheval, PKC 2010, LNCS, vol. 6056, pp. 224–242 (2010)
97. D. Aranha, K. Karabina, P. Longa, C.H. Gebotys, J. Lopez, Faster explicit formulae for
computing pairings over ordinary curves, Cryptology ePrint Archive, Report 2010/311
(2010), http://eprint.iacr.org/
98. R. Granger, M. Scott, Faster squaring in the cyclotomic subgroups of sixth degree extensions,
PKC-2010, 6056, pp. 209–223 (2010)
References 347
99. N. Guillermin, A high speed coprocessor for elliptic curve scalar multiplications over Fp,
CHES, LNCS (2010)
100. D. Kammler, D. Zhang, P. Schwabe, H. Scharwaechter, M. Langenberg, D. Auras,
G. Ascheid, R. Leupers, R. Mathar, H. Meyr, Designing an ASIP for cryptographic pairings
over Barreto-Naehrig curves, in CHES 2009, LCNS 5747 (Springer, Berlin, 2009),
pp. 254–271
101. D. Nibouche, A. Bouridane, M. Nibouche, Architectures for Montgomery’s multiplication, in
Proceedings of IEE Computers and Digital Techniques, vol. 150, pp. 361–368 (2003)
102. A. Barenghi, G. Bertoni, L. Breveglieri, G. Pelosi, A FPGA coprocessor for the cryptographic
Tate pairing over Fp, in Proceedings of Fifth International Conference on Information
Technology: New Generations, ITNG 2008, pp. 112–119 (April 2008)
103. M. Scott, P.S.L.M. Barreto, Compressed pairings, in CRYPTO, Lecture Notes in Computer
Science, vol. 3152, pp. 140–156 (2004)
Further Reading
E. Savas, M. Nasser, A.A.A. Gutub, C.K. Koc, Efficient unified Montgomery inversion with multi-
bit shifting, in Proceedings of IEE Computers and Digital Techniques, vol. 152, pp. 489–498
(2005)
A.F. Tenca, G. Todorov, C.K. Koc, High radix design of a scalable modular multiplier, in
Proceedings of Third International Workshop on Cryptographic Hardware and Embedded
Systems, CHES, pp. 185–201 (2001)

RNS in Cryptography

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RNS in Cryptography

Uploaded by

Copyright:

Available Formats

Chapter 10

© Springer International Publishing Switzerland 2016 263

This is the problem of Authentication. The mathematical operations performed by

10.1 Modulo Multiplication Using Barrett’s Technique

and the result is T ¼ X qM.

Then P ¼ AB can be computed by summing the terms

Rearranging noting that D0 ¼ 0, we have

After k steps, P ¼ (Pd, Pt) is obtained.

10.2 Montgomery Modular Multiplication

The Montgomery multiplication (MM) algorithm for processor-based implementations

e with dynamic range greater than the original value M by k bits at

S1, iþ1 , S2, iþ1 ¼ CSRðS1 ½i þ S2 ½i þ 0 þ 0Þdiv2 ð10:4aÞ

S1, iþ1 , S2, iþ1 ¼ CSRðS1 ½i þ S2 ½i þ B1 þ B2 Þdiv2 ð10:4bÞ

S1, iþ1 , S2, iþ1 ¼ CSRðS1 ½i þ S2 ½i þ M þ 0Þdiv2 ð10:4cÞ

S1, iþ1 , S2, iþ1 ¼ CSRðS1 ½i þ S2 ½i þ D1 þ D2 Þdiv2 ð10:4dÞ

where D1, D2 ¼ CSR (B1 + B2 + M + 0) is pre-computed.

SAMMM2 MODULUS SAMMM1

MPRODUCT1.1 SQUARE 1.1

iteratively for i ¼ 0, . . ., k 1. The algorithm starts from LSB of ML. Yang

Shieh et al. [18] suggest computing S as

p(m-1) p(2) p(1) p(0)

R(m) R(m-1) R(2) R(1) R(0)

E-cellm-1 E-cell1 E-cell0

Figure 10.2 (continued)

Carry1,n-2. Carry1,j-1. Carry1,0.

γn(2) γj(2) γj(2) γ0(2)

Carryi,n-1. Carryi,j-1. Carryi,0.

γj+1(n) mbj bj mj 0 mb1 b1 m1 0

Carryn,n-1. Carryn,j-1. Carryn,0.

γn(n+1) γj(n+1) γ1(n+1) γ0(n+1)

Figure 10.2 (continued)

The authors have described an array architecture comprising of (k 1) PE cells.

Figure 10.3 Pseudocode of MWR2MM algorithm (adapted from [20] ©IEEE2003)

a Case 1: e>2p-1; e=4,p=2 Case 2: e≤2p-1; e=4, p = 4

Z3w-2:2w-1 Zw-2:1 X2 MY5w-1:4w X3 MY3w-1:2w

Z-5w-2:4w-1 Z3w-2:2w-1 Zw-2:1

b Case 1: e>p; e=4,p=2 Case 2: e≤p; e=4, p = 4

2 Xo MY2w-1:w Xo MY2w-1:w X1 MYw-1:0

Z3w-2:2w-1 Zw-2:1 X4 MY2w-1:w X5 MYw-1:0 X3 MY5w-1:4w

Figure 10.5 (continued)

avoiding multiplication in computing q by scaling the modulus and also by pre-scaling

{xo,qo,C(4)} {x1,q1,C(3)} {x2,q2,C(2)}

D S(1)0 E S(2)0 E (3)

where OP1i ¼ SRi þ 2AP Biþ1 , OP2i ¼ SM2i1 þ AR Bi and OP3i ¼ qi N.

component design. They have considered four types of exponentiation algorithms

In the second step, we evaluate

zj þ xi yj þ cs1a þ cs2a ¼ ðcs1a þ cs2a , zs1 þ zs2 Þ

10.3 RNS Montgomery Multiplication and Exponentiation

Only the base extension

for j ¼ k + 1, . . .,2k. We compute next

Thus, there is no need to compute α. Instead, ^q needs to be computed in B0 . Once r is

In this method, the base extension algorithm is executed in parallel by plural

The original RNS MM algorithm and reorganized RNS MM algorithm are

Figure 10.11 (a, b) The

similarly we have μqD mod N ¼ μq Dq modN where Dq ¼ D mod (q 1).

Szerwinski and Guneysu [42] have studied on the application of Graphics

10.4 Montgomery Inverse

x ¼ NewMonInvðb2m Þ ¼ ðb2m Þ1 :22m mod a ¼ b1 2m mod a ð10:20Þ

where Montgomery product (MonPro) is defined as MonPro (a, b) ¼ (ab2n) mod

Figure 10.14 Bucek and

10.5 Elliptic Curve Cryptography Using RNS

W ¼ X0 Z21 X1 Z20 , R ¼ Y 0 Z 31 Y 1 Z 30 , T ¼ X0 Z21 þ X1 Z20 ,

The doubling of point P1 is given as

S1, iþ1 , S2, iþ1 ¼ CSRðS1 ½i þ S2 ½i þ 0 þ 0Þdiv2 ð10:4aÞ

S1, iþ1 , S2, iþ1 ¼ CSRðS1 ½i þ S2 ½i þ B1 þ B2 Þdiv2 ð10:4bÞ

S1, iþ1 , S2, iþ1 ¼ CSRðS1 ½i þ S2 ½i þ M þ 0Þdiv2 ð10:4cÞ

S1, iþ1 , S2, iþ1 ¼ CSRðS1 ½i þ S2 ½i þ D1 þ D2 Þdiv2 ð10:4dÞ

f ðtÞ ¼ tn þ f n1 tn1 þ þ f 1 ðtÞ þ f 0 ð10:37Þ