Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2016.2623609, IEEE
Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS, VOL. 14, NO. 8, AUGUST 2016 1

Elliptic Curve Cryptography with Efficiently


Computable Endomorphisms and Its Hardware
Implementations for the Internet of Things
Zhe Liu, Johann Großschädl, Zhi Hu, Kimmo Järvinen, Husen Wang and Ingrid Verbauwhede

Abstract—Verification of an ECDSA signature requires a double scalar multiplication on an elliptic curve. In this work, we study the
computation of this operation on a twisted Edwards curve with an efficiently computable endomorphism, which allows reducing the
number of point doublings by approximately 50% compared to a conventional implementation. In particular, we focus on a curve
defined over the 207-bit prime field Fp with p = 2207 − 5131. We develop several optimizations to the operation and we describe two
hardware architectures for computing the operation. The first architecture is a small processor implemented in 0.13 µm CMOS ASIC
and is useful in resource-constrained devices for the Internet of Things (IoT) applications. The second architecture is designed for fast
signature verifications by using FPGA acceleration and can be used in the server-side of these applications. Our designs offer various
trade-offs and optimizations between performance and resource requirements and they are valuable for IoT applications.

Index Terms—VLSI designs, Internet-of-Things, signature verification, elliptic curve cryptography, multiple-precision arithmetic.

F
1 I NTRODUCTION

T HE I NTERNET OF T HINGS (IoT) is a paradigm in which


objects, such as Radio Frequency IDentification (RFID)
tags, sensors, mobile phones, appliencies, etc., are provided
(TLS) [3], which is used to authenticate servers and clients.
The Datagram TLS (DTLS) [4], a variant of TLS optimized
for connectionless datagram transport (i.e. UDP), is widely
with unique identifiers and the ability to communicate with considered as the future standard protocol for securing the
each others over a network to reach common goals without IoT [5]. The signature algorithms supported by the most
requiring human interaction [1]. IoT has been a promising recent version (i.e. version 1.2) of TLS are RSA [6], DSA
approach to many diverse applications (i.e. civilian types) [7], as well as ECDSA [8] through a separate RFC [9].
and is playing a major role in the upcoming age of intel- The elliptic curve cryptography used by ECDSA (Elliptic
ligent networking. With the increase in popularity of such Curve Digital Signature Algorithm) is usually considered
networks, cryptographic protocols must be widely used to to be more applicable for low-end devices than RSA, since
protect their security. Due to the resource, computing, and it requires relatively small key sizes and operand lengths
environmental constraints, it is a challenging task to effi- [10]. In the state-of-the-art implementation, a 255-bit ECDSA
ciently implement cryptographic protocols for the IoT. These signature (matching the security of 128-bit AES) has a size of
constraints mean that cryptographic implementations in IoT merely 64 bytes when it is compressed [11], i.e., less than one
applications must be fast and compact but still provide sixth of the RSA signature size at the same security level.
security levels comparable to more traditional systems [2]. However, an inherent problem with ECDSA signatures
This has attracted many researchers’ attention and the topic is that, despite their small size, the verification process is rel-
is an active area of fruitful research work. atively computation intensive. This problem is emphasized
Digital signatures are an indispensable component of in heavily-loaded servers which may require thousands of
modern security protocols such as Transport Layer Security verifications per second and, thus, benefit from hardware ac-
celerators. The verification of an ECDSA signature requires
• Z. Liu is with College of Computer Science and Technology, Nanjing
a double scalar multiplication, an operation of the form
University of Aeronautics and Astronautics, Nanjing, 210016, China; k · G + l · Q, where G is a point on an elliptic curve E that
Institute for Quantum Computing and Department of Combinatorics and generates a large group of prime order r, Q is an (arbitrary)
Optimization, University of Waterloo, Canada. element of this group, and k and l are two integers in the
Email: z446liu@uwaterloo.ca
• J. Großschädl and H. Wang are with University of Luxembourg, Luxem- range of [1, r − 1] [8]. Normally, k · G + l · Q is computed in
bourg. a simultaneous fashion (i.e. with joint doublings) so that at
Email: {johann.groszschaedl, husen.wang}@uni.lu most m doublings need to be executed in total, where m is
• Z. Hu is with School of Mathematics and Statistics, Central South
University, Changsha, 410083 Hunan, P.R. China.
the bitlength of r [12].
Email: huzhi math@csu.edu.cn Most previous attempts to reduce the execution time of
• K. Järvinen is with Department of Computer Science, Aalto University, this operation fall into one of two categories, namely, (a)
Finland. Parts of this work was done when he was a FWO Pegasus Marie approaches that aim at minimizing the cost of a single point
Curie Fellow in ESAT/COSIC and iMinds, KU Leuven, Belgium.
Email: kimmo.jarvinen@aalto.fi addition or doubling and (b) techniques to reduce the num-
• I. Verbauwhede is with ESAT/COSIC and iMinds, KU Leuven, Belgium. ber of these operations. An example of (a) is EdDSA [11],
Email: ingrid.verbauwhede@esat.kuleuven.be which is a signature scheme based on a twisted Edwards
Date of the manuscript: October 21, 2016. curve [13] that allows more efficient implementations of

0018-9340 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2016.2623609, IEEE
Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS, VOL. 14, NO. 8, AUGUST 2016 2

point arithmetic than a basic Weierstrass curve [14]. Window of memory. But if more resources are available, our curve
methods to reduce the number of point additions in a dou- allows the designer to trade performance for memory or
ble scalar multiplication (for example, as described in [12, p. area (depending on whether the implementation is in soft-
109]) fall into the category (b). Another option to cut down ware or hardware) by exploiting the efficiently-computable
the number of point operations is to exploit an efficiently endomorphism. Moreover, it is possible to achieve further
computable endomorphism on a special curve, such as a speed-ups by combining the endomorphism with a window
Gallant-Lambert-Vanstone (GLV) curve, as explained in [15] method for simultaneous scalar multiplication when given
for variable-base scalar multiplication. Also a combination plenty of memory resources.
of the above approaches, namely using the twisted Edwards The main contributions of our work include
addition law on so-called Galbraith-Lin-Scott (GLS) [16] and
• We extensively study the computation of double
GLV-GLS [17] curves (both of which are defined over Fp2
base scalar multiplication on twisted Edwards curves
and possess endomorphisms) has been investigated in [18]
with an efficiently computable endomorphism that
and [19].
allows reducing the number of point doublings by
In this paper we introduce families of twisted Edwards
approximately 50 % compared to a conventional im-
curves with an efficiently computable endomorphism φ and
plementation. In particular, we focus on a curve
demonstrate how such endomorphism can be used to speed
defined over the 207-bit prime field Fp with p =
up the ECDSA verification process. We focus particularly
2207 − 5131, which offers a roughly 100-bit security
on hardware implementation of double scalar multiplication
level.
for IoT applications. We study the implementation proper-
• We develop several optimizations to the operation
ties of the twisted Edwards curve over a prime field defined
and describe two hardware architectures for comput-
as follows:
ing the operation. The first architecture is a small pro-
ET /Fp : −x2 + y 2 = 1 + x2 y 2 (1) cessor implemented in 0.13 µm CMOS ASIC, which
has an overall silicon area of only 5821 gate equiva-
(i.e. a = −1 and d = 1), which is birationally equivalent lents and is useful in resource-constrained devices for
over Fp to a GLV curve [15] of the form EW : y 2 = x3 + ax. IoT applications. The second architecture is designed
Gallant et al. [15] firstly described how an efficiently- for fast signature verifications by using FPGA accel-
computable endomorphism φ can be used to speed up a eration and can be used in the server-side of these ap-
variable-base scalar multiplication on such curves. In order plications. These architectures demonstrate that our
to accelerate a scalar multiplication k · P , the scalar k is split methods offer various trade-offs and optimizations
into two parts k1 and k2 of about half the length compared between performance and resource requirements and
to the original k (as explained in e.g. [15]). Then the scalar they are valuable for IoT applications.
multiplication is computed as k · P = k1 · P + k2 · φ(P )
in a simultaneous fashion, which saves roughly 50 % of the The paper is organized as follows. In Sect. 2, we recap
point doublings compared to a straightforward computation the background and present how to perform endomorphism
of k · P . While most of the previous work on exploiting en- on a twisted Edwards curve. In Sect. 3, we describe how to
domorphisms has focused primarily on variable-base scalar generate such curves and give an example curve that is used
multiplication (such as needed in ECDH key exchange), in our implementations. Sect. 4 reviews several approaches
we direct our attention to the double scalar multiplication for computing the double scalar multiplication and presents
carried out in the verification of an ECDSA signature. When how to speed up the operation by using an endomorphism.
taking advantage of the endomorphism φ, an m-bit double We describe the architectures and give the implementation
scalar multiplication k · G + l · Q can be performed via results for the small processor for signature generation and
four simultaneous half-length (i.e. roughly m/2-bit) scalar verification and for the high-speed core for verifications in
multiplications of the form k1 ·G+k2 ·φ(G)+l1 ·Q+l2 ·φ(Q) Sects. 5 and 6, respectively. Finally, we draw the conclusions
as shown by Galbraith et al. in [16]. in Sect. 7.
The real-world benefit of our settings is that it sup-
ports a multitude of implementation options and trade-offs 2 T WISTED E DWARDS C URVES WITH E NDOMOR -
between execution time and silicon area (when thinking PHISMS
about hardware implementation) or memory footprint (in
the context of software implementation) 1 . Our curve allows 2.1 Twisted Edwards Curves
a designer to fine-tune an implementation according to Twisted Edwards curves were introduced to cryptography
the requirements at hand. When resources are constrained, by Bernstein et al. [13] in 2008 and they are currently
one can perform a doubled scalar multiplication in the considered to be one of the most efficient models for ECC
straightforward fashion by computing two simultaneous m- implementation. Let Fp be a prime field with p > 3. A
bit scalar multiplications, which is very economic in terms twisted Edwards curve over Fp can be defined as

1. Such options and trade-offs are particularly important for crypto-


Ea,d : ax2 + y 2 = 1 + dx2 y 2 (2)
graphic schemes for the IoT since IoT devices come in all shapes and
sizes, and have, therefore, varying resource constraints. At one end where a and d satisfy ad(a − d) 6= 0. As specified in [13],
of the spectrum are devices with extreme restrictions (e.g. RFID tags, the j -invariant of Ea,d is
sensor nodes) where every single gate and byte counts. At the other end
of the spectrum are devices with plenty of resources that are equipped 16(a2 + 14ad + d2 )3
with powerful 32-bit or 64-bit processors or FPGAs. j(Ea,d ) =
ad(a − d)4

0018-9340 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2016.2623609, IEEE
Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS, VOL. 14, NO. 8, AUGUST 2016 3

There is a remarkable addition law on Twisted Edwards where c1 = (a − d)/4, c2 = (a + d)/6, c3 = (5a − d)/12, and
curves which can be complete when a is a square and c4 = (a − 5d)/12.
d a non-square in Fp [13]. Here completeness means the The original GLV method works on some elliptic curves
addition produces a correct result for any two points on Ea,d in Weierstrass model with special complex multiplication
without exception (even if one of the points is the neutral (such as CM discriminant D = −3, −4, −7, −8, etc). If there
element O = (0, 1)). exists an efficient endomorphism φ on elliptic curve Es ,
In the rest of this work, we adopt the twisted Edwards then we can obtain an efficient endomorphism φt on Ea,d
model for our desired curve, which provides very efficient as ψ −1 φψ . Thus GLV method is also applicable on twisted
elliptic curve group arithmetic and high performance securi- Edwards curves with some efficient endomorphism.
ty. On the one side, the group formulae of twisted Edwards Usually the computation of endomorphisms on short
curve are usually more efficient compared with other curve Weierstrass model is considerably simpler than on twisted
models, i.e., requiring less finite field arithmetic operations Edwards model. Here we take the most common cases
[13], [14]. On the other side, the complete group law of of “GLV friendly” curves with j -invariant 0 and 1728 as
twisted Edwards curves admits a more secure execution examples.
pattern and thus the implementation of scalar multiplication
on such curve would resist against certain side-channel 2.3.1 j -invariant 0
attacks [14]. This class of elliptic curves has CM discriminant D = −3,
and can be given by a Weierstrass equation of the form
2.2 GLV Method
Eb̃ : y 2 = x3 + b̃ (4)
In 2001, Gallant, Lambert, and Vanstone [15] described a
new method, now known as the GLV method, for speeding over a prime field Fp with p ≡ 1 mod 3, which means Fp
up scalar multiplication on certain classes of elliptic curves contains an element β of order 3. In this case, the map φ :
with efficiently computable endomorphisms. Let E be an Eb̃ → Eb̃ given by (x, y) 7→ (βx, y) and O 7→ O is an
elliptic curve over a finite field Fp and let G ∈ E(Fp ) endomorphism defined over Fp . If G ∈ Eb (Fp ) is a point of
have prime order r. Assume that there exists an efficiently prime order r, then φ(G) = λ · G = (βx, y), where λ is an
computable endomorphism φ on E such that φ(G) = λ·G ∈ integer satisfying λ2 + λ + 1 ≡ 0 mod r. There are only six
hGi. The GLV method replaces the computation k · G by a possible group orders for such curves when p is fixed.
multiscalar multiplication of the form k1 ·G+k2 ·φ(G), where Alternatively, we can find a twisted Edwards curve
the sub-scalars k1 and k2 have lengths of approximately half birationally equivalent to the GLV curve Eb̃ with help
of the original scalar k . These two scalar multiplications of the equation for the j -invariant: j(Ea,d ) = 0 requires
can be computed simultaneously by using the so-called a2 + 14ad √ + d2 = 0, and when we fix a to −1 then
Shamir’s trick [12, p. 109], which iterates over the scalars d = −7 ± 4 3. Thus we can obtain an endomorphism on its
so that corresponding bits from the two scalars are pro- birationally equivalent twisted Edwards curve Ea,d as
cessed simultaneously. This halves the number of doublings 
x(c5 y + c6 ) c7 y + c8

and, hence, the GLV method potentially gives significant φt (x, y) = , , (5)
y+1 y + c9
speedups in scalar multiplications on these elliptic curves.
5dβ−2d+β+2
Gallant et al. described in [15] several families of curves where c5 = 3(d+1) , c6 = dβ+2d+5β−2
3(d+1) , c7 =
5dβ+d+β+5
featuring an efficiently computable endomorphism derived
(5d+1)(β−1) , c8
5+d
= 5d+1 and c9 = dβ+5d+5β+1
(5d+1)(β−1) .
from special complex multiplication (CM). Let φ be a
complex number and K be the extension field Q(φ). If 2.3.2 j -invariant 1728
such an elliptic curve admits complex multiplication by Elliptic curves with j -invariant of 1728 have CM discrimi-
φ, then by [20, Thm 10.14] 
we obtain an endomorphism
0 nant D = −4, and can be defined by a Weierstrass equation
−2 f (x) −3 f (x) of the form
φ(x, y) = (φ g(x) , yφ g(x) ) and φ(O) = O , where
f, g are polynomial functions over Q with deg f = NK/Q (φ) Eã : y 2 = x3 + ãx (6)
and deg g = NK/Q (φ) − 1 (Here NK/Q (·) is the norm
function from K to Q). over a prime field Fp with p ≡ 1 mod 4, i.e. it is guaranteed
that Fp contains an element α of order 4. In this case, the
2.3 Efficient Endomorphism on Twisted Edwards map φ : Eã → Eã given by (x, y) 7→ (−x, αy) and O 7→ O
Curve is an endomorphism defined over Fp . When G ∈ Eã (Fp ) is
a point of prime order r, then φ(G) = λ · G = (−x, αy),
The twisted Edwards curve Ea,d : ax2 + y 2 = 1 + dx2 y 2
where λ is an integer satisfying λ2 + 1 ≡ 0 mod r. There are
is birationally equivalent to a short Weierstrass curve Es :
only four possible group orders for such curves when p is
y 2 = x3 + as x + bs , where the birational equivalence map
fixed.
can be given as
Similar as before, by setting
√ a = −1, j(Ea,d ) = 1728
ψ : Ea,d → Es , requires d = 1 or 17 ± 12 2. Then by the above method,
we obtain an endomorphism φt on corresponding twisted
 
c1 (1 + yt ) c1 (1 + yt )
(xt , yt ) → (xs , ys ) = + c2 , , Edwards model Ea,d : −x2 + y 2 = 1 + dx2 y 2 as
1 − yt xt (1 − yt )
(3)
ψ −1 : Es → Ea,d , φt (x, y) =
x((7d − 1)y + (7 − d)) (2d − 2)y + 5 + d (7)
 
xs − c2 xs − c3
 
(xs , ys ) → (xt , yt ) = , , − , .
ys xs + c4 3α(d + 1)(y + 1) (5d + 1)y + (2 − 2d)

0018-9340 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2016.2623609, IEEE
Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS, VOL. 14, NO. 8, AUGUST 2016 4

If d = 1, then φt has a simpler formula as φt (x, y) = is an operation of the form k · G + l · Q and computes
(αx, 1/y). the sum of two scalar products, where G is fixed and Q
Note that explicit formulae for endomorphisms on twist- is an arbitrary point. In the following, we review several
ed Edwards curves have also been exploited in [16] and [17]. approaches for performing the double scalar multiplication
and describe how to speed up this operation by exploiting
an endomorphism. For convenience, we assume that both
3 C URVE G ENERATION
scalars k and l are exactly m bits long.
3.1 CM Method
Let E/Fp be our desired elliptic curve with CM discriminant 4.1 Two Single Scalar Multiplications
D. The group order of E/Fp is #E(Fp ) = p + 1 − t, where
t is the Frobenius trace. It is well known that p and t also The most straightforward method to perform the double
satisfy the CM equation as 4p = t2 − Ds2 , where s ∈ Z. scalar multiplication is to compute the two single scalar
Note that the j -invariant of such a curve is also determined, multiplications separately and then add up the results. The
and there are only 2, 4, or 6 possible group orders for a first scalar multiplication k · G takes a fixed and a-priori-
desired curve. Thus the goal of the curve generation is not known point as an input, which can be efficiently performed
to find curve parameters (since we have them already), but through the fixed-base comb method as described in [12,
rather to find a prime field Fp , and then a twisted Edwards Sect. 3.3.2]. This single scalar multiplication requires rough-
m(2w −1)
curve defined over Fp (given by a = −1 and some fixed ly m/w point doublings and w·2w point additions when
d), which contains a large cyclic subgroup and meets other using 2w − 1 pre-computed points, where w is the window
security requirements. This contrasts with the “traditional” size. The second scalar multiplication, l · Q, is performed
approach for curve generation where the field Fp is fixed with an arbitrary base point Q not known in advance. The
and one has to find suitable curve parameters. simplest option for its computation is the binary method.
In that case, the arbitrary-base scalar multiplication requires
m point doublings and m/2 point additions in average. In
3.2 Example Curve
total, the double scalar multiplication requires m + m/w
We choose elliptic curve with CM discriminant D = −4 as m(2w −1)
point doublings and m/2 + w·2w point additions on av-
our example. If we fix a = −1 for efficiency reasons [13], erage. Windows methods (e.g., width-w non-adjacent form
√analysis in Sect. 2.3, the possible value of d is 1
then by the (NAF) [12, Sect. 3.3.1]) allow reducing the number of point
or 17 ± 12 2. We choose d = 1 since the endomorphism on additions in arbitrary-base scalar multiplications by using
E−1,1 has a very simple formula in this case as discussed precomputations, but also they require m point doublings.
before.
Our example curve is
4.2 Interleaving Method
E−1,1 /Fp : −x2 + y 2 = 1 + x2 y 2 , A method to speed up the computation of k · G + l · Q is to
where the prime p = 2207 − 5131. Note that p ≡ 1 mod perform them in a simultaneous (or interleaved) fashion by
4, which implies that E−1,1 is ordinary. The group order using Shamir’s trick [12, p. 109]. This method first computes
#E−1,1 (Fp ) = 8 · r, where r = 0xFFFFFFFFFFFFFFFFFFF the sum of G and Q, i.e. S = G + Q. Then, the scalars k
FFFFFFE090B67A2AE9D8EC7DD7009F95 is a 204-bit prime. and l are scanned simultaneously starting from the most
Then under the general ECDLP algorithm (such as√Pollard’s significant bit. One adds G if ki = 1 and li = 0, Q if
Rho attack with computational complexity as O( r) ), our ki = 0 and li = 1, and S if ki = li = 1. This method
desired curve is at around 100-bit security level. Moreover, reduces the number of point doublings so that a double
the embedding degree of E−1,1 /Fp with respect to r is r − 1, scalar multiplication requires m point doublings and 3m/4
which means that it is resistant to FR-MOV attack 2 . point additions on average.
There is an efficient endomorphism φt on E−1,1 as
φt (x, y) = (α · x, 1/y), (8) 4.3 Joint Sparse Form
Solinas [21] proposed a joint sparse form (JSF) represen-
where α = 0x5135DD9F4EBC5D1835EFB3D377F3A4A1FCB tation for a pair of integers which minimizes the joint
1E2DEC2911FF2B59A satisfies α2 + 1 ≡ 0 mod p. And we Hamming weight by using signed-binary representations
can check that φt (G) = λ · G for G ∈ E−1,1 (Fp ) with λ = for k and l. Hence, this representation leads to speedups
0xA1D776BEDB1ECFFCE5ABB8F12F8223CC0F494D461EC0 in double scalar multiplication k · G + l · Q, when S = G + Q
F724D06, here λ2 + 1 ≡ 0 mod r. and T = G − Q are precomputed. The method works
analogously to the above interleaving method but uses
4 D OUBLE S CALAR M ULTIPLICATION also point subtractions for negative digits. A double scalar
As mentioned before, double scalar multiplication is the multiplication performed in an interleaved fashion with JSF
most time-consuming operation of ECDSA signature ver- requires m point doublings and only m/2 point additions
ification and, therefore, deserves efficient implementation (resp. subtractions) on average [12, Sect. 3.3.3].
and optimization. Formally, double scalar multiplication
4.4 (Sliding) Window Method
2. It should be pointed out that E−1,1 /Fp is not twist secure. Howev-
er, since our implementations do not execute in the “x-coordinate only” Another approach to reduce the number of point additions
pattern, the requirement of twist-secure is not necessary. in a double scalar multiplication is to use a window method.

0018-9340 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2016.2623609, IEEE
Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS, VOL. 14, NO. 8, AUGUST 2016 5

Given the fixed window width w, a double-scalar multipli- After that, we generate the look-up table with 15 points
cation first generates a look-up table with points i · G + j · Q (line 4). Finally, the four scalar multiplications needed for
for all i, j ∈ [0, 2w − 1], and then scans w columns of the computing k1 · G + k2 · φ(G) + l1 · Q + l2 · φ(Q) are
scalars k and l. This method requires storage of 22w − 1 performed simultaneously, i.e. in an interleaved fashion.
points. A double scalar multiplication can be performed A double scalar multiplication using Algorithm 1 requires
m(22w −1) approximately m/2 point doublings and 15m/32 + 11 point
with (m/w − 1) · w point doublings and w·22w − 1
point additions on average. The window method can be additions including the overhead for the generation of the
further improved by using a “sliding” window, where only look-up table.
22w − 22(w−1) points are needed for the look-up table, and
m 4.6 Comparison and Trade-offs between Performance
the number of point additions is reduced to w+(1/3) [12].
and Memory
4.5 Double Scalar Multiplication with Endomorphism Table 1 reports the execution times and RAM require-
ments of double scalar multiplications for several different
When applying the above described approaches to the dou-
approaches outlined above, as well as a combination of
ble scalar multiplication, a maximum of m point doublings
endomorphism and window methods. Compared to the
could be saved compared to two single scalar multiplica-
implementation using (a), a double scalar multiplication
tions. Motivated by the work of Galbraith et al. [16], we
using a combination of (a) and (b) requires the same number
present a strategy to further reduce the number of point
of point doubling while it saves approximately 1/4 of the
doublings by some 50% using an efficiently computable
point additions. The number of point additions can be
endomorphism as follows.
further reduced by using a combination of (a) and (c) with
a look-up table of 22w − 1 points. Taking the window width
Algorithm 1 double scalar multiplication using an endo-
w = 2 as an example, one can save roughly 1/16 of the
morphism
point additions compared to the implementation with a
Input: Two m-bit scalars k and l, the fixed base point combination of (a) and (b). In relation to a combination of
G and an arbitrary point Q on the curve E(Fp ) with (a) and (c), the number of point doublings can be further
endomorphism φ. reduced by some 50% using the technique of (d) with the
Output: double scalar multiplication k · G + l · Q. same RAM occupation. A small number of point additions
1: Use [12, Algorithm 3.74] to find (k1 ; k2 ) of k and (l1 ; l2 ) may potentially be saved by using a combination of (c)
of l; and (d). However, the look-up table will grow exponentially
2: Compute φ(G), φ(Q) using G and Q; and a combination of (c) and (d) is only able to save point
3: G = (k1 > 0)?G : −G; φ(G) = (k2 > 0)?φ(G) : −φ(G); additions when n is big enough. For example, given w = 2,
Q = (l1 > 0)?Q : −Q; φ(Q) = (l2 > 0)?φ(Q) : −φ(Q); a double scalar multiplication using a combination of (c)
4: Generate look-up table T with 15 points such that T [i − and (d) requires a look-up table of 255 points and even
1] = [(i  3)&1] · φ(Q) + [(i  2)&1] · Q + [(i  requires more point doublings and point additions. Taking
1)&1] · φ(G) + (i&1) · G for 1 ≤ i ≤ 15; both performance and RAM requirements into account,
5: Let k1 = |k1 |, k2 = |k2 | , l1 = |l1 |, l2 = |l2 | and h = the technique (d) (i.e. double scalar multiplication with
max{k1, k2, l1, l2}; endomorphisms from Sect. 4.5) is the best choice to speed
6: R = O ; up the double scalar multiplication on resource-constraint
7: for i from h by 1 down to 0 do platforms.
8: R ← 2R; In the following, we demonstrate the flexibility of our
9: s ← 8 · (l2 )i + 4 · (l1 )i + 2 · (k2 )i + (k1 )i ; scheme based on double scalar multiplications with endo-
10: if s > 0 then morphisms by designing two hardware implementations
11: R ← R + T [s − 1]; aimed at different target applications within the IoT frame-
12: end if work. We present two different architectures: a small archi-
13: end for tecture for signature generation and verification for ASICs
14: return R. in Sect. 5 and a high-speed verification core for FPGAs in
Sect. 6. The former targets resource constrained devices such
The main idea is to compute a double scalar multipli- as RFID tags, sensor nodes, etc. The latter is designed for the
cation, i.e., k · G + l · Q, through four simultaneous scalar server-side where speed of signature verifications is impor-
multiplications k1 ·G, k2 ·φ(G), l1 ·Q and l2 ·φ(Q), where k1 , tant and FPGAs can be used for fast parallel computations.
k2 , l1 and l2 are roughly m/2 bits long. Algorithm 1 shows
the computation of double scalar multiplication exploiting
an efficiently-computable endomorphism. We first split the
5 S MALL P ROCESSOR A RCHITECTURE FOR S IG -
scalar k into two parts k1 and k2 using [12, Algorithm 3.74], NATURE G ENERATION AND V ERIFICATION
where k1 and k2 have roughly half of the bitlength of k ; In this section, we describe a small processor architecture for
the second scalar l can be decomposed into l1 and l2 in the signature generation and verification. It is targeted mainly
same way. Then, we calculate the points φ(G) and φ(Q) to resource-constrained devices of the IoT, where small
from G and Q by using (8) from Sect. 3. These can be resources (area, memory, power, and energy) are the first
computed with only one inversion and a few multiplications priority. We begin with the architecture for Fp arithmetic
by utilizing the so-called Montgomery’s trick [12, p. 44] that and other architectural design decisions in Sect. 5.1 and after
relies on the fact that 1/x = 1/xy · y and 1/y = 1/xy · x. that present the results on 130 nm CMOS in Sect. 5.2.

0018-9340 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2016.2623609, IEEE
Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS, VOL. 14, NO. 8, AUGUST 2016 6

TABLE 1
Comparison of execution time (including the generation of look-up table) and RAM requirements of double
scalar multiplication using different approaches.

Method Storage Point Doublings Point Additions


(a) 3 m 1 + 3m/4
(a) + (b) 4 m 2 + m/2
m·22w −1
(a) + (c) 22w − 1 (22(w−1) − 2w−1 ) + m − w (3 · 22(w−1) − 2w−1 − 1) + w·22w
(d) 24 − 1 m/2 − 1 11 + 15m
32
4w
−1)
(c) + (d) 24w − 1 (24(w−1) − 2w−1 ) + m/2 − w (15 · 24(w−1) + 2w−1 − 5) + m·(2
2w·24w
(a): Interleaved (Sect. 4.2); (b): JSF (Sect. 4.3); (c): Window (Sect. 4.4); (d): Endomorphism (Sect. 4.5)

5.1 Architecture for Fp Arithmetic precision multiplication is performed in a word-wise fash-


The prime p = 2 207
− 5131 is a pseudo-Mersenne prime of ion based on the product-scanning technique [12]. Then, we
the form p = 2n − c, where c fits into one word of the target multiply the most significant 209 bits of the product by c
platform since we select a 16-bit datapath. The basic idea of and add the result to least significant 207 bits, which yields
fast reduction using a pseudo-Mersenne prime is to apply a result of (at most) 226 bits length. Finally, we multiply the
the congruence relation 2n ≡ c mod p repetitively during most significant 19 bits by c and add the product to least
the reduction process. Suppose z = zH 2n + zL is a 2n-bit significant 207 bits; the result is now at most 208 bits and,
integer, such as obtained as result of a multiplication of two therefore, fits into 13 words. In order to achieve constant
n-bit integers. We can reduce z with respect to p as follows execution time, we always execute both reduction steps,
even when the result is already fully reduced after the first
z = zH 2n + zL mod p ≡ zH c + zL mod p (9) step.
Now z is already only slightly longer than n bits since c is
Algorithm 2 Modular multiplication for p = 2207 − c
small. To complete the reduction z mod p, we reduce z again
using (9) and then at most one subtraction of p is needed to Input: Two integers A[207 : 0], B[207 : 0], and modulus p
get a result that is at most n bits long. Output: R = A · B mod p
We use the following notation: 1: R = A · B
2: R = R[415 : 207] · c + R[206 : 0] {The 1st reduction}
• n: the operand size (i.e. n = 207). 3: R = R[225 : 207] · c + R[206 : 0] {The 2nd reduction}
• W : the word size of the datapath (i.e. W = 16).
• m: the bitlength of the scalars k, l (≈ the bitlength of
A modular squaring can be done more efficiently thanks
the prime group order), while m/2 roughly denotes
to the symmetry of partial products. Thus, it is possible to
the bitlength of the sub-scalars ki , li .
save the computation of (nearly) half of the partial products.
• A, B : two operands; A[i : j] represents bits at
position i to j of operand A.
5.1.2 Modular Inversion
• R: product A · B , which is twice as long as operand
A or B . Modular inversion is the most time-consuming field arith-
metic operation. Traditionally, the Extended Euclidean Al-
Our implementation adopts the idea of incomplete mod- gorithm (EEA) [12] and Montgomery modular inversion
ular reduction as described, for example, in [22], which algorithm [23], [24] are used to compute an inverse. Our
means the arithmetic functions described in the following inversion is mainly based on the Montgomery modular
subsections do not necessarily reduce the result to an integer inverse, but has been optimized for the pseudo-Mersenne
in the range of [0, p − 1], but only ensure that the result is prime p = 2n − c.
smaller than 2n so that it fits into dn/W e = d207/16e = 13 As shown in Algorithm 3, our inversion consists of
words. Also, all arithmetic functions accept incompletely two phases: phase I and phase II. In phase I, we perform
reduced inputs of dn/W e words. two additions, and then update the variables {u, v, r, s, k}
Note that all arithmetic operations (except Montgomery according to the sign flag of x. The trailing zero detection
inverse) we discuss in the following can be easily imple- (DET) and right-shift operation x  tlzx can be done in
mented in a regular way without conditional statements so parallel with the addition of u+v . Furthermore, the left-shift
that their execution time is independent of the values of operation of s  tlzx and r  tlzx can be done in parallel
the operands. Such constant execution time helps to thwart with the addition of y = r + s. In phase II, we perform
certain side-channel attack. Even though signature verifica- two ordinary multiplications to get the modular inverse. The
tion does not involve any secret values (and can, therefore, input a is set to be odd, but even if initially a is even, it can
not leak any secrets), it still makes sense to implement the be easily changed to be odd via a modular subtraction a − p.
underlying field arithmetic in a regular way so that it can The core idea behind our optimized inversion is to remove
also be used for signature generation. all trailing zeros of (u + v) in every iteration, which keeps u
and v always odd so that (u + v) converges to zero quickly.
5.1.1 Modular Multiplication and Squaring Compared to the Multibit Shifting method proposed by
The modular multiplication is performed in three basic Savaş et al. in [25], we remove all those iterations for shift
steps as shown in Algorithm 2. First, a conventional multi- operation (i.e. the iterations when u or v is even in [23,

0018-9340 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2016.2623609, IEEE
Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS, VOL. 14, NO. 8, AUGUST 2016 7

Algorithm 3 Optimized Montgomery Modular Inversion for


2n − c Controller
Program
ROM
...
Input: a ∈ [1, 2n ) and is odd, p > 2 is a n bits prime,
MULT R3, R1, R2
precomputed T = 2(−2n) mod p;
ADD R4, R6, R2
Output: R ∈ [1, 2n ), where R = a−1 mod p.
MSQ R5, R1
1: //Phase I
SRAM1 SUB R4, R5, R3
2: u = −p, v = a, r = 0, s = 1, k = 0; 16*16 Add/Sub
Mult INV R4, R5
3: while (1) do
SRAM2
...
4: x = u + v ; {Both u and v are always odd numbers}
5: y = r + s;
F207 Coprocessor Program
6: tlzx = DET (x); {Trailing zero detection}
7: if x == 0 then
8: break; Fig. 1. Hardware architecture
9: else if x < 0 then
10: u = x  tlzx ; {Right-shift operation can be done
in parallel with u + v} will guarantee that R > −p. If R is still negative, another ad-
11: r = y; dition step as shown in Equation (12) will make R a positive
12: s = s  tlzx ; {Left-shift operation can be done in number in the range [0, 2208 ). To ensure constant execution
parallel with r + s} time, we perform one subtraction and two additions for all
13: else possible inputs, but when R is positive after the subtraction,
14: v = x  tlzx ; the words of the operand are masked out (i.e. set to 0) so that
15: s = y; the value of R does not change.
16: r = r  tlzx ;
17: end if R ≡ R + 2p mod p (11)
18: k = k + tlzx ;
R ≡ R + p mod p (12)
19: end while
20: //phase II
21: s = s · 2(2n−k) mod p; 5.1.4 Hardware Architecture
22: s = s · T mod p; The hardware architecture, as shown in Fig 1, consists of a
23: return s. micro-controller, a program ROM, an Fp -coprocessor, which
we call Prime-Field Arithmetic Unit (i.e. PFAU), and two
dual ports SRAMs. The program ROM is used to com-
Algorithm MONTINVER]) and adopt the idea from [26] to mand sequences that execute high-level functions such as
avoid a complex comparison step by using the sign flag pre-computations, point addition, point doubling, etc. This
of x. More specifically, the number of total iterations in section focuses on the ALU.
Phase I of [23] is in the range of [n, 2n], with 50 % for shift The architecture of the ALU and other important mod-
operations. The number of iterations of our algorithm is in ules is shown in Fig. 2, where one (16 × 16)-bit multiplier,
the range of [0.5n, n] since such shift-operation iterations are one 3-input adder, a trailing-zero detection module (tlz), a
not required. Furthermore, the optimized inversion can be left-shifting module (lshifter), and a right-shifting mod-
made even faster by keeping track of the lengths of variables ule (rshifter) are depicted. We decided to implement a
{u, v, r, s}. This saves cycles for additions because the word 16-bit datapath since previous research has shown that this
lengths decrease linearly with the number of iterations. allows one to achieve a good trade-off between performance
and silicon area. The ALU supports the word-level instruc-
5.1.3 Modular Addition and Subtraction tions needed for modular multiplication, modular squaring,
modular inversion, modular addition and modular subtrac-
An addition modulo p = 2207 − c can be performed in
tion. The critical path goes from the input registers of the
three steps. First, a conventional multi-precision addition
multiplier to the output registers of the adder. The input
R = A + B is performed in a word-wise fashion. Then,
from mult to the adder is 33 bits long due to the fact that
for reduction, we reduce the 209-bit result to 208 bits by
we need to double some partial products when performing
using Equation (10). To ensure constant execution time, we
a modular squaring.
perform the addition step and the reduction step for all
possible inputs, even if no reduction is required. The optimized modular inversion requires the tlz,
lshifter, rshifter modules. Using the implementation
R ≡ R[209 : 207] · 2207 + R[206 : 0] mod p technique from [27], the tlz module can output the number
(10) of trailing zeros of a word (16 bits) in one clock cycle.
≡ R[209 : 207] · c + R[206 : 0].
To obtain the trailing zeros in a 208-bit operand, we can
For modular subtraction, a conventional multi-precision perform a zero detection word by word. If the number of
subtraction R = A − B is performed through word-wise trailing zeros exceeds one word, the detection process will
subtract-with-borrow operations. As the 208-bit input B take more than one cycle, but the probability is only 2−15
can be bigger than 2p, the result of the subtraction may be (because x is always even in Algorithm 3). The lshifter
smaller than −2p and, thus, up to two addition steps will be and rshifter receive the output of tlz and perform the
needed. As shown in Equation (11), the first addition step corresponding number of shifts on the 16-bit input. As

0018-9340 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2016.2623609, IEEE
Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS, VOL. 14, NO. 8, AUGUST 2016 8

D SE T
Q Q
SET
D D SET
Q TABLE 2
Execution times of field arithmetic operations (in clock cycles)
Q Q Q
16 16
CLR CLR CLR

Operation Mul Sqr Inv Add Sub


16*16 Mult Time 198 120 4452 30 43

16
36 33
Add offers, we designed two versions of the implementation: the
first one is optimized for performance and the second for
36
D SE T
Q low RAM footprint.
Speed-Optimized: The speed-optimized implemen-
CLR Q
tation requires a look-up table containing 15 points, of which
16 16 16 11 points (except G, Q, φ(G) and φ(Q)) will be generated by
rshifter tlz lshifter a sequence of point additions. In order to take the advantage
of the efficient point addition formula on a twisted Edwards
curve (i.e. the 7M mixed addition formulae based on [14]),
Fig. 2. ALU architecture we store these points using extended affine coordinates of
the form (U, V, W ), where U = (x + y)/2, V = (y − x)/2,
W = xy (in our case d = 1). A straightforward method
mentioned before, the shift operation in the modular inverse
to get the affine form of these points would require 11
can be done in parallel with the addition.
inversions. For reducing the number of inversions, we per-
form the 11 inversions by using Montgomery’s trick [12,
5.2 Implementation Results page 44]: With the help of three temporary variables, the
We implemented the arithmetic processor in Verilog and 11 inversions can be computed with only one inversion
synthesized it with Design Compiler 2013.12 using the UMC and 83 multiplications. Given an affine point, the extended
130 nm 1P8M Low Leakage Standard Cell Library with affine coordinates (U, V, W ) can be obtained by performing
typical values (i.e. voltage of 1.2V and temperature of one addition, one subtraction and one multiplication. In the
25◦ C ). The area (in gate equivalents, GE) after placement main loop, a pre-computed point given in extended affine
and routing is calculated by dividing the overall area by the coordinates is used as an operand in each iteration (i.e. in
area of a single two-input NAND gate. The design has been line 11 of Algorithm 1). As a result, our speed-optimized
synthesized for a clock frequency of 50 MHz, which is more double scalar multiplication requires an execution time of
than sufficient for common IoT devices such as RFID tags or 365,082 clock cycles with a RAM footprint of 1612 bytes.
sensor nodes. Memory-Optimized: A look-up table with 15 ex-
tended affine points requires 45 field elements to be stored in
5.2.1 Execution Time of Field Arithmetic RAM. Instead of generating a look-up table with extended
As mentioned in the previous section, we implemented affine points, the memory-optimized implementation gener-
the multiplication, squaring, addition and subtraction to ates a look-up table with standard affine coordinates (x, y)
have constant execution time. Constant execution time (and, and reduces the RAM requirements to only 30 field ele-
thus, constant pattern of operations) gives protection against ments. In the process of look-up table generation, we adopt
simple side-channel attacks that target a single side-channel the point addition formula with Z1 = 1 and Z2 = 1 [14,
trace. Protection against more elaborated attacks relying on Sect. 3.1] and directly convert the projective representation
statistical analysis of one or more traces is left for future into standard affine representation for each point. In total,
work. Table 2 summarizes the execution times of the five the look-up table generation requires 11 point additions, 11
basic arithmetic operations modulo the prime 2207 − 5131. inversions and 22 multiplications. We still use the efficient
The modular addition takes exactly 30 cycles, which is faster point addition formula for twisted Edwards curve (i.e. the
than the modular subtraction. Our constant-time modular 7M mixed addition formulae based on [14]) in the main
multiplication executes in exactly 192 cycles, whereas the loop of double scalar multiplication. Thus, we compute the
modular squaring has an execution time of 120 clock cy- extended affine representation of an affine point on-the-
cles, which means the squaring requires merely 60% of the fly, which requires one multiplication, one addition and
multiplication cycles. Thanks to the optimized Montgomery one subtraction for each iteration. As a consequence, our
modular inversion proposed in Algorithm 3, our inversion memory-optimized double scalar multiplication requires an
requires 4452 clock cycles in average, which corresponds execution time of 415,392 clock cycles with a RAM con-
to only 23 multiplications. The execution time of modular sumption of only 1222 bytes, which corresponds to a saving
inversion is evaluated based on the average number of of 33 % for the look-up table (780 instead of 1170 bytes)
Phase I iterations with two additions per iteration and two and 24 % in total (i.e. 1222 instead of 1612 bytes), by only
modular multiplications in Phase II. scarifying roughly 12 % in performance.
For comparison, a double scalar multiplication without
5.2.2 Double Scalar Multiplication: High-Speed VS exploiting the endomorphism (i.e. by using interleaving
Memory-Efficient with JSF) has an execution time of 454,179 cycles. This
To demonstrate the trade-offs between performance and shows that the endomorphism yields a speed-up of roughly
RAM requirements that our small processor architecture 8.5–19.6% (i.e., memory optimized version and high speed

0018-9340 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2016.2623609, IEEE
Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS, VOL. 14, NO. 8, AUGUST 2016 9

version).

5.2.3 Comparison with Other Implementations


Table 3 shows our implementation results and a comparison RAM 1 RAM 2
with related work over prime fields. All related implemen-
tations (except Lai et al. [29]) use a 16-bit datapath similarly
to our work. There are no implementations available in
the literature with exactly the same security level as our
implementations. Hence, direct comparison is impossible,
but we highlight the size of the prime field in Table 3 (the
’order’ column) in order to make comparison as easy and
fair as possible. ALU 1 ALU 2
We perform the fixed-base scalar multiplication (needed
for signature generation) on the chosen twisted Edwards
curve using a constant-time comb method with w = 4 as
described in [35], which only needs to store 8 points from
(a)
precomputation 3 . Note that signature generation requires a
constant-time inversion, which we compute using Fermat’s
theorem with an addition chain that can be evaluated by
computing 206 modular squarings and 14 modular multipli-
cations. Our implementation requires an execution time of
182, 653 and 365, 082 clock cycles for (constant-time) scalar
multiplication and speed-optimized double scalar imple- Pipelined multiplier
mentation, respectively, and consumes an area of 5821 GEs. high low
On the other hand, the memory-oriented implementation
needs 415, 392 clock cycles, while consuming only 1.2 kB of
RAM.
Since most of the previous implementations only re-
Adder Adder Adder
ported the execution time of signature generation, we es-
timate the cycle count of verification (i.e. double scalar
multiplication) by simply multiplying the generation time reg2 reg1 reg0
by two. As shown in the Table 3, our implementation is at
least three times faster than all the previous works using (b)
the same word size. In terms of area, the implementations
Fig. 3. The architectural diagrams of the verification engine. (a) The
from [30] and [29] support both prime and binary field high-level diagram of the computation core and (b) the architecture of
arithmetic and, thus, have a large area. On the other hand, the ALUs.
the authors of [32], [33], [34] optimized their implementation
with the help of a microcode-programmable structure (for
field arithmetic) and, thus, their implementations require utilizing multiple parallel cores. Because these computa-
extra instruction decoding modules and have higher ROM tions operate on public data, there is no need for side-
consumption in order to save area in the control logic. Our channel countermeasures. We begin with description of the
implementation does not include the area for SRAM since architecture of Fp arithmetic in Sect. 6.1, discuss latencies
it varies for different process technologies and depends of operations using the architecture in Sect. 6.2, and end
significantly on whether one has a RAM generator available with implementation results on a Xilinx Virtex-7 FPGA and
or not. Besides, in some applications the SRAM can be discussion in Sect. 6.3.
shared with other modules in the device and, in such cases,
it does not incur further costs [36]. 6.1 Architecture for Fp Arithmetic
The core for computing double scalar multiplications is
6 H IGH -S PEED V ERIFICATION C ORE depicted in Fig. 3. The high-level diagram given in Fig. 3(a)
We also provide an architecture tailored for fast signa- shows that the core consists of two parallel ALUs and two
ture verifications that require double scalar multiplications dual-port RAMs.
k · G + l · Q. The architecture is designed primarily for The ports of the RAMs are arranged as follows. The A-
FPGA devices which have embedded memory blocks and port is used for both writing the output of the corresponding
multipliers and it can be used in the server-side for achiev- ALU into the RAM and reading the contents of the RAM.
ing very high throughputs for signature verifications by The B-port of the RAM is dedicated only for reading during
the operation of the core, but it is used also by the external
3. The efficient endomorphism can also be used to accelerate the com- interface to write data into the RAMs. The architecture
putation of k ·G via GLV method, but it costs more time under the same allows the ALUs to take inputs from both ports of both
RAM occupation or needs much more storage to improve performance.
Due to the extreme resource constraints of IoT applications, we did not RAMs, but an ALU can only write to the corresponding
apply the GLV method to this computation. RAM. If the core computes with only two values from

0018-9340 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2016.2623609, IEEE
Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS, VOL. 14, NO. 8, AUGUST 2016 10

TABLE 3
Comparison of execution times, areas and RAM consumptions with related works over prime fields. Most of the works use a 16-bit datapath.

Time (cycles) ALU SRAM


Implementation Order
Sign. Ver. (GE) (bytes)
Chen et al. [28] 256 562,000 1,124,0001 n.a. n.a.
Lai et al. [29]2 176 93,399 186,7981 n.a. n.a.
Lai et al. [29]2 256 252,067 504,1341 n.a. n.a.
Satoh et al. [30] 192 1,362,906 2,725,8121 9,456 n.a.
Satoh et al. [30] 224 2,048,166 4,096,3321 10,800 n.a.
Furbass et al. [31]3 192 502,000 1,004,0001 21,769 n.a.
Hutter et al. [32]3 192 859,188 1,718,3761 2,3714 256
Wenger et al. [33] 192 1,377,000 2,645,000 4,3544 422
Plos et al. [34] 192 863,109 1,726,2181 3,6084 256
This work5 (HS) 204 182,653 365,082 5,821 1,612
This work5 (ME) 204 182,653 415,392 5,821 1,222
1: Estimated results from the execution time of signature implementation.
2: Four 32 bits multipliers used.
3: 0.35µm technology library used.
4 : Microcode based architecture used, more ROM are required.
5 : Fixed-base scalar multiplication (for signature generation) and double
scalar multiplication (for verification).

different RAMs (e.g., A + B in ALU 1 and A − B in ALU 2 point doubling so that the combined latency becomes only
or A × A in ALU 1 and B × B in ALU 2), then reading the 7 multiplications and 6 additions/subtractions. Algorithm 4
operands and writing the results can be done concurrently. gives the operation sequence. The subscripts of the variables
If more operands need to be read (e.g., A × B in ALU 1 denote the RAM in which the variable is located (e.g., X1 is
and C × D in ALU 2), then additional delays occur because in RAM 1 and Y2 is in RAM 2).
reading and writing must occur in different clock cycles. The
external interface allows writing and reading both RAMs. Algorithm 4 Interleaved point addition and point doubling
The ALU depicted in Fig. 3(b) has a W -bit datapath that Input: P1 = (X1 , Y2 , Z2 , E1 , H2 ), P2 = (U1 , V2 , W2 )
supports integer multiplication, addition, and subtraction. Output: (X1 , Y2 , Z2 , E1 , H2 ) = 2(P1 + P2 )
In our case W = 52 and each element of Fp splits into A1 ← Y2 + X1 ; A2 ← Y2 − X1 ;
four words. However, instead of restricting values to Fp , B1 ← E1 × H2 ; B2 ← A1 × V2 ;
we allow an extended range [0, 2208 − 1] to simplify the C1 ← B1 × W2 ; Y2 ← A2 × U1 ;
arithmetic. The ALU is built around a pipelined (6 stages) E1 ← Y2 + B2 ; H2 ← Y2 − B2 ;
W ×W -bit multiplier and a pipelined (3 stages) accumulator. A1 ← Z2 + C1 ; C2 ← Z2 − C1 ;
The multiplier is constructed by using a Xilinx IP Core so X1 ← E1 × A1 ; Y2 ← C2 × H2 ;
that it uses the hardwired multipliers of DSP48E1 blocks. B1 ← A1 × C2 ; B2 ← X1 × X1 ;
Multiplications are computed using the product- A1 ← Y2 × Y2 ; A2 ← B1 × B1 ;
scanning (Comba) algorithm [37] that computes all subprod- X1 ← X1 + Y2 ; C2 ← B2 − A1 ;
ucts of a result word successively starting from the least- B1 ← A2 + A2 ; H2 ← B2 + A1 ;
significant word. The subproducts are accumulated into two E1 ← X1 × X1 ; Y2 ← C2 × H2 ;
52-bit registers reg0 and reg1 for the lower and higher word E1 ← H2 + E1 ; B2 ← C2 − B1 ;
of the multiplication result and a 5-bit register reg2 for the X1 ← E1 × B2 ; Z2 ← C2 × B2 ;
overflowing bits. When a word of the result is ready the ac-
cumulator is shifted to the right (reg0 ← reg1, reg1 ← reg2,
reg2 ← 0). Additions and subtractions of 52-bit words are
computed using the adder on the right. The carry input to 6.2 Latencies
the adder can be set to either zero, one, or to the carry from The ALU computes field operations with the follow-
the previous addition. This allows efficient computation of ing latencies: multiplication 61 or 63 clock cycles, addi-
multiprecision additions and subtractions. Two words are tion/subtraction 7–18 clock cycles depending on the re-
subtracted by inverting all bits and setting the carry to one. quired reductions (average 11.5), and addition/addition 7–
The ALU also supports dividing words in reg0 by two (right 11 clock cycles (average 9). Division by two requires 12 or
shifts) in a way that allows implementing multiprecision 17 clock cycles if the lsb is zero or one, respectively. Hence,
divisions by two. Division by two is performed so that p is when two divisions are performed in parallel, the average
first added to the dividend if its lsb is one. latency is 15.75 clock cycles. A Fermat-based inversion in Fp
Both point addition and point doubling require 7 mul- takes on average (206 + 14) · 62 = 13, 640 clock cycles. One
tiplications and 6 additions/subtractions in Fp by using iteration of Alg. 4 requires 7 · 62 + 5 · 11.5 + 9 = 500.5 clock
the formulae from [14], but they can be computed with a cycles on average. Computing only the point addition or
latency of 4 multiplications and 3 additions/subtractions point doubling parts of Alg. 4 require 4 · 62 + 3 · 11.5 = 282.5
by utilizing the two parallel ALUs. We observe that it is and 4 · 62 + 2 · 11.5 + 9 = 280 clock cycles on average,
possible to interleave the computation of point addition and respectively.

0018-9340 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2016.2623609, IEEE
Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS, VOL. 14, NO. 8, AUGUST 2016 11

In the following we provide estimates for the latency of Algorithm 5 Precomputation (online)
computing the double scalar multiplication k · G + l · Q Input: Affine version of Q and affine and extended affine
with the core of Fig. 3. Similarly as before, we assume versions of P , φ(P ), φ(P ) + P
that the base point G is fixed and Q is varying. The scalar Output: Table T of 16 points represented in the extended
multiplication begins by precomputing all combinations of affine coordinates
a1 G + a2 φ(G) + a3 Q + a4 φ(Q) with ai ∈ {0, 1}. Only the 1: Set T [0, 0, 0, 0] ← O , T [1, 0, 0, 0] ← P , T [0, 1, 0, 0] ←
combinations where a3 = 1 or a4 = 1 need to be computed φ(P ), T [1, 1, 0, 0] ← φ(P ) + P ;
on the fly. The points depending only on G can be computed 2: Set T [0, 0, 1, 0] ← Q and convert it to extended affine
offline and written to the RAMs once during initialization. coordinates;
The online precomputation is given in Algorithm 5. It uses 3: Compute T [0, 0, 0, 1] ← φ(Q) and convert it to extended
the point addition part of Alg. 4 to compute point additions affine coordinates;
where the other operand is a point in projective coordinates 4: Compute T [1, 0, 1, 0] ← T [1, 0, 0, 0] + Q with point
and the other is a point in extended affine coordinates. In the addition of Alg. 4(the same below);
end, all precomputed points in the table T are in extended 5: Compute T [0, 1, 1, 0] ← T [0, 1, 0, 0] + Q;
affine coordinates. Projective version can be obtained from 6: Compute T [1, 1, 1, 0] ← T [1, 1, 0, 0] + Q ;
the affine version without computational cost: for (x, y), 7: Compute T [1, 0, 0, 1] ← T [1, 0, 0, 0] + φ(Q);
the projective version is (x, y, 1, x, y). The extended affine 8: Compute T [0, 1, 0, 1] ← T [0, 1, 0, 0] + φ(Q);
version requires computations: ((y + x)/2, (y − x)/2, x · y). 9: Compute T [1, 1, 0, 1] ← T [1, 1, 0, 0] + φ(Q)
The cost of Alg. 5 is as follows. Line 1 is performed 10: Compute T [0, 0, 1, 1] ← T [0, 0, 0, 1] + Q;
by writing data in to the RAMs by using the external 11: Compute T [1, 0, 1, 1] ← T [0, 0, 1, 1] + P ;
interface; this latency depends on the host processor and 12: Compute T [0, 1, 1, 1] ← T [0, 0, 1, 1] + φ(P );
is not counted in the following clock cycle counts. Line 2 13: Compute T [1, 1, 1, 1] ← T [0, 0, 1, 1] + (φ(P ) + P );
converts Q into extended affine coordinates with one ad- 14: Convert T [1, 0, 1, 0], . . . , T [1, 1, 1, 1] to affine coordi-
dition/subtraction (y + x and y − x in parallel), divisions nates by using Montgomery’s trick for inversions;
by two (again in parallel), and a multiplication; this costs 15: Convert T [1, 0, 1, 0], . . . , T [1, 1, 1, 1] to extended affine
62 + 11.5 + 15.75 = 89.25 clock cycles on average. Line 3 coordinates.
computes φ(Q) which requires one inversion and one multi-
plication and, hence, on average 13,702 clock cycles. Lines 4–
13 compute point additions with average latencies of 282.5 followed by two multiplications with an average latency of
clock cycles. Line 14 finds the affine coordinates of 10 points, 13,764 clock cycles. Summing up all above latencies gives
which requires 10 inversions and 20 multiplications. The that a double scalar multiplication requires 98,017 clock
inversions are computed using Montgomery’s trick which cycles on average.
translates the problem to one inversion and 27 multiplica-
tions. Hence, Line 14 takes 13, 640 + (27 + 20) · 62 = 16, 554 6.3 Results and Discussion
clock cycles on average. Finally, Alg. 5 ends in Line 15 with
We compiled the core depicted in Fig. 3 for Xilinx Virtex-7
the computation of extended affine coordinates for the 10
XC7VX330T-1FFG1157 using Xilinx ISE 14.7. The results are
points. The total cost of this is 10·89.25 = 892.5 clock cycles.
collected in Table 4. They show that the core is compact and
Summing up all the latencies gives the average latency of
operates on a relative high clock frequency (the critical path
34,062.75 clock cycles for the precomputation.
is in the pipelined 52-bit multiplier). If parallel instances of
The scalar array is scanned from the left (the msb)
the core are implemented in the FPGA, then the number of
to the right (the lsb). A point doubling is computed for
DSP48E1 blocks will become the bottleneck. The numbers
each column followed by a point addition if the column
indicate that even 50 parallel cores could fit in one Virtex-7
is nonzero (i.e., contains at least one one-bit). Hence, a point
XC7VX330T.
addition is computed on average for 15/16 of the columns.
Whenever a point addition is skipped, we need to be able TABLE 4
to compute only a point doubling. Also, the double scalar Results of the verification core on Xilinx Virtex-7
multiplication ends with a point addition 15 times out of 16. XC7VX330T-1FFG1157
Hence, in addition to Algorithm 4 also routines for separate
LUTs 955 (0.5%)
point addition and point doubling are needed. They can be Registers 992 (0.2%)
constructed from the two halves of Algorithm 4 with simple Slices 377 (0.7%)
modifications to the addressing. We assume that we have a RAMB36E1 2 (0.3%)
104-bit (≈ 207 DSP48E1 20 (1.8%)
2 ) scalar array. Then, the scalar multiplication Max. freq. (MHz) 205.634
latency is given by: Latency (clock cycles) 98,017

15 1

15 Timing (ms @ 200 MHz) 0.490
PD + 102 · · PAD + · PD + · PA Throughput (ops @ 200 MHz) 2,040
16 16 16
where PD, PA, and PAD denote point doubling, point ad- We are not aware of other published results that would
dition, and interleaved point addition and point doubling, be directly comparable with this architecture for double
respectively. Using the above latencies for these operations scalar multiplications for signature verifications. An archi-
gives 50,190 clock cycles. In the end, the affine coordinates tecture for verifying self-certified signatures on a Koblitz
of the result point are obtained by computing an inversion curve NIST K-163 over F2163 was presented by Järvinen et

0018-9340 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2016.2623609, IEEE
Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS, VOL. 14, NO. 8, AUGUST 2016 12

al. [38] in CHES 2007. It achieved throughput of 166,000 ver- These implementations show that our curve can be
ifications per second with an Altera Stratix II FPGA, which efficiently implemented for applications that require low
appears to be slightly faster than what is achievable with our resources or high speed. This is a particularly important ad-
architecture. However, the comparison is not fair because vantage for IoT applications because such systems must be
the curve used in [38] offers less security (approximately 80 flexible in the sense that they can be efficiently implemented
bits vs. 100 bits) and F2n arithmetic is typically much more in environments with varying implementation constraints.
efficient in hardware (FPGA) than Fp arithmetic. Recently, Our curve offers roughly 100-bit security level which is a
Sasdrich and Güneysu [39] presented a highly optimized good tradeoff between security and performance. All this
core for single scalar multiplications on Curve25519 [40]. makes our methods, the curve, and the architectures good
Their single-core implementation has comparable resource options for implementing cryptographic protocols in IoT
requirements (e.g., 20 DSP blocks) with our core and it applications.
computes 2,519 scalar multiplications in second. Hence,
we can estimate that their implementation is capable of
computing roughly 1,260 double scalar multiplications in ACKNOWLEDGMENTS
second. Our core achieves a throughput of 2,040 double s- The authors would like to thank the anonymous reviewers
calar multiplications in second which is approximately 60 % for their valuable comments and helpful suggestions. Zhi
more. However, these numbers are not directly comparable Hu was partially supported by the Natural Science Founda-
because Curve25519 offers a roughly 128-bit security level tion of China (Grant No.61602526).
whereas our curve offers a 100-bit security level. Never-
theless, this shows that our core compares favorably to the
state-of-the-art FPGA implementations of fast elliptic curve R EFERENCES
cryptography over prime fields.
[1] L. Atzori, A. Iera, and G. Morabito, The internet of things: A
Our high-speed core was purposely designed as simple survey, Computer Networks, vol. 54, no. 15, pp. 2787–2805, Oct.
as possible in order to maximize the operating frequency. 2010.
Adding certain features in the ALU would allow shorter [2] R. Roman, P. Najera, and J. Lopez, Securing the internet of things,
latencies, but would also lead to a drop in the maximum Computer, vol. 44, no. 9, pp. 51–58, 2011.
[3] T. Dierks and E. K. Rescorla, The Transport Layer Security (TLS)
frequency. In particular, adding support for shift operations Protocol Version 1.2, Internet Engineering Task Force, Network
would allow optimizing squarings (currently treated as Working Group, RFC 5246, Aug. 2008.
normal multiplications) and using faster inversions based [4] E. K. Rescorla and N. G. Modadugu, Datagram Transport Layer
Security Version 1.2, Internet Engineering Task Force, Network
on the Extended Euclidean Algorithm. The future work Working Group, RFC 6347, Jan. 2012.
includes studies on whether such modifications could lead [5] S. L. Keoh, S. S. Kumar, and H. Tschofenig, Securing the Internet
to further speedups and improvements in speed-area ratio. of things: A standardization perspective, IEEE Internet of Things
Journal, vol. 1, no. 3, pp. 265–275, Jun. 2014.
[6] R. L. Rivest, A. Shamir, and L. M. Adleman, A method for
7 C ONCLUSION obtaining digital signatures and public key cryptosystems, Com-
munications of the ACM, vol. 21, no. 2, pp. 120–126, Feb. 1978.
In this work, we introduced a twisted Edwards curve with [7] National Institute of Standards and Technology (NIST), Digital
an efficiently computable endomorphism and described Signature Standard (DSS), FIPS Publication 186-4, available for
download at http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.
how the endomorphism be exploited to speed up double 186-4.pdf, Gaithersburg, MD, USA, Jul. 2013.
scalar multiplications. We described two hardware imple- [8] D. Johnson, A. J. Menezes, and S. A. Vanstone, The elliptic curve
mentations utilizing the endomorphism and they target to digital signature algorithm (ECDSA), International Journal of Infor-
mation Security, vol. 1, no. 1, pp. 36–63, Jul. 2001.
resource-constrained IoT devices and FPGAs for the server-
[9] S. Blake-Wilson, N. Bolyard, V. Gupta, C. Hawk, and B. Möller,
side, respectively. Elliptic Curve Cryptography (ECC) Cipher Suites for Transport Layer
We presented an area-optimized processor architecture Security (TLS), Internet Engineering Task Force, Network Working
for resource constrained applications. The processor is built Group, RFC 4492, May 2006.
[10] N. P. Smart, Ed., ECRYPT II Yearly Report on Algorithms and Keysizes
around a 16-bit datapath and it has an overall silicon area (2011-2012). European Network of Excellence in Cryptology
of only 5821 GE when synthesized with a 130 nm CMOS (ECRYPT II), Sep. 2012, deliverable D.SPA.20, available for down-
standard-cell library. In addition, we showed that the archi- load at http://www.ecrypt.eu.org/documents/D.SPA.20.pdf.
tecture and the presented methods support various trade- [11] D. J. Bernstein, N. Duif, T. Lange, P. Schwabe, and B.-Y. Yang,
High-speed high-security signatures, Journal of Cryptographic Engi-
offs between execution time and memory requirements, neering, vol. 2, no. 2, pp. 77–89, Sep. 2012.
which gives a designer many options to optimize double s- [12] D. R. Hankerson, A. J. Menezes, and S. A. Vanstone, Guide to
calar multiplications for different requirements. Our proces- Elliptic Curve Cryptography. Springer Verlag, 2004.
[13] D. J. Bernstein, P. Birkner, M. Joye, T. Lange, and C. Peters, Twisted
sor architecture compares favorably to various counterparts Edwards curves, in Progress in Cryptology — AFRICACRYPT 2008,
from the literature. ser. Lecture Notes in Computer Science, S. Vaudenay, Ed., vol.
We also provided a high-speed architecture for FPGA 5023. Springer Verlag, 2008, pp. 389–405.
devices. This verification core was designed to use parallel [14] H. Hişil, K. K.-H. Wong, G. Carter, and E. Dawson, Twisted Ed-
wards curves revisited, in Advances in Cryptology — ASIACRYPT
processing with two ALUs and RAM memories. It resulted 2008, ser. Lecture Notes in Computer Science, J. Pieprzyk, Ed., vol.
in both fast and compact FPGA implementation of the 5350. Springer Verlag, 2008, pp. 326–343.
double scalar multiplication. It can be used for achieving [15] R. P. Gallant, R. J. Lambert, and S. A. Vanstone, Faster point
very high throughputs for signature verifications in server- multiplication on elliptic curves with efficient endomorphism, in
Advances in Cryptology — CRYPTO 2001, ser. Lecture Notes in
side operations related to the IoT by using parallel instances Computer Science, J. Kilian, Ed., vol. 2139. Springer Verlag, 2001,
of the core inside one FPGA device. pp. 190–200.

0018-9340 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2016.2623609, IEEE
Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS, VOL. 14, NO. 8, AUGUST 2016 13

[16] S. D. Galbraith, X. Lin, and M. Scott, Endomorphisms for faster [38] K. Järvinen, J. Forsten, and J. Skyttä, FPGA design of self-certified
elliptic curve cryptography on a large class of curves, in Advances signature verification on Koblitz curves, in Cryptographic Hardware
in Cryptology — EUROCRYPT 2009, ser. Lecture Notes in Com- and Embedded Systems — CHES 2007, ser. Lecture Notes in Com-
puter Science, A. Joux, Ed., vol. 5479. Springer Verlag, 2009, pp. puter Science, vol. 4727. Springer, 2007, pp. 256–271.
518–535. [39] P. Sasdrich and T. Güneysu, Implementing Curve25519 for side-
[17] P. Longa and F. Sica, Four-dimensional Gallant-Lambert-Vanstone channel-protected elliptic curve cryptography, ACM Transactions
scalar multiplication, in Advances in Cryptology — ASIACRYPT on Reconfigurable Technology and Systems, vol. 9, no. 1, p. 3, Nov.
2012, ser. Lecture Notes in Computer Science, X. Wang and 2015.
K. Sako, Eds., vol. 7658. Springer Verlag, 2012, pp. 719–739. [40] D. J. Bernstein, Curve25519: New Diffie-Hellman speed records,
[18] P. Longa and C. H. Gebotys, Efficient techniques for high-speed in Public Key Cryptography — PKC 2006, ser. Lecture Notes in
elliptic curve cryptography, in Cryptographic Hardware and Embed- Computer Science. Springer, 2006, vol. 3958, pp. 207–228.
ded Systems — CHES 2010, ser. Lecture Notes in Computer Science,
S. Mangard and F.-X. Standaert, Eds., vol. 6225. Springer Verlag,
2010, pp. 80–94.
[19] A. Faz-Hernández, P. Longa, and A. H. Sánchez, Efficient and
secure algorithms for GLV-based scalar multiplication and their
implementation on GLV-GLS curves, in Topics in Cryptology — CT-
RSA 2014, ser. Lecture Notes in Computer Science, J. Benaloh, Ed.,
vol. 8366. Springer Verlag, 2014, pp. 1–27. Zhe Liu is a full professor in College of Comput-
[20] D. A. Cox, Primes of the Form x2 + ny 2 . John Wiley & Sons, 1989. er Science and Technology, Nanjing University
[21] J. A. Solinas, Low-weight binary representations for pairs of of Aeronautics and Astronautics. He is current-
integers, Centre for Applied Cryptographic Research (CACR), ly a postdoctoral research fellow in Institute for
University of Waterloo, Waterloo, Canada, Tech. Rep. CORR 2001- Quantum Computing (IQC) and Department of
41, 2001. Combinatorics and Optimization, University of
[22] T. Yanık, E. Savaş, and Ç. K. Koç, Incomplete reduction in modular Waterloo, Canada. He received his Ph.D degree
arithmetic, IEE Proceedings – Computers and Digital Techniques, vol. in Laboratory of Algorithmics, Cryptology and
149, no. 2, pp. 46–52, Mar. 2002. Security (LACS), University of Luxembourg in
[23] B. S. Kaliski, The Montgomery inverse and its applications, IEEE 2015. During his doctoral studies, he has been
Transactions on Computers, vol. 44, no. 8, pp. 1064–1065, 1995. a visiting scholar in City University of HongKong,
[24] E. Savas, Ç. Koç, The Montgomery modular inverse-revisited, COSIC, K.U.Leuven as well as Microsoft Research (MSR), Redmond.
IEEE Transactions on Computers, vol. 49, no. 7, pp. 763–766, 2000. His research interests include different aspects of information security.
[25] E. Savaş, M. Naseer, A.-A. Gutub, and Ç. K. Koç, Efficient unified He has co-authored more than 40 research peer-reviewed journal and
montgomery inversion with multibit shifting, IEE Proceedings- conference papers in the area of cryptographic engineering, including
Computers and Digital Techniques, vol. 152, no. 4, pp. 489–498, 2005. IEEE TIFS and IACR CHES.
[26] R. Lórencz and J. Hlaváč, Subtraction-free almost montgomery
inverse algorithm, Information processing letters, vol. 94, no. 1, pp.
11–14, 2005.
[27] V. G. Oklobdzija, An algorithmic and novel design of a leading
zero detector circuit: Comparison with logic synthesis, Very Large
Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 2, no. 1,
pp. 124–128, 1994.
[28] G. Chen, G. Bai, and H. Chen, A high-performance elliptic curve Johann Großschädl is a member of research
cryptographic processor for general curves over gf (p) based on a staff at Laboratory of Algorithmics, Cryptology
systolic arithmetic unit, Circuits and Systems II: Express Briefs, IEEE and Security (LACS), University of Luxembourg.
Transactions on, vol. 54, no. 5, pp. 412–416, 2007. Before joining University of Luxembourg, he was
[29] J.-Y. Lai and C.-T. Huang, A highly efficient cipher processor a research scientist in the computer science
for dual-field elliptic curve cryptography, Circuits and Systems II: department of the University of Bristol, United
Express Briefs, IEEE Transactions on, vol. 56, no. 5, pp. 394–398, 2009. Kingdom. He has published more than 60 pa-
[30] A. Satoh and K. Takano, A scalable dual-field elliptic curve cryp- pers in international, peer-reviewed journals and
tographic processor, Computers, IEEE Transactions on, vol. 52, no. 4, conference proceedings, such as ACM Annu-
pp. 449–460, 2003. al Computer Security Applications Conference
[31] F. Furbass and J. Wolkerstorfer, ECC processor with low die size (ACSAC) Cryptographic Hardware and Embed-
for RFID applications, in Circuits and Systems, 2007. ISCAS 2007. ded Systems (CHES), which are the flagship events in the field of
IEEE International Symposium on. IEEE, 2007, pp. 1835–1838. applied cryptography. He is a member of the IEEE, and the International
[32] M. Hutter, M. Feldhofer, and T. Plos, An ECDSA processor for Association for Cryptologic Research (IACR).
RFID authentication, in Radio Frequency Identification: Security and
Privacy Issues. Springer, 2010, pp. 189–202.
[33] E. Wenger, M. Feldhofer, and N. Felber, Low-resource hardware
design of an elliptic curve processor for contactless devices, in
Information Security Applications. Springer, 2011, pp. 92–106.
[34] T. Plos, M. Hutter, M. Feldhofer, M. Stiglic, and F. Cavaliere,
Security-enabled near-field communication tag with flexible ar-
chitecture supporting asymmetric cryptography, Very Large Scale Zhi Hu was born in Hunan Province, China, in
Integration (VLSI) Systems, IEEE Transactions on, vol. 21, no. 11, pp. 1985. He received the B.S. degree in 2007 and
1965–1974, 2013. the Ph.D degree in 2012, both in School of Math-
[35] Z. Liu, E. Wenger, and J. Großschädl, MoTE-ECC: Energy-scalable ematical Sciences, Peking University, China. He
elliptic curve cryptography for wireless sensor networks, in The was a postdoctoral researcher fellow in Beijing
12th International Conference on Applied Cryptography and Network International Center for Mathematical Research
Security — ACNS 2014, ser. Lecture Notes in Computer Science, (BICMR) from 2012 to 2014. After that, he joined
I. Boureanu, P. Owezarski, and S. Vaudenay, Eds. Springer Verlag, the School of Mathematics and Statistics, Cen-
2014. tral South University, China, where he currently
[36] E. Wenger, Hardware architectures for MSP430-based wireless is a lecturer. His research interests include cryp-
sensor nodes performing elliptic curve cryptography, in Applied tography and information security, especially in
Cryptography and Network Security — ACNS 2013, ser. Lecture elliptic curve cryptography.
Notes in Computer Science, vol. 7954. Springer, 2013, pp. 290–306.
[37] P. G. Comba, Exponentiation cryptosystems on the IBM PC, IBM
Systems Journal, vol. 29, no. 4, pp. 526–538, 1990.

0018-9340 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2016.2623609, IEEE
Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS, VOL. 14, NO. 8, AUGUST 2016 14

Kimmo Järvinen received the M.Sc. (Tech.) de-


gree in 2003 and the D.Sc. (Tech.) degree in
2008, both in electrical engineering, from Helsin-
ki University of Technology (TKK), Espoo, Fin-
land. He was with the Signal Processing Lab-
oratory at TKK from 2002 to 2008. From 2008
to 2014, he worked in the Department of Infor-
mation and Computer Science, Aalto University,
Espoo, Finland. From 2014 to 2015, he was with
the COSIC group of KU Leuven ESAT, Leuven,
Belgium. He is currently with the Department of
Computer Science in Aalto University. His research interests include
efficient and secure realization of cryptosystems, general computer
arithmetic, and FPGAs.

Husen Wang received his B.S. in Beihang U-


niversity in 2009 and M.S. degrees in computer
science from Tsinghua University in 2012. Since
2015, he has been a researcher in Security and
Trust (SnT), University of Luxembourg, Luxem-
bourg. His research interests include informa-
tion security, with special focus on cryptographic
engineering. His works have been published in
refereed journals and cryptology conferences.

Ingrid Verbauwhede received the electrical en-


gineering degree and PhD degree from the KU
Leuven, Heverlee, Belgium, in 1991. From 1992
to 1994, she was a postdoctoral researcher and
visiting lecturer with the University of Californi-
a, Berkeley. From 1994 to 1998, she worked
with TCSI and ATMEL in Berkeley, California.
In 1998, she joined the faculty of University of
California, Los Angeles (UCLA). She is currently
a professor with the KU Leuven and an adjunct
professor with UCLA. At KU Leuven, she is a co-
director of the Computer Security and Industrial Cryptography (COSIC)
Laboratory. Her research interests include circuits, processor architec-
tures and design methodologies for real-time embedded systems for
security, cryptography, digital signal processing, and wireless communi-
cations. This includes the influence of new technologies and new circuit
solutions on the design of next-generation systems on chip. She was
the program chair of CHES’07, CHES’12, ASAP’08, and ISLPED’02.
She was also the general chair of ISLPED’03. She was a member of the
executive committee of DCA’05 and DAC’06.

0018-9340 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like