Professional Documents
Culture Documents
04358259
04358259
Abstract—We present an innovative methodology for accelerating the elliptic curve point formulas over prime fields. This flexible
technique uses the substitution of multiplication with squaring and other cheaper operations by exploiting the fact that field squaring is
generally less costly than multiplication. Applying this substitution to the traditional formulas, we obtain faster point operations in
unprotected sequential implementations. We also show the significant impact our methodology has in protecting against simple side-
channel (SSCA) attacks. We modify the elliptic curve cryptography (ECC) point formulas to achieve a faster atomic structure when
applying side-channel atomicity protection. In contrast to previous atomic operations that assume that squarings are indistinguishable
from multiplications, our new atomic structure offers true SSCA-protection because it includes squaring in its formulation. Moreover,
we extend our implementation to parallel architectures such as Single-Instruction Multiple-Data (SIMD). With the introduction of a new
coordinate system and the flexibility of our methodology, we present, to our knowledge, the fastest formulas for SIMD-based schemes
that are capable of executing three and four operations simultaneously. Finally, a new parallel SSCA-protected scheme is proposed for
multiprocessor/parallel architectures by applying the atomic structure presented in this work. Our parallel and atomic operations are
shown to be significantly faster than previous implementations.
Index Terms—Elliptic curve, point arithmetic, side-channel attacks, atomicity, SIMD operation, parallel implementations.
1 INTRODUCTION
multiplication is carried out with no cryptocoprocessors or structure is that it relies on the assumption that field
hardware accelerators. For details of addition/subtraction multiplication and squaring are indistinguishable from each
costs in an efficient implementation, the reader is referred other. In software implementations, timing and power
to [5] and [11]. consumption have been shown to be quite different for
Besides efforts to speed up scalar multiplication, there are these operations, making them directly distinguishable
two additional and important areas of research in ECC: side- through power analysis [7], [9]. The next attack can be
channel analysis (SCA) attacks and efficient implementations conceived in such a case. By observing only one EM or
on parallel architectures. In the following sections, we present power trace, an attacker may be able to detect which
a brief description of these areas of research. portions of the scalar multiplication are in fact executing a
squaring. With that knowledge, he/she can now gain access
1.1 Side-Channel Attacks to the point doubling/addition sequence (and, conse-
Side-channel information such as power dissipation and quently, to the secret key), given that the atomic addition
electromagnetic (EM) emission leaked by real-world devices has far more multiplications than squarings and its pattern
has been shown to be highly useful for guessing private of squaring/multiplications is very different from the
keys and effectively breaking the otherwise mathemati- pattern for the atomic doubling. Hardware platforms can
cally strong ECC cryptosystem [14]. There are two main be thought to be invulnerable to this attack when one
strategies to these attacks: simple SCA (SSCA) and hardware multiplier executes both field squarings and
differential SCA (DSCA). We will focus on SSCA, which multiplications. However, some studies suggest that higher
is based on the analysis of a single execution trace of a order DSCA attacks [24] can reveal differences between
scalar multiplication, to guess the secret key by revealing those operations by detecting data-dependent information
the sequence in the execution of point operations. through the observation of multiple sample times in the
Extensive research has been carried out to yield effective power trace. For instance, Walter [25] proposed a high-order
countermeasures. Among them, we could mention indis- DPA attack to defeat an RSA implementation by distinguish-
tinguishable operations via dummy instructions, scalar ing multiplications with precomputed points from squarings
multiplication with a fixed sequence of group operations and multiplications with random numbers through power
(for example, Coron’s countermeasure double-and-add- analysis. This work suggests that, similarly, power traces of
always method [15]), unified addition and doubling multiplications with random numbers can be distinguished
formulas (for example, the Jacobi and Hessian forms [16], from power traces of squarings. These conclusions can be
[17], [18]), and side-channel atomicity [19]. The first two directly applied to ECC cryptosystems and exploited to
methods are, in general, highly expensive and quite implement the attack described previously. Although more
susceptible to fault attacks. Using unified addition and research is required to assess the effectiveness of these and
doubling formulas has the drawback of being expensive or related attacks in hardware implementations, the smart
relying on special curves that are different from the ones developer would take into account some precautions when
specified by international standards. A highly efficient implementing side-channel atomicity [24, Section 29.4].
variation of the scalar multiplication with a fixed sequence In this work (see Section 5), we exploit the flexibility
of group operations is based on the Montgomery Ladder given by our technique of replacing multiplications by
[21], [22], [23]. However, similarly to the unified operation squarings to propose a more efficient atomic structure that
approach, the most efficient version of the Montgomery effectively takes into account squarings in its formulation.
Ladder also relies on a nonstandardized curve form, The latter makes our atomic operations not only faster but
namely, the Montgomery curve. Side-channel atomicity, also invulnerable to the potential attack described in this
proposed by Chevallier-Mames et al. [19], dissolves point section. The increase in speed comes from the fact that
operations into small homogenous blocks, known as atomic squarings are generally less expensive than multiplications
blocks, which cannot be distinguished from each other and that our improved atomic structure permits us to pack
through simple side-channel analysis because each one more field operations in one atomic block.
contains the same pattern of basic field operations.
Furthermore, atomic blocks are made sufficiently small to 1.2 Parallel Architectures
make this approach inexpensive. It is important to note that, In recent years, a new paradigm has arisen in the design
as pointed out in [19], if the leaking of the Hamming weight concept with the appearance of multiprocessor/parallel
of the scalar is an issue, it can be avoided by applying some architectures, which can execute several operations simul-
technique such as blinding. Also, we assume that transi- taneously. This topic is becoming increasingly important
tions between blocks are carefully implemented to avoid the since single-processor design is reaching its limits in terms
distinction of the different point operations. If the previous of clock frequency. Among several parallel architectures,
assumption cannot be met, then it could be advisable to Single-Instruction Multiple-Data (SIMD) has became highly
consider the approach given in [20] and make the point attractive since it generally avoids the higher hardware
operations have the same number of field operations. complexity needed in parallel architectures such as super-
However, notice that this extra measure would introduce scalar computers by leaving to the programmer the task of
a significant overhead into the side-channel atomicity parallelizing the program execution. Hence, we can already
strategy. Chevallier-Mames et al. [19] proposed the Multi- find SIMD-based schemes in many popular processors such
plication-Addition-Negation-Addition (M-A-N-A) structure as Pentium, SPARC, and PowerPC.
to build SSCA-protected formulas over prime fields. Similarly to other systems, ECC can be adapted to
However, the main drawback of the traditional M-A-N-A parallel architectures at different algorithmic levels. We
LONGA AND MIRI: FAST AND FLEXIBLE ELLIPTIC CURVE POINT ARITHMETIC OVER PRIME FIELDS 291
focus our efforts to parallelize ECC formulas at the point addition or subtraction, respectively. To simplify our cost
arithmetic level. In this regard, Aoki et al. [26] and Izu and analyses in the different sections, we will consider two
Takagi [27] introduced efficient parallel point operations possible scenarios:
targeting SIMD-based processors. In [27], the authors
presented formulas for two-processor architectures. The 1. The first scenario involves software-based imple-
authors in [26] introduced modified Jacobian coordinates mentations where squaring is faster than multi-
ðX; Y ; Z; Z 2 Þ to develop fast parallel formulas for platforms plication. In this case, we consider 1S ¼ 0:6M or
that can execute two and three operations simultaneously. 1S ¼ 0:8M.
Its parallel formulas are, to our knowledge, the fastest. 2. The second scenario involves implementations on
hardware platforms or when some built-in hard-
However, the limitation of the previous works is that they
ware is used to accelerate EC-operations (for
rely on traditional point operation formulas, which are
example, using a modular hardware multiplier
restricted to a fixed number of squarings and multiplica-
[29]). In this case, a multiplier executes both
tions. The methodology of replacing multiplications intro-
squaring and multiplication and, consequently, the
duced in this work will be shown to allow the development
ratio S=M is fixed to one.
of superior parallel operations that are more efficient for
multiprocessor/parallel execution. In this regard, we Although A=M ratios widely vary from application to
propose faster parallel formulas that are able to execute application, to simplify comparisons, we consider a low
three and four operations simultaneously (see Section 6). ratio for applications where multiplications are not favored
The previous approach targets unprotected implementa- by a fast hardware multiplier. In such a case, we use
tions where side-channel analysis is not a concern. How- 1A 0:05M, as achieved in [11]. If that is not the case, it can
ever, as previously stated, portable devices should be happen that the cost of an addition is not negligible. In this
protected against SCA. In the last part of the present work, case, we will consider the ratio achieved in [30], 1A 0:2M.
we deal with SSCA-protected implementations for parallel This paper is organized as follows: In Section 2, we
architectures. Similar efforts can be found in the literature. introduce some basic concepts about ECC and present, for
In [21] and [22], the authors presented efficient parallel comparison purposes, the traditional point operation for-
schemes on generic curves over prime fields using the mulas. In Section 3, we present our innovative methodology
Montgomery Ladder, which is intrinsically protected and, in Section 4, we apply it to develop fast formulas for
against SSCA because every iteration in the main loop the case of point doubling, general addition, mixed
involves one doubling and one addition. An advantage of addition, and tripling. In Section 5, new atomic structures
this method is that the formulas involve computation with are used to develop atomic formulas for the previous fast
the x-coordinate only. In particular, Fischer et al. [21] point operations. We then analyze in Section 6 the case of
presented a more attractive scheme since it parallelizes parallel architectures and develop SIMD-based formulas
doublings and additions at the field operation level, with three and four simultaneous operations. In Section 7,
whereas Izu and Takagi [22] parallelizes at the point we continue with parallel architectures, but this time to
operation level in every iteration of the main loop. The develop an SSCA-protected scheme that computes two
latter has the limitation that the cost of every iteration is operations simultaneously. We end with some conclusions
determined by the most costly point operation, namely, about the work presented in Section 8.
addition. Later, Izu and Takagi [23] improved the previous
proposals and introduced a unified Doubling-Addition 2 PRELIMINARIES
formula for the Montgomery Ladder method. The compo-
An elliptic curve E over a field K is defined by the general
site formula was then efficiently parallelized. In [28],
Weierstrass equation:
Mishra proposed a pipelined approach for generic curves
over prime fields using the standard point arithmetic. In E : y2 þ a1 xy þ a3 y ¼ x3 þ a2 x2 þ a4 x þ a6 ; ð1Þ
this scheme, each point operation is protected against SSCA
using atomicity and the atomic block execution is done where a1 , a2 , a3 , a4 , a6 2 K.
through a pipeline, where up to two atomic blocks can be The set of pairs ðx; yÞ that solves (1) and the point at
computed simultaneously. Because a pipelined atomic infinity O, which is the identity for the group law, form
operation can begin its execution before the previous an abelian group. This group of points is used to
atomic operation is complete, the total throughput is implement ECC.
significantly reduced to a few atomic blocks. We can define ECC over different finite fields. In
In this work (see Section 7), we propose a faster two- particular, we will work with a prime field IFp (the field
processor SSCA-protected scheme that introduces further with p elements, where p is a large prime). In this case, the
cost reductions by using the enhanced atomic structure general Weierstrass equation simplifies to the following:
with squarings introduced in Section 5. As previously
E : y2 ¼ x3 þ ax2 þ b; ð2Þ
explained, our atomic structure not only offers true
protection against SSCA by distinguishing multiplications where a; b 2 IFp and ¼ 4a3 þ 27b2 6¼ 0.
from squarings but also allows us to pack more field Let E be the elliptic curve over the finite field IFp . We can
operations in each block. represent the main operation in ECC, namely, scalar
For the rest of this work, M, S, and A, in italics, stand for multiplication, as Q ¼ dP , where P and Q are points in
the computing costs of field multiplication, squaring, and EðIFp Þ and d is the secret scalar. Several algorithms, such as
292 IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 3, MARCH 2008
binary, NAF, and sliding window methods, among many 2.3 Mixed Addition in Jacobian-Affine Coordinates
others, have been proposed to compute the scalar multi- Let P ¼ ðX1 ; Y1 ; Z1 Þ and Q ¼ ðX2 ; Y2 Þ be two points in
plication efficiently. In general, these methods rely on the Jacobian and affine coordinates, respectively, on the elliptic
execution of a given sequence of point doubling ð2P Þ and curve E. The mixed addition P þ Q ¼ ðX3 ; Y3 ; Z3 Þ is
addition operations ðP þ QÞ. Recent methods such as scalar traditionally obtained as follows:
multiplication based on the ternary/binary method [4] or
X3 ¼ 2 3 2X1 2 ; Y3 ¼ X1 2 X3 Y1 3 ;
double-base chains [3] introduced the tripling of a point
ð3P Þ as an additional point operation. Z3 ¼ Z1 ;
The representation of points on the curve E with affine with
coordinates ðx; yÞ introduces field inversions into the
computation of point doubling and point addition. Inver- ¼ Z13 Y2 Y1 ; ¼ Z12 X2 X1 : ð6Þ
sions over prime fields are the most expensive field With (6), the cost of a mixed addition is fixed in 8M þ 3S.
operation and are avoided as much as possible. Projective
coordinates ðX; Y ; ZÞ solve that problem by adding the 2.4 Point Tripling in Jacobian Coordinates
third coordinate Z to replace inversions with a few other Dimitrov et al. [3] introduced a fast tripling formula that
field operations. costs 10M þ 6S. Let P ¼ ðX1 ; Y1 ; Z1 Þ be a point in Jacobian
In the present work, we will work with Jacobian coordinates on the elliptic curve E. The point tripling 3P ¼
coordinates, a special case of projective coordinates that ðX3 ; Y3 ; Z3 Þ can be computed with the following:
has yielded very efficient point formulas. In the case of
X3 ¼ 8Y12 ð Þ þ X1 !2 ; Y3 ¼ Y1 4ð Þð2 Þ !3 ;
addition, we will also consider the most efficient case that is
obtained by adding two points in different point represen- Z3 ¼ Z1 !;
tations. In particular, we will analyze the case when one ð7Þ
point is represented in Jacobian coordinates and the second where ¼ !; ¼ 8Y14 , ¼ 3X12 þ aZ14 , ! ¼ 12X1 Y12 2 .
point in affine coordinates (mixed addition with Jacobian- We first remark that the same formula can be more
affine coordinates). efficiently implemented with 9M þ 7S. On the other hand,
In the following sections, we introduce the traditional Dimitrov et al. [3] proposed accelerating the computation
point operation formulas, which can later be used for by avoiding intermediate operations during computation of
comparison to our proposed fast operations. It is important the term aZ 4 when repeated triplings are to be computed.
to note that some additional improvements in computa- This idea is based on a similar approach given in [2] with
tional costs have been proposed for the tripling formulas. their modified Jacobian coordinates. However, it is straight-
forward to note that applying another well-known techni-
2.1 Point Doubling in Jacobian Coordinates que (fixing a ¼ 3) actually gives the best performance.
Let P ¼ ðX1 ; Y1 ; Z1 Þ be a point in Jacobian coordinates on Thus, by computing using the factorization technique
the elliptic curve E. The point doubling 2P ¼ ðX3 ; Y3 ; Z3 Þ given in (4), we reduce the cost of the tripling to 9M þ 5S.
can be computed by the following traditional formula:
The equivalence class denoted by ðX : Y : ZÞ that can be efficiently replaced by squarings because it would
contains the projective coordinates ðX; Y ; ZÞ is [1] add more than one squaring to the formula. The revised
doubling formula is given as follows:
ðX : Y : ZÞ ¼ fðc X; d Y ; ZÞ : 2 K ; c; d 2 ZZþ g: ð9Þ
X3 ¼ 2 2; Y3 ¼ ð X3 Þ 8Y14 ;
If we define ¼ 2, the previous strategy efficiently ð12Þ
inserts multiples of two into the formula, which permits the Z3 ¼ ðY1 þ Z1 Þ2 Y12 Z12 ;
transformation of the original algebraic substitution (8) to
where
the next form for an “even” field multiplication, eliminating
h 2 i
the division by two:
¼ 3X12 þ aZ14 ; ¼ 2 X1 þ Y12 X12 Y14 :
2ab ¼ ða þ bÞ2 a2 b2 : ð10Þ
Given (12), the cost of a doubling is reduced from 4M þ
Our flexible methodology can be summarized in two 6S to 2M þ 8S, trading two field squarings for two
steps: multiplications.
If we fix a ¼ 3, there is a computing reduction by
1. Modify, if necessary, the point formula by inserting applying the factorization technique in (4). Note that
multiples of two via the selection of the following computation of X12 is avoided and, consequently, comput-
representative of the equivalence class for Jacobian ing ¼ 2½ðX1 þ Y12 Þ2 X12 Y14 is not an improvement
coordinates: anymore since one multiplication would be replaced by two
squarings instead of only one. The doubling formula when
ðX : Y : ZÞ ¼ fð22 X; 23 Y ; 2ZÞg: ð11Þ
a ¼ 3 is given as follows:
where TABLE 1
Cost of the Proposed Fast Point Operations
¼ 2 Z13 Y2 Z23 Y1 ; in Comparison with Traditional Formulas
¼ Z12 X2 Z22 X1 ;
¼ ðZ1 þ Z2 Þ2 Z12 Z22 :
Given (15), the cost of an addition is reduced to
11M þ 5S, trading one multiplication for one squaring in
the traditional formula.
TABLE 2 TABLE 4
Cost of the Proposed and Previous Atomic Operations Performance of New Atomic Operations in Comparison with
(Scalar Multiplications Using Radix 2) Previous Atomic Formulas (NAF Method, n ¼ 160 bits)
TABLE 3
Cost of the Proposed and Previous Atomic Operations
(Scalar Multiplications Using Radices 2 and 3)
scarce occurrence of these operations during the scalar
multiplication.
To have a more precise idea of the improvement that can
be achieved with our atomic operations, we compare the
performance when using a traditional scalar multiplication
with the NAF method and scalar d of length n ¼ 160 bits.
NAF has a nonzero density of approximately 1/3 [1]. Thus,
for a 160-bit NAF scalar multiplication, one approximately
requires 159D + 53A. The number of required operations
w ¼ number of repeated doublings or triplings.
when using the new atomic formulas and the best previous
atomic operations in [3], [19], [28] are detailed in Table 4 for
respectively. Thus, a doubling costs 10M þ 20A and an the case where a hardware multiplier executes both multi-
addition, 16M þ 32A. Mishra [28] presented an improved plications and squarings ð1S ¼ 1MÞ and for the most
atomic operation for the case of mixed addition. The common cases in software implementations (1S ¼ 0:6M
number of atomic blocks for addition was reduced to 11, and 1S ¼ 0:8M).
with a total cost of 11M þ 22A. Later, Dimitrov et al. [3] As can be seen in Table 4, our atomic operations perform
presented a fast tripling formula with 16 atomic blocks significantly better than the previous operations in all of the
using the same atomic structure, with a total cost of studied scenarios. For instance, let us consider the case of an
16M þ 32A. In comparison, our enhanced atomic structure implementation using a modular hardware multiplier. In
based on multiplication and squaring shows reduced costs such a case, we have 1S ¼ 1M and an A=M ratio as high as
in all cases. Table 2 summarizes the performance of the new 0.2. Then, the new M-N-A-M-N-A-A structure presents a
formulas when only point addition and doubling are used, reduction of about 18.5 percent in comparison with a NAF
as is the case for the traditional binary and NAF methods. scalar multiplication using previous atomic operations.
For the case where ternary bases are included into the Now, let us consider the case of an efficient software
computation of the scalar multiplication, Table 3 sum- implementation such as the one presented in [5], [11], where
marizes our results. addition is very cheap. In that case, we set 1A ¼ 0:05M.
As we can see in both tables, our atomic operations Then, our approach presents significant reductions of about
achieve reduced computing costs by minimizing the 22.2 percent ðS=M ¼ 0:8Þ and 30.2 percent ðS=M ¼ 0:6Þ.
number of required field additions and replacing squarings
for multiplications in comparison to previous atomic 6 PARALLEL POINT OPERATIONS
operations using M-A-N-A, including cases where some
In this section, we show that our methodology of replacing
savings can be achieved by the successive execution of
multiplications, proposed in Section 3, permits flexibly
doublings or triplings.
modifying point doubling, addition, and tripling formulas
Remarkably, we also observe that, even in applications
to make them more efficient when implemented in a
where squarings are considered as costly as multiplications
parallel architecture such as SIMD. In the following, we
(that is, 1S ¼ 1M, if the same hardware multiplier is used to present formulas to compute three and four operations in
perform multiplication and squaring), our S-N-A-M-N-A-A- parallel. It is important to note that, in the four-processor
based atomic doubling and repeated doubling present a case, one multiplication can be replaced by up to two
reduction of at least two field multiplications and four field squarings since more computing resources are available
additions. In the case of triplings, the cost would be the same and the introduction of squarings would permit reducing
ð16M þ 32AÞ, but, when repeated triplings are computed, our the costs further.
approach is again superior, reducing the required number of Also, a new coordinate system that takes advantage
multiplications and additions from ð15w þ 1ÞM þ ð30w þ of the inserted squarings and thus minimizes computing
2ÞA to ð14w þ 2ÞM þ ð28w þ 4ÞA, which means an overall costs in parallel implementations is introduced:
reduction of ðw 1Þ field multiplications and 2ð2w 1Þ ðX; Y ; Z; X2 ; Z 2 ; Z 3 =Z 4 Þ. The fourth coordinate, X 2 , will be
field additions. Point addition is still one multiplication required for doublings and triplings and the sixth coordi-
more expensive than the traditional formulas. However, we nate will be Z 3 if the current operation is an addition and Z 4
expect that such a disadvantage is minimized due to the if the current operation is a doubling or tripling.
LONGA AND MIRI: FAST AND FLEXIBLE ELLIPTIC CURVE POINT ARITHMETIC OVER PRIME FIELDS 297
TABLE 5 TABLE 7
Three-Processor Point Doubling Three-Processor Point Tripling
(a) Z33 if the next operation is a point addition and Z34 if the next operation
is a doubling or tripling.
ð þ 4 X3 Þ2 2 ð4 X3 Þ2 :
Although the number of squarings (and the cost for the
sequential implementation) has been increased in (20), in a
four-processor architecture this leads to higher processor
utilization and a reduced or nil number of multiplications.
The parallel doubling formula is shown in Table 8, with a total
cost of 3S þ 13A (see Appendix G, which can be found in the
Computer Society Digital Library at http://computer.org/
tc/archives.htm, for further details). If the following
298 IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 3, MARCH 2008
TABLE 8 TABLE 10
Four-Processor Point Doubling Four-Processor Point Tripling
(a) Z33 if the next operation is a point addition and Z34 if the next operation
is a doubling or tripling.
operation is an addition, then the cost is slightly increased (a) Z33 if the next operation is a point addition and Z34 if the next operation
to 1M þ 2S þ 13A because we need to compute Z33 in the is a doubling or tripling.
third step, Processor 2.
Additionally, the cost is reduced by two field additions X3 ¼ 16Y12 ð2 2Þ þ 4X1 !2 ; Y3 ¼ 4Y1 ;
ð22Þ
to 3S þ 11A if repeated doublings are performed or the Z3 ¼ ðZ1 þ !Þ 2
Z12 2
! ;
following operation is a tripling because the last two four-
parallel field operations can be merged with the first two where
four-parallel operations of the following doubling or
2 ¼ ð þ !Þ2 2 !2 ; 2 ¼ 16Y14 ;
tripling.
In the case of point addition, we use the fast mixed ¼ 3X12 þ aZ14 ; ! ¼ 6½ðX1 þ Y12 Þ2 X12 Y14 2 ;
addition given by (16). Following the strategy previously ¼ 2ð2 2Þð4 2Þ 2!3 :
applied to the doubling formula, we modify (16) as follows:
The next multiplications are computed as follows:
X3 ¼ 42 4 3 8X1 2 ; Y3 ¼ 2 4X1 2 X3 8Y1 3 ;
2
4X1 !2 ¼ 2½ðX1 þ !2 Þ2 X12 !4 ;
Z3 ¼ ðZ1 þ Þ Z12 2
;
2!3 ¼ ð! þ !2 Þ2 !2 !4 ;
ð21Þ 2
16Y12 ð2 2Þ ¼ 8Y12 þ 2 2 64Y14 ð2 2Þ2 ;
where ¼ Z13 Y2 Y1 , ¼ Z12 X2 X1 , 2Y1 is computed as
ðY1 þ Þ2 Y12 2 and 2ð4X1 2 X3 Þ is computed as 4Y1 ¼ ð4Y1 þ Þ2 16Y12 2 ;
ð þ 4X1 2 X3 Þ2 2 ð4X1 2 X3 Þ2 . 2ð2 2 Þð4 2Þ ¼ 4 2 ð2 2Þ2 ð4 2Þ2 :
The parallel addition formula is presented in Table 9 (see
The parallel tripling formula has a total cost of 6S þ 17A
Appendix H, which can be found in the Computer Society
(Table 10; see Appendix I, which can be found in the
Digital Library at http://computer.org/tc/archives.htm,
Computer Society Digital Library at http://computer.org/
for more details). Its total cost is 2M þ 2S þ 9A. If the
tc/archives.htm). If the operation following a tripling is an
following operation is a doubling or tripling, the cost is addition, the cost is slightly increased to 1M þ 5S þ 17A.
reduced by two additions to 2M þ 2S þ 7A. On the other hand, the cost is reduced by two additions to
In the case of the tripling, this operation can be 6S þ 15A if repeated triplings are performed or the
implemented with the fast formula given by (17), fixing following operation is a doubling.
a ¼ 3. We again follow the strategy applied to doubling
and addition and show that, in this case, the replacement of 6.3 Performance Comparison
all multiplications by squarings leads to the lowest cost. Table 11 summarizes the cost of the parallel point
Formula (16) is modified as follows: operations presented for the cases when three and four
operations are executed simultaneously on an SIMD-based
architecture. The results are compared to previous propo-
TABLE 9 sals given in [26] and [27]. The most efficient scenario was
Four-Processor Mixed Addition
given in [26], which developed cheaper three-processor
SIMD doubling and mixed addition operations using the
modified coordinate system ðX; Y ; Z; Z 2 Þ. Similarly to our
case, they used a ¼ 3 for the doubling formula without
applying the factorization technique in (4). However, our
flexible methodology has allowed further cost reductions by
replacing some costly multiplications for squarings. This
permits the reduction of doublings to 1M þ 2S in the case of
three-parallel operations, in comparison to the cost of 2M þ
1S for the doubling given in [26]. For instance, for an n-bit
LONGA AND MIRI: FAST AND FLEXIBLE ELLIPTIC CURVE POINT ARITHMETIC OVER PRIME FIELDS 299
TABLE 11 4:8M, the tripling given in [3] costs 14:8M, and our fast
Comparison of Different Parallel tripling in a sequential fashion costs 12:6M.
and Sequential Point Operations
7 PARALLEL SSCA-PROTECTED POINT
OPERATIONS
Operations presented in the previous section are oriented to
achieving the maximum speedup on SIMD-based implemen-
tations when SCA attacks are not a problem. In the present
section, we propose a scheme with two-processor point
operations protected against SSCA. Again, atomicity has
been used to achieve the required level of security. We have
investigated dependences among field operations in each
point operation and concluded that architectures that are
NAF scalar multiplication, with an approximate cost of ðn designed for executing two operations simultaneously are
1Þ doublings and ðn=3Þ additions, the formula in [26] would more efficiently exploited if squarings are also considered in
require roughly 3:3nM þ 1nS, whereas our formulas the formula. Throughout this paper, our highly flexible
methodology has been applied to yield improved formulas
require 2nM þ 2:6nS. When 1S ¼ 0:8M, there are no
that permit the efficient introduction of squarings into the
significant differences between both formulas (about
atomic block structure, as was shown in Section 5. Thus, our
661M each if n ¼ 160). However, for the case 1S ¼ 0:6M,
scheme arranges two field operations in parallel at each step,
the formula in [26] costs 627M and ours only costs 574M. In
following the atomic structure given by S-N-A-A-M-N-A-A
comparison to costs using traditional sequential formulas
(introduced in Section 5.2), which has been found to
(described in Section 2), 1; 712M (if 1S ¼ 0:8M) and 1; 552M
efficiently accommodate all of the point operations in two-
(if 1S ¼ 0:6M), we get computing reductions of about
processor architectures. In the following, we call a block a
61 percent and 63 percent, respectively. In comparison with
parallel atomic block if it is able to execute two operations in
our fast sequential formulas presented in Section 4, 1; 657M
parallel and follows the aforementioned atomic structure to
(if 1S ¼ 0:8M) and 1; 455M (if 1S ¼ 0:6M), we get comput-
protect against SSCA.
ing reductions of about 60 percent and 61 percent,
In the next paragraphs, we describe each parallel point
respectively.
operation. For each formula, the order of execution has been
For the tripling, we have proposed, to our knowledge,
carefully arranged to yield the lowest cost.
the first approach for a parallel implementation. On a three- As explained in Section 5, to achieve minimum costs, we
processor SIMD scheme, the proposed tripling performs require formulas with a balanced number of field multi-
twice as fast as a sequential implementation. For instance, if plications and squarings. For the case of doubling, the
1S ¼ 0:8M, the traditional tripling in [3] costs 14:8M and traditional formula given by (3), with a ¼ 3, is already
our fast tripling in a sequential fashion, 12:6M. In contrast, balanced, with a cost of 4M þ 4S. Thus, we only require two
the proposed three-processor tripling only costs 5:4M. parallel atomic blocks as each of these is capable of executing
Furthermore, our methodology makes point operations two field multiplications and two field squarings. The cost of
suitable for architectures that compute four operations the two-parallel atomic doubling is fixed at 2M þ 2S þ 8A.
simultaneously. We have further reduced our three-parallel The details of this operation are shown in Appendix J,
formulas, which are the most efficient to our knowledge, to which can be found in the Computer Society Digital Library
achieve faster four-parallel formulas by trading one squar- at http://computer.org/tc/archives.htm.
ing for one multiplication in the case of a doubling, For the case of mixed addition, in Section 5, we derived
reducing one field multiplication in the case of mixed the balanced formula (19) with a cost of 6M þ 6S. We use
addition, and trading three squarings for three multi- this formula to obtain our two-parallel atomic addition with
plications in the case of a tripling. For comparison purposes, only four parallel atomic blocks. Thus, the total cost of the
if we consider a 160-bit NAF scalar multiplication, our four- two-parallel addition is given by 4M þ 4S þ 16A (see
processor formulas would require approximately 584M (if further details in Appendix K, which can be found in the
1S ¼ 0:8M) and 478M (if 1S ¼ 0:6M). That means reduc- Computer Society Digital Library at http://computer.org/
tions of about 11 percent and 17 percent for each case, tc/archives.htm).
respectively, in comparison with the three-processor im- Similarly, in Section 4, we presented a balanced tripling
plementation. In comparison with the traditional sequential formula given by (18) with a cost of 7M þ 7S. Using this
approach, we obtain reductions of about 66 percent and formula, we derive a two-parallel tripling protected against
70 percent for each case, respectively, which means that the SSCA with five parallel atomic blocks and a fixed cost in
four-processor scheme is about three times faster than the 5M þ 5S þ 20A (see Appendix L, which can be found in the
traditional sequential implementation. Computer Society Digital Library at http://computer.org/
On a four-processor scheme, the proposed tripling tc/archives.htm).
formula performs almost three times faster than the In Table 12, we show a sample execution of consecutive
sequential operation. For instance, if 1S ¼ 0:8M, the tripling, doubling, and addition operations in our proposed
proposed four-processor tripling costs, in most cases, ð6SÞ two-parallel SSCA-protected scheme, where one point
300 IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 3, MARCH 2008
TABLE 13 APPENDIX
Comparison of Performance of Appendices of this work can be found in the
Parallel SSCA-Protected Methods with n ¼ 160 Computer Society Digital Library at http://computer.
org/tc/archives.htm.
ACKNOWLEDGMENTS
The authors would like to thank the Natural Sciences and
Engineering Research Council of Canada (NSERC) for
partially supporting this work and the reviewers for their
valuable comments.
REFERENCES
[1] D. Hankerson, A. Menezes, and S. Vanstone, Guide to Elliptic Curve
Cryptography. Springer, 2004.
[2] H. Cohen, A. Miyaji, and T. Ono, “Efficient Elliptic Curve
Exponentiation Using Mixed Coordinates,” Advances in Cryptology
—Proc. ASIACRYPT ’98, pp. 51-65, 1998.
[3] V. Dimitrov, L. Imbert, and P.K. Mishra, “Efficient and Secure
Elliptic Curve Point Multiplication Using Double-Base Chains,”
Advances in Cryptology—Proc. ASIACRYPT ’05, pp. 59-78, 2005.
26 percent, and 30 percent for the three presented cases, [4] M. Ciet, M. Joye, K. Lauter, and P.L. Montgomery, “Trading
respectively, when compared to the best parallel method Inversions for Multiplications in Elliptic Curve Cryptography,”
Designs, Codes, and Cryptography, vol. 39, no. 2, pp. 189-206, 2006.
using the Montgomery Ladder [23]. In comparison with the [5] D. Bernstein, “High-Speed Diffie-Hellman, Part 2,” presentation at
pipelined scheme [28], our parallel approach introduces INDOCRYPT ’06, tutorial session, 2006.
reductions of approximately 17 percent, 25 percent, and [6] M. Brown, D. Hankerson, J. Lopez, and A. Menezes, “Software
32 percent. Implementation of the NIST Elliptic Curves over Prime Fields,”
Topics in Cryptology—CT-RSA ’01, pp. 250-265, 2001.
For comparison purposes, sequential atomic methods are [7] J. Großschädl, R. Avanzi, E. Savas, and S. Tillich, “Energy-Efficient
also presented. In that case, our scheme introduces Software Implementation of Long Integer Modular Arithmetic,”
speedups of about 40 percent, 43 percent, and 43 percent Proc. Seventh Int’l Workshop Cryptographic Hardware and Embedded
Systems, pp. 75-90, 2005.
for the three studied cases, respectively, in comparison with [8] C.H. Lim and H.S. Hwang, “Fast Implementation of Elliptic Curve
the new atomic implementation proposed in Section 5. The Arithmetic in GFðpm Þ,” Proc. Third Int’l Workshop Practice and
improvements are as high as 51 percent, 56 percent, and Theory in Public Key Cryptography, pp. 405-421, 2000.
60 percent if compared with the traditional atomic [9] C.H. Gebotys and R.J. Gebotys, “Secure Elliptic Curve Implemen-
tations: An Analysis of Resistance to Power-Attacks in a DSP
implementation based on M-A-N-A. Processor,” Proc. Fifth Int’l Workshop Cryptographic Hardware and
Embedded Systems, pp. 114-128, 2003.
[10] R. Avanzi, “Aspects of Hyperelliptic Curves over Large Prime
8 CONCLUSION Fields in Software Implementations,” Proc. Sixth Int’l Workshop
Cryptographic Hardware and Embedded Systems, pp. 148-162, 2004.
We have presented a highly flexible methodology to derive [11] D. Bernstein, “Curve25519: New Diffie-Hellman Speed Records,”
fast formulas for the doubling, tripling, and addition Proc. Ninth Int’l Conf. Theory and Practice of Public Key Cryptography,
operations, where some multiplications have been effi- pp. 229-240, 2006.
[12] N. Gura, A. Patel, A. Wander, H. Eberle, and S.C. Shantz,
ciently replaced by squarings with optimization purposes. “Comparing Elliptic Curve Cryptography and RSA on 8-Bit
Furthermore, we have shown that parallel schemes such as CPUs,” Proc. Sixth Int’l Workshop Cryptographic Hardware and
SIMD can greatly benefit from this flexible technique. For Embedded Systems, pp. 119-132, 2004.
instance, a 160-bit NAF scalar multiplication with intro- [13] A. Woodbury, “Efficient Algorithms for Elliptic Curve Crypto-
systems on Embedded Systems,” MSc thesis, Worcester Poly-
duced three and four-parallel SIMD operations reduces technic Inst., 2001.
computing costs by approximately 63 percent and 70 per- [14] R. Avanzi, “Side Channel Attacks on Implementations of Curve-
cent, respectively, when compared with the traditional Based Cryptographic Primitives,” Cryptology ePrint Archive,
Report 2005/017, http://eprint.iacr.org/, 2005.
sequential approach. Also, we have protected our formulas [15] J.S. Coron, “Resistance against Differential Power Analysis for
against SSCA using innovative and highly efficient atomic Elliptic Curve Cryptosystems,” Proc. First Int’l Workshop Crypto-
structures where squarings have also been included. We graphic Hardware and Embedded Systems, pp. 292-302, 1999.
have shown new atomic formulas that are cheaper and, [16] P.Y. Liardet and N.P. Smart, “Preventing SPA/DPA in ECC
Systems Using the Jacobi Form,” Proc. Third Int’l Workshop
more importantly, offer true protection against SSCA. For Cryptographic Hardware and Embedded Systems, pp. 401-411, 2001.
instance, in the scalar multiplication using NAF, our atomic [17] O. Billet and M. Joye, “The Jacobi Model of an Elliptic Curve and
blocks speed up the computation by up to 30 percent in Side-Channel Analysis,” Cryptology ePrint Archive, Report 2002/
125, http://eprint.iacr.org/2002/125/, 2002.
contrast to previous atomic implementations. Finally, by [18] N.P. Smart, “The Hessian Form of an Elliptic Curve,” Proc. Third
using the new atomic structure, a highly efficient two- Int’l Workshop Cryptographic Hardware and Embedded Systems,
parallel SSCA-protected scheme has been presented with a pp. 118-125, 2001.
computing reduction of up to 32 percent in the scalar [19] B. Chevallier-Mames, M. Ciet, and M. Joye, “Low-Cost Solutions
for Preventing Simple Side-Channel Analysis: Side-Channel
multiplication, in comparison with the previous best Atomicity,” IEEE Trans. Computers, vol. 53, no. 6, pp. 760-768,
method using NAF. June 2004.
302 IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 3, MARCH 2008
[20] L. Batina, N. Mentens, B. Preneel, and I. Verbauwhede, “Balanced Patrick Longa received the BSc degree in
Point Operations for Side-Channel Protection of Elliptic Curve electrical engineering from the Catholic Univer-
Cryptography,” IEE Proc.—Information Security, vol. 152, no. 1, sity of Peru in 1999 and the MSc degree from
pp. 57-65, 2005. the University of Ottawa, where he conducted
[21] W. Fischer, C. Giraud, E.W. Knudsen, and J.-P. Seifert, “Parallel research in cryptography and DSP. Currently, he
Scalar Multiplication on General Elliptic Curves over IFp Hedged is beginning his PhD studies in electrical and
against Non-Differential Side-Channel Attacks,” IACR ePrint computer engineering at the University of Water-
Archive, Report 2002/007, http://www.iacr.org, 2002. loo. He worked as a researcher at the Catholic
[22] T. Izu and T. Takagi, “A Fast Parallel Elliptic Curve Multiplication University of Peru and the Navy Industrial
Resistant against Side Channel Attacks,” Proc. Fifth Int’l Workshop Services (SIMA). He is the author of eight
Practice and Theory in Public Key Cryptosystems, pp. 280-296, 2002. research papers. His research interests include (curve-based) crypto-
[23] T. Izu and T. Takagi, “Fast Elliptic Curve Multiplications Resistant graphy, security on portable devices, and computer architectures for
against Side Channel Attacks,” IEICE Trans. Fundamentals, signal processing and cryptography.
vol. E88-A, no. 1, pp. 161-171, 2005.
[24] R. Avanzi, H. Cohen, C. Doche, G. Frey, T. Lange, K. Nguyen, and Ali Miri received the BSc and MSc degrees in
F. Vercauteren, Handbook of Elliptic and Hyperelliptic Curve mathematics from the University of Toronto in
Cryptography. CRC Press, 2005. 1991 and 1993, respectively, and the PhD
[25] C.D. Walter, “Sliding Windows Succumbs to Big Mac Attack,” degree in electrical and computer engineering
Proc. Third Int’l Workshop Cryptographic Hardware and Embedded from the University of Waterloo in 1998. He is
Systems, pp. 286-299, 2001. currently an associate professor with the School
[26] K. Aoki, F. Hoshino, T. Kobayashi, and H. Oguro, “Elliptic Curve of Information Technology and Engineering
Arithmetic Using SIMD,” Proc. Fourth Int’l Conf. Information (SITE) and the Department of Mathematics
Security, pp. 235-247, 2001. and Statistics at the University of Ottawa,
[27] T. Izu and T. Takagi, “Fast Elliptic Curve Multiplications with Canada. He is also the director of the Computa-
SIMD Operations,” Proc. Fourth Int’l Conf. Information and Comm. tional Laboratory in Coding and Cryptography (CLiCC), University of
Security, pp. 217-230, 2002. Ottawa. His research interests include coding and information theory,
[28] P.K. Mishra, “Pipelined Computation of Scalar Multiplication in applied number theory, and cryptography. He is a member of the
Elliptic Curve Cryptosystems,” IEEE Trans. Computers, vol. 55, Professional Engineers Ontario and the ACM and a senior member of
no. 8, pp. 1000-1010, Aug. 2006. the IEEE.
[29] S.B. Xu and L. Batina, “Efficient Implementation of Elliptic Curve
Cryptosystems on an ARM7 with Hardware Accelerator,” Proc.
Third Int’l Conf. Information and Comm. Security, pp. 266-279, 2001.
[30] K. Itoh, M. Takenaka, N. Torii, S. Temma, and Y. Kurihara, “Fast
Implementation of Public-Key Cryptography on a DSP . For more information on this or any other computing topic,
TMS320C6201,” Proc. First Int’l Workshop Cryptographic Hardware please visit our Digital Library at www.computer.org/publications/dlib.
and Embedded Systems, pp. 61-72, 1999.