Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO.

3, MARCH 2008 289

Fast and Flexible Elliptic Curve Point Arithmetic


over Prime Fields
Patrick Longa and Ali Miri, Senior Member, IEEE

Abstract—We present an innovative methodology for accelerating the elliptic curve point formulas over prime fields. This flexible
technique uses the substitution of multiplication with squaring and other cheaper operations by exploiting the fact that field squaring is
generally less costly than multiplication. Applying this substitution to the traditional formulas, we obtain faster point operations in
unprotected sequential implementations. We also show the significant impact our methodology has in protecting against simple side-
channel (SSCA) attacks. We modify the elliptic curve cryptography (ECC) point formulas to achieve a faster atomic structure when
applying side-channel atomicity protection. In contrast to previous atomic operations that assume that squarings are indistinguishable
from multiplications, our new atomic structure offers true SSCA-protection because it includes squaring in its formulation. Moreover,
we extend our implementation to parallel architectures such as Single-Instruction Multiple-Data (SIMD). With the introduction of a new
coordinate system and the flexibility of our methodology, we present, to our knowledge, the fastest formulas for SIMD-based schemes
that are capable of executing three and four operations simultaneously. Finally, a new parallel SSCA-protected scheme is proposed for
multiprocessor/parallel architectures by applying the atomic structure presented in this work. Our parallel and atomic operations are
shown to be significantly faster than previous implementations.

Index Terms—Elliptic curve, point arithmetic, side-channel attacks, atomicity, SIMD operation, parallel implementations.

1 INTRODUCTION

E LLIPTIC curve cryptography (ECC), independently intro-


duced by Koblitz and Miller in the 1980s, has attracted
increasing attention in recent years due to its shorter key
affine coordinates. For point addition, a combination of
projective and affine coordinates, namely, mixed addition
[2], has yielded the most efficient formulas. In the case of
length requirement in comparison with other public-key adding points in the same coordinate system, the required
cryptosystems such as RSA. A shorter key length means formula is more costly and is referred to as general
reduced power consumption and computing effort and less addition.
storage requirement, factors that are fundamental in Recently, new approaches to compute faster scalar
ubiquitous portable devices such as PDAs, cell phones, multiplications (double-base chains [3] and the ternary/
smart cards, and many others. To that end, a great deal of binary method [4]) have introduced tripling as a new point
research has been carried out to speed up and improve ECC operation. Dimitrov et al. [3] developed efficient tripling
implementations, mainly focusing on the most important formulas.
ECC operation: scalar multiplication. The structure of this In this work (see Section 4), we replace some expensive
operation involves three computational levels: scalar multi- field multiplications with a few cheaper operations to
plication algorithm, point arithmetic, and field arithmetic accelerate traditional point doubling, addition, and tripling
[1]. We will mainly focus on improvements at the point formulas. Previous work presented in [5] makes use of a
arithmetic level to speed up the ECC scalar multiplication. direct algebraic substitution to replace one “even” field
ECC point arithmetic involves the efficient execution of multiplication (that is, a multiplication accompanied by a
doubling and addition operations. Significant effort to multiple of two such as 2ab) by one squaring and three field
optimize formulas for those operations has been carried additions/subtractions in doubling and general addition
out through the last few years. In particular, projective
formulas. However, our technique first optimally modifies
coordinates have been shown to be highly effective in
current formulas for doubling, addition (general and mixed
speeding up operations by eliminating the costly field
addition), and tripling in such a way that allows maximum
inversion from the main loop of the scalar multiplication.
gain through the mentioned algebraic substitution, which is
This is achieved by introducing a third coordinate point into
applied right after. It is important to note that it is widely
the traditional ðx; yÞ-based point representation known as
accepted that one squaring is less computationally expensive
than one multiplication in software platforms. In this case,
most implementations report 1S  0:6-0:8M [6], [7], [8], [9],
. The authors are with the School of Information Technology and
Engineering (SITE), University of Ottawa, Ottawa, ON K1N 6N5,
[10]. In particular, some implementations using special
Canada. E-mail: plong034@uottawa.ca, samiri@site.uottawa.ca. primes and Optimal Extension Fields (OEFs) have reported
Manuscript received 6 Mar. 2007; revised 31 July 2007; accepted 21 Aug. S=M ratios as low as 0.6-0.67 [11], [12], [13]. Also, note that, for
2007; published online 6 Sept. 2007. this part of our work (Section 4), we target applications where
Recommended for acceptance by E. Antelo. the costs of additions/subtractions can be considered
For information on obtaining reprints of this article, please send e-mail to:
tc@computer.org, and reference IEEECS Log Number TC-2007-03-0073. negligible in comparison with that of multiplications. This
Digital Object Identifier no. 10.1109/TC.2007.70815. is typically observed in software implementations where
0018-9340/08/$25.00 ß 2008 IEEE Published by the IEEE Computer Society
290 IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 3, MARCH 2008

multiplication is carried out with no cryptocoprocessors or structure is that it relies on the assumption that field
hardware accelerators. For details of addition/subtraction multiplication and squaring are indistinguishable from each
costs in an efficient implementation, the reader is referred other. In software implementations, timing and power
to [5] and [11]. consumption have been shown to be quite different for
Besides efforts to speed up scalar multiplication, there are these operations, making them directly distinguishable
two additional and important areas of research in ECC: side- through power analysis [7], [9]. The next attack can be
channel analysis (SCA) attacks and efficient implementations conceived in such a case. By observing only one EM or
on parallel architectures. In the following sections, we present power trace, an attacker may be able to detect which
a brief description of these areas of research. portions of the scalar multiplication are in fact executing a
squaring. With that knowledge, he/she can now gain access
1.1 Side-Channel Attacks to the point doubling/addition sequence (and, conse-
Side-channel information such as power dissipation and quently, to the secret key), given that the atomic addition
electromagnetic (EM) emission leaked by real-world devices has far more multiplications than squarings and its pattern
has been shown to be highly useful for guessing private of squaring/multiplications is very different from the
keys and effectively breaking the otherwise mathemati- pattern for the atomic doubling. Hardware platforms can
cally strong ECC cryptosystem [14]. There are two main be thought to be invulnerable to this attack when one
strategies to these attacks: simple SCA (SSCA) and hardware multiplier executes both field squarings and
differential SCA (DSCA). We will focus on SSCA, which multiplications. However, some studies suggest that higher
is based on the analysis of a single execution trace of a order DSCA attacks [24] can reveal differences between
scalar multiplication, to guess the secret key by revealing those operations by detecting data-dependent information
the sequence in the execution of point operations. through the observation of multiple sample times in the
Extensive research has been carried out to yield effective power trace. For instance, Walter [25] proposed a high-order
countermeasures. Among them, we could mention indis- DPA attack to defeat an RSA implementation by distinguish-
tinguishable operations via dummy instructions, scalar ing multiplications with precomputed points from squarings
multiplication with a fixed sequence of group operations and multiplications with random numbers through power
(for example, Coron’s countermeasure double-and-add- analysis. This work suggests that, similarly, power traces of
always method [15]), unified addition and doubling multiplications with random numbers can be distinguished
formulas (for example, the Jacobi and Hessian forms [16], from power traces of squarings. These conclusions can be
[17], [18]), and side-channel atomicity [19]. The first two directly applied to ECC cryptosystems and exploited to
methods are, in general, highly expensive and quite implement the attack described previously. Although more
susceptible to fault attacks. Using unified addition and research is required to assess the effectiveness of these and
doubling formulas has the drawback of being expensive or related attacks in hardware implementations, the smart
relying on special curves that are different from the ones developer would take into account some precautions when
specified by international standards. A highly efficient implementing side-channel atomicity [24, Section 29.4].
variation of the scalar multiplication with a fixed sequence In this work (see Section 5), we exploit the flexibility
of group operations is based on the Montgomery Ladder given by our technique of replacing multiplications by
[21], [22], [23]. However, similarly to the unified operation squarings to propose a more efficient atomic structure that
approach, the most efficient version of the Montgomery effectively takes into account squarings in its formulation.
Ladder also relies on a nonstandardized curve form, The latter makes our atomic operations not only faster but
namely, the Montgomery curve. Side-channel atomicity, also invulnerable to the potential attack described in this
proposed by Chevallier-Mames et al. [19], dissolves point section. The increase in speed comes from the fact that
operations into small homogenous blocks, known as atomic squarings are generally less expensive than multiplications
blocks, which cannot be distinguished from each other and that our improved atomic structure permits us to pack
through simple side-channel analysis because each one more field operations in one atomic block.
contains the same pattern of basic field operations.
Furthermore, atomic blocks are made sufficiently small to 1.2 Parallel Architectures
make this approach inexpensive. It is important to note that, In recent years, a new paradigm has arisen in the design
as pointed out in [19], if the leaking of the Hamming weight concept with the appearance of multiprocessor/parallel
of the scalar is an issue, it can be avoided by applying some architectures, which can execute several operations simul-
technique such as blinding. Also, we assume that transi- taneously. This topic is becoming increasingly important
tions between blocks are carefully implemented to avoid the since single-processor design is reaching its limits in terms
distinction of the different point operations. If the previous of clock frequency. Among several parallel architectures,
assumption cannot be met, then it could be advisable to Single-Instruction Multiple-Data (SIMD) has became highly
consider the approach given in [20] and make the point attractive since it generally avoids the higher hardware
operations have the same number of field operations. complexity needed in parallel architectures such as super-
However, notice that this extra measure would introduce scalar computers by leaving to the programmer the task of
a significant overhead into the side-channel atomicity parallelizing the program execution. Hence, we can already
strategy. Chevallier-Mames et al. [19] proposed the Multi- find SIMD-based schemes in many popular processors such
plication-Addition-Negation-Addition (M-A-N-A) structure as Pentium, SPARC, and PowerPC.
to build SSCA-protected formulas over prime fields. Similarly to other systems, ECC can be adapted to
However, the main drawback of the traditional M-A-N-A parallel architectures at different algorithmic levels. We
LONGA AND MIRI: FAST AND FLEXIBLE ELLIPTIC CURVE POINT ARITHMETIC OVER PRIME FIELDS 291

focus our efforts to parallelize ECC formulas at the point addition or subtraction, respectively. To simplify our cost
arithmetic level. In this regard, Aoki et al. [26] and Izu and analyses in the different sections, we will consider two
Takagi [27] introduced efficient parallel point operations possible scenarios:
targeting SIMD-based processors. In [27], the authors
presented formulas for two-processor architectures. The 1. The first scenario involves software-based imple-
authors in [26] introduced modified Jacobian coordinates mentations where squaring is faster than multi-
ðX; Y ; Z; Z 2 Þ to develop fast parallel formulas for platforms plication. In this case, we consider 1S ¼ 0:6M or
that can execute two and three operations simultaneously. 1S ¼ 0:8M.
Its parallel formulas are, to our knowledge, the fastest. 2. The second scenario involves implementations on
hardware platforms or when some built-in hard-
However, the limitation of the previous works is that they
ware is used to accelerate EC-operations (for
rely on traditional point operation formulas, which are
example, using a modular hardware multiplier
restricted to a fixed number of squarings and multiplica-
[29]). In this case, a multiplier executes both
tions. The methodology of replacing multiplications intro-
squaring and multiplication and, consequently, the
duced in this work will be shown to allow the development
ratio S=M is fixed to one.
of superior parallel operations that are more efficient for
multiprocessor/parallel execution. In this regard, we Although A=M ratios widely vary from application to
propose faster parallel formulas that are able to execute application, to simplify comparisons, we consider a low
three and four operations simultaneously (see Section 6). ratio for applications where multiplications are not favored
The previous approach targets unprotected implementa- by a fast hardware multiplier. In such a case, we use
tions where side-channel analysis is not a concern. How- 1A  0:05M, as achieved in [11]. If that is not the case, it can
ever, as previously stated, portable devices should be happen that the cost of an addition is not negligible. In this
protected against SCA. In the last part of the present work, case, we will consider the ratio achieved in [30], 1A  0:2M.
we deal with SSCA-protected implementations for parallel This paper is organized as follows: In Section 2, we
architectures. Similar efforts can be found in the literature. introduce some basic concepts about ECC and present, for
In [21] and [22], the authors presented efficient parallel comparison purposes, the traditional point operation for-
schemes on generic curves over prime fields using the mulas. In Section 3, we present our innovative methodology
Montgomery Ladder, which is intrinsically protected and, in Section 4, we apply it to develop fast formulas for
against SSCA because every iteration in the main loop the case of point doubling, general addition, mixed
involves one doubling and one addition. An advantage of addition, and tripling. In Section 5, new atomic structures
this method is that the formulas involve computation with are used to develop atomic formulas for the previous fast
the x-coordinate only. In particular, Fischer et al. [21] point operations. We then analyze in Section 6 the case of
presented a more attractive scheme since it parallelizes parallel architectures and develop SIMD-based formulas
doublings and additions at the field operation level, with three and four simultaneous operations. In Section 7,
whereas Izu and Takagi [22] parallelizes at the point we continue with parallel architectures, but this time to
operation level in every iteration of the main loop. The develop an SSCA-protected scheme that computes two
latter has the limitation that the cost of every iteration is operations simultaneously. We end with some conclusions
determined by the most costly point operation, namely, about the work presented in Section 8.
addition. Later, Izu and Takagi [23] improved the previous
proposals and introduced a unified Doubling-Addition 2 PRELIMINARIES
formula for the Montgomery Ladder method. The compo-
An elliptic curve E over a field K is defined by the general
site formula was then efficiently parallelized. In [28],
Weierstrass equation:
Mishra proposed a pipelined approach for generic curves
over prime fields using the standard point arithmetic. In E : y2 þ a1 xy þ a3 y ¼ x3 þ a2 x2 þ a4 x þ a6 ; ð1Þ
this scheme, each point operation is protected against SSCA
using atomicity and the atomic block execution is done where a1 , a2 , a3 , a4 , a6 2 K.
through a pipeline, where up to two atomic blocks can be The set of pairs ðx; yÞ that solves (1) and the point at
computed simultaneously. Because a pipelined atomic infinity O, which is the identity for the group law, form
operation can begin its execution before the previous an abelian group. This group of points is used to
atomic operation is complete, the total throughput is implement ECC.
significantly reduced to a few atomic blocks. We can define ECC over different finite fields. In
In this work (see Section 7), we propose a faster two- particular, we will work with a prime field IFp (the field
processor SSCA-protected scheme that introduces further with p elements, where p is a large prime). In this case, the
cost reductions by using the enhanced atomic structure general Weierstrass equation simplifies to the following:
with squarings introduced in Section 5. As previously
E : y2 ¼ x3 þ ax2 þ b; ð2Þ
explained, our atomic structure not only offers true
protection against SSCA by distinguishing multiplications where a; b 2 IFp and  ¼ 4a3 þ 27b2 6¼ 0.
from squarings but also allows us to pack more field Let E be the elliptic curve over the finite field IFp . We can
operations in each block. represent the main operation in ECC, namely, scalar
For the rest of this work, M, S, and A, in italics, stand for multiplication, as Q ¼ dP , where P and Q are points in
the computing costs of field multiplication, squaring, and EðIFp Þ and d is the secret scalar. Several algorithms, such as
292 IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 3, MARCH 2008

binary, NAF, and sliding window methods, among many 2.3 Mixed Addition in Jacobian-Affine Coordinates
others, have been proposed to compute the scalar multi- Let P ¼ ðX1 ; Y1 ; Z1 Þ and Q ¼ ðX2 ; Y2 Þ be two points in
plication efficiently. In general, these methods rely on the Jacobian and affine coordinates, respectively, on the elliptic
execution of a given sequence of point doubling ð2P Þ and curve E. The mixed addition P þ Q ¼ ðX3 ; Y3 ; Z3 Þ is
addition operations ðP þ QÞ. Recent methods such as scalar traditionally obtained as follows:
multiplication based on the ternary/binary method [4] or  
X3 ¼ 2   3  2X1  2 ; Y3 ¼  X1  2  X3  Y1  3 ;
double-base chains [3] introduced the tripling of a point
ð3P Þ as an additional point operation. Z3 ¼ Z1 ;
The representation of points on the curve E with affine with
coordinates ðx; yÞ introduces field inversions into the
computation of point doubling and point addition. Inver-  ¼ Z13 Y2  Y1 ;  ¼ Z12 X2  X1 : ð6Þ
sions over prime fields are the most expensive field With (6), the cost of a mixed addition is fixed in 8M þ 3S.
operation and are avoided as much as possible. Projective
coordinates ðX; Y ; ZÞ solve that problem by adding the 2.4 Point Tripling in Jacobian Coordinates
third coordinate Z to replace inversions with a few other Dimitrov et al. [3] introduced a fast tripling formula that
field operations. costs 10M þ 6S. Let P ¼ ðX1 ; Y1 ; Z1 Þ be a point in Jacobian
In the present work, we will work with Jacobian coordinates on the elliptic curve E. The point tripling 3P ¼
coordinates, a special case of projective coordinates that ðX3 ; Y3 ; Z3 Þ can be computed with the following:
has yielded very efficient point formulas. In the case of  
X3 ¼ 8Y12 ð  Þ þ X1 !2 ; Y3 ¼ Y1 4ð   Þð2  Þ  !3 ;
addition, we will also consider the most efficient case that is
obtained by adding two points in different point represen- Z3 ¼ Z1 !;
tations. In particular, we will analyze the case when one ð7Þ
point is represented in Jacobian coordinates and the second where  ¼ !;  ¼ 8Y14 ,  ¼ 3X12 þ aZ14 , ! ¼ 12X1 Y12  2 .
point in affine coordinates (mixed addition with Jacobian- We first remark that the same formula can be more
affine coordinates). efficiently implemented with 9M þ 7S. On the other hand,
In the following sections, we introduce the traditional Dimitrov et al. [3] proposed accelerating the computation
point operation formulas, which can later be used for by avoiding intermediate operations during computation of
comparison to our proposed fast operations. It is important the term aZ 4 when repeated triplings are to be computed.
to note that some additional improvements in computa- This idea is based on a similar approach given in [2] with
tional costs have been proposed for the tripling formulas. their modified Jacobian coordinates. However, it is straight-
forward to note that applying another well-known techni-
2.1 Point Doubling in Jacobian Coordinates que (fixing a ¼ 3) actually gives the best performance.
Let P ¼ ðX1 ; Y1 ; Z1 Þ be a point in Jacobian coordinates on Thus, by computing  using the factorization technique
the elliptic curve E. The point doubling 2P ¼ ðX3 ; Y3 ; Z3 Þ given in (4), we reduce the cost of the tripling to 9M þ 5S.
can be computed by the following traditional formula:

X3 ¼ 2  2; Y3 ¼ ð  X3 Þ  8Y14 ; Z3 ¼ 2Y1 Z1 ; ð3Þ 3 OUR FLEXIBLE METHODOLOGY


The next algebraic substitution holds for any elements a, b
where  ¼ 3X12 þ aZ14 ,  ¼ 4X1 Y12 .
in a prime field:
Thus, the cost of a doubling is 4M þ 6S. Without loss of
generality [1], we can consider the efficient case with a ¼ 1h i
ab ¼ ða þ bÞ2 a2  b2 : ð8Þ
3 (see (2)). In that case,  is more efficiently computed as 2
follows: The first observation is that, if we apply (8) to the
   traditional formulas given in Section 2, one field multi-
3X12 þ aZ14 ¼ 3 X1 þ Z12 X1  Z12 : ð4Þ
plication would be replaced by three squarings, three
By using (4), the cost of a doubling is reduced to addition/subtractions, and one division by two. Conse-
4M þ 4S. quently, a direct replacement would be inefficient if we
consider 1S ¼ 0:8M or 1S ¼ 0:6M. However, we will show
2.2 Point Addition in Jacobian Coordinates that redundancy in the ECC point arithmetic formulas over
Let P ¼ ðX1 ; Y1 ; Z1 Þ and Q ¼ ðX2 ; Y2 ; Z2 Þ be points in prime fields avoids the necessity of computing two out of
Jacobian coordinates on the elliptic curve E. The point the three squarings.
addition P þ Q ¼ ðX3 ; Y3 ; Z3 Þ can be computed by We still have to deal with the division by two. A direct
  solution is to transform it to its inverse 21 mod p.
X3 ¼ 2   3  2Z22 X1  2 ; Y3 ¼  Z22 X1  2  X3  Z23 Y1  3 ; However, this value is expected to be a very large number
Z3 ¼ Z1 Z2 ; and consequently requires, at worst, a whole field multi-
ð5Þ plication to complete the execution according to (8).
To solve this problem more efficiently, we propose
where  ¼ Z13 Y2
 Z23 Y1 ,
¼ Z12 X2
 Z22 X1 . choosing another representative from the projective equiva-
Given (5), the cost of the general addition is 12M þ 4S. lence class that inserts multiples of two into the formula.
LONGA AND MIRI: FAST AND FLEXIBLE ELLIPTIC CURVE POINT ARITHMETIC OVER PRIME FIELDS 293

The equivalence class denoted by ðX : Y : ZÞ that can be efficiently replaced by squarings because it would
contains the projective coordinates ðX; Y ; ZÞ is [1] add more than one squaring to the formula. The revised
doubling formula is given as follows:
ðX : Y : ZÞ ¼ fðc X; d Y ; ZÞ :  2 K  ; c; d 2 ZZþ g: ð9Þ
X3 ¼ 2  2; Y3 ¼ ð  X3 Þ  8Y14 ;
If we define  ¼ 2, the previous strategy efficiently ð12Þ
inserts multiples of two into the formula, which permits the Z3 ¼ ðY1 þ Z1 Þ2 Y12  Z12 ;
transformation of the original algebraic substitution (8) to
where
the next form for an “even” field multiplication, eliminating
h 2 i
the division by two:
 ¼ 3X12 þ aZ14 ;  ¼ 2 X1 þ Y12 X12  Y14 :
2ab ¼ ða þ bÞ2 a2  b2 : ð10Þ
Given (12), the cost of a doubling is reduced from 4M þ
Our flexible methodology can be summarized in two 6S to 2M þ 8S, trading two field squarings for two
steps: multiplications.
If we fix a ¼ 3, there is a computing reduction by
1. Modify, if necessary, the point formula by inserting applying the factorization technique in (4). Note that
multiples of two via the selection of the following computation of X12 is avoided and, consequently, comput-
representative of the equivalence class for Jacobian ing  ¼ 2½ðX1 þ Y12 Þ2  X12  Y14  is not an improvement
coordinates: anymore since one multiplication would be replaced by two
squarings instead of only one. The doubling formula when
ðX : Y : ZÞ ¼ fð22 X; 23 Y ; 2ZÞg: ð11Þ
a ¼ 3 is given as follows:

2. Replace inserted or existing “even” field multi- X3 ¼ 2  2; Y3 ¼ ð  X3 Þ  8Y14 ;


ð13Þ
plications by applying the algebraic substitution Z3 ¼ ðY1 þ Z1 Þ2 Y12  Z12 ;
given in (10), depending on the requirements of the   
targeted application. where  ¼ 3 X1 þ Z12 X1  Z12 ,  ¼ 4X1 Y12 .
Additionally, the cost of a doubling is reduced from
The high flexibility of this methodology permits us to
4M þ 4S to 3M þ 5S.
optimally adapt the point formulas to each application in
Further improvement can be achieved for implementa-
such a way that the maximum cost reduction is achieved.
tions where squaring is relatively cheap in comparison with
In particular, we will show that substitutions in Step 2 of
multiplication (that is, 1S ¼ 0:6M). In this case, considering
our methodology are closely related to the targeted
a as sparse (sparse meaning very low Hamming weight),
application. For instance, in unprotected sequential opera-
(12) is the most efficient doubling formula with a cost of
tions (see Section 4), we must replace one “even” multi-
only 1M þ 8S, given that multiplication by the constant a
plication by only one squaring so that the cost (and number
can be computed with a few inexpensive field additions.
of operations) is kept to the minimum possible, while, in
This is even more efficient than the case a ¼ 3 and less
parallel implementations (see Section 6), in some cases, we
restrictive in the choice of a.
can replace one “even” multiplication by up to two
squarings to take advantage of the multiple processing 4.2 Fast Point Addition
units and, in this way, reduce the cost further. By observing (5), we can quickly detect one multiplication
that can be replaced by one squaring using the algebraic
4 FAST POINT ARITHMETIC substitution (8): Z1 Z2 in Z3 . However, this term lacks a
multiple of two to avoid the division by two. To solve this
In this section, we apply the methodology introduced in
problem, we follow the methodology given in Section 3.
Section 3 to derive fast formulas for ECC point operations.
Thus, the formula is first transformed as follows:
As explained previously, to achieve maximum gain in a
sequential software-based implementation, we should re- X3 ¼ 2  4 3  8Z22 X1  2 ;
place one multiplication for only one squaring.  
Y3 ¼  4Z22 X1 2  X3  8Z23 Y1  3 ; ð14Þ
4.1 Fast Point Doubling Z3 ¼ 2Z1 Z2 ;
Observing (3), we can easily detect two multiplications that  3 
can be directly replaced by squarings using the algebraic where  ¼ 2 Z1 Y2  Z23 Y1 ,  ¼ Z12 X2  Z22 X1 .
substitution (10): Z3 ¼ 2Y1 Z1 and  ¼ 4X1 Y12 . Obviously, we In this modified formula, the term Z1 Z2 has been
do not have to worry about divisions by two in this case: replaced by 2Z1 Z2 , allowing the computation

Z3 ¼ 2Y1 Z1 ¼ ðY1 þ Z1 Þ2 Y12  Z12 ; 2Z1 Z2 ¼ ðZ1 þ Z2 Þ2 Z12  Z22 :


 2 Thus, the new addition formula is given as follows:
 ¼ 4X1 Y12 ¼ 2½ X1 þ Y12 X12  Y14 :
Note that each of the previous multiplications is replaced X3 ¼ 2  4 3  8Z22 X1 2 ;
 
by only one squaring and some extra additions since the Y3 ¼  Z22 X1 2  X3  Z23 Y1  3 ; ð15Þ
rest of the squarings (Y12 , Z12 , X12 , and Y14 ) are already Z3 ¼ ;
computed in the doubling formula. No other multiplication
294 IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 3, MARCH 2008

where TABLE 1
  Cost of the Proposed Fast Point Operations
 ¼ 2 Z13 Y2  Z23 Y1 ; in Comparison with Traditional Formulas
 ¼ Z12 X2  Z22 X1 ;
 ¼ ðZ1 þ Z2 Þ2 Z12  Z22 :
Given (15), the cost of an addition is reduced to
11M þ 5S, trading one multiplication for one squaring in
the traditional formula.

4.3 Fast Mixed Addition


By following the same methodology, we achieve the next
revised formula for the mixed addition:
 
X3 ¼ 2  4 3  8X1  2 ; Y3 ¼  4X1 2  X3  8Y1  3 ;
Z3 ¼ ðZ1 þ  Þ2 Z12   2 ;
ð16Þ
 3  2
where  ¼ 2 Z1 Y2  Y1 ,  ¼ Z1 X2  X1 . w ¼ number of repeated triplings.
Note that the term Z3 ¼ 2Z1  (from (14), Z2 ¼ 1) has
been replaced by Z3 ¼ ðZ1 þ Þ2  Z12   2 , reducing the
where
cost to 7M þ 4S.
2 ¼ ð þ !Þ2 2  !2 ; 2 ¼ 16Y14 ;
4.4 Fast Point Tripling    ð18Þ
Substituting squarings for multiplications gives the greatest  ¼ 3 X1 þ Z12 X1  Z12 ; ! ¼ 12X1 Y12  2 :
increase in speed in the tripling formula, given the rich In this case, the cost of the tripling is further reduced to
redundancy found in this operation. If we apply our flexible 7M þ 7S.
methodology to (7), the new tripling operation can be Table 1 summarizes our achievements to this point. We
expressed as follows: distinguish three cases: when a has any possible value (first
X3 ¼ 16Y12 ð2  2Þ þ 4X1 !2 ; column, general case), when a is defined as sparse (second
  column), and when a ¼ 3 (third column).
Y3 ¼ 8Y1 ð2  2 Þð4  2Þ  !3 ; ð17Þ As can be seen, our formulas replace expensive multi-
Z3 ¼ ðZ1 þ !Þ2 Z12  !2 ; plications by squarings. In some highly efficient cases such
as the proposed fast tripling, we replace up to four
multiplications with four squarings in the original formula
where
given in [3]. We remark that the improvement is more
2 ¼ ð þ !Þ2 2  !2 ; 2 ¼ 16Y14 ; significant in applications where squarings are relatively
very cheap in comparison to multiplications (that is,
 ¼ 3X12 þ aZ14 ; ! ¼ 6½ðX1 þ Y12 Þ2  X12  Y14   2 : 1S ¼ 0:6M). Also, we point out that defining a as sparse
does not represent the most efficient case for the traditional
Thus, the cost of a tripling has been efficiently reduced operations. However, in our fast point formulas, defining a
from 10M þ 6S to 6M þ 10S. as sparse yields a doubling and a tripling that are more
A more efficient result is achieved by fixing a as a sparse efficient than the general case or when a ¼ 3 if 1S ¼ 0:6M.
number. In that case, the cost is further reduced to In addition, Dimitrov et al. [3] proposed a repeated
5M þ 10S. tripling formula that efficiently trades one multiplication for
Again, for the case when a ¼ 3, there is an additional two squarings at every repeated tripling. However, we can
reduction if we compute  using the factorization technique see in Table 1 that our fast tripling, without extra
given in (4). Note that the computation of X12 is avoided modifications needed, is more efficient if consecutive
and, consequently, computing ! ¼ 12X1 Y12  2 as 6½ðX1 þ triplings are to be computed. For instance, if 1S ¼ 0:8M,
Y12 Þ2  X12  Y14   2 is not an improvement anymore since the formula proposed in [3] needs ð14:2w þ 0:6Þ field
one multiplication would be replaced by two squarings multiplications, while our formula only requires ð14wÞ field
instead of only one. multiplications. If 1S ¼ 0:6M, the advantage of our formula
Thus, the revised tripling formula when a ¼ 3 is given is further increased. Furthermore, the reader must note that
as follows: more efficient repeated triplings can be computed when our
formula fixes a as sparse or a ¼ 3.
X3 ¼ 16Y12 ð2  2Þ þ 4X1 !2 ;
  5 SSCA-PROTECTED OPERATIONS
Y3 ¼ 8Y1 ð2  2 Þð4  2Þ  !3 ;
Z3 ¼ ðZ1 þ !Þ2 Z12  !2 ; Atomic blocks were originally M-A-N-A-based structures
for the case of point operations over prime fields [19]. For
LONGA AND MIRI: FAST AND FLEXIBLE ELLIPTIC CURVE POINT ARITHMETIC OVER PRIME FIELDS 295
 
efficiency purposes, squarings and multiplications are X3 ¼ 2  4 3  8X1  2 ; Y3 ¼  4X1  2  X3  8Y1  3 ;
considered to be side-channel equivalent [3], [19], [28] Z3 ¼ ðZ1 þ  Þ2 Z12   2 ;
and, consequently, the atomic block efficiency heavily ð19Þ
depends on the cost of the multiplication. The general  2
assumption is that multiplication and squaring are indis- where  ¼ Z13 þ Y2 Z16  Y22  2Y1 ,  ¼ Z12 X2  X1 .
tinguishable from a side-channel analysis point of view. The cost is now 6M þ 6S, containing the minimum
However, as explained in Section 1, that is not generally the possible number of operations when the number of field
case. Hence, an efficient atomic block should consider multiplications and that of squarings are equivalent. The
squaring in its structure. With previous atomic formulas, cost of the atomic addition is fixed at 6M þ 6S as only six
such a consideration would have been very inefficient and atomic blocks are required to accommodate (19). The details
expensive. However, the flexible methodology presented in of the atomic addition formula are shown in Appendix B,
which can be found in the Computer Society Digital Library
this work permits modification of the point formulas in
at http://computer.org/tc/archives.htm.
such a way that allows us to easily balance the number of
squarings and multiplications and thus introduces squar- 5.2 Case with Ternary Bases
ings into the formulation. In this work, we present an Scalar multiplications that take advantage of ternary bases
innovative atomic structure based on Squaring-Negation- to accelerate computations [3], [4] require the tripling of a
Addition-Multiplication-Negation-Addition-Addition (S-N- point besides doubling and addition. Given the high
A-M-N-A-A) to build the point operations. We remark that number of field additions found in the tripling formula
this new structure is more efficient and truly protects (see (7), (17), and (18)), it is not possible to accommodate
against SSCAs because it takes into account the differences this operation with the optimal number of atomic blocks
between field squarings and multiplications. using the proposed S-N-A-M-N-A-A structure. Thus, an
To achieve cheaper atomic operations with the proposed additional field addition per atomic block has been added
atomic structure, we first have to try to balance the number of to the previous atomic structure to permit the optimal
squarings and multiplications in a given formula. The latter is performance: S-N-A-A-M-N-A-A. In this way, we can
intended for minimizing the overall cost of each atomic accommodate the tripling formula without requiring extra
operation. For instance, a traditional mixed addition costs multiplications or squarings, resulting in the cheapest
8M þ 3S and, thus, its atomic form using S-N-A-M-N-A-A atomic implementation known to date.
would require at least eight blocks since there is a In the following paragraphs, we present the modified
maximum of one multiplication per block. If we otherwise atomic operations for scalar multiplications that require
“balance” the number of squarings and multiplications, say, triplings, additions, and doublings, as is the case of the
to 6M þ 6S, then the requirement of the atomic implemen- double-base chain algorithm.
tation is reduced to only six blocks. In the case of the tripling operation, the optimal balance
In the next paragraphs, we follow the previous approach between multiplications and squarings can be found in (18),
in combination with our methodology of replacing multi- when a ¼ 3, with a cost of 7M þ 7S. Then, seven atomic
plications by squarings to derive cheaper operations using blocks would be required to accommodate all of these
the new atomic structure, first, for the case of scalar operations. However, because of the internal dependences
multiplication using only radix 2 for the scalar expansion between field operations in the tripling formula, one extra
block is necessary, making a total of eight atomic blocks. If
(for example, traditional NAF and wNAF [1]) and then for
repeated triplings are computed, it is possible to reduce the
the case of expansions including ternary bases (for example,
cost to ð7w þ 1ÞM þ ð7w þ 1ÞS, where w is the number of
double-base chains [3]).
repeated triplings, by merging the last block of the current
5.1 Case with Binary Bases tripling with the first block of the following tripling, saving
In the traditional point doubling formula, we observe that one field multiplication and one field squaring at every
the number of multiplications and squarings is already additional tripling operation. The details of the atomic
balanced with the minimum number of operations when tripling and atomic repeated tripling are presented in
considering a ¼ 3 (3): 4M þ 4S. Thus, working with the Appendix C, which can be found in the Computer Society
Digital Library at http://computer.org/tc/archives.htm.
aforementioned atomic structure (S-N-A-M-N-A-A), four
In addition, point doubling and addition operations
atomic blocks are sufficient to accommodate the balanced
must be modified according to the new atomic structure to
doubling formula as each atomic block contains one field
make them suitable for scalar multiplications that include
multiplication and one field squaring. Therefore, the cost of
triplings in their computation. Basically, four and six extra
the atomic doubling is 4M þ 4S. The details are shown in
additions have to be added to the (S-N-A-M-N-A-A)-based
Appendix A (which can be found in the Computer Society atomic doubling and mixed addition in Section 5.1, as
Digital Library at http://computer.org/tc/archives.htm). detailed in Appendices A and B, which can be found in the
For the case of mixed addition, we have the fast formula Computer Society Digital Library at http://computer.org/
(16) whose cost is 7M þ 4S. We first need to balance the tc/archives.htm.
number of operations. That can be achieved if one multi-
plication is replaced by two squarings as follows: 5.3 Performance Comparison
2Z13 Y2 ¼ ðZ13 þ Y2 Þ2  Z16  Y22 , with the term Y22 precom- Chevallier-Mames at al. [19] proposed atomic doublings
puted. Thus, the balanced formula is given as follows: and additions built with 10 and 16 M-A-N-A atomic blocks,
296 IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 3, MARCH 2008

TABLE 2 TABLE 4
Cost of the Proposed and Previous Atomic Operations Performance of New Atomic Operations in Comparison with
(Scalar Multiplications Using Radix 2) Previous Atomic Formulas (NAF Method, n ¼ 160 bits)

TABLE 3
Cost of the Proposed and Previous Atomic Operations
(Scalar Multiplications Using Radices 2 and 3)
scarce occurrence of these operations during the scalar
multiplication.
To have a more precise idea of the improvement that can
be achieved with our atomic operations, we compare the
performance when using a traditional scalar multiplication
with the NAF method and scalar d of length n ¼ 160 bits.
NAF has a nonzero density of approximately 1/3 [1]. Thus,
for a 160-bit NAF scalar multiplication, one approximately
requires 159D + 53A. The number of required operations
w ¼ number of repeated doublings or triplings.
when using the new atomic formulas and the best previous
atomic operations in [3], [19], [28] are detailed in Table 4 for
respectively. Thus, a doubling costs 10M þ 20A and an the case where a hardware multiplier executes both multi-
addition, 16M þ 32A. Mishra [28] presented an improved plications and squarings ð1S ¼ 1MÞ and for the most
atomic operation for the case of mixed addition. The common cases in software implementations (1S ¼ 0:6M
number of atomic blocks for addition was reduced to 11, and 1S ¼ 0:8M).
with a total cost of 11M þ 22A. Later, Dimitrov et al. [3] As can be seen in Table 4, our atomic operations perform
presented a fast tripling formula with 16 atomic blocks significantly better than the previous operations in all of the
using the same atomic structure, with a total cost of studied scenarios. For instance, let us consider the case of an
16M þ 32A. In comparison, our enhanced atomic structure implementation using a modular hardware multiplier. In
based on multiplication and squaring shows reduced costs such a case, we have 1S ¼ 1M and an A=M ratio as high as
in all cases. Table 2 summarizes the performance of the new 0.2. Then, the new M-N-A-M-N-A-A structure presents a
formulas when only point addition and doubling are used, reduction of about 18.5 percent in comparison with a NAF
as is the case for the traditional binary and NAF methods. scalar multiplication using previous atomic operations.
For the case where ternary bases are included into the Now, let us consider the case of an efficient software
computation of the scalar multiplication, Table 3 sum- implementation such as the one presented in [5], [11], where
marizes our results. addition is very cheap. In that case, we set 1A ¼ 0:05M.
As we can see in both tables, our atomic operations Then, our approach presents significant reductions of about
achieve reduced computing costs by minimizing the 22.2 percent ðS=M ¼ 0:8Þ and 30.2 percent ðS=M ¼ 0:6Þ.
number of required field additions and replacing squarings
for multiplications in comparison to previous atomic 6 PARALLEL POINT OPERATIONS
operations using M-A-N-A, including cases where some
In this section, we show that our methodology of replacing
savings can be achieved by the successive execution of
multiplications, proposed in Section 3, permits flexibly
doublings or triplings.
modifying point doubling, addition, and tripling formulas
Remarkably, we also observe that, even in applications
to make them more efficient when implemented in a
where squarings are considered as costly as multiplications
parallel architecture such as SIMD. In the following, we
(that is, 1S ¼ 1M, if the same hardware multiplier is used to present formulas to compute three and four operations in
perform multiplication and squaring), our S-N-A-M-N-A-A- parallel. It is important to note that, in the four-processor
based atomic doubling and repeated doubling present a case, one multiplication can be replaced by up to two
reduction of at least two field multiplications and four field squarings since more computing resources are available
additions. In the case of triplings, the cost would be the same and the introduction of squarings would permit reducing
ð16M þ 32AÞ, but, when repeated triplings are computed, our the costs further.
approach is again superior, reducing the required number of Also, a new coordinate system that takes advantage
multiplications and additions from ð15w þ 1ÞM þ ð30w þ of the inserted squarings and thus minimizes computing
2ÞA to ð14w þ 2ÞM þ ð28w þ 4ÞA, which means an overall costs in parallel implementations is introduced:
reduction of ðw  1Þ field multiplications and 2ð2w  1Þ ðX; Y ; Z; X2 ; Z 2 ; Z 3 =Z 4 Þ. The fourth coordinate, X 2 , will be
field additions. Point addition is still one multiplication required for doublings and triplings and the sixth coordi-
more expensive than the traditional formulas. However, we nate will be Z 3 if the current operation is an addition and Z 4
expect that such a disadvantage is minimized due to the if the current operation is a doubling or tripling.
LONGA AND MIRI: FAST AND FLEXIBLE ELLIPTIC CURVE POINT ARITHMETIC OVER PRIME FIELDS 297

TABLE 5 TABLE 7
Three-Processor Point Doubling Three-Processor Point Tripling

(a) Z33 if the next operation is a point addition and Z34 if the next operation
is a doubling or tripling.

6.1 Three-Processor Formulas


For the parallel doubling operation, we use the fast formula
(a) Z33 if the next operation is a point addition and Z34 if the next operation
(12), considering a ¼ 3 for optimal performance. How-
is a doubling or tripling.
ever, we do not use the factorization technique given in (4),
with the objective of taking advantage of the new
For the parallel tripling operation, we consider (17) with
coordinate system that already includes X2 and Z 4 as
a ¼ 3. The parallel tripling formula is shown in Table 7
precomputed terms. The parallel doubling formula is
and, similarly to the doubling case, a is defined as sparse
shown in Table 5. Only squarings and multiplications are
with a fixed value of 3, but the factorization technique in
shown. For a detailed description, the reader is referred to (4) is not applied to achieve the maximal utilization of the
Appendix D, which can be found in the Computer Society introduced coordinate system. The total cost of the parallel
Digital Library at http://computer.org/tc/archives.htm. tripling is 3M þ 3S þ 14A (see Appendix F, which can be
The total cost of the parallel doubling is 1M þ 2S þ 11A. found in the Computer Society Digital Library at http://
This cost is reduced by one addition to 1M þ 2S þ 10A if computer.org/tc/archives.htm, for further details).
repeated doublings are being computed or the following
operation is a tripling. This reduction is possible because 6.2 Four-Processor Formulas
the last three-parallel operations of the doubling can be For the parallel doubling formula and similarly to the case
merged with the first three-parallel operations of the of three processors, we use (12) with a ¼ 3 and avoid the
following doubling or tripling. factorization technique (4). However, this time, we can
For the parallel addition formula, we consider the maximize processor utilization by replacing additional
efficient case with mixed coordinates given by the fast multiplications by two squarings. As was discussed
addition formula (15). The parallel addition formula is previously, in a sequential architecture, this would lead to
shown in Table 6, with a cost of 3M þ 2S þ 8A. This cost is higher costs. However, we remark that this strategy leads to
reduced by one addition to 3M þ 2S þ 7A if a doubling or further computing time reduction in four-processor archi-
tripling is computed right after the addition. Similarly to the tectures since not all processors are being used all the time
and extra squarings can be accommodated by inactive
three-processor doubling, the last three-parallel operations
processors. Before proceeding, we have to modify the fast
of the addition can be merged with the first three-parallel
doubling formula (12) to make it suitable for more squaring-
operations of the following doubling or tripling (see
for-multiplication replacements. The revised formula
Appendix E, which can be found in the Computer Society would be defined as follows if we apply the strategy given
Digital Library at http://computer.org/tc/archives.htm, in Section 3:
for details).
X3 ¼ 42  8; Y3 ¼ 2ð4  X3 Þ  64Y14 ;
h i ð20Þ
TABLE 6 Z3 ¼ 2 ðY1 þ Z1 Þ2 Y12  Z12 ;
Three-Processor Point Addition
where  ¼ 3X12 þ aZ14 , 4 ¼ 8½ðX1 þ Y12 Þ2  X12  Y14 , and
2ð4  X3 Þ is computed as

ð þ 4  X3 Þ2 2  ð4  X3 Þ2 :
Although the number of squarings (and the cost for the
sequential implementation) has been increased in (20), in a
four-processor architecture this leads to higher processor
utilization and a reduced or nil number of multiplications.
The parallel doubling formula is shown in Table 8, with a total
cost of 3S þ 13A (see Appendix G, which can be found in the
Computer Society Digital Library at http://computer.org/
tc/archives.htm, for further details). If the following
298 IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 3, MARCH 2008

TABLE 8 TABLE 10
Four-Processor Point Doubling Four-Processor Point Tripling

(a) Z33 if the next operation is a point addition and Z34 if the next operation
is a doubling or tripling.

operation is an addition, then the cost is slightly increased (a) Z33 if the next operation is a point addition and Z34 if the next operation
to 1M þ 2S þ 13A because we need to compute Z33 in the is a doubling or tripling.
third step, Processor 2.
Additionally, the cost is reduced by two field additions X3 ¼ 16Y12 ð2  2Þ þ 4X1 !2 ; Y3 ¼ 4Y1 ;
ð22Þ
to 3S þ 11A if repeated doublings are performed or the Z3 ¼ ðZ1 þ !Þ 2
Z12 2
! ;
following operation is a tripling because the last two four-
parallel field operations can be merged with the first two where
four-parallel operations of the following doubling or
2 ¼ ð þ !Þ2 2  !2 ; 2 ¼ 16Y14 ;
tripling.
In the case of point addition, we use the fast mixed  ¼ 3X12 þ aZ14 ; ! ¼ 6½ðX1 þ Y12 Þ2  X12  Y14   2 ;
addition given by (16). Following the strategy previously  ¼ 2ð2  2Þð4  2Þ  2!3 :
applied to the doubling formula, we modify (16) as follows:
The next multiplications are computed as follows:
 
X3 ¼ 42  4 3  8X1  2 ; Y3 ¼ 2 4X1  2  X3  8Y1  3 ;
2
4X1 !2 ¼ 2½ðX1 þ !2 Þ2  X12  !4 ;
Z3 ¼ ðZ1 þ  Þ Z12 2
 ;
2!3 ¼ ð! þ !2 Þ2  !2  !4 ;
ð21Þ  2
16Y12 ð2  2Þ ¼ 8Y12 þ 2  2 64Y14  ð2  2Þ2 ;
where  ¼ Z13 Y2  Y1 ,  ¼ Z12 X2  X1 , 2Y1  is computed as
ðY1 þ Þ2  Y12   2 and 2ð4X1  2  X3 Þ is computed as 4Y1  ¼ ð4Y1 þ Þ2 16Y12  2 ;
ð þ 4X1  2  X3 Þ2  2  ð4X1  2  X3 Þ2 . 2ð2  2 Þð4  2Þ ¼ 4 2  ð2  2Þ2 ð4  2Þ2 :
The parallel addition formula is presented in Table 9 (see
The parallel tripling formula has a total cost of 6S þ 17A
Appendix H, which can be found in the Computer Society
(Table 10; see Appendix I, which can be found in the
Digital Library at http://computer.org/tc/archives.htm,
Computer Society Digital Library at http://computer.org/
for more details). Its total cost is 2M þ 2S þ 9A. If the
tc/archives.htm). If the operation following a tripling is an
following operation is a doubling or tripling, the cost is addition, the cost is slightly increased to 1M þ 5S þ 17A.
reduced by two additions to 2M þ 2S þ 7A. On the other hand, the cost is reduced by two additions to
In the case of the tripling, this operation can be 6S þ 15A if repeated triplings are performed or the
implemented with the fast formula given by (17), fixing following operation is a doubling.
a ¼ 3. We again follow the strategy applied to doubling
and addition and show that, in this case, the replacement of 6.3 Performance Comparison
all multiplications by squarings leads to the lowest cost. Table 11 summarizes the cost of the parallel point
Formula (16) is modified as follows: operations presented for the cases when three and four
operations are executed simultaneously on an SIMD-based
architecture. The results are compared to previous propo-
TABLE 9 sals given in [26] and [27]. The most efficient scenario was
Four-Processor Mixed Addition
given in [26], which developed cheaper three-processor
SIMD doubling and mixed addition operations using the
modified coordinate system ðX; Y ; Z; Z 2 Þ. Similarly to our
case, they used a ¼ 3 for the doubling formula without
applying the factorization technique in (4). However, our
flexible methodology has allowed further cost reductions by
replacing some costly multiplications for squarings. This
permits the reduction of doublings to 1M þ 2S in the case of
three-parallel operations, in comparison to the cost of 2M þ
1S for the doubling given in [26]. For instance, for an n-bit
LONGA AND MIRI: FAST AND FLEXIBLE ELLIPTIC CURVE POINT ARITHMETIC OVER PRIME FIELDS 299

TABLE 11 4:8M, the tripling given in [3] costs 14:8M, and our fast
Comparison of Different Parallel tripling in a sequential fashion costs 12:6M.
and Sequential Point Operations
7 PARALLEL SSCA-PROTECTED POINT
OPERATIONS
Operations presented in the previous section are oriented to
achieving the maximum speedup on SIMD-based implemen-
tations when SCA attacks are not a problem. In the present
section, we propose a scheme with two-processor point
operations protected against SSCA. Again, atomicity has
been used to achieve the required level of security. We have
investigated dependences among field operations in each
point operation and concluded that architectures that are
NAF scalar multiplication, with an approximate cost of ðn  designed for executing two operations simultaneously are
1Þ doublings and ðn=3Þ additions, the formula in [26] would more efficiently exploited if squarings are also considered in
require roughly 3:3nM þ 1nS, whereas our formulas the formula. Throughout this paper, our highly flexible
methodology has been applied to yield improved formulas
require 2nM þ 2:6nS. When 1S ¼ 0:8M, there are no
that permit the efficient introduction of squarings into the
significant differences between both formulas (about
atomic block structure, as was shown in Section 5. Thus, our
661M each if n ¼ 160). However, for the case 1S ¼ 0:6M,
scheme arranges two field operations in parallel at each step,
the formula in [26] costs 627M and ours only costs 574M. In
following the atomic structure given by S-N-A-A-M-N-A-A
comparison to costs using traditional sequential formulas
(introduced in Section 5.2), which has been found to
(described in Section 2), 1; 712M (if 1S ¼ 0:8M) and 1; 552M
efficiently accommodate all of the point operations in two-
(if 1S ¼ 0:6M), we get computing reductions of about
processor architectures. In the following, we call a block a
61 percent and 63 percent, respectively. In comparison with
parallel atomic block if it is able to execute two operations in
our fast sequential formulas presented in Section 4, 1; 657M
parallel and follows the aforementioned atomic structure to
(if 1S ¼ 0:8M) and 1; 455M (if 1S ¼ 0:6M), we get comput-
protect against SSCA.
ing reductions of about 60 percent and 61 percent,
In the next paragraphs, we describe each parallel point
respectively.
operation. For each formula, the order of execution has been
For the tripling, we have proposed, to our knowledge,
carefully arranged to yield the lowest cost.
the first approach for a parallel implementation. On a three- As explained in Section 5, to achieve minimum costs, we
processor SIMD scheme, the proposed tripling performs require formulas with a balanced number of field multi-
twice as fast as a sequential implementation. For instance, if plications and squarings. For the case of doubling, the
1S ¼ 0:8M, the traditional tripling in [3] costs 14:8M and traditional formula given by (3), with a ¼ 3, is already
our fast tripling in a sequential fashion, 12:6M. In contrast, balanced, with a cost of 4M þ 4S. Thus, we only require two
the proposed three-processor tripling only costs 5:4M. parallel atomic blocks as each of these is capable of executing
Furthermore, our methodology makes point operations two field multiplications and two field squarings. The cost of
suitable for architectures that compute four operations the two-parallel atomic doubling is fixed at 2M þ 2S þ 8A.
simultaneously. We have further reduced our three-parallel The details of this operation are shown in Appendix J,
formulas, which are the most efficient to our knowledge, to which can be found in the Computer Society Digital Library
achieve faster four-parallel formulas by trading one squar- at http://computer.org/tc/archives.htm.
ing for one multiplication in the case of a doubling, For the case of mixed addition, in Section 5, we derived
reducing one field multiplication in the case of mixed the balanced formula (19) with a cost of 6M þ 6S. We use
addition, and trading three squarings for three multi- this formula to obtain our two-parallel atomic addition with
plications in the case of a tripling. For comparison purposes, only four parallel atomic blocks. Thus, the total cost of the
if we consider a 160-bit NAF scalar multiplication, our four- two-parallel addition is given by 4M þ 4S þ 16A (see
processor formulas would require approximately 584M (if further details in Appendix K, which can be found in the
1S ¼ 0:8M) and 478M (if 1S ¼ 0:6M). That means reduc- Computer Society Digital Library at http://computer.org/
tions of about 11 percent and 17 percent for each case, tc/archives.htm).
respectively, in comparison with the three-processor im- Similarly, in Section 4, we presented a balanced tripling
plementation. In comparison with the traditional sequential formula given by (18) with a cost of 7M þ 7S. Using this
approach, we obtain reductions of about 66 percent and formula, we derive a two-parallel tripling protected against
70 percent for each case, respectively, which means that the SSCA with five parallel atomic blocks and a fixed cost in
four-processor scheme is about three times faster than the 5M þ 5S þ 20A (see Appendix L, which can be found in the
traditional sequential implementation. Computer Society Digital Library at http://computer.org/
On a four-processor scheme, the proposed tripling tc/archives.htm).
formula performs almost three times faster than the In Table 12, we show a sample execution of consecutive
sequential operation. For instance, if 1S ¼ 0:8M, the tripling, doubling, and addition operations in our proposed
proposed four-processor tripling costs, in most cases, ð6SÞ two-parallel SSCA-protected scheme, where one point
300 IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 3, MARCH 2008

TABLE 12 The pipeline scheme was protected against SSCA using


EC Point Operations in the Proposed Two-Parallel atomicity. Because a pipelined atomic operation can begin
SSCA-Protected Scheme its execution before the previous atomic operation is
complete, the total throughput is reduced to only six atomic
blocks. In this work, each atomic block had the traditional
M-A-N-A structure. Since the cost of using the NAF method
is ðn  1Þ doublings and ðn=3Þ additions, the pipelined
method costs 10 þ 6ðn þ n=3  2Þ atomic blocks [28], which
is equivalent to 1,278 atomic blocks when n ¼ 160. As each
atomic block contains one multiplication and two additions,
the total cost of this method is fixed at 1; 278M þ 2; 556M.
In contrast to [28], our proposed parallel SSCA-protected
scheme introduces further cost reductions by including
squarings in the atomic structure. As was shown in the
previous sections, our atomic structure not only offers true
protection against SSCA by distinguishing multiplications
from squarings, but also allows us to pack more field
operations per block. Furthermore, our parallel approach is
shown to be superior to the pipelined one in [28]. The
pipeline reduces the throughput to six atomic blocks by
(a) Atomic point addition. (b) Atomic point doubling. (c) Atomic point
beginning each point operation as soon as possible, with a
tripling. maximum of two processes being computed simulta-
neously. Hence, the main obstacle to achieving cheaper
operation is executed in parallel at a time. We denote the execution is given by interoperation dependencies (depen-
parallel atomic block x executed by processor y by x; y. Five, dencies found between consecutive point operations). In
two, and four parallel atomic blocks are required to complete a contrast, our approach parallelizes field operations inside
tripling, a doubling, and an addition, respectively. each point operation and, hence, the cost of the atomic
formulas is mainly defined by intraoperation dependencies
7.1 Performance Comparison (dependencies inside each point operation). We have
Several efforts to protect parallel implementations against carefully analyzed both kinds of dependency and con-
SSCA can be found in the literature. In [21] and [22], the cluded that interoperation dependencies in the point
authors presented efficient parallel SSCA-protected schemes arithmetic over prime fields are more restrictive. Thus,
using the Montgomery Ladder method over prime fields. In even though throughput is effectively reduced to six atomic
general, for an n-bit scalar multiplication, the Montgomery blocks (about six field multiplications) in [28], with the
ladder method requires ðn  1Þ iterations. Fischer et al. [21] parallel approach, we have doublings and additions that are
presented a parallel doubling and addition execution with a executed with only 2M þ 2S and 4M þ 4S, respectively.
cost of 10M þ 8A. Thus, the scalar multiplication would cost The reader must note that doublings are more frequently
10ðn  1ÞM þ 8ðn  1ÞA. If n ¼ 160 bits, then the total cost is required in efficient scalar multiplication methods and,
1; 590M þ 1; 272A. On the other hand, the method given in hence, our method would be superior even in the case
[22] fixes the cost of every iteration to one point addition. 1S ¼ 1M, making our approach not only more secure but
The doubling and addition formulas in [22] cost 6M þ 3S þ also faster in hardware-based implementations.
6A and 8M þ 2S þ 7A, respectively. Since one extra In an n-bit NAF scalar multiplication, our scheme costs
doubling is required to complete the scalar multiplication, 2ðn  1Þ þ 4ðn=3Þ atomic blocks since doublings and addi-
the cost of the scalar multiplication using the Montgomery tions require two and four parallel atomic blocks, respec-
Ladder would be one doubling and ðn  1Þ additions, tively. Consequently, if n ¼ 160, it requires 531 parallel
which is equivalent to ðn ¼ 160Þ atomic blocks. Since each parallel atomic block costs one
multiplication, one squaring, and four additions, the total
ð6M þ 3S þ 6AÞ þ ðn  1Þð8M þ 2S þ 7AÞ ¼ cost would be 531M þ 531S þ 2; 124A. Table 13 compares
1; 278M þ 321S þ 1; 119A: the performance of previous parallel and protected meth-
ods against that of the proposed two-parallel SSCA-
Later, Izu and Takagi [23] improved the proposals in [21] protected scheme for three possible scenarios: hardware-
and [22] and proposed a unified Doubling-Addition assisted implementations ð1S ¼ 1MÞ; considering that
formula for the Montgomery Ladder. The composite multiplication is very efficient and, thus, addition is not
formula was efficiently parallelized with a cost of negligible ð1A ¼ 0:2MÞ; and software-based implementa-
7M þ 2S þ 8A. Thus, a scalar multiplication would cost tions where squaring is generally cheaper (1S ¼ 0:6M or
ðn ¼ 160Þ 0:8M). For the latter, we consider that addition can be made
very fast in comparison to multiplication ð1A ¼ 0:05MÞ.
ðn  1Þð7M þ 2S þ 8AÞ ¼ 1; 113M þ 318S þ 1; 272A:
As can be seen in Table 13, our approach introduces
In [28], Mishra proposed a pipelined approach for generic significant cost reductions in all of the studied cases. For
curves over prime fields using the standard point arithmetic. instance, it reduces costs by approximately 12 percent,
LONGA AND MIRI: FAST AND FLEXIBLE ELLIPTIC CURVE POINT ARITHMETIC OVER PRIME FIELDS 301

TABLE 13 APPENDIX
Comparison of Performance of Appendices of this work can be found in the
Parallel SSCA-Protected Methods with n ¼ 160 Computer Society Digital Library at http://computer.
org/tc/archives.htm.

ACKNOWLEDGMENTS
The authors would like to thank the Natural Sciences and
Engineering Research Council of Canada (NSERC) for
partially supporting this work and the reviewers for their
valuable comments.

REFERENCES
[1] D. Hankerson, A. Menezes, and S. Vanstone, Guide to Elliptic Curve
Cryptography. Springer, 2004.
[2] H. Cohen, A. Miyaji, and T. Ono, “Efficient Elliptic Curve
Exponentiation Using Mixed Coordinates,” Advances in Cryptology
—Proc. ASIACRYPT ’98, pp. 51-65, 1998.
[3] V. Dimitrov, L. Imbert, and P.K. Mishra, “Efficient and Secure
Elliptic Curve Point Multiplication Using Double-Base Chains,”
Advances in Cryptology—Proc. ASIACRYPT ’05, pp. 59-78, 2005.
26 percent, and 30 percent for the three presented cases, [4] M. Ciet, M. Joye, K. Lauter, and P.L. Montgomery, “Trading
respectively, when compared to the best parallel method Inversions for Multiplications in Elliptic Curve Cryptography,”
Designs, Codes, and Cryptography, vol. 39, no. 2, pp. 189-206, 2006.
using the Montgomery Ladder [23]. In comparison with the [5] D. Bernstein, “High-Speed Diffie-Hellman, Part 2,” presentation at
pipelined scheme [28], our parallel approach introduces INDOCRYPT ’06, tutorial session, 2006.
reductions of approximately 17 percent, 25 percent, and [6] M. Brown, D. Hankerson, J. Lopez, and A. Menezes, “Software
32 percent. Implementation of the NIST Elliptic Curves over Prime Fields,”
Topics in Cryptology—CT-RSA ’01, pp. 250-265, 2001.
For comparison purposes, sequential atomic methods are [7] J. Großschädl, R. Avanzi, E. Savas, and S. Tillich, “Energy-Efficient
also presented. In that case, our scheme introduces Software Implementation of Long Integer Modular Arithmetic,”
speedups of about 40 percent, 43 percent, and 43 percent Proc. Seventh Int’l Workshop Cryptographic Hardware and Embedded
Systems, pp. 75-90, 2005.
for the three studied cases, respectively, in comparison with [8] C.H. Lim and H.S. Hwang, “Fast Implementation of Elliptic Curve
the new atomic implementation proposed in Section 5. The Arithmetic in GFðpm Þ,” Proc. Third Int’l Workshop Practice and
improvements are as high as 51 percent, 56 percent, and Theory in Public Key Cryptography, pp. 405-421, 2000.
60 percent if compared with the traditional atomic [9] C.H. Gebotys and R.J. Gebotys, “Secure Elliptic Curve Implemen-
tations: An Analysis of Resistance to Power-Attacks in a DSP
implementation based on M-A-N-A. Processor,” Proc. Fifth Int’l Workshop Cryptographic Hardware and
Embedded Systems, pp. 114-128, 2003.
[10] R. Avanzi, “Aspects of Hyperelliptic Curves over Large Prime
8 CONCLUSION Fields in Software Implementations,” Proc. Sixth Int’l Workshop
Cryptographic Hardware and Embedded Systems, pp. 148-162, 2004.
We have presented a highly flexible methodology to derive [11] D. Bernstein, “Curve25519: New Diffie-Hellman Speed Records,”
fast formulas for the doubling, tripling, and addition Proc. Ninth Int’l Conf. Theory and Practice of Public Key Cryptography,
operations, where some multiplications have been effi- pp. 229-240, 2006.
[12] N. Gura, A. Patel, A. Wander, H. Eberle, and S.C. Shantz,
ciently replaced by squarings with optimization purposes. “Comparing Elliptic Curve Cryptography and RSA on 8-Bit
Furthermore, we have shown that parallel schemes such as CPUs,” Proc. Sixth Int’l Workshop Cryptographic Hardware and
SIMD can greatly benefit from this flexible technique. For Embedded Systems, pp. 119-132, 2004.
instance, a 160-bit NAF scalar multiplication with intro- [13] A. Woodbury, “Efficient Algorithms for Elliptic Curve Crypto-
systems on Embedded Systems,” MSc thesis, Worcester Poly-
duced three and four-parallel SIMD operations reduces technic Inst., 2001.
computing costs by approximately 63 percent and 70 per- [14] R. Avanzi, “Side Channel Attacks on Implementations of Curve-
cent, respectively, when compared with the traditional Based Cryptographic Primitives,” Cryptology ePrint Archive,
Report 2005/017, http://eprint.iacr.org/, 2005.
sequential approach. Also, we have protected our formulas [15] J.S. Coron, “Resistance against Differential Power Analysis for
against SSCA using innovative and highly efficient atomic Elliptic Curve Cryptosystems,” Proc. First Int’l Workshop Crypto-
structures where squarings have also been included. We graphic Hardware and Embedded Systems, pp. 292-302, 1999.
have shown new atomic formulas that are cheaper and, [16] P.Y. Liardet and N.P. Smart, “Preventing SPA/DPA in ECC
Systems Using the Jacobi Form,” Proc. Third Int’l Workshop
more importantly, offer true protection against SSCA. For Cryptographic Hardware and Embedded Systems, pp. 401-411, 2001.
instance, in the scalar multiplication using NAF, our atomic [17] O. Billet and M. Joye, “The Jacobi Model of an Elliptic Curve and
blocks speed up the computation by up to 30 percent in Side-Channel Analysis,” Cryptology ePrint Archive, Report 2002/
125, http://eprint.iacr.org/2002/125/, 2002.
contrast to previous atomic implementations. Finally, by [18] N.P. Smart, “The Hessian Form of an Elliptic Curve,” Proc. Third
using the new atomic structure, a highly efficient two- Int’l Workshop Cryptographic Hardware and Embedded Systems,
parallel SSCA-protected scheme has been presented with a pp. 118-125, 2001.
computing reduction of up to 32 percent in the scalar [19] B. Chevallier-Mames, M. Ciet, and M. Joye, “Low-Cost Solutions
for Preventing Simple Side-Channel Analysis: Side-Channel
multiplication, in comparison with the previous best Atomicity,” IEEE Trans. Computers, vol. 53, no. 6, pp. 760-768,
method using NAF. June 2004.
302 IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 3, MARCH 2008

[20] L. Batina, N. Mentens, B. Preneel, and I. Verbauwhede, “Balanced Patrick Longa received the BSc degree in
Point Operations for Side-Channel Protection of Elliptic Curve electrical engineering from the Catholic Univer-
Cryptography,” IEE Proc.—Information Security, vol. 152, no. 1, sity of Peru in 1999 and the MSc degree from
pp. 57-65, 2005. the University of Ottawa, where he conducted
[21] W. Fischer, C. Giraud, E.W. Knudsen, and J.-P. Seifert, “Parallel research in cryptography and DSP. Currently, he
Scalar Multiplication on General Elliptic Curves over IFp Hedged is beginning his PhD studies in electrical and
against Non-Differential Side-Channel Attacks,” IACR ePrint computer engineering at the University of Water-
Archive, Report 2002/007, http://www.iacr.org, 2002. loo. He worked as a researcher at the Catholic
[22] T. Izu and T. Takagi, “A Fast Parallel Elliptic Curve Multiplication University of Peru and the Navy Industrial
Resistant against Side Channel Attacks,” Proc. Fifth Int’l Workshop Services (SIMA). He is the author of eight
Practice and Theory in Public Key Cryptosystems, pp. 280-296, 2002. research papers. His research interests include (curve-based) crypto-
[23] T. Izu and T. Takagi, “Fast Elliptic Curve Multiplications Resistant graphy, security on portable devices, and computer architectures for
against Side Channel Attacks,” IEICE Trans. Fundamentals, signal processing and cryptography.
vol. E88-A, no. 1, pp. 161-171, 2005.
[24] R. Avanzi, H. Cohen, C. Doche, G. Frey, T. Lange, K. Nguyen, and Ali Miri received the BSc and MSc degrees in
F. Vercauteren, Handbook of Elliptic and Hyperelliptic Curve mathematics from the University of Toronto in
Cryptography. CRC Press, 2005. 1991 and 1993, respectively, and the PhD
[25] C.D. Walter, “Sliding Windows Succumbs to Big Mac Attack,” degree in electrical and computer engineering
Proc. Third Int’l Workshop Cryptographic Hardware and Embedded from the University of Waterloo in 1998. He is
Systems, pp. 286-299, 2001. currently an associate professor with the School
[26] K. Aoki, F. Hoshino, T. Kobayashi, and H. Oguro, “Elliptic Curve of Information Technology and Engineering
Arithmetic Using SIMD,” Proc. Fourth Int’l Conf. Information (SITE) and the Department of Mathematics
Security, pp. 235-247, 2001. and Statistics at the University of Ottawa,
[27] T. Izu and T. Takagi, “Fast Elliptic Curve Multiplications with Canada. He is also the director of the Computa-
SIMD Operations,” Proc. Fourth Int’l Conf. Information and Comm. tional Laboratory in Coding and Cryptography (CLiCC), University of
Security, pp. 217-230, 2002. Ottawa. His research interests include coding and information theory,
[28] P.K. Mishra, “Pipelined Computation of Scalar Multiplication in applied number theory, and cryptography. He is a member of the
Elliptic Curve Cryptosystems,” IEEE Trans. Computers, vol. 55, Professional Engineers Ontario and the ACM and a senior member of
no. 8, pp. 1000-1010, Aug. 2006. the IEEE.
[29] S.B. Xu and L. Batina, “Efficient Implementation of Elliptic Curve
Cryptosystems on an ARM7 with Hardware Accelerator,” Proc.
Third Int’l Conf. Information and Comm. Security, pp. 266-279, 2001.
[30] K. Itoh, M. Takenaka, N. Torii, S. Temma, and Y. Kurihara, “Fast
Implementation of Public-Key Cryptography on a DSP . For more information on this or any other computing topic,
TMS320C6201,” Proc. First Int’l Workshop Cryptographic Hardware please visit our Digital Library at www.computer.org/publications/dlib.
and Embedded Systems, pp. 61-72, 1999.

You might also like