Low-Power Design For A Digit-Serial Polynomial Basis Finite Field Multiplier Using Factoring Technique

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

Low-Power Design for a Digit-Serial

Polynomial Basis Finite Field Multiplier


Using Factoring Technique
Shoaleh Hashemi Namin, Huapeng Wu, Senior Member, IEEE, and Majid Ahmadi, Fellow, IEEE

Abstract— In CMOS-based application-specific integrated EC cryptosystem uses shorter key compared with RSA to
circuit (ASIC) designs, total power consumption is dominated by provide the same level of security, it is probably the more
dynamic power, where dynamic power consists of two major
components, namely, switching power and internal power. In this widely used technique in resource-constrained devices. Since
paper, we present a low-power design for a digit-serial finite field EC used in an EC cryptosystem is defined over finite fields,
m low-power design of finite field arithmetic results in an EC
multiplier in GF( 2 ). In the proposed design, a factoring
technique is used to minimize switching power. To the best of our cryptosystem, which consumes lower power and makes it more
knowledge, factoring method has not been reported in the suitable for wireless applications.
literature being used in the design of a finite field multiplier at an m
architectural level. Logic gate substitution is also utilized to reduce Binary extension field, denoted by GF(2 ), is very attractive
internal power. Our proposed design along with several existing for hardware implementation, because it offers carry-free
233 arithmetic. Multiplication operation has been paid most
similar works have been realized for GF(2 ) on ASIC platform,
and a comparison is made between them. The synthesis results attention by researchers, because addition is simply bitwise XOR
show that the proposed multiplier design consumes at least 27.8% operation between two field elements, and the more complex
lower total power than any previous work in comparison. operations, inversion, can be carried out with a few multiplications.
Index Terms— Digit-serial architecture, elliptic curve (EC) m
In GF(2 ), there are various methods to represent field elements,
cryptography, factoring method, finite field multiplier, low-power

design. such as polynomial basis (PB), normal basis, and dual basis. PB is
probably the most pop-ularly used basis, because it is adopted as
one of the basis choices by organizations that set standards for
I. INTRODUCTION cryptography applications [1], [2]. Thus, a large number of

ACCORDING to Moore’s law, the number of transistors on a architectures for efficient implementation of PB finite field
chip doubles almost every two years. As a result, more multipliers have been proposed. In addition, new representations
functions and more complicated designs can be imple-mented based on PB called shifted PB (SPB) [3] and generalized PB [4]
on one chip, which leads to more power density and more heat have been proposed for efficient implementation of multipliers
on the circuits. Higher power density on the circuit reduces the m
over GF(2 ).
reliability of the system and the battery life of the battery-
based devices. Thus, power and energy consumptions of the The choice of the irreducible polynomial p(x ) affects the
circuit gain the same or probably more importance than area, complexity of a finite field multiplier. Various types of
especially for most compact portable devices that work irreducible polynomials include trinomials, pentanomials, all
one polynomials, and equally spaced polynomials. Standard
by battery. organizations recommend irreducible polynomials with less
Nowadays, lots of information are exchanged through number of nonzero terms (irreducible trinomials and pen-
networks, thus providing security services over networks is tanomials) for practical use as these types of irreducible
crucial for protecting information. Among security technolo- polynomials can provide multipliers with lower complexity.
gies, public key cryptography is popular and important, since it PB finite field multiplier architectures can be categorized
can provide certain unique security services, such as key into bit-serial [5]–[7], bit-parallel [8]–[11], and digit-serial
exchange and digital signature. There are two public key architectures. Bit-parallel structure is fast, but it is expensive in
cryptography techniques, in practice, namely, Rivest–Shamir– terms of area. In EC cryptography, the binary extension field
Adleman (RSA) and elliptic curve (EC) cryptosystem. Since 2
size, m, is required to be on the order of 10 , and thus a bit-
parallel structure requires a very high I/O bandwidth, which is
Manuscript received October 6, 2015; revised February 9, 2016 and April 29,
2016; accepted June 7, 2016. Date of publication July 20, 2016; date of current
usually not available in the small portable and wireless devices.
version January 19, 2017. This work was supported by the Natural Sciences Bit-serial architecture is area efficient, but it is too slow for
and Engineering Research Council of Canada (NSERC). many applications. Power optimization were also considered in
The authors are with the Department of Electrical and Computer Engi- some of these works [12]–[18].
neering, University of Windsor, Windsor, ON N9B 3P4, Canada (e-mail:
hashemi1@uwindsor.ca; hwu@uwindsor.ca; ahmadi@uwindsor.ca). The digit-serial architecture is flexible in that it can trade off
Color versions of one or more of the figures in this paper are available online between space and speed; therefore, it achieves a moderate
at http://ieeexplore.ieee.org. speed and reasonable cost of implementation, so it is most
Digital Object Identifier 10.1109/TVLSI.2016.2585980
appropriate for practical use. Many digit-serial PB multipliers

1063-8210 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
442 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 2, FEBRUARY 2017

have been proposed in the literature [6], [19]–[28]. Most of digit-serial PB multiplication is explained, and sources of
these worked toward high-speed and/or low-area architectures. power dissipation for CMOS-based circuits and low-power
Only few considered the optimization of power consump-tion techniques, such as factoring and logic gate substitution, are
[19], [20]. also discussed. In Section III, the proposed low-power digit-
In [20], two low-energy digit-serial PB multipliers have been m
serial PB multiplier in GF(2 ) is presented. Synthesis results
proposed. For degree reduction, they used a binary-tree for area, power, and energy consumptions of the proposed
structure of XOR gates instead of a linear array of XOR gates to multiplier and similar multipliers in comparison are
reduce both power consumption and delay. Energy summarized in Section IV. Finally, in Section V, the conclusion
consumption of the multipliers was estimated using the HEAT remarks are given.
tool [29]. However, experimental results were only reported
16 II. PRELIMINARIES
over a very small field GF(2 ); hence, the issues involved in
multipliers over practically large fields, such as the existence of m
A. Binary Extension Field GF(2 )
very high fan-out nets, buffer insertions, and so on that can have A finite field is defined as a set of finite many elements,
high impacts on multipliers’ efficiencies, are not considered. where addition and multiplication are the operations defined on
In [19], a most significant digit (MSD)-first digit-serial m
the set. A binary extension field, GF(2 ),mis gener-ated by a
degree m irreducible polynomial, p(x ) = x +
233 m−1 2
multiplier
than that inin[20].
GF(2 ) was was
Its latency proposed with complexity
also reduced lower
by one cycle pm−1x + · · · + p2x + p1x + 1, where is either 0 or 1.
pi
2 m−
1
p(x ) also specifies a PB {1, x , x , . . . , x }. Each element
compared with that in [20]. m
Kumar et al. [23] proposed two architectures referred to as of GF(2 ) can be represented as a polynomial of degree at most
m
double accumulator multiplier and N-accumulator multiplier, m − 1 over GF(2 ) with respect to the PB. For instance, an
m
which are based on the least significant digit (LSD)-first element A GF(2 ) can be expressed as
multiplier in [20]. Their architectures achieved higher speed m−1 m−2 2
A(x ) = am−1x + am−2 x + · · · + a2x + a1 x + a0
than that in [20] by adding registers to store the intermediate
(1)
results at expenses of higher area complexities and power
consumption. with
In [21], an LSD-first multiplier with lower area-time com-
plexity has been proposed using a new method for the ai GF(2), 0 ≤ i ≤ m − 1.
accumulation in which XOR gates and D flip-flops (FFs) have
Multiplication of two field elements A(x ) and B(x ) of the
been replaced by T FFs and feedback loops have been
binary extension field can be given by
eliminated. In [30], a multiple-bit serial-parallel multiplier has
been proposed. One operand is decomposed recursively and the
other operand is prereduced hierarchically. The multiplier has C(x ) = A(x ) B(x ) mod p(x ). (2)
higher area complexity compared with some of the existing
multipliers; however, it has lower critical path delay, which B. Digit-Serial PB Multiplication
results in lower area-time complexity. In digit-serial multiplication, the bits of one operand are
A low-complexity digit-serial SPB multiplier using the pro- divided into digits of size k while the bits of the other input
posed (b, 2)-way Karatsuba decomposition has been proposed operand are processed in parallel. Only one digit of the first
in [31]. The (4, 2)-way and (6, 2)-way Karatsuba-based digit- operand is accessible in each clock cycle.
serial multipliers have been implemented and compared with Let two operands to multiplication be A and B,
the existing SPB multiplier. The Karatsuba-based digit-serial in digits: A A A and
SPB multiplier shows lower area complexity, less area-delay where A (isj) given ( j ) = (m/ k) −1 · · · 0
Aj = (ak 1 , . . . , a0 ) represent a digit of k-bits, where
product, and less energy consumption compared with the (j) −
existing SPB multiplier. Ai = a j k+i , i = 0, 1, . . . , k − 1 and j = 0, . . . , (m/ k) − 1.
In this paper, a factoring technique is adopted in design of a MSD digit-serial multiplication can be realized in several ways
m depending on when and where polynomial modular reduction
digit-serial PB multiplier in GF(2 ). To the best of our
knowledge, a factoring method has not been reported in the is performed. One MSD digit-serial multiplication algorithm
literature being used in the design of a finite field multiplier at used in [19] can be described as follows:
an architectural level. A logic gate substitution technique is also k
A B = · · · A mk −1 B mod p(x ) x mod p(x )
used in our design to reduce the internal power consumption of k
+ A mk −2 B mod p(x ) x m od p(x ) + · · ·
the proposed digit-serial multiplier. The synthesis results show
that our new design has both the lowest dynamic power + A0 B mod p(x ). (3)
consumption and the lowest total power consumption among
several similar existing works. Total power consumption of the
proposed multiplier is lower than any other previous works in C. Power Dissipation for CMOS-Based Circuits
comparison by at least 27.8%. Power consumption in a CMOS-based design contains two
The rest of this paper is organized as follows. In Section II, major components: static power and dynamic power [32]. For
a brief explanation of finite field and PB is provided, a CMOS-based design, dynamic power plays a dominant
HASHEMI NAMIN et al.: LOW-POWER DESIGN FOR A DIGIT-SERIAL PB FINITE FIELD MULTIPLIER 443

1
role in the total power consumption [33]. Dynamic power
consumption of a CMOS-based design, which consists of a large
number of standard cells and nets (a net is a connection to the cells
inputs or outputs), can be expressed
P as follows [34]:
P P + ternal
dynamic = switching in
C P
=P αi Li + internali . (4)
nets(i) cells(i)
In (4), Pswitching is the total switching power summarized
over all nets. For each net i , switching power is the power
dissipated due to the charging and discharging of the output
load capacitance of cell i . Switching activity, αi , denotes the
probability of a 0 → 1 transition during a clock period on the
output node of cell i . CLi represents total load capacitance at
the output of cell i and P is a variable independent of switching
activity and load capacitance, but related to clock frequency of Fig. 1. Two circuits for realizing f = ab + cb + cd. (a) Without factoring.
the circuit, and supply voltage. (b) With factoring (shaded gates indicate high switching activity at output)
[38].
The second term, Pinternal, in (4) is the total internal power
obtained by summing over all cells. The internal power of each
cell i (Pinternali ) is the power consumed within the cell because
of the charging and discharging of internal nodes capacitances
of a cell and short-circuit current. During the transitions of the
input signals and the output node for a short period of time, both
pull-up and pull-down paths in the CMOS cell conduct and the
current flows from VDD to GND. The current is called short- m
Fig. 2. Proposed digit-serial PB multiplier in GF(2 ).
circuit current. It can be concluded that Pinternal is also a
high activity nets. Therefore, the circuit shown in Fig. 1(b) has
function of switching activity and correlates positively with the
switching activity [34]. lower switching activity, and thus, it has lower Pswitching
As can be seen in (4), dynamic power (Pdynamic) can be compared with the circuit presented in Fig. 1(a).
reduced by lowering Pswitching or Pinternal or the both. In this Logic gate substitution reduces Pinternal by replacing the
paper, we have utilized a factoring technique, which is gates with higher internal power with those that consume lower
internal power.
explained later, to reduce Pswitching by minimizing the switch-
ing activities (αi s). We have also reduced Pinternal through III. PROPOSED LOW-POWER DESIGN OF A
logic gate substitution in which gates with larger number of m
internal nodes are replaced by gates with smaller number of DIGIT-SERIAL MULTIPLIER IN GF(2 )
internal nodes. In this section, we present a factoring-based circuit design for
m
a digit-serial PB multiplier in GF(2 ) that reduces
D. Low-Power Techniques for Dynamic Power Reduction Pswitching effectively. A logic gate substitution technique is
Factoring is an effective method to reduce power consump- also presented that reduces Pinternal by using gates with lower
tion applicable at both architecture and gate level [35]–[37]. internal power consumption. Gate count of the proposed digit-
serial PB multiplier is also optimized.
This method reduces Pswitching by reducing the logic depth,
which is connected to the nets with high switching activity [38]
A. Multiplier Architecture
to reduce the switching activity of the circuit.
Fig. 1 shows a simple gate level example of using factoring An architecture diagram for the proposed digit-serial PB
m
to reduce the switching activity of a small circuit. Both circuits multiplier in GF(2 ) is shown in Fig. 2. There are three
shown in Fig. 1 realize function f = ab + cb + cd. Assume that modules, as shown in Fig. 2, namely, k ×m multiplier, constant
input b has higher switching activity compared with the three multiplier, and field adder.
other inputs, a, c, and d. 1) k × m multiplier takes one operand B of m-bit and the
In Fig. 1(a), input b with high switching activity propagates other operand A j of k-bit. Note that A j changes for
through two gates at the first stage, which results in larger different clock cycles j . Thus, it has higher switching
number of high activity nets, and as a result, it causes higher activity compared with operand B. A straightforward
switching activity in the whole circuit. While in the circuit realization of this module was used in [19]. For the
shown in Fig. 1(b), input b propagates through one gate at the comparison purpose, it is given in Algorithm 1. Note that
second stage, and thus, it results in lower number of a modification to this algorithm using a factoring method
is proposed in Section III-B. The three steps in Algorithm
1 1 are, respectively, realized with the circuit blocks from
As expected, synthesis results that are given in Table V also show that static power
consumption of the finite field multipliers is significantly lower than their dynamic power left to right, as shown in Fig. 3(a).
consumption (nW versus μW).
444 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 2, FEBRUARY 2017

Algorithm 1 Operations Performed by k × m Multiplier Algorithm 2 Modified Algorithm for k × m Multiplier With
Switching Minimization Using Factoring

TABLE I
SWITCHING POWER REDUCTION WITH FACTORING METHOD
FOR k × m MULTIPLIER (EXPERIMENT SETTING:
m = 233 AND k = 8)

The two designs of k × m multiplier are shown in Fig. 3 both


of which include three submodules: AND network,
XOR network 1, and XOR network 2. The circuit shown
(j) =
in Fig. 3(a) first computes a i B, i ( j) 0, 1, . . . , k − 1
in
(j) network, and then computes (ai B)x i mod p(x ) =
AND
Vi at XOR network 1. In this circuit, input A j that has higher
switching activity compared with input B affects all the three
Fig. 3. k × m multiplier. (a) Without applying factoring. (b) With applied
submodules, which results in larger number of high activity
factoring method (the shaded modules indicate high switching activity). nets (outputs of the shaded modules), and thus causes higher
switching activity in the k × m multiplier. Factoring can be
applied to decrease the switching activity of the k × m multiplier
2) Constant multiplier module realizes multiplication by reducing the logic depth connected to high
k activity input A j .
between a field element and the constant x .
3) Field adder module implements finite field addition using In Fig. 3(b), the proposed design is obtained from Fig. 3(a)
m two-input XOR gates formed as a one-layer network. by applying a factoring technique. As shown in Fig. 3(b), the
Note that k × m multiplier is the most complex module logic depth connected to input A j is reduced by switching
among the three modules. In fact, it takes majority of system the submodules AND network and XOR network 1. In XOR net-
i
complexity in terms of gate count. By experiment, we also work 1, it computes Ui = B x mod p(x ), i = 1, 2, . . . , k − 1.
found that its power consumption is much higher than all the (j) (j)
Then at AND network, Vi = ai Ui , i = 0, 1, . . . , k − 1 are
2
other modules combined. In the following, we will propose obtained. As shown in Fig. 3(b), input A j with high
a low-power design of the k × m multiplier using factoring and switching activity does not propagate through submodule XOR
logic gate substitution methods. Complexity optimization of network 1. Therefore, the number of nets with high switching
this module is also presented. activity is minimized, which results in lower switching activity
in the proposed design.
XOR network 2 shown in Fig. 3(a) and (b) computes
B. Low-Power Design Using Factoring Technique
k−1 V ( j ) . This submodule is composed of m binary trees
Consider Algorithm 2 as the modified version of Algorithm i=0 i

1 for the operation by k × m multiplier module. While in each with k inputs.


Algorithm 1, transitions of high activity input A j
are involved in all the three steps, Step 1 in Algorithm 2 (Ui = C. Power Consumption With and Without Factoring
i In this section, we demonstrate that how much power
B x mod p(x ), i = 0, 1, . . . , k − 1) is not affected by A j and
it also does not depend on the cycle j , which means there is no reduction can be achieved on the k × m multiplier by applying
input data transitions involved in this step in Algorithm 2 for factoring method alone.
all the cycles j = 0, 1, . . . , m/ k − 1. Two designs of k × m multiplier module, shown in Fig. 3,
2 have been implemented for (m, k) = (233, 8) using
We assume (m, k) = (233, 8) for both the complexity estimation and the CORE65LPSVT standard cell library from STMicroelectron-
power consumption experiment.
ics [39]. Pswitching and Pinternal have been estimated by
HASHEMI NAMIN et al.: LOW-POWER DESIGN FOR A DIGIT-SERIAL PB FINITE FIELD MULTIPLIER 445

In the XOR network 1 with reduced gate count, it should be

noted that the k products Ui , i = 0, 1, . . . , k − 1 of constant


multiplication, have to be computed in serial fashion, which,

while reducing the gate count, could possibly increase the


critical path delay for general irreducible polynomial p(x ).

E. Low-Power Design Using Logic Gate Substitution


As can be seen in Fig. 3(b), the proposed k × m multiplier
consists of AND network followed immediately by XOR net-
work 2 that is made up of binary tree of XOR gates. Now lets
take a look at a simple example of this style logic functions
that is presented by (7) where two two-input AND gates and
one XOR gate are required to implement function y
y = (x0 · x1) ⊕ (x2 · x3). (7)
In CMOS logic, a two-input AND gate consists of six transis-
tors, while a two-input NAND gate can be made of only four
transistors. A two-input NAND gate also has less number of
internal nodes compared with a two-input AND gate, and as
a result, it consumes lower internal power compared with a
two-input AND gate.
gate count.
Note that inputs of the XOR gate in (7) can be negated
Synopsys Power Compiler tool [34] at 50-MHz clock fre- without changing the output [see (8)], which means that logic
quency and they are given in Table I. It can be seen that gate substitution can be utilized to reduce Pinternal by replacing
significant switching power consumption can be saved for this AND gates with NAND gates
module with the factoring method-based new design.
P y = (x0 · x1) ⊕(x2 · x3). (8)
internal is
also slightly reduced with the new method. It can be seen from Fig. 5 that the modified design of
D. Gate Count Optimization function y [implementation of (8)] can save four transistors
A straightforward realization of XOR network 1 compared with the standard method [implementation of (7)].
shown in Fig. 3(b) can be given in Fig. 4(a), where Consequently, for the proposed multiplier with even digit
i size k, there are 2km transistor savings.
Ui = B x mod p(x ) is implemented with the subblocks When digit size is odd, for example, digit size equal to 3,
i
CMi, = 1, 2, . . . , k − 1,i where CMi stands for constant
multiplier with constant x . An observation can be made logic function for AND–XOR network can be given by

that the subblocks CMi can be replaced with k − 1 z (x0 x1) (x2 x3) (x4 x5). (9)
number of subblocks CM1 to generate B x i mod p(x ), = · ⊕ · ⊕ ·

Any two ANDed terms in (9) can be complemented without


Step 1 in changing the value of z [see (10)]
Algorithm 2 has been changed as follows:
z ((x0 x1) (x2 (x4 x5)
U0 = B (5) = · ⊕ · x3)) ⊕ ·

Ui+1 = Ui x mod p(x ), i = 0, . . . , k − 2. = (x0 · x1) ⊕ ((x2 · x3) ⊕ (x4 · x5))

Fig. 4(b) shows the proposed architecture for XOR network 1 = ((x0 · x1) ⊕(x4 · x5)) ⊕(x2 · x3). (10)
with lower gate counts that computes the above expressions. For the proposed multiplier with odd digit size k, (k−1)m AND
Assuming that the irreducible polynomial is p(x ) = x + gates are replaced by NAND gates, thus 2(k
m
1)m transistors
n
x + 1, (0 < n < m), the operation performed by one CM1 can be saved. −
module is given by As shown in Fig. 5(b), NAND gates in the modified design
m−1 of function y have less number of internal nodes compared
Ui+1 = x Ui = xu(i) x with AND gates that are used in the standard design shown
=0 in Fig. 5(a). As a result, the modified design of function y
m−1 has lower Pinternal compared with the standard design. Thus,
(i) (i ) (i ) (i)
= um −1 +u − 1x + un − 1 + um −1 x n (6) using the modified design of function y in k × m multiplier
=1 reduces Pinternal and leaves Pswitching almost unchanged, which
=n result in further reduction of Pdynamic. A similar idea has been
where Ui and Ui+1 denote the input and output to the CM1 used in [27] for reducing the critical path delay and area of
8
constant multiplier, i = 0, 1, . . . , k − 2, respectively. It can be a digit-serial PB multiplier in GF(2 ). We have not included
seen that only one XOR gate is used for realizing CM1, or the previous work [27] in our comparison, because it did not
(k − 1) XOR gates are used for the XOR network 1. consider the power optimization of the multiplier architecture.
446 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 2, FEBRUARY 2017

Fig. 6. k × m multiplier with applied factoring, logic gate substitution, and


gate count optimization.

TABLE II
INTERNAL POWER REDUCTION WITH LOGIC GATE SUBSTITUTION
FOR k × m MULTIPLIER (EXPERIMENT SETTING:
m = 233 AND k = 8)

using the implementation settings mentioned in Section III-C.


Pswitching and Pinternal have been estimated and they are pre-
sented in Table II.
According to the experimental results, gate count reduction
has an insignificant impact on power consumption; therefore,
the power saving is effectively caused by logic gate substitu-
tion. As presented in Table II, internal power consumption of k
× m multiplier has been reduced significantly by applying logic
gate substitution.

F. Architecture Level Complexities


To evaluate the complexities of finite field arithmetic hard-
ware, circuit complexity is usually presented by the num-ber of
FFs, two-input AND/NAND gates, two-input XOR gates, and
MUXs. In estimation of critical path delay, TA, TN A , T X , TM
, TD , and TT are used to refer to as the delay caused by a two-
input AND gate, a two-input NAND gate,
a two-input XOR gate, a 2-to-1 multiplexer, a D FF, and a T
FF, respectively.
As shown in Fig. 6, the complexity of k × m multiplier
module can be estimated as follows. XOR network 1 with
reduced gate counts [Fig. 4(b)] contains (k − 1) number of CM1
modules. The number of NAND gates in the NAND network is
equal to the number of AND gates in the AND network for even
digit sizes. The operations involved in
Fig. 5. (a) Standard design: implementing (7) using 24 transistors. (j)
(b) Modified design: implementing (8) using 20 transistors.
the AND network are shown in Step 2 in Algorithm 2, Vi =
(j)
ai Ui , i = 0, 1, . . . , k − 1, which use km AND gates.
Therefore, the NAND network contains km NAND gates. The
The proposed low-power and area efficient design of k × m XOR network 2 contains m binary trees each with k inputs, which
multiplier with applied factoring method, logic gate substitu- requires (k − 1)m XOR gates.
tion, and gate count optimization is shown in Fig. 6. Besides k × m multiplier, constant multiplier module, field
To demonstrate that how much power reduction can be adder module, and the register (Fig. 2) require k number of XOR
achieved on k × m multiplier by applying logic gate sub- gates, m number of XOR gates, and m D FFs, respectively.
stitution, we implemented the architecture shown in Fig. 6
HASHEMI NAMIN et al.: LOW-POWER DESIGN FOR A DIGIT-SERIAL PB FINITE FIELD MULTIPLIER 447

TABLE III
m
ARCHITECTURE LEVEL AREA COMPLEXITIES AND COMPARISONS OF THE DIGIT-SERIAL MULTIPLIERS OVER GF(2 )

TABLE IV
TIME COMPLEXITIES AND COMPARISONS OF THE DIGIT-SERIAL MULTIPLIERS
m
OVER GF(2 )

Tables III and IV show the complexities for our proposed Fig. 7. Power estimation flow.
architecture and some similar digit-serial PB multipliers in the
literature, where we take p(x ) a trinomial. circuit is highly dependent on input data transitions and
Note that in Table III, output registers are added to some switching activity of each net in the circuit.
3,4 Our proposed architecture along with the other five digit-
existing works to support a stable output.
serial PB multipliers [19]–[21], [23] and a digit-serial SPB
IV. VLSI EXPERIMENTAL RESULTS multiplier [31] in the literature has been synthesized using DC,
and gate level netlists have been generated, then NCSim
A. Experimental Settings simulator from Cadence has been used for gate level sim-
Our proposed design and the similar existing architectures ulation. Switching activity information generated via gate level
have been modeled in VHDL hardware description language at simulation has been used by Synopsys Power Compiler tool for
register transfer level to realize a digit-serial multiplier in power consumption calculation. For more accurate power
233
GF(2 ) with a digit size of 8. The irreducible trinomial p(x ) = x estimation, full-timing rather than zero-delay gate level
233 74
+ x + 1 has been used for generating the field. simulation, with 1000 random vectors for inputs A and B, has
The designs have been synthesized by Design Com-piler been used for obtaining switching activity information. In full-
(DC) tool from Synopsys [40] and STMicroelectronics timing simulation, glitches that affect the power consumption
CORE65LPSVT standard cell library at a supply voltage of 1 can be captured.
V, 25 °C operating temperature and at typical process corner Fig. 7 shows the power estimation flow, where SAIF denotes
has been used for synthesis. CORE65LPSVT is a library for 65- the switching activity interchange format and the SDF file
nm low-power CMOS digital design [39]. All designs have denotes the standard delay format file that is used for full-
been synthesized with 50-MHz clock fre-quency. This timing simulation.
frequency is suitable for many low-power embed-ded devices,
such as smart cards.
C. Synthesis Results
B. Power Estimation Method Synthesis results are presented in Tables V and VI. As pre-sented
in Table V, our proposed multiplier has the lowest Pswitching and
In this paper, we have used input vectors and run gate level
simulation to generate switching activities that are used for Pinternal, and thus, it consumes the lowest amount of Pdynamic.
power estimation. This is a much more accurate power Compared with our proposed multiplier, the best previous work [20]
5 consumes about 38.4% higher Pdynamic.
measuring method, since power consumption of a digital In Table V, the last column presents the total power con-
3
sumption (Ptotal) of the proposed digit-serial PB multiplier and the
Subsequently, in Table IV, their number of clock cycles is also adjusted accordingly. similar existing multipliers. Ptotal has been obtained by adding
4 Pdynamic and Pstatic. However, it can be seen that Pstatic is
Note that the number of clock cycles should be m/ k for (m, k) = (233, 8) in
both the work in [19] and our proposed work. significantly smaller than Pdynamic (nW versus μW). This is the
5
A commonly used power estimation method, which is far from accurate is
provided by Synopsys Power Compiler by setting a fixed toggle rate and static
reason that we have optimized Pdynamic rather than Pstatic. As
probability for all the circuit inputs regardless of the input data transitions and shown in Table V, our proposed multiplier consumes the least
propagating the transitions using zero-delay simulation [34]. amount of total power among all other multipliers.
448 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 2, FEBRUARY 2017

TABLE V
233
POWER CONSUMPTION AT 50-MHz OPERATING CLOCK FOR THE DIGIT-SERIAL MULTIPLIERS OVER GF(2 ) CHOOSING
233 74
THE NATIONAL INSTITUTE OF STANDARD AND TECHNOLOGY RECOMMENDED IRREDUCIBLE TRINOMIAL x +x + 1 AND DIGIT SIZE k = 8

TABLE VI the proposed architecture, the best previous work is still about
AREA, ENERGY CONSUMPTION, EA PRODUCT, AND LATENCY OF THE 75.5% higher in the EA product. Fig. 8 shows the energy
233
DIGIT-SERIAL MULTIPLIERS OVER GF(2 ) CHOOSING THE consumed by one multiplication and the area complexity for the
NIST RECOMMENDED IRREDUCIBLE TRINOMIAL
233 74 proposed multiplier and the existing multipliers of similar
x +x + 1 AND DIGIT SIZE k = 8
architectural style.

V. CONCLUSION
The factoring technique has been adopted for a new archi-
tecture level design that minimizes the switching activities and,
consequently, reduces the power consumption of a digit-serial PB
m
multiplier in GF(2 ). The logic gate substitution technique has
also been utilized to further reduce the power consumption of the
digit-serial PB multiplier. Moreover, the area complexity of the
finite field multiplier has been reduced. The VLSI experimental
results show that the new architecture consumes about 27.8%
lower power and 31.6% lower energy and achieves 43% lower EA
product compared with the best previ-ous work. The proposed low-
power digit-serial PB multiplier is suitable for implementing low-
power EC cryptosystems in embedded systems with limited power
resources. The proposed digit-serial PB multiplier can also be used
as an IP core for fast implementation of EC cryptosystems.

ACKNOWLEDGMENT
Fig. 8. Energy per multiplication versus area.
S. Hashemi Namin would like to thank Dr. Muscedere for his
technical support and guidance in preparing the VLSI
Our proposed multiplier and the existing multipliers experiments.
in comparison complete one field multiplication in
different numbers of clock cycles. Therefore, it is REFERENCES
beneficial to use energy per one multiplication as an
[1] C. F. Kerry, “Digital signature standard (DSS),” Nat. Inst. Standards
efficiency measure for comparison, because it gives equal Technol., Gaithersburg, MD, USA, FIPS PUB 186-4, 2013.
weight to both power and computational delay. The energy for [2] IEEE Standard Specifications for Public-Key Cryptography, IEEE
one multiplication operation for the new architecture is about Standard 1363-2000, Aug. 2000, pp. 1–228.
31.6% lower than the best existing result given in Table VI. n
[3] H. Fan and Y. Dai, “Fast bit-parallel GF(2 ) multiplier for all trinomi-als,”

IEEE Trans. Comput., vol. 54, no. 4, pp. 485–490, Apr. 2005.
m
[4] A. Cilardo, “Fast parallel GF(2 ) polynomial multiplication for all degrees,”
IEEE Trans. Comput., vol. 62, no. 5, pp. 929–943, May 2013.
Table VI shows a comparison of energy-area (EA) product [5] T. Beth and D. Gollman, “Algorithm engineering for public key algo-
between our proposed field multiplier and the similar rithms,” IEEE J. Sel. Areas Commun., vol. 7, no. 4, pp. 458–466, May
existing multipliers. We introduce EA product metric as 1989.
[6] L. Song and K. K. Parhi, “Efficient finite field serial/parallel multiplica-
a comprehensive metric that considers the criteria, such tion,” in Proc. Int. Conf. Appl. Specific Syst., Archit. Processors (ASAP),
as area complexity and energy consumption. As shown Aug. 1996, pp. 72–82.
[7] M. Nikooghadam and A. Zakerolhosseini, “Utilization of pipeline tech-
in Table VI, the proposed multiplier has the lowest nique in AOP based multipliers with parallel inputs,” J. Signal Process.
EA product among all other multipliers. Compared with Syst., vol. 72, no. 1, pp. 57–62, Jul. 2013.
HASHEMI NAMIN et al.: LOW-POWER DESIGN FOR A DIGIT-SERIAL PB FINITE FIELD MULTIPLIER 449

[8] B. Sunar and C. K. Koc, “Mastrovito multiplier for all trinomials,” IEEE [34] Power Compiler User Guide, Version B-2008.09-SP4, Synopsys,
Trans. Comput., vol. 48, no. 5, pp. 522–527, May 1999. Mountain View, CA, USA, Mar. 2009.
[9] H. Wu, “Bit-parallel finite field multiplier and squarer using polynomial [35] M. P. S. Iman, “Logic extraction and factorization for low power,” in
basis,” IEEE Trans. Comput., vol. 51, no. 7, pp. 750–758, Jul. 2002. Proc. 32nd Annu. ACM/IEEE Design Autom. Conf. (DAC), Jun. 1995, pp.
[10] S.-M. Park, K.-Y. Chang, D. Hong, and C. Seo, “New efficient bit- 248–253.
parallel polynomial basis multiplier for special pentanomials,” Integr., [36] P. Balasubramanian, D. A. Edwards, and C. Hari Narayanan, “Low power
VLSI J., vol. 47, no. 1, pp. 130–139, Jan. 2014. synthesis of XOR-XNOR intensive combinational logic,” in Proc. Can. Conf.
[11] Y. Li and Y. Chen, “New bit-parallel Montgomery multiplier for trino- Elect. Comput. Eng. (CCECE), Apr. 2007, pp. 243–246.
mials using squaring operation,” Integr., VLSI J., vol. 52, pp. 142–155, [37] P. Balasubramanian and R. Arisaka, “A set theory based factoring
Jan. 2016. technique and its use for low power logic design,” Int. J. Comput., Elect.,
[12] A. G. Wassal, M. A. Hassan, and M. I. Elmasry, “Low-power design of Autom., Control Inf. Eng., vol. 1, no. 3, pp. 188–198, 2007.
finite field multipliers for wireless applications,” in Proc. 8th Great Lakes [38] C. Piguet, Low-Power CMOS Circuits: Technology, Logic Design and
Symp. VLSI (GLSVLSI), Feb. 1998, pp. 19–25. CAD Tools. Boca Raton, FL, USA: CRC Press, 2006.
[13] A. Zakerolhosseini and M. Nikooghadam, “Low-power and high-speed [39] CORE65LPSVT_1.00V 4.1 Standard Cell Library User Manual &
m Databook, STMicroelectronics, Geneva, Switzerland, 2006.
design of a versatile bit-serial multiplier in finite fields GF(2 ),” Integr.,
VLSI J., vol. 46, no. 2, pp. 211–217, Mar. 2013. [40] Synopsys, Mountain View, CA, USA. (Sep. 2008). Design Compiler
m User Guide Version B-2008.09. [Online]. Available:
[14] J. Grossschadl, “A low-power bit-serial multiplier for finite fields GF(2
),” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), vol. 4. May 2001, pp. http://www.synopsys.com
37–40.
[15] L. Gao and K. K. Parhi, “Custom VLSI design of efficient low latency and low
power finite field multiplier for Reed–Solomon codec,” in Proc. IEEE Int.
Symp. Circuits Syst. (ISCAS), vol. 4. May 2001, pp. 574–577. Shoaleh Hashemi Namin received the B.Sc. degree
[16] H. Hinkelmann, P. Zipf, J. Li, G. Liu, and M. Glesner, “On the design of in computer engineering from the Amirkabir Uni-
reconfigurable multipliers for integer and Galois field multiplication,” versity of Technology, Tehran, Iran, in 2004, the
Microprocessors Microsyst., vol. 33, no. 1, pp. 2–12, Feb. 2009. M.Sc. degree in computer engineering from the
[17] Z. Yu, “Low power finite field multiplication and division in re- Sharif University of Technology, Tehran, in 2006,
configurable Reed–Solomon codec,” in Proc. IEEE Int. Symp. Circuits and the Ph.D. degree from the Department of Electri-
Syst. (ISCAS), vol. 5. May 2002, pp. V-785–V-788. cal and Computer Engineering, University of Wind-
[18] W. Zhang, J. Wang, and X. Zhang, “Low-power design of Reed–Solomon sor, Windsor, ON, Canada, in 2016.
encoders,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2013, She is currently a Post-Doctoral Fellow in the
pp. 1560–1563. Department of Electrical and Computer Engineer-
[19] W. Tang, H. Wu, and M. Ahmadi, “VLSI implementation of bit-parallel ing, University of Windsor, Windsor, ON, Canada.
233
word-serial multiplier in GF(2 ),” in Proc. 3rd Int. IEEE-NEWCAS Her current research interests include digital integrated circuit design, low
Conf., Jun. 2005, pp. 399–402. power design, hardware implementation of finite field arithmetic, and
[20] L. Song and K. K. Parhi, “Low-energy digit-serial/parallel finite field cryptosystems.
multipliers,” J. VLSI Signal Process. Syst. Signal, Image Video Technol.,
vol. 19, no. 2, pp. 149–166, Jul. 1998.
[21] P. K. Meher, “High-throughput hardware-efficient digit-serial architec-
m
ture for field multiplication over GF(2 ),” in Proc. 6th Int. Conf. Inf., Huapeng Wu (S’98–M’99–SM’13) received the
Commun. Signal Process. (ICICS), Dec. 2007, pp. 1–5.
[22] J. Fan and I. Verbauwhede, “A digit-serial architecture for inversion and B.S. degree in electrical engineering and the M.Sc.
m degree in computer science from the University of
multiplication in GF(2 ),” in Proc. IEEE Workshop Signal Process.
Syst. (SiPS), Oct. 2008, pp. 7–12. Science and Technology of China, Hefei, China, in
[23] S. Kumar, T. Wollinger, and C. Paar, “Optimum digit serial G F(2 m) 1987 and 1992, respectively, and the Ph.D. degree in
multipliers for curve-based cryptography,” IEEE Trans. Comput., vol. 55, electrical engineering from the University of Water-
no. 10, pp. 1306–1311, Oct. 2006. loo, Waterloo, ON, USA, in 1999.
[24] G. Bertoni, L. Breveglieri, and P. Fragneto, “Comparative He was a Visiting Assistant Professor with the
cost/performance evaluation of digit-serial multipliers for finite fields of Department of Electrical and Computer Engineering,
n Illinois Institute of Technology, Chicago, IL, USA, in
type GF(2 ),” in Proc. 14th Annu. IEEE Int. ASIC/SOC Conf., Sep. 2001,
pp. 306–310. 1999. He held a post-doctoral position with the
[25] M. Hutter, J. Großschadl, and G.-A. Kamendje, “A versatile and scalable Centre for Applied Cryptographic Research, University of Waterloo, from 2000
m
digit-serial/parallel multiplier architecture for finite fields GF(2 ),” to 2002. He is currently a Professor with the Department of Elec-trical and
in Proc. Int. Conf. Inf. Technol., Coding Comput. [Comput. Com- Computer Engineering, University of Windsor, Windsor, ON, Canada. He has
mun.] (ITCC), Apr. 2003, pp. 692–700. authored or co-authored over 20 journal papers including 15 IEEE
m
[26] M. K. Ibrahim and A. Almulhem, “Bit-level pipelined digit serial GF(2 TRANSACTIONS papers, and over 40 peer-reviewed conference papers. His
) multiplier,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), vol. 4. May current research interests include fast and efficient implementation of public
2001, pp. 586–589. key cryptography systems, data security, cyber security, and security
[27] P. K. Meher, “Systolic and non-systolic scalable modular designs of finite applications in vehicles.
field multipliers for Reed–Solomon codec,” IEEE Trans. Very Large Dr. Wu served as an Associate Editor of the IEEE TRANSACTIONS ON
Scale Integr. (VLSI) Syst., vol. 17, no. 6, pp. 747–757, Jun. 2009. COMPUTERS.
[28] H. Ho, “Design and implementation of a polynomial basis multiplier
m
architecture over GF(2 ),” J. Signal Process. Syst., vol. 75, no. 3,
pp. 203–208, Jun. 2014.
[29] J. H. Satyanarayana and K. K. Parhi, “HEAT: Hierarchical energy
analysis tool,” in Proc. 33rd Annu. Design Autom. Conf., Jun. 1996, Majid Ahmadi (S’75–M’77–SM’84–F’02) received
pp. 9–14. the B.Sc. degree in electrical engineering from Sharif
m University, Tehran, Iran, in 1971, and the Ph.D.
[30] P. K. Meher and C.-Y. Lee, “Scalable serial-parallel multiplier over GF(2 ) degree in electrical engineering from the Imperial
by hierarchical pre-reduction and input decomposition,” in Proc. IEEE Int.
Symp. Circuits Syst. (ISCAS), May 2009, pp. 2910–2913. College of Science, Technology and Medicine, Lon-
[31] C.-Y. Lee, C.-S. Yang, B. K. Meher, P. K. Meher, and J.-S. Pan, “Low- don, U.K., in 1977.
complexity digit-serial and scalable SPB/GPB multipliers over large He has been with the Department of Electrical and
binary extension fields using (b,2)-way Karatsuba decomposition,” IEEE Computer Engineering, University of Windsor,
Trans. Circuits Syst. I, Reg. Papers, vol. 61, no. 11, pp. 3115–3124, Nov. Windsor, ON, Canada, since 1980. He is currently a
2014. Distinguished University Professor. He has authored
[32] J. Rabaey, Low Power Design Essentials. New York, USA: Springer, over 550 articles. His current research interests
2009. include digital signal processing, machine vision, pattern recognition, neural
[33] A. P. Chandrakasan and R. W. Brodersen, Low Power Digital CMOS network architectures, applications, VLSI implementation, computer arith-
Design. Springer, 1995. metic, and MEMS.
Low Power Design for a Word Level Normal Basis Finite Field
Multiplier Using Factoring Technique
Prasanna laxmi1, G.SWathi2, B.Hari Krishna3
1
M.Tech student,CMR Engineering College, ECE Department,prasanna963lucky@gmail.com,
2
Associate Professor,CMR Engineering College,ECEDepartment,swathi.guntupalli1@gmail.com
3
Professor, CMR Engineering College, ECE Department,harikrishna07@gmail.com

Abstract:

In CMOS based mostly application-specific integrated circuits (ASIC) styles, total power
consumption is dominated by dynamic power, wherever dynamic power consists of 2 major
elements, specifically change power and internal power. Low power style for a polynomial basis
finite field multiplier factor in GF (2m) during this technique provides high power consumption
and time delay with approximate results. We have a tendency to planned word level traditional
basis multiplier factor by this technology accustomed minimize change power and reduces the
facility consumption and time delay with correct results by victimization Xilinx computer code.
The synthesis results of the planned multiplier factor style consumes twenty seventh total power.

Keywords:

CMOS,ASIC, GF(2m),finite field multiplier,polynomial basis,normal basis.

1. Introduction:
The use of finite field arithmetic within the space of cryptography has gained increasing
importance. Analysis during this space is moving toward finding new architectures to implement
arithmetic operations a lot of with efficiency [1]. Multiplication operation has been paid most
attention by researchers, as a result of addition is just bitwise XOR operation between 2 field
components, and therefore the a lot of complicated operations, inversion, will be dole out with a
couple of multiplications.In GF (2m), there square measure numerous strategies to represent field
parts, like polynomial basis (PB), traditional basis, and twin basis. Lead is maybe the foremost
popularly used basis, as a result of it's adopted collectively of the premise decisions by
organizations that set standards for cryptography applications [2], [3]. An outsized variety of
architectures for economical implementation of lead finite field multipliers are projected.
Additionally, new representations supported lead referred to as shifted lead (SPB) [4] and
generalized lead [5] are projected for economical implementation of multipliers over GF (2m).The
choice of the irreducible polynomial p(x) affects the quality of a finite field number. Varied sorts
of irreducible polynomials embrace trinomials, pentanomials, all one polynomials, and equally
spaced polynomials. Normal organizations suggest irreducible polynomials with less variety of
nonzero terms (irreducible trinomials and pentanomials) for sensible use as these sorts of
irreducible polynomials will offer multipliers with lower quality. Atomic number 82 finite field
number architectures are often categorised into bit-serial [6], [7], bit-parallel [8],[9], and digit-
serial architectures.Bit-parallel structure is quick, however it's overpriced in terms of space. In
European Economic Community cryptography, the binary extension field size, m, is needed to get
on the order of 102, and so a bit-parallel structure needs a awfully high I/O information measure,
that is typically not accessible within the little transportable and wireless devices. Bit- serial design
is space economical, however it's too slow for several applications. In [10], a multiple-bit serial-
parallel multiplier factor has been planned. One quantity is rotten recursively and also the different
quantity is pre reduced hierarchically. The multiplier factor has higher space complexness
compared with a number of the present multipliers; but, it's lower important path delay, which
ends in lower area-time complexness.

2. Polynomial basis multiplier:


Limited field increase is otherwise called secluded duplication is spoken to as (A*B mod F), where
„A‟ and „B‟ are the information operands and „F‟ is final limited field generator polynomial.
Limited field augmentation is performed in two stages.

1. The initial step is customary polynomial increase.

2. The second step is modulo polynomial decrease process. The unchangeable limited field
generator polynomial (F) is utilized for decrease process.

The polynomial information (pk) and polynomial yield (pk) amid a cell is same since it's utilized
only for calculation. Each cell processes and that territory unit the coefficients of A (x)(i) and C
(x)(i) severally. In this approach we used 32 bit input And array’s, PIPO and acumilator. AND
array output is connected to PIPO then PIPO output is hooked up to XOR array and XOR array
output is connected to PIPO and subsequent PIPO output is hooked up to acumilator. Acumilator
is used to eliminate the modulus operation as shown in fig1.

Figure 1: Schematic diagram of polynomial basis multiplier

3. Word level normal basis:


A resolution technique is adopted in style of a word-level traditional basis number in GF (2m). it's
employed in the look of a finite field number at associate field of study level. A computer circuit
substitution technique is employed in style to cut back the inner power consumption of the
projected word-level traditional basis number. By mistreatment this technique to consume the low
power, scale back the time delay and space compared with existing technique.The equipment
execution for any NB limited field multiplier over GF(2m) can be ordered either as a parallel or
consecutive compose. In an ordinary parallel multiplier for GF(2m), once 2m-bits of two sources
of info are gotten, „m‟ bits of the item are acquired together at the yield after deferrals through
different rationale entryways or after postponements because of a memory get to.Somewhat level
successive multiplier is considerably more proficient yet it takes „m‟ cycles for one increase. In
this technique we used sixty four bit input XOR array, AND array, assign
and shift operation. By the usage of this shift and assign operations to get the accurate values and
much less time put off as shown in fig2.

Figure 2: Schematic diagram of word level normal basis multiplier

4. Results and Discussion:

The accompanying outcome demonstrates the current arrangement of polynomial premise


multiplier appeared in fig3.Here in existing method first delay occurred then output appears in fig3
and rationale use for polynomial premise multiplier appeared in fig4. In this venture the proposed
framework is word level ordinary premise multiplier which is appeared in fig5.Here in proposed
method without delay output appears shown in fig5 and rationale usage for word level typical
premise multiplier appeared in fig6. The power values shown in fig7.
Fig 3:Simulation waveform for polynomial basis multiplier

Fig 4: Logic utilization for polynomial basis multiplier

Fig 5:Simulation waveform for word level normal basis multiplier

Fig 6: Logic utilization for polynomial basis multiplier


proposed power value...0..036 watts
Existing power value...0..042 watts

Fig7: power values of Word level normal basis multiplier

Conclusion:

In CMOS based application-particular incorporated circuits (ASIC) plans, add up to control


utilization is commanded by unique power, where dynamic power comprises of two noteworthy
segments, specifically exchanging power and inward power. Low power plan for a polynomial
premise limited field multiplier in GF (2m) in this technique gives high power utilization and time
delay with estimated results. We proposed word level typical premise multiplier by this innovation
used to limit exchanging power and decreases the power utilization and time delay with precise
outcomes by utilizing Xilinx programming. The combination results have demonstrated that the
proposed multiplier configuration has devoured 37 % of aggregate power.
References:

[1] T. Beth and D. Gollman, “Algorithm Engineering for Public Key


Algorithms,” IEEE J. Selected Areas in Comm., vol. 7, no. 4, pp. 458-465, May
1989.

[2] C. F. Kerry, “Digital signature standard (DSS),” Nat. Inst. Standards Technol.,
Gaithersburg, MD, USA, FIPS PUB 186-4, 2013.

[3] IEEE Standard Specifications for Public-Key Cryptography, IEEE Standard


1363-2000, Aug. 2000, pp. 1–228.

[4] H. Fan and Y. Dai, “Fast bit-parallel GF(2n) multiplier for all trinomials,”
IEEE Trans. Comput., vol. 54, no. 4, pp. 485–490, Apr. 2005.

[5] A. Cilardo, “Fast parallel GF(2m ) polynomial multiplication for alldegrees,”


IEEE Trans. Comput., vol. 62, no. 5, pp. 929–943, May 2013.

[6] T. Beth and D. Gollman, “Algorithm engineering for public key algorithms,”
IEEE J. Sel. Areas Commun., vol. 7, no. 4, pp. 458–466,May 1989.

[7] M. Nikooghadam and A. Zakerolhosseini, “Utilization of pipeline technique


in AOP based multipliers with parallel inputs,” J. Signal Process.Syst., vol. 72,
no. 1, pp. 57–62, Jul. 2013.

[8] B. Sunar and C. K. Koc, “Mastrovito multiplier for all trinomials,” IEEE
Trans. Comput., vol. 48, no. 5, pp. 522–527, May 1999.

[9] Y. Li and Y. Chen, “New bit-parallel Montgomery multiplier for trinomials


using squaring operation,” Integr., VLSI J., vol. 52, pp. 142–155,Jan. 2016.

[10] P. K. Meher and C.-Y. Lee, “Scalable serial-parallel multiplier over GF(2m
) by hierarchical pre-reduction and input decomposition,” in Proc. IEEE Int.
Symp. Circuits Syst. (ISCAS), May 2009, pp. 2910–2913.

You might also like