Microprocessors and Microsystems: Lin Li, Shaoyu Lin, Shuli Shen, Kongcheng Wu, Xiaochao Li, Yihui Chen

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Microprocessors and Microsystems 67 (2019) 82–92

Contents lists available at ScienceDirect

Microprocessors and Microsystems


journal homepage: www.elsevier.com/locate/micpro

High-throughput and area-efficient fully-pipelined hashing cores using


BRAM in FPGA
Lin Li a, Shaoyu Lin a, Shuli Shen a, Kongcheng Wu a, Xiaochao Li a,b,∗, Yihui Chen a
a
Department of Electronic Engineering, Xiamen University, Xiamen 361005, China
b
School of Electrical and Computer Engineering, Xiamen University Malaysia, Sepang 43900, Malaysia

a r t i c l e i n f o a b s t r a c t

Article history: In this paper, an area-efficient fully-pipelined architecture of SHA-1 and SHA-256 implemented on FPGA
Received 27 May 2018 is proposed for achieving high operating frequency and throughput. The conventional pipeline architec-
Revised 19 January 2019
ture consumes a lot of registers, especially the consumption increases dramatically for the higher number
Accepted 16 March 2019
of pipeline stage. To solve this problem, a new scheme using block RAM (BRAM) is presented to reduce
Available online 17 March 2019
consumption of registers and make the fully-pipelined architecture simpler. Additionally, to achieve oper-
Keywords: ating frequency greater than 300 MHz, the new sub-cores of SHA-1 and SHA-256 combined with the loop
Feld-programmable gate arrays (FPGA) unrolling and pre-computation techniques are introduced to the design. Compared to previous works, the
Cryptography throughput and throughput/Slice of SHA-1 and SHA-256 in proposed designs are substantially increased
Secure Hash Algorithm (SHA) to 159.590 Gbps, 16.083 Mbps/slice and 154.880 Gbps, 10.94 Mbps/slice respectively on Kintex-7 FPGA.
Fully-pipelined © 2019 Elsevier B.V. All rights reserved.
Block RAM (BRAM)

1. Introduction [5], Pipelining [6] and Parallelism exploitation [7], were introduced
to these designs to improve performance. In [8] and [9], the loop
The Secure Hash Algorithms are a family of cryptographic hash unrolling technique has been discussed in detail. Loop unrolling
functions published by National Institute of Standards and Technol- refers to performing several operations in one clock cycle. It im-
ogy (NIST) as a U.S Federal Information Processing Standard (FIPS), proved the throughput of SHA-1 and SHA-256 architectures at the
including: SHA-0, SHA-1, SHA-2, SHA-3 [1]. A cryptographic hash cost of higher consumption of resource and lower frequency. Lee
function, which is designed to be a one-way function, is consid- et al. in [5] presented the pre-computation technique to achieve
ered as a mathematical algorithm that transforms data of arbitrary high performance in SHA-1. Pre-computation technique means that
size into a fixed-size string [2]. The Secure Hash Algorithms are pre-calculating some intermediate values needed in next opera-
often used in information authentication to ensure data confiden- tion in current operation. Compared to the loop unrolling, the
tiality and integrity. Among them, SHA-1 and SHA-256 hash func- pre-computation technique improves the throughput by reducing
tions play an important role. They are widely used in various ap- the critical path and increasing the operating frequency. For much
plications. For instance, SHA-1 algorithm is applied to distributed higher throughput, pipelining and parallelism exploitation tech-
revision control systems like Git to identify revisions and to detect niques were used in many designs. In [7] and [10], the parallel
data corruption [3]. It is also used in the authentication system architecture of SHA-1 and SHA-256 was proposed, which multiple
of MySQL server to protect user’s password. SHA-256 algorithm is SHA-1 or SHA-512 module worked simultaneously for high speed
implemented in some widely used security applications and pro- calculations at the cost of extra high resource utilization of slices.
tocols, including TLS and SSL, PGP, SSH, S/MIME, and IPsec. Sev- It is difficult to both increase the throughput and reduce the con-
eral cryptocurrencies like Bitcoin use SHA-256 for verifying trans- sumption.
actions and calculating proof-of-work or proof-of-stake. In the previous study, the design with eight pipeline stages
A variety of hardware implementations of SHA-1 and SHA-256 achieves high TPS(Throughput/Slice) values than four pipeline
have been proposed since the Secure Hash Algorithm standards stages [6], and the analysis chart shows that more pipeline stages
were announced by NIST in 2002. Some of well-known and con- design not only achieves higher throughput, but also achieves
sidered techniques, such as Loop unrolling [4], Pre-computation higher TPS values and better area efficiency. However, they think
fully pipelined designs are unrealistic since it takes up a large
portion of the resource in FPGA [6,11]. While, in most cases,

Corresponding author at: Department of Electronic Engineering, Xiamen Univer-
the hash functions are incorporated in a bigger security system
sity, Xiamen 361005, China.
E-mail address: leexcjeffrey@xmu.edu.cn (X. Li).
and have the area consumption constraints. Therefore, the area

https://doi.org/10.1016/j.micpro.2019.03.002
0141-9331/© 2019 Elsevier B.V. All rights reserved.
L. Li, S. Lin and S. Shen et al. / Microprocessors and Microsystems 67 (2019) 82–92 83

reduction is the key factor to implement the fully-pipelined hash give at +1 , bt +1 , ct +1 , dt +1 , et+1 are described as Eq. (1) (0 ≤ t < 80).
function. ⎧
In this paper, an area-efficient fully-pipelined architecture of ⎪at+1 = S5 (at ) + ft (bt , ct , dt ) + et + Wt + Kt

⎨bt+1 = at
SHA-1 and SHA-256 using BRAM is implemented on different Xil-
inx families (Virtex-4, Virtex-5, Virtex-6 and Kintex-7). Compared ct+1 = S30 (bt ) (1)


to previous works, the proposed design concerns not only high ⎩dt+1 = ct
throughput but also the area efficiency. Both utilization of slice and et+1 = dt
BRAM are under 20% of total slices and BRAM in Kintex-7 series
Let ‘x & y’ denote the bit-wise AND of x and y. Let ‘xy’ denote
FPGA. Moreover, the new sub-cores of SHA-1 and SHA-256 are ap-
the bit-wise XOR of x and y. The symbol ‘∗ ’ denotes the multipli-
plied to the architectures to reduce the delay in the critical path.
cation and the symbol ‘M’ represents the 512-bit data. ‘ft ’, ‘Wt ’ and
Compared to other works, our designs have achieved better per-
‘Kt ’ in Eq. (1) are now defined as Eqs. (2)–(4).
formance in terms of frequency, throughput and throughput/area ⎧
specification.
⎨(bt & ct )  (b̄t & dt )
⎪ 0 ≤ t < 20
The rest of the paper is organized as follows. Section 2 gives a bt  ct  dt 20 ≤ t < 40
ft (bt , ct , dt ) = (2)
brief introduction to the background of the SHA-1 and SHA-256. In
⎩(bt & ct )  (bt &dt )  (ct &dt )
⎪ 40 ≤ t < 60
Section 3, the proposed designs are explained in detail. Implemen- bt  ct  dt 60 ≤ t < 80
tation results and comparisons with previous works are given in 
Section 4. Finally, Section 5 concludes the paper. M[(32 ∗ t + 31 ) : (32 ∗ t )] 0 ≤ t < 16
Wt = (3)
S1 (Wt−16  Wt−14  Wt−8  Wt−3 ) 16 ≤ t < 80


⎨0x5A827999 0 ≤ t < 20
2. Background 0x6E D9E BA1 20 ≤ t < 40
Kt = (4)
⎪ 40 ≤ t < 60
⎩0x8F 1BBCDC
To describe the design of the SHA-1 and SHA-256 on FPGA, we 0xCA62C1D6 60 ≤ t < 80
will give a brief introduction to the basic SHA-1 and SHA-256 algo-
rithms and three common techniques which are used for improv- For the first hash operation, there is no initial hash value exists.
ing the throughput of hash functions. The techniques, including Let H0 , H1 , H2 , H3 , H4 represent the five initial hash values. Now
loop unrolling, pre-computation and pipelining, will merge with the initial inputs (a0 , b0 , c0 , d0 , e0 ) can be derived by Eq. (5).
the proposed designs in Section 3. ⎧
⎪a0 = H0 = 0x67452301

⎨b = H = 0xEF CDAB89
0 1
c0 = H2 = 0x98BACDF E (5)


2.1. SHA-1 algorithm
⎩d0 = H3 = 0x10235476
e0 = H4 = 0xC3D2E1F 0

SHA-1 takes input data of length less than 2^64 bits and gives 2.2. SHA-256 algorithm
a 160-bit output which is called message digest or the hash value
[1]. The arbitrary length message is packed and padded into one or Like SHA-1 hash function, SHA-256 algorithm operates on an
multiple 512-bit blocks in first stage of algorithm. Then, the 512-bit arbitrary length message, M, and return a fixed-length output,
blocks are used to generate eighty 32-bit values, Wt . The SHA-1 al- which is called hash value or message digest of M. The differences
gorithm requires 80 basic hash operations to process a single 512- between SHA-1 and SHA-256 are size of final hash value, initial
bit block and give the output values. Each operation that consists hash constants, round constants and round computation. SHA-256
of addition, nonlinear function and circular shift should be com- performs 64 iterations to produce the 256-bit output and its op-
pleted in one clock cycle. erational round includes simple additions and nonlinear functions,
The structure of a SHA-1 hash operation is shown in Fig. 1. The as shown in Fig. 2.
darker dotted line represents the critical path that consists of three
additions. Let ‘Sn (x)’ denote the 32-bit value obtained by circularly
shifting x left by n bit positions. Let ‘ft ’ denote the nonlinear func-
tion. ‘Wt ’ represents the 32-bit value computed from the 512-bit
data. ‘Kt ’ represents the 32-bit constant. at , bt , ct , dt , et are five vari-
ables used to perform the hash operations. The expressions that

Fig. 1. SHA-1 hash operation. Fig. 2. SHA-256 hash operation.


84 L. Li, S. Lin and S. Shen et al. / Microprocessors and Microsystems 67 (2019) 82–92

The darker dotted line represents the critical path of SHA-256


which consists of five additions. Let ‘Maj’ and ‘Ch’ denote the non-
{256}
linear functions. Kt represents the 32-bit constant. at , bt , ct , dt ,
et , ft , gt , ht are eight variables used to perform the hash operations.
The expressions that give at +1 , bt +1 , ct +1 , dt +1 , et +1 , ft +1 , gt +1 , ht+1
are described as Eq. (6) (0 ≤ t < 64).
⎧ {256}

⎪ at+1 = ht + 1 (et ) + Ch(et , ft , gt ) + Kt{256} + Wt

⎪ {256}
(at ) + Ma j (at , bt , ct )

⎪ + 0

⎪ bt+1 =


at
⎨c = t+1 bt
Fig. 3. The transformation operation of Loop Unrolling in SHA-1.
dt+1 = ct (6)

⎪ {256}

⎪ et+1 = dt + ht + 1 (et ) + Ch(et , ft , gt ) + Kt{256} + Wt

⎪ ft+1 =

⎪ et


⎩ gt+1 = ft
ht+1 = gt
Let ‘SHRn (x)’ denote right shift operation. Wt and nonlinear
functions in Eq. (6) are now defined as Eqs. (7) to (13). More de-
tails about SHA-256 can be found in Secure Hash Standard [1].

M[(32 ∗ t + 31 ) : (32 ∗ t )] 0 ≤ t < 16
Wt =
σ1{256} (Wt−2 ) + Wt−7 + σ0{256} (Wt−15 ) + Wt−16 16 ≤ t < 64
(7)

Ch(x, y, z ) = (x & y )  (x̄ & z ) (8)

Ma j (x, y, z ) = (x & y )  (x & z )  (y & z ) (9)


Fig. 4. The pre-computation design of SHA-1.
{256}
(x ) = S (x )  S (x )  S (x )
30 19 10
(10)
0
2.4. Pre-computation
{256}
(x ) = S26 (x )  S21 (x )  S7 (x ) (11) Pre-computation was proposed based on loop unrolling design
1
[5]. Pre-computation is applied in partial transformation operation
σ0{256} (x ) = S25 (x )  S14 (x )  SH R3 (x ) (12) of loop unrolling for increasing the frequency, which is illustrated
in Fig. 4. It refers to pre-calculating some intermediate values dur-
ing one operation and storing them in registers which can be used
σ1{256} (x ) = S15 (x )  S13 (x )  SH R10 (x ) (13) in next operation. The pale blue region represents pre-computation
unit. An extra clock cycle is spent to finish first pre-computation.
2.3. Loop unrolling After that, 40 iterative transformation operations will be completed
in 40 clock cycles.
Loop unrolling technique refers to performing two or more hash Three new parameters lt , mt , nt are used to storing the in-
operations in one clock cycle [4]. Among them, unrolling two hash termediate values. The expressions of the pre-computation design
operations into a new transformation operation is most commonly are described as Eq. (15). The subscript ‘t’ is defined as t = 2∗ x
used. According to the Eq. (1), in SHA-1 algorithm, the outputs (0 ≤ t < 40; for lt +2 , mt +2 , nt +2 , 0 ≤ t < 39).
at +2 , bt +2 , ct +2 , dt+ 2 and et+ 2 should be derived from the values ⎧
⎪at+2 = S5 (S5 (at ) + lt ) + ft+1 (at , S30 (bt ), ct ) + mt
at +1 , bt +1 , ct +1 , dt+1 and et+1 . Loop unrolling means that the out- ⎪
⎪bt+2 = S5 (at ) + lt


puts at +2 , bt +2 , ct +2 , dt+2 and et+2 are now derived directly from ⎪

⎨ct+2 = S 30(at )
⎪ 30
the values at to et . The modified express of this design is illus-
dt+2 = S (bt )
trated in Eq. (14). (15)
⎧ ⎪et+2 = ct
at+2 = S5 (S5 (at ) + ft (bt , ct , dt ) + et + Wt + Kt )



⎪ ⎪
⎪lt+2 = ft+2 ((S5 (at ) + lt ), S30 (at ), S30 (bt )) + ct + nt

⎪ + ft+1 (at , S30 (bt ), ct ) + dt + Wt+1 + Kt+1 ⎪m

⎩ t+2 = S30 (bt ) + Wt+3 + Kt+3
30
⎨ ⎪
bt+2 = S5 (at ) + ft (bt , ct , dt ) + et + Wt + Kt nt+2 = S (bt ) + Wt+4 + Kt+4
(14)
⎪ct+2 = S30 (at )

⎪ The critical path is represented by dotted line which consists of

⎩dt+2 = S (bt )
30
two additions, one nonlinear function and one circular shift. The
et+2 = ct
pre-computation design has achieved higher frequency compared
The structure of the new transformation is shown in Fig. 3. The to the loop unrolling by decreasing the delay in critical path.
darker dotted line represents the critical path that is composed of
four additions and one circular shift. Obviously, the frequency of 2.5. Pipelining
the design decreases for the great delay in critical path. Whereas,
the throughput increases since two hash operations is computed in Suppose that the architecture is split up into a sequence of de-
one clock cycle and the hash value is produced in 40 cycles instead pendent stages, pipelining is a technique in which different stages
of 80 compared to the basic SHA-1 design. can be executed in parallel [6,7]. Pipelining increases throughput
L. Li, S. Lin and S. Shen et al. / Microprocessors and Microsystems 67 (2019) 82–92 85

Fig. 5. Fully-pipelined implementation.

by performing multiple stages at the same time, but does not re- SHA-1 computation with 80 pipeline stages. It consists of message
duce the latency, the time required for a single datum to propagate padding unit, words computation unit and the main calculation
through the stages of the pipeline from start to finish. unit (represented by the dotted frame). Message padding unit re-
The core SHA-1 block consists of 80 basic hash operations ceives the messages per clock cycle and transform the data a for-
(SHA-256 has 64 basic hash operations). The typical pipeline ar- mat that SHA-1 algorithm required which contains 512 bits. After
chitecture has four stages [7,12]. Each stage performs 20 hash op- that, the processed 512-bit data will be divided into 16 blocks (W0
erations in 20 clock cycles. The first output hash value is obtained to W15 ) and be sent to the next two units: the main calculation
after 80 clock cycles and after that the hash value is obtained every unit and the words computation unit. The words computation unit
20 clock cycles. For the fully-pipelined design, the architecture can is responsible to compute other required Wt values (W16 to W79 ).
be divided into 80 stages in SHA-1 and 64 stages in SHA-256. Each At last, we will get 80 Wt values which will be used in 80 rounds
stage performs a basic hash operation. The hash values are now of SHA-1 algorithm computation.
produced per clock cycle, but the consumption of slice in FPGA will As shown in Fig. 5, the main calculation unit consists of five
greatly increase. In this paper, a fully-pipelined architecture is pro- similar sub-groups of sub-cores. The sub-core is defined as a block
posed to solve the problem in Section 3. that performs a hash operation or several operations of SHA-1. We
will discuss it in detail in the next section. Each sub-group accepts
3. Proposed designs 16 Wt values which are stored in BARM before being calculated by
the sub-cores. In our design, BRAM acts as a first-in/first-out FIFO
3.1. Optimized fully-pipelined architecture role. Analyzing the main calculation unit, W0 doesn’t need to be
stored in a BRAM since it will be calculated firstly. Other Wt values
The conventional pipeline architecture of SHA-1 and SHA-256 (W1 to W79 ) will be saved in BRAM to reduce the consumption of
consumes a lot of registers, especially the gate count will increase registers.
dramatically for the higher number of pipeline stage. For instance, The use of BRAM is illustrated in Fig. 6. In the proposed ar-
Michail et al. in [6] implements a fully-pipelined design with 80 chitectures, the block RAM is configured in 18Kb mode, 512 × 32
pipeline stages in Virtex-4 family. It takes up a significant portion configuration [13].
of the whole FPGA device; the utilization of slices in his design is The BRAM modules are configured in a simple dual-port RAM
31,700 of 49,152, almost 64%. To solve this problem, we propose mode. In this mode, independent Read and Write operations can
a new scheme for fully-pipelined architecture of hash functions, occur simultaneously, where port A is designated as the Write port
which focuses on the balanced utilization of resources in FPGA de- and port B as the Read port. Address A represents the position
vice, to make the cumbersome design simple. where Wt (1 ≤ t < 80) value is written at current active clock edge.
An illustrated example of our design is shown in Fig. 5. The core The Address A is defined as:
idea of scheme is using BRAM in appropriate position to reduce
the consumption of register. The design shows us a fully-pipelined AddressA = AddressA + 1
86 L. Li, S. Lin and S. Shen et al. / Microprocessors and Microsystems 67 (2019) 82–92

the shift registers are used in the conventional fully-pipelined ar-


chitecture for the above Wt values storage. At the 1st active clock
edge, Wt _data1 is stored in Wt _1delay. At the 2nd active clock
edge, the value of Wt _data1 moves from Wt _1delay to Wt _2delay
and Wt _data2 is written to Wt _1delay at the same time. At the 3rd
active clock edge, Wt _data1 moves into Wt _3delay, Wt _data2 moves
into Wt _2delay and Wt _data3 is written to Wt _1delay simultane-
Fig. 6. BRAM organization chart of the proposed design. ously. Accordingly, at the t-th active clock edge, Wt _data1 is writ-
ten to Wt _tdelay and at next active clock edge, Wt _data1 will be
used by sub-core#t. After that, the value of Wt _data1 will be dis-
carded, the same as Wt _data2, Wt _data3… Wt _datax.
Let us simply count the registers it costs as follow. Suppose that
each sub-core takes i clock cycles to finish its calculations; the W1
must wait 1∗ i clock cycles before being used by the sub-core#1. It
means that W1 needs 1∗ i extra registers to keep it. The W2 must
wait 2∗ i clock cycles before being used by the sub-core#2. So W2
needs 2∗ i extra registers. Accordingly, Wt needs t∗ i extra registers
(1 ≤ t < 80). If i equals 1, the sub-core takes 1 clock cycle to finish
calculations.
The sum of registers costed in this situation is:

Fig. 7. The instantiation of BRAM IP core. 1 + 2 + ... + t + ... + 79 = 3160.

It shows that the original design costs extra 3160 registers at


After each active clock edge, the value of Address A will add least to keep Wt values.
one, thus the value will be written into next position of BRAM at The SHA-256 has similar implementation only differing in
next active clock edge. The initial value of Address A is 1. The range the pipeline stages. The SHA-256 algorithm performs 64 pipeline
of Address A is 512 because the BRAM is configured in 512 × 32 stages with the output of 256-bit. Our scheme can reduce these ex-
configuration. As shown in Fig. 6, The symbol ‘Wt , n ’ represents the tra registers and make design simpler through storing Wt in BRAM.
Wt value written at nth active clock edge. It also means that the scheme can implement more SHA-1 or SHA-
Simultaneously, at one active clock edge, the sub-core #t read 256 modules in a FPGA device because of balancing the utilization
the corresponding Wt value from Address B of BRAM to perform of resource in FPGA. More contrast data can be found in experi-
calculation. The Address B is defined as: mental section. One important thing to note is the design should
flexibly adjust the amount of BRAM according to the actual situa-
AddressB = AddressA + Delay tion. For example, storing W1 and W2 in registers for the low cost
of their consumption.
Address B depends on the value of Address A and the Delay
value. The Delay value equals the waiting times before the Wt
value read by sub-core #t. 3.2. The sub-cores of SHA-1
There are several ways to implement a block ram on Xilinx
device, such as instantiation, inference, core generator and macro SHA-1 hash function is an iterative algorithm, which performs
[22,23]. We chose core generator and inference implementation 80 iterations called hash operations. A sub-core of SHA-1 is defined
methods in our design. The BRAMs can be generated by IP core as a block that performs a hash operation or several operations. In
generator of ISE design suit or Vivado design suit. The BRAM IP order to make the operation frequency over 300 MHz, in this sub-
core can be instantiated like Fig. 7. The configuration interface of section, three different sub-cores including the basic sub-core, the
BRAM in ISE 14.7 are shown in Fig. 8 and we can find that it uses sub-core incorporating loop unrolling and sub-core incorporating
‘ram’ not ‘fabric slice’. Another way to incorporate Block RAM into pre-computation, are carefully designed through implementing the
the design is using the automatic inference capability of the syn- techniques referred in Section 2 and explained in detail.
thesis tool for a higher degree of portability. In some synthesis
tools, such as ISE 14.4-14.7 and Vivado, BRAM can be inferred from
plain Verilog/VHDL code directly [22,23]. Fig. 9 shows how we cre- 3.2.1. The basic sub-core
ate a simple dual Block RAM in the design and by this way, we In [9], Li et al. proved that the additions operation in SHA-1
can achieve the flexibility and portability of the code to multiple dominated the delay in SHA-1 hash operations, followed by the
architectures. nonlinear function and circular shift. Therefore, the number of ad-
In conventional fully-pipelined architecture of SHA-1 (80 ditions in the critical path should be minimized before other func-
stages), W1 to W79 should be stored in registers. Suppose a set of tions are considered. As shown in Fig. 1, the critical path is com-
data (data1, data2 … datax) are entered the architecture at the 1st, posed of three additions in the hash operation. It is a common
2nd, … x-th active clock edge separately. ‘Wt _datax’ represents the practice to implement fully-pipelined architecture of SHA-1 with
corresponding Wt value of datax. As shown in the upper part of the hash operation, but the frequency could not reach 300 MHz
Fig. 10, Wt are stored in the registers before being used by the sub- with this design. Based on this case, we propose a sub-core aim
core #t. At the 1st active clock edge, one extra register, register A, at the high frequency over 300 MHz.
is spent to keep Wt _data1. At the 2nd active clock edge, Wt _data2 The basic sub-core structure is shown in Fig. 11. The hash oper-
is generated from data2. However, Wt _data2 can’t be stored in reg- ation is divided into two parts and each part must be completed in
ister A since Wt _data1 hasn’t be used. Another extra register, reg- one cycle. Seven temporary variables (mt ∗ , bt ∗ , ct ∗ , dt ∗ , et ∗ , ut ∗ , vt ∗ ),
ister B, is used for keeping Wt _data2 now. Accordingly, register C which are stored in seven additional 32-bit registers, should be in-
is used for keeping Wt _data3. Register A, B, C couldn’t be over- troduced to the hash operation. The modified expressions that give
written until their values were used by the sub-cores. Normally, at +1 , bt +1 , ct +1 , dt +1 , et+1 are now described in Eq. (16). (1 ≤ t < 80)
L. Li, S. Lin and S. Shen et al. / Microprocessors and Microsystems 67 (2019) 82–92 87

Fig. 8. The configuration interface of BRAM in ISE 14.7.

⎧ ∗ operation, which should be performed in a single clock cycle. The


⎪m∗t = S (at )
5
⎪ ⎧ critical path of original design consists of four additions and one

⎪u = f (bt , ct , dt ) ⎪at+1 = mt∗ + ut∗ + vt∗


t circular shift as shown in Fig. 3. To achieve a frequency over
⎨v = et + Wt + Kt ⎪
t

t
⎨b = b∗
t+1 t 300 MHz, the sub-core incorporating loop unrolling divides the
b∗t = at ct+1 = ct∗ (16) new transformation operation into 3 parts to reduce the great de-

⎪ ⎪

⎪ct∗ = S (bt )
∗ ∗
⎩dt+1 = d∗t
30
lay of the critical path.



⎩d∗t = ct et+1 = et The structure of the sub-core is shown in Fig. 12. The darker
et = dt dotted line represents the new critical path. Compared to the orig-
inal loop unrolling, the critical path has reduced to two additions.
Analyzing the Fig. 11, the darker dotted line denotes the critical
The design needs three clock cycles to finish calculations. So, sev-
path. The critical path consists of two additions and is shorten by
eral additional registers are used to storing the temporary vari-
one addition compared to Fig. 1.
ables. The expressions of at +2 , bt +2 , ct +2 , dt+2 and et+2 are de-
Now, the proposed fully-pipelined architecture with the basic
scribed in Eq. (17). The subscript t is defined as t = 2∗ x (0 ≤ x <
sub-core can achieve frequency more than 300 MHz since the re-
40).
duction of critical path. The design has 80 sub-cores, so the num-
The number of pipeline stages is 40 instead of 80 since two
ber of pipeline stages is 80. Each sub-core requires 2 cycles to per-
hash operations are merged into one transformation operation.
form calculations. Therefore, an initial latency of 160 clock cycles
Each stage has a sub-core which needs two Wt values to perform
is spent to produce the first hash value and after that the new
calculations now. In this case, the structure of fully-pipelined ar-
hash values will be produced every clock cycle due to the fully
chitecture in Fig. 5 should take slight adjustment. Each sub-core
pipelined design.
accepts two Wt values from two corresponding BRAM for com-
puting. Implementing this 40-stage fully-pipelined design, each
3.2.2. The sub-core incorporating loop unrolling stage requires 3 clock cycles to complete. The first hash value
The most common use of loop unrolling is unrolling and can be obtained after 120 clock cycles. Then, the hash values will
combining two hash operations into one new transformation be produced every clock cycle. The frequency of design is over
88 L. Li, S. Lin and S. Shen et al. / Microprocessors and Microsystems 67 (2019) 82–92

Fig. 9. The code of creating the simple dual Block RAM in the design.

Fig. 10. The units used to store Wt values in the conventional design.

300 MHz.

⎧ ∗ Fig. 11. The basic sub-core of SHA-1.


⎪m = S5 (at ) + ft (bt , ct , dt ) ⎧ ∗
⎪ ∗t
⎪ bt+1 = mt∗ + vt∗

⎪ut = ft+1 (at , S30 (bt ), ct ) ⎪ ⎪

⎪ ⎪ ∗
= ct∗ ⎧a = S5 (b∗ ) + u∗ + n∗
⎨nt∗ = dt + Wt+1 + Kt+1 ⎨ct+1 ⎪ t+2
dt+1 = dt∗

⎪ t+1 t+1 t+1
vt∗ = et + Wt + Kt ∗
⎨bt+2 = b∗
⎪ ⎪et+1 = et∗ t+1

⎪ct∗ = S30 (at ) ⎪
⎪ ct+2 = c∗ (17)

⎪ ⎪u ∗
⎩ t+1 = ut∗ ⎪

t+1

⎩d∗t = S (bt )
∗ ∗
⎪ 30
nt+1 = nt∗
∗ ⎩dt+2 = dt+1

et = ct et+2 = et+1
L. Li, S. Lin and S. Shen et al. / Microprocessors and Microsystems 67 (2019) 82–92 89

Fig. 12. The sub-core incorporating loop unrolling.

Comparing the pre-computation designs in Figs. 4 and 13, the


proposed sub-core has removed the variable nt . In original pre-
computation, nt is used as an intermediate value to generate the
variable lt+2 . In this case, the critical path of calculating lt+2 are
shortened by one addition compared to the case that does not use
nt . However, as shown in Eq. (10), the new calculations of lt+2 are
now just composed of one addition and one nonlinear function.
There is no need to introduce additional variable to reduce the
critical path. Thus, the variable nt can be removed. The darker dot-
ted line represents the critical path. The new critical path is short-
ened by one nonlinear function and one circular shift compared
to Fig. 4. It only consists of two additions which ensure the high
frequency of architecture.
⎧ ∗ ⎧
⎪bt = S5 (at ) + lt ⎪at+2 = S5 (b∗t ) + ut∗ + mt∗

⎪ct∗ = S30 (at ) ⎪
⎪bt+2 = b∗t

⎪ ⎪


⎨dt∗ = S30 (bt ) ⎪
⎨ct+2 = ct∗
e∗ = ct dt+2 = d∗ (18)

⎪ut∗ = ft+1 (at , S30 (bt ), ct )⎪
t

t
et+2 = et∗

⎪ ⎪

⎪ ⎪
⎩lt+2 = ft+2∗ (bt , ct , dt ) + vt
∗ ∗ ∗ ∗ ∗

⎩m∗t = mt ⎪
vt = ct + Wt+2 + Kt+2 mt+2 = dt + Wt+3 + Kt+3

As shown in Fig. 13, An extra initial block should be introduced


to architecture for the first pre-computation. It requires two clock
cycles. The number of sub-cores is 40. Each of them requires two
clock cycles to finish calculations. Thus, the initial latency of the
Fig. 13. The sub-core incorporating pre-computation.
architecture is 82 and after that the hash values will be produced
per clock cycle.

3.2.3. The sub-core incorporating pre-computation


The sub-core incorporating pre-computation is shown in Fig. 13. 3.3. The sub-core of SHA-256
The design contains a new transformation operation which should
be performed in two clock cycles. The pale blue region represents SHA-256 hash function performs 64 iterations to produce the
the pre-computation unit. Likewise, it is divided into two parts for 256-bit hash value and its iterations is called hash operations. The
calculation. The variables l, m denote the intermediate values that sub-core of SHA-256 is defined as a block that performs one hash
will be used in the next operations. Thus, the outputs lt +2 , mt+2 operation. As shown in Fig. 2, the critical path of SHA-256 consists
will involve in the next transformation operation as input values. of five additions. In this case, the implementations of SHA-256 are
The modified expressions that explain the design are described in inefficient for the low frequency of architectures. To achieve higher
Eq. (18). The subscript t is defined as t = 2∗ x (0 ≤ x < 40, for lt +2 , throughput, we propose the new sub-core of SHA-256 which ob-
mt +2 , 0 ≤ x < 39). tains high frequency over 311 MHz by reducing the critical path.
90 L. Li, S. Lin and S. Shen et al. / Microprocessors and Microsystems 67 (2019) 82–92

Virtex-4 (xc4vfx100-12ff1517), for the comparison between our de-


sign and the existing conventional fully-pipelined architecture [6].
The functionality of each design is tested and verified by ISim sim-
ulator with functional and post-place and route simulations. The
occupied slices, BRAM and frequency values provided in the fol-
lowing tables are achieved after performing place and route with
no timing violations.
Table 1 shows the results of proposed fully-pipelined architec-
tures of SHA-1 in the different Xilinx families. Tables 2 and 3 give
the comparisons between the proposed designs and the previous
works of SHA-1 and SHA-256. Moreover, we compared the pro-
posed designs with the existing conventional fully-pipelined archi-
tectures in the same Xilinx families in Table 4. The throughput is
calculated as
BlockSize × Frequency
T hrough put = (20)
#Cycles
The value of Block size is 512-bit. # Cycles refers to the required
clock cycles to produce each hash value. The throughput-per-slice
(TPS: Throughput/Area) is computed using Eq. (21). For more fair
comparison, one BRAM (36Kb, 2 × 18Kb) is equivalent to 128 Slices
consumption in calculations [17,18].
T hrough put
T PS = (21)
128 × #BRAM + Slices
As shown in Table 1, all proposed designs have achieved high
throughput over 136 Gbps and frequency over 266 MHz. Specially,
the frequency was greatly increased to 317 MHz when the pro-
Fig. 14. The sub-core of SHA-256. posed design was implemented on Kintex-7 FPGA device. The de-
sign with basic sub-core demands the most slices. Two other de-
signs require fewer slices for the application of loop unrolling and
The new sub-core of SHA-256 is shown in Fig. 14. Ten addi- pre-computation techniques. The minimum number of occupied
tional 32-bit registers are introduced to the hash operation divid- slices is only 4803 which is 9.4% of total slices of Kintex-7 FPGA
ing the sub-core into two parts. The sub-core now needs two clock device (xc7k325t-1ffg676). This is due to the use of BRAM struc-
cycles to finish calculations, but the critical path reduces to two ture which can greatly reduce the occupied slices and balance the
additions improving the frequency of design. The darker dotted utilization of resources in FPGA devices. This feature is very im-
line represents the new critical path. The expressions of at+1 to portant while we need to integrate many other cryptography algo-
ht+1 are described in Eq. (19). rithms to accommodate the complicated authentication scheme in
The number of pipeline stages is 64 since SHA-256 performs one resource-constrained FPGA.
64-round hash operations. Each stage includes one sub-core which Table 1 also gives the usage of BRAM in different Xilinx families.
needs two clock cycles to perform computations. The first output In Virtex-5, the design of pre-computation takes the least num-
can be obtained after 128 clock cycles. Then, the other outputs are ber of BRAMs which is 40 BRAMs (36Kb). The word length of the
produced every clock cycle in a high frequency of 311 MHz. BRAMs data word is 1440 Kbits and it accounts for 7.8% of total
⎧ {256 BRAMs (18,576 Kbits). The base design and the design of loop-
⎪ }

⎪T1 = (et ) + Ch(et , ft , gt ) unrolling cost the same number of BRAMs. In Virtex-6, they use



⎪ {
1
} ⎧a = T 1 + T 3 + T 2 42 BRAMs (36Kb). The word length of BRAMs is 1512 Kbits now

⎪ 
256
⎪ t+1

⎪T 2 = ( a ) + Ma j ( a , b , c ) ⎪
⎪ ∗
which is 10.1% of total BRAMs (14,976 Kbits).
t+1 = a
t t t t

⎪ 0 ⎪

b Table 2 gives the comparisons of SHA-1 between the proposed

⎪T 3 = ht + Kt { 256 } ⎪
⎪ t+1
c = b∗
⎨ + Wt ⎨ design and other related study. For fair comparison, we implement
a∗ = at dt+1 = c∗ our design on the same FPGA families as the related paper. The
(19)
⎪ ⎪et+1 = T 1 + T 3 + d∗
⎪b∗ =
⎪ bt ⎪
⎪ proposed design used for comparison in the Table 2 has applied

⎪c∗ = ct ⎪
⎪ ft+1 = e∗

⎪ ⎪
⎪ f∗
the new sub-core with the pre-computation technique. It is clear

⎪d∗ = dt ⎩gt+1 = from the table that our design has achieved much better perfor-

⎪ ht+1 = g∗

⎪e∗ = et mance in terms of frequency, throughput and TPS. In Virtex-6, the

⎪ ∗
⎩ f∗ = ft achieved throughput of the proposed design is 143.57 Gbps which
g = gt were 13 times higher than that (11.1 Gbps) in [14]. The achieved
frequency is 280.4 MHz increased by 63 MHz compared to that in
4. Experimental results [14]. Also, the achieved TPS is best which is 12.59 Mbps/slice.
Table 3 gives the comparisons of SHA-256 between the pro-
All proposed designs were coded in Verilog HDL using Xilinx posed design and other related study. The results are similar as
ISE design suit 14.7 or Vivado2014.2 and synthesized using Xilinx those of SHA-1. We have achieved the best performance in terms
Synthesis Technology. For a complete study and fair comparison, of frequency, throughput and TPS. The achieved throughput of the
we have implemented our designs on three different Xilinx fami- proposed design was increased by more than 130 Gbps. Compared
lies, including Virtex-5 (xc5vsx240t-2ff1738), Virtex-6 (xc6vlx240t- to the throughput in [20] and [21], it was increased by 11.9 times
1ff1156) and Kintex-7 (xc7k325t-1ffg676) [16]. Specially, the pro- and 68.3 times respectively. When the proposed design was imple-
posed designs were also implemented on the old Xilinx family, mented on Kintex-7 FPGA device, the best performance would be
L. Li, S. Lin and S. Shen et al. / Microprocessors and Microsystems 67 (2019) 82–92 91

Table 1
The performance of proposed designs of SHA-1.

Technique Frequency (MHz) Area Throughput (Gbps)

Slices BRAM

Virtex-5 Virtex-6 Kintex-7 Virtex-5 Virtex-6 Kintex-7 Virtex-5 Virtex-6 Kintex-7 Virtex-5 Virtex-6 Kintex-7

Basic 272.6 277.3 306.4 11,135 (29.7%) 8144 (21.6%) 7230 (14.2%) 42 (8.1%) 42 (10.1%) 42 (9.4%) 139.571 141.978 156.877
Loop. 266.2 272.6 317.0 8605 (23.0%) 6203 (16.4%) 5029 (9.9%) 42 (8.1%) 42 (10.1%) 42 (9.4%) 136.294 139.571 162.304
Pre. 274.1 280.4 311.7 8841 (23.6%) 6027 (16%) 4803 (9.4%) 40 (7.8%) 40 (9.6%) 40 (9.0%) 140.339 143.565 159.590

Table 2
The comparisons between the proposed design and the other SHA-1 architectures.

Device Design Frequency (MHz) Area Throughput (Gbps) TPS (Mbps/Slice)

Slices BRAM

Virtex-5 [14] 207.1 1213 0 10.6 8.739


Prop. 274.1 8841 40 140.339 9.871
Virtex-6 [12] 172.3 1230 0 8.607 6.998
[14] 217.4 1123 0 11.1 9.884
[15] 150.7 1649 0 7.351 4.458
Prop. 280.4 6027 40 143.565 12.590
Kintex-7 Prop. 311.7 4803 40 159.590 16.083

Table 3
The comparisons between the proposed design and the other SHA-256 architectures.

Device Design Frequency (MHz) Area Throughput (Gbps) TPS (Mbps/Slice)

Slices BRAM

Virtex-5 [14] 169.1 1892 0 10.8 5.708


[19] 64.45 139 0 0.118 0.840
[20] 169 1885 0 10.816 5.747
Prop. 275.6 14,786 35 141.107 7.324
Virtex-6 [20] 172 1831 0 11.008 6.012
[21] 271 905 0 2.041 2.255
Prop. 276.4 11,660 35 141.517 8.768
Kintex-7 Prop. 302.5 9681 35 154.880 10.940

Table 4
The comparisons between the proposed designs and the existing conventional architectures.

Device Hash function Design Frequency (MHz) Area Throughput (Gbps) TPS (Mbps/Slice)

Slices BRAM

Virtex-4 SHA-1 [6] 172.6 20,190 0 88.37 4.377


Prop. 257.2 12,350 40 131.686 7.538
SHA-256 [6] 135.3 26,340 0 69.27 2.630
Prop. 262.2 21,736 35 134.246 5.121

achieved. In this case, the frequency reached 302.5 MHz and the 5. Conclusions
TPS is up to 10.94 Mbps/slice.
To the best of our knowledge, there is only Michail et al. in In this paper, an area-efficient fully-pipelined architecture of
[6] implemented the conventional fully-pipelined architecture of SHA-1 and SHA-256 implemented on FPGA is proposed. For the
SHA-1 and SHA-256. Table 4 shows the comparisons between the balance of consumption of resources, we apply the BRAM modules
proposed designs and the full-pipelined architectures in [6] based to our fully-pipelined architecture. Additionally, new sub-cores of
on the same Xilinx Virtex-4 family. It is obvious that our design SHA-1 and SHA-256 are introduced to the architecture for achiev-
cost less slice utilization but achieves higher throughput and TPS ing higher operating frequency and throughput. Compared to pre-
values. Take SHA-1 implementation for example. Compared to the vious works, the proposed designs have achieved better perfor-
architecture in [6], the occupied slices of our design were reduced mance in terms of frequency, throughput and TPS.
by 7840 (nearly 39%), and the achieved throughput and TPS val-
ues were increased by 43.3 Gbps (49%) and 3.161 Mbps/slice (72.2%)
respectively. However, only 40 BRAMs were consumed to pay for Acknowledgment
the optimized performance. Meanwhile, for SHA-256 implementa-
tion, on the comparison with the architecture in [6], we reduced This work is supported by the funding for the opera-
the utilization of occupied slice by 4604 (nearly 17%), and im- tion of Fujian Key Laboratory of IC Design and Measurement
proved the achieved throughput and TPS by 65.0 Gbps (94%) and (Xiamen University), the special funds for science and technol-
2.491 Mbps/slice (94.7%) respectively, at the expense of 35 BRAMs ogy of Xiamen Science and Technology Bureau, and the innovation
resources. funding of Xiamen University in 2017.
92 L. Li, S. Lin and S. Shen et al. / Microprocessors and Microsystems 67 (2019) 82–92

Shaoyu Lin received the B.S. degree from the Department


References of Electronic Engineering, Xiamen University, Xiamen, Fu-
jian, China, in 2015. He is currently pursuing the master
[1] National Institute of Standards and Technology, Secure Hash Standard, Federal degree with the Department of Electronic Engineering, Xi-
Information Processing Standards Publication 180-4, [Online]. Available: https: amen University, Xiamen, Fujian, China.
//nvlpubs.nist.gov/nistpubs/fips/nist.fips.180-4.pdf.
[2] Y. Zhang, et al., High performance and low power hardware implementation
for cryptographic hash functions, Int. J. Distrib. Sens. Netw. 10 (3) (2014) 1–12.
[3] P. Hagemeister, M. Mauve, Distributing distributed revision control systems, in:
Proceedings of the IEEE Conference on Local Computer Networks, Dubai, UAE,
2016, pp. 647–650.
[4] H.E. Michail, et al., A top-down design methodology for ultrahigh-performance
hashing cores, IEEE Trans. Dependable Secur. Comput. 6 (4) (2009) 255–268
Oct–Dec.
Shuli Shen received the M.S. degree from the Department
[5] E.-H. Lee, et al., Implementation of high-speed SHA-1 architecture, IEICE Elec-
of Electronic Engineering, Xiamen University, Xiamen, Fu-
tron. Express 6 (16) (Aug. 2009) 1174–1179.
jian, China, in 2017. She is currently a software engineer
[6] H.E. Michail, et al., Area-throughput trade-offs for SHA-1 and SHA-256 hash
with Wangsu Technology Corporation, Xiamen, China.
functions’ pipelined designs, J. Circuits Syst. Comput. 25 (04) (Apr. 2016).
[7] Hu-ung Lee, et al., Parallelizing SHA-1, IEICE Electron. Express 12 (12) (Jun.
2015) 20150371-20150371.
[8] H. Michail, C. Goutis, Holistic methodology for designing ultra high-speed
SHA-1 hashing cryptographic module in hardware, in: Proceedings of the IEEE
International Conference on Electron Devices and Solid-State Circuits, Hong
Kong, C hina, 2008, pp. 144–147.
[9] K.L. Yong, H. Chan, I. Verbauwhede, Throughput optimized SHA-1 architecture
using unfolding transformation, in: Proceedings of the International Confer-
ence on Application-Specific Systems, Architectures and Processors, Steamboat
Springs, CO, 2006, pp. 354–359. Kongcheng Wu received the B.S. degree from the Depart-
[10] S. Gueron, Parallelized hashing via -lanes and -pointers tree modes, with ap- ment of Physics, Fuzhou University, Fuzhou, Fujian, China,
plications to SHA-256, J. Inform. Secur. 5 (3) (2014) 91–113. in 2015. He is currently pursuing the master degree with
[11] G.S. Athanasiou, et al., Optimising the SHA-512 cryptographic hash function on the Department of Electronic Engineering, Xiamen Univer-
FPGAs, IET Comput. Digit. Tech. 8 (2) (Mar. 2014) 70–82. sity, Xiamen, Fujian, China. His research interests include
[12] R.K. Makkad, A.K. Sahu, Novel design of fast and compact SHA-1 algorithm for embedded systems and Computer Science.
security applications, in: Proceedings of the IEEE International Conference on
Recent Trends in Electronics, Information & Communication Technology IEEE,
Bengaluru, I ndia, 2016, pp. 921–925.
[13] Xilinx, “IP Processor Block RAM (BRAM) Block v1.00a,” Data sheet,
2011, [Online]. Available: https://www.xilinx.com/support/documentation/ip_
documentation/bram_block.pdf.
[14] H.E. Michail, et al., On the development of high-throughput and area-efficient
multi-mode cryptographic hash designs in FPGAs, Integr. VLSI J. 47 (4) (Sep. Xiaochao Li received the B.S. degree in electrical en-
2014) 387–407. gineering from Beijing Institute of Technology, Beijing,
[15] J.W. Kim, et al., Design for high throughput SHA-1 hash function on FPGA, in: China, in 1992, the M.S. degree in electrical engineer-
Proceedings of the International Conference on Ubiquitous & Future Networks ing from Xiamen University, Xiamen, China, in 1995, and
IEEE, Phuket, Thailand, 2012, pp. 403–404. the Ph.D. degree in solid-state physics from Xiamen Uni-
[16] Xilinx, “7 Series FPGAs Data Sheet: Overview v2.5,” Data sheet, 2017, [Online]. versity, Xiamen, China, in 20 05. From 20 05 to 20 08, he
Available: https://www.xilinx.com/support/documentation/data_sheets/ds180_ was Assistant Professor of Electrical Engineer Department,
7Series_Overview.pdf. Xiamen University, China. Since Aug. 2008, he has been
[17] D.S. Kundi, A. Aziz, N. Ikram, A high performance ST-Box based unified AES an Associate Professor of Electrical Engineer Department,
encryption/decryption architecture on FPGA, Microprocess. Microsyst. 41 (Mar. Xiamen University, China. During this period, from 2010
2016) 37–46. to 2011, he was a Visiting Scholar in the Department of
[18] J.M. Granado-Criado, et al., A new methodology to implement the AES algo- Electrical and Computer Engineering of North Carolina
rithm using partial and dynamic reconfiguration, Integr. VLSI J. 43 (1) (Jan. State University, USA. From 2014 to 2016, he was a Vis-
2010) 72–80. iting Fellow in the State Key Laboratory of Analog and Mixed-signal VLSI (AMSV)
[19] R. Garcia, et al., A compact FPGA-based processor for the Secure Hash Algo- of University of Macau, China. He is currently the Principal investigator of the ad-
rithm SHA-256, Comput. Electr. Eng. 40 (1) (Jan. 2014) 194–202. vanced circuit and system group in Xiamen University.
[20] H.E. Michail, et al., On the exploitation of a high-throughput SHA-256 FPGA
design for HMAC, ACM Trans. Reconfigurable Technol. Syst. 5 (1) (Mar. 2012)
Yihui Chen received the Master Degree from Mechan-
1–28.
ical and Electrical Engineering, Xiamen University, Xia-
[21] M.D. Rote, N. Vijendran, D. Selvakumar, High performance SHA-2 core using
men, Fujian, China, in 2012. She is currently an Engineer
the round pipelined technique, in: Proceedings of the IEEE International Con-
with the School of Information Science and Engineering,
ference on Electronics, Computing and Communication Technologies IEEE, Ban-
Xiamen University, Xiamen, China. Her research interests
galore, I ndia, 2016, pp. 1–6.
include embedded systems, FPGA and switching power
[22] Xilinx, “Vivado Design Suite User Guide,” Data sheet, 2018, [Online]. Avail-
supply design.
able: https://www.xilinx.com/support/documentation/sw_manuals/xilinx2018_
2/ug901- vivado- synthesis.pdf.
[23] Xilinx, “Synthesis and Simulation Design Guide,” Data sheet, 2017, [On-
line]. Available: https://www.xilinx.com/support/documentation/sw_manuals/
xilinx14_7/sim.pdf.

Lin Li received the B.S. degree and Ph.D. degree in elec-


tronic engineering from the University of Science and
Technology of China, in 2003 and 2008, respectively. She
became a Member of IEEE in 2008. From 2008 to 2011,
she was an Assistant Professor with Electronic Engineer-
ing Department, Xiamen University, China. Since 2011,
She has been an Associate Professor. Her research inter-
ests include Speaker Verification, Speech Recognition, Em-
bedded Systems and Integrated Circuit Design. She is a
Committee member of the National Conference on Man-
Machine Speech Communication.

You might also like