Professional Documents
Culture Documents
Microprocessors and Microsystems: Lin Li, Shaoyu Lin, Shuli Shen, Kongcheng Wu, Xiaochao Li, Yihui Chen
Microprocessors and Microsystems: Lin Li, Shaoyu Lin, Shuli Shen, Kongcheng Wu, Xiaochao Li, Yihui Chen
Microprocessors and Microsystems: Lin Li, Shaoyu Lin, Shuli Shen, Kongcheng Wu, Xiaochao Li, Yihui Chen
a r t i c l e i n f o a b s t r a c t
Article history: In this paper, an area-efficient fully-pipelined architecture of SHA-1 and SHA-256 implemented on FPGA
Received 27 May 2018 is proposed for achieving high operating frequency and throughput. The conventional pipeline architec-
Revised 19 January 2019
ture consumes a lot of registers, especially the consumption increases dramatically for the higher number
Accepted 16 March 2019
of pipeline stage. To solve this problem, a new scheme using block RAM (BRAM) is presented to reduce
Available online 17 March 2019
consumption of registers and make the fully-pipelined architecture simpler. Additionally, to achieve oper-
Keywords: ating frequency greater than 300 MHz, the new sub-cores of SHA-1 and SHA-256 combined with the loop
Feld-programmable gate arrays (FPGA) unrolling and pre-computation techniques are introduced to the design. Compared to previous works, the
Cryptography throughput and throughput/Slice of SHA-1 and SHA-256 in proposed designs are substantially increased
Secure Hash Algorithm (SHA) to 159.590 Gbps, 16.083 Mbps/slice and 154.880 Gbps, 10.94 Mbps/slice respectively on Kintex-7 FPGA.
Fully-pipelined © 2019 Elsevier B.V. All rights reserved.
Block RAM (BRAM)
1. Introduction [5], Pipelining [6] and Parallelism exploitation [7], were introduced
to these designs to improve performance. In [8] and [9], the loop
The Secure Hash Algorithms are a family of cryptographic hash unrolling technique has been discussed in detail. Loop unrolling
functions published by National Institute of Standards and Technol- refers to performing several operations in one clock cycle. It im-
ogy (NIST) as a U.S Federal Information Processing Standard (FIPS), proved the throughput of SHA-1 and SHA-256 architectures at the
including: SHA-0, SHA-1, SHA-2, SHA-3 [1]. A cryptographic hash cost of higher consumption of resource and lower frequency. Lee
function, which is designed to be a one-way function, is consid- et al. in [5] presented the pre-computation technique to achieve
ered as a mathematical algorithm that transforms data of arbitrary high performance in SHA-1. Pre-computation technique means that
size into a fixed-size string [2]. The Secure Hash Algorithms are pre-calculating some intermediate values needed in next opera-
often used in information authentication to ensure data confiden- tion in current operation. Compared to the loop unrolling, the
tiality and integrity. Among them, SHA-1 and SHA-256 hash func- pre-computation technique improves the throughput by reducing
tions play an important role. They are widely used in various ap- the critical path and increasing the operating frequency. For much
plications. For instance, SHA-1 algorithm is applied to distributed higher throughput, pipelining and parallelism exploitation tech-
revision control systems like Git to identify revisions and to detect niques were used in many designs. In [7] and [10], the parallel
data corruption [3]. It is also used in the authentication system architecture of SHA-1 and SHA-256 was proposed, which multiple
of MySQL server to protect user’s password. SHA-256 algorithm is SHA-1 or SHA-512 module worked simultaneously for high speed
implemented in some widely used security applications and pro- calculations at the cost of extra high resource utilization of slices.
tocols, including TLS and SSL, PGP, SSH, S/MIME, and IPsec. Sev- It is difficult to both increase the throughput and reduce the con-
eral cryptocurrencies like Bitcoin use SHA-256 for verifying trans- sumption.
actions and calculating proof-of-work or proof-of-stake. In the previous study, the design with eight pipeline stages
A variety of hardware implementations of SHA-1 and SHA-256 achieves high TPS(Throughput/Slice) values than four pipeline
have been proposed since the Secure Hash Algorithm standards stages [6], and the analysis chart shows that more pipeline stages
were announced by NIST in 2002. Some of well-known and con- design not only achieves higher throughput, but also achieves
sidered techniques, such as Loop unrolling [4], Pre-computation higher TPS values and better area efficiency. However, they think
fully pipelined designs are unrealistic since it takes up a large
portion of the resource in FPGA [6,11]. While, in most cases,
∗
Corresponding author at: Department of Electronic Engineering, Xiamen Univer-
the hash functions are incorporated in a bigger security system
sity, Xiamen 361005, China.
E-mail address: leexcjeffrey@xmu.edu.cn (X. Li).
and have the area consumption constraints. Therefore, the area
https://doi.org/10.1016/j.micpro.2019.03.002
0141-9331/© 2019 Elsevier B.V. All rights reserved.
L. Li, S. Lin and S. Shen et al. / Microprocessors and Microsystems 67 (2019) 82–92 83
reduction is the key factor to implement the fully-pipelined hash give at +1 , bt +1 , ct +1 , dt +1 , et+1 are described as Eq. (1) (0 ≤ t < 80).
function. ⎧
In this paper, an area-efficient fully-pipelined architecture of ⎪at+1 = S5 (at ) + ft (bt , ct , dt ) + et + Wt + Kt
⎪
⎨bt+1 = at
SHA-1 and SHA-256 using BRAM is implemented on different Xil-
inx families (Virtex-4, Virtex-5, Virtex-6 and Kintex-7). Compared ct+1 = S30 (bt ) (1)
⎪
⎪
to previous works, the proposed design concerns not only high ⎩dt+1 = ct
throughput but also the area efficiency. Both utilization of slice and et+1 = dt
BRAM are under 20% of total slices and BRAM in Kintex-7 series
Let ‘x & y’ denote the bit-wise AND of x and y. Let ‘xy’ denote
FPGA. Moreover, the new sub-cores of SHA-1 and SHA-256 are ap-
the bit-wise XOR of x and y. The symbol ‘∗ ’ denotes the multipli-
plied to the architectures to reduce the delay in the critical path.
cation and the symbol ‘M’ represents the 512-bit data. ‘ft ’, ‘Wt ’ and
Compared to other works, our designs have achieved better per-
‘Kt ’ in Eq. (1) are now defined as Eqs. (2)–(4).
formance in terms of frequency, throughput and throughput/area ⎧
specification.
⎨(bt & ct ) (b̄t & dt )
⎪ 0 ≤ t < 20
The rest of the paper is organized as follows. Section 2 gives a bt ct dt 20 ≤ t < 40
ft (bt , ct , dt ) = (2)
brief introduction to the background of the SHA-1 and SHA-256. In
⎩(bt & ct ) (bt &dt ) (ct &dt )
⎪ 40 ≤ t < 60
Section 3, the proposed designs are explained in detail. Implemen- bt ct dt 60 ≤ t < 80
tation results and comparisons with previous works are given in
Section 4. Finally, Section 5 concludes the paper. M[(32 ∗ t + 31 ) : (32 ∗ t )] 0 ≤ t < 16
Wt = (3)
S1 (Wt−16 Wt−14 Wt−8 Wt−3 ) 16 ≤ t < 80
⎧
⎪
⎨0x5A827999 0 ≤ t < 20
2. Background 0x6E D9E BA1 20 ≤ t < 40
Kt = (4)
⎪ 40 ≤ t < 60
⎩0x8F 1BBCDC
To describe the design of the SHA-1 and SHA-256 on FPGA, we 0xCA62C1D6 60 ≤ t < 80
will give a brief introduction to the basic SHA-1 and SHA-256 algo-
rithms and three common techniques which are used for improv- For the first hash operation, there is no initial hash value exists.
ing the throughput of hash functions. The techniques, including Let H0 , H1 , H2 , H3 , H4 represent the five initial hash values. Now
loop unrolling, pre-computation and pipelining, will merge with the initial inputs (a0 , b0 , c0 , d0 , e0 ) can be derived by Eq. (5).
the proposed designs in Section 3. ⎧
⎪a0 = H0 = 0x67452301
⎪
⎨b = H = 0xEF CDAB89
0 1
c0 = H2 = 0x98BACDF E (5)
⎪
⎪
2.1. SHA-1 algorithm
⎩d0 = H3 = 0x10235476
e0 = H4 = 0xC3D2E1F 0
SHA-1 takes input data of length less than 2^64 bits and gives 2.2. SHA-256 algorithm
a 160-bit output which is called message digest or the hash value
[1]. The arbitrary length message is packed and padded into one or Like SHA-1 hash function, SHA-256 algorithm operates on an
multiple 512-bit blocks in first stage of algorithm. Then, the 512-bit arbitrary length message, M, and return a fixed-length output,
blocks are used to generate eighty 32-bit values, Wt . The SHA-1 al- which is called hash value or message digest of M. The differences
gorithm requires 80 basic hash operations to process a single 512- between SHA-1 and SHA-256 are size of final hash value, initial
bit block and give the output values. Each operation that consists hash constants, round constants and round computation. SHA-256
of addition, nonlinear function and circular shift should be com- performs 64 iterations to produce the 256-bit output and its op-
pleted in one clock cycle. erational round includes simple additions and nonlinear functions,
The structure of a SHA-1 hash operation is shown in Fig. 1. The as shown in Fig. 2.
darker dotted line represents the critical path that consists of three
additions. Let ‘Sn (x)’ denote the 32-bit value obtained by circularly
shifting x left by n bit positions. Let ‘ft ’ denote the nonlinear func-
tion. ‘Wt ’ represents the 32-bit value computed from the 512-bit
data. ‘Kt ’ represents the 32-bit constant. at , bt , ct , dt , et are five vari-
ables used to perform the hash operations. The expressions that
by performing multiple stages at the same time, but does not re- SHA-1 computation with 80 pipeline stages. It consists of message
duce the latency, the time required for a single datum to propagate padding unit, words computation unit and the main calculation
through the stages of the pipeline from start to finish. unit (represented by the dotted frame). Message padding unit re-
The core SHA-1 block consists of 80 basic hash operations ceives the messages per clock cycle and transform the data a for-
(SHA-256 has 64 basic hash operations). The typical pipeline ar- mat that SHA-1 algorithm required which contains 512 bits. After
chitecture has four stages [7,12]. Each stage performs 20 hash op- that, the processed 512-bit data will be divided into 16 blocks (W0
erations in 20 clock cycles. The first output hash value is obtained to W15 ) and be sent to the next two units: the main calculation
after 80 clock cycles and after that the hash value is obtained every unit and the words computation unit. The words computation unit
20 clock cycles. For the fully-pipelined design, the architecture can is responsible to compute other required Wt values (W16 to W79 ).
be divided into 80 stages in SHA-1 and 64 stages in SHA-256. Each At last, we will get 80 Wt values which will be used in 80 rounds
stage performs a basic hash operation. The hash values are now of SHA-1 algorithm computation.
produced per clock cycle, but the consumption of slice in FPGA will As shown in Fig. 5, the main calculation unit consists of five
greatly increase. In this paper, a fully-pipelined architecture is pro- similar sub-groups of sub-cores. The sub-core is defined as a block
posed to solve the problem in Section 3. that performs a hash operation or several operations of SHA-1. We
will discuss it in detail in the next section. Each sub-group accepts
3. Proposed designs 16 Wt values which are stored in BARM before being calculated by
the sub-cores. In our design, BRAM acts as a first-in/first-out FIFO
3.1. Optimized fully-pipelined architecture role. Analyzing the main calculation unit, W0 doesn’t need to be
stored in a BRAM since it will be calculated firstly. Other Wt values
The conventional pipeline architecture of SHA-1 and SHA-256 (W1 to W79 ) will be saved in BRAM to reduce the consumption of
consumes a lot of registers, especially the gate count will increase registers.
dramatically for the higher number of pipeline stage. For instance, The use of BRAM is illustrated in Fig. 6. In the proposed ar-
Michail et al. in [6] implements a fully-pipelined design with 80 chitectures, the block RAM is configured in 18Kb mode, 512 × 32
pipeline stages in Virtex-4 family. It takes up a significant portion configuration [13].
of the whole FPGA device; the utilization of slices in his design is The BRAM modules are configured in a simple dual-port RAM
31,700 of 49,152, almost 64%. To solve this problem, we propose mode. In this mode, independent Read and Write operations can
a new scheme for fully-pipelined architecture of hash functions, occur simultaneously, where port A is designated as the Write port
which focuses on the balanced utilization of resources in FPGA de- and port B as the Read port. Address A represents the position
vice, to make the cumbersome design simple. where Wt (1 ≤ t < 80) value is written at current active clock edge.
An illustrated example of our design is shown in Fig. 5. The core The Address A is defined as:
idea of scheme is using BRAM in appropriate position to reduce
the consumption of register. The design shows us a fully-pipelined AddressA = AddressA + 1
86 L. Li, S. Lin and S. Shen et al. / Microprocessors and Microsystems 67 (2019) 82–92
Fig. 9. The code of creating the simple dual Block RAM in the design.
Fig. 10. The units used to store Wt values in the conventional design.
300 MHz.
⎩d∗t = S (bt )
∗ ∗
⎪ 30
nt+1 = nt∗
∗ ⎩dt+2 = dt+1
∗
et = ct et+2 = et+1
L. Li, S. Lin and S. Shen et al. / Microprocessors and Microsystems 67 (2019) 82–92 89
Table 1
The performance of proposed designs of SHA-1.
Slices BRAM
Virtex-5 Virtex-6 Kintex-7 Virtex-5 Virtex-6 Kintex-7 Virtex-5 Virtex-6 Kintex-7 Virtex-5 Virtex-6 Kintex-7
Basic 272.6 277.3 306.4 11,135 (29.7%) 8144 (21.6%) 7230 (14.2%) 42 (8.1%) 42 (10.1%) 42 (9.4%) 139.571 141.978 156.877
Loop. 266.2 272.6 317.0 8605 (23.0%) 6203 (16.4%) 5029 (9.9%) 42 (8.1%) 42 (10.1%) 42 (9.4%) 136.294 139.571 162.304
Pre. 274.1 280.4 311.7 8841 (23.6%) 6027 (16%) 4803 (9.4%) 40 (7.8%) 40 (9.6%) 40 (9.0%) 140.339 143.565 159.590
Table 2
The comparisons between the proposed design and the other SHA-1 architectures.
Slices BRAM
Table 3
The comparisons between the proposed design and the other SHA-256 architectures.
Slices BRAM
Table 4
The comparisons between the proposed designs and the existing conventional architectures.
Device Hash function Design Frequency (MHz) Area Throughput (Gbps) TPS (Mbps/Slice)
Slices BRAM
achieved. In this case, the frequency reached 302.5 MHz and the 5. Conclusions
TPS is up to 10.94 Mbps/slice.
To the best of our knowledge, there is only Michail et al. in In this paper, an area-efficient fully-pipelined architecture of
[6] implemented the conventional fully-pipelined architecture of SHA-1 and SHA-256 implemented on FPGA is proposed. For the
SHA-1 and SHA-256. Table 4 shows the comparisons between the balance of consumption of resources, we apply the BRAM modules
proposed designs and the full-pipelined architectures in [6] based to our fully-pipelined architecture. Additionally, new sub-cores of
on the same Xilinx Virtex-4 family. It is obvious that our design SHA-1 and SHA-256 are introduced to the architecture for achiev-
cost less slice utilization but achieves higher throughput and TPS ing higher operating frequency and throughput. Compared to pre-
values. Take SHA-1 implementation for example. Compared to the vious works, the proposed designs have achieved better perfor-
architecture in [6], the occupied slices of our design were reduced mance in terms of frequency, throughput and TPS.
by 7840 (nearly 39%), and the achieved throughput and TPS val-
ues were increased by 43.3 Gbps (49%) and 3.161 Mbps/slice (72.2%)
respectively. However, only 40 BRAMs were consumed to pay for Acknowledgment
the optimized performance. Meanwhile, for SHA-256 implementa-
tion, on the comparison with the architecture in [6], we reduced This work is supported by the funding for the opera-
the utilization of occupied slice by 4604 (nearly 17%), and im- tion of Fujian Key Laboratory of IC Design and Measurement
proved the achieved throughput and TPS by 65.0 Gbps (94%) and (Xiamen University), the special funds for science and technol-
2.491 Mbps/slice (94.7%) respectively, at the expense of 35 BRAMs ogy of Xiamen Science and Technology Bureau, and the innovation
resources. funding of Xiamen University in 2017.
92 L. Li, S. Lin and S. Shen et al. / Microprocessors and Microsystems 67 (2019) 82–92