Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/373096678

Energy-efficient Unified Multi-hash Coprocessor for Securing IoT Systems


Integrating Blockchain

Conference Paper · August 2023

CITATIONS READS

0 42

6 authors, including:

Pham Hoai Luan Duong Le Vu Trung


Nara Institute of Science and Technology Nara Institute of Science and Technology
29 PUBLICATIONS 239 CITATIONS 21 PUBLICATIONS 93 CITATIONS

SEE PROFILE SEE PROFILE

Thi Hong Tran


Nara Institute of Science and Technology
53 PUBLICATIONS 350 CITATIONS

SEE PROFILE

All content following this page was uploaded by Pham Hoai Luan on 13 August 2023.

The user has requested enhancement of the downloaded file.


Energy-efficient Unified Multi-hash Coprocessor for
Securing IoT Systems Integrating Blockchain
Pham Hoai Luan1 , Thi Sang Duong1 , Vu Trung Duong Le1 , Thi Hong Tran2 , and Yasuhiko Nakashima1
1
Nara Institute of Science and Technology, Nara, Japan
2
Osaka Metropolitan University, Osaka 558-8585, Japan
Email: {pham.luan, duong.sang.do4, le.vu trung duong.lp4, nakashim}@is.naist.jp and hong@osaka-cu.ac.jp

Abstract—SHA-256, BLAKE-256, and BLAKE2s are cryp-


tographic hash functions widely used in data security for
Blockchain-enabled IoT systems to ensure the integrity of data
and transactions. However, previous works have focused solely
on implementing individual hash functions from these three
cryptographic hash functions. Therefore, we propose a unified
hardware solution named a B2$HA coprocessor that integrates
all three hash functions and offers high speed, low power con-
sumption, and high flexibility. Our proposed approach involves
three novel optimizations: a multi-level pipeline architecture, a
reused-register technique, and an adder-sharing method. The
theoretical evaluation demonstrates that B2$HA coprocessor
shares 76.5% of 32-bit registers (104 out of 136) and 84.6%
of 32-bit adders (44 out of 52) for SHA-256, BLAKE-256, and Fig. 1. The proposed B2$HA coprocessor at the system-on-chip level on a
Xilinx Zynq UltraScale+ MPSoC ZCU102 FPGA
BLAKE2s computations. Additionally, the implementation results
on the 45 nm CMOS ASIC show that the unified efficiency
increasingly adopted blockchain technology for data integrity
of B2$HA is approximately 82%. The B2$HA coprocessor is
implemented and verified at the system-on-chip level on the Xilinx and privacy, demanding hashing hardware that has sufficient
Zynq UltraScale+ MPSoC ZCU102 FPGA. Our experimental performance to ensure blockchain network security. Therefore,
results on multiple FPGAs show that the B2$HA coprocessor developing hashing hardware that is low-power, low-area, and
exhibits superior flexibility and outperforms stand-alone hashing high-performance and also supports all three hash functions is
architectures in terms of throughput and area efficiency by 1.9-
crucial for blockchain-based IoT applications.
38.4 times and 1.4-4.5 times, respectively.
Index Terms—SHA-2, BLAKE, Blockchain, IoT, Unified. Numerous studies have proposed several architectures for
SHA-256, BLAKE-256, and BLAKE2s to improve perfor-
I. I NTRODUCTION mance, area, and power consumption. For example, a compact
and efficient FPGA processor for SHA-256 that utilizes a
Cryptographic hash functions are essential in modern com- custom datapath to enable module reusing was proposed in [9].
puting as they can produce a fixed-size hash output from Authors in [10] presented a cost-effective and high-performing
any message input length. Among the most widely used multicore architecture for SHA-256. In [11], the authors
hash functions are SHA-256, BLAKE-256, and BLAKE2s, proposed BLAKE-256 architectures for round-transformation,
utilized for various security applications, such as password capable of producing a hash output through multiple rounds.
storage [1], digital signature [2], error detection and correction Authors in [12], [13] proposed a compact ALU and distributed
[3], pseudorandom number generators and [4], and JPEG RAM memory to improve area and energy consumption for
image encryption [5]. Moreover, SHA-256, BLAKE-256, and BLAKE-256 computation. The BLAKE-256/2s coprocessor
BLAKE2s are crucial for data integrity in the blockchain, in [14] was proposed to achieve high flexibility and energy
enabling secure and transparent distributed ledgers. [6] efficiency for blockchain-based IoT applications. Although the
The internet of things (IoT) is a popular and modern architectures proposed in [9]–[14] have improved efficiency,
field that uses SHA-256, BLAKE-256, and BLAKE2s for area, and power consumption, they only support a single hash
ensuring data security. The IoT field necessitates hashing function, either SHA-256 or BLAKE-256/2s, which limits
hardware that is specifically designed to meet the power- their flexibility. To the best of our knowledge, no architecture
consumption limitations of low-end devices. Additionally, the has been proposed to share hardware resources for support-
hashing hardware for IoT devices needs to support various ing all three hash functions due to the completely different
hash functions for different security needs, such as using SHA- mathematical structures of SHA-256 and BLAKE-256/2s.
256 for secure medical image encryption [7] and BALKE2s- To address the issues in previous works, this paper proposes
based authentication to block unauthorized nodes from send- the B2$HA coprocessor, a unified resource-sharing hardware
ing harmful packets [8]. Alternatively, IoT applications have for SHA-256, BLAKE-256, and BLAKE2s. The coprocessor
978-1-6654-8128-1/22/$31.00 ©2022 IEEE achieves high flexibility, high performance, and low power
Fig. 3. Reused-register technique for the multi-level pipeline architecture.

Fig. 2 depicts the multi-level pipeline architecture for the


B2$HA core. Performing a loop calculation of SHA-256
requires 11 adders, 11 Xors, 5 Ands, and 12 shift-rotations,
while BLAKE-256/2s needs 48 adders, 48 Xors, and 32 shift-
rotations for each loop calculation. Since BLAKE-256/2s has
approximately four times more operations per loop than SHA-
256, the B2$HA core is divided into 4 computational pipeline
Fig. 2. A multi-level pipeline architecture for the B2$HA core. stages. Where each stage is capable of handling either one
loop calculation of SHA-256 or a quarter of the loop calcu-
by utilizing our three optimizations, including the multi-level lation of BLAKE-256/2s (equivalent to half a G function).
pipeline architecture, the reused-register technique, and the To synchronize computational data, a 4-stage pipeline is also
adder-sharing method. The coprocessor is implemented and needed for storing and processing messages. For efficient
verified successfully at the system-on-chip (SoC) level on utilization of hardware resources, the B2$HA core requires
a real FPGA board, and comparisons with previous works four data flows, including four messages and hash inputs from
are presented. The paper’s structure is as follows: Section II memories, as input data. When computing for SHA-256, the
presents the B2$HA coprocessor, Section III discusses its B2$HA core becomes a pipelined 4-loop unrolling architecture
implementation, quantitative analysis, and comparison with that can handle 64 loops in 16×4 cycles to generate 4 hash
related works, and Section IV provides the conclusion. outputs. On the other hand, when computing for BLAKE-256
or BLAKE2s, the B2$HA core becomes a 4-stage pipelined
II. P ROPOSED B2$HA C OPROCESSOR round computation architecture that takes 14×4 or 10×4
Fig. 1 illustrates the B2$HA coprocessor’s SoC-level ar- cycles to complete 14 or 10 loops to generate 4 hash outputs,
chitecture, where the ARM Cortex-A53 CPU governs copro- respectively. Since all four stages remain active, the multi-level
cessor operations via the AXI bus. Configuration, message, pipelined architecture can achieve 100% hardware efficiency.
and hash input data are transferred to dedicated memories
B. Optimization 2: Reused-Register Technique
before being loaded into the B2$HA core. Computed hash
outputs are stored in the output memory for retrieval by the Although a multi-level pipeline architecture can enhance
CPU. Additionally, we propose three optimizations to enhance performance, it also results in a higher number of registers
flexibility, performance, and resource efficiency of the B2$HA. needed for the B2$HA core. Therefore, we propose the reused-
register technique to effectively save hardware resources.
A. Optimization 1: Multi-level Pipeline Architecture Fig. 3 depicts the reused-register technique implemented in
The process of generating a hash output in SHA-256, the multi-level pipeline architecture. In each loop calculation,
BLAKE-256, and BLAKE2s involves multiple loops that re- SHA-256 updates 8 new 32-bit hashes, while BLAKE-256/2s
quire extensive mathematical computations. The large number updates 16 new 32-bit hashes. The 16 32-bit registers used
of operations in each loop can result in a very long critical for BLAKE-256/2s can be reused for SHA-256 by sharing the
path, which necessitates the use of a pipeline technique. first 8 registers at each pipeline round. However, the storage
However, balancing the critical path for the B2$HA core for messages differs between SHA-256 and BLAKE-256/2s.
using traditional pipeline techniques is challenging due to the SHA-256 employs shift registers to add a new 32-bit message
differing number of operations in each loop calculation of word in each loop, while message words in BLAKE-256/2s
three hash functions. To overcome this issue, we propose a remain constant throughout 14/10 loops. Despite this variation,
multi-level pipeline technique for the B2$HA core that can all three hash functions necessitate 16 32-bit registers to store
effectively and optimally reduce the critical path. 16 32-bit message words in each loop calculation, allowing for
TABLE I
C OMPARISON BETWEEN B2$HA AND BASE STAND - ALONE MODULES

Design Algorithm Freq. Area Power Throughput


(MHz) (µm2 ) (mW) (Gbps)
SHA-256 465 42,388 21 14.004
Base module
BLAKE-256 510 53,626 31.4 17.408
(Opt. 1)
BLAKE2s 512 51,583 21.4 23.831
SHA-256 13.552 (↓3%)
B2$HA
BLAKE-256 450 70,495 31.6 15.360 (↓12%)
(Opt. 1,2,3)
BLAKE2s 20.945 (↓12%)
Fig. 4. Inside architecture of four ALUs utilizing adder-sharing method.

Upon completing the SoC design, it was converted into a


the reuse of these registers. Multiplexers are utilized to select
bitstream and integrated into the Linux 5.4.0-Xilinx-v2020.2
between shifting registers or retaining message words for
kernel on the ZCU102 FPGA. To validate the coprocessor’s
SHA-256 or BLAKE-256/2s. By implementing the proposed
accuracy, hash outputs generated by the coprocessor were
reused-register technique, the B2$HA core can reuse 76.5%
compared with message and hash inputs generated by the
of its 32-bit registers (104 out of 136).
ARM Cortex-A53 CPU. The experiment encompassed 10,000
C. Optimization 3: Adder-Sharing Method different inputs, and the results demonstrate a flawless perfor-
The B2$HA core relies on four ALUs as crucial elements mance, achieving a 100% success rate.
to execute arithmetic and logical operations necessary for B. Quantitative Analysis of the B2$HA Coprocessor on ASIC
SHA-256, BLAKE-256, and BLAKE2s computations. One of
the most frequently used and hardware-intensive operations This section provides a quantitative analysis of the B2$HA
in three hash functions is the 32-bit adder. Therefore, we coprocessor to evaluate its unified efficiency. Additionally,
propose an adder-sharing method to minimize the hardware we investigate the trade-offs of using the reused-register
cost associated with the 32-bit adders used in four ALUs. technique and adder-sharing method for the B2$HA core.
Fig. 4 illustrates the detailed architecture of four ALUs Accordingly, we have implemented four architectures in Ver-
that use the adder-sharing method. According to theory, each ilog: one B2$HA core and three base stand-alone hashing
loop calculation of the SHA-256 algorithm requires 11 32- modules applying the multi-level pipeline architecture. These
bit adders, while the BLAKE-256/2s algorithm requires up architectures were implemented on ASICs with 45 nm CMOS
to 48 32-bit adders. To share adders effectively with SHA- technology using Synopsys Design Compiler.
256, the loop calculation of BLAKE-256/2s is divided into Table I shows the maximum frequency, area, power, and
four stages, with each stage including 12 adders. This way, 11 throughput comparisons between the proposed B2$HA core
32-bit adders can be shared between SHA-256 and BLAKE- and base hashing modules based on ASIC results.
256/2s without affecting the performance of either algorithm. 1) Unified Efficiency Analysis: Let Ai (0≤i≤2) denote the
By using a multiplexer, the 11 adders used by SHA-256 and area of base hashing modules and let AB2$HA denote the area
BLAKE-256/2s can be shared by adjusting the input operators of B2$HA. Since B2$HA is a unified architecture of BLAKE-
to accommodate each hash function. It’s worth noting that 256, BLAKE2s, and SHA-256, AB2$HA is always bigger than
while the finalization step of SHA-256 requires 8 32-bit or equal to the largest among Ai , denoted max(Ai ). There-
adders, BLAKE-256/2s only utilizes XOR operations in this fore, max(Ai ) is considered the highest unified efficiency
P2
step, and hence adder operation cannot be shared. Overall, the (100%) for B2$HA, while the sum of Ai , denoted as i=0 Ai ,
B2$HA core, consisting of 12 ALUs, can share 78.6% of 32- is deemed the lowest unified efficiency (0%) for B2$HA. The
bit adders (44 out of 56), resulting in a more cost-effective unified efficiency of B2$HA is calculated by eq. (1).
and efficient implementation of three hash functions. P2
Ai − AB2$HA
Unified Efficiency = P2i=0 × 100% (1)
III. I MPLEMENTATION AND E VALUATION i=0 i − max(Ai )
A
A. B2$HA Implementation and Verification on FPGA Using the area results presented in Table I and eq. (1), we have
The B2$HA coprocessor was implemented and validated found that the unified efficiency of B2$HA is above 82%. This
on a Xilinx ZCU102 FPGA hardware platform (Fig. 1). indicates that the reused-register technique and adder-sharing
The design comprises an ARM Cortex-A53 CPU in the method are effective in reducing hardware resource usage.
processing system (PS) and the B2$HA coprocessor IP in 2) Trade-off Analysis: Table I demonstrates that the B2$HA
the programmable logic (PL), developed using Xilinx Vivado core exhibits a slight reduction in throughput compared to
v2020.2. Operating at 200 MHz, the coprocessor IP consumes the base modules, mainly due to the delay effect of adding
0.262 W and utilizes 8,078 LUTs-8,337 FFs-3,978 Slices. multiplexers. However, this reduction in throughput is not con-
The B2$HA core specifically employs 6,479 LUTs-4,370 FFs- sidered significant when considering the enhanced flexibility
2,396 Slices, consuming 0.222 W. of the B2$HA core. Furthermore, we analyze the trade-off
Fig. 5. Trade-off analysis of energy and area efficiencies between B2$HA and base modules in (a) SHA-256, (b) BLAKE256, and (c) BLAKE2s computations.

TABLE II hashing architectures. On the Virtex 5 board, the B2$HA


C OMPARISON WITH PREVIOUS STAND - ALONE HASHING ARCHITECTURES utilizes 2,605 slices, achieves a maximum frequency of 139
MHz, and delivers throughputs of 4,186 Mbps, 4,745 Mbps,
Freq. Area Through. Area Eff.
Device Design Algorithm and 6,470 Mbps for SHA-256, BLAKE-256, and BLAKE2s
(MHz) (Slice) (Mbps) (Mbps/slice)
computations, respectively. Compared to existing architec-
[9] SHA-256 64 139 118 0.84 tures, in SHA-256 mode, the B2$HA exhibits a throughput
[10]** SHA-256 278 2,447 2,229 0.91 35.5 times higher than [9] and 1.9 times higher than [10].
Virtex 5
[11] BLAKE-256 124.5 1,739 2,280 1.31 Furthermore, it demonstrates an area efficiency 1.9 times
SHA-256 4,186 1.61 better than [9] and 1.8 times better than [10]. In BLAKE-256
B2$HA BLAKE-256 139 2,605 4,745 1.82 mode, the B2$HA achieves a throughput 2.1 times greater
BLAKE2s 6,470 2.48
than [11] and an area efficiency 1.4 times better than [11].
On the Virtex 6 board, the B2$HA employs 2,633 slices,
[12] BLAKE-256 154 267* 214 0.80
operates at a maximum frequency of 170 MHz, and achieves
[13] BLAKE-256 349 306* 151 0.49
Virtex 6 throughputs of 5,120 Mbps, 5,803 Mbps, and 7,913 Mbps
BLAKE-256 5,856 2.64 for SHA-256, BLAKE-256, and BLAKE2s computations, re-
[14] 183 2,218
BLAKE2s 7,808 3.52 spectively. In BLAKE-256 mode, the B2$HA demonstrates a
SHA-256 5,120 1.94 throughput 27.1 times higher than [12] and 38.4 times higher
B2$HA BLAKE-256 170 2,633 5,803 2.20 than [13], along with an area efficiency 2.75 times better than
BLAKE2s 7,913 3.01 [12] and 4.5 times better than [13]. When compared to the
* : One Block RAM (BRAM) is normalized to 128 slices.
BLAKE-256/2s architecture in [14], the B2$HA showcases
** : One core in the multicore architecture is utilized for comparison. slightly lower throughput and area efficiency, ranging from
1% to 16%. Moreover, the B2$HA provides dynamic config-
between energy and area efficiencies of the B2$HA core and
urability, enabling seamless switching between different hash
base hashing modules at various input clock speeds ranging
functions, unlike the fixed stand-alone hashing architectures in
from 50 to 500 MHz (Fig. 5). Despite a notable decrease
[9]–[14], making it a more flexible coprocessor.
in area and energy efficiency, the B2$HA core achieves
impressive values, with a maximum power efficiency of 663
IV. C ONCLUSION
Mbps/mW and a maximum area efficiency of 297 Mbps/kµm2 .
These values surpass those of existing hashing architectures, The B2$HA coprocessor is a unified hardware solution
as discussed in the subsequent section. designed to integrate SHA-256, BLAKE-256, and BLAKE2s
cryptographic hash functions for data and transaction integrity
C. Performance Evaluation: B2$HA vs. Existing Works in Blockchain-enabled IoT systems. Implementation results
on a 45 nm CMOS ASIC reveal an approximate unified
Based on our investigation, reported B2$HA is the first uni-
efficiency of 82%. Experimental evaluations across multiple
fied and shared-resource architecture for SHA-256, BLAKE-
FPGA platforms demonstrate that the B2$HA coprocessor
256, and BLAKE2s and no implementation results are cur-
offers enhanced flexibility and surpasses stand-alone hashing
rently available in the open literature for comparison. There-
architectures in terms of throughput and area efficiency.
fore, this section only compare the performance of B2$HA
with previous stand-alone hashing architectures, such as [9]–
ACKNOWLEDGMENT
[14]. To ensure a fair comparison with other studies, we
synthesized the Verilog code of the B2$HA core on the Virtex This work was supported by Projects Supported by NAIST
5 and Virtex 6 FPGA boards using Xilinx ISE 14.7 tool. Foundation FY2023 under Grant R5190015. This work was
Table II presents a comparison of throughput and area also supported through the activities of VDEC, The University
efficiency between the B2$HA core and related stand-alone of Tokyo, in collaboration with NIHON SYNOPSYS G.K.
R EFERENCES
[1] W. Luo, Y. Hu, H. Jiang, and J. Wang, “Authentication by encrypted
negative password,” IEEE Transactions on Information Forensics and
Security, vol. 14, no. 1, pp. 114–128, 2019.
[2] M. Iavich, G. Iashvili, S. Gnatyuk, A. Tolbatov, and L. Mirtskhulava,
“Efficient and secure digital signature scheme for post quantum epoch,”
in Information and Software Technologies, A. Lopata, D. Gudonienė,
and R. Butkienė, Eds. Cham: Springer International Publishing, 2021,
pp. 185–193.
[3] W. Shan, W. Dai, C. Zhang, H. Cai, P. Liu, J. Yang, and L. Shi, “Tg-
spp: A one-transmission-gate short-path padding for wide-voltage-range
resilient circuits in 28-nm cmos,” IEEE Journal of Solid-State Circuits,
vol. 55, no. 5, pp. 1422–1436, 2020.
[4] A. Coughlin, G. Cusack, J. Wampler, E. Keller, and E. Wustrow, “Break-
ing the trust dependence on third party processes for reconfigurable
secure hardware,” in Proceedings of the 2019 ACM/SIGDA Interna-
tional Symposium on Field-Programmable Gate Arrays, ser. FPGA ’19.
New York, NY, USA: Association for Computing Machinery, 2019, p.
282291.
[5] P. Li, J. Meng, and Z. Sun, “A new jpeg encryption scheme using
adaptive block size,” in Advances in Intelligent Information Hiding and
Multimedia Signal Processing, J.-S. Pan, J. Li, O.-E. Namsrai, Z. Meng,
and M. Savić, Eds. Singapore: Springer Singapore, 2021, pp. 140–147.
[6] F. Wang, Y. Chen, R. Wang, A. O. Francis, B. Emmanuel, W. Zheng, and
J. Chen, “An experimental investigation into the hash functions used in
blockchains,” IEEE Transactions on Engineering Management, vol. 67,
no. 4, pp. 1404–1424, 2020.
[7] R. Hamza, K. Muhammad, A. N., and G. RamRez-GonzLez, “Hash
based encryption for keyframes of diagnostic hysteroscopy,” IEEE
Access, vol. 6, pp. 60 160–60 170, 2018.
[8] A. Mathur, T. Newe, and M. Rao, “Defence against black hole and
selective forwarding attacks for medical wsns in the iot,” Sensors,
vol. 16, no. 1, p. 118, 2016.
[9] M. M.-S. C. F.-U. R. Garcia, I. Algredo-Badillo and R. Cumplido, “A
compact fpga-based processor for the secure hash algorithm sha-256,”
Computers & Electrical Engineering, vol. 40, no. 1, pp. 194–202, 2014.
[10] T. H. Tran, H. L. Pham, and Y. Nakashima, “A high-performance
multimem sha-256 accelerator for society 5.0,” IEEE Access, vol. 9,
pp. 39 182–39 192, 2021.
[11] K. Latif, A. Mahboob, and A. Aziz, “High throughput hardware im-
plementation of secure hash algorithm (sha-3) finalist: Blake,” in 2011
Frontiers of Information Technology, 2011, pp. 189–194.
[12] J.-P. Kaps, P. Yalla, K. K. Surapathi, B. Habib, S. Vadlamudi, and
S. Gurung, “Lightweight implementations of sha-3 finalists on fpgas,”
in The Third SHA-3 Candidate Conference, no. 60, 2012, pp. 1–17.
[13] N. At, J.-L. Beuchat, E. Okamoto, s. San, and T. Yamazaki, “Compact
hardware implementations of chacha, blake, threefish, and skein on
fpga,” IEEE Transactions on Circuits and Systems I: Regular Papers,
vol. 61, no. 2, pp. 485–498, 2014.
[14] P. H. Luan, T. H. Tran, V. T. Duong Le, and Y. Nakashima, “A flexible
and energy-efficient blake-256/2s co-processor for blockchain-based iot
applications,” in 2022 35th SBC/SBMicro/IEEE/ACM Symposium on
Integrated Circuits and Systems Design (SBCCI), 2022, pp. 1–6.

View publication stats

You might also like