Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

488 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO.

4, APRIL 2022

On Efficiency Enhancement of SHA-3


for FPGA-Based Multimodal
Biometric Authentication
M. M. Sravani and S. Ananiah Durai

Abstract— Synchronized padder block and a compact-dynamic


round constant (RC) generator to achieve highly efficient Keccak
architecture are proposed in this work. The proposed design
yields high security with an option of 1024 bits as capacity “c,”
while limiting the round count to less than 12 for the base
design. Fusion schemes are adapted as a cost-effective approach
in the base design to explore and arrive at the best efficient
architecture for biometric access control application. The hybrid
architecture designed as a pipeline structure with ≤2 stages
eliminated the need for on-chip digital signal processor (DSP)
and block random access memory (BRAM) slices. Though fusion Fig. 1. Hash biometric authentication system.
schemes might lead to the increase in area, the minimized
structural RC design coupled with a low cost architecture,
ensures to achieve moderately low area. Among the proposed is stored in the database for decision threshold. Though feature
architectures, dual round function (Dual-f) design performed extraction approaches such as deep machine learning, fuzzy,
better in terms of throughput and operating frequency. Thus, and convolutional neural network [2]–[4] processes tighten the
when implemented, Dual-f achieved the highest efficiency of security fence of such authentication system, breaches leading
all with 12.85 Mb/s/slices and 15.11 Mb/s/slices on Virtex-5 to reconstruction of the original biometric template from the
and Virtex-7 devices, respectively. The miniature and high-speed biohash binary string were still reported [5]. Hash algorithm
features of the Dual-f Keccak design are found to be adequate
for multimodal biometric authentication applications. standardized by National Institute of Standards and Technol-
ogy (NIST), USA, being irreversible, can resist such breaches
Index Terms— Efficiency, fusion, round constant (RC), secured and attacks, eliminating the unreliability in the prevailing
hash algorithm-3 (SHA-3), throughput (TP).
biometric access control techniques. Once the binarized data of
multiple bio-features are randomly coalesced at feature-level
I. I NTRODUCTION point, encryption of the biometric template is done through
hash algorithm. The final hash digest bio-template is then
H ASH algorithm is prominent and increasingly preferred
in hardware applications such as biometric authentica-
tion, Internet of Things (IoT), remote health monitoring elec-
stored in the secured database for access control [6]. Such
encryption standards are even found to resist side channel
tronics, and digital certificates, for its superior authentication attacks (SCAs) [7] and birthday attacks. Among the hash
and data integrity features. Most authentication technologies techniques, Keccak architecture of secured hash algorithm-3
have now adapted biometric data for secured access as it is (SHA-3) gained its popularity in such access control imple-
nonreplicable. However, the conventional unimodal biometric mentations even by replacing the mature SHA-1 and SHA-2
access [1] fails to provide adequate security as privacy attacks schemes [8], [9]. SHA-3 exhibiting better properties, such as
on biometric template were observed due to weak shielding diversity and reusability, is much preferred for the biometric
with single bio-identity data. Multimodal biometric authenti- template protection [6]. Alternatively, additional hashing of
cation has been found to overcome these security issues by the features at the decision level point renders two levels
utilizing two or more biometric features. Features such as of security [3]. This is achieved by hashing each bio-feature
fingerprint, iris, face, palm, and ear images are preprocessed individually yielding specific hash key for each feature. Then,
first to get a 2-D image with skimmed features. Then, the ran- the generated hash keys are combined to form the final hashed
dom unique minutia of multibiometric data are extracted and biometric template, as shown in Fig. 1. The complexity of
transformed to a single bio-template by biohasing, which then back-to-back hashing coupled with the parallel processing of
multibiometric images deteriorates the system performance
Manuscript received October 11, 2021; revised November 24, 2021 and greatly. Therefore, a highly efficient SHA-3 architecture that
January 9, 2022; accepted January 25, 2022. Date of publication February 17, provides increased speed without any preimage collision
2022; date of current version March 22, 2022. (Corresponding author:
S. Ananiah Durai.) (nondiversity) is required. Techniques for parameter improve-
The authors are with the Center for Nanoelectronics and VLSI Design and ment of SHA-3 that targets specific application and hardware
the School of Electronics Engineering, VIT Chennai, Chennai, Tamil Nadu device choices were explored since its introduction by NIST
600127, India (e-mail: ananiahdurai.s@vit.ac.in).
Color versions of one or more figures in this article are available at during the year 2012. Hardware implementation of SHA-3
https://doi.org/10.1109/TVLSI.2022.3148275. is preferred over the software counterpart due to the fact
Digital Object Identifier 10.1109/TVLSI.2022.3148275 that the performances surpass in terms of speed, power,
1063-8210 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 27,2023 at 05:40:10 UTC from IEEE Xplore. Restrictions apply.
SRAVANI AND ANANIAH DURAI: ON EFFICIENCY ENHANCEMENT OF SHA-3 489

TABLE I TABLE II
SHA-3 P ERFORMANCE M ETRICS OF R EPORTED W ORK SHA-3 VARIANTS W ITH D IFFERENT O UTPUT L ENGTHS [13]

generator to effectively lower the area is designed. Finally,


a dual round function (Dual-f) architecture that utilizes
fusion of boosting schemes and an integrated round function
(Integrated-f) to achieve high efficiency is proposed with
experimental validation. The performance parameters in terms
of TP, frequency, area, and hence the efficiency were compared
among the proposed techniques as well as with the reported
and throughput (TP). The cost and implementation time- designs. Debugging and verification were achieved using the
line is less with the choice of field-programmable gate virtual input–output (VIO) core on Xilinx Virtex-5 (V5) and
array (FPGA) as hardware implementation platform rather Virtex-7 (V7) devices. Right choice of the design that effi-
than application-specific integrated circuits (ASIC); hence, ciently secures biometric data is then determined based on the
tremendous researches on architectural modifications over analysis.
such platform were reported recently. Design that yields Section II describes the overall basic architecture and tech-
performance improvement in terms of TP, frequency (F), niques of message scrambling with SHA-3. The requirement
area ( A), and efficiency (E f ) was reported in [10]. Also, and the need for an application-driven design improvement
efforts were taken by few other designs to enhance the of SHA-3 are provided in Section III. Brief overview of the
efficiency by focusing on resource sharing and pipelining of proposed design is explored in Section IV. In Section V, the
functional logic. Though the area achieved was low, TP was detailed hardware architectural design approach of SHA-3 is
found to be poor. The performance metrics of the reported discussed with the performance analysis. A comparison with
work have been compared in Table I, to determine the con- previous work is then provided in Section VI followed by a
sistent overall progress in performance improvement of the conclusion in Section VII.
reported architectures.
In [11], the complete Keccak architecture has been designed II. SHA-3 A RCHITECTURE
deriving from a single-equation form to achieve high fre- The Keccak algorithm includes four variable lengths of
quency while effectively reducing the lookup table (LUT) outputs and two extendable output functions. Each variant has
resources. However, the “iota” step in the round function of found its own specific applications based on the level of secu-
the architecture eventually doubled the area due to structural rity it offers. SHA-3(224) and SHA-3(256) hash functions are
implementation of both with and without round constant (RC). popularly preferred for digital signature applications, whereas
Even though such implementation significantly improves TP, the SHA-3(384) digested output is employed in pseudorandom
the area is inevitably compromised. Furthermore, the reported generation [13]. The largest bit variant SHA-3(512) offering
architecture in [12] achieved increased maximum frequency high security found extensively implemented in almost all
while designing the Keccak architecture as a single block. digital authentication purposes. The two extendable functions
Though the inclusion of temporary register improves the clock SHAKE-256 and SHAKE-128, which are capable of produc-
rates and hence the frequency, the area is greatly compromised. ing any desired output length, are preferred in digital transac-
Therefore, optimized design that caters to the requirement of tions. Furthermore, the variable length of SHA-3 functions
biometric authentication application such as high TP, increased shows minimal variation in architecture structure. In order
frequency and hence the efficiency is required, without com- to illustrate the operational variables that yield a specific
promising area. architecture of SHA-3 variant, let us consider SHA-3(512)
Considering, the multiple requirement of the end user’s here. The final hash output length in this variant is of 512 bits
application and the complex computational aspects of SHA-3 as specified within the braces of the standard conventions.
designs, improvement in terms of efficiency while maintaining Inherently, within its engine core, the variables rate “r ” has
high TP, increased frequency, and low area is the major focus a standard bit size of 576 and the capacity “c” will be of
of this work. Three architectural techniques are proposed to 1024 bits [13]. Concatenation of these two variables yields
achieve the desired target performance enhancement. First, another variable called bit rate “b,” which will be of 1600 bits.
handshake protocol for synchronized operation among the Similarly, details of other SHA-3 variants are provided in
padder and permutation process is proposed, which elimi- Table II; it can be observed that though the Keccak function
nates unnecessary wait cycles to achieve maximum operating is of 1600 bits for all variants, the varied bit length of “r ” and
frequency. Second, a novel compact-dynamic RC (C-RC) “c” requires minimal variation in architectural design.

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 27,2023 at 05:40:10 UTC from IEEE Xplore. Restrictions apply.
490 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 4, APRIL 2022

Fig. 2. Keccak sponge construction [13].

In SHA-3 algorithm, the permutation function “ f ” is the


core engine that performs the bit scrambling. The basic
functional block of the SHA-3 Keccak algorithm, shown in
Fig. 2, popularly referred to as a sponge construction due
to the behavior of its internal modules [13]. It includes two
phases of function, the absorbing phase and the squeezing
phase. During the absorbing phase, within the padder block,
the input message is XORed with the initial state of “r ” (which
consists of only zeroes) to form a new state of “r.” The new
state “r ” concatenated with the initial capacity “c” yields a
fixed bit rate “b” of 1600 bits. The permutation function is
then applied with 24 rounds of operations on these bits after
formulating into a 5 × 5 × 64 state matrix. Each round
includes five steps, viz., “theta,” “rho,” “pi,” “chi,” and “iota”
steps. The “theta” being an initial step of the function includes
complex trilogical computation stages that starts with bitwise
XOR followed by cyclic shift and then again an XOR stage.
These three logical stages are usually referred to as “theta1,”
“theta2,” and “theta3.” The operational flow of the complete
Keccak function is described as an algorithm in Fig. 3.
On completion of the first four steps in a round, the final
“iota” step will have only one lane of operation left, viz., Fig. 3. Algorithm for the round function [13].
A [0, 0]. This lane having a bit length of 64 bits is then
XORed with a 64-bit predefined RC [13]. Each round requires
specific RC value to complete the process; therefore, a specific
precomputed RC is utilized for each round. At the end of
the 24th round, the entire message block is absorbed, and
then the resultant output variable “r ” is forwarded to the
squeezing phase. Extraction of the MSBs from the “r ” bit
sets is performed by the squeezing function. Truncation with
specific number of MSBs is then carried out appropriately
according to the required size of the output digest. Finally,
byte inversion is done on the truncated “r ” to obtain the final
hash digested output.
III. S URVEY ON E XISTING W ORK Fig. 4. Area versus TP on reported works.
The SHA-3 architectures have unique techniques and
approaches for implementation on various hardware devices. due to the information leak. Techniques reported to effectively
FPGAs are found to be the most preferred target device due thwart such attacks were also surveyed.
to faster implementation, low complexity, increased TP, and
low computational time. Earlier designs and evaluations on A. TP Enhancement Techniques
different FPGAs have proven performance metrics in terms The first hardware architecture proposed by Gaj et al. [14]
of TP, area, and efficiency. Based on the techniques utilized, was primarily with data path and controller models. Model
SHA-3 designs can be classified as conventional (base), with this design style brought forth SHA-3 core unit that
folding, unrolling, pipelining, and subpipelining techniques. includes additional first-in–first-out (FIFO) to enable common
Efforts on performance improvement were reported with such I/O interface. Even though the additional serial data path
techniques in the recent past [14]–[33]. Area versus TP per- FIFO chains might marginally increase the computational
formances for few notable reported designs were depicted in time, parallel architecture of the round function compensated
Fig. 4. At the end of this section, security aspects of SHA-3 the lag yielding a TP of 6.85 Gb/s. Increase in frequency
were briefly discussed especially with the serious threat posed of 275 MHz boosted the efficiency though the TP achieved

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 27,2023 at 05:40:10 UTC from IEEE Xplore. Restrictions apply.
SRAVANI AND ANANIAH DURAI: ON EFFICIENCY ENHANCEMENT OF SHA-3 491

is moderate. In order to decrease the computational time a technique to lower ROM area by resizing the bit-length
further that can ultimately increase the TP, Kerckhof et al. [15] from 64 to 8. Though the area occupied was convincingly low,
proposed a design with two distinct asynchronous single the achieved maximum frequency is not satisfactorily high.
port random access memory (RAM) memories. Feasibility of A compression box technique reported in [22] minimized
parallel processing of the SHA-3 core steps [15] provided the area by 16% by reducing the round function steps; how-
better TP. In addition, a low area is achieved with the inclusion ever, TP is compromised due to large path delay. Furthermore,
of a barrel rotator to store the offset values of the “rho” compact design with significantly low area is achieved by
step, which is implemented using dedicated LUT resources employing folding factor (FR) technique in [23] and [24]. The
instead of basic multiplexer (MUX). However, though the Keccak round function is folded four times by a rescheduling
design is compact, it could not meet the TP requirement concept. A separate state memory is also utilized to store the
for image-based biometric access control applications. The offset values of “pi” step to minimize the rewriting cycles.
flexibility to enhance the TP with minimal modification of Even though a low area of around 476 slices is achieved, TP is
this architecture is not feasible due to the additional RAM poor as the number of clock cycles is four times more than
store/fetch cycles. A co-processor design was later proposed the basic/reference Keccak designs. It is observed that the FR
by Knezevic et al. [16] for fair evaluation of the reported and the number of clock cycles are thus inversely proportional
SHA-3 candidates. This co-processor architecture with a con- to TP and can be represented as follows:
trol and cryptographic FPGA design in a fully autonomous
mode yielded increased TP of 9 Gb/s. However, the word- FR ∝ Cn ∝ 1/TP. (3)
length padding scheme increased the delay/wait cycles render- Aziz [25] suggested an approach to reduce area by way of
ing a long waiting period for preparing the padded message. efficient utilization of digital signal processor (DSP) resources.
Latif et al. [17] and Jararweh et al. [18] suggested an efficient Folding technique was employed with no additional resource
FPGA implementation technique, which can boost the TP cost such as block RAM (BRAM), DSP, and RAM; how-
through efficient mapping of LUT resources. Though the area ever, invariable increase of clock cycles resulted in poor TP.
utilization was better, TP achieved was unsatisfactory. Techniques to reduce the clock cycles with additional round
The design reported in [14]–[18] includes software padder function within the loop of the core unit was success-
that lay outside of the hardware core unit which could not yield fully implemented in [26] while maintaining same area.
the required TP for complex applications. A full hardware Unrolling technique utilized here rendered poor frequency.
crypto architecture is proposed first by Provelengios et al. [19] Later, this loop unrolling technique was further enhanced by
that can promisingly cater the TP requirement for image-based El Moumni et al. [27] to decrease the number of clock cycles
biometric access control applications. A lane architectural from 24 to 2. However, the architecture consumed larger
design was suggested by implementing a complete hardware area. Major concern posed by this architecture is the weak
padder block that seamlessly interfaces with the core unit. security wall against collision attack, which was successfully
Though the TP achieved was better than the earlier designs demonstrated by Guo et al. [28]. To enhance the security
of [14]–[18], additional clock cycles required by the lane archi- strength “S” of message bits, capacity “c” bits must be larger
tecture could not meet the required performance level. A com- compared with rate “r ” bits, as shown in Table II. Based on the
plete hardware architectural design with less computational security analysis, (3) can be rewritten in terms of capacity bits
time, together with an intermodule architectural modification, and number of clock cycles/rounds, as shown in the following
enhances TP and, thereby, the efficiency is required for the equation:
target application. Efficiency is directly proportional to TP as
per (1). Enhancing the TP directly depends on frequency and S ∝ c ∝ Cn ∝ FR ∝ 1/TP. (4)
bit rate improvement; however, TP itself is inversely dependent Considering the parameter dependencies provided
on the number of clock cycles, thus requiring a design which by (1)–(4), a Keccak design with capacity as high as
yields high frequency, moderate bit rate, and low clock cycle 1024 bits that curbs collision attack is proposed to be
numbers, as per (2) designed in this work. Furthermore, folding technique to
TP (in Gb/s) achieve high efficiency is not preferred in this work as it
Ef = (1) significantly increases the clock cycle rendering poor TP.
A (in slices)
B∗ f Instead, frequency enhancement technique is employed to do
TP = (2) so. Therefore, considering (1)–(4), efficiency can be expressed
Cn
in terms of frequency as in the following equation:
where TP is throughput, A is area, B is block size, f is
maximum frequency, and Cn is the number of clock cycles. E f ∝ TP ∝ f. (5)

B. Area Minimization Techniques C. Frequency Boosting Schemes


Techniques to lower area by schemes such as minimized Subpipelining is widely employed to enhance the fre-
RC generator logic, structural modifications of the core unit, quency, which reduces the longest data path [29]–[32].
and effective resource utilization were reported. Paul and Michail et al. [33] replicated the permutation function four
Shukla [20] proposed an RC design that includes count gener- times, which yielded a four-stage pipelining architecture for
ator to fetch the RC from on-board read only memory (ROM). frequency improvement. However, the architecture holds draw-
This appealed to be better than the earlier work but yielded backs such as poor TP due to software padder and additional
poor frequency. In another attempt, Wong et al. [21] suggested area overhead due to multiple pipelining stages. Furthermore,

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 27,2023 at 05:40:10 UTC from IEEE Xplore. Restrictions apply.
492 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 4, APRIL 2022

D. Security Aspects of SHA-3 Against SCA


Power analysis (PA) and fault analysis (FA) are the pri-
mary SCA strategies on cryptosystem that challenge the
security aspects, though stringent measures were implemented
to thwart the information leak. In SHA-3 implementation,
if unprotected, the internal state bits of Keccak might reveal
the secret key of the combinational crypto-engine. One such
Fig. 5. DPD of high-speed core of Keccak [10]. noticeable and vulnerable architecture was message authenti-
cation code (MAC)-Keccak architecture. PA such as simple PA
the architecture implemented on V5 device in [10], found to (SPA), correlation PA (CPA), and differential PA (DPA) was
have frequency dependency on the data path delay (DPD) employed on the side channel information leaked from such
between the registers. DPD is in turn directly proportional architecture [7], to effectively retrieve the internal secret data.
to the number of logic levels (LLs) as is evident from Fig. 5; The invader specifically targets the intermediate registers of
thus, by increasing the LLs, TP can be decreased substantially. the permutation function for successful breach. It is observed
The LUT mapping-driven register transfer level (RTL) that “theta” (θ ) step of the permutation function that utilizes
design with an appropriate hardware description lan- larger intermediate storage element due to three stage logic
guage (HDL) coding is crucial in eliminating the maximum is victim to tapping and reading of internal power traces.
critical path delays. Therefore, frequency in terms of the levels The strong correlation leakages are analyzed through CPA to
of LLs can be expressed as in the following equation and thus retrieve the complete secret key of the MAC, which has been
it can be concluded that LL centered architecture is a key demonstrated in [34] and [35]. Effective countermeasures that
parameter in improving the efficiency: can shield the SHA-3 engine were reported recently. Random
shuffling, invariance-based protection scheme, and compiler
1
∝ DPD ∝ LL. (6) assistant threshold are few techniques to thwart such power
f leakages [36].
Comprehensive analysis and review of the above techniques In contrast, FA, such as differential fault injection attack
and its implications on Keccak architecture performances (DFA) and algebraic fault injection attack (AFA), poses severe
revealed that an architectural enhancement that yield high threat to SHA-3, as it leads to serious faults to the system. The
efficiency with moderate TP and low area combined with target functional units for such attacks were “theta” (θ ) and
collision resistant design is essential. Architecture that includes “chi” (χ) steps. The eavesdropping is performed especially
structural designs of padding and permutation modules is during the last two rounds of operation [37], [38] for effective
proposed in this work. The architecture effectively improves fault injection. In DFA, few single byte faults were injected
the frequency by decreasing the number of LLs between the during last but one round operation resulting in minimal fault
registers as per (6). The padder block is implemented utilizing to the data. Whereas, in AFA, a unit of 8 or 16 or 32 bits
the load shift and store (LSS) scheme with an additional output was injected into the permutation block, thus severity is found
buffer that prepares the 576 padded bits. Synchronized clock to be high compared with DFA. This is due to the fact
employed between padder and permutation provides seamless that AFA targets an algebraic modification resulting in the
store and forward of padded messages without the need of structural alteration during the last two rounds of operation
additional clock cycles. with optimized methods. Thus, a fault is successfully injected
Handling the storage of RC is another crucial aspect in within few seconds. Error detection approach, clock glitches,
achieving high efficiency. Earlier work employed a straight- and power distribution analysis will enable to thwart such fault
forward method of store/fetch. The 24 RC values are stored injection in SHA-3.
in the on-board ROM with a 24 × 1 MUX as an I/O controller. IV. P ROPOSED S YSTEM
Alternatively, the RC circuit constructed using linear feedback Existing designs have invariably increased the DPDs and
shift register (LFSR) can perform on-the-go generation of area rendering poor efficiency. This work explored the possi-
the RC values as designed in [31]. Another interesting and bility of employing handshake protocol within the core units
efficient technique is to dynamically calculate the RC value by to substantially decrease the path delays. In addition, the
means of RC generator, which then can be invoked whenever built-in buffer introduced in the padder significantly increases
required as demonstrated in [20]. All the existing techniques of the maximum allowable clock frequency. Furthermore, a novel
handling the RCs failed in terms of area and TP performances. C-RC is proposed that eliminates the need of additional mem-
Therefore, a novel technique of RC value generation within ory resources as well as clock cycles, reducing the area and
the required clock cycle is proposed. This is achieved by the computation time to a greater extend. Optimized predefined
minimized structural design of combinational and sequential capacity bits of 1024 and clock cycles of ≤12 further improved
logic that analyzes the nonzero bits to generate the required the area utilization, as large BRAM and DSP resources are not
RC value. This has substantially decreased the area with a required. Also, use of multistage pipeline is avoided. Thus,
significant increase in TP. The proposed Keccak architecture an architectural design that provides overall improvement in
that includes padder, permutation, and the RC generator was efficiency is proposed.
tested and verified on the preferred hardware device using the
A. Handshake Protocol
existing test vectors. The debugging and verifications were
carried out with VIO intellectual property (IP) core interface The proposed SHA-3 core unit includes padder and per-
of the Virtex devices, utilizing Xilinx Vivado EDA tool. mutation modules, which are designed as two separate

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 27,2023 at 05:40:10 UTC from IEEE Xplore. Restrictions apply.
SRAVANI AND ANANIAH DURAI: ON EFFICIENCY ENHANCEMENT OF SHA-3 493

TABLE IV
PADDER RULE FOR 32 B ITS [13]

is opted in this work due to various advantages as discussed


in [39] and [40]. The message bursts are prepared according to
the output bit size by appending “0” bits for the bits left out
in the message inputs. Such schemes are much compatible
for input messages with variable bit-length. Also, the com-
plexity is less compared with word-level padding technique.
Fig. 6. Handshaking operation of SHA-3 core. Furthermore, the bit-wise padding adds more confidentially,
as intense scrambling is done at the initial stage itself. As an
TABLE III added advantage, shielding against SCAs at the intermedi-
N OMENCLATURES OF C ONTROL S IGNALS ate stage due to the bit-level MUX operations is achieved,
which prevents information leak. Furthermore, the choice of
input message length determines the complexity and hence
influences the architecture performances. An optimal input
message length of 32 bits for reference design and 64 bits for
high-speed architecture is preferred here, which renders low
complexity and improved performances. Table IV provides the
padding rule for a 32-bit fixed message length [13]. Various
numbers of padded bits are required to be appended for each
variable message length. Say, if the message input is of 24 bits,
then padding operation to append 8 more bits must be carried
out so as to obtain the required message length of 32 bits.
The encryption process begins with padding of the input
bits by the padder block as soon as they are received. Start
of receiving the input messages happens only when the
“in_ready” signal goes high. The “Padd_Scheme” circuitry
structural units; however, as they run on a synchronized clock, subsequently verifies the user defined “pad_num” to carry out
it resembles as one functional module. This will effect in the required padding scheme. Once padded, the LSS technique
a shorter path delay between the two. This proposed unidi- is applied by the MUX, as shown in Fig. 7. The “is_last”
rectional communication protocol transfers the rate “r ” bits selection line of the MUX determines the sequence of padded
from the padder to the permutation in a ready-acknowledge bits to be loaded on to the shift registers. If the “is_last” bit is
mode termed as the handshaking protocol, as shown in Fig. 6. low, the “padd_out” bits from the “Padd_Scheme” gets loaded
The communication is initiated by the padder module when on to the shifter; else, the “buff_feed” bits from output buffer is
its buffer is full. Once the “buffer_full” signal goes high, the loaded. The bits are shifted once to the left by the shifter logic
padder module will pull “padd_ready” signal high, requesting and then passed on to the output buffer, filling from the MSBs.
permutation block to accept the transfer of “r ” bits (576 bits). The LSS process continues until the buffer is filled with all
If the permutation module is ready, an acknowledge signal 576 bits. A “buffer_full” signal enabled during the completion
“ f ack ” is sent back to the padder as acceptance. Thus, a col- of padder operation indicates that the “padder_out” bits are
laborative trusted communication is established for a seamless ready for scrambling by the permutation block. Permutation
data transfer. The handshaking-based bit transfer is found to block waits until the “padd_ready” control signal is received
have shortened the longest data path by one-third compared through the handshake protocol. The LSS schemes combined
with the earlier designs [16], [21]. This is due to the fact with the universal clock-based handshake protocol reduce the
that the number of LLs between the input and the output flip- computational time by eliminating unnecessary idle and wait
flops (FFs) has decreased as per (6). The decreased number of cycles. Therefore, as compared with the earlier design, the
LLs will also improve the maximum clock frequency of the handshake protocol has improved the performances of padder
architecture. Table III provides information on nomenclatures to a greater extend [14]–[18].
and conventions that are utilized in Sections IV-B and IV-C.
C. Permutation Module on Handshake Mode
B. Synchronized Padder Module In handshake communication, the permutation module
The preferable choice of padding techniques can be of either starts to execute the Keccak function once the padder
bit-level or with the entire word length. Bit-level technique enables “padd_ready.” Initially, XOR operation is performed

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 27,2023 at 05:40:10 UTC from IEEE Xplore. Restrictions apply.
494 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 4, APRIL 2022

Fig. 7. Padder block.

to concatenate the “padder_out” bits of 576 (referred to as


rate “r ” bits) with the capacity “c” bits of 1024, to form
5 × 5 × 64 state matrix [13]. Next, MUX with its initial
low selection signal starts the “round 0” operations with this
concatenated bits itself. However, for the remaining rounds,
MUX select line goes high such that it chooses the feedback
bits from the output registers. The reason to do so is that the
output register values are null before the “round 0” operation,
as no functions are performed yet. A total of 24 rounds of
operations are performed until the final 1600 bits are obtained.
Truncations are then performed according to the required SHA
output digest. For a 512-bits SHA encryption, 512 MSB bits
of the 1600 bits are truncated; furthermore, byte inversion
is applied to appropriately rearrange the final SHA-3 (512)
output digest [13]. Fig. 8 depicts the functionality and data
flow of the proposed base SHA-3 architecture.
1) Compact-Dynamic RC: The round function in the core
unit includes “theta,” “rho,” “pi,” “chi,” and “iota” steps [13].
Detailed explanations of these five steps are already provided
in Section II. Store and read of RCs for performing the
“iota” step influences the area and computational time to a
greater extend. RC being 24 sets of 64-bit length data requires Fig. 8. Proposed base SHA-3 architecture.
additional memory resources. Furthermore, read of such word
length during every round function requires additional clock TABLE V
cycles and, hence, additional power budget allocation. For RC C ONSTANT FOR S EVENTH ROUND F UNCTION
better resource utilization and to reduce computational time,
dynamic generation of RC is proposed. During each round
function, only the specific required RC values are generated
at the rising edge of the clock with the help of combina-
tional/sequential logic circuit. Consider the hexadecimal code
of RC for round 7, which is “8000 0000 8000 8081.” The
binary equivalent will have 1s only in specific places, as shown
in Table V. Interestingly, it is noted that only the bit positions logic having ≤12 inputs that can effectively minimize the
0th, 1st, 3rd, 7th, 15th, 31st, and 63rd have 1s and rest of utilization of 6-input LUT resources. Also, a 64-bit single
the positions are 0s. Surprisingly, the RCs for the remaining row edge triggered parallel-in–parallel out (PIPO) register
rounds were observed to have 1s only in 0th, 1st, 3rd, 7th, 15th, is designed for single-cycle direct loading of the generated
31st, and 63rd bit positions (7-bit positions only), as tabulated RC values on to the functional logic circuits. The need
in Table VI. Parameter “i ” in Table VI represents the round for store/forward scheme is thus eliminated, as shown in
count, for which a synchronous counter is implemented to Fig. 9(a). Seven OR gates are required to generate seven 1s
generate the specific bit sequence to trigger the RC logic for the bit positions ranging from 0 to 64. Though the
circuit for generation of the required constant. maximum number of 1s in any of these 7-bits positions
To dynamically generate the RCs shown in Table VI, is 15 (specifically for RC[15]), the OR inputs are optimized
a sequential logic circuit has been designed using structural to 12 so as to generate “1”s appropriately in the required bit
modeling with seven MUX implemented multiinput OR positions for a specific round. This is achieved by XORing

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 27,2023 at 05:40:10 UTC from IEEE Xplore. Restrictions apply.
SRAVANI AND ANANIAH DURAI: ON EFFICIENCY ENHANCEMENT OF SHA-3 495

TABLE VI
ROUND -S PECIFIC RC S EQUENCES

the counter outputs that have alternative “1” during any


two respective rounds, such that output “1” of that specific
XOR drives the corresponding OR for appropriate generation
of the RC value. This can be observed in the OR0 gate
that generates “RC[0]” bits that must be driven by the
12 counter output lines [0]RC, [4]RC, [5]RC, [6]RC, [7]RC,
[10]RC, [12]RC, [13]RC, [14]RC, [15]RC, [20]RC,
and [22]RC. Direct implementation of the logic might result
in utilization of 12 inputs OR gates for all combinations.
However, it is observed that [6]RC and [7]RC are alternatively
“1” during 7th and 8th rounds; similarly, [12]RC and [13]RC, Fig. 9. (a) C-RC generator block diagram. (b) Schematic of C-RC generator
as well as [14]RC and [15]RC have their outputs as “1” highlighting seventh round bit generation.
alternatively during rounds 13th and 14th and 15th and
16th, respectively. When utilizing three XOR gates for these
corresponding combinational counter output lines, the OR gate
input lines were reduced from 12 to 9, as shown in Fig. 9(b).
The optimized design now has five 2 inputs XOR gates and
seven variable input OR gates, thus requiring less LUT count
to cater for the total 80 inputs (OR gate inputs are 70 + Fig. 10. Simulation results of C-RC.
XOR gate inputs are 10). It is observed that only 16 LUTs
are required to implement the design, which otherwise would C-RC generator logic substantially decreases the computa-
require 19 LUTs (for 85 inputs in total). tional time for generation of each RC value. In addition, the
To further explore the circuit functionality, consider that design is compact occupying extremely low area as dynamic
the seventh round operation is in progress with “i ” being [6] generation and temporary store and forward techniques were
is enabled. The register input lines RC[0], RC[7], RC[15], employed. This is evident from the fact that it requires
RC[31], and RC[63] goes high, which will in turn pull the only 64 single bit registers compared with the 24 × 64
corresponding register output lines high. The remaining bit ROM/BRAM utilized in the earlier reported works [20]–[22].
positions of the 64-bit register will remain low, generating the At the end of first round function (i.e., when i = 0),
required RC constant value of “8000 0000 8000 8081” for output bits of the final step (“iota”) are temporarily stored
this round operation. The circuit with active connections for in the output “temp_register,” as shown in Fig. 8. This is then
round 7 is highlighted in Fig. 9(b). Similar process continues feedback as an input to the MUX for further scrambling by
for other rounds to generate the rest of the 23 RC constants. the subsequent rounds. MUX selection line is set high from
Simulation output of the 24 RC constant values is shown in the start of second round so that “temp_register” value is
Fig. 10 with value of seventh round marked. The proposed selected for the rest of the round operation. The feedback

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 27,2023 at 05:40:10 UTC from IEEE Xplore. Restrictions apply.
496 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 4, APRIL 2022

value which is of 1600 bits (can be represented in the form of


5 × 5 × 64 matrix) will go through all the five steps of
“theta,” “rho,” “pi,” “chi,” and “iota” during the remaining
23 rounds. During the “iota” step of each round, the rising
edge of the clock initiates the 24-bit-binary counter, which
in-turn triggers the C-RC generator to compute the appropriate
RC constant. If say, the round 2 operation is in progress,
the second bit place of 24-bit register will be “1”; hence,
RC constant for [1]RC[63:0] is computed by the RC generator Fig. 11. Integrated-f.
logic. The [1]RC[63:0] constant value is utilized by the “iota”
function to perform the second round operation. The final of 64 bits, increased TP can be achieved when compared with
1600 bits present in the output “temp_register” at the end of the base design with 32-bit crypto-processor.
round 24 will be truncated from the MSBs to yield 512 bits. 1) Integrated-f: Additional functional unit in the fusion
Byte inversion is then applied to appropriately rearrange the schemes might provide poor area and TP performances; hence,
final SHA-3 (512) output digest. it requires compensatory structural implementation. An inte-
The proposed base design is verified on the V5 FPGA grated functional (int_f) unit that renders logically optimized
device achieving a maximum frequency of 313 MHz, result- implementation of the permutation function, within four steps
ing in a relatively moderate TP of only 7.7 Gb/s. How- (θ1,2 , θ3 , γ , and ), is proposed. The initial “theta” step requires
ever, the customized structural design occupying a low multiple intermediate storage element due to three stage logic
area of 872 slices, yielded marginally high efficiency of operations viz., “theta1,” “theta2,” and “theta3.” Logical opti-
8.69 Mb/s/slices. It is preferable for applications such as mization reported in the earlier work [11], [22], [25], [29]
IoT security, secured wireless sensor nodes [11], [20], and provided poor area utilization, making it as a poor choice for
biometric cryptographic access control [41], [42], to run on fusion architectures. Therefore, optimization technique with
high efficiency crypto-engines, which will enhance the process (θ1,2 ) and (θ3 ) is explored in this “theta” step. The resultant
speed. However, the reported cryptographic hash-based mul- logical expression postoptimization of (θ1,2 ) can be given as
timodal biometric access control systems were observed to
have poor speed with larger encryption time in terms of few θ1,2 = (a[x−1, 0] ⊕ a[x−1, 1] ⊕ a[x−1, 2]
milliseconds [43], [44]. In addition, the enormous computa- ⊕a[x−1, 3] ⊕ a[x−1, 4])
tion involved both in parallel image processing and double
⊕(Rot(a[x + 1, 0], 1) ⊕ Rot(a[x + 1, 1], 1)
encryption, rendered poor efficiency of 2.2, 1.12, 0.379, and
0.0007 (in Mb/s/slices) with SHA1, SHA2, advanced encryp- ⊕Rot(a[x + 1, 2], 1)
tion standard (AES), and Rivest–Shamir–Adleman (RSA) ⊕Rot(a[x + 1, 3], 1) ⊕ Rot(a[x + 1, 4], 1)). (7)
encryption standards, respectively [43]–[46]. Moreover, the
AES implemented for multimodal biometric application on To further enhance the area utilization, “rho,” “pi,” and “chi”
the older Virtex-II FPGA device [41], also rendered poor steps are optimized to yield “gamma” (γ ) logical structure
efficiency of around 0.006 Mb/s/slices. Though the base archi- that will significantly increases the efficiency. The logical
tecture designed in this work achieves satisfactory efficiency, expression for the optimized “gamma” step can be given as
much higher efficiency is preferred to fully utilize the max-
γ = {b[(x + 3y), x][63 − n: 0],
imum speed capability of the advanced FPGA devices, such
as V5 and V7. Thus, redesigning the permutation functions of b[(x + 3y), x][63 : 63−n + 1]} ⊕
the base hash core must be done to capitalize the advanced ∼ {b[(x + 3y)+1,x][63 − n: 0],
features of the modern devices, which can in-turn enhance the b[(x + 3y)+1,x][63 : 63−n + 1]}
efficiency and hence the speed of the authentication system.
&{b[(x + 3y)+2,x][63 − n: 0],
Techniques to fuse few boosting schemes are explored in this
work to enhance the efficiency that can further improve the b[(x + 3y) + 2,x][63 : 63 − n + 1]}. (8)
speed. Such high efficiency architectures have been imple-
The output of “gamma” is then processed by the final “iota”
mented without violating the operational respects of the Kec-
step yielding an integrated four-steps round function, as shown
cak architecture.
in Fig. 11. Though both the five steps-f and Integrated-f
perform similar logical operations, the logical structure is
optimized in the later by eliminating the need of multiple inter-
D. Fusion Architectures
mediate storage elements. Thus, additional FPGA resources
To catapult the efficiency, interleaving of one or more to store the intermediate data are effectively minimized in
boosting schemes in the permutation module of the proposed Integrated-f; hence, when compared with the conventional five
base architecture [20]–[33] is introduced in this work. The steps function, the utilization of LUT-FF pairs is found to
interleaving technique can be carried out with either fusion of be marginally low. Furthermore, Integrated-f realization for
pipelining with subpipelining or subpipelining with unrolling unrolling architecture with V5 as the target device revealed
or pipelining with unrolling techniques. Such fusion schemes that almost 95% of the resources ((5279/5568) ∗ 100) are
can support up to 64-bit message input, which can significantly effectively used as pair of LUT-FF among the total resources
decrease the computational time of the architecture yielding a utilized. This is 2% higher when compared with the one
high-speed crypto unit. Hence, even with a larger data width realized with five steps function, which has only 93% of

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 27,2023 at 05:40:10 UTC from IEEE Xplore. Restrictions apply.
SRAVANI AND ANANIAH DURAI: ON EFFICIENCY ENHANCEMENT OF SHA-3 497

Fig. 13. Cascade-P design.

Fig. 12. Fusion architectures: (a) base design and (b) dual-pipeline design
(Dual-p).

resource utilization ((5363/5724) ∗ 100) [27]. Moreover, the


fully used LUT-FF is found to be 40% ((2103/5279)∗100) for
Integrated-f, which is again 2% higher when compared with
the 38% ((2064/5363)∗100) rendered by five steps. Thus, it is
evident that the proposed Integrated-f minimizes either a single
LUT or a single FF in any of the slice. Therefore, Integrated-f
that provided better logical optimization is employed in other
fusion architectures to ascertain the performances and to deter-
mine the right choice for multimodal biometric authentication.
2) Interleaving Techniques: The base architecture though
provided significantly high TP and showed effective utilization Fig. 14. Dual-f design.
of resources, the efficiency is still found to be marginally low.
Fusion of several boosting schemes would be an alternative stages of the base Keccak architecture when operated on a
option to significantly enhance the efficiency. Adapting fusion round-split mode with counter select (CS) logic will perform
techniques with few boosting schemes such as pipelining, a pseudo-parallel function. Therefore, two functions run on
subpipelining, and unrolling, in the permutation module of a single clock, thus, requiring half the total clock cycles to
the proposed base architecture, is explored. The optimum generate the hash digest. When the first P stage ends its (0–11)
design approach that meets the application-specific efficiency rounds for the first set of permutation bits, second P stage
requirement is then determined. starts the (12–23) rounds of Integrated-f operations. While
Primarily, the base Keccak architecture includes a per- the second P stage is performing its (12–23) round functions,
mutation module with a simple feedback loop, as shown the first P stage begins its (0–11) round for the subsequent
in Fig. 12(a). The longest critical path delay between the message data. This resembles a pseudo-parallel operation over
round operations due to increased number of LUTs failed to the Cascade-P stages. Therefore, hash digest value is obtained
achieve the desired maximum frequency. A latch to temporar- within 12 clock cycles instead of 24 clock cycles, which
ily store and forward the MUX output will boost the clock increases the TP twice when implemented on V5 device.
frequency between each round operation. Furthermore, it opti- Furthermore, the operating frequency is also maximized with
mizes the critical path delay thus increasing the computational subpipelining that introduces shortest path delay between the
performances when compared with earlier design [29]–[32]. steps. Though the design offers salient features, the additional
Such design that resembles pipelining of the message bits P-stage employed for fusion has doubled the area degrading
is employed to enhance the maximum operating frequency. the targeted efficiency performance.
To further enhance the clock frequency, an additional reg- To overcome the area drawback of the Cascade-P design and
ister that caters to subpipelining of the data-out from the hence enhance the efficiency, Dual-f Keccak architecture with
theta step is fused within the pipelining operation. Also, single MUX and register is designed. The minimal structural
the “gamma” within the internal round function (sub_f) can elements have marginally improved the area performance;
minimize the path delay between LUT’s, due to store/forward however, the increased LLs have led to the degradation of
logic. The proposed high-speed Keccak architecture with such the frequency as evident in (5). Therefore, additional storage
dual-pipelining structure (Dual-p) shown in Fig. 12(b) has element within the round functions is included to minimize
found to decrease the DPD to two-third compared with the the loss in frequency performance, as shown in Fig. 14.
earlier design [29], [32]. Also, a significantly high maximum Furthermore, single loop operation of the Dual-f resembling
frequency is observed. However, additional registers found a piggy-back design will complete the entire permutation
to degrade the area utilization. Though the frequency is function in half the clock counts. Achievement of these
high, poor area substantially decreases the efficiency as it is two performance enhancement provided satisfactory improve-
inversely proportional to efficiency. ment in efficiency that can cater for biometric authentication
Cascade permutation (Cascade-P) design shown in application.
Fig. 13 has reduced the clock cycle count to half that almost The proposed three fusion Keccak architectures have been
doubled the TP as evident in (1). Two cascaded permutation implemented on V5 and V7 FPGA devices to validate the

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 27,2023 at 05:40:10 UTC from IEEE Xplore. Restrictions apply.
498 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 4, APRIL 2022

Fig. 15. Preferred target device of the reported SHA-3 designs.

Fig. 17. Experimental setup of SHA-3 physical implementation.

and routing verification is required for validating the physical


crypto-engine. Once the proposed designs are implemented
Fig. 16. VIO block diagram [40]. on the target device with the assistance of Vivado tool
(as shown in Fig. 17), synchronizing “clk” between the VIO
efficiency performance. Physical realization of the architec- and proposed designs is initiated. The 32 bits of input message
tures to ascertain the right choice of design for the specific are driven with “in_ready,” “pad_num,” and “is_last” signals
application that meets the desired performances is mandatory. being enabled. The proposed crypto engine completes the
process and pulls the “f_out_ready” signal high. Further-
V. H ARDWARE I MPLEMENTATION more, the final hash digested output bits that are successfully
For fair evaluation among the proposed design, as well as to processed will be driven out through the JTAG port on to
compare with the existing work, implementations were carried the VIO’s graphical user interface. For comparative evaluation
out on V5 and V7 boards, as they are the most preferred of the proposed designs, V5 device was also considered as a
target devices of reported designs. This is evident from the supplementary target device. The implementation that provides
pie chart of Fig. 15. Xilinx ISE was utilized for implementing effective resource utilization that can lead to the choice of
the design on V5 and Xilinx Vivado tools for the targeted the best crypto engine for biometric authentication is the sole
V7 implementation [47], [48]. motive in this comparative analysis. As expected, V7 device
with its 5 and 6 input LUTs yielded comparatively low
area. Furthermore, performance metrics such as frequency and
A. Physical Implementation Approach
TP were also observed to be high.
Physical implementation of the proposed designs is done
on V7 device initially. As the design require larger I/Os,
VIO core (as shown in Fig. 16) that features multiple input B. Comparative Evaluation of Designs on V5 and V7
driving capability, as well as virtual output readout through The proposed architectures of Keccak were implemented
Joint Test Action Group (JTAG), is leveraged for debugging on V5 (XC5VLX220t-2fft1738) and V7 (XC7VX485t-
and verification of the physical design [49]. The inputs of 2ffg1761Xilinx) boards. TP and efficiency were considered
the implemented physical design drive the output ports of the as the primary design target parameters for performance eval-
VIO, whereas the inputs of the VIO drive the physical design. uations as evident from (1) and (2). Few design specifications
A universal on-chip clock drives both the physical design and such as optimal hash digested output size for better security,
VIO core, minimizing the clock skew. The input message data eliminating the necessity of embedded resources for low-cost
provided as test vectors through the VIO core physically drive design and opting to minimal clock cycle for increase speed,
the crypto architecture present internal to the FPGA device. were also considered. The choice of 512 bits as the hash
Simultaneously hash digest output can be read through the output digest resulted in a block size “r ” of 576 bits. Also
JTAG port. As an alternative to VIO core, integrated logic utilization of DSP and BRAM was deliberately eliminated
analyzer (ILA) can also be used to debug and verify designs in the abstraction level, providing comparatively a low-cost
with even more than 1000 bits. However, unlike VIO, the design. A total of 24 clock counts for base and dual-pipeline
requirement of on-chip/off-chip RAM resources makes it a designs and 12 for Cascade-P and Dual-f yielded better TP.
poor choice for seamless debugging process. The postplacement and routing validation results have been
The performances evaluated in the RTL functional verifi- provided in Tables VII and VIII for quick comparison. The
cation were found to be satisfactory; however, postplacement efficiency of the base design on V7 is observed to be higher

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 27,2023 at 05:40:10 UTC from IEEE Xplore. Restrictions apply.
SRAVANI AND ANANIAH DURAI: ON EFFICIENCY ENHANCEMENT OF SHA-3 499

TABLE VII
I MPLEMENTATION R ESULTS OF BASE A RCHITECTURE

Fig. 18. Area performances of base design on various FPGA.

TABLE VIII
I MPLEMENTATION R ESULTS OF F USION D ESIGNS

Fig. 19. TP versus E f of base designs on various FPGAs.

with 9.9 Mb/s/slices, when compared with the implemented


design on V5 (which is low with 8.69 Mb/s/slices). More- Fig. 20. Area performances on various fusion techniques.
over, the Dual-f design has yielded better efficiency with
15.11 Mb/s/slices on V7, whereas it is marginally low with TP being a crucial factor in determining the efficiency
12.85 Mb/s/slices on V5. Dual-f design with a satisfactory has been explored extensively in earlier work. Few reported
overall performance will be the best choice for biometric work [15], [18], [20] have managed to improve the TP by
authentication applications. increasing the bit rate “r ” from 576 to 1024 bits. However, the
capacity bit compromised from 1024 to 576 bits has impinged
VI. I NFERENCES the security level. In contrast, the capacity bits of 1024 bits
Graphical comparative analysis has been carried out to in the proposed “base” design yielded better security while
visualize the performance metrics such as area, TP, and keeping the bit rate as low as 576 bits. The padder area in the
frequency of the proposed and earlier designs. These para- reported designs has been invariably increased for handling
meters influence the efficiency directly or indirectly. The the larger bit rate, leading to reduced efficiency, whereas the
performance evaluation of area for the proposed base design proposed “base” design yielded comparatively high efficiency
with other work on different FPGA boards is plotted in of 9.9 Mb/s/slices on V7 and 8.69 Mb/s/slices on V5 boards
Fig. 18. On the V5, design [19] found to occupy the highest due to low padder area, as shown in Fig. 19.
area of 2573 slices, whereas the lowest one is the proposed Performance comparison in terms of area and TP versus
“base” design with around 872 slices. Though [21] occupies efficiency of the proposed fusion architectures is shown in
marginally low area than the base design, the utilization of Figs. 20 and 21, respectively. Comparing the area among
six inputs LUT on the advanced V6 device caters to the the fusion designs, it is notable that the Dual-f outperformed
effective area reduction. Design focus on improving frequency all other designs on V5. Furthermore, a low area of around
on V7 implementations rendered comparatively moderate area 1252 slices was achieved by Dual-p on V7. The Integrated-f
utilization as an added advantage. that has only four step operations (θ1,2 , θ3 , γ , ) has led to the

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 27,2023 at 05:40:10 UTC from IEEE Xplore. Restrictions apply.
500 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 4, APRIL 2022

efficiency of 15.11 Mb/s/slices on V7 and a relatively bet-


ter efficiency of 12.85 Mb/s/slices on V5 (when compared
with [21]), proved that the Dual-f performance surpassed all
reported designs. Therefore, Dual-f design can be considered
as the best choice for the target biometric access control
application. The choice of Dual-f is not only determined by
its high efficiency but also its greater security strength as
well as its impressive features such as compactness (low area
utilization) and higher operating frequency. Debugging and
verification carried out through VIO core on the target FPGA
device, validated and hence assisted for the right architec-
Fig. 21. TP versus E f of fusion designs on various FPGAs. tural choice for the desired application. As a future work,
ASIC implementation of the preferred Dual-f SHA-3 archi-
tecture in the advanced CMOS technology will be carried
out. Such crypto co-processor design can cater for securing
portable biometric authentication systems.

ACKNOWLEDGMENT
Authors acknowledge the facilities and support extended
by VIT Chennai, Chennai, India, in carrying out the research
work successfully. The suggestions indicated by the anony-
mous reviewers that improved the quality of this article are
duly recognized and the authors express sincere thanks for
Fig. 22. Efficiency versus frequency analysis on V5.
their productive comments.
effective utilization of 5/6 input LUT resources by eliminating
the additional pipeline stage. Furthermore, performance analy- R EFERENCES
sis in terms of TP revealed that the Cascade-P design achieved
the highest TP of 24.43 Gb/s on V7, when compared with any [1] S. Thavalengal, P. Bigioi, and P. Corcoran, “Iris authentication in
handheld devices–Considerations for constraint-free acquisition,” IEEE
other existing designs, whereas Dual-f yielded the highest TP Trans. Consum. Electron., vol. 61, no. 2, pp. 245–253, May 2015.
on V5. [2] V. Talreja, M. C. Valenti, and N. M. Nasrabadi, “Deep hashing for secure
Architecture in [21] exploited the fusion of unrolling, multimodal biometrics,” IEEE Trans. Inf. Forensics Security, vol. 16,
pp. 1306–1321, 2021.
pipelining, and subpipelining techniques, to augment the [3] C. Li, J. Hu, J. Pieprzyk, and W. Susilo, “A new biocryptosystem-
efficiency as 11.47 Mb/s/slices on V6. However, the pro- oriented security analysis framework and implementation of multibio-
posed Dual-f design achieved the highest efficiency of metric cryptosystems based on decision level fusion,” IEEE Trans. Inf.
12.85 Mb/s/slices even with V5 as the target device. Fur- Forensics Security, vol. 10, no. 6, pp. 1193–1206, Jun. 2015.
[4] M. Hammad, Y. Liu, and K. Wang, “Multimodal biometric authentica-
thermore, implementation of the same architecture on V7 tion systems using convolution neural network based on different level
rendered efficiency of 15.11 Mb/s/slices. Though the area fusion of ECG and fingerprint,” IEEE Access, vol. 7, pp. 26527–26542,
and TP performances are only marginally high, the efficiency 2018.
observed is significantly high. Analysis was carried out to [5] B. Topcu, C. Karabat, M. Azadmanesh, and H. Erdogan, “Practical
security and privacy attacks against biometric hashing using sparse
determine the cause of efficiency improvement. Referring recovery,” EURASIP J. Adv. Signal Process., vol. 2016, no. 1, pp. 1–20,
to (5), it might be inferred that the fusion technique that Dec. 2016.
tends to provide frequency improvement alone has inherently [6] N. D. Hammod, “Biometric authentication based on hash Iris features,”
Int. J. Biometrics Bioinf. (IJBB), vol. 13, no. 1, pp. 1–11, 2020.
improved the efficiency parameter. However, from Fig. 22,
[7] M. M. Sravani and S. Ananiah Durai, “Attacks on cryptosystems imple-
it can be observed that though a higher frequency of around mented via VLSI: A review,” J. Inf. Secur. Appl., vol. 60, Aug. 2021,
400 MHz is achieved by Dual-p architecture, efficiency is Art. no. 102861.
found to be lower than Dual-f. Area versus efficiency inves- [8] M. Stevens, E. Bursztein, P. Karpman, A. Albertini, and Y. Markov, “The
first collision for full SHA-1,” in Advances in Cryptology–(CRYPTO)
tigations as per Table VIII revealed that a moderately low (Lecture Notes in Computer Science), vol. 10401, Mountain View, CA,
area combined with the increased frequency due to fusion USA: Springer, 2017, pp. 570–596.
of functional unit yielded high efficiency. Therefore, it can [9] T. Mladenov and S. Nooshabadi, “Implementation of reconfigurable
be concluded that both frequency improvement and effective SHA-2 hardware core,” in Proc. IEEE Asia Pacific Conf. Circuits
Syst. (APCCAS), Macao, China, Nov. 2008, pp. 1802–1805.
area utilization in Dual-f architecture resulted in efficiency [10] G. Bertoni, J. Daemen, M. Peeters, and G. Van Assche, “Keccak,”
enhancement; hence, (5) can also be expressed as in Proc. Annu. Int. Conf. Theory Appl. Cryptograph. Techn. Berlin,
Germany: Springer, 2013, pp. 313–314.
E f ∝ f ∝ 1/A. (9) [11] M. Rao, T. Newe, I. Grout, and A. Mathur, “High speed implementation
of a SHA-3 core on Virtex-5 and Virtex-6 FPGAs,” J. Circuits, Syst.
Comput., vol. 25, no. 7, 2016, Art. no. 1650069.
VII. C ONCLUSION [12] S. E. Moumni, M. Fettach, and A. Tragha, “High frequency implemen-
The comparative analysis from Figs. 18 to 22 shows that tation of cryptographic hash function Keccak-512 on FPGA devices,”
Int. J. Inf. Comput. Secur., vol. 10, no. 4, pp. 361–373, 2018.
the fusion of unrolling and pipelining architecture (Dual-f) [13] M. J. Dworkin, SHA-3 Standard: Permutation-Based Hash and
has satisfactory performance in terms of efficiency. A high Extendable-Output Functions, FIPS, Standard 202, 2015.

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 27,2023 at 05:40:10 UTC from IEEE Xplore. Restrictions apply.
SRAVANI AND ANANIAH DURAI: ON EFFICIENCY ENHANCEMENT OF SHA-3 501

[14] K. Gaj, E. Homsirikamol, and M. Rogawski, “Fair and comprehensive [36] P. Luo, “Side-channel security analysis and protection of SHA-3,” Ph.D.
methodology for comparing hardware performance of fourteen round dissertation, Dept. Comput. Eng., Northeastern Univ., Boston, MA,
two SHA-3 candidates using FPGAs,” in Cryptographic Hardware USA, 2017.
and Embedded Systems (CHES) (Lecture Notes in Computer Science), [37] P. Luo, Y. Fei, L. Zhang, and A. A. Ding, “Differential fault analysis
vol. 6225, S. Mangard and F. X. Standaert, Eds. Berlin, Germany: of SHA-3 under relaxed fault models,” J. Hardw. Syst. Secur., vol. 1,
Springer, 2010, pp. 264–278. no. 2, pp. 156–172, Jun. 2017.
[15] S. Kerckhof, F. Durvaux, N. Veyrat-Charvillon, F. Regazzoni, G. M. de [38] P. Luo, K. Athanasiou, Y. Fei, and T. Wahl, “Algebraic fault analysis of
Dormale, and F. X. Standaert, “Compact FPGA implementations of the SHA-3 under relaxed fault models,” IEEE Trans. Inf. Forensics Security,
five SHA-3 finalists,” in Proc. Int. Conf. Smart Card Res. Adv. Appl. vol. 13, no. 7, pp. 1752–1761, Jul. 2018.
Berlin, Germany: Springer, 2011, pp. 217–233. [39] B. Baldwin et al., “A hardware wrapper for the SHA-3 hash algorithms,”
[16] M. Knezevic et al., “Fair and consistent hardware evaluation of fourteen in Proc. IET Irish Signals Syst. Conf. (ISSC), Cork, Ireland, 2010,
round two SHA-3 candidates,” IEEE Trans. Very Large Scale Integr. pp. 1–6.
(VLSI) Syst., vol. 20, no. 5, pp. 827–840, May 2012. [40] P. S. Z. Chen and S. Morozov, “A hardware interface for hashing
[17] K. Latif, A. Aziz, and A. Mahboob, “Look-up table based implemen- algorithms,” Cryptol. e-Print Arch., Lyon, France, Tech. Rep. 2008/529,
tations of SHA-3 finalists: JH, Keccak and Skein,” KSII Trans. Internet 2008. [Online]. Available: http://eprint.iacr.org/
Inf. Syst., vol. 6, no. 9, pp. 2388–2404, 2012. [41] V. Conti, C. Militello, F. Sorbello, and S. Vitabile, “A multimodal
[18] Y. Jararweh, L. Tawalbeh, H. Tawalbeh, and A. Moh’d, “Hardware technique for an embedded fingerprint recognizer in mobile payment
performance evaluation of SHA-3 candidate algorithms,” J. Inf. Secur., systems,” Mobile Inf. Syst., vol. 5, no. 2, pp. 105–124, 2009.
vol. 3, no. 2, pp. 69–76, 2012. [42] A. Alzahrani and F. Gebali, “Multi-core dataflow design and implemen-
[19] G. Provelengios, P. Kitsos, N. Sklavos, and C. Koulamas, “FPGA-based tation of secure hash algorithm-3,” IEEE Access, vol. 6, pp. 6092–6102,
design approaches of Keccak hash function,” in Proc. 15th Euromicro 2018.
Conf. Digit. Syst. Design, Izmir, Turkey, Sep. 2012, pp. 5–8. [43] A. Ashok, P. Poornachandran, and K. Achuthan, “Secure authentication
[20] R. Paul and S. Shukla, “Partitioned security processor architecture on in multimodal biometric systems using cryptographic hash functions,”
FPGA platform,” IET Comput. Digit. Techn., vol. 12, no. 5, pp. 216–226, in Proc. Int. Conf. Secur. Comput. Netw. Distrib. Syst. Berlin, Germany:
Sep. 2018. Springer, 2012, pp. 168–177.
[21] M. M. Wong, J. Haj-Yahya, S. Sau, and A. Chattopadhyay, “A new high [44] D. Jagadiswary and D. Saraswady, “Biometric authentication using
throughput and area efficient SHA-3 implementation,” in Proc. IEEE Int. fused multimodal biometric,” Proc. Comput. Sci., vol. 85, pp. 109–116,
Symp. Circuits Syst. (ISCAS), Florence, Italy, May 2018, pp. 1–5. Jan. 2016.
[22] A. Arshad, D.-E.-S. Kundi, and A. Aziz, “Compact implementation [45] A. Muthukumar and S. Kannan, “AES based multimodal biometric
of SHA3-512 on FPGA,” in Proc. Conf. Inf. Assurance Cyber Secur. authentication using cryptographic level fusion with fingerprint and fin-
(CIACS), Jun. 2014, pp. 29–33. ger knuckle print,” Int. Arab J. Inf. Technol., vol. 12, no. 5, pp. 431–440,
[23] T. Honda, H. Guntur, and A. Satoh, “FPGA implementation of new 2015.
standard hash function Keccak,” in Proc. IEEE 3rd Global Conf. [46] K. Vasavi, R. University, Y. Latha, and M. Reddy Engineering College
Consum. Electron. (GCCE), Oct. 2014, pp. 275–279. for Women, “RSA cryptography based multi-modal biometric identifi-
cation system for high-security application,” Int. J. Intell. Eng. Syst.,
[24] M. Sundal and R. Chaves, “Efficient FPGA implementation of the
vol. 12, no. 1, pp. 10–21, Feb. 2019.
SHA-3 hash function,” in Proc. IEEE Comput. Soc. Annu. Symp. VLSI
[47] Xilinx. (2012). Virtex-5 User Guide V5.4. [Online]. Available:
(ISVLSI), Jul. 2017, pp. 86–91.
https://www.xilinx.com/support/documentation/user_guides/ug190.pdf
[25] D.-E.-S. Kundi and A. Aziz, “A low-power SHA-3 designs using
[48] Xilinx. (2019). VC707 Evaluation Board for the Virtex-7 FPGA.
embedded digital signal processing slice on FPGA,” Comput. Electr.
[Online]. Available: https://www.xilinx.com/support/documentation/
Eng., vol. 55, pp. 138–152, Oct. 2016.
boards_and_kits/vc707/ug885_VC707_Eval_Bd.pdf
[26] A. Gholipour and S. Mirzakuchaki, “High-speed implementation of the [49] Xilinx. (2018). PG159—Virtual Input/Output V3.0 Product Guide v3.0.
Keccak hash function on FPGA,” Int. J. Adv. Comput. Sci., vol. 2, no. 8, [Online]. Available: https://www.xilinx.com/support/documentation/
pp. 303–307, 2012. ip_documentation/vio/v3_0/pg159-vio.pdf
[27] S. El Moumni, M. Fettach, and A. Tragha, “High throughput imple-
mentation of SHA3 hash algorithm on field programmable gate array
(FPGA),” Microelectron. J., vol. 93, Nov. 2019, Art. no. 104615.
[28] J. Guo, G. Liao, G. Liu, M. Liu, K. Qiao, and L. Song, “Practical
collision attacks against round-reduced SHA-3,” J. Cryptol., vol. 33,
pp. 228–270, Jan. 2019. M. M. Sravani received the bachelor’s and
[29] A. Akin, A. Aysu, O. C. Ulusel, and E. Savaş, “Efficient hardware master’s degrees from JNTU Anantapur Uni-
implementations of high throughput SHA-3 candidates Keccak, Luffa versity, Anantapur, India, in 2012 and 2015,
and blue midnight wish for single- and multi-message hashing,” in Proc. respectively. She is currently working toward the
3rd Int. Conf. Secur. Inf. Netw., Taganrog, Russia, 2010, pp. 168–177. Ph.D. degree at the Vellore Institute of Tech-
[30] F. D. Pereira, E. D. M. Ordonez, I. D. Sakai, and A. M. de Souza, nology, Chennai, India, under the supervision of
“Exploiting parallelism on Keccak: FPGA and GPU comparison,” Par- Dr. S. Ananiah Durai.
allel Cloud Comput., vol. 2, no. 1, pp. 1–6, 2013. Her research interests are cryptographic hardware
[31] G. S. Athanasiou, G.-P. Makkas, and G. Theodoridis, “High throughput implementation, reconfigurable computing, and dig-
pipelined FPGA implementation of the new SHA-3 cryptographic hash ital logic design.
algorithm,” in Proc. 6th Int. Symp. Commun., Control Signal Process.
(ISCCSP), May 2014, pp. 538–541.
[32] F. Kahri, H. Mestiri, B. Bouallegue, and M. Machhout, “High
speed FPGA implementation of cryptographic Keccak hash function
crypto-processor,” J. Circuits, Syst. Comput., vol. 25, no. 4, 2016,
Art. no. 1650026.
[33] H. E. Michail, L. Ioannou, and A. G. Voyiatzis, “Pipelined SHA-3 S. Ananiah Durai received the Ph.D. degree in
implementations on FPGA: Architecture and performance analysis,” in integrated circuit design from Massey University,
Proc. 2nd Workshop Cryptogr. Secur. Comput. Syst., Amsterdam, The Auckland, New Zealand, in 2015.
Netherlands, 2015, pp. 13–18. He is currently an Associate Professor with
[34] P. Luo, Y. Fei, X. Fang, A. A. Ding, M. Leeser, and D. R. Kaeli, the Center for Nanoelectronics and VLSI Design,
“Power analysis attack on hardware implementation of MAC-Keccak on School of Electronics Engineering, Vellore Institute
FPGAs,” in Proc. Int. Conf. ReConFigurable Comput. FPGAs (ReCon- of Technology, Chennai, India. He has authored over
Fig14), Dec. 2014, pp. 1–7. 14 research articles published in various journals.
[35] P. Luo, Y. Fei, X. Fang, A. A. Ding, D. R. Kaeli, and M. Leeser, “Side- His research interests are analog CMOS IC design,
channel analysis of MAC-Keccak hardware implementations,” in Proc. microsensor system design with CMOS-MEMS,
4th Workshop Hardw. Architectural Support Secur. Privacy, Jun. 2015, hardware security, and on-chip signal conditioning
p. 411. circuit design.

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 27,2023 at 05:40:10 UTC from IEEE Xplore. Restrictions apply.

You might also like