Professional Documents
Culture Documents
High-Performance and Area-Efficient VLSI Archit
High-Performance and Area-Efficient VLSI Archit
This research work has been carried out under the SMDP-C2SD
project, sponsored by MeitY, Govt. of India.
393
with 33 clock latency, a throughput of 221.63 Mbps has been PRESENT block cipher a 64-bit datapath is chosen, mainly to
obtained. Similar to this a 64-bit datapath based architecture implement the permutation operation efficiently.
has been synthesized on 87 slices of the Xilinx Virtex-5 In the proposed architecture there are three main
XC5VLX50 FPGA device [17]. Here a latency of 47 clocks, components which are encryption/decryption engine, key
maximum frequency of 221.64 MHz and a throughput of scheduling unit and controller. There is a 1-bit input signal
341.64 Mbps have been reported. ‘enc_dec’ which is used to select the encryption or decryption
As evident, most of the authors have provided an operations. If ‘enc_dec’ is at logic ‘1’ level then encryption
architecture for encryption-only operation and they have operation is performed, else the decryption operation is
assumed that the decryption operation works opposite to that executed. An up-down counter facilitates the integrated
of the encryption operation and hence would require roughly encryption/decryption operation. The main building blocks of
same logic complexities and hardware resources. However, in the proposed architecture have been arranged in different
our opinion, the decryption operation is a bit complex in subsections which are described below.
comparison to the encryption operation. This is due to the fact
that to start the decryption operation, first, we need the last A. Datapath of the Integrated PRESENT Architecture
round key that is generated from the key scheduling operation. As shown in the Fig. 3, the architecture consists of a set of
In connection to this, one of the implementations in which registers, multiplexers and XOR gates. The bit permutation (1)
both encryption and decryption operations have been tackled and inverse bit permutation operations (2) are simple bit-
[11], assumes that last round key is available at the starting of transposition, which require only simple wirings. There is one
the decryption operation. They have considered that the key is 64-bit multiplexer which is required to switch the state
static in nature throughout all of the encryption and decryption between the load phase (Input) and the intermediate state.
operations for all the input blocks. In another implementation Next, the multiplexer passes the state to a 64-bit state register.
by [17], for PRESENT cipher context, it has been given that This register is used to store the intermediate state and passes
the decryption operation requires almost same area as of it to the 64-bit XOR gate. This gate performs the XOR
encryption when implemented separately. In the following operation of intermediate state coming from the state register
section, an integrated encryption and decryption architecture with 64-bit round key coming from the key scheduling unit.
is proposed and described in detail. In the architecture, both the S-box (S) and inverse S-box (S-1)
are realized by the area-optimized combinational logic
IV. AN INTEGRATED VLSI ARCHITECTURE FOR THE implementation. To differentiate between the encryption and
PRESENT LIGHTWEIGHT BLOCK CIPHER decryption operations, two 64-bit multiplexers are deployed in
The proposed architecture for an integrated encryption and the datapath. In the proposed architecture, the inputs and
decryption of PRESENT block cipher is shown in Fig. 3. outputs are registered. The output register is added to
Here, an iterative type of architecture is considered for saving synchronize the output with the last round.
the resources and computation time. To implement the
0
64 64
64
P − layer 1 User key
64 Input
64 enc_dec 80 |128
64 64
4 4
−1 1 0 state_gen
S/S 1 S / S16−1
64 mux_sel
Schedulling
(BRAM)
4 4 key_gen
Key
Register
64 reset
State
enc_dec 1 0 clk
up_down
64 64
80 |128
80 |128
P −1 − layer
round
64
64
64 64 ⊕ 64
Counter
Output 64
Output
reset en clk
mux_sel Register
up_down
en
reset
clk
out_ready
clk state_gen
Controller up_down
enc_dec
out_ready
round key_gen
Fig. 3. A VLSI architecture of the PRESENT block cipher for an integrated encryption/decryption operation.
394
As per Fig. 3, a total of 33 clock cycles are required for the port of the BRAM. The intermediate key is read out from the
encryption operation to get the ciphertext. Out of 33 clock data_out port of the BRAM and first leftmost 64 bits of
cycles, the first cycle is used to load the user key and plaintext. intermediate key i.e., round key is XORed with the
In next 31 clock cycles, 31 intermediate states are computed intermediate state of that particular round. There are three
for each of the rounds and finally in the last clock cycle output control signals, ‘en’, ‘key_gen’ and ‘mux_sel’ generated by
is available in the output register. Whereas, in the decryption the controller for monitoring the key scheduling process,
operation, total 64 clock cycles are required to obtain the first which are shown in Fig. 4. Alongwith the BRAM, one 80-bit
block of decrypted output. Out of these 64 clock cycles, the or 128-bit multiplexer is also used. The ‘mux_sel’ signal is
31 clock cycles are required to compute the last round key and used to switch between the load phase (user key) and
33 clock cycles are required to perform the decryption intermediate keys. The ‘key_gen’ signal is given to the
operation. Here, the computed keys have been simultaneously write_enable port of the BRAM and it enables the BRAM for
stored in a block RAM (BRAM) so that there is no need to updating its content. The controller that controls the datapath
compute the last round key for other blocks of input. Thus, and the key generation unit is described below.
only 33 clock cycles are required to decrypt the remaining
blocks of ciphertext. The advantage of using the integrated C. Controller for the Encryption/Decryption Operations
architecture is that there are some resources which can be used The controller for the integrated encryption/decryption
in both encryption and decryption operations. The detailed operation of the PRESENT cipher is given in Fig. 5. To make
description of the key scheduling operation is detailed below. the proposed architecture (as in Fig. 3), capable of performing
the encryption and decryption operations for the multiple
B. Architecture for the Key Scheduling Operation blocks with the same user key, an appropriate controlling
Key scheduling unit works in the storage mode. Here, mechanism is required. The controller, thus, is designed to
computation of the round keys is performed only for the first provide various control signals for monitoring the key
block of data and the computed round keys are stored generation unit and for controlling the encryption/decryption
simultaneously in the BRAM. This computation mode offers engine. There are total nine states in the controller, which are
a reduced number of clock cycles for the decryption operation. used for performing the encryption and decryption operations.
The key storage mode is also beneficial for processing a large Out of these nine states, two states i.e. Start (S0) and the
chunk of data which contains multiple blocks that has to be Reg_out (SOUT) are common for both the encryption and
encrypted or decrypted with the same key. A detailed decryption. In the first state, S0, the counter is enabled through
description of the round key computation for both 80-bit and ‘en’ signal and works as an up counter by setting the
128-bit key lengths is given in [9]. In Fig. 3, the key ‘up_down’ signal to logic ‘1’. In this state, input and the user
scheduling unit is shown in the datapath of the integrated keys are loaded and the key scheduling unit starts computation
encryption/decryption architecture. A detailed architecture of through ‘key_gen’ signal which is at logic ‘1’. In the next
the same is given in Fig. 4, for both key-lengths of 80-bit and clock cycle, the state switches to either S1E state or S1D state for
128-bit. enc_dec=‘1’ or ‘0’, respectively.
round
5
62
For controlling the encryption operation given in Fig. 5, in
5
15 ⊕
the state S1E, multiplexers are switched as ‘mux_sel’ signal for
5 5 the key scheduling unit gets logic ‘1’ and the signal
⊕ 53
5 5
56
user key 4
S
4
‘state_gen’ for the datapath gets high so the encryption of the
user key
4 S 4 4 S 4 first block of plaintext begins in this state. Simultaneously, all
80 80 128 128 the generated round keys are stored in the BRAM. The state
remains in S1E, until the counter value reaches 31. Then, the
mux_sel
mux_sel
0 1 0 1
80 128 state switches to SOUT state, where the counter is disabled
80 data_in data_in 128 through en=‘0’ and ‘out_ready’ signal becomes logic ‘1’. In
address address
<< 61 << 61 this state, the key generation unit finishes its computation
BRAM
BRAM
enable en enable
395
en=‘1’
reset=‘1’ up_down=‘1’
enc_dec=‘1’ key_gen=‘1’ enc_dec=‘0’ round<30
Start
(S0)
mux_sel=‘1’
state_gen=‘0’ Compute round
mux_sel=‘1’ key_gen=‘1’ keys (S1D)
Encrypt 1st
plaintext (S1E) state_gen=‘1’
key_gen=‘1’
enc_dec=‘0’ round=30
enc_dec=‘1’
Reg_out
round<31 (SOUT) Load other
ciphertexts (S4D)
en=‘1’ out_ready=‘1’
Load other up_down=‘1’ up_down=‘0’
up_down=‘0’
plaintexts (S2E) key_gen=‘0’
key_gen=‘1’
key_gen=‘0’
round=0
Load 1st
round>0 ciphertext (S2D)
state_gen=‘1’
key_gen=‘0’
mux_sel=‘1’
Encrypt
state_gen=‘1’ Decrypt
key_gen=‘0’
(S3E)
(S3D)
396
than storing the computed keys in the BRAM. An around 16 mW for both the key lengths. The proposed
architectural-level comparison between the proposed design architecture is area-efficient with high-performance capability
and the design of [11] is given below. for providing an adequate level of security under the resource-
constrained environment for IoT and CPS applications.
A. Architectural-level Comparison
The architecture presented in [11] is one of a few REFERENCES
established ones that provides decryption operation for the
FPGA. This architecture has been implementation on the [1] E. A. Lee and S. A. Seshia, Introduction to Embedded Systems, A
Xilinx Spartan-IIIXC3S400 FPGA device. Thus, to perform a Cyber-Physical Systems Approach, 2011.
fair comparison of utilized device resources, we have targeted [2] T. Xu, J. B. Wendt and M. Potkonjak, “Security of IoT systems:
the same FPGA device and equal speed grade. Similar to [11], Design challenges and opportunities,” in IEEE/ACM Int'l Conf. on
Comp.-Aided Design, San Jose, Califo., pp. 417-423, 03 Nov. 2014.
we also kept the synthesis tool project design goals and
strategies with synthesis and place and route (PnR) effort [3] A. Biryukov and L. Perrin, “Lightweight Block Ciphers,” [Online]:
https://www.cryptolux.org/index.php/Lightweight_Block_Ciphers.
properties as high and the PnR extra effort at continue on [Accessed 06 Jan. 2017].
impossible. The implementation has been performed for both
[4] K. McKay, L. E. Bassham, M. S. Turan and N. W. Mouha, “NISTIR
80-bit key length (PRE_80) and 128-bit key length 8114 - Report on Lightweight Cryptography,” National Institute of
(PRE_128). The synthesis results for both the architectures are Standards and Technology (NIST), Gaithersburg, March 2017.
compared and shown in Fig. 6. [5] B. J. Mohd, T. Hayajneh and A. V. Vasilakos, “A survey on
lightweight block ciphers for low-resource devices: Comparative
study and open issues,” Jour. of Network and Computer Appl., vol.
Architectural-level Comparison 58, pp. 73-93, 2015.
[6] T. Eisenbarth, S. Kumar, C. Paar, A. Poschmann and L. Uhsadel, “A
Resource Consumption
575
382
373
366
328
326
306
299
268
253
221
217
200
202
202
197
176
397