Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Design and Implementation of HDFS Data

Encryption Scheme using ARIA Algorithm on


Hadoop
Youngho Song, Young-Sung Shin, Miyoung Jang, Jae-Woo Chang*
Dept. of Computer Engineering
Chonbuk National University, Republic of Korea
{songyoungho, twotoma, brilliant, jwchang}@jbnu.ac.kr
*Corresponding author

Abstract—Hadoop is developed as a distributed data Kerberos. Second, studies on data encryption using a
processing platform for analyzing big data. Enterprises can compression codec have been done to prohibit data disclosure
analyze big data containing users’ sensitive information by using in HDFS (Hadoop Distributed File System). Liu Yi[5]
Hadoop and utilize them for their marketing. Therefore, proposed a SplittableCompressionCodec to solve the problem
researches on data encryption have been widely done to protect that the existing compression codec cannot support the
the leakage of sensitive data stored in Hadoop. However, the encryption on distributed data due to using only a single map
existing researches support only the AES international standard for input data compression. The SplittableCompressionCodec
data encryption algorithm. Meanwhile, the Korean government can compress and decompress HDFS data, in a parallel way,
selected ARIA algorithm as a standard data encryption scheme
which consist a set of blocks because multiple maps are
for domestic usages. In this paper, we propose a HDFS data
allowed to process HDFS blocks. In addition, S. Park[6]
encryption scheme which supports both ARIA and AES
algorithms on Hadoop. First, the proposed scheme provides a
proposed a scheme which can support the AES encryption of
HDFS block-splitting component that performs ARIA/AES HDFS data by using the SplittableCompressionCodec and Java
encryption and decryption under the Hadoop distributed cipher library. However, both studies have a limitation that
computing environment. Second, the proposed scheme provides a they support only the AES encryption algorithm.
variable-length data processing component that can perform However, the ARIA algorithm[9, 10] is widely used in
encryption and decryption by adding dummy data, in case when many public institutions because the Korean government
the last data block does not contains 128-bit data. Finally, we
selected it as a standard data encryption scheme for domestic
show from performance analysis that our proposed scheme is
efficient for various applications, such as word counting, sorting,
usages. The ARIA algorithm performs data encryption based
k-Means, and hierarchical clustering. on a 128-bit block, like the AES algorithm. In this paper, we
propose a HDFS data encryption codec scheme using both
Keywords— Hadoop Security, Data Encryption, HDFS Data ARIA and AES encryption algorithms for Hadoop. First, the
Encryption, ARIA Encryption Algorithm, Hadoop Encryption proposed encryption scheme provides a HDFS block-splitting
Codec component that performs both ARIA encryption and
decryption in distributed computing environment by dividing
I. INTRODUCTION data into HDFS blocks. Second, we provide an encryption
scheme supporting a variable-length data processing that
Recently, due to the prevalence of smart phones, the performs the encryption and decryption of variable-sized data
number of data generated by Social Networking Services by adding dummy data. Finally, we provide a codec selection
(SNS) has rapidly been increasing. The data including component that selectively utilizes the existing AES codec and
multimedia contents with various forms are grown to big data the proposed ARIA codec.
scale. Meanwhile, the MapReduce framework can support
parallel data processing based on both map and reduce The structure of this paper is as follows. In Section 2, we
functions for a distributed big data processing computing introduce the HDFS encryption scheme on the existing AES
environment. Hadoop [1] is an open-source implementation of algorithm. In Section 3, we present the encryption scheme of
the MapReduce framework. Because Hadoop is used for the HDFS data using the ARIA algorithm. In Section 4, we
major projects of Facebook and Amazon, it is considered as a compare the performance of the proposed ARIA-based
standard distributed parallel processing system for dealing with encryption scheme with the exiting AES-based encryption
big data. Currently, companies analyze customers’ information scheme. Finally, we conclude the paper with future work in
and location data collected in various fields and utilize them for Section 5.
a marketing activity. Therefore, personal sensitive data can be
disclosed while analyzing customers’ data. To solve this II. RELATED WORK
problem, studies on the security of Hadoop [3-6] have actively
done. First, Kerberos [3, 4] protects the disclosure of data A. Encryption Algorithm
according to network attacks. Hadoop can block the accesses of The advanced encryption standard (AES) algorithm [7, 8] is
malicious attackers by using user authentication through a symmetric key-based encryption one that was announced by

978-1-5090-3015-6/17/$31.00 ©2017 IEEE 84 BigComp 2017


US National Institute of Standards and Technology(NIST). The the cipher can inherit such encryption algorithms as AES, DES,
AES encryption algorithm is widely used in various data- and RSA (Fig. 1). To override the cipher class to the Splittable-
related fields as US federal information processing standard. CompressionCodec, the scheme performs both encryption and
Meanwhile, the ARIA algorithm [9, 10] is a block cipher decryption based on the selected encryption algorithm. The
algorithm designed in 2003 by a large group of researchers limitations of the existing encryption schemes are as follows.
from government agency, academy, and research institutes in First, the exiting AES encryption schemes support the
South Korea. In 2004, the Korean Agency for Technology and encryption of data with a fixed 128-bit block size. As a result,
Standards selected it as a standard cryptographic technique. the exiting AES encryption cannot support schemes the
The ARIA algorithm processes data with 128-bit block size encryption of a variable-length data block. Second, the existing
and uses keys with 128/192/256-bit size. The ARIA algorithm encryption schemes cannot support the ARIA encryption
provides bit operations such as XOR and can support data algorithm that is a standard encryption algorithm in South
protection by executing the encryption of 12/14/16 rounds Korea.
depending on performance requirements. Because the ARIA
algorithm performs data encryption based on 128-bit block, III. DESIGN OF PROPOSED ARIA ENCRYPTION SCHEME
like AES algorithm, it can substitute the AES algorithm. Table
1 shows the comparison of the AES algorithm and the ARIA A. System Architecture
algorithm.
Though the ARIA encryption algorithm is designated as a
standard encryption algorithm by South Korea, to the best of
TABLE I. COMPARISON OF THE AES AND ARIA our knowledge, there is no research that utilizes ARIA data
encryption algorithm on Hadoop. In this paper, we design a
AES ARIA
HDFS data encryption scheme using both ARIA and AES
Period 1999 2004 encryption algorithms. Figure 2 shows the overall system
National Institute of architecture of the proposed HDFS data encryption scheme.
National Security The encryption of the data stored in the master local file system
Main Agent Standards and
Research Institute
Technology is performed as follows. i) The user specifies the preferred
National service encryption algorithm between the ARIA and the AES
Target Administrative service
Private service algorithms. ii) The data encryption codec encrypts data in
Standard of domestic
Characteristic and international
AES alternative HDFS based on the selected encryption algorithm. iii) The
domestic technology encrypted data are distributed into data nodes with the 64MB-
encryption
unit of HDFS blocks. The data decryption is done in the
reverse order of the data encryption.
B. HDFS data encryption methods based on AES encryption
algorithm
Liu Yi et al.[5] proposed a SplittableCompressionCodec to
perform the existing encryption in distributed environment.
The existing compression codec suffers from an efficiency
problem because it performs encryption with a single map
function. the SplittableCompressionCodec supports parallel
compression and decompression functions for HDFS data that
can be divided into several blocks by assigning each HDFS
block to a map function. Because Hadoop supports the
SplittableCompressionCodec in 1.x and 2.x version, there is no
compatibility issues. The user only needs to indicate
CompressCodec class name in the configuration file to make
use of it.

Fig. 2. Structure for encrypting HDFS data

To process the encrypted data in the MapReduce, it is


required not only to decrypt data before computation, but also
to encrypt the result before storing it to the HDFS. The overall
processing procedure of the encrypted data in MapReduce is
described in Figure 3. i) The input data of Map function is
Fig. 1. Encryption using Hadoop compression function
decrypted before the Map function is processed. ii) When the
S. Y. Park et al.[6] proposed a scheme which can support Map function is completed, the output data is encrypted and
the AES encryption of HDFS data using the JAVA cipher stored back into the HDFS. iii) The Reduce function is run
library. Because the cipher class is related to encryption library, after decrypting the intermediate result of the Map function in

85
HDFS. iv) When the Reduce function is completed, the final
result is encrypted and stored into HDFS. Based on the
sensitivity of the intermediate result of the Map function, the
step ii) and iii) can be omitted.

Fig. 3. Processing of Hadoop mapreduce on encrypted HDFS

B. Encryption procedure of the proposed scheme in HDFS


The data encryption of the proposed scheme is performed
as shown in Figure 4. i) The user selects the preferred Fig. 4. Procedure of data encryption
encryption CODEC between the existing AES and the
proposed ARIA, and calls the Hadoop encryption component
to encrypt input data. To convert the input data into a HDFS
block with 64MB unit, Hadoop encryption component calls the
HDFS block splitting component. ii) Because the existing
HDFS encryption scheme cannot process a variable-length data,
we newly design a variable-length data processing component
that can check whether the last data sub-block has a 128-bit
unit or not. The variable-length data processing component is
called by the HDFS block splitting component. If the size of
data stored in the last sub-block is a 128-bit unit, the variable-
length data processing component adds dummy data to make
the last sub-block have a 128-bit unit. iv) The encryption
component performs the encryption of each data sub-block by
using the selected encryption CODEC (ARIA or AES). Finally,
the encrypted sub-blocks are merged into the encrypted HDFS
block, which is stored into the corresponding data node by the
HDFS block splitting component.
Fig. 5. Hadoop encryption component
1) Hadoop encryption component: The users select a
preferred encryption algorithm to encrypt their data from the
Hadoop encryption component. Figure 5 shows an overall
architecture of the Hadoop encryption component. In the
figure, the users select an encryption algorithm between ARIA
and AES algorithms and override the encryption CODEC
through the algorithm selection module. The data block
generated by HDFS block splitting component is encrypted by
the CODEC and the encrypted data is returned to the users.
Fig. 6. HDFS block splitting component
2) HDFS block splitting component: Because the existing
ARIA algorithm normally performs the encryption of the 3) Variable-length data processing component: The AES
entire database, the algorithm cannot perform the encryption and ARIA algorithms only support the encryption of a fixed-
of data in HDFS, which is stored in a distributed manner. To length data block with 128-bi unit. As a result, the existing
solve this problem, we apply the SplittableCompressionCodec AES and ARIA algorithms using Hadoop have a problem that
[5] into our encryption module in order to divide data into they cannot support the encryption of variable-length data. To
HDFS block with 64MB unit and to encrypt data by using the solve this problem, we propose a variable-length data
same encryption key. processing component that automatically checks the data

86
block size and adds dummy data if the data is not composed of Algorithm 1: Variable-length data processing algorithm
a 16-byte unit. In addition, the proposed component can be Input : Plain text file
applied to the existing AES encryption scheme. Figure 7 Output : OutputStream that contains encrypted block
shows the processing of variable-length data. C A : 01: CurrentOff = 0
02: while CurrentOff >= EOF do
03: if CurrentOff + 16 >= EOF then
04: CodecBlk = Read InputData to EOF
05: CurrentOff = CurrentOff + Read Bytes
06: if CodecBlk.size % 16 != 0 then
07: Fill CodecBlk
08: end if
09: else
10: CodecBlk = Read InputData 16 Bytes
11: CurrentOff = CurrentOff + 16
12: end if
13: encrypt(CodecBlk, OutputStream)
14: end while
15: return OutputStream
(a) Encrypt data that is less then 16 bytes

4) Encryption component: The encryption component


performs the encryption of intermediate data generated by the
variable-length data processing component. There are two
types of the encryption based on the selected encryption
scheme, i.e, ARIA and AES. First, when the user selects the
ARIA algorithm as an encryption one, the encryption
component performs the encryption by using the encryption
key being received from the user. The encryption component
returns the encrypted data to the user through three components,
such as variable-length data processing component, HDFS
block splitting component, and Hadoop encryption component.
(b) Encrypt data that is more then 16 bytes
Second, when the user selects the AES algorithm an encryption
Fig. 7. Encryption of variable-length data one, the encryption component performs the encryption by
using the encryption key from the user. The encryption
Algorighm 1 shows a variable-length data processing
component returns the encrypted data to the user through the
algorithm that performs the encryption of data with sub-block
same three components.
unit. The algorithm divides the input file into a set of 16-byte
sub-blocks and performs the encryption of each sub-block. C. Decryption procedure of proposed scheme
First, the variable-length data processing algorithm checks if The data the decryption procedure of the proposed scheme
the sub-block to be encrypted is the last sub-block. If the sub- is performed as shown in Figure 8. i) The user selects a
block is not, the algorithm reads the sub-block with 16-byte preferred encryption algorithm between the AES and the ARIA
length and allocates it to the encryption sub-block. Second, the and decrypts input data by calling the Hadoop decryption
sub-block is the last one, the algorithm checks whether or not component. To divide the input data into HDFS blocks with
the last sub-block is shorter than 16-byte unit. If it is shorter, 64MB unit, the Hadoop decryption component uses the HDFS
the algorithm appends a dummy data at the end of the last sub- block splitting component. ii) The HDFS block-splitting
block to make it to be a 16-byte unit(line 3~8). The algorithm component merges into a single block the encrypted data being
reads the sub-block with 16-byte length and allocates it to the distributed in HDFS. iii) The variable-length data processing
encryption sub-block. Finally, the algorithm encrypts the component splits the input single block into sub-blocks with
entire block by merging a set of the encrypted sub-blocks. 128-bit unit. iv) The decryption component performs the
decryption of each sub-block by using the selected decryption
algorithm, i.e. ARIA or AES. Finally, the decrypted sub-blocks
are merged into the HDFS block by using the variable-length
processing component. The HDFS block is stored into the
corresponding data node through the HDFS block
splitting/merging component.

87
encryption of constant values, such as encryption key and
block size.

IV. PERFORMANCE EVALUATION

A. Experimental environment
We implement the proposed ARIA-based HDFS encryption
scheme on Java Eclipse ver. 1.8.24. In addition, we use a JAR
file format to aggregate encryption class files so that users can
apply the ARIA encryption CODEC without modifying the
Hadoop internal source codes. Moreover, we use two types of
typical queries: string data processing algorithm and scientific
data analysis algorithm. For the string data processing, we use
the WordCount and the sorting algorithms, both of which are
the most typical ones. For the scientific data analysis, we use k-
Means clustering algorithm [12]. and the hierarchical clustering
Fig. 8. Procedure of data decryption algorithm [13]. The k-Means clustering algorithm is the typical
scientific data processing one based on the nearest neighbor
data search. The hierarchical clustering algorithm requires
D. Detailed design of proposed scheme more complex and intense data computation then k-Means
In this section, we describe the detailed design of the clustering. Table 2 shows four applications used for the
proposed HDFS data encryption scheme using ARIA algorithm. performance evaluation.
Figure 9 shows the overall architecture of the proposed ARIA
CODEC. In the figure, i) AriaParallelCryptoCodec class that TABLE II. APPLICATIONS OF PERFORMANCE EVALUATION
implements the Hadoop encryption component calls the
AriaCryptoOutputStream. ii) AriaCryptoOutputStream class Application Reason of select
that implements the HDFS block splitting/merging component
returns the encrypted data by splitting the original data into Most basic application that can understand the
Word Count
operation process of Mapreduce framework
HDFS blocks with 64MB unit. iii) AriaEncrypt class divides
the HDFS block into sub-blocks with 16-byte unit and checks Sort
String types bigdata processing Application that
if the last sub-block has a 16-byte unit. The variable-length has large number of arithmetic operations
data processing component in ARIAEncrypt class performs the k-Means
The basic algorithm for solving Clustering
encryption of the sub-blocks without 16-byte unit. iv) problem
ARIAEngine class performs the encryption of sub-blocks Hierarchical Application that perform complex clustering
through the ARIAEncrypt class. Clustering amount of calculation than k-Means

Table 3 shows the dataset description used in our


experiments, where both Wikipedia English dump data [14]
and MetaWiki stub dump data [15] are the XML data set.
MODIS AQUA data [16] includes the seaweed information of
marine science for a year.

TABLE III. DATA USED FOR APPLICATIONS

Application Type of data Size

Dump of English Wikipedia in


WordCount 86GB
25.07.2015
Dump of English Metawiki
Sort 6.45GB
Stub in 10.08.2015
368.05MB
k-Means MODIS AQUA data in 10.2015
15 million records

Fig. 9. Detailed design of our ARIA-based HDFS data encryption scheme Hierarchical 1.34GB
MODIS AQUA data in 2012
Clustering 75 million redocrds

We run the four algorithms on a cluster with three data


The decryption procedure has the reverse procedure of nodes under the experimental environment described in Table 4.
encryption. In this figure, the AriaConstant class manages the

88
TABLE IV. PERFORMANNCE EVALUATION ENVIRONMENT to that of the AES algorithm. From the experiment, we can
show that the encryption cost for string data do not have much
Item Performance
influence on the performance between the AES encryption
CPU Intel(R) Core i3-3240 CPU @ 3.40GHz algorithm and the ARIA encryption algorithm.
Memory 12GB
D. k-Means algorithm
OS Ubuntu 12.04.2
Figure 12 shows the performance evaluation result of the k-
Means algorithm. NoEncrypt and AES require 440 and 462.4
seconds to perform the k-Means algorithm, respectively. AES
B. Performance evaluation of WordCount encryption shows approximately 5% performance degradation
Figure 10 shows the performance evaluation result using compared to the NoEncrypt. On the other hand, ARIA
the WordCount algorithm. NoEncrypt and AES require algorithm requires 478.4 seconds, which indicates about 9%
12307.8 and 13190.4 seconds to perform the WordCount performance degradation compared to the NoEncrypt and 3%
algorithm, respectively. On the other hand, the ARIA algorithm degradation compared to AES encryption, respectively. The
requires 13556.2 seconds, which indicates a slight performance performance trend for the k-Means algorithm using ARIA
declination compared to the NoEncrypt. In addition, our ARIA encryption is similar to those of WordCount and the sorting
encryption shows a similar performance to that of the AES algorithm. We analyze that the encryption and decryption costs
algorithm. are relatively higher than the computation cost because the k-
Means algorithm requires less computation overhead than other
clustering algorithms.

Fig. 10. Result of wordcount Application

Fig. 12. Result of k-Means clustering

Fig. 11. Result of sort Application


Fig. 13. Result of hierarchical clustering
C. Performance evaluation of sorting algorithm
Figure 11 shows the performance evaluation result using the E. Hierarchical clustering algorithm.
sorting algorithm. NoEncrypt and AES show 1947.6 and Figure 13 shows the performance evaluation result of the
2016.6 seconds on the average, respectively. On the other hand, hierarchical clustering algorithm. NoEncrypt and AES require
the ARIA algorithm requires 2081.6 seconds, which indicates a 1460 and 1470.2 seconds to perform hierarchical clustering,
slight performance degradation compared with the NoEncrypt. respectively. On the other hand, ARIA algorithm requires
In addition, our ARIA encryption shows a similar performance 1460.6 seconds, which indicates about 2% performance

89
degradation compared to the NoEncrypt and 1% performance [6] Seonyoung Park and Youngseok Lee, “A Performance Analysis of
degradation compared to AES encryption. Because the Encryption in HDFS,” Journal of KISS : Databases, Vol.41, Issue.1,
2014, pp.21-27
hierarchical clustering algorithm requires high computation for
[7] Byeong-yoon Choi. “Design of Cryptographic Processor for AES
clustering, the performance degradation of the ARIA algorithm Rijndael Algorithm,” The Journal of The Korean Institute of
is significantly reduced. For clustering algorithms with high Communication Sciences, Vol.26, Issue.10, 2001, pp. 1491-1500
computation, we show that the performances of the AES and [8] Yong Kuk Cho, Jung Hwan Song, and Sung Woo Kang, “Criteria for
ARIA encryption schemes have a less dependence on both data Evaluating Cryptographic Algorithms based on Statistical Testing of
encryption and decryption costs. Randomness,” Journal of the Korea Institute of Information Security and
Cryptology, Vol.11, Issue.6, 2001, pp.67-76.
[9] ARIA Development Team, Block Encryption Algorithm ARIA
V. CONCLUTION AND FUTURE WORK [Internet], http://glukjeoluk.tistory.com/attachment/ok110000000002.pdf.
In this paper, we proposed a HDFS data encryption scheme [10] Korea Internet & Security Agency, ARIA specification [Internet],
using the ARIA encryption algorithm on Hadoop. The ARIA http://seed.kisa.or.kr/iwt/ko/bbs/EgovReferenceDetail.do?bbsId=BBSM
STR_000000000002&nttId=39&pageIndex=1&searchCnd=&searchWrd
encryption algorithm has gained its importance since it =.
becomes the standard encryption scheme in South Korea. The
[11] Jeffrey Root, Intel Ⓡ Advanced Encryption Standard Instructions(AES-
proposed scheme provided the HDFS block-splitting NI) [Internet] https://software.intel.com/en-us/articles/intel-advanced-
component that supports encryption and decryption of data in encryption-standard-instructions-aes-ni.
Hadoop by dividing the data into HDFS blocks. We also [12] Weizhong Zhao, Huifang Ma, Qing He, “Parallel k-means clustering
introduced the variable-length data processing component that based on mapreduce,” In: IEEE International Conference on Cloud
supports the encryption of not only the data with a 128-bit unit Computing. Springer Berlin Heidelberg, Vol.5931 p. 674-679, 2009.
but also variable-length data. In addition, we can choose one of [13] Hui Gao, Jun Jiang, Li She, Yan Fu, “A New Agglomerative
encryption schemes between the international standard AES Hierarchical Clustering Algorithm Implementation based on the Map
Reduce Framework,” International Journal of Digital Content
and the Korean standard ARIA. We showed from the Technology and its Applications, Vol.4 Issue.3, 2010, pp.95-100
performance evaluation that our proposed scheme can support [14] [Internet] https://dumps.wikimedia.org/enwiki/
the ARIA-based encryption for various applications, such as
[15] [Internet] https://dumps.wikimedia.org/metawiki/
word counting, sorting, k-Means, and hierarchical clustering.
[16] MODIS [Internet] https://modis.gsfc.nasa.gov/
Our ARIA-based encryption scheme showed that it has only 2-
3% performance degradation on query processing, compared to
AES encryption. However, our scheme can support the
encryption of variable-length data while providing the same
level of data protection with the AES algorithm. Future work is
to apply our ARIA-based encryption scheme to the real world
applications, such as location-based services and financial
information processing.

ACKNOWLEDGEMENT
This work was partly supported by the Human Resource
Training Program for Regional Innovation and Creativity
through the Ministry of Education and National Research
Foundation of Korea (NRF-2016H1C1A1065816). This work
was also supported by Institute for Information &
communications Technology Promotion(IITP) grant funded by
the Korea government(MSIP) (No. R0113-16-0005,
Development of a Unified Data Engineering Technology for
Large-scale Transaction Processing and Real-time Complex
Analytics).

REFERENCES
[1] Jeffrey Dean, Sanjay Ghemawat, “MapReduce: Simplified data
processing on large clusters,” Communications of the ACM, Vol.51,
Issue.1, 2008, pp.107-113.
[2] Hadoop [Internet], http://hadoop.apache.org.
[3] Sudheesh Narayanan, “Securing Hadoop : Implement robust end-to-end
security for your Hadoop ecosystem,” 1st Vol, PACKT Publishing, 2014
[4] So Hyeon Park and Ik Rae Jeong, “A Study on Security Improvement in
Hadoop Distributed File System Based on Kerberos,” Journal of the
Korea Institute of Information Security and Cryptology, Vol.23, Issue.5,
2013, pp.803-813
[5] Liu Yi, Hadoop Crypto Design [Internet],
https://issues.apache.org/jira/secure/attachment/12571116/Hadoop
Crypto Desi gn.pdf.

90

You might also like