Professional Documents
Culture Documents
Design and Implementation of HDFS Data Encryption Scheme Using ARIA Algorithm On Hadoop
Design and Implementation of HDFS Data Encryption Scheme Using ARIA Algorithm On Hadoop
Abstract—Hadoop is developed as a distributed data Kerberos. Second, studies on data encryption using a
processing platform for analyzing big data. Enterprises can compression codec have been done to prohibit data disclosure
analyze big data containing users’ sensitive information by using in HDFS (Hadoop Distributed File System). Liu Yi[5]
Hadoop and utilize them for their marketing. Therefore, proposed a SplittableCompressionCodec to solve the problem
researches on data encryption have been widely done to protect that the existing compression codec cannot support the
the leakage of sensitive data stored in Hadoop. However, the encryption on distributed data due to using only a single map
existing researches support only the AES international standard for input data compression. The SplittableCompressionCodec
data encryption algorithm. Meanwhile, the Korean government can compress and decompress HDFS data, in a parallel way,
selected ARIA algorithm as a standard data encryption scheme
which consist a set of blocks because multiple maps are
for domestic usages. In this paper, we propose a HDFS data
allowed to process HDFS blocks. In addition, S. Park[6]
encryption scheme which supports both ARIA and AES
algorithms on Hadoop. First, the proposed scheme provides a
proposed a scheme which can support the AES encryption of
HDFS block-splitting component that performs ARIA/AES HDFS data by using the SplittableCompressionCodec and Java
encryption and decryption under the Hadoop distributed cipher library. However, both studies have a limitation that
computing environment. Second, the proposed scheme provides a they support only the AES encryption algorithm.
variable-length data processing component that can perform However, the ARIA algorithm[9, 10] is widely used in
encryption and decryption by adding dummy data, in case when many public institutions because the Korean government
the last data block does not contains 128-bit data. Finally, we
selected it as a standard data encryption scheme for domestic
show from performance analysis that our proposed scheme is
efficient for various applications, such as word counting, sorting,
usages. The ARIA algorithm performs data encryption based
k-Means, and hierarchical clustering. on a 128-bit block, like the AES algorithm. In this paper, we
propose a HDFS data encryption codec scheme using both
Keywords— Hadoop Security, Data Encryption, HDFS Data ARIA and AES encryption algorithms for Hadoop. First, the
Encryption, ARIA Encryption Algorithm, Hadoop Encryption proposed encryption scheme provides a HDFS block-splitting
Codec component that performs both ARIA encryption and
decryption in distributed computing environment by dividing
I. INTRODUCTION data into HDFS blocks. Second, we provide an encryption
scheme supporting a variable-length data processing that
Recently, due to the prevalence of smart phones, the performs the encryption and decryption of variable-sized data
number of data generated by Social Networking Services by adding dummy data. Finally, we provide a codec selection
(SNS) has rapidly been increasing. The data including component that selectively utilizes the existing AES codec and
multimedia contents with various forms are grown to big data the proposed ARIA codec.
scale. Meanwhile, the MapReduce framework can support
parallel data processing based on both map and reduce The structure of this paper is as follows. In Section 2, we
functions for a distributed big data processing computing introduce the HDFS encryption scheme on the existing AES
environment. Hadoop [1] is an open-source implementation of algorithm. In Section 3, we present the encryption scheme of
the MapReduce framework. Because Hadoop is used for the HDFS data using the ARIA algorithm. In Section 4, we
major projects of Facebook and Amazon, it is considered as a compare the performance of the proposed ARIA-based
standard distributed parallel processing system for dealing with encryption scheme with the exiting AES-based encryption
big data. Currently, companies analyze customers’ information scheme. Finally, we conclude the paper with future work in
and location data collected in various fields and utilize them for Section 5.
a marketing activity. Therefore, personal sensitive data can be
disclosed while analyzing customers’ data. To solve this II. RELATED WORK
problem, studies on the security of Hadoop [3-6] have actively
done. First, Kerberos [3, 4] protects the disclosure of data A. Encryption Algorithm
according to network attacks. Hadoop can block the accesses of The advanced encryption standard (AES) algorithm [7, 8] is
malicious attackers by using user authentication through a symmetric key-based encryption one that was announced by
85
HDFS. iv) When the Reduce function is completed, the final
result is encrypted and stored into HDFS. Based on the
sensitivity of the intermediate result of the Map function, the
step ii) and iii) can be omitted.
86
block size and adds dummy data if the data is not composed of Algorithm 1: Variable-length data processing algorithm
a 16-byte unit. In addition, the proposed component can be Input : Plain text file
applied to the existing AES encryption scheme. Figure 7 Output : OutputStream that contains encrypted block
shows the processing of variable-length data. C A : 01: CurrentOff = 0
02: while CurrentOff >= EOF do
03: if CurrentOff + 16 >= EOF then
04: CodecBlk = Read InputData to EOF
05: CurrentOff = CurrentOff + Read Bytes
06: if CodecBlk.size % 16 != 0 then
07: Fill CodecBlk
08: end if
09: else
10: CodecBlk = Read InputData 16 Bytes
11: CurrentOff = CurrentOff + 16
12: end if
13: encrypt(CodecBlk, OutputStream)
14: end while
15: return OutputStream
(a) Encrypt data that is less then 16 bytes
87
encryption of constant values, such as encryption key and
block size.
A. Experimental environment
We implement the proposed ARIA-based HDFS encryption
scheme on Java Eclipse ver. 1.8.24. In addition, we use a JAR
file format to aggregate encryption class files so that users can
apply the ARIA encryption CODEC without modifying the
Hadoop internal source codes. Moreover, we use two types of
typical queries: string data processing algorithm and scientific
data analysis algorithm. For the string data processing, we use
the WordCount and the sorting algorithms, both of which are
the most typical ones. For the scientific data analysis, we use k-
Means clustering algorithm [12]. and the hierarchical clustering
Fig. 8. Procedure of data decryption algorithm [13]. The k-Means clustering algorithm is the typical
scientific data processing one based on the nearest neighbor
data search. The hierarchical clustering algorithm requires
D. Detailed design of proposed scheme more complex and intense data computation then k-Means
In this section, we describe the detailed design of the clustering. Table 2 shows four applications used for the
proposed HDFS data encryption scheme using ARIA algorithm. performance evaluation.
Figure 9 shows the overall architecture of the proposed ARIA
CODEC. In the figure, i) AriaParallelCryptoCodec class that TABLE II. APPLICATIONS OF PERFORMANCE EVALUATION
implements the Hadoop encryption component calls the
AriaCryptoOutputStream. ii) AriaCryptoOutputStream class Application Reason of select
that implements the HDFS block splitting/merging component
returns the encrypted data by splitting the original data into Most basic application that can understand the
Word Count
operation process of Mapreduce framework
HDFS blocks with 64MB unit. iii) AriaEncrypt class divides
the HDFS block into sub-blocks with 16-byte unit and checks Sort
String types bigdata processing Application that
if the last sub-block has a 16-byte unit. The variable-length has large number of arithmetic operations
data processing component in ARIAEncrypt class performs the k-Means
The basic algorithm for solving Clustering
encryption of the sub-blocks without 16-byte unit. iv) problem
ARIAEngine class performs the encryption of sub-blocks Hierarchical Application that perform complex clustering
through the ARIAEncrypt class. Clustering amount of calculation than k-Means
Fig. 9. Detailed design of our ARIA-based HDFS data encryption scheme Hierarchical 1.34GB
MODIS AQUA data in 2012
Clustering 75 million redocrds
88
TABLE IV. PERFORMANNCE EVALUATION ENVIRONMENT to that of the AES algorithm. From the experiment, we can
show that the encryption cost for string data do not have much
Item Performance
influence on the performance between the AES encryption
CPU Intel(R) Core i3-3240 CPU @ 3.40GHz algorithm and the ARIA encryption algorithm.
Memory 12GB
D. k-Means algorithm
OS Ubuntu 12.04.2
Figure 12 shows the performance evaluation result of the k-
Means algorithm. NoEncrypt and AES require 440 and 462.4
seconds to perform the k-Means algorithm, respectively. AES
B. Performance evaluation of WordCount encryption shows approximately 5% performance degradation
Figure 10 shows the performance evaluation result using compared to the NoEncrypt. On the other hand, ARIA
the WordCount algorithm. NoEncrypt and AES require algorithm requires 478.4 seconds, which indicates about 9%
12307.8 and 13190.4 seconds to perform the WordCount performance degradation compared to the NoEncrypt and 3%
algorithm, respectively. On the other hand, the ARIA algorithm degradation compared to AES encryption, respectively. The
requires 13556.2 seconds, which indicates a slight performance performance trend for the k-Means algorithm using ARIA
declination compared to the NoEncrypt. In addition, our ARIA encryption is similar to those of WordCount and the sorting
encryption shows a similar performance to that of the AES algorithm. We analyze that the encryption and decryption costs
algorithm. are relatively higher than the computation cost because the k-
Means algorithm requires less computation overhead than other
clustering algorithms.
89
degradation compared to the NoEncrypt and 1% performance [6] Seonyoung Park and Youngseok Lee, “A Performance Analysis of
degradation compared to AES encryption. Because the Encryption in HDFS,” Journal of KISS : Databases, Vol.41, Issue.1,
2014, pp.21-27
hierarchical clustering algorithm requires high computation for
[7] Byeong-yoon Choi. “Design of Cryptographic Processor for AES
clustering, the performance degradation of the ARIA algorithm Rijndael Algorithm,” The Journal of The Korean Institute of
is significantly reduced. For clustering algorithms with high Communication Sciences, Vol.26, Issue.10, 2001, pp. 1491-1500
computation, we show that the performances of the AES and [8] Yong Kuk Cho, Jung Hwan Song, and Sung Woo Kang, “Criteria for
ARIA encryption schemes have a less dependence on both data Evaluating Cryptographic Algorithms based on Statistical Testing of
encryption and decryption costs. Randomness,” Journal of the Korea Institute of Information Security and
Cryptology, Vol.11, Issue.6, 2001, pp.67-76.
[9] ARIA Development Team, Block Encryption Algorithm ARIA
V. CONCLUTION AND FUTURE WORK [Internet], http://glukjeoluk.tistory.com/attachment/ok110000000002.pdf.
In this paper, we proposed a HDFS data encryption scheme [10] Korea Internet & Security Agency, ARIA specification [Internet],
using the ARIA encryption algorithm on Hadoop. The ARIA http://seed.kisa.or.kr/iwt/ko/bbs/EgovReferenceDetail.do?bbsId=BBSM
STR_000000000002&nttId=39&pageIndex=1&searchCnd=&searchWrd
encryption algorithm has gained its importance since it =.
becomes the standard encryption scheme in South Korea. The
[11] Jeffrey Root, Intel Ⓡ Advanced Encryption Standard Instructions(AES-
proposed scheme provided the HDFS block-splitting NI) [Internet] https://software.intel.com/en-us/articles/intel-advanced-
component that supports encryption and decryption of data in encryption-standard-instructions-aes-ni.
Hadoop by dividing the data into HDFS blocks. We also [12] Weizhong Zhao, Huifang Ma, Qing He, “Parallel k-means clustering
introduced the variable-length data processing component that based on mapreduce,” In: IEEE International Conference on Cloud
supports the encryption of not only the data with a 128-bit unit Computing. Springer Berlin Heidelberg, Vol.5931 p. 674-679, 2009.
but also variable-length data. In addition, we can choose one of [13] Hui Gao, Jun Jiang, Li She, Yan Fu, “A New Agglomerative
encryption schemes between the international standard AES Hierarchical Clustering Algorithm Implementation based on the Map
Reduce Framework,” International Journal of Digital Content
and the Korean standard ARIA. We showed from the Technology and its Applications, Vol.4 Issue.3, 2010, pp.95-100
performance evaluation that our proposed scheme can support [14] [Internet] https://dumps.wikimedia.org/enwiki/
the ARIA-based encryption for various applications, such as
[15] [Internet] https://dumps.wikimedia.org/metawiki/
word counting, sorting, k-Means, and hierarchical clustering.
[16] MODIS [Internet] https://modis.gsfc.nasa.gov/
Our ARIA-based encryption scheme showed that it has only 2-
3% performance degradation on query processing, compared to
AES encryption. However, our scheme can support the
encryption of variable-length data while providing the same
level of data protection with the AES algorithm. Future work is
to apply our ARIA-based encryption scheme to the real world
applications, such as location-based services and financial
information processing.
ACKNOWLEDGEMENT
This work was partly supported by the Human Resource
Training Program for Regional Innovation and Creativity
through the Ministry of Education and National Research
Foundation of Korea (NRF-2016H1C1A1065816). This work
was also supported by Institute for Information &
communications Technology Promotion(IITP) grant funded by
the Korea government(MSIP) (No. R0113-16-0005,
Development of a Unified Data Engineering Technology for
Large-scale Transaction Processing and Real-time Complex
Analytics).
REFERENCES
[1] Jeffrey Dean, Sanjay Ghemawat, “MapReduce: Simplified data
processing on large clusters,” Communications of the ACM, Vol.51,
Issue.1, 2008, pp.107-113.
[2] Hadoop [Internet], http://hadoop.apache.org.
[3] Sudheesh Narayanan, “Securing Hadoop : Implement robust end-to-end
security for your Hadoop ecosystem,” 1st Vol, PACKT Publishing, 2014
[4] So Hyeon Park and Ik Rae Jeong, “A Study on Security Improvement in
Hadoop Distributed File System Based on Kerberos,” Journal of the
Korea Institute of Information Security and Cryptology, Vol.23, Issue.5,
2013, pp.803-813
[5] Liu Yi, Hadoop Crypto Design [Internet],
https://issues.apache.org/jira/secure/attachment/12571116/Hadoop
Crypto Desi gn.pdf.
90