KVR Thesis

CERTIFICATE
This is to certify that the thesis entitled “A Novel Approach for

Providing Efficient and Scalable Security Model for Hadoop Cluster
in Cloud” submitted to Dr. Babasaheb Ambedkar Marathwada
University, Aurangabad for the award of Doctor of Philosophy in
Computer Science & Engineering, is a bonafide record of the research
work carried out by Mr. K. Vishal Reddy under our guidance. The
content of the thesis, in full or parts, has not been submitted to any other
Institute or University for the award of any other degree.
Place : Aurangabad
Date :
Prof. Dr. Jayantrao B. Patil Prof. Dr. Ratnadeep R. Deshmukh
Research Guide Research Co-Guide
i
ABSTRACT
The term “Big Data” is associated with the colossal and vague nature of
the data generated at a tremendous rate. In the current scenario, data is
produced from numerous sources and formats like images, videos,
audio, pdf, text documents, etc. Most organizations frequently generate
terabytes or petabytes of data. Procuring information by any
organization is just an issue, yet how these organizations/associations
manage and use a lot of data is a conclusive concern. The Hadoop
Distributed File System is an open-source, reliable, and user-friendly
system that plays a crucial part in taking care of an enormous volume of
created information with the most diminutive adaptation to faults. The
Hadoop permits multiple clients to use Big Data’s potential using other
open-source resources, such as Hive, HBase, Pig, Spark, and Storm.
However, the data stored in Hadoop is inclined to different attacks
internally or remotely, which may break out the reliability of the
Hadoop. To conquer these attacks, various investigations were carried
out on data encryption to secure the leakage in Hadoop’s sensitive data.
Nonetheless, preserving the data securely in Hadoop remains to be

challenging. To provision data-level security, the proposed work aims to
develop an efficient Transparent Data Encryption using RC6, Blowfish,
and Modified Blowfish algorithm to secure data in Hadoop. The work
minimizes all possible attacks and provides an appropriate remedy. The
proposed work has enhanced performance using RC6, Blowfish, and
Modified Blowfish (MBF) algorithms. MBF provides the encryption
/decryption of data by parallel data processing using the Adam Moulton
method. The proposed work overcomes the execution complication, key
management complexity, and application modification by integrating the
cryptosystem into the MapReduce code and consumes less power.
Experimental results showed that the proposed work performs better
than the several existing methods.
ii
DECLARATION
I here by declare that, the thesis entitled “A Novel Approach for
Providing Efficient and Scalable Security Model for
Hadoop Cluster in Cloud” is carried out by me under the
guidance Prof. Dr. Jayantrao B. Patil Department of Computer
Engineering, R. C. Patel Institute of Technology, Shirpur, Dhule (MS),
India and Prof. Dr. Ratnadeep R. Deshmukh, Department of
Computer Science and Information Technology, Dr. Babasaheb
Ambedkar Marathwada University, Aurangabad (MS), India. The work
is original and has not been submitted in part or in full to any other
University or institute to award any research degree. The extent of
information derived from the existing literature has been indicated in the
body of the thesis at appropriate places giving the references.
Place : Aurangabad Mr. K. Vishal Reddy
Date : Research Student
iii
ACKNOWLEDGMENT
I wish to express my hearty and sincere gratitude to my Research Guide
respected Prof. Dr. Jayantrao B. Patil, Professor and Principal,
Department of Computer Engineering, R. C. Patel Institute of
Technology, Shirpur, for his valuable guidance and constant
encouragement throughout this work.
I would also like to express my deep sense of gratitude and indebtedness
to respected Dr. Ratnadeep R. Deshmukh (Research Co-Guide),
Professor and former Head of Computer Science and Information
Technology Department, Dr. Babasaheb Ambedkar Marathwada
University, Aurangabad, for always guiding and encouraging me
throughout this work.
I thank to Dr. Sachin N. Deshmukh, Professor and Head, Department
of Computer Science and Information Technology, Dr. Babasaheb
Ambedkar Marathwada University, Aurangabad for providing me the
necessary facilities and continuous support rendered during the research
work.
I owe my sincere thanks to the Management of Deogiri Institute of
Engineering and Management Studies, Aurangabad for giving me
permission and encouragement to carry out the research work. I am
incredibly thankful to Dr. Ulhas D. Shiurkar, Director, Deogiri
Institute of Engineering and Management Studies, Aurangabad and
Sanjay B. Kalyankar, Head of Department, Computer Science and
Engineering, Deogiri Institute of Engineering and Management Studies,
Aurangabad for their moral support during my work and encouraging
me throughout this work.
I am also very much thankful to Prof. Dr. Nitin N. Patil, Head
Department of Computer Science and Engineering, R. C. Patel Institute
of Technology, Shirpur, for his constant support and meticulous
supervision at every step to come out with this work.
iv
I express my sincere gratitude for my parents, Dr. K. Ravinder Reddy
and Smt.K. Prabhavati Reddy, for their support and encouragement and
for being a constant source of strength. I deeply express my thanks to my
wife Ramya Reddy for constant support. I also thankful to my sister
Shirisha Reddy and brother-in-law Ramesh Chandra Reddy for their
continuous encouragement.
I am grateful to all my friends and colleagues as well as those who have
directly or indirectly helped me in the process of completion of this thesis.
K. Vishal Reddy
v
TABLE OF CONTENTS
CERTIFICATE . . . . . . . . . . . . . . . . . . . . . . . . i
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . ii
DECLARATION . . . . . . . . . . . . . . . . . . . . . . . . iii
ACKNOWLEDGMENT . . . . . . . . . . . . . . . . . . . . iv
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . x
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . xii
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . xiii
1 INTRODUCTION 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Hadoop Framework . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Need for Hadoop Distributed File System (HDFS) 2
1.3.2 Hadoop Distributed File System (HDFS)
Architecture . . . . . . . . . . . . . . . . . . . . 4
1.3.3 Anatomy of a file writes . . . . . . . . . . . . . 8
1.3.4 Anatomy of a file read . . . . . . . . . . . . . . 9
1.3.5 MapReduce . . . . . . . . . . . . . . . . . . . . 10
1.4 Cloud Computing Framework . . . . . . . . . . . . . . 13
1.4.1 Types/Deployment Models of Cloud . . . . . . . 14
1.4.2 Service/Delivery Models . . . . . . . . . . . . . 16
1.4.3 Key Essential Characteristics . . . . . . . . . . . 18
1.5 Hadoop in Cloud Computing . . . . . . . . . . . . . . . 18
1.5.1 Need for Hadoop in Cloud . . . . . . . . . . . . 19
1.6 Security threats and possible attacks . . . . . . . . . . . 21
vi
1.7 Technologies for security - Securing Hadoop Solution . . 23
1.8 Major Challenges . . . . . . . . . . . . . . . . . . . . . 25
1.8.1 Data Size and Storage Criticality . . . . . . . . . 25
1.8.2 Distributed Nature and Fragmented Data . . . . 25
1.9 Security issues in Hadoop and Cloud Computing . . . . 26
1.9.1 Security in Hadoop . . . . . . . . . . . . . . . . 26
1.9.2 Security in Cloud . . . . . . . . . . . . . . . . . 29
1.10 Necessities . . . . . . . . . . . . . . . . . . . . . . . . 31
1.11 Research Motivation . . . . . . . . . . . . . . . . . . . 32
1.12 Problem Statement . . . . . . . . . . . . . . . . . . . . 32
1.13 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.14 Research Contributions . . . . . . . . . . . . . . . . . . 33
1.15 Organization of Thesis . . . . . . . . . . . . . . . . . . 34
2 LITERATURE REVIEW 36
2.1 Authentication and Authorization in the Hadoop
Framework . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2 Data Security in Hadoop Distributed File System . . . . 42
2.3 Security Tools for Hadoop Cluster by Third Party . . . . 47
2.4 Cloud Computing . . . . . . . . . . . . . . . . . . . . . 49
2.4.1 Security Issue in Cloud Computing . . . . . . . 50
2.4.2 Authentication . . . . . . . . . . . . . . . . . . 50
2.4.3 Confidentiality . . . . . . . . . . . . . . . . . . 52
2.4.4 Integrity . . . . . . . . . . . . . . . . . . . . . . 53
2.4.5 Availability . . . . . . . . . . . . . . . . . . . . 55
2.4.6 Accountability . . . . . . . . . . . . . . . . . . 56
2.4.7 Privacy and Preservability . . . . . . . . . . . . 58
2.5 Data security in Cloud Computing . . . . . . . . . . . . 59
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 63
3 SYSTEM DEVELOPMENT 64
3.1 Necessity of Security at HDFS . . . . . . . . . . . . . . 65
vii
3.2 Security Integration by Hadoop Distributors . . . . . . . 66
3.3 System Architecture . . . . . . . . . . . . . . . . . . . . 68
3.4 Data level Encryption . . . . . . . . . . . . . . . . . . . 69
3.4.1 Advanced Encryption Standard (AES) . . . . . . 70
3.4.2 RC6 . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4.3 Blowfish . . . . . . . . . . . . . . . . . . . . . 71
3.4.4 Modified Version of Blowfish (BF) . . . . . . . 76
3.5 Encryption in Hadoop . . . . . . . . . . . . . . . . . . . 79
3.6 Implementation of Cryptosystems at Application Level . 80
3.6.1 Application-Level Encryption . . . . . . . . . . 80
3.6.2 Application-Level Decryption . . . . . . . . . . 81
3.6.3 Advantages of Application-level Encryption . . . 82
3.6.4 Disadvantages of Application-level Encryption . 83
3.7 File-System Level Cryptosystems . . . . . . . . . . . . 83
3.7.1 Transparent Data Encryption (TDE) . . . . . . . 84
3.7.2 Key Management Server (KMS) Architecture . . 89
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 91
4 RESULT AND DISCUSSION 92

4.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 92
4.2 PERFORMANCE ANALYSIS . . . . . . . . . . . . . . 93
4.2.1 Encryption time . . . . . . . . . . . . . . . . . . 93
4.2.2 Decryption time . . . . . . . . . . . . . . . . . 93
4.3 Implementation of Efficient and Scalable Security Model
for Hadoop Cluster . . . . . . . . . . . . . . . . . . . . 94
4.4 Performance of Application-Level Encryption /
Decryption Using Map-Reduce (MR) model . . . . . . . 94
4.4.1 Experimental Environment . . . . . . . . . . . . 95
4.4.2 Analysis of Space Consumption: Advanced
Encryption Standard (AES) VS Blowfish (BF) . 95
4.4.3 Analysis of Computational Time: AES VS
Blowfish (BF) . . . . . . . . . . . . . . . . . . . 97
viii
4.5 Implementation of TDE Model . . . . . . . . . . . . . . 100
4.5.1 Experimental setup . . . . . . . . . . . . . . . . 100
4.6 Performance Evaluation of TDE Model: AES vs RC6 . . 100
4.6.1 Analysis of Space Consumption: AES VS RC6 . 100
4.6.2 Analysis of Computational time: AES VS RC6 . 102
4.7 Performance Evaluation of TDE Model: AES vs RC6 vs
Blowfish (BF) . . . . . . . . . . . . . . . . . . . . . . . 104
4.7.1 Analysis of Space Consumption: AES vs RC6 vs
Blowfish (BF) . . . . . . . . . . . . . . . . . . . 104
4.7.2 Analysis of Computational time: AES vsRC6 vs
Blowfish (BF) . . . . . . . . . . . . . . . . . . . 106
4.8 Performance Evaluation of TDE Model: AES vs RC6 vs
Blowfish (BF) vs Modified Blowfish (MBF) . . . . . . . 108
4.8.1 Analysis of Space Consumption: AES vs RC6 vs
Blowfish (BF) vs Modified Blowfish (MBF) . . . 108
4.8.2 Analysis of Computational time: AES vs RC6 vs
4.9 Analysis of Throughput and Percentage Increase in file
size : AES vs RC6 vs Blowfish (BF) vs MBF . . . . . . 113
4.9.1 Analysis of Throughput: AES vsRC6 vs
4.9.2 Analysis of Percentage Increase in file size : AES
vs RC6 vs Blowfish (BF) vs MBF . . . . . . . . 115
5 CONCLUSION AND FUTURE SCOPE 116

5.1 Important Interventions of the Research . . . . . . . . . 116
5.1.1 Conclusions . . . . . . . . . . . . . . . . . . . . 117
5.1.2 Future Scope . . . . . . . . . . . . . . . . . . . 119
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . 119
LIST OF PUBLICATIONS . . . . . . . . . . . . . . . . . .
PLAGIARISM REPORT . . . . . . . . . . . . . . . . . . .
ix
LIST OF FIGURES
1.1 Generic Hadoop Architecture . . . . . . . . . . . . . . 4

1.2 Hadoop Distributed File System (HDFS) Architecture . 5
1.3 Working of HDFS Client . . . . . . . . . . . . . . . . . 7
1.4 Anatomy of a File Write . . . . . . . . . . . . . . . . . 8
1.5 Anatomy of a File Read . . . . . . . . . . . . . . . . . 10
1.6 Architecture of Hadoop YARN . . . . . . . . . . . . . 11
1.7 Map-Reduce (MR) Execution Flow . . . . . . . . . . . 12
1.8 Cloud Computing . . . . . . . . . . . . . . . . . . . . . 13
1.9 Cloud Computing Model . . . . . . . . . . . . . . . . . 15
1.10 Hadoop Usage in Cloud Computing . . . . . . . . . . . 20
3.1 Generic Hadoop Eco-System . . . . . . . . . . . . . . . 65

3.2 SecHDFS-AWS: Secured Hadoop Distributed File
System (HDFS) deployed in Amazon Web Services
(AWS) cloud . . . . . . . . . . . . . . . . . . . . . . . 68
3.3 Generic Representation of Blowfish (BF) algorithm . . . 75
3.4 Generic Representation of Feistel Function execution (F) 76
3.5 Fiestal Function R of the Modified Blowfish (MBF)
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.6 Encryption process using Map-Reduce (MR) . . . . . . 81
3.7 Decryption process using Map-Reduce (MR) . . . . . . 82
3.8 Transparent Data Encryption (TDE) key formation . . . 85
3.9 Creation of Encryption Zones (EZ) . . . . . . . . . . . 86
3.10 Write/Read a file to/from Encryption Zones (EZ) . . . . 88
3.11 Hadoop Key Management Server (KMS) Architecture . 90
x
4.1 Map-Reduce Model Space Consumption Analysis after
Encryption . . . . . . . . . . . . . . . . . . . . . . . . 96
4.2 Map-Reduce Model computational time Analysis during
encryption . . . . . . . . . . . . . . . . . . . . . . . . 98
decryption . . . . . . . . . . . . . . . . . . . . . . . . 99
4.4 TDE Model Space Consumption Analysis: AES vs RC6 101
4.5 TDE Model Computational time Analysis: AES vs RC6 103
4.6 TDE Model Space Consumption Analysis: AES vs RC6
vs Blowfish (BF) . . . . . . . . . . . . . . . . . . . . . 106
4.7 TDE Model Computational time Analysis: AES vs RC6
vs Blowfish (BF) . . . . . . . . . . . . . . . . . . . . . 107
vs Blowfish (BF) vs Modified Blowfish (MBF) . . . . . 110
4.10 TDE Model Throughput: AES vs RC6 vs Blowfish (BF)
vs Modified Blowfish (MBF) . . . . . . . . . . . . . . . 114
4.11 TDE Model % Increase in File Size: AES vs RC6 vs
Blowfish (BF) vs Modified Blowfish (MBF) . . . . . . . 115
xi
LIST OF TABLES
1.1 Use-Cases of Big Data . . . . . . . . . . . . . . . . . . 3
3.1 Provision of Security by the Hadoop Distributors . . . . 67
4.1 Analysis of Map-Reduce Model Space Consumption

after Encryption . . . . . . . . . . . . . . . . . . . . . 96
encryption . . . . . . . . . . . . . . . . . . . . . . . . 97
decryption . . . . . . . . . . . . . . . . . . . . . . . . 99
4.4 TDE Model Space Consumption Analysis: AES vs RC6 101
4.5 TDE Model Computational time Analysis: AES vs RC6 103
vs Blowfish (BF) . . . . . . . . . . . . . . . . . . . . . 105
vs Blowfish (BF) . . . . . . . . . . . . . . . . . . . . . 107
xii
LIST OF ABBREVIATIONS
HDFS Hadoop Distributed File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
NoSQL Not Only Structured Query Language . . . . . . . . . . . . . . . . . . . . . . 2
RPC Remote Procedure Call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
ACK acknowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
HTTP Hyper Text Transfer protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
YARN Yet Another Resource Negotiator . . . . . . . . . . . . . . . . . . . . . . . . . . 10
MR Map-Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
CLC Container Launch Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
IAAS Infrastructure As A Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
PAAS Platform As A Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
SAAS Software As A Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
AWS Amazon Web Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
EC2 Elastic Compute Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
DC Data Confidentiality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
ACL Access Control Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
DoS Denial of Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
DDoS Distributed Denial of Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
LDAP Lightweight Directory Access Protocol . . . . . . . . . . . . . . . . . . . . . 28
xiii
AD Active Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
MAC Media Access Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
BGP Border Gateway Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
KMS Key Management Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
AES Advanced Encryption Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
RC6 Rivest Cipher 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
BF Blowfish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
MBF Modified Blowfish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
MBF Modified Blowfish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
TDE Transparent Data Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
RDBMS Relational DataBase Management System . . . . . . . . . . . . . . . . 66
HIPAA Health Insurance Portability and Accountability Act . . . . . . . . 69
PCI DSS Payment Card Industry Data Security Standard . . . . . . . . . . . 69
FISMA Federal Information Security Management Act . . . . . . . . . . . . 69
NIST National Institute of Standards and Technology . . . . . . . . . . . . . . 70
ODE Ordinary Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
RM Resource Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
EZ Encryption Zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
EDEK Encrypted Data Encryption Key . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
DEK Data Encryption Key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
xiv
Chapter 1
INTRODUCTION
1.1 Introduction
This chapter focuses on the fundamental principles of Big Data, the
Hadoop framework, Cloud Computing, and security challenges in
Hadoop and Cloud Computing. The ultimate objective of the proposed
work is to develop a data security model for a Hadoop cluster in a cloud
environment. The Hadoop framework imposes various threats, attacks,
requirements, and obstacles in securing the model, described in the
following subsections. This chapter also inherits the problem definition
and research contributions during the research work.
1.2 Big Data

Big Data describes the vast and unclear nature of data collected rapidly.
Big Data, like the internet, is critical to every business or society. More
data implies more precise analysis. In today’s world, data is generated in
various formats (.csv, .tsv, .json, images, videos, audio, pdf, text
documents) from various sources. Data is generated in terabytes or
petabytes by businesses. The issue is not about gathering data as what
corporations do with that vast amount of information. For example, the
stock market approximately creates terabytes of data every day to
analyse the stock and identify the optimal trends. As illustrated in Table
1.1, organisations have grasped the hidden prospects and benefits of Big
Data. They began to analyse and extract data to gain better insights. The
heterogeneous that may be unstructured, semi-structured and structured
nature of data and the massive amount provides significant issues in
terms of storage and processing. To handle and process massive data,
new technologies and approaches have arisen. Researchers have
proposed various solutions to solve these issues, including distributed
file systems [1] and Not Only Structured Query Language (NoSQL)
databases [2]. These tools and systems are well suited to the long-term
storage and management of large, schema-free data-sets.
1.3 Hadoop Framework

Hadoop began development in 2006 and was quickly promoted to a
top-level project at Apache in 2008 [3]. Later, Hadoop 1.0.0 was
released under Apache Software Foundation in 2011 as an open-source
project. Hadoop is a set of foundational libraries (framework). Because
Hadoop has brought solutions to the issues posed by Big Data, many
businesses have begun to employ it for big data analytics. Across the
cluster, the Hadoop Framework provides distributed storage and
computing. Hadoop clusters offer flexibility, reliability, scalability,
fault-tolerance, and cost-effectiveness by utilising low-cost commodity
hardware. Organizations prefer Hadoop to scale up resources from a
single server to the number of machines organized in a cluster. The
Hadoop Distributed File System (HDFS) and Map-Reduce (MR) 2.0 are
the main components of Hadoop. The Hadoop architecture is depicted in
Figure 1.1. Every server will provide local computation and storage for
the associated client’s operations.
1.3.1 Need for HDFS
HDFS first appeared at Yahoo as part of the company’s ad serving and

site indexing requirements. Like other web-based organisations, Yahoo
found itself juggling a range of apps accessed by an expanding number
2
Table 1.1: Use-Cases of Big Data
Type of Organization Targeted analysis of data

Ad targeting
Abuse and Click Fraud detection
Web and e-tailing
Search quality
Recommendation Engines
Network performance optimization
Analysing Network to predict failure
Telecommunications
Customer Churn prevention
Calling Data Record (CDR) Analysis
Welfare schemes
Government Sectors Fraud detection and Cybersecurity
Justice
Gene sequencing
Drug safety
Life Sciences and Healthcare Serialization
Healthcare service quality improvements
Health information exchange
Threat analysis
Credit scoring and analysis
Banks and Financial services Fraud detection
Modelling true risk
Trade surveillance
Customer churn analysis
Retail Sentiment analysis
Point of sales transaction analysis
of users who were producing an ever-increasing amount of data. HDFS

is also at the heart of numerous open-source data warehouse alternatives,
dubbed ”information lakes” by some. Organizations used the HDFS in
most large-scale implementations because it facilitated fault-tolerance
and was installed on low-cost commodity hardware [4]. When used with
3
Figure 1.1: Generic Hadoop Architecture
a web search and related applications, such systems can scale to

thousands of nodes and petabytes of user data. This system must be able
to withstand server failures with ease.
1.3.2 HDFS Architecture
HDFS is a cloud file system and one of the Hadoop framework’s most
important components. HDFS is a distributed file system based on Java.
HDFS is a file system that runs on top of the native file system. For
example, HDFS is installed on top of the ext3, ext4, and XFS file
systems in the Linux operating system. It simultaneously saves and
retrieves various documents from several connected nodes. HDFS is one
of the essential components in the Hadoop architecture, and it is
responsible for data storage. Hadoop’s storage architecture is distributed
across multiple servers to improve reliability and lower costs.
The HDFS provides a framework for managing and storing files in

a distributed environment. Although HDFS depicts storage capacity as a
single unit, data is stored in dispersed clusters across numerous
4
computers. NameNode is the name of the core dedicated server in this
environment. The metadata is contained in this NameNode [5]. The
application data is stored in DataNodes, which are defined. Various
ways for providing reliable storage on numerous DataNodes
characterise the information security technique. The HDFS architecture
is depicted in Figure 1.2.
Figure 1.2: HDFS Architecture
The HDFS works in a client-server mode using a master-slave

configuration. NameNode represents the master node in the file system,
which succeeds by preserving namespace (metadata) information. Slave
nodes also define DataNodes. DataNodes are in-charge of handling and
managing actual data documents (block). Large files are separated into
blocks with the default size of 64 MB in Hadoop version 1 and 128 MB
from Hadoop version 2.
Furthermore, as illustrated in Figure 1.1, each block is duplicated at

least three times by default, with two copies saved on the same rack but
different physical servers and one copy on a separate shelf [6]. The
replication is done to improve dependability and to account for the
possibility of server outages. When a client programme (cloud
application) requests to write/read a file, NameNode examines the file
size and responds to the client with an appropriate amount of DataNode
5
locations (meta-data) and offset. The client makes direct contact with
the DataNode once it receives the DataNode locations from NameNode.
In the case of a write, the client simultaneously appends data to relevant
first salve servers. After that, salve servers replicate the data to other
salve servers sequentially.
NameNode: To alleviate the latency caused by disc reads, NameNode
saves metadata in memory for speedier retrieval [7]. Filename, File path,
Number of Blocks, Block-Id, Block location, number of blocks, slave
related parameters, and so on are all metadata parameters.
These parameters get split into directories and files specifications in

an organised namespace. These folders and files are represented in
NameNode as inodes in the Hadoop approach. The inode stores
information such as disc space quota, access time, and so on by utilising
various features. A colossal file is divided into blocks with ids that are
generally related. The cluster characterises this type of partial data
existence in a cloud system. There are hundreds of HDFS clients and
DataNodes in each cluster. NameNode is in charge of identifying a file’s
specific information, location, and size. The heartbeat mechanism
allows the NameNode to maintain track of DataNodes.
DataNode: The data stored in DataNodes in the cloud system as data
blocks. These DataNodes make use of the local native file system [8].
The information is saved in two critical files. Metadata is stored in one
file, whereas information on time stamping and checksums is stored in
another. SSH (Secure Shell) allows DataNodes and NameNodes to
communicate efficiently in a Hadoop cluster. Every 3 seconds, every
Salve Machine sends a heartbeat message to the NameNode, indicating
that it is alive. When a NameNode does not get a heartbeat from a
DataNode for 10 minutes, the NameNode deems that DataNode to be
dead and begins the Block replication process on another DataNode.
6
All Data Nodes in the Hadoop cluster are synchronized so that they
can communicate with one another and ensure that:
i. The data in the system is balanced;
ii. Data is moved to maintain high replication; and
iii. When necessary, copy data.
HDFS Client: The HDFS defines schema to create, store, and delete
files on DataNodes with application-specific access. Figure 1.3 depicts
the HDFS client’s connection to the HDFS storage area.
Figure 1.3: Working of HDFS Client
HDFS generates a pipeline; the client puts data into the pipe,
conducting various operations on the data files. The file can be deleted,
written to, and updated using the HDFS client. The client can connect
directly to the DataNode for accessing or transferring a required block.
The client conducts the write/read action, and data is updated on
NameNode and DataNode. The NameNode is accessed first, followed
by the DataNode for mapping to submit the request. The client also
7
acquires the node to node mapping for data replication to transfer the
data. The block-specific procedures are carried out in a well-organized
manner. The acknowledgements were successfully completed in reverse
order once the copies of a block was written. The last slave server
informs the primary slave machine of the existence of the other slave
servers. The central slave server acknowledges the client. Data has been
successfully written, and the client alerts the master (NameNode) server.
The master server updates the metadata, and client operations shut off
the pipeline.
1.3.3 Anatomy of a file writes
Figure 1.4: Anatomy of a File Write
To save a file in the HDFS, the client uses the distributed file system
to send a write operation to NameNode, as shown in Figure 1.4.
NameNode determines which DataNodes are available. NameNode
responds with a sufficient number of DataNode locations for each block
(with replication) based on file size. A client publishes the blocks of a
file to HDFS in parallel on various machines after obtaining the
DataNode locations (DataNodes). However, the block is replicated
8
sequentially. DataNodes accomplishes this by creating a pipeline for
every block and copying them one by one. When the acknowledge
(ACK) is transmitted from the last DataNode to the previous DataNodes
until it reaches the primary DataNode, it is considered complete. The
client is then informed that the principal DataNode has successfully
written the block. DataNodes acknowledges the NameNode by heartbeat
messages at the same time. The client disconnects from the DataNodes
in question.
1.3.4 Anatomy of a file read
When a client/user node requests to read a file from HDFS. If the data is
100 megabytes, we’ll have two blocks: one that’s 64 megabytes and the
other that’s 36 megabytes. Now, the client node will issue a request to the
distributed file system (software in the library), forwarded to NameNode.
NameNode verifies the namespace (metadata) to see if the requested file
is present in HDFS. If it exists, it responds by sending block positions to
the user. As shown in Figure 1.5, once the client has the coordinates of
the file blocks, it accesses the DataNodes directly where the file blocks
exist.
This process of reading/writing a file to HDFS is composed of two halves:
1. Hadoop clients use the Hadoops Remote Procedure Call (RPC)

library to access most Hadoop services (NameNode) to open or
create a folder. The user’s login name is obtained from the client
OS and delivered over NameNode and DataNodes as part of the
connection setup in insecure versions of Hadoop; this is insecure
because a knowing client who understands the protocol can
substitute any user-id.
2. Once the client has the locations, a streaming (pipeline) connection

is utilised to read or write a file block at a DataNode using the Hyper
Text Transfer protocol (HTTP).
9
Figure 1.5: Anatomy of a File Read
1.3.5 MapReduce
The Apache Hadoop Yet Another Resource Negotiator (YARN)

component is also known as MR 2.0 [9].Graph, interactive, stream, and
batch processing are possible with YARN. Figure 1.6 shows the
components of Hadoop YARN: Resource Manager, Node Manager,
Application Master, and Container.
Resource Manager: The Resource Manager is in charge of resource
allocation when job is submitted to the resource manager by the client.
The scheduler is in charge of scheduling the jobs in a way to operate. We
have two plugins in the resource manager: capacity and Fair schedulers.
The resource manager is in charge of the following tasks:
i. Initialization of job to node managers.
ii. Monitor the job execution status by receiving heartbeat messages

from the node manager.
iii. It assigns application master and containers to node manager for job
execution.
10
Figure 1.6: Architecture of Hadoop YARN
Node Manager: It registers with the resource manager and sends

heartbeat signals in response. The application master and containers
assigned by the resource manager are the node manager’s primary
responsibilities. By connecting with DataNode, the execution takes
place at the node manager. The node manager responds to the resource
management to ACK of job completion status once the job is completed
successfully. After that, the resource management provides the client
with a job execution response.
Application Master: The resource management creates an application
master for each application. It is in charge of managing and coordinating
the multirow jobs. The application master negotiates with the resource
manager for the appropriate resources (CPU, Memory, Disk, and
Network).
Containers: It gathers physical resources from the underlying machine,
such as RAM, CPU cores and drives. Container Launch Context (CLC)
11
is a programme that handles containers. It allocates a particular amount
of resources to the application master on request.
One of the elements of MR is the Apache Hadoop YARN

Framework. MR provides the computational platform. MR is a parallel
and distributed programming approach that runs on commodity
equipment. MR in Hadoop allows clients to develop simple applications
to process enormous data. The MR concept provides mapper and
Reducer functions. The input/output data in MR is executed in
key-value pairs [10]. Figure 1.7 depicts the MR execution flow.
Figure 1.7: MR Execution Flow
Mapper: The mapper’s job is to take a data block as an input as a

key/value (K1, V1) pair, process it according to code, and construct an
intermediate key/value (K2, V2) pair.
Shuffle and Sort: The intermediate data generated from the mapper
phase is then shuffled and stored accordingly.
12
Reduce: The shuffle and sort phase’s output (K2, V2) will be used as
input for the Reducer phase. Key K2 remains the same in the reducer
phase and performs a function operation on V2 (K2, f(V2)) to yield the
final output as a Key/Value (K3, V3) pair.
1.4 Cloud Computing Framework

Because the word ’cloud’ sounds similar to the phrase ’Internet,’ cloud
computing services are supplied over the Internet. Customers use the
board network (cloud) to access a collection of remotely connected
resources (computing), as depicted in Figure 1.8. Customers have the
ability to process, manage, and store data wherever [11, 12].
Figure 1.8: Cloud Computing
In remotely connected resources, a set of heterogeneous commodity

hardware is used. This commodity hardware processes customers’
workloads. It will include a flexible environment to host and analyse
13
user data correctly.
Cloud computing aims to give clients exponential computational

capacity seamlessly. Data is distributed across many network servers
with specific connections in cloud computing. Instead of installing a
software suite on each computer, this method requires a single software
package to be installed on each machine.
The user can use this technology to access a web-based service.

There is a large workload change in a cloud computing system. Local
computers will never have to bear the entire strain of executing
programmes again. Customers (individuals/organizations) benefit from
cloud computing technology because it is simple, adaptable, reliable,
and cost-effective. On the user’s end, hardware and software costs are
reduced. Clients use browsers to operate and access the cloud services
enabled. Clients see the user-friendly front end, while cloud providers
hide the sophisticated back end technology. The cloud’s back end
consists of a database system, many computers, and a server. Users must
enable internet access on their devices to use cloud services. By
connecting to the cloud, the user can access programmes from any
device at any time. Dropbox, Google Docs, Google Calendar, and Gmail
are examples of real-time apps. Cloud computing allows for the
dissemination of services across an extensive global network. There are
numerous resources and people sharing forms in a global environment,
depending on the demand. The cloud has been divided into three
primary groups, which are stated below, depending on scope, sharing,
and access restrictions. Figure 1.9 depicts cloud computing’s varied
deployment, delivery, and important fundamental qualities.
1.4.1 Types/Deployment Models of Cloud
Private Cloud: Private Cloud refers to a cloud environment that is

organised and applied to a single individual or supplied for exclusive
14
Figure 1.9: Cloud Computing Model
usage by a single business. This form of cloud’s services are available

within the corporation [13, 14]. Such a cloud will be ideal for highly
sophisticated commercial environments where service security is
paramount. The data centre-specific security limits are also described
for this constrained regional network. Security, which includes
authorisation and authentication standards, is one of the most critical
needs. Private networks are accessible worldwide due to the state of
corporate authentication data. This model may be hosted on-premise or
externally on third-party cloud providers premises.
Public Cloud: This cloud type’s availability and scope are enormous,
implying that it may be used by a person or a vast business [15]. Anyone
can utilize this form of cloud, and generic services are available
in-network. Authorization and authentication are available either
through a free client account or without charge. The authentication
limitations are lessened in this cloud version. Data is transferred with
15
high-security constraints to protect against attackers in some specialized
applications.
Hybrid Cloud: The functionality, features, and criticalities of both
(private and public) cloud types are combined in a hybrid cloud [15].
These cloud models are linked in the private and public domains to share
resources and services. Specialized and generalist services are available
and shared in the same environment, subject to authorisation and
authentication requirements. The sharing rules and different security
limitations are specified based on public and organisational access to
these cloud services. Corporations often supply add-on services such as
load balancing and cloud bursting to other cloud models in the hybrid
cloud.
Community Cloud: This form of cloud is set up for a certain group of
enterprises to utilise exclusively [15]. This form of cloud is adaptable
enough to be utilised by a community of businesses with similar
qualities, such as educational institutions and banks. Usage
authorization, authentication rules, and architecture may all be described
openly to organisation users. Ownership, management, and operations
for corporate clients may be carried out by many different parties
(internal-external). A community cloud is a type of organisation
confined to a cloud system. Community models represent the different
degrees of authorization with on-campus and off-campus accessibility.
1.4.2 Service/Delivery Models
Hundreds of service models are available on the market. All service

models are categorized into three types of delivery models. Software As
A Service (SAAS), Platform As A Service (PAAS), and Infrastructure
As A Service (IAAS) are three types of cloud computing services. This
section goes through these service delivery models in detail. These
service delivery models are primarily found in a cloud computing
context.
16
IAAS: Providers of IaaS host client virtual machines or supply network
storage. The IAAS model is directly linked to the hardware and shared
[16]. Networking, storage (NAS/SAN), servers, and virtualization are all
bare hardware metal. These components include things like a data
centre, a storage disc, memory, and a CPU, among others. The users are
not aware of these hardware components. Rather than actual resources,
virtual resources are provided to end-users in the cloud system. The
characterisation of these resources may be done based on sharing
elements, support, and scalability. This layer of the cloud system
infrastructure also includes security measures such as intrusion
detection, intrusion prevention, and a firewall. Amazon Amazon Web
Services (AWS) Elastic Compute Cloud (EC2), GoGrid, Flexiscale,
Windows Azure, RackSpace, Google Compute Engine, Hewlett Packard
Enterprise, DigitalOcean, and others are examples of such service
providers.
PAAS: Users are provided with a network-hosted software development
environment in PAAS [17].Providers maintain the underlying
infrastructure, operating system, programming platform, and so forth.
Because the providers abstract these fundamental resources for security
reasons, customers will not have access to them. The resources allocated
by multiple suppliers to the application developer are measured on a
request basis. Clients throughout the world will be able to access apps
that have been created and deployed on such a platform. The client can
acquire service or server access here without obtaining any physical or
core data. PAAS can be used to enable resource management, data
sharing, user authentication, hardware feature characterization, and
application-level control. The cloud system’s adaptability can be
explained by the fact that it is multilingual. This layer also provides the
security constraint integration, party relationship, and development
cycle. SalesForce.com, Amazon AWS Elastic Beanstalk, Google App
Engine, Microsoft Azure, IBM BlueMix, RedHat Software, OpenShift,
17
VMWare Pivotal Software, and Heroku are just a few examples of such
service providers.
SAAS: SAAS provides network-hosted (remote) applications to end
consumers, which are often accessed via mobile client apps or browsers.
Users can make a service request under this model. They don’t have
control over any of the cloud stack’s tiers. The cloud service provider is
responsible for handling, managing, maintaining, and monitoring the
complete cloud stack. Google Apps, Salesforce.com [17], Workday,
Citrix GoToMeeting, CiscoWebEx, DropBox, Paycom, Splunk,
HubSpot, and Zohosuite.com are a few examples of SAAS providers.
1.4.3 Key Essential Characteristics
Resource Pooling: Multiple customers share resources with enough

segmentation, so each user has their own control and consumption area.
On-Demand Self-Service: Customers may easily supply computer
resources on demand without the need for human intervention or aid
from the help desk.
Broad-Network Access: Capabilities are openly available over the
network and across devices.
Rapid Elasticity: Users can use flexible and automated provisioning and
de-provisioning to meet their needs. The customer appears to have nearly
infinite resources.
Measured Services: Users can prevail with Transparent Monitoring,
controlling, reporting of the utilized service to both the provider and
consumer (tenant).
1.5 Hadoop in Cloud Computing

Cloud providers use Hadoop clusters to store and analyze enormous
volumes of data sets in a distributed and parallel manner throughout the
cluster. It’s built to scale up to tens of thousands of computers from a
single server that provides storage and computing locally. On the other
18
hand, cloud computing encompasses a variety of computing concepts
that incorporate a large number of machines connected via a real-time
communication network. Cloud computing focuses on scalable,
on-demand, and adaptive service architectures. In cloud computing,
Cloud MapReduce is a replacement for MapReduce [18]. The
fundamental distinction between Hadoop and cloud MapReduce is that
cloud MapReduce does not provide its utilization; instead, it is
dependent on the infrastructure provided by various cloud service
providers. Hadoop is a ‘biological system’ of open source programming
that allows for simple logging and is widely used on industry-standard
hardware.
1.5.1 Need for Hadoop in Cloud
Since the term ”cloud” has been defined, it’s clear what the jargony
phrase ”Hadoop in the cloud” means: it means operating Hadoop groups
on assets provided by a cloud provider, as illustrated in Figure 1.10.
This training is compared on a regular basis, and Hadoop clusters that
operate on their own hardware, referred to as ”on-prem” or
”on-premises” clusters, are used.
If you’re already acquainted with running Hadoop clusters

on-premises, you’ll find that a lot of your knowledge and practises
translate well to the cloud. A cloud case should behave just like a typical
server that you can connect to remotely, with a certain amount of
circular space, a certain number of CPU centres, and root access, among
other things. When instances are properly grouped and made accessible,
you may imagine them running in a traditional data centre rather than a
cloud provider’s own data centre. This deception is intentional to make a
cloud provider feel familiar and your talents to apply regardless of the
circumstances.
19
Figure 1.10: Hadoop Usage in Cloud Computing
That isn’t to say that there isn’t more to learn or that the reflection
is complete. There are many choices and a variety of provider
characteristics to understand and investigate, with the goal that the
customer may build a functional framework but a moving arrangement
of Hadoop clusters. Cloud providers also provide services beyond what
you can accomplish on-premises, and the Hadoop cluster may benefit
from them.
Develop Hadoop clusters that operate in parallel every now and then.
Non-Hadoop servers and applications are supported by the clusters [19],
among other things, and supporting assets surrounding them supervise
information flow in and out and have a unique tool. The supporting cast
can also run in the cloud, or dedicated systems management features can
help bring them closer together.
20
1.6 Security threats and possible attacks
Security is a set of systems, policies, and technologies that work together
to keep networks, computers, programmes, data, and information safe
against attacks, harm, and illegal access. A threat is someone who has the
ability to harm a system or an organisation. Natural (earthquakes, floods,
or tornadoes) risks, accidental (staff errors that result in the deletion or
disclosure of private data) threats, and purposeful (spyware, malware)
threats are all possible. The following cryptographic security elements
should be enabled in every system:
• Data Confidentiality (DC)
• Data integrity
• Data Encryption/Decryption
DC: Only an authorised identity can access data or an information

system in DC. To protect data confidentiality, user IDs and passwords,
Access Control Lists (ACL), and policy-based security measures are
utilised. Attacks that may jeopardise data confidentiality are:
• Cracking (decrypting/deciphering).
• Man-in-Middle attacks on plain text.
• Data leakage scenarios: unauthorized copying of sensitive data.
• Installing spyware/malware on a data server to transmit its private

data to attackers.
Data Integrity: Only authorised individuals have access to data, which

is kept in its original condition while not in use. Integrity is ensured by
the use of data encryption and hashing methods.
Data Encryption/Decryption: The process of converting plain text to
encrypted text and vice versa is known as data encryption/decryption.
Both symmetric and asymmetric cryptographic techniques can be used
21
to encrypt and decode data. Symmetric algorithms are further divided
into block and stream cyphers. Public-key cryptography algorithms are
sometimes known as asymmetric algorithms.
Some of the attacks on cryptographic systems are:
• Browser-Based Attacks
• Brute-Force Attack
• Denial of service attack
• Replay attack
• Man-in-the-Middle-Attack
• Dictionary Attack
Browser-Based Attacks A web browser, which is the most widely used

method of accessing the internet, is utilised to compromise a system in
such assaults. These attacks are frequently carried out on legitimate
websites with known flaws (especially to the attacker). An attacker takes
advantage of the site and uses malicious scripts or processes to infect it.
When users try to access the infected website using a web browser, it
tries to install malicious malware on their computers by exploiting
browser vulnerabilities.
Brute-Force Attack To break the password for a target
(service/system/device), the attacker uses a series of permutations or
fuzzing processes. Because establishing many passwords to test against
a target is a time-consuming activity, attackers typically use tools like
fuzzer to automate the process.
Denial of Service (DoS) attack DoS attacks aim to ”overwhelm” a
target – such as websites, FTP / DNS / NTP servers, and so on – with
traffic floods [20]. The goal of these assaults is to slow down or crash
the target, reducing its availability. Distributed Denial of Service
(DDoS) is a type of DoS attack in which numerous attack sites are used
22
to carry out the assault. The purpose, though, remains the same. DDoS
is more effective and difficult to counteract. DDoS is occasionally
employed as a diversionary tactic to draw security personnel’s attention
away from a surreptitiously carried out large attack.
Replay attack Designed to stymie processing by repeatedly delivering
data to the host. In the absence of safeguards in the receiving service,
such as time stamping, one-time tokens, or sequence verification codes,
the system may handle duplicate files.
Man-in-the-Middle-Attack A Man in the Middle attack occurs when an
attacker can alter the data accepted by two legitimate users.
Dictionary Attack When it comes to password files, the dictionary
attack is the most popular. A dictionary attack is used to overcome an
authentication scheme by systematically inputting each word to explain
the decryption key of encrypted communication. It takes advantage of
users’ bad habits of choosing easy passwords based on unique phrases.
The dictionary attack encrypts all of the words in a dictionary, then
compares the hash to an encrypted password saved in the SAM file or
other password files. As a result, the dictionary attack is always quicker
than the brute force approach.
1.7 Technologies for security - Securing Hadoop

Solution
Hadoop becomes more ubiquitous in the IT world and is heavily used in
production environments, the same security considerations that apply to
IT systems like databases will also apply to Hadoop. Hadoop was not
designed with typical security measures in mind in its early years, but
implementing enterprise-grade protection features is an important aspect
of Hadoop’s maturation. It’s also an important component: If a supplier
is unable to give security assurances for a variety of applications, such as
finance, healthcare, banking, social media, and e-commerce.
For securing the information, three superior aspects were considered:
23
i. Perimeter management
ii. Access control
iii. Encryption
(i) Perimeter management: The first step in IT security is to carefully

control the borders between your network and the rest of the world.
Because Hadoop is a distributed system that spans several machines,
enabling perimeter security poses problems. A Hadoop cluster is a
distributed computing platform with many individual workstations, each
of which provides multiple open ports and services. Security is as bad as
you’d expect, and most administrators deal with it by keeping the cluster
on a separate network. The difficulty arises from the fact that customers
must run their own programmes against Hadoop.
Consider using NameNodes as a bridge between clients and the

Hadoop cluster with shared networking. However, the method raises
security concerns. To solve this issue, the Hortonworks team has begun
work on the Apache Knox Project [21, 22], which enables secure access
to Hadoop cluster services. Knox makes service-level authorization
easier by authenticating clients at the perimeter using a token
verification technique. Knox does not replace Kerberos, although it does
simplify the client’s settings.
(ii) Access control: The control of access is an important aspect of the
security debate, in which the perimeter control is provided protected
access by restricting access control and access points. Apache Ranger
[22, 23] is one such product that allows for centralised access control
and auditing methods management. We can achieve the IT policies of
security towards safeguarding the environment by combining both
technologies, Apache Ranger and Apache Knox.
(iii) Encryption: You may do much more if a breach occurs if you
ensure that your data is protected by managing the perimeter and
restricting access. Encryption may be the last line of defence.
24
Encryption as an option for any data saved in HDFS for data on disc is
currently being worked on in the Hadoop group. Intel’s (project Rhino)
[22, 24] delivery got a head start on this, allowing data encryption in
HDFS through encryption instructions in Intel CPUs used in Hadoop
slave nodes. Third-party technologies are also available to encrypt data
stored in HDFS.
1.8 Major Challenges

The Hadoop architecture presents many security issues that we don’t see
in a typical data management system. The following are the problems
that have to be overcome in order to improve Hadoop security.
• Data size and Storage Criticality
• Distributed Nature and Fragmented Data
1.8.1 Data Size and Storage Criticality
The data criticality is defined by significant data characterizations such

as truthfulness, value, volume, velocity, and diversity. Twitter,
Facebook, Instagram, Youtube, and Netflix are just a few examples of
cloud-based apps that create large amounts of data in a short amount of
time. The quantity and complexity of data provide several issues in
developing long-term security techniques. The current security
procedures are neither designed nor appropriate for massive data use. To
handle and analyze such massive data, the tools (Pig, Hive, HBase, and
Mahout) coupled with the Hadoop framework require more robust
security capabilities.
1.8.2 Distributed Nature and Fragmented Data
Hadoop operates in a distributed environment with hundreds of nodes

that operate independently. The cluster’s heterogeneity offers a number
of issues in terms of establishing security measures. Data is also broken
25
down into little chunks and stored on multiple nodes. When a client runs
a map-reduce operation, it generates intermediate data that is stored to
disc before being shuffled to other nodes in a distributed environment,
which might be vulnerable to attack. The formation of communication
between Hadoop components offers a challenge for fine-grained security
implementation.
1.9 Security issues in Hadoop and Cloud Computing

This section focuses on challenges in providing security for HDFS and
cloud computing. The following sections will briefly explain the security
issues in both HDFS and cloud.
1.9.1 Security in Hadoop
When considering Hadoop security, it’s important to remember how

Hadoop was conceptualized and developed. When Doug Cutting and
Mike Cafarella started working on Hadoop was not created with security
in mind. Hadoop was designed to process a large amount of online data
in a private or on-premises network. Therefore security was not a
priority. Another problem is that Hadoop was not designed and built as a
solid architecture with predefined components. However, it has evolved
into a collection of modules that either compare to other open-source
projects or include several (exclusive) augmentations developed by
various companies to improve functionality currently absent in the
Hadoop ecosystem.
Hadoop is now going from a proof-of-concept or early-stage

innovation stage to a high-level business and corporate application.
These new clients require a strategy for ensuring the security of sensitive
company information. Kerberos security is now the standard method for
creating a secure Hadoop cluster environment [25].
26
Kerberos validation is now fully supported in Hadoop and its major
sub-projects (Pig, Hive, SQoop, HBase, Mahout, and so on). However,
this only applies to a certain level of validation. There is still no reliable
intrinsic technique to classify client tasks for better control across sub-
projects with only Kerberos, no genuine means to tie down access to
Hadoop processes (or daemons), and no way to encode information in
transit with just Kerberos (or even at rest).
Hadoop’s distributed approach allows for the storage of large

amounts of data as well as simultaneous processing. With the help of
HDFS, it is possible to process enormous amounts of data in the
terabytes or petabytes range and provide high throughput access to this
data. The files are copied across many machines for concurrent
operations to provide dependability and high availability. When
handling sensitive or personal data in a dispersed setting, secure
computing is required. The attacker’s primary goal is to monitor the
vulnerable system, and in such circumstances, Hadoop becomes the
starting point because of the massive amount of data stored there.
Hadoop’s current version has simple fundamental execution,

making the Hadoop cluster inclined to the following attacks.
• Hadoop does not authenticate the client before allowing them to

access HDFS.
• Bypassing NameNodes, one can reach the DataNode and avoid

eavesdropping on data packets sent by DataNodes.
• Anyone can connect to DataNode and request a data location

without first connecting to NameNode.
• Information leaking may be a tendency in the Hadoop cluster’s

failing DataNodes.
27
• A malicious user with network access might intercept an inter-node
conversation.
• Big data stacks were created with little or limited security. Prevalent
big data installations are constructed on the web services paradigm
with limited features for preventing mutual web dangers.
Third-party permission and auditing mechanisms were utilised in

previous distributions. Such access control was readily bypassed since
any user may imitate any other user. Since most customers visited and
performed impersonation, the security controls solution was no longer
practical. Later, authentication and authorisation were added, although
they were found to have several flaws.
All programmers and users in the cluster had the same degree of
access to all data. Because MapReduce has no concept of authorisation or
authentication, a nefarious user may reduce the priority of other Hadoop
processes to make their work complete faster.
Firewalls, current Kerberos implementation, ACL, and appropriate

HDFS permissions are some of the security elements that the Hadoop
community supports. Kerberos is not a mandatory need for a Hadoop
cluster, making it possible to run the whole cluster without developing
or installing any security. Kerberos [26] is also difficult to install and
configure on a cluster of dispersed servers, as well as to coordinate with
Lightweight Directory Access Protocol (LDAP) and Active Directory
(AD) administrations.
The firewall does not effectively handle the Hadoop security; when
the firewall is penetrated, the cluster is wide open to attackers. The
firewall provides no security for data in motion or data at rest within the
cluster. Firewalls also offer no protection from attacks originating within
the firewall perimeter due to security failure.
28
1.9.2 Security in Cloud
Cloud computing is a new technology that has lately gotten a lot of

interest in various fields, including industry and academia. Data leakage
prevention is also a major concern in the cloud computing context.
Identity-based authentication, dynamic intrusion detection,
Diffie-Hellman key exchange, Media Access Control (MAC), secure
access control, and the ElGamal cryptosystem are only a few of the
security mechanisms suggested in cloud computing [27]. To access
encrypted, massive multidimensional data in an untrusted heterogeneous
distributed system environment, cipher-text policy attribute-based
encryption is utilised [28]. A security integration of cloud computing
and Hadoop MapReduce for healthcare application and traffic
management is illustrated in [29].
The sharing and transaction of vital data take place in an open

environment, and they are vulnerable to external and internal threats.
Security control mechanisms are required at several cloud computing
tiers to provide data exchange security. Service-level, hardware-level,
and user-level attacks are all possible. The following are some of the
security issues with cloud computing:
i. SAAS Security Issues
ii. IAAS Security Issues
iii. Storage Security
iv. Network Security
(i) SAAS Security Issues: The complete suite of software given to

customers is managed by cloud service providers. As a result, it is the
responsibility of application providers to secure the underlying
resources. This programme implements well-known web-service ideas
and provides an API that any other application may use. If the initial
29
web-service API is compromised, attackers will have access to all other
web-service apps that utilized it. The cloud environment offers
function-driven analysis to notice anomalous and harmful actions.
Along with these security issues, other data breaches can be found,
resulting in various data damages. Buffer overflow, integration issues,
implementation faults, design flow, and other errors are examples of
these flaws. Security concerns in this type of service delivery are:
1. Lack of compliance standards maintenance.
2. Inability to access Cloud Service Providers operations.
3. Lack of effective mechanisms in authentication and authorization

facilities.
(ii) IAAS Security Issues: Infrastructure with considerable business

procedure preparations are additional major problems in the cloud
environment. Because virtualization is so important to any cloud service
provider, any assault on this layer might be disastrous for all of the VMs
(virtual machines). At Vaserv.com, a UK-based corporation, one such
assault was disclosed and referred to as a zero-day vulnerability. It is
suggested that comprehensive security parameters be implemented at the
infrastructure level, as this is the foundation for the other two service
models.
(iii) Storage Security: The file and data specification are the greatest
repository of a cloud storage system. Storage space is provided to both
private and public customers. Data retention, data erasure, data residual,
and data leakage are just a few of the security risks in storage systems.
(iii) Network Security: There are no additional threats in the private
cloud, but changes in security needs may modify the network topology
when using public cloud provider services. The concerns at the network
layer include how an organisation ensures that data-in-transit is safe. Do
public cloud service providers correctly age IP addresses, which might
lead to unauthorised access to network resources? Some network layer
30
assaults, such as Border Gateway Protocol (BGP), DDoS, and DoS, may
render services inaccessible.
1.10 Necessities
Hadoop was designed to manage enormous amounts of web data in an
environment where security was not a priority. Hadoop gained a
reputation as an insecure platform as use grew and it evolved into an
enterprise tool. Because the built-in security and accessible settings
differ between release versions, security is inconsistent among Hadoop.
As the digital universe expands and Hadoop is adopted in practically
every industry, including business, banking, health care, military,
education, and government, security becomes a key problem. The prior
implementation lacked security features due to inconsistencies in
built-in security and available options between release versions,
affecting a wide range of industries. This is when engineers discovered a
huge oversight that Hadoop didn’t come with any security software.
This had an impact on many Hadoop-based applications. As a result,
Hadoop security is the next important step for Hadoop Framework.
HDFS is a commodity-hardware-based distributed file system that offers
data storage for other Apache Hadoop ecosystem components like MR
and Apache HBase. Storing sensitive data unprotected in a distributed
environment is an unacceptable risk for use that deal with personal
information or financial records. In HDFS, we frequently need to
encrypt data in order to store it safely at rest. This concern of Hadoop
security necessities the development of cryptographic algorithms that
ensure end-to-end data security to the clients and make use of these
algorithms to enhance security against various attacks and can be
achieved by integrating cryptosystems to HDFS.
31
1.11 Research Motivation
Along with developing notoriety of the Hadoop and Could Computing
environments, the security issues are expanding with the development of
these innovations. Even though Hadoop and Cloud Computing can
accomplish numerous advantages, a few threats/attacks may make it
defenseless. Because of deployments of these environments, critical
security components remain underutilized. The motivation for the
proposed work is to gather an ability to give security at the data level in
such a circumstance is to upgrade security. Subsequently, a necessity has
emerged for a legitimate answer to secure the components, loopholes,
and troubles arranged to attacks based on distributed computing and
Hadoop to lessen vulnerable infrastructure and platform attacks.
1.12 Problem Statement

The distinct focus of the HDFS empowered cloud is to guarantee the
security for stored files in Hadoop, whether encrypted or not and how it
handles the large volume of the dataset. Hadoop accompanies numerous
security threats: replay attacks, impersonation attacks, data node
impersonating attacks, stolen verifier attacks, brute-forcing attacks,
known-plaintext attacks, etc. These various threats need a novel and
proficient security model that will mitigate such sensitive attacks before
storing the big data in HDFS. Today, with enhancement in innovation
and technology, a massive amount of data is generated from numerous
sources and handled in HDFS enabled cloud environments that give
simple admittance to clients to store their files. The proposed technique
has a significant degree of safety for client’s information stored in HDFS
enabled Cloud environments, which supports the petabytes of data.
32
1.13 Objectives
The present research work focuses on providing the data security model
for the Hadoop cluster in a cloud environment. To accomplish the desired
goal of this, work the specific objectives are as follow:
• To study and review various Cloud Delivery Models, Cryptosystems

and Hadoop Security issues.
• To configure and deploy a Hadoop Cluster in the cloud environment.
• To Provide a proper access control mechanism to the stored files on

HDFS.
• To design and develop a secure model, that will be suitable for any
application.
• To evaluate the performance of implemented symmetric

cryptographic algorithms.
1.14 Research Contributions

Present research work focused on the unified system architecture for
improving security in HDFS enabled Amazon Web Services. The
following specific contributions were implemented:
• In phase 1, In Key Management Server (KMS), the ACL

permission is used to achieve secure user-level authentication,
which can mitigate Replay Attacks and User Impersonation
Attacks.
• In phase 2, In the Custom MR model (Map-Combine-Reduce), we

proposed efficient and straightforward symmetric algorithms for
data encryption and decryption, such as Advanced Encryption
Standard (AES), and Blowfish (BF) to handle data-level security.
33
• In phase 3, we proposed transparent end-to-end encryption by
incorporating symmetric algorithms like AES, BF and Rivest
Cipher 6 (RC6) into Hadoop’s common library. This architecture is
appropriate for any Hadoop-based application or tool. When
compared to a bespoke MR model, it delivers resilient
(data-in-transit or data-at-rest) security, quick processing, and
lower overhead.
• In phase 4, we designed and implemented a Modified Blowfish

(MBF) algorithm using parallel evaluation and Adams-Moulton
method.
• Finally, the proposed system characteristics have been tested against

security assaults using quantitative and qualitative data. The results
of the performance evaluations are used to calculate a variety of
metrics.
1.15 Organization of Thesis

The present work in the form of the thesis is organized as follows:
Chapter 1: Introduction
This Chapter will include an introduction to the Hadoop system.
Similarly, the need to secure Hadoop, HDFS architecture, MR, cloud
computing framework, security in Hadoop and Cloud, security threats
and possible attacks, motivation, research objective, and organization of
thesis will be included in this chapter.
Chapter 2: Literature Review
This chapter will be focused on a review of various existing research
methodologies concerning the proposed work.
Chapter 3: Proposed System
This chapter will be devoted to various proposed security algorithms for
Data Security and a comparative approach to secure the data storage
model in the Hadoop Framework.
34
Chapter 4: Performance Analysis
This chapter will be devoted to the performance analysis of the
implementation of AES, BF, RC6 and MBF Algorithm in HDFS and
will discuss its suitability. Experimental results and analysis in two
different simulation environments (Locally and Cloud) will be covered
in this chapter.
Chapter 5: Conclusion and Future Work
This chapter will summarize the proposed approach and concluding
remarks about the work done under the scope of this thesis. The
difficulties, limitations, and technical problems encountered and future
research directions associated with this work will be highlighted in this
chapter.
35
Chapter 2
LITERATURE REVIEW
The literature survey gives a theoretical base for the research and it is
useful for the investigator in assessment of the data. For the investigator
the information collection and narrates summaries assess and clarify of
the literature. Reviews are also helpful in understanding the subject and
its significance, identifying the conceptual methodology and subject them
to sound reasoning and meaningful interpretation.
Among the enormous number of research contributions and

articles, relative references were reviewed for scrutinizing and
criticizing the relevant data available. Several studies have focused on
guaranteeing security in Hadoop and Cloud environments. The security
can be ensured at different layers such as authentication and
authorization to maintain confidentiality, integrity, and availability of the
information put away on the system. This section concentrates around
related studies on the following aspects.
2.1 Authentication and Authorization in the Hadoop

Framework
G. S. Sadasivam et al. have proposed a dual server authentication
mechanism based on fundamental triangle properties for enhancing
security in the Hadoop cluster. The triangle parameters (calculation of
medians, centroids, Euler line) were used to derive the user password.
They suggested three approaches to authenticate a user to access
components of Hadoop securely. These approaches use an
authentication server, and two backend servers to authenticate the user.
The generation of arbitrary numbers improves the security level in the
Hadoop cluster. They analyzed the communication and computational
performance of these three approaches and found it to be the same [30].
S. H. Park and I. R. Jeong pointed to the vulnerabilities caused due

to adapting Kerberos [10, 25, 26] with a block access token system.
Nevertheless, this system is unprotected against the replay and
impersonation attacks on the block access tokens from customers to
DataNode. To defeat this, they proposed another public-key exchange
protocol approach that goes with the access token, which can moderate
weaknesses/drawbacks and have secure access between customer,
NameNode, and DataNode. Also, this model won’t influence the
presentation and cost measurements of Hadoop [31].
K. Zheng and W. Jiang explored an innovative method for creating

the Kerberos pre-authentication layer on both sides, client and KDC
(Key Distribution Center). They implement a Kerberos
pre-authentication plan that will permit users to verify with a standard
token into the KDC and create an MIT plugin, which can be used
separately for the new mechanism. On the basis, designing a token
authentication solution for the whole stack of Hadoop, which helps to
identify the OAuth 2.0 authorization solution and management system,
meanwhile avoiding deployed overhead, complication, and risk [32].
P. K. Rahul and T. Gireesh Kumar proposed a novel authentication

framework in the Hadoop cluster. The development of the proposed
method is done based on assessing the various security issues. The
developed framework utilizes different cryptographic algorithms such as
symmetric key, public key, random number generator, and hashing
37
methods. It also provides an additional layer between the client and
Hadoop cluster and characterizes two as client/user and data server. A
new key or another key is produced by a varying number utilizing the
hashing procedure and is allotted to every customer for giving
verification and approval consents between different parts of Hadoop.
Appropriate data privacy with information uprightness is guaranteed by
applying symmetric encryption on customer information put away on
HDFS [33].
N. Somu and V. S. S. Sriram developed a one-time pad algorithm

based on an encryption mechanism for authentication service in the
Hadoop cluster. In this methodology, the one-time pad algorithm creates
an irregular key to encode the password and have a safe channel
correspondence between the authorized and the backend server. This
technique makes the Hadoop progressively protected as the new
irregular key for encryption is created for each login. Using a random
key generation process by one-time pad encryption makes the Hadoop
more secure and mitigates replay and stolen verifier attacks [34].
H. Zhou and Q. Wen have proposed an access control mechanism

for Hadoop based on Ciphertext-Policy Attribute-Based Encryption
(CP-ABE). It is a kind of encryption where both encrypted text and
private keys rely upon the attributes. These attributes are collected from
the clients for identification. CP-ABE majorly depends on authorization
management and authorization control. The proposed technique
forestalls getting client data and decreases the chances of violating client
rights. The methodology enables data access security in the Hadoop all
through different components. In the proposed model, security and
reliability have been achieved. While the efficiency of the model is yet
to be achieved [35].
38
M. Sarvabhatla et al. analyzed and illustrated the possible security
flaws (offline password guessing attack) in N. Somu et al. [34]
authentication service with a one-time pad. To fix this, they proposed an
authentication scheme based on light-weight OTP for the Hadoop
environment. They analyzed the scheme from the security perspective
and illustrated that the scheme could mitigate security flaws. So, the
proposed scheme provides a robust and secure authentication service to
the Hadoop platform [36].
B. Saraladevi et al. have implemented three approaches for

enhancing security in the Hadoop Distributed File System (HDFS).
They concentrated on combined usage of three strategies, such as
Kerberos, Bull Eye Algorithm, and replication of NameNode. The first
approach focuses on enabling Kerberos for secure access control
between client, NameNode, and DataNode. The Bull Eye Algorithm
employs secure storage and access mechanisms on sensitive data. The
third approach ensures the high availability of NameNode and avoids
server failures. These approaches will overcome specific problems that
may occur in the NameNode and DataNode [7].
Y. S. Jeong and Y. T. Kim proposed a token-based authentication

scheme against replay attacks and impersonating attacks. In this
authentication scheme, tokens are generated using a hash chain
mechanism instead of a public key exchange scheme. These generated
tokens get encrypted using elliptic curve cryptographic, thereby securing
them against various attacks. The proposed authentication scheme
protects sensitive data stored in HDFS. In this model, clients are
authenticated by the NameNode and DataNode using hash chain keys.
This proposed model attains the similar performance of existing HDFS
systems in terms of communication and computing power [37].
39
Y. S. Jeong et al. focused on mitigating replay and data node
impersonating attacks by developing a hash chain technique that feeds
the cause. This expanded approach produces hash-chained block values,
and in this way, HDFS blocks are secure mindful and are unavailable to
an outsider to access the client’s data. The proposed approach contains
three stages: initialization, NameNode - client authentication, and
client-data node authentication. Client data should be undetectable to
the Data node, and the Data node needed to confirm access into the
NameNode. Consequently, we would altogether be able to eliminate
data node mimicking attacks [38].
I. Khalil et al. have provided a trusted platform module (TPM)

based on mutual authentication between NameNode, DataNodes, and
external clients. The proposed model overcomes the inadequacies of the
Kerberos mechanisms of not protecting the authentication from insider
attacks using the TPM model. In the TPM model, bind and seal
functions will mitigate and protects the Hadoop environment from
malicious insider attacks without effecting Platform Configuration
Registers (PCR). The proposed model shows finer performance results
and security when compared with state-of-art protocols [39].
Y. A. Jung et al. concentrated on solving the issues in the PK

scheme used to generate a one-time token based on hash value for
secure client authentication to the Hadoop environment. They found that
the PK scheme consumes more computational power to calculate the
hash values. So, they made a subtle modification to the program by
adding hash function G, which reduces the computational cost of hash
values. This proposed scheme enhanced the performance and efficiency
of the security scheme in Hadoop [40].
Z. Dou et al. has designed and implemented a trusted platform

module (TPM) based on strong mutual authentication between different
40
daemons of the Hadoop framework. Additionally, external clients are
authenticated with DataNodes and NameNode mutually while
interacting with Hadoop internal components. This subtle change by
adding bind and seal functions in TPM Model protects the Hadoop from
insiders’ malicious attacks using trusted third party attestation identity
keys (AIK) certifications [41].
Y. Mei put the Hash chain procedure for security enhancement in

the Hadoop cluster. Here, the hash chain password strategy is utilized to
improve the efficacy and productivity of the token-based scheme. The
computational expenses of the system are overcome by the utilization of
a public-key cryptographic algorithm. The developed approach mitigates
impersonation and replay attacks and improves the confidentiality of the
token [42].
M. Hena and N. Jeyanthi proposed a validation system based on

Kerberos for the Hadoop environment, which utilizes OTP (One Time
Password). Using symmetric cryptographic algorithms, secure
communication between the nodes is established. In this approach, OTP
is used rather than an asymmetric algorithm for secure correspondence.
The proposed method decreases computational expense and time and
upgrades the performance of execution. As no password is held by KDC
(Key Distribution Center), it mitigates password-guessing attackss and
settles time synchronization issues [43].
The authentication and authorization in the Hadoop system

provide/avails/focus on the security to the design and does not protect at
the data level. Thus, the following cryptographic algorithms have been
embedded to have a provision of security at the data level.
41
2.2 Data Security in Hadoop Distributed File System
Securing the data is of significant important than just securing the
system. In Hadoop, data security is achieved with the help of individual
implementation of various algorithms using MapReduce. Lightweight
encryption methods are always desirable. Many security algorithms are
used in MapReduce concepts such as RC4, ARIA, DES, Triple DES,
Blowfish, AES and one-time pad, ECC, Hashing, and MD5. The related
literature in concern to data security is mentioned in the following
manner.
W. Wei et al. introduced service integrity with secureMR in the

Hadoop structure. The SecureMR helps protect five components, i.e.,
managing, scheduling, executing, committing, and verification. The
integrity of the distributed processing system is maintained by
mitigating replay and denial of service attacks. The authors have also
checked the performance of the proposed approach with various existing
schemes and guarantee that it produces low-execution overhead, which
is worthy as it empowers integrity in task execution [44].
J. H. Majors has proposed symmetric cryptography algorithms

such as AES, DES, and Triple DES at the application-level through
Map-Reduce. This approach will provide security to the data-at-rest in
HDFS. In the proposed model, two applications were maintained: one
for encryption of data while writing to HDFS and the other for
decryption of data while reading data from HDFS. Further evaluation
and comparison were performed and claimed that AES is providing
better results than the different two symmetric algorithms [45].
H.-Y. Lin et al. executed a hybrid approach of encryption to secure

the HDFS storage system. HDFS-RSA and HDFS-Pairing algorithms are
the two encryption techniques that have been added as an extension to
42
HDFS. In HDFS-RSA, they utilized both Symmetric key and public-key
encryption. The file is separated into fixed-size blocks; at that point, all
the blocks except for the last block were encoded utilizing AES, whereas
the encryption of the last box is done using an RC4 stream cipher. In
HDFS-Paring, entrance control component tasks were added to secure
information. This hybrid approach is guaranteed to gives better secrecy
in Hadoop [46].
S. Park et al. presented a safe and secure HDFS by including

encrypt and decrypt functions as an inherent encryption/decryption class
in the Hadoop cluster. The AESCodec is fabricated utilizing
CompressionCodec. Since the HDFS client is liable for encrypting the
entire data, a bottleneck will arise while uploading the file in this
approach. The proposed method shows lower performance in
reading/writing files in HDFS [47].
C. Zhonghan et al. focused on protecting the blocks of files and

communication keys; for the same, they proposed two encryption
algorithms (AES and RSA) utilizing Java Cryptography Extension
(JEC). RSA is used to have secure communication between NameNode,
DataNode, and clients. At the same time, AES helps in encrypting the
blocks of files and stored in HDFS. This approach will prevent intruders
from attacking the Hadoop cluster [48].
Q. Quan et al. have proposed a model for secured cloud storage

based on HDFS. This model adopted a combined utilization of symmetric
and public-key cryptography to secure data. This model exhibits some
of the key benefits such as portability, scalability, data access with high
efficiency and confidentiality with maintaining integrity over the cloud
environment [49].
C. Yang et al. focused on the security provided to distributed data

storage in the cloud and proposed a novel triple encryption mechanism.
43
In this approach, the HDFS files are encrypted by using DES. The
encryption of the client’s DES key is done using the RSA public key,
while the client’s RSA private key is encrypted using IDEA. This model
is implemented and incorporated within data storage in the cloud-based
on Hadoop. The confidentiality of reading and writing the files to the
cloud is enhanced herein [50].
X. Yu et al. researched various insider attacks when a Hadoop

cluster is deployed on the Cloud IAAS platform. By compromising the
authentication key and authorization control, the Hadoop security can be
bypassed by attackers. The SEHadoop block is established to overcome
the issue. In SEHadoop block, tokens are created using SHA256 (Secure
Hash Algorithm), and block tokes are encrypted with AES algorithm to
provide a secure access channel from NameNode to DataNode. They
guarantee this model upgrades the isolation between Hadoop by
upholding the least access advantages and forestall side-channel attacks
on Hadoop from compromised Hadoop measures [51].
D. Shehzad et al. propose a scheme in which the content of files is

encrypted/decrypted with hybrid cryptographic techniques. Files are
encoded symmetrically by using images as a secret key. Then the key is
asymmetrically encrypted using user’s RSA public key. The client
reserves the private key and sends the encrypted data to HDFS. The
hybrid encryption scheme combines the essential features of symmetric
and asymmetric encryption schemes. This approach will reduce the
overhead on the generation of secret key computations. This mechanism
provides better results than existing techniques to secure data on HDFS
based on the cloud [52].
M. M. Shetty and D. H. Manjaiah et al. considered the current

safety efforts embedded in Hadoop for acquiring the data information.
In this work, Transparent Data Encryption (TDE) has been concentrated
44
inside and out, which can prevent and mitigate various attacking vectors,
for example, hardware access, key management, root access, rogue user,
and HDFS admin level of exploits. TDE gives end-to-end security to the
information stored on HDFS. Eventually, the information gets encoded
in end-to-end encryption/decryption method while writing and
consequently decrypted on reading. TDE upholds just the AES
algorithm to encrypt and decrypt the data information [5].
A. Jayan and B. R. Upadhyay enhanced the data storage in the

Hadoop environment by using an improvised parallel RC4 (Rivest
Cipher) stream cipher with the Map-Reduce. The MapReduce reduces
the cost of the algorithm. As RC4 belongs to the stream cipher, the
execution speed depends on the availability of the core. At last
performance of RC4 with and without the map-reduce method is
compared with other existing algorithms for data security. They claim
that RC4 gains better performance than existing algorithms [53].
Using the ARIA algorithm on Hadoop, Song et al. proposed an

HDFS data encryption method. Based on the Hadoop distributed
computing environment, their approach provides an HDFS block
spitting component that performs ARIA/AES algorithms. Further, a
variable-length data processing was achieved by adding a padding
technique on the last block if the size is less than 128 bit. At last, they
show ARIA has some degradation in performance when compared to
AES on various data analysis applications such as k-means, hierarchical
clustering algorithms [54].
H. Mahmoud et al. has proposed the integration of AES and OTP

algorithms for encrypting and decrypting of data in files stored on
HDFS. The merger is intended to lower the complexity of cryptographic
algorithms. The files are encrypted within HDFS and decrypted within
the Map task only. AES algorithms that were implemented in previous
45
work has increased the file size by 50 percent. Whereas AES with OTP
has reduced the file size to 20 percent after encryption. The proposed
model has enhanced the performance of encryption and decryption in
Hadoop [55].
Y. Xu et al. developed a distributed RSA for encryption and

decryption of data by utilizing an appropriated programming model
called MapReduce. RSA is proposed in the MapReduce programming
model in a distributed manner. The customer input information is
divided into four sections. For each part, dispersed encryption and
decryption are applied. At last, it is demonstrated that the speed of
operation can be optimized by distributed RSA algorithm [56].
P. Johri et al. achieved data security to the data-at-rest by

implementing two MapReduce applications based on AES. The first
application’s data encryption and data storing are done, while the second
application helps in data decryption during reading. The authors also
suggested the use of a password for encryption/decryption instead of the
symmetric key. Comparing the results obtained from AES with DES
(Data Encryption Standard) claims that AES provides enhanced
performance than DES [57].
T. S. Algaradi and B. Rama et al. proposed the Blowfish algorithm

using parallel MapReduce that improves encryption performance in the
Hadoop environment. A pre-designed MapReduce programming model
is used to encode enormous volumes of data in lesser time. Later on, the
performance evaluations were led with the conventional Blowfish and
compared with the proposed model. In which it has been demonstrated
that the proposed model gives better proficiency [58].
46
2.3 Security Tools for Hadoop Cluster by Third Party
In this section, we will illustrate different types of security tools being
integrated into the Hadoop cluster. The primary tools used for providing
and monitoring the security in Apache Hadoop are Apache Knox, Apache
Ranger, Kerberos, Project Rhino, Apache Sentry, and eCryptfs.
The Apache Knox is utilized as a passage for secure admittance to

different services or parts of Hadoop to give border security to the
Hadoop cluster. Knox provides a server-level authorization at the
perimeter. For authentication and token affirmation at the boundary,
Knox gives an SSO (Single-Sign-On) facility. It also avails REST APIs
against AD (Active Directory) or LDAP (Lightweight Directory Access
Protocol). Several enterprises utilize these REST APIs and uncover a
solitary URL to clients; Clients will use these URLs to get to different
services of Hadoop. Apache Knox is appropriate for verification,
approval, and inspecting administrations [21].
Apache Ranger facilitates a centralized administration framework

for security policies and access control auditing services in Hadoop. It
can be easily integrated with various Hadoop project stacks such as
HDFS, Hive, Knox, Strom, and HBase. It is based on the User/Group
synchronization server and Policy Admin Server, which provides an
interface to the client and Hadoop. Ranger provides different
authorization methods, such as role and attribute-based access control.
Ranger with Knox offers better access control mechanisms to the
Hadoop framework [23].
Kerberos is an open-source authentication framework developed at

the Massachusetts Institute of Technology (MIT). Kerberos majorly
depends on the Key Distribution Center (KDC), which has three
components: the authentication server (AS), the Kerberos database, and
47
the ticket-granting service (TGS). The Kerberos database handles all the
principals and the realms. Kerberos integrated with Hadoop generates
many tickets for ensuring proper access. It creates a delegation token for
establishing communication between the client and NameNode. Once
the connection is established, a block access token is designed to secure
the connection between the client and DataNodes [26].
Project Rhino is developed at Intel’s distribution of the Hadoop

framework as an open-source. Besides authorization, Rhino provides
support for encryption and key management of data stored on Hadoop.
Rhino is capable of facilitating security support for all the sub-projects
of Hadoop. Authorization in Rhino is provided via the SSO concept.
Project Rhino provides data-at-rest as well as data-in-transit level
security. Project Rhino implemented AES-NI (Advanced Encryption
Standard New Instructions) for encryption and decryption mechanism in
Hadoop [24].
Apache Sentry is a fine-grained role-based authorization

component for Hadoop. It provides an accurate level of access to data by
privileged users of Hadoop. For authorization in between Hadoop
components such as impala, hive, Apache Solr, and HDFS. It maintains
some elements such as Sentry server, Data Engine, and Sentry plugins to
provide authorization. Sentry server supports various policy metadata
for authorization. Data Engines processes data using hive, impala, and
HDFS. These Data Engines loads Sentry plugins as an interface to
manipulate policy metadata stored in the Sentry server. Apache Sentry
plays Access Level Controls (ACL) in Unix-based platform [59].
eCryptfs is an open-source file system that can be used effectively

with Hadoop for providing data-at-rest encryption. eCryptfs is part of
Linux Kernel, which is a POSIX encrypted file system. eCryptfs
facilitates filesystem-level encryption, which can be configured for a
48
partition. eCryptfs can be applied on file or directory level also.
eCryptfs is an application package with a Linux based operating system.
eCryptfs supports the following key level encryption: OpenSSL, tspi,
and passphrase. eCryptfs encrypts the data based on user selection
between various symmetric cryptographic algorithms such as AES,
Blowfish, Triple DES, Cast5, and Cast6. As Hadoop is deployed on top
of the Linux file system, the files or directories created on top of HDFS
can be encrypted using eCryptfs [60].
2.4 Cloud Computing

X. Xu has quickly reviewed the basic features of cloud computing.
Cloud computing has adopted all the essential requirements from
end-users, enterprises, and cloud provider’s perspectives. These enabled
features made cloud computing paradigm accepted by various
enterprises. With extensive adoption of cloud computing, the way
enterprises and industries focus on their businesses has changed
drastically. For the manufacturing industry, cloud computing is rising as
one of the significant enablers; it can improve the customary
manufacturing business model, make intelligent factory networks, help
it to adjust product development to the business procedure that supports
compelling collaboration. In the manufacturing sector, two kinds of
cloud computing adoptions have been proposed. One to adopt directly
the services provided by cloud computing. Another is to build a similar
version of the cloud computing framework. In some of the key areas,
cloud computing has been fabricating like customizing solutions,
flexibility in sending, creation scaling up and down per demand,
pay-as-you-go business models, and IT. In a unified manner, distributed
resources are encapsulated into cloud services and oversaw in cloud
manufacturing. As per their prerequisite, clients can utilize cloud
services [61].
49
2.4.1 Security Issue in Cloud Computing
Z. Xiao et al. have discussed the five most privacy and security attributes
like authentication, privacy preservability, accountability, availability,
integrity, and confidentiality. They presented the relationship amongst
these attributes; the attackers may exploit vulnerabilities, the defense
mechanisms adopted by cloud service providers. For each attribute,
future research directions are determined [62].
2.4.2 Authentication
A. A. Yassin et al. have proposed a two-factor mutual authentication

model based on Schnorr digital signature from the client’s fingerprint
feature extraction. Schnorr digital signature relies on ElGamal digital
signature. The proposed scheme involves three different components
(user, data owner, and service provider). Firstly, the user registers with
the data owner by providing his username, password, and fingerprint to
the data owner. Later data owner generates a public key parameter, and
the secret key based on Schnorr digital signature responds to the service
provider and user. In the proposed scheme, Users can access files shared
by the data owner without any device for authentication as the details are
already saved at the data owner during the registration phase. This
model will reduce device cost and computational time at a service
provider, which enhances the efficiency of service provision.
Furthermore, this scheme will mitigate many attacks like off-line
attacks, reply attacks, impersonation attacks, etc., [63].
N. Gajra, S. S. Khan, and P. Rane define the provision of security

are not ensured only by the authentication mechanism. Still, the files
should be secured, which is stored in a third-party environment. They
proposed a hybrid approach for facilitating data security and
authentication for secure access. The data is encrypted using modified
AES (MAES) and BlowFish algorithms, while the key is generated
50
using ECC, and agreement can be ensured using Diffie-Hellman.
Authentication is ensured by using ECDSA (Elliptic Curve Digital
Signature Algorithm) between client and service provider. The MAES
provides more robustness towards vulnerabilities by enhancing the
complexity of the AES algorithm [64].
H. Li et al. has implemented a novel authentication framework

based on the smart card in cloud computing. They used the user’s
password, smart card, and public key algorithm to ensure secured users
authentication and key exchange. The proposed scheme can address and
overcome two key issues with existing systems: 1) Attacks due to loss of
smart cards. 2) The issue related to forward and backward security. The
entire process of authentication and key exchange uses ECDH (Elliptic
Curve Diffie-Hellman) algorithm. The authors claim that the technique
proposed can provide all security requirements in authenticating a user
for accessing the cloud environment [14].
A. S. Tomar et al. identified authentication in the cloud

environment is a crucial step intended towards the provision of data
security. They implemented Captcha and ECC (Elliptic Curve
Cryptography) for secure key exchange. In the proposed model, the user
should select a Captcha for authentication by CSP (Cloud Service
Provider). Furthermore, the key is encrypted using ECC and gets
exchanged between a user and CSP. This mechanism ensures secure
authentication and key exchange policy between users and CSP. The
authors claim that this model will resolve various attacks such as
insider-attacks, Man-in-Middle attacks, impersonation attacks, etc., [65].
R. Dangi and S. Pawar have focused on the loophole of two-factor

authentication, which was device-dependent. They proposed three-level
authentication using email-based OTP by replacing device-dependent
with the secure interface to enhance the security of data over cloud
51
infrastructure. In this approach, a Private Key Generator (PKG) will
generate a public and private key. ECC algorithm encrypts the file using
the public key. Later SHA-2 creates a hash value, which will work as a
key for the RSA algorithm applied to the encrypted data. The receiver
will receive OTP through e-mail and use the reverse process to decrypt
the data. This proposed scheme will reduce the system cost and
time-complexity by enhancing the security mechanism in sharing and
accessing the file from the sender to the receiver [66].
2.4.3 Confidentiality
In the ongoing Cloud Computing research, one of the key challenges in

cloud computing is data confidentiality. Based on fragmentation, Hudic
et al. present a method for confidential storage and security of data in
the cloud environment. To a semi-trusted external service provider at a
cloud service provider (CSP), the hosting confidential business data
needs transmission of control over the data. However, when the data is
distributed among multiple CSP servers, in particular, these
cryptographic techniques add computational overhead. The proposed
approach utilizes the least possible encryption amount, and they
presented a fragmentation method that efficiently stores the data on the
CSP server. To a relational database, the fragmentation process is useful
where the table is preserved as self-fragmentation. The 3NF
normalization of databases ensures fragment structure without any
deduplicate and anomalies. The tables are categorized following user
requirements for XML fragments exported, availability, service, and
performance. To keep all fragments, the least number of cloud service
providers (CSPs) is utilized that has to be held unlinked in separate
places after the fragments were identified and the correspondent privacy
rates assigned [67].
L. Arockiam et al. had indicated that the users get an enormous

amount of virtual storage from Cloud computing. Cloud storage allows
52
a medium-sized and small enterprise to limit the maintenance of storage
servers and their investments. Cloud service provision adopts the
multi-tenant approach, where a resource is virtualized and handles
multiple customers. The confidentiality parameter assures cloud storage
protection. The encryption is a generally utilized technique for ensuring
confidentiality. If cloud data confidentiality is violated, the industry will
lose data. They used encryption and obfuscation as two various
technologies for securing data and the confidentiality of cloud storage.
Encryption is the mechanism by utilizing an algorithm and key for
transforming the readable text to the unreadable form. Encryption and
obfuscation are almost similar. Obfuscations are the mechanism that
hides unauthorized users by applying a specific mathematical function
or using programming techniques. The obfuscation and encryption can
apply concerning the data type. Encryption may be extended to
alphanumeric and alphabets data, and a numerical data form may be
used for obscure purposes. More number of unauthorized users can be
protected using obfuscation and encryption techniques on the cloud
data. By the combined use of obfuscation and encryption,
confidentiality could be attained [68].
2.4.4 Integrity
Based on the security levels, Y. Ren et al. make the master manage the
computing workers based on the existing methods; caching mechanism
and trusted verifier worker are introduced. The service integrity
framework is more efficient according to the system analysis and in the
cloud computing environment to detect malicious workers based on
MapReduce [69].
For big data analytics and management applications, Wang et al.

proposed IntegrityMR based on such architecture. At two alternative
software layers, they explore the result integrity check technique that is
the application layer and MapReduce layer. Based on Pig Latin and
53
MapReduce, they design and implement the system design at both layers
with popular management applications and big data analytics like
Apache Mahout, local cluster environment, and Pig on commercial,
public clouds such as Amazon EC2 and Microsoft Azure. The proposed
model has shown better performance overhead as compared to existing
approaches [70].
Saxena et al. have proposed a paillier homomorphic cryptography

(PHC). In the proposed model, two building blocks (homomorphic tag
and combinatorial batch codes) are used. A PHC technique can be used
to get homomorphic encryption on data blocks. The combinatorial batch
codes are used to allocate and store basic information, into different
distributed cloud SERVER. Based on MapReduce and Hadoop
framework, they implemented an application to demonstrate their
method. Based on various parameters, they have tested this application
[71].
For data verification purposes, using SHA1 and MD5

cryptographic function Idris et al. aimed to improve the function. Due to
their acceptance worldwide, these cryptographic functions have been
chosen. The attackers can hack independent application of SHA1 and
MD5, but a club of both the algorithms makes it stronger and less to be
hacked. During the uploading process data, validation can be performed
from processing erroneous data to prevent users [72].
R. Sumithra et al. have implemented Weighted and hash tree

algorithms, association rule mining (ARM), hybrid weighted-hashT
apriori algorithm. During the process of mining, their work handles the
security and integrity of data. Utilizing the eucalyptus platform, the
calculation is tested in the cloud environment with HDFS and VMware
workstation. How dispersed usage goes better than remain solitary
executions of weighted and hash tree apriori algorithm was assessed. In
54
the mining procedure for guaranteeing the security of data, their work
considered the viability of utilizing eucalyptus HADOOP hubs and the
exhibition changes concerning the utilization of the security protocol
[73].
2.4.5 Availability
A. Undheim et al. developed a model for the cloud data center to focus
on the availability attribute of a cloud SLA. The investigation was
performed on various techniques, which increased the availability of the
virtualized system. The outcomes of the works demonstrated that it
achieved availability differentiation based on failure rate and various
deployment scenarios. To restart the virtual machine, different priority
levels are used, which showed large differences [74].
For distributing computing environment, a dynamic data

replication method was proposed by D. W. Sun et al. with a brief survey
of the replication strategy. It incorporates 1) breaking down and
displaying the connection between scheme accessibility and the number
of duplications; 2) measuring and distinguishing the mainstream data
and triggering a replication activity when the prominence data passes a
unique threshold; 3) calculating a proper number of duplicates to meet a
reasonable framework byte suitable rate need and placing reproductions
among data nodes in a sensible way; 4) organizing the dynamic
information replication algorithm in a cloud [75].
C. T. Yang et al. used open-source software and platforms such as

OpenNebula virtual machines management tool and Xen-Hypervisor
technology to attain availability and stability. High availability was
achieved with Hadoop after extending the capabilities of the components
that were referred to as virtualization fault tolerance. A practical
problem was considered in the virtualization system that is the
single-point-of-failure issue. The high-availability is achieved based on
55
four components: Distributed Replicated Block Device (DRBD),
Virtualization fault tolerance (VFT), Xen Hypervisor, and lastly,
OpenNebula. The results confirmed that the interval of downtime could
be reduced even failure happened. The virtualization fault tolerance is
used in many areas like the cluster-based system, not only Hadoop
applications. The careful design and implementation have confirmed
that a single point of failure has been eliminated [76].
B. Mao et al. proposed an approach named HyRD, which was a

hybrid redundant data distribution method in cloud-of-clouds. The
HyRd approach was used to enhance the availability of cloud storage in
the cloud by using cloud provider diversity and workload features. Huge
documents are distributed in numerous cloud storage in HyRD with
deletion coded data redundancy. Here large files are stored in a
cost-effective cloud storage environment where small files and file
system metadata were replicated in the performance-oriented cloud
storage environment providers [77].
2.4.6 Accountability
J. Li et al. proposes a way to implement ABE-based (attribute-based

encryption) fine grain and flexible access control schemes. In order to
ensure secure cloud access control, the current ABE-based access control
systems fail to prevent the illegal sharing of key data among colluding
users. The paper tackles the difficult open problem by identifying and
implementing data-based access policies and by using traitor tracing for
device accountability. Besides, revoking and user grants are provided
efficiently with the help of a broadcast encryption method [78].
R. K. L. Ko et al. proposed the TrustCloud platform, which

provides the users a view at cloud service that provides accountability.
With large-scale virtualization and data spread mechanisms in a
distributed environment, they focus on the urgent need for cloud
56
accountability. To increase accountability, detection methods are
adapted instead of preventive methods. Detective strategies support
protective approaches since external threats, as well as insider risks, are
also considered. Detective techniques can also be used in a less invasive
manner than preventive procedures. We argued that a move to the
integrity and accountability of data at intervals between end-users
concerns of system health and performance involves a file-centered
view, precisely a quality system-centric approach to work [79].
S. Sundareswaran et al. proposed a new model to highly

decentralized data accountability, to keep track of data usage of users in
the cloud. By leveraging the capabilities of programmable Java JAR file
ability following user policies and data to include our logging
mechanisms. This method guarantees that all user data access can
trigger authentication and automatic JAR local logging. Distributed
auditing mechanisms were proposed for high accurate user control. The
feasibility and efficiency of the suggested methods were illustrated by
the experimental study [80].
Z. Xiao et al. has proposed accountable via MapReduce in the

cloud, which forces every machine to be accountable for its behavior. A
group of auditors is set up to achieve an A-test (Accountability Test) that
detects malicious nodes and forms all of the functioning machines in
real-time. Based on how the auditors are allocated, the A test can be
carried out with different options. A formulation of Optimal Worker and
Auditor Assignment issues for maximizing the utilization tool, which
tries to select the good worker count and examiners so that the total
work time is kept to a minimum. The assessment findings indicated that
the A-Test could be used on existing cloud platforms with MapReduce
inefficient, practical, and terms [81].
57
2.4.7 Privacy and Preservability
G. Zhang et al. reviewed various challenges in providing privacy

protection and preservation in the cloud paradigm. Based on the review,
they have drawn some insights into future aspects for research in
protecting privacy and preservation in the cloud model. They even
classified different privacy and preservation mechanisms from cloud
service roles and levels. This investigation gives the importance of cloud
privacy protection and preservation strategies. They also point out
potential key areas in cloud privacy protection and preservation [82].
A. Waqar et al. have studied the possibility of exploiting the

metadata stored in the cloud’s database for compromising the users’
privacy data items placed by the cloud provider’s simple storage service.
They modified the schema of the database by introducing cryptographic
and relational privacy preservation actions for utilizing the sensitivity
parameterization parent class membership of cloud database attributes.
By keeping its constituent steps formulation well aligned with the
recommendations ensured by the suitability of the proposed technique
concerning private cloud environments [83].
P. Hu et al. proposed a privacy and security preservation method for

minimizing issues. Depends upon face identification, the fog computing
outline was given. In the procedures of face distinguishing proof and
face resolution, in settling the problems of accessibility, trustworthiness,
privacy verification and meeting key understanding technique,
information encryption strategy, and information respectability checking
technique were introduced. The influence of the security method on
system performance was evaluated by the prototype system [84].
G. Sun et al. proposed a synthetic fog computing-based vehicle

crowdsensing scheme to solve the security problems placed in data
collection. A two-tier fog framework was designed for finding the data
58
security requirements, and the lower-tier fog nodes are positioned on fog
buses. Access control and data integrity verification were included for
resolving the data forgery. For the traceability and support the use of
incentive mechanisms, these innovations were used [85].
2.5 Data security in Cloud Computing

S. K. Sood has proposed a combined approach for providing data
security stored onto the cloud. In this approach, they provide
confidentiality, integrity, and availability to the data. In the proposed
model, confidentiality is achieved through SSL (Secure Socket Layer),
MAC (Message Authentication Code) for maintaining integrity,
classification of data into three (public, private, and owners access
control) based on searchable encryption in case of availability. Firstly
sensitivity rating is calculated to classify the data upon which indexing
is done for faster retrieval. Later the data along with indexing are
encrypted using SSL to maintain confidentiality while data-in-transit.
Further, MAC generates an offset and sent along with the encrypted data
to the cloud to ensure integrity. The proposed scheme can curtain issues
such as information leakage, tampering of data, and unintended access
[86].
F. F. Moghaddam et al. have used RSA small-e and modified Diffie-

Hellman algorithms to ensure data security in the cloud framework. The
RSA small-e will encrypt data on the client-side and stores the encrypted
data on the cloud. Modified Diffie-Hellman is used to securely exchange
the key used for encryption for obtaining access control between intended
users. The proposed model provides confidentiality and secures access
control mechanisms to data stored on the cloud. The authors claim that
the model will reduce security concerns by mitigating the well-known
attacks (man-in-middle, cycle, discrete logarithm, brute-force) [87].
59
V. S. Mahalle and A. K. Shahade has proposed a hybrid encryption
algorithm to secure the data stored in the cloud. In this hybrid scheme, a
combination of RSA and AES is used with different key lengths. A
unique key is generated based on system time, which enhances the
complexity of cracking the key by an intruder. The data is encrypted
twice for attaining enhanced security for data and protecting from
unintended access or attacks. Firstly the data gets encrypted using the
AES secret key; later, the encrypted data is again encrypted using RSA
public key. Users can access the file using AES secret key and RSA
private key by which integrity and confidentiality are achieved [88].
N. Khanezaei and Z. M. Hanapi have implemented a combination

of symmetric and asymmetric cryptographic techniques to ensure data
security in the cloud. The primary focus of the proposed system is to
transfer the files between the user and the cloud securely. For secure
communication, integration of RSA and AES algorithms were used.
RSA will strengthen the encryption mechanism from attackers, and the
reduction of time in transferring files is achieved by using AES. This
approach will enhance the security of data stored in the cloud while
mitigating various types of attacks [89].
G. Raj et al. have provided an improved security model for data

stored in the cloud using different key sizes in AES. Firstly the data is
categorized into huge, medium, and small sizes. A later different key size
of AES is suggested to use for data encryption. In the case of a large file,
and AES 128-bit key with ten rounds is applied for data encryption. AES
192 bits key with 12 rounds is applicable for medium file size, and for
small files, AES 256-bits key is suggested for encryption. The proposed
model will enhance the computational cost and provide confidentiality to
the data stored on the cloud [90].
60
N. Sengupta has identified the security flaws of data-in-transit to
cloud systems and proposed a hybrid RSA algorithm for cloud systems.
The proposed model has classified into two stages for providing security
to data. In the first stage, RSA encryption is applied to the data. Later in
the second stage, the Feistel Encryption algorithm is put on the generated
output from the first stage. The use of the new hybrid RSA encryption
algorithm to transfer data into the cloud system and minimizes man-in-
middle attacks [91].
M. Kumar et al. has investigated the threats on confidentiality of

data at rest in the cloud environment and proposed end-to-end security
to data stored in the cloud. They made a subtle change to the sood et al.
model by employing PBE (Predictive based encryption) to protect
customer data-at-rest in the cloud systems. While the data-is-in-transit
phase, files get encrypted using SSL-128 and SSL-256 bits between
client and cloud systems to ensure data privacy.MAC allows maintaining
data integrity such that tampering doesn’t occur on data. The plan
accentuations on various information owner situation, and partitions the
clients’ data based on SR (severity rating) like low, medium, high. The
unified method has shown an explicit protection of the customer data by
the experimental study [92].
Y. Li et al. have focused on the issues with distributed storage in

the cloud and provided an approach to secure the data stored in the cloud
environment. They proposed three security algorithms, Alternative Data
Distribution (AD2), Efficient Data Conflation (ED- Con), and Secure
Efficient Data Distributions (SED2). AD2 decides whether to split or
not based on the sensitivity of the information. Later SED2 processes
the data before moving the data to the cloud. Further, EDCon enables a
secure access control mechanism between different users. They claim
the proposed model will sustain from various attacks and provides better
performance than the AES algorithm [93].
61
G. Amalarethinam and H. M. Leena propose an enhanced RSA
algorithm for providing data security in the cloud. In the RSA
algorithm, we do have a private and public key that uses the same
computed N value either for encryption and decryption. They proposed
a subtle modification on computed N value by introducing two N1 and
N2 values. During encryption, RSA uses N1 and N2 while decryption
instead of the same value N. The authors claim that the proposed model
enhances the performance when compared to the standard RSA
algorithm [94].
D. P. Timothy and A. K. Santra has surveyed to develop a new

hybrid cryptosystem approach for protecting data stored in the cloud.
The proposed cryptosystem utilizes a combination of Blowfish, RSA
(Rivest Shamir Adleman), and SHA-2 (Secure Hash Algorithm). In the
proposed model, the Blowfish algorithm deals with data confidentiality,
whereas authentication is provided using RSA, while the SHA-2
mechanism ensures data integrity. This approach offers a high degree of
security to data during transmission and on-demand access to the shared
pool of resources, primarily storage, network, and server facility [95].
T. Kapse has implemented a hybrid model for providing secure

communication in the cloud framework. In the proposed model, RC6
and ECC algorithms provide confidentiality, and MD5 (Message Digest)
provides integrity to the data stored over cloud systems. By evaluation,
the author claims that the proposed model not only offers a better
security level but also enhances authentication and integrity. The
proposed model reduces overhead and computation time when
compared with the existing Timothy et al. model [96].
62
2.6 Summary
Several schemes have been explained to address the issues of data
security in Hadoop based cloud computing environment. This chapter
reviews the security provision in Hadoop and Cloud ecosystem for
authentication, authorization, auditing, and cryptographic (symmetric
and asymmetric) algorithms. Next, security tools like Apache Knox,
Apache sentry, Apache Ranger, Project Rhino, and Kerberos are also
reviewed for the Hadoop cluster by the third party. Then, we studied
security algorithms like RC4, ARIA, AES with one-time Pad and ECC,
and security issues in cloud computing like confidentiality, integrity,
availability, privacy, and accountability.
63
Chapter 3
SYSTEM DEVELOPMENT
Numerous researchers are constantly striving to secure Hadoop using

cryptographic approaches. The security procedures (Authentication and
Authorization) in Hadoop can be readily bypassed by attackers using
modern hacking tools and technology. If the attacker succeeds, they will
have access to the plain text saved on HDFS, which is a significant risk.
On the other hand, the Hadoop architecture provides a dynamic and
adaptable nature for integrating multiple technologies, which adds to the
complexity of data security. The focus of this chapter is on securing the
data stored in HDFS. First, we discussed Hadoop’s ecosystem,
emphasising the need for HDFS security. The following section
compares and contrasts the security techniques used by various Hadoop
distributors.
We use EC2 instances to represent a system with five nodes

Hadoop cluster on the AWS cloud. Various cryptographic algorithms
(AES, BF and RC6) were examined and applied in this work to provide
data-level security. Blowfish’s Feistel Function modification is paired
with parallel processing and the Adams Moulton approach to enhance
performance. The conventional stack’s data-level security applicability
is taken into account. The next part focuses on safeguarding data in the
HDFS environment by combining Cryptosystems with the concurrent
MR paradigm. This section also discusses the advantages and
disadvantages of implementing application-level security approaches.
Finally, we use the Transparent Data Encryption (TDE) technique to
cipher/decipher data at the file-system level, which overcomes the flaws
imposed by application-level security.
3.1 Necessity of Security at HDFS

As illustrated in Figure 3.1, Hadoop is an open-source project that
consists of several modules that have grown independently over time to
add other tools to its fundamental (HDFS and MR) capabilities. HDFS
and YARN / MR 2.0 are two important Hadoop components. Different
additional tools are created on top of these components, such as Sqoop,
Flume, Pig, Hive, Spark, Mahout, Ganglia, Nagios, ZooKeeper, etc.
Figure 3.1: Generic Hadoop Eco-System
The following are the numerous categories in which these components

are classified.
• Storage Component: HDFS has a distributed and parallel storage

component.
• Resource Management: YARN / MR 2.0 improves resource

management by providing a parallel and distributed processing
environment.
65
• Data Injection Tools: Sqoop allows data to be transferred from
Relational DataBase Management System (RDBMS) to HDFS and
back. Flume is used to collect data from web servers such as Twitter,
Facebook, and LinkedIn simultaneously.
• Processing Tools: Pig, Hive, Mahout, Spark, and other tools may
analyze and process data stored on HDFS.
• Monitoring and Managing Tools: The tools used to monitor the

execution status include Ganglia and Nagios. Zookeeper
coordinates the work of distributed apps.
One of the most intriguing aspects of Hadoop is how the tools in the
Hadoop ecosphere interact with the YARN and HDFS basic
architecture.In retrospect, security was an afterthought, and Hadoop falls
short of a suitable security strategy. Organizations are concerned about
the security of sensitive data that they are gradually storing in the
Hadoop framework as Hadoop’s footprints have grown. Administrators
have a significant and complicated challenge in ensuring data security in
Hadoop and its components. Authentication, authorization, and
safeguarding data stored in Hadoop are the three most essential elements
in Hadoop security. Authentication and authorization are now provided
through conventional third-party community-supported approaches.
Hadoop has a security flaw when it comes to storing data in HDFS.

In Hadoop’s ecosphere, data is managed using HDFS. As a result,
Hadoop Distributors are concerned about protecting sensitive data stored
on HDFS.
3.2 Security Integration by Hadoop Distributors

Hadoop Distributors explore solutions for privacy and security problems
in the Hadoop framework.
66
Table 3.1: Provision of Security by the Hadoop Distributors
Hadoop
Distributors/ Hortonworks Cloudera MapR IBM Insights
Parameters
MapR IBM Insights
Hortonworks
Kerberos with provides binary uses Kerberos
uses
AD/LDAP options to with LDAP
Kerberos
Authentication for client and Authentication via for
with
service Native authentication
Ambari and
authentication Authentication between clients
Knox
and Kerberos. and services.
POSIX way
MapR uses
of permission
Access Control
Ranger with ACL and ACL
Expressions
handles RBAC. handles
Authorization (ACE)
Access Mainly Apache the
for granting
Control. Sentry is authorization.
access to
used for
authorized users.
authorization.
Atlas Cloudera’s MapR maintains
IBM Insights
with proprietary the log
adopts a
Ranger product, such file, and later
lightweight
Auditing enforces as Cloudera Apache Drill
JMX
data Access Navigator, is used to
monitoring tool
policies and is used to analyzing these
for auditing.
Analysis. analyze the logs. log files.
Over-the-Wire Data-in-transit
RPC
encryption is
connections Kerberos
is adopted secured
are secured RPC prevents
Data-in-Transit by MapR to by SSL
using impersonation
protect the and
SASL attacks.
data-in-transit TLS
and SSL.
phase. certificates.
Data encryption
Cloudera MapR offers
TDE is for
provides a more granular
used to data-at-rest is
shield through protection to
encrypt/ ensured through
Data-at-Rest HDFS data-at-rest
decrypt two techniques:
encryption to through
the 1) TDE
protect Application
Data-at-Rest. 2) IBM
data-at-Rest. Encryption.
Data Encryption
67
Hortonworks, Cloudera, MapR, and IBM Insights are the four
major Hadoop distributors. Some distributors have added their flavour of
components to the standard Hadoop components for security. Some
distributors have made these extra components proprietary, while others
have made them open source.
Table 3.1 shows some Hadoop distributors that have previously

built a safe data protection method in all elements (Authentication,
Authorization, Auditing, Data-in-transit, and Data-at-Rest). However,
there is room to improve security measures in the case of data-at-rest on
HDFS in terms of performance and efficiency. Within the TDE
approach, Hortonworks and IBM Insights only supply the AES
encryption algorithm. The method used by MapR, on the other hand,
adds overhead to the programme during reading/writing operations.
Cloudera’s Hadoop distribution is also proprietary.
3.3 System Architecture

Clients,KMS, Cloud, and HDFS, are the four parts of our study activity,
as indicated in Figure 3.2.
Figure 3.2: SecHDFS-AWS: Secured HDFS deployed in AWS cloud
68
We use AWS to implement HDFS in the cloud environment. AWS
provides clients with dependable, adaptable, and scalable services. EC2
is a key infrastructure as a service offered by the Amazon cloud (IAAS).
A web-based management panel virtualizes and provisions various
computer components such as servers, network endpoints, and storage of
the required configuration in seconds. Before granting access to services
placed on AWS, customers must first authenticate themselves using the
administration portal. KMS manages the Hadoop cluster’s permission
by permitting or rejecting access depending on access privileges.
3.4 Data level Encryption

Data encryption is required in various corporate sectors to fulfil the
security requirements at the data level, including national security,
health and medical departments, social media, and the military
throughout the world. Health Insurance Portability and Accountability
Act (HIPAA) laws apply to the health and medical business. In contrast,
Payment Card Industry Data Security Standard (PCI DSS) standards
apply to the card payment industry. The US government has
implemented Federal Information Security Management Act (FISMA)
requirements. Hadoop has now made a considerable step forward to
provide security to data stored on HDFS.
A cryptosystem is a system that uses cryptographic techniques to

deliver information security services. The study of encryption concepts
and methods is referred to as cryptography. Asymmetric and symmetric
ciphers are two types of modern cryptographic algorithms. Asymmetric
ciphers are public-key ciphers, while symmetric ciphers are known as
secret (private) keys.
The length of keys (number of bits) for encryption in the public key
cryptosystem is considerable. As a result, the encryption/decryption
69
process is slower than symmetric key encryption. Symmetric encryption
methods are favoured over asymmetric encryption techniques for
encrypting large volumes of data for these reasons. The asymmetric
process requires more significant computational (processing) capacity
than symmetric key encryption.
Block ciphers and Stream ciphers are two types of symmetric cipher
algorithms. Block ciphers encrypt a block of a defined length at a time,
usually 64 or 128 bits, whereas Stream ciphers encrypt data one bit or
byte at a time and are most commonly employed in a continuous flow
of data. Stream ciphers are quicker than block ciphers, but they’re more
challenging to set up and prone to assaults. As a result, block ciphers are
chosen over stream ciphers in this study.
DES, IDEA, Blowfish, 3DES, RC6, AES, CAST-128, and Twofish

are the most used current symmetric cryptosystems [97, 98]. Compared
to alternative symmetric block ciphers, AES and BF symmetric
algorithms give higher security and performance [99, 100]. We initially
implemented and studied AES and BF ciphers in this work.
3.4.1 AES
In 1998, Joan Daemen and Vincent Rijmen created the AES algorithm
and presented a proposal to the National Institute of Standards and
Technology (NIST) (National Institute of Standards and Technology).
The AES cipher is a Rijndael variation that uses a
substitution-permutation network rather than the Feistel network. AES
algorithm has a fixed block length of 128-bits plain text. AES supports
various key lengths of 128, 192, and 256 bits. For different key sizes,
AES has a specified number of rounds 10 (128), 12 (192), and 14 (256).
For every round, except in the last round, AES transforms the basic data
block using four transformation functions (Substitute Bytes, Shift Rows,
Mixing Columns, and Add Round Key). The mixing columns
70
transformation function is not used in the final round. The 128-bit data
block is structured in a matrix order of 44 bytes in an array of bytes. A
state array matrix of 4x4 orders holds all of the intermediate output
created after each transformation step. In the past decade, various
attacks were reported viz., distinguishing attacks, key recovery attacks,
and side-channel attacks.
3.4.2 RC6
Rivest, Robshaw, Sidney, and Yin created RC6 to meet AES criteria
[101]. RC6 is a variant of RC5 with two more registers and a
multiplicative operation. RC6- w/r/b is the exact specification, with w
indicating the word size, r indicating the number of rounds in the cypher
suite, and b indicating the encryption/decryption key length. The word
(w) size in RC6 is 32 bits, and the rounds (r) are prefixed to 20, while
the key length has three options: 128, 196, and 256 bits.
3.4.3 Blowfish
In 1993, Bruce Schneier created the BF cipher, which is not trademarked

and is freely available to all users. BF was designed as a replacement for
the DES and IDEA block ciphers. When compared to other symmetric
ciphers, BF facilitates properties such as simplicity, security, and
compactness. BF is easy since it inherits basic operations such as XOR,
database lookup, and addition modulo operations. No effective
cryptanalysis of BF has been published. Hence it is considered secure.
BF is known for its compactness when performing encryption and
decryption processes on less than 5 KB memory.
The key length in BF can range from 32 bits to 448 bits. BF is built
on the Feistel network and features a 64-bit fixed block length with a 16-
round iteration. It also contains the setup of four S-Boxes and P-Arrays.
The variables P-Array and S-Box are created from the hexadecimal digits
71
Algorithm 1 RC6 Encryption Process
1: The input block length is of 128-bits which is subdivided into w-bits (w=32 bits)
and stored in four registers A, B, C, D.
2: The output cipher block length is of 128-bits (Encrypted Data)
3: Initialize the number of rounds r and generate the w-bit round keys S[0,1, 2r+3]
4: User-provided secret key z is loaded into an array L[], and two magic (Pw , Qw )
constants are used based on the length of the w-bit.
5: if w = 32 − bit then
6: Pw = B7E15163
7: Qw = 9E3779B9
8: end if
//The Process for Creating Rounds
9: S[0] = Pw
10: for i = 1 to 2r+3 do
11: S[i] = S[i-1] + Qw
12: end for
13: a=b=c=d=0
14: e=3 * max(z, 2r+4)
15: for j = 1 to e do
16: a = S[c] = (S[c] + a + b) <<< 3
17: b = L[d] = (L[d] + a + b) <<< (a + b)
18: c = (a + 1) mod (2r + 4)
19: d = (b + 1) mod c
20: end for
//The data block in B and D registers under goes pre whitening process the purpose
of pre-whitening is to remove inference part of the input text
21: B = B + S[0]
22: D = D + S[1]
//After pre-whitening step the round operations are applied
23: for i = 1 to r do
24: T = (B * (2B + 1)) <<< lg w
25: U = (D * (2D + 1)) <<< lg w
26: A = ((A ⊕ T) <<< U) + S[2i]
27: C = ((C ⊕ U) <<< T) + S[2i + 1]
28: (A,B,C,D) = (B,C,D,A)
29: end for
30: A = A + S[2r + 2]
31: C = C + S[2r + 3]
72
Algorithm 2 RC6 Decryption Process
1: The Input cipher block length of 128 bits passed to the decryption process which is
subdivided into w-bits (w=32 bits) and stored in four registers A, B, C, D.
2: The output plain block length is of 128-bits
3: Initialize the number of rounds r and generate the w-bit round keys S[0,1, 2r+3]
4: User-provided secret key z is loaded into an array L[], and two magic (Pw , Qw )
constants are used based on the length of the w-bit.
5: if w = 32 − bit then
6: Pw = B7E15163
7: Qw = 9E3779B9
8: end if
//The Process for Creating Rounds
9: S[0] = Pw
10: for i = 1 to 2r+3 do
11: S[i] = S[i-1] + Qw
12: end for
13: a=b=c=d=0
14: e=3 * max(z, 2r+4)
15: for j = 1 to e do
16: a = S[c] = (S[c] + a + b) <<< 3
17: b = L[d] = (L[d] + a + b) <<< (a + b)
18: c = (a + 1) mod (2r + 4)
19: d = (b + 1) mod c
20: end for
//The data block in C and A registers under goes pre whitening process the purpose
of pre-whitening is to remove inference part of the input text
21: C = C - S[2r + 3]
22: A = A - S[2r + 2]
//After pre-whitening step the round operations are applied
23: for i = r to 1 do
24: (A,B,C,D) = (D,A,B,C)
25: U = (D * (2D + 1)) <<< lg w
26: T = (B * (2B + 1)) <<< lg w
27: C = ((C - S[2i +1]) >>> T) ⊕ U
28: A = ((A - S[2i]) >>> U) ⊕ T
29: end for
30: D = D - S[1]
31: B = B - S[0]
73
of Pi. P-Array is a one-dimensional array with eighteen 32-bit items. The
S-Box is a two-dimensional array with 256 entries of 32-bits.
P-Array ⇒ P[1 ... 18]
S-BOX ⇒ S[0 ... 3] [0 ... 255]
The execution of BF split into two parts:
1. Sub-Key generation
2. Data Encryption Process
Algorithm 3 Sub-Key Generation

1: The user’s secret key gets split into 32-bit keys as K1, K2, K3.....Kn [1 ≤ n ≤ 14].
(if the key is 448-bits then K14)
2: Initialize the Eighteen P-Array and Four S-Boxes.
3: P [1] ⊕ K1, P [2] ⊕ K2, till P [14] ⊕ K14, Now repeat the cycle with P [15] ⊕ K1,
and so on till P [18] ⊕ K4.
4: Encrypt all zero strings with BF cipher. Then sub-keys produced in step 2 and step
3.
5: Replace P[1], P[2] ..... P[18] with the generated output from step 4.
6: Carry on with this process till all the entries of P-Array, and then all the entries of
S-Box are modified.
In this process, the original secret key gets discarded after the
transformation of P-Array and S-Box.
Algorithm 4 BF Encryption Process
1: The input block length is of 64-bits denoted by Y which is subdivided into two
32-bits halves: YL = 32-bits, YR = 32-bits
3: for k = 1 to 16 do
4: YL = YL ⊕ PK
5: YR = F(YL ) ⊕ YR
6: SWAP YL and YR
7: end for
8: SWAP YR and YL (UNDO LAST SWAP)
9: YR = YR ⊕ P17
10: YL ⊕ P18
11: Combine YL and YR
74
Figure 3.3: Generic Representation of BF algorithm
As illustrated in Figure 3.3, the 64-bit input block is divided into

two equal 32-bit halves, YL (LEFT) and YR (RIGHT)
• XORing with sub-key P[1] is now performed on the left 32-bit block
(32-bits). The output is sent into the F function. As illustrated in
Figure 3.4, the input 32-bits are divided into four 8-bits and sent to
four distinct S-Boxes.
• These S-Boxes replace these 8 bits with pre-computed values,

resulting in a 32-bit result. Later, modulo addition and XOR
operations are performed in F, as follows: F = ((S-Box 1 modulo
75
addition of S-Box 2) XOR with S-Box 3) modulo addition with
S-Box 4).
• The right block is XORed with the produced output of function F.
• Swap the right and left blocks and go through the procedure again
for a total of sixteen rounds.
• Now use P[17] to conduct XOR on the right block’s output, and
P[18] to do XOR on the left block’s output.
• Combine the blocks on the left and right sides. The 64-bit ciphertext
will then be obtained.
Figure 3.4: Generic Representation of Feistel Function execution (F)
The decryption process is similar to encryption process but in reverse

order, i.e., P[18] to P[1].
3.4.4 Modified Version of BF
In the modified version of BF , the entire process of sub-key generation

and rounds are kept similar to that of the standard BF algorithm. The
algorithm consists of 18 P-array of 32-bits sub keys and four S-boxes,
consisting of 256 entries each of 32-bits. The modification is done in the
76
sequential execution of Feistel structure to parallel processing using the
Adams-Moulton method. By doing so, the performance of the BF
cryptographic algorithm is qualitatively enhanced. XOR(EL , Pk ) is
applied in each round to obtain a new EL of 32-bits. The obtained EL is
fed as an input to the Feistel function, where it gets sub-divided equally
into four 8-bits ( assume w,x,y,z) and substituted by S-boxes as indicated
in Figure 3.5.
Figure 3.5: Fiestal Function R of the MBF algorithm
The new equation will be (3.1),
R = S(N) (3.1)
Where R is the dependent variable and N is the independent variable.

The modification by BF Function R can be implemented into a parallel
evaluation of different operations without violating security criteria and
even increasing the speed. The function is therefore obtained by dividing
the left (EL ) 32-bit input plain text into four 8 bits quarters, such as w, x,
y, z and the standard BF Feistel function equation is given as (3.2):
R(EL ) = ((S1, w + S2, xmod232 ) ⊕ S3, y) + S4, zmod232 (3.2)
Where “+” is the addition of 32 bits words, S1 , w represents key S-box [1]
77
[w], S2 , x key S-box[2] [x], S3 , y key S-box[3] [y], and S4 , z key S-box[4]
[z].
The equation consists of an independent and a dependent variable.
Thus, Ordinary Differential Equations (ODE) can be used to increase
the algorithm speed. ODE typically has two types of solutions: one is
analytic, and the other is numerical. Based on the function’s nature, the
Analytical methods have some drawbacks, which in contrast to
numerical methods, is not well suited for practical applications.
Numerical approaches would be preferred in this situation.The first
order Adams-Moulton is just similar to the Euler method, and the
truncation error is Θ(h2 ). The higher-order numerical method will have
a higher truncation error. The fourth-order Adams-Moulton method will
have truncation error is Θ(h5 ) [102] . There will be a higher numbering
error in the higher-order system. The Adams-Moulton fourth-order
method can have truncation errors Θ(h4 ), offering more precision and
greater stability with a higher-order system value h. The formulation of
Adams-Moulton is an implicit and not an explicit function, such as
Adams-Bash. The Adams-Moulton fourth order needs less space
compared to the existing Blowfish algorithm. The fourth-order
Adams-Moulton method is shown in equation ((3.3) - (3.6)).
1
Rn+3 = Rn+2 + h(9Sn+3 + 19Sn+2 − 5Sn+1 + Sn ) (3.3)
24
1
Rn+2 = Rn+1 + h(5Sn+2 + 8Sn+1 − Sn ) (3.4)
12
1
Rn+1 = Rn + h(Sn+1 − Sn ) (3.5)
2
Rn = h(Sn ), Sn = S(N0 , R0 ) (3.6)
The function is modified to the actual Blowfish by incorporating the

parallel evaluation and Adams-Moulton Methods. All the four 8-bits are
78
executed parallel and generate an ID passed to the function for further
process.
Algorithm 5 Encryption Process of Modified Blowfish
1: The input block length is of 64-bits denoted by E
3: Divide 64-bits of input plaintext block into two 32-bits halves: EL = 32-bits, ER =
32-bits
4: for i = 1 to 16 do
5: EL = EL ⊕ Pi
6: ID = getK(EL )
7: ER = R(EL ,ID) ⊕ ER
8: Swap EL and ER
9: end for
10: ER and EL (Undo last Swap)
11: ER = ER ⊕ P17
12: EL = EL ⊕ P18
13: Combine EL and ER
The decryption process is similar to encryption process but in reverse

order, i.e., P [18] to P [1].
3.5 Encryption in Hadoop

This study aims to improve the performance and efficiency of the
cryptographic technique used to store data in HDFS. In the typical stack,
cryptosystems can enable at three separate layers.
• Application layer: Application-level encryption (encryption within

a Hadoop application) enables a higher degree of granularity and
prevents ”rogue administrators.” Enabling encryption at this layer is
the most difficult, but it is also secure and robust. Unfortunately,
developing applications to achieve this is difficult and impossible
for customers who already have apps.
• File System Layer: A second methodology, encrypting the files or

directories in HDFS. This methodology utilizes exceptionally
assigned HDFS directories known as “encryption zones.”
79
• Disk Layer: Disk encryption is the lowest level of encryption,
ensuring data protection even if it is lost due to an assault. This
technology does not allow fine-grained encryption of single files or
directories because the entire disc is encrypted.
Down the stack, things become easier, but up the chimney, things
become more secure. The implementation of cryptographic methods at
two layers, namely the application and filesystem level, is used in this
study. Encryption at the disc level may be vulnerable to OS-level or
runtime attacks. Adoption of application and file systems will prevent
attacks at the disc level since HDFS provides an extra layer above the
native file system.
3.6 Implementation of Cryptosystems at Application

Level
3.6.1 Application-Level Encryption
The process of writing a document into HDFS is depicted in Figure 3.6.

Before storing the file to the DataNodes, an HDFS client node divides the
file according to the chosen block size and encrypts each block using the
MR application. At the client machine, the MR application is run. MR
applications typically have two phases: Mapper and Reducer.
However, we only have the mapper phase in our programme for

encrypting (AES or BF) user data. Each file line is treated as a record
by the Mapper function, which encrypts it. The OutPutCollector class
in MR gathers the output of the encryption function. MR performs the
encryption process in parallel, significantly increasing performance and
lowering encryption costs.
A file’s encrypted blocks are written parallel to the DataNode

locations provided. Each encrypted block is replicated in consecutive
80
order. The NameNode is responsible for handling metadata for system
files and granting access to encrypted data.
Figure 3.6: Encryption process using MR
3.6.2 Application-Level Decryption
The client uses MR code to process the encrypted data stored in HDFS.
As depicted in Figure 3.7, this MR task is submitted to the Resource
Manager (RM). By asking the NameNode, RM determines whether the
file exists in HDFS. If the relevant file’s metadata is located, NameNode
responds to the RM with it. RM launches the container with enough
resources to run the task on the DataNodes where the blocks are stored.
The task is duplicated in an amount equal to the number of blocks in a
file, with each block running in parallel on various slave nodes.
In the process of decryption following steps are defined:
1. Before processing the data, the Mapper function decrypts the

encrypted data.
2. After Mapper has completed its task, the intermediate output is

created and encrypted before being stored in HDFS.
81
3. The Reducer phase receives the output of the Mapper phase as an
input.
4. Before conducting its series of operations and generating the

required output file, the Reducer decrypts the data once again.
5. The client receives the output file.
Figure 3.7: Decryption process using MR
3.6.3 Advantages of Application-level Encryption
1. This entire procedure gives customers fine-grained control over how

the file is stored and processed on HDFS.
2. Companies can make data security provisioning for files stored on

HDFS more accessible.
3. A separate cryptographic tool is not required if the client apps

encrypt the data before saving it to HDFS.
82
4. To hack or change a file using application-level encryption methods,
a hacker requires access to the HDFS contents and the programs and
keys used to encrypt the data.
3.6.4 Disadvantages of Application-level Encryption
1. The client must alter programmes by including the cryptosystem

into the MR code, which developers consider difficult.
2. The developer will require a certain amount of time and

computational resources for such a procedure.
3. Because various apps run on the same data businesses own, these
applications require access to and power over the encrypted data.
4. Organizations analyse data using various technologies such as Pig,

Hive, HBase, and Spark. Such tools do not have capabilities for
constructing such complicated encryption algorithms.
5. In such an environment, key management becomes more

complicated.
6. Application-level encryption cannot be used when using data

injection tools to send large amounts of data between
databases/web servers.
3.7 File-System Level Cryptosystems

Cryptographic techniques applied at the file system level will fill up the
gaps left by application-level limitations. Cryptographic techniques
were incorporated with the distributed file system in filesystem-level
encryption. There is no need to update the application code developed in
any tool like Pig, Hive, Spark, HBase, or Map-Reduce due to this
integration. Developers may concentrate on the logic of the completed
work rather than on security measures. A KMS is deployed at the file
system level, which an administrator manages. By maintaining security
83
safeguards on encrypted data, KMS delivers an appropriate access
control system to all of the firm’s users. This part focuses on combining
multiple cryptographic algorithms such as AES, RC6, BF, and Modified
Version of Blowfish, which were covered in section 3.4, to provide
security at the file system level.
3.7.1 TDE
Hadoop introduced the TDE technique to allow end-to-end encryption

while transparently securing data. Clients are unaware of the
encryption/decryption methods with transparent encryption, whereas
end-to-end encryption implies data is encrypted both at rest and in
transit. The first step in installing transparent encryption is to integrate
HDFS with an outside, organization-level Keystore. A crucial portion of
this component to mitigate different threats is the separation of duties
between a key administrator and an HDFS superuser. The Hadoop
KMS, a new component/service that acts as an intermediate between
HDFS users and the HDFS, has been incorporated. To communicate
with one another and HDFS users, both the Hadoop KMS (described in
the following section) and Keystore must use Hadoop’s KeyProvider
API. HDFS enabled TDE provides these capabilities:
1. Client users are the only ones who can encrypt and decrypt data.
2. Encryption keys are stored outside of HDFS.
3. Encrypted data and encryption keys are inaccessible to Superuser

HDFS.
TDE on HDFS adds the notion of Encryption Zones

(EZ)(directories), which are automatically encrypted on writing and
decrypted on reading. These EZ have cryptographic algorithms encoded
in them (AES, RC6, BF, and MBF). Only when a directory is empty can
it be changed to EZ. EZ automatically encrypts all files and
subdirectories produced. When a zone is formed, the key administrator
84
specifies a unique key connected with it. This key refers to the EZ key,
which is saved on KMS, separate from HDFS. When a client writes a
file to one of these Encryption Zones, the file is encrypted using a
unique encryption key known as the Data Encryption Key (DEK). For
producing Encrypted Data Encryption Keys, these DEK are encrypted
using their special encryption zone’s EZ keys (Encrypted Data
Encryption Key (EDEK)). Figure 3.8 displays the complete process of
encrypting and decryption of a key.
EDEK has connected with files metadata, preserved in NameNode,

whereas EZ Keys is constantly stored on KMS key stores. To minimise
attacks that occur at the operating system or file-system level, DEK is not
stored on KMS or NameNodes.
Figure 3.8: TDE key formation
85
i. Creating an Encryption Zone
Figure 3.9 describes the creation of the EZ within HDFS.
Figure 3.9: Creation of EZ
1. The client requests the KMS to create an EZ Key.

2. KMS checks the ACL permissions if the client is allowed, then
creates and stores EZ Key in its Keystore.
3. Now the client can create a plain directory in NameNode and
request for superuser, making that directory an EZ.
4. HDFS superuser (NameNode) passes the request to KMS.
5. KMS checks the access level permissions if approved, and
then the directory is encrypted using the EZ key created by the
concerned client and forms the EZ.
In the above process, KMS treats the client as the admin key user of
the EZ key.
86
• Commands Used For Creating EZ
As a client create an EZ key using
$ sbin/hadoop key create demokey
$ bin/hdfs dfs –mkdir /demozone
Now change to superuser and convert the directory to EZ using
$ bin/hdfs crypto –createZone –keyName demokey –path

/demozone
In CryptoCodec class, we have modified and integrated RC6, BF,

and MBF to enhance the performance and alternative cipher suites
for existing ones. Hdfs crypto package has CryptoCodec class
where by default, the AES encryption/decryption gets applied.
Crypto sub-command of HDFS creates the EZ using the EZ Key.
All the directories, sub-directories, and files get encrypted on write
and decrypted on read using any of the four ciphers (AES, RC6,
BF, and MBF).
Change the owner of EZ to that of a Client
$ bin/hdfs dfs –chown demouser:demouser /demozone
ii. Reading/Writing files within an EZ

Figure 3.10 describes the procedure of reading from EZ or writing
a file to EZ is as follows.
87
1. To encrypt/decrypt a file, the client request EDEK for EZ from
NameNode.
2. NameNode passes the request to KMS.
3. KMS checks the access privileges on EZ and creates EDEK.
4. KMS responses to the NameNode with an EDEK.
5. NameNode forwards EDEK to the client.
6. Now client requests KMS to decrypt EDEK to get DEK.
7. KMS decrypts the EDEK by checking the ACL permissions and
responds client with a DEK.
8. Now the client uses the DEK to encrypt the file and stores it to
NameNode.
In the entire process, DEK is not stored, whereas EDEK is assigned to

the metadata of a file. The whole process of encryption and decryption of
EDEK takes place in KMS. The client never handles EDEK in the entire
process.
Figure 3.10: Write/Read a file to/from EZ
88
3.7.2 KMS Architecture
The Hadoop Framework has a KMS component that handles

cryptographic keys. Hadoop’s KMS was created with the help of a Java
Jetty web application. Using the Representational State Transfer
protocol (REST) Application Programming Interface, the client may
interface with Hadoop components through HTTP (API). Internally,
KMS uses ACL permissions to grant specific individuals, groups, or
applications access.
Hadoop KMS provides a fine-grained access control mechanism on

encryption keys and key operations. KMS ACL plays a pivotal role in
controlling key access to secure encryption keys. KMS will not directly
control access to data; instead, it ensures whether or not an authorized
client can perform a particular operation on encryption keys.
Hadoop KMS manages ACL for many user roles, including key
administrators, HDFS superusers, HDFS service users, and end-users.
Encryption zone keys are created and managed by a key administrator.
The HDFS superuser generates encryption zones, but they are not
permitted to decode data stored in these zones. The HDFS service user
can generate an EDEK for each encryption zone key. Finally, clients
classified as end users have access to encryption zones and may read and
write to them.
As shown in Figure 3.11, the KMS ACL classes offer granular

access control on keys that may be implemented KMS-wide or
key-specific. Providing users with access to certain operations The class
hadoop.kms.acl is used throughout KMS. Simultaneously, we can limit
user access to specific processes. The class hadoop.kms.blacklist is used
by KMS-Wide. We may define access control to particular users by
defining key-specific ACL permissions for specific operations. The
access rights for all keys to execute key actions are controlled by the
89
Whitelist.key.acl class. A single key is represented by the class key.acl.
All keys for which ACL has not been set are expressly covered by the
class default.key.acl. These ACL key-specific configurations are
evaluated based on two critical vital rules that are:
1. The class whitelist configuration bypasses key.acl and

default.key.acl classes.
2. The configuration defined in key.acl overrides all default.key

operation class.
Figure 3.11: Hadoop KMS Architecture
The mechanism used by KMS to determine whether a user is

granted or denied access control is depicted in Figure 3.11. The Client
or NameNode requests a key from KMS at this step. In addition, KMS
estimates activities in two sections: KMS-Wide and Key-Specific
operations. If a person or group is on the blacklist, they are denied
access, and the outcome is ”Denied.” If the user or group has permission
90
to access KMS-Wide activities, the judgement flow then moves to
key-specific authorisation. First, KMS checks whitelist operations in
key-specific, and access privileges are provided if the user is authorised.
If the user’s services aren’t found, the procedure continues with the
evaluation. Second, KMS ACL checks the key.acl configuration allows
it if it is discovered; if not, it assesses the default.key and approves it
based on the user’s actions. If a person or group isn’t mentioned in the
whitelist, key, or default configuration, they’re blocked, and their access
permissions are set to ”Denied”.
The TDE assures that different attacks on the data may be

neutralised by encrypting the file system. Insider attacks (HDFS Admin,
Root Access) and Hardware Level Access are examples of such
vulnerabilities. In all of these methods, the attacker only has access to
the encrypted data, not the file’s original content. EDEK are kept
together with the metadata of a file in this approach, whereas EZ keys
are persisted in the KMS and DEK are not. With key rolling policies,
users can re-encrypt the encrypted data.
3.8 Summary
This chapter focuses on cryptographic techniques to comprehend the
data security issue at the HDFS storage level. Cryptographic techniques
are used at two levels to achieve data security: application-level and
filesystem-level. In contrast, the application-level provides secure means
of storing data into HDFS but lacks performance and increases
complexity when integrating with tools like Pig, Hive, Spark, Sqoop,
and Flume. By providing improved performance and flexibility to
interface with diverse tools, a filesystem-level cryptosystem addresses
the shortcomings of application-level cryptosystems. In the next chapter,
the performance of both layers is discussed.
91
Chapter 4
RESULT AND DISCUSSION
4.1 INTRODUCTION
HDFS is used to store and manage enormous amounts of data. As a
result, any data platform hoping to breakthrough into the business
mainstream must prioritise security and data governance. In recent
months and years, the Hadoop community (both
open-source/commercial communities) have made tremendous progress
in minimising the threats to the Hadoop system, but more work is
needed. The Hadoop community must band together to address
Hadoop’s security vulnerabilities, as these are the issues that are
preventing many practitioners from implementing Hadoop to run
production-grade workloads and mission-critical applications, and they
will continue to do so.
Hadoop offers several advantages, like being free and open-source,

simple to use and has good performance. The open-source HDFS
software is used to store and analyse large amounts of data with high
throughput, fault tolerance MR. HDFS was a crucial target in the
Hadoop system; thus, the unavailability of the security model became
the Hadoop software’s fundamental flaw. Given Hadoop’s relevance in
today’s businesses, there is a growing trend toward providing a
high-security feature.
Chapter 4 presents the performance analysis of implemented data
level security (Data protection) mechanism in the Hadoop’s cluster at
two different levels, namely,
1. MR model
2. TDE model
4.2 PERFORMANCE ANALYSIS
4.2.1 Encryption time
The time it takes to convert plaintext to ciphertexts is known as

encryption time. The key size, plaintext block size, and mode
significantly dictate how long it takes to encrypt data. In this
experiment, encryption time is measured in milliseconds. The amount of
time it takes to encrypt data influences the performance of the system.
As a result, the time it takes to encrypt data must be minimum, resulting
in a speedy and responsive design.
The total encryption time of the records is calculated by

multiplying the complete encrypted plaintext in bytes by the encryption
time as indicated in equation 4.1.
E(p(s))
C(t) = (4.1)
T (e(s))
Where,
C(t) = Total time consumed for encryption of the records in bytes
E(p(s)) = Encrypted plain text in bytes
T(e(s)) = Time consumed to encrypt each document
4.2.2 Decryption time
The decryption time is the time it takes to retrieve plaintext from

ciphertexts. The decryption time should be less than the encryption time
93
to make the system responsive and fast. Decryption time, on the other
hand, has an impact on the system’s performance. The decryption time
is measured in milliseconds in this experiment.
Decryption time is calculated as the total cipher in bytes decrypted

divided by the decryption time as indicated in equation 4.2.
D(c(s))
P(t) = (4.2)
T (d(s))
Where, P(t) be the whole time taken for decryption of the records in
bytes D(c(s)) be the decrypted ciphertext in bytes. T(d(s)) be the time
taken to decipher each record.
4.3 Implementation of Efficient and Scalable Security

Model for Hadoop Cluster
This section focuses on performance evaluation of both proposed models,
namely,
• Model 1: MR model
• Model 2: TDE model
These models are notable in security provision at the data level. The
following sections and subsections clear the step wise performance
evaluation of suggested models.
4.4 Performance of Application-Level Encryption /

Decryption Using MR model
MR makes optimal network traffic by shifting processing to the data node
where the data is stored. An application is used to request a calculation
that operates on its data; if the data is executed near the operating data, the
computation becomes much more efficient; this is especially true when
94
computed on the large data size. The assumption states that processing
the data at its original location is far superior to relocating the data and
computing it where the programme operates. On the other hand, data
locality is achieved by the Map task, which, once completed, sends its
result to the Reduce task machine across the network. As a result, there’s
an opportunity to develop this method even further.
4.4.1 Experimental Environment
We configured a Hadoop testbed with a master node and a slave node to

evaluate the performance of a proposed strategy. The Master and Slave
nodes include an 8-core, 2.2 GHz CPU, a 1 TB hard drive, and 8 GB of
RAM.
4.4.2 Analysis of Space Consumption: AES VS BF
A comparison was performed based on the amount of space plain text

takes up once it is encrypted. The suggested BF technique consumes
494.49 MB of space while encrypting a plain file of 366.61 MB,
whereas the AES approach consumes 497.81 MB of space. Table 4.1
shows that AES encryption takes up more space than the BF technique.
The Map-Reduce model space utilised by the proposed BF system with
the present AES method is shown in Table 4.1. According to the
investigation, BF creates 11.61 MB files for a 7.95 MB file size. The
current AES technique yields 11.92 for the precise file sizes. The
proposed BF approach generates 25.44 for an 18.09 MB file size, while
the known AES methods create 26.52.
The suggested BF approach thus yields 37.96 for a 25.96 MB file

size, while the known AES methods provide 38.92. Furthermore, the
presented BF approach yields 66.7 for a 45.61 MB file size, while the
available AES methods offer 68.32. The proposed BF technique returns
127.31 for a 91.12 MB file size, while the existing method returns
130.36. The proposed BF approach yields 190.21 for a 137.48 MB file
95
size, whereas known AES methods get 193.12. Furthermore, with a file
size of 194.65 MB, the proposed BF technique returns 265, whereas the
available AES methods return 270.13. After that, the BF approach
yields 374.18, whereas the existing AES methods create 378.6 for a
274.96 MB file size. The BF approach thus generates 497.81 for a
366.61 MB file size, while the known AES methods yield 494.49.
Table 4.1: Analysis of Map-Reduce Model Space Consumption after Encryption
Plain Text (MB) AES (MB) BF (MB)

7.95 11.92 11.61
18.09 26.52 25.44
25.96 38.92 37.96
45.61 68.32 66.7
91.21 130.36 127.31
137.48 193.12 190.21
194.65 270.13 265
274.96 378.6 374.18
366.61 497.81 494.49
Figure 4.1: Map-Reduce Model Space Consumption Analysis after Encryption
96
In comparison to the proposed method, the current system consumes
more resources. In addition, as compared to the existing technique, the
suggested solution consumes less space for all File Size numbers. As a
result, it can be concluded that the new approach outperforms the existing
ones. Figure 4.1 shows a graphical depiction of the space utilised.
4.4.3 Analysis of Computational Time: AES VS BF
The file encryption and decryption times for AES encrypted HDFS and
BF encrypted HDFS are compared in this section. Table 4.2 and Table
4.3 illustrate the comparison and indicate that the AES encryption takes
less time to convert plain to cypher or cypher to plain text than the BF
Algorithm. It means that the AES method outperforms BF in terms of
the amount of time it takes to encrypt or decode a file. Figure 4.2 and
Figure 4.3 depicted the graphical depiction.
Table 4.2: Map-Reduce Model computational time Analysis during encryption
Plain Text (MB) AES (Sec.) BF (Sec.)

7.95 7 10
18.09 7 13
25.96 9 25
45.61 13 40
91.21 14 42
137.48 17 45
194.65 19 49
274.96 32 55
366.61 55 72
Table 4.2 shows the encryption time taken by the proposed BF and
current approaches such as AES for various data quantities. According to
the table, the suggested BF algorithms encrypt data with an average file
size of 7.95 MB in an average of 10 seconds. The present AES, on the
97
other hand, takes an average of 7 seconds to encrypt data.
The suggested BF approach takes 10 seconds for a file size of 18.09

MB, while existing AES approaches take 7 seconds. The proposed BF
approach thus takes 25 seconds for a 25.96 MB file, while current AES
methods take 9 seconds. Furthermore, the suggested BF approach takes
40 seconds to process a 45.61 MB file, whereas conventional AES
methods take 13 seconds. Finally, the proposed BF approach takes 42
seconds to process a 91.12 MB file, whereas the traditional method takes
14 seconds. The suggested BF approach takes 45 seconds for a 137.48
MB file, while known AES methods take 17 seconds. Compared to the
current AES approach, we can clearly conclude that the proposed
methods take longer to encrypt.
Figure 4.2: Map-Reduce Model computational time Analysis during encryption
Table 4.3 compares the proposed BF with AES for different data
amounts and shows how they compare decryption time. According to the
table, the suggested BF algorithm decrypts data with an average file size
98
Table 4.3: Map-Reduce Model computational time Analysis during decryption
Encrypted File in MB AES (Sec.) BF (Sec.)

11.61 6 10
25.44 6 13
37.98 10 25
66.72 13 39
127.35 15 41
190.27 17 44
265.09 18 48
374.31 30 54
494.66 31 70
of 7.95 MB in an average of 10 seconds. The present AES, on the other

hand, takes an average of 6 seconds to decrypt.
Figure 4.3: Map-Reduce Model computational time Analysis during decryption
Also, with a file size of 11.61 MB, the proposed BF approach takes
10 seconds, whereas the existing AES methods take 6 seconds. The BF
algorithm takes 13 seconds to process a 25.44 MB file, while
conventional AES methods take 6 seconds. Furthermore, the suggested
BF approach takes 25 seconds to process a 37.98 MB file, while
traditional AES methods take 10 seconds. The proposed BF approach
99
takes 39 seconds to process a 66.72 MB file, while the conventional
method takes 13 seconds. The suggested BF approach thus takes 41
seconds for a 127.35 MB file, while the known AES methods take 15
seconds.
4.5 Implementation of TDE Model

TDE is a feature of HDFS that protects data from unwanted access.
Because the HDFS enabled TDE method is built as an end-to-end
solution, clients who store their data within HDFS may benefit because
it is encrypted and decrypted at the client side.
4.5.1 Experimental setup
The computation is done on AWS-EC2 Instances. Here, the Master

Node Configuration is done on Memory-optimized instances t2.2xlarge
Instance Type with vCPU and 32 GB Memory. At the same time, Slave
Nodes configuration is done on General purpose instances t2.Medium
Instance Type with vCPU 2 and 4 GB Memory. On Master and Slave
Node, the storage space provided is 30GB SSD EBS-Only.
4.6 Performance Evaluation of TDE Model: AES vs

RC6
4.6.1 Analysis of Space Consumption: AES VS RC6
Table 4.4 illustrates that under the TDE Model for Hadoop Framework,
the proposed RC6 algorithm outperforms the AES method of space
consumption. Figure 4.4 depicts the space utilised as a graphical
depiction
The TDE model space utilised by the proposed RC6 system with the
present AES method is shown in Table 4.4.
100
Table 4.4: TDE Model Space Consumption Analysis: AES vs RC6
Plain Text (MB) AES (MB) RC6 (MB)

8 9.63 9.43
18 21.85 21.4
26 31.82 31.16
45.6 54.75 53.63
91.12 109.5 107.3
137.5 165.3 161.9
194 233.3 228.5
275 330.6 323.8
366.8 440.8 431.72
520.7 625.7 612.84
840 1009.38 988.64
Figure 4.4: TDE Model Space Consumption Analysis: AES vs RC6
According to the research, RC6 creates a file size of 9.43 MB for an

8 MB file. The existing AES technique yields 9.63 MB for the same
amount of file sizes. The suggested RC6 technique yields 21.4 MB for
an 18 MB file size, while the in-built AES methods create 21.85 MB. The
proposed RC6 technique produces 31.16 MB for a 26 MB file size, while
existing AES methods produces 31.82 MB. Furthermore, the suggested
101
RC6 approach yields 53.63 MB for a 45.6 MB file size, while current
AES methods yield 54.75 MB.
The suggested RC6 approach yields 107.3 MB for a 91.12 MB file

size, while the present method yields 109.5 MB. The proposed RC6
technique produces 161.9 MB for a 137.5 MB file size, while existing
AES methods create 165.3 MB. Furthermore, the suggested RC6
approach yields 228.5 MB for a 194 MB file size, while current AES
methods yield 233.3 MB. After that, the proposed RC6 algorithm
generates 323.8 MB for a 275 MB file size, whereas the existing AES
methods produce 330.6 MB. The suggested RC6 technique thus
generates 431.72 MB for a 366.5 MB file size, while the in-built AES
methods create 440.8 MB. Furthermore, the proposed RC6 approach
yields 612.84 MB for a 520.7 MB file size, while existing AES methods
yield 625.7 MB.
Similarly, the suggested RC6 approach yields 988.64 MB for the

largest 840 MB file size, whereas the existing method yields 1009.38
MB. As a result, the in-built AES algorithm consumes more space than
the proposed RC6 technique. As a result, it can be concluded that the
suggested RC6 algorithm strategy outperforms the existing AES
algorithm. Graph 4 shows a graphical depiction of the space consumed.
4.6.2 Analysis of Computational time: AES VS RC6
Table 4.5 illustrates that under the TDE Model for Hadoop Framework,
the suggested RC6 technique, rather than the AES algorithm, delivers
better outcomes as the file size and computing time rise.
The TDE model computation time utilising the proposed RC6 system
and the conventional AES technique is shown in Table 4.5. According to
the investigation, RC6 produces 0.27sec for a file size of 8 MB. The
existing AES technique consumes 0.27sec for the same amount of file
sizes. The proposed RC6 technique generates 0.37sec for 18 MB files,
102
whereas existing AES methods consume 0.37sec. The suggested RC6
technique generates 0.47sec for a 26 MB file size, whereas existing AES
methods produce 0.47sec. Furthermore, the proposed RC6 process
consumes 0.68sec for 45.6MB files, whereas current AES methods
consume 0.68sec.
Table 4.5: TDE Model Computational time Analysis: AES vs RC6
Input Plain Text in MB AES (Sec.) RC6 (Sec.)

8 0.27 0.27
18 0.37 0.37
26 0.47 0.47
45.6 0.68 0.68
91.12 1.19 1.15
137.5 1.7 1.6
194 2.2 2.2
275 3.3 3.1
366.8 6.3 5.1
520.7 10.4 8.6
840 26.8 16.06
Figure 4.5: TDE Model Computational time Analysis: AES vs RC6
103
The proposed RC6 approach takes 1.15 seconds to process a 91.12
MB file, whereas the existing method takes 1.19 seconds. The suggested
RC6 technique thus takes 1.6 seconds for a 137.5 MB file, while
conventional AES methods take 1.7 seconds. Furthermore, for a 194
MB file size, the suggested RC6 technique takes 2.2 seconds, whereas
current AES methods take 2.2 seconds.
Also, the RC6 technique takes 3.1 seconds for a 275 MB file,
whereas conventional AES methods take 3.3 seconds. The suggested
RC6 approach thus takes 5.1 seconds for a 366.5 MB file, whereas
traditional AES methods take 6.3 seconds. Furthermore, with a 520.7
MB file size, the suggested RC6 technique takes 8.6 seconds, whereas
conventional AES methods take 10.4 seconds. Similarly, at the largest
file size of 840 MB, the proposed RC6 approach takes 16.06 seconds,
whereas the traditional method takes 26.8 seconds. As a result, the
current system consumes more energy than the suggested technique. In
addition, compared to the existing approach for all File Size values, the
proposed method requires less time to compute. Figure 4.5 shows a
graphical depiction of the computing time of the AES and RC6
algorithms.

RC6 vs BF
Table 4.6 and Table 4.7 show how much space AES, RC6, and BF take
after encryption and how long they take to compute.
4.7.1 Analysis of Space Consumption: AES vs RC6 vs BF
Table 4.6 shows that the BF algorithm uses less space than the AES and
RC6 techniques. The computing time needed for all of these techniques
is similar or substantially identical for smaller file sizes. As the file size
grows more prominent, BF’s computation time for data encryption is
104
determined to be smaller than that of AES and RC6 methods, as shown
in Table 4.6.
Table 4.6: TDE Model Space Consumption Analysis: AES vs RC6 vs BF
Plain Text (MB) AES (Sec.) RC6 (Sec.) BF (Sec.)

8.0 9.630 9.430 8.870
18.0 21.850 21.40 20.130
26.0 31.820 31.160 29.320
45.60 54.750 53.630 50.40
91.120 109.50 107.30 100.90
137.50 165.30 161.90 152.30
194.0 233.30 228.50 214.90
275.0 330.60 323.80 304.50
366.80 440.80 431.720 406.040
520.70 625.70 612.840 576.40
840.0 1009.380 988.640 929.840
According to Table 4.6, the AES and RC6 methods take 4.98
seconds and 4.76 seconds, respectively, to consume 8 MB of space after
encryption. It takes 8.87 seconds for the BF algorithm to encrypt a file
with a size of 8 megabytes. The AES, RC6, and BF algorithms take
21.85 seconds, 21.4 seconds, and 20.13 seconds, respectively, to encrypt
18 megabytes.
The AES, RC6, and BF algorithms took 31.82 seconds, 31.16

seconds, and 29.32 seconds, respectively, to encrypt a 26 MB file. The
AES, RC6, and BF algorithms took 54.75 seconds, 53.63 seconds, and
50.4 seconds, respectively, to encrypt a 45.6 MB file. Then, after
encrypting a 91.12 MB file, the space occupied required 109.5 seconds,
107.3 seconds, and 100.9 seconds for the AES, RC6, and BF algorithms,
respectively. The AES, RC6, and BF algorithms took 165.3 seconds,
161.9 seconds, and 152.3 seconds, respectively, to encrypt a 137.5 MB
105
file. The encryption of a 194 MB file took 233.3 seconds, 228.5 seconds,
and 214.9 seconds for the AES, RC6, and BF algorithms, respectively.
The AES, RC6, and BF algorithms took 330.6 seconds, 323.8 seconds,
and 304.5 seconds, respectively, to encrypt a 275 MB file.
The AES, RC6, and BF algorithms took 440.8 seconds, 431.72

seconds, and 406.04 seconds, respectively, to encrypt a 366.8 MB file.
The AES, RC6, and BF algorithms took 625.7 seconds, 612.84 seconds,
and 576.4 seconds, respectively, to encrypt a 520.7 MB file. Similarly,
the AES, RC6, and BF algorithms encrypted the 840 MB file in 1009.38
seconds, 988.64 seconds, and 929.84 seconds, respectively. Figure 4.6
shows the graphical depiction of space used after the encryption time
measurement.
Figure 4.6: TDE Model Space Consumption Analysis: AES vs RC6 vs BF
4.7.2 Analysis of Computational time: AES vsRC6 vs BF
The computation time for a plain file of size 8 MB may be observed in

Table 4.7 (i.e. 0.27 Sec and 0.27 Sec for the AES and RC6 algorithms,
respectively). It takes 0.27 seconds for the BF algorithm to compute the
encryption time for a file size of 8 MB. The AES, RC6, and BF algorithms
106
take 0.37 sec, 0.37 sec, and 0.37 sec, respectively, to encrypt 18 MB. The
AES, RC6, and BF algorithms took 0.47 sec, 0.47 sec, and 0.47 sec,
respectively, to encrypt a 26 MB file.
Table 4.7: TDE Model Computational time Analysis: AES vs RC6 vs BF
Plain Text in MB AES (Sec.) RC6 (Sec.) BF (Sec.)

8.0 0.270 0.270 0.270
18.0 0.370 0.370 0.370
26.0 0.470 0.470 0.470
45.60 0.680 0.680 0.680
91.120 1.190 1.150 1.180
137.50 1.70 1.60 1.60
194.0 2.20 2.20 2.20
275.0 3.30 3.10 2.90
366.80 6.30 5.10 4.70
520.70 10.40 8.60 7.70
840.0 26.80 16.10 14.60
Figure 4.7: TDE Model Computational time Analysis: AES vs RC6 vs BF
For the AES, RC6, and BF algorithms, the computation time for
a 45.6 MB file was 0.68 sec, 068 sec, and 0.68 sec, respectively. The
space consumed after encrypting a plain 91.12 MB file took 1.19, 1.15,
107
and 1.18 seconds for the AES, RC6, and BF algorithms, respectively.
Furthermore, the AES, RC6, and BF algorithms required 1.7 seconds,
1.6 seconds, and 1.6 seconds, respectively, to encrypt a 137.5 MB file.
Then, for AES, RC6, and BF algorithms, the computation time for a 194
MB file size was 2.2 seconds, 2.2 seconds, and 2.2 seconds, respectively.
For the AES, RC6, and BF algorithms, the calculation time required for
encryption of a 275 MB file size was 3.3 seconds, 3.1 seconds, and 2.9
seconds, respectively.
For the AES, RC6, and BF algorithms, the computation time for a
366.8 MB file size is 6.3 seconds, 5.1 seconds, and 4.7 seconds,
respectively. The AES, RC6, and BF algorithms took 10.4 seconds, 8.6
seconds, and 7.7 seconds, respectively, to encrypt a 520.7 MB file.
Similarly, the AES, RC6, and BF algorithms encrypted the 840 MB file
in 26.8 seconds, 16.06 seconds, and 14.6 seconds, respectively. The
computing time needed for all of these techniques is similar or
substantially identical for smaller file sizes. As the file size grows larger,
BF’s computation time for data encryption is determined to be smaller
than that of AES and RC6 methods, as shown in Figure 4.7.

RC6 vs BF vs MBF
Table 4.8 and Table 4.9 shows how the proposed MBF comparision to
existing approaches like AES, RC6, and BF in space consumption and
computational time.
4.8.1 Analysis of Space Consumption: AES vs RC6 vs BF vs MBF
Table 4.8 shows that the MBF method takes up less space than the AES,
RC6, and BF algorithms.
108
The TDE model space occupied by the proposed RC6 system using
the existing AES, RC6, and BF algorithms is shown in Table 4.8.
According to the research, MBF creates 8.57 MB files for an 8 MB file
size. The present AES, RC6, BF technique generates 9.63, 9.43, and
8.87 for the same amount of file sizes. The suggested MBF approach
generates 19.45 for an 18 MB file size, while the known AES methods
provide 21.85, 21.4, and 20.13. The suggested MBF technique yields
28.33 for a 26 MB file size, while the current AES, RC6, BF methods
provide 31.82, 31.16, and 29.32.
Table 4.8: TDE Model Space Consumption Analysis: AES vs RC6 vs BF vs MBF
Plain Text (MB) AES (MB) RC6 (MB) BF (MB) MBF (MB)
8.0 9.630 9.430 8.870 8.570

18.0 21.850 21.40 20.130 19.450
26.0 31.820 31.160 29.320 28.330
45.60 54.750 53.630 50.40 48.750
91.120 109.50 107.30 100.90 97.50
137.50 165.30 161.90 152.30 147.180
194.0 233.30 228.50 214.90 207.650
275.0 330.60 323.80 304.50 294.350
366.80 440.80 431.720 406.04 392.470
520.70 625.70 612.840 576.40 557.130
840.0 1009.380 988.640 929.840 898.760
Furthermore, the new RC6 technique yields 48.75 for a 45.6MB

file size, while the current AES, RC6, BF; methods yield 54.75, 53.63,
and 50.4. The suggested MBF approach yields 97.5 for a 91.12 MB file
size, while the present process yields 109.5, 107.3, and 100.9. The
suggested MBF technique produces 147.18 for a 137.5 MB file size,
while the current AES, RC6, BF methods yield 165.3, 161.9, and 152.3.
Furthermore, the suggested MBF technique produces 207.65 for a 194
MB file size, while the current AES, RC6, BF methods yield 233.3,
109
Figure 4.8: TDE Model Space Consumption Analysis: AES vs RC6 vs BF vs MBF
228.5, and 214.9. Therefore, the suggested MBF technique generates

294.35, while the conventional AES, RC6, BF methods produce 330.6,
323.8, and 304.5 for a 275 MB file size. The suggested MBF approach
yields 392.47 for a 366.5 MB file size, while the current AES, RC6, BF
methods yield 440.8, 431.72, and 406.04. Furthermore, the suggested
MBF technique produces 557.13 for a 520.7 MB file size, while the
current AES, RC6, BF methods yield 625.7, 612.84, 576.4, and 557.13.
Similarly, the suggested MBF technique yields 898.76 for the largest file
size of 840 MB, while the existing AES, RC6, BF approach yields
1009.38, 988.64, and 929.84. In comparison to the proposed technique,
the current system consumes more energy. In addition, the suggested
method consumes less storage space than the existing approach for all
File Size figures. As a result, when compared to previous efforts, it can
be concluded that the suggested strategy performs well. Figure 4.8
shows a graphical depiction of the energy consumed.
110
4.8.2 Analysis of Computational time: AES vs RC6 vs BF vs MBF
The computing time needed for all of these techniques is similar or

substantially identical for smaller file sizes. As shown in Table 4.9 and
Figure 4.9, the calculation time required by MBF for data encryption
decreases as the file size rises, compared to AES, RC6, and BF
algorithms.
Table 4.9: TDE Model Computational time Analysis: AES vs RC6 vs BF vs MBF
Plain Text in MB AES (Sec.) RC6 (Sec.) BF (Sec.) MBF (Sec.)
8.0 0.270 0.270 0.270 0.270

18.0 0.370 0.370 0.370 0.370
26.0 0.470 0.470 0.470 0.470
45.60 0.680 0.680 0.680 0.680
91.120 1.190 1.150 1.180 1.180
137.50 1.70 1.60 1.60 1.60
194.0 2.20 2.20 2.20 1.90
275.0 3.30 3.10 2.90 2.80
366.80 6.30 5.10 4.70 4.40
520.70 10.40 8.60 7.70 7.50
840.0 26.80 16.060 14.60 13.80
The TDE model computation analysis using the proposed MBF

system with the existing AES, RC6, and BF algorithms is shown in
Table 4.9. MBF generates 0.27sec for an 8 MB file size, according to the
research. The present AES, RC6, BF algorithm yields 0.27secs,
0.27secs, and 0.27secs for the exact file sizes. The suggested MBF
approach produces 0.37sec for 18 MB files, while the existing AES
algorithms generate 0.37sec, 0.37sec, and 0.37sec, respectively.
The suggested MBF technique generates 0.47secs for a 26 MB file

size, while the existing AES, RC6, and BF methods provide 0.47secs,
111
Figure 4.9: TDE Model Computational time Analysis: AES vs RC6 vs BF vs MBF
0.47secs, and 0.47secs, respectively. Furthermore, the suggested RC6

technique generates 0.68sec for 45.6MB files, while the current AES,
RC6, and BF methods produce 0.68sec, 0.68sec, and 0.68sec,
respectively. The suggested MBF approach generates 1.19sec for 91.12
MB files, while the present process has 1.19sec, 1.15sec, and 1.18sec.
The suggested MBF technique generates 1.8sec for 137.5 MB files,
while the current AES, RC6, and BF methods produce 1.7sec, 1.6sec,
and 1.6sec, respectively. Furthermore, the suggested MBF technique
generates 2.4sec for a 194 MB file size, while the existing AES, RC6,
and BF methods produce 2.2sec, 2.2sec, and 2.2sec, respectively. The
suggested MBF technique thus generates 3.3sec for a 275 MB file size,
while the current AES, RC6, BF methods produce 3.3sec, 3.1sec, and
2.9sec.
The suggested MBF approach generates 5.2sec for 366.5 MB files,

while the conventional AES, RC6, and BF methods produce 6.3sec,
5.1sec, and 4.7sec, respectively. Furthermore, the suggested MBF
technique yields 557.13 for a 520.7 MB file size, while the current AES,
112
RC6, BF methods yield 625.7, 612.84, 576.4, and 557.13. Similarly, for
the largest file size of 840 MB, the suggested MBF technique takes 15.4
seconds, while the existing AES, RC6, BF method takes 26.8 seconds,
16.06 seconds, and 14.6 seconds. In comparison to the proposed
technique, the current system consumes more energy. In addition, the
suggested method consumes less space compared to the existing
approach for all File Size figures. As a result, when compared to
previous efforts, it can be concluded that the suggested strategy performs
well. Figure 4.9 shows a graphical depiction of the energy consumed.
4.9 Analysis of Throughput and Percentage Increase in

file size : AES vs RC6 vs BF vs MBF
Encryption throughput (∈T P ) value which is given by dividing the total
plaintext in Megabytes (TP ) upon the average of total encryption time (Tt )
for each algorithm.
TP
∈T P = (4.3)
Tt
Percentage Increase in File Size after Encryption (x) can be obtained
by dividing encrypted text (xc ) with the plain text (y f ).
xc
x= × 100 (4.4)
yf
4.9.1 Analysis of Throughput: AES vsRC6 vs BF vs MBF
Table 4.10 demonstrates that the encryption throughput value of the

suggested MBF approach is also examined, with the greater the value of
encryption throughput, the more robust the encryption technique
performs encrypting. According to this, the suggested MBF method
achieves an encryption throughput of 3.18 MB/sec, whereas the current
AES, RC6, and BF algorithms produce encryption throughputs of 4.88
MB/sec, 3.6 MB/sec, and 3.33 MB/sec, respectively, which are lower
113
than the proposed MBF technique. Figure 4.10 shows a graphical
depiction of the proposed MBF based on encryption time and encryption
throughput.
Plain Text in MB AES (Sec.) RC6 (Sec.) BF (Sec.) MBF (Sec.)

8 0.27 0.27 0.27 0.27
18 0.37 0.37 0.37 0.37
26 0.47 0.47 0.47 0.47
45.6 0.68 0.68 0.68 0.68
91.12 1.19 1.15 1.18 1.18
137.5 1.7 1.6 1.6 1.6
194 2.2 2.2 2.2 1.9
275 3.3 3.1 2.9 2.8
366.8 6.3 5.1 4.7 4.4
520.7 10.4 8.6 7.7 7.5
840 26.8 16.06 14.6 13.8
Average Time (Sec.) 4.88 3.6 3.33 3.18
Throughput (MB/Sec.) 516.95 700.75 757.57 793.31
Figure 4.10: TDE Model Throughput: AES vs RC6 vs BF vs MBF
Table 4.10 clearly shows that when compared to the AES, RC6, and
114
BF approaches, the suggested MBF achieves a higher throughput value.
Figure 4.10 shows a graphical depiction of the proposed MBF method
based on encryption throughput.
4.9.2 Analysis of Percentage Increase in file size : AES vs RC6 vs

BF vs MBF
The % increase in file size when AES, RC6, BF, and MBF algorithms
are applied is shown in Table 4.11. When compared to the built-in AES
method, these approaches lower the size of the encrypted file. In Figure
4.11, the combined AES, RC6, BF, and MBF algorithms graphically
depend on the percentage increase in file size after encryption.
AES RC6 BF MBF

21% 18% 11% 8%
Figure 4.11: TDE Model % Increase in File Size: AES vs RC6 vs BF vs MBF
115
Chapter 5
CONCLUSION AND FUTURE

SCOPE
The work presented in this thesis intends to create a unique approach for
implementing an efficient and scalable security model in Hadoop
Framework in the cloud. Different symmetric cryptographic methods
are grouped together in the proposed research work.
5.1 Important Interventions of the Research

• The Hadoop Framework’s security is configured in three ways:
authentication, authorization, and data protection.
• Because the Hadoop cluster consists of hundreds of nodes

functioning in a distributed environment, an
unauthorized/unapproved user may have unrestricted access to the
data at any time. In order to avoid this, data security is crucial.
• Application-level, database-level, filesystem-level, and disk-level

data security may all be configured as per the requisite. Among
these, the application level provides a reliable and secure method of
data security.
• The Hadoop framework was used to build an application-level

security model using Map-Reduce, however it was discovered to
have various obsolescence (as mentioned in section 3.1 (sir, please
check in your file) of this thesis)
• One of the approaches that provides data level security is

Transparent Data Encryption (TDE). It is readily capable of dealing
with the obsolescence imposed by the Application-Level Data
Protection Model.
• The Hadoop Framework’s conventional TDE architecture only

encrypts data using AES. Other cryptographic algorithms, such as
RC6, Blowfish, and Modified Blowfish, are implemented and run
in TDE in the implemented work.
• In terms of computational time and space complexity, the results

obtained from constructed cryptographic methods are determined to
be comparable to those obtained from the current AES algorithm.
5.1.1 Conclusions
The aimed work successfully implemented two distinctive approaches for

Data Security in HDFS through,
• Application-level using MapReduce model
• SecHDFS-AWS using TDE in Hadoop Framework

Initially, a MapReduce model is proposed, and Blowfish and AES are
compared on the basis of two parameters: space consumption and
computational time. In comparison to AES, the Blowfish method uses
less space in data encryption but requires more processing time in
encrypting/decrypting the data, according to the analysis.
Modifications to applications, batch-oriented tasks, complex

integration with other tools, key management, and access control rules,
among other things, are all part of the proposed MapReduce model’s
obsolescence. To address these flaws, the second recommended
paradigm, Transparent Data Encryption, is implemented.
117
TDE model execution is divided into three phases: RC6 algorithm
implementation (Phase I), Blowfish algorithm implementation (Phase II),
and Modified Blowfish method implementation (Phase III).
The RC6 technique is used in the Phase I to secure the data-at-rest in

HDFS. In comparison to AES, RC6 is simpler to implement. AES is used
for data security in the current version of Hadoop, 3.2.1. At this point, the
AES method is replaced with the RC6 algorithm, which provides superior
results in terms of space usage and computing time.
In the Phase II, TDE is used to offer an efficient end-to-end

encryption mechanism. TDE currently only supports AES for
transparent data encryption in the Hadoop context. Modifications to
AES are made at this step, followed by the introduction of the Blowfish
algorithm for transparent data encryption. AES, RC6, and Blowfish are
symmetric cryptographic algorithms that have been satisfactorily
verified for space consumption after encryption and computational time.
The performance of these algorithms is investigated, indicating that
Blowfish outperforms AES and RC6.
In the Phase III, an efficient, transparent data encryption in Hadoop

is presented, based on a modified blowfish algorithm, to provide greater
data storage security. In this level, the suggested architecture allows
parallel data processing for encryption while simultaneously reducing
the computing time required for sequential data processing. The Adams
Moulton approach is utilised for parallel data processing since it is
quicker than existing sequential methods. It is also avoided that the
application be modified by embedding the cryptosystem into the
MapReduce code. According to the findings of the experiments, the
suggested MBF approach outperforms AES, RC6, and Blowfish.
118
5.1.2 Future Scope
The findings revealed a number of prospective study possibilities as well

as security issues with Big Data platforms.
• In the Hadoop environment, different cryptanalysis can be done on

the implemented methods.
• More concentrating on the implementation of asymmetric

algorithms in the TDE model will surely add benefits in get rid of
security threats.
• Further work can be focused on application, network, and cloud

security issues.
119
REFERENCES
[1] J. Howard et al. “Scale and Performance in a Distributed File System”. In: ACM
SIGOPS Oper. Syst. Rev. 21.5 (1987), pp. 1–2. DOI: 10.1145/37499.37500.
URL: https://doi.org/10.1145/37499.37500.
[2] R. Cattell. “Scalable SQL and NoSQL Data Stores”. In: ACM SIGMOD Rec.
39.4 (2011), pp. 12–27. DOI: 10.1145/1978915.1978919. URL: https://
doi.org/10.1145/1978915.1978919.
[3] Data Flair Team:History of Hadoop – The complete evolution of Hadoop
Ecosytem. URL :
https://data-flair.training/blogs/hadoop-history/.
[4] I. A. T. Hashem et al. “The rise of ”big data” on cloud computing: Review
and open research issues”. In: Information Systems 47 (2015), pp. 98–115. DOI:
https://doi.org/10.1016/j.is.2014.07.006. URL: https://www.
sciencedirect.com/science/article/pii/S0306437914001288.
[5] M. M Shetty and D. H. Manjaiah. “Data security in Hadoop distributed file
system”. In: 2016 International Conference on Emerging Technological Trends
(ICETT). 2016, pp. 1–5. DOI: 10.1109/ICETT.2016.7873697.
[6] G. S. Bhathal and A. Singh. “Big Data: Hadoop framework vulnerabilities,
security issues and attacks”. In: Array 1-2 (2019), p. 100002. DOI:
https : / / doi . org / 10 . 1016 / j . array . 2019 . 100002. URL: https :
//www.sciencedirect.com/science/article/pii/S2590005619300025.
[7] B. Saraladevi et al. “Big Data and Hadoop-a Study in Security Perspective”. In:
Procedia Computer Science 50 (2015), pp. 596–601. DOI: https://doi.org/
10.1016/j.procs.2015.04.091. URL: https://www.sciencedirect.
com/science/article/pii/S187705091500592X.
[8] H. Lu, Chen H.-S., and Hu T.-T. “Research on Hadoop Cloud Computing Model
and its Applications”. In: 2012 Third International Conference on Networking
and Distributed Computing. 2012, pp. 59–63. DOI: 10.1109/ICNDC.2012.22.
120
[9] Arun Murthy et al. Apache Hadoop YARN: Moving beyond MapReduce and
Batch Processing with Apache Hadoop 2. 1st. Addison-Wesley Data and
Analytics, 2014.
[10] Tom White. Hadoop – The Definitive Guide. 4th. O’Reilly, 2015.
[11] S. Parikh et al. “Security and Privacy Issues in Cloud, Fog and Edge
Computing”. In: Procedia Computer Science 160 (2019). The 10th
International Conference on Emerging Ubiquitous Systems and Pervasive
Networks (EUSPN-2019) / The 9th International Conference on Current and
Future Trends of Information and Communication Technologies in Healthcare
(ICTH-2019), pp. 734–739. DOI :
https : / / doi . org / 10 . 1016 / j . procs . 2019 . 11 . 018. URL: https :
[12] H. H. Song. “Testing and Evaluation System for Cloud Computing Information
Security Products”. In: Procedia Computer Science 166 (2020). Proceedings of
the 3rd International Conference on Mechatronics and Intelligent Robotics
(ICMIR-2019), pp. 84–87. DOI :
https : / / doi . org / 10 . 1016 / j . procs . 2020 . 02 . 023. URL: https :
[13] L. Savu. “Cloud Computing: Deployment Models, Delivery Models, Risks and
Research Challenges”. In: 2011 International Conference on Computer and
Management (CAMAN). 2011, pp. 1–4. DOI: 10.1109/CAMAN.2011.5778816.
[14] H. Li et al. “Towards smart card based mutual authentication schemes in cloud
computing”. In: KSII Transactions on Internet and Information Systems 9.7
(July 2015). URL:
http://itiis.org/digital-library/manuscript/1072.
[15] S. Goyal. “Public vs Private vs Hybrid vs Community - Cloud Computing: A
Critical Review”. In: International Journal of Computer Network and
Information Security 6 (2014), pp. 20–29.
[16] S. Bhardwaj, L. Jain, and S. Jain. “Cloud Computing: A Study of Infrastructure
AS A Service (IAAS)”. In: International Journal of Engineering and
Information Technology 2.1 (Jan. 2010), pp. 60–63.
[17] G. Kulkarni, P. Khatawkar, and J. Gambhir. “Cloud Computing-Platform as
Service”. In: International Journal of Engineering and Advanced Technology
(IJEAT) 1.2 (Dec. 2011), pp. 115–120.
[18] S. Daneshyar. “Large-Scale Data Processing Using MapReduce in Cloud
Computing Environment”. In: International Journal on Web Service
Computing 3.4 (Dec. 2012), pp. 1–13. DOI: 10.5121/ijwsc.2012.3401.
121
[19] H. Lee et al. “Implementation of MapReduce-based image conversion module
in cloud computing environment”. In: The International Conference on
Information Network. 2012, pp. 234–238. DOI :
10.1109/ICOIN.2012.6164383.
[20] CyberPedia: An overview of DoS attacks. URL :
https://www.paloaltonetworks.com/cyberpedia/what-is-a-denial-
of-service-attack-dos.
[21] Apache Knox: Apache Knox Gateway 1.3.x User’s Guide. https : / / knox .
apache.org/books/knox- 1- 3- 0/user- guide.html. [Online; accessed
9-July-2021].
[22] Walker Rowe. Introduction to Hadoop Security. July 2016. URL : https : / /
www.bmc.com/blogs/hadoop-security/.
[23] Apache Software Foundation: Apache Ranger. http://ranger.apache.org/
index.html. [Online; accessed 9-July-2021].
[24] T. Allen. Project Rhino: Building a Layered Defense for Apache Hadoop.
https : / / itpeernetwork . intel . com / project - rhino - building - a -
layered - defense - for - apache - hadoop / gs . 4te3dc. [Online; accessed
9-July-2021].
[25] Bhushan Lakhe. Practical hadoop security. Apress, 2014.
[26] Kerberos: The Network Authentication Protocol. http : / / web . mit . edu /
KERBEROS/. [Online; accessed 9-July-2021].
[27] I. Mohiuddin et al. “Secure distributed adaptive bin packing algorithm for
cloud storage”. In: Future Generation Computer Systems 90 (Aug. 2018),
pp. 307–316. DOI: 10.1016/j.future.2018.08.013.
[28] J. Rong, R. Lu, and K-K. R. Choo. “Achieving high performance and
privacy-preserving query over encrypted multidimensional big metering data”.
In: Future Generation Computer Systems 78 (2018), pp. 392–401. DOI:
https : / / doi . org / 10 . 1016 / j . future . 2016 . 05 . 005. URL: https :
//www.sciencedirect.com/science/article/pii/S0167739X16301157.
[29] Gunasekaran M. et al. “A new architecture of Internet of Things and big data
ecosystem for secured smart healthcare monitoring and alerting system”. In:
Future Generation Computer Systems 82 (2018), pp. 375–387. DOI: https :
/ / doi . org / 10 . 1016 / j . future . 2017 . 10 . 045. URL: https : / / www .
sciencedirect.com/science/article/pii/S0167739X17305149.
122
[30] G. S. Sadasivam, K. A. Kumari, and S. Rubika. “A Novel Authentication
Service for Hadoop in Cloud Environment”. In: 2012 IEEE International
Conference on Cloud Computing in Emerging Markets (CCEM). 2012,
pp. 1–6. DOI: 10.1109/CCEM.2012.6354591.
[31] S. H. Park and I. R. Jeong. “A Study on Security Improvement in Hadoop
Distributed File System Based on Kerberos”. In: J. Korea Inst. Inf. Secur.
Cryptol., vol. 23, no. 5. 2013, pp. 415–463.
[32] K. Zheng and W. Jiang. “A token authentication solution for hadoop based on
kerberos pre-authentication”. In: 2014 International Conference on Data
Science and Advanced Analytics (DSAA). 2014, pp. 354–360. DOI:
10.1109/DSAA.2014.7058096.
[33] P. K. Rahul and T. GireeshKumar. “A Novel Authentication Framework for
Hadoop”. In: Artificial Intelligence and Evolutionary Algorithms in
Engineering Systems. Ed. by L. Padma Suresh, Subhransu Sekhar Dash, and
Bijaya Ketan Panigrahi. New Delhi: Springer India, 2015, pp. 333–340.
[34] N. Somu, A. Gangaa, and V. S. S. Sriram. “Authentication Service in Hadoop
using One Time Pad”. In: Indian Journal of Science and Technology 7.4 (2020),
pp. 56–62.
[35] H. Zhou and Q. Wen. “A new solution of data security accessing for Hadoop
based on CP-ABE”. In: 2014 IEEE 5th International Conference on Software
Engineering and Service Science. 2014, pp. 525–528. DOI: 10.1109/ICSESS.
2014.6933621.
[36] M. Sarvabhatla, M. Reddy, and C. Vorugunti. “A Secure and Light Weight
Authentication Service in Hadoop using One Time Pad”. In: Procedia
Computer Science 50 (Dec. 2015), pp. 81–86. DOI :
10.1016/j.procs.2015.04.064.
[37] Y.-S. Jeong and Y.-T. Kim. “A token-based authentication security scheme for
Hadoop distributed file system using elliptic curve cryptography”. In: J.
Comput. Virol. Hacking Tech. 11.3 (2015), pp. 137–142.
[38] Y.-S. Jeong, S.-S. Shin, and K.-H. Han. “High-dimentional data authentication
protocol based on hash chain for Hadoop systems”. In: Cluster Computing 19
(Mar. 2016), pp. 475–484. DOI: 10.1007/s10586-015-0508-y.
[39] I. Khalil, Z. Dou, and A. Khreishah. “TPM-Based Authentication Mechanism
for Apache Hadoop”. In: vol. 152. Sept. 2015, pp. 105–122. DOI: 10.1007/
978-3-319-23829-6_8.
123
[40] Y.-A. Jung, S.-J. Woo, and S.-S. Yeo. “A Study on Hash Chain-Based Hadoop
Security Scheme”. In: 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and
Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing
and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications
and Its Associated Workshops (UIC-ATC-ScalCom). 2015, pp. 1831–1835. DOI:
10.1109/UIC-ATC-ScalCom-CBDCom-IoP.2015.332.
[41] Z. Dou et al. “Robust Insider Attacks Countermeasure for Hadoop: Design and
Implementation”. In: IEEE Systems Journal 12.2 (2018), pp. 1874–1885. DOI:
10.1109/JSYST.2017.2669908.
[42] Y. MEI. “Using the HashChain to Improve the Security of the Hadoop”. In:
Proceedings of the 3rd Annual International Conference on Electronics,
Electrical Engineering and Information Science (EEEIS 2017). Atlantis Press,
2017/09, pp. 554–558. DOI :
https : / / doi . org / 10 . 2991 / eeeis - 17 . 2017 . 82. URL:
https://doi.org/10.2991/eeeis-17.2017.82.
[43] M. Hena and N. Jeyanthi. “Authentication Framework for Kerberos Enabled
Hadoop Clusters”. In: International Journal of Engineering and Advanced
Technology 9.1 (2019), pp. 510–519.
[44] W. Wei et al. “SecureMR: A Service Integrity Assurance Framework for
MapReduce”. In: 2009 Annual Computer Security Applications Conference.
2009, pp. 73–82. DOI: 10.1109/ACSAC.2009.17.
[45] J. H. Majors. “Secdoop: A Confidentiality Service on Hadoop Clusters”, Auburn
University, Master Thesis. 2011.
[46] H.-Y. Lin et al. “Toward Data Confidentiality via Integrating Hybrid
Encryption Schemes and Hadoop Distributed File System”. In: 2012 IEEE
26th International Conference on Advanced Information Networking and
Applications. 2012, pp. 740–747. DOI: 10.1109/AINA.2012.28.
[47] S. Park and Y. Lee. “Secure Hadoop with Encrypted HDFS”. In: Grid and
Pervasive Computing. Vol. 7861. Springer Berlin Heidelberg, 2013,
pp. 134–141. DOI: doi.org/10.1007/978-3-642-38027-3_14.
[48] C. Zhonghan et al. “Design and Implementation of Data Encryption in Cloud
based on HDFS”. In: Proceedings of the The 1st International Workshop on
Cloud Computing and Information Security. Atlantis Press, 2013, pp. 274–277.
DOI : https://doi.org/10.2991/ccis-13.2013.64.
[49] Q. Quan et al. “A model of cloud data secure storage based on HDFS”. In:
2013 IEEE/ACIS 12th International Conference on Computer and Information
Science (ICIS). 2013, pp. 173–178. DOI: 10.1109/ICIS.2013.6607836.
124
[50] C. Yang, W. Lin, and M. Liu. “A Novel Triple Encryption Scheme for
Hadoop-Based Cloud Data Security”. In: 2013 Fourth International
Conference on Emerging Intelligent Data and Web Technologies. 2013,
pp. 437–442. DOI: 10.1109/EIDWT.2013.80.
[51] X. Yu, P. Ning, and M. A. Vouk. “Enhancing security of Hadoop in a public
cloud”. In: 2015 6th International Conference on Information and
Communication Systems (ICICS). 2015, pp. 38–43. DOI :
10.1109/IACS.2015.7103198.
[52] D. Shehzad et al. “A Novel Hybrid Encryption Scheme to Ensure Hadoop
Based Cloud Data Security”. In: International Journal of Computer Science
and Information Security 1947-5500 14.4 (May 2016), pp. 480–484.
[53] A. Jayan and B. R. Upadhyay. “RC4 in Hadoop security using MapReduce”.
In: 2017 International Conference on Computational Intelligence in Data
Science(ICCIDS). 2017, pp. 1–5. DOI: 10.1109/ICCIDS.2017.8272637.
[54] Y. Song et al. “Design and implementation of HDFS data encryption scheme
using ARIA algorithm on Hadoop”. In: 2017 IEEE International Conference on
Big Data and Smart Computing (BigComp). 2017, pp. 84–90. DOI: 10.1109/
BIGCOMP.2017.7881720.
[55] H. Mahmoud, A. Hegazy, and M. H. Khafagy. “An approach for big data
security based on Hadoop distributed file system”. In: 2018 International
Conference on Innovative Trends in Computer Engineering (ITCE). 2018,
pp. 109–114. DOI: 10.1109/ITCE.2018.8316608.
[56] Y. Xu et al. “Design and implementation of distributed RSA algorithm based
on Hadoop”. In: Journal of Ambient Intelligence and Humanized Computing 11
(2020), pp. 1047–1053. DOI: 10.1007/s12652-018-1021-y.
[57] P. Johri, S. Arora, and M. Kumar. “Privacy Preserve Hadoop (PPH)—An
Implementation of BIG DATA Security by Hadoop with Encrypted HDFS”. In:
Information and Communication Technology for Sustainable Development.
Vol. 10. Springer Singapore, 2018, pp. 339–346. DOI :
https://doi.org/10.1007/978-981-10-3920-1_35.
[58] T. S. Algaradi and B. Rama. “A Novel Blowfish Based-Algorithm To Improve
Encryption Performance In Hadoop Using Mapreduce”. In: International
Journal of Scientific & Technology Research 8.11 (2019), pp. 2074–2081.
[59] A. D. Yu and Sun. Sentry Tutorial. https :
//cwiki.apache.org/confluence/display/SENTRY/sentry+tutorial.
[Online; accessed 9-July-2021].
[60] eCryptfs. URL: https://wiki.archlinux.org/title/ECryptfs.
125
[61] Xun Xu. “From cloud computing to cloud manufacturing”. In: Robotics and
Computer-Integrated Manufacturing 28.1 (2012), pp. 75–86. ISSN: 0736-5845.
DOI : https://doi.org/10.1016/j.rcim.2011.07.002. URL : https:
[62] Z. Xiao and Y. Xiao. “Security and Privacy in Cloud Computing”. In: IEEE
Communications Surveys Tutorials 15.2 (2013), pp. 843–859. DOI: 10.1109/
SURV.2012.060912.00182.
[63] A. A. Yassin et al. “Anonymous Password Authentication Scheme by Using
Digital Signature and Fingerprint in Cloud Computing”. In: 2012 Second
International Conference on Cloud and Green Computing. 2012, pp. 282–289.
DOI : 10.1109/CGC.2012.91.
[64] N. Gajra, S. S. Khan, and P. Rane. “Private cloud security: Secured user
authentication by using enhanced hybrid algorithm”. In: 2014 International
Conference on Advances in Communication and Computing Technologies
(ICACACT 2014). 2014, pp. 1–6. DOI: 10.1109/EIC.2015.7230712.
[65] A. S. Tomar et al. “Enhanced Image Based Authentication with Secure Key
Exchange Mechanism Using ECC in Cloud”. In: Security in Computing and
Communications. Vol. 625. Springer Singapore, 2016, pp. 63–73. DOI:
10.1007/978-981-10-2738-3_6.
[66] R. Dangi and S. Pawar. “An Improved Authentication and Data Security
Approach Over Cloud Environment”. In: Harmony Search and Nature Inspired
Optimization Algorithms. Vol. 741. Springer Singapore, 2019, pp. 1069–1076.
DOI : 10.1007/978-981-13-0761-4_100.
[67] Aleksandar Hudic et al. “Data confidentiality using fragmentation in cloud

computing”. In: International Journal of Pervasive Computing and
Communications 9.1 (2013), pp. 37–51. DOI: 10.1108/17427371311315743.
[68] L. Arockiam and S. Monikandan. “Efficient cloud storage confidentiality to
ensure data security”. In: 2014 International Conference on Computer
Communication and Informatics. 2014, pp. 1–5. DOI :
10.1109/ICCCI.2014.6921762.
[69] Yulong Ren and Wen Tang. “A service integrity assurance framework for cloud
computing based on MapReduce”. In: 2012 IEEE 2nd International Conference
on Cloud Computing and Intelligence Systems. Vol. 1. 2012, pp. 240–244. DOI:
10.1109/CCIS.2012.6664404.
126
[70] Yongzhi Wang et al. “IntegrityMR: Integrity assurance framework for big data
analytics and management applications”. In: 2013 IEEE International
Conference on Big Data. 2013, pp. 33–40. DOI :
10.1109/BigData.2013.6691780.
[71] R. Saxena and S. Dey. “Cloud Audit: A Data Integrity Verification Approach
for Cloud Computing”. In: Procedia Computer Science 89 (2016). Twelfth
International Conference on Communication Networks, ICCN 2016, August
19â C“ 21, 2016, Bangalore, India Twelfth International Conference on Data
Mining and Warehousing, ICDMW 2016, August 19-21, 2016, Bangalore,
India Twelfth International Conference on Image and Signal Processing, ICISP
2016, August 19-21, 2016, Bangalore, India, pp. 142–151. ISSN: 1877-0509.
DOI : https://doi.org/10.1016/j.procs.2016.06.024. URL : https:
[72] Y. B. Idris et al. “Enhancement data integrity checking using combination md5
and sha1 algorithm in hadoop architecture”. In: Journal of Computer Science
amp; Computational Mathematics 7.3 (2017), pp. 99–102. DOI: 10 . 20967 /
jcscm.2017.03.007.
[73] R. Sumithra and S. Paul. “Incorporating security and integrity into the mining
process of hybrid weighted-hasht apriori algorithm using Hadoop”. In:
International Journal of Data Science 3.3 (2018), pp. 266–287. DOI:
10.1504/ijds.2018.094506.
[74] A. Undheim, A. Chilwan, and P. Heegaard. “Differentiated Availability in Cloud
Computing SLAs”. In: 2011 IEEE/ACM 12th International Conference on Grid
Computing. 2011, pp. 129–136. DOI: 10.1109/Grid.2011.25.
[75] D.-W. Sun et al. “Modeling a Dynamic Data Replication Strategy to Increase
System Availability in Cloud Computing Environments”. In: Journal of
Computer Science and Technology 27.2 (2012), pp. 256–272. DOI:
10.1007/s11390-012-1221-4.
[76] C.-T. Yang et al. “On Improvement of Cloud Virtual Machine Availability with
Virtualization Fault Tolerance Mechanism”. In: 2011 IEEE Third International
Conference on Cloud Computing Technology and Science. 2011, pp. 122–129.
DOI : 10.1109/CloudCom.2011.26.
[77] B. Mao, S. Wu, and H. Jiang. “Improving Storage Availability in Cloud-of-

Clouds with Hybrid Redundant Data Distribution”. In: 2015 IEEE International
Parallel and Distributed Processing Symposium. 2015, pp. 633–642. DOI: 10.
1109/IPDPS.2015.47.
127
[78] J. Li et al. “Fine-Grained Data Access Control Systems with User
Accountability in Cloud Computing”. In: 2010 IEEE Second International
Conference on Cloud Computing Technology and Science. 2010, pp. 89–96.
DOI : 10.1109/CloudCom.2010.44.
[79] R. K.L. Ko et al. “TrustCloud: A Framework for Accountability and Trust in

Cloud Computing”. In: 2011 IEEE World Congress on Services. 2011,
pp. 584–588. DOI: 10.1109/SERVICES.2011.91.
[80] S. Sundareswaran et al. “Promoting Distributed Accountability in the Cloud”.
In: 2011 IEEE 4th International Conference on Cloud Computing. 2011,
pp. 113–120. DOI: 10.1109/CLOUD.2011.57.
[81] Z. Xiao and Y. Xiao. “Achieving Accountable MapReduce in Cloud
Computing”. In: Future Gener. Comput. Syst. 30.C (Jan. 2014), pp. 1–13. ISSN:
0167-739X. DOI: 10.5555/2747903.2748177.
[82] G. Zhang et al. “Key Research Issues for Privacy Protection and Preservation
in Cloud Computing”. In: 2012 International Conference on Cloud and Green
Computing (CGC). Los Alamitos, CA, USA: IEEE Computer Society, Nov.
2012, pp. 47–54. DOI: 10 . 1109 / CGC . 2012 . 47. URL:
https://doi.ieeecomputersociety.org/10.1109/CGC.2012.47.
[83] A. Waqar et al. “A framework for preservation of cloud usersâ C™ data privacy
using dynamic reconstruction of metadata”. In: Journal of Network and
Computer Applications 36.1 (2013), pp. 235–248. ISSN: 1084-8045. DOI:
https : / / doi . org / 10 . 1016 / j . jnca . 2012 . 09 . 001. URL: https :
[84] P. Hu et al. “Security and Privacy Preservation Scheme of Face Identification
and Resolution Framework Using Fog Computing in Internet of Things”. In:
IEEE Internet of Things Journal 4.5 (2017), pp. 1143–1155. DOI: 10.1109/
JIOT.2017.2659783.
[85] G. Sun et al. “Security and privacy preservation in fog-based crowd sensing on
the internet of vehicles”. In: Journal of Network and Computer Applications 134
(2019), pp. 89–99. ISSN: 1084-8045. DOI: https://doi.org/10.1016/j.
jnca.2019.02.018. URL: https://www.sciencedirect.com/science/
article/pii/S1084804519300694.
[86] S. K. Sood. “A combined approach to ensure data security in cloud computing”.
In: Journal of Network and Computer Applications 35.6 (2012), pp. 1831–1838.
ISSN : 1084-8045. DOI : https : / / doi . org / 10 . 1016 / j . jnca . 2012 . 07 .
007. URL: https://www.sciencedirect.com/science/article/pii/
S1084804512001592.
128
[87] F. F. Moghaddam et al. “A client-based user authentication and encryption
algorithm for secure accessing to cloud servers based on modified
Diffie-Hellman and RSA small-e”. In: 2013 IEEE Student Conference on
Research and Developement. 2013, pp. 175–180. DOI :
10.1109/SCOReD.2013.7002566.
[88] V. S. Mahalle and A. K. Shahade. “Enhancing the data security in Cloud by
implementing hybrid (Rsa amp; Aes) encryption algorithm”. In: 2014
International Conference on Power, Automation and Communication (INPAC).
2014, pp. 146–149. DOI: 10.1109/INPAC.2014.6981152.
[89] N. Khanezaei and Z. M. Hanapi. “A framework based on RSA and AES
encryption algorithms for cloud computing services”. In: 2014 IEEE
Conference on Systems, Process and Control (ICSPC 2014). 2014, pp. 58–62.
DOI : 10.1109/SPC.2014.7086230.
[90] G. Raj, R. C. Kesireddi, and S. Gupta. “Enhancement of security mechanism

for confidential data using AES-128, 192 and 256bit encryption in cloud”. In:
2015 1st International Conference on Next Generation Computing Technologies
(NGCT). 2015, pp. 374–378. DOI: 10.1109/NGCT.2015.7375144.
[91] N. Sengupta. “Designing of Hybrid RSA Encryption Algorithm for Cloud
Security”. In: International Journal of Innovative Research in Computer and
Communication Engineering 3.5 (2015), pp. 4146–4152. ISSN: 2320-9801.
DOI : https://10.15680/ijircce.2015.0305106.
[92] M. Kumar et al. “Data outsourcing: A threat to confidentiality, integrity, and

availability”. In: 2015 International Conference on Green Computing and
Internet of Things (ICGCIoT). 2015, pp. 1496–1501. DOI:
10.1109/ICGCIoT.2015.7380703.
[93] Y. Li et al. “Intelligent cryptography approach for secure distributed big data
storage in cloud computing”. In: Information Sciences 387 (2017), pp. 103–115.
ISSN : 0020-0255. DOI : https : / / doi . org / 10 . 1016 / j . ins . 2016 . 09 .
S0020025516307319.
[94] G. Amalarethinam and H.M. Leena. “Enhanced RSA algorithm for data security
in cloud”. In: Int. J. Control Theory Appl. 9.27 (Jan. 2016), pp. 147–152.
[95] D. P. Timothy and A. K. Santra. “A hybrid cryptography algorithm for cloud
computing security”. In: 2017 International conference on Microelectronic
Devices, Circuits and Systems (ICMDCS). 2017, pp. 1–5. DOI:
10.1109/ICMDCS.2017.8211728.
129
[96] T. Kapse. “Hybrid Security Model for Secure Communication in Cloud
Environment”. In: Journal of the Gujarat Research Society 21.15 (2019),
pp. 179–185. ISSN: 0374-8588.
[97] P. Jindal and B. Singh. “Analyzing the security-performance tradeoff in block
ciphers”. In: International Conference on Computing, Communication
Automation. 2015, pp. 326–331. DOI: 10.1109/CCAA.2015.7148425.
[98] V. Poonia and N. S. Yadav. “Analysis of modified Blowfish algorithm in
different cases with various parameters”. In: 2015 International Conference on
Advanced Computing and Communication Systems. 2015, pp. 1–5. DOI:
10.1109/ICACCS.2015.7324114.
[99] P. Patil et al. “A Comprehensive Evaluation of Cryptographic Algorithms: DES,
3DES, AES, RSA and Blowfish”. In: Procedia Computer Science 78 (2016). 1st
International Conference on Information Security Privacy 2015, pp. 617–624.
ISSN : 1877-0509. DOI : https://doi.org/10.1016/j.procs.2016.02.
S1877050916001101.
[100] R. Patel and P. Kamboj. “Security Enhancement of Blowfish Block Cipher”. In:
vol. 628. 2016, pp. 231–238. ISBN: 978-981-10-3432-9. DOI: 10.1007/978-
981-10-3433-6_28.
[101] S. Contini et al. “The RC6 Block Cipher”. In: (Mar. 2000), pp. 1–21.
[102] Dinesh. Adams Methods. URL: http : / / www . cs . unc . edu / ~dm / UNC /
COMP205/LECTURES/DIFF/lec19/node2.html#eq.
130
LIST OF PUBLICATIONS
1. K. Vishal Reddy, Jayantrao Patil, and Ratnadeep R. Deshmukh.
”Security Issues in Hadoop Framework: A Review”, International
Journal of Science, Engineering and Management (IJSEM). 3(2),
(2018), pp. 193 – 198.(Impact Factor 2.7).

”SecHDFS: Efficient and Secure Data Storage Model over HDFS
using RC6”. International Journal of Engineering and Advanced
Technology 9(2), pp. 3900-3903, 2019. (Scopus)
3. K. Vishal Reddy, Jayantrao Patil, and Ratnadeep R. Deshmukh. ”A

Comparative Approach to Secure Data Storage Model in Hadoop
Framework”. In: Computing in Engineering and Technology.
Springer, vol 1025. 2020, pp. 135-143. (Scopus)

”SecHDFS-AWS: A Novel Approach to Design Efficient and
Secure Data Storage Model Over HDFS Enabled Amazon Cloud”.
In: Proceedings of International Conference on Data Science and
Applications. Springer, vol 288. 2022, pp. 467-478, (Scopus)
5. K. Vishal Reddy., Jayantrao B. Patil., Ratnadeep R. Deshmukh.

(2021) Enhancing performance of TDE Model in Hadoop
Distributed File System. CopyRight application filed and registered
to CopyRight office, Government of India dated 29/07/2021 with
Registration Number SW-14758/2021.

KVR Thesis

Uploaded by

Copyright:

Available Formats

You might also like

KVR Thesis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

KVR Thesis

Uploaded by

Copyright:

Available Formats

CERTIFICATE

This is to certify that the thesis entitled “A Novel Approach for

Prof. Dr. Jayantrao B. Patil Prof. Dr. Ratnadeep R. Deshmukh

Research Guide Research Co-Guide

Nonetheless, preserving the data securely in Hadoop remains to be

Place : Aurangabad Mr. K. Vishal Reddy

Date : Research Student

4 RESULT AND DISCUSSION 92

5 CONCLUSION AND FUTURE SCOPE 116

1.1 Generic Hadoop Architecture . . . . . . . . . . . . . . 4

3.1 Generic Hadoop Eco-System . . . . . . . . . . . . . . . 65

1.1 Use-Cases of Big Data . . . . . . . . . . . . . . . . . . 3

3.1 Provision of Security by the Hadoop Distributors . . . . 67

4.1 Analysis of Map-Reduce Model Space Consumption

HDFS Hadoop Distributed File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

NoSQL Not Only Structured Query Language . . . . . . . . . . . . . . . . . . . . . . 2

RPC Remote Procedure Call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

HTTP Hyper Text Transfer protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

YARN Yet Another Resource Negotiator . . . . . . . . . . . . . . . . . . . . . . . . . . 10

CLC Container Launch Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

IAAS Infrastructure As A Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

PAAS Platform As A Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

SAAS Software As A Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

AWS Amazon Web Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

EC2 Elastic Compute Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

ACL Access Control Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

DoS Denial of Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

DDoS Distributed Denial of Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

LDAP Lightweight Directory Access Protocol . . . . . . . . . . . . . . . . . . . . . 28

MAC Media Access Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

BGP Border Gateway Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

KMS Key Management Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

AES Advanced Encryption Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

RC6 Rivest Cipher 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

MBF Modified Blowfish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

MBF Modified Blowfish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

TDE Transparent Data Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

RDBMS Relational DataBase Management System . . . . . . . . . . . . . . . . 66

HIPAA Health Insurance Portability and Accountability Act . . . . . . . . 69

PCI DSS Payment Card Industry Data Security Standard . . . . . . . . . . . 69

FISMA Federal Information Security Management Act . . . . . . . . . . . . 69

NIST National Institute of Standards and Technology . . . . . . . . . . . . . . 70

ODE Ordinary Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

EDEK Encrypted Data Encryption Key . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

DEK Data Encryption Key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

1.2 Big Data

1.3 Hadoop Framework

1.3.1 Need for HDFS

HDFS first appeared at Yahoo as part of the company’s ad serving and

Type of Organization Targeted analysis of data

of users who were producing an ever-increasing amount of data. HDFS

a web search and related applications, such systems can scale to

1.3.2 HDFS Architecture

The HDFS provides a framework for managing and storing files in

Figure 1.2: HDFS Architecture

The HDFS works in a client-server mode using a master-slave

Furthermore, as illustrated in Figure 1.1, each block is duplicated at

These parameters get split into directories and files specifications in

i. The data in the system is balanced;