Professional Documents
Culture Documents
KVR Thesis
KVR Thesis
KVR Thesis
Place : Aurangabad
Date :
i
ABSTRACT
The term “Big Data” is associated with the colossal and vague nature of
the data generated at a tremendous rate. In the current scenario, data is
produced from numerous sources and formats like images, videos,
audio, pdf, text documents, etc. Most organizations frequently generate
terabytes or petabytes of data. Procuring information by any
organization is just an issue, yet how these organizations/associations
manage and use a lot of data is a conclusive concern. The Hadoop
Distributed File System is an open-source, reliable, and user-friendly
system that plays a crucial part in taking care of an enormous volume of
created information with the most diminutive adaptation to faults. The
Hadoop permits multiple clients to use Big Data’s potential using other
open-source resources, such as Hive, HBase, Pig, Spark, and Storm.
However, the data stored in Hadoop is inclined to different attacks
internally or remotely, which may break out the reliability of the
Hadoop. To conquer these attacks, various investigations were carried
out on data encryption to secure the leakage in Hadoop’s sensitive data.
ii
DECLARATION
I here by declare that, the thesis entitled “A Novel Approach for
Providing Efficient and Scalable Security Model for
Hadoop Cluster in Cloud” is carried out by me under the
guidance Prof. Dr. Jayantrao B. Patil Department of Computer
Engineering, R. C. Patel Institute of Technology, Shirpur, Dhule (MS),
India and Prof. Dr. Ratnadeep R. Deshmukh, Department of
Computer Science and Information Technology, Dr. Babasaheb
Ambedkar Marathwada University, Aurangabad (MS), India. The work
is original and has not been submitted in part or in full to any other
University or institute to award any research degree. The extent of
information derived from the existing literature has been indicated in the
body of the thesis at appropriate places giving the references.
iii
ACKNOWLEDGMENT
I wish to express my hearty and sincere gratitude to my Research Guide
respected Prof. Dr. Jayantrao B. Patil, Professor and Principal,
Department of Computer Engineering, R. C. Patel Institute of
Technology, Shirpur, for his valuable guidance and constant
encouragement throughout this work.
I would also like to express my deep sense of gratitude and indebtedness
to respected Dr. Ratnadeep R. Deshmukh (Research Co-Guide),
Professor and former Head of Computer Science and Information
Technology Department, Dr. Babasaheb Ambedkar Marathwada
University, Aurangabad, for always guiding and encouraging me
throughout this work.
I thank to Dr. Sachin N. Deshmukh, Professor and Head, Department
of Computer Science and Information Technology, Dr. Babasaheb
Ambedkar Marathwada University, Aurangabad for providing me the
necessary facilities and continuous support rendered during the research
work.
I owe my sincere thanks to the Management of Deogiri Institute of
Engineering and Management Studies, Aurangabad for giving me
permission and encouragement to carry out the research work. I am
incredibly thankful to Dr. Ulhas D. Shiurkar, Director, Deogiri
Institute of Engineering and Management Studies, Aurangabad and
Sanjay B. Kalyankar, Head of Department, Computer Science and
Engineering, Deogiri Institute of Engineering and Management Studies,
Aurangabad for their moral support during my work and encouraging
me throughout this work.
I am also very much thankful to Prof. Dr. Nitin N. Patil, Head
Department of Computer Science and Engineering, R. C. Patel Institute
of Technology, Shirpur, for his constant support and meticulous
supervision at every step to come out with this work.
iv
I express my sincere gratitude for my parents, Dr. K. Ravinder Reddy
and Smt.K. Prabhavati Reddy, for their support and encouragement and
for being a constant source of strength. I deeply express my thanks to my
wife Ramya Reddy for constant support. I also thankful to my sister
Shirisha Reddy and brother-in-law Ramesh Chandra Reddy for their
continuous encouragement.
I am grateful to all my friends and colleagues as well as those who have
directly or indirectly helped me in the process of completion of this thesis.
K. Vishal Reddy
v
TABLE OF CONTENTS
CERTIFICATE . . . . . . . . . . . . . . . . . . . . . . . . i
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . ii
DECLARATION . . . . . . . . . . . . . . . . . . . . . . . . iii
ACKNOWLEDGMENT . . . . . . . . . . . . . . . . . . . . iv
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . x
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . xii
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . xiii
1 INTRODUCTION 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Hadoop Framework . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Need for Hadoop Distributed File System (HDFS) 2
1.3.2 Hadoop Distributed File System (HDFS)
Architecture . . . . . . . . . . . . . . . . . . . . 4
1.3.3 Anatomy of a file writes . . . . . . . . . . . . . 8
1.3.4 Anatomy of a file read . . . . . . . . . . . . . . 9
1.3.5 MapReduce . . . . . . . . . . . . . . . . . . . . 10
1.4 Cloud Computing Framework . . . . . . . . . . . . . . 13
1.4.1 Types/Deployment Models of Cloud . . . . . . . 14
1.4.2 Service/Delivery Models . . . . . . . . . . . . . 16
1.4.3 Key Essential Characteristics . . . . . . . . . . . 18
1.5 Hadoop in Cloud Computing . . . . . . . . . . . . . . . 18
1.5.1 Need for Hadoop in Cloud . . . . . . . . . . . . 19
1.6 Security threats and possible attacks . . . . . . . . . . . 21
vi
1.7 Technologies for security - Securing Hadoop Solution . . 23
1.8 Major Challenges . . . . . . . . . . . . . . . . . . . . . 25
1.8.1 Data Size and Storage Criticality . . . . . . . . . 25
1.8.2 Distributed Nature and Fragmented Data . . . . 25
1.9 Security issues in Hadoop and Cloud Computing . . . . 26
1.9.1 Security in Hadoop . . . . . . . . . . . . . . . . 26
1.9.2 Security in Cloud . . . . . . . . . . . . . . . . . 29
1.10 Necessities . . . . . . . . . . . . . . . . . . . . . . . . 31
1.11 Research Motivation . . . . . . . . . . . . . . . . . . . 32
1.12 Problem Statement . . . . . . . . . . . . . . . . . . . . 32
1.13 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.14 Research Contributions . . . . . . . . . . . . . . . . . . 33
1.15 Organization of Thesis . . . . . . . . . . . . . . . . . . 34
2 LITERATURE REVIEW 36
2.1 Authentication and Authorization in the Hadoop
Framework . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2 Data Security in Hadoop Distributed File System . . . . 42
2.3 Security Tools for Hadoop Cluster by Third Party . . . . 47
2.4 Cloud Computing . . . . . . . . . . . . . . . . . . . . . 49
2.4.1 Security Issue in Cloud Computing . . . . . . . 50
2.4.2 Authentication . . . . . . . . . . . . . . . . . . 50
2.4.3 Confidentiality . . . . . . . . . . . . . . . . . . 52
2.4.4 Integrity . . . . . . . . . . . . . . . . . . . . . . 53
2.4.5 Availability . . . . . . . . . . . . . . . . . . . . 55
2.4.6 Accountability . . . . . . . . . . . . . . . . . . 56
2.4.7 Privacy and Preservability . . . . . . . . . . . . 58
2.5 Data security in Cloud Computing . . . . . . . . . . . . 59
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 63
3 SYSTEM DEVELOPMENT 64
3.1 Necessity of Security at HDFS . . . . . . . . . . . . . . 65
vii
3.2 Security Integration by Hadoop Distributors . . . . . . . 66
3.3 System Architecture . . . . . . . . . . . . . . . . . . . . 68
3.4 Data level Encryption . . . . . . . . . . . . . . . . . . . 69
3.4.1 Advanced Encryption Standard (AES) . . . . . . 70
3.4.2 RC6 . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4.3 Blowfish . . . . . . . . . . . . . . . . . . . . . 71
3.4.4 Modified Version of Blowfish (BF) . . . . . . . 76
3.5 Encryption in Hadoop . . . . . . . . . . . . . . . . . . . 79
3.6 Implementation of Cryptosystems at Application Level . 80
3.6.1 Application-Level Encryption . . . . . . . . . . 80
3.6.2 Application-Level Decryption . . . . . . . . . . 81
3.6.3 Advantages of Application-level Encryption . . . 82
3.6.4 Disadvantages of Application-level Encryption . 83
3.7 File-System Level Cryptosystems . . . . . . . . . . . . 83
3.7.1 Transparent Data Encryption (TDE) . . . . . . . 84
3.7.2 Key Management Server (KMS) Architecture . . 89
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 91
viii
4.5 Implementation of TDE Model . . . . . . . . . . . . . . 100
4.5.1 Experimental setup . . . . . . . . . . . . . . . . 100
4.6 Performance Evaluation of TDE Model: AES vs RC6 . . 100
4.6.1 Analysis of Space Consumption: AES VS RC6 . 100
4.6.2 Analysis of Computational time: AES VS RC6 . 102
4.7 Performance Evaluation of TDE Model: AES vs RC6 vs
Blowfish (BF) . . . . . . . . . . . . . . . . . . . . . . . 104
4.7.1 Analysis of Space Consumption: AES vs RC6 vs
Blowfish (BF) . . . . . . . . . . . . . . . . . . . 104
4.7.2 Analysis of Computational time: AES vsRC6 vs
Blowfish (BF) . . . . . . . . . . . . . . . . . . . 106
4.8 Performance Evaluation of TDE Model: AES vs RC6 vs
Blowfish (BF) vs Modified Blowfish (MBF) . . . . . . . 108
4.8.1 Analysis of Space Consumption: AES vs RC6 vs
Blowfish (BF) vs Modified Blowfish (MBF) . . . 108
4.8.2 Analysis of Computational time: AES vs RC6 vs
Blowfish (BF) vs Modified Blowfish (MBF) . . . 111
4.9 Analysis of Throughput and Percentage Increase in file
size : AES vs RC6 vs Blowfish (BF) vs MBF . . . . . . 113
4.9.1 Analysis of Throughput: AES vsRC6 vs
Blowfish (BF) vs Modified Blowfish (MBF) . . . 113
4.9.2 Analysis of Percentage Increase in file size : AES
vs RC6 vs Blowfish (BF) vs MBF . . . . . . . . 115
ix
LIST OF FIGURES
x
4.1 Map-Reduce Model Space Consumption Analysis after
Encryption . . . . . . . . . . . . . . . . . . . . . . . . 96
4.2 Map-Reduce Model computational time Analysis during
encryption . . . . . . . . . . . . . . . . . . . . . . . . 98
4.3 Map-Reduce Model computational time Analysis during
decryption . . . . . . . . . . . . . . . . . . . . . . . . 99
4.4 TDE Model Space Consumption Analysis: AES vs RC6 101
4.5 TDE Model Computational time Analysis: AES vs RC6 103
4.6 TDE Model Space Consumption Analysis: AES vs RC6
vs Blowfish (BF) . . . . . . . . . . . . . . . . . . . . . 106
4.7 TDE Model Computational time Analysis: AES vs RC6
vs Blowfish (BF) . . . . . . . . . . . . . . . . . . . . . 107
4.8 TDE Model Space Consumption Analysis: AES vs RC6
vs Blowfish (BF) vs Modified Blowfish (MBF) . . . . . 110
4.9 TDE Model Computational time Analysis: AES vs RC6
vs Blowfish (BF) vs Modified Blowfish (MBF) . . . . . 112
4.10 TDE Model Throughput: AES vs RC6 vs Blowfish (BF)
vs Modified Blowfish (MBF) . . . . . . . . . . . . . . . 114
4.11 TDE Model % Increase in File Size: AES vs RC6 vs
Blowfish (BF) vs Modified Blowfish (MBF) . . . . . . . 115
xi
LIST OF TABLES
xii
LIST OF ABBREVIATIONS
ACK acknowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
MR Map-Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
DC Data Confidentiality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
xiii
AD Active Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
BF Blowfish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
RM Resource Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
EZ Encryption Zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
xiv
Chapter 1
INTRODUCTION
1.1 Introduction
This chapter focuses on the fundamental principles of Big Data, the
Hadoop framework, Cloud Computing, and security challenges in
Hadoop and Cloud Computing. The ultimate objective of the proposed
work is to develop a data security model for a Hadoop cluster in a cloud
environment. The Hadoop framework imposes various threats, attacks,
requirements, and obstacles in securing the model, described in the
following subsections. This chapter also inherits the problem definition
and research contributions during the research work.
2
Table 1.1: Use-Cases of Big Data
3
Figure 1.1: Generic Hadoop Architecture
HDFS is a cloud file system and one of the Hadoop framework’s most
important components. HDFS is a distributed file system based on Java.
HDFS is a file system that runs on top of the native file system. For
example, HDFS is installed on top of the ext3, ext4, and XFS file
systems in the Linux operating system. It simultaneously saves and
retrieves various documents from several connected nodes. HDFS is one
of the essential components in the Hadoop architecture, and it is
responsible for data storage. Hadoop’s storage architecture is distributed
across multiple servers to improve reliability and lower costs.
4
computers. NameNode is the name of the core dedicated server in this
environment. The metadata is contained in this NameNode [5]. The
application data is stored in DataNodes, which are defined. Various
ways for providing reliable storage on numerous DataNodes
characterise the information security technique. The HDFS architecture
is depicted in Figure 1.2.
5
locations (meta-data) and offset. The client makes direct contact with
the DataNode once it receives the DataNode locations from NameNode.
In the case of a write, the client simultaneously appends data to relevant
first salve servers. After that, salve servers replicate the data to other
salve servers sequentially.
NameNode: To alleviate the latency caused by disc reads, NameNode
saves metadata in memory for speedier retrieval [7]. Filename, File path,
Number of Blocks, Block-Id, Block location, number of blocks, slave
related parameters, and so on are all metadata parameters.
6
All Data Nodes in the Hadoop cluster are synchronized so that they
can communicate with one another and ensure that:
HDFS Client: The HDFS defines schema to create, store, and delete
files on DataNodes with application-specific access. Figure 1.3 depicts
the HDFS client’s connection to the HDFS storage area.
HDFS generates a pipeline; the client puts data into the pipe,
conducting various operations on the data files. The file can be deleted,
written to, and updated using the HDFS client. The client can connect
directly to the DataNode for accessing or transferring a required block.
The client conducts the write/read action, and data is updated on
NameNode and DataNode. The NameNode is accessed first, followed
by the DataNode for mapping to submit the request. The client also
7
acquires the node to node mapping for data replication to transfer the
data. The block-specific procedures are carried out in a well-organized
manner. The acknowledgements were successfully completed in reverse
order once the copies of a block was written. The last slave server
informs the primary slave machine of the existence of the other slave
servers. The central slave server acknowledges the client. Data has been
successfully written, and the client alerts the master (NameNode) server.
The master server updates the metadata, and client operations shut off
the pipeline.
To save a file in the HDFS, the client uses the distributed file system
to send a write operation to NameNode, as shown in Figure 1.4.
NameNode determines which DataNodes are available. NameNode
responds with a sufficient number of DataNode locations for each block
(with replication) based on file size. A client publishes the blocks of a
file to HDFS in parallel on various machines after obtaining the
DataNode locations (DataNodes). However, the block is replicated
8
sequentially. DataNodes accomplishes this by creating a pipeline for
every block and copying them one by one. When the acknowledge
(ACK) is transmitted from the last DataNode to the previous DataNodes
until it reaches the primary DataNode, it is considered complete. The
client is then informed that the principal DataNode has successfully
written the block. DataNodes acknowledges the NameNode by heartbeat
messages at the same time. The client disconnects from the DataNodes
in question.
When a client/user node requests to read a file from HDFS. If the data is
100 megabytes, we’ll have two blocks: one that’s 64 megabytes and the
other that’s 36 megabytes. Now, the client node will issue a request to the
distributed file system (software in the library), forwarded to NameNode.
NameNode verifies the namespace (metadata) to see if the requested file
is present in HDFS. If it exists, it responds by sending block positions to
the user. As shown in Figure 1.5, once the client has the coordinates of
the file blocks, it accesses the DataNodes directly where the file blocks
exist.
This process of reading/writing a file to HDFS is composed of two halves:
9
Figure 1.5: Anatomy of a File Read
1.3.5 MapReduce
iii. It assigns application master and containers to node manager for job
execution.
10
Figure 1.6: Architecture of Hadoop YARN
11
is a programme that handles containers. It allocates a particular amount
of resources to the application master on request.
12
Reduce: The shuffle and sort phase’s output (K2, V2) will be used as
input for the Reducer phase. Key K2 remains the same in the reducer
phase and performs a function operation on V2 (K2, f(V2)) to yield the
final output as a Key/Value (K3, V3) pair.
13
user data correctly.
14
Figure 1.9: Cloud Computing Model
15
high-security constraints to protect against attackers in some specialized
applications.
Hybrid Cloud: The functionality, features, and criticalities of both
(private and public) cloud types are combined in a hybrid cloud [15].
These cloud models are linked in the private and public domains to share
resources and services. Specialized and generalist services are available
and shared in the same environment, subject to authorisation and
authentication requirements. The sharing rules and different security
limitations are specified based on public and organisational access to
these cloud services. Corporations often supply add-on services such as
load balancing and cloud bursting to other cloud models in the hybrid
cloud.
Community Cloud: This form of cloud is set up for a certain group of
enterprises to utilise exclusively [15]. This form of cloud is adaptable
enough to be utilised by a community of businesses with similar
qualities, such as educational institutions and banks. Usage
authorization, authentication rules, and architecture may all be described
openly to organisation users. Ownership, management, and operations
for corporate clients may be carried out by many different parties
(internal-external). A community cloud is a type of organisation
confined to a cloud system. Community models represent the different
degrees of authorization with on-campus and off-campus accessibility.
16
IAAS: Providers of IaaS host client virtual machines or supply network
storage. The IAAS model is directly linked to the hardware and shared
[16]. Networking, storage (NAS/SAN), servers, and virtualization are all
bare hardware metal. These components include things like a data
centre, a storage disc, memory, and a CPU, among others. The users are
not aware of these hardware components. Rather than actual resources,
virtual resources are provided to end-users in the cloud system. The
characterisation of these resources may be done based on sharing
elements, support, and scalability. This layer of the cloud system
infrastructure also includes security measures such as intrusion
detection, intrusion prevention, and a firewall. Amazon Amazon Web
Services (AWS) Elastic Compute Cloud (EC2), GoGrid, Flexiscale,
Windows Azure, RackSpace, Google Compute Engine, Hewlett Packard
Enterprise, DigitalOcean, and others are examples of such service
providers.
PAAS: Users are provided with a network-hosted software development
environment in PAAS [17].Providers maintain the underlying
infrastructure, operating system, programming platform, and so forth.
Because the providers abstract these fundamental resources for security
reasons, customers will not have access to them. The resources allocated
by multiple suppliers to the application developer are measured on a
request basis. Clients throughout the world will be able to access apps
that have been created and deployed on such a platform. The client can
acquire service or server access here without obtaining any physical or
core data. PAAS can be used to enable resource management, data
sharing, user authentication, hardware feature characterization, and
application-level control. The cloud system’s adaptability can be
explained by the fact that it is multilingual. This layer also provides the
security constraint integration, party relationship, and development
cycle. SalesForce.com, Amazon AWS Elastic Beanstalk, Google App
Engine, Microsoft Azure, IBM BlueMix, RedHat Software, OpenShift,
17
VMWare Pivotal Software, and Heroku are just a few examples of such
service providers.
SAAS: SAAS provides network-hosted (remote) applications to end
consumers, which are often accessed via mobile client apps or browsers.
Users can make a service request under this model. They don’t have
control over any of the cloud stack’s tiers. The cloud service provider is
responsible for handling, managing, maintaining, and monitoring the
complete cloud stack. Google Apps, Salesforce.com [17], Workday,
Citrix GoToMeeting, CiscoWebEx, DropBox, Paycom, Splunk,
HubSpot, and Zohosuite.com are a few examples of SAAS providers.
18
hand, cloud computing encompasses a variety of computing concepts
that incorporate a large number of machines connected via a real-time
communication network. Cloud computing focuses on scalable,
on-demand, and adaptive service architectures. In cloud computing,
Cloud MapReduce is a replacement for MapReduce [18]. The
fundamental distinction between Hadoop and cloud MapReduce is that
cloud MapReduce does not provide its utilization; instead, it is
dependent on the infrastructure provided by various cloud service
providers. Hadoop is a ‘biological system’ of open source programming
that allows for simple logging and is widely used on industry-standard
hardware.
Since the term ”cloud” has been defined, it’s clear what the jargony
phrase ”Hadoop in the cloud” means: it means operating Hadoop groups
on assets provided by a cloud provider, as illustrated in Figure 1.10.
This training is compared on a regular basis, and Hadoop clusters that
operate on their own hardware, referred to as ”on-prem” or
”on-premises” clusters, are used.
19
Figure 1.10: Hadoop Usage in Cloud Computing
That isn’t to say that there isn’t more to learn or that the reflection
is complete. There are many choices and a variety of provider
characteristics to understand and investigate, with the goal that the
customer may build a functional framework but a moving arrangement
of Hadoop clusters. Cloud providers also provide services beyond what
you can accomplish on-premises, and the Hadoop cluster may benefit
from them.
Develop Hadoop clusters that operate in parallel every now and then.
Non-Hadoop servers and applications are supported by the clusters [19],
among other things, and supporting assets surrounding them supervise
information flow in and out and have a unique tool. The supporting cast
can also run in the cloud, or dedicated systems management features can
help bring them closer together.
20
1.6 Security threats and possible attacks
Security is a set of systems, policies, and technologies that work together
to keep networks, computers, programmes, data, and information safe
against attacks, harm, and illegal access. A threat is someone who has the
ability to harm a system or an organisation. Natural (earthquakes, floods,
or tornadoes) risks, accidental (staff errors that result in the deletion or
disclosure of private data) threats, and purposeful (spyware, malware)
threats are all possible. The following cryptographic security elements
should be enabled in every system:
• Data integrity
• Data Encryption/Decryption
• Cracking (decrypting/deciphering).
21
to encrypt and decode data. Symmetric algorithms are further divided
into block and stream cyphers. Public-key cryptography algorithms are
sometimes known as asymmetric algorithms.
Some of the attacks on cryptographic systems are:
• Browser-Based Attacks
• Brute-Force Attack
• Replay attack
• Man-in-the-Middle-Attack
• Dictionary Attack
22
to carry out the assault. The purpose, though, remains the same. DDoS
is more effective and difficult to counteract. DDoS is occasionally
employed as a diversionary tactic to draw security personnel’s attention
away from a surreptitiously carried out large attack.
Replay attack Designed to stymie processing by repeatedly delivering
data to the host. In the absence of safeguards in the receiving service,
such as time stamping, one-time tokens, or sequence verification codes,
the system may handle duplicate files.
Man-in-the-Middle-Attack A Man in the Middle attack occurs when an
attacker can alter the data accepted by two legitimate users.
Dictionary Attack When it comes to password files, the dictionary
attack is the most popular. A dictionary attack is used to overcome an
authentication scheme by systematically inputting each word to explain
the decryption key of encrypted communication. It takes advantage of
users’ bad habits of choosing easy passwords based on unique phrases.
The dictionary attack encrypts all of the words in a dictionary, then
compares the hash to an encrypted password saved in the SAM file or
other password files. As a result, the dictionary attack is always quicker
than the brute force approach.
23
i. Perimeter management
iii. Encryption
24
Encryption as an option for any data saved in HDFS for data on disc is
currently being worked on in the Hadoop group. Intel’s (project Rhino)
[22, 24] delivery got a head start on this, allowing data encryption in
HDFS through encryption instructions in Intel CPUs used in Hadoop
slave nodes. Third-party technologies are also available to encrypt data
stored in HDFS.
25
down into little chunks and stored on multiple nodes. When a client runs
a map-reduce operation, it generates intermediate data that is stored to
disc before being shuffled to other nodes in a distributed environment,
which might be vulnerable to attack. The formation of communication
between Hadoop components offers a challenge for fine-grained security
implementation.
26
Kerberos validation is now fully supported in Hadoop and its major
sub-projects (Pig, Hive, SQoop, HBase, Mahout, and so on). However,
this only applies to a certain level of validation. There is still no reliable
intrinsic technique to classify client tasks for better control across sub-
projects with only Kerberos, no genuine means to tie down access to
Hadoop processes (or daemons), and no way to encode information in
transit with just Kerberos (or even at rest).
27
• A malicious user with network access might intercept an inter-node
conversation.
• Big data stacks were created with little or limited security. Prevalent
big data installations are constructed on the web services paradigm
with limited features for preventing mutual web dangers.
All programmers and users in the cluster had the same degree of
access to all data. Because MapReduce has no concept of authorisation or
authentication, a nefarious user may reduce the priority of other Hadoop
processes to make their work complete faster.
The firewall does not effectively handle the Hadoop security; when
the firewall is penetrated, the cluster is wide open to attackers. The
firewall provides no security for data in motion or data at rest within the
cluster. Firewalls also offer no protection from attacks originating within
the firewall perimeter due to security failure.
28
1.9.2 Security in Cloud
29
web-service API is compromised, attackers will have access to all other
web-service apps that utilized it. The cloud environment offers
function-driven analysis to notice anomalous and harmful actions.
Along with these security issues, other data breaches can be found,
resulting in various data damages. Buffer overflow, integration issues,
implementation faults, design flow, and other errors are examples of
these flaws. Security concerns in this type of service delivery are:
30
assaults, such as Border Gateway Protocol (BGP), DDoS, and DoS, may
render services inaccessible.
1.10 Necessities
Hadoop was designed to manage enormous amounts of web data in an
environment where security was not a priority. Hadoop gained a
reputation as an insecure platform as use grew and it evolved into an
enterprise tool. Because the built-in security and accessible settings
differ between release versions, security is inconsistent among Hadoop.
As the digital universe expands and Hadoop is adopted in practically
every industry, including business, banking, health care, military,
education, and government, security becomes a key problem. The prior
implementation lacked security features due to inconsistencies in
built-in security and available options between release versions,
affecting a wide range of industries. This is when engineers discovered a
huge oversight that Hadoop didn’t come with any security software.
This had an impact on many Hadoop-based applications. As a result,
Hadoop security is the next important step for Hadoop Framework.
HDFS is a commodity-hardware-based distributed file system that offers
data storage for other Apache Hadoop ecosystem components like MR
and Apache HBase. Storing sensitive data unprotected in a distributed
environment is an unacceptable risk for use that deal with personal
information or financial records. In HDFS, we frequently need to
encrypt data in order to store it safely at rest. This concern of Hadoop
security necessities the development of cryptographic algorithms that
ensure end-to-end data security to the clients and make use of these
algorithms to enhance security against various attacks and can be
achieved by integrating cryptosystems to HDFS.
31
1.11 Research Motivation
Along with developing notoriety of the Hadoop and Could Computing
environments, the security issues are expanding with the development of
these innovations. Even though Hadoop and Cloud Computing can
accomplish numerous advantages, a few threats/attacks may make it
defenseless. Because of deployments of these environments, critical
security components remain underutilized. The motivation for the
proposed work is to gather an ability to give security at the data level in
such a circumstance is to upgrade security. Subsequently, a necessity has
emerged for a legitimate answer to secure the components, loopholes,
and troubles arranged to attacks based on distributed computing and
Hadoop to lessen vulnerable infrastructure and platform attacks.
32
1.13 Objectives
The present research work focuses on providing the data security model
for the Hadoop cluster in a cloud environment. To accomplish the desired
goal of this, work the specific objectives are as follow:
• To design and develop a secure model, that will be suitable for any
application.
33
• In phase 3, we proposed transparent end-to-end encryption by
incorporating symmetric algorithms like AES, BF and Rivest
Cipher 6 (RC6) into Hadoop’s common library. This architecture is
appropriate for any Hadoop-based application or tool. When
compared to a bespoke MR model, it delivers resilient
(data-in-transit or data-at-rest) security, quick processing, and
lower overhead.
34
Chapter 4: Performance Analysis
This chapter will be devoted to the performance analysis of the
implementation of AES, BF, RC6 and MBF Algorithm in HDFS and
will discuss its suitability. Experimental results and analysis in two
different simulation environments (Locally and Cloud) will be covered
in this chapter.
Chapter 5: Conclusion and Future Work
This chapter will summarize the proposed approach and concluding
remarks about the work done under the scope of this thesis. The
difficulties, limitations, and technical problems encountered and future
research directions associated with this work will be highlighted in this
chapter.
35
Chapter 2
LITERATURE REVIEW
The literature survey gives a theoretical base for the research and it is
useful for the investigator in assessment of the data. For the investigator
the information collection and narrates summaries assess and clarify of
the literature. Reviews are also helpful in understanding the subject and
its significance, identifying the conceptual methodology and subject them
to sound reasoning and meaningful interpretation.
37
methods. It also provides an additional layer between the client and
Hadoop cluster and characterizes two as client/user and data server. A
new key or another key is produced by a varying number utilizing the
hashing procedure and is allotted to every customer for giving
verification and approval consents between different parts of Hadoop.
Appropriate data privacy with information uprightness is guaranteed by
applying symmetric encryption on customer information put away on
HDFS [33].
38
M. Sarvabhatla et al. analyzed and illustrated the possible security
flaws (offline password guessing attack) in N. Somu et al. [34]
authentication service with a one-time pad. To fix this, they proposed an
authentication scheme based on light-weight OTP for the Hadoop
environment. They analyzed the scheme from the security perspective
and illustrated that the scheme could mitigate security flaws. So, the
proposed scheme provides a robust and secure authentication service to
the Hadoop platform [36].
39
Y. S. Jeong et al. focused on mitigating replay and data node
impersonating attacks by developing a hash chain technique that feeds
the cause. This expanded approach produces hash-chained block values,
and in this way, HDFS blocks are secure mindful and are unavailable to
an outsider to access the client’s data. The proposed approach contains
three stages: initialization, NameNode - client authentication, and
client-data node authentication. Client data should be undetectable to
the Data node, and the Data node needed to confirm access into the
NameNode. Consequently, we would altogether be able to eliminate
data node mimicking attacks [38].
40
daemons of the Hadoop framework. Additionally, external clients are
authenticated with DataNodes and NameNode mutually while
interacting with Hadoop internal components. This subtle change by
adding bind and seal functions in TPM Model protects the Hadoop from
insiders’ malicious attacks using trusted third party attestation identity
keys (AIK) certifications [41].
41
2.2 Data Security in Hadoop Distributed File System
Securing the data is of significant important than just securing the
system. In Hadoop, data security is achieved with the help of individual
implementation of various algorithms using MapReduce. Lightweight
encryption methods are always desirable. Many security algorithms are
used in MapReduce concepts such as RC4, ARIA, DES, Triple DES,
Blowfish, AES and one-time pad, ECC, Hashing, and MD5. The related
literature in concern to data security is mentioned in the following
manner.
42
HDFS. In HDFS-RSA, they utilized both Symmetric key and public-key
encryption. The file is separated into fixed-size blocks; at that point, all
the blocks except for the last block were encoded utilizing AES, whereas
the encryption of the last box is done using an RC4 stream cipher. In
HDFS-Paring, entrance control component tasks were added to secure
information. This hybrid approach is guaranteed to gives better secrecy
in Hadoop [46].
43
In this approach, the HDFS files are encrypted by using DES. The
encryption of the client’s DES key is done using the RSA public key,
while the client’s RSA private key is encrypted using IDEA. This model
is implemented and incorporated within data storage in the cloud-based
on Hadoop. The confidentiality of reading and writing the files to the
cloud is enhanced herein [50].
44
inside and out, which can prevent and mitigate various attacking vectors,
for example, hardware access, key management, root access, rogue user,
and HDFS admin level of exploits. TDE gives end-to-end security to the
information stored on HDFS. Eventually, the information gets encoded
in end-to-end encryption/decryption method while writing and
consequently decrypted on reading. TDE upholds just the AES
algorithm to encrypt and decrypt the data information [5].
45
work has increased the file size by 50 percent. Whereas AES with OTP
has reduced the file size to 20 percent after encryption. The proposed
model has enhanced the performance of encryption and decryption in
Hadoop [55].
46
2.3 Security Tools for Hadoop Cluster by Third Party
In this section, we will illustrate different types of security tools being
integrated into the Hadoop cluster. The primary tools used for providing
and monitoring the security in Apache Hadoop are Apache Knox, Apache
Ranger, Kerberos, Project Rhino, Apache Sentry, and eCryptfs.
47
the ticket-granting service (TGS). The Kerberos database handles all the
principals and the realms. Kerberos integrated with Hadoop generates
many tickets for ensuring proper access. It creates a delegation token for
establishing communication between the client and NameNode. Once
the connection is established, a block access token is designed to secure
the connection between the client and DataNodes [26].
48
partition. eCryptfs can be applied on file or directory level also.
eCryptfs is an application package with a Linux based operating system.
eCryptfs supports the following key level encryption: OpenSSL, tspi,
and passphrase. eCryptfs encrypts the data based on user selection
between various symmetric cryptographic algorithms such as AES,
Blowfish, Triple DES, Cast5, and Cast6. As Hadoop is deployed on top
of the Linux file system, the files or directories created on top of HDFS
can be encrypted using eCryptfs [60].
49
2.4.1 Security Issue in Cloud Computing
Z. Xiao et al. have discussed the five most privacy and security attributes
like authentication, privacy preservability, accountability, availability,
integrity, and confidentiality. They presented the relationship amongst
these attributes; the attackers may exploit vulnerabilities, the defense
mechanisms adopted by cloud service providers. For each attribute,
future research directions are determined [62].
2.4.2 Authentication
50
using ECC, and agreement can be ensured using Diffie-Hellman.
Authentication is ensured by using ECDSA (Elliptic Curve Digital
Signature Algorithm) between client and service provider. The MAES
provides more robustness towards vulnerabilities by enhancing the
complexity of the AES algorithm [64].
51
infrastructure. In this approach, a Private Key Generator (PKG) will
generate a public and private key. ECC algorithm encrypts the file using
the public key. Later SHA-2 creates a hash value, which will work as a
key for the RSA algorithm applied to the encrypted data. The receiver
will receive OTP through e-mail and use the reverse process to decrypt
the data. This proposed scheme will reduce the system cost and
time-complexity by enhancing the security mechanism in sharing and
accessing the file from the sender to the receiver [66].
2.4.3 Confidentiality
52
a medium-sized and small enterprise to limit the maintenance of storage
servers and their investments. Cloud service provision adopts the
multi-tenant approach, where a resource is virtualized and handles
multiple customers. The confidentiality parameter assures cloud storage
protection. The encryption is a generally utilized technique for ensuring
confidentiality. If cloud data confidentiality is violated, the industry will
lose data. They used encryption and obfuscation as two various
technologies for securing data and the confidentiality of cloud storage.
Encryption is the mechanism by utilizing an algorithm and key for
transforming the readable text to the unreadable form. Encryption and
obfuscation are almost similar. Obfuscations are the mechanism that
hides unauthorized users by applying a specific mathematical function
or using programming techniques. The obfuscation and encryption can
apply concerning the data type. Encryption may be extended to
alphanumeric and alphabets data, and a numerical data form may be
used for obscure purposes. More number of unauthorized users can be
protected using obfuscation and encryption techniques on the cloud
data. By the combined use of obfuscation and encryption,
confidentiality could be attained [68].
2.4.4 Integrity
Based on the security levels, Y. Ren et al. make the master manage the
computing workers based on the existing methods; caching mechanism
and trusted verifier worker are introduced. The service integrity
framework is more efficient according to the system analysis and in the
cloud computing environment to detect malicious workers based on
MapReduce [69].
53
MapReduce, they design and implement the system design at both layers
with popular management applications and big data analytics like
Apache Mahout, local cluster environment, and Pig on commercial,
public clouds such as Amazon EC2 and Microsoft Azure. The proposed
model has shown better performance overhead as compared to existing
approaches [70].
54
the mining procedure for guaranteeing the security of data, their work
considered the viability of utilizing eucalyptus HADOOP hubs and the
exhibition changes concerning the utilization of the security protocol
[73].
2.4.5 Availability
A. Undheim et al. developed a model for the cloud data center to focus
on the availability attribute of a cloud SLA. The investigation was
performed on various techniques, which increased the availability of the
virtualized system. The outcomes of the works demonstrated that it
achieved availability differentiation based on failure rate and various
deployment scenarios. To restart the virtual machine, different priority
levels are used, which showed large differences [74].
55
four components: Distributed Replicated Block Device (DRBD),
Virtualization fault tolerance (VFT), Xen Hypervisor, and lastly,
OpenNebula. The results confirmed that the interval of downtime could
be reduced even failure happened. The virtualization fault tolerance is
used in many areas like the cluster-based system, not only Hadoop
applications. The careful design and implementation have confirmed
that a single point of failure has been eliminated [76].
2.4.6 Accountability
56
accountability. To increase accountability, detection methods are
adapted instead of preventive methods. Detective strategies support
protective approaches since external threats, as well as insider risks, are
also considered. Detective techniques can also be used in a less invasive
manner than preventive procedures. We argued that a move to the
integrity and accountability of data at intervals between end-users
concerns of system health and performance involves a file-centered
view, precisely a quality system-centric approach to work [79].
57
2.4.7 Privacy and Preservability
58
security requirements, and the lower-tier fog nodes are positioned on fog
buses. Access control and data integrity verification were included for
resolving the data forgery. For the traceability and support the use of
incentive mechanisms, these innovations were used [85].
59
V. S. Mahalle and A. K. Shahade has proposed a hybrid encryption
algorithm to secure the data stored in the cloud. In this hybrid scheme, a
combination of RSA and AES is used with different key lengths. A
unique key is generated based on system time, which enhances the
complexity of cracking the key by an intruder. The data is encrypted
twice for attaining enhanced security for data and protecting from
unintended access or attacks. Firstly the data gets encrypted using the
AES secret key; later, the encrypted data is again encrypted using RSA
public key. Users can access the file using AES secret key and RSA
private key by which integrity and confidentiality are achieved [88].
60
N. Sengupta has identified the security flaws of data-in-transit to
cloud systems and proposed a hybrid RSA algorithm for cloud systems.
The proposed model has classified into two stages for providing security
to data. In the first stage, RSA encryption is applied to the data. Later in
the second stage, the Feistel Encryption algorithm is put on the generated
output from the first stage. The use of the new hybrid RSA encryption
algorithm to transfer data into the cloud system and minimizes man-in-
middle attacks [91].
61
G. Amalarethinam and H. M. Leena propose an enhanced RSA
algorithm for providing data security in the cloud. In the RSA
algorithm, we do have a private and public key that uses the same
computed N value either for encryption and decryption. They proposed
a subtle modification on computed N value by introducing two N1 and
N2 values. During encryption, RSA uses N1 and N2 while decryption
instead of the same value N. The authors claim that the proposed model
enhances the performance when compared to the standard RSA
algorithm [94].
62
2.6 Summary
Several schemes have been explained to address the issues of data
security in Hadoop based cloud computing environment. This chapter
reviews the security provision in Hadoop and Cloud ecosystem for
authentication, authorization, auditing, and cryptographic (symmetric
and asymmetric) algorithms. Next, security tools like Apache Knox,
Apache sentry, Apache Ranger, Project Rhino, and Kerberos are also
reviewed for the Hadoop cluster by the third party. Then, we studied
security algorithms like RC4, ARIA, AES with one-time Pad and ECC,
and security issues in cloud computing like confidentiality, integrity,
availability, privacy, and accountability.
63
Chapter 3
SYSTEM DEVELOPMENT
65
• Data Injection Tools: Sqoop allows data to be transferred from
Relational DataBase Management System (RDBMS) to HDFS and
back. Flume is used to collect data from web servers such as Twitter,
Facebook, and LinkedIn simultaneously.
• Processing Tools: Pig, Hive, Mahout, Spark, and other tools may
analyze and process data stored on HDFS.
One of the most intriguing aspects of Hadoop is how the tools in the
Hadoop ecosphere interact with the YARN and HDFS basic
architecture.In retrospect, security was an afterthought, and Hadoop falls
short of a suitable security strategy. Organizations are concerned about
the security of sensitive data that they are gradually storing in the
Hadoop framework as Hadoop’s footprints have grown. Administrators
have a significant and complicated challenge in ensuring data security in
Hadoop and its components. Authentication, authorization, and
safeguarding data stored in Hadoop are the three most essential elements
in Hadoop security. Authentication and authorization are now provided
through conventional third-party community-supported approaches.
66
Table 3.1: Provision of Security by the Hadoop Distributors
Hadoop
Distributors/ Hortonworks Cloudera MapR IBM Insights
Parameters
MapR IBM Insights
Hortonworks
Kerberos with provides binary uses Kerberos
uses
AD/LDAP options to with LDAP
Kerberos
Authentication for client and Authentication via for
with
service Native authentication
Ambari and
authentication Authentication between clients
Knox
and Kerberos. and services.
POSIX way
MapR uses
of permission
Access Control
Ranger with ACL and ACL
Expressions
handles RBAC. handles
Authorization (ACE)
Access Mainly Apache the
for granting
Control. Sentry is authorization.
access to
used for
authorized users.
authorization.
Atlas Cloudera’s MapR maintains
IBM Insights
with proprietary the log
adopts a
Ranger product, such file, and later
lightweight
Auditing enforces as Cloudera Apache Drill
JMX
data Access Navigator, is used to
monitoring tool
policies and is used to analyzing these
for auditing.
Analysis. analyze the logs. log files.
Over-the-Wire Data-in-transit
RPC
encryption is
connections Kerberos
is adopted secured
are secured RPC prevents
Data-in-Transit by MapR to by SSL
using impersonation
protect the and
SASL attacks.
data-in-transit TLS
and SSL.
phase. certificates.
Data encryption
Cloudera MapR offers
TDE is for
provides a more granular
used to data-at-rest is
shield through protection to
encrypt/ ensured through
Data-at-Rest HDFS data-at-rest
decrypt two techniques:
encryption to through
the 1) TDE
protect Application
Data-at-Rest. 2) IBM
data-at-Rest. Encryption.
Data Encryption
67
Hortonworks, Cloudera, MapR, and IBM Insights are the four
major Hadoop distributors. Some distributors have added their flavour of
components to the standard Hadoop components for security. Some
distributors have made these extra components proprietary, while others
have made them open source.
68
We use AWS to implement HDFS in the cloud environment. AWS
provides clients with dependable, adaptable, and scalable services. EC2
is a key infrastructure as a service offered by the Amazon cloud (IAAS).
A web-based management panel virtualizes and provisions various
computer components such as servers, network endpoints, and storage of
the required configuration in seconds. Before granting access to services
placed on AWS, customers must first authenticate themselves using the
administration portal. KMS manages the Hadoop cluster’s permission
by permitting or rejecting access depending on access privileges.
The length of keys (number of bits) for encryption in the public key
cryptosystem is considerable. As a result, the encryption/decryption
69
process is slower than symmetric key encryption. Symmetric encryption
methods are favoured over asymmetric encryption techniques for
encrypting large volumes of data for these reasons. The asymmetric
process requires more significant computational (processing) capacity
than symmetric key encryption.
Block ciphers and Stream ciphers are two types of symmetric cipher
algorithms. Block ciphers encrypt a block of a defined length at a time,
usually 64 or 128 bits, whereas Stream ciphers encrypt data one bit or
byte at a time and are most commonly employed in a continuous flow
of data. Stream ciphers are quicker than block ciphers, but they’re more
challenging to set up and prone to assaults. As a result, block ciphers are
chosen over stream ciphers in this study.
3.4.1 AES
In 1998, Joan Daemen and Vincent Rijmen created the AES algorithm
and presented a proposal to the National Institute of Standards and
Technology (NIST) (National Institute of Standards and Technology).
The AES cipher is a Rijndael variation that uses a
substitution-permutation network rather than the Feistel network. AES
algorithm has a fixed block length of 128-bits plain text. AES supports
various key lengths of 128, 192, and 256 bits. For different key sizes,
AES has a specified number of rounds 10 (128), 12 (192), and 14 (256).
For every round, except in the last round, AES transforms the basic data
block using four transformation functions (Substitute Bytes, Shift Rows,
Mixing Columns, and Add Round Key). The mixing columns
70
transformation function is not used in the final round. The 128-bit data
block is structured in a matrix order of 44 bytes in an array of bytes. A
state array matrix of 4x4 orders holds all of the intermediate output
created after each transformation step. In the past decade, various
attacks were reported viz., distinguishing attacks, key recovery attacks,
and side-channel attacks.
3.4.2 RC6
Rivest, Robshaw, Sidney, and Yin created RC6 to meet AES criteria
[101]. RC6 is a variant of RC5 with two more registers and a
multiplicative operation. RC6- w/r/b is the exact specification, with w
indicating the word size, r indicating the number of rounds in the cypher
suite, and b indicating the encryption/decryption key length. The word
(w) size in RC6 is 32 bits, and the rounds (r) are prefixed to 20, while
the key length has three options: 128, 196, and 256 bits.
3.4.3 Blowfish
The key length in BF can range from 32 bits to 448 bits. BF is built
on the Feistel network and features a 64-bit fixed block length with a 16-
round iteration. It also contains the setup of four S-Boxes and P-Arrays.
The variables P-Array and S-Box are created from the hexadecimal digits
71
Algorithm 1 RC6 Encryption Process
1: The input block length is of 128-bits which is subdivided into w-bits (w=32 bits)
and stored in four registers A, B, C, D.
2: The output cipher block length is of 128-bits (Encrypted Data)
3: Initialize the number of rounds r and generate the w-bit round keys S[0,1, 2r+3]
4: User-provided secret key z is loaded into an array L[], and two magic (Pw , Qw )
constants are used based on the length of the w-bit.
5: if w = 32 − bit then
6: Pw = B7E15163
7: Qw = 9E3779B9
8: end if
//The Process for Creating Rounds
9: S[0] = Pw
10: for i = 1 to 2r+3 do
11: S[i] = S[i-1] + Qw
12: end for
13: a=b=c=d=0
14: e=3 * max(z, 2r+4)
15: for j = 1 to e do
16: a = S[c] = (S[c] + a + b) <<< 3
17: b = L[d] = (L[d] + a + b) <<< (a + b)
18: c = (a + 1) mod (2r + 4)
19: d = (b + 1) mod c
20: end for
//The data block in B and D registers under goes pre whitening process the purpose
of pre-whitening is to remove inference part of the input text
21: B = B + S[0]
22: D = D + S[1]
//After pre-whitening step the round operations are applied
23: for i = 1 to r do
24: T = (B * (2B + 1)) <<< lg w
25: U = (D * (2D + 1)) <<< lg w
26: A = ((A ⊕ T) <<< U) + S[2i]
27: C = ((C ⊕ U) <<< T) + S[2i + 1]
28: (A,B,C,D) = (B,C,D,A)
29: end for
30: A = A + S[2r + 2]
31: C = C + S[2r + 3]
72
Algorithm 2 RC6 Decryption Process
1: The Input cipher block length of 128 bits passed to the decryption process which is
subdivided into w-bits (w=32 bits) and stored in four registers A, B, C, D.
2: The output plain block length is of 128-bits
3: Initialize the number of rounds r and generate the w-bit round keys S[0,1, 2r+3]
4: User-provided secret key z is loaded into an array L[], and two magic (Pw , Qw )
constants are used based on the length of the w-bit.
5: if w = 32 − bit then
6: Pw = B7E15163
7: Qw = 9E3779B9
8: end if
//The Process for Creating Rounds
9: S[0] = Pw
10: for i = 1 to 2r+3 do
11: S[i] = S[i-1] + Qw
12: end for
13: a=b=c=d=0
14: e=3 * max(z, 2r+4)
15: for j = 1 to e do
16: a = S[c] = (S[c] + a + b) <<< 3
17: b = L[d] = (L[d] + a + b) <<< (a + b)
18: c = (a + 1) mod (2r + 4)
19: d = (b + 1) mod c
20: end for
//The data block in C and A registers under goes pre whitening process the purpose
of pre-whitening is to remove inference part of the input text
21: C = C - S[2r + 3]
22: A = A - S[2r + 2]
//After pre-whitening step the round operations are applied
23: for i = r to 1 do
24: (A,B,C,D) = (D,A,B,C)
25: U = (D * (2D + 1)) <<< lg w
26: T = (B * (2B + 1)) <<< lg w
27: C = ((C - S[2i +1]) >>> T) ⊕ U
28: A = ((A - S[2i]) >>> U) ⊕ T
29: end for
30: D = D - S[1]
31: B = B - S[0]
73
of Pi. P-Array is a one-dimensional array with eighteen 32-bit items. The
S-Box is a two-dimensional array with 256 entries of 32-bits.
P-Array ⇒ P[1 ... 18]
S-BOX ⇒ S[0 ... 3] [0 ... 255]
The execution of BF split into two parts:
1. Sub-Key generation
In this process, the original secret key gets discarded after the
transformation of P-Array and S-Box.
Algorithm 4 BF Encryption Process
1: The input block length is of 64-bits denoted by Y which is subdivided into two
32-bits halves: YL = 32-bits, YR = 32-bits
2: The output cipher block length is of 64-bits (Encrypted Data)
3: for k = 1 to 16 do
4: YL = YL ⊕ PK
5: YR = F(YL ) ⊕ YR
6: SWAP YL and YR
7: end for
8: SWAP YR and YL (UNDO LAST SWAP)
9: YR = YR ⊕ P17
10: YL ⊕ P18
11: Combine YL and YR
74
Figure 3.3: Generic Representation of BF algorithm
• XORing with sub-key P[1] is now performed on the left 32-bit block
(32-bits). The output is sent into the F function. As illustrated in
Figure 3.4, the input 32-bits are divided into four 8-bits and sent to
four distinct S-Boxes.
75
addition of S-Box 2) XOR with S-Box 3) modulo addition with
S-Box 4).
• Swap the right and left blocks and go through the procedure again
for a total of sixteen rounds.
• Now use P[17] to conduct XOR on the right block’s output, and
P[18] to do XOR on the left block’s output.
• Combine the blocks on the left and right sides. The 64-bit ciphertext
will then be obtained.
76
sequential execution of Feistel structure to parallel processing using the
Adams-Moulton method. By doing so, the performance of the BF
cryptographic algorithm is qualitatively enhanced. XOR(EL , Pk ) is
applied in each round to obtain a new EL of 32-bits. The obtained EL is
fed as an input to the Feistel function, where it gets sub-divided equally
into four 8-bits ( assume w,x,y,z) and substituted by S-boxes as indicated
in Figure 3.5.
R = S(N) (3.1)
Where “+” is the addition of 32 bits words, S1 , w represents key S-box [1]
77
[w], S2 , x key S-box[2] [x], S3 , y key S-box[3] [y], and S4 , z key S-box[4]
[z].
The equation consists of an independent and a dependent variable.
Thus, Ordinary Differential Equations (ODE) can be used to increase
the algorithm speed. ODE typically has two types of solutions: one is
analytic, and the other is numerical. Based on the function’s nature, the
Analytical methods have some drawbacks, which in contrast to
numerical methods, is not well suited for practical applications.
Numerical approaches would be preferred in this situation.The first
order Adams-Moulton is just similar to the Euler method, and the
truncation error is Θ(h2 ). The higher-order numerical method will have
a higher truncation error. The fourth-order Adams-Moulton method will
have truncation error is Θ(h5 ) [102] . There will be a higher numbering
error in the higher-order system. The Adams-Moulton fourth-order
method can have truncation errors Θ(h4 ), offering more precision and
greater stability with a higher-order system value h. The formulation of
Adams-Moulton is an implicit and not an explicit function, such as
Adams-Bash. The Adams-Moulton fourth order needs less space
compared to the existing Blowfish algorithm. The fourth-order
Adams-Moulton method is shown in equation ((3.3) - (3.6)).
1
Rn+3 = Rn+2 + h(9Sn+3 + 19Sn+2 − 5Sn+1 + Sn ) (3.3)
24
1
Rn+2 = Rn+1 + h(5Sn+2 + 8Sn+1 − Sn ) (3.4)
12
1
Rn+1 = Rn + h(Sn+1 − Sn ) (3.5)
2
78
executed parallel and generate an ID passed to the function for further
process.
Algorithm 5 Encryption Process of Modified Blowfish
1: The input block length is of 64-bits denoted by E
2: The output cipher block length is of 64-bits (Encrypted Data)
3: Divide 64-bits of input plaintext block into two 32-bits halves: EL = 32-bits, ER =
32-bits
4: for i = 1 to 16 do
5: EL = EL ⊕ Pi
6: ID = getK(EL )
7: ER = R(EL ,ID) ⊕ ER
8: Swap EL and ER
9: end for
10: ER and EL (Undo last Swap)
11: ER = ER ⊕ P17
12: EL = EL ⊕ P18
13: Combine EL and ER
79
• Disk Layer: Disk encryption is the lowest level of encryption,
ensuring data protection even if it is lost due to an assault. This
technology does not allow fine-grained encryption of single files or
directories because the entire disc is encrypted.
Down the stack, things become easier, but up the chimney, things
become more secure. The implementation of cryptographic methods at
two layers, namely the application and filesystem level, is used in this
study. Encryption at the disc level may be vulnerable to OS-level or
runtime attacks. Adoption of application and file systems will prevent
attacks at the disc level since HDFS provides an extra layer above the
native file system.
80
order. The NameNode is responsible for handling metadata for system
files and granting access to encrypted data.
The client uses MR code to process the encrypted data stored in HDFS.
As depicted in Figure 3.7, this MR task is submitted to the Resource
Manager (RM). By asking the NameNode, RM determines whether the
file exists in HDFS. If the relevant file’s metadata is located, NameNode
responds to the RM with it. RM launches the container with enough
resources to run the task on the DataNodes where the blocks are stored.
The task is duplicated in an amount equal to the number of blocks in a
file, with each block running in parallel on various slave nodes.
In the process of decryption following steps are defined:
81
3. The Reducer phase receives the output of the Mapper phase as an
input.
82
4. To hack or change a file using application-level encryption methods,
a hacker requires access to the HDFS contents and the programs and
keys used to encrypt the data.
3. Because various apps run on the same data businesses own, these
applications require access to and power over the encrypted data.
83
safeguards on encrypted data, KMS delivers an appropriate access
control system to all of the firm’s users. This part focuses on combining
multiple cryptographic algorithms such as AES, RC6, BF, and Modified
Version of Blowfish, which were covered in section 3.4, to provide
security at the file system level.
3.7.1 TDE
1. Client users are the only ones who can encrypt and decrypt data.
84
specifies a unique key connected with it. This key refers to the EZ key,
which is saved on KMS, separate from HDFS. When a client writes a
file to one of these Encryption Zones, the file is encrypted using a
unique encryption key known as the Data Encryption Key (DEK). For
producing Encrypted Data Encryption Keys, these DEK are encrypted
using their special encryption zone’s EZ keys (Encrypted Data
Encryption Key (EDEK)). Figure 3.8 displays the complete process of
encrypting and decryption of a key.
85
i. Creating an Encryption Zone
In the above process, KMS treats the client as the admin key user of
the EZ key.
86
• Commands Used For Creating EZ
87
1. To encrypt/decrypt a file, the client request EDEK for EZ from
NameNode.
2. NameNode passes the request to KMS.
3. KMS checks the access privileges on EZ and creates EDEK.
4. KMS responses to the NameNode with an EDEK.
5. NameNode forwards EDEK to the client.
6. Now client requests KMS to decrypt EDEK to get DEK.
7. KMS decrypts the EDEK by checking the ACL permissions and
responds client with a DEK.
8. Now the client uses the DEK to encrypt the file and stores it to
NameNode.
88
3.7.2 KMS Architecture
Hadoop KMS manages ACL for many user roles, including key
administrators, HDFS superusers, HDFS service users, and end-users.
Encryption zone keys are created and managed by a key administrator.
The HDFS superuser generates encryption zones, but they are not
permitted to decode data stored in these zones. The HDFS service user
can generate an EDEK for each encryption zone key. Finally, clients
classified as end users have access to encryption zones and may read and
write to them.
89
Whitelist.key.acl class. A single key is represented by the class key.acl.
All keys for which ACL has not been set are expressly covered by the
class default.key.acl. These ACL key-specific configurations are
evaluated based on two critical vital rules that are:
90
to access KMS-Wide activities, the judgement flow then moves to
key-specific authorisation. First, KMS checks whitelist operations in
key-specific, and access privileges are provided if the user is authorised.
If the user’s services aren’t found, the procedure continues with the
evaluation. Second, KMS ACL checks the key.acl configuration allows
it if it is discovered; if not, it assesses the default.key and approves it
based on the user’s actions. If a person or group isn’t mentioned in the
whitelist, key, or default configuration, they’re blocked, and their access
permissions are set to ”Denied”.
3.8 Summary
This chapter focuses on cryptographic techniques to comprehend the
data security issue at the HDFS storage level. Cryptographic techniques
are used at two levels to achieve data security: application-level and
filesystem-level. In contrast, the application-level provides secure means
of storing data into HDFS but lacks performance and increases
complexity when integrating with tools like Pig, Hive, Spark, Sqoop,
and Flume. By providing improved performance and flexibility to
interface with diverse tools, a filesystem-level cryptosystem addresses
the shortcomings of application-level cryptosystems. In the next chapter,
the performance of both layers is discussed.
91
Chapter 4
4.1 INTRODUCTION
HDFS is used to store and manage enormous amounts of data. As a
result, any data platform hoping to breakthrough into the business
mainstream must prioritise security and data governance. In recent
months and years, the Hadoop community (both
open-source/commercial communities) have made tremendous progress
in minimising the threats to the Hadoop system, but more work is
needed. The Hadoop community must band together to address
Hadoop’s security vulnerabilities, as these are the issues that are
preventing many practitioners from implementing Hadoop to run
production-grade workloads and mission-critical applications, and they
will continue to do so.
1. MR model
2. TDE model
E(p(s))
C(t) = (4.1)
T (e(s))
Where,
C(t) = Total time consumed for encryption of the records in bytes
E(p(s)) = Encrypted plain text in bytes
T(e(s)) = Time consumed to encrypt each document
93
to make the system responsive and fast. Decryption time, on the other
hand, has an impact on the system’s performance. The decryption time
is measured in milliseconds in this experiment.
D(c(s))
P(t) = (4.2)
T (d(s))
Where, P(t) be the whole time taken for decryption of the records in
bytes D(c(s)) be the decrypted ciphertext in bytes. T(d(s)) be the time
taken to decipher each record.
• Model 1: MR model
These models are notable in security provision at the data level. The
following sections and subsections clear the step wise performance
evaluation of suggested models.
94
computed on the large data size. The assumption states that processing
the data at its original location is far superior to relocating the data and
computing it where the programme operates. On the other hand, data
locality is achieved by the Map task, which, once completed, sends its
result to the Reduce task machine across the network. As a result, there’s
an opportunity to develop this method even further.
95
size, whereas known AES methods get 193.12. Furthermore, with a file
size of 194.65 MB, the proposed BF technique returns 265, whereas the
available AES methods return 270.13. After that, the BF approach
yields 374.18, whereas the existing AES methods create 378.6 for a
274.96 MB file size. The BF approach thus generates 497.81 for a
366.61 MB file size, while the known AES methods yield 494.49.
96
In comparison to the proposed method, the current system consumes
more resources. In addition, as compared to the existing technique, the
suggested solution consumes less space for all File Size numbers. As a
result, it can be concluded that the new approach outperforms the existing
ones. Figure 4.1 shows a graphical depiction of the space utilised.
The file encryption and decryption times for AES encrypted HDFS and
BF encrypted HDFS are compared in this section. Table 4.2 and Table
4.3 illustrate the comparison and indicate that the AES encryption takes
less time to convert plain to cypher or cypher to plain text than the BF
Algorithm. It means that the AES method outperforms BF in terms of
the amount of time it takes to encrypt or decode a file. Figure 4.2 and
Figure 4.3 depicted the graphical depiction.
Table 4.2 shows the encryption time taken by the proposed BF and
current approaches such as AES for various data quantities. According to
the table, the suggested BF algorithms encrypt data with an average file
size of 7.95 MB in an average of 10 seconds. The present AES, on the
97
other hand, takes an average of 7 seconds to encrypt data.
Table 4.3 compares the proposed BF with AES for different data
amounts and shows how they compare decryption time. According to the
table, the suggested BF algorithm decrypts data with an average file size
98
Table 4.3: Map-Reduce Model computational time Analysis during decryption
Also, with a file size of 11.61 MB, the proposed BF approach takes
10 seconds, whereas the existing AES methods take 6 seconds. The BF
algorithm takes 13 seconds to process a 25.44 MB file, while
conventional AES methods take 6 seconds. Furthermore, the suggested
BF approach takes 25 seconds to process a 37.98 MB file, while
traditional AES methods take 10 seconds. The proposed BF approach
99
takes 39 seconds to process a 66.72 MB file, while the conventional
method takes 13 seconds. The suggested BF approach thus takes 41
seconds for a 127.35 MB file, while the known AES methods take 15
seconds.
Table 4.4 illustrates that under the TDE Model for Hadoop Framework,
the proposed RC6 algorithm outperforms the AES method of space
consumption. Figure 4.4 depicts the space utilised as a graphical
depiction
The TDE model space utilised by the proposed RC6 system with the
present AES method is shown in Table 4.4.
100
Table 4.4: TDE Model Space Consumption Analysis: AES vs RC6
101
RC6 approach yields 53.63 MB for a 45.6 MB file size, while current
AES methods yield 54.75 MB.
Table 4.5 illustrates that under the TDE Model for Hadoop Framework,
the suggested RC6 technique, rather than the AES algorithm, delivers
better outcomes as the file size and computing time rise.
The TDE model computation time utilising the proposed RC6 system
and the conventional AES technique is shown in Table 4.5. According to
the investigation, RC6 produces 0.27sec for a file size of 8 MB. The
existing AES technique consumes 0.27sec for the same amount of file
sizes. The proposed RC6 technique generates 0.37sec for 18 MB files,
102
whereas existing AES methods consume 0.37sec. The suggested RC6
technique generates 0.47sec for a 26 MB file size, whereas existing AES
methods produce 0.47sec. Furthermore, the proposed RC6 process
consumes 0.68sec for 45.6MB files, whereas current AES methods
consume 0.68sec.
Table 4.5: TDE Model Computational time Analysis: AES vs RC6
103
The proposed RC6 approach takes 1.15 seconds to process a 91.12
MB file, whereas the existing method takes 1.19 seconds. The suggested
RC6 technique thus takes 1.6 seconds for a 137.5 MB file, while
conventional AES methods take 1.7 seconds. Furthermore, for a 194
MB file size, the suggested RC6 technique takes 2.2 seconds, whereas
current AES methods take 2.2 seconds.
Also, the RC6 technique takes 3.1 seconds for a 275 MB file,
whereas conventional AES methods take 3.3 seconds. The suggested
RC6 approach thus takes 5.1 seconds for a 366.5 MB file, whereas
traditional AES methods take 6.3 seconds. Furthermore, with a 520.7
MB file size, the suggested RC6 technique takes 8.6 seconds, whereas
conventional AES methods take 10.4 seconds. Similarly, at the largest
file size of 840 MB, the proposed RC6 approach takes 16.06 seconds,
whereas the traditional method takes 26.8 seconds. As a result, the
current system consumes more energy than the suggested technique. In
addition, compared to the existing approach for all File Size values, the
proposed method requires less time to compute. Figure 4.5 shows a
graphical depiction of the computing time of the AES and RC6
algorithms.
Table 4.6 shows that the BF algorithm uses less space than the AES and
RC6 techniques. The computing time needed for all of these techniques
is similar or substantially identical for smaller file sizes. As the file size
grows more prominent, BF’s computation time for data encryption is
104
determined to be smaller than that of AES and RC6 methods, as shown
in Table 4.6.
Table 4.6: TDE Model Space Consumption Analysis: AES vs RC6 vs BF
According to Table 4.6, the AES and RC6 methods take 4.98
seconds and 4.76 seconds, respectively, to consume 8 MB of space after
encryption. It takes 8.87 seconds for the BF algorithm to encrypt a file
with a size of 8 megabytes. The AES, RC6, and BF algorithms take
21.85 seconds, 21.4 seconds, and 20.13 seconds, respectively, to encrypt
18 megabytes.
105
file. The encryption of a 194 MB file took 233.3 seconds, 228.5 seconds,
and 214.9 seconds for the AES, RC6, and BF algorithms, respectively.
The AES, RC6, and BF algorithms took 330.6 seconds, 323.8 seconds,
and 304.5 seconds, respectively, to encrypt a 275 MB file.
106
take 0.37 sec, 0.37 sec, and 0.37 sec, respectively, to encrypt 18 MB. The
AES, RC6, and BF algorithms took 0.47 sec, 0.47 sec, and 0.47 sec,
respectively, to encrypt a 26 MB file.
For the AES, RC6, and BF algorithms, the computation time for
a 45.6 MB file was 0.68 sec, 068 sec, and 0.68 sec, respectively. The
space consumed after encrypting a plain 91.12 MB file took 1.19, 1.15,
107
and 1.18 seconds for the AES, RC6, and BF algorithms, respectively.
Furthermore, the AES, RC6, and BF algorithms required 1.7 seconds,
1.6 seconds, and 1.6 seconds, respectively, to encrypt a 137.5 MB file.
Then, for AES, RC6, and BF algorithms, the computation time for a 194
MB file size was 2.2 seconds, 2.2 seconds, and 2.2 seconds, respectively.
For the AES, RC6, and BF algorithms, the calculation time required for
encryption of a 275 MB file size was 3.3 seconds, 3.1 seconds, and 2.9
seconds, respectively.
For the AES, RC6, and BF algorithms, the computation time for a
366.8 MB file size is 6.3 seconds, 5.1 seconds, and 4.7 seconds,
respectively. The AES, RC6, and BF algorithms took 10.4 seconds, 8.6
seconds, and 7.7 seconds, respectively, to encrypt a 520.7 MB file.
Similarly, the AES, RC6, and BF algorithms encrypted the 840 MB file
in 26.8 seconds, 16.06 seconds, and 14.6 seconds, respectively. The
computing time needed for all of these techniques is similar or
substantially identical for smaller file sizes. As the file size grows larger,
BF’s computation time for data encryption is determined to be smaller
than that of AES and RC6 methods, as shown in Figure 4.7.
Table 4.8 shows that the MBF method takes up less space than the AES,
RC6, and BF algorithms.
108
The TDE model space occupied by the proposed RC6 system using
the existing AES, RC6, and BF algorithms is shown in Table 4.8.
According to the research, MBF creates 8.57 MB files for an 8 MB file
size. The present AES, RC6, BF technique generates 9.63, 9.43, and
8.87 for the same amount of file sizes. The suggested MBF approach
generates 19.45 for an 18 MB file size, while the known AES methods
provide 21.85, 21.4, and 20.13. The suggested MBF technique yields
28.33 for a 26 MB file size, while the current AES, RC6, BF methods
provide 31.82, 31.16, and 29.32.
Table 4.8: TDE Model Space Consumption Analysis: AES vs RC6 vs BF vs MBF
Plain Text (MB) AES (MB) RC6 (MB) BF (MB) MBF (MB)
109
Figure 4.8: TDE Model Space Consumption Analysis: AES vs RC6 vs BF vs MBF
110
4.8.2 Analysis of Computational time: AES vs RC6 vs BF vs MBF
Table 4.9: TDE Model Computational time Analysis: AES vs RC6 vs BF vs MBF
111
Figure 4.9: TDE Model Computational time Analysis: AES vs RC6 vs BF vs MBF
112
RC6, BF methods yield 625.7, 612.84, 576.4, and 557.13. Similarly, for
the largest file size of 840 MB, the suggested MBF technique takes 15.4
seconds, while the existing AES, RC6, BF method takes 26.8 seconds,
16.06 seconds, and 14.6 seconds. In comparison to the proposed
technique, the current system consumes more energy. In addition, the
suggested method consumes less space compared to the existing
approach for all File Size figures. As a result, when compared to
previous efforts, it can be concluded that the suggested strategy performs
well. Figure 4.9 shows a graphical depiction of the energy consumed.
TP
∈T P = (4.3)
Tt
Percentage Increase in File Size after Encryption (x) can be obtained
by dividing encrypted text (xc ) with the plain text (y f ).
xc
x= × 100 (4.4)
yf
113
than the proposed MBF technique. Figure 4.10 shows a graphical
depiction of the proposed MBF based on encryption time and encryption
throughput.
Table 4.10: TDE Model Computational time Analysis: AES vs RC6 vs BF vs MBF
Table 4.10 clearly shows that when compared to the AES, RC6, and
114
BF approaches, the suggested MBF achieves a higher throughput value.
Figure 4.10 shows a graphical depiction of the proposed MBF method
based on encryption throughput.
The % increase in file size when AES, RC6, BF, and MBF algorithms
are applied is shown in Table 4.11. When compared to the built-in AES
method, these approaches lower the size of the encrypted file. In Figure
4.11, the combined AES, RC6, BF, and MBF algorithms graphically
depend on the percentage increase in file size after encryption.
Table 4.11: TDE Model Computational time Analysis: AES vs RC6 vs BF vs MBF
Figure 4.11: TDE Model % Increase in File Size: AES vs RC6 vs BF vs MBF
115
Chapter 5
The work presented in this thesis intends to create a unique approach for
implementing an efficient and scalable security model in Hadoop
Framework in the cloud. Different symmetric cryptographic methods
are grouped together in the proposed research work.
5.1.1 Conclusions
117
TDE model execution is divided into three phases: RC6 algorithm
implementation (Phase I), Blowfish algorithm implementation (Phase II),
and Modified Blowfish method implementation (Phase III).
118
5.1.2 Future Scope
119
REFERENCES
[1] J. Howard et al. “Scale and Performance in a Distributed File System”. In: ACM
SIGOPS Oper. Syst. Rev. 21.5 (1987), pp. 1–2. DOI: 10.1145/37499.37500.
URL: https://doi.org/10.1145/37499.37500.
[2] R. Cattell. “Scalable SQL and NoSQL Data Stores”. In: ACM SIGMOD Rec.
39.4 (2011), pp. 12–27. DOI: 10.1145/1978915.1978919. URL: https://
doi.org/10.1145/1978915.1978919.
[3] Data Flair Team:History of Hadoop – The complete evolution of Hadoop
Ecosytem. URL :
https://data-flair.training/blogs/hadoop-history/.
[4] I. A. T. Hashem et al. “The rise of ”big data” on cloud computing: Review
and open research issues”. In: Information Systems 47 (2015), pp. 98–115. DOI:
https://doi.org/10.1016/j.is.2014.07.006. URL: https://www.
sciencedirect.com/science/article/pii/S0306437914001288.
[5] M. M Shetty and D. H. Manjaiah. “Data security in Hadoop distributed file
system”. In: 2016 International Conference on Emerging Technological Trends
(ICETT). 2016, pp. 1–5. DOI: 10.1109/ICETT.2016.7873697.
[6] G. S. Bhathal and A. Singh. “Big Data: Hadoop framework vulnerabilities,
security issues and attacks”. In: Array 1-2 (2019), p. 100002. DOI:
https : / / doi . org / 10 . 1016 / j . array . 2019 . 100002. URL: https :
//www.sciencedirect.com/science/article/pii/S2590005619300025.
[7] B. Saraladevi et al. “Big Data and Hadoop-a Study in Security Perspective”. In:
Procedia Computer Science 50 (2015), pp. 596–601. DOI: https://doi.org/
10.1016/j.procs.2015.04.091. URL: https://www.sciencedirect.
com/science/article/pii/S187705091500592X.
[8] H. Lu, Chen H.-S., and Hu T.-T. “Research on Hadoop Cloud Computing Model
and its Applications”. In: 2012 Third International Conference on Networking
and Distributed Computing. 2012, pp. 59–63. DOI: 10.1109/ICNDC.2012.22.
120
[9] Arun Murthy et al. Apache Hadoop YARN: Moving beyond MapReduce and
Batch Processing with Apache Hadoop 2. 1st. Addison-Wesley Data and
Analytics, 2014.
[10] Tom White. Hadoop – The Definitive Guide. 4th. O’Reilly, 2015.
[11] S. Parikh et al. “Security and Privacy Issues in Cloud, Fog and Edge
Computing”. In: Procedia Computer Science 160 (2019). The 10th
International Conference on Emerging Ubiquitous Systems and Pervasive
Networks (EUSPN-2019) / The 9th International Conference on Current and
Future Trends of Information and Communication Technologies in Healthcare
(ICTH-2019), pp. 734–739. DOI :
https : / / doi . org / 10 . 1016 / j . procs . 2019 . 11 . 018. URL: https :
//www.sciencedirect.com/science/article/pii/S1877050919317181.
[12] H. H. Song. “Testing and Evaluation System for Cloud Computing Information
Security Products”. In: Procedia Computer Science 166 (2020). Proceedings of
the 3rd International Conference on Mechatronics and Intelligent Robotics
(ICMIR-2019), pp. 84–87. DOI :
https : / / doi . org / 10 . 1016 / j . procs . 2020 . 02 . 023. URL: https :
//www.sciencedirect.com/science/article/pii/S1877050920301459.
[13] L. Savu. “Cloud Computing: Deployment Models, Delivery Models, Risks and
Research Challenges”. In: 2011 International Conference on Computer and
Management (CAMAN). 2011, pp. 1–4. DOI: 10.1109/CAMAN.2011.5778816.
[14] H. Li et al. “Towards smart card based mutual authentication schemes in cloud
computing”. In: KSII Transactions on Internet and Information Systems 9.7
(July 2015). URL:
http://itiis.org/digital-library/manuscript/1072.
[15] S. Goyal. “Public vs Private vs Hybrid vs Community - Cloud Computing: A
Critical Review”. In: International Journal of Computer Network and
Information Security 6 (2014), pp. 20–29.
[16] S. Bhardwaj, L. Jain, and S. Jain. “Cloud Computing: A Study of Infrastructure
AS A Service (IAAS)”. In: International Journal of Engineering and
Information Technology 2.1 (Jan. 2010), pp. 60–63.
[17] G. Kulkarni, P. Khatawkar, and J. Gambhir. “Cloud Computing-Platform as
Service”. In: International Journal of Engineering and Advanced Technology
(IJEAT) 1.2 (Dec. 2011), pp. 115–120.
[18] S. Daneshyar. “Large-Scale Data Processing Using MapReduce in Cloud
Computing Environment”. In: International Journal on Web Service
Computing 3.4 (Dec. 2012), pp. 1–13. DOI: 10.5121/ijwsc.2012.3401.
121
[19] H. Lee et al. “Implementation of MapReduce-based image conversion module
in cloud computing environment”. In: The International Conference on
Information Network. 2012, pp. 234–238. DOI :
10.1109/ICOIN.2012.6164383.
[20] CyberPedia: An overview of DoS attacks. URL :
https://www.paloaltonetworks.com/cyberpedia/what-is-a-denial-
of-service-attack-dos.
[21] Apache Knox: Apache Knox Gateway 1.3.x User’s Guide. https : / / knox .
apache.org/books/knox- 1- 3- 0/user- guide.html. [Online; accessed
9-July-2021].
[22] Walker Rowe. Introduction to Hadoop Security. July 2016. URL : https : / /
www.bmc.com/blogs/hadoop-security/.
[23] Apache Software Foundation: Apache Ranger. http://ranger.apache.org/
index.html. [Online; accessed 9-July-2021].
[24] T. Allen. Project Rhino: Building a Layered Defense for Apache Hadoop.
https : / / itpeernetwork . intel . com / project - rhino - building - a -
layered - defense - for - apache - hadoop / gs . 4te3dc. [Online; accessed
9-July-2021].
[25] Bhushan Lakhe. Practical hadoop security. Apress, 2014.
[26] Kerberos: The Network Authentication Protocol. http : / / web . mit . edu /
KERBEROS/. [Online; accessed 9-July-2021].
[27] I. Mohiuddin et al. “Secure distributed adaptive bin packing algorithm for
cloud storage”. In: Future Generation Computer Systems 90 (Aug. 2018),
pp. 307–316. DOI: 10.1016/j.future.2018.08.013.
[28] J. Rong, R. Lu, and K-K. R. Choo. “Achieving high performance and
privacy-preserving query over encrypted multidimensional big metering data”.
In: Future Generation Computer Systems 78 (2018), pp. 392–401. DOI:
https : / / doi . org / 10 . 1016 / j . future . 2016 . 05 . 005. URL: https :
//www.sciencedirect.com/science/article/pii/S0167739X16301157.
[29] Gunasekaran M. et al. “A new architecture of Internet of Things and big data
ecosystem for secured smart healthcare monitoring and alerting system”. In:
Future Generation Computer Systems 82 (2018), pp. 375–387. DOI: https :
/ / doi . org / 10 . 1016 / j . future . 2017 . 10 . 045. URL: https : / / www .
sciencedirect.com/science/article/pii/S0167739X17305149.
122
[30] G. S. Sadasivam, K. A. Kumari, and S. Rubika. “A Novel Authentication
Service for Hadoop in Cloud Environment”. In: 2012 IEEE International
Conference on Cloud Computing in Emerging Markets (CCEM). 2012,
pp. 1–6. DOI: 10.1109/CCEM.2012.6354591.
[31] S. H. Park and I. R. Jeong. “A Study on Security Improvement in Hadoop
Distributed File System Based on Kerberos”. In: J. Korea Inst. Inf. Secur.
Cryptol., vol. 23, no. 5. 2013, pp. 415–463.
[32] K. Zheng and W. Jiang. “A token authentication solution for hadoop based on
kerberos pre-authentication”. In: 2014 International Conference on Data
Science and Advanced Analytics (DSAA). 2014, pp. 354–360. DOI:
10.1109/DSAA.2014.7058096.
[33] P. K. Rahul and T. GireeshKumar. “A Novel Authentication Framework for
Hadoop”. In: Artificial Intelligence and Evolutionary Algorithms in
Engineering Systems. Ed. by L. Padma Suresh, Subhransu Sekhar Dash, and
Bijaya Ketan Panigrahi. New Delhi: Springer India, 2015, pp. 333–340.
[34] N. Somu, A. Gangaa, and V. S. S. Sriram. “Authentication Service in Hadoop
using One Time Pad”. In: Indian Journal of Science and Technology 7.4 (2020),
pp. 56–62.
[35] H. Zhou and Q. Wen. “A new solution of data security accessing for Hadoop
based on CP-ABE”. In: 2014 IEEE 5th International Conference on Software
Engineering and Service Science. 2014, pp. 525–528. DOI: 10.1109/ICSESS.
2014.6933621.
[36] M. Sarvabhatla, M. Reddy, and C. Vorugunti. “A Secure and Light Weight
Authentication Service in Hadoop using One Time Pad”. In: Procedia
Computer Science 50 (Dec. 2015), pp. 81–86. DOI :
10.1016/j.procs.2015.04.064.
[37] Y.-S. Jeong and Y.-T. Kim. “A token-based authentication security scheme for
Hadoop distributed file system using elliptic curve cryptography”. In: J.
Comput. Virol. Hacking Tech. 11.3 (2015), pp. 137–142.
[38] Y.-S. Jeong, S.-S. Shin, and K.-H. Han. “High-dimentional data authentication
protocol based on hash chain for Hadoop systems”. In: Cluster Computing 19
(Mar. 2016), pp. 475–484. DOI: 10.1007/s10586-015-0508-y.
[39] I. Khalil, Z. Dou, and A. Khreishah. “TPM-Based Authentication Mechanism
for Apache Hadoop”. In: vol. 152. Sept. 2015, pp. 105–122. DOI: 10.1007/
978-3-319-23829-6_8.
123
[40] Y.-A. Jung, S.-J. Woo, and S.-S. Yeo. “A Study on Hash Chain-Based Hadoop
Security Scheme”. In: 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and
Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing
and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications
and Its Associated Workshops (UIC-ATC-ScalCom). 2015, pp. 1831–1835. DOI:
10.1109/UIC-ATC-ScalCom-CBDCom-IoP.2015.332.
[41] Z. Dou et al. “Robust Insider Attacks Countermeasure for Hadoop: Design and
Implementation”. In: IEEE Systems Journal 12.2 (2018), pp. 1874–1885. DOI:
10.1109/JSYST.2017.2669908.
[42] Y. MEI. “Using the HashChain to Improve the Security of the Hadoop”. In:
Proceedings of the 3rd Annual International Conference on Electronics,
Electrical Engineering and Information Science (EEEIS 2017). Atlantis Press,
2017/09, pp. 554–558. DOI :
https : / / doi . org / 10 . 2991 / eeeis - 17 . 2017 . 82. URL:
https://doi.org/10.2991/eeeis-17.2017.82.
[43] M. Hena and N. Jeyanthi. “Authentication Framework for Kerberos Enabled
Hadoop Clusters”. In: International Journal of Engineering and Advanced
Technology 9.1 (2019), pp. 510–519.
[44] W. Wei et al. “SecureMR: A Service Integrity Assurance Framework for
MapReduce”. In: 2009 Annual Computer Security Applications Conference.
2009, pp. 73–82. DOI: 10.1109/ACSAC.2009.17.
[45] J. H. Majors. “Secdoop: A Confidentiality Service on Hadoop Clusters”, Auburn
University, Master Thesis. 2011.
[46] H.-Y. Lin et al. “Toward Data Confidentiality via Integrating Hybrid
Encryption Schemes and Hadoop Distributed File System”. In: 2012 IEEE
26th International Conference on Advanced Information Networking and
Applications. 2012, pp. 740–747. DOI: 10.1109/AINA.2012.28.
[47] S. Park and Y. Lee. “Secure Hadoop with Encrypted HDFS”. In: Grid and
Pervasive Computing. Vol. 7861. Springer Berlin Heidelberg, 2013,
pp. 134–141. DOI: doi.org/10.1007/978-3-642-38027-3_14.
[48] C. Zhonghan et al. “Design and Implementation of Data Encryption in Cloud
based on HDFS”. In: Proceedings of the The 1st International Workshop on
Cloud Computing and Information Security. Atlantis Press, 2013, pp. 274–277.
DOI : https://doi.org/10.2991/ccis-13.2013.64.
[49] Q. Quan et al. “A model of cloud data secure storage based on HDFS”. In:
2013 IEEE/ACIS 12th International Conference on Computer and Information
Science (ICIS). 2013, pp. 173–178. DOI: 10.1109/ICIS.2013.6607836.
124
[50] C. Yang, W. Lin, and M. Liu. “A Novel Triple Encryption Scheme for
Hadoop-Based Cloud Data Security”. In: 2013 Fourth International
Conference on Emerging Intelligent Data and Web Technologies. 2013,
pp. 437–442. DOI: 10.1109/EIDWT.2013.80.
[51] X. Yu, P. Ning, and M. A. Vouk. “Enhancing security of Hadoop in a public
cloud”. In: 2015 6th International Conference on Information and
Communication Systems (ICICS). 2015, pp. 38–43. DOI :
10.1109/IACS.2015.7103198.
[52] D. Shehzad et al. “A Novel Hybrid Encryption Scheme to Ensure Hadoop
Based Cloud Data Security”. In: International Journal of Computer Science
and Information Security 1947-5500 14.4 (May 2016), pp. 480–484.
[53] A. Jayan and B. R. Upadhyay. “RC4 in Hadoop security using MapReduce”.
In: 2017 International Conference on Computational Intelligence in Data
Science(ICCIDS). 2017, pp. 1–5. DOI: 10.1109/ICCIDS.2017.8272637.
[54] Y. Song et al. “Design and implementation of HDFS data encryption scheme
using ARIA algorithm on Hadoop”. In: 2017 IEEE International Conference on
Big Data and Smart Computing (BigComp). 2017, pp. 84–90. DOI: 10.1109/
BIGCOMP.2017.7881720.
[55] H. Mahmoud, A. Hegazy, and M. H. Khafagy. “An approach for big data
security based on Hadoop distributed file system”. In: 2018 International
Conference on Innovative Trends in Computer Engineering (ITCE). 2018,
pp. 109–114. DOI: 10.1109/ITCE.2018.8316608.
[56] Y. Xu et al. “Design and implementation of distributed RSA algorithm based
on Hadoop”. In: Journal of Ambient Intelligence and Humanized Computing 11
(2020), pp. 1047–1053. DOI: 10.1007/s12652-018-1021-y.
[57] P. Johri, S. Arora, and M. Kumar. “Privacy Preserve Hadoop (PPH)—An
Implementation of BIG DATA Security by Hadoop with Encrypted HDFS”. In:
Information and Communication Technology for Sustainable Development.
Vol. 10. Springer Singapore, 2018, pp. 339–346. DOI :
https://doi.org/10.1007/978-981-10-3920-1_35.
[58] T. S. Algaradi and B. Rama. “A Novel Blowfish Based-Algorithm To Improve
Encryption Performance In Hadoop Using Mapreduce”. In: International
Journal of Scientific & Technology Research 8.11 (2019), pp. 2074–2081.
[59] A. D. Yu and Sun. Sentry Tutorial. https :
//cwiki.apache.org/confluence/display/SENTRY/sentry+tutorial.
[Online; accessed 9-July-2021].
[60] eCryptfs. URL: https://wiki.archlinux.org/title/ECryptfs.
125
[61] Xun Xu. “From cloud computing to cloud manufacturing”. In: Robotics and
Computer-Integrated Manufacturing 28.1 (2012), pp. 75–86. ISSN: 0736-5845.
DOI : https://doi.org/10.1016/j.rcim.2011.07.002. URL : https:
//www.sciencedirect.com/science/article/pii/S0736584511000949.
[62] Z. Xiao and Y. Xiao. “Security and Privacy in Cloud Computing”. In: IEEE
Communications Surveys Tutorials 15.2 (2013), pp. 843–859. DOI: 10.1109/
SURV.2012.060912.00182.
[63] A. A. Yassin et al. “Anonymous Password Authentication Scheme by Using
Digital Signature and Fingerprint in Cloud Computing”. In: 2012 Second
International Conference on Cloud and Green Computing. 2012, pp. 282–289.
DOI : 10.1109/CGC.2012.91.
[64] N. Gajra, S. S. Khan, and P. Rane. “Private cloud security: Secured user
authentication by using enhanced hybrid algorithm”. In: 2014 International
Conference on Advances in Communication and Computing Technologies
(ICACACT 2014). 2014, pp. 1–6. DOI: 10.1109/EIC.2015.7230712.
[65] A. S. Tomar et al. “Enhanced Image Based Authentication with Secure Key
Exchange Mechanism Using ECC in Cloud”. In: Security in Computing and
Communications. Vol. 625. Springer Singapore, 2016, pp. 63–73. DOI:
10.1007/978-981-10-2738-3_6.
[66] R. Dangi and S. Pawar. “An Improved Authentication and Data Security
Approach Over Cloud Environment”. In: Harmony Search and Nature Inspired
Optimization Algorithms. Vol. 741. Springer Singapore, 2019, pp. 1069–1076.
DOI : 10.1007/978-981-13-0761-4_100.
126
[70] Yongzhi Wang et al. “IntegrityMR: Integrity assurance framework for big data
analytics and management applications”. In: 2013 IEEE International
Conference on Big Data. 2013, pp. 33–40. DOI :
10.1109/BigData.2013.6691780.
[71] R. Saxena and S. Dey. “Cloud Audit: A Data Integrity Verification Approach
for Cloud Computing”. In: Procedia Computer Science 89 (2016). Twelfth
International Conference on Communication Networks, ICCN 2016, August
19â C“ 21, 2016, Bangalore, India Twelfth International Conference on Data
Mining and Warehousing, ICDMW 2016, August 19-21, 2016, Bangalore,
India Twelfth International Conference on Image and Signal Processing, ICISP
2016, August 19-21, 2016, Bangalore, India, pp. 142–151. ISSN: 1877-0509.
DOI : https://doi.org/10.1016/j.procs.2016.06.024. URL : https:
//www.sciencedirect.com/science/article/pii/S1877050916310894.
[72] Y. B. Idris et al. “Enhancement data integrity checking using combination md5
and sha1 algorithm in hadoop architecture”. In: Journal of Computer Science
amp; Computational Mathematics 7.3 (2017), pp. 99–102. DOI: 10 . 20967 /
jcscm.2017.03.007.
[73] R. Sumithra and S. Paul. “Incorporating security and integrity into the mining
process of hybrid weighted-hasht apriori algorithm using Hadoop”. In:
International Journal of Data Science 3.3 (2018), pp. 266–287. DOI:
10.1504/ijds.2018.094506.
[74] A. Undheim, A. Chilwan, and P. Heegaard. “Differentiated Availability in Cloud
Computing SLAs”. In: 2011 IEEE/ACM 12th International Conference on Grid
Computing. 2011, pp. 129–136. DOI: 10.1109/Grid.2011.25.
[75] D.-W. Sun et al. “Modeling a Dynamic Data Replication Strategy to Increase
System Availability in Cloud Computing Environments”. In: Journal of
Computer Science and Technology 27.2 (2012), pp. 256–272. DOI:
10.1007/s11390-012-1221-4.
[76] C.-T. Yang et al. “On Improvement of Cloud Virtual Machine Availability with
Virtualization Fault Tolerance Mechanism”. In: 2011 IEEE Third International
Conference on Cloud Computing Technology and Science. 2011, pp. 122–129.
DOI : 10.1109/CloudCom.2011.26.
127
[78] J. Li et al. “Fine-Grained Data Access Control Systems with User
Accountability in Cloud Computing”. In: 2010 IEEE Second International
Conference on Cloud Computing Technology and Science. 2010, pp. 89–96.
DOI : 10.1109/CloudCom.2010.44.
128
[87] F. F. Moghaddam et al. “A client-based user authentication and encryption
algorithm for secure accessing to cloud servers based on modified
Diffie-Hellman and RSA small-e”. In: 2013 IEEE Student Conference on
Research and Developement. 2013, pp. 175–180. DOI :
10.1109/SCOReD.2013.7002566.
[88] V. S. Mahalle and A. K. Shahade. “Enhancing the data security in Cloud by
implementing hybrid (Rsa amp; Aes) encryption algorithm”. In: 2014
International Conference on Power, Automation and Communication (INPAC).
2014, pp. 146–149. DOI: 10.1109/INPAC.2014.6981152.
[89] N. Khanezaei and Z. M. Hanapi. “A framework based on RSA and AES
encryption algorithms for cloud computing services”. In: 2014 IEEE
Conference on Systems, Process and Control (ICSPC 2014). 2014, pp. 58–62.
DOI : 10.1109/SPC.2014.7086230.
129
[96] T. Kapse. “Hybrid Security Model for Secure Communication in Cloud
Environment”. In: Journal of the Gujarat Research Society 21.15 (2019),
pp. 179–185. ISSN: 0374-8588.
[97] P. Jindal and B. Singh. “Analyzing the security-performance tradeoff in block
ciphers”. In: International Conference on Computing, Communication
Automation. 2015, pp. 326–331. DOI: 10.1109/CCAA.2015.7148425.
[98] V. Poonia and N. S. Yadav. “Analysis of modified Blowfish algorithm in
different cases with various parameters”. In: 2015 International Conference on
Advanced Computing and Communication Systems. 2015, pp. 1–5. DOI:
10.1109/ICACCS.2015.7324114.
[99] P. Patil et al. “A Comprehensive Evaluation of Cryptographic Algorithms: DES,
3DES, AES, RSA and Blowfish”. In: Procedia Computer Science 78 (2016). 1st
International Conference on Information Security Privacy 2015, pp. 617–624.
ISSN : 1877-0509. DOI : https://doi.org/10.1016/j.procs.2016.02.
108. URL: https://www.sciencedirect.com/science/article/pii/
S1877050916001101.
[100] R. Patel and P. Kamboj. “Security Enhancement of Blowfish Block Cipher”. In:
vol. 628. 2016, pp. 231–238. ISBN: 978-981-10-3432-9. DOI: 10.1007/978-
981-10-3433-6_28.
[101] S. Contini et al. “The RC6 Block Cipher”. In: (Mar. 2000), pp. 1–21.
[102] Dinesh. Adams Methods. URL: http : / / www . cs . unc . edu / ~dm / UNC /
COMP205/LECTURES/DIFF/lec19/node2.html#eq.
130
LIST OF PUBLICATIONS
1. K. Vishal Reddy, Jayantrao Patil, and Ratnadeep R. Deshmukh.
”Security Issues in Hadoop Framework: A Review”, International
Journal of Science, Engineering and Management (IJSEM). 3(2),
(2018), pp. 193 – 198.(Impact Factor 2.7).