Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

SECURITY ISSUES ASSOCIATED WITH BIG DATA

1. K.sekhar sai viswam 2 . Mrs.S.R.srividhya

1. Student ,Department of CSE, BIST, BIHER, Chennai.

2. Assistant Professor , Department of CSE, BIST, BIHER, Chennai.

1. viswakoppa@gmail.com 2 .vidhyasrinivasan1890@gmail.com

issues regarding Big Data security,


and also the solutions proposed
Abstract by the scientific community to
solve them. In this paper, we
explain the results obtained after
Data is currently one of the most
applying a systematic mapping
important assets for companies in
study to security in the Big Data
every field.
ecosystem. It is almost impossible
to carry out detailed research into
The continuous growth in the
the entire topic of security, and
importance and volume of data
the outcome of this research is,
has created a new problem: it
therefore, a big picture of the
cannot be handled by traditional
main problems related to security
analysis techniques. This problem
in a Big Data system, along with
was, therefore, solved through
the principal solutions to them
the creation of a new paradigm:
proposed by the research
Big Data. However, Big Data
community.
originated new issues related not
only to the volume or the variety Introduction
of the data, but also to data
security and privacy. In order to
obtain a full perspective of the Over the last few years, data
problem, we decided to carry out has become one of the most
an investigation with the important assets for companies in
objective of highlighting the main almost every field. Not only are

1
they important for companies mostly unstructured, signifying
related to the computer science that traditional systems are not
industry, but also for capable of analysing it.
organisations, such as countries’ Organisations are willing to extract
governments, healthcare, more beneficial information from
education, or the engineering this high volume and variety of
sector. Data are essential with data .A new analysis paradigm
respect to carrying out their daily
with which to analyse and better
activities, and also helping the
understand this data, therefore,
businesses’ management to
emerged in order to obtain not
achieve their goals and make the
only private, but also public,
best decisions on the basis of the
benefits, and this was Big Data .
information extracted from them.
It is estimated that of all the data
in recorded human history, 90 Each new disruptive
percent has been created in the technology brings new issues with
last few years. In 2003, five it. In the case of Big Data, these
exabytes of data were created by issues are related not only to the
humans, and this amount of volume or the variety of data, but
information is, at present, created also to data quality, data privacy,
within two days . and data security. This paper will
focus on the subjects of Big Data
This tendency towards privacy and security. Big Data not
increasing the volume and detail of only increases the scale of the
challenges related to privacy and
the data that is collected by
security as they are addressed in
companies will not change in the
traditional security management,
near future, as the rise of social
but also create new ones that
networks, multimedia, and the
need to be approached in a new
Internet of Things (IoT) is
way . As more data is stored and
producing an overwhelming flow
analysed by organisations or
of data . We are living in the era of
governments, more regulations
Big Data. Furthermore, this data is

2
Figure 1. Main challenges as regards
As more data is stored and analysed security in Big Data security
by organisations or governments,
more regulations are needed to
The purposes of this paper are to
address these concerns. Achieving
highlight the main security
security in Big Data has, therefore,
challenges that may affect Big Data,
become one of the most important
along with the solutions that
barriers that could slow down the
researchers have proposed in order
spread of technology; without
to deal with them. This big picture
adequate security guarantees, Big
of the security problem may help
Data will not achieve the required
other researchers to better
level of trust . Big Data brings big
understand the security changes
responsibility .
produced by the inherent
According to the Big Data Working characteristics of the Big Data
Group at the Cloud Security Alliance framework and, consequently, find
organisation there are, principally, new research lines so as to carry out
four different aspects of Big Data more in-depth investigations. This
security: infrastructure security, goal has been accomplished by
data privacy, data management, carrying out an empirical
and integrity and reactive security . investigation by means of the
This division of Big Data security systematic mapping study method
into four principal topics has also with the aim of obtaining a
been used by the International complete background to the
Organisation for Standardisation in security problem as regards Big
order to create asecurity standard Data and the proposed solutions.
for security in Big Data. Figure1 This paper is consequently
contains a scheme showing the structured as follows: first we
maintopics related to security in Big provide a brief introduction to the
Data. subject of Big Data, after which we
describe the systematic mapping
study process. We then go on to
analyse the results obtained, and
additionally discuss them. Finally,
we present a section concerning our
conclusions.

3
make it easier to process and to
understand that data. There is now,
Big Data Basis however, a new tendency: storing
the current data volumes in
unstructured or semi-structured
The term Big Data refers to a data . The current maturity of
framework that allows the analysis technologies such as Cloud
and management of a larger Computing or ubiquitous network
amount of data than the traditional connectivity provide a platform on
data processing technologies . Big which to easily collect the data,
Data supposes a change from the store it, or process it . This set of
traditional techniques in three characteristics has allowed the
different ways: the amount of data rapid spread of Big Data techniques.
(volume), the rate of data Furthermore, not only can big
generation and organisations afford Big Data, but
transmission(velocity),and the small companies can also obtain
types of structured and benefits from the use of this Big
unstructured data(variety). These Data ecosystem . A scheme of the
properties are known as the three typical Big Data ecosystem is shown
basic V’s of Big Data. Many authors in Figure 2.
have added new characteristics to
the initial group, such as variability,
veracity, or value . One of the most
important parts of the Big Data
world is the use of brand new
technologies in order to extract
valuable information from data and
the ability to combine data from
different sources and different
formats. Big Data have also changed
the way in which organisations
store data , and have allowed them
to develop a more thorough and in-
depth understanding of their
business, which implies a great
benefit . Data was traditionally
stored in a structured format, such
as a relational database, in order to

4
This success of the use of Big Data needed to create a set of
technology can be explained by the intermediate key/value pairs. The
release of a type of software: reduce function, which processes
Apache Hadoop. Hadoop is a the intermediate values generated
framework developed by Apache and merges them to produce a
that allows the distributed solution.
processing of large data ets across
clusters of computers using
programming models. It is designed Apart from the MapReduce
to be scalable from a single server framework, there are several
to thousands of them, each of different projects that comprise the
which offers computation and local Hadoop eco system, such as Hive,
storage . Hadoop can be considered Pig, Sqoop, Mahout, Zookeeper,
as a defact standard. The input for Spark, orHBase. Each of these tools,
the Hadoop Framework is the data along with hundreds of others,
that feed the Big Data system. As provides a range of possibilities that
explained previously, these data allow organisations to obtain value
usually originate from very different from their data . Big Data
sources and formats. Hadoop has its environments do not traditionally
own distributed file system (HDFS) prioritise security , and it was for
which stores the data in different this reason that we decided to carry
servers with different functions, out research in order to discover
such as Name Node, which is used the main security challenges with
to store the metadata, or the Data respect to Big Data, along with the
Nodes, which store the application solutions, methods, or techniques
data . The principal characteristic of proposed by researchers so as to
Hadoop is, however, that of being achieve security in Big Data
an open-source implementation of systems.
Map Reduce . Map Reduce is a
programming model that is
particularly focused on processing Systematic Mapping
and generating large datasets. The
MapReduce paradigm accomplishes
this goal by describing two different Study In order to obtain a big
functions .The map function, which picture of the security problem in
processes the key/value pair the Big Data field, we decided to

5
carry out an empirical investigation
based on previous literature. We,
The first part of the research string
therefore, resolved to adapt the
is related to Big Data technology
systematic mapping study method.
and we therefore,included the main
Mapping studies basically use the
implementation of Big Data:
same methodology as Systematic
Hadoop. Hadoop can be considered
lecture reviews, but their main
as a de facto standard. The second
objective is to identify and classify
part of the string is related to the
all the research related toa broad
traditional security dimensions:
software engineering topic, rather
confidentiality, integrity, and
than answering a more specific
availability. We also decided to
question. This method has four
include the privacy dimension, even
stages: the research questions, the
though it can be considered as
research method, the case selection
personal confidentiality. This
and case study roles and procedures
decision was related to the larger
and, finally, data analysis and
impact that the privacy dimension
interpretation
seems to have with respect to the
Research Method subject of Big Data in comparison
with confidentiality. This perception
is also confirmed by the
The research method employed classification made by the CSA .
was that of carrying out an
Previous Works
automatic search of various online
libraries: ACM, SCOPUS, and the Before starting our research, we
IEEE Digital Library. The reason for decided to carry out a small search
selecting these libraries rather than for studies in order to discover
others was that they contain a great whether there were any papers that
amount of literature related tour had already dealt with the subject
main objective and that they would of our investigation. This goal was
facilitate a specific search. In order accomplished by searching the
to achieve a proper outcome, we same online libraries for papers
created a research string: reviewing the security in Big Data
between the years 2004 and 2015.
(“Big Data” OR BigData OR Hadoop)
We found some reviews focused on
AND (Secur* OR Confidentiality OR
more specific topics, such as privacy
Integrity OR Availability OR Privacy)

6
in social networks , but were unable important workshops. We first
to find any whose objective was to selected those cases that were truly
obtain a high level picture of the related to our purpose and made a
security in Big Data. classification of them. This
classification was carried out by
focusing our research on the title
Case Selection and Case and abstract of each paper,
although it was, in some cases,
Study Roles and
necessary to read the full paper. We
Procedures then carried out a new search of the
selected literature in order to avoid
the inclusion of any duplicates and
The aforementioned research improve the quality of the research.
string was then run in the selected It is important to highlight that it is
online libraries in order search for possible for a case to belong to
the words it contained in the titles, different categories, i.e., privacy
keywords, abstracts, and whole and integrity. We eventually
texts of papers. This resulted in obtained over 500 papers that
about 2300 papers. Once we had adjusted to our parameters and
obtained the papers that would be useful for our research.
conformed to our research
question, Infrastructure Security
it was necessary to make a case When discussing infrastructure
selection so as to attain only those security, it is necessary to highlight
that best fitted with our main the main technologies and
researchaim. In order for us to frameworks found as regards
maintain those that were genuinely securing the architecture of a Big
related to our research, the selected Data system, and particularly those
papers had to fulfil a number of based on the Hadoop technology,
criteria, such as having been since it is that most frequently used.
published between 2004 (the In this section we shall also discuss
MapReduce programming paradigm certain other topics, such as
is released by Google ) and 2015. communication security in Big Data,
We additionally considered only or how to achieve high-availability.
those papers published in journals, Figure 3 contains a graphic that
conferences, congresses, or shows the main topics found and

7
the quantity of papers dealing with users’ authentication and some
each specific topic. security mechanisms in order to
protect the system from traditional
attacks . A few papers focus on
protecting the data that is stored in
the HDFS by proposing a new
schema , a secure access system , or
even the creation of an encryption
scheme .

Availability Researchers have also


dealt with the subject of availability
in Big Data systems. One of the
Figure 3. Main topics regarding main characteristics of Big Data
infrastructure security environments, and by extension of a
Hadoop implementation, is the
availability attained by the use of
Standard for implementing a Big
hundreds of computers in which the
Data environment in a company.
data are notonlystored,butare also
The security problems related to
replicated along the cluster. Finding
this technology have, therefore,
an architecture that will ensure the
been widely discussed by
full availability of the system is,
researchers, who have also
therefore, a priority. For instance, in
proposed various methods with
the authors propose a solution with
which to improve the security of the
which to achieve high availability by
Hadoop system. This category is
having multiple active NameNodes
probably the most transverse since,
at the same time. Other solutions
in order to protect it, the solutions
are based on creating a new
use different security mechanisms
infrastructure of the storage system
such as authenticity or
so as to improve availability and
cryptography. For example, there is
fault tolerance .
a proposal for a security model for
G-Hadoop (an extension of the
MapReduce framework to run on
Architecture Security Another
multiple clusters) that simplifies
different approach is that of

8
describing a new Big Data of papers therefore deal with this
architecture, or modifying the problem. One paper approaches the
typical one, in order to improve the to pic by explaining the regular data
security of the environment. The life cycle in a Big Data system,
authors of propose a new following the different network
architecture based on the Hadoop protocols and applications that the
file system which, when combined data pass through. The authors also
with network coding and multi- enumerate the main data transfer
node reading, makes it possible to security techniques .
improve the security of the system.
Another solution focuses on secure
group communications in large- Summary With regard to the topic
scale networks managed by Big of infrastructure security, the main
Data systems, and this is achieved problem dealt with by researchers
by creating certain protocols and would appear to be security for
changing the infrastructure of the Hadoop systems. This is not
nodes . surprising since, as stated
previously, Hadoop can be
Authentication The value of the
considered as a de facto standard in
data obtained after executing a Big
industry. The remaining problems
Data process can, to a great extent,
addressed in this topic are usually
be determined by its authenticity. A
solved by modifying the usual
few papers deal with this problem
scheme of a Big Data system
by proposing solutions related to
through the addition of new
authentication. In , the authors
security layers.
suggest solving the problem of
authentication by creating an
identity-based signcryption scheme
for Big Data. Data Privacy

Communication Security The Data privacy is probably the topic


security as regards communications about which ordinary people are
between different parts of the Big most concerned, but it should also
Data ecosystem is a topic that often be one of the greatest concerns for
is ignored, and only a small number the organisations that use Big Data
techniques. A Big Data system

9
users’ privacy. Other authors’
research is focused on how to
process data that is already
encrypted. One paper, for example,
explains a technique with which to
analyse and programme
transformations with Pig Latin in the
case of encrypted data .

Access Control is one of the basic


traditional techniques used to
achieve the security of a system. Its
main objective is to restrict non-
Figure 4.. Main topics on data desirable users’ access to the
privacy.. system. In the case of Big Data, the
access control problem is related to
the fact that there are only basic
Cryptography forms of access control. In order to
solve this problem, some authors
propose a framework that supports
The most frequently employed the integration of access control
solution as regards securing data features . Other researchers focus
privacy in a Big Data system is their attention on the Map Reduce
cryptography. Cryptography has process itself, and suggest a
been used to protect data for a framework with which to enforce
considerable amount of time. This the security policies at the key-value
tendency continues in the case of level .
Big Data, but it has a few inherent
characteristics that make the direct
application of traditional Confidentiality
cryptography techniques
impossible. One example of the use
of cryptography can be found in , in Although privacy is traditionally
which the authors propose a bitmap treated as a part of confidentiality,
encryption scheme that guarantees we decided to change the order

10
this problem is not an easy task, and others, we believe that it is
some authors suggest new advisable to split this category up
legislation with which to increase into, on the one hand, data privacy
the protection of data privacy. itself, and on the other,
Another paper, meanwhile, cryptography and access control
proposes a technique that can be techniques.
used to increase the control that
users have over their own data in
social networks. Data Management

Differential Privacy The objective of This section focuses on what to do


differential privacy is to provide a once the data is contained in the Big
method with which to maximise the Data environment. It not only shows
value of analysis of a set of data how to secure the data that is
while minimising the chances of stored in the Big Data system, but
identifying users’ identities. A few also how to share that data. We
papers focus on achieving privacy in shall also discuss the different
Big Data by applying differential policies and legislation that authors
privacy techniques. For example, in suggest in order to use Big Data
the authors attempt to distort the techniques safely. Figure 5 contains
data by adding noise. a graphic that shows the topics that
Summary The topic most frequently will be discussed in this section,
dealt with by researchers would along with the quantity of papers
appear to be privacy. There are a lot found for each specific topic.
of different perspectives as regards
ensuring privacy. Authors usually
propose different means of
encryption, based on traditional
techniques but with a few changes
in order to adapt these techniques
to the inherit characteristics of a Big
Data environment. Owing to the
large amount of papers found on Figure 5. Main topics on data
this topic in comparison to the management
11
Recovery system, and how to recover from it.
This is probably a consequence of
the high availability that a Big Data
system usually achieves, but this
The main purpose of this topic is to
topic should not be overlooked.
create particular policies or controls
in order to ensure that the system
recovers as soon as possible when a
disaster occurs. Many organisations Analysis of Results
currently store their data in Big Data
systems, signifying that if a disaster
occurs the entire company could be Analysing such a large amount of
in danger. We have found only a results is not an easy task, and in
few papers that cover this problem. order to meticulously interpret
For example, in there are some them we shall, therefore, answer
recommendations regarding what the research questions in the
can be done to recover from a specified order. The first question
desperate situation. was “What are the main challenges
and problems as regards security in
Big Data?” This was easily
Summary answered, because during the first
phase of research we found that a
few documents have been
produced by the Cloud Security
In this section, the main topic
Alliance and by the National
discussed by researchers would
Institute of Standards and
appear to be the integrity of data. In
Technology that approach the topic
order to secure that integrity, they
of security in Big Data and highlight
propose various kinds of verification
the main problems and challenges
to ensure that the data has not
that concern this technology. These
been modified. This section also
results allowed us to guide the
covers the possibility of detecting
remainder of our research. Figure 7
the attacks that a Big Data system
shows a pie chart with the amount
may undergo. There is a lack of
of papers grouped by the different
papers dealing with the possibility
categories. However, according to
of disaster occurring in a Big Data
the results of our research, and
12
owing to the quantity of papers problem with respect to one specific
found that are related to this topic, and on the other, those that
problem, we believed that it might deal with the problem and propose
be useful to split the privacy a solution. Many papers are, of
category into two different course, located in both columns.
categories: privacy, itself, and That is to say,we have found forty-
access control and cryptography. three papers related to
The second and third questions integrity(thirteen express the
were “What are the main security problem and thirty-seven propose a
dimensions on which researchers solution, but there are a few that do
are focusing their efforts?” and both things at the same time). The
“What techniques, methodologies, table shows that the main problems
and models with which to achieve are principally related to
security in Big Data exist?” These infrastructure problems and privacy
questions form the main body of issues. In the first case, researchers
our research, and, in order to focus their attention on creating
simplify the visualisation of the new Big Data architectures that deal
results, we have, therefore, created with the problems of availability and
Table 2, which connects the main privacy. Privacy challenges are,
topics found with the typical meanwhile, by far the most
security dimensions. The last frequently discussed topic. Many
columns hows those papers that are authors deal with this topic by
not clearly related to any of these explaining the new privacy
dimensions or deal with security, problems that have arisen as the
ingeneral, without specifying more. result of the use of the Big Data
technology, while others attempt to
solve the issue by applying
variations of traditional techniques
that have been adjusted to the
inherent characteristics of Big Data.
The other two main topics found
are researched to a far lesser extent
than those mentioned above. While
integrity is well covered, there are
only a few papers dealing with the

13
problem of recovering the system in
the case of a total failure. Conclusions
Furthermore, we discovered that
the topic of data management is not This paper provides an explanation
dealt with as frequently as it should of the research carried out in order
be. For example, we have not found to discover the main problems and
any security government challenges related to security in Big
frameworks that would make it Data, and how researchers are
possible to manage the security of a dealing with these problems. This
Big Data system throughout its objective was achieved by following
entire life cycle. We believe that this the systematic mapping study
is crucial if the correct deployment methodology, which allowed us to
of Big Data technology is to be find the papers related to our main
achieved. Related with the quantity goal. Having done so, we discovered
of studies found per year, we have that the principal problems are
detected how the number of papers related to the inherent
written by the researchers has been characteristics of a Big Data system,
continuously increasing until the and also to the fact that security
year 2015. This can be a issues were not contemplated when
consequence of the maturity Big Data was initially conceived.
reached by the Big Data technology. Many authors, therefore, focus their
Figure 8 contains a graph with the research on creating means to
evolution in the quantity of papers protect data, particularly with
found during the considered period. respect to privacy, but privacy it is
not the only security problem that
can be found in a Big Data system;
the traditional architecture itself
and how to protect a Hadoop
system is also a huge concern for
the researchers. We have, however,
also detected a lack of
investigations in the field of data
management, especially with
respect to government. We are of
the considered opinion that this is

14
not acceptable, since having a Conference on Collaboration
government security framework will Technologies and Systems
allow the rapid spread of Big Data (CTS), San Diego, CA, USA, 20–
technology. In conclusion, the Big 24 May 2013; pp. 42–47.
Data technology seems to be 3. Hashem, I.A.T.; Yaqoob, I.;
reaching a mature stage, and that is
Anuar, N.B.; Mokhtar, S.; Gani,
the reason why there have been a
A.; Ullah Khan, S. The rise of
number of studies created the last
“big data” on cloud
year. However, that does not mean
that it is no longer necessary to computing: Review and open
study this paradigm, infact ,the research issues. Inf. Syst. 2015,
studies created from now should 47, 98–115. [CrossRef]
focus on more specific problems. 4. Sharma, S. Rise of Big Data and
Furthermore, Big Data can be useful related issues. In Proceedings
as a base for the development of of the 2015 Annual IEEE India
the future technologies that will Conference (INDICON), New
change the world as we see it, like Delhi, India, 17–20 December
the Internet of Things (IoT), or on- 2015; pp. 1–6.
demand services, and that is the
5. Eynon, R. The rise of Big Data:
reason why Big Data is, after all, the
What does it mean for
future.
education, technology, and
media research? Learn. Media
Technol. 2013, 38, 237–240.
References
1. Mayer-Schönberger, V.; Cukier,
K. Big Data: A Revolution that
Will Transform How We Live,
Work, and Think;

Houghton Mifflin Harcourt:


Boston, MA, USA, 2013.

2. Sagiroglu, S.; Sinanc, D. Big


data: A review. In Proceedings
of the 2013 International

15

You might also like