Professional Documents
Culture Documents
01-13-Big Data Classification Problems and Challenges in Network Intrusion Prediction With Machine Learning
01-13-Big Data Classification Problems and Challenges in Network Intrusion Prediction With Machine Learning
Velocity
Need-based
request Additional
Need-based data storage
data transfer
Cloud Computing
Storage System
Figure 2: Proposed definition (C3) of Big Data characteristics The NTRS unit helps to capture network traffic and streams the
traffic data to HDFS unit or CCSS unit in real-time based on the
need of an additional storage. The HDFS system will also use
Compared to defining a metric to measure the Big Data Hive database to store the data. The UILS unit can learn and
characteristics in V3 space, it is much easier to develop a metric in control the additional storage and data requirements.
C3 space using mathematical and statistical tools. In C3 space the
cardinality defines the number of records in the dynamically
growing dataset at a particular instance. The continuity defines 3.2 Communication Challenges
two characteristics and they are: (i) representation of data by In computer networking research and applications, the
continuous functions, and (ii) continuously growth of data size communication cost is the major concern compared to the
with respect to time. The complexity defines three characteristics processing cost of the data – the same in this suggested topology.
and they are: (i) large varieties of data types, (ii) high dimensional The challenge here is to minimize that communication cost while
dataset; and (iii) the speed of data processing is very high. satisfying the additional storage and data requirement from public
cloud for processing Big Data. As stated in [3] and [4], the
bandwidth and latency are the two major network features that
3. MANAGEMENT OF BIG DATA will affect the communication between the clients and the cloud
The cardinality parameter demands the need for an efficient server. These problems and associated challenges to find solutions
distributed file system for data capture, storage and analysis of will adversely affect the timing requirements of the Big Data
network traffic for intrusion prediction. Then the continuity and processing at HDFS and UILS. Hence the machine learning
complexity parameters add extra difficulties to the task of techniques, using these technologies, must be developed by
managing the Big Data. Hence the network topology must be keeping these problems and challenges in mind.
3.3 Security Challenges while preserving the original characteristics of the data, to another
The security mechanism in cloud technology is generally weak. domain so that the classification algorithms can improve accuracy,
Hence tampering of data at the public cloud is inevitable and it is reduce computational complexity and increase processing speed.
a big concern. Finding a robust security mechanism for the However the Big Data classification requires multi-domain,
purpose of using the public cloud like CCSS is a challenging representation-learning (MDRL) technique because of its large
problem. As stated in [5], in cloud technology, an attacker can and growing data domain. The MDRL technique includes feature
easily tamper the data that is being exchanged between the CCSS variable learning, feature extraction learning and distance-metric
server and the HDFS and NTRS units; the attacker can spoof the learning. Several representation-learning techniques have been
reply between them and shut down the server (CCSS) using DOS proposed in machine learning research, but the recently proposed
attack. These problems can lead to challenges in implementing cross-domain, representation-learning (CDRL) technique by Tu
Big Data analytics tool with the suggested network topology. and Sun [14] maybe suitable for the Big Data classification along
with the suggested network model. The implementation of the
CDRL technique to Big Data classification will encounter several
challenges, including the difficulty in selecting relevant features,
4. LEARNING OF BIG DATA constructing geometric representation, extracting suitable features
The complexity parameter consists of many other influential and separating the various types of data. Recently the concept of
data characteristics, which are high dimensional dataset, a large unit-circle algorithm (UCA) [15] has been proposed. It represents
number of data types (classes), high speed in which the data the intrusion traffic data by unit circles and assigns many related
should be processed and unstructured data. The complexity records to fewer unit-circles. This property can help the Big Data
parameter and its resulting problems have to be addressed using classification to work effectively. The public NSL-KDD dataset
machine learning techniques. However the challenge still rests on [16] was used in this study. The challenge now is to adopt this
improving the current learning techniques to deal with the Big new algorithm to work with the modern network technology.
Data classification problems and requirements.
[2] T. White, Hadoop: The definitive guide. O'Reilly Media, [19] D. L. Silver, “Machine lifelong learning: challenges and
2012. benefits for artificial general intelligence,” Lecture Notes in
Computer Science, vol. 6830, 370-375, Springer, 2011.
[3] S. Carlin and K. Curran. "Cloud Computing Technologies."
International Journal of Cloud Computing and Services
Science (IJ-CLOSER) 1.2: 59-65, 2012.
[4] P. C. Wong, H. W. Shen, C. R. Johnson, C. Chen, and R. B.
Ross, “The Top 10 Challenges in Extreme-Scale Visual
Analytics,” Computer Graphics and Applications, IEEE,
32(4), 63-67, 2012.
[5] I. Muttik and C. Barton. "Cloud security technologies,"
information security technical report 14.1: 1-6, 2009.