Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

2017 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD 2017)

Analyst Intuition Based Hidden Markov Model On


High Speed, Temporal Cyber Security Big Data
T.T. Teoh1a, Y.Y. Nguwi2, Yuval Elovici1b, N.M. Cheung1c, W.L. Ng3
1 2
Centre for Research in Cyber Security, Singapore School of Business (IT), James Cook University,
University of Technology and Design, Singapore. Singapore. yokyen.nguwi@jcu.edu.au
1a
teiktoe_teoh@sutd.edu.sg
1b 3
yuval_elovici@sutd.edu.sg ST Electronics (Info-Security) Pte Ltd, Singapore
1c
ngaiman_cheung@sutd.edu.sg wailoong@stee.stengg.com

Abstract— Hidden Markov Models (HMM) are probabilistic understand and process volumes of data which was once
models that can be used for forecasting time series data. It has beyond their reach. While many domains have benefited
seen success in various domains like finance [1-5], bioinformatics through the use of big data technologies, cyber security is one
[6-8], healthcare [9-11], agriculture [12-14], artificial field that is just beginning explore the use of big data analytics.
intelligence[15-17]. However, the use of HMM in cyber security The ability to detect and deter cyber-attacks can make or break
found to date is numbered. We believe the properties of HMM the functional success of an enterprise [18]. Using big data,
being predictive, probabilistic, and its ability to model different organizations may be able to rigorously detect threats, create
naturally occurring states form a good basis to model cyber better defence mechanisms and improve security.
security data. It is hence the motivation of this work to provide
the initial results of our attempts to predict security attacks using The objective of this research is to use an efficient expert
HMM. A large network datasets representing cyber security system that tags on the expertise of cyber security expert and
attacks have been used in this work to establish an expert system. allow them to input suitable weights for different attribute. The
The characteristics of attacker’s IP addresses can be extracted cyber security expert also contributes to the scoring system
from our integrated datasets to generate statistical data. The based on the words in log file. We then adopt Fuzzy k-Means
cyber security expert provides the weight of each attribute and (FKM) algorithm to create clusters of attackers and non-
forms a scoring system by annotating the log history. We applied attackers in order to segregate the attack-related traffic from the
HMM to distinguish between a cyber security attack, unsure and network datasets.
no attack by first breaking the data into 3 cluster using Fuzzy K
mean (FKM), then manually label a small data (Analyst Our Analyst Intuition approach is inspired by Kalyan [19]
Intuition) and finally use HMM state-based approach. By doing and Chang [20]. Kaylan [19] used semi-supervise approach for
so, our results are very encouraging as compare to finding huge, unbalanced and unlabelled data. The approach started
anomaly in a cyber security log, which generally results in with labelling a sample of data and train the system and used
creating huge amount of false detection. that to test against the remaining huge data. Likewise, Chang
et. al. [23] also use similar method known as “Expectation
Keywords- idden Markov Model (HMM), Cyber security, Regulated Neural Network” for DDOS attack.
Network Protocols, Virus, Big Data, High Velocity, Analyst
Intuition, Principal Component Analysis (PCA), Expectation In this study, we collected 3 days of data that chalks up to
Regulated, Fuzzy k-means (FKM), Multi-layer Perceptron (MLP) 36 million of log files amounting to 36 Gigabytes of data. The
data was provided by Singapore Technology Engineering. The
I. INTRODUCTION total amount of Malware instances is about 60 cases. We apply
data mining techniques to study the statistical data obtained
The ever-changing landscape of cyber security attacks from the integrated datasets. These analytics help in identifying
evolve at an incredible speed. Information flowing in and out the attack related traffic from normal traffic as well as
of an organisation is enormous and an attempt to detect extracting attack patterns. The Fuzzy k-Means (FKM)
anomaly amongst these information is a real challenge. The clustering algorithm were performed to create attacker and
painstaking process of recovery is often due to late discovery non-attacker clusters on the time-related and connection-related
of such instance. These various factors and consequences lead data obtained from the integrated datasets.
to the rise of using big data for intrusion detection and
prevention. Big data is a general term that has been vastly used Several models were generated through changing the key
to describe the avalanche of information being consolidated parameters. The testing step was repeated several times to
into a dataset. The voluminous amount of big data presents a determine accuracy and efficiency in results. The results
great challenge when we attempt to study the patterns, or obtained from the algorithms were validated against each
association amongst the data. The advancement in handling big other’s in verifying the attack-related traffic.
data enables many industrial problems and challenges to be
addressed. These industries and companies are now able to The FKM algorithm created three cluster in total: (i)
cluster-1 consists of no attackers, (ii) cluster-2 consists of

978-1-5386-2165-3/17/$31.00
Authorized licensed ©2017 IEEE
use limited to: Birla Inst of Technology 2080 on November 02,2023 at 17:01:22 UTC from IEEE Xplore. Restrictions apply.
and Science Pilani Dubai. Downloaded
uncertain number of attacker, and (iii) cluster-3 consists of 364
non-attackers. One of the issue in cyber security is that
different network security systems and tools generate log files
in different format that renders complexity in consolidation.
This research demonstrates the integration and analysis of
datasets for identifying attack-related traffic that can potentially
lead to easier threat detection in cases where attacks occur on
multiple platforms.

II. METHODOLOGY AND RESULTS


In this research, we adopt a special semi supervise method
of classifying cyber security log into attack, unsure and no
attack by first breaking the data into 3 cluster using Fuzzy K
mean (FKM), then manually label a subset of data and finally
train the neural network classifier multi-layer perceptron
(MLP) base on the manually labelled data. By doing so, our
results are very encouraging as compare to finding anomaly in
a cyber security log which generally creates huge amount of
false detection. The method of including Artificial Intelligence
(AI) and Analyst Intuition (AI) is also known as AI2 [19, 20].
The Fuzzy k-Means (FKM) clustering algorithm were
performed to create attacker and non-attacker clusters on the
time-related and connection-related data obtained from the
integrated datasets. Our model is illustrated in Figure 1. The
model starts by extracting data from ArcSight through Graylog
where ArcSight collect data from McAfee, Checkpoint and
other application. We extracted 1147 data from the 3.6 million
log for training and testing base on computer IP address
10.67.25.69. The data is split into 3 clusters base on K-means
algorithm. The 3 clusters are no attack, unsure and attack. We Figure 1 The model that detects anomaly from big data
then train the data using Multi-Layer Perception Neural
Network using 2/3 of the data. The remaining 1/3 of the data is TABLE I: WEIGHTS FOR DIFFERENT ATTACKS
used for testing. We then arrange the data in sequence of 1147 Types of Attacks Weights
and using HMM to train the test the sequential data. A Hidden Soft1026 0.8
Markov Model (HMM) differs slightly to Markov Model. The Trojan 0.7
states in Markov model that depends on probabilities are Malware 0.5
visible. Hidden Markov Model’s states are not directly visible Worm 0.4
Virus 0.3
except the output, the states are thus “hidden” from observers. Forced_Off 0.5
The original data is not labelled. From the log files, words Failed_Login 0.4
Severity 0.5
are given certain weights and scores are assigned from there. Very_High 0.9
We use excel to visualise the data and manually label the data High 0.7
into 3 classes along with the clusters: attack, unsure and no Medium 0.5
attack. From there, we train our model and use the model for
classification. The classification system takes in expert view to Table II Number of cases for each types of attacks
provide weightages according to the types of attacks as shown
in Table I. Types of Attacks Number of Cases
Phishing 1
In this study, we collected 3 days of data that chalks up to Malware 28
Virus 19
36 million of log files amounting to 36 Gigabytes of data. The Soft1026 4
total amount of Malware instances is about 60 cases. Network Trojan 29
traffic data comes in at very high speed resulting in more than Passing off 112
1000 log files being generated every second, we use batch Login failed 91,000
Very High 150,000
processing instead of real-time processing. We extract 864 log High 200,000
files, out of which 500 of them does not have attack Medium 104,000
information. The types of attacks and number of cases are Low 36,000,000
summarized in Table II.
We apply data mining techniques to study the statistical
data obtained from the integrated datasets which consist of
attacks like Malware, Trojan, Passing off, Soft1026, and Virus.

2081 on November 02,2023 at 17:01:22 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded
The expert system allows cyber security expert to enter their 6. Then we cluster the data into 3 clusters using FKM.
inputs to form the scores. These analytics help in identifying 7. Label the output as attack, unsure, no attack. Labeling rules:
if score[i-1] > 0.7 and word_found[i-1]==1:
the attack related traffic from normal traffic as well as label[i-1]=1
extracting attack patterns. The Fuzzy k-Means (FKM) elif score[i-1] > 0.15 and word_found[i-1]==0:
clustering algorithm were performed to create attacker and label[i-1]=0.5
non-attacker clusters on the time-related and connection-related else:
label[i-1]=0
data obtained from the integrated datasets. The clustering 8. With that, we train our neural network MLP model and test the model
algorithm forms 3 clusters: Strong, Average, and Mild. The against the test data with our trained classified.
distance measure used is K-means. Prior to clustering, 9. Then we arrange in sequence of 1147 instead of per log basis,
fuzzification was performed. We extract a samples of 864 sequence break into 80% training and 20% testing
10. HMM Observation: Score break into 5 ranges, 0 to 0.2, 0.21 to 0.4,
datum for processing. The system first looks for keywords 0.41 to 0.6, 0.61 to 0.8 and 0.8 & above
among data like worm, malware and mark the feature as 1 11. Training using Viterbi (Bayesian Base) to calculate the HMM
when keywords are encountered. Expert weightage is then emission.
given and forms the scoring. Algorithm 1 outlines the process
of expert labelling.
Figure 2 outlines the system flows. We hope to use this
model to prioritize the identified event among the large log
files. Large number of events are captured by ArcSight, some
events are missed by security analyst due to the large volume
involved. The proposed model will learn from the attacks
identified by security analyst and propose new high priority
events. The diagram (Figure 5) shows that the raw data from
any application including firewall, intrusion detection system
(IDS), anti-virus software with common event format are fed
into ArcSight and in the meantime, we intend to design a
database that captures live feed from ArcSight. Our algorithm Figure 2 Proposed model to prioritize the identified event
will then take the data from ArcSight and as well as from the
database and produce a new high priority events on top of our
existing ArcSight original ranking. The system will then
combine our new rankings together with ArcSight data and
produce a consolidated result.
For series event, the trained model (see Figure 3) adopts a
3-state HMM derived from 5-state observations. The three
output state are Attack, no attack, Unsure. The observation
states are separated into 5 ranges of scoring: (I) 0 to 0.2, (II)
0.21 to 0.4, (III) 0.41 to 0.6, (IV) 0.61 to 0.8, (V) 0.81 and
above. The raw input is in the form of String, each second of
log consists of 1147 string input which is then converted into Figure 3 The model that detects anomaly from big data
vector for HMM processing. The training stage attempts to
calculate the emission and transition states base on Viterbi
algorithm. Viterbi algorithm finds Viterbi paths base on naïve
Bayesian calculation. The model creates a very generic model
because all the parameters are adjustable including observation
state and length of string. The use of HMM in this context is
due to its state base property and it caters for highly imbalance
data. Further, the types of data in log files are sequential in
nature, and it suits the states in HMM very well. The HMM
also provides fast training and computation time due to the use
of memoryless model.
Algorithm 1: HMM Scoring System with Expert Labeling or Analyst
Intuition
Given Log String=[] ;
Repeat the following until Log String is empty:
1. Sort the 36 mil log by source IP addess
2. Extract problematic computer, which in this case is 10.67.25.69. Figure 4 The results of HMM model indicated 3 major attacks as circled
3. Look for words that contain certain keywords like Malware, word in red occurred across the log files.
found=1.
4. For all log, assign a weight to the attribute and a score to the attribute,
example severity weight is 0.8 and very high is 0.9, then the score will be 0.72
5. Visualized the data using excel (see Figure 3)

2082 on November 02,2023 at 17:01:22 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded
Figure 4 indicates the visualization of the HMM results. [7] Liu, M., L.T. Watson, and L. Zhang, Quantitative prediction of the effect
The data recorded the log activity over 3 days across the 1146 of genetic variation using hidden Markov models. BMC Bioinformatics,
2014. 15(1): p. 5-5.
instances. During the period of data captured, three major
[8] Lunter, G., HMMoC-a compiler for hidden Markov models.
attacks were detected by the HMM model as circled in red. Bioinformatics, 2007. 23(18): p. 2485-2487.
This outcome is in-lined and validated by cyber security [9] Akhbari, M., et al., ECG segmentation and fiducial point extraction
analyst’s manual interpretation. using multi hidden Markov model. Computers in Biology and Medicine,
2016. 79: p. 21-29.
III. CONCLUSION [10] Huang, Z., et al., Medical Inpatient Journey Modeling and Clustering: A
Bayesian Hidden Markov Model Based Approach. AMIA . Annual
This research presented a unique blend of Fuzzy k-means Symposium proceedings / AMIA Symposium., 2015. 2015: p. 649.
(FKM) approach with HMM model. We first cluster the raw [11] Lim, D., et al., Fall-Detection Algorithm Using 3-Axis Acceleration:
data into 3 clusters (attacks, unsure and non-attacks), then use Combination with Simple Threshold and Hidden Markov Model. Journal
the state-based HMM approach to observe the state transitions of Applied Mathematics, 2014. 2014: p. 1-8.
among the data. The major challenge of interpreting cyber [12] Fu, G., S.P. Charles, and S. Kirshner, Daily rainfall projections from
security data has been the large volume of data which often general circulation models with a downscaling nonhomogeneous hidden
Markov model (NHMM) for south ̺ eastern Australia. Hydrological
results in missing out major attacks within the data. We Processes, 2013. 27(25): p. 3663-3673.
conducted experiment based on 36 million of log files and [13] Milone, D.H., et al., Automatic recognition of ingestive sounds of cattle
successfully detected the 3 major attacks in short computation based on hidden Markov models. Computers and Electronics in
time of 0.13 seconds. We aim to extend this work to Agriculture, 2012. 87: p. 51.
incorporate deep learning for better hit rate on larger sets of [14] Siachalou, S., G. Mallinis, and M. Tsakiri-Strati, A Hidden Markov
data. Models Approach for Crop Classification: Linking Crop Phenology to
Time Series of Multi-Sensor Remote Sensing Data. Remote Sensing,
2015. 7(4): p. 3633-3650.
REFERENCES [15] Fox, M., et al., Robot introspection through learned hidden Markov
[1] Giampieri, G., M. Davis, and M. Crowder, Analysis of default data models. Artificial Intelligence, 2006. 170(2): p. 59-113.
using hidden Markov models. Quantitative Finance, 2005. 5(1): p. 27- [16] Teoh, T.T. and Y.Y. Nguwi. Emotion indexing using Hidden Markov
34. Expert Rule Model (HMER) for autism children. in 2010 11th
[2] Lu, S.-L., A Hidden Markov Chain Model with Applications for International Conference on Control Automation Robotics & Vision
Assessing Credit Risk. Asia Pacific Management Review, 2014. 19(4): (ICARCV). 2010. Singapore.
p. 405. [17] Teoh, T.-T., S.-Y. Cho, and Y.-Y. Nguwi. Hidden Markov Model for
[3] Maheu, J.M. and Q. Yang, An infinite hidden Markov model for short- hard-drive failure detection. 2012.
term interest rates. Journal of Empirical Finance, 2016. 38: p. 202-220. [18] Jelani, H. Enterprise Threats: Big Data and Cyber Security. Dataversity
[4] Rossi, A. and G.M. Gallo, Volatility estimation via hidden Markov Education 2013; Available from: http://www.dataversity.net/enterprise-
models. Journal of Empirical Finance, 2006. 13(2): p. 203-230. threats-big-data-and-cyber-security/.
[5] Wypasek, C.J., Hidden Markov Models in Finance. 2008, American [19] Kalyan, V. and I. Arnaldo. AI2: Training a big data machine to defend.
Statistical Association: Alexandria. p. 1713-1714. Available from: https://people.csail.mit.edu/kalyan/AI2_Paper.pdf.
[6] D, H., Discriminating between rate heterogeneity and interspecific [20] Ching-Yun, C., Z. Teng, and Y. Zhang. Expectation-Regulated Neural
recombination in DNA sequence alignments with phylogenetic factorial Model for Event Mention Extraction. Available from:
hidden Markov models. Bioinformatics, 2005. 21(S2): p. ii166-ii172. http://www.aclweb.org/anthology/N/N16/N16-1045.pdf.

2083 on November 02,2023 at 17:01:22 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded

You might also like