Professional Documents
Culture Documents
Analyst Intuition Based Hidden Markov Model On High Speed Temporal Cyber Security Big Data
Analyst Intuition Based Hidden Markov Model On High Speed Temporal Cyber Security Big Data
Abstract— Hidden Markov Models (HMM) are probabilistic understand and process volumes of data which was once
models that can be used for forecasting time series data. It has beyond their reach. While many domains have benefited
seen success in various domains like finance [1-5], bioinformatics through the use of big data technologies, cyber security is one
[6-8], healthcare [9-11], agriculture [12-14], artificial field that is just beginning explore the use of big data analytics.
intelligence[15-17]. However, the use of HMM in cyber security The ability to detect and deter cyber-attacks can make or break
found to date is numbered. We believe the properties of HMM the functional success of an enterprise [18]. Using big data,
being predictive, probabilistic, and its ability to model different organizations may be able to rigorously detect threats, create
naturally occurring states form a good basis to model cyber better defence mechanisms and improve security.
security data. It is hence the motivation of this work to provide
the initial results of our attempts to predict security attacks using The objective of this research is to use an efficient expert
HMM. A large network datasets representing cyber security system that tags on the expertise of cyber security expert and
attacks have been used in this work to establish an expert system. allow them to input suitable weights for different attribute. The
The characteristics of attacker’s IP addresses can be extracted cyber security expert also contributes to the scoring system
from our integrated datasets to generate statistical data. The based on the words in log file. We then adopt Fuzzy k-Means
cyber security expert provides the weight of each attribute and (FKM) algorithm to create clusters of attackers and non-
forms a scoring system by annotating the log history. We applied attackers in order to segregate the attack-related traffic from the
HMM to distinguish between a cyber security attack, unsure and network datasets.
no attack by first breaking the data into 3 cluster using Fuzzy K
mean (FKM), then manually label a small data (Analyst Our Analyst Intuition approach is inspired by Kalyan [19]
Intuition) and finally use HMM state-based approach. By doing and Chang [20]. Kaylan [19] used semi-supervise approach for
so, our results are very encouraging as compare to finding huge, unbalanced and unlabelled data. The approach started
anomaly in a cyber security log, which generally results in with labelling a sample of data and train the system and used
creating huge amount of false detection. that to test against the remaining huge data. Likewise, Chang
et. al. [23] also use similar method known as “Expectation
Keywords- idden Markov Model (HMM), Cyber security, Regulated Neural Network” for DDOS attack.
Network Protocols, Virus, Big Data, High Velocity, Analyst
Intuition, Principal Component Analysis (PCA), Expectation In this study, we collected 3 days of data that chalks up to
Regulated, Fuzzy k-means (FKM), Multi-layer Perceptron (MLP) 36 million of log files amounting to 36 Gigabytes of data. The
data was provided by Singapore Technology Engineering. The
I. INTRODUCTION total amount of Malware instances is about 60 cases. We apply
data mining techniques to study the statistical data obtained
The ever-changing landscape of cyber security attacks from the integrated datasets. These analytics help in identifying
evolve at an incredible speed. Information flowing in and out the attack related traffic from normal traffic as well as
of an organisation is enormous and an attempt to detect extracting attack patterns. The Fuzzy k-Means (FKM)
anomaly amongst these information is a real challenge. The clustering algorithm were performed to create attacker and
painstaking process of recovery is often due to late discovery non-attacker clusters on the time-related and connection-related
of such instance. These various factors and consequences lead data obtained from the integrated datasets.
to the rise of using big data for intrusion detection and
prevention. Big data is a general term that has been vastly used Several models were generated through changing the key
to describe the avalanche of information being consolidated parameters. The testing step was repeated several times to
into a dataset. The voluminous amount of big data presents a determine accuracy and efficiency in results. The results
great challenge when we attempt to study the patterns, or obtained from the algorithms were validated against each
association amongst the data. The advancement in handling big other’s in verifying the attack-related traffic.
data enables many industrial problems and challenges to be
addressed. These industries and companies are now able to The FKM algorithm created three cluster in total: (i)
cluster-1 consists of no attackers, (ii) cluster-2 consists of
978-1-5386-2165-3/17/$31.00
Authorized licensed ©2017 IEEE
use limited to: Birla Inst of Technology 2080 on November 02,2023 at 17:01:22 UTC from IEEE Xplore. Restrictions apply.
and Science Pilani Dubai. Downloaded
uncertain number of attacker, and (iii) cluster-3 consists of 364
non-attackers. One of the issue in cyber security is that
different network security systems and tools generate log files
in different format that renders complexity in consolidation.
This research demonstrates the integration and analysis of
datasets for identifying attack-related traffic that can potentially
lead to easier threat detection in cases where attacks occur on
multiple platforms.
2081 on November 02,2023 at 17:01:22 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded
The expert system allows cyber security expert to enter their 6. Then we cluster the data into 3 clusters using FKM.
inputs to form the scores. These analytics help in identifying 7. Label the output as attack, unsure, no attack. Labeling rules:
if score[i-1] > 0.7 and word_found[i-1]==1:
the attack related traffic from normal traffic as well as label[i-1]=1
extracting attack patterns. The Fuzzy k-Means (FKM) elif score[i-1] > 0.15 and word_found[i-1]==0:
clustering algorithm were performed to create attacker and label[i-1]=0.5
non-attacker clusters on the time-related and connection-related else:
label[i-1]=0
data obtained from the integrated datasets. The clustering 8. With that, we train our neural network MLP model and test the model
algorithm forms 3 clusters: Strong, Average, and Mild. The against the test data with our trained classified.
distance measure used is K-means. Prior to clustering, 9. Then we arrange in sequence of 1147 instead of per log basis,
fuzzification was performed. We extract a samples of 864 sequence break into 80% training and 20% testing
10. HMM Observation: Score break into 5 ranges, 0 to 0.2, 0.21 to 0.4,
datum for processing. The system first looks for keywords 0.41 to 0.6, 0.61 to 0.8 and 0.8 & above
among data like worm, malware and mark the feature as 1 11. Training using Viterbi (Bayesian Base) to calculate the HMM
when keywords are encountered. Expert weightage is then emission.
given and forms the scoring. Algorithm 1 outlines the process
of expert labelling.
Figure 2 outlines the system flows. We hope to use this
model to prioritize the identified event among the large log
files. Large number of events are captured by ArcSight, some
events are missed by security analyst due to the large volume
involved. The proposed model will learn from the attacks
identified by security analyst and propose new high priority
events. The diagram (Figure 5) shows that the raw data from
any application including firewall, intrusion detection system
(IDS), anti-virus software with common event format are fed
into ArcSight and in the meantime, we intend to design a
database that captures live feed from ArcSight. Our algorithm Figure 2 Proposed model to prioritize the identified event
will then take the data from ArcSight and as well as from the
database and produce a new high priority events on top of our
existing ArcSight original ranking. The system will then
combine our new rankings together with ArcSight data and
produce a consolidated result.
For series event, the trained model (see Figure 3) adopts a
3-state HMM derived from 5-state observations. The three
output state are Attack, no attack, Unsure. The observation
states are separated into 5 ranges of scoring: (I) 0 to 0.2, (II)
0.21 to 0.4, (III) 0.41 to 0.6, (IV) 0.61 to 0.8, (V) 0.81 and
above. The raw input is in the form of String, each second of
log consists of 1147 string input which is then converted into Figure 3 The model that detects anomaly from big data
vector for HMM processing. The training stage attempts to
calculate the emission and transition states base on Viterbi
algorithm. Viterbi algorithm finds Viterbi paths base on naïve
Bayesian calculation. The model creates a very generic model
because all the parameters are adjustable including observation
state and length of string. The use of HMM in this context is
due to its state base property and it caters for highly imbalance
data. Further, the types of data in log files are sequential in
nature, and it suits the states in HMM very well. The HMM
also provides fast training and computation time due to the use
of memoryless model.
Algorithm 1: HMM Scoring System with Expert Labeling or Analyst
Intuition
Given Log String=[] ;
Repeat the following until Log String is empty:
1. Sort the 36 mil log by source IP addess
2. Extract problematic computer, which in this case is 10.67.25.69. Figure 4 The results of HMM model indicated 3 major attacks as circled
3. Look for words that contain certain keywords like Malware, word in red occurred across the log files.
found=1.
4. For all log, assign a weight to the attribute and a score to the attribute,
example severity weight is 0.8 and very high is 0.9, then the score will be 0.72
5. Visualized the data using excel (see Figure 3)
2082 on November 02,2023 at 17:01:22 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded
Figure 4 indicates the visualization of the HMM results. [7] Liu, M., L.T. Watson, and L. Zhang, Quantitative prediction of the effect
The data recorded the log activity over 3 days across the 1146 of genetic variation using hidden Markov models. BMC Bioinformatics,
2014. 15(1): p. 5-5.
instances. During the period of data captured, three major
[8] Lunter, G., HMMoC-a compiler for hidden Markov models.
attacks were detected by the HMM model as circled in red. Bioinformatics, 2007. 23(18): p. 2485-2487.
This outcome is in-lined and validated by cyber security [9] Akhbari, M., et al., ECG segmentation and fiducial point extraction
analyst’s manual interpretation. using multi hidden Markov model. Computers in Biology and Medicine,
2016. 79: p. 21-29.
III. CONCLUSION [10] Huang, Z., et al., Medical Inpatient Journey Modeling and Clustering: A
Bayesian Hidden Markov Model Based Approach. AMIA . Annual
This research presented a unique blend of Fuzzy k-means Symposium proceedings / AMIA Symposium., 2015. 2015: p. 649.
(FKM) approach with HMM model. We first cluster the raw [11] Lim, D., et al., Fall-Detection Algorithm Using 3-Axis Acceleration:
data into 3 clusters (attacks, unsure and non-attacks), then use Combination with Simple Threshold and Hidden Markov Model. Journal
the state-based HMM approach to observe the state transitions of Applied Mathematics, 2014. 2014: p. 1-8.
among the data. The major challenge of interpreting cyber [12] Fu, G., S.P. Charles, and S. Kirshner, Daily rainfall projections from
security data has been the large volume of data which often general circulation models with a downscaling nonhomogeneous hidden
Markov model (NHMM) for south ̺ eastern Australia. Hydrological
results in missing out major attacks within the data. We Processes, 2013. 27(25): p. 3663-3673.
conducted experiment based on 36 million of log files and [13] Milone, D.H., et al., Automatic recognition of ingestive sounds of cattle
successfully detected the 3 major attacks in short computation based on hidden Markov models. Computers and Electronics in
time of 0.13 seconds. We aim to extend this work to Agriculture, 2012. 87: p. 51.
incorporate deep learning for better hit rate on larger sets of [14] Siachalou, S., G. Mallinis, and M. Tsakiri-Strati, A Hidden Markov
data. Models Approach for Crop Classification: Linking Crop Phenology to
Time Series of Multi-Sensor Remote Sensing Data. Remote Sensing,
2015. 7(4): p. 3633-3650.
REFERENCES [15] Fox, M., et al., Robot introspection through learned hidden Markov
[1] Giampieri, G., M. Davis, and M. Crowder, Analysis of default data models. Artificial Intelligence, 2006. 170(2): p. 59-113.
using hidden Markov models. Quantitative Finance, 2005. 5(1): p. 27- [16] Teoh, T.T. and Y.Y. Nguwi. Emotion indexing using Hidden Markov
34. Expert Rule Model (HMER) for autism children. in 2010 11th
[2] Lu, S.-L., A Hidden Markov Chain Model with Applications for International Conference on Control Automation Robotics & Vision
Assessing Credit Risk. Asia Pacific Management Review, 2014. 19(4): (ICARCV). 2010. Singapore.
p. 405. [17] Teoh, T.-T., S.-Y. Cho, and Y.-Y. Nguwi. Hidden Markov Model for
[3] Maheu, J.M. and Q. Yang, An infinite hidden Markov model for short- hard-drive failure detection. 2012.
term interest rates. Journal of Empirical Finance, 2016. 38: p. 202-220. [18] Jelani, H. Enterprise Threats: Big Data and Cyber Security. Dataversity
[4] Rossi, A. and G.M. Gallo, Volatility estimation via hidden Markov Education 2013; Available from: http://www.dataversity.net/enterprise-
models. Journal of Empirical Finance, 2006. 13(2): p. 203-230. threats-big-data-and-cyber-security/.
[5] Wypasek, C.J., Hidden Markov Models in Finance. 2008, American [19] Kalyan, V. and I. Arnaldo. AI2: Training a big data machine to defend.
Statistical Association: Alexandria. p. 1713-1714. Available from: https://people.csail.mit.edu/kalyan/AI2_Paper.pdf.
[6] D, H., Discriminating between rate heterogeneity and interspecific [20] Ching-Yun, C., Z. Teng, and Y. Zhang. Expectation-Regulated Neural
recombination in DNA sequence alignments with phylogenetic factorial Model for Event Mention Extraction. Available from:
hidden Markov models. Bioinformatics, 2005. 21(S2): p. ii166-ii172. http://www.aclweb.org/anthology/N/N16/N16-1045.pdf.
2083 on November 02,2023 at 17:01:22 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded