Download as pdf or txt
Download as pdf or txt
You are on page 1of 123

AIMLCZG567: AI & ML

Techniques for Cyber Security


BITS Pilani Jagdish Prasad
Pilani Campus WILP
BITS Pilani
Pilani Campus

Session: 08
Title: AIML Lecture 01-07 Recap
Agenda
• Introduction to Cyber Security
• Introduction to Artificial Intelligence
• Basics of Machine Learning 1 & 2
• Supervised Learning for Misuse/Signature Detection
• Machine Learning for Anomaly Detection
• Machine Learning for Hybrid Detection

BITS Pilani, Pilani Campus


Introduction to Cyber Security

BITS Pilani, Pilani Campus


Computer Security

• Computer systems consists of ASSETS


• Computer ASSETS are of following types:
• Hardware: Computers, Devices (disk drives, memory cards, printers etc),
Networks
• Software: Operating system, utilities, commercial applications (MS-Office,
Oracle apps, SAP etc.), individual applications
• Data: Documents, photos, emails, projects, corporate data etc
• Computer ASSETS have a value and deserve security protection
• Computer ASSET value may be monitory or non-monitory and is person &
time dependent
• Computer security is protection of computer ASSETS

BITS Pilani, Pilani Campus


Vulnerability-Threat-Control Paradigm
• ‘Vulnerability’ is a weakness in the
system that might be exploited to
cause loss or harm
• ‘Threat’ is a set of circumstances that
has a potential to cause loss or harm
to system
• A person who exploits the
vulnerability perpetrates an ‘Attack’
• ‘Control’ is an action, device, • Vulnerability: Crack in the wall
procedure of technique that removes • Threat: Rising water level
or reduces the vulnerability
• Attack: Someone pumping in more
water
• Control: Fill the gap, strengthen the
wall

BITS Pilani, Pilani Campus


Security Triad - CIA
• Confidentiality: Ability of a system to
ensure that an asset is viewed by only
authorized parties
• Integrity: Ability of a system to ensure
that an asset is modified by only
authorized parties
• Availability: Ability of a system to
ensure that an asset can be used by any
authorized parties

Additional two properties:


• Authentication: Ability of a system to validate the identity of a sender
• Non-repudiation or Accountability: Ability of a system to confirm that a
sender can not convincingly deny having sent something

BITS Pilani, Pilani Campus


Security Challenges

Remote and Blockchain and


Emerging 5G Ransomware
Hybrid Crypto-currency
application evolution
workforce attacks

Phishing and Machine


Cloud and IOT Software
Spear phishing learning and AI
attacks vulnerabilities
attacks attacks

Outdated
Serverless App Supply Chain
hardware and Mobile malware
vulnerability attacks
software

Growth Firmware Deep fake


API attacks
hactivism weaponization technology

BITS Pilani, Pilani Campus


Cryptography for Security
• Cryptography usage encryption to ensures that data is visible to only
authorised parties and no middle man can snoop on data
• Cryptography provides confidentiality and integrity of the information
without being vulnerable to attackers or threats
• Common Cryptographic techniques:
• Symmetric key cryptography
• Public key cryptography
• Hash functions
• Digital Signatures
• Message Authentication Codes
• Steganography
• Quantum Cryptography

BITS Pilani, Pilani Campus


AI & ML Security Techniques
• AI in cybersecurity: Automates the threat detection and response process
more effectively than traditional methods.
• Detecting New Security Threats
• Endpoint Protection
• Identity Analytics and Fraud Detection
• IT Asset Inventory
• Smart Botnets: Automated Malware Detection and Prevention
• ML in cybersecurity: Helps understand past cyber attack patterns &
behaviours to improve security processes
• Automated Security Workflow
• Network Risk Scoring
• Threat Detection and Classification

BITS Pilani, Pilani Campus


Introduction to Artificial
Intelligence

BITS Pilani, Pilani Campus


What is Artificial Intelligence?
• IBM describes AI as: an application of advanced Natural Language
Processing, Information Retrieval, Knowledge Representation and
Reasoning, and Machine Learning technologies to the field of open domain
question answering.
• Artificial intelligence (AI) is the ability of a digital computer or computer-
controlled robot to perform tasks commonly associated with human beings.
• AI mimics human intelligence using data and logic.
• AI is applied to systems with the intellectual characteristic of humans, such as
the ability to reason, discover meaning, generalize, or learn from past
experience.
• Intelligence is composed of 5 abilities: Learning, Reasoning, Problem solving,
Perception andLanguage

BITS Pilani, Pilani Campus


Artificial Intelligence: Turing Test

• A human communicates with a


computer via a teletype. If the human
can’t tell whether he is talking to a
computer or another human, it passes.
– Natural language processing
– Knowledge representation
Questions
– Automated reasoning Answers Answers
– Machine learning
• Addition of vision and robotics leade to
total Turing test.
• Devised by Alan Turing in 1950
Which is the person?
Which is the computer?

BITS Pilani, Pilani Campus


Types of Artificial Intelligence

• Known as weak AI • Known as strong AI • Capability of machines


• Application of AI to specific • Machine has the ability to surpasses humans
tasks perform intellectual tasks • Ex: Self learning robots
• Ex: Siri, Elexa etc similar to humans
• Ex: Self driving cars

BITS Pilani, Pilani Campus


Predictive Analytics
• Predictive analytics is an advanced form of analytics that makes predictions
about future outcomes using historical data combined with statistical
modelling, data mining techniques and machine learning – “what might
happen next”?
• Organizations have large amount of data from log files, images, videos and
other data forms residing in disparate data repositories.
• Predictive analytics is used to find patterns in this data to identify risks and
opportunities.
• Some of these statistical techniques include logistic and linear
regression models, neural networks and decision trees.
• Some of the ML modelling techniques use predictive analytics for insights.

BITS Pilani, Pilani Campus


Predictive Analytics Framework
Define the problem
Define • A prediction needs a good thesis and set of requirements.
• Example - detect fraud, determine optimal inventory levels for holiday shopping season, identify
potential flood levels from severe weather?
Acquire and organize data
Acquire and • Decades of data or a continual flood of data from customer interactions.
Organize • Required data flows must be identified, and then datasets to be organized in a repository.

Pre-process data
Pre-process • Clean data to remove anomalies, missing data points, or extreme outliers, which might be the result
of input or measurement errors.

Develop predictive models


Develop • Identify the tools and techniques to develop predictive models depending on the problem to be
solved and nature of the dataset.
• Machine learning, regression models and decision trees are commonly used models.
Validate and deploy results
Validate • Check on the accuracy of the model and adjust accordingly.
and Deploy • Once acceptable results have been achieved, make model available to stakeholders via an app,
website, or data dashboard.

BITS Pilani, Pilani Campus


Cognitive Computing
• A Cognitive system learns at scale, reasons with purpose and interacts with
humans naturally.
• Cognitive systems learn and reason from interactions with human beings and
experiences with environment rather than being explicitly programmed
• Cognitive computing overlaps with Artificial Intelligence and involves similar
technologies to power cognitive applications.
• Cognitive systems understand and simulate human reasoning and behaviour
• Cognitive computing systems helps in making better human decisions at work.
• Examples:
• Speech recognition
• Sentiment analysis
• Face detection
• Risk assessment
• Fraud detection.

BITS Pilani, Pilani Campus


Key Attributes of Cognitive Computing

• Must be flexible enough to understand the changes in the information.


Adaptive • Must be able to digest dynamic data in real-time and make adjustments
as the data and environment change.

• Must have Human-computer interaction (HCI) for users to interact and


Interactive define needs
• Can also interact with other processors, devices and cloud platforms.

• Must be able to identify problems by asking questions or pulling in


Iterative and additional data if the problem is incomplete.
stateful • Maintain information about similar situations that have previously
occurred.

• Must understand, identify and mine contextual data like syntax, time,
location, domain, requirements, a specific user’s profile, tasks or goals.
Contextual • May draw on multiple sources of information, including structured and
unstructured data and visual, auditory or sensor data.

BITS Pilani, Pilani Campus


Scope of Cognitive Computing
• Engagement
• Ability to develop deep domain insights and provide expert assistance.
• Contextual relationships between various system entities to enable it to form
hypotheses and arguments.
• Able to engage in deep dialogue with humans e.g. chatbots
• Decision
• Systems are modelled using reinforcement learning and continually evolve based
on new information, outcomes, and actions.
• Autonomous decision making depends on the ability to trace why the particular
decision was made and change the confidence score of a systems response.
• Example: IBM Watson for healthcare
• Discovery
• Use of deep learning to understand vast amount of data
• Distributed intelligent agents to collect streaming data, like text and video, to
create an interactive sensing, inspection, and visualization system that provides
real-time monitoring and analysis.
• Real time alerts and reconfiguration to isolate/fix a critical event BITS Pilani, Pilani Campus
Cognitive Computing Products
• IBM Watson
• Google DeepMind
• Microsoft Cognitive Services
• CognitiveScale – former IBM Watson guys
• SparkCognition

BITS Pilani, Pilani Campus


Machine Learning Basics

BITS Pilani, Pilani Campus


What is Data Selection?
• Data selection is defined as the process of determining the appropriate data
type and source and suitable instruments to collect data.
• Data selection precedes the actual practice of data collection.
• The process of selecting suitable data for a research project can impact data
integrity.
• Primary objective of data selection is determining appropriate data type,
source, and instrument that allow investigators to answer research questions
adequately.
• This determination is often discipline-specific and is primarily driven by the
nature of the investigation, existing literature, and accessibility to necessary
data sources.

BITS Pilani, Pilani Campus


What is Feature Selection?
• Feature Selection is the method of reducing the input variable to model by
using only relevant data and getting rid of noise in data.
• Process of automatically choosing relevant features for machine learning
model based on the type of problem being solved.
• This is done by including or excluding important features without changing
them.
• It helps in cutting down the noise in our data and reducing the size of our
input data.

BITS Pilani, Pilani Campus


Feature Selection: Filter Method

• Features are dropped based on


their relation to the output, or
how they are correlating to the
output.
• Use correlation to check if the
features are positively or
negatively correlated to the
output labels and drop features
accordingly
• Common methods: Information
Gain, Chi-Square Test, Fisher’s
Score, Correlation Coefficient,
Variance threshold, Mean
Absolute Difference etc

BITS Pilani, Pilani Campus


Feature Selection: Wrapper Method

• Data is split into subsets and train


a model using this.
• Based on the output of the
model, add and subtract features
and train the model again.
• It forms the subsets using a
greedy approach and evaluates
the accuracy of all the possible
combinations of features
• Common methods: Forward
Selection, Backwards Elimination,
Bi-directional, Exhaustive
selection etc.

BITS Pilani, Pilani Campus


Feature Selection: Intrinsic Method

• Combines both Filter and


Wrapper methods
• Common methods:
Regularization, Tree-based
methods

BITS Pilani, Pilani Campus


Data Sampling Steps

Sample Goal: The population property that you wish to


estimate using the sample.

Population: The scope or domain from which observations


could theoretically be made.

Selection Criteria: The methodology that will be used to


accept or reject observations in your sample.

Sample Size: The number of observations that will constitute


the sample.

BITS Pilani, Pilani Campus


What is Data Sampling

1 2 3 4 5

Identify and Select Choose Determine Collect the


define sampling sampling sampling required
target frame methods methods data
population

BITS Pilani, Pilani Campus


Sampling Types
• Probabilistic sampling methods
• Simple random sampling
• Cluster sampling
• Systematic sampling
• Stratified random sampling
• Non-probabilistic sampling methods
• Convenience sampling
• Judgemental or Selective sampling
• Snow-ball sampling
• Quota sampling

BITS Pilani, Pilani Campus


Supervised Learning
Types of Supervised learning
• Classification: A
classification problem is
when the output variable is
a category, such as “red” or
“blue” or “disease” and “no
disease”.
• Regression: A regression
problem is when the output
variable is a real value, such
as “dollars” or “weight”.

BITS Pilani, Pilani Campus


Unsupervised Learning

Types of Unsupervised learning


• Clustering: A clustering
problem is where you want to
discover the inherent
groupings in the data, such as
grouping customers by
purchasing behavior.
• Association: An association
rule learning problem is
where you want to discover
rules that describe large
portions of your data, such as
people that buy X also tend to
buy Y.

BITS Pilani, Pilani Campus


Reinforced Learning

BITS Pilani, Pilani Campus


Machine Learning Types on a Page

BITS Pilani, Pilani Campus


Machine Learning Process

BITS Pilani, Pilani Campus


Machine Learning Process

• The ingested data is analyzed to detect patterns in the data


Analysis phase • Use patterns to create explicit features or parameters that can
be used to train the model.

• Data parameters generated in the previous phases are used to


create machine learning models in this phase.
Training phase • Training phase is an iterative process, where the data
incrementally helps to improve the quality of prediction.

• Machine learning models created in the training phase are


tested with more data and model's performance is assessed.
Testing phase • Test with data that has not been used in previous phase
• Model evaluation may or may not require parameter training.

• Tuned models are fed with real-world data at this phase.


Application phase • The model is deployed in the production environment.

BITS Pilani, Pilani Campus


Feature Extraction
• Most of the IT system logs are text data while machines
understand numerical data only.
• Text data ‘as-is’ can not be used as input to Machine Learning
algorithm.
• Process of converting text data into numbers is called Feature
Extraction (also called text vectorization).
• Feature Extraction is an important step for a better
understanding of the context of what we are dealing with.
• NLP usage feature extraction extensively.

BITS Pilani, Pilani Campus


Feature Extraction
• Feature Extraction also reduce the number of features in a
dataset by creating new features from the existing ones (and
then discarding the original features).
• New reduced set of features can summarize most of the
information contained in the original set of features.
• A summarised version of the original features can be created
from a combination of the original set.

BITS Pilani, Pilani Campus


Feature Extraction Techniques
• One Hot Encoding
• Bag of Word (BOW)
• N-grams
• Tf-Idf
• Custom features
• Word2Vec (Word Embedding)

BITS Pilani, Pilani Campus


TF-IDF: Term Frequency and Inverse
Document Frequency
• TF-IDF is a statistical measure that evaluates how relevant a word is to a
document in a collection of documents.
• Term Frequency (TF):
• Ration of number of times a word appears in a document to the total number of
words in that document (0 < Tf < 1)

Tfij = Count of term i in document i / Total count of all terms in the document j

• Inverse Document Frequency (IDF):


• Logarithm of the number of documents in the corpus divided by the number of
documents where the specific term appears.
• In Scikit-learn use log(N/ni) + 1 formula.

idfi = log (Total number of documents / Number of documents with term I in it)

BITS Pilani, Pilani Campus


Under / Over fit Models

BITS Pilani, Pilani Campus


Challenges of Imbalanced Class
• Malicious events are significantly lower than normal healthy events
approximately 1-2 % of the total number of observations.
• Requirement is to improve identification of the rare minority event as opposed
to achieving higher overall accuracy.
• Machine Learning algorithms tend to produce unsatisfactory classifiers when
faced with such imbalanced datasets.
• If the event to be predicted belongs to the minority class and the event rate is
less than 5%, it is referred to as a rare event and the dataset is imbalanced.
• Example:
• Electricity theft (third largest form of theft) is the main challenges faced by the utility
industry today.
• Advanced Analytics and Machine Learning algorithms are used to identify
consumption patterns that indicate theft.
• Biggest challenge in this is the humongous data and its distribution.

BITS Pilani, Pilani Campus


Evaluation Metrics
• Evaluation metrics are quantitative measures used to assess the performance
and effectiveness of a Machine Learning model.
• Metrics indicate how well a model is performing in comparison to different
models or algorithms.
• Evaluation metrics provide objective criteria to evaluate a Machine Learning
model for its:
• Predictive ability
• Generalization capability
• Overall quality
• Choice of evaluation metrics depends on the specific problem domain, the type
of data and the desired outcome.

BITS Pilani, Pilani Campus


Precision & Recall
Precision
• Precision is a measure of a model’s performance that tells how many of the
positive predictions made by the model are actually correct.
• Calculated as the number of true positive predictions divided by the number of
true positive and false positive predictions.
Precision = TP / (TP + FP)

Recall
• Recall is a measure of a model’s ability to detect correct positive samples.
• Calculated as the number of true positive predictions divided by the number of
true positive and false negative predictions.
Recall = TP / (TP + FN)

BITS Pilani, Pilani Campus


F1-Score
• F1-Score is the harmonic mean of precision and recall values for a classification.
• F1-Score ranges between 0 to 1 and is calculated as under:

• F1-Score tells how precise (correctly classifies how many instances) and robust
(does not miss any significant number of instances) the classifier is.
• Harmonic Mean punishes extreme values more.
• Example: Assume a binary classification model with the following results:
• Precision: 0, Recall: 1
• If we take the arithmetic mean, we get 0.5. It indicates that the above result comes
from a classifier that ignores the input and predicts one of the classes as output.
• If we were to take HM, we would get 0 which is accurate as this model is useless for
all purposes.

BITS Pilani, Pilani Campus


F1-Score
• Standard F1 score calculation gives the same importance to both Recall and
Precision.
• F1 score can be calculated by attaching a weightage to either Recall or Precision
depending on which one is more important.
• Modified equation is as under (β is the weightage:

BITS Pilani, Pilani Campus


Feature Pipeline
• Machine Learning process combines a series of transformers on raw data,
transforming the dataset at each step until it is passed to the fit method of a
final estimator.
• Input documents need to vectorize in the same manner otherwise the results
will be wrong or at the very least, unintelligible results.
• Scikit-Learn’s Pipeline object provides solution to this problem.
• Pipeline objects integrate a series of transformers that combine normalization,
vectorization and feature analysis into a single well-defined mechanism.
• Pipeline objects move data from a loader into feature extraction mechanisms
to finally an estimator object that implements the predictive models.
• Pipelines are Directed Acyclic Graphs (DAGs) that can be simple linear chains of
transformers to arbitrarily complex branching and joining paths.

• Ref: https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/ch04.html

BITS Pilani, Pilani Campus


Feature Pipeline

BITS Pilani, Pilani Campus


Feature Pipeline
• Objective of a Pipeline is to chain together multiple estimators representing a
fixed sequence of steps into a single unit.
• All estimators in the pipeline, except the last one, must be transformers
• They must implement the transform method
• Last estimator can be of any type, including predictive estimators.
• Pipelines fit and transform can be called for single inputs across multiple
objects at once.
• Pipelines provide a single interface for grid search of multiple estimators at
once.
• Pipelines provide operationalization of text models by coupling a vectorization
methodology with a predictive model.
• Pipelines are constructed by describing a list of (key, value) pairs where
the key is a string that names the step and the value is the estimator object.

BITS Pilani, Pilani Campus


ML Algorithm
• Naïve Bayes Algorithm
• Decision Tree
• Random Forests
• K-Mean Clustering
• Support Vector Machine (SVM)
• Genetic Algorithm
• Artificial Neural Network (ANN)

BITS Pilani, Pilani Campus


Misuse/Signature Detection
Methods

BITS Pilani, Pilani Campus


What is Misuse/Signature Detection?
• Misuse detection recognizes specific unique patterns (signature) of
unauthorized behaviour to predict and detect subsequent similar attempts.
• Signature is a specific pattern that includes patterns of log files or packets that
have been identified as a threat.
• Each log file or packet has signatures.
• In a host based IDS, a signature can be a pattern of system calls.
• In a network-based IDS, a signature can be a specific pattern of the packet such as
packet content signatures and/or header content signatures that can indicate
unauthorized actions i.e. improper FTP initiation.
• Packet includes source or destination IP addresses, source or destination
TCP/UDP ports, IP protocols (UDP, TCP, ICMP etc) and data payloads.

BITS Pilani, Pilani Campus


Supervised Learning for Misuse Detection
• Supervised learning can be used for misuse/signature detection.
• Supervised learning can learn attack patterns depicting the sequences of
events correlated with the cyber attacks.
• A method of threat elimination includes:
• Measuring similarity between the patterns recognized in the recent activity and
the known patterns of various types of cyber attacks
• Identifying which system vulnerabilities have been used in known attacks
• Determining actions cyber administrators should take to defend against attacks.
• Execution signatures vary substantially from one attack category to another
• Specific detection methods are required to identify attack patterns to improve
detection capability.
• Machine Learning algorithms help improve detection performance.

BITS Pilani, Pilani Campus


Misuse Detection using ‘if-then’ Rules
• If the learned patterns and signature
of attacks match, the system
generates an alert for system
administrator indicating an attack.
• Administrator labels the attack.
• Required information is provided to
the administrator.
• Example:
• A known attack signature as “Login
name = Sadan”.
• If login name matches this signature,
the system will alert the administrator.
• The event will be flagged as an
anomalous.

BITS Pilani, Pilani Campus


Machine Learning in Misuse Detection
• Misuse detection searches for known potentially malicious information by scanning
and makes decisions based on prior knowledge of the attack signatures.
• Requirements for an effective misuse detection solution:
• Comprehensive collections of known cyber attack characteristics (signature database).
• Frequent updating of the signature database.
• Anti-spyware tools use signature detection to find malicious programs
• Scans files and programs in the system
• Compares them with the signatures in the database.
• Effective in locating known threats.
• However may also raise false alarms.
• Example:
• A user forgets login password and make multiple attempts to sign into an account.
• Account is locked for specified time period after a defined number of failed login attempts.
• Unsuccessful attempts after this point can be considered as attacks.

BITS Pilani, Pilani Campus


Machine Learning Process for Misuse Detection
• A typical misuse/signature detection process has five steps:
• Information collection or Data source
• Data pre-processing
• Misuse/signature identification by matching methods
• Rules regeneration
• Denial of Service (DoS) or other security response
• Data resources include audit log, network packet flow, Windows registry etc.
• Data pre-processing prepares input data for pattern learning by reducing noise and
normalizing, selecting and extracting features.
• Domain experts or automatic intelligent learning systems build intrusive learning
models such as rule-based expert systems, based on prior knowledge of malicious
code, data and vulnerabilities.
• Learned classification models or rules are applied to the incoming data for misuse
pattern detection.
• If any incoming data matches the attack patterns then a defensive action can be
activated automatically or an alert can be generated for administrators for further
analysis.
BITS Pilani, Pilani Campus
Machine Learning Process for Misuse Detection

BITS Pilani, Pilani Campus


Rule Based Signature Analysis
• Rules defines the correlation between attribute conditions and class labels.
• Rules detail various descriptive scenarios of attacks.
• Intrusion detection mechanism identifies a potential attack if a user’s
activities are found to be consistent with the established rules for detecting a
misuse.
• Two types of Rule-based analysis:
• Associative classification and Association rules.
• Fuzzy-rule-based classification.

BITS Pilani, Pilani Campus


Classification Using Association Rules
• Association rule represents relationships among attributes in a multi-
dimensional database.
• Association rule classification describes the patterns in a dataset
• Process steps for Rule based classification:
• Step 1. Find all length 1 item sets that satisfy the minimum support threshold.
• Step 2. Iteratively generate sets of candidate length k item sets by combining two
length k − 1 frequent item sets.
• Prune the infrequent length k candidate item sets that include any infrequent length k −
1 subsets.
• Find all length k item sets among the candidate pool, which satisfy the minimum support
threshold.
• Step 3. Generate all non-empty subsets for each of the frequent item sets
generated in Step 2.
• Step 4. For each non-empty subset generated in Step 3, output the corresponding
rules for the frequent item sets that satisfy minimum confidence.

BITS Pilani, Pilani Campus


Example: Association Rules for Shell Command Data

• Example:
• Host-based data of one telnet session
recorded by a mid-size company server.
• Has 15 transactions in the database.
• Association rules are build using apriori
algorithm using attributes: time,
hostname, command and arg.
• Basic apriori algorithm does not
consider domain knowledge, its
application results in a large number of
irrelevant rules.
• Prior knowledge can reduce redundant
rules in post-processing or use item
constraints over attribute values.

BITS Pilani, Pilani Campus


Example: Association Rules for Shell Command Data

mail

BITS Pilani, Pilani Campus


Fuzzy Rule Based
• Fuzzy logic generates human-like expertise in decision-making process.
• Fuzzy-rule-based systems exploit the tolerance for handling and manipulating
uncertainty, robustness and partial truth to achieve tractability.
• Finding a set of fuzzy rules pertaining to the specific classification problem
being solved is a challenge.
• A rule-based system classifies the membership of data points in a binary term:
• A data point belongs to either a normal or an anomalous data set.
• Membership of any data point in a set is indicated by {0, 1}
• A fuzzy logic system define the membership of any data point in a set by a
value in the range [0.0, 1.0].
• 0.0 representing absolute falseness
• 1.0 representing absolute truth

BITS Pilani, Pilani Campus


Fuzzy Rule Based
• Given a set of data points X = {x} and a fuzzy set A, the membership of each
data point x ∈ A can be denoted by a membership function m as f(x), where A
is a fuzzy set and f: A → [0, 1].
• For each data set, x ∈ A, f(x) is the weight of membership of x.
• An element mapping to the value 0 means that the member is not included in
the fuzzy set, while 1 describes a fully included member.
• Values between 0 and 1 characterize the fuzzy members.
• Set {x ∈ A|m(x) > 0} is called the support of the fuzzy set (A,m).
• A fuzzy system is characterized by a set of linguistic statements based on
expert knowledge.
• A rule is in the form of “if: antecedent–then: consequent,”
• if (src_ip == dst_ip) then “land attack.”
• A fuzzy rule is in the form of “if: antecedent–then: consequent [weight]”
• if (src_ip == dst_ip) then “land attack” [0.6].
• Fuzzy rule is presented as FZ(rule) = FO(src _ ip = = dst _ ip) * 0.6
• FO is fuzzy operator function
BITS Pilani, Pilani Campus
Fuzzy Rule Based
• A heuristic process is applied to achieve the consequent class and grade of
certainty for each of these classes.
• Step 1. For each training data point, xi = (xi1, …, xid), calculate the joint
antecedent fuzzy set of the qth rule
• Step 2. For each class c ∈ {1, …, C}, calculate the sum of the grades of the
training data points in class c with the qth fuzzy rule R with the qth fuzzy rule Rq
• Step 3. Seek the class that has the maximum value calculated in Step 2.
• Step 4. Calculate the grade of certainty as following:

BITS Pilani, Pilani Campus


Artificial Neural Network
• ANN provides the capability to analyze incomplete or distorted data.
• Can learn misuse attacks and identify suspicious events that are unlikely to be
accurately observed with other methods.
• Assumes that attackers often emulate the successes of others
• Can detect similar attacks that do not match previous malicious behaviours exactly.
• Faster, nonlinear data analysis and predictive capability to detect instances of misuse.
• Challenges in the application of ANN for misuse detection.
• Accurate prediction needs a large number of attack data to ensure the training data
are adequate and balanced with the normal data.
• Malicious information in nature is infrequent and time consuming to collect.
• ANN learns patterns in a black box, which consists of the connection weights and
transfer functions of various net nodes.
• Success of ANN depends on the learning results of these weights.
• Accuracy of prediction cannot be interpreted using the complex network structure

BITS Pilani, Pilani Campus


ANN Usage Approach for Misuse Detection
• Two approaches to implement ANN in misuse detection:
• Incorporate ANN into existing or modified rule-based systems.
• Configure ANN standalone as a misuse detection system.
• Incorporate with a rule-based system:
• ANN is used to filter the input data of suspicious events before forwarding the misuse
candidate data to a rule-based expert system.
• Increases the sensitivity of misuse detection within the system.
• Drawback: Rule-based system has to be updated when ANN identifies new suspicious
events, because rule-based system cannot improve itself automatically with the
incoming data.
• Configure ANN to be fed network data directly to identify the malicious events.

BITS Pilani, Pilani Campus


Support Vector Machine (SVM)
• SVM conducts structural risk minimization, e.g. true error on unseen examples,
while ANN focuses on empirical risk minimization.
• SVM selects a number of parameters based on the requirement of the margin
that separates the data points and not based on the number of feature
dimensions.
• This feature allows SVM to be compatible with more applications.
• SVM has two significant advantages over ANN when applied in intrusion
detection: speed and scalability.
• Speed is important for real-time detection
• Scalability is important for the huge information flow.
• SVM is capable of updating training patterns dynamically - Important when
attack patterns change.

BITS Pilani, Pilani Campus


Application SVM for Misuse Detection
• SVM is used to identify attack and misuse patterns associated with computer security
breaches, such as consequence of system software bugs, hardware or software failures,
incorrect system administration procedures, or failure of the system authentication.
• SVM intrusion detection procedures include three steps:
• Extraction of input and output pairs from user logs, web servers logs and the authority log
• Training of SVM model using data obtained in the previous step
• Testing of classification done by SVM model
• Raw information in system log files of user activities consists of various types of
attributes related to command, HTTP, and class labels - normal or anomalous.
• Weights assigned to system commands and user activities to indicate the potential status
as an anomaly.
• an ‘rm’ command assigned a weight of four
• an ‘rm − r *’ assigned a weight of five as this posed a greater threat to the system.
• In HTTP activities, “Read only actual html pages or images” were assigned weight of one
• In HTTP activities, “Read and attempt to access directory pages” were assigned a weight of two.
• “Read and attempt to access directory pages” received a higher weight because it may be
related to malicious queries to the server.

BITS Pilani, Pilani Campus


Genetic Programming (GP)
• GP automatically breeds a population of computer programs according to the
“Darwin’s theory of evolution” by natural selection.
• Evolution leads to overall fitness increase of the population with every generation.
• GP is a searching algorithm similar to the natural selection process.
• GP creates an initial population of candidate solutions:
• Using natural selection each solution is evaluated and fitness values are assigned to a
solution according to the fitness of the solution to the problem.
• Fittest solutions in the population are more likely to perform the reproduction
operation that creates a new generation of solutions.
• Reproduction of the new population includes three operations:
• Direct reproduction by copying the best existing programs
• Creation of new computer programs by a mutation operation
• Creation of new computer programs by a crossover operation
• Empirical results show that GP technique is more accurate than some
conventional Machine Learning-based IDSs

BITS Pilani, Pilani Campus


Application GP for Misuse Detection

BITS Pilani, Pilani Campus


GP Technique Process Steps
• Step 1. Randomly initialize a population of individual solutions.
• Step 2. Randomly select the fittest individuals from the population by using a
selection method.
• Fitness measure defines the problem the algorithm is expected to solve.
• Step 3. Generate new variants by applying the following Genetic operators for
certain probabilities:
• Reproduction: Copy an individual without change.
• Recombination: Exchange substructures between individuals.
• Mutation: Randomly replace a single atomic unit in an individual.
• Step 4. Calculate the fitness of new the individual without change.
• Step 5. Go to Step 2, if the termination criterion is not met.
• Step 6. Stop. The best individual represents the best solution found.

BITS Pilani, Pilani Campus


Types of GP Technique
• GP techniques have three variants: Linear Genetic Programming (LGP), Multi-Expression
Programming (MEP) and Gene Expression Programming (GEP).
• LGP evolves computer programs as sequences of imperative instructions.
• Sequences of instructions run on the registers of predefined sets.
• Successively removing the variables that start with the last effective instruction, we can represent an LGP
chromosome in a functional way.
• Settings of LGP parameters, such as the restriction of the mitigation of individuals among subpopulations,
which locate in the subdivision of population space and control diversity of population, are critical for the
performance of the system.
• MGP chromosomes encode several expressions (computer program).
• Represent MEP genes using variable-length substrings.
• Length of the MGP chromosome corresponds to the number of genes of the chromosome.
• A function symbol presents a gene, which consists of pointers toward the function arguments.
• MGP is suitable for situations in which the target expression complexity is unknown.
• GEP uses character linear chromosomes with structurally organized heads and tails by
genes.
• Head consists of symbols denoting the elements from both function and terminal sets, while the tail consists
of only elements from terminal sets.
• Represent GEP individuals using fixed-length linear strings corresponding to variable-size and shape
chromosomes.
BITS Pilani, Pilani Campus
Decision Tree
• Decision tree is an un-parametric machine-learning method, which has no
requirement for data types.
• A data point is labelled by testing the feature values of the data against nodes of
the decision tree.
• Classification of the data point can be traced from the root node to a leaf node.
• Classification has high accuracy, intuitive knowledge expression, simple
implementation, efficiency and strength in handling high dimensional data
• Decision-tree classifiers are popularly used in applications, such as biomedical
analysis, manufacturing and production and clinical research.
• CART represents a Decision Trees in a form of binary recursive partitioning.
• Classifies objects or predicts outcomes by selecting from a large number of variables.
• Most important of these variables determine the outcome variable.

BITS Pilani, Pilani Campus


Decision Tree Process Steps
• Step 1. Split a variable at all of its split points. Sample sections into multiple
nodes at each split point.
• Step 2. Select the best split in the variable in terms of splitting criterion.
• Step 3. Repeat Steps 1 and 2 for all variables at the root node.
• Step 4. Rank the best splits and select the variable that achieves the highest
purity at the root.
• Step 5. Assign classes to the nodes according to a rule that minimizes
misclassification costs.
• Step 6. Repeat Steps 1–5 for each nonterminal node.
• Step 7. Grow a large tree until each leaf is pure.
• Step 8. Prune and choose the final tree using the cross validation (CV).

BITS Pilani, Pilani Campus


Decision Tree Process Steps
• Splitting criterion plays a critical role in the feature selection for splitting.
• Two most employed splitting criteria are: Information Gain and Gini Index.
• Assume there is a discrete set of symbols {x1, …, xn} with an associated
probability Pi for variable x.
• As per Shannon’s information theory, the randomness of a sequence of symbols
drawn from a symbol set can be measured by the entropy of the probability
distribution as under:
• IG(X |Y ) = H(X ) − H(X |Y ).

• Information Gain is the difference between the original information


requirement and the new information requirement.
• IG indicates the reduction in uncertainty about one variable when we have the
knowledge of the other correlated variable
• Feature with the highest IG is chosen as the splitting feature at a node.

BITS Pilani, Pilani Campus


Decision Tree Technique
• Gini index measures the impurity data distribution and is defined as

where pi is the probability that a variable x has symbol xi


• Gini index considers a binary split for each feature.
• Gini index is widely used by CART
• Due to noise and outliers, many of the decision-tree branches reflect anomalies
after the trees are built.
• Tree-pruning methods are employed to remove the least reliable branches.
• Two methods are commonly used in tree-pruning: pre-pruning and post-pruning.
• Pre-pruning is applied to halt splitting at a given node.
• Post-prunning removes subtrees from a fully built tree.

BITS Pilani, Pilani Campus


Application of Decision Tree in Misuse Detection
• Feature selection is critical in real-world IDSs, especially in improving the
effectiveness of IDSs.
• Difficult to test the match between an input element and a rule (signature) by
sequentially comparing the input element to the corresponding constraints
associated with the rule.
• Matching process covers the most resources intensively in the processes of
signature detection.
• A straightforward approach of clustering rules according to selected criteria will
improve the matching efficiency
• Example: Cluster the rules that have the same constraints in the same group.
• During signature detection, each rule in the group will be checked only when the
common constraints of a rule group match any input element.

BITS Pilani, Pilani Campus


What is CART?
• Red teaming is ethical hacking on a much broader and larger scale than conventional
security testing.
• Unlike penetration testing, it is not based on scope of IPs/application but instead
objective or goal-based i.e. you can attack whatever you want to achieve the goal.
• CART is a security technology designed to automate red teaming to achieve the breadth
and depth of the process as well as scale it and conduct it on a continuous basis.
• Traditional penetration testing is conducted on a few, known applications or systems.
• CART discovers the attack surface on its own without any inputs and launches a
combination of multi-stage attacks, spanning from networks to applications to humans.
• Once an attack surface is recognized and a scope for the simulated attack is authorized,
the attack engine launches multi-stage attacks on the discovered surface to identify
security blind spots and attack paths before hackers do.
• Traditional red teaming is typically conducted once or twice a year
• CART automates the process and makes red teaming continuous.

BITS Pilani, Pilani Campus


Classification Model for CART
• CART classifies data by constructing a decision tree.
• Significance of predictors is ranked according to their contribution to the
construction of the decision tree.
• Ranking indicates the significance of each feature in intrusion detection.
• Three steps are included in CART:
• Step 1. Tree building
• Step 2. Pruning
• Step 3. Optimal tree selection
• While splitting a variable at all of its split points, the sample splits into binary
nodes at each split point.
• Optimal tree selection process finds the correct complexity parameter, so that
the information in L is fit, but not overfit.
• This fit requires an independent set of data.
• If an independent set of data is not available, we can use CV to pick out the
subtree with the lowest estimated misclassification rate.
BITS Pilani, Pilani Campus
Anomaly Detection Systems

BITS Pilani, Pilani Campus


What is Anomaly Detection?
• Anomaly detection defines a profile of normal behaviours, which reflects the
health and sensitivity of a cyberinfrastructure.
• Anomaly behaviour is a pattern in data that does not conform to the expected
behaviours, including outliers, contaminants, noise etc.
• Anomaly detection requires a clear boundary between normal and anomalous
behaviour.
• A profile must contain robustly characterized normal behaviour, such as a
host/IP address or VLAN segment and have the ability to track the normal
behaviours of the target environment.
• Following information can be used to defines a profile:
• Occurrence patterns of specific commands in application protocols
• Association of content types with different fields of application protocols
• Connectivity patterns between protected servers and outside world
• Rate and burst length distributions for all types of traffic

BITS Pilani, Pilani Campus


What is Anomaly Detection?
• Anomaly detection flags out malicious behaviours like
• Segmentation of binary code in a user password
• Stealth reconnaissance attempts
• Backdoor service on a well-known standard port
• Natural failures in the network
• New buffer overflow attacks
• HTTP traffic on a nonstandard port
• Intentionally stealthy
• Example: If a user who usually logs in around 10 am from university dormitory,
logs in at 5:30 am from an IP address of China, then an anomaly has occurred.
• Business use cases:
• Medical diagnostics
• Network intrusion detection
• Manufacturing defect detection
• Fraud detection
BITS Pilani, Pilani Campus
Types of Anomaly

Global Outliers Contextual Outliers Collective Outliers


A data point assumes a Contextual means that its Collective outliers are a
value that is far outside all value doesn’t correspond subset of data points that
the other data point value with what we expect to deviate from normal
ranges in the dataset observe for a similar data behaviour.
Normally rare event i.e. point in the same context. Ex: Tech companies tend to
much higher salary credited Contexts are usually grow bigger, but some
than monthly average temporal and the same companies may decay.
situation observed at If many companies at once
different times may not an show a decrease in revenue
outlier. in the same period of time,
Ex: A sudden sales boost for it is a collective outlier
a store outside of holidays
or sales season
BITS Pilani, Pilani Campus
Why Machine Learning for Anomaly Detection?

Supervised Learning Unsupervised Learning Semi-Supervised Learning


Items in the dataset are labelled Artificial neural networks decrease Combines the benefits of
into two categories: normal and the amount of manual work needed Supervised and Human Supervision
abnormal. to pre-process i.e. no manual learning.
Model uses this data to learn labelling is needed. Unsupervised learning is used to
patterns and be able to detect Neural networks can be applied to automate feature learning and work
abnormal patterns in previously unstructured data. with unstructured data.
unseen data. Neural networks can detect Human supervision controls the
anomalies in unlabelled data and kind of patterns learned by the
use this learning to work with new model.
data. Helps to make the model’s
predictions more accurate.

BITS Pilani, Pilani Campus


Rule Based Anomaly Analysis
• Rules are defined to describe normal profiles of users, programs, and other
resources in cyberinfrastructures.
• Anomaly detection method identifies a potential attack if users or programs
act inconsistently with the defined rules.
• Constructing anomaly detection models using association rules is performed
in two steps.
• System audit data is mined for consistent and useful patterns of program and user
behaviours.
• Inductively learned classifiers are trained using the relevant features present in the
patterns to recognize anomalies

BITS Pilani, Pilani Campus


Threshold Rules
• Tests events or flows for activity that is greater than or less than a specified
range.
• Uses these rules to detect bandwidth usage changes in applications, failed
services, number of users connected to a VPN, and detecting large outbound
transfers.
• Example:
• A user who was involved in a previous incident has large outbound transfer
• When a user is involved in a previous offense, automatically set the Rule response
to add to the Reference set.
• If you have a watch list of users, add them to the Reference set.
• Tune acceptable limits within the Threshold rule.

BITS Pilani, Pilani Campus


Behavioral Rules
• Tests events or flows for volume changes that occur in regular patterns to
detect outliers e.g. a mail server that has an open relay and suddenly
communicates with many hosts or an IPS that starts to generate numerous
alert activity.
• A behaviour rule that learns the rate or volume of a property over a pre-
defined season.
• Season defines the baseline comparison timeline for what you are evaluating.
• When you set a season of 1 week, the behaviour for the property over that 1
week is learned and than you use rule tests to alert you to the changes.
• After a behavioural rule is set, the seasons adjust automatically.
• As the data in the season is learned and is continually evaluated so that business growth is
profiled within the season, you do not have to make changes to your rules.
• The longer that a behavioural rule runs, the more accurate it is over time.
• You can then adjust the rule responses to capture more subtle changes.

BITS Pilani, Pilani Campus


Behavioral Rules
• Detects changes in traffic or properties that are always present such as mail
traffic, firewall traffic, bytes transferred by common protocols such as 443
traffic, or applications that are common within network.
• Define a pattern, traffic type, or data type that can track to generate an
overall trend or historical analysis.
• Assign rule tests against that pattern to alert to special conditions.
• Example:
• When the importance of the current traffic level (on a scale of 0 to 100) is 70
compared to learned traffic trends and behaviour to the rule test, the system
sends an alert when the traffic is +70 or -70 of the learned behaviour.

BITS Pilani, Pilani Campus


Artificial Neural Network: Detection Steps
• Feature Selection: Key attributes are packets source & destination address,
source & destination ports, protocol (TCP, UDP, ICMP).
• Ranking: Principle of ranking used to categorize the features into three groups:
• Preliminary features represent the most useful features.
• Secondary features are of less impact on the monitoring process.
• Less important features are of the slightest impact on the detection process.
• This categorization can accelerate the whole operation of the system.
• Ex: Considered only 22 out of the 41 KDD’99 features and experimenting with 5 preliminary, 7
secondary, 10 less important features
• Encoding: Encoding is done without using any keys and the non numerical data
(records in the database) are converted into serial numbers.
• Normalization: Process of efficiently organising the data with following three
goals: Eliminate redundant, partial dependent and transitive dependent data
• Anomaly Detection: Use ANN algorithm for classification into Normal, Probing,
R2L (remote to local), U2R (user to root) and DOS attacks.

BITS Pilani, Pilani Campus


Artificial Neural Network: Attack Examples
• Probing: ipsweep, mscan, nmap, portsweep, sendsaint, satan etc
• R2L (Remote to Local): ftp write, guess_password, imap, multihop, named,
worm, xsnoop, snmpgetattack etc
• U2R (User to Root): buffer_overflow, httptunnel, rootkit, sqlattack, xterm etc
• DOS: mailbomb, smurf, teardrop, udpstorm, neptune etc

BITS Pilani, Pilani Campus


Support Vector Machine for Anomaly Detection
• Supervised SVM can be used in anomaly detection by training the SVM
structure with both attack data sets and normal data sets.
• SVM can also be applied as unsupervised machine learning in anomaly
detection.
• SVM outperforms ANN by fine tuning support vectors to separate data to:
• Achieve the global optimum
• Easily control the overfitting problem.
• One-class classification method is used to detect the outliers and anomalies
in a dataset.
• Based on Support Vector Machines (SVM) evaluation, the One-class SVM
applies a One-class classification method for novelty detection.

BITS Pilani, Pilani Campus


Support Vector Machine for Anomaly Detection
• Distance between these two data points is referred to as margins.
• SVM determines the best or the optimal hyperplane with the maximum margin to
ensure the distance between the two classes is as wide as possible.
• Regarding anomaly detection, SVM computes the margin of the new data point
observation from the hyperplane to classify it.
• If the margin exceeds the set threshold, it classifies the new observation as an anomaly.
• If the margin is less than the threshold, the observation is classified as normal.
• SVM algorithms are highly efficient in handling high-dimensional and complex data sets.

BITS Pilani, Pilani Campus


Random Forests
• Two key parameters required to build the network traffic model: number of
random features to split the node of trees (Nf) and number of trees in a forest
(Nt).
• Combinational values of these two variables are selected, corresponding to the
optimal prediction accuracy of the random forests.
• Random forests uses proximity measure between the paired data points to find
outliers.
• If a data point has low proximity measures to all the other data points in a given
data set, it is likely to be outlier.
• Proximity between xenquiry and xj ∈ Cj, prox2(xenquiry , xj), is accumulated by one,
when both data points are found in the same leaf of a tree.
• Divide final summation result by the number of trees to normalize the results.
• Obtain proximity and the degree of an outlier in any class for each data point
extracted from a network traffic data set and decide using the threshold.

BITS Pilani, Pilani Campus


Isolation Forests
• Principle of isolation forests is that outliers are points with features that are considerably different
from rest of the data.
• The inliers will be closer together while the outliers will be farther apart.
• A decision tree is constructed by making random cuts with randomly chosen features.
• Tree is allowed to grow until all points have been isolated.
• Fewer splits are required to isolate outliers and outlier nodes will reside closer to the root node.
• Process of constructing a tree with random cuts is repeated to create an ensemble, hence the term
forest in the algorithm’s name.
• Two key hyperparameters are estimators that give us the number of tresses uses in the ensemble
and contamination which gives the fraction of outliers in the data set.

BITS Pilani, Pilani Campus


Clustering Techniques
• Clustering-based outlier detection methods assume that the normal data
objects belong to large and dense clusters, whereas outliers belong to small or
sparse clusters, or do not belong to any clusters.
• Clustering-based approaches detect outliers by extracting the relationship
between Objects and Cluster.
• An object is an outlier if:
• Does the object belong to any cluster? If not, then it is identified as an outlier.
• Is there a large distance between the object and the cluster to which it is closest? If
yes, it is an outlier.
• Is the object part of a small or sparse cluster? If yes, then all the objects in that cluster
are outliers.

BITS Pilani, Pilani Campus


Clustering Techniques
• Categorization method is similar to assigning specific patterns or characteristics
to the groups of normal and anomalous data.
• Categorize clustering-based anomaly detection into two groups:
• Distance-based clustering: k-means clustering, Expectation Maximization and SOM
• Density-based clustering: CLIQUE and MAFIA
• Major focus is on the first group as the second group has fewer research results and does not
have as good anomaly detection results as the first group.

BITS Pilani, Pilani Campus


Clustering Techniques
• Most deployed distance-based clustering method is adapted from k-means
clustering.
• Without defining K in these algorithms, the clustering hyper-spheres are
constrained by a threshold r.
• Given data set X = {x1, …, xm} and cluster set C = {C1, …, CK}, distance metric
dist(xi, Cj) measures the closeness between data point xi, i = 1, …, m, and cluster
Cj.
• Steps to implement distance-based clustering are as under:
• Step 1. Initialize cluster set C = {C1, …, CK}.
• Step 2. Assign each data point xi in X to the closest cluster C *, C * ∈ {C1, …, CK.},
• if dist(xi, C *) ≤ r ; or creation of new cluster Cʹ for this data point, and update the
cluster set C.
• Step 3. Iterate until all data points are assigned to a cluster.

BITS Pilani, Pilani Campus


Clustering Techniques…
• Commonly used distance metric is Euclidean distance.
• If we choose the distance between a data point xi and cluster Cj to measure
dist(xi, Cj), the clustering algorithm is similar to k-means clustering, except there
is an additional constraint r for the clustering threshold.
• As all training data are unlabelled, we cannot determine which clusters belong
to normal or anomaly types.
• Each cluster may include mixed instances of normal data and different types of
attacks.
• Normal data out-number anomaly data, so the clusters that constitute more
than a percentage α of the training data set are labelled as “normal” groups.
• Other clusters are labelled as “attack.”
• Determination of abnormal clusters by the size of classes can lead to some
small-sized normal data groups be mis-classified as anomaly clusters especially
in case of multi-type normal data.

BITS Pilani, Pilani Campus


Clustering Techniques…
• Threshold r also affects the result of clustering.
• When r is large, the cluster number will decrease; when r is small, the cluster
number will increase.
• Selection of r is dependent on the knowledge of the normal data distribution.
• For instance, we know statistically it should be greater than the intra-cluster
distance and smaller than the inter-cluster distance.
• Jiang (2006) selected r by generating the mean and standard deviation of
distances between pairs of a sample data points from the training data set.
• Once the training data have been clustered and labelled, testing data can be
grouped according to their shortest distance to any cluster in the cluster set.

BITS Pilani, Pilani Campus


Deep Learning
• Deep Learning is designed to work with multivariate and high dimensional data.
• This makes it easy to integrate information from multiple sources, and
eliminates challenges associated with individually modelling anomalies for each
variable and aggregating the results.
• Deep learning approaches are well-adapted to jointly modelling the interactions
between multiple variables with respect to a given task and beyond the
specification of generic hyperparameters (number of layers, units per layer, etc.)
• Deep learning models require minimal tuning to achieve good results.
• Deep learning methods offer the opportunity to model complex, nonlinear
relationships within data, and leverage this for the anomaly detection task.
• Performance of deep learning models can potentially scale with the availability
of appropriate training data, making them suitable for data-rich problems.

BITS Pilani, Pilani Campus


Autoencoders
• Autoencoders are neural networks designed to learn a low-dimensional
representation, given some input data.
• Consist of two components:
• Encoder that learns to map input data to a low-dimensional representation (termed
the bottleneck)
• Decoder that learns to map this low-dimensional representation back to the original
input data.
• Encoder network learns an efficient “compression” function that maps input
data to a salient lower-dimensional representation, such that the decoder
network is able to successfully reconstruct the original input data.
• Model is trained by minimizing the reconstruction error, which is the difference
(mean squared error) between the original input and the reconstructed output
produced by the decoder.
• In practice, autoencoders have been applied as a dimensionality reduction
technique, as well as in other use cases such as noise removal from images,
image colorization, unsupervised feature extraction, and data compression
BITS Pilani, Pilani Campus
Autoencoders
• A semi-supervised approach is used to
train the model for normal behaviour.
• Autoencoder is trained on normal data
samples.
• Model learns a mapping function that
reconstructs normal data samples with a
very small reconstruction error.
• This behavior is replicated at test time,
where the reconstruction error is small for
normal data samples, and large for
abnormal data samples.
• To identify anomalies, we use the
reconstruction error score as an anomaly
score and flag samples with
reconstruction errors above a given
threshold
BITS Pilani, Pilani Campus
Variational Autoencoders
• VAE consists of an encoder and a decoder network
for Variational Inference learning.
• VAE learns a mapping from an input to a distribution
and learns to reconstruct the original data by
sampling from this distribution using a latent code.
• In Bayesian terms, prior is the distribution of the
latent code, likelihood is the distribution of the input
given the latent code, and posterior is distribution of
latent code, given our input.
• Encoder learns the parameters (mean and variance)
of a distribution that outputs a latent code vector Z,
given the input data (posterior) – typically a Gaussian
or Bernouli.
• Decoder learns a distribution that outputs the
original input data point (or something really close to
it), given a latent bottleneck sample (likelihood) -
typically, an isotropic Gaussian distribution.
• VAE model is trained by minimizing the difference
between the estimated distribution produced by the
model and the real distribution of the data.
BITS Pilani, Pilani Campus
Hybrid Detection Systems

BITS Pilani, Pilani Campus


Why Hybrid Systems?
• Misuse detection methods and Anomaly detection methods have
compensatory functions and abilities.
• Hybrid of these two will provide:
• Flexibility & Intelligence of anomaly detection methods
• Accuracy and Reliability of misuse detection methods.

BITS Pilani, Pilani Campus


Designing Hybrid Systems
• Design of hybrid detection system requires to consider two critical factors.
• First:
• Identify the best candidates of misuse or anomaly detection systems
• Determine the good pairs of the misuse or anomaly detection systems for
integration
• Can be difficult to choose from so many misuse and anomaly detection systems.
• Second:
• Need to consider the optimal or sub-optimal way to integrate the given detection
systems based on the different fundamental techniques
• Can achieve the best balance between the detection and false-alarm rates and
maintain the ability to detect new intrusions
• Selection of misuse and anomaly detection systems for combination is an
application-specific problem.

BITS Pilani, Pilani Campus


Integration Frameworks
• Anomaly-Misuse sequence detection
• Misuse-Anomaly sequence detection
• Parallel detection
• Complex mixture detection

BITS Pilani, Pilani Campus


Anomaly-Misuse Sequence Detection

• Input dataset = X has two components Xi (intrusive) and Xn (normal)


• Only Anomaly detection used: An (Normal) and Au (unknown)
• X = Au + An
• Only Misuse detection used: Mi (intrusive) and Mu (unknown)
• X = Mi + Mu
• If output of Anomaly (Au) is input to Misuse then output of Misuse will be:
Au.Mi and Au.Mu (. stands for intersection)

BITS Pilani, Pilani Campus


Anomaly-Misuse Sequence Detection
Anomaly confusion matrix
Xn Xi
• An = ATP + AFP
An ATP AFP
• Au = AFN + ATN
Au AFN ATN

Misuse confusion matrix Xn Xi


• Mi = MFP + MTP Mi MFP MTP
• Mu = MFN + MTN Mu MFN MTN

• Correct subsets = ATP . MTP, ATN . MTN


• Incorrect subsets = AFP . MFP, AFN . MFN
• Other subsets have suspicious data e.g. AFN . MTP (Anomaly declares Normal
but Misuse declares Intrusion – here Anomaly is wrong)

BITS Pilani, Pilani Campus


Anomaly-Misuse Detection

• In Anomaly-Misuse detection Au output of Anomaly is sent to Misuse.


• System outputs three subsets = An, Au.Mi, Au.Mu
• An is not sent to Misuse while it has possibility of error (AFN . MTP ).
• For system to have high accuracy AFN . MTP must be negligible
• Subset Au.Mu (consists of ATP.MFN and AFP.MTN) requires administrator
intervention to classify them into intrusive or normal.
• Subset Au.Mi consists of ATP.MFP and AFP.MTP. First part is empty, second part is
malicious
BITS Pilani, Pilani Campus
Association Rule in Audit Data Analysis & Mining
Framework of training phase
in ADAM

Framework of testing phase


in ADAM.

BITS Pilani, Pilani Campus


Misuse-Anomaly Sequence Detection

• Input dataset = X has two components Xi (intrusive) and Xn (normal)


• Only Anomaly detection used: An (Normal) and Au (unknown)
• X = Au + An
• Only Misuse detection used: Mi (intrusive) and Mu (unknown)
• X = Mi + Mu
• If output of Misuse (Mu) is input to Anomaly then output of Anomaly will be:
Mu.An and Mu.Au (. stands for intersection)

BITS Pilani, Pilani Campus


Misuse-Anomaly Sequence Detection
Anomaly confusion matrix
Xn Xi
• An = ATP + AFP
An ATP AFP
• Au = AFN + ATN
Au AFN ATN

Misuse confusion matrix Xn Xi


• Mi = MFP + MTP Mi MFP MTP
• Mu = MFN + MTN Mu MFN MTN

• Correct subsets = ATP . MTP, ATN . MTN


• Incorrect subsets = AFP . MFP, AFN . MFN
• Other subsets have suspicious data e.g. AFN . MTP (Anomaly declares Normal
but Misuse declares Intrusion – here Anomaly is wrong)

BITS Pilani, Pilani Campus


Misuse-Anomaly Detection

• In Misuse-Anomaly detection Mu output of Misuse is sent to Anomaly.


• Mi is not sent to Anomaly while it has possibility of error (MFP . ATN ).
• For system to have high accuracy MFP . ATN must be negligible
• Subset Mu.An (consists of MFN.ATN and MTN.AFN) requires administrator
intervention to classify them into anomalous or normal.
• AFN can be ensured low in building an accurate profile of normal behaviours.
• To have a low MFN, the rules in the misuse detection system must have a good
coverage of malicious data.
BITS Pilani, Pilani Campus
Using Random Forests in Network Intrusion
Detection
The workflow of
misuse–anomaly
detection system in
Random-forest-
based network
intrusion detection
systems Random-
forest-based
network intrusion
detection systems

BITS Pilani, Pilani Campus


Using Weighted Signature Generation over Anomalous
Internet Episodes in Network Intrusion Detection

The workflow in the signature


generation module. Hybrid
intrusion detection with weighted
signature generation over
anomalous internet episodes
BITS Pilani, Pilani Campus
Parallel Detection System

• Under parallel detection system anomaly detection system and misuse


detection system perform intrusive detection independently and in parallel.
• Output of anomaly detection system consists of two subsets Au and An.
• Output of misuse detection system consists of two subsets Mu and Mi.
• An intelligent correlation system, called resolver, is designed to analyze these
subsets and report the detection results.
BITS Pilani, Pilani Campus
Parallel Detection System
• Combination of these subsets composes the final results of the parallel
detection system (An ∩ Mi, An ∩ Mu, Au ∩ Mi, and Au ∩ Mu)
• An ∩ Mu and Au ∩ Mi improve the detection rate of normal and intrusive
patterns.
• An ∩ Mi and Au ∩ Mu subsets denote the suspicious data need further
analysis.
• Subset An ∩ Mi implies that a malicious pattern defined by intrusive signatures
in misuse detection system falls into the normal profile defined in anomaly
detection system.
• Both or either of these signatures and profiles may need to be updated.
• Subset Au ∩ Mu presents the unknown data for both misuse and anomaly
systems.
• Requires further investigation.
• In parallel detection systems, a correlation system plays a key role.
• Parallel system analyzes the detection results from both misuse and anomaly detection
systems.
BITS Pilani, Pilani Campus
Parallel Detection System
• Given prior knowledge about intrusive or normal data, the parallel detection
system can reduce the chances of missing intrusions among suspicious data
and the false-alarm rate.
• NIDES (developed by SRI) is an example for such parallel detection systems.

BITS Pilani, Pilani Campus


Complex Mixture Detection System
• No clear boundary between the misuse and anomaly detection subsystems.
• Apply AdaBoost (Adaptive Boosting) algorithm as a typical example of intrusion
detection.
• AdaBoost is widely used in supervised machine learning.
• Performs weight based classification on a number of weak classifiers according
to a function of classification errors.
• AdaBoost algorithm is simple and efficient in implementation, given simple
weak classifiers.
• AdaBoost algorithm is less susceptible to overfitting.
• AdaBoost tends to resist overfitting in practice, while it sensitive to outliers
• AdaBoost must be trained using both attack and normal labelled data.
• Using labelled data of networks, one can train the weak classifiers and improve
strong classifiers using AdaBoost algorithms.

BITS Pilani, Pilani Campus


Complex Mixture Detection System
• Trained strong classifiers are used to classify new data into normal and anomaly.
• AdaBoost algorithm can select weak classifiers among ANN, KNN, SVM, decision
trees, etc.
• Computational complexities of such an action depend on the training of the
classifiers.
• Computational complexity of a strong classifier depends mostly on the
computational complexity of a weak classifier.
• Prefer to choose simply structured machine-learning methods for weak
classifiers.

BITS Pilani, Pilani Campus


AdaBoost Algorithm
• Principle behind boosting algorithms is that first build a model and then build a second
model to rectify the errors present in the first model.
• Procedure is repeated until the errors are minimized and dataset is predicted correctly.
• Boosting algorithm combines multiple models (weak learners) to reach the final output
(strong learners).
• Decision trees with just one split are normally used in AdaBoost.
• AdaBoost builds a model and gives equal weights to all the data points.
• AdaBoost assigns higher weights to points that are wrongly classified.
• All the points with higher weights are given more importance in the next model.
• It will keep training models until and unless a lower error is received.

BITS Pilani, Pilani Campus


Complex Mixture Detection System

Workflow of hybrid detection system


using the AdaBoost algorithm

BITS Pilani, Pilani Campus


Thank You

BITS Pilani, Pilani Campus

You might also like