Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Anomaly Detection(AD)

This chapter is about detecting unexpected events, or anomalies, in systems.

Introduction
• In general context, anomaly detection is any method for finding
events that don’t conform to an expectation.
• In the context of network security, anomaly detection refers to
identifying unexpected intruders or breaches.
One important application of anomaly detection is in the field
of fraud detection. Fraud in the financial industry can often be
fished out of a vast pool of legitimate transactions by studying
patterns of normal events and detecting when deviations
occur.

if the power company can find anomalies in the electrical


power grid and remedy them, it can possibly avoid expensive
damage that occurs when a power flow causes outages(‫)انقطاع‬
in other system components.
2
Terminology
• Novelty detection and outlier detection are forms of anomaly
detection. There is an important distinction between :
• Novelty detection involves learning a representation of “regular” data using data that does
not contain any outliers,
• Outlier detection involves learning from data that contains both regular data and outliers.
• A time series is a sequence of data points of an event or process
observed at successive points in time. These data points, often
collected at regular intervals.
• The study of anomaly detection is closely coupled with the concept of time series analysis
because an anomaly is often defined as a deviation from what is normal or expected, given
what had been observed in the past.

AD Versus Supervised Learning


• It is sometimes unclear which approach to take when looking to develop a
solution for a problem.
• For example, if you are looking for fraudulent credit card transactions, it
might make sense to use a supervised learning model if you have a large
number of both legitimate and fraudulent transactions with which to train
your model.
• Supervised learning would be especially suited for the problem if you expect future instances of
fraud to look similar to the examples of fraud you have in your training set.
• In many other scenarios such as server breaches are sometimes caused by
zero-day attacks.
• By definition, the method of intrusion cannot be predicted in advance, and it is difficult to build a
profile of every possible method of intrusion in a system.
• Because these events are relatively rare, this also contributes to the class imbalance problem that
makes for difficult application of supervised learning.
• Anomaly detection is perfect for such problems.

4
Anomaly Detection Algorithms
1. Unsupervised machine learning
i. One-class support vector machines
ii. Isolation Forest
2. Density-based methods
i. Local Outliers Factors (LOF)
3. Forecasting (supervised machine learning)
4. Goodness-of-fit tests
i. Elliptic envelope fitting (covariance estimate fitting)*
5. Statistical metrics

1.i One-class Support Vector Machines


• One-Class Support Vector Machine is an unsupervised model for anomaly or outlier
detection.
• Unlike the regular supervised SVM, the one-class SVM does not have target labels for
the model training process. Instead, it learns the boundary for the normal data
points and identifies the data outside the border to be anomalies.
• One-Class Classification aims to differentiate samples of one particular class by
learning from single class samples during training.

▪ This method is more suitable for novelty detection than outlier detection; that is, the
training data should ideally be thoroughly cleaned and contain no anomalies.
▪ one-class SVM works on the basic idea of minimizing the hypersphere of the single
class of examples in training data and considers all the other samples outside the
hypersphere to be outliers or out of training data distribution.

6
Example
• from sklearn.svm import OneClassSVM
• X = [[0], [0.44], [0.45], [0.46], [1]]
• clf = OneClassSVM(gamma='auto').fit(X)
• print(clf.predict(X)) # array([-1, 1, 1, 1, -1])
• print(clf.score_samples(X))
• #[1.77987316 2.05479873 2.05560497 2.05615569 1.73328509]

score_samples method is used to access the scoring function of the estimator and the
contamination parameter is used to set the threshold for classification

Example:
• Create a random sample dataset for this example by using the
make_blob() function.
from sklearn.datasets import make_blobs
np.random.seed(13)
x, y = make_blobs(n_samples=200, n_features=2 , centers=1, cluster_std=.3) #
y: cluster number
plt.scatter(x[:,0], x[:,1] , c=y)
plt.show()
Centers=3

8

• Define the model, fit and predict:
svm = OneClassSVM(kernel='rbf', gamma=0.001, nu=0.03)
svm.fit(x)
pred = svm.predict(x)

• Next, we'll extract the negative outputs as the outliers.


anom_index = np.where(pred==-1)
values = x[anom_index]

• Visualize
plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0], values[:,1], color='r')
plt.show()

Anomaly Detection With Scores


• We can find anomalies by using their scores. In this method, we'll define
the model, fit it on the x data by using the fit_predict() method. We'll
calculate the outliers according to the score value of each element.
svm = OneClassSVM(kernel='rbf', gamma=0.001, nu=0.02)
pred = svm.fit_predict(x)
scores = svm.score_samples(x)
• Next, we'll obtain the threshold value from the scores by using the quantile
function. Here, we'll get the lowest 3 percent of score values as the
anomalies.
thresh = np.quantile(scores, 0.03)
• Next, we'll extract the anomalies by comparing the threshold value and
identify the values of elements.
index = np.where(scores<=thresh)
values = x[index]
a quantile determines how many values in a distribution are
above or below a certain limit.

10

• Finally, we can visualize the results in a plot by highlighting the
anomalies with a color.
plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0], values[:,1], color='r')
plt.show()

11

1.ii Isolation Forest


• The sklearn.ensemble.IsolationForest class helps determine the anomaly score of
a sample using the Isolation Forest algorithm.
• The algorithm operates in the context of anomaly detection by computing the
number of splits required to isolate a single sample; that is, how many times we
need to perform splits on features in the dataset before we end up with a region
that contains only the single target sample.
• The intuition behind this method is that:
• Inliers have more feature value similarities, which requires them to go through more splits to be
isolated.
• Outliers, on the other hand, should be easier to isolate with a small number of splits because they will
likely have some feature value differences that distinguish them from inliers.
• By measuring the “path length” of recursive splits from the root of the tree, we
have a metric with which we can attribute an anomaly score to data points.
• Anomalous data points should have shorter path lengths than regular data
points.
12

13


• In the sklearn implementation, the threshold for points to be considered anomalous is
defined by the contamination ratio. With a contamination ratio of 0.01, the shortest 1% of
paths will be considered anomalies.
• Let’s apply this algorithm on the non- Gaussian contaminated(‫ )ملوث‬dataset– see next slides

WIKI: Anomaly detection with Isolation Forest is a process composed of two main stages:
1.in the first stage, a training dataset is used to build isolation Trees.
2.in the second stage, each instance in the test set is passed through these Trees, and a proper “anomaly
score” is assigned to the instance.
Once all the instances in the test set have been assigned an anomaly score, it is possible to mark as
“anomaly” any point whose score is greater than a predefined threshold, which depends on the domain
the analysis is being applied to.

14
Hypothetical Non-Gaussian Distribution Dataset
• We will synthesize non • import numpy as np
Gaussian multimodal • num_dimensions = 2
• num_samples = 1000
(there is more than • outlier_ratio = 0.01
one “center” of regular • num_inliers = int(num_samples * (1-outlier_ratio))
inliers) dataset and • num_outliers = num_samples - num_inliers
then including a 0.01 • # randn: Generate the normally distributed inliers (standard normal distribution).
ratio of outliers in the • x_0 = np.random.randn(num_inliers//3, num_dimensions) -3
• x_1 = np.random.randn(num_inliers//3, num_dimensions)
mixture • x_2 = np.random.randn(num_inliers//3, num_dimensions) +4
• Note: each cluster has • # Add outliers sampled from a random uniform distribution
• x_rand = np.random.uniform(low=-10, high=10, size=(num_outliers, num_dimensions))
a Gaussian distribution • x = np.r_[x_0,x_1,x_2, x_rand] # concatenate
• This code will generate • # Generate labels, 1 for inliers and −1 for outliers
• labels = np.ones(num_samples, dtype=int)
bimodal dataset
• labels[-num_outliers:] = −1

15


Plotting The Dataset
• import matplotlib.pyplot as plt
• plt.plot(x[:num_inliers,0], x[:num_inliers,1], 'wo',label='inliers')
• plt.plot(x[-num_outliers:,0], x[-num_outliers:,1], 'ko',
label='outliers')
• plt.xlim(-11,11)
• plt.ylim(-11,11)
• plt.legend(numpoints=1)
• plt.show()

16
Back to the example
from sklearn.ensemble import IsolationForest
rng = np.random.RandomState(99)
classifier = IsolationForest(max_samples=num_samples,
contamination=outlier_ratio, random_state=rng)
classifier.fit(x)
y_pred = classifier.predict(x)
num_anom = sum(y_pred == -1)
print('Number of errors: {}'.format(num_anom))
> Number of errors: 10

17

2- Density-Based Methods
• Several different density-based methods have been adapted for use in anomaly
detection.
• The main idea behind all of them is to form a cluster representation of the
training data, under the hypothesis that outliers or anomalies will be located in
low-density regions of this cluster representation.
• Even though the k-nearest neighbors (k-NN) algorithm is not a clustering
algorithm, it is commonly considered a density-based method and is actually
quite a popular way to measure the probability that a data point is an outlier.
• In essence, the algorithm can estimate the local sample density of a point by measuring its
distance to the kth nearest neighbor.
• You can also use k-means clustering for anomaly detection in a similar way, using
distances between the point and centroids as a measure of sample density.
• In this section, we will focus on a method called the local outlier factor (LOF),
which is a classic density-based machine learning method for anomaly detection.

18
2.1 Local Outlier Factor (LOF)
• The LOF is an anomaly score that you can generate using the scikit-
learn class sklearn.neighbors.LocalOutlierFactor.
• Similar to the aforementioned k-NN and k-means anomaly detection
methods, LOF classifies anomalies using local density around a
sample.
• The local density of a data point refers to the concentration of other
points in the immediate surrounding region, where the size of this
region can be defined either by a fixed distance threshold or by the
closest n neighboring points.
• Data points with a significantly lower local density than that of their
closest n neighbors are considered to be anomalies.
19

LOF Example 1
• import numpy as np
• from sklearn.neighbors import LocalOutlierFactor
• X = [[-1.1], [0.2], [101.1], [0.3]]
• clf = LocalOutlierFactor(n_neighbors=2)
• print(clf.fit_predict(X)) ➔ array([ 1, 1, -1, 1])
• print(clf.negative_outlier_factor_ )
• ➔ array([ -0.9821..., -1.0370..., -73.3697..., -0.9821...])

Negative_outlier_factor: The higher, the more normal. Inliers tend to have a LOF score close to -1, while
outliers tend to have a lower LOF score.
➔ lower values indicating higher outlier scores.

n_neighbors : the number of neighbors to use for density estimation

20
LOF Example 2
• from sklearn.neighbors import LocalOutlierFactor
• from sklearn.datasets import make_blobs
• import numpy as np
• import matplotlib.pyplot as plt
• np.random.seed(1)
• x, _ = make_blobs(n_samples=200, n_features=2, centers=1,
cluster_std=.3, center_box=(10,10))
• plt.scatter(x[:,0], x[:,1])
• plt.show()

• lof = LocalOutlierFactor(n_neighbors=20, contamination=.03)


• y_pred = lof.fit_predict(x)
• lofs_index=np.where(y_pred==-1)
• values = x[lofs_index]
• plt.scatter(x[:,0], x[:,1])
• plt.scatter(values[:,0],values[:,1], color='r')
• plt.show()

21

…Using Scores and Thresholds


• model = LocalOutlierFactor(n_neighbors=20)
• model.fit_predict(x)
• lof = model.negative_outlier_factor_
• thresh = np.quantile(lof, .2)
• print(thresh)
• index = np.where(lof<=thresh)
• values = x[index]
• plt.scatter(x[:,0], x[:,1])
• plt.scatter(values[:,0],values[:,1], color='r')
• plt.show()

22
… Let’s run an example on a similar non- Gaussian,
contaminated dataset once again:

from sklearn.neighbors import LocalOutlierFactor


print('Number of anamolies: {}'.format(num_anom))
classifier = LocalOutlierFactor(n_neighbors=100,
contamination=outlier_ratio)
y_pred = classifier.fit_predict(x)
num_anom = sum(y_pred == -1)
print('Number of errors: {}'.format(num_anom))
> Number of errors: 10

23

LOF for Novelty Detection


• By default, LOF is only meant to be used for outlier detection
(novelty=False).
• Set novelty to True if you want to use LOF for novelty detection.
• In this case be aware that you should only use predict,
decision_function and score_samples on new unseen data and not
on the training set; and note that the results obtained this way may
differ from the standard LOF results.

24

You might also like