Professional Documents
Culture Documents
10 - Anomaly Detection
10 - Anomaly Detection
Introduction
• In general context, anomaly detection is any method for finding
events that don’t conform to an expectation.
• In the context of network security, anomaly detection refers to
identifying unexpected intruders or breaches.
One important application of anomaly detection is in the field
of fraud detection. Fraud in the financial industry can often be
fished out of a vast pool of legitimate transactions by studying
patterns of normal events and detecting when deviations
occur.
4
Anomaly Detection Algorithms
1. Unsupervised machine learning
i. One-class support vector machines
ii. Isolation Forest
2. Density-based methods
i. Local Outliers Factors (LOF)
3. Forecasting (supervised machine learning)
4. Goodness-of-fit tests
i. Elliptic envelope fitting (covariance estimate fitting)*
5. Statistical metrics
▪ This method is more suitable for novelty detection than outlier detection; that is, the
training data should ideally be thoroughly cleaned and contain no anomalies.
▪ one-class SVM works on the basic idea of minimizing the hypersphere of the single
class of examples in training data and considers all the other samples outside the
hypersphere to be outliers or out of training data distribution.
6
Example
• from sklearn.svm import OneClassSVM
• X = [[0], [0.44], [0.45], [0.46], [1]]
• clf = OneClassSVM(gamma='auto').fit(X)
• print(clf.predict(X)) # array([-1, 1, 1, 1, -1])
• print(clf.score_samples(X))
• #[1.77987316 2.05479873 2.05560497 2.05615569 1.73328509]
score_samples method is used to access the scoring function of the estimator and the
contamination parameter is used to set the threshold for classification
Example:
• Create a random sample dataset for this example by using the
make_blob() function.
from sklearn.datasets import make_blobs
np.random.seed(13)
x, y = make_blobs(n_samples=200, n_features=2 , centers=1, cluster_std=.3) #
y: cluster number
plt.scatter(x[:,0], x[:,1] , c=y)
plt.show()
Centers=3
8
…
• Define the model, fit and predict:
svm = OneClassSVM(kernel='rbf', gamma=0.001, nu=0.03)
svm.fit(x)
pred = svm.predict(x)
• Visualize
plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0], values[:,1], color='r')
plt.show()
10
…
• Finally, we can visualize the results in a plot by highlighting the
anomalies with a color.
plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0], values[:,1], color='r')
plt.show()
11
13
…
• In the sklearn implementation, the threshold for points to be considered anomalous is
defined by the contamination ratio. With a contamination ratio of 0.01, the shortest 1% of
paths will be considered anomalies.
• Let’s apply this algorithm on the non- Gaussian contaminated( )ملوثdataset– see next slides
WIKI: Anomaly detection with Isolation Forest is a process composed of two main stages:
1.in the first stage, a training dataset is used to build isolation Trees.
2.in the second stage, each instance in the test set is passed through these Trees, and a proper “anomaly
score” is assigned to the instance.
Once all the instances in the test set have been assigned an anomaly score, it is possible to mark as
“anomaly” any point whose score is greater than a predefined threshold, which depends on the domain
the analysis is being applied to.
14
Hypothetical Non-Gaussian Distribution Dataset
• We will synthesize non • import numpy as np
Gaussian multimodal • num_dimensions = 2
• num_samples = 1000
(there is more than • outlier_ratio = 0.01
one “center” of regular • num_inliers = int(num_samples * (1-outlier_ratio))
inliers) dataset and • num_outliers = num_samples - num_inliers
then including a 0.01 • # randn: Generate the normally distributed inliers (standard normal distribution).
ratio of outliers in the • x_0 = np.random.randn(num_inliers//3, num_dimensions) -3
• x_1 = np.random.randn(num_inliers//3, num_dimensions)
mixture • x_2 = np.random.randn(num_inliers//3, num_dimensions) +4
• Note: each cluster has • # Add outliers sampled from a random uniform distribution
• x_rand = np.random.uniform(low=-10, high=10, size=(num_outliers, num_dimensions))
a Gaussian distribution • x = np.r_[x_0,x_1,x_2, x_rand] # concatenate
• This code will generate • # Generate labels, 1 for inliers and −1 for outliers
• labels = np.ones(num_samples, dtype=int)
bimodal dataset
• labels[-num_outliers:] = −1
15
…
Plotting The Dataset
• import matplotlib.pyplot as plt
• plt.plot(x[:num_inliers,0], x[:num_inliers,1], 'wo',label='inliers')
• plt.plot(x[-num_outliers:,0], x[-num_outliers:,1], 'ko',
label='outliers')
• plt.xlim(-11,11)
• plt.ylim(-11,11)
• plt.legend(numpoints=1)
• plt.show()
16
Back to the example
from sklearn.ensemble import IsolationForest
rng = np.random.RandomState(99)
classifier = IsolationForest(max_samples=num_samples,
contamination=outlier_ratio, random_state=rng)
classifier.fit(x)
y_pred = classifier.predict(x)
num_anom = sum(y_pred == -1)
print('Number of errors: {}'.format(num_anom))
> Number of errors: 10
17
2- Density-Based Methods
• Several different density-based methods have been adapted for use in anomaly
detection.
• The main idea behind all of them is to form a cluster representation of the
training data, under the hypothesis that outliers or anomalies will be located in
low-density regions of this cluster representation.
• Even though the k-nearest neighbors (k-NN) algorithm is not a clustering
algorithm, it is commonly considered a density-based method and is actually
quite a popular way to measure the probability that a data point is an outlier.
• In essence, the algorithm can estimate the local sample density of a point by measuring its
distance to the kth nearest neighbor.
• You can also use k-means clustering for anomaly detection in a similar way, using
distances between the point and centroids as a measure of sample density.
• In this section, we will focus on a method called the local outlier factor (LOF),
which is a classic density-based machine learning method for anomaly detection.
18
2.1 Local Outlier Factor (LOF)
• The LOF is an anomaly score that you can generate using the scikit-
learn class sklearn.neighbors.LocalOutlierFactor.
• Similar to the aforementioned k-NN and k-means anomaly detection
methods, LOF classifies anomalies using local density around a
sample.
• The local density of a data point refers to the concentration of other
points in the immediate surrounding region, where the size of this
region can be defined either by a fixed distance threshold or by the
closest n neighboring points.
• Data points with a significantly lower local density than that of their
closest n neighbors are considered to be anomalies.
19
LOF Example 1
• import numpy as np
• from sklearn.neighbors import LocalOutlierFactor
• X = [[-1.1], [0.2], [101.1], [0.3]]
• clf = LocalOutlierFactor(n_neighbors=2)
• print(clf.fit_predict(X)) ➔ array([ 1, 1, -1, 1])
• print(clf.negative_outlier_factor_ )
• ➔ array([ -0.9821..., -1.0370..., -73.3697..., -0.9821...])
Negative_outlier_factor: The higher, the more normal. Inliers tend to have a LOF score close to -1, while
outliers tend to have a lower LOF score.
➔ lower values indicating higher outlier scores.
20
LOF Example 2
• from sklearn.neighbors import LocalOutlierFactor
• from sklearn.datasets import make_blobs
• import numpy as np
• import matplotlib.pyplot as plt
• np.random.seed(1)
• x, _ = make_blobs(n_samples=200, n_features=2, centers=1,
cluster_std=.3, center_box=(10,10))
• plt.scatter(x[:,0], x[:,1])
• plt.show()
21
22
… Let’s run an example on a similar non- Gaussian,
contaminated dataset once again:
23
24