Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

8/5/2021 Isolation Forest Algorithm for Anomaly Detection | by Prakash verma | Heartbeat

Isolation Forest Algorithm for


Anomaly Detection
Detecting anomalies using tree-based algorithms

Prakash verma Follow


Oct 27, 2020 · 7 min read

https://heartbeat.fritz.ai/isolation-forest-algorithm-for-anomaly-detection-2a4abd347a5 1/16
8/5/2021 Isolation Forest Algorithm for Anomaly Detection | by Prakash verma | Heartbeat

Photo by Erol Ahmed on Unsplash

https://heartbeat.fritz.ai/isolation-forest-algorithm-for-anomaly-detection-2a4abd347a5 2/16
8/5/2021 Isolation Forest Algorithm for Anomaly Detection | by Prakash verma | Heartbeat

Introduction:
Did you ever wonder how credit card fraud detection is caught in real-time?
Do you want to know how to catch an intruder program if it is trying to
access your system? This is all possible by the application of the anomaly
detection machine learning model.

Anomaly detection is one of the most popular machine learning techniques.


In this article, we will learn concepts related to anomaly detection and how
to implement it as a machine learning model.

What is Anomaly Detection?

https://heartbeat.fritz.ai/isolation-forest-algorithm-for-anomaly-detection-2a4abd347a5 3/16
8/5/2021 Isolation Forest Algorithm for Anomaly Detection | by Prakash verma | Heartbeat

In simple words, we can define the finding of abnormal events, data, or


activity during a process, such as running an app or program, as anomaly
detection. In the picture above, only one egg is red and the others are white.
The identification of the red color egg is an outcome of anomaly detection,
as it is different from the pattern. As we know, generally eggs are white, so
the presence of a red egg is violating the pattern.

Let us try to understand this with one more example. As we know, if an egg
is floating in the water, it might be old and rotten. This indicates that the
weight of eggs varies and, on the basis of its weight, one can differentiate
between a fresh egg and a rotten egg.

https://heartbeat.fritz.ai/isolation-forest-algorithm-for-anomaly-detection-2a4abd347a5 4/16
8/5/2021 Isolation Forest Algorithm for Anomaly Detection | by Prakash verma | Heartbeat

Photo by science4fun

Suppose we have a list indicating the weight of each egg. As per the list
value, we want to identify the number of rotten eggs and learn the
percentage value from the lot. We can solve this using machine learning.

What are the Different Use Cases of Anomaly Detection?

https://heartbeat.fritz.ai/isolation-forest-algorithm-for-anomaly-detection-2a4abd347a5 5/16
8/5/2021 Isolation Forest Algorithm for Anomaly Detection | by Prakash verma | Heartbeat

Real-world datasets may have very large datasets with complicated patterns
where it is difficult to detect the anomaly by just looking at the data. That’s
why the study of anomaly detection is an extremely important application
of machine learning.

Anomaly detection has a wide variety of applications across many


industries and is used for a variety of purposes. For example, below is a list
of three applications, though there are many more.

1. Fault detection in manufacturing

2. Fraud detection in banking

3. Software systems health monitoring

Anomaly Detection Types

Anomaly detection methods are classified under the following two


headings, based on different machine learning algorithms.

Supervised — The supervised machine learning method requires the


existence of pre-labeled datasets. Of course, this data contains both normal
and anomalous data points. Examples of this method include anomaly

https://heartbeat.fritz.ai/isolation-forest-algorithm-for-anomaly-detection-2a4abd347a5 6/16
8/5/2021 Isolation Forest Algorithm for Anomaly Detection | by Prakash verma | Heartbeat

detection using neural networks, K-nearest neighbors, and Bayesian


networks.

Unsupervised — The unsupervised machine learning method of anomaly


detection does not depend on any training data with manual labeling. This
is based on the statistical assumption that most inflowing data are normal
and only a minor percentage would be anomalous data. We know that any
malicious data would be different, statistically, from normal data. Some of
the unsupervised methods include the K-means clustering, autoencoder
method, and hypothesis-based analysis.

Isolation Forest:
It is worth knowing that the most common techniques employed for
anomaly detection are based on the construction of a profile of what is
normal data. Anomalies are found as those instances of data that do not
conform to the defined normal profile.

However, the isolation forest does not work on the above methodology. It
identifies anomalies by isolating outliers in the data. Isolation forest exists
under an unsupervised machine learning algorithm.

https://heartbeat.fritz.ai/isolation-forest-algorithm-for-anomaly-detection-2a4abd347a5 7/16
8/5/2021 Isolation Forest Algorithm for Anomaly Detection | by Prakash verma | Heartbeat

One of the advantages of using the isolation forest is that it not only detects
anomalies faster but also requires less memory compared to other anomaly
detection algorithms.

Isolation forest works on the principle of the decision tree algorithm. It


isolates the outliers by randomly selecting a feature from the given set of
features and then randomly selecting a split value between the maximum
and minimum values of the selected feature. This random partitioning of
features will produce smaller paths in trees for the anomalous data values
and distinguish them from the normal set of the data.

Isolation forest works on the principle of recursion. This algorithm


recursively generates partitions on the datasets by randomly selecting a
feature and then randomly selecting a split value for the feature. Arguably,
the anomalies need fewer random partitions to be isolated compared to the
so defined normal data points in the dataset. Therefore, the anomalies will
be the points that have a shorter path in the tree. Here, we assume the path
length is the number of edges traversed from the root node.

https://heartbeat.fritz.ai/isolation-forest-algorithm-for-anomaly-detection-2a4abd347a5 8/16
8/5/2021 Isolation Forest Algorithm for Anomaly Detection | by Prakash verma | Heartbeat

Join more than 14,000 of your fellow machine


learners and data scientists. Subscribe to the
premier newsletter for all things deep learning.

Implementation in Python

Let us start by importing the required libraries numpy , pandas , seaborn, and
matplotlib . We also need to import the isolation forest from

sklearn.ensemble

import numpy as np

import pandas as pd

import seaborn as sns

from sklearn.ensemble import IsolationForest

import matplotlib.pyplot as plt

Our second task is to read the data file from CSV to the pandas DataFrame.
The data is about the collection of egg weights in grams. This data has few
anomalies (like a weight too low) which the algorithm will detect.
https://heartbeat.fritz.ai/isolation-forest-algorithm-for-anomaly-detection-2a4abd347a5 9/16
8/5/2021 Isolation Forest Algorithm for Anomaly Detection | by Prakash verma | Heartbeat

# Read Egg weight detail from .csv file

df = pd.read_csv('egg_weight.csv')

df.head(15)

Define and Fit Model

Here we will define a model variable and instantiate the isolation forest
class. Note that the four main parameters that need to be passed to the
model are listed below.
https://heartbeat.fritz.ai/isolation-forest-algorithm-for-anomaly-detection-2a4abd347a5 10/16
8/5/2021 Isolation Forest Algorithm for Anomaly Detection | by Prakash verma | Heartbeat

Maximum Number Features: The default value of max features is 1. The


maximum number of features is the number of features to draw from the
total features to train each tree.

Maximum Number of Samples: Indicates the number of samples to be


drawn to train each tree. If this value is more than the number of samples
provided, all samples will be used.

Contamination: This is the expected proportion of outliers in the dataset


and is quite sensitive. This is used when fitting to define the threshold on
the scores of the samples.

Count of Estimators: It is an optional parameter with a default value equal


to 100. This refers to the number of trees that will get built in the forest.

max_features=1.0

n_estimators=50

max_samples='auto'

contamination=float(0.2)

forest_model=IsolationForest(max_features = max_features,
n_estimators=n_estimators, max_samples=max_samples,
contamination=contamination)

model.fit(df[['Egg_weight']])

https://heartbeat.fritz.ai/isolation-forest-algorithm-for-anomaly-detection-2a4abd347a5 11/16
8/5/2021 Isolation Forest Algorithm for Anomaly Detection | by Prakash verma | Heartbeat

After we define the model it needs to be trained using the dataset provided.
For this, we are going to use the fit() method. We are passing one
parameter to the fit() method, which is our data of interest. This means the
egg weights column of the dataset.

Find Scores
Now let’s find the value of scores and that of the anomaly column.
Bypassing the egg weight as a parameter to decision_function() we can
find the values of the scores column.

Similarly, we can find the values of the anomaly column bypassing the egg
weight as a parameter to predict() the function of the trained model.

df['scores']=forest_model.decision_function(df[['Egg_weight']])

df['anomaly_Value']=forest_model.predict(df[['Egg_weight']])

df.head(10)

https://heartbeat.fritz.ai/isolation-forest-algorithm-for-anomaly-detection-2a4abd347a5 12/16
8/5/2021 Isolation Forest Algorithm for Anomaly Detection | by Prakash verma | Heartbeat

After adding the scores and anomalies for all the rows in the complete
dataset, it will print the predicted anomalies.

Anomalies
To show the predicted anomalies present in the dataset under the egg
weight column, data need to be analyzed after the addition of scores and
anomaly columns. Note that the anomaly column values would be -1 and
the corresponding scores will be negative.

By using this information one can show the predicted anomaly as below.

https://heartbeat.fritz.ai/isolation-forest-algorithm-for-anomaly-detection-2a4abd347a5 13/16
8/5/2021 Isolation Forest Algorithm for Anomaly Detection | by Prakash verma | Heartbeat

Evaluation
For model evaluation let’s set a threshold limit with egg weight <80 as an
outlier. Remember that our goal is to find out the number of outliers
present in the data as described in the above rule.

Outliers_Counter = 1

It’s time now to calculate the accuracy of the model.

print("Accuracy percentage:",
100*list(df['anomaly_Value']).count(-1)/(outliers_counter))

Conclusion:
In this article, we discussed one of the most powerful anomaly detection
algorithms: the isolation forest.
https://heartbeat.fritz.ai/isolation-forest-algorithm-for-anomaly-detection-2a4abd347a5 14/16
8/5/2021 Isolation Forest Algorithm for Anomaly Detection | by Prakash verma | Heartbeat

Isolation forest is used widely due to its faster anomaly detection and
smaller memory requirement.

I hope you will be able to use this algorithm if required.

Cheers!!

Editor’s Note: Heartbeat is a contributor-driven online publication and


community dedicated to exploring the emerging intersection of mobile app
development and machine learning. We’re committed to supporting and
inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Fritz AI, the


machine learning platform that helps developers teach devices to see, hear,
sense, and think. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can
also sign up to receive our weekly newsletters (Deep Learning Weekly and the
Fritz AI Newsletter), join us on Slack, and follow Fritz AI on Twitter for all
the latest in mobile machine learning.
https://heartbeat.fritz.ai/isolation-forest-algorithm-for-anomaly-detection-2a4abd347a5 15/16
8/5/2021 Isolation Forest Algorithm for Anomaly Detection | by Prakash verma | Heartbeat

Isolation Forests Anomaly Detection Machine Learning Artificial Intelligence Heartbeat

Learn more. Make Medium yours. Write a story on Medium.


Medium is an open platform where 170 Follow the writers, publications, and topics If you have a story to tell, knowledge to
million readers come to find insightful and that matter to you, and you’ll see them on share, or a perspective to offer — welcome
dynamic thinking. Here, expert and your homepage and in your inbox. Explore home. It’s easy and free to post your thinking
undiscovered voices alike dive into the heart on any topic. Start a blog
of any topic and bring new ideas to the
surface. Learn more

About Write Help Legal

https://heartbeat.fritz.ai/isolation-forest-algorithm-for-anomaly-detection-2a4abd347a5 16/16

You might also like