Isolationforest2 Python

Applying Isolation Forest to regression dataset
A dataset having continuous output values is known as a regression dataset. Here we will take
the house price dataset and detect if our dataset contains any outlier based on the prices of the
houses.
Before going to the implementation part, ensure that you have installed the following Python
modules on your system:
numpy
pandas
seaborn
sklearn
plotly
You can install the above-required modules by running the following commands in the cell of
the Jupyter notebook.
%pip install numpy
%pip install pandas
%pip install seaborn
%pip install sklearn
%pip install plolty
Once the installation is complete, we can then start the implementation part.
Importing and exploring the dataset

We will use Pandas DataFrame to work with the dataset and explore it.
# importing the pandas
import pandas as pd
# importing the dataset
dataset = pd.read_csv('Dushanbe_house.csv')
# printing the head
dataset.head()
Notice that our dataset contains some NULL values and an unnecessary column of index values.
Let’s remove them from the dataset:
# removing the indexed column
dataset.drop('Unnamed: 0', axis=1, inplace=True)
# removing null values
dataset.dropna(inplace=True)
# heading
dataset.head()
Let’s plot the dataset to see if we have any outliers or not:
# importing the module
import plotly.express as px
# plotting scattered graph
fig = px.scatter([i for i in range(len(dataset['price']))], y=dataset['price'])
fig.show()
The visualization shows that there are some anomalies in the dataset. Let’s apply the Isolation
Forest algorithm to detect them.
Training Isolation Forest model
Let’s train the Isolation Forest model using our dataset. Pay attention, that we will not split the
dataset into the testing and training parts as the Isolation Forest belongs to Unsupervised
Machine Learning algorithms.
# importing the required module
from sklearn.ensemble import IsolationForest
# initializing the isolation forest
isolation_model = IsolationForest(contamination = 0.003)
# training the model
isolation_model.fit(dataset)
# making predictions
IF_predictions = isolation_model.predict(dataset)
The contamination parameter defines a rough estimate of the percentage of the outliers in our
dataset. So, we have assigned contamination to be 0.3% in our case.
Let’s print the predictions of the model:
# printing
print(IF_predictions)
Notice that all the predictions are either 1 or -1. Where 1 shows the data point is normal while -
1 represents the outliers.
Let’s plot the outlier using a different color on the same scattered plot:
# adding the anomalies to the dataset
dataset['anomalies'] = IF_predictions
anomalies = dataset.query('anomalies == -1')
# importing the plot
import plotly.graph_objects as go
# plotting the graph for outliers
normal =
go.Scatter(x=dataset.index.astype(str),y=dataset['price'],name="Dataset",mode='marker
s')
outlier =
go.Scatter(x=anomalies.index.astype(str),y=anomalies['price'],name="Anomalies",mode
='markers',
marker=dict(color='red', size=6,
line=dict(color='red', width=1)))
# labeling the graph
layout = go.Layout(title="Isolation Forest",yaxis_title='Price',xaxis_title='x-axis',)
# plotting
data = [normal, outlier]
fig = go.Figure(data=data, layout=layout)
fig.show()
By looking at the above plot, you might be wondering why some of the data points have been
incorrectly classified as anomalies by the algorithm as some of the red dots seem to be not
outliers. However, that is not true. The algorithm has classified the above (red) points as
anomalies based on all input variables. We’re visualizing only the price column, so don’t be
confused with the obtained result. Some points may not be anomalies based on the price data
only, but other parameters of the same data points allow us to treat them as anomalies.
Let’s remove some columns and try to detect the anomalies, to see how the input variables
affect the detection process.
Training Isolation Forest model (reduced set of features)
Let’s remove the area, the number of floors, and the number of rooms variables and train the
model using only location and price information.
# copying the dataset
data = dataset.copy()
# droping columns
data.drop('anomalies', axis=1, inplace=True)
data.drop('number_of_rooms', axis=1, inplace=True)
data.drop('area', axis=1, inplace=True)
data.drop('floor', axis=1, inplace=True)
# printing the head
data.head()
Now we can train the model using the same contamination parameter value (0.3%).
# initializing the isolation forest
isolation_model1= IsolationForest(contamination=0.003)
isolation_model1.fit(data)
IF_predictions1 = isolation_model1.predict(data)
Now let’s plot the outlier detected by the model again:
# adding the anomalies to the dataset
data['anomalies'] = IF_predictions1
anomalies1 = data.query('anomalies == -1')
# plotting the scattered plot
normal = go.Scatter(x=data.index.astype(str),y=data['price'],name="Normal data",

mode='markers')
outlier = go.Scatter(x=anomalies1.index.astype(str), y=anomalies1['price'],

name="Anomalies", mode='markers',
marker=dict(color='red', size=5,
line=dict(color='red', width=1)))
# labelling
layout = go.Layout(title="Isolation Forest", yaxis_title='price',xaxis_title='x-axis',)
# plotting
Data = [normal, outlier]
fig = go.Figure(data=Data, layout=layout)
fig.show()
Pay attention that anomalies have been caught up by the algorithm this time because we
trained the model based on the reduced feature set (location and prices data).
You can play with it by training the model on each of the columns separately and then
visualizing the anomalies as we did here.
Applying Isolation Forest to classification dataset
A dataset that contains categorical values as output is known as a classification dataset. In this
section, we will use a dataset about credit card transactions. The dataset contains transactions
made by credit cards in September 2013 by European cardholders. This dataset presents
transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions.
The dataset is highly unbalanced because the positive class (frauds) account for 0.172% of all
transactions. Let’s use the Isolation Forest algorithm to detect fraud transactions and calculate
how accurately the model predicts them.
Importing and exploring the dataset
First, let’s import the dataset and print out the few rows to get familiar with the data type.
# importing the dataset
credit_card = pd.read_csv("creditcard.csv")
# heading
credit_card.head()
Notice that the dataset contains 30 different columns storing transactions data, and the last
column is the output class. Columns V1, V2, V3, …, and V28 are a result of the PCA
transformation. According to the official dataset website (OpenML), features V1, V2, …, and V28
are the principal components obtained by PCA, and the only features which have not been
transformed with PCA are ‘Time’ and ‘Amount.’
Let’s plot the output category using a bar chart:
import seaborn as sns
# setting the size of plotting
sns.set(rc={'figure.figsize':(15,8)})
# plotting bar plot
sns.countplot(credit_card['Class'])
Notice that the fraud category is extremely lower compared to the normal transactions. Let’s
print out the total number of fraud and normal transactions:
# stroing the normal and fraud cases
fraud = credit_card[credit_card['Class']=="'1'"]
valid = credit_card[credit_card['Class']=="'0'"]
# printing
print("Normal transactions :", len(valid))
print("Fraud transactions :", len(fraud))
Let’s now use the info() method to get information about the data type in each column:
Info method
# info method
credit_card.info()
Notice that except for the output class, every column has numeric data. So, before training the
model, we need to change the last column to numeric values too.
# Import label encoder
from sklearn import preprocessing
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column 'species'.
credit_card['Class']= label_encoder.fit_transform(credit_card['Class'])
Now, our data is ready to be used by the Isolation Forest algorithm.
Training and testing the Isolation forest model
We can split the dataset into testing and training parts to evaluate the Isolation Forest once it
detects the outliers because we know upfront that the fraud category represents outliers. After
that, we will compare the actual fraud cases and the model predicted outliers.
# splitting the dataset into input and outputs
columns = credit_card.columns.tolist()
columns = [c for c in columns if c not in ['Class']]
X = credit_card[columns]
Y = credit_card['Class']
from sklearn.model_selection import train_test_split
# splitting the dataset
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size = .20)
Now let’s use the training data to train the model.
# initializing the model
clf = IsolationForest(max_samples=len(X_train))
clf.fit(X_train)
We have applied only the input data to train the model because it is an Unsupervised Learning
algorithm. Once training is complete, we can make predictions:
y_pred = clf.predict(X_test)
We already know that the predicted values of the Isolation forest will be -1 and 1. Where -1
represents the outlier and 1 represents the normal value. 0 represents the normal value in the
actual dataset, and 1 represents the fraud so we will change the predictions to 0 and 1.
# changing the predicted values
y_pred[y_pred==1] = 0
y_pred[y_pred==-1] = 1
Let’s find the accuracy of the Isolation Forest by using its predicted and the actual values:
from sklearn.metrics import accuracy_score
# finding the accuracy
print(accuracy_score(y_pred,Y_test))
The result shows that the model has classified fraud and valid transactions in 99.7% of cases.
If you want to know how we can apply the Isolation forest to the Time series, look at the
Implementing anomaly detection using Python article.
Summary
The Isolation Forest is an Unsupervised Machine Learning algorithm that detects the outliers in
a dataset by building a random forest of decision trees. In this article, we’ve covered the
Isolation Forest algorithm, its logic, and how to apply it to solve regression and classification
problems.

Isolationforest2 Python

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Isolationforest2 Python

Uploaded by

Copyright:

Available Formats

Applying Isolation Forest to regression dataset

%pip install numpy

%pip install pandas

%pip install seaborn

%pip install sklearn

%pip install plolty

Importing and exploring the dataset

# importing the pandas

# importing the dataset

# printing the head

# removing the indexed column

dataset.drop('Unnamed: 0', axis=1, inplace=True)

# removing null values

Let’s plot the dataset to see if we have any outliers or not:

# importing the module

# plotting scattered graph

fig = px.scatter([i for i in range(len(dataset['price']))], y=dataset['price'])

Training Isolation Forest model

# importing the required module

from sklearn.ensemble import IsolationForest

# initializing the isolation forest

isolation_model = IsolationForest(contamination = 0.003)

# training the model

Let’s print the predictions of the model:

# adding the anomalies to the dataset

anomalies = dataset.query('anomalies == -1')

# importing the plot

# plotting the graph for outliers

# labeling the graph

layout = go.Layout(title="Isolation Forest",yaxis_title='Price',xaxis_title='x-axis',)

data = [normal, outlier]

fig = go.Figure(data=data, layout=layout)

Training Isolation Forest model (reduced set of features)

# copying the dataset

data.drop('anomalies', axis=1, inplace=True)

data.drop('number_of_rooms', axis=1, inplace=True)

data.drop('area', axis=1, inplace=True)

data.drop('floor', axis=1, inplace=True)

# printing the head

# training the model

Now let’s plot the outlier detected by the model again:

# adding the anomalies to the dataset

anomalies1 = data.query('anomalies == -1')

# plotting the scattered plot

normal = go.Scatter(x=data.index.astype(str),y=data['price'],name="Normal data",

outlier = go.Scatter(x=anomalies1.index.astype(str), y=anomalies1['price'],

layout = go.Layout(title="Isolation Forest", yaxis_title='price',xaxis_title='x-axis',)

Data = [normal, outlier]

fig = go.Figure(data=Data, layout=layout)

Applying Isolation Forest to classification dataset

Importing and exploring the dataset

# importing the dataset

# importing the module

import seaborn as sns

# setting the size of plotting

# plotting bar plot

# stroing the normal and fraud cases

print("Normal transactions :", len(valid))

print("Fraud transactions :", len(fraud))

# Import label encoder

from sklearn import preprocessing

# label_encoder object knows how to understand word labels.