Professional Documents
Culture Documents
Isolationforest2 Python
Isolationforest2 Python
A dataset having continuous output values is known as a regression dataset. Here we will take
the house price dataset and detect if our dataset contains any outlier based on the prices of the
houses.
Before going to the implementation part, ensure that you have installed the following Python
modules on your system:
numpy
pandas
seaborn
sklearn
plotly
You can install the above-required modules by running the following commands in the cell of
the Jupyter notebook.
Once the installation is complete, we can then start the implementation part.
import pandas as pd
dataset = pd.read_csv('Dushanbe_house.csv')
dataset.head()
Notice that our dataset contains some NULL values and an unnecessary column of index values.
Let’s remove them from the dataset:
dataset.dropna(inplace=True)
# heading
dataset.head()
import plotly.express as px
fig.show()
The visualization shows that there are some anomalies in the dataset. Let’s apply the Isolation
Forest algorithm to detect them.
Let’s train the Isolation Forest model using our dataset. Pay attention, that we will not split the
dataset into the testing and training parts as the Isolation Forest belongs to Unsupervised
Machine Learning algorithms.
isolation_model.fit(dataset)
# making predictions
IF_predictions = isolation_model.predict(dataset)
The contamination parameter defines a rough estimate of the percentage of the outliers in our
dataset. So, we have assigned contamination to be 0.3% in our case.
# printing
print(IF_predictions)
Notice that all the predictions are either 1 or -1. Where 1 shows the data point is normal while -
1 represents the outliers.
Let’s plot the outlier using a different color on the same scattered plot:
dataset['anomalies'] = IF_predictions
import plotly.graph_objects as go
normal =
go.Scatter(x=dataset.index.astype(str),y=dataset['price'],name="Dataset",mode='marker
s')
outlier =
go.Scatter(x=anomalies.index.astype(str),y=anomalies['price'],name="Anomalies",mode
='markers',
marker=dict(color='red', size=6,
line=dict(color='red', width=1)))
# plotting
fig.show()
By looking at the above plot, you might be wondering why some of the data points have been
incorrectly classified as anomalies by the algorithm as some of the red dots seem to be not
outliers. However, that is not true. The algorithm has classified the above (red) points as
anomalies based on all input variables. We’re visualizing only the price column, so don’t be
confused with the obtained result. Some points may not be anomalies based on the price data
only, but other parameters of the same data points allow us to treat them as anomalies.
Let’s remove some columns and try to detect the anomalies, to see how the input variables
affect the detection process.
Let’s remove the area, the number of floors, and the number of rooms variables and train the
model using only location and price information.
data = dataset.copy()
# droping columns
data.head()
Now we can train the model using the same contamination parameter value (0.3%).
# initializing the isolation forest
isolation_model1= IsolationForest(contamination=0.003)
isolation_model1.fit(data)
# making predictions
IF_predictions1 = isolation_model1.predict(data)
data['anomalies'] = IF_predictions1
marker=dict(color='red', size=5,
line=dict(color='red', width=1)))
# labelling
# plotting
fig.show()
Pay attention that anomalies have been caught up by the algorithm this time because we
trained the model based on the reduced feature set (location and prices data).
You can play with it by training the model on each of the columns separately and then
visualizing the anomalies as we did here.
A dataset that contains categorical values as output is known as a classification dataset. In this
section, we will use a dataset about credit card transactions. The dataset contains transactions
made by credit cards in September 2013 by European cardholders. This dataset presents
transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions.
The dataset is highly unbalanced because the positive class (frauds) account for 0.172% of all
transactions. Let’s use the Isolation Forest algorithm to detect fraud transactions and calculate
how accurately the model predicts them.
First, let’s import the dataset and print out the few rows to get familiar with the data type.
credit_card = pd.read_csv("creditcard.csv")
# heading
credit_card.head()
Notice that the dataset contains 30 different columns storing transactions data, and the last
column is the output class. Columns V1, V2, V3, …, and V28 are a result of the PCA
transformation. According to the official dataset website (OpenML), features V1, V2, …, and V28
are the principal components obtained by PCA, and the only features which have not been
transformed with PCA are ‘Time’ and ‘Amount.’
Let’s plot the output category using a bar chart:
sns.set(rc={'figure.figsize':(15,8)})
sns.countplot(credit_card['Class'])
Notice that the fraud category is extremely lower compared to the normal transactions. Let’s
print out the total number of fraud and normal transactions:
fraud = credit_card[credit_card['Class']=="'1'"]
valid = credit_card[credit_card['Class']=="'0'"]
# printing
Let’s now use the info() method to get information about the data type in each column:
Info method
# info method
credit_card.info()
Notice that except for the output class, every column has numeric data. So, before training the
model, we need to change the last column to numeric values too.
label_encoder = preprocessing.LabelEncoder()
credit_card['Class']= label_encoder.fit_transform(credit_card['Class'])
We can split the dataset into testing and training parts to evaluate the Isolation Forest once it
detects the outliers because we know upfront that the fraud category represents outliers. After
that, we will compare the actual fraud cases and the model predicted outliers.
columns = credit_card.columns.tolist()
X = credit_card[columns]
Y = credit_card['Class']
# importing the module
clf = IsolationForest(max_samples=len(X_train))
clf.fit(X_train)
We have applied only the input data to train the model because it is an Unsupervised Learning
algorithm. Once training is complete, we can make predictions:
# making predictions
y_pred = clf.predict(X_test)
We already know that the predicted values of the Isolation forest will be -1 and 1. Where -1
represents the outlier and 1 represents the normal value. 0 represents the normal value in the
actual dataset, and 1 represents the fraud so we will change the predictions to 0 and 1.
y_pred[y_pred==1] = 0
y_pred[y_pred==-1] = 1
Let’s find the accuracy of the Isolation Forest by using its predicted and the actual values:
print(accuracy_score(y_pred,Y_test))
The result shows that the model has classified fraud and valid transactions in 99.7% of cases.
If you want to know how we can apply the Isolation forest to the Time series, look at the
Implementing anomaly detection using Python article.
Summary
The Isolation Forest is an Unsupervised Machine Learning algorithm that detects the outliers in
a dataset by building a random forest of decision trees. In this article, we’ve covered the
Isolation Forest algorithm, its logic, and how to apply it to solve regression and classification
problems.