Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

Mumbai University

TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI

Experiment No.06
A.1 Aim:
To Implement SMOTE technique to generate synthetic data. A.2
Prerequisite:
Knowledge of Python, Dataset

A.3 Outcome:
After successful completion of this experiment students will be able to generate synthetic data
using SMOTE technique and compare the various performance metrics of data set in both cases.
A.4 Theory:
What is an imbalanced dataset?

An imbalanced dataset contains two different observations, where one observation is in the
majority class and the other is in the minority class.

Let’s understand this with the help of an example:

Suppose we want to build a model that will help us identify whether a given patient has a tumor
or not. There are 1000 patients, with 900 being non-cancer patients and the other 100 being
cancer patients. Since the non-cancer patients are high in number, they will belong to the
majority class while the rest will belong to the minority class.

As the purpose of our model is to predict whether someone has cancer or not, the focus is
primarily on the minority class. In this case, however, the majority is nine times bigger than the
minority class. This is an imbalanced dataset because the model will deliver high accuracy in
predicting non-cancer patients and will be more inclined to the majority class - even though this
is not the main objective of building our system.

Here are more examples of imbalanced datasets:

• Fraudulent transactions occurring in banks.


• Theft and pilferage of electricity.
• Identification of rare diseases such as cancer, tumors, and so on.
• Natural disasters.
• Customer churn rate.
• email is spam or not.

Before learning about SMOTE’s functionality, it’s important to understand two important terms:
undersampling and oversampling.
Mumbai University

TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI

Undersampling

The purpose of undersampling is to reduce the majority class. We perform it by removing some
observations of the said class. There are two ways of doing so: in the first method, we randomly
remove some records of the majority class, which is known as random undersampling. In the
second method, we use statistical methods to remove the majority class, known as informed
undersampling.

These undersampling methods also use data clearing techniques to further refine the majority of
classes. Undersampling methods are generally not preferred because there is a chance of losing
valuable information. This also leads to bias since we are removing data to ensure that the
proportion of the two classes remains the same.

Oversampling

Oversampling is the opposite of undersampling. The objective is to increase the samples of the
minority class so that the observations of both major and minor classes become equal. Unlike
undersampling, where we remove datasets, we add new data to the dataset in oversampling. It can
be achieved in two ways: random oversampling and synthetic oversampling.

In case of random oversampling, we replicate the existing minority class and add it to our dataset
to increase the minority class. Meanwhile, synthetic oversampling technique is the process of
generating artificial samples for the minority class. New samples are created in such a way that
they add relevant information to the minority class while avoiding misclassification. The only
downfall of oversampling is that it can lead to overfitting due to duplication of the same
information.

The SMOTE algorithm

An oversampling method, SMOTE creates new, synthetic observations from present samples of
the minority class. Not only does it duplicate the existing data, it also creates new data that
contains values that are close to the minority class with the help of data augmentation. These new
synthetic training records are made randomly by selecting one or more K-nearest neighbors for
each of the minority classes. After completing oversampling, the problem of an imbalanced
dataset is resolved and we are ready to test different classification models.

Below are the steps to implement the SMOTE algorithm:

• Draw a random set from the minority class.


• For all the observations for the sample, locate the K-nearest neighbors. To obtain the
distance between the neighbors, find the Euclidean distance.
• The next step is to find the vector between the current data point and the selected
neighbor.
Mumbai University

TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI

• Next, multiply a vector between 0 and 1.


• To obtain the new dataset, add new samples to the current data point.
An exploration of SMOTE and some variants like Borderline-SMOTE and ADASYN

Figure 1. SMOTE, Borderline-SMOTE and ADASYN representation

The general idea of SMOTE is the generation of synthetic data between each sample of the
minority class and its “k” nearest neighbors. That is, for each one of the samples of the minority
class, its “k” nearest neighbors are located (by default k = 5), then between the pairs of points
generated by the sample and each of its neighbors, a new synthetic data is generated. In Figure 2
you can see a visual description of the SMOTE implementation.

Figure 2. SMOTE visual description


Mumbai University

TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI

As we can see in Figure 2 (b), SMOTE is applied to generate synthetic data from x1 considering
the 3 nearest neighbors (x2, x3 and x4) to generate the synthetic data s1, s2 and s3.
Although SMOTE is a technique that allows the generation of synthetic tabular data, such an algorithm by
itself has some limitations. SMOTE only works with continuous data (that is, it is not designed to
generate categorical synthetic data), on the other hand, the synthetic data generated is linearly dependent,
which can cause a bias in the data generated and consequently produce an overfitted model. For this
reason, alternatives based on SMOTE have been proposed that aim to improve the limitations of the
original SMOTE technique.Borderline-SMOTE

Borderline-SMOTE

Unlike the original SMOTE technique, Borderline-SMOTE focuses on generating synthetic data
by considering only samples that make up the border that divides one class from another. That is,
Borderline-SMOTE detects which samples are on the border of the class space and applies the
SMOTE technique to these samples. In Figure 3 you can see a visual description of Borderline-
SMOTE.

Figure 3. Borderline-SMOTE visual description

As can be seen in the previous image, the samples of the minority class that are considered to
generate the synthetic samples are those that are part of the borderline. An alternative to
Borderline-SMOTE is SVM-SMOTE, which determines the borderline using support vector
machines.

ADASYN

ADASYN is a technique that is based on the SMOTE algorithm for generating synthetic data.
The difference between ADASYN and SMOTE is that ADASYN implements a methodology that
Mumbai University

TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI

detects those samples of the minority class found in spaces dominated by the majority class, this
in
order to generate samples in the lower density areas of the minority class. That is, ADASYN
focuses on those samples of the minority class that are difficult to classify because they are in a
low-density area. In Figure 4 you can see a visual description of ADASYN.

Figure 4. ADASYN visual description

Figures 5,6 and 7 show the visualizations of the implementation of the SMOTE, Borderline-
SMOTE and ADASYN algorithms respectively.
Mumbai University

TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI

Figure 5. SMOTE

Figure 6. Borderline-SMOTE

Figure 8. ADASYN
Mumbai University

TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI

PART B
(PART B: TO BE COMPLETED BY STUDENTS)

(Students must submit the soft copy as per following segments within two hours of the
practical. The soft copy must be uploaded on the Blackboard or emailed to the concerned lab
in charge faculties at the end of the practical in case the there is no Black board access
available)

Roll No.: 65 Name: Ketki Kulkarni


Class: BE-A Batch: A4
Date of Experiment: Date of Submission:
Grade:

B.1 Observations and learning:


Code:
import pandas as pd from sklearn.datasets import
load_breast_cancer from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import
RandomUnderSampler from sklearn.model_selection import
train_test_split from sklearn.linear_model import LogisticRegression from
sklearn.metrics import accuracy_score, classification_report, confusion_matrix

data = load_breast_cancer() df = pd.DataFrame(data.data,


columns=data.feature_names) df['target']
= data.target df.head()

X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(


X, y, test_size=0.2, random_state=42)

lr = LogisticRegression(max_iter=10000)
lr.fit(X_train, y_train)
y_pred_normal = lr.predict(X_test)
Mumbai University

TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI

cm_normal = confusion_matrix(y_test, y_pred_normal)


print('Confusion matrix (Normal Logistic Regression):\n') print(cm_normal) print('\
nClassification report (Normal Logistic Regression):\n')
print(classification_report(y_test, y_pred_normal))

rus = RandomUnderSampler(random_state=42)
X_train_under, y_train_under = rus.fit_resample(X_train, y_train)

lr.fit(X_train_under, y_train_under)
y_pred_under = lr.predict(X_test)

cm_under = confusion_matrix(y_test, y_pred_under)


print('Confusion matrix (Undersampling):\n') print(cm_under)
print('\nClassification report
(Undersampling):\n') print(classification_report(y_test,
y_pred_under))

smote = SMOTE(random_state=42)
X_train_over, y_train_over = smote.fit_resample(X_train, y_train)

lr.fit(X_train_over, y_train_over) y_pred_over


= lr.predict(X_test)

accuracy_over = accuracy_score(y_test, y_pred_over)

cm_over = confusion_matrix(y_test, y_pred_over)


print('Confusion matrix (Oversampling):\n') print(cm_over)
print('\nClassification report
(Oversampling):\n') print(classification_report(y_test,
y_pred_over)) Output:
Mumbai University

TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI


Mumbai University

TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI

B.2 Conclusion:
Hence, SMOTE technique has been successfully implemented to generate synthetic data and its
performance evaluation metrics have been compared with other techniques.

B.3 Question of Curiosity


Q.1 What is SMOTE technique? How does SMOTE technique work?
Answer: SMOTE stands for Synthetic Minority Oversampling Technique. It is an oversampling
algorithm that is used to generate synthetic data points for the minority class in an imbalanced
dataset. SMOTE works by creating synthetic examples based on the existing minority class data
points. The algorithm selects a minority class instance and finds its K nearest minority class
neighbors. It then chooses one of these neighbors and generates a synthetic data point somewhere
between the two points. This process is repeated until the desired balance between the minority
and majority classes is achieved. SMOTE is a popular algorithm used to generate artificial data,
which involves oversampling the minority class by creating synthetic examples based on the
existing minority class data points. SMOTE comes under the generate synthetic sample strategy.

Q.2 What are the real-world applications that uses synthetic data?
Answer: In financial services, synthetic data is used to manufacture data with similar attributes
to actual sensitive or regulated data. Data scientists are using synthetic data to prevent fraud by
creating a dataset of typical transactions. Synthetic data can also be used to train machine
learning models to detect fraudulent activities. In healthcare, synthetic data is used to complete
existing datasets and improve data accessibility. Synthetic data can be used to simulate and
predict research, test hypotheses, methods, and algorithms, and develop health IT. Synthetic data
can also be used for education and training, public release of datasets, and linking data. Sharing
healthcare data among researchers, institutions, and companies building AI solutions can have
numerous benefits. Synthetic data can contribute to faster disease or drug discovery, a more
personalized approach to patient treatment, and improved AI systems.

Q.3 Why is SMOTE better than oversampling?


Answer: SMOTE is better than oversampling because it generates synthetic data points based on
the existing minority class data points, rather than simply duplicating the existing minority class
data points. SMOTE creates synthetic data points by selecting a minority class instance and
finding its K nearest minority class neighbors. It then chooses one of these neighbors and
generates a synthetic data point somewhere between the two points. This process is repeated until
the desired balance between the minority and majority classes is achieved. SMOTE is an
advanced version of oversampling, or as a specific algorithm for data augmentation. The
Mumbai University

TPCT’s, TERNA ENGINEERING COLLEGE (TEC), NAVI MUMBAI

advantage of SMOTE is that it creates synthetic data points that are not duplicates of the existing
data points, but rather new data points that are like the existing data points. SMOTE is ideal for
imbalanced datasets, where the minority class is rare in the population, or the data is difficult to
collect.

Q4. What is the drawback of SMOTE?


Answer: One of the drawbacks of SMOTE is that it can generate synthetic samples by
interpolating between noisy examples, which can lead to over-representation of noise in the
resampled dataset. This can negatively affect the performance of machine learning models trained
on the resampled dataset. Another drawback of SMOTE is that it does not consider the majority
class while creating synthetic examples. This can cause issues where there is a strong overlap
between the classes. Therefore, it is important to carefully select the parameters of the SMOTE
algorithm to avoid these issues. It is also important to note that SMOTE is not always the best
solution for imbalanced datasets, and other techniques such as under sampling or a combination
of oversampling and under sampling may be more appropriate depending on the specific dataset
and problem at hand.

You might also like