Professional Documents
Culture Documents
Be A 65 Ads Exp 6
Be A 65 Ads Exp 6
Experiment No.06
A.1 Aim:
To Implement SMOTE technique to generate synthetic data. A.2
Prerequisite:
Knowledge of Python, Dataset
A.3 Outcome:
After successful completion of this experiment students will be able to generate synthetic data
using SMOTE technique and compare the various performance metrics of data set in both cases.
A.4 Theory:
What is an imbalanced dataset?
An imbalanced dataset contains two different observations, where one observation is in the
majority class and the other is in the minority class.
Suppose we want to build a model that will help us identify whether a given patient has a tumor
or not. There are 1000 patients, with 900 being non-cancer patients and the other 100 being
cancer patients. Since the non-cancer patients are high in number, they will belong to the
majority class while the rest will belong to the minority class.
As the purpose of our model is to predict whether someone has cancer or not, the focus is
primarily on the minority class. In this case, however, the majority is nine times bigger than the
minority class. This is an imbalanced dataset because the model will deliver high accuracy in
predicting non-cancer patients and will be more inclined to the majority class - even though this
is not the main objective of building our system.
Before learning about SMOTE’s functionality, it’s important to understand two important terms:
undersampling and oversampling.
Mumbai University
Undersampling
The purpose of undersampling is to reduce the majority class. We perform it by removing some
observations of the said class. There are two ways of doing so: in the first method, we randomly
remove some records of the majority class, which is known as random undersampling. In the
second method, we use statistical methods to remove the majority class, known as informed
undersampling.
These undersampling methods also use data clearing techniques to further refine the majority of
classes. Undersampling methods are generally not preferred because there is a chance of losing
valuable information. This also leads to bias since we are removing data to ensure that the
proportion of the two classes remains the same.
Oversampling
Oversampling is the opposite of undersampling. The objective is to increase the samples of the
minority class so that the observations of both major and minor classes become equal. Unlike
undersampling, where we remove datasets, we add new data to the dataset in oversampling. It can
be achieved in two ways: random oversampling and synthetic oversampling.
In case of random oversampling, we replicate the existing minority class and add it to our dataset
to increase the minority class. Meanwhile, synthetic oversampling technique is the process of
generating artificial samples for the minority class. New samples are created in such a way that
they add relevant information to the minority class while avoiding misclassification. The only
downfall of oversampling is that it can lead to overfitting due to duplication of the same
information.
An oversampling method, SMOTE creates new, synthetic observations from present samples of
the minority class. Not only does it duplicate the existing data, it also creates new data that
contains values that are close to the minority class with the help of data augmentation. These new
synthetic training records are made randomly by selecting one or more K-nearest neighbors for
each of the minority classes. After completing oversampling, the problem of an imbalanced
dataset is resolved and we are ready to test different classification models.
The general idea of SMOTE is the generation of synthetic data between each sample of the
minority class and its “k” nearest neighbors. That is, for each one of the samples of the minority
class, its “k” nearest neighbors are located (by default k = 5), then between the pairs of points
generated by the sample and each of its neighbors, a new synthetic data is generated. In Figure 2
you can see a visual description of the SMOTE implementation.
As we can see in Figure 2 (b), SMOTE is applied to generate synthetic data from x1 considering
the 3 nearest neighbors (x2, x3 and x4) to generate the synthetic data s1, s2 and s3.
Although SMOTE is a technique that allows the generation of synthetic tabular data, such an algorithm by
itself has some limitations. SMOTE only works with continuous data (that is, it is not designed to
generate categorical synthetic data), on the other hand, the synthetic data generated is linearly dependent,
which can cause a bias in the data generated and consequently produce an overfitted model. For this
reason, alternatives based on SMOTE have been proposed that aim to improve the limitations of the
original SMOTE technique.Borderline-SMOTE
Borderline-SMOTE
Unlike the original SMOTE technique, Borderline-SMOTE focuses on generating synthetic data
by considering only samples that make up the border that divides one class from another. That is,
Borderline-SMOTE detects which samples are on the border of the class space and applies the
SMOTE technique to these samples. In Figure 3 you can see a visual description of Borderline-
SMOTE.
As can be seen in the previous image, the samples of the minority class that are considered to
generate the synthetic samples are those that are part of the borderline. An alternative to
Borderline-SMOTE is SVM-SMOTE, which determines the borderline using support vector
machines.
ADASYN
ADASYN is a technique that is based on the SMOTE algorithm for generating synthetic data.
The difference between ADASYN and SMOTE is that ADASYN implements a methodology that
Mumbai University
detects those samples of the minority class found in spaces dominated by the majority class, this
in
order to generate samples in the lower density areas of the minority class. That is, ADASYN
focuses on those samples of the minority class that are difficult to classify because they are in a
low-density area. In Figure 4 you can see a visual description of ADASYN.
Figures 5,6 and 7 show the visualizations of the implementation of the SMOTE, Borderline-
SMOTE and ADASYN algorithms respectively.
Mumbai University
Figure 5. SMOTE
Figure 6. Borderline-SMOTE
Figure 8. ADASYN
Mumbai University
PART B
(PART B: TO BE COMPLETED BY STUDENTS)
(Students must submit the soft copy as per following segments within two hours of the
practical. The soft copy must be uploaded on the Blackboard or emailed to the concerned lab
in charge faculties at the end of the practical in case the there is no Black board access
available)
X = data.data
y = data.target
lr = LogisticRegression(max_iter=10000)
lr.fit(X_train, y_train)
y_pred_normal = lr.predict(X_test)
Mumbai University
rus = RandomUnderSampler(random_state=42)
X_train_under, y_train_under = rus.fit_resample(X_train, y_train)
lr.fit(X_train_under, y_train_under)
y_pred_under = lr.predict(X_test)
smote = SMOTE(random_state=42)
X_train_over, y_train_over = smote.fit_resample(X_train, y_train)
B.2 Conclusion:
Hence, SMOTE technique has been successfully implemented to generate synthetic data and its
performance evaluation metrics have been compared with other techniques.
Q.2 What are the real-world applications that uses synthetic data?
Answer: In financial services, synthetic data is used to manufacture data with similar attributes
to actual sensitive or regulated data. Data scientists are using synthetic data to prevent fraud by
creating a dataset of typical transactions. Synthetic data can also be used to train machine
learning models to detect fraudulent activities. In healthcare, synthetic data is used to complete
existing datasets and improve data accessibility. Synthetic data can be used to simulate and
predict research, test hypotheses, methods, and algorithms, and develop health IT. Synthetic data
can also be used for education and training, public release of datasets, and linking data. Sharing
healthcare data among researchers, institutions, and companies building AI solutions can have
numerous benefits. Synthetic data can contribute to faster disease or drug discovery, a more
personalized approach to patient treatment, and improved AI systems.
advantage of SMOTE is that it creates synthetic data points that are not duplicates of the existing
data points, but rather new data points that are like the existing data points. SMOTE is ideal for
imbalanced datasets, where the minority class is rare in the population, or the data is difficult to
collect.