Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

Outlier and Class Imbalance

Dr. Manjubala Bisi

Assistant Professor
Department of Computer Science and Engineering
National Institute of Technology Warangal
manjubalabisi@nitw.ac.in

02/05/2023

Dr. Manjubala Bisi (NITW) 02/05/2023 1 / 56


Outline of the Talk

Outlier
Class Imbalance Problem
Case Study

Dr. Manjubala Bisi (NITW) 02/05/2023 2 / 56


Outlier

In statistics, an Outlier is an observation point that is distant from


other observations
An outlier is an object that deviates significantly from the rest of the
objects
They can be caused by measurement or execution error
The analysis of outlier data is referred to as outlier analysis or outlier
mining
An outlier is something which is an odd-one-out or the one that is
different from the crowd
Outliers having a different underlying behavior than the rest of the
data
An outlier is a data point that is distant from other points

Dr. Manjubala Bisi (NITW) 02/05/2023 3 / 56


Introduction of Outliers in the datasets

Outliers are first introduced to the population while gathering or


collecting the data
Data can be collected in many ways be it via Interview;
Questionnaires Survey; Observations; Documents Records; Focus
groups; Oral History etc., and in this Tech era Internet; IT sensors
etc., are generating data for us
Another possible cause of outliers could be Incorrect entry;
Misreporting of data or observations; Sampling errors while doing the
experiment; Exceptional but True value
The outliers can be a result of a mistake during data collection or
they can be just an indication of variance in your data

Dr. Manjubala Bisi (NITW) 02/05/2023 4 / 56


Type of Outliers

Point or global Outliers


Contextual (Conditional) Outliers
Collective Outliers

Dr. Manjubala Bisi (NITW) 02/05/2023 5 / 56


Point or global Outliers

Observations anomalous with respect to the majority of observations


in a feature
A data point is considered a global outlier if its value is far outside
the entirety of the data set in which it is found
Example: In a class all student age will be approximately similar, but
if see a record of a student with age as 100, it is an outlier and could
be generated due to various reason

Dr. Manjubala Bisi (NITW) 02/05/2023 6 / 56


Contextual (Conditional) Outliers

Observations considered anomalous given a specific context


A data point is considered a contextual outlier if its value significantly
deviates from the rest of the data points in the same context
The same value may not be considered an outlier if it occurred in a
different context
Example: World economy falls drastically due to COVID-19 in 2020

Dr. Manjubala Bisi (NITW) 02/05/2023 7 / 56


Collective Outliers

A collection of observations anomalous but appear close to one


another because they all have a similar anomalous value
A subset of data points within a data set is considered anomalous if
those values as a collection deviate significantly from the entire data
set, but the values of the individual data points are not themselves
anomalous in either a contextual or global sense
In time series data, one way this can manifest is as normal peaks and
valleys occurring outside of a time frame when that seasonal sequence
is normal or as a combination of time series that is in an outlier state
as a group

Dr. Manjubala Bisi (NITW) 02/05/2023 8 / 56


Importance of identifying the outliers

Machine learning algorithms are sensitive to the range and


distribution of attribute values
Data outliers can spoil and mislead the training process resulting in
longer training times, less accurate models and ultimately poorer
results

Dr. Manjubala Bisi (NITW) 02/05/2023 9 / 56


Detection of Outliers

Visualization Technique
Box plot
Histogram
Scatter plot
Mathematical Function
Z Score
IQR (Inter Quartile Range) Score

Dr. Manjubala Bisi (NITW) 02/05/2023 10 / 56


Visualization Technique (Box plot)

Dr. Manjubala Bisi (NITW) 02/05/2023 11 / 56


Visualization Technique (Histogram)

Dr. Manjubala Bisi (NITW) 02/05/2023 12 / 56


Visualization Technique (Scatter plot)

Dr. Manjubala Bisi (NITW) 02/05/2023 13 / 56


Mathematical Function(Z score)

Z score is an important concept in statistic


It helps to understand if a data value is greater or smaller than mean
and how far away it is from the mean
It tells how many standard deviations away a data point is from the
mean
Z score = (x -mean) / std. deviation
If the z score of a data point is more than 3, it indicates that the data
point is quite different from the other data points and considered as
an outlier

Dr. Manjubala Bisi (NITW) 02/05/2023 14 / 56


Mathematical Function (IQR score)

Dr. Manjubala Bisi (NITW) 02/05/2023 15 / 56


Prevention methods in Outliers
Depends upon domain knowledge and experience with outliers
Drop the outlier records
sometimes it’s best to completely remove those records from your
dataset to stop them from skewing your analysis
Cap your outliers’ data
another way to handle true outliers is to cap them. For example, if
you’re using income, you might find that people above a certain income
level behave in the same way as those with a lower income. In this
case, you can cap the income value at a level that keeps that intact
Assign a new value
If an outlier seems to be due to a mistake in your data, try imputing a
new value. Common imputation methods include using the mean of a
variable or utilizing a regression model to predict the missing value
Try a transformation
A different approach to true outliers could be to try creating a
transformation of the data rather than using the data itself.For
example, try creating a percentile version of your original field and
working with that new field instead
Dr. Manjubala Bisi (NITW) 02/05/2023 16 / 56
Class Imbalance

When observation in one class is higher than the observation in other


classes then there exists a class imbalance
Example: To detect number of modules as fault-prone or non
fault-prone in Software, assume fault-prone modules are 100 when
compared with non fault-prone modules around 9000
Class Imbalance is a common problem in machine learning, especially
in classification problems Imbalance data can hamper our model
accuracy big time

Dr. Manjubala Bisi (NITW) 02/05/2023 17 / 56


The Problem with Class Imbalance

Most machine learning algorithms work best when the number of


samples in each class are about equal
This is because most algorithms are designed to maximize accuracy
and reduce errors
However, if the data set in imbalance then In such cases, you get a
pretty high accuracy just by predicting the majority class, but you fail
to capture the minority class

Dr. Manjubala Bisi (NITW) 02/05/2023 18 / 56


The Problem with Class Imbalance

Accuracy will mislead


All those non-fault-prone modules, you’d have 100% accuracy.
Those modules which are fault-prone, you’d have 0% accuracy.
Overall accuracy would be high simply because the most modules are
non fault-prone(not because model is good)

Dr. Manjubala Bisi (NITW) 02/05/2023 19 / 56


Resampling Technique

A widely adopted technique for dealing with highly imbalanced


datasets is resampling
It consists of removing samples from the majority class
(under-sampling)
It consists of adding more examples from the minority class
(over-sampling)

Dr. Manjubala Bisi (NITW) 02/05/2023 20 / 56


Resampling Technique

Dr. Manjubala Bisi (NITW) 02/05/2023 21 / 56


Resampling Technique

The simplest implementation of over-sampling is to duplicate random


records from the minority class, which can cause overfitting
In under-sampling, the simplest technique involves removing random
records from the majority class, which can cause loss of information

Dr. Manjubala Bisi (NITW) 02/05/2023 22 / 56


Random Under-Sampling

Under sampling can be defined as removing some observations of the


majority class
This is done until the majority and minority class is balanced out
Under sampling can be a good choice when you have a ton of data
-think millions of rows
But a drawback to under sampling is that we are removing
information that may be valuable

Dr. Manjubala Bisi (NITW) 02/05/2023 23 / 56


Random Over-Sampling

Oversampling can be defined as adding more copies to the minority


class
Oversampling can be a good choice when you don’t have a ton of
data to work with
A con to consider when under sampling is that it can cause overfitting
and poor generalization to your test set

Dr. Manjubala Bisi (NITW) 02/05/2023 24 / 56


Under-sampling: Tomek link

Tomek links are pairs of very close instances but of opposite classes
Removing the instances of the majority class of each pair increases the
space between the two classes, facilitating the classification process
Tomek’s link exists if the two samples are the nearest neighbors of
each other

Dr. Manjubala Bisi (NITW) 02/05/2023 25 / 56


Under-sampling: Tomek link

Dr. Manjubala Bisi (NITW) 02/05/2023 26 / 56


Synthetic Minority Oversampling Technique (SMOTE)

This technique generates synthetic data for the minority class


It works by randomly picking a point from the minority class and
computing the k-nearest neighbors for this point
The synthetic points are added between the chosen point and its
neighbors
It synthesizes new minority instances between existing minority
instances
It generates the virtual training records by linear interpolation for the
minority class
These synthetic training records are generated by randomly selecting
one or more of the k-nearest neighbors for each example in the
minority class

Dr. Manjubala Bisi (NITW) 02/05/2023 27 / 56


Synthetic Minority Oversampling Technique (SMOTE)

Step 1: Setting the minority class set A, for each X € A,


the k-nearest neighbors of x are obtained by calculating the Euclidean
distance between X and every other sample in set A
Step 2: The sampling rate N is set according to the imbalanced
proportion. For each X € A, N examples (i.e X1, X2,..,Xn) are
randomly selected from its k-nearest neighbors, and they construct
the set A1
Step 3: For each example Xk € A1 (k= 1,2,. . . N), the following
formula is used to generate a new example: X’= X + rand(0,1) *
—X- Xk—

Dr. Manjubala Bisi (NITW) 02/05/2023 28 / 56


Near-Miss Algorithm – Under sampling

Near-Miss is an under-sampling technique


It aims to balance class distribution by randomly eliminating majority
class examples
When instances of two different classes are very close to each other,
we remove the instances of the majority class to increase the spaces
between the two classes
This helps in the classification process to prevent problem
of information loss

Dr. Manjubala Bisi (NITW) 02/05/2023 29 / 56


Near-Miss Algorithm – Under sampling

Step 1: The method first finds the distances between all instances of
the majority class and the instances of the minority class. Here,
majority class is to be under-sampled
Step 2: Then, n instances of the majority class that have the smallest
distances to those in the minority class are selected

Dr. Manjubala Bisi (NITW) 02/05/2023 30 / 56


Different version of Near-Miss Algorithm

NearMiss – Version 1: It selects samples of the majority class for


which average distances to the k closest instances of the minority
class is smallest
NearMiss – Version 2: It selects samples of the majority class for
which average distances to the k farthest instances of the minority
class is smallest

Dr. Manjubala Bisi (NITW) 02/05/2023 31 / 56


ADASYN: Adaptive Synthetic Sampling Approach for
Imbalanced Learning

It uses a weighted distribution for different minority class examples


according to their level of difficulty in learning
More synthetic data is generated for minority class examples that are
harder to learn compared to those minority examples that are easier
to learn
As a result, the ADASYN approach improves learning with respect to
the data distributions in two ways:
1 reducing the bias introduced by the class imbalance
2 adaptively shifting the classification decision boundary toward the
difficult examples

Dr. Manjubala Bisi (NITW) 02/05/2023 32 / 56


ADASYN: Adaptive Synthetic Sampling Approach for
Imbalanced Learning

Input : Training data set Dtr with m samples (xi, yi), i = 1, ..., m,
where xi is an instance in the n dimensional feature space X and yi is
the class identity label associated with xi
Define ms and ml as the number of minority class examples and the
number of majority class examples, respectively. Therefore, ms ≤ ml
and ms + ml = m
dth is a preset threshold for the maximum tolerated degree of class
imbalance ratio

Dr. Manjubala Bisi (NITW) 02/05/2023 33 / 56


ADASYN: Adaptive Synthetic Sampling Approach for
Imbalanced Learning

Calculate the degree of class imbalance as d = ms/ml where d € (0,


1]
If d ¡ dth, do the following steps
Calculate the number of synthetic data examples that need to be
generated for the minority class: G = (ml ms) × β, Where,β[0, 1] is
a parameter used to specify the desired balance level after generation
of the synthetic data
For each example xi € minority class, find K nearest neighbors based
on the Euclidean distance in n dimensional space, and calculate the
ratio ri defined as: ri = Si/K, i = 1, ..., ms , where Si is the number
of examples in the K nearest neighbors of xi that belong to the
majority class, therefore ri € [0, 1]
′ ri ′
Normalize ri as ri = Pms , where ri is a density distribution
i=1 ri

Dr. Manjubala Bisi (NITW) 02/05/2023 34 / 56


ADASYN: Adaptive Synthetic Sampling Approach for
Imbalanced Learning

Calculate the number of synthetic data examples that need to be



generated for each minority example xi as g i = ri ∗ G
For each minority class data example xi, generate g i synthetic data
examples using the following steps:
Randomly choose one minority data example, xzi , from the K nearest
neighbors for data xi
Generate the synthetic data example as si = xi + (xzi − xi ) ∗ α, where
α € [0,1]

Dr. Manjubala Bisi (NITW) 02/05/2023 35 / 56


Performance metric

Accuracy is not the best metric to use when evaluating imbalanced


datasets as it can be misleading
Confusion Matrix
Precision is a measure of a classifier’s exactness
Recall is a measure of a classifier’s completeness
F1-Score is the the weighted average of precision and recall
Area Under ROC Curve (AUROC) represents the likelihood of the
model distinguishing observations from two classes

Dr. Manjubala Bisi (NITW) 02/05/2023 36 / 56


Area Under the Curve (AUC)
AUC is the measure of the ability of a classifier to distinguish between
classes and it compares TP rate vs FP rate
The higher the AUC, the better the performance of the model at
distinguishing between the positive and negative classes
When AUC = 1, then the classifier is able to perfectly distinguish
between all the Positive and the Negative class points correctly
When AUC = 0, then the classifier would be predicting all Negatives
as Positives, and all Positives as Negatives

Dr. Manjubala Bisi (NITW) 02/05/2023 37 / 56


Area Under the Curve (AUC)
When 0.5¡AUC¡1, there is a high chance that the classifier will be able
to distinguish the positive class values from the negative class values
This is so because the classifier is able to detect more numbers of True
positives and True negatives than False negatives and False positives

Dr. Manjubala Bisi (NITW) 02/05/2023 38 / 56


Area Under the Curve (AUC)
When AUC=0.5, then the classifier is not able to distinguish between
Positive and Negative class points. Meaning either the classifier is
predicting random class or constant class for all the data points
So, the higher the AUC value for a classifier, the better its ability to
distinguish between positive and negative classes

Dr. Manjubala Bisi (NITW) 02/05/2023 39 / 56


Case Study on Early Software Reliability Prediction

Reliability prediction during the early phase of software development


life cycle (SDLC) i.e before testing phase
Failure data are not available before testing phase. Using some
software metrics available in the early phases of SDLC, reliability can
be predicted.
Early prediction of software reliability is used to
evaluate design feasibility
compare design alternative
identify potential failure areas
track reliability improvements
identify the cost overrun at an early stage
provide optimal development strategies

Dr. Manjubala Bisi (NITW) 02/05/2023 40 / 56


Early Software Reliability Prediction Models

Software fault-prone module prediction


Software defect prediction
Software development effort prediction
Software testing effort prediction

Dr. Manjubala Bisi (NITW) 02/05/2023 41 / 56


Need for Software fault-prone module prediction

Predicting the software fault-prone modules during early phases


(requirement analysis, design and coding phase) of software
development process help software project managers for planning
better testing activities
The testing efforts are focused to the most troublesome fault-prone
modules only instead of focusing to all modules of software system
To achieve high software reliability in lesser time, it is important that
testing is prioritized based on fault-proneness of modules

Dr. Manjubala Bisi (NITW) 02/05/2023 42 / 56


Software Quality Metrics Affecting Fault-Proneness

A number of software quality metrics are considered which affect


fault-proneness
These metrics are classified into five major categories:
Lines of code measures
Base Halstead metrics
Derived Halstead metrics
Branch count

Dr. Manjubala Bisi (NITW) 02/05/2023 43 / 56


Software fault-prone module prediction model: ANN
approach

Software fault-prone classification can be used to classify the modules


into fault-prone or not
It ranks the software modules according to number of defects, and
thus can be used to determine the order in which code should be
inspected
Developers can allocate the limited test resources to the code areas
most likely to contain bugs
It reduces the overall cost of maintenance activities and maximizing
company profits

Dr. Manjubala Bisi (NITW) 02/05/2023 44 / 56


Problem Statement

Develop an ANN model to classify software modules into faulty and


not faulty
The key goal of the model is to classify the software module as faulty
or not-faulty
The model for software fault-prone classification can be defined as a
mapping: Y = f(m) where m is given module and Y is the
classification output, it can take two possible values (0/1). Y = 0
represents software module (m) is not faulty and Y = 1 represents
software module (m) is faulty
A software module (m) is represented as a set of software metrics and
is defined as follows: m=¡sm1, sm2, ..., smn¿
Target is to find an approximation to the mapping f using ANN
approach

Dr. Manjubala Bisi (NITW) 02/05/2023 45 / 56


ANN approach

Dr. Manjubala Bisi (NITW) 02/05/2023 46 / 56


Data Set Example

CM1 Data Set

Dr. Manjubala Bisi (NITW) 02/05/2023 47 / 56


Software Metrics in Data Set

Dr. Manjubala Bisi (NITW) 02/05/2023 48 / 56


Data Preprocessing

Natural logarithmic transformation


It converts the data distribution from highly skewed to less skewed
Normalization
It transforms large data value ranges into small data value ranges
(min-max normaliation)

Dr. Manjubala Bisi (NITW) 02/05/2023 49 / 56


ANN architecture, Training and Testing Phase

Determine the ANN architecture (No of layers, Number of neurons in


each layer)
Identify the activation function of each layer
Train the model using Back Propagation algorithm or Customize the
training algorithm
Test the model using the testing data

Dr. Manjubala Bisi (NITW) 02/05/2023 50 / 56


ANN model

Dr. Manjubala Bisi (NITW) 02/05/2023 51 / 56


Performance Measures using confusion matrix

Dr. Manjubala Bisi (NITW) 02/05/2023 52 / 56


Performance Measures

TPR = TP/(TP+FN)
TNR = TN / (TN+FP)
Accuracy = (TP+TN)/((TP+FP+TN+FN)
Recall = TP /(TP+FN)
Precision = TN / (TN+FN)
F-measure = 2*Precision*Recall/ (Precision + Recall)

Dr. Manjubala Bisi (NITW) 02/05/2023 53 / 56


Demo in Python

Dr. Manjubala Bisi (NITW) 02/05/2023 54 / 56


References

1 Pravali Manchala and Manjubala Bisi ”Diversity based imbalance


learning approach for software fault prediction using machine learning
models”, Applied Soft Computing (2022)
2 Kwabena Ebo Bennin, Jacky Keung, Member, Passakorn
Phannachitta, Akito Monden, Member, and Solomon
Mensah,”MAHAKIL: Diversity Based Oversampling Approach to
Alleviate the Class Imbalance Issue in Software Defect Prediction”,
IEEE Transaction on Software Engineering (2018)
3 Rheza Harliman and Kaoru Uchida, ”Data- and Algorithm-Hybrid
Approach for Imbalanced Data Problems in Deep Neural Network”,
International Journal of Machine Learning and Computing (2018)

Dr. Manjubala Bisi (NITW) 02/05/2023 55 / 56


Thank You

Dr. Manjubala Bisi (NITW) 02/05/2023 56 / 56

You might also like