Outlier and Class Imbalance: Dr. Manjubala Bisi

Outlier and Class Imbalance
Dr. Manjubala Bisi
Assistant Professor
Department of Computer Science and Engineering
National Institute of Technology Warangal
manjubalabisi@nitw.ac.in
02/05/2023
Dr. Manjubala Bisi (NITW) 02/05/2023 1 / 56

Outline of the Talk
Outlier
Class Imbalance Problem
Case Study

Outlier
In statistics, an Outlier is an observation point that is distant from

other observations
An outlier is an object that deviates significantly from the rest of the
objects
They can be caused by measurement or execution error
The analysis of outlier data is referred to as outlier analysis or outlier
mining
An outlier is something which is an odd-one-out or the one that is
different from the crowd
Outliers having a different underlying behavior than the rest of the
data
An outlier is a data point that is distant from other points

Introduction of Outliers in the datasets
Outliers are first introduced to the population while gathering or

collecting the data
Data can be collected in many ways be it via Interview;
Questionnaires Survey; Observations; Documents Records; Focus
groups; Oral History etc., and in this Tech era Internet; IT sensors
etc., are generating data for us
Another possible cause of outliers could be Incorrect entry;
Misreporting of data or observations; Sampling errors while doing the
experiment; Exceptional but True value
The outliers can be a result of a mistake during data collection or
they can be just an indication of variance in your data

Type of Outliers
Point or global Outliers

Contextual (Conditional) Outliers
Collective Outliers

Point or global Outliers
Observations anomalous with respect to the majority of observations

in a feature
A data point is considered a global outlier if its value is far outside
the entirety of the data set in which it is found
Example: In a class all student age will be approximately similar, but
if see a record of a student with age as 100, it is an outlier and could
be generated due to various reason

Contextual (Conditional) Outliers
Observations considered anomalous given a specific context

A data point is considered a contextual outlier if its value significantly
deviates from the rest of the data points in the same context
The same value may not be considered an outlier if it occurred in a
different context
Example: World economy falls drastically due to COVID-19 in 2020

Collective Outliers
A collection of observations anomalous but appear close to one

another because they all have a similar anomalous value
A subset of data points within a data set is considered anomalous if
those values as a collection deviate significantly from the entire data
set, but the values of the individual data points are not themselves
anomalous in either a contextual or global sense
In time series data, one way this can manifest is as normal peaks and
valleys occurring outside of a time frame when that seasonal sequence
is normal or as a combination of time series that is in an outlier state
as a group

Importance of identifying the outliers
Machine learning algorithms are sensitive to the range and

distribution of attribute values
Data outliers can spoil and mislead the training process resulting in
longer training times, less accurate models and ultimately poorer
results

Detection of Outliers
Visualization Technique
Box plot
Histogram
Scatter plot
Mathematical Function
Z Score
IQR (Inter Quartile Range) Score

Visualization Technique (Box plot)

Visualization Technique (Histogram)

Visualization Technique (Scatter plot)

Mathematical Function(Z score)
Z score is an important concept in statistic

It helps to understand if a data value is greater or smaller than mean
and how far away it is from the mean
It tells how many standard deviations away a data point is from the
mean
Z score = (x -mean) / std. deviation
If the z score of a data point is more than 3, it indicates that the data
point is quite different from the other data points and considered as
an outlier

Mathematical Function (IQR score)

Prevention methods in Outliers
Depends upon domain knowledge and experience with outliers
Drop the outlier records
sometimes it’s best to completely remove those records from your
dataset to stop them from skewing your analysis
Cap your outliers’ data
another way to handle true outliers is to cap them. For example, if
you’re using income, you might find that people above a certain income
level behave in the same way as those with a lower income. In this
case, you can cap the income value at a level that keeps that intact
Assign a new value
If an outlier seems to be due to a mistake in your data, try imputing a
new value. Common imputation methods include using the mean of a
variable or utilizing a regression model to predict the missing value
Try a transformation
A different approach to true outliers could be to try creating a
transformation of the data rather than using the data itself.For
example, try creating a percentile version of your original field and
working with that new field instead
Class Imbalance
When observation in one class is higher than the observation in other

classes then there exists a class imbalance
Example: To detect number of modules as fault-prone or non
fault-prone in Software, assume fault-prone modules are 100 when
compared with non fault-prone modules around 9000
Class Imbalance is a common problem in machine learning, especially
in classification problems Imbalance data can hamper our model
accuracy big time

The Problem with Class Imbalance
Most machine learning algorithms work best when the number of

samples in each class are about equal
This is because most algorithms are designed to maximize accuracy
and reduce errors
However, if the data set in imbalance then In such cases, you get a
pretty high accuracy just by predicting the majority class, but you fail
to capture the minority class

The Problem with Class Imbalance
Accuracy will mislead

All those non-fault-prone modules, you’d have 100% accuracy.
Those modules which are fault-prone, you’d have 0% accuracy.
Overall accuracy would be high simply because the most modules are
non fault-prone(not because model is good)

Resampling Technique
A widely adopted technique for dealing with highly imbalanced

datasets is resampling
It consists of removing samples from the majority class
(under-sampling)
It consists of adding more examples from the minority class
(over-sampling)


The simplest implementation of over-sampling is to duplicate random

records from the minority class, which can cause overfitting
In under-sampling, the simplest technique involves removing random
records from the majority class, which can cause loss of information

Random Under-Sampling
Under sampling can be defined as removing some observations of the

majority class
This is done until the majority and minority class is balanced out
Under sampling can be a good choice when you have a ton of data
-think millions of rows
But a drawback to under sampling is that we are removing
information that may be valuable

Random Over-Sampling
Oversampling can be defined as adding more copies to the minority

class
Oversampling can be a good choice when you don’t have a ton of
data to work with
A con to consider when under sampling is that it can cause overfitting
and poor generalization to your test set

Under-sampling: Tomek link
Tomek links are pairs of very close instances but of opposite classes
Removing the instances of the majority class of each pair increases the
space between the two classes, facilitating the classification process
Tomek’s link exists if the two samples are the nearest neighbors of
each other

Under-sampling: Tomek link

Synthetic Minority Oversampling Technique (SMOTE)
This technique generates synthetic data for the minority class

It works by randomly picking a point from the minority class and
computing the k-nearest neighbors for this point
The synthetic points are added between the chosen point and its
neighbors
It synthesizes new minority instances between existing minority
instances
It generates the virtual training records by linear interpolation for the
minority class
These synthetic training records are generated by randomly selecting
one or more of the k-nearest neighbors for each example in the
minority class

Synthetic Minority Oversampling Technique (SMOTE)
Step 1: Setting the minority class set A, for each X € A,

the k-nearest neighbors of x are obtained by calculating the Euclidean
distance between X and every other sample in set A
Step 2: The sampling rate N is set according to the imbalanced
proportion. For each X € A, N examples (i.e X1, X2,..,Xn) are
randomly selected from its k-nearest neighbors, and they construct
the set A1
Step 3: For each example Xk € A1 (k= 1,2,. . . N), the following
formula is used to generate a new example: X’= X + rand(0,1) *
—X- Xk—

Near-Miss Algorithm – Under sampling
Near-Miss is an under-sampling technique

It aims to balance class distribution by randomly eliminating majority
class examples
When instances of two different classes are very close to each other,
we remove the instances of the majority class to increase the spaces
between the two classes
This helps in the classification process to prevent problem
of information loss

Near-Miss Algorithm – Under sampling
Step 1: The method first finds the distances between all instances of
the majority class and the instances of the minority class. Here,
majority class is to be under-sampled
Step 2: Then, n instances of the majority class that have the smallest
distances to those in the minority class are selected

Different version of Near-Miss Algorithm
NearMiss – Version 1: It selects samples of the majority class for

which average distances to the k closest instances of the minority
class is smallest
NearMiss – Version 2: It selects samples of the majority class for
which average distances to the k farthest instances of the minority
class is smallest

ADASYN: Adaptive Synthetic Sampling Approach for
Imbalanced Learning
It uses a weighted distribution for different minority class examples

according to their level of difficulty in learning
More synthetic data is generated for minority class examples that are
harder to learn compared to those minority examples that are easier
to learn
As a result, the ADASYN approach improves learning with respect to
the data distributions in two ways:
1 reducing the bias introduced by the class imbalance
2 adaptively shifting the classification decision boundary toward the
difficult examples

Imbalanced Learning
Input : Training data set Dtr with m samples (xi, yi), i = 1, ..., m,
where xi is an instance in the n dimensional feature space X and yi is
the class identity label associated with xi
Define ms and ml as the number of minority class examples and the
number of majority class examples, respectively. Therefore, ms ≤ ml
and ms + ml = m
dth is a preset threshold for the maximum tolerated degree of class
imbalance ratio

Imbalanced Learning
Calculate the degree of class imbalance as d = ms/ml where d € (0,

1]
If d ¡ dth, do the following steps
Calculate the number of synthetic data examples that need to be
generated for the minority class: G = (ml ms) × β, Where,β[0, 1] is
a parameter used to specify the desired balance level after generation
of the synthetic data
For each example xi € minority class, find K nearest neighbors based
on the Euclidean distance in n dimensional space, and calculate the
ratio ri defined as: ri = Si/K, i = 1, ..., ms , where Si is the number
of examples in the K nearest neighbors of xi that belong to the
majority class, therefore ri € [0, 1]
′ ri ′
Normalize ri as ri = Pms , where ri is a density distribution
i=1 ri

Imbalanced Learning
Calculate the number of synthetic data examples that need to be

′
generated for each minority example xi as g i = ri ∗ G
For each minority class data example xi, generate g i synthetic data
examples using the following steps:
Randomly choose one minority data example, xzi , from the K nearest
neighbors for data xi
Generate the synthetic data example as si = xi + (xzi − xi ) ∗ α, where
α € [0,1]

Performance metric
Accuracy is not the best metric to use when evaluating imbalanced

datasets as it can be misleading
Confusion Matrix
Precision is a measure of a classifier’s exactness
Recall is a measure of a classifier’s completeness
F1-Score is the the weighted average of precision and recall
Area Under ROC Curve (AUROC) represents the likelihood of the
model distinguishing observations from two classes

Area Under the Curve (AUC)
AUC is the measure of the ability of a classifier to distinguish between
classes and it compares TP rate vs FP rate
The higher the AUC, the better the performance of the model at
distinguishing between the positive and negative classes
When AUC = 1, then the classifier is able to perfectly distinguish
between all the Positive and the Negative class points correctly
When AUC = 0, then the classifier would be predicting all Negatives
as Positives, and all Positives as Negatives

When 0.5¡AUC¡1, there is a high chance that the classifier will be able
to distinguish the positive class values from the negative class values
This is so because the classifier is able to detect more numbers of True
positives and True negatives than False negatives and False positives

When AUC=0.5, then the classifier is not able to distinguish between
Positive and Negative class points. Meaning either the classifier is
predicting random class or constant class for all the data points
So, the higher the AUC value for a classifier, the better its ability to
distinguish between positive and negative classes

Case Study on Early Software Reliability Prediction
Reliability prediction during the early phase of software development

life cycle (SDLC) i.e before testing phase
Failure data are not available before testing phase. Using some
software metrics available in the early phases of SDLC, reliability can
be predicted.
Early prediction of software reliability is used to
evaluate design feasibility
compare design alternative
identify potential failure areas
track reliability improvements
identify the cost overrun at an early stage
provide optimal development strategies

Early Software Reliability Prediction Models
Software fault-prone module prediction

Software defect prediction
Software development effort prediction
Software testing effort prediction

Need for Software fault-prone module prediction
Predicting the software fault-prone modules during early phases

(requirement analysis, design and coding phase) of software
development process help software project managers for planning
better testing activities
The testing efforts are focused to the most troublesome fault-prone
modules only instead of focusing to all modules of software system
To achieve high software reliability in lesser time, it is important that
testing is prioritized based on fault-proneness of modules

Software Quality Metrics Affecting Fault-Proneness
A number of software quality metrics are considered which affect

fault-proneness
These metrics are classified into five major categories:
Lines of code measures
Base Halstead metrics
Derived Halstead metrics
Branch count

Software fault-prone module prediction model: ANN
approach
Software fault-prone classification can be used to classify the modules

into fault-prone or not
It ranks the software modules according to number of defects, and
thus can be used to determine the order in which code should be
inspected
Developers can allocate the limited test resources to the code areas
most likely to contain bugs
It reduces the overall cost of maintenance activities and maximizing
company profits

Problem Statement
Develop an ANN model to classify software modules into faulty and

not faulty
The key goal of the model is to classify the software module as faulty
or not-faulty
The model for software fault-prone classification can be defined as a
mapping: Y = f(m) where m is given module and Y is the
classification output, it can take two possible values (0/1). Y = 0
represents software module (m) is not faulty and Y = 1 represents
software module (m) is faulty
A software module (m) is represented as a set of software metrics and
is defined as follows: m=¡sm1, sm2, ..., smn¿
Target is to find an approximation to the mapping f using ANN
approach

ANN approach

Data Set Example
CM1 Data Set

Software Metrics in Data Set

Data Preprocessing
Natural logarithmic transformation

It converts the data distribution from highly skewed to less skewed
Normalization
It transforms large data value ranges into small data value ranges
(min-max normaliation)

ANN architecture, Training and Testing Phase
Determine the ANN architecture (No of layers, Number of neurons in

each layer)
Identify the activation function of each layer
Train the model using Back Propagation algorithm or Customize the
training algorithm
Test the model using the testing data

ANN model

Performance Measures using confusion matrix

Performance Measures
TPR = TP/(TP+FN)
TNR = TN / (TN+FP)
Accuracy = (TP+TN)/((TP+FP+TN+FN)
Recall = TP /(TP+FN)
Precision = TN / (TN+FN)
F-measure = 2*Precision*Recall/ (Precision + Recall)

Demo in Python

References
1 Pravali Manchala and Manjubala Bisi ”Diversity based imbalance

learning approach for software fault prediction using machine learning
models”, Applied Soft Computing (2022)
2 Kwabena Ebo Bennin, Jacky Keung, Member, Passakorn
Phannachitta, Akito Monden, Member, and Solomon
Mensah,”MAHAKIL: Diversity Based Oversampling Approach to
Alleviate the Class Imbalance Issue in Software Defect Prediction”,
IEEE Transaction on Software Engineering (2018)
3 Rheza Harliman and Kaoru Uchida, ”Data- and Algorithm-Hybrid
Approach for Imbalanced Data Problems in Deep Neural Network”,
International Journal of Machine Learning and Computing (2018)

Thank You

Outlier and Class Imbalance: Dr. Manjubala Bisi

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Outlier and Class Imbalance: Dr. Manjubala Bisi

Uploaded by

Copyright:

Available Formats

Outlier and Class Imbalance

Dr. Manjubala Bisi

Dr. Manjubala Bisi (NITW) 02/05/2023 1 / 56

Dr. Manjubala Bisi (NITW) 02/05/2023 2 / 56

In statistics, an Outlier is an observation point that is distant from

Dr. Manjubala Bisi (NITW) 02/05/2023 3 / 56

Outliers are first introduced to the population while gathering or

Dr. Manjubala Bisi (NITW) 02/05/2023 4 / 56

Point or global Outliers

Dr. Manjubala Bisi (NITW) 02/05/2023 5 / 56

Observations anomalous with respect to the majority of observations

Dr. Manjubala Bisi (NITW) 02/05/2023 6 / 56

Observations considered anomalous given a specific context

Dr. Manjubala Bisi (NITW) 02/05/2023 7 / 56

A collection of observations anomalous but appear close to one

Dr. Manjubala Bisi (NITW) 02/05/2023 8 / 56

Machine learning algorithms are sensitive to the range and

Dr. Manjubala Bisi (NITW) 02/05/2023 9 / 56

Dr. Manjubala Bisi (NITW) 02/05/2023 10 / 56

Dr. Manjubala Bisi (NITW) 02/05/2023 11 / 56

Dr. Manjubala Bisi (NITW) 02/05/2023 12 / 56

Dr. Manjubala Bisi (NITW) 02/05/2023 13 / 56

Z score is an important concept in statistic

Dr. Manjubala Bisi (NITW) 02/05/2023 14 / 56

Dr. Manjubala Bisi (NITW) 02/05/2023 15 / 56

When observation in one class is higher than the observation in other

Dr. Manjubala Bisi (NITW) 02/05/2023 17 / 56

Most machine learning algorithms work best when the number of

Dr. Manjubala Bisi (NITW) 02/05/2023 18 / 56

Accuracy will mislead

Dr. Manjubala Bisi (NITW) 02/05/2023 19 / 56

A widely adopted technique for dealing with highly imbalanced

Dr. Manjubala Bisi (NITW) 02/05/2023 20 / 56

Dr. Manjubala Bisi (NITW) 02/05/2023 21 / 56

The simplest implementation of over-sampling is to duplicate random

Dr. Manjubala Bisi (NITW) 02/05/2023 22 / 56

Under sampling can be defined as removing some observations of the

Dr. Manjubala Bisi (NITW) 02/05/2023 23 / 56

Oversampling can be defined as adding more copies to the minority

Dr. Manjubala Bisi (NITW) 02/05/2023 24 / 56

Dr. Manjubala Bisi (NITW) 02/05/2023 25 / 56

Dr. Manjubala Bisi (NITW) 02/05/2023 26 / 56

This technique generates synthetic data for the minority class

Dr. Manjubala Bisi (NITW) 02/05/2023 27 / 56

Step 1: Setting the minority class set A, for each X € A,

Dr. Manjubala Bisi (NITW) 02/05/2023 28 / 56

Near-Miss is an under-sampling technique

Dr. Manjubala Bisi (NITW) 02/05/2023 29 / 56

Dr. Manjubala Bisi (NITW) 02/05/2023 30 / 56

NearMiss – Version 1: It selects samples of the majority class for

Dr. Manjubala Bisi (NITW) 02/05/2023 31 / 56

It uses a weighted distribution for different minority class examples

Dr. Manjubala Bisi (NITW) 02/05/2023 32 / 56

Dr. Manjubala Bisi (NITW) 02/05/2023 33 / 56

Calculate the degree of class imbalance as d = ms/ml where d € (0,

Dr. Manjubala Bisi (NITW) 02/05/2023 34 / 56

Calculate the number of synthetic data examples that need to be

Dr. Manjubala Bisi (NITW) 02/05/2023 35 / 56

Accuracy is not the best metric to use when evaluating imbalanced

Dr. Manjubala Bisi (NITW) 02/05/2023 36 / 56

Dr. Manjubala Bisi (NITW) 02/05/2023 37 / 56

Dr. Manjubala Bisi (NITW) 02/05/2023 38 / 56