Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 57

COMPARISON OF DISTANCE-BASED

SAMPLING FOR IMBALANCE


DATASET:
BINARY LOGISTIC REGESSION

Hezlin Aryani Abd Rahman


CS953 (2013833008)
Supervisor: Assoc. Prof Dr. Ahmad Zia Ul-
Saufi Mohamad Japeri
PROBLEM STATEMENT
ISSUES IN IDS
1 The Real Dataset Are Commonly Imbalanced
Binary Example:
Survival status (DIED/ALIVE)
Examination (PASS/FAIL)
Majority Production (REJECT/NOT)
Minority
Medical Exam (HEALTHY/SICK)

Note: Imbalance can also happen in polytomous datasets.


ISSUES IN IDS
2 3 Solutions at Data Level or
The Imbalance Classes Causes
Poor Classifcation Accuracy for Algorithm Level?
the Classifiers in Data Mining
(Japkowicz, 2000a, Lenmaru & Potolea, 2012,
• Most prefer data level
Visa & Relescu, 2005, Zhi et. al, 2012, Niu (Chawla, Bowyer, Hall, &
et.al, 2022) Kegelmeyer, 2002; Kubat & Matwin,
1997, Liu et. al, 2016, Niu et. al,
2022).
• Reason being is the
simplicity relative to the
technique (Lenmaru & Potolea,
2012; Weiss, 2004)
ISSUES IN IDS

4 How ‘Imbalanced’ is 5Can Simulation Study be


Imbalanced? What Degree Beneficial in IDS Research?
Effects the IDS More? • Many IDS studies apply the
No universal guidelines for statistical method to a set of real
the data scientist as to what data.
degree of imbalances affect a • Simulation allows us to control
particular classifier or which some parameters and assess the
techniques can perform well performance of techniques under
in IDS various
(He & Garcia, 2009; Weiss, 2013, Xing
et. al, 2012, Thabtah et. al, 2020) scenarios.
ISSUES IN IDS

6 There are already Many 7How does IDS problem affects


Studies Solving Binary Binary Logistic Regression?
Classification Problem.
• Since many IDS studies use
• Most study focusses on
Decision Tree C 4.5 & C5.0, MLP
classifiers like Decision Tree C
4.5, C5.0, MLP and SVM, as and SVM (Weiss, 2013, He, 2003)
well as and algorithms • There is a need to study the
modifications in IDS(He, predictive performance of BLR in
2003), IDS
• Not many study approaches
the statistical model i.e
binary logistic regression
(Japkowicz, 2003a)
ISSUES IN IDS

8 Using BLR, will the 9Can Sampling Strategy


Parameter Estimates be Further be Improved in Solving
affected under Different IDS?
Ratio Of Imbalances?
• Common sampling strategy in IDS
More Information on effect of are Random Sampling, SMOTE,
imbalanced on parameter etc, which has shown some
estimates of BLR needed. overfitting issues (Weiss, 2013, He,
2003)
• Distance-based sampling shows
promising results, thus more
study is required.
RESEARCH QUESTIONS
1. What is the threshold of imbalance ratio that affects the parameter
estimations of binary logistic regression model?
Is there a difference in performance of parameter estimation of the binary
2. logistic regression in dealing with imbalanced dataset? Is the effect
significant?
Can distance-based sampling techniques improve the parameter estimation

3. What is the best distance-based sampling technique used for imbalanced


of the binary logistic regression model?

dataset in the case of binary logistic regression model?

4.
RESEARCH OBJECTIVES
To compare the performance of
different distance-based sampling
strategies for binary logistic
regression, decision tree and
artificial neural network in the term
of accuracy, sensitivity and
specificity for imbalanced datasets.
SCOPE OF STUDY
Classifier:
Binary Logistic Regression
Decision Tree
ANN

Sampling Techniques:
Distance-based Undersampling (Euclidean, Mahalanobs
Manhattan Distance, Minkowski Distance, and Hamming Distance)

Benchmark Data:
Binary dataset from UCI Repository (used in the simulation study by
Weiss & Provost, 2001)
LITERATURE REVIEW
Imbalanced Learning
Data Level

Algorithm
IDS Evaluation
Level Level

Weiss (2013)
Summary of Methods in IDS
METHODS STRENGTHS WEAKNESSES IN R-
PROGRAMMING

Sampling • Simplest and most popular strategy for • Potential over-fitting problem YES
handling IDS • Redundancy problem
• Solving at data level • Discard potentially useful observations that
• This method usually outperform other could be important for the induction process
methods because of the simplicity
• The first choice of data scientists

Ensemble • High generalization ability • Potential of decrease in accuracy for more YES
• History of great success in every complex ensemble
applications • Potential of over-fitting
• Evolved with series of combination with • Quite complex in terms of algorithm
sampling methods (hybrid ensemble development
methods)
• Promising in handling multi-class
IDS(Zhou & Liu, 2012)

Cost-Sensitive • Algorithm-based learning • User need to calculate the cost to feed the NO
• Ability to accept cost information from algorithm (user keep on adjusting the cost)
user - difficult to determine the cost
• Not for easy to be used by basic users

Feature Selection • Good in handling high multi- • Does not work well in IDS without multi- YES
dimensional dataset dimensional feature
• Improve predictive performance of
classifiers

Algorithm • Promising results in handling IDS • Unfavorable – difficult to implement NO


Modification • Improve predictive performance of • Not for easy to be used by basic users
classifiers
SAMPLING APPROACH
Disadvantages
STATEMENT SOURCE Advantages
How many sample is (Hoens & STATEMENT SOURCE
enough? To oversample or Chawla, 2013;
Japkowicz, Solves IDS problem at data (Bekkar &
undersample? Alitouche,
2000b) level 2013; Chawla
Over-fitting (Japkowicz, et al., 2002)
(from Oversampling) 2000b; Chawla,
2003) The easiest and most (Hoens &
popular in learning Chawla, 2013;
Oversampling generates (Japkowicz, Japkowicz,
synthetic. data (Copy an 2000b; Chawla imbalanced data set 2000b, Chawla
observation many times) et al., 2002) et al., 2002)

Undersampling tends to (Bekkar & Outperformed other (Bekkar &


Alitouche, method - simplicity Alitouche,
eliminate important 2013; Chawla
observation. 2013; Chawla et
al., 2002) et al., 2002)
Drawbacks can be solved (Bekkar &
using heuristic/ hybrid Alitouche,
2013)
sampling approach
Binary Logistic Regression (BLR)

• The limitation of Linear Regression in modelling dichotomous


and categorical dataset has led to the introduction of Logistic
Regression.
• The logistic model predicts the logit of Y from X, which is the
natural logarithm (ln) of odds of Y, and odds are ratios of
probabilities (π) of Y happening to probabilities (1 – π) of Y not
happening.
• The logit transformation of the model is as follows:

• The following equation shows the simple logistic regression


model:

Source: Han & Kamber, 2006


Binary Logistic Regression
The assumptions in Logistic Regression are as follows (Hosmer, 2004):
1. Logistic regression does not assume a linear relationship between the dependent
and independent variables.
2. The dependent variable must be a category either dichotomous two classes or
more).
3. The independent variables need not be interval, nor normally distributed, nor
linearly related, nor of equal variance within each group.
4. The categories (groups) must be mutually exclusive and exhaustive; a case can
only be in one group and every case must be a member of one of the groups.
5. Larger samples are needed than for linear regression because maximum
likelihood coefficients works well for large sample estimates.

Source: Hosmer, 2004


Simulation Study
• The objective of a simulation study differs from one study to
the other.
• In statistical analysis, simulation studies are performed to
assess the performance of a selection of statistical methods in
relation to a known truth or to observe an underlying truth,
which is done using computer intensive procedures. Thus,
such study cannot be achieved by merely applying the
statistical method to a set of data.
SIMULATION STUDIES
IN IDS
Statements Source
There are few published simulation studies in IDS (Burton, Altman,
• insufficient for the other researchers to gain Royston, & Holder, 2006)
information on methods and procedures on
details about performing such studies
Assess the effects of the complexity of a dataset Japkowicz, 2003
and degree of imbalance toward dual multi-layered
perception (DMLP), a type of neural network
classifier.
• Generate 125 variables with 5000 observations .
• Considered undersampling and oversampling.
METHODOLOGY
Research Process Flow
Phase 1:
• Effects of imbalance on the classification
performance of binary logistic regression for
different types of covariates
Objective 1
Objective
& 12 & 2
Simulation 1
• Obtain the threshold

Phase 2:
• Effects of ROS & RUS on IDS on the classification
performance of binary logistic regression for
different types of covariates
Objective 3
Simulation 2

Phase 3:
• Effects of E-DBUS (Lie et. al, 2013), iE-DBUS
(simplified E-DBUS) & M-DBUS on IDS on the
classification performance of binary logistic Objective 4
Simulation 3
regression for different types of covariates

Phase 4:
• Using 14 real benchmark datasets from the UCI
repository as used by (Weiss & Provost. al, 2001)
to evaluate the performance of sampling Objective 5
Evaluation
techniques (ROS. RUS, DBUS & IDBUS) on IDS
Simulation Study 1
Objectives
Classifier • Binary Logistic Regression
1. To compare the
performance of
• One Continuous (x1) binary logistic
• Two Continuous (x1 & x2) regression in the
Covariates • One categorical (x1) parameter
• One Categorical (x1) & One estimation of the
Continuous (x2) binary logistic
regression for
Imbalance • 1, 2, 5, 10, 20, 30, 40, 50 imbalanced datasets.
Ratio (IR%) 2. To determine the
• 100, 500, 1000, 1500, 2000, 2500, threshold level
Sample
where imbalance
Size (n) 3000, 3500, 4000, 4500, 5000
ratio (IR) effects the
Simulation parameter
• 5000 estimation of the
replication
binary logistic
regression.
Simulation Study 2
Classifier • Binary Logistic Regression Objectives
• One Continuous (x1) 3. To compare the
• Two Continuous (x1 & x2) performance of
Covariates • One categorical (x1) difference sampling
• One Categorical (x1) & One strategies readily
Continuous (x2) available in R-
programming (random
Sampling • Random Oversampling (ROS) oversampling,
Technique • Random Undersampling (RUS) undersampling) in
handling imbalanced
Imbalance • 1, 2, 5, 10, 20, 30, 40 datasets.
Ratio (IR%)

Sample • 100, 500, 1000, 1500, 2000, 2500,


Size (n) 3000, 3500, 4000, 4500, 5000 Note: ROS & RUS
using the ROSE
Simulation • 5000 function in R-
replication
programming
Simulation Study 3
Classifier • Binary Logistic Regression Objectives
Covariates • One Continuous (x1)
4. To improve the
parameter estimation of
• Euclidean Distanced-Based
the binary logistic
Undersampling (E-DBUS) (Lie et.
al, 2013) regression using
Sampling • Improved Distanced-Based distance-based sampling
Technique approach on imbalanced
Undersampling (iE-DBUS)
• Mahalanobis Distanced-Based datasets.
Undersampling (M-DBUS)

Imbalance • 1, 2, 5, 10, 20, 30, 40


Ratio (IR%)

Sample • 100, 500, 1000, 1500, 2000, 2500,


Size (n) 3000, 3500, 4000, 4500, 5000

Simulation • 5000
replication
Improvements on Method
E-DBUS (Li et. al, 2013) iE-DBUS (Improved Method)
Let minority class has N number of Let minority class has N number of instances
instances and majority class has M number and majority class has M number of instances.
of instances. Step 1: Select a sample of of majority class and
Step 1: Select a sample of of majority class calculate the Euclidean distance with all samples
in minority class . Record as .
and calculate the Euclidean distance with all
samples in minority class . Record as . Step 2: Compute the average distance, .
Step 3: If is greater than predefined threshold,
Step 2: Compute the average distance, .
then is deleted, otherwise reserve .
Step 3: If is greater than predefined Step 4: Repeat Step 3 for all samples in majority
threshold, then is deleted, otherwise reserve class until M = N.
. Step 5: If all the samples deleted in Step 3 and is
Step 4: Repeat Step 1 to Step 3 for all still M >= N, then delete the samples according
samples in majority class. to the largest values until M = N.
Step 5: New dataset is generated from Step 6: New dataset is generated from reserved .
reserved .
Note:
threshold value is entropy value as follow:
H(X) = -sum(each k in K p(k) * log(p(k)))
Improvements on Method
Euclidean distance is as follows:

M-DBUS (Mahalanobis Distance) where,


Let minority class has N number of instances “d” is the Euclidean distance,
and majority class has M number of instances. (x1, y1) is the coordinate of the first point,
(x2, y2) is the coordinate of the second point.
Step 1: Select a sample of of majority class
and calculate the Mahalanobis distance with
all samples in minority class . Record as .
Step 2: Compute the average distance, .
Step 3: If is greater than predefined Mahalanobis distance is calculated as follows:
threshold, then is deleted, otherwise reserve .
Step 4: Repeat Step 1 to Step 3 for all samples
in majority class. Where,
Step 5: New dataset is generated from = the square of the Mahalanobis distance,
reserved . = the vector of the observation (row in a dataset),
= the vector of mean values of independent variables
(mean of each column),
=the inverse covariance matrix of independent variables.
Note:
threshold value is entropy value as follow:
H(X) = -sum(each k in K p(k) * log(p(k)))
Evaluation on Method
Benchmark • 14 UCI Repository Dataset (Weiss & Objectives
Dataset Provost, 2001)
5. To evaluate the
• Accuracy performance of the
Evaluation • Sensitivity enhanced distance-
Criteria
• Specificity based undersampling
technique on the
• ROS classification of binary
• RUS logistic regression in
Sampling • handling highly
E-DBUS (Li et. al, 2013)
Techniques imbalanced datasets
• iE-DBUS
• M-DBUS using benchmark
datasets.
.
Note for E-DBUS, iE-DBUS & M-DBUS
1. threshold value is entropy value as follow:
H(X) = -sum(each k in K p(k) * log(p(k)))
2. For more than one independent variables, the
average of distance is taken for measurement.
Benchmark Data Set
Descriptions
RESULTS & DISCUSSIONS
Phase 1:
Simulation 1
Objective 1: To compare the performance of binary logistic regression in the parameter
estimation for imbalanced datasets.
Objective 2: To determine the threshold level where imbalance ratio (IR) effects the
parameter estimation of the binary logistic regression.
One Continuous Covariate

Line plots
of parameter estimates
β1

Two Continuous Covariate

β1
β2
One Categorical Covariate

Line plots
of parameter estimates
β1

One Categorical & One Continuous Covariate

β2

β1
One Continuous Covariate

Horizontal boxplots
of parameter estimates

Two Continuous Covariate


One Categorical Covariate

Horizontal boxplots
of parameter estimates

One Categorical & One Continuous Covariate


Objective 1:To compare the performance of binary logistic
regression in the parameter estimation for imbalanced
datasets.
1. the smaller the sample size, the more it will be affected by
imbalanced;
2. highly imbalanced data () affects even big sample sizes ();
3. more variables included might be more biased toward
imbalance compared to the simple models.
Thus, from this simulation study results, we have a better
understanding and confirmation on the effect of imbalance on
sample size on the parameter estimation of a binary logistic
regression.
Objective 2: To determine the threshold level where
imbalance ratio (IR) effects the parameter estimation of the
binary logistic regression.
Based on Table 4.8.
1. According to different distributions, all gave different cut-off
or thresholds for different sample size and IR percentage.
2. In summary, the most effected is
1. highly imbalanced data (), even .
2. Small sample size (n)

Thus, from this simulation study results, confirmed that highly


imbalanced data and small sample sizes are more affected by
imbalanced.
Phase 2:
Simulation 2
Objective 3: To compare the performance of difference sampling
strategies readily available in R- programming (random oversampling,
undersampling) in handling imbalanced datasets.
Line plots
of parameter estimates
β1 for RUS
for RUS & ROS
ONE CONTINUOUS
COVARIATE

β1 for ROS
β1 for ROS
β1 for RUS

Line plots of parameter estimates


for RUS & ROS
TWO CONTINUOUS COVARIATE

β2 for RUS
β2 for ROS
Line plots
of parameter estimates
β1 for RUS for RUS & ROS
ONE CATEGORICAL
COVARIATE

β1 for ROS
β1 for ROS
β1 for RUS

Line plots of parameter estimates


for RUS & ROS
ONE CATEGORICAL & ONE CONTINUOUS COVARIATE

β2 for RUS
β2 for ROS
Summary
1. Results for ROS is better than RUS – ROS data becomes
bigger compared to RUS.
2. However, both sampling proves to give better performance
than no sampling.
3. For and IR , imbalance effect is still prominent, especially
Phase 3:
Simulation 3
Objective 4: To improve the parameter estimation of the binary logistic
regression using distance-based undersampling approach on imbalanced
datasets.
Line plots
of parameter estimates
β1 for No Sampling

β1 for E-DBUS

β1 for RUS

β1 for M-DBUS

β1 iE-DBUS
Line plots
of MSE
β1 for No Sampling

β1 for IDBUS

β1 for RUS

β1 for M-DBUS

β1 for E-DBUS
β1 for No Sampling

Horizontal boxplots
of parameter estimates

β1 for iE-DBUS

β1 for RUS

β1 for E-DBUS β1 for M-DBUS


Summary
1. RUS, E-DBUS, iE-DBUS and M-DBUS performance
showed better than no sampling.
2. Although iE-DBUS and M-DBUS showed not much different
performance as compared to the E-DBUS by Lie et. al
(2013). However, all three methods were better performing
compared to the RUS method.
3. This might be since RUS is random in nature, and the other
three methods have a set of rules that aided in the selection of
the samples (or removal of the samples).
4. However, for IR=1% and IR=2%, and sample size n<1000,
the effect of imbalance is still visible.
Phase 4:
Evaluation
Objective 5: To evaluate the performance of the improved distance-based
undersampling technique on the classification of binary logistic
regression in handling imbalanced datasets using benchmark datasets.
ORI TRAIN ORI TRAIN ORI TRAIN
M-DBUS TEST 1 ORI TEST 1 1
M-DBUS TEST ORI TEST M-DBUS TEST ORI TEST

M-DBUS TRAIN 0.9 M-DBUS TRAIN


ROS TRAIN 0.9 ROS TRAIN
M-DBUS TRAIN 0.5 ROS T

0.8 0.8
0
iDBUS TEST ROS TEST
iDBUS TEST
(i) Lattera ROS TEST (ii) Pendigits iE-DBUS TEST ROS
(iii) Abalone
iDBUS TRAIN RUS TRAIN
iDBUS TRAIN RUS TRAIN iE-DBUS TRAIN RUS TRA
DBUS TEST RUS TEST
DBUS TEST RUS TEST DBUS TRAIN E-DBUS TEST RUS TEST
DBUS TRAIN ORI TRAIN
E-DBUS TRAIN
ORI TRAIN
ORI TRAIN
1
M-DBUS TEST 1 ORI TEST M-DBUS TEST ORI TEST
1
M-DBUS TEST ORI TEST

M-DBUS TRAIN 0.5 M-DBUS


ROS TRAINTRAIN 0.5 ROS
M-DBUS TRAIN 0.5 ROS TRAIN

0 0
0
iDBUS TEST ROSiDBUS
TEST TEST RO
iDBUS TEST
(iv) Sick ROS TEST (v) Hyperthyroid (vi) SolarFlare
iDBUS TRAIN RUS TRAIN
iDBUS TRAIN RUS TR
iDBUS TRAIN RUS TRAIN
DBUS TEST RUS TEST
DBUS TEST RUS TEST
DBUS TEST RUS TEST DBUS TRAIN ORI TRAIN
DBUS TRAIN
DBUS TRAIN ORI TRAIN
ORI TRAIN 1
1 M-DBUS TEST ORI TEST
1 M-DBUS TEST ORI TEST
M-DBUS TEST ORI TEST

M-DBUS TRAIN 0.9 ROS T


M-DBUS TRAIN 0.5 ROS TRAIN
M-DBUS TRAIN 0.5 ROS TRAIN

0 0.8
0
iDBUS TEST iDBUS
ROS TEST
TEST ROS
iDBUS TEST ROS TEST
Contraceptive (vii) Adult (viii) Splice
iDBUS TRAIN RUS TRAIN iDBUS TRAIN RUS TRAIN
iDBUS TRAIN RUS TRA

DBUS TEST RUS TEST DBUS TEST RUS TEST DBUS TEST RUS TEST
ORI TRAIN
ORI TRAIN
1
M-DBUS TEST ORI TEST 1
M-DBUS TEST ORI TEST

M-DBUS TRAIN 0.5 ROS TRAIN M-DBUS TRAIN 0.5 ROS

0 0
iDBUS TEST RO
iDBUS TEST ROS TEST

iDBUS TRAIN RUS TR


iDBUS TRAIN RUS TRAIN
ORI TRAIN
DBUS TEST RUS TEST
M-DBUS TEST 1 ORI TEST
DBUS TEST RUS TEST DBUS TRAIN

(ix) Yeast DBUS TRAIN (xi) German


M-DBUS TRAIN 0.5 ROS TRAIN

0
iDBUS TEST ROS TEST

iDBUS TRAIN RUS TRAIN

DBUS TEST RUS TEST


(x) Car DBUS TRAIN
1
ORI TRAIN

ORI TRAIN M-DBUS TEST ORI TEST

M-DBUS TEST 1 ORI TEST


M-DBUS TRAIN 0.5 ROS TRAIN

M-DBUS TRAIN 0.5 ROS TRAIN


0
iDBUS TEST ROS TEST
0
iDBUS TEST ROS TEST
iDBUS TRAIN RUS TRAIN

iDBUS TRAIN RUS TRAIN DBUS TEST RUS TEST


DBUS TRAIN
DBUS TEST
DBUS TRAIN
RUS TEST
(xiii) Bands
(xii) BC
Summary of Spiderweb
Diagrams
• The results for this evaluation test on benchmark datasets, are
aligned with the simulation results in previous sections.
• In conclusion,
(1) the smaller the sample size, the more it will be affected by
imbalanced;
(2) highly imbalanced data (IR ≤20%) affects even big sample
sizes (n=5000), especially IR<10%;
(3) more variables included in the model might be more biased
toward imbalance compared to the simple models.
CONCLUSION &
RECOMMENDATIONS
Conclusions
1. Effect Of Imbalance On Parameter Estimation
1. The effect of the imbalance ratio is more prominent for small sample sizes (n less than or equal to 1000). This sample size should be considered
especially doing sampling to balance the imbalance effect. The total sample in this context should be considered the current sample size after
sampling has taken place.
2. Highly imbalanced ratio (IR ≤20%) affects even big sample sizes (n=5000). The imbalance ratio in the response variable will not only affect the
parameter estimates, but the p-value and odds- ratio for the covariate as well. Hence, this leads to biased estimates and inaccurate findings. And the
last conclusion,
3. the more variables included might be more biased toward imbalance compared to the simple models. The further complex the binary logistic
regression model is, the more prone it is to the effect of imbalance.
2. Threshold Level Of Imbalance Ratio
1. Big data with large sample size (at least n > 10, 000) will not much be affected by imbalance problem. However, sample sizes less than 10,000 should
take into consideration sampling strategies, especially for highly imbalanced datases.
2. For simple models (IR > 20%) will not much be affected by imbalanced problem, especially for large sample size. However, IR ≤20% should consider
using some sampling strategies to lower the effect of imbalance in the model.
3. For more complex models, with many variables, both sample size and imbalance ratio will have great affect on the effect of imbalance toward the
performance of the model. Thus, a parsimonious model is very much likely less prone to imbalance. Thus, complex model should consider any
sampling strategies to lower the imbalance effect.
3. EFFECT OF RESAMPLING IDS ON PARAMETER ESTIMATION
1. From the results of the previous chapter, it can be seen that resampling of imbalanced datasets has a tremendous affect in lowering the effect of
imbalance toward the classification performance of the binary logistic regression model. Although random oversampling showed quite an outstanding
performance in this study, the notion of replication of observations are very unlikely a good choice especially in the statistical point of view.
2. The undersampling approach, although seemed “reckless” for discarding precious observations are more preferred than creating synthetic
observations. However, randomly discarding observations has proven, through this study, to be an unjust and unstable approach to undersampling.
Thus, undersampling with a certain degree of criteria selection algorithm, is the preferred way to undersampling.
4. COMPARISON AND PERFORMANCE OF DISTANCE SAMPLING ON PARAMETER
ESTIMATION
1. The distance-based undersampling, either the Euclidean or the Mahalanobis methods, has proven to significantly improve the performance of binary
logistics regression in terms of parameter estimation and classification performance (accuracy, sensitivity, and specificity). All the methods proposed
for this study E-DBUS, iE-DBUS and M-DBUS, although the performances of these methods were not much different from each other, certainly
proved to be better choices than no sampling or random undersampling in terms of improving the performance of binary logistic regression in
imbalanced datasets.
Recommendations
1. Explore more on categorical covariates – potential to obtain
more info on simulation studies.
2. Explore on other types of data i.e. panel data, circular data.
3. Focus on highly imbalanced dataset (IR<5%)
4. Change to other classifiers i.e. Support Vector Machine
algorithm and k-Nearest Neighbours approach should be used
in combination with sampling techniques to improve the
classifier performance for imbalanced data set.
REFERENCES
He, H. & Ma, Y. (2013) “Imbalanced Learning: Foundations, Zadrozny, B., Langford, J. & Abe,N. (2003) “Cost-sensitive
Algorithms, and Applications” IEEE Press & Wiley, learning by cost-proportionate example weighting,“ Proc.
ISBN: 9781118074626. of the Third IEEE International Conference on Data
He, H. & Garcia, E.A. (2009) “Learning from imbalanced data Mining, pp. 435-442.
sets.” IEEE Transactions on Knowledge and Data Fan, W., StolfoJ., Zhang, J., & Chan,P. (1999) “Adacost:
Engineering, vol. 21, no. 9, pp. 1263-1284. misclasscation cost-sensitive boosting”, Proc. of the
Weiss, G & Provost. F. (2003), “Learning when training data Sixteenth International Conference on Machine
are costly: the effect of class distribution on tree Learning, pp. 99-105.
induction,” Journal of Artificial Intelligence Research, Chawla, N., Lazarevic, A., Hall, L. & Bowyer, K. (2003)
vol. 19, pp. 315-354. “SMOTEBoost: Improving prediction of the minority
Japkowicz. N. (2003) “Concept learning in the presence of class in boosting”. Proc. of Principles of Knowledge
between-class and within-class imbalances,“ Proc. of the Discovery in Databases, pp. 107-119.
Fourteenth Conference of the Canadian Society for Galar , M., Ferna´ndez, A., Barrenechea, E., Bustince, H., &
Computational Studies of Intelligence, pp. 67-77. Herrera , F. (2011) “An overview of ensemble methods
Weiss, G. (1995) “Learning with rare cases and small for binary classifiers in multi-class problems:
disjuncts," Proc. of the Twelfth International Conference Experimental study on one-vs-one and one-vs-all
on Machine Learning, pp. 558-565. schemes”. Journal of Pattern Recognition, 44, pp. 1761–
Weiss, G. (2013) “Foundations of Imbalanced Learning” , 1776.
Chapter in a book, “Imbalanced Learning: Foundations, Doucette, J. & Heywood, M.I. (2008) “ GP Classifiication
Algorithms, and Applications” IEEE Press & Wiley, under imbalanced data sets: Active sub-sampling and
ISBN: 9781118074626, ch. 2, pp. 13-38. AUC approximation”. Lecture Notes in Computer
Japkowicz. N. (2013) “Assessment Metrics for Imbalanced Science, Vol. 4971, pp. 266-277.
Learning”, Chapter in a book, “Imbalanced Learning: Scholkopt, B. Platt, J.C. Shawe-Taylor, J. Smola, A.J. &
Foundations, Algorithms, and Applications” IEEE Press williamson, R.C. (2001) “Estimating the Support of a
& Wiley, ISBN: 9781118074626, ch. 8, pp. 187-205. high dimensional distribution”. Neural Computation, vol.
Kubat, M. & Matwin, S. (1997) “Addressing the curse of 13, pp 1443-1471.
imbalanced training sets: one-sided selection,” Proc. of Hong, X., Chen, S. & Harris, C.J. ( 2007) “A kernel-based
the Fourteenth International Conference on Machine two-classifier for imbalanced dataset” IEEE Transaction
Learning, pp. 179-186. in Neural Netwoks, vol. 18, no. 1, pp. 28-41.
Breinman, L. (1996) “Bagging Predictors,” Journal of Machine
Learning, 24, pp. 123–140.
ADD-ONS BASED ON
PREVIOUS MOCK
1. Change in the Main Title.
2. Reorganize and rephrase the research questions and
objectives.
3. Add-on Literature Review (recent papers)
4. Reorganize Methodology & more equations.
5. Rewrite chapter 4: discussion on results and analysis.
6. Add another distance-based sampling method – using
Mahalanobis distance.
7. Represent the analysis better with creative chart comparison
and spiderweb diagrams.
Q&A SESSION
THANK YOU

You might also like