Professional Documents
Culture Documents
UN Makassar Proposal 2023
UN Makassar Proposal 2023
4.
RESEARCH OBJECTIVES
To compare the performance of
different distance-based sampling
strategies for binary logistic
regression, decision tree and
artificial neural network in the term
of accuracy, sensitivity and
specificity for imbalanced datasets.
SCOPE OF STUDY
Classifier:
Binary Logistic Regression
Decision Tree
ANN
Sampling Techniques:
Distance-based Undersampling (Euclidean, Mahalanobs
Manhattan Distance, Minkowski Distance, and Hamming Distance)
Benchmark Data:
Binary dataset from UCI Repository (used in the simulation study by
Weiss & Provost, 2001)
LITERATURE REVIEW
Imbalanced Learning
Data Level
Algorithm
IDS Evaluation
Level Level
Weiss (2013)
Summary of Methods in IDS
METHODS STRENGTHS WEAKNESSES IN R-
PROGRAMMING
Sampling • Simplest and most popular strategy for • Potential over-fitting problem YES
handling IDS • Redundancy problem
• Solving at data level • Discard potentially useful observations that
• This method usually outperform other could be important for the induction process
methods because of the simplicity
• The first choice of data scientists
Ensemble • High generalization ability • Potential of decrease in accuracy for more YES
• History of great success in every complex ensemble
applications • Potential of over-fitting
• Evolved with series of combination with • Quite complex in terms of algorithm
sampling methods (hybrid ensemble development
methods)
• Promising in handling multi-class
IDS(Zhou & Liu, 2012)
Cost-Sensitive • Algorithm-based learning • User need to calculate the cost to feed the NO
• Ability to accept cost information from algorithm (user keep on adjusting the cost)
user - difficult to determine the cost
• Not for easy to be used by basic users
Feature Selection • Good in handling high multi- • Does not work well in IDS without multi- YES
dimensional dataset dimensional feature
• Improve predictive performance of
classifiers
Phase 2:
• Effects of ROS & RUS on IDS on the classification
performance of binary logistic regression for
different types of covariates
Objective 3
Simulation 2
Phase 3:
• Effects of E-DBUS (Lie et. al, 2013), iE-DBUS
(simplified E-DBUS) & M-DBUS on IDS on the
classification performance of binary logistic Objective 4
Simulation 3
regression for different types of covariates
Phase 4:
• Using 14 real benchmark datasets from the UCI
repository as used by (Weiss & Provost. al, 2001)
to evaluate the performance of sampling Objective 5
Evaluation
techniques (ROS. RUS, DBUS & IDBUS) on IDS
Simulation Study 1
Objectives
Classifier • Binary Logistic Regression
1. To compare the
performance of
• One Continuous (x1) binary logistic
• Two Continuous (x1 & x2) regression in the
Covariates • One categorical (x1) parameter
• One Categorical (x1) & One estimation of the
Continuous (x2) binary logistic
regression for
Imbalance • 1, 2, 5, 10, 20, 30, 40, 50 imbalanced datasets.
Ratio (IR%) 2. To determine the
• 100, 500, 1000, 1500, 2000, 2500, threshold level
Sample
where imbalance
Size (n) 3000, 3500, 4000, 4500, 5000
ratio (IR) effects the
Simulation parameter
• 5000 estimation of the
replication
binary logistic
regression.
Simulation Study 2
Classifier • Binary Logistic Regression Objectives
• One Continuous (x1) 3. To compare the
• Two Continuous (x1 & x2) performance of
Covariates • One categorical (x1) difference sampling
• One Categorical (x1) & One strategies readily
Continuous (x2) available in R-
programming (random
Sampling • Random Oversampling (ROS) oversampling,
Technique • Random Undersampling (RUS) undersampling) in
handling imbalanced
Imbalance • 1, 2, 5, 10, 20, 30, 40 datasets.
Ratio (IR%)
Simulation • 5000
replication
Improvements on Method
E-DBUS (Li et. al, 2013) iE-DBUS (Improved Method)
Let minority class has N number of Let minority class has N number of instances
instances and majority class has M number and majority class has M number of instances.
of instances. Step 1: Select a sample of of majority class and
Step 1: Select a sample of of majority class calculate the Euclidean distance with all samples
in minority class . Record as .
and calculate the Euclidean distance with all
samples in minority class . Record as . Step 2: Compute the average distance, .
Step 3: If is greater than predefined threshold,
Step 2: Compute the average distance, .
then is deleted, otherwise reserve .
Step 3: If is greater than predefined Step 4: Repeat Step 3 for all samples in majority
threshold, then is deleted, otherwise reserve class until M = N.
. Step 5: If all the samples deleted in Step 3 and is
Step 4: Repeat Step 1 to Step 3 for all still M >= N, then delete the samples according
samples in majority class. to the largest values until M = N.
Step 5: New dataset is generated from Step 6: New dataset is generated from reserved .
reserved .
Note:
threshold value is entropy value as follow:
H(X) = -sum(each k in K p(k) * log(p(k)))
Improvements on Method
Euclidean distance is as follows:
Line plots
of parameter estimates
β1
β1
β2
One Categorical Covariate
Line plots
of parameter estimates
β1
β2
β1
One Continuous Covariate
Horizontal boxplots
of parameter estimates
Horizontal boxplots
of parameter estimates
β1 for ROS
β1 for ROS
β1 for RUS
β2 for RUS
β2 for ROS
Line plots
of parameter estimates
β1 for RUS for RUS & ROS
ONE CATEGORICAL
COVARIATE
β1 for ROS
β1 for ROS
β1 for RUS
β2 for RUS
β2 for ROS
Summary
1. Results for ROS is better than RUS – ROS data becomes
bigger compared to RUS.
2. However, both sampling proves to give better performance
than no sampling.
3. For and IR , imbalance effect is still prominent, especially
Phase 3:
Simulation 3
Objective 4: To improve the parameter estimation of the binary logistic
regression using distance-based undersampling approach on imbalanced
datasets.
Line plots
of parameter estimates
β1 for No Sampling
β1 for E-DBUS
β1 for RUS
β1 for M-DBUS
β1 iE-DBUS
Line plots
of MSE
β1 for No Sampling
β1 for IDBUS
β1 for RUS
β1 for M-DBUS
β1 for E-DBUS
β1 for No Sampling
Horizontal boxplots
of parameter estimates
β1 for iE-DBUS
β1 for RUS
0.8 0.8
0
iDBUS TEST ROS TEST
iDBUS TEST
(i) Lattera ROS TEST (ii) Pendigits iE-DBUS TEST ROS
(iii) Abalone
iDBUS TRAIN RUS TRAIN
iDBUS TRAIN RUS TRAIN iE-DBUS TRAIN RUS TRA
DBUS TEST RUS TEST
DBUS TEST RUS TEST DBUS TRAIN E-DBUS TEST RUS TEST
DBUS TRAIN ORI TRAIN
E-DBUS TRAIN
ORI TRAIN
ORI TRAIN
1
M-DBUS TEST 1 ORI TEST M-DBUS TEST ORI TEST
1
M-DBUS TEST ORI TEST
0 0
0
iDBUS TEST ROSiDBUS
TEST TEST RO
iDBUS TEST
(iv) Sick ROS TEST (v) Hyperthyroid (vi) SolarFlare
iDBUS TRAIN RUS TRAIN
iDBUS TRAIN RUS TR
iDBUS TRAIN RUS TRAIN
DBUS TEST RUS TEST
DBUS TEST RUS TEST
DBUS TEST RUS TEST DBUS TRAIN ORI TRAIN
DBUS TRAIN
DBUS TRAIN ORI TRAIN
ORI TRAIN 1
1 M-DBUS TEST ORI TEST
1 M-DBUS TEST ORI TEST
M-DBUS TEST ORI TEST
0 0.8
0
iDBUS TEST iDBUS
ROS TEST
TEST ROS
iDBUS TEST ROS TEST
Contraceptive (vii) Adult (viii) Splice
iDBUS TRAIN RUS TRAIN iDBUS TRAIN RUS TRAIN
iDBUS TRAIN RUS TRA
DBUS TEST RUS TEST DBUS TEST RUS TEST DBUS TEST RUS TEST
ORI TRAIN
ORI TRAIN
1
M-DBUS TEST ORI TEST 1
M-DBUS TEST ORI TEST
0 0
iDBUS TEST RO
iDBUS TEST ROS TEST
0
iDBUS TEST ROS TEST