Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 55

Proposal Defence

Name: Vijayalakshmi Mahanra Rao


Programme: PhD
Faculty: FCI
Registration Date: Jan ‘13
Supervisor: Prof. Y.P Singh
Co-supervisor: Dr. C.K.Ho
Research Title: HYBRID ENSEMBLE DECISION TREES
LEARNING FOR IMBALANCED DATA SET
Outline
• Abstract
• Class Imbalance problem
• Research Problem and Motivations
• Research Background for Sampling techniques, Classifier
Ensembles and Ensemble Learning Machines
• Research Background on PSO
• Methodology
• Work Planned
• Overall Research Timeline
• Initial Experiments on dataset using various machine
learning algorithm mainly on C4.5 and wrapper method.
• References
Abstract
Classification based on decision trees is one of the very important problems in data mining
and has applications in many fields. The imbalanced dataset problem is a special type of
classification problem where the class priors are highly unequal and imbalanced. It has been
observed that class imbalance may produce an important deterioration of the performance
achieved by existing decision tree learning and classification systems. Binary and multiclass
learning decision tree for imbalanced data set has emerged as one of the challenging problems
in machine learning and data mining fields. The objective functions used for learning the
decision tree classifiers typically tend to favor the larger, less important classes in such
problems. This research project proposes to consider the performance of several popular
decision tree splitting criteria – information gain, Gini Index, and DKM – and investigates to
develop a new skew insensitive measure to induct decision tree. Few algorithms have been
proposed at the data level as well as the algorithm level. There are different resampling
techniques introduced at the data level such as oversampling, undersampling, or even a
combination of both, regenerative oversampling and SMOTE. These methods have been
shown to achieve good performance on minority examples of two-class data sets. So, multi-
class classification in imbalanced data sets remains an important topic of research. At
algorithm level however, existing algorithms has been modified to suit the characteristics of
the imbalanced data. This thesis explores the methods for imbalanced data sets at data and
algorithm level by proposing a hybrid ensemble decision trees induction for binary and
multiclass classification. The thesis will also consider computational complexity of the
proposed hybrid ensemble algorithm.
Key Points in Abstract
• Rare events are infrequent in machine learning, i.e.
data is not balanced in which the frequency ranges from
5% to less than 0.1%.
• Decision tree learning for imbalanced dataset
• Exploring resampling techniques for binary class as
well as multiclass problem for decision tree learning
• Classifiers Ensemble learning with decision tree
• Proposed Hybrid ensemble learning
• Computational complexity of the proposed techniques
Class Imbalance Problem
Balanced dataset definition :
• Datasets are said to be balanced if there are,
approximately, as many positive examples of the concept
(class) as there are negative ones.

Imbalanced dataset definition :


• Datasets are said to be imbalanced if classes are not
equally represented or unequal distribution of data (if
majority class represent 95+% of the dataset).
Class Imbalance Examples
• There exist many domains or problems that do not have
a balanced dataset.
• Examples:
 Helicopter Gearbox Fault Monitoring
 Discrimination between Earthquakes and Nuclear Explosions
 Document Filtering
 Detection of Oil Spills
 Detection of Fraudulent Telephone Calls
 Financial Fraud Transactions
 Protein Fold Recognition
Research Problem
• The problem with class imbalance data set is that
standard learners are often biased towards the majority
class for binary as well as multiclass classifier.
• That is because these classifiers attempt to reduce global
quantities such as the error rate (entropy or
classification error), not taking the data distribution into
consideration.
• As a result examples from the majority class are well-
classified whereas examples from the minority class tend
to be misclassified.
Research Motivation
• To develop hybrid ensemble decision tree induction
algorithms for binary as well as multiclass problems
involving
1. Induction of base classifiers for ensemble methods based
on greedy (filter method) (C4.5) and global (wrapper
method) (GA/PSO)
2. Sampling techniques for ensemble methods for
imbalanced dataset to improve the accuracy of learning for
binary as well as multiclass classification.
3. Error bound calculation of ensemble classifiers with
imbalance dataset
• Computational complexity analysis of the proposed
hybrid ensemble methods
Research Questions as Objective
Hypothesis: Given an imbalanced dataset, propose hybrid learning
techniques involving weak learner, SMOTE and boosting algorithms:
1.Selection of base classifiers
a. Single attribute based method (filter method)
b. Wrapper method (subset of attributes) – evolutionary techniques
(GA/PSO) (wrapper method)
2. SMOTE based ensemble methods (Hybrid Ensemble Methods)
a. SMOTEAdaBoost
b. SMOTEAdaBoost.M1
c. SMOTEAdaBoost.M2
d. SMOTE based Stochastic Gradient Boosting techniques
3. Performance and complexity analysis of proposed algorithms with
existing boosting algorithms for binary as well as multiclass.
4. Analytical calculation of error bound based on PAC learning on
proposed hybrid ensemble methods and it’s generalization.
Literature Review on Learning
Algorithms
• J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann
Publishers Inc (1993)

• J. R. Quinlan. Induction of decision trees Machine Learning, Vol. 1, No. 1. pp.


81-106 (1986)

• Breiman, L. Bagging Predictors. "Machine Learning, 24": pp. 123-140. (1996).

• J. R. Quinlan. Bagging, Boosting, and C4.5 In In Proceedings of the Thirteenth


National Conference on Artificial Intelligence (1996)

• Freund, Y., & Schapire, R. E.. Experiments with a new boosting algorithm.
In ICML (Vol. 96, pp. 148-156). (1996)

• Eberhart, R. C., & Kennedy, J. (1995). A new optimizer using particle swarm
theory. In Proceedings of the sixth international symposium on micro machine
and human science (pp. 39–43)
Research Background
• There are various top-down decision trees induction algorithms
such as ID3, C4.5, CART.
• C4.5 and CART consist of two conceptual phases: Growing and
Pruning.
• C4.5 algorithm
 developed by Ross Quinlan as an extended algorithm from
ID3 in building a decision tree classifiers/model
 uses the information gain/gain ratio for selecting the best
attribute to be used as a node in the tree to split dataset,
constructing decision tree in a top-down recursive divide and
conquer manner.
• Ensemble techniques construct multiple classifiers from the
original data set and aggregate their predictions when
classifying unknown instances. he
Research Background
• Top-down decision trees induction algorithms: ID3
and C4.5.

ID3 C4.5
handles continuous and
handles discrete values discrete values
doesn’t handle missing and
continuous values handles missing values
doesn’t perform pruning pruning
uses information gain as uses gain ratio as splitting
splitting criteria criteria
Research Background
C4.5 Decision Tree Induction Algorithm
Input: Training set S of n examples
Output: decision tree with root R;
 
1. If the instances in S belong to the same class or the amount of instances in
S is too few, set R as leaf node and label the node R with the most frequent
class in S;
2. Otherwise, choose a test attribute X with two or more values (outcomes)
based on a selecting criterion, and label the node R with X;
3. Partition S into subsets S1, S2, …, Sm according to the outcome of
attribute X for each outcome; generate Rm children nodes R1, R2, …, Rm;
4. For every group (Si, Ri), build recursively a subtree with root Ri.
5. Recursive partitioning complete only if all instances belongs to same class,
or there are no more attributes remain, or no instances left.
Research Background
• Ensemble learning is defined as learning multiple
classifiers using different training data or different
learning algorithms
• Combine decisions of multiple classifiers, e.g. using
majority voting, weighted voting or Bayesian voting.
• Applied to improve classifiers and to improve their
accuracy in predictions.
• Ensemble methods
 Boosting
 Gradient Boosting/Stochastic Gradient Boosting
 Bagging
 Stacking
Research Background
• Ensemble Methods – Generic Approach:
Three main steps exist in an ensemble model are:
1. training set generation,
2. learning, and
3. integration.
Step 1 begins with the original training set S. From this training
set, t data subsets are created (S1, S2, …, St). Bagging and
boosting are common ways to accomplish this step
Step 2, t base classifiers are generated (I1, I2, …, It) training
data or different learning algorithms. These classifiers may all be
the same, all different, or contain any combination of the
same or different classifiers. Each classifier It is trained using
the subset St.
Step 3, the prediction of each classifier is combined in an a
predetermined way to produce the resulting classification.
Research Background
Ensemble Methods – Generic Approach:
Two primary approaches exist to the integration phase:
combination and selection.
1. The combination approach, the base classifiers produce their
class predictions and the final outcome is composed using
those predictions.
2. The selection approach, one of the classifiers is selected and
the final prediction is the one produced by it. The most
commonly used integration techniques are
1. voting,
2. simple and weighted averaging, and
3. a posteriori (Bayesian).
Research Background
• Use a single, arbitrary learning algorithm but
manipulate training data to make it learn multiple
models.
– Data1  Data2  …  Data m
– Learner1 = Learner2 = … = Learner m
• Different methods for changing training data:
– Bagging: Resample training data (Bootstrap sampling)
– Boosting: Reweight training data
– DECORATE: Add additional artificial training data
• In WEKA, these are called meta-learners, they take a
learning algorithm as an argument (base learner) and
create a new learning algorithm.
Research Background
• Ensemble Method -Bagging
 Create ensembles by repeatedly randomly resampling the
training data (Brieman, 1996).
 Given a training set of size S, create m samples of size S by
drawing S examples from the original data, with replacement.
− Each bootstrap sample will on average contain 63.2% of
the unique training examples, the rest are replicates.
 Combine the m resulting models using simple majority voting.
 Decreases error by decreasing the variance in the results due to
unstable learners, algorithms (like decision trees) whose
output can change dramatically when the training data is
slightly changed.
Research Background
• Ensemble Methods - Boosting
 Train multiple classifiers using the training data in following
way :
‾ Look at errors from previous classifiers to decide what to
focus on in the next training iteration.
‾ Each new classifier depends on its predecessors’ errors.
 Result: more weight on ‘hard’ samples (the ones where
classifiers committed mistakes in the previous iterations).
• Predict outcome for a previously unseen sample by
aggregating predictions made by the multiple models
(ensembles).
• Hard and difficult samples in dataset are synonyms.
Research Background
• Ensemble Method: AdaBoost learning algorithm.
1. Assume that the learning algorithm accepts
weighted examples
2. At each step, AdaBoost increases the weights of
examples from the learning sample misclassified
by the previous model
3. Thus, the algorithm focuses on the hard/difficult
samples from the learning samples
4. In the weighted majority vote, AdaBoost gives
higher influence to the more accurate models
Research Background
• Ensemble Methods:AdaBoost (adaptive boosting) :
1. AdaBoost (two-class).
2. AdaBoost.M1 and AdaBoost.M2 (multi-class).
3. AdaBoostR (regression).
• AdaBoost.M1 is the most popular (classification
algorithm with more than two classes).
• Other families (what changes is the weight/loss
function and the voting function):
 LogitBoost
 L2Boost
Research Background

 
H( x)  sign   t h t ( x) 
 t 
 1 
H( x)  arg max y  ln  
t:h t (x)  y  t 
Research Background

AdaBoost Algorithm for Binary Class

Microsoft Office
Word Document
Research Background
AdaBoost.M1 Algorithm
• Freund & Schapire (1996)
• Each classifier is generated with different training
set obtained from the original dataset using
resampling or reweighting techniques.
• Creates an ensemble of classifiers with each gives a
weighted vote
• Is used to boost decision trees
Research Background
AdaBoost .M1Algorithm for Multi Class

Microsoft Office
Word Document
Research Background
AdaBoost.M2 Algorithm:
• Freund et al.(1999)
• Extension of Adaboost.M1 for multiclass in
which it makes use of base classifiers’
confidence rates
Research Background
AdaBoost .M2 Algorithm for Multiclass

Microsoft Office
Word Document
Research Background
PSO – Particle Swarm Optimization
• developed in 1995 by James Kennedy and Russell Eberhart
• Basically PSO works as below :
 Each particle is searching for the optimum
 Each particle is moving and hence has a velocity.
 Each particle remembers the position it was in where it had its
best result so far (its personal best)
 A particle has a neighborhood associated with it.
 A particle knows the fitnesses of those in its neighborhood, and
uses the position of the one with best fitness.
 This position is simply used to adjust the particle’s velocity
Research Background
PSO - Pseudo code
For each particle
    Initialize particle
END

Do
    For each particle
        Calculate fitness value
        If the fitness value is better than its personal best
            set current value as the new pBest
    End

    Choose the particle with the best fitness value of all as gBest
    For each particle
        Calculate particle velocity
        Update particle position
    End
While maximum iterations or minimum error criteria is not attained
Literature Review on
Sampling Techniques
• Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.:
“SMOTE: Synthetic Minority Over-sampling Technique”.
Journal of Artificial Intelligence Research 16 (2002) 321–357.

• N. V. Chawla. C4.5 and imbalanced datasets: Investigating the


effect of sampling method, probabilistic estimate, and decision
tree structure. In Proceedings of the ICML'03 Workshop on
Class Imbalances, 2003
Research Background
Class Imbalance dataset Balancing Techniques
• Oversampling
 Balance class distribution by replicating minority class
examples by uniform sampling with replacement
• Undersampling
 Balance class distribution by removing majority class
examples by uniform sampling with replacement
• SMOTE (Synthetic Minority Over-sampling
Technique) by using kNN techniques
Research Background
• SMOTE: A State-of-the-Art Resampling Approach
 SMOTE stands for Synthetic Minority Oversampling
Technique.
 It is a technique designed by Chawla, Hall, & Kegelmeyer
in 2002.
 It combines Informed Oversampling of the minority class
with Random Undersampling of the majority class.
 SMOTE currently yields the best results as far as re-
sampling and modifying the probabilistic estimate
techniques go (Chawla, 2003).
Research Background
SMOTE’s Informed Oversampling Procedure – Top Level
Description:
• For each minority Sample
– Find its k-nearest minority neighbours
– Randomly select i of these neighbours
– Randomly generate synthetic samples along the
lines joining the minority sample and its i selected
neighbours
(i depends on the amount of oversampling desired)
Research Background
SMOTE’s Informed Oversampling Procedure:
• Random Oversampling (with replacement) of the minority
class has the effect of making the decision region for the
minority class very specific.
• In a decision tree, it would cause a new split and often lead to
overfitting.
• SMOTE’s informed oversampling generalizes the decision
region for the minority class.
• As a result, larger and less specific regions are learned, thus,
paying attention to minority class samples without causing
overfitting.
Research Background
SMOTE Algorithm
Input : number of minority class T, amount of SMOTE N%,
number of nearest neighbors, k
Output: synthetic minority samples T multiplied with (N/100)

Microsoft Word
Document
Literature Review on
Assessment Metrics and Wrapper
 Kohavi. R, John,G.H , Wrappers for feature subset
selection, Artificial Intelligence, Volume 97, Issues 1–2,
December 1997, Pages 273-324. (1997)

 Galar, M.; Fernández, A.; Barrenechea, E.; Bustince, H.;


Herrera, F., "A Review on Ensembles for the Class
Imbalance Problem: Bagging-, Boosting-, and Hybrid-
Based Approaches," Systems, Man, and Cybernetics, Part
C: Applications and Reviews, IEEE Transactions on , 42(4),
463-484, 2012.

 Chawla, N. V. Data mining for imbalanced datasets: An


overview. In Data mining and knowledge discovery
handbook (pp. 853-867) (2005).
Research Background
Wrapper Approach – GA/PSO – Top Level Description

 attribute subset selection is applied by having an induction


algorithm "wrapped" around a search engine and using the
induction algorithm itself as an evaluation function

 induction algorithm is being fed with dataset, and


partitions the dataset into internal training and test set, with
different sets of features removed from data.

 feature subset with the highest evaluation is chosen as the


final set on which to run the induction algorithm

 resulting classifier is then evaluated on an independent test


set that was not used earlier during the search.
Research Background
Assessment Metrics for Class Imbalance Learning

•Confusion matrix is used in measuring the performance of a


binary classifier
Positive Negative
Prediction Prediction

Negative Class FP TN

Positive Class TP FN

•ROC Curve - signal detection to characterize the tradeoff


between hit rate and false-alarm rate over noisy channel
•Accuracy, Sensitivity and Specificity and many others.
Research Background
Assessment Metrics for Class Imbalance Learning

•Recall = TP / (TP + FN)


•Precision = TP / (TP + FP)
•F-measure is a combination of tradeoffs of precision and
recall, outputs a single number reflecting “goodness” of a
classifier in the presence of rare classes, which denoted as
below, where β corresponds to the relative importance of
precision vs recall (usually set to 1) :
Methodology
Based on the diagram below which depicts the two types of class imbalance
methods, data algorithm has been chosen as an approach in balancing the
dataset
Class Imbalance

Data Algorithm

Undersampling Cost-Sensitive
• Random Undersampling Learning
• Condensed Nearest Neighbour
Rule
• Tomek Links
• Evolutionary Prototype
Selection One Class Learning

Oversampling
• Random Oversampling
• SMOTE
• Borderline-SMOTE1
• Borderline - SMOTE2
Methodology
• Hybrid ensemble decision tree is proposed in solving
binary and multiclass imbalance classification problem

• Hybrid approach involves a combination of decision


tree with sampling techniques and boosting ensemble
methods along with some optimization techniques for
attributes selection, such as
 Genetic Algorithms
 Particle Swarm Optimisation.
Methodology
Simulation environment considered for this thesis are

1) R language -  language and environment for statistical


computing and graphics
2) WEKA - tool with a collection of machine learning
algorithms for data mining tasks with R
Methodology
Dataset considered for this thesis are as below :

Binary Class Good: 307 (44.5%) Bad: 383 (55.5%)


Dataset Instances Attributes Classes
Credit Scoring 690 14 (good, bad)

Multiclass – most common dataset used in multiclass


research papers
Dataset Instances Attributes Classes
Glass 214 9 6
ecoli 336 7 8
wine 178 13 3

? Financial multiclass dataset


Work Planned
C4.5 Algorithms and Experimental Design for binary and
multiclass classification problems for imbalance dataset:

1) Experimental Studies of existing sampling


techniques at data level with existing learning
algorithms C4.5
2) Experimental studies of existing splitting
measures (information gain and gain ratio) and
their effects on classifier’s performance
3) Wrapper method using GA and PSO for
constructing base classifier
Work Planned
Combining existing sampling techniques with ensemble-
based algorithms to binary and multiclass classification

1)Multiclass Imbalanced Datasets

2)Multiclass Classification by combining sampling and


boosting techniques

3)Effects of various weight update rule for imbalanced


data on classifier
Work Planned
Boosting based ensemble learning in multiclass
classification

1)Performance analysis of boosting techniques


with/without sampling techniques
Overall Research
Timeline
Initial Experiments on
Binary Class Classifier
Dataset Instances Attributes Classes
Credit Scoring 690 14 (good, bad)

Good: 307 (44.5%) Bad: 383 (55.5%)


References
• Lior Rokach and Oded Maimon, Data Mining with Decision
tree: Theory and Applications, Machine Perception and
Artificial Intelligence, Vol. 69 (2008).

• Lior Rokach, Pattern Classification using Ensemble Methods,


Machine Perception and Artificial Intelligence, Vol. 75 (2010).

• Shuo Wang, XinYao , "Multiclass Imbalance Problems:


Analysis and Potential Solutions," Systems, Man, and
Cybernetics, Part B: Cybernetics, IEEE Transactions on,
vol.42, no.4, pp.1119-1130, Aug. 2012.

• Chawla, N. V. Data mining for imbalanced datasets: An


overview. In Data mining and knowledge discovery
handbook (pp. 853-867) (2005).
References
 Galar, M.; Fernández, A.; Barrenechea, E.; Bustince, H.;
Herrera, F., "A Review on Ensembles for the Class Imbalance
Problem: Bagging-, Boosting-, and Hybrid-Based
Approaches," Systems, Man, and Cybernetics, Part C:
Applications and Reviews, IEEE Transactions on , 42(4), 463-
484, 2012.

 Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.:


“SMOTE: Synthetic Minority Over-sampling Technique”.
Journal of Artificial Intelligence Research 16 (2002) 321–357.

 Breiman, L.: Bagging Predictors, Technical Report No. 421,


September 1994, Department of Statistics, University of
California, Berkeley, California 94720.
References
 Japkowicz, N.: Class Imbalance Problem: Significance &
Strategies. In: International Conference on Artificial
Intelligence (ICAI). (2000) 111–117

 N. V. Chawla. C4.5 and imbalanced datasets: Investigating the


effect of sampling method, probabilistic estimate, and decision
tree structure. In Proceedings of the ICML'03 Workshop on
Class Imbalances, 2003

 Freund, Y., and R.E. Schapire: Experiments with a New


Boosting Algorithm, Machine Learning: Proceedings of the
Thirteenth International Conference (1996).
References
 Briem, G.J. Benediktsson, J.A. Sveinsson, J.R.: Boosting,
Bagging, and Consensus Based Classification of Multisource
Remote Sensing Data. Multiple Classifier Systems 2001: 279-
288.

 K., Ghosh, J.: Analysis of decision boundaries in linearly


combined neural classifiers. Pattern Recognition 29 (1996) 341-
348.

 Benediktsson, J.A. Sveinsson, J.R., Swain, P.H.: Hybrid


consensus theoretic classification. IEEE Transactions on
Geoscience and Remote Sensing. 35 (1997) 833-843.

 Y. Park, J. Ghosh : Compact Ensemble Trees for Imbalanced


Data. In Multiple Classifier Systems (2011).
References
 J. Ross Quinlan. C4.5: Programs for Machine Learning.
Morgan Kaufmann Publishers Inc (1993)

 J. R. Quinlan. Induction of decision trees Machine


Learning, Vol. 1, No. 1. pp. 81-106 (1986)

 J. R. Quinlan. Bagging, Boosting, and C4.5 In In


Proceedings of the Thirteenth National Conference on
Artificial Intelligence (1996)

 Eberhart, R. C., & Kennedy, J. (1995). A new optimizer


using particle swarm theory. In Proceedings of the sixth
international symposium on micro machine and human
science (pp. 39–43)
References
 Freund, Y., & Schapire, R. E.. Experiments with a new
boosting algorithm. In ICML (Vol. 96, pp. 148-156). (1996)

 He, H., & Garcia, E. A. Learning from imbalanced


data. Knowledge and Data Engineering, IEEE Transactions
on, 21(9), 1263-1284. (2009).

 Sun, Y., Kamel, M. S., & Wang, Y. Boosting for learning


multiple classes with imbalanced class distribution. In Data
Mining, 2006. ICDM'06. Sixth International Conference
on (pp. 592-602). IEEE. (2006)

 Ghosh, J.: Multiclassifier Systems: Back to the Future. Proc,


3rd Int. Workshop on Multiple Classifier Systems, 2002, pp.
1-15.
Thank you
Q&A

You might also like