Professional Documents
Culture Documents
Business Intelligence DM2 WEKA Classification
Business Intelligence DM2 WEKA Classification
Business Intelligence DM2 WEKA Classification
Using Weka
Getting to know WEKA
• Open the dataset
• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)
• Color is important !
6 No Married 60K No
TaxInc NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
Model: Decision Tree
10
Training Data
Note that:
- “Cheat” is the categorical “output” variable;
2 No Married 100K No
3 No Single 70K No NO TaxInc
10
10 No Single 90K Yes
There can consequently be more than one
decision tree for the same data set!
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
NO YES
12
Stephan Poelmans– Data Mining
Validation of a decision tree
• Focus on the predictive ability of a model
• Instead of focusing on how much time it takes to make a model, the size of the model, etc.
• Confusion Matrix: how many records were properly classified? How many errors?
PREDICTED CLASS
Class x Class y a: TP (true positive)
b: FN (false negative)
Class x a b
ACTUAL (TP) (FN) c: FP (false positive)
CLASS Class y c d d: TN (true negative)
(FP) (TN)
x y x y
ACTUAL ACTUAL
CLASS x 150 40 CLASS x 250 45
y 60 250 y 5 200
Computed Decision
C C C
1 2 3
C C C C
1 11 12 13
C C C C
2 21 22 23
C C C C
3 31 32 33
16
Stephan Poelmans– Data Mining
Configuration panel
Right-Click to
visualize
17
Stephan Poelmans– Data Mining
Weka: J48
• Evolution:
• ID3 (1979)
• C4.5 (1993)
• C4.8 (1996?) => J48 (adapted for Weka)
• C5.0 (commercial)
• Open the configuration panel in Weka (Click on the white box under “Classifier”:
• Check the More information
• Examine the options
• Look at leaf sizes, Set minNumObj to 15 to avoid small leaves
• Visualize the tree using right‐click menu
18
Stephan Poelmans– Data Mining
Min. leaves 2, Min. leaves 15,
accuracy = 66.8% accuracy = 62.15%
19
Stephan Poelmans– Data Mining
General information
Output
Tree size
20
Stephan Poelmans– Data Mining
Accuracy !
Output
Kappa !
23
Stephan Poelmans– Data Mining
Entropy is a measure of impurity (the opposite of
information gain). So the Entropy after the split is best
lower than before the split,
Stephan Poelmans– Data Mining 24
Stephan Poelmans– Data Mining 25
J48: algorithm
Note: informationStephan
gainPoelmans–
is measured
Data Mining in ‘bits’, as a unit of information 27
Weka: J48
• Fewer attributes = sometimes a better classification!
• Open glass.arff to run J48 (with default options...):
• Run J48, with default options first
• Next. Remove Fe, and run J48 again
• Next, Remove all attributes except RI and MG, run J48 again
• Compare the decision trees , and particularly their accuracy %
• (Also use the right‐click menu to visualize decision trees)
28
Stephan Poelmans– Data Mining
Classification Algorithms
• Structure of the slides:
• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)
29
Stephan Poelmans– Data Mining
PART
• Rules from partial decision trees: PART
• Theoretically, rules and trees have equivalent “descriptive” or expressive power ... but either
can be more perspicuous (understandable, transparent) than the other
• Create a decision tree: top-down, ”divide-and-conquer”; read rules off the tree
• One rule for each leaf
• Straightforward, but rules contain repeated tests and are overly complex
• Alternative: a covering method: bottom-up, “separate-and-conquer”:
• Take a certain class (value) in turn and seek a way of covering all instances in it. This is called a covering
approach because at each stage you identify a rule that “covers” some of the instances. This approach may
lead to a set of rules, rather than to a decision tree.
• Separate-and-conquer:
• Identify a rule
• Remove the instances it covers
• Continue, creating rules for the remaining instances
30
Stephan Poelmans– Data Mining
PART: Separate-and-Conquer: Example
• Identifying a rule for class a (so not explicitly considering class b):
• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)
34
Stephan Poelmans– Data Mining
Baseline Accuracy: ZeroR
• Open file diabetes.arff: 768 instances (500 negative, 268 positive)
• Always guess “negative”: 500/768 : accuracy = 65%
• rules > ZeroR: most likely class!
• Try these classifiers (cf. later for more info):
– trees > J48 74%
– bayes > NaiveBayes 74%
– lazy > IBk 70%
– rules > PART 75%
• So ZeroR is a classifier that uses no attributes. Zero R simply ”predicts” the mean (for a
numeric class (or output variable)) or the mode (for a nominal class).
35
Stephan Poelmans– Data Mining
Baseline Accuracy : ZeroR
• Always Try a simple Baseline. Sometimes, the baseline is best!
• Open supermarket.arff and blindly apply the following classifications:
• rules > ZeroR -- accuracy =63.7%
• trees > J48 -- accuracy = 63.7%
• bayes > NaiveBayes – accuracy = 63.7%
• lazy > IBk – accuracy = 37% (!)
• rules > PART – accuracy = 63.7%
36
Stephan Poelmans– Data Mining
OneR: One attribute does all the work
• Learn a 1‐level “decision tree”
• – i.e., rules that all test one particular attribute
• Basic version
• One branch for each value
• Each branch assigns most frequent class
• Error rate: proportion of instances that don’t belong to the majority class of their corresponding
branch
• Choose attribute with smallest error rate
37
Stephan Poelmans– Data Mining
• In Weka:
• Open file weather.nominal.arff
• Choose OneR rule learner (rules>OneR)
• Look at the rule,
38
Stephan Poelmans– Data Mining
Dealing with numeric attributes
• Idea: discretize numeric attributes into sub ranges (intervals or partitions)
• How to divide each attribute’s overall range into intervals?
• Sort instances according to attribute’s values
• Place breakpoints where (majority) class changes
• This minimizes the total classification error 8 intervals or
partitions…
• Example: temperature from the weather.numeric data
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No | Yes Yes Yes | No | Yes Yes | No
• However, whenever adjacent partitions have the same majority class, as the two first
partitions above (in both, ”yes” is the majority), they can be merged together, leading to 2
partitions
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes No No Yes Yes Yes | No Yes Yes No
• So the rule: temperature <= 77.5 -> yes
temperature > 77.5 -> no 40
Stephan Poelmans– Data Mining
Results with overfitting avoidance
• Resulting rule sets for the four attributes in the • In Weka:
weather.numeric data, with only two rules for the temperature
• Open
attribute: weather.numeric
Attribute Rules Errors Total errors • Classify
Outlook Sunny → No 2/5 4/14 • Rules > OneR
Overcast → Yes 0/4 • Configuration panel:
Rainy → Yes 2/5 MinBucketSize = 3
Temperature 77.5 → Yes 3/10 5/14 (see previous slide)
> 77.5 → No* 2/4 • Start
Humidity 82.5 → Yes 1/7 3/14 • Look at the rule…
> 82.5 and 95.5 → No 2/6
> 95.5 → Yes 0/1
Windy False → Yes 2/8 5/14
True → No* 3/6
42
Stephan Poelmans– Data Mining
Classification Algorithms
• Structure of the slides:
• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)
43
Stephan Poelmans– Data Mining
The problem of Overfitting
• Any machine learning method may “overfit” the training data …
• … by producing a classifier that fits the training data too tightly
• So ML performs well on the training data but not on the independent test data (which is
then reflected in a low accuracy rate)
• Overfitting is a general phenomenon that plagues all ML methods
• This is one reason why you must always evaluate on an independent test set
• However, overfitting can occur more generally: you can have good accuracy rate but the
model performs badly when using new validation data (coming from a different context)
• E.g. You try many ML methods, and choose the best for your data – you cannot expect to get the
same performance on new validation data
44
Stephan Poelmans– Data Mining
Overfitting : an example
• Experiment with the diabetes dataset
• Open file diabetes.arff
• Choose ZeroR rule learner (rules>ZeroR)
• Use cross‐validation: 65.1%
• Choose OneR rule learner (rules>OneR)
• Use cross‐validation: 71.5%
• Look at the rule in the output (plas = plasma glucose concentration)
• In the configuration panel of OneR, change minBucketSize parameter to 1: run again
• Use cross-validation: 57.16%
• Look at the rule again
• So in the 2nd run of OneR, the rule is much more complex, tightly (over)fitted to the training set, but
performing worse on the test set(s) (even worse than ZeroR).
45
Stephan Poelmans– Data Mining
Classification Algorithms
• Structure of the slides:
• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)
46
Stephan Poelmans– Data Mining
Quality Evaluation: Accuracy, Precision and Recall
• So far, mainly focus on Accuracy % = general number of rightly classified / total number
• Overall a useful indicator, mainly in case of more or less balanced class values
• Easy to understand, evaluates the entire model
• Additional, and related measures (between 0 and 1) and given per class value:
• TP rate = TP/(TP+FN) = Recall = sensitivity
• Precision = TP/(TP+FP)
• F-measure is an average of recall and precision
• Area under the ROC curve (also called the C statistic)
• A receiver operating characteristic curve, i.e., ROC curve, is a graphical plot that illustrates the
discriminatory ability of a classifier.
• The ‘area under the curve’ (or C statistic) represents the classification quality of the model. 1 is a perfect
model, without any false postives; 0.5 is a worthless model, that detects as many true positives as false
positives
47
Stephan Poelmans– Data Mining
Quality Evaluation: Kappa Statistic
• Measures such as the accuracy % are easy to understand, but may give a distorted picture in the case of
imbalanced classes. E.g. we have two classes, say A and B, and A shows up on 5% of the time. Classifying
all as B gives an accuracy of 95% (so ‘excellent’), whereas the minority class is not well predicted. This
problem is even more present in the case of 3 or more classes
• Cohen’s kappa statistic is a very good measure that can handle very well imbalanced class problems.
• Kappa statistic:
• (success rate of actual predictor - success rate of random predictor) / (1 - success rate of random predictor)
• Measures relative improvement on random predictor: 1 means perfect accuracy, 0 means we are doing no better than
random
• Interpretation, rules of thumb (!):
based on Landis, J.R.; Koch, G.G. (1977). “The measurement of observer agreement for categorical
data”. Biometrics 33 (1): 159–174
• value < = 0 is indicating no improvement over random prediction
• 0–0.20 a slight improvement,
• 0.21–0.40 a fair improvement,
• 0.41–0.60 a moderate improvement,
• 0.61–0.80 a substantial improvement,
• and 0.81–1 an almost perfect, maximal improvement.
• It basically tells you how much better your classifier is performing over the performance of a classifier
that simply guesses at random according to the frequency of each class.
48
Stephan Poelmans– Data Mining
Example
• A Decision Tree on the Iris.arff dataset:
• When opting for separate training and test sets, separate files (.arff) need to be created
first. The model is than built with the training set, and tested separately on the test set.
Training and test files can be created by random selection or stratified (with each file
respecting a comparable proportion of the output (class) variable values). Typically, the
number of instances in the test set is one third of the training set. This is not a strict
requirement though.
52
Stephan Poelmans– Data Mining
Accuracy testing: 2. Holdout estimation
• What should we do if we only have a single dataset?
• The holdout method reserves a certain amount for testing and uses the remainder for
training, after shuffling
• Usually: one third for testing, the rest for training, by default 66% training– 34% testing in Weka
• Problem: the samples might not be representative
• Example: a class value might be missing in the test data
• Advanced version uses stratification
• Ensures that each class value is represented with approximately equal proportions in both subsets
• Holdout estimates can be made more reliable by repeating the process with
different subsamples
• In each iteration, a certain proportion is randomly selected for training (possibly with
stratification)
• The error rates on the different iterations are averaged to yield an overall error rate
• This is called the repeated holdout method
• Still not optimum: the different test sets overlap
• Can we prevent overlapping?
See slide 45
Stephan Poelmans– Data Mining 55
3. Repeated holdout: an example
• So quite some variations in the accuracy %.
• Calculate the mean and variance, to get a more reliable outcome (lies between 92.9% and 96.7%)
58
Stephan Poelmans– Data Mining
4.1. k-fold Cross Validation
Test Set
• 10-fold cross-validation:
Training Set
• Divide dataset into 10 parts (folds)
• Hold out each part in turn for testing
• Average the results
• So Each data point is used once for
testing, 9 times for training
59
Stephan Poelmans– Data Mining
4.1. k-fold Cross Validation:
• With 10‐fold cross‐validation, Weka invokes the learning algorithm 11 times
• Practical rule of thumb:
• Lots of data? – use percentage split
• Else stratified 10‐fold cross‐validation
ML = Machine Learning
60
Stephan Poelmans– Data Mining
More on cross-validation
• Standard method for evaluation: stratified ten-fold cross-validation
• Why ten?
• Extensive experiments have shown that this is the best choice to get an accurate estimate
• There is also some theoretical evidence for this (cf. Witten, et al. 2017)
• Stratification reduces the estimate’s variance
• Even better: repeated stratified cross-validation
• E.g., ten-fold cross-validation is repeated ten times and results are averaged (reduces the
variance)
61
Stephan Poelmans– Data Mining
Accuracy Testing 4.2. Leave-one-out cross-
validation
• Leave-one-out:
a particular form of k-fold cross-validation:
• Set number of folds to the number of training instances
• I.e., for n training instances, build classifier n times
• Makes best use of the data
• Involves no random subsampling
• Very computationally expensive (exception: using lazy classifiers such as the
nearest-neighbor classifier ( cf. later))
• Disadvantage of Leave-one-out CV: stratification is not possible
• It guarantees a non-stratified sample because there is only one instance in the test set!
62
Stephan Poelmans– Data Mining
Classification Algorithms
• Structure of the slides:
• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)
63
Stephan Poelmans– Data Mining
Naïve Bayes
• Frequently used in Machine Learning, Naive Bayes is a collection of classification algorithms based on the Bayes
Theorem. The family of algorithms all share a common principle, that every feature being used to classify is independent
of the value of any other feature. So for example, a fruit may be considered to be an apple if it is red, round, and about
3″ in diameter. A Naive Bayes classifier considers each of these “features” (red, round, 3” in diameter) to contribute
independently to the probability that the fruit is an apple, regardless of any correlations between features. Features or
attributes, however, aren’t always independent in reality, which is often seen as a shortcoming of the Naive Bayes
algorithm and this is why it’s called “naive”.
• Although based on a relatively simple idea, Naive Bayes can often outperform other more sophisticated algorithms and
is very useful in common applications like spam detection and document classification.
• In a nutshell, the algorithm aims at predicting a class (outcome variable), given a set of features, using probabilities. So
in a fruit example, we could predict whether a fruit is an apple, orange or banana (class) based on its color, shape, etc.
(features).
• Advantages
• It is relatively simple to understand and build
• It is easily trained, even with a small dataset
• It is not sensitive to irrelevant attributes
• Disadvantages
• It assumes every attribute is independent, which is not always the case 64
Stephan Poelmans– Data Mining
Probabilities using Bayes’s rule
• Famous rule from probability theory thanks to Thomas Bayes
• Probability of an event H given observed evidence E:
P(H | E) = P(E | H)P(H) / P(E)
• H = class variable; E=instance
• A priori probability of H : P(H )
• Probability of event before, prior to, evidence is seen
• E.g. before tossing a coin: the probability of ‘heads’ is 50%, given that the coin has two similar sites …
We know this a priori.
• A posteriori probability of H : P(H | E)
• Probability of event after evidence is seen
• E.g. tossing a coin 1000 times , with 550 times ‘heads’. So now, with this evidence, the probability of
tossing heads = 550/1000. This is the a posteriori probability.
68
Stephan Poelmans– Data Mining
Naïve Bayes
• So suppose we are presented with new data: we are only given the features of a piece of fruit
and we need to predict the class, i.e. fruit type. If we are told that the additional fruit is Long,
Sweet and Yellow, we can classify it using the following formula and the facts from the table
above.
• P(H|E) = P(E|H)*P(H) / P(E)
• So H = the values of the class = Banana, Orange or Other
• So E = the evidence presented = 3 attributes: Long, Sweet and Yellow
• P(E) is not known, but is used for each fruit, so based on the above, we can conclude that the
fruit is most likely a Banana. (0.252 > 0.01875)
69
Stephan Poelmans– Data Mining
Naïve Bayes: Probabilities for weather.nominal data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 14 14
Rainy 3/9 2/5 Cool 3/9 1/5
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Just counting … E.g. Sunny –yes happened 2 Sunny Hot High True No
2 / 9 ´ 3 / 9 ´ 3 / 9 ´ 3 / 9 ´ 9 /14
=
P(E) 71
Stephan Poelmans– Data Mining
Naïve Bayes: Probabilities for weather.nominal data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 14 14
Rainy 3/9 2/5 Cool 3/9 1/5
• What if an attribute value does not occur with every class value?
(e.g., “Outlook= Overcast” for class “no”)
• Probability will be zero: E.g. P( Outlook = Overcast | No) = 0
• A posteriori probability will also be zero: P(No | E) = 0
(Regardless of how likely the other values are!)
• So a zero frequency kind of takes over the entire probability calculation.
• Remedy: add 1 to the count for every attribute value-class combination
(Laplace estimator)
• Result: probabilities will never be zero
• Additional advantage: stabilizes probability estimates computed from
small samples of data (where the likelihood of zero is bigger)
73
Stephan Poelmans– Data Mining
Numeric (Continuous) attributes
• In the previous examples, all attributes were discrete !
• What if certain attributes are numeric (e.g. Temperature in the Weather.numeric data)
• Usual assumption: attributes have a normal or Gaussian probability distribution (given the
class)
• The probability density function for the normal distribution is defined by two parameters:
• Sample mean
• Standard deviation
75
Stephan Poelmans– Data Mining
Classifying a new day
Standard output
in Weka. Just
Frequencies.. Do
you have to
calculate the
Probabilities
yourselves?
Choose “output
predictions” ,
Plaintext. (cf.
the More
options button
in the test panel
Stephan Poelmans– Data Mining 77
Classification Algorithms
• Structure of the slides:
• Decision Trees (J48)
• PART
• ZeroR
• 1R (’one R’)
• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)
• Note that taking the square root is not required when comparing distances
• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)
W1i
Node j
Wjk
W2j
0.4 Node 2 W2i Node k
Wik
Node i
W3j
• The “nodes” are neurons or perceptrons and the link between them are the “dendrites”. The neurons do an amount of operations
and pass the result to the next layer.
This calculation is a linear combination. Now what does an 8 mean? We need to define the threshold value. The neural network’s
output, 0 or 1 (stay home or go to work), is determined if the value of the linear combination is greater than the threshold value.
Suppose the threshold value is 5, which means that if the calculation gives you less than 5, you can stay at home, but if it is equal to
or more than 5, you need to go to work.
Source: Towardsdatascience.com
Stephan Poelmans– Data Mining 90
Neural Network
• So using weights, the NN algorithm knows which information will be most important in making its
decision. A higher weight means the neural network considers that input more important compared to
other inputs.
• The weights are set to random values, and then the network adjusts those weights based on the output
errors it made using the previous weights. This is called training the neural network.
• So the NN algorithm takes in inputs, applies a linear combination using a weight vector. If the result is
compared to a threshold value, leading to an output, 1 or 0. This predicted output is compared to the
observed (real) output, so the NN can learn.
Error =
Observed
(actual)-
Predicted
where Inputs(Ni) corresponds to all neurons that provide an input to Ni. ak is the input value
2. This sum, called the input function, is then passed on to Ni’s activation function (ai). So:
Typically, the activation functions of all neurons in the NN are the same, so we just write g
instead of gi. Example of a Simoid function:
Thus. every neuron in a NN takes in the activation of all its inputs and provides its own activation as an output
Stephan Poelmans– Data Mining 95
NN: Backpropagation & Gradient Descent
• It is the job of the training process of a NN to ensure that the weights, wij, given by each neuron to each
of its inputs is set right, so that the entire NN gives an optimal outcome. Backpropagation is one of the
ways to optimize those weights.
• The error function defines how much of the output of the model defers from the required (observed)
output. Typically, a mean-squared error function is used for this.
• It is important to know that each training instance will result in a different error value. It is the
backpropagation’s goal to minimize the error for all training instances on average. Backpropagation tries
to minimize or reduce the error function, by going backwards through the network, layer by layer. Then it
uses the gradient descent to optimize or fine-tune the weights.
• So the gradient descent is a technique used to fine tune the weights.