Business Intelligence DM2 WEKA Classification

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 102

Supervised learning: Classification

Using Weka
Getting to know WEKA
• Open the dataset

Stephan Poelmans– Data Mining 2


Classification Algorithms
• Structure of the slides:

• Decision Trees (J48)


• PART
• ZeroR
• 1R (’one R’)

• The problem of overfitting


• Quality Evaluation

• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)

Stephan Poelmans– Data Mining


Decision Trees with Classification rules
• Decision trees are usually intuitively easy to interpret. A decision tree seems very similar
to a “flow chart”, by splitting nodes.
• The difficulty is not the understanding of the decision tree but rather building the tree.
Especially when you work with hundreds and thousands of records, you need a data
mining tool.
• A decision tree:
• Is an application of supervised learning (with “output” variables)
• Used to classify data (i.e. “classification”)
• The output variable is discrete!

Stephan Poelmans– Data Mining


Decision Trees with Classification rules:
Introductory example

Stephan Poelmans– Data Mining 5


Decision Trees with Classification rules :
Introductory example
• Training Instances

Stephan Poelmans– Data Mining 6


Decision Trees with Classification rules :
Introductory example
• Shape is important !

• Color is important !

Stephan Poelmans– Data Mining 7


Decision Trees with Classification rules :
Introductory example
• Decision tree:

Stephan Poelmans– Data Mining 8


Example of a decision tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


Refund
2 No Married 100K No Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt

5 No Divorced 95K Yes Single, Divorced Married

6 No Married 60K No
TaxInc NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
Model: Decision Tree
10

Training Data

Note that:
- “Cheat” is the categorical “output” variable;

Stephan Poelmans– Data Mining


Yet another example of a decision tree
MarSt

Married Single, Divorced

Tid Refund Marital Taxable


NO Refund
Status Income Cheat
No
1 Yes Single 125K No Yes

2 No Married 100K No
3 No Single 70K No NO TaxInc

4 Yes Married 120K No


< 80K > 80K
5 No Divorced 95K Yes
6 No Married 60K No
NO YES
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No

10
10 No Single 90K Yes
There can consequently be more than one
decision tree for the same data set!

Stephan Poelmans– Data Mining


Evaluating the model on the test data
Test Data

Refund Marital Taxable


Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married

TaxInc NO

< 80K > 80K

NO YES

Stephan Poelmans– Data Mining


How good is a decision tree?
• Multiple decision trees can be made on the same training set.
• The question then arises: which decision tree is the best to be used? Or, what decision
tree is the best?
• To measure the validity of a decision tree we use a “confusion matrix” .
• A confusion matrix shows how many records (observations) were correctly classified, by
the decision tree, and how many records were wrongly classified.
• The confusion matrix is obtained by the application of the decision tree to the test set !

12
Stephan Poelmans– Data Mining
Validation of a decision tree
• Focus on the predictive ability of a model
• Instead of focusing on how much time it takes to make a model, the size of the model, etc.
• Confusion Matrix: how many records were properly classified? How many errors?

PREDICTED CLASS
Class x Class y a: TP (true positive)
b: FN (false negative)
Class x a b
ACTUAL (TP) (FN) c: FP (false positive)
CLASS Class y c d d: TN (true negative)
(FP) (TN)

False negative: a record belongs to a particular class, but was excluded


False positive: a record does not belong to a class, but was however housed in it
13
Stephan Poelmans– Data Mining
Validation of a decision tree
• Accuracy: the number of correctly classified records from a dataset / the total number
of records.

Model M1 PREDICTED CLASS Model M2 PREDICTED CLASS

x y x y
ACTUAL ACTUAL
CLASS x 150 40 CLASS x 250 45

y 60 250 y 5 200

Accuracy = 80% Accuracy = 90%


(400 “true” /500 “total”)

Stephan Poelmans– Data Mining 14


Confusion Matrix

Table 2.5 • A Three-Class Confusion Matrix

Computed Decision

C C C
1 2 3
C C C C
1 11 12 13
C C C C
2 21 22 23
C C C C
3 31 32 33

Stephan Poelmans– Data Mining


15
Weka: Use J48 to analyze the glass dataset
• Check the available classifiers (the “Classify” tab)
• Choose the J48 decision tree learner (trees>J48)
• Run it (“start”)
• Examine the output
• Look at the correctly classified instances
• … and the confusion matrix

16
Stephan Poelmans– Data Mining
Configuration panel

Click to open the


configuration panel
Choose J48

Right-Click to
visualize

17
Stephan Poelmans– Data Mining
Weka: J48
• Evolution:
• ID3 (1979)
• C4.5 (1993)
• C4.8 (1996?) => J48 (adapted for Weka)
• C5.0 (commercial)
• Open the configuration panel in Weka (Click on the white box under “Classifier”:
• Check the More information
• Examine the options
• Look at leaf sizes, Set minNumObj to 15 to avoid small leaves
• Visualize the tree using right‐click menu

18
Stephan Poelmans– Data Mining
Min. leaves 2, Min. leaves 15,
accuracy = 66.8% accuracy = 62.15%
19
Stephan Poelmans– Data Mining
General information

Output

Test method (cf. below)

Classification rules of the tree

Tree size
20
Stephan Poelmans– Data Mining
Accuracy !
Output
Kappa !

Other accuracy indicators, such as ROC


area under the curve, ….

Confusion matrix ! Horizontal = observed.


E.g. observed a = 70, of which 46 correctly
classified. (So TP rate of a = .657)

Stephan Poelmans– Data Mining 21


J48: algorithm

Stephan Poelmans– Data Mining 22


J48: algorithm

23
Stephan Poelmans– Data Mining
Entropy is a measure of impurity (the opposite of
information gain). So the Entropy after the split is best
lower than before the split,
Stephan Poelmans– Data Mining 24
Stephan Poelmans– Data Mining 25
J48: algorithm

Stephan Poelmans– Data Mining 26


J48: algorithm

Note: informationStephan
gainPoelmans–
is measured
Data Mining in ‘bits’, as a unit of information 27
Weka: J48
• Fewer attributes = sometimes a better classification!
• Open glass.arff to run J48 (with default options...):
• Run J48, with default options first
• Next. Remove Fe, and run J48 again
• Next, Remove all attributes except RI and MG, run J48 again
• Compare the decision trees , and particularly their accuracy %
• (Also use the right‐click menu to visualize decision trees)

• What is your conclusion? Which model is the most accurate?

28
Stephan Poelmans– Data Mining
Classification Algorithms
• Structure of the slides:

• Decision Trees (J48)


• PART
• ZeroR
• 1R (’one R’)

• The problem of overfitting


• Quality Evaluation

• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)

29
Stephan Poelmans– Data Mining
PART
• Rules from partial decision trees: PART
• Theoretically, rules and trees have equivalent “descriptive” or expressive power ... but either
can be more perspicuous (understandable, transparent) than the other
• Create a decision tree: top-down, ”divide-and-conquer”; read rules off the tree
• One rule for each leaf
• Straightforward, but rules contain repeated tests and are overly complex
• Alternative: a covering method: bottom-up, “separate-and-conquer”:
• Take a certain class (value) in turn and seek a way of covering all instances in it. This is called a covering
approach because at each stage you identify a rule that “covers” some of the instances. This approach may
lead to a set of rules, rather than to a decision tree.
• Separate-and-conquer:
• Identify a rule
• Remove the instances it covers
• Continue, creating rules for the remaining instances

30
Stephan Poelmans– Data Mining
PART: Separate-and-Conquer: Example
• Identifying a rule for class a (so not explicitly considering class b):

• Possible rule set for class b:

• You could add more rules to get a ‘perfect’ rule set…


• E.g. One ‘a’ is not covered in the 3rd graph: this could be acceptable, but if you also want to cover that instance,
you need an extra rule: If x > 1.4 and Y<2.4, then class = a
Stephan Poelmans– Data Mining 31
PART
• In the example above , a top-down, decision tree
approach (’divide-and-conquer’) might lead to the same
rule set, presented via a Decision Tree
• However, whereas the covering algorithm is concerned
with only covering a single class, disregarding what
happens to the other classes, the tree division would
take all classes into account, right from the start. As a
result, in many instances, there is a difference between
rules and trees in terms of the transparency of the
representations. A decision tree typically (but not
always) leads to more complex rules
• In fact: Generating a PART decision list, using separate-
and-conquer, builds a partial C4.5 (J48 in Weka)
decision tree in each iteration and makes the "best" leaf
into a rule.

Stephan Poelmans– Data Mining 32


PART in Weka:
• Application to the Diabetes
dataset :
• Open diabetes.arff, classify:
rules -> PART ; trees -> J48:
• ZeroR: 64.29% (Can you
calculate this number
yourself?)
• J48 : Accuracy: 73.82% : 39-
node tree (with 20 leaves)
• PART: Accuracy: 75.26% : 13
rules

Stephan Poelmans– Data Mining 33


Classification Algorithms
• Structure of the slides:

• Decision Trees (J48)


• PART
• ZeroR
• 1R (’one R’)

• The problem of overfitting


• Quality Evaluation

• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)

34
Stephan Poelmans– Data Mining
Baseline Accuracy: ZeroR
• Open file diabetes.arff: 768 instances (500 negative, 268 positive)
• Always guess “negative”: 500/768 : accuracy = 65%
• rules > ZeroR: most likely class!
• Try these classifiers (cf. later for more info):
– trees > J48 74%
– bayes > NaiveBayes 74%
– lazy > IBk 70%
– rules > PART 75%
• So ZeroR is a classifier that uses no attributes. Zero R simply ”predicts” the mean (for a
numeric class (or output variable)) or the mode (for a nominal class).

35
Stephan Poelmans– Data Mining
Baseline Accuracy : ZeroR
• Always Try a simple Baseline. Sometimes, the baseline is best!
• Open supermarket.arff and blindly apply the following classifications:
• rules > ZeroR -- accuracy =63.7%
• trees > J48 -- accuracy = 63.7%
• bayes > NaiveBayes – accuracy = 63.7%
• lazy > IBk – accuracy = 37% (!)
• rules > PART – accuracy = 63.7%

36
Stephan Poelmans– Data Mining
OneR: One attribute does all the work
• Learn a 1‐level “decision tree”
• – i.e., rules that all test one particular attribute

• Basic version
• One branch for each value
• Each branch assigns most frequent class
• Error rate: proportion of instances that don’t belong to the majority class of their corresponding
branch
• Choose attribute with smallest error rate

37
Stephan Poelmans– Data Mining
• In Weka:
• Open file weather.nominal.arff
• Choose OneR rule learner (rules>OneR)
• Look at the rule,

38
Stephan Poelmans– Data Mining
Dealing with numeric attributes
• Idea: discretize numeric attributes into sub ranges (intervals or partitions)
• How to divide each attribute’s overall range into intervals?
• Sort instances according to attribute’s values
• Place breakpoints where (majority) class changes
• This minimizes the total classification error 8 intervals or
partitions…
• Example: temperature from the weather.numeric data
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No | Yes Yes Yes | No | Yes Yes | No

Outlook Temperature Humidity Windy Play


Sunny 85 85 False No
Sunny 80 90 True No
Overcast 83 86 False Yes
Rainy 75 80 False Yes
… … … … …
39
Stephan Poelmans– Data Mining
The problem of overfitting
• Discretization procedure is very sensitive to noise
• A single instance with an incorrect class label will probably produce a separate interval
• Simple solution:
enforce minimum number of instances in majority class per interval
• Example: temperature attribute with required minimum number of instances in majority class
set to three:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No | Yes Yes Yes | No | Yes Yes | No

• So now we have the following partition:


64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No

• However, whenever adjacent partitions have the same majority class, as the two first
partitions above (in both, ”yes” is the majority), they can be merged together, leading to 2
partitions
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes No No Yes Yes Yes | No Yes Yes No
• So the rule: temperature <= 77.5 -> yes
temperature > 77.5 -> no 40
Stephan Poelmans– Data Mining
Results with overfitting avoidance
• Resulting rule sets for the four attributes in the • In Weka:
weather.numeric data, with only two rules for the temperature
• Open
attribute: weather.numeric
Attribute Rules Errors Total errors • Classify
Outlook Sunny → No 2/5 4/14 • Rules > OneR
Overcast → Yes 0/4 • Configuration panel:
Rainy → Yes 2/5 MinBucketSize = 3
Temperature  77.5 → Yes 3/10 5/14 (see previous slide)
> 77.5 → No* 2/4 • Start
Humidity  82.5 → Yes 1/7 3/14 • Look at the rule…
> 82.5 and  95.5 → No 2/6
> 95.5 → Yes 0/1
Windy False → Yes 2/8 5/14
True → No* 3/6

Stephan Poelmans– Data Mining 41


OneR: Conclusion
• Incredibly simple method, described in 1993 (Robert C. Holte, Computer Science
Department, University of Ottawa)
• “Very Simple Classification Rules Perform Well on Most Commonly Used Datasets”
• Experimental evaluation on 16 datasets
• Used cross‐validation
• Simple rules often outperformed far more complex methods
• How can it work so well?
• some datasets really are simple
• some are so small/noisy/complex that nothing can be learned from them!

42
Stephan Poelmans– Data Mining
Classification Algorithms
• Structure of the slides:

• Decision Trees (J48)


• PART
• ZeroR
• 1R (’one R’)

• The problem of overfitting


• Quality Evaluation

• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)

43
Stephan Poelmans– Data Mining
The problem of Overfitting
• Any machine learning method may “overfit” the training data …
• … by producing a classifier that fits the training data too tightly
• So ML performs well on the training data but not on the independent test data (which is
then reflected in a low accuracy rate)
• Overfitting is a general phenomenon that plagues all ML methods
• This is one reason why you must always evaluate on an independent test set
• However, overfitting can occur more generally: you can have good accuracy rate but the
model performs badly when using new validation data (coming from a different context)
• E.g. You try many ML methods, and choose the best for your data – you cannot expect to get the
same performance on new validation data

44
Stephan Poelmans– Data Mining
Overfitting : an example
• Experiment with the diabetes dataset
• Open file diabetes.arff
• Choose ZeroR rule learner (rules>ZeroR)
• Use cross‐validation: 65.1%
• Choose OneR rule learner (rules>OneR)
• Use cross‐validation: 71.5%
• Look at the rule in the output (plas = plasma glucose concentration)
• In the configuration panel of OneR, change minBucketSize parameter to 1: run again
• Use cross-validation: 57.16%
• Look at the rule again
• So in the 2nd run of OneR, the rule is much more complex, tightly (over)fitted to the training set, but
performing worse on the test set(s) (even worse than ZeroR).

45
Stephan Poelmans– Data Mining
Classification Algorithms
• Structure of the slides:

• Decision Trees (J48)


• PART
• ZeroR
• 1R (’one R’)

• The problem of overfitting


• Quality Evaluation

• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)

46
Stephan Poelmans– Data Mining
Quality Evaluation: Accuracy, Precision and Recall
• So far, mainly focus on Accuracy % = general number of rightly classified / total number
• Overall a useful indicator, mainly in case of more or less balanced class values
• Easy to understand, evaluates the entire model
• Additional, and related measures (between 0 and 1) and given per class value:
• TP rate = TP/(TP+FN) = Recall = sensitivity
• Precision = TP/(TP+FP)
• F-measure is an average of recall and precision
• Area under the ROC curve (also called the C statistic)
• A receiver operating characteristic curve, i.e., ROC curve, is a graphical plot that illustrates the
discriminatory ability of a classifier.
• The ‘area under the curve’ (or C statistic) represents the classification quality of the model. 1 is a perfect
model, without any false postives; 0.5 is a worthless model, that detects as many true positives as false
positives

47
Stephan Poelmans– Data Mining
Quality Evaluation: Kappa Statistic
• Measures such as the accuracy % are easy to understand, but may give a distorted picture in the case of
imbalanced classes. E.g. we have two classes, say A and B, and A shows up on 5% of the time. Classifying
all as B gives an accuracy of 95% (so ‘excellent’), whereas the minority class is not well predicted. This
problem is even more present in the case of 3 or more classes
• Cohen’s kappa statistic is a very good measure that can handle very well imbalanced class problems.
• Kappa statistic:
• (success rate of actual predictor - success rate of random predictor) / (1 - success rate of random predictor)
• Measures relative improvement on random predictor: 1 means perfect accuracy, 0 means we are doing no better than
random
• Interpretation, rules of thumb (!):
based on Landis, J.R.; Koch, G.G. (1977). “The measurement of observer agreement for categorical
data”. Biometrics 33 (1): 159–174
• value < = 0 is indicating no improvement over random prediction
• 0–0.20 a slight improvement,
• 0.21–0.40 a fair improvement,
• 0.41–0.60 a moderate improvement,
• 0.61–0.80 a substantial improvement,
• and 0.81–1 an almost perfect, maximal improvement.
• It basically tells you how much better your classifier is performing over the performance of a classifier
that simply guesses at random according to the frequency of each class.
48
Stephan Poelmans– Data Mining
Example
• A Decision Tree on the Iris.arff dataset:

TP rate, a = 49/50 = 0.98


Precision, a = 49 /49
In this data set, all classes (a,b,c) are very well classified,
Class a has the best performance .

Stephan Poelmans– Data Mining 49


Accuracy Testing options
• Test set: independent instances that have played no part in formation of classifier
• Assumption: both training data and test data are representative samples of the underlying problem
• Test and training data may differ in nature
• Example: classifiers built using customer data from two different towns A and B
• To estimate performance of classifier from town A in completely new town, test it on data from B

1. Using separate training and test datasets


2. Holdout estimation (‘Percentage Split’)
3. Repeated Holdout
4. Cross-validation
1. K-Fold
2. Leave-one-out
50
Stephan Poelmans– Data Mining
Training and Test set need to be different,
but they are created from one dataset

Stephan Poelmans– Data Mining 51


Accuracy testing: 1. Using Separate training & test sets

• When opting for separate training and test sets, separate files (.arff) need to be created
first. The model is than built with the training set, and tested separately on the test set.
Training and test files can be created by random selection or stratified (with each file
respecting a comparable proportion of the output (class) variable values). Typically, the
number of instances in the test set is one third of the training set. This is not a strict
requirement though.

• As an example look at the Reuterscorn-test.arff (N=604) and Reuterscorn-training.arff


(N=1554) sets. So open the training set in the Weka Explorer and when classifiying,
choose the test file in “Supply test set”.

52
Stephan Poelmans– Data Mining
Accuracy testing: 2. Holdout estimation
• What should we do if we only have a single dataset?
• The holdout method reserves a certain amount for testing and uses the remainder for
training, after shuffling
• Usually: one third for testing, the rest for training, by default 66% training– 34% testing in Weka
• Problem: the samples might not be representative
• Example: a class value might be missing in the test data
• Advanced version uses stratification
• Ensures that each class value is represented with approximately equal proportions in both subsets

Stephan Poelmans– Data Mining


53
Accuracy testing: 3. Repeated holdout method

• Holdout estimates can be made more reliable by repeating the process with
different subsamples
• In each iteration, a certain proportion is randomly selected for training (possibly with
stratification)
• The error rates on the different iterations are averaged to yield an overall error rate
• This is called the repeated holdout method
• Still not optimum: the different test sets overlap
• Can we prevent overlapping?

Stephan Poelmans– Data Mining 54


3. Repeated holdout: an example

See slide 45
Stephan Poelmans– Data Mining 55
3. Repeated holdout: an example
• So quite some variations in the accuracy %.
• Calculate the mean and variance, to get a more reliable outcome (lies between 92.9% and 96.7%)

Stephan Poelmans– Data Mining 56


More on the Seed option: repeat holdout
(percentage split)
• When opting for the repeated holdout estimation, you
choose the percentage split, which is 66% by default. This
means that Weka randomly selects 66% of your data set to
build the model.
• So if you simply repeat the algorithm, you could expect that
another 66% is chosen, leading to a different accuracy %.
However, try it yourself, this is not true. So when simply
repeating, you get the same accuracy.
• This is because Weka ’remembers’ the random selection
algorithm, and uses the same function, so you can re-run
and compare methods without having different training sets.
• If you want different training sets (and test sets, and thus
accuracy levels), click on ‘more options’ in the test panel,
and change the seed option in the resulting pop-up window.
57
Stephan Poelmans– Data Mining
Accuracy Testing 4.1. k-fold Cross-validation
• K-fold cross-validation avoids overlapping test sets
• First step: split data into k subsets of equal size
• Second step: use each subset in turn for testing, the remainder for training
• This means the learning algorithm is applied to k different training sets
• Often the subsets are stratified before the cross-validation is performed to yield
stratified k-fold cross-validation
• The error estimates are averaged to yield an overall error estimate; also, standard
deviation is often computed
• Alternatively, predictions and actual target values from the k folds are pooled to
compute one estimate
• Does not yield an estimate of standard deviation

58
Stephan Poelmans– Data Mining
4.1. k-fold Cross Validation
Test Set
• 10-fold cross-validation:
Training Set
• Divide dataset into 10 parts (folds)
• Hold out each part in turn for testing
• Average the results
• So Each data point is used once for
testing, 9 times for training

59
Stephan Poelmans– Data Mining
4.1. k-fold Cross Validation:
• With 10‐fold cross‐validation, Weka invokes the learning algorithm 11 times
• Practical rule of thumb:
• Lots of data? – use percentage split
• Else stratified 10‐fold cross‐validation

ML = Machine Learning
60
Stephan Poelmans– Data Mining
More on cross-validation
• Standard method for evaluation: stratified ten-fold cross-validation
• Why ten?
• Extensive experiments have shown that this is the best choice to get an accurate estimate
• There is also some theoretical evidence for this (cf. Witten, et al. 2017)
• Stratification reduces the estimate’s variance
• Even better: repeated stratified cross-validation
• E.g., ten-fold cross-validation is repeated ten times and results are averaged (reduces the
variance)

61
Stephan Poelmans– Data Mining
Accuracy Testing 4.2. Leave-one-out cross-
validation
• Leave-one-out:
a particular form of k-fold cross-validation:
• Set number of folds to the number of training instances
• I.e., for n training instances, build classifier n times
• Makes best use of the data
• Involves no random subsampling
• Very computationally expensive (exception: using lazy classifiers such as the
nearest-neighbor classifier ( cf. later))
• Disadvantage of Leave-one-out CV: stratification is not possible
• It guarantees a non-stratified sample because there is only one instance in the test set!

62
Stephan Poelmans– Data Mining
Classification Algorithms
• Structure of the slides:

• Decision Trees (J48)


• PART
• ZeroR
• 1R (’one R’)

• The problem of overfitting


• Quality Evaluation

• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)

63
Stephan Poelmans– Data Mining
Naïve Bayes
• Frequently used in Machine Learning, Naive Bayes is a collection of classification algorithms based on the Bayes
Theorem. The family of algorithms all share a common principle, that every feature being used to classify is independent
of the value of any other feature. So for example, a fruit may be considered to be an apple if it is red, round, and about
3″ in diameter. A Naive Bayes classifier considers each of these “features” (red, round, 3” in diameter) to contribute
independently to the probability that the fruit is an apple, regardless of any correlations between features. Features or
attributes, however, aren’t always independent in reality, which is often seen as a shortcoming of the Naive Bayes
algorithm and this is why it’s called “naive”.
• Although based on a relatively simple idea, Naive Bayes can often outperform other more sophisticated algorithms and
is very useful in common applications like spam detection and document classification.
• In a nutshell, the algorithm aims at predicting a class (outcome variable), given a set of features, using probabilities. So
in a fruit example, we could predict whether a fruit is an apple, orange or banana (class) based on its color, shape, etc.
(features).
• Advantages
• It is relatively simple to understand and build
• It is easily trained, even with a small dataset
• It is not sensitive to irrelevant attributes
• Disadvantages
• It assumes every attribute is independent, which is not always the case 64
Stephan Poelmans– Data Mining
Probabilities using Bayes’s rule
• Famous rule from probability theory thanks to Thomas Bayes
• Probability of an event H given observed evidence E:
P(H | E) = P(E | H)P(H) / P(E)
• H = class variable; E=instance
• A priori probability of H : P(H )
• Probability of event before, prior to, evidence is seen
• E.g. before tossing a coin: the probability of ‘heads’ is 50%, given that the coin has two similar sites …
We know this a priori.
• A posteriori probability of H : P(H | E)
• Probability of event after evidence is seen
• E.g. tossing a coin 1000 times , with 550 times ‘heads’. So now, with this evidence, the probability of
tossing heads = 550/1000. This is the a posteriori probability.

Stephan Poelmans– Data Mining 65


Naïve Bayes
P(H | E) = P(E | H)P(H) / P(E)
• Evidence E = non-class attribute values (attributes used to predict the class or outcome
variable)
• Event H = class (outcome) value of instance
• Naïve assumption: the evidence is split into parts (i.e., attributes) that are
conditionally independent
• This means, given n attributes, we can write Bayes’ rule using a product of per-
attribute probabilities:

P(H | E) = P(E1 | H)P(E3 | H)… P(En | H)P(H) / P(E)


• But wait, don’t worry … an example will make this clear… take the time to analyse it…

Stephan Poelmans– Data Mining 66


Naïve Bayes: Explanation with the Fruit example
• A simple example best explains the application of Naive Bayes for classification.
• So, suppose we have data on 1000 pieces of fruit. The fruit is a Banana, Orange or some
Other fruit, and imagine we know 3 features of each fruit, whether it’s long or not,
sweet or not and yellow or not, as displayed in the table below:

Stephan Poelmans– Data Mining 67


• So from the table, what do we already know?
• 50% of the fruits are bananas P(Banana) = .50
• 30% are oranges P(Orange) = .30
• 20% are other fruits P(Other)= .20
• Based on this training set we can also say the following:
• From 500 bananas 400 (0.8) are Long, 350 (0.7) are Sweet and 450 (0.9) are Yellow
• So P(Long|Banana) = .8; P(Sweet|Banana)= .7; P(Yellow|Banana) = .9
• Out of 300 oranges 0 are Long, 150 (0.5) are Sweet and 300 (1) are Yellow
• From the remaining 200 fruits, 100 (0.5) are Long, 150 (0.75) are Sweet and 50 (0.25) are Yellow

68
Stephan Poelmans– Data Mining
Naïve Bayes
• So suppose we are presented with new data: we are only given the features of a piece of fruit
and we need to predict the class, i.e. fruit type. If we are told that the additional fruit is Long,
Sweet and Yellow, we can classify it using the following formula and the facts from the table
above.
• P(H|E) = P(E|H)*P(H) / P(E)
• So H = the values of the class = Banana, Orange or Other
• So E = the evidence presented = 3 attributes: Long, Sweet and Yellow

• P(Banana|Long,Sweet,Yellow)=P(Long|Banana)*P(Sweet|Banana)*P(Yellow|Banana)*P(Banana) / P(Long) * P(Sweet) * P(Yellow)


= 0.8 * 0.7 * 0.9 * 0.5 / P(E) = 0.252 / P(E)
• P(Orange|Long,Sweet,Yellow)=0 / P(E)
• P(Other|Long,Sweet,Yellow)=P(Long|Other)*P(Sweet|Other)*P(Yellow|Other)*P(Other) / P(Long)*P(Sweet)*P(Yellow)
= 0.5*0.75*0.25*0.2 / P(E) = 0.01875 / P(E)

• P(E) is not known, but is used for each fruit, so based on the above, we can conclude that the
fruit is most likely a Banana. (0.252 > 0.01875)
69
Stephan Poelmans– Data Mining
Naïve Bayes: Probabilities for weather.nominal data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 14 14
Rainy 3/9 2/5 Cool 3/9 1/5
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Just counting … E.g. Sunny –yes happened 2 Sunny Hot High True No

times. Data set Overcast Hot High False Yes

Sunny – no : 3 times Rainy Mild High False Yes


Rainy Cool Normal False Yes
So 9 times yes in total: 2/9
Rainy Cool Normal True No
5 times no in total: 3/5
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Stephan Poelmans– Data Mining Rainy Mild High True No 70
Weather.nominal data example

Predict a new day: Outlook Temp. Humidity Windy Play


Evidence E
Yes or no? Sunny Cool High True ?

P(yes | E) = P(Outlook = Sunny | yes)


P(Temperature = Cool | yes)
Probability of P(Humidity = High | yes)
class “yes”
P(Windy = True | yes)
P(yes) / P(E)

2 / 9 ´ 3 / 9 ´ 3 / 9 ´ 3 / 9 ´ 9 /14
=
P(E) 71
Stephan Poelmans– Data Mining
Naïve Bayes: Probabilities for weather.nominal data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 14 14
Rainy 3/9 2/5 Cool 3/9 1/5

Outlook Temp. Humidity Windy Play


• A new day:
Sunny Cool High True ?

Likelihood of the two classes


For “yes” = 2/9  3/9  3/9  3/9  9/14 = 0.0053
2 / 9 ´ 3 / 9 ´ 3 / 9 ´ 3 / 9 ´ 9 /14 For “no” = 3/5  1/5  4/5  3/5  5/14 = 0.0206
=
P(E) Conversion into a probability by normalization:
P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205
P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795 72
Stephan Poelmans– Data Mining
The “zero-frequency problem”

• What if an attribute value does not occur with every class value?
(e.g., “Outlook= Overcast” for class “no”)
• Probability will be zero: E.g. P( Outlook = Overcast | No) = 0
• A posteriori probability will also be zero: P(No | E) = 0
(Regardless of how likely the other values are!)
• So a zero frequency kind of takes over the entire probability calculation.
• Remedy: add 1 to the count for every attribute value-class combination
(Laplace estimator)
• Result: probabilities will never be zero
• Additional advantage: stabilizes probability estimates computed from
small samples of data (where the likelihood of zero is bigger)

73
Stephan Poelmans– Data Mining
Numeric (Continuous) attributes
• In the previous examples, all attributes were discrete !
• What if certain attributes are numeric (e.g. Temperature in the Weather.numeric data)
• Usual assumption: attributes have a normal or Gaussian probability distribution (given the
class)
• The probability density function for the normal distribution is defined by two parameters:

• Sample mean

• Standard deviation

• Then the density function f(x) is

Stephan Poelmans– Data Mining


74
Statistics for weather.numeric data

Outlook Temperature Humidity Windy Play


Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 64, 68, 65,71, 65, 70, 70, 85, False 6 2 9 5
Overcast 4 0 69, 70, 72,80, 70, 75, 90, 91, True 3 3
Rainy 3 2 72, … 85, … 80, … 95, …
Sunny 2/9 3/5  =73  =75  =79  =86 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5  =6.2  =7.9  =10.2  =9.7 True 3/9 3/5 14 14
Rainy 3/9 2/5

• Example density value:

75
Stephan Poelmans– Data Mining
Classifying a new day

• A new day: Outlook Temp. Humidity Windy Play


Sunny 66 90 true ?

Likelihood of “yes” = 2/9  0.0340  0.0221  3/9  9/14 = 0.000036


Likelihood of “no” = 3/5  0.0221  0.0381  3/5  5/14 = 0.000108
P(“yes”) = 0.000036 / (0.000036 + 0. 000108) = 25%
P(“no”) = 0.000108 / (0.000036 + 0. 000108) = 75%

• Missing values during training are not included


in calculation of mean and standard deviation

Stephan Poelmans– Data Mining 76


NaïveBayes in Weka
• Open file weather.nominal.arff
• Choose Naive Bayes method (bayes>NaiveBayes)
• Look at the output
• Avoid zero frequencies: start all counts at 1 (How can you see this in the output?)

Standard output
in Weka. Just
Frequencies.. Do
you have to
calculate the
Probabilities
yourselves?
Choose “output
predictions” ,
Plaintext. (cf.
the More
options button
in the test panel
Stephan Poelmans– Data Mining 77
Classification Algorithms
• Structure of the slides:
• Decision Trees (J48)
• PART
• ZeroR
• 1R (’one R’)

• The problem of overfitting


• Quality Evaluation

• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)

Stephan Poelmans– Data Mining 78


K-nearest neighbours (KNN, Ibk in Weka)
• “Rote learning”: the simplest form of learning:
• To classify a new instance, search the training set for one that’s “most
like” it
• the instances themselves represent the “knowledge”
• lazy learning: just look for similarities, nothing more
• “Instance‐based” learning (Ibk) = “nearest‐neighbor” learning

Stephan Poelmans– Data Mining 79


KNN, an example
• A simple example: Below (left) is a chart with data points (observations); they are
grouped into red circles (RC) and green squares (GS). The intention is to find out the
class of the blue star (BS) (=new data). BS can either be RC or GS and nothing else. The
“K” in the KNN algorithm is the number of nearest neighbors we wish to consider.
Suppose K = 3, so the 3 closest neighbours of BS belong to RC. We classify BS in this
group. Next, we look for the values of the class (output) variable in RC, the majority
class value is then predicted for BS (=‘majority vote’).

Stephan Poelmans– Data Mining 80


K-nearest neighbours (KNN, Ibk in Weka)
• Search training set for one that’s “most like” it:
• We Need a similarity function
• Regular (“Euclidean”) distance? (sum of squares of differences)
• Manhattan (“city‐block”) distance? (sum of absolute differences)
• Nominal attributes? Distance = 1 if different, 0 if same
• Normalize the attributes to lie between 0 and 1?
• What about noisy instances?
• Nearest‐neighbor
• k‐nearest‐neighbors
• choose majority class among several neighbors (k of them)
• In Weka: lazy>IBk (instance‐based learning)

Stephan Poelmans– Data Mining 81


Similarity Functions
• Euclidean distance

• Note that taking the square root is not required when comparing distances

• Manhattan (‘city-block’) distance: Adds the differences without squaring them

• Normalize the attributes to lie between 0 and 1?


• Different attributes are measured on different scales need to be normalized, e.g., to range [0,1]:

vi : the actual value of attribute i

Stephan Poelmans– Data Mining 82


K-nearest neighbours (KNN, Ibk in Weka)
• So the algorithm
1. Compute a distance value between the item to be classified and every item in the
training data-set
2. Pick the k closest data points (the items with the k lowest distances)
3. Conduct a “majority vote” among those data points — the dominating classification in
that pool is decided as the final classification

• Investigate effect of changing k


• Use the Glass dataset
• lazy > IBk, k = 1, 5, 20
• 10‐fold cross‐validation:

Stephan Poelmans– Data Mining 83


K-nearest neighbours (KNN, Ibk in Weka)
• Often very accurate … but slow:
• scan entire training data to make each prediction?
• sophisticated data structures can make this faster
• Assumes all attributes equally important
• Remedy: attribute selection or weights
• Remedies against noisy instances:
• Majority vote over the k nearest neighbors
• Weight instances according to prediction accuracy
• Identify reliable “prototypes” for each class
• Statisticians have used k‐NN since 1950s
• If training set size the error approaches minimum

Stephan Poelmans– Data Mining 84


Classification Algorithms
• Structure of the slides:

• Decision Trees (J48)


• PART
• ZeroR
• 1R (’one R’)

• The problem of overfitting


• Accuracy testing

• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)

Stephan Poelmans– Data Mining 85


Neural Networks
• A neural network is an algorithm that mimics the action of the brains

• A neural network for example looks like the following:


Input Layer Hidden Layer Output Layer

1.0 Node 1 W1j

W1i
Node j
Wjk
W2j
0.4 Node 2 W2i Node k
Wik
Node i
W3j

0.7 Node 3 W3i

• The topology “reflects” the human brain.

• The “nodes” are neurons or perceptrons and the link between them are the “dendrites”. The neurons do an amount of operations
and pass the result to the next layer.

Stephan Poelmans– Data Mining 86


Neural Networks
• 2 biological neurons: input of a neuron is provided by the dendrites, output of a neuron is provided by the
axons. The “synapse” links the axon with the dendrites (signals are thus transmitted between brain cells).

Stephan Poelmans– Data Mining 87


History of Neural Networks

Stephan Poelmans– Data Mining 88


Neural Networks
• Humans can be looked at as pattern-recognition machines. Human brains process
‘inputs’ from the world, categorize them (that’s a dog, that’s ice-cream, a spider, …), and
then generate an ‘output’ (pet the dog (or fight/ run), run away from the spider, taste
the ice-cream). This happens automatically and quickly, with little or no effort. It is the
same system that senses that someone is angry at us, or involuntarily reads the stop
sign as we speed past it. Psychologists call this mode of thinking ‘System 1’ (see K.
Stanovich and R. West), and it includes the innate skills — like perception and fear —
that we share with other animals. (There’s also a ‘System 2’, see Thinking, Fast and
Slow by Daniel Kahneman).
• Neural networks loosely mimic the way our brains solve the problem: by taking in
inputs, processing them and generating an output. Like us, they learn to recognize
patterns, but they do this by training on datasets.
Source: Towardsdatascience.com

Stephan Poelmans– Data Mining 89


Neural Network: input -> output
Suppose you bike to work. (output = go to work or not)
You have two factors to make your decision to go to work: the weather must not be bad, and it must be a weekday. The weather
is not that big a deal, but working on weekends is a big no-no. (So they say..) So weekday has a higher weight in your decision
than ‘bad weather’. The inputs have to be binary, so let’s propose the conditions as yes or no questions. Weather is fine? 1 for
yes, 0 for no. Is it a weekday? 1 yes, 0 no.
Let us set suitable weights of 2 for weather and 6 for weekday. Now how do we calculate the output? We simply multiply the
input with its respective weight, and sum up all the values we get for all the inputs. For example, if it is a nice, sunny (1) weekday
(1), we would do the following calculation:

This calculation is a linear combination. Now what does an 8 mean? We need to define the threshold value. The neural network’s
output, 0 or 1 (stay home or go to work), is determined if the value of the linear combination is greater than the threshold value.
Suppose the threshold value is 5, which means that if the calculation gives you less than 5, you can stay at home, but if it is equal to
or more than 5, you need to go to work.
Source: Towardsdatascience.com
Stephan Poelmans– Data Mining 90
Neural Network
• So using weights, the NN algorithm knows which information will be most important in making its
decision. A higher weight means the neural network considers that input more important compared to
other inputs.
• The weights are set to random values, and then the network adjusts those weights based on the output
errors it made using the previous weights. This is called training the neural network.
• So the NN algorithm takes in inputs, applies a linear combination using a weight vector. If the result is
compared to a threshold value, leading to an output, 1 or 0. This predicted output is compared to the
observed (real) output, so the NN can learn.

Stephan Poelmans– Data Mining 91


Neural Network
• As the neural network trains, it makes incremental changes to those weights to produce more accurate outputs.
• But additionally, a function that transforms the values for the decision of the output perceptron is known as
an activation function. A well-known activation function is the Sigmoid function. There are other activation
functions, such as linear functions, rectified linear functions, hyperbolic tangent, etc.
• Even when dealing with absolutes (1s and 0s; yeses and no’s), it is beneficial to have the output give an intermediary
value (a probability between 0 and 1). (It is comparable to answering “maybe” when asked a yes-or-no question you
have no clear answer to.)

Stephan Poelmans– Data Mining 92


Architecture of a Perceptron

Error =
Observed
(actual)-
Predicted

Stephan Poelmans– Data Mining 93


Neural Network
• So far, we have discussed the architecture of the perceptron. The simplest neural network model has one
layer of perceptrons, so 1 hidden layer.
• In the neural network diagram below, the layer on the far left is the input layer (i.e. the data you feed in),
and the layer on the far right is the output layer (the network’s prediction/answer). Any number of layers
in between these two are known as hidden layers. The more the number of layers, the more nuanced the
decision-making can get, and the more depth it has (and so they call it deep learning).
• The networks often go by different names: deep feedforward networks, feedforward neural
networks, or multi-layer perceptrons (MLP, cf. Weka).
• = feedforward networks because the information flows in one general (forward) direction, where
mathematical functions are applied at each stage.

Stephan Poelmans– Data Mining 94


How does it work more precisely?
Suppose Ni is the ith perceptron (neuron) in a NN (in a hidden layer).
1. Ni has stored a weight vector, assigned to each of its inputs. As the first step, Ni computes a
weighted sum of all inputs (IN). So:

where Inputs(Ni) corresponds to all neurons that provide an input to Ni. ak is the input value

2. This sum, called the input function, is then passed on to Ni’s activation function (ai). So:

Typically, the activation functions of all neurons in the NN are the same, so we just write g
instead of gi. Example of a Simoid function:

Thus. every neuron in a NN takes in the activation of all its inputs and provides its own activation as an output
Stephan Poelmans– Data Mining 95
NN: Backpropagation & Gradient Descent
• It is the job of the training process of a NN to ensure that the weights, wij, given by each neuron to each
of its inputs is set right, so that the entire NN gives an optimal outcome. Backpropagation is one of the
ways to optimize those weights.
• The error function defines how much of the output of the model defers from the required (observed)
output. Typically, a mean-squared error function is used for this.
• It is important to know that each training instance will result in a different error value. It is the
backpropagation’s goal to minimize the error for all training instances on average. Backpropagation tries
to minimize or reduce the error function, by going backwards through the network, layer by layer. Then it
uses the gradient descent to optimize or fine-tune the weights.
• So the gradient descent is a technique used to fine tune the weights.

Stephan Poelmans– Data Mining 96


NN: Backpropagation & Gradient Descent
• Consider f(x) = 2x + 5. The gradient of the function defines how the value of f will
change with a unit decrease/increase of x. In the example, this is 2. (compare x=1 to
x=2)
• Gradient descent?
The red regions correspond to places of higher
function value, while the blue regions correspond to
low values. You want to reach the lowest point
possible. You are standing on the red hill, and you
can only see a small part of the terrain around you.
So you will take small steps in the direction in which
the terrain slopes down. With each step you take,
you will scan your immediate neighbourhood
(which is visible to you), and go in the direction that
shows the steepest descent (the local optimum).
Stephan Poelmans– Data Mining 97
NN: Backpropagation & Gradient descent
• So you are descending from a higher value of the target function (the error function), to
a lower value, by following the direction that corresponds to the steepest decrease in
function value.
• If the gradient descent algorithm finds the minimum point, it will stop. Then, the model
has converged.
• Note that you always want to move in the direction shown by the negation of the
gradient. Consider again f(x) = 2x+5. The gradient = 2; therefore to reduce the f(x), you
need to decrease the value of x.
• It is important to understand that backpropagation tries to optimize the inner-neuron
weights and not the activation function within the neurons.

Stephan Poelmans– Data Mining 98


Gradient Descent
• A summary of the algorithm:
• Starting from a point on the graph of a function;
• Find a direction ▽F(a) from that point, in which the function decreases fastest;
• Go (down) along this direction a small step γ, go to a new point of a;
• By iterating the above three steps, we can find the local minima or global minima of this function.

Stephan Poelmans– Data Mining 99


Neural networks : Multilayer Perceptron in Weka
• Open weahter.nominal
• Function -> MultilayerPerceptron , in the configuration panel, set GUI = True, start
• 10-fold cross validation: Accuracy: 71.43% (compare with other classifiers)
• By default: 1 hidden layer !
• one epoch = one forward pass and one
backward pass of all the training examples
• Learning rate = the amount the weights have
been changed

Stephan Poelmans– Data Mining 100


Neural networks : Multilayer Perceptron in Weka
• You can extend the number of hidden layers in Weka (by default = 1), in the
configuration panel.
• For example, type in “hiddenlayers:”: 3, 5 ,
instead of ‘a’, run again and see what
happens

Want to know more?


https://www.coursera.org/courses?query=neural%20networks
&
http://neuralnetworksanddeeplearning.com

Stephan Poelmans– Data Mining 101


A few applications
• Stock market predictions
• Character recognition
• Business applications
• Face recognition, Colorization, etc.

Stephan Poelmans– Data Mining 102

You might also like