Naïve Bayes Classifier: Ke Chen

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 20

Naïve Bayes Classifier

Ke Chen

COMP24111 Machine Learning


Outline

• Background
• Probability Basics
• Probabilistic Classification
• Naïve Bayes
– Principle and Algorithms
– Example: Play Tennis

• Relevant Issues
• Summary
2
COMP24111 Machine Learning
Background
• There are three methods to establish a classifier
a) Model a classification rule directly
Examples: k-NN, decision trees, perceptron, SVM
b) Model the probability of class memberships given input data
Example: perceptron with the cross-entropy cost
c) Make a probabilistic model of data within each class
Examples: naive Bayes, model based classifiers
• a) and b) are examples of discriminative classification
• c) is an example of generative classification
• b) and c) are both examples of probabilistic classification

3
COMP24111 Machine Learning
Probability Basics
• Prior, conditional and joint probability for random variables
– Prior probability: P(X )
– Conditional probability: P( X1 |X2 ), P(X2 | X1 )
– Joint probability: X  ( X1 , X2 ), P( X )  P(X1 ,X2 )
– Relationship: P(X1 ,X2 )  P( X2 | X1 )P( X1 )  P( X1 | X2 )P( X2 )
– Independence: P( X2 | X1 )  P( X2 ), P( X1 | X2 )  P( X1 ), P(X1 ,X2 )  P( X1 )P( X2 )
• Bayesian Rule

P( X |C )P(C ) Likelihood  Prior


P(C |X )  Posterior 
P( X ) Evidence

4
COMP24111 Machine Learning
Probability Basics
• Quiz: We have two six-sided dice. When they are tolled, it could end up
with the following occurance: (A) dice 1 lands on side “3”, (B) dice 2 lands
on side “1”, and (C) Two dice sum to eight. Answer the following questions:
1) P( A)  ?
2) P(B)  ?
3) P(C)  ?
4) P( A | B)  ?
5) P(C | A)  ?
6) P( A , B)  ?
7) P( A , C )  ?
8) Is P( A , C ) equal to P(A)  P(C) ?
5
COMP24111 Machine Learning
Probabilistic Classification
• Establishing a probabilistic model for classification
– Discriminative model
P(C |X ) C  c1 ,  , c L , X  (X1 ,  , Xn )

P(c1 | x ) P(c 2 | x) P( c L | x )


Discriminative
Probabilistic Classifier


x1 x2 xn
x  ( x1 , x2 ,  , xn )
6
COMP24111 Machine Learning
Probabilistic Classification
• Establishing a probabilistic model for classification (cont.)
– Generative model
P( X |C ) C  c1 ,  , c L , X  (X1 ,  , Xn )

P( x | c1 ) P( x | c 2 ) P( x | c L )

Generative Generative Generative


Probabilistic Model Probabilistic Model  Probabilistic Model
for Class 1 for Class 2 for Class L
  
x1 x2 xn x1 x2 xn x1 x2 xn

x  ( x1 , x2 ,  , xn )
7
COMP24111 Machine Learning
Probabilistic Classification
• MAP classification rule
– MAP: Maximum A Posterior
– Assign x to c* if
P(C  c * |X  x )  P(C  c |X  x ) c  c * , c  c1 ,  , c L

• Generative classification with the MAP rule


– Apply Bayesian rule to convert them into posterior probabilities
P( X  x |C  c i )P(C  ci )
P(C  ci |X  x) 
P( X  x)
 P( X  x |C  ci )P(C  ci )
for i  1,2 ,  , L
– Then apply the MAP rule

8
COMP24111 Machine Learning
Naïve Bayes
• Bayes classification
P(C |X )  P( X |C )P(C )  P( X1 ,  , Xn |C )P(C )

Difficulty: learning the joint probability P( X1 ,  , Xn |C )


• Naïve Bayes classification
– Assumption that all input features are conditionally independent!
P( X1 , X2 ,  , Xn |C )  P( X1 | X2 ,  , Xn , C )P( X2 ,  , Xn |C )
 P( X1 |C )P( X2 ,  , Xn |C )
 P( X1 |C )P( X2 |C )    P( Xn |C )
– MAP classification rule: for x  ( x1 , x2 ,  , xn )
[ P( x1 |c * )    P( xn |c * )]P(c * )  [ P( x1 |c)    P( xn |c)]P(c), c  c * , c  c1 ,  , c L

9
COMP24111 Machine Learning
Naïve Bayes
• Algorithm: Discrete-Valued Features
– Learning Phase: Given a training set S,
For each target value of ci (ci  c1 ,   , c L )
Pˆ (C  ci )  estimate P(C  ci ) with examples in S;
For every feature value x jk of each feature X j ( j  1,  , n; k  1,  , N j )
Pˆ ( X j  x jk |C  ci )  estimate P( X j  x jk |C  ci ) with examples in S;

Output: conditional probability tables; for X j , N j  L elements


– Test Phase: Given an unknown instance X  ( a1 ,  , an ),
Look up tables to assign the label c* to X’ if
[ Pˆ ( a1 |c * )    Pˆ ( an |c * )]Pˆ ( c * )  [ Pˆ ( a1 |c)    Pˆ ( an |c )]Pˆ (c), c  c * , c  c1 ,  , c L

10
COMP24111 Machine Learning
Example
• Example: Play Tennis

11
COMP24111 Machine Learning
Example
• Learning Phase
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5

Humidity Play=Yes Play=No Wind Play=Yes Play=No


High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5

P(Play=Yes) = 9/14 P(Play=No) = 5/14

12
COMP24111 Machine Learning
Example
• Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phrase
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14

– Decision making with the MAP rule


P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

13
COMP24111 Machine Learning
Example
• Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14

– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

14
COMP24111 Machine Learning
Naïve Bayes
• Algorithm: Continuous-valued Features
– Numberless values for a feature
– Conditional probability often modeled with the normal distribution
1  ( X j  ji )2 
Pˆ (X j |C  ci )  exp  
2 ji  2  2 
 ji 
 ji : mean (avearage) of feature values X j of examples for which C  ci
 ji : standard deviation of feature values X j of examples for which C  ci

– Learning Phase: for X  ( X1 ,  , Xn ), C  c1 ,  , c L


Output: n  L normal distributions and P(C  ci ) i  1,  , L
– Test Phase: Given an unknown instance X  ( a1 ,  , an )
• Instead of looking-up tables, calculate conditional probabilities with all the
normal distributions achieved in the learning phrase
• Apply the MAP rule to make a decision
15
COMP24111 Machine Learning
Naïve Bayes
• Example: Continuous-valued Features
– Temperature is naturally of continuous value.
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
– Estimate mean and variance for each class
1 N 1 N Yes  21.64 , Yes  2.35
   xn ,  2   ( x n   ) 2  No  23.88 , No  7.09
N n1 N n1

– Learning Phase: output two Gaussian models for P(temp|C)


ˆ 1  ( x  21 . 64 ) 2
 1  ( x  21 .64 ) 2

P( x | Yes )  
exp  
  
exp  
2.35 2  2  2.35  2.35 2
2
 11.09 
ˆ 1  ( x  23 . 88 ) 2
 1  ( x  23. 88) 2

P( x | No)  exp    exp  
7.09 2  2  7.09  7.09 2
2
 50.25 
16
COMP24111 Machine Learning
Relevant Issues
• Violation of Independence Assumption
– For many real world tasks, P( X1 ,  , Xn |C )  P( X1 |C )    P( Xn |C )
– Nevertheless, naïve Bayes works surprisingly well anyway!
• Zero conditional probability Problem
– If no example contains the attribute value X j  a jk , Pˆ ( X j  a jk |C  ci )  0
– In this circumstance, Pˆ ( x |c )    Pˆ ( a |c )    Pˆ ( x |c )  0 during test
1 i jk i n i

– For a remedy, conditional probabilities estimated with


n  mp
Pˆ ( X j  a jk |C  ci )  c
nm
nc : number of training examples for which X j  a jk and C  ci
n : number of training examples for which C  ci
p : prior estimate (usually, p  1 / t for t possible values of X j )
m : weight to prior (number of " virtual" examples, m  1)
17
COMP24111 Machine Learning
Summary
• Naïve Bayes: the conditional independence assumption
– Training is very easy and fast; just requiring considering each
attribute in each class separately
– Test is straightforward; just looking up tables or calculating
conditional probabilities with estimated distributions
• A popular generative model
– Performance competitive to most of state-of-the-art classifiers even in
presence of violating independence assumption
– Many successful applications, e.g., spam mail filtering
– A good candidate of a base learner in ensemble learning
– Apart from classification, naïve Bayes can do more…

18
COMP24111 Machine Learning
Exercise
• Using Naive Bayes,
please classify
buys_computer for
a case:
• (age=youth,
income=medium,
student=yes,
credit_rating=fair).

19
COMP24111 Machine Learning
Thank You

20
COMP24111 Machine Learning

You might also like