5 ML NaiveBayes

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

Machine Learning

Generative Classification & Naı̈ve Bayes

Dariush Hosseini

dariush.hosseini@ucl.ac.uk
Department of Computer Science
University College London

1 45
Lecture Overview

Lecture Overview

1 Lecture Overview

2 Generative Classification - Recap

3 Naı̈ve Bayes
Categorical Naı̈ve Bayes
Gaussian Naı̈ve Bayes
Gaussian Naı̈ve Bayes & Logistic Regression

4 Summary

2 45
Lecture Overview

Lecture Overview

By the end of this lecture you should:

1 Understand the Naı̈ve Bayes algorithm and its motivation as a


Generative approach to the classification problem

2 Understand discrete and continuous version of the Naı̈ve Bayes


algorithm

3 Understand the relationship between Gaussian Naı̈ve Bayes and


Logistic Regression

3 45
Generative Classification - Recap

Lecture Overview

1 Lecture Overview

2 Generative Classification - Recap

3 Naı̈ve Bayes
Categorical Naı̈ve Bayes
Gaussian Naı̈ve Bayes
Gaussian Naı̈ve Bayes & Logistic Regression

4 Summary

4 45
Generative Classification - Recap

Notation

Inputs
x = [1, x1 , ..., xm ]T 2 Rm+1

Binary Outputs
y 2 {0, 1}

Training Data
S = {(x(i ) , y (i ) )}ni=1

Data-Generating Distribution, D
S⇠D

5 45
Generative Classification - Recap

Probabilistic Environment

We assume:

x is the outcome of a random variable X

y is the outcome of a random variable Y

(x, y ) are drawn i.i.d. from some data generating distribution, D, i.e.:

(x, y ) ⇠ D

and:
S ⇠ Dn

6 45
Generative Classification - Recap

Learning Problem
Representation
f 2F

Evaluation
Loss Measure:
E(f (x), y ) = I[y 6= f (x)]
Generalisation Loss:
⇥ ⇤
L(E, D, f ) = ED I[Y 6= f (X)]

Where D is characterised by pX,Y (x, y ) = pY (y |x)pX (x) for some pmf,


pY (·|·), and some pdf, pX (·)

Optimisation ⇥ ⇤

f = argmin ED I[Y 6= f (X)]
f 2F

7 45
Generative Classification - Recap

Bayes Optimal Classifier

So the generalisation minimiser for the Misclassification Loss can


be specified entirely in term of the posterior distribution:

1 if pY (y = 1|x) > 0.5


f ⇤ (x) =
0 if pY (y = 1|x) < 0.5

It is known as the Bayes Optimal Classifier

8 45
Generative Classification - Recap

Probabilistic Classifier

In probabilistic classification we use this expression for the Bayes


Optimal Classifier in order to re-cast the classification problem as
an inference problem in which we must learn pY (y = 1|x)

Here pY (y = 1|x) characterises an inhomogeneous Bernoulli


distribution

9 45
Generative Classification - Recap

Generative Classification
In Generative Classification we seek to learn pY (y = 1|x)
indirectly
First we re-express the Bayes Optimal Classifier as follows, without
loss of generality:

f ⇤ (x) = argmax pY (y |x)


y 2{0,1}

pX (x|y )pY (y )
= argmax P Bayes’ Theorem
y 2{0,1} y 2{0,1} pX (x|y )pY (y )

= argmax pX (x|y )pY (y ) Denominator doesn’t depend on y


y 2{0,1}

Then we seek to infer the likelihood pX (x|y ) and the prior pY (y ) for each
class separately

10 45
Generative Classification - Recap

Inference Problem
Inferring pY (y ) is straightforward
In binary classification there is only one parameter to learn

Inferring pX (x|y ) is more difficult


For example: consider x which is a vector of boolean functions
For each possible value of x = b
x and y = yb we must learn a
probability, pX (b
x|yb)
For each value of yb there are 2m possible values of b
x
2m - 1 parameters must be inferred for each output class
And 2(2m - 1) parameters must be inferred altogether
This is intractable

11 45
Generative Classification - Recap

Example: Document Topic Classification

Outcomes, y, of a random variable, Y, characterise a set of topics

Outcomes, x, of a random variable, X, characterise a particular


document according to the bag-of-words representation

Here word order doesn’t matter, instead x is a vector whose


elements are boolean, each of which indicates the presence or
absence of a particular dictionary word in the document

A dictionary is the set of words

12 45
Generative Classification - Recap

Example: Document Topic Classification


So, for example:
2 (i ) 3
‘aardvark’ : x1 =1
6 .. 7
x(i ) =4 . 5
(i )
‘zyme’ : xm = 0

But a dictionary contains ⇠ 10, 000 words

So m ⇡ 10, 000, and we need to infer ⇠ 210,000 parameters to


characterise the likelihood!

We need a simplifying assumption...

13 45
Naı̈ve Bayes

Lecture Overview

1 Lecture Overview

2 Generative Classification - Recap

3 Naı̈ve Bayes
Categorical Naı̈ve Bayes
Gaussian Naı̈ve Bayes
Gaussian Naı̈ve Bayes & Logistic Regression

4 Summary

14 45
Naı̈ve Bayes

Conditional Independence: Definition


Given 3 random variables, X, Y, Z, we say that X is conditionally
independent of Y given Z iff the probability distribution governing X
is independent of the outcomes of Y given the outcomes of Z

So, 8i , j , k :

P x (i ) |y (j ) , z (k ) = P x (i ) |z (k )

=) P x (i ) |y (j ) , z (k ) P y (j ) |z (k ) = P x (i ) |z (k ) P y (j ) |z (k )

=) P x (i ) , y (j ) |z (k ) = P x (i ) |z (k ) P y (j ) |z (k )

Here x (i ) , y (j ) , z (k ) are outcomes of X, Y, Z respectively


And the notation P x (i ) |y (j ) , z (k ) is used as a short-hand for
P X = x (i ) |Y = y (j ) , Z = z (k )
15 45
Naı̈ve Bayes

Conditional Independence: Example

H V

A is a random variable with outcomes that are children’s ages


H is a random variable with outcomes that are children’s heights
V is a random variable with outcomes that are the ranges of children’s vocabulary

P(H = h, V = v ) 6= P(H = h)P(V = v )


P(H = h, V = v |A = a) = P(H = h|A = a)P(V = v |A = a)

16 45
Naı̈ve Bayes

Naı̈ve Bayes
Recall that each sample, (x, y ) is an outcome of a random variable,
X, Y
Furthermore: Each element of x, xi , is the outcome of a
corresponding random variable, Xi
Thus: pX (x) = pX1 ,X2 ,...,Xm (x1 , x2 , ..., xm )
Naı̈ve Bayes seeks to simplify the likelihood by assuming that
{Xi }m
i =1 are all conditionally independent given Y:
m
Y
pX (x|y ) = pXi (xi |y )
i =1
This is a much simpler representation
So: For our vector of boolean attributes we now need only 2m
parameters, rather than 2m - 1, to characterise the likelihood
17 45
Naı̈ve Bayes

Example: Document Topic Classification


Consider again our bag-of-words example
Let the topic of y be ‘Politics’
Let xi correspond to the presence or absence of the word ‘Trump’
Let xj correspond to the presence or absence of the word ‘Clinton’

The conditional independence assumption implies that:


P(xi = ‘Trump’|xj = ‘Clinton’, y = ‘Politics’) = P(xi = ‘Trump’|y = ‘Politics’)

This is quite a strong assumption...surely:


P(xi = ‘Trump’|xj = ‘Clinton’, y = ‘Politics’) > P(xi = ‘Trump’|y = ‘Politics’)

Despite this Naı̈ve Bayes often works well as a classifier

18 45
Naı̈ve Bayes

Representation

Recall that we seek:

f ⇤ (x) = argmax pX (x|y )pY (y )


y 2{0,1}
m
Y
= argmax pY (y ) pXi (xi |y ) By NB assumption
y 2{0,1} i =1

But how do we learn the parameterisation of the prior and the likelihood?

Let’s consider two cases:


Categorical Naı̈ve Bayes ! for discrete inputs
Gaussian Naı̈ve Bayes ! for continuous inputs

19 45
Naı̈ve Bayes / Categorical Naı̈ve Bayes

Categorical Naı̈ve Bayes

Assume that the features, xi , are discrete-valued, and take mi


different values, such that the outcomes of xi are taken from the set
{xij }m i
j =1

Let us attempt to learn the pmf’s for pY (y ) and pXi (xi |y ) in a


frequentist setting using MLE

20 45
Naı̈ve Bayes / Categorical Naı̈ve Bayes

Evaluation: pY (y )

Y is a Bernoulli random variable, with outcomes, y ⇠ Bern(✓y ),


which implies the following log-likelihood function:
n
!
Y
ln (L(✓)) = ln pY (y (i ) ; ✓y )
i =1
n
X ⇣ ⌘
(i )
= ln pY (y ; ✓y )
i =1
X n
= y (i ) ln ✓y + (1 - y (i ) ) ln (1 - ✓y )
i =1

21 45
Naı̈ve Bayes / Categorical Naı̈ve Bayes

Optimisation: pY (y )

We seek ✓y MLE such that:


n
X
✓y MLE = argmax y (i ) ln ✓y + (1 - y (i ) ) ln (1 - ✓y )
✓y i =1

Let’s try to find an analytic solution:


n
X y (i ) (1 - y (i ) )
d
ln (L(✓)) = -
d ✓y ✓y 1 - ✓y
i =1

22 45
Naı̈ve Bayes / Categorical Naı̈ve Bayes

Optimisation: pY (y )
For stationarity set this equal to zero:
n
X y (i ) (1 - y (i ) )
- =0
✓y MLE 1 - ✓y MLE
i =1
X n
=) y (i ) (1 - ✓y MLE ) - (1 - y (i ) )✓y MLE = 0
i =1
X n n
X
=) y (i ) = ✓y MLE
i =1 i =1
Pn
n1 i =1 y (i )
=) ✓y MLE = =
n n
Where n1 is equal to the number of training points for which y = 1

We can demonstrate convexity by taking the second derivative


23 45
Naı̈ve Bayes / Categorical Naı̈ve Bayes

Evaluation: pXi (xi |y )

(Xi |y = k ) is a categorical random variable, which can take the


values {xij }m i
j =1

We seek to parameterise a different categorical distribution for each


(Xi , y = k )

What is the categorical distribution?

24 45
Naı̈ve Bayes / Categorical Naı̈ve Bayes

Evaluation: pXi (xi |y )

It is a generalisation of the Bernoulli distribution, where the random


variable has more than 2 discrete outcomes (in this case mi ):

(xi |y = k ) ⇠ Categorical(⇥ik )

⇥ik has elements {✓ijk }m i


j =1
mi
X
✓ijk = 1 has elements {✓ijk }m i
j =1
j =1

pXi (Xi = xij |y = k ; ⇥ik ) = ✓ijk

25 45
Naı̈ve Bayes / Categorical Naı̈ve Bayes

Evaluation & Optimisation: pXi (xi |y )

We can learn these pmf’s by forming the log-likelihood, and then


performing a constrained optimisation of the resulting function using
the method of Lagrange multipliers

This results in:


nijk
✓ijk MLE =
nk

Where nijk is equal to the number of training points for which


(Xi = xij ^ Y = k )
Where nk is equal to the number of training points for which Y = k

26 45
Naı̈ve Bayes / Categorical Naı̈ve Bayes

Recap
Representation
m
Y
F= f✓y ,{✓ijk } (x) = argmax pY (y ) pXi (xi |y ) pY (y = 1) = ✓y ,
y 2{0,1} i =1

m,mi ,1
pXi (xij |k ) = ✓ijk i =1,j =1,k =0

Evaluation
⌦ ↵m , 1
ln L(✓y ) and ln (L(⇥ik ))
i =1,k =0

Optimisation
m,mi ,1
n1 nijk
✓y MLE = and ✓ijk MLE =
n nk i =1,j =1,k =0

27 45
Naı̈ve Bayes / Categorical Naı̈ve Bayes

Problem: Overfitting
We must take care when the training data contains no instances that satisfy
Xi = xij

If this occurs then the resulting parameter, ✓ijk , would be zero and for any data
point for which Xi = xij , regardless of the state of Y, then:

pXi (xij |k ) = 0 8k
m
Y
=) pX (x|k ) = pXi (xij |k ) = 0 8k
i =1

pX (x|k )pY (k ) 0
=) pY (k |x) = P = 8k
k̃ pX (x|k̃ )pY (k̃ ) 0

And we have a problem!

This is an example of overfitting


statistically speaking it’s a bad idea to estimate the probability of an event to be zero
just because we have never observed it

28 45
Naı̈ve Bayes / Categorical Naı̈ve Bayes

Solution: Additive Smoothing


We remedy the problem by adjusting our MLE estimates such that:
nijk + ↵
✓ijk =
nk + ↵J
n1 + ↵
✓y =
n + ↵K
Where:
J = # of distinct values outcomes of Xi can take (mi in this case)
K = # of distinct values outcomes of Y can take (2 in this case)
↵ indicates the strength of smoothing

Additive smoothing is like adding instances uniformly to the data

29 45
Naı̈ve Bayes / Categorical Naı̈ve Bayes

Solution: Additive Smoothing

Where does additive smoothing come from?

It is a form of regularisaton that emerges most naturally from the


Bayesian approach:
We treat ✓ijk and ✓j as random variables and place prior
distributions over them which correspond to the belief that they are
both finite

If we choose symmetric Dirichelet distributions for each of these


priors then the expectations of ✓ijk and ✓j with respect to their
posterior distributions yields the additive smoothing estimators

Note these are distinct from the MAP estimators

30 45
Naı̈ve Bayes / Gaussian Naı̈ve Bayes

Gaussian Naı̈ve Bayes

Now let us assume that the features, xi , are continuous valued,


such that xi 2 R

We assume further that (Xi |y = k ) is a Gaussian random variable

Then we attempt to learn the pmf for pY (y ) and the pdf for pXi (xi |y )
in a frequentist setting using MLE

The inference problem for pY (y ) remains the same

31 45
Naı̈ve Bayes / Gaussian Naı̈ve Bayes

Evaluation: pXi (xi |y )

(xi |y = k ) ⇠ N(µik , ik )
(xi -µik )2
1 -
2 2
pXi (xi |k ; µik , ik ) =q e ik

2⇡ 2
ik

x2
pX (x|y = 0)

µ20

pX (x|y = 1)

µ21

x1
µ10 µ11

32 45
Naı̈ve Bayes / Gaussian Naı̈ve Bayes

Evaluation: pXi (xi |y )

(j )
Given {xi }nj =k 1 samples drawn from the Normal distribution, and for
which y = k, the log-likelihood is given by:

0 (j )
1
nk (x -µik )2
Y 1 - i
2 2
ln (L(µik , ik )) = ln @ q e ik A
2⇡ 2
j =1 ik
nk (j )
!
X 2
( xi - µik )
= -nk ln ik - 2
+ const.
2 ik
j =1

33 45
Naı̈ve Bayes / Gaussian Naı̈ve Bayes

Optimisation: pXi (xi |y )

We can optimise the log-likelihood as we did in the Statistics


Lecture to give:

nk
X (j )
x i
µik MLE =
nk
j =1
nk
2 1 X ⇣ (j ) ⌘2
ik MLE = xi - µik
nk
j =1

34 45
Naı̈ve Bayes / Gaussian Naı̈ve Bayes

Recap
Representation
m
Y
F= f✓y ,{µik , ik }
(x) = argmax pY (y ) N(xi ; µik , ik ) pY (y = 1) = ✓y ,
y 2{0,1} i =1

m ,1
µik 2 R, ik >0 i =1,k =0

Evaluation
⌦ ↵m,1
ln L(✓y ) and ln (L(µik , ik ))
i =1, k =0

Optimisation
nk nk
✏m , 1
1 X⇣ ⌘2
X (j )
n1 x i 2 (j )
✓y MLE = and µik MLE = , ik MLE = xi - µik
n nk nk
j =1 j =1 i =1,k =0

35 45
Naı̈ve Bayes / Gaussian Naı̈ve Bayes & Logistic Regression

Gaussian Naı̈ve Bayes & Logistic Regression

While we have already motivated Logistic Regression we may


examine it further and view it as a sort of generalisation of the GNB
algorithm

This gives and insight into the tradeoff between Generative and
Discriminative methods more generally

36 45
Naı̈ve Bayes / Gaussian Naı̈ve Bayes & Logistic Regression

Bayes Rule Revisited

pY (y = 1)pX (x|y = 1)
pY (y = 1|x) =
pY (y = 1)pX (x|y = 1) + pY (y = 0)pX (x|y = 0)
1
= pY (y =0)pX (x|y =0)
1+ pY (y =1)pX (x|y =1)
1
= ⇣ ⇣ ⌘⌘
pY (y =0)pX (x|y =0)
1 + exp ln pY (y =1)pX (x|y =1)
1
= ⇣ ⇣ ⌘ ⇣ ⌘⌘
pY (y =0) pX (x|y =0)
1 + exp ln pY (y =1) + ln pX (x|y =1)

37 45
Naı̈ve Bayes / Gaussian Naı̈ve Bayes & Logistic Regression

Logistic Regression

At this point the Logistic Regression assumption is to make the


following modelling choice:
✓ ◆ ✓ ◆
pY (y = 0) pX (x|y = 0)
ln + ln = -w · x
pY (y = 1) pX (x|y = 1)

And recall that w · x = 0 defines a linear discriminant because:

w·x>0 =) pY (y = 1|x) > 0.5 =) fw (x) = 1


w·x<0 =) pY (y = 0|x) < 0.5 =) fw (x) = 0

38 45
Naı̈ve Bayes / Gaussian Naı̈ve Bayes & Logistic Regression

Naı̈ve Bayes

On the other hand, Naı̈ve Bayes gives us:


✓ ◆ ✓ ◆
pY (y = 0) 1 - ✓y
ln = ln
pY (y = 1) ✓y

And using the conditional independence assumption:


✓ ◆ m
X ✓ ◆
pX (x|y = 0) pXi (xi |y = 0)
ln = ln
pX (x|y = 1) pXi (xi |y = 1)
i =1

39 45
Naı̈ve Bayes / Gaussian Naı̈ve Bayes & Logistic Regression

Gaussian Naı̈ve Bayes


Furthermore, the Gaussianity assumption, and an assumption that
ik = i , implies:
0 ⇣ 2
⌘1
✓ ◆ m
X p1 exp - (xi -µ i0 )
pX (x|y = 0) B 2⇡ 2
i
2 2i C
ln = ln @ ⇣ ⌘ A
pX (x|y = 1) p1 exp - (xi -µ i1 )
2
i =1 2 2 2
2⇡ i i
m
X ✓ ✓ ◆◆
(xi - µi0 )2 (xi - µi1 )2
= ln exp - 2
+ 2
2 i 2 i
i =1
m ✓ ◆
X -xi2 - µ2i0 + 2xi µi0 + xi2 + µ2i1 - 2xi µi1
= 2
2 i
i =1
m ✓ ◆
X µi0 - µi1 µ2i1 - µ2i0
= xi 2
+ 2
i 2 i
i =1

40 45
Naı̈ve Bayes / Gaussian Naı̈ve Bayes & Logistic Regression

Gaussian Naı̈ve Bayes


But this means that:
✓ ◆ ✓ ◆ ✓ ◆ m ✓ ◆
pY (y = 0) pX (x|y = 0) 1 - ✓y X µi0 - µi1 µ2i1 - µ2i0
ln + ln = ln + xi 2
+
pY (y = 1) pX (x|y = 1) ✓y i 2 2i
i =1
m
X
= w0 + wi xi
i =1

Where:
⇣ ⌘ Pm ⇣ µ2i1 -µ2i0 ⌘
1-✓y
w0 = ln ✓y + i =1 2 2
i
µi0 -µi1
wi = 2
i

In other words we have demonstrated that GNB (with non-class


contingent variances) also results in a linear discriminant

41 45
Naı̈ve Bayes / Gaussian Naı̈ve Bayes & Logistic Regression

Gaussian Naı̈ve Bayes & Logistic Regression Compared

GNB and LR both result in a similar form of linear discriminant


classifier

However they do not result in the same classifier:


w will be different for each method

GNB makes different model assumptions to LR

GNB and LR are different representations, with different


restrictions on how the parameters are set:
LR places fewer restrictions on these parameters
GNB explicitly defines a dependence between w0 and wi

42 45
Naı̈ve Bayes / Gaussian Naı̈ve Bayes & Logistic Regression

Gaussian Naı̈ve Bayes & Logistic Regression Compared


Recall that generative classifiers require us to learn more
parameters than discriminative classifiers, all other things being
equal

But the generative approach is often intractable, so we are forced to


make model assumptions...

...This means that a discriminative method such as LR makes fewer


model assumptions than a generative one such as NB...

...This means that LR is considered more robust and less sensitive


to modelling choices than NB...

However, because of these parameter restrictions then (if modelling


assumptions are good) NB requires less data than LR for a similar
level of convergence in parameter estimates
43 45
Summary

Lecture Overview

1 Lecture Overview

2 Generative Classification - Recap

3 Naı̈ve Bayes
Categorical Naı̈ve Bayes
Gaussian Naı̈ve Bayes
Gaussian Naı̈ve Bayes & Logistic Regression

4 Summary

44 45
Summary

Summary

1 The Generative approach to classification is demanding and sometimes


intractable. It prompts us to make simplifying assumptions

2 The Naı̈ve Bayes algorithm flows from the conditional independence


assumption.

Then depending on the form which we assume for the class


contingent distribution we are led to different NB algorithms

3 Both Logistic Regression and Gaussian NB lead to similar forms of


linear classifier, but may result in very different discriminant boundaries

In the next lecture we will return to more theoretical considerations and discuss
the problem of Model Selection

45 45

You might also like