5 ML NaiveBayes

Machine Learning
Generative Classification & Naı̈ve Bayes
Dariush Hosseini
dariush.hosseini@ucl.ac.uk
Department of Computer Science
University College London
1 45
Lecture Overview
Lecture Overview
1 Lecture Overview
2 Generative Classification - Recap
3 Naı̈ve Bayes
Categorical Naı̈ve Bayes
Gaussian Naı̈ve Bayes
Gaussian Naı̈ve Bayes & Logistic Regression
4 Summary
2 45
Lecture Overview
Lecture Overview
By the end of this lecture you should:
1 Understand the Naı̈ve Bayes algorithm and its motivation as a

Generative approach to the classification problem
2 Understand discrete and continuous version of the Naı̈ve Bayes

algorithm
3 Understand the relationship between Gaussian Naı̈ve Bayes and

Logistic Regression
3 45
Generative Classification - Recap
Lecture Overview
1 Lecture Overview
3 Naı̈ve Bayes
4 Summary
4 45
Notation
Inputs
x = [1, x1 , ..., xm ]T 2 Rm+1
Binary Outputs
y 2 {0, 1}
Training Data
S = {(x(i ) , y (i ) )}ni=1
Data-Generating Distribution, D
S⇠D
5 45
Probabilistic Environment
We assume:
x is the outcome of a random variable X
y is the outcome of a random variable Y
(x, y ) are drawn i.i.d. from some data generating distribution, D, i.e.:
(x, y ) ⇠ D
and:
S ⇠ Dn
6 45
Learning Problem
Representation
f 2F
Evaluation
Loss Measure:
E(f (x), y ) = I[y 6= f (x)]
Generalisation Loss:
⇥ ⇤
L(E, D, f ) = ED I[Y 6= f (X)]
Where D is characterised by pX,Y (x, y ) = pY (y |x)pX (x) for some pmf,

pY (·|·), and some pdf, pX (·)
Optimisation ⇥ ⇤
⇤
f = argmin ED I[Y 6= f (X)]
f 2F
7 45
Bayes Optimal Classifier
So the generalisation minimiser for the Misclassification Loss can

be specified entirely in term of the posterior distribution:
1 if pY (y = 1|x) > 0.5

f ⇤ (x) =
0 if pY (y = 1|x) < 0.5
It is known as the Bayes Optimal Classifier
8 45
Probabilistic Classifier
In probabilistic classification we use this expression for the Bayes

Optimal Classifier in order to re-cast the classification problem as
an inference problem in which we must learn pY (y = 1|x)
Here pY (y = 1|x) characterises an inhomogeneous Bernoulli

distribution
9 45
Generative Classification
In Generative Classification we seek to learn pY (y = 1|x)
indirectly
First we re-express the Bayes Optimal Classifier as follows, without
loss of generality:
f ⇤ (x) = argmax pY (y |x)

y 2{0,1}
pX (x|y )pY (y )
= argmax P Bayes’ Theorem
y 2{0,1} y 2{0,1} pX (x|y )pY (y )
= argmax pX (x|y )pY (y ) Denominator doesn’t depend on y

y 2{0,1}
Then we seek to infer the likelihood pX (x|y ) and the prior pY (y ) for each
class separately
10 45
Inference Problem
Inferring pY (y ) is straightforward
In binary classification there is only one parameter to learn
Inferring pX (x|y ) is more difficult

For example: consider x which is a vector of boolean functions
For each possible value of x = b
x and y = yb we must learn a
probability, pX (b
x|yb)
For each value of yb there are 2m possible values of b
x
2m - 1 parameters must be inferred for each output class
And 2(2m - 1) parameters must be inferred altogether
This is intractable
11 45
Example: Document Topic Classification
Outcomes, y, of a random variable, Y, characterise a set of topics
Outcomes, x, of a random variable, X, characterise a particular

document according to the bag-of-words representation
Here word order doesn’t matter, instead x is a vector whose

elements are boolean, each of which indicates the presence or
absence of a particular dictionary word in the document
A dictionary is the set of words
12 45

So, for example:
2 (i ) 3
‘aardvark’ : x1 =1
6 .. 7
x(i ) =4 . 5
(i )
‘zyme’ : xm = 0
But a dictionary contains ⇠ 10, 000 words
So m ⇡ 10, 000, and we need to infer ⇠ 210,000 parameters to

characterise the likelihood!
We need a simplifying assumption...
13 45
Naı̈ve Bayes
Lecture Overview
1 Lecture Overview
3 Naı̈ve Bayes
4 Summary
14 45
Naı̈ve Bayes
Conditional Independence: Definition

Given 3 random variables, X, Y, Z, we say that X is conditionally
independent of Y given Z iff the probability distribution governing X
is independent of the outcomes of Y given the outcomes of Z
So, 8i , j , k :
P x (i ) |y (j ) , z (k ) = P x (i ) |z (k )
=) P x (i ) |y (j ) , z (k ) P y (j ) |z (k ) = P x (i ) |z (k ) P y (j ) |z (k )
=) P x (i ) , y (j ) |z (k ) = P x (i ) |z (k ) P y (j ) |z (k )
Here x (i ) , y (j ) , z (k ) are outcomes of X, Y, Z respectively

And the notation P x (i ) |y (j ) , z (k ) is used as a short-hand for
P X = x (i ) |Y = y (j ) , Z = z (k )
15 45
Naı̈ve Bayes
Conditional Independence: Example
H V
A is a random variable with outcomes that are children’s ages

H is a random variable with outcomes that are children’s heights
V is a random variable with outcomes that are the ranges of children’s vocabulary
P(H = h, V = v ) 6= P(H = h)P(V = v )

P(H = h, V = v |A = a) = P(H = h|A = a)P(V = v |A = a)
16 45
Naı̈ve Bayes
Naı̈ve Bayes
Recall that each sample, (x, y ) is an outcome of a random variable,
X, Y
Furthermore: Each element of x, xi , is the outcome of a
corresponding random variable, Xi
Thus: pX (x) = pX1 ,X2 ,...,Xm (x1 , x2 , ..., xm )
Naı̈ve Bayes seeks to simplify the likelihood by assuming that
{Xi }m
i =1 are all conditionally independent given Y:
m
Y
pX (x|y ) = pXi (xi |y )
i =1
This is a much simpler representation
So: For our vector of boolean attributes we now need only 2m
parameters, rather than 2m - 1, to characterise the likelihood
17 45
Naı̈ve Bayes

Consider again our bag-of-words example
Let the topic of y be ‘Politics’
Let xi correspond to the presence or absence of the word ‘Trump’
Let xj correspond to the presence or absence of the word ‘Clinton’
The conditional independence assumption implies that:

P(xi = ‘Trump’|xj = ‘Clinton’, y = ‘Politics’) = P(xi = ‘Trump’|y = ‘Politics’)
This is quite a strong assumption...surely:

P(xi = ‘Trump’|xj = ‘Clinton’, y = ‘Politics’) > P(xi = ‘Trump’|y = ‘Politics’)
Despite this Naı̈ve Bayes often works well as a classifier
18 45
Naı̈ve Bayes
Representation
Recall that we seek:
f ⇤ (x) = argmax pX (x|y )pY (y )

y 2{0,1}
m
Y
= argmax pY (y ) pXi (xi |y ) By NB assumption
y 2{0,1} i =1
But how do we learn the parameterisation of the prior and the likelihood?
Let’s consider two cases:

Categorical Naı̈ve Bayes ! for discrete inputs
Gaussian Naı̈ve Bayes ! for continuous inputs
19 45
Naı̈ve Bayes / Categorical Naı̈ve Bayes
Assume that the features, xi , are discrete-valued, and take mi

different values, such that the outcomes of xi are taken from the set
{xij }m i
j =1
Let us attempt to learn the pmf’s for pY (y ) and pXi (xi |y ) in a

frequentist setting using MLE
20 45
Evaluation: pY (y )
Y is a Bernoulli random variable, with outcomes, y ⇠ Bern(✓y ),

which implies the following log-likelihood function:
n
!
Y
ln (L(✓)) = ln pY (y (i ) ; ✓y )
i =1
n
X ⇣ ⌘
(i )
= ln pY (y ; ✓y )
i =1
X n
= y (i ) ln ✓y + (1 - y (i ) ) ln (1 - ✓y )
i =1
21 45
Optimisation: pY (y )
We seek ✓y MLE such that:

n
X
✓y MLE = argmax y (i ) ln ✓y + (1 - y (i ) ) ln (1 - ✓y )
✓y i =1
Let’s try to find an analytic solution:

n
X y (i ) (1 - y (i ) )
d
ln (L(✓)) = -
d ✓y ✓y 1 - ✓y
i =1
22 45
Optimisation: pY (y )
For stationarity set this equal to zero:
n
X y (i ) (1 - y (i ) )
- =0
✓y MLE 1 - ✓y MLE
i =1
X n
=) y (i ) (1 - ✓y MLE ) - (1 - y (i ) )✓y MLE = 0
i =1
X n n
X
=) y (i ) = ✓y MLE
i =1 i =1
Pn
n1 i =1 y (i )
=) ✓y MLE = =
n n
Where n1 is equal to the number of training points for which y = 1
We can demonstrate convexity by taking the second derivative

23 45
Evaluation: pXi (xi |y )
(Xi |y = k ) is a categorical random variable, which can take the

values {xij }m i
j =1
We seek to parameterise a different categorical distribution for each

(Xi , y = k )
What is the categorical distribution?
24 45
It is a generalisation of the Bernoulli distribution, where the random

variable has more than 2 discrete outcomes (in this case mi ):
(xi |y = k ) ⇠ Categorical(⇥ik )
⇥ik has elements {✓ijk }m i

j =1
mi
X
✓ijk = 1 has elements {✓ijk }m i
j =1
j =1
pXi (Xi = xij |y = k ; ⇥ik ) = ✓ijk
25 45
Evaluation & Optimisation: pXi (xi |y )
We can learn these pmf’s by forming the log-likelihood, and then

performing a constrained optimisation of the resulting function using
the method of Lagrange multipliers
This results in:

nijk
✓ijk MLE =
nk
Where nijk is equal to the number of training points for which

(Xi = xij ^ Y = k )
Where nk is equal to the number of training points for which Y = k
26 45
Recap
Representation
m
Y
F= f✓y ,{✓ijk } (x) = argmax pY (y ) pXi (xi |y ) pY (y = 1) = ✓y ,
y 2{0,1} i =1
✏
m,mi ,1
pXi (xij |k ) = ✓ijk i =1,j =1,k =0
Evaluation
⌦ ↵m , 1
ln L(✓y ) and ln (L(⇥ik ))
i =1,k =0
Optimisation
m,mi ,1
n1 nijk
✓y MLE = and ✓ijk MLE =
n nk i =1,j =1,k =0
27 45
Problem: Overfitting
We must take care when the training data contains no instances that satisfy
Xi = xij
If this occurs then the resulting parameter, ✓ijk , would be zero and for any data
point for which Xi = xij , regardless of the state of Y, then:
pXi (xij |k ) = 0 8k
m
Y
=) pX (x|k ) = pXi (xij |k ) = 0 8k
i =1
pX (x|k )pY (k ) 0
=) pY (k |x) = P = 8k
k̃ pX (x|k̃ )pY (k̃ ) 0
And we have a problem!
This is an example of overfitting

statistically speaking it’s a bad idea to estimate the probability of an event to be zero
just because we have never observed it
28 45
Solution: Additive Smoothing

We remedy the problem by adjusting our MLE estimates such that:
nijk + ↵
✓ijk =
nk + ↵J
n1 + ↵
✓y =
n + ↵K
Where:
J = # of distinct values outcomes of Xi can take (mi in this case)
K = # of distinct values outcomes of Y can take (2 in this case)
↵ indicates the strength of smoothing
Additive smoothing is like adding instances uniformly to the data
29 45
Solution: Additive Smoothing
Where does additive smoothing come from?
It is a form of regularisaton that emerges most naturally from the

Bayesian approach:
We treat ✓ijk and ✓j as random variables and place prior
distributions over them which correspond to the belief that they are
both finite
If we choose symmetric Dirichelet distributions for each of these

priors then the expectations of ✓ijk and ✓j with respect to their
posterior distributions yields the additive smoothing estimators
Note these are distinct from the MAP estimators
30 45
Naı̈ve Bayes / Gaussian Naı̈ve Bayes
Now let us assume that the features, xi , are continuous valued,

such that xi 2 R
We assume further that (Xi |y = k ) is a Gaussian random variable
Then we attempt to learn the pmf for pY (y ) and the pdf for pXi (xi |y )
in a frequentist setting using MLE
The inference problem for pY (y ) remains the same
31 45
(xi |y = k ) ⇠ N(µik , ik )
(xi -µik )2
1 -
2 2
pXi (xi |k ; µik , ik ) =q e ik
2⇡ 2
ik
x2
pX (x|y = 0)
µ20
pX (x|y = 1)
µ21
x1
µ10 µ11
32 45
(j )
Given {xi }nj =k 1 samples drawn from the Normal distribution, and for
which y = k, the log-likelihood is given by:
0 (j )
1
nk (x -µik )2
Y 1 - i
2 2
ln (L(µik , ik )) = ln @ q e ik A
2⇡ 2
j =1 ik
nk (j )
!
X 2
( xi - µik )
= -nk ln ik - 2
+ const.
2 ik
j =1
33 45
Optimisation: pXi (xi |y )
We can optimise the log-likelihood as we did in the Statistics

Lecture to give:
nk
X (j )
x i
µik MLE =
nk
j =1
nk
2 1 X ⇣ (j ) ⌘2
ik MLE = xi - µik
nk
j =1
34 45
Recap
Representation
m
Y
F= f✓y ,{µik , ik }
(x) = argmax pY (y ) N(xi ; µik , ik ) pY (y = 1) = ✓y ,
y 2{0,1} i =1
✏
m ,1
µik 2 R, ik >0 i =1,k =0
Evaluation
⌦ ↵m,1
ln L(✓y ) and ln (L(µik , ik ))
i =1, k =0
Optimisation
nk nk
✏m , 1
1 X⇣ ⌘2
X (j )
n1 x i 2 (j )
✓y MLE = and µik MLE = , ik MLE = xi - µik
n nk nk
j =1 j =1 i =1,k =0
35 45
Naı̈ve Bayes / Gaussian Naı̈ve Bayes & Logistic Regression
While we have already motivated Logistic Regression we may

examine it further and view it as a sort of generalisation of the GNB
algorithm
This gives and insight into the tradeoff between Generative and
Discriminative methods more generally
36 45
Bayes Rule Revisited
pY (y = 1)pX (x|y = 1)
pY (y = 1|x) =
pY (y = 1)pX (x|y = 1) + pY (y = 0)pX (x|y = 0)
1
= pY (y =0)pX (x|y =0)
1+ pY (y =1)pX (x|y =1)
1
= ⇣ ⇣ ⌘⌘
pY (y =0)pX (x|y =0)
1 + exp ln pY (y =1)pX (x|y =1)
1
= ⇣ ⇣ ⌘ ⇣ ⌘⌘
pY (y =0) pX (x|y =0)
1 + exp ln pY (y =1) + ln pX (x|y =1)
37 45
Logistic Regression
At this point the Logistic Regression assumption is to make the

following modelling choice:
✓ ◆ ✓ ◆
pY (y = 0) pX (x|y = 0)
ln + ln = -w · x
pY (y = 1) pX (x|y = 1)
And recall that w · x = 0 defines a linear discriminant because:
w·x>0 =) pY (y = 1|x) > 0.5 =) fw (x) = 1

w·x<0 =) pY (y = 0|x) < 0.5 =) fw (x) = 0
38 45
Naı̈ve Bayes
On the other hand, Naı̈ve Bayes gives us:

✓ ◆ ✓ ◆
pY (y = 0) 1 - ✓y
ln = ln
pY (y = 1) ✓y
And using the conditional independence assumption:

✓ ◆ m
X ✓ ◆
pX (x|y = 0) pXi (xi |y = 0)
ln = ln
pX (x|y = 1) pXi (xi |y = 1)
i =1
39 45

Furthermore, the Gaussianity assumption, and an assumption that
ik = i , implies:
0 ⇣ 2
⌘1
✓ ◆ m
X p1 exp - (xi -µ i0 )
pX (x|y = 0) B 2⇡ 2
i
2 2i C
ln = ln @ ⇣ ⌘ A
pX (x|y = 1) p1 exp - (xi -µ i1 )
2
i =1 2 2 2
2⇡ i i
m
X ✓ ✓ ◆◆
(xi - µi0 )2 (xi - µi1 )2
= ln exp - 2
+ 2
2 i 2 i
i =1
m ✓ ◆
X -xi2 - µ2i0 + 2xi µi0 + xi2 + µ2i1 - 2xi µi1
= 2
2 i
i =1
m ✓ ◆
X µi0 - µi1 µ2i1 - µ2i0
= xi 2
+ 2
i 2 i
i =1
40 45

But this means that:
✓ ◆ ✓ ◆ ✓ ◆ m ✓ ◆
pY (y = 0) pX (x|y = 0) 1 - ✓y X µi0 - µi1 µ2i1 - µ2i0
ln + ln = ln + xi 2
+
pY (y = 1) pX (x|y = 1) ✓y i 2 2i
i =1
m
X
= w0 + wi xi
i =1
Where:
⇣ ⌘ Pm ⇣ µ2i1 -µ2i0 ⌘
1-✓y
w0 = ln ✓y + i =1 2 2
i
µi0 -µi1
wi = 2
i
In other words we have demonstrated that GNB (with non-class

contingent variances) also results in a linear discriminant
41 45
Gaussian Naı̈ve Bayes & Logistic Regression Compared
GNB and LR both result in a similar form of linear discriminant

classifier
However they do not result in the same classifier:

w will be different for each method
GNB makes different model assumptions to LR
GNB and LR are different representations, with different

restrictions on how the parameters are set:
LR places fewer restrictions on these parameters
GNB explicitly defines a dependence between w0 and wi
42 45
Gaussian Naı̈ve Bayes & Logistic Regression Compared

Recall that generative classifiers require us to learn more
parameters than discriminative classifiers, all other things being
equal
But the generative approach is often intractable, so we are forced to

make model assumptions...
...This means that a discriminative method such as LR makes fewer

model assumptions than a generative one such as NB...
...This means that LR is considered more robust and less sensitive

to modelling choices than NB...
However, because of these parameter restrictions then (if modelling

assumptions are good) NB requires less data than LR for a similar
level of convergence in parameter estimates
43 45
Summary
Lecture Overview
1 Lecture Overview
3 Naı̈ve Bayes
4 Summary
44 45
Summary
Summary
1 The Generative approach to classification is demanding and sometimes

intractable. It prompts us to make simplifying assumptions
2 The Naı̈ve Bayes algorithm flows from the conditional independence

assumption.
Then depending on the form which we assume for the class

contingent distribution we are led to different NB algorithms
3 Both Logistic Regression and Gaussian NB lead to similar forms of

linear classifier, but may result in very different discriminant boundaries
In the next lecture we will return to more theoretical considerations and discuss
the problem of Model Selection
45 45

5 ML NaiveBayes

Uploaded by

Copyright:

Available Formats

You might also like

5 ML NaiveBayes

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5 ML NaiveBayes

Uploaded by

Copyright:

Available Formats

Machine Learning

Generative Classification & Naı̈ve Bayes

2 Generative Classification - Recap

By the end of this lecture you should:

1 Understand the Naı̈ve Bayes algorithm and its motivation as a

2 Understand discrete and continuous version of the Naı̈ve Bayes

3 Understand the relationship between Gaussian Naı̈ve Bayes and

2 Generative Classification - Recap

x is the outcome of a random variable X

y is the outcome of a random variable Y

Where D is characterised by pX,Y (x, y ) = pY (y |x)pX (x) for some pmf,

Bayes Optimal Classifier

So the generalisation minimiser for the Misclassification Loss can

1 if pY (y = 1|x) > 0.5

It is known as the Bayes Optimal Classifier

In probabilistic classification we use this expression for the Bayes

Here pY (y = 1|x) characterises an inhomogeneous Bernoulli

f ⇤ (x) = argmax pY (y |x)

= argmax pX (x|y )pY (y ) Denominator doesn’t depend on y

Inferring pX (x|y ) is more difficult

Example: Document Topic Classification

Outcomes, y, of a random variable, Y, characterise a set of topics

Outcomes, x, of a random variable, X, characterise a particular

Here word order doesn’t matter, instead x is a vector whose

A dictionary is the set of words

Example: Document Topic Classification

But a dictionary contains ⇠ 10, 000 words

So m ⇡ 10, 000, and we need to infer ⇠ 210,000 parameters to

We need a simplifying assumption...

2 Generative Classification - Recap

Conditional Independence: Definition

Here x (i ) , y (j ) , z (k ) are outcomes of X, Y, Z respectively

Conditional Independence: Example

A is a random variable with outcomes that are children’s ages

P(H = h, V = v ) 6= P(H = h)P(V = v )

Example: Document Topic Classification

The conditional independence assumption implies that:

This is quite a strong assumption...surely:

Despite this Naı̈ve Bayes often works well as a classifier

Recall that we seek:

f ⇤ (x) = argmax pX (x|y )pY (y )

Let’s consider two cases:

Categorical Naı̈ve Bayes

Assume that the features, xi , are discrete-valued, and take mi

Let us attempt to learn the pmf’s for pY (y ) and pXi (xi |y ) in a

Y is a Bernoulli random variable, with outcomes, y ⇠ Bern(✓y ),

We seek ✓y MLE such that:

Let’s try to find an analytic solution:

We can demonstrate convexity by taking the second derivative

Evaluation: pXi (xi |y )

(Xi |y = k ) is a categorical random variable, which can take the

We seek to parameterise a different categorical distribution for each

What is the categorical distribution?

Evaluation: pXi (xi |y )

It is a generalisation of the Bernoulli distribution, where the random

⇥ik has elements {✓ijk }m i

pXi (Xi = xij |y = k ; ⇥ik ) = ✓ijk

Evaluation & Optimisation: pXi (xi |y )