CS411 Tut4 Handout

Naive Bayes and Gaussian Bayes Classifier
Mengye Ren
mren@cs.toronto.edu
September 29, 2017
Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 1 / 21
Naive Bayes
Bayes Rules:
p(x|t)p(t)
p(t|x) =
p(x)
Naive Bayes Assumption:
D
Y
p(x|t) = p(xj |t)
j=1
Likelihood function:
L(✓) = p(x, t|✓) = p(x|t, ✓)p(t|✓)
Example: Spam Classification
Each vocabulary is one feature dimension.

We encode each email as a feature vector x 2 {0, 1}|V |
xj = 1 i↵ the vocabulary xj appears in the email.

We want to model the probability of any word xj appearing in an

email given the email is spam or not.
Example: $10,000, Toronto, Piazza, etc.

We want to model the probability of any word xj appearing in an

email given the email is spam or not.
Example: $10,000, Toronto, Piazza, etc.
Idea: Use Bernoulli distribution to model p(xj |t)

Example: p(“$10, 000”|spam) = 0.3
Bernoulli Naive Bayes
Assuming all data points x (i) are i.i.d. samples, and p(xj |t) follows a
Bernoulli distribution with parameter µjt
D
Y (i)
x (i)
p(x (i) |t (i) ) = µjtj(i) (1 µjt (i) )(1 xj )
j=1
D
Y (i)
x (i)
j=1
N
Y N
Y D
Y (i)
x (i)
p(t|x) / p(t (i) )p(x (i) |t (i) ) = p(t (i) ) µjtj(i) (1 µjt (i) )(1 xj )
i=1 i=1 j=1
D
Y (i)
x (i)
j=1
N
Y N
Y D
Y (i)
x (i)
p(t|x) / p(t (i) )p(x (i) |t (i) ) = p(t (i) ) µjtj(i) (1 µjt (i) )(1 xj )
i=1 i=1 j=1
where p(t) = ⇡t . Parameters ⇡t , µjt can be learnt using maximum

likelihood.
Derivation of maximum likelihood estimator (MLE)
✓ = [µ, ⇡]
log L(✓) = log p(x, t|✓)
✓ = [µ, ⇡]
log L(✓) = log p(x, t|✓)

0 1
N
X D
X
@log ⇡t (i) + (i) (i)
= xj log µjt (i) + (1 xj ) log(1 µjt (i) )A
i=1 j=1
P
Want: arg max✓ log L(✓) subject to k ⇡k = 1
Take derivative w.r.t. µ
0 1
N ⇣ ⌘ (i) (i)
@ log L(✓) X xj 1 xj
=0) (i)
1 t =k @ A=0
@µjk µjk 1 µjk
i=1
0 1
N ⇣ ⌘ (i) (i)
=0) (i)
1 t =k @ A=0
@µjk µjk 1 µjk
i=1
N
X ⇣ ⌘h ⇣ ⌘ i
(i) (i)
1 t (i) = k xj (1 µjk ) 1 xj µjk = 0
i=1
0 1
N ⇣ ⌘ (i) (i)
=0) (i)
1 t =k @ A=0
@µjk µjk 1 µjk
i=1
N
X ⇣ ⌘h ⇣ ⌘ i
(i) (i)
1 t (i) = k xj (1 µjk ) 1 xj µjk = 0
i=1
N
X ⇣ ⌘ N
X ⇣ ⌘
(i)
1 t (i) = k µjk = 1 t (i) = k xj
i=1 i=1
0 1
N ⇣ ⌘ (i) (i)
=0) (i)
1 t =k @ A=0
@µjk µjk 1 µjk
i=1
N
X ⇣ ⌘h ⇣ ⌘ i
(i) (i)
1 t (i) = k xj (1 µjk ) 1 xj µjk = 0
i=1
N
X ⇣ ⌘ N
X ⇣ ⌘
(i)
1 t (i) = k µjk = 1 t (i) = k xj
i=1 i=1
PN (i) (i)
i=1 1 t = k xj
µjk = PN
i=1 1 t (i) = k
Use Lagrange multiplier to derive ⇡

P N
X ⇣ ⌘ 1
@L(✓) @  ⇡
+ =0) = 1 t (i) = k)
@⇡k @⇡k ⇡k
i=1

P N
X ⇣ ⌘ 1
@L(✓) @  ⇡
+ =0) = 1 t (i) = k)
@⇡k @⇡k ⇡k
i=1
PN
i=1 1 t (i) = k)
⇡k =

P N
X ⇣ ⌘ 1
@L(✓) @  ⇡
+ =0) = 1 t (i) = k)
@⇡k @⇡k ⇡k
i=1
PN
i=1 1 t (i) = k)
⇡k =
P
Apply constraint: k ⇡k = 1 ) = N
PN
i=1 1 t (i) = k)
⇡k =
N
Spam Classification Demo
Gaussian Bayes Classifier
Instead of assuming conditional independence of xj , we model p(x|t) as a

Gaussian distribution and the dependence relation of xj is encoded in the
covariance matrix.
Instead of assuming conditional independence of xj , we model p(x|t) as a

Gaussian distribution and the dependence relation of xj is encoded in the
covariance matrix.
Multivariate Gaussian distribution:

✓ ◆
1 1 T 1
f (x) = p exp (x µ) ⌃ (x µ)
D
(2⇡) det(⌃) 2
µ: mean, ⌃: covariance matrix, D: dim(x)
q
✓ = [µ, ⌃, ⇡], Z = (2⇡)D det(⌃)
✓ ◆
1 1 T 1
p(x|t) = exp (x µ) ⌃ (x µ)
Z 2
q
✓ = [µ, ⌃, ⇡], Z = (2⇡)D det(⌃)
✓ ◆
1 1 T 1
p(x|t) = exp (x µ) ⌃ (x µ)
Z 2
log L(✓) = log p(x, t|✓) = log p(t|✓) + log p(x|t, ✓)
q
✓ = [µ, ⌃, ⇡], Z = (2⇡)D det(⌃)
✓ ◆
1 1 T 1
p(x|t) = exp (x µ) ⌃ (x µ)
Z 2
log L(✓) = log p(x, t|✓) = log p(t|✓) + log p(x|t, ✓)
1 ⇣ (i) ⌘T ⇣ ⌘
N
X
= log ⇡t (i) log Z x µt (i) ⌃t (i)1 x (i) µt (i)
2
i=1
P
Want: arg max✓ log L(✓) subject to k ⇡k = 1
@ log L
N
X ⇣ ⌘
= 1 t (i) = k ⌃ 1
(x (i) µk ) = 0
@µk
i=0
@ log L
N
X ⇣ ⌘
= 1 t (i) = k ⌃ 1
(x (i) µk ) = 0
@µk
i=0
PN
i=1 1 t (i) = k x (i)
µk = PN
i=1 1 t (i) = k
Take derivative w.r.t. ⌃ 1 (not ⌃)

Note:
@ det(A) 1T
= det(A)A
@A
1 1
det(A) = det A
@x T Ax
= xx T
@A
⌃T = ⌃

Note:
@ det(A) 1T
= det(A)A
@A
1 1
det(A) = det A
@x T Ax
= xx T
@A
⌃T = ⌃
" #
@ log L
N
X ⇣ ⌘ @ log Zk 1 (i)
= 1 t (i) = k (x µk )(x (i)
µk ) T
=0
@⌃k 1 i=0
@⌃k 1 2

Note:
@ det(A) 1T
= det(A)A
@A
1 1
det(A) = det A
@x T Ax
= xx T
@A
⌃T = ⌃
" #
@ log L
N
X ⇣ ⌘ @ log Zk 1 (i)
= 1 t (i) = k (x µk )(x (i)
µk ) T
=0
@⌃k 1 i=0
@⌃k 1 2
q
Zk = (2⇡)D det(⌃k )
1
@ log Zk 1 @Zk D 1 D @ det ⌃k 1 2
1
= = (2⇡) 2 det(⌃k ) 2 (2⇡) 2
@⌃k Zk @⌃k 1 @⌃k 1
q
Zk = (2⇡)D det(⌃k )
1
1
D @ det ⌃
2
@ log Zk 1 @Zk D 1
k
= = (2⇡) 2 det(⌃ ) 2 (2⇡) 2
k
@⌃k 1 Zk @⌃k 1 @⌃k 1
✓ ◆
1 12 1 3 1
= det(⌃k ) det ⌃k 1 2 det ⌃k 1 ⌃T k = ⌃k
2 2
q
Zk = (2⇡)D det(⌃k )
1
1
D @ det ⌃
2
@ log Zk 1 @Zk D 1
k
= = (2⇡) 2 det(⌃ ) 2 (2⇡) 2
k
@⌃k 1 Zk @⌃k 1 @⌃k 1
✓ ◆
1 12 1 3 1
2 2
@ log L
N
X ⇣ ⌘ 1 1 (i)
(i)
= 1 t =k ⌃k (x µk )(x (i) µk ) T = 0
@⌃k 1 i=0
2 2
q
Zk = (2⇡)D det(⌃k )
1
1
D @ det ⌃
2
@ log Zk 1 @Zk D 1
k
= = (2⇡) 2 det(⌃ ) 2 (2⇡) 2
k
@⌃k 1 Zk @⌃k 1 @⌃k 1
✓ ◆
1 12 1 3 1
2 2
@ log L
N
X ⇣ ⌘ 1 1 (i)
(i)
= 1 t =k ⌃k (x µk )(x (i) µk ) T = 0
@⌃k 1 i=0
2 2
PN T
i=1 1 t (i) = k x (i) µk x (i) µk
⌃k = PN
i=1 1 t (i) = k
PN
t (i) = k)
i=1 1
⇡k =
N
(Same as Bernoulli)
Gaussian Bayes Classifier Demo
If we constrain ⌃ to be diagonal, then we can rewrite p(xj |t) as a product

of p(xj |t)
✓ ◆
1 1 T 1
p(x|t) = p exp (x µt ) ⌃t (x µt )
(2⇡)D det(⌃t ) 2
If we constrain ⌃ to be diagonal, then we can rewrite p(xj |t) as a product

of p(xj |t)
✓ ◆
1 1 T 1
p(x|t) = p exp (x µt ) ⌃t (x µt )
(2⇡)D det(⌃t ) 2
D
Y ✓ ◆ D
Y
1 1
= p exp ||xj µjt ||22 = p(xj |t)
(2⇡)D ⌃t,jj 2⌃t,jj
j=1 j=1
Diagonal covariance matrix satisfies the naive Bayes assumption.
Case 1: The covariance matrix is shared among classes
p(x|t) = N (x|µt , ⌃)
Case 2: Each class has its own covariance
p(x|t) = N (x|µt , ⌃t )
Gaussian Bayes Binary Classifier Decision Boundary
If the covariance is shared between classes,
p(t = 1|x) = p(t = 0|x)
p(t = 1|x) = p(t = 0|x)
1 1
log ⇡1 (x µ1 ) T ⌃ 1
(x µ1 ) = log ⇡0 (x µ0 ) T ⌃ 1
(x µ0 )
2 2
p(t = 1|x) = p(t = 0|x)
1 1
log ⇡1 (x µ1 ) T ⌃ 1
(x µ1 ) = log ⇡0 (x µ0 ) T ⌃ 1
(x µ0 )
2 2
C + xT ⌃ 1
x 2µT
1⌃
1
x + µT 1 T 1
1 ⌃ µ1 = x ⌃ x 2µT 1 T
0 ⌃ x + µ0 ⌃
1
µ0
h i
2(µ0 µ1 )T ⌃ 1 x (µ0 µ1 )T ⌃ 1 (µ0 µ1 ) = C
p(t = 1|x) = p(t = 0|x)
1 1
log ⇡1 (x µ1 ) T ⌃ 1
(x µ1 ) = log ⇡0 (x µ0 ) T ⌃ 1
(x µ0 )
2 2
C + xT ⌃ 1
x 2µT
1⌃
1
x + µT 1 T 1
1 ⌃ µ1 = x ⌃ x 2µT 1 T
0 ⌃ x + µ0 ⌃
1
µ0
h i
2(µ0 µ1 )T ⌃ 1 x (µ0 µ1 )T ⌃ 1 (µ0 µ1 ) = C
) aT x b=0
The decision boundary is a linear function (a hyperplane in general).
Relation to Logistic Regression
We can write the posterior distribution p(t = 0|x) as
p(x, t = 0) ⇡0 N (x|µ0 , ⌃)
=
p(x, t = 0) + p(x, t = 1) ⇡0 N (x|µ0 , ⌃) + ⇡1 N (x|µ1 , ⌃)
p(x, t = 0) ⇡0 N (x|µ0 , ⌃)
=
p(x, t = 0) + p(x, t = 1) ⇡0 N (x|µ0 , ⌃) + ⇡1 N (x|µ1 , ⌃)
⇢  1
⇡1 1 1
= 1+ exp (x µ1 ) T ⌃ 1
(x µ1 ) + (x µ0 ) T ⌃ 1
(x µ0 )
⇡0 2 2
p(x, t = 0) ⇡0 N (x|µ0 , ⌃)
=
p(x, t = 0) + p(x, t = 1) ⇡0 N (x|µ0 , ⌃) + ⇡1 N (x|µ1 , ⌃)
⇢  1
⇡1 1 1
= 1+ exp (x µ1 )T ⌃ 1 (x µ1 ) + (x µ0 )T ⌃ 1 (x µ0 )
⇡0 2 2
⇢ 
⇡1 1⇣ T 1 ⌘ 1
= 1 + exp log + (µ1 µ0 )T ⌃ 1 x + µ1 ⌃ µ1 µT 1
0 ⌃ µ0
⇡0 2
p(x, t = 0) ⇡0 N (x|µ0 , ⌃)
=
p(x, t = 0) + p(x, t = 1) ⇡0 N (x|µ0 , ⌃) + ⇡1 N (x|µ1 , ⌃)
⇢  1
⇡1 1 1
= 1+ exp (x µ1 )T ⌃ 1 (x µ1 ) + (x µ0 )T ⌃ 1 (x µ0 )
⇡0 2 2
⇢ 
⇡1 1⇣ T 1 ⌘ 1
= 1 + exp log + (µ1 µ0 )T ⌃ 1 x + µ1 ⌃ µ1 µT 1
0 ⌃ µ0
⇡0 2
1
=
1 + exp( w T x b)
If the covariance is not shared between classes,
p(t = 1|x) = p(t = 0|x)
p(t = 1|x) = p(t = 0|x)
1 1
log ⇡1 (x µ1 )T ⌃1 1 (x µ1 ) = log ⇡0 (x µ0 )T ⌃0 1 (x µ0 )
2 2
p(t = 1|x) = p(t = 0|x)
1 1
2 2
⇣ ⌘ ⇣ ⌘
x T ⌃1 1 ⌃0 1 x 2 µT
1 ⌃1
1
µT ⌃
0 0
1
x + µ T
0 ⌃ 0 µ 0 µ T
1 ⌃ 1 µ 1 =C
p(t = 1|x) = p(t = 0|x)
1 1
2 2
⇣ ⌘ ⇣ ⌘
x T ⌃1 1 ⌃0 1 x 2 µT
1 ⌃1
1
µT ⌃
0 0
1
x + µ T
0 ⌃ 0 µ 0 µ T
1 ⌃ 1 µ 1 =C
) x T Qx 2b T x + c = 0
The decision boundary is a quadratic function. In 2-d case, it looks

like an ellipse, or a parabola, or a hyperbola.
Thanks!

CS411 Tut4 Handout

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS411 Tut4 Handout

Uploaded by

Copyright:

Available Formats

Naive Bayes and Gaussian Bayes Classifier

September 29, 2017

L(✓) = p(x, t|✓) = p(x|t, ✓)p(t|✓)

Each vocabulary is one feature dimension.

Each vocabulary is one feature dimension.

We want to model the probability of any word xj appearing in an

Each vocabulary is one feature dimension.

We want to model the probability of any word xj appearing in an

Idea: Use Bernoulli distribution to model p(xj |t)

i=1 i=1 j=1

i=1 i=1 j=1

where p(t) = ⇡t . Parameters ⇡t , µjt can be learnt using maximum

log L(✓) = log p(x, t|✓)

log L(✓) = log p(x, t|✓)

Use Lagrange multiplier to derive ⇡

Use Lagrange multiplier to derive ⇡

Use Lagrange multiplier to derive ⇡

Instead of assuming conditional independence of xj , we model p(x|t) as a

Instead of assuming conditional independence of xj , we model p(x|t) as a

Multivariate Gaussian distribution:

µ: mean, ⌃: covariance matrix, D: dim(x)

log L(✓) = log p(x, t|✓) = log p(t|✓) + log p(x|t, ✓)

log L(✓) = log p(x, t|✓) = log p(t|✓) + log p(x|t, ✓)

Take derivative w.r.t. µ

Take derivative w.r.t. µ

Take derivative w.r.t. ⌃ 1 (not ⌃)

Take derivative w.r.t. ⌃ 1 (not ⌃)

Take derivative w.r.t. ⌃ 1 (not ⌃)

Take derivative w.r.t. ⌃ 1 (not ⌃)

If we constrain ⌃ to be diagonal, then we can rewrite p(xj |t) as a product

If we constrain ⌃ to be diagonal, then we can rewrite p(xj |t) as a product

Diagonal covariance matrix satisfies the naive Bayes assumption.

Case 1: The covariance matrix is shared among classes

Case 2: Each class has its own covariance

If the covariance is shared between classes,

p(t = 1|x) = p(t = 0|x)

If the covariance is shared between classes,

p(t = 1|x) = p(t = 0|x)

If the covariance is shared between classes,

p(t = 1|x) = p(t = 0|x)

If the covariance is shared between classes,

p(t = 1|x) = p(t = 0|x)

The decision boundary is a linear function (a hyperplane in general).

We can write the posterior distribution p(t = 0|x) as

We can write the posterior distribution p(t = 0|x) as

We can write the posterior distribution p(t = 0|x) as

We can write the posterior distribution p(t = 0|x) as

If the covariance is not shared between classes,

p(t = 1|x) = p(t = 0|x)

If the covariance is not shared between classes,

p(t = 1|x) = p(t = 0|x)

If the covariance is not shared between classes,

p(t = 1|x) = p(t = 0|x)

If the covariance is not shared between classes,

p(t = 1|x) = p(t = 0|x)

The decision boundary is a quadratic function. In 2-d case, it looks

You might also like