Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

Naive Bayes and Gaussian Bayes Classifier

Mengye Ren
mren@cs.toronto.edu

September 29, 2017

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 1 / 21
Naive Bayes

Bayes Rules:
p(x|t)p(t)
p(t|x) =
p(x)
Naive Bayes Assumption:
D
Y
p(x|t) = p(xj |t)
j=1

Likelihood function:

L(✓) = p(x, t|✓) = p(x|t, ✓)p(t|✓)

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 2 / 21
Example: Spam Classification

Each vocabulary is one feature dimension.


We encode each email as a feature vector x 2 {0, 1}|V |
xj = 1 i↵ the vocabulary xj appears in the email.

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 3 / 21
Example: Spam Classification

Each vocabulary is one feature dimension.


We encode each email as a feature vector x 2 {0, 1}|V |
xj = 1 i↵ the vocabulary xj appears in the email.

We want to model the probability of any word xj appearing in an


email given the email is spam or not.
Example: $10,000, Toronto, Piazza, etc.

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 3 / 21
Example: Spam Classification

Each vocabulary is one feature dimension.


We encode each email as a feature vector x 2 {0, 1}|V |
xj = 1 i↵ the vocabulary xj appears in the email.

We want to model the probability of any word xj appearing in an


email given the email is spam or not.
Example: $10,000, Toronto, Piazza, etc.

Idea: Use Bernoulli distribution to model p(xj |t)


Example: p(“$10, 000”|spam) = 0.3

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 3 / 21
Bernoulli Naive Bayes

Assuming all data points x (i) are i.i.d. samples, and p(xj |t) follows a
Bernoulli distribution with parameter µjt

D
Y (i)
x (i)
p(x (i) |t (i) ) = µjtj(i) (1 µjt (i) )(1 xj )

j=1

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 4 / 21
Bernoulli Naive Bayes

Assuming all data points x (i) are i.i.d. samples, and p(xj |t) follows a
Bernoulli distribution with parameter µjt

D
Y (i)
x (i)
p(x (i) |t (i) ) = µjtj(i) (1 µjt (i) )(1 xj )

j=1

N
Y N
Y D
Y (i)
x (i)
p(t|x) / p(t (i) )p(x (i) |t (i) ) = p(t (i) ) µjtj(i) (1 µjt (i) )(1 xj )

i=1 i=1 j=1

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 4 / 21
Bernoulli Naive Bayes

Assuming all data points x (i) are i.i.d. samples, and p(xj |t) follows a
Bernoulli distribution with parameter µjt

D
Y (i)
x (i)
p(x (i) |t (i) ) = µjtj(i) (1 µjt (i) )(1 xj )

j=1

N
Y N
Y D
Y (i)
x (i)
p(t|x) / p(t (i) )p(x (i) |t (i) ) = p(t (i) ) µjtj(i) (1 µjt (i) )(1 xj )

i=1 i=1 j=1

where p(t) = ⇡t . Parameters ⇡t , µjt can be learnt using maximum


likelihood.

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 4 / 21
Derivation of maximum likelihood estimator (MLE)

✓ = [µ, ⇡]

log L(✓) = log p(x, t|✓)

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 5 / 21
Derivation of maximum likelihood estimator (MLE)

✓ = [µ, ⇡]

log L(✓) = log p(x, t|✓)


0 1
N
X D
X
@log ⇡t (i) + (i) (i)
= xj log µjt (i) + (1 xj ) log(1 µjt (i) )A
i=1 j=1

P
Want: arg max✓ log L(✓) subject to k ⇡k = 1

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 5 / 21
Derivation of maximum likelihood estimator (MLE)
Take derivative w.r.t. µ
0 1
N ⇣ ⌘ (i) (i)
@ log L(✓) X xj 1 xj
=0) (i)
1 t =k @ A=0
@µjk µjk 1 µjk
i=1

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 6 / 21
Derivation of maximum likelihood estimator (MLE)
Take derivative w.r.t. µ
0 1
N ⇣ ⌘ (i) (i)
@ log L(✓) X xj 1 xj
=0) (i)
1 t =k @ A=0
@µjk µjk 1 µjk
i=1

N
X ⇣ ⌘h ⇣ ⌘ i
(i) (i)
1 t (i) = k xj (1 µjk ) 1 xj µjk = 0
i=1

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 6 / 21
Derivation of maximum likelihood estimator (MLE)
Take derivative w.r.t. µ
0 1
N ⇣ ⌘ (i) (i)
@ log L(✓) X xj 1 xj
=0) (i)
1 t =k @ A=0
@µjk µjk 1 µjk
i=1

N
X ⇣ ⌘h ⇣ ⌘ i
(i) (i)
1 t (i) = k xj (1 µjk ) 1 xj µjk = 0
i=1

N
X ⇣ ⌘ N
X ⇣ ⌘
(i)
1 t (i) = k µjk = 1 t (i) = k xj
i=1 i=1

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 6 / 21
Derivation of maximum likelihood estimator (MLE)
Take derivative w.r.t. µ
0 1
N ⇣ ⌘ (i) (i)
@ log L(✓) X xj 1 xj
=0) (i)
1 t =k @ A=0
@µjk µjk 1 µjk
i=1

N
X ⇣ ⌘h ⇣ ⌘ i
(i) (i)
1 t (i) = k xj (1 µjk ) 1 xj µjk = 0
i=1

N
X ⇣ ⌘ N
X ⇣ ⌘
(i)
1 t (i) = k µjk = 1 t (i) = k xj
i=1 i=1

PN (i) (i)
i=1 1 t = k xj
µjk = PN
i=1 1 t (i) = k

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 6 / 21
Derivation of maximum likelihood estimator (MLE)

Use Lagrange multiplier to derive ⇡


P N
X ⇣ ⌘ 1
@L(✓) @  ⇡
+ =0) = 1 t (i) = k)
@⇡k @⇡k ⇡k
i=1

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 7 / 21
Derivation of maximum likelihood estimator (MLE)

Use Lagrange multiplier to derive ⇡


P N
X ⇣ ⌘ 1
@L(✓) @  ⇡
+ =0) = 1 t (i) = k)
@⇡k @⇡k ⇡k
i=1

PN
i=1 1 t (i) = k)
⇡k =

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 7 / 21
Derivation of maximum likelihood estimator (MLE)

Use Lagrange multiplier to derive ⇡


P N
X ⇣ ⌘ 1
@L(✓) @  ⇡
+ =0) = 1 t (i) = k)
@⇡k @⇡k ⇡k
i=1

PN
i=1 1 t (i) = k)
⇡k =
P
Apply constraint: k ⇡k = 1 ) = N
PN
i=1 1 t (i) = k)
⇡k =
N

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 7 / 21
Spam Classification Demo

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 8 / 21
Gaussian Bayes Classifier

Instead of assuming conditional independence of xj , we model p(x|t) as a


Gaussian distribution and the dependence relation of xj is encoded in the
covariance matrix.

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 9 / 21
Gaussian Bayes Classifier

Instead of assuming conditional independence of xj , we model p(x|t) as a


Gaussian distribution and the dependence relation of xj is encoded in the
covariance matrix.

Multivariate Gaussian distribution:


✓ ◆
1 1 T 1
f (x) = p exp (x µ) ⌃ (x µ)
D
(2⇡) det(⌃) 2

µ: mean, ⌃: covariance matrix, D: dim(x)

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 9 / 21
Derivation of maximum likelihood estimator (MLE)

q
✓ = [µ, ⌃, ⇡], Z = (2⇡)D det(⌃)
✓ ◆
1 1 T 1
p(x|t) = exp (x µ) ⌃ (x µ)
Z 2

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 10 / 21
Derivation of maximum likelihood estimator (MLE)

q
✓ = [µ, ⌃, ⇡], Z = (2⇡)D det(⌃)
✓ ◆
1 1 T 1
p(x|t) = exp (x µ) ⌃ (x µ)
Z 2

log L(✓) = log p(x, t|✓) = log p(t|✓) + log p(x|t, ✓)

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 10 / 21
Derivation of maximum likelihood estimator (MLE)

q
✓ = [µ, ⌃, ⇡], Z = (2⇡)D det(⌃)
✓ ◆
1 1 T 1
p(x|t) = exp (x µ) ⌃ (x µ)
Z 2

log L(✓) = log p(x, t|✓) = log p(t|✓) + log p(x|t, ✓)

1 ⇣ (i) ⌘T ⇣ ⌘
N
X
= log ⇡t (i) log Z x µt (i) ⌃t (i)1 x (i) µt (i)
2
i=1

P
Want: arg max✓ log L(✓) subject to k ⇡k = 1

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 10 / 21
Derivation of maximum likelihood estimator (MLE)

Take derivative w.r.t. µ

@ log L
N
X ⇣ ⌘
= 1 t (i) = k ⌃ 1
(x (i) µk ) = 0
@µk
i=0

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 11 / 21
Derivation of maximum likelihood estimator (MLE)

Take derivative w.r.t. µ

@ log L
N
X ⇣ ⌘
= 1 t (i) = k ⌃ 1
(x (i) µk ) = 0
@µk
i=0

PN
i=1 1 t (i) = k x (i)
µk = PN
i=1 1 t (i) = k

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 11 / 21
Derivation of maximum likelihood estimator (MLE)

Take derivative w.r.t. ⌃ 1 (not ⌃)

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 12 / 21
Derivation of maximum likelihood estimator (MLE)

Take derivative w.r.t. ⌃ 1 (not ⌃)


Note:
@ det(A) 1T
= det(A)A
@A
1 1
det(A) = det A
@x T Ax
= xx T
@A
⌃T = ⌃

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 12 / 21
Derivation of maximum likelihood estimator (MLE)

Take derivative w.r.t. ⌃ 1 (not ⌃)


Note:
@ det(A) 1T
= det(A)A
@A
1 1
det(A) = det A
@x T Ax
= xx T
@A
⌃T = ⌃

" #
@ log L
N
X ⇣ ⌘ @ log Zk 1 (i)
= 1 t (i) = k (x µk )(x (i)
µk ) T
=0
@⌃k 1 i=0
@⌃k 1 2

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 12 / 21
Derivation of maximum likelihood estimator (MLE)

Take derivative w.r.t. ⌃ 1 (not ⌃)


Note:
@ det(A) 1T
= det(A)A
@A
1 1
det(A) = det A
@x T Ax
= xx T
@A
⌃T = ⌃

" #
@ log L
N
X ⇣ ⌘ @ log Zk 1 (i)
= 1 t (i) = k (x µk )(x (i)
µk ) T
=0
@⌃k 1 i=0
@⌃k 1 2

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 12 / 21
Derivation of maximum likelihood estimator (MLE)
q
Zk = (2⇡)D det(⌃k )

1
@ log Zk 1 @Zk D 1 D @ det ⌃k 1 2

1
= = (2⇡) 2 det(⌃k ) 2 (2⇡) 2
@⌃k Zk @⌃k 1 @⌃k 1

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 13 / 21
Derivation of maximum likelihood estimator (MLE)
q
Zk = (2⇡)D det(⌃k )

1
1
D @ det ⌃
2
@ log Zk 1 @Zk D 1
k
= = (2⇡) 2 det(⌃ ) 2 (2⇡) 2
k
@⌃k 1 Zk @⌃k 1 @⌃k 1
✓ ◆
1 12 1 3 1
= det(⌃k ) det ⌃k 1 2 det ⌃k 1 ⌃T k = ⌃k
2 2

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 13 / 21
Derivation of maximum likelihood estimator (MLE)
q
Zk = (2⇡)D det(⌃k )

1
1
D @ det ⌃
2
@ log Zk 1 @Zk D 1
k
= = (2⇡) 2 det(⌃ ) 2 (2⇡) 2
k
@⌃k 1 Zk @⌃k 1 @⌃k 1
✓ ◆
1 12 1 3 1
= det(⌃k ) det ⌃k 1 2 det ⌃k 1 ⌃T k = ⌃k
2 2

@ log L
N
X ⇣ ⌘ 1 1 (i)
(i)
= 1 t =k ⌃k (x µk )(x (i) µk ) T = 0
@⌃k 1 i=0
2 2

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 13 / 21
Derivation of maximum likelihood estimator (MLE)
q
Zk = (2⇡)D det(⌃k )

1
1
D @ det ⌃
2
@ log Zk 1 @Zk D 1
k
= = (2⇡) 2 det(⌃ ) 2 (2⇡) 2
k
@⌃k 1 Zk @⌃k 1 @⌃k 1
✓ ◆
1 12 1 3 1
= det(⌃k ) det ⌃k 1 2 det ⌃k 1 ⌃T k = ⌃k
2 2

@ log L
N
X ⇣ ⌘ 1 1 (i)
(i)
= 1 t =k ⌃k (x µk )(x (i) µk ) T = 0
@⌃k 1 i=0
2 2

PN T
i=1 1 t (i) = k x (i) µk x (i) µk
⌃k = PN
i=1 1 t (i) = k

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 13 / 21
Derivation of maximum likelihood estimator (MLE)

PN
t (i) = k)
i=1 1
⇡k =
N
(Same as Bernoulli)

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 14 / 21
Gaussian Bayes Classifier Demo

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 15 / 21
Gaussian Bayes Classifier

If we constrain ⌃ to be diagonal, then we can rewrite p(xj |t) as a product


of p(xj |t)
✓ ◆
1 1 T 1
p(x|t) = p exp (x µt ) ⌃t (x µt )
(2⇡)D det(⌃t ) 2

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 16 / 21
Gaussian Bayes Classifier

If we constrain ⌃ to be diagonal, then we can rewrite p(xj |t) as a product


of p(xj |t)
✓ ◆
1 1 T 1
p(x|t) = p exp (x µt ) ⌃t (x µt )
(2⇡)D det(⌃t ) 2

D
Y ✓ ◆ D
Y
1 1
= p exp ||xj µjt ||22 = p(xj |t)
(2⇡)D ⌃t,jj 2⌃t,jj
j=1 j=1

Diagonal covariance matrix satisfies the naive Bayes assumption.

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 16 / 21
Gaussian Bayes Classifier

Case 1: The covariance matrix is shared among classes

p(x|t) = N (x|µt , ⌃)

Case 2: Each class has its own covariance

p(x|t) = N (x|µt , ⌃t )

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 17 / 21
Gaussian Bayes Binary Classifier Decision Boundary

If the covariance is shared between classes,

p(t = 1|x) = p(t = 0|x)

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 18 / 21
Gaussian Bayes Binary Classifier Decision Boundary

If the covariance is shared between classes,

p(t = 1|x) = p(t = 0|x)

1 1
log ⇡1 (x µ1 ) T ⌃ 1
(x µ1 ) = log ⇡0 (x µ0 ) T ⌃ 1
(x µ0 )
2 2

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 18 / 21
Gaussian Bayes Binary Classifier Decision Boundary

If the covariance is shared between classes,

p(t = 1|x) = p(t = 0|x)

1 1
log ⇡1 (x µ1 ) T ⌃ 1
(x µ1 ) = log ⇡0 (x µ0 ) T ⌃ 1
(x µ0 )
2 2

C + xT ⌃ 1
x 2µT
1⌃
1
x + µT 1 T 1
1 ⌃ µ1 = x ⌃ x 2µT 1 T
0 ⌃ x + µ0 ⌃
1
µ0
h i
2(µ0 µ1 )T ⌃ 1 x (µ0 µ1 )T ⌃ 1 (µ0 µ1 ) = C

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 18 / 21
Gaussian Bayes Binary Classifier Decision Boundary

If the covariance is shared between classes,

p(t = 1|x) = p(t = 0|x)

1 1
log ⇡1 (x µ1 ) T ⌃ 1
(x µ1 ) = log ⇡0 (x µ0 ) T ⌃ 1
(x µ0 )
2 2

C + xT ⌃ 1
x 2µT
1⌃
1
x + µT 1 T 1
1 ⌃ µ1 = x ⌃ x 2µT 1 T
0 ⌃ x + µ0 ⌃
1
µ0
h i
2(µ0 µ1 )T ⌃ 1 x (µ0 µ1 )T ⌃ 1 (µ0 µ1 ) = C

) aT x b=0

The decision boundary is a linear function (a hyperplane in general).

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 18 / 21
Relation to Logistic Regression

We can write the posterior distribution p(t = 0|x) as

p(x, t = 0) ⇡0 N (x|µ0 , ⌃)
=
p(x, t = 0) + p(x, t = 1) ⇡0 N (x|µ0 , ⌃) + ⇡1 N (x|µ1 , ⌃)

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 19 / 21
Relation to Logistic Regression

We can write the posterior distribution p(t = 0|x) as

p(x, t = 0) ⇡0 N (x|µ0 , ⌃)
=
p(x, t = 0) + p(x, t = 1) ⇡0 N (x|µ0 , ⌃) + ⇡1 N (x|µ1 , ⌃)
⇢  1
⇡1 1 1
= 1+ exp (x µ1 ) T ⌃ 1
(x µ1 ) + (x µ0 ) T ⌃ 1
(x µ0 )
⇡0 2 2

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 19 / 21
Relation to Logistic Regression

We can write the posterior distribution p(t = 0|x) as

p(x, t = 0) ⇡0 N (x|µ0 , ⌃)
=
p(x, t = 0) + p(x, t = 1) ⇡0 N (x|µ0 , ⌃) + ⇡1 N (x|µ1 , ⌃)
⇢  1
⇡1 1 1
= 1+ exp (x µ1 )T ⌃ 1 (x µ1 ) + (x µ0 )T ⌃ 1 (x µ0 )
⇡0 2 2
⇢ 
⇡1 1⇣ T 1 ⌘ 1
= 1 + exp log + (µ1 µ0 )T ⌃ 1 x + µ1 ⌃ µ1 µT 1
0 ⌃ µ0
⇡0 2

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 19 / 21
Relation to Logistic Regression

We can write the posterior distribution p(t = 0|x) as

p(x, t = 0) ⇡0 N (x|µ0 , ⌃)
=
p(x, t = 0) + p(x, t = 1) ⇡0 N (x|µ0 , ⌃) + ⇡1 N (x|µ1 , ⌃)
⇢  1
⇡1 1 1
= 1+ exp (x µ1 )T ⌃ 1 (x µ1 ) + (x µ0 )T ⌃ 1 (x µ0 )
⇡0 2 2
⇢ 
⇡1 1⇣ T 1 ⌘ 1
= 1 + exp log + (µ1 µ0 )T ⌃ 1 x + µ1 ⌃ µ1 µT 1
0 ⌃ µ0
⇡0 2
1
=
1 + exp( w T x b)

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 19 / 21
Gaussian Bayes Binary Classifier Decision Boundary

If the covariance is not shared between classes,

p(t = 1|x) = p(t = 0|x)

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 20 / 21
Gaussian Bayes Binary Classifier Decision Boundary

If the covariance is not shared between classes,

p(t = 1|x) = p(t = 0|x)

1 1
log ⇡1 (x µ1 )T ⌃1 1 (x µ1 ) = log ⇡0 (x µ0 )T ⌃0 1 (x µ0 )
2 2

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 20 / 21
Gaussian Bayes Binary Classifier Decision Boundary

If the covariance is not shared between classes,

p(t = 1|x) = p(t = 0|x)

1 1
log ⇡1 (x µ1 )T ⌃1 1 (x µ1 ) = log ⇡0 (x µ0 )T ⌃0 1 (x µ0 )
2 2
⇣ ⌘ ⇣ ⌘
x T ⌃1 1 ⌃0 1 x 2 µT
1 ⌃1
1
µT ⌃
0 0
1
x + µ T
0 ⌃ 0 µ 0 µ T
1 ⌃ 1 µ 1 =C

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 20 / 21
Gaussian Bayes Binary Classifier Decision Boundary

If the covariance is not shared between classes,

p(t = 1|x) = p(t = 0|x)

1 1
log ⇡1 (x µ1 )T ⌃1 1 (x µ1 ) = log ⇡0 (x µ0 )T ⌃0 1 (x µ0 )
2 2
⇣ ⌘ ⇣ ⌘
x T ⌃1 1 ⌃0 1 x 2 µT
1 ⌃1
1
µT ⌃
0 0
1
x + µ T
0 ⌃ 0 µ 0 µ T
1 ⌃ 1 µ 1 =C

) x T Qx 2b T x + c = 0

The decision boundary is a quadratic function. In 2-d case, it looks


like an ellipse, or a parabola, or a hyperbola.

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 20 / 21
Thanks!

Mengye Ren Naive Bayes and Gaussian Bayes Classifier September 29, 2017 21 / 21

You might also like