Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

Bayesian Decision

Theory
Prof. Richard Zanibbi

Bayesian Decision Theory


The Basic Idea
To minimize errors, choose the least risky
class, i.e. the class for which the expected loss
is smallest

Assumptions
Problem posed in probabilistic terms, and all
relevant probabilities are known
2

Probability Mass vs. Probability


Density Functions
Probability Mass Function, P(x)
Probability for values of discrete random variable x. Each
. . , vm }
value has its own associated probability = {v1, . !

= {v1, . . . , vm}
!
P (x) 0, and
P (x) = 1
x

"

P
r[x (a, b)] = Density,
p(x) dx p(x)
Probability
" a
Probability
values
of1 continuous
p(x)
0 and for p(x)
dx =

P (x) 0, and

P (x) = 1

"

P r[x (a, b)] =


p(x) dx
" a
p(x) 0 and
p(x) dx = 1

random variable x.

Probability returned
is for an interval within which the
value lies (intervals defined by some unit distance)

Prior Probability
Definition ( P( w ) )
The likelihood of a value for a random variable
representing the state of nature (true class for the
current input), in the absence of other information

Informally, what percentage of the time state X


occurs

Example
The prior probability that an instance taken from
two classes is provided as input, in the absence of
any features (e.g. P(cat) = 0.3, P(dog) = 0.7)
4

Class-Conditional Probability Density


Function (for Continuous Features)
Definition ( p( x | w ) )
The probability of a value for continuous
random variable x, given a state of nature in
w

For each value of x, we have a different classconditional pdf for each class in w (example
next slide)

Example: Class-Conditional
Probability Densities
p(x|i)
0.4

2
1

0.3

0.2

0.1

x
9

10

11

12

13

14

15

FIGURE 2.1. Hypothetical class-conditional probability density functions show the


probability density of measuring a particular feature value x given the pattern is in
category i . If x represents the lightness of a fish, the two curves might describe the
difference in lightness of populations of two types of fish. Density functions are normalized, and thus the area under each curve is 1.0. From: Richard O. Duda, Peter E. Hart,
c 2001 by John Wiley & Sons,
and David G. Stork, Pattern Classification. Copyright !
Inc.
6

Bayes Formula
p(x|j )P (wj )
posterior = likelihood x prior
P (j |x) =
p(x)
evidence
c
!
where p(x) =
p(x|j )P (j )
j=1

Purpose

Convert class prior and class-conditional


densities to a posterior probability for a class: the
probability of a class given the input features
(post-observation)
7

Example: Posterior Probabilities


P(i|x)
1

0.8

0.6

0.4

2
0.2

x
9

10

11

12

13

14

15

FIGURE 2.2. Posterior probabilities for the particular priors P (1 ) = 2/3 and P (2 )
= 1/3 for the class-conditional probability densities shown in Fig. 2.1. Thus in this
case, given that a pattern is measured to have feature value x = 14, the probability it is
in category 2 is roughly 0.08, and that it is in 1 is 0.92. At every x , the posteriors sum
to 1.0. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.
c 2001 by John Wiley & Sons, Inc.
Copyright !

where p(x) =

p(x|j )P (j )
j=1

"

P (1|x) if we choose 2
P (2|x) if we choose 1

Choosing the Most Likely Class


P (error|x) =
#

PWhat
(error) =
P (error|x)p(x)
(average
error)
happens
if we dodxthe
following?

Decide 1 if P (1|x) > P (2|x); otherwise decide 2

p(x|j )P (wj )
p(x) probability of
A. We minimize the average
c
!
error. Consider
the=two-class
case
from
where p(x)
p(x|j )P (
j)
P (j |x) =

previous slide:

j=1

P (error|x) =

P (error) =

"

P (1|x) if we choose 2
P (2|x) if we choose 1

P (error|x)p(x) dx (average error)


9

Expected Loss or Conditional Risk


of an Action
R(i|x) =

Explanation

c
!
j=1

(i|j )P (j |x)

The expected (average) loss for taking an


action (choosing a class) given an input
vector, for a given conditional loss function
(lambda)
10

c
!

Functions
and
R(Decision
|x)
=
(
|
)P
(j |x)
i
i j
R=

"

Overall
Risk
j=1

R((x)|x)p(x) dx

Decision Function or Decision Rule


( alpha(x) ): takes on the value of exactly one
action for each input vector x

Overall Risk
The expected (average) loss associated with
a decision rule

11

Bayes Decision Rule


Idea
Minimize the overall risk, by choosing the action
with the least conditional risk for input vector x

Bayes Risk (R*)


The resulting overall risk produced using this
procedure. This is the best performance that can
be achieved given available information.

12

Bayes Decision Rule: Two


Category Case
Bayesian
Decision Rules
Bayes Decision
Rule

Bayesian
Decision
Rules
For each
input,
selectclass
with
ij = (
i|least
j)
conditional risk, i.e. choose class one if:
ij = (i|j )

BayesianR(
Decision
Rules
1|x) < R(
2|x)

where
R(1|x)R(
<1R(
|x)
|x)==
11P|
(1)|x) + 12P (2|x)
2(
ij

R(1|x)R(
= 2|x)
|x)
+1|x)
P (
+(
22
P (2|x)
11P=(121
12P
2|x)

R(1|x) < R(2|x)

(21 11)P (1|x) > (12 22)P (2|x)

R(2|x) = 21P (1|x) + 22P (2|x)

R(1|x) = 11P (1|x) + 12P (2|x)

(21 11)p(x|1)P (1) > (12 22)p(x|2)P (2)

(21 11)P (1|x) > (12 22)P (2|x)


p(x|1)

>

12 22 P (2)

13

ij = (i|j )
R(Alternate
|x) = Equivalent
11P (1Expressions
|x) + 12ofPBayes
(2|x)
1
R( |x) < R( |x)
1

Decision Rule
(Choose
Class
One
If...)
R(1|x) < R(2|x)

R(2|x) = 21P (1|x) + 22P (2|x)


R(1|x) = 11P (1|x) + 12P (2|x)

PosteriorR(
Class
Probabilities
|x)
= 11P (1|x) + 12P (2|x
1

(21 11)P (1|x) > (12 22)P (2|x)

R(2|x) = 21P (1|x) + 22P (2|x)

R(
|x)
=

P
(
|x)
+

P
(
|x
2
21
1
22
2
Class
Priors
and
Conditional
Densities
(21 11)p(x|1)P (1) > (12 22)p(x|2)P (2)
Produced
applying
Bayes
Formula
the
(21 p(x|
by
((
>
(

)P
(2above,
|x)
11 )P
1 |x)
12(
22to
)

P
(
)

)P
|x)
>
(
1
2
21 12 11 22 1
12 22 )P (2 |x)
multiplying both sides
> by p(x)

p(x|2) 21 11 P (1)
(21 11)p(x|(121
)P(111))p(x|
> (12
(122
)p(x|
)>
(12
22(
)p(x|
2 )P
2)
1 )P
2 )P (
122 ) 22
p(x|Ratio
12p(x|
221) P
(
1)
Likelihood
>
>

p(x|2)

211 ) 11
21p(x|
112) P (

P (2)
P (1)

14

The Zero-One Loss


11

Minimum
Error
Rate Classification
Classification
Minimum
Error
Rate
Definition !
!0 i=j
0 iRate
= j Classification
(
|
)
=
i, jj =
= 1,
1, .. .. .. ,, cc
1 Minimum
Error
i
j
(i|j ) = 1 i != j i,
!1 i != j

0 ifor
= j Zero-One Loss
Conditional
Risk
(
i, j = 1, . . . , c
c i |j ) =

"
c
"

"
1 i != j "
R(
(
|jj )P
)P (
(jj|x)
|x) =
=
P (
(jj|x)
|x) =
= 1P
1P(
(ii|x)
|x)
R(ii|x)
|x) =
=
(ii|
P
c
j=1
=ii
"
j=1
j!j!"
=
R(i|x) =
(i|j )P (j |x) =
P (j |x) = 1P (i|x)
j=1

j!=i

Decide
|x) Rule
>P
P (
((min.
|x) for
for all
all jjrate)
!= ii
BayesDecision
ii if
Decide
if P
P (
(ii|x)
>
!=
jj|x) error

22

Decide i if P (i|x) > P (j |x) for all j != i


Multicategory
case (revisited)
(revisited)
Multicategory case
2

Multicategory case (revisited)

15

Example: Likelihood Ratio


p(x|1)
p(x|2)

b
a

R2

R1

R2

R1

FIGURE 2.3. The likelihood ratio p(x |1 )/p(x |2 ) for the distributions shown in
Fig. 2.1. If we employ a zero-one or classification loss, our decision boundaries are
determined by the threshold a . If our loss function penalizes miscategorizing 2 as 1
patterns more than the converse, we get the larger threshold b , and hence R1 becomes
smaller. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classificac 2001 by John Wiley & Sons, Inc.
tion. Copyright !

16

1 Minimum
Error
Rate
Classification
c
"
"
R(i|x) =
(!
(
|x)
=
P
(
|x)
=
1P
(
|x)
i |0j )P
j
j
i
i=j
(ij=1
|j ) =
i, j =
j!=i1, . . . , c
1 i != j

Bayes Classifiers

c
"
Decide
i if P (i|x) > P (j"
|x) for all j != i
R(i|x) =
(i|j )P (j |x) =
P (j |x) = 1P (i|x)

Recallj=1the Canonical Model


j!=i

Multicategory case (revisited)


Decide class i if:
Decide i gifi (x)
P (>
> P (
i |x)
j |x)
gj (x)
for
all jfor
!= all
i j != i

For Bayes Classifiers


Multicategory
case= (revisited)
gi(x)
R(i|x)
gi(x) = P (defn
i |x) below for
Use the first
discriminant
gi(x) > gj (x) for all j != i
general case, second for zero-one
loss
p(x|i)P (
i)
gi(x) = P (i|x) = #c
p(x|j )P (j )
gi(x) = R(j=1
i |x)
gi(x)i )P
=P
gi(x) = p(x|
((
i ) i |x)
gi(x) = ln p(x|i) + ln
P (i)P
) ( )
p(x|
i

17

Multicategory
case (revisited)
2 Multicategory
case (revisited)

Equivalent Discriminants
for ZeroMulticategory
case (revisited)
One Loss
gi(x) (Minimum-Error-Rate)
> gj (x) for all j != i

gi(x) > gj (x) for all j != i


gi(x) > gj (x) for all j != i

Trade-off gi(x) = R(i|x)


gi(x) = R(i|x)
P (
Simplicity ofgunderstanding
vs. computation
i (x)
gi=(x)
=i|x)
R(
i |x)

gi(x) = P (i|x)
gi(x) =p(x|
P (ii)P
|x)(i)
gi(x) = P (i|x) = #c
p(x|
)P
(
)
i
i
p(x|
)P
(
)
j
j
j=1= #
gi(x) = P (i|x)
p(x|
)P
(i)
c
i
gi(x)
= p(x|
)Pi(
gi(x)
= P i(
|x)i) = #c j=1 p(x|j )P (j )
p(x|
)P
(
)
j
j
j=1
gi(x) =
ln p(x|
P (
gi(x)
= p(x|
(i)i)
i ) + ln
i )P
gi(x)
= p(x|
)
i )P (
i
gi(x)
= ln p(x|
)
i + ln P (i )
gi(x) = ln p(x|i) + ln P (i)
18

Discriminants for Two Categories


For Two Categories
We can use a single discriminant function,
with decision rule: choose class one if the
discriminant returns a value > 0.
3

two-category
again Loss
Example:
Zero-One
g(x) = P (1|x) P (2|x)
p(x|1)
P (1)
g(x) = ln
+ ln
p(x|2)
P (2)

Gaussian (univariate normal)

19

Example: Decision Regions for


Binary Classifier
p(x|1)P(1)

0.3

p(x|2)P(2)

0.2
0.1
0

R1

R2
5

R2

decision
boundary

5
0

FIGURE 2.6. In this two-dimensional two-category classifier, the probability densities


are Gaussian, the decision boundary consists of two hyperbolas, and thus the decision
region R2 is not simply connected. The ellipses mark where the density is 1/e times
that at the peak of the distribution. From: Richard O. Duda, Peter E. Hart, and David G.
c 2001 by John Wiley & Sons, Inc.
Stork, Pattern Classification. Copyright !
20

The (Univariate) Normal


Distribution
Why are Gaussians so Useful?
They represent many probability
distributions in nature quite accurately. In
our case, when patterns can be represented
as random variations of an ideal prototype
(represented by the mean feature vector)

Everyday examples: height, weight of a


population
21

Univariate Normal Distribution


p(x)

2.5%

2.5%

x
- 2 -

+ + 2

FIGURE 2.7. A univariate normal distribution has roughly 95% of its areain the range
|x | 2 , as shown. The peak of the distribution has value p() = 1/ 2 . From:
Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright
c 2001 by John Wiley & Sons, Inc.
$
22

two-category again
g(x) = P (1|x) P (2|x)

Formal Definition

p(x|1)
P (1)
g(x) = ln again+ ln
3 two-category
p(x|
P (2)
3 2)two-category
again
g(x) = P (1|x) P (2|x)
4 Gaussian (univariate normal) g(x) = P (1 |x) P (2 |x)
! 1"
$
p(x|
)
P#(
2 1)
g(x)
+
ln
1 = ln
1 x
p(x|1)
P (1)

p(x|
)
P
(
)
P
(x)
=
exp

2
2
Has value: 2
2
g(x) = ln p(x| ) + ln P ( )
2
2
%
4 Gaussian (univariate normal)
=
x p(x) dx
4 Gaussian
normal)
!

" (univariate
#2 $
% 1
1 x
!
"
#2 $

P (x)
=
exp

2
2
=
(x ) p(x)2 dx
1
1 x
2
P (x) =
exp
%

2
=
x p(x) dx
%
5 Multivariate normal
density
= ' x p(x) dx
%
&

1 2 =
1 )2p(x)
t 1
(x
dx
p(x) =
exp (x ) (x
% )

d/2
1/2

2
(2) ||
2 =
(x )2p(x) dx 23
%

5 Multivariate
= normal
x p(x)density
dx

&
'
%

Peak of the Distribution (the mean)


Definition for Univariate Normal

Def. for mean, variance

g(x) = ln

p(x|1)
P (1)
+ ln
p(x|2)
P (2)

4 Gaussian (univariate normal)


Multivariate
Normal
Density
!
"
#$
1
1 x
P (x) =
exp
2

2
%
=
x p(x) dx

Informal Definition

%
A normal distribution
over two or more
2 =
(x )2p(x) dx
variables (d variables/dimensions)

Formal
Definition
5 Multivariate
normal density
&
'
1
1
t 1
p(x) =
exp

(x

)
(x )
2
(2)d/2||1/2
%
=
x p(x) dx
=

(x )(x )tp(x)dx
3

24

x p(x) dx

The Covariance% Matrix


( =
) (x )(x )tp(x)d

For our purposes...

Assume matrix is positive definite, so the


determinant of the matrix is always positive

Matrix Elements

Main diagonal: variances for each individual


variable

Off-diagonal: covariances of each variable


pairing i & j (note: values are repeated, as
matrix is symmetric)
25

Independence and Correlation


For multivariate normal covariance matrix

Off-diagonal entries with a value of 0 indicate


uncorrelated variables, that are statistically
independent (variables likely do not influence one
another)

Roughly speaking, covariance positive if two


variables increase together (positive correlation),
negative if one variable decreases when the other
increases (negative correlation)
26

A Two-Dimensional Gaussian
Distribution, with Samples Shown
x2

x1

FIGURE 2.9. Samples drawn from a two-dimensional Gaussian lie in a cloud centered
on the mean ! . The ellipses show lines of equal probability density of the Gaussian.
From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyc 2001 by John Wiley & Sons, Inc.
right !
27

Linear Transformations in a 2D
Feature Space
x2

N(A w, I)
t

A w
N(,)

Aw

A
t

A
P
N(At, At A)

a
t

P
0

x1

FIGURE 2.8. The action of a linear transformation on the feature space will convert an arbitrary normal distribution into another normal distribution. One transforma-

28

=i |x)
P (>
(2|x)for all j != i
1 |x)
Decide i ifg(x)
P (
P
(Pj |x)

Decide
Pi (
if1P
p(x|
) (i |x) > P (j |
1)
g(x) = case
ln
+ ln i
Multicategory
(revisited)
p(x|2)
P (2)

Discriminant
Functions ( g (x) )
2
for the Normal
Density
2
Multicategory
gi(x) > gj (x) for all j != i case (revis
4

Gaussian (univariate normal)


!
$ g (x) for a
" gi (x)
#2>
j
1
1
x

DiscriminantPFunctions
(x) =exp
R(
(x) =gi
i |x)
2

2= P (i |x)
gi(x)
We will consider three special
cases for: gi(x) = R(i|
%
=
x p(x)
dxi )P (gii)(x) = P (i |x)
p(x|

normally
distributed
features,
gi(x)
= P (i|x)
=
#cand

j=1 p(x|j )P (j )
minimum-error-rate
classification
(0-1
loss)
2
2(x)
g
=
= i )P (
(x i
gi(x) = p(x|
) )i p(x) dxP (i|x)

p(x
= #c
p

j=1
Bayes Recall:
Dec. Rules:
I i) + ln P (i)
gi(x) =Case
ln p(x|
gi(x) = p(x|i)P (i)
5 Multivariate normal density
i (x) = ln p(x|i ) + ln P (
if p(x|i) N (i , i ) thengapprox.

&
'
1
1
using: p(x) = (2)d/2||1/2 exp 2 (x )t1(x )
d % 1
1
29
1
t
gi (x) = (x i ) i (x i ) ln 2 ln |i | + ln P (i )
2
2 =
x2p(x) dx
2

2
%
=

x p(x) dx

Minimum Error-Rate
Discriminant
Function
for
%
Multivariate Gaussian
Distributions
2 =
(xFeature
)2p(x)
dx

5 Multivariate normal density


ln (natural
log) of

&
'
1
1
t 1
p(x) =
exp

(x

)
(x )
d/2
1/2
2 I
Bayes Dec.
Rules: Case
(2) ||
%
p(x|
(i ,
= i ) xNp(x)
dxi )

gives a general form% for our


discriminant functions:

t= 1(x )(x d)tp(x)dx


1
1
gi (x) = (x i ) i (x i ) ln 2 ln |i | + ln P (i )
2
2
2

2
3

Case II
Case III

30

gi(x) > gj (x) for all j

Special Cases for Binary


Classification

gi(x) = R(i|x)
gi(x) = P (i|x)

p(x|i)
#
gi(x) = P (i|x) = c
j=1 p(x|
Purpose
gi(x) = p(x|i)P (i)
Overview of commonly assumed cases for feature likelihoodgdensities,
i (x) = ln p(x|i ) + ln P (i )

Goal: eliminate common additive constants in discriminant functions.


These do not affect the classification decision (i.e. define gi(x) providing
just the differences)

Also, look at resulting decision surfaces ( defined by gi(x) = gj(x) )

Three Special Cases

1. Statistically independent features, identically distributed Gaussians for


each class
2. Identical covariances for each class
3. Arbitrary covariances
31

1 1 Bayes
BayesDec.
Dec.Rules:
Rules: Case
Case II
2
1 Bayes
Bayes Dec.
Dec. Rules:
Rules:
Case
I
1
Case
I
p(x|
1 Bayes Dec. Rules:
Case
i )i )N
p(x|
i IN((i,i,i)i)

Case 1: = I

p(x|
NN
(
i)
i,
i)
p(x|
)

(
,

)
i
i
i
p(x|i ) N (i , i )
dd
11
1 1
t t 11
gi (x)
= (x(x
)
(xi )i ) ln
ln2
2
ln
ln |i | + ln
gi (x)
=

i
) i
lnPP(
(ii))
i i (x
2
2
2
dd 2 1 1 2
11 2
t 11
t
(x)==1 (x
(xti )i )

(xi) i
)d lnln
+P
ln(
P (
ggii(x)
22
1 ln ln
||
| +| ln
) )
1
i i (x
gi (x) =Remove:
(x22 i ) i (x i ) 2ln22 2ln2|i |i+iln P (i )i i
2 2Case II
2i = 2 I 2

i =

Items in red: same across classes (unimportant


additive constants)
2 2

i =
i =
2 I I
i = 2d
I 1
3Inverse
Case
III
of Covariance
|Matrix:
|
=

, i = (1/ 2 )I
i
2d

2d
1by(1/ )I2
|
product
,
Only effect is to scale
|
|==
,
= (1/
)I
i |ivector
i i =
2d
1
2

i | = , i = (1/ )I
2 Case|II
Discriminant function:
t
ti ) (x i )
Case IIg (x) =(x(x
i )= + ln P ( )
i ) (x

i = (arbitrary gaussian)

2 i + ln P (i ) i
2
2
2
i =
1 t
t
t
3 gCase
III
[x
x

2
x
+

i ] + ln P (i )
i (x) =
i
i
2
2 Case II
2
3 Case III
i = (arbitrary gaussian)
i =
2 Case II

gi (x)i =

32

(x i ) (x ii )= 2 I
+ ln P (i )
gi (x) =
2
2
2d
1
2
|i |2d= 1
, 2
=
(1/
)I
i
2
1 t |i | =

=
(1/
)I
t
ti
i
gi (x) = 2 [x x 2i x + i i ] + ln P (i )
2
t
(x

)
i (x i )
t
(x i ) (x 2i )
+ ln P (i )
gi (x) =
g1i (x)Discriminant
= 1 t
2 + ln P (i )
Linear
Function
2
gi (x) = 2 i x 2 i i2
+ ln P (i )

2
1 t the previous
Produced by
factoring
form
t
t
1
t
t
t
gi (x)
[x
x
2i ix+
ilnP
+iln
P (i )
i ] (
gi (x)
= =
[x
x

2
x
+
]
+
)
2
i
i
2 t2
2w
gi (x) =
x + i0 i

Case 1: = I

1 t1 t 1 t 1 t
(x) 2=
i x 2i i +
ln
gi (x)gi=
i x2
(iln
) P (i )
i+
i P
2
2 2

Threshold
gaussian)
g
(x)
=
w
x
+

i = (arbitrary
g
(x)
=
w
x
or Bias for Class +
i:
i

t
i

t
i i0

i0

Change in prior translates decision boundary


33

gi (x)

tt x + t ] + ln Pt ( )
t
gi (x)
[x
x

2
i i
i
i
1 i=
1
i
i
i
t 2 2
t
= 2 i x 22 i i + ln P2 (i )

1 2
1
gi (x) = 2 i t x 2 ti i + ln P (i )

2
t
ti
i0
i
gi (x) =giw
x
+
t i0
i
(x) = wi x + i0

g (x) =

Case 1:

+ ln P (i )

2
gi(x)==w xI+

gi (x) = gj (x)
gi (x) = gj (x)
g
(x)
=
gj (x)
i
Decision Boundary:
t
w (x x0 ) = 0
wt (x x0 ) = 0

i =

!
"
2
1

P (i )
(i j )t (x (i + j )
ln
(i j ) )
t
2
(i j ) (i j ) P (j )

Notes

Decision boundary goes through x0 along line between


means, orthogonal to this line

= (arbitrary gaussian)

If priors equal, x0 between means (minimum distance


i
classifier),
otherwise x0 shifted

If variance small relative to distance between means,


priors have limited
1 effect on boundary location

34

Case 1: Statistically Independent


Features with Identical Variances
0

-2

0.15

p(x|i)
0.4

1
0

P(2)=.5

0.1

0.05
1

0.3

R2

0.2

-1

P(2)=.5

0.1

P(1)=.5
x
-2

R1

P(1)=.5

R2

P(2)=.5

R2

R1

-2

P(1)=.5 R1

-2
-2
-1

FIGURE 2.10. If the covariance matrices for two distributions are equal and proportional to the identity
matrix, then the distributions are spherical in d dimensions, and the boundary is a generalized hyperplane of
d 1 dimensions, perpendicular to the line separating the means. In these one-, two-, and three-dimensional
examples, we indicate p(x|i ) and the boundaries for the case P (1 ) = P (2 ). In the three-dimensional case,
the grid plane separates R1 from R2 . From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern
c 2001 by John Wiley & Sons, Inc.
Classification. Copyright "
35

p(x|i)

p(x|i)

0.4
0.3

0.3

0.2

0.2

0.1

0.1

-2

R1

-2

R2

P(1)=.7

-2

0.15

0.05

0.05

P(2)=.01

P(2)=.2
R1

0.1

-2

P(2)=.1

0.15

0.1

P(1)=.8

R2

P(1)=.9

-2

R2

R1

-2
0

2
4

2
2

P(2)=.2

R2

R1

-1

R2

R1

P(2)=.01
2

-1

P(1)=.8

-2

R2

P(1)=.99

R1

P(2)=.3

0.4

P(1)=.99

-2
-2

-2

-1

-1

1
2

FIGURE 2.11. As the priors are changed, the decision boundary shifts; for sufficiently
disparate priors the boundary will not lie between the means of these one-, two- and

Example: Translation
of Decision
Boundaries Through
Changing Priors

Case II
1

Bayes Dec. Rules: Case I

Case II: Identical


Covariances,

i
p(x|
)

N
(
,

)
i
i
i

d
1
1
1
t
gi (x) = (x i ) i (x i ) ln 2 ln |i | + ln P (i )
2
2
2

Case
III
Remove
2 Case II

Terms in red; as in Case I these can be ignored (same


3across
Case
III
classes)

i = (arbitrary gauss

Squared Mahalanobis Distance (yellow)

Distance from x to mean for class i, taking covariance


into account; defines contours of fixed density
37

Case II
Case
22 Case
II II

1 1
t1
t
1(x
ggi(x)
=

(x
)
(x
(i) P (i)
(x)
=

(x
)
)+ln
i
i)+ln P
i
i
i
21 2 Covariances,
CasegiII:
Identical
(x) = (xi)t1(xi)+ln P (ii)
2
i =
i =
Case II
i = Mahalanobis distance
Expansion of squared
1
tt 1
1x xt1 t1x + t1
x
(x) = (xi) t (x
) i
i i1
i t 1
i1
i)+ln Pt(
1
t
2
=x

x t x1
i t 1
i x
+1i i
t
1
t
= x x x i i x + i i
i = t 1
= x x 2(1i)tx + ti1i
t
1
1
t
t
1
t
1
1
t
t
1
the
last
step
comes
from
symmetry
of
the
x 2(
2(i)i
) xi
+ i
i
Case III
==xx x
x i+
i
3 Case III

Case II

Case III

= (arbitrary gauss

covariance matrix and thus its inverse:

i = (arbitrary
gaussian)
3 Case
IIIi = t(arbitrary gaussian)
1
t
1
= , ( ) =

iterm
= (arbitrary
Once again,
above ingaussian)
red is an additive
3 Case
III
constant
independent
of class, and can be removed

i = (arbitrary gaussian)

38

Case II

+1ln P (i )
gi (x) =
2
1
t
2
t 1
1i ) + ln P (
t i ) 1
(x) x
=2
(x
i i
) t(x

= xt 1 x gi

x
+

i 1
i
i
t
1
t
1
t
1
t
2

x
+

1
1 =x
i
i
i
i
t
t
t
t
t
t
gi (x) = 1gi2(x)
[xt x=
2ti x +[x

]+
lni xP +
(i) i i ] + ln P (i )
x
2
it =
i

gi (x) = 22t [x1


x 22
ii it] + ln P (
i)
i x 2+1
t
1
Case
Covariances,
=2
xII:
Identical
x

2(

)
x
+

i
ii
i t 1
t
1
1
t
= tx 1
1x
2( ti) 1
x + i

1
t t1
t 1 i
x 1x
Pi
1ti+
1 ix) +t i i
i
1 2xti x
t
gi (x) = =
ln
(
i
2
(x)
=
ii+xt ln
P (
+ ln P (i )
gi (x) = 2 g
x

i
i
t 2 2 2i 1
12i ) i i
t =1
,(Function
)11
= tt2 1
2
Linear Discriminant
t
( i)) x=+ti1i
= x x =,
2(
gi (x) = wtit x + i0
t
gi (x)
=
w
x
+

1
i0
g
(x)
=
w
x
+

i t
i
i0
1 1
t1
1
1
t
t
i
=
,
(
)
=

gi (x) = g(

ln
P
(
1
i ) x1 ti ti +
i)
i(x) = ( i)2 x i i + ln P (i)
2
1 t 1
1g (x)
t
g
(x)
=
g!
i = ( ji ) x i i i + ln P (i )
i (x)
Decision
Boundary:
"
g
(x)
=
g
(x)
2
i
j
3 Case
III
1
ln[P(i )/Pj )
t
1
t
w
(x

x
)
=
0
(i j )) (x (i!+0j )
(i " j )
1
2t
(i ln[P(
j ) i)/P
(ij) j )
1
1
i =((arbitrary
gaussian)
( (i j )) (x
(i j ) )
i + j )
1

Case III

= (arbitrary gauss

Case III

Case III

=0

(i j ) (i j )

39

i = (arbitrary gaussian)

Case II
Case II: Identical Covariances, i
Notes on Decision
Boundary

Case
III

As for Case I, passes


through point x0 lying
on the line between the
two class means. Again,
x0 in the middle if
priors identical

0.2

0.2

-0.1

-0.1

P(2)=.5
R2

P(2)=.9

R1

P(1)=.5

-5

R2

R1

P(1)=.1

-5

0
5

-5

-5

i = (arbitrary gauss
10

7.5

R1

P(1)=.5

R2

7.5

P(1)=.1 5
1

2
P(2)=.5

-2

R1

2.5

Hyperplane defined by
boundary generally not
orthogonal to the line
between the two
means

0
2
-2

-2.5
2

-2.5

R2

2
-2

P(2)=.9

0
2

-2

FIGURE 2.12. Probability densities (indicated by the surfaces in two dimensions and
ellipsoidal surfaces in three dimensions) and decision regions for equal but asymmetric Gaussian distributions. The decision hyperplanes need not be perpendicular to the
line connecting the means. From: Richard O. Duda, Peter E. Hart, and David G. Stork,
40
c 2001 by John Wiley & Sons, Inc.
Pattern Classification. Copyright !

Bayes Dec. Rules: Case I

Case III: arbitrary i = (ar


p(x|i ) N (i , i )

d
1
1
1
t
gi (x) = (x i ) i (x i ) ln 2 ln |i | + ln P (i )
2
2
2

Case III

2 Case II
III Remove

i = (arbitrary gaussian)

3 =
Case
III gaussian)
Can
only
remove
the one term in red above
(arbitrary
i

Discriminant

t
t
g
(x)
=
x
W
x
+
w
i
i
i x + i0
Function (quadratic)

gi (x) = xt Wi x + wit x + i0
1 t 1
1
1 1
1
t
gi (x) = x ( i )x+(i i ) x i i i ln |i |+ln P (i )
2
2
2
t

1 t 1
1
1 1
1
i )x + (i i )x i i i ln |i | + ln P (i )
2
2
2

41

Case III: arbitrary i = (ar


Decision Boundaries
Are hyperquadrics: can be hyperplanes,
hyperplane pairs, hyperspheres,
hyperellipsoids, hyperparabaloids,
hyperhyperparabaloids

Decision Regions
Need not be simply connected, even in one
dimension (next slide)
42

Case 3: Arbitrary Covariances


p(x|i)

0.4

2
0.3

0.2

0.1

x
-5

-2.5

R1

2.5

R2

7.5

R1

FIGURE 2.13. Non-simply connected decision regions can arise in one dimensions for
Gaussians having unequal variance. From: Richard O. Duda, Peter E. Hart, and David
c 2001 by John Wiley & Sons, Inc.
G. Stork, Pattern Classification. Copyright !
43

See Fig. 2.15 for


3D cases

FIGURE 2.14. Arbitrary Gaussian distributions lead to Bayes decision boundaries that
are general hyperquadrics. Conversely, given any hyperquadric, one can find two Gaussian distributions whose Bayes decision boundary is that hyperquadric. These variances

More than Two Categories


Decision Boundary
Defined by two most
likely classes for each
segment

Other Distributions

R4

R3

R2

R4

Possible; underlying
R
Bayesian Decision Theory
isFIGURE
unmodified,
2.16. however
The decision regions for four normal distributions
1

number of categories, the shapes of the boundary regions 45can be


Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern C

Discrete Features
Roughly speaking...
Replace probability densities
by probability
!!
mass functions. Expressions
using
p(x|
) dxintegrals
p(x|
jj) dx
are changed to use summations, e.g.
!

p(x|j ) dx

" Formula
Bayes
P (x|j )

"
"

P (x|
(x|jj))

xx

P (x|j )P (j )
P (j |x) =
P (x)

P (x) =

c
"
j=1

P (x|j )P (j )
46

"

P (x|j )

Example: Independent
Binary
P (x|j )P (j )
!
P (j |x) =
Features p(x| ) dx P (x)
j

Binary Feature Vector

"

c
"

j ) (x|j )P (j )
P (x) =P (x|P
x

j=1

P (x|
x = {x1, ..., xd} of 0/1 -valued
features,
where
j )P (j )
P (j |x) =
P (x)
each xi is 0/1 with probability: pi = P r[x
i = 1|1 ]
P (x) =
Conditional Independence

c
"

P (x|j )P (j )

j=1

Assume that given a class, the features are


pi = P r[xi = 1|1 ]
independent

Likelihood Function

P (x|1 ) =

d
#
i=1

pxi i (1 pi )1xi
47

You might also like