Bayesian Decision Theory: Prof. Richard Zanibbi

Bayesian Decision
Theory
Prof. Richard Zanibbi
Bayesian Decision Theory

The Basic Idea
To minimize errors, choose the least risky
class, i.e. the class for which the expected loss
is smallest
Assumptions
Problem posed in probabilistic terms, and all
relevant probabilities are known
2
Probability Mass vs. Probability

Density Functions
Probability Mass Function, P(x)
Probability for values of discrete random variable x. Each
. . , vm }
value has its own associated probability = {v1, . !
= {v1, . . . , vm}
!
P (x) 0, and
P (x) = 1
x
"
P
r[x (a, b)] = Density,
p(x) dx p(x)
Probability
" a
Probability
values
of1 continuous
p(x)
0 and for p(x)
dx =
P (x) 0, and
P (x) = 1
"
P r[x (a, b)] =

p(x) dx
" a
p(x) 0 and
p(x) dx = 1
random variable x.
Probability returned
is for an interval within which the
value lies (intervals defined by some unit distance)
Prior Probability
Definition ( P( w ) )
The likelihood of a value for a random variable
representing the state of nature (true class for the
current input), in the absence of other information
Informally, what percentage of the time state X

occurs
Example
The prior probability that an instance taken from
two classes is provided as input, in the absence of
any features (e.g. P(cat) = 0.3, P(dog) = 0.7)
4
Class-Conditional Probability Density

Function (for Continuous Features)
Definition ( p( x | w ) )
The probability of a value for continuous
random variable x, given a state of nature in
w
For each value of x, we have a different classconditional pdf for each class in w (example
next slide)
Example: Class-Conditional
Probability Densities
p(x|i)
0.4
2
1
0.3
0.2
0.1
x
9
10
11
12
13
14
15
FIGURE 2.1. Hypothetical class-conditional probability density functions show the

probability density of measuring a particular feature value x given the pattern is in
category i . If x represents the lightness of a fish, the two curves might describe the
difference in lightness of populations of two types of fish. Density functions are normalized, and thus the area under each curve is 1.0. From: Richard O. Duda, Peter E. Hart,
c 2001 by John Wiley & Sons,
and David G. Stork, Pattern Classification. Copyright !
Inc.
6
Bayes Formula
p(x|j )P (wj )
posterior = likelihood x prior
P (j |x) =
p(x)
evidence
c
!
where p(x) =
p(x|j )P (j )
j=1
Purpose
Convert class prior and class-conditional

densities to a posterior probability for a class: the
probability of a class given the input features
(post-observation)
7
Example: Posterior Probabilities

P(i|x)
1
0.8
0.6
0.4
2
0.2
x
9
10
11
12
13
14
15
FIGURE 2.2. Posterior probabilities for the particular priors P (1 ) = 2/3 and P (2 )
= 1/3 for the class-conditional probability densities shown in Fig. 2.1. Thus in this
case, given that a pattern is measured to have feature value x = 14, the probability it is
in category 2 is roughly 0.08, and that it is in 1 is 0.92. At every x , the posteriors sum
to 1.0. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.
c 2001 by John Wiley & Sons, Inc.
Copyright !
where p(x) =
p(x|j )P (j )
j=1
"
P (1|x) if we choose 2
Choosing the Most Likely Class

P (error|x) =
#
PWhat
(error) =
P (error|x)p(x)
(average
error)
happens
if we dodxthe
following?
Decide 1 if P (1|x) > P (2|x); otherwise decide 2
p(x|j )P (wj )
p(x) probability of
A. We minimize the average
c
!
error. Consider
the=two-class
case
from
where p(x)
p(x|j )P (
j)
P (j |x) =
previous slide:
j=1
P (error|x) =
P (error) =
"
P (error|x)p(x) dx (average error)

9
Expected Loss or Conditional Risk

of an Action
R(i|x) =
Explanation
c
!
j=1
(i|j )P (j |x)
The expected (average) loss for taking an

action (choosing a class) given an input
vector, for a given conditional loss function
(lambda)
10
c
!
Functions
and
R(Decision
|x)
=
(
|
)P
(j |x)
i
i j
R=
"
Overall
Risk
j=1
R((x)|x)p(x) dx
Decision Function or Decision Rule

( alpha(x) ): takes on the value of exactly one
action for each input vector x
Overall Risk
The expected (average) loss associated with
a decision rule
11
Bayes Decision Rule

Idea
Minimize the overall risk, by choosing the action
with the least conditional risk for input vector x
Bayes Risk (R*)

The resulting overall risk produced using this
procedure. This is the best performance that can
be achieved given available information.
12
Bayes Decision Rule: Two

Category Case
Bayesian
Decision Rules
Bayes Decision
Rule
Bayesian
Decision
Rules
For each
input,
selectclass
with
ij = (
i|least
j)
conditional risk, i.e. choose class one if:
ij = (i|j )
BayesianR(
Decision
Rules
1|x) < R(
2|x)
where
R(1|x)R(
<1R(
|x)
|x)==
11P|
(1)|x) + 12P (2|x)
2(
ij
R(1|x)R(
= 2|x)
|x)
+1|x)
P (
+(
22
P (2|x)
11P=(121
12P
2|x)
R(1|x) < R(2|x)
(21 11)P (1|x) > (12 22)P (2|x)
R(2|x) = 21P (1|x) + 22P (2|x)
R(1|x) = 11P (1|x) + 12P (2|x)
(21 11)p(x|1)P (1) > (12 22)p(x|2)P (2)
(21 11)P (1|x) > (12 22)P (2|x)

p(x|1)
>
12 22 P (2)
13
ij = (i|j )
R(Alternate
|x) = Equivalent
11P (1Expressions
|x) + 12ofPBayes
(2|x)
1
R( |x) < R( |x)
1
Decision Rule
(Choose
Class
One
If...)
R(1|x) < R(2|x)
R(2|x) = 21P (1|x) + 22P (2|x)

R(1|x) = 11P (1|x) + 12P (2|x)
PosteriorR(
Class
Probabilities
|x)
= 11P (1|x) + 12P (2|x
1
(21 11)P (1|x) > (12 22)P (2|x)
R(2|x) = 21P (1|x) + 22P (2|x)
R(
|x)
=
P
(
|x)
+
P
(
|x
2
21
1
22
2
Class
Priors
and
Conditional
Densities
(21 11)p(x|1)P (1) > (12 22)p(x|2)P (2)
Produced
applying
Bayes
Formula
the
(21 p(x|
by
((
>
(
)P
(2above,
|x)
11 )P
1 |x)
12(
22to
)
P
(
)
)P
|x)
>
(
1
2
21 12 11 22 1
12 22 )P (2 |x)
multiplying both sides
> by p(x)
p(x|2) 21 11 P (1)
(21 11)p(x|(121
)P(111))p(x|
> (12
(122
)p(x|
)>
(12
22(
)p(x|
2 )P
2)
1 )P
2 )P (
122 ) 22
p(x|Ratio
12p(x|
221) P
(
1)
Likelihood
>
>
p(x|2)
211 ) 11
21p(x|
112) P (
P (2)
P (1)
14
The Zero-One Loss

11
Minimum
Error
Rate Classification
Classification
Minimum
Error
Rate
Definition !
!0 i=j
0 iRate
= j Classification
(
|
)
=
i, jj =
= 1,
1, .. .. .. ,, cc
1 Minimum
Error
i
j
(i|j ) = 1 i != j i,
!1 i != j
0 ifor
= j Zero-One Loss
Conditional
Risk
(
i, j = 1, . . . , c
c i |j ) =
"
c
"
"
1 i != j "
R(
(
|jj )P
)P (
(jj|x)
|x) =
=
P (
(jj|x)
|x) =
= 1P
1P(
(ii|x)
|x)
R(ii|x)
|x) =
=
(ii|
P
c
j=1
=ii
"
j=1
j!j!"
=
R(i|x) =
(i|j )P (j |x) =
P (j |x) = 1P (i|x)
j=1
j!=i
Decide
|x) Rule
>P
P (
((min.
|x) for
for all
all jjrate)
!= ii
BayesDecision
ii if
Decide
if P
P (
(ii|x)
>
!=
jj|x) error
22
Decide i if P (i|x) > P (j |x) for all j != i

Multicategory
case (revisited)
(revisited)
Multicategory case
2
Multicategory case (revisited)
15
Example: Likelihood Ratio

p(x|1)
p(x|2)
b
a
R2
R1
R2
R1
FIGURE 2.3. The likelihood ratio p(x |1 )/p(x |2 ) for the distributions shown in
Fig. 2.1. If we employ a zero-one or classification loss, our decision boundaries are
determined by the threshold a . If our loss function penalizes miscategorizing 2 as 1
patterns more than the converse, we get the larger threshold b , and hence R1 becomes
smaller. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classificac 2001 by John Wiley & Sons, Inc.
tion. Copyright !
16
1 Minimum
Error
Rate
Classification
c
"
"
R(i|x) =
(!
(
|x)
=
P
(
|x)
=
1P
(
|x)
i |0j )P
j
j
i
i=j
(ij=1
|j ) =
i, j =
j!=i1, . . . , c
1 i != j
Bayes Classifiers
c
"
Decide
i if P (i|x) > P (j"
|x) for all j != i
R(i|x) =
(i|j )P (j |x) =
P (j |x) = 1P (i|x)
Recallj=1the Canonical Model

j!=i
Multicategory case (revisited)

Decide class i if:
Decide i gifi (x)
P (>
> P (
i |x)
j |x)
gj (x)
for
all jfor
!= all
i j != i
For Bayes Classifiers

Multicategory
case= (revisited)
gi(x)
R(i|x)
gi(x) = P (defn
i |x) below for
Use the first
discriminant
gi(x) > gj (x) for all j != i
general case, second for zero-one
loss
p(x|i)P (
i)
gi(x) = P (i|x) = #c
p(x|j )P (j )
gi(x) = R(j=1
i |x)
gi(x)i )P
=P
gi(x) = p(x|
((
i ) i |x)
gi(x) = ln p(x|i) + ln
P (i)P
) ( )
p(x|
i
17
Multicategory
case (revisited)
2 Multicategory
case (revisited)
Equivalent Discriminants
for ZeroMulticategory
case (revisited)
One Loss
gi(x) (Minimum-Error-Rate)
> gj (x) for all j != i

Trade-off gi(x) = R(i|x)

gi(x) = R(i|x)
P (
Simplicity ofgunderstanding
vs. computation
i (x)
gi=(x)
=i|x)
R(
i |x)
gi(x) = P (i|x)
gi(x) =p(x|
P (ii)P
|x)(i)
gi(x) = P (i|x) = #c
p(x|
)P
(
)
i
i
p(x|
)P
(
)
j
j
j=1= #
gi(x) = P (i|x)
p(x|
)P
(i)
c
i
gi(x)
= p(x|
)Pi(
gi(x)
= P i(
|x)i) = #c j=1 p(x|j )P (j )
p(x|
)P
(
)
j
j
j=1
gi(x) =
ln p(x|
P (
gi(x)
= p(x|
(i)i)
i ) + ln
i )P
gi(x)
= p(x|
)
i )P (
i
gi(x)
= ln p(x|
)
i + ln P (i )
gi(x) = ln p(x|i) + ln P (i)
18
Discriminants for Two Categories

For Two Categories
We can use a single discriminant function,
with decision rule: choose class one if the
discriminant returns a value > 0.
3
two-category
again Loss
Example:
Zero-One
g(x) = P (1|x) P (2|x)
p(x|1)
P (1)
g(x) = ln
+ ln
p(x|2)
P (2)
Gaussian (univariate normal)
19
Example: Decision Regions for

Binary Classifier
p(x|1)P(1)
0.3
p(x|2)P(2)
0.2
0.1
0
R1
R2
5
R2
decision
boundary
5
0
FIGURE 2.6. In this two-dimensional two-category classifier, the probability densities

are Gaussian, the decision boundary consists of two hyperbolas, and thus the decision
region R2 is not simply connected. The ellipses mark where the density is 1/e times
that at the peak of the distribution. From: Richard O. Duda, Peter E. Hart, and David G.
Stork, Pattern Classification. Copyright !
20
The (Univariate) Normal

Distribution
Why are Gaussians so Useful?
They represent many probability
distributions in nature quite accurately. In
our case, when patterns can be represented
as random variations of an ideal prototype
(represented by the mean feature vector)
Everyday examples: height, weight of a

population
21
Univariate Normal Distribution

p(x)
2.5%
2.5%
x
- 2 -
+ + 2
FIGURE 2.7. A univariate normal distribution has roughly 95% of its areain the range
|x | 2 , as shown. The peak of the distribution has value p() = 1/ 2 . From:
Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright
$
22
two-category again
g(x) = P (1|x) P (2|x)
Formal Definition
p(x|1)
P (1)
g(x) = ln again+ ln
3 two-category
p(x|
P (2)
3 2)two-category
again
g(x) = P (1|x) P (2|x)
4 Gaussian (univariate normal) g(x) = P (1 |x) P (2 |x)
! 1"
$
p(x|
)
P#(
2 1)
g(x)
+
ln
1 = ln
1 x
p(x|1)
P (1)
p(x|
)
P
(
)
P
(x)
=
exp
2
2
Has value: 2
2
g(x) = ln p(x| ) + ln P ( )
2
2
%
4 Gaussian (univariate normal)
=
x p(x) dx
4 Gaussian
normal)
!
" (univariate
#2 $
% 1
1 x
!
"
#2 $
P (x)
=
exp
2
2
=
(x ) p(x)2 dx
1
1 x
2
P (x) =
exp
%
2
=
x p(x) dx
%
5 Multivariate normal
density
= ' x p(x) dx
%
&
1 2 =
1 )2p(x)
t 1
(x
dx
p(x) =
exp (x ) (x
% )
d/2
1/2
2
(2) ||
2 =
(x )2p(x) dx 23
%
5 Multivariate
= normal
x p(x)density
dx
&
'
%
Peak of the Distribution (the mean)

Definition for Univariate Normal
Def. for mean, variance
g(x) = ln
p(x|1)
P (1)
+ ln
p(x|2)
P (2)
4 Gaussian (univariate normal)

Multivariate
Normal
Density
!
"
#$
1
1 x
P (x) =
exp
2
2
%
=
x p(x) dx
Informal Definition
%
A normal distribution
over two or more
2 =
(x )2p(x) dx
variables (d variables/dimensions)
Formal
Definition
5 Multivariate
normal density
&
'
1
1
t 1
p(x) =
exp
(x
)
(x )
2
(2)d/2||1/2
%
=
x p(x) dx
=
(x )(x )tp(x)dx
3
24
x p(x) dx
The Covariance% Matrix

( =
) (x )(x )tp(x)d
For our purposes...
Assume matrix is positive definite, so the

determinant of the matrix is always positive
Matrix Elements
Main diagonal: variances for each individual

variable
Off-diagonal: covariances of each variable

pairing i & j (note: values are repeated, as
matrix is symmetric)
25
Independence and Correlation

For multivariate normal covariance matrix
Off-diagonal entries with a value of 0 indicate

uncorrelated variables, that are statistically
independent (variables likely do not influence one
another)
Roughly speaking, covariance positive if two

variables increase together (positive correlation),
negative if one variable decreases when the other
increases (negative correlation)
26
A Two-Dimensional Gaussian
Distribution, with Samples Shown
x2
x1
FIGURE 2.9. Samples drawn from a two-dimensional Gaussian lie in a cloud centered
on the mean ! . The ellipses show lines of equal probability density of the Gaussian.
From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyc 2001 by John Wiley & Sons, Inc.
right !
27
Linear Transformations in a 2D
Feature Space
x2
N(A w, I)
t
A w
N(,)
Aw
A
t
A
P
N(At, At A)
a
t
P
0
x1
FIGURE 2.8. The action of a linear transformation on the feature space will convert an arbitrary normal distribution into another normal distribution. One transforma-
28
=i |x)
P (>
(2|x)for all j != i
1 |x)
Decide i ifg(x)
P (
P
(Pj |x)
Decide
Pi (
if1P
p(x|
) (i |x) > P (j |
1)
g(x) = case
ln
+ ln i
Multicategory
(revisited)
p(x|2)
P (2)
Discriminant
Functions ( g (x) )
2
for the Normal
Density
2
Multicategory
gi(x) > gj (x) for all j != i case (revis
4
Gaussian (univariate normal)

!
$ g (x) for a
" gi (x)
#2>
j
1
1
x
DiscriminantPFunctions
(x) =exp
R(
(x) =gi
i |x)
2
2= P (i |x)
gi(x)
We will consider three special
cases for: gi(x) = R(i|
%
=
x p(x)
dxi )P (gii)(x) = P (i |x)
p(x|
normally
distributed
features,
gi(x)
= P (i|x)
=
#cand
j=1 p(x|j )P (j )
minimum-error-rate
classification
(0-1
loss)
2
2(x)
g
=
= i )P (
(x i
gi(x) = p(x|
) )i p(x) dxP (i|x)
p(x
= #c
p
j=1
Bayes Recall:
Dec. Rules:
I i) + ln P (i)
gi(x) =Case
ln p(x|
gi(x) = p(x|i)P (i)
5 Multivariate normal density
i (x) = ln p(x|i ) + ln P (
if p(x|i) N (i , i ) thengapprox.
&
'
1
1
using: p(x) = (2)d/2||1/2 exp 2 (x )t1(x )
d % 1
1
29
1
t
gi (x) = (x i ) i (x i ) ln 2 ln |i | + ln P (i )
2
2 =
x2p(x) dx
2
2
%
=
x p(x) dx
Minimum Error-Rate
Discriminant
Function
for
%
Multivariate Gaussian
Distributions
2 =
(xFeature
)2p(x)
dx
5 Multivariate normal density

ln (natural
log) of
&
'
1
1
t 1
p(x) =
exp
(x
)
(x )
d/2
1/2
2 I
Bayes Dec.
Rules: Case
(2) ||
%
p(x|
(i ,
= i ) xNp(x)
dxi )
gives a general form% for our

discriminant functions:
t= 1(x )(x d)tp(x)dx

1
1
2
2
2
2
3
Case II
Case III
30
gi(x) > gj (x) for all j
Special Cases for Binary

Classification
gi(x) = R(i|x)
gi(x) = P (i|x)
p(x|i)
#
gi(x) = P (i|x) = c
j=1 p(x|
Purpose
gi(x) = p(x|i)P (i)
Overview of commonly assumed cases for feature likelihoodgdensities,
i (x) = ln p(x|i ) + ln P (i )
Goal: eliminate common additive constants in discriminant functions.

These do not affect the classification decision (i.e. define gi(x) providing
just the differences)
Also, look at resulting decision surfaces ( defined by gi(x) = gj(x) )
Three Special Cases
1. Statistically independent features, identically distributed Gaussians for

each class
2. Identical covariances for each class
3. Arbitrary covariances
31
1 1 Bayes
BayesDec.
Dec.Rules:
Rules: Case
Case II
2
1 Bayes
Bayes Dec.
Dec. Rules:
Rules:
Case
I
1
Case
I
p(x|
1 Bayes Dec. Rules:
Case
i )i )N
p(x|
i IN((i,i,i)i)
Case 1: = I
p(x|
NN
(
i)
i,
i)
p(x|
)
(
,
)
i
i
i
p(x|i ) N (i , i )
dd
11
1 1
t t 11
gi (x)
= (x(x
)
(xi )i ) ln
ln2
2
ln
ln |i | + ln
gi (x)
=
i
) i
lnPP(
(ii))
i i (x
2
2
2
dd 2 1 1 2
11 2
t 11
t
(x)==1 (x
(xti )i )
(xi) i
)d lnln
+P
ln(
P (
ggii(x)
22
1 ln ln
||
| +| ln
) )
1
i i (x
gi (x) =Remove:
(x22 i ) i (x i ) 2ln22 2ln2|i |i+iln P (i )i i
2 2Case II
2i = 2 I 2
i =
Items in red: same across classes (unimportant

additive constants)
2 2
i =
i =
2 I I
i = 2d
I 1
3Inverse
Case
III
of Covariance
|Matrix:
|
=
, i = (1/ 2 )I
i
2d
2d
1by(1/ )I2
|
product
,
Only effect is to scale
|
|==
,
= (1/
)I
i |ivector
i i =
2d
1
2
i | = , i = (1/ )I
2 Case|II
Discriminant function:
t
ti ) (x i )
Case IIg (x) =(x(x
i )= + ln P ( )
i ) (x
i = (arbitrary gaussian)
2 i + ln P (i ) i
2
2
2
i =
1 t
t
t
3 gCase
III
[x
x
2
x
+
i ] + ln P (i )
i (x) =
i
i
2
2 Case II
2
3 Case III
i =
2 Case II
gi (x)i =
32
(x i ) (x ii )= 2 I
+ ln P (i )
gi (x) =
2
2
2d
1
2
|i |2d= 1
, 2
=
(1/
)I
i
2
1 t |i | =
=
(1/
)I
t
ti
i
gi (x) = 2 [x x 2i x + i i ] + ln P (i )
2
t
(x
)
i (x i )
t
(x i ) (x 2i )
+ ln P (i )
gi (x) =
g1i (x)Discriminant
= 1 t
2 + ln P (i )
Linear
Function
2
gi (x) = 2 i x 2 i i2
+ ln P (i )
2
1 t the previous
Produced by
factoring
form
t
t
1
t
t
t
gi (x)
[x
x
2i ix+
ilnP
+iln
P (i )
i ] (
gi (x)
= =
[x
x
2
x
+
]
+
)
2
i
i
2 t2
2w
gi (x) =
x + i0 i
Case 1: = I
1 t1 t 1 t 1 t
(x) 2=
i x 2i i +
ln
gi (x)gi=
i x2
(iln
) P (i )
i+
i P
2
2 2
Threshold
gaussian)
g
(x)
=
w
x
+
i = (arbitrary
g
(x)
=
w
x
or Bias for Class +
i:
i
t
i
t
i i0
i0
Change in prior translates decision boundary

33
gi (x)
tt x + t ] + ln Pt ( )
t
gi (x)
[x
x
2
i i
i
i
1 i=
1
i
i
i
t 2 2
t
= 2 i x 22 i i + ln P2 (i )
1 2
1
gi (x) = 2 i t x 2 ti i + ln P (i )
2
t
ti
i0
i
gi (x) =giw
x
+
t i0
i
(x) = wi x + i0
g (x) =
Case 1:
+ ln P (i )
2
gi(x)==w xI+
gi (x) = gj (x)
gi (x) = gj (x)
g
(x)
=
gj (x)
i
Decision Boundary:
t
w (x x0 ) = 0
wt (x x0 ) = 0
i =
!
"
2
1
P (i )
(i j )t (x (i + j )
ln
(i j ) )
t
2
(i j ) (i j ) P (j )
Notes
Decision boundary goes through x0 along line between

means, orthogonal to this line
= (arbitrary gaussian)
If priors equal, x0 between means (minimum distance

i
classifier),
otherwise x0 shifted
If variance small relative to distance between means,

priors have limited
1 effect on boundary location
34
Case 1: Statistically Independent

Features with Identical Variances
0
-2
0.15
p(x|i)
0.4
1
0
P(2)=.5
0.1
0.05
1
0.3
R2
0.2
-1
P(2)=.5
0.1
P(1)=.5
x
-2
R1
P(1)=.5
R2
P(2)=.5
R2
R1
-2
P(1)=.5 R1
-2
-2
-1
FIGURE 2.10. If the covariance matrices for two distributions are equal and proportional to the identity
matrix, then the distributions are spherical in d dimensions, and the boundary is a generalized hyperplane of
d 1 dimensions, perpendicular to the line separating the means. In these one-, two-, and three-dimensional
examples, we indicate p(x|i ) and the boundaries for the case P (1 ) = P (2 ). In the three-dimensional case,
the grid plane separates R1 from R2 . From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern
Classification. Copyright "
35
p(x|i)
p(x|i)
0.4
0.3
0.3
0.2
0.2
0.1
0.1
-2
R1
-2
R2
P(1)=.7
-2
0.15
0.05
0.05
P(2)=.01
P(2)=.2
R1
0.1
-2
P(2)=.1
0.15
0.1
P(1)=.8
R2
P(1)=.9
-2
R2
R1
-2
0
2
4
2
2
P(2)=.2
R2
R1
-1
R2
R1
P(2)=.01
2
-1
P(1)=.8
-2
R2
P(1)=.99
R1
P(2)=.3
0.4
P(1)=.99
-2
-2
-2
-1
-1
1
2
FIGURE 2.11. As the priors are changed, the decision boundary shifts; for sufficiently
disparate priors the boundary will not lie between the means of these one-, two- and
Example: Translation
of Decision
Boundaries Through
Changing Priors
Case II
1
Bayes Dec. Rules: Case I
Case II: Identical

Covariances,
i
p(x|
)
N
(
,
)
i
i
i
d
1
1
1
t
2
2
2
Case
III
Remove
2 Case II
Terms in red; as in Case I these can be ignored (same

3across
Case
III
classes)
i = (arbitrary gauss
Squared Mahalanobis Distance (yellow)
Distance from x to mean for class i, taking covariance

into account; defines contours of fixed density
37
Case II
Case
22 Case
II II
1 1
t1
t
1(x
ggi(x)
=
(x
)
(x
(i) P (i)
(x)
=
(x
)
)+ln
i
i)+ln P
i
i
i
21 2 Covariances,
CasegiII:
Identical
(x) = (xi)t1(xi)+ln P (ii)
2
i =
i =
Case II
i = Mahalanobis distance
Expansion of squared
1
tt 1
1x xt1 t1x + t1
x
(x) = (xi) t (x
) i
i i1
i t 1
i1
i)+ln Pt(
1
t
2
=x
x t x1
i t 1
i x
+1i i
t
1
t
= x x x i i x + i i
i = t 1
= x x 2(1i)tx + ti1i
t
1
1
t
t
1
t
1
1
t
t
1
the
last
step
comes
from
symmetry
of
the
x 2(
2(i)i
) xi
+ i
i
Case III
==xx x
x i+
i
3 Case III
Case II
Case III
= (arbitrary gauss
covariance matrix and thus its inverse:
i = (arbitrary
gaussian)
3 Case
IIIi = t(arbitrary gaussian)
1
t
1
= , ( ) =
iterm
= (arbitrary
Once again,
above ingaussian)
red is an additive
3 Case
III
constant
independent
of class, and can be removed
38
Case II
+1ln P (i )
gi (x) =
2
1
t
2
t 1
1i ) + ln P (
t i ) 1
(x) x
=2
(x
i i
) t(x
= xt 1 x gi
x
+
i 1
i
i
t
1
t
1
t
1
t
2
x
+
1
1 =x
i
i
i
i
t
t
t
t
t
t
gi (x) = 1gi2(x)
[xt x=
2ti x +[x
]+
lni xP +
(i) i i ] + ln P (i )
x
2
it =
i
gi (x) = 22t [x1

x 22
ii it] + ln P (
i)
i x 2+1
t
1
Case
Covariances,
=2
xII:
Identical
x
2(
)
x
+
i
ii
i t 1
t
1
1
t
= tx 1
1x
2( ti) 1
x + i
1
t t1
t 1 i
x 1x
Pi
1ti+
1 ix) +t i i
i
1 2xti x
t
gi (x) = =
ln
(
i
2
(x)
=
ii+xt ln
P (
+ ln P (i )
gi (x) = 2 g
x
i
i
t 2 2 2i 1
12i ) i i
t =1
,(Function
)11
= tt2 1
2
Linear Discriminant
t
( i)) x=+ti1i
= x x =,
2(
gi (x) = wtit x + i0
t
gi (x)
=
w
x
+
1
i0
g
(x)
=
w
x
+
i t
i
i0
1 1
t1
1
1
t
t
i
=
,
(
)
=
gi (x) = g(
ln
P
(
1
i ) x1 ti ti +
i)
i(x) = ( i)2 x i i + ln P (i)
2
1 t 1
1g (x)
t
g
(x)
=
g!
i = ( ji ) x i i i + ln P (i )
i (x)
Decision
Boundary:
"
g
(x)
=
g
(x)
2
i
j
3 Case
III
1
ln[P(i )/Pj )
t
1
t
w
(x
x
)
=
0
(i j )) (x (i!+0j )
(i " j )
1
2t
(i ln[P(
j ) i)/P
(ij) j )
1
1
i =((arbitrary
gaussian)
( (i j )) (x
(i j ) )
i + j )
1
Case III
= (arbitrary gauss
Case III
Case III
=0
(i j ) (i j )
39
Case II
Case II: Identical Covariances, i
Notes on Decision
Boundary
Case
III
As for Case I, passes

through point x0 lying
on the line between the
two class means. Again,
x0 in the middle if
priors identical
0.2
0.2
-0.1
-0.1
P(2)=.5
R2
P(2)=.9
R1
P(1)=.5
-5
R2
R1
P(1)=.1
-5
0
5
-5
-5
i = (arbitrary gauss
10
7.5
R1
P(1)=.5
R2
7.5
P(1)=.1 5
1
2
P(2)=.5
-2
R1
2.5
Hyperplane defined by
boundary generally not
orthogonal to the line
between the two
means
0
2
-2
-2.5
2
-2.5
R2
2
-2
P(2)=.9
0
2
-2
FIGURE 2.12. Probability densities (indicated by the surfaces in two dimensions and
ellipsoidal surfaces in three dimensions) and decision regions for equal but asymmetric Gaussian distributions. The decision hyperplanes need not be perpendicular to the
line connecting the means. From: Richard O. Duda, Peter E. Hart, and David G. Stork,
40
Pattern Classification. Copyright !
Bayes Dec. Rules: Case I
Case III: arbitrary i = (ar

p(x|i ) N (i , i )
d
1
1
1
t
2
2
2
Case III
2 Case II
III Remove
3 =
Case
III gaussian)
Can
only
remove
the one term in red above
(arbitrary
i
Discriminant
t
t
g
(x)
=
x
W
x
+
w
i
i
i x + i0
Function (quadratic)
gi (x) = xt Wi x + wit x + i0
1 t 1
1
1 1
1
t
gi (x) = x ( i )x+(i i ) x i i i ln |i |+ln P (i )
2
2
2
t
1 t 1
1
1 1
1
i )x + (i i )x i i i ln |i | + ln P (i )
2
2
2
41
Case III: arbitrary i = (ar

Decision Boundaries
Are hyperquadrics: can be hyperplanes,
hyperplane pairs, hyperspheres,
hyperellipsoids, hyperparabaloids,
hyperhyperparabaloids
Decision Regions
Need not be simply connected, even in one
dimension (next slide)
42
Case 3: Arbitrary Covariances

p(x|i)
0.4
2
0.3
0.2
0.1
x
-5
-2.5
R1
2.5
R2
7.5
R1
FIGURE 2.13. Non-simply connected decision regions can arise in one dimensions for
Gaussians having unequal variance. From: Richard O. Duda, Peter E. Hart, and David
G. Stork, Pattern Classification. Copyright !
43
See Fig. 2.15 for

3D cases
FIGURE 2.14. Arbitrary Gaussian distributions lead to Bayes decision boundaries that
are general hyperquadrics. Conversely, given any hyperquadric, one can find two Gaussian distributions whose Bayes decision boundary is that hyperquadric. These variances
More than Two Categories

Decision Boundary
Defined by two most
likely classes for each
segment
Other Distributions
R4
R3
R2
R4
Possible; underlying
R
Bayesian Decision Theory
isFIGURE
unmodified,
2.16. however
The decision regions for four normal distributions
1
number of categories, the shapes of the boundary regions 45can be

Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern C
Discrete Features
Roughly speaking...
Replace probability densities
by probability
!!
mass functions. Expressions
using
p(x|
) dxintegrals
p(x|
jj) dx
are changed to use summations, e.g.
!
p(x|j ) dx
" Formula
Bayes
P (x|j )
"
"
P (x|
(x|jj))
xx
P (x|j )P (j )
P (j |x) =
P (x)
P (x) =
c
"
j=1
P (x|j )P (j )
46
"
P (x|j )
Example: Independent
Binary
P (x|j )P (j )
!
P (j |x) =
Features p(x| ) dx P (x)
j
Binary Feature Vector
"
c
"
j ) (x|j )P (j )
P (x) =P (x|P
x
j=1
P (x|
x = {x1, ..., xd} of 0/1 -valued
features,
where
j )P (j )
P (j |x) =
P (x)
each xi is 0/1 with probability: pi = P r[x
i = 1|1 ]
P (x) =
Conditional Independence
c
"
P (x|j )P (j )
j=1
Assume that given a class, the features are

pi = P r[xi = 1|1 ]
independent
Likelihood Function
P (x|1 ) =
d
#
i=1
pxi i (1 pi )1xi
47

Bayesian Decision Theory: Prof. Richard Zanibbi

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayesian Decision Theory: Prof. Richard Zanibbi

Uploaded by

Copyright:

Available Formats

Bayesian Decision

Bayesian Decision Theory

Probability Mass vs. Probability

P r[x (a, b)] =

Informally, what percentage of the time state X

Class-Conditional Probability Density

FIGURE 2.1. Hypothetical class-conditional probability density functions show the

Convert class prior and class-conditional

Example: Posterior Probabilities

Choosing the Most Likely Class

Decide 1 if P (1|x) > P (2|x); otherwise decide 2

P (error|x)p(x) dx (average error)

Expected Loss or Conditional Risk

The expected (average) loss for taking an

Decision Function or Decision Rule

Bayes Decision Rule

Bayes Risk (R*)

Bayes Decision Rule: Two

R(1|x) < R(2|x)

(21 11)P (1|x) > (12 22)P (2|x)

R(2|x) = 21P (1|x) + 22P (2|x)

R(1|x) = 11P (1|x) + 12P (2|x)

(21 11)p(x|1)P (1) > (12 22)p(x|2)P (2)

(21 11)P (1|x) > (12 22)P (2|x)

R(2|x) = 21P (1|x) + 22P (2|x)

(21 11)P (1|x) > (12 22)P (2|x)

R(2|x) = 21P (1|x) + 22P (2|x)

The Zero-One Loss

Decide i if P (i|x) > P (j |x) for all j != i

Multicategory case (revisited)

Example: Likelihood Ratio

Recallj=1the Canonical Model

Multicategory case (revisited)

For Bayes Classifiers

gi(x) > gj (x) for all j != i

Trade-off gi(x) = R(i|x)

Discriminants for Two Categories

Gaussian (univariate normal)

Example: Decision Regions for

FIGURE 2.6. In this two-dimensional two-category classifier, the probability densities

The (Univariate) Normal

Everyday examples: height, weight of a

Univariate Normal Distribution

Peak of the Distribution (the mean)

Def. for mean, variance

4 Gaussian (univariate normal)

The Covariance% Matrix

For our purposes...

Assume matrix is positive definite, so the

Main diagonal: variances for each individual

Off-diagonal: covariances of each variable

Independence and Correlation

Off-diagonal entries with a value of 0 indicate

Roughly speaking, covariance positive if two

Gaussian (univariate normal)

5 Multivariate normal density

gives a general form% for our

t= 1(x )(x d)tp(x)dx

gi(x) > gj (x) for all j

Special Cases for Binary

Goal: eliminate common additive constants in discriminant functions.

Also, look at resulting decision surfaces ( defined by gi(x) = gj(x) )

Three Special Cases