Professional Documents
Culture Documents
Bayesian Decision Theory: Prof. Richard Zanibbi
Bayesian Decision Theory: Prof. Richard Zanibbi
Theory
Prof. Richard Zanibbi
Assumptions
Problem posed in probabilistic terms, and all
relevant probabilities are known
2
= {v1, . . . , vm}
!
P (x) 0, and
P (x) = 1
x
"
P
r[x (a, b)] = Density,
p(x) dx p(x)
Probability
" a
Probability
values
of1 continuous
p(x)
0 and for p(x)
dx =
P (x) 0, and
P (x) = 1
"
random variable x.
Probability returned
is for an interval within which the
value lies (intervals defined by some unit distance)
Prior Probability
Definition ( P( w ) )
The likelihood of a value for a random variable
representing the state of nature (true class for the
current input), in the absence of other information
Example
The prior probability that an instance taken from
two classes is provided as input, in the absence of
any features (e.g. P(cat) = 0.3, P(dog) = 0.7)
4
For each value of x, we have a different classconditional pdf for each class in w (example
next slide)
Example: Class-Conditional
Probability Densities
p(x|i)
0.4
2
1
0.3
0.2
0.1
x
9
10
11
12
13
14
15
Bayes Formula
p(x|j )P (wj )
posterior = likelihood x prior
P (j |x) =
p(x)
evidence
c
!
where p(x) =
p(x|j )P (j )
j=1
Purpose
0.8
0.6
0.4
2
0.2
x
9
10
11
12
13
14
15
FIGURE 2.2. Posterior probabilities for the particular priors P (1 ) = 2/3 and P (2 )
= 1/3 for the class-conditional probability densities shown in Fig. 2.1. Thus in this
case, given that a pattern is measured to have feature value x = 14, the probability it is
in category 2 is roughly 0.08, and that it is in 1 is 0.92. At every x , the posteriors sum
to 1.0. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.
c 2001 by John Wiley & Sons, Inc.
Copyright !
where p(x) =
p(x|j )P (j )
j=1
"
P (1|x) if we choose 2
P (2|x) if we choose 1
PWhat
(error) =
P (error|x)p(x)
(average
error)
happens
if we dodxthe
following?
p(x|j )P (wj )
p(x) probability of
A. We minimize the average
c
!
error. Consider
the=two-class
case
from
where p(x)
p(x|j )P (
j)
P (j |x) =
previous slide:
j=1
P (error|x) =
P (error) =
"
P (1|x) if we choose 2
P (2|x) if we choose 1
Explanation
c
!
j=1
(i|j )P (j |x)
c
!
Functions
and
R(Decision
|x)
=
(
|
)P
(j |x)
i
i j
R=
"
Overall
Risk
j=1
R((x)|x)p(x) dx
Overall Risk
The expected (average) loss associated with
a decision rule
11
12
Bayesian
Decision
Rules
For each
input,
selectclass
with
ij = (
i|least
j)
conditional risk, i.e. choose class one if:
ij = (i|j )
BayesianR(
Decision
Rules
1|x) < R(
2|x)
where
R(1|x)R(
<1R(
|x)
|x)==
11P|
(1)|x) + 12P (2|x)
2(
ij
R(1|x)R(
= 2|x)
|x)
+1|x)
P (
+(
22
P (2|x)
11P=(121
12P
2|x)
>
12 22 P (2)
13
ij = (i|j )
R(Alternate
|x) = Equivalent
11P (1Expressions
|x) + 12ofPBayes
(2|x)
1
R( |x) < R( |x)
1
Decision Rule
(Choose
Class
One
If...)
R(1|x) < R(2|x)
PosteriorR(
Class
Probabilities
|x)
= 11P (1|x) + 12P (2|x
1
R(
|x)
=
P
(
|x)
+
P
(
|x
2
21
1
22
2
Class
Priors
and
Conditional
Densities
(21 11)p(x|1)P (1) > (12 22)p(x|2)P (2)
Produced
applying
Bayes
Formula
the
(21 p(x|
by
((
>
(
)P
(2above,
|x)
11 )P
1 |x)
12(
22to
)
P
(
)
)P
|x)
>
(
1
2
21 12 11 22 1
12 22 )P (2 |x)
multiplying both sides
> by p(x)
p(x|2) 21 11 P (1)
(21 11)p(x|(121
)P(111))p(x|
> (12
(122
)p(x|
)>
(12
22(
)p(x|
2 )P
2)
1 )P
2 )P (
122 ) 22
p(x|Ratio
12p(x|
221) P
(
1)
Likelihood
>
>
p(x|2)
211 ) 11
21p(x|
112) P (
P (2)
P (1)
14
Minimum
Error
Rate Classification
Classification
Minimum
Error
Rate
Definition !
!0 i=j
0 iRate
= j Classification
(
|
)
=
i, jj =
= 1,
1, .. .. .. ,, cc
1 Minimum
Error
i
j
(i|j ) = 1 i != j i,
!1 i != j
0 ifor
= j Zero-One Loss
Conditional
Risk
(
i, j = 1, . . . , c
c i |j ) =
"
c
"
"
1 i != j "
R(
(
|jj )P
)P (
(jj|x)
|x) =
=
P (
(jj|x)
|x) =
= 1P
1P(
(ii|x)
|x)
R(ii|x)
|x) =
=
(ii|
P
c
j=1
=ii
"
j=1
j!j!"
=
R(i|x) =
(i|j )P (j |x) =
P (j |x) = 1P (i|x)
j=1
j!=i
Decide
|x) Rule
>P
P (
((min.
|x) for
for all
all jjrate)
!= ii
BayesDecision
ii if
Decide
if P
P (
(ii|x)
>
!=
jj|x) error
22
15
b
a
R2
R1
R2
R1
FIGURE 2.3. The likelihood ratio p(x |1 )/p(x |2 ) for the distributions shown in
Fig. 2.1. If we employ a zero-one or classification loss, our decision boundaries are
determined by the threshold a . If our loss function penalizes miscategorizing 2 as 1
patterns more than the converse, we get the larger threshold b , and hence R1 becomes
smaller. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classificac 2001 by John Wiley & Sons, Inc.
tion. Copyright !
16
1 Minimum
Error
Rate
Classification
c
"
"
R(i|x) =
(!
(
|x)
=
P
(
|x)
=
1P
(
|x)
i |0j )P
j
j
i
i=j
(ij=1
|j ) =
i, j =
j!=i1, . . . , c
1 i != j
Bayes Classifiers
c
"
Decide
i if P (i|x) > P (j"
|x) for all j != i
R(i|x) =
(i|j )P (j |x) =
P (j |x) = 1P (i|x)
17
Multicategory
case (revisited)
2 Multicategory
case (revisited)
Equivalent Discriminants
for ZeroMulticategory
case (revisited)
One Loss
gi(x) (Minimum-Error-Rate)
> gj (x) for all j != i
gi(x) = P (i|x)
gi(x) =p(x|
P (ii)P
|x)(i)
gi(x) = P (i|x) = #c
p(x|
)P
(
)
i
i
p(x|
)P
(
)
j
j
j=1= #
gi(x) = P (i|x)
p(x|
)P
(i)
c
i
gi(x)
= p(x|
)Pi(
gi(x)
= P i(
|x)i) = #c j=1 p(x|j )P (j )
p(x|
)P
(
)
j
j
j=1
gi(x) =
ln p(x|
P (
gi(x)
= p(x|
(i)i)
i ) + ln
i )P
gi(x)
= p(x|
)
i )P (
i
gi(x)
= ln p(x|
)
i + ln P (i )
gi(x) = ln p(x|i) + ln P (i)
18
two-category
again Loss
Example:
Zero-One
g(x) = P (1|x) P (2|x)
p(x|1)
P (1)
g(x) = ln
+ ln
p(x|2)
P (2)
19
0.3
p(x|2)P(2)
0.2
0.1
0
R1
R2
5
R2
decision
boundary
5
0
2.5%
2.5%
x
- 2 -
+ + 2
FIGURE 2.7. A univariate normal distribution has roughly 95% of its areain the range
|x | 2 , as shown. The peak of the distribution has value p() = 1/ 2 . From:
Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright
c 2001 by John Wiley & Sons, Inc.
$
22
two-category again
g(x) = P (1|x) P (2|x)
Formal Definition
p(x|1)
P (1)
g(x) = ln again+ ln
3 two-category
p(x|
P (2)
3 2)two-category
again
g(x) = P (1|x) P (2|x)
4 Gaussian (univariate normal) g(x) = P (1 |x) P (2 |x)
! 1"
$
p(x|
)
P#(
2 1)
g(x)
+
ln
1 = ln
1 x
p(x|1)
P (1)
p(x|
)
P
(
)
P
(x)
=
exp
2
2
Has value: 2
2
g(x) = ln p(x| ) + ln P ( )
2
2
%
4 Gaussian (univariate normal)
=
x p(x) dx
4 Gaussian
normal)
!
" (univariate
#2 $
% 1
1 x
!
"
#2 $
P (x)
=
exp
2
2
=
(x ) p(x)2 dx
1
1 x
2
P (x) =
exp
%
2
=
x p(x) dx
%
5 Multivariate normal
density
= ' x p(x) dx
%
&
1 2 =
1 )2p(x)
t 1
(x
dx
p(x) =
exp (x ) (x
% )
d/2
1/2
2
(2) ||
2 =
(x )2p(x) dx 23
%
5 Multivariate
= normal
x p(x)density
dx
&
'
%
g(x) = ln
p(x|1)
P (1)
+ ln
p(x|2)
P (2)
2
%
=
x p(x) dx
Informal Definition
%
A normal distribution
over two or more
2 =
(x )2p(x) dx
variables (d variables/dimensions)
Formal
Definition
5 Multivariate
normal density
&
'
1
1
t 1
p(x) =
exp
(x
)
(x )
2
(2)d/2||1/2
%
=
x p(x) dx
=
(x )(x )tp(x)dx
3
24
x p(x) dx
Matrix Elements
A Two-Dimensional Gaussian
Distribution, with Samples Shown
x2
x1
FIGURE 2.9. Samples drawn from a two-dimensional Gaussian lie in a cloud centered
on the mean ! . The ellipses show lines of equal probability density of the Gaussian.
From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyc 2001 by John Wiley & Sons, Inc.
right !
27
Linear Transformations in a 2D
Feature Space
x2
N(A w, I)
t
A w
N(,)
Aw
A
t
A
P
N(At, At A)
a
t
P
0
x1
FIGURE 2.8. The action of a linear transformation on the feature space will convert an arbitrary normal distribution into another normal distribution. One transforma-
28
=i |x)
P (>
(2|x)for all j != i
1 |x)
Decide i ifg(x)
P (
P
(Pj |x)
Decide
Pi (
if1P
p(x|
) (i |x) > P (j |
1)
g(x) = case
ln
+ ln i
Multicategory
(revisited)
p(x|2)
P (2)
Discriminant
Functions ( g (x) )
2
for the Normal
Density
2
Multicategory
gi(x) > gj (x) for all j != i case (revis
4
DiscriminantPFunctions
(x) =exp
R(
(x) =gi
i |x)
2
2= P (i |x)
gi(x)
We will consider three special
cases for: gi(x) = R(i|
%
=
x p(x)
dxi )P (gii)(x) = P (i |x)
p(x|
normally
distributed
features,
gi(x)
= P (i|x)
=
#cand
j=1 p(x|j )P (j )
minimum-error-rate
classification
(0-1
loss)
2
2(x)
g
=
= i )P (
(x i
gi(x) = p(x|
) )i p(x) dxP (i|x)
p(x
= #c
p
j=1
Bayes Recall:
Dec. Rules:
I i) + ln P (i)
gi(x) =Case
ln p(x|
gi(x) = p(x|i)P (i)
5 Multivariate normal density
i (x) = ln p(x|i ) + ln P (
if p(x|i) N (i , i ) thengapprox.
&
'
1
1
using: p(x) = (2)d/2||1/2 exp 2 (x )t1(x )
d % 1
1
29
1
t
gi (x) = (x i ) i (x i ) ln 2 ln |i | + ln P (i )
2
2 =
x2p(x) dx
2
2
%
=
x p(x) dx
Minimum Error-Rate
Discriminant
Function
for
%
Multivariate Gaussian
Distributions
2 =
(xFeature
)2p(x)
dx
&
'
1
1
t 1
p(x) =
exp
(x
)
(x )
d/2
1/2
2 I
Bayes Dec.
Rules: Case
(2) ||
%
p(x|
(i ,
= i ) xNp(x)
dxi )
2
3
Case II
Case III
30
gi(x) = R(i|x)
gi(x) = P (i|x)
p(x|i)
#
gi(x) = P (i|x) = c
j=1 p(x|
Purpose
gi(x) = p(x|i)P (i)
Overview of commonly assumed cases for feature likelihoodgdensities,
i (x) = ln p(x|i ) + ln P (i )
1 1 Bayes
BayesDec.
Dec.Rules:
Rules: Case
Case II
2
1 Bayes
Bayes Dec.
Dec. Rules:
Rules:
Case
I
1
Case
I
p(x|
1 Bayes Dec. Rules:
Case
i )i )N
p(x|
i IN((i,i,i)i)
Case 1: = I
p(x|
NN
(
i)
i,
i)
p(x|
)
(
,
)
i
i
i
p(x|i ) N (i , i )
dd
11
1 1
t t 11
gi (x)
= (x(x
)
(xi )i ) ln
ln2
2
ln
ln |i | + ln
gi (x)
=
i
) i
lnPP(
(ii))
i i (x
2
2
2
dd 2 1 1 2
11 2
t 11
t
(x)==1 (x
(xti )i )
(xi) i
)d lnln
+P
ln(
P (
ggii(x)
22
1 ln ln
||
| +| ln
) )
1
i i (x
gi (x) =Remove:
(x22 i ) i (x i ) 2ln22 2ln2|i |i+iln P (i )i i
2 2Case II
2i = 2 I 2
i =
i =
i =
2 I I
i = 2d
I 1
3Inverse
Case
III
of Covariance
|Matrix:
|
=
, i = (1/ 2 )I
i
2d
2d
1by(1/ )I2
|
product
,
Only effect is to scale
|
|==
,
= (1/
)I
i |ivector
i i =
2d
1
2
i | = , i = (1/ )I
2 Case|II
Discriminant function:
t
ti ) (x i )
Case IIg (x) =(x(x
i )= + ln P ( )
i ) (x
i = (arbitrary gaussian)
2 i + ln P (i ) i
2
2
2
i =
1 t
t
t
3 gCase
III
[x
x
2
x
+
i ] + ln P (i )
i (x) =
i
i
2
2 Case II
2
3 Case III
i = (arbitrary gaussian)
i =
2 Case II
gi (x)i =
32
(x i ) (x ii )= 2 I
+ ln P (i )
gi (x) =
2
2
2d
1
2
|i |2d= 1
, 2
=
(1/
)I
i
2
1 t |i | =
=
(1/
)I
t
ti
i
gi (x) = 2 [x x 2i x + i i ] + ln P (i )
2
t
(x
)
i (x i )
t
(x i ) (x 2i )
+ ln P (i )
gi (x) =
g1i (x)Discriminant
= 1 t
2 + ln P (i )
Linear
Function
2
gi (x) = 2 i x 2 i i2
+ ln P (i )
2
1 t the previous
Produced by
factoring
form
t
t
1
t
t
t
gi (x)
[x
x
2i ix+
ilnP
+iln
P (i )
i ] (
gi (x)
= =
[x
x
2
x
+
]
+
)
2
i
i
2 t2
2w
gi (x) =
x + i0 i
Case 1: = I
1 t1 t 1 t 1 t
(x) 2=
i x 2i i +
ln
gi (x)gi=
i x2
(iln
) P (i )
i+
i P
2
2 2
Threshold
gaussian)
g
(x)
=
w
x
+
i = (arbitrary
g
(x)
=
w
x
or Bias for Class +
i:
i
t
i
t
i i0
i0
gi (x)
tt x + t ] + ln Pt ( )
t
gi (x)
[x
x
2
i i
i
i
1 i=
1
i
i
i
t 2 2
t
= 2 i x 22 i i + ln P2 (i )
1 2
1
gi (x) = 2 i t x 2 ti i + ln P (i )
2
t
ti
i0
i
gi (x) =giw
x
+
t i0
i
(x) = wi x + i0
g (x) =
Case 1:
+ ln P (i )
2
gi(x)==w xI+
gi (x) = gj (x)
gi (x) = gj (x)
g
(x)
=
gj (x)
i
Decision Boundary:
t
w (x x0 ) = 0
wt (x x0 ) = 0
i =
!
"
2
1
P (i )
(i j )t (x (i + j )
ln
(i j ) )
t
2
(i j ) (i j ) P (j )
Notes
= (arbitrary gaussian)
34
-2
0.15
p(x|i)
0.4
1
0
P(2)=.5
0.1
0.05
1
0.3
R2
0.2
-1
P(2)=.5
0.1
P(1)=.5
x
-2
R1
P(1)=.5
R2
P(2)=.5
R2
R1
-2
P(1)=.5 R1
-2
-2
-1
FIGURE 2.10. If the covariance matrices for two distributions are equal and proportional to the identity
matrix, then the distributions are spherical in d dimensions, and the boundary is a generalized hyperplane of
d 1 dimensions, perpendicular to the line separating the means. In these one-, two-, and three-dimensional
examples, we indicate p(x|i ) and the boundaries for the case P (1 ) = P (2 ). In the three-dimensional case,
the grid plane separates R1 from R2 . From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern
c 2001 by John Wiley & Sons, Inc.
Classification. Copyright "
35
p(x|i)
p(x|i)
0.4
0.3
0.3
0.2
0.2
0.1
0.1
-2
R1
-2
R2
P(1)=.7
-2
0.15
0.05
0.05
P(2)=.01
P(2)=.2
R1
0.1
-2
P(2)=.1
0.15
0.1
P(1)=.8
R2
P(1)=.9
-2
R2
R1
-2
0
2
4
2
2
P(2)=.2
R2
R1
-1
R2
R1
P(2)=.01
2
-1
P(1)=.8
-2
R2
P(1)=.99
R1
P(2)=.3
0.4
P(1)=.99
-2
-2
-2
-1
-1
1
2
FIGURE 2.11. As the priors are changed, the decision boundary shifts; for sufficiently
disparate priors the boundary will not lie between the means of these one-, two- and
Example: Translation
of Decision
Boundaries Through
Changing Priors
Case II
1
i
p(x|
)
N
(
,
)
i
i
i
d
1
1
1
t
gi (x) = (x i ) i (x i ) ln 2 ln |i | + ln P (i )
2
2
2
Case
III
Remove
2 Case II
i = (arbitrary gauss
Case II
Case
22 Case
II II
1 1
t1
t
1(x
ggi(x)
=
(x
)
(x
(i) P (i)
(x)
=
(x
)
)+ln
i
i)+ln P
i
i
i
21 2 Covariances,
CasegiII:
Identical
(x) = (xi)t1(xi)+ln P (ii)
2
i =
i =
Case II
i = Mahalanobis distance
Expansion of squared
1
tt 1
1x xt1 t1x + t1
x
(x) = (xi) t (x
) i
i i1
i t 1
i1
i)+ln Pt(
1
t
2
=x
x t x1
i t 1
i x
+1i i
t
1
t
= x x x i i x + i i
i = t 1
= x x 2(1i)tx + ti1i
t
1
1
t
t
1
t
1
1
t
t
1
the
last
step
comes
from
symmetry
of
the
x 2(
2(i)i
) xi
+ i
i
Case III
==xx x
x i+
i
3 Case III
Case II
Case III
= (arbitrary gauss
i = (arbitrary
gaussian)
3 Case
IIIi = t(arbitrary gaussian)
1
t
1
= , ( ) =
iterm
= (arbitrary
Once again,
above ingaussian)
red is an additive
3 Case
III
constant
independent
of class, and can be removed
i = (arbitrary gaussian)
38
Case II
+1ln P (i )
gi (x) =
2
1
t
2
t 1
1i ) + ln P (
t i ) 1
(x) x
=2
(x
i i
) t(x
= xt 1 x gi
x
+
i 1
i
i
t
1
t
1
t
1
t
2
x
+
1
1 =x
i
i
i
i
t
t
t
t
t
t
gi (x) = 1gi2(x)
[xt x=
2ti x +[x
]+
lni xP +
(i) i i ] + ln P (i )
x
2
it =
i
2(
)
x
+
i
ii
i t 1
t
1
1
t
= tx 1
1x
2( ti) 1
x + i
1
t t1
t 1 i
x 1x
Pi
1ti+
1 ix) +t i i
i
1 2xti x
t
gi (x) = =
ln
(
i
2
(x)
=
ii+xt ln
P (
+ ln P (i )
gi (x) = 2 g
x
i
i
t 2 2 2i 1
12i ) i i
t =1
,(Function
)11
= tt2 1
2
Linear Discriminant
t
( i)) x=+ti1i
= x x =,
2(
gi (x) = wtit x + i0
t
gi (x)
=
w
x
+
1
i0
g
(x)
=
w
x
+
i t
i
i0
1 1
t1
1
1
t
t
i
=
,
(
)
=
gi (x) = g(
ln
P
(
1
i ) x1 ti ti +
i)
i(x) = ( i)2 x i i + ln P (i)
2
1 t 1
1g (x)
t
g
(x)
=
g!
i = ( ji ) x i i i + ln P (i )
i (x)
Decision
Boundary:
"
g
(x)
=
g
(x)
2
i
j
3 Case
III
1
ln[P(i )/Pj )
t
1
t
w
(x
x
)
=
0
(i j )) (x (i!+0j )
(i " j )
1
2t
(i ln[P(
j ) i)/P
(ij) j )
1
1
i =((arbitrary
gaussian)
( (i j )) (x
(i j ) )
i + j )
1
Case III
= (arbitrary gauss
Case III
Case III
=0
(i j ) (i j )
39
i = (arbitrary gaussian)
Case II
Case II: Identical Covariances, i
Notes on Decision
Boundary
Case
III
0.2
0.2
-0.1
-0.1
P(2)=.5
R2
P(2)=.9
R1
P(1)=.5
-5
R2
R1
P(1)=.1
-5
0
5
-5
-5
i = (arbitrary gauss
10
7.5
R1
P(1)=.5
R2
7.5
P(1)=.1 5
1
2
P(2)=.5
-2
R1
2.5
Hyperplane defined by
boundary generally not
orthogonal to the line
between the two
means
0
2
-2
-2.5
2
-2.5
R2
2
-2
P(2)=.9
0
2
-2
FIGURE 2.12. Probability densities (indicated by the surfaces in two dimensions and
ellipsoidal surfaces in three dimensions) and decision regions for equal but asymmetric Gaussian distributions. The decision hyperplanes need not be perpendicular to the
line connecting the means. From: Richard O. Duda, Peter E. Hart, and David G. Stork,
40
c 2001 by John Wiley & Sons, Inc.
Pattern Classification. Copyright !
d
1
1
1
t
gi (x) = (x i ) i (x i ) ln 2 ln |i | + ln P (i )
2
2
2
Case III
2 Case II
III Remove
i = (arbitrary gaussian)
3 =
Case
III gaussian)
Can
only
remove
the one term in red above
(arbitrary
i
Discriminant
t
t
g
(x)
=
x
W
x
+
w
i
i
i x + i0
Function (quadratic)
gi (x) = xt Wi x + wit x + i0
1 t 1
1
1 1
1
t
gi (x) = x ( i )x+(i i ) x i i i ln |i |+ln P (i )
2
2
2
t
1 t 1
1
1 1
1
i )x + (i i )x i i i ln |i | + ln P (i )
2
2
2
41
Decision Regions
Need not be simply connected, even in one
dimension (next slide)
42
0.4
2
0.3
0.2
0.1
x
-5
-2.5
R1
2.5
R2
7.5
R1
FIGURE 2.13. Non-simply connected decision regions can arise in one dimensions for
Gaussians having unequal variance. From: Richard O. Duda, Peter E. Hart, and David
c 2001 by John Wiley & Sons, Inc.
G. Stork, Pattern Classification. Copyright !
43
FIGURE 2.14. Arbitrary Gaussian distributions lead to Bayes decision boundaries that
are general hyperquadrics. Conversely, given any hyperquadric, one can find two Gaussian distributions whose Bayes decision boundary is that hyperquadric. These variances
Other Distributions
R4
R3
R2
R4
Possible; underlying
R
Bayesian Decision Theory
isFIGURE
unmodified,
2.16. however
The decision regions for four normal distributions
1
Discrete Features
Roughly speaking...
Replace probability densities
by probability
!!
mass functions. Expressions
using
p(x|
) dxintegrals
p(x|
jj) dx
are changed to use summations, e.g.
!
p(x|j ) dx
" Formula
Bayes
P (x|j )
"
"
P (x|
(x|jj))
xx
P (x|j )P (j )
P (j |x) =
P (x)
P (x) =
c
"
j=1
P (x|j )P (j )
46
"
P (x|j )
Example: Independent
Binary
P (x|j )P (j )
!
P (j |x) =
Features p(x| ) dx P (x)
j
"
c
"
j ) (x|j )P (j )
P (x) =P (x|P
x
j=1
P (x|
x = {x1, ..., xd} of 0/1 -valued
features,
where
j )P (j )
P (j |x) =
P (x)
each xi is 0/1 with probability: pi = P r[x
i = 1|1 ]
P (x) =
Conditional Independence
c
"
P (x|j )P (j )
j=1
Likelihood Function
P (x|1 ) =
d
#
i=1
pxi i (1 pi )1xi
47