PR January20 03 PDF

CSE 473
Pattern Recognition
Instructor:
Dr. Md. Monirul Islam
Bayesian Classifier
and its Variants
2
Classification Example 1
• Given:
– A doctor knows that meningitis causes stiff neck 50% of the time
– one of every 50,000 persons has meningitis
– one of every 20 persons has stiff neck
3
• Given:
– one of every 50,000 persons has meningitis
– one of every 20 persons has stiff neck
• If a patient has stiff neck, does he/she has

meningitis?
4
a l a l s
u
Classification Example 2 r ic r ic o
o o u
g g t in s
te te n las
ca ca c o c
Tid Refund Marital Taxable
Status Income Evade
1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
5
a l a l s
u
Classification Example 2 r ic r ic o
o o u
g g t in s
te te n las
ca ca c o c
Status Income Evade

3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
• A married person with income 100K did not refund

the loan previously
6
a l a l s
c c u
Classification Example 2 r i r i o
o o u
g g t in s
te te n las
ca ca c o c
Status Income Evade

3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10

the loan previously
• Can we trust him? 7
• The sea bass/salmon example
– We know the previous counts of salmon/sea bass

– Can we predict which fish is coming on the
conveyor?
8
Bayes Classifier
• A probabilistic framework for solving classification
problems
• Bayes theorem:
P( A, C )  P( A) P(C | A)  P( A | C ) P (C )
9
Bayes Classifier
• A probabilistic framework for solving classification
problems
• Conditional Probabilities:
P( A, C )
P (C | A) 
P ( A)
P( A, C )
P( A | C ) 
P (C )
10
Example 1
• Given:
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20
• If a patient has stiff neck, does he/she has meningitis?
P ( S | M ) P ( M ) 0.5 1 / 50000
P( M | S )    0.0002
P( S ) 1 / 20
11
Example 3
– We know the previous counts of salmon/sea bass

• amount of catches
• class or state of nature (Salmon/Sea bass)
state of nature is a random variable
12
Example 3
• if the catch of salmon and sea bass is equi-probable
– P(1) = P(2) (uniform priors)
– P(1) + P( 2) = 1 (exclusivity and exhaustivity)
13
Example 3
• Decision rule with only the prior information
– Decide 1 if P(1) > P(2) otherwise decide 2
14
Example 3
• Decision rule with only the prior information
– Decide 1 if P(1) > P(2) otherwise decide 2
– Misclassify many fishes
15
Example 3
• Use of the class –conditional information
– Use lightness
• P(x | 1) and P(x | 2) describe lightness in sea bass

and salmon
16
Example 3
Lightness
17
Example 3
• Given the lightness evidence x, calculate Posterior
from Likelihood and evidence
p( x |  j ) P( j )
– P( j | x) 
p( x)
– Posterior = (Likelihood. Prior) / Evidence
18
Example 3
• Given the lightness evidence x, calculate Posterior
from Likelihood and evidence
p( x |  j ) P( j )
– P( j | x) 
p( x)
where in case of two categories

j2
P ( x )   P ( x |  j )P (  j )
j 1
19
Example 3
Posterior function
20
Example 3
Decision rules with the posterior probabilities
x is an observation for which:
if P(1 | x) > P(2 | x) True state of nature = 1

if P(1 | x) < P(2 | x) True state of nature = 2
21
Example 3
Decision rules with the posterior probabilities
x is an observation for which:
if P(1 | x) > P(2 | x) True state of nature = 1

if P(1 | x) < P(2 | x) True state of nature = 2
Therefore:
whenever we observe a particular x, the probability of
error is :
P(error | x) = P(1 | x) if we decide 2
P(error | x) = P(2 | x) if we decide 1
22
• Minimizing the probability of error
• Decide 1 if P(1 | x) > P(2 | x);

otherwise decide 2
Therefore:
P(error | x) = min [P(1 | x), P(2 | x)]
(Bayes decision)
23
Bayesian Decision Theory – Continuous
Features
• Generalization of the preceding ideas
– Use of more than one feature

– Use more than two states of nature
– Allowing actions and not only decision on the state
of nature
– Introduce a loss of function which is more general
than the probability of error
24
Classification to Minimize Loss
Let {1, 2,…, m} be the set of m states of nature
(or “categories or classes”)
Let {1, 2,…, n} be the set of possible actions
25
Let {1, 2,…, m} be the set of m states of nature
(or “categories or classes”)
Let {1, 2,…, n} be the set of possible actions
Let (i | j) be the loss incurred for taking
action i when the true state of nature (class) is j
26
The risk to take decision i, for observation x, is

j m
R( i | x)    ( i |  j ) P( j | x)
j 1
27

j m
R( i | x)    ( i |  j ) P( j | x)
j 1
Overall risk
R = Sum of all R(i | x) for i = 1,…,n
28

j m
R( i | x)    ( i |  j ) P( j | x)
j 1
Overall risk
R = Sum of all R(i | x) for i = 1,…,n
Minimizing R Minimizing R(i | x) for i = 1,…, n
29
• Two-category classification
1 : deciding 1
30
ij = (i | j)
loss incurred for deciding i when the true state of nature is j
31
ij = (i | j)
loss incurred for deciding i when the true state of nature is j
Conditional risk:
R(1 | x) = 11P(1 | x) + 12P(2 | x)

R(2 | x) = 21P(1 | x) + 22P(2 | x)
32
Our rule is the following:
if R(1 | x) < R(2 | x)
action 1: “decide 1” is taken
33
if R(1 | x) < R(2 | x)
Now use these formula:

R(1 | x) = 11P(1 | x) + 12P(2 | x)
R(2 | x) = 21P(1 | x) + 22P(2 | x)
34
if R(1 | x) < R(2 | x)
This results in the equivalent rule :
decide 1 if:
(21- 11) P(x | 1) P(1) >
(12- 22) P(x | 2) P(2)
and decide 2 otherwise

35
The preceding rule
(21- 11) P(x | 1) P(1) >
(12- 22) P(x | 2) P(2)
is equivalent to the following rule:
P ( x |  1 ) 12   22 P (  2 )
if  .
P ( x |  2 )  21  11 P (  1 )
Then take action 1 (decide 1)

Otherwise take action 2 (decide 2)
36
The preceding rule

(21- 11) P(x | 1) P(1) >
(12- 22) P(x | 2) P(2)
is equivalent to the following rule:
P ( x |  1 ) 12   22 P (  2 )
if  .
P ( x |  2 )  21  11 P (  1 )
Likelihood
ratio
Then take action 1 (decide 1)
Otherwise take action 2 (decide 2)
37
Optimal decision property
“If the likelihood ratio exceeds a threshold

value independent of the input pattern x, we
can take optimal actions”
P ( x | 1 ) 12  22 P ( 2 )
 .
P( x |  2 ) 21  11 P(1 )
38
Minimum Error Rate Classification
Assume the loss function for two class case:
The Risk is now:
39
Assume the loss function for two class case:
The Risk is now:
Minimizing R Minimizing R(i | x) Maximizing P(ωi | x)

For loss minimization, our rule was:

if R(1 | x) < R(2 | x)
For loss minimization, our rule was:

if R(1 | x) < R(2 | x)
which is equivalent to:

Classifiers, Discriminant Functions
and Decision Surfaces
• Remember the Bayesian classifier:
gi(x) > gj(x)

• This is equivalent to:

decide x to class i
if gi(x) > gj(x) j  i
where, gi(x) =P(ωi|x), i = 1,…, c
Discriminant
Functions
d- dimensional
Feature vector
47
• Based on minimum risk classification

– Let gi(x) = - R(i | x)
(max. discriminant corresponds to min. risk!)
48
• For the minimum error rate, we take

gi(x) = P(i | x)
(max. discrimination corresponds to max.

posterior!)
49
• For the minimum error rate, we take

gi(x) = P(i | x)
some alternate representations but giving similar

results
gi(x)  P(x | i) P(i)
gi(x)  ln P(x | i) + ln P(i)
(ln: natural logarithm!)
50
Decision Surface
• Let, Ri and Rj : two regions identifying classes ωi and
ωj
Ri
Rj
Decision Surface
ωj
Decision Decide ωi if P(ωi |x) > P(ωj |x)
Rule
Ri
Rj
Decision Surface
ωj
Decision Decide ωi if P(ωi |x) - P(ωj |x) > 0
Rule
Ri
Rj
Decision Surface
ωj
Decision Decide ωi if g(x) ≡ P(ωi |x) - P(ωj |x) > 0
Rule
Ri
Rj
Decision Surface
• Let, Ri , R j : two regions identifying classes ωi and ωj
g(x)  P(i x)  P(j x)  0

Decision Surface
• If Ri , R j: two regions identifying classes ωi and ωj
g(x)  P(i x)  P(j x)  0
g ( x)  0
decision
surface
56
Decision Surface
g(x)  P(i x)  P(j x)  0
+
- g ( x)  0
decision
surface
57
Decision Surface
g(x)  P(i x)  P(j x)  0
Ri : P (i x)  P ( j x)
+
- g ( x)  0
R j : P ( j x)  P (i x)
decision
surface
58
Decision Surface in Multi-categories
• Feature space is divided into c decision regions

if gi(x) > gj(x) j  i then x is in Ri
(Ri means: assign x to i)
59
Decision Surface in Two-categories
• The two-category case

– A classifier is a “dichotomizer” that has two
discriminant functions g1 and g2
Let g(x)  g1(x) – g2(x)
Decide 1 if g(x) > 0 ; Otherwise decide 2
60
– The computation of g(x)
g ( x)  P(1 | x)  P (2 | x)
 P( x | 1 ) P(1 )  P( x | 2 ) P(2 )
P ( x | 1 ) P (1 )
 ln  ln
P ( x | 2 ) P(2 )
61
62
The Normal Density
• Density which is analytically tractable
• Continuous density
• A lot of processes are asymptotically Gaussian
– Handwritten characters, speech sounds, and many more
– Any prototype corrupted by random process
63
The Normal Density
• Univariate density
1  1 x 
2
P( x )  exp     ,
2   2    
Where:
 = mean (or expected value) of x
2 = expected squared deviation or variance
64
65
66
• Multivariate density
– Multivariate normal density in d dimensions is:
1  1 
P( x )  1/ 2
exp   ( x   )  ( x   )
t 1
( 2 ) d/2
  2 
where:
x = (x1, x2, …, xd)t (t stands for the transpose vector form)
 = (1, 2, …, d)t mean vector
 = d*d covariance matrix
|| and -1 are determinant and inverse respectively
67
• Multivariate density
– Multivariate normal density in d dimensions is:
1  1 
P( x )  1/ 2
exp   ( x   )  ( x   )
t 1
( 2 ) d/2
  2 
where:
68
2D Gaussian Example - 1
70
  
2
1
2
2
71
  
2
1
2
2
72
Computer Exercise
• Use matlab to generate Gaussian plots

• Try with different Σ and б
73
a l a l s
c c u
Classification Example 2 r i r i o
o o u
g g t in s
te te n las
ca ca c o c
Status Income Evade

3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10

the loan previously
• Can we trust him? 74

PR January20 03 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PR January20 03 PDF

Uploaded by

Copyright:

Available Formats

CSE 473

• If a patient has stiff neck, does he/she has

1 Yes Single 125K No

1 Yes Single 125K No

• A married person with income 100K did not refund

1 Yes Single 125K No

• A married person with income 100K did not refund

– We know the previous counts of salmon/sea bass

• If a patient has stiff neck, does he/she has meningitis?

– We know the previous counts of salmon/sea bass

• if the catch of salmon and sea bass is equi-probable

– P(1) = P(2) (uniform priors)

– P(1) + P( 2) = 1 (exclusivity and exhaustivity)

– Decide 1 if P(1) > P(2) otherwise decide 2

– Decide 1 if P(1) > P(2) otherwise decide 2

– Misclassify many fishes

• P(x | 1) and P(x | 2) describe lightness in sea bass

– Posterior = (Likelihood. Prior) / Evidence

where in case of two categories

x is an observation for which:

if P(1 | x) > P(2 | x) True state of nature = 1

x is an observation for which:

if P(1 | x) > P(2 | x) True state of nature = 1

• Decide 1 if P(1 | x) > P(2 | x);

• Generalization of the preceding ideas

– Use of more than one feature

Let {1, 2,…, n} be the set of possible actions

Let {1, 2,…, n} be the set of possible actions

Let (i | j) be the loss incurred for taking

action i when the true state of nature (class) is j

The risk to take decision i, for observation x, is

The risk to take decision i, for observation x, is

The risk to take decision i, for observation x, is

Minimizing R Minimizing R(i | x) for i = 1,…, n

R(1 | x) = 11P(1 | x) + 12P(2 | x)

Now use these formula:

This results in the equivalent rule :

and decide 2 otherwise

is equivalent to the following rule:

Then take action 1 (decide 1)

The preceding rule

is equivalent to the following rule:

“If the likelihood ratio exceeds a threshold

The Risk is now:

The Risk is now:

Minimizing R Minimizing R(i | x) Maximizing P(ωi | x)

For loss minimization, our rule was:

For loss minimization, our rule was:

which is equivalent to:

gi(x) > gj(x)

• This is equivalent to:

• Based on minimum risk classification

• For the minimum error rate, we take

(max. discrimination corresponds to max.

• For the minimum error rate, we take

some alternate representations but giving similar

g(x)  P(i x)  P(j x)  0

g(x)  P(i x)  P(j x)  0

g(x)  P(i x)  P(j x)  0

g(x)  P(i x)  P(j x)  0

• Feature space is divided into c decision regions

• The two-category case