Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

Bayesian Decision Making

-
2 Introduction
• Classification different from regression in the
sense that the output will be a discrete label
denoting the entity of the class.

• We will study the decision making process, by


relying on probabilistic inferences.

• The Bayes Theorem is very powerful, and


knowledge of it helps to design the classifier,
with appropriate decision surfaces.


3 Bayes Decision
• It is the decision making process when all
underlying probability distributions are known

• Generative model

• It is optimal given the distributions are known.

• In this discussion, we assume supervised


learning paradigm.


4 Problem
• 2 type of fish -sea bass and salmon in a
conveyer belt

• Problem: Need to recognize the type of


fish.

• 2 class/ category problem


5 Prior probability

Let us denote the 2 classes by

• ω1 : Salmon
• ω2 : Sea bass

• Let us say that salmon occurs more likely than


sea bass. ( This information is known as the
prior and is generally estimated by
experimentation.)

Mathematically , P (ω1 ) > P(ω2 )


6
Estimation of prior probabilities
by frequentist approach
• Assume that out of 8000 fish used for
training, we observed that 6000 were
salmon, 2000 sea bass

• Accordingly we have

P(ω1 ) = 0.75 prior probability of salmon


P (ω 2 ) = 0.25 prior probability of sea bass


7
Simple Decision making
based on prior knowledge
Assign unknown fish as salmon ω1 , if
P (ω1 ) > P(ω2 )
else
assign it to sea bass ω2.


8
Issues with using only
prior knowledge
• Decision rule is flawed. Salmon will always
be favored and assigned to a test fish.

• Such an approach will not work in practice.

• Sole dependence on prior probability for


making decisions is not a good idea


9 Need for features
• Solution: Look for additional information
to describe the sea bass and salmon.

• Key idea is to look for discriminative


features.

• Features like length, width, color, life


span, texture may describe salmon and
sea bass.


10 Feature vector
• Assume that we have d features.

• Let set of features describing a fish be


represented by a d dimensional feature
vector x


11 Class Conditional Density
• For each class/ category of fish, we can associate the
d features to come from a probability distribution
function.

• This pdf is referred to as ‘class conditional density’.

• The nature of the features are continuous.

• Note that, for a set of d features describing the fish,


we work on a d dimensional probability distribution.

12

• For the time being , let us work on how to


improve our decision process by
incorporating a single feature x.

• Later, we extend the framework for d


dimensional features, and also for classes
greater than 2.


13 Class conditional Density


14 Bayes Theorem
• Bayes Theorem :
p ( x | ω j ) P (ω j )
P(ω j | x) =
P( x)
likelihood × prior
posterior =
evidence
In the case of two categories

j =2
P( x) = ∑ P( x | ω j ) P(ω j )
j =1


15
Posterior probability plot
for the two classes


16
Decision based on
posterior probabilities
• Decision given the posterior probabilities

x is an observation for which:

if P(ω1 | x ) > P(ω2 | x ) True state of nature = ω1


if P(ω1 | x ) < P(ω2 | x ) True state of nature = ω2

Therefore: whenever we observe a particular x , the


probability of error is :

P(error | x ) = P(ω1 | x ) if we decide ω2

P(error | x ) = P(ω2 | x ) if we decide ω1



17
Decision based on
posterior probabilities
• Minimizing the probability of error

Decide ω1 if P(ω1 | x ) > P(ω2 | x ); otherwise


decide ω2

Therefore:
P(error | x ) = min [P(ω1 | x ), P(ω2 | x )]
(Bayes decision)


18 Probability of error

∞ ∞
P(error ) = ∫ P(error, x)dx = ∫ P(error | x) p( x)dx
−∞ −∞

• We want P (error | x) to be as small as possible for


every value of x

The Bayes classifier scheme strives to achieve that


19
Bayesian classification framework for high
dimensional features and more classes

Let {ω1, ω2,…, ωC} be the set of “C” states of nature (or
“categories” / “classes”)

Assume , that for an unknown pattern, a d dimensional feature


vector x is constructed:

From Bayes rule p (x | ω j ) P(ω j )


P(ω j | x) =
P ( x)
We compute the posterior probability of the pattern with respect
to each of the “c” classes.

In the decision making step, we assign the pattern to the class


for which the posterior probability is greatest.


20
Bayesian classification framework for high
dimensional features and more classes

p(x | ω j ) P (ω j )
ωtest = arg max j P(ω j | x) = j = 1,2....C
P ( x)

C
P(x) = ∑ p(x | ω j ) P(ω j ) Evidence acts as a normalization
j =1
factor ….term same for all
Classes.

ωtest is the class for which the posterior probability is highest.

The pattern is assigned to this class.


21 Risk minimization framework
Let {ω1, ω2,…, ωC} be the set of “C” states
of nature (or “categories”)

Let {α1, α2,…, αa} be the set of possible


“a” actions

Let λ(αi | ωj) be the loss incurred for taking


action αi when the state of nature is ωj


22 Risk minimization framework

The expected loss:

Given an observation with vector x, the conditional risk is:

At every x, a decision is made: α(x), by minimizing the expected


loss.

Our final goal is to minimize the total risk over all x.


23 Reading

• Sections 2.1, 2.2 :

Duda, Hart , Stork : Pattern Classification


Bayesian Decision Making

-
2 Risk minimization framework
Let {ω1, ω2,…, ωC} be the set of “C” states
of nature (or “categories”)

Let {α1, α2,…, αa} be the set of possible


“a” actions

Let λ(αi | ωj) be the loss incurred for taking


action αi when the state of nature is ωj


3 Risk minimization framework

The expected loss:

Given an observation with vector x, the conditional risk is:

At every x, a decision is made: α(x), by minimizing the expected


loss.

Our final goal is to minimize the total risk over all x.


4 Risk minimization framework
• Two-category classification
α1 : deciding ω1
α2 : deciding ω2
λij = λ(αi | ωj) is loss incurred for deciding ωi
when the true state of nature is ωj

Conditional risk:
R(α1 | x) = λ11P(ω1 | x) + λ12P(ω2 | x)
R(α2 | x) = λ21P(ω1 | x) + λ22P(ω2 | x)


5 Risk minimization framework
Our rule is the following:

if R(α1 | x) < R(α2 | x)


action α1: “decide ω1” is taken

This results in the equivalent rule :


Decide ω1 if:
(λ21- λ11) p(x | ω1) P(ω1) >
(λ12- λ22) p(x | ω2) P(ω2)

and decide ω2 otherwise


6
Likelihood ratio

The preceding rule is equivalent to the following


rule:

p (x | ω1 ) λ12 − λ22 P(ω2 )


if > .
p(x | ω2 ) λ21 − λ11 P(ω1 )

Then take action α1 (decide ω1)


Otherwise take action α2 (decide ω2)


7
Likelihood ratio

• Regions of decision and zero-one loss function,


therefore:

λ12 − λ22 P(ω2 )


Let . = θλ
λ21 − λ11 P(ω1 )
p(x | ω1 )
then decide ω1 if : > θλ
p(x | ω2 )


8 Likelihood ratio


9 Likelihood ratio


10 Likelihood ratio
Optimal decision property

“If the likelihood ratio exceeds a threshold


value independent of the input pattern x, we
can take optimal actions”


11 Zero-one loss function
• Actions are decisions on classes

If action αi is taken and the true state of nature is ωj


then: the decision is correct if i = j and in error if i ≠ j

• Seek a decision rule that minimizes the probability of


error which is the error rate


12 Zero-one loss function
• Introduction of the zero-one loss function:

0 i = j
λ (α i | ω j ) = 
1 i ≠ j
Therefore, the conditional risk is: All errors are equally costly.

j =c
R(α i | x) = ∑ λ (α i | ω j ) P(ω j | x)
j =1

= ∑ P(ω j | x) = 1 − P(ωi | x)
j ≠i

“The risk corresponding to this loss function is the average probability


error” –
13 Zero-one loss function
• Minimize the risk requires maximizing P(ωi | x)

(since R(αi | x) = 1 – P(ωi | x))

• For Minimum error rate

– Decide ωi if P (ωi | x) > P(ωj | x) ∀ j ≠ i


Bayesian Decision Making

Discriminant functions, Normal


Distribution

-
2
2
Classifiers, Discriminant Functions
and Decision Surfaces

• The multi-category case

– Set of discriminant functions gi(x), i = 1,…, c

– The classifier assigns a feature vector x to


class ωi
if:
gi(x) > gj(x) ∀ j ≠ i


3
Generic view of a
pattern classifier


4
Discriminant Functions

• Let gi(x) = -R(αi | x)


(max. discriminant corresponds to min. risk!)
• For the minimum error rate, we take
gi(x) = P(ωi | x)

(max. discrimination corresponds to max.


posterior!)
gi(x) ≡ p(x | ωi) P(ωi)
gi(x) = ln p(x | ωi) + ln P(ωi)
(ln: natural

logarithm!)
5 Discriminant functions
• Discriminant functions do not change the
decision, when scaled by some positive
constant ‘k’.

• The decision is not affected when a


constant is added to all discriminant
functions.


6
Discriminant Functions

• Feature space divided into c decision regions


if gi(x) > gj(x) ∀ j ≠ i then x is in Ri

(Ri means assign x to ωi)

• The two-category case


– A classifier is a “dichotomizer” that has two
discriminant functions g1 and g2

Let g(x) ≡ g1(x) – g2(x)

Decide ω1 if g(x) > 0 ; Otherwise decide ω2



7 Dichotomizer

gi(x) = ln p(x | ωi) + ln P(ωi)


(ln: natural logarithm!)

– The computation of g(x) for dichotomizer

or


8 Normal /Gaussian Distribution


9


10 Multivariate Gaussian Distribution

Where and

d x d Covariance matrix


11


12 Multivariate Gaussian Distribution


13 Covariance matrix property
• If w is any d dimensional vector, the
T
variance of w x can not be negative.

T
• This leads to the quadratic form w Σw to
be non-negative  positive semi-definite
nature of Σ

• Eigen values of Σ are non-negative.


14 Multivariate Gaussian Distribution

Reminder: the covariance matrix is symmetric and positive semidefinite.

Entropy - the measure of uncertainty

Normal distribution has the maximum entropy over all distributions with a
given mean and variance.


15 2D Gaussian


16 Multivariate Gaussian Distribution


17 Mahalanobis Distance


Bayesian Decision Making

-
2 Error Computation

P(e) = P(x ∈ R1 , ω2 ) + P(x ∈ R2 , ω1 )


= P(x ∈ R1 | ω2 ) P(ω2 ) + P(x ∈ R2 | ω1 ) P(ω1 )

= ∫ p ( x|ω2 ) P ( ω2 )+ ∫ p ( x|ω1 ) P ( ω1 )
R1 R2


3 Error computation


4 Error computation

C
P (correct ) = ∑ P ( x ∈ Ri , ωi )
i =1
C
= ∑ P ( x ∈ Ri | ωi ) P (ωi )
i =1

C
= ∑ ∫ p (x | ωi ) P (ωi )
i =1 R
i


5
5 Bayes Decision Theory – Discrete Features

• Components of x are binary or integer valued.

• x can take only one of m discrete values


v1, v2, …, vm


6
Bayes Decision Theory – Discrete
Features


7
Bayes Decision Theory – Discrete
Features
• Case of independent binary features in 2
category problem

Let x = [x1, x2, …, xd ]T where each xi is


either 0 or 1, with probabilities:

pi = P(xi = 1 | ω1)
qi = P(xi = 1 | ω2)


8
Bayes Decision Theory – Discrete
Features
Assuming, conditional independence of components of a feature vector,
likelihoods can be computed as


9
Bayes Decision Theory – Discrete
Features

pi=qi dependence on prior only–


10
Bayes Decision Theory – Discrete
Features
We have:
d
g ( x) = ∑ wi xi + w0
i =1

where :
pi (1 − qi )
wi = ln i = 1,..., d
qi (1 − pi )
and :
d
1 − pi P(ω1 )
w0 = ∑ ln + ln
i =1 1 − qi P(ω2 )
decide ω1 if g(x)– > 0 and ω2 if g(x) ≤ 0
11 Evaluation of classifier
True
negative/
correct True
rejection Positive
/ Hit

False False
negative/ positive/
Miss alarm


12 Discriminability ratio
Discriminability ratio between two
distributions


13
Receiver operating
characteristics
Consider a pair of distributions with
discriminability ratio d/.

Vary threshold x*.

We note that the true positive / hit


probabilities and false positive / alarm
probabilities will vary with a threshold
x*.

The variation of these probabilities for a


given discriminability ratio present a
smooth curve.

The curve is called ‘Receiving


Operating Characteristics’

You might also like