Bayesian Theory

Bayesian Decision Making
-
2 Introduction
• Classification different from regression in the
sense that the output will be a discrete label
denoting the entity of the class.
• We will study the decision making process, by

relying on probabilistic inferences.
• The Bayes Theorem is very powerful, and

knowledge of it helps to design the classifier,
with appropriate decision surfaces.
–
3 Bayes Decision
• It is the decision making process when all
underlying probability distributions are known
• Generative model
• It is optimal given the distributions are known.
• In this discussion, we assume supervised

learning paradigm.
–
4 Problem
• 2 type of fish -sea bass and salmon in a
conveyer belt
• Problem: Need to recognize the type of

fish.
• 2 class/ category problem
–
5 Prior probability
Let us denote the 2 classes by
• ω1 : Salmon
• ω2 : Sea bass
• Let us say that salmon occurs more likely than

sea bass. ( This information is known as the
prior and is generally estimated by
experimentation.)
Mathematically , P (ω1 ) > P(ω2 )
–
6
Estimation of prior probabilities
by frequentist approach
• Assume that out of 8000 fish used for
training, we observed that 6000 were
salmon, 2000 sea bass
• Accordingly we have
P(ω1 ) = 0.75 prior probability of salmon

P (ω 2 ) = 0.25 prior probability of sea bass
–
7
Simple Decision making
based on prior knowledge
Assign unknown fish as salmon ω1 , if
P (ω1 ) > P(ω2 )
else
assign it to sea bass ω2.
–
8
Issues with using only
prior knowledge
• Decision rule is flawed. Salmon will always
be favored and assigned to a test fish.
• Such an approach will not work in practice.
• Sole dependence on prior probability for

making decisions is not a good idea
–
9 Need for features
• Solution: Look for additional information
to describe the sea bass and salmon.
• Key idea is to look for discriminative

features.
• Features like length, width, color, life

span, texture may describe salmon and
sea bass.
–
10 Feature vector
• Assume that we have d features.
• Let set of features describing a fish be

represented by a d dimensional feature
vector x
–
11 Class Conditional Density
• For each class/ category of fish, we can associate the
d features to come from a probability distribution
function.
• This pdf is referred to as ‘class conditional density’.
• The nature of the features are continuous.
• Note that, for a set of d features describing the fish,

we work on a d dimensional probability distribution.
–
12
• For the time being , let us work on how to

improve our decision process by
incorporating a single feature x.
• Later, we extend the framework for d

dimensional features, and also for classes
greater than 2.
–
13 Class conditional Density
–
14 Bayes Theorem
• Bayes Theorem :
p ( x | ω j ) P (ω j )
P(ω j | x) =
P( x)
likelihood × prior
posterior =
evidence
In the case of two categories
j =2
P( x) = ∑ P( x | ω j ) P(ω j )
j =1
–
15
Posterior probability plot
for the two classes
–
16
Decision based on
posterior probabilities
• Decision given the posterior probabilities
x is an observation for which:
if P(ω1 | x ) > P(ω2 | x ) True state of nature = ω1

if P(ω1 | x ) < P(ω2 | x ) True state of nature = ω2
Therefore: whenever we observe a particular x , the

probability of error is :
P(error | x ) = P(ω1 | x ) if we decide ω2
P(error | x ) = P(ω2 | x ) if we decide ω1

–
17
Decision based on
posterior probabilities
• Minimizing the probability of error
Decide ω1 if P(ω1 | x ) > P(ω2 | x ); otherwise

decide ω2
Therefore:
P(error | x ) = min [P(ω1 | x ), P(ω2 | x )]
(Bayes decision)
–
18 Probability of error
∞ ∞
P(error ) = ∫ P(error, x)dx = ∫ P(error | x) p( x)dx
−∞ −∞
• We want P (error | x) to be as small as possible for

every value of x
The Bayes classifier scheme strives to achieve that
–
19
Bayesian classification framework for high
dimensional features and more classes
Let {ω1, ω2,…, ωC} be the set of “C” states of nature (or
“categories” / “classes”)
Assume , that for an unknown pattern, a d dimensional feature

vector x is constructed:
From Bayes rule p (x | ω j ) P(ω j )

P(ω j | x) =
P ( x)
We compute the posterior probability of the pattern with respect
to each of the “c” classes.
In the decision making step, we assign the pattern to the class

for which the posterior probability is greatest.
–
20
Bayesian classification framework for high
dimensional features and more classes
p(x | ω j ) P (ω j )
ωtest = arg max j P(ω j | x) = j = 1,2....C
P ( x)
C
P(x) = ∑ p(x | ω j ) P(ω j ) Evidence acts as a normalization
j =1
factor ….term same for all
Classes.
ωtest is the class for which the posterior probability is highest.
The pattern is assigned to this class.
–
21 Risk minimization framework
Let {ω1, ω2,…, ωC} be the set of “C” states
of nature (or “categories”)
Let {α1, α2,…, αa} be the set of possible

“a” actions
Let λ(αi | ωj) be the loss incurred for taking

action αi when the state of nature is ωj
–
The expected loss:
Given an observation with vector x, the conditional risk is:
At every x, a decision is made: α(x), by minimizing the expected

loss.
Our final goal is to minimize the total risk over all x.
–
23 Reading
• Sections 2.1, 2.2 :
Duda, Hart , Stork : Pattern Classification
–
-
Let {ω1, ω2,…, ωC} be the set of “C” states
of nature (or “categories”)
Let {α1, α2,…, αa} be the set of possible

“a” actions
Let λ(αi | ωj) be the loss incurred for taking

action αi when the state of nature is ωj
–
The expected loss:
Given an observation with vector x, the conditional risk is:
At every x, a decision is made: α(x), by minimizing the expected

loss.
Our final goal is to minimize the total risk over all x.
–
• Two-category classification
α1 : deciding ω1
α2 : deciding ω2
λij = λ(αi | ωj) is loss incurred for deciding ωi
when the true state of nature is ωj
Conditional risk:
R(α1 | x) = λ11P(ω1 | x) + λ12P(ω2 | x)
R(α2 | x) = λ21P(ω1 | x) + λ22P(ω2 | x)
–
Our rule is the following:
if R(α1 | x) < R(α2 | x)

action α1: “decide ω1” is taken
This results in the equivalent rule :

Decide ω1 if:
(λ21- λ11) p(x | ω1) P(ω1) >
(λ12- λ22) p(x | ω2) P(ω2)
and decide ω2 otherwise
–
6
Likelihood ratio
The preceding rule is equivalent to the following

rule:
p (x | ω1 ) λ12 − λ22 P(ω2 )

if > .
p(x | ω2 ) λ21 − λ11 P(ω1 )
Then take action α1 (decide ω1)

Otherwise take action α2 (decide ω2)
–
7
Likelihood ratio
• Regions of decision and zero-one loss function,

therefore:
λ12 − λ22 P(ω2 )

Let . = θλ
λ21 − λ11 P(ω1 )
p(x | ω1 )
then decide ω1 if : > θλ
p(x | ω2 )
–
8 Likelihood ratio
–
9 Likelihood ratio
–
10 Likelihood ratio
Optimal decision property
“If the likelihood ratio exceeds a threshold

value independent of the input pattern x, we
can take optimal actions”
–
11 Zero-one loss function
• Actions are decisions on classes
If action αi is taken and the true state of nature is ωj

then: the decision is correct if i = j and in error if i ≠ j
• Seek a decision rule that minimizes the probability of

error which is the error rate
–
• Introduction of the zero-one loss function:
0 i = j
λ (α i | ω j ) = 
1 i ≠ j
Therefore, the conditional risk is: All errors are equally costly.
j =c
R(α i | x) = ∑ λ (α i | ω j ) P(ω j | x)
j =1
= ∑ P(ω j | x) = 1 − P(ωi | x)
j ≠i
“The risk corresponding to this loss function is the average probability

error” –
• Minimize the risk requires maximizing P(ωi | x)
(since R(αi | x) = 1 – P(ωi | x))
• For Minimum error rate
– Decide ωi if P (ωi | x) > P(ωj | x) ∀ j ≠ i
–
Discriminant functions, Normal

Distribution
-
2
2
Classifiers, Discriminant Functions
and Decision Surfaces
• The multi-category case
– Set of discriminant functions gi(x), i = 1,…, c
– The classifier assigns a feature vector x to

class ωi
if:
gi(x) > gj(x) ∀ j ≠ i
–
3
Generic view of a
pattern classifier
–
4
Discriminant Functions
• Let gi(x) = -R(αi | x)

(max. discriminant corresponds to min. risk!)
• For the minimum error rate, we take
gi(x) = P(ωi | x)
(max. discrimination corresponds to max.

posterior!)
gi(x) ≡ p(x | ωi) P(ωi)
gi(x) = ln p(x | ωi) + ln P(ωi)
(ln: natural
–
logarithm!)
5 Discriminant functions
• Discriminant functions do not change the
decision, when scaled by some positive
constant ‘k’.
• The decision is not affected when a

constant is added to all discriminant
functions.
–
6
Discriminant Functions
• Feature space divided into c decision regions

if gi(x) > gj(x) ∀ j ≠ i then x is in Ri
(Ri means assign x to ωi)
• The two-category case

– A classifier is a “dichotomizer” that has two
discriminant functions g1 and g2
Let g(x) ≡ g1(x) – g2(x)
Decide ω1 if g(x) > 0 ; Otherwise decide ω2

–
7 Dichotomizer
gi(x) = ln p(x | ωi) + ln P(ωi)

(ln: natural logarithm!)
– The computation of g(x) for dichotomizer
or
–
8 Normal /Gaussian Distribution
–
9
–
10 Multivariate Gaussian Distribution
Where and
d x d Covariance matrix
–
11
–
–
13 Covariance matrix property
• If w is any d dimensional vector, the
T
variance of w x can not be negative.
T
• This leads to the quadratic form w Σw to
be non-negative positive semi-definite
nature of Σ
• Eigen values of Σ are non-negative.
–
Reminder: the covariance matrix is symmetric and positive semidefinite.
Entropy - the measure of uncertainty
Normal distribution has the maximum entropy over all distributions with a
given mean and variance.
–
15 2D Gaussian
–
–
17 Mahalanobis Distance
–
-
2 Error Computation
P(e) = P(x ∈ R1 , ω2 ) + P(x ∈ R2 , ω1 )

= P(x ∈ R1 | ω2 ) P(ω2 ) + P(x ∈ R2 | ω1 ) P(ω1 )
= ∫ p ( x|ω2 ) P ( ω2 )+ ∫ p ( x|ω1 ) P ( ω1 )
R1 R2
–
3 Error computation
–
4 Error computation
C
P (correct ) = ∑ P ( x ∈ Ri , ωi )
i =1
C
= ∑ P ( x ∈ Ri | ωi ) P (ωi )
i =1
C
= ∑ ∫ p (x | ωi ) P (ωi )
i =1 R
i
–
5
5 Bayes Decision Theory – Discrete Features
• Components of x are binary or integer valued.
• x can take only one of m discrete values

v1, v2, …, vm
–
6
Bayes Decision Theory – Discrete
Features
–
7
Features
• Case of independent binary features in 2
category problem
Let x = [x1, x2, …, xd ]T where each xi is

either 0 or 1, with probabilities:
pi = P(xi = 1 | ω1)
qi = P(xi = 1 | ω2)
–
8
Features
Assuming, conditional independence of components of a feature vector,
likelihoods can be computed as
–
9
Features
pi=qi dependence on prior only–

10
Features
We have:
d
g ( x) = ∑ wi xi + w0
i =1
where :
pi (1 − qi )
wi = ln i = 1,..., d
qi (1 − pi )
and :
d
1 − pi P(ω1 )
w0 = ∑ ln + ln
i =1 1 − qi P(ω2 )
decide ω1 if g(x)– > 0 and ω2 if g(x) ≤ 0
11 Evaluation of classifier
True
negative/
correct True
rejection Positive
/ Hit
False False
negative/ positive/
Miss alarm
–
12 Discriminability ratio
Discriminability ratio between two
distributions
–
13
Receiver operating
characteristics
Consider a pair of distributions with
discriminability ratio d/.
Vary threshold x*.
We note that the true positive / hit

probabilities and false positive / alarm
probabilities will vary with a threshold
x*.
The variation of these probabilities for a

given discriminability ratio present a
smooth curve.
The curve is called ‘Receiving

Operating Characteristics’

Bayesian Theory

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayesian Theory

Uploaded by

Copyright:

Available Formats

Bayesian Decision Making

• We will study the decision making process, by

• The Bayes Theorem is very powerful, and

• It is optimal given the distributions are known.

• In this discussion, we assume supervised

• Problem: Need to recognize the type of

• 2 class/ category problem

Let us denote the 2 classes by

• Let us say that salmon occurs more likely than

Mathematically , P (ω1 ) > P(ω2 )

P(ω1 ) = 0.75 prior probability of salmon

• Such an approach will not work in practice.

• Sole dependence on prior probability for

• Key idea is to look for discriminative

• Features like length, width, color, life

• Let set of features describing a fish be

• This pdf is referred to as ‘class conditional density’.

• The nature of the features are continuous.

• Note that, for a set of d features describing the fish,

• For the time being , let us work on how to

• Later, we extend the framework for d

x is an observation for which:

if P(ω1 | x ) > P(ω2 | x ) True state of nature = ω1

Therefore: whenever we observe a particular x , the

P(error | x ) = P(ω1 | x ) if we decide ω2

P(error | x ) = P(ω2 | x ) if we decide ω1

Decide ω1 if P(ω1 | x ) > P(ω2 | x ); otherwise

• We want P (error | x) to be as small as possible for

The Bayes classifier scheme strives to achieve that

Assume , that for an unknown pattern, a d dimensional feature

From Bayes rule p (x | ω j ) P(ω j )

In the decision making step, we assign the pattern to the class

ωtest is the class for which the posterior probability is highest.

The pattern is assigned to this class.

Let {α1, α2,…, αa} be the set of possible

Let λ(αi | ωj) be the loss incurred for taking

The expected loss:

Given an observation with vector x, the conditional risk is:

At every x, a decision is made: α(x), by minimizing the expected

Our final goal is to minimize the total risk over all x.

• Sections 2.1, 2.2 :

Duda, Hart , Stork : Pattern Classification

Let {α1, α2,…, αa} be the set of possible

Let λ(αi | ωj) be the loss incurred for taking

The expected loss:

Given an observation with vector x, the conditional risk is:

At every x, a decision is made: α(x), by minimizing the expected

Our final goal is to minimize the total risk over all x.

if R(α1 | x) < R(α2 | x)

This results in the equivalent rule :

and decide ω2 otherwise

The preceding rule is equivalent to the following

p (x | ω1 ) λ12 − λ22 P(ω2 )

Then take action α1 (decide ω1)

• Regions of decision and zero-one loss function,

λ12 − λ22 P(ω2 )

“If the likelihood ratio exceeds a threshold

If action αi is taken and the true state of nature is ωj

• Seek a decision rule that minimizes the probability of

“The risk corresponding to this loss function is the average probability

(since R(αi | x) = 1 – P(ωi | x))

• For Minimum error rate