Professional Documents
Culture Documents
Bayesian Theory
Bayesian Theory
-
2 Introduction
• Classification different from regression in the
sense that the output will be a discrete label
denoting the entity of the class.
–
3 Bayes Decision
• It is the decision making process when all
underlying probability distributions are known
• Generative model
–
4 Problem
• 2 type of fish -sea bass and salmon in a
conveyer belt
–
5 Prior probability
• ω1 : Salmon
• ω2 : Sea bass
–
6
Estimation of prior probabilities
by frequentist approach
• Assume that out of 8000 fish used for
training, we observed that 6000 were
salmon, 2000 sea bass
• Accordingly we have
–
7
Simple Decision making
based on prior knowledge
Assign unknown fish as salmon ω1 , if
P (ω1 ) > P(ω2 )
else
assign it to sea bass ω2.
–
8
Issues with using only
prior knowledge
• Decision rule is flawed. Salmon will always
be favored and assigned to a test fish.
–
9 Need for features
• Solution: Look for additional information
to describe the sea bass and salmon.
–
10 Feature vector
• Assume that we have d features.
–
11 Class Conditional Density
• For each class/ category of fish, we can associate the
d features to come from a probability distribution
function.
–
13 Class conditional Density
–
14 Bayes Theorem
• Bayes Theorem :
p ( x | ω j ) P (ω j )
P(ω j | x) =
P( x)
likelihood × prior
posterior =
evidence
In the case of two categories
j =2
P( x) = ∑ P( x | ω j ) P(ω j )
j =1
–
15
Posterior probability plot
for the two classes
–
16
Decision based on
posterior probabilities
• Decision given the posterior probabilities
Therefore:
P(error | x ) = min [P(ω1 | x ), P(ω2 | x )]
(Bayes decision)
–
18 Probability of error
∞ ∞
P(error ) = ∫ P(error, x)dx = ∫ P(error | x) p( x)dx
−∞ −∞
–
19
Bayesian classification framework for high
dimensional features and more classes
Let {ω1, ω2,…, ωC} be the set of “C” states of nature (or
“categories” / “classes”)
–
20
Bayesian classification framework for high
dimensional features and more classes
p(x | ω j ) P (ω j )
ωtest = arg max j P(ω j | x) = j = 1,2....C
P ( x)
C
P(x) = ∑ p(x | ω j ) P(ω j ) Evidence acts as a normalization
j =1
factor ….term same for all
Classes.
–
21 Risk minimization framework
Let {ω1, ω2,…, ωC} be the set of “C” states
of nature (or “categories”)
–
22 Risk minimization framework
–
23 Reading
–
Bayesian Decision Making
-
2 Risk minimization framework
Let {ω1, ω2,…, ωC} be the set of “C” states
of nature (or “categories”)
–
3 Risk minimization framework
–
4 Risk minimization framework
• Two-category classification
α1 : deciding ω1
α2 : deciding ω2
λij = λ(αi | ωj) is loss incurred for deciding ωi
when the true state of nature is ωj
Conditional risk:
R(α1 | x) = λ11P(ω1 | x) + λ12P(ω2 | x)
R(α2 | x) = λ21P(ω1 | x) + λ22P(ω2 | x)
–
5 Risk minimization framework
Our rule is the following:
–
6
Likelihood ratio
–
7
Likelihood ratio
–
8 Likelihood ratio
–
9 Likelihood ratio
–
10 Likelihood ratio
Optimal decision property
–
11 Zero-one loss function
• Actions are decisions on classes
–
12 Zero-one loss function
• Introduction of the zero-one loss function:
0 i = j
λ (α i | ω j ) =
1 i ≠ j
Therefore, the conditional risk is: All errors are equally costly.
j =c
R(α i | x) = ∑ λ (α i | ω j ) P(ω j | x)
j =1
= ∑ P(ω j | x) = 1 − P(ωi | x)
j ≠i
–
Bayesian Decision Making
-
2
2
Classifiers, Discriminant Functions
and Decision Surfaces
–
3
Generic view of a
pattern classifier
–
4
Discriminant Functions
–
6
Discriminant Functions
or
–
8 Normal /Gaussian Distribution
–
9
–
10 Multivariate Gaussian Distribution
Where and
d x d Covariance matrix
–
11
–
12 Multivariate Gaussian Distribution
–
13 Covariance matrix property
• If w is any d dimensional vector, the
T
variance of w x can not be negative.
T
• This leads to the quadratic form w Σw to
be non-negative positive semi-definite
nature of Σ
–
14 Multivariate Gaussian Distribution
Normal distribution has the maximum entropy over all distributions with a
given mean and variance.
–
15 2D Gaussian
–
16 Multivariate Gaussian Distribution
–
17 Mahalanobis Distance
–
Bayesian Decision Making
-
2 Error Computation
= ∫ p ( x|ω2 ) P ( ω2 )+ ∫ p ( x|ω1 ) P ( ω1 )
R1 R2
–
3 Error computation
–
4 Error computation
C
P (correct ) = ∑ P ( x ∈ Ri , ωi )
i =1
C
= ∑ P ( x ∈ Ri | ωi ) P (ωi )
i =1
C
= ∑ ∫ p (x | ωi ) P (ωi )
i =1 R
i
–
5
5 Bayes Decision Theory – Discrete Features
–
6
Bayes Decision Theory – Discrete
Features
–
7
Bayes Decision Theory – Discrete
Features
• Case of independent binary features in 2
category problem
pi = P(xi = 1 | ω1)
qi = P(xi = 1 | ω2)
–
8
Bayes Decision Theory – Discrete
Features
Assuming, conditional independence of components of a feature vector,
likelihoods can be computed as
–
9
Bayes Decision Theory – Discrete
Features
where :
pi (1 − qi )
wi = ln i = 1,..., d
qi (1 − pi )
and :
d
1 − pi P(ω1 )
w0 = ∑ ln + ln
i =1 1 − qi P(ω2 )
decide ω1 if g(x)– > 0 and ω2 if g(x) ≤ 0
11 Evaluation of classifier
True
negative/
correct True
rejection Positive
/ Hit
False False
negative/ positive/
Miss alarm
–
12 Discriminability ratio
Discriminability ratio between two
distributions
–
13
Receiver operating
characteristics
Consider a pair of distributions with
discriminability ratio d/.