Download as pdf or txt
Download as pdf or txt
You are on page 1of 76

Intro to Pattern Recognition :

Bayesian Decision Theory


- Introduction

- Bayesian Decision TheoryContinuous Features

Materials used in this course were taken from the textbook Pattern Classification by Duda et al., J ohn
Wiley & Sons, 2001.
Credits and Acknowledgments
Materials used in this course were taken from the textbook Pattern Classification by Duda
et al., John Wiley & Sons, 2001 with the permission of the authors and the publisher; and
also from

Other material on the web:

Dr. A. Aydin Atalan, Middle East Technical University, Turkey
Dr. Djamel Bouchaffra, Oakland University
Dr. Adam Krzyzak, Concordia University
Dr. Joseph Picone, Mississippi State University
Dr. Robi Polikar, Rowan University
Dr. Stefan A. Robila, University of New Orleans
Dr. Sargur N. Srihari, State University of New York at Buffalo
David G. Stork, Stanford University
Dr. Godfried Toussaint, McGill University
Dr. Chris Wyatt, Virginia Tech
Dr. Alan L. Yuille, University of California, Los Angeles
Dr. Song-Chun Zhu, University of California, Los Angeles



TYPICAL APPLICATIONS
IMAGE PROCESSING EXAMPLE
Sorting Fish: incoming fish are
sorted according to species
using optical sensing (sea
bass or salmon?)
Feature Extraction
Segmentation
Sensing
Problem Analysis:
set up a camera and take
some sample images to
extract features
Consider features such as
length, lightness, width,
number and shape of fins,
position of mouth, etc.
TYPICAL APPLICATIONS
LENGTH AS A DISCRIMINATOR
Length is a poor discriminator
TYPICAL APPLICATIONS
ADD ANOTHER FEATURE
Lightness is a better feature than length because it
reduces the misclassification error.
Can we combine features in such a way that we improve
performance? (Hint: correlation)
TYPICAL APPLICATIONS
WIDTH AND LIGHTNESS
Treat features as a two-dimensional vector
Create a scatter plot
Draw a line (regression) separating the two classes
TYPICAL APPLICATIONS
DECISION THEORY
Can we do better than a linear classifier?
What is wrong with this decision surface?
(hint: generalization)
TYPICAL APPLICATIONS
GENERALIZATION AND RISK
Why might a smoother decision surface be a better
choice?
This course investigates how to find such optimal
decision surfaces and how to provide system designers
with the tools to make intelligent trade-offs.
Thomas Bayes
At the time of his death, Rev.
Thomas Bayes (1702 1761) left
behind two unpublished essays
attempting to determine the
probabilities of causes from
observed effects. Forwarded to the
British Royal Society, the essays had
little impact and were soon
forgotten.

When several years later, the French
mathematician Laplace
independently rediscovered a very
similar concept, the English scientists
quickly reclaimed the ownership of
what is now known as the Bayes
Theorem.
Bayesian decision theory is a fundamental statistical
approach to the problem of pattern classification.
Quantify the tradeoffs between various classification
decisions using probability and the costs that
accompany these decisions.
Assume all relevant probability distributions are known
(later we will learn how to estimate these from data).
Can we exploit prior knowledge in our fish classification
problem:
Are the sequence of fish predictable? (statistics)
Is each class equally probable? (uniform priors)
What is the cost of an error? (risk, optimization)
BAYESIAN DECISION THEORY
PROBABILISTIC DECISION THEORY
State of nature is prior information
Model as a random variable, :


=
1
: the event that the next fish is a sea bass
category 1: sea bass; category 2: salmon
P(
1
) = probability of category 1
P(
2
) = probability of category 2
P(
1
) + P(
2
) = 1
Exclusivity:
1
and
2
share no basic events
Exhaustivity: the union of all outcomes is the sample space
(either
1
or
2
must occur)
If all incorrect classifications have an equal cost:
Decide
1
if P(
1
) > P(
2
); otherwise, decide
2
BAYESIAN DECISION THEORY
PRIOR PROBABILITIES
BAYESIAN DECISION THEORY
CLASS-CONDITIONAL PROBABILITIES
A decision rule with only prior information always
produces the same result and ignores measurements.
If P(
1
) >> P(
2
), we will be correct most of the time.
Probability of error: P(E) = min(P(
1
),P(
2
)).
Given a feature, x (lightness),
which is a continuous random
variable, p(x|
2
) is the class-
conditional probability density
function:
p(x|
1
) and p(x|
2
) describe the
difference in lightness between
populations of sea and salmon.
A probability density function is denoted in lowercase and represents
a function of a continuous variable.
p
x
(x|), denotes a probability density function for the random variable
X. Note that p
x
(x|) and p
y
(y|) can be two different functions.
P(x|) denotes a probability mass function, and must obey the
following constraints:

BAYESIAN DECISION THEORY
PROBABILITY FUNCTIONS
1 =

X x
P(x)
0 ) x ( P
Probability mass functions are typically used for discrete
random variables while densities describe continuous
random variables (latter must be integrated).
Suppose we know both P(
j
) and p(x|
j
), and we can
measure x. How does this influence our decision?
The joint probability that of finding a pattern that is in
category j and that this pattern has a feature value of x is:
( )
( ) ( )
( ) x p
P x p
x P
j j
j

=
BAYESIAN DECISION THEORY
BAYES FORMULA
( ) ( ) ( )
j
j
j
P x p x p

=
=
2
1
where in the case of two categories:
( ) ( ) ( ) ( )
j j j j
P x p x p x P ) x , ( p = =
Rearranging terms, we arrive at Bayes formula:
Bayes formula:

can be expressed in words as:

By measuring x, we can convert the prior probability,
P(
j
), into a posterior probability, P(
j
|x).
Evidence can be viewed as a scale factor and is often
ignored in optimization applications (e.g., speech
recognition).
BAYESIAN DECISION THEORY
POSTERIOR PROBABILITIES
( )
( ) ( )
( ) x p
P x p
x P
j j
j

=
evidence
prior likelihood
posterior

=
Likelihood- is a synonym for probability
BAYESIAN DECISION THEORY
POSTERIOR PROBABILITIES
For every value of x, the posteriors sum to 1.0.
At x=14, the probability it is in category
2
is 0.08, and for
category
1
is 0.92.
Two-class fish sorting problem (P(
1
) = 2/3, P(
2
) = 1/3):
Decision rule:
For an observation x, decide
1
if P(
1
|x) > P(
2
|x); otherwise,
decide
2
Probability of error:

The average probability of error is given by:


If for every x we ensure that P(error|x) is as small as
possible, then the integral is as small as possible. Thus,
Bayes decision rule for minimizes P(error).
BAYESIAN DECISION THEORY
BAYES DECISION RULE
( )



=
2 1
1 2
x ) x ( P
x ) x ( P
x | error P

= =


dx ) x ( p ) x | error ( P dx ) x , error ( P ) error ( P
)] x ( P ), x ( P min[ ) x | error ( P
2 1
=
Bayes Decision Rule
BAYESIAN DECISION THEORY
EVIDENCE
The evidence, p(x), is a scale factor that assures
conditional probabilities sum to 1:
P(
1
|x)+P(
2
|x)=1
We can eliminate the scale factor (which appears
on both sides of the equation):
Decide
1
if p(x|
1
)P(
1
) > p(x|
2
)P(
2
)

Special cases:
if p(x|
1
)=p(x|
2
): x gives us no useful information
if P(
1
) = P(
2
): decision is based entirely on the
likelihood, p(x|
j
).
Generalization of the preceding ideas:
Use of more than one feature
(e.g., length and lightness)
Use more than two states of nature
(e.g., N-way classification)
Allowing actions other than a decision to decide on the
state of nature (e.g., rejection: refusing to take an
action when alternatives are close or confidence is low)
Introduce a loss of function which is more general than
the probability of error
(e.g., errors are not equally costly)
Let us replace the scalar x by the vector x in a
d-dimensional Euclidean space, R
d
,

called
the feature space.
CONTINUOUS FEATURES
GENERALIZATION OF TWO-CLASS PROBLEM
CONTINUOUS FEATURES
LOSS FUNCTION
(
i
|
j
) be the loss incurred for taking action
i
when the
state of nature is
j
The posterior, P(
j
|x), can be computed from Bayes
formula:
) ( p
) ( P ) | ( p
) ( P
j j
j
x
x
x

=
where the evidence is:
) ( P ) | ( p ) ( p
j
c
j
j

=
=1
x x
The expected loss from taking action
i
is:
) | ( ) | ( ) | (
1
x x
j
c
j
j i i
P R

=
=

-loss
CONTINUOUS FEATURES
BAYES RISK
An expected loss is called a risk.
R(
i
|x) is called the conditional risk.
A general decision rule is a function (x) that tells us
which action to take for every possible observation.
The overall risk is given by:

= x x x x d ) ( p ) | ) ( ( R R
If we choose (x) so that R(
i
(x)) is as small as possible
for every x, the overall risk will be minimized.
Compute the conditional risk for every and select the
action that minimizes R(
i
|x). This is denoted R*, and is
referred to as the Bayes risk.
The Bayes risk is the best performance that can be
achieved.
CONTINUOUS FEATURES
TWO-CATEGORY CLASSIFICATION
Let
1
correspond to
1
,
2
to
2
, and
ij
= (
i
|
j
)
The conditional risk is given by:
R(
1
|x) =
11
P(
1
|x) +
12
P(
2
|x)
R(
2
|x) =
21
P(
1
|x) +
22
P(
2
|x)
Our decision rule is:
choose
1
if: R(
1
|x) < R(
2
|x);
otherwise decide
2

This results in the equivalent rule:
choose
1
if: (
21
-
11
) P(
1
|x) > (
12
-
22
) P(
2
|x);
otherwise decide

2
If the loss incurred for making an error is greater than that
incurred for being correct, the factors (
21
-
11
) and
(
12
-
22
) are positive, and the ratio of these factors simply
scales the posteriors.
CONTINUOUS FEATURES
LIKELIHOOD
By employing Bayes formula, we can replace the
posteriors by the prior probabilities and conditional
densities:
choose
1
if:
(
21
-
11
) p(x|
1
) P(
1
) > (
12
-
22
) p(x|
2
) P(
2
);
otherwise decide

2
If
21
-
11
is positive, our rule becomes:

) ( P
) ( P
) | ( p
) | ( p
: if choose
1
2
11 21
22 12
2
1
1



>

x
x
If the loss factors are identical, and the prior probabilities
are equal, this reduces to a standard likelihood ratio:
1
2
1
1
>

) | ( p
) | ( p
: if choose
x
x
Consider a symmetrical or zero-one loss function:
Minimum Error Rate

=
= c ,..., , j , i
j i
j i
) (
j i
2 1
1
0
MINIMUM ERROR RATE
The conditional risk is:
x)
x)
x) x
j
j j
i
c
i j
c
j
i i
( P
( P
( P ) ( R ) ( R
=

=
1
1
The conditional risk is the average probability of error.
To minimize error, maximize P(
i
|x)
Minimum Error Rate
LIKELIHOOD RATIO
Minimum error rate classification (zero-one loss):
choose
i
if: P(
i
|

x) > P(
j
|

x) for all ji

Classifiers, Discriminant Functions
and Decision Surfaces
The multi-category case

Set of discriminant functions g
i
(x), i = 1,, c

The classifier assigns a feature vector x to class
i

if
g
i
(x) > g
j
(x) for all j i
Dr. Djamel Bouchaffra
CSE 616 Applied Pattern Recognition,
Chapter 2 , Section 2.
32
4
Dr. Djamel Bouchaffra
CSE 616 Applied Pattern Recognition,
Chapter 2 , Section 2.
33
Let g
i
(x) = - R(
i
| x)
(max. discriminant corresponds to min. risk!)
For the minimum error rate, we take
g
i
(x) = P(
i
| x)
(max. discrimination corresponds to max. posterior!)
g
i
(x) P(x |
i
) P(
i
)

g
i
(x) = ln P(x |
i
) + ln P(
i
)
(ln: natural logarithm!)
4
Dr. Djamel Bouchaffra
CSE 616 Applied Pattern Recognition,
Chapter 2 , Section 2.
34
Feature space divided into c decision regions
if g
i
(x) > g
j
(x) j i then x is in R
i

(R
i
means assign x to
i
)


The two-category case
A classifier is a dichotomizer that has two discriminant functions g
1
and
g
2


Let g(x) g
1
(x) g
2
(x)

Decide
1
if g(x) > 0; Otherwise decide
2

4
Dr. Djamel Bouchaffra
CSE 616 Applied Pattern Recognition,
Chapter 2 , Section 2.
35
The computation of g(x)

) ( P
) ( P
ln
) | x ( P
) | x ( P
ln
) x | ( P ) x | ( P ) x ( g
2
1
2
1
2 1

=
=

4
Dr. Djamel Bouchaffra
CSE 616 Applied Pattern Recognition,
Chapter 2 , Section 2.
36
4
Dr. Djamel Bouchaffra
CSE 616 Applied Pattern Recognition,
Chapter 2 , Section 2.
37
The Normal Density
Univariate density

Density which is analytically tractable
Continuous density
A lot of processes are asymptotically Gaussian
Handwritten characters, speech sounds are ideal or prototype
corrupted by random process (central limit theorem)



Where:
= mean (or expected value) of x

2
= expected squared deviation or variance
2
1 1
( ) exp ,
2
2
x
P x

| |
=
(
|
\ .
(

5
Dr. Djamel Bouchaffra
CSE 616 Applied Pattern Recognition,
Chapter 2 , Section 2.
38
5
Expected value
The expected value of x (taken over the
feature space) is given by


And the expected squared deviation or
variance is:
=

2
( )
2
= ( )
2


Dr. Djamel Bouchaffra
CSE 616 Applied Pattern Recognition,
Chapter 2 , Section 2.
40
Multivariate density

Multivariate normal density in d dimensions is:




where:
x = (x
1
, x
2
, , x
d
)
t
(t stands for the transpose vector form)
= (
1
,
2
, ,
d
)
t
mean vector
= dxd covariance matrix
|| and
-1
are determinant and inverse respectively


1
1/2
/2
1 1
( ) exp ( ) ( )
2
(2 )
t
d
f


(
=
(

x x x

5
A two dimensional Gaussian density
distribution
Expected value-Multivariate case
The expected value of X is given by
x is a vector
=
The expected value of squared deviation
( )( )

= ( )( )


Mahalanobis distance
Is the distance between and

2
= ( )

1
( )
Discriminant function
( ) ln ( | ) ln ( )
i i i
g p P = + x x
1
1 1
( ) ( ) ( ) ln2 ln ln ( )
2 2 2
t
i i i i i i
d
g P

= + x x x
1 2
( ) ( | ) ( | ) g P P = x x x
In logarithmic form
Thus,

( | ) ( )
( ) ln ln
( | ) ( )
i i
j j
P P
g i j
P P


= +
x
x
x
Case 1:
i
=
2
I


Features are statistically independent, and all features have the
same variance: Distributions are spherical in d dimensions.

d
i
2
2
2
2
... 0 0
... ... ... 0
0 ... 0
0 0 0

=
(
(
(
(
(

=
6
I
i
) / 1 (
2
1
=

i of t independen is
2
I
i
=
GAUSSIAN CLASSIFIERS
GAUSSIAN CLASSIFIERS
THRESHOLD DECODING
This has a simple geometric interpretation:
) ( P
) ( P
ln
i
j
j i

=
2
2
2
2 x x
The decision region when the priors are equal
and the support regions are spherical is simply
halfway between the means (Euclidean distance).
6
Note how priors shift the boundary away from the more likely mean !!!
GAUSSIAN CLASSIFIERS
6
3-D case
GAUSSIAN CLASSIFIERS
Case 2: i =
Covariance matrices are arbitrary, but equal to each other
for all classes. Features then form hyper-ellipsoidal
clusters of equal size and shape.

Discriminant function is linear:

6
0
) (
i
T
i
i
g + = x x
) ( ln
2
1
,
1
0
1
i i
T
i i i
P
i
+ = =


GAUSSIAN CLASSIFIERS
6
6
Case 3: i = arbitrary
The covariance matrices are different for each category
All bets off !In two class case, the decision boundaries form
hyperquadratics.
(Hyperquadrics are: hyperplanes, pairs of hyperplanes,
hyperspheres, hyperellipsoids, hyperparaboloids,
hyperhyperboloids)



6
6
GAUSSIAN CLASSIFIERS
ARBITRARY COVARIANCES
Example: Decision regions for a two-dimensional Gaussian data
Chi-Square Goodness-of-fit test
A special type of hypothesis test that is often used to test the equivalence
of a probability density function of sampled data to some theoretical
density function is called the chi-squared goodness-of-fit test.
The procedure involves:

1. To use a statistics with an approximate chi-
square distribution
2. To determine the degree of discrepancy
between the observed and theoretical
3. To test an hypothesis of equivalence
Procedure
Consider a sample of N independent observations from a
random variable x with a probability density p(x). Let N
observations be grouped into K intervals, called class
intervals, which together form a frequency histogram.

The number of observations falling within the ith class
interval is called the observed frequency f
i
.

The theoretical numbers if the true pdf is p
o
(x) is called the
expected frequency, F
i
Histogram
-8 -6 -4 -2 0 2 4 6 8 10
0
5
10
15
20
25
30
35
40
K
F
r
e
q
u
e
n
c
y
Procedure
Total discrepancy for all intervals (K) is given as
( )

=
K
i
i
i i
F
F f
X
1
2
2
The discrepancies in each interval are normalized against by the
associated expected frequencies
The distribution of X
2
is approximately the same as
2

Chi-Squared distribution
Hypothesis test
2
;
2

n
X
Where the value of is available from tables
2
;

n
If the expression is not true then ,
the hypothesis that p(x)=p
o
(x) is rejected at the level ;
Otherwise,
the hypothesis is accepted at the level .
Where n=K-3
Region of acceptance
) ( ) ( x p x p
o
=
Basic ways to apply chi-square goodness of
fit
There are two basic way to apply the chi-square
goodnessof-fit test:

1) Select intervals that produce equal expected
frequency (different width from class to class).

2) Select class intervals of equal width (different
frequencies from class to class).
Equal width is more often used to apply chi-square goodness of fit.
s x 4 . 0 =
S- standard deviation of data
Example
A sample of N=200 independent observations of the digital output of a
thermal noise generator are stored.
Test the noise generator output for normality by performing a chi-
square goodness of fit test at the = 0.05
Dx=0.4*3.30=1.3
Calculations
Results
-15 -10 -5 0 5 10 15
0
5
10
15
20
25
30
[h p l c] =lillietest(x,0.032)
Statistical independence and trend test
Situations often arise in data analysis where it is desired to
establish if a sequence or observations or parameter estimates
are statistically independent or include an underlying trend.
Run Test
Reverse arrangement test
Procedures:
Both are distribution-free or nonparametric procedures
Run Test
The procedure is as follows
1. Consider a sequence of N observed values of a random
variable.
2. Classify the sequence into two mutually exclusive categories (+)
or (-).
3. Identify the runs. A run is sequence of identical observations
that is followed and preceded by a different observations
4. The number of runs that occur in a sequence of observations
gives an indication as to whether or not results are
independent.
Run Test
The sampling distribution of the number of runs is a r.v. r with
a mean and variance
1
2
2 1
+ =
N
N N
r

) 1 (
) 2 ( 2
2
2 1 2 1
2


=
N N
N N N N N
r
A tabulation of 100 percentage points for the distribution function of
runs provides a mean to test the hypothesis that the number of runs falls
in the interval between:
2 / : 2 / 1 : n n
r r r <

EXAMPLE
Reverse Arrangements Test
Consider a sequence of N observations of a random variable x, where
observations are denoted by:
N i x
i
,..., 3 , 2 , 1 =
The procedure consists in counting the number of times that:
j i
x x > For
j i <
This is called reverse arrangement. The total number of reverse
arrangements is denoted by A
EXAMPLE
Distribution of A
If the sequence of N observations are independent observations of
the same r.v., then the number of reverse arrangement is a r.v. A ,
with a mean and variance given as
4
) 1 (
=
N N
r

72
5 3 2
2 3
2
N N N
A
+
=
The 100 of A can then be found in tables
The test is good for detecting monotonic trends and not so good for
detecting fluctuating trends
EXAMPLE

You might also like