Parameter Estimation

Pattern
Classification
All materials in these slides were taken

from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and
the publisher
Introduction
Data availability in a Bayesian framework
We could design an optimal classifier if we knew:
P(wi) (priors)
P(x | wi) (class-conditional densities)
Unfortunately, we rarely have this complete information
Design a classifier from a training sample
No problem with prior estimation
Samples are often too small for class-conditional
estimation (large dimension of feature space!)
Pattern Classification, Chapter 3 2

Apriori information about the problem
Normality of P(x | wi)
P(x | wi) ~ N( µi, Si)
Characterized by 2 parameters

Estimation techniques
Maximum-Likelihood (ML) and Bayesian estimation
Results are nearly identical, but the approaches are
different
• Parameters in ML estimation are fixed but unknown
• Bayesian methods view the parameters as random variables

having some known distribution
Best parameters are obtained by maximizing
the probability of obtaining the samples
observed
In either approach, we use P(wi | x) for our

classification rule!

1
Maximum-Likelihood Estimation
Has good convergence properties as the sample size increases
Simpler than any other alternative techniques
General principle
Assume we have c classes and

P(x | wj) ~ N( µj, Sj)
P(x | wj) º P (x | wj, qj) where:
θ j = ( µ j ,Σ j ) = ( µ1j , µ 2j ,...,σ 11j ,σ 22j ,cov(x mj , x nj )...)

Maximum-Likelihood Estimation
Use the information provided by the training samples to
estimate q = (q1, q2, …, qc), each qi (i = 1, 2, …, c) is
associated with each category
Suppose that D contains n samples, x1, x2,…, xn
k =n
P(D | θ ) = ∏ P(x k | θ ) = F(θ )
k =1
P(D | θ ) is called the likelihood of θ w.r.t. the set of samples)

ML estimate of q is, by definition the value that maximizes
P(D | q)
€ “It is the value of q that best agrees with the actually
observed training sample”
Pattern Classification, Chapter 3

Optimal estimation
Let q = (q1, q2, …, qp)t and let Ñq be the gradient
operator t
% ∂ ∂ ∂ (
∇θ = ' , ,..., *
& ∂θ1 ∂θ 2 ∂θ p )
We define l(q) as the log-likelihood function l(q) = ln P(D

| q)
€
New problem statement: “Determine q that maximizes
the log-likelihood”
θˆ = arg max l(θ )
θ
€
Set of necessary conditions for an optimum is:
k=n
∇θ l = ∑ ∇θ ln P(xk | θ ))
k=1
∇θ l = 0

Example of a specific case: Unknown µ
P(xi | µ) ~ N(µ, S)
(Samples are drawn from a multivariate normal population)
−1
1 1
ln P(x k | µ) = − ln[(2π ) Σ ] − (x k − µ) t ∑ (x k − µ)
d
2 2
−1
and ∇ µ ln P(x k | µ) = ∑ (x k − µ)
q = µ therefore, the ML estimate for µ must satisfy:

k =n
€ ∑ (x k − µˆ ) = 0
Σ −1
k =1

Multiplying by S and rearranging, we obtain:
1 k =n
µˆ = ∑ x k
n k =1
Just the arithmetic average of the samples of the training data
Conclusion: €
If P(xk | wj) (j = 1, 2, …, c) is supposed to be Gaussian in a d-
dimensional feature space; then we can estimate the vector
q = (q1, q2, …, qc)t and perform an optimal classification!

Gaussian Case: Unknown µ and s
q = (q1, q2) = (µ, s2)
1 1
l = ln P(x k | θ ) = − ln2πθ 2 − ( x k − θ1 ) 2
2 2θ 2
' ∂ *
) ∂θ (ln P(x k | θ )) ,
1
∇θ l = ) ,
) ∂ ,
) (ln P(x k | θ )) ,
( ∂θ 2 +
-1
/ θ (x k − θ1 )
/ 2
=. 2
/− 1 ( x − θ )
+ k 21
/
0 2θ 2 2θ 2

Summation: % k =n 1
'∑ ˆ (x k − θ1 ) = 0 (1)
' k =1 θ 2
& k =n
(x k − θˆ1 ) 2
k =n
' 1
' −∑ θˆ + ∑ θˆ 2 =0 (2)
( k =1 2 k =1 2
Combining (1) and (2):

€ k =n
k =n
xk
∑ k
(x − µ ) 2
µ=∑ ; σ2 = k =1
k =1 n
n

Bias
• ML estimate for s2 is biased, i.e., Expected value of
sample variance over all datasets of size n is not equal to
true variance
$1 2
' n −1 2
E & Σ(x i − x ) ) = .σ ≠ σ 2
%n ( n
• An elementary unbiased estimator for S is:
€ k =n
1
C=
n -1
∑ (x k − µ )(x k − ˆ
µ ) t
k =1 
Sample covariance matrix

If an estimator is unbiased for all distributions,
then it is called absolutely unbiased
If the estimator tends to become unbiased as

the number of samples become very large,
then the estimator is asymptotically unbiased

• Bayesian Estimation (BE)
• Bayesian Parameter Estimation: General

Estimation
• Bayesian Parameter Estimation: Gaussian Case

Bayesian Estimation
• Bayesian learning to pattern classification problems
– In MLE, q was supposed fix
– In BE, q is a random variable
– The computation of posterior probabilities P(wi | x)
lies at the heart of Bayesian classification
– Goal: Compute P(wi | x, D)
– Given the sample D, Bayes formula can be written
P(x | ω i ,D).P(ω i |D)

P(ω i | x,D) = c
∑ P(x | ω j ,D).P(ω j |D)

j =1

Bayesian Parameter Estimation:
Gaussian Case
• Goal: Estimate q using the a-posteriori density P(q|D)
– The univariate case: P(µ | D)

– µ is the only unknown parameter
p(x | µ ) ~ N(µ, σ 2 )
p(µ ) ~ N(µ 0 , σ 02 )
– (µ0 and s0 are known!)

P(D | µ ).P(µ ) P(µ |D) ~ N(µ n , σ n2 )
P(µ |D) =
∫ P(D | µ ).P(µ )d µ
k=n p(x | µ ) ~ N(µ, σ 2 )
= α ∏ P(xk | µ ).P(µ )
k=1 p(µ ) ~ N(µ 0 , σ 02 )

– The univariate case P(x | D)
• P(µ | D) computed
• P(x | D) remains to be computed!
P(x |D) = ∫ P(x | µ).P(µ |D)dµ is Gaussian
• It provides: P(x |D) ~ N( µn ,σ 2 + σ n2 )
€
• (Desired class-conditional density P(x | Dj, wj))
• Therefore: P(x | Dj, wj) together with P(wj)
€
• And using Bayes formula for classification
General Theory
• P(x | D) computation can be applied to any
situation in which the unknown density can be
parameterized: the basic assumptions are:
– The form of P(x | q) is assumed known, but
the value of q is not known exactly
– Our knowledge about q is assumed to be
contained in a known prior density P(q)
– The rest of our knowledge q is contained in
a set D of n random variables x1, x2, …, xn
that follows P(x)
General Theory…
• The basic problem is: “Compute the
posterior density P(q | D)”
• then “Derive P(x | D)”
P(D | θ ).P(θ )
• Using Bayes formula: P(θ |D) = ,
∫ P(D | θ ).P(θ )dθ
• And by independence assumption:
€ k =n
P(D | θ ) = ∏ P(x k | θ )
k =1

Challenges with MLE
• Text classification:
– Every word unseen in the training data is assigned
a probability of zero.
– Keywords: Bayes, nnet, classification, ….

– Classes: ML/AI and not ML/AI
– A classifier trained to find articles in AI and ML

would have left-out an article with Metaverse or
Meta
Smoothing
• Add-1 (Laplace) smoothing: Assume every seen or unseen
event occurred once more than it did in the training data
• Disadvantage of Add-1 Smoothing

– Takes away too much probability mass from seen events.
– Assigns too much total probability mass to unseen events.
Smoothing
How much probability does the model assign to
strings/ pixel values it hasn’t seen before?
• An empirical fact about

language:
• A small number of events
occur with high frequency
• A large number of events
occur with low frequency
Zipf’s Law
Estimators
• f is a fudge factor
• Expected Likelihood Estimator sets f = 0.5
• Laplace and add-one estimators are the same
thing and set f = 1,
• add-tiny sets f = 1/T
http://kochanski.org/gpk/teaching/0401Oxford/GoodTuring.pdf
Good Turing Smoothing
• X is the event, NX is the number of times you have seen

event X , T is the sample size and E (n ) is an estimate of
how many different events happened exactly n times.
• Translating that into text-analysis terms, X is a word, NX
is the number of times you have seen word X , T is the
size of the corpus and E(n) is an estimate of how many
different words were observed exactly n times 3 .
Good Turing Smoothing
• Take a corpus of 30000 English words, our universe is all
English words, our event, X , is the word “unusualness.” The
word “unusualness” appears once, so NX = 1. In a reasonable
corpus, you might have 10000 different words that appear
once, so E (1) = 10000, and you might have 3000 words that
appear twice, giving E (2) = 3000. The Good-Turing estimate
of the probability of “unusualness” is then

Parameter Estimation

Uploaded by

Copyright:

Available Formats

You might also like

Parameter Estimation

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parameter Estimation

Uploaded by

Copyright:

Available Formats

Pattern

All materials in these slides were taken

Pattern Classification, Chapter 3 2

P(x | wi) ~ N( µi, Si)

Pattern Classification, Chapter 3 3

• Parameters in ML estimation are fixed but unknown

• Bayesian methods view the parameters as random variables

 In either approach, we use P(wi | x) for our

Pattern Classification, Chapter 3 5

 Assume we have c classes and

θ j = ( µ j ,Σ j ) = ( µ1j , µ 2j ,...,σ 11j ,σ 22j ,cov(x mj , x nj )...)

Pattern Classification, Chapter 3 6

P(D | θ ) is called the likelihood of θ w.r.t. the set of samples)

Pattern Classification, Chapter 3

 We define l(q) as the log-likelihood function l(q) = ln P(D

Pattern Classification, Chapter 3 9

Pattern Classification, Chapter 3 10

 q = µ therefore, the ML estimate for µ must satisfy:

Pattern Classification, Chapter 3 11

 Just the arithmetic average of the samples of the training data

Pattern Classification, Chapter 3 12

Pattern Classification, Chapter 3 13

 Combining (1) and (2):

Pattern Classification, Chapter 3 14

Pattern Classification, Chapter 3 15

 If the estimator tends to become unbiased as

Pattern Classification, Chapter 3 16

• Bayesian Parameter Estimation: General

• Bayesian Parameter Estimation: Gaussian Case

P(x | ω i ,D).P(ω i |D)

∑ P(x | ω j ,D).P(ω j |D)

Pattern Classification, Chapter 1

– The univariate case: P(µ | D)

– (µ0 and s0 are known!)

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

• It provides: P(x |D) ~ N( µn ,σ 2 + σ n2 )

Pattern Classification, Chapter 1

– Keywords: Bayes, nnet, classification, ….

– A classifier trained to find articles in AI and ML

• Disadvantage of Add-1 Smoothing

• An empirical fact about

• X is the event, NX is the number of times you have seen

You might also like

In either approach, we use P(wi | x) for our

Assume we have c classes and

We define l(q) as the log-likelihood function l(q) = ln P(D

q = µ therefore, the ML estimate for µ must satisfy:

Just the arithmetic average of the samples of the training data

Combining (1) and (2):

If the estimator tends to become unbiased as