Parameter Estimation

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Pattern

Classification

All materials in these slides were taken


from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and
the publisher
Introduction
— Data availability in a Bayesian framework
— We could design an optimal classifier if we knew:
— P(wi) (priors)
— P(x | wi) (class-conditional densities)
Unfortunately, we rarely have this complete information
— Design a classifier from a training sample
— No problem with prior estimation
— Samples are often too small for class-conditional
estimation (large dimension of feature space!)

Pattern Classification, Chapter 3 2


— Apriori information about the problem
— Normality of P(x | wi)

P(x | wi) ~ N( µi, Si)

— Characterized by 2 parameters

Pattern Classification, Chapter 3 3


— Estimation techniques
— Maximum-Likelihood (ML) and Bayesian estimation
— Results are nearly identical, but the approaches are
different

• Parameters in ML estimation are fixed but unknown

• Bayesian methods view the parameters as random variables


having some known distribution
— Best parameters are obtained by maximizing
the probability of obtaining the samples
observed

— In either approach, we use P(wi | x) for our


classification rule!

Pattern Classification, Chapter 3 5


1
Maximum-Likelihood Estimation
— Has good convergence properties as the sample size increases
— Simpler than any other alternative techniques

— General principle

— Assume we have c classes and


— P(x | wj) ~ N( µj, Sj)
— P(x | wj) º P (x | wj, qj) where:

θ j = ( µ j ,Σ j ) = ( µ1j , µ 2j ,...,σ 11j ,σ 22j ,cov(x mj , x nj )...)

Pattern Classification, Chapter 3 6


Maximum-Likelihood Estimation
— Use the information provided by the training samples to
estimate q = (q1, q2, …, qc), each qi (i = 1, 2, …, c) is
associated with each category
— Suppose that D contains n samples, x1, x2,…, xn
k =n
P(D | θ ) = ∏ P(x k | θ ) = F(θ )
k =1

P(D | θ ) is called the likelihood of θ w.r.t. the set of samples)


— ML estimate of q is, by definition the value that maximizes
P(D | q)
€ — “It is the value of q that best agrees with the actually
observed training sample”

Pattern Classification, Chapter 3


Pattern Classification, Chapter 3 8
— Optimal estimation
— Let q = (q1, q2, …, qp)t and let Ñq be the gradient
operator t
% ∂ ∂ ∂ (
∇θ = ' , ,..., *
& ∂θ1 ∂θ 2 ∂θ p )

— We define l(q) as the log-likelihood function l(q) = ln P(D


| q)

— New problem statement: “Determine q that maximizes
the log-likelihood”
θˆ = arg max l(θ )
θ

Pattern Classification, Chapter 3 9


— Set of necessary conditions for an optimum is:
k=n
∇θ l = ∑ ∇θ ln P(xk | θ ))
k=1

∇θ l = 0

Pattern Classification, Chapter 3 10


Example of a specific case: Unknown µ
— P(xi | µ) ~ N(µ, S)
— (Samples are drawn from a multivariate normal population)
−1
1 1
ln P(x k | µ) = − ln[(2π ) Σ ] − (x k − µ) t ∑ (x k − µ)
d

2 2
−1
and ∇ µ ln P(x k | µ) = ∑ (x k − µ)

— q = µ therefore, the ML estimate for µ must satisfy:


k =n
€ ∑ (x k − µˆ ) = 0
Σ −1

k =1

Pattern Classification, Chapter 3 11


— Multiplying by S and rearranging, we obtain:
1 k =n
µˆ = ∑ x k
n k =1

— Just the arithmetic average of the samples of the training data

— Conclusion: €
— If P(xk | wj) (j = 1, 2, …, c) is supposed to be Gaussian in a d-
dimensional feature space; then we can estimate the vector
q = (q1, q2, …, qc)t and perform an optimal classification!

Pattern Classification, Chapter 3 12


Gaussian Case: Unknown µ and s
— q = (q1, q2) = (µ, s2)
1 1
l = ln P(x k | θ ) = − ln2πθ 2 − ( x k − θ1 ) 2
2 2θ 2
' ∂ *
) ∂θ (ln P(x k | θ )) ,
1
∇θ l = ) ,
) ∂ ,
) (ln P(x k | θ )) ,
( ∂θ 2 +
-1
/ θ (x k − θ1 )
/ 2
=. 2
/− 1 ( x − θ )
+ k 21
/
0 2θ 2 2θ 2

Pattern Classification, Chapter 3 13


— Summation: % k =n 1
'∑ ˆ (x k − θ1 ) = 0 (1)
' k =1 θ 2
& k =n
(x k − θˆ1 ) 2
k =n
' 1
' −∑ θˆ + ∑ θˆ 2 =0 (2)
( k =1 2 k =1 2

— Combining (1) and (2):


€ k =n

k =n
xk
∑ k
(x − µ ) 2

µ=∑ ; σ2 = k =1

k =1 n
n

Pattern Classification, Chapter 3 14


Bias
• ML estimate for s2 is biased, i.e., Expected value of
sample variance over all datasets of size n is not equal to
true variance
$1 2
' n −1 2
E & Σ(x i − x ) ) = .σ ≠ σ 2
%n ( n
• An elementary unbiased estimator for S is:
€ k =n
1
C=
n -1
∑ (x k − µ )(x k − ˆ
µ ) t

k =1 
Sample covariance matrix

Pattern Classification, Chapter 3 15


— If an estimator is unbiased for all distributions,
then it is called absolutely unbiased

— If the estimator tends to become unbiased as


the number of samples become very large,
then the estimator is asymptotically unbiased

Pattern Classification, Chapter 3 16


• Bayesian Estimation (BE)

• Bayesian Parameter Estimation: General


Estimation

• Bayesian Parameter Estimation: Gaussian Case


Bayesian Estimation
• Bayesian learning to pattern classification problems
– In MLE, q was supposed fix
– In BE, q is a random variable
– The computation of posterior probabilities P(wi | x)
lies at the heart of Bayesian classification
– Goal: Compute P(wi | x, D)
– Given the sample D, Bayes formula can be written

P(x | ω i ,D).P(ω i |D)


P(ω i | x,D) = c

∑ P(x | ω j ,D).P(ω j |D)


j =1

Pattern Classification, Chapter 1


Bayesian Parameter Estimation:
Gaussian Case
• Goal: Estimate q using the a-posteriori density P(q|D)

– The univariate case: P(µ | D)


– µ is the only unknown parameter

p(x | µ ) ~ N(µ, σ 2 )
p(µ ) ~ N(µ 0 , σ 02 )

– (µ0 and s0 are known!)

Pattern Classification, Chapter 1


P(D | µ ).P(µ ) P(µ |D) ~ N(µ n , σ n2 )
P(µ |D) =
∫ P(D | µ ).P(µ )d µ
k=n p(x | µ ) ~ N(µ, σ 2 )
= α ∏ P(xk | µ ).P(µ )
k=1 p(µ ) ~ N(µ 0 , σ 02 )

Pattern Classification, Chapter 1


Pattern Classification, Chapter 1
– The univariate case P(x | D)
• P(µ | D) computed
• P(x | D) remains to be computed!
P(x |D) = ∫ P(x | µ).P(µ |D)dµ is Gaussian

• It provides: P(x |D) ~ N( µn ,σ 2 + σ n2 )


• (Desired class-conditional density P(x | Dj, wj))
• Therefore: P(x | Dj, wj) together with P(wj)

• And using Bayes formula for classification
Pattern Classification, Chapter 1
Bayesian Parameter Estimation:
General Theory
• P(x | D) computation can be applied to any
situation in which the unknown density can be
parameterized: the basic assumptions are:
– The form of P(x | q) is assumed known, but
the value of q is not known exactly
– Our knowledge about q is assumed to be
contained in a known prior density P(q)
– The rest of our knowledge q is contained in
a set D of n random variables x1, x2, …, xn
that follows P(x)
Pattern Classification, Chapter 1
Bayesian Parameter Estimation:
General Theory…
• The basic problem is: “Compute the
posterior density P(q | D)”
• then “Derive P(x | D)”
P(D | θ ).P(θ )
• Using Bayes formula: P(θ |D) = ,
∫ P(D | θ ).P(θ )dθ
• And by independence assumption:
€ k =n
P(D | θ ) = ∏ P(x k | θ )
k =1

Pattern Classification, Chapter 1


Challenges with MLE
• Text classification:
– Every word unseen in the training data is assigned
a probability of zero.

– Keywords: Bayes, nnet, classification, ….


– Classes: ML/AI and not ML/AI

– A classifier trained to find articles in AI and ML


would have left-out an article with Metaverse or
Meta
Smoothing
• Add-1 (Laplace) smoothing: Assume every seen or unseen
event occurred once more than it did in the training data

• Disadvantage of Add-1 Smoothing


– Takes away too much probability mass from seen events.
– Assigns too much total probability mass to unseen events.
Smoothing
How much probability does the model assign to
strings/ pixel values it hasn’t seen before?

• An empirical fact about


language:
• A small number of events
occur with high frequency
• A large number of events
occur with low frequency

Zipf’s Law
Estimators

• f is a fudge factor
• Expected Likelihood Estimator sets f = 0.5
• Laplace and add-one estimators are the same
thing and set f = 1,
• add-tiny sets f = 1/T

http://kochanski.org/gpk/teaching/0401Oxford/GoodTuring.pdf
Good Turing Smoothing

• X is the event, NX is the number of times you have seen


event X , T is the sample size and E (n ) is an estimate of
how many different events happened exactly n times.
• Translating that into text-analysis terms, X is a word, NX
is the number of times you have seen word X , T is the
size of the corpus and E(n) is an estimate of how many
different words were observed exactly n times 3 .
Good Turing Smoothing
• Take a corpus of 30000 English words, our universe is all
English words, our event, X , is the word “unusualness.” The
word “unusualness” appears once, so NX = 1. In a reasonable
corpus, you might have 10000 different words that appear
once, so E (1) = 10000, and you might have 3000 words that
appear twice, giving E (2) = 3000. The Good-Turing estimate
of the probability of “unusualness” is then

You might also like