Professional Documents
Culture Documents
Parameter Estimation
Parameter Estimation
Parameter Estimation
Classification
Characterized by 2 parameters
General principle
€
Set of necessary conditions for an optimum is:
k=n
∇θ l = ∑ ∇θ ln P(xk | θ ))
k=1
∇θ l = 0
2 2
−1
and ∇ µ ln P(x k | µ) = ∑ (x k − µ)
k =1
Conclusion: €
If P(xk | wj) (j = 1, 2, …, c) is supposed to be Gaussian in a d-
dimensional feature space; then we can estimate the vector
q = (q1, q2, …, qc)t and perform an optimal classification!
k =n
xk
∑ k
(x − µ ) 2
µ=∑ ; σ2 = k =1
k =1 n
n
k =1
Sample covariance matrix
p(x | µ ) ~ N(µ, σ 2 )
p(µ ) ~ N(µ 0 , σ 02 )
€
• (Desired class-conditional density P(x | Dj, wj))
• Therefore: P(x | Dj, wj) together with P(wj)
€
• And using Bayes formula for classification
Pattern Classification, Chapter 1
Bayesian Parameter Estimation:
General Theory
• P(x | D) computation can be applied to any
situation in which the unknown density can be
parameterized: the basic assumptions are:
– The form of P(x | q) is assumed known, but
the value of q is not known exactly
– Our knowledge about q is assumed to be
contained in a known prior density P(q)
– The rest of our knowledge q is contained in
a set D of n random variables x1, x2, …, xn
that follows P(x)
Pattern Classification, Chapter 1
Bayesian Parameter Estimation:
General Theory…
• The basic problem is: “Compute the
posterior density P(q | D)”
• then “Derive P(x | D)”
P(D | θ ).P(θ )
• Using Bayes formula: P(θ |D) = ,
∫ P(D | θ ).P(θ )dθ
• And by independence assumption:
€ k =n
P(D | θ ) = ∏ P(x k | θ )
k =1
Zipf’s Law
Estimators
• f is a fudge factor
• Expected Likelihood Estimator sets f = 0.5
• Laplace and add-one estimators are the same
thing and set f = 1,
• add-tiny sets f = 1/T
http://kochanski.org/gpk/teaching/0401Oxford/GoodTuring.pdf
Good Turing Smoothing