Professional Documents
Culture Documents
Estimations
Estimations
Estimations
(Estimation)
Bing-Yi JING
Math, HKUST
Email: majing@ust.hk
1 Probability Models 2
1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Unbiased Estimation 21
3.3 UMVUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1
3.6 UMVUE in Normal Linear Models . . . . . . . . . . . . . . . . . . . . . . . 30
3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Efficient Estimation 43
6 Transformation 56
6.4 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2
6.5 The δ-method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.5.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7 Quantiles 69
7.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8 M-Estimation 82
8.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9 U -Statistics 99
9.1 U -statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3
10 Jackknife 105
10.4 Application of the jackknife to the smooth function of means models . . . . 115
10 Bootstrap 129
4
11 Bayes Rules and Empirical Bayes 146
1
Chapter 1
Probability Models
Statistical inference is based on probability models, which, broadly speaking, can be clas-
sified into
• Parametric models
• Non-parametric models
1.1 Motivations
we wish to draw useful information about the underlying population. For instance, we
collect information of n people consisting of
2
1.2 Models
Parametric models
Examples
1. A random sample
Xi ∼i.i.d. N (µ, σ 2 ), i = 1, ..., n,
where θ = (µ, σ 2 ).
2. Linear regression
⇐⇒
Yi = α + β T xi + i , i ∼ N (0, σ 2 ), i = 1, ..., n,
where θ = (α, β1 , ..., βp , σ 2 ).
Non-parametric models
Examples
1. A random sample
Xi ’s are i.i.d. with EXi2 < ∞.
2. Linear regression
Yi = α + βxi + i , i = 1, ..., n,
where i are i.i.d. For instance,
• Ei = 0, V (i ) = σ 2 , or
• median(i ) = 0, and IQR(i ) = σ 2 .
3
3. Non-linear regression
• prone to overfitting
Semi-parametric models
These models all have their advantages and disadvantages, and are useful in different
situations. Exactly which model to use purely depends on the problem at hand.
We will study parametric models first before moving to nonparametric models later.
Exponential family
where η(θ) = (η1 (θ), ..., ηk (θ)) and T (x) = (T1 (x), ..., Tk (x)).
nR nP o o
k R
Note that B(θ) = ln exp j=1 ηj (θ)Tj (x) h(x)dx as fθ (x)dx = 1.
4
2. Canonical family: Reparametrizing ηj = ηj (θ), we get a canonical family
fη (x) = exp {η · T (x) − A(η)} h(x),
where η = (η1 , ..., ηk ) is called the natural parameter.
3. The natural parameter space
Z Xk
Ξ= η = (η1 , ..., ηk ) : exp ηj Tj (x) h(x)dx < ∞
j=1
Examples of exponential families include Gaussian, Gamma, Binomial, Poisson, etc. How-
ever, examples of non-exponential families are Student-t, Uniform(0, θ), Cauchy, Double-
exponential.
5
Exponential family is closed under independent, or marginal, or condi-
tional operations
Proof. For simplicity, we only prove this for the discrete r.v. For t = (t1 , ..., tk ),
X
P (T = t) = P (X = x)
T (x)=t
X
= exp {η(θ) · T (x) − B(θ)} h(x)
T (x)=t
X
= exp {η(θ) · t − B(θ)} h(x)
T (x)=t
= exp {η(θ)t − B(θ)} h0 (t),
P
where h0 (t) = T (x)=t h(x).
Theorem 1.4 Let X ∼ fη (x) = exp {η · T (x) − A(η)} h(x). The moment-generating
function (m.g.f.) of T (X):
M (s) := Ees·T (X) = exp[A(s + η) − A(η)].
Proof. Note
Z Z Z Z
s·T (X) s·T (x)
M (s) = Ee = ... e fη (x)dx = ... e(s+η)·T (x)−A(η) h(x)dx
Z Z
τ T (x)−A(s+η)
= eA(s+η)−A(η) ... e(s+η) h(x)dx
= eA(s+η)−A(η) .
So its cumulant generating function (c.g.f.) is C(s) =: log M (s) = A(s + η) − A(η). Hence
E(T (X)) = C 0 (s)|s=0 = A0 (η), V ar(T (X)) = C 00 (s)|s=0 = A00 (η).
It is clear that exponential family requires the existence of all moments (very thin-
tailed distributions). This excludes cases like Student-t, Cauchy, etc.
6
Location-scale family
Location family: {F (x − a) : a ∈ R} ;
(i.e., X + a ∼ F (x − a))
x
Scale family: F :b>0 ;
b
(i.e., bX ∼ F (x/b))
x−a
Location-scale family: F : µ ∈ R, σ > 0 ;
b
(i.e., a + bX ∼ F ((x − a)/b)).
Location family: {f (x − a) : a ∈ R} ;
1 x
Scale family: f :b>0 ;
σ b
1 x−a
Location-scale family: f : a ∈ R, b > 0 .
b b
Note a and b are NOT necessarily the mean and standard deviation, which may not
even exist, e.g., for F = Cauchy.
1. If X1 , ..., Xn ∼ U (0, θ) (a scale family), then Xi /θ ∼ U nif (0, 1), and hence
k
EX(k) = θ.
n+1
2. If Y1 , ..., Yn ∼ U (θ, θ + 1) (a location family), then Yi − θ ∼ U nif (0, 1), and hence
k
EY(k) = θ + .
n+1
7
Chapter 2
• Sufficiency principle
One can reduce X to a “sufficient” one T (X) without losing information on θ. That
is, T (X) contains all information about θ. (If we cannot reduce it any further, we
get the minimal sufficient statistics.)
• Auxiliary statistics
They contain no information about θ. Properly chosen, this proves to be very useful
in (conditional) statistical inference. This is opposite to sufficiency.
• Basu Theorem
Sufficient and auxiliary statistics are independent for a complete family.
2.1 Sufficiency
Definition 2.1 T (X) is sufficient for θ if the distribution of X conditional on T (X) does
not depend on θ.
(That is, once T is known, the data X contains no more information about θ.)
Example 2.1
8
not depending on θ.
(Continuity of d.f. ensures no ties. One can remove the assumption by the Factor-
ization Theorem later.)
Pn
3. T = i=1 Xi if X ∼ Bernoulli(p).
Proof. T ∼ Bin(n, p).
• If ni=1 xi =
6 t, P (X = x|T (X) = t) = 0.
P
Pn
• If i=1 xi = t, then P (X = x|T (X) = t) equals
P (X = x, T (X) = t)
P (X = x|T (X) = t) = (1.1)
P ( ni=1 Xi = t)
P
P (X = x)
= (1.2)
P ( ni=1 Xi = t)
P
P (X = x, T (X) = t)
Proof. We prove it for the discrete case only. Now P (X = x|T = t) = .
P (T (X) = t)
Hence T is sufficient.
• Suppose T (X) is sufficient. Then
fθ (x) = P (X = x) = P (X = x, T (X) = T (x))
(as {X = x} ⊂ {T (X) = T (x)})
= P (T (X) = T (x)) P (X = x|T (X) = T (x))
=: g(T (x), θ) h(x),
where sufficiency implies that h(x) is free of θ.
9
Qn
Example 2.2 fθ (x) = i=1 fθ (xi ).
1. X ∼ Bernoulli(p), then
n
Y P P
fθ (x) = pxi (1 − p)1−xi = p xi
(1 − p)n− xi
.
i=1
Pn
So T = i=1 Xi is sufficient.
So T = X(n) is sufficient.
Definition 2.2 T (X) is minimal sufficient if, given any other sufficient statistic S(X),
T (X) is a function of S(X).
1. If T is sufficient, So is S.
Proof.
Examples
Example 2.3 Clearly, (X(1) , ..., X(n) ) is sufficient, so the original data (X1 , ..., Xn ) is
never minimal sufficient.
Example 2.4 X ∼ U(θ, θ + 1), i.e., fθ (x) = I{θ < x < θ + 1}. Then
n
Y
fθ (x) = I{θ < xi < θ + 1} = I{θ < x(1) < x(n) < θ + 1}
i=1
= I{x(n) − 1 < θ < x(1) }. (3.4)
10
By the factorization theorem, T = X(1) , X(n) is sufficient.
To show minimal sufficiency, suppose that S(X) is also sufficient. By the factoriza-
tion theorem,
fθ (x) = g (S(x), θ) h (x) .
We only consider those x s.t. h(x) > 0, so all conclusions hold a.s. From (3.4),
Sufficient statistics can be obtained easily by Factorization theorem. Then one can apply
the following to verify minimal sufficiency.
1. T (X) is sufficient.
fθ (x)
2. is free of θ, ∀x, y =⇒ T (x) = T (y).
fθ (y)
Proof. Te(X) is any sufficient statistics. For each x ∈ X , we have a pair (Te(x), T (x)). To
show that T (x) is a function of Te(x), it suffices to prove that
Examples
Example 2.5 Find minimal sufficient statistics for the following cases:
11
1. X ∼ U(θ, θ + 1)
2. X ∼ U(0, θ)
3. X ∼ U(0, θ) with θ ≥ 1.
Solution.
1. First fθ (x) = I{x(n) − 1 < θ < x(1) }, so T (x) = x(1) , x(n) is sufficient by Factor-
ization Theorem.
If
2. First fθ (x) = θ−1 I{0 < x < θ}, then fθ (x) = θ−n I(0 < x(1) ≤ x(n) < θ) = θ−n I(0 <
x(1) )I(x(n) < θ). Hence T (x) = x(n) is sufficient.
Now if
3. First the pdf of X is fθ (x) = θ−1 I{0 < x < θ}I{θ > 1}. Hence,
It is intuitively clear why Te(X) = max{X(n) , 1}, rather than X(n) , is a minimal
sufficient statistic: if X(n) < 1, say X(n) = 1/2, then 1/2 would not be a good estimator
of θ since we know θ > 1. A better choice would be 1. Of course, if X(n) ≥ 1, then X(n)
would be a very good estimator of θ.
12
2.4 Minimal sufficiency for special families
For full rank exponential families, finding minimal sufficient statistics is very easy.
Pk
is free of θ iff i=1 aj ηj (θ) is free of θ, i.e.,
k
X
aj ηj (θ) = C (x, y) ,
j=1
Pk
Fix θ0 ∈ Θ, then j=1 aj ηj (θ0 ) = C (x, y) . Then
k
X
0= aj [ηj (θ) − ηj (θ0 )] = a · ηe(θ) (4.6)
j=1
where a = (a1 , ..., ak )0 and ηe(θ) = (η1 (θ) − η1 (θ0 ), ..., ηk (θ) − ηk (θ0 ))0 .
are linearly independent. Then it follows from (4.6) that a = 0. That is, Ti (x) = Ti (y)
for all 1 ≤ i ≤ k.
Examples
Example 2.6 Let X ∼ N (µ, σ 2 ), find a minimal sufficient statistics for θ = (µ, σ 2 ).
√ 2 2
Solution. fθ (x) = ( 2πσ)−1 e−(x−µ) /(2σ ) . Therefore,
√ n n
( )
−n −
Pn
(x −µ)2 /(2σ 2 ) µ X 1 X
fθ (x) = ( 2πσ) e i=1 i = C(θ) exp x i − x2
σ 2 i=1 2σ 2 i=1 i
So k = 2,
µ 1
η(θ) = (η1 , η2 ) = ,−
σ 2 2σ 2
13
and
n n
!
X X
T (x) = (T1 (x), T2 (x)) = xi , x2i .
i=1 i=1
Clearly, the natural parameter space Ξ is a half space in R2 , which contains a 2-dim
rectangle. By Theorem 2.4, T (X) is a minimal sufficient statistic.
n n P 2 P
with sufficient statistics T (x) = (T1 (x), T2 (x)) = i=1 xi , i=1 xi . Since it is a curved
exponential family (not a full rank), one can not use the very last theorem to find the
minimal sufficient statistic. However, we can use the simple rule given earlier. Now
fθ (x) 1 1
r(θ) ≡ = exp (T1 (x) − T1 (y)) − 2 (T2 (x) − T2 (y))
fθ (y) θ 2θ
is free of θ iff
1 1
(T1 (x) − T1 (y)) − 2 (T2 (x) − T2 (y)) = C (x, y) = 0,
θ 2θ
where the last equality is obtained by letting θ → ∞ on both sides. Again, since θ is
arbitrary, we get (by taking θ = 1 and θ = 2)
2. If X ∼ exp(θ, 1) with pdf f (x−θ) = e−(x−θ) I(x > θ), then X(1) is minimal sufficient.
(Similar proof to U (0, θ) example.)
3. If X ∼ U (θ, θ + 1), X(1) , X(n) is minimal sufficient (shown earlier).
14
In some other cases, T (X) = X(1) , ..., X(n) is the maximum reduction (i.e. minimal
sufficiency).
Example 2.9 (Further reduction is impossible) T (X) = X(1) , ..., X(n) is minimal
sufficient if X ∼ fθ (x) = f (x − θ), where
1 −|x|
1. (Double exponential) f (x) = e (all moments exist, and fθ (x) is not differ-
2
entiable at θ = x).
1 1
2. (Cauchy) f (x) = (the mean does not exist).
π (1 + x2 )
e−x
3. (Logistic) f (x) = (µ = EX1 exists).
(1 + e−x )2
Hence,
n
X n
X
g(θ) =: |xi − θ| = |yi − θ|
i=1 i=1
The value minimizing the above is the sample median. Thus, the two data sets X’s and
Y ’s have the same medians. Consider two cases:
• n = 2m + 1 is odd
By minimizing w.r.t. θ, we see that x’s and y’s have the same sample medians, i.e.,
x(m+1) = y(m+1) .
15
• n = 2m is even
Any value in [x(m) , x(m+1) ] is a median of x’s, which minimizes g(θ) w.r.t. θ.
Any value in [y(m) , y(m+1) ] is a median of y’s, which minimizes g(θ) w.r.t. θ.
These two sets must be the same, thus x(m) = y(m) and x(m+1) = y(m+1) .
Removing these two points, we still have even number of points left.
Continuing doing this, we get x(k) = y(k) for all k = 1, ..., n.
Remarks.
• For double exponential d.f., the sample median is a better estimator than the sample
mean. In fact, as can be seen in later chapters, the sample median is an MLE, and
hence asymptotically efficient.
• For Cauchy d.f., the sample mean is a bad choice. But sample median is OK, even
though it is not the asymptotic efficient estimator (MLE).
• For logistic d.f., we will find out later. Again, the sample median is better than the
sample mean, even though the optimal one is neither.
Example 2.10
Ancillary statistics are useful in conditional inference and also in hypotheses testing.
1. T is complete for θ if
16
Examples of complete families
Pn
Example 2.11 If X ∼ Bernoulli(p), T = i=1 Xi is complete.
Proof. F (x) = (x/θ)I(0 < x < θ) + I(x ≥ θ), P (T ≤ t) = F (t)n . If Eθ g(T ) = 0 for θ > 0,
then Z θ Z θ
n−1 −n
g(t)nt θ dt = 0, =⇒ g(t)tn−1 dt = 0.
0 0
Differentiating w.r.t. θ, one gets g(θ)θn−1 = 0, i.e., g(θ) = 0 for all θ > 0.
Pn−1
Example 2.13 Let X1 , ..., Xn ∼ Bin(1, p). Then ( i=1 Xi , Xn ) is not complete.
Example 2.14 Given X1 , ..., Xn ∼ U (θ, θ + 1), T = X(1) , X(n) is minimal sufficient,
but not complete.
17
Exponential family of full rank is complete
Then T (X) = (T1 (X), ..., Tk (X)) is complete (and minimal sufficient).
Proof. If Eη [g(T )] = 0, recalling fT (t) = exp {tη 0 − A(η)} h0 (t) for t = (t1 , ..., tk ) and
η = (η1 , ..., ηk ), we have
Z Z Z
0
Eη [g(T )] = g(t)fT (t)dt = g(t) exp{tη −A(η)}h0 (t)dt = g(t) exp{tη 0 −A(η)} dλ(t) = 0,
for all η ∈ N (η0 ), where N (η0 ) = {η : ||η − η0 || < } for some > 0. In particular,
Z Z
g + (t) exp{tη00 − A(η0 )} dλt = g − (t) exp{tη00 − A(η0 )} dλt = C ≥ 0.
Sufficient statistics contain all the information about θ while ancillary statistics
contain no information about θ. It is natural to ask whether (minimal) sufficient statistics
and ancillary statistics are independent. This does not hold true in general unless bounded
completeness is imposed.
18
Theorem 2.6 (Basu’s Theorem) If T (X) is bounded complete and sufficient and V is
ancillary, then T (X) ⊥ V .
Note that g(T ) is free of θ as T is sufficient and V is ancillary. It is also bounded. Clearly,
Since T is bounded complete, we have g(T ) = 0 a.s., i.e., hence (7.8) holds.
Examples
Pn
Example 2.16 Given X ∼ N (µ, σ 2 ), X̄ and S 2 = n−1 i=1 (Xi − X̄)2 are independent.
Proof. Fix σ 2 , and treat µ as the only parameter. Then, X̄ is complete and sufficient for
µ. On other other hand, S 2 is ancillary for µ. The result follows from Basu’s theorem.
Alternative proof. Let Yi = Xi /σ, and define Ȳ and SY2 correspondingly. Note that
Yi ∼ N (µ/σ, 1) =: N (µ
e, 1). Treating µ
e as our new parameter, we can use Basu’s Theorem
as above.
Example 2.17 Given X1 , . . . , Xn ∼ fθ (x) = e−(x−θ) I{x > θ}, it is easy to show that X(1)
is complete and sufficient. Hence, X(1) ⊥ g(X), where g(X) is any ancillary statistics,
e.g.,
g(X) = S 2 , or (X(n) − X(1) ), or X2 − X1 , etc.
By Basu’s Theorem,
X(1) ⊥ g(X).
For example,
2
eX(1) − cos X(1) , and eS sin2 (X(n) − X(1) ) + log (1 + |X2 − X1 |) .
In this case, Basu’s Theorem allows us to deduce independence of two statistics without
finding their joint distributions, which is typically very messy to do.
19
2.8 Exercises
2. Let X1 , . . . , Xn be independent r.v.’s with densities fXi (x) = eiθ−x , for x ≥ iθ.
Prove that T = mini {Xi /i} is a sufficient statistic for θ.
6. Let X1 , · · · , Xn be i.i.d. random variables from U (a, b), where a < b. Show that
X(i) − X(1)
Zi =: , i = 2, ..., n − 1, are independent of (X(1) , X(n) ) for any a and b.
X(n) − X(1)
7. X1 , · · · , Xn are i.i.d. r.v.s with pdf f (x) = θ−1 e−(x−a)/θ I(a,∞)(x) . Show that
Pn
(a) i=1 (Xi − X(1) ) are X(1) are independent
(b) Zi =: (X(n) − X(i) )/(X(n) − X(n−1) ) i = 1, 2, ..., n − 2, are independent of
((X(1) , ni=1 (Xi − X(1) ))
P
fθ (x) = θ, x=0
2 x−1
= (1 − θ) θ , x = 1, 2, ...
= 0 otherwise
where θ ∈ (0, 1). Show that X is boundedly complete, but not complete.
20
Chapter 3
Unbiased Estimation
• Bayesian estimator
• ......
A simple example will show that an optimal estimator does not exist in the class of all
estimators. There are many criteria for “optimality”. Here we use the minimization of
Mean Squared Errors (MSE):
where bias[T ] = ET − g(θ). Hence T is MSE-optimal if it has the smallest MSE for all
θ ∈ Θ.
21
Example 3.1 Let X1 , · · · , Xn be iid with EXi = µ and V ar(Xi ) = 1. Take three estima-
tors of µ:
T1 = X1 , T2 = 3, T3 = X̄.
Clearly,
1
M SE(T1 ) = 1, M SE(T2 ) = (3 − µ)2 , M SE(T3 ) = .
n
We can plot their MSEs against µ.
• Neither T2 nor T3 dominates the other for all θ, i.e., there is no uniformly MSE-
optimal estimator between the two.
• If we have some vague knowledge that µ is near 3, we might take both estimators
into account, e.g., a linear combination of both estimators (Bayesian estimator):
T4 = λT2 + (1 − λ)T3 .
3.3 UMVUE
Definition
The problem arises because the class of all estimators under consideration (with second
moments) is too big. One way to get around is to focus on a smaller class of estimators,
i.e., all unbiased estimators (with second moments):
Characterization of a UMVUE.
Theorem 3.1
22
(a) Eη(T ) = g(θ);
(b) Cov[η(T ), U (T )] = E[η(T )U (T )] = 0, ∀θ ∈ Θ and ∀U ∈ U{0}
e = {h(T ) :
Eh(T ) = 0}.
Proof. We shall only prove the first part. The second part can be shown similarly.
Necessity. Suppose that δ is a UMVUE for g(θ). Fix U ∈ U{0} and θ ∈ Θ. Let
δ1 = δ + λU . Clearly Eδ1 = g(θ), and so
that is,
2λCov(U, δ) + λ2 V ar(U ) ≥ 0,
for all λ. The quadratic in λ has two roots λ1 = 0 and λ2 = −2Cov(U, δ)/V ar(U ), and
hence takes on negative values unless λ1 = λ2 , that is, Cov(U, δ) = 0.
Sufficiency. Suppose that Eθ (δU ) = 0 for all U ∈ U{0}. For any δ1 such that
E(δ1 ) = g(θ), we have δ1 − δ ∈ U{0}. Hence,
• E[U (X)] = 0.
• E[δ(X)U (X)] = 0.
• E[δ(X)] = g(θ).
Or
• E[U (T )] = 0.
• E[η(T )U (T )] = 0.
• E[η(T )] = g(θ).
Example 3.2 X ∼ Bin(n,p). Then, U{g(θ)} is nonempty (i.e., g(p) is U -estimable) iff
it is a polynomial in p of degree m (an integer), where 0 ≤ m ≤ n.
Pn n k
Proof. η(X) is an unbiased estimator iff g(p) = Eη(X) = k=0 k η(k)p (1 − p)n−k , i.e.
g(p) is a polynomial in p of degree m, where 0 ≤ m ≤ n.
√
Thus, g(p) = p−1 , p/(1 − p) and p are not U -estimable.
23
Uniqueness of UMVUE, if it exists
Proof. If δ1 and δ2 are two UMVUEs, then δ1 −δ2 ∈ U{0}. By Characterization Theorem,
we have E[δ1 (δ1 − δ2 )] = 0 and E[δ2 (δ1 − δ2 )] = 0. So
Lehmann-Scheffe Theorem
Theorem 3.3 If T is C-S, and Eθ [η(T )] = g(θ), then η(T ) is the UMVUE for g(θ).
Rao-Blackwell Theorem
2. V ar(η(T )) ≤ V ar(δ).
2. This follows from the last and the next parts, and M SE(Y ) = bias(Y )2 + V ar(Y ).
24
3. Now
where
C = E ([δ − η][η − g(θ)]) = EE ([δ − η][η − g(θ)]|T ) = E {[η − g(θ)] (E[δ|T ] − η)} = 0.
Therefore, M SE(δ) ≥ M SE(η) and the equality holds iff δ = η(T ) a.s.
Theorem 3.5 If T is C-S and Eδ = g(θ), then η(T ) = E(δ|T ) is the UMVUE for g(θ).
– Solving equations
– Conditioning by Rao-Blackwell Theorem.
Solving equations
Example 3.3 X1 , . . . , Xn ∼ U (0, θ). Find a UMVUE for g(θ) where g(x) is differen-
tiable.
X(n) 0
η(T ) = g(X(n) ) + g (X(n) )
n
n+1
(a). If g(θ) = θ, then η(T ) = n X(n) .
(b). If g(θ) = θ2 , then η(T ) 2 + 2 X2
= X(n) = n+2 2
n (n) n X(n) .
25
Remark 3.1 An MLE of g(θ) is g(X(n) ) with bias of order O(n−1 ).
Example 3.4 Let X1 , . . . , Xn ∼ Poisson(λ), find a UMVUE for g(λ), where g(λ) is in-
finitely differentiable.
Xi is C-S, and T ∼ P oisson(nλ), that is, fT (t) = e−nλ (nλ)t /t! for
P
Solution. T =
t = 0, 1, 2, ... Then
∞
η(k)e−nλ (nλ)k /k! = g(λ).
X
Eη(T ) =
k=0
Then
∞ ∞ (i)
! ∞
∞
X η(k)nk λk X g (0)λi X nj λ j X
= g(λ)enλ = = ak λk ,
k=0
k! i=0
i! j=0
j! k=0
T ! nT −r r! T (T − 1)...(T − r + 1)
η(T ) = = if T ≥ r,
nT (T − r)!r! nr
= 0 if T < r.
26
Pn
• X̄ and S 2 are independent, where S 2 = (n − 1)−1 i=1 (Xi − X̄)2
xr/2−1 e−x/2
• X̄ ∼ N (µ, σ 2 /n), and (n − 1)S 2 /σ 2 ∼ χ2n−1 , where the pdf of χ2r is 2r/2 Γ(r/2)
.
• Γ(α + 1) = αΓ(α)
X̄
η(T ) = C .
S2
1
Then Eη(T ) = CµE S2
. Now taking r = n − 1, we have
! !
σ2 1
E = E
(n − 1)S 2 χ2n−1
1 xr/2−1 e−x/2
Z ∞
= dx
0 x 2r/2 Γ(r/2)
x(r−2)/2−1 e−x/2 2(r−2)/2 Γ((r − 2)/2)
Z ∞
= dx
0 2(r−2)/2 Γ((r − 2)/2) 2r/2 Γ((r − 2)/2 + 1)
1 Γ((n − 3)/2)
=
2 Γ((n − 1)/2)
1
= .
n−3
Then
1 n−1 µ µ
Eη(T ) = CµE =C = 2,
S2 n−3σ 2 σ
n−3 n−3 X̄ µ
if C = n−1 . That is, n−1 S 2 is a UMVUE of σ2
.
e−λ λk
g(λ) = P (X1 = k) = , for any k = 0, 1, 2, ........
k!
P
Solution. Let δ = I{X1 = k}. Clearly, Eδ = g(λ). Now T = Xi is C-S, and
T ∼ P oisson(nλ), then
27
So a UMVUE for g(λ) is
! T −k
k
T 1 1
η(T ) = 1− .
k n n
Remark 3.2
When a C-S statistic is not available or not easily available, we can not use Lehmann-
Scheffe Theorem to find UMVUEs. Then we can try the Characterization Theorem.
Solution. It is known that T = X(n) is sufficient. But it is not complete. To show this,
note that FT (t) = F n (t) = tn /θn , and so
ntn−1
fT (t) = I(0 < t < θ).
θn
If Eθ g(T ) = 0, then
Z θ Z 1 Z θ
n−1 n−1
0= t g(t)dt = t g(t)dt + tn−1 g(t)dt.
0 0 1
(3). Next we need to find η(T ) such that Eη(T ) = θ. We use two approaches.
28
Approach 1. Clearly, (1) and (2) are both satisfied by
η(t) = c if 0 < t ≤ 1
= bt if t > 1.
η(t) = c if 0 < t ≤ 1
= η(t) if t > 1 .
Here, we choose η(t) = c if 0 < t ≤ 1 since it satisfies (1). From Eη(T ) = θ, we get
nη(θ)θn−1 = (n + 1)θn .
n+1
Therefore, η(θ) = n θ for θ > 1. Putting it back into (5.1), we get c = 1. Hence,
η(T ) = 1 if 0 < T ≤ 1
(n + 1)T /n if T > 1.
29
Remark 3.3 If x(n) < 1, we estimate θ by 1. If x(n) ≥ 1, we just carry on as usual.
η(T ) = 1 if T0 = 1
−1
(1 + n )T0 if T0 > 1.
1. Is it U -estimable? If not, it is the end of story. If so, move on to the next step.
• Even if UMVUE’s exist, they are usually quite difficult to work out.
We have considered UMVUE for i.i.d. r.v.’s. We now consider independent r.v.’s.
Xi = ξi + i = α + βti + i , i = 1, ..., n,
30
Solution. The joint pdf is
n
!
Y 1 −(xi − α − βti )2
f (x) = exp
i=1
(2πσ 2 )1/2 2σ 2
Pn Pn Pn !
2
i=1 xi α i=1 xi β i=1 ti xi
= C(θ) exp − − −
2σ 2 2σ 2 2σ 2
Hence, T = ( ni=1 xi , ni=1 ti xi , ni=1 x2i ) is C-S for θ̃ = (1/σ 2 , α/σ 2 , β/σ 2 ) or equivalently
P P P
Since the LSEs of α and β are unbiased and are functions of T , hence they must be
UMVUE. Similarly, the UMVUE of σ 2 is given by σ̂ 2 = SSE/(n − 2).
Rather than dealing with every single case, here we present some unified treatment for
independent case, in particular, the ”Normal Linear Models”:
Xi ∼ N (ξi , σ 2 ), i = 1, ..., n,
or equivalently
X ∼ N (ξ, σ 2 I),
where the Xi ’s are independent and ξ = (ξ1 , ..., ξn ) ∈ Ls , an s-dimensional linear subspace
of Rn (s < n). Let
ηs+1 = ... = ηn = 0.
As a result, we have
n
!
1
Y −(yi − ηi )2
fY (y) = exp
i=1
(2πσ 2 )1/2 2σ 2
Ps Pn 2
!
i=1 (yi − ηi )2 i=s+1 yi
= C1 (θ) exp − −
2σ 2 2σ 2
Pn Ps !
2
i=1 yi i=1 ηi yi
= C1 (θ) exp − −
2σ 2 2σ 2
Pn 2 Ps 2
Hence, T = (Y1 , ..., Ys , i=1 Yi ) or equivalently, T̃ = (Y1 , ..., Ys , i=1 Yi ), is C-S.
Ps
Theorem 3.6 UMVUE of i=1 λi ηi and σ 2 are, respectively,
s Ps 2
i=1 Yi
X
2
λi Yi , and σ̂ =
i=1
n−s
31
Proof. They are unbiased and functions of T , hence UMVUE by L-S theorem.
Pn Pn Ps Pn
Proof. From i=1 (Xi − ξi ) 2 = i=1 (Yi − ηi )2 = i=1 (Yi − ηi )2 + 2
i=s+1 Yi , we get
ˆ
η̂ = (Y1 , ..., Ys , 0, ..., 0) = ξC.
Thus, ξˆ = η̂C T , and hence si=1 λi ξˆi , is a function of Y1 , ..., Ys . But η = E η̂ = E ξC
ˆ = ξC,
P
So the LSE and hence the UMVUE of ξi s are ξˆi = Xi· . And the UMVUE of σ 2 is
Ps Pni
2 i=1 j=1 (Xij − Xi· )2
σ̂ = .
n−s
32
3.8 Exercises
iid
1. T = (X1 , . . . , Xn ) ∼ f (x), n ≥ 2. Show that T is sufficient, but not complete.
iid
3. Let X1 , . . . , Xn ∼ N (µ, σ 2 ). Find a UMVUE for µ2 /σ.
iid
4. X1 , . . . , Xn ∼ P oisson(λ). Find a UMVUE for e−tλ for t > 0.
iid
5. X1 , . . . , Xn ∼ U (θ1 , θ2 ), where θ1 < θ2 . Find a UMVUE for (θ1 + θ2 )/2.
6. Let X1 , . . . , Xn be iid Bin(1, p), where p ∈ (0, 1). Find the UMVUE of
8. Let X1 , . . . , Xn ∼ Bin(k, p). Find a UMVUE for g(p) = P (X1 = 1) = kp(1 − p)k−1 .
iid
9. X1 , . . . , Xn ∼ N (0, σ 2 ). We wish to find a UMVUE of σ 2 .
33
Chapter 4
Likelihood-based methods lie in the heart of statistics. They have been widely used in
practice, more so than you perhaps realise. Some methods, , e.g., least square estimation
(LSE), logistic regression, etc., which are seemingly unrelated to likelihood, can be cast
into the likelihood framework. Others, e.g., Bayesian methods to be studied later, can be
loosely regarded as the extension of likelihood methods.
The idea of the maximum likelihood estimate (MLE), first introduced by R.A. Fisher,
is simple and yet very powerful. It enjoys many nice properties, such as invariance prin-
ciple, asymptotic efficiency, existence principle. Of course, like every method, it also has
drawbacks, e.g. strong assumptions, non-robust, etc.
Let θ̂ be an estimate of θ.
θ̂ = 0 if X = 0, 3
= 1 if X = 1, 2, 3.
34
Definition 4.1 Suppose X ∼ fθ (x), where θ ∈ Θ ∈ Rk . Let Θ be the closure of Θ.
• Given X = x, define
i.e.,
θ̂ = arg sup L (θ(x)) = arg sup l (θ(x)) .
θ∈Θ θ∈Θ
We note that MLE may not be unique (unlike UMVUE). Furthermore, if L(θ) is
differentiable, possible candidates for MLE’s satisfy
One very attractive property of the MLE is its invariance. To be more precise, let η = g(θ),
where g(·) is a function from Rk to Rd . Basically, then the invariance property of MLE
means: if θ̂ is an MLE of θ, then g(θ̂) is an MLE of g(θ).
Theorem 4.1 Let X ∼ fθ (x) and η = g(θ) where g(·) is 1-to-1. If θ̂ is an MLE for θ,
then g(θ̂) is an MLE for g(θ).
L(θ̂) = L1 (g(θ̂))
Since θ̂ is an MLE, hence L(θ) and L1 (η) are maximized by θ̂ and g(θ̂), respectively. That
is, g(θ̂) is an MLE of η.
35
4.2.2 g(·) is not 1-to-1
If g(·) is NOT 1-to-1, η = g(θ) is no longer a parameter, since once knowing η, we could
have more than one θ’s, say, θ1 and θ2 , such that η = g(θ1 ) = g(θ2 ), i.e., we may not be
able to completely specify θ in the parameter space Θ (non-identifiable). This is bad since
we don’t know whether X ∼ fθ1 (x) or X ∼ fθ2 (x).
L(η)
e = sup L(θ).
{θ∈Θ:g(θ)=η}
Such a η̂ always exists in g(Θ) (not necessarily in g(Θ)), and is still called a maxi-
mum likelihood estimator (MLE) of θ.
L(η̂)
e = sup L(η)
e = sup sup L(θ) = sup L(θ) = L(θ̂).
η∈g(Θ) η {θ∈Θ:g(θ)=η} θ∈Θ
L(g(
e θ̂)) = sup L(θ) = L(θ̂).
{θ∈Θ:g(θ)=g(θ̂)}
4.3 Examples
Example 4.2 Let X1 , ..., Xn ∼ Bin(1, p) with p ∈ (0, 1). Find an MLE for p2 .
i=1
l(p) = n[x̄ log p + (1 − x̄) log (1 − p)]
36
Setting
x̄ 1 − x̄ (x̄ − p)
0
l (p) = n − = = 0,
p 1−p p(1 − p)
we get p̂ = x̄. Since
x̄ 1 − x̄
00
l (p) = − 2 − < 0,
p (1 − p)2
l(p) is a concave function. Then p̂ = x̄ is an MLE.
Remark 4.1 (a) If X̄ = 0 or 1, then p̂ = 0 or 1 6∈ Θ = (0, 1). (b) Note that in this
example, the MLE X̄ is also the UMVUE.
Example 4.3 Let X1 , ..., Xn ∼ P oisson(λ) with λ > 0. Find an MLE for λ.
λ̂ = x̄.
Remark 4.2 (i) If X̄ = 0, then λ̂ = 0 6∈ Θ = (0, ∞). (ii) Note that in this example, the
MLE X̄ is also the UMVUE.
Example 4.4 Let X1 , ..., Xn ∼ N (µ, σ 2 ) with µ ∈ R and σ 2 > 0, where n ≥ 2. Find an
MLE for θ = (µ, σ 2 ).
Solution. Note
n
n ( )
1 1 X
L(θ) = √ exp − 2 (xi − µ)2 ,
2πσ 2 2σ i=1
and n
1 X n n
l(θ) = − 2
(xi − µ)2 − log σ 2 − log(2π).
2σ i=1 2 2
The log-likelihood equations become
1 Pn
! ! !
0 lµ0 (θ) σ2 i=1 (xi − µ) 0
l (θ) = = = .
lσ0 2 (θ) 1 Pn
2σ 4
2 n
i=1 (xi − µ) − 2σ 2 0
37
Therefore, we get θ̂ = (µ̂, σ̂ 2 ), where
n
σ̂ 2 = n−1 (xi − x̄)2 .
X
µ̂ = x̄, and
i=1
It can be shown that l00 (θ) is negative definite (exercise). Hence, the MLE of θ is
n
2 −1
X 2
µ̂ = X̄, and σ̂ = n Xi − X̄ .
i=1
Example 4.5 Let X1 , ..., Xn ∼ N (µ, σ 2 ) with µ ≥ 0 and σ 2 > 0, where n ≥ 2. Find an
MLE for θ = (µ, σ 2 ).
resulting in
n
−1
(xi − x̄)2 .
X
2
µ̂ = x̄, and σ̂ = n
i=1
For each (fixed) σ 2 , l(θ) attains its maximum when µ = 0 since x̄ ≤ 0. Then
n
1 X n n n
l(0, σ 2 ) = − 2
(xi − x̄)2 − 2 (x̄)2 − log σ 2 − log(2π),
2σ i=1 2σ 2 2
σ̂ 2 = n−1
X
µ̂ = 0, x2i .
In summary, an MLE of θ is
θ̂ = (0, n−1
X
Xi2 ), if X̄ < 0
= (X̄, n−1
X
(Xi − X̄)2 ), if X̄ ≥ 0.
38
4.4 Numerical solution to likelihood equations
Typically, there are no explicit solutions to likelihood equations, and some numerical
methods have to be used to compute MLE’s. We start with a couple of examples.
Newton’s algorithm.
Note that θ̂ →p θ, and n−1 l(θ), n−1 l0 (θ) and n−1 l00 (θ) are all sample means of i.i.d. r.v.’s.
Thus
1 1 0 1 1 1
0 = l0 (θ̂) ≈ l (θ) + (θ̂ − θ) l00 (θ) ≈ l0 (θ) + (θ̂ − θ) El00 (θ)
n n n n n
Then
θ̂ ≈ θ − l0 (θ)[l00 (θ)]−1 ≈ θ − l0 (θ)[El00 (θ)]−1 .
• Newton-Raphson’s algorithm:
−1
θj+1 = θj − l0 (θj ) l00 (θj )
, j = 0, 1, 2, ...,
• Scoring method:
−1
θj+1 = θj − l0 (θj ) El00 (θj )
, j = 0, 1, 2, ...,
Solution. Note
n
Y
L(θ) = I(θ − 1/2 < xi < θ + 1/2)
i=1
= I(θ − 1/2 < x(1) < x(n) < θ + 1/2)
= I(x(n) − 1/2 < θ < x(1) + 1/2).
By plotting L(θ) against θ, we see that θ̂ ∈ (x(n) − 1/2, x(1) + 1/2). So we have infinite
number of MLE’s, and the range is x(1) + 1/2 − (x(n) − 1/2) = 1 − (x(n) − x(1) ) → 0 as
n → ∞.
39
4.5.2 MLE may not be asymptotically normal
θ̂ = X(n) .
Now
WLLN, we get X̄ →p µ.
When there are many nuisance parameters, then MLE’s can behave very badly even for
large samples, as shown below.
......
Xp1 , · · · , Xpn ∼ N (µp , σ 2 ).
Let θ = (µ1 , · · · , µp , σ 2 ). Equivalently,
Xi = (X1i , ..., Xpi ) ∼ M N (µ, σ 2 I), where µ = (µ1 , ..., µp ), and i = 1, ..., n.
Show that
n p X
n
1. an MLE of θ is θ̂, where µ̂i = n−1 σ̂ 2 = (pn)−1
X X
xij = x̄i· , (xij − x̄i· )2 ;
j=1 i=1 j=1
40
3. if n is fixed and p → ∞, then θ̂ is NOT consistent for θ.
Proof.
Thus,
n−1 2 σ2 σ2 2(n − 1) 4
E σ̂ 2 = σ = σ2 − , Bias(σ̂ 2 ) = − , V ar(σ̂ 2 ) = σ
n n n pn2
3. If n is fixed, p → ∞, then
The third part is often referred to as “small n, large p” problem in the literature
(p n). There are several ways to overcome this difficulty.
41
4.6 Exercises
42
Chapter 5
Efficient Estimation
Let X ∼ fθ (x), and l(θ) = log fθ (X), where θ ∈ Rk . Define the Fisher Information of θ
contained in X by h i
τ
IX (θ) = E l0 (θ) k×1 l0 (θ) 1×k
Theorem 5.1 (Cramer-Rao (C-R) lower bound) Let Eδ(X) = g(θ). Then
where the equality holds iff δ(x) and l0 (θ) are linearly dependent, provided
∂ ∂
Z Z
h(x)fθ (x)dx = h(x) fθ (x)dx, for θ ∈ Θ
∂θ ∂θ
for h(x) = 1 and h(x) = δ(x),
Proof
where the equality holds iff δ and Y are linearly dependent. Here,
0 ≤ V ar(δ − λY τ ) = Cov(δ − λY τ , δ − λY τ )
= V ar(δ) − 2Cov(δ, Y λτ ) + Cov(λY τ , λY τ )
= V ar(δ) − 2Cov(δ, Y )λτ + λV ar(Y )λτ
=: g(λ)
43
Setting g 0 (λ) = 2λV ar(Y ) − 2Cov(δ, Y ) = 0, we get the stationary point
λ0 = Cov(δ, Y )[V ar(Y )]−1 .
Thus,
g(λ0 ) = V ar(δ) − Cov(δ, Y )[V ar(Y )]−1 [Cov(δ, Y )]τ ≥ 0
where equality holds iff g(λ0 ) = 0 = V ar(δ − λ0 Y τ ) iff δ − λ0 Y τ = C.
(b) Take
∂
∂ fθ (X)
Y = l0 (θ) = log fθ (X) = ∂θ .
∂θ fθ (X)
Differentiating (w.r.t. θ)
Z Z
1= fθ (x)dx, and g(θ) = E(δ(X)) = δ(x)fθ (x)dx
we get
∂fθ (x)
Z Z
0 = dx = l0 (θ)fθ (x)dx = El0 (θ) = EY,
∂θ
∂ ∂fθ (x)
Z Z
g 0 (θ) = Eδ(X) = δ(x) dx = δ(x)l0 (θ)fθ (x)dx
∂θ ∂θ
= E[δ(x)l0 (θ)] = cov δ(X), l0 (θ) = cov (δ(X), Y ) ,
h τ i
l0 (θ) l0 (θ) = E(Y τ Y ) = V ar(Y ).
IX (θ) = E k×1 1×k
44
5. Let θ = θ(η) is from Rm to Rk , where m, k ≥ 1. Assume that θ(.) is differentiable.
Then the Fisher information that X contains about η is
∂θ ∂θτ
IX (η) = IX (θ) .
∂η τ m×k ∂η k×m
∂l(θ) τ ∂l(θ)
IX (η) = E
∂η ∂η
τ 0
= E l (θ)θ (η) l (θ)θ0 (η)
0 0
τ 0 τ 0 0
= θ0 (η) E l (θ) l (θ) θ (η)
τ
= θ0 (η) IX (θ)θ0 (η)
6. If θ = h(η) is 1-1 (from Rk to Rk ), then the Cramer-Rao lower bound remains the
same (i.e., reparametrization invariant).
Proof. X1 , ..., Xn ∼ fθ (x), and θ = h(η) is 1-1. Now Eδ(X) = g(θ) = g(h(η)) =
r(η). By C-R lower bound theorem,
τ
V ar(δ(X)) ≥ g 0 (θ) [IX (θ)]−1 0
1×k k×k g (θ) k×1 =: A,
and also
τ
V ar(δ(X)) ≥ r0 (η) [IX (η)]−1 0
1×k k×k r (η) k×1 =: B.
[g 0 (θ)]2 2
IX (θ) = E l0 (θ) = −E l00 (θ) .
V ar(δ(X)) ≥ , where
IX (θ)
−1
• efficient if V ar(δ) = IX (θ)
−1
• sub-efficient if V ar(δ) > IX (θ)
−1
• super-efficient if V ar(δ) < IX (θ)
45
Examples of efficient and sub-efficient estimators
A necessary and sufficient condition for reaching Cramer-Rao lower bound is that δ(x)
and l0 (θ) are linearly dependent, i.e. δ(x) − Eδ(X) = λ0 (l0 (θ))τ .
Example 5.1 Let X1 , ..., Xn be iid from the Bin(1, p). Show that the Cramer-Rao lower
bound can be reached in estimating p but not for p2 .
P P
Proof. First fp (x) = p i xi (1 − p)n− xi P
i , xi = 0, 1. So l(p) = log fp (x) = i xi log p +
(n − i xi ) log(1 − p). Then
P
n − i xi n(x̄ − p)
P P
i xi
l0 (p) = − = , xi = 0, 1. (2.2)
p 1−p p(1 − p)
Therefore,
nE(X1 − p)2 n
IX (p) = 2 2
= , x = 0, 1.
p (1 − p) p(1 − p)
If Eδ(X) = g(p), then by the Cramer-Rao inequality,
[g 0 (p)]2 [g 0 (p)]2 p(1 − p)
V ar(δ(X)) ≥ = .
IX (p) n
From (2.2), the equality only holds for g(p) = c0 + c1 p, a linear function of p.
On regularity conditions
46
1. the assumption in the Cramer-Rao theorem does not hold
(the range of the pdf depends on the unknown parameter).
2. the information inequality does not apply to the UMVUE of θ.
Solution.
nxn+1
Z θ
2 n 2
EX(n) = dx = θ
0 θn n+2
2 2
n+1 n+1 n 2 θ2
2
var θ̂ = EX(n) − θ2 = θ − θ2 =
n n n+2 n(n + 2)
θ2 1
< = .
n nI(θ)
Note that the V ar(X(n) ) is of size O(n−2 ), as opposed to the usual size O(n−1 ).
Remark 5.1 In this example, the UMVUE has variance below the C-R lower bound, which
is a good thing. In fact, the variance of θ̂ is of order n−2 , much smaller than the usual
n−1 . More precisely, we have
n(θ − θ̂)/θ ∼approx Exp(λ = 1)
Therefore, θ̂ is asymptotically Exp(λ = 1), but not asymptotically normal.
47
Taking y = tθ, we get
P n(θ − X(n) )/θ ≤ t = 1 − (1 − t/n)n → 1 − e−t as n → ∞.
Assume
√
n[δn − g(θ)] →d N (0, v(θ)), v(θ) > 0 (3.4)
which holds if bias(δn ) = Eδn − g(θ) = o(n−1/2 ) (i.e., δn is asymptotically unbiased) and
√
V ar( nδn ) → v(θ). Let n → ∞ in (3.3), we would get
[g 0 (θ)]2
v(θ) ≥ for ALL θ ∈ Θ. (3.5)
I(θ)
However, (3.5) may not hold uniformly for ALL θ ∈ Θ, even under nice regularity condi-
tions. (The example when regularity conditions fails is U nif (0, θ).)
Example 5.3 (Hodges) Let X1 , ..., Xn ∼ N (θ, 1), θ ∈ R. Then IX1 (θ) = 1. The MLE
of θ is θ̂ = X̄ ∼ N (θ, 1/n). Now let us define
θ̃ = X̄ if |X̄| ≥ n−1/4
tX̄ if |X̄| < n−1/4 ,
Remark 5.2 θ̃ shrinks X̄ around 0 to tX̄. Hence it is more efficient at 0 if the true
parameter is 0. The trouble is that we don’t know where the true θ is.
Solution.
√
Note that θ̃ = X̄In + tX̄(1 − In ), where In = I(|X̄| ≥ n−1/4 ) = I(| nX̄| ≥ n1/4 ).
√
(1) If θ = 0, then nX̄ ∼ N (0, 1). Also, it can be shown that In →p 0 as P (In ≥ ) =
√
P (| nX̄| ≥ n1/4 ) = P (|N (0, 1)| ≥ n1/4 ) → 0. Hence, by Slutsky theorem, we have
√ √ √
n(θ̃ − θ) = nX̄In + t nX̄(1 − In ) →d N (0, t2 )
48
√ √
(2) If θ 6= 0, then n(X̄ − θ) ∼ N (0, 1). Also, it can be shown that n(1 − In ) →p 0 as
√ √ √
P ( n(1 − In ) ≥ ) = P (In ≤ 1 − / n) = P (In = 0) = P | nX̄| < n1/4
√
= P (−n1/4 < nX̄ < n1/4 )
√ √ √
= P − nθ − n1/4 < n(X̄ − θ) < n1/4 − nθ
√ √
= P − nθ − n1/4 < N (0, 1) < − nθ + n1/4
√ √
= Φ − nθ + n1/4 − Φ − nθ − n1/4
→ 0
In the case of θ = 0, the asymptotic Cramer-Rao theorem does not hold, since
t2 < 1 = [I(θ)]−1 .
Remark 5.3
• The set of all points violating (3.5) is called points of superefficiency. Supereffi-
ciency in fact is a very good property.
• Even for the normal family (the best possible family), (3.5) may not hold. However,
it will hold for a more restricted class (i.e., “regular” estimators).
θ̂ = X̄ if |X̄ − k| ≥ n−1/4 ,
k + tk (X̄ − k) if |X̄ − k| < n−1/4
where we choose 0 ≤ tk < 1 for all k. Then it can be shown that this estimator will
be super-efficient at all k = 0, ±1, ±2, .......
• Although (3.5) is not always true, however, it holds a.e., i.e., the points of super-
efficiency has Lebesgue measure 0. See the next theorem (Le Cam (1953), Bahadur
(1964)).
Theorem 5.2 If δn = δn (X1 , . . . , Xn ) is any estimator satisfying (3.4), then v(θ) satisfies
(3.5) except on a set of Lebesgue measure 0, given the following conditions hold:
1. The set A = {x : fθ (x) > 0} is independent of θ. [i.e., they have common support,
and the boundary of Θ does not depend on θ.]
49
R
3. fθ (x)dx can be twice differentiated under the integral sign.
∂ log fθ (X1 ) 2
4. 0 < IX1 (θ) < ∞, where IX1 (θ) = E ∂θ .
5. For any θ0 ∈ Θ, there exists c > 0 and a function M (x) (both may depend on θ0 )
such that for all x ∈ A and |θ − θ0 | < c, we have
∂ 2 log f (X )
θ 1
≤ M (x), and Eθ0 [M (X1 )] < ∞.
∂θ2
50
5.4 Consistency of the MLE
Although for a single point x, we don’t usually have l(θ0 ) = f (x1 |θ0 ) > f (x1 |θ) =
l(θ). However, for large enough sample size n, we have l(θ0 ) = ni=1 f (xi |θ0 ) > ni=1 f (xi |θ) =
Q Q
l(θ) with high probability. We do not know θ0 , but we can determine the value θ̂ of θ
which maximizes l(θ) = ni=1 f (xi |θ). The above theorem suggests that, if fθ (x) varies
Q
smoothly with θ, the MLE θ̂ typically should be close to the true value θ0 , and hence be
a reasonable estimator.
In fact, we illustrate this for the simple case in which Θ is finite, say, Θ = {θ0 , θ1 , · · · , θk }.
By the last theorem, for any θi , i = 1, ..., k, we have
Pθ0 [l(θ0 ) > l(θi ), for large n] = 1.
Hence,
Pθ0 l(θ0 ) > max l(θi ), for large n = Pθ0 [l(θ0 ) > l(θi ), for large n] = 1.
1≤i≤n
Asymptotically efficiency
1. (Consistency): δn →p g(θ).
√
2. (Asymptotic normality): n[δn − g(θ)] →d N (0, v(θ)).
51
• Do AE estimators exist? Under what conditions?
√
AE estimators are not unique. If δn is AE, so is δn + Rn if nRn →p 0, since
!
√ √ √ [g 0 (θ)]2
n (δn + Rn − g(θ)) = n (δn − g(θ)) + nRn →d N 0, .
I(θ)
Theorem 5.4 Let F = {fθ : θ ∈ Θ}, where Θ is an open set in R. Assume that
1. For each θ ∈ Θ, l000 (θ) exists for all x, and also there exists a H(x) ≥ 0 (possibly
depending on θ) such that for θ ∈ N (θ0 , ) = {θ : |θ − θ0 | < },
000
l (θ) ≤ H(x), Eθ H(X1 ) < ∞.
∂ ∂gθ (x)
Z Z
2. For gθ (x) = fθ (x) or l000 (θ), we have gθ (x)dx = dx.
∂θ ∂θ
2
3. For each θ ∈ Θ, we have 0 < IX1 (θ) = Eθ l0 (θ) < ∞.
Let X1 , ..., Xn ∼ Fθ . Then with probability 1, the likelihood equations, l0 (θ) = 0, admit a
sequence of solutions {θ̂n } satisfying
Proof.
s(θ0 ) → Es(θ0 ) = 0,
s0 (θ0 ) → Es0 (θ0 ) = −IX1 (θ0 ),
H̄(X) → Eθ H(Xi ) < ∞.
52
Remark 5.4 Heuristically, we have, for large n and θ = θ0 ± ,
1
s(θ0 ± ) ≈ ∓IX1 (θ0 ) + EH(X1 )η ∗ 2 ≈ ∓IX1 (θ0 )
2
which is negative at θ0 + and positive at θ0 − . Thus there is a change of
sign for s(θ) near θ0 , and hence must have at least one root. By choosing
smaller and smaller, we get a sequence of solutions converging to θ0 . Let
us make this precise below.
|s(θ0 )| ≤ 2 ,
|s0 (θ0 ) + IX1 (θ0 )| ≤ 2 ,
|H̄(X) − Eθ H(Xi )| ≤ 2 .
We now show that θ̂n , is a proper random variable, that is, θ̂n , is measureable.
Note that, for all t ≥ θ0 − , we have
{θ̂n , > t} = inf s(θ) > 0
θ0 −≤θ≤t
( )
= inf s(θ) > 0 ,
θ0 −≤θ≤t,θ is rational
53
Thus,
√ √ 1
ns(θ0 ) = n(θ̂n − θ0 ) −s0 (θ0 ) − H̄(X)η ∗ (θ̂n − θ0 )
2
√
Since ns(θ0 ) → N (0, I(θ0 )) in distribution by CLT, and −s0 (θ0 ) − 12 H̄(X)η ∗ (θ̂0 −
θ0 ) → I(θ0 ) a.s., then it follows from Slutsky’s theorem that
√
√ ns(θ0 )
n(θ̂n − θ0 ) = 1
−s (θ0 ) − 2 H̄(X)η ∗ (θ̂0 − θ0 )
0
N (0, I(θ0 ))
→d = N 0, I −1 (θ0 ) .
I(θ0 )
Remark 5.5 To estimate g(θ) (g is smooth), we can use g(θ̂n ). Consequently, we can
apply the continuous mapping theorem and δ-method to obtain
Example 5.4 Let X1 , . . . , Xn ∼iid fθ (x) = 12 exp{−|x − θ|} (the double exponential).
Despite the violation of regularity conditions, the asymptotic version of the Cramer-Rao
theorem still holds for the MLE.
Proof. First of all, for EVERY x, fθ (x) is not differentable w.r.t θ at θ = x. So the
regularity conditions are not satisfied. Secondly,
n n n
( )
Y 1 X X
l(θ) = log fθ (xi ) = log n exp − |xi − θ| =C− |xi − θ|.
i=1
2 i=1 i=1
Pn
So an MLE of θ is θ̂ = arg maxθ∈R l(θ) = arg minθ∈R i=1 |xi − θ| = sample median, as
seen in an earlier chapter.
Let us assume that n is odd at this moment, then θ̂ = m̂. It is well known that
√
n(θ̂ − θ) ∼asymptotic N 0, [4f 2 (θ)]−1 .
In our current example, we have [4f 2 (θ)]−1 = [4 × (1/4)]−1 = 1. Hence, the asymptotic
Cramer-Rao bound does hold.
54
5.6 Some comparisons about the UMVUE and MLE’s
1. It can be shown that UMVUE’s are always consistent and MLE’s usually are. (Bickel
and Doksum, 1977., p133.)
2. In the exponential family, both estimates are functions of the natural sufficient statis-
tics Ti (X).
3. When n is large, both give essentially the same answer. In that case, use is governed
by ease of computation.
4. MLEs always exist while UMVUEs may not exist, and MLEs are usually easier to
calculate than UMVUE’s, due partially to the invariance property of MLE’s.
5. Neither MLE’s nor UMVUE’s are satisfactory in general if one takes a Bayesian or
minimax point of view. Nor are they necessarily robust.
55
Chapter 6
Transformation
Transforms are usually applied so that the data appear to more closely meet the
assumptions of a statistical inference procedure that is to be applied, or to improve the
interpretability or appearance of graphs.
In fact, many familiar measured scales are really transformed scales: decibels, pH
and the Richter scale of earthquake magnitude are all logarithmic. Furthermore, trans-
formations are needed because there is no guarantee that the world works on the scales it
happens to be measured on, such as music.
• Convenience
x−a
An example is standardisation: z = . It puts all variables on the same scale.
b
Standardisation makes no difference to the shape of a distribution.
• Reducing skewness
A (nearly) symmetric distribution is often easier to handle and interpret than a
skewed one. More specifically, a normal distribution is often regarded as ideal as it
is assumed by many statistical methods.
√
– To reduce right skewness, try x, log x or 1/x.
– To reduce left skewness, take x2 , x3 , ....
• Linear relationships
56
When looking at relationships between variables, it is often far easier to think about
patterns that are approximately linear than about patterns that are highly curved.
This is vitally important when using linear regression, which amounts to fitting such
patterns to data.
For example, a plot of logarithms of a series of values against time has the property
that periods with constant rates of change (growth or decline) plot as straight lines.
• Additive relationships
Relationships are often easier to analyse for an additive model
y = a + bx
• Easier to visualize.
For example, suppose we have a scatterplot in which the points are the countries of
the world, and the data values being plotted are the land area and population of
each country. If the plot is made using untransformed data (e.g. square kilometers
for area and the number of people for population), most of the countries would be
plotted in tight cluster of points in the lower left corner of the graph. The few coun-
tries with very large areas and/or populations would be spread thinly around most
of the graph’s area. Simply rescaling units (e.g., to thousand square kilometers, or
to millions of people) will not change this. However, following logarithmic transfor-
mations of both area and population, the points will be spread more uniformly in
the graph.
The most useful transformations are the reciprocal, logarithm, cube root, square root, and
square.
Reciprocal: 1/x, x 6= 0
• The reciprocal of a ratio may often be interpreted as easily as the ratio itself: e.g.
– population density (people per unit area) becomes area per person;
57
– persons per doctor becomes doctors per person;
– rates of erosion become time to erode a unit depth.
• This has a major effect on distribution shape. It is commonly used for reducing right
skewness.
• Exponential growth or decline y = a exp(bx) is made linear by
ln y = ln a + bx.
Exponential function does not go through origin, as at x = 0, we have
y = a 6= 0.
• Power function y = axb is made linear by
log y = log a + b log x.
The power function for b > 0 goes through the origin, as at x = 0 (b > 0),
we have y = 0. This often makes physical or biological or economic sense.
• Consider ratios y = p/q ∈ (0, ∞) where p, q > 0. Examples are
– males/females;
– dependants/workers;
– downstream length/downvalley length
which are usually skewed, because there is a clear lower limit and no clear upper
limit. The logarithm
log y = log p/q = log p − log q ∈ (−∞, ∞),
which is likely to be more symmetrically distributed.
• The cube root x1/3 is a fairly strong transformation with a substantial effect on
distribution shape: it is weaker than the logarithm.
• It is also used for reducing right skewness, and has the advantage that it can be
applied to zero and negative values.
• Note that the cube root of a volume has the units of a length. It is commonly applied
to rainfall data.
58
Square x2
• The square x2 has a moderate effect on distribution shape and it could be used to
reduce left skewness.
• In practice, the main reason for using it is to fit a response by a quadratic function
y = a + bx + cx2 .
******************
Test
QQ plots
Box plot
59
6.4 Preliminaries
• Xn →p X =⇒ g(Xn ) →p g(X).
• Xn →d X =⇒ g(Xn ) →d g(X).
• Xn + Yn →d X + C.
• Xn Yn →d CX.
• Xn /Yn →d X/C if C 6= 0.
Order relations could greatly simplify writings in a proof. Let {Xn }’s and {Yn }’s be two
sequences of r.v.’s.
(a) Xn = Op (1)
⇐⇒ ∀ > 0, ∃M and N such that P (|Xn | > M ) < , ∀n ≥ N .
(P (|Xn | ≤ M ) ≥ 1 − )
⇐⇒ ∀ > 0, ∃M such that supn P (|Xn | > M ) < .
⇐⇒ lim lim sup P (|Xn | > M ) = 0. (“tightness”.)
M →∞ n
We say that {Xn } is “stochastically bounded”, or “tight.” It means that no mass
will escape to ∞ with positive probability.
(c) Xn = op (1) if Xn →p 0.
Theorem 6.3
60
1. If Xn →d X, then Xn = Op (1).
Proof.
61
6.5 The δ-method
The δ-method is a very useful tool in proving limit theorems and Slutsky’s theorems are
extensively employed. We illustrate the idea with smooth function of means.
Theorem 6.4 Let X1 , ..., Xn be i.i.d. with µ = EX1 and σ 2 = V ar(X1 ) ∈ (0, ∞), and
g 0 (µ) 6= 0. Then,
√
−1 0 2 2
n(g(X̄) − g(µ))
g(X̄) →d N g(µ), n [g (µ)] σ , ⇐⇒ →d N (0, 1) . (5.1)
g 0 (µ)σ
62
The multivariate case
6.5.1 Examples
Solution.
63
5. If µ 6= 0, take g(x, y) = (y − x2 )1/2 /x. Then
n
!1/2
1X
S/X̄ = X 2 − (X̄)2 /X̄ := g(x, y).
n i=1 i
We often have
Tn →d N (µ, σ 2 (µ))
i.e., the variance changes with the mean.
We now seek some transformation g so that g(Tn ) has stable variance. Since
g(Tn ) −→d N g(µ), [g 0 (µ)]2 σ(µ)2 ,
µ = λ, σ 2 = λ2 = µ2 .
64
from which one can get an approximate confidence interval for λ.
C
Z
C √ Z
g(θ) = dθ = √ dθ = 2C θ.
σ(θ) θ
√
Choose C = 1/2. Then we get g(θ) = θ, hence
√ √
1 √ p
p
X̄ ∼ AN λ, , or 2 n( X̄ − λ) ∼ AN (0, 1)
4n
√
Hence, an approximate two-sided confidence interval for λ at (1 − α)-level is
p 1
(L, U ) = X̄ ± √ Φ−1 (1 − α/2).
2 n
Example 1. Let (X, Y ), (X1 , Y1 ), ..., (Xn , Yn ) be i.i.d. random vectors. For simplicity, we
first assume that µx = µy = 0. Then the correlation coefficient is
(Note that this is different from the usual definition of ρ̂; see the next example.)
√
(a) Show that ρ̂ →a.s. ρ, and n(ρ̂ − ρ) →d N (0, (1 − ρ2 )2 ). Further,
√
n(ρ̂ − ρ)
→d N (0, 1).
(1 − ρ̂2 )
65
n n n
!
1X 1X 1X
Solution. (i) Let θ = (EX 2 , EY 2 , EXY ) and θ̂ = Xi2 , Yi2 , Xi Yi , and
n i=1 n i=1 n i=1
z3
g(z1 , z2 , z3 ) = .
(z1 z2 )1/2
Then, ρ = g(θ) and ρ̂ = g(θ̂). Note that E θ̂ = θ. Assuming that EX 4 < ∞ and EY 4 < ∞,
from the multivariate CLT, we have
θ̂ ∼ AN (θ, Σ/n),
where Σ3×3 is the covariance matrix of (X 2 , Y 2 , XY ):
V (X 2 ) Cov(X 2 , Y 2 ) Cov(X 2 , XY )
Σ = Cov(X 2 , Y 2 ) V (Y 2 ) Cov(Y 2 , XY ) .
Cov(X 2 , XY ) Cov(Y 2 , XY ) V (XY )
Therefore, from the theorem in this section, we have
ρ̂ = g(θ̂) ∼ AN (ρ, n−1 [5g(θ)]τ Σ 5 g(θ)),
From
!
∂g(z) ∂g(z) ∂g(z) z3 z3 1
[5g(z)]τ = , , = − 3/2 1/2
,− 1/2 3/2
, 1/2 1/2
,
∂z1 ∂z2 ∂z3 2z1 z2 2z1 z2 z1 z 2
we have !
τ σxy σxy 1
[5g(θ)] = − 3 ,− 3
, .
2σx σy 2σx σy σx σy
For simplicity, we shall also assume that σx2 = σy2 = 1. Then X ∼ N (0, 1),
X2 ∼ χ21 , hence EX 2 = 1, V (X 2 ) = 2, EX 3 = 0, EX 4 = 3. Similar results hold for Y as
well. Note that X and Y − βX are independent for β = σxy /σx2 = ρ. Therefore
E(X 2 Y 2 ) = E(X 2 (Y − βX + βX)2 )
= EX 2 (Y − βX)2 + 2βEX 3 (Y − βX) + β 2 EX 4
= EX 2 E(Y − βX)2 + 2βEX 3 E(Y − βX) + β 2 EX 4
= EX 2 E(Y − βX)2 + β 2 EX 4
= EX 2 EY 2 − 2βEXY + β 2 EX 2 + β 2 EX 4
= 1 − 2ρ2 + ρ2 + 3ρ2
= 1 + 2ρ2
V (XY ) = E(X 2 Y 2 ) − (EXY )2 = 1 + 2ρ2 − ρ2 = 1 + ρ2
Cov(X 2 , Y 2 ) = E(X 2 Y 2 ) − EX 2 EY 2 = 1 + 2ρ2 − 1 = 2ρ2
Cov(X 2 , XY ) = E(X 3 Y ) − EX 2 EXY = EX 3 (Y − βX) + βEX 4 − ρ
= EX 3 E(Y − βX) + 3ρ − ρ = 2ρ
Cov(Y 2 , XY ) = 2ρ (similar to above)
Hence,
2 2ρ2 2ρ 1 ρ2 ρ
Σ = 2ρ2 2
2
2ρ = 2 ρ 1 ρ .
2ρ 2ρ 1 + ρ2 2
ρ ρ (1 + ρ )/2
66
and [5g(θ)]τ can be simplified to
!
τ σxy σxy 1 1
[5g(θ)] = − 3 ,− 3
, = (−ρ/2, −ρ/2, 1) = (−ρ, −ρ, 2).
2σx σy 2σx σy σx σy 2
Finally,
1 ρ2 ρ −ρ
1
[5g(θ)]τ Σ 5 g(θ) =
2
(−ρ, −ρ, 2) ρ 1 ρ −ρ
2
ρ ρ (1 + ρ2 )/2 2
−ρ
1 2 2 2
= ρ(1 − ρ ), ρ(1 − ρ ), (1 − ρ ) −ρ
2
2
1
= −2ρ2 (1 − ρ2 ) + 2(1 − ρ2 )
2
= (1 − ρ2 )2 .
√
Therefore, n(ρ̂ − ρ) →d N (0, (1 − ρ2 )2 ). This, ρ̂ →a.s. ρ and Slutsky’s theorem imply
√
n(ρ̂ − ρ)
→d N (0, 1).
(1 − ρ̂2 )
Example 2. (Continuation of the last example.) Let (X, Y ), (X1 , Y1 ), ..., (Xn , Yn ) be i.i.d.
random vectors. The general definition of correlation coefficient is
σxy Cov(X, Y ) E(X − µx )(Y − µy )
ρ= =p = p .
σx σy V (X)V (Y ) V (X)V (Y )
The sample correlation coefficient is
1 Pn
n i=1 (Xi − X̄)(Yi − Ȳ )
ρ̂ = 1/2 P 1/2 .
1 Pn 2 1 n
n i=1 (Xi − X̄) n i=1 (Yi − Ȳ )2
Then, ρ = g(θ) and ρ̂ = g(θ̂). Note that E θ̂ = θ. Assuming that EX 4 < ∞ and EX 4 < ∞,
from the multivariate CLT, we have
θ̂ ∼ AN (θ, Σ/n),
where Σ5×5 is the covariance matrix of (X, Y, X 2 , Y 2 , XY ). (Why?) Therefore, from the
theorem in this section, we have
67
Some routine calculations yield
!
τ ρµx µy ρµy µx ρ ρ 1
[5g(θ)] = 2
− , 2 − ,− 2,− 2, .
σx σx σy σy σx σy 2σx 2σy σx σy
6.7 Exercises
Below X, X1 , ..., Xn are i.i.d. r.v.’s with mean µ, variance σ 2 , and µk = E(X − µ)k .
1. Prove the equivalence of several definitions for Xn = Op (1) given in the notes.
4. Let (X, Y ), (X1 , Y1 ), ..., (Xn , Yn ) be i.i.d. with mean µ = (µx , µy )τ and covariance
matrix Σ. Let r = µx /µy with µy 6= 0, and an estimate of r is r̂ = X̄/Ȳ . Investigate
the limiting behavior of r̂ (i.e. both consistency and asymptotic d.f.).
n
√
Ui )−1 . Show that
Y
5. Let Un ∼iid Uniform[0, 1], and Xn = ( n(Xn − e) →d N (0, e2 ).
i=1
68
Chapter 7
Quantiles
We are often interested in a special portion of a population distribution, other than the
centre. For instance, we might be interested in
These are all examples of quantiles. The layout of the chapter is as follows:
• definitions of quantiles
• consistency
• asymptotic normality
• Bahadur theorem
• quantile regression
7.1 Definition
69
2. F −1 (F (x)) ≤ x, ∀x ∈ R.
3. F (ξp −) ≤ p ≤ F (ξp ).
4. ξp = F −1 (p) ≤ t ⇐⇒ p ≤ F (t).
Proof.
Left-continuity. Let pn % p.
=⇒ F −1 (pn ) ≤ F −1 (p) and F −1 (pn ) % b, say.
(Monotonicity shown above)
=⇒ F −1 (p n) ≤ b ≤ F −1 (p) for all n.
=⇒ pn ≤ F (b) from (4) (proof given below).
=⇒ F (b) ≥ limn pn = p.
=⇒ b ∈ {t : F (t) ≥ p}, hence, b ≥ F −1 (p).
=⇒ limn F −1 (pn ) = b = F −1 (p).
X := F −1 (U ) ∼ F.
Theorem 7.2 forms the basis for sampling from d.f. F , at least in principle. Feasi-
bility of this technique depends on either having F −1 available in closed form, or being
able to approximate it numerically, or using some other techniques.
70
Theorem 7.3 If F (ξp − δ) < p < F (ξp + δ) for all δ > 0 (i.e., F is not flat at ξp ), then
for all > 0 and n,
2
P (|ξˆp − ξp | > ) ≤ 2Ce−2nδ ,
where δ = min{F (ξp + ) − p, p − F (ξp − )}, and C > 0 is an absolute constant.
Corollary 7.1 Let F (ξp − ) < p < F (ξp + ) for any > 0. Then ξˆp −→a.s. ξp .
Proof.
P∞ ˆ − ξp | > ) ≤ P∞ 2Ce−2nδ2 < ∞.
n=1 P (|ξp n=1
Asymptotic normality can be proved in several different ways, e.g. Bahadur representation.
For simplicity, let m = F −1 (1/2) and m̂ = Fn−1 (1/2) be the population and sample
medians, respectively. General quantiles can be treated similarly.
Theorem 7.4 Assume f (x) = F 0 (x) is continuous near m and f (m) > 0. Then
1
m̂ ∼asymp. N m, 2 .
4f (m)n
√
Proof. WLOG, assume n is odd. Let Yni = I{Xi ≤ m + x/ n}. Then Yni ∼
P
√
Bin(n, pn ), where pn = P (Xi ≤ m + x/ n). Now
√ √
P n(m̂ − m) ≤ x = P m̂ ≤ m + x/ n
n
!
X √ n+1
= P I{Xi ≤ m + x/ n} ≥
i=1
2
n
!
X n+1
= P Yni≥
i=1
2
n+1
Pn !
i=1 Yni− npn − npn
= P p ≥p 2
npn (1 − pn ) npn (1 − pn )
71
√
and n(pn − 1/2) → xf (m). Thus,
1 √
1
(n+ 1) − npn √
2 n
+ n( 12 − pn ) −xf (m)
2p
= p → = −2xf (m)
npn (1 − pn ) pn (1 − pn ) 1/2
Applying the CLT to the triangular array Yni and Slutsky’s theorem, we have
√
n(m̂ − m) ≤ x → 1 − Φ(−2xf (m)) = Φ(2xf (m)) = Φ(x/(2f (m))−1 ).
P
Sample quantile x̂ip is a nonlinear statistic (i.e., not an average of iid r.v.s). However,
Bahadur representation says that it can be approximately by a linear statistic, which is
much easier to handle. Heuristically, by Taylor’s expansion,
F (ξp ) − Fn (ξp )
ξˆp = ξp + + op (n−1/2 ).
f (ξp )
Or equivalently, √
√ n[F (ξp ) − Fn (ξp )]
ˆ
n ξp − ξp = + op (1).
f (ξp )
Xn − Yn = op (1).
t
To apply the lemma, let t ∈ R and ξnt = ξp + √ . We have
n
P (Xn ≤ t, Yn ≥ t + ) = P ξˆp ≤ ξnt , Yn ≥ t +
= P Fn (ξˆp ) ≤ Fn (ξnt ) , Yn ≥ t +
72
= P F (ξnt ) − Fn (ξnt ) ≤ F (ξnt ) − Fn (ξˆp ), Yn ≥ t +
√ √
n[F (ξnt ) − Fn (ξnt )] n[F (ξnt ) − Fn (ξˆp )]
= P ≤ ,
f (ξp ) f (ξp )
√ !
n[F (ξp ) − Fn (ξp )]
≥t+
f (ξp )
:= P (Zn (t) ≤ Un (t), Zn (0) ≥ t + ) ,
where √ √
n[F (ξnt ) − Fn (ξnt )] n[F (ξnt ) − Fn (ξˆp )]
Zn (t) = , Un (t) = .
f (ξp ) f (ξp )
We can show (later), as n → ∞,
Proof. Let ξˆp = Fn−1 (p). Note that Fn (ξˆp −) ≤ p ≤ Fn (ξˆp ). Thus
|Fn (ξˆp ) − p| ≤ |Fn (ξˆp ) − Fn (ξˆp −)| = n−1 .
√
F ξp + √t − F (ξp )
n[F (ξnt ) − p] n
(a2) Note → t, as → f (ξp ).
f (ξp ) √t
n
73
where Yni = I{ξp < Xi ≤ ξnt }. Note that EYni = P (ξp < X1 ≤ ξnt ) = F (ξnt ) − F (ξp ) =
√
f (ξp )t/ n + o(n−1/2 ) [see the proof of (a2) above]. Thus,
n n
!
1 X 1X
V ar √ (Yni − EYni ) = V ar(Yni ) ≤ V ar(Yn1 )
n i=1 n i=1
√
≤ E(Yn12
) = EYn1 = f (ξp )t/ n + o(n−1/2 )
−→ 0,
which implies that the first term in (4.1) is op (1). Thus Zn (t) − Zn (0) = op (1).
Theorem 7.6 Let F have positive derivatives at ξpj , where 0 < p1 < ... < pm < 1. Then
√ h i
n ξˆp1 , ..., ξˆpm − (ξp1 , ..., ξpm ) −→d Nm (0, Σ),
The theorem is useful in deriving the (joint) d.f. of the sample range, inter-quartile range,
etc.
Proof of Lemma 7.1. Fix > 0 any δ > 0. Since Xn = Op (1), there exists M and N
such that
P (|Xn | > M ) < , ∀n ≥ N .
Divide the intervals [−M , M ] into m sub-intervals with the length of each sub-interval
≤ δ/2, e.g., we can take m = 2[2M /(δ/2)]. Write [−M , M ] ⊂ m
P
k=1 [tk , tk+1 ] with
tk+1 − tk ≤ δ/2. Then for all n ≥ N , we have
74
Now for every k and as n → ∞, we have (draw a diagram to illustrate)
Population quantiles
F (ξp −) ≤ p ≤ F (ξp ).
For simplicity, we first consider medians (i.e. p = 1/2) and denote m = ξ1/2 , i.e.
Theorem 7.7
Proof.
75
• Medians are closed, i.e. m = inf{m : m is a median} and m = sup{m :
m is a median} are also medians.
Proof. WLOG, we only show that m is a median. Clearly, there exist a
decreasing subsequence of medians, denoted by {mk , k ≥ 1} such that
– mk ↓ m.
– From F (mk −) ≤ 1/2 ≤ F (mk )
Hence,
– F (m−) ≤ F (mk −) ≤ 1/2,
– F (m) = limk F (mk ) ≥ 1/2.
Similarly, we can show that m is a median too.
where the last inequality holds since 2F (m) − 1 ≥ 0 (as m is a median), and also
the assumption that c > m.
Sample quantiles
Definition. Given a sample X1 , ..., Xn from a continuous d.f. F , the sample median is
any value m̂ satisfying
Fn (m̂−) ≤ 1/2 ≤ Fn (m̂)
where Fn is the empirical distribution function (e.d.f.):
Fn (x) = n−1
X
I(Xi ≤ x).
76
(1)
m̂ = X( n+1 ) if n is odd
2
h i
any value in X( n ) , X( n +1) if n is even
2 2
(2) We have
n
X
m̂ = arg min EFn |X − c| = arg min |Xi − c|,
c c
i=1
i.e. for any c, we have
n
X n
X
|Xi − m̂| ≤ |Xi − c|
i=1 i=1
(3) For n given points (Xi , Yi ), i = 1, ..., n, the sum ni=1 |Yi − bXi | is minimized by
P
b = Z(k) , where k satisfies ki=1 p(i) ≥ 1/2 and ni=k p(i) ≥ 1/2 with pi = P|X|X i|
P P
|
,
j
and Z(1) , ..., Z(n) are order statistics of Y1 /X1 , ..., Yn /Xn , and p(i) is the value cor-
responding to Z(i) .
Proof.
(1) It suffices to show that m̂ satisfies nFn (m̂−) ≤ n/2 ≤ nFn (m̂), or equivalently
n
X n
X
LHS =: I{Xi < m̂} ≤ n/2 ≤ I{Xi ≤ m̂} =: RHS.
i=1 i=1
n n+1 n o
RHS = # Xi ≤ X̂( n+1 ) = ≥ ,
2 2 2
n o n+1 n−1 n
LHS = # Xi < X̂( n+1 ) = −1= ≤ .
2 2 2 2
– If n is even, then for m̂ ∈ [X( n2 ) , X( n2 +1) ], we have
n n
RHS = # {Xi ≤ m̂} = or + 1 ≥ n/2,
2 2
n n
LHS = # {Xi < m̂} = − 1 or ≤ n/2.
2 2
77
7.5.1 Alternative definition of quantiles
Theorem 7.9 A p-th quantile of X is any value ξp such that (Lehmann, p58.)
FX (ξp −) ≤ p ≤ FX (ξp ).
Then
Proof. We only prove the last below since the proofs for the first two are similar to those
of medians. WLOG, assume that c > ξp , then
pE(X − c)+ + (1 − p)E(X − c)−
Z Z
= p (x − c)I{x − c > 0}dF (x) + (1 − p) [−(x − c)]I{x − c ≤ 0}dF (x)
Z Z
= (1 − p) (c − x)dF (x) + p (x − c)dF (x)
(−∞,c] (c,∞)
Z Z
= (1 − p) (c − x)dF (x) + (1 − p) (c − x)dF (x)
(−∞,ξp ] (ξp ,c]
Z Z
+p (x − c)dF (x) − p (x − c)dF (x)
(ξp ,∞) (ξp ,c]
Z Z
= (1 − p) (c − ξp )dF (x) + (1 − p) (ξp − x)dF (x)
(−∞,ξp ] (−∞,ξp ]
Z Z
+p (x − ξp )dF (x) + p (ξp − c)dF (x)
(ξp ,∞) (ξp ,∞)
Z Z
+(1 − p) (c − x)dF (x) − p (x − c)dF (x)
(ξp ,c] (ξp ,c]
Z Z
= (1 − p) (ξp − x)dF (x) + p (x − ξp )dF (x)
(−∞,ξp ] (ξp ,∞)
Z Z
+(1 − p) (c − ξp )dF (x) + p (ξp − c)dF (x)
(−∞,ξp ] (ξp ,∞)
Z Z
+(1 − p) (c − x)dF (x) + p (c − x)dF (x)
(ξp ,c] (ξp ,c]
−
= pE(X − ξp ) + + (1 − p)E(X − ξp )
Z
+(c − ξp )[(1 − p)F (ξp ) − p(1 − F (ξp ))] + (c − x)dF (x)
(ξp ,c]
= pE(X − ξp )+ + (1 − p)E(X − ξp )−
Z
+(c − ξp )[F (ξp ) − p] + (c − x)dF (x)
(ξp ,c]
≥ pE(X − ξp )+ + (1 − p)E(X − ξp )− .
78
where the last inequality holds since F (ξp ) ≥ p (as ξp is a p-th quantile), and also the
assumption that c > ξp . The statement is proved.
Remark 7.2 As we have seen earlier, our quantiles may not be unique. In fact, we have
shown that it can be any value from the closed interval [a, b], where
Definition. Let X1 , ..., Xn be a set of distinct real numbers, its order statistics are
X(1) , ..., X(n) . Then we define the sample p-th quantile to be
h i
ξˆp = any value in X([np]) , X([np]+1) if np = [np] (i.e., np is an integer)
= X([np]+1) if np 6= [np] (i.e., np is NOT an integer).
(1) ξˆp is also a p-th quantile of Fn (x) = n−1 ni=1 I(Xi ≤ x).
P
i=1
pi = P|X|X
i|
|
. The details are similar to the medians.
j
Proof.
79
– If np is NOT an integer, then ξˆp = X([np]+1) , therefore,
n o
RHS = # Xi ≤ X̂([np]+1) = [np] + 1 ≥ np,
n o
LHS = # Xi < X([np]+1) = [np] ≤ np.
h i
– If np is an integer, then ξˆp could be any value in the close interval X([np]) , X([np]+1) =
h i
X(np) , X(np+1) , therefore,
n o
RHS = # Xi ≤ ξˆp = {np or np + 1} ≥ [np],
n o
LHS = # Xi < ξˆp = {np − 1 or np} ≤ np = [np].
= n−1
X X
p|Xi − c| + (1 − p)|Xi − c| ,
i∈{i:Xi ≥c} i∈{i:Xi <c}
(3) If we assume that Xi > 0 for all i. Then, letting Zi = Yi /Xi and pi = |Xi |/ |Xi |,
P
we have
X X
p|Yi − cXi | + (1 − p)|Yi − cXi |
i∈{i:Yi −cXi ≥0} i∈{i:Yi −cXi <0}
X X X
= |Xi | p|Yi /Xi − c|pi + (1 − p)|Yi /Xi − c|pi
i∈{i:Yi /Xi −c≥0} i∈{i:Yi /Xi −c<0}
X X X
= |Xi | p|Zi − c|pi + (1 − p)|Zi − c|pi
i∈{i:Zi −c≥0} i∈{i:Zi −c<0}
Yi = α + βxτi + i , i = 1, ..., n,
80
• The least absolute deviation (LAD) estimates [or median regression] of (α, β) are
( n )
X
min |yi − (a + bxτi )| .
a,b
i=1
Remark 7.3
• LAD is more robust than the LSE. If the d.f. of is heavy tailed, one can use LAD.
There are many other robust estimates, balancing robustness and efficiency.
• LSE is easier to compute than LAD. Theorem 7.10 shows how to compute one un-
known for LAD. For more than one unknowns, we could fix one value given all
others, and repeat for others.
Remark 7.4 It can be shown (see the chapter on M-functionals) that, the mean, the
median and the p-th quantile satisfy
µ = arg min E(X − c)2
c
m = arg min E|X − c|
c
ξp = arg min E(p(X − c)+ + (1 − p)(X − c)− ),
c
7.7 Exercises
1. Suppose X ∼ F , and m = F −1 (1/2), µ = EX, σ 2 = V ar(x). Show that
|m − µ| ≤ σ.
(Hint: use the L1 -optimal property of the median.)
2. Let F (x) = |x|, which is convex and has derivatives everywhere except 0. Denote its
left derivative by D(x) =: f (x−) = I{x > 0} − I{x ≤ 0}. Show that the one-term
Taylor expansion satisfies
|F (x + t) − F (x) − f (x−)t| =: ||x + t| − |x| − D(x)t|
≤ 2|t|I{|X| ≤ |t|}.
81
Chapter 8
M-Estimation
8.1 Definition
λF (t0 ) = 0.
Remark 8.1 Although the M -functional and M -estimators are usually derived by Min-
imization or Maximization of certain quantities EF ρ(X, t) (hence the name M ), we
actually define those items without any mention of EF ρ(X, t).
82
8.2 Examples of M-estimators
Example 1 (MLE). Let F = {Fθ (·) : θ ∈ Θ}. Assume that Fθ0 (x) = fθ (x) exists. Maxi-
mizing log-likelihood log fθ (x) is equivalent to minimizing ρ(x, θ) = − log fθ (x) Choose
∂
ψ(x, θ) = − log fθ (x).
∂θ
The resulting M -estimator T (Fn ) is an MLE.
and the corresponding M -estimators T (Fn ) to the location parameter. Different choices
of ψ lead to different estimators. Here are some examples.
1. The least square estimate (LSE). If ρ(x) = x2 /2, then ψ(x) = x. Then the
solution to EF ψ(X − t0 ) = EF (X − t0 ) = µF − t0 is t0 = µF . Therefore,
Z
T (F ) = xdF (x) = EX, mean functional,
and
Z
T (Fn ) = xdFn (x) = X̄, sample mean.
(In fact, ρ(x) is not differentiable at 0, but we still define ψ(0) = 0).
Then we claim that any solution to
must be a median m.
83
3. pth LAD estimator (LAD). More generally, we can take ρ(x) = |x|p /p, where
p ∈ [1, 2], then
p−1
−|x|
x<0
ψ(x) = 0 x=0
|x|p−1
x > 0.
The M-estimator T (Fn ) is called the pth least absolute deviation estimator
(LAD) or the minimum Lp distance estimator. (Again, ρ(x) may not be differentiable
at 0, but we still define ψ(0) = 0).
with
(
x |x| ≤ C
ψ(x) =
0 |x| > C
= xI{|x| ≤ C}.
EF ψ(X − t0 ) = EF (X − t0 )I{|X − t0 | ≤ C}
= EF XI{|X − t0 | ≤ C} − t0 P (|X − t0 | ≤ C) = 0
satisfies
EF XI{|X − t0 | ≤ C}
t0 = = EF (X|{|X − t0 | ≤ C}) := T (F ).
P (|X − t0 | ≤ C)
satisfies
Pn
i=1 Xi I{|Xi − t̂| ≤ C}
t̂ = P n = EFn X|{|X − t̂| ≤ C} := T (Fn ).
i=1 I{|Xi − t̂| ≤ C}
Intuitively, t̂ = T (Fn ) is the mean for the restricted range [t̂ − C, t̂ + C]. Notice that
both sides in the above equation involves t̂, but we can solve t̂ iteratively with the
initial choice of t̂0 being the sample mean or any other reasonable guess.
Clearly, T (Fn ) is a type of trimmed mean. It completely eliminates the influence
of the extreme values (outliers).
84
5. Huber’s estimate (or winsorized mean). Huber (1964) considers
1
x2
|x| ≤ C
ρ(x) = 2
C|x| − 1 C 2 |x| > C.
2
with
C
x>C
ψ(x) = x |x| ≤ C
−C x < −C.
85
8.3 Consistency and asymptotic normality of M -estimates
R 1 P
Recall λF (t) = EF ψ(X, t) = ψ(x, t)dF (x) and λFn (t) = EFn ψ(X, t) = n ψ(Xi , t).
(b) Either t0 = T (F ) is an isolated root of λF (t) = 0, (or λF (t) changes sign uniquely
in a neighborhood of t0 ).
(c) {Tn } is a root of λFn (t) = 0, (or λFn (t) changes sign in a neighborhood of Tn ).
If t0 is an isolated root to λF (t0 ) = 0, t0 is the unique root, then for any > 0,
Recall that λFn (t) is non-increasing in t. Then from assumption (c), we must have
t0 − ≤ Tn ≤ t0 + , a.s.
(Note that Tn does not depend on and may not be unique, but they all satisfy the above
inequality.) Letting → 0, we see that Tn → t0 a.s.
86
(c) Eψ 2 (X, t) < ∞ for t in a neighborhood of t0 and is continuous at t = t0 .
Suppose that {Tn } is a root of λFn (t) = 0, or λFn (t) changes sign in a neighborhood of Tn .
Then,
(i) Tn −→a.s. t0 ;
√ EF ψ 2 (X, t0 )
(ii) n (Tn − t0 ) →d N (0, σ 2 ), where σ 2 = .
[λ0F (t0 )]2
Thus, P (λFn (t) < 0) ≤ P (Tn ≤ t) ≤ P (λFn (t) ≤ 0). Note that
√ !
n(Tn − t0 ) σx σx
P ≤ x = P Tn ≤ t0 + √ = P (Tn ≤ tnx ) , where tnx = t0 + √ .
σ n n
Now
X
P (λFn (tnx ) ≤ 0) = P ψ(Xi , tnx ) ≤ 0
√ !
[ψ(Xi , tnx ) − Eψ(Xi , tnx )] − nEψ(Xi , tnx )
P
= P √ ≤ ,
nσnx σnx
√ !
− nλ F (t nx )
n−1/2
X
=: P Yni ≤
σnx
and, since EF ψ 2 (X, t) are EF ψ(X, t) = λF (t) are both continuous at t0 , we have
2
σnx = EF ψ 2 (X, tnx ) − [EF ψ(X, tnx )]2
= EF ψ 2 (X, t0 ) + o(1) − [EF ψ(X, t0 ) + o(1)]2
= EF ψ 2 (X, t0 ) + o(1)
= σ 2 [λ0F (t0 )]2 + o(1)
→ σ 2 [λ0F (t0 )]2 .
87
Therefore, σnx → −λ0F (t0 )σ as λ0F (t0 ) ≤ 0. Hence, as n → ∞,
√
− nλF (tnx )
−→ x.
σnx
Since Yni , 1 ≤ i ≤ n, are i.i.d. with mean 0 and variance 1, we may apply the double
array CLT (Theorem 8.6) with Bn2 = nV ar(Yn1 ) = n. Applying Slutsky’s theorem and
the CLT for double arrays (see Theorem 8.6 in Section 8.6 below), we get
By the following remark, we have limn P (λFn (tnx ) < 0) = Φ(x). The proof is complete.
Remark 8.2 There are some equivalent definitions of weak convergence, referred to as
Portmanteau Theorem:
(a) Xn =⇒ X or Xn →d X.
(d) lim P (Xn ∈ A) = P (X ∈ A) for all sets A with P (X ∈ ∂A) = 0, where ∂A is the
n→∞
boundary of A.
(f ) Eg(Xn ) → Eg(X) for all functions g of the form g(x) = h(x)I[a,b] (x), where h is
continuous on [a, b] and a, b ∈ C(F ).
(g) limn ψn (t) = ψ(t), where ψn (t) and ψ(t) are the c.f.s of Xn and X, respectively.
Remark 8.3 In the quantile example below, the estimating equation λFn (t) = 0 reduces
to Fn (t) = p, which may have no solution or an infinite number of solutions, since Fn (t)
has jump sizes n−1 at each observation. If λFn (t) = 0 has no solution, then Tn is defined
to be a value which changes sign of λFn (t).
Example 1. (Sample pth Quantile). F is continuous, and 0 < p < 1. Let ξp be the
unique root of F (t) = p, and let ξˆp be a root of Fn (t) = p or Fn (t) = p changes signs at
ξˆp . Further assume that F 0 (ξp ) = f (ξp ). Then
!
√ p(1 − p)
n ξˆp − ξp −→d N 0, 2
f (ξp )
88
Then
p
λF (t) = EF ψ(X − t) = (−1)P (X − t < 0) + P (X − t > 0)
1−p
p p
= [1 − F (t)] − P (X < t) = [1 − F (t)] − F (t)
1−p 1−p
p − F (t)
= .
1−p
p2
Eψ 2 (X, ξp ) = (−1)2 P (X − ξp < 0) + P (X − ξp > 0)
(1 − p)2
p2 p
= p+ 2
(1 − p) = .
(1 − p) 1−p
and
−F 0 (ξp ) −f (ξp )
λ0F (ξp ) = = .
1−p 1−p
It iseasy to check that all conditions in Theorem 8.2 are satisfied with t0 = ξp . Therefore,
√ ˆ
n ξp − ξp →d N 0, σ 2 , where
1
Ignoring the constant term 1−p , we notice that the p-th quantile estimate minimizes the
loss function weighted L1 loss, EL(c), where
Take p = 5% for instance, if we plot L(c) against c, we see that we penalize more if X
move further away toward the left tail than it toward the right tail. This makes sense
since we are interested in the quantiles in the left tail (p = 5% e.g.).
The next theorem trades monotonicity of ψ(x, ·) for smoothness restrictions. The method
has been most used for MLE.
89
Lemma 8.1 Let g(x, t) be continuous at t0 uniformly in x. Let F be a d.f. such that
EF |g(X, t0 )| < ∞. Let {Xi } be i.i.d. with d.f. F and suppose that Tn −→p t0 . Then,
n
1X
g(Xi , Tn ) −→p EF g(X, t0 ).
n i=1
Proof. Write
n n
1 X 1 X
g(Xi , Tn ) − EF g(X, t0 ) ≤ [g(Xi , Tn ) − g(Xi , t0 )]
n n
i=1
i=1
n
1 X
+ g(Xi , t0 ) − EF g(X, t0 )
n
i=1
The second term on the RHS −→p 0, while the first term is bounded by
n
1X
|g(Xi , Tn ) − g(Xi , t0 )| −→p 0,
n i=1
Proof. Since λF (t) is continuous in t, assumption (b) implies that ∃ > 0 such that
or
λF (t0 + ) > λF (t0 ) = 0 > λF (t0 − ).
For simplicity, we assume that first inequality holds. By the Strong Law of Large Numbers
(SLLN), λFn (t) → λF (t) a.s. for each t. Then as n → ∞,
Since λFn (t) is also continuous in t, then there EXISTS a solution sequence, {Tn }, of
λFn (t) = 0 in the interval [t0 − , t0 + ]. That is,
t0 − ≤ Tn ≤ t0 + , a.s.
90
Remark 8.5 Note that λFn (t) = 0 in Theorem 8.3 may have multiple solutions. In that
case, one needs to identify a consistent solution sequence in practice. Therefore, for a
given solution sequence, e.g., one obtained by a specified algorithm, one needs to CHECK
if it is consistent or not.
Let {Tn } be a strongly consistent solution sequence of λFn (t) = 0 (The existence is estab-
lished in Theorem 8.3). Then,
√ Eψ 2 (X, t0 )
n (Tn − t0 ) →d N (0, σ 2 ), where σ 2 = V ar (φF (X1 )) = .
[λ0F (t0 )]2
The derivation of (3.4) will be given in the next theorem. Let us prove d∞ -Hadamard
differentiability next.
We now show Hadamard differentiability. For Gj = F + tj ∆j , we have
λGj (T (F )) λG −F (T (F )) tj λ∆ (T (F ))
TF0 (Gj − F ) = − 0 = − j0 =− 0 j
λF (T (F )) λF (T (F )) λF (T (F ))
Then,
0 = λGj [T (Gj )] − λF [T (F )]
= λGj [T (Gj )] − λGj [T (F )] + λGj [T (F )] − λF [T (F )]
λG [T (Gj )] − λGj [T (F )]
= [T (Gj ) − T (F )] j + tj λ∆j [T (F )]
T (Gj ) − T (F )
!
λGj [T (Gj )] − λGj [T (F )]
= [T (Gj ) − T (F )] − TF0 (Gj − F )λ0F (T (F ))
T (Gj ) − T (F )
So
λ0F [T (F )]
T (Gj ) − T (F ) = TF0 (Gj − F ) λ
Gj [T (Gj )]−λGj [T (F )]
T (Gj )−T (F )
λ0 [T (F )]
= TF0 (Gj − F ) + TF0 (Gj − F ) λ [T (GF )]−λ [T (F )] − 1
Gj j Gj
T (Gj )−T (F )
91
λ0 [T (F )]
= TF0 (Gj − F ) + tj λ∆j [T (F )] λ [T (GF )]−λ [T (F )] − 1
Gj j Gj
T (Gj )−T (F )
:= TF0 (Gj − F ) + tj λ∆j [T (F )]Ξ
It suffices to show that |λ∆j [T (F )]| < C and Ξ → 0. The first relation is easy to
prove. The second can also be proved as well.
the supremum being taken over all partitions a = x0 < ... < xk = b of [a, b].
Lemma 8.2 Let the function h be continuous with khkV < ∞ and the function K be
right-continuous with kKk∞ < ∞ and K(∞) = K(−∞) = 0. Then,
Z
hdK ≤ khkV kKk∞ .
using the fact that h(±∞) < ∞ (Why? Please check) and the assumption K(∞) =
K(−∞) = 0. Hence
Z Z Z Z
hdK = Kdh ≤ |K(x)|d|h(x)| ≤ sup |K(x)| d|h(x)| ≤ kK(x)k∞ kh(x)kV .
x
√ Eψ 2 (X, t0 )
n (Tn − t0 ) →d N (0, σ 2 ), where σ 2 = V ar (φF (X1 )) = .
[λ0F (t0 )]2
92
• Let
R
us first find out φF (x). Let Ft = F + t(δx − F ) and λF (t) = EF ψ(X, t) =
ψ(x, t)dF (x). Then,
λFt (T (Ft )) = EFt ψ(X, T (Ft ))
= EF ψ(X, T (Ft )) + t[EG ψ(X, T (Ft )) − EF ψ(X, T (Ft ))]
= λF (T (Ft )) + t[λG (T (Ft )) − λF (T (Ft ))]
= 0.
Differentiating w.r.t. t, we get
dT (Ft ) d
λ0F (T (Ft )) + [λG (T (Ft )) − λF (T (Ft ))] + t [λG (T (Ft )) − λF (T (Ft ))] = 0.
dt dt
Setting t = 0, we get
dT (Ft )
λ0F (t0 ) + [λG (t0 ) − λF (t0 )] = 0.
dt t=0
Note that λF (t0 ) = 0. Therefore,
dT (Ft ) λG (t0 ) EG ψ(X, t0 )
TF0 (G − F) = =− 0 =− .
dt t=0 λF (t0 ) λ0F (t0 )
From this, we get
Eδx ψ(X, t0 ) ψ(x, t0 )
φF (x) = TF0 (δx − F ) = − 0 =− 0 .
λF (t0 ) λF (t0 )
λF (T (G)) − λF (t0 )
= ,
h[T (G)]
where
λF (T (G)) − λF (t0 ) if T (G) 6= t0
h[T (G)] = T (G) − t0
λ0F (t0 ) if T (G) = t0
It is easy to see that h(t) → h(t0 ) = λ0F (t0 ). Therefore,
λF (T (G)) − λF (t0 ) λG (t0 )
Rn (G, F ) = + 0 .
h[T (G)] λF (t0 )
As it turns out, this expression is not especially manageable. (Try this yourself.)
However, if the denomenatior in the second term of Rn is replaced by h[T (G)], things
becomes a lot easier. In other words, we can study
93
We now specialize to G = Fn and T (G) = T (Fn ) = Tn . The assumption Tn −→p t0
yields that h[T (Fn )] −→p h[t0 ] = λ0F (t0 ). Check that Lemma 8.2 is applicable. We
thus have
R
− [ψ(x, Tn ) − ψ(x, t0 )]d[Fn (x) − F (x)]
|Rn (Fn , F )| =
e
h[T ] n
kψ(·, Tn ) − ψ(·, t0 )kV kFn − F k∞
≤
|h[Tn ]|
= op (1) kFn − F k∞ = op n−1/2 [n1/2 kFn − F k∞ ]
= op n−1/2 Op (1)
s
(recall that E n1/2 kFn − F k∞ < ∞ for any s > 0 from before)
= op n−1/2 .
(a) least pth power estimates, i.e., ψ(x) = |x|p−1 sgn(x) (1 < p ≤ 2).
(b) Huber’s estimates
(c) Hampel’s (smoothed or non-smoothed) estimates.
94
(i) the uniform asymptotic negligibility condition
Pkn
(ii) the asymptotic normality condition j=1 Xnj ∼asy N (An , Bn2 )
Let X1 , ..., Xn ∼ F (x) with symmetric pdf f (x). Then, µ = m. Here we could use either
the sample mean X̄ or sample median m̂ to estimator of the center.
Therefore,
σ 2 /n
ARE(m̂, X̄) = = 4σ 2 f 2 (µ).
[4f 2 (µ)]−1 /n
(x−µ)2
1. Take f (x) = (2πσ 2 )−1/2 e− 2σ 2 (i.e., N (µ, σ 2 )). Then f 2 (µ) = 1/(2πσ 2 ). So
e −x
2. If f (x) = (1+e 2 2
−x )2 (logistic distribution), then µ = EX = 0 and σ = var(X) = π /3
95
4. If f (x) is the pdf of Cauchy distribution, then σ 2 = ∞. So
ARE(m̂, X̄) = ∞.
Remark: This example illustrates that as the tail of pdf gets thicker, the median becomes
more efficient.
e−x
Logistic distribution. X ∼ f (x) = . Find µ = EX and σ 2 = var(X).
(1 + e−x )2
ex e−x
Solution. Clearly, f (−x) = = = f (x), i.e., f (x) is even and
x
(1 + e )2 (1 + e−x )2
symmetric around the origin. Thus µ = EX = 0. And
Z ∞ Z ∞
σ 2 = E(X − µ)2 = EX 2 = x2 f (x)dx = 2 x2 f (x)dx
−∞ 0
x2 ex
Z ∞
= 2 dx
0 (1 + ex )2
Z ∞
1
2
= −2 x d
0 1 + ex
∞
x2
Z ∞
x
= −2 x
+4 dx
1+e 0 0 1 + ex
xe−x
Z ∞
= 4 dx
1 + e−x
Z0∞
= 4 xe−x 1 − e−x + e−2x − e−3x + e−4x + ...... dx
0
Z ∞ ∞
xe−x (−1)m e−mx dx
X
= 4
0 m=0
∞
Z ∞ X
= 4 (−1)m xe−(m+1)x dx
0 m=0
∞ Z ∞
xe−(m+1)x dx
X
= 4 (−1)m (interchanging integration and summation)
m=0 0
∞
(−1)m
Z ∞
ye−y dy
X
= 4 [take y = (m + 1)x]
m=0
(m + 1)2 0
∞
(−1)m 1 1 1 1 1
X
= 4 2
= 4 1 − 2 + 2 − 2 + 2 − 2 + ......
m=0
(m + 1) 2 3 4 5 6
1 1 1 1 1 π2
But it is known that 1 + + + + + + ...... = So
22 32 42 52 62 6
1 1 1 1 1 1 1 1 π2
2 2 + 2 + 2 + 2 + ...... = 2 × 1 + 2 + 2 + 2 + ...... =
2 4 6 8 4 2 3 4 12
The difference of the above two relations results in
1 1 1 1 1 π2
1− + − + − + ...... =
22 32 42 52 62 12
π2 π2
Therefore, σ 2 = 4 × = .
12 3
96
8.5 Robustness v.s. Efficiency
2 Eψ 2 (X − θ0 ) A
σM = 0 2
= 2.
[Eψ (X − θ0 )] B
2 A 1
σM = ≥ ,
B2 I(θ)
∂ ln fθ0 (X)
ψ(X − θ0 ) = a , for some a.
∂θ
That is, M-estimators are not efficient unless they are MLE’s.
The method of GEE is a very powerful tool and general method of deriving point estima-
tors, which includes many other methods as special cases. We omit the details here.
8.7 Exercises
97
Using the theorem given in the lecture, show that any solution sequence Tn of
Pn
i=1 ψ(Xi , Tn ) = 0 satisfies asymptotically
√
n (Tn − θ) −→ N (0, σ 2 ), in distribution,
i.e., only the smallest/largest two order statistics do not have finite second
moments.
(c) Check if the trimmed mean by removing the smallest and largest two observa-
tions is a consistent estimate of the center of the d.f. consistently.
(Hint: find the mean/variance of the trimmed mean.)
98
Chapter 9
U -Statistics
Hoeffding (1948) introduced U -statistics and proved their central limit theorem. There
are several factors that are unique to the class of U -statistics.
• Many of the techniques developed for U -statistics are also essential in studying even
more general classes of statistics, such as the symmetrical statistics.
• The special dependence structure in the U -statistics allows one to employ the mar-
tingale properties extensively.
9.1 U -statistics
Definition. Let X1 , ..., Xn ∼i.i.d. F and h(x1 , ..., xm ) be a symmetric function in its
arguments. Consider
Z Z
θ = T (F ) = EF h(X1 , ..., Xm ) = ... h(x1 , ..., xm )dF (x1 )...dF (xm ).
99
Remark 9.1 U -statistics are always unbiased (“U ” = unbiased) while V -statistics (“V ”
comes from “von Mises”, who studied this type of statistics first) are typically biased for
m ≥ 2, since usually Eh(X1 , ..., X1 ) 6= Eh(X1 , ..., Xn ) for diagonal elements.
9.2 Examples
• If h(x) = x, the U-statistic
Un = x̄n = (x1 + · · · + xn )/n
is the sample mean.
• If h(x1 , x2 ) = |x1 − x2 |, the U-statistic
1 X
Un = |xi − xj |, n≥2
n(n − 1) i6=j
9.3 Hoeffding-decomposition
where
Ti = E(T |Xi )
Tij = E(T |Xi , Xj ) − Ti − Tj
Tijk = E(T |Xi , Xj , Xk ) − E(T |Xi , Xj ) − E(T |Xi , Xk ) − E(T |Xj , Xk )
+E(T |Xi ) + E(T |Xj ) + E(T |Xk )
= E(T |Xi , Xj , Xk )
−{E(T |Xi , Xj ) − E(T |Xi ) − E(T |Xj )}
−{E(T |Xi , Xk ) − E(T |Xi ) − E(T |Xk )}
−{E(T |Xi , Xk ) − E(T |Xj ) − E(T |Xk )}
−E(T |Xi ) − E(T |Xj ) − E(T |Xk )
= E(T |Xi , Xj , Xk ) − (Tij + Tik + Tjk ) − (Ti + Tj + Tk ).
Clearly,
100
• Ti is the main effect of Xi on T ,
• Tij is the 2-way interaction effect of Xi and Xj on T , after removing the main effects.
• Tijk is the 3-way interaction effect of Xi , Xj , Xk on T , after removing the main and
two-way interaction effects.
Heuristic proof
• For n = 1,
• For n = 2,
T := f (X1 , X2 )
= E(T |X1 ) + E(T |X2 ) + {E(T |X1 , X2 ) − E(T |X1 ) − E(T |X2 )}
2
X X
= E(T |Xi ) + [E(T |Xi , Xj ) − E(T |Xi ) − E(T |Xj )]
i=1 1≤i<j≤2
• For n = 3,
T := f (X1 , X2 , X3 )
= E(T |X1 ) + E(T |X2 ) + E(T |X3 )
+{E(T |X1 , X2 ) − E(T |X1 ) − E(T |X2 )}
+{E(T |X1 , X3 ) − E(T |X1 ) − E(T |X3 )}
+{E(T |X2 , X3 ) − E(T |X2 ) − E(T |X3 )}
+{E(T |X1 , X2 , X3 ) − E(T |X1 , X2 ) − E(T |X1 , X3 ) − E(T |X2 , X3 )
+E(T |X1 ) + E(T |X2 ) + E(T |X3 )}
3
X X
= E(T |Xi ) + [E(T |Xi , Xj ) − E(T |Xi ) − E(T |Xj )]
i=1 1≤i<j≤3
X
+ {E(T |Xi , Xj , Xk ) − E(T |Xi , Xj ) − E(T |Xi , Xk ) − E(T |Xj , Xk )
1≤i<j<k≤3
+E(T |Xi ) + E(T |Xj ) + E(T |Xk )}.
101
For R2 , the main effect of 2-way interactions from (Xi , Xj ) is Ti,j , giving
n
X n
X
T = ET + Ti + Ti1 ,i2 + R3 .
i=1 1≤i1 <i2 ≤2
Continuing on until all interaction effects are taken care of, giving the above ANOVE
decomposition.
Hence,
Hence,
and
Eψ(X1 , X2 )ψ(X1 , X3 ) = EE[ψ(X1 , X2 )ψ(X1 , X3 )|X1 ]
= E{E[ψ(X1 , X2 )|X1 ]}E{E[ψ(X1 , X3 )|X1 ]} = 0.
102
9.3.1 Asymptotic Normality
√ Pn √
n(Un − θ) = √2 2 P
Proof. Note n i=1 g(Xi ) + nRn , where Rn = n(n−1) i<j ψ(Xi , Xj ).
Now
√ 4 X 2n
V ar( nRn ) = n V ar(ψ(Xi , Xj )) = V ar(ψ(Xi , Xj )) → 0,
n2 (n − 1)2 i<j n(n − 1)
√
implying nRn →p 0. Then apply the CLT and Slutsky theorem.
1 Pn
Here g(Xi ) = E{h(Xi , Xj )|Xi } can be approximated by g(Xi ) ' n−1 j=1,j6=i hij − Un .
Then an approximation to the variance of Un is
2
n n
4 X 1 X
Vd
ar(Un ) = hij − Un .
n2 i=1
n − 1 j=1,j6=i
This is almost the same as the jackknife variance estimator, Vd arJack (Un ), given in the
next theorem. In effect, Vd
arJack (Un ) estimates the variance of the dominating term in
the H-decomposition of Un .
Furthermore, Vd
arJack (Un ) is a consistent estimator of V ar(Un ) in the sense that
Vd
arJack (Un )
−→p 1, as n → ∞. (4.2)
V ar(Un )
Proof. We only derive (3.1) below (the proof for (3.2) is more involved and hence omitted).
From Theorem 10.1, we have
1 X
Un = hij
n(n − 1) i6=j
103
n
(−k) 1 X X
Un−1 = hij − 2 hkj
(n − 1)(n − 2) i6=j j=1,j6=k
n
(·) 1X (−k)
Un−1 = Un−1
n k=1
n n X n n
1 X 2X 2X
= hij − hkj + hkk
(n − 1)(n − 2) i6=j n k=1 j=1 n k=1
1 X 2X
= hij − hij
(n − 1)(n − 2) i6=j n i6=j
1 X
= hij − 2(n − 1)Un
(n − 1)(n − 2) i6=j
n
n−1 X (−k) (·)
2
Vd
arJack (Un ) = Un−1 − Un−1
n k=1
2
n n
n−1 1 X X
= 2 hkj − 2(n − 1)Un
n (n − 1)2 (n − 2)2 k=1 j=1,j6=k
2
n n
n−1 4(n − 1)2 X 1 X
= hkj − Un
n (n − 1)2 (n − 2)2 k=1 n − 1 j=1,j6=k
2
n n
4(n − 1) X 1 X
= hkj − Un .
n(n − 2)2 k=1 n − 1 j=1,j6=k
104
Chapter 10
Jackknife
The jackknife and bootstrap are two popular nonparametric methods in statistical infer-
ence. The first idea related to the bootstrap was von-Mises, who used the plug-in principle
in the 1930’s, (although it can be argued that Gauss invented the bootstrap in 1800’s).
Then in the late 40’s Quenouille found a way of correcting the bias for estimators whose
bias was known to be of the form. The method is called the jackknife.
Both methods provide several advantages over the traditional parametric approach:
the methods are easy to describe and they apply to arbitrarily complicated situations;
distribution assumptions, such as normality, are never made.
3. its sampling distribution: P ((Tn − θ)/σTn ≤ x), or P ((Tn − θ)/σ̂Tn ≤ x)? (for con-
structing confidence intervals for θ).
105
Example 2. Let θ = µ2 , and Tn = (X̄)2 , then
σ2
1. Bias[(X̄)2 ] = E[(X̄)2 ] − µ2 = V ar(X̄) + [E(X̄)]2 − µ2 = . Hence, a biased-
n
corrected estimator of µ2 is given by
Se2 Se2
Ten = Tn − = (X̄)2 − ,
n n
which is in fact unbiased for µ2 . (But its variance may increase as a result, so one
need to use the MSE criterion, e.g., to choose a better one.)
2. Let us find out V ar(Tn ) next. Let Y = X − µ. Note that Tn = (X̄ − µ + µ)2 =
(Ȳ + µ)2 = (Ȳ )2 + 2µȲ + µ2 . Hence
Exercise. Let (Xi , Yi ) be i.i.d. r.v.s. Let θ = ρ and Tn = ρ̂ be the population and sample
correlation coefficients, respectively. Repeat the last example.
(1) The theoretical formula for bias, variance, etc can be very difficult or even impossible
to obtain. It often requires the data analysts to have strong mathematical and
statistical background.
(3) The formula are problem-specific. For a new problem, we need to repeat the whole
process all over again.
(4) The formula are often valid only for large sample size n (e.g. CLT). The performance
for small n can often be poor.
In the following two chapters, we shall introduce alternative methods such as boot-
strap and jackknife and other resampling methods to overcome the above-mentioned dif-
ficulties. Some of the advantages are listed below.
106
(1) The methods are highly automatic, easy to use.
(2) The methods require little mathematical background on the user’s part.
(3) ) The methods are general, not problem-specific.
(4) The methods have good performances for small sample size, as compared to some
asymptotic methods.
10.2 Jackknife
The jackknife preceded the bootstrap. It is simple to use. The original work on the
“delete-one” jackknife is due to Quenouille (1949) and Tukey (1958).
10.2.1 Definition
where
1 X
Fn−1,i (x) ≡ I .
n − 1 j6=i {Xj ≤x}
(−i)
That is, Tn−1 is the estimator of F (F ) based on the data with Xi “deleted” or left out.
• Tipseudo (1 ≤ i ≤ n) may be treated as i.i.d. r.v.’s, which have been verified in many
cases such as U -statistics and functions of means.
√
• Tipseudo has approximately the same variance as nTn .
We shall now use pseudo-values Tipseudo to estimate bias, variance, etc. Let
n
(·) 1X (−i)
Tn−1 = T .
n i=1 n−1
107
√ pseudo
3. From Remark 10.1, we estimate var( nTn ) by the sample variance based on Ten(i) .
Hence, the jackknife estimate of V ar(Tn ) is
!2
n n
1 1 1
Tipseudo − Tipseudo
X X
V̂Jack (Tn ) =
n (n − 1) i=1
n i=1
n
n−1 X (−i) (·)
2
= Tn−1 − Tn−1 .
n i=1
4. The jackknife can also be used to estimate d.f.’s. Details are omitted.
Remark 10.2 The delete-1 jackknife estimate of P (Tn ≤ t) can be taken as P̂Jack (Tn ≤
(−i)
t) = n1 ni=1 I{Tn−1 ≤ t} or P̂Jack (Tn ≤ t) = n1 ni=1 I{Tipseudo ≤ t}. This has poor
P P
properties for i.i.d. sample. But for dependent sample, one could use the delete-1 jackknife
estimate, which will be mentioned later.
So TnJack has bias of order O(n−2 ), one order smaller than the bias of Tn (of O(n−1 )).
One can repeat the jackknife again and again to keep on reducing the bias. But be
aware: the jackknife might introduce more variability. One could use the Mean Square
Error (MSE) to see if one- or more-step jackknife is worthwhile or not. Usually, the
jackknife is rarely used beyond one-step.
108
n
(·) 1X (−i) nX̄ − X̄
Tn−1 = Tn−1 = = X̄.
n i=1 n−1
So we have
109
2. The jackknife estimate of θ = T (F ) (biased-corrected) is
Se2
TnJack ≡ Tn − bias(T
d n ) = (X̄)2 − ,
n
E Se2 σ2 σ2
which is unbiased as ETnJack − µ2 = E(X̄)2 − − µ2 = − µ2 − − µ2 = 0.
n n n
3. The jackknife estimate of V ar(θ̂) is left as an exercise.
Example 3. Let θ = T (F ) = σ 2 = V ar(X), and Tn = T (Fn ) = n−1 ni=1 (Xi − X̄)2 , which
P
Example 4. Let Tn = X(n) be the largest order statistic of a random sample X1 , ..., Xn .
(Tn is often used to estimate the boundary point of a d.f., such as Unif[0, θ]). Find its
jackknife estimator TnJack . Compare their MSEs.
2n − 1 n−1
Proof. It is easy to find that TnJack = X(n) − X(n−1) . Finding the MSEs is
n n
more tedious.
Definition. Let X1 , ..., Xn ∼i.i.d. F and h(x1 , ..., xm ) be a symmetric function in its
arguments. Consider
Z Z
θ = T (F ) = Eh(X1 , ..., Xm ) = ... h(x1 , ..., xm )dF (x1 )...dF (xm ).
Remark 10.3 U -statistics are always unbiased (“U ” = unbiased) while V -statistics (“V ”
comes from “von Mises”, who studied this type of statistics first) are typically biased for
m ≥ 2, since usually Eh(X1 , ..., X1 ) 6= Eh(X1 , ..., Xn ) for diagonal elements.
110
10.3.1 Reduce bias for V -statistics by jackknife
Examples.
The above three examples are all special cases of the next theorem.
Vn,Jack = Un .
(·) (·)
Proof. Recall that Vn,Jack = nVn − (n − 1)Vn−1 . We first calculate Vn−1 . Denote
hij := h(Xi , Xj ), then
n X n
1 X
Vn = 2 hij .
n i=1 j=1
111
Hence, (easy to see from a diagram)
n X n n n
(−k) 1 X X X
Vn−1 = hij − hik − hkj + hkk
(n − 1)2 i=1 j=1 i=1 j=1
n X n n
1 X X
= hij − 2 hkj + hkk
(n − 1)2 i=1 j=1 j=1
n
(·) 1X (−k)
Vn−1 = V
n k=1 n−1
n X n n X n n
1 X 2X 1X
= hij − h kj + hkk
(n − 1)2 i=1 j=1 n k=1 j=1 n k=1
n X n n
1 n−2X 1X
= hij + hkk
(n − 1)2 n i=1 j=1 n k=1
Therefore,
(·)
Vn,Jack = nVn − (n − 1)Vn−1
n X n n X n n
1X 1 n−2X 1X
= hij − hij + hkk
n i=1 j=1 (n − 1) n i=1 j=1 n k=1
n X n n
1 n−2 X 1
X
= 1− hij − hkk
n n − 1 i=1 j=1 n(n − 1) k=1
n X n n
1 X X
= hij − hij
n(n − 1) i=1 j=1 i=j
n
1 X
= hij
n(n − 1) i6=j
= Un .
Furthermore, Vd
arJack (Un ) is a consistent estimator of V ar(Un ) in the sense that
Vd
arJack (Un )
−→p 1, as n → ∞. (3.2)
V ar(Un )
112
Proof. We only derive (3.1) below (the proof for (3.2) is more involved and hence omitted).
From Theorem 10.1, we have
1 X
Un = hij
n(n − 1) i6=j
n
(−k) 1 X X
Un−1 = hij − 2 hkj
(n − 1)(n − 2) i6=j j=1,j6=k
n
(·) 1X (−k)
Un−1 = U
n k=1 n−1
n n X n n
1 X 2X 2X
= hij − hkj + hkk
(n − 1)(n − 2) i6=j n k=1 j=1 n k=1
1 X 2X
= hij − hij
(n − 1)(n − 2) i6=j n i6=j
1 X
= hij − 2(n − 1)Un
(n − 1)(n − 2) i6=j
n
n−1 X (−k) (·)
2
Vd
arJack (Un ) = Un−1 − Un−1
n k=1
2
n n
n−1 1 X X
= 2 hkj − 2(n − 1)Un
n (n − 1)2 (n − 2)2 k=1 j=1,j6=k
2
n n
n−1 4(n − 1)2 X 1 X
= hkj − Un
n (n − 1)2 (n − 2)2 k=1
n − 1 j=1,j6=k
2
n n
4(n − 1) X 1 X
= hkj − Un .
n(n − 2)2 k=1 n − 1 j=1,j6=k
where Ti = E(T − ET |Xi ), Ti1 ,i2 = E(T − ET |Xi1 , Xi2 ) − Ti1 − Ti2 , etc.
T = ET + R1 .
From R1 , we take the main effects or contributions from each Xi , given by Ti , re-
sulting in
n
X
T = ET + Ti + R2 .
i=1
113
From R2 , we take the 2-way interactions from each pair (Xi , Xj ) (but without the ),
given by Ti,j , resulting in
n
X n
X
T = ET + Ti + Ti1 ,i2 + R3 .
i=1 1≤i1 <i2 ≤2
We can continue until all the interaction effects have been taken care of, giving the
above ANOVE decomposition.
where g(Xi ) = E (hij | Xi ) − θ and ψ(Xi , Xj ) = hij − g(Xi ) − g(Xj ) + θ and all the
terms on the RHS of Un are uncorrelated.
Proof. WLOG, assume that θ = 0.
1 X
T1 = E(Un |X1 ) = E[h(Xi , Xj )|X1 ]
n(n − 1) i6=j
1 1
= 2(n − 1)E[h(X1 , Xj )|X1 ] = 2(n − 1)g1
n(n − 1) n(n − 1)
2
= g1 ,
n
1 X
E(Un |X1 , X2 ) = E[h(Xi , Xj )|X1 , X2 ]
n(n − 1) i6=j
1
= (h12 + 2(n − 2)g1 + 2(n − 2)g2 )
n(n − 1)
Hence,
4V ar[g(X1 )]
V ar(Un ) = + O(n−2 )
n
which can be approximated by
n n
41X 4 X
ar(Un ) '
Vg [g(Xi )]2 = 2 g 2 (Xi ).
n n i=1 n i=1
114
Combining the above, we get an approximation to the variance of Un :
2
n n
4 X 1 X
Vd
ar(Un ) = 2 hij − Un .
n i=1 n − 1 j=1,j6=i
Remark 10.7 Asymptotic properties such as the CLT for U -statistics, Berry-Esseen bounds,
and so on will be described in a later chapter.
τ
Let X̄ and Σ̂ = n−1 ni=1 X̄i − X̄ X̄i − X̄ denote sample mean and sample
P
covariance matrix. One obvious estimate of σg2 is the plug-in (i.e., bootstrap) estimator:
However, the formula requires the user to carry out much analytical work, such as calcu-
lating g 0 (X̄) etc. By comparion, the jackknife provides a very simple alternative, which
requires no differentiation at all.
Vd
arJack [g(X̄)] nVd
arJack [g(X̄)
2
= −→a.s. 1.
σg /n σg2
Proof. Let θ = g(µ) = T (F ), and θ̂ = T (Fn ) = g(X̄). The jackknife estimate of V ar(θ̂)
is
nVd
arJack [g(X̄)] = nVd
arJack (Tn )
n 2
X (−i) (·)
= (n − 1) Tn−1 − Tn−1
i=1
115
n n
!2
X
(−i) 1X
= (n − 1) g(X̄ )− g(X̄ (−i) )
i=1
n i=1
n n
!2
X
(−i) 1X
= (n − 1) [g(X̄ ) − g(X̄)] − [g(X̄ (−i) ) − g(X̄)] .
i=1
n i=1
Now
1 nX̄ − Xi
X̄ (−i) = (X1 + ... + Xi−1 + Xi+1 + ... + Xn ) = ,
n−1 n−1
n
nX̄ − Xi − (n − 1)X̄ X̄ − Xi 1X
X̄ (−i) − X̄ = = , X̄ (−i) − X̄ = 0.
n−1 n−1 n i=1
Furthermore, by the mean-value theorem, we have
g(X̄ (−i) ) − g(X̄) = g 0 (ξi )τ X̄ (−i) − X̄ = g 0 (X̄)τ X̄ (−i) − X̄ + Ri ,
116
Note that (trace of Σ̂) −→a.s. (trace of Σ). Since ξi is between X̄ (−i) and X̄, we have
2
1
max kξi − X̄k = max kξi − X̄k2 ≤ max kX̄ (−i) − X̄k2 ≤ max kX̄i − X̄k2
1≤i≤n 1≤i≤n 1≤i≤n (n − 1)2 1≤i≤n
n
1 X n
≤ kX̄i − X̄k2 ≤ (trace of Σ̂) −→a.s. 0,
(n − 1)2 i=1
(n − 1)2
That is, max1≤i≤n kξi − X̄k −→a.s. 0. This, together with X̄ −→a.s. µ, implies that
un −→a.s. 0. Hence, Bn −→a.s. 0.
For Cn2 ≤ |An | |Bn |, since An −→a.s. σg2 and Bn −→a.s. 0, we have Cn −→a.s. 0.
q
Remark 10.8 Theorem 10.3 implies that [g(X̄) − g(µ)]/ Vd
arn,Jack [g(X̄)] −→d N (0, 1),
which could be used to construct a C.I. for g(µ).
Remark 10.9 From Theorem 10.3, Vd arJack [g(X̄)] is consistent if g 0 (x) is continuous in
a neighborhood of µ. On the other hand, σ̂g2 is consistent if g 0 (x) is continuous only at µ.
So the jackknife needs stronger condition.
Yi = xτi β + i , i = 1, ..., n,
where
Y = Xβ + ,
where
xτ1
Y1 β0 1
.
.
.
.
Y = . , X= . , β= . , = . .
. . . .
Yn xτn βp−1 p
117
and
V ar(β̂) = (X τ X)−1 (X τ W X) (X τ X)−1 ,
with W = diag(σ12 , ..., σn2 ).
Basic properties
• E β̂ = β (unbiased)
• If σ12 = σ 2 for all i, then V ar(β̂) = σ 2 (X τ X)−1 . Furthermore, in this case, β̂ is the
best linear unbiased estimator of β (BLUE).
Miller (1974) first extended the jackknife to regression problem. Let β̂(i) be the LSE of β
based on the data (xτ1 , Y1 ), ..., (xτi−1 , Yi−1 ), (xτi+1 , Yi+1 ), ..., (xτn , Yn ). That is,
−1
−1 n n
τ τ X X
β̂(i) = X (−i) X (−i) X (−i) Y(i) = xj xτj − xi xτi xj Yj − xi Yi
j=1 j=1
= (X X − τ
xi xτi )−1 (X τ Y − xi Yi ) .
A−1 uv τ A−1
(A + uv τ )−1 = A−1 − ,
1 + v τ A−1 u
which is true since
!
τ −1 A−1 uv τ A−1
(A + uv ) A −
1 + v τ A−1 u
uv τ A−1 uv τ A−1 uv τ A−1
= I + uv τ A−1 − −
1 + v τ A−1 u 1 + v τ A−1 u
uv τ A−1 u(v τ A−1 u)v τ A−1
= I + uv τ A−1 − −
1 + v τ A−1 u 1 + v τ A−1 u
τ
uv A −1 uv τ A−1
= I + uv τ A−1 − − (v τ −1
A u)
1 + v τ A−1 u 1 + v τ A−1 u
τ −1
(since (v A u) is a scalar)
uv τ A−1
= I + uv τ A−1 − 1 + v τ −1
A u
1 + v τ A−1 u
= I + uv A − uv τ A−1 = I.
τ −1
Letting
zi ri
hi = xτi (X τ X)−1 xi , ri = Yi − xτi β̂, zi = (X τ X)−1 xi , bi =
1 − hi
we have
118
!
τ −1 (X τ X)−1 (xi xτi )(X τ X)−1
= (X X) + (X τ Y − xi Yi )
1 − xτi (X τ X)−1 xi
(X τ X)−1 (xi xτi )(X τ X)−1
= β̂ − (X τ X)−1 xi Yi + (X τ Y − xi Yi )
1 − hi
(X τ X)−1 (xi xτi )
= β̂ − (X τ X)−1 xi Yi + β̂ − (X τ X)−1 xi Yi
1 − hi
τ
(X X) xi −1
= β̂ − Yi (1 − hi ) − xτi β̂ + hi Yi
1 − hi
(X τ X)−1 xi
= β̂ − Yi − xτi β̂
1 − hi
(X τ X)−1 xi ri
= β̂ −
1 − hi
zi ri
= β̂ −
1 − hi
= β̂ − bi .
Hence,
n n
!
1X zi ri n−1X
β̂Jack = nβ̂ − (n − 1)β̂(·) = nβ̂ − (n − 1) β̂ − = β̂ + bi = β̂ + (n − 1)b̄
n i=1 1 − hi n i=1
n n
n−1X τ n−1X τ
V̂Jack (β̂) = β̂ − β̂(·) β̂(i) − β̂(·) = bi − b̄ bi − b̄ .
n i=1 (i) n i=1
Remark 10.10
• Note that the “normal equations” from the LSE imply that
n
X n
X
ri = (Yi − xτi β̂) = 0
i=1 i=1
n n
zi ri = (X τ X)−1 xi [Yi − xτi β̂] = (X τ X)−1 X τ Y − X τ X β̂ = β̂ − β̂ = 0.
X X
i=1 i=1
n
1X zi ri
However, in general, b̄ = 6= 0. Hence, θbJack 6= θ̂ although E θbJack = E θ̂
n i=1 1 − hi
for fixed design points.
• Both β̂ and β̂Jack are unbiased for β, but β̂Jack has an extra unattractive term (n−1)b̄
(although it is pretty harmless). The same thing will happen to the linear combination
of β. One way to get rid of the extra term is to use “weighted jackknife” method.
• The extra term appears because of the unbalanced design points xi . Note that the
bigger the absolute value xi is, the bigger its influence. To counterbalance this, one
can down-weigh the influence of the large values in absolute value terms.
119
Remark 10.11 In fact, for linear combinations of β = C τ β, since the design points xi
are fixed, we do have
E θbJack = E θ̂.
That is, θbJack is indeed unbiased estimator for θ = C τ β. Despite this, it is still not natural
to have b̄ around. Some proposals have been made to “correct” (maybe not necessary here)
this by using weighted jackknife.
Miller (1974) showed that V̂Jack (θ̂) is a consistent variance estimator for V (θ̂) under very
weaker conditions. However, θbJack may not be consistent. Recall C τ β̂ is unbiased for C τ β,
but jackknife estimator is biased. This is caused by the unbalance of the linear model,
i.e., EYi = xτi β are note the same. (Please check the statements here.)
The jackknife is good, but it can fail miserably if the statistic θ̂ is not “smooth” (i.e., small
changes in the data can cause big changes in the statistics). One example is the sample
median (or quantile), whose influence curve IC(x, F, T ) has a jump (i.e., not smooth) near
the sample median. Note that the sample median only depends one or two values in the
middle of the data.
10.6.1 An example
10, 27, 31, 38, 40, 50, 52, 104, 146, 150.
Then, we get
(
(−i) 50 1 ≤ i ≤ 5
Tn−1 = =
40 1 ≤ 6 ≤ 10.
(·)
Tn−1 = 45
(
(−i) (·) 5 1≤i≤5
Tn−1 − Tn−1 =
−5 6 ≤ i ≤ 10.
Thus, Vd arJack (Tn ) tried to estimate the variance from just these two values, 40 and 50,
resulting in
10 − 1 2
Vd
arJack (Tn ) = 5 .
4
Note that the estimate depends solely on two points 40 and 50, which makes it highly
unstable. (It is impossible to estimate from one data value, and near impossible to estimate
from two data values. The variance estimate should be based on “many” data points.)
120
The jackknife described so far deletes one observation at a time. If we try to delete
2 observations at atime to the above quantile example, the total number of ways one can
do this is n2 = 10
2 = 10 × 9/2 = 45. You can try to list the pseudo-values here. Clearly,
we have more different values now. More generally, one can do delete-d jackknife, to be
discussed below.
Theorem 10.4 Let θ = F −1 (1/2) and Tn = Fn−1 (1/2) and n = 2m. Assume that f = F 0
exists and continuous at θ, and f (θ) > 0. Then
1
where vn = .
4nf 2 (θ)
(Hence, no matter how large n is, the ratio on the RHS does not go to 1.)
Proof.
1. The order statistics are X(1) , ..., X(m) , X(m+1) , ..., X(n) . Now
(
(−i) X(m+1) 1 ≤ i ≤ m
Tn−1 =
X(m) 1 ≤ m + 1 ≤ n.
(·) mX(m+1) + mX(m) X(m+1) + X(m)
Tn−1 = =
n 2
X(m+1) − X (m)
1≤i≤m
(−i) (·) 2
Tn−1 − Tn−1 = X(m+1) − X(m)
− m + 1 ≤ i ≤ n.
2
n
n−1X (−i) (·) 2
n−1 2
VdarJack (Tn ) = Tn−1 − Tn−1 = X(m+1) − X(m) .
n i=1 4
2. Proof. The assumptions in the theorem implies that F has local inverse at θ.
WLOG, we might assume that F has inverse everywhere. By Lemma 10.1, we have
Xk =: F −1 (1 − e−Yk ) := H(Yk ) ∼ F, 1 ≤ k ≤ n,
121
Since H(t) = F −1 (1 − e−t ) is increasing in t, from Lemma 10.1, we have
X(k) = F −1 (1 − e−Y(k) ) := H(Y(k) ), 1 ≤ k ≤ n,
where Y(1) , ..., Y(n) are order statistics of Y1 , ..., Yn ∼iii Exp(1).
e−t
Note that H 0 (t) = . By (6.3), we have Y(m+1) − Y(m) = Zm /m = Op (n−1 ),
f (H(t))
and
n n m−1
X 1 X 1 X 1
EY(m) = = − = (C + ln n + n ) − (C + ln(m − 1) + m−1 )
k=m
k k=1
k k=1
k
n
= ln + (n − j−1 ),
m−1
= ln 2 + o(1)
n
V ar(Zk ) 1 1
+ ... + 2 = O(n−1 )
X
V ar(Y(m) ) = =
k=m
k2 m2 n
Y(m) = EY(m) + Op (n−1/2 )
H(EY(m) ) = F −1 (e−EY(m) ) = F −1 (e− ln 2+o(1) ) = F −1 (1/2 + o(1)) = θ + o(1)
e−EY(m) e− ln 2+o(1) 1
H 0 (EY(m) ) = = = + o(1)
f (H(EY(m) )) f (θ + o(1)) 2f (θ)
H 0 (Y(m) ) = H 0 (EY(m) ) + o(Y(m) − EY(m) ).
Let θ̂ estimate θ. Let θ̂(s) denote θ̂ applied to the data with subset s removed. Then the
delete-d jackknife variance estimator is
n − d X 2
Vd
arJack (Tn )[d] = ! θ̂(s) − θ̂(s·) ,
n d
d
d
s of size n − d chosen without replacement from
P
where d is the sum over all subsets
n
{X1 , ..., Xn } and θ̂(s·) = d θ̂(s) / d .
P
122
√
Theorem 10.5 Under the same conditions of Theorem 10.4, and n/d → 0 and n − d →
∞, then
VdarJack (Tn )[d]
−→p 1.
vn
Remark 10.12
1. The conditions require d to go to ∞, but not too fast. One such choice is d = n2/3 .
√
2. If n is large, and n < d < n, then nd is large. In this case, one can use random
sampling (or Monte-Carlo simulations) to approximate Vd arJack (Tn )[d]. If this is too
much trouble, one might use the bootstrap in the first place.
√
3. The delete-d jackknife can be shown to be consistent for the median if n/d → 0 and
√
n − d → 0. Roughly speaking, it is preferable to choose a d such that n < d < n
for the delete-d jackknife estimation of standard error
4. d plays the role of smoothing parameter, similar to that of bandwidth in curve esti-
mation.
5. The delete-d jackknife can also be used to estimate d.f.’s for t-statistics, for exam-
ple. But its convergence rate is slower than the bootstrap for the independent data.
For dependent data though, the delete-d jackknife works beautifully while ”naive”
bootstrap or delete-1 jackknife all fails here.
Lemma 10.1 Let Y1 , ..., Yn be iid from Exp(1), and Y(1) , ..., Y(n) its order statistics.
3. Let U(1) , ..., U(n) be order statistics from U nif orm(0, 1). Let U(0) = 0, U(n+1) = 1.
Then,
Vi := n[U(i) − U(i−1) ] −→d Exp(1), 1 ≤ i ≤ n + 1.
Proof.
√
1. We prove (ii) only. fY (y) = e−y I{y > 0}. Let w = y 2 , so y = w, dy/dw =
√
1/(2 w). Hence,
√ √ 1 1/2
fW (w) = fY (x)|dy/dw| = e− w
/(2 w) = w−1/2 e−w , w > 0.
2
β
(Recall that W eibull(α, β) has density f (w) = αβwβ−1 e−αw , where α > 0, β > 0,
w > 0.)
123
2. The joint pdf of (Y(1) , ..., Y(n) ) is
fY(1) ,...,Y(n) (y1 , ..., yn ) = n!fY (y1 )...fY (yn ) = n!e−y1 ...e−yn , where y1 < ... < yn .
Let
Z1 = nY(1)
Z2 = (n − 1)[Y(2) − Y(1) ]
Z3 = (n − 2)[Y(3) − Y(2) ]
... ...... (6.4)
Zn = 1 × [Y(n) − Y(n−1) ],
dz
whose jacobian is = n!. Therefore,
dy
dy 1
fZ1 ,...,Zn (z1 , ..., zn ) = fY(1) ,...,Y(n) (y1 , ..., yn ) = n!e−y1 −...−yn = e−z1 −...−zn = e−z1 ...e−zn ,
dz n!
Z1
Y(1) =
n
Z1 Z2
Y(2) = +
n n−1
... ......
Z1 Z2 Zn
Y(n) = + + ...... + .
n n−1 1
n!
fU(k) ,U(k+1) (u, v) = uk−1 [1 − v]n−k−1 , 0 ≤ u < v ≤ 1.
(k − 1)!(n − k − 1)!
Let
X = U(k)
Y = n[U(k+1) − U(k) ].
So
U(k) = X
Y
U(k+1) = X + .
n
Therefore,
dU
f(X,Y ) (x, y) = f(U(k) ,U(k+1) ) (u, v)
dY
(n − 1)!
= uk−1 [1 − v]n−k−1
(k − 1)!(n − k − 1)!
(n − 1)!
= xk−1 [1 − y/n − x]n−k−1 , where 0 ≤ x ≤ 1 − y/n.
(k − 1)!(n − k − 1)!
124
(Note that 0 ≤ u < v ≤ 1 ⇐⇒ 0 ≤ x < x + y/n ≤ 1 ⇐⇒ 0 ≤ x ≤ 1 − y/n.) Thus,
(n − 1)! 1−y/n Z
fY (y) = xk−1 [1 − y/n − x]n−k−1 dx
(k − 1)!(n − k − 1)! 0
y n−1
Z 1
(n − 1)! y
= 1− tk−1 (1 − t)n−k−1 dt, where x = t 1 −
n (k − 1)!(n − k − 1)! 0 n
n−1
y
= 1−
n
−→ e−y , y > 0.
In the study of income shares, sometimes we need to estimate the poverty line (or low
income cutoff point). Consider the model
m
X
log Zi = γ1 + γ2 log Yi + βt xit + i , i = 1, ..., n,
t=1
where
Zi = expenditure on “necessities”,
Yi = total income,
Let γ0 be the overall proportion of income spent on “necessities”. Then the poverty line
θ can be defined to be the solution of
m
X
log[(γ0 + 0.2)θ] = γ1 + γ2 log θ + βt x0t
t=1
for a particular set x01 , ..., x0m . Our purpose is to estimate θ and construct a (1 − α)
confidence interval for θ.
On the other hand, γ1 , γ2 , βt ’s can be estimated by the LSE, denoted by γ̂1 , γ̂2 , β̂t ’s. As
a consequence, θ can be estimated (implicitly) from
m
X
log[(γ̂0 + 0.2)θ̂] = γ̂1 + γ̂2 log θ̂ + β̂t x0t ,
t=1
Note that
θ̂ = g γ̂0 , γ̂1 , γ̂2 , β̂1 , ..., β̂m := g(X̄),
125
where
Xi = (Zi , Yi , log Zi , log Yi , xi1 , ..., xim , all other cross product terms) .
Therefore, we can express θ̂ into a very smooth function of multivariate sample means
of independent (not necessarily identically distributed) r.v.’s. Hence, θ̂ is asymptotically
normal. We need to get a variance estimate.
vn = 5g(X̄)τ Vd
ar(X̄) 5 g(X̄).
Since g is not explicit, it is very tedious to calculate 5g(X̄) in this case. On the other
hand, the jackknife can provide an easy way to get a consistent variance estimator:
2. The jackknife variance estimator v̂Jack is consistent for many types of statistics
under mild conditions, such as sample means, function of sample means, U -, V -,
L-statistics, M -estimates, etc.
10.9 Exercises
estimator of σ 2 .
(a) Show that θb has bias equal to −σ 2 /n.
(b) Find θbJ , the jackknife estimator of σ 2 .
(b) Find vbJ (θ),
b the jackknife estimator of V ar(θ).
b
3. (Use S-Plus or otherwise to do this question.) The following data (Xi , Yi ) are English
scores of 6 people before and after attending an intensive study program.
126
(80, 90), (85, 78), (65, 75), (88, 92), (77, 86), (95,90).
127
9. Let X1 , ..., Xn ∼i.i.d. from F . Let θ = T (F ) and Tn = T (Fn ). Suppose that
√
n (Tn − θ) −→d N (0, σF2 ),
Check if this variance estimate is consistent in the case of the sample median, Tn =
Fn−1 (1/2).
128
Chapter 10
Bootstrap
The bootstrap was first introduced by Efron (1979). It is a powerful nonparametric tech-
nique, like the jackknife. In a sense, jackknife can be considered as a first-order approxi-
mation to the bootstrap.
Suppose
• An estimator of θ is
Tn ≡ Tn (X1 , . . . , Xn ) ≡ Tn (Fn ).
Let X1∗ , ..., Xn∗ ∼i.i.d. Fn . Then, the bootstrap approximations of the above quantities are
129
Example Let θ = µ2 and T = (X̄)2 . Find the bootstrap estimate of the bias.
VFn (X1∗ )
= + (EX1∗ )2
n
n
1 1X 2 2
= · Xi − X̄ + X̄
n n i=1
1 2 2
= Sn + X̄ .
n
Hence, the bootstrap bias estimate is
h 2 i 2 2 1 2
Bias
d Boot X̄ = EFn X̄ ∗ − X̄ = S .
n n
Recall that the true bias is
h 2 i 2 σ2 σ2
Bias X̄ = EF X̄ − µ2 = + µ2 − µ2 = .
n n
η̂Boot = EFn T (X1∗ , ..., Xn∗ , Fn ), where X1∗ , ..., Xn∗ ∼iid Fn .
That is, the generalized function η is approximated by the generalized statistical functional
η̂Boot .
Generally speaking, there is no closed form expression for the bootstrap estimate η̂Boot .
On the other hand, for each k, we note
1
P (Xk∗ = Xj ) = , j = 1, · · · , n.
n
Hence, one can find the exact value η̂Boot by exhausting all possible bootstrap samples (all
together nn ) and calculate
n n
1 X
η̂Boot = n T (X1∗ (k), ..., Xn∗ (k), Fn )
n k=1
!
2n − 1
If T is symmetric in X1 , . . . , Xn , then the total number of different samples is
n
! !
2n − 1 19
(Why?). If n = 10, then nn
= 10, 000, 000, 000, and = , a huge number
n 10
indeed. Therefore, the exact calculation is not practical.
1. Draw B independent random samples Xk∗ = {X1∗ (k), ..., Xn∗ (k)} from Fn , k = 1, ..., B.
(This can be done by drawing from Fn with replacement).
130
2. The Monte Carlo approximation to η̂Boot is
B
1 X
η̂M C = T (X1∗ (k), ..., Xn∗ (k), Fn ).
B k=1
Closed expressions
Example. Let X1 , . . . , Xn be i.i.d. r.v.’s from d.f. F and Fn be its e.d.f. Let ξp =
F −1 (p) = inf{t : F (t) ≥ p} and
ξp = F −1 (p) = inf{t : F (t) ≥ p}
ξˆp = Fn−1 (p) = inf{t : nFn (t) ≥ np} := X(r) ,
where r = dnpe, giving the smallest integer ≥ np (namely = [np] if integer, [np] + 1
otherwise). Find the bootstrap estimates of
(2) the variance σξ2 = V arF (ξˆp ), and the Mean Square Error M SEξ = EF (ξˆp − ξp )2
(3) the d.f.’s of standardized and studentized sample quantiles
√ ˆ ! √ ˆ !
n(ξp − ξp ) n(ξp − ξp )
H(t) = PF ≤t , K(t) = PF ≤t ,
σξ σ̂ξ
where σξ2 = σξ2 (F ) = V arF (ξˆp ) and σ̂ξ2 is an consistent estimate of σξ2 .
Note that X(r) =d F −1 (U(r) ), where U(r) is the rth order statistic from U (0, 1) with pdf
given by
!
n r−1
dG(r) (u) = r u [1 − u]n−r , 0 < x < 1.
r
Therefore,
Z
EF (ξˆp − c)m = E[X(r) − c]m = (x − c)m dF(r) (x)
Z ∞ !
n
= (x − c)m r F r−1 (x)[1 − F (x)]n−r dF (x)
−∞ r
!Z
n 1
= r [F −1 (u) − c]m ur−1 [1 − u]n−r du
r 0
131
For the bootstrap,
Xn
= wnk [X(k) − c]m ,
k=1
where !Z
n
wnk =r ur−1 [1 − u]n−r du, k = 1, ..., n.
r ( k−1
n
k
,n ]
ξˆp∗ − ξˆp n
! !
X n
ĤBoot (t) = PFn ≤t = F k (s)[1 − Fn (s)]n−k .
σξ2 (Fn ) k=r
k n
132
Comparisons with other methods
n
p(1 − p) 1 X x − Xi
(a) the kernel estimate ˆ
, where f (x) = K .
fˆ2 (ξˆp ) nh i=1 h
n n
!2
X X
2
(b) the bootstrap variance estimate σ̂ξ,boot = σξ2 (Fn ) = 2
wnk X(k) − wnk X(k) .
k=1 k=1
(c) Recall that delete-1 jackknife variance estimate is inconsistent.
ξˆp∗ − ξˆp
!
K̂Boot (t) = PFn ≤t
σξ2 (Fn∗ )
where !2
n n
∗ 2
σξ2 (Fn∗ ) ∗
X X
= wnk X(k) − wnk X(k) .
k=1 k=1
Note that there is no close form expression for K̂Boot (t). Monte Carlo simulations
have be carried out here.
Remark 10.1 For more materials see Hall (1992), page 319.
Remark 10.2 Bootstrap approximations of the mean, bias, and variance are all weighted
averages of the ordered statistics. Note that ur−1 [1−u]n−r is maximized at u = (r−1)/(n−
1) ≈ p. Hence, all the bootstrap estimates place more weights around θ̂ = Fn−1 (p) = X(r) ,
and less weights if the data are far away from θ̂, thus producing a consistent estimator.
By comparison, the jackknife variance estimator, Vd arJack (θ̂) only uses two values near θ̂
to approximate the variance, which fails to be consistent.
In particular,
(a) ĤB (x) is (first-order) consistent if supx ĤB (x) − H(x) = o(1) or O(n−1/2 ).
(b) ĤB (x) is second-order accurate if supx ĤB (x) − H(x) = o(n−1/2 ) or O(n−1 ).
(c) ĤB (x) is third-order accurate if supx ĤB (x) − H(x) = o(n−1 ) or O(n−3/2 ).
133
10.4.1 First-order accurate or consistent
Under certain regularity conditions, the bootstrap provides first-order accurate approxi-
mation in a number of familiar situations:
2. U -statistics
4. symmetric statistics
Theorem 10.1 Assume that σ 2 ∈ (0, ∞). Denote σ̂ 2 = n−1 ni=1 (Xi − X̄)2 . Let Rn be
P
√ √
n(X̄ − µ) √ n(X̄ − µ)
(a). , (b). n(X̄ − µ), (c). .
σ σ̂
Then ĤB (x) is strongly consistent for H(x) in all three cases.
Lemma 10.1 (Marcinkeiwicz SLLN) If Y1 , ..., Yn are i.i.d. r.v.’s with E|Y1 |δ < ∞ for
δ ∈ (0, 1), then
n
1 X
|Yi | →a.s. 0.
n1/δ i=1
Lemma 10.2 (Berry-Esseen bounds) If X, X1 , ..., Xn are i.i.d. r.v.’s with E|X|3 < ∞,
and denote σ̂ 2 = n−1 ni=1 (Xi − X̄)2 . Then
P
√ !
n(X̄ − µ) A E|X|3
P ≤ t − Φ(t) ≤ √ .
σ or σ̂ n σ3
EFn |X1 |∗ 3
Pn 3 Pn
A A i=1 |X1 | A i=1 Yi
sup ĤB (x) − Φ(x) ≤ √ ≤ 3/2 = 3 .
2 σ̂ 3
x ∗
n (EFn |X1 | ) 3/2 n σ̂ n1/δ
where δ = 2/3 and Yi = |X1 |3 , and clearly E|Yi |δ = EXi2 < ∞. Note
that σ̂ 3 → σ 3
a.s. and also by Marcinkeiwicz SLLN, we get supx ĤB (x) − Φ(x) →a.s. 0.
134
(b) By the CLT,
sup |H(x) − Φ (x/σ)| → 0.
x
By the Berry-Esseen bounds, we have
EFn |X1 |∗ 3
Pn 3
A A i=1 |X1 |
sup ĤB (x) − Φ (x/σ̂) ≤ √ ≤ 3/2 →a.s. 0,
∗
n (EFn |X1 | )2 3/2 σ̂ 3
x n
(c) The proof is very similar to that of (a), and hence omitted.
Theorem 10.2 Let X, X1 , . . . , Xn be i.i.d. r.v.’s with µ = EX and EkX1 k2 < ∞. Let
θ = g(µ), θ̂ = g(X̄). Assume that
Then,
Vd
arBoot (θ̂) arFn (θ∗ )
Vd
:= −1 −→a.s. 1.
vn n 5 g(µ)τ V ar(X) 5 g(µ)
Lemma 10.3
(a) If supn E|Xn |1+δ < ∞ for some δ > 0, then {X1 , . . . , Xn } is u.i.
Lemma 10.4 (Bernstein inequality) Let Y, Y1 , ..., Yn be i.i.d. r.v.’s. such that P (|Y −
EY | < C) = 1 for some C < ∞. Then for t > 0,
( )
−nt2 /2
P (|Ȳ − EY | > t) ≤ 2 exp
V ar(Y ) + Ct/3
135
Lemma 10.5 For any r.v. Z ≥ 0, we have
∞
X ∞
X
P (Z ≥ i) ≤ EZ ≤ P (Z ≥ i) + 1.
i=1 i=1
Lemma 10.6 If E|X|α < ∞, and X(1) ≤ ... ≤ X(n) are ordered statistics. Then,
|X(1) | + |X(n) |
−→a.s. 0.
n1/α
Proof. Exercise.
(1) There are some close relationships between confidence (upper or lower) limits
and coverage probability. Suppose that θ̂L (α) is an approximate (1 − α) level C.L. and
θ̂L (α) − θ̂Exact (α) = Op (n−(k+1)/2 ).
Then,
P θ ∈ [IˆL (α), ∞) = P θ ≥ IˆL (α)
= P θ ≥ θ̂Exact (α) + Op (n−(k+1)/2 )
= P θ̂Exact (α) ≤ θ + Op (n−(k+1)/2 )
!
v̂ 1/2
= P θ̂ − √ K −1 (1 − α) ≤ θ + Op (n−(k+1)/2 )
n
√ !
n(θ̂ − θ) −1 −k/2
= P ≤ K (1 − α) + Op (n )
v̂ 1/2
√ !
n(θ̂ − θ)
= P ≤ K −1 (1 − α) + O(n−k/2 )
v̂ 1/2
[assuming that Op (n−k/2 ) makes contribution O(n−k/2 )]
= (1 − α) + O(n−k/2 ).
136
For instance, if we need to show that θ̂L (α) to be first or second order accurate, then if
suffices to show that
Outline of proof:
X
→ 0, σ 2 = 5g 0 (µ) Σ 5 g (µ)
sup P (Rn ≤ X) − Φ
X σ
3
AE X1∗ − X̄
X
sup P ∗ (Rn∗ ≤ X) − Φ , σ̂ 2 = 5g 0 X̄ Σ̂ 5 g X̄ .
≤ √
X σ̂ nσ
We have two approximations to :
X
1. Normal approximation: Φ σ ;
X1 , · · · , Xn ∼ F, Xi ∈ Rd
Notations:
θ = θ (F ) = T (F ) : parameter
θ̂ = θ (Fn ) = T (Fn ) : estimate
√
v is the asymptotic variance of n (θ)
v̂ : estimate of v
1–sided upper (1 − α) confidence interval
Iˆ1,U = −∞, θ̂U (α) if P θ ∈ Iˆ1,U = 1 − α. (6.2)
137
1–sided lower (1 − α) confidence interval
Iˆ1,L = θ̂L (α) , ∞ if P θ ∈ Iˆ1,L = 1 − α. (6.3)
Example 10.1 X1 , · · · , Xn ∼ N µ, σ 2 .
1–sided upper
σ̂
−∞, X̄ + tn−1 (α) √ (6.5)
n
1–sided lower
σ̂
X̄ − tn−1 (α) √ , ∞ (6.6)
n
2–sided
σ̂
X̄ ± tn−1 (α/2) √ (6.7)
n
Define
√
n θ̂ − θ
K (t) = P ≤ t
v̂ 1/2
√
n θ̂∗ − θ
K̂B (t) = P ∗ ≤ t .
v̂ ∗1/2
Normal approximation !
−1 v̂ 1/2
−∞, θ̂ + Φ (1 − α) √ . (6.9)
n
2. Percentile method
ĜP er (X) = PF∗n θ̂∗ ≤ X (6.11)
138
3. Hybrid method
ĜHybrid (X) = PF∗n θ̂∗ − θ̂ ≤ X (6.12)
GH (X) = P θ̂ − θ ≤ X = P θ ≥ θ̂ − X = 1 − α (6.13)
where Ẑ0 = Φ−1 ĜP er θ̂
Z b ∞
X M
X Z bi
cos (tx) = ci IAi g (x)dx = ci cos (tx) dx
a i=1 i=1 ai
sin (tx) bi
= → 0 as t → ∞,
t ai
for g continuous.
Let
√
n θ̂ − θ
H (X) = P ≤ t Standardized
V 1/2
√
n θ̂ − θ
K (X) = P ≤ t . Studentized
V̂ 1/2
139
Bootstrap approximation
√
n θ̂∗ − θ̂
Ĥ (X) = P ∗ ≤ t
V̂ 1/2
√
n θ̂∗ − θ̂
K (X) = P ∗ ≤ t .
V̂ ∗1/2
V̂ 1/2
θ̂L,Exact (α) = θ̂ − √ κ−1 (1 − α)
n
V̂ 1/2
θ̂U,Exact (α) = θ̂ − √ κ−1 (α) .
n
√
n θ̂ − θ
!
V̂1/2
P θ ≥ θ̂L = P θ ≥ θ̂ − √ κ−1 (1 − α) =P ≤ κ−1 (1 − α) (6.16)
n V̂ 1/2
if we know κ.
If
θ̂L (α) = θ̂L,Exact + Op n−(k+1)/2 (stochastic expansion)
= θ̂L,Exact + δ̂n .
Lemma 10.8
sup |P (Xn + Yn ≤ t) − Φ (t)| ≤ sup |P (Xn ≤ t) − Φ (t)| + O (an ) + P (|Yn | > an ) , an > 0.
t t
(6.17)
140
Proof.
an = cn−k/2 ,
P |∆n | > cn−k/2 = O n−k/2 . (Use Chebyshev’s bound, etc) (6.18)
To show C.L.
Cornish–Fisher expansions
1 1
α = κH (X) = Φ (X) + √ (q1 )P1 (X) ϕ (X) + (q2 )P2 (X) ϕ (X) + · · · (6.19)
n n
1
X ≡ H −1 (α) = Zα + √ P11 (Zα ) + · · · Zα = Φ−1 (α) (6.20)
n
1
Xα = Zα + Wn , α = H (Zα + Wn ) = Φ (Zα + Wn ) + √ P1 (Zα + Wn ) ϕ (Zα + Wn ) + · · ·
n
= Zα + √1n P1 (Zα ) + Wn1
= α + Wn ϕ (Zα ) + · · · + √1n P1 (Zα ) ϕ (Zα ) + · · · Iteration
⇒ Wn = − √1n P1 (Zα ) + · · ·
1. Percentile−t method
V̂ 1/2
θ̂L,P er−t (α) = θ̂ − √ κ̂ (1 − α) − κ−1 (1 − α) + κ−1 (1 − α)
n
V̂ 1/2 h i
= θ̂L,Exact − √ κ̂ (1 − α) − κ−1 (1 − α) .
n
Now
1
κ−1 (1 − α) = Z1−α + √ q11 (Z1−α ) + · · ·
n
1
κ̂−1 (1 − α) = Z1−α + √ q̂11 (Z1−α ) + · · ·
n
1
⇒ κ̂−1 (1 − α) − κ−1 (1 − α) = O √ .
n
141
2. Normal approximation
V̂ 1/2 −1
θ̂L,N or (α) = θ̂ − √ Φ (1 − α) − κ−1 (1 − α) + κ−1 (1 − α)
n
V̂ 1/2 h −1 i
= θ̂L,Exact − √ Φ (1 − α) − κ−1 (1 − α)
n
Now
1
κ−1 (1 − α) = Φ−1 (1 − α) + √ q11 (Z1−α ) + · · ·
n
1
κ̂−1 (1 − α) = Φ−1 (1 − α) + √ q̂11 (Z1−α ) + · · ·
n
1
⇒ Φ−1 (1 − α) − κ−1 (1 − α) = O √ .
n
3. Percentile method
θ̂L,P er (α) = Ĝ−1 (α) . (6.21)
G (X) = P θ̂ ≤ X
√
√
n θ̂ − θ n (X − θ)
= P ≤
V 1/2 V 1/2
√ !
n (X − θ)
= H
V 1/2
= α.
√
n G−1 (α) − θ
= H −1 (α) . (6.22)
V 1/2
V 1/2
G−1 (α) = θ + √ H −1 (α)
n
V̂ 1/2 −1
Ĝ−1 (α) = θ̂ + √ Ĥ (α) + κ−1 (1 − α) − κ−1 (1 − α)
n
V̂ 1/2
= θ̂L,Exact + √ Typically not cancelledH −1 (α) + κ−1 (1 − α) .
n | {z }
c1 c2
H −1 (α) = Zα + √ , κ−1 (1 − α) = Zα + √ . (6.23)
n n
Percentile method: 1st order accurate.
4. Hybrid method
θ̂L,Hyb (α) = θ̂ − Ĝ−1 (1 − α) . : α → 1 − α (6.24)
√ −1
n G (α)
G (X) = P θ̂ − θ ≤ X ⇒ = H −1 (α) . (6.25)
V 1/2
V̂ 1/2 c1
θ̂L,Hyb (α) = θ̂L − √ Ĥ −1 (1 − α) , Ĥ −1 (1 − α) = Z1−α + √ . (6.26)
n n
Hybrid method: 1st order accurate.
142
5. Bias correction (only special case: e.g. transformation location model)
Ĝ−1
P er Φ Zα + 2Ẑ0 = Φ−1 Ĝ θ̂ . (6.27)
1
Ĝ θ̂ = H (0) = Φ (0) + √ P1 (0) ϕ (0) + · · ·
n
1 1
= + √ Wn P1 (0) ϕ (0) + · · ·
2 n | {z }
1 1 Wn
Φ−1 Ĝ θ̂ = Φ−1 + Wn = Φ−1 +
2 2 φ (0)
1
√ P1 (0) + · · ·
=
n
2P1 (0) 2P1 (0)
Φ Zα + √ + ··· = α + ϕ (Zα ) √ + ···
n n
V̂ 1/2
Ĝ−1 (α) = θ̂ + √ H −1 (α)
n
V̂ 1/2 2P1 (0)
θ̂L,BC (α) = θ̂ + √ H −1 α + φ (Zα ) √ −1
+κ (1 − α) − κ −1
(1 − α)
n n
V̂ 1/2 2P1 (0)
= θ̂L,Exact + √ H −1 α + φ (Zα ) √ +κ −1
(1 − α)
n n
2P1 (0)
H −1 α + φ (Zα ) √ = Zα + · · · (CF expansion, depends on γ − skewness)
n
c1
κ−1 (1 − α) = Z1−α + √ (Taylor expansion)
n
Bias correction method: 1st order accurate. [Calibration method (choose α̂ → α)]
6. ABC " #!
−1 Φ−1 (α) + Ẑ0
Ĝ Φ (6.28)
1 − a ()
Problem: Ẑ0 unknown. a−depends on 3rd moment.
Edgeworth expansion
1
1 − α = K (t) = Φ (t) + √ P̂1 (X) φ (X) (6.29)
n
Empirical Edgeworth expansion (2nd order accurate). May not be better than bootstrap
since tail may not be very accurate.
10.7 Exercises
1. Consider an artificial data set consisting of the 8 numbers
1, 2, 3.5, 4, 7, 7.3, 8.6, 12.4, 13.8, 18.1.
n−[nα]
1 X
Define the α trimmed mean by θ̂ = Xn,i .
n − 2[nα] i=[nα]+1
143
(a) Take α = 25%, calculate the bootstrap variance estimate v̂B (θ̂) by
Monte-Carlo simulations for B = 20, 100, 200, 500, 1000, 2000. Do they be-
come more stable as B gets larger?
(b) Find the jackknife variance estimate v̂J (θ̂).
3. Show that for function of means, one-sided (lower) confidence interval by the hybrid
method is only first-order accurate.
4. The following are the numbers of minutes that a doctor kept 10 patients waiting
beyond their appointment times,
12.1, 9.8, 10.5, 5.6, 8.2, 0.5, 6.8, 10.1, 17.2, 4.2.
find a two-sided 90% confidence interval for the mean waiting time by the normal
approximation, percentile-t, percentile, ABC method.
5. The following data are the number of hours 10 students studied for a certain achieve-
ment test and their scores on the test.
Number of hours (x): 8, 6, 10, 9, 14, 6, 18, 10, 15, 24,
Score on the test (Y): 80, 62, 91, 77, 89, 65, 96, 85, 94, 91.
6. “Are statistical tables obsolete?” In this example, we shall use Monte Carlo simula-
tions to obtain quantiles from any given distribution. We shall illustrate this with
the standard normal distribution Φ(x). Suppose we wish to find the 90% and 95%
quantile, i.e. q1 = Φ−1 (0.90) and q2 = Φ−1 (0.95).
144
(3) Rank the random numbers in increasing order, say, XB,1 , · · · , XB,B .
So the Monte Carlo approximations for q1 and q2 are
(4) Compare and comment on those Monte Carlo approximations and the
exact values.
(5) Let X ∼ N (0, 1). Suppose that we are interested in calculate P (X ≥
2), µ = EX and σ 2 = var(X). How do we use the Monte Carlo simulations
to approximate these? Compare your approximations with the exact ones.
145
Chapter 11
11.1 Introduction
It is impossible to find estimators which minimize the risk R(θ, δ) uniformly at every value
of θ (see the example below) unless we restrict ourselves to a smaller class. We shall now
drop such restrictions, admitting all estimators into competition, but shall then have to
be satisfied with a weaker optimality property than uniformly minimum risk. We shall
look for estimators that make the risk function R(θ, δ) small in some overall sense. Two
such optimality properties will be considered:
We consider the first approach now while the second will be studied in the next chapter.
146
An illustrative example
Suppose we are interested in the average income µ of an area A. If we have no data for
the area, a natural guess would be µ0 , the average income from the last national census;
whereas if we have a random sample X1 , X2 , ..., Xn from the area A, we may want to
combine µ0 and X̄ = n−1 ni=1 Xi into an estimator, for instance,
P
µ̂ = (0.2)µ0 + (0.8)X̄.
For comparison, we calculate their M SE(µ̂) = E(µ̂ − µ)2 = V ar(µ̂) + bias2 (µ). Hence,
3. If we use as our criteria the maximum (over µ) of the MSE (called the minimax
criteria), then X̄ is optimal (shown later).
147
11.2 Bayes estimator
Assume that
X|θ ∼ f (x|θ),
θ ∼ π.
• Posterior risk:
Z
r(δ|x) ≡ E[l(θ, δ(x))|X = x] = l(θ, δ(x))f (θ|x)dθ
148
Definition 11.1 An estimator δ ∗ is called a Bayes estimator w.r.t. π if
Bayes estimators play a central role in Wald’s decision theory. One of its main results
is that in any given statistical problem, attention can be restricted to Bayes solutions and
suitable limits of Bayes solutions; given any other procedure δ, there exists a procedure δ 0
in this class such that R(θ, δ 0 ) ≤ R(θ, δ) for all θ. (In view of this result, it is not surprising
that Bayes estimators provide a tool for solving minimax problems, as will be seen in the
next chapter.)
149
Finding a Bayes estimator is, in principle, quite simple.
Determinination of Bayes estimates may be daunting and the use of conjugate priors
might facilitate its calculations, and reduce computational complexity in δ ∗ (x). Conjugate
priors are natural parametric families of priors such that the posterior distributions also
belong to this family. Such families are called conjugate.
150
Examples
Suppose
∼ N (E[µ|x], var[µ|x]) ,
A σ02
δ ∗ = E[µ|x] = x̄ + M.
A + σ02 A + σ02
151
Remark 11.1
1. δ ∗ = wx̄ + (1 − w)M , a weighted average of the standard estimator X̄ and the mean
M of the prior distribution.
Remark 11.2 Inference can be based on sufficient statistics z = x̄ only. That is,
A σ02
δ ∗ = E[µ|z] = z + M.
A + σ02 A + σ02
152
Example 11.2 (Normal variance with known mean)
Suppose
Xi |σ 2 ∼ N (0, σ 2 ),
1
τ ≡ 2 ∼ Gamma(α, β) (conjugate prior)
σ
where π(τ ) = (β α /Γ(α))τ α−1 e−βτ with
α α(α + 1)
Eτ = , Eτ 2 = ,
β β2
β β2
Eτ −1 = Eσ 2 = , Eτ −2 = Eσ 4 = .
α−1 (α − 1)(α − 2)
So the posterior distribution is
153
Example 11.3 (Normal variance with unknown mean)
Suppose
• τ≡ 1
σ2
∼ Gamma(α, β), i.e., π(τ ) ∝ τ a−1 e−bτ , (conjugate prior)
• µ⊥τ
Now
n/2 Pn !
1 i=1 (xi − µ)2
f (x|µ, τ ) = exp −
2πσ 2 2σ 2
n
!
τX
∝ τ n/2 exp − (xi − µ)2 .
2 i=1
154
Under L2 -loss, the Bayes estimator of σ 2 = 1/τ is
1 Pn n
− x̄)2 + β (xi − x̄)2 + 2β
P
∗ −1 i=1 (xi
δ = E(τ |x) = 2
= i=1
n/2 + α − 3/2 n + 2α − 3
Pn 2
n−1 i=1 (xi − x̄) 2α − 2 β
= +
n + 2α − 3 n−1 n + 2α − 3 (α − 1)
2 2
=: wS + (1 − w) Eσ ,
Pn
which is a weighted average of sample variance S 2 = i=1 (xi − x̄)2 /(n − 1) and prior
variance Eσ 2 = β/(α − 1).
155
Example 11.4 (Binomial mean)
Suppose
a ab
E(p) = and var(p) = (a + b + 1)
a+b (a + b)2
which is a weighted average of standard estimate X/n and prior mean a/(a + b).
156
Example 11.5 (Poisson)
Suppose
Xi |θ ∼ P oisson(θ), ,
θ ∼ Gamma(α, β), (conjugate prior)
Note that EX = var(X) = θ. Although L0 (θ, δ) = (θ − δ)2 is often preferred for the
estimation of a mean, some type of scaled squared error loss, e.g., Lk (θ, δ) = (θ − δ)2 /θk ,
may be more appropriate for the estimation of a variance. The Bayes estimator under
Lk -loss is (by taking w(θ) = θ−k in Corollary 11.1)
E(θ1−k |x) β
δk∗ (x) = −k
= (x̄ + α − k) ,
E(θ |x) β+1
β 1
= x̄ + (α − k)β, for α − k > 0
β+1 β+1
which is a weighted average of sample mean x̄ and (α − k)β (the prior mean is αβ).
Note that the choice of loss function can have a large effect on the resulting Bayes
estimator.
157
Example 11.6 (Regression)
Y = Xβ + , ∼ N (0, V ).
β̂Bayes = (X T V −1 X + P −1 )−1 (X T V −1 Y + P −1 βP ).
Remarks:
1. The MLE of β is
β̂ = arg min f () = arg max g().
β β
E β̂ = (X T V −1 X)−1 X T V −1 Xβ = β,
V ar(β̂) = AV ar(Y )AT = (X T V −1 X)−1 X T V −1 V V −1 X(X T V −1 X)−1 = (X T V −1 X)−1 .
2. The posterior of β is
f (, β) = f (|β)π(β)
1
= (2π)n/2 |V |−1/2 exp{− (Y − Xβ)T V −1 (Y − Xβ)}
2
−1/2 1
p/2
×(2π) |P | exp{− (β − βP )T P −1 (β − βP )}
2
1 1
∝ exp{− (Y − Xβ)T V −1 (Y − Xβ) − (β − βP )T P −1 (β − βP )}
2 2
:= exp{−h(β)},
158
The maximum a posteriori (MAP) estimate is
0 = h0 (β)
= −X T V −1 (Y − Xβ) + P −1 (β − βP )
= −X T V −1 Y + X T V −1 Xβ + P −1 β − P −1 βP
= (X T V −1 X + P −1 )β − (X T V −1 Y + P −1 βP )
resulting in
β̂ Bayes = (X T V −1 X + P −1 )−1 (X T V −1 Y + P −1 βP ).
It is easy to see
Clearly, as with other Bayes estimator, β̂ Bayes is a linear combination of prior esti-
mator and frequentist estimator. It is biased with reduced variance.
159
11.3 Properties of Bayes Estimators
In many cases, a Bayes estimator is a weighted average of frequentist estimate θ̂f req and
prior estimate θprior :
δ ∗ = wθ̂f req + (1 − w)θprior ,
where the weight w strikes balances between evidence from data and prior belief.
Bayes estimators can also be regarded as shrinkage estimator. Without priors, the
usual estimator is θ̂f req . Once we place priors, our new estimator shrinks toward θprior .
More discussion will be given on James-Stein estimator.
• However, under weak conditions, Bayes estimators are unique, e.g., if the loss func-
tion l(θ, δ) is squared error loss, or more generally, if it is strictly convex in δ.
The proof follows easily from that of Corollary 11.1, and hence omitted here.
Theorem 11.2 No unbiased estimator δ(X) can be a Bayes solution under L2 -loss unless
Proof. Suppose δ(X) is a Bayes estimator and is unbiased for estimating g(θ). Then, we
have: (i) δ(X) = E[g(θ)|X], a.s. and (ii) E[δ(X)|θ] = g(θ) for all θ. Hence,
160
Example 11.7 Sample means are NOT Bayes estimators
Let Xi ’s be i.i.d. (i = 1, ..., n) with E(Xi ) = θ and var(Xi ) = σ 2 (independent of θ), then
Improper prior
Since the Fisher information I(θ) = constant, this is actually the Jeffreys prior. It
is easy to check that the posterior distribution calculated from this improper prior is a
proper distribution as soon as an observation has been taken. This is not surprising; since
X is normally distributed about θ with variance 1, even a single observation provides a
good idea of the position of θ.
Let X = (X1 , ..., Xn ) ∼ f (x|θ) with sufficient statistic T . Its likelihood function is
L(θ|x) = f (x|θ). Given T = t, we have
f (x|θ)π(θ) g(t|θ)π(θ)
π(θ|x) = R =R = π(θ|t).
f (x|θ)π(θ)dθ g(t|θ)π(θ)dθ
That is, π(θ|x) depends on x only through t, and the posterior distribution of θ is the
same whether we compute it on the basis of x or of t. Since Bayesian measures are
computed from posterior distributions, we need only focus on functions of a minimal
sufficient statistic.
161
11.4 Empirical Bayes
Assume
Xi |θ ∼ f (x|θ), i = 1, .., n
θ|γ ∼ π(θ|γ),
Procedure
162
Examples
Example 11.8 (Continuation of Example 11.1, Normal mean with known variance)
Proof.
( )
1 1 1 µ2
Z
f (z|A) = 2 1/2
exp − 2 (z − µ)2 exp − dµ
(2πσ0 ) 2σ0 (2πA)1/2 2A
( ) ( )
1 (z − µ)2 µ2
Z
= exp − exp − dµ
(4π 2 σ02 A)1/2 2σ02 2A
( !)
1 1 1 1 z z2
Z
= exp − µ2 + 2 − 2µ 2 + 2 dµ
(4π σ02 A)1/2
2 2 A σ0 σ0 σ0
( )Z
1 z2 1 1 1 Az
= exp − exp − + µ2 − 2µ dµ
2
(4π 2 σ0 A)1/2 2σ02 2 A σ02 σ02 + A
( ) ( ! 2 )
1 z2 1 σ02 + A A 2
= exp − exp z
(4π 2 σ02 A)1/2 2σ02 2 Aσ02 2
σ0 + A
( ! 2 )
1 σ02 + A Az
Z
× exp − µ− 2 dµ
2 Aσ02 σ0 + A
( ) ( ! ) !1/2
1 z2 z2 A 2πAσ02
= exp − exp ×
(4π 2 σ02 A)1/2 2σ02 2σ02 σ02 + A σ02 + A
( )
1 z2
= exp − .
(2π(σ02 + A))1/2 2(σ02 + A)
163
Method A: Empirical Bayes-MLE (EB-MLE)
η̂ = max{σ02 , z 2 }. (4.4)
Proof of (4.4). Let η = σ02 + A, so η ∈ [σ02 , ∞). We have L(η) ∝ η −1/2 exp −z 2 /(2η) ,
1 z2 1
l0 (η) ∝ − + 2 = 2 (z 2 − η)
η η η
2. If z 2 6∈ [σ02 , ∞), i.e., x̄2 < σ02 , the MLE is η̂ = σ02 . [l(µ) decreases as l0 (η) ≤ 0].
164
Example 11.9 (Empirical Bayes binomial)
Empirical Bayes estimation is best suited to large-scale inference, where there are many
problems that can be modeled simultaneously in a common way.
For example, suppose that there are K different groups of patients, where each group
has n patients. Each group is given a different treatment for the same illness, and in the
k-th group, we count Xk , k = 1, · · · , K, the number of successful treatments out of n.
Let Xk = Xki ∼ Bin(n, pk ), k = 1, ..., K. Since the groups receive different treatments,
P
we expect different success rates; however, since we are treating the same illness, these rates
should be somewhat related to each other. These considerations suggest the hierarchy
Xk |pk ∼ Bin(n, pk ),
pk ∼ Beta(a, b), k = 1, ..., K,
where the K groups are tied together by the common prior distribution. As in Example
11.4, the Bayes estimator of pk under L2 -loss is
a + xk a+b a n xk
p̂Bayes
k = E[pk |xk ] = = + .
b + n − xk a+b+n a+b a+b+n n
If a and b are assumed unknown and one can use EB method. First we need to calculate
the marginal distribution
K
Z 1 Z 1Y ( ! )
n xk Γ(a + b) a−1
f (x|a, b) = ··· pk (1 − pk )n−xk × p (1 − pk )b−1 dpk
0 0 i=1 xk Γ(a)Γ(b) k
K
( ! Z 1 )
n Γ(a + b)
pkxk +a−1 (1
Y
n−xk +b−1
= − pk ) dpk
k=1
xk Γ(a)Γ(b) 0
K
!
Y n Γ(a + b) Γ(a + xk )Γ(n − xk + b)
= .
k=1
xk Γ(a)Γ(b) Γ(a + b + n)
We then proceed with MLEs of a and b, denoted by â and b̂, although they do not have
closed solutions and have to be calculated numerically. Therefore,
!
â + xk â + b̂ â n xk
p̂EB
k = E[pk |xk ] = = + .
b̂ + n − xk â + b̂ + n â + b̂ â + b̂ + n n
165
11.5 Empirical Bayes and the James-Stein Estimator
Charles Stein shocked the statistical world in 1955 with his proof that MLE methods for
Gaussian models, in common use for more than a century, were inadmissible beyond simple
one- or two-dimensional situations. These methods are still in use, for good reasons, but
Stein-type estimators have pointed the way toward a radically different empirical Bayes
approach to high-dimensional statistical inference. Empirical Bayes ideas are powerful for
estimation, testing, and prediction. We begin here with their path-breaking appearance
in the James-Stein formulation.
166
Model
Suppose
Write µ = (µ1 , ..., µp )0 , z = (z1 , ..., zp )0 , and I as the p × p identify matrix. Then,
and
z ∼ N (0, (A + σ02 )I),
which have been verified component-wise before.
167
Bayes estimator
Theorem 11.3
Proof.
MLE
The obvious estimator of µ is the MLE µ̂M LE = z. Its risk and Bayes risks are
p
X p
X
R(M LE) (µ) = Eµ (zi − µi )2 = Vµ (zi ) = pσ02 ,
i=1 i=1
(M LE) (M LE) 2
r (µ) = ER (µ) = pσ0 .
168
Empirical Bayes Estimation, James-Stein Estimation
Since A is assumed unknown, we can use empirical Bayes rules below. Note that
p
X
kzk2 = zi2 ∼ (A + σ02 )χ2p .
i=1
Thus, if we replace σ02 /(A + σ02 ) in (5.5) by an unbiased estimator (p − 2)σ02 /kzk2 , we have
!
JS (p − 2)σ02
µ̂ = 1− z,
kzk2
the James-Stein estimator. It was discovered by Stein (1956) and later shown by James
and Stein (1961) to have a smaller mean squared error than the MLE z for all µ. Its
empirical Bayes derivation can be found in Efron and Morris (1972).
Since kzk2 /(A + σ02 ) ∼ χ2p = Gamma(a, b), with (a, b) = (p/2, 1/2), we have
! !
A + σ02 1/2 1 (p − 2)σ02 σ02
E = = , =⇒ E = .
kzk2 p/2 − 1 p−2 kzk2 A + σ02
Since the James-Stein estimator (or any EB estimator) cannot attain as small a
Bayes risk as the Bayes estimator, it is of interest to compare their risks, which, in effect,
tell us the penalty we are paying for estimating A.
169
Stein Unbiased Risk Estimation (SURE)
Stein provided an unbiased risk estimation to the Bayes risk r(θ, δ). SURE can not only
be used to calculate risks, but also (perhaps more importantly) can be useful in many
other situaitons, e.g., or more importantly to find EB-SURE estimators, or to find the
tuning parameters (Soft-thresholding level).
Theorem 11.4 (SURE) Let X ∼ Np (µ, σ 2 I), and let the estimator δ be of the form
δ(x) = x − g(x),
Pp
where g(x) = x−δ(x) is (weakly) differentiable. If Ekg 0 (X)k1 ≡ i=1 E|∂g(X)/∂Xi | < ∞,
then
170
Proof. It suffices to show the step from (5.7) to (5.8). For simplicity, we only prove it
for p = 2. For g(x) = (g1 (x), g2 (x))T , we have
Eµ g(x1 , x2 )(x1 − µ1 )
x 1 − µ1 x 2 − µ2
Z Z
= φ φ g1 (x1 , x2 )(x1 − µ1 )dx1 dx2
σ σ
x 1 − µ1 x 2 − µ2
Z Z
= (−σ)φ0 φ g1 (x1 , x2 )dx1 dx2
σ σ
x 2 − µ2 x1 − µ 1
Z Z
= −σ φ φ0 g1 (x1 , x2 )dx1 dx2
x2 ∈R σ x1 ∈R σ
x2 − µ2 x 1 − µ1
Z Z
2
= −σ φ g1 (x1 , x2 )dφ dx2
x2 ∈R σ x1 ∈R σ
x 2 − µ2 ∂g1 (x1 , x2 ) x 1 − µ1
Z Z
= σ2 φ φ dx1 dx2
x2 ∈R σ x1 ∈R ∂x1 σ
∂g1 (x1 , x2 ) x 1 − µ1 x 2 − µ2
Z Z
= σ2 φ φ dx1 dx2
x1 ∈R x2 ∈R ∂x1 σ σ
dg1
= σ 2 Eµ .
dx1
dg2
Similarly, Eµ g(x1 , x2 )(x2 − µ2 ) = σ 2 Eµ .
dx2
171
!
JS (p − 2)σ02
Theorem 11.5 For JS estimator µ̂ = 1− z, we have
kzk2
1
JS 2 2 4
R(µ, µ̂ ) = pσ − (p − 2) σ Eµ
kzk2
(p − 2)2 σ 4
SU RE(µ̂JS ) = pσ 2 −
kzk2
1
JS 2 2 4
r(µ, µ̂ ) = pσ − (p − 2) σ E
kzk2
(p − 2)σ 4
= pσ 2 − (by (5.6))
σ2 + A
σ2A σ4
= p 2 +2 2
σ +A σ +A
172
Proof. Here µ̂JS = z − g(z) with g(z) = (p − 2)σ02 z/kzk2 . Hence,
p !
JS 2 (p − 2)2 σ04 kzk2 X ∂ (p − 2)σ02 zi
R(µ, µ̂ ) = pσ + Eµ 4
− 2σ02 Eµ
kzk i=1
∂zi kzk2
p !
2 1 X kzk2 − 2zi2
= pσ + (p − 2)2 σ04 Eµ 2
− 2(p − 2)σ04 Eµ
kzk i=1
kzk4
!
2 1 pkzk2 − 2kzk2
= pσ + (p − 2)2 σ04 Eµ − 2(p − 2)σ04 Eµ
kzk2 kzk4
1
= pσ 2 − (p − 2)2 σ04 Eµ .
kzk2
And
2
Recall that r(µ, µ̂Bayes ) = p σσ2 +A
A
. So the relative risk is
4
r(µ, µ̂JS ) 2 σ2σ+A 2σ 2
= 1 + 2A = 1 +
r(µ, µ̂Bayes ) p σσ2 +A pA
For p = 10 and A = σ 2 , r(µ, µ̂JS ) is only 20% greater than the true Bayes risk.
173
The shock the James-Stein estimator provided the statistical world didn’t come from
the above average risk comparisons. These are based on the zero-centric Bayesian model,
where the MLE, which doesn’t favor values of µ near 0, might be expected to be bested.
The rude surprise came from the theorem proved by James and Stein in 1961:
This result is frequentist rather that Bayesian – it implies the superiority of µ̂JS
no matter what one’s prior beliefs about µ may be. Since versions of µ̂M LE dominate
popular statistical techniques such as linear regression, its apparent uniform inferiority
was a cause for alarm. The fact that linear regression applications continue unabated
reflects some virtues of µ̂M LE , which will be discussed later.
174
11.6 Using SURE to find Estimators
Recall
1
Bayes
µ̂ = 1− z = z − g(z)
A+1
where g(z) = z/(A + 1) and ∂gi (z)/∂zi = 1/(A + 1). From Theorem 11.4, we have
p
X dgi kzk2 2pσ 2
SURE(A) = pσ 2 + kg(x)k2 − 2σ 2 = pσ 2 + 2
−
i=1
dxi (A + 1) (A + 1)
A SURE estimate of A is
 = arg min SURE(A),
A≥0
1 1 pσ 2
Thus, a SURE estimate of is = . Hence, we obtain a SURE type
(1 + A) (1 + Â) kzk2
estimator of µ as !
SU RE pσ 2
µ̂ = 1− z.
kzk2
Note that this is very similar to the James-Stein estimator µ̂JS .
175
11.7 James-Stein Estimator, shrinking to non-zero point
The above James-Stein estimator shrinks each observed value zi toward 0. We don’t have
to take 0 as the preferred shrinking point. Now suppose
the (µi , zi ) pairs being independent of each other. We can write this compactly as
zi ∼ N (M, A + σ02 ).
176
Bayes estimator
Theorem 11.7
177
James-Stein Estimation
Since A is assumed unknown, we can use empirical Bayes rules below. Note that
p
X
kz − z̄k2 = (zi − z̄)2 ∼ (A + σ02 )χ2p−1 .
i=1
178