Professional Documents
Culture Documents
Math Stat 2019
Math Stat 2019
Math Stat 2019
September 2019
2
Contents
1 Introduction 9
1.1 Speed of light example . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Example: the location model . . . . . . . . . . . . . . . . . . . . 11
1.4 Some further examples of statistical models . . . . . . . . . . . . 12
2 Estimation 15
2.1 What is an estimator? . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 The empirical distribution function . . . . . . . . . . . . . . . . . 15
2.3 Some estimators for the location model . . . . . . . . . . . . . . 16
2.4 How to construct estimators . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Plug-in estimators . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 The method of moments . . . . . . . . . . . . . . . . . . . 18
2.4.3 Likelihood methods . . . . . . . . . . . . . . . . . . . . . 20
2.5 Asymptotic tests and confidence intervals based on the likelihood 23
3
4 CONTENTS
8 Comparison of estimators 83
8.1 Definition of risk . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.2 Risk and sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.3 Rao-Blackwell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.4 Sensitivity and robustness . . . . . . . . . . . . . . . . . . . . . . 85
8.5 Computational aspects . . . . . . . . . . . . . . . . . . . . . . . . 86
9 Equivariant statistics 87
9.1 Equivariance in the location model . . . . . . . . . . . . . . . . . 87
9.1.1 Construction of the UMRE estimator . . . . . . . . . . . 88
9.1.2 Quadratic loss: the Pitman estimator . . . . . . . . . . . 89
9.1.3 Invariant statistics . . . . . . . . . . . . . . . . . . . . . . 91
9.1.4 Quadratic loss and Basu’s Lemma . . . . . . . . . . . . . 91
9.2 Equivariance in the location-scale model ? . . . . . . . . . . . . . 93
9.2.1 Construction of the UMRE estimator ? . . . . . . . . . . 94
9.2.2 Quadratic loss ? . . . . . . . . . . . . . . . . . . . . . . . 94
10 Decision theory 97
10.1 Decisions and their risk . . . . . . . . . . . . . . . . . . . . . . . 97
10.2 Admissible decisions . . . . . . . . . . . . . . . . . . . . . . . . . 99
10.2.1 Not using the data at all is admissible . . . . . . . . . . . 99
CONTENTS 5
14 M-estimators 141
14.1 MLE as special case of M-estimation . . . . . . . . . . . . . . . . 142
14.2 Consistency of M-estimators . . . . . . . . . . . . . . . . . . . . . 144
14.3 Asymptotic normality of M-estimators . . . . . . . . . . . . . . . 147
14.4 Asymptotic normality of the MLE . . . . . . . . . . . . . . . . . 150
14.5 Two further examples of M-estimation . . . . . . . . . . . . . . . 151
14.6 Asymptotic relative efficiency . . . . . . . . . . . . . . . . . . . . 154
14.7 Asymptotic pivots . . . . . . . . . . . . . . . . . . . . . . . . . . 156
14.8 Asymptotic pivot based on the MLE . . . . . . . . . . . . . . . . 157
14.9 MLE for the multinomial distribution . . . . . . . . . . . . . . . 159
14.10Likelihood ratio tests . . . . . . . . . . . . . . . . . . . . . . . . . 161
6 CONTENTS
17 Literature 195
CONTENTS 7
Introduction
Fizeau and Foucault developed methods for estimating the speed of light (1849,
1850), which were later improved by Newcomb and Michelson. The main idea
is to pass light from a rapidly rotating mirror to a fixed mirror and back to
the rotating mirror. An estimate of the velocity of light is obtained, taking
into account the speed of the rotating mirror, the distance travelled, and the
displacement of the light as it returns to the rotating mirror.
Fig. 1
The data are Newcomb’s measurements of the passage time it took light to
travel from his lab, to a mirror on the Washington Monument, and back to his
lab.
9
10 CHAPTER 1. INTRODUCTION
●
● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
●
●
X1
●
−40
0 5 10 15 20 25
t1
day 2
0 20 40
● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ●
●
X2
−40
20 25 30 35 40 45
t2
day 3
0 20 40
● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●
X3
−40
40 45 50 55 60 65
t3
One may estimate the speed of light using e.g. the mean, or the median, or
Huber’s estimate (see below). This gives the following results (for the 3 days
separately, and for the three days combined):
Table 1
The question which estimate is “the best one” is one of the topics of these notes.
1.2 Notation
F (·) = P (X ≤ ·).
Recall that the distribution function F determines the distribution P (and vise
versa).
Further model assumptions then concern the modeling of P . We write such a
model as P ∈ P, where P is a given collection of probability measures, the so-
called model class. Typically the distributions in P are indexed by a parameter,
say θ, in some parameter space, say Θ. Then P = {Pθ : θ ∈ Θ}, and P = Pθ
for some θ ∈ Θ. Then θ is often called the “true parameter”.1 The parameter
space Θ may be high-dimensional, or even ∞-dimensional. Often, one is only
interested in a certain aspect of the parameter. We write the parameter of
interest as γ := g(θ) where g : Θ → Γ a given function with values in some
space Γ.
The following example will serve to illustrate the concepts that are to follow.
Let X be real-valued. The location model is
Xi = µ + i , i = 1, . . . , n,
with 1 , . . . , n i.i.d. and, under model (1.2), symmetrically but otherwise un-
known distributed and, under model (1.3), N (0, σ 2 )-distributed with unknown
variance σ 2 .
γ = Pθ (X ≥ 4)
= 1 − Pθ (X ≤ 3)
θ2 θ3 −θ
= 1− 1+θ+ + e
2 3!
:= g(θ).
xi # days
0 100
1 60
2 32
3 8
≥4 0
where θ > 0 is unknown. This is the Pareto density which is often used to
model the distribution of income. The parameter θ is sometimes called the
Pareto index. We write the model class for the distribution as
Pθ (Y = 1|Z = z) = θ(z), z ∈ R,
with
θ(·) ∈ Θ := all increasing functions θ : R → [0, 1] .
γ := θ−1 ( 12 ),
that is, the value γ such that for z ≥ γ you are at risk:
1
Pθ (Y = 1|Z = z) ≥ 2.
Estimation
F (·) := P (X ≤ ·).
15
16 CHAPTER 2. ESTIMATION
Most estimators are constructed according the so-called a plug-in principle (Ein-
setzprinzip). That is, the parameter of interest γ is written as γ = Q(F ), with
Q some given map. The empirical distribution F̂n is then “plugged in”, to ob-
tain the estimator T := Q(F̂n ). (We note however that problems can arise, e.g.
Q(F̂n ) may not be well-defined ....).
In the location model of Section 1.3, one may consider the following estimators
µ̂ of µ (among others):
• The average
n
1X
µ̂1 := Xi .
n
i=1
that is
n
X
µ̂ = arg min (Xi − µ)2 . (2.1)
µ∈R
i=1
It can be shown that µ̂1 is a “good” estimator if the model (1.3) holds, i.e., if
X1 , . . . , Xn are i.i.d. N (µ, σ 2 ). When (1.3) is not true, in particular when there
are outliers (large, “wrong”, observations) (Ausreisser), then one has to apply
a more robust estimator.
• The (sample) median is
X((n+1)/2) when n odd
µ̂2 := ,
{X(n/2) + X(n/2+1) }/2 when n is even
where X(1) ≤ · · · ≤ X(n) are the order statistics. Note that µ̂2 is a minimizer
of the absolute loss
Xn
|Xi − µ|.
i=1
1
To avoid misunderstanding, we note that e.g. in (2.1), µ is used as variable over which
is minimized and is there not the unknown parameter µ. It is a general convention to abuse
notation and employ the same symbol µ. When further developing the theory we shall often
introduce a different symbol for the variable, to distinguish it from the “true parameter” µ,
e.g., (2.1) is written as
Xn
µ̂1 := arg min (Xi − c)2 .
c∈R
i=1
2.4. HOW TO CONSTRUCT ESTIMATORS 17
where
x2 if |x| ≤ k
ρ(x) = ,
k(2|x| − k) if |x| > k
with k > 0 some given threshold.
• We finally mention the α-trimmed mean, defined, for some 0 < α < 1, as
n−[nα]
1 X
µ̂4 := X(i) .
n − 2[nα]
i=[nα]+1
The above estimators µ̂1 , . . . , µ̂4 are plug-in estimators of the location parameter
µ. We define the maps Z
Q1 (F ) := xdF (x)
Q2 (F ) := F −1 (1/2)
and finally
Z F −1 (1−α)
1
Q4 (F ) := xdF (x).
1 − 2α F −1 (α)
F (·) = P (X ≤ ·).
Note that when knowing only F̂n , one can reconstruct the order statistics
X(1) ≤ . . . ≤ X(n) , but not the original data X1 , . . . , Xn . Now, the order
at which the data are given carries no information about the distribution P . In
other words, a “reasonable”2 estimator T = T (X1 , . . . , Xn ) depends only on the
sample (X1 , . . . , Xn ) via the order statistics (X(1) , . . . X(n) ) (i.e., shuffling the
data should have no influence on the value of T ). Because these order statistics
can be determined from the empirical distribution F̂n , we conclude that any
“reasonable” estimator T can be written as a function of F̂n :
T = Q(F̂n ),
γ = g(θ) = Q(Fθ ).
Let X ∈ R and suppose (say) that the parameter of interest is θ itself, and
that Θ ⊂ Rp . Let µ1 (θ), . . . , µp (θ) denote the first p moments of X (assumed
2
What is “reasonable” has to be considered with some care. There are in fact “reasonable”
statistical procedures that do treat the {Xi } in an asymmetric way. An example is splitting
the sample into a training and test set (for model validation).
2.4. HOW TO CONSTRUCT ESTIMATORS 19
to exist), i.e., Z
j
µj (θ) = Eθ X = xj dFθ (x), j = 1, . . . , p.
This is the distribution of the number of failures till the k th success, where at
each trial, the probability of success is θ, and where the trials are independent.
It holds that
(1 − θ)
Eθ (X) = k := m(θ).
θ
Hence
k
m−1 (µ) = ,
µ+k
and the method of moments estimator is
k nk number of successes
θ̂ = = Pn = .
X̄ + k X
i=1 i + nk number of trails
with respect to Lebesgue measure, and with θ ∈ Θ ⊂ (0, ∞) (see Example 1.4.2).
Then, for θ > 1
1
Eθ X = := m(θ),
θ−1
20 CHAPTER 2. ESTIMATION
with inverse
1+µ
m−1 (µ) = .
µ
The method of moments estimator would thus be
1 + X̄
θ̂ = .
X̄
However, the mean Eθ X does not exist for θ < 1, so when Θ contains values
θ < 1, the method of moments is perhaps not a good idea. We will see that the
maximum likelihood estimator does not suffer from this problem.
Note We use the symbol ϑ for the variable in the likelihood function, and the
slightly different symbol θ for the parameter we want to estimate. It is however
a common convention to use the same symbol for both (as already noted in the
footnotes in Sections 1.2 and 2.3). As we will see, different symbols are needed
for the development of the theory.
Note Alternatively, we may write the MLE as the maximizer of the log-likelihood
n
X
θ̂ = arg max log LX (ϑ) = arg max log pϑ (Xi ).
ϑ∈Θ ϑ∈Θ
i=1
We will now show that maximum likelihood is a plug-in method. First, as noted
above, the MLE maximizes the log-likelihood. We may of course normalize the
log-likelihood by 1/n:
n
1X
θ̂ = arg max log pϑ (Xi ). (2.2)
ϑ∈Θ n
i=1
Pn
Replacing the average i=1 log pϑ (Xi )/n in (2.2) by its theoretical counterpart
Eθ log pϑ (X) gives
arg max Eθ log pϑ (X)
ϑ∈Θ
d 1
log pϑ (x) = − log(1 + x).
dϑ ϑ
We put the derivative of the log-likelihood based on n observations to zero and
solve:
n
n X
− log(1 + Xi ) = 0
θ̂ i=1
1
⇒ θ̂ = Pn .
{ i=1 log(1 + Xi )}/n
(One may check that this is indeed the maximum.)
Example 2.4.4 MLE for some location/scale models
Let X ∈ R and θ = (µ, σ 2 ), with µ ∈ R a location parameter, σ > 0 a scale
parameter. We assume that the distribution function Fθ of X is
·−µ
Fθ (·) = F0 ,
σ
22 CHAPTER 2. ESTIMATION
Taking µ̃ = X1 yields
n
2 1 1 1 Y 1 1
LX (X1 , σ̃ ) = √ ( + ) φ(Xi − X1 ) + φ((Xi − X1 )/σ̃) .
2π 2 2σ̃ i=2 2 2σ̃
2.5. ASYMPTOTIC TESTS AND CONFIDENCE INTERVALS BASED ON THE LIKELIHOOD23
This section is a look ahead for what is to come in Chapter 14. Suppose that
Θ is an open subset of Rp . Define the log-likelihood ratio
Z(X, θ) := 2 log LX (θ̂) − log LX (θ) .
Intermezzo: distribution
theory
Recall the definition of conditional probabilities: for two sets A and B, with
P (B) 6= 0, the conditional probability of A given B is defined as
P (A ∩ B)
P (A|B) := .
P (B)
It follows that
P (B)
P (B|A) = P (A|B) ,
P (A)
and that, for a partition {Bj }1
X
P (A) = P (A|Bj )P (Bj ).
j
Consider now two random vectors X ∈ Rn and Y ∈ Rm . Let fX,Y (·, ·), be the
density of (X, Y ) with respect to Lebesgue measure (assumed to exist). The
marginal density of X is
Z
fX (·) = fX,Y (·, y)dy,
25
26 CHAPTER 3. INTERMEZZO: DISTRIBUTION THEORY
Thus, we have
fY (y)
fY (y|x) = fX (x|y) , (x, y) ∈ Rn+m ,
fX (x)
and Z
fX (x) = fX (x|y)fY (y)dy, x ∈ Rn .
Proof. Define
h(y) := E[g(X, Y )|Y = y].
Then Z Z
Eh(Y ) = h(y)fY (y)dy = E[g(X, Y )|Y = y]fY (y)dy
Z Z
= g(x, y)fX,Y (x, y)dxdy = Eg(X, Y ).
u
t
In a survey, people were asked their opinion about some political issue. Let X
be the number of yes-answers, Y be the number of no and Z be the number
of perhaps. The total number of people in the survey is n = X + Y + Z. We
consider the votes as a sample with replacement, with p1 = P (yes), p2 = P (no),
and p3 = P (perhaps), p1 + p2 + p3 = 1. Then
n
P (X = x, Y = y, Z = z) = px1 py2 pz3 , (x, y, z) ∈ {0, . . . , n}, x+y+z = n.
xyz
3.3. THE POISSON DISTRIBUTION 27
Here
n n!
:= .
xyz x!y!z!
It is called a multinomial coefficient.
Lemma 3.2.1 The marginal distribution of X is the Binomial(n, p1 )-distribution.
Proof. For x ∈ {0, . . . , n}, we have
n−x
X
P (X = x) = P (X = x, Y = y, Z = n − x − y)
y=0
n−x
X
n
= px1 py2 (1 − p1 − p2 )n−x−y
x y n−x−y
y=0
n−x
n xX n−x y
= p p2 (1 − p1 − p2 )n−x−y
x 1 y
y=0
n x
= p (1 − p1 )n−x .
x 1
u
t
Definition 3.2.1 We say that the random vector (N1 , . .P . , Nk ) has the multino-
mial distribution with parameters n and p1 , . . . , pk (with kj=1 pj = 1), if for all
(n1 , . . . , nk ) ∈ {0, . . . , n}k , with n1 + · · · + nk = n, it holds that
n
P (N1 = n1 , . . . , Nk = nk ) = pn1 · · · pnk k .
n1 · · · nk 1
Here
n n!
:= .
n1 · · · nk n1 ! · · · n k !
Example 3.2.1 Histograms
Let X1 , . . . , Xn be i.i.d. copies of a random variable X ∈ R with distribution F ,
and let −∞ = a0 < a1 < · · · < ak−1 < ak = ∞. Define, for j = 1, . . . , k,
pj := P (X ∈ (aj−1 , aj ]) = F (aj ) − F (aj−1 ),
Nj #{Xi ∈ (aj−1 , aj ]}
:= = F̂n (aj ) − F̂n (aj−1 ).
n n
Then (N1 , . . . , Nk ) has the Multinomial(n, p1 , . . . , pk )-distribution.
Lemma 3.3.1 Suppose X and Y are independent, and that X has the Poisson(λ)-
distribution, and Y the Poisson(µ)-distribution. Then Z := X + Y has the
Poisson(λ + µ)-distribution.
Proof. For all z ∈ {0, 1, . . .}, we have
z
X
P (Z = z) = P (X = x, Y = z − x)
x=0
Xz
= P (X = x)P (Y = z − x)
x=0
z
X λx −µ µz−x
= e−λe
x! (z − x)!
x=0
z
X z
−(λ+µ) 1
= e λx µz−x
z! x
x=0
(λ + µ)z
= e−(λ+µ) .
z!
u
t
Lemma 3.3.2 Let X1 , . . . , Xn be independent, and P (for i = 1, . . . , n), let Xi
have the Poisson(λi )-distribution. Define Z := ni=1 Xi . Let z ∈ {0, 1, . . .}.
Then the conditional distribution of (X1 , . . . , Xn ) given Z = z is the multino-
mial distribution with parameters z and p1 , . . . , pn , where
λj
pj = Pn , j = 1, . . . , n.
i=1 λi
Pn
Proof. First note that Z is Poisson(λ+ )-distributed, with λ+ := i=1 λi .
Thus, for all (x1 , . . . , xn ) ∈ {0, 1, . . . , z}n satisfying ni=1 xi = z, we have
P
P (X1 = x1 , . . . , Xn = xn )
P (X1 = x1 , . . . , Xn = xn |Z = z) =
P (Z = z)
Qn −λ i λxi /x !
i=1 e i i
=
e−λ+ λz+ /z!
x1 xn
z λ1 λn
= ··· .
x1 · · · xn λ+ λ+
u
t
Z := max{X1 , X2 }.
3.4. THE DISTRIBUTION OF THE MAXIMUM OF TWO RANDOM VARIABLES29
P (Z ≤ z) = P (max{X1 , X2 } ≤ z)
= P (X1 ≤ z, X2 ≤ z) = F 2 (z).
d
f (z) = F (z).
dz
So the derivative of F 2 exists almost everywhere, and
d 2
F (z) = 2F (z)f (z).
dz
u
t
The conditional distribution function of X1 given Z = z is
F (x )
1
FX1 (x1 |z) = 2F (z) , x1 < z .
1, x1 ≥ z
4.1 Sufficiency
Pθ (X ∈ ·|S(X) = s)
Pθ (Xi = 1) = 1 − Pθ (Xi = 0) = θ.
Pn
Take S = i=1 Xi . Then S is sufficient for θ: for all possible s,
n
1 X
IPθ X1 = x1 , . . . , Xn = xn S = s = n , xi = s.
s i=1
31
32 CHAPTER 4. SUFFICIENCY AND EXPONENTIAL FAMILIES
(This is the Gamma(2, θ)-distribution.) For all possible s, the conditional den-
sity of (X1 , X2 ) given S = s is thus
1
fX1 ,X2 (x1 , x2 |S = s) = , x1 + x2 = s.
s
Hence, S is sufficient for θ.
Example 4.1.4 Sufficiency of the order statistics
Let X1 , . . . , Xn be an i.i.d. sample from a continuous distribution F . Then
S := (X(1) , . . . , X(n) ) is sufficient for F : for all possible s = (s1 , . . . , sn ) (s1 <
. . . < sn ), and for (xq1 , . . . , xqn ) = s,
1
IPθ (X1 , . . . , Xn ) = (x1 , . . . , xn )(X(1) , . . . , X(n) ) = s = .
n!
Example 4.1.5 Sufficiency and the uniform distribution
Let X1 and X2 be independent, and both uniformly distributed on the interval
[0, θ], with θ > 0. Define Z := X1 + X2 .
Lemma The random variable Z has density
z/θ2 if 0 ≤ z ≤ θ
fZ (z; θ) = 2 .
(2θ − z)/θ if θ ≤ z ≤ 2θ
For general θ, the result follows from the uniform case by the transformation
Z 7→ θZ, which maps fZ into fZ (·/θ)/θ. u
t
The conditional density of (X1 , X2 ) given Z = z ∈ (0, 2θ) depends on θ, so Z
is not sufficient for θ.
Consider now S := max{X1 , X2 }. The conditional density of (X1 , X2 ) given
S = s ∈ (0, θ) is
1
fX1 ,X2 (x1 , x2 |S = s) = , 0 ≤ x1 < s, x2 = s or x1 = s, 0 ≤ x2 < s.
2s
This does not depend on θ, so S is sufficient for θ.
pθ (x) = gθ (S(x))h(x), ∀ x, θ
(⇒) If S is sufficient for θ, the above does not depend on θ, but is only a
function of x, say h(x). So we may write for S(x) = s,
h(x)
Pθ (X = x|S = s) = P
j: S(aj )=s h(aj )
34 CHAPTER 4. SUFFICIENCY AND EXPONENTIAL FAMILIES
1
pθ (x1 , . . . , xn ) = l{0 ≤ min{x1 , . . . , xn } ≤ max{x1 , . . . , xn } ≤ θ}
θn
= gθ (S(x1 , . . . , xn ))h(x1 , . . . , xn ),
with
1
gθ (s) := l{s ≤ θ},
θn
and
h(x1 , . . . , xn ) := l{0 ≤ min{x1 , . . . , xn }}.
Thus, S = max{X1 , . . . , Xn } is sufficient for θ.
Corollary 4.2.1 The likelihood is LX (θ) = pθ (X) = gθ (S)h(X). Hence, the
maximum likelihood estimator θ̂ = arg maxθ LX (θ) = arg maxθ gθ (S) depends
only on the sufficient statistic S.
k
X
pθ (x) = exp cj (θ)Tj (x) − d(θ) h(x).
j=1
n
Y Xk n
Y
pθ (x) = pθ (xi ) = exp[ ncj (θ)T̄j (x) − nd(θ)] h(xi ),
i=1 j=1 i=1
4.3. EXPONENTIAL FAMILIES 35
where, for j = 1, . . . , k,
n
1X
T̄j (x) = Tj (xi ).
n
i=1
θx
pθ (x) = e−θ
x!
1
= exp[x log θ − θ] .
x!
So we can take T (x) = x, c(θ) = log(θ/(1 − θ)), and d(θ) = −n log(1 − θ).
Example 4.3.3 Negative binomial distribution
If X has the Negative Binomial(m, θ)-distribution with m known we have
Γ(x + m) m
pθ (x) = θ (1 − θ)x
Γ(m)x!
Γ(x + m)
= exp[x log(1 − θ) + k log(θ)].
Γ(m)x!
θm
pθ (x) = e−θx xm−1
Γ(m)
xm−1
= exp[−θx + m log θ].
Γ(m)
λm
pθ (x) = e−λx xm−1
Γ(m)
= exp[−λx + (m − 1) log x + m log λ − log Γ(m)].
So we can take T1 (x) = x, T2 (x) = log x, c(θ) = −λ, c2 (θ) = (m − 1), and
d(θ) = −m log λ + log Γ(m).
(x − µ)2
1
pθ (x) = √ exp −
2πσ 2σ 2
x2 µ2
1 xµ
= √ exp 2 − 2 − 2 − log σ .
2π σ 2σ 2σ
Let Z ∈ Rk be a random vector. Then its mean (if it exists) is defined as the
vector consisting of the means of each entry of Z:
EZ1
.
EZ = .. .
EZk
σ12
σ1,2 ··· σ1,k
σ1,2 σ22 ··· σ2,k
Σ := EZZ 0 − EZEZ 0 = . .. .
.. ..
.. . . .
σ1,k σ2,k ··· σk2
dPθ
pθ := .
dν
k
X
pθ (x) = exp θj Tj (x) − d(θ) h(x).
j=1
We let
˙ ∂
d(θ) := d(θ)
∂θ
denote the vector of first derivatives. Let
2 ∂2
¨ := ∂ d(θ) =
d(θ) d(θ)
∂θ∂θ0 ∂θj1 ∂θj2
T1 (X) Eθ T1 (X)
˙
Eθ T (X) = d(θ), ¨
Covθ (T (X)) = d(θ).
38 CHAPTER 4. SUFFICIENCY AND EXPONENTIAL FAMILIES
= Covθ (T (X)).
u
t
θ 7→ c(θ) := γ (say),
4.7. SCORE FUNCTION AND FISHER INFORMATION 39
to get
p̃γ (x) = exp[γT (x) − d0 (γ)]h(x),
It follows that
˙ −1 (γ))
d(c ˙
d(θ)
Eθ T (X) = d˙0 (γ) = = , (4.1)
ċ(c−1 (γ)) ċ(θ)
and
¨ −1 (γ))
d(c ˙ −1 (γ))c̈(c−1 (γ))
d(c
varθ (T (X)) = d¨0 (γ) = −
[ċ(c−1 (γ))]2 [ċ(c−1 (γ))]3
¨
d(θ) ˙
d(θ)c̈(θ)
= −
[ċ(θ)]2 [ċ(θ)]3
!
1 ˙
d(θ)
= ¨ −
d(θ) c̈(θ) .
[ċ(θ)]2 ċ(θ)
d
sθ (x) := log pθ (x).
dθ
Eθ sθ (X) = 0,
and
I(θ) = −Eθ ṡθ (X),
d
where ṡθ (x) := dθ sθ (x).
40 CHAPTER 4. SUFFICIENCY AND EXPONENTIAL FAMILIES
Proof. The results follow from the fact that densities integrate to one, assuming
that we may interchange derivatives and integrals:
Z
Eθ sθ (X) = sθ (x)pθ (x)dν(x)
Z Z
d log pθ (x) ṗθ (x)
= pθ (x)dν(x) = pθ (x)dν(x)
dθ pθ (x)
Z Z
d d
= ṗθ (x)dν(x) = pθ (x)dν(x) = 1 = 0,
dθ dθ
and
ṗθ (X) 2
p̈θ (X)
Eθ ṡθ (X) = Eθ −
pθ (X) pθ (X)
p̈θ (X)
= Eθ − Eθ s2θ (X).
pθ (X)
Now, Eθ s2θ (X) equals varθ sθ (X), since Eθ sθ (X) = 0. Moreover,
Z 2
p̈θ (X) d
Eθ = pθ (x)dν(x)
pθ (X) dθ2
d2
Z
= pθ (x)dν(x)
dθ2
d2
= 1 = 0.
dθ2
u
t
˙
d(θ)
¨ −
I(θ) = d(θ) c̈(θ).
ċ(θ)
I(θ)
I0 (γ) = d¨0 (γ) = .
[ċ(θ)]2
We reparametrize:
θ
γ := c(θ) = log ,
1−θ
which is called the log-odds ratio. Inverting gives
eγ
θ= ,
1 + eγ
and hence
γ
d(θ) = − log(1 − θ) = log 1 + e := d0 (γ).
Thus
eγ
d˙0 (γ) = = θ = Eθ X,
1 + eγ
and
eγ e2γ eγ
d¨0 (γ) = − = = θ(1 − θ) = varθ (X).
1 + eγ (1 + eγ )2 (1 + eγ )2
varθ (X) 1
Eθ s2θ (X) = 2
= ,
[θ(1 − θ)] θ(1 − θ)
Definition 4.9.1 We say that two likelihoods Lx (θ) and Lx̃ (θ) are proportional
at (x, x̃), if
Lx (θ) = Lx̃ (θ)c(x, x̃), ∀ θ,
for some constant c(x, x̃).
We write this as
Lx (θ) ∝ Lx̃ (θ).
A sufficient statistic S is called minimal sufficient if S(x) = S(x̃) for all x and
x̃ for which the likelihoods are proportional.
which equals
log c(x, x̃), ∀ θ,
for some function c, if and only if (x(1) , . . . , x(n) ) = (x̃(1) , . . . , x̃(n) ). So the
order statistics X(1) , . . . , X(n) are minimal sufficient.
Chapter 5
biasθ (T ) := Eθ T − g(θ).
biasθ (T ) = 0, ∀ θ.
43
44CHAPTER 5. BIAS, VARIANCE AND THE CRAMÉR RAO LOWER BOUND
Note that p(θ) is a power series in θ. Thus only parameters g(θ) which are
a power series in θ times e−θ can be estimated unbiasedly. An example is the
probability of early failure
g(θ) := e−2θ .
An unbiased estimator is
+1 if X is even
T (X) = .
−1 if X is odd
This estimator does not make sense at all!
Example 5.1.3 Unbiased estimator of the variance
Let X1 , . . . , Xn be i.i.d. N (µ, σ 2 ), and let θ = (µ, σ 2 ) ∈ R × R+ . Then
n
2 1 X
S := (Xi − X̄)2
n−1
i=1
Lemma 5.2.1 We have the following decomposition for the mean square error:
It is known that S 2 is unbiased. But does it also have small mean square error?
Let us compare it with the estimator
n
1X
σ̂ 2 := (Xi − X̄)2 .
n
i=1
To compute the mean square errors of these two estimators, we first recall that
Pn 2
i=1 (Xi − X̄)
∼ χ2n−1 ,
σ2
a χ2 -distribution with n − 1 degrees of freedom. The χ2 -distribution is a special
case of the Gamma-distribution, namely
2 n−1 1
χn−1 = Γ , .
2 2
46CHAPTER 5. BIAS, VARIANCE AND THE CRAMÉR RAO LOWER BOUND
Thus 1
Pn Pn
X̄)2 X̄)2
i=1 (Xi − i=1 (Xi −
Eθ = n − 1, var = 2(n − 1).
σ2 σ2
It follows that
2
2(n − 1)σ 4 2σ 4
MSEθ (S 2 ) = Eθ S 2 − σ 2 = var(S 2 ) = 2
= .
(n − 1) n−1
Moreover
n−1 2 1
Eθ σ̂ 2 = σ , biasθ (σ̂ 2 ) = − σ 2 ,
n n
so that
2
2 2 2
MSEθ (σ̂ ) = Eθ σ̂ − σ
Conclusion: the mean square error of σ̂ 2 is smaller than the mean square error
of S 2 !
Generally, it is not possible to construct an estimator that possesses the best
among all of all desirable properties. We therefore fix one property: unbi-
asedness (despite its disadvantages), and look for good estimators among the
unbiased ones.
Definition 5.2.2 An unbiased estimator T ∗ is called UMVU (Uniform Minimum
Variance Unbiased) if for any other unbiased estimator T ,
varθ (T ∗ ) ≤ varθ (T ), ∀ θ.
T ∗ := E(T |S).
varθ (T ∗ ) ≤ varθ (T ), ∀ θ.
1
If Y has a Γ(k, λ)-distribution, then EY = k/λ and var(Y ) = k/λ2 .
5.3. THE LEHMANN-SCHEFFÉ LEMMA 47
= EY 2 − E[E(Y |Z)]2 .
Hence, when adding up, the term E[E(Y |Z)]2 cancels out, and what is left over
is exactly the variance
var(Y ) = EY 2 − [EY ]2 .
u
t
The question arises: can we construct an unbiased estimator with even smaller
variance than T ∗ = E(T |S)? Note that T ∗ depends on X only via S = S(X),
i.e., it depends only on the sufficient statistic. In our search for UMVU estima-
tors, we may restrict our attention to estimators depending only on S. Thus, if
there is only one unbiased estimator depending only on S, it has to be UMVU.
Definition 5.3.1 A statistic S is called complete if we have the following im-
plication:
Eθ h(S) = 0 ∀ θ ⇒ h(S) = 0, Pθ − a.s., ∀ θ.
Here, h is a function of S not depending on θ.
Lemma 5.3.1 (Lehmann-Scheffé) Let T be an unbiased estimator of g(θ),
with, for all θ, finite variance. Moreover, let S be sufficient and complete.
Then T ∗ := E(T |S) is UMVU.
Proof. We already noted that T ∗ = T ∗ (S) is unbiased and that varθ (T ∗ ) ≤
varθ (T ) ∀ θ. If T 0 (S) is another unbiased estimator of g(θ), we have
Eθ (T (S) − T 0 (S)) = 0, ∀ θ.
Because S is complete, this implies
T ∗ = T 0 , Pθ − a.s.
u
t
To check whether a statistic is complete, one often needs somewhat sophisti-
cated tools from analysis/integration theory. In the next two examples, we only
sketch the proofs of completeness.
48CHAPTER 5. BIAS, VARIANCE AND THE CRAMÉR RAO LOWER BOUND
A sufficient statistic is
n
X
S := Xi .
i=1
The equation
Eθ h(S) = 0 ∀ θ,
thus implies
∞
X (nθ)k
h(k) = 0 ∀ θ.
k!
k=0
The left hand side can only be zero for all x if f ≡ 0, in which case also
f (k) (0) = 0 for all k. Thus (h(k) takes the role of f (k) (0) and nθ the role of x),
we conclude that h(k) = 0 for all k, i.e., that S is complete.
So we know from the Lehmann-Scheffé Lemma that T ∗ := E(T |S) is UMVU.
Let us now calculate T ∗ . First,
Hence S
∗ n−1
T =
n
is UMVU.
Example 5.3.2 UMVU estimation for uniform distribution
Let X1 , . . . , Xn be i.i.d. Uniform[0, θ]-distributed, and g(θ) := θ. We know
5.4. COMPLETENESS FOR EXPONENTIAL FAMILIES 49
h(θ)θn−1 = 0 ∀ θ,
λk −λz k−1
f (z) = e z , z > 0.
Γ(k)
Hence,
n
X n
X
pθ (x) = exp −λ xi + (k − 1) log xi − d(θ) h(x),
i=1 i=1
where
d(k, λ) = −nk log λ + n log Γ(k),
and
h(x) = l{xi > 0, i = 1, . . . , n}.
It follows that
n
X n
X
Xi , log Xi
i=1 i=1
n
X n
X m
X m
X
S := Xi , Xi2 , Yj , Yj2
i=1 i=1 j=1 j=1
Definition 5.5.1 If I and II hold, we call sθ the score function, and I(θ) the
Fisher information.
Comparing the above with Definition 4.7.1, we see that they coincide if θ 7→ pθ is
differentiable and “regularity conditions” hold. Recall also that in Lemma 4.7.1
we assumed unspecified “regularity conditions”. We now present a rigorous
proof of the first part of this lemma.
Lemma 5.5.1 Assume Conditions I and II. Then
Eθ sθ (X) = 0, ∀ θ.
Proof. Under Pθ , we only need to consider values x with pθ (x) > 0, that is,
we may freely divide by pθ , without worrying about dividing by zero.
Observe that
pθ+h (X) − pθ (X)
Z
Eθ = (pθ+h − pθ )dν = 0,
pθ (X) A
since densities integrate to 1, and both pθ+h and pθ vanish outside A. Thus,
2
2
pθ+h (X) − pθ (X)
|Eθ sθ (X)| = Eθ − sθ (X)
hpθ (X)
2
pθ+h (X) − pθ (X)
≤ Eθ − sθ (X) → 0.
hpθ (X)
52CHAPTER 5. BIAS, VARIANCE AND THE CRAMÉR RAO LOWER BOUND
u
t
Note Thus I(θ) = varθ (sθ (X)).
Remark As already noted, if pθ (x) is differentiable for all x, we can take (under
regularity conditions)
d ṗθ (x)
sθ (x) := log pθ (x) = ,
dθ pθ (x)
where
d
ṗθ (x) := pθ (x).
dθ
Remark Suppose X1 , . . . , Xn are i.i.d. with density pθ , and sθ = ṗθ /pθ exists.
The joint density is
Y n
pθ (x) = pθ (xi ),
i=1
so that (under conditions I and II) the score function for n observations is
n
X
sθ (x) = sθ (xi ).
i=1
Moreover,
ġ 2 (θ)
varθ (T ) ≥ , ∀ θ.
I(θ)
2
pθ+h (X) − pθ (X)
≤ varθ (T )Eθ − sθ (X)
hpθ (X)
→ 0.
Thus,
ġ(θ) = Eθ T (X)sθ (X) = covθ (T, sθ (X)).
The last inequality holds because Eθ sθ (X) = 0. By Cauchy-Schwarz,
2
ġ 2 (θ) = covθ (T, sθ (X))
u
t
Definition 5.5.2 We call ġ 2 (θ)/I(θ), θ ∈ Θ, the Cramer Rao lower bound
(CRLB) (for estimating g(θ)).
Example 5.5.1 CRLB for exponential case
Let X1 , . . . , Xn be i.i.d. Exponential(θ), θ > 0. The density of a single obser-
vation is then
pθ (x) = θe−θx , x > 0.
Let g(θ) := 1/θ, and T := X̄. Then T is unbiased, and varθ (T ) = 1/(nθ2 ). We
now compute the CRLB. With g(θ) = 1/θ, one has ġ(θ) = −1/θ2 . Moreover,
so
sθ (x) = 1/θ − x,
and hence
1
I(θ) = varθ (X) = .
θ2
The CRLB for n observations is thus
ġ 2 (θ) 1
= 2.
nI(θ) nθ
so
x
sθ (x) = −1 + ,
θ
and hence
X varθ (X) 1
I(θ) = varθ = 2
= .
θ θ θ
One easily checks that X̄ reaches the CRLB for estimating θ.
Let now g(θ) := e−θ . The UMVU estimator of g(θ) is
Pni=1 Xi
1
T := 1 − .
n
To compute its variance, we first compute
∞
1 2k (nθ)k −nθ
X
2
Eθ T = 1− e
n k!
k=0
∞
1 (n − 1)2 θ k
X
= e−nθ
k! n
k=0
(n − 1)2 θ
−nθ (1 − 2n)θ
= e exp = exp .
n n
Thus,
> θe−2θ /n
.
≈ θe−2θ /n for n large
ġ 2 (θ) θe−2θ
= .
nI(θ) n
We conclude that T does not reach the CRLB, but the gap is small for n large.
The next lemma shows that the CRLB can only be reached within exponential
families, thus is only tight in a rather limited context.
Lemma 5.6.1 Assume Conditions I and II, with sθ = ṗθ /pθ . Suppose T is
unbiased for g(θ), and that T reaches the Cramér Rao lower bound. Then
{Pθ : θ ∈ Θ} forms a one-dimensional exponential family: there exist functions
c(θ), d(θ), and h(x) such that for all θ,
i.e., then the correlation between T and sθ (X) is ±1. Thus, there exist constants
a(θ) and b(θ) (depending on θ), such that
|a0 c|2
maxp = c0 V −1 c.
a∈R a0 V a
|b0 d|2
maxp = kdk2 = d0 d = c0 V −1 c.
b∈R kbk2
u
t
We will now present the CRLB in higher dimensions. To simplify the exposition,
we will not carefully formulate the regularity conditions, that is, we assume
derivatives to exist and that we can interchange differentiation and integration
at suitable places.
g : Θ → R,
∂g(θ)/∂θ1
ġ(θ) := .. .
.
∂g(θ)/∂θk
sθ (·) := .. .
.
∂ log pθ /∂θk
|a0 ġ(θ)|2
varθ (T ) ≥ max = ġ 0 (θ)I(θ)−1 ġ(θ).
a∈Rk a0 I(θ)a
u
t
Corollary 5.7.1 As a consequence, one obtains a lower bound for unbiased
estimators of higher-dimensional parameters of interest. As example, let g(θ) :=
θ = (θ1 , . . . , θk )0 , and suppose that T ∈ Rk is an unbiased estimator of θ:
Eθ T = θ ∀ θ. Then, for all a ∈ Rk , a0 T is an unbiased estimator of a0 θ. Since
a0 θ has derivative a, the CRLB gives
But
varθ (a0 T ) = a0 Covθ (T )a.
So for all a,
a0 Covθ (T )a ≥ a0 I(θ)−1 a,
in other words, Covθ (T ) ≥ I(θ)−1 , that is, Covθ (T ) − I(θ)−1 is positive semi-
definite.
58CHAPTER 5. BIAS, VARIANCE AND THE CRAMÉR RAO LOWER BOUND
Chapter 6
It holds that
F
F (qinf (u)) ≥ u
and, for all h > 0,
F
F (qsup (u) − h) ≤ u.
Hence
F F
F (qsup (u)−) := lim F (qsup (u) − h) ≤ u.
h↓0
59
60 CHAPTER 6. TESTS AND CONFIDENCE INTERVALS
does not depend on θ. We note that to find a pivot is unfortunately not always
possible. However, if we do have a pivot Z(X, γ) with distribution G, we can
compute its quantile functions
α α
G G
qL := qsup , qR := qinf 1− .
2 2
and the test
1 if Z(X, γ0 ) ∈/ [qL , qR ] .
n
φ(X, γ0 ) :=
0 else
Then the test has level α for testing Hγ0 , with γ0 = g(θ0 ):
IPθ0 (φ(X, g(θ0 )) = 1) = Pθ0 (Z(X, g(θ0 )) > qR ) + IPθ0 (Z(X), g(θ0 )) < qL )
α α
= 1 − G(qR ) + G(qL −) ≤ 1 − 1 − + = α.
2 2
Asymptotic pivot Let Zn (X1 , . . . , Xn , γ) be some function of the data and the
parameter of interest, defined for each sample size n. We call Zn (X1 , . . . , Xn , γ)
an asymptotic pivot if for all θ ∈ Θ,
Θ := {θ = (µ, F0 ), µ ∈ R, F0 ∈ F0 },
Then √
n(X̄n − µ)
Zn (X1 , . . . , Xn , µ) :=
Sn
is an asymptotic pivot, with limiting distribution G = Φ.
I := [γ, γ̄],
where the boundaries γ = γ(X) and γ̄ = γ̄(X) depend (only) on the data X.
Let for each γ0 ∈ R, φ(X, γ0 ) ∈ {0, 1} be a test at level α for the hypothesis
Hγ0 : γ = γ0 .
Thus, we reject Hγ0 if and only if φ(X, γ0 ) = 1, and
IPθ:γ=γ0 (φ(X, γ0 ) = 1) ≤ α.
Then
I(X) := {γ : φ(X, γ) = 0}
is a (1 − α)-confidence set for γ.
62 CHAPTER 6. TESTS AND CONFIDENCE INTERVALS
Conversely, if I(X) is a (1 − α)-confidence set for γ, then, for all γ0 , the test
φ(X, γ0 ) defined as
φ(X, γ0 ) = 1 if γ0 ∈ / I(X)
n
0 else
is a test at level α of Hγ0 .
When comparing confidence intervals, the aim is usually to take the one with
smallest length on average (keeping the level at 1 − α). In the case of tests,
we look for the one with maximal power. Recall that the power is of a test
φ(X, γ0 ) at a value θ with g(θ) 6= γ0 is Pθ (φ(X, γ0 ) = 1).
Consider the following data, concerning weight gain/loss. The control group x
had their usual diet, and the treatment group y obtained a special diet, designed
for preventing weight gain. The study was carried out to test whether the diet
works.
control treatment
group group rank(x) rank(y)
x y
5 6 7 8
0 -5 3 2
16 -6 10 1
2 1 5 4
9 4 9 6
+ +
32 0
Table 2
Let n (m) be the sample size of the control group x (treatment group y). The
mean
Pn in group x (y) is denoted
Pm by x̄ (ȳ). The sums of squares are SSx :=
(x − x̄)2 and SS := (y − ȳ)2 . So in this study, one has n = m = 5
i=1 i y j=1 j
and the values x̄ = 6.4, ȳ = 0, SSx = 161.2 and SSy = 114. The ranks, rank(x)
and rank(y), are the rank-numbers when putting all n + m data together (e.g.,
y3 = −6 is the smallest observation and hence rank(y3 ) = 1).
We assume that the data are realizations of two independent samples, say
X = (X1 , . . . , Xn ) and Y = (Y1 , . . . , Ym ), where X1 , . . . , Xn are i.i.d. with
distribution function FX , and Y1 , . . . , Ym are i.i.d. with distribution function
FY . The distribution functions FX and FY may be in whole or in part un-
known. The testing problem is:
6.5. AN ILLUSTRATION: THE TWO-SAMPLE PROBLEM 63
H0 : FX = FY
against a one- or two-sided alternative.
The classical two-sample student test is based on the assumption that the data
come from a normal distribution. Moreover, it is assumed that the variance of
FX and FY are equal. Thus,
(FX , FY ) ∈
·−µ · − (µ + γ)
FX = Φ , FY = Φ : µ ∈ R, σ > 0, γ ∈ Γ .
σ σ
Here, Γ ⊃ {0} is the range of shifts in mean one considers, e.g. Γ = R for
two-sided situations, and Γ = (−∞, 0] for a one-sided situation. The testing
problem reduces to
H0 : γ = 0.
We now look for a pivot Z(X, Y, γ). Define the sample means
n m
1X 1 X
X̄ := Xi , Ȳ := Yj ,
n m
i=1 j=1
Note that X̄ has expectation µ and variance σ 2 /n, and Ȳ has expectation µ + γ
and variance σ 2 /m. So Ȳ − X̄ has expectation γ and variance
σ2 σ2
2 n+m
+ =σ .
n m nm
Hence r
nm Ȳ − X̄ − γ
is N (0, 1)−distributed.
n+m σ
To arrive at a pivot, we now plug in the estimate S for the unknown σ:
r
nm Ȳ − X̄ − γ
Z(X, Y, γ) := .
n+m S
So the p-value is in this case only a little larger than the level α = 0.05.
The continuity assumption ensures that all observations are distinct, that is,
there are no ties. We can then put them in strictly increasing order. Let
N = n + m and Z1 , . . . , ZN be the pooled sample
Zi := Xi , i = 1, . . . , n, Zn+j := Yj , j = 1, . . . , m.
Define
Ri := rank(Zi ), i = 1, . . . , N.
and let
Z(1) < · · · < Z(N )
be the order statistics of the pooled sample (so that Zi = Z(Ri ) (i = 1, . . . , n)).
The Wilcoxon test statistic is
n
X
T = T Wilcoxon := Ri .
i=1
One may check that this test statistic T can alternatively be written as
n(n + 1)
T = #{Yj < Xi } + .
2
6.5. AN ILLUSTRATION: THE TWO-SAMPLE PROBLEM 65
For example, for the data in Table 2, the observed value of T is 34, and
n(n + 1)
#{yj < xi } = 19, = 15.
2
Large values of T mean that the Xi are generally larger than the Yj , and hence
indicate evidence against H0 .
To check whether or not the observed value of the test statistic is compatible
with the null-hypothesis, we need to know its null-distribution, that is, the
distribution under H0 . Under H0 : FX = FY , the vector of ranks (R1 , . . . , Rn )
has the same distribution as n random draws without replacement from the
numbers {1, . . . , N }. That is, if we let
r := (r1 , . . . , rn , rn+1 , . . . , rN )
1
IP(R = r) = .
N!
66 CHAPTER 6. TESTS AND CONFIDENCE INTERVALS
R = r ⇔ Q = r−1 := q,
where r−1 is the inverse permutation of r.1 For all permutations q and all
measurable maps f ,
D
f (Z1 , . . . , ZN ) = f (Zq1 , . . . , ZqN ).
= IP (Zq1 . . . , ZqN ) ∈ A, Zq1 < . . . < ZqN .
where r = q−1 . Thus we have shown that for all measurable A, and for all r,
1
IP (Z(1) , . . . , Z(N ) ) ∈ A, R = r = IP (Z(1) , . . . , Z(n) ) ∈ A . (6.1)
N!
Plug this back into (6.1) to see that we have the product structure
IP (Z(1) , . . . , Z(N ) ) ∈ A, R = r = IP (Z(1) , . . . , Z(n) ) ∈ A IP R = r ,
which holds for all measurable A. In other words, (Z(1) , . . . , Z(N ) ) and R are
independent. u
t
1
Here is an example, with N = 3:
(z1 , z2 , z3 ) = ( 5 , 6 , 4 )
(r1 , r2 , r3 ) = ( 2 , 3 , 1 )
(q1 , q2 , q3 ) = ( 3 , 1 , 2 )
6.5. AN ILLUSTRATION: THE TWO-SAMPLE PROBLEM 67
Because Wilcoxon’s test is ony based on the ranks, and does not rely on the
assumption of normality, it lies at hand that, when the data are in fact normally
distributed, Wilcoxon’s test will have less power than Student’s test. The loss
of power is however small. Let us formulate this more precisely, in terms of
the relative efficiency of the two tests. Let the significance α be fixed, and
let β be the required power. Let n and m be equal, N = 2n be the total
sample size, and N Student (N Wilcoxon ) be the number of observations needed to
reach power β using Student’s (Wilcoxon’s) test. Consider shift alternatives,
i.e. FY (·) = FX (· − γ), (with, in our example, γ < 0). One can show that
N Student /N Wilcoxon is approximately .95 when the normal model is correct. For
a large class of distributions, the ratio N Student /N Wilcoxon ranges from .85 to ∞,
that is, when using Wilcoxon one generally has very limited loss of efficiency as
compared to Student, and one may in fact have a substantial gain of efficiency.
68 CHAPTER 6. TESTS AND CONFIDENCE INTERVALS
Chapter 7
sup Eθ φ(X) ≤ α.
θ∈Θ0
We consider testing
H0 : θ = θ0
against the alternative
H1 : θ = θ1 .
Define the risk R(θ, φ) of a test φ as the probability of error of first and second
kind:
Eθ φ(X), θ = θ0
R(θ, φ) := .
1 − Eθ φ(X), θ = θ1
69
70 CHAPTER 7. THE NEYMAN PEARSON LEMMA AND UMP TESTS
We let p0 (p1 ) be the density of Pθ0 (Pθ1 ) with respect to some dominating
measure ν (for example ν = Pθ0 + Pθ1 ). A Neyman Pearson test is
1 if p1 /p0 > c
φNP := q if p1 /p0 = c .
0 if p1 /p0 < c
Proof.
R(θ , φ ) − R(θ1 , φ)
Z 1 NP
= (φ − φNP )p1
Z Z Z
= (φ − φNP )p1 + (φ − φNP )p1 + (φ − φNP )p1
p1 /p0 >c p1 /p0 =c p1 /p0 <c
Z Z Z
≤ c (φ − φNP )p0 + c (φ − φNP )p0 + c (φ − φNP )p0
p1 /p0 >c p1 /p0 =c p1 /p0 <c
= c[R(θ0 , φ) − R(θ0 , φNP )].
u
t
7.2.1 An example
Pθ (X = 1) = 1 − Pθ (X = 0) = θ.
We consider three testing problems. The chosen level in all three problems is
α = 0.05.
Problem 1
We want to test, at level α, the hypothesis
1
H0 : θ = 2 =: θ0 ,
against the alternative
1
H1 : θ = 4 =: θ1 .
7.2. UNIFORMLY MOST POWERFUL TESTS 71
α − Pθ0 (T ≤ t0 − 1)
q= .
Pθ0 (T = t0 )
Because φ = φNP is the Neyman Pearson test, it is the most powerful test (at
level α) (see the Neyman Pearson Lemma in Section 7.1). The power of the
test is β(θ1 ), where
β(θ) := Eθ φ(T ).
Numerical Example
Let n = 7. Then
7
1
Pθ0 (T = 0) = = 0.0078,
2
7
7 1
Pθ0 (T = 1) = = 0.0546,
1 2
Pθ0 (T ≤ 1) = 0.0624 > α,
so we choose t0 = 1. Moreover
0.05 − 0.0078 422
q= = .
0.0546 546
The power is now
Problem 2
Consider now testing
H0 : θ0 = 12 ,
against
H1 : θ < 12 .
In Problem 1, the construction of the test φ is independent of the value θ1 < θ0 .
So φ is most powerful for all θ1 < θ0 . We say that φ is uniformly most powerful
for the alternative H1 : θ < θ0 .
Problem 3
We now want to test
H0 : θ ≥ 21 ,
against the alternative
H1 : θ < 12 .
Recall the function
β(θ) := Eθ φ(T ).
The level of φ is defined as
sup β(θ).
θ≥1/2
We have
β(θ) = Pθ (T ≤ t0 − 1) + qPθ (T = t0 )
= (1 − q)Pθ (T ≤ t0 − 1) + qPθ (T ≤ t0 ).
Observe that if θ1 < θ0 , small values of T are more likely under Pθ1 than under
Pθ0 :
Pθ1 (T ≤ t) > Pθ0 (T ≤ t), ∀ t ∈ {0, 1, . . . , n}.
Thus, β(θ) is a decreasing function of θ. It follows that the level of φ is
sup β(θ) = β( 12 ) = α.
θ≥ 21
1
Hence, φ is uniformly most powerful for H0 : θ ≥ 2 against H1 : θ < 12 .
We now study the situation where Θ is an interval in R, and the testing problem
is
H0 : θ ≤ θ0 ,
against
H1 : θ > θ0 .
where q0 and c0 are chosen in such a way that Eθ0 φNP (X) = α. We have
pθ1 (x)
= exp (c(θ1 ) − c(θ0 ))T (X) − (d(θ1 ) − d(θ0 )) .
pθ0 (x)
Hence
> >
pθ1 (x)
= c ⇔ T (x) = t ,
pθ0 (x)
< <
where t is some constant (depending on c, θ0 and θ1 ). Therefore, φ = φNP . It
follows that φ is most powerful for H0 : θ = θ0 against H1 : θ = θ1 . Because φ
does not depend on θ1 , it is therefore UMP for H0 : θ = θ0 against H1 : θ > θ0 .
We will now prove that β(θ) := Eθ φ(T ) is increasing in θ. Let
But then
Z
β(ϑ) − β(θ) = φ(t)[p̄ϑ (t) − p̄θ (t)]dν̄(t)
Z Z
= φ(t)[p̄ϑ (t) − p̄θ (t)]dν̄(t) + φ(t)[p̄ϑ (t) − p̄θ (t)]dν̄(t)
t≤s0 t≥s0
Z
≥ φ(s0 ) [p̄ϑ (t) − p̄θ (t)]dν̄(t) = 0.
Hence, φ has level α. Because any other test φ0 with level α must have
Eθ0 φ0 (X) ≤ α, we conclude that φ is UMP. u
t
Example 7.3.1 Test for the variance of the normal distribution
Let X1 , . . . , Xn be an i.i.d. sample from the N (µ0 , σ 2 )-distribution, with µ0
known, and σ 2 > 0 unknown. We want to test
H0 : σ 2 ≤ σ02 ,
against
H1 : σ 2 > σ02 .
The density of the sample is
n
1 X 2 n 2
pσ2 (x1 , . . . , xn ) = exp − 2 (xi − µ0 ) − log(2πσ ) .
2σ 2
i=1
The function c(σ 2 )is strictly increasing in σ 2 . So we let φ be the test which
rejects H0 for large values of T (X). Note that under H0 , the statistic T (X)/σ02
has a χ2 -distribution with n degrees of freedom, the χ2n -distribution (see Section
12.2 for a definition). So we can find the critical value from the quantile of the
χ2n -distribution.
We can take
θ
c(θ) = log ,
1−θ
which is strictly increasing in θ. Then T (X) = ni=1 Xi .
P
Right-sided alternative
H0 : θ ≤ θ0 ,
against
H1 : θ > θ0 .
The UMP test is (1 T > tR
φR (T ) := qR T = tR .
0 T < tR
The function βR (θ) := Eθ φR (T ) is strictly increasing in θ.
Left-sided alternative
H0 : θ ≥ θ0 ,
against H1 : θ < θ0 .
The UMP test is (1 T < tL
φL (T ) := qL T = tL .
0 T > tL
The function βL (θ) := Eθ φL (T ) is strictly decreasing in θ.
Two-sided alternative
H0 : θ = θ0 ,
against
H1 : θ 6= θ0 .
The test φR is most powerful for θ > θ0 , whereas φL is most powerful for θ < θ0 .
Hence, a UMP test does not exist for the two-sided alternative.
Note Let φR a right-sided test as defined Theorem 7.3.1 with level at most
α and φL be the similarly defined left-sided test. Then βR (θ) = Eθ φR (T ) is
strictly increasing, and βL (θ) = Eθ φL (T ) is strictly decreasing. The two-sided
test φ of Theorem 7.5.1 is a superposition of two one-sided tests. Writing
β(θ) = Eθ φ(T ),
and hence
√ √
tR − nµ0 = nσ0 Φ−1 (1 − α/2), tL − nµ0 = − nσ0 Φ−1 (1 − α/2).
We now study the case where Θ is an interval in R2 . We let θ = (β, γ), and we
assume that γ is the parameter of interest. We aim at testing
H0 : γ ≤ γ0 ,
against the alternative
H1 : γ > γ0 .
We assume moreover that we are dealing with an exponential family in canonical
form:
pθ (x) = exp[βT1 (x) + γT2 (x) − d(θ)]h(x).
Then we can restrict ourselves to tests φ(T ) depending only on the sufficient
statistic T = (T1 , T2 ).
Lemma 7.6.1 Suppose that {β : (β, γ0 ) ∈ Θ} contains an open interval. Let
1 if T2 > t0 (T1 )
φ(T1 , T2 ) = q(T1 ) if T2 = t0 (T1 ) ,
0 if T2 < t0 (T1 )
where the constants t0 (T1 ) and q(T1 ) are allowed to depend on T1 , and are
chosen in such a way that
Eγ0 φ(T1 , T2 )T1 = α.
Then φ is UMPU.
Sketch of proof.
Let p̄θ (t1 , t2 ) be the density of (T1 , T2 ) with respect to dominating measure ν̄:
We assume ν̄(tt , t2 ) = ν̄1 (t1 )ν̄2 (t2 ) is a product measure. The conditional den-
sity of T2 given T1 = t1 is then
where Z
d(γ|t1 ) := log exp[γs2 ]h̄(t1 , s2 )dν̄2 (s2 ) .
s2
7.6. CONDITIONAL TESTS ? 79
Proof of Result 1.
Conversely,
sup E(β,γ) φ(T ) = sup E(β,γ) Eγ (φ(T )|T1 ) ≤ α.
γ≤γ0 γ≤γ0 | {z }
≤α
i.e., φ is unbiased.
Result 3 Let φ0 be a test with level
we know that
E(β,γ0 ) φ0 (T ) ≤ α0 , ∀ β.
Conversely, the unbiasedness implies that for all γ > γ0 ,
E(β,γ) φ0 (T ) ≥ α0 , ∀ β.
E(β,γ0 ) φ0 (T ) = α0 , ∀ β.
80 CHAPTER 7. THE NEYMAN PEARSON LEMMA AND UMP TESTS
E(β,γ0 ) (φ0 (T ) − α0 ) = 0, ∀ β.
n m m n
Y
X X 1 Y 1
= exp log(λ) xi + log(µ) yj − nλ − mµ
x! y !
i=1 j=1 i=1 i j=1 j
Xn m
X n
X
= exp log(µ)( xi + yj ) + log(λ/µ) xi − nλ − mµ h(x, y)
i=1 j=1 i=1
= exp[βT1 (x, y) + γT2 (x) − d(θ)]h(x, y),
where
n
X m
X n
X
T1 (X, Y) := Xi + Yj , T2 (X) := Xi ,
i=1 j=1 i=1
and
n m
Y 1 Y 1
h(x, y) := .
xi ! yj !
i=1 j=1
Comparison of estimators
One can compare estimators in terms of their risk. This is done in Sections 8.1,
8.2 and 8.3. Section 8.4 briefly addresses sensitivity and robustness and Section
8.5 discusses computational aspects. These last two sections do not present the
details.
83
84 CHAPTER 8. COMPARISON OF ESTIMATORS
Let S = S(X) be sufficient. Knowing the sufficient statistic S one can forget
about the original data X without loosing information. Indeed, the following
lemma says that any decision based on the original data X can be replaced by
a randomized one which depends only on S and which has the same risk.
Lemma 8.2.1 Suppose S is sufficient for θ. Let d : X → A be some decision.
Then there is a randomized decision δ(S) that only depends on S, such that
Proof. Let Xs∗ be a random variable with distribution P (X ∈ ·|S = s). Then,
by construction, for all possible s, the conditional distribution, given S = s,
of Xs∗ and X are equal. It follows that X and XS∗ have the same distribution.
Formally, let us write Qθ for the distribution of S. Then
Z
Pθ (XS∗ ∈ ·) = P (Xs∗ ∈ ·|S = s)dQθ (s)
Z
= P (X ∈ ·|S = s)dQθ (s) = Pθ (X ∈ ·).
8.3 Rao-Blackwell
The Lemma of Rao-Blackwell says that in the case of convex loss an estimator
based on the original data X can be replaced by one based only on S without
increasing the risk. Randomization is not needed here.
Lemma 8.3.1 (Rao Blackwell) Suppose that S is sufficient for θ. Suppose
moreover that the action space A ⊂ Rp is convex, and that for each θ, the
map a 7→ L(θ, a) is convex. Let d : X → A be a decision, and define d0 (s) :=
E(d(X)|S = s) (assumed to exist). Then
E(g(X)) ≥ g(EX).
Hence, ∀ θ,
E L θ, d(X) S = s
≥ L θ, E d(X)|S = s
= L(θ, d0 (s)).
8.4. SENSITIVITY AND ROBUSTNESS 85
≥ Eθ L(θ, d0 (S)).
u
t
Example 8.3.1 Mean square error
Let T be an estimator of g(θ) ∈ R and let
R(θ, T̃ ) ≤ R(θ, T ) ∀ θ.
The mean square error can be decomposed in the variance term and the squared
bias term. Since E
I T̃ = ET
I by the iterated expectations lemma, we thus have
Compare with Lemma 5.2.2 and the result of Lehmann-Scheffé in Section 5.3.
If (m) := ∞, we say that with m outliers the estimator can break down. The
break down point is defined as
Today, the data are often high-dimensional and the number of parameters p
is also very large. Maximum likelihood estimation for example requires maxi-
mization of a function of p variables and this can be very hard if p is large. The
more so if the likelihood is not concave, or if there are e.g. some parameters are
integer valued etc. We will moreover examine in Chapter 10 Bayesian theory.
Then on needs to find so-called “posterior distributions”, which is typically
computationally very hard (this is where MCMC (Monte Carlo Markov Chain)
algorithms come in). Clearly, an estimator which cannot be computed (say in
polynomial time) is of little practical value.
Chapter 9
Equivariant statistics
Xi = θ + i , i = 1, . . . , n.
T (x1 + c, . . . , xn + c) = T (x1 , . . . , xn ) + c.
87
88 CHAPTER 9. EQUIVARIANT STATISTICS
Examples
X̄
(
T = X( n+1 ) (n odd) .
2
···
Definition 9.1.2 A loss function L(θ, a) is called location invariant if for all
c ∈ R,
L(θ + c, a + c) = L(θ, a), (θ, a) ∈ R2 .
or equivalently,
R(0, T ) = min R(0, d).
d equivariant
Proof.
(⇒) Trivial.
(⇐) Replacing X by X + c leaves Y unchanged (i.e. Y is invariant). So
T (X + c) = T (Y) + Xn + c = T (X) + c. u
t
9.1. EQUIVARIANCE IN THE LOCATION MODEL 89
Moreover, let
T ∗ (X) := T ∗ (Y) + Xn .
Then T ∗ is UMRE.
Proof. First, note that Y and its distribution does not depend on θ, so that
T ∗ is indeed a statistic. It is also equivariant, by the previous lemma.
Let T be an equivariant statistic. Then T (X) = T (Y) + Xn . So
T (X) − θ = T (Y) + n .
Hence
R(0, T ) = EL0 T (Y) + n = E E L0 T (Y) + n Y .
But
E L0 T (Y) + n Y ≥ min E L0 v + n Y
v
∗
= E L0 T (Y) + n Y .
Hence,
∗
T (Y) + n Y = R(0, T ∗ ).
R(0, T ) ≥ E E L0
u
t
L(θ, a) := (a − θ)2 ,
= −E(n |Y),
and hence
T ∗ (X) = Xn − E(n |Y).
This estimator is called the Pitman estimator.
90 CHAPTER 9. EQUIVARIANT STATISTICS
So
p0 (x1 − z, . . . , xn − z) = l{x(n) − 1/2 ≤ z ≤ x(1) + 1/2}.
Thus, writing
T1 := X(n) − 1/2, T2 := X(1) + 1/2,
the UMRE estimator T ∗ is
Z T2 T2 X(1) + X(n)
Z
∗ T1 + T2
T = zdz dz = = .
T1 T1 2 2
9.1. EQUIVARIANCE IN THE LOCATION MODEL 91
Y(x) = Y(x0 ) ⇔ ∃ c : x = x0 + c.
Then
T ∗ (X) := T ∗ (Y) + d(X)
is UMRE.
Proof. Let T be an equivariant estimator. Then
= T (Y) + d(X).
Hence
E L0 T (ε) Y = E L0 T (Y) + d(ε) Y
≥ min E L0 v + d(ε) Y .
v
For quadratic loss (L0 [a] = a2 ), the definition of T ∗ (Y) in the above theorem
is
T ∗ (Y) = −E(d(ε)|Y) = −E0 (d(X)|X − d(X)),
so that
T ∗ (X) = d(X) − E0 (d(X)|X − d(X)).
So for a equivariant estimator T , we have
So then T is UMRE.
To check independence, Basu’s lemma can be useful.
Basu’s lemma Let X have distribution Pθ , θ ∈ Θ. Suppose T is sufficient
and complete, and that Y = Y (X) has a distribution that does not depend on
θ. Then, for all θ, T and Y are independent under Pθ .
Proof. Let A be some measurable set, and
h(T ) = 0, Pθ −a.s., ∀ θ,
in other words,
Since A was arbitrary, we thus have that the conditional distribution of Y given
T is equal to the unconditional distribution:
Location-scale model
We assume
Xi = µ + σi , i = 1, . . . , n.
The unknown parameter is θ = (µ, σ), with µ ∈ R a location parameter and
σ > 0 a scale parameter. The parameter space Θ and action space A are both
R × R+ (R+ := (0, ∞)). The distribution of ε = (1 , . . . , n ) is assumed to be
known.
Definition 9.2.1 A statistic T = T (X) = (T1 (X), T2 (X)) is called location-
scale equivariant if for all constants b ∈ R, c ∈ R+ , and all x = (x1 , . . . , xn ),
and
T2 (b + cx1 , . . . , b + cxn ) = cT2 (x1 , . . . , xn ).
Definition 9.2.2 A loss function L(µ, σ, a1 , a2 ) is called location-scale invari-
ant if for all (µ, a1 , b) ∈ R3 , (σ, a2 , c) ∈ R3+
where L0 (a1 , a2 ) := L(0, 1, a1 , a2 ). We conclude that the risk does not depend
on θ. We may therefore omit the subscript θ in the last expression:
or equivalently,
Then
d1 (X) + d2 (X)T1∗ (Y)
∗
T (X) :=
d2 (X)T2∗ (Y)
is UMRE.
Proof. We have
X − d1 (X) ε − d1 (ε)
Y= = .
d2 (X) d2 (ε)
So
ε = d1 (ε) + d2 (ε)Y.
= EL0 T1 (d1 (ε) + d2 (ε)Y), T2 (d1 (ε) + d2 (ε)Y)
= EL0 d1 (ε) + d2 (ε)T1 (Y), d2 (ε)T2 (Y)
= EE L0 d1 (ε) + d2 (ε)T1 (Y), d2 (ε)T2 (Y) Y
≥ E min E L0 d1 (ε) + d2 (ε)a1 , d2 (ε)a2 Y
a1 ∈R, a2 ∈R+
∗ ∗
= EE L0 d1 (ε) + d2 (ε)T1 (Y), d2 (ε)T2 (Y) Y .
u
t
For quadratic loss (L0 (a1 , a2 ) := a21 ), the definition of T ∗ (Y) in the above
theorem is 2
∗
T (Y) = arg min E d1 (ε) + d2 (ε)a1 Y .
a1 ∈R
We then have:
9.2. EQUIVARIANCE IN THE LOCATION-SCALE MODEL ? 95
Lemma 9.2.1 Suppose that d is equivariant, and sufficient and complete. Then
is UMRE.
Proof. By Basu’s lemma, d and Y are independent. Hence
2 2
E d1 (ε) + d2 (ε)a1 Y = E d1 (ε) + d2 (ε)a1 .
Moreover 2
Ed1 (ε)d2 (ε)
arg min E d1 (ε) + d2 (ε)a1 = − .
a1 ∈R Ed22 (ε)
u
t
Example 9.2.1 UMRE of the mean of the normal distribution: σ 2
unknown
Let X1 , . . . , Xn be i.i.d. and N (µ, σ 2 )-distributed. Define
Ed1 (ε) = E¯
= 0,
and, from Example 9.1.2 (a consequence of Basu’s lemma), we know that d1 (X) =
X̄ and d2 (X) = S are independent. So
Decision theory
In this chapter, we again denote the observable random variable (the data) by
X ∈ X , and its distribution by P ∈ P. The probability model is P := {Pθ : θ ∈
Θ}, with θ an unknown parameter.
In particular cases, we apply the results with X being replaced by a vector
X = (X1 , . . . , Xn ), with X1 , . . . , X
Qn i.i.d. with distribution P ∈ {Pθ : θ ∈ Θ}
(so that X has distribution IP := ni=1 P ∈ {IPθ = ni=1 Pθ : θ ∈ Θ}).
Q
We give a definition of risk, but now somewhat more formal than in Chapter 8.
Let A be the action space.
• A = R corresponds to estimating a real-valued parameter.
• A = {0, 1} corresponds to testing a hypothesis.
• A = [0, 1] corresponds to randomized tests.
• A = {intervals} corresponds to confidence intervals.
Given the observation X, we decide to take a certain action in A. Thus, an
action is a map d : X → A, with d(X) being the decision taken. If A = R a
decision is often called an estimator (denoted by T for instance). If A = {0, 1}
or A = [0, 1], a decision is often called a test (denoted by φ for instance).
A loss function (Verlustfunktion) is a map
L : Θ × A → R,
with L(θ, a) being the loss when the parameter value is θ and one takes action
a.
The risk of decision d(X) is defined as
R(θ, d) := Eθ L(θ, d(X)), θ ∈ Θ.
97
98 CHAPTER 10. DECISION THEORY
Note
The best decision d is the one with the smallest risk R(θ, d). However, θ is not
known. Thus, if we compare two decision functions d1 and d2 , we may run into
problems because the risks are not comparable: R(θ, d1 ) may be smaller than
R(θ, d2 ) for some values of θ, and larger than R(θ, d2 ) for other values of θ.
Example 10.1.3 Estimating the mean
We revisit Example 5.2.1. Let X ∈ R and let g(θ) = Eθ X := µ. We take
quadratic loss
L(θ, a) := |µ − a|2 .
Assume that varθ (X) = 1 for all θ. Consider the collection of decisions
dλ (X) := λX,
where 0 ≤ λ ≤ 1. Then
R(θ, dλ ) = var(λX) + bias2θ (λX)
= λ2 + (λ − 1)2 µ2 .
10.2. ADMISSIBLE DECISIONS 99
µ2
λopt := ,
1 + µ2
because this value minimizes R(θ, dλ ). However, λopt depends on the unknown
µ, so dλopt (X) is not an estimator.
and
∃ θ : R(θ, d0 ) < R(θ, d).
When there exists a d0 that is strictly better than d, then d is called inadmissible.
Example 10.2.1 Using only one of the observations
Let, for n ≥ 2, X1 , . . . , Xn be i.i.d., with g(θ) := Eθ (Xi ) := µ, and var(Xi ) = 1
(for all i). Take quadratic loss L(θ, a) := |µ − a|2 . Consider d0 (X1 , . . . , Xn ) :=
X̄n and d(X1 , . . . , Xn ) := X1 . Then, ∀ θ,
1
R(θ, d0 ) = , R(θ, d) = 1,
n
so that d is inadmissible.
Note
We note that to show that a decision d is inadmissible, it suffices to find a
strictly better d0 . On the other hand, to show that d is admissible, one has to
verify that there is no strictly better d0 . So in principle, one then has to take
all possible d0 into account.
Let L(θ, a) := |g(θ) − a|r and d(X) := g(θ0 ), where θ0 is some fixed given value.
Lemma 10.2.1 Assume that Pθ0 dominates Pθ 1 for all θ. Then d is admissi-
ble.
1
Let P and Q be probability measures on the same measurable space. Then P dominates
Q if for all measurable B, P (B) = 0 implies Q(B) = 0 (Q is absolut stetig bezüglich P ).
100 CHAPTER 10. DECISION THEORY
Proof.
Suppose that d0 is better than d. Then we have
Eθ0 |g(θ0 ) − d0 (X)|r ≤ 0.
This implies that
d0 (X) = g(θ0 ), Pθ0 −almost surely. (10.1)
Since by (10.1),
Pθ0 (d0 (X) 6= g(θ0 )) = 0
the assumption that Pθ0 dominates Pθ , ∀ θ, implies now
Pθ (d0 (X) 6= g(θ0 )) = 0, ∀ θ.
That is, for all θ, d0 (X) = g(θ0 ), Pθ -almost surely, and hence, for all θ, R(θ, d0 ) =
R(θ, d). So d0 is not strictly better than d. We conclude that d is admissible. u t
Consider testing
H0 : θ = θ0
against the alternative
H1 : θ = θ1 .
Define the risk R(θ, φ) of a test φ as the probability of error of first and second
kind:
Eθ φ(X), θ = θ0
R(θ, φ) := .
1 − Eθ φ(X), θ = θ1
We let p0 (p1 ) be the density of Pθ0 (Pθ1 ) with respect to some dominating
measure ν (for example ν = Pθ0 + Pθ1 ). A Neyman Pearson test is (see Section
7.1)
1 if p1 /p0 > c
φNP := q if p1 /p0 = c .
0 if p1 /p0 < c
Thus, the minimax criterion concerns the best decision in the worst possible
case.
Lemma 10.3.1 A Neyman Pearson test φNP is minimax, if and only if R(θ0 , φNP ) =
R(θ1 , φNP ).
Proof. Let φ be a test, and write for j = 0, 1,
Suppose that r0 = r1 and that φNP is not minimax. Then, for some test φ,
Thus, the Bayes risk may be thought of as taking a weighted average of the
risks. For example, one may want to assign more weight to “important” values
of θ.
Proof.
Z Z
rw (φ) = w0 φp0 + w1 (1 − φp1 )
Z
= φ(w0 p0 − w1 p1 ) + w1 .
i.e., the risk is large if the two densities are close to each other.
10.5. CONSTRUCTION OF BAYES ESTIMATORS 103
pθ (x) = p(x|θ), x ∈ X .
w(ϑ)
w(ϑ|x) = p(x|ϑ) , ϑ ∈ Θ, x ∈ X .
p(x)
and
d(x) := arg min l(x, a).
a
u
t
Corollary 10.5.1 The Bayes decision is
where Z
l(x, a) = E(L(θ, a)|X = x) = L(ϑ, a)w(ϑ|x)dµ(ϑ)
Z
= L(ϑ, a)gϑ (S(x))w(ϑ)dµ(ϑ)h(x)/p(x).
So Z
dBayes (X) = arg min L(ϑ, a)gϑ (S)w(ϑ)dµ(ϑ),
a∈A
we have
l(x, φ) = φw0 p0 (x)/p(x) + (1 − φ)w1 p1 (x)/p(x).
Thus,
1 if w1 p1 > w0 p0
arg min l(·, φ) = q if w1 p1 = w0 p0 .
φ
0 if w1 p1 < w0 p0
Proof.
E(Z − a)2 = var(Z) + (a − EZ)2 .
u
t
Consider the case A = R and Θ ⊆ R . Let L(θ, a) := |θ − a|2 . Then
For quadratic loss, and for T = E(θ|X), the Bayes risk of an estimator T 0 is
rw (T 0 ) = Evar(θ|X) + E(T − T 0 )2
10.5. CONSTRUCTION OF BAYES ESTIMATORS 105
0 0 2 0 2
= ER(θ, T ) = E(θ − T ) = E E (θ − T ) X
= var(θ|X) + (T − T 0 )2 .
Consider again the case Θ ⊆ R, and A = Θ, and now with loss function
L(θ, a) := l{|θ − a| > c} for a given constant c > 0. Then
Z
l(x, a) = Π(|θ − a| > c|X = x) = w(ϑ|x)dϑ.
|ϑ−a|>c
Thus, for c small, Bayes rule is approximately d0 (x) := arg maxa∈Θ p(x|a)w(a).
The estimator d0 (X) is called the maximum a posteriori estimator (MAP). If
w is the uniform density on Θ (which only exists if Θ is bounded), then d0 (X)
is the maximum likelihood estimator.
In many examples, it saves a lot of work not to write out complete expressions
for the posterior w(ϑ|x). The main interest (at first) is how it depends on
ϑ and all the other expressions can (at least theoretically, it may be difficult
computationally) be found later by using that a density integrates to one. We
therefore recall the symbol ∝ (see Section 4.9). For example, we may write
w(ϑ|x) = p(x|ϑ)w(ϑ)/p(x)
∝ p(x|ϑ)w(ϑ)
where Z ∞
Γ(k) = e−z z k−1 dz.
0
w(ϑ)
w(ϑ|x) = p(x|ϑ)
p(x)
ϑ λk ϑk−1 e−λϑ /Γ(k)
x
= e−ϑ
x! p(x)
= e−ϑ(1+λ) ϑk+x−1 c(x, k, λ),
w(ϑ|x) ∝ p(x|ϑ)w(ϑ)
∝ e−ϑ(1+λ) ϑk+x−1 .
You see this saves a lot of writing. We recognize w(ϑ|x) as the density of the
Gamma(k + x, 1 + λ)-distribution. Bayes estimator with quadratic loss is thus
k+X
E(θ|X) = .
1+λ
The maximum a posteriori estimator is
k+X −1
.
1+λ
Example 10.5.2 Binomial distibution with Beta prior
Suppose given θ, X has the Binomial(n, θ)-distribution, and that θ is uniformly
distributed on [0, 1]. Then
n x
w(ϑ|x) = ϑ (1 − ϑ)n−x /p(x).
x
10.6. DISCUSSION OF BAYESIAN APPROACH 107
More generally, suppose that X is binomial(n, θ) and that θ has the Beta(r, s)-
prior
Γ(r + s) r−1
w(ϑ) = ϑ (1 − ϑ)s−1 , 0 < ϑ < 1.
Γ(r)Γ(s)
Here r and s are given positive numbers. The prior expectation is
r
Eθ = .
r+s
Bayes estimator under quadratic loss is the posterior expectation
X +r
E(θ|X) = .
n+r+s
Example 10.5.3 Normal distribution with normal prior
Let X ∼ N (θ, 1) and θ ∼ N (c, τ 2 ) for some c and τ 2 . We have
p(x|ϑ)w(ϑ)
w(ϑ|x) =
p(x)
ϑ−c
∝ φ(x − ϑ)φ
τ
(ϑ − c)2
1 2
∝ exp − (x − ϑ) +
2 τ2
τ 2x + c 2 1 + τ 2
1
∝ exp − ϑ − 2 .
2 τ +1 τ2
We conclude that Bayes estimator for quadratic loss is
τ 2X + c
TBayes = E(θ|X) = .
τ2 + 1
where τ > 0 is the prior scale parameter (τ 2 is the variance of this distribution).
Given X1 , . . . , Xn , the posterior distribution of the vector θ is then
w(ϑ|X1 , . . . , Xn ) ∝
Pn 2 √ Pn
−n/2 i=1 (Xi − ϑi ) −n/2 2 i=1 |ϑi |
(2π) exp − × (2πτ ) exp − .
2 τ
√
Thus, θ̂ with regularization parameter λ = 2/τ is the maximum a posteriori
estimator.
Bayesian methods as theoretical tool. In Chapter 11 we will illustrate the
fact that Bayesian methods can be exploited as a tool for proving for example
frequentist lower bounds. We will see for instance that the Bayesian estimator
with constant risk is also the minimax estimator. The idea in such results is to
look for “worst possible priors”.
Striving at flexible prior distributions one can model them depending on another
“hyper-parameter”, say τ , i.e., in formula
w(ϑ) := w(ϑ|τ ).
One can proceed by estimating τ , using for instance maximum likelihood (this
is generally computationally quite hard), or the methods of moments. One then
obtains a prior w(ϑ|τ̂ ) with estimated parameter τ̂ . The prior is thus based on
the data. The whole procedure is called empirical Bayes.
Example 10.7.1 Poisson with Gamma prior with hyperparameters
Suppose X1 , . . . , Xn are independent and Xi has a Poisson(θi )-distribution,
i = 1, . . . , n. Assume moreover that θ1 , . . . , θn are i.i.d. with Gamma(k, λ)-
distribution, i.e., each has prior density
n n
!
λk
Z Pn Y Pn Y
∝ e− i=1 ϑi
ϑxi i e−λ i=1 ϑi
ϑk−1
i dϑ1 · · · dϑn
Γ(k)
i=1 i=1
n
Y Γ(xi + k)
= pk (1 − p)xi +k−1 ,
Γ(k)
i=1
110 CHAPTER 10. DECISION THEORY
where p := λ/(1 + λ). Thus, under p̃(·|k, λ), the observations X1 , . . . , Xn are
independent and Xi has a negative binomial distribution with parameters k and
p (check the formula for the negative binomial distribution, see e.g. Example
2.4.1. The mean and variance of the negative binomial distribution can be
calculated directly or looked up in a textbook. We then find (for i = 1, . . . , n),
k(1 − p) k
E(Xi |k, λ) = =
p λ
and
k(1 − p) k(1 + λ)
var(Xi |k, λ) = 2
= .
p λ2
We use the method
Pn of moments2 to estimate k and λ. Let X̄n be the sample
2
mean and Sn := i=1 (Xi − X̄) /(n − 1) be the sample variance. We solve
k̂ k̂(1 + λ̂)
= X̄n , = Sn2 .
λ̂ λ̂2
This yields
X̄n2 X̄n
k̂ = 2
, λ̂ = 2 .
Sn − X̄n Sn − X̄n
For given k and λ, the Bayes estimator of θi is given in Example 10.5.1. We
now insert the estimated values of k and λ to get an empirical Bayes estimator
Xi + k̂
θ̂i = = Xi (1 − X̄n /Sn2 ) + X̄n2 /Sn2 , i = 1, . . . , n.
1 + λ̂
The MLE of θi is Xi itself (i = 1, . . . , n). We see that the empirical Bayes
estimator uses all observations to estimate a particular θi . The empirical Bayes
estimator θ̂i is a convex combination (1 − α)Xi + αX̄n of Xi and X̄n , with
α = X̄n /Sn2 generally close to one if the pooled sample has mean and variance
approximately equal, i.e., if the pooled sample is “Poisson-like”.
Chapter 11
Bayes estimators are quite useful, also for obdurate frequentists. They can be
used to construct estimators that are minimax (admissible), or for verification
of minimaxity (admissibility).
Especially in cases where one wants to use the uniform distribution as prior,
but cannot do so because Θ is not bounded, the notion extended Bayes is useful.
111
112 CHAPTER 11. PROVING ADMISSIBILITY AND MINIMAXITY
11.1 Minimaxity
Lemma 11.1.1 Suppose T is a statistic with risk R(θ, T ) = R(T ) not depend-
ing on θ. Then
(i) T admissible ⇒ T minimax,
(ii) T Bayes ⇒ T minimax,
and in fact more generally,
(iii) T extended Bayes ⇒ T minimax.
Proof.
(i) T is admissible, so for all T 0 , either there is a θ with R(θ, T 0 ) > R(T ), or
R(θ, T 0 ) ≥ R(T ) for all θ. Hence supθ R(θ, T 0 ) ≥ R(T ).
(ii) Since Bayes implies extended Bayes, this follows from (iii). We nevertheless
present a separate proof, as it is somewhat simpler than (iii).
Note first that for any T 0 ,
Z
0
rw (T ) = R(ϑ, T 0 )w(ϑ)dµ(θ) (11.1)
Z
≤ sup R(ϑ, T 0 )w(ϑ)dµ(θ) (11.2)
ϑ
= sup R(ϑ, T 0 ), (11.3)
ϑ
that is, Bayes risk is always bounded by the supremum risk. Suppose now that
T 0 is a statistic with supθ R(θ, T 0 ) < R(T ). Then
because, as we have seen in (11.1), the Bayes risk is bounded by supremum risk.
Since can be chosen arbitrary small, this proves (iii). u
t
Example 11.1.1 Minimax estimator for Binomial distribution
Consider a Binomial(n, θ) random variable X. Let the prior on θ ∈ (0, 1) be
the Beta(r, s) distribution. Then Bayes estimator for quadratic loss is
X +r
T =
n+r+s
(see Example 10.5.2). Its risk is
R(θ, T ) = Eθ (T − θ)2
11.1. MINIMAXITY 113
= varθ (T ) + bias2θ (T )
(n + r + s)θ 2
nθ(1 − θ) nθ + r
= + −
(n + r + s)2 n+r+s n+r+s
[(r + s) − n]θ + [n − 2r(r + s)]θ + r2
2 2
= .
(n + r + s)2
This can only be constant in θ if the coefficients in front of θ2 and θ are zero:
(r + s)2 − n = 0, n − 2r(r + s) = 0.
Then for all x, Ta,b (x) → T (x) as a → −∞ and b → ∞. One can easily verify
that also
2
lim E0 Ta,b (X) → E0 T 2 (X).
a→−∞, b→∞
(Note that, for any prior w, E0 T 2 (X) is the Bayes risk rw (T ) since the risk
R(θ, T ) = E0 T 2 (X) does not depend on θ.) Moreover
Rb R b−θ
a (z − θ)p0 (X − z)dz vp0 (X − θ − v)dv
Ta,b (X) − θ = Rb = Ra−θ
b−θ
.
a p0 (x − z)dz a−θ p0 (X − θ − v)dv
It follows that
Eθ (Ta,b (X) − θ)2 = E0 Ta−θ,b−θ
2
(X).
Hence,
2
R(θ, Tm ) = E0 T−m−θ,m−θ (X).
The Bayes risk is
Z m
1 2
rwm (Tm ) = Eθ∼wm R(θ, Tm ) = E0 T−m−ϑ,m−ϑ (X)dϑ.
2m −m
11.2 Admissibility
Lemma 11.2.1 Suppose that the statistic T is Bayes for the prior density w.
Then (i) or (ii) below are sufficient conditions for the admissibility of T .
(i) The statistic T is the unique Bayes decision (i.e., rw (T ) = rw (T 0 ) implies
that ∀ θ, T = T 0 Pθ -almost surely),
(ii) For all T 0 , R(θ, T 0 ) is continuous
R in θ, and moreover, for all open U ⊂ Θ,
the prior probability Π(U ) := U w(ϑ)dµ(ϑ) of U is strictly positive.
Proof.
(i) Suppose that for some T 0 , R(θ, T 0 ) ≤ R(θ, T ) for all θ. Then also rw (T 0 ) ≤
rw (T ). Because T is Bayes, we then must have equality:
rw (T 0 ) = rw (T ).
R(ϑ, T 0 ) ≤ R(ϑ, T ) − , ϑ ∈ U.
But then
Z Z
0 0
rw (T ) = R(ϑ, T )w(ϑ)dν(ϑ) + R(ϑ, T 0 )w(ϑ)dν(ϑ)
ZU Uc Z
≤ R(ϑ, T )w(ϑ)dν(ϑ) − Π(U ) + R(ϑ, T )w(ϑ)dν(ϑ)
U Uc
= rw (T ) − Π(U ) < rw (T ).
R(ϑ, T 0 ) ≤ R(ϑ, T ) − , ϑ ∈ U.
Suppose for simplicity that a Bayes decision Tm for the prior wm exists, for all
m, i.e.
rwm (Tm ) = inf0 rwm (T 0 ), m = 1, 2, . . . .
T
Then, for all m,
or
rwm (T ) − rwm (Tm )
≥ > 0,
Πm (U )
that is, we arrived at a contradiction. u
t
T = aX + b, a > 0, b ∈ R.
τ 2X + c
TBayes = E(θ|X) = .
τ2 + 1
Taking
τ2 c
= a, 2 = b,
τ2 +1 τ +1
yields T = TBayes .
Next, we check (i) in Lemma 11.2.1, i.e. that T is unique. Since in view of the
calculations in Subsection 10.5.2
rw (T 0 ) = Evar(θ|X) + E(T − T 0 )2
E(T − T 0 )2 = 0.
Here, the expectation is with θ integrated out, i.e., with respect to the measure
P with density Z
p(x) = pϑ (x)w(ϑ)dµ(ϑ).
11.2. ADMISSIBILITY 117
Now, we can write X = θ+, with θ N (c, τ 2 )-distributed, and with a standard
normal random variable independent of θ. So X is N (c, τ 2 +1), that is, P is the
N (c, τ 2 + 1)-distribution. Now, E(T − T 0 )2 = 0 implies T = T 0 P -a.s.. Since
P dominates all Pθ , we conclude that T = T 0 Pθ -a.s., for all θ. So T is unique,
and hence admissible.
(⇐) (ii)
In this case, T = X. We use Lemma 11.2.2. Because R(θ, T ) = 1 for all θ, also
rw (T ) = 1 for any prior. Let wm be the density of the N (0, m)-distribution.
As we have seen in Example 10.5.3 and also in the previous part of the proof,
the Bayes estimator is
m
Tm = X.
m+1
By the bias-variance decomposition, it has risk
2
m2 m2 θ2
m 2
R(θ, Tm ) = + − 1 θ = + .
(m + 1)2 m+1 (m + 1)2 (m + 1)2
As Eθ2 = m, its Bayes risk is
m2 m m
rwm (Tm ) = 2
+ 2
= .
(m + 1) (m + 1) m+1
It follows that
m 1
rwm (T ) − rwm (Tm ) = 1 − = .
m+1 m+1
So T is extended Bayes. But we need to prove the more refined property of
Lemma 11.2.2. It is clear that here, we only need to consider open intervals
U = (u, u + h), with u and h > 0 fixed. We have
u+h u
Πm (U ) = Φ √ −Φ √
m m
√
1 u
= √ φ √ h + o(1/ m).
m m
For m large,
u 1 1
φ √ ≈ φ(0) = √ > (say),
m 2π 4
so for m sufficiently large (depending on u)
u 1
φ √ ≥ .
m 4
Thus, for m sufficiently large (depending on u and h), we have
1
Πm (U ) ≥ √ h.
4 m
We conclude that for m sufficiently large
rwm (T ) − rwm (Tm ) 4
≤ √ .
Πm (U ) h m
118 CHAPTER 11. PROVING ADMISSIBILITY AND MINIMAXITY
Eθ T 0 := q(θ).
the bias of T 0 is
b(θ) = q(θ) − g(θ),
or
˙
q(θ) = b(θ) + g(θ) = b(θ) + d(θ).
This implies
q̇(θ) = ḃ(θ) + I(θ).
By the Cramer Rao lower bound
[ḃ(θ) + I(θ)]2
+ b2 (θ) ≤ I(θ),
I(θ)
11.3. ADMISSIBLE ESTIMATORS IN EXPONENTIAL FAMILIES ? 119
or
I(θ){b2 (θ) + 2ḃ(θ)} ≤ −[ḃ(θ)]2 ≤ 0.
ḃ(θ) 1
≤− ,
b2 (θ) 2
so
d 1 1
− ≥ 0,
dθ b(θ) 2
or
d 1 θ
− ≥ 0.
dθ b(θ) 2
In other words, 1/b(θ) − θ/2 is an increasing function.
We will now show that this gives a contradiction, implying that b(θ) = 0 for all
θ.
Suppose instead b(θ0 ) < 0 for some θ0 . Then also b(ϑ) < 0 for all ϑ > θ0 since
b(·) is decreasing. It follows that
1 1 ϑ − θ0
≥ + → ∞, ϑ → ∞
b(ϑ) b(θ0 ) 2
i.e.,
b(ϑ) → 0, ϑ → ∞.
b(ϑ) → 0, ϑ → −∞,
u
t
Example 11.3.1 Admissibility of the sample average for estimating
the mean of a normal distribution
Let X be N (θ, 1)-distributed, with θ ∈ R unknown. Then X is an admissible
estimator of θ.
120 CHAPTER 11. PROVING ADMISSIBILITY AND MINIMAXITY
x2
1
pθ (x) = √ exp − 2
2πσ 2 2σ
= exp[θT (x) − d(θ)]h(x),
with
T (x) = −x2 /2, θ = 1/σ 2 ,
Definition 11.4.1 Let p > 2 and let 0 < b < 2(p−2) be some constant. Stein’s
estimator is
∗ b
T := 1 − X.
kXk2
Lemma 11.4.1 We have
∗ 2 1
R(θ, T ) = p − 2b(p − 2) − b Eθ .
kXk2
Xi2 Xi2
∗ 2 2 1
Eθ (Ti − θi ) = 1 + b Eθ − 2bEθ −2
kXk4 kXk2 kXk4
Xi2 1
= 1 + (b2 + 4b)Eθ 4
− 2bEθ .
kXk kXk2
122 CHAPTER 11. PROVING ADMISSIBILITY AND MINIMAXITY
It follows that
Pp
∗ 2 X2 1
R(θ, T ) = p + (b + 4b)Eθ i=1 4 i − 2bpEθ
kXk kXk2
1
= p − 2b(p − 2) − b2 Eθ .
kXk2
u
t
We thus have the surprising fact that Stein’s estimator of θi uses also the
observations Xj with j 6= i, even though these observations are independent of
Xi and have a distribution which does not depend on θi .
Note that [2b(p − 2) − b2 ] is maximized for b = p − 2. So the value b = p − 2
gives the maximal improvement over X. Stein’s estimator is then
∗ p−2
T = 1− X.
kXk2
τ2
Ti,Bayes = Xi , i = 1, . . . , p
1 + τ2
(see Example 5.2.1). Given θi , Xi ∼ N (θi , 1) (i = 1, . . . , p). So uncondi-
tionally, Xi ∼ N (0, 1 + τ 2 ) (i = 1, . . . , p). Thus, unconditionally, X1 , . . . , Xp
are identically distributed, each having the N (0, σ 2 )-distribution with σ 2 =
1 + τ 2 .PAs estimator of the variance σ 2 we may use the the sample version
σ̂ 2 := pi=1 Xi2 /p = kXk2 /p (we need not center with the sample average as
the unconditional mean of the Xi is known to be zero). That is, we estimate
τ 2 by
τ̂ 2 := σ̂ 2 − 1 = kXk2 /p − 1.
This leads to the empirical Bayes estimator
τ̂ 2
p
Ti,emp. Bayes := X = 1− X.
1 + τ̂ 2 kXk2
This shows that when p > 4, then Stein’s estimator with b = p is an empirical
Bayes estimator.
Chapter 12
123
124 CHAPTER 12. THE LINEAR MODEL
over b = (b1 , . . . , bp )T ∈ Rp .
Let us denote the Euclidean norm of a vector v ∈ Rn by1
v
u n
uX
kvk2 := t vi2 .
i=1
Write
Y1
.
Y = .. .
Yn
Then
n
X p
X 2
Yi − xi,j bj = kY − Xbk22 .
i=1 j=1
X T (Y − X β̂) = 0
or
X T Y = X T X β̂.
If X has rank p, the matrix X T X has an inverse (X T X)−1 and we get
β̂ = (X T X)−1 X T Y.
X(X T X)−1 X T Y.
| {z }
projection
Example with p = 1
For p = 1
1 x1
1 x2
X=
... .. .
.
1 xn
Then
Pn
T
X X= Pnn Pni=1 x2i ,
i=1 xi i=1 xi
n n −1 Pn 2 − ni=1 xi
X P
i=1 xi
X
(X T X)−1 = n x2i − ( xi )2
− ni=1 xi
P
n
i=1 i=1
Xn −1 1 Pn 2 −x̄
= (xi − x̄)2 n i=1 xi .
−x̄ 1
i=1
Moreover
T
X Y = PnnȲ .
i=1 xi Yi
α̂
= (X T X)−1 X T Y
β̂
n −1 1 Pn
x2i −x̄
X
2 n i=1 n Ȳ
= (xi − x̄) Pn
−x̄ 1 i=1 xi Yi
i=1
n −1 Pn
x2i Ȳ −P x̄ ni=1 xi Yi
X P
2 i=1
= (xi − x̄)
−nx̄Ȳ + ni=1 xi Yi
i=1
n −1 Pn
x̄)2 Ȳ − x̄( ni=1 xi Yi − nx̄Ȳ )
X P
= (xi − x̄)2 i=1 (xi − P .
n
i=1 i=1 xi Yi − nx̄Ȳ
Pn 2
Pn
Here we used that i=1 xi = i=1 (xi − x̄)2 + nx̄2 . We can moreover write
n
X n
X
xi Yi − nx̄Ȳ = (xi − x̄)(Yi − Ȳ ).
i=1 i=1
Thus
Ȳ − β̂ x̄
α̂
= Pni=1 (xi −x̄)(Yi −Ȳ ) .
β̂ Pn
(xi −x̄)2i=1
126 CHAPTER 12. THE LINEAR MODEL
2.0
1.5
1.0
y
0.5
0.0
-0.5
Z1
.
Z := .. .
Zp
For a symmetric positive definite matrix Σ, one can define the square root Σ1/2
as a symmetric positive definite matrix satisfying
Σ1/2 Σ1/2 = Σ.
Z̃ := Σ−1/2 Z
Proof.
i) By straightforward computation
β̂ − β ∗ = (X T X)−1 X T .
| {z }
:=A
We therefore have
E(β̂ − β ∗ ) = AE = 0,
and the covariance matrix of β̂ is
Cov(β̂) = Cov(A)
= A Cov() AT
| {z }
=σ 2 I
= σ AA = σ 2 (X T X)−1 .
2 T
where V := P T ,
EV = P T E = 0,
and
Cov(V ) = P T Cov()P = σ 2 I.
It follows that
p
X p
X
E Vj2 = EVj2 = σ 2 p.
j=1 j=1
Proof.
i) Since β̂ is a linear function of the multivariate normal , the least squares
estimator β̂ is also multivariate normal.
ii) Define the projection P P T := X(X T X)−1 X T . Then
p
X
kX(β̂ − β ∗ )k22 = kP P T k22 := Vj2 .
j=1
EY = Xβ.
u
t
Corollary 12.4.1 Let L(a) := 2aT z − aT V a. The difference between the un-
restricted maximum and the restricted maximum of L(a) is
−1
T −1 T −1 T
max L(a) − max L(a) = z V B BV B BV −1 z.
a a: Ba=0
130 CHAPTER 12. THE LINEAR MODEL
Y = Xβ + ,
−1
−1 −1 T
T
= X(X X) T T T
B B(X X) B B(X T X)−1 X T
| {z } | {z }
:=Z T :=Z
−1
= Z T B(X T X)−1 B T Z.
The q-vector
Z := B(X T X)−1 X T
has a multivariate normal distribution with mean zero and covariance matrix
T
−1 T −1 T
2 T
σ B(X X) X T
B(X X) X = σ 2 B(X T X)−1 B T .
Asymptotic theory
γ := g(θ) ∈ Rp ,
Γ := {g(θ) : θ ∈ Θ}
Tn = Tn (X1 , . . . , Xn ).
131
132 CHAPTER 13. ASYMPTOTIC THEORY
for all permutations π of {1, . . . , n}. In that case, one can write Tn in the form
Tn = Q(P̂n ), where P̂n is the empirical distribution (see also Subsection 2.4.1).
IP
Notation: Zn −→Z.
Remark Chebyshev’s inequality can be a tool to prove convergence in proba-
bility. It says that for all increasing functions ψ : [0, ∞) → [0, ∞), one has
Eψ(kZ
I n − Zk)
IP(kZn − Zk ≥ ) ≤ .
ψ()
The following theorem says that for convergence in distribution, one actually
can do with one-dimensional random variables. We omit the proof.
Theorem 13.1.1 (Cramér-Wold device) Let ({Zn }, Z) be a collection of Rp -
valued random variables. Then
D D
Zn −→Z ⇔ aT Zn −→aT Z ∀ a ∈ Rp .
1
Let (Ω, A, IP) be a probability space, and X : Ω → Rp and Y : Ω → Rq be two measurable
maps. Then X and Y are called random variables, and they are defined on the same probability
space Ω.
13.1. TYPES OF CONVERGENCE 133
Define
X̄n = (X̄n(1) , . . . , X̄n(p) )T .
By the 1-dimensional CLT, for all a ∈ Rp ,
√ D
n(aT X̄n − aT µ)−→N (0, aT Σa).
Let {Zn } be a collection of Rp -valued random variables, and let {rn } be strictly
positive random variables. We write
Zn = OIP (1)
Zn = oIP (1).
Moreover, Zn = oIP (rn ) (Zn is of small order rn in probability) if Zn /rn = oIP (1).
2
A real-valued function f on (a subset of) Rp is Lipschitz if for a constant CL and all (z, z̃)
in the domain of f , |f (z) − f (z̃)| ≤ CL kz − z̃k.
134 CHAPTER 13. ASYMPTOTIC THEORY
IP(Zn ≤ −M ) → G(−M ).
Then
T T
Ef
I (A n Z n ) − Ef
I (a Z)
T T T T
I (An Zn ) − Ef
≤ Ef I (a Zn ) − Ef
I (a Zn ) + Ef I (a Z).
Let g(θ) := µ and Tn := (X1 + · · · + Xn )/n =: X̄n . Then, by the law of large
numbers, Tn is a consistent estimator of µ and, by the central limit theorem,
√ Dθ
n(Tn − µ)−→N (0, σF2 0 ).
We rewrite σ̂ 2 as
n n
1X 2X
σ̂n2 = (Xi − µ)2 + (X̄n − µ)2 − (Xi − µ)(X̄n − µ)
n n
i=1 i=1
n
1X
= (Xi − µ)2 − (X̄n − µ)2 .
n
i=1
n
1X
σ̂n2 = (Xi − µ)2 + OIPθ (1/n).
n
i=1
So asymptotically one does not notice that µ is estimated, and σ̂n2 (and also S 2 )
is asymptotically linear with influence function
lθ (x) = (x − µ)2 − σ 2 .
Then
D
h(Tn ) − h(c) /rn −→ḣ(c)T Z
and in fact
h(Tn ) − h(c) = ḣ(c)T (Tn − c) + oIP (rn ).
Since (Tn − c)/rn converges in distribution, we know that kTn − ck/rn = OIP (1).
Hence, kTn − ck = OIP (rn ). The result follows now from
h(Tn ) − h(c) = ḣ(c)T (Tn − c) + o(kTn − ck) = ḣ(c)T (Tn − c) + oIP (rn ).
u
t
Corollary 13.4.1 Let Tn be an asymptotically normal estimator of γ = g(θ) ∈
Rp with asymptotic covariance matrix
Vθ .
ḣ(γ)T Vθ ḣ(γ).
ḣ(γ)T lθ .
We have
σ̂n2 = h(Tn ),
where Tn = (Tn,1 , Tn,2 )T , with
n
1X 2
Tn,1 = X̄n , Tn,2 = Xi ,
n
i=1
and
h(t) = t2 − t21 , t = (t1 , t2 )T .
The estimator Tn has influence function
x − µ1
lθ (x) = .
x 2 − µ2
with
µ2 − µ21
µ 3 − µ1 µ2
Vθ = .
µ3 − µ1 µ 2 µ4 − µ22
It holds that
µ1 −2µ1
ḣ = ,
µ2 1
13.4. THE δ-TECHNIQUE 139
i.e., the δ-method gives the same result as the ad hoc method in Example 13.3.2,
as it of course should.
140 CHAPTER 13. ASYMPTOTIC THEORY
Chapter 14
M-estimators
141
142 CHAPTER 14. M-ESTIMATORS
The “M” in “M-estimator” stands for Minimizer (or - take minus signs - Max-
imizer).
If ρc (x) is differentiable in c for all x, we generally can define γ̂n as the solution
of putting the derivatives
n n
˙ ∂ 1X 1X
R̂n (c) = ρc (Xi ) = ψc (Xi )
∂c n n
i=1 i=1
to zero. This is called the Z-estimator. The “Z” in “Z-estimator” stands for
Zero.
Definition 14.0.2 The Z-estimator γ̂n of γ is defined as a solution of the
equations
˙
R̂n (γ̂n ) = 0
˙
where R̂n (c) = n1 ni=1 ψc (Xi ).
P
Eθ (X − c)2 = σ 2 + (µ − c)2 .
ρc (x) = (x − c)2 .
Clearly
n
1X
(Xi − c)2
n
i=1
Pn
is minimized at c = X̄n := i=1 Xi /n. See also Section 2.3.
Suppose Θ ⊂ Rp and that the densities pθ = dPθ /dν exist w.r.t. some σ-finite
measure ν.
14.1. MLE AS SPECIAL CASE OF M-ESTIMATION 143
u
t
With ρθ̃ (x) = − log pθ̃ (x) we find ψθ̃ (x) := ρ̇θ̃ (x) = −sθ̃ (x). Recall that sθ is
the score function
sθ = ṗθ /pθ ,
see Definition 4.7.1. We have seen moreover in Lemma 4.7.1 that Eθ sθ (X) = 0.
This is just another way to see that θ is a solution of the equation
Ṙ(θ) = 0,
With ρθ̃ (x) = − log pθ̃ (x) the M-estimator is the maximum likelihood estimator
θ̂ = arg min LX (θ̃)
θ̃∈Θ
n
1X
= arg min − log pθ̃ (Xi ) .
θ̃∈Θ n i=1
u
t
We will need that convergence of to the minimum value also implies convergence
of the arg min, i.e., convergence of the location of the minimum. To this end,
we present the following definition.
Definition The minimizer γ of R(c) is called well-separated if for all > 0,
inf R(c) : c ∈ Γ, kc − γk > > R(γ).
Pθ w(·, δ, c) → 0.
Pθ w(·, δc , c) ≤ .
Let
Bc := {c̃ ∈ Γ : kc̃ − ck < δc }.
Then {Bc : c ∈ Γ} is a covering of Γ by open sets. Since Γ is compact, there
exists finite sub-covering
Bc1 . . . BcN .
For c ∈ Bcj ,
|ρc − ρcj | ≤ w(·, δcj , cj ).
146 CHAPTER 14. M-ESTIMATORS
It follows that
u
t
Example 14.2.1 Consistency of the MLE in the logistic location fam-
ily
The above theorem directly uses the definition of the M-estimator, and does not
rely on having an explicit expression available. Here is an example where an
explicit expression is indeed not possible. Consider the logistic location family,
where the densities are
ex−θ
pθ (x) = , x ∈ R,
(1 + ex−θ )2
This expression cannot be made into an explicit expression for θ̂n . However, we
do note the caveat that in order to be able to apply the above consistency theorem,
we need to assume that Θ is compact. This problem can be circumvented by
using the result below for Z-estimators.
To prove consistency of a Z-estimator of a one-dimensional parameter is rela-
tively easy.
Recall that ψc = ρ̇c and Ṙ(c) = Pθ ψc := Eθ ψc (X), c ∈ Γ. Recall further that
Ṙ(γ) = 0 since γ is defined as the minimizer of R(·).
Theorem 14.2.2 Suppose that Γ ⊂ R and that ψc (x) is continuous in c for all
x. Assume moreover that
Pθ |ψc | < ∞, ∀c,
and that ∃ δ > 0 such that
Proof. Let 0 < < δ be arbitrary. By the law of large numbers, IPθ -a.s. for n
sufficiently large,
˙ ˙
R̂n (γ + ) > 0, R̂n (γ − ) < 0.
˙
The continuity of c 7→ ψc implies that then R̂n (γ̂n ) = 0 for some |γn − γ| < .
u
t
For verifying asymptotic continuity, there are various tools, which involve com-
plexity assumptions on the map c 7→ ψc . This goes beyond the scope of these
notes. But let us see what asymptotic continuity can bring us.
Recall that Ṙ(c) = Pθ ψc . We assume that
∂
Mθ := T Ṙ(c)
∂c c=γ
Theorem 14.3.1 Let γ̂n be the Z-estimator of γ. Suppose that γ̂n is a con-
sistent estimator of γ, and that νn is asymptotically continuous at γ. Suppose
moreover Mθ−1 exists, and also
Jθ := Pθ ψγ ψγT .
lθ = −Mθ−1 ψγ .
Proof. By definition,
˙
R̂n (γ̂n ) = 0, Ṙ(γ) = 0.
So we have
˙
0 = R̂n (γ̂n )
√
= νn (γ̂n )/ n + Ṙ(γ̂n )
√
= νn (γ̂n )/ n + Ṙ(γ̂n ) − Ṙ(γ)
=: (i) + (ii).
So we arrive at
˙ √
0 = R̂n (γ) + oIPθ (1/ n) + Mθ (γ̂n − γ) + o(kγn − γk).
˙ √ √
Because, by the CLT, R̂n (γ) = OIPθ (1/ n), this implies kγ̂n −γk = OIPθ (1/ n).
Hence √
˙
0 = R̂n (γ) + Mθ (γ̂n − γ) + oIPθ (1/ n),
or
˙ √
Mθ (γ̂n − γ) = −R̂n (γ) + oIPθ (1/ n)
√
= −P̂n ψγ + oIPθ (1/ n)
or √
(γ̂n − γ) = −P̂n M −1 ψγ + oIPθ (1/ n).
u
t
14.3. ASYMPTOTIC NORMALITY OF M-ESTIMATORS 149
with
Vθ = Mθ−1 Jθ Mθ−1 .
∂
ψ̇c (x) = ψc (x)
∂cT
(a p × p matrix). Assume moreover that, for all c and c̃ in a neighborhood of
γ, and for all x, we have, in matrix-norm2 ,
where H : X → R satisfies
Pθ H < ∞.
Then
∂
Mθ := T Ṙ(c) = Pθ ψ̇γ . (14.3)
∂c c=γ
lθ = −Mθ−1 ψγ .
so that
˙
R̂n (γ) + P̂n ψ̇γ (γ̂n − γ)
≤ P̂n H kγ̂n − γk2 = OIP (1)kγ̂n − γk2 ,
θ
where in the last inequality, we used Pθ H < ∞. Now, by the law of large
numbers,
P̂n ψ̇γ = Pθ ψ̇γ + oIPθ (1) = Mθ + oIPθ (1).
Thus
˙
R̂n (γ) + Mθ (γ̂n − γ) + oIP (kγ̂n − γk) = OIP (kγ̂n − γk2 ).
θ θ
˙ √
Because R̂n (γ) = OIPθ (1/ n) by the CLT, this ensures that kγ̂n − γk =
√
OIPθ (1/ n). It follows that
√
˙
R̂n (γ) + Mθ (γ̂n − γ) + oIP (1/ n) = OIP (1/n).
θ θ
Hence
˙ √
Mθ (γ̂n − γ) = −R̂n (γ) + oIPθ (1/ n)
and so
˙ √
γ̂n − γ = −Mθ−1 R̂n (γ) + oIPθ (1/ n)
√
= −P̂n Mθ−1 ψγ + oIPθ (1/ n).
u
t
Note The asymptotic normality follows again from the asymptotic linearity
established in Theorem 14.3.2.
In this section, we show that, under regularity conditions, the MLE is asymp-
totically normal with asymptotic covariance matrix the inverse of the Fisher-
information matrix I(θ). We use that maximum likelihood estimation is a
special case of M-estimation and apply the results of the previous section. In
order to do so we need to assume regularity conditions.
Let P = {Pθ : θ ∈ Θ} be dominated by a σ-finite dominating measure ν, and
write the densities as pθ = dPθ /dν. Suppose that Θ ⊂ Rp . Assume that the
support of pθ does not depend on θ (Condition I in Section 5.5). As loss we
take minus the log-density:
ρθ := − log pθ .
The MLE is
θ̂n := arg max P̂n log pθ̃ .
θ̃∈Θ
14.5. TWO FURTHER EXAMPLES OF M-ESTIMATION 151
I(θ) := Pθ sθ sTθ .
Now, it is clear that ψθ = −sθ , and, assuming derivatives exist and that again
we may change the order of differentiation and integration,
ṗθ ṗTθ
p̈θ
Pθ ṡθ = Pθ −
pθ pθ pθ
2
∂
= 1 − Pθ sθ sTθ
∂θ∂θT
= 0 − I(θ) = −I(θ).
lθ = I(θ)−1 sθ .
It follows that √ D
θ
n(θ̂n − θ)−→N (0, I −1 (θ)).
where
ρ(x) := (1 − α)|x|l{x < 0} + α|x|l{x > 0}.
We have
ρ̇(x) = αl{x > 0} − (1 − α)l{x < 0}.
Note that ρ̇ does not exist at x = 0. This is one of the irregularities in this
example.
It follows that
ψc (x) = −αl{x > c} + (1 − α){x < c}.
Hence
Ṙ(c) = Pθ ψc = −α + F (c)
(the fact that ψc is not defined at x = c can be shown not to be a problem, roughly
because a single point has probability zero, as F is assumed to be continuous).
So
Ṙ(γ) = 0, for γ = F −1 (α).
1
lθ (x) = −Mθ−1 ψγ (x) = −l{x < γ} + α .
f (γ)
3
Note that in the special case α = 1/2 (where γ is the median), this becomes
(
− 2f1(γ) x < γ
lθ (x) = .
+ 2f1(γ) x > γ
14.5. TWO FURTHER EXAMPLES OF M-ESTIMATION 153
which we write as the sample quantile γ̂n = F̂n−1 (α) (or an approximation
√
thereof up to order oIPθ (1/ n)), one has
√
−1 −1 Dθ α(1 − α)
n(F̂n (α) − F (α))−→N 0, 2 −1 .
f (F (α))
with
x2 |x| ≤ k
ρ(x) = .
k(2|x| − k) |x| > k
We define γ as
γ := arg min Pθ ρc .
c
It holds that
2x |x| ≤ k
(
ρ̇(x) = +2k x>k .
−2k x < −k
Therefore,
−2(x − c) |x − c| ≤ k
(
ψc (x) = −2k x−c>k .
+2k x − c < −k
One easily derives that
Z k+c
Ṙ(c) := Pθ ψc = − 2 xdF (x) + 2c[F (k + c) − F (−k + c)]
−k+c
− 2k[1 − F (k + c)] + 2kF (−k + c).
So
d
Mθ = Ṙ(c) = 2[F (k + γ) − F (−k + γ)].
dc c=γ
γ ∈ Γ ⊂ R.
Definition 14.6.1 Let Tn,1 and Tn,2 be two estimators of γ, that satisfy
√ D
θ
n(Tn,j − γ)−→N (0, Vθ,j ), j = 1, 2.
Then
Vθ,1
e2:1 :=
Vθ,2
is called the asymptotic relative efficiency of Tn,2 with respect to Tn,1 .
If e2:1 > 1, the estimator Tn,2 is asymptotically more efficient than Tn,1 . An
asymptotic (1 − α)-confidence interval for γ based on Tn,2 is then narrower than
the one based on Tn,1 .
Example 14.6.1 Asymptotic relative efficiency of sample mean and
sample median
Let X = R, and F be the distribution function of X. Suppose that F is sym-
metric around the parameter of interest µ. In other words,
F (·) = F0 (· − µ),
e2:1 = 2.
14.6. ASYMPTOTIC RELATIVE EFFICIENCY 155
Thus, the sample median, which is the MLE for this case, is the winner.
It follows that
2η 2
2
e2:1 = 1− (1 + 8η).
π 3
Let us now further compare the results with the α-trimmed mean. Because F
is symmetric, it turns out that the α-trimmed mean has the same influence
function as the Huber-estimator with k = F −1 (1 − α):
1 x − µ, |x − µ| ≤ k
lθ (x) = +k, x−µ>k .
F0 (k) − F (−k)
−k, x − µ < −k
(This can be seen from Example 15.3.2 ahead which is not part of the exam).
The influence function is used to compute the asymptotic variance Vθ,α of the
α-trimmed mean:
R F0−1 (1−α)
F0−1 (α)
x2 dF0 (x) + 2α(F0−1 (1 − α))2
Vθ,α = .
(1 − 2α)2
From this, we then calculate the asymptotic relative efficiency of the α-trimmed
mean w.r.t. the mean. Note that the median is the limiting case with α → 1/2.
Table: Asymptotic relative efficiency of α-trimmed mean over mean
α = 0.05 0.125 0.5
η = 0.00 0.99 0.94 0.64
0.05 1.20 1.19 0.83
0.25 1.40 1.66 1.33
156 CHAPTER 14. M-ESTIMATORS
Vθ = Mθ−1 Jθ Mθ−1 ,
where
Jθ = Pθ ψγ ψγT ,
14.8. ASYMPTOTIC PIVOT BASED ON THE MLE 157
and
Mθ := R̈(γ)
∂
:= T
Ṙ(c)
∂c c=γ
∂
= Pθ ψc
∂cT c=γ
= Pθ ψ̇γ .
and
¨
M̂n := R̂n (γ̂n )
= P̂n ψ̇γ̂n
n
1X
= ψ̇γ̂n (Xi ),
n
i=1
is a consistent estimator of Vθ 4 .
be the MLE.
Assume enough regularity such as existence of derivatives and interchanging
integration and differentiation. Recall that θ̂n is an M-estimator with loss func-
tion ρϑ = − log pϑ , and hence under regularity conditions, ψϑ = ρ̇θ is minus
4
From most algorithms used to compute the M-estimator γ̂n , one easily can obtain M̂n
and Jˆn as output. Recall e.g. that the Newton-Raphson algorithm is based on the iterations
!−1
¨ ˙
γ̂new = γ̂old − R̂n (γ̂old ) R̂n (γ̂old ).
158 CHAPTER 14. M-ESTIMATORS
the score function sϑ := ṗϑ /pϑ . The asymptotic variance of the MLE is I −1 (θ),
where I(θ) := Pθ sθ sTθ is the Fisher information:
√ D
θ
n(θ̂n − θ)−→N (0, I −1 (θ)), ∀ θ
Hence,
2Ln (θ̂n ) − 2Ln (θ) ≈ n(P̂n sθ )T I(θ)−1 (P̂n sθ ).
The result now follows from
√ Dθ
nP̂n sθ −→N (0, I(θ)).
u
t
Pθ (X = j) := πj , j = 1, . . . , k.
Pk
where the probabilities πj are positive and add up to one: j=1 πj = 1, but
are assumed to be otherwise unknown. Then there are p := k − 1 unknown
parameters, say θ = (π1 , . . . , πk−1 ). Define Nj := #{i : Xi = j}. Note that
(N1 , . . . , Nk ) has a multinomial distribution with parameters n and (π1 , . . . , πk ).
Lemma 14.9.1 For each j = 1, . . . , k, the MLE of πj is
Nj
π̂j = .
n
so that
n
X k
X
log pθ (Xi ) = Nj log πj .
i=1 j=1
and thus
k
X π̂k
1= π̂j = n ,
Nk
j=1
yielding
Nk
π̂k = ,
n
and hence
Nj
π̂j = , j = 1, . . . , k.
n
u
t
We now first calculate Zn,1 (θ). For that, we need to find the Fisher information
I(θ).
Lemma 14.9.2 The Fisher information is
1
... 0
π1
I(θ) = ... . . .
.. + 1 ιιT , 7
. π
1 k
0 . . . πk−1
u
t
We thus find
k
X (π̂j − πj )2
= n
πj
j=1
k
X (Nj − nπj )2
= .
nπj
j=1
The approximation log(1+x) ≈ x−x2 /2 shows that 2Ln (θ̂n )−2Ln (θ) ≈ Zn,2 (θ):
k
X πj − π̂j
2Ln (θ̂n ) − 2Ln (θ) = −2 Nj log 1 +
π̂j
j=1
k k 2
X πj − π̂j X πj − π̂j
≈ −2 Nj + Nj
π̂j π̂j
j=1 j=1
= Zn,2 (θ).
The three asymptotic pivots Zn,1 (θ), Zn,2 (θ) and 2Ln (θ̂n ) − 2Ln (θ) are each
asymptotically χ2k−1 -distributed under IPθ .
where Gp is the distribution function of the χ2p -distribution. This test has
approximately level α as was shown in Lemma 14.8.1.
Consider now the hypothesis
H0 : R(θ) = 0,
where
R1 (θ)
R(θ) = ...
Rq (θ)
are q restrictions on θ (with R : Rp → Rq a given function8 ).
Let θ̂n be the unrestricted MLE, that is
n
X
θ̂n = arg max log pϑ (Xi ).
ϑ∈Θ
i=1
As in the sketch of the proof of Lemma 14.8.1, we can use a two-term Taylor
expansion to show for any sequence ϑn satisfying ϑn = θ + OIPθ (n−1/2 ), that
n
X
2 log pϑn (Xi ) − log pθ (Xi )
i=1
8
The notation R is used here for the restrictions. It has nothing to do with the risk R
14.10. LIKELIHOOD RATIO TESTS 163
√
= 2 n(ϑn − θ)T Zn − n(ϑn − θ)2 I(θ)(ϑn − θ) + oIPθ (1).
Here, we also again use that ni=1 ṡϑn (Xi )/n = −I(θ) + oIPθ (1). Moreover, by
P
a one-term Taylor expansion, and invoking that R(θ) = 0,
n
X n
X
= 2 log pθ̂n (Xi ) − log pθ (Xi ) − 2 log pθ̂0 (Xi ) − log pθ (Xi )
n
i=1 i=1
−1
−1 T −1
= T
Zn I(θ) Ṙ (θ) Ṙ(θ)I(θ) Ṙ(θ)T
Ṙ(θ)I(θ)−1 Zn + oIPθ (1)
Yn := Ṙ(θ)I(θ)−1 Zn ,
W := Ṙ(θ)I(θ)−1 Ṙ(θ)T .
We know that
D
θ
Zn −→N (0, I(θ)).
Hence
θD
Yn −→N (0, W ),
so that
D
YnT W −1 Yn −→χ
θ 2
q.
u
t
Corollary 14.10.1 From the sketch of the proof of Lemma 14.10.1, one sees
that moreover (under regularity),
and also
2Ln (θ̂n ) − 2Ln (θ̂n0 ) ≈ n(θ̂n − θ̂n0 )T I(θ̂n0 )(θ̂n − θ̂n0 ).
164 CHAPTER 14. M-ESTIMATORS
From Section 14.9, we know that the (unrestricted) MLE of πj,k is equal to
Nj,k
π̂j,k := .
n
We now want to test whether the two labels are independent. The null-
hypothesis is
H0 : πj,k = (πj,+ ) × (π+,k ) ∀ (j, k).
Here
X s r
X
πj,+ := πj,k , π+,k := πj,k .
k=1 j=1
where
s
X r
X
π̂j,+ := π̂j,k , π̂+,k := π̂j,k .
k=1 j=1
Abstract asymptotics ?
In Section 5.5 we obtained the Cramér Rao lower bound. A somewhat disap-
pointing result was that it can only be reached within exponential families (see
Lemma 5.6.1). We have also seen in in Section 14.4 that the MLE is asymptoti-
cally unbiased and reaches asymptotically the CRLB (its asymptotic covariance
matrix is I(θ)−1 , where I(θ) is the Fisher information for estimating θ). The
topic of the second part of this chapter is to show (for the one-dimensional case
for simplicity) that I(θ) is indeed asymptotically the efficient variance.
One may now wonder why the inverse of I(θ) is there. Think of it in this way.
The Fisher information was obtained by looking at derivatives of the map
θ 7→ log pθ .
P 7→ θ
P 7→ γ.
167
168 CHAPTER 15. ABSTRACT ASYMPTOTICS ?
When X is Euclidean space, one can define the distribution function F (x) :=
Pθ (X ≤ x) and the empirical distribution function
1
F̂n (x) = #{Xi ≤ x, 1 ≤ i ≤ n}.
n
This is the distribution function of a probability measure that puts mass 1/n at
each observation. For general X , we define likewise the empirical distribution P̂n
as the distribution that puts mass 1/n at each observation, i.e., more formally
n
1X
P̂n := δXi ,
n
i=1
1
P̂n (A) = #{Xi ∈ A, 1 ≤ i ≤ n}.
n
For (measurable) functions f : X → Rr (for some r ∈ N), we write, as in
previous sections,
n Z
1X
P̂n f := f (Xi ) = f dP̂n .
n
i=1
so that
Pθ (A) = Pθ lA .
γ = g(θ) ∈ Rp .
γ = Q(Pθ ),
Tn := Q(P̂n ).
Conversely,
15.1. PLUG-IN ESTIMATORS ? 169
Tn = Qn (P̂n ),
is a plug-in estimator of
γ = arg min Pθ ρc .
c∈Γ
is a plug-in estimator of
Pθ ψc = 0.
c=γ
What is its theoretical counterpart? Because the i-th order statistic X(i) can be
written as
X(i) = F̂n−1 (i/n),
and in fact
X(i) = F̂n−1 (u), i/n ≤ u < (i + 1)/n,
we may write, for αn := [nα]/n,
n−[nα]
n 1 X
Tn = F̂n−1 (i/n)
n − 2[nα] n
i=[nα]+1
Z 1−αn
1
= F̂ −1 (u)du := Qn (P̂n ).
1 − 2αn αn +1/n n
170 CHAPTER 15. ABSTRACT ASYMPTOTICS ?
F̂n (x + hn ) − F̂n (x − hn )
fˆn (x) := := Qn (P̂n ),
2hn
with hn “small” (hn → 0 as n → ∞).
Let > 0 be arbitrary, and take a0 < a1 < · · · < aN −1 < aN in such a way that
F (aj ) − F (aj−1 ) ≤ , j = 1, . . . , N
15.3. ASYMPTOTIC NORMALITY OF PLUG-IN ESTIMATORS ? 171
and
F̂n (x) − F (x) ≥ F̂n (aj−1 ) − F (aj ) ≥ F̂n (aj−1 ) − F (aj−1 ) − ,
so
sup |F̂n (x) − F (x)| ≤ max |F̂n (aj ) − F (aj )| + → , IPθ −a.s..
x 1≤j≤N
u
t
Example 15.2.1 Consistency of the sample median
Let X = R and let F be the distribution function of X. We consider estimating
the median γ := F −1 (1/2). We assume F to continuous and strictly increasing.
The sample median is
−1 X((n+1)/2) n odd
Tn := F̂n (1/2) := .
[X(n/2) + X(n/2+1) ]/2 n even
So
1 n 1/(2n) n odd .
F̂n (Tn ) = +
2 0 n even
It follows that
P̃ f := EP̃ f (X).
P lP (:= EP lP (X)) = 0.
The following result says that Fréchet differentiable functionals are generally
asymptotically linear.
Lemma 15.3.1 Suppose that Q is Fréchet differentiable at P with influence
function lP , and that
d(P̂n , P ) = OIP (n−1/2 ). (15.1)
Then
Q(P̂n ) − Q(P ) = P̂n lP + oIP (n−1/2 ).
Then indeed d(P̂n , P ) = OIP (n−1/2 ). This follows from Donsker’s theorem,
which we state here without proof:
Theorem 15.3.1 (Donsker’s theorem) Suppose F is continuous. Then
√ D
sup n|F̂n (x) − F (x)| −→ Z,
x
Fréchet differentiability is generally quite hard to prove, and often not even true.
We will sketch how to establish Gâteaux differentiability in two examples.
Example 15.3.1 Asymptotic linearity of the Z-estimator
We consider again the asymptotic linearity of the Z-estimator. Throughout in
this example, we assume enough regularity. Let γ be defined by the equation
P ψγ = 0.
P ψγ = 0.
so
P ψγ + (P̃ − P )ψγ = 0,
and hence
P (ψγ − ψγ ) + (P̃ − P )ψγ = 0.
Assuming differentiabality of c 7→ P ψc , we obtain
∂
P (ψγ − ψγ ) = P ψ c
(γ − γ) + o(|γ − γ|)
∂cT
c=γ
:= MP (γ − γ) + o(|γ − γ|).
It follows that
MP (γ − γ) + o(|γ − γ|) + (P̃ − P )ψγ + o() = 0,
or, assuming MP to be invertible,
(γ − γ)(1 + o(1)) = −MP−1 P̃ ψγ + o(),
which gives
γ − γ
→ −MP−1 P̃ ψγ .
The influence function is thus (as already seen in Subsection 14.3)
lP = −MP−1 ψγ .
Example 15.3.2 Asymptotic linearity of the α-trimmed mean
The α-trimmed mean is a plug-in estimator of
Z F −1 (1−α)
1
γ := Q(P ) = xdF (x).
1 − 2α F −1 (α)
Using partial integration, may write this as
Z 1−α
(1 − 2α)γ = (1 − α)F −1 (1 − α) − αF −1 (α) − vdF −1 (v).
α
= (1 − 2α)P̃ lP ,
15.4. ASYMPTOTIC CRAMER RAO LOWER BOUND ? 175
where Z F −1 (1−α)
1
lP (x) = − l{x ≤ u} − F (u) du.
1 − 2α F −1 (α)
We conclude that, under regularity conditions, the α-trimmed mean is asymp-
totically linear with the above influence function lP , and hence asymptotically
normal with asymptotic variance P lP2 .
To make all this mathematically precise is quite involved. We refer to van der
Vaart (1998). A glimps is given in Le Cam’s 3rd Lemma, see the next subsection.
Remark 2 Note that when θ = θn is allowed to change with n, this means that
distribution of Xi can change with n, and hence Xi can change with n. Instead
of regarding the sample X1 , . . . , Xn are the first n of an infinite sequence, we
now consider for each n a new sample, say X1,1 , . . . , Xn,n .
Remark 3 We have seen that the MLE θ̂n generally is indeed asymptotically
unbiased with asymptotic variance Vθ equal to 1/I(θ), i.e., under regularity
assumptions, the MLE is asymptotically efficient.
For asymptotically linear estimators, with influence function lθ , one has asymp-
totic variance Vθ = Eθ lθ2 (X). The next lemma indicates that generally 1/I(θ)
is indeed a lower bound for the asymptotic variance.
Lemma 15.4.1 Suppose asymptotic linearity:
n
1X
Tn − θ = lθ (Xi ) + oIPθ (n−1/2 ),
n
i=1
Then
1
Vθ ≥ .
I(θ)
P ψθ = 0.
where
d
Mθ := P ψθ .
dθ
Under regularity, we have
Z
d
Mθ = P ψ̇θ = ψ̇θ pθ dν, ψ̇θ = ψθ .
dθ
We may also write Z
d
Mθ = − ψθ ṗθ dν, ṗθ = pθ .
dθ
This follows from the chain rule
d
ψθ pθ = ψ̇θ pθ + ψθ ṗθ ,
dθ
and (under regularity)
Z Z
d d d d
ψθ pθ dν = ψθ pθ dν = P ψθ = 0 = 0.
dθ dθ dθ dθ
Thus Z
P lθ sθ = −Mθ−1 P ψθ sθ = −Mθ−1 ψθ ṗθ dν = 1,
or, as h → 0,
R
(Pθ+h − Pθ )lθ lθ (pθ+h − pθ )dν
1 = + o(1) = + o(1)
Z h h
→ lθ ṗθ dν = Pθ (lθ sθ ).
So (15.2) holds.
Then
√
Dθ N (0, 1), 6 0
θ=
n(Tn − θ) −→ .
N (0, 14 ), θ=0
So the pointwise asymptotics show that Tn can be more efficient than the sample
average X̄n . But what happens if we consider sequences θn ? For example, let
√ √
θn = h/ n. Then, under IPθn , X̄n = ¯n + h/( n) = OIPθn (n−1/2 ). Hence,
IPθn (|X̄n | > n−1/4 ) → 0, so that IPθn (Tn = X̄n ) → 0. Thus,
√ √ √
n(Tn − θn ) = n(Tn − θn )l{Tn = X̄n } + n(Tn − θn )l{Tn = X̄n /2}
Dθn h 1
−→ N − , .
2 4
The asymptotic mean square error AMSEθ (Tn ) is defined as the asymptotic
variance + asymptotic squared bias:
1 + h2
AMSEθn (Tn ) = .
4
The AMSEθ (X̄n ) of X̄n is its normalized non-asymptotic mean square error,
which is
AMSEθn (X̄n ) = MSEθn (X̄n ) = 1.
So when h is large enough, the asymptotic mean square error of Tn is larger
than that of X̄n .
Le Cam’s 3rd lemma shows that asymptotic linearity for all θ implies asymptotic
√
normality, now also for sequences θn = θ + h/ n. The asymptotic variance for
such sequences θn does not change. Moreover, if (15.2) holds for all θ, the
estimator is also asymptotically unbiased under IPθn .
Lemma 15.5.1 (Le Cam’s 3rd Lemma) Suppose that for all θ,
n
1X
Tn − θ = lθ (Xi ) + oIPθ (n−1/2 ),
n
i=1
√
Dθn
n(Tn − θn ) −→ N {Pθ (lθ sθ ) − 1}h, Vθ .
We will present a sketch of the proof of this lemma. For this purpose, we need
the following auxiliary lemma.
15.5. LE CAM’S 3RD LEMMA ? 179
φZ (z)ez2 = φY (z).
as
n
1X
ṡθ (Xi ) ≈ Eθ ṡθ (X) = −I(θ).
n
i=1
Thus, √
n(Tn − θ) θ D
−→ Z,
Λn
where Z ∈ R2 , has the two-dimensional normal distribution:
Z1 0 Vθ hPθ (lθ sθ )
Z= ∼N 2 , .
Z2 − h2 I(θ) hPθ (lθ sθ ) h2 I(θ)
Thus, we know that for all bounded and continuous f : R2 → R, one has
√
I θ f ( n(Tn − θ), Λn ) → Ef
E I (Z1 , Z2 ).
we may write √ √
I θ f ( n(Tn − θ))eΛn .
I θn f ( n(Tn − θ)) = E
E
The function (z1 , z2 ) 7→ f (z1 )ez2 is continuous, but not bounded. However,
one can show that one may extend the Portmanteau Theorem to this situation.
This then yields √
EI θ f ( n(Tn − θ))eΛn → Ef
I (Z1 )eZ2 .
Now, apply the auxiliary Lemma, with
0 Vθ hPθ (lθ sθ )
µ= h2 , Σ= .
− 2 I(θ) hPθ (lθ sθ ) h2 I(θ)
Then we get
Z Z
Z2 z2
Ef
I (Z1 )e = f (z1 )e φZ (z)dz = f (z1 )φY (z)dz = Ef
I (Y1 ),
where
Y1 hPθ (lθ sθ ) Vθ hPθ (lθ sθ )
Y = ∼N h2 , ,
Y2 2 I(θ)
hPθ (lθ sθ ) h2 I(θ)
so that
Y1 ∼ N (hPθ (lθ sθ ), Vθ ).
15.5. LE CAM’S 3RD LEMMA ? 181
So we conclude that
√ n Dθ
n(Tn − θ) −→ Y1 ∼ N (hPθ (lθ sθ ), Vθ ).
Hence
√ √ Dθ
n
n(Tn − θn ) = n(Tn − θ) − h −→ N (h{Pθ (lθ sθ ) − 1}, Vθ ).
u
t
182 CHAPTER 15. ABSTRACT ASYMPTOTICS ?
Chapter 16
Complexity regularization
We note that even when the parameter space is ∞-dimensional one need not
always apply complexity regularization. The dimension of Θ is in itself not
always the best description of complexity. If Θ is a metric space, one can apply
its so-called entropy as a measure of complexity. The details go beyond the
scope of these lecture notes. We instead will give the non-parametric and the
high-dimensional regression problem as prototype examples.
183
184 CHAPTER 16. COMPLEXITY REGULARIZATION
Clearly, if all xi are distinct, then fˆLS = Y and one has a perfect fit
kY − fˆLS k22 = 0.
This is a typical instance of overfitting. The estimator fˆLS just reproduces the
data and has no predictive power.
We need a model for f , say f ∈ F with F some class of functions.
One
R 1 0way2to go is now to do least squares over all f under the restriction that
2
0 |f (x)| dx ≤ M where M is a given constant. It turns out that a more
flexible approach is to apply the Lagrangian version of this. Then, choose a
tuning parameter λ ≥ 0 and define the estimator fˆ as
Z 1
ˆ 2 2 0 2
f := arg min kY − fk2 + λ |f (x)| dx .
f 0
as a penalty for choosing a too wiggly function. The penalty regularizes the
function. The tuning parameter λ controls the amount of regularization: the
larger λ the more regular the estimator will be. The choice of λ is not an easy
point. From theoretical point of view there are some guidelines. (Choosing λ2
of order n1/3 trades off approximation error and estimation error. One may
alternatively invoke Bayesian arguments to choose λ.) In practice one may
apply cross-validation.
16.3. A CONTINUOUS VERSION WITH EXPLICIT SOLUTION ? 185
The tuning parameter λ can be chosen smaller for higher values of m. (A value
1
of order n 2m+1 gives a trade-off between approximation error and estimation
error.) The resulting estimator is called a smoothing spline.
With a quadratic penalty, the penalized least squares estimator fˆ is not difficult
to compute as it is a minimizer of a quadratic function. Nevertheless, explicit
expressions are typically not available. In the next section, we provide explicit
expressions for the continuous version, as a “curiosity”.
We examine the continuous version of the previous section. The problem can
be explicitly solved. Suppose we observe a function y : [0, 1] → R and we aim
at smoothing it using the penalty of the previous section. We let
Z 1 Z 1
ˆ 2 2 0 2
f = arg min |y(x) − f(x)| dx + λ |f (x)| dx . (16.1)
f 0 0
1 x
u−x
Z
ˆ C x
f (x) = cosh + y(u) sinh du,
λ λ λ 0 λ
where Z 1
1 1−u 1
C = Y (1) − Y (u) sinh du / sinh ,
λ 0 λ λ
with Z x
Y (x) = y(u)du.
0
Numerical example
0.02
-0.02
-0.04
-0.06
-0.08
-0.1
-0.12
-0.14
-0.16
-0.18
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
In black is the unknown function f (as
Denoised, lambda=0.1
this is a simulation we know the un-
Error=2.8119e-04 known f ). The red wiggly function is
the observed function y. The green
function is the estimator fˆ. We see in
the second figure that by decreasing the
0.02
tuning parameter λ the estimator fˆ is
0 closer to the unknown truth f .
-0.02
-0.04
-0.06
-0.08
-0.1
-0.12
-0.14
-0.16
-0.18
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Denoised, lambda=0.05
Error=7.8683e-05
i.e. without the square. This seems like a minor modification but it makes a
huge difference. One may further relax the assumption of differentiability and
define the total variation of the function f as
n
X
TV(f) := |f(xi ) − f(xi−1 )|,
i=2
16.4. ESTIMATING A FUNCTION OF BOUNDED VARIATION 187
where we assume that the xi are ordered: x1 < . . . < xn . This leads to the
estimator
fˆ := arg min kY − fk22 + λ2 TV(f) .
f
One may write the estimator as solution of a linear least squares problem with
an `1 -penalty on the coefficients. This will relate the problem with that of
Section 16.5. The results there will help to understand why removing the square
in the penalty drastically changes the estimator.
i
X
f(xi ) = f(xj ) − f(xj−1 )
j=1
Xn
= f(xj ) − f(xj−1 ) l{j ≤ i}
| {z }
i=1 | {z } :=ξi,j
:=bj
n
X
= bj ξi,j .
i=1
where kbk1 = nj=1 |bj | denotes the `1 -norm of the vector b ∈ Rn . This rewriting
P
makes clear that there are as many parameters as there are observations, namely
n. The penalty takes care that despite this fact one will not overfit the data
(when λ is not too small).
188 CHAPTER 16. COMPLEXITY REGULARIZATION
We end this section with a small trip to the case where X is two-dimensional,
say X = [0, 1]2 . Then the unknown f is an image say, and the observations are
Yi1 ,i2 = f (xi1 ,i2 ) + i1 ,i2 .
Consider n = m2 observations and say they are on a regular grid
i1 i2
xi1 ,i2 = , , i1 = 1, . . . , m, i2 = 1, . . . , m.
m m
Now, how can we model smoothness of an image? Again, one may opt to
work with the squares of derivatives, a counterpart of |f 0 (x)|2 dx for the one-
R
where
i1 i2 i1 − 1 i2 i1 i2 − 1 i1 − 1 i2 − 1
∆f(xi1 ,i2 ) := f , −f , −f , +f , .
n m m m m m m m
The image reconstruction algorithm is
Xm X m 2
ˆ
f := arg min 2
Yi1 ,i2 − f(xi1 ,i2 ) + λ TV(f) .
f
i1 =1 i2 =1
We now investigate further expressions for both ridge estimator and Lasso.
Lemma 16.5.1 The ridge estimator β̂ridge is given by
β̂ridge = (X T X + λ2 I)−1 X T Y.
Proof. We have
1 ∂ 2 2 2
2 ∂b kY − Xbk2 + λ kbk2 = −X T (Y − Xb) + λ2 b
= −X T Y + X T X + λ2 I b.
λ2 /n 2
2
∗ 2 1
EkX
I β̂ridge − f k22 = kXβ k2 + pσ 2 + kXβ ∗ − f k22 ,
1 + λ2 /n 1 + λ2 /n | {z }
| {z } | {z } misspecification
(bias)2 variance error
If β̂j > 0 it must be a solution of putting the derivative of the above expression
to zero:
−Zj + nβ̂j + λ = 0,
or
β̂j = Zj /n − λ/n.
Similarly, if β̂j < 0 we must have
−Zj + nβ̂j − λ = 0.
Otherwise β̂j = 0. u
t
Some notation
◦ For a vector z ∈ Rp we let kzk∞ := max1≤j≤p |zj | be its `∞ -norm.
◦ We let X1 , . . . , Xp denote the columns of X.
◦ For a subset S ⊂ {1, . . . , p} we let XβS∗ be the best linear approximation of
f := EY usingPthe variables in S, i.e., XβS∗ is the projection in Rn of f on the
linear space { j∈S Xj bS,j : bS ∈ R|S| }.
In the next theorem we again assume orthogonal design. For general design,
one needs so-called restricted eigenvalues or compatibility conditions between
`2 -norms and `1 -norms.
Theorem 16.5.1 Consider again fixed design with X T X = nI. Let f = EY
and = Y − f . Fix some level α ∈ (0, 1) and suppose that for some λα it holds
that
IP(kX T k∞ > λα ) ≤ α.
Then for λ > λα we have with probability at least 1 − α
(λ + λα )2
2 ∗ 2
kX β̂Lasso − f k2 ≤ min |S| + kXβS − f k2 .
S
| n{z } | approximation
{z }
estimation error
error
(λ + λα )2
X
2
kX β̂Lasso − f k2 ≤ #{j : n|βj | > λ + λα } + nβj2
n
n|βj |≤λ+λα
(λ + λα )2
= min |S| + kXβS∗ − f k22 .
S n
u
t
If we compare the result of Theorem 16.5.1 with result iii) of Lemma 12.3.1
concerning the ordinary least squares estimator, we see that the Lasso exhibits
192 CHAPTER 16. COMPLEXITY REGULARIZATION
(λ + λα )2
kX(β̂Lasso − β)k22 ≤ s.
n
The above corollary tells us that the Lasso estimator adapts to favourable sit-
uations where β has many zeroes (i.e. where β is sparse).
To complete the story, we need to study a bound for λα . It turns √ out that
for many types of error distributions, one can take λα of order n log p. We
show this for the case of i.i.d. N (0, σ 2 ) noise. For that purpose, we start with
a bound for the tails of a standard normal random variable.
Lemma 16.5.3 Suppose Z ∼ N (0, 1). Then for all t > 0
√
IP(Z ≥ 2t) ≤ exp[−t].
Remark. The value α = 12 thus gives a bound for the median of kX β̂Lasso −f k22 .
In the case of Gaussian errors one may use “concentration of measure” to deduce
that kX β̂Lasso − f k22 is “concentrated” around its median.
16.6 Conclusion
We have seen in this chapter that several concepts from classical statistics also
play their role in the modern version of statistics where the parameter is possibly
high dimensional. The classical least squares methodology keeps its prominent
place, but now it is equipped with a regularization penalty. More generally,
M-estimators (e.g. maximum likelihood estimators) can also be used in high
dimensions, applying again some regularization technique. The bias-variance
decomposition continues to play its role too, for instance leading to guidelines
for choosing tuning parameters.
Shrinkage estimators play an important role in high-dimensional statistics. This
is also related to the result of Section 11.4 where we have seen that the sample
average in dimension higher than 2 is inadmissible as it can be improved by a
shrinkage estimator.
Complexity regularization can typically be seen as a Bayesian MAP approach.
One may also use the a posteriori mean as estimator etc. Today, Bayesian ap-
proaches to high-dimensional and non-parametric problems are very important
and successful.
Complexity regularization is typically invoked for the construction of adaptive
estimators. An adaptive estimator mimics the situation where we knew before-
hand the complexity of the underlying target to be estimated. To evaluate the
performance of an adaptive estimator, one uses as benchmark the case where the
(hopefully low) complexity of the target is indeed known. Thus the benchmark
comes from classical statistical theory.
194 CHAPTER 16. COMPLEXITY REGULARIZATION
Chapter 17
Literature
195
196 CHAPTER 17. LITERATURE