Math Stat 2019

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 196

Mathematical Statistics

Sara van de Geer

September 2019
2
Contents

1 Introduction 9
1.1 Speed of light example . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Example: the location model . . . . . . . . . . . . . . . . . . . . 11
1.4 Some further examples of statistical models . . . . . . . . . . . . 12

2 Estimation 15
2.1 What is an estimator? . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 The empirical distribution function . . . . . . . . . . . . . . . . . 15
2.3 Some estimators for the location model . . . . . . . . . . . . . . 16
2.4 How to construct estimators . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Plug-in estimators . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 The method of moments . . . . . . . . . . . . . . . . . . . 18
2.4.3 Likelihood methods . . . . . . . . . . . . . . . . . . . . . 20
2.5 Asymptotic tests and confidence intervals based on the likelihood 23

3 Intermezzo: distribution theory 25


3.1 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 The multinomial distribution . . . . . . . . . . . . . . . . . . . . 26
3.3 The Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 The distribution of the maximum of two random variables . . . . 28

4 Sufficiency and exponential families 31


4.1 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Factorization Theorem of Neyman . . . . . . . . . . . . . . . . . 33
4.3 Exponential families . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Intermezzo: the mean and covariance matrix of a random vector 36
4.5 Canonical form of an exponential family . . . . . . . . . . . . . . 37
4.6 Reparametrizing in the one-dimensional case . . . . . . . . . . . 38
4.7 Score function and Fisher information . . . . . . . . . . . . . . . 39
4.8 Score function for exponential families . . . . . . . . . . . . . . . 40
4.9 Minimal sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Bias, variance and the Cramér Rao lower bound 43


5.1 What is an unbiased estimator? . . . . . . . . . . . . . . . . . . . 43
5.2 UMVU estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3
4 CONTENTS

5.3 The Lehmann-Scheffé Lemma . . . . . . . . . . . . . . . . . . . . 47


5.4 Completeness for exponential families . . . . . . . . . . . . . . . 49
5.5 The Cramér Rao lower bound . . . . . . . . . . . . . . . . . . . . 51
5.6 CRLB and exponential families . . . . . . . . . . . . . . . . . . . 54
5.7 Higher-dimensional extensions . . . . . . . . . . . . . . . . . . . . 55

6 Tests and confidence intervals 59


6.1 Intermezzo: quantile functions . . . . . . . . . . . . . . . . . . . 59
6.2 How to construct tests . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3 Equivalence confidence sets and tests . . . . . . . . . . . . . . . . 61
6.4 Comparison of confidence intervals and tests . . . . . . . . . . . . 62
6.5 An illustration: the two-sample problem . . . . . . . . . . . . . . 62
6.5.1 Student’s test . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.5.2 Wilcoxon’s test . . . . . . . . . . . . . . . . . . . . . . . . 64
6.5.3 Comparison of Student’s test and Wilcoxon’s test . . . . . 67

7 The Neyman Pearson Lemma and UMP tests 69


7.1 The Neyman Pearson Lemma . . . . . . . . . . . . . . . . . . . . 69
7.2 Uniformly most powerful tests . . . . . . . . . . . . . . . . . . . . 70
7.2.1 An example . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.3 UMP tests and exponential families . . . . . . . . . . . . . . . . . 72
7.4 One- and two-sided tests: an example with the Bernoulli distri-
bution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.5 Unbiased tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.6 Conditional tests ? . . . . . . . . . . . . . . . . . . . . . . . . . . 78

8 Comparison of estimators 83
8.1 Definition of risk . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.2 Risk and sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.3 Rao-Blackwell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.4 Sensitivity and robustness . . . . . . . . . . . . . . . . . . . . . . 85
8.5 Computational aspects . . . . . . . . . . . . . . . . . . . . . . . . 86

9 Equivariant statistics 87
9.1 Equivariance in the location model . . . . . . . . . . . . . . . . . 87
9.1.1 Construction of the UMRE estimator . . . . . . . . . . . 88
9.1.2 Quadratic loss: the Pitman estimator . . . . . . . . . . . 89
9.1.3 Invariant statistics . . . . . . . . . . . . . . . . . . . . . . 91
9.1.4 Quadratic loss and Basu’s Lemma . . . . . . . . . . . . . 91
9.2 Equivariance in the location-scale model ? . . . . . . . . . . . . . 93
9.2.1 Construction of the UMRE estimator ? . . . . . . . . . . 94
9.2.2 Quadratic loss ? . . . . . . . . . . . . . . . . . . . . . . . 94

10 Decision theory 97
10.1 Decisions and their risk . . . . . . . . . . . . . . . . . . . . . . . 97
10.2 Admissible decisions . . . . . . . . . . . . . . . . . . . . . . . . . 99
10.2.1 Not using the data at all is admissible . . . . . . . . . . . 99
CONTENTS 5

10.2.2 A Neyman Pearson test is admissible . . . . . . . . . . . . 100


10.3 Minimax decisions . . . . . . . . . . . . . . . . . . . . . . . . . . 101
10.3.1 Minimax Neyman Pearson test . . . . . . . . . . . . . . . 101
10.4 Bayes decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
10.4.1 Bayes test . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
10.5 Construction of Bayes estimators . . . . . . . . . . . . . . . . . . 103
10.5.1 Bayes test revisited . . . . . . . . . . . . . . . . . . . . . . 104
10.5.2 Bayes estimator for quadratic loss . . . . . . . . . . . . . 104
10.5.3 Bayes estimator and the maximum a posteriori estimator 105
10.5.4 Three worked-out examples . . . . . . . . . . . . . . . . . 105
10.6 Discussion of Bayesian approach . . . . . . . . . . . . . . . . . . 107
10.7 Integrating parameters out ? . . . . . . . . . . . . . . . . . . . . 109

11 Proving admissibility and minimaxity 111


11.1 Minimaxity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
11.1.1 Minimaxity of the Pitman estimator ? . . . . . . . . . . . 113
11.2 Admissibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
11.2.1 Admissible estimators for the normal mean . . . . . . . . 116
11.3 Admissible estimators in exponential families ? . . . . . . . . . . 118
11.4 Inadmissibility in higher-dimensional settings ? . . . . . . . . . . 120

12 The linear model 123


12.1 Definition of the least squares estimator . . . . . . . . . . . . . . 123
12.2 Intermezzo: the χ2 distribution . . . . . . . . . . . . . . . . . . . 126
12.3 Distribution of the least squares estimator . . . . . . . . . . . . . 127
12.4 Intermezzo: some matrix algebra . . . . . . . . . . . . . . . . . . 128
12.5 Testing a linear hypothesis . . . . . . . . . . . . . . . . . . . . . . 130

13 Asymptotic theory 131


13.1 Types of convergence . . . . . . . . . . . . . . . . . . . . . . . . . 132
13.1.1 Stochastic order symbols . . . . . . . . . . . . . . . . . . . 133
13.1.2 Some implications of convergence . . . . . . . . . . . . . . 134
13.2 Consistency and asymptotic normality . . . . . . . . . . . . . . . 135
13.3 Asymptotic linearity . . . . . . . . . . . . . . . . . . . . . . . . . 135
13.4 The δ-technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

14 M-estimators 141
14.1 MLE as special case of M-estimation . . . . . . . . . . . . . . . . 142
14.2 Consistency of M-estimators . . . . . . . . . . . . . . . . . . . . . 144
14.3 Asymptotic normality of M-estimators . . . . . . . . . . . . . . . 147
14.4 Asymptotic normality of the MLE . . . . . . . . . . . . . . . . . 150
14.5 Two further examples of M-estimation . . . . . . . . . . . . . . . 151
14.6 Asymptotic relative efficiency . . . . . . . . . . . . . . . . . . . . 154
14.7 Asymptotic pivots . . . . . . . . . . . . . . . . . . . . . . . . . . 156
14.8 Asymptotic pivot based on the MLE . . . . . . . . . . . . . . . . 157
14.9 MLE for the multinomial distribution . . . . . . . . . . . . . . . 159
14.10Likelihood ratio tests . . . . . . . . . . . . . . . . . . . . . . . . . 161
6 CONTENTS

14.11Contingency tables . . . . . . . . . . . . . . . . . . . . . . . . . . 164

15 Abstract asymptotics ? 167


15.1 Plug-in estimators ? . . . . . . . . . . . . . . . . . . . . . . . . . 168
15.2 Consistency of plug-in estimators ? . . . . . . . . . . . . . . . . . 170
15.3 Asymptotic normality of plug-in estimators ? . . . . . . . . . . . 171
15.4 Asymptotic Cramer Rao lower bound ? . . . . . . . . . . . . . . 175
15.5 Le Cam’s 3rd Lemma ? . . . . . . . . . . . . . . . . . . . . . . . . 177

16 Complexity regularization 183


16.1 Non-parametric regression . . . . . . . . . . . . . . . . . . . . . . 183
16.2 Smoothness classes . . . . . . . . . . . . . . . . . . . . . . . . . . 184
16.3 A continuous version with explicit solution ? . . . . . . . . . . . 185
16.4 Estimating a function of bounded variation . . . . . . . . . . . . 186
16.5 The ridge and Lasso penalty . . . . . . . . . . . . . . . . . . . . . 188
16.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

17 Literature 195
CONTENTS 7

These notes in English closely follow Mathematische Statistik, by H.R. Künsch


(2005). Mathematische Statistik can be used as supplementary reading material
in German.
Throughout the notes measurability assumptions are not stated explicitly.
Mathematical rigour and clarity often bite each other. At some places, not all
subtleties are fully presented. A snake will indicate this.
Chapters or (sub)sections with a ? are not part of the exam.
8 CONTENTS
Chapter 1

Introduction

Statistics is about the mathematical modeling of observable phenomena, us-


ing stochastic models, and about analyzing data: estimating parameters of
the model, constructing confidence intervals and testing hypotheses. In these
notes, we study various estimation and testing procedures. We consider their
theoretical properties and we investigate various notions of optimality.
Some notation and model assumptions
The data consist of measurements (observations) x1 , . . . , xn , which are regarded
as realizations of random variables X1 , . . . , Xn . In most of the notes, the Xi
are real-valued: Xi ∈ R (for i = 1, . . . , n), although we will also consider some
extensions to vector-valued observations.

1.1 Speed of light example

Fizeau and Foucault developed methods for estimating the speed of light (1849,
1850), which were later improved by Newcomb and Michelson. The main idea
is to pass light from a rapidly rotating mirror to a fixed mirror and back to
the rotating mirror. An estimate of the velocity of light is obtained, taking
into account the speed of the rotating mirror, the distance travelled, and the
displacement of the light as it returns to the rotating mirror.

Fig. 1
The data are Newcomb’s measurements of the passage time it took light to
travel from his lab, to a mirror on the Washington Monument, and back to his
lab.

9
10 CHAPTER 1. INTRODUCTION

distance: 7.44373 km.


66 measurements on 3 consecutive days
first measurement: 0.000024828 seconds= 24828 nanoseconds
The dataset has the deviations from 24800 nanoseconds.
The measurements on 3 different days:
day 1
0 20 40


● ●
● ● ● ● ● ● ●
● ● ● ● ● ●


X1


−40

0 5 10 15 20 25

t1

day 2
0 20 40

● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ●

X2

−40

20 25 30 35 40 45

t2

day 3
0 20 40

● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●

X3

−40

40 45 50 55 60 65

t3

One may estimate the speed of light using e.g. the mean, or the median, or
Huber’s estimate (see below). This gives the following results (for the 3 days
separately, and for the three days combined):

Day 1 Day 2 Day 3 All


Mean 21.75 28.55 27.85 26.21
Median 25.5 28 27 27
Huber 25.65 28.40 27.71 27.28

Table 1

The question which estimate is “the best one” is one of the topics of these notes.

1.2 Notation

The collection of observations will be denoted by X = {X1 , . . . , Xn }. The


distribution of X, denoted by IP, is generally unknown. A statistical model is
1.3. EXAMPLE: THE LOCATION MODEL 11

a collection of assumptions about this unknown distribution.


We will usually assume that the observations X1 , . . . , Xn are independent and
identically distributed (i.i.d.). Or, to formulate it differently, X1 , . . . , Xn are
i.i.d. copies from some population random variable, which we denote by X.
The common distribution, that is: the distribution of X, is denoted by P . For
X ∈ R, the distribution function of X is written as

F (·) = P (X ≤ ·).

Recall that the distribution function F determines the distribution P (and vise
versa).
Further model assumptions then concern the modeling of P . We write such a
model as P ∈ P, where P is a given collection of probability measures, the so-
called model class. Typically the distributions in P are indexed by a parameter,
say θ, in some parameter space, say Θ. Then P = {Pθ : θ ∈ Θ}, and P = Pθ
for some θ ∈ Θ. Then θ is often called the “true parameter”.1 The parameter
space Θ may be high-dimensional, or even ∞-dimensional. Often, one is only
interested in a certain aspect of the parameter. We write the parameter of
interest as γ := g(θ) where g : Θ → Γ a given function with values in some
space Γ.

1.3 Example: the location model

The following example will serve to illustrate the concepts that are to follow.
Let X be real-valued. The location model is

P := {Pθ (X ≤ ·) := F0 (· − µ), θ := (µ, F0 ), µ ∈ R, F0 ∈ F0 }, (1.1)

where F0 is a given collection of distribution functions. Assuming the expec-


tation exist, we center the distributions in F0 to have mean zero. Then Pµ,F0
has mean µ. We call µ a location parameter. Often, only µ is the parameter of
interest, and F0 is a so-called nuisance parameter. Then g(µ, F0 ) = µ.
The class F0 is for example modeled as the class of all symmetric distributions,
that is
F0 := {F0 (x) = 1 − F0 (−x), ∀ x}. (1.2)
This is an ∞-dimensional collection: it is not parametrized by a finite dimen-
sional parameter. We then call F0 an infinite-dimensional parameter.
A finite-dimensional model is for example

F0 := {Φ(·/σ) : σ > 0}, (1.3)

where Φ is the standard normal distribution function.


1
To be mathematically correct one should write Pθ ∈ {Pϑ : ϑ ∈ Θ} to make the distinction
in notation between the true parameter θ and the parameter ϑ indexing the class P. We
actually need this distinction as the theory develops.
12 CHAPTER 1. INTRODUCTION

Thus, the location model is

Xi = µ + i , i = 1, . . . , n,

with 1 , . . . , n i.i.d. and, under model (1.2), symmetrically but otherwise un-
known distributed and, under model (1.3), N (0, σ 2 )-distributed with unknown
variance σ 2 .

1.4 Some further examples of statistical models

Example 1.4.1 Poisson distribution


Consider a small insurance company. Let X be the number of claims on a
particular day. Then a possible model is the Poisson model, which assumes that
X has a Poisson distribution with parameter θ > 0:
θx −θ
Pθ (X = x) = e , x = 0, 1, 2, . . . .
x!
A parameter of interest could be for example the probability of at least 4 claims
on a particular day. Then

γ = Pθ (X ≥ 4)
= 1 − Pθ (X ≤ 3)
θ2 θ3 −θ
 
= 1− 1+θ+ + e
2 3!
:= g(θ).

Suppose we observed the number of claims X1 , . . . , X n during n = 200 days. A


possible estimator of θ is the sample average X̄ := ni=1 Xi /n. For γ we may
P
use the “plug in” principle, that is, we plug in the estimator X̄ for θ in the
function g. This gives γ̂ := g(X̄) as estimator of g(θ).
Here is a data example

xi # days
0 100
1 60
2 32
3 8
≥4 0

The observed average is x̄ = .74 and the estimate of Pθ (X ≥ 4) is .00697.


Example 1.4.2 Pareto distribution
Suppose X has density (with respect to Lebesgue measure ν)

pθ (x) = θ(1 + x)−(1+θ) , x > 0


1.4. SOME FURTHER EXAMPLES OF STATISTICAL MODELS 13

where θ > 0 is unknown. This is the Pareto density which is often used to
model the distribution of income. The parameter θ is sometimes called the
Pareto index. We write the model class for the distribution as

P = {Pθ : dPθ /dν = pθ }.

A parameter of interest may be the Gini index. It describes income inequality


and is defined for θ ≥ 1 as γ(θ) = 1/(2θ − 1). The value G(1) = 1 for θ = 1
corresponds to complete income inequality.
Example 1.4.3 Classification
Let X = (Y, Z) where Z = body mass index ∈ R and Y ∈ {0, 1} indicates
having diabetes or not. We assume the model

Pθ (Y = 1|Z = z) = θ(z), z ∈ R,

with  
θ(·) ∈ Θ := all increasing functions θ : R → [0, 1] .

The parameter space is ∞-dimensional. A parameter of interest is for example

γ := θ−1 ( 12 ),

that is, the value γ such that for z ≥ γ you are at risk:
1
Pθ (Y = 1|Z = z) ≥ 2.

Example 1.4.4 Social networks


Consider p individuals. which either do or do not communicate with each other.
If they do, we say there is a connection between them, or we call them friends.
Let X be a p × p matrix X = (Xj,k ) coding the connections: for j 6= k
1 if there is a connection between j and k
n
Xj,k := .
0 else
The “stochastic block model” assumes that the {Xj,k : j < k} are independent
and

βm if j and k are in the same community m
Pθ (Xj,k = 1) = .
δ if j and k are in different communities
If the number of communities is known, say M , but otherwise nothing further,
then the parameter space is
 
M +1
Θ = (β1 , . . . , βM , δ) ∈ [0, 1] , M

where M is the collection of all community configurations. There are M p possi-


ble community configurations, so |M| = M p . If p is large this is a huge number,
i.e. the parameter space is very complex.
Remark: typically we observe only one realization of X, i.e. n = 1.
14 CHAPTER 1. INTRODUCTION

Example 1.4.5 Causal models


Suppose X = (Z1 , . . . , Zp ) ∈ Rp is a p-dimensional random variable, for ex-
ample Z1 = rainfall, Z2 = tea consumption, Z3 = number of tall people,
Z4 = mountain height, · · · within a particular canton of Switzerland. A causal
model aims to find out which variables are causes and which are consequences.
The structural relations model is
 
Zπ(j) = fj Zπ(1) , . . . , Zπ(j−1) , j , j = 2, . . . , p.

Here π is a permutation of {1, . . . , p}, j is unobservable noise and fj is a


partly unknown function. If we assume (for simplicity) the noise distribution
to be known the parameter space is
 
Θ := all permutations π and structural relations (f2 , . . . , fp ) .

A parameter of interest is for example the causal graph.


A sub-example is where the structural relations are modeled as being linear:

fj (z1 , . . . , zj−1 , εj ) = βj,1 z1 + · · · + βj,j−1 zj−1 + εj , j = 2, . . . , p

with {βj,k } unknown coefficients.


Chapter 2

Estimation

2.1 What is an estimator?

Recall that the data consist of observations X = (X1 , . . . , Xn ) with partly


unknown distribution. A parameter is an aspect of the unknown distribution.
We typically assume that X1 , . . . , Xn are i.i.d. copies of a random variable X
where X has distribution Pθ ∈ {Pϑ : ϑ ∈ Θ}.
An estimator is constructed to estimate some unknown parameter, γ say. Its
formal definition is
Definition 2.1.1 An estimator T (X) is some given (measurable) function T (·)
evaluated at the observations X. The function T (·) is not allowed to depend on
unknown parameters.
An estimator is also called a statistic or a decision.
The reason why T is not allowed to depend on unknown parameters is that one
should be able to calculate it in practice, using only the data. We will often use
the same notation T for the estimator T (X) (i.e., we write T = T (X) omitting
the argument X) and the function T = T (·). It should be clear from the context
which of the two is meant.

2.2 The empirical distribution function

Let X1 , . . . , Xn be real-valued observations. An example of a nonparametric


estimator is the empirical distribution function
1
F̂n (·) := #{Xi ≤ ·, 1 ≤ i ≤ n}.
n
This is an estimator of the theoretical distribution function

F (·) := P (X ≤ ·).

15
16 CHAPTER 2. ESTIMATION

Most estimators are constructed according the so-called a plug-in principle (Ein-
setzprinzip). That is, the parameter of interest γ is written as γ = Q(F ), with
Q some given map. The empirical distribution F̂n is then “plugged in”, to ob-
tain the estimator T := Q(F̂n ). (We note however that problems can arise, e.g.
Q(F̂n ) may not be well-defined ....).

2.3 Some estimators for the location model

In the location model of Section 1.3, one may consider the following estimators
µ̂ of µ (among others):
• The average
n
1X
µ̂1 := Xi .
n
i=1

Note that µ̂1 minimizes over µ the squared loss1


n
X
(Xi − µ)2 ,
i=1

that is
n
X
µ̂ = arg min (Xi − µ)2 . (2.1)
µ∈R
i=1

It can be shown that µ̂1 is a “good” estimator if the model (1.3) holds, i.e., if
X1 , . . . , Xn are i.i.d. N (µ, σ 2 ). When (1.3) is not true, in particular when there
are outliers (large, “wrong”, observations) (Ausreisser), then one has to apply
a more robust estimator.
• The (sample) median is

X((n+1)/2) when n odd
µ̂2 := ,
{X(n/2) + X(n/2+1) }/2 when n is even

where X(1) ≤ · · · ≤ X(n) are the order statistics. Note that µ̂2 is a minimizer
of the absolute loss
Xn
|Xi − µ|.
i=1

1
To avoid misunderstanding, we note that e.g. in (2.1), µ is used as variable over which
is minimized and is there not the unknown parameter µ. It is a general convention to abuse
notation and employ the same symbol µ. When further developing the theory we shall often
introduce a different symbol for the variable, to distinguish it from the “true parameter” µ,
e.g., (2.1) is written as
Xn
µ̂1 := arg min (Xi − c)2 .
c∈R
i=1
2.4. HOW TO CONSTRUCT ESTIMATORS 17

• The Huber estimator is


n
X
µ̂3 := arg min ρ(Xi − µ),
µ∈R
i=1

where 
x2 if |x| ≤ k
ρ(x) = ,
k(2|x| − k) if |x| > k
with k > 0 some given threshold.
• We finally mention the α-trimmed mean, defined, for some 0 < α < 1, as
n−[nα]
1 X
µ̂4 := X(i) .
n − 2[nα]
i=[nα]+1

The above estimators µ̂1 , . . . , µ̂4 are plug-in estimators of the location parameter
µ. We define the maps Z
Q1 (F ) := xdF (x)

(the mean, or point of gravity, of F ), and

Q2 (F ) := F −1 (1/2)

(the median of F ), and


Z
Q3 (F ) := arg min ρ(· − µ)dF,
µ

and finally
Z F −1 (1−α)
1
Q4 (F ) := xdF (x).
1 − 2α F −1 (α)

Then µ̂k corresponds to Qk (F̂n ), k = 1, . . . , 4. If the model (1.2) is correct (i.e.


if F is symmetric around µ) µ̂1 , . . . , µ̂4 are all estimators of µ. If the model
is incorrect, each Qk (F̂n ) is still an estimator of Qk (F ) (assuming the latter
exists), but the Qk (F ) may all be different aspects of F .

2.4 How to construct estimators

2.4.1 Plug-in estimators

For real-valued observations, one can define the distribution function

F (·) = P (X ≤ ·).

An estimator of F is the empirical distribution function


n
1X
F̂n (·) = l{Xi ≤ ·}.
n
i=1
18 CHAPTER 2. ESTIMATION

Note that when knowing only F̂n , one can reconstruct the order statistics
X(1) ≤ . . . ≤ X(n) , but not the original data X1 , . . . , Xn . Now, the order
at which the data are given carries no information about the distribution P . In
other words, a “reasonable”2 estimator T = T (X1 , . . . , Xn ) depends only on the
sample (X1 , . . . , Xn ) via the order statistics (X(1) , . . . X(n) ) (i.e., shuffling the
data should have no influence on the value of T ). Because these order statistics
can be determined from the empirical distribution F̂n , we conclude that any
“reasonable” estimator T can be written as a function of F̂n :

T = Q(F̂n ),

for some map Q.


Similarly, the distribution function Fθ := Pθ (X ≤ ·) completely characterizes
the distribution Pθ . Hence, a parameter is a function of Fθ :

γ = g(θ) = Q(Fθ ).

If the mapping Q is defined at all Fθ as well as at F̂n , we call Q(F̂n ) a plug-in


estimator of Q(Fθ ).
The idea is not restricted to the one-dimensional setting. For arbitrary obser-
vation space X , we define the empirical measure
n
1X
P̂n = δXi ,
n
i=1

where δx is a point-mass at x. The empirical measure puts mass 1/n at each


observation. This is indeed an extension of X = R to general X , as the empirical
distribution function F̂n jumps at each observation, with jump height equal to
the number of times the value was observed (i.e. jump height 1/n if all Xi are
distinct). As in the real-valued case, if the map Q is defined at all Pθ as well
as at P̂n , we call Q(P̂n ) a plug-in estimator of Q(Pθ ).
We stress that typically, the representation γ = g(θ) as function Q of Pθ is not
unique, i.e., that there are various choices of Q. Each such choice generally
leads to a different estimator. Moreover, the assumption that Q is defined at
P̂n is often violated. One can sometimes modify the map Q to a map Qn that,
in some sense, approximates Q for n large. The modified plug-in estimator then
takes the form Qn (P̂n ).

2.4.2 The method of moments

Let X ∈ R and suppose (say) that the parameter of interest is θ itself, and
that Θ ⊂ Rp . Let µ1 (θ), . . . , µp (θ) denote the first p moments of X (assumed
2
What is “reasonable” has to be considered with some care. There are in fact “reasonable”
statistical procedures that do treat the {Xi } in an asymmetric way. An example is splitting
the sample into a training and test set (for model validation).
2.4. HOW TO CONSTRUCT ESTIMATORS 19

to exist), i.e., Z
j
µj (θ) = Eθ X = xj dFθ (x), j = 1, . . . , p.

Also assume that the map


m : Θ → Rp ,
defined by
m(θ) = [µ1 (θ), . . . , µp (θ)],
has an inverse
m−1 (µ1 , . . . , µp ),
for all [µ1 , . . . , µp ] ∈ M (say). We estimate the µj by their sample counterparts
n Z
1X j
µ̂j := Xi = xj dF̂n (x), j = 1, . . . , p.
n
i=1

When [µ̂1 , . . . , µ̂p ] ∈ M we can plug them in to obtain the estimator

θ̂ := m−1 (µ̂1 , . . . , µ̂p ).

Example 2.4.1 Method of moments for the negative binomial distri-


bution
Let X have the negative binomial distribution with known parameter k and un-
known success parameter θ ∈ (0, 1):
 
k+x−1 k
Pθ (X = x) = θ (1 − θ)x , x ∈ {0, 1, . . .}.
x

This is the distribution of the number of failures till the k th success, where at
each trial, the probability of success is θ, and where the trials are independent.
It holds that
(1 − θ)
Eθ (X) = k := m(θ).
θ
Hence
k
m−1 (µ) = ,
µ+k
and the method of moments estimator is
k nk number of successes
θ̂ = = Pn = .
X̄ + k X
i=1 i + nk number of trails

Example 2.4.2 Method of moments for the Pareto distribution


Suppose X has density

pθ (x) = θ(1 + x)−(1+θ) , x > 0,

with respect to Lebesgue measure, and with θ ∈ Θ ⊂ (0, ∞) (see Example 1.4.2).
Then, for θ > 1
1
Eθ X = := m(θ),
θ−1
20 CHAPTER 2. ESTIMATION

with inverse
1+µ
m−1 (µ) = .
µ
The method of moments estimator would thus be

1 + X̄
θ̂ = .

However, the mean Eθ X does not exist for θ < 1, so when Θ contains values
θ < 1, the method of moments is perhaps not a good idea. We will see that the
maximum likelihood estimator does not suffer from this problem.

2.4.3 Likelihood methods

Suppose that P := {Pθ : θ ∈ Θ} is dominated by a σ-finite measure ν. We


write the densities as
dPθ
pθ := , θ ∈ Θ.

Definition 2.4.1 The likelihood function (with data X = (X1 , . . . , Xn )) is the
function LX : Θ → R given by
n
Y
LX (ϑ) := pϑ (Xi ), ϑ ∈ Θ.
i=1

The MLE (maximum likelihood estimator) is

θ̂ := arg max LX (ϑ).


ϑ∈Θ

Note We use the symbol ϑ for the variable in the likelihood function, and the
slightly different symbol θ for the parameter we want to estimate. It is however
a common convention to use the same symbol for both (as already noted in the
footnotes in Sections 1.2 and 2.3). As we will see, different symbols are needed
for the development of the theory.
Note Alternatively, we may write the MLE as the maximizer of the log-likelihood
n
X
θ̂ = arg max log LX (ϑ) = arg max log pϑ (Xi ).
ϑ∈Θ ϑ∈Θ
i=1

The log-likelihood is generally mathematically more tractable. For example,


if the densities are differentiable, one can typically obtain the maximum by
setting the derivatives to zero, and it is easier to differentiate a sum than a
product.
Note The likelihood function may have local maxima. Moreover, the MLE is
not always unique, or may not exist (for example, the likelihood function may
be unbounded).
2.4. HOW TO CONSTRUCT ESTIMATORS 21

We will now show that maximum likelihood is a plug-in method. First, as noted
above, the MLE maximizes the log-likelihood. We may of course normalize the
log-likelihood by 1/n:
n
1X
θ̂ = arg max log pϑ (Xi ). (2.2)
ϑ∈Θ n
i=1
Pn
Replacing the average i=1 log pϑ (Xi )/n in (2.2) by its theoretical counterpart
Eθ log pϑ (X) gives
arg max Eθ log pϑ (X)
ϑ∈Θ

which is indeed equal to the parameter θ we are trying to estimate:


Lemma 2.4.1 We have

θ = arg max Eθ log pϑ (X).


ϑ∈Θ

Proof. By the inequality log x ≤ x − 1, x > 0, for all ϑ ∈ Θ


 
pϑ (X) pϑ (X)
Eθ log ≤ Eθ − 1 = 0.
pθ (X) pθ (X)
u
t
Example 2.4.3 MLE for the Pareto distribution
Suppose X has density

pθ (x) = θ(1 + x)−(1+θ) , x > 0,

with respect to Lebesgue measure, and with θ ∈ Θ = (0, ∞). Then

log pϑ (x) = log ϑ − (1 + ϑ) log(1 + x),

d 1
log pϑ (x) = − log(1 + x).
dϑ ϑ
We put the derivative of the log-likelihood based on n observations to zero and
solve:
n
n X
− log(1 + Xi ) = 0
θ̂ i=1
1
⇒ θ̂ = Pn .
{ i=1 log(1 + Xi )}/n
(One may check that this is indeed the maximum.)
Example 2.4.4 MLE for some location/scale models
Let X ∈ R and θ = (µ, σ 2 ), with µ ∈ R a location parameter, σ > 0 a scale
parameter. We assume that the distribution function Fθ of X is
 
·−µ
Fθ (·) = F0 ,
σ
22 CHAPTER 2. ESTIMATION

where F0 is a given distribution function, with density f0 w.r.t. Lebesgue mea-


sure. The density of X is thus
 
1 ·−µ
pθ (·) = f0 .
σ σ

Case 1 If F0 = Φ (the standard normal distribution function), then


 
1 1
f0 (x) = φ(x) = √ exp − x2 , x ∈ R,
2π 2
so that  
1 1 2
pθ (x) = √ exp − 2 (x − µ) , x ∈ R.
2πσ 2 2σ
The MLE of µ resp. σ 2 is
n
2 1X
µ̂ = X̄, σ̂ = (Xi − X̄)2 .
n
i=1

Case 2 The (standardized) double exponential or Laplace distribution has den-


sity

 
1
f0 (x) = √ exp − 2|x| , x ∈ R,
2
so  √ 
1 2|x − µ|
pθ (x) = √ exp − , x ∈ R.
2σ 2 σ
The MLE of µ resp. σ is now
√ n
2X
µ̂ = sample median, σ̂ = |Xi − µ̂2 |.
n
i=1

Example 2.4.5 An example where the MLE does not exist


Here is a famous example, from Kiefer and Wolfowitz (1956), where the like-
lihood is unbounded, and hence the MLE does not exist. It concerns the case
of a mixture of two normals: each observation, is either N (µ, 1)-distributed or
N (µ, σ 2 )-distributed, each with probability 1/2 (say). The unknown parameter
is θ = (µ, σ 2 ), and X has density
1 1
pθ (x) = φ(x − µ) + φ((x − µ)/σ), x ∈ R,
2 2σ
w.r.t. Lebesgue measure. Then
n  
Y 1 1
LX (µ̃, σ̃ 2 ) = φ(Xi − µ̃) + φ((Xi − µ̃)/σ̃) .
2 2σ̃
i=1

Taking µ̃ = X1 yields
n  
2 1 1 1 Y 1 1
LX (X1 , σ̃ ) = √ ( + ) φ(Xi − X1 ) + φ((Xi − X1 )/σ̃) .
2π 2 2σ̃ i=2 2 2σ̃
2.5. ASYMPTOTIC TESTS AND CONFIDENCE INTERVALS BASED ON THE LIKELIHOOD23

Now, since for all z 6= 0


1
lim φ(z/σ̃) = 0,
σ̃↓0 σ̃
we have
n   Yn
Y 1 1 1
lim φ(Xi − X1 ) + φ((Xi − X1 )/σ̃) = φ(Xi − X1 ) > 0.
σ̃↓0 2 2σ̃ 2
i=2 i=2
It follows that
lim LX (X1 , σ̃ 2 ) = ∞.
σ̃↓0

2.5 Asymptotic tests and confidence intervals based


on the likelihood

This section is a look ahead for what is to come in Chapter 14. Suppose that
Θ is an open subset of Rp . Define the log-likelihood ratio
 
Z(X, θ) := 2 log LX (θ̂) − log LX (θ) .

Note that Z(X, θ) ≥ 0, as θ̂ maximizes the (log)-likelihood. We will see in


Chapter 14 that, under some regularity conditions,
D
θ
Z(X, θ) −→ χ2p , ∀ θ.
D
θ
Here, “ −→ ” means convergence in distribution under IPθ , and χ2p denotes the
Chi-squared distribution with p degrees of freedom.
We say that Z(X, θ) is an asymptotic pivot: its asymptotic distribution does
not depend on the unknown parameter θ. For the null-hypothesis
H0 : θ = θ0 ,
a test at asymptotic level α is: reject H0 if Z(X, θ0 ) > χ2p (1−α), where χ2p (1−α)
is the (1 − α)-quantile of the χ2p -distribution. An asymptotic (1 − α)-confidence
set for θ is
{θ : Z(X, θ) ≤ χ2p (1 − α)}
= {θ : 2 log LX (θ̂) ≤ 2 log LX (θ) + χ2p (1 − α)}.
Example 2.5.1 Likelihood ratio for the normal distribution
Here is a toy example. Let X have the N (µ, 1)-distribution, with µ ∈ R un-
known. The MLE of µ is the sample average µ̂ = X̄. It holds that
n
n 1X
log LX (µ̂) = − log(2π) − (Xi − X̄)2 ,
2 2
i=1
and  
2 log LX (µ̂) − log LX (µ) = n(X̄ − µ)2 .

The random variable n(X̄ − µ) is N (0, 1)-distributed under IPµ . So its square,
n(X̄ − µ)2 , has a χ21 -distribution. Thus, in this case the above test (confidence
interval) is exact.
24 CHAPTER 2. ESTIMATION
Chapter 3

Intermezzo: distribution
theory

3.1 Conditional distributions

Recall the definition of conditional probabilities: for two sets A and B, with
P (B) 6= 0, the conditional probability of A given B is defined as
P (A ∩ B)
P (A|B) := .
P (B)
It follows that
P (B)
P (B|A) = P (A|B) ,
P (A)
and that, for a partition {Bj }1
X
P (A) = P (A|Bj )P (Bj ).
j

Consider now two random vectors X ∈ Rn and Y ∈ Rm . Let fX,Y (·, ·), be the
density of (X, Y ) with respect to Lebesgue measure (assumed to exist). The
marginal density of X is
Z
fX (·) = fX,Y (·, y)dy,

and the marginal density of Y is


Z
fY (·) = fX,Y (x, ·)dx.

Definition 3.1.1 The conditional density of X given Y = y is


fX,Y (x, y)
fX (x|y) := , x ∈ Rn .
fY (y)
1
{Bj } is a partition if Bj ∩ Bk = ∅ for all j 6= k and P (∪j Bj ) = 1.

25
26 CHAPTER 3. INTERMEZZO: DISTRIBUTION THEORY

Thus, we have
fY (y)
fY (y|x) = fX (x|y) , (x, y) ∈ Rn+m ,
fX (x)
and Z
fX (x) = fX (x|y)fY (y)dy, x ∈ Rn .

Definition 3.1.2 Let g : Rn × Rm → R be some function. The conditional


expectation of g(X, Y ) given Y = y is
Z
E[g(X, Y )|Y = y] := fX (x|y)g(x, y)dx.

Note thus that

E[g1 (X)g2 (Y )|Y = y] = g2 (y)E[g1 (X)|Y = y].

Notation We define the random variable E[g(X, Y )|Y ] as

E[g(X, Y )|Y ] := h(Y ),

where h(y) is the function h(y) := E[g(X, Y )|Y = y].


Lemma 3.1.1 (Iterated expectations lemma) It holds that
 
E E[g(X, Y )|Y ] = Eg(X, Y ).

Proof. Define
h(y) := E[g(X, Y )|Y = y].
Then Z Z
Eh(Y ) = h(y)fY (y)dy = E[g(X, Y )|Y = y]fY (y)dy
Z Z
= g(x, y)fX,Y (x, y)dxdy = Eg(X, Y ).

u
t

3.2 The multinomial distribution

In a survey, people were asked their opinion about some political issue. Let X
be the number of yes-answers, Y be the number of no and Z be the number
of perhaps. The total number of people in the survey is n = X + Y + Z. We
consider the votes as a sample with replacement, with p1 = P (yes), p2 = P (no),
and p3 = P (perhaps), p1 + p2 + p3 = 1. Then
 
n
P (X = x, Y = y, Z = z) = px1 py2 pz3 , (x, y, z) ∈ {0, . . . , n}, x+y+z = n.
xyz
3.3. THE POISSON DISTRIBUTION 27

Here  
n n!
:= .
xyz x!y!z!
It is called a multinomial coefficient.
Lemma 3.2.1 The marginal distribution of X is the Binomial(n, p1 )-distribution.
Proof. For x ∈ {0, . . . , n}, we have
n−x
X
P (X = x) = P (X = x, Y = y, Z = n − x − y)
y=0
n−x
X 
n
= px1 py2 (1 − p1 − p2 )n−x−y
x y n−x−y
y=0
  n−x  
n xX n−x y
= p p2 (1 − p1 − p2 )n−x−y
x 1 y
y=0
 
n x
= p (1 − p1 )n−x .
x 1
u
t
Definition 3.2.1 We say that the random vector (N1 , . .P . , Nk ) has the multino-
mial distribution with parameters n and p1 , . . . , pk (with kj=1 pj = 1), if for all
(n1 , . . . , nk ) ∈ {0, . . . , n}k , with n1 + · · · + nk = n, it holds that
 
n
P (N1 = n1 , . . . , Nk = nk ) = pn1 · · · pnk k .
n1 · · · nk 1
Here  
n n!
:= .
n1 · · · nk n1 ! · · · n k !
Example 3.2.1 Histograms
Let X1 , . . . , Xn be i.i.d. copies of a random variable X ∈ R with distribution F ,
and let −∞ = a0 < a1 < · · · < ak−1 < ak = ∞. Define, for j = 1, . . . , k,
pj := P (X ∈ (aj−1 , aj ]) = F (aj ) − F (aj−1 ),
Nj #{Xi ∈ (aj−1 , aj ]}
:= = F̂n (aj ) − F̂n (aj−1 ).
n n
Then (N1 , . . . , Nk ) has the Multinomial(n, p1 , . . . , pk )-distribution.

3.3 The Poisson distribution

Definition 3.3.1 A random variable X ∈ {0, 1, . . .} has the Poisson distribu-


tion with parameter λ > 0, if for all x ∈ {0, 1, . . .}
λx
P (X = x) = e−λ
x!
(see also Example 1.4.1).
28 CHAPTER 3. INTERMEZZO: DISTRIBUTION THEORY

Lemma 3.3.1 Suppose X and Y are independent, and that X has the Poisson(λ)-
distribution, and Y the Poisson(µ)-distribution. Then Z := X + Y has the
Poisson(λ + µ)-distribution.
Proof. For all z ∈ {0, 1, . . .}, we have
z
X
P (Z = z) = P (X = x, Y = z − x)
x=0
Xz
= P (X = x)P (Y = z − x)
x=0
z
X λx −µ µz−x
= e−λe
x! (z − x)!
x=0
z
X z
−(λ+µ) 1
= e λx µz−x
z! x
x=0
(λ + µ)z
= e−(λ+µ) .
z!
u
t
Lemma 3.3.2 Let X1 , . . . , Xn be independent, and P (for i = 1, . . . , n), let Xi
have the Poisson(λi )-distribution. Define Z := ni=1 Xi . Let z ∈ {0, 1, . . .}.
Then the conditional distribution of (X1 , . . . , Xn ) given Z = z is the multino-
mial distribution with parameters z and p1 , . . . , pn , where
λj
pj = Pn , j = 1, . . . , n.
i=1 λi

Pn
Proof. First note that Z is Poisson(λ+ )-distributed, with λ+ := i=1 λi .
Thus, for all (x1 , . . . , xn ) ∈ {0, 1, . . . , z}n satisfying ni=1 xi = z, we have
P

P (X1 = x1 , . . . , Xn = xn )
P (X1 = x1 , . . . , Xn = xn |Z = z) =
P (Z = z)
Qn −λ i λxi /x !

i=1 e i i
=
e−λ+ λz+ /z!
   x1  xn
z λ1 λn
= ··· .
x1 · · · xn λ+ λ+
u
t

3.4 The distribution of the maximum of two random


variables

Let X1 and X2 be independent and both have distribution F . Suppose that F


has density f w.r.t. Lebesgue measure. Let

Z := max{X1 , X2 }.
3.4. THE DISTRIBUTION OF THE MAXIMUM OF TWO RANDOM VARIABLES29

Lemma 3.4.1 The distribution function of Z is F 2 . Moreover, Z has density

fZ (z) = 2F (z)f (z), z ∈ R.

Proof. We have for all z,

P (Z ≤ z) = P (max{X1 , X2 } ≤ z)
= P (X1 ≤ z, X2 ≤ z) = F 2 (z).

If F has density f , then (Lebesgue)-almost everywhere,

d
f (z) = F (z).
dz
So the derivative of F 2 exists almost everywhere, and
d 2
F (z) = 2F (z)f (z).
dz
u
t
The conditional distribution function of X1 given Z = z is
 F (x )
1
FX1 (x1 |z) = 2F (z) , x1 < z .
1, x1 ≥ z

Note thus that this distribution has a jump of size 1


2 at z.
30 CHAPTER 3. INTERMEZZO: DISTRIBUTION THEORY
Chapter 4

Sufficiency and exponential


families

In this chapter, we denote the data by X ∈ X . (In examples X is often replaced


by X = (X1 , . . . , Xn ) with X1 , . . . , Xn i.i.d. copies of X.) We assume X has
distribution P ∈ {Pθ : θ ∈ Θ}.

4.1 Sufficiency

Let S : X → Y be some given map. We consider the statistic S = S(X).


Throughout, by the phrase for all possible s, we mean for all s for which con-
ditional distributions given S = s are defined (in other words: for all s in the
support of the distribution of S, which may depend on θ).
Definition 4.1.1 We call S sufficient for θ ∈ Θ if for all θ, and all possible s,
the conditional distribution

Pθ (X ∈ ·|S(X) = s)

does not depend on θ.


Example 4.1.1 Sufficiency and Bernoulli trials
Let X = (X1 , . . . , Xn ) with X1 , . . . , Xn i.i.d. with the Bernoulli distribution
with probability θ ∈ (0, 1) of success: (for i = 1, . . . , n)

Pθ (Xi = 1) = 1 − Pθ (Xi = 0) = θ.
Pn
Take S = i=1 Xi . Then S is sufficient for θ: for all possible s,
  n
1 X
IPθ X1 = x1 , . . . , Xn = xn S = s = n , xi = s.
s i=1

Example 4.1.2 Sufficiency and the Poisson distribution


Let X := (X1 , . . . , Xn ), with X1 , . . . , Xn i.i.d. and Poisson(θ)-distributed. Take

31
32 CHAPTER 4. SUFFICIENCY AND EXPONENTIAL FAMILIES

S = ni=1 Xi . Then S has the Poisson(nθ)-distribution. For all possible s, the


P
conditional distribution of X given S = s is the multinomial distribution with
parameters s and (p1 , . . . , pn ) = ( n1 , . . . , n1 ):
     s Xn
s 1
IPθ X1 = x1 , . . . , Xn = xn S = s =
, xi = s.
x1 · · · xn n
i=1

This distribution does not depend on θ, so S is sufficient for θ.


Example 4.1.3 Sufficiency and the exponential distribution
Let X1 and X2 be independent, and both have the exponential distribution with
parameter θ > 0. The density of e.g., X1 is then

fX1 (x; θ) = θe−θx , x > 0.

Let S = X1 + X2 . Verify that S has density

fS (s; θ) = sθ2 e−θs , s > 0.

(This is the Gamma(2, θ)-distribution.) For all possible s, the conditional den-
sity of (X1 , X2 ) given S = s is thus
1
fX1 ,X2 (x1 , x2 |S = s) = , x1 + x2 = s.
s
Hence, S is sufficient for θ.
Example 4.1.4 Sufficiency of the order statistics
Let X1 , . . . , Xn be an i.i.d. sample from a continuous distribution F . Then
S := (X(1) , . . . , X(n) ) is sufficient for F : for all possible s = (s1 , . . . , sn ) (s1 <
. . . < sn ), and for (xq1 , . . . , xqn ) = s,
 
1
IPθ (X1 , . . . , Xn ) = (x1 , . . . , xn ) (X(1) , . . . , X(n) ) = s = .

n!
Example 4.1.5 Sufficiency and the uniform distribution
Let X1 and X2 be independent, and both uniformly distributed on the interval
[0, θ], with θ > 0. Define Z := X1 + X2 .
Lemma The random variable Z has density

z/θ2 if 0 ≤ z ≤ θ
fZ (z; θ) = 2 .
(2θ − z)/θ if θ ≤ z ≤ 2θ

Proof. First, assume θ = 1. Then the distribution function of Z is


 2
z /2 0≤z≤1
FZ (z) = .
1 − (2 − z)2 /2 1 ≤ z ≤ 2
So the density is then

z 0≤z≤1
fZ (z) = .
2−z 1≤z≤2
4.2. FACTORIZATION THEOREM OF NEYMAN 33

For general θ, the result follows from the uniform case by the transformation
Z 7→ θZ, which maps fZ into fZ (·/θ)/θ. u
t
The conditional density of (X1 , X2 ) given Z = z ∈ (0, 2θ) depends on θ, so Z
is not sufficient for θ.
Consider now S := max{X1 , X2 }. The conditional density of (X1 , X2 ) given
S = s ∈ (0, θ) is
1
fX1 ,X2 (x1 , x2 |S = s) = , 0 ≤ x1 < s, x2 = s or x1 = s, 0 ≤ x2 < s.
2s
This does not depend on θ, so S is sufficient for θ.

4.2 Factorization Theorem of Neyman

Theorem 4.2.1 (Factorization Theorem of Neyman) Suppose {Pθ : θ ∈ Θ}


is dominated by a σ-finite measure ν. Let pθ := dPθ /dν denote the densities.
Then S is sufficient for θ if and only if one can write pθ in the form

pθ (x) = gθ (S(x))h(x), ∀ x, θ

for some functions gθ (·) ≥ 0 and h(·) ≥ 0.


Proof in the discrete case. Suppose X takes only the values a1 , a2 , . . . ∀ θ
(so we may take ν to be the counting measure). Let Qθ be the distribution of
S: X
Qθ (s) := Pθ (X = aj ).
j: S(aj )=s

The conditional distribution of X given S is


Pθ (X = x)
Pθ (X = x|S = s) = , S(x) = s.
Qθ (s)

(⇒) If S is sufficient for θ, the above does not depend on θ, but is only a
function of x, say h(x). So we may write for S(x) = s,

Pθ (X = x) = Pθ (X = x|S = s)Qθ (S = s) = h(x)gθ (s),

with gθ (s) = Qθ (S = s).


(⇐) Inserting pθ (x) = gθ (S(x))h(x), we find
X
Qθ (s) = gθ (s) h(aj ),
j: S(aj )=s

This gives in the formula for Pθ (X = x|S = s),

h(x)
Pθ (X = x|S = s) = P
j: S(aj )=s h(aj )
34 CHAPTER 4. SUFFICIENCY AND EXPONENTIAL FAMILIES

which does not depend on θ. u


t
Remark The proof for the general case is along the same lines, but does have
some subtle elements!

Example 4.2.1 Sufficiency for the uniform distribution with unknown


endpoint
Let X1 , . . . , Xn be i.i.d., and uniformly distributed on the interval [0, θ]. Then
the density of X = (X1 , . . . , Xn ) is

1
pθ (x1 , . . . , xn ) = l{0 ≤ min{x1 , . . . , xn } ≤ max{x1 , . . . , xn } ≤ θ}
θn
= gθ (S(x1 , . . . , xn ))h(x1 , . . . , xn ),

with
1
gθ (s) := l{s ≤ θ},
θn
and
h(x1 , . . . , xn ) := l{0 ≤ min{x1 , . . . , xn }}.
Thus, S = max{X1 , . . . , Xn } is sufficient for θ.
Corollary 4.2.1 The likelihood is LX (θ) = pθ (X) = gθ (S)h(X). Hence, the
maximum likelihood estimator θ̂ = arg maxθ LX (θ) = arg maxθ gθ (S) depends
only on the sufficient statistic S.

4.3 Exponential families

Definition 4.3.1 A k-dimensional exponential family is a family of distribu-


tions {Pθ : θ ∈ Θ}, dominated by some σ-finite measure ν, with densities
pθ = dPθ /dν of the form

k
X 
pθ (x) = exp cj (θ)Tj (x) − d(θ) h(x).
j=1

Note In case of a k-dimensional exponential family, the k-dimensional statistic


S(X) = (T1 (X), . . . , Tk (X)) is sufficient for θ.
Note If X1 , . . . , Xn is an i.i.d. sample from a k-dimensional exponential family,
then the distribution of X = (X1 , . . . , Xn ) is also in a k-dimensional exponential
family. The density of X is then (for x := (x1 , . . . , xn )),

n
Y Xk n
Y
pθ (x) = pθ (xi ) = exp[ ncj (θ)T̄j (x) − nd(θ)] h(xi ),
i=1 j=1 i=1
4.3. EXPONENTIAL FAMILIES 35

where, for j = 1, . . . , k,
n
1X
T̄j (x) = Tj (xi ).
n
i=1

Hence S(X) = (T̄1 (X), . . . , T̄k (X)) is then sufficient for θ.


Note The functions {Tj } and {cj } are not uniquely defined.
Example 4.3.1 Poisson distribution
If X is Poisson(θ)-distributed, we have

θx
pθ (x) = e−θ
x!
1
= exp[x log θ − θ] .
x!

Hence, we may take T (x) = x, c(θ) = log θ, and d(θ) = θ.


Example 4.3.2 Binomial distribution
If X has the Binomial(n, θ)-distribution, we have
 
n x
pθ (x) = θ (1 − θ)n−x
x
  x
n θ
= (1 − θ)n
x 1−θ
   
n θ
= exp x log( ) + n log(1 − θ) .
x 1−θ

So we can take T (x) = x, c(θ) = log(θ/(1 − θ)), and d(θ) = −n log(1 − θ).
Example 4.3.3 Negative binomial distribution
If X has the Negative Binomial(m, θ)-distribution with m known we have

Γ(x + m) m
pθ (x) = θ (1 − θ)x
Γ(m)x!
Γ(x + m)
= exp[x log(1 − θ) + k log(θ)].
Γ(m)x!

So we may take T (x) = x, c(θ) = log(1 − θ) and d(θ) = −m log(θ).


Example 4.3.4 Gamma distribution with one parameter
Let X have the Gamma(m, θ)-distribution with m known. Then

θm
pθ (x) = e−θx xm−1
Γ(m)
xm−1
= exp[−θx + m log θ].
Γ(m)

So we can take T (x) = x, c(θ) = −θ, and d(θ) = −m log θ.


36 CHAPTER 4. SUFFICIENCY AND EXPONENTIAL FAMILIES

Example 4.3.5 Gamma distribution with two parameters


Let X have the Gamma(m, λ)-distribution, and let θ = (m, λ). Then

λm
pθ (x) = e−λx xm−1
Γ(m)
= exp[−λx + (m − 1) log x + m log λ − log Γ(m)].

So we can take T1 (x) = x, T2 (x) = log x, c(θ) = −λ, c2 (θ) = (m − 1), and
d(θ) = −m log λ + log Γ(m).

Example 4.3.6 Normal distribution


Let X be N (µ, σ 2 )-distributed, and let θ = (µ, σ). Then

(x − µ)2
 
1
pθ (x) = √ exp −
2πσ 2σ 2
x2 µ2
 
1 xµ
= √ exp 2 − 2 − 2 − log σ .
2π σ 2σ 2σ

So we can take T1 (x) = x, T2 (x) = x2 , c1 (θ) = µ/σ 2 , c2 (θ) = −1/(2σ 2 ), and


d(θ) = µ2 /(2σ 2 ) + log(σ).

4.4 Intermezzo: the mean and covariance matrix of


a random vector

Let Z ∈ Rk be a random vector. Then its mean (if it exists) is defined as the
vector consisting of the means of each entry of Z:

EZ1

.
EZ =  ..  .
EZk

The covariance matrix of X (if it exists) is defined as the symmetric k × k


matrix Σ containing the (co)variances between each pair of entries in X:

σ12
 
σ1,2 ··· σ1,k
 σ1,2 σ22 ··· σ2,k 
Σ := EZZ 0 − EZEZ 0 =  . ..  .
 
.. ..
 .. . . . 
σ1,k σ2,k ··· σk2

Here, Z 0 denotes the transpose1 of the vector Z, σj2 := var(Zj ) (j = 1, . . . , k)


and for all j1 6= j2 σj1 ,j2 = cov(Zj1 , Zj2 ). We will often write the covariance
matrix Σ as Cov(Z).
1
We alternatively write the transpose of a vector, say v, as v T .
4.5. CANONICAL FORM OF AN EXPONENTIAL FAMILY 37

4.5 Canonical form of an exponential family

In this subsection, we assume regularity conditions, such as existence of deriva-


tives, and inverses, and permission to interchange differentiation and integra-
tion.

Let Θ ⊂ Rk , and let {Pθ : θ ∈ Θ} be a family of probability measures dominated


by a σ-finite measure ν. Define the densities

dPθ
pθ := .

Definition We call {Pθ : θ ∈ Θ} an exponential family in canonical form, if

k
X 
pθ (x) = exp θj Tj (x) − d(θ) h(x).
j=1

Note that d(θ) is the normalizing constant


 
Z k
X 
d(θ) = log  exp θj Tj (x) h(x)dν(x) .
j=1

We let
˙ ∂
d(θ) := d(θ)
∂θ
denote the vector of first derivatives. Let
2 ∂2
 
¨ := ∂ d(θ) =
d(θ) d(θ)
∂θ∂θ0 ∂θj1 ∂θj2

denote the k × k matrix of second derivatives. Further, we write

T1 (X) Eθ T1 (X)
   

T (X) :=  ...  , Eθ T (X) :=  ..


. ,
Tk (X) Eθ Tk (X)

and we write the k × k covariance matrix of T (X) as

Covθ (T (X)) := Eθ T (X)T 0 (X) − Eθ T (X)Eθ T 0 (X).

Lemma 4.5.1 We have (under regularity)

˙
Eθ T (X) = d(θ), ¨
Covθ (T (X)) = d(θ).
38 CHAPTER 4. SUFFICIENCY AND EXPONENTIAL FAMILIES

Proof. By the definition of d(θ), we find


Z   
˙ ∂ 0
d(θ) = log exp θ T (x) h(x)dν(x)
∂θ
 
0
R
exp θ T (x) T (x)h(x)dν(x)
=  
R
exp θ0 T (x) h(x)dν(x)
Z  
0
= exp θ T (x) − d(θ) T (x)h(x)dν(x)
Z
= pθ (x)T (x)dν(x) = Eθ T (X),

and (omitting the integration x variable to shorten the expressions)


 
exp θ0 T T T 0 hdν
R
¨
d(θ) =  
R
exp θ0 T hdν
      0
0 0
R R
exp θ T T hdν exp θ T T hdν
−    2
R
0
exp θ T hdν
Z  
= exp θ0 T − d(θ) T T 0 hdν
Z    Z   
0 0 0
− exp θ T − d(θ) T hdν × exp θ T − d(θ) T hdν
Z Z Z 
0 0
= T T pθ dν − pθ T dν pθ T dν
  
0 0
= Eθ T (X)T (X) − Eθ T (X) Eθ T (X)

= Covθ (T (X)).

u
t

4.6 Reparametrizing in the one-dimensional case

Let us now simplify to the one-dimensional case, that is Θ ⊂ R. Consider an


exponential family, not necessarily in canonical form:

pθ (x) = exp[c(θ)T (x) − d(θ)]h(x).

We can put this in canonical form by reparametrizing

θ 7→ c(θ) := γ (say),
4.7. SCORE FUNCTION AND FISHER INFORMATION 39

to get
p̃γ (x) = exp[γT (x) − d0 (γ)]h(x),

where when c is one-to-one

d0 (γ) = d(c−1 (γ)).

It follows that
˙ −1 (γ))
d(c ˙
d(θ)
Eθ T (X) = d˙0 (γ) = = , (4.1)
ċ(c−1 (γ)) ċ(θ)
and
¨ −1 (γ))
d(c ˙ −1 (γ))c̈(c−1 (γ))
d(c
varθ (T (X)) = d¨0 (γ) = −
[ċ(c−1 (γ))]2 [ċ(c−1 (γ))]3
¨
d(θ) ˙
d(θ)c̈(θ)
= −
[ċ(θ)]2 [ċ(θ)]3
!
1 ˙
d(θ)
= ¨ −
d(θ) c̈(θ) .
[ċ(θ)]2 ċ(θ)

4.7 Score function and Fisher information

Consider an arbitrary (but regular) family of densities {pθ : θ ∈ Θ}, with


(again for simplicity) Θ ⊂ R.

Definition 4.7.1 The score function is

d
sθ (x) := log pθ (x).

The Fisher information for estimating θ is

I(θ) := varθ (sθ (X)).

More generally, the Fisher information for estimating a differentiable function


g(θ) of the parameter θ, is equal to I(θ)/[ġ(θ)]2 .

See also Chapters 5 and 13.

Lemma 4.7.1 We have (under regularity)

Eθ sθ (X) = 0,

and
I(θ) = −Eθ ṡθ (X),
d
where ṡθ (x) := dθ sθ (x).
40 CHAPTER 4. SUFFICIENCY AND EXPONENTIAL FAMILIES

Proof. The results follow from the fact that densities integrate to one, assuming
that we may interchange derivatives and integrals:
Z
Eθ sθ (X) = sθ (x)pθ (x)dν(x)
Z Z
d log pθ (x) ṗθ (x)
= pθ (x)dν(x) = pθ (x)dν(x)
dθ pθ (x)
Z Z
d d
= ṗθ (x)dν(x) = pθ (x)dν(x) = 1 = 0,
dθ dθ
and
ṗθ (X) 2
   
p̈θ (X)
Eθ ṡθ (X) = Eθ −
pθ (X) pθ (X)
 
p̈θ (X)
= Eθ − Eθ s2θ (X).
pθ (X)
Now, Eθ s2θ (X) equals varθ sθ (X), since Eθ sθ (X) = 0. Moreover,
  Z 2
p̈θ (X) d
Eθ = pθ (x)dν(x)
pθ (X) dθ2
d2
Z
= pθ (x)dν(x)
dθ2
d2
= 1 = 0.
dθ2
u
t

4.8 Score function for exponential families

In the special case that {Pθ : θ ∈ Θ} is a one-dimensional exponential family,


the densities are of the form
pθ (x) = exp[c(θ)T (x) − d(θ)]h(x).
Hence
˙
sθ (x) = ċ(θ)T (x) − d(θ).
The equality Eθ sθ (X) = 0 implies that
˙
d(θ)
Eθ T (X) = ,
ċ(θ)
which re-establishes (4.1). One moreover has
¨
ṡθ (x) = c̈(θ)T (x) − d(θ).
Hence, the inequality varθ (sθ (X)) = −Eθ ṡθ (X) implies
¨
[ċ(θ)]2 varθ (T (X)) = −c̈(θ)Eθ T (X) + d(θ)
˙
¨ − d(θ) c̈(θ),
= d(θ)
ċ(θ)
4.8. SCORE FUNCTION FOR EXPONENTIAL FAMILIES 41

which re-establishes (4.2). In addition, it follows that

˙
d(θ)
¨ −
I(θ) = d(θ) c̈(θ).
ċ(θ)

The Fisher information for estimating γ = c(θ) is

I(θ)
I0 (γ) = d¨0 (γ) = .
[ċ(θ)]2

Example 4.8.1 Bernoulli-distribution in canonical form


Let X ∈ {0, 1} have the Bernoulli-distribution with success parameter θ ∈ (0, 1):
   
x 1−x θ
pθ (x) = θ (1 − θ) = exp x log + log(1 − θ) , x ∈ {0, 1}.
1−θ

We reparametrize:  
θ
γ := c(θ) = log ,
1−θ
which is called the log-odds ratio. Inverting gives


θ= ,
1 + eγ
and hence  
γ
d(θ) = − log(1 − θ) = log 1 + e := d0 (γ).

Thus

d˙0 (γ) = = θ = Eθ X,
1 + eγ
and
eγ e2γ eγ
d¨0 (γ) = − = = θ(1 − θ) = varθ (X).
1 + eγ (1 + eγ )2 (1 + eγ )2

The score function is


   
d θ
sθ (x) = x log + log(1 − θ)
dθ 1−θ
x 1
= − .
θ(1 − θ) 1 − θ

The Fisher information for estimating the success parameter θ is

varθ (X) 1
Eθ s2θ (X) = 2
= ,
[θ(1 − θ)] θ(1 − θ)

whereas the Fisher information for estimating the log-odds ratio γ is

I0 (γ) = θ(1 − θ).


42 CHAPTER 4. SUFFICIENCY AND EXPONENTIAL FAMILIES

4.9 Minimal sufficiency

Definition 4.9.1 We say that two likelihoods Lx (θ) and Lx̃ (θ) are proportional
at (x, x̃), if
Lx (θ) = Lx̃ (θ)c(x, x̃), ∀ θ,
for some constant c(x, x̃).
We write this as
Lx (θ) ∝ Lx̃ (θ).
A sufficient statistic S is called minimal sufficient if S(x) = S(x̃) for all x and
x̃ for which the likelihoods are proportional.

Example 4.9.1 Minimal sufficiency for the normal distribution


Let X1 . . . , Xn be independent and N (θ, 1)-distributed. Then S = ni=1 Xi is
P
sufficient for θ. We moreover have
Pn
nθ2 x2
log Lx (θ) = S(x)θ − − i=1 i − log(2π)/2.
2 2
So Pn 2 − ni=1 (x̃i )2
P
i=1 xi
log Lx (θ) − log Lx̃ (θ) = (S(x) − S(x̃))θ − ,
2
which equals,
log c(x, x̃), ∀ θ,
for some function c, if and only if S(x) = S(x̃). So S is minimal sufficient.
Example 4.9.2 Minimal sufficiency for the Laplace distribution
Let X1 , . . . , Xn be independent and Laplace-distributed with location parameter
θ. Then
n
√ X
log Lx (θ) = −(log 2)/2 − 2 |xi − θ|,
i=1
so
n
√ X
log Lx (θ) − log Lx̃ (θ) = − 2 (|xi − θ| − |x̃i − θ|)
i=1

which equals
log c(x, x̃), ∀ θ,
for some function c, if and only if (x(1) , . . . , x(n) ) = (x̃(1) , . . . , x̃(n) ). So the
order statistics X(1) , . . . , X(n) are minimal sufficient.
Chapter 5

Bias, variance and the Cramér


Rao lower bound

5.1 What is an unbiased estimator?

Let X ∈ X denote the observations. The distribution P of X is assumed to


be a member of a given class {Pθ : θ ∈ Θ} of distributions. The parameter
of interest is γ := g(θ), with g : Θ → R. Except for the last section in this
chapter, the parameter γ is assumed to be one-dimensional.

Let T : X → R be an estimator of g(θ).

Definition 5.1.1 The bias of T = T (X) is

biasθ (T ) := Eθ T − g(θ).

The estimator T is called unbiased if

biasθ (T ) = 0, ∀ θ.

Thus, unbiasedness means that there is no systematic error: Eθ T = g(θ). We


require this for all θ.

Example 5.1.1 Unbiased estimators in the Binomial case


Let X ∼ Binomial(n, θ), 0 < θ < 1. We have
n  
X n
Eθ T (X) = θk (1 − θ)n−k T (k) =: q(θ).
k
k=0

Note that q(θ) is a polynomial in θ of degree at most n. So only parameters


g(θ) which are polynomials of degree at most n can be estimated
√ unbiasedly. It
means that there exists no unbiased estimator of, for example, θ or θ/(1 − θ).

43
44CHAPTER 5. BIAS, VARIANCE AND THE CRAMÉR RAO LOWER BOUND

Example 5.1.2 Unbiased estimators in the Poisson case


Let X ∼ Poisson(θ). Then

X θk
Eθ T (X) = e−θ T (k) =: e−θ p(θ).
k!
k=0

Note that p(θ) is a power series in θ. Thus only parameters g(θ) which are
a power series in θ times e−θ can be estimated unbiasedly. An example is the
probability of early failure

g(θ) := e−θ = Pθ (X = 0).

An unbiased estimator of e−θ is for instance

T (X) = l{X = 0}.

As another example, suppose the parameter of interest is

g(θ) := e−2θ .

An unbiased estimator is

+1 if X is even
T (X) = .
−1 if X is odd
This estimator does not make sense at all!
Example 5.1.3 Unbiased estimator of the variance
Let X1 , . . . , Xn be i.i.d. N (µ, σ 2 ), and let θ = (µ, σ 2 ) ∈ R × R+ . Then
n
2 1 X
S := (Xi − X̄)2
n−1
i=1

is an unbiased estimator of σ 2 . But S is not an unbiased estimator of σ. In


fact, one can show that there does not exist any unbiased estimator of σ!
We conclude that requiring unbiasedness can have disadvantages: unbiased es-
timators do not always exist, and if they do, they can be nonsensical. Moreover,
the property of unbiasedness is not preserved under taking nonlinear transfor-
mations.

5.2 UMVU estimators

Definition 5.2.1 The mean square error of T is


 2
MSEθ (T ) := Eθ T − g(θ) .

Lemma 5.2.1 We have the following decomposition for the mean square error:

MSEθ (T ) = bias2θ (T ) + varθ (T ).


5.2. UMVU ESTIMATORS 45

Proof. Write Eθ := q(θ). Then


 2  2  2
Eθ T − g(θ) = Eθ T − q(θ) + q(θ) − g(θ)
| {z } | {z }
=varθ (T ) =bias2θ (T )
   
+ 2 q(θ) − g(θ) Eθ T − q(θ) .
| {z }
=0
u
t
In other words, the mean square error consists of two components, the (squared)
bias and the variance. This is called the bias-variance decomposition. As we
will see, it is often the case that an attempt to decrease the bias results in an
increase of the variance (and vise versa).
Example 5.2.1 Estimators in case of the normal distribution
Let X1 , . . . , Xn be i.i.d. N (µ, σ 2 )-distributed. Both µ and σ 2 are unknown pa-
rameters: θ := (µ, σ 2 ).
Case i Suppose the mean µ is our parameter of interest. Consider the estimator
T := aX̄, where 0 ≤ a ≤ 1. Then the bias is decreasing in a, but the variance
is increasing in a:
MSEθ (T ) = Eθ (T − µ)2 = (1 − a)2 µ2 + a2 σ 2 /n.
The right hand side can be minimized as a function of α. The minimum is
attained at
µ2
aopt := 2 .
σ /n + µ2
However, αopt X̄ is not an estimator as it depends on the unknown parameters.
Case ii Suppose σ 2 is the parameter of interest. Let S 2 be the sample variance:
n
2 1 X
S := (Xi − X̄)2 .
n−1
i=1

It is known that S 2 is unbiased. But does it also have small mean square error?
Let us compare it with the estimator
n
1X
σ̂ 2 := (Xi − X̄)2 .
n
i=1

To compute the mean square errors of these two estimators, we first recall that
Pn 2
i=1 (Xi − X̄)
∼ χ2n−1 ,
σ2
a χ2 -distribution with n − 1 degrees of freedom. The χ2 -distribution is a special
case of the Gamma-distribution, namely
 
2 n−1 1
χn−1 = Γ , .
2 2
46CHAPTER 5. BIAS, VARIANCE AND THE CRAMÉR RAO LOWER BOUND

Thus 1

 Pn  Pn
X̄)2 X̄)2
 
i=1 (Xi − i=1 (Xi −
Eθ = n − 1, var = 2(n − 1).
σ2 σ2

It follows that
2
2(n − 1)σ 4 2σ 4

MSEθ (S 2 ) = Eθ S 2 − σ 2 = var(S 2 ) = 2
= .
(n − 1) n−1

Moreover
n−1 2 1
Eθ σ̂ 2 = σ , biasθ (σ̂ 2 ) = − σ 2 ,
n n
so that
 2
2 2 2
MSEθ (σ̂ ) = Eθ σ̂ − σ

= bias2θ (σ̂ 2 ) + varθ (σ̂ 2 )


σ 4 2(n − 1)σ 4 (2n − 1)σ 4
= 2
+ 2
= .
n n n2

Conclusion: the mean square error of σ̂ 2 is smaller than the mean square error
of S 2 !
Generally, it is not possible to construct an estimator that possesses the best
among all of all desirable properties. We therefore fix one property: unbi-
asedness (despite its disadvantages), and look for good estimators among the
unbiased ones.
Definition 5.2.2 An unbiased estimator T ∗ is called UMVU (Uniform Minimum
Variance Unbiased) if for any other unbiased estimator T ,

varθ (T ∗ ) ≤ varθ (T ), ∀ θ.

Suppose that T is unbiased, and that S is sufficient. Let

T ∗ := E(T |S).

The distribution of T given S does not depend on θ, so T ∗ is also an estimator.


Moreover, it is unbiased: by the iterated expectations lemma

Eθ T ∗ = Eθ (E(T |S)) = Eθ T = g(θ).

By conditioning on S, “superfluous” variance in the sample is killed. Indeed,


the following lemma (which is a general property of conditional distributions)
shows that T ∗ cannot have larger variance than T :

varθ (T ∗ ) ≤ varθ (T ), ∀ θ.
1
If Y has a Γ(k, λ)-distribution, then EY = k/λ and var(Y ) = k/λ2 .
5.3. THE LEHMANN-SCHEFFÉ LEMMA 47

Lemma 5.2.2 Let Y and Z be two random variables. Then


var(Y ) = var(E(Y |Z)) + Evar(Y |Z).

Proof. It holds that


 2  2
var(E(Y |Z)) = E E(Y |Z) − E(E(Y |Z))

= E[E(Y |Z)]2 − [EY ]2 ,


and  
2 2
Evar(Y |Z) = E E(Y |Z) − [E(Y |Z)]

= EY 2 − E[E(Y |Z)]2 .
Hence, when adding up, the term E[E(Y |Z)]2 cancels out, and what is left over
is exactly the variance
var(Y ) = EY 2 − [EY ]2 .
u
t

5.3 The Lehmann-Scheffé Lemma

The question arises: can we construct an unbiased estimator with even smaller
variance than T ∗ = E(T |S)? Note that T ∗ depends on X only via S = S(X),
i.e., it depends only on the sufficient statistic. In our search for UMVU estima-
tors, we may restrict our attention to estimators depending only on S. Thus, if
there is only one unbiased estimator depending only on S, it has to be UMVU.
Definition 5.3.1 A statistic S is called complete if we have the following im-
plication:
Eθ h(S) = 0 ∀ θ ⇒ h(S) = 0, Pθ − a.s., ∀ θ.
Here, h is a function of S not depending on θ.
Lemma 5.3.1 (Lehmann-Scheffé) Let T be an unbiased estimator of g(θ),
with, for all θ, finite variance. Moreover, let S be sufficient and complete.
Then T ∗ := E(T |S) is UMVU.
Proof. We already noted that T ∗ = T ∗ (S) is unbiased and that varθ (T ∗ ) ≤
varθ (T ) ∀ θ. If T 0 (S) is another unbiased estimator of g(θ), we have
Eθ (T (S) − T 0 (S)) = 0, ∀ θ.
Because S is complete, this implies
T ∗ = T 0 , Pθ − a.s.
u
t
To check whether a statistic is complete, one often needs somewhat sophisti-
cated tools from analysis/integration theory. In the next two examples, we only
sketch the proofs of completeness.
48CHAPTER 5. BIAS, VARIANCE AND THE CRAMÉR RAO LOWER BOUND

Example 5.3.1 UMVU estimator in the Poisson case


Let X1 , . . . , Xn be i.i.d. Poisson(θ)-distributed. We want to estimate g(θ) :=
e−θ , the probability of early failure. An unbiased estimator is

T (X1 , . . . , Xn ) := l{X1 = 0}.

A sufficient statistic is
n
X
S := Xi .
i=1

We now check whether S is complete. Its distribution is the Poisson(nθ)-


distribution. We therefore have for any function h,

X (nθ)k
Eθ h(S) = e−nθ h(k).
k!
k=0

The equation
Eθ h(S) = 0 ∀ θ,
thus implies

X (nθ)k
h(k) = 0 ∀ θ.
k!
k=0

Let f be a function with Taylor expansion at zero. Then



X xk
f (x) = f (k) (0).
k!
k=0

The left hand side can only be zero for all x if f ≡ 0, in which case also
f (k) (0) = 0 for all k. Thus (h(k) takes the role of f (k) (0) and nθ the role of x),
we conclude that h(k) = 0 for all k, i.e., that S is complete.
So we know from the Lehmann-Scheffé Lemma that T ∗ := E(T |S) is UMVU.
Let us now calculate T ∗ . First,

P (T = 1|S = s) = P (X1 = 0|S = s)


e−θ e−(n−1)θ [(n − 1)θ]s /s!
=
e−nθ (nθ)s /s!
n−1 s
 
= .
n

Hence  S
∗ n−1
T =
n
is UMVU.
Example 5.3.2 UMVU estimation for uniform distribution
Let X1 , . . . , Xn be i.i.d. Uniform[0, θ]-distributed, and g(θ) := θ. We know
5.4. COMPLETENESS FOR EXPONENTIAL FAMILIES 49

that S := max{X1 , . . . , Xn } is sufficient (see Example 4.2.1). The distribution


function of S is
 s n
FS (s) = Pθ (max{X1 , . . . , Xn } ≤ s) = , 0 ≤ s ≤ θ.
θ
Its density is thus
nsn−1
fS (s) = , 0 ≤ s ≤ θ.
θn
Hence, for any (measurable) function h,
θ
nsn−1
Z
Eθ h(S) = h(s) ds.
0 θn
If
Eθ h(S) = 0 ∀ θ,
it must hold that Z θ
h(s)sn−1 ds = 0 ∀ θ.
0

Differentiating w.r.t. θ gives

h(θ)θn−1 = 0 ∀ θ,

which implies h ≡ 0. So S is complete.


It remains to find a statistic T ∗ that depends only on S and that is unbiased.
We have Z θ
nsn−1 n
Eθ S = s n ds = θ.
0 θ n+1
So S itself is not unbiased, it is too small. But this can be easily repaired: take
n+1
T∗ = S.
n
Then, by the Lehmann-Scheffé Lemma, T ∗ is UMVU.

5.4 Completeness for exponential families

In the case of an exponential family, completeness holds for a sufficient statistic


if the parameter space is “of the same dimension” as the sufficient statistic.
This is stated more formally in the following lemma. We omit the proof.
Lemma 5.4.1 Let for θ ∈ Θ,
k
X 
pθ (x) = exp cj (θ)Tj (x) − d(θ) h(x).
j=1
50CHAPTER 5. BIAS, VARIANCE AND THE CRAMÉR RAO LOWER BOUND

Consider the set

C := {(c1 (θ), . . . , ck (θ)) : θ ∈ Θ} ⊂ Rk .

Suppose that C is truly k-dimensional (that is, not of dimension


Qk smaller than
k
k), i.e., it contains an open ball in R . (Or an open cube j=1 (aj , bj ).) Then
S := (T1 , . . . , Tk ) is complete.

Example 5.4.1 Completeness for Gamma distribution


Let X1 , . . . , Xn be i.i.d. with Γ(k, λ)-distribution. Both k and λ are assumed to
be unknown, so that θ := (k, λ). We moreover let Θ := R2+ . The density f of
the Γ(k, λ)-distribution is

λk −λz k−1
f (z) = e z , z > 0.
Γ(k)

Hence,
 n
X n
X 
pθ (x) = exp −λ xi + (k − 1) log xi − d(θ) h(x),
i=1 i=1

where
d(k, λ) = −nk log λ + n log Γ(k),

and
h(x) = l{xi > 0, i = 1, . . . , n}.

It follows that
n
X n
X 
Xi , log Xi
i=1 i=1

is sufficient and complete.

Example 5.4.2 Completeness for two normal samples


Consider two independent samples from normal distributions: X1 , . . . Xn i.i.d.
N (µ, σ 2 )-distributed and Y1 , . . . , Ym be i.i.d. N (ν, τ 2 )-distributed.

Case i If θ = (µ, ν, σ 2 , τ 2 ) ∈ R2 × R2+ , one can easily check that

n
X n
X m
X m
X 
S := Xi , Xi2 , Yj , Yj2
i=1 i=1 j=1 j=1

is sufficient and complete.

Case ii If µ, σ 2 and τ 2 are unknown, and ν = µ, then S of course remains


sufficient. One can however show that S is not complete. Difficult question:
does a sufficient and complete statistic exist?
5.5. THE CRAMÉR RAO LOWER BOUND 51

5.5 The Cramér Rao lower bound

Let {Pθ : θ ∈ Θ} be a collection of distributions on X , dominated by a σ-finite


measure ν. We denote the densities by
dPθ
pθ := , θ ∈ Θ.

In this section, we assume that Θ is a one-dimensional open interval (the ex-
tension to a higher-dimensional parameter space will be handled in the next
section).
We will impose the following two conditions:
Condition I The set
A := {x : pθ (x) > 0}
does not depend on θ.
Condition II (Differentiability in L2 ) For all θ and for a function sθ : X → R
satisfying
I(θ) := Eθ s2θ (X) < ∞,
it holds that  2
pθ+h (X) − pθ (X)
lim Eθ − sθ (X) = 0.
h→0 hpθ (X)

Definition 5.5.1 If I and II hold, we call sθ the score function, and I(θ) the
Fisher information.
Comparing the above with Definition 4.7.1, we see that they coincide if θ 7→ pθ is
differentiable and “regularity conditions” hold. Recall also that in Lemma 4.7.1
we assumed unspecified “regularity conditions”. We now present a rigorous
proof of the first part of this lemma.
Lemma 5.5.1 Assume Conditions I and II. Then

Eθ sθ (X) = 0, ∀ θ.

Proof. Under Pθ , we only need to consider values x with pθ (x) > 0, that is,
we may freely divide by pθ , without worrying about dividing by zero.
Observe that
 
pθ+h (X) − pθ (X)
Z
Eθ = (pθ+h − pθ )dν = 0,
pθ (X) A

since densities integrate to 1, and both pθ+h and pθ vanish outside A. Thus,
  2
2
pθ+h (X) − pθ (X)
|Eθ sθ (X)| = Eθ − sθ (X)
hpθ (X)
 2
pθ+h (X) − pθ (X)
≤ Eθ − sθ (X) → 0.
hpθ (X)
52CHAPTER 5. BIAS, VARIANCE AND THE CRAMÉR RAO LOWER BOUND

u
t
Note Thus I(θ) = varθ (sθ (X)).
Remark As already noted, if pθ (x) is differentiable for all x, we can take (under
regularity conditions)
d ṗθ (x)
sθ (x) := log pθ (x) = ,
dθ pθ (x)
where
d
ṗθ (x) := pθ (x).

Remark Suppose X1 , . . . , Xn are i.i.d. with density pθ , and sθ = ṗθ /pθ exists.
The joint density is
Y n
pθ (x) = pθ (xi ),
i=1

so that (under conditions I and II) the score function for n observations is
n
X
sθ (x) = sθ (xi ).
i=1

The Fisher information for n observations is thus


n
X
I(θ) = varθ (sθ (X)) = varθ (sθ (Xi )) = nI(θ).
i=1

Theorem 5.5.1 (The Cramér-Rao lower bound) Suppose Conditions I and II


are met, and that T is an unbiased estimator of g(θ) with finite variance. Then
g(θ) has a derivative, ġ(θ) := dg(θ)/dθ, equal to

ġ(θ) = cov(T, sθ (X)).

Moreover,
ġ 2 (θ)
varθ (T ) ≥ , ∀ θ.
I(θ)

Proof. We first show differentiability of g. As T is unbiased, we have


g(θ + h) − g(θ) Eθ+h T (X) − Eθ T (X)
=
h h
pθ+h (X) − pθ (X)
Z
1
= T (pθ+h − pθ )dν = Eθ T (X)
h hp (X)
  θ
pθ+h (X) − pθ (X)
= Eθ T (X) − sθ (X) + Eθ T (X)sθ (X)
hpθ (X)
  
pθ+h (X) − pθ (X)
= Eθ T (X) − gθ − sθ (X)
hpθ (X)
+ Eθ T (X)sθ (X)
→ Eθ T (X)sθ (X),
5.5. THE CRAMÉR RAO LOWER BOUND 53

as h → 0. This is becasue by the Cauchy-Schwarz inequality


   2

Eθ T (X) − gθ p θ+h (X) − p θ (X)
− sθ (X)
hpθ (X)

 2
pθ+h (X) − pθ (X)
≤ varθ (T )Eθ − sθ (X)
hpθ (X)
→ 0.

Thus,
ġ(θ) = Eθ T (X)sθ (X) = covθ (T, sθ (X)).
The last inequality holds because Eθ sθ (X) = 0. By Cauchy-Schwarz,
 2
ġ 2 (θ) = covθ (T, sθ (X))

≤ varθ (T )varθ (sθ (X)) = varθ (T )I(θ).

u
t
Definition 5.5.2 We call ġ 2 (θ)/I(θ), θ ∈ Θ, the Cramer Rao lower bound
(CRLB) (for estimating g(θ)).
Example 5.5.1 CRLB for exponential case
Let X1 , . . . , Xn be i.i.d. Exponential(θ), θ > 0. The density of a single obser-
vation is then
pθ (x) = θe−θx , x > 0.
Let g(θ) := 1/θ, and T := X̄. Then T is unbiased, and varθ (T ) = 1/(nθ2 ). We
now compute the CRLB. With g(θ) = 1/θ, one has ġ(θ) = −1/θ2 . Moreover,

log pθ (x) = log θ − θx,

so
sθ (x) = 1/θ − x,
and hence
1
I(θ) = varθ (X) = .
θ2
The CRLB for n observations is thus

ġ 2 (θ) 1
= 2.
nI(θ) nθ

In other words, T reaches the CRLB.


Example 5.5.2 CRLB for Poisson case Suppose X1 , . . . , Xn are i.i.d.
Poisson(θ), θ > 0. Then

log pθ (x) = −θ + x log θ − log x!,


54CHAPTER 5. BIAS, VARIANCE AND THE CRAMÉR RAO LOWER BOUND

so
x
sθ (x) = −1 + ,
θ
and hence  
X varθ (X) 1
I(θ) = varθ = 2
= .
θ θ θ
One easily checks that X̄ reaches the CRLB for estimating θ.
Let now g(θ) := e−θ . The UMVU estimator of g(θ) is
 Pni=1 Xi
1
T := 1 − .
n
To compute its variance, we first compute
∞ 
1 2k (nθ)k −nθ
X 
2
Eθ T = 1− e
n k!
k=0

1 (n − 1)2 θ k
X  
= e−nθ
k! n
k=0
(n − 1)2 θ
   
−nθ (1 − 2n)θ
= e exp = exp .
n n
Thus,

varθ (T ) = Eθ T 2 − [Eθ T ]2 = Eθ T 2 − e−2θ


 
−2θ θ/n
= e e −1

> θe−2θ /n

.
≈ θe−2θ /n for n large

As ġ(θ) = −e−θ , the CRLB is

ġ 2 (θ) θe−2θ
= .
nI(θ) n
We conclude that T does not reach the CRLB, but the gap is small for n large.

5.6 CRLB and exponential families

For the next result, we:


Recall Let X and Y be two real-valued random variables. The correlation
between X and Y is
cov(X, Y )
ρ(X, Y ) := p .
var(X)var(Y )
We have

|ρ(X, Y )| = 1 ⇔ ∃ constants a, b : Y = aX + b (a.s.).


5.7. HIGHER-DIMENSIONAL EXTENSIONS 55

The next lemma shows that the CRLB can only be reached within exponential
families, thus is only tight in a rather limited context.
Lemma 5.6.1 Assume Conditions I and II, with sθ = ṗθ /pθ . Suppose T is
unbiased for g(θ), and that T reaches the Cramér Rao lower bound. Then
{Pθ : θ ∈ Θ} forms a one-dimensional exponential family: there exist functions
c(θ), d(θ), and h(x) such that for all θ,

pθ (x) = exp[c(θ)T (X) − d(θ)]h(x), x ∈ X .


˙
Moreover, c(θ) and d(θ) are differentiable, say with derivatives ċ(θ) and d(θ)
respectively. We furthermore have the equality
˙
g(θ) = d(θ)/ċ(θ), ∀ θ.

Proof. By Theorem 5.5, when T reaches the CRLB, we must have

|covθ (T, sθ (X))|2


varθ (T ) = ,
varθ (sθ (X))

i.e., then the correlation between T and sθ (X) is ±1. Thus, there exist constants
a(θ) and b(θ) (depending on θ), such that

sθ (X) = a(θ)T (X) − b(θ). (5.1)

But, as sθ = ṗθ /pθ = d log pθ /dθ, we can take primitives:

log pθ (x) = c(θ)T (x) − d(θ) + h̃(x),


˙
where ċ(θ) = a(θ), d(θ) = b(θ) and h̃(x) is constant in θ. Hence,

pθ (x) = exp[c(θ)T (x) − d(θ)]h(x),

with h(x) = exp[h̃(x)].


Moreover, the equation (5.1) tells us that

Eθ sθ (X) = a(θ)Eθ T − b(θ) = a(θ)g(θ) − b(θ).

Because Eθ sθ (X) = 0, this implies that g(θ) = b(θ)/a(θ). u


t

5.7 Higher-dimensional extensions

Expectations and covariance matrices of random vectors


Let Z ∈ Rk be a k-dimensional random vector. Then EZ is a k-dimensional
vector, and
Σ := Cov(Z) := EZZ 0 − (EZ)(EZ 0 )
56CHAPTER 5. BIAS, VARIANCE AND THE CRAMÉR RAO LOWER BOUND

is a k × k matrix containing all variances (on the diagonal) and covariances


(off-diagonal). Note that Σ is positive semi-definite: for any vector a ∈ Rk , we
have
var(a0 Z) = a0 Σa ≥ 0.

Some matrix algebra


Let V be a symmetric matrix. If V is positive (semi-)definite, we write this
as V > 0 (V ≥ 0). One then has that V = W 2 , where W is also positive
(semi-)definite.
Auxiliary lemma. Suppose V > 0. Then

|a0 c|2
maxp = c0 V −1 c.
a∈R a0 V a

Proof. Write V = W 2 , and b := W a, d := W −1 c. Then a0 V a = b0 b = kbk2 and


a0 c = b0 d. By Cauchy-Schwarz

|b0 d|2
maxp = kdk2 = d0 d = c0 V −1 c.
b∈R kbk2
u
t
We will now present the CRLB in higher dimensions. To simplify the exposition,
we will not carefully formulate the regularity conditions, that is, we assume
derivatives to exist and that we can interchange differentiation and integration
at suitable places.

Consider a parameter space Θ ⊂ Rk . Let

g : Θ → R,

be a given function. Denote the vector of partial derivatives as

∂g(θ)/∂θ1
 

ġ(θ) :=  .. .
.
∂g(θ)/∂θk

The score vector is defined as


∂ log pθ /∂θ1
 

sθ (·) :=  .. .
.
∂ log pθ /∂θk

The Fisher information matrix is

I(θ) = Eθ sθ (X)s0θ (X) = Covθ (sθ (X)).


5.7. HIGHER-DIMENSIONAL EXTENSIONS 57

Theorem 5.7.1 Let T be an unbiased estimator of g(θ). Then, under regular-


ity conditions,
varθ (T ) ≥ ġ(θ)0 I(θ)−1 ġ(θ).

Proof. As in the one-dimensional case, one can show that, for j = 1, . . . , k,

ġj (θ) = covθ (T, sθ,j (X)).

Hence, for all a ∈ Rk ,

|a0 ġ(θ)|2 = |covθ (T, a0 sθ (X))|2


≤ varθ (T )varθ (a0 sθ (X))
= varθ (T )a0 I(θ)a.

Combining this with the auxiliary lemma gives

|a0 ġ(θ)|2
varθ (T ) ≥ max = ġ 0 (θ)I(θ)−1 ġ(θ).
a∈Rk a0 I(θ)a
u
t
Corollary 5.7.1 As a consequence, one obtains a lower bound for unbiased
estimators of higher-dimensional parameters of interest. As example, let g(θ) :=
θ = (θ1 , . . . , θk )0 , and suppose that T ∈ Rk is an unbiased estimator of θ:
Eθ T = θ ∀ θ. Then, for all a ∈ Rk , a0 T is an unbiased estimator of a0 θ. Since
a0 θ has derivative a, the CRLB gives

varθ (a0 T ) ≥ a0 I(θ)−1 a.

But
varθ (a0 T ) = a0 Covθ (T )a.
So for all a,
a0 Covθ (T )a ≥ a0 I(θ)−1 a,
in other words, Covθ (T ) ≥ I(θ)−1 , that is, Covθ (T ) − I(θ)−1 is positive semi-
definite.
58CHAPTER 5. BIAS, VARIANCE AND THE CRAMÉR RAO LOWER BOUND
Chapter 6

Tests and confidence intervals

6.1 Intermezzo: quantile functions

Let F be a distribution function on R. Then F is cadlag (continue à droite,


limite à gauche). Define the quantile functions
F
qsup (u) := sup{x : F (x) ≤ u},
and
F
qinf (u) := inf{x : F (x) ≥ u} := F −1 (u).

It holds that
F
F (qinf (u)) ≥ u
and, for all h > 0,
F
F (qsup (u) − h) ≤ u.
Hence
F F
F (qsup (u)−) := lim F (qsup (u) − h) ≤ u.
h↓0

6.2 How to construct tests

Consider a model class


P := {Pθ : θ ∈ Θ}.
Moreover, consider a space Γ, and a map
g : Θ → Γ, g(θ) := γ.
We think of γ as the parameter of interest.
Definition 6.2.1 Let γ0 ∈ Γ and α ∈ [0, 1] be given. A (non-randomized) test
at level α for the hypothesis
H0 : γ = γ0
is a statistic φ(X, γ0 ) ∈ {0, 1} such that Pθ (φ(X, γ0 ) = 1) ≤ α for all θ ∈ {ϑ :
g(ϑ) = γ0 }

59
60 CHAPTER 6. TESTS AND CONFIDENCE INTERVALS

We often omit the dependence of φ on γ0 , i.e., we write φ(X) := φ(X, γ0 ).


Note Typically a test φ is based on a test statistic T , i.e. it is of the form

φ(X) = 1 if T (X) > c .


n
0 else
The constant c is called the critical value.
To test Hγ0 : γ = γ0 ,
we look for a pivot (Tür-Angel). This is a function Z(X, γ) depending on the
data X and on the parameter γ, such that for all θ ∈ Θ, the distribution

IPθ (Z(X, g(θ)) ≤ ·) =: G(·)

does not depend on θ. We note that to find a pivot is unfortunately not always
possible. However, if we do have a pivot Z(X, γ) with distribution G, we can
compute its quantile functions
α  α
G G
qL := qsup , qR := qinf 1− .
2 2
and the test
1 if Z(X, γ0 ) ∈/ [qL , qR ] .
n
φ(X, γ0 ) :=
0 else
Then the test has level α for testing Hγ0 , with γ0 = g(θ0 ):

IPθ0 (φ(X, g(θ0 )) = 1) = Pθ0 (Z(X, g(θ0 )) > qR ) + IPθ0 (Z(X), g(θ0 )) < qL )
 α α
= 1 − G(qR ) + G(qL −) ≤ 1 − 1 − + = α.
2 2

Asymptotic pivot Let Zn (X1 , . . . , Xn , γ) be some function of the data and the
parameter of interest, defined for each sample size n. We call Zn (X1 , . . . , Xn , γ)
an asymptotic pivot if for all θ ∈ Θ,

lim IPθ (Zn (X1 , . . . , Xn , γ) ≤ ·) = G(·),


n→∞

at all continuity points of G, where the limit G does not depend on θ.


Example 6.2.1 The location model
As example, consider again the location model (Section 1.3). Let

Θ := {θ = (µ, F0 ), µ ∈ R, F0 ∈ F0 },

with F0 a subset of the collection of symmetric distributions (see (1.2)). Let µ̂


be an equivariant estimator, that is: the distribution of µ̂ − µ does not depend
on µ (see Chapter 9 for the formal definition of equivariance).
• If F0 := {F0 } is a single distribution (i.e., the distribution F0 is known), we
take Z(X, µ) := µ̂ − µ as pivot. By the equivariance, this pivot has distribution
G depending only on F0 .
6.3. EQUIVALENCE CONFIDENCE SETS AND TESTS 61

• If F0 := {F0 (·) = Φ(·/σ) : σ > 0}, we choose µ̂ := X̄n where X̄n =


P n
i=1 Xi /n is the sample mean. As pivot, we take

n(X̄n − µ)
Z(X, µ) := ,
Sn
Pn
where Sn2 = 2
i=1 (Xi − X̄) /(n − 1) is the sample variance. Then G is the
Student distribution with n − 1 degrees of freedom.
• If F0 := {F0 symmetric and continuous at x = 0}, we let the pivot be the sign
test statistic:
Xn
Z(X, µ) := l{Xi ≥ µ}.
i=1

Then G is the Binomial(n, p) distribution, with parameter p = 1/2.


• Suppose now that X1 , . . . , Xn are the first n of an infinite sequence of i.i.d.
random variables, and that
Z Z
F0 := {F0 : xdF0 (x) = 0, x2 dF0 (x) < ∞}.

Then √
n(X̄n − µ)
Zn (X1 , . . . , Xn , µ) :=
Sn
is an asymptotic pivot, with limiting distribution G = Φ.

6.3 Equivalence confidence sets and tests

Definition 6.3.1 A subset I = I(X) ⊂ Γ, depending (only) on the data X =


(X1 , . . . , Xn ), is called a confidence set (Vertrauensbereich) for γ, at level 1−α,
if
IPθ (γ ∈ I) ≥ 1 − α, ∀ θ ∈ Θ.
A confidence interval is of the form

I := [γ, γ̄],

where the boundaries γ = γ(X) and γ̄ = γ̄(X) depend (only) on the data X.
Let for each γ0 ∈ R, φ(X, γ0 ) ∈ {0, 1} be a test at level α for the hypothesis
Hγ0 : γ = γ0 .
Thus, we reject Hγ0 if and only if φ(X, γ0 ) = 1, and

IPθ:γ=γ0 (φ(X, γ0 ) = 1) ≤ α.

Then
I(X) := {γ : φ(X, γ) = 0}
is a (1 − α)-confidence set for γ.
62 CHAPTER 6. TESTS AND CONFIDENCE INTERVALS

Conversely, if I(X) is a (1 − α)-confidence set for γ, then, for all γ0 , the test
φ(X, γ0 ) defined as
φ(X, γ0 ) = 1 if γ0 ∈ / I(X)
n
0 else
is a test at level α of Hγ0 .

6.4 Comparison of confidence intervals and tests

When comparing confidence intervals, the aim is usually to take the one with
smallest length on average (keeping the level at 1 − α). In the case of tests,
we look for the one with maximal power. Recall that the power is of a test
φ(X, γ0 ) at a value θ with g(θ) 6= γ0 is Pθ (φ(X, γ0 ) = 1).

6.5 An illustration: the two-sample problem

Consider the following data, concerning weight gain/loss. The control group x
had their usual diet, and the treatment group y obtained a special diet, designed
for preventing weight gain. The study was carried out to test whether the diet
works.
control treatment
group group rank(x) rank(y)
x y
5 6 7 8
0 -5 3 2
16 -6 10 1
2 1 5 4
9 4 9 6
+ +
32 0
Table 2
Let n (m) be the sample size of the control group x (treatment group y). The
mean
Pn in group x (y) is denoted
Pm by x̄ (ȳ). The sums of squares are SSx :=
(x − x̄)2 and SS := (y − ȳ)2 . So in this study, one has n = m = 5
i=1 i y j=1 j
and the values x̄ = 6.4, ȳ = 0, SSx = 161.2 and SSy = 114. The ranks, rank(x)
and rank(y), are the rank-numbers when putting all n + m data together (e.g.,
y3 = −6 is the smallest observation and hence rank(y3 ) = 1).
We assume that the data are realizations of two independent samples, say
X = (X1 , . . . , Xn ) and Y = (Y1 , . . . , Ym ), where X1 , . . . , Xn are i.i.d. with
distribution function FX , and Y1 , . . . , Ym are i.i.d. with distribution function
FY . The distribution functions FX and FY may be in whole or in part un-
known. The testing problem is:
6.5. AN ILLUSTRATION: THE TWO-SAMPLE PROBLEM 63

H0 : FX = FY
against a one- or two-sided alternative.

6.5.1 Student’s test

The classical two-sample student test is based on the assumption that the data
come from a normal distribution. Moreover, it is assumed that the variance of
FX and FY are equal. Thus,
(FX , FY ) ∈
     
·−µ · − (µ + γ)
FX = Φ , FY = Φ : µ ∈ R, σ > 0, γ ∈ Γ .
σ σ
Here, Γ ⊃ {0} is the range of shifts in mean one considers, e.g. Γ = R for
two-sided situations, and Γ = (−∞, 0] for a one-sided situation. The testing
problem reduces to
H0 : γ = 0.
We now look for a pivot Z(X, Y, γ). Define the sample means
n m
1X 1 X
X̄ := Xi , Ȳ := Yj ,
n m
i=1 j=1

and the pooled sample variance


n
X m 
1 X
S 2 := (Xi − X̄)2 + (Yj − Ȳ )2 .
m+n−2
i=1 j=1

Note that X̄ has expectation µ and variance σ 2 /n, and Ȳ has expectation µ + γ
and variance σ 2 /m. So Ȳ − X̄ has expectation γ and variance

σ2 σ2
 
2 n+m
+ =σ .
n m nm

The normality assumption implies that


  
2 n+m
Ȳ − X̄ is N γ, σ −distributed.
nm

Hence r  
nm Ȳ − X̄ − γ
is N (0, 1)−distributed.
n+m σ
To arrive at a pivot, we now plug in the estimate S for the unknown σ:
r  
nm Ȳ − X̄ − γ
Z(X, Y, γ) := .
n+m S

Indeed, Z(X, Y, γ) has a distribution G which does not depend on unknown


parameters. The distribution G is Student(n + m − 2) (the Student-distribution
64 CHAPTER 6. TESTS AND CONFIDENCE INTERVALS

with n+m−2 degrees of freedom). As test statistic for H0 : γ = 0, we therefore


take
T = T Student := Z(X, Y, 0).
The one-sided test at level α, for H0 : γ = 0 against H1 : γ < 0, is

1 if T < −tn+m−2 (1 − α)
φ(X, Y) := ,
0 if T ≥ −tn+m−2 (1 − α)

where, for ν > 0, tν (1 − α) = −tν (α) is the (1 − α)-quantile of the Student(ν)-


distribution.
Let us apply this test to the data given in Table 2. We take α = 0.05. The
observed values are x̄ = 6.4, ȳ = 0 and s2 = 34.4. The test statistic takes the
value −1.725 which is bigger than the 5% quantile t8 (0.05) = −1.9. Hence, we
cannot reject H0 . The p-value of the observed value of T is

p−value := IPγ=0 (T < −1.725) = 0.06.

So the p-value is in this case only a little larger than the level α = 0.05.

6.5.2 Wilcoxon’s test

In this subsection, we suppose that FX and FY are continuous, but otherwise


unknown. The model class for both FX and FY is thus

F := {all continuous distributions}.

The continuity assumption ensures that all observations are distinct, that is,
there are no ties. We can then put them in strictly increasing order. Let
N = n + m and Z1 , . . . , ZN be the pooled sample

Zi := Xi , i = 1, . . . , n, Zn+j := Yj , j = 1, . . . , m.

Define
Ri := rank(Zi ), i = 1, . . . , N.
and let
Z(1) < · · · < Z(N )
be the order statistics of the pooled sample (so that Zi = Z(Ri ) (i = 1, . . . , n)).
The Wilcoxon test statistic is
n
X
T = T Wilcoxon := Ri .
i=1

One may check that this test statistic T can alternatively be written as

n(n + 1)
T = #{Yj < Xi } + .
2
6.5. AN ILLUSTRATION: THE TWO-SAMPLE PROBLEM 65

For example, for the data in Table 2, the observed value of T is 34, and

n(n + 1)
#{yj < xi } = 19, = 15.
2
Large values of T mean that the Xi are generally larger than the Yj , and hence
indicate evidence against H0 .
To check whether or not the observed value of the test statistic is compatible
with the null-hypothesis, we need to know its null-distribution, that is, the
distribution under H0 . Under H0 : FX = FY , the vector of ranks (R1 , . . . , Rn )
has the same distribution as n random draws without replacement from the
numbers {1, . . . , N }. That is, if we let

r := (r1 , . . . , rn , rn+1 , . . . , rN )

denote a permutation of {1, . . . , N }, then


 
1
IPH0 (R1 , . . . , Rn , Rn+1 , . . . RN ) = r = ,
N!

(see Theorem 6.5.1 below), and hence


Pn
#{r : i=1 ri = t}
IPH0 (T = t) = .
N!
This can also be written as
n
1 X
IPH0 (T = t) = N
 #{r1 < · · · < rn < rn+1 < · · · < rN : ri = t}.
n i=1

So clearly, the null-distribution of T does not depend on FX or FY . It does


however depend on the sample sizes n and m. It is tabulated for n and m
small or moderately large. For large n and m, a normal approximation of the
null-distribution can be used.
Theorem 6.5.1 formally derives the null-distribution of the test, and actually
proves that the order statistics and the ranks are independent. The latter result
will be of interest in Example 4.1.4.
For two random variables X and Y , use the notation
D
X =Y

when X and Y have the same distribution.


Theorem 6.5.1 Let Z1 , . . . , ZN be i.i.d. with continuous distribution F on
R. Then (Z(1) , . . . , Z(N ) ) and R := (R1 , . . . , RN ) are independent, and for all
permutations r := (r1 , . . . , rN ),

1
IP(R = r) = .
N!
66 CHAPTER 6. TESTS AND CONFIDENCE INTERVALS

Proof. Let ZQi := Z(i) , and Q := (Q1 , . . . , QN ). Then

R = r ⇔ Q = r−1 := q,

where r−1 is the inverse permutation of r.1 For all permutations q and all
measurable maps f ,

D
f (Z1 , . . . , ZN ) = f (Zq1 , . . . , ZqN ).

Therefore, for all measurable sets A ⊂ RN , and all permutations q,


 
IP (Z1 , . . . , ZN ) ∈ A, Z1 < . . . < ZN

 
= IP (Zq1 . . . , ZqN ) ∈ A, Zq1 < . . . < ZqN .

Because there are N ! permutations, we see that for any q,


   
IP (Z(1) , . . . , Z(n) ) ∈ A = N !IP (Zq1 . . . , ZqN ) ∈ A, Zq1 < . . . < ZqN
 
= N !IP (Z(1) , . . . , Z(N ) ) ∈ A, R = r ,

where r = q−1 . Thus we have shown that for all measurable A, and for all r,
   
1
IP (Z(1) , . . . , Z(N ) ) ∈ A, R = r = IP (Z(1) , . . . , Z(n) ) ∈ A . (6.1)
N!

Take A = RN to find that (6.1) implies


 
1
IP R = r = .
N!

Plug this back into (6.1) to see that we have the product structure
     
IP (Z(1) , . . . , Z(N ) ) ∈ A, R = r = IP (Z(1) , . . . , Z(n) ) ∈ A IP R = r ,

which holds for all measurable A. In other words, (Z(1) , . . . , Z(N ) ) and R are
independent. u
t
1
Here is an example, with N = 3:

(z1 , z2 , z3 ) = ( 5 , 6 , 4 )

(r1 , r2 , r3 ) = ( 2 , 3 , 1 )
(q1 , q2 , q3 ) = ( 3 , 1 , 2 )
6.5. AN ILLUSTRATION: THE TWO-SAMPLE PROBLEM 67

6.5.3 Comparison of Student’s test and Wilcoxon’s test

Because Wilcoxon’s test is ony based on the ranks, and does not rely on the
assumption of normality, it lies at hand that, when the data are in fact normally
distributed, Wilcoxon’s test will have less power than Student’s test. The loss
of power is however small. Let us formulate this more precisely, in terms of
the relative efficiency of the two tests. Let the significance α be fixed, and
let β be the required power. Let n and m be equal, N = 2n be the total
sample size, and N Student (N Wilcoxon ) be the number of observations needed to
reach power β using Student’s (Wilcoxon’s) test. Consider shift alternatives,
i.e. FY (·) = FX (· − γ), (with, in our example, γ < 0). One can show that
N Student /N Wilcoxon is approximately .95 when the normal model is correct. For
a large class of distributions, the ratio N Student /N Wilcoxon ranges from .85 to ∞,
that is, when using Wilcoxon one generally has very limited loss of efficiency as
compared to Student, and one may in fact have a substantial gain of efficiency.
68 CHAPTER 6. TESTS AND CONFIDENCE INTERVALS
Chapter 7

The Neyman Pearson Lemma


and UMP tests

Let P := {Pθ : θ ∈ Θ} be a family of probability measures. Let Θ0 ⊂ Θ,


Θ1 ⊂ Θ, and Θ0 ∩ Θ1 = ∅. Based on observations X ∈ X , with distribution
P ∈ P, we consider the general testing problem for
H0 : θ ∈ Θ0 ,
against
H1 : θ ∈ Θ1 .

A (possibly randomized) test is some function φ : X → [0, 1]. Fix some α ∈


[0, 1]. We say that φ is a test at level α if

sup Eθ φ(X) ≤ α.
θ∈Θ0

Definition 7.0.1 A test φ is called Uniformly Most Powerful (UMP, (German:


gleichmässig mächtigst) ) if
• φ has level α,
• for all tests φ0 with level α, it holds that Eθ φ0 (X) ≤ Eθ φ(X) ∀ θ ∈ Θ1 .

7.1 The Neyman Pearson Lemma

We consider testing
H0 : θ = θ0
against the alternative
H1 : θ = θ1 .

Define the risk R(θ, φ) of a test φ as the probability of error of first and second
kind: 
Eθ φ(X), θ = θ0
R(θ, φ) := .
1 − Eθ φ(X), θ = θ1

69
70 CHAPTER 7. THE NEYMAN PEARSON LEMMA AND UMP TESTS

We let p0 (p1 ) be the density of Pθ0 (Pθ1 ) with respect to some dominating
measure ν (for example ν = Pθ0 + Pθ1 ). A Neyman Pearson test is

 1 if p1 /p0 > c
φNP := q if p1 /p0 = c .
0 if p1 /p0 < c

Here 0 ≤ q ≤ 1, and 0 ≤ c < ∞ are given constants.


Lemma 7.1.1 Neyman Pearson Lemma Let φ be some test. We have

R(θ1 , φNP ) − R(θ1 , φ) ≤ c[R(θ0 , φ) − R(θ0 , φNP )].

Proof.

R(θ , φ ) − R(θ1 , φ)
Z 1 NP
= (φ − φNP )p1
Z Z Z
= (φ − φNP )p1 + (φ − φNP )p1 + (φ − φNP )p1
p1 /p0 >c p1 /p0 =c p1 /p0 <c
Z Z Z
≤ c (φ − φNP )p0 + c (φ − φNP )p0 + c (φ − φNP )p0
p1 /p0 >c p1 /p0 =c p1 /p0 <c
= c[R(θ0 , φ) − R(θ0 , φNP )].

u
t

7.2 Uniformly most powerful tests

7.2.1 An example

Let X1 , . . . , Xn be i.i.d. copies of a Bernoulli random variable X ∈ {0, 1} with


success parameter θ ∈ (0, 1):

Pθ (X = 1) = 1 − Pθ (X = 0) = θ.

We consider three testing problems. The chosen level in all three problems is
α = 0.05.
Problem 1
We want to test, at level α, the hypothesis
1
H0 : θ = 2 =: θ0 ,
against the alternative
1
H1 : θ = 4 =: θ1 .
7.2. UNIFORMLY MOST POWERFUL TESTS 71

Let T := ni=1 Xi be the number of successes (T is a sufficient statistic), and


P
consider the randomized test
( 1 if T < t
0
φ(T ) := q if T = t0 ,
0 if T > t0
where q ∈ (0, 1), and where t0 is the critical value of the test. The constants q
and t0 ∈ {0, . . . , n} are chosen in such a way that the probability of rejecting
H0 when it is in fact true, is equal to α:

Pθ0 (H0 rejected) = Pθ0 (T ≤ t0 − 1) + qPθ0 (T = t0 ) := α.

Thus, we take t0 in such a way that

Pθ0 (T ≤ t0 − 1) ≤ α, Pθ0 (T ≤ t0 ) > α,


G (α) with q G the quantile function defined in Section 6.1 and
(i.e., t0 − 1 = qinf inf
G the distribution function of T ) and

α − Pθ0 (T ≤ t0 − 1)
q= .
Pθ0 (T = t0 )

Because φ = φNP is the Neyman Pearson test, it is the most powerful test (at
level α) (see the Neyman Pearson Lemma in Section 7.1). The power of the
test is β(θ1 ), where
β(θ) := Eθ φ(T ).

Numerical Example
Let n = 7. Then

 7
1
Pθ0 (T = 0) = = 0.0078,
2
  7
7 1
Pθ0 (T = 1) = = 0.0546,
1 2
Pθ0 (T ≤ 1) = 0.0624 > α,

so we choose t0 = 1. Moreover
0.05 − 0.0078 422
q= = .
0.0546 546
The power is now

β(θ1 ) = Pθ1 (T = 0) + qPθ1 (T = 1)


 7   6  
3 422 7 3
= + 1/4
4 546 1 4
422
= 0.1335 + 0.3114.
546
72 CHAPTER 7. THE NEYMAN PEARSON LEMMA AND UMP TESTS

Problem 2
Consider now testing
H0 : θ0 = 12 ,
against
H1 : θ < 12 .
In Problem 1, the construction of the test φ is independent of the value θ1 < θ0 .
So φ is most powerful for all θ1 < θ0 . We say that φ is uniformly most powerful
for the alternative H1 : θ < θ0 .
Problem 3
We now want to test
H0 : θ ≥ 21 ,
against the alternative
H1 : θ < 12 .
Recall the function
β(θ) := Eθ φ(T ).
The level of φ is defined as
sup β(θ).
θ≥1/2

We have

β(θ) = Pθ (T ≤ t0 − 1) + qPθ (T = t0 )
= (1 − q)Pθ (T ≤ t0 − 1) + qPθ (T ≤ t0 ).

Observe that if θ1 < θ0 , small values of T are more likely under Pθ1 than under
Pθ0 :
Pθ1 (T ≤ t) > Pθ0 (T ≤ t), ∀ t ∈ {0, 1, . . . , n}.
Thus, β(θ) is a decreasing function of θ. It follows that the level of φ is

sup β(θ) = β( 12 ) = α.
θ≥ 21

1
Hence, φ is uniformly most powerful for H0 : θ ≥ 2 against H1 : θ < 12 .

7.3 UMP tests and exponential families

We now study the situation where Θ is an interval in R, and the testing problem
is
H0 : θ ≤ θ0 ,
against
H1 : θ > θ0 .

We suppose that P is dominated by a σ-finite measure ν.


7.3. UMP TESTS AND EXPONENTIAL FAMILIES 73

Theorem 7.3.1 Suppose that P is a one-dimensional exponential family

pθ (x) = exp[c(θ)T (x) − d(θ)]h(x).

Assume moreover that c(θ) is a strictly increasing function of θ. Then a UMP


test φ is 
1 if T (x) > t0
φ(T (x)) := q if T (x) = t0 ,
0 if T (x) < t0

where q and t0 are chosen in such a way that Eθ0 φ(T ) = α.


Proof. The Neyman Pearson test for H0 : θ = θ0 against H1 : θ = θ1 is

 1 if pθ1 (x)/pθ0 (x) > c0
φNP (x) = q0 if pθ1 (x)/pθ0 (x) = c0 ,
0 if pθ1 (x)/pθ0 (x) < c0

where q0 and c0 are chosen in such a way that Eθ0 φNP (X) = α. We have
 
pθ1 (x)
= exp (c(θ1 ) − c(θ0 ))T (X) − (d(θ1 ) − d(θ0 )) .
pθ0 (x)

Hence
> >
pθ1 (x)
= c ⇔ T (x) = t ,
pθ0 (x)
< <
where t is some constant (depending on c, θ0 and θ1 ). Therefore, φ = φNP . It
follows that φ is most powerful for H0 : θ = θ0 against H1 : θ = θ1 . Because φ
does not depend on θ1 , it is therefore UMP for H0 : θ = θ0 against H1 : θ > θ0 .
We will now prove that β(θ) := Eθ φ(T ) is increasing in θ. Let

p̄θ (t) = exp[c(θ)t − d(θ)]

be the density of T with respect to dominating measure ν̄. For ϑ > θ


 
p̄ϑ (t)
= exp (c(ϑ) − c(θ))t − (d(ϑ) − d(θ)) ,
p̄θ (t)

which is increasing in t. Moreover, we have


Z Z
p̄ϑ dν̄ = p̄θ dν̄ = 1.

Therefore, there must be a point s0 where the two densities cross:


( p̄
ϑ (t)
p̄θ (t) ≤1 for t ≤ s0
p̄ϑ (t) .
p̄θ (t) ≥1 for t ≥ s0
74 CHAPTER 7. THE NEYMAN PEARSON LEMMA AND UMP TESTS

But then
Z
β(ϑ) − β(θ) = φ(t)[p̄ϑ (t) − p̄θ (t)]dν̄(t)
Z Z
= φ(t)[p̄ϑ (t) − p̄θ (t)]dν̄(t) + φ(t)[p̄ϑ (t) − p̄θ (t)]dν̄(t)
t≤s0 t≥s0
Z
≥ φ(s0 ) [p̄ϑ (t) − p̄θ (t)]dν̄(t) = 0.

So indeed β(θ) is increasing in θ.


But then
sup β(θ) = β(θ0 ) = α.
θ≤θ0

Hence, φ has level α. Because any other test φ0 with level α must have
Eθ0 φ0 (X) ≤ α, we conclude that φ is UMP. u
t
Example 7.3.1 Test for the variance of the normal distribution
Let X1 , . . . , Xn be an i.i.d. sample from the N (µ0 , σ 2 )-distribution, with µ0
known, and σ 2 > 0 unknown. We want to test
H0 : σ 2 ≤ σ02 ,
against
H1 : σ 2 > σ02 .
The density of the sample is
 n 
1 X 2 n 2
pσ2 (x1 , . . . , xn ) = exp − 2 (xi − µ0 ) − log(2πσ ) .
2σ 2
i=1

Thus, we may take


1
c(σ 2 ) = − ,
2σ 2
and
n
X
T (X) = (Xi − µ0 )2 .
i=1

The function c(σ 2 )is strictly increasing in σ 2 . So we let φ be the test which
rejects H0 for large values of T (X). Note that under H0 , the statistic T (X)/σ02
has a χ2 -distribution with n degrees of freedom, the χ2n -distribution (see Section
12.2 for a definition). So we can find the critical value from the quantile of the
χ2n -distribution.

7.4 One- and two-sided tests: an example with the


Bernoulli distribution

Let X1 , . . . , Xn be an i.i.d. sample from the Bernoulli(θ)-distribution, 0 < θ < 1.


Then   n
X 
θ
pθ (x1 , . . . , xn ) = exp log xi + n log(1 − θ) .
1−θ
i=1
7.5. UNBIASED TESTS 75

We can take  
θ
c(θ) = log ,
1−θ
which is strictly increasing in θ. Then T (X) = ni=1 Xi .
P

Right-sided alternative
H0 : θ ≤ θ0 ,
against
H1 : θ > θ0 .
The UMP test is (1 T > tR
φR (T ) := qR T = tR .
0 T < tR
The function βR (θ) := Eθ φR (T ) is strictly increasing in θ.
Left-sided alternative
H0 : θ ≥ θ0 ,
against H1 : θ < θ0 .
The UMP test is (1 T < tL
φL (T ) := qL T = tL .
0 T > tL
The function βL (θ) := Eθ φL (T ) is strictly decreasing in θ.
Two-sided alternative
H0 : θ = θ0 ,
against
H1 : θ 6= θ0 .
The test φR is most powerful for θ > θ0 , whereas φL is most powerful for θ < θ0 .
Hence, a UMP test does not exist for the two-sided alternative.

7.5 Unbiased tests

Consider again the general case: P := {Pθ : θ ∈ Θ} is a family of probability


measures, the spaces Θ0 , and Θ1 are disjoint subspaces of Θ, and the testing
problem is
H0 : θ ∈ Θ0 ,
against
H1 : θ ∈ Θ1 .
The significance level is α (< 1).
As we have seen in Section 7.4, uniformly most powerful tests do not always
exist. We therefore restrict attention to a smaller class of tests, and look for
uniformly most powerful tests in the smaller class.
76 CHAPTER 7. THE NEYMAN PEARSON LEMMA AND UMP TESTS

Definition 7.5.1 A test φ is called unbiased (German unverfälscht) if for all


θ ∈ Θ0 and all ϑ ∈ Θ1 ,
Eθ φ(X) ≤ Eϑ φ(X).
Definition 7.5.2 A test φ is called Uniformly Most Powerful Unbiased (UMPU)
if
• φ has level α,
• φ is unbiased,
• for all unbiased tests φ0 with level α, one has Eθ φ0 (X) ≤ Eθ φ(X) ∀ θ ∈ Θ1 .
We return to the special case where Θ ⊂ R is an interval. We consider testing
H0 : θ = θ0 ,
against
H1 : θ 6= θ0 .
The following theorem presents the UMPU test. We omit the proof (see e.g.
Lehmann (1986)).
Theorem 7.5.1 Suppose P is a one-dimensional exponential family:
dPθ
(x) := pθ (x) = exp[c(θ)T (x) − d(θ)]h(x),

with c(θ) strictly increasing in θ. Then a UMPU test is


 1 if T (x) < tL or T (x) > tR
qL if T (x) = tL

φ(T (x)) := ,
q if T (x) = tR
 R


0 if tL < T (x) < tR
where the constants tR , tL , qR and qL are chosen in such a way that

d
Eθ0 φ(X) = α, Eθ φ(X) = 0.
dθ θ=θ0

Note Let φR a right-sided test as defined Theorem 7.3.1 with level at most
α and φL be the similarly defined left-sided test. Then βR (θ) = Eθ φR (T ) is
strictly increasing, and βL (θ) = Eθ φL (T ) is strictly decreasing. The two-sided
test φ of Theorem 7.5.1 is a superposition of two one-sided tests. Writing

β(θ) = Eθ φ(T ),

the one-sided tests are constructed in such a way that

β(θ) = βR (θ) + βL (θ).

Moreover, β(θ) should be minimal at θ = θ0 , whence the requirement that its


derivative at θ0 should vanish. Let us see what this derivative looks like. With
the notation used in the proof of Theorem 7.3.1, for a test φ̃ depending only on
the sufficient statistic T ,
Z
Eθ φ̃(T ) = φ̃(t) exp[c(θ)t − d(θ)]dν̄(t).
7.5. UNBIASED TESTS 77

Hence, assuming we can take the differentiation inside the integral,


Z
d ˙
Eθ φ̃(T ) = φ̃(t) exp[c(θ)t − d(θ)](ċ(θ)t − d(θ))dν̄(t)

= ċ(θ)covθ (φ̃(T ), T ).
The UMPU test sets this to zero. We leave the interpretation to the reader...
Example 7.5.1 Two-sided test for the mean of the normal distribu-
tion
Let X1 , . . . , Xn be an i.i.d. sample from the N (µ, σ02 )-distribution, with µ ∈ R
unknown, and with σ02 known. We consider testing
H0 : µ = µ0 ,
against
H1 : µ 6= µ0 .
A sufficient statistic is T := ni=1 Xi . We have, for tL < tR ,
P

Eµ φ(T ) = IPµ (T > tR ) + IPµ (T < tL )


   
T − nµ tR − nµ T − nµ tL − nµ
= IPµ √ > √ + IPµ √ < √
nσ0 nσ0 nσ0 nσ0
   
tR − nµ tL − nµ
= 1−Φ √ +Φ √ ,
nσ0 nσ0
where Φ is the standard normal distribution function. To avoid confusion with
the test φ, we denote the standard normal density in this example by Φ̇. Thus,
   
d n tR − nµ n tL − nµ
Eµ φ(T ) = √ Φ̇ √ −√ Φ̇ √ ,
dµ nσ0 nσ0 nσ0 nσ0
So putting
d
Eµ φ(T ) = 0,
dµ µ=µ0
gives    
tR − nµ0 tL − nµ0
Φ̇ √ = Φ̇ √ ,
nσ0 nσ0
or
(tR − nµ0 )2 = (tL − nµ0 )2 .
We take the solution (tL − nµ0 ) = −(tR − nµ0 ), (because the solution
(tL − nµ0 ) = (tR − nµ0 ) leads to a test that always rejects, and hence does
not have level α, as α < 1). Plugging this solution back in gives
   
tR − nµ0 tR − nµ0
Eµ0 φ(T ) = 1 − Φ √ +Φ − √
nσ0 nσ0
  
tR − nµ0
= 2 1−Φ √ .
nσ0
The requirement Eµ0 φ(T ) = α gives us
 
tR − nµ0
Φ √ = 1 − α/2,
nσ0
78 CHAPTER 7. THE NEYMAN PEARSON LEMMA AND UMP TESTS

and hence
√ √
tR − nµ0 = nσ0 Φ−1 (1 − α/2), tL − nµ0 = − nσ0 Φ−1 (1 − α/2).

7.6 Conditional tests ?

We now study the case where Θ is an interval in R2 . We let θ = (β, γ), and we
assume that γ is the parameter of interest. We aim at testing
H0 : γ ≤ γ0 ,
against the alternative
H1 : γ > γ0 .
We assume moreover that we are dealing with an exponential family in canonical
form:
pθ (x) = exp[βT1 (x) + γT2 (x) − d(θ)]h(x).
Then we can restrict ourselves to tests φ(T ) depending only on the sufficient
statistic T = (T1 , T2 ).
Lemma 7.6.1 Suppose that {β : (β, γ0 ) ∈ Θ} contains an open interval. Let

1 if T2 > t0 (T1 )
φ(T1 , T2 ) = q(T1 ) if T2 = t0 (T1 ) ,
0 if T2 < t0 (T1 )

where the constants t0 (T1 ) and q(T1 ) are allowed to depend on T1 , and are
chosen in such a way that
 

Eγ0 φ(T1 , T2 ) T1 = α.

Then φ is UMPU.

Sketch of proof.
Let p̄θ (t1 , t2 ) be the density of (T1 , T2 ) with respect to dominating measure ν̄:

p̄θ (t1 , t2 ) := exp[βt1 + γt2 − d(θ)]h̄(t1 , t2 ).

We assume ν̄(tt , t2 ) = ν̄1 (t1 )ν̄2 (t2 ) is a product measure. The conditional den-
sity of T2 given T1 = t1 is then

exp[βt1 + γt2 − d(θ)]h̄(t1 , t2 )


p̄θ (t2 |t1 ) = R
s2 exp[βt1 + γs2 − d(θ)]h̄(t1 , s2 )dν̄2 (s2 )
= exp[γt2 − d(γ|t1 )]h̄(t1 , t2 ),

where Z 
d(γ|t1 ) := log exp[γs2 ]h̄(t1 , s2 )dν̄2 (s2 ) .
s2
7.6. CONDITIONAL TESTS ? 79

In other words, the conditional distribution of T2 given T1 = t1


- does not depend on β,
- is a one-parameter exponential family in canonical form.
This implies that given T1 = t1 , φ is UMPU.
Result 1 The test φ has level α, i.e.

sup E(β.γ) φ(T ) = E(β,γ0 ) φ(T ) = α, ∀ β.


γ≤γ0

Proof of Result 1.

sup E(β,γ) φ(T ) ≥ E(β,γ0 ) φ(T ) = E(β,γ0 ) Eγ0 (φ(T )|T1 ) = α.


γ≤γ0

Conversely,
sup E(β,γ) φ(T ) = sup E(β,γ) Eγ (φ(T )|T1 ) ≤ α.
γ≤γ0 γ≤γ0 | {z }
≤α

Result 2 The test φ is unbiased.


Proof of Result 2. If γ > γ0 , it holds that Eγ (φ(T )|T1 ) ≥ α, as the conditional
test is unbiased. Thus, also, for all β,

E(β,γ) φ(T ) = E(β,γ) Eγ (φ(T )|T1 ) ≥ α,

i.e., φ is unbiased.
Result 3 Let φ0 be a test with level

α0 := sup sup E(β,γ) φ0 (T ) ≤ α,


β γ≤γ0

and suppose moreover that φ0 is unbiased, i.e., that

sup sup E(β,γ) φ0 (T ) ≤ inf inf E(β,γ) φ0 (T ).


γ≤γ0 β γ>γ0 β

Then, conditionally on T1 , φ0 has level α0 .


Proof of Result 3. As

α0 = sup sup E(β,γ) φ0 (T )


β γ≤γ0

we know that
E(β,γ0 ) φ0 (T ) ≤ α0 , ∀ β.
Conversely, the unbiasedness implies that for all γ > γ0 ,

E(β,γ) φ0 (T ) ≥ α0 , ∀ β.

A continuity argument therefore gives

E(β,γ0 ) φ0 (T ) = α0 , ∀ β.
80 CHAPTER 7. THE NEYMAN PEARSON LEMMA AND UMP TESTS

In other words, we have

E(β,γ0 ) (φ0 (T ) − α0 ) = 0, ∀ β.

But then also  


0 0

E(β,γ0 ) Eγ0 (φ (T ) − α ) T1 = 0, ∀ β,

which we can write as


E(β,γ0 ) h(T1 ) = 0, ∀ β.
The assumption that {β : (β, γ0 ) ∈ Θ} contains an open interval implies that
T1 is complete for (β, γ0 ). So we must have

h(T1 ) = 0, P(β,γ0 ) −a.s., ∀ β,

or, by the definition of h,

Eγ0 (φ0 (T )|T1 ) = α0 , P(β,γ0 ) − a.s., ∀ β.

So conditionally on T1 , the test φ0 has level α0 .


Result 4 Let φ0 be a test as given in Result 3. Then φ0 can not be more powerful
than φ at any (β, γ), with γ > γ0 .
Proof of Result 4. By the Neyman Pearson lemma, conditionally on T1 , we
have
Eγ (φ0 (T )|T1 ) ≤ Eγ (φ(T )|T1 ), ∀ γ > γ0 .
Thus also
E(β,γ) φ0 (T ) ≤ E(β,γ) φ(T ), ∀ β, γ > γ0 .
u
t
Example 7.6.1 Comparing the means of two Poissons
Consider two independent samples X = (X1 , . . . , Xn ) and Y = (Y1 , . . . , Ym ),
where X1 , . . . , Xn are i.i.d. Poisson(λ)-distributed, and
Y1 , . . . , Ym are i.i.d. Poisson(µ)-distributed. We aim at testing
H0 : λ ≤ µ,
against the alternative
H1 : λ > µ.
Define
β := log(µ), γ := log(λ/µ).
The testing problem is equivalent to
H0 : γ ≤ γ0 ,
against the alternative
H1 : γ > γ0 ,
where γ0 := 0.
The density is
pθ (x1 , . . . , xn , y1 , . . . , ym )
7.6. CONDITIONAL TESTS ? 81

 n m m n
Y
X X 1 Y 1
= exp log(λ) xi + log(µ) yj − nλ − mµ
x! y !
i=1 j=1 i=1 i j=1 j
 Xn m
X n
X 
= exp log(µ)( xi + yj ) + log(λ/µ) xi − nλ − mµ h(x, y)
i=1 j=1 i=1
= exp[βT1 (x, y) + γT2 (x) − d(θ)]h(x, y),

where
n
X m
X n
X
T1 (X, Y) := Xi + Yj , T2 (X) := Xi ,
i=1 j=1 i=1

and
n m
Y 1 Y 1
h(x, y) := .
xi ! yj !
i=1 j=1

The conditional distribution of T2 given T1 = t1 is the Binomial(t1 , p)-distribution,


with
nλ eγ
p= = .
nλ + mµ 1 + eγ
Thus, conditionally on T1 = t1 , using the observation T2 from the Binomial(t1 , p)-
distribution, we test
H0 : p ≤ p0 ,
against the alternative
H1 : p > p0 ,
where p0 := n/(n + m). This test is UMPU for the unconditional problem.
82 CHAPTER 7. THE NEYMAN PEARSON LEMMA AND UMP TESTS
Chapter 8

Comparison of estimators

One can compare estimators in terms of their risk. This is done in Sections 8.1,
8.2 and 8.3. Section 8.4 briefly addresses sensitivity and robustness and Section
8.5 discusses computational aspects. These last two sections do not present the
details.

8.1 Definition of risk

Consider a random variable X with distribution Pθ , θ ∈ Θ. Let T = T (X)


be an estimator of a parameter of interest γ = g(θ). A risk function R(·, ·)
measures the loss due to the error of the estimator. The risk depends on the
unknown parameter θ and on the estimator. We define the risk as
R(θ, T ) := E
I θ (L(θ, T (X))
where L(·, ·) is a given so-called loss function1 . A more detailed description is
given in Chapter 10.
Example 8.1.1 Risk of a test
Consider the testing problem
H0 : θ = θ0 ,
against
H1 : θ = θ1 .
Let φ(X) ∈ [0, 1] be a test.
The risk of the test can then be defined as the probability of an error, i.e.

E
I θ0 φ(X) θ = θ0
R(θ, φ) = .
1 −E I θ1 φ(X) θ = θ1
Example 8.1.2 Risk of an estimator
In the case γ ∈ R an important risk measure is the mean square error
I θ (T (X) − g(θ))2 =: MSEθ (T ).
R(θ, T ) := E
1
Note that the quantity L(θ, T (X)) is random. Note also that in the notation of risk
R(θ, T ), the symbol T stands for the map T .

83
84 CHAPTER 8. COMPARISON OF ESTIMATORS

8.2 Risk and sufficiency

Let S = S(X) be sufficient. Knowing the sufficient statistic S one can forget
about the original data X without loosing information. Indeed, the following
lemma says that any decision based on the original data X can be replaced by
a randomized one which depends only on S and which has the same risk.
Lemma 8.2.1 Suppose S is sufficient for θ. Let d : X → A be some decision.
Then there is a randomized decision δ(S) that only depends on S, such that

R(θ, δ(S)) = R(θ, d), ∀ θ.

Proof. Let Xs∗ be a random variable with distribution P (X ∈ ·|S = s). Then,
by construction, for all possible s, the conditional distribution, given S = s,
of Xs∗ and X are equal. It follows that X and XS∗ have the same distribution.
Formally, let us write Qθ for the distribution of S. Then
Z
Pθ (XS∗ ∈ ·) = P (Xs∗ ∈ ·|S = s)dQθ (s)
Z
= P (X ∈ ·|S = s)dQθ (s) = Pθ (X ∈ ·).

The result of the lemma follows by taking δ(s) := d(Xs∗ ). u


t.

8.3 Rao-Blackwell

The Lemma of Rao-Blackwell says that in the case of convex loss an estimator
based on the original data X can be replaced by one based only on S without
increasing the risk. Randomization is not needed here.
Lemma 8.3.1 (Rao Blackwell) Suppose that S is sufficient for θ. Suppose
moreover that the action space A ⊂ Rp is convex, and that for each θ, the
map a 7→ L(θ, a) is convex. Let d : X → A be a decision, and define d0 (s) :=
E(d(X)|S = s) (assumed to exist). Then

R(θ, d0 ) ≤ R(θ, d), ∀ θ.

Proof. Jensen’s inequality says that for a convex function g,

E(g(X)) ≥ g(EX).

Hence, ∀ θ,
      

E L θ, d(X) S = s
≥ L θ, E d(X)|S = s

= L(θ, d0 (s)).
8.4. SENSITIVITY AND ROBUSTNESS 85

By the iterated expectations lemma, we arrive at

R(θ, d) = Eθ L(θ, d(X))


   

= Eθ E L θ, d(X) S

≥ Eθ L(θ, d0 (S)).

u
t
Example 8.3.1 Mean square error
Let T be an estimator of g(θ) ∈ R and let

I θ (T (X) − g(θ))2 =: MSEθ (T ).


R(θ, T ) := E

I |S). Then by the Rao-Blackwell Lemma


Let S be sufficient and T̃ := E(T

R(θ, T̃ ) ≤ R(θ, T ) ∀ θ.

The mean square error can be decomposed in the variance term and the squared
bias term. Since E
I T̃ = ET
I by the iterated expectations lemma, we thus have

varθ (T̃ ) ≤ varθ (T ), ∀ θ.

Compare with Lemma 5.2.2 and the result of Lehmann-Scheffé in Section 5.3.

8.4 Sensitivity and robustness

We can compare estimators with respect to their sensitivity to large errors in


the data. Let X1 , . . . , Xn be i.id. copies of a random variable X. Let Tn =
Tn (X1 , . . . , Xn ) be a real-valued estimator defined for each n, and symmetric
in X1 , . . . , Xn .
Influence of a single additional observation
The influence function is

l(x) := Tn+1 (X1 , . . . , Xn , x) − Tn (X1 , . . . , Xn ), x ∈ R.

Break down point


Let for m ≤ n,

(m) := sup |T (x∗1 , . . . , x∗m , Xm+1 , . . . , Xn )|.


x∗1 ,...,x∗m

If (m) := ∞, we say that with m outliers the estimator can break down. The
break down point is defined as

∗ := min{m : (m) = ∞}/n.

An estimator is called robust if it has a bounded influence function and/or a


large breakdown point.
86 CHAPTER 8. COMPARISON OF ESTIMATORS

8.5 Computational aspects

Today, the data are often high-dimensional and the number of parameters p
is also very large. Maximum likelihood estimation for example requires maxi-
mization of a function of p variables and this can be very hard if p is large. The
more so if the likelihood is not concave, or if there are e.g. some parameters are
integer valued etc. We will moreover examine in Chapter 10 Bayesian theory.
Then on needs to find so-called “posterior distributions”, which is typically
computationally very hard (this is where MCMC (Monte Carlo Markov Chain)
algorithms come in). Clearly, an estimator which cannot be computed (say in
polynomial time) is of little practical value.
Chapter 9

Equivariant statistics

As we have seen in Chapter 5 for instance, it can be useful to restrict attention


to a collection of statistics satisfying certain desirable properties. In Chapter
5, we restricted ourselves to unbiased estimators. In this chapter, equivariance
will be the key concept.
The data consists of i.i.d. real-valued random variables X1 , . . . , Xn . We write
X := (X1 , . . . , Xn ). The density w.r.t. some dominating measure Q ν, of a
single observation is denoted by pθ . The density of X is pθ (x) = i pθ (xi ),
x = (x1 , . . . , xn ).
Location model
Then θ ∈ R is a location parameter, and we assume

Xi = θ + i , i = 1, . . . , n.

We are interested in estimating θ. Both the parameter space Θ, as well as the


action space A, are the real line R. We assume 1 , . . . , n are i.i.d. with a known
density p0 (·).
Location-scale model
Here θ = (µ, σ), with µ ∈ R a location parameter and σ > 0 a scale parameter.
We assume
Xi = µ + σi , i = 1, . . . , n.
The parameter space Θ and action space A are both R × (0, ∞). We assume
1 , . . . , n are i.i.d. with a known density p0 (·).

9.1 Equivariance in the location model

Definition 9.1.1 A statistic T = T (X) is called location equivariant if for all


constants c ∈ R and all x = (x1 , . . . , xn ),

T (x1 + c, . . . , xn + c) = T (x1 , . . . , xn ) + c.

87
88 CHAPTER 9. EQUIVARIANT STATISTICS

Examples


(
T = X( n+1 ) (n odd) .
2
···
Definition 9.1.2 A loss function L(θ, a) is called location invariant if for all
c ∈ R,
L(θ + c, a + c) = L(θ, a), (θ, a) ∈ R2 .

In this section we abbreviate location equivariance (invariance) to simply equiv-


ariance (invariance), and we assume throughout that the loss L(θ, a) is invari-
ant.
Corollary 9.1.1 If T is equivariant (and L(θ, a) is invariant), then

R(θ, T ) = Eθ L(θ, T (X)) = Eθ L(0, T (X) − θ)


= Eθ L(0, T (X − θ)) = Eθ L0 [T (ε)],

where L0 [a] := L(0, a) and ε := (1 , . . . , n ). Because the distribution of ε does


not depend on θ, we conclude that the risk does not depend on θ. We may
therefore omit the subscript θ in the last expression:

R(θ, T ) = EL0 [T (ε)].

Since for θ = 0, we have the equality X = ε we may alternatively write

R(θ, T ) = E0 L0 [T (X)] = R(0, T ).

Definition 9.1.3 An equivariant statistic T is called uniform minimum risk


equivariant (UMRE) if

R(θ, T ) = min R(θ, d), ∀ θ,


d equivariant

or equivalently,
R(0, T ) = min R(0, d).
d equivariant

9.1.1 Construction of the UMRE estimator

Lemma 9.1.1 Let Yi := Xi − Xn , i = 1, . . . , n, and Y := (Y1 , . . . , Yn ). We


have
T equivariant ⇔ T (X) = T (Y) + Xn .

Proof.
(⇒) Trivial.
(⇐) Replacing X by X + c leaves Y unchanged (i.e. Y is invariant). So
T (X + c) = T (Y) + Xn + c = T (X) + c. u
t
9.1. EQUIVARIANCE IN THE LOCATION MODEL 89

Theorem 9.1.1 Let Yi := Xi − Xn , i = 1, . . . , n, Y := (Y1 , . . . , Yn ), and define


   


T (Y) := arg min E L0 v + n Y .
v

Moreover, let
T ∗ (X) := T ∗ (Y) + Xn .
Then T ∗ is UMRE.
Proof. First, note that Y and its distribution does not depend on θ, so that
T ∗ is indeed a statistic. It is also equivariant, by the previous lemma.
Let T be an equivariant statistic. Then T (X) = T (Y) + Xn . So

T (X) − θ = T (Y) + n .

Hence
     

R(0, T ) = EL0 T (Y) + n = E E L0 T (Y) + n Y .

But
       

E L0 T (Y) + n Y ≥ min E L0 v + n Y

v
   


= E L0 T (Y) + n Y .

Hence,    

T (Y) + n Y = R(0, T ∗ ).

R(0, T ) ≥ E E L0

u
t

9.1.2 Quadratic loss: the Pitman estimator

Corollary 9.1.2 If we take quadratic loss

L(θ, a) := (a − θ)2 ,

we get L0 [a] = a2 , and so, for Y = X − Xn ,


 2 
T ∗ (Y) = arg min E

v + n Y
v

= −E(n |Y),
and hence
T ∗ (X) = Xn − E(n |Y).
This estimator is called the Pitman estimator.
90 CHAPTER 9. EQUIVARIANT STATISTICS

To investigate the case of quadratic risk further, we:


Note If (X, Z) has density f (x, z) w.r.t. Lebesgue measure, then the density
of Y := X − Z is Z
fY (y) = f (y + z, z)dz.

Lemma 9.1.2 Consider quadratic loss. Let p0 be the density of ε = (1 , . . . , n )


w.r.t. Lebesgue measure. Then a UMRE statistic is
R
∗ zp0 (X1 − z, . . . , Xn − z)dz
T (X) = R .
p0 (X1 − z, . . . , Xn − z)dz

Proof. Let Y = X − Xn . The random vector Y has density


Z
fY (y1 , . . . , yn−1 , 0) = p0 (y1 + z, . . . , yn−1 + z, z)dz.

So the density of n given Y = y = (y1 , . . . , yn−1 , 0) is


p0 (y1 + u, . . . , yn−1 + u, u)
fn (u) = R .
p0 (y1 + z, . . . , yn−1 + z, z)dz
It follows that
R
up0 (y1 + u, . . . , yn−1 + u, u)du
E(n |y) = R .
p0 (y1 + z, . . . , yn−1 + z, z)dz
Thus
R
up0 (Y1 + u, . . . , Yn−1 + u, u)du
E(n |Y) = R
p0 (Y1 + z, . . . , Yn−1 + z, z)dz
R
up0 (X1 − Xn + u, . . . , Xn−1 − Xn + u, u)du
= R
p0 (X1 − Xn + z, . . . , Xn−1 − Xn + z, z)dz
R
zp0 (X1 − z, . . . , Xn−1 − z, Xn − z)dz
= Xn − R .
p0 (X1 − z, . . . , Xn−1 − z, Xn − z)dz
Finally, recall that T ∗ (X) = Xn − E(n |Y). u
t
Example 9.1.1 Uniform distribution with unknown midpoint
Suppose X1 , . . . , Xn are i.i.d. Uniform[θ − 1/2, θ + 1/2], θ ∈ R. Then
p0 (x) = l{|x| ≤ 1/2}.
We have
max |xi − z| ≤ 1/2 ⇔ x(n) − 1/2 ≤ z ≤ x(1) + 1/2.
1≤i≤n

So
p0 (x1 − z, . . . , xn − z) = l{x(n) − 1/2 ≤ z ≤ x(1) + 1/2}.
Thus, writing
T1 := X(n) − 1/2, T2 := X(1) + 1/2,
the UMRE estimator T ∗ is
Z T2 T2 X(1) + X(n)
Z 
∗ T1 + T2
T = zdz dz = = .
T1 T1 2 2
9.1. EQUIVARIANCE IN THE LOCATION MODEL 91

9.1.3 Invariant statistics

We now consider more general invariant statistics Y.


Definition 9.1.4 A map Y : Rn → Rn is called maximal invariant if

Y(x) = Y(x0 ) ⇔ ∃ c : x = x0 + c.

(The constant c may depend on x and x0 .)


Example The map Y(x) := x − xn is maximal invariant:
(⇐) is clear
(⇒) if x − xn = x0 − x0n , we have x = x0 + (xn − x0n ).
More generally:
Example Let d(X) be equivariant. Then Y := X − d(X) is maximal invariant.
Theorem 9.1.2 Suppose that d(X) is equivariant. Let Y := X − d(X), and
   


T (Y) := arg min E L0 v + d(ε) Y .
v

Then
T ∗ (X) := T ∗ (Y) + d(X)
is UMRE.
Proof. Let T be an equivariant estimator. Then

T (X) = T (X − d(X)) + d(X)

= T (Y) + d(X).
Hence
       

E L0 T (ε) Y = E L0 T (Y) + d(ε) Y
   

≥ min E L0 v + d(ε) Y .
v

Now, use the iterated expectation lemma. u


t

9.1.4 Quadratic loss and Basu’s Lemma

For quadratic loss (L0 [a] = a2 ), the definition of T ∗ (Y) in the above theorem
is
T ∗ (Y) = −E(d(ε)|Y) = −E0 (d(X)|X − d(X)),
so that
T ∗ (X) = d(X) − E0 (d(X)|X − d(X)).
So for a equivariant estimator T , we have

T is UMRE ⇔ E0 (T (X)|X − T (X)) = 0.


92 CHAPTER 9. EQUIVARIANT STATISTICS

From the right hand side, we conclude that E0 T = 0 and hence Eθ (T ) = θ


∀ θ. Thus, in the case of quadratic loss, an UMRE estimator is unbiased.
Conversely, suppose we have an equivariant and unbiased estimator T . If T (X)
and X − T (X) are independent, it follows that

E0 (T (X)|X − T (X)) = E0 T (X) = 0.

So then T is UMRE.
To check independence, Basu’s lemma can be useful.
Basu’s lemma Let X have distribution Pθ , θ ∈ Θ. Suppose T is sufficient
and complete, and that Y = Y (X) has a distribution that does not depend on
θ. Then, for all θ, T and Y are independent under Pθ .
Proof. Let A be some measurable set, and

h(T ) := P (Y ∈ A|T ) − P (Y ∈ A).

Notice that indeed, P (Y ∈ A|T ) does not depend on θ because T is sufficient.


Because
Eθ h(T ) = 0, ∀ θ,
we conclude from the completness of T that

h(T ) = 0, Pθ −a.s., ∀ θ,

in other words,

P (Y ∈ A|T ) = P (Y ∈ A), Pθ −a.s., ∀ θ.

Since A was arbitrary, we thus have that the conditional distribution of Y given
T is equal to the unconditional distribution:

P (Y ∈ ·|T ) = P (Y ∈ ·), Pθ −a.s., ∀ θ,

that is, for all θ, T and Y are independent under Pθ . u


t
Basu’s lemma is intriguing: it proves a probabilistic property (independence)
via statistical concepts.
Example 9.1.2 UMRE estimator for the mean of the normal distri-
bution: σ 2 known
Let X1 , . . . , Xn be independent N (θ, σ 2 ), with σ 2 known. Then T := X̄ is suf-
ficient and complete, and moreover, the distribution of Y := X − X̄ does not
depend on θ. So by Basu’s lemma, X̄ and X − X̄ are independent. Hence, X̄
is UMRE.
Remark Indeed, Basu’s lemma is peculiar: X̄ and X − X̄ of course remain
independent if the mean θ is known and/or the variance σ 2 is unknown!
Remark As a P by-product, one concludes the independence of X̄ and the sample
variance S 2 = ni=1 (Xi − X̄)2 /(n − 1), because S 2 is a function of X − X̄.
9.2. EQUIVARIANCE IN THE LOCATION-SCALE MODEL ? 93

9.2 Equivariance in the location-scale model ?

Location-scale model
We assume
Xi = µ + σi , i = 1, . . . , n.
The unknown parameter is θ = (µ, σ), with µ ∈ R a location parameter and
σ > 0 a scale parameter. The parameter space Θ and action space A are both
R × R+ (R+ := (0, ∞)). The distribution of ε = (1 , . . . , n ) is assumed to be
known.
Definition 9.2.1 A statistic T = T (X) = (T1 (X), T2 (X)) is called location-
scale equivariant if for all constants b ∈ R, c ∈ R+ , and all x = (x1 , . . . , xn ),

T (b + cx1 , . . . , b + cxn ) = b + cT (x1 , . . . , xn )

and
T2 (b + cx1 , . . . , b + cxn ) = cT2 (x1 , . . . , xn ).
Definition 9.2.2 A loss function L(µ, σ, a1 , a2 ) is called location-scale invari-
ant if for all (µ, a1 , b) ∈ R3 , (σ, a2 , c) ∈ R3+

L(b + cµ, cσ, b + ca1 , ca2 ) = L(µ, σ, a1 , a2 ).

In this section we abbreviate location-scale equivariance (invariance) to simply


equivariance (invariance), and we assume throughout that the loss L(θ, a) is
invariant.
Corollary 9.2.1 If T is equivariant (and L(θ, a) is invariant), then

R(θ, T ) = Eθ L(µ, σ, T1 (X), T2 (X))


 
T1 (X) − µ T2 (X)
= Eθ L 0, 1, ,
σ σ
 
= Eθ L 0, 1, T1 (), T2 () = Eθ L0 (T (ε)),

where L0 (a1 , a2 ) := L(0, 1, a1 , a2 ). We conclude that the risk does not depend
on θ. We may therefore omit the subscript θ in the last expression:

R(θ, T ) = EL0 (T (ε)).

Definition 9.2.3 An equivariant statistic T is called uniform minimum risk


equivariant (UMRE) if

R(θ, T ) = min R(θ, d), ∀ θ,


d equivariant

or equivalently,

R(0, 1, T1 , T2 ) = min R(0, 1, d1 , d2 ).


d equivariant
94 CHAPTER 9. EQUIVARIANT STATISTICS

9.2.1 Construction of the UMRE estimator ?

Theorem 9.2.1 Suppose that d(X) is equivariant. Let


X − d1 (X)
Y := ,
d2 (X)
and
   


T (Y) := arg min E L0 d1 (ε) + d2 (ε)a1 , d2 (ε)a2 Y .
a1 ∈R, a2 ∈R+

Then
d1 (X) + d2 (X)T1∗ (Y)
 

T (X) :=
d2 (X)T2∗ (Y)
is UMRE.
Proof. We have
X − d1 (X) ε − d1 (ε)
Y= = .
d2 (X) d2 (ε)
So
ε = d1 (ε) + d2 (ε)Y.

Let T be an equivariant estimator. Then


 
EL0 T1 (ε), T2 (ε)

 
= EL0 T1 (d1 (ε) + d2 (ε)Y), T2 (d1 (ε) + d2 (ε)Y)
 
= EL0 d1 (ε) + d2 (ε)T1 (Y), d2 (ε)T2 (Y)
   

= EE L0 d1 (ε) + d2 (ε)T1 (Y), d2 (ε)T2 (Y) Y
   

≥ E min E L0 d1 (ε) + d2 (ε)a1 , d2 (ε)a2 Y
a1 ∈R, a2 ∈R+
   
∗ ∗

= EE L0 d1 (ε) + d2 (ε)T1 (Y), d2 (ε)T2 (Y) Y .

u
t

9.2.2 Quadratic loss ?

For quadratic loss (L0 (a1 , a2 ) := a21 ), the definition of T ∗ (Y) in the above
theorem is  2 


T (Y) = arg min E d1 (ε) + d2 (ε)a1 Y .
a1 ∈R

We then have:
9.2. EQUIVARIANCE IN THE LOCATION-SCALE MODEL ? 95

Lemma 9.2.1 Suppose that d is equivariant, and sufficient and complete. Then

Ed1 (ε)d2 (ε)


T ∗ (X) := d1 (X) − d2 (X)
Ed22 (ε)

is UMRE.
Proof. By Basu’s lemma, d and Y are independent. Hence
 2   2

E d1 (ε) + d2 (ε)a1 Y = E d1 (ε) + d2 (ε)a1 .

Moreover  2
Ed1 (ε)d2 (ε)
arg min E d1 (ε) + d2 (ε)a1 = − .
a1 ∈R Ed22 (ε)
u
t
Example 9.2.1 UMRE of the mean of the normal distribution: σ 2
unknown
Let X1 , . . . , Xn be i.i.d. and N (µ, σ 2 )-distributed. Define

d1 (X) := X̄, d2 (X) := S,

where S 2 is the sample variance


n
2 1 X
S := (Xi − X̄)2 .
n−1
i=1

It is easy to see that d is equivariant. We moreover know from Example 4.3.6


that d is sufficient, and an application of Lemma 5.4.1 shows that d is also
complete. We furthermore have

Ed1 (ε) = E¯
 = 0,

and, from Example 9.1.2 (a consequence of Basu’s lemma), we know that d1 (X) =
X̄ and d2 (X) = S are independent. So

Ed1 (ε)d2 (ε) = Ed1 (ε)Ed2 (ε) = 0.

It follows that T ∗ (X) = X̄ is UMRE.


96 CHAPTER 9. EQUIVARIANT STATISTICS
Chapter 10

Decision theory

In this chapter, we again denote the observable random variable (the data) by
X ∈ X , and its distribution by P ∈ P. The probability model is P := {Pθ : θ ∈
Θ}, with θ an unknown parameter.
In particular cases, we apply the results with X being replaced by a vector
X = (X1 , . . . , Xn ), with X1 , . . . , X
Qn i.i.d. with distribution P ∈ {Pθ : θ ∈ Θ}
(so that X has distribution IP := ni=1 P ∈ {IPθ = ni=1 Pθ : θ ∈ Θ}).
Q

10.1 Decisions and their risk

We give a definition of risk, but now somewhat more formal than in Chapter 8.
Let A be the action space.
• A = R corresponds to estimating a real-valued parameter.
• A = {0, 1} corresponds to testing a hypothesis.
• A = [0, 1] corresponds to randomized tests.
• A = {intervals} corresponds to confidence intervals.
Given the observation X, we decide to take a certain action in A. Thus, an
action is a map d : X → A, with d(X) being the decision taken. If A = R a
decision is often called an estimator (denoted by T for instance). If A = {0, 1}
or A = [0, 1], a decision is often called a test (denoted by φ for instance).
A loss function (Verlustfunktion) is a map
L : Θ × A → R,
with L(θ, a) being the loss when the parameter value is θ and one takes action
a.
The risk of decision d(X) is defined as
R(θ, d) := Eθ L(θ, d(X)), θ ∈ Θ.

97
98 CHAPTER 10. DECISION THEORY

Example 10.1.1 Estimation In the case of estimating a parameter of inter-


est g(θ) ∈ R, the action space is A = R (or a subset thereof ). Important loss
functions are then
L(θ, a) := w(θ)|g(θ) − a|r ,
where w(·) are given non-negative weights and r ≥ 0 is a given power. The risk
is then
R(θ, d) = w(θ)Eθ |g(θ) − d(X)|r .
A special case is taking w ≡ 1 and r = 2. Then L(θ, a) = (g(θ) − a)2 is called
quadratic loss and
R(θ, d) = Eθ |g(θ) − d(X)|2
is called the mean square error.
Example 10.1.2 Tests Consider testing the hypothesis
H0 : θ ∈ Θ0
against the alternative
H1 : θ ∈ Θ1 .
Here, Θ0 and Θ1 are given subsets of Θ with Θ0 ∩ Θ1 = ∅. As action space, we
take A = {0, 1}, and as loss
1 if θ ∈ Θ0 and a = 1
(
L(θ, a) := c if θ ∈ Θ1 and a = 0 .
0 otherwise
Here c > 0 is some given constant. Then
Pθ (d(X) = 1) if θ ∈ Θ0
(
R(θ, d) = cPθ (d(X) = 0) if θ ∈ Θ1 .
0 otherwise
Thus, the risks correspond to the error probabilities (type I and type II errors).

Note
The best decision d is the one with the smallest risk R(θ, d). However, θ is not
known. Thus, if we compare two decision functions d1 and d2 , we may run into
problems because the risks are not comparable: R(θ, d1 ) may be smaller than
R(θ, d2 ) for some values of θ, and larger than R(θ, d2 ) for other values of θ.
Example 10.1.3 Estimating the mean
We revisit Example 5.2.1. Let X ∈ R and let g(θ) = Eθ X := µ. We take
quadratic loss
L(θ, a) := |µ − a|2 .
Assume that varθ (X) = 1 for all θ. Consider the collection of decisions
dλ (X) := λX,
where 0 ≤ λ ≤ 1. Then
R(θ, dλ ) = var(λX) + bias2θ (λX)
= λ2 + (λ − 1)2 µ2 .
10.2. ADMISSIBLE DECISIONS 99

The “optimal” choice for λ would be

µ2
λopt := ,
1 + µ2

because this value minimizes R(θ, dλ ). However, λopt depends on the unknown
µ, so dλopt (X) is not an estimator.

Various optimality concepts


We will consider three optimality concepts: admissibility (zulässigkeit), mini-
max and Bayes.

10.2 Admissible decisions

Definition 10.2.1 A decision d0 is called strictly better than d if

R(θ, d0 ) ≤ R(θ, d), ∀ θ,

and
∃ θ : R(θ, d0 ) < R(θ, d).
When there exists a d0 that is strictly better than d, then d is called inadmissible.
Example 10.2.1 Using only one of the observations
Let, for n ≥ 2, X1 , . . . , Xn be i.i.d., with g(θ) := Eθ (Xi ) := µ, and var(Xi ) = 1
(for all i). Take quadratic loss L(θ, a) := |µ − a|2 . Consider d0 (X1 , . . . , Xn ) :=
X̄n and d(X1 , . . . , Xn ) := X1 . Then, ∀ θ,
1
R(θ, d0 ) = , R(θ, d) = 1,
n
so that d is inadmissible.

Note
We note that to show that a decision d is inadmissible, it suffices to find a
strictly better d0 . On the other hand, to show that d is admissible, one has to
verify that there is no strictly better d0 . So in principle, one then has to take
all possible d0 into account.

10.2.1 Not using the data at all is admissible

Let L(θ, a) := |g(θ) − a|r and d(X) := g(θ0 ), where θ0 is some fixed given value.
Lemma 10.2.1 Assume that Pθ0 dominates Pθ 1 for all θ. Then d is admissi-
ble.
1
Let P and Q be probability measures on the same measurable space. Then P dominates
Q if for all measurable B, P (B) = 0 implies Q(B) = 0 (Q is absolut stetig bezüglich P ).
100 CHAPTER 10. DECISION THEORY

Proof.
Suppose that d0 is better than d. Then we have
Eθ0 |g(θ0 ) − d0 (X)|r ≤ 0.
This implies that
d0 (X) = g(θ0 ), Pθ0 −almost surely. (10.1)
Since by (10.1),
Pθ0 (d0 (X) 6= g(θ0 )) = 0
the assumption that Pθ0 dominates Pθ , ∀ θ, implies now
Pθ (d0 (X) 6= g(θ0 )) = 0, ∀ θ.
That is, for all θ, d0 (X) = g(θ0 ), Pθ -almost surely, and hence, for all θ, R(θ, d0 ) =
R(θ, d). So d0 is not strictly better than d. We conclude that d is admissible. u t

10.2.2 A Neyman Pearson test is admissible

Consider testing
H0 : θ = θ0
against the alternative
H1 : θ = θ1 .
Define the risk R(θ, φ) of a test φ as the probability of error of first and second
kind: 
Eθ φ(X), θ = θ0
R(θ, φ) := .
1 − Eθ φ(X), θ = θ1
We let p0 (p1 ) be the density of Pθ0 (Pθ1 ) with respect to some dominating
measure ν (for example ν = Pθ0 + Pθ1 ). A Neyman Pearson test is (see Section
7.1) 
 1 if p1 /p0 > c
φNP := q if p1 /p0 = c .
0 if p1 /p0 < c

Here 0 ≤ q ≤ 1, and 0 ≤ c < ∞ are given constants.


Lemma 10.2.2 A Neyman Pearson test is admissible if and only if one of the
following two cases hold:
i) its power is strictly less than 1,
or
ii) it has minimal level among all tests with power 1.
Proof. Suppose R(θ0 , φ) < R(θ0 , φNP ). Then from the Neyman Pearson
Lemma, we know that either R(θ1 , φ) > R(θ1 , φNP ) (i.e., then φ is not bet-
ter then φNP ), or c = 0. But when c = 0, it holds that R(θ1 , φNP ) = 0, i.e. then
φNP has power one.
Similarly, suppose that R(θ1 , φ) < R(θ1 , φNP ). Then it follows from the Neyman
Pearson Lemma that R(θ0 , φ) > R(θ0 , φNP ), because we assume c < ∞.
u
t
10.3. MINIMAX DECISIONS 101

10.3 Minimax decisions

Definition 10.3.1 A decision d is called minimax if

sup R(θ, d) = inf0 sup R(θ, d0 ).


θ d θ

Thus, the minimax criterion concerns the best decision in the worst possible
case.

10.3.1 Minimax Neyman Pearson test

Lemma 10.3.1 A Neyman Pearson test φNP is minimax, if and only if R(θ0 , φNP ) =
R(θ1 , φNP ).
Proof. Let φ be a test, and write for j = 0, 1,

rj := R(θj , φNP ), rj0 = R(θj , φ).

Suppose that r0 = r1 and that φNP is not minimax. Then, for some test φ,

max rj0 < max rj .


j j

This implies that both


r00 < r0 , r10 < r1
and by the Neyman Pearson Lemma, this is not possible.
Let S = {(R(θ0 , φ), R(θ1 , φ)) : φ : X → [0, 1]}. Note that S is convex. Thus, if
r0 < r1 , we can find a test φ with r0 < r00 < r1 and r10 < r1 . So then φNP is not
minimax. Similarly if r0 > r1 .
u
t

10.4 Bayes decisions

Suppose the parameter space Θ is a measurable space. We can then equip it


with a probability measure Π. We call Π the a priori distribution.
Definition 10.4.1 The Bayes risk (with respect to the probability measure Π)
is Z
r(Π, d) := R(ϑ, d)dΠ(ϑ).
Θ

A decision d is called Bayes (with respect to Π) if

r(Π, d) = inf0 r(Π, d0 ).


d
102 CHAPTER 10. DECISION THEORY

Let Π have density w := dΠ/dµ with respect to some dominating measure µ.


We may then write
Z
r(Π, d) = R(ϑ, d)w(ϑ)dµ(ϑ) := rw (d).
Θ

Thus, the Bayes risk may be thought of as taking a weighted average of the
risks. For example, one may want to assign more weight to “important” values
of θ.

10.4.1 Bayes test

Consider again the testing problem


H0 : θ = θ0
against the alternative
H1 : θ = θ1 .
Let L(θ0 , a) := a and L(θ1 , a) := 1 − a, w(θ0 ) =: w0 and w(θ1 ) =: w1 = 1 − w0 .
Then
rw (φ) := w0 R(θ0 , φ) + w1 R(θ1 , φ).
We take 0 < w0 = 1 − w1 < 1.
Lemma 10.4.1 Bayes test is

1 if p1 /p0 > w0 /w1
φBayes = q if p1 /p0 = w0 /w1 .
0 if p1 /p0 < w0 /w1

Proof.
Z Z
rw (φ) = w0 φp0 + w1 (1 − φp1 )
Z
= φ(w0 p0 − w1 p1 ) + w1 .

So we choose φ ∈ [0, 1] to minimize φ(w0 p0 − w1 p1 ). This is done by taking



 1 if w0 p0 − w1 p1 < 0
φ = q if w0 p0 − w1 p1 = 0 ,
0 if w0 p0 − w1 p1 > 0

where for q we may take any value between 0 and 1. u


t
Note that Z
2rw (φBayes ) = 1 − |w1 p1 − w0 p0 |.

In particular, when w0 = w1 = 1/2,


Z
2rw (φBayes ) = 1 − |p1 − p0 |/2,

i.e., the risk is large if the two densities are close to each other.
10.5. CONSTRUCTION OF BAYES ESTIMATORS 103

10.5 Construction of Bayes estimators

Let X have distribution P ∈ P := {Pθ : θ ∈ Θ}. Suppose P is dominated by a


(σ-finite) measure ν, and let pθ = dPθ /dν denote the densities. Let Π be an a
priori distribution on Θ, with density w := dΠ/dµ. We now think of pθ as the
density of X given the value of θ. We write it as

pθ (x) = p(x|θ), x ∈ X .

Moreover, we define the marginal density


Z
p(·) := p(·|ϑ)w(ϑ)dµ(ϑ).
Θ

Definition 10.5.1 The a posteriori density of θ is

w(ϑ)
w(ϑ|x) = p(x|ϑ) , ϑ ∈ Θ, x ∈ X .
p(x)

Lemma 10.5.1 Given the data X = x, consider θ as a random variable with


density w(ϑ|x). Let
Z
l(x, a) := E[L(θ, a)|X = x] = L(ϑ, a)w(ϑ|x)dµ(ϑ),
Θ

and
d(x) := arg min l(x, a).
a

Then d is Bayes decision dBayes .


Proof.
Z
0
rw (d ) = R(ϑ, d0 )w(ϑ)dµ(ϑ)
ZΘ Z 
0
= L(ϑ, d (x))p(x|ϑ)dν(x) w(ϑ)dµ(ϑ)
ZΘ ZX 
0
= L(ϑ, d (x))w(ϑ|x)dµ(ϑ) p(x)dν(x)
X Θ
Z
= l(x, d0 (x))p(x)dν(x)
ZX
≥ l(x, d(x))p(x)dν(x)
X
= rw (d).

u
t
Corollary 10.5.1 The Bayes decision is

dBayes (X) = arg min l(X, a),


a∈A
104 CHAPTER 10. DECISION THEORY

where Z
l(x, a) = E(L(θ, a)|X = x) = L(ϑ, a)w(ϑ|x)dµ(ϑ)
Z
= L(ϑ, a)gϑ (S(x))w(ϑ)dµ(ϑ)h(x)/p(x).

So Z
dBayes (X) = arg min L(ϑ, a)gϑ (S)w(ϑ)dµ(ϑ),
a∈A

which only depends on the sufficient statistic S.

10.5.1 Bayes test revisited

For the testing problem


H0 : θ = θ0
against the alternative
H1 : θ = θ1 ,
with loss function

L(θ0 , a) := a, L(θ1 , a) := 1 − a, a ∈ {0, 1},

we have
l(x, φ) = φw0 p0 (x)/p(x) + (1 − φ)w1 p1 (x)/p(x).
Thus, 
 1 if w1 p1 > w0 p0
arg min l(·, φ) = q if w1 p1 = w0 p0 .
φ
0 if w1 p1 < w0 p0

10.5.2 Bayes estimator for quadratic loss

In the next result, we shall use:


Lemma 10.5.2 Let Z be a real-valued random variable. Then

arg min E(Z − a)2 = EZ.


a∈R

Proof.
E(Z − a)2 = var(Z) + (a − EZ)2 .
u
t
Consider the case A = R and Θ ⊆ R . Let L(θ, a) := |θ − a|2 . Then

dBayes (X) = E(θ|X).

For quadratic loss, and for T = E(θ|X), the Bayes risk of an estimator T 0 is

rw (T 0 ) = Evar(θ|X) + E(T − T 0 )2
10.5. CONSTRUCTION OF BAYES ESTIMATORS 105

(compare with Lemma 5.2.2). This follows from straightforward calculations:


Z
rw (T ) = R(ϑ, T 0 )w(ϑ)dµ(ϑ)
0

  
0 0 2 0 2

= ER(θ, T ) = E(θ − T ) = E E (θ − T ) X

and, with θ being the random variable,


   
0 2
= E (θ − T ) X + (T − T 0 )2
2

E (θ − T ) X

= var(θ|X) + (T − T 0 )2 .

10.5.3 Bayes estimator and the maximum a posteriori estima-


tor

Consider again the case Θ ⊆ R, and A = Θ, and now with loss function
L(θ, a) := l{|θ − a| > c} for a given constant c > 0. Then
Z
l(x, a) = Π(|θ − a| > c|X = x) = w(ϑ|x)dϑ.
|ϑ−a|>c

We note that for c → 0


1 − l(x, a) Π(|θ − a| ≤ c|X = x) w(a)
= ≈ w(a|x) = p(x|a) .
2c 2c p(x)

Thus, for c small, Bayes rule is approximately d0 (x) := arg maxa∈Θ p(x|a)w(a).
The estimator d0 (X) is called the maximum a posteriori estimator (MAP). If
w is the uniform density on Θ (which only exists if Θ is bounded), then d0 (X)
is the maximum likelihood estimator.

10.5.4 Three worked-out examples

In many examples, it saves a lot of work not to write out complete expressions
for the posterior w(ϑ|x). The main interest (at first) is how it depends on
ϑ and all the other expressions can (at least theoretically, it may be difficult
computationally) be found later by using that a density integrates to one. We
therefore recall the symbol ∝ (see Section 4.9). For example, we may write

w(ϑ|x) = p(x|ϑ)w(ϑ)/p(x)
∝ p(x|ϑ)w(ϑ)

because the marginal density p(x) does not depend on ϑ.


As we will see, in the following three examples the posterior is in the same family
as the prior. We call them conjugate priors for the distribution concerned.
106 CHAPTER 10. DECISION THEORY

Example 10.5.1 Poisson with Gamma prior


Suppose that given θ, X has Poisson distribution with parameter θ, and that θ
has the Gamma(k, λ)-distribution. The density of θ is then

w(ϑ) = λk ϑk−1 e−λϑ /Γ(k),

where Z ∞
Γ(k) = e−z z k−1 dz.
0

The Gamma(k, λ) distribution has mean


Z ∞
k
Eθ = ϑw(ϑ)dϑ = .
0 λ

The a posteriori density is then

w(ϑ)
w(ϑ|x) = p(x|ϑ)
p(x)
ϑ λk ϑk−1 e−λϑ /Γ(k)
x
= e−ϑ
x! p(x)
= e−ϑ(1+λ) ϑk+x−1 c(x, k, λ),

where c(x, k, λ) is such that


Z
w(ϑ|x)dϑ = 1.

With the ∝ notation

w(ϑ|x) ∝ p(x|ϑ)w(ϑ)
∝ e−ϑ(1+λ) ϑk+x−1 .

You see this saves a lot of writing. We recognize w(ϑ|x) as the density of the
Gamma(k + x, 1 + λ)-distribution. Bayes estimator with quadratic loss is thus

k+X
E(θ|X) = .
1+λ
The maximum a posteriori estimator is

k+X −1
.
1+λ
Example 10.5.2 Binomial distibution with Beta prior
Suppose given θ, X has the Binomial(n, θ)-distribution, and that θ is uniformly
distributed on [0, 1]. Then
 
n x
w(ϑ|x) = ϑ (1 − ϑ)n−x /p(x).
x
10.6. DISCUSSION OF BAYESIAN APPROACH 107

This is the density of the Beta(x+1, n−x+1)-distribution. Thus, with quadratic


loss, Bayes estimator is
X +1
E(θ|X) = .
n+2

More generally, suppose that X is binomial(n, θ) and that θ has the Beta(r, s)-
prior
Γ(r + s) r−1
w(ϑ) = ϑ (1 − ϑ)s−1 , 0 < ϑ < 1.
Γ(r)Γ(s)
Here r and s are given positive numbers. The prior expectation is
r
Eθ = .
r+s
Bayes estimator under quadratic loss is the posterior expectation
X +r
E(θ|X) = .
n+r+s
Example 10.5.3 Normal distribution with normal prior
Let X ∼ N (θ, 1) and θ ∼ N (c, τ 2 ) for some c and τ 2 . We have

p(x|ϑ)w(ϑ)
w(ϑ|x) =
p(x)
 
ϑ−c
∝ φ(x − ϑ)φ
τ
(ϑ − c)2
  
1 2
∝ exp − (x − ϑ) +
2 τ2
τ 2x + c 2 1 + τ 2
   
1
∝ exp − ϑ − 2 .
2 τ +1 τ2
We conclude that Bayes estimator for quadratic loss is

τ 2X + c
TBayes = E(θ|X) = .
τ2 + 1

10.6 Discussion of Bayesian approach

A main objection against the Bayesian approach is that it is generally subjective.


The final estimator depends strongly on the choice of the prior distribution. On
the other hand, Bayesian methods are very powerful and often quite natural.
The prior may be inspired by or estimated from previous data sets, in which
case the above subjectivity problem becomes less pregnant. Furthermore, in
complicated models with many unknown parameters, Bayesian methods are a
welcome tool for developing sensible algorithms.
Credibility sets. A (frequentist) confidence set for a parameter of interest can
be hard to find, and is also less easy to explain to “non-experts”. The Bayesian
version of a confidence set is called a credibility set, which generally is seen as
108 CHAPTER 10. DECISION THEORY

an intuitively much clearer concept. For example, in the case of a real-valued


parameter θ, a (1 − α)-credibility interval is defined as
I := [θ̂L (X), θ̂R (X)],
where the endpoints θ̂L and θ̂R are chosen in such a way that
Z θ̂R (X)
w(ϑ|X)dϑ = (1 − α).
θ̂L (X)

Thus, it is the set which has posterior probability (1 − α). A (1 − α)-credibility


set is generally not a (1 − α)-confidence set, i.e., from a frequentist point of
view, its properties are not always clear.
Pragmatic point of view. The Bayesian approach is fruitful for the construc-
tion of estimators. One can then proceed by studying the frequentist properties
of the Bayesian procedure. For example, in the Binomial(n, θ)-model with a
uniform prior on θ, the Bayes estimator is
X +1
θ̂Bayes (X) = .
n+2
Given this estimator, one can “forget” we obtained it by Bayesian arguments,
and study for example its (frequentist) mean square error.
Complexity regularization. Here is a “toy” example, where a Bayesian
method helps constructing a useful procedure. Let X1 , . . . , Xn be independent
random variables, where Xi is N (θi , 1)- distributed. The n parameters θi are all
unknown. Thus, there are as many observations as unknowns, a situation where
complexity regularization is needed. Complexity regularization (see Chapter 16)
means that in principle, one allows for any parameter value, but that one pays a
price for choosing “complex” values. What “complexity” means depends on the
situation at hand. We consider in this example the situation where complexity
is the opposite of sparsity, where the sparseness of a vector ϑ is defined as its
number of non-zero entries. Consider the estimator
n
X n
X
2
θ̂ := arg minn (Xi − ϑi ) + 2λ |ϑi |,
ϑ∈R
i=1 i=1

where λ > 0 is a regularization parameter. Note that when λ = 0, one has


θ̂i = Xi for all i, whereas on the other extreme, when λ = ∞, one has θ̂ ≡ 0.
The larger λ, the more sparse the estimator will be. In fact, it is easy to verify
that for i = 1, . . . , n, (X − λ X > λ
i i
θ̂i = 0 |Xi | ≤ λ .
Xi + λ Xi < −λ
This is called the soft thresholding estimator. The procedure corresponds to
Bayesian maximum a posteriori estimation, with double-exponential (also called
Laplace) prior . Indeed, suppose that the prior is θ1 , . . . , θn i.i.d. with density
 √ 
1 2|z|
w(z) = √ exp − , z ∈ R,
τ 2 τ
10.7. INTEGRATING PARAMETERS OUT ? 109

where τ > 0 is the prior scale parameter (τ 2 is the variance of this distribution).
Given X1 , . . . , Xn , the posterior distribution of the vector θ is then

w(ϑ|X1 , . . . , Xn ) ∝
 Pn 2   √ Pn 
−n/2 i=1 (Xi − ϑi ) −n/2 2 i=1 |ϑi |
(2π) exp − × (2πτ ) exp − .
2 τ

Thus, θ̂ with regularization parameter λ = 2/τ is the maximum a posteriori
estimator.
Bayesian methods as theoretical tool. In Chapter 11 we will illustrate the
fact that Bayesian methods can be exploited as a tool for proving for example
frequentist lower bounds. We will see for instance that the Bayesian estimator
with constant risk is also the minimax estimator. The idea in such results is to
look for “worst possible priors”.

10.7 Integrating parameters out ?

Striving at flexible prior distributions one can model them depending on another
“hyper-parameter”, say τ , i.e., in formula

w(ϑ) := w(ϑ|τ ).

Keeping τ fixed and integrating ϑ out, the density of X is then


Z
p̃(x|τ ) := p(x|ϑ)w(ϑ|τ )dµ(ϑ).

One can proceed by estimating τ , using for instance maximum likelihood (this
is generally computationally quite hard), or the methods of moments. One then
obtains a prior w(ϑ|τ̂ ) with estimated parameter τ̂ . The prior is thus based on
the data. The whole procedure is called empirical Bayes.
Example 10.7.1 Poisson with Gamma prior with hyperparameters
Suppose X1 , . . . , Xn are independent and Xi has a Poisson(θi )-distribution,
i = 1, . . . , n. Assume moreover that θ1 , . . . , θn are i.i.d. with Gamma(k, λ)-
distribution, i.e., each has prior density

w(z|k, λ) = e−λz z k−1 λk /Γ(k), z > 0.

Both k and λ are considered as hyper-parameters. Then the density of X1 , . . . , Xn


is
p̃(x1 , . . . , xn |k, λ)

n n
!
λk
Z Pn Y Pn Y
∝ e− i=1 ϑi
ϑxi i e−λ i=1 ϑi
ϑk−1
i dϑ1 · · · dϑn
Γ(k)
i=1 i=1
n
Y Γ(xi + k)
= pk (1 − p)xi +k−1 ,
Γ(k)
i=1
110 CHAPTER 10. DECISION THEORY

where p := λ/(1 + λ). Thus, under p̃(·|k, λ), the observations X1 , . . . , Xn are
independent and Xi has a negative binomial distribution with parameters k and
p (check the formula for the negative binomial distribution, see e.g. Example
2.4.1. The mean and variance of the negative binomial distribution can be
calculated directly or looked up in a textbook. We then find (for i = 1, . . . , n),

k(1 − p) k
E(Xi |k, λ) = =
p λ
and
k(1 − p) k(1 + λ)
var(Xi |k, λ) = 2
= .
p λ2
We use the method
Pn of moments2 to estimate k and λ. Let X̄n be the sample
2
mean and Sn := i=1 (Xi − X̄) /(n − 1) be the sample variance. We solve

k̂ k̂(1 + λ̂)
= X̄n , = Sn2 .
λ̂ λ̂2
This yields
X̄n2 X̄n
k̂ = 2
, λ̂ = 2 .
Sn − X̄n Sn − X̄n
For given k and λ, the Bayes estimator of θi is given in Example 10.5.1. We
now insert the estimated values of k and λ to get an empirical Bayes estimator

Xi + k̂
θ̂i = = Xi (1 − X̄n /Sn2 ) + X̄n2 /Sn2 , i = 1, . . . , n.
1 + λ̂
The MLE of θi is Xi itself (i = 1, . . . , n). We see that the empirical Bayes
estimator uses all observations to estimate a particular θi . The empirical Bayes
estimator θ̂i is a convex combination (1 − α)Xi + αX̄n of Xi and X̄n , with
α = X̄n /Sn2 generally close to one if the pooled sample has mean and variance
approximately equal, i.e., if the pooled sample is “Poisson-like”.
Chapter 11

Proving admissibility and


minimaxity

Bayes estimators are quite useful, also for obdurate frequentists. They can be
used to construct estimators that are minimax (admissible), or for verification
of minimaxity (admissibility).

Let us first recall the definitions. Let X ∈ X have distribution Pθ , θ ∈ Θ. Let


T = T (X) be a statistic (estimator, decision), L(θ, a) be a loss function, and
R(θ, T ) := Eθ L(θ, T (X)) be the risk of T .

◦ T is minimax if ∀ T 0 supθ R(θ, T ) ≤ supθ R(θ, T 0 ).

◦ T is inadmissible if ∃ T 0 : {∀ θ R(θ, T 0 ) ≤ R(θ, T ) and ∃ θ R(θ, T 0 ) < R(θ, T )}.

◦ T is Bayes (for the prior density w on Θ) if ∀ T 0 , rw (T ) ≤ rw (T 0 ).

Recall also that Bayes risk for w is


Z
rw (T ) = R(ϑ, T )w(ϑ)dµ(ϑ).

Whenever we say that a statistic T is Bayes, without referring to an explicit


prior on Θ, we mean that there exists a prior for which T is Bayes. Of course,
if the risk R(θ, T ) = R(T ) does not depend on θ, then Bayes risk of T does not
depend on the prior.

Especially in cases where one wants to use the uniform distribution as prior,
but cannot do so because Θ is not bounded, the notion extended Bayes is useful.

Definition 11.0.1 A statistic T is called extended Bayes if there exists a se-


quence of prior densities {wm }∞
m=1 (w.r.t. dominating measures that are allowed
to depend on m), such that rwm (T ) − inf T 0 rwm (T 0 ) → 0 as m → ∞.

111
112 CHAPTER 11. PROVING ADMISSIBILITY AND MINIMAXITY

11.1 Minimaxity

Lemma 11.1.1 Suppose T is a statistic with risk R(θ, T ) = R(T ) not depend-
ing on θ. Then
(i) T admissible ⇒ T minimax,
(ii) T Bayes ⇒ T minimax,
and in fact more generally,
(iii) T extended Bayes ⇒ T minimax.
Proof.
(i) T is admissible, so for all T 0 , either there is a θ with R(θ, T 0 ) > R(T ), or
R(θ, T 0 ) ≥ R(T ) for all θ. Hence supθ R(θ, T 0 ) ≥ R(T ).
(ii) Since Bayes implies extended Bayes, this follows from (iii). We nevertheless
present a separate proof, as it is somewhat simpler than (iii).
Note first that for any T 0 ,
Z
0
rw (T ) = R(ϑ, T 0 )w(ϑ)dµ(θ) (11.1)
Z
≤ sup R(ϑ, T 0 )w(ϑ)dµ(θ) (11.2)
ϑ
= sup R(ϑ, T 0 ), (11.3)
ϑ

that is, Bayes risk is always bounded by the supremum risk. Suppose now that
T 0 is a statistic with supθ R(θ, T 0 ) < R(T ). Then

rw (T 0 ) ≤ sup R(ϑ, T 0 ) < R(T ) = rw (T ),


ϑ

which is in contradiction with the assumption that T is Bayes.


(iii) Suppose for simplicity that a Bayes decision Tm for the prior wm exists, for
all m, i.e.
rwm (Tm ) = inf0 rwm (T 0 ), m = 1, 2, . . . .
T
By assumption, for all  > 0, there exists an m sufficiently large, such that

R(T ) = rwm (T ) ≤ rwm (Tm ) +  ≤ rwm (T 0 ) +  ≤ sup R(θ, T 0 ) + ,


θ

because, as we have seen in (11.1), the Bayes risk is bounded by supremum risk.
Since  can be chosen arbitrary small, this proves (iii). u
t
Example 11.1.1 Minimax estimator for Binomial distribution
Consider a Binomial(n, θ) random variable X. Let the prior on θ ∈ (0, 1) be
the Beta(r, s) distribution. Then Bayes estimator for quadratic loss is
X +r
T =
n+r+s
(see Example 10.5.2). Its risk is

R(θ, T ) = Eθ (T − θ)2
11.1. MINIMAXITY 113

= varθ (T ) + bias2θ (T )
(n + r + s)θ 2
 
nθ(1 − θ) nθ + r
= + −
(n + r + s)2 n+r+s n+r+s
[(r + s) − n]θ + [n − 2r(r + s)]θ + r2
2 2
= .
(n + r + s)2

This can only be constant in θ if the coefficients in front of θ2 and θ are zero:

(r + s)2 − n = 0, n − 2r(r + s) = 0.

Solving for r and s gives √


r=s= n/2.
Plugging these values back in the estimator T gives

X + n/2
T = √
n+ n
is minimax. The minimax risk is
1
R(T ) = √ .
4( n + 1)2
We can compare this with the supremum risk of the unbiased estimator X/n:
θ(1 − θ) 1
sup R(θ, X/n) = sup = .
θ θ n 4n
So for large n, this does not differ much from the minimax risk.

11.1.1 Minimaxity of the Pitman estimator ?

We consider again the Pitman estimator (see Lemma 9.1.2)


R
∗ zp0 (X1 − z, . . . , Xn − z)dz
T = R .
p0 (X1 − z, . . . , Xn − z)dz

Lemma 11.1.2 T ∗ is extended Bayes (for quadratic loss).


Proof. Let wm be (the density of) the uniform distribution on the interval
[−m, m]:
wm = l[−m,m] /2m.
The posterior density is then
p0 (x − ϑ)l[−m,m] (ϑ)
wm (ϑ|x) = Rm .
−m p0 (x − ϑ)dϑ

Bayes estimator is thus


Rm
ϑp0 (x − ϑ)dϑ
Tm = R−m
m .
−m p0 (x − ϑ)dϑ
114 CHAPTER 11. PROVING ADMISSIBILITY AND MINIMAXITY

We now compute R(θ, Tm ) = Eθ (Tm − θ)2 . Let


Rb
zp0 (x − z)dz
Ta,b (x) := Ra b .
a p0 (x − z)dz

Then for all x, Ta,b (x) → T (x) as a → −∞ and b → ∞. One can easily verify
that also
2
lim E0 Ta,b (X) → E0 T 2 (X).
a→−∞, b→∞

(Note that, for any prior w, E0 T 2 (X) is the Bayes risk rw (T ) since the risk
R(θ, T ) = E0 T 2 (X) does not depend on θ.) Moreover
Rb R b−θ
a (z − θ)p0 (X − z)dz vp0 (X − θ − v)dv
Ta,b (X) − θ = Rb = Ra−θ
b−θ
.
a p0 (x − z)dz a−θ p0 (X − θ − v)dv

It follows that
Eθ (Ta,b (X) − θ)2 = E0 Ta−θ,b−θ
2
(X).
Hence,
2
R(θ, Tm ) = E0 T−m−θ,m−θ (X).
The Bayes risk is
Z m
1 2
rwm (Tm ) = Eθ∼wm R(θ, Tm ) = E0 T−m−ϑ,m−ϑ (X)dϑ.
2m −m

Hence, for any 0 <  < 1, we have


2
rwm (Tm ) ≥ inf (1 − )E0 T−m−ϑ,m−ϑ (X)
|ϑ|≤m(1−)
2
≥ inf (1 − )E0 Ta,b (X).
a≤−m, b≥m

It follows that for any 0 <  < 1,


2
lim inf rwm (Tm ) ≥ lim inf inf (1 − )E0 Ta,b (X) = (1 − )E0 T 2 (X).
m→∞ m→∞ a≤−m, b≥m

Hence we have rwm (Tm ) → E0 T 2 (X), i.e., rwm (Tm ) − rwm (T ) → 0. u


t
Corollary 11.1.1 T ∗ is minimax (for quadratic loss).

11.2 Admissibility

In this section, the parameter space is assumed to be an open subset of a


topological space, so that we can consider open neighborhoods of members of
Θ, and continuous functions on Θ. We moreover restrict ourselves to statistics
T with R(θ, T ) < ∞.
11.2. ADMISSIBILITY 115

Lemma 11.2.1 Suppose that the statistic T is Bayes for the prior density w.
Then (i) or (ii) below are sufficient conditions for the admissibility of T .
(i) The statistic T is the unique Bayes decision (i.e., rw (T ) = rw (T 0 ) implies
that ∀ θ, T = T 0 Pθ -almost surely),
(ii) For all T 0 , R(θ, T 0 ) is continuous
R in θ, and moreover, for all open U ⊂ Θ,
the prior probability Π(U ) := U w(ϑ)dµ(ϑ) of U is strictly positive.
Proof.
(i) Suppose that for some T 0 , R(θ, T 0 ) ≤ R(θ, T ) for all θ. Then also rw (T 0 ) ≤
rw (T ). Because T is Bayes, we then must have equality:

rw (T 0 ) = rw (T ).

So then, ∀ θ, T 0 and T are equal Pθ -a.s., and hence, ∀ θ, R(θ, T 0 ) = R(θ, T ), so


that T 0 can not be strictly better than T .
(ii) Suppose that T is inadmissible. Then, for some T 0 , R(θ, T 0 ) ≤ R(θ, T ) for
all θ, and, for some θ0 , R(θ0 , T 0 ) < R(θ0 , T ). This implies that for some  > 0,
and some open neighborhood U ⊂ Θ of θ0 , we have

R(ϑ, T 0 ) ≤ R(ϑ, T ) − , ϑ ∈ U.

But then
Z Z
0 0
rw (T ) = R(ϑ, T )w(ϑ)dν(ϑ) + R(ϑ, T 0 )w(ϑ)dν(ϑ)
ZU Uc Z
≤ R(ϑ, T )w(ϑ)dν(ϑ) − Π(U ) + R(ϑ, T )w(ϑ)dν(ϑ)
U Uc
= rw (T ) − Π(U ) < rw (T ).

We thus arrived at a contradiction. u


t
Lemma 11.2.2 Suppose that T is extended Bayes, and that for all T 0 , R(θ, T 0 )
is continuous in θ. In fact assume, for all open sets U ⊂ Θ,

rwm (T ) − inf T 0 rwm (T 0 )


→ 0,
Πm (U )
R
as m → ∞. Here Πm (U ) := U wm (ϑ)dµm (ϑ) is the probability of U under the
prior Πm . Then T is admissible.
Proof. We start out as in the proof of (ii) in the previous lemma. Suppose that
T is inadmissible. Then, for some T 0 , R(θ, T 0 ) ≤ R(θ, T ) for all θ, and, for some
θ0 , R(θ0 , T 0 ) < R(θ0 , T ), so that for some  > 0, and some open neighborhood
U ⊂ Θ of θ0 , we have

R(ϑ, T 0 ) ≤ R(ϑ, T ) − , ϑ ∈ U.

This would give that for all m,

rwm (T 0 ) ≤ rwm (T ) − Πm (U ).


116 CHAPTER 11. PROVING ADMISSIBILITY AND MINIMAXITY

Suppose for simplicity that a Bayes decision Tm for the prior wm exists, for all
m, i.e.
rwm (Tm ) = inf0 rwm (T 0 ), m = 1, 2, . . . .
T
Then, for all m,

rwm (Tm ) ≤ rwm (T 0 ) ≤ rwm (T ) − Πm (U ),

or
rwm (T ) − rwm (Tm )
≥  > 0,
Πm (U )
that is, we arrived at a contradiction. u
t

11.2.1 Admissible estimators for the normal mean

Let X be N (θ, 1)-distributed, θ ∈ Θ := R and R(θ, T ) := Eθ (T − θ)2 be the


quadratic risk. We consider estimators of the form

T = aX + b, a > 0, b ∈ R.

Lemma T is admissible if and only if one of the following cases hold


(i) a < 1,
(ii) a = 1 and b = 0.
Proof.
(⇐) (i)
First, we show that T is Bayes for some prior. It turns out that this works with
a normal prior, i.e., we take θ ∼ N (c, τ 2 ) for some c and τ 2 to be specified. In
Example 10.5.3 we have seen that Bayes estimator is

τ 2X + c
TBayes = E(θ|X) = .
τ2 + 1
Taking
τ2 c
= a, 2 = b,
τ2 +1 τ +1
yields T = TBayes .
Next, we check (i) in Lemma 11.2.1, i.e. that T is unique. Since in view of the
calculations in Subsection 10.5.2

rw (T 0 ) = Evar(θ|X) + E(T − T 0 )2

we conclude that if rw (T 0 ) = rw (T ), then

E(T − T 0 )2 = 0.

Here, the expectation is with θ integrated out, i.e., with respect to the measure
P with density Z
p(x) = pϑ (x)w(ϑ)dµ(ϑ).
11.2. ADMISSIBILITY 117

Now, we can write X = θ+, with θ N (c, τ 2 )-distributed, and with  a standard
normal random variable independent of θ. So X is N (c, τ 2 +1), that is, P is the
N (c, τ 2 + 1)-distribution. Now, E(T − T 0 )2 = 0 implies T = T 0 P -a.s.. Since
P dominates all Pθ , we conclude that T = T 0 Pθ -a.s., for all θ. So T is unique,
and hence admissible.
(⇐) (ii)
In this case, T = X. We use Lemma 11.2.2. Because R(θ, T ) = 1 for all θ, also
rw (T ) = 1 for any prior. Let wm be the density of the N (0, m)-distribution.
As we have seen in Example 10.5.3 and also in the previous part of the proof,
the Bayes estimator is
m
Tm = X.
m+1
By the bias-variance decomposition, it has risk
2
m2 m2 θ2

m 2
R(θ, Tm ) = + − 1 θ = + .
(m + 1)2 m+1 (m + 1)2 (m + 1)2
As Eθ2 = m, its Bayes risk is
m2 m m
rwm (Tm ) = 2
+ 2
= .
(m + 1) (m + 1) m+1
It follows that
m 1
rwm (T ) − rwm (Tm ) = 1 − = .
m+1 m+1
So T is extended Bayes. But we need to prove the more refined property of
Lemma 11.2.2. It is clear that here, we only need to consider open intervals
U = (u, u + h), with u and h > 0 fixed. We have
   
u+h u
Πm (U ) = Φ √ −Φ √
m m

 
1 u
= √ φ √ h + o(1/ m).
m m
For m large,  
u 1 1
φ √ ≈ φ(0) = √ > (say),
m 2π 4
so for m sufficiently large (depending on u)
 
u 1
φ √ ≥ .
m 4
Thus, for m sufficiently large (depending on u and h), we have
1
Πm (U ) ≥ √ h.
4 m
We conclude that for m sufficiently large
rwm (T ) − rwm (Tm ) 4
≤ √ .
Πm (U ) h m
118 CHAPTER 11. PROVING ADMISSIBILITY AND MINIMAXITY

As the right hand side converges to zero as m → ∞, this shows that X is


admissible.
(⇒)
We now have to show that if (i) or (ii) do not hold, then T is not admissible.
This means we have to consider two cases: a > 1 and a = 1, b 6= 0. In the
case a > 1, we have R(θ, aX + b) ≥ var(aX + b) > 1 = R(θ, X), so aX + b is
not admissible. When a = 1 and b 6= 0, it is the bias term that makes aX + b
inadmissible:
R(θ, X + b) = 1 + b2 > 1 = R(θ, X).
. u
t

11.3 Admissible estimators in exponential families ?

Lemma 11.3.1 Let θ ∈ Θ = R and {Pθ : θ ∈ Θ} be an exponential family in


canonical form:
pθ (x) = exp[θT (x) − d(θ)]h(x).
˙
Then T is an admissible estimator of g(θ) := d(θ), under quadratic loss (i.e.,
2
under the loss L(θ, a) := |a − g(θ)| ).
Proof. Recall that
˙
d(θ) ¨ = varθ (T ) = I(θ).
= Eθ T, d(θ)

(see Section 4.8). Now, let T 0 be some estimator, with expectation

Eθ T 0 := q(θ).

the bias of T 0 is
b(θ) = q(θ) − g(θ),
or
˙
q(θ) = b(θ) + g(θ) = b(θ) + d(θ).
This implies
q̇(θ) = ḃ(θ) + I(θ).
By the Cramer Rao lower bound

R(θ, T 0 ) = varθ (T 0 ) + b2 (θ)


[q̇(θ)]2 [ḃ(θ) + I(θ)]2
≥ + b2 (θ) = + b2 (θ).
I(θ) I(θ)
Suppose now that
R(θ, T 0 ) ≤ R(θ, T ), ∀ θ.
Because R(θ, T ) = I(θ) this implies

[ḃ(θ) + I(θ)]2
+ b2 (θ) ≤ I(θ),
I(θ)
11.3. ADMISSIBLE ESTIMATORS IN EXPONENTIAL FAMILIES ? 119

or
I(θ){b2 (θ) + 2ḃ(θ)} ≤ −[ḃ(θ)]2 ≤ 0.

This in turn implies


b2 (θ) + 2ḃ(θ) ≤ 0,

and hence, b(θ) is decreasing and when b(θ) 6= 0,

ḃ(θ) 1
≤− ,
b2 (θ) 2
so  
d 1 1
− ≥ 0,
dθ b(θ) 2
or  
d 1 θ
− ≥ 0.
dθ b(θ) 2
In other words, 1/b(θ) − θ/2 is an increasing function.
We will now show that this gives a contradiction, implying that b(θ) = 0 for all
θ.
Suppose instead b(θ0 ) < 0 for some θ0 . Then also b(ϑ) < 0 for all ϑ > θ0 since
b(·) is decreasing. It follows that

1 1 ϑ − θ0
≥ + → ∞, ϑ → ∞
b(ϑ) b(θ0 ) 2

i.e.,
b(ϑ) → 0, ϑ → ∞.

This is not possible, as b(θ) is a decreasing function.


Similarly, if b(θ0 ) > 0, take θ0 ≥ ϑ → −∞, to find again

b(ϑ) → 0, ϑ → −∞,

which is not possible.


We conclude that b(θ) = 0 for all θ, i.e., T 0 is an unbiased estimator of θ. By
the Cramer Rao lower bound, we now conclude

R(θ, T 0 ) = varθ (T 0 ) ≥ R(θ, T ) = I(θ).

u
t
Example 11.3.1 Admissibility of the sample average for estimating
the mean of a normal distribution
Let X be N (θ, 1)-distributed, with θ ∈ R unknown. Then X is an admissible
estimator of θ.
120 CHAPTER 11. PROVING ADMISSIBILITY AND MINIMAXITY

Example 11.3.2 Inadmissibility in the normal distribution with σ 2


unknown
Let X be N (0, σ 2 ), with σ 2 ∈ (0, ∞) unknown. Its density is

x2
 
1
pθ (x) = √ exp − 2
2πσ 2 2σ
= exp[θT (x) − d(θ)]h(x),

with
T (x) = −x2 /2, θ = 1/σ 2 ,

d(θ) = (log σ 2 )/2 = −(log θ)/2,


˙ 1 σ2
d(θ) = − =− ,
2θ 2
1 σ 4
¨
d(θ) = = .
2θ2 2
Observe that θ ∈ Θ = (0, ∞), which is not the whole real line. So Lemma 11.3.1
cannot be applied. We will now show that T is not admissible. Define for all
a > 0,
Ta := −aX 2 .
so that T = T1/2 . We have

R(θ, Ta ) = varθ (Ta ) + bias2θ (Ta )


= 2a2 σ 4 + [a − 1/2]2 σ 4 .

Thus, R(θ, Ta ) is minimized at a = 1/6 giving

R(θ, T1/6 ) = σ 4 /6 < σ 4 /2 = R(θ, T ).

11.4 Inadmissibility in higher-dimensional settings ?

Let (for i = 1, . . . , p) Xi ∼ N (θi , 1) and let X1 , . . . , Xp be independent. The


vector θ := (θ1 , . . . , θp ) ∈ Rp is unknown. For an estimator T = (T1 , . . . , Tp ) ∈
Rp , we define the risk
p
X
R(θ, T ) := Eθ (Ti − θi )2 .
i=1

Note that R(θ, X) = p where X := (X1 , . . . , Xp ). One can moreover show


(in a similar way as for the case p = 1) that X is minimax, extended Bayes,
UMRE and that is reaches the Cramer-Rao lower bound. But for p > 2, X is
inadmissible. This follows from the lemma below, which shows
Pthat X can be
improved by Stein’s estimator. We use the notation kXk2 := pi=1 Xi2 .
11.4. INADMISSIBILITY IN HIGHER-DIMENSIONAL SETTINGS ? 121

Definition 11.4.1 Let p > 2 and let 0 < b < 2(p−2) be some constant. Stein’s
estimator is  
∗ b
T := 1 − X.
kXk2
Lemma 11.4.1 We have
 
∗ 2 1
R(θ, T ) = p − 2b(p − 2) − b Eθ .
kXk2

Proof. We first calculate


  2
b
Eθ (Ti∗ − θi ) 2
= Eθ 1 − Xi − θi
kXk2
 2
b
= Eθ (Xi − θi ) − Xi
kXk2
2
 
2 2 Xi Xi (Xi − θi )
= Eθ (Xi − θi ) + b − 2b
kXk4 kXk2
Xi2 Xi (Xi − θi )
= 1 + b2 Eθ 4
− 2bEθ .
kXk kXk2
Consider now the expectation in the last term, with i = 1 (say):
p  
X1 (X1 − θ1 ) x1 (x1 − θ1 ) Y
Z
Eθ = φ(xi − θi )dxi
kXk2 kxk2
i=1
p  
x1 (x1 − θ1 )
Z Y
= φ(x1 − θ1 )dx1 φ(xi − θi )dxi
kxk2
i=2
Z p  
x1 Y
= − dφ(x1 − θ1 ) φ(xi − θi )dxi
kxk2
i=2
Z  Y p  
x1
= φ(x1 − θ1 )d φ(xi − θi )dxi
kxk2
i=2
p 
x21
Z   
1 Y
= φ(x1 − θ1 ) −2 dx1 φ(xi − θi )dxi
kxk2 kxk4
i=2
p 
x21
Z  Y 
1
= −2 φ(xi − θi )dxi
kxk2 kxk4
i=1
X12
 
1
= Eθ − 2 .
kXk2 kXk4
The same calculation can be done for all other i. Inserting the result in our
formula for Eθ (Ti∗ − θi )2 gives

Xi2 Xi2
 
∗ 2 2 1
Eθ (Ti − θi ) = 1 + b Eθ − 2bEθ −2
kXk4 kXk2 kXk4
Xi2 1
= 1 + (b2 + 4b)Eθ 4
− 2bEθ .
kXk kXk2
122 CHAPTER 11. PROVING ADMISSIBILITY AND MINIMAXITY

It follows that
Pp
∗ 2 X2 1
R(θ, T ) = p + (b + 4b)Eθ i=1 4 i − 2bpEθ
kXk kXk2
 
1
= p − 2b(p − 2) − b2 Eθ .
kXk2
u
t
We thus have the surprising fact that Stein’s estimator of θi uses also the
observations Xj with j 6= i, even though these observations are independent of
Xi and have a distribution which does not depend on θi .
Note that [2b(p − 2) − b2 ] is maximized for b = p − 2. So the value b = p − 2
gives the maximal improvement over X. Stein’s estimator is then
 
∗ p−2
T = 1− X.
kXk2

Remark It turns out that Stein’s estimator is also inadmissible!


Remark Let g(θ) := Eθ 1/kXk2 . One can show that g(0) = 1/(p − 2). More-
over, g(θ) ↓ 0 as kθk ↑ ∞, so R(θ, T ∗ ) ≈ R(θ, X) for kθk large.
Remark Let us take an empirical Bayesian point of view. Suppose θ1 , . . . , θp
are i.i.d. with the N (0, τ 2 )-distribution. If τ 2 is known, Bayes estimator is

τ2
Ti,Bayes = Xi , i = 1, . . . , p
1 + τ2
(see Example 5.2.1). Given θi , Xi ∼ N (θi , 1) (i = 1, . . . , p). So uncondi-
tionally, Xi ∼ N (0, 1 + τ 2 ) (i = 1, . . . , p). Thus, unconditionally, X1 , . . . , Xp
are identically distributed, each having the N (0, σ 2 )-distribution with σ 2 =
1 + τ 2 .PAs estimator of the variance σ 2 we may use the the sample version
σ̂ 2 := pi=1 Xi2 /p = kXk2 /p (we need not center with the sample average as
the unconditional mean of the Xi is known to be zero). That is, we estimate
τ 2 by
τ̂ 2 := σ̂ 2 − 1 = kXk2 /p − 1.
This leads to the empirical Bayes estimator

τ̂ 2
 
p
Ti,emp. Bayes := X = 1− X.
1 + τ̂ 2 kXk2

This shows that when p > 4, then Stein’s estimator with b = p is an empirical
Bayes estimator.
Chapter 12

The linear model

Consider n independent observations Y1 , . . . , Yn . This time we do not assume


that they are identically distributed. Let X ∈ Rn×p be a given matrix with
(non-random) entries {xi,j : i = 1, . . . , n, j = 1, . . . , p}. The matrix X is con-
sidered as (fixed) input and the vector Y = (Y1 , . . . , Yn )T as (random) output.
One also calls the columns of X the co-variables. The matrix X is the design
matrix. We assume it to be non-random, that is, we consider the case of fixed
design.

12.1 Definition of the least squares estimator

We aim at predicting Y given X and decide to do this by linear approximation:


we look for the best linear approximation of Yi given xi,1 , . . . , xi,p . We measure
the fit using the residual sum of squares. This means that we minimize
Xn  X p 2
Yi − a − xi,j bj .
i=1 j=1

over a ∈ R and b = (b1 , . . . , bp )T ∈ Rp .


To simplify the expressions, we rename the quantities involved as follows.
Pp Define
for all i, xi,p+1 := 1 and define bp+1 := a. Then for all i a + j=1 xi,j bj =
Pp+1
j=1 xi,j bj . In other words, if we put in the matrix X a column containing
only 1’s then we may omit the constant a. Thus, putting the column of only
1’s in front and replacing p + 1 by p, we let
1 x ··· x 
1,2 1,p
1 x2,2 ··· x2,p 
X := 
 ... .. .. .. 
. . . 
1 xn,2 ··· xn,p
Then we minimize
n 
X p
X 2
Yi − xi,j bj .
i=1 j=1

123
124 CHAPTER 12. THE LINEAR MODEL

over b = (b1 , . . . , bp )T ∈ Rp .
Let us denote the Euclidean norm of a vector v ∈ Rn by1
v
u n
uX
kvk2 := t vi2 .
i=1

Write
Y1
 
.
Y =  ..  .
Yn
Then
n 
X p
X 2
Yi − xi,j bj = kY − Xbk22 .
i=1 j=1

Definition 12.1.1 Suppose X has rank p. One calls

β̂ := arg minp kY − Xbk22


b∈R

the least squares estimator.


We say that the least squares β̂ is obtained by (linear) regression of Y on X.
The distance between Y and the space {Xb : b ∈ Rp } spanned by the columns
of X is minimized by projecting Y on this space. In fact, one has
1 ∂
kY − Xbk22 = −X T (Y − Xb).
2 ∂b

It follows that β̂ is a solution of the so-called normal equations

X T (Y − X β̂) = 0

or
X T Y = X T X β̂.
If X has rank p, the matrix X T X has an inverse (X T X)−1 and we get

β̂ = (X T X)−1 X T Y.

The projection of Y on {Xb : b ∈ Rp } is

X(X T X)−1 X T Y.
| {z }
projection

Recall that a projection is a linear map of the form P P T such that P T P = I.


We can write X(X T X)−1 X T := P P T .2
1
We sometimes omit the subscript “2”
2
Write the singular value decomposition of X as X = P φQT , where φ = diag(φ1 , . . . , φp )
contains the singular values and where P T P = I and QT Q = I.
12.1. DEFINITION OF THE LEAST SQUARES ESTIMATOR 125

Example with p = 1
For p = 1
1 x1
 
1 x2 
X=
 ... ..  .
. 
1 xn

Then
 Pn 
T
X X= Pnn Pni=1 x2i ,
i=1 xi i=1 xi

n n −1  Pn 2 − ni=1 xi
 X P 
i=1 xi
X
(X T X)−1 = n x2i − ( xi )2
− ni=1 xi
P
n
i=1 i=1
Xn −1  1 Pn 2 −x̄

= (xi − x̄)2 n i=1 xi .
−x̄ 1
i=1

Moreover
 
T
X Y = PnnȲ .
i=1 xi Yi

We now let (changing notation: α̂ := β̂1 , β̂ := β̂2 )

 
α̂
= (X T X)−1 X T Y
β̂
n −1  1 Pn
x2i −x̄
X  
2 n i=1 n Ȳ
= (xi − x̄) Pn
−x̄ 1 i=1 xi Yi
i=1
n −1 Pn
x2i Ȳ −P x̄ ni=1 xi Yi
X   P 
2 i=1
= (xi − x̄)
−nx̄Ȳ + ni=1 xi Yi
i=1
n −1  Pn
x̄)2 Ȳ − x̄( ni=1 xi Yi − nx̄Ȳ )
X P 
= (xi − x̄)2 i=1 (xi − P .
n
i=1 i=1 xi Yi − nx̄Ȳ

Pn 2
Pn
Here we used that i=1 xi = i=1 (xi − x̄)2 + nx̄2 . We can moreover write

n
X n
X
xi Yi − nx̄Ȳ = (xi − x̄)(Yi − Ȳ ).
i=1 i=1

Thus
Ȳ − β̂ x̄
   
α̂
= Pni=1 (xi −x̄)(Yi −Ȳ ) .
β̂ Pn
(xi −x̄)2i=1
126 CHAPTER 12. THE LINEAR MODEL

2.0
1.5
1.0
y

0.5
0.0
-0.5

0.0 0.2 0.4 0.6 0.8 1.0

Simulated data with Y = .3 + .6 × x + ,  ∼ N (0, 14 ), α̂ = .19 , β̂ = .740

12.2 Intermezzo: the χ2 distribution

Let Z1 , . . . , Zp be i.i.d. N (0, 1)-distributed. Define the p-vector

Z1


.
Z :=  ..  .
Zp

Then Z is N (0, I)-distributed, with I the p × p identity matrix. The χ2 -


distribution with p degrees of freedom is defined as the distribution of
p
X
kZk22 := Zj2 .
j=1

Notation: kZk22 ∼ χ2p .

For a symmetric positive definite matrix Σ, one can define the square root Σ1/2
as a symmetric positive definite matrix satisfying

Σ1/2 Σ1/2 = Σ.

Its inverse is denoted by Σ−1/2 (which is the square root of Σ−1 ). If Z ∈ Rp is


N (0, Σ)-distributed, the transformed vector

Z̃ := Σ−1/2 Z

is N (0, I)-distributed. It follows that

Z T Σ−1 Z = Z̃ T Z̃ = kZ̃k22 ∼ χ2p .


12.3. DISTRIBUTION OF THE LEAST SQUARES ESTIMATOR 127

12.3 Distribution of the least squares estimator

Definition 12.3.1 For f = EY we let β ∗ := (X T X)−1 X T f and we call Xβ ∗


the best linear approximation of f .
Lemma 12.3.1 Suppose ET = σ 2 I where  := Y − f . Then
i) E β̂ = β ∗ , Cov(β̂) = σ 2 (X T X)−1 ,
ii) EkX(β̂ − β ∗ )k22 = σ 2 p,
iii) EkX β̂ − f k22 = σ 2 p + kXβ ∗ − f k22 .
|{z} | {z }
estimation misspecification
error error

Proof.
i) By straightforward computation
β̂ − β ∗ = (X T X)−1 X T .
| {z }
:=A

We therefore have
E(β̂ − β ∗ ) = AE = 0,
and the covariance matrix of β̂ is
Cov(β̂) = Cov(A)
= A Cov() AT
| {z }
=σ 2 I
= σ AA = σ 2 (X T X)−1 .
2 T

ii) Define the projection P P T := X(X T X)−1 X T . Then


p
X
kX(β̂ − β ∗ )k22 = kP P T k22 =: Vj2 ,
j=1

where V := P T ,
EV = P T E = 0,
and
Cov(V ) = P T Cov()P = σ 2 I.
It follows that
p
X p
X
E Vj2 = EVj2 = σ 2 p.
j=1 j=1

iii) It holds by Pythagoras’ rule for all b


kXb − f k22 = kX(b − β ∗ )k22 + kXβ ∗ − f k22
since Xβ ∗ − f is orthogonal to X. u
t
Lemma 12.3.2 Suppose  := Y − f ∼ N (0, σ 2 I). Then we have
i) β̂ − β ∗ ∼ N (0, σ 2 (X T X)−1 ),
kX(β̂−β ∗ )k2
ii) σ2
2
∼ χ2p where χ2p is χ2 -distributed with p degrees of freedom (see
Section 12.2 for a definition).
128 CHAPTER 12. THE LINEAR MODEL

Proof.
i) Since β̂ is a linear function of the multivariate normal , the least squares
estimator β̂ is also multivariate normal.
ii) Define the projection P P T := X(X T X)−1 X T . Then
p
X
kX(β̂ − β ∗ )k22 = kP P T k22 := Vj2 .
j=1

Now V := P T  has i.i.d. N (0, σ)2 entries. u


t
Remark The misspecification error kXβ ∗ − f k22 comes from the possible mis-
specification of the linear model. That is, f need not be a linear combination
of the columns of X. One sometimes also calls kXβ ∗ − f k22 the approximation
error. The estimation error is here the variance term σ 2 p.
Remark More generally, many estimators are approximately normally dis-
tributed (for example the sample median) and many test statistics have ap-
proximately a χ2 null-distribution (for example the χ2 goodness-of-fit statis-
tic). This phenomenon occurs because many models can in a certain sense be
approximated by the linear model and many minus log-likelihoods resemble the
least squares loss function (see Chapter 14). Understanding the linear model is
a first step towards understanding a wide range of more complicated models.
Corollary 12.3.1 Suppose the linear model is well-specified: for some β ∈ Rp

EY = Xβ.

Assume  := Y − EY ∼ N (0, σ 2 ). where σ 2 := σ02 is known. Then a test for


H0 : β = β0 ,
is:
reject H0 when kX(β̂ − β 0 )k22 /σ02 > G−1
p (1 − α),
where Gp is the distribution function of a χ2p -distributed random variable.
Remark When σ 2 is unknown one may estimate it using the estimator
k22

σ̂ 2 = ,
n−p

where ˆ := Y − X β̂ is the vector of residuals. Under the assumptions of the


previous corollary (but now with possibly unknown σ 2 ) the test statistic kX(β̂−
β 0 )k22 /(pσ̂ 2 ) has a so-called F -distribution with p and n − p degrees of freedom.

12.4 Intermezzo: some matrix algebra

Let z ∈ Rp be a vector and B ∈ Rq×p be a q × p-matrix, (p ≥ q) with rank q.


Moreover, let V ∈ Rp×p be a positive definite p × p-matrix.
Lemma 12.4.1 We have

max {2aT z − aT a} = z T z − z T B T (BB T )−1 Bz.


a∈Rp : Ba=0
12.4. INTERMEZZO: SOME MATRIX ALGEBRA 129

Proof. We use Lagrange multipliers λ ∈ Rp . We have



{2aT z − aT a + 2aT B T λ} = z − a + B T λ.
∂a
Hence for
a∗ := arg max {2aT z − aT a},
a∈Rp : Ba=0
we have
z − a∗ + B T λ = 0,
or
a∗ = z + B T λ.
The restriction Ba∗ = 0 gives
Bz + BB T λ = 0.
So
λ = −(BB T )−1 Bz.
Inserting this in the solution a∗ gives
a∗ = z − B T (BB T )−1 Bz.
Now
  
T −1 T −1
aT∗ a∗ = T T T
z − z B (BB ) B T
z − B (BB ) Bz

= z T z − z T B T (BB T )−1 Bz.


So
2aT∗ z − aT∗ a∗ = z T z − z T B T (BB T )−1 Bz.
u
t
Lemma 12.4.2 We have
 −1
−1 −1 −1 T
max T T
{2a z − a V a} = z V T T
z−z V T
B BV B BV −1 z.
a∈Rp : Ba=0

Proof. Make the transformation b := V 1/2 a, and y := V −1/2 z, and C =


BV −1/2 . Then
max {2aT z − aT V a} = max {2bT y − bT b}
a: Ba=0 b: Cb=0
= y y − y C T (CC T )−1 Cy
T T
 −1
T −1 T −1 T −1 T
= z V z − z V B BV B BV −1 z.

u
t
Corollary 12.4.1 Let L(a) := 2aT z − aT V a. The difference between the un-
restricted maximum and the restricted maximum of L(a) is
 −1
T −1 T −1 T
max L(a) − max L(a) = z V B BV B BV −1 z.
a a: Ba=0
130 CHAPTER 12. THE LINEAR MODEL

12.5 Testing a linear hypothesis

In this section we assume the model

Y = Xβ + ,

where  ∼ N (0, σ 2 I). We want to test the hypothesis


H0 : Bβ = 0 ,
where B ∈ Rq×p is a given q × p matrix.
Let
β̂0 := arg min kY − Xbk22
b∈Rp : Bb=0

be the least squares estimator under the restriction B β̂ = 0.


Lemma 12.5.1 Under H0

kY − X β̂0 k22 − kY − X β̂k22


σ2
has a χ2q -distribution.
Proof. Since kY − Xbk2 = kk2 − 2T X(b − β) + (b − β)T X T X(b − β) we have
under H0 (write b̃ := b − β)
 
β̂0 − β = arg max 2T X b̃ − b̃T X T X b̃ .
b̃∈Rp : B b̃=0

Therefore, invoking Corollary 12.4.1

kY − X β̂0 k22 − kY − X β̂k22

 −1
−1 −1 T
T
=  X(X X) T T T
B B(X X) B B(X T X)−1 X T 
| {z } | {z }
:=Z T :=Z
 −1
= Z T B(X T X)−1 B T Z.

The q-vector
Z := B(X T X)−1 X T 
has a multivariate normal distribution with mean zero and covariance matrix
  T
−1 T −1 T
2 T
σ B(X X) X T
B(X X) X = σ 2 B(X T X)−1 B T .

It follows that under H0

kY − X β̂0 k22 − kY − X β̂k22


σ2
has a χ2q -distribution. u
t
Chapter 13

Asymptotic theory

In this and subsequent chapters, the observations X1 , . . . , Xn are considered as


the first n of an infinite sequence of i.i.d. random variables X1 , . . . , Xn , . . . with
values in X and with distribution P . We say that the Xi are i.i.d. copies, of
some random variable X ∈ X with distribution P . We let IP = P × P × · · · be
the distribution of the whole sequence {Xi }∞ i=1 .

The model class for P is


P := {Pθ : θ ∈ Θ}.
When P = Pθ , we write IP = IPθ = Pθ × Pθ × · · ·. The parameter of interest is

γ := g(θ) ∈ Rp ,

where g : Θ → Rp is a given function. We let

Γ := {g(θ) : θ ∈ Θ}

be the parameter space for γ.


An estimator
Tn (X1 , . . . , Xn )
based on the data X1 , . . . , Xn , is some function Tn (·) evaluated at the data
X1 , . . . , Xn . We often write shorthand

Tn = Tn (X1 , . . . , Xn ).

We assume the estimator Tn is defined for all n, i.e., we actually consider a


sequence of estimators {Tn }∞
n=1 . We are interested in estimators Tn ∈ Γ of γ.

Remark Under the i.i.d. assumption, it is natural to assume that each Tn is a


symmetric function of the data, that is

Tn (X1 , . . . , Xn ) = Tn (Xπ1 , . . . Xπn )

131
132 CHAPTER 13. ASYMPTOTIC THEORY

for all permutations π of {1, . . . , n}. In that case, one can write Tn in the form
Tn = Q(P̂n ), where P̂n is the empirical distribution (see also Subsection 2.4.1).

13.1 Types of convergence

Definition 13.1.1 Let {Zn }∞ p


n=1 and Z be R -valued random variables defined
1
on the same probability space . We say that Zn converges in probability to Z
if for all  > 0,
lim IP(kZn − Zk > ) = 0.
n→∞

IP
Notation: Zn −→Z.
Remark Chebyshev’s inequality can be a tool to prove convergence in proba-
bility. It says that for all increasing functions ψ : [0, ∞) → [0, ∞), one has

Eψ(kZ
I n − Zk)
IP(kZn − Zk ≥ ) ≤ .
ψ()

Definition 13.1.2 Let {Zn }∞ p


n=1 and Z be R -valued random variables. We
say that Zn converges in distribution to Z, if for all continuous and bounded
functions f ,
lim Ef
I (Zn ) = Ef
I (Z).
n→∞
D
Notation: Zn −→Z.
Remark Convergence in probability implies convergence in distribution, but
not the other way around.
Example 13.1.1 The central limit theorem (CLT)
Let X1 , X2 , . . . bePi.i.d. real-valued random variables with mean µ and variance
σ 2 . Let X̄n := ni=1 Xi /n be the average of the first n. Then by the central
limit theorem (CLT),
√ D
n(X̄n − µ)−→N (0, σ 2 ),
that is
√ (X̄n − µ)
 
IP n ≤ z → Φ(z), ∀ z.
σ

The following theorem says that for convergence in distribution, one actually
can do with one-dimensional random variables. We omit the proof.
Theorem 13.1.1 (Cramér-Wold device) Let ({Zn }, Z) be a collection of Rp -
valued random variables. Then
D D
Zn −→Z ⇔ aT Zn −→aT Z ∀ a ∈ Rp .
1
Let (Ω, A, IP) be a probability space, and X : Ω → Rp and Y : Ω → Rq be two measurable
maps. Then X and Y are called random variables, and they are defined on the same probability
space Ω.
13.1. TYPES OF CONVERGENCE 133

Example 13.1.2 Multivariate CLT


Let X1 , X2 , . . . be i.i.d. copies of a random variable X = (X (1) , . . . , X (p) )T in
Rp . Assume EX := µ = (µ1 , . . . , µp )T and Σ := Cov(X) := EXX T − µµT
exist. Then for all a ∈ Rp ,

EaT X = aT µ, var(aT X) = aT Σa.

Define
X̄n = (X̄n(1) , . . . , X̄n(p) )T .
By the 1-dimensional CLT, for all a ∈ Rp ,
√ D
n(aT X̄n − aT µ)−→N (0, aT Σa).

The Cramér-Wold device therefore gives the p-dimensional CLT


√ D
n(X̄n − µ)−→N (0, Σ).

We state the Portmanteau Theorem:


Theorem 13.1.2 Let ({Zn }, Z) be a collection of Rp -valued random variables.
Denote the distribution of Z by Q and let G = Q(Z ≤ ·) be its distribution
function. The following statements are equivalent:
D
(i) Zn −→Z (i.e., Ef
I (Zn ) → Ef
I (Z) ∀ f bounded and continuous).
I (Z) ∀ f bounded and Lipschitz.2
I (Zn ) → Ef
(ii) Ef
I (Zn ) → Ef
(iii) Ef I (Z) ∀ f bounded and Q-a.s. continuous.
(iv) IP(Zn ≤ z) → G(z) for all G-continuity points z.

13.1.1 Stochastic order symbols

Let {Zn } be a collection of Rp -valued random variables, and let {rn } be strictly
positive random variables. We write

Zn = OIP (1)

(Zn is bounded in probability) if

lim lim sup IP(kZn k > M ) = 0.


M →∞ n→∞

This is also called uniform tightness of the sequence {Zn }. We write Zn =


OIP (rn ) if Zn /rn = OIP (1).
If Zn converges in probability to zero, we write this as

Zn = oIP (1).

Moreover, Zn = oIP (rn ) (Zn is of small order rn in probability) if Zn /rn = oIP (1).
2
A real-valued function f on (a subset of) Rp is Lipschitz if for a constant CL and all (z, z̃)
in the domain of f , |f (z) − f (z̃)| ≤ CL kz − z̃k.
134 CHAPTER 13. ASYMPTOTIC THEORY

13.1.2 Some implications of convergence

Lemma 13.1.1 Suppose that Zn converges in distribution. Then Zn = OIP (1).


D
Proof. To simplify, take p = 1 (Cramér-Wold device). Let Zn −→Z, where Z
has distribution function G. Then for every G-continuity point M ,

IP(Zn > M ) → 1 − G(M ),

and for every G-continuity point −M ,

IP(Zn ≤ −M ) → G(−M ).

Since 1 − G(M ) as well as G(−M ) converge to zero as M → ∞, the result


follows. u
t

Example 13.1.3 Averages differ from their means by an order 1/ n
in probability
Let X1 , X2 , . . . be i.i.d. copies of a random variable X ∈ R with EX = µ and
var(X) < ∞. Then by the CLT,
√ 
X̄n − µ = OIP 1/ n .

Theorem 13.1.3 (Slutsky’s Theorem) Let ({Zn , An }, Z) be a collection of Rp -


valued random variables, and a ∈ Rp be a vector of constants. Assume that
D IP
Zn −→Z, An −→a. Then
D
ATn Zn −→aT Z.

Proof. Take a bounded Lipschitz function f , say

|f | ≤ CB , |f (z) − f (z̃)| ≤ CL kz − z̃k.

Then
T T

Ef
I (A n Z n ) − Ef
I (a Z)


T T T T

I (An Zn ) − Ef
≤ Ef I (a Zn ) − Ef
I (a Zn ) + Ef I (a Z) .

Because the function z 7→ f (aT z) is bounded and Lipschitz (with Lipschitz


constant kakCL ), we know that the second term goes to zero. As for the first
term, we argue as follows. Let  > 0 and M > 0 be arbitrary. Define Sn :=
{kZn k ≤ M, kAn − ak ≤ }. Then

T T T T

Ef
I (A n Z n ) − Ef
I (a Z n )
≤ E
I
f (A n Z n ) − f (a Z n )


T T

I f (An Zn ) − f (a Zn ) l{Sn }
= E

I f (An Zn ) − f (a Zn ) l{Snc }
T T

+ E
13.2. CONSISTENCY AND ASYMPTOTIC NORMALITY 135

≤ CL M + 2CB IP(Snc ). (13.1)


Now
IP(Snc ) ≤ IP(kZn k > M ) + IP(kAn − ak > ).
Thus, both terms in (13.1) can be made arbitrary small by appropriately choos-
ing  small and n and M large. u
t

13.2 Consistency and asymptotic normality

Definition 13.2.1 A sequence of estimators {Tn } of γ = g(θ) is called consis-


tent if
IPθ
Tn −→γ.
Definition 13.2.2 A sequence of estimators {Tn } of γ = g(θ) is called asymp-
totically normal with asymptotic covariance matrix Vθ , if
√ Dθ
n(Tn − γ)−→N (0, Vθ ).
Example 13.2.1 Consistency and asymptotic normality of the average
Suppose P is the location model
 
P = Pµ,F0 (X ≤ ·) := F0 (· − µ), µ ∈ R, F0 ∈ F0 .

The parameter is then θ = (µ, F0 ) and Θ = R × F0 . We assume for all F0 ∈ F0


Z Z
2
xdF0 (x) = 0, σF0 := x2 dF0 (x) < ∞.

Let g(θ) := µ and Tn := (X1 + · · · + Xn )/n =: X̄n . Then, by the law of large
numbers, Tn is a consistent estimator of µ and, by the central limit theorem,
√ Dθ
n(Tn − µ)−→N (0, σF2 0 ).

13.3 Asymptotic linearity

As we will show, for many estimators asymptotic normality is a consequence


of asymptotic linearity, that is, the estimator is approximately an average, to
which we can apply the CLT.
Definition 13.3.1 The sequence of estimators {Tn } of γ = g(θ) ∈ Rp is called
asymptotically linear if for a function lθ : X → Rp , with Eθ lθ (X) = 0 and3
Eθ lθ (X)lθT (X) =: Vθ < ∞,
it holds that
n
1X √
Tn − γ = lθ (Xi ) + oIPθ (1/ n).
n
i=1
3
In the one-dimensional case (p = 1) we thus have Vθ = Eθ lθ2 (X).
136 CHAPTER 13. ASYMPTOTIC THEORY

Remark. We then call lθ the influence function of (the sequence) Tn . Roughly


speaking, lθ (x) approximately measures the influence of an additional observa-
tion x (compare with the influence function as defined in Section 8.4).

Example 13.3.1 Influence function of the sample average


Assuming the entries of X have finite variance, the estimator Tn := X̄n is a
linear and hence asymptotically linear estimator of the mean µ, with influence
function
lθ (x) = x − µ.

Example 13.3.2 Influence function of the sample variance


Let X be real-valued, with Eθ X =: µ, varθ (X) =: σ 2 and κ =: Eθ (X − µ)4
(assumed to exist). The sample variance is
n
2 1 X
S := (Xi − X̄n )2 .
n−1
i=1

Let σ̂n2 be the estimator


n
1X
σ̂n2 := (Xi − X̄n )2 .
n
i=1

One sees that



S 2 − σ̂ 2 = OIP (1/n) = oIP (1/ n).

We rewrite σ̂ 2 as
n n
1X 2X
σ̂n2 = (Xi − µ)2 + (X̄n − µ)2 − (Xi − µ)(X̄n − µ)
n n
i=1 i=1
n
1X
= (Xi − µ)2 − (X̄n − µ)2 .
n
i=1

Because by the CLT, X̄n − µ = OIPθ (n−1/2 ), we get

n
1X
σ̂n2 = (Xi − µ)2 + OIPθ (1/n).
n
i=1

So asymptotically one does not notice that µ is estimated, and σ̂n2 (and also S 2 )
is asymptotically linear with influence function

lθ (x) = (x − µ)2 − σ 2 .

The asymptotic variance is


 2
2 2
Vθ = Eθ (X − µ) − σ = κ − σ4.
13.4. THE δ-TECHNIQUE 137

13.4 The δ-technique

Theorem 13.4.1 Let ({Tn }, Z) be a collection of random variables in Rp ,


c ∈ Rp be a nonrandom vector, and {rn } be a nonrandom sequence of posi-
tive numbers, with rn ↓ 0. Moreover, let h : Rp → R be differentiable at c, with
derivative ḣ(c) ∈ Rp . Suppose that
D
(Tn − c)/rn −→Z.

Then  
D
h(Tn ) − h(c) /rn −→ḣ(c)T Z

and in fact
h(Tn ) − h(c) = ḣ(c)T (Tn − c) + oIP (rn ).

Proof. By Slutsky’s Theorem,


D
ḣ(c)T (Tn − c)/rn −→ḣ(c)T Z.

Since (Tn − c)/rn converges in distribution, we know that kTn − ck/rn = OIP (1).
Hence, kTn − ck = OIP (rn ). The result follows now from

h(Tn ) − h(c) = ḣ(c)T (Tn − c) + o(kTn − ck) = ḣ(c)T (Tn − c) + oIP (rn ).

u
t
Corollary 13.4.1 Let Tn be an asymptotically normal estimator of γ = g(θ) ∈
Rp with asymptotic covariance matrix

Vθ .

Suppose h is differentiable at γ. Then h(Tn ) is an asymptotically normal esti-


mator of h(γ) with asymptotic variance4

ḣ(γ)T Vθ ḣ(γ).

If moreover Tn is an asymptotically linear estimator of γ, with influence func-


tion

then h(Tn ) is an asymptotically linear estimator of h(γ) with influence function

ḣ(γ)T lθ .

Example 13.4.1 Asymptotic linear estimator of the parameter of the


exponential distribution
Let X1 , . . . , Xn be a sample from the Exponential(θ) distribution, with θ > 0.
4
For p = 1 the asymptotic variance of h(Tn ) is thus ḣ2 (γ)Vθ .
138 CHAPTER 13. ASYMPTOTIC THEORY

Then X̄n is a linear estimator of Eθ X = 1/θ := γ, with influence function



lθ (x) = x−1/θ. The variance of n(Tn −1/θ) is 1/θ2 = γ 2 . By Theorem 13.4.1,
1/X̄n is an asymptotically linear estimator of θ. In this case, h(γ) = 1/γ, so
that ḣ(γ) = −1/γ 2 . The influence function of 1/X̄n is thus
1
ḣ(γ)lθ (x) = − (x − γ) = −θ2 (x − 1/θ).
γ2

The asymptotic variance of 1/X̄n is


1
[ḣ(γ)]2 γ 2 = = θ2 .
γ2
So

 
1 Dθ
n − θ −→N (0, θ2 ).
X̄n
Example 13.4.2 Two-dimensional asymptotic linearity of the sample
average and sample variance
Consider again Example 13.3.2. Let X be real-valued, with Eθ X := µ, varθ (X) :=
σ 2 and κ := Eθ (X − µ)4 (assumed to exist). Define moreover, for r = 1, 2, 3, 4,
the r-th moment µr := Eθ X r . We again consider the estimator
n
1X
σ̂n2 := (Xi − X̄n )2 .
n
i=1

We have
σ̂n2 = h(Tn ),
where Tn = (Tn,1 , Tn,2 )T , with
n
1X 2
Tn,1 = X̄n , Tn,2 = Xi ,
n
i=1

and
h(t) = t2 − t21 , t = (t1 , t2 )T .
The estimator Tn has influence function
 
x − µ1
lθ (x) = .
x 2 − µ2

By the 2-dimensional CLT,



  
µ1 Dθ
n Tn − −→N (0, Vθ ),
µ2

with
µ2 − µ21
 
µ 3 − µ1 µ2
Vθ = .
µ3 − µ1 µ 2 µ4 − µ22
It holds that    
µ1 −2µ1
ḣ = ,
µ2 1
13.4. THE δ-TECHNIQUE 139

so that σ̂n2 has influence function


   T  
µ1 −2µ1 x − µ1
ḣT lθ (x) =
µ2 1 x 2 − µ2
= (x − µ)2 − σ 2

(invoking µ1 = µ). After some calculations, one finds moreover that


 T  
−2µ1 −2µ1
Vθ = κ − σ4,
1 1

i.e., the δ-method gives the same result as the ad hoc method in Example 13.3.2,
as it of course should.
140 CHAPTER 13. ASYMPTOTIC THEORY
Chapter 14

M-estimators

Recall the maximum likelihood estimator as defined in Section 2.4.3. In this


chapter we introduce a general class of estimators of which the MLE is a special
case. They are defined as minimizers of some empirical risk function.
Let, for each γ ∈ Γ, be defined some loss function ργ (X). These are for instance
constructed as in Chapter 10: we let L(θ, a) be the loss when taking action a.
Then, we fix some decision d(x), and rewrite
L(θ, d(x)) := ργ (x),
assuming the loss L depends only on θ via the parameter of interest γ = g(θ).
We now require that the theoretical risk
R(c) := Eθ ρc (X)
is minimized at the value c = γ i.e.,
γ = arg min Eθ ρc (X) = arg min R(c). (14.1)
c∈Γ c∈Γ

Alternatively, given ρc , one may view (14.1) as the definition of γ.


If c 7→ ρc (x) is differentiable for all x, we write

ψc (x) := ρ̇c (x) := ρc (x).
∂c
Then, assuming we may interchange differentiation and taking expectations 1 ,
we have
Ṙ(γ) = 0,
where Ṙ(c) = Eθ ψc (X).
Define now the empirical risk
n
1X
R̂n (c) := ρc (Xi ), c ∈ Γ.
n
i=1
1
If |∂ρc /∂c| ≤ H(·) where Eθ H(X) < ∞, then it follows from the dominated convergence
theorem that ∂[Eθ ρc (X)]/∂c = Eθ [∂ρc (X)/∂c] or otherwise put, Ṙ(c) = Eθ ψ(X).

141
142 CHAPTER 14. M-ESTIMATORS

Definition 14.0.1 The M-estimator γ̂n of γ is defined as


n
1X
γ̂n := arg min ρc (Xi ) = arg min R̂n (c).
c∈Γ n c∈Γ
i=1

The “M” in “M-estimator” stands for Minimizer (or - take minus signs - Max-
imizer).
If ρc (x) is differentiable in c for all x, we generally can define γ̂n as the solution
of putting the derivatives
n n
˙ ∂ 1X 1X
R̂n (c) = ρc (Xi ) = ψc (Xi )
∂c n n
i=1 i=1

to zero. This is called the Z-estimator. The “Z” in “Z-estimator” stands for
Zero.
Definition 14.0.2 The Z-estimator γ̂n of γ is defined as a solution of the
equations
˙
R̂n (γ̂n ) = 0
˙
where R̂n (c) = n1 ni=1 ψc (Xi ).
P

Remark A solution γ̂n ∈ Γ is then assumed to exist.


Example 14.0.1 The least squares estimator
Let X ∈ R, and let the parameter of interest be the mean µ = Eθ X. Assume X
has finite variance σ 2 Then

µ = arg min Eθ (X − c)2 ,


c

as (recall), by the bias-variance decomposition

Eθ (X − c)2 = σ 2 + (µ − c)2 .

So in this case, we can take

ρc (x) = (x − c)2 .

Clearly
n
1X
(Xi − c)2
n
i=1
Pn
is minimized at c = X̄n := i=1 Xi /n. See also Section 2.3.

14.1 MLE as special case of M-estimation

Suppose Θ ⊂ Rp and that the densities pθ = dPθ /dν exist w.r.t. some σ-finite
measure ν.
14.1. MLE AS SPECIAL CASE OF M-ESTIMATION 143

Definition 14.1.1 The quantity


 
pθ (X)
K(θ̃|θ) = Eθ log
pθ̃ (X)

is called the Kullback Leibler information, or the relative entropy.


Remark Some care has to be taken, not to divide by zero! This can be handled
e.g., by assuming that the support {x : pθ (x) > 0} does not depend on θ (see
also Condition I in the CRLB of Chapter 5).

Take for all θ̃ ∈ Θ


ρθ̃ (x) = − log pθ̃ (x).
Then
R(θ̃) = −Eθ log pθ̃ (X).
One easily sees that
K(θ̃|θ) = R(θ̃) − R(θ).

Let us restate Lemma 2.4.1 and reprove it in a slightly different manner.


Lemma 14.1.1 The function R(θ̃) = −Eθ log pθ̃ (X) is minimized at θ̃ = θ:

θ = arg min R(θ̃).


θ̃

Proof. We will show that


K(θ̃|θ) ≥ 0.
This follows from Jensen’s inequality. Since the log-function is concave,
 
pθ̃ (X)
K(θ̃|θ) = −Eθ log
pθ (X)
  
pθ̃ (X)
≥ − log Eθ
pθ (X)
= − log 1 = 0.

u
t
With ρθ̃ (x) = − log pθ̃ (x) we find ψθ̃ (x) := ρ̇θ̃ (x) = −sθ̃ (x). Recall that sθ is
the score function
sθ = ṗθ /pθ ,
see Definition 4.7.1. We have seen moreover in Lemma 4.7.1 that Eθ sθ (X) = 0.
This is just another way to see that θ is a solution of the equation

Ṙ(θ) = 0,

wehere Ṙ(θ̃) = Eθ ψθ̃ (X) with ψθ̃ = −ṗθ̃ /pθ̃ .


144 CHAPTER 14. M-ESTIMATORS

With ρθ̃ (x) = − log pθ̃ (x) the M-estimator is the maximum likelihood estimator
θ̂ = arg min LX (θ̃)
θ̃∈Θ
n  
1X
= arg min − log pθ̃ (Xi ) .
θ̃∈Θ n i=1

14.2 Consistency of M-estimators

Note that γ minimizes a theoretical expectation, whereas the M-estimator γ̂n


minimizes the empirical average. Likewise, γ is a solution of putting a theoret-
ical expectation to zero, whereas the Z-estimator γ̂n is the solution of putting
an empirical average to zero.
By the law of large numbers, averages converge to expectations. So the M-
estimator (Z-estimator) does make sense. However, consistency and further
properties are not immediate, because we actually need convergence the aver-
ages to expectations over a range of values c ∈ Γ simultaneously. This is the
topic of empirical process theory.
We will borrow the notation from empirical process theory. For a function
f : X → R, we let
n
1X
Pθ f := Eθ f (X), P̂n f := f (Xi ).
n
i=1
Then, by the law of large numbers, if Pθ |f | < ∞,
|(P̂n − Pθ )f | → 0, IPθ −a.s..

With this new notation we have


R̂n (c) = P̂n ρc , R(c) = P ρc .
Theorem 14.2.1 Suppose the uniform convergence
sup |R̂n (c) − R(c)| → 0, IPθ −a.s..
c∈Γ
Then
R(γ̂n ) → R(γ), IPθ −a.s..

Proof. The uniform convergence implies


0 ≤ Pθ (ργ̂n − ργ )
= −(P̂n − Pθ )(ργ̂n − ργ ) + P̂n (ργ̂n − ργ )
≤ −(P̂n − Pθ )(ργ̂n − ργ )
≤ |(P̂n − Pθ )ργ̂n | + |(P̂n − Pθ )ργ |
≤ sup |(P̂n − Pθ )ρc | + |(P̂n − Pθ )ργ |
c∈Γ
≤ 2 sup |(P̂n − Pθ )ρc |.
c∈Γ
14.2. CONSISTENCY OF M-ESTIMATORS 145

u
t
We will need that convergence of to the minimum value also implies convergence
of the arg min, i.e., convergence of the location of the minimum. To this end,
we present the following definition.
Definition The minimizer γ of R(c) is called well-separated if for all  > 0,
 
inf R(c) : c ∈ Γ, kc − γk >  > R(γ).

If γ is well-separated, R(γ̂n ) → R(γ) IPθ -a.s.. implies γ̂n → γ IPθ -a.s..


In the next lemma, we give sufficient conditions for the uniform in c conver-
gence of the empirical risk R̂n (c) to the theoretical risk R(c). Consistency
of the M-estimator is then a consequence, as was shown in Theorem 14.2.1.
(For consistency the assumption of a compact parameter space Γ can often be
omitted if c 7→ ρc is convex. We skip the details.)
Lemma 14.2.1 Suppose that Γ is compact, that c 7→ ρc (x) is continuous for
all x, and that  
Pθ sup |ρc | < ∞.
c∈Γ
Then we have the uniform convergence

sup |(P̂n − Pθ )ρc | → 0, IPθ −a.s.. (14.2)


c∈Γ

Proof. Define for each δ > 0 and c ∈ Γ,

w(·, δ, c) := sup |ρc̃ − ρc |.


c̃∈Γ: kc̃−ck<δ

Then for all x, as δ ↓ 0,


w(x, δ, c) → 0.
So also, by dominated convergence

Pθ w(·, δ, c) → 0.

Hence, for all  > 0, there exists a δc such that

Pθ w(·, δc , c) ≤ .

Let
Bc := {c̃ ∈ Γ : kc̃ − ck < δc }.
Then {Bc : c ∈ Γ} is a covering of Γ by open sets. Since Γ is compact, there
exists finite sub-covering
Bc1 . . . BcN .
For c ∈ Bcj ,
|ρc − ρcj | ≤ w(·, δcj , cj ).
146 CHAPTER 14. M-ESTIMATORS

It follows that

sup |(P̂n − Pθ )ρc | ≤ max |(P̂n − Pθ )ρcj |


c∈Γ 1≤j≤N

+ max P̂n w(·, δcj , cj ) + max Pθ w(·, δcj , cj )


1≤j≤N 1≤j≤N
→ 2 max Pθ w(·, δcj , cj ) ≤ 2, IPθ −a.s..
1≤j≤N

u
t
Example 14.2.1 Consistency of the MLE in the logistic location fam-
ily
The above theorem directly uses the definition of the M-estimator, and does not
rely on having an explicit expression available. Here is an example where an
explicit expression is indeed not possible. Consider the logistic location family,
where the densities are
ex−θ
pθ (x) = , x ∈ R,
(1 + ex−θ )2

where θ ∈ Θ ⊂ R is the location parameter. Take

ρθ (x) := − log pθ (x) = θ − x + 2 log(1 + ex−θ ).

Then θ̂n is the MLE. It is a solution of


n
2 X eXi −θ̂n
= 1.
n 1 + eXi −θ̂n
i=1

This expression cannot be made into an explicit expression for θ̂n . However, we
do note the caveat that in order to be able to apply the above consistency theorem,
we need to assume that Θ is compact. This problem can be circumvented by
using the result below for Z-estimators.
To prove consistency of a Z-estimator of a one-dimensional parameter is rela-
tively easy.
Recall that ψc = ρ̇c and Ṙ(c) = Pθ ψc := Eθ ψc (X), c ∈ Γ. Recall further that
Ṙ(γ) = 0 since γ is defined as the minimizer of R(·).
Theorem 14.2.2 Suppose that Γ ⊂ R and that ψc (x) is continuous in c for all
x. Assume moreover that
Pθ |ψc | < ∞, ∀c,
and that ∃ δ > 0 such that

Ṙ(c) > 0, γ < c < γ + δ,

Ṙ(c) < 0, γ − δ < c < γ.


˙
Then for n large enough, IPθ -a.s., there is a solution γ̂n of R̂n (γ̂n ) = 0, and
this solution γ̂n is consistent.
14.3. ASYMPTOTIC NORMALITY OF M-ESTIMATORS 147

Proof. Let 0 <  < δ be arbitrary. By the law of large numbers, IPθ -a.s. for n
sufficiently large,
˙ ˙
R̂n (γ + ) > 0, R̂n (γ − ) < 0.
˙
The continuity of c 7→ ψc implies that then R̂n (γ̂n ) = 0 for some |γn − γ| < .
u
t

14.3 Asymptotic normality of M-estimators

For a function f : X → Rp we let Pθ f := Eθ f (X) ∈ Rp (whenever it exists).


Moreover, we let
Pθ f f T = Eθ f (X)f T (X) ∈ Rp×p
(whenever is exists). The covariance matrix of the vector f (X) is thus
Σ := Pθ f f T − (Pθ f )(Pθ f )T
The CLT says that
√ θD
n(P̂n − Pθ )f −→N (0, Σ).
With this notation we can translate Definition 13.3.1 as: Tn is an asymptotically
linear estimator of γ if

Tn − γ = P̂n lθ + oIPθ (1/ n),
with Pθ lθ = 0 and Vθ := Pθ lθ lθT < ∞.
Denote now
√ √
 
˙
νn (c) := n(P̂n − Pθ )ψc = n R̂n (c) − Ṙ(c) , c ∈ Γ.

Definition 14.3.1 The stochastic process


{νn (c) : c ∈ Γ}
is called the empirical process indexed by c. The empirical process is called
asymptotically continuous at γ if for all (possibly random) sequences {γn } in
Γ, with kγn − γk = oIPθ (1), we have
|νn (γn ) − νn (γ)| = oIPθ (1).

For verifying asymptotic continuity, there are various tools, which involve com-
plexity assumptions on the map c 7→ ψc . This goes beyond the scope of these
notes. But let us see what asymptotic continuity can bring us.
Recall that Ṙ(c) = Pθ ψc . We assume that


Mθ := T Ṙ(c)
∂c c=γ

exists. It is a p × p matrix. We require it to be of full rank, which amounts to


assuming that γ, as a solution to Ṙ(γ) = 0, is well-identified.
148 CHAPTER 14. M-ESTIMATORS

Theorem 14.3.1 Let γ̂n be the Z-estimator of γ. Suppose that γ̂n is a con-
sistent estimator of γ, and that νn is asymptotically continuous at γ. Suppose
moreover Mθ−1 exists, and also

Jθ := Pθ ψγ ψγT .

Then γ̂n is asymptotically linear, with influence function

lθ = −Mθ−1 ψγ .

Proof. By definition,
˙
R̂n (γ̂n ) = 0, Ṙ(γ) = 0.
So we have
˙
0 = R̂n (γ̂n )

= νn (γ̂n )/ n + Ṙ(γ̂n )

= νn (γ̂n )/ n + Ṙ(γ̂n ) − Ṙ(γ)
=: (i) + (ii).

For the first term, we use the asymptotic continuity of νn at γ:



(i) = νn (γ̂n )/ n
√ √
= νn (γ)/ n + oIPθ (1/ n)
˙ √
= R̂n (γ) + oIPθ (1/ n).

For the second term, we use the differentiability of Ṙ(c) = Pθ ψc at c = γ:

(ii) = Ṙ(γ̂n ) − Ṙ(γ)


= Mθ (γ̂n − γ) + o(kγn − γk).

So we arrive at
˙ √
0 = R̂n (γ) + oIPθ (1/ n) + Mθ (γ̂n − γ) + o(kγn − γk).

˙ √ √
Because, by the CLT, R̂n (γ) = OIPθ (1/ n), this implies kγ̂n −γk = OIPθ (1/ n).
Hence √
˙
0 = R̂n (γ) + Mθ (γ̂n − γ) + oIPθ (1/ n),
or
˙ √
Mθ (γ̂n − γ) = −R̂n (γ) + oIPθ (1/ n)

= −P̂n ψγ + oIPθ (1/ n)

or √
(γ̂n − γ) = −P̂n M −1 ψγ + oIPθ (1/ n).
u
t
14.3. ASYMPTOTIC NORMALITY OF M-ESTIMATORS 149

Corollary 14.3.1 Under the conditions of Theorem 14.3.1


√ θ D
n(γ̂n − γ)−→N (0, Vθ ),

with
Vθ = Mθ−1 Jθ Mθ−1 .

Asymptotic linearity can also be established directly, under rather restrictive


assumptions, see Theorem 14.3.2 coming up next. We assume quite a lot of
smoothness for the functions ψc (namely, derivatives that are Lipschitz), so
that asymptotic linearity can be proved by straightforward arguments. We
stress however that such smoothness assumptions are by no means necessary.
Theorem 14.3.2 Let γ̂n be the Z-estimator of γ, and suppose that γ̂n is a
consistent estimator of γ. Suppose that, for all c in a neighborhood {c ∈ Γ :
kc − γk < }, the map c 7→ ψc (x) is differentiable for all x, with derivative


ψ̇c (x) = ψc (x)
∂cT
(a p × p matrix). Assume moreover that, for all c and c̃ in a neighborhood of
γ, and for all x, we have, in matrix-norm2 ,

kψ̇c (x) − ψ̇c̃ (x)k ≤ H(x)kc − c̃k,

where H : X → R satisfies
Pθ H < ∞.
Then

Mθ := T Ṙ(c) = Pθ ψ̇γ . (14.3)
∂c c=γ

Assuming Mθ−1 and Jθ := Eθ ψγ ψγT exist, the influence function of γ̂n is

lθ = −Mθ−1 ψγ .

Proof. Result (14.3) follows from the dominated convergence theorem.


By the mean value theorem,
˙
0 = R̂n (γ̂)
= P̂n ψγ̂n
= P̂n ψγ + P̂n ψ̇γ̃n (·) (γ̂n − γ)
˙
= R̂n (γ) + P̂n ψ̇γ̃n (·) (γ̂n − γ)

where for all x, kγ̃n (x) − γk ≤ kγ̂n − γk. Thus


˙
0 = R̂n (γ) + P̂n ψ̇γ (γ̂n − γ) + P̂n (ψ̇γ̃n (·) − ψ̇γ )(γ̂n − γ),
2
For a matrix A, kAk := supv6=0 kAvk/kvk.
150 CHAPTER 14. M-ESTIMATORS

so that
 
˙
R̂n (γ) + P̂n ψ̇γ (γ̂n − γ) ≤ P̂n H kγ̂n − γk2 = OIP (1)kγ̂n − γk2 ,

θ

where in the last inequality, we used Pθ H < ∞. Now, by the law of large
numbers,
P̂n ψ̇γ = Pθ ψ̇γ + oIPθ (1) = Mθ + oIPθ (1).
Thus
˙
R̂n (γ) + Mθ (γ̂n − γ) + oIP (kγ̂n − γk) = OIP (kγ̂n − γk2 ).

θ θ

˙ √
Because R̂n (γ) = OIPθ (1/ n) by the CLT, this ensures that kγ̂n − γk =

OIPθ (1/ n). It follows that



˙
R̂n (γ) + Mθ (γ̂n − γ) + oIP (1/ n) = OIP (1/n).
θ θ

Hence
˙ √
Mθ (γ̂n − γ) = −R̂n (γ) + oIPθ (1/ n)
and so
˙ √
γ̂n − γ = −Mθ−1 R̂n (γ) + oIPθ (1/ n)

= −P̂n Mθ−1 ψγ + oIPθ (1/ n).

u
t
Note The asymptotic normality follows again from the asymptotic linearity
established in Theorem 14.3.2.

14.4 Asymptotic normality of the MLE

In this section, we show that, under regularity conditions, the MLE is asymp-
totically normal with asymptotic covariance matrix the inverse of the Fisher-
information matrix I(θ). We use that maximum likelihood estimation is a
special case of M-estimation and apply the results of the previous section. In
order to do so we need to assume regularity conditions.
Let P = {Pθ : θ ∈ Θ} be dominated by a σ-finite dominating measure ν, and
write the densities as pθ = dPθ /dν. Suppose that Θ ⊂ Rp . Assume that the
support of pθ does not depend on θ (Condition I in Section 5.5). As loss we
take minus the log-density:
ρθ := − log pθ .
The MLE is
θ̂n := arg max P̂n log pθ̃ .
θ̃∈Θ
14.5. TWO FURTHER EXAMPLES OF M-ESTIMATION 151

We suppose that the score function


∂ ṗθ
sθ = log pθ =
∂θ pθ
exists, and that we may interchange differentiation and integration, so that the
score has mean zero:
Z Z
∂ ∂
Pθ sθ = ṗθ dν = pθ dν = 1 = 0.
∂θ ∂θ
Recall that the Fisher-information matrix is

I(θ) := Pθ sθ sTθ .

Now, it is clear that ψθ = −sθ , and, assuming derivatives exist and that again
we may change the order of differentiation and integration,

Mθ = Pθ ψ̇θ = −Pθ ṡθ ,

and (see also Lemma 4.7.1)

ṗθ ṗTθ
 
p̈θ
Pθ ṡθ = Pθ −
pθ pθ pθ
2
 

= 1 − Pθ sθ sTθ
∂θ∂θT
= 0 − I(θ) = −I(θ).

Hence, in this case, Mθ = −I(θ), and the influence function is

lθ = I(θ)−1 sθ .

So the asymptotic covariance matrix of the MLE θ̂n is


 
−1
I(θ) Pθ sθ sθ I(θ)−1 = I(θ)−1 .
T

It follows that √ D
θ
n(θ̂n − θ)−→N (0, I −1 (θ)).

14.5 Two further examples of M-estimation

In this section we examine the α-quantile and the Huber estimator.


Example 14.5.1 Asymptotic normality of the α-quantile
In this example, the parameter of interest is the α-quantile. We will consider a
loss function which does not satisfy regularity conditions, but nevertheless leads
to an asymptotically linear, and hence asymptotically normal, estimator.
152 CHAPTER 14. M-ESTIMATORS

Let X := R. The distribution function of X is denoted by F . Let 0 < α < 1


be given. The α-quantile of F is γ = F −1 (α) (assumed to exist). We moreover
assume that F has density f with respect to Lebesgue measure, and that f (x) > 0
in a neighborhood of γ. As loss function we take

ρc (x) := ρ(x − c),

where
ρ(x) := (1 − α)|x|l{x < 0} + α|x|l{x > 0}.

We now first check that for R(c) := Pθ ρc

arg min R(c) = F −1 (α) := γ.


c

We have
ρ̇(x) = αl{x > 0} − (1 − α)l{x < 0}.
Note that ρ̇ does not exist at x = 0. This is one of the irregularities in this
example.
It follows that
ψc (x) = −αl{x > c} + (1 − α){x < c}.
Hence
Ṙ(c) = Pθ ψc = −α + F (c)
(the fact that ψc is not defined at x = c can be shown not to be a problem, roughly
because a single point has probability zero, as F is assumed to be continuous).
So
Ṙ(γ) = 0, for γ = F −1 (α).

We now derive Mθ , which is a scalar in this case:



d
Mθ = Ṙ(c)
dc c=γ

d
= (−α + F (c))
dc c=γ
−1
= f (γ) = f (F (α)).

The influence function is thus 3

 
1
lθ (x) = −Mθ−1 ψγ (x) = −l{x < γ} + α .
f (γ)
3
Note that in the special case α = 1/2 (where γ is the median), this becomes
(
− 2f1(γ) x < γ
lθ (x) = .
+ 2f1(γ) x > γ
14.5. TWO FURTHER EXAMPLES OF M-ESTIMATION 153

We conclude that, for R̂n (c) := P̂n ρc and

γ̂n := arg min R̂n (c),


c

which we write as the sample quantile γ̂n = F̂n−1 (α) (or an approximation

thereof up to order oIPθ (1/ n)), one has


 
−1 −1 Dθ α(1 − α)
n(F̂n (α) − F (α))−→N 0, 2 −1 .
f (F (α))

Example 14.5.2 Asymptotic normality of the Huber estimator


In this example, we illustrate that the Huber-estimator is asymptotically linear
and hence asymptotically normal. Let again X = R and F be the distribution
function of X. We let the parameter of interest be the a location parameter.
The Huber loss function is

ρc (x) = ρ(x − c),

with 
x2 |x| ≤ k
ρ(x) = .
k(2|x| − k) |x| > k
We define γ as
γ := arg min Pθ ρc .
c

It holds that
2x |x| ≤ k
(
ρ̇(x) = +2k x>k .
−2k x < −k
Therefore,
−2(x − c) |x − c| ≤ k
(
ψc (x) = −2k x−c>k .
+2k x − c < −k
One easily derives that
Z k+c
Ṙ(c) := Pθ ψc = − 2 xdF (x) + 2c[F (k + c) − F (−k + c)]
−k+c
− 2k[1 − F (k + c)] + 2kF (−k + c).

So
d
Mθ = Ṙ(c) = 2[F (k + γ) − F (−k + γ)].
dc c=γ

The influence function of the Huber estimator is



1 x − γ |x − γ| ≤ k
lθ (x) = +k x−γ >k .
[F (k + γ) − F (−k + γ)] 
−k x − γ < −k

For k → 0, this corresponds to the influence function of the median.


154 CHAPTER 14. M-ESTIMATORS

14.6 Asymptotic relative efficiency

In this section, we assume that the parameter of interest is real-valued:

γ ∈ Γ ⊂ R.

Definition 14.6.1 Let Tn,1 and Tn,2 be two estimators of γ, that satisfy
√ D
θ
n(Tn,j − γ)−→N (0, Vθ,j ), j = 1, 2.

Then
Vθ,1
e2:1 :=
Vθ,2
is called the asymptotic relative efficiency of Tn,2 with respect to Tn,1 .
If e2:1 > 1, the estimator Tn,2 is asymptotically more efficient than Tn,1 . An
asymptotic (1 − α)-confidence interval for γ based on Tn,2 is then narrower than
the one based on Tn,1 .
Example 14.6.1 Asymptotic relative efficiency of sample mean and
sample median
Let X = R, and F be the distribution function of X. Suppose that F is sym-
metric around the parameter of interest µ. In other words,

F (·) = F0 (· − µ),

where F0 is symmetric around zero. We assume that F0 has finite variance


σ 2 , and that is has density f0 w.r.t. Lebesgue measure, with f0 (0) > 0. Take
Tn,1 := X̄n , the sample mean, and Tn,2 := F̂n−1 (1/2), the sample median. Then
Vθ,1 = σ 2 and Vθ,2 = 1/(4f02 (0)) (the latter being derived in Example 14.5.1).
So
e2:1 = 4σ 2 f02 (0).
Whether the sample mean is the winner, or rather the sample median, depends
thus on the distribution F0 . Let us consider three cases.
Case i Let F0 √be the standard normal distribution, i.e., F0 = Φ. Then σ 2 = 1
and f0 (0) = 1/ 2π. Hence
2
e2:1 = ≈ 0.64.
π
So X̄n is the winner. Note that X̄n is the MLE in this case.
Case ii Let F0 be the Laplace distribution, with variance σ 2 equal to one. This
distribution has density
1 √
f0 (x) = √ exp[− 2|x|], x ∈ R.
2

So we have f0 (0) = 1/ 2, and hence

e2:1 = 2.
14.6. ASYMPTOTIC RELATIVE EFFICIENCY 155

Thus, the sample median, which is the MLE for this case, is the winner.

Case iii Suppose


F0 = (1 − η)Φ + ηΦ(·/3).

This means that the distribution of X is a mixture, with mixing probabilities


1 − η and η, of two normal distributions, one with unit variance, and one
with variance 32 = 9. Otherwise put, associated with X is an unobservable
label Y ∈ {0, 1}. Given Y = 1, the random variable X is N (µ, 1)-distributed.
Given Y = 0, the random variable X has a N (µ, 32 ) distribution. Moreover,
P (Y = 1) = 1 − P (Y = 0) = 1 − η. Hence

σ 2 := var(X) = (1 − η)var(X|Y = 1) + ηvar(X|Y = 0) = (1 − η) + 9η = 1 − 8η.

It furthermore holds that


 
η 1 2η
f0 (0) = (1 − η)φ(0) + φ(0) = √ 1− .
3 2π 3

It follows that
2η 2
 
2
e2:1 = 1− (1 + 8η).
π 3

Let us now further compare the results with the α-trimmed mean. Because F
is symmetric, it turns out that the α-trimmed mean has the same influence
function as the Huber-estimator with k = F −1 (1 − α):

1  x − µ, |x − µ| ≤ k
lθ (x) = +k, x−µ>k .
F0 (k) − F (−k) 
−k, x − µ < −k

(This can be seen from Example 15.3.2 ahead which is not part of the exam).
The influence function is used to compute the asymptotic variance Vθ,α of the
α-trimmed mean:
R F0−1 (1−α)
F0−1 (α)
x2 dF0 (x) + 2α(F0−1 (1 − α))2
Vθ,α = .
(1 − 2α)2

From this, we then calculate the asymptotic relative efficiency of the α-trimmed
mean w.r.t. the mean. Note that the median is the limiting case with α → 1/2.
Table: Asymptotic relative efficiency of α-trimmed mean over mean
α = 0.05 0.125 0.5
η = 0.00 0.99 0.94 0.64
0.05 1.20 1.19 0.83
0.25 1.40 1.66 1.33
156 CHAPTER 14. M-ESTIMATORS

14.7 Asymptotic pivots

Again throughout this section, enough regularity is assumed, such as existence


of derivatives and interchanging integration and differentiation.

Recall the definition of an asymptotic pivot (see Section 6.2). It is a function


Zn (γ) := Zn (X1 , . . . , Xn , γ) of the data X1 , . . . , Xn and the parameter of inter-
est γ = g(θ) ∈ Rp , such that its asymptotic distribution does not depend on
the unknown parameter θ, i.e., for a random variable Z, with distribution Q
not depending on θ,

Zn (γ)−→Z, ∀ θ.
An asymptotic pivot can be used to construct approximate (1 − α)-confidence
intervals for γ, and tests for H0 : γ = γ0 with approximate level α.
Consider now an asymptotically normal estimator Tn of γ, which is asymptot-
ically unbiased and has asymptotic covariance matrix Vθ , that is
√ D
θ
n(Tn − γ)−→N (0, Vθ ), ∀ θ.

(assuming such an estimator exists). Then, depending on the situation, there


are various ways to construct an asymptotic pivot.
1st asymptotic pivot
If the asymptotic covariance matrix Vθ is non-singular, and depends only on
the parameter of interest γ, say Vθ = V (γ) (for example, if γ = θ), then an
asymptotic pivot is

Zn,1 (γ) := n(Tn − γ)T V (γ)−1 (Tn − γ).

The asymptotic distribution is the χ2 -distribution with p degrees of freedom.


2nd asymptotic pivot
If, for all θ, one has a consistent estimator V̂n of Vθ , then an asymptotic pivot
is
Zn,2 (γ) := n(Tn − γ)T V̂n−1 (Tn − γ).
The asymptotic distribution is again the χ2 -distribution with p degrees of free-
dom. This follows from Slutsky’s lemma.
Estimators of the asymptotic variance
◦ If θ̂n is a consistent estimator of θ and if θ 7→ Vθ is continuous, one may insert
V̂n := Vθ̂n .
◦ If Tn = γ̂n is the M-estimator of γ, γ being the solution of Pθ ψγ = 0, then
(under regularity) the asymptotic covariance matrix is

Vθ = Mθ−1 Jθ Mθ−1 ,

where
Jθ = Pθ ψγ ψγT ,
14.8. ASYMPTOTIC PIVOT BASED ON THE MLE 157

and

Mθ := R̈(γ)


:= T
Ṙ(c)
∂c c=γ


= Pθ ψc
∂cT c=γ
= Pθ ψ̇γ .

Then one may estimate Jθ and Mθ by


n
1X
Jˆn := P̂n ψγ̂n ψγ̂Tn = ψγ̂n (Xi )ψγ̂Tn (Xi ),
n
i=1

and
¨
M̂n := R̂n (γ̂n )
= P̂n ψ̇γ̂n
n
1X
= ψ̇γ̂n (Xi ),
n
i=1

respectively. Under some regularity conditions,

V̂n := M̂n−1 Jˆn M̂n−1 .

is a consistent estimator of Vθ 4 .

14.8 Asymptotic pivot based on the MLE

Suppose that P = {Pθ : θ ∈ Θ} has Θ ⊂ Rp , and that P is dominated by some


σ-finite measure ν. Let pθ := dPθ /dν denote the densities, and let
n
X
θ̂n := arg max log pϑ (Xi )
ϑ∈Θ
i=1

be the MLE.
Assume enough regularity such as existence of derivatives and interchanging
integration and differentiation. Recall that θ̂n is an M-estimator with loss func-
tion ρϑ = − log pϑ , and hence under regularity conditions, ψϑ = ρ̇θ is minus
4
From most algorithms used to compute the M-estimator γ̂n , one easily can obtain M̂n
and Jˆn as output. Recall e.g. that the Newton-Raphson algorithm is based on the iterations
!−1
¨ ˙
γ̂new = γ̂old − R̂n (γ̂old ) R̂n (γ̂old ).
158 CHAPTER 14. M-ESTIMATORS

the score function sϑ := ṗϑ /pϑ . The asymptotic variance of the MLE is I −1 (θ),
where I(θ) := Pθ sθ sTθ is the Fisher information:
√ D
θ
n(θ̂n − θ)−→N (0, I −1 (θ)), ∀ θ

(see Section 14.4). Thus, in this case

Zn,1 (θ) = n(θ̂n − θ)T I(θ)(θ̂n − θ),

and, with Iˆn being a consistent estimator of I(θ)

Zn,2 (θ) = n(θ̂n − θ)T Iˆn (θ̂n − θ).

Note that one may take


n n
∂2 1 X

1 X
Iˆn := −

ṡθ̂n (Xi ) = − T
log pϑ (Xi )
n ∂ϑ∂ϑ n ϑ=θ̂n
i=1 i=1

as estimator of the Fisher information5 .


3rd asymptotic pivot
Define now the twice log-likelihood ratio
n 
X 
2Ln (θ̂n ) − 2Ln (θ) := 2 log pθ̂n (Xi ) − log pθ (Xi ) .
i=1

It turns out that the log-likelihood ratio is indeed an asymptotic pivot. A


practical advantage is that it is self-normalizing: one does not need to explicitly
estimate asymptotic (co-)variances.
Lemma 14.8.1 Under regularity conditions, 2Ln (θ̂n ) − 2Ln (θ) is an asymp-
totic pivot for θ. Its asymptotic distribution is again the χ2 -distribution with p
degrees of freedom:
Dθ 2
2Ln (θ̂n ) − 2Ln (θ)−→χ p ∀ θ.

Sketch of the proof. We have by a two-term Taylor expansion


 
2Ln (θ̂n ) − 2Ln (θ) = 2nP̂n log pθ̂n − log pθ

≈ 2n(θ̂n − θ)T P̂n sθ + n(θ̂n − θ)T P̂n ṡθ (θ̂n − θ)


≈ 2n(θ̂n − θ)T P̂n sθ − n(θ̂n − θ)T I(θ)(θ̂n − θ),
where in the second step, we used P̂n ṡθ ≈ Pθ ṡθ = −I(θ). (If you like you may
compare this two-term Taylor expansion with the one in the sketch of proof of
Le Cam’s 3rd Lemma (which is not part of the exam)). With the remainder
terms of the two-term Taylor expansion being asymptotically negligible, we are
5
In other words (as for general M-estimators), the algorithm (e.g. Newton Raphson) for
calculating the maximum likelihood estimator θ̂n generally also provides an estimator of the
Fisher information as by-product.
14.9. MLE FOR THE MULTINOMIAL DISTRIBUTION 159

mathematically speaking dealing with a situation as in Lemma 12.3.2 where the


least squares estimator is studied6 . The MLE θ̂n is asymptotically linear with
influence function lθ = I(θ)−1 sθ :

θ̂n − θ = I(θ)−1 P̂n sθ + oIPθ (n−1/2 ).

Hence,
2Ln (θ̂n ) − 2Ln (θ) ≈ n(P̂n sθ )T I(θ)−1 (P̂n sθ ).
The result now follows from
√ Dθ
nP̂n sθ −→N (0, I(θ)).

u
t

14.9 MLE for the multinomial distribution

Let X1 , . . . , Xn be i.i.d. copies of X, where X ∈ {1, . . . , k} is a label, with

Pθ (X = j) := πj , j = 1, . . . , k.
Pk
where the probabilities πj are positive and add up to one: j=1 πj = 1, but
are assumed to be otherwise unknown. Then there are p := k − 1 unknown
parameters, say θ = (π1 , . . . , πk−1 ). Define Nj := #{i : Xi = j}. Note that
(N1 , . . . , Nk ) has a multinomial distribution with parameters n and (π1 , . . . , πk ).
Lemma 14.9.1 For each j = 1, . . . , k, the MLE of πj is

Nj
π̂j = .
n

Proof. The log-densities can be written as


k
X
log pθ (x) = l{x = j} log πj ,
j=1

so that
n
X k
X
log pθ (Xi ) = Nj log πj .
i=1 j=1

Pk−1 the derivatives with respect to θ = (π1 , . . . , πk−1 ), (with πk = 1 −


Putting
j=1 θj ) to zero gives,
Nj Nk
− = 0.
π̂j π̂k
Hence
π̂k
π̂j = Nj , j = 1, . . . , k,
Nk
6
P̂n sθ takes the role of X T /n and I(θ) takes the role of X T X/n
160 CHAPTER 14. M-ESTIMATORS

and thus
k
X π̂k
1= π̂j = n ,
Nk
j=1

yielding
Nk
π̂k = ,
n
and hence
Nj
π̂j = , j = 1, . . . , k.
n
u
t
We now first calculate Zn,1 (θ). For that, we need to find the Fisher information
I(θ).
Lemma 14.9.2 The Fisher information is
 1
... 0

π1
I(θ) =  ... . . .
 ..  + 1 ιιT , 7
.  π
1 k
0 . . . πk−1

where ι is the (k − 1)-vector ι := (1, . . . , 1)T .


Proof. We have
1 1
sθ,j (x) = l{x = j} − l{x = k}.
πj πk
So
  
1 1 1 1
(I(θ))j1 ,j2 = Eθ l{X = j1 } − l{X = k} l{X = j2 } − l{X = k}
πj1 πk πj2 πk
(
1
πk j1 6= j2
= 1 1 .
πj + πk j1 = j2 = j

u
t
We thus find

Zn,1 (θ) = n(θ̂n − θ)T I(θ)(θ̂n − θ)


T  1 . . . 0

π̂1 − π1

 π1
.. . .. 
= n .    .. . . . . 
π̂k−1 − πk−1 1
0 . . . πk−1
1 ... 1  π̂1 − π1
   
1 . ..   ..
+  .. . . 
πk
1 ... 1 π̂k−1 − πk−1
k−1 k−1
X (π̂j − πj )2 1 X
= n + n ( (π̂j − πj ))2
πj πk
j=1 j=1

A−1 bbT A−1


7
To invert such a matrix, one may apply the formula (A + bbT )−1 = A−1 − 1+bT A−1 b
.
14.10. LIKELIHOOD RATIO TESTS 161

k
X (π̂j − πj )2
= n
πj
j=1
k
X (Nj − nπj )2
= .
nπj
j=1

This is called the Pearson’s chi-square


X (observed − expected)2
.
expected

A version of Zn,2 (θ) is to replace, for j = 1, . . . k, πj by π̂j in the expression for


the Fisher information. This gives
k
X (Nj − nπj )2
Zn,2 (θ) = .
Nj
j=1

This is called the Pearson’s chi-square


X (observed − expected)2
.
observed

Finally, the log-likelihood ratio pivot is


k  
X π̂j
2Ln (θ̂n ) − 2Ln (θ) = 2 Nj log .
πj
j=1

The approximation log(1+x) ≈ x−x2 /2 shows that 2Ln (θ̂n )−2Ln (θ) ≈ Zn,2 (θ):
k  
X πj − π̂j
2Ln (θ̂n ) − 2Ln (θ) = −2 Nj log 1 +
π̂j
j=1
k   k  2
X πj − π̂j X πj − π̂j
≈ −2 Nj + Nj
π̂j π̂j
j=1 j=1
= Zn,2 (θ).

The three asymptotic pivots Zn,1 (θ), Zn,2 (θ) and 2Ln (θ̂n ) − 2Ln (θ) are each
asymptotically χ2k−1 -distributed under IPθ .

14.10 Likelihood ratio tests

For the simple hypothesis


H0 : θ = θ0 ,
we can use 2Ln (θ̂n ) − 2Ln (θ0 ) as test statistic: reject H0 if

2Ln (θ̂n ) − 2Ln (θ0 ) > G−1


p (1 − α),
162 CHAPTER 14. M-ESTIMATORS

where Gp is the distribution function of the χ2p -distribution. This test has
approximately level α as was shown in Lemma 14.8.1.
Consider now the hypothesis
H0 : R(θ) = 0,
where
R1 (θ)
 

R(θ) =  ... 
Rq (θ)
are q restrictions on θ (with R : Rp → Rq a given function8 ).
Let θ̂n be the unrestricted MLE, that is
n
X
θ̂n = arg max log pϑ (Xi ).
ϑ∈Θ
i=1

Moreover, let θ̂n0 be the restricted MLE, defined as


n
X
θ̂n0 = arg max log pϑ (Xi ).
ϑ∈Θ: R(ϑ)=0
i=1

Define the (q × p)-matrix



Ṙ(θ) = R(ϑ)|ϑ=θ .
∂ϑT
We assume Ṙ(θ) has rank q.
Let
n 
X 
Ln (θ̂n ) − Ln (θ̂n0 ) = log pθ̂n (Xi ) − log pθ̂0 (Xi )
n
i=1

be the log-likelihood ratio for testing H0 : R(θ) = 0.


Lemma 14.10.1 Under regularity conditions, and if H0 : R(θ) = 0 holds, we
have
Dθ 2
2Ln (θ̂n ) − 2Ln (θ̂n0 )−→χ q.

Sketch of the proof. Let


n
1 X
Zn := √ sθ (Xi ).
n
i=1

As in the sketch of the proof of Lemma 14.8.1, we can use a two-term Taylor
expansion to show for any sequence ϑn satisfying ϑn = θ + OIPθ (n−1/2 ), that
n 
X 
2 log pϑn (Xi ) − log pθ (Xi )
i=1
8
The notation R is used here for the restrictions. It has nothing to do with the risk R
14.10. LIKELIHOOD RATIO TESTS 163


= 2 n(ϑn − θ)T Zn − n(ϑn − θ)2 I(θ)(ϑn − θ) + oIPθ (1).

Here, we also again use that ni=1 ṡϑn (Xi )/n = −I(θ) + oIPθ (1). Moreover, by
P
a one-term Taylor expansion, and invoking that R(θ) = 0,

R(ϑn ) = Ṙ(θ)(ϑn − θ) + oIPθ (n−1/2 ).

Insert Corollary 12.4.1 with z := Zn , B := Ṙ(θ), and V = I(θ). This gives

2Ln (θ̂n ) − 2Ln (θ̂n0 )

n 
X  n 
X 
= 2 log pθ̂n (Xi ) − log pθ (Xi ) − 2 log pθ̂0 (Xi ) − log pθ (Xi )
n
i=1 i=1
 −1
−1 T −1
= T
Zn I(θ) Ṙ (θ) Ṙ(θ)I(θ) Ṙ(θ)T
Ṙ(θ)I(θ)−1 Zn + oIPθ (1)

:= YnT W −1 Yn + oIPθ (1),

where Yn is the q-vector

Yn := Ṙ(θ)I(θ)−1 Zn ,

and where W is the (q × q)-matrix

W := Ṙ(θ)I(θ)−1 Ṙ(θ)T .

We know that
D
θ
Zn −→N (0, I(θ)).

Hence
θD
Yn −→N (0, W ),

so that
D
YnT W −1 Yn −→χ
θ 2
q.

u
t

Corollary 14.10.1 From the sketch of the proof of Lemma 14.10.1, one sees
that moreover (under regularity),

2Ln (θ̂n ) − 2Ln (θ̂n0 ) ≈ n(θ̂n − θ̂n0 )T I(θ)(θ̂n − θ̂n0 ),

and also
2Ln (θ̂n ) − 2Ln (θ̂n0 ) ≈ n(θ̂n − θ̂n0 )T I(θ̂n0 )(θ̂n − θ̂n0 ).
164 CHAPTER 14. M-ESTIMATORS

14.11 Contingency tables

Let X be a bivariate label, say X ∈ {(j, k) : j = 1, . . . , r, k = 1, . . . , s}. For


example, the first index may correspond to sex (r = 2) and the second index to
the color of the eyes (s = 3). The probability of the combination (j, k) is
 
πj,k := Pθ X = (j, k) .

Let X1 , . . . , Xn be i.i.d. copies of X, and

Nj,k := #{Xi = (j, k)}.

From Section 14.9, we know that the (unrestricted) MLE of πj,k is equal to

Nj,k
π̂j,k := .
n
We now want to test whether the two labels are independent. The null-
hypothesis is
H0 : πj,k = (πj,+ ) × (π+,k ) ∀ (j, k).
Here
X s r
X
πj,+ := πj,k , π+,k := πj,k .
k=1 j=1

One may check that the restricted MLE is


0
π̂j,k = (π̂j,+ ) × (π̂+,k ),

where
s
X r
X
π̂j,+ := π̂j,k , π̂+,k := π̂j,k .
k=1 j=1

The log-likelihood ratio test statistic is thus


r X
s     
X Nj,k Nj,+ N+,k
2Ln (θ̂n ) − 2Ln (θ̂n0 ) = 2 Nj,k log − log
n n2
j=1 k=1
r X
s  
X nNj,k
= 2 Nj,k log .
Nj,+ N+,k
j=1 k=1

Its approximation as given in Corollary 14.10.1 is


r X
s
X (Nj,k − Nj,+ N+,k /n)2
2Ln (θ̂n ) − 2Ln (θ̂n0 ) ≈n .
Nj,+ N+,k
j=1 k=1

This is Pearson’s chi-squared test statistic for testing independence.


To find out what the value of q is in this example, we first observe that the
unrestricted case has p = rs − 1 free parameters. Under the null-hypothesis,
14.11. CONTINGENCY TABLES 165

there remain (r − 1) + (s − 1) free parameters. Hence, the number of restrictions


is    
q = rs − 1 − (r − 1) + (s − 1) = (r − 1)(s − 1).

Thus, under H0 : πj,k = (πj,+ ) × (π+,k ) ∀ (j, k), we have


r X
s
X (Nj,k − Nj,+ N+,k /n)2 D
n θ
−→ χ2(r−1)(s−1) .
Nj,+ N+,k
j=1 k=1
166 CHAPTER 14. M-ESTIMATORS
Chapter 15

Abstract asymptotics ?

In Subsection 2.4.1 we discussed so-called plug-in estimators. The idea is that


and estimator γ̂n can (typically) be written as some functional Q of the empirical
distribution P̂n . When P is the “true” distribution and γ = Q(P ), then the
point is the studying how close γ̂n (P̂n ) is to γ = Q(P ) when P̂n is close to P .
This is the topic of the first part of this chapter.

In Section 5.5 we obtained the Cramér Rao lower bound. A somewhat disap-
pointing result was that it can only be reached within exponential families (see
Lemma 5.6.1). We have also seen in in Section 14.4 that the MLE is asymptoti-
cally unbiased and reaches asymptotically the CRLB (its asymptotic covariance
matrix is I(θ)−1 , where I(θ) is the Fisher information for estimating θ). The
topic of the second part of this chapter is to show (for the one-dimensional case
for simplicity) that I(θ) is indeed asymptotically the efficient variance.

One may now wonder why the inverse of I(θ) is there. Think of it in this way.
The Fisher information was obtained by looking at derivatives of the map

θ 7→ log pθ .

But what actually plays a role in the inverse map

P 7→ θ

or, in case γ = g(θ) is the parameter of interest, the map

P 7→ γ.

Then remember that the derivate of the inverse of a function (say f : R 7→ R) is


the inverse of its derivative. In our case the mapping P 7→ γ is rather abstract,
so studying its derivatives, as done in the first part of this chapter, requires
some new notions.

167
168 CHAPTER 15. ABSTRACT ASYMPTOTICS ?

15.1 Plug-in estimators ?

When X is Euclidean space, one can define the distribution function F (x) :=
Pθ (X ≤ x) and the empirical distribution function

1
F̂n (x) = #{Xi ≤ x, 1 ≤ i ≤ n}.
n
This is the distribution function of a probability measure that puts mass 1/n at
each observation. For general X , we define likewise the empirical distribution P̂n
as the distribution that puts mass 1/n at each observation, i.e., more formally
n
1X
P̂n := δXi ,
n
i=1

where δx is a point mass at x. Thus, for (measurable ) sets A ⊂ X ,

1
P̂n (A) = #{Xi ∈ A, 1 ≤ i ≤ n}.
n
For (measurable) functions f : X → Rr (for some r ∈ N), we write, as in
previous sections,
n Z
1X
P̂n f := f (Xi ) = f dP̂n .
n
i=1

Thus, for sets,


P̂n (A) = P̂n lA .
Again, as in previous sections, we use the same notations for expectations under
Pθ : Z
Pθ f := Eθ f (X) = f dPθ ,

so that
Pθ (A) = Pθ lA .

The parameter of interest is denoted as

γ = g(θ) ∈ Rp .

It can often be written in the form

γ = Q(Pθ ),

where Q is some functional on (a supset of) the model class P. Assuming Q is


also defined at the empirical measure P̂n , the plug-in estimator of γ is now

Tn := Q(P̂n ).

Conversely,
15.1. PLUG-IN ESTIMATORS ? 169

Definition 15.1.1 If a statistic Tn can be written as Tn = Q(P̂n ), then it is


called a Fisher-consistent estimator of γ = g(θ), if Q(Pθ ) = g(θ) for all θ ∈ Θ.
We will also encounter modifications, where

Tn = Qn (P̂n ),

and for n large,


Qn (Pθ ) ≈ Q(Pθ ) = g(θ).
Example 15.1.1 Plug-in estimator of (functions of ) the mean
Consider a given f : X → Rr and a given h : Rr → Rp . Let γ := h(Pθ f ). The
plug-in estimator is then Tn = h(P̂n f ).
Example 15.1.2 M- and Z-estimators are plug-in estimators
The M-estimator
γ̂n := arg min P̂n ρc
c∈Γ

is a plug-in estimator of
γ = arg min Pθ ρc .
c∈Γ

Similarly, the Z-estimator γ̂n as solution of




P̂n ψc =0
c=γ̂n

is a plug-in estimator of

Pθ ψc = 0.
c=γ

Example 15.1.3 The α-trimmed mean as plug-in estimator


Let X = R and consider the α-trimmed mean
n−[nα]
1 X
Tn := X(i) .
n − 2[nα]
i=[nα]+1

What is its theoretical counterpart? Because the i-th order statistic X(i) can be
written as
X(i) = F̂n−1 (i/n),
and in fact
X(i) = F̂n−1 (u), i/n ≤ u < (i + 1)/n,
we may write, for αn := [nα]/n,

n−[nα]
n 1 X
Tn = F̂n−1 (i/n)
n − 2[nα] n
i=[nα]+1
Z 1−αn
1
= F̂ −1 (u)du := Qn (P̂n ).
1 − 2αn αn +1/n n
170 CHAPTER 15. ABSTRACT ASYMPTOTICS ?

Replacing F̂n by F gives


Z 1−αn
1
Qn (F ) = F −1 (u)du
1 − 2αn αn +1/n
Z 1−α
1
≈ F −1 (u)du
1 − 2α α
Z F −1 (1−α)
1
= xdF (x)
1 − 2α F −1 (α)
:= Q(Pθ ).

Example 15.1.4 Histogram as plug-in estimator of a density


Let X = R, and suppose X has density f w.r.t., Lebesgue measure. Suppose f
is the parameter of interest. We may write
F (x + h) − F (x − h)
f (x) = lim .
h→0 2h
Replacing F by F̂n here does not make sense. Thus, this is an example where
Q(P ) = f is only well defined for distributions P that have a density f . We
may however slightly extend the plug-in idea, by using the estimator

F̂n (x + hn ) − F̂n (x − hn )
fˆn (x) := := Qn (P̂n ),
2hn
with hn “small” (hn → 0 as n → ∞).

15.2 Consistency of plug-in estimators ?

We first present the uniform convergence of the empirical distribution function


to the theoretical one.
Such uniform convergence results hold also in much more general settings (see
also (14.2) in the proof of consistency for M-estimators).
Theorem 15.2.1 (Glivenko-Cantelli) Let X = R. We have

sup |F̂n (x) − F (x)| → 0, IPθ − a.s..


x

Proof. We know that by the law of large numbers, for all x

|F̂n (x) − F (x)| → 0, IPθ −a.s.,

so also for all finite collection a1 , . . . , aN ,

max |F̂n (aj ) − F (aj )| → 0, IPθ −a.s..


1≤j≤N

Let  > 0 be arbitrary, and take a0 < a1 < · · · < aN −1 < aN in such a way that

F (aj ) − F (aj−1 ) ≤ , j = 1, . . . , N
15.3. ASYMPTOTIC NORMALITY OF PLUG-IN ESTIMATORS ? 171

where F (a0 ) := 0 and F (aN ) := 1. Then, when x ∈ (aj−1 , aj ],

F̂n (x) − F (x) ≤ F̂n (aj ) − F (aj−1 ) ≤ Fn (aj ) − F (aj ) + ,

and
F̂n (x) − F (x) ≥ F̂n (aj−1 ) − F (aj ) ≥ F̂n (aj−1 ) − F (aj−1 ) − ,
so
sup |F̂n (x) − F (x)| ≤ max |F̂n (aj ) − F (aj )| +  → , IPθ −a.s..
x 1≤j≤N
u
t
Example 15.2.1 Consistency of the sample median
Let X = R and let F be the distribution function of X. We consider estimating
the median γ := F −1 (1/2). We assume F to continuous and strictly increasing.
The sample median is

−1 X((n+1)/2) n odd
Tn := F̂n (1/2) := .
[X(n/2) + X(n/2+1) ]/2 n even

So
1 n 1/(2n) n odd .
F̂n (Tn ) = +
2 0 n even
It follows that

|F (Tn ) − F (γ)| ≤ |F̂n (Tn ) − F (Tn )| + |F̂n (Tn ) − F (γ)|


1
= |F̂n (Tn ) − F (Tn )| + |F̂n (Tn ) − |
2
1
≤ |F̂n (Tn ) − F (Tn )| + → 0, IPθ −a.s..
2n
So F̂n−1 (1/2) = Tn → γ = F −1 (1/2), IPθ −a.s., i.e., the sample median is a
consistent estimator of the population median.

15.3 Asymptotic normality of plug-in estimators ?

Let γ := Q(P ) ∈ Rp be the parameter of interest. The idea in this subsection is


to apply a δ-method, but now in a nonparametric framework. The parametric
δ-method says that if θ̂n is an asymptotically linear estimator of θ ∈ Rp , and if
γ = g(θ) is some function of the parameter θ, with g being differentiable at θ,
then γ̂ is an asymptotically linear estimator of γ. Now, we write γ = Q(P ) as
a function of the probability measure P (with P = Pθ , so that g(θ) = Q(Pθ )).
We let P play the role of θ, i.e., we use the probability measures themselves as
parameterization of P. We then have to redefine differentiability in an abstract
setting, namely we differentiate w.r.t. P .
Definition 15.3.1
◦ The influence function of Q at P is
Q((1 − )P + δx ) − Q(P )
lP (x) := lim , x ∈ X,
↓0 
172 CHAPTER 15. ABSTRACT ASYMPTOTICS ?

whenever the limit exists.


◦ The map Q is called Gâteaux differentiable at P if for all probability measures
P̃ , we have
Q((1 − )P + P̃ ) − Q(P )
lim = EP̃ lP (X).
↓0 

◦ Let d be some (pseudo-)metric on the space of probability measures. The map


Q is called Fréchet differentiable at P , with respect to the metric d, if

Q(P̃ ) − Q(P ) = EP̃ lP (X) + o(d(P̃ , P )).

Remark 1 In line with the notation introduced previously, we write for a


function f : X → Rr and a probability measure P̃ on X

P̃ f := EP̃ f (X).

Remark 2 If Q is Fréchet or Gâteaux differentiable at P , then

P lP (:= EP lP (X)) = 0.

Remark 3 If Q is Fréchet differentiable at P , and if moreover

d((1 − )P + P̃ , P ) = o(),  ↓ 0,

then Q is Gâteaux differentiable at P :

Q((1 − )P + P̃ ) − Q(P ) = ((1 − )P + P̃ )lP + o()


= P̃ lP + o().

The following result says that Fréchet differentiable functionals are generally
asymptotically linear.
Lemma 15.3.1 Suppose that Q is Fréchet differentiable at P with influence
function lP , and that
d(P̂n , P ) = OIP (n−1/2 ). (15.1)
Then
Q(P̂n ) − Q(P ) = P̂n lP + oIP (n−1/2 ).

Proof. This follows immediately from the definition of Fréchet differentiability.


u
t
Corollary 15.3.1 Assume the conditions of Lemma 15.3.1, with influence func-
tion lP satisfying VP := P lP lPT < ∞. Then
√ D
P
n(Q(P̂n ) − Q(P )) −→ N (0, VP ).
15.3. ASYMPTOTIC NORMALITY OF PLUG-IN ESTIMATORS ? 173

An example where (15.1) holds


Suppose X = R and that we take

d(P̃ , P ) := sup |F̃ (x) − F (x)|.


x

Then indeed d(P̂n , P ) = OIP (n−1/2 ). This follows from Donsker’s theorem,
which we state here without proof:
Theorem 15.3.1 (Donsker’s theorem) Suppose F is continuous. Then
√ D
sup n|F̂n (x) − F (x)| −→ Z,
x

where the random variable Z has distribution function



X
G(z) = 1 − 2 (−1)j+1 exp[−2j 2 z 2 ], z ≥ 0.
j=1

Fréchet differentiability is generally quite hard to prove, and often not even true.
We will sketch how to establish Gâteaux differentiability in two examples.
Example 15.3.1 Asymptotic linearity of the Z-estimator
We consider again the asymptotic linearity of the Z-estimator. Throughout in
this example, we assume enough regularity. Let γ be defined by the equation

P ψγ = 0.

Let P := (1 − )P + P̃ , and let γ be a solution of the equation

P ψγ = 0.

We assume that as  ↓ 0, also γ → γ. It holds that

(1 − )P ψγ + P̃ ψγ = 0,

so
P ψγ + (P̃ − P )ψγ = 0,
and hence
P (ψγ − ψγ ) + (P̃ − P )ψγ = 0.
Assuming differentiabality of c 7→ P ψc , we obtain
 

P (ψγ − ψγ ) = P ψ c
(γ − γ) + o(|γ − γ|)
∂cT
c=γ
:= MP (γ − γ) + o(|γ − γ|).

Moreover, again under regularity

(P̃ − P )ψγ = (P̃ − P )ψγ + (P̃ − P )(ψγ − ψγ )


= (P̃ − P )ψγ + o(1) = P̃ ψγ + o(1).
174 CHAPTER 15. ABSTRACT ASYMPTOTICS ?

It follows that
MP (γ − γ) + o(|γ − γ|) + (P̃ − P )ψγ + o() = 0,
or, assuming MP to be invertible,
(γ − γ)(1 + o(1)) = −MP−1 P̃ ψγ + o(),
which gives
γ − γ
→ −MP−1 P̃ ψγ .

The influence function is thus (as already seen in Subsection 14.3)
lP = −MP−1 ψγ .
Example 15.3.2 Asymptotic linearity of the α-trimmed mean
The α-trimmed mean is a plug-in estimator of
Z F −1 (1−α)
1
γ := Q(P ) = xdF (x).
1 − 2α F −1 (α)
Using partial integration, may write this as
Z 1−α
(1 − 2α)γ = (1 − α)F −1 (1 − α) − αF −1 (α) − vdF −1 (v).
α

The influence function of the quantile F −1 (v)


is
 
1 −1
qv (x) = − l{x ≤ F (v)} − v}
f (F −1 (v))
(see Example 14.5.1), i.e., for the distribution P = (1 − )P + P̃ , with distri-
bution function F = (1 − )F + F̃ , we have
F−1 (v) − F −1 (v)
lim = P̃ qv
↓0 
 
1 −1
= − F̃ (F (v)) − v .
f (F −1 (v))
Hence, for P = (1 − )P + P̃ ,
Q((1 − )P + P̃ ) − Q(P )
(1 − 2α) lim
↓0 
Z 1−α
= (1 − α)P̃ q1−α − αP̃ qα − vdP̃ qv
α
Z 1−α  
1 −1
= F̃ (F (v)) − v dv
α f (F −1 (v))
Z F −1 (1−α)  
1
= F̃ (u) − F (u) dF (u)
F −1 (α) f (u)
Z F −1 (1−α)  
= F̃ (u) − F (u) du
F −1 (α)

= (1 − 2α)P̃ lP ,
15.4. ASYMPTOTIC CRAMER RAO LOWER BOUND ? 175

where Z F −1 (1−α)  
1
lP (x) = − l{x ≤ u} − F (u) du.
1 − 2α F −1 (α)
We conclude that, under regularity conditions, the α-trimmed mean is asymp-
totically linear with the above influence function lP , and hence asymptotically
normal with asymptotic variance P lP2 .

15.4 Asymptotic Cramer Rao lower bound ?

Let X have distribution P ∈ {Pθ : θ ∈ Θ}. We assume for simplicity that


Θ ⊂ R and that θ is the parameter of interest. Let Tn be an estimator of θ.
Throughout this section, we take certain, sometimes unspecified, regularity
conditions for granted.

In particular, we assume that P is dominated by some σ-finite measure ν, and


that the Fisher-information
I(θ) := Eθ s2θ (X)
exists for all θ. Here, sθ is the score function
d
sθ := log pθ = ṗθ /pθ ,

with pθ := dPθ /dν.
Recall now that if Tn is an unbiased estimator of θ, then by the Cramer Rao
lower bound, 1/I(θ) is a lower bound for its variance (under regularity Condi-
tions I and II, see Section 5.5).
Definition 15.4.1 Suppose that
√ Dθ
n(Tn − θ)−→ N (bθ , Vθ ), ∀ θ.
Then bθ is called the asymptotic bias, and Vθ the asymptotic variance. The
estimator Tn is called asymptotically unbiased if bθ = 0 for all θ. If Tn is
asymptotically unbiased and moreover Vθ = 1/I(θ) for all θ, and some regularity
conditions holds, then Tn is called asymptotically efficient.
Remark 1 The assumptions in the above definition, are for all θ. Clearly, if one
only looks at one fixed given θ0 , it is easy to construct a super-efficient estimator,
namely Tn = θ0 . More generally, to avoid this kind of super-efficiency, one does
not only require conditions to hold for all θ, but in fact uniformly in θ, or for
all sequences {θn }. The regularity one needs here involves the idea that one

actually needs to allow for sequences θn the form θn = θ + h/ n. In fact, the
regularity requirement is that also, for all h,
√ Dθn
n(Tn − θn )−→ N (0, Vθ ).
176 CHAPTER 15. ABSTRACT ASYMPTOTICS ?

To make all this mathematically precise is quite involved. We refer to van der
Vaart (1998). A glimps is given in Le Cam’s 3rd Lemma, see the next subsection.
Remark 2 Note that when θ = θn is allowed to change with n, this means that
distribution of Xi can change with n, and hence Xi can change with n. Instead
of regarding the sample X1 , . . . , Xn are the first n of an infinite sequence, we
now consider for each n a new sample, say X1,1 , . . . , Xn,n .
Remark 3 We have seen that the MLE θ̂n generally is indeed asymptotically
unbiased with asymptotic variance Vθ equal to 1/I(θ), i.e., under regularity
assumptions, the MLE is asymptotically efficient.
For asymptotically linear estimators, with influence function lθ , one has asymp-
totic variance Vθ = Eθ lθ2 (X). The next lemma indicates that generally 1/I(θ)
is indeed a lower bound for the asymptotic variance.
Lemma 15.4.1 Suppose asymptotic linearity:
n
1X
Tn − θ = lθ (Xi ) + oIPθ (n−1/2 ),
n
i=1

where Eθ lθ (X) = 0, Eθ lθ2 (X) := Vθ < ∞. Assume moreover that

Eθ lθ (X)sθ (X) = 1. (15.2)

Then
1
Vθ ≥ .
I(θ)

Proof. This follows from the Cauchy-Schwarz inequality:

1 = |covθ (lθ (X), sθ (X))|2

≤ varθ (lθ (X))varθ (sθ (X)) = Vθ I(θ).


u
t
It may look like a coincidence when in a special case, equality (15.2) indeed
holds. But actually, it is true in quite a few cases. This may at first seem like
magic.
We consider two examples. To simplify the expressions, we again write short-
hand
Pθ f := Eθ f (X).
Example 15.4.1 Equation (15.2) for Z-estimators
This example examines the Z-estimator of θ. Then we have, for P = Pθ ,

P ψθ = 0.

The influence function is


lθ = −ψθ /Mθ ,
15.5. LE CAM’S 3RD LEMMA ? 177

where
d
Mθ := P ψθ .

Under regularity, we have
Z
d
Mθ = P ψ̇θ = ψ̇θ pθ dν, ψ̇θ = ψθ .

We may also write Z
d
Mθ = − ψθ ṗθ dν, ṗθ = pθ .

This follows from the chain rule
d
ψθ pθ = ψ̇θ pθ + ψθ ṗθ ,

and (under regularity)
Z Z
d d d d
ψθ pθ dν = ψθ pθ dν = P ψθ = 0 = 0.
dθ dθ dθ dθ
Thus Z
P lθ sθ = −Mθ−1 P ψθ sθ = −Mθ−1 ψθ ṗθ dν = 1,

that is, (15.2) holds.


Example 15.4.2 Equation (15.2) for plug-in estimators
We consider now the plug-in estimator Q(P̂n ). Suppose that Q is Fisher con-
sistent (i.e., Q(Pθ ) = θ for all θ). Assume moreover that Q is Fréchet differ-
entiable with respect to the metric d, at all Pθ , and that

d(Pθ̃ , Pθ ) = O(|θ̃ − θ|).

Then, by the definition of Fréchet differentiability

h = Q(Pθ+h ) − Q(Pθ ) = Pθ+h lθ + o(|h|) = (Pθ+h − Pθ )lθ + o(|h|),

or, as h → 0,
R
(Pθ+h − Pθ )lθ lθ (pθ+h − pθ )dν
1 = + o(1) = + o(1)
Z h h
→ lθ ṗθ dν = Pθ (lθ sθ ).

So (15.2) holds.

15.5 Le Cam’s 3rd Lemma ?

The following example serves as a motivation to consider sequences θn depend-


ing on n. It shows that pointwise asymptotics can be very misleading.
178 CHAPTER 15. ABSTRACT ASYMPTOTICS ?

Example 15.5.1 Hodges-Lehmann example of super-efficiency


Let X1 , . . . , Xn be i.i.d. copies of X, where X = θ + , and  is N (0, 1)-
distributed. Consider the estimator

if |X̄n | > n−1/4



X̄n ,
Tn := .
X̄n /2, if |X̄n | ≤ n−1/4

Then


Dθ N (0, 1), 6 0
θ=
n(Tn − θ) −→ .
N (0, 14 ), θ=0
So the pointwise asymptotics show that Tn can be more efficient than the sample
average X̄n . But what happens if we consider sequences θn ? For example, let
√ √
θn = h/ n. Then, under IPθn , X̄n = ¯n + h/( n) = OIPθn (n−1/2 ). Hence,
IPθn (|X̄n | > n−1/4 ) → 0, so that IPθn (Tn = X̄n ) → 0. Thus,
√ √ √
n(Tn − θn ) = n(Tn − θn )l{Tn = X̄n } + n(Tn − θn )l{Tn = X̄n /2}
 
Dθn h 1
−→ N − , .
2 4

The asymptotic mean square error AMSEθ (Tn ) is defined as the asymptotic
variance + asymptotic squared bias:

1 + h2
AMSEθn (Tn ) = .
4
The AMSEθ (X̄n ) of X̄n is its normalized non-asymptotic mean square error,
which is
AMSEθn (X̄n ) = MSEθn (X̄n ) = 1.
So when h is large enough, the asymptotic mean square error of Tn is larger
than that of X̄n .
Le Cam’s 3rd lemma shows that asymptotic linearity for all θ implies asymptotic

normality, now also for sequences θn = θ + h/ n. The asymptotic variance for
such sequences θn does not change. Moreover, if (15.2) holds for all θ, the
estimator is also asymptotically unbiased under IPθn .
Lemma 15.5.1 (Le Cam’s 3rd Lemma) Suppose that for all θ,
n
1X
Tn − θ = lθ (Xi ) + oIPθ (n−1/2 ),
n
i=1

where Pθ lθ = 0, and Vθ := Pθ lθ2 < ∞. Then, under regularity conditions,


 
Dθn
n(Tn − θn ) −→ N {Pθ (lθ sθ ) − 1}h, Vθ .

We will present a sketch of the proof of this lemma. For this purpose, we need
the following auxiliary lemma.
15.5. LE CAM’S 3RD LEMMA ? 179

Lemma 15.5.2 (Auxiliary lemma) Let Z ∈ R2 be N (µ, Σ)-distributed, where


   2 
µ1 σ1 σ1,2
µ= , Σ= .
µ2 σ1,2 σ22
Suppose that
µ2 = −σ22 /2.
Let Y ∈ R2 be N (µ + a, Σ)-distributed, with
 
σ1,2
a= .
σ22
Let φZ be the density of Z and φY be the density of Y . Then we have the
following equality for all z = (z1 , z2 ) ∈ R2 :

φZ (z)ez2 = φY (z).

Proof. The density of Z is


 
1 1 T −1
φZ (z) = p exp − (z − µ) Σ (z − µ) .
2π det(Σ) 2

Now, one easily sees that  


−1 0
Σ a= .
1
So
1 1
(z − µ)T Σ−1 (z − µ) = (z − µ − a)T Σ−1 (z − µ − a)
2 2
1
+ aT Σ−1 (z − µ) − aT Σ−1 a
2
and
 T
1 0 T
 
−1 1 0
T
a Σ (z − µ) − aT Σ−1 a = (z − µ) − a
2 1 2 1
1
= z2 − µ2 − σ22 = z2 .
2
u
t
Sketch of proof of Le Cam’s 3rd Lemma. Set
n 
X 
Λn := log pθn (Xi ) − log pθ (Xi ) .
i=1

Then under IPθ , by a two-term Taylor expansion,


n n
h X h2 1 X
Λn ≈ √ sθ (Xi ) + ṡθ (Xi )
n 2 n
i=1 i=1
n
h X h2
≈ √ sθ (Xi ) − I(θ),
n 2
i=1
180 CHAPTER 15. ABSTRACT ASYMPTOTICS ?

as
n
1X
ṡθ (Xi ) ≈ Eθ ṡθ (X) = −I(θ).
n
i=1

We moreover have, by the assumed asymptotic linearity, under IPθ ,


n
√ 1 X
n(Tn − θ) ≈ √ lθ (Xi ).
n
i=1

Thus, √ 
n(Tn − θ) θ D
−→ Z,
Λn
where Z ∈ R2 , has the two-dimensional normal distribution:
     
Z1 0 Vθ hPθ (lθ sθ )
Z= ∼N 2 , .
Z2 − h2 I(θ) hPθ (lθ sθ ) h2 I(θ)

Thus, we know that for all bounded and continuous f : R2 → R, one has

I θ f ( n(Tn − θ), Λn ) → Ef
E I (Z1 , Z2 ).

Now, let f : R → R be bounded and continuous. Then, since


n
Y n
Y
pθn (Xi ) = pθ (Xi )eΛn ,
i=1 i=1

we may write √ √
I θ f ( n(Tn − θ))eΛn .
I θn f ( n(Tn − θ)) = E
E
The function (z1 , z2 ) 7→ f (z1 )ez2 is continuous, but not bounded. However,
one can show that one may extend the Portmanteau Theorem to this situation.
This then yields √
EI θ f ( n(Tn − θ))eΛn → Ef
I (Z1 )eZ2 .
Now, apply the auxiliary Lemma, with
   
0 Vθ hPθ (lθ sθ )
µ= h2 , Σ= .
− 2 I(θ) hPθ (lθ sθ ) h2 I(θ)

Then we get
Z Z
Z2 z2
Ef
I (Z1 )e = f (z1 )e φZ (z)dz = f (z1 )φY (z)dz = Ef
I (Y1 ),

where
     
Y1 hPθ (lθ sθ ) Vθ hPθ (lθ sθ )
Y = ∼N h2 , ,
Y2 2 I(θ)
hPθ (lθ sθ ) h2 I(θ)

so that
Y1 ∼ N (hPθ (lθ sθ ), Vθ ).
15.5. LE CAM’S 3RD LEMMA ? 181

So we conclude that
√ n Dθ
n(Tn − θ) −→ Y1 ∼ N (hPθ (lθ sθ ), Vθ ).

Hence
√ √ Dθ
n
n(Tn − θn ) = n(Tn − θ) − h −→ N (h{Pθ (lθ sθ ) − 1}, Vθ ).

u
t
182 CHAPTER 15. ABSTRACT ASYMPTOTICS ?
Chapter 16

Complexity regularization

Suppose again the framework where we have i.i.d. copies X1 , . . . , Xn of a pop-


ulation random variable X, and that we model the distribution P of X as
P ∈ P = {Pθ : θ ∈ Θ}. If Θ is finite-dimensional, one can regard its dimen-
sion, p say, as a measure of the complexity of the parameter space Θ. If p is
larger than n, there are more parameters than observations and one may intu-
itively already see that this is not a statistically favourable situation. It is in a
way comparable to a system with more unknowns then equations: an ill-posed
system. A model with very many parameters may fit the data very well, but
will have little predictive power. With too many parameters, there is a danger
of overfitting. Complexity regularization can be roughly seen as a way to deal
with a complex parameter space by letting the data decide which sub-model
yields a good trade-off between the approximation error and estimation error.

We note that even when the parameter space is ∞-dimensional one need not
always apply complexity regularization. The dimension of Θ is in itself not
always the best description of complexity. If Θ is a metric space, one can apply
its so-called entropy as a measure of complexity. The details go beyond the
scope of these lecture notes. We instead will give the non-parametric and the
high-dimensional regression problem as prototype examples.

16.1 Non-parametric regression

Consider n real-valued response variables Y1 , . . . , Yn which depend on some


fixed co-variables x1 , . . . , xn (with xi in some space X for all i) in the following
manner:
Yi = f (xi ) + i , i = 1, . . . , n,

where 1 , . . . , n is unobservable noise, and where f is an unknown function.


Suppose we think of using the least squares estimator for estimating f . Let
Y := (Y1 , . . . , Yn )T . With some abuse of notation, we identify a function f :

183
184 CHAPTER 16. COMPLEXITY REGULARIZATION

X → R with the vector

f := (f(x1 ), . . . , f(xn ))T ∈ Rn

The least squares estimator is

fˆLS := arg minn kY − fk22 .


f∈R

Clearly, if all xi are distinct, then fˆLS = Y and one has a perfect fit

kY − fˆLS k22 = 0.

This is a typical instance of overfitting. The estimator fˆLS just reproduces the
data and has no predictive power.
We need a model for f , say f ∈ F with F some class of functions.

16.2 Smoothness classes

Suppose X is some interval in R, say X = [0, 1]. It depends on the situation


but a reasonable assumption on f may be that it is not too wiggly. One may
formulate that mathematically for example by saying that f is differentiable
with derivative f 0 with |f 0 | within bounds. One may for example measure the
roughness of f by the (Sobolev) semi-norm of its derivative
s
Z 1
|f 0 (x)|2 dx.
0

One
R 1 0way2to go is now to do least squares over all f under the restriction that
2
0 |f (x)| dx ≤ M where M is a given constant. It turns out that a more
flexible approach is to apply the Lagrangian version of this. Then, choose a
tuning parameter λ ≥ 0 and define the estimator fˆ as
 Z 1 
ˆ 2 2 0 2
f := arg min kY − fk2 + λ |f (x)| dx .
f 0

This is called (a version of) Tikhonov regularization. One may view


Z 1
λ2 |f 0 (x)|2 dx
0

as a penalty for choosing a too wiggly function. The penalty regularizes the
function. The tuning parameter λ controls the amount of regularization: the
larger λ the more regular the estimator will be. The choice of λ is not an easy
point. From theoretical point of view there are some guidelines. (Choosing λ2
of order n1/3 trades off approximation error and estimation error. One may
alternatively invoke Bayesian arguments to choose λ.) In practice one may
apply cross-validation.
16.3. A CONTINUOUS VERSION WITH EXPLICIT SOLUTION ? 185

So far we described smoothness in terms of first derivatives being bounded. One


may also use higher order derivatives if one believes the unknown function f has
these. Let f (m) denote the derivative of order m of the function f : [0, 1] → R.
A possible penalty is then
Z 1
2
λ |f (m) (x)|2 dx.
0

The tuning parameter λ can be chosen smaller for higher values of m. (A value
1
of order n 2m+1 gives a trade-off between approximation error and estimation
error.) The resulting estimator is called a smoothing spline.
With a quadratic penalty, the penalized least squares estimator fˆ is not difficult
to compute as it is a minimizer of a quadratic function. Nevertheless, explicit
expressions are typically not available. In the next section, we provide explicit
expressions for the continuous version, as a “curiosity”.

16.3 A continuous version with explicit solution ?

We examine the continuous version of the previous section. The problem can
be explicitly solved. Suppose we observe a function y : [0, 1] → R and we aim
at smoothing it using the penalty of the previous section. We let
Z 1 Z 1 
ˆ 2 2 0 2
f = arg min |y(x) − f(x)| dx + λ |f (x)| dx . (16.1)
f 0 0

Lemma 16.3.1 Let fˆ be given in (16.1). Then

1 x
   
u−x
Z
ˆ C x
f (x) = cosh + y(u) sinh du,
λ λ λ 0 λ

where  Z 1     
1 1−u 1
C = Y (1) − Y (u) sinh du / sinh ,
λ 0 λ λ
with Z x
Y (x) = y(u)du.
0

The proof is calculus of variations and will be skipped.


Yet, with this explicit expression it is very easy to carry out the estimation
procedure. Here is an example.
186 CHAPTER 16. COMPLEXITY REGULARIZATION

Numerical example

0.02

-0.02

-0.04

-0.06

-0.08

-0.1

-0.12

-0.14

-0.16

-0.18
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
In black is the unknown function f (as
Denoised, lambda=0.1
this is a simulation we know the un-
Error=2.8119e-04 known f ). The red wiggly function is
the observed function y. The green
function is the estimator fˆ. We see in
the second figure that by decreasing the
0.02
tuning parameter λ the estimator fˆ is
0 closer to the unknown truth f .
-0.02

-0.04

-0.06

-0.08

-0.1

-0.12

-0.14

-0.16

-0.18
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Denoised, lambda=0.05
Error=7.8683e-05

16.4 Estimating a function of bounded variation

Suppose again X = [0, 1]. Instead of taking the penalty


Z 1
λ 2
|f 0 (x)|2 dx
0

one may alternatively choose


Z 1
λ 2
|f 0 (x)|dx
0

i.e. without the square. This seems like a minor modification but it makes a
huge difference. One may further relax the assumption of differentiability and
define the total variation of the function f as
n
X
TV(f) := |f(xi ) − f(xi−1 )|,
i=2
16.4. ESTIMATING A FUNCTION OF BOUNDED VARIATION 187

where we assume that the xi are ordered: x1 < . . . < xn . This leads to the
estimator  
fˆ := arg min kY − fk22 + λ2 TV(f) .
f

In analogy, if Y denotes the heights of mountains at locations where a road


is to be built, one tries to even out the mountains and valleys using the total
variation penalty (so that the road will not have steep slopes) and at the same
time move little earth measured in terms of the least squares loss. This can
be done in an iterative manner by filling the valley where the slope is steepest,
or digging away the mountain with steepest slope. The estimator is locally
adaptive: by increasing λ only local changes are made. This is in contrast with
the estimator of Section 16.2 or its continuous version in Section 16.3. One can
see in the numerical example of Section 16.3 that changing λ has a non-local
effect.

One may write the estimator as solution of a linear least squares problem with
an `1 -penalty on the coefficients. This will relate the problem with that of
Section 16.5. The results there will help to understand why removing the square
in the penalty drastically changes the estimator.

Let us define f(x0 ) =: 0. Then for i = 1, . . . , n

i
X
f(xi ) = f(xj ) − f(xj−1 )
j=1
Xn  
= f(xj ) − f(xj−1 ) l{j ≤ i}
| {z }
i=1 | {z } :=ξi,j
:=bj
n
X
= bj ξi,j .
i=1

Putting the coefficients bj , j = 1, . . . , n in a vector b = (b1 , . . . , bn )T and the


ξi,j in a matrix
 
ξ1,1 · · · ξ1,n
 .. .. .. 
X :=  . . . 
ξn,1 ··· ξn,n

we see that the total variation penalized estimator is


 
fˆ = arg min kY − fk22 + λ2 kbk1
f=Xb

where kbk1 = nj=1 |bj | denotes the `1 -norm of the vector b ∈ Rn . This rewriting
P
makes clear that there are as many parameters as there are observations, namely
n. The penalty takes care that despite this fact one will not overfit the data
(when λ is not too small).
188 CHAPTER 16. COMPLEXITY REGULARIZATION

We end this section with a small trip to the case where X is two-dimensional,
say X = [0, 1]2 . Then the unknown f is an image say, and the observations are
Yi1 ,i2 = f (xi1 ,i2 ) + i1 ,i2 .
Consider n = m2 observations and say they are on a regular grid
 
i1 i2
xi1 ,i2 = , , i1 = 1, . . . , m, i2 = 1, . . . , m.
m m
Now, how can we model smoothness of an image? Again, one may opt to
work with the squares of derivatives, a counterpart of |f 0 (x)|2 dx for the one-
R

dimensional case. However, as in the one-dimensional case, taking squares has


non-local effects. The penalized least squares reconstruction of the image will
then look blurred. For example, if the image is a landscape with lakes and rivers,
there are sharp boundaries which will be blurred by taking a penalty based on
squared derivatives. An alternative is again a total variation penalty. There are
several definitions around for total variation in dimension larger than 1. One
possibility in our 2-dimensional case is to define
m X
X m
TV(f ) := |∆f(xi1 ,i2 )|,
i1 =2 i2 =2

where
       
i1 i2 i1 − 1 i2 i1 i2 − 1 i1 − 1 i2 − 1
∆f(xi1 ,i2 ) := f , −f , −f , +f , .
n m m m m m m m
The image reconstruction algorithm is
Xm X m  2 
ˆ
f := arg min 2
Yi1 ,i2 − f(xi1 ,i2 ) + λ TV(f) .
f
i1 =1 i2 =1

This estimator can again be written as a least squares estimator of a linear


function with an `1 -penalty on the coefficients.

16.5 The ridge and Lasso penalty

In the linear model one has data (x1 , Y1 ), . . . , (xn , Yn ) with xi ∈ Rp a p-


dimensional row vector and Yi ∈ R (i = 1, . . . , n) and one wants to find a
good linear approximation using the least squares loss function
n 
X X p 2
b 7→ Yi − xi,j bj ,
i=1 j=1

Define the design matrix


  
x1 x1,1 ··· x1,p

.
X :=  ..  =  ... .. .. 

. . 
xn xn,1 ··· xn,p
16.5. THE RIDGE AND LASSO PENALTY 189

and the vector of responses


Y1
 
.
Y :=  ..  .
Yn
Set the vector of coefficients to b := (b1 , . . . , bp )T . Then
X n  X p 2
Yi − xi,j bj = kY − Xbk22 .
i=1 j=1

If p ≥ n and X has rank n, minimizing this over all b ∈ Rp gives a “perfect”


solution β̂LS with X β̂LS = Y . This solution just reproduces the data and is
therefore of no use. We say that it overfits.
Definition 16.5.1 The ridge regression estimator is
 
2 2 2
β̂ridge := arg min kY − Xbk2 + λ kbk2 ,
b∈R

where λ > 0 is a regularization parameter.


Definition 16.5.2 The Lasso1 estimator is
 
2
β̂Lasso := arg min kY − Xbk2 + 2λkbk1 ,
b∈R
Pp
where λ > 0 is a regularization parameter and kbk1 := j=1 |bj | is the `1 -norm
of b .
Note Recall the definition of the Bayesian MAP estimator as given in Sub-
section 10.5.3. Consider the model Y = Xβ +  with  ∼ N (0, σ 2 I). The
ridge regression estimator is the MAP estimator using as prior β1 , . . . , βp i.i.d.
∼ N (0, τ 2 ). with τ = σ/λ. The Lasso estimator is the MAP using √ as prior
2
β1 , . . . , βp i.i.d. ∼ Laplace(0, τ ) where the standard deviation τ is τ = 2σ 2 /λ.
See also Section 10.6.
Remark As λ grows the ridge estimator shrinks the coefficients. They will
however not be set exactly to zero. The coefficients of the Lasso estimator
shrink as well, and some - or even many - are set exactly to zero. The ridge
estimator can be useful if p is moderately large. For very large p the Lasso
is often preferred. The idea is that one should not try to estimate something
when the signal is below the noise level. Instead, then one should simply put
it to zero.
Remark Both ridge estimator and Lasso are biased. As λ increases the bias
increases, but the variance decreases.
Remark The regularization parameter λ is for example chosen by using “cross
validation” or (information) theoretic or
√ Bayesian arguments. Below, we will
see that for the Lasso a choice of order n log p is theoretically justified.
1
This is relatively recent methodology, introduced as Lasso by Tibshirani, R., 1996: Re-
gression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series
B 58, 267-288
190 CHAPTER 16. COMPLEXITY REGULARIZATION

We now investigate further expressions for both ridge estimator and Lasso.
Lemma 16.5.1 The ridge estimator β̂ridge is given by

β̂ridge = (X T X + λ2 I)−1 X T Y.

Proof. We have
 
1 ∂ 2 2 2
2 ∂b kY − Xbk2 + λ kbk2 = −X T (Y − Xb) + λ2 b
 
= −X T Y + X T X + λ2 I b.

The estimator β̂ridge puts this to zero. u


t
Corollary 16.5.1 Suppose orthonormal design: X T X = nI (thus p ≤ n nec-
essarily). Then
β̂ridge = X T Y /(n + λ2 ).
After some calculations, as in Example 5.2.1, one sees that when 1 , . . . , n are
i.i.d. with mean zero and variance σ 2 , then

λ2 /n 2
   2
∗ 2 1
EkX
I β̂ridge − f k22 = kXβ k2 + pσ 2 + kXβ ∗ − f k22 ,
1 + λ2 /n 1 + λ2 /n | {z }
| {z } | {z } misspecification
(bias)2 variance error

where Xβ ∗ is the projection of f on the space spanned by the columns of X. We


see that in order to trade off bias and variance, we have to know the variance
σ 2 , but what is worse, we also have to know the (bias)2 kXβ ∗ k22 . But the bias
is unknown as f is unknown. Thus, we are facing the same problems as in
Example 5.2.1.
For the Lasso estimator there is no simple expression in general. We therefore
only consider the special case of orthonormal design.
Lemma 16.5.2 Suppose X T X = nI (thus p ≤ n necessarily). Define Z :=
X T Y . Then for j = 1, . . . , p

 Zj /n − λ/n Zj ≥ λ
β̂Lasso,j = 0 |Zj | ≤ λ .
Zj /n + λ/n Zj ≤ −λ

Proof. Write β̂Lasso =: β̂ for short. We can write

kY − Xbk22 = kY k22 − 2bT X T Y + nbT b = −2bT Z + nbT b.

Thus for each j we minimize

−2bj Zj + nb2j + 2λ|bj |.


16.5. THE RIDGE AND LASSO PENALTY 191

If β̂j > 0 it must be a solution of putting the derivative of the above expression
to zero:
−Zj + nβ̂j + λ = 0,
or
β̂j = Zj /n − λ/n.
Similarly, if β̂j < 0 we must have

−Zj + nβ̂j − λ = 0.

Otherwise β̂j = 0. u
t
Some notation
◦ For a vector z ∈ Rp we let kzk∞ := max1≤j≤p |zj | be its `∞ -norm.
◦ We let X1 , . . . , Xp denote the columns of X.
◦ For a subset S ⊂ {1, . . . , p} we let XβS∗ be the best linear approximation of
f := EY usingPthe variables in S, i.e., XβS∗ is the projection in Rn of f on the
linear space { j∈S Xj bS,j : bS ∈ R|S| }.
In the next theorem we again assume orthogonal design. For general design,
one needs so-called restricted eigenvalues or compatibility conditions between
`2 -norms and `1 -norms.
Theorem 16.5.1 Consider again fixed design with X T X = nI. Let f = EY
and  = Y − f . Fix some level α ∈ (0, 1) and suppose that for some λα it holds
that
IP(kX T k∞ > λα ) ≤ α.
Then for λ > λα we have with probability at least 1 − α

(λ + λα )2
 
2 ∗ 2
kX β̂Lasso − f k2 ≤ min |S| + kXβS − f k2 .
S
| n{z } | approximation
{z }
estimation error
error

Proof. Write β̂ := β̂Lasso and f = Xβ. On the set where kX T k∞ ≤ λα we


have
- n|βj | > λ + λα ⇒ n|β̂j − βj | ≤ λ + λα ,
- n|βj | ≤ λ + λα ⇒ |β̂j − βj | ≤ |βj |.
So with probability at least (1 − α),

(λ + λα )2
  X
2
kX β̂Lasso − f k2 ≤ #{j : n|βj | > λ + λα } + nβj2
n
n|βj |≤λ+λα

(λ + λα )2
 
= min |S| + kXβS∗ − f k22 .
S n
u
t
If we compare the result of Theorem 16.5.1 with result iii) of Lemma 12.3.1
concerning the ordinary least squares estimator, we see that the Lasso exhibits
192 CHAPTER 16. COMPLEXITY REGULARIZATION

an automatic trade-off between approximation error and estimation error. This


is called adaptation.
√ As we will see below, the parameter λ can typically be
chosen of order n log p. Therefore the price paid for not knowing a priori
which subset of the coefficients is relevant is of order log p. This is generally
considered as being a relatively low price.
Note One may write

kXβS∗ − f k22 = kXβS∗ − Xβ ∗ k22 + kXβ ∗ − f k22

where Xβ ∗ is the projection of f on the space spanned by the columns of X.


Thus, the “approximation error” actually consists of two terms. The second
term is the misspecification error: it vanishes when the linear model is well-
specified.
Corollary 16.5.2 Suppose that f = Xβ where β has s := #{j : βj 6= 0}
non-zero components. Then under the conditions of the above theorem, with
probability at least 1 − α

(λ + λα )2
kX(β̂Lasso − β)k22 ≤ s.
n

The above corollary tells us that the Lasso estimator adapts to favourable sit-
uations where β has many zeroes (i.e. where β is sparse).
To complete the story, we need to study a bound for λα . It turns √ out that
for many types of error distributions, one can take λα of order n log p. We
show this for the case of i.i.d. N (0, σ 2 ) noise. For that purpose, we start with
a bound for the tails of a standard normal random variable.
Lemma 16.5.3 Suppose Z ∼ N (0, 1). Then for all t > 0

IP(Z ≥ 2t) ≤ exp[−t].

Proof. First check that for all u > 0

E exp[uZ] = exp[u2 /2].

Then by Chebyshevs inequality


√ √
IP(Z > 2t) ≤ exp[u2 /2 − u 2t)].

Now choose u = 2t. u t
Corollary 16.5.3 Let Z1 , . . . , Zp be p (possibly dependent) standard normal
random variables. Then for all t
  p
X
p p
IP max |Zj | ≥ 2(log(2p) + t) ≤ IP(|Zj | ≥ 2(log(2p) + t))
1≤j≤p
j=1
≤ 2p exp[−(log(2p) + t)] = exp[−t].
16.6. CONCLUSION 193

Remark In the above corollary we used IP(A ∪ B) ≤ IP(A) + IP(B). This is


called the union bound.
Corollary 16.5.4 Let 1 , . . . , n be i.i.d. N (0, σ 2 ) and let X = (X1 , . . . , Xp )
with X1 , . . . , Xp be fixed vectors in Rn with kXj k2 = n for all j. Then for
0<α<1  
T
p
IP kX k∞ ≥ σ 2n(log(2p/α)) ≤ α.

Remark. The value α = 12 thus gives a bound for the median of kX β̂Lasso −f k22 .
In the case of Gaussian errors one may use “concentration of measure” to deduce
that kX β̂Lasso − f k22 is “concentrated” around its median.

16.6 Conclusion

We have seen in this chapter that several concepts from classical statistics also
play their role in the modern version of statistics where the parameter is possibly
high dimensional. The classical least squares methodology keeps its prominent
place, but now it is equipped with a regularization penalty. More generally,
M-estimators (e.g. maximum likelihood estimators) can also be used in high
dimensions, applying again some regularization technique. The bias-variance
decomposition continues to play its role too, for instance leading to guidelines
for choosing tuning parameters.
Shrinkage estimators play an important role in high-dimensional statistics. This
is also related to the result of Section 11.4 where we have seen that the sample
average in dimension higher than 2 is inadmissible as it can be improved by a
shrinkage estimator.
Complexity regularization can typically be seen as a Bayesian MAP approach.
One may also use the a posteriori mean as estimator etc. Today, Bayesian ap-
proaches to high-dimensional and non-parametric problems are very important
and successful.
Complexity regularization is typically invoked for the construction of adaptive
estimators. An adaptive estimator mimics the situation where we knew before-
hand the complexity of the underlying target to be estimated. To evaluate the
performance of an adaptive estimator, one uses as benchmark the case where the
(hopefully low) complexity of the target is indeed known. Thus the benchmark
comes from classical statistical theory.
194 CHAPTER 16. COMPLEXITY REGULARIZATION
Chapter 17

Literature

• J.O. Berger (1985) Statistical Decision Theory and Bayesian Analysis


Springer
A fundamental book on Bayesian theory.
• P.J. Bickel, K.A. Doksum (2001) Mathematical Statistics, Basic Ideas and
Selected Topics Volume I, 2nd edition, Prentice Hall
Quite general, and mathematically sound.
• D.R. Cox and D.V. Hinkley (1974) Theoretical Statistics Chapman and
Hall
Contains good discussions of various concepts and their practical meaning.
Mathematical development is sketchy.
• A. DasGupta (2011) Probability for Statistics and Machine Learning,
Springer
Contains all the probability theory background needed. (Look out for the
upcoming book Statistical Theory, a Comprehensive Course by the same
author.)
• J.G. Kalbfleisch (1985) Probability and Statistical Inference Volume 2,
Springer
Treats likelihood methods.
• L.M. Le Cam (1986) Asymptotic Methods in Statistical Decision Theory
Springer
Treats decision theory on a very abstract level.
• E.L. Lehmann (1983) Theory of Point Estimation Wiley
A “klassiker”. The lecture notes partly follow this book
• E.L. Lehmann (1986) Testing Statistical Hypothesis 2nd edition, Wiley
Goes with the previous book.
• J.A. Rice (1994) Mathematical Statistics and Data Analysis 2nd edition,
Duxbury Press
A more elementary book.

195
196 CHAPTER 17. LITERATURE

• M.J. Schervish (1995) Theory of Statistics Springer


Mathematically exact and quite general. Also good as reference book.
• R.J. Serfling (1980) Approximation Theorems of Mathematical Statistics
Wiley
Treats asymptotics.
• A.W. van der Vaart (1998) Asymptotic Statistics Cambridge University
Press
Treats modern asymptotics and e.g. semiparametric theory
• L. Wasserman (2004) All of Statistics. A Concise Course in Statistical
Inference Springer.
Contains a wide range of topics in mathematical statistics and machine
learning.

You might also like