Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

QF314800/314900: Mathematical Statistics

Chapter 07: Sufficiency and Data Reduction


Instructor: Chung-Han Hsieh (ch.hsieh@mx.nthu.edu.tw)

In typical statistical problem, we often want to study a random variable X of interest but its pdf
or pmf is not known or known only partially.1 Thus, statistical inference is to use the information
in a sample X := (X1 , . . . , Xn ) to make inferences about an unknown parameter vector θ ∈ Θ. If
the sample size n is large, then the observed sample2 x := (x1 , . . . , xn ) is a long list of members
that may be hard to interpret. Hence, it is desirable to “summarize” the information in a sample
by determining a few key features of the sample values. This is usually done by computing some
suitable statistics, T (X1 , . . . , Xn ); e.g., sample mean, sample variance, maximum observation,
minimum observation.
In this chapter, our goal is to study a notion called sufficiency and the methods of data reduction
that do not discard important information about the unknown parameter θ.3

7.1 Preliminaries
Below we review some basic idea of random samples from Chapters 5 and 6.
Definition 7.1 (Random Samples). The random vector X := (X1 , . . . , Xn ) is called a random
sample of size n from the population f (x) if X1 , . . . , Xn are iid with pdf or pmf f (x).
Definition 7.2 (Statistic). Let X := (X1 , . . . , Xn ) be a random sample of size n from a pop-
ulation and let T (x1 , . . . , xn ) be real-valued (or vector-valued) function whose domain includes
the sample space of X.

7.1.1 Review of Some Desirable Properties of Point Estimator


Definition 7.3 (Point Estimator). A point estimator is any function T (X) of a sample X that
is used to estimate the population parameter θ. The realized statistic t := T (x) of X is called
the point estimate.
Definition 7.4 (Unbiased Estimator). Let X = (X1 , . . . , Xn ) be a random sample from a
population with pdf f (x; θ) for some θ ∈ Θ where Θ is a parameter space. Let T = T (X) be a
statistic. We say that T is an unbiased estimator of θ if E[T ] = θ.
Definition 7.5 (Consistent Estimator). Let X be a sample form the distribution of X with cdf
FX (x; θ) and θ ∈ Θ. Let Tn := T (X1 , . . . , Xn ) be a statistic. We say that Tn is a consistent
P
estimator of parameter θ if Tn → θ; i.e., for any ε > 0,

lim P (|Tn − θ| ≥ ε) = 0
n→∞

Theorem 7.1 (WLLN). Let X1 , X2 , . . . be iid random variables with E[Xi ] = µ and var(Xi ) =
Pn P
σ 2 < ∞. Define X n := n1 i=1 Xi . Then X n → µ.
Proof. See Chapter 6.

1 The ignorance about unknown pdf (or pmf) can be classified in two ways: (i) f (x) is completely unknown.

(ii) The form of f (x) is known down to a parameter θ ∈ Θ.


2 Sometimes, this observed sample x is also called a realization of X.
3 The material of this chapter is drawn heavily from [1, Chapter 6] and [2, Chapter 7].

1
Exercise 7.1 (Image of Statistic). Let X be the sample space of X. Find the image, call it T ,
of the statistic T (X).
Proof. It is readily verified that T := {t : t = T (x) for some x ∈ X } is the image of X un-
der T (x).

7.1.2 Statistic T Partitions the Sample Space of X


Data reduction in terms of a particular statistic can be thought of as a partition of the sample
space X . Let T := {t : t = T (x) for some x ∈ X } be the image of X under T (x). Then T (x)
partitions the sample space into sets At , t ∈ T defined by

At := {x : T (x) = t}.

The statistic summarizes the data in that, rather than reporting the entire sample x, it reports
only that T (x) = t or equivalently, x ∈ At . All points in At are treated the same if we are
interested in T only. Thus, the statistic T provides a data reduction. Our goal here is to reduce
data as much as we can but not lose any important information about θ.
Remark 7.1. (i) Let T (X) be a statistic. For T , if x ̸= y but T (x) = T (y), then x and y
provides the same information and can be treated as the same. (ii) Data reduction in terms of
a statistic T (X) is a partition of the sample space X .
Exercise 7.2 (Data Reduction). Suppose that Xi ∼ Bernoulli(p) for i = 1, 2, 3 and p ∈ (0, 1)
and Xi ∈ {0, 1} are i.i.d. Define T : {0, 1}3 → {0, 1, 2, 3} with
3
X
T (X) := Xi
i=1

Find the partitions At and T .


Proof. The partition can be summarized in the following Table 1. In particular, with the image

Table 1: Partition of Sample Space


P3
partition X1 , X2 , X3 T (X) = i=1 Xi
A0 0, 0, 0 0
A1 0, 0, 1 1
0, 1, 0 1
0, 0, 1 1
A2 0, 1, 1 2
1, 0, 1 2
1, 1, 0 2
A3 1, 1, 1 3

T := {t : t = T (X) for some x ∈ X } and At := {x : T (x) = t, t ∈ T }, instead of reporting


x = [x1 x2 x3 ]T , we report only T (X) = t or equivalently, x ∈ At . The partition of the sample
space based on T (X) is “coarser” than the original sample space. Indeed, we see that there are 8
elements in the sample space of X and are partitioned into 4 subsets. T (X) is simpler (coarser)
than X.

2
7.2 Sufficiency of Statistics
The sections to follow requires that the reader to equip with some background on conditional prob-
ability and conditional distribution; e.g., see Appendix A.1. A sufficient statistic for a unknown
parameter θ is a statistic that captures all the information about θ contained in the sample. A
sufficient statistic is formally defined in the following way.
Definition 7.6 (Sufficient Statistic). Let X := (X1 , . . . , Xn ) be a random sample of size n from
a distribution that has pdf (or pmf) f (x; θ) for θ ∈ Θ. A statistic T (X) is called a sufficient
statistic for θ if, for each T (X) = T (x) := t, the conditional distribution of the sample X given
the value of T (X) = t, i.e.,
f (x|t) = h(x)
does not depend on θ.

Remark 7.2. (i) If, for T (X) = t, conditional pdf f (x; θ|t) is unrelated to the parameter θ, then
the random samples X contains no more information about θ when T (X) = t is observed. Said
another way, T (X) exhaust all the information about θ that is contained in the sample. (ii) If X
is discrete, then T (X) is discrete and sufficiency means that P (X = x|T (X) = t) is known; i.e.,
it does not depend on any unknown quantity θ. (iii) Once we observe x and compute a sufficient
statistic T (x), the original data x do not contain any further information concerning θ and can
be discarded; i.e., T (x) is all we need regarding θ. Roughly speaking, a statistic is sufficient if
we can calculate the joint pdf of X by knowing only T (X).
Exercise 7.3 (A Useful Identity). Let X be a random sample from a population and define T (X)
to be a statistic. Show that the following events relationship holds:

{X = x} = {X = x, T (X) = T (x)}

for any x ∈ X .
Proof. We observe that if X = x, then T (x) = T (x). It follows that {X = x} ⊂ {T (X) = T (x)}.
Therefore, we conclude {X = x, T (X) = T (x)} = {X = x}.

Remark 7.3. Indeed, Exercise 7.3 means that if A ⊂ B, then A ∩ B = A.


Theorem 7.2 (Criterion for a Sufficient Statistic). If p(x; θ) is the joint pdf (or pmf) of X and
q(t; θ) is the pdf (or pmf) of T (X), then T (X) is a sufficient statistic for θ if for every x in the
sample space, the ratio
p(x; θ)
:= h(x)
q(T (x); θ)
where h(x) is some function that does not depend on θ ∈ Θ.
Proof. Let T (X) be a statistic. To prove that T (X) is a sufficient statistic for θ, we must show
that for any fixed values of x and T (x) = t, the conditional probability

f (x|t) = P (X = x|T (X) = t)

3
does not depend on θ. Observe that
P (X = x, T (X) = t)
P (X = x|T (X) = t) =
P (T (X) = t)
(
P (X=x)
P (T (X)=T (x)) if T (x) = t
=
0 otherwise
(
p(x;θ)
q(T (x);θ) if T (x) = t
=
0 otherwise

where the second equality holds by Exercise 7.3. In addition, in the last equality, the quantity
p(x; θ) is the joint pmf of the sample X and q(t; θ) is the pmf of T (x). By assumption that
p(x; θ)
:= h(x)
q(T (x); θ)
which implies that T (X) is a sufficient statistic for θ.

Exercise 7.4 (Random Sample itself is Sufficient Statistics). Let T (X) := (X1 , . . . , Xn ) ∈ Rn
be random sample as a statistic. Show that T is sufficient.
Proof. For T (X) = T (x) := x′ , we observe that
(
f (x;θ)
′ f (x, x′ ; θ) f (x′ ;θ) =1 if x = x′
f (x; θ|x ) = =
f (x′ ; θ) 0 if x ̸= x′ .

which is independent of θ. So X itself is sufficient statistic.

Remark 7.4. Instead of listing all of the individual samples X1 , . . . , Xn , we might prefer to
give only the sample mean X or the sample variance Sn2 . Motivated by this, we seek for ways
of reducing a set of data so that these data can be more easily understood without losing the
meaning associated with the entire set of observations.
Exercise 7.5 (Binomial Sufficient Statistic). Let X = (X1 , . . . , Xn ) be a Bernoulli random
sample with unknown parameter θ ∈ (0, 1); i.e., it has pmf P (X = x) = θx (1 − θ)1−x for
x ∈ {0, 1}. Define a statistic
Xn
T (X) := Xi
i=1

(i) Show that the joint pmf of X is


n
Y
f (x; θ) := P (X = x; θ) = θxi (1 − θ)1−xi .
i=1

(ii) Verify that T (X) has a binomial(n, θ) distribution.


(iii) Show that T (X) is a sufficient statistic for θ.
Proof. (i). Since X1 , . . . , Xn are iid, it is readily verified that the joint pdf of X is given by
n
Y
f (x; θ) = P (X = x; θ) = θxi (1 − θ)1−xi .
i=1

4
(X) := the total number of successes in n trials. The probability of event {T (X) = t}
(ii).Let T P
n
with t := i=1 xi is given by
 
n t
P (T (X) = t; θ) = θ (1 − θ)n−t , t = 0, 1, . . . , n
t

which is a binomial (n, θ) distribution. (verify! by mgf technique).


(iii). To show that T (X) is a sufficient statistic for θ, we invoke Theorem 7.2 by observing the
ratio of pmfs:

p(x; θ) P (X = x; θ)
=
q(T (x); θ) P (T (X) = T (x); θ)
Qn
P (Xi = xi ; θ)
= i=1n t

n−t
(∗)
t θ (1 − θ)

where the last equality holds by the iid property of X and part (i). Now, note that

P (Xi = xi ; θ) = θxi (1 − θ)xi , xi ∈ {0, 1}.

It follows that
n
Y n
Y Pn Pn
P (Xi = xi ; θ) = θxi (1 − θ)1−xi = θ i=1 xi
(1 − θ)n− i=1 xi
= θt (1 − θ)n−t .
i=1 i=1

Hence, Equation (∗) becomes

p(x; θ) θt (1 − θ)n−t
=  n
q(T (x); θ) θt (1 − θ)n−t
t
1
= n
t
1
=
P n

i=1 xi

which does not depend on θ. Therefore, by Theorem 7.2, T (X) is a sufficient statistic for θ.

Remark 7.5. The analysis above tells us that the total number of successes in the Bernoulli
sample contains all the information about θ that is in the data. Other features of the data, such
as the exact value of X4 , contain no additional information.
Exercise 7.6 (Gamma Sufficient Statistic). Let X = (X1 , X2 , . . . , Xn ) be a random sample
from a gamma distribution with α = 2 and β = θ > 0 with pdf
1 1
f (x|α, β) = 2
x1 e−x/θ = 2 xe−x/θ , x ≥ 0
Γ(2)θ θ

and it has mgf M (t) = (1 − θt)−2 , t < 1/θ.


(i) Find the joint distribution of X.
(ii) Define a statistic
n
X
T (X) := Xi .
i=1

5
Show that the mgf of T has a gamma distribution with α = 2n and β = θ. Hence,
(
1 (2n−1) −t/θ
2n t e , t>0
fT (t; θ) = Γ(2n)θ
0, otherwise.

(iii) Show that T is a sufficient statistic for θ.


Proof. (i). The joint pdf is
n n
Y Y 1
f (x; θ) = f (xi ; θ) = x e−xi /θ .
2 i
i=1 i=1
θ

(ii). We proceed a mgf technique by observing that, by independence of Xi ,

MT (t) = E[etT ]
Pn
= E[et i=1 Xi
]
tX1 tX2
= E[e e · · · etXn ]
Yn
= E[etXi ]
i=1
= E[etX1 ]n
= (1 − θt)−2n

for t < 1/θ. Thus, by uniqueness of mgf theorem, it follows that T ∼ Γ(α = 2n, β = θ).
Therefore, the pdf is
(
1 (2n−1) −t/θ
2n t e , t>0
fT (t; θ) = Γ(2n)θ
0, otherwise.

(ii). To prove that T is sufficient statistic, we note that


1 − θ1 t
Qn
p(x; θ) θ 2n e ( i=1 xi )
= 1 (2n−1) e−t/θ
q(T (x); θ) Γ(2n)θ 2n t
Qn
( i=1 xi )
= 1 (2n−1)
.
Γ(2n) t

where xi > 0 for i = 1, , . . . , n. Since the ratio does not depend on θ, the summation statistic T
is a sufficient statistic for θ.

Remark 7.6 (Preview of Neyman’s Factorization


Pn Theorem). As we seen later in this chapter,
an alternative way to show that T (X) = i=1 Xi is sufficient statistic for θ in Exercise 7.6 is to

6
invoke the Pearson Factorization Theorem:
n
Y
f (x; θ) = f (xi ; θ)
i=1
n
1
Y
= x e−xi /θ
2 i
i=1
θ
n
!
1 Y 1
Pn
= 2n xi e− θ i=1 xi
θ i=1
n
!
1 −1t Y
= 2n e θ xi .
θ i=1

Exercise 7.7 (Normal Sufficient Statistic). Let X := (X1 , . . . , Xn ) be iid normal random vari-
able with mean µ and variance σ 2 , where σ 2 is known but µ is not; i.e., θ := µ. Show that the
sample mean
n
1X
T (X) = Xi
n i=1
is a sufficient statistic for µ.
Proof. Let θ := µ. To show that T (X) is a sufficient statistic for θ, we fix x then invoke
f (x;θ)
Theorem 7.2 by checking the ratio of pdfs q(T (x);θ) . First note that, with the aids of iid of Xn ,
the numerator for the joint pdf of X is
n
Y
f (x; θ) = f (xi ; θ)
i=1
n 2
1

xi −θ
−1
Y
= √ e 2 σ

i=1
σ 2π
2
1

Pn xi −θ
− 21
= n/2
e i=1 σ
. (∗)
(2πσ 2 )
Pn
On the other hand, to compute q(T (x); θ), we note that T (x) = n1 i=1 Xi ∼ N (θ, σ 2 /n).
n
(verify!) Let T (x) = x := n1 i=1 xi . Then the pdf of T (X) = T (x) is given by
P
2
1

x−θ
−1 √
q(T (x); θ) = √ √ e 2 σ/ n

σ/ n 2π
n1/2 x−θ 2
e − 2 n( σ ) .
1
= 1/2
σ(2π)
Now, for Equation (∗), we further rewrite the exponent term
n  2 n
X xi − θ 1 X 2
= 2 (xi −x + x − θ)
i=1
σ σ i=1
n
1 X
(xi − x)2 + 2(xi − x)(x − θ) + (x − θ)2

= 2
σ i=1
" n n n
#
1 X 2
X X
2
= 2 (xi − x) + 2 (xi − x)(x − θ) + (x − θ) (∗∗)
σ i=1 i=1 i=1

7
Note that the cross term
Xn n
X n
X
2 (xi − x)(x − θ) = 2(x − θ) (xi − x) = 2(x − θ)( xi − nx) = 0.
i=1 i=1 i=1

Therefore, Equation (∗∗) becomes


n  2 " n n
#
X xi − θ 1 X 2
X
2
= 2 (xi − x) + (x − θ)
i=1
σ σ i=1 i=1
" n #
1 X
= 2 (xi − x)2 + n(x − θ)2 .
σ i=1

This implies that Equation (∗) is equivalent to


2
1

Pn xi −θ
− 21
f (x; θ) = n/2
e i=1 σ

(2πσ 2 )
1 Pn 2
+n(x−θ)2 ]
e− 2 σ 2 [
1 1
= i=1 (xi −x) .
n/2
(2πσ 2 )
Hence, the ratio
Pn 2
1 1
[ +n(x−θ)2 ]
f (x; θ)
1
(2πσ 2 )n/2
e− 2 σ2 i=1 (xi −x)

= x−θ 2
q(T (x); θ) 1
e − 2 n( σ )
1

σn1/2 (2π)1/2
− 2σ12 [ n 2
i=1 (xi −x) ] − 2
1 1
n(x−θ)2
P
1
(2πσ 2 )n/2
e e σ2

= x−θ 2
e − 2 n( σ )
1
n1/2
σ(2π)1/2
− 12 σ12 [ n 2
i=1 (xi −x) ]
P
1
(2πσ 2 )n/2
e
= n1/2
σ(2π)1/2

which does not depend on θ. Therefore, by Theorem 7.2, the sample mean is a sufficient statistic
for θ = µ.

Remark 7.7 (Motivation for Alternatives of Checking Sufficiency). To check if a statistic T (X)
is sufficient, according to Theorem 7.2, we must check if the ratio function f (x; θ)/q(T (x); θ)
depends on θ or not. This requires to know to know the pdf of T (x), which is often not easy to
calculate. Even if the pdf of T (X) is available, it may require a very tedious analysis as we seen
in previous exercise. The next theorem provides a way to avoid this by factorizing the joint pdf
of X.

7.3 Neyman’s Factorization Theorem


Theorem 7.3 (Neyman’s Factorization Theorem). Let X := (X1 , X2 , . . . , Xn ) be a random
sample from a population with pdf (or pmf ) f (x; θ) for θ ∈ Θ. Then T (X) is a sufficient statistic
for θ if and only if there exist two nonnegative functions g(·) and h(·) such that for all sample
points x and all parameters θ, the joint pdf of X can be factorized as
f (x; θ) = g(T (x); θ) · h(x)
where h(x) does not depend on θ.

8
Proof. Here we provide a proof for continuous random samples.4
(⇒) If T (X) is a sufficient statistic for θ, then, for T (X) = t, the conditional pdf of X given
T (x) = t is
f (x, t; θ) f (x; θ)
f (x; θ|t) = = = h(x) (∗)
fT (t, θ) fT (t, θ)
where fT is the pdf of T (X) and the second equality holds since f (x, t; θ) = f (x; θ) (If X = x,
then T (x) = t) and the last equality holds by definition of sufficiency of T . Therefore, rearranging
the last equality in Equation (∗) yields

f (x; θ) = fT (t; θ)h(x)

which is the desired form of factorization if we take the function g to be the pdf of T (X); i.e.,
g := fT (t; θ).
(⇐) The proof of this direction is more involved. Assuming that the factorization identity holds;
i.e.,

f (x; θ) = g(T (x); θ) · h(x). (1)

We must show that T (X) is a sufficient statistic. Define the one-to-one transformations5

y1 := T (x1 , . . . , xn );
y2 := T2 (x1 , . . . , xn );
..
.
yn := Tn (x1 , . . . , xn ),

having the inverse functions x1 = w1 (y1 , . . . , yn ), . . . , xn = wn (y1 , . . . , yn ) and Jacobian J satis-


fying
∂x1 ∂x1
∂y1 . . . ∂y n

J = ... ..
.
∂xn
∂y1 . . . ∂x ∂yn
n

that does not depend on θ. Then the joint pdf of Y1 , . . . , Yn , call it g, is then given by

g(y1 , y2 , . . . , yn ; θ) = f (w1 (y1 , . . . , yn ), . . . , wn (y1 , . . . , yn ); θ)|J|


= f (x1 , . . . , xn ; θ)|J|
= g(T (x); θ) · h(x)|J| (By Equation (1))
= g(y1 ; θ) · h(w1 (y1 , . . . , yn ), . . . , wn (y1 , . . . , yn )|J|.

The pdf of Y1 = T (X), call it fY1 (y1 ; θ), is simply the marginal pdf of g. That is,
Z ∞ Z ∞
fY1 (y1 ; θ) = ··· g(y1 , y2 . . . . , yn ; θ)dy2 · · · dyn
−∞ −∞
Z ∞ Z ∞
= g(y1 ; θ) ··· h(w1 , w2 . . . , wn )|J|dy2 · · · dyn .
−∞ −∞
4A proof for discrete case can be found in Appendix A.1.
5 In a more general setting, the assumption of a one-to-one transformation made here can be dropped; see [2].

9
Note that the function h does not depend on θ nor is θ involved in either the Jacobian J or the
limits of integration. Hence the (n − 1)-fold integral in the right-hand side above is a function
of y1 , say m(y1 ), for some function m(·). Thus, we may write
fY1 (y1 ; θ) = g(y1 ; θ) · m(y1 ).
If m(y1 ) = 0, then fY1 (y1 ; θ) = 0. If m(y1 ) > 0, then we can write
fY1 (T (x); θ)
g(y1 ; θ) = .
m(T (x))
Therefore, with the aids of Equation (1), the joint pdf of X
f (x; θ) = g(T (x); θ) · h(x)
h(x)
= fY1 (T (x); θ) .
m(T (x))
Or equivalently,
f (x; θ) h(x)
= .
fY1 (T (x); θ) m(T (x))
Since neither the function h nor the function m depends on θ, then the right-hand side does not
depend on θ. In accordance with the definition, Y1 (i.e., T (X)) is a sufficient statistic for the
parameter θ.

Exercise 7.8 (Random Sample Itself is Sufficient Statistic: Revisited). Let X be a random
sample with joint pdf f (x; θ) for parameter θ. Show that T (X) = X is sufficient statistic for θ.
Proof. For X = x, observe that the joint pdf of X is given by
f (x; θ) = g(T (x); θ)h(x)
where h(x) = 1 and g(T (x); θ) := f (T (x); θ) = f (x; θ). Hence, by factorization theorem, T (X) =
X is a sufficient statistic for θ.

Exercise 7.9 (Bernoulli Revisited). Let X1 , . . . , Xn be iid Bernoulli random variables with
parameter θ ∈ (0, 1); i.e., it has pmf fX (x) = θx (1 − θ)1−x for x ∈ {0, 1}. Define a statistic
n
X
T (X) := Xi
i=1

Show that T (X) is a sufficient statistic for θ.


Proof. To apply factorization theorem, we observe that for T (x) := t,
n
Y
f (x; θ) = f (xi ; θ)
i=1
Yn
= θxi (1 − θ)1−xi
i=1
Pn Pn
xi
=θ i=1 (1 − θ)n− i=1 xi

= θt (1 − θ)n−t · |{z}
1
| P{z }
:=g( n
i=1 xi ;θ) :=h(x)

Hence, according to factorization theorem, T (X) is sufficient statistic for θ.

10
Exercise 7.10 (Nonuniquenss of Sufficient Statistic). Let X1 , . . . , Xn be random sample from
Poisson(λ) with pdf
λx e−λ
f (x; λ) =
x!
Let
n
X
T (X) := Xi .
i=1

(i) Show that T (X) isPa sufficient statistic for θ := λ.


n
(ii) Does T ′ (X) = n1 i=1 Xi also a sufficient statistic for θ?
Proof. (i). We begin by observing that the joint pdf of X1 , . . . , Xn is
n
Y
f (x; λ) = f (xi ; θ)
i=1
n
Y λxi e−λ
=
i=1
xi !
P
xi −nλ
λ i e
= Qn
i=1 xi !
P
xi −nλ 1
= |λ {ze } · Qn xi ! . (∗)
i

P n i=1
:=g( i=1 xi ;θ) | {z }
:=h(x)

Pn
Thus, take T (X) := i=1 Xi . Then (∗) tells us that

f (x; θ) = g(T (x); θ) · h(x)

hence, by factorization theorem,


Pn T (X) is a sufficient statistic for λ.
(ii). Consider T ′ (X) = n1 i=1 Xi . Then, by Equation (∗) above, we also have

1
f (x; λ) = λnx e−nλ · Qn
i=1 xi !
Pn
where x := n1 i=1 xi . Define g1 (X; λ) and h(x) := Qn 1
xi ! . Then, by factorization theorem x
i=1
is also a sufficient statistic for λ.

Exercise 7.11 (Gamma Sufficient Statistic Revisited). Let X = (X1 , X2 , . . . , Xn ) be a random


sample from a gamma distribution with α = 2 and β = θ > 0 with pdf
1 1
f (x|α, β) = x1 e−x/θ = 2 xe−x/θ , x ≥ 0
Γ(2)θ2 θ

Define a statistic
n
X
T (X) := Xi .
i=1

Show that T is a sufficient statistic for θ.

11
Proof. The joint pdf is
n n
Y Y 1
f (x; θ) = f (xi ; θ) = x e−xi /θ .
2 i
i=1 i=1
θ
To prove that T is sufficient statistic, we note that
An alternative way is to invoke the Pearson Factorization Theorem:
n
Y
f (x; θ) = f (xi ; θ)
i=1
n
Y1
= x e−xi /θ
2 i
i=1
θ
n
!
1 Y 1
Pn
= 2n xi e− θ i=1 xi
θ i=1
n
!
1 −1t Y
= 2n e θ xi
θ i=1
= g(t; θ)h(x)

1 − θ1 t
Qn
where g(t; θ) = θ 2n e and h(x) = ( i=1 xi ). Hence, by factorization theorem, T is a sufficient
statistic for θ.

Exercise 7.12 (Normal Sufficient Statistic Revisited). Let X1 , . . . , Xn be iid normal random
2 2
variable with mean
Pnθ and variance σ , where σ is known but θ is not. Show that the sample
1
mean T (X) = n i=1 Xi is a sufficient statistic for θ := µ via Factorization Theorem.
Proof. By Exercise 7.7, we know that the joint pdf of X is
2
1

Pn xi −θ
− 12
f (x; θ) = n/2
e i=1 σ
(@)
(2πσ 2 )
1 Pn 2 2
e− 2 σ 2 [ ].
1 1
= i=1 (xi −x) +n(x−θ) (∗)
n/2
(2πσ 2 )

Now, define
1 Pn 2
e− 2σ2 [ ]
1
h(x) := i=1 (xi −x)
n/2
(2πσ 2 )
which does not depend on the unknown parameter θ = µ. The factor in Equation (∗) that
contains θ depends on the sample x only through the sample mean function T (x) = x = t.
Thus, we have
g(t; θ) := exp −n(t − θ)2 /(2σ 2 )


and note that


f (x; µ) = h(x)g(T (x); θ)
Thus, by the Factorization Theorem, T (X) = X is a sufficient statistic for θ.

12
Remark 7.8 (Trivial Sufficient Statistic). Of course, if we look at Equation (@), then we may
write Pn  xi −θ 2
1 − 21
f (x; θ) = n/2
e i=1 σ
·1.
(2πσ 2 )
| {z }
:=g(T (x);θ)

Call h(x) := 1 and g(T (x); θ) = g(x; θ) by setting T (X) := X, the random samples. Then, by
factorization theorem, X is itself a sufficient statistic for θ := (µ, σ 2 ). However, this is not a
good sufficient statistic, because it requires all Xn .
Remark 7.9 (Multiple Sufficient Statistics). In all the previous exercises, the sufficient statistic
is a real-valued function of the sample. All the information about θ in the sample x is summarized
in the single number T (x). Sometimes, the information cannot be summarized in one number
and several numbers are required instead. In such cases, a sufficient statistic is a vector, say

T (X) := (T1 (X), T2 (X), . . . , Tr (X)).

This situation often occurs when the parameter is also a vector, say θ = (θ1 , . . . , θs ), and is
usually the case that the sufficient statistic and the parameter vectors are of equal length, that
is r = s. While different combinations of lengths are possible, the Factorization Theorem may
be used to find a vector-valued sufficient statistic; see Exercise 7.13 below.

Exercise 7.13 (Joint Sufficient Statistic for Normal with Both Unknown Mean and Variance).
Assume that X1 , . . . , Xn are iid normal random variables with mean µ and variance σ 2 . However,
2
assuming that both µ and Pnσ are unknown; i.e., the parameter vector θ := (µ, σ 2 ). Let T1 (x) = x
2 1 2
and T2 (x) = s := n−1 i=1 (xi − x) . Show that

T (X) = (T1 (X), T2 (X)) = (X, S 2 )

is a sufficient statistic for (µ, σ 2 ).


Proof. By using the factorization theorem, any part of the joint pdf that depends on either µ or
σ 2 must be included in the g function. From
1 Pn 2 2
e− 2σ2 [ ]
1
f (x; θ) = i=1 (xi −x) +n(x−µ)
n/2
(2πσ 2 )
1 2
+n(x−µ)2 ]
e− 2σ2 [(n−1)s
1
= n/2
.
(2πσ 2 )

it is clear that the pdfP


depends on the sample x only through the two values T1 (x) = x := t1
1 n
and T2 (x) = s2 = n−1 2
i=1 (xi − x) := t2 . Thus, by taking h(x) := 1 and

g(t; θ) = g(t1 , t2 ; µ, σ 2 ) = (2πσ 2 )−n/2 exp(−(n(t1 − µ)2 + (n − 1)t2 /(2σ 2 ))) · 1.

which can be seen that


f (x; µ, σ 2 ) = g(t1 , t2 ; µ, σ 2 )h(x).
Thus, by the Factorization Theorem, T (X) = (T1 (X), T2 (X)) = (X, S 2 ) is a sufficient statistic
for (µ, σ 2 ) in this normal model.

13
Exercise 7.14 (Order Statistic as Sufficient Statistic). Let Y1 < Y2 < · · · < Yn be the order
statistics of a random sample X1 , . . . , Xn from the population with pdf

f (x; θ) = e−(x−θ) 1(θ,∞) (x)

with unknown parameter θ. Define Y1 := min Xi . Determine if Y1 is a sufficient statistic of θ or


not.
Proof. (i). Observe that the joint pdf is given by
n
Y
f (x; θ) = e−(xi −θ) 1(θ,∞) (xi )
i=1
Yn
= e−xi eθ 1(θ,∞) (xi )
i=1
= [e−x1 eθ 1(θ,∞) (x1 )] · [e−x2 eθ 1(θ,∞) (x2 )] · · · [e−xi eθ 1(θ,∞) (x2 )]
Pn
= e− i=1 xi nθ
e 1(θ,∞) (x1 ) · 1(θ,∞) (x2 ) · · · 1(θ,∞) (x2 )
Pn
− xi nθ
=e i=1 e 1(θ,∞) (min xi )
i

where the last equality holds since 1(θ,∞) (x1 ) · 1(θ,∞) (x2 ) · · · 1(θ,∞) (x2 ) takes value 1 if and only

if 1P
(θ,∞) (mini xi ) = 1; otherwise, it is zero. Now, set g(T ; θ) := e 1(θ,∞) (mini xi ) and h(x) :=
− n xi
e i=1 , then by factorization theorem, it follows that Y1 = mini Xi is a sufficient statistic.
In Chapter 3, we discussed a broad class of distributions called exponential family. Using the
factorization theorem, one can readily find a sufficient statistic for an exponential family of
distributions.

Theorem 7.4 (Sufficient Statistic for Exponential Family). Let X = (X1 , . . . , Xn ) be random
sample from a pdf or pmf f (x|θ) that belongs to an exponential family given by
k
!
X
f (x|θ) = h(x)c(θ) exp wi (θ)ti (x)
i=1

where θ := (θ1 , . . . , θd ) with d ≤ k. Then


 
Xn n
X
T (X) =  t1 (Xj ), . . . , tk (Xj )
j=1 j=1

is a sufficient statistic for θ.

14
Proof. Note that the joint pdf of X is given by We observe that
n n k
!
Y Y X
f (xj |θ) = h(xj )c(θ) exp wi (θ)ti (xj )
j=1 j=1 i=1
n k
!
Y X
n
= [c(θ)] h(xj ) exp wi (θ)ti (xj )
j=1 i=1
   
n
Y Xn X
k
= [c(θ)]n  h(xj ) exp  wi (θ)ti (xj )
j=1 j=1 i=1
   
n
Y Xk n
X
= [c(θ)]n  h(xj ) exp  wi (θ) ti (xj )
j=1 i=1 j=1
   
n
Y Xk n
X
= [c(θ)]n  h(xj ) exp  wi (θ) ti (xj ) .
j=1 i=1 j=1

Qn P 
k Pn
Define h′ (x) := j=1 h(xj ) and g(T (X); θ) := [c(θ)]n exp i=1 wi (θ) j=1 ti (xj ) . Then, by
factorization theorem, it shows that T (X) is sufficient statistic for θ.

7.4 Minimal Sufficient Statistics


So far, we are focusing on finding one sufficient statistic for each model considered. However, in
any problem, there are many sufficient statistics. For example, before one proceed the Neyman’s
factorization theorem, the complete sample X is sufficient statistic; see Exercise 7.8. Also, having
obtained a sufficient statistic, then taking a bijective mapping still preserves sufficiency.
Exercise 7.15 (Bijective Mapping Preserves Sufficiency). Let X := (X1 , . . . , Xn ) be a random
sample from a population with pdf f (x; θ). Suppose T (X) is a sufficient statistic and u be a
one-to-one function with inverse u−1 . Show that T ∗ (X) := u(T (X)) is also sufficient.
Proof. Since T (X) is sufficient statistic of θ, by factorization theorem, it follows that there exists
functions g and h such that
f (x|θ) = g(T (x)|θ)h(x).
Note that T ∗ (x) := u(T (x)) and u is invertible, it follows that u−1 (T ∗ (x)) = T (x). Therefore,
we have
f (x|θ) = g(u−1 (T ∗ (x))|θ)h(x).
Defining a new function g ∗ (t|θ) := g(u−1 (t)|θ), we see that
f (x|θ) = g ∗ (T ∗ (x)|θ)h(x).
By the Factorization theorem, T ∗ (X) is a sufficient statistic.

Remark 7.10 (Selecting a Good Sufficient Statistic). The above shows the nonuniquencess of
sufficient statistics. Hence, it is natural to ask whether one sufficient statistic is any better than
the another. Recall that the purpose of a sufficient statistic is to achieve data reduction without
loss of information about the parameter θ; thus, a statistic that achieves the most data reduction
while still retaining all the information about θ might be considered preferable. The definition
of such a statistic is formalized now.

15
Definition 7.7 (Minimal Sufficient Statistic). A sufficient statistic T (X) is called a minimal
sufficient statistic if, for any other sufficient statistic T ′ (X), there exists a function g such
that T (x) = g(T ′ (x)) for any x ∈ X . Equivalently, for every x, y ∈ X if T ′ (x) = T ′ (y),
then T (x) = T (y).
Theorem 7.5 (Sufficient Condition for Minimal Sufficient Statistic). Let f (x; θ) be the pdf (or
pmf ) of a random sample X. A statistic T (x) is minimal sufficient statistic for θ if for every
two sample points x and y,

f (x; θ)
is independent of θ if and only if T (x) = T (y). (∗)
f (y; θ)

Proof. In the sequel, to simplify the proof, we assume f (x; θ) > 0 for all x ∈ X and θ. Let T
be statistic satisfying statement (∗). We first show that T (X) is a sufficient statistic; i.e., by
Neyman’s factorization theorem, we must show that there exists nonnegative function g and h
such that f (x; θ) = g(T (x); θ)h(x). Let

T := {t : t = T (x) for some x ∈ X }

be the image of X under T (x). Define the partition sets induced by T (x) as At := {x : T (x) = t}.
For each At , choose and fix one element xt ∈ At . For any x ∈ X , xT (x) is the fixed element that
is in the same set At as x. Since x and xT (x) are in the same set At ,

T (x) = T (xT (x) )


f (x;θ)
and hence, by assumed hypothesis, the ratio f (xT (x) ;θ) is independent of θ. Thus, we can define
a function on X by
f (x; θ)
h(x) :=
f (xT (x); θ)
and h does not depend on θ. Define a function on T by g(t; θ) := f (xt ; θ). Then we see that

f (xT (x) ; θ)f (x; θ) f (x; θ)


f (x; θ) = = f (xT (x) ; θ) · = g(T (x); θ)h(x)
f (xT (x) ; θ) f (xT (x); θ)
| {z }
=h(x)

and by the factorization theorem, T (X) is a sufficient statistic for θ.


To complete the proof, it remains to show that T (X) is minimal, let T ′ (X) be any other sufficient
statistic. By the factorization theorem, there exists functions g ′ and h′ such that

f (x; θ) = g ′ (T ′ (x); θ)h′ (x).

Let x, y be any two sample points with T ′ (x) = T ′ (y). Then

f (x; θ) g ′ (T ′ (x); θ)h′ (x) h′ (x)


= ′ ′ ′
= ′ .
f (y; θ) g (T (y); θ)h (y) h (y)

Since the ratio does not depend on θ, the assumptions of the theorem imply that T (x) = T (y).
Thus, T (x) is a function of T ′ (x) and T (x) is minimal.

16
Exercise 7.16 (Minimal Sufficient Statistic: Bernoulli Case). Let X := (X1 , X2 , . . . , Xn ) be a
random sample from a Bernoulli(θ) population with distribution

f (x; θ) := θx (1 − θ)1−x , x ∈ {0, 1}.


Pn
Define a statistic T (X) := i=1 Xi to estimate θ. Recalling that in Exercise 7.9, we have shown
that T is sufficient. Show that T (X) is minimal sufficient for θ.
Proof. To show that T (X) is minimal sufficient statistic for θ, we fix x, y to be two sample points
and form a ratio
Qn
f (x; θ) f (xi ; θ)
= Qi=1
n
f (y; θ) i=1 f (yi ; θ)
Pn Pn
θ i=1 xi (1 − θ)n− i=1 xi
= Pn y Pn
θ i=1 i (1 − θ)n− i=1 yi
θT (x) (1 − θ)n−T (x)
=
θT (y) (1 − θ)n−T (y)
= θT (x)−T (y) (1 − θ)−(T (x)−T (y))
 T (x)−T (y)
θ
= .
1−θ

Note that the ratio is independent of θ if and only if T (x) = T (y). Therefore, T (x) is minimal
sufficient for θ.

Exercise 7.17 (Normal Minimal Sufficient Statistics). Let X := (X1 , . . . , Xn ) be random sample
from N (µ, σ 2 ), both µ and σ 2 unknown. Let (x, s2x ) and (y, s2y ) be the sample means and sample
variances correspond to the x and y samples, respectively. Show that (X, S 2 ) is a minimal
sufficient statistic for (µ, σ 2 ).
Proof. To show that (X, S 2 ) is a minimal sufficient statistic for (µ, σ 2 ). By Exercise 7.7, we
know that the joint pdf of X is
1 1
Pn 2
( xiσ−µ )
f (x; θ) = n/2
e− 2 i=1

(2πσ 2 )
1 Pn 2 2
e− 2 σ 2 [ ].
1 1
= i=1 (xi −x) +n(x−µ) (∗)
n/2
(2πσ 2 )

17
With θ := (µ, σ), we observe the ratio
Qn
f (x; θ) f (xi ; θ)
= Qi=1
n
f (y; θ) i=1 f (yi ; θ)
 
Qn −(xi −µ)2
√1
i=1 σ 2π exp 2σ 2
= Q  
n 1 −(yi −µ)2

i=1 σ 2π exp 2σ 2

n 2 2
e− 2 σ2 [ i=1 (xi −x) +n(x−µ) ]
1 1
P

= Pn (By (∗))
e− 2 σ2 [ i=1 (yi −y) +n(y−µ) ]
1 1 2 2

2 2
e− 2 σ2 [(n−1)sx +n(x−µ) ]
1 1

= 1 1
e− 2 
σ2
[(n−1)s2y +n(y−µ)2 ]

1 
= exp − 2 (n − 1)s2x + n(x − µ)2 − (n − 1)s2y + n(y − µ)2
  

 
1
= exp − 2 (n − 1)(s2x − s2y ) + n[(x − µ)2 − (y − µ)2 ]


 
1
= exp − 2 (n − 1)(s2x − s2y ) + n[(x2 − y 2 ) − 2(x − y)µ] .


The ratio will be constant as a function of µ and σ 2 if and only if x = y and s2X = s2Y . Therefore,
by Theorem 7.5, (X, S 2 ) is a minimal statistic for (µ, σ 2 ).

A Appendix
A.1 Review of Conditional Probability
Let A, B be two events on sample space S. The conditional probability of A given B is

P (A ∩ B)
P (A|B) =
P (B)

provided that P (B) > 0.


Remark A.1. P (·|B) is a probability set function; i.e., P (A|B) ≥ 0 for all A ⊂ S, P (S|B) = 1,
and if A1 , A2 , . . . are disjoint events, then

X
P (∪∞
i=1 Ai |B) = P (Ai |B).
i=1

Let X and Y be two random variables with joint pdf f (x, y) and marginal pdf’s fX (x) and fY (y).
The conditional pdf of Y given X = x is

f (x, y)
f (y|x) =
fX (x)
R∞
for all y. Note that f (y|x) is still a pdf since f (y|x) ≥ 0 for all y and −∞
f (y|x)dy = 1 for all x.

18
A.2 Factorization Theorem for Discrete Random Samples
Theorem A.1 (Neyman’s Factorization Theorem). Let f (x; θ) denote the joint pdf (or pmf )
of a random sample X. A statistic T (X) is a sufficient statistic for parameter θ if and only if
there exists two nonnegative functions gθ (t) and h(x) such that for all sample points x and all
parameter points θ,
n
Y
f (x; θ) = f (xi ; θ) = gθ (T (x))h(x) (∗)
i=1

where h(x) does not depend on θ but gθ (t) depends on θ.


Proof. We give the proof for discrete distributions.
(⇒) Suppose T (X) is a sufficient statistic. We must show that there exists two functions gθ (t)
and h such that the factorization (∗) holds. Choose

gθ (t) := Pθ (T (X) = t) (∗∗)

and
h(x) := P (X = x|T (X) = T (x)). (@)
We claim that these two functions are nonnegative and gθ depends on θ and h does not. The
nonnegativity is easy to verify. To verify the dependency requirement, we note that because T (X)
is sufficient, by Definition 7.6, the conditional probability (@) defining h(x) does not depend on θ.
Thus, this choice of h(x) and g(t|θ) is legitimate. For this choice of gθ and h, we observe that,
for discrete distribution, fθ (x) := Pθ (X = x). Moreover, note that if X = x, then T (X) = T (x).
Therefore, we have
Pθ (X = x) = Pθ (X = x, T (X) = T (x)).
We can rewrite the last probability via a conditional probability; i.e.,

Pθ (X = x, T (X) = T (x)) = Pθ (T (X) = T (x))P (X = x|T (X) = T (x)).

Thus, to sum up, we obtain

f (x; θ) = Pθ (X = x)
= Pθ (X = x, T (X) = T (x))
= Pθ (T (X) = T (x)) · P (X = x|T (X) = T (x))
= gθ (T (X)) · h(x).

which proves the factorization.


(⇐) Assuming that the factorization form (∗) holds. Let qθ (t) be the pmf of T (X); i.e.,

qθ (t) := Pθ (T (X) = t).


fθ (x)
To show that T (X) is a sufficient statistic, we examine the ratio qθ (T (x)) . For given x, define a
set Ax := {y : T (y) = T (x)}. Then,

qθ (T (x)) = Pθ (T (X) = T (x))


= Pθ (X ∈ Ax )
X
= fθ (y) (⋆).
y∈Ax

19
for any x,

f (x; θ) g(T (x); θ) · h(x)


= (By factorization (∗))
q(T (x); θ) q(T (x); θ)
gθ (T (x)) · h(x)
=
Pθ (T (X) = T (x))
gθ (T (x)) · h(x)
= P (By (⋆))
y∈Ax fθ (y)
g(T (x); θ) · h(x)
=P (factorization (∗) on denominator)
y∈Ax g(T (y); θ)h(y)
g(T (x); θ) · h(x)
= P (T is constant on Ax )
g(T (x); θ) y∈Ax h(y)
h(x)
=P
y∈Ax h(y)

Since the ratio above does not depend on θ, by Theorem 7.2, T (X) is a sufficient statistic
for θ.

References
[1] G. Casella and R. L. Berger, Statistical Inference. Cengage Learning, 2001.
[2] R. V. Hogg, J. W. Mckean, and A. T. Craig, Introduction to Mathematical Statistics. 2019.

20

You might also like