Professional Documents
Culture Documents
Chap 07 Data Reduction (1)
Chap 07 Data Reduction (1)
In typical statistical problem, we often want to study a random variable X of interest but its pdf
or pmf is not known or known only partially.1 Thus, statistical inference is to use the information
in a sample X := (X1 , . . . , Xn ) to make inferences about an unknown parameter vector θ ∈ Θ. If
the sample size n is large, then the observed sample2 x := (x1 , . . . , xn ) is a long list of members
that may be hard to interpret. Hence, it is desirable to “summarize” the information in a sample
by determining a few key features of the sample values. This is usually done by computing some
suitable statistics, T (X1 , . . . , Xn ); e.g., sample mean, sample variance, maximum observation,
minimum observation.
In this chapter, our goal is to study a notion called sufficiency and the methods of data reduction
that do not discard important information about the unknown parameter θ.3
7.1 Preliminaries
Below we review some basic idea of random samples from Chapters 5 and 6.
Definition 7.1 (Random Samples). The random vector X := (X1 , . . . , Xn ) is called a random
sample of size n from the population f (x) if X1 , . . . , Xn are iid with pdf or pmf f (x).
Definition 7.2 (Statistic). Let X := (X1 , . . . , Xn ) be a random sample of size n from a pop-
ulation and let T (x1 , . . . , xn ) be real-valued (or vector-valued) function whose domain includes
the sample space of X.
lim P (|Tn − θ| ≥ ε) = 0
n→∞
Theorem 7.1 (WLLN). Let X1 , X2 , . . . be iid random variables with E[Xi ] = µ and var(Xi ) =
Pn P
σ 2 < ∞. Define X n := n1 i=1 Xi . Then X n → µ.
Proof. See Chapter 6.
1 The ignorance about unknown pdf (or pmf) can be classified in two ways: (i) f (x) is completely unknown.
1
Exercise 7.1 (Image of Statistic). Let X be the sample space of X. Find the image, call it T ,
of the statistic T (X).
Proof. It is readily verified that T := {t : t = T (x) for some x ∈ X } is the image of X un-
der T (x).
At := {x : T (x) = t}.
The statistic summarizes the data in that, rather than reporting the entire sample x, it reports
only that T (x) = t or equivalently, x ∈ At . All points in At are treated the same if we are
interested in T only. Thus, the statistic T provides a data reduction. Our goal here is to reduce
data as much as we can but not lose any important information about θ.
Remark 7.1. (i) Let T (X) be a statistic. For T , if x ̸= y but T (x) = T (y), then x and y
provides the same information and can be treated as the same. (ii) Data reduction in terms of
a statistic T (X) is a partition of the sample space X .
Exercise 7.2 (Data Reduction). Suppose that Xi ∼ Bernoulli(p) for i = 1, 2, 3 and p ∈ (0, 1)
and Xi ∈ {0, 1} are i.i.d. Define T : {0, 1}3 → {0, 1, 2, 3} with
3
X
T (X) := Xi
i=1
2
7.2 Sufficiency of Statistics
The sections to follow requires that the reader to equip with some background on conditional prob-
ability and conditional distribution; e.g., see Appendix A.1. A sufficient statistic for a unknown
parameter θ is a statistic that captures all the information about θ contained in the sample. A
sufficient statistic is formally defined in the following way.
Definition 7.6 (Sufficient Statistic). Let X := (X1 , . . . , Xn ) be a random sample of size n from
a distribution that has pdf (or pmf) f (x; θ) for θ ∈ Θ. A statistic T (X) is called a sufficient
statistic for θ if, for each T (X) = T (x) := t, the conditional distribution of the sample X given
the value of T (X) = t, i.e.,
f (x|t) = h(x)
does not depend on θ.
Remark 7.2. (i) If, for T (X) = t, conditional pdf f (x; θ|t) is unrelated to the parameter θ, then
the random samples X contains no more information about θ when T (X) = t is observed. Said
another way, T (X) exhaust all the information about θ that is contained in the sample. (ii) If X
is discrete, then T (X) is discrete and sufficiency means that P (X = x|T (X) = t) is known; i.e.,
it does not depend on any unknown quantity θ. (iii) Once we observe x and compute a sufficient
statistic T (x), the original data x do not contain any further information concerning θ and can
be discarded; i.e., T (x) is all we need regarding θ. Roughly speaking, a statistic is sufficient if
we can calculate the joint pdf of X by knowing only T (X).
Exercise 7.3 (A Useful Identity). Let X be a random sample from a population and define T (X)
to be a statistic. Show that the following events relationship holds:
{X = x} = {X = x, T (X) = T (x)}
for any x ∈ X .
Proof. We observe that if X = x, then T (x) = T (x). It follows that {X = x} ⊂ {T (X) = T (x)}.
Therefore, we conclude {X = x, T (X) = T (x)} = {X = x}.
3
does not depend on θ. Observe that
P (X = x, T (X) = t)
P (X = x|T (X) = t) =
P (T (X) = t)
(
P (X=x)
P (T (X)=T (x)) if T (x) = t
=
0 otherwise
(
p(x;θ)
q(T (x);θ) if T (x) = t
=
0 otherwise
where the second equality holds by Exercise 7.3. In addition, in the last equality, the quantity
p(x; θ) is the joint pmf of the sample X and q(t; θ) is the pmf of T (x). By assumption that
p(x; θ)
:= h(x)
q(T (x); θ)
which implies that T (X) is a sufficient statistic for θ.
Exercise 7.4 (Random Sample itself is Sufficient Statistics). Let T (X) := (X1 , . . . , Xn ) ∈ Rn
be random sample as a statistic. Show that T is sufficient.
Proof. For T (X) = T (x) := x′ , we observe that
(
f (x;θ)
′ f (x, x′ ; θ) f (x′ ;θ) =1 if x = x′
f (x; θ|x ) = =
f (x′ ; θ) 0 if x ̸= x′ .
Remark 7.4. Instead of listing all of the individual samples X1 , . . . , Xn , we might prefer to
give only the sample mean X or the sample variance Sn2 . Motivated by this, we seek for ways
of reducing a set of data so that these data can be more easily understood without losing the
meaning associated with the entire set of observations.
Exercise 7.5 (Binomial Sufficient Statistic). Let X = (X1 , . . . , Xn ) be a Bernoulli random
sample with unknown parameter θ ∈ (0, 1); i.e., it has pmf P (X = x) = θx (1 − θ)1−x for
x ∈ {0, 1}. Define a statistic
Xn
T (X) := Xi
i=1
4
(X) := the total number of successes in n trials. The probability of event {T (X) = t}
(ii).Let T P
n
with t := i=1 xi is given by
n t
P (T (X) = t; θ) = θ (1 − θ)n−t , t = 0, 1, . . . , n
t
p(x; θ) P (X = x; θ)
=
q(T (x); θ) P (T (X) = T (x); θ)
Qn
P (Xi = xi ; θ)
= i=1n t
n−t
(∗)
t θ (1 − θ)
where the last equality holds by the iid property of X and part (i). Now, note that
It follows that
n
Y n
Y Pn Pn
P (Xi = xi ; θ) = θxi (1 − θ)1−xi = θ i=1 xi
(1 − θ)n− i=1 xi
= θt (1 − θ)n−t .
i=1 i=1
p(x; θ) θt (1 − θ)n−t
= n
q(T (x); θ) θt (1 − θ)n−t
t
1
= n
t
1
=
P n
i=1 xi
which does not depend on θ. Therefore, by Theorem 7.2, T (X) is a sufficient statistic for θ.
Remark 7.5. The analysis above tells us that the total number of successes in the Bernoulli
sample contains all the information about θ that is in the data. Other features of the data, such
as the exact value of X4 , contain no additional information.
Exercise 7.6 (Gamma Sufficient Statistic). Let X = (X1 , X2 , . . . , Xn ) be a random sample
from a gamma distribution with α = 2 and β = θ > 0 with pdf
1 1
f (x|α, β) = 2
x1 e−x/θ = 2 xe−x/θ , x ≥ 0
Γ(2)θ θ
5
Show that the mgf of T has a gamma distribution with α = 2n and β = θ. Hence,
(
1 (2n−1) −t/θ
2n t e , t>0
fT (t; θ) = Γ(2n)θ
0, otherwise.
MT (t) = E[etT ]
Pn
= E[et i=1 Xi
]
tX1 tX2
= E[e e · · · etXn ]
Yn
= E[etXi ]
i=1
= E[etX1 ]n
= (1 − θt)−2n
for t < 1/θ. Thus, by uniqueness of mgf theorem, it follows that T ∼ Γ(α = 2n, β = θ).
Therefore, the pdf is
(
1 (2n−1) −t/θ
2n t e , t>0
fT (t; θ) = Γ(2n)θ
0, otherwise.
where xi > 0 for i = 1, , . . . , n. Since the ratio does not depend on θ, the summation statistic T
is a sufficient statistic for θ.
6
invoke the Pearson Factorization Theorem:
n
Y
f (x; θ) = f (xi ; θ)
i=1
n
1
Y
= x e−xi /θ
2 i
i=1
θ
n
!
1 Y 1
Pn
= 2n xi e− θ i=1 xi
θ i=1
n
!
1 −1t Y
= 2n e θ xi .
θ i=1
Exercise 7.7 (Normal Sufficient Statistic). Let X := (X1 , . . . , Xn ) be iid normal random vari-
able with mean µ and variance σ 2 , where σ 2 is known but µ is not; i.e., θ := µ. Show that the
sample mean
n
1X
T (X) = Xi
n i=1
is a sufficient statistic for µ.
Proof. Let θ := µ. To show that T (X) is a sufficient statistic for θ, we fix x then invoke
f (x;θ)
Theorem 7.2 by checking the ratio of pdfs q(T (x);θ) . First note that, with the aids of iid of Xn ,
the numerator for the joint pdf of X is
n
Y
f (x; θ) = f (xi ; θ)
i=1
n 2
1
xi −θ
−1
Y
= √ e 2 σ
i=1
σ 2π
2
1
Pn xi −θ
− 21
= n/2
e i=1 σ
. (∗)
(2πσ 2 )
Pn
On the other hand, to compute q(T (x); θ), we note that T (x) = n1 i=1 Xi ∼ N (θ, σ 2 /n).
n
(verify!) Let T (x) = x := n1 i=1 xi . Then the pdf of T (X) = T (x) is given by
P
2
1
x−θ
−1 √
q(T (x); θ) = √ √ e 2 σ/ n
σ/ n 2π
n1/2 x−θ 2
e − 2 n( σ ) .
1
= 1/2
σ(2π)
Now, for Equation (∗), we further rewrite the exponent term
n 2 n
X xi − θ 1 X 2
= 2 (xi −x + x − θ)
i=1
σ σ i=1
n
1 X
(xi − x)2 + 2(xi − x)(x − θ) + (x − θ)2
= 2
σ i=1
" n n n
#
1 X 2
X X
2
= 2 (xi − x) + 2 (xi − x)(x − θ) + (x − θ) (∗∗)
σ i=1 i=1 i=1
7
Note that the cross term
Xn n
X n
X
2 (xi − x)(x − θ) = 2(x − θ) (xi − x) = 2(x − θ)( xi − nx) = 0.
i=1 i=1 i=1
(2πσ 2 )
1 Pn 2
+n(x−θ)2 ]
e− 2 σ 2 [
1 1
= i=1 (xi −x) .
n/2
(2πσ 2 )
Hence, the ratio
Pn 2
1 1
[ +n(x−θ)2 ]
f (x; θ)
1
(2πσ 2 )n/2
e− 2 σ2 i=1 (xi −x)
= x−θ 2
q(T (x); θ) 1
e − 2 n( σ )
1
σn1/2 (2π)1/2
− 2σ12 [ n 2
i=1 (xi −x) ] − 2
1 1
n(x−θ)2
P
1
(2πσ 2 )n/2
e e σ2
= x−θ 2
e − 2 n( σ )
1
n1/2
σ(2π)1/2
− 12 σ12 [ n 2
i=1 (xi −x) ]
P
1
(2πσ 2 )n/2
e
= n1/2
σ(2π)1/2
which does not depend on θ. Therefore, by Theorem 7.2, the sample mean is a sufficient statistic
for θ = µ.
Remark 7.7 (Motivation for Alternatives of Checking Sufficiency). To check if a statistic T (X)
is sufficient, according to Theorem 7.2, we must check if the ratio function f (x; θ)/q(T (x); θ)
depends on θ or not. This requires to know to know the pdf of T (x), which is often not easy to
calculate. Even if the pdf of T (X) is available, it may require a very tedious analysis as we seen
in previous exercise. The next theorem provides a way to avoid this by factorizing the joint pdf
of X.
8
Proof. Here we provide a proof for continuous random samples.4
(⇒) If T (X) is a sufficient statistic for θ, then, for T (X) = t, the conditional pdf of X given
T (x) = t is
f (x, t; θ) f (x; θ)
f (x; θ|t) = = = h(x) (∗)
fT (t, θ) fT (t, θ)
where fT is the pdf of T (X) and the second equality holds since f (x, t; θ) = f (x; θ) (If X = x,
then T (x) = t) and the last equality holds by definition of sufficiency of T . Therefore, rearranging
the last equality in Equation (∗) yields
which is the desired form of factorization if we take the function g to be the pdf of T (X); i.e.,
g := fT (t; θ).
(⇐) The proof of this direction is more involved. Assuming that the factorization identity holds;
i.e.,
We must show that T (X) is a sufficient statistic. Define the one-to-one transformations5
y1 := T (x1 , . . . , xn );
y2 := T2 (x1 , . . . , xn );
..
.
yn := Tn (x1 , . . . , xn ),
J = ... ..
.
∂xn
∂y1 . . . ∂x ∂yn
n
that does not depend on θ. Then the joint pdf of Y1 , . . . , Yn , call it g, is then given by
The pdf of Y1 = T (X), call it fY1 (y1 ; θ), is simply the marginal pdf of g. That is,
Z ∞ Z ∞
fY1 (y1 ; θ) = ··· g(y1 , y2 . . . . , yn ; θ)dy2 · · · dyn
−∞ −∞
Z ∞ Z ∞
= g(y1 ; θ) ··· h(w1 , w2 . . . , wn )|J|dy2 · · · dyn .
−∞ −∞
4A proof for discrete case can be found in Appendix A.1.
5 In a more general setting, the assumption of a one-to-one transformation made here can be dropped; see [2].
9
Note that the function h does not depend on θ nor is θ involved in either the Jacobian J or the
limits of integration. Hence the (n − 1)-fold integral in the right-hand side above is a function
of y1 , say m(y1 ), for some function m(·). Thus, we may write
fY1 (y1 ; θ) = g(y1 ; θ) · m(y1 ).
If m(y1 ) = 0, then fY1 (y1 ; θ) = 0. If m(y1 ) > 0, then we can write
fY1 (T (x); θ)
g(y1 ; θ) = .
m(T (x))
Therefore, with the aids of Equation (1), the joint pdf of X
f (x; θ) = g(T (x); θ) · h(x)
h(x)
= fY1 (T (x); θ) .
m(T (x))
Or equivalently,
f (x; θ) h(x)
= .
fY1 (T (x); θ) m(T (x))
Since neither the function h nor the function m depends on θ, then the right-hand side does not
depend on θ. In accordance with the definition, Y1 (i.e., T (X)) is a sufficient statistic for the
parameter θ.
Exercise 7.8 (Random Sample Itself is Sufficient Statistic: Revisited). Let X be a random
sample with joint pdf f (x; θ) for parameter θ. Show that T (X) = X is sufficient statistic for θ.
Proof. For X = x, observe that the joint pdf of X is given by
f (x; θ) = g(T (x); θ)h(x)
where h(x) = 1 and g(T (x); θ) := f (T (x); θ) = f (x; θ). Hence, by factorization theorem, T (X) =
X is a sufficient statistic for θ.
Exercise 7.9 (Bernoulli Revisited). Let X1 , . . . , Xn be iid Bernoulli random variables with
parameter θ ∈ (0, 1); i.e., it has pmf fX (x) = θx (1 − θ)1−x for x ∈ {0, 1}. Define a statistic
n
X
T (X) := Xi
i=1
= θt (1 − θ)n−t · |{z}
1
| P{z }
:=g( n
i=1 xi ;θ) :=h(x)
10
Exercise 7.10 (Nonuniquenss of Sufficient Statistic). Let X1 , . . . , Xn be random sample from
Poisson(λ) with pdf
λx e−λ
f (x; λ) =
x!
Let
n
X
T (X) := Xi .
i=1
P n i=1
:=g( i=1 xi ;θ) | {z }
:=h(x)
Pn
Thus, take T (X) := i=1 Xi . Then (∗) tells us that
1
f (x; λ) = λnx e−nλ · Qn
i=1 xi !
Pn
where x := n1 i=1 xi . Define g1 (X; λ) and h(x) := Qn 1
xi ! . Then, by factorization theorem x
i=1
is also a sufficient statistic for λ.
Define a statistic
n
X
T (X) := Xi .
i=1
11
Proof. The joint pdf is
n n
Y Y 1
f (x; θ) = f (xi ; θ) = x e−xi /θ .
2 i
i=1 i=1
θ
To prove that T is sufficient statistic, we note that
An alternative way is to invoke the Pearson Factorization Theorem:
n
Y
f (x; θ) = f (xi ; θ)
i=1
n
Y1
= x e−xi /θ
2 i
i=1
θ
n
!
1 Y 1
Pn
= 2n xi e− θ i=1 xi
θ i=1
n
!
1 −1t Y
= 2n e θ xi
θ i=1
= g(t; θ)h(x)
1 − θ1 t
Qn
where g(t; θ) = θ 2n e and h(x) = ( i=1 xi ). Hence, by factorization theorem, T is a sufficient
statistic for θ.
Exercise 7.12 (Normal Sufficient Statistic Revisited). Let X1 , . . . , Xn be iid normal random
2 2
variable with mean
Pnθ and variance σ , where σ is known but θ is not. Show that the sample
1
mean T (X) = n i=1 Xi is a sufficient statistic for θ := µ via Factorization Theorem.
Proof. By Exercise 7.7, we know that the joint pdf of X is
2
1
Pn xi −θ
− 12
f (x; θ) = n/2
e i=1 σ
(@)
(2πσ 2 )
1 Pn 2 2
e− 2 σ 2 [ ].
1 1
= i=1 (xi −x) +n(x−θ) (∗)
n/2
(2πσ 2 )
Now, define
1 Pn 2
e− 2σ2 [ ]
1
h(x) := i=1 (xi −x)
n/2
(2πσ 2 )
which does not depend on the unknown parameter θ = µ. The factor in Equation (∗) that
contains θ depends on the sample x only through the sample mean function T (x) = x = t.
Thus, we have
g(t; θ) := exp −n(t − θ)2 /(2σ 2 )
12
Remark 7.8 (Trivial Sufficient Statistic). Of course, if we look at Equation (@), then we may
write Pn xi −θ 2
1 − 21
f (x; θ) = n/2
e i=1 σ
·1.
(2πσ 2 )
| {z }
:=g(T (x);θ)
Call h(x) := 1 and g(T (x); θ) = g(x; θ) by setting T (X) := X, the random samples. Then, by
factorization theorem, X is itself a sufficient statistic for θ := (µ, σ 2 ). However, this is not a
good sufficient statistic, because it requires all Xn .
Remark 7.9 (Multiple Sufficient Statistics). In all the previous exercises, the sufficient statistic
is a real-valued function of the sample. All the information about θ in the sample x is summarized
in the single number T (x). Sometimes, the information cannot be summarized in one number
and several numbers are required instead. In such cases, a sufficient statistic is a vector, say
This situation often occurs when the parameter is also a vector, say θ = (θ1 , . . . , θs ), and is
usually the case that the sufficient statistic and the parameter vectors are of equal length, that
is r = s. While different combinations of lengths are possible, the Factorization Theorem may
be used to find a vector-valued sufficient statistic; see Exercise 7.13 below.
Exercise 7.13 (Joint Sufficient Statistic for Normal with Both Unknown Mean and Variance).
Assume that X1 , . . . , Xn are iid normal random variables with mean µ and variance σ 2 . However,
2
assuming that both µ and Pnσ are unknown; i.e., the parameter vector θ := (µ, σ 2 ). Let T1 (x) = x
2 1 2
and T2 (x) = s := n−1 i=1 (xi − x) . Show that
13
Exercise 7.14 (Order Statistic as Sufficient Statistic). Let Y1 < Y2 < · · · < Yn be the order
statistics of a random sample X1 , . . . , Xn from the population with pdf
where the last equality holds since 1(θ,∞) (x1 ) · 1(θ,∞) (x2 ) · · · 1(θ,∞) (x2 ) takes value 1 if and only
nθ
if 1P
(θ,∞) (mini xi ) = 1; otherwise, it is zero. Now, set g(T ; θ) := e 1(θ,∞) (mini xi ) and h(x) :=
− n xi
e i=1 , then by factorization theorem, it follows that Y1 = mini Xi is a sufficient statistic.
In Chapter 3, we discussed a broad class of distributions called exponential family. Using the
factorization theorem, one can readily find a sufficient statistic for an exponential family of
distributions.
Theorem 7.4 (Sufficient Statistic for Exponential Family). Let X = (X1 , . . . , Xn ) be random
sample from a pdf or pmf f (x|θ) that belongs to an exponential family given by
k
!
X
f (x|θ) = h(x)c(θ) exp wi (θ)ti (x)
i=1
14
Proof. Note that the joint pdf of X is given by We observe that
n n k
!
Y Y X
f (xj |θ) = h(xj )c(θ) exp wi (θ)ti (xj )
j=1 j=1 i=1
n k
!
Y X
n
= [c(θ)] h(xj ) exp wi (θ)ti (xj )
j=1 i=1
n
Y Xn X
k
= [c(θ)]n h(xj ) exp wi (θ)ti (xj )
j=1 j=1 i=1
n
Y Xk n
X
= [c(θ)]n h(xj ) exp wi (θ) ti (xj )
j=1 i=1 j=1
n
Y Xk n
X
= [c(θ)]n h(xj ) exp wi (θ) ti (xj ) .
j=1 i=1 j=1
Qn P
k Pn
Define h′ (x) := j=1 h(xj ) and g(T (X); θ) := [c(θ)]n exp i=1 wi (θ) j=1 ti (xj ) . Then, by
factorization theorem, it shows that T (X) is sufficient statistic for θ.
Remark 7.10 (Selecting a Good Sufficient Statistic). The above shows the nonuniquencess of
sufficient statistics. Hence, it is natural to ask whether one sufficient statistic is any better than
the another. Recall that the purpose of a sufficient statistic is to achieve data reduction without
loss of information about the parameter θ; thus, a statistic that achieves the most data reduction
while still retaining all the information about θ might be considered preferable. The definition
of such a statistic is formalized now.
15
Definition 7.7 (Minimal Sufficient Statistic). A sufficient statistic T (X) is called a minimal
sufficient statistic if, for any other sufficient statistic T ′ (X), there exists a function g such
that T (x) = g(T ′ (x)) for any x ∈ X . Equivalently, for every x, y ∈ X if T ′ (x) = T ′ (y),
then T (x) = T (y).
Theorem 7.5 (Sufficient Condition for Minimal Sufficient Statistic). Let f (x; θ) be the pdf (or
pmf ) of a random sample X. A statistic T (x) is minimal sufficient statistic for θ if for every
two sample points x and y,
f (x; θ)
is independent of θ if and only if T (x) = T (y). (∗)
f (y; θ)
Proof. In the sequel, to simplify the proof, we assume f (x; θ) > 0 for all x ∈ X and θ. Let T
be statistic satisfying statement (∗). We first show that T (X) is a sufficient statistic; i.e., by
Neyman’s factorization theorem, we must show that there exists nonnegative function g and h
such that f (x; θ) = g(T (x); θ)h(x). Let
be the image of X under T (x). Define the partition sets induced by T (x) as At := {x : T (x) = t}.
For each At , choose and fix one element xt ∈ At . For any x ∈ X , xT (x) is the fixed element that
is in the same set At as x. Since x and xT (x) are in the same set At ,
Since the ratio does not depend on θ, the assumptions of the theorem imply that T (x) = T (y).
Thus, T (x) is a function of T ′ (x) and T (x) is minimal.
16
Exercise 7.16 (Minimal Sufficient Statistic: Bernoulli Case). Let X := (X1 , X2 , . . . , Xn ) be a
random sample from a Bernoulli(θ) population with distribution
Note that the ratio is independent of θ if and only if T (x) = T (y). Therefore, T (x) is minimal
sufficient for θ.
Exercise 7.17 (Normal Minimal Sufficient Statistics). Let X := (X1 , . . . , Xn ) be random sample
from N (µ, σ 2 ), both µ and σ 2 unknown. Let (x, s2x ) and (y, s2y ) be the sample means and sample
variances correspond to the x and y samples, respectively. Show that (X, S 2 ) is a minimal
sufficient statistic for (µ, σ 2 ).
Proof. To show that (X, S 2 ) is a minimal sufficient statistic for (µ, σ 2 ). By Exercise 7.7, we
know that the joint pdf of X is
1 1
Pn 2
( xiσ−µ )
f (x; θ) = n/2
e− 2 i=1
(2πσ 2 )
1 Pn 2 2
e− 2 σ 2 [ ].
1 1
= i=1 (xi −x) +n(x−µ) (∗)
n/2
(2πσ 2 )
17
With θ := (µ, σ), we observe the ratio
Qn
f (x; θ) f (xi ; θ)
= Qi=1
n
f (y; θ) i=1 f (yi ; θ)
Qn −(xi −µ)2
√1
i=1 σ 2π exp 2σ 2
= Q
n 1 −(yi −µ)2
√
i=1 σ 2π exp 2σ 2
n 2 2
e− 2 σ2 [ i=1 (xi −x) +n(x−µ) ]
1 1
P
= Pn (By (∗))
e− 2 σ2 [ i=1 (yi −y) +n(y−µ) ]
1 1 2 2
2 2
e− 2 σ2 [(n−1)sx +n(x−µ) ]
1 1
= 1 1
e− 2
σ2
[(n−1)s2y +n(y−µ)2 ]
1
= exp − 2 (n − 1)s2x + n(x − µ)2 − (n − 1)s2y + n(y − µ)2
2σ
1
= exp − 2 (n − 1)(s2x − s2y ) + n[(x − µ)2 − (y − µ)2 ]
2σ
1
= exp − 2 (n − 1)(s2x − s2y ) + n[(x2 − y 2 ) − 2(x − y)µ] .
2σ
The ratio will be constant as a function of µ and σ 2 if and only if x = y and s2X = s2Y . Therefore,
by Theorem 7.5, (X, S 2 ) is a minimal statistic for (µ, σ 2 ).
A Appendix
A.1 Review of Conditional Probability
Let A, B be two events on sample space S. The conditional probability of A given B is
P (A ∩ B)
P (A|B) =
P (B)
Let X and Y be two random variables with joint pdf f (x, y) and marginal pdf’s fX (x) and fY (y).
The conditional pdf of Y given X = x is
f (x, y)
f (y|x) =
fX (x)
R∞
for all y. Note that f (y|x) is still a pdf since f (y|x) ≥ 0 for all y and −∞
f (y|x)dy = 1 for all x.
18
A.2 Factorization Theorem for Discrete Random Samples
Theorem A.1 (Neyman’s Factorization Theorem). Let f (x; θ) denote the joint pdf (or pmf )
of a random sample X. A statistic T (X) is a sufficient statistic for parameter θ if and only if
there exists two nonnegative functions gθ (t) and h(x) such that for all sample points x and all
parameter points θ,
n
Y
f (x; θ) = f (xi ; θ) = gθ (T (x))h(x) (∗)
i=1
and
h(x) := P (X = x|T (X) = T (x)). (@)
We claim that these two functions are nonnegative and gθ depends on θ and h does not. The
nonnegativity is easy to verify. To verify the dependency requirement, we note that because T (X)
is sufficient, by Definition 7.6, the conditional probability (@) defining h(x) does not depend on θ.
Thus, this choice of h(x) and g(t|θ) is legitimate. For this choice of gθ and h, we observe that,
for discrete distribution, fθ (x) := Pθ (X = x). Moreover, note that if X = x, then T (X) = T (x).
Therefore, we have
Pθ (X = x) = Pθ (X = x, T (X) = T (x)).
We can rewrite the last probability via a conditional probability; i.e.,
f (x; θ) = Pθ (X = x)
= Pθ (X = x, T (X) = T (x))
= Pθ (T (X) = T (x)) · P (X = x|T (X) = T (x))
= gθ (T (X)) · h(x).
19
for any x,
Since the ratio above does not depend on θ, by Theorem 7.2, T (X) is a sufficient statistic
for θ.
References
[1] G. Casella and R. L. Berger, Statistical Inference. Cengage Learning, 2001.
[2] R. V. Hogg, J. W. Mckean, and A. T. Craig, Introduction to Mathematical Statistics. 2019.
20