Download as pdf or txt
Download as pdf or txt
You are on page 1of 92

50 Pen and Paper Exercises on Statistical Science

(with Solutions)

By Tanujit Chakraborty (Webpage)

For UG and PG students in Mathematics, Statistics, Economics, Data


Science, and Computer Science
Topics to be covered (Exercises and Solutions):

1. Transformation of Random Variables and Sampling Distributions

2. Limit Theorems and Order Statistics

3. Theory of Point Estimation

4. Confidence Intervals

5. Bayesian Inference

6. Testing of Hypothesis

7. Statistical Simulation using R

8. Linear and Nonlinear Regression Analysis

9. Multicollinearity and Shrinkage Methods

10. Logistic Regression

Acknowledgment: These problems and their solutions are largely based on some extraordi-
nary books on Statistics and several outstanding university courses available online, with
some additions by the author. A non-exhaustive list is given below. Happy Learning!

Note: To err is human! There can be typos and unwanted mistakes in the solutions. If
you find one, please email me at ctanujit@gmail.com.

References (Books):

1. John A. Rice. Mathematical Statistics and Data Analysis, 2007.

2. Draper, Norman R., and Harry Smith. Applied Regression Analysis, 1998.

3. Shao, Jun. Mathematical statistics, 2003.

References (Online Resources):

1. STAT 200: Introduction to Statistical Inference at Stanford University.

2. NPTEL Regression Analysis Course at IIT Kharagpur.

3. ISI MSTAT and IITJAM MS Previous Papers.


Statistical Science By Tanujit Chakraborty

Statistical Science - Pen and Paper Exercises (with Solutions)

Problem 1. (Bivariate and Multivariate Distributions)


(a) (Accounting for voter turnout) Let N be the number of people in the state of Iowa. Suppose pN of these
people support Hillary Clinton, and (1 − p)N of them support Donald Trump, for some p ∈ (0, 1). N is known (say
N = 3, 000, 000) and p is unknown.
(i) Suppose that each person in Iowa randomly and independently decides, on election day, whether or not to vote,
with probability 1/2 of voting and probability 1/2 of not voting. Let VHillary be the number of people who
vote for Hillary and VDonald be the number of people who vote for Donald. Show that
1 1
E [VHillary ] = pN, E [VDonald ] = (1 − p)N.
2 2
What are the standard deviations of VHillary and VDonald , in terms of p and N ? Explain why, when N is large,
we expect the fraction of voters who vote for Hillary to be very close to p.

(ii) Now, suppose there are two types of voters - “passive” and “active.” Each passive voter votes on election day
with probability 1/4 and doesn’t vote with probability 3/4, while each active voter votes with probability 3/4
and doesn’t vote with probability 1/4. Suppose that a fraction qH of the people who support Hillary are passive
and 1 − qH are active, and a fraction qD of the people who support Donald are passive and 1 − qD are active.
Show that
1 3 1 3
E [VHillary ] = qH pN + (1 − qH ) pN, E [VDonald ] = qD (1 − p)N + (1 − qD ) (1 − p)N.
4 4 4 4
What are the standard deviations of VHillary and VDonald in terms of p, N, qH , and qD ? If we estimate p by p̂
using a simple random sample of n = 1000 people from Iowa. Explain why p̂ might not be a good estimate of
the fraction of voters who will vote for Hillary.

(iii) We do not know qH and qD . However, suppose that in our simple random sample, we can observe whether
each person is passive or active, in addition to asking them whether they support Hillary or Donald. Suggest
estimators V̂Hillary and V̂Donald for E [VHillary ] and E [VDonald ] using this additional information. Show, for
your estimators, that
h i 1 3 h i 1 3
E V̂Hillary = qH pN + (1 − qH ) pN, E V̂Donald = qD (1 − p)N + (1 − qD ) (1 − p)N.
4 4 4 4

Solution: (i) Recall that a Binomial(n, p) random variable has mean np and variance np(1 − p). A total
 of pN
people support Hillary, each voting independently with probability 21 , so VHillary ∼ Binomial pN, 12 . Then

1 1
E [VHillary ] = pN, Var [VHillary ] = pN,
2 4
q
and the standard deviation of VHillary is 1
4 pN . Similarly, as (1 − p)N people support Donald, VDonald ∼

Binomial (1 − p)N, 12 , so

1 1
E [VDonald ] = (1 − p)N, Var [VDonald ] = (1 − p)N,
2 4
q
and the standard deviation of VDonald is 14 (1 − p)N . The fraction of voters who voted for Hillary is

VHillary VHillary /N
= .
VHillary + VDonald VHillary /N + VDonald /N

Page 1
As E [VHillary /N ] = 21 p (a constant) and Var [VHillary /N ] = 14 p/N → 0 as N → ∞, VHillary /N should be close
to 12 p with high probability when N is large. Similarly, as E [VDonald /N ] = 12 (1 − p) and Var [VDonald /N ] =
4 (1 − p)/N → 0 as N → ∞, VDonald /N should be close to 2 (1 − p) with high probability when N is large.
1 1

Then the fraction of voters for Hillary should, with high probability, be close to
1
2p
= p.
1
2p + 1
2 (1 − p)

(The above statement “close to with high probability” may be formalized using Chebyshev’s inequality, which
states that a random variable is, with high probability, not too many standard deviations away from its mean.)

(ii) Let VH,p and VH,a be the number of passive and active voters who vote for Hillary, and similarly define
VD,p and VD,a for Donald. There are qH pN passive Hillary supporters, each of whom votes independently with
probability 41 , so  
1
VH,p ∼ Binomial qH pN, .
4
Similarly,  
3
VH,a ∼ Binomial (1 − qH ) pN, ,
4
 
1
VD,p ∼ Binomial qD (1 − p)N, ,
4
 
3
VD,a ∼ Binomial (1 − qD ) (1 − p)N, ,
4
and these four random variables are independent. Since VHillary = VH,p + VH,a ,

1 3
E [VHillary ] = E [VH,p ] + E [VH,a ] = qH pN + (1 − qH ) pN,
4 4
3 3 3
Var [VHillary ] = Var [VH,p ] + Var [VH,a ] = qH pN + (1 − qH ) pN = pN,
16 16 16
q
3
and the standard deviation of VHillary is 16 pN . Similarly,

1 3
E [VDonald ] = E [VD,p ] + E [VD,a ] = qD (1 − p)N + (1 − qD ) (1 − p)N,
4 4
3 3
Var [VDonald ] = Var [VD,p ] + Var [VD,a ] = qD (1 − p)N + (1 − qD ) (1 − p)N
16 16
3
= (1 − p)N,
16
q
and the standard deviation of VDonald is 16 3
(1 − p)N .

The quantity p̂ estimates p, but in this case, p may not be the fraction of voters who vote for Hillary: By the
same argument as in part (a), the fraction of voters who vote for Hillary is given by

VHillary VHillary /N
=
VHillary + VDonald VHillary /N + VDonald /N
1
+ 34 (1 − qH ) p
4 qH p
≈ ,
1
4 qH p + 3
4 (1 − qH ) p + 14 qD (1 − p) + 43 (1 − qD ) (1 − p)

where the approximation is accurate with high probability when N is large. When qH 6= qD , this is different
p
from p: For example, if qH = 0 and qD = 1, this is equal to p+(1−p)/3 which is greater than p, reflecting the
fact that Hillary supporters are more likely to vote than are Donald supporters.

(iii) Let p̂ be the proportion of the 1000 surveyed people who support Hillary. Among the surveyed people
supporting Hillary, let q̂H be the proportion who are passive. Similarly, among the surveyed people supporting

Page 2
Donald, let q̂D be the proportion who are passive. (Note that these are observed quantities computed from our
sample of 1000 people.) Then we may estimate the number of voters for Hillary and Donald by
1 3
V̂Hillary = q̂H p̂N + (1 − q̂H ) p̂N
4 4
1 3
V̂Donald = q̂D (1 − p̂)N + (1 − q̂D ) (1 − p̂)N.
4 4
q̂H p̂ is simply the proportion of the 1000 surveyed people who both support Hillary and are passive. Hence,
letting X1 , . . . , X1000 indicate whether each surveyed person both supports Hillary and is passive, we have
1
q̂H p̂ = (X1 + . . . + Xn ) .
n
Each Xi ∼ Bernoulli (qH p), so linearity of expectation implies E [q̂H p̂] = qH p. Similarly, (1 − q̂H ) p̂, q̂D (1 − p̂),
and (1 − q̂D ) (1− p̂) are the proportions of the 1000 surveyed people who support Hillary and are active, support
Donald and are passive, and support Donald and are active, so the same argument shows E [(1 − q̂H ) p̂] =
(1 − qH ) p, E [q̂D (1 − p̂)] = qD (1 − p), and E [(1 − q̂D ) (1 − p̂)] = (1 − qD ) (1 − p). Then applying linearity of
expectation again yields
h i h i
E V̂Hillary = E [VHillary ] , E V̂Donald = E [VDonald ] .

(b) Let X and Y be continuous random variables with joint density function
 y3
−y(x+1)
f (x, y) = 2 e for x > 0, y > 0
0 otherwise.
(i) Find the marginal pdf of X.
(ii) Find the marginal pdf of Y and E[Y ].

Solution: (i) Integrating the joint pdf over y gives the marginal pdf of X. For x > 0, we have
Z ∞ 3 Z ∞
y −y(x+1) Γ(4) (x + 1)4 3 −y(x+1) 3
fX (x) = e dy = 4
y e dy = ,
0 2 2(x + 1) 0 Γ(4) (x + 1)4
as the integrand in the last integral over y is a Gamma(4, x + 1) pdf, and so integrates (over y) to 1.
For x ≤ 0, fX (x) = 0.

(ii) Integrating the joint pdf over x gives the marginal pdf of Y . For y > 0, we have
Z ∞ 3 Z
y −y(x+1) y 2 −y ∞ −yx y2
fY (y) = e dx = e ye dx = e−y
0 2 2 0 2
as the integrand in the last integral over x is an Exponential (y) pdf, and so integrates ( over x ) to 1 . For
y ≤ 0, fY (y) = 0. We can recognize the marginal distribution of Y to be Gamma(3, 1), and so E[Y ] = 3/1 = 3.

(c) (Existence of multivariate normal)


(i) Suppose (X1 , . . . , Xk ) ∼ N (0, Σ) for a covariance matrix Σ ∈ Rk×k . Let Y1 , . . . , Ym be linear combinations of
X1 , . . . , Xk , given by
Yj = aj1 X1 + . . . + ajk Xk
for each j = 1, . . . , m and some constants aj1 , . . . , ajk ∈ R. Consider the matrix
 
a11 a12 · · · a1k
 a21 a22 · · · a2k 
 
A= . .. . ..  .
 . . . . . . 
am1 am2 ··· amk
By computing the means, variances, and covariances of Y1 , . . . , Ym , show that

(Y1 , . . . , Ym ) ∼ N 0, AΣAT .

Page 3
(ii) Let A ∈ Rk×k be any matrix, let Σ = AAT , and let µ ∈ Rk be any vector. Show that there exist random
IID
variables Y1 , . . . , Yk such that (Y1 , . . . , Yk ) ∼ N (µ, Σ). (Hint: Let X1 , . . . , Xk ∼ N (0, 1), and let each Yj be a
certain linear combination of X1 , . . . , Xk plus a certain constant.)

Solution: (i) Any linear combination of Y1 , . . . , Ym is a linear combination of X1 , . . . , Xk , so (Y1 , . . . , Ym ) is


multivariate normal. Using linearity of expectation and bilinearity of covariance, we compute

E [Yi ] = ai1 E [X1 ] + . . . + aik E [Xk ] = 0,


Cov [Yi , Yj ] = Cov [ai1 X1 + . . . + aik Xk , aj1 X1 + . . . + ajk Xk ]
X
k X
k X
k X
k
= air ajs Cov [Xr , Xs ] = air ajs Σrs = ai ΣaTj ,
r=1 s=1 r=1 s=1

where ai and aj denote rows i and j of the matrix A. (The computation of covariance  is valid for both i 6= j and
i = j; in the latter case this yields Var [Yi ] = Cov [Yi , Yi ].) As ai ΣaTj = AΣAT ij , this implies by definition

Y ∼ N 0, AΣAT .

iid
(ii) Take X1 , . . . , Xk ∼ N (0, 1), define Zj = aj1 X1 + . . . + ajk Xk for each j = 1, . . . , k, and let Yj = Zj + µj .
As (X1 , . . . , Xk ) have the multivariate normal distribution N (0, I) where I is the k × k identity matrix, and
as AIAT = AAT = Σ, part (a) implies (Z1 , . . . , Zk ) ∼ N (0, Σ). Then (Y1 , . . . , Yk ) ∼ N (µ, Σ) (since adding the
vector µ = (µ1 , . . . , µk ) does not change the variances and covariances of Y1 , . . . , Yk but shifts their means by
µ1 , . . . , µk ).


(d) (i) Let (X, Y ) be a random point uniformly distributed on the unit disk (x, y) : x2 + y 2 ≤ 1 . Show that
Cov[X, Y ] = 0, but that X and Y are not independent.
(ii) Let X, Y ∼ N (0, 1) be independent. Compute P[X + Y > 0 | X > 0]. (Hint: Use the rotational symmetry of
the bivariate normal PDF.)

Solution: (i) X has the same distribution as −X, so E[X] = E[−X] = −E[X], hence E[X] = 0. Simi-
larly E[Y ] = 0. Also (X, Y ) has the same joint distribution as (−X, Y ), so E[XY ] = E[−XY ] = −E[XY ],
hence E[XY ] = 0. Then Cov[X, Y ] = E[XY
√ ] − E[X]E[Y
√ ] = 0. On the other hand, conditional on X = x, Y is
uniformly distributed on the interval − 1 − x2 , 1 − x2 . As this depends on x, X and Y are not independent.

(ii) We have
P[X + Y > 0, X > 0]
P[X + Y > 0 | X > 0] = .
P[X > 0]
Since X has the same distribution as −X, P[X > 0] = P[−X > 0] = P[X < 0], so P[X > 0] = 21 . To compute
P[X + Y > 0, X > 0], note that this is the integral of the bivariate normal PDF fX,Y (x, y) in the region to the
right of the y-axis and above the line y = −x. The integral of fX,Y (x, y) over all of R2 must equal 1; hence by
rotational symmetry of fX,Y (x, y) around the origin, the integral over any wedge formed by two rays extending
from the origin is θ/(2π) where θ is the angle formed by these rays. For the above region, this angle is 3π/4,
so P[X + Y > 0, X > 0] = 3/8. Then P[X + Y > 0 | X > 0] = 3/4.

(e) (i) A family of Yellow-Faced (YF) gophers consisting of 2 parents and 3 children are kept in a laboratory. In
addition to these, a family of YF gophers with 2 parents and 4 children, a family of Big Pocket (BP) gophers
with 2 parents and 5 children, and a family of BP gophers with 1 mother and 4 children are also kept in the
laboratory. A sample of 4 gophers is selected randomly from all the gophers in the laboratory. What is the
probability that the sample consists of one adult female, one adult male, and 2 children, with both adults of
the same genus (either both YF or both BP)?
(ii) Let X and Y be arbitrary random variables, and let g(·) and h(·) be arbitrary real valued functions defined on
R. For each statement, say whether it is TRUE or FALSE. If TRUE, prove it, and if FALSE, give a counterex-
ample.
(A) If X and Y are uncorrelated then so are g(X) and h(Y ) for any g and h.
(B) If g(X) and h(Y ) are uncorrelated for all g and h then X and Y are uncorrelated.

Page 4
Solution: (i) Let X1 be the number of male YF gophers, X2 the number of female YF gophers, X3 the number
of male BP gophers, X4 the number of female BP gophers, and X5 the number of child gophers in the sample.
The sample size is n = 4, and the total number of gophers in the laboratory is 23. The total numbers of
YF male, YF female, BP male, BP female, and children gophers are 2, 2, 1, 2 and 16 respectively. The joint
T
distribution of (X1 , X2 , X3 , X4 , X5 ) is Multivariate Hypergeometric and the desired probability is

P (X1 = 1, X2 = 1, X3 = 0, X4 = 0, X5 = 2) + P (X1 = 0, X2 = 0, X3 = 1, X4 = 1, X5 = 2)
           
2 2 1 2 16 2 2 1 2 16
1 1 0 0 2 0 0 1 1 2
=   +  
23 23
4 4
2 × 2 × 1 × 1 × 120 1 × 1 × 1 × 2 × 120 720
= + = ≈ 0.0813.
8855 8855 8855

 
(ii) (A) FALSE. Take X to have a N (0, 1) distribution. Then
 E[X] = 0 and E X 3 = 0, and take Y = X 2 .
Then Cov(X, Y ) = Cov X, X 2  = E X 3 − E[X]E X 2 = 0. Take  g(X) = X 2 and h(Y ) = Y . Then
Cov(g(X), h(Y )) = Cov X 2 , X 2 = Var X 2 = 2 > 0, where Var X 2 is the variance of a χ21 distribution.

(B) TRUE. Let A and B be arbitrary subsets of R, and take g(X) = IA (X) and h(Y ) = IB (Y ). Since
Cov(g(X), h(Y )) = 0 we have E [IA (X)IB (Y )] = E [IA (X)] E [IB (Y )], which is equivalent to P (X ∈ A, Y ∈
B) = P (X ∈ A)P (Y ∈ B). Since A and B were arbitrary this implies that X and Y are independent. But
then X and Y are uncorrelated.

Problem 2. (Transformation of Random Variables)


(a) Let X1 , X2 , X3 be independent N (0, 1) random variables. Find the probability density function of U = X12 +X22 +X32 .
Solution: We know from class that Xi2 has the χ21 distribution, i.e., the Gamma(1/2, 1/2) distribution. Since
X1 , X2 , X3 are independent, so are X12 , X22 , X32 , so U = X12 + X22 + X32 is a sum of 3 independent Gamma(1/2, 1/2)
random variables, and as such has the Gamma(3/2, 1/2) distribution (or χ23 distribution). Thus
(
(1/2)3/2 1/2 −(1/2)u
Γ(3/2) u e if x > 0
fU (u) =
0 otherwise.

π
The expression can be simplified by noting that Γ(3/2) = (1/2)Γ(1/2) = 2 .

(b) Suppose that the random vector Y = (Y1 , Y2 , Y3 ) is uniformly distributed on the sphere of radius 1 centered at the
origin; that is, Y has joint probability density function (pdf)
 3
4π if (y1 , y2 , y3 ) ∈ S
fY (y1 , y2 , y3 ) =
0 otherwise,

where S = (y1 , y2 , y3 ) ∈ R3 : y12 + y22 + y32 ≤ 1 is the sphere of radius 1 centred at (0, 0, 0). Let V = Y12 + Y22 + Y32 .
Find the probability density function of V and find E[V ].
Solution: We can find the pdf of V by first computing the CDF of V . For v ∈ [0, 1] we have
√  
FV (v) = P (V ≤ v) = P Y12 + Y22 + Y32 ≤ ( v)2 = P (Y1 , Y2 , Y3 ) ∈ S√v ,

where S√v is the sphere of radius v. Thus, for v ∈ [0, 1]
ZZZ ZZZ
3
FV (v) = fY (y1 , y2 , y3 ) dy1 dy2 dy3 = dy1 dy2 dy3
S √v S√v 4π
3
= × Volume of S√v
4π √
3 4π( v)3
= × = v 3/2 .
4π 3

Page 5
We also have FV (v) = 0 for v < 0 and FV (v) = 1 for v > 1. Differentiating, we obtain the pdf of V as
( √
3 v
′ 2 for 0 ≤ v ≤ 1
fV (v) = FV (v) =
0 otherwise.

Then Z  1
1
3 3/2 3 2 5/2 3
E[V ] = v dv = v = .
2 0 2 5 0 5

(c) Let X and Y be independent and identically distributed uniform random variables on the interval (0, 1). Define
U = 12 (X + Y ) to be the average and define V = X.
(i) Find the joint probability density function of (U, V ) and draw the sample space of (U, V ).
(ii) Find the marginal probability density function of U .

Solution: (i) The inverse transformation is X = V and Y = 2U − V . The matrix of partial derivatives of the
inverse transformation is  
0 1
2 −1
with determinant −2. The joint pdf of (X, Y ) is
(
1 for 0 ≤ x ≤ 1, 0 ≤ y ≤ 1
fXY (x, y) =
0 otherwise

So the joint pdf of (U, V ) is (


2 for (u, v) ∈ SU V
fU V (u, v) =
0 otherwise
where SU V is the sample space of (U, V ), determined by the constraints 0 ≤ v ≤ 1 and 0 ≤ 2u − v ≤ 1, or
0 ≤ v ≤ 1 and v/2 ≤ u ≤ (v + 1)/2. The sample space of (U, V ) is plotted in Fig. 1.

Figure 1: Sample space of (U, V ).

(ii) To get the marginal pdf of U we integrate the joint pdf fU V (u, v) over the variable v from −∞ to ∞. From
Figure 1, we see that the limits of integration are different depending on whether u ∈ [0, 0.5] or u ∈ (0.5, 1].
For u ∈ [0, 0.5] we have
Z 2u
fU (u) = (2)dv = 4u.
0
For u ∈ (0.5, 1] we have
Z 1
fU (u) = (2)dv = 2(1 − (2u − 1)) = 4(1 − u)
2u−1

Page 6
Clearly, for u ∈
/ [0, 1] we have fU (u) = 0. To summarize,

 4u for u ∈ [0, 0.5]
fU (u) = 4(1 − u) for u ∈ (0.5, 1]

0 otherwise.
T
(d) Let X = (X1 , X2 ) be uniformly distributed on the positive quadrant intersected with the disk of radius 1 centered
at the origin; i.e., X has joint pdf fX (x1 , x2 ) = π4 ISX (x1 , x2 ), where

SX = (x1 , x2 ) ∈ R2 : x1 ≥ 0, x2 ≥ 0, x21 + x22 ≤ 1 .
T
Let Y1 = X12 and Y2 = X22 . Find the joint pdf of (Y1 , Y2 ) and the marginal pdf of Y1 .

Solution: We use the bivariate change of variable formula to get the joint pdf of (Y1 , Y2 ). The inverse transformation
1/2 1/2
is X1 = Y1 and X2 = Y2 . The Jacobian is
" 1 #
√ 0 1
2 y1
J = det 1 = √
0 √ 4 y1 y 2
2 y2

The support of (Y1 , Y2 ) is 


SY = (y1 , y2 ) ∈ R2 : y1 ≥ 0, y2 ≥ 0, y1 + y2 ≤ 1
By the change of variable formula, the joint pdf of (Y1 , Y2 ) is given by
 
1 1 4 1 1
fY (y1 , y2 ) = fX √ , √ |J| = √ IS (y1 , y2 ) = √ IS (y1 , y2 ) .
2 y 1 2 y2 π 4 y1 y2 Y π y 1 y2 Y
The marginal pdf of Y1 is obtained by integration. For y1 ∈ [0, 1],
Z ∞ Z 1−y1
1
fY1 (y1 ) = fY (y1 , y2 ) dy2 = √ dy2
−∞ 0 π y1 y2
1 √ 1−y
= √ [2 y2 ]0 1
π y1

2 1 − y1
= √ ,
π y1
and fY1 (y1 ) = 0 otherwise.

(e) Let X1 , X2 , X3 be independent and identically distributed Exponential (λ) random variables (the Exponential (λ)
distribution has pdf fX (x) = λe−λx for x > 0 and fX(x) = 0 for x ≤ 0, and df FX (x) = 1 − e−λx for x > 0
and FX (x) = 0 for x ≤ 0). Find P X1 + X2 + X3 ≤ 23 . (Write down the appropriate 3-dimensional integral and
evaluate it).

Solution: The appropriate 3-dimensional integral is (by symmetry, every order in which the integration is done is
equally easy)
  Z 3/2 Z 3/2−x3 Z 3/2−x2 −x3
3
P X1 + X2 + X3 ≤ = fX (x1 ) fX (x2 ) fX (x3 ) dx1 dx2 dx3
2 0 0 0
Z 3/2 Z 3/2−x3  
= fX (x2 ) fX (x3 ) 1 − e−λ(3/2−x2 −x3 ) dx2 dx3
0 0
Z 3/2 Z 3/2−x3 h i
= λe−λx3 λe−λx2 − λe−λ(3/2−x3 ) dx2 dx3
0 0
Z 3/2    
3
= λe−λx3 1 − e−λ(3/2−x3 ) − λ − x3 e−λ(3/2−x3 ) dx3
0 2
 2
3λ −3λ/2
−3λ/2 3λ (3/2)2
=1 − e − e − e−3λ/2 + λ2 e−3λ/2
2 2 2
"  2 #
3λ 1 3λ
=1 − e−3λ/2 1 + + .
2 2 2

Page 7
Problem 3. (Moment-generating function)
(a) Let X ∼ Binomial(n, p). Find the moment-generating function of X in terms of n and p. (Hint: X is the sum of n
independent Bernoulli random variables.)

Solution: X is the sum of n independent Bernoulli random variables X1 , . . . , Xn , each with moment generating
function
MXi (t) = E exp (tXi ) = pet + (1 − p).
Combining these and applying independence yields
Y
n
n
MX (t) = E exp(tX) = E exp (t (X1 + . . . + Xn )) = E exp (tXi ) = 1 − p + pet .
i=1

(b) Let X1 , X2 , . . . be independent and identically distributed random variables, each with a Poisson distribution with
mean 1. Let Sn = X1 + . . . + Xn for n ≥ 1 and let Mn (t) be the moment generating function of Sn .
(i) Find the smallest n such that P (Mn (Sn ) > 1) ≥ 0.99 using the exact probability.
(ii) Find the smallest n such that P (Mn (Sn ) > 1) ≥ 0.99 using the central limit theorem.

Solution: (i) First, the common moment generating function of the Xi is



X ∞
X (et )
  1 −1
k
e = e−1 = e−1 ee = ee −1 .
t t
tXi tk
M (t) = E e = e
k! k!
k=0 k=0

−1)
and Mn (Sn ) = en(e −1) . So,
t Sn
Then Mn (t) = M (t)n = en(e
    
P (Mn (Sn ) > 1) = P en(e −1) > 1 = P n eSn − 1 > 0 = P eSn > 1 = P (Sn > 0) .
Sn

The exact distribution of Sn is Poisson(n), so P (Sn > 0) = 1 − P (Sn = 0) = 1 − e−n . Setting this to .99 gives
e−n = .01, or n = ln 100 = 4.605. So n = 5 is the smallest n.

(ii) By the central limit theorem, the distribution of (Sn − n) / n is approximated by a N (0, 1) distribution,
so that (from part i),
√ √  √
P (Mn (Sn ) > 1) = P (Sn > 0) = P (Sn − n) / n > − n ≈ 1 − Φ(− n),

where
√ Φ(·) is the standard normal cdf. Setting this to 0.99 gives Φ(− n) = 0.01, or (from the table)
− n = −2.33, or n = 5.429. So n = 6 is the smallest n according to this approximation.

(c) Let X have a Gamma (3, 3) distribution. Conditional on X = x let Z have a normal distribution with mean x and
variance 2. Finally, let Y = eZ . Find E[Y ] and Var(Y ).

Solution: Conditioned on X = x, E[Y ] is the moment generating function of a N (x, 2) distribution evaluated
2
at 1 . Letting Mx (·) denote this mgf,
 we have Mx (t) = ext+t , and so E[Y | X = x] = Mx (1) = ex+1 . Similarly,

E Y 2 | X = x = E  e2Z | X = x
 = Mx (2) = e2x+4 . Then by the law of total expectation, E[Y ] = E eX+1 =
eMX (1) and E Y 2 = E e2X+4 = e4 MX (2), where MX (·) is the moment generating function of X. Since X has
 3
3
a Gamma (3, 3) distribution we have MX (t) = 3−t , and so
 3
3 27e  
E[Y ] = e = ≈ 9.174 and E Y 2 = 27e4 .
2 8

Then  2
  27
Var(Y ) = E Y 2
− E[Y ] = 27e −
2 4
e2 ≈ 1390.
8

Page 8
(d) Let X be a random variable with Exponential(λ) distribution. Recall that the moment generating function of X is
λ
MX (t) = λ−t for t < λ.

(i) Find E [X n ], where n is any positive integer. You may use the mgf or compute E [X n ] directly.
(ii) Find MY (t), the mgf of Y = ln X, for t > −1.

Solution: (i) Using the mgf, we write the mgf as


∞  n
X
1 t
MX (t) = =
1 − t/λ n=0 λ
P∞ n
which is valid for |t| < λ. Comparing to MX (t) = n=0 E[X ] n
n! t (which is the MacLaurin series expansion of
MX (t) when it is valid (it is valid for all t < λ) ), we get that E [X n ] = λn!n . Computing E [X n ] directly, we
have Z ∞ Z
Γ(n + 1) ∞ λn+1 n!
E [X n ] = xn λe−λx dx = λ n+1 x(n+1)−1 e−λx dx = n ,
0 λ 0 Γ(n + 1) λ
since the integrand in the last integral is a Gamma(n + 1, λ) density and so the integral is 1, and Γ(n + 1) = n!.
   
(ii) Write the mgf of Y as MY (t) = E etY = E et ln X = E [X t ]. Computing E [X t ] directly, we get an
integral similar to the one in part(a) with n replaced by t. For t > −1 we get a proper Gamma(t + 1, λ) density
in the final integral and so we get
  Γ(t + 1)
MY (t) = E X t = .
λt

1
Pn
(e) Let X1 , . . . , Xn be independent Poisson(1) random variables and let X̄ = n i=1 Xi . Let k > 1 be given. Show that
 n
ek−1
P (X̄ ≥ k) ≤ .
kk

Solution: We find the mgf of X̄ and then apply Chernoff’s bound. The mgf of each Xi is

  X 1
etj e−1 = e−1 ee = ee −1 .
t t
MXi (t) = E etXi =
j=0
j!

Then the mgf of X̄ is given by


h i  n
t
= en(e −1) .
t/n
tX̄
MX (t) = E e = MX i
n

Chernoff’s bound gives


en(e −1)
t/n

= en(e −1)−tk .
t/n
P (X̄ > k) ≤ min (1)
t>0 etk
Differentiating the exponent with respect to t gives et/n − k. Setting this to 0 and solving for t gives t = n ln k.
Plugging this into the RHS of Eqn. (1) then gives
 n
n(k−1)−ln kkn en(k−1) ek−1
P (X̄ > k) ≤ e = = ,
k kn kk

as desired.

Page 9
Problem 4. (Order Statistics)
(a) Let X1 , . . . , Xn be a sequence of independent random variables uniformly distributed on the interval (0, 1), and let
X(1) , . . . , X(n) denote their order statistics. For fixed k, let gn (x) denote the probability density function of nX(k) .
Find gn (x) and show that (
xk−1 −x
(k−1)! e for x ≥ 0
lim gn (x) =
n→∞ 0 for x < 0
which is the Gamma(k, 1) density.

Solution: The pdf of X(k) is


(
n−k
n! k−1
(k−1)!(n−k)! xk (1 − xk ) for 0 < xk < 1
fk (xk ) =
0 otherwise.
Let X = nX(k) . Then the set of possible values of X (the sample space of X) is {x : 0 < x < n}. So for x ∈ (0, n),
by the (1-dimensional) change of variable formula the pdf of X is given by
x 1 (n − 1)!  x k−1  x n−k
gn (x) = fk = 1−
n n (k − 1)!(n − k)! n n
and gn (x) = 0 for x ∈
/ (0, n). Now fix x > 0 and let n > x. Then
(n − 1)!  x k−1  x n−k
gn (x) = 1−
(k − 1)!(n − k)! n n
(n − 1) × . . . × (n − k + 1)  x −k 1  x n
= 1 − x k−1
1 −
nk−1 n (k − 1)! n
1
→ xk−1 e−x ,
(k − 1)!
as desired, since    
(n − 1) × . . . × (n − k + 1) n−1 n−k+1
= × ... ×
nk−1 n n
→ 1 × . . . × 1 = 1,
(1 − x/n) clearly converges to 1 as n → ∞ (for fixed k ), and (1 − x/n)n → e−x as n → ∞ (from calculus). Since
−k

gn (x) = 0 for x < 0 it is obvious that gn (x) → 0 as n → ∞ if x < 0.

(b) (i) Let X1 , . . . , X6 be independent random variables uniformly distributed on the interval (0, 1). Find the pdf of
U = min {max (X1 , X2 ) , max (X3 , X4 ) , max (X5 , X6 )}.
(ii) Let X1 , . . . , Xn be independent and identically distributed random variables, each with a Uniform  distribution
 
on the interval (0, 1). Let X = min (X1 , . . . , Xn ) and Y = max (X1 , . . . , Xn ). Find P X < 12 < Y and E X 3 .

Solution: (i) Let Y1 = max (X1 , X2 ) , Y2 = max (X3 , X4 ), and Y3 = max (X5 , X6 ). Then Y1 , Y2 , Y3 are
independent and identically distributed random variables, each with cdf


y
2
for 0 < y < 1
2
FY (y) = P (X1 ≤ y, X2 ≤ y) = [P (X1 ≤ y)] = 0 for y ≤ 0


1 for y ≥ 1
and pdf 
2y for 0 < y < 1
fY (y) =
0 otherwise.
Then U = min (Y1 , Y2 , Y3 ) has cdf
FU (u) = 1 − P (min (Y1 , Y2 , Y3 ) > u)
= 1 − P (Y1 > u, Y2 > u, Y3 > u)
3
= 1 − (1 − FY (u))
 3
 1 − 1 − u2 for 0 < u < 1
= 0 for u ≤ 0

1 for u ≥ 1.

Page 10
By differentiation, then, U has pdf given by
 2
6u 1 − u2 for 0 < u < 1
fU (u) =
0 otherwise.

(ii) The joint pdf of (X, Y ) is



n(n − 1)(y − x)n−2 for 0 < x < y < 1
f (x, y) =
0 otherwise.
Therefore,
  Z 1 Z 1/2
1
P X< <Y = n(n − 1)(y − x)n−2 dxdy
2 1/2 0
Z 1 h i
1/2
= n − (y − x)n−1 0
dy
1/2
Z 1 n−1 Z 1 
1
= ny dy − n y− n−1
dy
1/2 1/2 2
 n 1
n 1 1 1 1 1
= y |1/2 − y − = 1 − n − n = 1 − n−1 .
2 1/2 2 2 2
The marginal pdf of X is 
n(1 − x)n−1 for 0 < x < 1
fX (x) =
0 otherwise.
Therefore, Z Z
  1 1
E X 3
= x n(1 − x)
3 n−1
dx = n x3 (1 − x)n−1 dx
0 0
The integral on the right is B(4, n), where B(·, ·) is the beta function. So
  Γ(4)Γ(n) 6n 6
E X 3 = nB(4, n) = n = = .
Γ(n + 4) (n + 3)(n + 2)(n + 1)n (n + 3)(n + 2)(n + 1)

(c) Let X1 , . . . , Xn , Y1 , . . . , Yn be independent Uniform (0, 1) random variables. We form n rectangles, where the i-th
rectangle has adjacent sides of length Xi and Yi , for i = 1, . . . , n. Let Ai be the area of the i th rectangle, i = 1, . . . , n,
and define Amax = max (A1 , . . . , An ). Find the pdf of Amax .

Solution: First note that Ai = Xi Yi for i = 1, . . . , n, and A1 , . . . , An are independent since the vectors
(X1 , Y1 ) , . . . , (Xn , Yn ) are independent. Also, the Ai are identically distributed, each with support (0, 1). We first
find P (Ai ≤ a) for any a ∈ (0, 1), which is P (Xi Yi ≤ a). If Yi ≤ a then the limits for Xi are from 0 to 1. If Yi > a
then the limits for Xi are from 0 to a/Yi . The joint pdf of (Xi , Yi ) is f (x, y) = 1 for 0 < x, y < 1. Thus we have
Z aZ 1 Z 1 Z a/y
P (Ai ≤ a) = P (Xi Yi ≤ a) = dxdy + dxdy
0 0 a 0
Z 1
1
=a+a dy = a + a(ln 1 − ln a) = a(1 − ln a).
a y
Next, we get the cdf of Amax as
n
F (a) = P (Amax ≤ a) = P (A1 ≤ a, . . . , An ≤ a) = P (A1 ≤ a) = (a(1 − ln a))n ,

for 0 < a < 1. Finally, we obtain the pdf f of Amax by differentiation:

f (a) = F ′ (a) = n(a(1 − ln a))n−1 [1 − ln a − 1] = −n ln a(a(1 − ln a))n−1 ,

which is valid for a ∈ (0, 1), and f (a) = 0 for a ≤ 0 or a ≥ 0.

Page 11
 
(d) Let X1 , . . . , Xn be independent and identically distributed Exponential (λ) random variables. Compute E X(1) e−λX(2) ,
where X(1) and X(2) are the first and second order statistics of X1 , . . . , Xn .

Solution: The joint pdf of X(1) and X(2) is


 n−2
n(n − 1)λ2 e−λx1 e−λx2 e−λx2 for 0 < x1 < x2 < ∞
f12 (x1 , x2 ) =
0 otherwise

Then Z ∞Z ∞
 
E X(1) e−λX(2) = n(n − 1)λ2 x1 e−λx2 e−λx1 e−λ(n−1)x2 dx2 dx1
0 x1
Z ∞
= (n − 1)λ x1 e−λx1 e−λnx1 dx1
Z 0
n−1 ∞
= x1 λ(n + 1)e−λ(n+1)x1 dx1
n+1 0
n−1
=
λ(n + 1)2
since the final integral is the mean of an Exponential(λ(n + 1)) distribution.

(e) Let X1 , X2 , X3 be independent Uniform (0, 1) random variables, and let X(1) , X(2) , X(3) denote their order statistics.
Let X be the area of the square with side length X(2) and let Y be the area of the rectangle with side lengths X(1)
and X(3) . Find P (X > Y ), E[X], and E[Y ].
  
2
Solution: We wish to compute P X(2) > X(1) X(3) . The joint pdf of X(1) , X(2) , X(3) is f123 (x1 , x2 , x3 ) = 6
for 0 < x1 < x2 < x3 < 1 and equals 0 otherwise. The solution is computed by integrating f123 (x1 , x2 , x3 ) over
the region A = (x1 , x2 , x2 ) : x22 − x1 x3 > 0 . If we set the inner integral over x2 and note that for given values of

0 < x1 < x3 < 1, the possible values of x2 are x1 x3 < x2 < x3 . The integral can be written as
ZZZ Z 1 Z x3 Z x3
f123 (x1 , x2 , x3 ) dx1 dx2 dx3 = 6 √
dx2 dx1 dx3
A 0 0 x1 x3
 Z 1 Z x3 
1/2 1/2
x3 − x1 x3 =6
dx1 dx3
0 0
Z 1 
1/2 2 3/2
=6 x23 − x3 x3 dx3
0 3
Z 1
2
=2 x23 dx3 = .
0 3

The marginal pdf of X(2) and the joint pdf of X(1) , X(3) are given by

6x2 (1 − x2 ) 0 < x2 < 1
f2 (x2 ) =
0 otherwise

and 
6 (x3 − x1 ) 0 < x1 < x3 < 1
f13 (x1 , x3 ) =
0 otherwise
respectively. Then
h i Z 1
6 6 3
E[X] = E 2
X(2) = x22 6x2 (1 − x2 ) dx2 = − =
0 4 5 10
Z Z Z  
  1 x3 1
x43 x4 1
E[Y ] = E X(1) X(3) = x1 x3 6 (x3 − x1 ) dx1 dx3 = 6 − 3 dx3 = .
0 0 0 2 3 5

Page 12
Problem 5. (Convergence of Random Variables and Limit Theorems)
(a) Let X and X1 , X2 , . . . be random variables each having a N (0, 1) distribution. Suppose (Xn , X) has a bivariate
normal distribution for each n and the correlation between Xn and X is ρ (Xn , X) = ρn , for n ≥ 1.
(i) Show that Xn converges to X in distribution as n → ∞ (for arbitrary correlations ρn ).
(ii) If ρn → 1 as n → ∞, show that Xn converges to X in probability as n → ∞.
(iii) Show that if ρn = 1 − an for some constant a ∈ (0, 1), then Xn converges to X with probability 1 as n → ∞.
Do you get convergence with probability 1 if a = 0? If a = 1 ? Prove your answers.

Solution: (i) If Φ is the standard normal cdf and Fn is the cdf of Xn for n ≥ 1, then Fn (x) = Φ(x) for all
x (this is given). Therefore, Fn (x) → Φ(x) as n → ∞ (trivially). It is also given that Φ(x) is the cdf of X.
Therefore, Xn converges to X in distribution.

(ii) Let ϵ > 0 be given. Both Xn and X are zero mean, so Xn − X also has zero mean. So by Chebyshev’e
inequality,
Var (Xn − X)
P (|Xn − X| > ϵ) ≤ . (2)
ϵ2
But
Var (Xn − X) = Var (Xn ) + Var(X) − 2 Cov (Xn , X) = 1 + 1 − 2ρn = 2 (1 − ρn ) . (3)
So if ρn → 1 as n → ∞ then P (|Xn − X| > ϵ) → 0 as n → ∞ for any ϵ > 0. Thus, Xn converges to X in
probability.

(iii) Plugging ρn = 1 − an into (3) we have Var (Xn − X) = 2an . Plugging this into Chebyshev’s inequality in
(2), we have
2an
P (|Xn − X| > ϵ) ≤ 2 .
ϵ
If a ∈ (0, 1), the sum on the right converges, and so Xn converges to X with probability 1 by the sufficient
condition from class. If a = 0, the sum is again convergent (it is equal to 0), and so Xn converges to X with
probability 1. If a = 1 then ρn = 0 so that Xn is independent of X for any n (since (Xn√ , X) is bivariate
normal). In this case Xn − X has a N (0, 2) distribution and so P (|Xn − X| > ϵ) = 2(1 − Φ(ϵ/ 2)), where Φ is
the standard normal cdf. This is some positive constant for every n and so P (|Xn − X| > ϵ) does not converge
to 0 as n → ∞. Therefore, Xn does not converge to X in probability. This then implies that Xn does not
converge to X with probability 1.

(b) Suppose 80 points are uniformly distributed in the ball in R3 centered at the origin with radius 1. Let Di be the
X
80
1
Euclidean distance from the origin of the i th point, i = 1, . . . , 80, and let D̄ = 80 Di . Use the central limit
i=1
theorem to find a value c satisfying P (|D̄ − E[D̄]| ≤ c) = 0.95.

Solution: We first find the distribution of Di . For x ∈ (0, 1), Di ≤ x if and only if the i-th point in the ball
3
is in the ball in R3 centred at the origin of radius x. That is, P (Di ≤ x) = 4πx /3 3
4π/3 = x . From this we have that
the pdf of Di is 
3x2 for x ∈ [0, 1]
fD (x) =
0 otherwise.
The mean, second moment, and variance are then easily computed to be
3   3 3
E [Di ] = , E Di2 = , Var (Di ) = .
4 5 80
By the central limit theorem
√ √ !
80|D̄ − 3/4| c 80
P (|D̄ − 3/4| ≤ c) = P p ≤p
3/80 3/80
√ !
c 80
≈P |Z| ≤ p ,
3/80

Page 13

where Z has a N (0, 1) distribution. In order for this probability to be equal to 0.95 we need √
c 80
equal to 1.96.
3/80
Solving for c gives √
1.96 3
c= ≈ 0.0424.
80
(c) (i) Suppose that {X
h n } is a sequence
i of zero-mean random variables and X is a zero mean random variable, and
2
suppose that E (Xn − X) ≤ C/np for every n, for some constants C and p > 1. Show that Xn → X almost
surely.
(ii) Suppose that {Xn } is a sequence of non-negative random variables. Show that E [Xn ] → 0 as n → ∞ implies
that Xn → 0 in probability, but that the converse is false in general.

Solution: (i) Let ϵ > 0 be given. Since E [Xn − X] = 0 we have by Chebyshev’s inequality that
h i
2
Var (Xn − X) E (Xn − X) C
P (|Xn − X| > ϵ) ≤ 2
= 2
≤ 2 p.
ϵ ϵ ϵ n
Therefore,

X X∞
C
P (|Xn − X| > ϵ) ≤ <∞
n=1 n=1
ϵ np
2

if p > 1. By a sufficient condition from class, this implies that Xn → X almost surely.

(ii) Suppose that E [Xn ] → 0. Let ϵ > 0 be given. Then by Markov’s inequality

E [Xn ]
P (|Xn − 0| > ϵ) ≤ →0
ϵ
as n → ∞. Hence, Xn → 0 in probability. To prove that the reverse implication is false in general, we give a
counterexample. Let the distribution of Xn be given by
(
0 with probability 1 − n1
Xn =
n with probability n1 .

 given P (|Xn − 0| > ϵ) = P (Xn = n) = n → 0 as n → ∞, so Xn → 0 in probability. However,


1
Then if ϵ > 0 is
1
E [Xn ] = n n = 1 for all n, which does not converge to 0.

(d) Suppose that {Xn } and {Yn } are sequences of random variables and X and Y are random variables such that
Xn → X in distribution and Yn → Y in distribution. Give an example where it is not true that Xn + Yn converges
to X + Y in distribution. [Hint: Consider Y = −X]

Solution: Let W and Z be independent N (0, 1) random variables. Let Xn = W for all n and Yn = Z for all n. Let
X = W and Y = −W . It is easy to see that X and Y are both N (0, 1), and it is clear that Xn → X in distribution
and Yn → Y in distribution. But X + Y = 0 whereas Xn + Yn is distributed as N (0, 2) for all n. Therefore, it is not
true that Xn + Yn converges to X + Y in distribution.

(e) Consider the probability !


1X
n
1 1
P F (Xi ) − ≥√ .
n i=1 2 3n
Use Chebyshev’s inequality to bound this probability. Use the central limit theorem to approximate this probability
for large n.
Pn
Solution: The random variable n1 i=1 F (Xi ) has mean 12 and variance 12n
1
. Applying Chebyshev’s inequality, we
have ! Pn 
1X
n
1 1 Var n1 i=1 F (Xi ) 1/12n 1
P F (Xi ) − ≥√ ≤ √ = = .
n i=1 2 3n (1/ 3n)2 1/3n 4

Page 14
1
∑n
By the central limit theorem, Zn = n √ F (Xi )−1/2 has approximately a N (0, 1) distribution for large n. Therefore,
i=1
1/12n
! √ !
1X 1X 1 √
n n
1 1 12n
P F (Xi ) − ≥√ =P F (Xi ) − 12n ≥ √
n i=1 2 3n n i=1 2 3n
= P (|Zn | ≥ 2) ≈ 0.0456.

Problem 6. (Sampling Distributions)

(a) Let T ∼ t1 (the t distribution with 1 degree of freedom). Explain why T has the same distribution as X
|Y | where
IID
X, Y ∼ N (0, 1), and hence why T also has the same distribution as X
Y .
IID
Y when X, Y ∼ N (0, 1) is also called the Cauchy distribution (the t distribution with
[Hints: The distribution of X
1 degree of freedom). You may check that it has a PDF:
1 1
f (x) = . .
π 1 + x2
You may use this result without proof in part (b).]

Solution: By definition of the t distribution, T has the same distribution as √XU , where X ∼ N (0, 1), U ∼ χ21 ,
and X and U are independent. By definition of the χ2 distribution, U has the same distribution as Y 2 where
Y ∼ N (0, 1). Therefore T has the same distribution as √XY 2 = |Y
X X X
| . To show |Y | has the same distribution as Y :
Suppose, for any t ∈ R,
 
X
P ≤ t = P[X ≤ t|Y |] = P[X ≤ tY, Y > 0] + P[X ≤ −tY, Y < 0],
|Y |
and
 
X
P ≤ t = P[X ≤ tY, Y > 0] + P[X ≥ tY, Y < 0] = P[X ≤ tY, Y > 0] + P[−X ≤ −tY, Y < 0].
Y
Since (X, Y ) has the same distribution as (−X, Y ), the above implies
   
X X
P ≤t =P ≤t .
|Y | Y
X X
So the CDFs of |Y | and Y are the same.

 
(b) t1 is an example of an extremely “heavy-tailed” distribution: For T ∼ t1 , show that E[|T |] = ∞ and E T 2 = ∞.
IID
If T1 , . . . , Tn ∼ t1 , explain why the Law of Large Numbers and the Central Limit Theorem do not apply to the
sample mean n1 (T1 + . . . + Tn ).

Solution: Noting f (x) = f (−x), we compute


Z ∞ Z ∞
1 1
E[|T |] = |x|f (x)dx = 2 x dx
−∞ 0 π x2 + 1
Z ∞ ∞
1 1  1 
= 2
d x2 + 1 = ln x2 + 1 = ∞.
π 0 x +1 π 0

Also, Z Z Z
  ∞ ∞ ∞
1 1 2 x2
E T2 = x2 f (x)dx = 2 x2 dx = dx,
−∞ 0 π x2 + 1 π 0 x2+1
x2
which equals ∞ since → 1 as x → ∞. So, T does not have a well-defined (finite) mean or variance, and the
x2 +1
LLN and CLT both do not apply. (In fact, it may be shown that n1 (T1 + . . . + Tn ) does not converge to 0 but rather
n (T1 + . . . + Tn ) ∼ t1 for any n.)
1

Page 15
(c) Let Un ∼ χ2n . Show that √ 11 → 1 in probability as n → ∞. (Hint: Apply the Law of Large Numbers and the
n Un
Continuous Mapping Theorem.)


[Continuous Mapping Theorem: If random variables {Xn }n=1 converge in probability to c ∈ R (as n → ∞ ),

and g : R → R is continuous, then {g (Xn )}n=1 converge in probability to g(c).]

Pn IID  
Solution: We may write Un = i=1 Xi2 , where√X1 , . . . , Xn ∼ N (0, 1). Then the LLN implies n1 Un → E Xi2 = 1
in probability as n → ∞. The function x 7→ 1/ x is continuous (at every x > 0 ), so by the Continuous Mapping
Theorem
1
q →1
1
n nU
in probability as n → ∞.

(d) Using Slutsky’s lemma, show that if Tn ∼ tn for each n = 1, 2, 3, . . ., then as n → ∞, Tn → Z in distribution
where Z ∼ N (0, 1). (This formalizes the statement that “the tn distribution approaches to the standard normal
distribution as n gets large”.)

∞ ∞
[Slutsky’s lemma: If sequences of random variables {Xn }n=1 and {Yn }n=1 satisfy Xn → c in probability for a
constant c ∈ R and Yn → Y in distribution for a random variable Y , then Xn Yn → cY in distribution.]

Solution: We may write Tn = √Z1n where Zn ∼ N (0, 1), Un ∼ χ2n , and Zn and Un are independent. By part (c),
n Un
√ 11 → 1 in probability. Clearly, Zn → N (0, 1) in distribution, since the distribution of Zn does not change with
n Un
n. Then Tn → N (0, 1) in distribution by Slutsky’s lemma.

(e) If ξp is the p-th quantile of Fm,n distribution (F distribution with (m, n) degree of freedoms) and ξp′ the p-th quantile

of Fn,m . Show that, ξp ξ1−p = 1.

Solution:
χ2m /m 1 χ2n /n
F = ⇒ = ∼ Fn,m .
χ2n /n F χ2m /m
   
1 1 1
p = P [Fm,n ≤ ξp ] = P ⩾ = P Fn,m ⩾
Fm,n ϵp ξp
 
1
⇒ P Fn,m ≤ = 1 − p.
ξp
 ′
 ′
But, p Fn,m ≤ ξ1−p =1−p⇒ 1
ξp = ξ1−p .

Problem 7. (Methods of Estimation)


IID
(a) Suppose X1 , . . . , Xn ∼ Geometric(p), where Geometric(p) is the geometric distribution on the positive integers
{1, 2, 3, . . .} defined by the probability mass function (PMF)

f (x | p) = p(1 − p)x−1 ,

with a single parameter p ∈ [0, 1]. Compute the method-of-moments estimate of p, as well as the MLE of p. For
large n, what approximately is the sampling distribution of the MLE? (You may use, without proof, the fact that
the Geometric(p) distribution has mean 1/p.)
Solution. The method of moments estimator sets the population mean, 1/p, equal to the sample mean,

1X
n
X̄ = Xi .
n i=1

Page 16
Inverting to solve for p gives
1
p̂MOM = .

For the maximum likelihood estimator, the likelihood is
Y
n

L (p | X1 , . . . , Xn ) = p(1 − p)Xi −1 = pn (1 − p)n(X̄−1)
i=1

and the log-likelihood is therefore


ℓ(p) = n log p + n(X̄ − 1) log(1 − p).
The derivative is
n n(X̄ − 1)
ℓ′ (p) = − .
p 1−p
Setting equal to zero and solving for p gives
1
p̂MLE = .

We must also check that ℓ(p) achieves a maximum at X̄ −1 ; this may be verified by checking that ℓ′ (p) takes positive
for p < X̄ −1 and negative values for p > X̄ −1 .
Now, we can get the asymptotic distribution using the delta method. We have from the central limit theorem that
 
√ 1−p
n(X̄ − 1/p) ∼ N 0, 2 .
p

Taking g(θ) = 1/θ gives (g ′ (θ)) = θ−4 , which for θ = 1/p is (g ′ (θ)) = p4 . Hence
2 2

√ √ √ 
n (p̂MLE − p) = n(1/X̄ − p) = n(g(X) − g(1/p)) ⇒ N 0, p2 (1 − p) .

Alternatively, we could obtain the variance using the Fisher information:


 
√ 1
n (p̂MLE − p) ⇒ N 0, ,
I(p)

where I(p) is the Fisher information for a single observation. We compute


 2 
′′ ∂
I(p) = −Ep [ℓ (p)] = −Ep 2 (log p + (X − 1) log(1 − p))
∂ p
    
∂ 1 X −1 1 X −1
= −Ep − = −Ep − 2 −
∂p p 1−p p (1 − p)2
1
= 2
p (1 − p)
√ 
So, n (p̂MLE − p) ∼ N 0, p2 (1 − p) .

IID 
(b) Let X1 , . . . , Xn ∼ N µ, σ 2 . We showed in class that the MLEs for µ and σ 2 are given by

1X 2
n
µ̂ = X̄ and σ̂ 2 = Xi − X̄ .
n i=1

(i) By computing the Fisher information matrix I µ, σ 2 , derive the approximate joint distribution of µ̂ and σ̂ 2
for large n. (Hint: Substitute v = σ 2 and treat v as the parameter rather than σ.)

(ii) Suppose it is known that µ = 0. Compute the MLE σ̃ 2 in the one-parameter sub-model N 0, σ 2 . The Fisher
information matrix in part (i) has off-diagonal entries equal to 0 when µ = 0 and n is large. What does this
tell you about the standard error of σ̃ 2 as compared to that of σ̂ 2 ?

Page 17
Solution. (i) Denote v = σ 2 . Then
1
e− 2v (X−µ)
1 2
f (X | µ, v) = √
2πv
and
1 1 1
ℓ(µ, v) = log f (X | µ, v) = − log 2π − log v − (X − µ)2 .
2 2 2v
To obtain the Fisher information matrix I(µ, v), we must compute the four second-order partial derivatives of
ℓ(µ, v). These quantities are
∂2ℓ 1
=− ,
∂µ2 v
∂2ℓ 1 1
= 2 − 3 (X − µ)2 ,
∂v 2 2v v
∂2ℓ ∂2ℓ X −µ
= =− .
∂µ∂v ∂v∂µ v2
Then " #  
∂2ℓ ∂2ℓ
1/v 0
I(µ, v) = −Eµ,v ∂µ2
∂2ℓ
∂µ∂v
∂2ℓ
= .
0 1/2v 2
∂v∂µ ∂v 2

This matrix has an inverse  


−1 v 0
I(µ, v) = .
0 2v 2
Substituting back v = σ 2 , we have  
−1 σ2 0
I µ, σ 2 = ,
0 2σ 4
which we conclude is the asymptotic variance of the maximum likelihood estimate. In other words,
       2 
√ X̄ µ 0 σ 0
n − ∼N , .
S2 σ2 0 0 2σ 4

(ii) The joint log-likelihood in this one-parameter sub-model is given by

1 X 2
n
n n
ℓ(v) = − log 2π − log v − X ,
2 2 2v i=1 i

where again v = σ 2 . Then


1 X 2
n
n
ℓ′ (v) = − + 2 X ,
2v 2v i=1 i
and setting equal to zero and solving for v gives

1X 2
n
ṽ = σ̃ 2 = X .
n i=1 i

Since the off-diagonals of the inverse Fisher information matrix are zero, the sample mean and standard devi-
ation are asymptotically uncorrelated, and so σ̂ 2 and σ̃ 2 have the same asymptotic standard error.

IID
(c) Let X1 , . . . , Xn ∼ Uniform(0, θ) for a single parameter θ > 0 having PDF
1
f (x | θ) = 1{0 ≤ X ≤ θ}.
θ

(i) Compute the MLE θ̂ of θ. (Hint: Note that the PDFs f (x | θ) do not have the same support for all θ > 0, and
they are also not differentiable with respect to θ you will need to reason directly from the definition of MLE.)

Page 18

(ii) If the true parameter is θ, explain why θ̂ ≤ θ always, and hence why it cannot be true that n(θ̂ − θ) converges
in distribution to N (0, v) for any v > 0.

Solution. (i) The likelihood is


1 Y
n
L(θ) = n 1 {0 ≤ Xi ≤ θ} .
θ i=1
Now, notice that the expression
Y
n
1 {0 ≤ Xi ≤ θ} ,
i=1

taken as a function of θ, is the same as


n o
1 {θ > Xi for all i} = 1 θ ≥ max Xi .
i

This means that the likelihood can be written as


1 n o
L(θ) = 1 θ ≥ max X i ,
θn i

and that the maximum likelihood estimate is the value of θ that maximizes 1/θn on the interval [maxi Xi , ∞).
Since 1/θn is a decreasing function, this maximum occurs at the left end point, so

θ̂ = max Xi .
i

(ii) The true parameter θ must satisfy θ ≥ Xi for all i = 1, . . . , n, since the range of Xi is bounded above by θ.

Hence θ ≥ maxi Xi = θ̂ as well. This means that for any value of n, n(θ̂ − θ) takes on positive values with

probability zero, so n(θ̂ − θ) cannot be asymptotically normally distributed.

(d) Consider a parametric model {f (x | θ) : θ ∈ R} of the form

f (x | θ) = eθT (x)−A(θ) h(x),

where T, A, and h are known functions.


(i) Show that the Poisson(λ) model is of this form, upon reparametrizing by θ = log λ. What are the functions
T (x), A(θ), and h(x)?

(ii) For any model of the above form, differentiate the identity
Z
1 = eθT (x)−A(θ) h(x)dx

with respect to θ on both sides, to obtain a formula for Eθ [T (X)], where Eθ denotes expectation when X ∼
f (x | θ). Verify that this formula is correct for the Poisson example in part (i).

(iii) The following procedure defines the generalized method-of-moments estimator: For a fixed function g(x),
compute Eθ [g(X)] in terms of θ, and take the estimate θ̂ to be the value of θ for which

1X
n
Eθ [g(X)] = g (Xi ) .
n i=1

The method-of-methods estimator discussed in class is the special case of this procedure for g(x) = x.
IID
Let X1 , . . . , Xn ∼ f (x | θ), where f (x | θ) is of the given form, and consider the generalized method-of-
moments estimator using the function g(x) = T (x). Show that this estimator is the same as the MLE. (You
may assume that the MLE is the unique solution to the equation 0 = l′ (θ), where l(θ) is the log-likelihood.)

Page 19
Solution. (i) A Poisson random variable has a mass function

λx e−λ 1
f (x | λ) = = ex log λ−λ ,
x! x!
for x = 0, 1, 2, . . .. Reparametrizing by θ = log λ, we obtain
1 θx−eθ
f (x | θ) = e ,
x!
which is of the form in question. The functions are given by
1
T (x) = x, A(θ) = eθ , and h(x) = .
x!

(ii) The derivative of the right hand side is


Z Z Z
d d θT (x)−A(θ)
e θT (x)−A(θ)
h(x)dx = e h(x)dx = (T (x) − A′ (θ)) eθT (x)−A(θ) h(x)dx.
dθ dθ
Since the derivative of the left-hand side is 0, we have
Z
0 = (T (x) − A′ (θ)) eθT (x)−A(θ) h(x)dx

which implies Z Z
θT (x)−A(θ) ′
T (x)e h(x)dx = A (θ) eθT (x)−A(θ) h(x)dx .
| {z } | {z }
Eθ [T (X)] 1

Using the identities noted above, we obtain the formula

Eθ [T (X)] = A′ (θ).

(Note: By replacing integrals with sums, the identity holds for discrete models as well.)
In the Poisson model, A(θ) = eθ , so A′ (θ) = eθ as well, and T (X) = X. This means

Eθ [X] = eθ = λ,

which we know to be true.

(iii) From part (ii), Eθ [T (X)] = A′ (θ), so the generalized method-of-moments estimator is the value of θ
satisfying
1X
n
A′ (θ) = T (Xi ) .
n i=1
We now compute the maximum likelihood estimator. The log-likelihood is
X
n X
n
θ T (Xi ) − nA(θ) + log h (Xi ) ,
i=1 i=1

which has derivative


X
n
T (Xi ) − nA′ (θ).
i=1

Setting equal to zero, we see that the MLE must satisfy

1X
n

A (θ) = T (Xi ) ,
n i=1

which is the same as the GMM estimator for g(x) = T (x).

Page 20
(e) Suppose that X is a discrete random variable with
2
P[X = 0] = θ
3
1
P[X = 1] = θ
3
2
P[X = 2] = (1 − θ)
3
1
P[X = 3] = (1 − θ)
3
where 0 ≤ θ ≤ 1 is a parameter. The following 10 independent observations were taken from such a distribution:
{3, 0, 2, 1, 3, 2, 1, 0, 2, 1}. (For parts (i) and (ii), feel free to use any asymptotic approximations you wish, even though
n = 10 here is rather small.)
(i) Find the method of moments estimate of θ and compute an approximate standard error of your estimate using
asymptotic theory.

(ii) Find the maximum likelihood estimate of θ and compute an approximate standard error of your estimate
using asymptotic theory. (Hint: Your formula for the log-likelihood based on n observations X1 , . . . , Xn should
depend on the numbers of 0’s, 1’s, 2’s, and 3’s in this sample.)

Solution. (i) The expectation of X is


2 1 2 1 7
E[X] = θ · 0 + θ · 1 + (1 − θ) · 2 + (1 − θ) · 3 = − 2θ.
3 3 3 3 3
For an IID sample X1 , . . . , Xn , equating 73 − 2θ with the sample mean X̄ and solving for θ, the method-of-

moments estimate is θ̂ = 12 37 − X̄ . For the 10 given observations, θ̂ = 0.417.
The variance of X is
 
Var[X] = E X 2 − E[X]2
 2
2 1 2 1 7 2
= θ · 02 + θ · 12 + (1 − θ) · 22 + (1 − θ) · 32 − − 2θ = + 4θ − 4θ2
3 3 3 3 3 9

Then Var[θ̂] = 14 Var[X̄] = 4n 1
Var[X] = 4n1 2
+ 4θ − 4θ2 .
r  9

9 + 4θ̂ − 4θ̂
1 2
An estimate of the standard error is 4n 2 , which for the 10 given observations is 0.173.

1
(An alternative estimate of the standard error is given by 4n times the sample variance of X1 , . . . , Xn , which
for the 10 given observations is 0.171.)

(ii) For an IID sample X1 , . . . , Xn , let N0 , N1 , N2 , N3 be the total numbers of observations equal to 0, 1, 2, and
3 . Then the log-likelihood is
Yn  1{Xi =0}  1{Xi =1}  1{Xi =2}  1{Xi =3} !
2 1 2 1
l(θ) = log θ θ (1 − θ) (1 − θ)
i=1
3 3 3 3
2 1 2 1
= N0 log θ + N1 log θ + N2 log (1 − θ) + N3 log (1 − θ).
3 3 3 3
To compute the MLE for θ, we set
N0 N1 N2 N3
0 = l′ (θ) = + − −
θ θ 1−θ 1−θ
and solve for θ, yielding θ̂ = (N0 + N1 ) / (N0 + N1 + N2 + N3 ) = (N0 + N1 ) /n. For the 10 given observa-
tions, θ̂ = 0.5. The total probability that X = 0 or X = 1 is θ, so N0 + N1 ∼ Binomial(n, θ). Then
Var[θ̂] = n12 Var [N0 + N1 ] = θ(1−θ)
n .

Page 21
(Alternatively, we may compute
(
∂2 − θ12 x = 0 or x = 1
log f (x | θ) =
∂θ2 − (1−θ)
1
2 x = 2 or x = 3,
h i
∂2
so the Fisher information is I(θ) = −E ∂θ 2 log f (X | θ) = 1
θ(1−θ) . This shows that the variance of θ̂ is
θ(1−θ)
approximately for large n.)
n q
Thus, an estimate of the standard error is θ̂(1 − θ̂)/n, which for the 10 given observations is 0.158.

(f) Let (X1 , . . . , Xk ) ∼ Multinomial (n, (p1 , . . . , pk )). (This is not quite the setting of n IID observations from a para-
metric model, although you can think of (X1 , . . . , Xk ) as a summary of n such observations Y1 , . . . , Yn from the
parametric model Multinomial (1, (p1 , . . . , pk )), where Yi indicates which of k possible outcomes occurred for the ith
observation.) Find the MLE of pi .

Solution. The log-likelihood is given by


     X
k
n n
l (p1 , . . . , pk ) = log pX
1
1
. . . pX
k
k
= log + Xi log pi ,
X1 , . . . , X k X1 , . . . , X k
i=1

and the parameter space is

Ω = {(p1 , . . . , pk ) : 0 ≤ pi ≤ 1 for all i and p1 + . . . + pk = 1} .

To maximize l (p1 , . . . , pk ) subject to the linear constraint p1 + . . . + pk = 1, we may use the method of Lagrange
multipliers: Consider the Lagrangian
  X
k
n
L (p1 , . . . , pk , λ) = log + Xi log pi + λ (p1 + . . . + pk − 1)
X1 , . . . , X k
i=1

for a constant λ to be chosen later. Clearly, subject to p1 + . . . + pk = 1, maximizing l (p1 , . . . , pk ) is the same as
maximizing L (p1 , . . . , pk , λ). Ignoring momentarily the constraint p1 + . . . + pk = 1, the unconstrained maximizer
of L is obtained by setting for each i = 1, . . . , k
∂L Xi
0= = + λ,
∂pi pi
Pn
Pn yields p̂i = −Xi /λ. For the specific choice of constant λ = −n, we obtain p̂i = Xi /n and
which i=1 p̂i =
i=1 X i /n = 1, so the constraint is satisfied. As p̂ i = X i /n is the unconstrained maximizer of L (p 1 , . . . , pk , −n),
this implies that it must also be the constrained maximizer of L (p1 , . . . , pk , −n), so it is the constrained maximizer
of l (p1 , . . . , pk ). So the MLE is given by p̂i = Xi /n for i = 1, . . . , k.

Problem 8. (Properties of Estimator)


(a) Suppose X1 , X2 , X3 are independent normally distributed random variables with mean µ and variance σ 2 . However,
instead of X1 , X2 , X3 , we only observe Y1 = X2 − X1 and Y2 = X3 − X2 . Which of the following statistics is sufficient
for σ 2 ?

(i) Y12 + Y22 − Y1 Y2 (ii) Y12 + Y22 + 2Y1 Y2 (iii) Y12 + Y22 (iv) Y12 + Y22 + Y1 Y2 .
iid 
Solution. Given X1 , X2 , X3 ∼ N µ, σ 2 and Y1 = X2 − X1 and Y2 = X3 − X2 . Now,

E (Y1 ) = E (X2 ) − E (X1 ) = µ − µ = 0


E (Y2 ) = E (X3 ) − E (X2 ) = µ − µ = 0
V (Y1 ) = V (X2 ) + V (X1 ) − 2 Cov (X2 , X1 ) = σ 2 + σ 2 − 0 = 2σ 2
V (Y2 ) = V (X3 ) + V (X2 ) − 2 cov (X3 , X2 ) = 2σ 2

Page 22
Cov (Y1 , Y2 ) = Cov (X2 − X1 , X3 − X2 ) = Cov (X2 , X3 ) − Cov (X2 , X2 ) − Cov (X1 , X3 ) + Cov (X1 , X2 ) = −σ 2 .
Cov (Y1 , Y2 ) −σ 2 1
Corr (Y1 , Y2 ) = p =√ =− .
V (Y1 ) V (Y2 ) 2σ · 2σ
2 2 2
Since Y1 , Y2 are linear combinations of Normal random variables, their joint distribution is also Normal

(Y1 , Y2 ) ∼ N2 0, 0, 2σ 2 , 2σ 2 , −1/2
[ 2
]
y1 y2 y1 y2
1 − 23 + 2σ22 + 1
e−[y1 +y2 +y1 y2 ]/3σ = gσ2 (T (y1 , y2 )) h (y1 , y2 ) ,
2 2 2
fY1, ,Y2 (y1 , y2 ) = √ e 2σ 2 2σ 2
= √
2 3πσ 2 2 3πσ 2

e−[y1 +y2 +y1 y2 ]/3σ and h (y1 , y2 ) = 1 and T (y1 , y2 ) = y12 + y22 + y1 y2 .
2 2 2
1
where gσ2 (T (y1 , y2 )) = 2√3πσ 2

∴ By Factorization Theorem, Y12 + Y22 + Y1 Y2 is sufficient for σ 2 .

Xn be a random sample from a Bernoulli distribution with parameter p; 0 ≤ p ≤ 1. The bias of the
(b) Let X1 , X2√, . . . , ∑
n+2 n i=1 Xi
estimator 2(n+ n)
√ for estimating p is equal to
    
(i) √1
n+1
p− 1
2 (ii) 1√
n+ n
1
2 −p (iii) √1
n+1
1
2 + √p
n
−p (iv) √1
n+1
1
2 −p .

iid
Solution. Given X1 , X2 , X3 ∼ Ber(p).
√ Pn
n + 2 i=1 Xi
Let, T = √ .
2(n + n)

Bias(T ) = E(T ) − p
√ Pn  √ P
n + 2 i=1 Xi n + 2E ( Xi )
=E √ −p= √ −p
2(n + n) 2(n + n)
√ h i
n + 2np PN
= √ −p since i=1 X i ∼ Bin (n, p) .
2(n + n)
√ √ √
n + 2np − 2np − 2 np n(1 − 2p)
= √ = √ √
2(n + n) 2 n(1 + n)
   
1 1 − 2p 1 1
=√ =√ −p .
n+1 2 n+1 2

(c) Let X1 , X2 , . . . , Xn be a random sample from an exponential distribution with the probability density function;
(
1 −x
θ e θ , if x > 0,
f (x; θ) =
0, otherwise,

where θ > 0.
PnDerive the Cramér-Rao lower bound for the variance of any unbiased estimator of θ. Hence, prove
that T = n1 i=1 Xi is the uniformly minimum variance unbiased estimator of θ.
IID
Solution. X1 , . . . , Xn ∼ Exp( mean θ)
1 −x/θ
f (x|θ) = e ; x > 0.
θ
Let τ (θ) = θ which is to be estimated.
Now, if T (x) is any unbiased estimator of τ (θ) then by Cramer-Rao Inequality

(τ ′ (θ))2
V (T (x)) ⩾ 2
∼ ∂
nE ∂θ ln f (x|θ)

Page 23
provided the distribution of X follows the regularity conditions.
Since X ∼ Exp( mean θ), thus X belongs to One Parameter Exponential family and hence satisfies the regularity
conditions
1
f (x | θ) = e−x/θ
θ
x
ln f (x | θ) = − ln θ −
θ
∂ 1 x
ln f (x|θ) = − + 2
∂θ θ θ
 2  2  2 
∂ x 1 x + θ2 − 2θx 2θ2 + θ2 − 2θ2 1
E ln f (x|θ) = E − = E = = 2
∂θ θ2 θ θ4 θ4 θ
1 θ2
Since, τ (θ) = θ ⇒ τ ′ (θ) = 1 ∴ V (T (x)) ⩾ = .
∼ n· 1
θ2
n
θ2
Thus, the Cramer-Rao
Pn Lower Bound = n
Now for T = n1 i=1 Xi ,

1X 1 X θ2
E(T ) = E (Xi ) = θ and V (T ) = V (X i ) = .
n n2 n
∴ T is an Unbiased Estimator of θ and it attains CRLB and T is the UMVUE of θ.


(d) Let X1 , X2 , . . . , Xn be a random sample from a N µ, σ 2 distribution, where both µ and σ 2 are unknown. Find
Pn 2
the value of b that minimizes the mean squared error of the estimator Tb = n−1b
Xi − X̄ for estimating σ 2 ,
Pn i=1
where X̄ = n1 i=1 Xi .

Solution.
b X 2
n
Given Tb = Xi − X̄ = bS 2 (where S 2 is the Sample Variance)
n − 1 i=1
2   2
MSE (Tb ) = Var (Tb ) + E (Tb ) − σ 2 = Var bS 2 + E bS 2 − σ 2
 2 2σ 4
= b2 Var S 2 + σ 2 (b − 1) = b2 · + σ 4 (b − 1)2
n−1
 
2σ 4  2σ 4
= b2 · + σ 4 b2 + 1 − 2b = + σ 4 b2 − 2σ 4 b + σ 4
n−1 n−1
 
n+1
= σ 4 · b2 − 2σ 4 b + σ 4 , which is a quadratic polynomial in b.
n−1
 
n+1
Let, p(b) = σ 4 b2 − 2σ 4 b + σ 4 .
n−1
 
′ n+1 n−1
p (b) = σ 4 · 2b − 2σ 4 = 0 ⇒ b = .
n−1 n+1
 
n+1
p′′ (b) = 2σ 4 > 0 ∀ b ∈ R.
n−1
∴ p(b) is minimum at b = n−1
n+1 and for b = n−1
n+1 , the MSE (Tb ) is minimum for estimating σ 2 .

(e) Let X1 , X2 , . . . , Xn be a random sample from a continuous distribution with the probability density function;
( 2
2x − xλ
λ e , if x > 0
f (x; λ) =
0, otherwise,

where λ > 0. Find the maximum likelihood estimator of λ and show that it is sufficient and an unbiased estimator
of λ.

Page 24
Solution. Given X1 , X2 . . . , Xn is a random sample form
2x − x2
f (x; λ) = e λ; x>0
λ
The likelihood function is
Y
n
2xi x2 2n Y
n ∑ 2
X
e− − λi
i
L(λ | x) = λ = n
x i e ; xi > 0 ∀ i = 1, . . . , n
∼ λ λ i=1
i=1
P
x2i
ln L(λ | x) = − − n ln λ + terms independent of λ

P λ2
∂ xi n
ln L(λ | x) = 2
− =0
∂λ ∼
P 2 λ
λ
xi − nλ
⇒ =0
λ2
1X 2
n
⇒λ= x
n i=1 i
Pn
∴ λ̂ = 1
n i=1 x2i is the MLE of λ. Now,
Z ∞
 2x − x2 dx x2
E Xk = xk · e λ ; substituting =t
0 λ λ
Z ∞ √ k+1
( λt) λ
=2 e−t · √ dt
0 λ 2 λt
Z ∞ √ k −t Z ∞
= ( λt) e dt = λ k/2
tk/2+1−1 e−t dt = λk/2 Γ(k/2 + 1)
0 0
  √
√ 1 1 πλ
∴ E(X) = λ Γ(1/2 + 1) = λ · Γ
1/2
=
2 2 2

E X 2 = λΓ(2) = λ
1X  
n
∴ E(λ̂) = E Xi2 = E Xi2 = λ
n i=1

∴ λ̂ s the unbiased estimator of λ.


The joint pdf of X1 , . . . , Xn is

Y
n Q ∑n
2xi −x2i /λ 2n xi
f (x | λ) = e = e− i=1 x2i /λ

i=1
λ λn
  ! !
1 −

x2i
Y
n X
n
= e λ 2n xi = gλ x2i h(x)
λn 1 i=1

Pn
∴ By Neyman Fisher Factorization Theorem (NFFT), i=1 Xi2 is a sufficient statistic for λ.
P 2 P 2
∴ λ̂ = n1 Xi being a one-one function of Xi is also a sufficient statistic of λ.

Problem 9. (Sufficiency Principle)


(a) Let X be a single observation from a population belonging to the family {f0 (x), f1 (x)}, where
1 1
f0 (x) = √ e−x /2 and f1 (x) =
2
; x ∈ R.
2π π (1 + x2 )

Find a non-trivial sufficient statistic for the family of distribution.

Page 25
Solution. Writing the family as {fθ (x) : θ ∈ Ω = {0, 1}}. [Here, the parameter θ is called the labeling parameter.]
Define (
0 if θ = 0
I(θ) =
1 if θ = 1

The pdf of X is
 I(θ) ( 1
)I(θ)
f1 (x) π(1+x2 ) 1
· √ e−x /2
1−I(θ) I(θ) 2
fθ (x) = {f0 (x)} {f1 (x)} = .f0 (x) =
√1 e−x /2
2
f0 (x) 2π

1
= g (T (x) ; θ) h (x) where h(x) = √ e−x /2 and T (x) = x2 or |x|.
2


Hence, X 2 or |X| is sufficient for the family of distribution.

(b) Let X1 , X2 , . . . , Xn be a random sample from the following pdf. Find the non-trivial sufficient statistic in each case:
[Hints: If in the range of Xi , there  is the parameter of the distribution present then we have to use the concept of
Indicator function X(1) or X(n) or mini {Xi } or maxi {Xi }.]
(
θxθ−1 ; if 0 < x < 1
(i) f (x; θ) =
0 ; otherwise.

Solution. The joint pdf of X1 , X2 , . . . , Xn is


!θ−1
  Yn
n
f x =θ xi

i=1
( ) !
Y
n       Y
n
= gθ xi · h x , where h x = 1 and T x = xi
∼ ∼ ∼
i=1 i=1

  Yn
∴ By Neyman-Fisher Factorization criterion, T X = Xi is sufficient for θ.

1=1
(x−µ)
− 2µ2
(ii) f (x; µ) = 1

|µ| 2π
·e ;x ∈ R


Solution. We know if X ∼ N µ, σ 2 :
1 (x−µ)
f (x; µ, σ) = √ · e− 2σ2 .
σ 2π
 Pn Pn 
So, in the given problem X ∼ N µ, µ2 , where µ 6= 0. Hence, T (X ) = i=1 Xi ,
2
i=1 Xi is sufficient for µ.

(
xα−1 (1−x)β−1
; if 0 < x < 1; α, β > 0
(iii) f (x; α, β) = B(α,β)
0 ; otherwise.

Solution. The joint pdf of X1 . . . , Xn is


 n Y !α−1 !
1
n Y
n
β−1
f (x) = xi (1 − xi )
∼ B(α, β) i=1 i=1
!
        Y
n Y
n
= g T x ; α, β h x , where h x = 1 and T (x) = xi , (1 − xi ) .
∼ ∼ ∼ ∼
i=1 i=1
!
Y
n Y
n
Hence, T (X ) = Xi , (1 − Xi ) is jointly sufficient for (α, β) .

i=1 i=1

Page 26
(
1 − (x−µ)
λe
λ ; if x > µ
(iv) f (x; µ, λ) =
0 ; otherwise.

Solution. The joint PDF of X is



  1

− n i=1 (xi −µ)
f x = n ·e λ if xi > µ
∼ λ
 Pn  (
1 − i=1 xi − nµ  1 if a ≥ b
= n · exp · I x(1) , µ where I (a, b) =
λ λ 0 otherwise.
!
Xn
=g xi , x(1) ; λ, µ × h(x), where h(x) = 1.
∼ ∼
i=1
Pn
Thus, X(1) and i=1 Xi are jointly sufficient statistic for µ and λ.

(
e− 2σ2 (ln x−µ)
1 2
1
√ ; if x > 0
(v) f (x; µ, σ) = xσ 2π
0 ; otherwise.

Solution. The joint PDF of X is



( )
  1 1 X
n
f x = Qn √ n · exp − 2 (ln xi − µ)2 ; if xi > 0
∼ ( i=1 xi ) σ n 2π 2σ i=1
 Pn 2
Pn 
1 i=1 (ln xi ) µ i=1 ln xi nµ2 1
= √ n · exp − 2
− 2
+ 2 · Qn
σ n 2π 2σ σ 2σ i=1 xi
!
Xn Xn     1
(ln xi ) ; µ, σ · h x ; where h x = Qn
2
=T ln xi , .
∼ ∼
i=1 i=1 i=1 xi
  P Pn 
n 2
Hence, T X = i=1 ln Xi , i=1 (ln Xi ) is sufficient for µ and σ.

(
θαθ
xθ+1
; if x > α
(vi) f (x; α, θ) =
0 ; otherwise.

Solution. The joint PDF of X is




θ n
α
f (x) = θn Qn  if xi > α

i=1 xθ+1
i
(

θ n 1  1 if a > b
= θα Qn θ+1
I x(1) , α if x(1) > α, and I (a, b) =
i=1 {xi } 0 otherwise.
! !
Y
n       Y
n
=g xi , x(1) ; θ, α · h x ; where h x = 1 and T x = xi , x(1) .
∼ ∼ ∼
i=1 i=1
!
  Y
n
Hence, T X = Xi , X(1) is sufficient for (θ, α).

i=1

(
2(θ−x)
θ2 ; if 0 < x < θ
(vii) f (x; θ) =
0 ; otherwise.

Page 27
Solution. The joint PDF of X is

  2n Y
n
f x = 2n (θ − xi ) ; 0 < xi < θ
∼ θ i=1
 n
2
= · (θ − x1 ) (θ − x2 ) . . . (θ − xn ) ; 0 < xi < θ.
θ2

These cannot be expressed in the form of factorization criterion. So, (X1 , X2 , . . . , Xn ) or X(1) , X(2) , . . . X(n)
are trivially sufficient for θ. Hence there is no non-trivial sufficient statistic.

(c) If fθ (x) = 12 ; θ − 1 < x < θ + 1, then show that X(1) and X(n) are jointly sufficient for θ, Xi ∼ U (θ − 1, θ + 1).

Solution. The joint PDF of X is


   1 n
f x =
∼ 2
(
1   1 if a < b
= n · I θ − 1, x(1) I x(n) , θ + 1 ; θ − 1 < x(1) < x(n) < θ + 1 where I (a, b) =
2 0 if a ≥ b
        1   
= g T x ; θ h x ; where h x = n and T x = x(1) , x(n) .
∼ ∼ ∼ 2 ∼
  
Hence T X = X(1) , X(n) is jointly sufficient for θ.

(d) If a random sample of size n ⩾ 2 is drawn from a Cauchy distribution with PDF
1
fθ (x) = ,
π [1 + (x − θ)2 ]

where −∞ < θ < ∞, is considered. Then can you have a single sufficient statistic for θ?

Solution.
Y
n
1
f (xi ; θ) = nQ h io .
n 2
i=1 πn i=1 1 + (xi − θ)
n n
Y o n on o n o
2 2 2 2
Note that 1 + (xi − θ) = 1 + (x1 − θ) 1 + (x2 − θ) . . . 1 + (xn − θ)
i=1
= 1 + term involving one xi + term involving two x′i s + . . . + term involving all x′i s
X 2
X 2 2
Y
n
2
=1+ (xi − θ) + (xi − θ) (xj − θ) + . . . + (xi − θ) .
i i̸=j i=1
Qn      
Clearly, i=1 f (xi ; θ) cannot be written as g T ∼ x , θ · h x for a statistic other than the trivial choices
 ∼
(X1 , . . . , Xn ) or X(1) , . . . , X(n) . Hence there is no non-trivial sufficient statistic. Therefore, in this case, no
reduction in space is possible. Thus, the whole set (X1 , . . . , Xn ) is jointly sufficient for θ.

Page 28
Problem 10. (Uniformly Minimum Variance Unbiased Estimator)
(
p(1 − p)x ; x = 0, 1, . . .
(a) Let X1 , X2 , . . . , Xn be a random sample from f (x; p) =
0 ; otherwise.
Pn
Show that unbiased estimator of p based on T = i=1 Xi is unique. Hence or otherwise, find the UMVUE of p.

Pn
Solution. T = i=1 Xi ∼ NB(n, p).
To solve for h(T ) such that
E {h (T )} = p ∀ p ∈ (0, 1)
X∞  
t+n−1
⇒ h (t) pn q t = p ∀ p
n−1
t=0
X∞  
t+n−1
⇒ h (t) q t = p−(n−1) = (1 − q)n−1
n−1
t=0
X∞   ∞ 
X 
t+n−1 n−1+t−1
⇒ h(t) qt = qt , as 0 < q < 1.
n−1 t
t=0 t=0

By the uniqueness property of Power Series, we get,


   
t+n−1 n+t−2
h(t) = , t = 0, 1, 2, . . . .
n−1 t

Hence, h(T ) = n−1


t+n−1 is the only solution of E{h(T )} = p ∀ p.
Thus, h(T ) is the only UE of p based on T .
Pn
It can be shown that T = i=1 Xi is sufficient.
By Rao-Blackwell theorem, UMVUE is a function of T.
As there is only one UE of p based on T , then UE h(T ) is the UMVUE of p.

2
(b) Let X1 and X2 be two independent random variables having the same mean θ. Suppose that E (X1 − θ) = 1 and
2
E (X2 − θ) = 2. For estimating θ, consider the estimators Tα (X1 , X2 ) = αX1 + (1 − α)X2 , α ∈ [0, 1]. The value of
α ∈ [0, 1], for which the variance of Tα (X1 , X2 ) is minimum, equals

2 1 1 3
(i) 3 (ii) 2 (iii) 4 (iv) 4

Solution. Given X1 and X2 are independent with


E (X1 ) = E (X2 ) = θ and
2 2
V (X1 ) = E (X1 − θ) = 1, V (X2 ) = E (X2 − θ) = 2
Given Tα (X1 , X2 ) = αX1 + (1 − α)X2 , α ∈ [0, 1]
V (Tα (X1 , X2 )) =V (αX1 + (1 − α)X2 )
 
Cov (X1 , X2 ) = 0
=α2 V (X1 ) + (1 − α)2 V (X2 ) + 2α(1 − α) Cov (X1 , X2 )
∵ X1 , X2 are independent

=α2 + 2(1 − α)2 = α2 + 2 1 − 2α + α2 = 3α2 − 4α + 2
which is a quadratic polynomial in α.

Let p(α) = 3α2 − 4α + 2, α ∈ [0, 1]


p′ (α) = 6α − 4 = 0 ⇒ α = 4/6 = 2/3
p′′ (α) = 6 > 0 ∀ α ∈ [0, 1]
∴ p(α) is minimum at α = 2/3. ∴ For α = 2/3, V (Tα (X1 , X2 )) is minimum.

Page 29
(c) Let X1 , X2 , . . . , Xn be a random sample from
(
e−(x−θ) , if x > θ
f (x; θ) =
0, otherwise.

Show that T = X(1) is a complete sufficient statistic. Hence find the UMVUE of θ.

Solution. The PDF of X = (X1 , X2 , . . . , Xn ) is



( ∑n
Y
n
e− i=1 (xi −θ) , if xi > θ, ∀i = 1(1)n
f (xi ; θ) =
i=1
0 , otherwise
( ∑n
e− i=1 (xi −θ) , if x(1) > θ
=
0 , otherwise
(

∑n  1 , if a > b
=e i=1 (xi −θ) · I x(1) , θ , where I(a, b) =
0 , if a < b.
 ∑n
= eθ · I x(1) , θ · e− i=1 xi
       
= g T x , θ · h x with T x = x(1) .
∼ ∼ ∼

By factorization criterion T = X(1) is sufficient.


Let
E {h (T )} = 0 ∀θ
 
FT (t) = 1 − P [T
 > t] 
 = 1 − P X(1) > t 
 
Z  = 1 − {P [X1 > t]}
n 
∞  R ∞ 
 = 1 − t e−(x1 −θ) dx1
n
if t > θ 
⇒ h (t) · fT (t)dt = 0 ∀θ  
−∞  = 1(− e−n(t−θ) if t > θ 
 
 
 ne−n(t−θ) , if t > θ 
∴ fT (t) =
0 , otherwise .
Z ∞
⇒ h(t) · ne−n(t−θ) dt = 0 ∀θ
−θ
Z ∞
⇒ h(t) · e−nt dt = 0 ∀θ
θ
Differentiating w.r.t. θ,

0 − h (θ) · e−nθ = 0 ∀ θ ⇒ h (θ) = 0 ∀ θ as e−nθ > 0.


Hence, h (T ) = 0, with probability 1, ∀ θ ⇒ T is complete.
Now, Z Z
∞ ∞
E(T − θ) = (t − θ)fT (t)dt = (t − θ)ne−n(t−θ) dt
−∞ 0
Z ∞
1 −u 1 1
= ue du, where u = n(t − θ) = · Γ(2) =
n 0 n n

Thus, E T − 1
n = θ. By Lehmann–Scheffé Theorem, h(T ) = T − 1
n = X(1) − 1
n is the UMVUE of θ.

(d) Is the following families of distribution regular in the sense of Cramer & Rao? If so, find the lower bound for the
variance of an unbiased estimator of θ based on a sample of size n. Also, find the UMVUE of θ for the PDF:
x2
e− 2θ
f (x, θ) = √ ; −∞ < x < ∞, θ > 0.
2πθ

Page 30
Solution. As we know ‘=’ holds in CR inequality whenever the family of distributions is OPEF. The given pdf is
OPEF, satisfying the regularity conditions for CR inequality. That is, it is regular in the sense of Cramer-Rao.
By CR inequality, for an unbiased estimator T of θ,
1
Var (T ) ⩾ = CRLB.
In (θ)
1 x2 1 x2
Here, f (x, θ) = √ e− 2θ ; x ∈ R, θ > 0 ⇒ ln f (x, θ) = − ln(2πθ) − .
2πθ 2 2θ
∂ 1 x21 ∂2 1 x21
Differentiating w.r.t θ we get, ln f (x1 , θ) = − + 2 and ln f (x1 , θ) = 2 − 3 .
∂θ 2θ 2θ ∂θ2 2θ θ
  ( )  
2 2
∂ 1 E X1 1 θ n
In (θ) = n · I1 (θ) = n · E − 2 ln f (x1 , θ) = n · − 2 + 3
= n − 2 + 3 = 2.
∂θ 2θ θ 2θ θ 2θ
2θ 2
Hence, Var(T ) ⩾ n = CRLB. The MVBUE, if exists for θ, is given by
 P 2
1 X ∂
n
ψ ′ (θ) ∂ 2θ2 n xi
T = ψ(θ) ± · ln L(x, θ) = θ ± n · ln f (xi , θ) = θ ± − + 2
In (θ) ∂θ ∼ ∂θ n 2θ 2θ
2θ 2 i=1
 P 2
1X 2
n
2θ2 n xi
=θ+ − + , taking +ve sign only. = x .
n 2θ 2θ2 n i=1 i
Pn Pn 
Hence, T = n1 i=1 Xi2 attains CRLB and n1 i=1 Xi2 is the UMVUE of θ.


(e) Based on a random sample X1 , X2 , . . . , Xn from Gamma(α). Obtain an estimator of ψα = ∂α ln Γ(α) which attains
CRLB and its variance.

Solution. The pdf of X = (X1 , X2 , . . . , Xn ) is



!α−1
  Y
n
1 −xi α−1 1 ∑ Y
n
− n
f x, α = e · xi = ne
i=1 xi xi , if xi > 0, ∀ i = 1(1)n.

i=1
Γ (α) {Γ (α)} i=1
  X
n X
n
⇒ ln f x, α = −n ln Γ (α) − xi + (α − 1) ln xi , if xi > 0

i=1 i=1

∂   ∂ Xn
ln f x, α = −n ln Γ (α) + ln xi
∂α ∼ ∂α i=1
∂2   ∂2
and ln f x , α = −n ln Γ (α)
 ∂α2 ∼
 ∂α2
∂ 2   ∂ 2
Here, I (α) = E − 2 ln f x, α = n 2 ln Γ (α) .
∂α ∼ ∂α
A UE which attains CRLB, if it exists, is given by,
  ∂2  
ψ ′ (α) ∂ ∂ ∂α2 ln Γ(α) ∂
T = ψ(α) ± · ln f x, α = ln Γ(α) ± ∂ 2 −n ln Γ(α) + Σ ln xi .
I(α) ∂α ∼ ∂α n ∂α2 ln Γ(α) ∂α
( )
1X 1X
n
∂ ∂
= ln Γ(α) ± − ln Γ(α) + ln xi = ln xi , taking positive sign only
∂α ∂α n i n i=1
!1/n
Yn
= ln G, where G = Xi is the GM of X1 , X2 , . . . , Xn .
i=1
n o2
∂2 ∂2

{ψ (α)}
2
∂α2 ln Γ(α) ln Γ(α)
∂α2
Clearly, Var(T ) = CRLB = = = .
I(α) n· ∂2
∂α2 ln Γ(α) n

Page 31
Problem 11. (Finding Confidence Intervals)

(a) Let X1 , . . . , Xn be a random sample from U(0, θ), θ > 0. Find a confidence interval for θ with confidence coefficient
(1 − α), based on X(n) .
(b) Consider a random sample of size n from the rectangular distribution
(
1
if 0 < x < θ
f (x, θ) = θ
0 otherwise.

If Y be the sample range then ξ is given by ξ n−1 [n − (n − 1)ξ] = α. Show that, Y and Y ξ −1 are confidence limit to
θ with confidence coefficient (1 − α).

(c) Consider a random sample of size n from an exponential distribution, with PDF
(
exp[−(x − θ)] if θ < x < ∞
fX (x) =
0 otherwise.
Suggest a 100(1 − α)% confidence interval for θ.

Solution.
(a) The pdf of X(n) is (
nxn−1
θn if 0 < x < θ
fX(n) (x) =
0 otherwise.
 X(n)
The pdf ψ X(n) , θ = θ = T is (
ntn−1 if 0 < t < 1
g(t) =
0 otherwise
which is independent of θ.
  
Now, P c < ψ X(n) , θ < 1 = 1 − α.
Z 1
⇒ ntn−1 dt = 1 − α, where c is the critical region.
c
⇒ 1 − cn = 1 − α, i.e. c = α1/n .
X(n)
Note that, α1/n < ψ(X(n) , θ) = <1
θ
θ
⇒ α−1/n > >1
X(n)
i.e., X(n) < θ < α−1/n X(n) .
 
Hence, X(n) , α−1/n X(n) is a confidence interval for θ with confidence coefficient (1 − α).

(b) Here, Y = X(n) − X(1) . The pdf of Y , is


(
n(n − 1)y n−2 (1 − y) if 0 < y < θ
fY (y) =
0 otherwise.

The pdf of ψ(Y, θ) = U , is (


n(n − 1)un−2 (1 − u) if 0 < u < 1
fU (u) =
0 otherwise

Page 32
which is independent of θ.

Now, P [ξ ⩽ U ⩽ 1] = 1 − α
Z 1
⇒ n(n − 1)un−2 (1 − u)du = 1 − α
ξ
Z 1  
⇒ n(n − 1) un−2 − un−1 du = 1 − α
ξ

⇒ ξ n−1 [n − (n − 1)ξ] = α.

 n o
Note that {ξ ⩽ U ⩽ 1} = ξ ⩽ Y
θ ⩽1 = Y ⩽θ⩽ Y
ξ .

Hence, Y, Y ξ −1 is a random confidence interval for θ with confidence coefficient 1 − α, where ξ is such that
ξ n−1 [n − (n − 1)ξ] = α.

(c) The distribution function of X(1) is


       n
FX(1) (x(1) ) = P X(1) ⩽ x(1) = 1 − P X(1) > x(1) = 1 − P X > x(1)
n on
= 1 − e−(x(1) −θ) = 1 − e−n(x(1) −θ) if x(1) > θ.

Hence, U = e−n(X(1) −θ) = 1 − FX(1) x(1) ∼ U (0, 1).
d
We know the p.d.f. of X(1) is fX(1) (x(1) ) = FX (x(1) ) = ne−n(x(1) −θ) if x(1) > θ
dx(1) (1)
1 dx(1) 1
Let u = e−(x(1) −θ) ⇒ log u = −(x(1) − θ) ⇒ · du = −dx(1) ⇒ J = = .
u du u
(
nun−1 if 0 < u < 1
∴ fU (u) =
0 otherwise.
Z 1
α = P [0 ⩽ U ⩽ c]
Now, 1 − α = P [c ⩽ U ⩽ 1] = nun−1 du
c ⇒ c = α1/n
⇒ 1 − α = 1 − cn ⇒ c = α1/n .
Note that, α1/n ⩽ u ⩽ 1 ⇒ α1/n ⩽ e−(x(1) −θ) ⩽ 1
1 1
⇒ log α ⩽ −(x(1) − θ) ⩽ 0 ⇒ x(1) + log α ⩽ θ ⩽ x(1) .
n n

Problem 12. (A Heteroskedastic Linear Model)

Consider observed response variables Y1 , . . . , Yn ∈ R that depend linearly on a single covariate x1 , . . . , xn as follows:

Yi = βxi + εi .

Here, the εi ’s are independent Gaussian


 noise variables, but we do not assume they have the same variance. Instead, they
are distributed as εi ∼ N 0, σi2 for possibly different variances σ12 , . . . , σn2 . The unknown parameter of interest is β.

(a) Suppose that the error variances σ12 , . . . , σn2 are all known. Show that the MLE β̂ for β, in this case, minimizes a
certain weighted least-squares criterion, and derive an explicit formula for β̂.

(b) Show that the estimate β̂ in part (a) is unbiased, and derive a formula for the variance of β̂ in terms of σ12 , . . . , σn2
and x1 , . . . , xn .

Page 33
(c) Compute the Fisher information IY (β) = −Eβ [l′′ (β)] in this model (still assuming σ12 , . . . , σn2 are known constants).
Show that the variance of β̂ that you derived in part (b) is exactly equal to IY (β)−1 .

In the remaining parts of this question, denote by β̃ the usual (unweighted) least-squares estimator for β, which
P 2
minimizes i (Yi − βxi ) . In practice, we might not know the values of σ12 , . . . , σn2 , so we might still estimate β
using β̃.

(d) Derive an explicit formula for β̃, and show that it is also an unbiased estimate of β.

(e) Derive a formula for the variance of β̃ in terms of σ12 , . . . , σn2 and x1 , . . . , xn . Show that when all error terms have
−1
the same variance σ02 , this coincides with the general formula σ02 X T X for the linear model.

P 2 P 2 P 2
(f) Using the Cauchy-Schwarz inequality i ai i bi ≥ ( i ai bi ) for any positive numbers a1 , . . . , an and b1 , . . . , bn ,
compare your variance formulas from parts (b) and (e) and show directly that the variance of β is always at least
the variance of β̂. Explain, using the Cramer-Rao lower bound, why this is to be expected given your finding in (c).

Solution. (a) The log-likelihood is

n Xn Xn
1 2
ℓ(β) = − log 2π + σi − 2 (Yi − βxi ) ,
2 i=1 i=1
2σ i

so that the MLE β̂ minimizes the weighted sum of squares

Xn
1 2
σ 2 (Yi − βxi ) .
i=1 i

Taking derivatives and solving for zero gives the explicit solution
Pn
xi Yi /σi2
β̂ = Pi=1n 2 2 .
i=1 xi /σi

(b) The estimator has mean

1 Xn
xi 1 Xn
xi
E[β̂] = Pn 2 2 2 E [Y i ] = P n 2 2 2 βxi = β,
i=1 xi /σi i=1 σi i=1 xi /σi i=1 σi

so it is unbiased.
The variance is Pn Pn
2 4
i=1 xi V [Yi ] /σi x2i σi2 /σi4 1
V[β̂] = Pn 2 = Pi=1
n 2 =
Pn 2
2
2 , using V [Yi ] = σi .
( i=1 x2i /σi2 ) i=1 xi /σi
2
( i=1 xi /σi )2

(c) Taking derivatives, we have


Xn
xi
ℓ′ (β) = (Yi − βxi )
σ2
i=1 i
Xn
x2i
ℓ′′ (β) = − .
σ2
i=1 i
Pn
Therefore, IY (β) = −Eβ [ℓ′′ (β)] = i=1 x2i /σi2 . From part (b), this is exactly 1/V[β̂].
P 2
(d) Taking the derivative of i (Yi − βxi ) and solving gives

Page 34
Pn
xi Yi
β̃ = Pi=1
n 2 ,
i=1 xi

and its expectation is Pn Pn 2


xi E [Yi ] i=1 βxi
E[β̃] = Pn
i=1
2 = P n 2 = β.
i=1 xi i=1 xi

(e) The variance of β̃ is given by


Pn 2
Pn
i=1 xi V [Yi ] x2i σi2
V[β̃] = Pn 2 = Pi=1
n 2.
( i=1 x2i ) ( i=1 x2i )
If σi2 = σ02 , then this reduces to
σ2
V[β̃] = Pn 0 .
i=1 x2i
T −1 Pn
In our case the model matrix X is a single vector (x1 , . . . , xn ) , so that X T X = 1/ x2i . Hence the variance
−1 i=1
formula above is consistent with the general formula σ02 X T X .

(f) Applying the Cauchy-Schwarz inequality with ai = |xi σi | and bi = |xi /σi |, we obtain the inequality

! ! !2
X
n Xn
x2i X
n
x2i σi2 ≥ x2i ,
i=1
σ2
i=1 i i=1

and hence, rearranging terms, Pn


x2i σi2 1
V[β̃] = Pi=1
n 2 ≥
Pn 2 2 = V[β̂].
i=1 xi /σi
2
( i=1 xi )
The Cramèr-Rao lower bound states that any unbiased estimator has a variance no smaller than the inverse Fisher
information. Since the variance of β̂ attains the lower bound by part (c), and β̃ is unbiased, the above inequality is
expected.

Problem 13. (The delta method for two samples)


IID IID
Let X1 , . . . , Xn ∼ Bernoulli(p), and let Y1 , . . . , Ym ∼ Bernoulli(q), where the Xi ’s and Yi ’s are independent. For
example, X1 , . . . , Xn may represent, among n individuals exposed to a certain risk factor for a disease, which individuals
have this disease, and Y1 , . . . , Ym may represent, among m individuals not exposed to this risk factor, which individuals
have this disease. The odds-ratio 
p q
1−p 1−q
provides a quantitative measure of the association between this risk factor and this disease. The log-odds-ratio is the
(natural) logarithm of this quantity,   
p q
log .
1−p 1−q
(a) Suggest reasonable estimators p̂ and q̂ for p and q, and suggest a plugin estimator for the log-odds-ratio.
(b) Using the first-order Taylor expansion
∂g ∂g
g(p̂, q̂) ≈ g(p, q) + (p̂ − p) (p, q) + (q̂ − q) (p, q)
∂p ∂q
as well as the Central Limit Theorem and independence of the Xi ’s and Yi ’s, derive an asymptotic normal approxi-
mation to the sampling distribution of your plugin estimator in part (a).
.
p q
(c) Give an approximate 95% confidence interval for the log-odds-ratio log 1−p 1−q . Translate this into an approximate
.
p q
95% confidence interval for the odds-ratio 1−p 1−q . (You may use a plugin estimate for the variance of the normal
distribution you derived in part (b).)

Page 35
Solution. (a) We may estimate p by p̂ = X̄, and q by q̂ = Ȳ . The plugin estimator for the log-odds-ratio is
 
p̂ . q̂
log .
1 − p̂ 1 − q̂

(b) Let  
p . q
g(p, q) = log = log p − log(1 − p) − log q + log(1 − q).
1−p 1−q
Applying a first-order Taylor expansion to g,

p̂ − p q̂ − q
g(p̂, q̂) ≈ g(p, q) +
+ .
p(1 − p) q(1 − q)
√ √
p̂ and q̂ are independent, and by the Central Limit Theorem, n(p̂ − p) → N (0, p(1 − p)) and m(q̂ − q) → N (0, q(1 − q)).
Hence, for large m and n, g(p̂, q̂) is approximately distributed as N (g(p, q), v) where

p(1 − p) 1 q(1 − q) 1 1 1
v= × 2 + × 2 = + .
n p (1 − p) 2 m q (1 − q) 2 np(1 − p) mq(1 − q)

1
(c) Let v̂ = np̂(1− p̂) +
1
mq̂(1−q̂) be the plugin estimate of v. As m, n → ∞, v̂/v → 1 in probability, so by part (b) and
Slutsky’s lemma,  
g(p̂, q̂) − g(p, q)
P −Z0.025 ≤ √ ≤ Z0.025 ≈ 0.95

for large m and n. Rearranging yields a 95% confidence interval for g(p, q) given by
   s
√ p̂ q̂ 1 1
g(p̂, q̂) ± Z0.025 v̂ = log ± Z0.025 + .
1 − p̂ 1 − q̂ np̂(1 − p̂) mq̂(1 − q̂)
 
Denoting this interval
. by [L(p̂, q̂), U (p̂, q̂)], we may exponentiate to obtain the confidence interval eL(p̂,q̂) , eU (p̂,q̂) for the
p q
odds-ratio 1−p 1−q .

Problem 14. (Laplace distribution and Bayesian Analysis)

The double-exponential distribution with mean µ and scale b is a continuous distribution over R with PDF
 
1 |x − µ|
f (x | µ, b) = exp − .
2b b

It is sometimes used as an alternative to the normal distribution to model data with heavier tails, as this PDF decays
exponentially in |x − µ| rather than in (x − µ)2 .

(a) What are the MLEs


 µ̂ and b̂ given data X1 , . . . , Xn ? Why is this MLE µ̂ more robust to outliers than the MLE µ̂
in the N µ, σ 2 model?
You may assume that n is odd and that the data values X1 , . . . , Xn are all distinct. (Hint: The log-likelihood is
differentiable in b but not in µ. To find the MLE µ̂, you will need to reason directly from its definition.)
Solution. The joint log-likelihood is

1X
n
ℓ(µ, b) = −n log(2b) − |Xi − µ| .
b i=1

The likelihood is differentiable in b, so differentiating with respect to b gives

1 X
n
∂ℓ n
=− + 2 |Xi − µ| .
∂b b b i=1

Page 36
Setting this equal to 0 , substituting in the MLE µ̂ for µ, and solving gives the MLE for b as

1X
n
b̂ = |Xi − µ̂| .
n i=1
Pn
We can see that the MLE µ̂ is the value of µ that minimizes the total absolute deviations K(µ) = i=1 |Xi − µ|.
Without loss of generality assume that the X1 , . . . , Xn are ordered. We shall see that the minimizer is the sample
median µ̂ = Xm , where m = (n + 1)/2. We see that K(µ) is continuous everywhere (it is the sum of absolute value
functions), and furthermore, it is decreasing for µ < Xm and increasing for µ > Xm .

Therefore the minimizer is given by µ̂ = Xm . This estimator is more robust to outliers because it only depends on
the middle few ordered values, so a few data points with extreme values won’t change the median, whereas the mean
depends on all data points.

(b) Suppose it is known that µ = 0. In a Bayesian analysis, let us model the scale parameter as a random variable B
with prior distribution B ∼ InverseGamma(α, β), where α, β > 0. If X1 , . . . , Xn ∼ Laplace (0, b) when B = b, what
are the posterior distribution and posterior mean of B given the data X1 , . . . , Xn ?
(Hints: The InverseGamma (α, β) distribution is a continuous distribution on (0, ∞) with PDF

β α −α−1 −β/x
f (x) = x e
Γ(α)
β
and with mean α−1 when α > 1.)

Solution. If µ = 0 and B ∼ InverseGamma(α, β), then the posterior density is given by

f (B | α, β, X1 , . . . , Xn ) ∝ f (X1 , . . . , Xn | B) f (B | α, β)
( )
1 X
n
1 β α −α−1 −β/B
= exp − |X i | B e
(2B)n B i=1 Γ(α)
( !)
1 Xn
−(α+n)−1
∝B exp − β+ |Xi | ,
B i=1

where we have dropped any normalizing constants into the proportionality


P term. From here, we can see that the
posterior distribution
P of B follows an InverseGamma (α + n, β + |X i |) distribution, and therefore has posterior
mean (β + |Xi |) /(α + n − 1).

(c) Still supposing it is known that µ = 0, what is the MLE b̂ for b in this sub-model? How does this compare to the
posterior mean from part (b) when n is large?

Solution. The MLE for b when µ = 0 is


1X
n
b̂ = |Xi | .
n i=1
We can write the posterior mean as a weighted average
Pn
1X
n
β + i=1 |Xi | α−1 β n
= + |Xi |
α+n−1 α+n−1 α−1 α + n − 1 n i=1
| {z } | {z }
posterior mean prior mean | {z }
MLE

of the prior mean and the MLE, from which we see that the posterior mean tends to the MLE as n → ∞.

Page 37
Problem 15. (Bayesian inference for Multinomial Proportions)

The Dirichlet (α1 , . . . , αK ) distribution with parameters α1 , . . . , αK > 0 is a continuous joint distribution over K random
variables (P1 , . . . , PK ) such that 0 ≤ Pi ≤ 1 for all i = 1, . . . , K and P1 + . . . + PK = 1. It has (joint) PDF
f (p1 , . . . , pk | α1 , . . . , αK ) ∝ p1α1 −1 × . . . × pα
K
K −1
.
Letting α0 = α1 + . . . + αK , this distribution satisfies
αi αi (α0 − αi )
E [Pi ] = , Var [Pi ] = .
α0 α02 (α0 + 1)
(a) Let (X1 , . . . , X6 ) ∼ Multinomial (n, (p1 , . . . , p6 )) be the numbers of 1’s through 6’s obtained in n rolls of a (possibly
biased) die. Let us model (P1 , . . . , P6 ) as random variables with prior distribution Dirichlet (α1 , . . . , α6 ). What
is the posterior distribution of (P1 , . . . , P6 ) given the observations (X1 , . . . , X6 )? What is the posterior mean and
variance of P1 ?
(b) How might you choose the prior parameters α1 , . . . , α6 to represent a strong prior belief that the die is close to fair
(meaning p1 , . . . , p6 are all close to 1/6)?
(c) How might you choose an improper Dirichlet prior to represent no prior information? How do the posterior mean
estimates of p1 , . . . , p6 under this improper prior compare to the MLE?

Solution.
(a) The posterior distribution has density proportional to
P1α1 −1 × · · · × P6α6 −1 × P1X1 × · · · × P6X6 = P1α1 +X1 −1 × · · · × P6α6 +X6 −1 .
So the posterior distribution of (P1 , . . . , P6 ) given (X1 , . . . , X6 ) is Dirichlet(α1 + X1 , . . . , α6 + X6 ). The posterior
mean and variance are given by

α i + Xi (αi + Xi ) α0 + nX̄ − αi − Xi
E [Pi | X1 , . . . , X6 ] = ; V [Pi | X1 , . . . , X6 ] = 2  .
α0 + nX̄ α0 + nX̄ α0 + nX̄ + 1

(b) We would like to select the parameters αi such that


• The prior mean is 1/6 for each i,
• The prior variance is small.
Since
αi
E [Pi ] = P6 ,
j=1 αj
a prior mean of 1/6 can be achieved by setting αi = α for each i. Then the variance is given by
α(6α − α) 5
V [Pi ] = = ,
(6α)2 (α + 1) 36(α + 1)
from which we see that a large value of α achieves a small variance. (The stronger our prior belief that the die is
fair, the larger we would set α.)

(c) From the posterior mean calculated in part (a), we can interpret the parameters αi as “prior counts,” so an unin-
formative prior sets αi = 0. Then the posterior mean is
Xi Xi
E [Pi | X1 , . . . , X6 ] = Pn = ,
j=1 Xj n

which is the same as the MLE.

Page 38
Problem 16. (Basics of Testing)

(a) Let Z be a random variable with probability density function f (z) = 12 e−|z−µ| , z ∈ R with parameter µ ∈ R. Sup-
pose, we observe X = max(0, Z).

i. Find the constant c such that the test that “rejects when X > c” has size 0.05 for the null hypothesis H0 : µ = 0.

ii. Find the power of this test against the alternative hypothesis H1 : µ = 2.

(b) Suppose that X1 , . . . , Xn form a random sample from the uniform distribution on the interval [0, θ], and that the
following hypotheses are to be tested:
H0 : θ ≥ 2,
H1 : θ < 2.
Let Yn = max {X1 , . . . , Xn }, and consider a test procedure such that the critical region contains all the outcomes
for which Yn ≤ 1.5.

i. Determine the power function of the test.

ii. Determine the size of the test.

(c) Let X be a single observation of an Exp(λ) random variable, which has PDF
(
λ exp(−λx) if x ≥ 0
fλ (x) =
0 if x < 0

Consider testing H0 : λ ≥ λ0 versus H1 : λ < λ0 .

i. Find the power function of the hypothesis test that rejects H0 if and only if X ≥ c.

ii. Let 0 < α < 1. Find a value of c such that the test in part (i) has size α.

iii. For what true values of λ is Pλ (type II error) ≥ 1/2 for the test in part (i) with size α as in (ii)?

IID
(d) Let X1 , X2 ∼ Bin(1, θ), and consider testing H0 : θ = 1/3 versus H1 : θ < 1/3.

i. Find a test that has size 2/9 exactly. Note: It does not have to be a sensible test.

ii. Find the power function of the test from part (i), and use it to explain why this test is not a good test of
these hypotheses.

(e) Suppose X is a random variable on {0, 1, 2, . . .} with unknown PMF p(x). To test the hypothesis H0 : X ∼
Poisson(1/2) against H1 : p(x) = 2−(x+1) for all x ∈ {0, 1, 2, . . .}, we reject H0 if x > 2. The probability of type-II
error for this test is

13 −1/2
(A) 14 ; (B) 1 − 8 e ; (C) 1 − 32 e−1/2 ; (D) 78 .
Solution.
(a) i. Given that PH0 (X > c) = 0.05. Now, under H0 , µ = 0. So, we have the pdf of Z as f (z) = 21 e−|z| . As the
support of Z is R, we can partition it in {Z ≥ 0, Z < 0}. Now, let’s condition based on this partition. So, we have:

PH0 (X > c) = PH0 (X > c, Z ≥ 0) + PH0 (X > c, Z < 0) = PH0 (X > c, Z ≥ 0) = PH0 (Z > c).
Z ∞
1 −|z| 1
Thus, we have PH0 (X > c) = PH0 (Z > c) = e dz = e−c .
c 2 2
Equating 12 e−c with 0.05, we get c = ln 10.

Page 39
ii. Power of test against H1 is given by:
Z ∞
1 −|z−2| e2
PH1 (X > ln 10) = PH1 (Z > ln 10) = e dz = .
ln 10 2 20

(b) i. The power function of the test is


  Y n  n
n 1.5
Power(θ) = Pθ (Yn ≤ 1.5) = Pθ max Xi ≤ 1.5 = Pθ (Xi ≤ 1.5) = [Pθ (X1 ≤ 1.5)] = .
1≤i≤n
i=1
θ

for θ ≥ 1.5, and Power(θ) = 1 for θ < 1.5.

ii. Power(θ) is a non-increasing function of θ, so


 n  n
1.5 3
sup Power(θ) = Power(2) = = .
θ≥2 2 4

Thus, the size of the test is (3/4)n .


R∞
(c) i. Power(λ) = Pλ (X ≥ c) = c
fλ (x)dx = exp(−λc).

ii. Power(λ) is a non-increasing function of λ, so

sup Power(λ) = Power (λ0 ) = exp (−λ0 c) .


λ≥λ0

Thus, the size of the test is exp (−λ0 c). Then the test has size α if and only if
log α
exp (−λ0 c) = α ⇐⇒ c = −
λ0
noting that log α is negative since 0 < α < 1.

iii. Pλ (type II error) ≥ 1/2 if and only if both λ < λ0 and Power(λ) ≤ 1/2. The test in part (i) with size α as in
(ii) has power function   
log α
Power(λ) = exp −λ − = αλ/λ0 ,
λ0
and hence
λ0 log 2
Power(λ) ≤ 1/2 ⇐⇒ λ ≥ − ,
log α
again noting that log α is negative. Thus, Pλ (type II error) ≥ 1/2 if and only if
λ0 log 2
− ≤ λ < λ0 .
log α
(Note that if α ≥ 1/2, then there are no such values of λ.)

(d) i. Note that there are only four possible values of (X1 , X2 ), i.e., the sample space consists of only four points. If
θ = 1/3, then 

 (0, 0) with probability 4/9

(0, 1) with probability 2/9
(X1 , X2 ) =

 (1, 0) with probability 2/9

(1, 1) with probability 1/9
Thus, the only tests with size 2/9 exactly are the test that rejects H0 if and only if (X1 , X2 ) = (0, 1) and the test
that rejects H0 if and only if (X1 , X2 ) = (1, 0).

ii. Power(θ) = θ(1 − θ) for both of the tests from part (i). Note that Power(1/3) > Power(θ) for all θ < 1/3. Thus,
these tests are more likely to reject H0 if it is true than if it is false, which is exactly the opposite of what a good
hypothesis test should do.

Page 40
(e) (D) Given x ∼ p(x). Reject H0 if x > 2.
P(Type-II error) = P(Accepting H0 when it is false) = 1 - P(Reject HO | H1 is true)
= 1 - PH1 (Reject H0 ) = 1 - PH1 (X > 2). Now, we compute PH1 (X > 2) as follows:

P (x > 2) = 1 − P (x ≤ 2)
= 1 − [P (x = 0) + P (x = 1) + P (x = 2)]
 
= 1 − 2−1 + 2−2 + 2−3
7
=1−
8
1
= .
8

Problem 17. (Likelihood ratio tests)

(a) For data X1 , . . . , Xn ∈ R and two fixed and known values σ02 < σ12 , consider the following testing problem:
IID 
H0 : X1 , . . . , Xn ∼ N 0, σ02
IID 
H1 : X1 , . . . , Xn ∼ N 0, σ12

What is the most powerful test for testing H0 versus H1 at level α? Letting χ2n (α) denote the 1 − α quantile of the
χ2n distribution explicitly describes the test statistic T and the rejection region for this test.
(b) What is the distribution of this test statistic T under the alternative hypothesis H1 ? Using this result and letting
F denote the CDF of the χ2n distribution, provide a formula for the power of this test against H1 in terms of
χ2n (α), σ02 , σ12 , and F . Keeping σ02 fixed, what happens to the power of the test as σ12 increases to ∞ ?
(c) Consider two probability density functions on [0, 1] : f0 (x) = 1 and f1 (x) = 2x. Among all tests of the null hypothesis
H0 : X ∼ f0 (x) versus the alternative H1 : X ∼ f1 (x) with significance level α = 0.10, how large can the power
possibly be?

(d) Let X1 and X2 be a random sample from a distribution having the probability density function f (x). Consider the
testing of H0 : f (x) = f0 (x) against H1 : f (x) = f1 (x) based on X1 and X2 , where
( (
1, 0 < x < 1, 4x3 , 0 < x < 1,
f0 (x) = and f1 (x) =
0, otherwise 0, otherwise

(i) For a given level, show that the critical region of the most powerful test for testing H0 against H1 is of the form

{(x1 , x2 ) : ln x1 + ln x2 > c}

for some constant c.


(ii) Determine c in terms of a suitable cutoff point of a Chi-square distribution when the level is α.
(e) Let X1 , . . . , Xn be i.i.d. observation from the density,
 
1 x
f (x) = exp − ,x > 0
µ µ

where µ > 0 is an unknown parameter. Consider the problem of testing the hypothesis Ho : µ ≤ µo against
H1 : µ > µ o .
 
Show that the test with critical region X̄ ≥ µo χ22n,1−α /2n , where χ22n,1−α is the (1 − α)th quantile of the χ22n
distribution, has size α. Give an expression of the power in terms of the c.d.f. of the χ22n distribution.

Page 41
Solution.

(a) The joint PDF under H0 is !n  Pn 2



1 i=1 xi
f0 (x1 , . . . , xn ) = p exp −
2πσ02 2σ02
The joint PDF under H1 is !n  Pn 2

1 i=1 xi
f1 (x1 , . . . , xn ) = p exp −
2πσ12 2σ12
So the likelihood ratio statistic is
 n !
σ02 − σ12 X 2
n
f0 (X1 , . . . , Xn ) σ1
L (X1 , . . . , Xn ) = = exp x
f1 (X1 , . . . , Xn ) σ0 2σ02 σ12 i=1 i
Pn
Since σ02 < σ12 , L is a decreasing function of T := i=1 Xi2 . Then rejecting for small values of L is the same as
rejecting for large values of T .
P n  X i 2
Since under H0 , i=1 σ0 ∼ χ2n , we have σ12 T ∼ χ2n , so T ∼ σ02 χ2n . Then the rejection threshold should be
0
c = σ02 χ2n (α), and the most powerful test rejects H0 when T > c.

Pn  2
(b) Under H1 , i=1
Xi
σ1 ∼ χ2n , so T ∼ σ12 χ2n . Then the probability of type II error is
 
β = PH1 [accept H0 ] = PH1 T ≤ σ02 χ2n (α)
   2 
T σ02 2 σ0 2
= PH 1 ≤ χ (α) = F χ (α) .
σ12 σ12 n σ12 n

where F is the χ2n CDF. The power of the test is then


 
σ02 2
Power = 1 − β = 1 − F χ (α)
σ12 n

As σ12 → ∞, β → F (0) = 0 and the power of the test → 1.

(c) The likelihood ratio statistic is


f0 (X) 1
L(X) = =
f1 (X) 2X
The condition L(X) < c is then equivalent to X > c̃, where c̃ = 2c 1
. Under the hypothesis H0 , X ∼ Uniform(0, 1),
so the rejection threshold c̃ should be 1 − 0.1 = 0.9, i.e. the most powerful tests rejects H0 when X > 0.9. Under
the hypothesis H1 , X ∼ f1 (x) = 2x. Then the type II error probability is
Z 0.9
β = PH1 [ accept H0 ] = PH1 [X ≤ 0.9] = 2xdx = 0.81.
0

Thus the power of the test is


Power = 1 − β = 0.19.
This is the maximum power that can be achieved: According to the Neyman-Pearson lemma, for any other test of
H0 with a significance level at most 0.1, its power against H1 is at most 0.19.

(d) (i) By NP Lemma, the Most Powerful Critical Region is given by


 
fH1 (x)
x| ˜ > k , where k is a constant that depends on level requirements.
˜ fH0 (x)
˜

Page 42
Q2 3
fH1 (x) i=1 4xi
Now, ˜ = Q 2 ⇒ 16(x1 x2 )3 > k
fH0 (x) i=1 1
˜
⇒ (x1 x2 )3 > k1 ⇒ 3 ln(x1 x2 ) > k2 ⇒ ln x1 + ln x2 > c.
∴ W = {(x1 , x2 ) | ln x1 + ln x2 > c} for some constant c.

(ii) Given level α,


PH0 (Reject H0 ) = α ⇒ PH0 (ln x1 + ln x2 > c) = α.
iid
Under H0 , X1 , X2 ∼ U(0, 1),
∴ − 2 ln X1 ∼ χ2(2)
∴ − 2 ln X1 − 2 ln X2 ∼ χ2(4)
∴ PH0 (−2 ln X1 − 2 ln X2 < −2c) = α
⇒ PH0 (χ2(4) < −2c) = α
∴ − 2c = χ24;α
1
⇒ c = − χ24;α .
2
∴ c = − 12 χ24;α is the required value of c.

 n  ∑n 
⃗ = 1 exp − i=1 Xi , µ > 0, also
(e) Hence, the Likelihood function of the µ for the given sample is, L(µ | X) µ µ
observe that sample mean of X⃗ is the MLE of µ. So, the Likelihood Ratio statistic is,

supµ≤µo L(µ | ⃗x)


λ(⃗x) =
supµ L(µ | ⃗x)
(
1 µ0 ≥ X̄
= L(µ0 |⃗
x)
L(X̄|⃗x)
µ0 < X̄

So, our test function is, (


1 λ(⃗x) < k
ϕ(⃗x) =
0 otherwise
We, reject H0 at size α, when ϕ(⃗x) = 1, for some k, EH0 (ϕ) ≤ α. Hence, λ(⃗x) < k

⇒ L (µ0 | ⃗x) < kL(X̄ | ⃗x)


1 X
n
1
ln k1 − Xi < ln k − n ln X̄ −
µ0 i=1 n
nX̄
n ln X̄ − < K∗
µ0
for some constant, K∗.

µ0 , and observe that g is, decreasing function of x̄ for x̄ ≥ µo Hence, there exists a c such
Let g(x̄) = n ln x̄ − nx̄
that x̄ ≥ c, we have g(x) ¯ < K∗. So, the critical region of the test is of form X̄ ≥ c, for some c such that,
PHo (X̄ ≥ c) = α, for some 0 ≤ α ≤ 1, where α is the size of the test. Now, our task is to find c, and for that observe,
if X ∼ Exponential(θ), then 2X θ ∼ χ2 .
2

Hence, in this problem, since the Xi ’s follows Exponential (µ), hence, 2nµX̄ ∼ χ22n , we have,

PHo (X̄ ≥ c) = α
 
2nX̄ 2nc
PH o ≥ =α
µo µo
 
2nc
PHo χ2 2n ≥ =α
µo

Page 43
h i
µo χ22n;1−α µo χ22n;1−α
which gives c = 2n , Hence, the rejection region is indeed, X̄ ≥ 2n . Hence Proved!

Now, we know that the power of the test is,


!
µo χ22n;1−α
β = Eµ (ϕ) = Pµ (λ(x̄) > k) = P X̄ ≥
2n
 
µo 2
β = Pµ χ22n ≥ χ2n;1−α
µ

Hence, the power of the test is of the form of a cdf of the chi-squared distribution.

Problem 18. (MP tests and GLRT test)

(a) Let X1 , . . . , X10 be a random sample of size 10 from a population having a probability density function
(
θ
θ+1 , if x > 1,
f (x | θ) = x
0, otherwise

where θ > 0. For testing H0 : θ = 2 against H1 : θ = 4 at the level of significance α = 0.05, find the most powerful
test. Also, find the power of this test.

(b) Let X1 , . . . , Xn be a random sample from the population having probability density function
( 2
2x − x
f (x, θ) = θ2 e
θ2 if x > 0
0 otherwise.

Obtain the most powerful test for testing H0 : θ = θ0 against H1 : θ = θ1 (θ1 < θ0 ).


(c) Let X1 , X2 , . . . , X5 be a random sample from a N 2, σ 2 distribution, where σ 2 is unknown. Derive the most
powerful test of size α = 0.05 for testing H0 : σ 2 = 4 against H1 : σ 2 = 1.

IID 
(d) Let X1 , . . . , Xn ∼ N µ, σ 2 , where both µ and σ 2 are unknown. Consider the problem of testing

H0 : µ = 0
H1 : µ 6= 0

Show that the generalized likelihood ratio test statistic for this problem simplifies to
Pn 2 !n/2
i=1 X i − X̄
Λ (X1 , . . . , Xn ) = Pn 2 .
i=1 Xi − X̄ + nX̄ 2
Pn 2 √
i=1 Xi − X̄
2 1
Letting SX = n−1 and T = nX̄/SX (the usual one-sample t-statistic for this problem), show
that Λ (X1 , . . . , Xn ) is a monotonically decreasing function of |T |, and hence the generalized likelihood ratio test is
equivalent to the two-sided t-test which rejects for large values of |T |.

Page 44
Solution.
fH1 (x)
(a) By NP Lemma, the Most Powerful Critical Region is given by reject H0 if fH0 (˜
x) > k where k is a constant that
˜
depends on level requirements.
Q10 4 10
∏ 4
fH1 (x) i=1 x5i ( 10
i=1 xi )
5 210
˜ = Q10 2 = 2 10 = Q10 >k
fH0 (x) i=1 3
∏10 3 ( i=1 x i ) 2
˜ x i ( i=1 ix )

Y10 X
10
⇒ xi < k ′ ⇒ ln xi < c.
i=1 i=1
P10
Now, level α = 0.05, we have PH0 (Reject H0 ) = 0.05 ⇒ PH0 ( i=1 ln xi < c) = 0.05.

If Xi has the given pdf then ln Xi ∼ Exp 1
theta . Under H0 ,
 
1
ln Xi ∼ Exp
2
∴ 4 ln Xi ∼ Exp (2)
!
X
10
 
∴ P 4 ln Xi < 4c = 0.05 ⇒ P χ220 < 4c = 0.05 ⇒ P χ220 > 4c = 0.95
i=1
∴ 4c = χ220,0.95
1 2 10.85
⇒c= χ20,0.95 = = 2.7125.
4 4
P10
∴ Reject H0 if i=1 ln Xi < 2.7125.

P10
Power of the test, PH1 (Reject H0 ) = PH1 ( i=1 ln Xi < 2.7125).
Under H1 ,
 
1
ln Xi ∼ Exp
4
∴ 8 ln Xi ∼ Exp (2)
!
X
10
 
∴ PH1 8 ln Xi < 8(2.7125) = P χ220 < 21.7 = 1 − P χ220 > 21.7 = 1 − 0.36 = 0.64.
i=1

fH1 (x)
(b) By NP Lemma, the Most Powerful Critical Region is given by reject H0 if fH0 (˜
x) > k where k is a constant that
˜
depends on level requirements.
x2 ∏n ∑n x2
Qn 2xi − θ12
i
xi −
i
i=1 θ 2
fH1 (x) i=1 θ1 e
2n i=1
θ1n e 1

˜ = = ∑
fH0 (x) Qn 2x − x22i ∏n − n
x2
i
˜ θ0 n i=1 xi i=1 2
θ0
i=1 θ0 e 2 e
i
θ0n
 n ∑ n 2 ( 1 1 )
θ0 − i=1 xi 2 − θ2
⇒ e θ1 0 >k
θ1
Xn  
1 1
⇒ − 2
xi 2 − θ2 > k′
i=1
θ 1 0
 
X n ∵ θ 0 < θ 1
 
x2i < c  ∴ θ12 > θ02
1 1
⇒ 
i=1 ∴ θ2 − θ2 > 0
1 1
1 0

Now, if X has the given distribution then X ∼ Exp mean = theta2
2

2X 2
∴ ∼ Exp (mean = 2) .
θ2

Page 45
Under H0 ,
2Xi2
∼ Exp (mean = 2)
θ02
Xn
2Xi2
∴ ∼ χ22n
i=1
θ02

Let the test be a level α test !


X
n
∴ PH0 Xi2 <c =α
i=1
!
2Xi2 Xn
2c
⇒ PH0 < 2 =α
i=1
θ02 θ0
 
2c
⇒ PH0 χ22n < 2 = α
θ0
 
2c
⇒ PH0 χ2n > 2 = 1 − α
2
θ0
2c
∴ 2 = χ22n,1−α
θ0
θ2
∴ c = 0 χ22n,1−α .
2
Pn θ02
∴ Reject H0 if i=1 Xi2 < 2 χ22n,1−α .

(c) Given X1 , X2 , . . . , X5 ∼ N(2, σ 2 ).


To test,
H0 : σ 2 = 4 vs H1 : σ 2 = 1.
n o
fH1 (x)
By NP Lemma, the Most Powerful Critical Region is given by W = x | fH0 (˜
x) > k where k ∈ R such that the test
˜ ˜
is of size α. Q5
√1 − 2 (xi −2)
1 2
fH1 (x) i=1 2π e
˜ =Q (x −2)2
>k
fH0 (x) 5
√1 − 12 i 2
˜ i=1 2 2π e

25 e − 2 (xi −2)2
1

> k ⇒ 25 e − 4 (xi −2)2
1
⇒ ∑ (xi −2)2
>k
e − 12 2

X
5
2
⇒ (xi − 2) < c.
i=1
P5 2
∴ Reject H0 if i=1 (xi − 2)  < c.
P5 2
Now, PH0 i=1 (xi − 2) < c = 0.05. Under H0 ,

IID
X1 , X2 , . . . , X5 ∼ N(2, 4)
X5  2
xi − 2
∴ ∼ χ25
i=1
2
P5 !
2
i=1 (xi − 2) c
∴ PH0 < = 0.05
4 4
 c
⇒ PH0 χ25 < = 0.05
 4
c
⇒ PH0 χ25 > = 0.95
4
c
⇒ = χ25,0.95 ⇒ c = 4 χ25,0.95 .
4
P5 2
∴ The MP test of size α = 0.05 is reject H0 if i=1 (xi − 2) < 4 χ25,0.95 .

Page 46
(d) The log-likelihood for the full model is

 1 X
n
n 2
− log 2πσ 2 − 2 (Xi − µ) ,
2 2σ i=1

and the MLEs for µ and σ are


1X
n
2
µ̂ = X̄ and σ̂ 2 = (Xi − µ̂) .
n i=1
Under the submodel defined by µ = 0, the log-likelihood is

 1 X 2
n
n
− log 2πσ 2 − 2 X
2 2σ i=1 i

and the MLE for σ 2 is


1X 2
n
σ̃ 2 = X .
n i=1 i
Therefore, the GLRT statistic is given by
 −n/2  Pn  n/2
supσ2 ℓ 0, σ 2 2πσ̃ 2 exp − 2σ̄1 2 i=1 Xi2 σ̂ 2
= n Pn o =
supµ,σ2 ℓ (µ, σ 2 ) −n/2 2 σ̃ 2
(2πσ̂ 2 ) exp − 2σ̂1 2 i=1 (Xi − µ̂)
Pn 2 !n/2
i=1 Xi − X̄
= Pn 2
i=1 Xi
Pn 2 !n/2
i=1 Xi − X̄
= Pn 2 .
i=1 Xi − X̄ + nX̄ 2

We can rewrite this as  n/2


(n − 1)SX 2
Λ (X1 , . . . , Xn ) =
(n − 1)SX
2 + nX̄ 2
 n/2
n−1
=
n − 1 + nX̄ 2 /SX
2
 n/2
n−1
= ,
n − 1 + T2
which is a decreasing function of T 2 and hence of |T | as well.

Problem 19. (Sign Test)


IID
Consider data X1 , . . . , Xn ∼ f for some unknown probability density function f , and the testing problem

H0 : f has median 0
H1 : f has median µ for some µ > 0

(a) Explain why the Wilcoxon signed rank statistic does not have the same sampling distribution under every P ∈ H0 .
Draw a picture of the graph of a density function f with median 0, such that the Wilcoxon signed rank statistic
would tend to take larger values under f than under any density function g that is symmetric about 0.
Solution.

For a density function f with median 0 but is skewed right, such as in Fig. (a), positive values of X1 , . . . , Xn
would tend to have a higher rank than negative values, so the Wilcoxon signed rank statistic would tend to take
larger values under f than under any density function g that is symmetric about 0.

Page 47
Figure 2: f with median 0.

(b) Consider the sign statistic S, defined as the number of values in X1 , . . . , Xn that are greater than 0. Explain why S
has the same sampling distribution under every P ∈ H0 . How would you conduct a level-α test of H0 Vs. H1 using 
the test statistic S? (Describe explicitly the rejection threshold; you may assume that for X ∼ Binomial n, 12 ,
there exists an integer k such that P[X ≥ k] is exactly α.)

Solution. Let Yi = 1 if Xi > 0 and Yi = 0 otherwise. Since f has median 0, P [Yi = 1] = P [Xi > 0] = 1/2. Then
Xn  
1
S= Yi ∼ Binomial n, .
i=1
2

This distribution is the same for any PDF f with median 0. A test of H0 Vs. H1 should reject for large values of
S. To achieve level-α, it should reject when S ≥ k, where k is a value such that P[S ≥ k] = α under H0 . This is
exactly the value of k given in the problem statement.

pn
(c) When n is large, explain why we may reject H0 when S > n2 + 4 Zα where Zα is the upper α point of N (0, 1),
instead of using the rejection threshold you derived in part (b).

Solution. Note E [Yi ] = 1/2, Var [Yi ] = 1/4, and S


n = Ȳ . Then by the CLT
 
√ S 1
4n − → N (0, 1)
n 2
in distribution as n → ∞. So for large n,
  
S√ 1
α≈P 4n − > Zα
n 2
 
S 1 1
=P > + √ Zα
n 2 4n
 r 
n n
=P S> + Zα ,
2 4
p
and we may take as an approximate rejection threshold n2 + n4 Zα .

 
(d) In this problem, we’ll study the power of this test against the specific alternative N √h , 1 , for a fixed constant
  n

h > 0 (say h = 1 or h = 2 ) and large n. If X ∼ N √hn , 1 , show that


 
h
P[X > 0] = Φ √ ;
n
where Φ is the CDF of the standard normal distribution N (0, 1). Applying a first-order Taylor expansion of Φ
around 0, show that for large n
1 h
P[X > 0] ≈ + √ .
2 2πn

Page 48
 
Solution. Given that X ∼ N √h , 1
, we have
n
     
h h h h
P[X > 0] = P X − √ > − √ = 1 − Φ −√ =Φ √ .
n n n n
A first-order Taylor expansion for a differentiable function f suggests that
f (x + h) ≈ f (x) + hf ′ (x)
x2
Applying this to the above and noting Φ′ (x) is the normal PDF ϕ(x) = √12π e− 2 ,
 
h h 1 h
Φ √ ≈ Φ(0) + √ ϕ(0) = + √ .
n n 2 2πn

  q 
IID
(e) Let X1 , . . . , Xn ∼ N √h , 1
n
. In this case, show that 4
n S− n
2 has an approximate normal distribution that
does not depend on n (but depends on h ) – what is the mean and variance of this normal distribution?
  Using this
result, derive an approximate formula for the power of the sign test against the alternative N √h
n
, 1 , in terms of
Zα , h, and the CDF Φ.
Solution. The sign statistic S can be written as
X
S= Yi , where Yi ∼ Bernoulli (P [Xi > 0]) .
i
√ 
By the CLT, n S
n − E [Yi ] is approximately distributed as N (0, Var [Yi ]). Applying part (d), E [Yi ] ≈ 1
2 + √h ,
2πn
so    
√ S √ S 1 h 1  n h
n − E [Yi ] ≈ n − −√ = √ S− −√ .
n n 2 2πn n 2 2π
For large n,   
1 1 h h 1 1 1
Var [Yi ] ≈ +√+√ ≈ × = .
1−
2 2 2πn 2πn 2 2 4
   q 
So √1n S − n2 is approximately distributed as N √h2π , 14 . Multiplying by 2, n4 S − n2 is approximately dis-
   
tributed as N √2h

, 1 . The power of the sign test against the alternative N √h , 1 is given by
n

 r  "r #    
n n 4 n 2h 2h 2h 2h
P S> + Zα = P S− − √ > Zα − √ ≈ 1 − Φ Zα − √ =Φ √ − Zα .
2 4 n 2 2π 2π 2π 2π

Problem 20. (Comparing Binomial proportions)

The popular search engine Google would like to understand whether visitors to a website are more likely to click on an
advertisement at the top of the page than one on the side of the page. They conduct an “AB test” in which they show
n visitors (group A) a version of the website with the advertisement at the top and m visitors (group B) a version of
the website with the (same) advertisement at the side. They record how many visitors in each group clicked on the
advertisement.

(a) Formulate this problem as a hypothesis test. (You may assume that visitors in group A independently click on the
ad with probability pA and visitors in group B independently click on the ad with probability pB , where both pA and
pB are unknown probabilities in (0, 1).) What are the null and alternative hypotheses? Are they simple or composite?

IID IID
Solution. Let X1 , . . . , Xn ∼ Bernoulli (pA ) and Y1 , . . . , Ym ∼ Bernoulli (pB ) be indicators of whether each visitor
clicked on the ad. We wish to test
H 0 : p A = pB
H 1 : pA > p B
Both hypotheses are composite, as they do not specify the exact value of pA or pB .

Page 49
(b) Let p̂A be the fraction of visitors in group A who clicked on the ad, and similarly for p̂B . A reasonable intuition is
to reject H0 when p̂A − p̂B is large. What is the variance of p̂A − p̂B ? Is this the same for all data distributions in H0 ?

pA (1−pA )
Solution. As np̂A ∼ Binomial (n, pA ), we have Var [np̂A ] = npA (1 − pA ) so Var [p̂A ] = n . Similarly
Var [p̂B ] = pB (1−p
m
B)
. Since p̂A and p̂B are independent,

pA (1 − pA ) pB (1 − pB )
Var [p̂A − p̂B ] = Var [p̂A ] + Var [−p̂B ] = Var [p̂A ] + Var [p̂B ] = + .
n m

Under H0 , pA = pB = p for some p ∈ (0, 1), and this variance is p(1 − p) n1 + m 1
. This is not the same for all data
distributions in H0 , as it depends on p. (So we cannot perform a test of H0 directly using the test statistic p̂A − p̂B .)

(c) Describe a way to estimate the variance of p̂A − p̂B using the available data, assuming H0 is true – call this estimate
V̂ . Explain heuristically why, when n and m are both large, the test statistic

p̂A − p̂B
T = p

is approximately distributed as N (0, 1) under any data distribution in H0 . (You may assume that when n and m
are both large, the ratio of V̂ to the true variance of p̂A − p̂B that you derived in part (b) is very close to 1 with
high probability.) Explain how to use this observation to perform an approximate level-α test of H0 versus H1 .

Solution. One way of estimating the variance is to take

p̂A (1 − p̂A ) p̂B (1 − p̂B )


V̂ = + .
n m
Another way (since pA = pB under H0 ) is to first estimate a pooled sample proportion

p̂A n + p̂B m
p̂ =
n+m
and then estimate the variance as  
1 1
V̂ = p̂(1 − p̂) + .
n m
(Both ways are reasonable under H0 .)
√ √
Under H0 , pA = pB = p for some p ∈ (0, 1), so the CLT implies √n(p̂A −p) → N (0, 1) and √ m(p̂B −p)
→ N (0, 1)
p(1−p) p(1−p)
 
in distribution as n, m → ∞. So for large n and m, the distributions of p̂A and p̂B are approximately N p, p(1−p)
  n
p(1−p)
and N p, m . Note that p̂A and p̂B are independent, so their difference is distributed approximately as

N 0, p(1 − p) n + m
1 1
. We may write the test statistic T as
q 
p(1 − p) 1
n + 1
m p̂A − p̂B
T = p q .
V̂ p(1 − p) n1 + 1
m


p(1−p)( n
1 1
+m )
Since √ ≈ 1 with high probability for large n and m, T is approximately distributed as N (0, 1), and

an asymptotic level-α test rejects H0 for T > Zα .

Page 50
Problem 21. (Covid-19 testing problem)

There are approximately 540 coronavirus testing locations in Abu Dhabi. At the beginning of the day, officials at each lo-
cation record Y = number of specimens tested to find the first positive case and assume Y follows a geometric distribution
with probability p. The probability p satisfies 0 < p < 1 and is unknown. In this application, we might also call p the “pop-
ulation prevalence” of the disease. Of course, careful thought should go into defining exactly what the “population” is here.

Suppose Y1 , Y2 , . . . , Y540 are iid geometric (p) random variables (one for each site) observed on a given day. Epidemiologists
at the Department of Health would like to test
H0 : p = 0.02
versus
Ha : p < 0.02.
(a) Show the likelihood function of p on the basis of observing y = (y1 , y2 , . . . , y540 ) is given by
∑540
yi −540 540
L(p | y) = (1 − p) i=1 p .

(b) Show the uniformly most powerful (UMP) level α test of H0 versus Ha has a rejection region of the form
( )
X
540

RR = t = yi ≥ k .
i=1

P540
How would you choose k ∗ to ensure the test is level α = 0.05? Hint: What is the sampling distribution of T = i=1 Yi
when H0 is true?

Solution.

(a) Using the p.m.f of a geometric distribution, we can get

Y
540
 ∑540
L(p | y) = (1 − p)yi −1 × p = (1 − p) i=1 yi −540 540
p .
i=1

(b) We choose Ha′ : p = pa to start.


 ∑540
i=1 yi −540
 540
L (p0 | y) 1 − p0 p0
= .
L (pa | y) 1 − pa pa
P540 1−p0
This ratio goes up when T = i=1 Yi goes down, under pa < 0.02 = p0 , namely 1−p a
< 1. Therefore, by Neyman-Pearson
Lemma, we can construct the most powerful test with its reject region like:
  ( X
540
)
L (p0 | y) ∗
RR = <k = T = Yi > k ,
L (pa | y) i=1

where k ∗ satisfies: !
X
540
α = PH0 (RR) = P Y i > k ∗ | H0 .
i=1
P540
When H0 is true, we can argue T = i=1 Yi ∼ N eg − Binomial (540, p0 ). (Here, I used the negative binomial in the
version of ”the number of trials that needed to get the first 540 success, i.e., the first 540 positive cases”. There is another
version of “the number of failures before the first 540 success”. You just need to take care of the definition of your pa-
rameters.) Thus, we can choose k ∗ to be the (upper) quantile of the negative binomial distribution to achieve a level α test.

Now, we simply note that this rejection region does not depend on the value of pa under Ha′ (which we specified ar-
bitrarily). Therefore, this test can be the uniformly most powerful (UMP) level α test. Note: To get an exact size α, you
may refer to the method of making a randomized test.

Page 51
Problem 22. (Epidemiological Modeling of Covid-19)

SEIR models are used by epidemiologists to describe COVID-19 disease severity in a population. The model consists of
four different categories:

S = susceptible category; E = exposed category; I = infected category; R = recovered category.

The four categories are mutually exclusive and exhaustive among living individuals (SEIRD models do include a fifth
category for those who have died from disease). A random sample of n individuals is selected from a population (e.g.,
residents of Abu Dhabi city), and the category status of each individual is identified. This produces the multinomial
random vector  
X4
Y ∼ mult n, p; pj = 1  ,
j=1

where 
  
Y1 p1
 Y2   p2 
Y=  
 Y3  and p = 
.
p3 
Y4 p4
The random variables Y1 , Y2 , Y3 , Y4 record the number of individuals identified in the susceptible, exposed, infected, and
recovered categories, respectively. Recall that the beta distribution is a conjugate prior for the binomial. Just as the
multinomial distribution can be regarded as a generalization of the binomial (to more than two categories), we need a
prior distribution for p that is a generalization of the beta. This generalization is the Dirichlet (α1 , α2 , α3 , α4 ) distribution.
Specifically, suppose p is best regarded as random with prior pdf
( P4
Γ(α1 +α2 +α3 +α4 ) 1 −1 α2 −1 α3 −1 α4 −1
Γ(α )Γ(α )Γ(α )Γ(α ) pα
1 p2 p3 p4 , 0 < pj < 1, j=1 pj = 1
g(p) = 1 2 3 4
0, otherwise

where α1 > 0, α2 > 0, α3 > 0, and α4 > 0 are known.


 P4 
(a) If Y | p ∼ mult n, p; j=1 pj = 1 and p ∼ g(p), show the posterior distribution g(p | y) is Dirichlet with
parameters αj∗ = yj + αj , for j = 1, 2, 3, 4. Hint: The joint distribution of Y and p satisfies

fY,p (y, p) = fY|p (y | p)g(p)

and g(p | y) is proportional to fY,p (y, p).


(b) A special case of the Dirichlet distribution above arises when α1 = α2 = α3 = α4 , the so-called “symmetric Dirich-
let distribution.” This distribution would arise when one has no prior information to favor the count in one SEIR
category over the other three. Do you think g(p) with α1 = α2 = α3 = α4 would be a reasonable prior model for
covid-19 in Abu Dhabi city? Explain.

Solution:

(a)
g(p | y) ∝ fY,p (y, p)
= fY|p (y | p)g(p)
 
n! Y
n
Γ (α1 + α2 + α3 + α4 ) α1 −1 α2 −1 α3 −1 α4 −1
=  Q4 pyi i  p p2 p3 p4
y !
j=1 i j=1
Γ (α1 ) Γ (α2 ) Γ (α3 ) Γ (α4 ) 1
1 +y1 −1 α2 +y2 −1 α3 +y3 −1 α4 +y4 −1
∝ pα
1 p2 p3 p4
∼ Dirichlet (α1∗ , α2∗ , α3∗ , α4∗ ) , where αj∗ = yj + αj .

(b) No, this prior might not be proper here. The probability for each category should not be all the same. (You can
search for more information to argue whether it is reasonable or not, like historical data, expertise, doctors, or even
your own observations, etc.)

Page 52
Problem 23. (Mean referral waiting times)

We would like to compare the population mean referral waiting times for patients in Abu Dhabi and Dubai seeking care
from a gastrointestinal specialist. By “referral waiting time”, I mean the time it takes to see a gastrointestinal specialist
once a referral has been made by another health professional (e.g., a primary care physician). We have independent
random samples of patients from the two locations. Here are the corresponding waiting times and population-level models
for them:
IID
• Abu Dhabi: X1 , X2 , . . . , Xm ∼ Exponential (θ1 )
IID
• Dubai: Y1 , Y2 , . . . , Yn ∼ Exponential (θ2 ).
The population parameters satisfy θ1 > θ and θ2 > θ and are unknown. The goal is to test

H0 : θ 1 = θ 2
versus
6 θ2 .
Ha : θ 1 =

(a) Preparing for an LRT derivation below, carefully describe the null parameter space Θ0 and the entire parameter
space Θ. Draw a picture of what both spaces look like.
(b) Show the likelihood function is given by
1 − ∑m 1 − ∑nj=1 yj /θ2
L(θ | x, y) = L (θ1 , θ2 | x, y) = m e i=1 xi /θ1 × e ,
θ1 θ2n

where θ = (θ1 , θ2 ). This is just the likelihood from each sample multiplied together (because the two samples are
independent).
(c) Show the (restricted) MLE of θ over the null parameter space Θ0 is
!
mX̄+nȲ
θ̂0 = m+n .
mX̄+nȲ
m+n

(d) Show the (unrestricted) MLE of θ over the entire parameter space Θ is
 

θ̂ = .

(e) Show the likelihood ratio test (LRT) statistic


 
L θ̂0 | x, y x̄m ȳ n
λ= = m+n .
L(θ̂ | x, y) mx̄+nȳ
m+n

(f) Here are the observed data on the referral waiting times for both groups of patients:

Abu Dhabi Dubai


36 11 1 49
47 52 16 43
52 9 22 8
1 39 11 2
9 20 8 8
72 32 26 53
12 24
39

Calculate −2 ln λ using the data above and implement a large-sample LRT to test H0 versus Ha . What is your
conclusion at α = 0.05?

Page 53
Solution.
(a) The entire space should be the region with all positive values for θ1 and θ2 . The null space is then the line with
θ1 = θ2 .

(b) Plug in the p.d.f’s to construct likelihood:


Ym
1 − θxi Y 1 − yθj
n
L(θ | x, y) = L (θ1 , θ2 | x, y) = e 1 e 2
θ
i=1 1
θ
j=1 2
1 − ∑m
i=1 xi /θ1 ×
1 − ∑nj=1 yj /θ2
= m e e
θ1 θ2n

(c) Under H0 , we assume θ1 = θ2 = θ, then:


∑m ∑n
1 − i=1 xi + j=1 yj
L(θ | x, y) = e θ
θ(m+n)
log-Likelihood: Pm Pn
 i=1 xi + j=1 yj
l = log L θ0 = (θ, θ) | x, y = −(m + n) ln θ −
T
θ
Derivative: Pm Pn
∂l m+n i=1 xi + j=1 yj set
=− + = 0
∂θ θ θ2
Then, we can get Pm Pn
i=1 xi + j=1 yj mX̄ + nȲ
θ̂ = = ,
m+n m+n
And therefore, !
  mX̄+nȲ
θ̂ m+n
θ̂0 = = mX̄+nȲ .
θ̂ m+n

(d) From the original likelihood:


1 − ∑m 1 − ∑nj=1 yj /θ2
L(θ | x, y) = L (θ1 , θ2 | x, y) = m e i=1 xi /θ1 × e ,
θ1 θ2n
we get log-likelihood:
  Pm Pn
T xi j=1 yj
l = log L θ = (θ1 , θ2 ) | x, y = −m ln θ1 − i=1 − n ln θ2 −
θ1 θ2

Page 54
Take partial derivatives: 
 mX set 
∂l
θ1 = − θm1 + θ12
= 0 θ̂1 = X̄

 ∂l
= − θn2 + nY set
= 0 θ̂2 = Ȳ
θ2 θ22

Thus, we can get  



θ̂ = .

(e) Plug in the results above, you will get:
∑ ∑
1 − m 1 − n
maxθ∈Θ0 L(θ | x, y) θ̂ m
e i=1 xi /θ̂ × θ̄ n
e j=1 yj /θ̂ 1
θ̂ m+n
e−(mx̄+nȳ)/θ̂
λ= = ∑m ∑ =
maxθ∈Θ L(θ | x, y) 1 − i=1 xi /θ̂1
e × 1 − n
e j=1 yj /θ̂2
1 −mx̄/x̄
x̄m e × ȳ1n e−nȳ/ȳ
θ̂1m θ̂2n

x̄m ȳ n em+n x̄m ȳ n


= = m+n
θ̂m+n em+n mx̄+nȳ
m+n

(f) We can use the large-sample result:


d
−2 ln λ → χ21 .
Here, the degree of freedom is 1 since we need to estimate two values under the entire parameter space, but under
H0 , we just need 1 estimator. Thus, ν = dim(Θ) − dim (Θ0 ) = 2 − 1 = 1.

Plug in the values, we can get:


−2 ln λ = 1.0158 < χ21,0.95 = 3.841, so we cannot reject the null hypothesis under the 0.05 significance level.

Problem 24. (Effect of a confounding factor)

To study the effectiveness of a drug that claims to lower blood cholesterol levels, we designed a simple experiment with
n subjects in a control group and n (different) subjects in a treatment group. We administer the drug to the treatment
group and a placebo to the control group, measure the cholesterol levels of all subjects at the end of the study and look
at whether cholesterol levels are lower in the treatment group than in control. Let X1 , . . . , Xn be the cholesterol levels in
the control group and Y1 , . . . , Yn be those in the treatment group, and let
X̄ − Ȳ
T = q
Sp n1 + 1
n

be the standard two-sample t-statistic where Sp2 is the pooled variance.


Assume, throughout this problem, that the drug, in fact, is not effective and has the exact same effect as the placebo.
However, suppose there are two types of subjects, high-risk and low-risk. (Approximately half of the human population
is high-risk, and half is low-risk; assume that we cannot directly observe whether a person is high-risk orlow-risk.) The
cholesterol level for high-risk subjects is distributed as N µH , σ 2 , and for low-risk subjects as N µL , σ 2 .

(a) A carefully designed study randomly selects subjects for the two groups so that each subject selected for either
group is (independently) with probability 1/2 high-risk and probability 1/2 low-risk. Explain why, in this case,
X1 , . . . , Xn , Y1 , . . . , Yn are IID from a common distribution. What are E [Xi ] and Var [Xi ]?
(b) Explain (using the CLT and Slutsky’s lemma) why, when n is large, T is approximately distributed as N (0, 1), and
hence a test that rejects for T > Zα is approximately level-α for large n.
(c) A poorly-designed study fails to properly randomize the treatment and control groups, so that each subject selected
for the control group is with probability p high-risk and probability 1 − p low-risk, and each subject selected
for the treatment group is with probability q high-risk and probability 1 − q low-risk. In this case, what are
E [Xi ] , Var [Xi ] , E [Yi ], and Var [Yi ]?
(d) In the setting of part (c), show that Sp2 converges in probability to a constant c ∈ R as n → ∞, and provide a
formula for c. Show that T is approximately normally distributed, and provide formulas for the mean and variance
of this normal. Is the rejection probability P[T > Zα ] necessarily close to α? Discuss briefly how this probability
depends on the values µH , µL , σ 2 , p, and q.

Page 55
Solution.

(a) Each group member is selected independently from either the high-risk group or the low-risk group. So the cholesterol
level for each person in each group is a random variable independent of that for any other person. Also, since for
both the treatment and control groups, with probability 1/2, a high-risk individual is chosen, and with probability
1/2, a low-risk individual is chosen, they must have the same distribution. So, the variables X1 , · · · , Xn , Y1 , · · · , Yn
are IID from a common distribution. To compute the mean and variance, we may write Xi as

Xi = Zi Hi + (1 − Zi ) Li
 
where Hi ∼ N µH , σ 2
, L i ∼ N µL , σ 2
, Zi ∼ Bernoulli(1/2), and these are independent. Then

E [Xi ] = E [Zi ] E [Hi ] + E [1 − Zi ] E [Li ] (due to independence)


1 1
= µH + µL .
2 2
To compute the variance, we have
  h i
2
E Xi2 = E Zi2 Hi2 + 2Zi (1 − Zi ) Hi Li + (1 − Zi ) L2i .

2
Note that since Zi ∈ {0, 1}, Zi (1 − Zi ) = 0, Zi2 = Zi , and (1 − Zi ) = (1 − Zi ). Then
      1   1  
E Xi2 = E [Zi ] E Hi2 + E [1 − Zi ] E L2i = E Hi2 + E L2i
2 2
 2 2  
We have E Hi = Var [Hi ] + (E [Hi ]) = µ2H + σ 2 , and similarly E L2i = µ2L + σ 2 . So
  1 2 
E Xi2 = µL + µ2H + σ 2 ,
2
and
  2 1 2  1 2  1 2
Var [Xi ] = E Xi2 − (E [Xi ]) = µ + µ2H + σ 2 − µ − 2µL µH + µ2H = σ 2 + (µH − µL ) .
2 L 4 L 4

√  √ 
√ Theorem, n X̄ − E [Xi ] → N (0, Var [Xi ]) and n Ȳ − E [Xi ] →
(b) As the Xi ’s and Yi ’s are all IID, by the Central Limit
N (0, Var [Xi ]) in distribution, so their difference n(X̄ − Ȳ ) → N (0, 2 Var [Xi ]). The pooled variance is
!
1 Xn
 2 Xn
 2 1 2 1
Sp2 = Xi − X̄ + Yi − Ȳ = SX + SY2 ,
2n − 2 i=1 i=1
2 2
P 2 P 2
i Xi − X̄ i Yi − Ȳ
2 1 1
where SX = n−1 and SY2 = n−1 are the individual sample variances. Using the result,
SX → Var [Xi ] and SY → Var [Yi ] = Var [Xi ] in probability, so the Continuous Mapping Theorem implies Sp2 →
2 2

Var [Xi ]. Then p √


X̄ − Ȳ Var [Xi ] n(X − Ȳ )
T = q = p → N (0, 1)
Sp 1 + 1 Sp 2 Var [Xi ]
n n

in distribution by Slutsky’s lemma. Hence, a test that rejects for T > Zα is approximately level α for large n.

(c) The difference in this part from Part (a) is that here

Xi = Zi Hi + (1 − Zi ) Li ,

where Hi , Li are defined as before but Zi ∼ Bernoulli(p). Then

E [Xi ] = pµH + (1 − p)µL .

Page 56
Similarly, E [Yi ] = qµH + (1 − q)µL .
For the variances, we compute as in Part (a)
       
E Xi2 = E [Zi ] E Hi2 + E [1 − Zi ] E L2i = p µ2H + σ 2 + (1 − p) µ2L + σ 2 = pµ2H + (1 − p)µ2L + σ 2 ,

so
  2 2 2
Var [Xi ] = E Xi2 − (E [Xi ]) = pµ2H + (1 − p)µ2L + σ 2 − (pµH + (1 − p)µL ) = σ 2 + (µH − µL ) p(1 − p).
2
Similarly, Var [Yi ] = σ 2 + (µH − µL ) q(1 − q).

2
(d) In this case SX → Var [Xi ] and SY2 → Var [Yi ] in probability, so
1 1 2
Sp2 → (Var [Xi ] + Var [Yi ]) = σ 2 + (p(1 − p) + q(1 − q)) (µH − µL ) =: c.
2 2
√  √ 
By the CLT, n X̄ − E [Xi ] → N (0, Var [Xi ]) and n Ȳ − E [Yi ] → N (0, Var [Yi ]). The Xi ’s and Yi ’s are

independent, so the difference n X̄ − Ȳ − E [Xi ] + E [Yi ] → N (0, Var [Xi ] + Var [Yi ]).
Then
X̄ − Ȳ 1 √
T = p =q ( n(X̄ − Ȳ ))
Sp 2/n 2Sp2

is approximately distributed as
√ 
1 √  n (E [Xi ] − E [Yi ])
√ N n (E [Xi ] − E [Yi ]) , Var [Xi ] + Var [Yi ] = N √ ,1 .
2c 2c
Let √ √ √
n (E [Xi ] − E [Yi ]) n(p − q) (µH − µL ) n(p − q) (µH − µL )
m := √ = √ = q ,
2c 2c 2σ 2 + (p(1 − p) + q(1 − q)) (µH − µL )
2

so T is approximately N (m, 1). Then the rejection probability is

P[T > Zα ] = P[T − m > Zα − m] ≈ 1 − Φ(Zα − m) = Φ(m − Zα ).

This probability increases in m and only equals α when m = 0. If (p − q) (µH − µL ) > 0, then m → ∞ as n → ∞,
and we expect to falsely reject H0 with probability close to 1 for large n. If (p − q) (µH − µL ) < 0, then m → −∞
as n → ∞, and we expect the significance level of the test to in fact be close to 0 for large n.

Problem 25. (Improving upon Bonferroni for independent tests)

(a) Let P1 , . . . , Pn be the p-values from n different hypothesis tests. Suppose that the tests are performed using inde-
IID
pendent sets of data, and in fact all of the null hypotheses are true, so P1 , . . . , Pn ∼ Uniform(0, 1). Show that for
any t ∈ (0, 1),  
n
P min Pi ≤ t = 1 − (1 − t)n .
i=1

(b) Under the setting of part (a), if we perform all tests at significance level 1 − (1 − α)1/n (that is, we reject a null
hypothesis if its p-value is less than this level), show that the probability of (falsely) rejecting any of the n null
hypotheses is exactly α. Is this procedure more or less powerful than the Bonferroni procedure (of performing all
tests at level α/n )?

(c) Suppose, now, that all of the p-values P1 , . . . , Pn are still independent, but not necessarily all of the null hypotheses
are true. (So the p-values corresponding to the true null hypotheses are still IID and distributed as Uniform (0, 1).)
If we perform all tests at significance level 1 − (1 − α)1/n , does this procedure control the family-wise error rate
(FWER) at level α? (Explain why, or show a counterexample.)

Page 57
Solution.
IID
(a) We know that P1 , . . . , Pn ∼ U(0, 1). So for any t ∈ (0, 1),
   
n n
P min Pi ≤ t = 1 − P min Pi > t
i=1 i=1

= 1 − P [Pi > t ∀ i = 1, . . . , n]
Yn
=1− P [Pi > t] = 1 − (1 − t)n .
i=1

(b) If all the tests are performed at significance level 1 − (1 − α)1/n ,


 
1/n
P(rejecting any of the n null hypotheses) = P Pi < 1 − (1 − α) for any i
   n
n
= P min Pi < 1 − (1 − α) 1/n
= 1 − 1 − 1 + (1 − α)1/n = α.
i=1

Hence, the probability of (falsely) rejecting any of the n null hypotheses is exactly α.
The Bonferroni procedure rejects when Pi ≤ α/n and the above procedure rejects when Pi ≤ 1 − (1 − α)1/n .
Note that  α n
1− > 1 − α,
n
so 1 − (1 − α)1/n > α/n. Hence, whenever the Bonferroni test rejects, this procedure also rejects, so this procedure
is more powerful than the Bonferroni test.
(c) Suppose there are k true null hypotheses, and without loss of generality, let us assume that these are the first k. If
all the tests are performed at significance level 1 − (1 − α)1/n , and V is the number of true null hypotheses that are
rejected, then the FWER is
 
k
P(V ≥ 1) = P min Pi ≤ 1 − (1 − α) 1/n
i=1
 
k
= 1 − P min Pi > 1 − (1 − α) 1/n
i=1
 
= 1 − P Pi > 1 − (1 − α)1/n ∀ i = 1, · · · , k
 k
= 1 − 1 − 1 + (1 − α)1/n = 1 − (1 − α)k/n .

Since k ≤ n and α < 1, (1 − α)k/n > (1 − α) and hence 1 − (1 − α)k/n < α, so the FWER is controlled.

Problem 26. (Computing the MME and MLE using Newton-Raphson Method)

(a) Let X = (X1 , . . . , Xn ) be i.i.d. random variables with Kumaraswamy distribution with parameters a > 0 and b > 0.
The probability density function of the Kumaraswamy distribution is
b−1
f (x; a, b) = abxa−1 (1 − xa ) , where x ∈ [0, 1].

Find the method of moments for the Kumaraswamy distribution using numerical methods.
(b) Implement a function that takes as input a vector of data values X, performs the Newton-Raphson iterations to
compute the MLEs α̂ and β̂ in the Gamma (α, β) model, and outputs α̂ and β̂. (You may use the form of the Newton-
Raphson update equation derived in class. You may terminate the Newton-Raphson iterations when |α(t+1) − α(t) |
is sufficiently small.)

(c) For n = 500, use your function from part (a) to simulate the sampling distributions of α̂ and β̂ computed from
IID
X1 , . . . , Xn ∼ Gamma(1,2). Plot histograms of the values of α̂ and β̂ across 5000 simulations, and report the
simulated mean and variance of α̂ and β̂ as well as the simulated covariance between α̂ and β̂. Compute the inverse of
the Fisher Information matrix I(α, β) at α = 1 and β = 2. Do your simulations support that (α̂, β̂) is approximately
distributed as N (1, 2), n1 I(1, 2)−1 ? (You may use the formula for the Fisher information matrix I(α, β) and/or
its inverse derived in class.)

Page 58
Solution.
(a) The Kumaraswamy distribution is an alternative to the Beta distribution, which offers similar flexibility in terms of
the shape of the density. An appealing feature of the Kumaraswamy distribution is that it only involves algebraic
terms, in contrast to the Beta distribution, which requires the evaluation of the Beta function. The mean and
variance of this distribution are

bΓ 1 + a1 Γ(b)
µ1 = E[X] =  ,
Γ 1 + a1 + b
  bΓ(1 + 2/a)Γ(b)
Var[X] = E X 2 − E[X]2 = µ2 − µ21 = − µ21 .
Γ(1 + b + 2/a)

The MME of a and b, ã and b̃, are the solution to the system of non-linear equations,

bΓ 1 + a1 Γ(b)
X=  ,
Γ 1 + a1 + b
1X 2
n
2 bΓ(1 + 2/a)Γ(b)
Xi − X = − µ21 .
n i=1 Γ(1 + b + 2/a)

which cannot be found in closed form. Thus, we can only obtain the MME using numerical methods. The following
R code shows how to calculate the MME numerically.

rm(list = ls()) # Delete memory


# Moments of order n of the Kumaraswamy(a,b) distribution
# https://en.wikipedia.org/wiki/Kumaraswamy_distribution

mn <- function(a,b,n){
log.num <- log(b) + lgamma(1 + n/a) + lgamma(b)
log.den <- lgamma(1 + b + n/a)
return(exp(log.num-log.den))
}

#--------------------------------
# Method of Moments Estimates
#--------------------------------
library(nleqslv) # Required packages

# Function containing the implementation of the MME


# It solves a system of nonlinear equations

MME <- function(data){


x.bar <- mean(data)
x2.bar <- mean(data^2)

fn <- function(par){
mom1 <- mn(exp(par[1]),exp(par[2]),1) - x.bar
mom2 <- mn(exp(par[1]),exp(par[2]),2) - mn(exp(par[1]),exp(par[2]),1)^2 - x2.bar + x.bar^2
return(c(mom1,mom2))
}

est <- as.vector(exp(nleqslv(c(0,0), fn)$x))

return(est)

# Examples

Page 59
#--------------------------------
# Example 1
#--------------------------------
a0 <- 1
b0 <- 1
ns <- 1000

# Simulation of Kumaraswamy distributed data


set.seed(123)
dat <- (rbeta(ns, shape1 = 1, shape2 = b0))^(1/a0)

# MME
MME(dat)
## [1] 1.018995 1.032939
#--------------------------------
# Example 2
#--------------------------------
a0 <- 0.5
b0 <- 2
ns <- 1000

# Simulation of Kumaraswamy distributed data


set.seed(123)
dat <- (rbeta(ns, shape1 = 1, shape2 = b0))^(1/a0)

# MME
MME(dat)
## [1] 0.5059483 2.1418411
#--------------------------------
# Example 3
#--------------------------------
a0 <- 5
b0 <- 3
ns <- 1000

# Simulation of Kumaraswamy distributed data


set.seed(123)
dat <- (rbeta(ns, shape1 = 1, shape2 = b0))^(1/a0)

# MME
MME(dat)
## [1] 4.796745 2.915643

(b) We denote the function


Γ′ (α)
f (α) = log α − = log α − ψ(α),
Γ(α)
where ψ(α) = Γ′ (α)/Γ(α) is the digamma function. Its derivative is
1
f ′ (α) = − ψ ′ (α),
α
where ψ ′ (α) is the trigamma function.
The Newton-Raphson update rule is
 Pn
−f α(t) + log X̄ − n1 i=1 log Xi
α (t+1)
=α (t)
+  .
f ′ α(t)
We can implement this as follows:

Page 60
gamma.MLE = function(X) {
ahat = compute.ahat(X)
bhat = ahat/mean(X)

return(c(ahat,bhat))
}

# estimate ahat by Newton-Raphson


compute.ahat = function(X) {
a.prev = -Inf
a = mean(X)^2/var(X)#initialguess

#while not converged, do Newton-Raphson update


while(abs(a-a.prev)>1e-12) {
a.prev = a
numerator = -f(a.prev)+log(mean(X))-mean(log(X))
denominator = f.prime(a.prev)
a = a.prev+numerator/denominator
}

return(a)
}

#define some helper functions


f=function(alpha) {

return(log(alpha)-digamma(alpha))
}
f.prime=function(alpha) {
return(1/alpha-trigamma(alpha))
}

(c) We run the simulation:


# In R you may simulate samples from Gamma (alpha, beta) using

X = rgamma(n , alpha, rate = beta )

# The sample variance of a vector of values of X is given by var(X), and the sample covariance
# between two vectors of values X and Y (of the same length) is given by cov(X, Y).

n=500
n.reps=5000
alpha=1
beta=2

alpha.hat=numeric(n.reps)
beta.hat=numeric(n.reps)

for(i in 1:n.reps) {
X=rgamma(n,shape=alpha,rate=beta)
estimates=gamma.MLE(X)
alpha.hat[i] = estimates[1]
beta.hat[i] = estimates[2]
}
Here are the resulting histograms:

Page 61
hist(alpha.hat, breaks=20)

hist(beta.hat, breaks=20)

The inverse of the Fisher information matrix was computed to be


 α 1 
1
I(α, β)−1 = ′ β2 β
.
ψ (α) βα2 − β12
1
β ψ ′ (α)

Plugging in α = 1, β = 2 gives
   
−1 4 1/4 1/2 1.551 3.101
I(1, 2) = ′ ≈
ψ (1) − 1 1/2 ψ ′ (1) 3.101 10.202

using ψ ′ (1) = π 2 /6 ≈ 1.645.


Now, let’s take a look at the empirical moments produced by the simulation. Here are the means:
mean(alpha.hat) # should be close to 1
## [1] 1.003308

mean(beta.hat) # should be close to 2


## [1] 2.010126
They are close to the true values of α = 1 and β = 2, as expected. Now here are the variance and covariance terms:

Page 62
var(alpha.hat)
## [1] 0.003079451

var(beta.hat)
## [1] 0.02048946

cov(alpha.hat,beta.hat)
## [1] 0.006190339
In order to compare to the Fisher information, we need to scale by n.
n*var(alpha.hat) #should be close to 1.551
## [1] 1.539726

n*var(beta.hat) #should be close to 10.202


## [1] 10.24473

n*cov(alpha.hat,beta.hat) #should be close to 3.101


## [1] 3.09517

Problem 27. (MLE in a misspecified model)

Suppose you fit the model Exponential(λ) to data X1 , . . . , Xn by computing the MLE λ̂ = 1/X̄, but the true distri-
IID
bution of the data is X1 , . . . , Xn ∼ Gamma(2, 1).
(a) Let f (x | λ) be the PDF of the Exponential(λ) distribution, and let g(x) be the PDF of the Gamma (2, 1) distribution.
Compute an explicit formula for the KL-divergence DKL (g(x)kf (x | λ)) in terms of λ, and find the value λ∗ that
minimizes this KL-divergence.
(You may use the fact that if X ∼ Gamma(α, β), then E[X] = α/β and E[log X] = ψ(α) − log β where ψ is the
digamma function.)
(b) Show directly, using the Law of Large Numbers and the Continuous Mapping Theorem, that the MLE λ̂ converges
in probability to λ∗ as n → ∞.
(c) Perform a simulation study for the standard error of λ̂ with sample size n = 500, as follows: In each of B = 10000
IID
simulations, sample X1 , . . . , Xn ∼ Gamma(2, 1), compute the MLE λ̂ = 1/X̄ for the exponential model, compute
an estimate of the standard error √ of
 λ̂ using the Fisher information I(λ̂), and also compute the sandwich estimate
of the standard error, SX / X̄ 2 n .

Report the true mean and standard deviation of λ̂ that you observe across your 10000 simulations. Is the mean
close to λ∗ from part (a)? Plot a histogram of the 10000 estimated standard errors using the Fisher information,
and also plot a histogram of the 10000 estimated standard errors using the sandwich estimate. Do either of these
methods for estimating the standard error of λ̂ seem accurate in this setting?

Solution.
(a) The KL-divergence is given by
 
g(X)
DKL (g(x)kf (x | λ)) = Eg log
f (X | λ)
" 1 −X
#
Γ(2) Xe
= Eg log
λe−λX
= Eg [− log Γ(2) + log X − X − log λ + λX]
= − log Γ(2) − log λ + Eg [log X] + (λ − 1)Eg [X]
= − log Γ(2) − log λ + ψ(2) + 2(λ − 1).
Setting the derivative with respect to λ equal to 0, is minimized at λ∗ = 1/2.

Page 63
(b) By the Law of Large Numbers, X̄ → Eg [X] = 2 in probability, so λ̂ = 1/X̄ → 1/2 in probability by the Continuous
Mapping Theorem.

(c) The Fisher information in the exponential model is given by


 2   2 
∂ ∂
I(λ) = −Eλ log f (X | λ) = −Eλ (log λ − λX) = 1/λ2
∂λ2 ∂λ2
q
The corresponding plug-in estimate of the standard error is nI(1λ̂) = X √
1
n
.

n = 500
B = 10000
lamb_hat = numeric(B)
se_Fisher = numeric(B)
se_sandwich = numeric(B)
for(i in 1:B){
X = rgamma(n,2,rate=1)
lamb_hat[i] = 1/mean(X)
se_Fisher[i] = 1/(mean(X)*sqrt(n))
se_sandwich[i] = sd(X)/(mean(X)^2*sqrt(n))
}
print(mean(lamb_hat))
## [1] 0.5006192
print(sd(lamb_hat))
## [1] 0.01595704

hist(se_Fisher)

hist(se_sandwich)

Page 64
The empirical mean and standard error of λ̂ are 0.501 and 0.016; the mean is close to λ∗ = 0.5 from part (a). The
Fisher-information-based estimate of the standard error is incorrect – it estimates the standard error as approximately
0.022. The sandwich estimate of the standard error seems correct – it estimates the standard error as 0.016, with
some variability in the third decimal place.

Problem 28. (Confidence intervals for a binomial proportion)


IID
Let X1 , . . . , Xn ∼ Bernoulli(p) be n tosses of a biased coin, and let p̂ = X̄. In this problem, we will explore two
different ways to construct a 95% confidence interval for p, both based on the Central Limit Theorem result

n(p̂ − p) → N (0, p(1 − p)). (4)
(a) Use the plugin estimate p̂(1 − p̂) for the variance p(1 − p) to obtain a 95% confidence interval for p. (This is the
procedure discussed in class, yielding the Wald interval for p.)
(b) Instead of using the plugin estimate p̂(1 − p̂), note that Eqn. (4) implies, for large n,
h p √ p i
Pp − p(1 − p)Zα/2 ≤ n(p̂ − p) ≤ p(1 − p)Zα/2 ≈ 1 − α.
√ p √ p
Solve the equation n(p̂−p) = p(1 − p)Zα/2 for p in terms of p̂, and solve the equation n(p̂−p) = − p(1 − p)Zα/2
for p in terms of p̂, to obtain a different 95% confidence interval for p.
(c) Perform a simulation study to determine the true coverage of the confidence intervals in parts (a) and (b), for the
9 combinations of sample size n = 10, 40, 100 and true parameter p = 0.1, 0.3, 0.5. (For each combination, perform
at least B = 100, 000 simulations. In each simulation, you may simulate p̂ directly from n1 Binomial(n, p) instead
of simulating X1 , . . . , Xn .) Report the simulated coverage levels in two tables. Which interval yields true coverage
closer to 95% for small values of n?

Solution.


(a) p̂ → p in probability, hence √n(p̂−p) → N (0, 1) in distribution by Slutsky’s Lemma. So for large n
p̂(1−p̂)
" √ #
n(p̂ − p)
P −Z0.025 ≤p ≤ Z0.025 ≈ 0.95.
p̂(1 − p̂)
We may rewrite the above as
" r r #
p̂(1 − p̂) p̂(1 − p̂)
P p̂ − Z0.025 ≤ p ≤ p̂ + Z0.025 ≈ 0.95,
n n
q
so an approximate 95% confidence interval is p̂ ± p̂(1−n
p̂)
Z0.025 .

Page 65
(b) The following conditions are equivalent:
p √ p
− p(1 − p)Zα/2 ≤ n(p̂ − p) ≤ p(1 − p)Zα/2
m
n(p̂ − p) ≤p(1 − p)Zα/2
2 2

m
   
2
n + Zα/2 p2 − 2np̂ + Zα/2
2
p + np̂2 ≤ 0.

This occurs when p is between the two real roots of the above quadratic equation, which are given by
r 2  
2np̂ + Zα/2 ±
2 2
2np̂ + Zα/2 − 4 n + Zα/2
2 np̂2
  .
2
2 n + Zα/2

Taking α = 0.05 and simplifying the above, we obtain an approximate 95% confidence interval of
q
Z2 p̂(1−p̂) 2
Z0.025
p̂ + 0.025
2n ± Z 0.025 n + 4n 2
2
Z0.025
.
1+ n

(c) ns = c(10,40,100)
ps = c(0.1,0.3,0.5)
B=100000
z = qnorm(0.975)
for (n in ns) {
for (p in ps) {
cover_A = numeric(B)
cover_B = numeric(B)
for (i in 1:B) {
phat = rbinom(1,n,p)/n
U = phat+z*sqrt(phat*(1-phat)/n)
L = phat-z*sqrt(phat*(1-phat)/n)
if (p <= U && p >= L) {
cover_A[i] = 1
} else {
cover_A[i] = 0
}
U = (phat+z^2/(2*n)+z*sqrt(phat*(1-phat)/n+z^2/(4*n^2)))/(1+z^2/n)
L = (phat+z^2/(2*n)-z*sqrt(phat*(1-phat)/n+z^2/(4*n^2)))/(1+z^2/n)
if (p <= U && p >= L) {
cover_B[i] = 1
} else {
cover_B[i] = 0
}
}
print(c(n,p,mean(cover_A),mean(cover_B)))
}
}

For the interval from part (a), we obtain the following coverage probabilities:

p = 0.1 p = 0.3 p = 0.5


n = 10 0.65 0.84 0.89
n = 40 0.91 0.93 0.92
n = 100 0.93 0.95 0.94

Page 66
For the interval from part (b), we obtain the following coverage probabilities:

p = 0.1 p = 0.3 p = 0.5


n = 10 0.93 0.92 0.98
n = 40 0.94 0.94 0.96
n = 100 0.94 0.94 0.94

The intervals from part (b) are more accurate when n is small.

Problem 29. (Power Comparisons)

Consider the problem of testing


IID
H0 : X1 , . . . , Xn ∼ N (0, 1)
IID
H1 : X1 , . . . , Xn ∼ N (µ, 1)
at significance level α = 0.05, where µ > 0. We’ve seen four tests that may be applied to this problem, summarized below:
• Likelihood ratio test: Reject H0 when X̄ > √1 Z0.05 .
n
√ P 2
i Xi − X̄ .
1
• t-test: Reject H0 when T := nX̄/S > tn−1;0.05 , where S 2 = n−1
q
• Wilcoxon signed rank test: Reject H0 when W+ > n(n+1) 4 + n(n+1)(2n+1)
24 Z0.05 , where W+ is the Wilcoxon signed
rank statistic.
p
• Sign test (from Problem 4 of TD-3): Reject H0 when S > n2 + n4 Z0.05 , where S is the number of positive values
in X1 , . . . , Xn .
(For the Wilcoxon and sign test statistics, we are using the normal approximations for their null distributions.) These
tests are successively more robust to violations of the N (0, 1) distributional assumption imposed by H0 .
(a) For n = 100, verify numerically that these tests have significance level close to α, in the following way: Perform
10,000 simulations. In each simulation, draw a sample of 100 observations from N (0, 1), compute the above four test
statistics X̄, T , W+ , and S on this sample, and record whether each test accepts or rejects H0 . Report the fraction
of simulations for which each test rejected H0 , and confirm that these fractions are close to 0.05.
(b) For n = 100, numerically compute the powers of these tests against the alternative H1 , for the values µ = 0.1, 0.2, 0.3,
and 0.4. Do this by performing 10,000 simulations as in part (a), except now drawing each sample of 100 observations
from N (µ, 1) instead of N (0, 1). (You should be able to re-use most of your code from part (a).) Report your
computed powers either in a table or visually using a graph.
(c) How do the powers of the four tests compare when testing against a normal alternative? Your friend says, “We
should always use the testing procedure that makes the fewest distributional assumptions because we never know in
practice, for example, whether the variance is truly 1 or whether data is truly normal.” Comment on this statement.
Rice says, “It has been shown that even when the assumption of normality holds, the [Wilcoxon] signed rank test
is nearly as powerful as the t test. The [signed rank test] is thus generally preferable, especially for small sample
sizes.” Do your simulated results support this conclusion?

Solution.
(a) The code below runs the simulations for the null case (µ = 0) as well as for µ = 0.1, 0.2, 0.3, 0.4 :

set.seed(1)
n = 100
B = 10000
for (mu in c(0,0.1,0.2,0.3,0.4)) {
output.Z = numeric(B)
output.T = numeric(B)

Page 67
output.W = numeric(B)
output.S = numeric(B)
for (i in 1:B) {
X = rnorm(n, mean=mu, sd=1)
if (mean(X) > 1/sqrt(n)*qnorm(0.95)) {
output.Z[i] = 1
} else {
output.Z[i] = 0
}
T = t.test(X)$statistic
if (T > qt(0.95,df=n-1)) {
output.T[i] = 1
} else {
output.T[i] = 0
}
W = wilcox.test(X)$statistic
if (W > n*(n+1)/4+sqrt(n*(n+1)*(2*n+1)/24)*qnorm(0.95)) {
output.W[i] = 1
} else {
output.W[i] = 0
}
S = length(which(X>0))
if (S > n/2+sqrt(n/4)*qnorm(0.95)) {
output.S[i] = 1
} else {
output.S[i] = 0
}
}
print(paste("mu = ", mu))
print(paste("Z: ", mean(output.Z)))
print(paste("T: ", mean(output.T)))
print(paste("W: ", mean(output.W)))
print(paste("S: ", mean(output.S)))
}
Under H0 (case µ = 0), we obtained the results:

Test stat Type-I Error


Likelihood ratio test 0.0507
t-test 0.0505
Wilcoxon signed-rank test 0.053
Sign test 0.0441

(b) Under these alternatives, we obtained the results:

Test stat Power over N (µ, 1)


µ = 0.1 µ = 0.2 µ = 0.3 µ = 0.4
Likelihood ratio test 0.2631 0.6356 0.913 0.9895
t-test 0.261 0.6306 0.9085 0.9885
Wilcoxon signed-rank test 0.252 0.6164 0.8978 0.9847
Sign test 0.1805 0.4617 0.7521 0.9318

(c) The powers of the tests against N (µ, 1) decrease as we increasingly relax the distributional assumptions (from
N (0, 1) to N 0, σ 2 to any symmetric PDF f about 0 to any PDF f with median 0). The sign test makes the fewest
distributional assumptions under H0 , but its power is substantially lower than the other three tests. Hence if we

Page 68
have good reason to believe that the data distribution under H0 is symmetric (for example, if each data value is the
difference of paired samples (Xi , Yi ), and (Xi , Yi ) should have the same distribution as (Yi , Xi ) under H0 ), then we
should at least opt for using the Wilcoxon test. The difference in powers between the Wilcoxon test, t-test, and the
most-powerful likelihood ratio test is indeed very small, which supports Rice’s claim (at least for the tested sample
size n = 100).

Problem 30. (Testing gender ratios)

In a classical genetics study, Geissler (1889) studied hospital records in Saxony and compiled data on the gender ratio.
The following table shows the number of male children in 6115 families having 12 children:

Number of male children Number of families


0 7
1 45
2 181
3 478
4 829
5 1112
6 1343
7 1033
8 670
9 286
10 104
11 24
12 3

Let X1 , . . . , X6115 denote the number of male children in these 6115 families.
(a) Suggest two reasonable test statistics T1 and T2 for testing the null hypothesis
IID
H0 : X1 , . . . , X6115 ∼ Binomial(12, 0.5).

(This is intentionally open-ended; try to pick T1 and T2 to “target” different possible alternatives to the above null.)
Compute the values of T1 and T2 for the above data.
(b) Perform a simulation to simulate the null distributions of T1 and T2 . (For example: Simulate 6115 independent
samples X1 , . . . , X6115 from Binomial(12, 0.5), and compute T1 on this sample. Do this 1000 times to obtain 1000
simulated values of T1 . Do the same for T2 .) Plot the histograms of the simulated null distributions of T1 and T2 .
Using your simulated values, compute approximate p-values of the hypothesis tests based on T1 and T2 for the above
data. For either of your tests, can you reject H0 at significance level α = 0.05?
(c) In this example, why might the null hypothesis H0 not hold? (Please answer this question regardless of your findings
in part (b).)

Solution.

(a) There are many possible answers. We may take T1 to be the average number of male children per family,

T1 = X̄,

and perform a two-sided test based on T1 to check whether roughly half of the children are male. We may take T2
to be Pearson’s chi-squared statistic
X12
2
T2 = (Ok − Ek ) /Ek
k=0

where Ok is the number of families withk male  children and Ek is the expected number under the hypothesized
12
binomial distribution, i.e. Ek = 6115 × (0.5)12 , and perform a one-sided test that rejects for large T2 to
k
check whether the shape of the observed distribution of X1 , . . . , X6115 matches the shape of the binomial PDF.

Page 69
(b) R code corresponding to the above T1 and T2 is as follows:

ks = seq(0,12)
counts = c(7,45,181,478,829,1112,1343,1033,670,286,104,24,3)
expected = 6115*choose(12,ks)*(0.5^12)

T1_obs = sum(ks*counts)/6115
T2_obs = sum((counts-expected)^2/expected)

T1s = numeric(1000)
T2s = numeric(1000)
for (i in 1:1000) {
X = rbinom(6115, 12, 0.5)
T1s[i] = mean(X)
counts = numeric(13)
for (k in 0:12) {
counts[k+1] = length(which(X==k))
}
T2s[i] = sum((counts-expected)^2/expected)
}
hist(T1s)
hist(T2s)
T1_pvalue = length(which(T1s<T1_obs))/1000 * 2
T2_pvalue = length(which(T2s>T2_obs))/1000
Histograms of the null distributions of T1 and T2 are below:

The values of the test statistics for the observed data are T1 = 5.77 and T2 = 249, which are both far outside the
range of the simulated null distributions above. The simulated p-values for the two tests are both < 0.001, and there
is strong evidence that H0 is not correct.
(c) There may be biological and sociological reasons why H0 is false. Biologically, the human male-to-female sex ratio at
birth is not exactly 1 : 1. The probability p that a child is male might also vary from family to family. The sexes of
children within a family might be dependent; in particular, one source of dependence is the presence of identical twins.

Sociologically, there may be a relationship between family size and the sex ratio of children because the current sex
ratio influences parents’ decision of whether to have another child. Note that the given data is only for families
with 12 children, which is quite large even for that time. There is a noticeable bias towards families with more girls
than boys, which may be explained if parents tended to continue having children when their current children were
predominantly female.

Page 70
Problem 31. (Monte Carlo Simulations)

(a) Consider the problem of approximating tail probabilities of a Standard Normal distribution. Find P(X > 2.04)
for X ∼ N (0, 1) using Monte Carlo methods. While we can easily calculate this in R using the pnorm function, check how
a Monte Carlo method would perform. How does the estimate change with the number of Monte Carlo samples?

(b) Consider the problem of calculating the expected present value of the payoff of a call option on a stock. Let S(t)
denote the price of the stock at time t. Consider a European call option, that is, the holder has the right to buy the stock
at a fixed price K at a fixed time T in the future, t = 0 being the current time. If S(T ) > K, the holder exercises the
option for a profit of S(T ) − K. If S(T ) ≤ K, the option expires worthless. The payoff at time T is thus

max{S(T ) − K, 0}.

The expected present value of this payoff is


 
E e−rT max{S(T ) − K, 0} ,

where r is the compound interest rate. The evolution of the stock price S(t) can be modeled via the Black-Scholes model
expressed as
dS(t)
= rdt + σdW (t),
S(t)
where W is a standard Brownian motion. The solution to the above Stochastic Differential Equation is given by
  
1 2
S(T ) = S(0) exp r − σ T + σW (T ) ,
2

where S(0) is the current price of the stock and W (T ) ∼ N (0, T ), that is, W (T ) = T Z, Z ∼ N (0, 1). Thus, the loga-
rithm
 of S(T ) is Normallydistributed, or, in other words, the distribution of S(T ) is log-normal. The expected payoff
E e−rT max{S(T ) − K, 0} is thus an integral w.r.t. to the lognormal density.

Can you perform Monte Carlo sampling to estimate this integral (that is, find the expected payoff)? This will require
generating samples from the standard Normal distribution, followed by computing the value of the function inside the
expectation and then taking the average. Can you also compute the estimated standard error of the estimator?

Solution.
(a) N <- 1e06 # number of samples
t <- 2.04
Xsamp <- rnorm(N) #simulating from N(0,1)
tail.prob <- mean(Xsamp > t)
exact.tail.prob <- pnorm(t, lower.tail = FALSE)
c(MC.estimate = tail.prob, Exact.value = exact.tail.prob)
# Standard error of the estimate
MC.se <- sqrt(var(Xsamp > t)/N)
MC.se

library(tibble)
library(ggplot2)

MC_func <- function(N){


Xsamp <- rnorm(N) # simulate from N(0,1)
tail.prob <- mean(Xsamp > t)
MC.se <- sqrt(var(Xsamp > t)/N)
c(N = N, MC.estimate = tail.prob, MC.se = MC.se, Exact.value = exact.tail.prob)
}
N <- seq(1000, 100000, by = 1000)
out <- NULL
for (i in N) {
out <- rbind(out, MC_func(i))

Page 71
}
out <- as_tibble(out)
ggplot(out, mapping = aes(N,MC.estimate))
+ geom_line() + geom_hline(yintercept = exact.tail.prob, color = "blue")

(b) payoff_func <- function(S_0, r, s, T, K, N=1e06){


Z <- rnorm(N)
S_T <- S_0*exp((r - s^2/2)*T + s*sqrt(T)*Z)
payoff <- exp(-r*T)*pmax((S_T - K), 0) # pmax function in the formula
ex_payoff <- mean(payoff)
se_ex_payoff <- sqrt(var(payoff)/N)
out <- c(expected.payoff = ex_payoff, SE = se_ex_payoff)
return(out)
}
payoff_func(S_0 = 100, r = 0.07, s = 0.3, T = 5, K = 120)

Problem 32. (Simple Linear Regression)

A study was made on the effect of temperature on the yield of a chemical process. The following data were collected:

X -5 -4 -3 -2 -1 0 1 2 3 4 5
Y 1 5 4 7 10 8 9 13 14 13 18

(a) Assuming a model, Y = β0 + β1 X + ϵ, what are the least square estimates of β0 and β1 ? What is the fitted equation?
(b) Construct the ANOVA table and test the hypothesis H0 : β1 = 0 with α = 0.05.
(c) What are the confidence limits for β1 at α = 0.05?
(d) What are the confidence limits for the true mean value of Y when X = 3 at α = 0.05?

(e) What are the confidence limits at α = 0.05 level of significance for the difference between the true mean value of Y
when X1 = 3 and the true mean value of Y when X2 = −2?
[Given, F0.05,1,9 = 5.12, t0.05,9 = 1.833, t0.025,9 = 2.263, ]

Page 72
Solution:

(a) The given data is (xi , yi ); i = 1, 2, . . . , 11. The linear regression model is

X
11
yi = β0 + β1 xi + ϵi thus, S = (yi − β̂0 − β̂1 xi )2
i=1
P11
Sxy xi yi − nx̄ȳ 158
β̂1 = = Pi=1
11 = = 1.44; β̂0 = ȳ − β̂1 x̄ = 9.27
i=1 xi − nx̄
Sxx 2 2 110
Thus, yˆi = 9.27 + 1.44xi is the fitted model.

(b) The sum of squares can be calculated as follows:

X
11 X
11
SST = (yi − ȳ)2 = 248.18, SSReg = β̂12 Sxx = 226.94, SSRes = (yi − yˆi )2 = SST − SSReg = 22.23
i=1 i=1

Thus, the ANOVA table is

Source DF SS MS = SS/DF F-obs=MSReg /MSRes F-tab


Regression 1 226.94 226.94 96.17 F0.05,1,9 = 5.12
Residual 9 22.23 2.36
Total 10 248.18

To test the hypothesis H0 : β1 = 0 vs H1 : β1 6= 0.


Here, we reject H0 , since Fobs > Ftab . Y and X are linearly dependent.
Alternative method for testing:
To test the hypothesis H0 : β1 = 0 vs H1 : β1 6= 0.
Under H0 : β1 = 0, t = √ Mβ̂S1Res ∼ tn−2 . Hence, t = √1.44
2.36
= 9.83 > t0.05,9 = 1.833. Hence, we reject H0 : β1 = 0.
Sxx 110

Hence, F ≡ t , under H0 .
2

(c) The confidence interval of β1 can be computed as follows:


 
σ2 β̂1 − β1
β̂1 ∼ N β1 , ⇒ q ∼ N(0, 1) [σ 2 is unknown and estimated by M SRes ]
Sxx σ2
Sxx

β̂1 − β1
⇒q ∼ tn−2
M SRes
Sxx
 
β̂1 − β1
⇒ P − tα/2,n−2 ≤q ≤ tα/2,n−2  = 1 − α
M SRes
Sxx
r r
M SRes M SRes
⇒ β̂1 − tα/2,n−2 ≤ β1 ≤ β̂1 + tα/2,n−2
Sxx Sxx
⇒ 1.44 − 2.263 ∗ 0.146 ≤ β1 ≤ 1.44 + 2.263 ∗ 0.146 ⇒ 1.11 ≤ β1 ≤ 1.77.

(d) The confidence limits for E(Y |X = 3) can be computed as follows:


95% CI for E(Y at X = x0 ) is β0 + β1 x0 . An unbiased estimator E(Y at X = x0 ) is β̂0 + β̂1 x0 .
 h i
Therefore, β̂0 + β̂1 x0 ∼ N β0 + β1 x0 , σ 2 n1 + (x0S−x̄)
2

xx
.

Page 73
 
β̂0 + β̂1 x0 − (β0 + β1 x0 ) 
A= r h i ∼ tn−2 , ∴ P −tα/2,n−2 ≤ A ≤ tα/2,n−2 = 1 − α
M SRes n1 + (x0S−x̄)
2

xx
s  
  1 (x0 − x̄)2
∴ β̂0 + β̂1 x0 ± tα/2,n−2 × M SRes + is the confidence interval for β0 + β1 x0 .
n Sxx

Thus, replacing the values we get, the required confidence interval is 12.15 ≤ β0 + β1 x0 ≤ 15.03.
(e) The confidence limits for the difference between the true mean value of Y when X1 = 3 and the true mean value of
Y when X2 = −2 can be computed as follows:

E(Y at X1 = 3) − E(Y at X2 = −2) ≡ Z1 − Z2 .

Unbiased estimators of Z1 and Z2 are: Ẑ1 = β̂0 + β̂1 3 and Ẑ2 = β̂0 + β̂1 (−2).
Thus, Ẑ1 − Ẑ2 = (β̂0 + β̂1 3) − (β̂0 + β̂1 (−2)) = 5β̂1 = 7.20 (point estimation).
Confidence interval for Z1 − Z2 :
2
25σ 2 25σ 2 (Ẑ1 −Ẑ2 )−(Z1 −Z2 )
Compute V ar(Ẑ1 − Ẑ2 ) = V ar(5β̂1 ) = 25σ Sxx = 110 ; thus Ẑ1 − Ẑ2 ∼ N(Z1 − Z2 , 110 ) ⇒

25M SRes
∼ tn−2 .
110
So, C.I. of Z1 − Z2 is:
r r
25 × 2.36 25 × 2.36
(Ẑ1 − Ẑ2 ) − tα/2,9 ≤ Z1 − Z2 ≤ (Ẑ1 − Ẑ2 ) + tα/2,9 ⇒ 5.54 ≤ Z1 − Z2 ≤ 8.86.
110 110

Problem 33. (Fake Data in Linear Regression)

There are a few occasions where it makes sense to fit a model without an intercept β0 . If there were an occasion to fit the
model y = βx + ϵ to a set of data (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ), the least square estimate of β would be
Pn
x i yi
β̂ = b = Pi=1 n 2
i=1 xi

Suppose you have a programmed calculator that will fit only the intercept model y = β0 + β1 x + ϵ, but you want to fit a
non-intercept model. By adding one more fake data point (mx̄, mȳ) to the data above, where m = (n+1)n1/2 −1 = na , say,
and letting the calculation fit y = β0 + β1 x + ϵ, can you estimate β by using b?

Solution:

Given the data set (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ), we fit: y = βx + ϵ, where the least square estimate of β is calculated as
follows:
X X X P
∂S x i yi
SSE = S = (yi − ŷi ) = 2
(yi − β̂xi ) ;
2
=0⇒ (yi − β̂xi )xi = 0 ⇒ β̂ = P 2 .
∂β xi

For the computer-fitted model with intercept, we have:


P P
Sxy (xi − x̄)(yi − ȳ) x y − nx̄ȳ
y = β0 + β1 x + ϵ where, the LSEs are β̂0 = ȳ − β̂1 x̄, and β̂1 = = P = P i 2i
Sxx (xi − x̄)2 xi − nx̄2

Adding one more data point: (mx̄, mȳ); m = (n+1)n1/2 −1 = na ⇒ (a + 1)2 = (n + 1).
Thus the new dataset is given by (u1 , v1 ), (u2 , v2 ), . . . , (un , vn ), (un+1 , vn+1 ) = (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ), (mx̄, mȳ).

Page 74
Hence we have

nx̄ + (n/a)x̄ nx̄ + mx̄ nx̄( a+1


a ) nx̄ a+1 nx̄ nȳ
ū = = = = × = similarly v̄ =
n+1 n+1 n+1 (a + 1)2 a a(a + 1) a(a + 1)
X
n X
n+1 X
n
n2 2 n2 x̄2 Xn
Suu = x2i + (mx̄)2 − (n + 1)ū2 = u2i − (n + 1)ū2 = x2i + 2
x̄ − (n + 1) 2 2
= x2i
i=1 i=1 i=1
a a (a + 1) i=1
!   
Xn
nx̄ nȳ X n
(n + 1)m2 x̄ȳ
Suv = xi yi (mx̄mȳ) − (n + 1) = xi yi + m2 x̄ȳ −
i=1
a(a + 1) a(a + 1) i=1
(a + 1)2
X
n X
n
= xi yi + m2 x̄ȳ − m2 x̄ȳ = x i yi
i=1 i=1

∑n
x i yi
For the programmed calculation: βˆ1 = Suv
Suu = ∑i=1
n 2
i=1 xi

Problem 34. (Multiple Linear Regression)

Fit the model Y = β0 + β1 X1 + β2 X2 + ϵ for the data given below. Provide an ANOVA table and perform the partial
F-tests to test H0 : βi = 0 vs H1 : βi 6= 0 for i = 1, 2; given the other variable is already in the model. Comment on the
relative contributions of the variables X1 and X2 , depending on whether they enter the model first or second. Find the
regression equation.

X1 -5 -4 -1 2 2 3 3
X2 5 4 1 -3 -2 -2 -3
Y 11 11 8 2 5 5 4

Solution:

Y = β 0 + β 1 X1 + β 2 X2 + ϵ
Fit the model with both X1 , X2 . OLS estimates can be obtained using β̂ = (X T X)−1 X T Y
 
1 −5 5
 1 −4 4 
     −1    46 
 1 −1 1  β̂0 7 0 0 46
  7 46
X=  1 2 −3  ;
  β̂1  =  0 68 −67   −66  =  1  ∴ Ŷ = + X1 + 2X2
 1 2 −2  7
  β̂2 0 −67 63 69 2
 1 3 −2 
1 3 −3

Source DF SS MS F-obs F-tab


Regression 2 72.00 36.00 83.72 F0.05,2,4 = 6.94
Residual 4 1.71 0.43
Total 6 73.71

P 2 Pn
SST = Yi − Ȳ = 73.71, ei = yi − ŷi , SSRes = i=1 e2i , H0 : β1 = β2 = 0 vs. H1 : H0 is not true. To test
this hypothesis, we have F = 83.72 > F0.05,2,4 = 6.94 ⇒ H0 is rejected.

Partial F-tast:- Fit the model with X1 as Y = β0 + β1 X1 + ϵ, Ŷ = 46


7 − 66
68 X1 and test the hypothesis H0 : β2 = 0 vs H1 :
β2 6= 0

{SSReg (Full) − SSReg (Restricted Model)} /1 72 − 64.06


F = = = 18.53 > F0.05,1,4 = 7.71
M SRes 0.43
∴ H0 is rejected at 5% level of significance, i.e., X2 is significant in the presence of X1 .

Page 75
Partial F-tast:- Fit the model with X2 as Y = β0 + β2 X2 + ϵ, Ŷ = 46 7 − 68 X2 and test the hypothesis H0 : β1 = 0 vs H1 :
69

β1 6= 0
{SSReg (Full) − SSReg (Restricted Model)} /1 72 − 70.01
F = = = 4.64 < F0.05,1,4 = 7.71
M SRes 0.43
∴ H0 is not rejected at 5% level of significance, i.e., X1 is not significant in the presence of X2 .

Implication:- If X2 is in the model, we donot need X1 . If X1 is in the model, X2 helps out significantly. Then X2 is
70.01
clearly more useful variable and it explains R2 = 73.71 = 95% of the total variability in Y about mean, where as X1 alone
2 64.06 72.00
explains R = 79.71 = 86% of total variability in Y about mean. And X1 and X2 together explain 73.71 = 97% of total
variability.

Note: In this problem, X1 + X2 ≈ 0 and the presence of multicollinearity (X1 and X2 are not independent), and that is
why Partial F-test suggests β0 = 0 is accepted.

Problem 35. (More on Multiple Linear Regression)

Given a 2-variables linear regression problem Y = β1 + β2 X1 + β3 X2 + ϵ, yield the following


   
33 0 0 132 X
X T X =  0 40 20 , X T Y =  24  , and (Y − Ȳ )2 = 150.
0 20 60 92
(a) What is the sample size?
(b) Write the normal equations and solve for the regression coefficients.
(c) Estimate the standard error of β2 and test the hypothesis that β2 = 0
(d) Compute R2 and interpret it. Also, interpret the value of regression coefficients.
(e) Predict the value of y given x1 = −4 and x2 = 2
(f) Comment on the possibilities of any regressors being a dummy variable.

Solution:
 
1 x11 x12
 1 x21 x22 
 
(a) The variance-covariance matrix of the design matrix X =  .. .. ..  is given by
 . . . 
1 xn1 xn2
 
  1 x11 x12  P P   
1 1 ··· 1   n P x2i1 P xi2 33 0 0
 1 x21 x22   P
X T X =  x11 x21 ··· xn1    = P xi1 P xi1 x x  =  0 90 20 
... .. .. P 2i1 i2
x12 x22 ··· xn2  . .  xi2 xi1 xi2 xi2 0 20 60
1 xn1 xn2

∴ The sample size, n = 33.


 
  y1  P   
1 ... 1  y2  P yi 132
 
X T Y =  x21 ... xn1   ..  =  P xi1 yi  =  24 
x12 ... xn2 3×n  .  xi2 yi 92
yn n×1

(b) We know the normal equation is given by β̂ = (X T X)−1 X T Y ⇒ (X T X)β̂ = X T Y . Thus we have
         
33 0 0 b1 132 33 0 0 b1 832
(X T X)β̂ = X T Y ⇒  0 40 20   b2  =  24  ⇒  0 40 20   b2  =  24 
0 20 60 b3 92 0 0 50 b3 80
 
1
using Gaussian elimination:R3′ = R3 − R2
2

Page 76
Thus we obtain b3 = 1.60, 40b2 + 20b3 = 24 ⇒ 40b2 = −8 ⇒ b2 = −0.20, 33b1 = 132 ⇒ b1 = 4. Hence, the regression
equation is ŷ = 4 − 0.20x1 + 1.60x2

(c)
X T X = 33(2400 − 400) = 66000,
r r
√ RSS 7.6 √ det(co-factor) 1980
S.E. of b2 = S c22 where S = = = 3.8 and C22 = = = 0.003
n−k 2 |X T X| 66000

SE(b2 ) = 0.003 × 3.8 = 0.106
−0.20
To test the hypothesis H0 : b2 = 0 vs. H1 : b2 6= 0, we use the following test statistic t = b2
SE(b2 ) = 0.106 = 1.88 <
4.303. We reject H0 .
(d) TSS = RSS + SSReg
 
∗T ∗
  24
Given that T SS = 150, we can calculate SSReg = β T
(X Y ) = −0.20 1.60 = 142.4 and thus we have
92
RSS = 150 − 142.4 = 7.6
Thus, the coefficient of determination R2 is given by R2 = 142.4/150 = 0.9493 explains 94.93% of the total variability
in the response variable.
(e) The predicted value of y for the given values of x1 = −4 and x2 = 2 is given by ŷ = 4 − 0.20(−4) + 1.60(2) = 8.

Problem 36. (Presence of Multicollinearity)

Can we use the data given below to get a unique fit to the model Y = β0 X0 + β1 X1 + β2 X2 + β3 X3 + ϵ

X1 -4 3 1 4 -3 -1
X2 1 2 3 4 5 6
X3 3 -5 -4 -8 -2 -5
Y 7.4 14.7 13.9 18.2 12.1 14.8

Solution:
−1
The LSE of the regression model Y = Xβ + ϵ is given by β̂ = X T X XT Y
 
x0 x1 x2 x3
 1 −4 1 3 
 
 1 3 2 −5 
 
X=  1 1 3 −4 

 1 4 4 −8 
 
 1 −3 5 −2 
1 −1 6 −5
 
Note that, X1 + X2 + X3 = 0 which implies that X T X is a singular matrix, i.e., | X T X | = 0. No, we can’t compute
β̂ uniquely here.

Problem 37. (Residual Vs. Response Variable)

Show that in linear regression with a β0 term in the model:


(a) The correlation between the vector e and Y is (1 − R2 )1/2 . This result implies that it is a mistake to find defective
regressions by a plot of residuals ei versus observations Yi as this always shows a slope.
(b) Show further that the correlation between e and Ŷ is zero.

Page 77
Solution:

(a) P 
(ei − ē) Yi − Ȳ
Cor(e, Y ) = qP 2
2P
(ei − ē) Yi − Ȳ
X  X  X
(ei − ē) Yi − Ȳ = ei Yi − Ȳ = ei Yi = Y T e [since, ē = 0 if β0 is in the model]
Note that,
−1 −1 −1
Y = Xβ + ϵ, β̂ = X T X X T Y ⇒ Ŷ = X X T X X T Y = X β̂ = HY, where H = X X T X XT
e = Y − Ŷ = (I − H)Y ⇒ eT e = Y T (I − H)T (I − H)Y = Y T (I − H)Y = Y T e [since, H 2 = H]

Thus we have s r r
Y Te eT e SSRes SSReg p
Cor(e, Y ) = p = = = 1− = 1 − R2 .
(eT e)SST SST SST SST

Implication: This is why we plot Ŷi and ei but not Yi and ei since they are correlated.
(b) The correlation between e and Ŷ can be computed as follows:
X  
¯
Cov(e, Ŷ ) = (ei − ē) Ŷi − Ŷ = eT Ŷ = Y T (I − H)HY = Y T (H − H 2 )Y = 0 [since, Ŷ = HY, e = (I − H)Y ]
Cor(e, Ŷ ) = 0.

Problem 38. (Multiple Coefficient of Determination)

Prove that the multiple coefficient of determination R2 equals the square of the correlation between Y and Ŷ .
SSReg
Solution: We have to prove that [Cor(Y, Ŷ )]2 = R2 = SST

X X  X X ¯
Note that, ei = 0 ⇒ Yi − Ŷi = 0 ⇒ Yi = Ŷi ⇒ Ȳ = Ŷ [ since, Yi = Ŷi + ei ]
    X 
Cor e, Ŷ = 0 ⇒ Cov e, Ŷ = 0 ⇒ Ŷi − Ȳ ei = 0
v
P ¯
  P   P  uP 
u
2
Ŷi − Ŷ Yi − Ȳ Ŷi − Ȳ Ŷi − Ȳ + Ŷi − Ȳ ei u Ŷi − Ȳ SSReg √
rY Ŷ =r   = P    =tP  = = R2
P ¯ 2P 2 2
Yi − Ȳ
2 SST
Ŷi − Ŷ Yi − Ȳ Yi − Ȳ Ŷi − Ȳ

Problem 39. (Polynomial Regression)

A new born baby was weighted weekly. Twenty such weights are shown below, recorded in ounces. Fit to the data, using
orthogonal polynomials, a polynomial model of degree justified by the accuracy of the figures, that is, test as you go along
for the significance of the linear, quadratic and so fourth, terms.

No. of weeks 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Weights 141 144 148 150 158 161 166 170 175 181 189 194 196 206 218 229 234 242 247 257

Solution:

We wish to fit the model


Y = β0 + β1 X + β2 X 2 + β3 X 3 + . . . + βk X k + ϵ
Polynomial fitting using orthogonal polynomial:

Page 78
 
Y = α0 + α1 P1 (X) + α2 P2 (X) + . . . + αk Pk (X) + ϵ; Pi (x) are ith orthogonal polynomial .

P (x )y
The parameters can be estimated as αˆ0 = ȳ and αˆj = ∑ j2 i i . Given the data (xi , yi ), we compute total variability
Pj (xi )
(SST ), then build a model to explain this variability. Suppose that SSReg (α1 ) represents how much of the total variability
in Y is explained by the linear term, similarly for cubic and quadratic.

X
n X
n
SSReg (α1 ) = α̂1 yi P1 (xi ) = 25438.75 (linear term), SSReg (α2 ) = α̂2 yi P2 (xi ) = 489 (quadratic term)
i=1 i=1
Xn X
n
SSReg (α3 ) = α̂3 yi P3 (xi ) = 1.15 (cubic term), SST = (yi − ȳ)2 = 26018
i=1 i=1

Source DF SS MS F-obs F-tab


Regression (α1 ) 1 25438.75 25438.75 4558.98 F0.05,1,6 = 4.49
Regression (α2 ) 1 489 489 84.63
Regression (α3 ) 1 1.15 1.15 0.21
Residual 16 89.30 5.58
Total 19 26018

As we can see, the linear term explains the major part of variability, and α2 is also significant (F-value). Thus the final
model is ŷ = 136.227 + 2.68x + 0.167x2 , (quadratic fit). If residual SSRes is large, you may check for fourth-degree
polynomial fitting.

Problem 40. (Outliers)

If you are asked to fit a straight line to the data (X, Y ) = (1, 3), (2, 2.5), (2, 1.2), (3, 1), and (6, 4.5). What would you do
about it?

Solution:

From the figure below, we can observe that (6, 4.5) is an influential observation. The regression model has a positive slope
when the influential observation is included and a negative slope when it is not.

Recommendation: You can ignore influential observation if it’s small in number. Some observations between X = 3
and X = 6 would be useful here.

Problem 41. (Calculating R2 but Friends Forever)

Your friend says he/she has fitted a plane to n = 33 observations on (X1 , X2 , Y ) and his/ her overall regression (given β0 )
is just significant at the α = 0.05 level. You ask him/ her for R2 value, but s/he does not know. You work it out for him/
her based on what s/he has told you.

Page 79
Solution:

The equation of the plane is


Y = β 0 + β 1 X1 + β 2 X2 + ϵ
Here we have n = 33 observations

Source DF SS MS F-obs F-tab


Regression 2 SSReg M SReg F F0.05,2,30 = 3.32
Residual 30 SSRes M SRes
Total 32 SST

SSReg SSReg SSReg /M SRes 2M SReg /M SRes 2F 2 × 3.32


R2 = = = SSReg = 2M SReg = = = 0.1812
SST SSReg + SSRes + SSRes + 30M SRes 2F + 30 (2 × 3.32) + 30
M SRes M SRes M SRes M SRes

∴ 18% of the total variability is explained by the model.

Implication: Thus, R2 is a “good” measure to measure the “goodness of fit” even when the statistical test suggests that
the regression is significant.

Problem 42. (Understanding Regression Output)

You are given a regression printout that shows a planar to fit X1 , X2 , X3 , X4 , X5 plus an intercept term obtained from a set
of 50 observations. The overall F for regression is ten times as high as the 5% upper-tail F percentage point. How big is R2 ?

Solution:

The regression equation is


Y = β 0 + β 1 X1 + β 2 X2 + β 3 X3 + β 4 X4 + β 5 X5 + ϵ
The ANOVA table is We know,

Source DF SS MS F-obs F-tab


Regression 5 SSReg M SReg F F0.05,5,44 = 2.43
Residual 44 SSRes M SRes 10 × F0.05,5,44 = 24.3
Total 49 SST

SSReg SSReg SSReg /M SRes 5M SReg /M SRes 5F 5 × 24.3


R2 = = = SSReg = 5M SReg = = = 0.7343.
SST SSReg + SSRes + SSRes
+ 44M SRes 5F + 44 (5 × 24.3) + 44
M SRes M SRes M SRes M SRes

Conclusion: 73.43% of the total variability in the response variable is explained by the fitted model.

Problem 43. (Weighted Least Square)

Consider the simple linear regression model: yi = β0 + β1 xi + ϵi , where the variance of ϵi is proportional to x2i , i.e.,
V (ϵi ) = σ 2 x2i (assumption of constant variance is NOT satisfied).

(a) Suppose that we use these transformation y ′ = y


x and x′ = x1 . Is this a variance-stabilizing transformation?
(b) What are the relationships between the parameters in the original and the transformed model?
1
(c) Suppose we use the method of weighted least squares with wi = x2i
. Is this equivalent to the transformation
introduced in part (a)?

Page 80
Solution:

(a)
yi = β0 + β1 xi + ϵi −→ original model
yi β0 ϵi
= + β1 + , yi′ = β0 x′i + β1 + ϵ′i −→ Transformed model
xi xi xi
 
σ 2 x2
Now, Var(yi′ ) = Var xϵii = x2 i = σ 2 . Yes, it’s a variance-stabilizing transformation.
i

(b) Slope in the transformed model became an intercept in the transformed model and vice-versa.
(c) Weighted LS function:

X
n
2
SOLS (β0 , β1 ) = (yi − β0 − β1 xi ) =⇒ Oridinary Least Squares
i=1
X
n
2
SW LS (β0 , β1 ) = wi (yi − β0 − β1 xi ) =⇒ Weighted Least Squares
i=1
Xn Xn  2
1 2 yi β0
= (y i − β 0 − β x
1 i ) = − − β 1
x2
i=1 i i=1
xi xi

Calculate β0 and β1 by minimizing SW LS (β0 , β1 ) in Weighted Least Squares method.


For the transformed model, OLS estimates are:
X
n n 
X 2
yi β0
S ∗ (β0 , β1 ) = (yi′ − β0 x′i − β1 ) =
2
− − β1
i=1 i=1
xi xi

Calculate β0 and β1 by minimizing S ∗ (β0 , β1 ) for the transformed model.


Here, S ∗ (β0 , β1 ) ≡ SW LS (β0 , β1 ).

Problem 44. (Autoregression)

Consider the simple linear regression model yt = β0 +β1 xt +ϵt where the errors are generated by second-order autoregressive
process
ϵt = ρ1 ϵt−1 + ρ2 ϵt−2 + zt ,

where zt is an NID 0, σz2 random variable, and ρ1 and ρ2 are autocorrelation parameters. Discuss how the Cochrane-
Orcutt iterative procedure could be used in this situation. What transformations would be appropriate for the variables
yt and xt ? How would you estimate the parameters ρ1 and ρ2 ?

Solution:
ind  ind 
yt = β0 + β1 xt + ϵt where ϵt = ρ1 ϵt−1 + ρ2 ϵt−2 + zt ; zt ∼ N 0, σz2 [In OLS-based SLR, we assume ϵt ∼ N 0, σ 2 which
is not true here.]

yt → yt′ = yt − ρ1 yt−1 − ρ2 yt−2 = (β0 + β1 xt + ϵt ) − ρ1 (β0 + β1 xt−1 + ϵt−1 ) − ρ2 (β0 + β1 xt−2 + ϵt−2 )
= (β0 − ρ1 β0 − ρ2 β0 ) + β1 (xt − ρ1 xt−1 − ρ2 xt−2 ) + (ϵt − ρ1 ϵt−1 − ρ2 ϵt−2 ) = β0′ + β1 x′t + zt

Now zt ’s are independent and zt ∼ N 0, σz2 . But, (yt′ , x′t ) cannot be directly used as

yt′ = yt − ρ1 yt−1 − ρ2 yt−2 and x′t = xt − ρ1 xt−1 − ρ2 xt−2

are functions of unknown parameters ρ1 and ρ2 .

We know ϵt = ρ1 ϵt−1 + ρ2 ϵt−2 + zt ; (How to estimate ρ1 and ρ2 ?).

• Fit yt = β0 + β1 xt + ϵt using OLS and obtain ei (ignoring autocorrelation)

Page 81
• Regress ei on ei−1 and ei−2 , i,e., ei = ρ1 ei−1 + ρ2 ei−2 + zt (MLR with two regressors)
P 2
• Compute S (ρ1 , ρ2 ) = (ei − ρ1 ei−1 − ρ2 ei−2 ) and minimize S (ρ1 , ρ2 ) and obtain ρ̂1 and ρ̂2 .

∂S X ∂S
=0⇒ (ei − ρ1 ei−1 − ρ2 ei−2 ) ei−1 = 0, = 0 ⇒ (ei − ρ1 ei−1 − ρ2 ei−2 ) ei−2 = 0
∂ρ1 ∂ρ2
These will generate the LSE of ρˆ1 and ρˆ2 .
• yt′ = yt − ρ̂1 yt−1 − ρ̂2 yt−2 and x′t = xt − ρ̂1 xt−1 − ρ̂2 xt−2 and apply OLS to the transformed data yt′ = β0′ + β1 x′t +
ind 
zt , where zt ∼ N 0, σz2 .

• Final fitted model is: ŷt′ = β̂0′ + β̂1 xt .

Problem 45. (Durbin-Watson Test)

The following 24 residuals from a straight-line fit are equally spaced and given in time sequential order. Is there any
evidence of lag-1 serial correlation?

8, −5, 7, 1, −3, −6, 1, −2, 10, 1, −1, 8, −6, 1, −6, −8, 10, −6, 9, −3, 3, −5, 1, −9

Use a two-sided test at level α = 0.05.

Solution:

The correlation between the residuals, ei , i = 1(1)24 is Cor (eu , eu+1 ) = ρ. (if ρ 6= 0, there is autocorrelation).

We test the hypothesis H0 : ρ = 0 vs H1 : ρ 6= 0, thus H0 indicates that there is no lag-1 autocorrelation. To perform this
test, we compute the Durbin - Watson test statistic:
P24 2
(eu − eu−1 ) 2225
d = u=2P24 = = 2.67 ⇒ 4 − d = 1.33
2
u=1 eu
834

Now, compare with dL and dU values from d table. For α = 0.025. (two-sided test) n = 24, k = 1 (since straight line fit
with one regressor variable) dL = 1.16, dU = 1.33

• If d < dL and 4 − d < dL : reject H0 . Here, we accept H0 as d = 2.67 > 1.16 (then is no lag-1 autocorrelation).
• If d > dU and 4 − d > dU : accept H0 . Here, we accept H0 as d = 2.67 > 1.33 (then is no lag-1 autocorrelation).

Thus, there is no lag-1 autocorrelation/serial correlation in the data.

Problem 46. (Nonlinear Regression)

Estimate the parameters α & β in the non-linear model Y = α + (0.49 − α)e−β(X−8) + ϵ from the following observations:

X 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42
Y 0.490 0.475 0.450 0.433 0.458 0.423 0.407 0.407 0.407 0.405 0.393 0.405 0.400 0.395 0.400 0.390 0.407 0.390

Solution:

The problem is to estimate α, β of the non-linear model using the data. The residual sum of square can be written as
X X
S (α, β) = (yu − f (xu , α, β))2 = (yu − α − (0.49 − α)e−β(xu −8) ))2
u u

f (xu , α, β) = α + (0.49 − α)e−β(xu −8)


[Since, f is nonlinear, we solve the system of nonlinear eqns by via Taylor series approximation of nonlinear into linear one]
∂f ∂f
= 1 − e−β(xu −8) , = −(0.49 − α)e−β(xu −8) (xu − 8)
∂α ∂β

Page 82
Linearization: Taylor series expansion of f (xu , α, β) about the point (α0 , β0 ) is

f (xu ; α, β) = f (xu , α0 , β0 ) + (1 − e−β0 (xu −8) )(α − α0 ) + [−(0.49 − α0 )e−β0 (xu −8) (xu − 8)](β − β0 )
= fu0 + z1u
0
(α − α0 ) + z2u
0
(β − β0 ) [linear function from nonlinear function using Taylor approximmation]
Yu = fu0 + z1u
0
(α − α0 ) + z2u
0
(β − β0 ) + ϵu
⇒ Yu − fu0 = z1u
0
(α − α0 ) + z2u
0
(β − β0 ) + ϵu −→ [MLR model]
⇒ Y0 = z0 θ0 + ϵ −→ [In matrix form]
⇒ θˆ0 = (z0T z0 )−1 z0T Y0 is the least square estimate

where    0   
Y1 − f10 z11 0
z21   ϵ1
 .
..   . ..  , θ = α − α0 , ϵ =  .. 
Y0 =   , z0 =  .. .  0 .
β − β0
Yn − fn0 0
z1n 0
z2n ϵn

• If we begin the iteration with initial guess α0 = 0.30, β0 = 0.02


 
1 − e−β0 (x1 −8) −(0.49 − α0 )(x1 − 8)e−β0 (x1 −8)
 .. .. 
z0 =  . . 
1 − e−β0 (xn −8) −(0.49 − α0 )(xn − 8)e−β0 (xn −8)

iteration αj βj
0 0.30 0.02
1 0.84 0.10

• Iteration continues and obtain αj+1 and βj+1 .


• This process continue until |αj+1 − αj | < δ and |βj+1 − βj | < δ = 0.0001. So, we stop here.

iteration αj βj
2 0.3901 0.1004
3 0.3901 0.1016
4 0.3901 0.1016

Problem 47. (Statistician’s Dilemma) Look at these data. I don’t know whether to fit two straight lines, one straight
line, or something else. How can I solve this dilemma?

Solution:

If we attach a dummy variable Z to distinguish the two groups (such that Z = 0 for set A and Z = 1 for set B), we can
look at all 4 possibilities.

Y = (β0 + β1 X) + Z(α0 + α1 X) + ϵ = β0 + β1 X + α0 Z + α1 XZ + ϵ

Thus the X matrix becomes  


e 1 X Z XZ
1 8 0 0 
 
1 0 0 0 
 
1 12 0 0 
 
X=
1 2 0 0 

e 1 9 1 9 
 
1 7 1 7 
 
1 8 1 8 
1 6 1 6

Page 83
Thus we have Y = Xβ + ϵ ⇒ β = (β0 , β1 , β2 , β3 ) and β̂ = (X T X)−1 X T Y ⇒ Ŷ = 1.142 + 0.506X − 0.0418Z − 0.036XZ.
T

Case: Test if a single line is sufficient i.e., H0 : α0 = α1 = 0

{SSReg (Full) − SSReg (Restricted Model)} / {df (Full) − df (Restricted Model)}


F =
M SRes
0.1818/(3 − 1)
= = 1.11 < F0.05,2,4
0.3272/4

Hence, we fail to reject H0 and can go for a single straight-line fit.

Problem 48. (Principal component Regression)


Pp
Let x be a vector of p random variables and αk is a vector of p constants and we write αkT x = j=1 αkj xj . Also, let S be
e e
the (known) sample covariance matrix for the random variable x. For k = 1, 2, show that the k th principal component is
T e th
given by zk = αk x where αk is an eigenvector of S corresponding to its k largest eigenvalue λk .
e
Solution:

To calculate the k th principal component zk = αkT x we apply the constrained maximuization method. By using the
  ∼
Lagrange multiplier we maximize var αkT x = αkT Σαk subject to αkT αk = 1. We maximize the function


αkT Σαk − λ αkT αk − 1

w.r.t. to αk by differentiating w.r.t. to αk . This results in

d 
αkT Σαk − λk αkT αk − 1 = 0
dαk
Σαk − λk αk = 0
Σαk = λk αk

This is recognizable as an eigenvector equation where αk is an eigenvector of Σ and λk is the associated eigenvalue.

If we recognize that the quantity to be maximized

αkT Σαk = αkT λk αk = λk αkT αk = λk

then we should choose λk to be as big as possible. So, calling λ1 the largest eigenvector of Σ and α1 the corresponding
eigenvector then the solution to
Σα1 = λ1 α1
is the 1st principal component of x.

The second PC, α2T x maximizes α2T Σα2 subject to being uncorrelated with α1T x. The uncorrelation constraint can be
∼ ∼
expressed using any of these equations
 
cov α1T x, α2T x = α1T Σα2 = α2T Σα1 = α2T λ1 α1T
∼ ∼
= λ1 α2T α = λ1 α1T α2 = 0

Of these, if we choose the last we can write a Lagrangian to maximize α2



α2T Σα2 − λ2 α2T α2 − 1 − ϕα2T α1

Differentiation of this quantity w.r.t. α2 (and setting the result equal to zero) yields
 
dα2 α2 Σα2 − λ2 α2 α2 − 1 − ϕα2 α1 = 0
d T T T

⇒ Σα2 − λ2 α2 − ϕα1 = 0

Page 84
If we left multiply α1 into this expression

α1T Σα2 − λ2 α1T α2 − ϕα1T α1 = 0


⇒ 0 − 0 − ϕ1 = 0

then we can see that ϕ must be zero and that when this is true we are left with

Σα2 − λ2 α2 = 0

Clearly
Σα2 − λ2 α2 = 0
is another eigenvalue equation and the same strategy of choosing α2 to be the eigenvector associated with the second
largest eigenvalue yields the second PC of x, namely α2T x.
∼ ∼

Problem 49. (Shrinkage Methods in Regression)

Show that kβ̂Ridge k increases as its tuning parameter λ → 0. Does the same property hold for the LASSO regression?

Solution

T
SVD Decomposition of X = Un×p Dp×p Vp×p
−1 −1  −1 −1
β̂Ridge = X T X + λI X T Y = V D2 V T + λI V DU T Y = V D2 + λI V T V DU T Y = V T D2 + λI DU T Y.
X
p
d2j (U T Y )2j
kβ̂Ridge k22 = Y T U D(D2 + λI)−1 (D2 + λI)−1 DU T Y = (U T Y )T [D(D2 + λI)−2 D](U T Y ) = ;
j=1
(d2j + λ)2

−1 d2
where D D2 + λI D represents a diagonal matrix with elements j
2.
(d2j +λ)
Therefore, we see that kβ̂Ridge k increases as its tuning parameter λ → 0. Recall the dual form of LASSO as defined below:
 2
X
N X
p
β̂Lasso = arg min yi − β0 − xij βj 
β
i=1 j=1

X
p
s.t. |βj | ≤ t
j=1
  2 

1 X N Xp Xp 

β̂Lasso = arg min y i − β 0 − xij βj  + λ |βj |
β  2 i=1 

j=1 j=1

It is easy to see that t and λ have an inverse relationship; therefore, as λ → 0, t increases and so does the norm of optimal
solutions.

Problem 50. (Logistic Regression)

Consider a two-class logistic regression problem with x ∈ R. Characterize the maximum-likelihood estimates of the slope
and intercept parameter if the sample xi for the two classes are separated by a point x0 ∈ R. Generalize this result to (a)
x ∈ Rp and (b) more than two classes.

Solution

Without loss of generality, suppose that x0 = 0 and that the coding is y = 1 for xi > 0 and y = 0 for xi < 0. Now,
suppose that
exp {βx + β0 }
p(x; β) =
1 + exp {βx + β0 }

Page 85
so that
1
1 − p(x; β) =
1 + exp {βx + β0 }
Since x0 = 0 is the boundary then p (x0 ) = 1 − p (x0 ) then β0 = 0. Therefore,

exp{βx}
p(x; β) =
1 + exp{βx}

so that
1
1 − p(x; β) = .
1 + exp{βx}
Therefore, the likelihood function

Y
N YN   yi YN
y 1−yi p (xi ; β) y
L(β; y, x) = p (xi ; β) i [1 − p (xi ; β)] = [1 − p (xi ; β)] = [exp {βxi }] i [1 + exp {βxi }]
i=1 i=1
1 − p (x i ; β) i=1

so that the log-likelihood function


X
N
l(β; y, x) = yi [βxi ] − log [1 + exp {βxi }]
i=1

Taking the derivative with respect to β and substituting in the proper coding of yi gives
 
dl(β; x, y) X
N
exp {βxi }
= x i yi −
dβ i=1
1 + exp {βxi }
X   X  
exp {βxi } exp {βxi }
= xi 1 − − xi
xi >0
1 + exp {βxi } xi <0
1 + exp {βxi }
X X   X  exp {βxi } 
exp {βxi }
= xi − xi − xi .
x >0 x >0
1 + exp {βxi } x <0
1 + exp {βxi }
i i i

Setting the above equal to zero gives

X X
N  
exp {βxi }
xi = xi .
xi >0 i=1
1 + exp {βxi }

N
Clearly, for any data set {xi }i=1 we must have that β → ∞ for the above equality to hold.

(b) Now, suppose that there are K classes such that x1 seperates classes one and two, x2 seperates classes two and
three, and so on to xK−1 that seperates classes K − 1 and K with −∞ = x0 < x1 < x2 < . . . < xK−1 < xK = ∞. Now,
define probabilities
exp {β1 x + β01 }
p1 (x; β) = PK−1
1 + j=1 exp {βj x + β0j }
exp {β2 x + β02 }
p2 (x; β) = PK−1
1 + j=1 exp {βj x + β0j }
..
.
1
pK−1 (x; β) = PK−1 .
1+ j=1 exp {βj x + β0j }
Now, suppose that the coding is yi = 1 if xj−1 < xi < xj and yi = 0 otherwise for observation i = 1, . . . , N and class
j = 1, . . . , K. Therefore, the likelihood function

Y
K Y
Nj
yi
L(β; y, x) = [pj (xi ; β)]
j=1 i=1

Page 86
where Nj is the number of observations in class j, so that the log-likelihood function
" # N " #
XX
K−1 Nj
exp {βj xi + β0j } X k
1
l(β; y, x) = yi log PK−1 + yi log PK−1
j=1 i=1 1 + j=1 exp {βj x + β0j } i=1 1 + j=1 exp {βj xi + β0j }
 
XX
K−1 Nj
X
K XNj
X
K−1
= yi [βj xi + β0j ] − yi log 1 + exp {βj xi + β0j }
j=1 i=1 j=1 i=1 j=1

Now, we determine the values of β0j . First, note that β0j is a function of βj , xj−1 , and xj . So that the expression p(x; β)
maintains proper form, for xj−1 < x < xj we define

exp {βj (x − xj−1 )} − exp {βj (x − xj )}


p (x; βj ) = PK−1
1 + j=1 exp {βj xi + β0j }
exp {βj x} [exp {βj xj−1 } − exp {βj xj }]
= PK−1
1 + j=1 exp {βj xi + β0j }
exp {βj x + β0j }
= PK−1
1 + j=1 exp {βj xi + β0j }

where β0j = log [exp {βj xj−1 } − exp {βj xj }]. The reason for the beginning step of the formulation above is because, for
example, when x ∈ (x1 , x2 ) so that x classifies to class two, the probability function appears as in the following figure,
where it was assumed that x1 = 1 and x2 = 2.
Now, taking the derivative with respect to β = (β1 , . . . , βK−1 ) and substituting in the proper coding of yi gives

dl(β; x, y) X X
Nj Nj
exp {βj xj−1 } xj−1 − exp {βj xj } xj
= xi +
dβj i=1 i=1
exp {βj xj−1 } − exp {βj xj }
Nj   !
X exp {βj xj−1 } xj−1 − exp {βj xj } xj ] exp {βj xi − β0j }
− xi + PK−1
i=1
exp {βj xj−1 } − exp {βj xj } 1 + j=1 exp {βj xi + β0j }

exp{β x }x −exp{β x }x
Note that the j j−1 j−1 j j
exp{βj xj−1 }−exp{βj xj }
j
term in the above is a constant in the sum over i = 1, . . . , Nj . Therefore, setting
the above equal to zero for each j = 1, . . . , K − 1 and solving for βj similarly gives the maximum likelihood estimators to
the two-class case that βj → ∞.

Page 87
Some Useful Formulae

• The volume of a sphere of radius r is 43 πr3 .


• A Maclaurin series is a Taylor series expansion of a function about 0,
f ′′ (0) 2 f (3) (0) 3 f (n) (0) n
f (x) = f (0) + f ′ (0)x + x + x + ... + x + ....
2! 3! n!
• The generic Chernoff bound for a random variable X is attained by applying Markov’s inequality to etX . This gives
a bound in terms of the moment-generating function of X. For every t ≥ 0:
 
 E et·X
Pr(X ≥ a) = Pr e t·X
≥e t·a
≤ .
et·a
• Uniform distribution on the interval (0, 1) has pdf
(
1 if 0 < x < 1
f (x) =
0 otherwise
• The standard normal distribution has a pdf

1
f (x) = √ e−x /2
2
for x ∈ R

• Beta density with parameters α and β :
(
Γ(α+β) α−1
Γ(α)Γ(β) x (1 − x)β−1 if 0 < x < 1
f (x) =
0 otherwise
α αβ
E[X] = , Var(X) = 2
α+β (α + β) (α + β + 1)
α = 1 and β = 1 gives the Uniform distribution on (0, 1).
• Gamma distribution with parameters r > 0 and λ > 0 :
 λr r−1 −λx
Γ(r) x e if x > 0 r r
f (x) = E[X] = , Var(X) = 2
0 otherwise. λ λ
 r
λ
MX (t) = for t < λ.
λ−t
r = 1 gives the Exponential distribution with mean 1/λ.
• Exponential distribution with parameter λ > 0 :

λe−λx if x > 0 1 1
f (x) = E[X] = , Var(X) =
0 otherwise. λ λ2
(
−λx
1−e if x > 0
FX (x) =
0 otherwise.

• Normal distribution with mean µ and variance σ 2 > 0 :


(
√ 1 e−(x−µ) /2σ
2 2
if x ∈ R
f (x) = 2πσ E[X] = µ, Var(X) = σ 2
0 otherwise.
2
σ 2 /2
MX (t) = eµt+t for t ∈ R. µ = 0 and σ 2 = 1 gives the standard normal distribution.
• Poisson distribution with mean λ > 0 :
 k
λ −λ
k! e if k = 0, 1, 2, . . .
f (k) = E[X] = λ, Var(X) = λ
0 otherwise.
 
MX (t) = exp λ et − 1 for t ∈ R.

Page 88
Some Useful Distributional Properties


1. If Y ∼ N µ, σ 2 , then
Y −µ
Z= ∼ N (0, 1)
σ
2. Y ∼ N (0, 1) =⇒ Y 2 ∼ χ2 (1)
 
3. Y ∼ N µ, σ 2 =⇒ aY + b ∼ N aµ + b, a2 σ 2
4. Y ∼ U (0, 1) =⇒ − ln Y ∼ exponential(1)

Generalization: Y ∼ U (0, 1) =⇒ −β ln Y ∼ exponential(β)


Related: Y ∼ beta(α, 1) =⇒ − ln Y ∼ exponential(1/α)
Related: Y ∼ beta(1, β) =⇒ − ln(1 − Y ) ∼ exponential(1/β)
5. Y ∼ exponential(α) =⇒ Y 1/m ∼ Weibull(m, α)

Related: Y ∼ Weibull(m, α) =⇒ Y m ∼ exponential(α)


   
6. Y ∼ N µ, σ 2 =⇒ eY ∼ lognormal µ, σ 2 or equivalently if U ∼ lognormal µ, σ 2 =⇒ ln U ∼ N µ, σ 2
7. Y ∼ beta(α, β) =⇒ 1 − Y ∼ beta(β, α)

8. Y ∼ U (−π/2, π/2) =⇒ tan Y ∼ Cauchy


9. Y ∼ gamma(α, β) =⇒ cY ∼ gamma(α, βc), where c > 0
Special case: 2Y /β ∼ χ2 (2α)
iid P
10. Y1 , Y2 , . . . , Yn ∼ Bernoulli (p) =⇒ Yi ∼ b(n, p)

11. Yi ∼ gamma (αi , β) , i = 1, 2, . . . , n (mutually independent)


X X 
=⇒ Yi ∼ gamma αi , β

iid P
Special case: αi = 1, for i = 1, 2, . . . , n. Then Y1 , Y2 , . . . , Yn ∼ exponential(β) =⇒ Yi ∼ gamma(n, β)
P P
Special case: αi = νi /2, β = 2. Then Yi ∼ χ2 (νi ) , i = 1, 2, . . . , n (mutually independent) =⇒ Yi ∼ χ2 ( ν i )
iid
Combination: If Y1 , Y2 , . . . , Yn ∼ exponential(β), then
P
2 Yi
∼ χ2 (2n)
β

12. Yi ∼ Poisson (λi ) , i = 1, 2, . . . , n (mutually independent)


X X 
=⇒ Yi ∼ Poisson λi


13. Yi ∼ N µi , σi2 , i = 1, 2, . . . , n (mutually independent)
X X X 
=⇒ ai Yi ∼ N ai µi , a2i σi2

iid 
Special case: µi = µ and σi2 = σ 2 , for i = 1, 2, . . . , n. Then Y1 , Y2 , . . . , Yn ∼ N µ, σ 2
X  X X 
=⇒ ai Yi ∼ N µ ai , σ 2 a2i

Page 89
iid  
Special case of iid result: If Y1 , Y2 , . . . , Yn ∼ N µ, σ 2 , then Ȳ ∼ N µ, σ 2 /n
iid  P 
Special case of iid result: If Y1 , Y2 , . . . , Yn ∼ N µ, σ 2 , then Yi ∼ N nµ, nσ 2

14. If Yi ∼ N µi , σi2 , i = 1, 2, . . . , n (mutually independent), then

Yi − µi
Zi = ∼ N (0, 1),
σi
P
for i = 1, 2, . . . , n. Therefore, U = Zi2 ∼ χ2 (n) because Z12 , Z22 , . . . , Zn2 are iid χ2 (1)
iid P
15. Y1 , Y2 , . . . , Yn ∼ geometric(p) =⇒ U = Yi ∼ nib(n, p)
iid
16. Y1 , Y2 ∼ N (0, 1) =⇒ U = Y1 /Y2 ∼ Cauchy
iid
17. Y1 , Y2 , . . . , Yn ∼ exponential(β) =⇒ Y(1) ∼ exponential (β/n)
iid
18. Y1 , Y2 , . . . , Yn ∼ Weibull(m, α) =⇒ Y(1) ∼ Weibull(m, α/n)
19. If Z ∼ N (0, 1), W ∼ χ2 (ν), and Z ⊥
⊥ W , then

Z
T =p ∼ t(ν)
W/ν

iid 
20. Y1 , Y2 , . . . , Yn ∼ N µ, σ 2
(n − 1)S 2
=⇒ ∼ χ2 (n − 1)
σ2
21. If W1 ∼ χ2 (ν1 ) , W2 ∼ χ2 (ν2 ), and W1 ⊥
⊥ W2 , then

W1 /ν1
F = ∼ F (ν1 , ν2 )
W2 /ν2

22. If F ∼ F (ν1 , ν2 ), then 1/F ∼ F (ν2 , ν1 )


23. If T ∼ t(ν), then T 2 ∼ F (1, ν)

24. If W ∼ F (ν1 , ν2 ), then

(ν1 /ν2 ) W
∼ beta (ν1 /2, ν2 /2)
1 + (ν1 /ν2 ) W

Page 90

You might also like