Professional Documents
Culture Documents
Pen and Paper Exercises On Statistics
Pen and Paper Exercises On Statistics
(with Solutions)
4. Confidence Intervals
5. Bayesian Inference
6. Testing of Hypothesis
Acknowledgment: These problems and their solutions are largely based on some extraordi-
nary books on Statistics and several outstanding university courses available online, with
some additions by the author. A non-exhaustive list is given below. Happy Learning!
Note: To err is human! There can be typos and unwanted mistakes in the solutions. If
you find one, please email me at ctanujit@gmail.com.
References (Books):
2. Draper, Norman R., and Harry Smith. Applied Regression Analysis, 1998.
(ii) Now, suppose there are two types of voters - “passive” and “active.” Each passive voter votes on election day
with probability 1/4 and doesn’t vote with probability 3/4, while each active voter votes with probability 3/4
and doesn’t vote with probability 1/4. Suppose that a fraction qH of the people who support Hillary are passive
and 1 − qH are active, and a fraction qD of the people who support Donald are passive and 1 − qD are active.
Show that
1 3 1 3
E [VHillary ] = qH pN + (1 − qH ) pN, E [VDonald ] = qD (1 − p)N + (1 − qD ) (1 − p)N.
4 4 4 4
What are the standard deviations of VHillary and VDonald in terms of p, N, qH , and qD ? If we estimate p by p̂
using a simple random sample of n = 1000 people from Iowa. Explain why p̂ might not be a good estimate of
the fraction of voters who will vote for Hillary.
(iii) We do not know qH and qD . However, suppose that in our simple random sample, we can observe whether
each person is passive or active, in addition to asking them whether they support Hillary or Donald. Suggest
estimators V̂Hillary and V̂Donald for E [VHillary ] and E [VDonald ] using this additional information. Show, for
your estimators, that
h i 1 3 h i 1 3
E V̂Hillary = qH pN + (1 − qH ) pN, E V̂Donald = qD (1 − p)N + (1 − qD ) (1 − p)N.
4 4 4 4
Solution: (i) Recall that a Binomial(n, p) random variable has mean np and variance np(1 − p). A total
of pN
people support Hillary, each voting independently with probability 21 , so VHillary ∼ Binomial pN, 12 . Then
1 1
E [VHillary ] = pN, Var [VHillary ] = pN,
2 4
q
and the standard deviation of VHillary is 1
4 pN . Similarly, as (1 − p)N people support Donald, VDonald ∼
Binomial (1 − p)N, 12 , so
1 1
E [VDonald ] = (1 − p)N, Var [VDonald ] = (1 − p)N,
2 4
q
and the standard deviation of VDonald is 14 (1 − p)N . The fraction of voters who voted for Hillary is
VHillary VHillary /N
= .
VHillary + VDonald VHillary /N + VDonald /N
Page 1
As E [VHillary /N ] = 21 p (a constant) and Var [VHillary /N ] = 14 p/N → 0 as N → ∞, VHillary /N should be close
to 12 p with high probability when N is large. Similarly, as E [VDonald /N ] = 12 (1 − p) and Var [VDonald /N ] =
4 (1 − p)/N → 0 as N → ∞, VDonald /N should be close to 2 (1 − p) with high probability when N is large.
1 1
Then the fraction of voters for Hillary should, with high probability, be close to
1
2p
= p.
1
2p + 1
2 (1 − p)
(The above statement “close to with high probability” may be formalized using Chebyshev’s inequality, which
states that a random variable is, with high probability, not too many standard deviations away from its mean.)
(ii) Let VH,p and VH,a be the number of passive and active voters who vote for Hillary, and similarly define
VD,p and VD,a for Donald. There are qH pN passive Hillary supporters, each of whom votes independently with
probability 41 , so
1
VH,p ∼ Binomial qH pN, .
4
Similarly,
3
VH,a ∼ Binomial (1 − qH ) pN, ,
4
1
VD,p ∼ Binomial qD (1 − p)N, ,
4
3
VD,a ∼ Binomial (1 − qD ) (1 − p)N, ,
4
and these four random variables are independent. Since VHillary = VH,p + VH,a ,
1 3
E [VHillary ] = E [VH,p ] + E [VH,a ] = qH pN + (1 − qH ) pN,
4 4
3 3 3
Var [VHillary ] = Var [VH,p ] + Var [VH,a ] = qH pN + (1 − qH ) pN = pN,
16 16 16
q
3
and the standard deviation of VHillary is 16 pN . Similarly,
1 3
E [VDonald ] = E [VD,p ] + E [VD,a ] = qD (1 − p)N + (1 − qD ) (1 − p)N,
4 4
3 3
Var [VDonald ] = Var [VD,p ] + Var [VD,a ] = qD (1 − p)N + (1 − qD ) (1 − p)N
16 16
3
= (1 − p)N,
16
q
and the standard deviation of VDonald is 16 3
(1 − p)N .
The quantity p̂ estimates p, but in this case, p may not be the fraction of voters who vote for Hillary: By the
same argument as in part (a), the fraction of voters who vote for Hillary is given by
VHillary VHillary /N
=
VHillary + VDonald VHillary /N + VDonald /N
1
+ 34 (1 − qH ) p
4 qH p
≈ ,
1
4 qH p + 3
4 (1 − qH ) p + 14 qD (1 − p) + 43 (1 − qD ) (1 − p)
where the approximation is accurate with high probability when N is large. When qH 6= qD , this is different
p
from p: For example, if qH = 0 and qD = 1, this is equal to p+(1−p)/3 which is greater than p, reflecting the
fact that Hillary supporters are more likely to vote than are Donald supporters.
(iii) Let p̂ be the proportion of the 1000 surveyed people who support Hillary. Among the surveyed people
supporting Hillary, let q̂H be the proportion who are passive. Similarly, among the surveyed people supporting
Page 2
Donald, let q̂D be the proportion who are passive. (Note that these are observed quantities computed from our
sample of 1000 people.) Then we may estimate the number of voters for Hillary and Donald by
1 3
V̂Hillary = q̂H p̂N + (1 − q̂H ) p̂N
4 4
1 3
V̂Donald = q̂D (1 − p̂)N + (1 − q̂D ) (1 − p̂)N.
4 4
q̂H p̂ is simply the proportion of the 1000 surveyed people who both support Hillary and are passive. Hence,
letting X1 , . . . , X1000 indicate whether each surveyed person both supports Hillary and is passive, we have
1
q̂H p̂ = (X1 + . . . + Xn ) .
n
Each Xi ∼ Bernoulli (qH p), so linearity of expectation implies E [q̂H p̂] = qH p. Similarly, (1 − q̂H ) p̂, q̂D (1 − p̂),
and (1 − q̂D ) (1− p̂) are the proportions of the 1000 surveyed people who support Hillary and are active, support
Donald and are passive, and support Donald and are active, so the same argument shows E [(1 − q̂H ) p̂] =
(1 − qH ) p, E [q̂D (1 − p̂)] = qD (1 − p), and E [(1 − q̂D ) (1 − p̂)] = (1 − qD ) (1 − p). Then applying linearity of
expectation again yields
h i h i
E V̂Hillary = E [VHillary ] , E V̂Donald = E [VDonald ] .
(b) Let X and Y be continuous random variables with joint density function
y3
−y(x+1)
f (x, y) = 2 e for x > 0, y > 0
0 otherwise.
(i) Find the marginal pdf of X.
(ii) Find the marginal pdf of Y and E[Y ].
Solution: (i) Integrating the joint pdf over y gives the marginal pdf of X. For x > 0, we have
Z ∞ 3 Z ∞
y −y(x+1) Γ(4) (x + 1)4 3 −y(x+1) 3
fX (x) = e dy = 4
y e dy = ,
0 2 2(x + 1) 0 Γ(4) (x + 1)4
as the integrand in the last integral over y is a Gamma(4, x + 1) pdf, and so integrates (over y) to 1.
For x ≤ 0, fX (x) = 0.
(ii) Integrating the joint pdf over x gives the marginal pdf of Y . For y > 0, we have
Z ∞ 3 Z
y −y(x+1) y 2 −y ∞ −yx y2
fY (y) = e dx = e ye dx = e−y
0 2 2 0 2
as the integrand in the last integral over x is an Exponential (y) pdf, and so integrates ( over x ) to 1 . For
y ≤ 0, fY (y) = 0. We can recognize the marginal distribution of Y to be Gamma(3, 1), and so E[Y ] = 3/1 = 3.
Page 3
(ii) Let A ∈ Rk×k be any matrix, let Σ = AAT , and let µ ∈ Rk be any vector. Show that there exist random
IID
variables Y1 , . . . , Yk such that (Y1 , . . . , Yk ) ∼ N (µ, Σ). (Hint: Let X1 , . . . , Xk ∼ N (0, 1), and let each Yj be a
certain linear combination of X1 , . . . , Xk plus a certain constant.)
where ai and aj denote rows i and j of the matrix A. (The computation of covariance is valid for both i 6= j and
i = j; in the latter case this yields Var [Yi ] = Cov [Yi , Yi ].) As ai ΣaTj = AΣAT ij , this implies by definition
Y ∼ N 0, AΣAT .
iid
(ii) Take X1 , . . . , Xk ∼ N (0, 1), define Zj = aj1 X1 + . . . + ajk Xk for each j = 1, . . . , k, and let Yj = Zj + µj .
As (X1 , . . . , Xk ) have the multivariate normal distribution N (0, I) where I is the k × k identity matrix, and
as AIAT = AAT = Σ, part (a) implies (Z1 , . . . , Zk ) ∼ N (0, Σ). Then (Y1 , . . . , Yk ) ∼ N (µ, Σ) (since adding the
vector µ = (µ1 , . . . , µk ) does not change the variances and covariances of Y1 , . . . , Yk but shifts their means by
µ1 , . . . , µk ).
(d) (i) Let (X, Y ) be a random point uniformly distributed on the unit disk (x, y) : x2 + y 2 ≤ 1 . Show that
Cov[X, Y ] = 0, but that X and Y are not independent.
(ii) Let X, Y ∼ N (0, 1) be independent. Compute P[X + Y > 0 | X > 0]. (Hint: Use the rotational symmetry of
the bivariate normal PDF.)
Solution: (i) X has the same distribution as −X, so E[X] = E[−X] = −E[X], hence E[X] = 0. Simi-
larly E[Y ] = 0. Also (X, Y ) has the same joint distribution as (−X, Y ), so E[XY ] = E[−XY ] = −E[XY ],
hence E[XY ] = 0. Then Cov[X, Y ] = E[XY
√ ] − E[X]E[Y
√ ] = 0. On the other hand, conditional on X = x, Y is
uniformly distributed on the interval − 1 − x2 , 1 − x2 . As this depends on x, X and Y are not independent.
(ii) We have
P[X + Y > 0, X > 0]
P[X + Y > 0 | X > 0] = .
P[X > 0]
Since X has the same distribution as −X, P[X > 0] = P[−X > 0] = P[X < 0], so P[X > 0] = 21 . To compute
P[X + Y > 0, X > 0], note that this is the integral of the bivariate normal PDF fX,Y (x, y) in the region to the
right of the y-axis and above the line y = −x. The integral of fX,Y (x, y) over all of R2 must equal 1; hence by
rotational symmetry of fX,Y (x, y) around the origin, the integral over any wedge formed by two rays extending
from the origin is θ/(2π) where θ is the angle formed by these rays. For the above region, this angle is 3π/4,
so P[X + Y > 0, X > 0] = 3/8. Then P[X + Y > 0 | X > 0] = 3/4.
(e) (i) A family of Yellow-Faced (YF) gophers consisting of 2 parents and 3 children are kept in a laboratory. In
addition to these, a family of YF gophers with 2 parents and 4 children, a family of Big Pocket (BP) gophers
with 2 parents and 5 children, and a family of BP gophers with 1 mother and 4 children are also kept in the
laboratory. A sample of 4 gophers is selected randomly from all the gophers in the laboratory. What is the
probability that the sample consists of one adult female, one adult male, and 2 children, with both adults of
the same genus (either both YF or both BP)?
(ii) Let X and Y be arbitrary random variables, and let g(·) and h(·) be arbitrary real valued functions defined on
R. For each statement, say whether it is TRUE or FALSE. If TRUE, prove it, and if FALSE, give a counterex-
ample.
(A) If X and Y are uncorrelated then so are g(X) and h(Y ) for any g and h.
(B) If g(X) and h(Y ) are uncorrelated for all g and h then X and Y are uncorrelated.
Page 4
Solution: (i) Let X1 be the number of male YF gophers, X2 the number of female YF gophers, X3 the number
of male BP gophers, X4 the number of female BP gophers, and X5 the number of child gophers in the sample.
The sample size is n = 4, and the total number of gophers in the laboratory is 23. The total numbers of
YF male, YF female, BP male, BP female, and children gophers are 2, 2, 1, 2 and 16 respectively. The joint
T
distribution of (X1 , X2 , X3 , X4 , X5 ) is Multivariate Hypergeometric and the desired probability is
P (X1 = 1, X2 = 1, X3 = 0, X4 = 0, X5 = 2) + P (X1 = 0, X2 = 0, X3 = 1, X4 = 1, X5 = 2)
2 2 1 2 16 2 2 1 2 16
1 1 0 0 2 0 0 1 1 2
= +
23 23
4 4
2 × 2 × 1 × 1 × 120 1 × 1 × 1 × 2 × 120 720
= + = ≈ 0.0813.
8855 8855 8855
(ii) (A) FALSE. Take X to have a N (0, 1) distribution. Then
E[X] = 0 and E X 3 = 0, and take Y = X 2 .
Then Cov(X, Y ) = Cov X, X 2 = E X 3 − E[X]E X 2 = 0. Take g(X) = X 2 and h(Y ) = Y . Then
Cov(g(X), h(Y )) = Cov X 2 , X 2 = Var X 2 = 2 > 0, where Var X 2 is the variance of a χ21 distribution.
(B) TRUE. Let A and B be arbitrary subsets of R, and take g(X) = IA (X) and h(Y ) = IB (Y ). Since
Cov(g(X), h(Y )) = 0 we have E [IA (X)IB (Y )] = E [IA (X)] E [IB (Y )], which is equivalent to P (X ∈ A, Y ∈
B) = P (X ∈ A)P (Y ∈ B). Since A and B were arbitrary this implies that X and Y are independent. But
then X and Y are uncorrelated.
(b) Suppose that the random vector Y = (Y1 , Y2 , Y3 ) is uniformly distributed on the sphere of radius 1 centered at the
origin; that is, Y has joint probability density function (pdf)
3
4π if (y1 , y2 , y3 ) ∈ S
fY (y1 , y2 , y3 ) =
0 otherwise,
where S = (y1 , y2 , y3 ) ∈ R3 : y12 + y22 + y32 ≤ 1 is the sphere of radius 1 centred at (0, 0, 0). Let V = Y12 + Y22 + Y32 .
Find the probability density function of V and find E[V ].
Solution: We can find the pdf of V by first computing the CDF of V . For v ∈ [0, 1] we have
√
FV (v) = P (V ≤ v) = P Y12 + Y22 + Y32 ≤ ( v)2 = P (Y1 , Y2 , Y3 ) ∈ S√v ,
√
where S√v is the sphere of radius v. Thus, for v ∈ [0, 1]
ZZZ ZZZ
3
FV (v) = fY (y1 , y2 , y3 ) dy1 dy2 dy3 = dy1 dy2 dy3
S √v S√v 4π
3
= × Volume of S√v
4π √
3 4π( v)3
= × = v 3/2 .
4π 3
Page 5
We also have FV (v) = 0 for v < 0 and FV (v) = 1 for v > 1. Differentiating, we obtain the pdf of V as
( √
3 v
′ 2 for 0 ≤ v ≤ 1
fV (v) = FV (v) =
0 otherwise.
Then Z 1
1
3 3/2 3 2 5/2 3
E[V ] = v dv = v = .
2 0 2 5 0 5
(c) Let X and Y be independent and identically distributed uniform random variables on the interval (0, 1). Define
U = 12 (X + Y ) to be the average and define V = X.
(i) Find the joint probability density function of (U, V ) and draw the sample space of (U, V ).
(ii) Find the marginal probability density function of U .
Solution: (i) The inverse transformation is X = V and Y = 2U − V . The matrix of partial derivatives of the
inverse transformation is
0 1
2 −1
with determinant −2. The joint pdf of (X, Y ) is
(
1 for 0 ≤ x ≤ 1, 0 ≤ y ≤ 1
fXY (x, y) =
0 otherwise
(ii) To get the marginal pdf of U we integrate the joint pdf fU V (u, v) over the variable v from −∞ to ∞. From
Figure 1, we see that the limits of integration are different depending on whether u ∈ [0, 0.5] or u ∈ (0.5, 1].
For u ∈ [0, 0.5] we have
Z 2u
fU (u) = (2)dv = 4u.
0
For u ∈ (0.5, 1] we have
Z 1
fU (u) = (2)dv = 2(1 − (2u − 1)) = 4(1 − u)
2u−1
Page 6
Clearly, for u ∈
/ [0, 1] we have fU (u) = 0. To summarize,
4u for u ∈ [0, 0.5]
fU (u) = 4(1 − u) for u ∈ (0.5, 1]
0 otherwise.
T
(d) Let X = (X1 , X2 ) be uniformly distributed on the positive quadrant intersected with the disk of radius 1 centered
at the origin; i.e., X has joint pdf fX (x1 , x2 ) = π4 ISX (x1 , x2 ), where
SX = (x1 , x2 ) ∈ R2 : x1 ≥ 0, x2 ≥ 0, x21 + x22 ≤ 1 .
T
Let Y1 = X12 and Y2 = X22 . Find the joint pdf of (Y1 , Y2 ) and the marginal pdf of Y1 .
Solution: We use the bivariate change of variable formula to get the joint pdf of (Y1 , Y2 ). The inverse transformation
1/2 1/2
is X1 = Y1 and X2 = Y2 . The Jacobian is
" 1 #
√ 0 1
2 y1
J = det 1 = √
0 √ 4 y1 y 2
2 y2
(e) Let X1 , X2 , X3 be independent and identically distributed Exponential (λ) random variables (the Exponential (λ)
distribution has pdf fX (x) = λe−λx for x > 0 and fX(x) = 0 for x ≤ 0, and df FX (x) = 1 − e−λx for x > 0
and FX (x) = 0 for x ≤ 0). Find P X1 + X2 + X3 ≤ 23 . (Write down the appropriate 3-dimensional integral and
evaluate it).
Solution: The appropriate 3-dimensional integral is (by symmetry, every order in which the integration is done is
equally easy)
Z 3/2 Z 3/2−x3 Z 3/2−x2 −x3
3
P X1 + X2 + X3 ≤ = fX (x1 ) fX (x2 ) fX (x3 ) dx1 dx2 dx3
2 0 0 0
Z 3/2 Z 3/2−x3
= fX (x2 ) fX (x3 ) 1 − e−λ(3/2−x2 −x3 ) dx2 dx3
0 0
Z 3/2 Z 3/2−x3 h i
= λe−λx3 λe−λx2 − λe−λ(3/2−x3 ) dx2 dx3
0 0
Z 3/2
3
= λe−λx3 1 − e−λ(3/2−x3 ) − λ − x3 e−λ(3/2−x3 ) dx3
0 2
2
3λ −3λ/2
−3λ/2 3λ (3/2)2
=1 − e − e − e−3λ/2 + λ2 e−3λ/2
2 2 2
" 2 #
3λ 1 3λ
=1 − e−3λ/2 1 + + .
2 2 2
Page 7
Problem 3. (Moment-generating function)
(a) Let X ∼ Binomial(n, p). Find the moment-generating function of X in terms of n and p. (Hint: X is the sum of n
independent Bernoulli random variables.)
Solution: X is the sum of n independent Bernoulli random variables X1 , . . . , Xn , each with moment generating
function
MXi (t) = E exp (tXi ) = pet + (1 − p).
Combining these and applying independence yields
Y
n
n
MX (t) = E exp(tX) = E exp (t (X1 + . . . + Xn )) = E exp (tXi ) = 1 − p + pet .
i=1
(b) Let X1 , X2 , . . . be independent and identically distributed random variables, each with a Poisson distribution with
mean 1. Let Sn = X1 + . . . + Xn for n ≥ 1 and let Mn (t) be the moment generating function of Sn .
(i) Find the smallest n such that P (Mn (Sn ) > 1) ≥ 0.99 using the exact probability.
(ii) Find the smallest n such that P (Mn (Sn ) > 1) ≥ 0.99 using the central limit theorem.
−1)
and Mn (Sn ) = en(e −1) . So,
t Sn
Then Mn (t) = M (t)n = en(e
P (Mn (Sn ) > 1) = P en(e −1) > 1 = P n eSn − 1 > 0 = P eSn > 1 = P (Sn > 0) .
Sn
The exact distribution of Sn is Poisson(n), so P (Sn > 0) = 1 − P (Sn = 0) = 1 − e−n . Setting this to .99 gives
e−n = .01, or n = ln 100 = 4.605. So n = 5 is the smallest n.
√
(ii) By the central limit theorem, the distribution of (Sn − n) / n is approximated by a N (0, 1) distribution,
so that (from part i),
√ √ √
P (Mn (Sn ) > 1) = P (Sn > 0) = P (Sn − n) / n > − n ≈ 1 − Φ(− n),
√
where
√ Φ(·) is the standard normal cdf. Setting this to 0.99 gives Φ(− n) = 0.01, or (from the table)
− n = −2.33, or n = 5.429. So n = 6 is the smallest n according to this approximation.
(c) Let X have a Gamma (3, 3) distribution. Conditional on X = x let Z have a normal distribution with mean x and
variance 2. Finally, let Y = eZ . Find E[Y ] and Var(Y ).
Solution: Conditioned on X = x, E[Y ] is the moment generating function of a N (x, 2) distribution evaluated
2
at 1 . Letting Mx (·) denote this mgf,
we have Mx (t) = ext+t , and so E[Y | X = x] = Mx (1) = ex+1 . Similarly,
E Y 2 | X = x = E e2Z | X = x
= Mx (2) = e2x+4 . Then by the law of total expectation, E[Y ] = E eX+1 =
eMX (1) and E Y 2 = E e2X+4 = e4 MX (2), where MX (·) is the moment generating function of X. Since X has
3
3
a Gamma (3, 3) distribution we have MX (t) = 3−t , and so
3
3 27e
E[Y ] = e = ≈ 9.174 and E Y 2 = 27e4 .
2 8
Then 2
27
Var(Y ) = E Y 2
− E[Y ] = 27e −
2 4
e2 ≈ 1390.
8
Page 8
(d) Let X be a random variable with Exponential(λ) distribution. Recall that the moment generating function of X is
λ
MX (t) = λ−t for t < λ.
(i) Find E [X n ], where n is any positive integer. You may use the mgf or compute E [X n ] directly.
(ii) Find MY (t), the mgf of Y = ln X, for t > −1.
1
Pn
(e) Let X1 , . . . , Xn be independent Poisson(1) random variables and let X̄ = n i=1 Xi . Let k > 1 be given. Show that
n
ek−1
P (X̄ ≥ k) ≤ .
kk
Solution: We find the mgf of X̄ and then apply Chernoff’s bound. The mgf of each Xi is
∞
X 1
etj e−1 = e−1 ee = ee −1 .
t t
MXi (t) = E etXi =
j=0
j!
= en(e −1)−tk .
t/n
P (X̄ > k) ≤ min (1)
t>0 etk
Differentiating the exponent with respect to t gives et/n − k. Setting this to 0 and solving for t gives t = n ln k.
Plugging this into the RHS of Eqn. (1) then gives
n
n(k−1)−ln kkn en(k−1) ek−1
P (X̄ > k) ≤ e = = ,
k kn kk
as desired.
Page 9
Problem 4. (Order Statistics)
(a) Let X1 , . . . , Xn be a sequence of independent random variables uniformly distributed on the interval (0, 1), and let
X(1) , . . . , X(n) denote their order statistics. For fixed k, let gn (x) denote the probability density function of nX(k) .
Find gn (x) and show that (
xk−1 −x
(k−1)! e for x ≥ 0
lim gn (x) =
n→∞ 0 for x < 0
which is the Gamma(k, 1) density.
(b) (i) Let X1 , . . . , X6 be independent random variables uniformly distributed on the interval (0, 1). Find the pdf of
U = min {max (X1 , X2 ) , max (X3 , X4 ) , max (X5 , X6 )}.
(ii) Let X1 , . . . , Xn be independent and identically distributed random variables, each with a Uniform distribution
on the interval (0, 1). Let X = min (X1 , . . . , Xn ) and Y = max (X1 , . . . , Xn ). Find P X < 12 < Y and E X 3 .
Solution: (i) Let Y1 = max (X1 , X2 ) , Y2 = max (X3 , X4 ), and Y3 = max (X5 , X6 ). Then Y1 , Y2 , Y3 are
independent and identically distributed random variables, each with cdf
y
2
for 0 < y < 1
2
FY (y) = P (X1 ≤ y, X2 ≤ y) = [P (X1 ≤ y)] = 0 for y ≤ 0
1 for y ≥ 1
and pdf
2y for 0 < y < 1
fY (y) =
0 otherwise.
Then U = min (Y1 , Y2 , Y3 ) has cdf
FU (u) = 1 − P (min (Y1 , Y2 , Y3 ) > u)
= 1 − P (Y1 > u, Y2 > u, Y3 > u)
3
= 1 − (1 − FY (u))
3
1 − 1 − u2 for 0 < u < 1
= 0 for u ≤ 0
1 for u ≥ 1.
Page 10
By differentiation, then, U has pdf given by
2
6u 1 − u2 for 0 < u < 1
fU (u) =
0 otherwise.
(c) Let X1 , . . . , Xn , Y1 , . . . , Yn be independent Uniform (0, 1) random variables. We form n rectangles, where the i-th
rectangle has adjacent sides of length Xi and Yi , for i = 1, . . . , n. Let Ai be the area of the i th rectangle, i = 1, . . . , n,
and define Amax = max (A1 , . . . , An ). Find the pdf of Amax .
Solution: First note that Ai = Xi Yi for i = 1, . . . , n, and A1 , . . . , An are independent since the vectors
(X1 , Y1 ) , . . . , (Xn , Yn ) are independent. Also, the Ai are identically distributed, each with support (0, 1). We first
find P (Ai ≤ a) for any a ∈ (0, 1), which is P (Xi Yi ≤ a). If Yi ≤ a then the limits for Xi are from 0 to 1. If Yi > a
then the limits for Xi are from 0 to a/Yi . The joint pdf of (Xi , Yi ) is f (x, y) = 1 for 0 < x, y < 1. Thus we have
Z aZ 1 Z 1 Z a/y
P (Ai ≤ a) = P (Xi Yi ≤ a) = dxdy + dxdy
0 0 a 0
Z 1
1
=a+a dy = a + a(ln 1 − ln a) = a(1 − ln a).
a y
Next, we get the cdf of Amax as
n
F (a) = P (Amax ≤ a) = P (A1 ≤ a, . . . , An ≤ a) = P (A1 ≤ a) = (a(1 − ln a))n ,
Page 11
(d) Let X1 , . . . , Xn be independent and identically distributed Exponential (λ) random variables. Compute E X(1) e−λX(2) ,
where X(1) and X(2) are the first and second order statistics of X1 , . . . , Xn .
Then Z ∞Z ∞
E X(1) e−λX(2) = n(n − 1)λ2 x1 e−λx2 e−λx1 e−λ(n−1)x2 dx2 dx1
0 x1
Z ∞
= (n − 1)λ x1 e−λx1 e−λnx1 dx1
Z 0
n−1 ∞
= x1 λ(n + 1)e−λ(n+1)x1 dx1
n+1 0
n−1
=
λ(n + 1)2
since the final integral is the mean of an Exponential(λ(n + 1)) distribution.
(e) Let X1 , X2 , X3 be independent Uniform (0, 1) random variables, and let X(1) , X(2) , X(3) denote their order statistics.
Let X be the area of the square with side length X(2) and let Y be the area of the rectangle with side lengths X(1)
and X(3) . Find P (X > Y ), E[X], and E[Y ].
2
Solution: We wish to compute P X(2) > X(1) X(3) . The joint pdf of X(1) , X(2) , X(3) is f123 (x1 , x2 , x3 ) = 6
for 0 < x1 < x2 < x3 < 1 and equals 0 otherwise. The solution is computed by integrating f123 (x1 , x2 , x3 ) over
the region A = (x1 , x2 , x2 ) : x22 − x1 x3 > 0 . If we set the inner integral over x2 and note that for given values of
√
0 < x1 < x3 < 1, the possible values of x2 are x1 x3 < x2 < x3 . The integral can be written as
ZZZ Z 1 Z x3 Z x3
f123 (x1 , x2 , x3 ) dx1 dx2 dx3 = 6 √
dx2 dx1 dx3
A 0 0 x1 x3
Z 1 Z x3
1/2 1/2
x3 − x1 x3 =6
dx1 dx3
0 0
Z 1
1/2 2 3/2
=6 x23 − x3 x3 dx3
0 3
Z 1
2
=2 x23 dx3 = .
0 3
The marginal pdf of X(2) and the joint pdf of X(1) , X(3) are given by
6x2 (1 − x2 ) 0 < x2 < 1
f2 (x2 ) =
0 otherwise
and
6 (x3 − x1 ) 0 < x1 < x3 < 1
f13 (x1 , x3 ) =
0 otherwise
respectively. Then
h i Z 1
6 6 3
E[X] = E 2
X(2) = x22 6x2 (1 − x2 ) dx2 = − =
0 4 5 10
Z Z Z
1 x3 1
x43 x4 1
E[Y ] = E X(1) X(3) = x1 x3 6 (x3 − x1 ) dx1 dx3 = 6 − 3 dx3 = .
0 0 0 2 3 5
Page 12
Problem 5. (Convergence of Random Variables and Limit Theorems)
(a) Let X and X1 , X2 , . . . be random variables each having a N (0, 1) distribution. Suppose (Xn , X) has a bivariate
normal distribution for each n and the correlation between Xn and X is ρ (Xn , X) = ρn , for n ≥ 1.
(i) Show that Xn converges to X in distribution as n → ∞ (for arbitrary correlations ρn ).
(ii) If ρn → 1 as n → ∞, show that Xn converges to X in probability as n → ∞.
(iii) Show that if ρn = 1 − an for some constant a ∈ (0, 1), then Xn converges to X with probability 1 as n → ∞.
Do you get convergence with probability 1 if a = 0? If a = 1 ? Prove your answers.
Solution: (i) If Φ is the standard normal cdf and Fn is the cdf of Xn for n ≥ 1, then Fn (x) = Φ(x) for all
x (this is given). Therefore, Fn (x) → Φ(x) as n → ∞ (trivially). It is also given that Φ(x) is the cdf of X.
Therefore, Xn converges to X in distribution.
(ii) Let ϵ > 0 be given. Both Xn and X are zero mean, so Xn − X also has zero mean. So by Chebyshev’e
inequality,
Var (Xn − X)
P (|Xn − X| > ϵ) ≤ . (2)
ϵ2
But
Var (Xn − X) = Var (Xn ) + Var(X) − 2 Cov (Xn , X) = 1 + 1 − 2ρn = 2 (1 − ρn ) . (3)
So if ρn → 1 as n → ∞ then P (|Xn − X| > ϵ) → 0 as n → ∞ for any ϵ > 0. Thus, Xn converges to X in
probability.
(iii) Plugging ρn = 1 − an into (3) we have Var (Xn − X) = 2an . Plugging this into Chebyshev’s inequality in
(2), we have
2an
P (|Xn − X| > ϵ) ≤ 2 .
ϵ
If a ∈ (0, 1), the sum on the right converges, and so Xn converges to X with probability 1 by the sufficient
condition from class. If a = 0, the sum is again convergent (it is equal to 0), and so Xn converges to X with
probability 1. If a = 1 then ρn = 0 so that Xn is independent of X for any n (since (Xn√ , X) is bivariate
normal). In this case Xn − X has a N (0, 2) distribution and so P (|Xn − X| > ϵ) = 2(1 − Φ(ϵ/ 2)), where Φ is
the standard normal cdf. This is some positive constant for every n and so P (|Xn − X| > ϵ) does not converge
to 0 as n → ∞. Therefore, Xn does not converge to X in probability. This then implies that Xn does not
converge to X with probability 1.
(b) Suppose 80 points are uniformly distributed in the ball in R3 centered at the origin with radius 1. Let Di be the
X
80
1
Euclidean distance from the origin of the i th point, i = 1, . . . , 80, and let D̄ = 80 Di . Use the central limit
i=1
theorem to find a value c satisfying P (|D̄ − E[D̄]| ≤ c) = 0.95.
Solution: We first find the distribution of Di . For x ∈ (0, 1), Di ≤ x if and only if the i-th point in the ball
3
is in the ball in R3 centred at the origin of radius x. That is, P (Di ≤ x) = 4πx /3 3
4π/3 = x . From this we have that
the pdf of Di is
3x2 for x ∈ [0, 1]
fD (x) =
0 otherwise.
The mean, second moment, and variance are then easily computed to be
3 3 3
E [Di ] = , E Di2 = , Var (Di ) = .
4 5 80
By the central limit theorem
√ √ !
80|D̄ − 3/4| c 80
P (|D̄ − 3/4| ≤ c) = P p ≤p
3/80 3/80
√ !
c 80
≈P |Z| ≤ p ,
3/80
Page 13
√
where Z has a N (0, 1) distribution. In order for this probability to be equal to 0.95 we need √
c 80
equal to 1.96.
3/80
Solving for c gives √
1.96 3
c= ≈ 0.0424.
80
(c) (i) Suppose that {X
h n } is a sequence
i of zero-mean random variables and X is a zero mean random variable, and
2
suppose that E (Xn − X) ≤ C/np for every n, for some constants C and p > 1. Show that Xn → X almost
surely.
(ii) Suppose that {Xn } is a sequence of non-negative random variables. Show that E [Xn ] → 0 as n → ∞ implies
that Xn → 0 in probability, but that the converse is false in general.
Solution: (i) Let ϵ > 0 be given. Since E [Xn − X] = 0 we have by Chebyshev’s inequality that
h i
2
Var (Xn − X) E (Xn − X) C
P (|Xn − X| > ϵ) ≤ 2
= 2
≤ 2 p.
ϵ ϵ ϵ n
Therefore,
∞
X X∞
C
P (|Xn − X| > ϵ) ≤ <∞
n=1 n=1
ϵ np
2
if p > 1. By a sufficient condition from class, this implies that Xn → X almost surely.
(ii) Suppose that E [Xn ] → 0. Let ϵ > 0 be given. Then by Markov’s inequality
E [Xn ]
P (|Xn − 0| > ϵ) ≤ →0
ϵ
as n → ∞. Hence, Xn → 0 in probability. To prove that the reverse implication is false in general, we give a
counterexample. Let the distribution of Xn be given by
(
0 with probability 1 − n1
Xn =
n with probability n1 .
(d) Suppose that {Xn } and {Yn } are sequences of random variables and X and Y are random variables such that
Xn → X in distribution and Yn → Y in distribution. Give an example where it is not true that Xn + Yn converges
to X + Y in distribution. [Hint: Consider Y = −X]
Solution: Let W and Z be independent N (0, 1) random variables. Let Xn = W for all n and Yn = Z for all n. Let
X = W and Y = −W . It is easy to see that X and Y are both N (0, 1), and it is clear that Xn → X in distribution
and Yn → Y in distribution. But X + Y = 0 whereas Xn + Yn is distributed as N (0, 2) for all n. Therefore, it is not
true that Xn + Yn converges to X + Y in distribution.
Page 14
1
∑n
By the central limit theorem, Zn = n √ F (Xi )−1/2 has approximately a N (0, 1) distribution for large n. Therefore,
i=1
1/12n
! √ !
1X 1X 1 √
n n
1 1 12n
P F (Xi ) − ≥√ =P F (Xi ) − 12n ≥ √
n i=1 2 3n n i=1 2 3n
= P (|Zn | ≥ 2) ≈ 0.0456.
(a) Let T ∼ t1 (the t distribution with 1 degree of freedom). Explain why T has the same distribution as X
|Y | where
IID
X, Y ∼ N (0, 1), and hence why T also has the same distribution as X
Y .
IID
Y when X, Y ∼ N (0, 1) is also called the Cauchy distribution (the t distribution with
[Hints: The distribution of X
1 degree of freedom). You may check that it has a PDF:
1 1
f (x) = . .
π 1 + x2
You may use this result without proof in part (b).]
Solution: By definition of the t distribution, T has the same distribution as √XU , where X ∼ N (0, 1), U ∼ χ21 ,
and X and U are independent. By definition of the χ2 distribution, U has the same distribution as Y 2 where
Y ∼ N (0, 1). Therefore T has the same distribution as √XY 2 = |Y
X X X
| . To show |Y | has the same distribution as Y :
Suppose, for any t ∈ R,
X
P ≤ t = P[X ≤ t|Y |] = P[X ≤ tY, Y > 0] + P[X ≤ −tY, Y < 0],
|Y |
and
X
P ≤ t = P[X ≤ tY, Y > 0] + P[X ≥ tY, Y < 0] = P[X ≤ tY, Y > 0] + P[−X ≤ −tY, Y < 0].
Y
Since (X, Y ) has the same distribution as (−X, Y ), the above implies
X X
P ≤t =P ≤t .
|Y | Y
X X
So the CDFs of |Y | and Y are the same.
(b) t1 is an example of an extremely “heavy-tailed” distribution: For T ∼ t1 , show that E[|T |] = ∞ and E T 2 = ∞.
IID
If T1 , . . . , Tn ∼ t1 , explain why the Law of Large Numbers and the Central Limit Theorem do not apply to the
sample mean n1 (T1 + . . . + Tn ).
Also, Z Z Z
∞ ∞ ∞
1 1 2 x2
E T2 = x2 f (x)dx = 2 x2 dx = dx,
−∞ 0 π x2 + 1 π 0 x2+1
x2
which equals ∞ since → 1 as x → ∞. So, T does not have a well-defined (finite) mean or variance, and the
x2 +1
LLN and CLT both do not apply. (In fact, it may be shown that n1 (T1 + . . . + Tn ) does not converge to 0 but rather
n (T1 + . . . + Tn ) ∼ t1 for any n.)
1
Page 15
(c) Let Un ∼ χ2n . Show that √ 11 → 1 in probability as n → ∞. (Hint: Apply the Law of Large Numbers and the
n Un
Continuous Mapping Theorem.)
∞
[Continuous Mapping Theorem: If random variables {Xn }n=1 converge in probability to c ∈ R (as n → ∞ ),
∞
and g : R → R is continuous, then {g (Xn )}n=1 converge in probability to g(c).]
Pn IID
Solution: We may write Un = i=1 Xi2 , where√X1 , . . . , Xn ∼ N (0, 1). Then the LLN implies n1 Un → E Xi2 = 1
in probability as n → ∞. The function x 7→ 1/ x is continuous (at every x > 0 ), so by the Continuous Mapping
Theorem
1
q →1
1
n nU
in probability as n → ∞.
(d) Using Slutsky’s lemma, show that if Tn ∼ tn for each n = 1, 2, 3, . . ., then as n → ∞, Tn → Z in distribution
where Z ∼ N (0, 1). (This formalizes the statement that “the tn distribution approaches to the standard normal
distribution as n gets large”.)
∞ ∞
[Slutsky’s lemma: If sequences of random variables {Xn }n=1 and {Yn }n=1 satisfy Xn → c in probability for a
constant c ∈ R and Yn → Y in distribution for a random variable Y , then Xn Yn → cY in distribution.]
Solution: We may write Tn = √Z1n where Zn ∼ N (0, 1), Un ∼ χ2n , and Zn and Un are independent. By part (c),
n Un
√ 11 → 1 in probability. Clearly, Zn → N (0, 1) in distribution, since the distribution of Zn does not change with
n Un
n. Then Tn → N (0, 1) in distribution by Slutsky’s lemma.
(e) If ξp is the p-th quantile of Fm,n distribution (F distribution with (m, n) degree of freedoms) and ξp′ the p-th quantile
′
of Fn,m . Show that, ξp ξ1−p = 1.
Solution:
χ2m /m 1 χ2n /n
F = ⇒ = ∼ Fn,m .
χ2n /n F χ2m /m
1 1 1
p = P [Fm,n ≤ ξp ] = P ⩾ = P Fn,m ⩾
Fm,n ϵp ξp
1
⇒ P Fn,m ≤ = 1 − p.
ξp
′
′
But, p Fn,m ≤ ξ1−p =1−p⇒ 1
ξp = ξ1−p .
f (x | p) = p(1 − p)x−1 ,
with a single parameter p ∈ [0, 1]. Compute the method-of-moments estimate of p, as well as the MLE of p. For
large n, what approximately is the sampling distribution of the MLE? (You may use, without proof, the fact that
the Geometric(p) distribution has mean 1/p.)
Solution. The method of moments estimator sets the population mean, 1/p, equal to the sample mean,
1X
n
X̄ = Xi .
n i=1
Page 16
Inverting to solve for p gives
1
p̂MOM = .
X̄
For the maximum likelihood estimator, the likelihood is
Y
n
L (p | X1 , . . . , Xn ) = p(1 − p)Xi −1 = pn (1 − p)n(X̄−1)
i=1
Taking g(θ) = 1/θ gives (g ′ (θ)) = θ−4 , which for θ = 1/p is (g ′ (θ)) = p4 . Hence
2 2
√ √ √
n (p̂MLE − p) = n(1/X̄ − p) = n(g(X) − g(1/p)) ⇒ N 0, p2 (1 − p) .
IID
(b) Let X1 , . . . , Xn ∼ N µ, σ 2 . We showed in class that the MLEs for µ and σ 2 are given by
1X 2
n
µ̂ = X̄ and σ̂ 2 = Xi − X̄ .
n i=1
(i) By computing the Fisher information matrix I µ, σ 2 , derive the approximate joint distribution of µ̂ and σ̂ 2
for large n. (Hint: Substitute v = σ 2 and treat v as the parameter rather than σ.)
(ii) Suppose it is known that µ = 0. Compute the MLE σ̃ 2 in the one-parameter sub-model N 0, σ 2 . The Fisher
information matrix in part (i) has off-diagonal entries equal to 0 when µ = 0 and n is large. What does this
tell you about the standard error of σ̃ 2 as compared to that of σ̂ 2 ?
Page 17
Solution. (i) Denote v = σ 2 . Then
1
e− 2v (X−µ)
1 2
f (X | µ, v) = √
2πv
and
1 1 1
ℓ(µ, v) = log f (X | µ, v) = − log 2π − log v − (X − µ)2 .
2 2 2v
To obtain the Fisher information matrix I(µ, v), we must compute the four second-order partial derivatives of
ℓ(µ, v). These quantities are
∂2ℓ 1
=− ,
∂µ2 v
∂2ℓ 1 1
= 2 − 3 (X − µ)2 ,
∂v 2 2v v
∂2ℓ ∂2ℓ X −µ
= =− .
∂µ∂v ∂v∂µ v2
Then " #
∂2ℓ ∂2ℓ
1/v 0
I(µ, v) = −Eµ,v ∂µ2
∂2ℓ
∂µ∂v
∂2ℓ
= .
0 1/2v 2
∂v∂µ ∂v 2
1 X 2
n
n n
ℓ(v) = − log 2π − log v − X ,
2 2 2v i=1 i
1X 2
n
ṽ = σ̃ 2 = X .
n i=1 i
Since the off-diagonals of the inverse Fisher information matrix are zero, the sample mean and standard devi-
ation are asymptotically uncorrelated, and so σ̂ 2 and σ̃ 2 have the same asymptotic standard error.
IID
(c) Let X1 , . . . , Xn ∼ Uniform(0, θ) for a single parameter θ > 0 having PDF
1
f (x | θ) = 1{0 ≤ X ≤ θ}.
θ
(i) Compute the MLE θ̂ of θ. (Hint: Note that the PDFs f (x | θ) do not have the same support for all θ > 0, and
they are also not differentiable with respect to θ you will need to reason directly from the definition of MLE.)
Page 18
√
(ii) If the true parameter is θ, explain why θ̂ ≤ θ always, and hence why it cannot be true that n(θ̂ − θ) converges
in distribution to N (0, v) for any v > 0.
and that the maximum likelihood estimate is the value of θ that maximizes 1/θn on the interval [maxi Xi , ∞).
Since 1/θn is a decreasing function, this maximum occurs at the left end point, so
θ̂ = max Xi .
i
(ii) The true parameter θ must satisfy θ ≥ Xi for all i = 1, . . . , n, since the range of Xi is bounded above by θ.
√
Hence θ ≥ maxi Xi = θ̂ as well. This means that for any value of n, n(θ̂ − θ) takes on positive values with
√
probability zero, so n(θ̂ − θ) cannot be asymptotically normally distributed.
(ii) For any model of the above form, differentiate the identity
Z
1 = eθT (x)−A(θ) h(x)dx
with respect to θ on both sides, to obtain a formula for Eθ [T (X)], where Eθ denotes expectation when X ∼
f (x | θ). Verify that this formula is correct for the Poisson example in part (i).
(iii) The following procedure defines the generalized method-of-moments estimator: For a fixed function g(x),
compute Eθ [g(X)] in terms of θ, and take the estimate θ̂ to be the value of θ for which
1X
n
Eθ [g(X)] = g (Xi ) .
n i=1
The method-of-methods estimator discussed in class is the special case of this procedure for g(x) = x.
IID
Let X1 , . . . , Xn ∼ f (x | θ), where f (x | θ) is of the given form, and consider the generalized method-of-
moments estimator using the function g(x) = T (x). Show that this estimator is the same as the MLE. (You
may assume that the MLE is the unique solution to the equation 0 = l′ (θ), where l(θ) is the log-likelihood.)
Page 19
Solution. (i) A Poisson random variable has a mass function
λx e−λ 1
f (x | λ) = = ex log λ−λ ,
x! x!
for x = 0, 1, 2, . . .. Reparametrizing by θ = log λ, we obtain
1 θx−eθ
f (x | θ) = e ,
x!
which is of the form in question. The functions are given by
1
T (x) = x, A(θ) = eθ , and h(x) = .
x!
which implies Z Z
θT (x)−A(θ) ′
T (x)e h(x)dx = A (θ) eθT (x)−A(θ) h(x)dx .
| {z } | {z }
Eθ [T (X)] 1
Eθ [T (X)] = A′ (θ).
(Note: By replacing integrals with sums, the identity holds for discrete models as well.)
In the Poisson model, A(θ) = eθ , so A′ (θ) = eθ as well, and T (X) = X. This means
Eθ [X] = eθ = λ,
(iii) From part (ii), Eθ [T (X)] = A′ (θ), so the generalized method-of-moments estimator is the value of θ
satisfying
1X
n
A′ (θ) = T (Xi ) .
n i=1
We now compute the maximum likelihood estimator. The log-likelihood is
X
n X
n
θ T (Xi ) − nA(θ) + log h (Xi ) ,
i=1 i=1
1X
n
′
A (θ) = T (Xi ) ,
n i=1
Page 20
(e) Suppose that X is a discrete random variable with
2
P[X = 0] = θ
3
1
P[X = 1] = θ
3
2
P[X = 2] = (1 − θ)
3
1
P[X = 3] = (1 − θ)
3
where 0 ≤ θ ≤ 1 is a parameter. The following 10 independent observations were taken from such a distribution:
{3, 0, 2, 1, 3, 2, 1, 0, 2, 1}. (For parts (i) and (ii), feel free to use any asymptotic approximations you wish, even though
n = 10 here is rather small.)
(i) Find the method of moments estimate of θ and compute an approximate standard error of your estimate using
asymptotic theory.
(ii) Find the maximum likelihood estimate of θ and compute an approximate standard error of your estimate
using asymptotic theory. (Hint: Your formula for the log-likelihood based on n observations X1 , . . . , Xn should
depend on the numbers of 0’s, 1’s, 2’s, and 3’s in this sample.)
1
(An alternative estimate of the standard error is given by 4n times the sample variance of X1 , . . . , Xn , which
for the 10 given observations is 0.171.)
(ii) For an IID sample X1 , . . . , Xn , let N0 , N1 , N2 , N3 be the total numbers of observations equal to 0, 1, 2, and
3 . Then the log-likelihood is
Yn 1{Xi =0} 1{Xi =1} 1{Xi =2} 1{Xi =3} !
2 1 2 1
l(θ) = log θ θ (1 − θ) (1 − θ)
i=1
3 3 3 3
2 1 2 1
= N0 log θ + N1 log θ + N2 log (1 − θ) + N3 log (1 − θ).
3 3 3 3
To compute the MLE for θ, we set
N0 N1 N2 N3
0 = l′ (θ) = + − −
θ θ 1−θ 1−θ
and solve for θ, yielding θ̂ = (N0 + N1 ) / (N0 + N1 + N2 + N3 ) = (N0 + N1 ) /n. For the 10 given observa-
tions, θ̂ = 0.5. The total probability that X = 0 or X = 1 is θ, so N0 + N1 ∼ Binomial(n, θ). Then
Var[θ̂] = n12 Var [N0 + N1 ] = θ(1−θ)
n .
Page 21
(Alternatively, we may compute
(
∂2 − θ12 x = 0 or x = 1
log f (x | θ) =
∂θ2 − (1−θ)
1
2 x = 2 or x = 3,
h i
∂2
so the Fisher information is I(θ) = −E ∂θ 2 log f (X | θ) = 1
θ(1−θ) . This shows that the variance of θ̂ is
θ(1−θ)
approximately for large n.)
n q
Thus, an estimate of the standard error is θ̂(1 − θ̂)/n, which for the 10 given observations is 0.158.
(f) Let (X1 , . . . , Xk ) ∼ Multinomial (n, (p1 , . . . , pk )). (This is not quite the setting of n IID observations from a para-
metric model, although you can think of (X1 , . . . , Xk ) as a summary of n such observations Y1 , . . . , Yn from the
parametric model Multinomial (1, (p1 , . . . , pk )), where Yi indicates which of k possible outcomes occurred for the ith
observation.) Find the MLE of pi .
To maximize l (p1 , . . . , pk ) subject to the linear constraint p1 + . . . + pk = 1, we may use the method of Lagrange
multipliers: Consider the Lagrangian
X
k
n
L (p1 , . . . , pk , λ) = log + Xi log pi + λ (p1 + . . . + pk − 1)
X1 , . . . , X k
i=1
for a constant λ to be chosen later. Clearly, subject to p1 + . . . + pk = 1, maximizing l (p1 , . . . , pk ) is the same as
maximizing L (p1 , . . . , pk , λ). Ignoring momentarily the constraint p1 + . . . + pk = 1, the unconstrained maximizer
of L is obtained by setting for each i = 1, . . . , k
∂L Xi
0= = + λ,
∂pi pi
Pn
Pn yields p̂i = −Xi /λ. For the specific choice of constant λ = −n, we obtain p̂i = Xi /n and
which i=1 p̂i =
i=1 X i /n = 1, so the constraint is satisfied. As p̂ i = X i /n is the unconstrained maximizer of L (p 1 , . . . , pk , −n),
this implies that it must also be the constrained maximizer of L (p1 , . . . , pk , −n), so it is the constrained maximizer
of l (p1 , . . . , pk ). So the MLE is given by p̂i = Xi /n for i = 1, . . . , k.
(i) Y12 + Y22 − Y1 Y2 (ii) Y12 + Y22 + 2Y1 Y2 (iii) Y12 + Y22 (iv) Y12 + Y22 + Y1 Y2 .
iid
Solution. Given X1 , X2 , X3 ∼ N µ, σ 2 and Y1 = X2 − X1 and Y2 = X3 − X2 . Now,
Page 22
Cov (Y1 , Y2 ) = Cov (X2 − X1 , X3 − X2 ) = Cov (X2 , X3 ) − Cov (X2 , X2 ) − Cov (X1 , X3 ) + Cov (X1 , X2 ) = −σ 2 .
Cov (Y1 , Y2 ) −σ 2 1
Corr (Y1 , Y2 ) = p =√ =− .
V (Y1 ) V (Y2 ) 2σ · 2σ
2 2 2
Since Y1 , Y2 are linear combinations of Normal random variables, their joint distribution is also Normal
(Y1 , Y2 ) ∼ N2 0, 0, 2σ 2 , 2σ 2 , −1/2
[ 2
]
y1 y2 y1 y2
1 − 23 + 2σ22 + 1
e−[y1 +y2 +y1 y2 ]/3σ = gσ2 (T (y1 , y2 )) h (y1 , y2 ) ,
2 2 2
fY1, ,Y2 (y1 , y2 ) = √ e 2σ 2 2σ 2
= √
2 3πσ 2 2 3πσ 2
e−[y1 +y2 +y1 y2 ]/3σ and h (y1 , y2 ) = 1 and T (y1 , y2 ) = y12 + y22 + y1 y2 .
2 2 2
1
where gσ2 (T (y1 , y2 )) = 2√3πσ 2
Xn be a random sample from a Bernoulli distribution with parameter p; 0 ≤ p ≤ 1. The bias of the
(b) Let X1 , X2√, . . . , ∑
n+2 n i=1 Xi
estimator 2(n+ n)
√ for estimating p is equal to
(i) √1
n+1
p− 1
2 (ii) 1√
n+ n
1
2 −p (iii) √1
n+1
1
2 + √p
n
−p (iv) √1
n+1
1
2 −p .
iid
Solution. Given X1 , X2 , X3 ∼ Ber(p).
√ Pn
n + 2 i=1 Xi
Let, T = √ .
2(n + n)
Bias(T ) = E(T ) − p
√ Pn √ P
n + 2 i=1 Xi n + 2E ( Xi )
=E √ −p= √ −p
2(n + n) 2(n + n)
√ h i
n + 2np PN
= √ −p since i=1 X i ∼ Bin (n, p) .
2(n + n)
√ √ √
n + 2np − 2np − 2 np n(1 − 2p)
= √ = √ √
2(n + n) 2 n(1 + n)
1 1 − 2p 1 1
=√ =√ −p .
n+1 2 n+1 2
(c) Let X1 , X2 , . . . , Xn be a random sample from an exponential distribution with the probability density function;
(
1 −x
θ e θ , if x > 0,
f (x; θ) =
0, otherwise,
where θ > 0.
PnDerive the Cramér-Rao lower bound for the variance of any unbiased estimator of θ. Hence, prove
that T = n1 i=1 Xi is the uniformly minimum variance unbiased estimator of θ.
IID
Solution. X1 , . . . , Xn ∼ Exp( mean θ)
1 −x/θ
f (x|θ) = e ; x > 0.
θ
Let τ (θ) = θ which is to be estimated.
Now, if T (x) is any unbiased estimator of τ (θ) then by Cramer-Rao Inequality
∼
(τ ′ (θ))2
V (T (x)) ⩾ 2
∼ ∂
nE ∂θ ln f (x|θ)
Page 23
provided the distribution of X follows the regularity conditions.
Since X ∼ Exp( mean θ), thus X belongs to One Parameter Exponential family and hence satisfies the regularity
conditions
1
f (x | θ) = e−x/θ
θ
x
ln f (x | θ) = − ln θ −
θ
∂ 1 x
ln f (x|θ) = − + 2
∂θ θ θ
2 2 2
∂ x 1 x + θ2 − 2θx 2θ2 + θ2 − 2θ2 1
E ln f (x|θ) = E − = E = = 2
∂θ θ2 θ θ4 θ4 θ
1 θ2
Since, τ (θ) = θ ⇒ τ ′ (θ) = 1 ∴ V (T (x)) ⩾ = .
∼ n· 1
θ2
n
θ2
Thus, the Cramer-Rao
Pn Lower Bound = n
Now for T = n1 i=1 Xi ,
1X 1 X θ2
E(T ) = E (Xi ) = θ and V (T ) = V (X i ) = .
n n2 n
∴ T is an Unbiased Estimator of θ and it attains CRLB and T is the UMVUE of θ.
(d) Let X1 , X2 , . . . , Xn be a random sample from a N µ, σ 2 distribution, where both µ and σ 2 are unknown. Find
Pn 2
the value of b that minimizes the mean squared error of the estimator Tb = n−1b
Xi − X̄ for estimating σ 2 ,
Pn i=1
where X̄ = n1 i=1 Xi .
Solution.
b X 2
n
Given Tb = Xi − X̄ = bS 2 (where S 2 is the Sample Variance)
n − 1 i=1
2 2
MSE (Tb ) = Var (Tb ) + E (Tb ) − σ 2 = Var bS 2 + E bS 2 − σ 2
2 2σ 4
= b2 Var S 2 + σ 2 (b − 1) = b2 · + σ 4 (b − 1)2
n−1
2σ 4 2σ 4
= b2 · + σ 4 b2 + 1 − 2b = + σ 4 b2 − 2σ 4 b + σ 4
n−1 n−1
n+1
= σ 4 · b2 − 2σ 4 b + σ 4 , which is a quadratic polynomial in b.
n−1
n+1
Let, p(b) = σ 4 b2 − 2σ 4 b + σ 4 .
n−1
′ n+1 n−1
p (b) = σ 4 · 2b − 2σ 4 = 0 ⇒ b = .
n−1 n+1
n+1
p′′ (b) = 2σ 4 > 0 ∀ b ∈ R.
n−1
∴ p(b) is minimum at b = n−1
n+1 and for b = n−1
n+1 , the MSE (Tb ) is minimum for estimating σ 2 .
(e) Let X1 , X2 , . . . , Xn be a random sample from a continuous distribution with the probability density function;
( 2
2x − xλ
λ e , if x > 0
f (x; λ) =
0, otherwise,
where λ > 0. Find the maximum likelihood estimator of λ and show that it is sufficient and an unbiased estimator
of λ.
Page 24
Solution. Given X1 , X2 . . . , Xn is a random sample form
2x − x2
f (x; λ) = e λ; x>0
λ
The likelihood function is
Y
n
2xi x2 2n Y
n ∑ 2
X
e− − λi
i
L(λ | x) = λ = n
x i e ; xi > 0 ∀ i = 1, . . . , n
∼ λ λ i=1
i=1
P
x2i
ln L(λ | x) = − − n ln λ + terms independent of λ
∼
P λ2
∂ xi n
ln L(λ | x) = 2
− =0
∂λ ∼
P 2 λ
λ
xi − nλ
⇒ =0
λ2
1X 2
n
⇒λ= x
n i=1 i
Pn
∴ λ̂ = 1
n i=1 x2i is the MLE of λ. Now,
Z ∞
2x − x2 dx x2
E Xk = xk · e λ ; substituting =t
0 λ λ
Z ∞ √ k+1
( λt) λ
=2 e−t · √ dt
0 λ 2 λt
Z ∞ √ k −t Z ∞
= ( λt) e dt = λ k/2
tk/2+1−1 e−t dt = λk/2 Γ(k/2 + 1)
0 0
√
√ 1 1 πλ
∴ E(X) = λ Γ(1/2 + 1) = λ · Γ
1/2
=
2 2 2
E X 2 = λΓ(2) = λ
1X
n
∴ E(λ̂) = E Xi2 = E Xi2 = λ
n i=1
Y
n Q ∑n
2xi −x2i /λ 2n xi
f (x | λ) = e = e− i=1 x2i /λ
∼
i=1
λ λn
! !
1 −
∑
x2i
Y
n X
n
= e λ 2n xi = gλ x2i h(x)
λn 1 i=1
∼
Pn
∴ By Neyman Fisher Factorization Theorem (NFFT), i=1 Xi2 is a sufficient statistic for λ.
P 2 P 2
∴ λ̂ = n1 Xi being a one-one function of Xi is also a sufficient statistic of λ.
Page 25
Solution. Writing the family as {fθ (x) : θ ∈ Ω = {0, 1}}. [Here, the parameter θ is called the labeling parameter.]
Define (
0 if θ = 0
I(θ) =
1 if θ = 1
The pdf of X is
I(θ) ( 1
)I(θ)
f1 (x) π(1+x2 ) 1
· √ e−x /2
1−I(θ) I(θ) 2
fθ (x) = {f0 (x)} {f1 (x)} = .f0 (x) =
√1 e−x /2
2
f0 (x) 2π
2π
1
= g (T (x) ; θ) h (x) where h(x) = √ e−x /2 and T (x) = x2 or |x|.
2
2π
Hence, X 2 or |X| is sufficient for the family of distribution.
(b) Let X1 , X2 , . . . , Xn be a random sample from the following pdf. Find the non-trivial sufficient statistic in each case:
[Hints: If in the range of Xi , there is the parameter of the distribution present then we have to use the concept of
Indicator function X(1) or X(n) or mini {Xi } or maxi {Xi }.]
(
θxθ−1 ; if 0 < x < 1
(i) f (x; θ) =
0 ; otherwise.
Yn
∴ By Neyman-Fisher Factorization criterion, T X = Xi is sufficient for θ.
∼
1=1
(x−µ)
− 2µ2
(ii) f (x; µ) = 1
√
|µ| 2π
·e ;x ∈ R
Solution. We know if X ∼ N µ, σ 2 :
1 (x−µ)
f (x; µ, σ) = √ · e− 2σ2 .
σ 2π
Pn Pn
So, in the given problem X ∼ N µ, µ2 , where µ 6= 0. Hence, T (X ) = i=1 Xi ,
2
i=1 Xi is sufficient for µ.
∼
(
xα−1 (1−x)β−1
; if 0 < x < 1; α, β > 0
(iii) f (x; α, β) = B(α,β)
0 ; otherwise.
Page 26
(
1 − (x−µ)
λe
λ ; if x > µ
(iv) f (x; µ, λ) =
0 ; otherwise.
(
e− 2σ2 (ln x−µ)
1 2
1
√ ; if x > 0
(v) f (x; µ, σ) = xσ 2π
0 ; otherwise.
(
θαθ
xθ+1
; if x > α
(vi) f (x; α, θ) =
0 ; otherwise.
(
2(θ−x)
θ2 ; if 0 < x < θ
(vii) f (x; θ) =
0 ; otherwise.
Page 27
Solution. The joint PDF of X is
∼
2n Y
n
f x = 2n (θ − xi ) ; 0 < xi < θ
∼ θ i=1
n
2
= · (θ − x1 ) (θ − x2 ) . . . (θ − xn ) ; 0 < xi < θ.
θ2
These cannot be expressed in the form of factorization criterion. So, (X1 , X2 , . . . , Xn ) or X(1) , X(2) , . . . X(n)
are trivially sufficient for θ. Hence there is no non-trivial sufficient statistic.
(c) If fθ (x) = 12 ; θ − 1 < x < θ + 1, then show that X(1) and X(n) are jointly sufficient for θ, Xi ∼ U (θ − 1, θ + 1).
1 n
f x =
∼ 2
(
1 1 if a < b
= n · I θ − 1, x(1) I x(n) , θ + 1 ; θ − 1 < x(1) < x(n) < θ + 1 where I (a, b) =
2 0 if a ≥ b
1
= g T x ; θ h x ; where h x = n and T x = x(1) , x(n) .
∼ ∼ ∼ 2 ∼
Hence T X = X(1) , X(n) is jointly sufficient for θ.
∼
(d) If a random sample of size n ⩾ 2 is drawn from a Cauchy distribution with PDF
1
fθ (x) = ,
π [1 + (x − θ)2 ]
where −∞ < θ < ∞, is considered. Then can you have a single sufficient statistic for θ?
Solution.
Y
n
1
f (xi ; θ) = nQ h io .
n 2
i=1 πn i=1 1 + (xi − θ)
n n
Y o n on o n o
2 2 2 2
Note that 1 + (xi − θ) = 1 + (x1 − θ) 1 + (x2 − θ) . . . 1 + (xn − θ)
i=1
= 1 + term involving one xi + term involving two x′i s + . . . + term involving all x′i s
X 2
X 2 2
Y
n
2
=1+ (xi − θ) + (xi − θ) (xj − θ) + . . . + (xi − θ) .
i i̸=j i=1
Qn
Clearly, i=1 f (xi ; θ) cannot be written as g T ∼ x , θ · h x for a statistic other than the trivial choices
∼
(X1 , . . . , Xn ) or X(1) , . . . , X(n) . Hence there is no non-trivial sufficient statistic. Therefore, in this case, no
reduction in space is possible. Thus, the whole set (X1 , . . . , Xn ) is jointly sufficient for θ.
Page 28
Problem 10. (Uniformly Minimum Variance Unbiased Estimator)
(
p(1 − p)x ; x = 0, 1, . . .
(a) Let X1 , X2 , . . . , Xn be a random sample from f (x; p) =
0 ; otherwise.
Pn
Show that unbiased estimator of p based on T = i=1 Xi is unique. Hence or otherwise, find the UMVUE of p.
Pn
Solution. T = i=1 Xi ∼ NB(n, p).
To solve for h(T ) such that
E {h (T )} = p ∀ p ∈ (0, 1)
X∞
t+n−1
⇒ h (t) pn q t = p ∀ p
n−1
t=0
X∞
t+n−1
⇒ h (t) q t = p−(n−1) = (1 − q)n−1
n−1
t=0
X∞ ∞
X
t+n−1 n−1+t−1
⇒ h(t) qt = qt , as 0 < q < 1.
n−1 t
t=0 t=0
2
(b) Let X1 and X2 be two independent random variables having the same mean θ. Suppose that E (X1 − θ) = 1 and
2
E (X2 − θ) = 2. For estimating θ, consider the estimators Tα (X1 , X2 ) = αX1 + (1 − α)X2 , α ∈ [0, 1]. The value of
α ∈ [0, 1], for which the variance of Tα (X1 , X2 ) is minimum, equals
2 1 1 3
(i) 3 (ii) 2 (iii) 4 (iv) 4
Page 29
(c) Let X1 , X2 , . . . , Xn be a random sample from
(
e−(x−θ) , if x > θ
f (x; θ) =
0, otherwise.
Show that T = X(1) is a complete sufficient statistic. Hence find the UMVUE of θ.
(d) Is the following families of distribution regular in the sense of Cramer & Rao? If so, find the lower bound for the
variance of an unbiased estimator of θ based on a sample of size n. Also, find the UMVUE of θ for the PDF:
x2
e− 2θ
f (x, θ) = √ ; −∞ < x < ∞, θ > 0.
2πθ
Page 30
Solution. As we know ‘=’ holds in CR inequality whenever the family of distributions is OPEF. The given pdf is
OPEF, satisfying the regularity conditions for CR inequality. That is, it is regular in the sense of Cramer-Rao.
By CR inequality, for an unbiased estimator T of θ,
1
Var (T ) ⩾ = CRLB.
In (θ)
1 x2 1 x2
Here, f (x, θ) = √ e− 2θ ; x ∈ R, θ > 0 ⇒ ln f (x, θ) = − ln(2πθ) − .
2πθ 2 2θ
∂ 1 x21 ∂2 1 x21
Differentiating w.r.t θ we get, ln f (x1 , θ) = − + 2 and ln f (x1 , θ) = 2 − 3 .
∂θ 2θ 2θ ∂θ2 2θ θ
( )
2 2
∂ 1 E X1 1 θ n
In (θ) = n · I1 (θ) = n · E − 2 ln f (x1 , θ) = n · − 2 + 3
= n − 2 + 3 = 2.
∂θ 2θ θ 2θ θ 2θ
2θ 2
Hence, Var(T ) ⩾ n = CRLB. The MVBUE, if exists for θ, is given by
P 2
1 X ∂
n
ψ ′ (θ) ∂ 2θ2 n xi
T = ψ(θ) ± · ln L(x, θ) = θ ± n · ln f (xi , θ) = θ ± − + 2
In (θ) ∂θ ∼ ∂θ n 2θ 2θ
2θ 2 i=1
P 2
1X 2
n
2θ2 n xi
=θ+ − + , taking +ve sign only. = x .
n 2θ 2θ2 n i=1 i
Pn Pn
Hence, T = n1 i=1 Xi2 attains CRLB and n1 i=1 Xi2 is the UMVUE of θ.
∂
(e) Based on a random sample X1 , X2 , . . . , Xn from Gamma(α). Obtain an estimator of ψα = ∂α ln Γ(α) which attains
CRLB and its variance.
∂ ∂ Xn
ln f x, α = −n ln Γ (α) + ln xi
∂α ∼ ∂α i=1
∂2 ∂2
and ln f x , α = −n ln Γ (α)
∂α2 ∼
∂α2
∂ 2 ∂ 2
Here, I (α) = E − 2 ln f x, α = n 2 ln Γ (α) .
∂α ∼ ∂α
A UE which attains CRLB, if it exists, is given by,
∂2
ψ ′ (α) ∂ ∂ ∂α2 ln Γ(α) ∂
T = ψ(α) ± · ln f x, α = ln Γ(α) ± ∂ 2 −n ln Γ(α) + Σ ln xi .
I(α) ∂α ∼ ∂α n ∂α2 ln Γ(α) ∂α
( )
1X 1X
n
∂ ∂
= ln Γ(α) ± − ln Γ(α) + ln xi = ln xi , taking positive sign only
∂α ∂α n i n i=1
!1/n
Yn
= ln G, where G = Xi is the GM of X1 , X2 , . . . , Xn .
i=1
n o2
∂2 ∂2
′
{ψ (α)}
2
∂α2 ln Γ(α) ln Γ(α)
∂α2
Clearly, Var(T ) = CRLB = = = .
I(α) n· ∂2
∂α2 ln Γ(α) n
Page 31
Problem 11. (Finding Confidence Intervals)
(a) Let X1 , . . . , Xn be a random sample from U(0, θ), θ > 0. Find a confidence interval for θ with confidence coefficient
(1 − α), based on X(n) .
(b) Consider a random sample of size n from the rectangular distribution
(
1
if 0 < x < θ
f (x, θ) = θ
0 otherwise.
If Y be the sample range then ξ is given by ξ n−1 [n − (n − 1)ξ] = α. Show that, Y and Y ξ −1 are confidence limit to
θ with confidence coefficient (1 − α).
(c) Consider a random sample of size n from an exponential distribution, with PDF
(
exp[−(x − θ)] if θ < x < ∞
fX (x) =
0 otherwise.
Suggest a 100(1 − α)% confidence interval for θ.
Solution.
(a) The pdf of X(n) is (
nxn−1
θn if 0 < x < θ
fX(n) (x) =
0 otherwise.
X(n)
The pdf ψ X(n) , θ = θ = T is (
ntn−1 if 0 < t < 1
g(t) =
0 otherwise
which is independent of θ.
Now, P c < ψ X(n) , θ < 1 = 1 − α.
Z 1
⇒ ntn−1 dt = 1 − α, where c is the critical region.
c
⇒ 1 − cn = 1 − α, i.e. c = α1/n .
X(n)
Note that, α1/n < ψ(X(n) , θ) = <1
θ
θ
⇒ α−1/n > >1
X(n)
i.e., X(n) < θ < α−1/n X(n) .
Hence, X(n) , α−1/n X(n) is a confidence interval for θ with confidence coefficient (1 − α).
Page 32
which is independent of θ.
Now, P [ξ ⩽ U ⩽ 1] = 1 − α
Z 1
⇒ n(n − 1)un−2 (1 − u)du = 1 − α
ξ
Z 1
⇒ n(n − 1) un−2 − un−1 du = 1 − α
ξ
⇒ ξ n−1 [n − (n − 1)ξ] = α.
n o
Note that {ξ ⩽ U ⩽ 1} = ξ ⩽ Y
θ ⩽1 = Y ⩽θ⩽ Y
ξ .
Hence, Y, Y ξ −1 is a random confidence interval for θ with confidence coefficient 1 − α, where ξ is such that
ξ n−1 [n − (n − 1)ξ] = α.
Consider observed response variables Y1 , . . . , Yn ∈ R that depend linearly on a single covariate x1 , . . . , xn as follows:
Yi = βxi + εi .
(a) Suppose that the error variances σ12 , . . . , σn2 are all known. Show that the MLE β̂ for β, in this case, minimizes a
certain weighted least-squares criterion, and derive an explicit formula for β̂.
(b) Show that the estimate β̂ in part (a) is unbiased, and derive a formula for the variance of β̂ in terms of σ12 , . . . , σn2
and x1 , . . . , xn .
Page 33
(c) Compute the Fisher information IY (β) = −Eβ [l′′ (β)] in this model (still assuming σ12 , . . . , σn2 are known constants).
Show that the variance of β̂ that you derived in part (b) is exactly equal to IY (β)−1 .
In the remaining parts of this question, denote by β̃ the usual (unweighted) least-squares estimator for β, which
P 2
minimizes i (Yi − βxi ) . In practice, we might not know the values of σ12 , . . . , σn2 , so we might still estimate β
using β̃.
(d) Derive an explicit formula for β̃, and show that it is also an unbiased estimate of β.
(e) Derive a formula for the variance of β̃ in terms of σ12 , . . . , σn2 and x1 , . . . , xn . Show that when all error terms have
−1
the same variance σ02 , this coincides with the general formula σ02 X T X for the linear model.
P 2 P 2 P 2
(f) Using the Cauchy-Schwarz inequality i ai i bi ≥ ( i ai bi ) for any positive numbers a1 , . . . , an and b1 , . . . , bn ,
compare your variance formulas from parts (b) and (e) and show directly that the variance of β is always at least
the variance of β̂. Explain, using the Cramer-Rao lower bound, why this is to be expected given your finding in (c).
n Xn Xn
1 2
ℓ(β) = − log 2π + σi − 2 (Yi − βxi ) ,
2 i=1 i=1
2σ i
Xn
1 2
σ 2 (Yi − βxi ) .
i=1 i
Taking derivatives and solving for zero gives the explicit solution
Pn
xi Yi /σi2
β̂ = Pi=1n 2 2 .
i=1 xi /σi
1 Xn
xi 1 Xn
xi
E[β̂] = Pn 2 2 2 E [Y i ] = P n 2 2 2 βxi = β,
i=1 xi /σi i=1 σi i=1 xi /σi i=1 σi
so it is unbiased.
The variance is Pn Pn
2 4
i=1 xi V [Yi ] /σi x2i σi2 /σi4 1
V[β̂] = Pn 2 = Pi=1
n 2 =
Pn 2
2
2 , using V [Yi ] = σi .
( i=1 x2i /σi2 ) i=1 xi /σi
2
( i=1 xi /σi )2
Page 34
Pn
xi Yi
β̃ = Pi=1
n 2 ,
i=1 xi
(f) Applying the Cauchy-Schwarz inequality with ai = |xi σi | and bi = |xi /σi |, we obtain the inequality
! ! !2
X
n Xn
x2i X
n
x2i σi2 ≥ x2i ,
i=1
σ2
i=1 i i=1
Page 35
Solution. (a) We may estimate p by p̂ = X̄, and q by q̂ = Ȳ . The plugin estimator for the log-odds-ratio is
p̂ . q̂
log .
1 − p̂ 1 − q̂
(b) Let
p . q
g(p, q) = log = log p − log(1 − p) − log q + log(1 − q).
1−p 1−q
Applying a first-order Taylor expansion to g,
p̂ − p q̂ − q
g(p̂, q̂) ≈ g(p, q) +
+ .
p(1 − p) q(1 − q)
√ √
p̂ and q̂ are independent, and by the Central Limit Theorem, n(p̂ − p) → N (0, p(1 − p)) and m(q̂ − q) → N (0, q(1 − q)).
Hence, for large m and n, g(p̂, q̂) is approximately distributed as N (g(p, q), v) where
p(1 − p) 1 q(1 − q) 1 1 1
v= × 2 + × 2 = + .
n p (1 − p) 2 m q (1 − q) 2 np(1 − p) mq(1 − q)
1
(c) Let v̂ = np̂(1− p̂) +
1
mq̂(1−q̂) be the plugin estimate of v. As m, n → ∞, v̂/v → 1 in probability, so by part (b) and
Slutsky’s lemma,
g(p̂, q̂) − g(p, q)
P −Z0.025 ≤ √ ≤ Z0.025 ≈ 0.95
v̂
for large m and n. Rearranging yields a 95% confidence interval for g(p, q) given by
s
√ p̂ q̂ 1 1
g(p̂, q̂) ± Z0.025 v̂ = log ± Z0.025 + .
1 − p̂ 1 − q̂ np̂(1 − p̂) mq̂(1 − q̂)
Denoting this interval
. by [L(p̂, q̂), U (p̂, q̂)], we may exponentiate to obtain the confidence interval eL(p̂,q̂) , eU (p̂,q̂) for the
p q
odds-ratio 1−p 1−q .
The double-exponential distribution with mean µ and scale b is a continuous distribution over R with PDF
1 |x − µ|
f (x | µ, b) = exp − .
2b b
It is sometimes used as an alternative to the normal distribution to model data with heavier tails, as this PDF decays
exponentially in |x − µ| rather than in (x − µ)2 .
1X
n
ℓ(µ, b) = −n log(2b) − |Xi − µ| .
b i=1
1 X
n
∂ℓ n
=− + 2 |Xi − µ| .
∂b b b i=1
Page 36
Setting this equal to 0 , substituting in the MLE µ̂ for µ, and solving gives the MLE for b as
1X
n
b̂ = |Xi − µ̂| .
n i=1
Pn
We can see that the MLE µ̂ is the value of µ that minimizes the total absolute deviations K(µ) = i=1 |Xi − µ|.
Without loss of generality assume that the X1 , . . . , Xn are ordered. We shall see that the minimizer is the sample
median µ̂ = Xm , where m = (n + 1)/2. We see that K(µ) is continuous everywhere (it is the sum of absolute value
functions), and furthermore, it is decreasing for µ < Xm and increasing for µ > Xm .
Therefore the minimizer is given by µ̂ = Xm . This estimator is more robust to outliers because it only depends on
the middle few ordered values, so a few data points with extreme values won’t change the median, whereas the mean
depends on all data points.
(b) Suppose it is known that µ = 0. In a Bayesian analysis, let us model the scale parameter as a random variable B
with prior distribution B ∼ InverseGamma(α, β), where α, β > 0. If X1 , . . . , Xn ∼ Laplace (0, b) when B = b, what
are the posterior distribution and posterior mean of B given the data X1 , . . . , Xn ?
(Hints: The InverseGamma (α, β) distribution is a continuous distribution on (0, ∞) with PDF
β α −α−1 −β/x
f (x) = x e
Γ(α)
β
and with mean α−1 when α > 1.)
f (B | α, β, X1 , . . . , Xn ) ∝ f (X1 , . . . , Xn | B) f (B | α, β)
( )
1 X
n
1 β α −α−1 −β/B
= exp − |X i | B e
(2B)n B i=1 Γ(α)
( !)
1 Xn
−(α+n)−1
∝B exp − β+ |Xi | ,
B i=1
(c) Still supposing it is known that µ = 0, what is the MLE b̂ for b in this sub-model? How does this compare to the
posterior mean from part (b) when n is large?
of the prior mean and the MLE, from which we see that the posterior mean tends to the MLE as n → ∞.
Page 37
Problem 15. (Bayesian inference for Multinomial Proportions)
The Dirichlet (α1 , . . . , αK ) distribution with parameters α1 , . . . , αK > 0 is a continuous joint distribution over K random
variables (P1 , . . . , PK ) such that 0 ≤ Pi ≤ 1 for all i = 1, . . . , K and P1 + . . . + PK = 1. It has (joint) PDF
f (p1 , . . . , pk | α1 , . . . , αK ) ∝ p1α1 −1 × . . . × pα
K
K −1
.
Letting α0 = α1 + . . . + αK , this distribution satisfies
αi αi (α0 − αi )
E [Pi ] = , Var [Pi ] = .
α0 α02 (α0 + 1)
(a) Let (X1 , . . . , X6 ) ∼ Multinomial (n, (p1 , . . . , p6 )) be the numbers of 1’s through 6’s obtained in n rolls of a (possibly
biased) die. Let us model (P1 , . . . , P6 ) as random variables with prior distribution Dirichlet (α1 , . . . , α6 ). What
is the posterior distribution of (P1 , . . . , P6 ) given the observations (X1 , . . . , X6 )? What is the posterior mean and
variance of P1 ?
(b) How might you choose the prior parameters α1 , . . . , α6 to represent a strong prior belief that the die is close to fair
(meaning p1 , . . . , p6 are all close to 1/6)?
(c) How might you choose an improper Dirichlet prior to represent no prior information? How do the posterior mean
estimates of p1 , . . . , p6 under this improper prior compare to the MLE?
Solution.
(a) The posterior distribution has density proportional to
P1α1 −1 × · · · × P6α6 −1 × P1X1 × · · · × P6X6 = P1α1 +X1 −1 × · · · × P6α6 +X6 −1 .
So the posterior distribution of (P1 , . . . , P6 ) given (X1 , . . . , X6 ) is Dirichlet(α1 + X1 , . . . , α6 + X6 ). The posterior
mean and variance are given by
α i + Xi (αi + Xi ) α0 + nX̄ − αi − Xi
E [Pi | X1 , . . . , X6 ] = ; V [Pi | X1 , . . . , X6 ] = 2 .
α0 + nX̄ α0 + nX̄ α0 + nX̄ + 1
(c) From the posterior mean calculated in part (a), we can interpret the parameters αi as “prior counts,” so an unin-
formative prior sets αi = 0. Then the posterior mean is
Xi Xi
E [Pi | X1 , . . . , X6 ] = Pn = ,
j=1 Xj n
Page 38
Problem 16. (Basics of Testing)
(a) Let Z be a random variable with probability density function f (z) = 12 e−|z−µ| , z ∈ R with parameter µ ∈ R. Sup-
pose, we observe X = max(0, Z).
i. Find the constant c such that the test that “rejects when X > c” has size 0.05 for the null hypothesis H0 : µ = 0.
ii. Find the power of this test against the alternative hypothesis H1 : µ = 2.
(b) Suppose that X1 , . . . , Xn form a random sample from the uniform distribution on the interval [0, θ], and that the
following hypotheses are to be tested:
H0 : θ ≥ 2,
H1 : θ < 2.
Let Yn = max {X1 , . . . , Xn }, and consider a test procedure such that the critical region contains all the outcomes
for which Yn ≤ 1.5.
(c) Let X be a single observation of an Exp(λ) random variable, which has PDF
(
λ exp(−λx) if x ≥ 0
fλ (x) =
0 if x < 0
i. Find the power function of the hypothesis test that rejects H0 if and only if X ≥ c.
ii. Let 0 < α < 1. Find a value of c such that the test in part (i) has size α.
iii. For what true values of λ is Pλ (type II error) ≥ 1/2 for the test in part (i) with size α as in (ii)?
IID
(d) Let X1 , X2 ∼ Bin(1, θ), and consider testing H0 : θ = 1/3 versus H1 : θ < 1/3.
i. Find a test that has size 2/9 exactly. Note: It does not have to be a sensible test.
ii. Find the power function of the test from part (i), and use it to explain why this test is not a good test of
these hypotheses.
(e) Suppose X is a random variable on {0, 1, 2, . . .} with unknown PMF p(x). To test the hypothesis H0 : X ∼
Poisson(1/2) against H1 : p(x) = 2−(x+1) for all x ∈ {0, 1, 2, . . .}, we reject H0 if x > 2. The probability of type-II
error for this test is
13 −1/2
(A) 14 ; (B) 1 − 8 e ; (C) 1 − 32 e−1/2 ; (D) 78 .
Solution.
(a) i. Given that PH0 (X > c) = 0.05. Now, under H0 , µ = 0. So, we have the pdf of Z as f (z) = 21 e−|z| . As the
support of Z is R, we can partition it in {Z ≥ 0, Z < 0}. Now, let’s condition based on this partition. So, we have:
PH0 (X > c) = PH0 (X > c, Z ≥ 0) + PH0 (X > c, Z < 0) = PH0 (X > c, Z ≥ 0) = PH0 (Z > c).
Z ∞
1 −|z| 1
Thus, we have PH0 (X > c) = PH0 (Z > c) = e dz = e−c .
c 2 2
Equating 12 e−c with 0.05, we get c = ln 10.
Page 39
ii. Power of test against H1 is given by:
Z ∞
1 −|z−2| e2
PH1 (X > ln 10) = PH1 (Z > ln 10) = e dz = .
ln 10 2 20
Thus, the size of the test is exp (−λ0 c). Then the test has size α if and only if
log α
exp (−λ0 c) = α ⇐⇒ c = −
λ0
noting that log α is negative since 0 < α < 1.
iii. Pλ (type II error) ≥ 1/2 if and only if both λ < λ0 and Power(λ) ≤ 1/2. The test in part (i) with size α as in
(ii) has power function
log α
Power(λ) = exp −λ − = αλ/λ0 ,
λ0
and hence
λ0 log 2
Power(λ) ≤ 1/2 ⇐⇒ λ ≥ − ,
log α
again noting that log α is negative. Thus, Pλ (type II error) ≥ 1/2 if and only if
λ0 log 2
− ≤ λ < λ0 .
log α
(Note that if α ≥ 1/2, then there are no such values of λ.)
(d) i. Note that there are only four possible values of (X1 , X2 ), i.e., the sample space consists of only four points. If
θ = 1/3, then
(0, 0) with probability 4/9
(0, 1) with probability 2/9
(X1 , X2 ) =
(1, 0) with probability 2/9
(1, 1) with probability 1/9
Thus, the only tests with size 2/9 exactly are the test that rejects H0 if and only if (X1 , X2 ) = (0, 1) and the test
that rejects H0 if and only if (X1 , X2 ) = (1, 0).
ii. Power(θ) = θ(1 − θ) for both of the tests from part (i). Note that Power(1/3) > Power(θ) for all θ < 1/3. Thus,
these tests are more likely to reject H0 if it is true than if it is false, which is exactly the opposite of what a good
hypothesis test should do.
Page 40
(e) (D) Given x ∼ p(x). Reject H0 if x > 2.
P(Type-II error) = P(Accepting H0 when it is false) = 1 - P(Reject HO | H1 is true)
= 1 - PH1 (Reject H0 ) = 1 - PH1 (X > 2). Now, we compute PH1 (X > 2) as follows:
P (x > 2) = 1 − P (x ≤ 2)
= 1 − [P (x = 0) + P (x = 1) + P (x = 2)]
= 1 − 2−1 + 2−2 + 2−3
7
=1−
8
1
= .
8
(a) For data X1 , . . . , Xn ∈ R and two fixed and known values σ02 < σ12 , consider the following testing problem:
IID
H0 : X1 , . . . , Xn ∼ N 0, σ02
IID
H1 : X1 , . . . , Xn ∼ N 0, σ12
What is the most powerful test for testing H0 versus H1 at level α? Letting χ2n (α) denote the 1 − α quantile of the
χ2n distribution explicitly describes the test statistic T and the rejection region for this test.
(b) What is the distribution of this test statistic T under the alternative hypothesis H1 ? Using this result and letting
F denote the CDF of the χ2n distribution, provide a formula for the power of this test against H1 in terms of
χ2n (α), σ02 , σ12 , and F . Keeping σ02 fixed, what happens to the power of the test as σ12 increases to ∞ ?
(c) Consider two probability density functions on [0, 1] : f0 (x) = 1 and f1 (x) = 2x. Among all tests of the null hypothesis
H0 : X ∼ f0 (x) versus the alternative H1 : X ∼ f1 (x) with significance level α = 0.10, how large can the power
possibly be?
(d) Let X1 and X2 be a random sample from a distribution having the probability density function f (x). Consider the
testing of H0 : f (x) = f0 (x) against H1 : f (x) = f1 (x) based on X1 and X2 , where
( (
1, 0 < x < 1, 4x3 , 0 < x < 1,
f0 (x) = and f1 (x) =
0, otherwise 0, otherwise
(i) For a given level, show that the critical region of the most powerful test for testing H0 against H1 is of the form
{(x1 , x2 ) : ln x1 + ln x2 > c}
where µ > 0 is an unknown parameter. Consider the problem of testing the hypothesis Ho : µ ≤ µo against
H1 : µ > µ o .
Show that the test with critical region X̄ ≥ µo χ22n,1−α /2n , where χ22n,1−α is the (1 − α)th quantile of the χ22n
distribution, has size α. Give an expression of the power in terms of the c.d.f. of the χ22n distribution.
Page 41
Solution.
Pn 2
(b) Under H1 , i=1
Xi
σ1 ∼ χ2n , so T ∼ σ12 χ2n . Then the probability of type II error is
β = PH1 [accept H0 ] = PH1 T ≤ σ02 χ2n (α)
2
T σ02 2 σ0 2
= PH 1 ≤ χ (α) = F χ (α) .
σ12 σ12 n σ12 n
Page 42
Q2 3
fH1 (x) i=1 4xi
Now, ˜ = Q 2 ⇒ 16(x1 x2 )3 > k
fH0 (x) i=1 1
˜
⇒ (x1 x2 )3 > k1 ⇒ 3 ln(x1 x2 ) > k2 ⇒ ln x1 + ln x2 > c.
∴ W = {(x1 , x2 ) | ln x1 + ln x2 > c} for some constant c.
n ∑n
⃗ = 1 exp − i=1 Xi , µ > 0, also
(e) Hence, the Likelihood function of the µ for the given sample is, L(µ | X) µ µ
observe that sample mean of X⃗ is the MLE of µ. So, the Likelihood Ratio statistic is,
µ0 , and observe that g is, decreasing function of x̄ for x̄ ≥ µo Hence, there exists a c such
Let g(x̄) = n ln x̄ − nx̄
that x̄ ≥ c, we have g(x) ¯ < K∗. So, the critical region of the test is of form X̄ ≥ c, for some c such that,
PHo (X̄ ≥ c) = α, for some 0 ≤ α ≤ 1, where α is the size of the test. Now, our task is to find c, and for that observe,
if X ∼ Exponential(θ), then 2X θ ∼ χ2 .
2
Hence, in this problem, since the Xi ’s follows Exponential (µ), hence, 2nµX̄ ∼ χ22n , we have,
PHo (X̄ ≥ c) = α
2nX̄ 2nc
PH o ≥ =α
µo µo
2nc
PHo χ2 2n ≥ =α
µo
Page 43
h i
µo χ22n;1−α µo χ22n;1−α
which gives c = 2n , Hence, the rejection region is indeed, X̄ ≥ 2n . Hence Proved!
Hence, the power of the test is of the form of a cdf of the chi-squared distribution.
(a) Let X1 , . . . , X10 be a random sample of size 10 from a population having a probability density function
(
θ
θ+1 , if x > 1,
f (x | θ) = x
0, otherwise
where θ > 0. For testing H0 : θ = 2 against H1 : θ = 4 at the level of significance α = 0.05, find the most powerful
test. Also, find the power of this test.
(b) Let X1 , . . . , Xn be a random sample from the population having probability density function
( 2
2x − x
f (x, θ) = θ2 e
θ2 if x > 0
0 otherwise.
Obtain the most powerful test for testing H0 : θ = θ0 against H1 : θ = θ1 (θ1 < θ0 ).
(c) Let X1 , X2 , . . . , X5 be a random sample from a N 2, σ 2 distribution, where σ 2 is unknown. Derive the most
powerful test of size α = 0.05 for testing H0 : σ 2 = 4 against H1 : σ 2 = 1.
IID
(d) Let X1 , . . . , Xn ∼ N µ, σ 2 , where both µ and σ 2 are unknown. Consider the problem of testing
H0 : µ = 0
H1 : µ 6= 0
Show that the generalized likelihood ratio test statistic for this problem simplifies to
Pn 2 !n/2
i=1 X i − X̄
Λ (X1 , . . . , Xn ) = Pn 2 .
i=1 Xi − X̄ + nX̄ 2
Pn 2 √
i=1 Xi − X̄
2 1
Letting SX = n−1 and T = nX̄/SX (the usual one-sample t-statistic for this problem), show
that Λ (X1 , . . . , Xn ) is a monotonically decreasing function of |T |, and hence the generalized likelihood ratio test is
equivalent to the two-sided t-test which rejects for large values of |T |.
Page 44
Solution.
fH1 (x)
(a) By NP Lemma, the Most Powerful Critical Region is given by reject H0 if fH0 (˜
x) > k where k is a constant that
˜
depends on level requirements.
Q10 4 10
∏ 4
fH1 (x) i=1 x5i ( 10
i=1 xi )
5 210
˜ = Q10 2 = 2 10 = Q10 >k
fH0 (x) i=1 3
∏10 3 ( i=1 x i ) 2
˜ x i ( i=1 ix )
Y10 X
10
⇒ xi < k ′ ⇒ ln xi < c.
i=1 i=1
P10
Now, level α = 0.05, we have PH0 (Reject H0 ) = 0.05 ⇒ PH0 ( i=1 ln xi < c) = 0.05.
If Xi has the given pdf then ln Xi ∼ Exp 1
theta . Under H0 ,
1
ln Xi ∼ Exp
2
∴ 4 ln Xi ∼ Exp (2)
!
X
10
∴ P 4 ln Xi < 4c = 0.05 ⇒ P χ220 < 4c = 0.05 ⇒ P χ220 > 4c = 0.95
i=1
∴ 4c = χ220,0.95
1 2 10.85
⇒c= χ20,0.95 = = 2.7125.
4 4
P10
∴ Reject H0 if i=1 ln Xi < 2.7125.
P10
Power of the test, PH1 (Reject H0 ) = PH1 ( i=1 ln Xi < 2.7125).
Under H1 ,
1
ln Xi ∼ Exp
4
∴ 8 ln Xi ∼ Exp (2)
!
X
10
∴ PH1 8 ln Xi < 8(2.7125) = P χ220 < 21.7 = 1 − P χ220 > 21.7 = 1 − 0.36 = 0.64.
i=1
fH1 (x)
(b) By NP Lemma, the Most Powerful Critical Region is given by reject H0 if fH0 (˜
x) > k where k is a constant that
˜
depends on level requirements.
x2 ∏n ∑n x2
Qn 2xi − θ12
i
xi −
i
i=1 θ 2
fH1 (x) i=1 θ1 e
2n i=1
θ1n e 1
˜ = = ∑
fH0 (x) Qn 2x − x22i ∏n − n
x2
i
˜ θ0 n i=1 xi i=1 2
θ0
i=1 θ0 e 2 e
i
θ0n
n ∑ n 2 ( 1 1 )
θ0 − i=1 xi 2 − θ2
⇒ e θ1 0 >k
θ1
Xn
1 1
⇒ − 2
xi 2 − θ2 > k′
i=1
θ 1 0
X n ∵ θ 0 < θ 1
x2i < c ∴ θ12 > θ02
1 1
⇒
i=1 ∴ θ2 − θ2 > 0
1 1
1 0
Now, if X has the given distribution then X ∼ Exp mean = theta2
2
2X 2
∴ ∼ Exp (mean = 2) .
θ2
Page 45
Under H0 ,
2Xi2
∼ Exp (mean = 2)
θ02
Xn
2Xi2
∴ ∼ χ22n
i=1
θ02
X
5
2
⇒ (xi − 2) < c.
i=1
P5 2
∴ Reject H0 if i=1 (xi − 2) < c.
P5 2
Now, PH0 i=1 (xi − 2) < c = 0.05. Under H0 ,
IID
X1 , X2 , . . . , X5 ∼ N(2, 4)
X5 2
xi − 2
∴ ∼ χ25
i=1
2
P5 !
2
i=1 (xi − 2) c
∴ PH0 < = 0.05
4 4
c
⇒ PH0 χ25 < = 0.05
4
c
⇒ PH0 χ25 > = 0.95
4
c
⇒ = χ25,0.95 ⇒ c = 4 χ25,0.95 .
4
P5 2
∴ The MP test of size α = 0.05 is reject H0 if i=1 (xi − 2) < 4 χ25,0.95 .
Page 46
(d) The log-likelihood for the full model is
1 X
n
n 2
− log 2πσ 2 − 2 (Xi − µ) ,
2 2σ i=1
1 X 2
n
n
− log 2πσ 2 − 2 X
2 2σ i=1 i
H0 : f has median 0
H1 : f has median µ for some µ > 0
(a) Explain why the Wilcoxon signed rank statistic does not have the same sampling distribution under every P ∈ H0 .
Draw a picture of the graph of a density function f with median 0, such that the Wilcoxon signed rank statistic
would tend to take larger values under f than under any density function g that is symmetric about 0.
Solution.
For a density function f with median 0 but is skewed right, such as in Fig. (a), positive values of X1 , . . . , Xn
would tend to have a higher rank than negative values, so the Wilcoxon signed rank statistic would tend to take
larger values under f than under any density function g that is symmetric about 0.
Page 47
Figure 2: f with median 0.
(b) Consider the sign statistic S, defined as the number of values in X1 , . . . , Xn that are greater than 0. Explain why S
has the same sampling distribution under every P ∈ H0 . How would you conduct a level-α test of H0 Vs. H1 using
the test statistic S? (Describe explicitly the rejection threshold; you may assume that for X ∼ Binomial n, 12 ,
there exists an integer k such that P[X ≥ k] is exactly α.)
Solution. Let Yi = 1 if Xi > 0 and Yi = 0 otherwise. Since f has median 0, P [Yi = 1] = P [Xi > 0] = 1/2. Then
Xn
1
S= Yi ∼ Binomial n, .
i=1
2
This distribution is the same for any PDF f with median 0. A test of H0 Vs. H1 should reject for large values of
S. To achieve level-α, it should reject when S ≥ k, where k is a value such that P[S ≥ k] = α under H0 . This is
exactly the value of k given in the problem statement.
pn
(c) When n is large, explain why we may reject H0 when S > n2 + 4 Zα where Zα is the upper α point of N (0, 1),
instead of using the rejection threshold you derived in part (b).
(d) In this problem, we’ll study the power of this test against the specific alternative N √h , 1 , for a fixed constant
n
Page 48
Solution. Given that X ∼ N √h , 1
, we have
n
h h h h
P[X > 0] = P X − √ > − √ = 1 − Φ −√ =Φ √ .
n n n n
A first-order Taylor expansion for a differentiable function f suggests that
f (x + h) ≈ f (x) + hf ′ (x)
x2
Applying this to the above and noting Φ′ (x) is the normal PDF ϕ(x) = √12π e− 2 ,
h h 1 h
Φ √ ≈ Φ(0) + √ ϕ(0) = + √ .
n n 2 2πn
q
IID
(e) Let X1 , . . . , Xn ∼ N √h , 1
n
. In this case, show that 4
n S− n
2 has an approximate normal distribution that
does not depend on n (but depends on h ) – what is the mean and variance of this normal distribution?
Using this
result, derive an approximate formula for the power of the sign test against the alternative N √h
n
, 1 , in terms of
Zα , h, and the CDF Φ.
Solution. The sign statistic S can be written as
X
S= Yi , where Yi ∼ Bernoulli (P [Xi > 0]) .
i
√
By the CLT, n S
n − E [Yi ] is approximately distributed as N (0, Var [Yi ]). Applying part (d), E [Yi ] ≈ 1
2 + √h ,
2πn
so
√ S √ S 1 h 1 n h
n − E [Yi ] ≈ n − −√ = √ S− −√ .
n n 2 2πn n 2 2π
For large n,
1 1 h h 1 1 1
Var [Yi ] ≈ +√+√ ≈ × = .
1−
2 2 2πn 2πn 2 2 4
q
So √1n S − n2 is approximately distributed as N √h2π , 14 . Multiplying by 2, n4 S − n2 is approximately dis-
tributed as N √2h
2π
, 1 . The power of the sign test against the alternative N √h , 1 is given by
n
r "r #
n n 4 n 2h 2h 2h 2h
P S> + Zα = P S− − √ > Zα − √ ≈ 1 − Φ Zα − √ =Φ √ − Zα .
2 4 n 2 2π 2π 2π 2π
The popular search engine Google would like to understand whether visitors to a website are more likely to click on an
advertisement at the top of the page than one on the side of the page. They conduct an “AB test” in which they show
n visitors (group A) a version of the website with the advertisement at the top and m visitors (group B) a version of
the website with the (same) advertisement at the side. They record how many visitors in each group clicked on the
advertisement.
(a) Formulate this problem as a hypothesis test. (You may assume that visitors in group A independently click on the
ad with probability pA and visitors in group B independently click on the ad with probability pB , where both pA and
pB are unknown probabilities in (0, 1).) What are the null and alternative hypotheses? Are they simple or composite?
IID IID
Solution. Let X1 , . . . , Xn ∼ Bernoulli (pA ) and Y1 , . . . , Ym ∼ Bernoulli (pB ) be indicators of whether each visitor
clicked on the ad. We wish to test
H 0 : p A = pB
H 1 : pA > p B
Both hypotheses are composite, as they do not specify the exact value of pA or pB .
Page 49
(b) Let p̂A be the fraction of visitors in group A who clicked on the ad, and similarly for p̂B . A reasonable intuition is
to reject H0 when p̂A − p̂B is large. What is the variance of p̂A − p̂B ? Is this the same for all data distributions in H0 ?
pA (1−pA )
Solution. As np̂A ∼ Binomial (n, pA ), we have Var [np̂A ] = npA (1 − pA ) so Var [p̂A ] = n . Similarly
Var [p̂B ] = pB (1−p
m
B)
. Since p̂A and p̂B are independent,
pA (1 − pA ) pB (1 − pB )
Var [p̂A − p̂B ] = Var [p̂A ] + Var [−p̂B ] = Var [p̂A ] + Var [p̂B ] = + .
n m
Under H0 , pA = pB = p for some p ∈ (0, 1), and this variance is p(1 − p) n1 + m 1
. This is not the same for all data
distributions in H0 , as it depends on p. (So we cannot perform a test of H0 directly using the test statistic p̂A − p̂B .)
(c) Describe a way to estimate the variance of p̂A − p̂B using the available data, assuming H0 is true – call this estimate
V̂ . Explain heuristically why, when n and m are both large, the test statistic
p̂A − p̂B
T = p
V̂
is approximately distributed as N (0, 1) under any data distribution in H0 . (You may assume that when n and m
are both large, the ratio of V̂ to the true variance of p̂A − p̂B that you derived in part (b) is very close to 1 with
high probability.) Explain how to use this observation to perform an approximate level-α test of H0 versus H1 .
p̂A n + p̂B m
p̂ =
n+m
and then estimate the variance as
1 1
V̂ = p̂(1 − p̂) + .
n m
(Both ways are reasonable under H0 .)
√ √
Under H0 , pA = pB = p for some p ∈ (0, 1), so the CLT implies √n(p̂A −p) → N (0, 1) and √ m(p̂B −p)
→ N (0, 1)
p(1−p) p(1−p)
in distribution as n, m → ∞. So for large n and m, the distributions of p̂A and p̂B are approximately N p, p(1−p)
n
p(1−p)
and N p, m . Note that p̂A and p̂B are independent, so their difference is distributed approximately as
N 0, p(1 − p) n + m
1 1
. We may write the test statistic T as
q
p(1 − p) 1
n + 1
m p̂A − p̂B
T = p q .
V̂ p(1 − p) n1 + 1
m
√
p(1−p)( n
1 1
+m )
Since √ ≈ 1 with high probability for large n and m, T is approximately distributed as N (0, 1), and
V̂
an asymptotic level-α test rejects H0 for T > Zα .
Page 50
Problem 21. (Covid-19 testing problem)
There are approximately 540 coronavirus testing locations in Abu Dhabi. At the beginning of the day, officials at each lo-
cation record Y = number of specimens tested to find the first positive case and assume Y follows a geometric distribution
with probability p. The probability p satisfies 0 < p < 1 and is unknown. In this application, we might also call p the “pop-
ulation prevalence” of the disease. Of course, careful thought should go into defining exactly what the “population” is here.
Suppose Y1 , Y2 , . . . , Y540 are iid geometric (p) random variables (one for each site) observed on a given day. Epidemiologists
at the Department of Health would like to test
H0 : p = 0.02
versus
Ha : p < 0.02.
(a) Show the likelihood function of p on the basis of observing y = (y1 , y2 , . . . , y540 ) is given by
∑540
yi −540 540
L(p | y) = (1 − p) i=1 p .
(b) Show the uniformly most powerful (UMP) level α test of H0 versus Ha has a rejection region of the form
( )
X
540
∗
RR = t = yi ≥ k .
i=1
P540
How would you choose k ∗ to ensure the test is level α = 0.05? Hint: What is the sampling distribution of T = i=1 Yi
when H0 is true?
Solution.
Y
540
∑540
L(p | y) = (1 − p)yi −1 × p = (1 − p) i=1 yi −540 540
p .
i=1
where k ∗ satisfies: !
X
540
α = PH0 (RR) = P Y i > k ∗ | H0 .
i=1
P540
When H0 is true, we can argue T = i=1 Yi ∼ N eg − Binomial (540, p0 ). (Here, I used the negative binomial in the
version of ”the number of trials that needed to get the first 540 success, i.e., the first 540 positive cases”. There is another
version of “the number of failures before the first 540 success”. You just need to take care of the definition of your pa-
rameters.) Thus, we can choose k ∗ to be the (upper) quantile of the negative binomial distribution to achieve a level α test.
Now, we simply note that this rejection region does not depend on the value of pa under Ha′ (which we specified ar-
bitrarily). Therefore, this test can be the uniformly most powerful (UMP) level α test. Note: To get an exact size α, you
may refer to the method of making a randomized test.
Page 51
Problem 22. (Epidemiological Modeling of Covid-19)
SEIR models are used by epidemiologists to describe COVID-19 disease severity in a population. The model consists of
four different categories:
The four categories are mutually exclusive and exhaustive among living individuals (SEIRD models do include a fifth
category for those who have died from disease). A random sample of n individuals is selected from a population (e.g.,
residents of Abu Dhabi city), and the category status of each individual is identified. This produces the multinomial
random vector
X4
Y ∼ mult n, p; pj = 1 ,
j=1
where
Y1 p1
Y2 p2
Y=
Y3 and p =
.
p3
Y4 p4
The random variables Y1 , Y2 , Y3 , Y4 record the number of individuals identified in the susceptible, exposed, infected, and
recovered categories, respectively. Recall that the beta distribution is a conjugate prior for the binomial. Just as the
multinomial distribution can be regarded as a generalization of the binomial (to more than two categories), we need a
prior distribution for p that is a generalization of the beta. This generalization is the Dirichlet (α1 , α2 , α3 , α4 ) distribution.
Specifically, suppose p is best regarded as random with prior pdf
( P4
Γ(α1 +α2 +α3 +α4 ) 1 −1 α2 −1 α3 −1 α4 −1
Γ(α )Γ(α )Γ(α )Γ(α ) pα
1 p2 p3 p4 , 0 < pj < 1, j=1 pj = 1
g(p) = 1 2 3 4
0, otherwise
Solution:
(a)
g(p | y) ∝ fY,p (y, p)
= fY|p (y | p)g(p)
n! Y
n
Γ (α1 + α2 + α3 + α4 ) α1 −1 α2 −1 α3 −1 α4 −1
= Q4 pyi i p p2 p3 p4
y !
j=1 i j=1
Γ (α1 ) Γ (α2 ) Γ (α3 ) Γ (α4 ) 1
1 +y1 −1 α2 +y2 −1 α3 +y3 −1 α4 +y4 −1
∝ pα
1 p2 p3 p4
∼ Dirichlet (α1∗ , α2∗ , α3∗ , α4∗ ) , where αj∗ = yj + αj .
(b) No, this prior might not be proper here. The probability for each category should not be all the same. (You can
search for more information to argue whether it is reasonable or not, like historical data, expertise, doctors, or even
your own observations, etc.)
Page 52
Problem 23. (Mean referral waiting times)
We would like to compare the population mean referral waiting times for patients in Abu Dhabi and Dubai seeking care
from a gastrointestinal specialist. By “referral waiting time”, I mean the time it takes to see a gastrointestinal specialist
once a referral has been made by another health professional (e.g., a primary care physician). We have independent
random samples of patients from the two locations. Here are the corresponding waiting times and population-level models
for them:
IID
• Abu Dhabi: X1 , X2 , . . . , Xm ∼ Exponential (θ1 )
IID
• Dubai: Y1 , Y2 , . . . , Yn ∼ Exponential (θ2 ).
The population parameters satisfy θ1 > θ and θ2 > θ and are unknown. The goal is to test
H0 : θ 1 = θ 2
versus
6 θ2 .
Ha : θ 1 =
(a) Preparing for an LRT derivation below, carefully describe the null parameter space Θ0 and the entire parameter
space Θ. Draw a picture of what both spaces look like.
(b) Show the likelihood function is given by
1 − ∑m 1 − ∑nj=1 yj /θ2
L(θ | x, y) = L (θ1 , θ2 | x, y) = m e i=1 xi /θ1 × e ,
θ1 θ2n
where θ = (θ1 , θ2 ). This is just the likelihood from each sample multiplied together (because the two samples are
independent).
(c) Show the (restricted) MLE of θ over the null parameter space Θ0 is
!
mX̄+nȲ
θ̂0 = m+n .
mX̄+nȲ
m+n
(d) Show the (unrestricted) MLE of θ over the entire parameter space Θ is
X̄
θ̂ = .
Ȳ
(f) Here are the observed data on the referral waiting times for both groups of patients:
Calculate −2 ln λ using the data above and implement a large-sample LRT to test H0 versus Ha . What is your
conclusion at α = 0.05?
Page 53
Solution.
(a) The entire space should be the region with all positive values for θ1 and θ2 . The null space is then the line with
θ1 = θ2 .
Page 54
Take partial derivatives:
mX set
∂l
θ1 = − θm1 + θ12
= 0 θ̂1 = X̄
⇒
∂l
= − θn2 + nY set
= 0 θ̂2 = Ȳ
θ2 θ22
To study the effectiveness of a drug that claims to lower blood cholesterol levels, we designed a simple experiment with
n subjects in a control group and n (different) subjects in a treatment group. We administer the drug to the treatment
group and a placebo to the control group, measure the cholesterol levels of all subjects at the end of the study and look
at whether cholesterol levels are lower in the treatment group than in control. Let X1 , . . . , Xn be the cholesterol levels in
the control group and Y1 , . . . , Yn be those in the treatment group, and let
X̄ − Ȳ
T = q
Sp n1 + 1
n
(a) A carefully designed study randomly selects subjects for the two groups so that each subject selected for either
group is (independently) with probability 1/2 high-risk and probability 1/2 low-risk. Explain why, in this case,
X1 , . . . , Xn , Y1 , . . . , Yn are IID from a common distribution. What are E [Xi ] and Var [Xi ]?
(b) Explain (using the CLT and Slutsky’s lemma) why, when n is large, T is approximately distributed as N (0, 1), and
hence a test that rejects for T > Zα is approximately level-α for large n.
(c) A poorly-designed study fails to properly randomize the treatment and control groups, so that each subject selected
for the control group is with probability p high-risk and probability 1 − p low-risk, and each subject selected
for the treatment group is with probability q high-risk and probability 1 − q low-risk. In this case, what are
E [Xi ] , Var [Xi ] , E [Yi ], and Var [Yi ]?
(d) In the setting of part (c), show that Sp2 converges in probability to a constant c ∈ R as n → ∞, and provide a
formula for c. Show that T is approximately normally distributed, and provide formulas for the mean and variance
of this normal. Is the rejection probability P[T > Zα ] necessarily close to α? Discuss briefly how this probability
depends on the values µH , µL , σ 2 , p, and q.
Page 55
Solution.
(a) Each group member is selected independently from either the high-risk group or the low-risk group. So the cholesterol
level for each person in each group is a random variable independent of that for any other person. Also, since for
both the treatment and control groups, with probability 1/2, a high-risk individual is chosen, and with probability
1/2, a low-risk individual is chosen, they must have the same distribution. So, the variables X1 , · · · , Xn , Y1 , · · · , Yn
are IID from a common distribution. To compute the mean and variance, we may write Xi as
Xi = Zi Hi + (1 − Zi ) Li
where Hi ∼ N µH , σ 2
, L i ∼ N µL , σ 2
, Zi ∼ Bernoulli(1/2), and these are independent. Then
2
Note that since Zi ∈ {0, 1}, Zi (1 − Zi ) = 0, Zi2 = Zi , and (1 − Zi ) = (1 − Zi ). Then
1 1
E Xi2 = E [Zi ] E Hi2 + E [1 − Zi ] E L2i = E Hi2 + E L2i
2 2
2 2
We have E Hi = Var [Hi ] + (E [Hi ]) = µ2H + σ 2 , and similarly E L2i = µ2L + σ 2 . So
1 2
E Xi2 = µL + µ2H + σ 2 ,
2
and
2 1 2 1 2 1 2
Var [Xi ] = E Xi2 − (E [Xi ]) = µ + µ2H + σ 2 − µ − 2µL µH + µ2H = σ 2 + (µH − µL ) .
2 L 4 L 4
√ √
√ Theorem, n X̄ − E [Xi ] → N (0, Var [Xi ]) and n Ȳ − E [Xi ] →
(b) As the Xi ’s and Yi ’s are all IID, by the Central Limit
N (0, Var [Xi ]) in distribution, so their difference n(X̄ − Ȳ ) → N (0, 2 Var [Xi ]). The pooled variance is
!
1 Xn
2 Xn
2 1 2 1
Sp2 = Xi − X̄ + Yi − Ȳ = SX + SY2 ,
2n − 2 i=1 i=1
2 2
P 2 P 2
i Xi − X̄ i Yi − Ȳ
2 1 1
where SX = n−1 and SY2 = n−1 are the individual sample variances. Using the result,
SX → Var [Xi ] and SY → Var [Yi ] = Var [Xi ] in probability, so the Continuous Mapping Theorem implies Sp2 →
2 2
in distribution by Slutsky’s lemma. Hence, a test that rejects for T > Zα is approximately level α for large n.
(c) The difference in this part from Part (a) is that here
Xi = Zi Hi + (1 − Zi ) Li ,
Page 56
Similarly, E [Yi ] = qµH + (1 − q)µL .
For the variances, we compute as in Part (a)
E Xi2 = E [Zi ] E Hi2 + E [1 − Zi ] E L2i = p µ2H + σ 2 + (1 − p) µ2L + σ 2 = pµ2H + (1 − p)µ2L + σ 2 ,
so
2 2 2
Var [Xi ] = E Xi2 − (E [Xi ]) = pµ2H + (1 − p)µ2L + σ 2 − (pµH + (1 − p)µL ) = σ 2 + (µH − µL ) p(1 − p).
2
Similarly, Var [Yi ] = σ 2 + (µH − µL ) q(1 − q).
2
(d) In this case SX → Var [Xi ] and SY2 → Var [Yi ] in probability, so
1 1 2
Sp2 → (Var [Xi ] + Var [Yi ]) = σ 2 + (p(1 − p) + q(1 − q)) (µH − µL ) =: c.
2 2
√ √
By the CLT, n X̄ − E [Xi ] → N (0, Var [Xi ]) and n Ȳ − E [Yi ] → N (0, Var [Yi ]). The Xi ’s and Yi ’s are
√
independent, so the difference n X̄ − Ȳ − E [Xi ] + E [Yi ] → N (0, Var [Xi ] + Var [Yi ]).
Then
X̄ − Ȳ 1 √
T = p =q ( n(X̄ − Ȳ ))
Sp 2/n 2Sp2
is approximately distributed as
√
1 √ n (E [Xi ] − E [Yi ])
√ N n (E [Xi ] − E [Yi ]) , Var [Xi ] + Var [Yi ] = N √ ,1 .
2c 2c
Let √ √ √
n (E [Xi ] − E [Yi ]) n(p − q) (µH − µL ) n(p − q) (µH − µL )
m := √ = √ = q ,
2c 2c 2σ 2 + (p(1 − p) + q(1 − q)) (µH − µL )
2
This probability increases in m and only equals α when m = 0. If (p − q) (µH − µL ) > 0, then m → ∞ as n → ∞,
and we expect to falsely reject H0 with probability close to 1 for large n. If (p − q) (µH − µL ) < 0, then m → −∞
as n → ∞, and we expect the significance level of the test to in fact be close to 0 for large n.
(a) Let P1 , . . . , Pn be the p-values from n different hypothesis tests. Suppose that the tests are performed using inde-
IID
pendent sets of data, and in fact all of the null hypotheses are true, so P1 , . . . , Pn ∼ Uniform(0, 1). Show that for
any t ∈ (0, 1),
n
P min Pi ≤ t = 1 − (1 − t)n .
i=1
(b) Under the setting of part (a), if we perform all tests at significance level 1 − (1 − α)1/n (that is, we reject a null
hypothesis if its p-value is less than this level), show that the probability of (falsely) rejecting any of the n null
hypotheses is exactly α. Is this procedure more or less powerful than the Bonferroni procedure (of performing all
tests at level α/n )?
(c) Suppose, now, that all of the p-values P1 , . . . , Pn are still independent, but not necessarily all of the null hypotheses
are true. (So the p-values corresponding to the true null hypotheses are still IID and distributed as Uniform (0, 1).)
If we perform all tests at significance level 1 − (1 − α)1/n , does this procedure control the family-wise error rate
(FWER) at level α? (Explain why, or show a counterexample.)
Page 57
Solution.
IID
(a) We know that P1 , . . . , Pn ∼ U(0, 1). So for any t ∈ (0, 1),
n n
P min Pi ≤ t = 1 − P min Pi > t
i=1 i=1
= 1 − P [Pi > t ∀ i = 1, . . . , n]
Yn
=1− P [Pi > t] = 1 − (1 − t)n .
i=1
Hence, the probability of (falsely) rejecting any of the n null hypotheses is exactly α.
The Bonferroni procedure rejects when Pi ≤ α/n and the above procedure rejects when Pi ≤ 1 − (1 − α)1/n .
Note that α n
1− > 1 − α,
n
so 1 − (1 − α)1/n > α/n. Hence, whenever the Bonferroni test rejects, this procedure also rejects, so this procedure
is more powerful than the Bonferroni test.
(c) Suppose there are k true null hypotheses, and without loss of generality, let us assume that these are the first k. If
all the tests are performed at significance level 1 − (1 − α)1/n , and V is the number of true null hypotheses that are
rejected, then the FWER is
k
P(V ≥ 1) = P min Pi ≤ 1 − (1 − α) 1/n
i=1
k
= 1 − P min Pi > 1 − (1 − α) 1/n
i=1
= 1 − P Pi > 1 − (1 − α)1/n ∀ i = 1, · · · , k
k
= 1 − 1 − 1 + (1 − α)1/n = 1 − (1 − α)k/n .
Since k ≤ n and α < 1, (1 − α)k/n > (1 − α) and hence 1 − (1 − α)k/n < α, so the FWER is controlled.
Problem 26. (Computing the MME and MLE using Newton-Raphson Method)
(a) Let X = (X1 , . . . , Xn ) be i.i.d. random variables with Kumaraswamy distribution with parameters a > 0 and b > 0.
The probability density function of the Kumaraswamy distribution is
b−1
f (x; a, b) = abxa−1 (1 − xa ) , where x ∈ [0, 1].
Find the method of moments for the Kumaraswamy distribution using numerical methods.
(b) Implement a function that takes as input a vector of data values X, performs the Newton-Raphson iterations to
compute the MLEs α̂ and β̂ in the Gamma (α, β) model, and outputs α̂ and β̂. (You may use the form of the Newton-
Raphson update equation derived in class. You may terminate the Newton-Raphson iterations when |α(t+1) − α(t) |
is sufficiently small.)
(c) For n = 500, use your function from part (a) to simulate the sampling distributions of α̂ and β̂ computed from
IID
X1 , . . . , Xn ∼ Gamma(1,2). Plot histograms of the values of α̂ and β̂ across 5000 simulations, and report the
simulated mean and variance of α̂ and β̂ as well as the simulated covariance between α̂ and β̂. Compute the inverse of
the Fisher Information matrix I(α, β) at α = 1 and β = 2. Do your simulations support that (α̂, β̂) is approximately
distributed as N (1, 2), n1 I(1, 2)−1 ? (You may use the formula for the Fisher information matrix I(α, β) and/or
its inverse derived in class.)
Page 58
Solution.
(a) The Kumaraswamy distribution is an alternative to the Beta distribution, which offers similar flexibility in terms of
the shape of the density. An appealing feature of the Kumaraswamy distribution is that it only involves algebraic
terms, in contrast to the Beta distribution, which requires the evaluation of the Beta function. The mean and
variance of this distribution are
bΓ 1 + a1 Γ(b)
µ1 = E[X] = ,
Γ 1 + a1 + b
bΓ(1 + 2/a)Γ(b)
Var[X] = E X 2 − E[X]2 = µ2 − µ21 = − µ21 .
Γ(1 + b + 2/a)
The MME of a and b, ã and b̃, are the solution to the system of non-linear equations,
bΓ 1 + a1 Γ(b)
X= ,
Γ 1 + a1 + b
1X 2
n
2 bΓ(1 + 2/a)Γ(b)
Xi − X = − µ21 .
n i=1 Γ(1 + b + 2/a)
which cannot be found in closed form. Thus, we can only obtain the MME using numerical methods. The following
R code shows how to calculate the MME numerically.
mn <- function(a,b,n){
log.num <- log(b) + lgamma(1 + n/a) + lgamma(b)
log.den <- lgamma(1 + b + n/a)
return(exp(log.num-log.den))
}
#--------------------------------
# Method of Moments Estimates
#--------------------------------
library(nleqslv) # Required packages
fn <- function(par){
mom1 <- mn(exp(par[1]),exp(par[2]),1) - x.bar
mom2 <- mn(exp(par[1]),exp(par[2]),2) - mn(exp(par[1]),exp(par[2]),1)^2 - x2.bar + x.bar^2
return(c(mom1,mom2))
}
return(est)
# Examples
Page 59
#--------------------------------
# Example 1
#--------------------------------
a0 <- 1
b0 <- 1
ns <- 1000
# MME
MME(dat)
## [1] 1.018995 1.032939
#--------------------------------
# Example 2
#--------------------------------
a0 <- 0.5
b0 <- 2
ns <- 1000
# MME
MME(dat)
## [1] 0.5059483 2.1418411
#--------------------------------
# Example 3
#--------------------------------
a0 <- 5
b0 <- 3
ns <- 1000
# MME
MME(dat)
## [1] 4.796745 2.915643
Page 60
gamma.MLE = function(X) {
ahat = compute.ahat(X)
bhat = ahat/mean(X)
return(c(ahat,bhat))
}
return(a)
}
return(log(alpha)-digamma(alpha))
}
f.prime=function(alpha) {
return(1/alpha-trigamma(alpha))
}
# The sample variance of a vector of values of X is given by var(X), and the sample covariance
# between two vectors of values X and Y (of the same length) is given by cov(X, Y).
n=500
n.reps=5000
alpha=1
beta=2
alpha.hat=numeric(n.reps)
beta.hat=numeric(n.reps)
for(i in 1:n.reps) {
X=rgamma(n,shape=alpha,rate=beta)
estimates=gamma.MLE(X)
alpha.hat[i] = estimates[1]
beta.hat[i] = estimates[2]
}
Here are the resulting histograms:
Page 61
hist(alpha.hat, breaks=20)
hist(beta.hat, breaks=20)
Plugging in α = 1, β = 2 gives
−1 4 1/4 1/2 1.551 3.101
I(1, 2) = ′ ≈
ψ (1) − 1 1/2 ψ ′ (1) 3.101 10.202
Page 62
var(alpha.hat)
## [1] 0.003079451
var(beta.hat)
## [1] 0.02048946
cov(alpha.hat,beta.hat)
## [1] 0.006190339
In order to compare to the Fisher information, we need to scale by n.
n*var(alpha.hat) #should be close to 1.551
## [1] 1.539726
Suppose you fit the model Exponential(λ) to data X1 , . . . , Xn by computing the MLE λ̂ = 1/X̄, but the true distri-
IID
bution of the data is X1 , . . . , Xn ∼ Gamma(2, 1).
(a) Let f (x | λ) be the PDF of the Exponential(λ) distribution, and let g(x) be the PDF of the Gamma (2, 1) distribution.
Compute an explicit formula for the KL-divergence DKL (g(x)kf (x | λ)) in terms of λ, and find the value λ∗ that
minimizes this KL-divergence.
(You may use the fact that if X ∼ Gamma(α, β), then E[X] = α/β and E[log X] = ψ(α) − log β where ψ is the
digamma function.)
(b) Show directly, using the Law of Large Numbers and the Continuous Mapping Theorem, that the MLE λ̂ converges
in probability to λ∗ as n → ∞.
(c) Perform a simulation study for the standard error of λ̂ with sample size n = 500, as follows: In each of B = 10000
IID
simulations, sample X1 , . . . , Xn ∼ Gamma(2, 1), compute the MLE λ̂ = 1/X̄ for the exponential model, compute
an estimate of the standard error √ of
λ̂ using the Fisher information I(λ̂), and also compute the sandwich estimate
of the standard error, SX / X̄ 2 n .
Report the true mean and standard deviation of λ̂ that you observe across your 10000 simulations. Is the mean
close to λ∗ from part (a)? Plot a histogram of the 10000 estimated standard errors using the Fisher information,
and also plot a histogram of the 10000 estimated standard errors using the sandwich estimate. Do either of these
methods for estimating the standard error of λ̂ seem accurate in this setting?
Solution.
(a) The KL-divergence is given by
g(X)
DKL (g(x)kf (x | λ)) = Eg log
f (X | λ)
" 1 −X
#
Γ(2) Xe
= Eg log
λe−λX
= Eg [− log Γ(2) + log X − X − log λ + λX]
= − log Γ(2) − log λ + Eg [log X] + (λ − 1)Eg [X]
= − log Γ(2) − log λ + ψ(2) + 2(λ − 1).
Setting the derivative with respect to λ equal to 0, is minimized at λ∗ = 1/2.
Page 63
(b) By the Law of Large Numbers, X̄ → Eg [X] = 2 in probability, so λ̂ = 1/X̄ → 1/2 in probability by the Continuous
Mapping Theorem.
n = 500
B = 10000
lamb_hat = numeric(B)
se_Fisher = numeric(B)
se_sandwich = numeric(B)
for(i in 1:B){
X = rgamma(n,2,rate=1)
lamb_hat[i] = 1/mean(X)
se_Fisher[i] = 1/(mean(X)*sqrt(n))
se_sandwich[i] = sd(X)/(mean(X)^2*sqrt(n))
}
print(mean(lamb_hat))
## [1] 0.5006192
print(sd(lamb_hat))
## [1] 0.01595704
hist(se_Fisher)
hist(se_sandwich)
Page 64
The empirical mean and standard error of λ̂ are 0.501 and 0.016; the mean is close to λ∗ = 0.5 from part (a). The
Fisher-information-based estimate of the standard error is incorrect – it estimates the standard error as approximately
0.022. The sandwich estimate of the standard error seems correct – it estimates the standard error as 0.016, with
some variability in the third decimal place.
Solution.
√
(a) p̂ → p in probability, hence √n(p̂−p) → N (0, 1) in distribution by Slutsky’s Lemma. So for large n
p̂(1−p̂)
" √ #
n(p̂ − p)
P −Z0.025 ≤p ≤ Z0.025 ≈ 0.95.
p̂(1 − p̂)
We may rewrite the above as
" r r #
p̂(1 − p̂) p̂(1 − p̂)
P p̂ − Z0.025 ≤ p ≤ p̂ + Z0.025 ≈ 0.95,
n n
q
so an approximate 95% confidence interval is p̂ ± p̂(1−n
p̂)
Z0.025 .
Page 65
(b) The following conditions are equivalent:
p √ p
− p(1 − p)Zα/2 ≤ n(p̂ − p) ≤ p(1 − p)Zα/2
m
n(p̂ − p) ≤p(1 − p)Zα/2
2 2
m
2
n + Zα/2 p2 − 2np̂ + Zα/2
2
p + np̂2 ≤ 0.
This occurs when p is between the two real roots of the above quadratic equation, which are given by
r 2
2np̂ + Zα/2 ±
2 2
2np̂ + Zα/2 − 4 n + Zα/2
2 np̂2
.
2
2 n + Zα/2
Taking α = 0.05 and simplifying the above, we obtain an approximate 95% confidence interval of
q
Z2 p̂(1−p̂) 2
Z0.025
p̂ + 0.025
2n ± Z 0.025 n + 4n 2
2
Z0.025
.
1+ n
(c) ns = c(10,40,100)
ps = c(0.1,0.3,0.5)
B=100000
z = qnorm(0.975)
for (n in ns) {
for (p in ps) {
cover_A = numeric(B)
cover_B = numeric(B)
for (i in 1:B) {
phat = rbinom(1,n,p)/n
U = phat+z*sqrt(phat*(1-phat)/n)
L = phat-z*sqrt(phat*(1-phat)/n)
if (p <= U && p >= L) {
cover_A[i] = 1
} else {
cover_A[i] = 0
}
U = (phat+z^2/(2*n)+z*sqrt(phat*(1-phat)/n+z^2/(4*n^2)))/(1+z^2/n)
L = (phat+z^2/(2*n)-z*sqrt(phat*(1-phat)/n+z^2/(4*n^2)))/(1+z^2/n)
if (p <= U && p >= L) {
cover_B[i] = 1
} else {
cover_B[i] = 0
}
}
print(c(n,p,mean(cover_A),mean(cover_B)))
}
}
For the interval from part (a), we obtain the following coverage probabilities:
Page 66
For the interval from part (b), we obtain the following coverage probabilities:
The intervals from part (b) are more accurate when n is small.
Solution.
(a) The code below runs the simulations for the null case (µ = 0) as well as for µ = 0.1, 0.2, 0.3, 0.4 :
set.seed(1)
n = 100
B = 10000
for (mu in c(0,0.1,0.2,0.3,0.4)) {
output.Z = numeric(B)
output.T = numeric(B)
Page 67
output.W = numeric(B)
output.S = numeric(B)
for (i in 1:B) {
X = rnorm(n, mean=mu, sd=1)
if (mean(X) > 1/sqrt(n)*qnorm(0.95)) {
output.Z[i] = 1
} else {
output.Z[i] = 0
}
T = t.test(X)$statistic
if (T > qt(0.95,df=n-1)) {
output.T[i] = 1
} else {
output.T[i] = 0
}
W = wilcox.test(X)$statistic
if (W > n*(n+1)/4+sqrt(n*(n+1)*(2*n+1)/24)*qnorm(0.95)) {
output.W[i] = 1
} else {
output.W[i] = 0
}
S = length(which(X>0))
if (S > n/2+sqrt(n/4)*qnorm(0.95)) {
output.S[i] = 1
} else {
output.S[i] = 0
}
}
print(paste("mu = ", mu))
print(paste("Z: ", mean(output.Z)))
print(paste("T: ", mean(output.T)))
print(paste("W: ", mean(output.W)))
print(paste("S: ", mean(output.S)))
}
Under H0 (case µ = 0), we obtained the results:
(c) The powers of the tests against N (µ, 1) decrease as we increasingly relax the distributional assumptions (from
N (0, 1) to N 0, σ 2 to any symmetric PDF f about 0 to any PDF f with median 0). The sign test makes the fewest
distributional assumptions under H0 , but its power is substantially lower than the other three tests. Hence if we
Page 68
have good reason to believe that the data distribution under H0 is symmetric (for example, if each data value is the
difference of paired samples (Xi , Yi ), and (Xi , Yi ) should have the same distribution as (Yi , Xi ) under H0 ), then we
should at least opt for using the Wilcoxon test. The difference in powers between the Wilcoxon test, t-test, and the
most-powerful likelihood ratio test is indeed very small, which supports Rice’s claim (at least for the tested sample
size n = 100).
In a classical genetics study, Geissler (1889) studied hospital records in Saxony and compiled data on the gender ratio.
The following table shows the number of male children in 6115 families having 12 children:
Let X1 , . . . , X6115 denote the number of male children in these 6115 families.
(a) Suggest two reasonable test statistics T1 and T2 for testing the null hypothesis
IID
H0 : X1 , . . . , X6115 ∼ Binomial(12, 0.5).
(This is intentionally open-ended; try to pick T1 and T2 to “target” different possible alternatives to the above null.)
Compute the values of T1 and T2 for the above data.
(b) Perform a simulation to simulate the null distributions of T1 and T2 . (For example: Simulate 6115 independent
samples X1 , . . . , X6115 from Binomial(12, 0.5), and compute T1 on this sample. Do this 1000 times to obtain 1000
simulated values of T1 . Do the same for T2 .) Plot the histograms of the simulated null distributions of T1 and T2 .
Using your simulated values, compute approximate p-values of the hypothesis tests based on T1 and T2 for the above
data. For either of your tests, can you reject H0 at significance level α = 0.05?
(c) In this example, why might the null hypothesis H0 not hold? (Please answer this question regardless of your findings
in part (b).)
Solution.
(a) There are many possible answers. We may take T1 to be the average number of male children per family,
T1 = X̄,
and perform a two-sided test based on T1 to check whether roughly half of the children are male. We may take T2
to be Pearson’s chi-squared statistic
X12
2
T2 = (Ok − Ek ) /Ek
k=0
where Ok is the number of families withk male children and Ek is the expected number under the hypothesized
12
binomial distribution, i.e. Ek = 6115 × (0.5)12 , and perform a one-sided test that rejects for large T2 to
k
check whether the shape of the observed distribution of X1 , . . . , X6115 matches the shape of the binomial PDF.
Page 69
(b) R code corresponding to the above T1 and T2 is as follows:
ks = seq(0,12)
counts = c(7,45,181,478,829,1112,1343,1033,670,286,104,24,3)
expected = 6115*choose(12,ks)*(0.5^12)
T1_obs = sum(ks*counts)/6115
T2_obs = sum((counts-expected)^2/expected)
T1s = numeric(1000)
T2s = numeric(1000)
for (i in 1:1000) {
X = rbinom(6115, 12, 0.5)
T1s[i] = mean(X)
counts = numeric(13)
for (k in 0:12) {
counts[k+1] = length(which(X==k))
}
T2s[i] = sum((counts-expected)^2/expected)
}
hist(T1s)
hist(T2s)
T1_pvalue = length(which(T1s<T1_obs))/1000 * 2
T2_pvalue = length(which(T2s>T2_obs))/1000
Histograms of the null distributions of T1 and T2 are below:
The values of the test statistics for the observed data are T1 = 5.77 and T2 = 249, which are both far outside the
range of the simulated null distributions above. The simulated p-values for the two tests are both < 0.001, and there
is strong evidence that H0 is not correct.
(c) There may be biological and sociological reasons why H0 is false. Biologically, the human male-to-female sex ratio at
birth is not exactly 1 : 1. The probability p that a child is male might also vary from family to family. The sexes of
children within a family might be dependent; in particular, one source of dependence is the presence of identical twins.
Sociologically, there may be a relationship between family size and the sex ratio of children because the current sex
ratio influences parents’ decision of whether to have another child. Note that the given data is only for families
with 12 children, which is quite large even for that time. There is a noticeable bias towards families with more girls
than boys, which may be explained if parents tended to continue having children when their current children were
predominantly female.
Page 70
Problem 31. (Monte Carlo Simulations)
(a) Consider the problem of approximating tail probabilities of a Standard Normal distribution. Find P(X > 2.04)
for X ∼ N (0, 1) using Monte Carlo methods. While we can easily calculate this in R using the pnorm function, check how
a Monte Carlo method would perform. How does the estimate change with the number of Monte Carlo samples?
(b) Consider the problem of calculating the expected present value of the payoff of a call option on a stock. Let S(t)
denote the price of the stock at time t. Consider a European call option, that is, the holder has the right to buy the stock
at a fixed price K at a fixed time T in the future, t = 0 being the current time. If S(T ) > K, the holder exercises the
option for a profit of S(T ) − K. If S(T ) ≤ K, the option expires worthless. The payoff at time T is thus
max{S(T ) − K, 0}.
where r is the compound interest rate. The evolution of the stock price S(t) can be modeled via the Black-Scholes model
expressed as
dS(t)
= rdt + σdW (t),
S(t)
where W is a standard Brownian motion. The solution to the above Stochastic Differential Equation is given by
1 2
S(T ) = S(0) exp r − σ T + σW (T ) ,
2
√
where S(0) is the current price of the stock and W (T ) ∼ N (0, T ), that is, W (T ) = T Z, Z ∼ N (0, 1). Thus, the loga-
rithm
of S(T ) is Normallydistributed, or, in other words, the distribution of S(T ) is log-normal. The expected payoff
E e−rT max{S(T ) − K, 0} is thus an integral w.r.t. to the lognormal density.
Can you perform Monte Carlo sampling to estimate this integral (that is, find the expected payoff)? This will require
generating samples from the standard Normal distribution, followed by computing the value of the function inside the
expectation and then taking the average. Can you also compute the estimated standard error of the estimator?
Solution.
(a) N <- 1e06 # number of samples
t <- 2.04
Xsamp <- rnorm(N) #simulating from N(0,1)
tail.prob <- mean(Xsamp > t)
exact.tail.prob <- pnorm(t, lower.tail = FALSE)
c(MC.estimate = tail.prob, Exact.value = exact.tail.prob)
# Standard error of the estimate
MC.se <- sqrt(var(Xsamp > t)/N)
MC.se
library(tibble)
library(ggplot2)
Page 71
}
out <- as_tibble(out)
ggplot(out, mapping = aes(N,MC.estimate))
+ geom_line() + geom_hline(yintercept = exact.tail.prob, color = "blue")
A study was made on the effect of temperature on the yield of a chemical process. The following data were collected:
X -5 -4 -3 -2 -1 0 1 2 3 4 5
Y 1 5 4 7 10 8 9 13 14 13 18
(a) Assuming a model, Y = β0 + β1 X + ϵ, what are the least square estimates of β0 and β1 ? What is the fitted equation?
(b) Construct the ANOVA table and test the hypothesis H0 : β1 = 0 with α = 0.05.
(c) What are the confidence limits for β1 at α = 0.05?
(d) What are the confidence limits for the true mean value of Y when X = 3 at α = 0.05?
(e) What are the confidence limits at α = 0.05 level of significance for the difference between the true mean value of Y
when X1 = 3 and the true mean value of Y when X2 = −2?
[Given, F0.05,1,9 = 5.12, t0.05,9 = 1.833, t0.025,9 = 2.263, ]
Page 72
Solution:
(a) The given data is (xi , yi ); i = 1, 2, . . . , 11. The linear regression model is
X
11
yi = β0 + β1 xi + ϵi thus, S = (yi − β̂0 − β̂1 xi )2
i=1
P11
Sxy xi yi − nx̄ȳ 158
β̂1 = = Pi=1
11 = = 1.44; β̂0 = ȳ − β̂1 x̄ = 9.27
i=1 xi − nx̄
Sxx 2 2 110
Thus, yˆi = 9.27 + 1.44xi is the fitted model.
X
11 X
11
SST = (yi − ȳ)2 = 248.18, SSReg = β̂12 Sxx = 226.94, SSRes = (yi − yˆi )2 = SST − SSReg = 22.23
i=1 i=1
Hence, F ≡ t , under H0 .
2
β̂1 − β1
⇒q ∼ tn−2
M SRes
Sxx
β̂1 − β1
⇒ P − tα/2,n−2 ≤q ≤ tα/2,n−2 = 1 − α
M SRes
Sxx
r r
M SRes M SRes
⇒ β̂1 − tα/2,n−2 ≤ β1 ≤ β̂1 + tα/2,n−2
Sxx Sxx
⇒ 1.44 − 2.263 ∗ 0.146 ≤ β1 ≤ 1.44 + 2.263 ∗ 0.146 ⇒ 1.11 ≤ β1 ≤ 1.77.
xx
.
Page 73
β̂0 + β̂1 x0 − (β0 + β1 x0 )
A= r h i ∼ tn−2 , ∴ P −tα/2,n−2 ≤ A ≤ tα/2,n−2 = 1 − α
M SRes n1 + (x0S−x̄)
2
xx
s
1 (x0 − x̄)2
∴ β̂0 + β̂1 x0 ± tα/2,n−2 × M SRes + is the confidence interval for β0 + β1 x0 .
n Sxx
Thus, replacing the values we get, the required confidence interval is 12.15 ≤ β0 + β1 x0 ≤ 15.03.
(e) The confidence limits for the difference between the true mean value of Y when X1 = 3 and the true mean value of
Y when X2 = −2 can be computed as follows:
Unbiased estimators of Z1 and Z2 are: Ẑ1 = β̂0 + β̂1 3 and Ẑ2 = β̂0 + β̂1 (−2).
Thus, Ẑ1 − Ẑ2 = (β̂0 + β̂1 3) − (β̂0 + β̂1 (−2)) = 5β̂1 = 7.20 (point estimation).
Confidence interval for Z1 − Z2 :
2
25σ 2 25σ 2 (Ẑ1 −Ẑ2 )−(Z1 −Z2 )
Compute V ar(Ẑ1 − Ẑ2 ) = V ar(5β̂1 ) = 25σ Sxx = 110 ; thus Ẑ1 − Ẑ2 ∼ N(Z1 − Z2 , 110 ) ⇒
√
25M SRes
∼ tn−2 .
110
So, C.I. of Z1 − Z2 is:
r r
25 × 2.36 25 × 2.36
(Ẑ1 − Ẑ2 ) − tα/2,9 ≤ Z1 − Z2 ≤ (Ẑ1 − Ẑ2 ) + tα/2,9 ⇒ 5.54 ≤ Z1 − Z2 ≤ 8.86.
110 110
There are a few occasions where it makes sense to fit a model without an intercept β0 . If there were an occasion to fit the
model y = βx + ϵ to a set of data (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ), the least square estimate of β would be
Pn
x i yi
β̂ = b = Pi=1 n 2
i=1 xi
Suppose you have a programmed calculator that will fit only the intercept model y = β0 + β1 x + ϵ, but you want to fit a
non-intercept model. By adding one more fake data point (mx̄, mȳ) to the data above, where m = (n+1)n1/2 −1 = na , say,
and letting the calculation fit y = β0 + β1 x + ϵ, can you estimate β by using b?
Solution:
Given the data set (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ), we fit: y = βx + ϵ, where the least square estimate of β is calculated as
follows:
X X X P
∂S x i yi
SSE = S = (yi − ŷi ) = 2
(yi − β̂xi ) ;
2
=0⇒ (yi − β̂xi )xi = 0 ⇒ β̂ = P 2 .
∂β xi
Adding one more data point: (mx̄, mȳ); m = (n+1)n1/2 −1 = na ⇒ (a + 1)2 = (n + 1).
Thus the new dataset is given by (u1 , v1 ), (u2 , v2 ), . . . , (un , vn ), (un+1 , vn+1 ) = (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ), (mx̄, mȳ).
Page 74
Hence we have
∑n
x i yi
For the programmed calculation: βˆ1 = Suv
Suu = ∑i=1
n 2
i=1 xi
Fit the model Y = β0 + β1 X1 + β2 X2 + ϵ for the data given below. Provide an ANOVA table and perform the partial
F-tests to test H0 : βi = 0 vs H1 : βi 6= 0 for i = 1, 2; given the other variable is already in the model. Comment on the
relative contributions of the variables X1 and X2 , depending on whether they enter the model first or second. Find the
regression equation.
X1 -5 -4 -1 2 2 3 3
X2 5 4 1 -3 -2 -2 -3
Y 11 11 8 2 5 5 4
Solution:
Y = β 0 + β 1 X1 + β 2 X2 + ϵ
Fit the model with both X1 , X2 . OLS estimates can be obtained using β̂ = (X T X)−1 X T Y
1 −5 5
1 −4 4
−1 46
1 −1 1 β̂0 7 0 0 46
7 46
X= 1 2 −3 ;
β̂1 = 0 68 −67 −66 = 1 ∴ Ŷ = + X1 + 2X2
1 2 −2 7
β̂2 0 −67 63 69 2
1 3 −2
1 3 −3
P 2 Pn
SST = Yi − Ȳ = 73.71, ei = yi − ŷi , SSRes = i=1 e2i , H0 : β1 = β2 = 0 vs. H1 : H0 is not true. To test
this hypothesis, we have F = 83.72 > F0.05,2,4 = 6.94 ⇒ H0 is rejected.
Page 75
Partial F-tast:- Fit the model with X2 as Y = β0 + β2 X2 + ϵ, Ŷ = 46 7 − 68 X2 and test the hypothesis H0 : β1 = 0 vs H1 :
69
β1 6= 0
{SSReg (Full) − SSReg (Restricted Model)} /1 72 − 70.01
F = = = 4.64 < F0.05,1,4 = 7.71
M SRes 0.43
∴ H0 is not rejected at 5% level of significance, i.e., X1 is not significant in the presence of X2 .
Implication:- If X2 is in the model, we donot need X1 . If X1 is in the model, X2 helps out significantly. Then X2 is
70.01
clearly more useful variable and it explains R2 = 73.71 = 95% of the total variability in Y about mean, where as X1 alone
2 64.06 72.00
explains R = 79.71 = 86% of total variability in Y about mean. And X1 and X2 together explain 73.71 = 97% of total
variability.
Note: In this problem, X1 + X2 ≈ 0 and the presence of multicollinearity (X1 and X2 are not independent), and that is
why Partial F-test suggests β0 = 0 is accepted.
Solution:
1 x11 x12
1 x21 x22
(a) The variance-covariance matrix of the design matrix X = .. .. .. is given by
. . .
1 xn1 xn2
1 x11 x12 P P
1 1 ··· 1 n P x2i1 P xi2 33 0 0
1 x21 x22 P
X T X = x11 x21 ··· xn1 = P xi1 P xi1 x x = 0 90 20
... .. .. P 2i1 i2
x12 x22 ··· xn2 . . xi2 xi1 xi2 xi2 0 20 60
1 xn1 xn2
(b) We know the normal equation is given by β̂ = (X T X)−1 X T Y ⇒ (X T X)β̂ = X T Y . Thus we have
33 0 0 b1 132 33 0 0 b1 832
(X T X)β̂ = X T Y ⇒ 0 40 20 b2 = 24 ⇒ 0 40 20 b2 = 24
0 20 60 b3 92 0 0 50 b3 80
1
using Gaussian elimination:R3′ = R3 − R2
2
Page 76
Thus we obtain b3 = 1.60, 40b2 + 20b3 = 24 ⇒ 40b2 = −8 ⇒ b2 = −0.20, 33b1 = 132 ⇒ b1 = 4. Hence, the regression
equation is ŷ = 4 − 0.20x1 + 1.60x2
(c)
X T X = 33(2400 − 400) = 66000,
r r
√ RSS 7.6 √ det(co-factor) 1980
S.E. of b2 = S c22 where S = = = 3.8 and C22 = = = 0.003
n−k 2 |X T X| 66000
√
SE(b2 ) = 0.003 × 3.8 = 0.106
−0.20
To test the hypothesis H0 : b2 = 0 vs. H1 : b2 6= 0, we use the following test statistic t = b2
SE(b2 ) = 0.106 = 1.88 <
4.303. We reject H0 .
(d) TSS = RSS + SSReg
∗T ∗
24
Given that T SS = 150, we can calculate SSReg = β T
(X Y ) = −0.20 1.60 = 142.4 and thus we have
92
RSS = 150 − 142.4 = 7.6
Thus, the coefficient of determination R2 is given by R2 = 142.4/150 = 0.9493 explains 94.93% of the total variability
in the response variable.
(e) The predicted value of y for the given values of x1 = −4 and x2 = 2 is given by ŷ = 4 − 0.20(−4) + 1.60(2) = 8.
Can we use the data given below to get a unique fit to the model Y = β0 X0 + β1 X1 + β2 X2 + β3 X3 + ϵ
X1 -4 3 1 4 -3 -1
X2 1 2 3 4 5 6
X3 3 -5 -4 -8 -2 -5
Y 7.4 14.7 13.9 18.2 12.1 14.8
Solution:
−1
The LSE of the regression model Y = Xβ + ϵ is given by β̂ = X T X XT Y
x0 x1 x2 x3
1 −4 1 3
1 3 2 −5
X= 1 1 3 −4
1 4 4 −8
1 −3 5 −2
1 −1 6 −5
Note that, X1 + X2 + X3 = 0 which implies that X T X is a singular matrix, i.e., | X T X | = 0. No, we can’t compute
β̂ uniquely here.
Page 77
Solution:
(a) P
(ei − ē) Yi − Ȳ
Cor(e, Y ) = qP 2
2P
(ei − ē) Yi − Ȳ
X X X
(ei − ē) Yi − Ȳ = ei Yi − Ȳ = ei Yi = Y T e [since, ē = 0 if β0 is in the model]
Note that,
−1 −1 −1
Y = Xβ + ϵ, β̂ = X T X X T Y ⇒ Ŷ = X X T X X T Y = X β̂ = HY, where H = X X T X XT
e = Y − Ŷ = (I − H)Y ⇒ eT e = Y T (I − H)T (I − H)Y = Y T (I − H)Y = Y T e [since, H 2 = H]
Thus we have s r r
Y Te eT e SSRes SSReg p
Cor(e, Y ) = p = = = 1− = 1 − R2 .
(eT e)SST SST SST SST
Implication: This is why we plot Ŷi and ei but not Yi and ei since they are correlated.
(b) The correlation between e and Ŷ can be computed as follows:
X
¯
Cov(e, Ŷ ) = (ei − ē) Ŷi − Ŷ = eT Ŷ = Y T (I − H)HY = Y T (H − H 2 )Y = 0 [since, Ŷ = HY, e = (I − H)Y ]
Cor(e, Ŷ ) = 0.
Prove that the multiple coefficient of determination R2 equals the square of the correlation between Y and Ŷ .
SSReg
Solution: We have to prove that [Cor(Y, Ŷ )]2 = R2 = SST
X X X X ¯
Note that, ei = 0 ⇒ Yi − Ŷi = 0 ⇒ Yi = Ŷi ⇒ Ȳ = Ŷ [ since, Yi = Ŷi + ei ]
X
Cor e, Ŷ = 0 ⇒ Cov e, Ŷ = 0 ⇒ Ŷi − Ȳ ei = 0
v
P ¯
P P uP
u
2
Ŷi − Ŷ Yi − Ȳ Ŷi − Ȳ Ŷi − Ȳ + Ŷi − Ȳ ei u Ŷi − Ȳ SSReg √
rY Ŷ =r = P =tP = = R2
P ¯ 2P 2 2
Yi − Ȳ
2 SST
Ŷi − Ŷ Yi − Ȳ Yi − Ȳ Ŷi − Ȳ
A new born baby was weighted weekly. Twenty such weights are shown below, recorded in ounces. Fit to the data, using
orthogonal polynomials, a polynomial model of degree justified by the accuracy of the figures, that is, test as you go along
for the significance of the linear, quadratic and so fourth, terms.
No. of weeks 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Weights 141 144 148 150 158 161 166 170 175 181 189 194 196 206 218 229 234 242 247 257
Solution:
Page 78
Y = α0 + α1 P1 (X) + α2 P2 (X) + . . . + αk Pk (X) + ϵ; Pi (x) are ith orthogonal polynomial .
∑
P (x )y
The parameters can be estimated as αˆ0 = ȳ and αˆj = ∑ j2 i i . Given the data (xi , yi ), we compute total variability
Pj (xi )
(SST ), then build a model to explain this variability. Suppose that SSReg (α1 ) represents how much of the total variability
in Y is explained by the linear term, similarly for cubic and quadratic.
X
n X
n
SSReg (α1 ) = α̂1 yi P1 (xi ) = 25438.75 (linear term), SSReg (α2 ) = α̂2 yi P2 (xi ) = 489 (quadratic term)
i=1 i=1
Xn X
n
SSReg (α3 ) = α̂3 yi P3 (xi ) = 1.15 (cubic term), SST = (yi − ȳ)2 = 26018
i=1 i=1
As we can see, the linear term explains the major part of variability, and α2 is also significant (F-value). Thus the final
model is ŷ = 136.227 + 2.68x + 0.167x2 , (quadratic fit). If residual SSRes is large, you may check for fourth-degree
polynomial fitting.
If you are asked to fit a straight line to the data (X, Y ) = (1, 3), (2, 2.5), (2, 1.2), (3, 1), and (6, 4.5). What would you do
about it?
Solution:
From the figure below, we can observe that (6, 4.5) is an influential observation. The regression model has a positive slope
when the influential observation is included and a negative slope when it is not.
Recommendation: You can ignore influential observation if it’s small in number. Some observations between X = 3
and X = 6 would be useful here.
Your friend says he/she has fitted a plane to n = 33 observations on (X1 , X2 , Y ) and his/ her overall regression (given β0 )
is just significant at the α = 0.05 level. You ask him/ her for R2 value, but s/he does not know. You work it out for him/
her based on what s/he has told you.
Page 79
Solution:
Implication: Thus, R2 is a “good” measure to measure the “goodness of fit” even when the statistical test suggests that
the regression is significant.
You are given a regression printout that shows a planar to fit X1 , X2 , X3 , X4 , X5 plus an intercept term obtained from a set
of 50 observations. The overall F for regression is ten times as high as the 5% upper-tail F percentage point. How big is R2 ?
Solution:
Conclusion: 73.43% of the total variability in the response variable is explained by the fitted model.
Consider the simple linear regression model: yi = β0 + β1 xi + ϵi , where the variance of ϵi is proportional to x2i , i.e.,
V (ϵi ) = σ 2 x2i (assumption of constant variance is NOT satisfied).
Page 80
Solution:
(a)
yi = β0 + β1 xi + ϵi −→ original model
yi β0 ϵi
= + β1 + , yi′ = β0 x′i + β1 + ϵ′i −→ Transformed model
xi xi xi
σ 2 x2
Now, Var(yi′ ) = Var xϵii = x2 i = σ 2 . Yes, it’s a variance-stabilizing transformation.
i
(b) Slope in the transformed model became an intercept in the transformed model and vice-versa.
(c) Weighted LS function:
X
n
2
SOLS (β0 , β1 ) = (yi − β0 − β1 xi ) =⇒ Oridinary Least Squares
i=1
X
n
2
SW LS (β0 , β1 ) = wi (yi − β0 − β1 xi ) =⇒ Weighted Least Squares
i=1
Xn Xn 2
1 2 yi β0
= (y i − β 0 − β x
1 i ) = − − β 1
x2
i=1 i i=1
xi xi
Consider the simple linear regression model yt = β0 +β1 xt +ϵt where the errors are generated by second-order autoregressive
process
ϵt = ρ1 ϵt−1 + ρ2 ϵt−2 + zt ,
where zt is an NID 0, σz2 random variable, and ρ1 and ρ2 are autocorrelation parameters. Discuss how the Cochrane-
Orcutt iterative procedure could be used in this situation. What transformations would be appropriate for the variables
yt and xt ? How would you estimate the parameters ρ1 and ρ2 ?
Solution:
ind ind
yt = β0 + β1 xt + ϵt where ϵt = ρ1 ϵt−1 + ρ2 ϵt−2 + zt ; zt ∼ N 0, σz2 [In OLS-based SLR, we assume ϵt ∼ N 0, σ 2 which
is not true here.]
yt → yt′ = yt − ρ1 yt−1 − ρ2 yt−2 = (β0 + β1 xt + ϵt ) − ρ1 (β0 + β1 xt−1 + ϵt−1 ) − ρ2 (β0 + β1 xt−2 + ϵt−2 )
= (β0 − ρ1 β0 − ρ2 β0 ) + β1 (xt − ρ1 xt−1 − ρ2 xt−2 ) + (ϵt − ρ1 ϵt−1 − ρ2 ϵt−2 ) = β0′ + β1 x′t + zt
Now zt ’s are independent and zt ∼ N 0, σz2 . But, (yt′ , x′t ) cannot be directly used as
Page 81
• Regress ei on ei−1 and ei−2 , i,e., ei = ρ1 ei−1 + ρ2 ei−2 + zt (MLR with two regressors)
P 2
• Compute S (ρ1 , ρ2 ) = (ei − ρ1 ei−1 − ρ2 ei−2 ) and minimize S (ρ1 , ρ2 ) and obtain ρ̂1 and ρ̂2 .
∂S X ∂S
=0⇒ (ei − ρ1 ei−1 − ρ2 ei−2 ) ei−1 = 0, = 0 ⇒ (ei − ρ1 ei−1 − ρ2 ei−2 ) ei−2 = 0
∂ρ1 ∂ρ2
These will generate the LSE of ρˆ1 and ρˆ2 .
• yt′ = yt − ρ̂1 yt−1 − ρ̂2 yt−2 and x′t = xt − ρ̂1 xt−1 − ρ̂2 xt−2 and apply OLS to the transformed data yt′ = β0′ + β1 x′t +
ind
zt , where zt ∼ N 0, σz2 .
The following 24 residuals from a straight-line fit are equally spaced and given in time sequential order. Is there any
evidence of lag-1 serial correlation?
8, −5, 7, 1, −3, −6, 1, −2, 10, 1, −1, 8, −6, 1, −6, −8, 10, −6, 9, −3, 3, −5, 1, −9
Solution:
The correlation between the residuals, ei , i = 1(1)24 is Cor (eu , eu+1 ) = ρ. (if ρ 6= 0, there is autocorrelation).
We test the hypothesis H0 : ρ = 0 vs H1 : ρ 6= 0, thus H0 indicates that there is no lag-1 autocorrelation. To perform this
test, we compute the Durbin - Watson test statistic:
P24 2
(eu − eu−1 ) 2225
d = u=2P24 = = 2.67 ⇒ 4 − d = 1.33
2
u=1 eu
834
Now, compare with dL and dU values from d table. For α = 0.025. (two-sided test) n = 24, k = 1 (since straight line fit
with one regressor variable) dL = 1.16, dU = 1.33
• If d < dL and 4 − d < dL : reject H0 . Here, we accept H0 as d = 2.67 > 1.16 (then is no lag-1 autocorrelation).
• If d > dU and 4 − d > dU : accept H0 . Here, we accept H0 as d = 2.67 > 1.33 (then is no lag-1 autocorrelation).
Estimate the parameters α & β in the non-linear model Y = α + (0.49 − α)e−β(X−8) + ϵ from the following observations:
X 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42
Y 0.490 0.475 0.450 0.433 0.458 0.423 0.407 0.407 0.407 0.405 0.393 0.405 0.400 0.395 0.400 0.390 0.407 0.390
Solution:
The problem is to estimate α, β of the non-linear model using the data. The residual sum of square can be written as
X X
S (α, β) = (yu − f (xu , α, β))2 = (yu − α − (0.49 − α)e−β(xu −8) ))2
u u
Page 82
Linearization: Taylor series expansion of f (xu , α, β) about the point (α0 , β0 ) is
f (xu ; α, β) = f (xu , α0 , β0 ) + (1 − e−β0 (xu −8) )(α − α0 ) + [−(0.49 − α0 )e−β0 (xu −8) (xu − 8)](β − β0 )
= fu0 + z1u
0
(α − α0 ) + z2u
0
(β − β0 ) [linear function from nonlinear function using Taylor approximmation]
Yu = fu0 + z1u
0
(α − α0 ) + z2u
0
(β − β0 ) + ϵu
⇒ Yu − fu0 = z1u
0
(α − α0 ) + z2u
0
(β − β0 ) + ϵu −→ [MLR model]
⇒ Y0 = z0 θ0 + ϵ −→ [In matrix form]
⇒ θˆ0 = (z0T z0 )−1 z0T Y0 is the least square estimate
where 0
Y1 − f10 z11 0
z21 ϵ1
.
.. . .. , θ = α − α0 , ϵ = ..
Y0 = , z0 = .. . 0 .
β − β0
Yn − fn0 0
z1n 0
z2n ϵn
iteration αj βj
0 0.30 0.02
1 0.84 0.10
iteration αj βj
2 0.3901 0.1004
3 0.3901 0.1016
4 0.3901 0.1016
Problem 47. (Statistician’s Dilemma) Look at these data. I don’t know whether to fit two straight lines, one straight
line, or something else. How can I solve this dilemma?
Solution:
If we attach a dummy variable Z to distinguish the two groups (such that Z = 0 for set A and Z = 1 for set B), we can
look at all 4 possibilities.
Y = (β0 + β1 X) + Z(α0 + α1 X) + ϵ = β0 + β1 X + α0 Z + α1 XZ + ϵ
Page 83
Thus we have Y = Xβ + ϵ ⇒ β = (β0 , β1 , β2 , β3 ) and β̂ = (X T X)−1 X T Y ⇒ Ŷ = 1.142 + 0.506X − 0.0418Z − 0.036XZ.
T
To calculate the k th principal component zk = αkT x we apply the constrained maximuization method. By using the
∼
Lagrange multiplier we maximize var αkT x = αkT Σαk subject to αkT αk = 1. We maximize the function
∼
αkT Σαk − λ αkT αk − 1
d
αkT Σαk − λk αkT αk − 1 = 0
dαk
Σαk − λk αk = 0
Σαk = λk αk
This is recognizable as an eigenvector equation where αk is an eigenvector of Σ and λk is the associated eigenvalue.
then we should choose λk to be as big as possible. So, calling λ1 the largest eigenvector of Σ and α1 the corresponding
eigenvector then the solution to
Σα1 = λ1 α1
is the 1st principal component of x.
∼
The second PC, α2T x maximizes α2T Σα2 subject to being uncorrelated with α1T x. The uncorrelation constraint can be
∼ ∼
expressed using any of these equations
cov α1T x, α2T x = α1T Σα2 = α2T Σα1 = α2T λ1 α1T
∼ ∼
= λ1 α2T α = λ1 α1T α2 = 0
Differentiation of this quantity w.r.t. α2 (and setting the result equal to zero) yields
dα2 α2 Σα2 − λ2 α2 α2 − 1 − ϕα2 α1 = 0
d T T T
⇒ Σα2 − λ2 α2 − ϕα1 = 0
Page 84
If we left multiply α1 into this expression
then we can see that ϕ must be zero and that when this is true we are left with
Σα2 − λ2 α2 = 0
Clearly
Σα2 − λ2 α2 = 0
is another eigenvalue equation and the same strategy of choosing α2 to be the eigenvector associated with the second
largest eigenvalue yields the second PC of x, namely α2T x.
∼ ∼
Show that kβ̂Ridge k increases as its tuning parameter λ → 0. Does the same property hold for the LASSO regression?
Solution
T
SVD Decomposition of X = Un×p Dp×p Vp×p
−1 −1 −1 −1
β̂Ridge = X T X + λI X T Y = V D2 V T + λI V DU T Y = V D2 + λI V T V DU T Y = V T D2 + λI DU T Y.
X
p
d2j (U T Y )2j
kβ̂Ridge k22 = Y T U D(D2 + λI)−1 (D2 + λI)−1 DU T Y = (U T Y )T [D(D2 + λI)−2 D](U T Y ) = ;
j=1
(d2j + λ)2
−1 d2
where D D2 + λI D represents a diagonal matrix with elements j
2.
(d2j +λ)
Therefore, we see that kβ̂Ridge k increases as its tuning parameter λ → 0. Recall the dual form of LASSO as defined below:
2
X
N X
p
β̂Lasso = arg min yi − β0 − xij βj
β
i=1 j=1
X
p
s.t. |βj | ≤ t
j=1
2
1 X N Xp Xp
β̂Lasso = arg min y i − β 0 − xij βj + λ |βj |
β 2 i=1
j=1 j=1
It is easy to see that t and λ have an inverse relationship; therefore, as λ → 0, t increases and so does the norm of optimal
solutions.
Consider a two-class logistic regression problem with x ∈ R. Characterize the maximum-likelihood estimates of the slope
and intercept parameter if the sample xi for the two classes are separated by a point x0 ∈ R. Generalize this result to (a)
x ∈ Rp and (b) more than two classes.
Solution
Without loss of generality, suppose that x0 = 0 and that the coding is y = 1 for xi > 0 and y = 0 for xi < 0. Now,
suppose that
exp {βx + β0 }
p(x; β) =
1 + exp {βx + β0 }
Page 85
so that
1
1 − p(x; β) =
1 + exp {βx + β0 }
Since x0 = 0 is the boundary then p (x0 ) = 1 − p (x0 ) then β0 = 0. Therefore,
exp{βx}
p(x; β) =
1 + exp{βx}
so that
1
1 − p(x; β) = .
1 + exp{βx}
Therefore, the likelihood function
Y
N YN yi YN
y 1−yi p (xi ; β) y
L(β; y, x) = p (xi ; β) i [1 − p (xi ; β)] = [1 − p (xi ; β)] = [exp {βxi }] i [1 + exp {βxi }]
i=1 i=1
1 − p (x i ; β) i=1
Taking the derivative with respect to β and substituting in the proper coding of yi gives
dl(β; x, y) X
N
exp {βxi }
= x i yi −
dβ i=1
1 + exp {βxi }
X X
exp {βxi } exp {βxi }
= xi 1 − − xi
xi >0
1 + exp {βxi } xi <0
1 + exp {βxi }
X X X exp {βxi }
exp {βxi }
= xi − xi − xi .
x >0 x >0
1 + exp {βxi } x <0
1 + exp {βxi }
i i i
X X
N
exp {βxi }
xi = xi .
xi >0 i=1
1 + exp {βxi }
N
Clearly, for any data set {xi }i=1 we must have that β → ∞ for the above equality to hold.
(b) Now, suppose that there are K classes such that x1 seperates classes one and two, x2 seperates classes two and
three, and so on to xK−1 that seperates classes K − 1 and K with −∞ = x0 < x1 < x2 < . . . < xK−1 < xK = ∞. Now,
define probabilities
exp {β1 x + β01 }
p1 (x; β) = PK−1
1 + j=1 exp {βj x + β0j }
exp {β2 x + β02 }
p2 (x; β) = PK−1
1 + j=1 exp {βj x + β0j }
..
.
1
pK−1 (x; β) = PK−1 .
1+ j=1 exp {βj x + β0j }
Now, suppose that the coding is yi = 1 if xj−1 < xi < xj and yi = 0 otherwise for observation i = 1, . . . , N and class
j = 1, . . . , K. Therefore, the likelihood function
Y
K Y
Nj
yi
L(β; y, x) = [pj (xi ; β)]
j=1 i=1
Page 86
where Nj is the number of observations in class j, so that the log-likelihood function
" # N " #
XX
K−1 Nj
exp {βj xi + β0j } X k
1
l(β; y, x) = yi log PK−1 + yi log PK−1
j=1 i=1 1 + j=1 exp {βj x + β0j } i=1 1 + j=1 exp {βj xi + β0j }
XX
K−1 Nj
X
K XNj
X
K−1
= yi [βj xi + β0j ] − yi log 1 + exp {βj xi + β0j }
j=1 i=1 j=1 i=1 j=1
Now, we determine the values of β0j . First, note that β0j is a function of βj , xj−1 , and xj . So that the expression p(x; β)
maintains proper form, for xj−1 < x < xj we define
where β0j = log [exp {βj xj−1 } − exp {βj xj }]. The reason for the beginning step of the formulation above is because, for
example, when x ∈ (x1 , x2 ) so that x classifies to class two, the probability function appears as in the following figure,
where it was assumed that x1 = 1 and x2 = 2.
Now, taking the derivative with respect to β = (β1 , . . . , βK−1 ) and substituting in the proper coding of yi gives
dl(β; x, y) X X
Nj Nj
exp {βj xj−1 } xj−1 − exp {βj xj } xj
= xi +
dβj i=1 i=1
exp {βj xj−1 } − exp {βj xj }
Nj !
X exp {βj xj−1 } xj−1 − exp {βj xj } xj ] exp {βj xi − β0j }
− xi + PK−1
i=1
exp {βj xj−1 } − exp {βj xj } 1 + j=1 exp {βj xi + β0j }
exp{β x }x −exp{β x }x
Note that the j j−1 j−1 j j
exp{βj xj−1 }−exp{βj xj }
j
term in the above is a constant in the sum over i = 1, . . . , Nj . Therefore, setting
the above equal to zero for each j = 1, . . . , K − 1 and solving for βj similarly gives the maximum likelihood estimators to
the two-class case that βj → ∞.
Page 87
Some Useful Formulae
1
f (x) = √ e−x /2
2
for x ∈ R
2π
• Beta density with parameters α and β :
(
Γ(α+β) α−1
Γ(α)Γ(β) x (1 − x)β−1 if 0 < x < 1
f (x) =
0 otherwise
α αβ
E[X] = , Var(X) = 2
α+β (α + β) (α + β + 1)
α = 1 and β = 1 gives the Uniform distribution on (0, 1).
• Gamma distribution with parameters r > 0 and λ > 0 :
λr r−1 −λx
Γ(r) x e if x > 0 r r
f (x) = E[X] = , Var(X) = 2
0 otherwise. λ λ
r
λ
MX (t) = for t < λ.
λ−t
r = 1 gives the Exponential distribution with mean 1/λ.
• Exponential distribution with parameter λ > 0 :
λe−λx if x > 0 1 1
f (x) = E[X] = , Var(X) =
0 otherwise. λ λ2
(
−λx
1−e if x > 0
FX (x) =
0 otherwise.
Page 88
Some Useful Distributional Properties
1. If Y ∼ N µ, σ 2 , then
Y −µ
Z= ∼ N (0, 1)
σ
2. Y ∼ N (0, 1) =⇒ Y 2 ∼ χ2 (1)
3. Y ∼ N µ, σ 2 =⇒ aY + b ∼ N aµ + b, a2 σ 2
4. Y ∼ U (0, 1) =⇒ − ln Y ∼ exponential(1)
iid P
Special case: αi = 1, for i = 1, 2, . . . , n. Then Y1 , Y2 , . . . , Yn ∼ exponential(β) =⇒ Yi ∼ gamma(n, β)
P P
Special case: αi = νi /2, β = 2. Then Yi ∼ χ2 (νi ) , i = 1, 2, . . . , n (mutually independent) =⇒ Yi ∼ χ2 ( ν i )
iid
Combination: If Y1 , Y2 , . . . , Yn ∼ exponential(β), then
P
2 Yi
∼ χ2 (2n)
β
13. Yi ∼ N µi , σi2 , i = 1, 2, . . . , n (mutually independent)
X X X
=⇒ ai Yi ∼ N ai µi , a2i σi2
iid
Special case: µi = µ and σi2 = σ 2 , for i = 1, 2, . . . , n. Then Y1 , Y2 , . . . , Yn ∼ N µ, σ 2
X X X
=⇒ ai Yi ∼ N µ ai , σ 2 a2i
Page 89
iid
Special case of iid result: If Y1 , Y2 , . . . , Yn ∼ N µ, σ 2 , then Ȳ ∼ N µ, σ 2 /n
iid P
Special case of iid result: If Y1 , Y2 , . . . , Yn ∼ N µ, σ 2 , then Yi ∼ N nµ, nσ 2
14. If Yi ∼ N µi , σi2 , i = 1, 2, . . . , n (mutually independent), then
Yi − µi
Zi = ∼ N (0, 1),
σi
P
for i = 1, 2, . . . , n. Therefore, U = Zi2 ∼ χ2 (n) because Z12 , Z22 , . . . , Zn2 are iid χ2 (1)
iid P
15. Y1 , Y2 , . . . , Yn ∼ geometric(p) =⇒ U = Yi ∼ nib(n, p)
iid
16. Y1 , Y2 ∼ N (0, 1) =⇒ U = Y1 /Y2 ∼ Cauchy
iid
17. Y1 , Y2 , . . . , Yn ∼ exponential(β) =⇒ Y(1) ∼ exponential (β/n)
iid
18. Y1 , Y2 , . . . , Yn ∼ Weibull(m, α) =⇒ Y(1) ∼ Weibull(m, α/n)
19. If Z ∼ N (0, 1), W ∼ χ2 (ν), and Z ⊥
⊥ W , then
Z
T =p ∼ t(ν)
W/ν
iid
20. Y1 , Y2 , . . . , Yn ∼ N µ, σ 2
(n − 1)S 2
=⇒ ∼ χ2 (n − 1)
σ2
21. If W1 ∼ χ2 (ν1 ) , W2 ∼ χ2 (ν2 ), and W1 ⊥
⊥ W2 , then
W1 /ν1
F = ∼ F (ν1 , ν2 )
W2 /ν2
(ν1 /ν2 ) W
∼ beta (ν1 /2, ν2 /2)
1 + (ν1 /ν2 ) W
Page 90