Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Chapter 3 - Questions and Solutions

Edwin Fennell

3.1 Prove the least squares optimal solution for the linear regression
case given in Eq. (3.13).

We seek to find the value of θ that minimises


N
X
(yn − θ T xn )2
n=1

This is a differentiable function w.r.t all components of θ. Moreover, note


that it is quadratic with a positive quadratic coefficient for all components
of θ, guaranteeing that the sole stationary point is a global minimum.

Taking the derivative w.r.t θi gives


  
XN XN
(xn )i  2(xn )j θj  − 2yn 
n=1 j=1

For the value θ̂ of θ for which this is 0 for all i, we have


N N
X T X
θ̂ xn (xn )i = yn (xn )i
n=1 n=1

The right-hand side is the ith element of


N
X
yn xn
n=1

while the left-hand side is the ith element of


N
!
X
T
xn xn θ̂
n=1

Therefore these two vectors are equal and we have our requried equality.

1
3.2 Let θ̂i , i = 1, 2, ..., m be unbiased estimators of a parameter
vector θ, so that E[θ̂i ] = θ, i = 1, 2, ..., m. Moreover, assume
that the respective estimators are uncorrelated to each other
and that all have the same variance σ 2 = E[(θi − θ)T (θi − θ)].
Show that by averaging the estimates, e.g.
m
1 X
θ̂ = θ̂i
m i=1

the new estimator has total variance σc2 = E[(θi − θ)T (θi − θ)] =
σ2
m
.

Trivially, the mean of our unbiased estimators is also an unbiased estima-


tor. The variance of our estimator θ̂ is therefore

E(θ̂T θ̂) − E(θ)T E(θ)

This expands to
  
m m
X 1 ˆT X 1 ˆ 
E θj θj − E(θ)T E(θ)
i=1
m j=1
m

The estimators are all pairwise uncorrelated, which means the the product
of the expectations of any two distinct estimators is equal to the expecta-
tion of their product. Therefore we can rewrite the above as
 
m m
X X 1 1

2
E(θ̂i )T E(θ̂j ) + 2 (E(θ̂iT θ̂i ) − E(θ̂i )T E(θ̂i )) − E(θ)T E(θ)
i=1 j=1
m m

The first term with the double sum is actually just

m
!T m
!
X 1 X 1
E(θ̂i ) E(θ̂i ) = E(θ)T E(θ)
i=1
m i=1
m

This just cancels out with the last term, and we are left with the middle
term, which is just m12 times the sum of the variances of the initial m
estimators, as required.
3.3 Let x be random variable uniformly distributed on [0, θ1 ], θ > 0.
Assume that g is a Lebesgue measurable function on [0, θ1 ]. Show
that if θ̂ = g(x) is an unbiased estimator, then
Z 1
θ
g(x)dx = 1
0
Assume that θ̂ is an unbiased estimator. Then regardless of the value of θ
Z 1
θ
θ = E(θ̂) = E(g(x)) = g(x)φ(x)dx
0

where φ is the p.d.f of x. Since x is uniform, the p.d.f is just constantly θ


This directly gives
Z θ1
g(x)dx = 1
0

We note that there is no function g(x) s.t. this holds for all θ. We note
that our condition gives
Z b
g(x)dx = 0 ∀(0 < a < b)
a

Therefore
Z 1 ∞ Z 2i
θ X θ
1= g(x)dx = g(x)dx = 0
2i+1
0 i=0 θ

which is a contradiction.

3.4 A family [p(D, θ); θ ∈ A] is called complete if, for any vector
function h(D) such that ED [h; D] = 0, ∀θ, then h = 0. Show
that if [p(D; θ) : θ ∈ A] in complete. and there exists an MVU
estimator, then this estimator is unique.

Suppose we have two MVU estimators, θ1 and θ2 . Then E θ1 +θ



2
2
is also
unbiased. This estimator has variance
 T  !  T 
θ1 + θ2 θ1 + θ2 var(θ1 ) var(θ2 ) θ θ2
E = + +E 1
2 2 4 4 2

var(θ1 ) = var(θ2 ) is a lower bound for this variance, which gives us an


inequality. A little rearranging gives

E(2θ1T θ2 ) ≥ E(θ1T θ1 ) + E(θ2T θ2 )

We also know that

E(2θ1T θ2 ) ≤ E(θ1T θ1 ) + E(θ2T θ2 )

by examining the expectation of the positive quantity (θ1 − θ2 )T (θ1 −


θ2 ). Therefore both of these inequalities hold with equality, and E((θ1 −
θ2 )T (θ1 − θ2 )) = 0 ∀θ. By completeness, we have (θ1 − θ2 )T (θ1 − θ2 ) = 0
identically, which immediately gives θ1 = θ2 , so the MVU estimator must
be unique.
3.5 Let θ̂n be an unbiased estimator, so that E[θ̂u ] = θ0 . Define a
biased one by θ̂b = (1 + α)θ̂u . Show that the range of α where
the MSE of θ̂b is smaller than that of θ̂u is

2MSE(θˆu )
−2 < − <α<0
MSE(θˆu ) + θ 2 0

We note that the MSE of θ̂u is

E((θ̂u − θ0 )2 ) = E(θ̂u2 ) − θ02

by unbiasedness. Similarly, the MSE of θ̂b is

E((θ̂b − θ0 )2 ) = (1 + α)2 E(θ̂u2 ) − (1 + 2α)θ02 = (1 + α)2 MSE(θˆu ) + α2 θ02

Therefore the condition we want occurs for exactly

(1 + α)2 MSE(θˆu ) + α2 θ02 < MSE(θˆu )

which rearranges to

α((MSE(θˆu ) + θ02 )α + 2MSE(θˆu )) < 0

This occurs only iff exactly one of α and (MSE(θˆu ) + θ02 )α + 2 · MSE(θˆu )
is positive. Note also that α > 0 directly implies (MSE(θˆu ) + θ02 )α + 2 ·
MSE(θˆu ) > 0. Thus our proposed condition holds iff

2MSE(θˆu )
− <α<0
MSE(θˆu ) + θ20

as required. The final leftmost inequality stems from the fact that

MSE(θˆu )
<1
MSE(θˆu ) + θ2 0

3.6 Show that for the setting of Problem 3.4, the optimal value of α
is equal to
1
α∗ = − θ2
1 + var(0θ̂ )
u

We note that our MSE for the biased estimator is a quadratic in α. There-
fore we can just pick the unique value of α for which the derivative is 0,
and we obtain the minimum possible MSE. The derivative is

2(MSE(θˆu ) + θ02 )α + 2MSE(θˆu )

which rearranges to the required result.


3.7 Show that the regularity condition for the Cramer-Rao lower
bound holds true if the order of differentiation and integration
can be interchanged.

∂ log(p(x|θ)) 1 ∂ p(x|θ)
Note that ∂θ = p(x|θ) · ∂θ . This gives
  Z Z
∂ log(p(x|θ)) 1 ∂ p(x|θ) ∂ p(x|θ)
E = p(x|θ)· · dx = dx
∂θ x∈X∩θ p(x|θ) ∂θ x∈X∩θ ∂θ

where X ∩θ is the probability space created by intersecting the probability


space X with the event θ, and µ is its associated probability measure. (We
fix θ to be a single event - we are using the frequentist concept of a ”true”
value for θ)
Now, if we assume that differentiation and integration can be interchanged,
then this is equal to Z

p(x|θ) dµ
∂θ x∈X∩θ
The expression being differentiated is just the integral of the probability of
all events over the probability space, and so so is just equal to 1. Therefore
the derivative is equal to 0, and thus the regularity condition holds, as
required.
3.8 Derive the Cramer-Rao bound for the LS estimator, when the
training data result from the linear model

yn = θxn + ηn , n = 1, 2, ...,

where xn and ηn are i.i.d. samples of a zero mean random vari-


able with variance σx2 , and a Gaussian random variable with zero
mean and variance ση2 respectively. Assume also that x and η
are independent[sic]. Then show that the LS simulator achieves
the CR bound only asymptotically.

Assume that the input variable X and output Y are observable and that
the elements of x and η are all mutually independent. Therefore our pdf
is separable as
N
Y
f (a, b, θ) = pnx (an )pnη (bn − θan )
n=1

where pnx is the marginal pdf of xn and pnη is the marginal pdf of ηn .
Taking the log gives
N
X (bn − θan )2
− + terms independent of θ
n=1
2ση2
Differentiating this twice w.r.t θ and multiplying through by -1 yields
N
X a2n
σ2
n=1 η

The expectation of this quantity (as a multiple integral over all the an and
bn ) is the Fisher information. We can make our lives easier by making the
change of variables cn = bn − θan and reframing our intgeral as being over
the an and cn instead. This separates the base pdf completely into an
and cn terms. Since the quantity we are taking the expectation of only
contains an terms, the Fisher information immediately simplifies to
N Z ∞
X a2n
pnx (an ) dan
n=1 −∞ ση2

Since the xn are i.i.d with zero mean and variance σx2 this is just equal to

σx2

ση2

Therefore the Cramer-Rao lower bound is


ση2
N · σx2

Given a set of observations x, y, the value of θ̂ that minimises least squares


is
x·y x · (θx + η) x·η
θ̂ = = =θ+
x·x x·x x·x
This quantity has expectation θ since we can separate out the zero-mean
η from the second term. Therefore the variance is just

(x · η)2
 
E
(x · x)2

Considering that the elements of η are mutually independent and have


expectation zero, we can ignore cross terms in the expansion of the nu-
merator, which allows us to simplify the above expression to
PN ! !
2 ση2
 
n=1 (xn ηn ) 2 x·x
E = ση E =E
(x · x)2 (x · x)2 x·x

Note that x · x is the sum of N independent copies of x, and thus by the


strong law of large numbers
x · x a.s. 2
−−→ σx
N
where the convergence is almost sure convergence. By the continuous
mapping theorem we immediately get
N a.s. 1
−−→ 2
x·x σx
Under the assumption that this family of random variables is uniformly
integrable (which I think is true but showing it is effort) the expectations
exist and converge to the expectation of the limit. TODO - work out how
to get from a.s. convergence to L1 comvergence
3.9 Let us consider the regression model

yn = θ T xn + ηn , n = 1, 2, 3, ..., n

where the noise samples η = [η1 , ..., ηN ]T come from a zero-mean


Gaussian random vector with covariance matrix Ση . If X is our
input matrix and y is our output vector, show that

θ̂ = (X T Σ−1
η X)
−1
X T Σ−1
η y

is a sufficient estimate.

Here we treat the xn as known quantities, and model the ηn as our only
source of uncertainty. Using the relation y = θ T x + y we can rewrite θ̂ as

θ + (X T Σ−1
η X)
−1 T −1
X Ση η

The covariance of this expression is just

(X T Σ−1
η X)
−1

The likelihood function here is given by


N −(a−θ T X)T Σ−1 T
1 Y η (a−θ X)
p(y = a) = p(η = a − θT X) = 1 e 2

((2π)d |Ση |) 2
n=1

From this, we obtain the Fisher information matrix as simply (X T Σ−1


η X),
and parameter efficiency via Cramer-Rao follows trivially.
3.10 Assume a set of i.i.d. X = {x1 , x2 , ..., xN } samples of a ran-
dom variable[sic] with mean µ and variance σ 2 . Define also the
quantities
N
1 X
Sµ := xn
N n=1
N
1 X
Sσ2 := (xn − Sµ )2
N n=1
N
1 X
S̄σ2 := (xn − µ)2
N n=1
Show that if µ is considered to be known, a sufficient statistic for
σ 2 is S̄σ2 . Moreover in the case where both (µ, σ 2 ) are unknown,
then a sufficient statistic is the pair (Sµ , Sσ2 )

This is not well-posed in its current form. It does however make sense if
we constrain the xi to be drawn from a Gaussian distribution.
We note the the p.d.f of X is
1 −σ 2 (x−µ)T (x−µ)
p(X = x) = N e 2

(2πσ 2 ) 2

This can be written as


−σ 2 N S̄ 2
1 σ
N e 2

(2πσ 2 ) 2

which is a function of σ and S̄σ2 only. Therefore by Fisher-Neyman we


have the first result. For the second part, we note that the p.d.f may
alternatively be written as
−N σ 2 (S 2 −2µSµ +µ2 +Sµ
2)
1 σ
N e 2

(2πσ 2 ) 2

which is solely a function of µ, σ 2 , Sµ , and Sσ2 . Therefore we have the


second result by Fisher-Neyman.
3.11 Show that solving the task
N l
!2 l
X X X
minimise L(θ, λ) = y n − θ0 − θi xni +λ |θi |2
n=1 i=1 i=1

is equivalent to solving the task


N l
!2 l
X X X
minimise L(θ, λ) = (yn − ȳ) − θi (xni − x̄i ) +λ |θi |2
n=1 i=1 i=1

and the estimate of θ0 is given by


l
X
θ̂0 = ȳ − θ̂i x̄i
i=1

The derivative of the first expression is necessarily 0 when evalutated at


any minimiser θ̂. By evaluating the derivative with respect to θ0 at the
minimiser, we immediately get
l
X
θ̂0 = ȳ − θ̂i x̄i
i=1
Substituting this into the expression to be minimised reveals that minimis-
ing over the other N components is exactly equivalent to solving the first
constraint of the second optimisation problem. We already know that the
minimiser to our first problem meets the second constraint of the second
problem, and therefore will solve the problem in its entirety.
The second optimisation problem has a unique solution (can be seen by
explicitly expressing as a quadratic form), so the above implication shows
that the first problem also has a unique solution. Thus the two problems
are exactly equivalent.

3.12 Consider the linear regression task

yn = θo + ηn

where the ηn are i.i.d. zero-mean Gaussian variables of variance


ση2 . Show that the ridge regression estimator with parameter λ,
θ̂b (λ), and the minimum variance unbiased estimator, θ̂MVU , of
θo satisfy the following:
MSE(θ̂b (λ)) < MSE(θ̂MVU ) provided that
 2
2 ση
λ ∈ (0, ∞)
 θo ≤

 N
2 2
2ση ση
λ ∈
 0, σ2
θo2 > N
θo − Nη
2

Moreover, show that the minimum MSE performance for the


2
ση
ridge regression estimate is attained at λ∗ = 2 .
θo

We know that in this case with i.i.d. Gaussian noise and constant input
that the MVU estimator is just ȳ the mean of the sample outputs. We
thus calculate
N
X ηn 2
MSE(θ̂MVU ) = E((ȳ − θo )2 ) = E(( ) )
n=1
N

Since the ηn are independent and have zero mean, the cross terms disap-
pear and we are left with
N
X ηn2 ση2 ση2
MSE(θ̂MVU ) = ( E( )) = N · =
n=1
N2 N2 N

Similarly, we know from studying the ridge regression problem that θ̂b (λ)
satisfies
N
X N
X
(λ + N )θ̂b (λ) = yn = N θ o + ηn
n=1 n=1
and therefore
N
N X ηn
θ̂b (λ) = θo +
N +λ n=1
N +λ
We now have

N
!2  
N
!2 
N X ηn X ηn λθo
MSE(θ̂b (λ)) = E  θo + − θo  = E − 
N +λ n=1
N +λ n=1
N +λ N +λ

We again use the fact that the ηn are independent and have zero mean to
obtain
λ2 θo2 + N ση2
MSE(θ̂b (λ)) =
(N + λ)2
Therefore the statement MSE(θ̂b (λ)) < MSE(θ̂MVU ) is equivalent to
λ2 θo2 + N ση2 ση2
<
(N + λ)2 N
Multiplying through by each denominator reveals this to be equivalent to
stating
N λ2 θo2 + N 2 ση2 < (N + λ)2 ση2
or rather
(ση2 − N θo2 )λ2 + 2ση2 N λ > 0
Our outline of ridge regression already assumes that λ is positive. We
ση2
can see by considering the sign of the quadratic term that if θo2 ≤ N
ση2
this expression is positive for all λ ∈ (0, ∞). Similarly, if θo2 > N , the
expression is positive exactly for
!
2ση2
λ ∈ 0, σ2
θo2 − Nη
as required. To calculate where the ridge regression estimate achieves
minimum MSE, we take the derivative (with the help of the quotient rule)
w.r.t. λ:
d(MSE(θ̂b (λ))) 2λθo2 (N + λ)2 − 2(λ2 θo2 + N ση2 )(N + λ)
=
dθ (N + λ)4
This becomes 0 in the case where
λθo2 (N + λ) = (λ2 θo2 + N ση2 )
which directly gives the optimal value of λ as
ση2
λ∗ =
θo2

3.13 Consider

You might also like