Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

MATH 412 QUIZ 3

28/11/2022

Dear students, usual rules apply: you can’t leave the class during the test, nor move from your desk
or talk to your colleagues. Mobile switched off, scientific calculator fine, you can ask additional
sheets rising your hand remaining at your desk. Please write your name and ID number on the first
sheet and hand back all your sheets numbered k/N with k=1,2,..,N.

1) The following set of experimental results is given, leading to a minimum least square
estimation problem.

We wish to identify the best hyperplane 𝑋𝛽 approximating 𝑦 which minimizes the sum of
the squared errors ∑𝑡 𝑒𝑡2 , 𝑡 = 1,2, . . ,6, where 𝛽 ∈ 𝑅 3 , 𝑋 ∈ 𝑅 (6,3) , 𝑦 ∈ 𝑅 6 .

a. Formulate the least square minimization problem associated with this set of data
leading to the least square estimate 𝛽̂. In particular: specify the objective function,
define the errors and the associated data sets.
Write down the first order optimality conditions leading to the definition of 𝛽̂.
[Hint: you may solve the minimization problem but notice that you are not expected to
0.6032
̂
do it ok. Indeed I can anticipate that the solution is 𝜷 = [0.2677] resulting into a
1.2841
minimum ∑𝑡 𝑒𝑡2 = 7.6721)
b. Clarify the mathematical motivation behind the LSE method and under which general
assumptions does this quadratic programming problem arise.
[Hint: I expect here a link between the definition of the errors, the solution of the
unconstrained quadratic program and the rank of the matrix X].
c. Given the LSE solution what can you say regarding the linear dependence of 𝑦 from
𝑋1 , 𝑋2 ?
Which type of probability distribution is associated, for 𝑡 = 1,2. . ,6, with the errors 𝑒𝑡 ?
Focus only on the mean and the variance. What can we say instead for 𝑇 ≤ 3?
[Hint: notice that here you are expected to answer to 3 possibilities, namely for 6 data
points as above but also for T=3 and for T<3. For the first case you have been working
above and you think to the difference between the observed data and the fitting
hyperplane. For T=3 and T<3 will there be an error to specify? Then the answer follows.]

[Grades: 1.a: 15%, 1.b: 15%, 1.c: 15%; total 45 %]

Solution

a) We have 𝑒𝑡 = (𝑦𝑡 − 𝑦̂𝑡 ), where: 𝑦𝑡 = 𝛽0 + 𝛽1 𝑋1,𝑡 + 𝛽2 𝑋2,𝑡 + 𝑒𝑡 and 𝑦̂𝑡 = 𝑋𝑡 𝛽̂ = 𝛽̂0 +


𝛽̂1 𝑋1,𝑡 + 𝛽̂2 𝑋2,𝑡 to denote the predicted values while 𝑦𝑡 are the observed values. So the
problem to be solved is
2
min ∑ 𝑒𝑡2 = ∑(𝑦𝑡 − 𝑦̂𝑡 )2 = ∑(𝑦𝑡 − [ 𝛽̂0 + 𝛽̂1 𝑋1,𝑡 + 𝛽̂2 𝑋2,𝑡 ])
𝛽
𝑡≤6 𝑡 𝑡
This can be also given in matrix form as
2 2
min||𝑒𝑡 || = min||𝑦 − 𝑋𝛽|| = min(𝑦 − 𝑋𝛽)𝑇 (𝑦 − 𝑋𝛽)
𝛽 𝛽 𝛽

And we can derive the first order optimality condition (gradient=0) in matrix form as

2𝑋 𝑇 𝑋𝛽 − 2𝑋 𝑇 𝑦 = 0. From which the solution 𝛽 = (𝛽0 𝛽1 𝛽2 )𝑇 = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦.


0.6032
̂ = [0.2677] resulting into a minimum ∑𝑡 𝑒𝑡2 = 7.6721.
In our case we have as solution: 𝜷
1.2841
b) The key mathematical motivation for the LSE method is related to the fact that we have a
matrix of data X of dimension (m,n) which in our case is (6,3) with typically 𝑚 ≥ 𝑛 and
rank=n. Under this assumption the system 𝑋𝛽 = 𝑦 is not solvable and then we look for the
solution to the problem that minimizes the discrepancy, specifically we look for the 𝛽 which
has the minimum distance from the subspace spanned by the vectors in the range of X:
indeed the solution is represented by the orthogonal projection of 𝑦 into the subspace.
The problem then arises any time in which, due to 𝑚 > 𝑛 say in a linear system of the
classical type 𝐴𝑥 = 𝑏 there is no solution and the rank of A is at most n.
c) The key assumption of the LSE method is that the errors are normally distributed with mean
0 and given variance. This is the classical case above. When, however the number of
observations is equal to the number of coefficients to estimate, that’s it: no error at all and
the linear system is determined under assumption of l.i.
When finally we have T<3, the system has 3-rank(X) degrees of freedom and yet no error at
all.
From the coefficients we see that there is positive effect of X1 and X2 on y and
independently from X1 and X2, the y variable is in any case expected to evolve along the
constant value 1.2841 that can be regarded as the information unexplained by factors not
included in the linear model.

& & &

2) Consider the following minimization problem: min 𝑓(𝐱) = 𝑥12 + 3𝑥22 + 𝑥32 − 3𝑥1 subject to
𝑥
two equality constraints: ℎ1 (𝐱): 2𝑥1 + 𝑥3 = 2 and ℎ2 (𝐱): 𝑥2 − 𝑥1 = 3 with 𝐱 ∈ 𝑅 3 , 𝐱 =
(𝑥1 , 𝑥2 , 𝑥3 )𝑇 . [Hint: this is a quadratic program with linear constraints. It is not
1
however in the form min 2 𝑥 𝑇 𝑄𝑥 𝑠. 𝑡. 𝐴𝑥 = 𝑏, due to the linear term 3𝑥1 in the objective. It
𝑥
may be extended to the classical case we have seen in class but I suggest you just develop the
Lagrange method]
a. Define the Lagrange function for this problem and derive the first order necessary
optimality conditions. Why, in general, are these first order conditions necessary
but not sufficient for a minimum (or a maximum)? While they are sufficient in this
case (for this class of problems). [Hint: for the sufficiency, elaborate over the fact
that both the objective and the constraints are convex functions]
b. Specify the Jacobian matrix for this problem and show that any point 𝐱 =
(𝑥1 , 𝑥2 , 𝑥3 )𝑇 ∈ 𝑅 3 is a regular point on the constraint surface. Why is this condition
necessary to establish a candidate 𝐱 ∗ = argmin{𝑓(x) 𝑠. 𝑡. ℎ(𝑥) = 0} ?
[Hint: this question is a direct consequence of the Lagrange theorem in R^n and
focuses on its assumptions]
−0.4375
c. The function has a unique minimum given by 𝐱 ∗ = [ 2.5625 ]. Evaluate the
2.875
objective function at this point and derive the optimal Lagrange multipliers. Verify
the optimality of 𝐱 ∗ . [Hint: verify the optimality amounts to determine the
Lagrange multipliers and check that with the given set of optimal vectors 𝐱 ∗ and
𝛌∗ FONC are verified].
d. Consider now the same objective function min 𝑓(𝐱) = 𝑥12 + 3𝑥22 + 𝑥32 − 3𝑥1 but
𝑥
without any constraint. Determine the optimal decision vector 𝐱 ∗ and optimal
𝒇(𝐱 ∗ ). Is this optimal value unique? [Hint: the solution is immediate once your
write down the first order conditions. Notice that the function is quadratic and
without the linear term, tje optimal x would be the origin.]
e. Consider the solution of the constrained problem with the one of the unconstrained
problem: when analysing the optimal values of the objective function and the
optimal vectors 𝐱 ∗ , how would you interpret the effect of the Lagrange multipliers
on the optimal solution of the constrained case? [Hint: you have all the
information from the previous answers: this should be easy to answer. You focus on
the definition of the dual multipliers as shadow prices and as sensitivity coefficients
for constraint violation.]

[Grades: 2.a: 15%, 2.b:15%, 2.c:10%; 2.d:10%, 2.e:10%, total 60%]

Solution

a) We have two constraints then this is in the form min 𝑓(𝑥) 𝑠. 𝑡. ℎ1 (𝑥) = 0, ℎ2 (𝑥) = 0. The
𝑥
Lagrange function is 𝐿(𝒙, 𝝀) = 𝑥12 + 3𝑥22 + 𝑥32 − 3𝑥1 + 𝜆1 (2 − 2𝑥1 − 𝑥3 ) + 𝜆2 (3 − 𝑥2 +
𝑥1 ).
We derive FONC:
𝜕𝐿(𝑥, 𝜆)
= 2𝑥1 − 3 − 2𝜆1 + 𝜆2 = 0
𝜕𝑥1
𝜕𝐿(𝑥, 𝜆)
= 6𝑥2 − 𝜆2 = 0
𝜕𝑥2
𝜕𝐿(𝑥, 𝜆)
= 2𝑥3 − 𝜆1 = 0
𝜕𝑥3
𝜕𝐿(𝑥, 𝜆)
= 2 − 2𝑥1 − 𝑥3 = 0
𝜕𝜆1
𝜕𝐿(𝑥, 𝜆)
= 3 + 𝑥1 − 𝑥2 = 0
𝜕𝜆2
These conditions are in general not sufficient to define and extremum because they would
be satisfied also in presence of a saddle point: then only second order information would
clarify whether the associated Hessian is indefinite or rather positive or negative s.d.
In this case however, being the problem quadratic with linear constraints it is a convex
program with both the objective function and the constraints convex, in such case then the
conditions are also sufficient.
b) Consider the two constraints: the Jacobian matrix is the (2,3) matrix with the gradients of
ℎ1 (𝑥), ℎ2 (𝑥) as rows, we have:
𝜕ℎ1 𝜕ℎ1 𝜕ℎ1
𝜕𝑥1 𝜕𝑥2 𝜕𝑥3 −2 0 − 1
𝑫𝒉(𝒙) = =[ ]
𝜕ℎ2 𝜕ℎ2 𝜕ℎ2 1 −1 0
[ 𝜕𝑥1 𝜕𝑥2 𝜕𝑥3 ]
We see that the two gradients (rows of the Jacobian) are linearly independent and, being
linear functions, do not depend on x. Then any 𝑥 ∈ 𝑅 3 is regular: accordingly the gradient of
the objective function can be expressed as a linear combination of the two constraints’
gradients, which is the Lagrange theorem. Indeed:
𝑫𝒇(𝒙∗ ) + 𝝀∗ 𝑻 𝑫𝒉(𝒙∗ ) = 𝟎𝑻 thus 𝛁𝒇(𝒙∗ ) = −𝑫𝒉(𝒙∗ )𝑻 𝝀∗ that requires the rows of the
Jacobian to be l.i.
c) The problem has a unique solution that can be derived directly from the FONC, once you see
that this is a system in 5 unknowns as 𝐴𝑥 = 𝑏, where:

−0.4375
2.5625
Then 𝑥 ∗ = 𝐴−1 𝑏 = 2.875 with the two optimal multipliers already written down and
5.75
[ 15.375 ]
optimal objective given by 𝑓(𝑥 ∗ ) = 29.41.

Since I gave you the optimal coordinates 𝐱 = (𝒙∗𝟏 , 𝒙∗𝟐 , 𝒙∗𝟑 ), the optimal multipliers can be
recovered directly from the first order conditions written down in question 2.a.

To verify the optimality, it is then sufficient to check whether indeed the elements of the
gradient of f can be expressed as linear combinations of the gradients of the constraints with
coefficients defined by 𝝀∗𝟏 , 𝝀∗𝟐 .

Indeed:
−0.4375 −3.875 2 −1
∇𝑓 ([ 2.5625 ]) = [ 15.375 ] = 5.75 [0] + 15.375 [ 1 ]
2.875 5.75 1 0
d) Here we have only the objective and in this case the FONC simplify to:

𝜕𝑓(𝑥)
= 2𝑥1 − 3 = 0
𝜕𝑥1
𝜕𝑓(𝑥)
= 6𝑥2 = 0
𝜕𝑥2
𝜕𝑓(𝑥)
= 2𝑥3 = 0
𝜕𝑥3
1.5
With stationary point at 𝐱 ∗ = [ 0 ] which is unique being the function convex and leading to
0
an optimal value 𝒇(𝐱 ∗ ) = −𝟐. 𝟐𝟓.
e) In the unconstrained case above here there is a global minimum which is identified over the
entire 𝑅 3, this feasible region is heavily restricted once we introduce the two linear equality
constraints which notice specify two hyperplanes one is defined with the second coordinate
at 0, thus in the (x1,x3) plane, and the other with the third coordinate equal to 0, thus in the
(x1,x2) plane. Now the Lagrange multipliers tell us what would be the impact on the optimal
value induced by a violation of a constraint. We see that in this case the second constraint
has a relevant impact on the optimal value.

& & &

You might also like