Optimization

G0A63A: Optimization and Numerical Methods
Stefan Velev r0924289
2023-01-09
Exercise 1
We consider a simple multinomial setting where
Z11 Z12 Z21 Z22 Z3 Z4

3θ1 θ2 1 θ1 1 θ2 θ1 θ2
8 4 2 − 2 2 − 2 8 4
Y1 Y2 Y3 Y4
3θ1 θ2 1 θ1 1 θ2 θ1 θ2
8 + 4 2 − 2 + − 2 2 8 4
1124 8023 131 722
a) Derive analytically the complete-data log-likelihood function.
6
X
ℓc (θ1 , θ2 ) = Zj log[πjc (θ1 , θ2 )]
j=1
3θ1 θ2
= Z11 (1124, θ1 ) log( ) + Z12 (1124, θ2 ) log( )
8 4
1 θ1 1 θ2
+ Z21 (8023, θ1 ) log( − ) + Z22 (8023, θ2 ) log( − )
2 2 2 2
θ1 θ2
+ 131 log( ) + 722 log( )
8 4
b) Derive analytically the observed-data log-likelihood function.
4
X
ℓ(θ1 , θ2 ) = Yj log[πj (θ1 , θ2 )]
j=1
3θ1 θ2 1 θ1 1 θ2
= 1124 log(+ ) + 8023 log( − + − )
8 4 2 2 2 2
θ1 θ2
+ 131 log( ) + 722 log( )
8 4
c) Derive analytically the closed-form solution, if available.
First let us derive the partial derivative with respect to θ1 and set it to 0
1
∂ℓ 3y1 y2 y3
= + + =0 pluging in the values
∂θ1 3θ1 + 2θ2 θ1 + θ2 − 2 θ1
∂ℓ 3372 8023 131
= + + =0
∂θ1 3θ1 + 2θ2 θ1 + θ2 − 2 θ1
Now let us derive the partial derivative with respect to θ2 and set it to 0
∂ℓ 2y1 y2 y4
= + + =0 pluging in the values
∂θ2 3θ1 + 2θ2 θ1 + θ2 − 2 θ2
∂ℓ 2248 8023 722
= + + =0
∂θ2 3θ1 + 2θ2 θ1 + θ2 − 2 θ2
Now solving these two partial derivatives as a series of equations we get two possibilities
√
−451 − 17 7873
θ1 = ≈ −0.19594087642280523424
10000√
4405 + 17 7873
θ2 = ≈ 0.59134087642280523424
10000
OR
√
−451 + 17 7873
θ1 = ≈ 0.10574087642280523424
10000√
4405 − 17 7873
θ2 = ≈ 0.28965912357719476576
10000
Naturally we choose the second set of results, first because if θ1 was negative then we would get an undefined
result for the 131 log( θ81 ) term in the log likelihood function. Secondly because we ran a reality check and
the maximum of the log-likelihood is indeed at θ1 ≈ 0.1057 and θ2 ≈ 0.2896 with a value of −6689.525
d) Derive analytically the expressions for Newton-Raphson optimization.
Now as we have a θ1 and a θ2 we have to use a more generalized version of the Newton-Raphson method
(as opposed to to the single parameter one). First in essence what we want is to solve a system of n = 2
equations in n = 2 unknowns. The system we have is
(
∂ℓ 3372 8023 131
f1 = ∂θ1 = 3θ1 +2θ2 + θ1 +θ2 −2 + θ1 =0
∂ℓ 2248 8023 722
f2 = ∂θ2 = 3θ1 +2θ2 + θ1 +θ2 −2 + θ2 =0
and from here we define a map : f : Θ ∈ R2 → R2 where it is defined as

3372 8023 131 2248 8023 722
f (θ) = (f1 (θ1 ), f2 (θ2 )) = ( + + , − + )
3θ1 + 2θ2 θ1 + θ2 − 2 θ1 3θ1 + 2θ2 θ1 + θ2 − 2 θ2
From here we need to find the Jacobian of this function which is therefore the second (partial) derivatives
of the log-likelihood function as the equations above are the first derivatives. Specifically we get
" #
− (3θ10116
1 +2θ2 )
8023
2 − (θ +θ −2)2 − θ 2
1 2
131
− (3θ16744 8023
+2θ2 )2 − (θ1 +θ2 −2)2
J f (θ1 , θ2 ) = 6774 8023
1
− (3θ1 +2θ2 )2 − (θ1 +θ2 −2)2 − (3θ1 +2θ2 )2 − (θ1 +θ2 −2)2 − 722
4496 8023
θ2
2
From here we would want to calculate the inverse of this matrix i.e. [J f (θ1 , θ2 )]−1 . However this is a tedious
process and the result does not fit on the page. You can see it in closed form in the appendix. Nevertheless
2
# "
(0)
θ1
once we have the inverse and an starting values (0) we can specify the expression of the Newton-Raphson
θ2
optimisation as follow:
" # " #
(k+1) (k)
θ1 θ1 −1 (k) (k)
(k+1) = (k) − [J f (θ1 , θ2 )] f (θ1 , θ2 )
θ2 θ2
3372 8023
+ 131
" # " #
(k) + (k)
(k)
θ1 A B (k)
3θ1 +2θ2
(k)
θ1 +θ2 −2
(k)
θ1
= (k) − 2248 8023
θ2 C D (k) (k) + (k) (k) + 722
(k)
3θ1 +2θ2 θ1 +θ2 −2 θ2

A B
where is the inverse Jacobian and can be seen in the appendix
C D
e) DeriveDerive analytically the expressions for the EM algorithm
First let us define the Expectation stage. We have the following:
(t)
3θ1 (t)
(t) (t) 8 3372θ1
E(Z11 |θ1 , θ2 Y1 ) = 1124 (t) (t)
= (t) (t)
3θ1 θ2 3θ1 + 2θ2
8 + 4
(t) (t)
(t) (t) 3372θ1 2248θ2
E(Z12 |θ1 , θ2 Y1 ) = 1124 − (t) (t)
= (t) (t)
3θ1 + 2θ2 3θ1 + 2θ2
(t)
1 θ1 (t)
(t) (t) 2 − 2 8023(θ1 − 1)
E(Z21 |θ1 , θ2 Y2 ) = 8023 (t) (t)
= (t) (t)
1 θ1 1 θ2 θ1 + θ2 − 2
2 − 2 + 2 − 2
(t) (t)
(t) (t) 8023(1 − θ1 ) 8023(θ2 − 1)
E(Z22 |θ1 , θ2 Y2 ) = 8023 − (t) (t)
= (t) (t)
θ1 + θ2 − 2 θ1 + θ2 − 2
First taking a look at the complete log-likelihood function from above we can plug in the expectations and
derive the objective function as follows:
(t) (t) 3θ1

(t) (t) (t) (t) θ2
Q(θ1 , θ2 |θ1 , θ2 ) = Z11 (1124, θ1 , θ2 ) log( ) + Z12 (1124, θ2 , θ2 ) log( )
8 4
(t) (t) 1 θ1 (t) (t) 1 θ2
+ Z21 (8023, θ1 , θ2 ) log( − ) + Z22 (8023, θ2 , θ2 ) log( − )
2 2 2 2
θ1 θ2
+ 131 log( ) + 722 log( ) =
8 4
(t) (t)
3372θ 3θ1 2248θ θ2
= (t) 1 (t) log( ) + (t) 2 (t) log( )
3θ + 2θ 8 3θ + 2θ 4
1 2 1 2
(t) (t)
8023(θ1 − 1) 1 θ1 8023(θ − 1) 1 θ2
+ (t) (t)
log( − ) + (t) 2(t) log( − )
θ1 + θ2 − 2 2 2 θ1 + θ2 − 2 2 2
θ1 θ2
+ 131 log( ) + 722 log( )
8 4
From here we can take the first derivatives with respect to θ1 and θ2 respectively and set them to zero.
3
(t) (t)
∂Q(θ1 , θ2 |θ1 , θ2 ) θ1 (Z11 + Z21 + 131) − Z11 − 131
= =
∂θ1 (θ1 − 1)θ1
(t) (t)
3372θ1 8023(θ1 − 1) 131
= (t) (t)
− (t) (t)
+ =0
θ1 (3θ1 + 2θ2 ) 2( 12 + θ1
θ2 − 2) θ1
2 )(θ1 +
(t) (t)
∂Q(θ1 , θ2 |θ1 , θ2 ) θ2 (Z12 + Z22 + 722) − Z12 − 722
= =
∂θ2 (θ2 − 1)θ2
(t) (t)
2248θ2 8023(θ2 − 1) 722
= (t) (t)
− (t) (t)
+ =0
θ2 (3θ1 + 2θ2 ) 2( 12 − θ2
θ2 − 2) θ2
2 )(θ1 +
From here once we can find an expression for θ1 and θ2 respectively. They are
Z11 + 131
θ1 = =
Z11 + Z21 + 131
(t) (t) (t) (t) (t)
(3765(θ1 )2 + θ1 (4027θ2 − 7530) + 262 ∗ (θ2 − 2)θ2 )
= (t) (t) (t) (t) (t)
=
(27834(θ1 )2 + 3θ1 (6691θ2 − 10533) + 2θ2 (131θ2 − 8285))
Z12 + 722
θ2 = =
Z12 + Z22 + 722
(t) (t) (t) (t) (t)
(2166(θ2 )2 + θ2 (5858θ2 − 4332) + 3692(θ2 − 2)θ2 )
= (t) (t) (t) (t) (t)
(2166(θ2 )2 + θ2 (29927θ2 − 28401) + 142θ2 (139θ2 − 165))
a1) Implement your Newton-Raphson optimization algorithm
Here I will implement

" my Newton-Raphson
# optimization. We have to be a bit careful in our starting values
(0)
θ = 0.5
if we start with 1(0) we end up in the other solution (see above). This is because I want to show a
θ2 = 0.5
direct
" from#scratch code from the calculations above. Given that I know the answer I will start with values
(0)
θ1 = 0.2
(0) , where the algorithm converges within 6 steps up to 7 decimals. Please see steps below for the
θ2 = 0.2
result. The code is available in the appendix
## Theta1 Theta2
## 1 0.20000000 0.2000000
## 2 0.09673894 0.2725738
## 3 0.10505359 0.2889658
## 4 0.10573591 0.2896598
## 5 0.10574088 0.2896591
## 6 0.10574088 0.2896591
## 7 0.10574088 0.2896591
## 8 0.10574088 0.2896591
## 9 0.10574088 0.2896591
## 10 0.10574088 0.2896591
Implement your EM Algorithm
Here I will implement my EM algorithm. This algorithm takes

" a little #
bit longer to converge than the
(0)
θ = 0.2
Newton-Raphson with about 45-46 steps with starting values 1(0) . The code is available in the
θ2 = 0.2
appendix
4
## Teta1 Teta2
## 1 0.2000000 0.2000000
## 2 0.1672030 0.2260423
## 3 0.1479708 0.2451093
## 4 0.1352150 0.2581961
## 5 0.1265147 0.2673104
## [1] "..."
## Teta1 Teta2
## 40 0.1057411 0.2896589
## 41 0.1057410 0.2896590
## 42 0.1057410 0.2896590
## 43 0.1057410 0.2896590
## 44 0.1057409 0.2896591
## 45 0.1057409 0.2896591
Exercise 2
a) Find the normalizing constant of q(θ) using two methods for numerical integration.
First we know from Bayes theorem that

likelihood × prior p(y|θ)p(θ)
P osterior = = p(θ|y) =
probability(data) p(y)
.
However the question does not explicitly state if it wants the normalizing constant for this Bayesian problem
or the normalizing constant which makes q(θ) integrate to 1 when we divide by it. Therefore I will show
both. First the normalizing constant which makes
Z 1
q(θ)
=1
0 constant
Z 1
constant = q(θ)
0
Z 1
= (exp(−4(θ − 0.5)2 − 0.1 cos2 (12πθ))
0
= 0.7108697
In the code below I will implement Riemann and Trapezoidal integral approximation along with my personal
choice stochastic (Monte Carlo) integration to calculate the integral above. The code is available in the
appendix
## [1] "RIEMAN"
## [1] 0.07333786
## [1] "TRAPEZOID"
## [1] 0.07333786
## [1] "STOCHASTIC"
5
## [1] 0.073371
If we however assume that we are looking for the normalizing constant for the Bayes problem.Then we need
to calculate from
p(θ|y) ∝ p(y|θ)p(θ)
Now given that p(θ|y) is a probability, then the integral over all values of θ must be equal to 1, which in
turn means that
p(y|θ)p(θ) p(y|θ)p(θ) p(y|θ)q(θ)
p(θ|y) = = R1 = R1
p(y) p(y|θ)p(θ) p(y|θ)q(θ)
0 0
R1
from here we have that the normalizing constant is the integral 0 p(y|θ)p(θ). Now the question gives us 10
Bernoulli observation and we know that 7 of those were successful.So in essence the question is asking us to
R1 7 (10−7) (exp(−4(θ−0.5)2 −0.1 cos2 (12πθ))
use two numerical methods to find 0 ( 10 7 θ θ ) 0.7108697 where the first part is the
Binomial and the second part is the prior. In the code (available in the appendix) below I will calculate this
integral again using Rieman, Trapezoidal and Stochastic integration methods
## [1] "RIEMAN"
## [1] 0.1031664
## [1] "TRAPEZOID"
## [1] 0.1031664
## [1] "STOCHASTIC"
## [1] 0.1031175
b) Implement the probability integral transform method to simulate M = 10000 draws from
p(θ).
What we have is that p(θ) is a Random Variable with PDF
exp(−4((θ − 0.5)2 ) − 0.1 cos(12πθ)2 )

p(θ) = for 0 < θ < 1
0.7108697
If we use the Integration from above we can plot the CDF of this Random variable. (code for the graph
available in the appendix)
6
CDF of p(theta)
1.00
0.75
p(theta)
0.50
0.25
0.00
0.00 0.25 0.50 0.75 1.00

Theta
Here I will assume that by probability integral transform method I am meant to use the inverted CDF
method. In order to invert this CDF I will use the built in integrate function and not the numeric
integration functions from above (for computational performance reasons), then I will also use the unitroot
function to find the root of this function and invert the CDF. Finally I will generate a loop which draws
from a uniform random variable and plugs in the value into this inverted CDF function. For the inverse
CDF method we know that if we have a random variable X with a CDF FX (x) = P (X ≤ x) then the
−1
random variable FX (U ), where U ∼ U nif (0, 1) , has the same law as X. So below I will draw 10000
samples from p(θ) and and show the histogram.Unfortunately this method is a bit slow (hence the reason I
use the built in integrate) and 10000 draws are not enough to show the nice wavy line of the distribution.
Please check the accept reject method histogram as it will be with more draws and a higher resolution. The
code is available in the appendix
7
Observations drawn from p(theta) using inverse CDF
1.5
1.0
density
0.5
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
p_theta
### c) Implement a rejection algorithm to simulate M = 10000 draws from p(θ).
Below I will implement the accept reject sampling method. The histogram shows draws from the same
density as above but with many more draws, so that the resolution is better.
8
Observations drawn from p(theta) using accept reject
1.0
density
0.5
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
p_theta
### ci) Write the (closed form) formula of the envelope.
For the envelope I have used a Normal distribution with a µ = 0.5 and σ 2 = 0.25. It was scaled by M ≈ 1.756
p(θ)
which I got from the maximum value of ϕ(θ) where ϕ(θ) is the pdf of the normal distribution. The closed
form is the following
1 1 θ − 0.5 2
ϕ(θ) = √ exp(− ( ) )Envelope = ϕ(θ) × 1.756
0.5 2π 2 0.5
below you can see how the two distributions look. I have extended the x axis so that it looks better, but for
the accept reject I limit them to the interval (0, 1)
9
Envoelope Visualisation
1.0
Density
0.5
0.0
−2 −1 0 1 2
Theta
### cii) Estimate the acceptance probability.
We can derive the acceptance probability as follows.

p(Θ)
P U≤ = E1U ≤ p(Θ)
M g(Θ) M g(Θ)

= E E 1U ≤ p(Θ) | Θ By the tower property
M g(Θ)

p(Θ)
=E P U ≤ |Θ
M g(Θ)

p(Θ)
=E because P(U < u) = u when U is uniform on (0, 1)
M g(Θ)
Z
p(θ)
= g(θ)dθ
θ:g(θ)>0 M g(θ)
Z
1
= p(θ)dθ
M θ:g(θ)>0
1
=
M
and indeed in the above histogram we have drawn 100000 samples from p(θ) from a total of 1757819 candi-
dates which gives us exactly the ratio
100000 1 1
= =
1757819 1.7578 M
10
ciii) Compare the estimated acceptance probability to the theoretical value
1
I ( )
We are given that the theoretical acceptance probability is A = 020√20e = 0.0492 which is a generally poor
acceptance probability compared to the ≈ 57% which we have when we use the normal distribution as an
envelope. Additionally, a theoretical acceptance probability is as far as I understand accept reject method
is entirely dependent on the envelope, and the envelope is not specified in the question and I assume can
be chosen freely, plus I could not find anything in the lectures of the notes which would indicate there is a
fixed theoretical answer. My intuition says that if such exists it is some kind of Taylor approximation, but
I could not find any mention of a theoretical acceptance rate.
If we nevertheless had to find a density and an M such that we have an acceptance probability of A =
1
I0 ( 20 )
√
20
e
= 0.0492 we could just use the same normal distribution as in our methodology but use M = 29.77225.
Alternatively we could use a uniform distribution from (0, 14.54649) and use an M = 20.32 and then run
the accept reject sampling. We would get that acceptance probability
d) Derive the posterior p(θ|y).
First we know from Bayes theorem that
likelihood × prior p(y|θ)p(θ)

P osterior = = p(θ|y) =
probability(data) p(y)
. Which if we look at the calculations we did in section a) we have the following
p(y|θ)p(θ)
p(θ|y) = p(θ|y) =
p(y)
p(y|θ)p(θ)
= R1
0
p(y|θ)p(θ)
2
−0.1 cos2 (12πθ))
( 10 θ7 θ(10−7) ) (exp(−4(θ−0.5)

= R 1 710 0.7108697
2 −0.1 cos2 (12πθ))
0
( 7 θ7 θ(10−7) ) (exp(−4(θ−0.5)
0.7108697
Where I assume that for y I only have the 10 Bernoulli draws of which 7 were successful. Obviously prof.
Wierdo has been doing this experiment for longer and has more observations, but nevertheless we will limit
this analysis to just the 10 mentioned in the question. We get the following density.
11
Bernoulli likelihood
0.20
p_bern(range)
0.10
0.00
0.0 0.2 0.4 0.6 0.8 1.0
range
As we can see we get the maximum at 0.7 which makes sense. Now if we take a look below we can see how
the density of p(θ|y) looks like (code in the appendix)
12
Density of p(theta|y)
3.0
2.5
p_theta_y(range)
2.0
1.5
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
range
## [1] "Check if it integrates to 1"
## 1 with absolute error < 6.7e-07
e) Apply the Metropolis sampler to sample from p(θ|y). Simulate M = 10000 draws.
Now what we need to do is use The Metropolis-Hastings algorithm to draw 10000 samples from p(θ|y). The
MH algorithm is a powerful way of approximating a distribution using Markov chain Monte Carlo. Here we
can use the fact that MH needs is an expression that is proportional to the density we are looking to sample
from. Even though we could use the non-normalized densities, I have alread derived them and shall therefore
use them. In the code below I am using a normal distribution as the proposal density with a µ = 0.5 and
σ 2 = 0.25. The code is available in the appendix
13
Observations drawn from p(theta|y) using Metropolis−Hastings
2
density
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

p_theta_y
APENDIX
EX 1 d) The inverse of the Jacobian
We calculate the inverse of the Jacobian with the help of https://matrixcalc.org/ which kindly provides
the results in Latex

A B
[J f (θ1 , θ2 )]−1 =
C D
with
−157904θ28 − 952544θ1 θ27 + 118144θ27 − 2205448θ12 θ26 + 611200θ1 θ26 − 118144θ26
−2405016θ13 θ25 + 1278720θ12 θ25 − 493056θ1 θ25 − 1216665θ14 θ24 + 1409472θ13 θ24 − 785664θ12 θ24 −
272916θ15 θ23 + 857736θ14 θ23 − 623808θ13 θ23 − 58482θ16 θ22 + 233928θ15 θ22 − 233928θ14 θ22
A=
7661142θ16 − 30644568θ15 + 30644568θ14 + 177488456θ26 + 931869784θ1 θ25 − 131527616θ25 + 1889147546θ12 θ24 −
546698048θ1 θ24 + 131527616θ24 + 1785372552θ13 θ23 − 781027488θ12 θ23 + 415170432θ1 θ23 + 694317969θ14 θ22 − 447575904θ13 θ22
+365857056θ12 θ22 + 35751996θ15 θ2 − 112363416θ14 θ2 + 81718848θ13 θ2
155344θ28 + 905088θ1 θ27 − 107904θ27 + 1982496θ12 θ26 − 431616θ1 θ26 + 107904θ26 +

1935288θ13 θ25 − 566496θ12 θ25 + 323712θ1 θ25 + 710559θ14 θ24 − 242784θ13 θ24 + 242784θ12 θ24
B=
7661142θ16 − 30644568θ15 + 30644568θ14 + 177488456θ26 + 931869784θ1 θ25 − 131527616θ25 + 1889147546θ12 θ24 −
546698048θ1 θ24 + 131527616θ24 + 1785372552θ13 θ23 − 781027488θ12 θ23 + 415170432θ1 θ23 +
694317969θ14 θ22 − 447575904θ13 θ22 + 365857056θ12 θ22 + 35751996θ15 θ2 − 112363416θ14 θ2 + 81718848θ13 θ2
14
155464θ28 + 905688θ1 θ27 − 108384θ27 + 1983606θ12 θ26 − 433536θ1 θ26 + 108384θ26 +
1936188θ13 θ25 − 569016θ12 θ25 + 325152θ1 θ25 +
710829θ14 θ24 − 243864θ13 θ24 + 243864θ12 θ24
C=
7661142θ16 − 30644568θ15 + 30644568θ14 + 177488456θ26 + 931869784θ1 θ25 − 131527616θ25 +
1889147546θ12 θ24 − 546698048θ1 θ24 + 131527616θ24 + 1785372552θ13 θ23 − 781027488θ12 θ23 +
415170432θ1 θ2 + 694317969θ14 θ22 − 447575904θ13 θ22 + 365857056θ12 θ22 + 35751996θ15 θ2 − 112363416θ14 θ2 + 81718848θ13 θ2
3
−170928θ28 − 989296θ1 θ27 + 170240θ27 − 2162804θ12 θ26 + 706112θ1 θ26 − 170240θ26 −

2133912θ13 θ25 + 1013232θ12 θ25 − 535872θ1 θ25 − 836406θ14 θ24 + 590544θ13 θ24 − 477360θ12 θ24 −
49518θ15 θ23 + 155628θ14 θ23 − 113184θ13 θ23 − 10611θ16 θ22 + 42444θ15 θ22 − 42444θ14 θ22
D=
7661142θ16 − 30644568θ15 + 30644568θ14 + 177488456θ26 + 931869784θ1 θ25 − 131527616θ25 +
1889147546θ12 θ24 − 546698048θ1 θ24 + 131527616θ24 + 1785372552θ13 θ23 − 781027488θ12 θ23 +
415170432θ1 θ2 + 694317969θ14 θ22 − 447575904θ13 θ22 + 365857056θ12 θ22 + 35751996θ15 θ2 − 112363416θ14 θ2 + 81718848θ13 θ2
3
a1) Implement your Newton-Raphson optimization algorithm Code
###NEWTON-RAPHSON IMPLEMENTATION
#Package for inverting the Jaconbian, I could use the
#fomula in the appendix if packages are not allowed
library(matlib)
jacob<-function(x,y) {
a<- -(10116/(3*x+2*y)ˆ2)-(8023/(x+y-2)ˆ2)-(131/(xˆ2)) #partial second derivative of log-likelihood
b<- -(6744/(3*x+2*y)ˆ2)-(8023/(x+y-2)ˆ2) #partial second derivative of log-likelihood
c<- -(6774/(3*x+2*y)ˆ2)-(8023/(x+y-2)ˆ2) #partial second derivative of log-likelihood
d<- -(4496/(3*x+2*y)ˆ2)-(8023/(x+y-2)ˆ2)-(722/yˆ2) #partial second derivative of log-likelihood
#invert the Jacobian, I could use the fomula in the appendix but this is more elegant
IA<-inv(matrix(data = c(a,b,c,d),ncol = 2,nrow = 2))
e<-(3372/(3*x+2*y))+(8023/(x+y-2))+(131/x) #First derivative of log-likelihood

f<-(2248/(3*x+2*y))+(8023/(x+y-2))+(722/y) #First derivative of log-likelihood
#Multiply the inverse of the Jacobian times the first derivative
g<-IA %*% matrix(data = c(e,f),
ncol = 1,nrow = 2)
return(g)
}
#create a dataframe for place holder

grid<-data.frame("Theta1"=rep(NA,10),"Theta2"=rep(NA,10))
#Set initial values different values may lead to the other solution #Set only valid initial values
grid[1,]<-c(0.2,0.2)
for (i in 1:(nrow(grid)-1)) { #Loop for the Netwon raphson

#Value in next step is the value in this step minus the output from the jacob function
grid[(i+1),]<-grid[i,]-t(jacob(grid[i,1],grid[i,2]))
#grid #print the results
15
b1) Implement your EM Algorithm
grid<-data.frame("Teta1"=rep(NA,15),"Teta2"=rep(NA,15))
grid[1,]<-c(0.01,0.01)
EMfunc1<- function(params) {
a<-params[1]
b<-params[2]
c<-(3372*a/(3*a+2*b))
d<-(2248*b)/(3*a+2*b)
e<-(8023*(a-1))/(a+b-2)
f<-(8023*(b-1))/(a+b-2)
#if we do not want to calculate the expectation every time we can use this form
#x<- (3765*aˆ2+a*(4027*b-7530)+262*(b-2)*b)/(27834*aˆ2+3*a*(6691*b-10533)+2*b*(131*b-8285))
#y<- (2166*aˆ2+a*(5858*b-4332)+3692*(b-2)*b)/(2166*aˆ2+a*(29927*b-28401)+142*b*(139*b-165))
x<-(c+131)/(c+e+131)
y<- (d+722)/(d+f+722)
z<-c(x,y)
return(z)
for (i in 1:(nrow(grid)-1)) {
grid[(i+1),]<-EMfunc1(c(grid[i,1],grid[i,2]))
#grid
2 a1) Normalizing Constant which makes p(θ) integrate to 1
Normalizing Constant which makes p(θ) integrate to 1
#Define the function

int_q_theta<-function(x){
y<-(exp(-4*((x-0.5)ˆ2)-0.1*cos(12*pi*x)ˆ2))
return(y)}
#Riemann integration Approximation

rie<-function(n,a,b){
h=(b-a)/n
y=rep(0,length(n))
for (j in 1:length(n)){
grid1=seq(a,b-h[j],h[j])
grid_int=int_q_theta(grid1)
y[j]=sum(grid_int)*h[j]}
return(y)}
#trapezoid integration Approximation

trp<-function(n,a,b){
16
h=(b-a)/n
y=rep(0,length(n))
grid1=seq(a,b,h[j])
y[j]=(sum(grid_int)-0.5*(grid_int[1]+grid_int[length(grid1)]))*h[j]}
return(y)}
#Stochastic Integration Approximation

#Note differs slightly from the other two due to its random nature
stch<-function(n,a,b){
rn=runif(n,min = a,max = b)
y=rep(NA,n)
for (j in 1:n){
y[j]<-int_q_theta(rn[j])
}
return(mean(y))
}
print("RIEMAN")
## [1] "RIEMAN"
rie(10000000,0,1) #RIEMAN
## [1] 0.7108697
print("TRAPEZOID")
## [1] "TRAPEZOID"
trp(10000000,0,1) #TRAPEZOID
## [1] 0.7108697
print("STOCHASTIC")
## [1] "STOCHASTIC"
stch(10000000,0,1) #STOCHASTIC
## [1] 0.7107604
2 a2) Bernoulli×prior normalizing constant
Bernoulli×prior normalizing constant
17
#Define the function
y<-(choose(10,7)*((x)ˆ7)*((1-x)ˆ3))*((exp(-4*((x-0.5)ˆ2)-0.1*cos(12*pi*x)ˆ2))/0.7108697)
return(y)}
#Riemann integration Approximation

rie<-function(n,a,b){
h=(b-a)/n
y=rep(0,length(n))
return(y)}
#trapezoid integration Approximation

trp<-function(n,a,b){
h=(b-a)/n
y=rep(0,length(n))
grid1=seq(a,b,h[j])
y[j]=(sum(grid_int)-0.5*(grid_int[1]+grid_int[length(grid1)]))*h[j]}
return(y)}
#Stochastic Integration Approximation

#Note differs slightly from the other two due to its random nature
stch<-function(n,a,b){
rn=runif(n,min = a,max = b)
y=rep(NA,n)
for (j in 1:n){
y[j]<-int_q_theta(rn[j])
}
return(mean(y))
}
print("RIEMAN")
## [1] "RIEMAN"
rie(10000000,0,1) #RIEMAN
## [1] 0.1031664
print("TRAPEZOID")
## [1] "TRAPEZOID"
18
trp(10000000,0,1) #TRAPEZOID
## [1] 0.1031664
print("STOCHASTIC")
## [1] "STOCHASTIC"
stch(10000000,0,1) #STOCHASTIC
## [1] 0.1031679
2 b) Inverted CDF sampling
CDF plot code
library(ggplot2)
p_theta<-function(x){
y<-exp(-4*((x-0.5)ˆ2)-0.1*cos(12*pi*x)ˆ2)/0.7108697
return(y)}
u<-seq(from=0, to=1,length.out=100000)
rie_p<-function(n,a,b){
h=(b-a)/n
y=rep(0,length(n))
grid_int=p_theta(grid1)
return(y)}
values<-c()
for (i in 1:length(u)) {
values[i]<-rie_p(1000,0,u[i])
df<-data.frame("Values"=values,"Sequence"=u)
cdf_p_theta<-ggplot(data = df,aes(x=Sequence,y=Values))+
geom_line()+labs(title = "CDF of p(theta)",x="Theta",y="p(theta)")+theme_bw()
#cdf_p_theta
Inverse CDF Sampling
# Get the cdf by numeric integration

cdf <- function(x){
integrate(p_theta, 0, x)$value
}
19
# Use a root finding function to invert the cdf
invcdf <- function(q){
uniroot(function(x){cdf(x) - q}, range(0,1))$root
}
p_theta_draws<-rep(0,10000)
for (i in 1:length(p_theta_draws)) {
u<-runif(1,0,1)
p_theta_draws[i]<-invcdf(u)
}
df2<-data.frame("p_theta"=p_theta_draws)
p_draws<-ggplot(df2, aes(x = p_theta)) +

geom_histogram(aes(y = ..density..),
colour = 1, fill = "lightblue") +
geom_density(lwd = 1.2,
linetype = 2,
colour = "blue")+
labs(title = "Observations drawn from p(theta) using inverse CDF")+
scale_x_continuous(breaks=c(seq(from=0,to=max(df2$p_theta)+0.1,by=0.1)))+
theme_bw()
p_draws
## ‘stat_bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
20
Observations drawn from p(theta) using inverse CDF
1.0
density
0.5
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
p_theta
2 c) Accept reject Sampling
Code below for drawing from p(θ) using accept reject
g<- function(x) {
y<-dnorm(x,mean=0.5,sd=0.5)
return(y)
}
range<-seq(-2,2,length.out=1000)
M<-max(p_theta(range)/g(range))
f_x<-rep(0,1000000)
count<-0
for (i in 1:length(f_x)) {
repeat {
count<-count+1
x<-rnorm(1,mean=0.5,sd=0.5)
u<-runif(1,0,1)
if(u< p_theta(x)/(M*g(x)) & 0<x & x < 1) {break}
}
f_x[i]<-x
21
df3<-data.frame("p_theta"=f_x)
acc_rej<-ggplot(df3, aes(x = p_theta)) +
geom_histogram(bins=100,aes(y = ..density..),
colour = 1, fill = "lightblue") +
linetype = 2,
colour = "pink")+
labs(title = "Observations drawn from p(theta) using accept reject")+
theme_bw()
#acc_rej
range<-seq(-2,2,length.out=1000)
M<-max(p_theta(range)/g(range))
df_dist<-data.frame("p_theta"=p_theta(range),"phi_theta"=M*g(range),"range"=range)
#ggplot(df_dist) +
#geom_line(aes(x=range,y = p_theta),color="red")+
#geom_line(aes(x=range,y = phi_theta),color="blue")+
#labs(title = "Envoelope Visualisation",y="Density",x="Theta")+
#theme_bw()
2 d) Derive the posterior p(θ|y).
Bernouli Distribution plot code
p_bern<-function(x) {
y<-(choose(10,7)*((x)ˆ7)*((1-x)ˆ(10-7)))
return(y)
}
range<-seq(0,1,length.out=1000)
#plot(range,p_bern(range),type = "l",main = "Bernoulli likelihood")
code to plot the density of p(θ|y)
y<-(choose(10,7)*((x)ˆ7)*((1-x)ˆ3))*((exp(-4*((x-0.5)ˆ2)-0.1*cos(12*pi*x)ˆ2))/0.7108697)
return(y)}
p_theta_y<-function(x) {
y<-((choose(10,7)*((x)ˆ7)*((1-x)ˆ3))*((exp(-4*((x-0.5)ˆ2)-0.1*cos(12*pi*x)ˆ2))/0.7108697))/integrate(i
return(y)
}
range<-seq(0,1,length.out=1000)
#plot(range,p_theta_y(range),type = "l",main = "Density of p(theta|y)")
#print("Check if it integrates to 1")
#integrate(p_theta_y, 0, 1)
22
2 e) Apply the Metropolis sampler to sample from p(θ|y). Simulate M = 10000 draws.
Here is the code for the Metropolis Hastings sampler
mh_sampler <- function(dens, start = 0, nreps = 10000, prop_sd = 1, ...){

theta <- numeric(nreps)
theta[1] <- start
for (i in 2:nreps){
theta_star <- rnorm(1, mean = theta[i - 1], sd = prop_sd)
alpha = dens(theta_star, ...) / dens(theta[i - 1], ...)
if (runif(1) < alpha) {theta[i] <- theta_star}

else {theta[i] <- theta[i - 1]}
}
return(theta)
}
mh_sample<-mh_sampler(p_theta_y, nreps = 300000, start = 0.5)
df4<-data.frame("p_theta_y"=mh_sample)
p_y_draws<-ggplot(df4, aes(x = p_theta_y)) +

geom_histogram(bins=100,aes(y = ..density..),
colour = 1, fill = "darkgreen") +
linetype = 2,
colour = "green")+
labs(title = "Observations drawn from p(theta|y) using Metropolis-Hastings")+
theme_bw()
#p_y_draws
23

Optimization

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Optimization

Uploaded by

Copyright:

Available Formats

G0A63A: Optimization and Numerical Methods

Stefan Velev r0924289

We consider a simple multinomial setting where

Z11 Z12 Z21 Z22 Z3 Z4

a) Derive analytically the complete-data log-likelihood function.

b) Derive analytically the observed-data log-likelihood function.

c) Derive analytically the closed-form solution, if available.

d) Derive analytically the expressions for Newton-Raphson optimization.

and from here we define a map : f : Θ ∈ R2 → R2 where it is defined as

e) DeriveDerive analytically the expressions for the EM algorithm

First let us define the Expectation stage. We have the following:

(t) (t) 3θ1

a1) Implement your Newton-Raphson optimization algorithm

Here I will implement

Implement your EM Algorithm

Here I will implement my EM algorithm. This algorithm takes

First we know from Bayes theorem that

What we have is that p(θ) is a Random Variable with PDF

exp(−4((θ − 0.5)2 ) − 0.1 cos(12πθ)2 )

0.00 0.25 0.50 0.75 1.00

d) Derive the posterior p(θ|y).

First we know from Bayes theorem that

likelihood × prior p(y|θ)p(θ)

. Which if we look at the calculations we did in section a) we have the following

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

## [1] "Check if it integrates to 1"

## 1 with absolute error < 6.7e-07

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

EX 1 d) The inverse of the Jacobian

155344θ28 + 905088θ1 θ27 − 107904θ27 + 1982496θ12 θ26 − 431616θ1 θ26 + 107904θ26 +

−170928θ28 − 989296θ1 θ27 + 170240θ27 − 2162804θ12 θ26 + 706112θ1 θ26 − 170240θ26 −

a1) Implement your Newton-Raphson optimization algorithm Code

e<-(3372/(3*x+2*y))+(8023/(x+y-2))+(131/x) #First derivative of log-likelihood

#create a dataframe for place holder

for (i in 1:(nrow(grid)-1)) { #Loop for the Netwon raphson

#grid #print the results

2 a1) Normalizing Constant which makes p(θ) integrate to 1

Normalizing Constant which makes p(θ) integrate to 1

#Define the function

#Riemann integration Approximation

#trapezoid integration Approximation

#Stochastic Integration Approximation

2 a2) Bernoulli×prior normalizing constant

Bernoulli×prior normalizing constant

#Riemann integration Approximation

#trapezoid integration Approximation

#Stochastic Integration Approximation

2 b) Inverted CDF sampling

CDF plot code

Inverse CDF Sampling

# Get the cdf by numeric integration

p_draws<-ggplot(df2, aes(x = p_theta)) +

## ‘stat_bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.

2 c) Accept reject Sampling

Code below for drawing from p(θ) using accept reject

2 d) Derive the posterior p(θ|y).

Bernouli Distribution plot code

code to plot the density of p(θ|y)

Here is the code for the Metropolis Hastings sampler

mh_sampler <- function(dens, start = 0, nreps = 10000, prop_sd = 1, ...){

if (runif(1) < alpha) {theta[i] <- theta_star}

mh_sample<-mh_sampler(p_theta_y, nreps = 300000, start = 0.5)

p_y_draws<-ggplot(df4, aes(x = p_theta_y)) +

e<-(3372/(3x+2y))+(8023/(x+y-2))+(131/x) #First derivative of log-likelihood