Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021

Homework 1
ELEN E6885: Introduction to Reinforcement Learning

September 21, 2021
Problem 1 (2-Armed Bandit, 20 Points)

Consider the following 2-armed bandit problem: the first arm has a fixed reward 0.3 and
the second arm has a 0-1 reward following a Bernoulli distribution with probability 0.6, i.e.,
arm 2 yields reward 1 with probability 0.6. Assume we selected arm 1 at t = 1, and arm
2 four times at t = 2, 3, 4, 5 with reward 0, 1, 0, 0, respectively. We use the sample-average
technique to estimate the action-value, and then use it to guide our choices starting from
t = 6.
1. [5 pts] Which arm will be played at t = 6, 7, respectively, if the greedy method is used
to select actions?
2. [10 pts] What is the probability to play arm 2 at t = 6, 7, respectively, if the -greedy
method is used to select actions ( = 0.1)?
3. [5 pts] Why could the greedy method perform significantly worse than the -greedy
method in the long run?
Solution. We use Qit to represent the estimated expected payoff for arm i (i = 1, 2) after
time t.
1. Q15 = 0.3 and Q25 = (0 + 1 + 0 + 0)/4 = 0.25. Therefore, arm 1 will be played at both
t = 6 and t = 7 if the greedy method is used.
2. At t = 6, the probability to play arm 2 using -greedy method is /2 = 0.05. For t = 7,
consider the following three cases:
a. Arm 1 is played at t = 6: Its probability is 1 − + /2 = 0.95. We have Q16 = 0.3

and Q26 = 0.25. Therefore, the probability to play arm 2 at t = 7 is /2 = 0.05.
b. Arm 2 is played at t = 6 and the reward is 0: Its probability is /2 × 0.4 = 0.02.
We have Q16 = 0.3 and Q26 = (0 + 1 + 0 + 0 + 0)/5 = 0.2. Therefore, the probability
to play arm 2 at t = 7 is /2 = 0.05.
c. Arm 2 is played at t = 6 and the reward is 1: Its probability is /2 × 0.6 = 0.03.
We have Q16 = 0.3 and Q26 = (0 + 1 + 0 + 0 + 1)/5 = 0.4. Therefore, the probability
to play arm 2 at t = 7 is 1 − + /2 = 0.95.
1
By combining all above possibilities, the probability to play arm 2 at t = 7 using
-greedy method is
0.95 × 0.05 + 0.02 × 0.05 + 0.03 × 0.95 = 0.077.
3. Greedy method may get stuck in the suboptimal action due to the lack of exploration.
Problem 2 (Softmax, 15 Points)

For the softmax action selection, show the following.
1. [5 pts] In the limit as the temperature τ → 0, softmax action selection becomes the
same as greedy action selection.
2. [5 pts] In the limit as τ → ∞, softmax action selection yields equiprobable selection

among all actions.
3. [5 pts] In the case of two actions, the softmax operation using the Gibbs distribution
becomes the logistic (or sigmoid) function commonly used in artificial neural networks.
Solution.
1. Let the actions be a1 , a2 , ...an and the action with the maximum Q-value (i.e., the
greedy action) be am . Then, we have Q(am ) > Q(ai ) for all i 6= m. The probability of
choosing action ai using the softmax action selection is
exp(Q(ai )/τ )
p i = Pn
j=1 exp(Q(aj )/τ )

exp Q(ai )−Q(a
τ
m)
= (1)
P Q(aj )−Q(am )
1 + j6=m exp τ
Since Q(am ) > Q(aj ) for all j 6= m, it implies that for j 6= m,

Q(aj ) − Q(am )
lim exp = 0.
τ →0 τ
Therefore, (??) reduces to (

1, if i = m,
pi =
0, if i 6= m.
This is exactly the greedy action selection policy.
2
2. When τ → ∞, we have for all j,

Q(aj ) − Q(am )
lim exp = 1.
τ →∞ τ
Substituting this fact into (??), we have pi = 1/n for all i = 1, 2, · · · , n. Therefore, all
actions are selected with equal probability.
3. When there are only two actions a1 and a2 , the probability of choosing action a1 is
exp (Q(a1 )/τ )

p1 =
exp (Q(a1 )/τ ) + exp (Q(a2 )/τ )
1
= .
−(Q(a1 )−Q(a2 ))
1 + exp τ
1
This is the form of the logistic/sigmoid function, i.e., f (x) = 1+e−x
.
Problem 3 (Incremental Implementation, 15 Points)

Suppose we have a sequence of returns G1 , G2 , · · · , Gn−1 , all starting in the same state and
each with a corresponding random weight Wi , i = 1, 2, · · · , n − 1. We wish to form the
estimate Pn−1
k=1 Wk Gk
Vn = P n−1 , n ≥ 2,
k=1 W k
and keep it up-to-date as we obtain an additional return Gn . In addition to keeping track

of Vn , we must maintain for each state the cumulative sum Cn of the weights given to the
first n returns. Show that the update rule for Vn+1 , n ≥ 1 is
Wn h i
Vn+1 = Vn + Gn − Vn ,
Cn
and
Cn+1 = Cn + Wn+1 ,
where C0 = 0 (and V1 is arbitrary and thus need not be specified).
Pn
Solution. By definition, Cn is the cumulative sum of weights, i.e., Cn = k=1 Wk . Thus, we
have
Cn+1 = Cn + Wn+1 .
3
And for n ≥ 1,
Pn
k=1 Wk Gk
Vn+1 = P n
k=1 Wk
Pn−1
Wk Gk + Wn Gn
= k=1
Cn
Cn−1 Vn + Wn Gn
=
Cn
(Cn − Wn )Vn + Wn Gn
=
Cn
Cn Vn + Wn (Gn − Vn )
=
Cn
Wn
= Vn + (Gn − Vn ).
Cn
Problem 4 (MDP: State-Value Function, 15 Points)

Compute the state value for St in all the MDPs in Fig. ?? – ??. The decimal number above
lines refers to the probability of choosing the corresponding action. The value r refers to
reward, which can be deterministic or stochastic. Assume γ = 1 for all questions and all
terminal states (i.e., no successors in the graph) always have zero values.
Solution.
1. We have v(St+2 ) = 0. By Bellman expectation equation,

X
v(St+1 ) = π(a|St+1 ) (r + γv(St+2 )) = 0.2 × 6 + 0.3 × 8 + 0.5 × 10 = 8.6,
a
v(St ) = r + γv(St+1 ) = 4 + 8.6 = 12.6.
2. We have v(St+2 ) = 0, and v(St+1 ) = 0 for St+1 on the top of St and the bottom-right
of St . By Bellman expectation equation,
X
v(St+1 ) = π(a|St+1 ) (r + γv(St+2 ))
a
= 0.2 × 6 + 0.3 × 8 + 0.5 × 10 = 8.6,
for St+1 on the top-right of St , and

X XX
v(St ) = π(a|St ) p(St+1 , r|St , a) (r + γv(St+1 ))
a St+1 r
= 0.5 × (4 + 0) + 0.5 × [0.4 × (4 + 8.6) + 0.6 × (4 + 0)]

= 2 + 0.5(0.4 × 12.6 + 2.4)
= 5.72.
4
3. We have v(St+1 ) = 0. By Bellman expectation equation,
X XX
v(St ) = π(a|St ) p(St+1 , r|St , a) (r + γv(St+1 ))
a St+1 r
= 0.5 × (4 + 0) + 0.5 × (0.4µr11 + 0.6µr12 )

= 2 + 0.5 × (0.4 × 0.5 + 0.6 × 5)
= 3.6
Problem 5 (MDP: Optimal Policy, 10 Points)

Given an arbitrary MDP with reward function Ras and constants α and β > 0, prove that
the following modified MDPs have the same optimal policy as the original MDP.
1. [5 pts] Everything remains the same as the original MDP, except it has a new reward
function α + Ras . Assume that there is no terminal state and discount factor γ < 1.
2. [5 pts] Everything remains the same as the original MDP, except it has a new reward
function β · Ras .
Solution. Let Qπ (s, a) and Q0π (s, a) denote the action-value functions of the original MDP
and the modified MDP following the same policy π, respectively. We will prove that
Q0π (s, a) = Qπ (s, a) + (1−γ)
α
for 1) and Q0π (s, a) = βQπ (s, a) for 2).
1. For the original MDP,
Qπ (s, a) = Eπ [Gt |St = s, At = a]

= Eπ [Rt+1 + γRt+2 + γ 2 Rt+3 + ...|St = s, At = a].
Similarly for the modified MDP,
Q0π (s, a) = Eπ [G0t |St = s, At = a]

0 0 0
= Eπ [Rt+1 + γRt+2 + γ 2 Rt+3 + ...|St = s, At = a]
= Eπ [Rt+1 + γRt+2 + γ Rt+3 + ... + α + γα + γ 2 α + ...|St = s, At = a].
2
Now, as there is no terminal state and γ < 1,

α
Q0π (s, a) = Eπ [Rt+1 + γRt+2 + γ 2 Rt+3 + ...|St = s, At = a] +
1−γ
α
= Qπ (s, a) + .
1−γ
5
2. Similar to 1, we see that
Q0π (s, a) = Eπ [G0t |St = s, At = a]
0 0 0
= Eπ [Rt+1 + γRt+2 + γ 2 Rt+3 + ...|St = s, At = a]
2
= Eπ [βRt+1 + γβRt+2 + γ βRt+3 + ...|St = s, At = a]
= β Eπ [Rt+1 + γRt+2 + γ 2 Rt+3 + ...|St = s, At = a]
= βQπ (s, a).
Now for 1), we have

α α
Q0∗ (s, a) = max Q0π (s, a) = max Qπ (s, a) + = Q∗ (s, a) + ,
π π 1−γ 1−γ
and for 2), we have
Q0∗ (s, a) = max Q0π (s, a) = β max Qπ (s, a) = βQ∗ (s, a).
π π
Therefore, the modified MDP has the same (deterministic) optimal policy as the original
MDP.
Problem 6 (MDP: Simple Card Game, 25 Points)

In a card game, you repeatedly draw a card (with replacement) that is equally likely to be
a 2 or 3. You can either Draw or Stop if the total score of the cards you have drawn is
less than 6. Otherwise, you must Stop. When you Stop, your reward is equal to your total
score (up to 5), or zero if you get a total of 6 or higher. When you Draw, you receive no
reward. Assume there is no discount (γ = 1). We formulate this problem as an MDP with
the following states: 0, 2, 3, 4, 5, and a Done state for when the game ends.
1. [10 pts] What is the state transition function and the reward function for this MDP?
2. [12 pts] What is the optimal state-value function and optimal action-value function for
this MDP? (Hint: Solve Bellman optimality equation starting from states 4 and 5.)
3. [3 pts] What is the optimal policy for this MDP?
Solution.
1. Transition function:
Stop
Ps,Done = 1, s ∈ {0, 2, 3, 4, 5};
P0,s0 = 0.5, s0 ∈ {2, 3};
Draw
Draw
P2,s 0 = 0.5, s0 ∈ {4, 5};
P3,s0 = 0.5, s0 ∈ {5, Done};
Draw
Draw
Ps,Done = 1, s ∈ {4, 5}.
Reward function:
RsStop = s, s ∈ {0, 2, 3, 4, 5};
Rsa = 0, otherwise.
6
2. Optimal state-value function:
Q∗ (5, Draw) = 0, Q∗ (5, Stop) = 5;
Q∗ (4, Draw) = 0, Q∗ (4, Stop) = 4;
Q∗ (3, Draw) = 2.5, Q∗ (3, Stop) = 3;
Q∗ (2, Draw) = 4.5, Q∗ (2, Stop) = 2;
Q∗ (0, Draw) = 3.75, Q∗ (0, Stop) = 0.
Optimal action-value function:
V∗ (5) = 5, V∗ (4) = 4, V∗ (3) = 3, V∗ (2) = 4.5, V∗ (5) = 3.75.
3. Optimal policy:
(
1, if a = Draw, s ∈ {0, 2} or a = Stop, s ∈ {3, 4, 5},
π∗ (a|s) =
0, otherwise.
Figure 1: MDP with deterministic transitions
7
Figure 2: MDP with stochastic transitions
Figure 3: MDP with stochastic rewards

Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021

Uploaded by

Copyright:

Available Formats

You might also like

Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021

Uploaded by

Copyright:

Available Formats

Homework 1

ELEN E6885: Introduction to Reinforcement Learning

Problem 1 (2-Armed Bandit, 20 Points)

a. Arm 1 is played at t = 6: Its probability is 1 −  + /2 = 0.95. We have Q16 = 0.3

0.95 × 0.05 + 0.02 × 0.05 + 0.03 × 0.95 = 0.077.

Problem 2 (Softmax, 15 Points)

2. [5 pts] In the limit as τ → ∞, softmax action selection yields equiprobable selection

Since Q(am ) > Q(aj ) for all j 6= m, it implies that for j 6= m,

Therefore, (??) reduces to (

exp (Q(a1 )/τ )

Problem 3 (Incremental Implementation, 15 Points)

and keep it up-to-date as we obtain an additional return Gn . In addition to keeping track

Problem 4 (MDP: State-Value Function, 15 Points)

1. We have v(St+2 ) = 0. By Bellman expectation equation,

v(St ) = r + γv(St+1 ) = 4 + 8.6 = 12.6.

for St+1 on the top-right of St , and

= 0.5 × (4 + 0) + 0.5 × [0.4 × (4 + 8.6) + 0.6 × (4 + 0)]

= 0.5 × (4 + 0) + 0.5 × (0.4µr11 + 0.6µr12 )

Problem 5 (MDP: Optimal Policy, 10 Points)

1. For the original MDP,

Qπ (s, a) = Eπ [Gt |St = s, At = a]

Similarly for the modified MDP,

Q0π (s, a) = Eπ [G0t |St = s, At = a]

Now, as there is no terminal state and γ < 1,

Now for 1), we have

Problem 6 (MDP: Simple Card Game, 25 Points)

Figure 1: MDP with deterministic transitions

Figure 3: MDP with stochastic rewards

You might also like

a. Arm 1 is played at t = 6: Its probability is 1 − + /2 = 0.95. We have Q16 = 0.3