Professional Documents
Culture Documents
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
1. [5 pts] Which arm will be played at t = 6, 7, respectively, if the greedy method is used
to select actions?
2. [10 pts] What is the probability to play arm 2 at t = 6, 7, respectively, if the -greedy
method is used to select actions ( = 0.1)?
3. [5 pts] Why could the greedy method perform significantly worse than the -greedy
method in the long run?
Solution. We use Qit to represent the estimated expected payoff for arm i (i = 1, 2) after
time t.
1. Q15 = 0.3 and Q25 = (0 + 1 + 0 + 0)/4 = 0.25. Therefore, arm 1 will be played at both
t = 6 and t = 7 if the greedy method is used.
2. At t = 6, the probability to play arm 2 using -greedy method is /2 = 0.05. For t = 7,
consider the following three cases:
1
By combining all above possibilities, the probability to play arm 2 at t = 7 using
-greedy method is
3. Greedy method may get stuck in the suboptimal action due to the lack of exploration.
1. [5 pts] In the limit as the temperature τ → 0, softmax action selection becomes the
same as greedy action selection.
3. [5 pts] In the case of two actions, the softmax operation using the Gibbs distribution
becomes the logistic (or sigmoid) function commonly used in artificial neural networks.
Solution.
1. Let the actions be a1 , a2 , ...an and the action with the maximum Q-value (i.e., the
greedy action) be am . Then, we have Q(am ) > Q(ai ) for all i 6= m. The probability of
choosing action ai using the softmax action selection is
exp(Q(ai )/τ )
p i = Pn
j=1 exp(Q(aj )/τ )
exp Q(ai )−Q(a
τ
m)
= (1)
P Q(aj )−Q(am )
1 + j6=m exp τ
2
2. When τ → ∞, we have for all j,
Q(aj ) − Q(am )
lim exp = 1.
τ →∞ τ
Substituting this fact into (??), we have pi = 1/n for all i = 1, 2, · · · , n. Therefore, all
actions are selected with equal probability.
3. When there are only two actions a1 and a2 , the probability of choosing action a1 is
1
This is the form of the logistic/sigmoid function, i.e., f (x) = 1+e−x
.
Pn
Solution. By definition, Cn is the cumulative sum of weights, i.e., Cn = k=1 Wk . Thus, we
have
Cn+1 = Cn + Wn+1 .
3
And for n ≥ 1,
Pn
k=1 Wk Gk
Vn+1 = P n
k=1 Wk
Pn−1
Wk Gk + Wn Gn
= k=1
Cn
Cn−1 Vn + Wn Gn
=
Cn
(Cn − Wn )Vn + Wn Gn
=
Cn
Cn Vn + Wn (Gn − Vn )
=
Cn
Wn
= Vn + (Gn − Vn ).
Cn
Solution.
2. We have v(St+2 ) = 0, and v(St+1 ) = 0 for St+1 on the top of St and the bottom-right
of St . By Bellman expectation equation,
X
v(St+1 ) = π(a|St+1 ) (r + γv(St+2 ))
a
= 0.2 × 6 + 0.3 × 8 + 0.5 × 10 = 8.6,
4
3. We have v(St+1 ) = 0. By Bellman expectation equation,
X XX
v(St ) = π(a|St ) p(St+1 , r|St , a) (r + γv(St+1 ))
a St+1 r
1. [5 pts] Everything remains the same as the original MDP, except it has a new reward
function α + Ras . Assume that there is no terminal state and discount factor γ < 1.
2. [5 pts] Everything remains the same as the original MDP, except it has a new reward
function β · Ras .
Solution. Let Qπ (s, a) and Q0π (s, a) denote the action-value functions of the original MDP
and the modified MDP following the same policy π, respectively. We will prove that
Q0π (s, a) = Qπ (s, a) + (1−γ)
α
for 1) and Q0π (s, a) = βQπ (s, a) for 2).
5
2. Similar to 1, we see that
Q0π (s, a) = Eπ [G0t |St = s, At = a]
0 0 0
= Eπ [Rt+1 + γRt+2 + γ 2 Rt+3 + ...|St = s, At = a]
2
= Eπ [βRt+1 + γβRt+2 + γ βRt+3 + ...|St = s, At = a]
= β Eπ [Rt+1 + γRt+2 + γ 2 Rt+3 + ...|St = s, At = a]
= βQπ (s, a).
Therefore, the modified MDP has the same (deterministic) optimal policy as the original
MDP.
Solution.
1. Transition function:
Stop
Ps,Done = 1, s ∈ {0, 2, 3, 4, 5};
P0,s0 = 0.5, s0 ∈ {2, 3};
Draw
Draw
P2,s 0 = 0.5, s0 ∈ {4, 5};
P3,s0 = 0.5, s0 ∈ {5, Done};
Draw
Draw
Ps,Done = 1, s ∈ {4, 5}.
Reward function:
RsStop = s, s ∈ {0, 2, 3, 4, 5};
Rsa = 0, otherwise.
6
2. Optimal state-value function:
Q∗ (5, Draw) = 0, Q∗ (5, Stop) = 5;
Q∗ (4, Draw) = 0, Q∗ (4, Stop) = 4;
Q∗ (3, Draw) = 2.5, Q∗ (3, Stop) = 3;
Q∗ (2, Draw) = 4.5, Q∗ (2, Stop) = 2;
Q∗ (0, Draw) = 3.75, Q∗ (0, Stop) = 0.
Optimal action-value function:
V∗ (5) = 5, V∗ (4) = 4, V∗ (3) = 3, V∗ (2) = 4.5, V∗ (5) = 3.75.
3. Optimal policy:
(
1, if a = Draw, s ∈ {0, 2} or a = Stop, s ∈ {3, 4, 5},
π∗ (a|s) =
0, otherwise.
7
Figure 2: MDP with stochastic transitions