Practice Problem Set

COL333: Artificial Intelligence
Practice Problem Set
Instructor: Mausam Subject: MDPs, POMDPs
Problem 1
Most techniques for Markov Decision Processes focus on calculating V ∗ (s), the maximum expected utility of state
s, generally using the following Bellman Backup Equation,
X
V ∗ (s) = max T (s, a, s0 )[R(s, a, s0 ) + γV ∗ (s0 )].
a
s0
In this question, instead of measuring the quality of a policy by its expected utility, we will consider the worst-case
utility as our measure of quality. Concretely, Lπ (s) is the minimum utility it is possible to attain over all state-action
sequences that can result from executing the policy π, starting from state s. The optimal worst-case policy is then
L∗ (s) = maxπ Lπ (s). In other words, L∗ (S) is the greatest lower bound on the utility of state s.
Let C(s, a) be the set of all states that the agent has a non-zero probability of transferring to from state s using
action a. Formally, C(s, a) = {s0 |T (s, a, s0 ) > 0}.
(a) Express L∗ (s) in a recursive form similar to the Bellman Backup Equation.
(b) Recall that the Bellman update for value iteration is :
X
Vi+1 (s) ← max T (s, a, s0 )[R(s, a, s0 ) + γVi (s0 )]
a
s0
Formally define an update for calculating Li+1 (s) using Li , where Li (s) is the utility of being in state s.
(c) Assume that rewards are a function of state, i.e. R(s, a, s0 ) = R(s), and that R(s) ≥ 0 for all s. With these
assumptions, the Bellman Backups for the Q-function is
X
Q∗ (s, a) = R(s) + T (s, a, s0 )[γ max
0
Q∗ (s0 , a0 )]
a
s0
Let M (s, a) be the greatest lower bound on the utility of state s when taking action a. Informally, M is to L as
Q is to V. Formally define M ∗ (s, a) in a recursive form, similar to how Q∗ is defined.
Problem 2
Assume you are asked to act in a given MDP (S, A, T, R, γ, s0 ). However, rather than being able to freely choose
your actions, at each time step you must start by flipping a coin. If the coin lands heads, then you can freely choose
your action. If the coin lands tails, however, you don’t get to choose an action and instead an action will be chosen for
you uniformly at random from the available actions. Can you specify a modified MDP (S 0 , A0 , T 0 , R0 , γ 0 , s00 ) for which
the optimal policy maximises the expected discounted sum of rewards under the specified restrictions on your ability
to choose actions ?
Problem 3
Consider the un-discounted MDP with states S = {A, B}, actions A = {1, 2, 3} and transition and reward functions
as given in Tables and respectively. Fill in the values for V0∗ , V1∗ , V2∗ , Q∗1 , Q∗2 in the graph given below (Figure ). You
can assume that V0∗ (A) = 0 and V0∗ (B) = 0.
1
MDPs, POMDPs – Practice Problem Set 2
s0 s0
s a s a
A B A B
A 1 0 1 A 1 0 0
A 2 1 0 A 2 1 0
A 3 0.5 0.5 A 3 0 0
B 1 0.5 0.5 B 1 10 0
B 2 1 0 B 2 0 0
B 3 0.5 0.5 B 3 2 4
Table 1 – Transition Function Table 2 – Reward Function
Figure 1 – Value Iteration
Problem 4
In this question, we consider the famous problem of a tiger behind two closed doors. In this problem, the agent
has to make a potentially life-threatening decision of choosing between two doors. Behind one door is a tiger, whereas
behind the other are some samosas. At each time instant, the agent has to necessarily perform one of the following
actions : open the door on the left, open the door on the right, or simultaneously listen to the sound coming from
behind both doors. Also, after opening a door (and dealing with the repercussions), the game resets, both doors are
closed, and the tiger is put behind either door with equal likelihood. The goal of the problem is to maximise the utility
of the agent, which in this case are its chances of survival. For this scenario, a POMDP is defined in the following
manner,
— State Space S = {sl , sr }, where in sl the tiger is behind the left door, and in sr , the tiger is behind right door.
— Actions A = {l, r, listen}. l and r open the left and right doors respectively, while listen allows the agent to
simultaneously listen to the sound coming from behind both doors. The purpose of this is to listen for growls.
— If the agent chooses the wrong door, then a penalty of −100 is incurred, whereas a reward of +10 is obtained
MDPs, POMDPs – Practice Problem Set 3
for opening the correct door. Listening has a fixed cost of −1. More formally, the reward function R(s, a) is
defined as :
s l r listen
sl -100 10 -1
sr 10 -100 -1
— The observation function (O(s, a)), and the transition function (T (s, a, s0 )) are defined in the following tables.
Here, observation T L corresponds to hearing growling behind the left door, and T R corresponds to hearing
growling behind the right door.
s0 Obs
s a s a
sl sr TL TR
sl l 0.5 0.5 sl l 0.5 0.5
sl r 0.5 0.5 sl r 0.5 0.5
sl listen 1 0 sl listen 0.85 0.15
sr l 0.5 0.5 sr l 0.5 0.5
sr r 0.5 0.5 sr r 0.5 0.5
sr listen 0 1 sr listen 0.15 1
Table 3 – Transition Function Table 4 – Observation Function
Note that the transition probabilities are uniform after taking action l or r in either state : this corresponds to the
game being reset after opening a door. Also, both observations T L and T R are equiprobable for actions l and r, since
opening either door is an uninformed action, and the tiger is equally likely to be behind both doors.
(a) For this POMDP, find the optimal policy for this POMDP at time t = 1 and time t = 2.
(b) Find the expected utility of each action at time t = 1 and time t = 2.

Practice Problem Set

Uploaded by

Copyright:

Available Formats

You might also like

Practice Problem Set

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Practice Problem Set

Uploaded by

Copyright:

Available Formats

COL333: Artificial Intelligence

Practice Problem Set

Instructor: Mausam Subject: MDPs, POMDPs

Table 1 – Transition Function Table 2 – Reward Function

Figure 1 – Value Iteration

Table 3 – Transition Function Table 4 – Observation Function

You might also like