Professional Documents
Culture Documents
Markov Decision Processes (MDP) : Sudeshna Sarkar
Markov Decision Processes (MDP) : Sudeshna Sarkar
Sudeshna Sarkar
Department of Computer Science & Engineering
IIT Kharagpur
6-7 Sep 2017
How would you get to the airport in the
least amount of time?
Metro
Uber
Taxi
Airport Express
2
Uncertainty in the real world
Randomness shows up in many places.
Could be caused by limitations of the sensors and actuators of the
robot
Could be caused by market forces or nature, which we have no
control over.
State s, action a
State s1’
State s2’
…
3
Applications
Robotics: decide where to move, but actuators can
fail, hit unseen obstacles, etc.
4
Volcano crossing
5
Dice Game
For each round r = 1, 2, …
You choose stay or quit.
If quit, you get $10 and we end the game.
If stay, you get $4 and then I roll a 6-sided dice.
If the dice results in 1 or 2, we end the game.
Otherwise, you continue to the next round.
6
MDP for Dice Game
For each round r = 1, 2, …
You choose stay or quit.
If quit, you get $10 and we end the game.
If stay, you get $4 and then I roll a 6-sided dice.
If the dice results in 1 or 2, we end the game.
Otherwise, you continue to the next round.
7
MDP
Markov Decision Processes
8
MDP Model
MDP Model <S, A, T, R>
Agent State set S
Action set A
State Reward Action Markov transition function
T(s,a,s’)=Pr(s’|s,a)
Environment
Bounded real-valued reward
function R(s)
• Can be generalized to include
a0 a1 a2 action costs: R(s,a)
s0 s1 s2 s3
r0 r1 r2 • Can be generalized to be a
stochastic function
Process:
• Observe state st in S
• Choose action at in At
• Receive immediate reward rt
• State changes to st+1
9
Similarities of MDP with Search?
10
Transitions
The transition probabilities T(s, a, s’) specify the
probability of ending up in state s’ if taken action a in
state s.
s a s’ T(s,a,s’)
in quit end 1
in stay in 2/3
in stay end 1/3
For each state s and action a:
𝑇 𝑠, 𝑎, 𝑠 ′ = 1
𝑠′∈𝑆
11 Successors: 𝑠′ such that 𝑇 𝑠, 𝑎, 𝑠 ′ > 0
Exercise: Transportation problem
Street with blocks numbered 1 to n.
Walking from s to s + 1 takes 1 minute.
Taking a magic tram from s to 2s takes 2 minutes.
How to travel from 1 to n in the least time?
Tram fails with probability 0.5.
12
What is a solution?
Search problem: path (sequence of actions)
MDP: ??
MDP: Policy
A Policy 𝜋 is a mapping from each state s 2 States to
an action 𝑎 ∈ Actions(𝑠)
13
Evaluating a policy
Following a policy yields a random path.
The utility of a policy is the (discounted) sum of the
rewards on the path (this is a random quantity).
Path Utility
[in; stay, 4, end] 4
[in; stay, 4, in; stay, 4, in; stay, 4, end] 12
[in; stay, 4, in; stay, 4, end] 8
[in; stay, 4, in; stay, 4, in; stay, 4, in; stay, 4, end] 16
...
The value of a policy is the expected utility.
14