Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 14

Markov Decision Processes (MDP)

Sudeshna Sarkar
Department of Computer Science & Engineering
IIT Kharagpur
6-7 Sep 2017
How would you get to the airport in the
least amount of time?
 Metro
 Uber
 Taxi
 Airport Express

2
Uncertainty in the real world
 Randomness shows up in many places.
 Could be caused by limitations of the sensors and actuators of the
robot
 Could be caused by market forces or nature, which we have no
control over.

 State s, action a
 State s1’
 State s2’
 …

 How can we hope to act optimally in the face of randomness?


 Certainly we can't just have a single deterministic plan, and
talking about a minimum cost path doesn't make sense.

3
Applications
 Robotics: decide where to move, but actuators can
fail, hit unseen obstacles, etc.

 Resource allocation: decide what to produce, don't


know the customer demand for various products

 Agriculture: decide what to plant, but don't know


weather and thus crop yield

4
Volcano crossing

5
Dice Game
For each round r = 1, 2, …
 You choose stay or quit.
 If quit, you get $10 and we end the game.
 If stay, you get $4 and then I roll a 6-sided dice.
 If the dice results in 1 or 2, we end the game.
 Otherwise, you continue to the next round.

6
MDP for Dice Game
For each round r = 1, 2, …
 You choose stay or quit.
 If quit, you get $10 and we end the game.
 If stay, you get $4 and then I roll a 6-sided dice.
 If the dice results in 1 or 2, we end the game.
 Otherwise, you continue to the next round.

7
MDP
Markov Decision Processes

Decision Theoretic Planning

Markov Property: The transition properties depend


only on the current state, not on previous history
(how that state was reached)

8
MDP Model
MDP Model <S, A, T, R>
Agent State set S
Action set A
State Reward Action Markov transition function
T(s,a,s’)=Pr(s’|s,a)
Environment
Bounded real-valued reward
function R(s)
• Can be generalized to include
a0 a1 a2 action costs: R(s,a)
s0 s1 s2 s3
r0 r1 r2 • Can be generalized to be a
stochastic function
Process:
• Observe state st in S
• Choose action at in At
• Receive immediate reward rt
• State changes to st+1
9
Similarities of MDP with Search?

10
Transitions
 The transition probabilities T(s, a, s’) specify the
probability of ending up in state s’ if taken action a in
state s.
s a s’ T(s,a,s’)
in quit end 1
in stay in 2/3
in stay end 1/3
 For each state s and action a:

෍ 𝑇 𝑠, 𝑎, 𝑠 ′ = 1
𝑠′∈𝑆
11 Successors: 𝑠′ such that 𝑇 𝑠, 𝑎, 𝑠 ′ > 0
Exercise: Transportation problem
 Street with blocks numbered 1 to n.
 Walking from s to s + 1 takes 1 minute.
 Taking a magic tram from s to 2s takes 2 minutes.
 How to travel from 1 to n in the least time?
 Tram fails with probability 0.5.

12
What is a solution?
 Search problem: path (sequence of actions)
 MDP: ??
 MDP: Policy
 A Policy 𝜋 is a mapping from each state s 2 States to
an action 𝑎 ∈ Actions(𝑠)

13
Evaluating a policy
 Following a policy yields a random path.
 The utility of a policy is the (discounted) sum of the
rewards on the path (this is a random quantity).
Path Utility
[in; stay, 4, end] 4
[in; stay, 4, in; stay, 4, in; stay, 4, end] 12
[in; stay, 4, in; stay, 4, end] 8
[in; stay, 4, in; stay, 4, in; stay, 4, in; stay, 4, end] 16
...
The value of a policy is the expected utility.

14

You might also like