Markov Decision Process

Markov Decision Process Tutorial
Intro to AI 096210
Erez Karpas
Faculty of Industrial Engineering & Managment
Technion
December 22, 2011
Erez Karpas
Markov Decision Process
A Markov Decision Process (MDP) is a stochastic planning

problem
Stationary Markovian Dynamics
The rewards and transitions only depend on current state
Fully observable
We might not know where were going, but we always know where
we are
Decision theoretic planning

We want to maximize expected reward
Erez Karpas
Markov Decision Process Formal Definition
A MDP consists of hS , A, R , T i
S is a finite set of states
A is a finite set of actions
R : S 7 [0, rmax ] is the reward function
Rewards are bounded
T : S A S 7 [0, 1] is the transition function

Probability of going from s to s0 after applying a is T (s, a, s0 )
Where is the initial state?
Erez Karpas
Markov Decision Process Example
Shamelessly stolen from Andrew Moore

You run a startup company. In every decision period, you must
choose between Saving money or Advertizing.
S = {Poor &Unknown, Poor &Famous, Rich&Unknown,
Rich&Famous}
A = {Save, Advertize}
(
R (s) =
if s = Poor & X
10
if s = Rich & X
T set next slide
Erez Karpas
Markov Decision Process Graphic Example
Erez Karpas
Markov Decision Process Solution
How do we solve an MDP?

What does a solution for an MDP look like?
A solution to an MDP is a policy : S 7 A
Given that Im in state s, I should apply action (s)
This is why we need full observability
What is an optimal policy?
Erez Karpas
Markov Decision Process Policy
We can compute the expected value of following fixed policy at

state s
V (s) = R (s) +
T (s, (s), s0 )V (s0 )

s0
is a discount factor
It makes sure the infinite sum converges
It can also be explained by interest rates, mortality, . . .
Value is immediate reward plus discounted expected future

reward
Erez Karpas
Markov Decision Process Optimal Policy Value
An optimal policy maximizes V (s) for all states

Is the optimal policy unique? No
Is the value of an optimal policy unique? Yes
We denote the value of an optimal policy at state s by V (s)
V (s) is unique
Erez Karpas
Markov Decision Process Using V
If we know V , we can simply choose the best action for each

state
The best action maximizes:
R (s) +
T (s, a, s0 )V (s0 )
s0
So we want to find V
Erez Karpas
Markov Decision Process Value Iteration
How do we find V ? Value Iteration
V 0 (s) = R (s)
V t (s) = R (s) + max
a
T (s, a, s0 )V t 1 (s0 )
s0
Converges: V t V
Stop when
max |V t (s) V t 1 (s)| <
s
Erez Karpas
Markov Decision Process Value Iteration Example

V 0 (s) =
V t (s) =
R (s)
R (s) + maxa s0 T (s, a, s0 )V t 1 (s0 )
= 0.9
t
0
1
2
3
4
5
V t (PU )
0
0
2.03
4.75
7.62
10.21
100
31.58
V t (PF )
0
4.5
8.55
12.2
15.07
17.46
...
38.6
V t (RU )
10
14.5
16.525
18.34
20.39
22.61
V t (RF )
10
19
25.08
28.72
31.18
33.2
44.02
54.2
Erez Karpas
Markov Decision Process Policy from Values
(s) :=
s
PU
PF
RU
RF
argmaxa R (s) + s0 T (s, a, s0 )V (s0 )
V
31.58
38.6
44.02
54.2
A
S
S
S
Erez Karpas
Markov Decision Process Policy Iteration
Another algorithm to find an optimal policy:
Initialize a policy arbitrarily
Evaluate V (s) for all states s S
3
4
0 (s) := argmaxa s0 T (s, a, s0 )V (s0 )

If 6= 0
:= 0
Goto 2
Erez Karpas
Markov Decision Process Policy Iteration Example
t
0
1
2
t (PU )
t (PF )
t (RU )
t (RF )
A
A
A
A
S
S
A
S
S
A
S
S
Erez Karpas
V t (PU )
0
31.58
V t (PF )
V t (RU )
0
10
38.6
44.02
Done
V t (RF )
10
54.2
Value Iteration vs. Policy Iteration
Which is better? It depends

VI takes more iterations than PI, but PI requires more time on
each iteration
Lots of actions? PI
Already got a fair policy? PI
Few actions, acyclic? VI
Also possible to mix
Erez Karpas
Solving MDP without the Model
What if we do not have access to the model?

We dont know transition probabilities T
We dont know reward function R
Then we cant compute a policy offline

We must choose an action online
Erez Karpas
Reinforcement Learning
The model
At every time step, the agent sees the current state s and the
applicable actions at s
After choosing an action to execute, the agent receives a reward
There are many RL algorithms

We will focus on Q-Learning
Erez Karpas
Q-Learning
We define Q : S A 7 [0, rmax ]

Q (s, a) is the best value we can expect after taking action a in
state s
Q (s, a) = R (s) +
Q (s0 , a0 )
T (s, a, s0 ) max
a
s0
Q (s, a) is immediate reward plus discounted expected future

reward if we choose the best action in the next state
Erez Karpas
Learning Q
Suppose our agent performed action a in state s

It moved to some state s0 , and got some reward R (s)
We can update Q (s, a):
Q (s, a) := Q (s, a) + R (s) + max Q (s0 , a0 ) Q (s, a)
a0
is the learning rate how much weight to give new vs. past
knowledge
Under some (realistic?) assumptions, Q-learning will converge to
optimal Q
Erez Karpas
Q-Learing: Exploration/Exploitation
Suppose were in the middle of Q-Learning

Were at state s
We have some estimate for Q (s, a), for any applicable action a
Which action to choose?
We can choose an action greedily the one which maximizes
Q (s , a )
But we might now know about the best action, and miss out
We want a policy that is greedy in the limit of infinite exploration
(GLIE)
Erez Karpas
GLIE Policies
Need to make exploitation more likely as more knowledge is

gained
One of the most popular GLIE policy is Boltzmann Exploration
Choose action a with probablity proportional to
eQ (s,a)/T
T is the temperature, which decreases with time
Erez Karpas

Markov Decision Process

Uploaded by

Copyright:

Available Formats

You might also like

Markov Decision Process

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Markov Decision Process

Uploaded by

Copyright:

Available Formats

Markov Decision Process Tutorial

December 22, 2011