Markov Decision Process

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Markov Decision Process Tutorial

Intro to AI 096210

Erez Karpas
Faculty of Industrial Engineering & Managment
Technion

December 22, 2011

Erez Karpas

Markov Decision Process Tutorial

Markov Decision Process

A Markov Decision Process (MDP) is a stochastic planning


problem
Stationary Markovian Dynamics
The rewards and transitions only depend on current state

Fully observable
We might not know where were going, but we always know where
we are

Decision theoretic planning


We want to maximize expected reward

Erez Karpas

Markov Decision Process Tutorial

Markov Decision Process Formal Definition

A MDP consists of hS , A, R , T i
S is a finite set of states
A is a finite set of actions
R : S 7 [0, rmax ] is the reward function
Rewards are bounded

T : S A S 7 [0, 1] is the transition function


Probability of going from s to s0 after applying a is T (s, a, s0 )

Where is the initial state?

Erez Karpas

Markov Decision Process Tutorial

Markov Decision Process Example

Shamelessly stolen from Andrew Moore


You run a startup company. In every decision period, you must
choose between Saving money or Advertizing.
S = {Poor &Unknown, Poor &Famous, Rich&Unknown,
Rich&Famous}
A = {Save, Advertize}

(
R (s) =

if s = Poor & X

10

if s = Rich & X

T set next slide

Erez Karpas

Markov Decision Process Tutorial

Markov Decision Process Graphic Example

Erez Karpas

Markov Decision Process Tutorial

Markov Decision Process Solution

How do we solve an MDP?


What does a solution for an MDP look like?
A solution to an MDP is a policy : S 7 A
Given that Im in state s, I should apply action (s)
This is why we need full observability

What is an optimal policy?

Erez Karpas

Markov Decision Process Tutorial

Markov Decision Process Policy

We can compute the expected value of following fixed policy at


state s
V (s) = R (s) +

T (s, (s), s0 )V (s0 )


s0

is a discount factor
It makes sure the infinite sum converges
It can also be explained by interest rates, mortality, . . .

Value is immediate reward plus discounted expected future


reward

Erez Karpas

Markov Decision Process Tutorial

Markov Decision Process Optimal Policy Value

An optimal policy maximizes V (s) for all states


Is the optimal policy unique? No
Is the value of an optimal policy unique? Yes
We denote the value of an optimal policy at state s by V (s)
V (s) is unique

Erez Karpas

Markov Decision Process Tutorial

Markov Decision Process Using V

If we know V , we can simply choose the best action for each


state
The best action maximizes:
R (s) +

T (s, a, s0 )V (s0 )
s0

So we want to find V

Erez Karpas

Markov Decision Process Tutorial

Markov Decision Process Value Iteration

How do we find V ? Value Iteration

V 0 (s) = R (s)
V t (s) = R (s) + max
a

T (s, a, s0 )V t 1 (s0 )
s0

Converges: V t V
Stop when
max |V t (s) V t 1 (s)| <
s

Erez Karpas

Markov Decision Process Tutorial

Markov Decision Process Value Iteration Example


V 0 (s) =
V t (s) =

R (s)
R (s) + maxa s0 T (s, a, s0 )V t 1 (s0 )

= 0.9

t
0
1
2
3
4
5

V t (PU )
0
0
2.03
4.75
7.62
10.21

100

31.58

V t (PF )
0
4.5
8.55
12.2
15.07
17.46
...
38.6

V t (RU )
10
14.5
16.525
18.34
20.39
22.61

V t (RF )
10
19
25.08
28.72
31.18
33.2

44.02

54.2

Erez Karpas

Markov Decision Process Tutorial

Markov Decision Process Policy from Values

(s) :=

s
PU
PF
RU
RF

argmaxa R (s) + s0 T (s, a, s0 )V (s0 )

V
31.58
38.6
44.02
54.2

A
S
S
S

Erez Karpas

Markov Decision Process Tutorial

Markov Decision Process Policy Iteration

Another algorithm to find an optimal policy:

Initialize a policy arbitrarily

Evaluate V (s) for all states s S

3
4

0 (s) := argmaxa s0 T (s, a, s0 )V (s0 )


If 6= 0
:= 0
Goto 2

Erez Karpas

Markov Decision Process Tutorial

Markov Decision Process Policy Iteration Example

t
0
1
2

t (PU )

t (PF )

t (RU )

t (RF )

A
A
A

A
S
S

A
S
S

A
S
S

Erez Karpas

V t (PU )
0
31.58

V t (PF )
V t (RU )
0
10
38.6
44.02
Done

Markov Decision Process Tutorial

V t (RF )
10
54.2

Value Iteration vs. Policy Iteration

Which is better? It depends


VI takes more iterations than PI, but PI requires more time on
each iteration
Lots of actions? PI
Already got a fair policy? PI
Few actions, acyclic? VI
Also possible to mix

Erez Karpas

Markov Decision Process Tutorial

Solving MDP without the Model

What if we do not have access to the model?


We dont know transition probabilities T
We dont know reward function R

Then we cant compute a policy offline


We must choose an action online

Erez Karpas

Markov Decision Process Tutorial

Reinforcement Learning

The model
At every time step, the agent sees the current state s and the
applicable actions at s
After choosing an action to execute, the agent receives a reward

There are many RL algorithms


We will focus on Q-Learning

Erez Karpas

Markov Decision Process Tutorial

Q-Learning

We define Q : S A 7 [0, rmax ]


Q (s, a) is the best value we can expect after taking action a in
state s
Q (s, a) = R (s) +

Q (s0 , a0 )
T (s, a, s0 ) max
a
s0

Q (s, a) is immediate reward plus discounted expected future


reward if we choose the best action in the next state

Erez Karpas

Markov Decision Process Tutorial

Learning Q

Suppose our agent performed action a in state s


It moved to some state s0 , and got some reward R (s)
We can update Q (s, a):
Q (s, a) := Q (s, a) + R (s) + max Q (s0 , a0 ) Q (s, a)

a0

is the learning rate how much weight to give new vs. past
knowledge
Under some (realistic?) assumptions, Q-learning will converge to
optimal Q

Erez Karpas

Markov Decision Process Tutorial

Q-Learing: Exploration/Exploitation

Suppose were in the middle of Q-Learning


Were at state s
We have some estimate for Q (s, a), for any applicable action a
Which action to choose?
We can choose an action greedily the one which maximizes
Q (s , a )
But we might now know about the best action, and miss out
We want a policy that is greedy in the limit of infinite exploration
(GLIE)

Erez Karpas

Markov Decision Process Tutorial

GLIE Policies

Need to make exploitation more likely as more knowledge is


gained
One of the most popular GLIE policy is Boltzmann Exploration
Choose action a with probablity proportional to
eQ (s,a)/T
T is the temperature, which decreases with time

Erez Karpas

Markov Decision Process Tutorial

You might also like