Professional Documents
Culture Documents
Markov Decision Process
Markov Decision Process
Markov Decision Process
Intro to AI 096210
Erez Karpas
Faculty of Industrial Engineering & Managment
Technion
Erez Karpas
Fully observable
We might not know where were going, but we always know where
we are
Erez Karpas
A MDP consists of hS , A, R , T i
S is a finite set of states
A is a finite set of actions
R : S 7 [0, rmax ] is the reward function
Rewards are bounded
Erez Karpas
(
R (s) =
if s = Poor & X
10
if s = Rich & X
Erez Karpas
Erez Karpas
Erez Karpas
is a discount factor
It makes sure the infinite sum converges
It can also be explained by interest rates, mortality, . . .
Erez Karpas
Erez Karpas
T (s, a, s0 )V (s0 )
s0
So we want to find V
Erez Karpas
V 0 (s) = R (s)
V t (s) = R (s) + max
a
T (s, a, s0 )V t 1 (s0 )
s0
Converges: V t V
Stop when
max |V t (s) V t 1 (s)| <
s
Erez Karpas
R (s)
R (s) + maxa s0 T (s, a, s0 )V t 1 (s0 )
= 0.9
t
0
1
2
3
4
5
V t (PU )
0
0
2.03
4.75
7.62
10.21
100
31.58
V t (PF )
0
4.5
8.55
12.2
15.07
17.46
...
38.6
V t (RU )
10
14.5
16.525
18.34
20.39
22.61
V t (RF )
10
19
25.08
28.72
31.18
33.2
44.02
54.2
Erez Karpas
(s) :=
s
PU
PF
RU
RF
V
31.58
38.6
44.02
54.2
A
S
S
S
Erez Karpas
3
4
Erez Karpas
t
0
1
2
t (PU )
t (PF )
t (RU )
t (RF )
A
A
A
A
S
S
A
S
S
A
S
S
Erez Karpas
V t (PU )
0
31.58
V t (PF )
V t (RU )
0
10
38.6
44.02
Done
V t (RF )
10
54.2
Erez Karpas
Erez Karpas
Reinforcement Learning
The model
At every time step, the agent sees the current state s and the
applicable actions at s
After choosing an action to execute, the agent receives a reward
Erez Karpas
Q-Learning
Q (s0 , a0 )
T (s, a, s0 ) max
a
s0
Erez Karpas
Learning Q
a0
is the learning rate how much weight to give new vs. past
knowledge
Under some (realistic?) assumptions, Q-learning will converge to
optimal Q
Erez Karpas
Q-Learing: Exploration/Exploitation
Erez Karpas
GLIE Policies
Erez Karpas