Professional Documents
Culture Documents
New CZ3005 Module 4 - Markov Decision Process
New CZ3005 Module 4 - Markov Decision Process
New CZ3005 Module 4 - Markov Decision Process
Artificial Intelligence
https://personal.ntu.edu.sg/hanwangzhang/
Email: hanwangzhang@ntu.edu.sg
Office: N4-02c-87
Lesson Outline
• Introduction
• Markov Decision Process
• Two methods for solving MDP
• Value iteration
• Policy iteration
Introduction
Transition model:
𝑘=0
𝑉 𝜋 ( 𝑠 )=𝔼 𝜋 [ 𝐺𝑡 ∨𝑆 𝑡 =𝑠 ]
• Similarly, we define the action-value of a state-action pair as
𝑄 𝜋 ( 𝑠 ,𝑎 )=𝔼 𝜋 [ 𝐺𝑡 ∨𝑆 𝑡 =𝑠 , 𝐴 𝑡 =𝑎]
• The relationship between Q and V
𝑉 ( 𝑠 )= ∑ 𝑄 𝜋 ( 𝑠 ,𝑎) 𝑝(𝑎∨𝑠)
𝜋
𝑎∈ 𝐴
Recursive Mechanism
Finding the Value Function of States
• What is the expected value of taking
Max node action in state ?
∑ 𝑃 ( 𝑠
′
|𝑠,𝑎
′
) [𝑟 ( 𝑠 ,𝑎 ,𝑠′)+𝛾 ∗𝑉 ( 𝑠
′
)]
𝑠
• How do we choose the optimal action?
Chance node
𝜋∗ ( 𝑠 )=argmax 𝑎 ∈ 𝐴 ∑ 𝑃 ( 𝑠 ′|𝑠,𝑎 ) [𝑟 ( 𝑠,𝑎 ,𝑠′)+𝛾 𝑉 ( 𝑠′ ) ]
′
𝑟(𝑠,𝑎,𝑠′) 𝑠
P(s’ | s, a)
• What is the recursive expression for in
terms of the utilities of its successor
V(s’) states?
V ( 𝑠 )=max𝑎 ∈ 𝐴 ∑ 𝑃 ( 𝑠 ′|𝑠 ,𝑎) [𝑟 ( 𝑠,𝑎 ,𝑠′)+𝛾 𝑉 ( 𝑠 ′ ) ]
′
𝑠
The Bellman Equation
• Recursive relationship between the
accumulated rewards of successive states:
𝑎 𝑠′
We can instead calculate Q(s, a) values for each and a and get the best V(s)
𝑠′
Finally
𝑎 𝑠′ 𝑎
Value Iteration: Example (cont’d)
• We use the previous example and solve it using VI
• Construct a Q table, Q(s, a)
– 3 rows (3 states) and 2 columns (2 actions)
𝑄0 V0 𝑄1 V1 𝑠′
0 0 0 0
0 0
0 0 3.5 0
0 3.5
0 0 0 -0.3
0 0 Q 1 ( s 1 , a 0 ) =0.7 ∗ [ 5+ 0 ] + 0.1∗ [ 0+0 ] + 0.2∗ [ 0+0 ] =3.5
The initial Q and V table
𝑎 𝑠′ 𝑎
𝜋∗
𝑠′
′
′
[
𝑉 ( 𝑠 )=∑ 𝑃 ( 𝑠 , 𝜋 𝑖 ( 𝑠 ) , 𝑠 ) 𝑅 ( 𝑠 , 𝜋 𝑖 ( 𝑠 ) , 𝑠 ) +𝛾 𝑉 (𝑠 )
𝜋𝑖 ′ 𝜋𝑖 ′
]
𝑠
+ +
+ +
+ +
Policy Iteration: Policy Evaluation
(cont’d)
• Use the example used in VI
• Start iteration , , initialize random and for states , and
+ +
𝑉 ( 𝑠 0 )= 0
𝑉 ( 𝑠1 ) =0.7 ∗ 5.0=3.5
8.03
11.2
8.9
– We∗ can easily calculate the optimal policy. Can you try it?
𝜋