Professional Documents
Culture Documents
Cs 188: Artificial Intelligence: Mdps and RL Outline
Cs 188: Artificial Intelligence: Mdps and RL Outline
Many slides over the course adapted from Dan Klein, Stuart § Exploration vs. exploitation
Russell, Andrew Moore § Large state spaces: feature-based representations
1 2
[Andrew Ng] 5
Tedrake, Zhang and Seung, 2005 6
In your project 3 Reinforcement Learning
§ Still assume a Markov decision process
(MDP):
§ A set of states s ∈ S
§ A set of actions (per state) A
§ A model T(s,a,s )
§ A reward function R(s,a,s )
§ Still looking for a policy π(s)
11 12
Learning the Model in Model-
Based Learning Model-based vs. Model-free
§ Estimate P(x) from samples § Model-based RL
§ Samples:
§ First act in MDP and learn T, R
§ Then value iteration or policy iteration with learned T, R
§ Estimate: § Advantage: efficient use of data
§ Disadvantage: requires building a model for T, R
13 14
(done)
V(3,3) ~ (99 + 97 + -102) / 3 = 31.3
15 16
Update to V(s):
Update to V(s):
§ Decreasing learning rate can give converging averages
Same update:
22 23
Policy Evaluation when T (and R) unknown --- recap Problems with TD Value Learning
§ Model-based: § TD value leaning is a model-free way s
§ Learn the model empirically through experience to do policy evaluation
§ Solve for values as if the learned model were correct a
§ However, if we want to turn values into s, a
a (new) policy, we re sunk:
s,a,s
§ Model-free: s
§ Direct evaluation:
§ V(s) = sample estimate of sum of rewards accumulated from state s onwards
27 28
29 31
33 35
Q-Learning The Story So Far: MDPs and RL
Things we know how to do: Techniques:
§ Q-learning produces tables of q-values:
§ If we know the MDP § Model-based DPs
§ Compute V*, Q*, π* exactly § Value Iteration
§ Evaluate a fixed policy π § Policy evaluation
37 38
Exact Q s
Approximate Q s
26 26
24 24
20 22 20 22
20 20
30 30
40 40
20 20
30 30
10 20 10 20
0 10 0 10
0 10 20 0 20
0 0 0 0
Given examples
Predict given a new point Prediction Prediction
45 46
Error or residual
Observation
Prediction
0
0 20
47 48
30
Overfitting MDPs and RL Outline
25
20
§ Markov Decision Processes (MDPs)
Degree 15 polynomial § Formalism
15
§ Planning
10 § Value iteration
§ Policy Evaluation and Policy Iteration
5
§ Reinforcement Learning --- MDP with T and/or R unknown
0 § Model-based Learning
§ Model-free Learning
-5
§ Direct Evaluation [performs policy evaluation]
§ Temporal Difference Learning [performs policy evaluation]
-10
§ Q-Learning [learns optimal state-action value function Q*]
§ Policy search [learns optimal policy from subset of all policies]
-15
0 2 4 6 8 10 12 14 16 18 20
§ Exploration vs. exploitation
[DEMO]
§ Large state spaces: feature-based representations
49 50
51 52
§ Problems: § Turns out you can efficiently approximate the
§ How do we tell the policy got better? derivative of the returns with respect to the
§ Need to run many sample episodes! parameters w (details in the book, but you don t have
to know them)
§ If there are a lot of features, this can be impractical