Levine Deep RL Lecture

Deep Reinforcement Learning via
Imitation Learning
Sergey Levine
perception
Action
(run away)
action
sensorimotor loop
Action
(run away)
End-to-end vision
standard
features mid-level features classifier
computer
(e.g. HOG) (e.g. DPM) (e.g. SVM)
vision
Felzenszwalb ‘08
deep
learning
Krizhevsky ‘12
End-to-end control
standard state low-level
modeling & motion motor
robotic observations estimation
prediction planning
controller
torques
control (e.g. vision) (e.g. PD)
deep
motor
sensorimotor observations
torques
learning
indirect supervision
actions have consequences
Contents
Imitation learning
Imitation without a human
Research frontiers
Terminology & notation
1. run away
2. ignore
3. pet
1. run away
2. ignore
3. pet
1. run away
2. ignore
3. pet
a bit of history…
управление Lev Pontryagin Richard Bellman

Contents
Imitation learning
Research frontiers
Imitation Learning
training supervised
data learning
Images: Bojarski et al. ‘16, NVIDIA

Does it work? No!
Does it work? Yes!
Video: Bojarski et al. ‘16, NVIDIA

Why did that work?
Bojarski et al. ‘16, NVIDIA

Can we make it work more often?
cost
stability
Learning from a stabilizing
controller
(more on this later)

DAgger: Dataset Aggregation
Ross et al. ‘11

DAgger Example
Ross et al. ‘11

What’s the problem?
Ross et al. ‘11

Imitation learning: recap
training supervised
data learning
• Usually (but not always) insufficient by itself

– Distribution mismatch problem
• Sometimes works well
– Hacks (e.g. left/right images)
– Samples from a stable trajectory distribution
– Add more on-policy data, e.g. using DAgger
Contents
Imitation learning
Research frontiers
1. run away
2. ignore
3. pet
Trajectory optimization
Probabilistic version
Probabilistic version (in pictures)
DAgger without Humans
Ross et al. ‘11

Another problem
PLATO: Policy Learning with
Adaptive Trajectory Optimization
Kahn, Zhang, Levine, Abbeel ‘16

path replanned!








avoids high cost!
input substitution trick

need state at training time
but not at test time!

Beyond driving & flying
Trajectory Optimization with Unknown Dynamics
[L. et al. NIPS ‘14]

Trajectory Optimization with Unknown Dynamics
new old
[L. et al. NIPS ‘14]
Learning on PR2
[L. et al. ICRA ‘15]

Combining with Policy Learning
expectation under
current policy
trajectory distribution(s)
L. et al. ICML ’14 (dual descent)

can also use BADMM (L. et al.’15)
Lagrange multiplier
Guided Policy Search
trajectory-centric RL supervised learning

[see L. et al. NIPS ‘14 for details]
training time test time
L.*, Finn*, Darrell, Abbeel ‘16

~ 92,000
parameters
Experimental Tasks
Experimental Tasks
Generalization Experiments
Comparisons
end-to-end training
(trained on pose only)

pose prediction
(trained on pose only)

pose features
Comparisons
coat hanger success rate

pose prediction 55.6%
pose features 88.9%
end-to-end training 100%
shape sorting cube success rate
pose prediction 0%
pose features 70.4%
end-to-end training 96.3% 2 cm
toy claw hammer success rate
pose prediction 8.9%
pose features 62.2%
end-to-end training 91.1% Meeussen et al. (Willow Garage)
bottle cap success rate
pose prediction n/a
pose features 55.6%
end-to-end training 88.9%
Guided Policy Search Applications
manipulation dexterous hands soft hands
with N. Wagener and P. Abbeel with V. Kumar and E. Todorov with A. Gupta, C. Eppner, P. Abbeel
locomotion aerial vehicles
with G. Kahn, T. Zhang, P. Abbeel

with V. Koltun
A note about terminology…
the “R” word
a bit of history…
reinforcement learning
(the problem statement)
reinforcement learning
without using the model
(the method)
Lev Pontryagin Richard Bellman Andrew Barto Richard Sutton

Contents
Imitation learning
Research frontiers
ingredients for success in learning:
supervised learning: learning sensorimotor skills:
computation computation
algorithms
data
~? algorithms
data
L., Pastor, Krizhevsky, Quillen ‘16

Grasping with Learned Hand-Eye Coordination
• 800,000 grasp monocular

RGB camera
attempts for training
(3,000 robot-hours) 7 DoF arm
• monocular camera 2-finger

(no depth) gripper
• 2-5 Hz update
object
• no prior knowledge bin

Using Grasp Success Prediction
training testing

Open-Loop vs. Closed-Loop Grasping
open-loop grasping closed-loop grasping
failure rate: 33.7% depth + segmentation failure rate: 17.5%

failure rate: 35%
Pinto & Gupta, 2015

Grasping Experiments

Continuous Learning in the Real World
• breadth and diversity of data

• learning new tasks quickly
• leveraging prior data
• task success supervision
Learning from Prior Experience
with J. Fu
Learning from Prior Experience
Learning what Success Means
can we learn the cost

with visual features?
with C. Finn, P. Abbeel

Challenges & Frontiers
• Algorithms
– Sample complexity
– Safety
– Scalability
• Supervision
– Automatically evaluate success
– Learn cost functions
• Transfer from prior experience
Acknowledgements
Greg Kahn Tianhao Zhang Chelsea Finn Trevor Darrell Pieter Abbeel
Justin Fu Peter Pastor Alex Krizhevsky Deirdre Quillen

Levine Deep RL Lecture

Uploaded by

Copyright:

Available Formats

You might also like

Levine Deep RL Lecture

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Levine Deep RL Lecture

Uploaded by

Copyright:

Available Formats

Deep Reinforcement Learning via

Imitation without a human

управление Lev Pontryagin Richard Bellman

Imitation without a human

Images: Bojarski et al. ‘16, NVIDIA

Video: Bojarski et al. ‘16, NVIDIA

Bojarski et al. ‘16, NVIDIA

(more on this later)

DAgger: Dataset Aggregation

Ross et al. ‘11

Ross et al. ‘11

Ross et al. ‘11

• Usually (but not always) insufficient by itself

Imitation without a human

Ross et al. ‘11

Kahn, Zhang, Levine, Abbeel ‘16

Kahn, Zhang, Levine, Abbeel ‘16

Kahn, Zhang, Levine, Abbeel ‘16

Kahn, Zhang, Levine, Abbeel ‘16

Kahn, Zhang, Levine, Abbeel ‘16

Kahn, Zhang, Levine, Abbeel ‘16

Kahn, Zhang, Levine, Abbeel ‘16

Kahn, Zhang, Levine, Abbeel ‘16

Kahn, Zhang, Levine, Abbeel ‘16

avoids high cost!

input substitution trick

Kahn, Zhang, Levine, Abbeel ‘16

[L. et al. NIPS ‘14]

[L. et al. ICRA ‘15]

L. et al. ICML ’14 (dual descent)

trajectory-centric RL supervised learning

L.*, Finn*, Darrell, Abbeel ‘16

(trained on pose only)

(trained on pose only)

coat hanger success rate

locomotion aerial vehicles

with G. Kahn, T. Zhang, P. Abbeel

Lev Pontryagin Richard Bellman Andrew Barto Richard Sutton

Imitation without a human

L., Pastor, Krizhevsky, Quillen ‘16

• 800,000 grasp monocular

• monocular camera 2-finger

L., Pastor, Krizhevsky, Quillen ‘16

L., Pastor, Krizhevsky, Quillen ‘16

failure rate: 33.7% depth + segmentation failure rate: 17.5%

Pinto & Gupta, 2015

L., Pastor, Krizhevsky, Quillen ‘16

• breadth and diversity of data

can we learn the cost

with C. Finn, P. Abbeel

Justin Fu Peter Pastor Alex Krizhevsky Deirdre Quillen

You might also like

L., Finn, Darrell, Abbeel ‘16