Levine Deep RL Lecture

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

Deep Reinforcement Learning via

Imitation Learning
Sergey Levine

(run away)

sensorimotor loop

(run away)
End-to-end vision
features mid-level features classifier
(e.g. HOG) (e.g. DPM) (e.g. SVM)
Felzenszwalb ‘08

Krizhevsky ‘12

End-to-end control
standard state low-level
modeling & motion motor
robotic observations estimation
prediction planning
control (e.g. vision) (e.g. PD)

sensorimotor observations
indirect supervision
actions have consequences

Imitation learning

Imitation without a human

Research frontiers
Terminology & notation

1. run away
2. ignore
3. pet
Terminology & notation

1. run away
2. ignore
3. pet
Terminology & notation

1. run away
2. ignore
3. pet

a bit of history…

управление Lev Pontryagin Richard Bellman


Imitation learning

Imitation without a human

Research frontiers
Imitation Learning

training supervised
data learning

Images: Bojarski et al. ‘16, NVIDIA

Does it work? No!
Does it work? Yes!

Video: Bojarski et al. ‘16, NVIDIA

Why did that work?

Bojarski et al. ‘16, NVIDIA

Can we make it work more often?

Learning from a stabilizing

(more on this later)

Can we make it work more often?
Can we make it work more often?

DAgger: Dataset Aggregation

Ross et al. ‘11

DAgger Example

Ross et al. ‘11

What’s the problem?

Ross et al. ‘11

Imitation learning: recap
training supervised
data learning

• Usually (but not always) insufficient by itself

– Distribution mismatch problem
• Sometimes works well
– Hacks (e.g. left/right images)
– Samples from a stable trajectory distribution
– Add more on-policy data, e.g. using DAgger

Imitation learning

Imitation without a human

Research frontiers
Terminology & notation
1. run away
2. ignore
3. pet
Trajectory optimization
Probabilistic version
Probabilistic version (in pictures)
DAgger without Humans

Ross et al. ‘11

Another problem
PLATO: Policy Learning with
Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16

PLATO: Policy Learning with
Adaptive Trajectory Optimization

path replanned!

Kahn, Zhang, Levine, Abbeel ‘16

PLATO: Policy Learning with
Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16

PLATO: Policy Learning with
Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16

PLATO: Policy Learning with
Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16

PLATO: Policy Learning with
Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16

PLATO: Policy Learning with
Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16

PLATO: Policy Learning with
Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16

PLATO: Policy Learning with
Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16

PLATO: Policy Learning with
Adaptive Trajectory Optimization

avoids high cost!

input substitution trick

need state at training time
Kahn, Zhang, Levine, Abbeel ‘16
but not at test time!
PLATO: Policy Learning with
Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16

Beyond driving & flying
Trajectory Optimization with Unknown Dynamics

[L. et al. NIPS ‘14]

Trajectory Optimization with Unknown Dynamics

new old
[L. et al. NIPS ‘14]
Learning on PR2

[L. et al. ICRA ‘15]

Combining with Policy Learning
expectation under
current policy

trajectory distribution(s)

L. et al. ICML ’14 (dual descent)

can also use BADMM (L. et al.’15)
Lagrange multiplier
Guided Policy Search

trajectory-centric RL supervised learning

[see L. et al. NIPS ‘14 for details]
training time test time

L.*, Finn*, Darrell, Abbeel ‘16

~ 92,000
Experimental Tasks
Experimental Tasks
Generalization Experiments

end-to-end training

(trained on pose only)

pose prediction

(trained on pose only)

pose features

coat hanger success rate

pose prediction 55.6%
pose features 88.9%
end-to-end training 100%
shape sorting cube success rate
pose prediction 0%
pose features 70.4%
end-to-end training 96.3% 2 cm
toy claw hammer success rate
pose prediction 8.9%
pose features 62.2%
end-to-end training 91.1% Meeussen et al. (Willow Garage)
bottle cap success rate
pose prediction n/a
pose features 55.6%
end-to-end training 88.9%
Guided Policy Search Applications
manipulation dexterous hands soft hands

with N. Wagener and P. Abbeel with V. Kumar and E. Todorov with A. Gupta, C. Eppner, P. Abbeel

locomotion aerial vehicles

with G. Kahn, T. Zhang, P. Abbeel

with V. Koltun
A note about terminology…
the “R” word
a bit of history…
reinforcement learning
(the problem statement)

reinforcement learning
without using the model
(the method)

Lev Pontryagin Richard Bellman Andrew Barto Richard Sutton


Imitation learning

Imitation without a human

Research frontiers
ingredients for success in learning:
supervised learning: learning sensorimotor skills:
computation computation
~? algorithms

L., Pastor, Krizhevsky, Quillen ‘16

Grasping with Learned Hand-Eye Coordination

• 800,000 grasp monocular

RGB camera
attempts for training
(3,000 robot-hours) 7 DoF arm

• monocular camera 2-finger

(no depth) gripper

• 2-5 Hz update
• no prior knowledge bin

L., Pastor, Krizhevsky, Quillen ‘16

Using Grasp Success Prediction

training testing

L., Pastor, Krizhevsky, Quillen ‘16

Open-Loop vs. Closed-Loop Grasping
open-loop grasping closed-loop grasping

failure rate: 33.7% depth + segmentation failure rate: 17.5%

failure rate: 35%

Pinto & Gupta, 2015

Grasping Experiments

L., Pastor, Krizhevsky, Quillen ‘16

Continuous Learning in the Real World

• breadth and diversity of data

• learning new tasks quickly
• leveraging prior data
• task success supervision
Learning from Prior Experience

with J. Fu
Learning from Prior Experience
Learning what Success Means

can we learn the cost

with visual features?
Learning what Success Means

with C. Finn, P. Abbeel

Learning what Success Means
Challenges & Frontiers
• Algorithms
– Sample complexity
– Safety
– Scalability
• Supervision
– Automatically evaluate success
– Learn cost functions
• Transfer from prior experience

Greg Kahn Tianhao Zhang Chelsea Finn Trevor Darrell Pieter Abbeel

Justin Fu Peter Pastor Alex Krizhevsky Deirdre Quillen

You might also like