Levine Deep RL Lecture

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

Deep Reinforcement Learning via

Imitation Learning
Sergey Levine
perception

Action
(run away)

action
sensorimotor loop

Action
(run away)
End-to-end vision
standard
features mid-level features classifier
computer
(e.g. HOG) (e.g. DPM) (e.g. SVM)
vision
Felzenszwalb ‘08

deep
learning
Krizhevsky ‘12

End-to-end control
standard state low-level
modeling & motion motor
robotic observations estimation
prediction planning
controller
torques
control (e.g. vision) (e.g. PD)

deep
motor
sensorimotor observations
torques
learning
indirect supervision
actions have consequences
Contents

Imitation learning

Imitation without a human

Research frontiers
Terminology & notation

1. run away
2. ignore
3. pet
Terminology & notation

1. run away
2. ignore
3. pet
Terminology & notation

1. run away
2. ignore
3. pet

a bit of history…

управление Lev Pontryagin Richard Bellman


Contents

Imitation learning

Imitation without a human

Research frontiers
Imitation Learning

training supervised
data learning

Images: Bojarski et al. ‘16, NVIDIA


Does it work? No!
Does it work? Yes!

Video: Bojarski et al. ‘16, NVIDIA


Why did that work?

Bojarski et al. ‘16, NVIDIA


Can we make it work more often?
cost

stability
Learning from a stabilizing
controller

(more on this later)


Can we make it work more often?
Can we make it work more often?

DAgger: Dataset Aggregation

Ross et al. ‘11


DAgger Example

Ross et al. ‘11


What’s the problem?

Ross et al. ‘11


Imitation learning: recap
training supervised
data learning

• Usually (but not always) insufficient by itself


– Distribution mismatch problem
• Sometimes works well
– Hacks (e.g. left/right images)
– Samples from a stable trajectory distribution
– Add more on-policy data, e.g. using DAgger
Contents

Imitation learning

Imitation without a human

Research frontiers
Terminology & notation
1. run away
2. ignore
3. pet
Trajectory optimization
Probabilistic version
Probabilistic version (in pictures)
DAgger without Humans

Ross et al. ‘11


Another problem
PLATO: Policy Learning with
Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16


PLATO: Policy Learning with
Adaptive Trajectory Optimization

path replanned!

Kahn, Zhang, Levine, Abbeel ‘16


PLATO: Policy Learning with
Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16


PLATO: Policy Learning with
Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16


PLATO: Policy Learning with
Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16


PLATO: Policy Learning with
Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16


PLATO: Policy Learning with
Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16


PLATO: Policy Learning with
Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16


PLATO: Policy Learning with
Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16


PLATO: Policy Learning with
Adaptive Trajectory Optimization

avoids high cost!

input substitution trick


need state at training time
Kahn, Zhang, Levine, Abbeel ‘16
but not at test time!
PLATO: Policy Learning with
Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16


Beyond driving & flying
Trajectory Optimization with Unknown Dynamics

[L. et al. NIPS ‘14]


Trajectory Optimization with Unknown Dynamics

new old
[L. et al. NIPS ‘14]
Learning on PR2

[L. et al. ICRA ‘15]


Combining with Policy Learning
expectation under
current policy

trajectory distribution(s)

L. et al. ICML ’14 (dual descent)


can also use BADMM (L. et al.’15)
Lagrange multiplier
Guided Policy Search

trajectory-centric RL supervised learning


[see L. et al. NIPS ‘14 for details]
training time test time

L.*, Finn*, Darrell, Abbeel ‘16


~ 92,000
parameters
Experimental Tasks
Experimental Tasks
Generalization Experiments
Comparisons

end-to-end training

(trained on pose only)


pose prediction

(trained on pose only)


pose features
Comparisons

coat hanger success rate


pose prediction 55.6%
pose features 88.9%
end-to-end training 100%
shape sorting cube success rate
pose prediction 0%
pose features 70.4%
end-to-end training 96.3% 2 cm
toy claw hammer success rate
pose prediction 8.9%
pose features 62.2%
end-to-end training 91.1% Meeussen et al. (Willow Garage)
bottle cap success rate
pose prediction n/a
pose features 55.6%
end-to-end training 88.9%
Guided Policy Search Applications
manipulation dexterous hands soft hands

with N. Wagener and P. Abbeel with V. Kumar and E. Todorov with A. Gupta, C. Eppner, P. Abbeel

locomotion aerial vehicles

with G. Kahn, T. Zhang, P. Abbeel


with V. Koltun
A note about terminology…
the “R” word
a bit of history…
reinforcement learning
(the problem statement)

reinforcement learning
without using the model
(the method)

Lev Pontryagin Richard Bellman Andrew Barto Richard Sutton


Contents

Imitation learning

Imitation without a human

Research frontiers
ingredients for success in learning:
supervised learning: learning sensorimotor skills:
computation computation
algorithms
data
~? algorithms
data

L., Pastor, Krizhevsky, Quillen ‘16


Grasping with Learned Hand-Eye Coordination

• 800,000 grasp monocular


RGB camera
attempts for training
(3,000 robot-hours) 7 DoF arm

• monocular camera 2-finger


(no depth) gripper

• 2-5 Hz update
object
• no prior knowledge bin

L., Pastor, Krizhevsky, Quillen ‘16


Using Grasp Success Prediction

training testing

L., Pastor, Krizhevsky, Quillen ‘16


Open-Loop vs. Closed-Loop Grasping
open-loop grasping closed-loop grasping

failure rate: 33.7% depth + segmentation failure rate: 17.5%


failure rate: 35%

Pinto & Gupta, 2015


Grasping Experiments

L., Pastor, Krizhevsky, Quillen ‘16


Continuous Learning in the Real World

• breadth and diversity of data


• learning new tasks quickly
• leveraging prior data
• task success supervision
Learning from Prior Experience

with J. Fu
Learning from Prior Experience
Learning what Success Means

can we learn the cost


with visual features?
Learning what Success Means

with C. Finn, P. Abbeel


Learning what Success Means
Challenges & Frontiers
• Algorithms
– Sample complexity
– Safety
– Scalability
• Supervision
– Automatically evaluate success
– Learn cost functions
• Transfer from prior experience
Acknowledgements

Greg Kahn Tianhao Zhang Chelsea Finn Trevor Darrell Pieter Abbeel

Justin Fu Peter Pastor Alex Krizhevsky Deirdre Quillen

You might also like