Professional Documents
Culture Documents
Learning For Autonomous Vehicles: A Focus On Expert Demonstration
Learning For Autonomous Vehicles: A Focus On Expert Demonstration
Expert Demonstration
by
Dinesh Palanisamy
Summary
Self-driving vehicle technology has seen large advances in the past 30 years thanks to
dramatic increases in research and development from academia and industry. The main goal of
these efforts is to automate as much of the driving task as possible. Autonomous driving
systems sometimes employ a various suite of machine learning based techniques to achieve this
end. This report aims to provide a limited survey of these approaches within the context of the
robotics pipeline. It particularly focuses on the assimilation of expert data towards the goal of
1
Contents
Summary
.......................................................................................................................................... 1
List of Tables
....................................................................................................................................3
List of Figures
.................................................................................................................................. 4
1. Overview
....................................................................................................................................5
1.1 Modular Approaches
.........................................................................................................5
1.2 End-to-end Approaches
...................................................................................................10
1.3 Expert Demonstration
......................................................................................................10
2. Learning Frameworks
..............................................................................................................10
2.1 Problem Formulation
......................................................................................................11
2.2 Reinforcement Learning Preliminaries
.........................................................................11
3. Supervised Learning
.................................................................................................................12
3.1 Perception Mediation
.......................................................................................................12
3.2 Behavioral Cloning
...........................................................................................................13
3.2.1 Advantages
..............................................................................................................14
3.2.1 Limitations
...............................................................................................................14
4. Imitation Learning
.................................................................................................................... 15
5. Apprenticeship Learning
.........................................................................................................16
5.1 Advantages
........................................................................................................................ 18
5.2 Limitations
.........................................................................................................................18
6. Reinforcement Learning
...........................................................................................................19
6.1 Actor-critic Algorithms
...................................................................................................19
6.2 Policy Gradient Algorithms
............................................................................................ 19
6.3 POMDPs
............................................................................................................................ 20
7. Challenges in Reinforcement Learning for Autonomous Vehicles
.................................... 20
8. Conclusion .................................................................................................................................21
8.1 Deep Learning
..................................................................................................................21
8.2 Structured Prediction
......................................................................................................21
2
List of Tables
Table 1: A table describing the RL formulation of the decision problem....................................4
3
List of Figures
Figure 1: An illustration of the hierarchy of decision-making processes..............................6
4
1. Overview
Autonomous driving systems broadly fit two categories—modular approaches and end-
to-end approaches (Figure 2). Modular approaches rely on an explicit decomposition of the
driving task (Figure 1). Often times they utilize vision systems to provide a semantic abstraction
of the environment, expensive GPS, IMU, and lidar sensors for state estimation, and
computationally intensive algorithms for motion planning. These solutions comprise state-of-
the art capabilities. End-to-end approaches on the other hand afempt to directly map raw
sensor data to action. They only require cheap camera sensors and offer a fully automatic
approach to deriving a policy but typically do not provide a generalizable solution for changing
or unknown environments. Moreover, they do not provide any guarantees for meeting safety
constraints.
We can consider sensing under the modular subtask of robotic perception. Supervised
learning is frequently used in vision systems for recognition tasks like object detection and
classification. This includes finding lane markings, locating other vehicles, and detecting
pedestrians.
The vehicle must also decide the manner in which it will interact with the environment
so as to accomplish the goal of “driving well.” Specifically, we would like to map states of the
vehicle environment to available vehicle actions. This mapping which selects actions based
upon state is known as a policy. Deriving such a policy (or planning) is studied under the
5
framework of reinforcement learning (RL). Reinforcement learning techniques are used in the
6
Figure 2: An illustration to provide a taxonomy of approaches in this report
7
Figure 3: An illustration describing the taxonomy of learning techniques in this report
with regard to expert demonstration
8
State s ∈ S Action a ∈ A Control u ∈ U Reward R
Behavioral Cloning
Pommerleau (1989) Camera image 45 discrete, linear Direct steering ℓ 2 loss; single
[30x32]; Range turn curvature command hidden layer NN
finder [8x32] values
LeCun et al. (2006) Camera image A = [−m a x L e f t ∘, Direct steering angle ℓ 2 loss; CNN
[2@149x58] +m a x Righ t ∘]
{r }
Bojarski et al. (2016) Camera image A=
1
: r ∈ [−m a x L R, m a x R R] Direct steering ℓ 2 loss; CNN
[3@66x200] command
Imitation Learning
Ross et al. (2011) Simulator image A = [−1,1] Direct steering angle ℓ 2 loss; linear
regressor
Apprenticeship Learning
Abbeel, Ng, et al. Vehicle state 4D kinematic Inverse kinematic R(s) = w ⋅ ϕ(s);
(2008) ⟨x, y, θ, d ⟩ parameters control signal Cost-to-go in A*
Road lanes planner
Kuderer et al. (2015) Vehicle state Time continuous Inverse kinematic R(s) = w ⋅ ϕ(s);
⟨x, y⟩ spline parameters control signal Likelihood in
Road lanes, other probabilistic
vehicles trajectory planner
Sharifzadeh et al. Simulated obstacle A = [−π, π] Direct steering angle R(s) = w ⋅ ϕ(s);
(2017) sensor [208x1] Deep Q-Network
POMDP
Ulbrich et al. (2013) S = {LCPossible, A = {Dr ive, Derived deeper in Engineered R(s, a);
L cIn Pr ogr ess, Init i ateLC, motion planning Real-time belief
L cBen e f ici al} Abor t LC} pipeline space search
Brechtel et al. (2014) Vehicle state acc-/deceleration Inverse kinematic Engineered R(s, a);
⟨x, y, θ⟩ kinematic control signal Value iteration
Other vehicles parameters
Actor-Critic Algorithms
Lillicrap et al. (2015) Simulator image a = ⟨steer ing, Direct steering angle Engineered R(s, a) ;
a cc . , br a k e⟩ and actuator control DDPG
Policy Gradient
Shalev-Shwarq et al. Vehicle state A = {Give wa y, Derived deeper in Engineered R(s, a);
(2016b) ⟨x, y⟩ Ta k e wa y, motion planning DP optimization
Other vehicles MaintainO f fset} pipeline
9
1.2 End-to-end Approaches
process and try to learn a policy directly from raw camera inputs. In supervised learning, a
regressor can be used on expert provided training data to map image pixels to steering angles.
More sophisticated reinforcement learning based methods include techniques like Deep
The recognition task in perception essentially boils down to the statistical learning
problem of classification. Data is utilized in the form of an expert labeled training set over
utilized to learn a policy. Example data comprise of sequences of state-action pairs collected
during the demonstration of desired vehicle behavior by an expert or teacher. LfD techniques
afempt to use this data to derive a policy that reproduces the demonstrated behavior. LfD
techniques used in autonomous driving systems we will cover are behavioral cloning, imitation
learning, and apprenticeship learning (Figure 3). For a survey of robot learning from
demonstration, see Argall et al. (2009). Figures 3, 5, 6, and 7 also take inspiration from this
2. Learning Frameworks
Learning algorithms can be considered along two axes of complexity (Figure 4). Though
a precise linear ordering of learning techniques cannot be given, we can consider that each
10
problem subsumes those below and to the left. This report is organized as to discuss
Both behavioral cloning and imitation learning use supervised learning techniques to
learn a mapping function policy. Modeling the interaction of the vehicle and environment as a
Markov Decision Process (MDP) allows us to consider the problem under the framework of
reinforcement learning. See Table 1 for a RL formulation of the autonomous vehicle decision
Classical reinforcement learning approaches are based on the assumption that we have
probabilities T which capture the dynamics of the system. The goal of reinforcement learning is
{∑ }
J=E Rh .
h=0
In the autonomous vehicle domain and real-world robotics domain in general, the finite-
horizon model and average reward sefing are often most relevant. See Section 7 for more on
how autonomous vehicle decision problems differ from the traditional RL sefing. Under this
model,
11
π*(s) = arg max (Q*(s, a)) .
a
∑
Q*(s, a) = R(s, a) − R̄ + V*(s′)T (s, a, s′) .
s′
a* [ ∑ ]
V*(s) = max R(s, a*) − R̄ + V*(s′)T (s, a*, s′) .
s′
where V*(s) corresponds to long term additional reward, beyond the average reward R̄, gained
by starting in state s while taking optimal actions a* (according to optimal policy π*). See
Puterman (1994) and Mahadevan (1996) for precise exposition, formulation, and optimization of
3. Supervised Learning
Supervised learning in autonomous driving can be broken down into two broad
In perception mediated methods, convolutional neural networks (CNNs) are used for
recognition in computer vision. See Huval et al. (2015) for an empirical evaluation of deep
More generally, Janai et al. (2017) describes state-of-the-art computer vision for
autonomous vehicles and Martínez-Gómez et al. (2014) provides a taxonomy of vision systems
12
Learning
Technique
Given training set D = {s (i), a (i)} where s is a camera image in pixel values and a is a
steering angle, we can derive a control policy by training a CNN function approximator . This is
done via backpropogation by minimizing ℓ 2 loss over training examples. Supervised learning
for deriving π can be viewed as a special case of RL in which sh is sampled i.i.d. from some
distribution over image pixel space S and Rh = − ℓ 2 (ah, yh) where the learner observes yh
which is the (possibly noisy) value of the optimal action when viewing sh (Shalev-Shwarq et al,
2016a).
This was first done by Pommerleau (1989) with ALVINN, using what we would consider
by today’s standards a tiny neural network, to drive a Chevy van along a 400 meter
unobstructed path in a wooded area under sunny fall conditions. Another such system DAVE
by LeCun et al. (2006) trained a 1/10-th scale RC car to avoid obstacles on off-road terrain. More
recently DAVE-2 (Bojaraski et al., 2016) was capable of driving 98-100% autonomously for 10
miles on a highway with 72 hours of expert demonstration. The later two systems achieved
these (human level in the case of DAVE-2) results in large part by leveraging CNNs and GPUs.
13
3.2.1 Advantages
A big advantage of the behavioral cloning approach is that it is end-to-end and fully
automatic i.e. there is no need to engineer features or consider complex models (dynamics or
otherwise). We simply employ statistical learning methods. Another benefit is that the camera
sensors these systems utilize are very cheap in comparison to the other sensors discussed in
later methods.
3.2.1 Limitations
Limitations are many. For one, we need a tremendous amount of data which requires a
teacher and may be tedious to obtain. As a consequence, these solutions for deriving a control
policy may not generalize that well across environments that differ drastically from training
scenarios. Obviously, critical tasks like driving hold many lives in the balance and have lifle to
no margin for error. An earlier advantage that comes back to bite us is the fact that we have no
dynamics model. Left with traditional statistical learning theory, we have lifle in the way of
ensuring safety constraints. And in terms of statistical learning, our earlier i.i.d. assumption
turns out to be somewhat problematic. Since our learner’s prediction affects future input states
during the execution of our learned policy, mistakes could lead to a compounding of errors.
Indeed as soon as the learner makes a mistake, it may encounter totally different observations
than those under the expert demonstration leading to as many as H 2 ϵ mistakes in expectation
over H steps where ϵ is the classifier training error (Ross and Bagnell, 2010). Though we are
able to obtain end-to-end control policies for a real-world vehicle, they remain fairly limited in
their capabilities. DAVE-1 focused solely on obstacle avoidance, DAVE-2 essentially was a lane
14
follower, and both only output steering angles leaving throfle control to be set at a
constant or controlled by a human. These policies once deployed also cannot be modified on-
Expert
Policy Derivation
Demonstration
Environ-
ment
4. Imitation Learning
Imitation learning allows us to model slightly more complexity in the sequential
interaction between the vehicle and the environment. DAGGER (Dataset Aggregation) afempts
to address the previously mentioned i.i.d. assumption with an online, no-regret supervised
approach to imitation learning (Ross et al., 2011). It hinges on the following key observation:
though more demonstration data in the behavioral cloning approach can yield marginally befer
results, what we really want for improving our policy is to learn from mistakes. That is, we
want information on how to act over states which the optimal expert wouldn’t normally
15
encounter. We obtain the set of states our policy induces in the environment by running
it online, query our expert for the optimal action, and aggregate this pair in our training set D
for the next iteration (Figure 6). This allows for a learner that makes linear mistakes in
expectation with respect to time steps which is an improvement to the quadratic upper bound
Learning Policy
Technique Derivation
Figure 7: An illustration of reward learning with IRL and deriving a policy with RL
5. Apprenticeship Learning
Formulating our decision problem as an MDP allows us to consider both reward
structure and interactive/sequential complexity. If R(s, a) and T are known, we can view the
problem as one of optimal control (Powell, 2012). Thus now the motion planning problem
The reward function provides a succinct and more transferable representation of our
task, and knowing R(s, a) completely determines the optimal policy (or set of policies) (Abbeel
and Ng, 2004). Determining the reward function the vehicle is optimizing from data provided
reward function within the motion planning problem (or other RL technique) is often referred
We can utilize the motion planning framework to aid in the reinforcement learning
16
the vehicle with respect to an inertial reference frame. The space C is the set of all vehicle
configurations and system dynamics are invariant with respect to rotations and translations on
it (Frazzoli, 2002). In Y are the derivatives of the configuration variables as well as other state
parameters depending on the approach. Examples include distance to road edge from the
vehicle, distance to other vehicles, etc. Abbeel et al. (2008) use a 4D kinematic approach with
Δx, Δy, distance to nearest road edge, and angle of nearest road edge. Their reward is of the
form R(s) = w ⋅ ϕ(s) where w is a vector of 7 weights, each of which correspond to a feature in
feature map ϕ over S. The idea of the feature map is to cover the different desiderata in driving
that we want to trade off. Some features this implementation includes are total length of
iteratively learn (using max-margin SVM) weights w for which the cost-to-go in a motion
planner produces optimal polices with J values close to J(π *). This process can be done with
O (k log k) iterations where k is the dimension of the feature space (Abbeel and Ng, 2004).
Abbeel et al. do kinodynamic motion planning using a custom A* algorithm that searches
though 4D kinematic state space with R(s) as the cost-to-go heuristic. They use it compute
17
different expert demonstration styles. See Dolgov et al. (2008) for more on the path planning
Kuderer et al. (2015) formulate the optimization problem slightly differently and use
time continuous quintic splines to compute trajectories in real-time for highway driving. Instead
of a motion planner, Sharifzadeh et al. (2017) use a Deep Q-Network for apprenticeship learning
in simulation.
5.1 Advantages
computational cost that meets safety constraints. It benefits from the use of a motion planner,
and instead of engineering the cost-to-go function, we learn it. This may be a lot less work than
manually tweaking it. Once we have R, we have a generalizable solution because we can
(potentially in real-time) compute new policies in changing environments. The most highly
capable autonomous vehicle systems use motion planning techniques; see Katrakazas et al.
(2015) for analysis on state-of-the-art motion planning methods for autonomous on-road
driving.
5.2 Limitations
These capabilities are not without cost however. We require maps of the environment for
localization and accurate state estimation relies on expensive GPS/IMU and lidar sensors.
Another issue is that the Markov assumption in the MDP formulation of our problem may not
hold. This is especially true in the presence of other (potentially adversarial) vehicles. This
18
seems to limit us in achieving a fully autonomous vehicle capable of operating under real-world
road conditions.
6. Reinforcement Learning
In contrast to apprenticeship learning, the techniques presented here use engineered
reward functions and may explore to gather data for deriving a policy rather than depend on
demonstration.
algorithm in simulation. It is worth noting that their DDPG algorithm with no dynamics model
was sometimes able to achieve results equal to or befer than iLQG, a motion planner with full
Uhlenbeck process to randomly explore. This poses a challenge for real-world implementation
due to the need for learning from negative rewards. Safely evaluating reinforcement learning
for real vehicles is a non-trivial challenge that differentiates it from other well-studied RL
problems.
Markov formulation of the decision problem. They utilize a hybrid approach where a policy
gradient algorithm assigns intended actions and relationships to other vehicles and feasible
19
6.3 POMDPs
Another way to overcome the limitations of the Markov assumption is to use a Partially
Observable Markov Decision Process (POMDP) model. Ulbrich et al. (2013) do a real-time belief
space search to do lane changes for fully automated driving. It is important to note this work
deals solely with the behavioral layer and trajectory planning and control are left to other
components of their Stadpilot system. They most likely utilize a motion planner and some
sophisticated control system. See Paden et al. (2016) for a survey of motion planning and control
techniques for self-driving urban vehicles. Brechtel et al. (2014) find kinematically feasible
unclear whether the state representation and computational costs of the algorithm are amenable
vehicles inherit the “Curse of Dimensionality” from optimal control. Since our state and action
spaces are continuous, we must consider the resolution at which they are represented and how
fine grain we want the control (Kober et al., 2013). It also can’t be ignored that real-world
unrealistic to assume that we can observe a state completely and noise-free. Finally, engineering
reward functions is more difficult. For more on reinforcement learning in the robotics domain,
20
8. Conclusion
It seems beneficial to breakdown the autonomous driving decision problem and utilize
different learning techniques for each sub-problem. This requires an incredibly more complex
system overall than the end-to-end approaches. However, it also yields a system more capable
Providing expert data can make sub-problems more tractable by narrowing the search
space and removing the need for global exploration. We essentially reduced the reinforcement
learning problem to supervised learning in all our learning from demonstration approaches.
Deep learning is ubiquitous due to it’s function approximation abilities. We use CNNs
for classification and as a policy itself. More specialized is it’s use in reinforcement learning; see
Not covered in this report are lines of work that treat autonomous navigation as a video
prediction task. They generally utilize structured prediction methods for sequential video
prediction (Xu et al., 2017). See Lofer et al. (2017) for unsupervised learning and video
prediction. Xu et al. (2017) do end-to-end learning of driving models from large-scale video
datasets.
21
References
Abbeel, P., Dolgov, D., Ng, A., Thrun, S. (2008). Apprenticeship learning for motion planning
with application to parking lot navigation. In Proc. IEEE/RSJ Int. Conf. on Intelligent
Robots and Systems (IROS).
Abbeel, P., Ng, A. (2004). Apprenticeship learning via inverse reinforcement learning. In
International Conference on Machine Learning (ICML).
Argall, B. D., Chernova, S., Veloso, M., Browning, B. (2009). A survey of robot learning from
demonstration. Robotics and Autonomous Systems, 57:469–483.
Arulkumaran, K., Deisenroth, M. P., Brundage, M., Bharath, A. A. (2017). A Brief Survey of
Deep Reinforcement Learning. In IEEE Signal Processing Magazine, Special Issue on Deep
Learning for Image Understanding.
Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L., Monfort,
M., Muller, U., Zhang, J., Zhang, X., Zhao, J., Zieba, K. (2016). End to End Learning for
Self-Driving Cars. In arXiv:1604.07316.
Brechtel, S., Gindele, T., Dillmann, R. (2014). Probabilistic decision-making under uncertainty
for autonomous driving using continuous POMDPs. In Intelligent Transportation Systems
(ITSC).
Dolgov, D., Thrun, S., Montemerlo, M., Diebel, J. (2008). Path planning for autonomous driving
in unknown environ- ments. In Proceedings of the Eleventh International Symposium on
Experimental Robotics (ISER-08)
Frazzoli, E., Dahleh, M. A., and Feron., E. (2002). Real-time motion planning for agile
autonomous vehicles. Journal of Guidance, Control, and Dynamics, 25(1):116–129
Huval, B., Wang, T., Tandon, S., Kiske, J., Song, W. Pazhayampallil, J., Andriluka, M., Rajpurkar,
P., Migimatsu, T., Cheng-Yue, R., Mujica, F., Coates, A. (2015). An Empirical Evaluation
of Deep Learning on Highway Driving. In arXiv:1504.01716.
Janai, J., Güney, F., Behl., A. (2017). Computer Vision for Autonomous Vehicles: Problems,
Datasets and State-of-the-Art. In arXiv:1704.05519.
Katrakazas, C., Quddus, M., Chen, W.H., Lipika, D. (2015). Real-time motion planning methods
for autonomous on-road driving: State-of-the-art and future research directions. In
Transp. Res. C Emerg. Technol.
Kober J., Peters J. (2013). Reinforcement Learning in Robotics: A Survey. In International Journal
of Robotics Research (IJRR).
22
Kuderer, M., Shilpa Gulati S., Burgard, W. (2015). Learning driving styles for autonomous
vehicles from demonstration. In Robotics and Automation (ICRA).
LeCun, Y., Muller, U., Ben, J.; Cosafo, E., Flepp, B. (2006). Off-road obstacle avoidance through
end-to-end learning. In Advances in Neural Information Processing Systems 18.
Lillicrap, T., Hunt, J. , Priqel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D. (2016).
Continuous control with deep reinforcement learning. In arXiv:1509.02971.
Lofer, W., Kreiman G., Cox, D. (2016). Deep Predictive Coding Networks for Video Prediction
and Unsupervised Learning. In arXiv:1605.08104.
Paden, B., Cap, M., Yong, S. Z., Yershov, D., Frazzoli, E. (2016). A Survey of Motion Planning
and Control Techniques for Self-driving Urban Vehicles. In arXiv:1604.07446.
Powell, W. B. (2012). AI, OR and control theory: A rosefa stone for stochastic optimization.
Technical report, Princeton University.
Ross, S., Bagnell, J. A. (2010). Efficient reductions for imitation learning. In Proceedings of the 13th
International Conference on Artificial Intelligence and Statistics (AISTATS).
Ross, S., Gordon, G., and Bagnell, J. A. (2011). A reduction of imitation learning and structured
prediction to no-regret online learning. In International Conference on Artifical Intelligence
and Statistics (AISTATS)
Shalev-Schwarq, S., Ben-Zrihem, N., Cohen, A., Shashua, A. (2016a) Long-term Planning by
Short-term Prediction. In arXiv:1602.01580.
23
Sharifzadeh, S., Chiotellis, I., Triebel, R., Cremers., D. (2017). Learning to drive using inverse
reinforcement learning and deep q-networks. In arXiv:1612.03653.
Ulbrich, S., Maurer, M. (2013). Probabilistic online POMDP decision making for lane changes in
fully automated driving. In Intelligent Transportation Systems (ITSC).
Xu, H., Gao, Y., Yu, F., Darrell, T. (2016). End-to-end Learning of Driving Models from Large-
scale Video Datasets. In arXiv:1612.01079.
24