Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Learning For Autonomous Vehicles: A Focus on

Expert Demonstration
by

Dinesh Palanisamy

Advanced Topics: Motion Planning & Optimal Control


Instructor: Kris Hauser
Duke University

Summary
Self-driving vehicle technology has seen large advances in the past 30 years thanks to

dramatic increases in research and development from academia and industry. The main goal of

these efforts is to automate as much of the driving task as possible. Autonomous driving

systems sometimes employ a various suite of machine learning based techniques to achieve this

end. This report aims to provide a limited survey of these approaches within the context of the

robotics pipeline. It particularly focuses on the assimilation of expert data towards the goal of

“driving well” or “driving like a human.”

1
Contents

Summary
.......................................................................................................................................... 1
List of Tables
....................................................................................................................................3
List of Figures
.................................................................................................................................. 4
1. Overview
....................................................................................................................................5
1.1 Modular Approaches
.........................................................................................................5
1.2 End-to-end Approaches
...................................................................................................10
1.3 Expert Demonstration
......................................................................................................10
2. Learning Frameworks
..............................................................................................................10
2.1 Problem Formulation
......................................................................................................11
2.2 Reinforcement Learning Preliminaries
.........................................................................11
3. Supervised Learning
.................................................................................................................12
3.1 Perception Mediation
.......................................................................................................12
3.2 Behavioral Cloning
...........................................................................................................13
3.2.1 Advantages
..............................................................................................................14
3.2.1 Limitations
...............................................................................................................14
4. Imitation Learning
.................................................................................................................... 15
5. Apprenticeship Learning
.........................................................................................................16
5.1 Advantages
........................................................................................................................ 18
5.2 Limitations
.........................................................................................................................18
6. Reinforcement Learning
...........................................................................................................19
6.1 Actor-critic Algorithms
...................................................................................................19
6.2 Policy Gradient Algorithms
............................................................................................ 19
6.3 POMDPs
............................................................................................................................ 20
7. Challenges in Reinforcement Learning for Autonomous Vehicles
.................................... 20
8. Conclusion .................................................................................................................................21
8.1 Deep Learning
..................................................................................................................21
8.2 Structured Prediction
......................................................................................................21

2
List of Tables
Table 1: A table describing the RL formulation of the decision problem....................................4

3
List of Figures
Figure 1: An illustration of the hierarchy of decision-making processes..............................6

Figure 2: An illustration to provide a taxonomy of approaches in this report........................7

Figure 3: An illustration describing the taxonomy of LfD..........................................................8

Figure 4: An illustration of inter-relations between well studies learning problems.............8

Figure 5: An illustration of mapping state to action with SL....................................................13

Figure 6: An illustration of control policy derivation with imitation learning.......................15

Figure 7: An illustration of apprenticeship learning..................................................................16

4
1. Overview
Autonomous driving systems broadly fit two categories—modular approaches and end-

to-end approaches (Figure 2). Modular approaches rely on an explicit decomposition of the

driving task (Figure 1). Often times they utilize vision systems to provide a semantic abstraction

of the environment, expensive GPS, IMU, and lidar sensors for state estimation, and

computationally intensive algorithms for motion planning. These solutions comprise state-of-

the art capabilities. End-to-end approaches on the other hand afempt to directly map raw

sensor data to action. They only require cheap camera sensors and offer a fully automatic

approach to deriving a policy but typically do not provide a generalizable solution for changing

or unknown environments. Moreover, they do not provide any guarantees for meeting safety

constraints.

1.1 Modular Approaches

We can consider sensing under the modular subtask of robotic perception. Supervised

learning is frequently used in vision systems for recognition tasks like object detection and

classification. This includes finding lane markings, locating other vehicles, and detecting

pedestrians.

The vehicle must also decide the manner in which it will interact with the environment

so as to accomplish the goal of “driving well.” Specifically, we would like to map states of the

vehicle environment to available vehicle actions. This mapping which selects actions based

upon state is known as a policy. Deriving such a policy (or planning) is studied under the

5
framework of reinforcement learning (RL). Reinforcement learning techniques are used in the

behavioral layer, as well the motion planning layer.

Figure 1: An illustration of the hierarchy of decision-making processes (Paden et al.,


2016)

6
Figure 2: An illustration to provide a taxonomy of approaches in this report

7
Figure 3: An illustration describing the taxonomy of learning techniques in this report
with regard to expert demonstration

Figure 4: An illustration of inter-relations between well studies learning problems


(Langford and Zardony, 2005)


8
State s ∈ S Action a ∈ A Control u ∈ U Reward R

Behavioral Cloning

Pommerleau (1989) Camera image 45 discrete, linear Direct steering ℓ 2 loss; single
[30x32]; Range turn curvature command hidden layer NN
finder [8x32] values

LeCun et al. (2006) Camera image A = [−m a x L e f t ∘, Direct steering angle ℓ 2 loss; CNN
[2@149x58] +m a x Righ t ∘]

{r }
Bojarski et al. (2016) Camera image A=
1
: r ∈ [−m a x L R, m a x R R] Direct steering ℓ 2 loss; CNN
[3@66x200] command

Imitation Learning

Ross et al. (2011) Simulator image A = [−1,1] Direct steering angle ℓ 2 loss; linear
regressor

Apprenticeship Learning

Abbeel, Ng, et al. Vehicle state 4D kinematic Inverse kinematic R(s) = w ⋅ ϕ(s);
(2008) ⟨x, y, θ, d ⟩ parameters control signal Cost-to-go in A*
Road lanes planner

Kuderer et al. (2015) Vehicle state Time continuous Inverse kinematic R(s) = w ⋅ ϕ(s);
⟨x, y⟩ spline parameters control signal Likelihood in
Road lanes, other probabilistic
vehicles trajectory planner

Sharifzadeh et al. Simulated obstacle A = [−π, π] Direct steering angle R(s) = w ⋅ ϕ(s);
(2017) sensor [208x1] Deep Q-Network

POMDP

Ulbrich et al. (2013) S = {LCPossible, A = {Dr ive, Derived deeper in Engineered R(s, a);
L cIn Pr ogr ess,   Init i ateLC, motion planning Real-time belief
L cBen e f ici al}   Abor t LC} pipeline space search

Brechtel et al. (2014) Vehicle state acc-/deceleration Inverse kinematic Engineered R(s, a);
⟨x, y, θ⟩ kinematic control signal Value iteration
Other vehicles parameters

Actor-Critic Algorithms

Lillicrap et al. (2015) Simulator image a = ⟨steer ing, Direct steering angle Engineered R(s, a) ;
a cc . , br a k e⟩ and actuator control DDPG

Policy Gradient

Shalev-Shwarq et al. Vehicle state A = {Give wa y, Derived deeper in Engineered R(s, a);
(2016b) ⟨x, y⟩ Ta k e wa y, motion planning DP optimization
Other vehicles MaintainO f fset} pipeline

Table 1: A table describing the RL formulation of the decision problem


9
1.2 End-to-end Approaches

End-to-end approaches collapse the hierarchical breakdown of the decision-making

process and try to learn a policy directly from raw camera inputs. In supervised learning, a

regressor can be used on expert provided training data to map image pixels to steering angles.

More sophisticated reinforcement learning based methods include techniques like Deep

Deterministic Policy Gradient—an off-policy, model-free, actor-critic algorithm.

1.3 Expert Demonstration

The recognition task in perception essentially boils down to the statistical learning

problem of classification. Data is utilized in the form of an expert labeled training set over

which the model error rate is minimized.

In Learning from Demonstration (LfD), expert demonstrations provided by a teacher are

utilized to learn a policy. Example data comprise of sequences of state-action pairs collected

during the demonstration of desired vehicle behavior by an expert or teacher. LfD techniques

afempt to use this data to derive a policy that reproduces the demonstrated behavior. LfD

techniques used in autonomous driving systems we will cover are behavioral cloning, imitation

learning, and apprenticeship learning (Figure 3). For a survey of robot learning from

demonstration, see Argall et al. (2009). Figures 3, 5, 6, and 7 also take inspiration from this

excellent synthesis of the topic.

2. Learning Frameworks
Learning algorithms can be considered along two axes of complexity (Figure 4). Though

a precise linear ordering of learning techniques cannot be given, we can consider that each

10
problem subsumes those below and to the left. This report is organized as to discuss

techniques in ascending order of complexity.

2.1 Problem Formulation

Both behavioral cloning and imitation learning use supervised learning techniques to

learn a mapping function policy. Modeling the interaction of the vehicle and environment as a

Markov Decision Process (MDP) allows us to consider the problem under the framework of

reinforcement learning. See Table 1 for a RL formulation of the autonomous vehicle decision

problems this report covers.

2.2 Reinforcement Learning Preliminaries

Classical reinforcement learning approaches are based on the assumption that we have

an MDP consisting of states s ∈ S, actions a ∈ A, reward R : S × A → ℝ, and transition

probabilities T which capture the dynamics of the system. The goal of reinforcement learning is

to discover an optimal stationary deterministic policy π* : S → A that maximizes cumulative

expected reward J for the next H steps h (Kober et al. 2013).

{∑ }
J=E Rh .
h=0

In the autonomous vehicle domain and real-world robotics domain in general, the finite-

horizon model and average reward sefing are often most relevant. See Section 7 for more on

how autonomous vehicle decision problems differ from the traditional RL sefing. Under this

model,


11
π*(s) = arg max (Q*(s, a)) .
a


Q*(s, a) = R(s, a) − R̄ + V*(s′)T (s, a, s′) .
s′

a* [ ∑ ]
V*(s) = max R(s, a*) − R̄ + V*(s′)T (s, a*, s′) .
s′

where V*(s) corresponds to long term additional reward, beyond the average reward R̄, gained

by starting in state s while taking optimal actions a* (according to optimal policy π*). See

Puterman (1994) and Mahadevan (1996) for precise exposition, formulation, and optimization of

the finite-horizon, average reward MDP.

3. Supervised Learning
Supervised learning in autonomous driving can be broken down into two broad

categories—perception mediation and behavioral cloning.

3.1 Perception Mediation

In perception mediated methods, convolutional neural networks (CNNs) are used for

recognition in computer vision. See Huval et al. (2015) for an empirical evaluation of deep

learning based mediated perception for highway driving.

More generally, Janai et al. (2017) describes state-of-the-art computer vision for

autonomous vehicles and Martínez-Gómez et al. (2014) provides a taxonomy of vision systems

for ground mobile robots.


12
Learning
Technique

Figure 5: An illustration of mapping state to action (i.e. deriving a policy) with a


supervised learning technique

3.2 Behavioral Cloning

Given training set D = {s (i), a (i)} where s is a camera image in pixel values and a is a

steering angle, we can derive a control policy by training a CNN function approximator . This is

done via backpropogation by minimizing ℓ 2 loss over training examples. Supervised learning

for deriving π can be viewed as a special case of RL in which sh is sampled i.i.d. from some

distribution over image pixel space S and Rh = − ℓ 2 (ah, yh) where the learner observes yh

which is the (possibly noisy) value of the optimal action when viewing sh (Shalev-Shwarq et al,

2016a).

This was first done by Pommerleau (1989) with ALVINN, using what we would consider

by today’s standards a tiny neural network, to drive a Chevy van along a 400 meter

unobstructed path in a wooded area under sunny fall conditions. Another such system DAVE

by LeCun et al. (2006) trained a 1/10-th scale RC car to avoid obstacles on off-road terrain. More

recently DAVE-2 (Bojaraski et al., 2016) was capable of driving 98-100% autonomously for 10

miles on a highway with 72 hours of expert demonstration. The later two systems achieved

these (human level in the case of DAVE-2) results in large part by leveraging CNNs and GPUs.


13
3.2.1 Advantages

A big advantage of the behavioral cloning approach is that it is end-to-end and fully

automatic i.e. there is no need to engineer features or consider complex models (dynamics or

otherwise). We simply employ statistical learning methods. Another benefit is that the camera

sensors these systems utilize are very cheap in comparison to the other sensors discussed in

later methods.

3.2.1 Limitations

Limitations are many. For one, we need a tremendous amount of data which requires a

teacher and may be tedious to obtain. As a consequence, these solutions for deriving a control

policy may not generalize that well across environments that differ drastically from training

scenarios. Obviously, critical tasks like driving hold many lives in the balance and have lifle to

no margin for error. An earlier advantage that comes back to bite us is the fact that we have no

dynamics model. Left with traditional statistical learning theory, we have lifle in the way of

ensuring safety constraints. And in terms of statistical learning, our earlier i.i.d. assumption

turns out to be somewhat problematic. Since our learner’s prediction affects future input states

during the execution of our learned policy, mistakes could lead to a compounding of errors.

Indeed as soon as the learner makes a mistake, it may encounter totally different observations

than those under the expert demonstration leading to as many as H 2 ϵ mistakes in expectation

over H steps where ϵ is the classifier training error (Ross and Bagnell, 2010). Though we are

able to obtain end-to-end control policies for a real-world vehicle, they remain fairly limited in

their capabilities. DAVE-1 focused solely on obstacle avoidance, DAVE-2 essentially was a lane


14
follower, and both only output steering angles leaving throfle control to be set at a

constant or controlled by a human. These policies once deployed also cannot be modified on-

line in any fashion.

Expert
Policy Derivation
Demonstration

Policy Execution/Online Imitation Learning Step


Learning (single cycle)

Environ-
ment

Figure 6: An illustration of control policy derivation and execution highlighting the


iterative, online imitation learning step

4. Imitation Learning
Imitation learning allows us to model slightly more complexity in the sequential

interaction between the vehicle and the environment. DAGGER (Dataset Aggregation) afempts

to address the previously mentioned i.i.d. assumption with an online, no-regret supervised

approach to imitation learning (Ross et al., 2011). It hinges on the following key observation:

though more demonstration data in the behavioral cloning approach can yield marginally befer

results, what we really want for improving our policy is to learn from mistakes. That is, we

want information on how to act over states which the optimal expert wouldn’t normally


15
encounter. We obtain the set of states our policy induces in the environment by running

it online, query our expert for the optimal action, and aggregate this pair in our training set D

for the next iteration (Figure 6). This allows for a learner that makes linear mistakes in

expectation with respect to time steps which is an improvement to the quadratic upper bound

guarantee mentioned in Section 3.2.1.

Learning Policy
Technique Derivation

Figure 7: An illustration of reward learning with IRL and deriving a policy with RL

5. Apprenticeship Learning
Formulating our decision problem as an MDP allows us to consider both reward

structure and interactive/sequential complexity. If R(s, a) and T are known, we can view the

problem as one of optimal control (Powell, 2012). Thus now the motion planning problem

comes into focus.

The reward function provides a succinct and more transferable representation of our

task, and knowing R(s, a) completely determines the optimal policy (or set of policies) (Abbeel

and Ng, 2004). Determining the reward function the vehicle is optimizing from data provided

by an expert is referred to as Inverse Reinforcement Learning (IRL). Embedding the learned

reward function within the motion planning problem (or other RL technique) is often referred

to as apprenticeship learning (Figure 7).

We can utilize the motion planning framework to aid in the reinforcement learning

formulation of our decision problem. Let state S = C × Y where x ∈ C is the configuration of

16
the vehicle with respect to an inertial reference frame. The space C is the set of all vehicle

configurations and system dynamics are invariant with respect to rotations and translations on

it (Frazzoli, 2002). In Y are the derivatives of the configuration variables as well as other state

parameters depending on the approach. Examples include distance to road edge from the

vehicle, distance to other vehicles, etc. Abbeel et al. (2008) use a 4D kinematic approach with

x = ⟨x, y, θ, d ⟩ and C = ℝ2 × S 1 × {0,1} where d denotes backwards or forwards. In Y is

Δx, Δy, distance to nearest road edge, and angle of nearest road edge. Their reward is of the

form R(s) = w ⋅ ϕ(s) where w is a vector of 7 weights, each of which correspond to a feature in

feature map ϕ over S. The idea of the feature map is to cover the different desiderata in driving

that we want to trade off. Some features this implementation includes are total length of

trajectory, proximity of trajectory to obstacles, and measure of trajectory alignment to

directional rules in the parking lot. We are given {s (i), s (i), . . . }m


i=1 recorded from expert
0 1

demonstration which we assume is π * . By taking guesses at w and knowing R(s), we can

iteratively learn (using max-margin SVM) weights w for which the cost-to-go in a motion

planner produces optimal polices with J values close to J(π *). This process can be done with

O (k log k) iterations where k is the dimension of the feature space (Abbeel and Ng, 2004).

Abbeel et al. do kinodynamic motion planning using a custom A* algorithm that searches

though 4D kinematic state space with R(s) as the cost-to-go heuristic. They use it compute

trajectories for an autonomous vehicle to navigate a parking lot in a manner mimicking

17
different expert demonstration styles. See Dolgov et al. (2008) for more on the path planning

algorithm utilized by this vehicle for autonomous driving in unknown environments.

Kuderer et al. (2015) formulate the optimization problem slightly differently and use

time continuous quintic splines to compute trajectories in real-time for highway driving. Instead

of a motion planner, Sharifzadeh et al. (2017) use a Deep Q-Network for apprenticeship learning

in simulation.

5.1 Advantages

Apprenticeship learning gives us a fairly generalizable solution with reasonable

computational cost that meets safety constraints. It benefits from the use of a motion planner,

and instead of engineering the cost-to-go function, we learn it. This may be a lot less work than

manually tweaking it. Once we have R, we have a generalizable solution because we can

(potentially in real-time) compute new policies in changing environments. The most highly

capable autonomous vehicle systems use motion planning techniques; see Katrakazas et al.

(2015) for analysis on state-of-the-art motion planning methods for autonomous on-road

driving.

5.2 Limitations

These capabilities are not without cost however. We require maps of the environment for

localization and accurate state estimation relies on expensive GPS/IMU and lidar sensors.

Another issue is that the Markov assumption in the MDP formulation of our problem may not

hold. This is especially true in the presence of other (potentially adversarial) vehicles. This

18
seems to limit us in achieving a fully autonomous vehicle capable of operating under real-world

road conditions.

6. Reinforcement Learning
In contrast to apprenticeship learning, the techniques presented here use engineered

reward functions and may explore to gather data for deriving a policy rather than depend on

demonstration.

6.1 Actor-critic Algorithms

Lillicrap et al. (2016) achieve continuous control end-to-end with an actor-critic

algorithm in simulation. It is worth noting that their DDPG algorithm with no dynamics model

was sometimes able to achieve results equal to or befer than iLQG, a motion planner with full

knowledge of underlying dynamics. However, DDPG learns off-policy using an Ornstein–

Uhlenbeck process to randomly explore. This poses a challenge for real-world implementation

due to the need for learning from negative rewards. Safely evaluating reinforcement learning

for real vehicles is a non-trivial challenge that differentiates it from other well-studied RL

problems.

6.2 Policy Gradient Algorithms

Shalev-schwarq et al. (2016b) perform multi-vehicle planning in simulation with a non-

Markov formulation of the decision problem. They utilize a hybrid approach where a policy

gradient algorithm assigns intended actions and relationships to other vehicles and feasible

trajectories are generated via a dynamic programming method.

19
6.3 POMDPs

Another way to overcome the limitations of the Markov assumption is to use a Partially

Observable Markov Decision Process (POMDP) model. Ulbrich et al. (2013) do a real-time belief

space search to do lane changes for fully automated driving. It is important to note this work

deals solely with the behavioral layer and trajectory planning and control are left to other

components of their Stadpilot system. They most likely utilize a motion planner and some

sophisticated control system. See Paden et al. (2016) for a survey of motion planning and control

techniques for self-driving urban vehicles. Brechtel et al. (2014) find kinematically feasible

policies for merging lanes in a multi-vehicle simulation environment. However, it remains

unclear whether the state representation and computational costs of the algorithm are amenable

to implementation in a real-world system.

7. Challenges in Reinforcement Learning for Autonomous


Vehicles
Autonomous driving decision problems within the reinforcement learning domain differ

considerably from traditional reinforcement learning. Most fundamentally, autonomous

vehicles inherit the “Curse of Dimensionality” from optimal control. Since our state and action

spaces are continuous, we must consider the resolution at which they are represented and how

fine grain we want the control (Kober et al., 2013). It also can’t be ignored that real-world

experience on a vehicle is costly, tedious to obtain, and hard to reproduce. Moreover, it is

unrealistic to assume that we can observe a state completely and noise-free. Finally, engineering

reward functions is more difficult. For more on reinforcement learning in the robotics domain,

see Kober et al. (2013) for an excellent survey on the topic.

20
8. Conclusion
It seems beneficial to breakdown the autonomous driving decision problem and utilize

different learning techniques for each sub-problem. This requires an incredibly more complex

system overall than the end-to-end approaches. However, it also yields a system more capable

of “driving like a human.”

Providing expert data can make sub-problems more tractable by narrowing the search

space and removing the need for global exploration. We essentially reduced the reinforcement

learning problem to supervised learning in all our learning from demonstration approaches.

8.1 Deep Learning

Deep learning is ubiquitous due to it’s function approximation abilities. We use CNNs

for classification and as a policy itself. More specialized is it’s use in reinforcement learning; see

Arulkumaran et al. (2017) for a survey on deep reinforcement learning.

8.2 Structured Prediction

Not covered in this report are lines of work that treat autonomous navigation as a video

prediction task. They generally utilize structured prediction methods for sequential video

prediction (Xu et al., 2017). See Lofer et al. (2017) for unsupervised learning and video

prediction. Xu et al. (2017) do end-to-end learning of driving models from large-scale video

datasets.


21
References

Abbeel, P., Dolgov, D., Ng, A., Thrun, S. (2008). Apprenticeship learning for motion planning
with application to parking lot navigation. In Proc. IEEE/RSJ Int. Conf. on Intelligent
Robots and Systems (IROS).

Abbeel, P., Ng, A. (2004). Apprenticeship learning via inverse reinforcement learning. In
International Conference on Machine Learning (ICML).

Argall, B. D., Chernova, S., Veloso, M., Browning, B. (2009). A survey of robot learning from
demonstration. Robotics and Autonomous Systems, 57:469–483.

Arulkumaran, K., Deisenroth, M. P., Brundage, M., Bharath, A. A. (2017). A Brief Survey of
Deep Reinforcement Learning. In IEEE Signal Processing Magazine, Special Issue on Deep
Learning for Image Understanding.

Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L., Monfort,
M., Muller, U., Zhang, J., Zhang, X., Zhao, J., Zieba, K. (2016). End to End Learning for
Self-Driving Cars. In arXiv:1604.07316.

Brechtel, S., Gindele, T., Dillmann, R. (2014). Probabilistic decision-making under uncertainty
for autonomous driving using continuous POMDPs. In Intelligent Transportation Systems
(ITSC).

Dolgov, D., Thrun, S., Montemerlo, M., Diebel, J. (2008). Path planning for autonomous driving
in unknown environ- ments. In Proceedings of the Eleventh International Symposium on
Experimental Robotics (ISER-08)

Frazzoli, E., Dahleh, M. A., and Feron., E. (2002). Real-time motion planning for agile
autonomous vehicles. Journal of Guidance, Control, and Dynamics, 25(1):116–129

Huval, B., Wang, T., Tandon, S., Kiske, J., Song, W. Pazhayampallil, J., Andriluka, M., Rajpurkar,
P., Migimatsu, T., Cheng-Yue, R., Mujica, F., Coates, A. (2015). An Empirical Evaluation
of Deep Learning on Highway Driving. In arXiv:1504.01716.

Janai, J., Güney, F., Behl., A. (2017). Computer Vision for Autonomous Vehicles: Problems,
Datasets and State-of-the-Art. In arXiv:1704.05519.

Katrakazas, C., Quddus, M., Chen, W.H., Lipika, D. (2015). Real-time motion planning methods
for autonomous on-road driving: State-of-the-art and future research directions. In
Transp. Res. C Emerg. Technol.

Kober J., Peters J. (2013). Reinforcement Learning in Robotics: A Survey. In International Journal
of Robotics Research (IJRR).

22
Kuderer, M., Shilpa Gulati S., Burgard, W. (2015). Learning driving styles for autonomous
vehicles from demonstration. In Robotics and Automation (ICRA).

Langford, J. and Zadrozny, B. (2005). Relating reinforcement learning performance to


classification performance. In 22nd International Conference on Machine Learning (ICML).

LeCun, Y., Muller, U., Ben, J.; Cosafo, E., Flepp, B. (2006). Off-road obstacle avoidance through
end-to-end learning. In Advances in Neural Information Processing Systems 18.

Lillicrap, T., Hunt, J. , Priqel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D. (2016).
Continuous control with deep reinforcement learning. In arXiv:1509.02971.

Lofer, W., Kreiman G., Cox, D. (2016). Deep Predictive Coding Networks for Video Prediction
and Unsupervised Learning. In arXiv:1605.08104.

Mahadevan, S. (1996). Average Reward Reinforcement Learning: Foundations, Algorithms, and


Empirical Results. Machine Learning, 22:159–195.

Martinez-Gomez, J., Fernandez-Caballero, A., Garcia-Varea, I., Rodriguez, L., Romero-Gonzalez,


C. (2014). A Taxonomy of Vision Systems for Ground Mobile Robots. In International
Journal of Advanced Robotic Systems (IJARS).

Paden, B., Cap, M., Yong, S. Z., Yershov, D., Frazzoli, E. (2016). A Survey of Motion Planning
and Control Techniques for Self-driving Urban Vehicles. In arXiv:1604.07446.

Pomerleau, D. (1989). Alvinn: An autonomous land vehicle in a neural network. In NIPS 1.

Powell, W. B. (2012). AI, OR and control theory: A rosefa stone for stochastic optimization.
Technical report, Princeton University.

Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming.


Wiley-Interscience.

Ross, S., Bagnell, J. A. (2010). Efficient reductions for imitation learning. In Proceedings of the 13th
International Conference on Artificial Intelligence and Statistics (AISTATS).

Ross, S., Gordon, G., and Bagnell, J. A. (2011). A reduction of imitation learning and structured
prediction to no-regret online learning. In International Conference on Artifical Intelligence
and Statistics (AISTATS)

Shalev-Schwarq, S., Ben-Zrihem, N., Cohen, A., Shashua, A. (2016a) Long-term Planning by
Short-term Prediction. In arXiv:1602.01580.

Shalev-Schwarq, S., Shammah, S., Shashua, A. (2016b). Safe, Multi-Agent, Reinforcement


Learning for Autonomous Driving. In arXiv:1610.03295.

23
Sharifzadeh, S., Chiotellis, I., Triebel, R., Cremers., D. (2017). Learning to drive using inverse
reinforcement learning and deep q-networks. In arXiv:1612.03653.

Ulbrich, S., Maurer, M. (2013). Probabilistic online POMDP decision making for lane changes in
fully automated driving. In Intelligent Transportation Systems (ITSC).

Xu, H., Gao, Y., Yu, F., Darrell, T. (2016). End-to-end Learning of Driving Models from Large-
scale Video Datasets. In arXiv:1612.01079.

24

You might also like