Garzon'16 ConvNN TrafficSigns

CONVOLUTIONAL
NEURAL NETWORKS

APPLIED TO TRAFFIC SIGN DETECTION

IN GRAND THEFT AUTO V

ALEXANDER RAFAEL GARZÓN

ADVISOR: ALAIN L. KORNHAUSER

SUBMITTED IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF

BACHELOR OF SCIENCE IN ENGINEERING

DEPARTMENT OF OPERATIONS RESEARCH AND FINANCIAL ENGINEERING

PRINCETON UNIVERSITY

JUNE 2016

I hereby declare that I am the sole author of this thesis.

I authorize Princeton University to lend this thesis to other institutions or individuals
for the purpose of scholarly research.

______________________________________

Alex Garzón

I further authorize Princeton University to reproduce this thesis by photocopying or by
other means, in total or in part, at the request of other institutions or individuals for the
purpose of scholarly research.

______________________________________

Alex Garzón

ii
Abstract
In recent years, more and more companies have continued to join the quest of
developing fully autonomously driven vehicles. With a relatively recent research
report suggesting that by 2030, the technologies of autonomous driving will have
developed into a global industry worth $87 billion US dollars, it is no wonder so
many companies are investing so heavily now in creating such technologies. Many
issues and obstacles need to be addressed and resolved though, before fully
autonomously driven vehicles can be sold to consumers and used on public streets.
Perhaps the most fundamental issues are those of giving the autonomously driven
vehicles the ability to actually drive accurately, safely, and in accordance with all
street laws. One specific obstacle is for the vehicle to recognize road signs just as a
human would normally, such as detecting stop signs, traffic lights, speed limit signs,
and warning signs. The focus of this study is on improving upon current detection
methods by boosting accuracy (less false detections, less missed detections) and
reducing image analysis time. This is done by attempting a more difficult single-step
approach to traffic sign detection, as opposed to the traditional relatively easier
two-step approach described in Chapter 2. This study first attempts to develop a
reliable traffic sign detector by constructing, training, and tuning various
Convolutional Neural Networks. Images for training are obtained both from real
world public datasets and images from the game of Grand Theft Auto V. It then
attempts to explore the advantages of using a virtual environment (in this case a
video game) to train detectors for autonomous driving. It concludes there are
distinctive, measurable advantages to training such detectors in a virtual
environment. Investments in constructing virtual environments for training and
testing autonomously driven vehicles should be seriously considered.

iii
Acknowledgements
These past four years at Princeton have been a great and sometimes wild
journey. It was a transformative time for me with many good memories and friends
made. I have grown a lot more certain about my general path from here onwards,
yet the exact path has been made more obscure given all the fantastic opportunities
after college that I have learned about and discovered while here at Princeton. This
senior thesis was a fantastic project to end my time here, and the machine learning
and statistical techniques here in, directly relate to the machine learning team I will
be working on at Google post-graduation.

This thesis would not have been possible without the invaluable guidance
and vision of Professor Kornhauser. The research ideas he pitched to me were a
fantastic senior thesis project and also greatly appealing to my interests. The classes
of Computer Vision (COS 429) and Analysis of Big Data (ORF 350) and their
teachings by Professors Jianxiong Xiao and Han Liu, respectively, were also
extraordinarily useful in developing my understanding of Convolutional Neural
Networks and other Computer Vision techniques, features, and statistical methods.
I’d also like to thank Artur Filipowicz and Chenyi Chen. Artur was invaluable in
introducing me to Script Hook V, so that I could begin hacking GTA V. Chenyi was
invaluable in helping me use his NVIDIA Tesla K40, 12 GB RAM GPU in the PAVE lab.
When CNN training code run time can be cut from 10 hours to 30 minutes, it is truly
a much-appreciated blessing.

Lastly, I would like to thank my friends and family. Shirley, you’ve been an
awesome girlfriend, thank so much for putting up with and supporting me. Ruina,
you’re the best early morning breakfast buddy ever, thank you for all the smiles and
support. To the ORFE squad, Raina, Chris, and Matt, we survived ORFE together,
cheers to all the late night memories p-setting, I could not have done this alone. To
the PCT family, you all have been the greatest. Also Katrina, you have been a sister
who has always believed in me and that has pushed me to work harder, whether
you knew it or not, thank you so much, I will keep at it. Lastly, Papá, you should be
proud, you have done so much for your son, I am so thankful for all of it. I would not
be here, at Princeton, writing this thesis, had you not always been there for me
growing up and teaching me.

The end is now here. And equally so is the end the beginning. Thank you all.

iv
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
1 Introduction 1
1.1 Problem & Objective Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Why Grand Theft Auto V (GTA V) ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Why Convolutional Neural Networks (CNNs) ? . . . . . . . . . . . . . . . . . . . . . 5

2 Background & Literature Review 6
2.1 Previous Traffic Sign Detection Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Previous Virtual Environment Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Previous GTA V Hacking Development . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Data Creation and Development 12
3.1 CNN Training and Testing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 LISA-TS Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Training, Validation, and Test Set Construction . . . . . . . . . . . . . . . . . . . . 15

4 CNN Methodology & Results 19
4.1 CNN Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 CNN Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Programming with Theano. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Stop Sign MLP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 Contrasting MLPs with MNIST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Stop Sign CNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.7 Contrasting CNNs with MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.8 Contrasting CNNs with GTSDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.9 Performance on GTA V Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

v
5 GTA V Methodology & Results 35
5.1 GTA V – System Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 GTA V – Car Handling Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 GTA V – Position and Angle Structuring . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4 Live Action Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6 Conclusion 47
6.1 CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2 GTA V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.3 Future Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3.1 Additional CNN Component Creation. . . . . . . . . . . . . . . . . . . . . 51
6.3.2 Additional Tracking Layer Creation. . . . . . . . . . . . . . . . . . . . . . 51
6.3.3 Additional Virtual Environment and Dataset Exploration . . . . . . . . 52

Bibliography 56

Appendix 59
A.1 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
A.1.1 Image and True-values Grouping and Processing . . . . . . . . . . . . . 59
A.1.2 Reading, Formatting, and Pickling Data . . . . . . . . . . . . . . . . . . . 60
A.2 CNN Implementation (Theano) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
A.2.1 CNN Architecture Definition and Training . . . . . . . . . . . . . . . . . 61
A.2.2 CNN Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.2.3 MLP Architecture Definition and Training. . . . . . . . . . . . . . . . . . 68
A.3 GTA V Hacking Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.3.1 Live Video Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

vi

List of Tables
3.1 Training, Validation, Testing Set Counts . . . . . . . . . . . . . . . . . . . . . . . 16
4.1 Stop Sign MLP Validation and Testing Errors . . . . . . . . . . . . . . . . . . . 27
4.2 MNIST MLP Validation and Testing Errors . . . . . . . . . . . . . . . . . . . . . 28
4.3 Stop Sign CNN Validation and Testing Errors . . . . . . . . . . . . . . . . . . . 29
4.4 MNIST CNN Validation and Testing Errors . . . . . . . . . . . . . . . . . . . . . 30
4.5 Traffic Sign CNN (GTSDB) Validation and Testing Errors . . . . . . . . . . . . 31
4.6 GTA V Image Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1 Sample Stopping Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

vii
List of Figures
1.1 Sample Image of Stop Sign Detector in GTA V . . . . . . . . . . . . . . . . . . . . 2
1.2 Sample Images of Traffic Signs in GTA V . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Image Depicting TORCS System Setup [19] . . . . . . . . . . . . . . . . . . . . . 10
3.1 Sample Image Abbreviated Annotations . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Samples Images from LISA-TS Dataset . . . . . . . . . . . . . . . . . . . . . . . 18
4.1 Example of a Sobel Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Example of Max-Pooling technique used . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Three common activation functions . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4 Sample FC Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Sample CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.6 Sample MNIST Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.7 Sample GTSDB Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.8 Sample GTA V Image Classifications . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 System Setup Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Camera Placement Images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 GTA V Video Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4 GTA V Traffic Sign CNN Script Images . . . . . . . . . . . . . . . . . . . . . . . . 45
5.5 GTA V Lane Detection CNN and Bird’s Eye View . . . . . . . . . . . . . . . . . 46
6.1 Speed Limit Sign Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2 Speed Limit Sign Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
viii

Chapter 1

Introduction

1.1 Problem & Objective Specification

The issue of accurate sign detection is one of the most fundamental tasks that must
be resolved before fully autonomously driven vehicles can become a viable option
for general public use. If the vehicle cannot be autonomously driven in a safe
manner, then no matter how great the solutions are to all the other high level issues
such as insurance, marketing, safety mechanisms, et cetera, the vehicles will still not
be able to hit the roads. Personally software engineering is very interesting as well
as the computer vision and machine learning algorithmic aspect of this problem; it
has also been a great joy to build and tweak street sign detection algorithms and
detectors, along with rigorously training, testing and comparing their performances
against large sets of test images obtained from widely used open source image
libraries as well as the GTA V (Grand Theft Auto V) game itself.

One will never be able to achieve 100% detection success across millions of images,
but the goal is to improve and raise that detection rate as much as possible. The
approach in this thesis is distinctly different from the typical two-step approach,
which is discussed in the literature review section. Additional goals also include
identifying obstacles along the way for relying solely on CNNs for street sign
detection, such as areas that prove challenging for image detection algorithms like
low-light environments, image blur, and distorted/damaged/partially-occluded
street signs. Lastly, analysis and discussion are made on the measurable advantages
to training and testing a CNN street sign detector, and more broadly, a full
autonomously driven system inside a virtual environment.
1.2 Why Grand Theft Auto V?

This study is not just about training a CNN-based street sign detector. It is about
trying to determine and measure the advantages of training such a detector and
more generally full autonomous driving systems inside a virtual environment. Given
that CNNs are being used, it is of utmost importance that the training images are an
accurate reflection and very similar to the images that the CNN detector is expected
to perform well on in practice. Thus it is inherent that such a virtual environment be
as visually representative and similar to the real world as possible.

Figure 1.1: Sample Image of Stop Sign Detector in GTA V

Grand Theft Auto V is indeed very representative of the real world. It was the
second most expensive video game ever developed at a cost of $137 million USD
[18]. The game very closely emulates the real world. In the game there is changing
lighting based on time of day and weather. There are pedestrians crossing roads,
other vehicles driving on the roads, and even your occasional anomalies like animal
crossings. Importantly, all traffic signs in the game are also based off American
traffic signs. The CNN detector trained will be focused on American traffic sign
detection, but realistically, retraining it on another country’s traffic sign set is as
trivial as changing the image/object files in the game used to render such traffic
signs. The fact that training a CNN traffic sign detector of game images is an at least
equally good detector as one trained on real world images will be demonstrated in
this study.

An additional benefit of the game is that to some degree it is very modifiable by a
researcher to fit the needs of developing and testing a CNN traffic sign detector. Just
to mention it first, collecting in game images for training is the most tedious task,
although, it doesn’t take forever. A researcher can generate about 150 unique and
representative images of traffic signs, semi-manually, per hour. Then the researcher
can also combine these images with real world image datasets that already exist and
contain thousands more images. What is really great though is the wide range of
game hacking tools publically available and kept up-to-date by game enthusiasts
who maintain an API of functions and variables that can be used via various
programming languages such as C++ and C# to interact with and modify game
behavior. For instance, a researcher can change the time of day, lighting and
weather in the game. One can also change the camera angle to any position one
wants. For instance, one can collect images from a low-seated sport car view, and
also from a high-seat truck view, and then train the detector, and compare whether
a variable such as the height of camera mounting makes a distinct difference in
performance. One can also zoom to different parts of screen, and crop out irrelevant
parts such as the sky, where a traffic sign is unlikely to be located.

There are also unique advantages to testing the system in game. For instance,
hypothetically, assuming a reasonably working autonomous system, the car can just
run forever in the game. If the detector doesn’t work well at night, the lighting can
be fixed permanently to sunny, bright conditions. If the autonomous system can’t
detect pedestrians yet, or gets confused by them, they can be deleted automatically
from the game to prevent issues. If the car crashes or drives off road, it can be re-
spawned back to a legal starting position on the road. These are just examples of
many in game hacks, that can make testing much easier than in the real world.
Simultaneously, one could also automate image collection during testing to retrain
the detector afterwards to make it even better.

Specific details of how the car handling script works and what hacks were taken
advantage of will be described as they arise throughout the study.

Examples of Signs in Game

Figure 1.2: Sample Images of Traffic Signs in GTA V

The signs above in order left to right then top to bottom are: (1) Avoid Median, (2)
Yield, (3) No U-Turns, (4) Pedestrian Crossing, (5) Do Not Block Intersection, (6)
One Way Road, (7) Right Turn Only, (8) Stop Sign, and (9) Do Not Enter.

There are also many more traffic signs in the game such as traffic signals, animal
crossing, no parking, and many more. As you can see, they all very accurately
resemble their real world counter parts in America. It is essential that traffic signs
or whatever object you are training a CNN on to detect in virtual environment is as

similar as possible (ideally identical) to that object’s appearance in the real world,
otherwise training in the virtual environment will not be that effective.

1.3 Why Convolutional Neural Networks (CNNs)?

The fundamental reason is that the goal of this research is that it is a small step
down a path to develop a fully autonomous driving system. The spirit of the
research is that a computer should drive the car just as a human would drive a car,
using vision of the view around it, predominantly focused of the view in front of the
car, as the car travels down the road, to make important driving decisions.

In contrary to that spirit, one could imagine all sorts of additional tools one could
develop to help assist the autonomous system in safety driving the vehicle. For
instance one could have the cars computer preloaded with maps of the area and
using GPS technology the car could gauge what road it is on and roughly where on
the road (perhaps). The problem is though that GPS technology is not accurate
enough for precision driving, it at best is able to average out at a couple meters of
precision. Clearly such an inaccurate tool could at best only play a minor supporting
role in an autonomous driving system. Another example, could be installing sensors
on roads to mark traffic signs, lanes, sharp turns, etc. However, such an idea of
entails massive costs and quickly becomes intractable.

Instead, the goal is to use computer vision to analyze the environment the car is
passing through and make driving decisions based on conclusions from that
analysis. CNNs (Convolutional Neural Networks) thus seem to be a natural choice of
technology. CNNs have proven to be an incredibly powerful tool in computer vision
for image classification and object detection. The exact advantages, construction,
and theory behind CNNs will be explained in the methodology section, they also
have a proven record of working very well for object detection and classification
problems.
Chapter 2

Background & Literature Review

2.1 Previous Traffic Sign Detection Work

If this research had been done 10 years ago there would be very little work on it at
all to reference in terms of traffic sign detection, however since then, especially in
the last few years, there have been several interesting papers published that will be
discussed below.

Many such interesting papers have been written about traffic sign detectors that
were designed for the German Traffic Sign Detection Benchmark (GTSDB)
Competitions [24]. First though, it is important to note, that the detection in this
competition is quite easier and more established than the detection attempted in
this study. The competition focuses on a two-step detector, the first step is a
localization process, where the detector is trained to detect signs anywhere in the
image, and return the bounding boxes of whatever signs it can find. The second and
final step is a CNN classifier that is in charge of classification for the image within
each bounding box. So essentially, a bounding box is a blown up version of a sign
that is a centered and large object in the new-cropped picture, the CNN then
classifies that sign as a certain type.

In contrast the CNN in this thesis is focused on putting these two steps together (a
much more challenging problem), that is, given an image of the street view ahead,
the classifier attempts to classify the whole picture as either containing a specific
sign, in my case a stop sign, or not containing that sign. The next goal, and the reason
this paper is titled with “detection” and not “classification” is to predict how far that
traffic sign is away from the vehicle. All an autonomously driven vehicle system
should care about is if there is a stop sign being approach and how far away the stop
sign is, the exact bounding box of where it is exactly does not matter, the vehicle just
needs to know when to begin slowing down and eventually stop. Detecting if there is
a stop sign and how far it is, can both be part of the prediction output from one CNN,
currently though, this study has been focused on the first part of the prediction
which is whether or not there is a stop sign. Adding the other part of the prediction
is trivial if the first part can work well, since once a dataset with labels of true values
of distances to stop signs in images is available, the CNN can be retrained to also
make predictions on those distances. The only constraint that prevents currently
training in that manner is the lack of a labeled dating set, which makes sense, since
you can’t measure the distance to a sign directly from an image so easily after it is
taken and you are no longer at the location, but if you are taking images in a virtual
environment, that distance could be known by grabbing game variables and
computing the distance from their values and recording it at time of capturing the
image. This is one such advantage of being able to use a virtual environment to train
a CNN (assuming that training a CNN on a virtual environment translates well to
real-world usage, and vice versa, a fact we discover through experimentation later
in this thesis). Despite current traffic sign detection schemes using such a different
technique from what is proposed in this study, such papers can still be quite
relevant as the detection tasks are to some degree related and it is important to be
familiar with other methods being developed.

The most relevant paper to this study’s work that could be found was entitled:
Multi-Column Deep Neural Network for Traffic Sign Classification [30]. The
researchers were very successful in the GTSDB competition and they focused on
training a set of CNNs to detect traffic signs and then taking an average of their
predictions to come up with their final detection prediction. Another important
technique they used was processing the images into multiple forms that could be
used for training. For instance, of course you use the original photo, then they also
modified it into four additional forms which they called: Imadjust, Histeq,
Adapthisteq, and Conorm. Briefly, Imadjust is the picture with increased contrast up
to the point where 1% of data is saturated at the highest and low intensities. Histeq
increases contrast such that the output image histogram of pixel intensities is close
to uniform. Adapthisteq is similar to histeq except the image is tiled and each tile is
processed to become close to uniform in intensity. Lastly Conorm is an edge
enhancer, they used a difference of gaussians to enhance the edges, but one could
have also attempted using a modified form of the Sobel operator which I introduce
later in Chapter 4. It is these two special modifications that allow the team to bump
up their performance to give them the extra edge to win, but still, 95% of that
performance could have come from just their single first CNN with no image
processing. This suggests that CNNs are very strong tools for this kind of of problem.
The team goes into details about their different layers in the CNN and the intuition
they give for their decisions is quite useful for this study in thinking about how to
architect the CNN in this study later on.

Another interesting research team has a paper entitled: Traffic Sign Recognition –
How far are we from the solution? [31]. They, too, also divide the problem into 2
pieces, first traffic sign localization, and then classification of traffic sign
localizations, however, they attempt to do it with other methods that do not include
a CNN. Although they are quite successful, they still fall short of the team above
which used multiple CNNs, further suggesting, that CNN is indeed quite a good
solution for this problem. Instead, the team attempted to first accomplish traffic sign
localization by way of using integral channel features (ChnFtrs) as created by other
researchers in [2] for pedestrian detection. ChnFtrs detector uses ideas derived from
HOGs (Histograms of Oriented Gradients). Specifically the team in [31] uses 10
channels derived from ChnFtrs to localize signs, and they do so quite successfully.
For the final step of classification they use a technique known as INNLP (Iterative
Nearest Neighbors-based Linear Projections) and it also works quite well. However,
for a simple classification problem where the object is dead center in the image,
even the CNN in this study can perform incredibly well when being trained for only
a few minutes, as will be shown later with the MNIST and GTSDB datasets. Thus,
while their work is very interesting, since it was decided not to use the two-step
method which already has much research done on it, and instead focus on the
single-step more difficult method of classification, thus it was decided their work
was not too relevant for the CNN research in this study. Nonetheless, it was
important and eye opening to see other techniques out there that exist for this two-
step traffic sign detection approach. It will be interesting to see if the single-step
CNN can ever be trained to a sufficient state that it can out due the winners of
GTSDB who use a relatively easier two-step method. As you will see in the results,
although the single-step detector turns out to be good, it is by no means perfect, yet
nonetheless impressive given the limited amount of data for a CNN single-step
solution.

2.2 Previous Virtual Environment Work

The number of studies using virtual environments to train and test autonomous
driving systems is dramatically smaller than those studies which train and test
detectors for autonomous driving systems using real world images and frames from
real world recorded video, such as those discussed in section 2.1. One such study of
great interest though is that of ORFE PhD student, Chenyi Chen, at Princeton
University. In his study [19], Chenyi constructs CNNs that are trained and tested
within a virtual environment, TORCS (The Open Racing Car Simulator), an open
source simulator popular among the AI community [20]. The focus of his CNNs is
about lane detection and changing lanes to avoid collisions with other cars. His CNN
is able to detect how many lanes are on the road (for instance if it widens from 2
lanes to 3 lanes), and it calculates values such as the car displacement from the
lanes, and the angle from the car’s forward moving direction to the line parallel to
the lane (this would be the angle the car should turn to continue going within it’s
lane and not drift out of it). He is able to grab true values for measurements such as
the lane displacement and angles for each image by calculating them from in game
global variables and parameters, thus he can develop a large training set over which
to train his CNN and make a very powerful CNN lane tracker/detector.

In fact, Chenyi’s entire system setup of having the game running, a car handling
script running, and a CNN running, and all three components communicating with
each other, does indeed very much mirror my system setup. Below is how Chenyi
very clearly illustrates his system setup [19]:

Figure 2.1: Image Depicting TORCS System Setup [19]

The biggest reason it was decided to use GTA V instead of TORCS is quite
straightforward. The research in this paper centers on improving traffic sign
detection, however TORCS is a racetrack environment, there simply are no street
signs. It is a professional racing track with very clear lane markings that works
perfect for Chenyi’s research but is not useful for traffic sign detection research. GTA
V on the other hand, has traffic signs that very accurately resemble the real world in
terms of factors such as: placement, size, and design (they are all based off standard
American traffic signs). Not only that, but to assist in training and testing purposes,
as with most objects in the game, every sign is uniquely identified by an ID number
with its location. Another Princeton undergraduate student, Artur Filipowicz, is
currently working on grabbing traffic signs in close proximity of the vehicle in the
game and being able to calculate their exact coordinates in game images. Knowing
these true values would greatly speed up CNN training for those pursing the two-
step method with bounding boxes as the time required in data collection and
labeling dramatically drops. The label process would be instant and automated.
10

Currently that task is a work in progress. It also would be exactly what is needed
here to train the CNN to predict distances to signs, since Artur’s code would be able
to output the true values of distances to signs in images that are collected inside the
game for training the CNN.

What is being used in this research though is the foundational system setup
modeling Chenyi’s setup for TORCS that is created by Artur for GTA V in his
independent research work [21]. The setup here is based off his setup with
modifications for the particular CNN, for the system used, and with optimizations to
improve speed of communication of game images over to the CNN so the CNN can
quickly calculate its designed values and send them over to the car handling script
(driving controller).

2.3 Previous GTA V Hacking Development

GTA V is one of the most popular video games in the world and has sold over 60
million copies since its launch in 2013 [22]. It is thus natural that is had a large
enthusiastic community of gamers that is actively developing hacks for the game.
One of the most widely used libraries to interact with GTA V when writing hacks and
game modification scripts is the one called Script Hook V created by Alexander
Blade [12]. This library is the foundation of he car handling script and all other code
in this thesis that is modifying in-game behavior. The library provides access to a
wide variety of in game variables to modify their values, and also a wide range of
functions to call and use in combination together to manipulate the game as the
programmers’ needs require. Its usage will be discussed in greater detail in the car
handling script section. In general though, the wide availability of many developers’
game modification code online was very useful in debugging my game scripts.

11

Chapter 3

Data Creation and Development

3.1 CNN Training and Testing Data

CNNs require a large amount of training and validation images to be tuned to an
optimal local maximum in terms of measured performance against a the validation
set and test set. The real global maximum will likely never be found and is
computationally impractical to attempt to make sure it is discovered every time. It is
often also not in our interest to find the global maximum as such a solution is likely
to fall victim to being an over fit model that performs very well on the validation set,
and perhaps on your test set, but on a more broader class of images it cannot
perform as well.

The important idea to take away is that, disregarding computational constraints,
more data (more images) are always better, assuming those images are well curated
to accurately and proportionally resemble the diverse set of scenarios the detector
is expected to perform under. For example, if one is detecting stop signs, and all the
images are of a stop sign 1 meter in front of the car, where the stop sign takes up a
large portion of the image, and it is brightly lit on a sunny day, and being viewed
from an almost exactly, perpendicular angle to the face of the sign, then training an
accurate detector to detect such a sign is a much easier task. It is an easier task, but
then your detector is also only going to work well under that exact, easy scenario. If
the stop sign is scene at angle, if it is partially occluded, if it is under the shadow of a
tree, if it is 10 meters away, et cetera, any of those very realistic scenarios in the real
world or combination of them will struggle to be detected by the detector. And such
a poor detector, obviously, cannot be used by itself as a centerpiece in an
12

autonomous driving system expected to keep riders safe. Training a CNN involves
creating a network architecture by tuning hyper parameters (number of layers,
types of layers, order of layers, dimensions of kernels, number of kernels, pooling
sizes, pooling padding size, dimensions of input and output to a layer, et cetera) and
then letting the optimizer run across thousands of images in the training set,
thousands of times, to estimate optimal values for the thousands of parameters in
the layers via backpropogation and stochastic gradient descent. The amount of
parameters being estimated is an enormous and the outcome is in large part based
on a well-curated selection of images for the training set. It is also important to note
that the word “thousands” can easily be replaced by “tens of thousands”, “hundreds
of thousands”, or “millions” and onwards. It all depends on many factors such as the
difficulty of the detection problem, the amount of training data available, the ease of
construction of new data, and the computational “firepower” (hardware) available
to the researcher.

3.2 LISA-TS Dataset

The LISA-TS (Laboratory for Intelligent & Safe Automobiles – Traffic Sign) dataset,
as from the creator’s website, is a dataset containing 47 US sign types, 7855
annotations on 6610 frames, with sign sizes ranging from 6x6 to 167x168 pixels.
Images were collected using several different cameras and vary in size from
640x480 to 1024x522 pixels. Each annotation of a sign includes many useful facts
such as: sign type, sign position and size (a bounding box, 4 coordinates around the
sign), an indicator if the sign is occluded, an indicator if the sign belongs on that
vehicle’s road or a road next to it, and a few more statistics about what camera took
the image, what track the image belongs to, and what frame number the image is in
that track. Note that a track is a set of a burst of images over a few seconds as the car
approaches the sign. Here are some examples of abbreviated annotations:
Filename;Annotation tag;Upper left corner X;Upper left corner Y;Lower right corner X;Lower right corner Y;Occluded
/stop_1330545910.avi_image0.png;stop;862;104;916;158;0
/stop_1330545910.avi_image0.png;speedLimitUrdbl;425;197;438;213;0
/stop_1330545910.avi_image1.png;speedLimit25;447;193;461;210;0
/speedLimit_1330545914.avi_image0.png;speedLimit25;529;183;546;203;0
13

/pedestrianCrossing_1330545944.avi_image3.png;pedestrianCrossing;355;179;384;207;0
/leftTurn_1330546134.avi_image2.png;turnLeft;594;179;617;203;0
/rightLaneMustTurn_1330546501.avi_image12.png;rightLaneMustTurn;928;52;1001;128;0
/signalAhead_1330546728.avi_image4.png;signalAhead;342;188;368;213;0
/laneEnds_1330547145.avi_image10.png;laneEnds;592;101;617;127;0
Figure 3.1: Sample Image Abbreviated Annotations

To begin testing the CNN (whose construction, debugging, and tuning is described in
the next section), it was decided to start with detecting stop signs. The goal is to get
the CNN to detect if a the car is approaching a stop sign, and if it is, start triggering
warning bells, and eventually be able to determine how close the be vehicle is to the
stop sign so that a gradual, normal human-like 3 second stop can be made by the car
handling script inside GTA V.

The choice of a stop sign was made for multiple reasons. One of the big reasons is
that a stop sign is simply a very, very common sign. This was reflected in the LISA-
TS dataset, which despite containing 47 types of US signs, the image frames are not
perfectly distributed among the signs, but rather 22% of the frames contain stop
signs (over 1000 frames). This is also reflected in GTA V were in some locations in
the game, there are literally stop signs at every single street intersection which is
great for collecting additional training images, and also testing the CNN detector
trained off real world images to see if it works just as well on in game images, which
we expect it should, since to a human, the images look incredibly similar and GTA V
is very realistic with the graphic developers having paid high attention to detail.

Additionally, stop signs, as compared to other traffic signs, are among the easier
class of traffic signs to detect. Stop signs are uniquely octagon-shaped, they are
distinctly bright red, with the capitalized white words “STOP” on the sign. It is their
unique shape and color that make them so unique and great for CNNs to detect. This
is in stark contrast to American speed limits signs, which are among the most
difficult to detect. They are standard square white signs, the text formatting is not
standardized, and often, extra text is appended to a speed limit sign, and the shape
14

of the sign can be elongated. Not only does one want to recognize that there is a
speed limit sign, but one also want to be able to determine what is the value of the
speed limit (25 mph, 35 mph, etc.).

Lastly, for modifying car behavior based off a stop sign, the behavior is quite
standard and defined. The vehicle needs to come to a gradual stop at the correct
place and remain still for a couple of seconds before reaccelerating to the
appropriate speed and continuing on its way through the intersection (assuming
there is nothing blocking the intersection, as determining that, is another entirely
different detection problem that needs to be solved). Developing the car handling
script to do this was another equally challenging problem, as interacting with a
game that was not purposefully designed to be interacted with, can at times prove
challenging and require quite a few interesting hacks or different attempts at getting
the expected sought-after behavior. This is also discussed in the next section.

3.3 Training, Validation, and Test Set Construction

It is important to remember these sets of images need to accurately represent the
environment in which the detector is going to be used in. The goal is to have the
detector drive in GTA V, and for reasons of system setup, the size of the screen shots
taken from GTA V are 140 x 105 (width x height) pixels. This actually is not much of
a problem since based on other research papers this is well within the range of
dimensions of training images uses to develop similar successful detectors. Note it is
actually 140x105x3 (32,760 values) since these are images in color. The only
significance of bringing this up is that the architecture of the CNN has to match the
dimensions of the images it is being used on. So it needs to be trained, validated, and
tested on real world images that are also 140 x 105.

Looking through the LISA-TS set, there are images that largely fall into one of two
groups: 704 x 480 or 640 x 480. Without distorting the image to stretching, those
two images can be dimensions can be scaled down to 140 x 95 or 140 x 105
15

respectively. So of course the latter, is greatly preferred, however due to the vast
majority of stop signs falling into the first category, there was no choice but to go
with using the 704 x 480 category. The GTA V script simply disregards the bottom
10 rows of pixels, which seems reasonable, since a sign is unlikely to be in the very
bottom portion of the image anyways. Alternatively, one could have distorted the
image aspect ratios slightly, although it was decided not to do that as the shape of
the sign would be change slightly and that could affect detection.

So with that issue addressed, AWK and Bash scripts were developed, to
automatically go through the 7856 images, and sort them into 2 categories. Images
containing stop signs and images NOT containing stop signs (but quite likely other
signs). It does so, while dropping all the 640x480 images, and only keeping the
704x480 images which it down samples to a size of 140x95. Then it groups them
into training, validation, and test sets (where positive is defined as containing a stop
sign, and negative is no stop sign present):
# of Positive Images # of Negative Images
Training Set 500 500
Validation Set 200 200
Test Set 200 200
Table 3.1: Training, Validation, Testing Set Counts
Note: The images are all grouped as tracks, where one track is a burst of images in
the few seconds approaching a traffic sign. Because the training set, validation set,
and test set, should resemble the diverse environment, but not too close to identical,
so one had to manually make sure no tracks were present in more than one set.
Otherwise there is a big risk of over fitting.

Another concern about tracks is that ideally you would not like any tracks. However,
in the LISA-TS dataset almost all data is grouped into tracks of 10 – 40 frames.
Assuming an average of 25 frames a track, the training set only has 20 unique tracks
of positive images, which really isn’t a lot. However, all datasets explored in this
16

research such as LISA, GTSDB [24], and others are also track based. The reason for
this is simply that this is how data is collected in the real world using a car and a
camera. There is no other way but going around and taking a video (which is where
the bursts of frames/images come from). Of course, 500 images from 20 tracks is
still though much better than 20 images (one from each track), although the ideal
case would be 500 images from 500 tracks. It would be interesting research to see
how much better it would actually be, though based on reduction of a lot of
redundant information, and a dataset with much more diverse and accurate training
images, it already intuitively seems it should be a substantial boost to detector
performance.

Here is the first case of where data availability constraints come into the equation.
Limit of data available can be real constraints on training a CNN when there is not a
sufficient amount of data. Fortunately though, this amount of data is enough, though
using GTA V to automatically collect images could be a major break through in
collecting a much larger amount of training images. One could also do what Chenyi
[19] did, and have someone manually drive the car around for 12 hours, as he did, to
collect images. But he was collecting images with lane markings, so the full 12 hours
was useful, for this research though, only a small percentage of the 12 hours (say
5%) would be useful (the time when traffic signs are present in images). Thus it
would be substantially more costly to do a similar procedure manually, not to
mention all the data cleanup afterwards, so it was decided to not undergo this task
and instead focus on using the already available and annotated LISA-TS dataset of
real world traffic sign image, specifically for stop sign detection.

17

Here are examples of positive images from LISA-TS:

Here are examples of negative images from LISA-TS:

Figure 3.2: Samples Images from LISA-TS Dataset
18

Chapter 4

CNN Methodology & Results

4.1 CNN Introduction

The choice of a CNN is not trivial. There are many alternatives that can be explored
such as deep belief networks, deep neural networks, recurrent neural networks,
deep Boltzmann machines, and etcetera. However, a CNN tends to be the most
common choice for computer vision challenges such as image recognition.

CNNs, Convolutional Neural Networks, are a type of artificial neural network.
Artificial neural networks (ANNs) are models that are biologically inspired variants
of real neural networks within a biological brain. Essentially, an ANN is a network of
nodes (or “neurons”) where every node is only connected to certain other nodes,
and all node connections are assigned weights based on how important those
connections are (just like in a real brain). It is all these weights that are being
optimized/learned via training the CNN on the training set. Also during training, the
validation set is used to optimize/learn some hyper-parameters (such as number of
filters, filter shapes, max-pooling shape, etcetera) and also to judge how well you
optimized the weights using the training data. After using the training and
validation set to construct some learned weights, you can use the test set to gauge
the quality of the detector. One can NOT use the performance on a test set to further
tune a model, that is the purpose of the validation set. Tuning based on the test set
defeats the purpose of the test set.

Now the exact composition of a CNN will be broken down for you. A CNN is an ANN
composed of several types of layers. These types of layers include:
19

• Convolutional Layers (CONV). The CONV layer is where CNN gets it name
from, it is essentially a process where the input into the layer is “convolved”
with a set of filters, and the results/outputs of those convolutions are passed
as input into the next layer. A filter can be thought of as a sliding window
with weights (a matrix of numbers) that is sliding across the entire image (a
bigger matrix of numbers), and at each position it takes a dot product with
the portion of the image it covers, which will be part of the output. These
weights on these filters are what is being learned during training. The filter
object in a CONV layer is 4 dimensional: (1) number of filters, (2) number of
channels (number of input matrices), (3) filter height, and (4) filter width.
Usually, it is very hard for the human eye to get an idea of what these filter
are doing and especially if the filters do not come from the first convolutional
layer. This is just their nature and there isn’t a better way to visualize them.
But here is a small trivial example to give some basic intuition:

One interesting convolution is that of a Sobel operator that produces an
image with edges strongly emphasized. This is especially helpful in edge
detection and thus consequently detecting the edges of the road and the
street signs and many more useful artifacts.

Sobel operator example: SO =

Figure 4.1: Example of a Sobel Kernel

As one can see, intuitively, the Sobel operator should produce a very large
dot product in absolute value when it is “convolved” on a 3x3 matrix that has
a vertical edge through the middle of it, since such an image would have the
left hand side and right hand side values differ in magnitude by a large
amount. If one wanted to detect horizontal edges, one could just rotate the
20

above SO by 90 degrees. One could also imagine a detector for curved, or

octagonal shapes, which might be developed for detecting stop signs. After
that, there are more useful but complicated operators that work really well
but are hard to make intuitional sense of.

• Down-sampling/pooling Layers (POOL). The POOL layers are largely
responsible for thinning out the information (originally the image when it
first enters) flowing through the CNN, which is a good as it can speed up
computation and remove unnecessary details or noise, however if too much
pooling is used it can thin out the data too quickly and impair object
detection, thus finding the right balance is important. The intuition behind
the idea of pooling is that after a filter has identified a feature (and thus
having a high dot product value), the exact location of the feature is not
important, but rather what is important is only its relative location to other
features. Thus one can divide the current input matrix into 2x2 or 3x3 tiles,
and take the maximum, average, or L2-norm of those values, and have that 1
value replace those 4 or 9 values in the output (I choose to use max pooling
as that has been shown to work best in practice). For my CNNs they all use
max pooling and the pool size is 2x2. These appear to be the most popular
setting for these 2 hyper-parameters and have worked well in practice. On
the next page is an illustration made to show the max pooling occurring in
this CNN, note how ignore_border was set equal true, so if the last row or
column is not in a tile, due to odd number of rows or columns, then it is
simply dropped, this is acceptable practice, done my many researchers, note:
Pool Shape = (2,2):
21

Figure 4.2: Example of Max-Pooling technique used

• Rectified Linear Unit Layers (ReLU). The RELU layer is the layer responsible
from adding further nonlinearity to the CNN, it is supposed to act as an
activation function to allow the CNN to train faster. There is debate about
which function to use in this layer, there are 3 major functions in use though:
o (1) f(x) = max(0,x) range: [0, ∞]
o (2) f(x) = tanh(x) range: [-1, 1]
o (3) f(x) = (1 + 𝑒 !! )!! range: [0, 1]
Figure 4.3: Three common activation functions

Function (1) is often preferred since it is computationally simple and has
been shown to accelerate convergence in stochastic gradient descent.
However, it can have disadvantages associated with not getting rid of large
gradient values. The function (2) is preferred over function (3) since it is
centered on zero, which give additional useful properties. Thus it was
decided on using function (2) in all the CNNs: f(x) = tanh(x)

22

• Fully Connected (FC) Layer. This is typically the last layer in the CNN after
some combination of any amount of layers before, where their types
correspond to the previous three layers mentioned. The FC layer is unique in
that it is the only section of the CNN where every node in the layer n has its
own connections to all the nodes in layer n+1, so long as layers n and n+ 1
are part of the FC section. As one can imagine, so many weights makes this
computationally expensive, thus having combinations of CONV, POOL and
ReLU layers before this step, and thus greatly reducing the size of the input
into the FC layer is very important. If the FC layer has Hidden Layers (HLs)
within it, which it usually does, then it can also be considered a Multilayer
Perceptron (MLP). Learning in the MLP/FC layer takes place via a process
known as Backpropogation which is essentially a method of using stochastic
gradient descent to minimize the error of the predections, which is the
output from the final layer of the FC. Each layer in the FC layer has a weight
matrix W and bias vector b, these are the values that are learned. The last
layer of the CNN in this research is actually a logistic regression (LR) layer,
this is because it is a common technique shown to improve performance
[15]. The hidden layer (HL) essentially transforms the input into a linearly
separable space, and then the Logistic Regression (LR) layer classifies
(makes predictions from) the output from HL. Below is a useful visualization
of an FC layer with three HLs [28]:

23

Figure 4.4: Sample FC Layer

The final construction for the CNN in this study was the following:
n INPUT IMAGE ->[CONV -> POOL -> RELU]*3 -> HL -> RELU -> LR ->
PREDICTION (OUTPUT)
Note the “*3” notation, the 3 layers in the brackets are repeated 3 times. So
essentially the CNN can be thought of as having 12 layers in between input and
output. Multiple architectures were tried and this one appeared to work best in
practice and intuitively seemed justifiable, more on that intuition below in 4.2.
To help you visualize, below is an illustration from [25] of a typical CNN
architecture:

Figure 4.5: Sample CNN Architecture

24

4.2 CNN Construction

Deciding how to architect a CNN is not easy in large part because at the very
beginning, one struggles with how to intuitively make decisions. That is because a
beginner with CNNs has zero or very little intuition about such abstract concepts
such as for example, if the CNN isn’t performing well, what hyper parameters should
be changed? There are so many hyper parameters to change such as the learning
rate, the number of filters, the dimensions of so many different layers, etc cetera. A
user could also consider a big structural change, such as perhaps currently there are
3 CONV layers, but would it be better to change it to 2 or 4 CONV layers? Developing
this kind of intuition can only be done through a lot of time spent training neural
networks, making such changes, and seeing what happens, thus it is best at the
beginning to start with a CNN architecture similar to existing successful models, and
then start tweaking from there. To do this, the study largely relied on the Deep
Learning Tutorial [15], and their many, many working examples, to serve as a
foundation for this study’s own CNN. Their tutorials also rely largely on the Theano
python library. Quite a few CNN libraries were researched and Theano came across
as the most attractive for a few reasons, the advantages of Theano are detailed
below.

4.3 Programming with Theano

Theano is specifically designed for performing operations on multidimensional
arrays [26]. Users can define, optimize, tweak, and write mathematical expressions
for Theano to evaluate. The speed at which Theano can perform these operations is
much faster than traditional python techniques using numpy and instead rivals the
speeds of optimized C code. One big advantage that Theano has is it can use the C++
compilers to compile your expressions using either G++ or NVCC. Compiled code in
general runs a lot faster than code interpreted on execution. Additionally, while
traditional code is run by the CPU, Theano is typically set to run making use of the
GPU via installing NVIDIA’s GPU-programming toolchain (CUDA). In recent years,
GPUs have emerged as the winner over their rival CPU when tackling computer
25

vision problems. The GPUs are more efficient and faster. These are two good
reasons that have lead to beginning with using Theano to develop, train, test, and
operate the CNN in this study.

There are many alternatives to Theano in the world of deep learning and CNN
construction. Such alternatives include: Caffe, Torch, Tensorflow, and DL4J. Caffe
seems to be a promising alternative. It is a deep learning framework developed by
the Berkley Vision and Learning Center (BVLC). Caffe also seems to be a very
popular framework out there, with plenty of documentation. It similarly has a flag
enabling switching usage of CPU to using the GPU. It has libraries available in
several languages such as python and C++. However, Theano is also great and all
CNNs within this research paper rely on Theano.

Used in this study were the following hardware to train CNNs:
• Intel CPU (on a personal Macbook Pro)
• NVIDIA Tesla K40 GPU – 12 GB SDRAM (made available in the Princeton
PAVE Lab [27])
The NVIDIA GPU greatly increased training speed by a factor of roughly 10 to 12.
4.4 Stop Sign MLP

Research began by first creating a Multilayer Perceptron to detect stop signs. This
seemed like a natural first step since the eventual CNN would encompass an MLP. Of
course, without CONV layers, pretty bad performance was expected, and here were
the results, note all are trained for 1000 epochs with learning rate of 0.01 (except
for 1 LR where 0.1 was used to make it converge faster):
Layer Setup: Validation Error (%): Testing Error (%): # Hidden Units:
1 LR (not MLP) 21.5 65.5 N/A
1 LR + 1 HL 47.5 51.0 50
1 LR + 1 HL 46.0 49.0 250
1 LR + 1 HL 50.0 50.0 500
26

1 LR + 2 HL 39.0 38.0 50,50

1 LR + 2 HL 42.0 44.0 500,500
Table 4.1: Stop Sign MLP Validation and Testing Errors

As one can clearly see based of the testing error, these results are not great. Given
that the testing set is half positives and half negatives, one would expect a random
coin toss to receive a testing error of 50%. Any of the above getting a testing error of
greater than 50% most likely have zero predicting power in them (such as the first
single logistic regression). And all the MLPs below it are not that convincing in their
performance either. The reason for this is that MLPs do not enjoy the property of
translational invariance. They work very well if stop signs are expected to appear in
roughly the same spot with roughly the same size and lighting, however this is
clearly not the case when driving a vehicle. Thus the MLP as expected performs very
poorly. This highlights though why the need for the convolutional layers of a CNN
are so necessary. It is the convolutional layers of the CNN that provide this
translation invariance.
4.5 Contrasting MLPs with MNIST

Just as a sanity check that this is indeed the case, the above 5 MLPs were run
(technically 1 LR is not an MLP, more like a Single-layer Perceptron) on the MNIST
dataset [16]. The MNIST dataset is a famous computer vision dataset that contains
70,000 examples of handwritten digits from census records. 50,000 images were
used for training, 10,000 for validation, and 10,000 for testing. The Deep Learning
Tutorial conveniently provides an example with it, and the above-created 5 MLPs
were tweaked to also test them on MNIST and get the following results (note same
epochs and learning rate values from 4.4):
Layer Setup: Validation Error (%): Testing Error (%): # Hidden Units:
1 LR (not MLP) 7.50 7.49 N/A
1 LR + 1 HL 2.65 2.52 50
1 LR + 1 HL 2.00 1.95 250
27

1 LR + 1 HL 2.42 2.40 500

1 LR + 2 HL 2.48 2.37 50, 50
1 LR + 2 HL 1.91 1.99 500, 500
Table 4.2: MNIST MLP Validation and Testing Errors

As expected, the MLPs perform astoundingly better on the MNIST dataset. Now the
paper will transition to discussing the results of the CNN. Here first though are
examples of members of the MNIST dataset:

Figure 4.6: Sample MNIST Images
4.6 Stop Sign CNN

The CNN was created initially to detect stop signs. The CNN architecture, as stated
previously, is: INPUT IMAGE ->[CONV -> POOL -> RELU]*3 -> HL -> RELU -> LR ->
PREDICTION (OUTPUT). It is believed that the part of having 3 layers of [CONV-
>POOL->RELU] at the beginning is intuitively sound based on other research. Thus it
was decided one main variable to tune could be the number of kernels/filters (K) in
each layer and the dimension of each of those kernels/filters. For example if K =
[4,3,2,1] and D = [5,4,3,2], which is not recommend all, but to explain notation in the
table below, this means there are 4 layers of [CONV->POOL->RELU], where the first
layer has 4 kernels that are 5x5, the second layer has 3 kernels that are 4x4, the
third layer has 2 kernels that are 3x3 and the last layer has 1 kernel that is 2x2.
Another parameter under consideration to adjust is the # of hidden units in the HL
of the CNN. Additionally tuned was the learning rate parameter lambda (λ) and the
number of epochs to run for (E). Please see the table below for the results for trying
the CNN under different settings (Note VE is Validation Error and TE is Testing
Error, also the learning rate was played around with, and value 0.003 worked nicely,
so then just from there on the study used different learning rates by changing the
magnitude, otherwise the actual value of 0.003 is arbitrary, it just happened to work
nicely):
28

CNN Setup: VE TE Learning # Hidden

(%): (%): Rate (λ) Units:
K = [20,20,20], D = [9,9,9], E = 200 48.5 50.0 0.03 500
K = [20,20,20], D = [9,9,9], E = 200 22.5 45.0 0.003 500
K = [20,20,20], D = [9,9,9], E = 400 17.5 26.5 0.003 500
K = [20,20,20], D = [9,9,9], E = 400 28.5 35.5 0.003 800
-------------------------------- ------ ------ ------
K = [15,30,30], D = [16,8,5], E = 200 25.0 33.0 0.03 500
K = [15,30,30], D = [16,8,5], E = 200 27.5 34.5 0.003 500
K = [15,30,30], D = [16,8,5], E = 400 19.0 21.5 0.0003 500
K = [15,30,30], D = [16,8,5], E = 200 23.5 26.0 0.003 800
-------------------------------- ------ ------ ------ ------
K =[15,25,30], D = [14,10,6], E = 200 42.5 40.5 0.03 500
K =[15,25,30], D = [14,10,6], E = 200 23.0 32.5 0.003 500
K =[15,25,30], D = [14,10,6], E = 200 31.5 37.5 0.0003 500
K =[15,25,30], D = [14,10,6], E = 400 31.5 37.5 0.0003 800
Table 4.3: Stop Sign CNN Validation and Testing Errors

Considering the limited amount of data for training, validating, and testing, these
results are actually quite surprisingly good. The winner is the 7th CNN tested with a
VE of 19% and TE of 21.5%, it will be the CNN used later in section 4.9 to test on
GTA V images collected from the game, and it also performs well there too!
4.7 Contrasting CNNs with MNIST

Again, to keep up the pattern of comparing to the MNIST dataset, it was decided to
do that again by making and running a CNN under the following settings, here are
the results:
CNN Setup: VE (%): TE (%): Learning # Hidden
Rate (λ) Units:
K = [20,50], D = [5,5], E = 200 0.91 0.92 0.1 500
29

K = [15,30], D = [5,5], E = 200 1.04 1.10 0.03 500

K = [15,30], D = [5,5], E = 200 0.95 1.06 0.3 500
K = [25,50], D = [5,5], E = 200 0.88 0.85 0.3 500
K = [25,50], D = [4,4], E = 200 0.99 0.88 0.3 500
K = [25,50], D = [6,6], E = 200 1.07 0.98 0.3 500
K = [25,50], D = [5,5], E = 200 0.88 1.00 0.3 800
Table 4.4: MNIST CNN Validation and Testing Errors

Clearly, one can see that the 4th row (K = [25,50], D = [5,5], E = 200, Hidden Units =
500), performed the best, although practically speaking they are all very, very close.
Clearly the CNN works well, and if the Stop Sign CNN was given a lot more data, then
it would likely be able to perform just as good as the MNIST detector does, or close
to it. The MNIST problem is easier because the numbers are centered in the image,
and the total dataset is 70,000 images, in contrast, the Stop Sign problem is much
harder as the stop signs are somewhere in the background of the photo (never the
center object of the photo) and the total dataset from LISA-TS is only about 1,800
images. It seems quite reasonable to conclude that throwing more data into the
Stop Sign CNN would improve performance, it is just a question of where to find
more images of stop signs, or rather use GTA V game images, and have someone
spend a whole day driving around the game to collect more images.
4.8 Contrasting CNNs with GTSDB

Again, this time it was decided to do a more meaningful comparison, by seeing how
the CNNs compare on traffic sign classification, when the traffic sign is the center of
the object (this assumes it has already been localized in the image), and thus only
running a CNN on the cropped image of the signs (see examples on next page). To do
this, par of the training data provided in the GTSDB competition to participants was
used, 6 signs with the most data were picked, so thus, the CNN outputs numbers
0,1,2,3,4, or 5 based on what sign it believes is in the picture. The signs used are: (0)
One Way Sign, (1) Warning Sign, (2) Yield Sign, (3) 50 km/h Speed Limit Sign, (4) No
Truck Passing Sign, and (5) 30 km/h Speed Limit Sign.
30

Here are the results of the CNN, a total of 490 images were used, roughly equal
portions of the above 6 signs; 400 for training, 50 for validation, and 40 for testing:

CNN Setup: VE TE (%): Learning # Hidden
(%): Rate (λ) Units:
K = [20,20], D = [8,6], E = 200 4.0 5.0 0.003 500
K = [20,20], D = [8,6], E = 200 18.0 5.0 0.03 500
K = [20,20], D = [8,6], E = 200 12.0 5.0 0.0003 500
K = [25,50], D = [8,6], E = 200 4.0 0.0 0.003 500
K = [25,50], D = [8,6], E = 200 6.0 2.5 0.003 800
K = [35,70], D = [8,6], E = 200 4.0 2.5 0.003 500
K = [20,20], D = [10,6], E = 200 2.0 5.0 0.003 500
K = [25,50], D = [10,6], E = 200 2.0 0.0 0.003 500
K = [25,50], D = [12,6], E = 200 2.0 7.5 0.003 500
K = [25,50], D = [12,8], E = 200 0.0 5.0 0.003 500
Table 4.5: Traffic Sign CNN (GTSDB) Validation and Testing Errors

As one can see, the CNNs performs extraordinarily well for this kind of traffic sign
classification, which is a relatively much easier problem than the one being
attempted to be solved with GTA V in this thesis. The top performing CNN gets 2%
error on the validation set and 0% error on the testing set! If you take that to be 1%
error on average, assuming the traffic sign localization (bounding box detection) is
equally error free and correlated, then only a 1% error detection system on traffic
signs in GTSDB would have been a very competitive performer in the GTSDB
competition, especially considering this study has not used any special data
processing methods or other performance boosting methods and tunings.

31

Examples of the specific 6 signs being detected:

Of course, not all the photos are that clear and well lit, for example:

Figure 4.7: Sample GTSDB Images

Clearly it is quite likely, that the small errors are likely to be on signs such as those
above that are either over lit or under lit, and thus look abnormal with obfuscated
edges and colors.
4.9 Performance on GTA V Images

Next the vehicle was manually driven around using the one of the hacking scripts to
adjust the camera so that it could take training photos. Below are some of those
photos used for testing, grouped into positive (stop sign present) and negative (not
present) groups. Also provided is an indication of how the best CNN (from 4.6)
predicted, by calculating the detection rate on the sets of images grabbed from the
game, which is about 100 positive images and 100 negative images.

EXAMPLE OF POSITIVE IMAGE SET (easier images, sign is big in photo):

(1) D = YES (2) D = YES (3) D = YES
32

(4) D = YES (5) D = NO, NOT CORRECT (6) D = YES, CORRECT

(7) D = YES, CORRECT (8) D = YES, CORRECT (9) D = NO, NOT CORRECT
EXAMPLE OF NEGATIVE IMAGE SET:
(1) NO, CORRECT (2) NO, CORRECT (3) YES, NOT CORRECT

(4) YES, NOT CORRECT (5) NO, CORRECT (6) NO, CORRECT

(7) NO, CORRECT (8) YES, NOT CORRECT (9) NO, CORRECT
Figure 4.8: Sample GTA V Image Classifications
33

These are just 18 images selectively pulled from the 200 game images that were
tested. Overall, from all 200 images tested, roughly 73% of images were classified
correctly and 27% were classified incorrectly. This is almost 3 out of 4! It is good
news that the detector can work well on game images. Admittedly, some of the game
images seem to be easier in that the stop sign is bigger on average or slightly bigger
in the image, than compared with the LISA-TS images used to train, validate, and
test the CNN in section 4.6 Nonetheless, it still shows the detector did a great job.

Also, looking at the photos, one can speculate why the CNN detector may have
predicted wrongly. In the positive set, the detector surprisingly was able to do
image (4) correctly, but image (5) and (9) also dark images, it missed. It seems to be
that the detector struggles when the sign is very dark as in (5) and dark and
blending into the background as in (9). Of course, to test this hypothesis, we would
need to look at more images to be more confident. Perhaps this suggests though that
more training with darker images is needed to further improve the CNN. Also, it
could suggest that the tactics used in [30] for image preprocessing, such as
increasing the contrast, might be incredibly useful for the CNN in detecting signs in
relatively darker images. As for the negative set, discerning what was inaccurately
detected as a stop sign is a slightly harder task, but in general the detector seemed
to struggle with images with dark shadows or in general less light in the images,
further confirming the belief that intuitively makes sense that the detector would
struggle in less well-lit images.
Here is the detection rates table, showing the breakdown of error between positive
and negative testing images:
Image Set: Correctly Classified (%): Incorrectly Classified (%):
Positive Set 79 21
Negative Set 67 33
Total Set 73 27
Table 4.6: GTA V Image Classification Results

34

It seems that the detector has a somewhat bigger problem with false positives than
with false negatives. One reason false negatives could be low (a good thing) is that
the positive image set tended to have the stop sign relatively larger in the image
than compared with LISA-TS images, as stated before. As for why false negatives are
relatively larger, it is not immediately clear why, but one possible reason as alluded
to earlier could be the detector falsely detects stop signs in low light images.

An interesting note about the images, is as you can see, they are a very diverse set,
taken from all times of day, and under different weather conditions. If desired, one
could also have driven to many different environments, such as urban, rural,
mountainous, dessert, costal, industrial, et cetera, since they all exist in this game.
However this study concentrated on urban, suburban, and a few images
mountainous. The breadth of images in this game though is truly a promising
feature of using such a game, as it improves the odds of solely using this one game to
be able to develop a diverse enough dataset to reflect what an autonomously driven
vehicle would encounter in the real world. Hopefully after data collection in the
game is more automated, large new datasets can be collected from it, and the CNN
detector performance can be improved even further.

Most importantly though, from this we see that a CNN detector trained on real
world images works very well on in-game images, and thus one can conclude that a
CNN detector on training images from GTA V, could also work reasonably well on
images from the real world. This confirms my belief that a virtual environment can
be used to successfully train CNNs and is great news.
35

Chapter 5

GTA V Methodology & Results

5.1 GTA V – System Setup

To begin setting up the system, one must first complete many installations. This is a
brief summary to get anyone started who is continuing this research. After installing
the game GTA V on Windows OS, go ahead and install Script Hook V [12], which is
written in C++ by Alexander Blade, then install Script Hook V .NET [13] which is an
ASI plugin that allows a programmer to run any script written in a .NET language
(such as C#) and have that script interact and call on elements of Blade’s Script
Hook V library and thus control elements in the game such as driving vehicles. In
addition, of course also install all dependences for the Theano python library.

To set up the system and write the code to connect all the parts, one needs to first
conceptually develop an architecture. Here is he system setup architecture,
developed with the help of Artur [21], shown on the next page:
36

Figure 5.1: System Setup Architecture

Getting this system architecture to work well requires patience. The process of
requesting a screen shot via the OS can vary in reliability especially for computers
with weaker hardware. If the CPU, GPU, and RAM are incredibly occupied running
the game and Theano, then sometimes the screen shots do not come in as quickly as
one would like. 12Hz though seems to be a quite reasonable and sufficient frequency
to aim for. If you increase the frequency to be too high, Theano can sometimes not
keep up, depending on how deep and big your CNN is. On the other hand, if the
frequency is to low, such as 1Hz, then car handling results will appear pretty choppy
and poorly done since updating parameters every second is actually very
infrequent. When a vehicle is traveling so fast, the environment and decisions need
to be evaluated and made multiple times a second. Next will be a discussion of the
car handling script creation for GTA V.
37

5.2 GTA V – Car Handling Script

The library Script Hook V [12] provides a vast amount of global variables and
functions for you to call to modify game behavior. The challenging part is searching
through the sometime poor documentation of all these variables and functions to
determine how one will achieve the game modification behavior that one is after.

There are quite a few behavior modifications that were desired when creating the
car handling script. For instance, it was decided to delete any other vehicle that
comes within 10 meters of my vehicle, this is to facilitate in training autonomous
vehicles in the game to prevent collisions. To make training easier, one can also fix
the time of day and weather to always be noon and super sunny, to provide ideal
conditions for training, then if training works well under those conditions, one can
consider testing more difficult times and conditions, such as when it is cloudy and
less light, or perhaps even at twilight when it is getting dark. If it is completely dark,
one might have to consider preprocessing filters to increase contrast in the photo to
make it somewhat more possible for the detector to have a chance.

Additionally, there were some more obvious behavior modifications, such as fixing
the car speed at 20mph, and stopping the car to 0 mph if a stop sign is reported as
being detected for 3 seconds, and then the car will automatically speed back up to
full speed. To get this right, one would have to time it so the car stops right at the
stop sign, and of course one would expect multiple detections, possibly 10, 20, 30
before, assuming the detector is operating at 12Hz. Thus in the code it was
attempted to implement a stopping system, however, due to the CNN training only
down to about 20% error, the output was not sufficiently low in error to live demo a
nicely working automatic stopping system, so instead it is done manually. But here
is a description of the system built. Note: It was necessary build the stopping system
like this since the CNN does not yet predict the distance to detected signs.

38

The system comes from the assumption that one would expect multiple detections,
possibly 10, 20, 30 before the stop sign, assuming the detector is operating at 12Hz,
or possibly even more. However, that number cannot be averaged out and used as a
point to trigger stopping since the fact it varies so much is widely due to road
conditions. On a straight road the stop sign can be scene from a long distance, but on
a curvy road the stop sign appears right away, or rather it could be a straight road
but the stop sign is just initially occluded by some other object or bad weather
conditions. Thus it was decided to adjust the camera zoom so that only stop signs
greater than 14 meters away can be seen. The camera is just zoomed slightly ahead.
If a car is traveling at 20mph that is equivalent to roughly 8.9 meters per second. So
given that the system operates at 12 Hz, or 1 image processed per 80 milliseconds,
assuming a near perfect detector with <5% error, one could be quite certain that
after 5 missed detections (or 400 milliseconds of no detection), that the stop sign is
within the 14 meters and then the car at that point comes to a stop, having roughly
1061 milliseconds (1 second) to stop at the 1 meter before stop sign mark,. Here is a
table of the settings for different speeds (15 to 50 mph, in increments of 5). It was
assumed there are no stop signs on 55mph or greater roads, which seems somewhat
reasonable though not perfect. The rule could still be applied for high speeds, but
the stopping might be too abrupt unless the camera is zoomed incredibly far out, but
there could be other problems since photo quality tends to degrade with distance.
Here though is the table:

Camera Zoom Vehicle Speed Vehicle Speed Time left to stop
(meters) (mph) (meters/sec) (milliseconds)
10 15 6.7 940
14 20 8.9 1061
18 25 11.1 1130
23 30 13.4 1241
28 35 15.6 1331
34 40 17.9 1443
39

40 45 20.11 1539
46 50 22.3 1617
Table 5.1: Sample Stopping Times

In the game these values would work fantastically assuming a great detector and
that the stop sign is really missed because it has been dropped out of the camera
field of vision and not that some object occluded it. However, in real life, a car might
not be able to, or rather the driver would not enjoy, the car coming to a complete
stop from 50 mph in 1617 milliseconds (1.6 seconds). To adjust this one could
extend the camera zoom, and tripling the stopping time to 5 seconds at a speed of 50
mph, which seems much more reasonable and close to the comfort zone. However
that would require a zoom of 150 meters, and that is quite far, and there is a high
possibility of another car occluding a sign so far away. So thus, it was concluded that
this type of stopping system in the real world would not be that effective. A better
approach to this problem is to have the CNN not only detect there is a stop sign but
also predict its distance based on how big it is in the photo. Unfortunately, as
mentioned before there are no datasets that have the true value of distances to signs
from vehicle in the annotations of their photos, since getting such a true value would
require a person measuring in real life the distance for each photo. Given that, using
those photos to train a CNN (supervised training) to predict distance to the sign is
not possible. However, this is where the advantages of GTA V come into play. In GTA
V one can use the hacking code and libraries to grab objects that are nearby. From
this one can also get other variables that help in estimating distances to such
objects. Artur is currently working on developing code in an attempt to estimate the
true values of distances to signs in game images captured. If he is able to develop
successfully a script that can accurately determine this distance on all images
captured with signs, then the images and their true values could be trained on my
currently existing CNN, and the CNN would then be able to begin outputting the
predicted distance values. The only limiting constraint to the problem is first getting
a dataset to train the CNN on.
40

As for now, the CNN is currently focused on image classification. It outputs one to
the CNN output text file when the image being processed contains a stop sign and
outputs zero to that text file when the images do not contain a stop sign. Thus the
output vector is simply a scalar:
My Stop Sign CNN Output Vector
Stop Sign Detection:
SSP
SSP = stop sign present (1 if yes, 0 if no)

In contrast, here is Chenyi’s simplified lane detection CNN implemented for GTA V
by Artur that outputs two scalars:
Chenyi’s CNN Output Vector
Driving car portion:
DTC ATC
DTC = distance to center of road
ATC = angle of adjustment to return to center of road

5.3 GTA V – Position And Angle Structuring
Another idea that came about with the game hacks being available is training the
CNN on images taken from cameras of multiple positions and multiple angles. This is
because, in the game, you can set the camera literally anywhere in relation to the
car. The camera can be set using 5 parameters, <x,y,z,w,v> where <x,y,z> define
starting at where the player is in the game (presumably in the car) and point to
where the camera should go, for example, x is how far in front, y is how far to the
side, and z is how far up. Then <w,v> control the angle of the camera, using these
parameters, you can point the camera at exactly, whatever angle you wish. For
example, if x,y,z are in meters and w,z, in degrees, if you let <w,v,x,y,z> = <0,-
90,0,0,10>, you would have a camera in the game hovering 10 meters above the car
and looking down on it, to give you a bird’s eye view of it. This could be a very
interesting point of view from which to train the car, however the practicality in the
real world currently does not seem that great. A more practical approach could be to
assume the camera has to be attached to the car (as all major companies are doing
41

this), and thus experiment with where on the car is optimal. For instance, would a
camera under the front bumper, just barely above the ground, looking forward,
work better, or would a camera mounted much higher on the car do better? The
script created can very easily do this. And below are 6 settings with which the car is
currently able to run at (note: the ground is defined as 10 cm above the pavement).
The 6 photos below are ordered first such that the first row is 0 and 0.5 meters off
the ground, the second row is 1 and 1.5 meters off the ground, and the last row is 2
and 3 meters off the ground:
Figure 5.2: Camera Placement Images

42

Collecting a sufficient amount of images to run the CNN, say 500 per setting if doing
all 6, or maybe 1000 per setting if doing only 2, could easily have taken me at least
12 hours of manually playing the game and driving around to collect such images. It
was decided it was best to focus first on other research in this thesis were the data
was more easily accessible. However, having the car ready to go and collect such
data at different positions and angles is none the less a great start. If one could get
someone to drive around the game and collect images, then one could easily just
feed them into the CNN and see how they perform as a test set, or rather, if the
person is able to collect a sufficiently large amount of images, one could train an
entire new CNN entirely on this dataset and see how it compares to training on real
world images! Additionally, the bird’s eye view images pictured below in 5.4 could
be an additional topic of research.
5.4 Live Action Videos

It was decided to record a live action video demo of what hacks are currently
working in the game in relation to usability with training and testing CNN detectors.
This video below demonstrates them:
https://www.youtube.com/watch?v=70ecHWz7kQc

Figure 5.3: GTA V Video Image
43

In the above provided video there are 5 sections:

(1) Clip of Traffic Light and Stop Sign Indications responding to manually created
detections (CNN is not error free enough to be ready for live demo).
(2) Clip of the above, with how it interacts with Artur’s implementation of
Chenyi’s lane-detection CNN in the game.
(3) Clip of the different altitude modes for taking training images.
(4) Clip of bird’s eye view altitudes, for possible research, although the practical
usage is currently questionable, but it would be interesting if a bird’s eye
view detector could be very effective, possibly for lane detection, turning the
car at intersections, or nearby vehicle and obstacle detection.
(5) Clip illustrating some of the testing image frames collected from the game.
Provided on the next page in addition to the video, are some still images of the car
performing 1) Stop Sign Detection 2) Traffic Sign/Light Detection (no CNN
implemented yet), and 3) Lane Detection (using Chenyi’s CNN and implemented by
Artur), 4) some bird’s eye view images:

44

Figure 5.4: GTA V Traffic Sign CNN Script Images

45

Table 5.5: GTA V Lane Detection CNN and Bird’s Eye View

46

Chapter 6

Conclusion

6.1 CNN
The CNNs trained within this thesis all exceeded my expectations in how they would
perform. CNNs are truly a very powerful detection tool in for computer vision tasks.
The most important CNN in this thesis is undoubtedly that best performer in section
4.6 which had a Validation Error of 19% and Testing Error of 21.5%. It was a single
CNN with 12 layers trained and validated on a set of only 1400 images from the
LISA-TS dataset. 700 of those images contained stop signs, and 700 of them did not.
They are real images taken from behind the windshield of a car driving on real
world roads. The fact that this CNN could perform so well, despite being given such
a limited amount of data, is truly remarkable. It shows that this single-step approach
to traffic sign detection is truly viable, and provides an alternative to the traditional
two-step approach.

Additionally, if more data is thrown at the problem for training, results in this thesis
all strongly suggest that the CNN from 4.6 could continue to increase in
performance. In all likelihood, an increase in the data by 1 magnitude would get
validation and testing errors to below 10%, and the single-step detector would
come close to rivaling the winners of the GTSDB competition. Additionally, an even
more important result, is that the fact that the single-step method works for traffic
sign classification, given that the CNN can locate the sign’s presence, it is also likely
to be able to be trained to return a prediction of its distance from the vehicle based
on the traffic sign’s size. The only last obstacle to testing that is dataset creation,
which will hopefully be made possible by Artur’s independent work this semester
47

with working on getting the true values of distance by calculating them off of in-
game parameters. And as shown in 4.9, a CNN trained to classify or detect in GTA V,
is likely to perform equally well on real world images as well, which is what we
want. The outcomes from CNN research in this thesis are quite inspiring.

On another note, it is deemed important to pause and explain the intuition behind
why the single-step traffic sign detection method requires a larger training dataset.
The reasoning is that in a large image of the road ahead, there is a very small signal-
to-noise ratio. That is, there are so many objects in a vast image, and the CNN is
expected to train on detecting one small object that many images have in common.
In order to get a well performing CNN, it thus needs a very large training dataset on
which to optimize over. If the dataset is too small, there are risks of the CNN training
itself to detect mutually shared noise among the images such as the amount of road
pavement, trees, or other objects that the CNN architect did not notice might have
been coincidentally more common in the positive set than negative set, or vice
verse. In terms of performance, there is only upside to adding more and more high-
quality, curated images to the training and validation sets of images.

Additionally, even if the single-step CNN reaches a bottleneck at 5% error and
cannot decrease further to 1% error, like two-step method, that does not mean it is
useless. Rather, it can still be quite useful. For instance, if it is able to predict
distances to the signs with only 5% error, that is amazing performance, as the two-
step traffic sign detectors are only computing bounding boxes and not directly
giving an indication of distance to the sign. Distance to the sign is of huge benefit,
because it allows the vehicle to calibrate its behavior much more accurately by
knowing how far the stop sign is from it. Additionally, it prevents disasters if the
sign happens to be temporarily occluded by another vehicle or object. If a stop sign
is detected as 40, 39, …30, ..., 20 meters away, over the span of a few seconds, but
then all of a sudden goes missing due to occlusion, the self driving car will still know
when to stop, and this is an essential criteria for an autonomously driven vehicle.

48

Additionally, the two-step method’s bounding boxes are sometimes completely

irrelevant. For instance, consider a speed-limit sign, all the vehicle should care about
is that speed limit sign corresponds to the road it is driving on and that it exists,
whether it is 5, 10, or 15 meters away, that doesn’t really matter, the vehicle still
needs to follow that speed. Thus the two-step method is doing unnecessary work,
and most likely takes more time to process images than my single-step method. If
the single-step method processes images in 100ms, while the two-step method
takes 200ms, that is a huge difference for a vehicle that is traveling so fast on roads
in the real world where mistakes can cause serious accidents. Thus, it is clear, that
the single-step method does win the contest in terms of model simplicity, and most
likely also model speed. It was truly a great joy developing the single-step CNN
method, and now the discussion proceeds to talking about working with GTA V.

6.2 GTA V
Working with GTA V was a great success. Whenever there was an idea about setting
up a certain camera view, setting the car to a certain speed, or changing car behavior
based on information from a text file (such as a CNN output/prediction), it was all
accomplished with great success by leveraging the tools available in the hacking
library, Script Hook V (C++)(and equivalently in ScriptHookDotNet which is in C#).

Not only were ideas able to accomplish to satisfaction by using Script Hook V, but
also along the way much learning about the abundant amount of tools available in
the game that could be leverage for future research took place. Below are some of
the ideas for ways to use GTA V for future research, some of which were previously
briefly mentioned:

One, the capability to locate nearby objects can be used to locate the number of
signs in the vicinity. Other information about that sign such as game coordinates can
be calculated. Artur is currently working on developing a script to compute the true
values of distances to signs on the upcoming stretch of road, and this would be
49

infinitely useful in training a CNN that can predict distance to stop signs detected,
for example. Currently no real world datasets have been found that have these true
values, and this makes sense, as going out to measure for every photo is extremely
costly in man-hours.

Two, the position and angle structuring code can be used to create large datasets of
traffic sign images from an infinite amount of camera placements and angles,
although, you’d choose strategic, practical positions and angles, of course. While it
would require some man-hours, a database of 10,000 images could easily be
achievable by playing the game for one week. The results would hopefully be
spectacular in being able to train well working CNNs and determine which
placement and angle have what advantages in detecting certain signs. From this
research, one could infer that it is quite likely to be ideal to have multiple cameras
mounted in different places on a self driving car, as Google already does, but more
specifically, one could train and test for the optimal combined locations of cameras
to produce the best possible combination and thus work towards a tuple of CNNs
together in a system that can guarantee near 100% accuracy of traffic sign
detection. Such research would be a strong advancement to the current progress in
the realm of traffic sign detection.

Three, someone could work on the task of locating all traffic signs within the entire
game world in GTA V. They could then spawn the player and car at all those
locations systematically, in front of and forward facing those signs. They could then
automate the car driving straight down the road and passing the traffic sign and
collecting images for training. A script that could automate this and allow for
automated data collection with minimal human oversight could greatly increase the
amount of data that could be collected easily by a magnitude and possibly two
magnitudes. Instead of collecting 5,000 images, one could go ahead and collection
500,000 images. Such a massive dataset is quite reasonable, and not unheard of, as
one of the most famous examples in the object detection and object localization
fields is the ImageNet competition [29]. The dataset for the 2015 competition had a
50

whopping 1.2 million images in it. Given current GPU technologies increasing in
computational capacity on a yearly basis, large datasets are becoming much more
feasible to train. And given the greater capacity to train, larger datasets should be
acquired, in order to train CNNs on more difficult detection problems that were
previously considered infeasible, or rather, too difficult at the time.

6.3 Future Extensions

6.3.1 Additional CNN Component Creation
Creating additional CNN components running to increase capabilities of the
autonomously driven car would be further steps down the path of creating a fully
autonomously driven vehicle. One suggestion is a CNN that works on a wide range of
traffic sign classes, possibly designed along the lines of the following:
CNN Output Vector
Street sign prediction output:
QSSD TSSD1 DTSS1 SF1 … TSSDn DTSSn SFn

-- QSSD = Quantity of Street Signs Detected (within a certain radius of the car in the forward
direction)
-- TSSD = Type of Street Sign Detected (1 = stop sign, 2 = traffic light, 3 = speed limit sign, 4 = caution
sign, 5 = other)
-- DTSS = Distance to Street Sign
-- SF = Sign Feature (only used for certain signs, for example could be the speed limit if sign is a speed
limit sign, thus TSSD = 3, or could be 1 for green, 2 for yellow, 3 for red if the sign is a traffic light,
thus TSSD = 2)

6.3.2 Additional Tracking Layer Creation
In addition to just reacting to output from the CNN, the Car Handling Script could
also store its own internal tracking layer state. Let’s assume the tracking layer is
applied to a CNN, like the one mentioned above, that work on a wide range of traffic
sign classes. Including a tracking layer has proved to improve performance greatly,
especially in reducing false positives. The tracking layer algorithm can be tweaked a
lot and some experimentation will be required for optimal parameter values. Such
questions such as: After how many iterations of having a previously detected
upcoming sign, not being detected any longer, should we decide to assume it was a
false detection (knowing prediction of distance to that sign could also help in this
51

decision)? If we are 75% confident speed is decreasing from 50 to 30mph, but there
is a 25% chance it is increase from 50 to 60mph, what should the defined behavior
for the vehicle handler be in such a case? Safety tradeoffs need to be made. The
tracking layer will prove to be quite interesting to work on.

Several of the academic papers read for this research have indeed implemented
different tracking layers. It would be interesting to also test in further research how
changing the tracking layer to function in different ways affects the overall network
performance. An initial plan could be to create a data structure that will hold
anywhere from 0-10 upcoming objects. The objects will have priorities of
importance, confidence levels of their existence, their estimated location, and other
important information abstracted from the output of the CNN network and also
calculated upon that output.

6.3.3 Additional Virtual Environment and Dataset Exploration
In addition to GTA V, one could also begin exploring other virtual environments. Just
how TORCS worked very well for Chenyi Chen to train a lane detection CNN, and
GTA V seems similarly promising to train a good traffic sign detection CNN, it is
possible that other virtual environments might have their own set of quirks that are
ideal for training certain types of CNNs useful for detecting objects that would
enhance the ability of an autonomously driven vehicles. For instance, in GTA V one
can also pilot boats and pilot planes among other vehicles. Currently companies like
Amazon and Google are well invested in designing autonomously driven drones.
Drones in the air and middle of their journey tend to have little obstacles in their
path, however the taking off and landing portions could be ripe areas for CNN
detection to be well put to use.

In GTA V one can only pilot planes, blimps, and helicopters. There is no small
aircraft available such as drones. However, with the hacks it would be feasible to
possibly create one, given a good amount of work put in. So that being said, an
52

autonomously driven drone could possibly be trained in GTA V as well, however

there could be other suited virtual environments for that as well.

Returning to cars, which for now seem to be the autonomously driven vehicle with
the most monetary incentive to develop, if further research in training CNNs in
virtual environments continues to show that CNNs trained on virtual environments
perform just as good, if not better than those trained on real world images/videos,
then it might be of great benefit for companies such as Google to not only train their
CNNs by driving real cars around in Mountain View, CA, but also to create a team of
engineers to develop a virtual environment finally tuned to all the needs of
developing high quality CNN detectors for autonomously driven vehicle systems.

Returning to datasets, currently in this thesis only datasets from LISA-TS and GTA V
were employed (and in 4.8 also a subsection of GTSDB). That being said there are
plenty of other datasets worth considering, for example, CNNs will also need to be
trained on traffic signs in Europe so that they can function there as well, and two
such great datasets that were explored and look great for such a task include the
German Traffic Sign Benchmark Dataset [24] (briefly explored in 4.8), as well as the
KUL Belgium Traffic Sign Dataset [32] and the Swedish Traffic Sign Dataset [33], all
of which contain thousands of images. Note: In this research there was no discovery
of any current virtual environment or game that uses European traffic signs.

From training CNNs to detect traffic signs on both American and European traffic
signs one could see which traffic signs are easier for CNN detectors to recognize. If
there is a big difference for one traffic sign category, it might be necessary for that
country to redesign that specific traffic sign to be more detector-friendly. The
specific example being thought of is the American speed limit sign versus the
European speed limit sign (see below):
53

AMERICAN SPEET LIMIT SIGN from [23] EUROPEAN SPEED LIMIT SIGN from [24]
Figure 6.1: Speed Limit Sign Comparison

Which sign looks easier to detect? The European one in general is almost always
easier to detect. It is circular, with a bright red border, and it is uniformly the same
throughout Europe. The researchers in [1] confirm this. The tried using GTSDB
winning detectors to detect American traffic signs and by far the hardest sign to
detect was the speed limit sign. Not only is it a very generic sign, but it is often
elongated by extra information such as “Radar Enforced”, “Photo Enforced”, “State
Maximum”, and many more variations. It is also very similar to other objects.

Figure 6.2: Speed Limit Sign Image

The above picture illustrates how generic a speed limit sign is, and in fact, the
researchers in (1) had a big problem with the detector detecting other objects, such
54

as the back of a truck, (which is also white, square, and can have text on it) as a
speed limit sign, which it is mostly clearly not.

It is through training CNNs on various sets of traffic signs, that researchers might
begin to hit these kind of road blocks and possibly request a country or region to
standardize their traffic signs to a more detectable and uniform version. Changing
street signs is a relatively cheap task compared to other high cost options such as
putting special detectors or sensors on top of street signs to assist autonomously
driven cars. Not to mention impractical, imagine if the sensors broke and the car
thus did not detect a stop sign. That would easily be a very costly and unacceptable
error.

Final Thoughts
Working on traffic sign detection and with GTA V was a great pleasure. The utmost
encouragement is expressed to others to further research in this area. Based on all
the work that has been have done so far, training, testing, and developing CNN
detectors in virtual environments seems to be a very tractable and effective method.
There are so many new techniques you can try with relative ease in virtual
environment that are not feasible in the real world. Go get started now! Vehicular
accidents are one of the leading causes of deaths in American and in the World,
further work in getting autonomously driven tools on to cars on the roads to quickly
save the lives of many. Go make a difference.

55

Bibliography
[1]
Andreas Møgelmose, Dongran Liu , and Mohan M. Trivedi, Traffic Sign Detection for U.S. Roads:
Remaining Challenges and a case for Tracking, Oct 2014
http://cvrr.ucsd.edu/publications/2014%5CMoegelmoseLiuTrivedi_ITSC2014.pdf

[2]
Piotr Dollár, Integral Channel Features, 2009
http://authors.library.caltech.edu/60048/1/dollarBMVC09ChnFtrs.pdf

[3]
Mohammed Boumediene , Jean-Philippe Lauffenburger, Jeremie Daniel, and Christophe Cudel,
Coupled Detection, Association and Tracking for Traffic Sign Recognition*, June 2014
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6856492

[4]
Mohammed Boumediene , Christophe Cudel, Michel Basset, Abdelaziz Ouamri, Triangular traffic signs
detection based on RSLD algorithm, Aug 2013
http://link.springer.com/article/10.1007%2Fs00138-013-0540-y

[5]
Miguel Angel Garcia-Garrido, Fast Traffic Sign Detection and Recognition Under Changing Lighting
Conditions, Sept 2006

[5]
Andrzej Ruta, Video-based Traffic Sign Detection, Tracking and Recognition, 2009
http://www.brunel.ac.uk/~csstyyl/papers/tmp/thesis.pdf

[6]
Karla Brkic, An overview of traffic sign detection methods, 2010
https://www.fer.unizg.hr/_download/repository/BrkicQualifyingExam.pdf

[7]
Shu Wang, A New Edge Feature For Head-Shoulder Detection, 2013
http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6738581

[8]
M. Liang, Traffic sign detection by ROI extraction and histogram features-based recognition, Aug
2013
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6706810&tag=1

[9]
Auranuch Lorsakul, Road Lane and Traffic Sign Detection & Tracking for Autonomous Urban Driving,
2000
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.332.6612&rep=rep1&type=pdf

[10]
Traffic Sign Recognition for Intelligent Vehicle/Driver Assistance System Using Neural Network on
OpenCV, 2007
http://bartlab.org/Dr.%20Jackrit's%20Papers/ney/3.KRS036_Final_Submission.pdf
56

[11]
Arturo de la Escalera, Road Traffic Sign Detection and Classification, Dec 1997

[12]
Script Hook V v1.0.505.2a
By Alexander Blade, Frequently Updated
http://www.dev-c.com/gtav/scripthookv/

[13]
Script Hook V .NET v2.5.1
By Crosire, First Published: April 27, 2015, Frequently Updated
https://www.gta5-mods.com/tools/scripthookv-net

[14]
Theano Python Library
Developed by Machine Learning Group at the Université de Montréal
https://github.com/Theano/Theano

[15]
Deep Learning Tutorials, LeNet5.
http://deeplearning.net/tutorial/code/convolutional_mlp.py

[16]
The MNIST Database of handwritten digits, Yann LeCun, Corinna Cortes, and Christopher J.C. Burges
http://yann.lecun.com/exdb/mnist/

[17]
Lux Research, Self-driving Cars an $87 Billion Opportunity in 2030, Though None Reach Full
Autonomy, May 2014.
http://www.luxresearchinc.com/news-and-events/press-releases/read/self-driving-cars-87-billion-
opportunity-2030-though-none-reach

[18]
Brendan Sinclair, GTA V dev costs over $137 million, says analyst, Feb 2013.
http://www.gamesindustry.biz/articles/2013-02-01-gta-v-dev-costs-over-USD137-million-says-
analyst

[19]
C. Chen, A. Seff, A. Kornhauser, J. Xiao. DeepDriving: Learning Affordance for Direct Perception in
Autonomous Driving
http://deepdriving.cs.princeton.edu/

[20]
B. Wymann, E. Espie, C. Guionneau, C. Dimitrakakis, ´ R. Coulom, and A. Sumner. TORCS, The Open
Racing Car Simulator, 2014.
http://www.torcs.org

[21]
A. Filipowicz, D. Stanley, B. Zhang. TorcsNet in GTA 5, 2015.
http://devpost.com/software/pave-gtav-lane-detection - updates
[22]
57

S. Macy. GTA 5 Has Now Sold Over 60 Million Copies, 3 Feb 2016.
http://www.ign.com/articles/2016/02/03/gta-5-has-now-sold-over-60-million-copies

[23]
LISA-TS Dataset,
Andreas Møgelmose, Vision based Traffic Sign Detection and Analysis for Intelligent Driver
Assistance Systems: Perspectives and Survey, 2012.
http://cvrr.ucsd.edu/LISA/datasets.html

[24]
German Traffic Sign Detection Benchmark (GTSDB)
Sebastian Houben and Johannes Stallkamp and Jan Salmen and Marc Schlipsing and Christian Igel,
Detection of Traffic Signs in Real-World Images: The German Traffic Sign Detection Benchmark,
2013.
http://benchmark.ini.rub.de/

[25]
Convolutional Neural Network
https://en.wikipedia.org/wiki/Convolutional_neural_network
[26]
Convolutional Neural Network
http://deeplearning.net/software/theano/theano.pdf

[27]
PAVE (Princeton Autonomous Vehicle Engineering) Laboratory GPU
https://www.princeton.edu/ris/projects/pave/

[28]
Michael Nielsen, Why are deep neural networks hard to train?, Jan 2016.
http://neuralnetworksanddeeplearning.com/chap5.html

[29]
ImageNet Competition, Stanford Vision Lab, Stanford University, Princeton University
http://image-net.org/

[30]
Dan Cire¸san, Ueli Meier, Jonathan Masci and J¨urgen Schmidhuber
Multi-Column Deep Neural Network for Traffic Sign Classification, 2012.
http://people.idsia.ch/~juergen/nn2012traffic.pdf

[31]
Markus Mathias, Radu Timofte, Rodrigo Benenson, and Luc Van Gool
Traffic Sign Recognition – How far are we from the solution?
http://rodrigob.github.io/documents/2013_ijcnn_traffic_signs.pdf

[32]
Radu Timofte , KUL Belgium Traffic Sign Dataset
http://btsd.ethz.ch/shareddata/

[33]
Swedish Traffic Sign (STS) Dataset
http://www.cvl.isy.liu.se/research/datasets/traffic-signs-dataset/
58

Appendix A

Code

A.1 Data Processing

There was a lot of data processing to get training, validation, and testing data sets.
Here is the code for working with the LISA-TS dataset and getting it into a format
readable by my Theano CNN:

A.1.1 Image and True-values Grouping and Processing:
AWK, grab all file annotations in LISA TS that are stop signs, output them to positivesX.csv:
#! /bin/awk -f
BEGIN{FS=";"}
$2=="stop" && (match($0,"vid0/") || match($1,"vid1/") || match($1,"vid2/") || match($1,"vid3/") ||
match($1,"vid4") || match($1,"vid5/")) {;print $1;}

AWK, grab all file annotations in LISA TS that are not stop signs, output them to negativesX.csv:
#! /bin/awk -f
BEGIN{FS=";"}
$2!="stop" && (match($0,"vid0/") || match($1,"vid1/") || match($1,"vid2/") || match($1,"vid3/") ||
match($1,"vid4") || match($1,"vid5/")) {;print $1;}

movePosFiles.sh, creates folder of all the positive image (bash script):
#!/bin/bash
count=0
for file in $(cat ./ positivesX.csv);
do
echo $count
cp /Volumes/exfat/LISA_TS/$file ./positiveImages/$count.png
((count++))
done;
sips -Z 140 ./positiveImages/*.png

moveNegFiles.sh, creates folder of all the negative images:
#!/bin/bash
count=0
for file in $(cat ./negativesX.csv);
do
echo $count
cp /Volumes/exfat/LISA_TS/$file ./negativeImages/$count.png
((count++))
done;
sips -Z 140 ./negativeImages/*.png

59

Next one can divide the positive and negative images into whatever ratio they want
among the training, validation, and testing sets. I choose 5:1:1 as my ratio. And had
700 total positive images and 700 total negative images, initially. Then for other
experiments I tried different ratios, and with other data.

A.1.2 Reading, Formatting, and Pickling Data:
There is some work involved in packaging all the photos together nicely into a
datastructure of numpy that is necessary for Theano code to be then able to work
with it, here is how I did it with the LISA-TS dataset, using Python:

from scipy import misc
import numpy
import os
import cPickle
import gzip
import theano
import theano.tensor as T
#alternate the order

train_set_x = numpy.empty([1000,39900])
train_set_y = numpy.empty([1000,1])

#for fn in os.listdir('.')
for i in range(0,500): #will do 0 through 49
img = misc.imread('./binaryDetection140x95/pos500_training/'+str(i)+'.png')
imgArray = img.ravel()
#print i
#print imgArray.shape
train_set_x[i*2,] = imgArray
train_set_y[i*2,] = 1

img = misc.imread('./binaryDetection140x95/neg500_training/'+str(i)+'.png')
imgArray = img.ravel()
train_set_x[i*2+1,] = imgArray
train_set_y[i*2+1,] = 0

valid_set_x = numpy.empty([200,39900])
valid_set_y = numpy.empty([200,1])
test_set_x = numpy.empty([200,39900])
test_set_y = numpy.empty([200,1])

imgArrayV =
misc.imread('./binaryDetection140x95/pos100_validation/'+str(i+530)+'.png').ravel()
imgArrayT =
misc.imread('./binaryDetection140x95/pos100_testing/'+str(i+660)+'.png').ravel()
valid_set_x[i*2,] = imgArrayV
valid_set_y[i*2,] = 1
60

test_set_x[i*2,] = imgArrayT
test_set_y[i*2,] = 1

imgArrayV =
misc.imread('./binaryDetection140x95/neg100_validation/'+str(i+1060)+'.png').ravel()
imgArrayT =
misc.imread('./binaryDetection140x95/neg100_testing/'+str(i+800)+'.png').ravel()
valid_set_x[i*2+1,] = imgArrayV
valid_set_y[i*2+1,] = 0
test_set_x[i*2+1,] = imgArrayT
test_set_y[i*2+1,] = 0

data =
[(train_set_x,train_set_y.flatten()),(valid_set_x,valid_set_y.flatten()),(test_set_x,test_set_y.flatten())]

with gzip.open('stopsignAlter.pkl.gz', 'wb') as f:
print "dumping"
cPickle.dump(data, f)
f.close()

print "done"

A.2 CNN Implementation (Theano)
There was a lot of work in getting to learn how Theano works in practice, and also
how a CNN works in theory, and then putting the two together to figure out how to
code up a working CNN architecture. Here is the code for different pieces of working
with CNNs and Theano, everything is done in python:

A.2.1 CNN architecture definition and training:
import os
import sys
import timeit

import numpy

import theano
from theano.tensor.signal import downsample
from theano.tensor.nnet import conv

from logistic_sgd_ss import LogisticRegression, load_data
from mlp import HiddenLayer

import cPickle

class LeNetConvPoolLayer(object):
"""Pool Layer of a convolutional network """

61

def __init__(self, rng, input, filter_shape, image_shape, poolsize=(2, 2)):

"""
Allocate a LeNetConvPoolLayer with shared variable internal parameters.

:type rng: numpy.random.RandomState
:param rng: a random number generator used to initialize weights

:type input: theano.tensor.dtensor4
:param input: symbolic image tensor, of shape image_shape

:type filter_shape: tuple or list of length 4
:param filter_shape: (number of filters, num input feature maps,
filter height, filter width)

:type image_shape: tuple or list of length 4
:param image_shape: (batch size, num input feature maps,
image height, image width)

:type poolsize: tuple or list of length 2
:param poolsize: the downsampling (pooling) factor (#rows, #cols)
"""

assert image_shape[1] == filter_shape[1]
self.input = input

# there are "num input feature maps * filter height * filter width"
# inputs to each hidden unit
fan_in = numpy.prod(filter_shape[1:])
# each unit in the lower layer receives a gradient from:
# "num output feature maps * filter height * filter width" /
# pooling size
fan_out = (filter_shape[0] * numpy.prod(filter_shape[2:]) /
numpy.prod(poolsize))
# initialize weights with random weights
W_bound = numpy.sqrt(6. / (fan_in + fan_out))
self.W = theano.shared(
numpy.asarray(
rng.uniform(low=-W_bound, high=W_bound, size=filter_shape),
dtype=theano.config.floatX
),
borrow=True
)

# the bias is a 1D tensor -- one bias per output feature map
b_values = numpy.zeros((filter_shape[0],), dtype=theano.config.floatX)
self.b = theano.shared(value=b_values, borrow=True)

# convolve input feature maps with filters
conv_out = conv.conv2d(
input=input,
filters=self.W,
filter_shape=filter_shape,
image_shape=image_shape
)

62

# downsample each feature map individually, using maxpooling

pooled_out = downsample.max_pool_2d(
input=conv_out,
ds=poolsize,
ignore_border=True
)

# add the bias term. Since the bias is a vector (1D array), we first
# reshape it to a tensor of shape (1, n_filters, 1, 1). Each bias will
# thus be broadcasted across mini-batches and feature map
# width & height
self.output = T.tanh(pooled_out + self.b.dimshuffle('x', 0, 'x', 'x'))

# store parameters of this layer
self.params = [self.W, self.b]

# keep track of model input
self.input = input

#learning rate 0.1 to 1
def evaluate_cnn(learning_rate=0.03, n_epochs=200,
dataset='stopsignAlter.pkl.gz',
nkerns=[15, 30, 30], batch_size=20):
""" Demonstrates lenet on MNIST dataset

:type learning_rate: float
:param learning_rate: learning rate used (factor for the stochastic
gradient)

:type n_epochs: int
:param n_epochs: maximal number of epochs to run the optimizer

:type dataset: string
:param dataset: path to the dataset used for training /testing (MNIST here)

:type nkerns: list of ints
:param nkerns: number of kernels on each layer
"""

rng = numpy.random.RandomState(23455)

datasets = load_data(dataset)

train_set_x, train_set_y = datasets[0]
valid_set_x, valid_set_y = datasets[1]
test_set_x, test_set_y = datasets[2]

# compute number of minibatches for training, validation and testing
n_train_batches = train_set_x.get_value(borrow=True).shape[0]
n_valid_batches = valid_set_x.get_value(borrow=True).shape[0]
n_test_batches = test_set_x.get_value(borrow=True).shape[0]
n_train_batches /= batch_size
n_valid_batches /= batch_size
n_test_batches /= batch_size
63

# allocate symbolic variables for the data
index = T.lscalar() # index to a [mini]batch

# start-snippet-1
x = T.matrix('x') # the data is presented as rasterized images
y = T.ivector('y') # the labels are presented as 1D vector of
# [int] labels

######################
# BUILD ACTUAL MODEL #
######################
print '... building the model'

# Reshape matrix of rasterized images of shape (batch_size, 3 * 95 * 140)
# to a 4D tensor, compatible with our LeNetConvPoolLayer
# (28, 28) is the size of MNIST images.
layer0_input = x.reshape((batch_size, 3, 95, 140))

# Construct the first convolutional pooling layer:
# filtering reduces the image size to (95-16+1 , 140-16+1) = (80, 125)
# maxpooling reduces this further to (80/2, 125/2) = (40, 62)
# 4D output tensor is thus of shape (batch_size, nkerns[0], 40, 62)
layer0 = LeNetConvPoolLayer(
rng,
input=layer0_input,
image_shape=(batch_size, 3, 95, 140),
filter_shape=(nkerns[0], 3, 16, 16),
poolsize=(2, 2)
)

# Construct the second convolutional pooling layer
# filtering reduces the image size to (40-8+1, 62-8+1) = (33, 55)
rng,
input=layer0.output,
image_shape=(batch_size, nkerns[0], 40, 62),
filter_shape=(nkerns[1], nkerns[0], 8, 8),
poolsize=(2, 2)
)

# Construct the third convolutional pooling layer
# filtering reduces the image size to (16-5+1, 27-5+1) = (12, 23)
rng,
input=layer1.output,
image_shape=(batch_size, nkerns[1], 16, 27),
filter_shape=(nkerns[2], nkerns[1], 5, 5),
poolsize=(2, 2)
)

64

# the HiddenLayer being fully-connected, it operates on 2D matrices of

# shape (batch_size, num_pixels) (i.e matrix of rasterized images).
# This will generate a matrix of shape (batch_size, nkerns[2] * 6 * 11),
# or (500, 30 * 6 * 11) = (500, 1980) with the default values.
layer3_input = layer2.output.flatten(2)

# construct a fully-connected sigmoidal layer
layer3 = HiddenLayer(
rng,
input=layer3_input,
n_in=nkerns[2] * 6 * 11,
n_out=500,
activation=T.tanh
)

# classify the values of the fully-connected sigmoidal layer
layer4 = LogisticRegression(input=layer3.output, n_in=500, n_out=2)

# the cost we minimize during training is the NLL of the model
cost = layer4.negative_log_likelihood(y)

# create a function to compute the mistakes that are made by the model
test_model = theano.function(
[index],
layer4.errors(y),
givens={
x: test_set_x[index * batch_size: (index + 1) * batch_size],
y: test_set_y[index * batch_size: (index + 1) * batch_size]
}
)

validate_model = theano.function(
[index],
layer4.errors(y),
givens={
x: valid_set_x[index * batch_size: (index + 1) * batch_size],
y: valid_set_y[index * batch_size: (index + 1) * batch_size]
}
)

# create a list of all model parameters to be fit by gradient descent
params = layer4.params + layer3.params + layer2.params + layer1.params + layer0.params

# create a list of gradients for all model parameters
grads = T.grad(cost, params)

# train_model is a function that updates the model parameters by
# SGD Since this model has many parameters, it would be tedious to
# manually create an update rule for each model parameter. We thus
# create the updates list by automatically looping over all
# (params[i], grads[i]) pairs.
updates = [
(param_i, param_i - learning_rate * grad_i)
for param_i, grad_i in zip(params, grads)
]
65

train_model = theano.function(
[index],
cost,
updates=updates,
givens={
x: train_set_x[index * batch_size: (index + 1) * batch_size],
y: train_set_y[index * batch_size: (index + 1) * batch_size]
}
)

getGrads = theano.function(
[index],
grads,
givens={
x: train_set_x[index * batch_size: (index + 1) * batch_size],
y: train_set_y[index * batch_size: (index + 1) * batch_size]
}
)
# end-snippet-1

###############
# TRAIN MODEL #
###############
print '... training'
# early-stopping parameters
patience = 10000 # look as this many examples regardless
patience_increase = 2 # wait this much longer when a new best is
# found
improvement_threshold = 0.995 # a relative improvement of this much is
# considered significant
validation_frequency = min(n_train_batches, patience / 2)
# go through this many
# minibatche before checking the network
# on the validation set; in this case we
# check every epoch

best_validation_loss = numpy.inf
best_iter = 0
test_score = 0.
start_time = timeit.default_timer()

epoch = 0
done_looping = False

while (epoch < n_epochs) and (not done_looping):
epoch = epoch + 1
for minibatch_index in xrange(n_train_batches):

iter = (epoch - 1) * n_train_batches + minibatch_index
theseGrads = getGrads(minibatch_index)

if iter % 50 == 0:
print 'training @ iter = ', iter

66

cost_ij = train_model(minibatch_index)

if (iter + 1) % validation_frequency == 0:

# compute zero-one loss on validation set
validation_losses = [validate_model(i) for i
in xrange(n_valid_batches)]
this_validation_loss = numpy.mean(validation_losses)
print('epoch %i, minibatch %i/%i, validation error %f %%' %
(epoch, minibatch_index + 1, n_train_batches,
this_validation_loss * 100.))

# if we got the best validation score until now
if this_validation_loss < best_validation_loss:

#improve patience if loss improvement is good enough
if this_validation_loss < best_validation_loss * \
improvement_threshold:
patience = max(patience, iter * patience_increase)

# save best validation score and iteration number
best_validation_loss = this_validation_loss
best_iter = iter

# test it on the test set
test_losses = [
test_model(i)
for i in xrange(n_test_batches)
]
test_score = numpy.mean(test_losses)
print((' epoch %i, minibatch %i/%i, test error of '
'best model %f %%') %
(epoch, minibatch_index + 1, n_train_batches,
test_score * 100.))
# save the best model

with open('best_model_c_ss_0n3.pkl', 'w') as f:
cPickle.dump(layer0, f)
f.close()
f.close()
f.close()
f.close()
f.close()

if patience <= iter:
done_looping = True
break
67

end_time = timeit.default_timer()
print('Optimization complete.')
print('Best validation score of %f %% obtained at iteration %i, '
'with test performance %f %%' %
(best_validation_loss * 100., best_iter + 1, test_score * 100.))
print >> sys.stderr, ('The code for file ' +
os.path.split(__file__)[1] +
' ran for %.2fm' % ((end_time - start_time) / 60.))

'''
import cPickle
save_file = open('bestTrained.p','wb')
#cPickle.dump(params.get_value(borrow=True),save_file,-1)
#cPickle.dump(params,save_file,-1)
cPickle.dump(layer3,save_file,-1)
save_file.close()
'''
if __name__ == '__main__':
evaluate_cnn()

def experiment(state, channel):
evaluate_cnn(state.learning_rate, dataset=state.dataset)

A.2.2 CNN Testing:
#can test CNN on any image sets assuming first proper formatting
import cPickle
import gzip
import os
import sys
import timeit

import numpy

import theano
from logistic_sgd import LogisticRegression, load_data
from convolutional_mlp import LeNetConvPoolLayer

def predict():
"""
An example of how to load a trained model and use it
to predict labels.
"""
# load the saved model
layer0 = cPickle.load(open('best_model_c_ss_0.pkl'))

# compile a predictor function
predict_model_4 = theano.function(
68

inputs=[layer4.input],
outputs=layer4.y_pred)

# We can test it on some examples from test test
#dataset='stopsign.pkl.gz'
dataset='stopsignAlter.pkl.gz'
datasets = load_data(dataset)
test_set_x, test_set_y = datasets[2]
test_set_x = test_set_x.get_value()
#print test_set_x
#print test_set_x[1]

# compile a layer0 or layer1 CNN function
convPool_0 = theano.function(
outputs=layer0.output)



#mlp function
mlp_3 = theano.function(

batchsize = 20
layer0input = test_set_x[0:batchsize,]
layer0input = layer0input.reshape(batchsize,3,95,140)
layer0output = convPool_0(layer0input)

layer1input = layer0output


layer3input = layer3input.reshape(batchsize,800)
layer3output = mlp_3(layer3input)

predicted_values = predict_model_4(layer4input)
print ("Predicted values for the first 38 examples in test set:")
print predicted_values[0:37]
print test_set_y.eval()[0:37]

print (predicted_values[0:batchsize]-test_set_y.eval()[0:batchsize])
#
#predicted_values = predict_model_3(test_set_x[0:200,])
#print ("Predicted values for the first 200 examples in test set:")
69

#print predicted_values
#print test_set_y.eval()[0:200]

predict()

A.2.3 MLP Architecture Definition and Training:

I decided against including this code, since it is a simpler version of A.2.2. If you
understand A.2.2 then the MLP should be relatively easy to implement.
A.3 GTA V Hacking Code (ScriptHookV)
There was a lot of work involved in setting up ScriptHookV and ScriptHookDotNet,
that can be done following online tutorials though, so is not included here, however
I will include some code that can be used to generate similar results to in the live
video (note part is my code and part is Artur’s code, as together we are trying to
combine our hacks and eventually with other people’s hacks with the goal of
creating a fully autonomous system):

A.3.1 Code for live video related hacks:
using System;using System.Collections.Generic;
using System.Linq;using System.Text;
using System.Threading.Tasks;using System;
using System.IO;using System.Drawing;
using System.Windows.Forms;using System.Collections.Generic;
using GTA;using GTA.Native;using GTA.Math;using System.Drawing.Imaging;
using System.Runtime.InteropServices;using System.Diagnostics;

public class AutoDriveMod2 : Script
{
const float D_TO_R = (float)(Math.PI / 180.0);
Boolean enabled; Camera camera = null;
Queue<float> angles = new Queue<float>();
Queue<float> disps = new Queue<float>();
int memory = 1;
//private PictureBox pictureBox1 = new PictureBox();
//public event EventHandler RedLightFound;
Point p = new Point(550, 600);
Size s = new Size(100, 100);
int displayTime = 500;
private struct Rect {public int Left; public int Top; public int Right; public int Bottom;}
[DllImport("C:\\Windows\\System32\\user32.dll")]
private static extern IntPtr GetForegroundWindow();
[DllImport("C:\\Windows\\System32\\user32.dll")]
private static extern IntPtr GetWindowRect(IntPtr hWnd, ref Rect rect);
List<Vector2> points = new List<Vector2>();
List<Vector3> debugPoints = new List<Vector3>();
List<Vector3> debugPoints2 = new List<Vector3>();
UIText line1;UIText line2;float heightZ = 1f;
70

private UIContainer mContainer = null;

public AutoDriveMod2()
{
UI.Notify("loaded2");
Tick += OnTick; KeyUp += onKeyUp;
this.mContainer = new UIContainer(new Point(10, 10), new Size(200, 40),
Color.FromArgb(200, 237, 239, 241));
line1 = new UIText("Test 2", new Point(2, 0), 0.4f, Color.Black, 0, false);
this.mContainer.Items.Add(line1);
line2 = new UIText("", new Point(15, 234), 0.4f, Color.Black, 0, false);
this.mContainer.Items.Add(line2);
heightZ = 1f;
World.DestroyAllCameras();
camera = World.CreateCamera(new Vector3(), new Vector3(), 50);
camera.IsActive = true;
GTA.Native.Function.Call(Hash.RENDER_SCRIPT_CAMS, false, true, camera.Handle, true,
true);
}
int counter = 0;int counter2 = 0;float angle = 0;float disp = 0;float prevAngle = 0;
const String angleFileName =
"C:\\Users\\arg\\Documents\\thesis\\code\\currentangle.txt";
void OnTick(object sender, EventArgs e)
{
if (counter % 30 == 1 && Game.Player.Character.IsInVehicle())
{
Vehicle car = Game.Player.Character.CurrentVehicle;
Vehicle[] vehicles = World.GetAllVehicles();
foreach (Vehicle v in vehicles)
{
if (v.PrimaryColor != VehicleColor.ModshopBlack1)
{
if (v.Position.DistanceTo(car.Position) < 10f){v.Delete();}
}
}
}
try
{
using (TextReader reader = File.OpenText(angleFileName))
{
string strAngle = reader.ReadLine();string strDisp = reader.ReadLine();
float angle1 = float.Parse(strAngle);float disp1 = float.Parse(strDisp);
if (angle1 != prevAngle)
{
prevAngle = angle1;
if (angles.Count >= memory){angles.Dequeue();disps.Dequeue();}
angles.Enqueue(angle1); disps.Enqueue(disp1);
angle = average(angles); disp = average(disps);
line1.Caption = "angle: " + angle.ToString("00.00") + "\ndisp: " +
disp.ToString("00.00");
71

counter2 = 3; //number of frames to show debug (until we might take the next
picture)
}
}
}
catch (Exception exception){UI.Notify("error - python probably writing to file");}
if (counter2 > 0)
{
if (enabled){drawDebug(Game.Player.Character.CurrentVehicle);}
counter2--;
}
if (enabled){drive(angle, disp);}
if (Game.Player.Character.IsInVehicle())
{
camera.AttachTo(Game.Player.Character.CurrentVehicle, new Vector3(0f, 3f,
heightZ));
camera.Rotation = Game.Player.Character.CurrentVehicle.Rotation;
}
this.mContainer.Draw();counter++;
}
float average(Queue<float> list)
{
float sum = 0;
foreach (float f in list){sum += f;}
return sum / list.Count;
}
Vector3 vel;
float targetSpeed = 5f;
void drive(float angle, float disp)
{
Vehicle car = Game.Player.Character.CurrentVehicle;
float turn = -1 * disp * Math.Abs(disp) - angle;
float originalVelW = .7f;
float desiredVelW = 1f - originalVelW;
vel = car.ForwardVector + ((float)Math.Sin(turn / 180f * Math.PI)) * car.RightVector;
vel.Normalize(); car.Velocity = (originalVelW * car.Velocity + desiredVelW * targetSpeed
* vel);
}
const int IMAGE_HEIGHT = 210;const int IMAGE_WIDTH = 280;
void screenshot(String filename)
{
var foregroundWindowsHandle = GetForegroundWindow();
var rect = new Rect();
GetWindowRect(foregroundWindowsHandle, ref rect);
Rectangle bounds = new Rectangle(rect.Left, rect.Top, rect.Right - rect.Left, rect.Bottom
- rect.Top);
using (Bitmap bitmap = new Bitmap(bounds.Width, bounds.Height))
{
using (Graphics g = Graphics.FromImage(bitmap))
{
72

g.ScaleTransform(.2f, .2f);
g.CopyFromScreen(new Point(bounds.Left, bounds.Top), Point.Empty, bounds.Size);
}
Bitmap output = new Bitmap(IMAGE_WIDTH, IMAGE_HEIGHT);
using (Graphics g = Graphics.FromImage(output))
{
g.DrawImage(bitmap, 0, 0, IMAGE_WIDTH, IMAGE_HEIGHT);
}
output.Save(filename, ImageFormat.Bmp);
}
}

private void onKeyUp(object sender, KeyEventArgs e)
{
if (e.KeyCode == Keys.I)
{
if (Game.Player.Character.IsInVehicle())
{
enabled = true;
GTA.Native.Function.Call(Hash.RENDER_SCRIPT_CAMS, true, true, camera.Handle,
true, true);
}else{UI.Notify("Please enter a vehicle.");}
}
if (e.KeyCode == Keys.U){UI.Notify("red light
detected");UI.DrawTexture("t_red_light.png",1,1, displayTime, p,s);}
if (e.KeyCode == Keys.J){UI.Notify("yellow light
detected");UI.DrawTexture("t_yellow_light.png", 1, 1, displayTime, p, s);}
if (e.KeyCode == Keys.M){UI.Notify("green light
detected");UI.DrawTexture("t_green_light.png", 1, 1, displayTime, p, s);}
if (e.KeyCode == Keys.K){UI.Notify("stop sign
detected");UI.DrawTexture("t_stop_sign.png", 1, 1, displayTime, p, s);}
if (e.KeyCode == Keys.O)
{
enabled = false;
UI.Notify("Relinquishing control");
GTA.Native.Function.Call(Hash.RENDER_SCRIPT_CAMS, false, false, camera.Handle,
true, true);
}
if (e.KeyCode == Keys.N)
{
Vehicle vehicle = World.CreateVehicle(VehicleHash.Adder,
Game.Player.Character.Position + Game.Player.Character.ForwardVector * 3.0f,
Game.Player.Character.Heading + 90);
vehicle.CanTiresBurst = false;
vehicle.PrimaryColor = VehicleColor.ModshopBlack1;
vehicle.CustomSecondaryColor = Color.DarkOrange;
vehicle.PlaceOnGround();
vehicle.NumberPlate = " 888 ";
}
if (e.KeyCode == Keys.L)
73

{
enabled = !enabled;
if(enabled)
UI.Notify("activated self driving");
else
UI.Notify("deactivated self driving");
}
if (e.KeyCode == Keys.NumPad1){heightZ = 0f;}
if (e.KeyCode == Keys.NumPad2){heightZ = 0.5f;}
UI.Notify("endKeyUp");
}
void drawDebug(Vehicle car)
{
Vector3 forward = (float)Math.Cos(angle * D_TO_R) * car.ForwardVector -
(float)Math.Sin(angle * D_TO_R) * car.RightVector;
Vector3 right = -(float)Math.Sin(angle * D_TO_R) * car.ForwardVector -
(float)Math.Cos(angle * D_TO_R) * car.RightVector;
Vector3 center = car.Position + disp * right;
float r = 0.1f;
for (float i = 0; i < 15; i += .2f)
{
World.DrawMarker(MarkerType.DebugSphere, center + i * forward,
Vector3.WorldUp, new Vector3(1, 1, 1), new Vector3(r, r, r), Color.Blue);
}
if (enabled && false) // this is turned off
{
Vector3 debugV = car.Position + 15f * vel; r = .3f;
World.DrawMarker(MarkerType.DebugSphere, debugV, Vector3.WorldUp, new
Vector3(1, 1, 1), new Vector3(r, r, r), Color.Blue);
}}}
74

Garzon'16 ConvNN TrafficSigns

Uploaded by

Copyright:

Available Formats

You might also like

Garzon'16 ConvNN TrafficSigns

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Garzon'16 ConvNN TrafficSigns

Uploaded by

Copyright:

Available Formats

CONVOLUTIONAL

1.1 Problem & Objective Specification

1.2 Why Grand Theft Auto V?

or whatever object you are training a CNN on to detect in virtual environment is as

1.3 Why Convolutional Neural Networks (CNNs)?

2.1 Previous Traffic Sign Detection Work

2.2 Previous Virtual Environment Work

2.3 Previous GTA V Hacking Development

3.1 CNN Training and Testing Data

3.2 LISA-TS Dataset

Figure 3.1: Sample Image Abbreviated Annotations

3.3 Training, Validation, and Test Set Construction

Here are examples of positive images from LISA-TS:

4.1 CNN Introduction

Sobel operator example: SO =

above SO by 90 degrees. One could also imagine a detector for curved, or

4.2 CNN Construction

4.3 Programming with Theano

4.4 Stop Sign MLP

1 LR + 2 HL 39.0 38.0 50,50

4.5 Contrasting MLPs with MNIST

1 LR + 1 HL 2.42 2.40 500

4.6 Stop Sign CNN

CNN Setup: VE TE Learning # Hidden

4.7 Contrasting CNNs with MNIST

K = [15,30], D = [5,5], E = 200 1.04 1.10 0.03 500

4.8 Contrasting CNNs with GTSDB

Examples of the specific 6 signs being detected:

4.9 Performance on GTA V Images

5.1 GTA V – System Setup

Figure 5.1: System Setup Architecture

5.2 GTA V – Car Handling Script

Figure 5.2: Camera Placement Images

5.4 Live Action Videos

In the above provided video there are 5 sections:

Figure 5.4: GTA V Traffic Sign CNN Script Images

Additionally, the two-step method’s bounding boxes are sometimes completely

6.3 Future Extensions

autonomously driven drone could possibly be trained in GTA V as well, however

A.1 Data Processing

def __init__(self, rng, input, filter_shape, image_shape, poolsize=(2, 2)):

# downsample each feature map individually, using maxpooling

# the HiddenLayer being fully-connected, it operates on 2D matrices of

A.2.3 MLP Architecture Definition and Training:

private UIContainer mContainer = null;

You might also like

def init(self, rng, input, filter_shape, image_shape, poolsize=(2, 2)):