Improving The Accuracy of 2D On - Road Object Detection Based On Deep Learning Techniques

DEGREE PROJECT IN INFORMATION AND COMMUNICATION
TECHNOLOGY,
SECOND CYCLE, 30 CREDITS
STOCKHOLM, SWEDEN 2018
Improving the Accuracy of 2D On-

Road Object Detection Based on
Deep Learning Techniques
YING YU
KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Abstract
This paper focuses on improving the accuracy of detecting on-road objects, in-
cluding cars, trucks, pedestrians, and cyclists. To meet the requirements of the
embedded vision system and maintain a high speed of detection in the advanced
driving assistance system (ADAS) domain, the neural network model is designed
based on single channel images as input from a monocular camera.
In the past few decades, forward collision avoidance system, a sub-system of

ADAS, has been widely adopted in vehicular safety systems for its great contri-
bution in reducing accidents. Deep neural networks, as the the-state-of-art ob-
ject detection techniques, can be achieved in this embedded vision system with
efficient computation on FPGA and high inference speed. Aimed at detecting
on-road objects at a high accuracy, this paper applies an advanced end-to-end
neural network, single-shot multi-box detector (SSD).
In this thesis work, several experiments are carried out on how to enhance the
accuracy performance of SSD models with grayscale input. By adding proper
extra default boxes in high-layer feature maps and adjust the entire scale range,
the detection AP over all classes has been efficiently improved around 20%, with
the mAP of SSD300 model increased from 45.1% to initially 76.8% and the mAP
of SSD512 model increased from 58.5% to 78.8% on KITTI dataset. Besides,
it has been verified that without color information, the model performance will
not degrade in both speed and accuracy. Experimental results were evaluated
using Nvidia Tesla P100 GPU on KITTI Vision Benchmark Suite, Udacity an-
notated dataset and a short video recorded on one street in Stockholm.
Sammanfattning
Detta dokument fokuserar p att frbttra noggrannheten nr det gller att upptcka
on-road-objekt, inklusive bilar, lastbilar, fotgngare och cyklister. Fr att uppfylla
kraven i det inbyggda visionssystemet, och upprtthlla en hg upptckthastighet i
ADAS-domnen (advanced drive assist system), r den neurala ntverksmodellen
utformad baserat p enkanalsbilder som inmatning frn en monokulr kamera.
Under de senaste decennierna har systemet fr framtida kollisionsundvikande

system, ett delsystem fr ADAS, antagits allmnt i fordonsskerhetssystem fr sitt
stora bidrag till att minska olyckor. Djupa neurala ntverk, som den senaste
tekniken fr detektering av objekt, kan uppns i detta inbyggda visionssystem
med e↵ektiv berkning p FPGA och hg inferenshastighet. Siktat p att upptcka
vgar p vgar i hg noggrannhet, tillmpar vi ett avancerat neuralt ntverk, single-
shot multi-box detector (SSD).
I det hr avhandlingsarbetet utfrs flera experiment om hur man frbttrar SSD-

modellernas noggrannhet med grtoningng. Genom att lgga till lmpliga extra
standardldor i hglagerskartor och justera hela skalaomrdet har upptckt AP ver
alla klasser frbttrats e↵ektivt kring 20 %, med mAP av SSD300-modellen kat frn
45,1 % till initialt 76,8 % och mAP av SSD512-modellen p KITTI-dataset kade
frn 58,5 % till 78,8 %. Dessutom har det kontrollerats att utan frginformation
inte kommer att frsmras i bde prestanda och prestanda. Experimentella resultat
utvrderades med hjlp av Nvidia Tesla P100 GPU p KITTI Vision Benchmark
Suite, Udacity annoterade dataset och en kort video inspelad p en gata i Stock-
holm.
Acknowledgment
I would like to thank my internship company Bitsim AB and business manager

Mr.Sivard at Bitsim, who has o↵ered me this chance of doing such an interesting
and challenging thesis project. During the past five months, I have obtained
a lot of industrial experience and practical knowledge which can help me a lot
in my future career. My supervisor Andreas Gustafsson at Bitsim, has given
me many valuable advice to research on the right path to solve problems and
Hanwei Wu as my supervisor at KTH, o↵ered me his kindly help when I encoun-
tered some technical doubts. Professor Markus Flierl, as my examiner at KTH,
has been supportive all the time and given me more faith on this project. My
colleague Andrea Leopardi, who works together with me at Bitsim, has been
a great listener and a helper for discussing the crucial problems with me and
finding out right answers.
Besides, I also really appreciate the help and support I got from my parents, my
friends, and especially my boyfriend, who always motivates me to keep moving
forward and working hard. I have learned a lot from him and got more patience
and passion with my work. I will keep my passion and my curiosity in the future
work all the time.
Contents
1 Introduction 1
1.1 Background and motivation . . . . . . . . . . . . . . . . . . . . . 2
1.2 Overview of the work . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Literature Review 4
2.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . 5
2.2.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Activation . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 2D Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Region-proposal methods . . . . . . . . . . . . . . . . . . 8
2.3.2 End-to-end learning systems . . . . . . . . . . . . . . . . 9
3 Methodology 12
3.1 Dataset preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Grayscale image from UYVY format . . . . . . . . . . . . 12
3.1.2 Data augmentation . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Ca↵e framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.1 Modified input layer . . . . . . . . . . . . . . . . . . . . . 15
3.3.2 Multi-scale feature maps . . . . . . . . . . . . . . . . . . . 16
3.3.3 Default box selection . . . . . . . . . . . . . . . . . . . . . 18
3.3.4 Hard negative mining . . . . . . . . . . . . . . . . . . . . 19
3.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.1 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.2 Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.3 Weight update . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5.1 Non-Maximum Suppression . . . . . . . . . . . . . . . . . 22
3.6 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.7 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 Experimental Results and Analysis 26

4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.1 KITTI 2D object detection dataset . . . . . . . . . . . . . 27
4.1.2 Udacity Annotated Datasets . . . . . . . . . . . . . . . . 27
3
4.2 Pre-trained model testing . . . . . . . . . . . . . . . . . . . . . . 28
4.2.1 Performance on KITTI dataset . . . . . . . . . . . . . . . 29
4.2.2 Performance on Udacity dataset . . . . . . . . . . . . . . 30
4.3 Fine-tuning based on the original design . . . . . . . . . . . . . . 32
4.3.1 Hyperparameter selection . . . . . . . . . . . . . . . . . . 32
4.3.2 Performance for both datasets . . . . . . . . . . . . . . . 32
4.4 Enhancement on detection accuracy . . . . . . . . . . . . . . . . 33
4.5 K-means evaluation on default boxes . . . . . . . . . . . . . . . . 35
4.5.1 Analysis on KITTI dataset . . . . . . . . . . . . . . . . . 35
4.5.2 Analysis on Udacity dataset . . . . . . . . . . . . . . . . . 38
4.5.3 Algorithm validation . . . . . . . . . . . . . . . . . . . . . 40
4.6 Color information analysis . . . . . . . . . . . . . . . . . . . . . . 41
4.7 Inference time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 Conclusions and Future Work 46

5.1 Ethical and social impacts . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Limitations and future work . . . . . . . . . . . . . . . . . . . . . 47
5.2.1 Small object detection . . . . . . . . . . . . . . . . . . . . 47
5.2.2 Long-time training . . . . . . . . . . . . . . . . . . . . . . 48
5.2.3 Limited generality . . . . . . . . . . . . . . . . . . . . . . 49
5.2.4 Distance measurement . . . . . . . . . . . . . . . . . . . . 50
5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
A Detection examples from KITTI 57
B Detection examples from Udacity test set 59
C Detection examples from Stockholm street video 61
4
Chapter 1
Introduction
Advanced driving assistance system (ADAS) has drawn growing attention in

the global autonomous driving market, for its great value in improving traffic
efficiency and safety. This concept is a general term of several high-tech sub-
systems which have already been widely used, such as anti-lock braking sys-
tems, parking sensors, automotive navigation systems, etc. While the collision
avoidance system (precrash system), one promising technique among them, can
dramatically reduce road fatalities with an accurate detection of forward col-
lisions and in-time reactions. Recent advances in electronics, control systems,
processors and communications now allow for the design of collision avoidance
systems with increased sophistication, reduced cost and high reliability [30].
Typical collision avoidance system detects and recognizes vehicles, pedestri-

ans, traffic signs and the like around a vehicle using radar, laser radar, GPS,
and camera, etc. It provides a warning signal for the driver or takes timely
actions by steering and/or braking autonomously when there is an impending
collision. Among above four sensors, the camera has the cheapest material cost
but can interpret scenes better than the rest by seeing the color. A pair of
forward-looking cameras, also called stereo vision, is able to provide 3D depth
perception of the environment ahead. However, a fine resolution camera collects
a massive amount of data (millions of pixels in each frame), which requires an
intensive computation and complex algorithm for processing them. Recent ad-
vances in hardware computation efficiency have enabled the rapid development
of software algorithms, especially for embedded vision systems.
During the past few years, vision-based object detection systems have been
significantly enhanced by machine learning algorithms. Deep Neural Network
(DNN) [20], a powerful machine learning branch of methods, has delivered a
number of stunning achievements in many benchmarks. The concept of DNN
gets inspired by the biological neural networks that constitute animal brains,
which have more complicated architectures by comparison. Currently, DNNs are
the state of art in object detection with a promising future in improving both
the detection accuracy and speed. Compared to the classical approaches based
on hand-crafted features, such as Histograms of oriented gradients (HOG), they
can detect more robust results in the images in more unconstraint environments.
As figure 1.1 shows, our expected output would be all target objects including
1
cars in a certain grayscale image taken with daytime light can be localized and
recognized with a high accuracy.
Figure 1.1: Vehicle detection in a grayscale image.
1.1 Background and motivation

Collision avoidance systems have decreased the injury and death rate greatly in
the accidents. With the development of both software algorithms and hardware
components, it is getting more precise for cameras to detect multiple objects in
a real-time scenario using DNN. But the real-world industrial cases are more
required and complex than simple implementations in experimental phases in
pure personal computer environment. Practical issues such as color channel
transform, power consumption, processing speed and accuracy are always in
discussion to ensure the adoptions of DNN in the industrial applications.
As DNN involves a huge amount of matrix calculations and other operations

which can be massively parallelized, graphics processing units (GPUs) are able
to complete this task better than central processing units (CPUs) with its large
computational units and higher bandwidth to memory. For industrial purpose,
there are usually two main phases involving in DNN implementations, training
phase and inference phase. DNN model training with high throughout is car-
ried on with one or multiple GPUs, taking from hours to weeks to complete.
However, the inference of DNN in real-world applications requires high-speed
processing and power efficiency, thus more hardware units such as FPGA-based
embedded vision platforms are applied. The adoption of DNN on embedded
systems are out of our scope. In this paper, it focuses on the accuracy per-
formance of DNN model in the driving context with the condition of real-time
processing.
In many industrial fields, the object detection algorithms are required to be
2
implemented with grayscale images. To maintain a high speed for real-time
detection, Y pixel format has been used, which may degrade the detection re-
sults for the loss of color information. Therefore, our research explores on if the
loss of color information has influenced the performance of the selected DNN
algorithm, reducing the precision and recall of detecting multiple objects in the
test set.
1.2 Overview of the work

This chapter introduces the motivation and the thesis work briefly. In chapter 2,
a background description is carried out including the basic knowledge of DNN
and some advanced methods of the state-of-art in 2D object detection. There
are two main branches of DNN models achieving the highest speed and accu-
racy performance respectively. Among those DNN models, this paper selects
Single Shot MultiBox Detector (SSD), a one-stage learning system, as the basic
research methodology and provide justifications for this choice.
Chapter 3 elaborates the preparation work, the main structure of SSD model
and evaluation metrics in details. There it also provides reasons for these de-
sign choices and analyze potential problems on the chosen method. Through the
steps in training and testing process, several advanced algorithms and techniques
such as weight update, regularization are explained. Especially, the models are
trained based on pre-trained models but not trained from scratch.
All experimental results are presented in chapter 4, as well as the descriptions

of our chosen datasets used in the experiments. This chapter discusses the rea-
sons behind the results of the original models and the modified models, and
then provide the K-means clustering scheme for addressing the low accuracy
problem. Here it evaluates the fine-tuned models using the standard metrics
described in chapter 3. Though this paper focuses on improving the accuracy,
the time latency will also be listed but not discussed. Finally, the conclusion is
presented in chapter 5. Besides, the potential ethical problems, the limitations
of the work and the future improvements will be summarized in chapter 5 as well.
3
Chapter 2
Literature Review
This chapter reviews the history of DNN, recent advances in DNN approaches
for object detection tasks and descriptions of other comparable DNN models
with SSD. The base network models for object classification tasks are also in-
troduced as they are closely related to our object detection task.
2.1 Artificial Neural Networks

As introduced before, Artificial Neural Network (ANN) is a popular machine
learning algorithm consisting of a collection of connected units. Though ANN
exists for years, the attempts in training deep architectures of ANNs failed until
Geo↵rey Hinton’s breakthrough work of the mid-2000s. In addition to algorith-
mic achievements, the increase in computing capabilities using GPUs and the
collection of larger datasets are all boosting the recent surge of ANN develop-
ment.
ANN usually consists of one input layer, one output layer and one or more
hidden layers, each of which contains one or more units/neurons/nodes. We
call them ”units” in this paper. Each artificial unit process the received sig-
nal in a certain way and pass it to the connected units, which can be seen as
a simplified form of biological neurons in animal brains. The linear filter and
bias are generally called weights, and they can be learned from the training data.
Figure 2.1 shows an example of one feedforward 3-layer ANN model, with fully
connections from units in one layer to the ones in the next layer. To tackle com-
plex problems in many fields, ANN often contains many hidden layers to extract
more semantically strong features, thus it is also called deep artificial neural net-
work, known as ”DNN”. Since the early 2000s [38], DNNs have been developed
rapidly and applied in a wide range of applications such as vision, robotics, video
analytics, speech recognition, natural language processing, targeted advertising,
and web search [16, 31, 49, 4, 29]. For vision-based tasks, convolutional neural
network (CNN) models are commonly applied to their strong and mostly correct
assumptions about the nature of images.
4
Figure 2.1: Example of a three-layered ANN model.
2.2 Convolutional Neural Networks

Convolutional Neural Network is a branch of deep, feed-forward neural net-
works, usually designed for visual imagery problem. Usually, when we talk
about deep learning, it refers to deep convolutional neural network, instead of
deep reinforcement learning. Krizhevsky et al. [29] reached a milestone in 2012
with Alexnet, an 8-layer deep convolutional neural network (CNN) to get 63.8%
accuracy in ImageNet Challenge: ILSVRC 2012 [42], which has improved 10%
over the prior years best e↵orts [24]. As figure 2.2 shows below, with the struc-
ture of AlexNet going deeper, the object features it can interpret become more
complex. With the first convolutional layer, it is only able to distinguish color
and simple textures, while the fifth layer is able to extract complex and key
features such as sunflower shape.
With the success of AlexNet, a growing number of computer vision and ma-
chine learning researchers became focused on using successively more intricate
CNNs to solve image classification and other problems. Many new model struc-
tures have been proposed and verified to update the accuracy records for several
challenge recognition competitions, such as ImageNet [42] and MS COCO [35].
Figure 2.2: Extracted features of the first 5 convolutional layers of Alexnet,

which classifies 1000 classes of images (ImageNet).
There are many operations in CNN, such as convolution, activation, pooling,

fully connection that form the hidden layers and process the image input to
produce the expected output. The convolution and activation function usually
composite a convolutional layer and a pooling function composites a pooling
5
layer. We will not put many details about fully connected layer since our cho-
sen model is fully convolutional.
2.2.1 Convolution
Convolution function is the core building block of a CNN. One dimensional con-
volution function can be denoted as equation 2.1. It is a discrete operation for
one-dimensional array x and one-dimensional filter w. With 2D images as our
input signals, the convolutional filter (kernel) should also be of 2 dimensional,
which needs to be sliding along weight and height axis of the input, presented
as equation 2.2.
Except for weight and height, images have another dimension ”depth”, with
value 1 for grayscale images and 3 for RGB images. If we assume the size of an
input as WI ⇥ HI ⇥ DI and the size of the convolutional filter is Wf ⇥ Hf ⇥ Df ,
then for each layer, the Df must equal to DI . The convolution output of the
filter across the input is often called a feature map, which represents the cross-
correlation between the pattern within the filter and local features of the input.
With a translation invariant property, CNN layer can detect the same features
in di↵erent parts of the image. The DI of this feature map only depends on the
number of filters in the previous layer. The WI and HI of the feature map are
determined by the WI and HI of the input respectively, the stride of the filter
and the number of padding zeroes.
1
X
s(t) = (x ⇤ w)(t) = x(a)w(t a) (2.1)
a= 1
XX
S(i, j) = (I ⇤ K)(i, j) = I(i m, j n)K(m, n) (2.2)
m n
Besides, Levine and Shefner (1991) defined a receptive field (RF) as ”an area in
which stimulation leads to a response of a particular sensory neuron” [33]. For
object detection CNN, RF can be explained as the region in the input that a
CNN’s feature after one or more convolutional layers is looking at. As depicted
in figure 2.3, after cascading two 3 ⇥ 3 convolutional filters, the receptive field
of each feature in the second layer has a receptive field of size 5 ⇥ 5 in the input
space.
The concept of RF is important to object detection tasks since it provides the

insights of which region a default box in a certain layer is targeting in DNN. In
general, the RF lk of layer k is defined as equation 2.3, where lk 1 is the RF of
layer k 1, fk is the filter size (the height and the weight are the same) and si
is the stride of layer i. The equation 2.3 can calculate the RF from bottom to
top intuitively.
kY1
lk = lk 1 + ((fk 1) ⇤ si ) (2.3)
i=1
6
Figure 2.3: Reception field example.
2.2.2 Activation
Activation function is a critical part for CNN. It introduces non-linear prop-
erty into the CNN, which is necessary for learning complex functional mappings
from input data, especially unstructured data such as images, videos, speeches.
Without activation component, the neural network would become a linear re-
gression model with limited power in describing complex features.
In modern neural networks, the popular recommendation for activation function

is Rectified Linear Unit (ReLU) [25], defined as g(z) = max{0, z}. Though it
yields a non-linear transformation, ReLU maintains the linearity in two stages,
z < 0 and z > 0 or z = 0. It still preserves the capability of generalizing linear
models [14].
2.2.3 Pooling
Max-pooling layers of size 2 ⇥ 2 with a stride of 2 have been used in many
popular network architectures, such as VGG16 network, depicted in figure 2.4.
They basically extract the max values in each 2 ⇥ 2 block in the output of the
previous convolutional layer. In this case, even there is some small translation
of the input, the pooling outputs can still maintain the same or slightly change,
hence the final output would not be a↵ected. Pooling can also improve the
statistical efficiency of the network model [14].
2.3 2D Object Detection

As CNN represents the state of the art, we applied deep, feed-forward CNN
model for our 2D multiple object detection task. The CNN models are designed
to imitate the behavior of a visual cortex. Compared to other object detection
methods, it needs little work in image preprocessing, which saves a lot of time
and e↵ort on the extraction of hand-drafted features, and it is easier to train
since it has much fewer parameters than fully connected networks with the same
amount of hidden units.
In the object detection research field, there are mainly two model structures,
7
Method data mAP FPS
Fast R-CNN 07++12 68.4 0.5
Faster R-CNN VGG-16 07++12 70.4 7
YOLO 07++12 57.9 45
YOLOv2 544 07++12 73.4 40
Faster R-CNN ResNet 07++12 73.8 5
SSD300 07++12 72.4 46
SSD512 07++12 74.9 19 rgbtext
Table 2.1: VOC2012 test results for di↵erent detection frameworks
region-proposal methods with two-stage learning and end-to-end-learning sys-

tems. Table 2.1 presents some of the state of art DNN that are working on
object detection tasks, and the results are tested on the VOC2012 test set.
2.3.1 Region-proposal methods

Region-proposal methods usually contain two stages: region proposal genera-
tion and object classification. After the achievements in image classification
area, it is not difficult to associate it with detection by applying the classifier to
the sliding windows that vary in size and location in one image. But doing this
ergodic search to find the target object will be costly in computation and not
efficient. Thus, researchers came up with region proposal generators, such as
selective search [48] or edge boxes [51], which suggest promising windows that
have more possibility of containing an object.
In 2013, NYU published Overfeat algorithm for using deep learning in object
detection, which was the winner of the localization task of ILSVRC2013 [43].
They introduced a novel method for object localization and classification by
accumulating predicted bounding boxes, integrated with a single CNN. Quickly
after Overfeat, regions with CNN features (R-CNN) [13] was published which
boosted an almost 50% improvement on the object detection challenge. It com-
bines Selective Search, CNN, and SVMs as a three-stage method. However,
R-CNN performs CNN for each object proposal independently without sharing
computation, leading to its expensive training in space, time and resulting in
low-speed detection as well. Later on, spatial pyramid pooling network (SPP-
Net), introduced by He et al. [19], was proposed to speed up R-CNN by sharing
computation. It generates a convolutional feature map on the whole input im-
age and classifies the object proposal by extracting feature vectors from this
map, thus avoiding repeated evaluation of the DNN on each proposal.
However, these methods all have multi-stage pipelines [12]. SPPNet has reduced
the training time for R-CNN by 3 times because of its faster proposal feature
extraction. However, the only convolutional feature map cannot be updated,
leading to its limited accuracy. Inspired by these two networks, Ross Girshick
[12] developed a more e↵ective object classification method called Fast R-CNN.
Instead of using SVM classifiers, it applies the fully connected layer on the out-
put image and then used both Region of Interest (RoI) Pooling on the feature
8
map with a feed-forward network for classification and bounding box regression
while ignoring the time for generating region proposals, which improves by a
large extent in both accuracy and speed. After that, the object classification
task can achieve nearly real-time, but proposals are relatively time-consuming,
which still remains the computational bottleneck in object detection tasks.
After Simonyan et al. created a successful DNN VGGNet [44], using only 33
convolutions with up to 19 layers, Ren et al. introduced Faster R-CNN [41],
upgraded from ”Fast R-CNN”, which is based on a VGG-16 net architecture.
The most impressing step they have made is applying a region proposal network
(RPN) instead of selective search. It achieves a frame rate of 5fps (including all
steps) on a GPU. RPN makes a large breakthrough in the detecting speed for
regions of interest.
In summary, there are various ways of multi-scale object localization. One

of them is to generate image/feature pyramids of multiple scales, applied in
CNN-based methods like Overfeat, SPPNet, Fast R-CNN, which turns out to
be e↵ective but costly in the computation. Another method is using multi-scale
sliding windows on the feature map which is actually changing the size of the
filter. For instance, Faster R-CNN fixes a single scale for the feature map and
a single size for the sliding window, and uses a novel method, a pyramid of an-
chors. The bounding box regression and classification rely only on these anchors
of various scales and aspect ratios, thereby dramatically reducing the number
of parameters in the output layer.
2.3.2 End-to-end learning systems

Another branch of methods are end-to-end learning algorithms by a single neural
network. The famous examples are YOLO (You Only Look Once) [39] YOLOv2
[40], SSD (Single Shot MultiBox Detector) [36]. YOLO algorithm divides the
input image into 7 ⇥ 7 grid of cells and performs localization and classification
tasks in each cell at the same time. It dramatically reduces the detection time
by reframing detection as a regression problem, from an input image to bound-
ing box coordinates and class probabilities. However, it assumes roughly object
locations, thus only gets breakthrough in test speed but not accuracy, which still
lags behind the state of the art by then. Later on, YOLOv2, which removes two
fully connected layers at the end of YOLO and uses anchor boxes, outperformed
the state of art and reached the highest speed and accuracy performance among
all detection models, which o↵ers comparable performance with SSD model.
At the end of 2016, SSD [36] made an impressive step by refreshing the records
in both accuracy and speed. It scores over 74% mAP (mean Average Preci-
sion) on VOC2007 [24] at 59 frames per second (FPS) on a Nvidia Titan X for
300300 input images [36]. The key to success is that it uses the multi-scale fea-
ture maps for more accurate bounding box regression, which will be explained
in section 3.3, and those maps are generated from extra convolutional layers
added to the truncated VGG16 model. The original structure of VGG16 has
been presented in figure 2.4. It consists of 16 convolutional layers with convo-
lutional filters of uniform size (3 ⇥ 3). These convolutional layers all use ”same
9
padding” with padding 1 and stride 1. In each convolutional layer, there is one
convolutional function followed by one activation ReLU function.
SSD reuses the computation from the VGG16 model, thereby saving a lot of
time. It discards the fully-connected layers in VGG16 which are mainly used
for object classification output and applies a set of convolutional layers to the
end of the truncated VGG, which enables extracting features at multiple scales
and decreases the size of the input to each subsequent feature maps progres-
sively. Inspired by Szegedys work on MultiBox [47], SSD associates default
boxes varying in scales and aspect ratios with the extracted feature maps of
di↵erent resolutions. Moreover, over 80% of inference time is spent on the im-
age classification base network (VGG16), which implies that the improvement
in the base network will also boost the detection speed of SSD.
There are many improved versions of SSD, such as Deconvolutional single shot
detector (DSSD) [10], RainbowSSD (R-SSD) [26]. DSSD has applied decon-
volutional layers to the multi-scale feature maps and ResNet feature extractor
instead of VGGNet. However, it has improved the accuracy performance, es-
pecially for small objects, but increased the time latency. Other attempts such
as R-SSD have optimized the SSD model based on a di↵erent dataset, but have
not succeeded to upgrade the overall performance by a large extent.
For these end-to-end learning systems, the selection of base network as a classi-
fier is also important, which a↵ects the inference time and classification scores
directly. In these years, more advanced base feature extractors have been re-
leased, image classification has also gained several significant improvements,
thereby boosting the speed and accuracy performance of object detection mod-
els as well.
As we mentioned above, the VGGNet that we use in our experiments is a simple

architecture model with only 3-by-3 filters with a stride of 1 in convolution layers
and a stride of 2 in max pooling layers. Later on, GoogLeNet [46], also known as
Inception Module, reduces the number of parameters from 60 million (AlexNet)
to 4 million and wins the ILSVRC 2014. It goes deeper in parallel paths with
di↵erent receptive field sizes and uses 1 ⇥ 1 convolution for dimension reduction
before costly convolutions. To get a DNN model closer to the biological neuron
architecture, Kaiming comes up with Residual Neural Network (ResNet) by skip
connections, which wins the ILSVRC 2015 by lower the top-5 error rate to 3.6%
with 152 layers in total. After that, there are no breakthroughs comparable
to ResNet. These achievements in image classification can lead directly to the
accuracy improvement in object detection task. However, they also increase the
inference time by a large extent, which is why we do not consider them in our
real-time scenario.
10
Figure 2.4: VGG16 model.
11
Chapter 3
Methodology
3.1 Dataset preprocessing

In ADAS context, the video stream used in DSP (Digital Signal Processor) con-
troller in the embedded vision system is UYVY pixel format, which is totally
di↵erent from RGB color format. Although it is feasible to convert the UYVY
format to RGB in experiments, by firstly extracting UYVY to YUV format and
then converted to RGB, the total processing time will be increased and it is
not the best solution for a real-time system. To maintain the speed of DNN in
this real-time scenario, this paper is exploring how to improve object detection
accuracy in grayscale images (Y channel).
3.1.1 Grayscale image from UYVY format

Before explaining the details of data preprocessing, this paper will introduce
the color format UYVY. UYVY is a packed YUV 4:2:2 format, in which the
luminance component (Y) is sampled at every pixel, and the chrominance com-
ponents (U and V) are sampled at every second pixel horizontally on each line
[23]. The YUV formats can be divided into two groups, the packed formats
where Y, U and V components are packed together into macropixels stored in
a single array, and the planar formats where each component is stored in a
separate array with the final image being a fusion of the three separate planes.
The reason why YUV formats are preferable in displaying digital video signal
is that it is developed to provide compatibility between color and black/white
analog television systems and can imitate a human vision. It allows reduced
bandwidth for chrominance components so that the transmission errors can be
masked by the human perception efficiently. Figure 3.1 shows one example im-
age and figure 3.1.1 shows the comparison of its separate visualizations of RGB
and YUV channels.
3.1.2 Data augmentation

This paper chooses to use the single component Y (grayscale) of images for ob-
ject detection. Grayscale image, also known as black-and-white or monochrome
12
Figure 3.1: An original image example.
image, contains pixels whose values vary from 0 to 255 in digital formats. The
value 0 means black color with the weakest intensity and the value 255 means
white with the strongest intensity. Grayscale images have been widely used in
medical imaging, monitoring system etc.
When using DNN to detect RGB images, for the purpose of making the model
more robust and enhancing the accuracy of detection, it often applies data aug-
mentation to the original dataset. Obviously, there is more information such
as hue, saturation including in the RGB images than in the grayscale images.
It has been proven that in object detection tasks using computer vision meth-
ods, the additional use of color information will perform better than only using
shape information [28]. While some researches claim that the color information
is not efficiently used in the deep learning models [17]. In other words, adding
color information in the input of the model has a negligible e↵ect on the results.
To get a more clear understanding of the importance of color information, this
paper will explore this in section 4.6.
Moreover, there are di↵erent conditions when people are driving vehicles, such
as raining, weak lighting, fogging etc., which lead to significant changes in the
brightness and contrast in the camera view. To enhance the performance of
SSD, more distortions have been added into the contrast and brightness in the
grayscale training set and apply random flipping to handle the object detection
on di↵erent sides.
Brightness Brightness is a relative term, showing how bright the image ap-
pears compared to another reference image, based on our visual perception. For
grayscale images, it is the mean pixel value intensity that can be used to change
the brightness. Higher brightness corresponds to the weather condition of more
sunlight reflection existence. In our experiments, this global attribute has been
chosen as a data augmentation method for all experiments. This paper applies
a brightness change brightness to the intensities of all pixels in each image,
within the range of [ 32, 32] with a probability of 0.5.
Contrast Contrast is the di↵erence in visual properties that makes an object

distinguishable from other objects and the background. In visual perception
13
(a) R channel (b) Y channel
(c) G channel (d) U channel
(e) B channel (f) V channel
Figure 3.2: Separate RGB channel and YUV channel visualizations
for grayscale images, it is determined by the di↵erence in the brightness of the

object with other objects. The human visual system is more sensitive to the
contrast than to the absolute luminance. We have scaled the contrast of each
image within the range of [0.5, 1.5] with a 0.5 probability. The brightness change
and contrast change can be represented as equation 3.1. The mean value is the
mean pixel value intensity for all of the images in the dataset, not calculated
individually.
f (x) = contrast f actor ⇥ (x mean value) + mean value + brightness

(3.1)
Flipping We apply random flipping on the dataset with a probability of 0.5.

It can efficiently solve the problem of detecting objects on di↵erent sides.
3.2 Ca↵e framework

Ca↵e (Convolutional Architecture for Fast Feature Embedding) is an open-
source deep learning framework [27] made with expression, speed, and modu-
larity in mind. It is created by Yangqing Jia as his PhD project at UC Berkeley
in 2014 and developed by Berkeley AI Research (BAIR) and by community
14
contributors now. Ca↵e supports seamless switching between CPU and CUDA
capable GPU, by simply setting a single flag. It not only o↵ers the model
definitions, optimization settings but also pre-trained weights in the format of
”.ca↵emodel” binaries in the ca↵e model zoo. Ca↵e can be accessed using API
including C++, python, and matlab. Ca↵e has been chosen with Python API
in our experiments.
3.3 Network Architecture

Based on SSD300 and SSD512 with VGG16 feature extractor [36], we o↵er a
new design for fine-tuning a pre-trained model targeted on grayscale image in-
put. The computation complexity of the whole neural network has not been
degraded by our structure change.
3.3.1 Modified input layer

SSD often uses RGB images as input and can reach a high accuracy on perfor-
mance, so the question for us is how to modify the network structure to take
only one-channel input images and maintain the performance? A dumb method
of solving this problem is to duplicate the grayscale channel of the image twice
and merge these three channels together to get a ”3-channel” grayscale image.
This solution leads more disk space usage for storing the extended channels and
more useless computation in the input layer as well.
Figure 3.3: Di↵erent convolutional filters in Conv1 1.
Another method is to change the input convolutional layer, conv11 layer in the
network structure. As figure 3.3 shows below, a convolutional filter of size3 ⇥ 3
15
slides over the input image with height 300 and width 300 (convolution opera-
tion) to produce a same sized output. There is a concept called depth, which
implies the number of 3 ⇥ 3 filters used in each layer, which is 64 shown in
figure 3.3. Besides, the number of the conv11 output equals to the depth of the
input layer. Here we specify that the convolutional filter has a third dimension,
also called channel, which should always be the same as the channels of the
input from the previous layer. If the number of input channels changes from 3
to 1, the filter channel should also change to 1. In this way, we lose the color
information from di↵erent color channels, so that we assume the performance
of model would also downgrade by a certain degree due to the loss of color in-
formation. Previously the mean values of an RGB image are commonly set as
[104, 117, 123], based on Pascal VOC dataset [8]. Now we apply 96 as our mean
value on our grayscale images and [93, 98, 95] on our color images, which are the
mean pixel values based on the images in our datasets, explained in details in
section4.1.
3.3.2 Multi-scale feature maps

There are several methods of bounding box prediction in DNN models. An ear-
lier method is using pyramids of images and feature maps. The feature maps
are generated from each image of various scales in the image pyramid, which
may take a long time. One more advanced method is using a single feature
map with di↵erent scales and aspect ratios of bounding boxes in fixed grids, i.e.
Overfeat [43], YOLO [39], which gets faster on speed but loses the performance
of accuracy.
SSD makes a good compensation by adding extra convolutional layers to the end
of the truncated base network and efficiently using the computation from the
base network. The truncated base network is generated by removing all fully
connected layers and dropout layers from the original VGG16 model, shown
in figure 2.4. The original network extracts features of the targets for image
classification purpose. Multiple scales of feature maps from the additional con-
volutional layers called ”feature pyramid” in figure 3.4 and 3.5 can be used for
bounding box prediction, which maintains the detection speed and improves the
precision at the same time. Here, we called the bounding boxes for matching
the objects in images as ”default box”.
Besides, as the network model goes deeper, the feature maps get smaller scale
in size, but the default boxes of various scales and aspect ratios at each cell in
the map can represent a larger area in the original image, which is similar to
the concept ”anchor” in Faster R-CNN [41]. These default boxes are used for
predicting the location o↵sets to the ground truth boxes and their associated
confidence. As the SSD512 model has a larger input than SSD300 model, with
more feature maps on a larger scale, the capability of detecting small objects is
also better.
16
Figure 3.4: SSD300 architecture.
Figure 3.5: SSD512 architecture.
17
3.3.3 Default box selection
As mentioned in section 3.3.2, aspect ratios and scales need to be manually as-
signed to each default boxes in feature maps, which directly determines the total
number of default boxes in each image. More default box usage can directly in-
crease the inference time for each frame/image. It is critical to set the default
scales and aspect ratios since it directly a↵ects matching efficiency during the
training process, which is the number of matched positive samples. The more
positive samples there are in the training set, the more robust model we can
obtain. Below we will introduce our choice and the reasons behind.
Scales
Feature maps generated from di↵erent levels of layers have di↵erent receptive
fields [50], thereby targeting on objects of di↵erent sizes. We assume the number
of feature map is m and the scales are in [smin , smax ] ([0.2,0.9]), then the scale
Sk of the default box of the kth feature map is computed by:
Smax Smin
Sk = Smin + (k 1) , k 2 [1, m] (3.2)
m 1
Taken SSD300 as an example, the highest feature map from conv4 3 layer has
the maximal scale 0.2, the lowest feature map from conv92 layer has the minimal
scale 0.9 and other scale values are evenly placed in this range. Thus the scales
of the feature maps are not corresponding to their receptive sizes. This scale
range is generally used for large dataset, however, it can be adjusted according
to the distribution of scales of objects in the dataset.
Aspect ratios
Aspect ratio is used for describing di↵erent object shapes. In original paper, it
is assigned as ar 2 {1, 2, 3, 12 , 13 } for all feature maps. Then the width and height
p p
of a default
p box are sk ar , sk / ar accordingly. When aspect ratio is 1, another
scale Sk Sk+1 of bounding box is applied. Then the maximal number of default
boxes at each cell is 6. To obtain a higher detection speed, the first and the
last two feature maps are assigned 4 bounding boxes with aspect ratio 1, 2, 12 .
With a larger size of input, SSD512 is able to detect more objects of various
scales. Here we list the number of default boxes at each feature map in original
SSD300 and SSD512 models in table 3.3.3. For example, the first convolutional
feature map of size 38 ⇥ 38, uses 4 default boxes in each cell, therefore, the total
number of default boxes in this map is 38 ⇥ 38 ⇥ 4. We calculated the number
of boxes for the rest feature maps and add them together. In the orignal design,
it uses 8732 and 23574 default boxes in SSD300 model
As seen in the figure3.6, there are 6 default boxes in each cell in 10 ⇥ 10 feature
map, detecting smaller objects by comparison, while 4 boxes in 5 ⇥ 5 feature
map in figure3.7, detecting relatively bigger objects. Each detected boxes will
output the confidence scores for all classes (c1 , c2 , ..., cp ) and 4 location o↵set
(cx, cy, w, h). Therefore for a m ⇥ n feature map with k default boxes in each
cell, the total output would be (num of classes + 4) ⇥ k ⇥ m ⇥ n.
18
# of box positions 38x38 19x19 10x10 5x5 3x3 1x1 - total boxes
SSD300 4 6 6 6 4 4 8732
# of box positions 64x64 32x32 10x10 8x8 3x3 1x1 1x1 total boxes
SSD512 4 6 6 6 6 4 4 23574
Table 3.1: The number of default boxes for each classifier layer and the number
of total boxes. 4 and 6 are the number of di↵erent default boxes.
Figure 3.6: 10x10 feature map Figure 3.7: 5x5 feature map
Though SSD300 and SSD512 can reach very high accuracy and speed at the
same time, there still exists some bottlenecks. For instance, it has been found
that the lower the feature map is, the more complex semantic features it can
interpret [37, 18]. In higher feature maps, the default boxes are in smaller scale
but lack of strong semantic features, therefore the classification confidence would
decrease. This problem can be tackled by finding the optimal scales and aspect
ratios. Another novel method that change total the structure is ”Feature Pyra-
mid Network” [34], in which a top-down architecture with lateral connections is
developed for strong semantic feature maps at all scales.In our experiments, we
optimized SSD using the first method to get a high accuracy. The experimental
results will be discussed in section 4.4 and 4.5.3.
A potential issue in selecting the optimal scales and aspect ratios is that the
scale and aspect ratios should be re-designed for di↵erent datasets. For on-road
objects, the aspect ratio will have a di↵erent distribution than the 20 classes
object in VOC dataset. If the scales and aspect ratios we choose are making the
matching between default boxes and ground truth boxes better, there are more
positive training samples in the training set and the training loss will converge
to a smaller value.
3.3.4 Hard negative mining

After defining the scales and aspect ratios for multiple feature maps, in the
training process, we started matching the default boxes with our ground truth
19
boxes. We called the matched boxes as ”prior boxes”, all the prior boxes as ”pos-
itive training samples”. As there are only a few objects in the image/frame, the
number of negative training samples would be disproportionate compared to
positive training samples. By ”positive”, we select the defaults boxes whose
IoUs are larger than 0.01. To restrict the total amount of training samples, only
the top 400 with highest IoU in the positive samples are selected.
Then, instead of using all negative predictions, we keep a ratio of negative

to positive examples of around 3:1. As there is a background class, which rep-
resents the incorrect detections, the negative samples would also be randomly
picked by random taking negative samples and select the samples with the high-
est confidence scores. By learning from positive and negative collections, the
model can be more robust for background interference.
3.4 Training
During the training phase, abstractly the DNN model is learning representative
features from the training data and can be generalized to generate outputs
that predict the ground truths of new data. First, we define the loss function
to measure how far the results are from the ground truths. Then, there are
mainly two steps, propagation and weight update, for learning the parameters
in the neural network iteratively. Another related problem introduced here is
regularization that prevents the model from overfitting. To training the network
model fast and efficient, fine-tuning is used in our experiments.
3.4.1 Loss function

In the deep learning scenario, cost function or loss function is used for describ-
ing the di↵erence between the current network output and the expected output.
For an object classification task, only the confidence loss of the object category
needs to be considered. While for an object detection task, there are confidence
loss (conf) and localization loss (loc) that should be handled with. The loss
function will be calculated through the forward pass of the DNN. In the origi-
nal paper, the loss function inspired by [7] has been extended to fit multi-class
task, computed as:
1
L(x, c, l, g) = (Lconf (x, c) + ↵Lloc (x, l, g)) (3.3)
N
where N is the number of detected default boxes. It applies softmax loss for
the confidence loss over all classes and a Smooth L1 loss for localization loss
between the detected boxes and the ground truth boxes, as equation 3.4 states.
The ”detected” specifically means that the confidence score is more than 0.1.
8
> PN exp(cˆp
>
< Lconf (x, c) =
p ˆp
i2P os xij log(ci ) log(cˆ0i ) where cˆpi = P i)
ˆp
p exp(ci )
>
> PN P
: Lloc (x, l, g) = xkij smoothL1 (lim ĝim )
i2P os m2bbox
(3.4)
20
3.4.2 Propagation
The process of passing inputs forward through the neural network is called for-
ward propagation and its output is the class, the confidence and the bounding
box coordinates in this case. With those outputs, loss function explained in
section 3.4.1 can be computed. Then the question is how to minimize the loss
function.
Backpropagation is a conceptually simple and computationally efficient neu-

ral network learning algorithm [32]. It is based on gradient descent algorithm in
which we need to compute the gradients of the loss with respect to the weight
of each hidden unit in the DNN. There are many hidden layers with thousands
of nodes, leading to a complex gradient computation that has to go backward
from the output to the target node by using the chain rule. Equation 3.5 shows
the simplest form of the chain rule applied to two functions. It can be used to
compute local gradients by multiplying the Jacobians of each node backward
through the network until the target node and add the products from all of
the di↵erent tracks backward from the output loss. If y = f (u) and u = g(w),
equation 3.5 can be written as equation 3.6.
(f g)0 (w) = f 0 (g(w)) · g 0 (w) (3.5)

dy dy dx
= · (3.6)
dw dx dw
3.4.3 Weight update

After getting the local gradient of each node, we apply gradient descent algo-
rithm to update the weights. As equation 3.7 states, the weight wk+1 at current
state k + 1 can be updated by a certain degree from weight wk at the previous
state k. Here it needs to manually choose a learning rate to decide on the size
of the change in all the weights for each iteration. It is denoted as ⌘, often set
as 10 2 or 10 3 . The change in weights can reflect the influence on L of an
increase or decrease in wk .
@L
wk+1 wk ⌘ (3.7)
@wk
After training with a learning rate for a certain number of iterations, the loss
function may get stuck within a small range, which means the learning rate
might be too big for finding the minima. Hence, the learning rate has decreased
by multiplying a gamma, which is often set as 0.1. Then, the next period
training will use 0.1 ⇤ initiallearningrate as the new learning rate. Therefore,
can get rid of the fluctuation and converge the loss to a smaller value. Usually,
it uses three periods with 80000, 40000, 20000 iterations respectively to converge
the loss function. It is called ”multistep” training strategy.
3.4.4 Regularization
To prevent the DNN model from overfitting the data, two methods can be
adopted, dataset augmentation and regularization. The artificial data augmen-
tation is commonly used in deep learning research field, which has introduced in
21
section 3.1.2. Hence, regularization has been added for optimization during the
training process. One popular form of regularization is the L2 loss, also known
as weight decay. This loss shown as equation 3.8, has been parameterized by
the constant ! and added into the loss function. The inspiration behind this L2
loss is that weight matrices with lower and uniformly distributed values perform
better in exploiting all the input data than sparse weight matrices with higher,
intensively values [15].
2
L2 (!) = k!k (3.8)
2
Another way of regularization is dropout [45], which deactivates units or neu-
rons by setting their output to 0 with a certain possibility. By randomly cutting
o↵ neurons during the training process, it can keep the most robust features in
the training set but also takes 2-3 times longer to train than a standard neural
network of the same architecture. For each epoch, it would train a di↵erent
random architecture. It has been considered as an e↵ective method in improv-
ing the performance of neural nets and prevent overfitting problem in a wide
variety of application domains. However, the SSD has cut o↵ the dropout lay-
ers in the VGG16 base network, which means it will not be used in this method.
3.5 Testing
During the training process, all of the weights and bias of the network are saved
to snapshot periodically. When testing a network, the values will be restored
to apply to the input images in the testing set. Following the loss described in
section 3.4.1, the algorithm firstly calculates the confidence of each detection,
represented by the product of the object confidence scores and classification
scores. Then the top k (set as 400) predictions with confidences more than 0.01
among all confidences in each image have remained. Among those scores, it is
likely to happen that multiple bounding boxes are assigned to the same object,
thus Non-Maximum Suppression (NMS) explained below [21] has been applied
to these detections within each image and class with a threshold of 0.5. In
the end, this filtering algorithm returns the bounding boxes, confidences, and
classes for the final detection in each image.
3.5.1 Non-Maximum Suppression

As there might be multiple generated bounding boxes targeting at the same
object, it is necessary to prune the redundant detection and come out with
one bounding box for each object target in the image. Hence, non-maximum
suppression algorithm (NMS) is often applied. In our experiments, predicted
boxes with a confidence loss threshold of less than 0.01 are omitted and the
top 400 predictions are kept. This ensures only the most likely predictions are
retained by the network, while the noisier ones are removed. Then, the next step
is to pick the box with highest confidence as output and discard the remaining
boxes that have an IoU over 0.45 and to perform this step repeatedly until there
is no remaining boxes in this image. Hence, a few outputs would be generated
22
and their confidences are more than 0.01. An example is shown in figure 3.8.
Figure 3.8: Non-Maximum Suppression example. Top: image after detection

before NMS processing; Bottom: result image after NMS processing.
3.6 Fine-tuning
There are two main approaches of training a neural network model, training-
from-scratch and transfer learning. By training, it means executing the back-
propagation and optimize the parameters of the neural network model to lower
the loss function. As there is limited computational resources and large net-
work model, it will take long for training model parameters from scratch, hence
transfer learning has been applied to train the models.
Transfer learning is a machine learning technique in which a model trained

on one task can be re-purposed on another related task. As Emilio Olivas
stated (2009), T̈ransfer learning is the improvement of learning in a new task
through the transfer of knowledge from a related task that has already been
learned.Ïn fact, most prevalent object detection tasks have adopted transfer
learning strategy, such as Overfeat [43], Fast R-CNN [12] which makes use of
Alexnet [29] (trained with ImageNet). Moreover, most object detection models
are using classification models partially. Thus, this paper adopted the transfer
learning method by taking VGG16 (pretrained with VOC2007 and VOC2012
dataset) as pretrained model. The weights in truncated VGG16 have been al-
ready trained to classify object classes with a high accuracy for VOC2007 and
VOC2012 datasets.
Fine-tuning means restoring the pre-trained parameters as our initialization

and training them for another task. In order to finetune the pretrained model
on our dataset, some layers need to be re-initialized, otherwise, it may pose a
problem of inconsistency with the size of the weights in pretrained model. We
have modified the input layer’s weights from (4, 3, 300, 300) to (4, 1, 300, 300)
23
and create 6 and 7 extra layers for bounding box prediction in SSD300 and
SSD512 respectively.
3.7 Evaluation metrics

Since in the object detection research field, DNN models can be trained to fit into
di↵erent problems, in which di↵erent objects are detected and the distribution
of them is not uniform, thus a simple precision metric may be not comprehensive
for evaluation. Besides, by defining detected targets with a certain threshold, it
needs to associate them with a confidence score, and it also needs to be involved
in the evaluation metric. Therefore, mAP (mean average precision) metric has
been widely used for evaluating object detection models and accepted to evalu-
ate several object detection competitions, such as Pascal VOC[8], ImageNet[42].
To understand mAP, firstly we need to introduce the Precision and Recall graph
for a classifier. As 3.9 shows, the precision score for category c is the ratio of the
number of true detections of object c to the total number of detected c objects.
While the recall refers to the ratio of the number of true detections of object c to
the number of ground truth box of object c in all examples. But they all need
a confidence threshold to define which object is considered as ”predicted” or
”detected”. For example, there would be probably more objects being detected
when the threshold is 0.2 than the threshold is 0.8. The threshold we used here
is Intersection over Union (IoU) threshold.
N (true positives)c
P recisionc = (3.9)
N (all detections)c
N (true positives)c
Recallc = (3.10)
N (all ground truths)c
As shown in the equation 3.11, the area (Bp \ Bgt ) refers to the overlap between
the ground truth box and the output predicted box, and the area (Bp [ Bgt )
is their union. IoU equals the ratio of them, ranging in [0, 1]. For Pascal VOC
challenge, it regards a prediction with IoU equals or more than 0.5 as a posi-
tive prediction. Precision and recall vary with the strictness of our classifiers
threshold. If we choose larger IoU threshold, the precision got shrunk fast at
a smaller recall rate. The maximum value of recall for each graph means the
recall for all the prediction result. In our experiments, we set 0.5 as the IoU
threshold, which is the same as the threshold in Pascal VOC2007.
area(Bp \ Bgt ))
IoU = (3.11)
area(Bp [ Bgt )
Average Precision is used to describe the precision of detection for one class of
object [9]. It summarizes the shape of the precision/recall curve by sampling pre-
cision at a set of eleven equally spaced recall levels, Recalli = [0, 0.1, 0.2, ..., 1.0]
and averaging them. In our experiments, to get more accurate details in the
change of precision and recall, we pick 41 recall levels in the same range, hence
24
Recalli = [0, 0.125, 0.250, ..., 1.0] and the APs is calculated following equa-
tion 3.12.
1 X
AP = P recision(Recalli ) (3.12)
41
Recalli
After defining AP for all classes in the dataset, we can compute mean Aver-
age Precision (mAP) to show the overall performance of the DNN model. It
is calculated by taking the average of AP over all classes at each recall level
after filtering with the IoU threshold at 0.5. We often use AP for single-class
detection evaluation and mAP for the overall result evaluation.
25
Chapter 4
Experimental Results and

Analysis
In this section, we introduce the implementations of SSD model training and

testing. Two datasets are used in this work and their data analysis are pre-
sented in section 4.1. By analyzing the drawbacks of SSD models for detecting
all classes of objects in grayscale images, we have adjusted our model structure
to fit the selected dataset, thereby efficiently optimizing the models and improv-
ing the accuracy. We also implemented two trials for exploring the e↵ects of
color input on SSD model performance.
4.1 Datasets
There are many open-source datasets where images are taken on the road con-
taining labels for multiple objects, working on autonomous driving problems.
While some datasets among them are targeting single objects like pedestrians
[5, 6] or cars [1, 3]. For our task, we are targeting at the multiple objects on
the roads. To meet the requirements of forward collision avoidance system, we
have chosen three datasets, two Udacity labeled datasets and the KITTI vision
benchmark suite [11], and the statistics of object labels are listed in table 4.1.
Annotations
Dataset Images
Car Truck Pedestrian Cyclist
KITTI 7481 28742 1094 4487 1627
CrowdAI 9423 62570 3819 5675 -
Autti 15000 60787 3503 9866 1676
Table 4.1: All Dataset Statistics
26
4.1.1 KITTI 2D object detection dataset
KITTI 2D object detection dataset is collected by an autonomous driving plat-
form Annieway published on KITTI Vision Benchmark Suite, a project of Karl-
sruhe Institute of Technology and Toyota Technological Institute at Chicago and
has been introduced their publication in 2012 [11]. The images are extracted
from videos recorded on the streets of Karlsruhe with large lighting variations
and extensive occlusions. It has been unique among these datasets because of
the resolution of its images, which is 1240 ⇥ 375, as seen in figure 4.1. Though
it has 7481 images as the training set and 7518 images as the testing set, only
the training set is available but not the testing set. Hence, we take the training
set as our whole dataset and select the images that contain cars, pedestrians,
cyclists, and trucks. The mean value for these images is 96.2. We randomly
select 5237 images as the training set and 2244 images as the testing set.
Figure 4.1: Sample images from the KITTI 2D object detection dataset.
KITTI dataset o↵ers 8 classes of objects, including Tram, Misc, Cyclist, Person
(sitting), Pedestrian, Truck, Car, Van. For each object, the coordinates in the
image is also provided. It is prominent that car, pedestrian and cyclist objects
are the majority, leading to an imbalance problem between di↵erent classes. In
our research, we take 4 classes, car, pedestrian, cyclist and truck as our research
targets. Due to the dataset imbalance problem, we assume during the training
process the model tends to decrease the loss of larger classes than the smaller
classes, thus the AP of car and pedestrian detection would be greater than the
AP of the truck and the cyclist. However, this assumption has been verified
to be wrong, described in section4.3.2, which is not the main reason for poor
performance.
4.1.2 Udacity Annotated Datasets

There are two annotated datasets in the udacity self-driving challenge, ”Crow-
dAI” and ”Autti”, named from the names of their annotators. The CrowdAI
27
Figure 4.2: Object size distributions for two categories ”cars” and ”pedestrians”
in KITTI dataset.
dataset includes four classes of object, car, person, and truck. The images
are collected from a Point Grey research cameras running at a resolution of
1920x1200 at 2hz during the drive in the mountain of California and the neigh-
boring cities in a daylight condition. For each object in the images, it pro-
vides the class label and 4 vertex coordinates of the corresponding ground truth
bounding box, [xmin, ymin, xmax, ymax].
The Autti dataset has the same context and same image resolution as CrowdAI
dataset but includes two extra classes of traffic lights and bikers. In our case, we
take bikers into consideration but not traffic lights. Since the contexts of Autti
and CrowdAI dataset are similar to each other, we merge these two datasets,
and call it as ”Udacity dataset” in our paper. At the same time, from table 4.1
we notice that the merged dataset has an imbalance problem with many cars
and few bikers, which may pose a problem for training. To enlarge the small
class, like the person class, may help ease the problem.
We have randomly sampled the dataset into 70% (17107) for the training set and
30% (7327) for the testing set. The calculated mean value for the training set is
95.8. Figure 4.3 shows several example images in Udacity dataset, which have
already been converted into the grayscale images. Since there is no height or
width of reference objects in the images, we cannot compute the actual widths
or heights of the ground truth boxes in the images in Udacity dataset.
4.2 Pre-trained model testing

To test the generality of pre-trained models, including SSD300 model trained
on ILSVRC2016 dataset for 200 classes and SSD512 model trained on Pascal
VOC2007 and VOC2012 data for 20 classes. Both of them do not take ”truck”
and ”cyclist” classes into account. Though there are ”bicycle” and ”train”
classes that have somehow similar features to ”cyclist” and ”truck” classes, it
is totally di↵erent for neural networks to extract the features from them. So we
only consider two classes, ”person” and ”car”. To fit the training set into the
prepared weights, we have to extend the label map file to 200 and 20 classes
28
Figure 4.3: Sample images from Udacity dataset.
respectively for training and testing set preparation.
It is remarkable that for di↵erent detection purposes, the data and its labels
are totally di↵erent. Though the general large datasets include hundreds or
thousands of objects from di↵erent categories, they do not fit well with other
industrial applications, such as on-road detection for ADAS. For on-road object
detection, it is very important to use a power-efficient system with fast and
robust detection in unconstraint environments. To enhance the performance for
this requirement, the collection of a large amount of data is necessary but also
costly in manpower and resources.
4.2.1 Performance on KITTI dataset

Figure 4.4 shows that neither SSD300 or SSD512 do not perform robustly on
detecting human objects. By comparison, it can reach much higher AP for
detecting cars. The underlying reason might be that cars have larger scales
than persons in the images, which are detected by the default boxes in lower
feature maps with strong semantic interpretation power. It is also remarkable
that the SSD512 model can reach 100% more in accuracy for both person and
car detection compared to the SSD300 model.
The mAP presented in figure 4.5 is calculated based on car and person classes,
and the detection of rest classes is not used in this discussion. The drop in
accuracy is possibly caused by di↵erent scenarios between these general datasets
with 20 or 200 classes of objects in various environments and KITTI which
focuses on the vehicle and pedestrian detection. Another reason is that the
aspect ratio of bounding boxes in KITTI dataset after resizing is fairly small
29
Figure 4.4: P-R graphs of pre-trained SSD300 and SSD512 models for ”car”
and ”person” classes.
within a range of [0, 1.4] compared to the aspect ratio of boxes in other datasets,
such as Pascal VOC 2012, where the aspect ratio of objects are in range [0, 9].
Figure 4.5: Recall vs. Average of each class precision graph for pre-trained
SSD300 and SSD512 models.
4.2.2 Performance on Udacity dataset

We have also verified the generality of SSD300 and SSD512 pre-trained models
on Udacity dataset, shown in figure4.6 and 4.7. The results are even worse than
the results on KITTI dataset. For instance, the SSD300 can only get 5.7% of
accuracy on detecting person objects. It leads to an assumption that the aspect
ratios of bounding boxes in this dataset are in general closer to the ones in the
large pre-trained dataset, but why the detection results get worse. It is hard
to see the underlying reason as we have not trained a model for both dataset,
30
Figure 4.6: P-R graphs of pre-trained SSD300 and SSD512 models for ”car”
and ”person” classes.
Figure 4.7: Recall vs. Average of each class precision graph for pre-trained
SSD300 and SSD512 models.
31
hence we cannot evaluate the data quality of both dataset.
4.3 Fine-tuning based on the original design

In this section, the experimental results of SSD300 and SSD512 models based
on gray images are presented referred to the original design in the paper [36].
By exploring multiple methods of implementing the models, we got more in-
sights on this one-stage learning DNN. The purpose of this part of experiments
is to get a brief understanding of how to select hyperparameters and how many
epochs are usually required to make the loss function converge.
Since the training process for DNN models involves the compute-intensive task of
matrix multiplication and other operations that can take advantage of a GPU’s
massively parallel architecture, we apply one GPU (Nvidia Tesla P100) with 16
GB memory capacity in Ubuntu 16.04 system to conduct the experiments. For
testing, we also use the same GPU to generate comparable results.
4.3.1 Hyperparameter selection

Most of the hyperparameters set for training and testing phases in our experi-
ments are set referring to the original paper[36]. The ”multistep” learning rate
policy is adopted to decrease the learning rate after reaching a certain number
of iterations in each step, by multiplying gamma 0.1 or 0.3. The step sizes in
”multistep” policy are chosen according to the number of iterations it will take
for the loss function to get fluctuated within a small range. Our initial learning
rate is set as 0.0001. Besides, in each iteration, we are taking a batch of images
for training. We can choose the batch size among 16, 32 and 64 depending on
the size of GPU memory. Higher batch size will lead to longer time in each it-
eration as there will be more computation. To speed up our learning algorithm,
we apply stochastic gradient descent with 0.9 momentum. We choose 0.0005 as
weight decay to specify regularization in the neural network.
4.3.2 Performance for both datasets

We have fine-tuned the pre-trained SSD300 and SSD512 models for detecting 4
classes of objects, including cars, pedestrians, cyclists, and trucks. The results
are listed in table 4.2, with less than 60% mAP for all datasets. According to the
statistics in these two datasets, the instances of pedestrians are more than the
instances of trucks, which is expected to obtain higher accuracy for detecting
persons than trucks. However, the results show that the AP for trucks is higher
than the AP for persons, indicating that the data imbalance is not a correct
explanation for low APs on detecting pedestrian and cyclist objects.
32
Dataset Method mAP(%) Car Cyclist Person Truck
Udacity SSD300 44.2 66.9 24.3 26.5 59.2
(CrowdAI,Autti) SSD512 59.9 80.5 47.7 38.1 73.4
SSD300 45.1 67.4 26.4 25.0 61.6
KITTI
SSD512 58.5 77.8 45.7 38.9 71.4
Table 4.2: Model performance trained and tested on grayscale images
# of box positions 38x38 19x19 10x10 5x5 3x3 1x1 - total boxes
SSD300 4 6 6 6 4 4 - 8732
SSD300(Enhanced) 6 6 6 6 4 4 - 11620
SSD300(More boxes) 8 6 6 6 6 6 - 14528
# of box positions 64x64 32x32 10x10 8x8 3x3 1x1 1x1 total boxes
SSD512 4 6 6 6 6 4 4 23574
SSD512(Enhanced) 6 6 6 6 6 4 4 31766
Table 4.3: The default boxes statistics after modifying SSD300 and SSD512.
4.4 Enhancement on detection accuracy

As the performances of SSD models are fairly poor with overall mAP less than
60%, we experimented further on finding a better set of scales and aspect ra-
tios on enhancing the performance. Here we only verify this method on KITTI
dataset for both SSD300 and SSD512 models.
By analyzing the statistics of KITTI dataset in figure 4.2, it is obviously that

the width and length of pedestrains are both mostly in the range [0, 1], while
the width of cars are mostly in range [1.3, 2.2] and length are in [3.0, 5.0] (unit:
meter). To increase the performance in detecting pedestrains and cyclists, one
method is to increase the variation of default boxes in multi-scale feature maps.
As the highest feature map generated by conv4 3 in both SSD300 and SSD512
can detect the smallest objects, such as pedestrians, we change the aspect ratio
from ar 2 {1, 2, 12 } to a0r 2 {1, 2, 3, 12 , 13 }. Then the total number of default box
prediction in each image has increased compared to the numbers in table 3.3.3.
The increased versions, named as SSD300(enhanced) and SSD512(enhanced)
respectively, have been shown in table 4.3.
To verify the e↵ect of adding extra default boxes in the highest feature map, we
apply modified SSD300 and SSD512 models on KITTI dataset. From the re-
sults in figure4.8, we can conclude that it has efficiently improved the accuracy
performance in accuracy for each class in the KITTI dataset. For ”car” and
”truck”, it reaches high accuracy around 90% in the SSD512 (enhanced model),
while for ”person” and ”cyclist” it reaches 58.3% and 72.6% respectively. It is
obvious that the model performance for person class is still not robust. Fig-
ure 4.9 shows that the overall mAP has increased 26.1% for SSD300 model and
20.2% for SSD512 model.
Comparing the AP for big objects such as cars and trucks with small objects
such as persons, we can tell that the model cannot perform quite well on small
objects with the default setting in the original paper. By adding proper aspect
33
ratios for default boxes, the performances have been improved by a large extent.
Figure 4.8: P-R graphs of SSD300, SSD512 and their enhanced models on KITTI
grayscale dataset for all classes.
Figure 4.9: Recall vs. average of each class precision graph for SSD300, SSD512,
and their corresponding enhanced models.
We further experimented on increasing the number of default boxes of 6 feature

maps in SSD300 to [8, 6, 6, 6, 6, 6], called ”SSD300(More boxes)” in table 4.3.
We added {4, 0.25} aspect ratio into first feature map and the rest assigned
34
{1, 2, 3, 12 , 13 }. However, the result mAP is 71.4%, which does not increase dra-
matically, compared to 71.2% mAP from SSD300 (enhanced model). Therefore,
in section 4.5, we will explore the choice for scales and aspect ratios for KITTI
dataset and Udacity dataset.
4.5 K-means evaluation on default boxes

To design an optimal set of scales and their corresponding aspect ratios, we
applied K-means clustering algorithm. In the end, we would choose the cen-
troids of large clusters in aspect ratios with regards to all scales. Therefore,
the selected aspect ratios can represent the ground truth bounding boxes in the
datasets better, but this will not enhance the generality of our models. Here
we have chosen 6 scale values, which is corresponding to the 6 feature maps
in SSD300 model and the experiments for SSD512 model can be left for future
work.
4.5.1 Analysis on KITTI dataset

The experiments of SSD300, SSD512, and their enhanced models have shown
that the prior knowledge of scale and aspect ratio for the target dataset is im-
portant for improving the efficiency of the training process. More default box
matching can provide more positive samples, thereby enhancing the accuracy.
To analyze the scales and aspect ratios of ground truth bounding boxes, we ap-
ply k-means clustering on both datasets. It is an unsupervised machine learning
algorithm, aimed to partition the data into k clusters. The target is to find the
centroids of all clusters and the data points are closest to their centroids in each
cluster. In this way, we can find the most representative aspect ratios for each
scale.
The scale we choose before is [0.2, 0.9], which may not fit the dataset well.
First, we define the scale of the default boxes as equation 4.1. After evaluating
KITTI dataset, we found the scale of ground truth boxes is in the range of
[0.01, 0.7]. For SSD300, the scale are evenly spaced in this range, thus defined
as [0.065, 0.17, 0.285, 0.395, 0.505, 0.605].
s
widthbbox ⇥ heightbbox
Scale = (4.1)
widthimg ⇥ heightimg
Then, we apply k-means cluster for each scale to find the mean aspect ratio. The
clustering algorithm is one-dimensional since we fix the scale factor and only
aim to cluster the aspect ratios. We select the k by trial and error. Besides,
the aspect ratio is not based on the original image of 1242 ⇥ 375 resolution but
based on transformed 300 ⇥ 300 training sample resolution. Figure 4.10 reflects
that the aspect ratio only ranges in [0, 1.4], therefore the large aspect ratios such
as 2 and 3 actually influence little for the performance. While the added 13 is
possibly useful as it takes a large percentage in the whole dataset.
35
Scale Aspect ratio (percentage of data)
0.065 0.40 (17.61%) 0.31 (13.96%) 0.12 (11.00%) 0.52 (9.21%) 0.68 (8.44%) 0.87 (5.18%)
0.17 0.42 (7.98%) 0.16 (4.96%) 0.63 (4.78%) 0.84 (3.90%)
0.285 0.50 (4.61%) 0.29 (2.42%) 0.78 (1.50%)
0.395 0.65 (2.24%) 0.45 (1.63%)
0.505 0.38 (0.47%)
0.615 0.37 (0.13%)
Table 4.4: K-means result of aspect ratios at di↵erent scales in KITTI dataset.
Run for all bounding boxes.
Table 4.5.1 shows the percentage of bounding boxes in each aspect ratio cluster
for di↵erent scales. This inspired us that the 0.40 can be another aspect ratio
we can add, as in both scales 0.065, 0.17, it takes a large percentage of data
points. Other aspect ratios such as 0.68, 0.87 also can be experimented further.
Figure 4.10: K-means result of aspect ratios at di↵erent scales in KITTI dataset.
To explore more details for each class in KITTI dataset, we select two repre-
sentative classes, ”car” and ”person”. For cars, the shapes of them are flatter,
thus the corresponding aspect ratios are in general bigger than the aspect ratios
for persons. As seen in figure 4.11, the scatter plots of aspect ratio to scale of
bounding boxes di↵er in both attributes for ”car” and ”person”. Persons, in
general, are taking smaller region in the image, thus the largest scale is only
0.45, while the maximal scale of cars in this dataset is 0.53. It should be noted
that the original aspect ratio of images in KITTI dataset is higher as we resize
the image from 1224 ⇥ 370 to 300 ⇥ 300, so the aspect ratio is then recalculated
as equation 4.2. However, as the widthbnx to widthimg ratio and heightbnx to
heightimg ratio remain the same after resizing, the scales are the same.
widthresize
widthbbox ⇥ widthi mg
Aspect ratio = heightresize
(4.2)
heightbbox ⇥ heighti mg
36
0.065 0.40 (19.6%) 0.31 (13.4%) 0.49 (10.91%) 0.73 (8.32%) 0.60 (7.55%) 0.92 (4.56%)
0.17 0.45 (8.69%) 0.65 (5.54%) 0.85 (4.67%) 0.26 (2.22%)
0.285 0.51 (5.40%) 0.33 (2.51%) 0.80 (1.77%)
0.395 0.67 (2.32%) 0.51 (2.32%)
0.505 0.54 (0.17%)
Run for ”car” ground truth bounding boxes.

0.065 0.11 (28.6%) 0.08 (16.07%) 0.17 (5.27%) 0.13 (17.35%) 0.31 (0.91%)
0.17 0.14 (11.91%) 0.11 (9.74%) 0.17 (5.43%) 0.31 (1.04%)
0.285 0.17 (2.44%) 0.32 (0.81%)
0.395 0.31 (0.38%)
Run for ”person” ground truth bounding boxes.
Figure 4.11: K-means result of aspect ratios at di↵erent scales in KITTI dataset
for car and person objects.
Table 4.5.1 and Table 4.5.1 provide the distributions of aspect ratios in each
cluster for di↵erent scales, for class ”car” and ”person” respectively. It is re-
markable that both of them have their majority in the lowest scale interval. To
adjust for the data points, by trials and errors we have assigned the di↵erent
number of clusters for each scale in the whole dataset. We defined k as [6,4,3,2,1]
for the first 5 scale values for cars and [5,4,2,1] for the first 4 scale values for
persons.
Combined with the overall K-means result, other than the aspect ratios {1, 2, 3, 12 , 13 }
that are used in pre-defined SSD method, 0.40 ± 0.02 and 0.11 ± 0.02 can also
be added as they represent mostly of the aspect ratios in car class and person
class respectively.
37
0.07 0.75 (22.91%) 0.58 (17.83%) 0.92 (16.32%) 0.29 (13.74%) 1.22 (8.99%) 1.77 (4.78%)
0.19 0.73 (4.14%) 0.35 (1.64%) 1.01 (3.54%) 1.66 (1.02%)
0.31 0.77 (1.98%) 0.42 (0.60%) 1.32 (0.46%)
0.43 0.79 (0.10%) 1.97 (0.05%) 0.44 (0.03%)
0.55 0.78 (0.48%) 1.88 (0.02%)
0.67 1.07 (0.008%)
Table 4.7: K-means result of aspect ratios at di↵erent scales in Udacity dataset.
4.5.2 Analysis on Udacity dataset

Di↵erent from KITTI dataset, the images in Udacity dataset have a smaller
aspect ratio 1.6, calculated by 1920/1200, leading to the aspect ratios of ground
truth bounding boxes after resizing are higher, in the range [0, 6]. It is actually
more close to the common camera resolution such as 1920 ⇥ 1080. Figure 4.12
presented the redesigned scale values [0.07, 0.19, 0.31, 0.43, 0.55, 0.67] and their
corresponding aspect ratio clusters. It is also noted that the pre-defined aspect
ratio of value 2 is helpful for describing the centroids of clusters in several scale
intervals, in most general cases.
Figure 4.12: K-means result of aspect ratios at di↵erent scales in Udacity dataset
After several di↵erent trials, we determined k as [6, 4, 3, 3, 2, 1] for 6 scales. We

notice that 84.57% of data points have their scales in the range of [0.01, 0.13]
with centroid scale 0.07, presented in table 4.5.2. By defining the aspect ratios
according to the 6 k-means result, it is likely to enhance the final accuracy.
However, for objects of small scales, it is always costly in time for iterating all
grid cells in the highest feature map. Thus, we can add one representative value
”0.75” to the pre-defined aspect ratio set.
It is the same case as KITTI dataset that the aspect ratios of bounding boxes
for the person class are generally smaller than the ones for cars, as shown in
figure 4.13. From table 4.5.2, we can tell that the majority of aspect ratios of
38
0.07 0.79 (24.75%) 0.64 (21.29%) 0.96 (16.19%) 1.27 (9.06%) 0.40 (7.74%) 1.80 (4.99%)
0.19 0.75 (4.42%) 1.02 (3.76%) 0.38 (1.39%) 1.63 (1.08%)
0.31 0.79 (2.13%) 0.49 (0.68%) 1.29 (0.50%)
0.43 0.75 (1.07%) 1.03 (0.22%) 0.45 (0.21%)
0.55 0.71 (0.28%) 0.94 (0.24%)
0.67 1.00 (0.01%)
Run for ”car” ground truth bounding boxes.

0.07 0.22 (24.0%) 0.27 (22.18%) 0.32 (19.26%) 0.17 (16.10%) 0.42 (11.02%)
0.19 0.14 (1.64%) 0.17 (0.92%) 0.43 (0.40%)
0.31 0.22 (0.37%) 0.35 (0.26%)
0.43 0.23 (0.13%) 0.34 (0.11%)
0.55 0.33 (0.06%)
Run for ”person” ground truth bounding boxes.
bounding boxes for the person class is in the first scale interval [0.01, 0.13], tak-
ing account for 92.56%. By adding smaller aspect ratio 0.22±0.05, the matching
between default boxes and ground truth boxes would be more precise, thereby
enhancing the accuracy performance for person class.
Another question is that the scale of bounding boxes for the person class are
mostly in a small scale range, and thus in lower feature map, there are very
few true training samples for persons. In a real-world scenario, if a person ap-
pears from one side of the car and is very close to the vehicle, it needs to be
detected either by the camera or other hardware components of the vehicle. For
the former case, more person objects of larger scale need to be added in the
training data. For the latter case, other components such as Ladar (laser radar)
or Radar should be applied.
Figure 4.13: K-means result of aspect ratios at di↵erent scales in Udacity dataset
for car and person objects.
Originally the aspect ratio in the KITTI ranges from 0 to 4, but after trans-
forming to 300 ⇥ 300, it dramatically shrinks into the range of [0, 1.4].
39
4.5.3 Algorithm validation
We will verify the efficiency of the K-means clustering algorithm on improving
the accuracy of SSD model for KITTI dataset in this section. From the scale
and aspect ratio distribution of the images in KITTI dataset, we set up the
scale [0.05, 0.75] for 6 feature maps and the number of aspect ratios for them
are {8, 6, 6, 6, 4, 4}. Based on the SSD300 (enhanced) model, We added {10, 0.1}
as new aspect ratios for the default boxes with smallest scale, as 0.1 describes
better the shape of person object. We name this model as ”SSD300 (enhanced
+ 0.1 AR) model”. As shown in figure 4.14, the AP of person classes gets most
significant improvement by 10%, greater than the AP of the rest classes. It also
boost the overall mAP increased from 71.2% to 76.8%, as shown in figure 4.15.
Figure 4.14: P-R graphs of SSD300, SSD300 (enhanced), and SSD300 (enhanced
+ 0.1 AR) model for all classes.
Further on, to test the generality of this more advanced model, we use the whole
Udacity dataset as the test set. It is not surprising that the mAP would drop as
the model are trained more fit to the KITTI dataset, and the distribution of as-
pect ratios are very di↵erent between KITTI dataset and Udacity dataset. For
Udacity dataset, the mAPs of SSD300 model, SSD300 (enhanced) model and
SSD300 (enhanced + 0.1 AR) model are 44.2%, 22.7%, and 18.7% respectively.
It shows us that this method are not robust for other dataset like Udacity, with
40
Figure 4.15: Recall vs. average of each class precision graph for SSD300, SSD300
(enhanced), and SSD300 (enhanced + 0.1 AR) model.
a di↵erent resolution of images from KITTI dataset.
We have not trained a model for Udacity dataset for our limited time and
resource. However, the experiments in this section has verified the better choice
of aspect ratio and scales will improve the performance of SSD model. The
training of larger dataset such as Udacity dataset can be left for future work.
4.6 Color information analysis

To verify the impact of the color information on our detection task, we imple-
mented two trials on KITTI dataset with SSD300 model. We experimented on
training two models for grayscale input and RGB input respectively, named as
grayscale model and color model, in a condition that the training and testing set
are the same images with di↵erent color channels and the data augmentation
strategy are the same with contrast and brightness only. The APs for each class
are evaluated and presented in figure 4.16 and 4.17.
As there is still color distortion on hue and saturation that can be added in the
data augmentation part, we trained another model on the same training and
testing set, with data augmentation in contrast, brightness, hue, saturation, and
flipping. The hue distortion within [ 18, 18] and the saturation distortion of
range [0.5, 1.5] has been added to the RGB images with a probability of 0.5
respectively. The result has been shown in figure 4.18. The mAP has increased
to 73.7% from 70.8%, which is also slightly greater than the mAP of grayscale
model.
To get more insights about what the neural network model is ”looking at”, we
plot one RGB image, its grayscale image and their output after the preprocessing
41
Figure 4.16: P-R graphs of grayscale model and color model for all classes.
Figure 4.17: Recall vs. average of each class precision graph for grayscale and
color input models.
42
Figure 4.18: Recall vs. average of each class precision graph for grayscale ,color
and color with more augmentation models.
payer ”data” from RGB model and grayscale model respectively in figure4.19.
Another inspiration we can obtain is that the resolution of the dataset, more
specifically the aspect ratio of the whole image would change the general size
of objects that neural network used for training. Therefore, the model trained
with KITTI dataset cannot perform well on the Udacity test set, as the aspect
ratio of the same object class is considerably di↵erent.
However, after we compared the output of the first convolutional layer conv11
in the model in figure 4.20, it is very hard to check if they function the same
or not. There are many factors that are influencing the model optimization.
The reality is that RGB images can interpret more scenes with di↵erent hues,
saturations, etc. In our limited test set, the performances of the model with
color input and the one with grayscale input under the same data augmentation
condition do not di↵er a lot.
From the results in figure 4.9, we can conclude that the color information is not
a critical factor that influences the performance for SSD model with the same
structure. The underlying reason is possible that for both cases the number
of convolutional filters we apply in each layer does not change and the whole
structure remains the same, thus the complexities of computation are nearly the
same for both cases. Hence, the accuracy of model performance has remained
the same. However, as there are many other factors that can influence the final
accuracy, after we applied more color distortion in the input images, it shows a
small increase in the mAP, shown in figure 4.18.
4.7 Inference time

The recorded inference time of SSD300 model in the original paper is 59 FPS
on VOC test set using a Nvidia Titan X GPU. In our experiments, with the
43
Figure 4.19: Comparion between color input and grayscale input for color model
and grayscale model respectively.
Figure 4.20: Comparison of first convolutional layer output in the VGGNet.
44
Model Inference time (ms) Frame Per Second (FPS)
SSD300 19.8 50.51
SSD300 (enhanced) 17.7 56.50
SSD300 (enhanced +AR 0.1) 16.4 60.98
SSD300 (color) 18.7 53.48
SSD512 34.5 28.99
SSD512 (enhanced) 32.0 31.25
Table 4.10: Model inference time.
limited choices on GPU resource o↵ered by cloud server, we finally chose to

use Nvidia Tesla P100 GPU with 16 GB memory. Though it is more advanced
with the memory and blazing fast double-precision floating point calculations,
we are actually not using these features and the speed performance should be
a↵ected a lot. However, we cannot o↵er the comparable results with Titan X
GPU. During the inference phase, a video recorded on a street of Stockholm
has been used. There are in total 404 frames with the resolution of 1920 ⇥ 1080.
The inference time is calculated by taking the average inference time of all video
frames.
In the inference phase, the SSD300 (enhanced + 0.1 AR) model can reach
around 0.0164s per frame, while the SSD512 (enhanced) model can reach 0.032s
per frame, which are 58 FPS and 31 FPS respectively, depicted in table 4.10.
The enhanced models have not degraded the speed performance, instead, it gets
faster during the inference. Compared to the record of 59 FPS for SSD300 and 22
FPS for SSD512 in the original paper on VOC2007 test set, it is quite promising.
The inference time is the sum of NMS time and forward pass time. It is tested
that the change of the default boxes does not influence the forward pass time.
Hence, it is the NMS time has been greatly shortened after the change of de-
fault box selection. This is probably because more correct boxes with higher
confidence are detected, which saves time for filtering out the incorrect boxes
with low scores.
Besides, the trained SSD300 enhanced model with color input can reach 0.018s
on average which is slightly longer than the one with grayscale input, which
indicates that the model of grayscale input do not influence the inference time
significantly.
45
Chapter 5
Conclusions and Future

Work
In this paper, we have proven that the better selection of default box shapes
in multi-scale feature maps in SSD model can boost the accuracy performance.
For a certain dataset with a special resolution, such as KITTI dataset, the as-
pect ratio of default boxes in feature maps becomes vital in determining the
number of positive samples during the training process. In order to find the
optimal default box shapes, we use K-means algorithm for clustering the most
similar scales and aspect ratios. It is very e↵ective in improving the accuracy
performance of the model for one specific dataset, but possibly degrading the
generality of the model as well. Moreover, we found the SSD model performance
in accuracy aspect does not degrade with grayscale images as input.
This chapter summarizes our contributions and outlines the directions for fu-
ture work. Firstly, we list several ethical issues related to our research and social
impacts in section 5.1. Within our research, there are still some drawbacks of
the model structure and limitations that have slowed down our progress, which
will be explained in section 5.2. More findings related to our work will also be
discussed here. In the end, we conclude our contributions and propose some
future work in section 5.3.
5.1 Ethical and social impacts

ADAS has been widely accepted by the public with their limited control over the
vehicles. However, as more advanced ADAS are tend to participate in the driv-
ing operations, it gradually becomes a concern for the public that their driving
behaviors will be interfered by the ADAS systems. This issue can be solved by
proving the safety and reliability features of the system. Therefore, before the
release of a certain ADAS system, there will be plenty of tests within di↵erent
environments regarding their functionality. Those tests should be regulated and
updated as well. Hence, the testing set we took in our research is far too small
for training a robust model for a collision avoidance system.
46
As we are using a deep learning model, which is a data-driven method, more
data of good quality in our training set can improve the robustness of our model.
However, the collection of data is quite costly and may cause some ethical issues
as well, such as privacy invasion. The KITTI and Udacity datasets we use are
only for research purpose but not for commercial use, under a certain license
like ”Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License”.
It should be taken care that the images with people’s recognizable features will
not be released on public resources or get processed for anonymity.
Moreover, for the companies who are in the automated driving market, it is
a trouble that there is insufficient legislation, standards, and guidelines for the
regulating the products. The latest developed vehicles cannot be properly tested
or even released due to the slow response in the policies, which will cause a stag-
nating market. Even after the products passing the test phase, transparency is
another problem for evaluation of the ethics of ADAS. If the performance of the
system cannot be enough robust to di↵erent errors, it should be transparent to
customers. Besides, there should be a clear boundary for information disclosure
between di↵erent objectives, including drivers, manufacturers, system design-
ers. For the disclosed part of the technology, the company needs to protect their
intellectual property or other rights.
5.2 Limitations and future work

In this section, we listed several main limitations of our work and o↵ered corre-
sponding solutions that can be experimented on in the future.
5.2.1 Small object detection

The utilization of more default boxes of general aspect ratio in SSD model did
help with general object detection of any sizes, but not targeting at the classes
containing small objects. Later on, by adding 0.1 aspect ratio, we have signifi-
cantly improved the accuracy for person class detection. However, the detection
accuracy for small objects such as ”person” and ”cyclist” are still left behind by
the accuracy of vehicles for SSD, as shown in figure 5.1. It is obvious that the
average scale of person objects is smaller than the average scale of other class
objects and the result AP is the worst among all the classes.
This still needs to be handled by optimizing the structure of the model. One
solution for this is to enlarge the size of the small objects in the original dataset,
which can o↵er more features to be learned by higher feature maps in SSD. An-
other solution is to change the approach for SSD to generate the feature maps,
such as R-SSD [26] and FPN [34]. R-SSD has replaced VGGNet with ResNet,
which increase the computation overhead, thus lower the speed. While FPN
(Feature Pyramid Network) is more advanced in increasing the accuracy with
strong semantic feature maps at all scales. However, it is developed based on
a two-stage learning method ”Faster R-CNN”, and FPN has replaced the RPN
with marginal extra cost. It needs more future work to adapt the idea of it into
SSD model.
47
Figure 5.1: Left: P-R graph on each class for SSD512 (enhanced) model. Right:
Scale and aspect ratio distribution for four classes in KITTI dataset.
5.2.2 Long-time training

Since the 4 classes, ”car”, ”cyclist”, ”person” and ”truck” in our dataset are
di↵erent from the 20 classes in VOC 07 + 12 dataset and the 200 classes in
ILSVRC2016 dataset, which leads to redefining the output shapes the extra
convolutional layers for matching default boxes. As these layers contain mil-
lions of parameters, training them from scratch will be costly in computation
and with our limited computation resources, it will take a long time. Moreover,
when we need to finetune a new model for a di↵erent group of objects in another
dataset, the truncated VGGNet will take around 80 percent of the training time.
For example, for training an SSD300 model for Udacity dataset, we set up
the batch size as 16 and average the loss for every 500 iterations, as shown in
figure 5.2. It shows that the bias between training and validation error is very
small, meaning there is no overfitting problem. As the training set contains
10897 images and each iteration takes 16 images as a batch, then each epoch
includes 681 iterations. The validation is done every 10k iterations, which is
why it looks not consistent. We spent in total around 48 hours for training one
SSD300 model with single GPU computing, while for SSD512 model training, it
will take even longer. Here we have chosen 0.001 as our initial learning rate for
80k iterations, 10 4 for 20k iterations and 10 5 for 20k iterations. The learning
rate needs to be changed according to the training status. If the learning rate is
too high, SGD will ignore very narrow minima, and the loss will fluctuate in a
certain range. But if the learning rate is small, SGD will fall inside one of these
local minima and cannot come out of it. Then we can increase it to get to the
true local minima. More researches have been working on the adaptive learning
rate to avoid manual tuning for learning rates.
There are mainly two ways of solving this problem. One is to use Snapshot
Ensembles [22], which means to train a single neural network and get multiple
local minima in the meanwhile. Compared with the original training process,
there is no additional training cost and it has been a success in getting a marginal
boost in accuracy and a e↵ective way in the converging. Another method is to
progressively freeze layers [2]. By freezing layers, it means using the parameters
48
Figure 5.2: An example of loss in the training process
of the first few layers in the pre-trained network without changing them during
the training process. However, this method has been proven to have no e↵ects in
speeding up the training process for neural networks like VGGNet, as there are
no skip connections. But for other base networks such as ResNet, DenseNets,
it can e↵ectively save up to 20% training time and keep the accuracy results
unchanged or slightly dropped.
5.2.3 Limited generality

Here we only trained and tested our models on the natural light images, which
limits the usage scenario of our models. Besides, the aspect ratio of training set
images in KITTI dataset is smaller than the one of a normal image taken from
a camera with 1920 ⇥ 1080 resolution for instance. SSD512 (enhanced) model
trained on KITTI dataset has the highest mAP, thus we test the generality
of this model. We used one video shot in Stockholm and Udacity test set
for testing. For the video, we made some random screenshots, as shown in
Appendix B. It shows us that cars can always be detected with a rather high
confidence, while the persons are not always detected and the confidence is lower
in general. For Udacity test set, the results are presented in figure 5.3, which
shows us the weak generality of pre-trained SSD512 model from KITTI dataset
and the test samples are shown in Appendix A.
This problem is mainly caused by the narrow-down selection of scales and aspect
ratios. It should be noticed that the neural network input is always resized as
a square image, for computation simplification. Then after resizing, the scales
would not be changed but the aspect ratios are changed by multiplying a factor
imagewidth /imageheight , according to the definition. Thus, for the same objects
such as cars, in datasets of various resolutions, the ground truth box shapes are
di↵erent in aspect ratio aspect.
49
Figure 5.3: Pre-trained Models (KITTI dataset) test on Udacity dataset
To solve this problem, one efficient way is to find optimal default boxes re-
lated to image resolution in more general cases, then the detection will make
full use of the training set samples and enhance the accuracy. Assume the
average aspect ratio for the person boxes is around 0.4 and for the car boxes
is around 0.75 in a resized image, we can try to change the aspect ratio set
ar 2 {1, 2, 3, 12 , 13 } to ar 2 {1, 0.75, 0.4, 1.3, 2.5}. For larger dataset, we should
analyze the typical scales and aspect ratios and assign them as optimal set in
that context. We also found that there is a marginal decrease on speed after
more default boxes are used in each image.
Another future work is to enlarge the dataset on road in various conditions

such as lightings, various viewpoints, occlusions, etc. But it is very costly to
collect data, requiring a large amount of time for labeling and correcting by hu-
mans. What we can do is to create data manually by image processing. Except
for flipping, contrast, and brightness that we have tried, other measures such as
partially occlusions, warping and image stitching can also be experimented on.
5.2.4 Distance measurement

For forward collision avoidance system in the industrial field, except for detect-
ing the object in the camera view, another critical issue is the distance mea-
surement. Though it is not involved in this paper, it can be our future work.
By monocular camera, it is often hard to detect the depth, which requires prior
knowledge of object shapes and it is not stable and accurate for measurement as
well. By two cameras, it is possible to calculate the detecting depth from focal
points. However, it is still not robust for the blurriness in the images. Other
techniques that have been commercialized are Radar or Ladar (laser radar).
But it is costly and demanding for specialized hardware components.
In a word, one single component cannot work alone perfectly for this forward
50
collision avoidance purpose. It is critical to combine information from di↵erent
algorithms and sensors to further increase the reliability of ADAS systems.
5.3 Conclusion
This paper has presented an optimal default box selection scheme that has
been verified to efficiently improve the accuracy performance of SSD model on
grayscale images in a real-time scenario. We have chosen two on-road object
datasets, KITTI Vision Benchmark Suite and Udacity annotated dataset, con-
taining images with 4 classes of objects in our experiments. The contribution
of the paper is as follows.
First, we have verified the SSD pre-trained models from ILSVRC2016 dataset
and Pascal VOC2007+2012 dataset cannot perform well on our chosen on-road
datasets and only two classes, ”person” and ”car” can be poorly detected. They
are not generalized for the object detection in the on-road context. Therefore,
for various data in di↵erent contexts, di↵erent DNN models need to be trained
to fit their special requirements.
Second, with K-means clustering algorithm analyzing the most typical scales
and aspect ratios of default boxes in our data, we found that proper default box
selections can increase the positive samples in the training pool and thus enhance
the learned features for several classes, which improves the overall detection ac-
curacy on a certain dataset. For KITTI test set, the average precision of ”car”
and ”truck” has eventually reached 92.2% and 92.1% respectively. Moreover,
optimal scales and aspect ratio selection on multiple feature maps without in-
creasing the model size would not influence the training and testing time, but
shorten the inference time.
Finally, we found that the model trained with grayscale images does not de-
grade compared to the one trained with color images under the same condition
of data augmentation. However, if more color distortion augmentation are added
into the training set, the test accuracy performance will increase by around 3%,
and get slightly greater than the accuracy of the model trained with grayscale
images. In other words, the performances of SSD models do not rely a lot on
the color information. Besides, by using the SSD model with single-channel
input on a specific video stream, the average inference time is 17.7 ms, which
is slightly shorter than the 18.7 ms inference time of the SSD model with color
input images.
51
Bibliography
[1] Shivani Agarwal, Aatif Awan, and Dan Roth. Learning to detect objects
in images via a sparse, part-based representation. IEEE transactions on
pattern analysis and machine intelligence, 26(11):1475–1490, 2004.
[2] Andrew Brock, Theodore Lim, JM Ritchie, and Nick Weston. Freeze-
out: Accelerate training by progressively freezing layers. arXiv preprint
arXiv:1706.04983, 2017.
[3] Claudio Caraffi, Tomas Vojir, Jura Trefny, Jan Sochman, and Jiri Matas.
A System for Real-time Detection and Tracking of Vehicles from a Single
Car-mounted Camera. In ITS Conference, pages 975–982, Sep. 2012.
[4] Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and
Ng Andrew. Deep learning with cots hpc systems. In International Con-
ference on Machine Learning, pages 1337–1345, 2013.
[5] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human
detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005.
IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE,
2005.
[6] Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Perona. Pedestrian
detection: An evaluation of the state of the art. PAMI, 34, 2012.
[7] Dumitru Erhan, Christian Szegedy, Alexander Toshev, and Dragomir

Anguelov. Scalable object detection using deep neural networks. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 2147–2154, 2014.
[8] Mark Everingham, L Van Gool, Christopher KI Williams, John Winn, and
Andrew Zisserman. The pascal visual object classes challenge 2007 (voc
2007) results (2007), 2008.
[9] Mark Everingham, Andrew Zisserman, Christopher Williams, and Luc
Van Gool. The pascal visual object classes challenge 2006 (voc 2006) re-
sults. 2006.
[10] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexan-
der C Berg. Dssd: Deconvolutional single shot detector. arXiv preprint
arXiv:1701.06659, 2017.
52
[11] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for au-
tonomous driving? the kitti vision benchmark suite. In Conference on
Computer Vision and Pattern Recognition (CVPR), 2012.
[12] Ross Girshick. Fast r-cnn. arXiv preprint arXiv:1504.08083, 2015.

[13] Ross Girshick, Je↵ Donahue, Trevor Darrell, and Jitendra Malik. Rich
feature hierarchies for accurate object detection and semantic segmenta-
tion. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 580–587, 2014.
[14] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT
Press, 2016. http://www.deeplearningbook.org.
[15] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning, vol-
ume 1. 2016.
[16] Alex Graves, Abdel-rahman Mohamed, and Geo↵rey Hinton. Speech recog-
nition with deep recurrent neural networks. In Acoustics, speech and signal
processing (icassp), 2013 ieee international conference on, pages 6645–6649.
IEEE, 2013.
[17] Klemen Grm, Vitomir Štruc, Anais Artiges, Matthieu Caron, and Hazım K
Ekenel. Strengths and weaknesses of deep learning models for face recog-
nition against image degradations. IET Biometrics, 7(1):81–89, 2017.
[18] Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Jitendra Malik.
Hypercolumns for object segmentation and fine-grained localization. In
Proceedings of the IEEE conference on computer vision and pattern recog-
nition, pages 447–456, 2015.
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid
pooling in deep convolutional networks for visual recognition. In european
conference on computer vision, pages 346–361. Springer, 2014.
[20] Geo↵rey E Hinton and Ruslan R Salakhutdinov. Reducing the dimension-

ality of data with neural networks. science, 313(5786):504–507, 2006.
[21] Jan Hosang, Rodrigo Benenson, and Bernt Schiele. Learning non-maximum
suppression. arXiv preprint, 2017.
[22] Gao Huang, Yixuan Li, Geo↵ Pleiss, Zhuang Liu, John E Hopcroft, and
Kilian Q Weinberger. Snapshot ensembles: Train 1, get m for free. arXiv
preprint arXiv:1704.00109, 2017.
[23] Qifan Huang and Li Sha. Edge adaptive demosaic system and method,
February 19 2008. US Patent 7,333,678.
[24] Forrest Iandola and Kurt Keutzer. Small neural nets are beautiful: en-
abling embedded systems with small deep-neural-network architectures. In
Proceedings of the Twelfth IEEE/ACM/IFIP International Conference on
Hardware/Software Codesign and System Synthesis Companion, page 1.
ACM, 2017.
53
[25] Kevin Jarrett, Koray Kavukcuoglu, Yann LeCun, et al. What is the best
multi-stage architecture for object recognition? In Computer Vision, 2009
IEEE 12th International Conference on, pages 2146–2153. IEEE, 2009.
[26] Jisoo Jeong, Hyojin Park, and Nojun Kwak. Enhancement of ssd
by concatenating feature maps for object detection. arXiv preprint
arXiv:1705.09587, 2017.
[27] Yangqing Jia, Evan Shelhamer, Je↵ Donahue, Sergey Karayev, Jonathan
Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Ca↵e:
Convolutional architecture for fast feature embedding. arXiv preprint
arXiv:1408.5093, 2014.
[28] Fahad Shahbaz Khan, Rao Muhammad Anwer, Joost Van de Weijer, An-
drew D Bagdanov, Maria Vanrell, and Antonio M Lopez. Color attributes
for object detection. In Computer Vision and Pattern Recognition (CVPR),
2012 IEEE Conference on, pages 3306–3313. IEEE, 2012.
[29] Alex Krizhevsky, Ilya Sutskever, and Geo↵rey E Hinton. Imagenet clas-
sification with deep convolutional neural networks. In Advances in neural
information processing systems, pages 1097–1105, 2012.
[30] William A Leasure Jr and August L Burgett. Nhtsa’s ivhs collision avoid-
ance research program: Strategic plan and status update. In Proc. First
World Congress on Applications of Transport Telematics and Intelligent
Vehicle-Highway Systems, pages 2216–2223, 1994.
[31] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson,
Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backprop-
agation applied to handwritten zip code recognition. Neural computation,
1(4):541–551, 1989.
[32] Yann LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller.
Efficient backprop. In Neural networks: Tricks of the trade, pages 9–50.
Springer, 1998.
[33] Michael W Levine and Jeremy M Shefner. Fundamentals of sensation and
perception. 1991.
[34] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariha-
ran, and Serge Belongie. Feature pyramid networks for object detection.
In CVPR, volume 1, page 4, 2017.
[35] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona,
Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco:
Common objects in context. In European conference on computer vision,
pages 740–755. Springer, 2014.
[36] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott
Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multi-
box detector. In European conference on computer vision, pages 21–37.
Springer, 2016.
54
[37] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional
networks for semantic segmentation. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 3431–3440, 2015.
[38] Seonwoo Min, Byunghan Lee, and Sungroh Yoon. Deep learning in bioin-
formatics. Briefings in bioinformatics, 18(5):851–869, 2017.
[39] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You
only look once: Unified, real-time object detection. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 779–
788, 2016.
[40] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger.
[41] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn:
Towards real-time object detection with region proposal networks. In Ad-
vances in neural information processing systems, pages 91–99, 2015.
[42] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh,
Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bern-
stein, et al. Imagenet large scale visual recognition challenge. International
Journal of Computer Vision, 115(3):211–252, 2015.
[43] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fer-
gus, and Yann LeCun. Overfeat: Integrated recognition, localization and
detection using convolutional networks. arXiv preprint arXiv:1312.6229,
2013.
[44] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[45] Nitish Srivastava, Geo↵rey Hinton, Alex Krizhevsky, Ilya Sutskever, and
Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks
from overfitting. The Journal of Machine Learning Research, 15(1):1929–
1958, 2014.
[46] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabi-
novich, et al. Going deeper with convolutions. Cvpr, 2015.
[47] Christian Szegedy, Scott Reed, Dumitru Erhan, Dragomir Anguelov, and
Sergey Io↵e. Scalable, high-quality object detection. arXiv preprint
arXiv:1412.1441, 2014.
[48] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM
Smeulders. Selective search for object recognition. International journal of
computer vision, 104(2):154–171, 2013.
[49] Ren Wu, Shengen Yan, Yi Shan, Qingqing Dang, and Gang Sun. Deep
image: Scaling up image recognition. arXiv preprint arXiv:1501.02876,
7(8), 2015.
[50] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio
Torralba. Object detectors emerge in deep scene cnns. arXiv preprint
arXiv:1412.6856, 2014.
55
[51] C Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating object pro-
posals from edges. In European Conference on Computer Vision, pages
391–405. Springer, 2014.
56
Appendix A
Detection examples from

KITTI
Figure A.1: SSD300 vs. SSD300 (enhanced). Boxes with objectness score of 0.5
or higher are shown: (a)SSD300 with 8732 prior boxes; (b)SSD300 with 11620
boxes.
57
Figure A.2: SSD512 vs. SSD512 (enhanced). Boxes with objectness score of 0.5
or higher are shown: (a)SSD512 with 23574 prior boxes; (b)SSD512 with 31766
boxes.
58
Appendix B

Udacity test set
Figure B.1: SSD300 (enhanced) model detection output.
59
Figure B.2: SSD512 (enhanced) detection output.
60
Appendix C

Stockholm street video
Figure C.1: SSD300 (enhanced) model detection output with average inference
time is 0.0177s per frame (57 FPS).
61
Figure C.2: SSD512 (enhanced) model detection output with average inference
time 0.032 per frame (31 FPS).
62
TRITA TRITA-EECS-EX-2018:263
ISSN 1653-5146
www.kth.se

Improving The Accuracy of 2D On - Road Object Detection Based On Deep Learning Techniques

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Improving The Accuracy of 2D On - Road Object Detection Based On Deep Learning Techniques

Uploaded by

Copyright:

Available Formats

DEGREE PROJECT IN INFORMATION AND COMMUNICATION

Improving the Accuracy of 2D On-

KTH ROYAL INSTITUTE OF TECHNOLOGY

In the past few decades, forward collision avoidance system, a sub-system of

Under de senaste decennierna har systemet fr framtida kollisionsundvikande

I det hr avhandlingsarbetet utfrs flera experiment om hur man frbttrar SSD-

I would like to thank my internship company Bitsim AB and business manager

4 Experimental Results and Analysis 26

5 Conclusions and Future Work 46

A Detection examples from KITTI 57

B Detection examples from Udacity test set 59

C Detection examples from Stockholm street video 61

Advanced driving assistance system (ADAS) has drawn growing attention in

Typical collision avoidance system detects and recognizes vehicles, pedestri-

Figure 1.1: Vehicle detection in a grayscale image.

1.1 Background and motivation

As DNN involves a huge amount of matrix calculations and other operations

In many industrial fields, the object detection algorithms are required to be

1.2 Overview of the work

All experimental results are presented in chapter 4, as well as the descriptions

2.1 Artificial Neural Networks

2.2 Convolutional Neural Networks

Figure 2.2: Extracted features of the first 5 convolutional layers of Alexnet,

There are many operations in CNN, such as convolution, activation, pooling,

The concept of RF is important to object detection tasks since it provides the

In modern neural networks, the popular recommendation for activation function

2.3 2D Object Detection

Table 2.1: VOC2012 test results for di↵erent detection frameworks

region-proposal methods with two-stage learning and end-to-end-learning sys-

2.3.1 Region-proposal methods

In summary, there are various ways of multi-scale object localization. One

2.3.2 End-to-end learning systems

As we mentioned above, the VGGNet that we use in our experiments is a simple

3.1 Dataset preprocessing

3.1.1 Grayscale image from UYVY format

3.1.2 Data augmentation

Contrast Contrast is the di↵erence in visual properties that makes an object

(c) G channel (d) U channel

(e) B channel (f) V channel

Figure 3.2: Separate RGB channel and YUV channel visualizations

for grayscale images, it is determined by the di↵erence in the brightness of the

f (x) = contrast f actor ⇥ (x mean value) + mean value + brightness

Flipping We apply random flipping on the dataset with a probability of 0.5.

3.2 Ca↵e framework

3.3 Network Architecture

3.3.1 Modified input layer

Figure 3.3: Di↵erent convolutional filters in Conv1 1.

3.3.2 Multi-scale feature maps

Figure 3.5: SSD512 architecture.

3.3.4 Hard negative mining

Then, instead of using all negative predictions, we keep a ratio of negative

3.4.1 Loss function

Backpropagation is a conceptually simple and computationally efficient neu-

(f g)0 (w) = f 0 (g(w)) · g 0 (w) (3.5)

3.4.3 Weight update

3.5.1 Non-Maximum Suppression

Figure 3.8: Non-Maximum Suppression example. Top: image after detection

Transfer learning is a machine learning technique in which a model trained

Fine-tuning means restoring the pre-trained parameters as our initialization

3.7 Evaluation metrics

Experimental Results and

In this section, we introduce the implementations of SSD model training and

Table 4.1: All Dataset Statistics