Professional Documents
Culture Documents
Improving The Accuracy of 2D On - Road Object Detection Based On Deep Learning Techniques
Improving The Accuracy of 2D On - Road Object Detection Based On Deep Learning Techniques
TECHNOLOGY,
SECOND CYCLE, 30 CREDITS
STOCKHOLM, SWEDEN 2018
YING YU
This paper focuses on improving the accuracy of detecting on-road objects, in-
cluding cars, trucks, pedestrians, and cyclists. To meet the requirements of the
embedded vision system and maintain a high speed of detection in the advanced
driving assistance system (ADAS) domain, the neural network model is designed
based on single channel images as input from a monocular camera.
In this thesis work, several experiments are carried out on how to enhance the
accuracy performance of SSD models with grayscale input. By adding proper
extra default boxes in high-layer feature maps and adjust the entire scale range,
the detection AP over all classes has been efficiently improved around 20%, with
the mAP of SSD300 model increased from 45.1% to initially 76.8% and the mAP
of SSD512 model increased from 58.5% to 78.8% on KITTI dataset. Besides,
it has been verified that without color information, the model performance will
not degrade in both speed and accuracy. Experimental results were evaluated
using Nvidia Tesla P100 GPU on KITTI Vision Benchmark Suite, Udacity an-
notated dataset and a short video recorded on one street in Stockholm.
Sammanfattning
Detta dokument fokuserar p att frbttra noggrannheten nr det gller att upptcka
on-road-objekt, inklusive bilar, lastbilar, fotgngare och cyklister. Fr att uppfylla
kraven i det inbyggda visionssystemet, och upprtthlla en hg upptckthastighet i
ADAS-domnen (advanced drive assist system), r den neurala ntverksmodellen
utformad baserat p enkanalsbilder som inmatning frn en monokulr kamera.
Besides, I also really appreciate the help and support I got from my parents, my
friends, and especially my boyfriend, who always motivates me to keep moving
forward and working hard. I have learned a lot from him and got more patience
and passion with my work. I will keep my passion and my curiosity in the future
work all the time.
Contents
1 Introduction 1
1.1 Background and motivation . . . . . . . . . . . . . . . . . . . . . 2
1.2 Overview of the work . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Literature Review 4
2.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . 5
2.2.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Activation . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 2D Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Region-proposal methods . . . . . . . . . . . . . . . . . . 8
2.3.2 End-to-end learning systems . . . . . . . . . . . . . . . . 9
3 Methodology 12
3.1 Dataset preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Grayscale image from UYVY format . . . . . . . . . . . . 12
3.1.2 Data augmentation . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Ca↵e framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.1 Modified input layer . . . . . . . . . . . . . . . . . . . . . 15
3.3.2 Multi-scale feature maps . . . . . . . . . . . . . . . . . . . 16
3.3.3 Default box selection . . . . . . . . . . . . . . . . . . . . . 18
3.3.4 Hard negative mining . . . . . . . . . . . . . . . . . . . . 19
3.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.1 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.2 Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.3 Weight update . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5.1 Non-Maximum Suppression . . . . . . . . . . . . . . . . . 22
3.6 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.7 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3
4.2 Pre-trained model testing . . . . . . . . . . . . . . . . . . . . . . 28
4.2.1 Performance on KITTI dataset . . . . . . . . . . . . . . . 29
4.2.2 Performance on Udacity dataset . . . . . . . . . . . . . . 30
4.3 Fine-tuning based on the original design . . . . . . . . . . . . . . 32
4.3.1 Hyperparameter selection . . . . . . . . . . . . . . . . . . 32
4.3.2 Performance for both datasets . . . . . . . . . . . . . . . 32
4.4 Enhancement on detection accuracy . . . . . . . . . . . . . . . . 33
4.5 K-means evaluation on default boxes . . . . . . . . . . . . . . . . 35
4.5.1 Analysis on KITTI dataset . . . . . . . . . . . . . . . . . 35
4.5.2 Analysis on Udacity dataset . . . . . . . . . . . . . . . . . 38
4.5.3 Algorithm validation . . . . . . . . . . . . . . . . . . . . . 40
4.6 Color information analysis . . . . . . . . . . . . . . . . . . . . . . 41
4.7 Inference time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4
Chapter 1
Introduction
During the past few years, vision-based object detection systems have been
significantly enhanced by machine learning algorithms. Deep Neural Network
(DNN) [20], a powerful machine learning branch of methods, has delivered a
number of stunning achievements in many benchmarks. The concept of DNN
gets inspired by the biological neural networks that constitute animal brains,
which have more complicated architectures by comparison. Currently, DNNs are
the state of art in object detection with a promising future in improving both
the detection accuracy and speed. Compared to the classical approaches based
on hand-crafted features, such as Histograms of oriented gradients (HOG), they
can detect more robust results in the images in more unconstraint environments.
As figure 1.1 shows, our expected output would be all target objects including
1
cars in a certain grayscale image taken with daytime light can be localized and
recognized with a high accuracy.
2
implemented with grayscale images. To maintain a high speed for real-time
detection, Y pixel format has been used, which may degrade the detection re-
sults for the loss of color information. Therefore, our research explores on if the
loss of color information has influenced the performance of the selected DNN
algorithm, reducing the precision and recall of detecting multiple objects in the
test set.
Chapter 3 elaborates the preparation work, the main structure of SSD model
and evaluation metrics in details. There it also provides reasons for these de-
sign choices and analyze potential problems on the chosen method. Through the
steps in training and testing process, several advanced algorithms and techniques
such as weight update, regularization are explained. Especially, the models are
trained based on pre-trained models but not trained from scratch.
3
Chapter 2
Literature Review
This chapter reviews the history of DNN, recent advances in DNN approaches
for object detection tasks and descriptions of other comparable DNN models
with SSD. The base network models for object classification tasks are also in-
troduced as they are closely related to our object detection task.
ANN usually consists of one input layer, one output layer and one or more
hidden layers, each of which contains one or more units/neurons/nodes. We
call them ”units” in this paper. Each artificial unit process the received sig-
nal in a certain way and pass it to the connected units, which can be seen as
a simplified form of biological neurons in animal brains. The linear filter and
bias are generally called weights, and they can be learned from the training data.
Figure 2.1 shows an example of one feedforward 3-layer ANN model, with fully
connections from units in one layer to the ones in the next layer. To tackle com-
plex problems in many fields, ANN often contains many hidden layers to extract
more semantically strong features, thus it is also called deep artificial neural net-
work, known as ”DNN”. Since the early 2000s [38], DNNs have been developed
rapidly and applied in a wide range of applications such as vision, robotics, video
analytics, speech recognition, natural language processing, targeted advertising,
and web search [16, 31, 49, 4, 29]. For vision-based tasks, convolutional neural
network (CNN) models are commonly applied to their strong and mostly correct
assumptions about the nature of images.
4
Figure 2.1: Example of a three-layered ANN model.
With the success of AlexNet, a growing number of computer vision and ma-
chine learning researchers became focused on using successively more intricate
CNNs to solve image classification and other problems. Many new model struc-
tures have been proposed and verified to update the accuracy records for several
challenge recognition competitions, such as ImageNet [42] and MS COCO [35].
5
layer. We will not put many details about fully connected layer since our cho-
sen model is fully convolutional.
2.2.1 Convolution
Convolution function is the core building block of a CNN. One dimensional con-
volution function can be denoted as equation 2.1. It is a discrete operation for
one-dimensional array x and one-dimensional filter w. With 2D images as our
input signals, the convolutional filter (kernel) should also be of 2 dimensional,
which needs to be sliding along weight and height axis of the input, presented
as equation 2.2.
Except for weight and height, images have another dimension ”depth”, with
value 1 for grayscale images and 3 for RGB images. If we assume the size of an
input as WI ⇥ HI ⇥ DI and the size of the convolutional filter is Wf ⇥ Hf ⇥ Df ,
then for each layer, the Df must equal to DI . The convolution output of the
filter across the input is often called a feature map, which represents the cross-
correlation between the pattern within the filter and local features of the input.
With a translation invariant property, CNN layer can detect the same features
in di↵erent parts of the image. The DI of this feature map only depends on the
number of filters in the previous layer. The WI and HI of the feature map are
determined by the WI and HI of the input respectively, the stride of the filter
and the number of padding zeroes.
1
X
s(t) = (x ⇤ w)(t) = x(a)w(t a) (2.1)
a= 1
XX
S(i, j) = (I ⇤ K)(i, j) = I(i m, j n)K(m, n) (2.2)
m n
Besides, Levine and Shefner (1991) defined a receptive field (RF) as ”an area in
which stimulation leads to a response of a particular sensory neuron” [33]. For
object detection CNN, RF can be explained as the region in the input that a
CNN’s feature after one or more convolutional layers is looking at. As depicted
in figure 2.3, after cascading two 3 ⇥ 3 convolutional filters, the receptive field
of each feature in the second layer has a receptive field of size 5 ⇥ 5 in the input
space.
6
Figure 2.3: Reception field example.
2.2.2 Activation
Activation function is a critical part for CNN. It introduces non-linear prop-
erty into the CNN, which is necessary for learning complex functional mappings
from input data, especially unstructured data such as images, videos, speeches.
Without activation component, the neural network would become a linear re-
gression model with limited power in describing complex features.
2.2.3 Pooling
Max-pooling layers of size 2 ⇥ 2 with a stride of 2 have been used in many
popular network architectures, such as VGG16 network, depicted in figure 2.4.
They basically extract the max values in each 2 ⇥ 2 block in the output of the
previous convolutional layer. In this case, even there is some small translation
of the input, the pooling outputs can still maintain the same or slightly change,
hence the final output would not be a↵ected. Pooling can also improve the
statistical efficiency of the network model [14].
In the object detection research field, there are mainly two model structures,
7
Method data mAP FPS
Fast R-CNN 07++12 68.4 0.5
Faster R-CNN VGG-16 07++12 70.4 7
YOLO 07++12 57.9 45
YOLOv2 544 07++12 73.4 40
Faster R-CNN ResNet 07++12 73.8 5
SSD300 07++12 72.4 46
SSD512 07++12 74.9 19 rgbtext
In 2013, NYU published Overfeat algorithm for using deep learning in object
detection, which was the winner of the localization task of ILSVRC2013 [43].
They introduced a novel method for object localization and classification by
accumulating predicted bounding boxes, integrated with a single CNN. Quickly
after Overfeat, regions with CNN features (R-CNN) [13] was published which
boosted an almost 50% improvement on the object detection challenge. It com-
bines Selective Search, CNN, and SVMs as a three-stage method. However,
R-CNN performs CNN for each object proposal independently without sharing
computation, leading to its expensive training in space, time and resulting in
low-speed detection as well. Later on, spatial pyramid pooling network (SPP-
Net), introduced by He et al. [19], was proposed to speed up R-CNN by sharing
computation. It generates a convolutional feature map on the whole input im-
age and classifies the object proposal by extracting feature vectors from this
map, thus avoiding repeated evaluation of the DNN on each proposal.
However, these methods all have multi-stage pipelines [12]. SPPNet has reduced
the training time for R-CNN by 3 times because of its faster proposal feature
extraction. However, the only convolutional feature map cannot be updated,
leading to its limited accuracy. Inspired by these two networks, Ross Girshick
[12] developed a more e↵ective object classification method called Fast R-CNN.
Instead of using SVM classifiers, it applies the fully connected layer on the out-
put image and then used both Region of Interest (RoI) Pooling on the feature
8
map with a feed-forward network for classification and bounding box regression
while ignoring the time for generating region proposals, which improves by a
large extent in both accuracy and speed. After that, the object classification
task can achieve nearly real-time, but proposals are relatively time-consuming,
which still remains the computational bottleneck in object detection tasks.
After Simonyan et al. created a successful DNN VGGNet [44], using only 33
convolutions with up to 19 layers, Ren et al. introduced Faster R-CNN [41],
upgraded from ”Fast R-CNN”, which is based on a VGG-16 net architecture.
The most impressing step they have made is applying a region proposal network
(RPN) instead of selective search. It achieves a frame rate of 5fps (including all
steps) on a GPU. RPN makes a large breakthrough in the detecting speed for
regions of interest.
At the end of 2016, SSD [36] made an impressive step by refreshing the records
in both accuracy and speed. It scores over 74% mAP (mean Average Preci-
sion) on VOC2007 [24] at 59 frames per second (FPS) on a Nvidia Titan X for
300300 input images [36]. The key to success is that it uses the multi-scale fea-
ture maps for more accurate bounding box regression, which will be explained
in section 3.3, and those maps are generated from extra convolutional layers
added to the truncated VGG16 model. The original structure of VGG16 has
been presented in figure 2.4. It consists of 16 convolutional layers with convo-
lutional filters of uniform size (3 ⇥ 3). These convolutional layers all use ”same
9
padding” with padding 1 and stride 1. In each convolutional layer, there is one
convolutional function followed by one activation ReLU function.
SSD reuses the computation from the VGG16 model, thereby saving a lot of
time. It discards the fully-connected layers in VGG16 which are mainly used
for object classification output and applies a set of convolutional layers to the
end of the truncated VGG, which enables extracting features at multiple scales
and decreases the size of the input to each subsequent feature maps progres-
sively. Inspired by Szegedys work on MultiBox [47], SSD associates default
boxes varying in scales and aspect ratios with the extracted feature maps of
di↵erent resolutions. Moreover, over 80% of inference time is spent on the im-
age classification base network (VGG16), which implies that the improvement
in the base network will also boost the detection speed of SSD.
There are many improved versions of SSD, such as Deconvolutional single shot
detector (DSSD) [10], RainbowSSD (R-SSD) [26]. DSSD has applied decon-
volutional layers to the multi-scale feature maps and ResNet feature extractor
instead of VGGNet. However, it has improved the accuracy performance, es-
pecially for small objects, but increased the time latency. Other attempts such
as R-SSD have optimized the SSD model based on a di↵erent dataset, but have
not succeeded to upgrade the overall performance by a large extent.
For these end-to-end learning systems, the selection of base network as a classi-
fier is also important, which a↵ects the inference time and classification scores
directly. In these years, more advanced base feature extractors have been re-
leased, image classification has also gained several significant improvements,
thereby boosting the speed and accuracy performance of object detection mod-
els as well.
10
Figure 2.4: VGG16 model.
11
Chapter 3
Methodology
12
Figure 3.1: An original image example.
image, contains pixels whose values vary from 0 to 255 in digital formats. The
value 0 means black color with the weakest intensity and the value 255 means
white with the strongest intensity. Grayscale images have been widely used in
medical imaging, monitoring system etc.
When using DNN to detect RGB images, for the purpose of making the model
more robust and enhancing the accuracy of detection, it often applies data aug-
mentation to the original dataset. Obviously, there is more information such
as hue, saturation including in the RGB images than in the grayscale images.
It has been proven that in object detection tasks using computer vision meth-
ods, the additional use of color information will perform better than only using
shape information [28]. While some researches claim that the color information
is not efficiently used in the deep learning models [17]. In other words, adding
color information in the input of the model has a negligible e↵ect on the results.
To get a more clear understanding of the importance of color information, this
paper will explore this in section 4.6.
Moreover, there are di↵erent conditions when people are driving vehicles, such
as raining, weak lighting, fogging etc., which lead to significant changes in the
brightness and contrast in the camera view. To enhance the performance of
SSD, more distortions have been added into the contrast and brightness in the
grayscale training set and apply random flipping to handle the object detection
on di↵erent sides.
Brightness Brightness is a relative term, showing how bright the image ap-
pears compared to another reference image, based on our visual perception. For
grayscale images, it is the mean pixel value intensity that can be used to change
the brightness. Higher brightness corresponds to the weather condition of more
sunlight reflection existence. In our experiments, this global attribute has been
chosen as a data augmentation method for all experiments. This paper applies
a brightness change brightness to the intensities of all pixels in each image,
within the range of [ 32, 32] with a probability of 0.5.
13
(a) R channel (b) Y channel
14
contributors now. Ca↵e supports seamless switching between CPU and CUDA
capable GPU, by simply setting a single flag. It not only o↵ers the model
definitions, optimization settings but also pre-trained weights in the format of
”.ca↵emodel” binaries in the ca↵e model zoo. Ca↵e can be accessed using API
including C++, python, and matlab. Ca↵e has been chosen with Python API
in our experiments.
Another method is to change the input convolutional layer, conv11 layer in the
network structure. As figure 3.3 shows below, a convolutional filter of size3 ⇥ 3
15
slides over the input image with height 300 and width 300 (convolution opera-
tion) to produce a same sized output. There is a concept called depth, which
implies the number of 3 ⇥ 3 filters used in each layer, which is 64 shown in
figure 3.3. Besides, the number of the conv11 output equals to the depth of the
input layer. Here we specify that the convolutional filter has a third dimension,
also called channel, which should always be the same as the channels of the
input from the previous layer. If the number of input channels changes from 3
to 1, the filter channel should also change to 1. In this way, we lose the color
information from di↵erent color channels, so that we assume the performance
of model would also downgrade by a certain degree due to the loss of color in-
formation. Previously the mean values of an RGB image are commonly set as
[104, 117, 123], based on Pascal VOC dataset [8]. Now we apply 96 as our mean
value on our grayscale images and [93, 98, 95] on our color images, which are the
mean pixel values based on the images in our datasets, explained in details in
section4.1.
SSD makes a good compensation by adding extra convolutional layers to the end
of the truncated base network and efficiently using the computation from the
base network. The truncated base network is generated by removing all fully
connected layers and dropout layers from the original VGG16 model, shown
in figure 2.4. The original network extracts features of the targets for image
classification purpose. Multiple scales of feature maps from the additional con-
volutional layers called ”feature pyramid” in figure 3.4 and 3.5 can be used for
bounding box prediction, which maintains the detection speed and improves the
precision at the same time. Here, we called the bounding boxes for matching
the objects in images as ”default box”.
Besides, as the network model goes deeper, the feature maps get smaller scale
in size, but the default boxes of various scales and aspect ratios at each cell in
the map can represent a larger area in the original image, which is similar to
the concept ”anchor” in Faster R-CNN [41]. These default boxes are used for
predicting the location o↵sets to the ground truth boxes and their associated
confidence. As the SSD512 model has a larger input than SSD300 model, with
more feature maps on a larger scale, the capability of detecting small objects is
also better.
16
Figure 3.4: SSD300 architecture.
17
3.3.3 Default box selection
As mentioned in section 3.3.2, aspect ratios and scales need to be manually as-
signed to each default boxes in feature maps, which directly determines the total
number of default boxes in each image. More default box usage can directly in-
crease the inference time for each frame/image. It is critical to set the default
scales and aspect ratios since it directly a↵ects matching efficiency during the
training process, which is the number of matched positive samples. The more
positive samples there are in the training set, the more robust model we can
obtain. Below we will introduce our choice and the reasons behind.
Scales
Feature maps generated from di↵erent levels of layers have di↵erent receptive
fields [50], thereby targeting on objects of di↵erent sizes. We assume the number
of feature map is m and the scales are in [smin , smax ] ([0.2,0.9]), then the scale
Sk of the default box of the kth feature map is computed by:
Smax Smin
Sk = Smin + (k 1) , k 2 [1, m] (3.2)
m 1
Taken SSD300 as an example, the highest feature map from conv4 3 layer has
the maximal scale 0.2, the lowest feature map from conv92 layer has the minimal
scale 0.9 and other scale values are evenly placed in this range. Thus the scales
of the feature maps are not corresponding to their receptive sizes. This scale
range is generally used for large dataset, however, it can be adjusted according
to the distribution of scales of objects in the dataset.
Aspect ratios
Aspect ratio is used for describing di↵erent object shapes. In original paper, it
is assigned as ar 2 {1, 2, 3, 12 , 13 } for all feature maps. Then the width and height
p p
of a default
p box are sk ar , sk / ar accordingly. When aspect ratio is 1, another
scale Sk Sk+1 of bounding box is applied. Then the maximal number of default
boxes at each cell is 6. To obtain a higher detection speed, the first and the
last two feature maps are assigned 4 bounding boxes with aspect ratio 1, 2, 12 .
With a larger size of input, SSD512 is able to detect more objects of various
scales. Here we list the number of default boxes at each feature map in original
SSD300 and SSD512 models in table 3.3.3. For example, the first convolutional
feature map of size 38 ⇥ 38, uses 4 default boxes in each cell, therefore, the total
number of default boxes in this map is 38 ⇥ 38 ⇥ 4. We calculated the number
of boxes for the rest feature maps and add them together. In the orignal design,
it uses 8732 and 23574 default boxes in SSD300 model
As seen in the figure3.6, there are 6 default boxes in each cell in 10 ⇥ 10 feature
map, detecting smaller objects by comparison, while 4 boxes in 5 ⇥ 5 feature
map in figure3.7, detecting relatively bigger objects. Each detected boxes will
output the confidence scores for all classes (c1 , c2 , ..., cp ) and 4 location o↵set
(cx, cy, w, h). Therefore for a m ⇥ n feature map with k default boxes in each
cell, the total output would be (num of classes + 4) ⇥ k ⇥ m ⇥ n.
18
# of box positions 38x38 19x19 10x10 5x5 3x3 1x1 - total boxes
SSD300 4 6 6 6 4 4 8732
# of box positions 64x64 32x32 10x10 8x8 3x3 1x1 1x1 total boxes
SSD512 4 6 6 6 6 4 4 23574
Table 3.1: The number of default boxes for each classifier layer and the number
of total boxes. 4 and 6 are the number of di↵erent default boxes.
Figure 3.6: 10x10 feature map Figure 3.7: 5x5 feature map
Though SSD300 and SSD512 can reach very high accuracy and speed at the
same time, there still exists some bottlenecks. For instance, it has been found
that the lower the feature map is, the more complex semantic features it can
interpret [37, 18]. In higher feature maps, the default boxes are in smaller scale
but lack of strong semantic features, therefore the classification confidence would
decrease. This problem can be tackled by finding the optimal scales and aspect
ratios. Another novel method that change total the structure is ”Feature Pyra-
mid Network” [34], in which a top-down architecture with lateral connections is
developed for strong semantic feature maps at all scales.In our experiments, we
optimized SSD using the first method to get a high accuracy. The experimental
results will be discussed in section 4.4 and 4.5.3.
A potential issue in selecting the optimal scales and aspect ratios is that the
scale and aspect ratios should be re-designed for di↵erent datasets. For on-road
objects, the aspect ratio will have a di↵erent distribution than the 20 classes
object in VOC dataset. If the scales and aspect ratios we choose are making the
matching between default boxes and ground truth boxes better, there are more
positive training samples in the training set and the training loss will converge
to a smaller value.
19
boxes. We called the matched boxes as ”prior boxes”, all the prior boxes as ”pos-
itive training samples”. As there are only a few objects in the image/frame, the
number of negative training samples would be disproportionate compared to
positive training samples. By ”positive”, we select the defaults boxes whose
IoUs are larger than 0.01. To restrict the total amount of training samples, only
the top 400 with highest IoU in the positive samples are selected.
3.4 Training
During the training phase, abstractly the DNN model is learning representative
features from the training data and can be generalized to generate outputs
that predict the ground truths of new data. First, we define the loss function
to measure how far the results are from the ground truths. Then, there are
mainly two steps, propagation and weight update, for learning the parameters
in the neural network iteratively. Another related problem introduced here is
regularization that prevents the model from overfitting. To training the network
model fast and efficient, fine-tuning is used in our experiments.
>
> PN P
: Lloc (x, l, g) = xkij smoothL1 (lim ĝim )
i2P os m2bbox
(3.4)
20
3.4.2 Propagation
The process of passing inputs forward through the neural network is called for-
ward propagation and its output is the class, the confidence and the bounding
box coordinates in this case. With those outputs, loss function explained in
section 3.4.1 can be computed. Then the question is how to minimize the loss
function.
After training with a learning rate for a certain number of iterations, the loss
function may get stuck within a small range, which means the learning rate
might be too big for finding the minima. Hence, the learning rate has decreased
by multiplying a gamma, which is often set as 0.1. Then, the next period
training will use 0.1 ⇤ initiallearningrate as the new learning rate. Therefore,
can get rid of the fluctuation and converge the loss to a smaller value. Usually,
it uses three periods with 80000, 40000, 20000 iterations respectively to converge
the loss function. It is called ”multistep” training strategy.
3.4.4 Regularization
To prevent the DNN model from overfitting the data, two methods can be
adopted, dataset augmentation and regularization. The artificial data augmen-
tation is commonly used in deep learning research field, which has introduced in
21
section 3.1.2. Hence, regularization has been added for optimization during the
training process. One popular form of regularization is the L2 loss, also known
as weight decay. This loss shown as equation 3.8, has been parameterized by
the constant ! and added into the loss function. The inspiration behind this L2
loss is that weight matrices with lower and uniformly distributed values perform
better in exploiting all the input data than sparse weight matrices with higher,
intensively values [15].
2
L2 (!) = k!k (3.8)
2
Another way of regularization is dropout [45], which deactivates units or neu-
rons by setting their output to 0 with a certain possibility. By randomly cutting
o↵ neurons during the training process, it can keep the most robust features in
the training set but also takes 2-3 times longer to train than a standard neural
network of the same architecture. For each epoch, it would train a di↵erent
random architecture. It has been considered as an e↵ective method in improv-
ing the performance of neural nets and prevent overfitting problem in a wide
variety of application domains. However, the SSD has cut o↵ the dropout lay-
ers in the VGG16 base network, which means it will not be used in this method.
3.5 Testing
During the training process, all of the weights and bias of the network are saved
to snapshot periodically. When testing a network, the values will be restored
to apply to the input images in the testing set. Following the loss described in
section 3.4.1, the algorithm firstly calculates the confidence of each detection,
represented by the product of the object confidence scores and classification
scores. Then the top k (set as 400) predictions with confidences more than 0.01
among all confidences in each image have remained. Among those scores, it is
likely to happen that multiple bounding boxes are assigned to the same object,
thus Non-Maximum Suppression (NMS) explained below [21] has been applied
to these detections within each image and class with a threshold of 0.5. In
the end, this filtering algorithm returns the bounding boxes, confidences, and
classes for the final detection in each image.
22
and their confidences are more than 0.01. An example is shown in figure 3.8.
3.6 Fine-tuning
There are two main approaches of training a neural network model, training-
from-scratch and transfer learning. By training, it means executing the back-
propagation and optimize the parameters of the neural network model to lower
the loss function. As there is limited computational resources and large net-
work model, it will take long for training model parameters from scratch, hence
transfer learning has been applied to train the models.
23
and create 6 and 7 extra layers for bounding box prediction in SSD300 and
SSD512 respectively.
To understand mAP, firstly we need to introduce the Precision and Recall graph
for a classifier. As 3.9 shows, the precision score for category c is the ratio of the
number of true detections of object c to the total number of detected c objects.
While the recall refers to the ratio of the number of true detections of object c to
the number of ground truth box of object c in all examples. But they all need
a confidence threshold to define which object is considered as ”predicted” or
”detected”. For example, there would be probably more objects being detected
when the threshold is 0.2 than the threshold is 0.8. The threshold we used here
is Intersection over Union (IoU) threshold.
N (true positives)c
P recisionc = (3.9)
N (all detections)c
N (true positives)c
Recallc = (3.10)
N (all ground truths)c
As shown in the equation 3.11, the area (Bp \ Bgt ) refers to the overlap between
the ground truth box and the output predicted box, and the area (Bp [ Bgt )
is their union. IoU equals the ratio of them, ranging in [0, 1]. For Pascal VOC
challenge, it regards a prediction with IoU equals or more than 0.5 as a posi-
tive prediction. Precision and recall vary with the strictness of our classifiers
threshold. If we choose larger IoU threshold, the precision got shrunk fast at
a smaller recall rate. The maximum value of recall for each graph means the
recall for all the prediction result. In our experiments, we set 0.5 as the IoU
threshold, which is the same as the threshold in Pascal VOC2007.
area(Bp \ Bgt ))
IoU = (3.11)
area(Bp [ Bgt )
Average Precision is used to describe the precision of detection for one class of
object [9]. It summarizes the shape of the precision/recall curve by sampling pre-
cision at a set of eleven equally spaced recall levels, Recalli = [0, 0.1, 0.2, ..., 1.0]
and averaging them. In our experiments, to get more accurate details in the
change of precision and recall, we pick 41 recall levels in the same range, hence
24
Recalli = [0, 0.125, 0.250, ..., 1.0] and the APs is calculated following equa-
tion 3.12.
1 X
AP = P recision(Recalli ) (3.12)
41
Recalli
After defining AP for all classes in the dataset, we can compute mean Aver-
age Precision (mAP) to show the overall performance of the DNN model. It
is calculated by taking the average of AP over all classes at each recall level
after filtering with the IoU threshold at 0.5. We often use AP for single-class
detection evaluation and mAP for the overall result evaluation.
25
Chapter 4
4.1 Datasets
There are many open-source datasets where images are taken on the road con-
taining labels for multiple objects, working on autonomous driving problems.
While some datasets among them are targeting single objects like pedestrians
[5, 6] or cars [1, 3]. For our task, we are targeting at the multiple objects on
the roads. To meet the requirements of forward collision avoidance system, we
have chosen three datasets, two Udacity labeled datasets and the KITTI vision
benchmark suite [11], and the statistics of object labels are listed in table 4.1.
Annotations
Dataset Images
Car Truck Pedestrian Cyclist
KITTI 7481 28742 1094 4487 1627
CrowdAI 9423 62570 3819 5675 -
Autti 15000 60787 3503 9866 1676
26
4.1.1 KITTI 2D object detection dataset
KITTI 2D object detection dataset is collected by an autonomous driving plat-
form Annieway published on KITTI Vision Benchmark Suite, a project of Karl-
sruhe Institute of Technology and Toyota Technological Institute at Chicago and
has been introduced their publication in 2012 [11]. The images are extracted
from videos recorded on the streets of Karlsruhe with large lighting variations
and extensive occlusions. It has been unique among these datasets because of
the resolution of its images, which is 1240 ⇥ 375, as seen in figure 4.1. Though
it has 7481 images as the training set and 7518 images as the testing set, only
the training set is available but not the testing set. Hence, we take the training
set as our whole dataset and select the images that contain cars, pedestrians,
cyclists, and trucks. The mean value for these images is 96.2. We randomly
select 5237 images as the training set and 2244 images as the testing set.
Figure 4.1: Sample images from the KITTI 2D object detection dataset.
KITTI dataset o↵ers 8 classes of objects, including Tram, Misc, Cyclist, Person
(sitting), Pedestrian, Truck, Car, Van. For each object, the coordinates in the
image is also provided. It is prominent that car, pedestrian and cyclist objects
are the majority, leading to an imbalance problem between di↵erent classes. In
our research, we take 4 classes, car, pedestrian, cyclist and truck as our research
targets. Due to the dataset imbalance problem, we assume during the training
process the model tends to decrease the loss of larger classes than the smaller
classes, thus the AP of car and pedestrian detection would be greater than the
AP of the truck and the cyclist. However, this assumption has been verified
to be wrong, described in section4.3.2, which is not the main reason for poor
performance.
27
Figure 4.2: Object size distributions for two categories ”cars” and ”pedestrians”
in KITTI dataset.
dataset includes four classes of object, car, person, and truck. The images
are collected from a Point Grey research cameras running at a resolution of
1920x1200 at 2hz during the drive in the mountain of California and the neigh-
boring cities in a daylight condition. For each object in the images, it pro-
vides the class label and 4 vertex coordinates of the corresponding ground truth
bounding box, [xmin, ymin, xmax, ymax].
The Autti dataset has the same context and same image resolution as CrowdAI
dataset but includes two extra classes of traffic lights and bikers. In our case, we
take bikers into consideration but not traffic lights. Since the contexts of Autti
and CrowdAI dataset are similar to each other, we merge these two datasets,
and call it as ”Udacity dataset” in our paper. At the same time, from table 4.1
we notice that the merged dataset has an imbalance problem with many cars
and few bikers, which may pose a problem for training. To enlarge the small
class, like the person class, may help ease the problem.
We have randomly sampled the dataset into 70% (17107) for the training set and
30% (7327) for the testing set. The calculated mean value for the training set is
95.8. Figure 4.3 shows several example images in Udacity dataset, which have
already been converted into the grayscale images. Since there is no height or
width of reference objects in the images, we cannot compute the actual widths
or heights of the ground truth boxes in the images in Udacity dataset.
28
Figure 4.3: Sample images from Udacity dataset.
It is remarkable that for di↵erent detection purposes, the data and its labels
are totally di↵erent. Though the general large datasets include hundreds or
thousands of objects from di↵erent categories, they do not fit well with other
industrial applications, such as on-road detection for ADAS. For on-road object
detection, it is very important to use a power-efficient system with fast and
robust detection in unconstraint environments. To enhance the performance for
this requirement, the collection of a large amount of data is necessary but also
costly in manpower and resources.
The mAP presented in figure 4.5 is calculated based on car and person classes,
and the detection of rest classes is not used in this discussion. The drop in
accuracy is possibly caused by di↵erent scenarios between these general datasets
with 20 or 200 classes of objects in various environments and KITTI which
focuses on the vehicle and pedestrian detection. Another reason is that the
aspect ratio of bounding boxes in KITTI dataset after resizing is fairly small
29
Figure 4.4: P-R graphs of pre-trained SSD300 and SSD512 models for ”car”
and ”person” classes.
within a range of [0, 1.4] compared to the aspect ratio of boxes in other datasets,
such as Pascal VOC 2012, where the aspect ratio of objects are in range [0, 9].
Figure 4.5: Recall vs. Average of each class precision graph for pre-trained
SSD300 and SSD512 models.
30
Figure 4.6: P-R graphs of pre-trained SSD300 and SSD512 models for ”car”
and ”person” classes.
Figure 4.7: Recall vs. Average of each class precision graph for pre-trained
SSD300 and SSD512 models.
31
hence we cannot evaluate the data quality of both dataset.
Since the training process for DNN models involves the compute-intensive task of
matrix multiplication and other operations that can take advantage of a GPU’s
massively parallel architecture, we apply one GPU (Nvidia Tesla P100) with 16
GB memory capacity in Ubuntu 16.04 system to conduct the experiments. For
testing, we also use the same GPU to generate comparable results.
32
Dataset Method mAP(%) Car Cyclist Person Truck
Udacity SSD300 44.2 66.9 24.3 26.5 59.2
(CrowdAI,Autti) SSD512 59.9 80.5 47.7 38.1 73.4
SSD300 45.1 67.4 26.4 25.0 61.6
KITTI
SSD512 58.5 77.8 45.7 38.9 71.4
# of box positions 38x38 19x19 10x10 5x5 3x3 1x1 - total boxes
SSD300 4 6 6 6 4 4 - 8732
SSD300(Enhanced) 6 6 6 6 4 4 - 11620
SSD300(More boxes) 8 6 6 6 6 6 - 14528
# of box positions 64x64 32x32 10x10 8x8 3x3 1x1 1x1 total boxes
SSD512 4 6 6 6 6 4 4 23574
SSD512(Enhanced) 6 6 6 6 6 4 4 31766
Table 4.3: The default boxes statistics after modifying SSD300 and SSD512.
To verify the e↵ect of adding extra default boxes in the highest feature map, we
apply modified SSD300 and SSD512 models on KITTI dataset. From the re-
sults in figure4.8, we can conclude that it has efficiently improved the accuracy
performance in accuracy for each class in the KITTI dataset. For ”car” and
”truck”, it reaches high accuracy around 90% in the SSD512 (enhanced model),
while for ”person” and ”cyclist” it reaches 58.3% and 72.6% respectively. It is
obvious that the model performance for person class is still not robust. Fig-
ure 4.9 shows that the overall mAP has increased 26.1% for SSD300 model and
20.2% for SSD512 model.
Comparing the AP for big objects such as cars and trucks with small objects
such as persons, we can tell that the model cannot perform quite well on small
objects with the default setting in the original paper. By adding proper aspect
33
ratios for default boxes, the performances have been improved by a large extent.
Figure 4.8: P-R graphs of SSD300, SSD512 and their enhanced models on KITTI
grayscale dataset for all classes.
Figure 4.9: Recall vs. average of each class precision graph for SSD300, SSD512,
and their corresponding enhanced models.
34
{1, 2, 3, 12 , 13 }. However, the result mAP is 71.4%, which does not increase dra-
matically, compared to 71.2% mAP from SSD300 (enhanced model). Therefore,
in section 4.5, we will explore the choice for scales and aspect ratios for KITTI
dataset and Udacity dataset.
The scale we choose before is [0.2, 0.9], which may not fit the dataset well.
First, we define the scale of the default boxes as equation 4.1. After evaluating
KITTI dataset, we found the scale of ground truth boxes is in the range of
[0.01, 0.7]. For SSD300, the scale are evenly spaced in this range, thus defined
as [0.065, 0.17, 0.285, 0.395, 0.505, 0.605].
s
widthbbox ⇥ heightbbox
Scale = (4.1)
widthimg ⇥ heightimg
Then, we apply k-means cluster for each scale to find the mean aspect ratio. The
clustering algorithm is one-dimensional since we fix the scale factor and only
aim to cluster the aspect ratios. We select the k by trial and error. Besides,
the aspect ratio is not based on the original image of 1242 ⇥ 375 resolution but
based on transformed 300 ⇥ 300 training sample resolution. Figure 4.10 reflects
that the aspect ratio only ranges in [0, 1.4], therefore the large aspect ratios such
as 2 and 3 actually influence little for the performance. While the added 13 is
possibly useful as it takes a large percentage in the whole dataset.
35
Scale Aspect ratio (percentage of data)
0.065 0.40 (17.61%) 0.31 (13.96%) 0.12 (11.00%) 0.52 (9.21%) 0.68 (8.44%) 0.87 (5.18%)
0.17 0.42 (7.98%) 0.16 (4.96%) 0.63 (4.78%) 0.84 (3.90%)
0.285 0.50 (4.61%) 0.29 (2.42%) 0.78 (1.50%)
0.395 0.65 (2.24%) 0.45 (1.63%)
0.505 0.38 (0.47%)
0.615 0.37 (0.13%)
Table 4.4: K-means result of aspect ratios at di↵erent scales in KITTI dataset.
Run for all bounding boxes.
Table 4.5.1 shows the percentage of bounding boxes in each aspect ratio cluster
for di↵erent scales. This inspired us that the 0.40 can be another aspect ratio
we can add, as in both scales 0.065, 0.17, it takes a large percentage of data
points. Other aspect ratios such as 0.68, 0.87 also can be experimented further.
Figure 4.10: K-means result of aspect ratios at di↵erent scales in KITTI dataset.
To explore more details for each class in KITTI dataset, we select two repre-
sentative classes, ”car” and ”person”. For cars, the shapes of them are flatter,
thus the corresponding aspect ratios are in general bigger than the aspect ratios
for persons. As seen in figure 4.11, the scatter plots of aspect ratio to scale of
bounding boxes di↵er in both attributes for ”car” and ”person”. Persons, in
general, are taking smaller region in the image, thus the largest scale is only
0.45, while the maximal scale of cars in this dataset is 0.53. It should be noted
that the original aspect ratio of images in KITTI dataset is higher as we resize
the image from 1224 ⇥ 370 to 300 ⇥ 300, so the aspect ratio is then recalculated
as equation 4.2. However, as the widthbnx to widthimg ratio and heightbnx to
heightimg ratio remain the same after resizing, the scales are the same.
widthresize
widthbbox ⇥ widthi mg
Aspect ratio = heightresize
(4.2)
heightbbox ⇥ heighti mg
36
Scale Aspect ratio (percentage of data)
0.065 0.40 (19.6%) 0.31 (13.4%) 0.49 (10.91%) 0.73 (8.32%) 0.60 (7.55%) 0.92 (4.56%)
0.17 0.45 (8.69%) 0.65 (5.54%) 0.85 (4.67%) 0.26 (2.22%)
0.285 0.51 (5.40%) 0.33 (2.51%) 0.80 (1.77%)
0.395 0.67 (2.32%) 0.51 (2.32%)
0.505 0.54 (0.17%)
Table 4.5: K-means result of aspect ratios at di↵erent scales in KITTI dataset.
Run for ”car” ground truth bounding boxes.
Table 4.6: K-means result of aspect ratios at di↵erent scales in KITTI dataset.
Run for ”person” ground truth bounding boxes.
Figure 4.11: K-means result of aspect ratios at di↵erent scales in KITTI dataset
for car and person objects.
Table 4.5.1 and Table 4.5.1 provide the distributions of aspect ratios in each
cluster for di↵erent scales, for class ”car” and ”person” respectively. It is re-
markable that both of them have their majority in the lowest scale interval. To
adjust for the data points, by trials and errors we have assigned the di↵erent
number of clusters for each scale in the whole dataset. We defined k as [6,4,3,2,1]
for the first 5 scale values for cars and [5,4,2,1] for the first 4 scale values for
persons.
Combined with the overall K-means result, other than the aspect ratios {1, 2, 3, 12 , 13 }
that are used in pre-defined SSD method, 0.40 ± 0.02 and 0.11 ± 0.02 can also
be added as they represent mostly of the aspect ratios in car class and person
class respectively.
37
Scale Aspect ratio (percentage of data)
0.07 0.75 (22.91%) 0.58 (17.83%) 0.92 (16.32%) 0.29 (13.74%) 1.22 (8.99%) 1.77 (4.78%)
0.19 0.73 (4.14%) 0.35 (1.64%) 1.01 (3.54%) 1.66 (1.02%)
0.31 0.77 (1.98%) 0.42 (0.60%) 1.32 (0.46%)
0.43 0.79 (0.10%) 1.97 (0.05%) 0.44 (0.03%)
0.55 0.78 (0.48%) 1.88 (0.02%)
0.67 1.07 (0.008%)
Table 4.7: K-means result of aspect ratios at di↵erent scales in Udacity dataset.
Figure 4.12: K-means result of aspect ratios at di↵erent scales in Udacity dataset
It is the same case as KITTI dataset that the aspect ratios of bounding boxes
for the person class are generally smaller than the ones for cars, as shown in
figure 4.13. From table 4.5.2, we can tell that the majority of aspect ratios of
38
Scale Aspect ratio (percentage of data)
0.07 0.79 (24.75%) 0.64 (21.29%) 0.96 (16.19%) 1.27 (9.06%) 0.40 (7.74%) 1.80 (4.99%)
0.19 0.75 (4.42%) 1.02 (3.76%) 0.38 (1.39%) 1.63 (1.08%)
0.31 0.79 (2.13%) 0.49 (0.68%) 1.29 (0.50%)
0.43 0.75 (1.07%) 1.03 (0.22%) 0.45 (0.21%)
0.55 0.71 (0.28%) 0.94 (0.24%)
0.67 1.00 (0.01%)
Table 4.8: K-means result of aspect ratios at di↵erent scales in Udacity dataset.
Run for ”car” ground truth bounding boxes.
Table 4.9: K-means result of aspect ratios at di↵erent scales in Udacity dataset.
Run for ”person” ground truth bounding boxes.
bounding boxes for the person class is in the first scale interval [0.01, 0.13], tak-
ing account for 92.56%. By adding smaller aspect ratio 0.22±0.05, the matching
between default boxes and ground truth boxes would be more precise, thereby
enhancing the accuracy performance for person class.
Another question is that the scale of bounding boxes for the person class are
mostly in a small scale range, and thus in lower feature map, there are very
few true training samples for persons. In a real-world scenario, if a person ap-
pears from one side of the car and is very close to the vehicle, it needs to be
detected either by the camera or other hardware components of the vehicle. For
the former case, more person objects of larger scale need to be added in the
training data. For the latter case, other components such as Ladar (laser radar)
or Radar should be applied.
Figure 4.13: K-means result of aspect ratios at di↵erent scales in Udacity dataset
for car and person objects.
Originally the aspect ratio in the KITTI ranges from 0 to 4, but after trans-
forming to 300 ⇥ 300, it dramatically shrinks into the range of [0, 1.4].
39
4.5.3 Algorithm validation
We will verify the efficiency of the K-means clustering algorithm on improving
the accuracy of SSD model for KITTI dataset in this section. From the scale
and aspect ratio distribution of the images in KITTI dataset, we set up the
scale [0.05, 0.75] for 6 feature maps and the number of aspect ratios for them
are {8, 6, 6, 6, 4, 4}. Based on the SSD300 (enhanced) model, We added {10, 0.1}
as new aspect ratios for the default boxes with smallest scale, as 0.1 describes
better the shape of person object. We name this model as ”SSD300 (enhanced
+ 0.1 AR) model”. As shown in figure 4.14, the AP of person classes gets most
significant improvement by 10%, greater than the AP of the rest classes. It also
boost the overall mAP increased from 71.2% to 76.8%, as shown in figure 4.15.
Figure 4.14: P-R graphs of SSD300, SSD300 (enhanced), and SSD300 (enhanced
+ 0.1 AR) model for all classes.
Further on, to test the generality of this more advanced model, we use the whole
Udacity dataset as the test set. It is not surprising that the mAP would drop as
the model are trained more fit to the KITTI dataset, and the distribution of as-
pect ratios are very di↵erent between KITTI dataset and Udacity dataset. For
Udacity dataset, the mAPs of SSD300 model, SSD300 (enhanced) model and
SSD300 (enhanced + 0.1 AR) model are 44.2%, 22.7%, and 18.7% respectively.
It shows us that this method are not robust for other dataset like Udacity, with
40
Figure 4.15: Recall vs. average of each class precision graph for SSD300, SSD300
(enhanced), and SSD300 (enhanced + 0.1 AR) model.
We have not trained a model for Udacity dataset for our limited time and
resource. However, the experiments in this section has verified the better choice
of aspect ratio and scales will improve the performance of SSD model. The
training of larger dataset such as Udacity dataset can be left for future work.
As there is still color distortion on hue and saturation that can be added in the
data augmentation part, we trained another model on the same training and
testing set, with data augmentation in contrast, brightness, hue, saturation, and
flipping. The hue distortion within [ 18, 18] and the saturation distortion of
range [0.5, 1.5] has been added to the RGB images with a probability of 0.5
respectively. The result has been shown in figure 4.18. The mAP has increased
to 73.7% from 70.8%, which is also slightly greater than the mAP of grayscale
model.
To get more insights about what the neural network model is ”looking at”, we
plot one RGB image, its grayscale image and their output after the preprocessing
41
Figure 4.16: P-R graphs of grayscale model and color model for all classes.
Figure 4.17: Recall vs. average of each class precision graph for grayscale and
color input models.
42
Figure 4.18: Recall vs. average of each class precision graph for grayscale ,color
and color with more augmentation models.
payer ”data” from RGB model and grayscale model respectively in figure4.19.
Another inspiration we can obtain is that the resolution of the dataset, more
specifically the aspect ratio of the whole image would change the general size
of objects that neural network used for training. Therefore, the model trained
with KITTI dataset cannot perform well on the Udacity test set, as the aspect
ratio of the same object class is considerably di↵erent.
However, after we compared the output of the first convolutional layer conv11
in the model in figure 4.20, it is very hard to check if they function the same
or not. There are many factors that are influencing the model optimization.
The reality is that RGB images can interpret more scenes with di↵erent hues,
saturations, etc. In our limited test set, the performances of the model with
color input and the one with grayscale input under the same data augmentation
condition do not di↵er a lot.
From the results in figure 4.9, we can conclude that the color information is not
a critical factor that influences the performance for SSD model with the same
structure. The underlying reason is possible that for both cases the number
of convolutional filters we apply in each layer does not change and the whole
structure remains the same, thus the complexities of computation are nearly the
same for both cases. Hence, the accuracy of model performance has remained
the same. However, as there are many other factors that can influence the final
accuracy, after we applied more color distortion in the input images, it shows a
small increase in the mAP, shown in figure 4.18.
43
Figure 4.19: Comparion between color input and grayscale input for color model
and grayscale model respectively.
44
Model Inference time (ms) Frame Per Second (FPS)
SSD300 19.8 50.51
SSD300 (enhanced) 17.7 56.50
SSD300 (enhanced +AR 0.1) 16.4 60.98
SSD300 (color) 18.7 53.48
SSD512 34.5 28.99
SSD512 (enhanced) 32.0 31.25
In the inference phase, the SSD300 (enhanced + 0.1 AR) model can reach
around 0.0164s per frame, while the SSD512 (enhanced) model can reach 0.032s
per frame, which are 58 FPS and 31 FPS respectively, depicted in table 4.10.
The enhanced models have not degraded the speed performance, instead, it gets
faster during the inference. Compared to the record of 59 FPS for SSD300 and 22
FPS for SSD512 in the original paper on VOC2007 test set, it is quite promising.
The inference time is the sum of NMS time and forward pass time. It is tested
that the change of the default boxes does not influence the forward pass time.
Hence, it is the NMS time has been greatly shortened after the change of de-
fault box selection. This is probably because more correct boxes with higher
confidence are detected, which saves time for filtering out the incorrect boxes
with low scores.
Besides, the trained SSD300 enhanced model with color input can reach 0.018s
on average which is slightly longer than the one with grayscale input, which
indicates that the model of grayscale input do not influence the inference time
significantly.
45
Chapter 5
In this paper, we have proven that the better selection of default box shapes
in multi-scale feature maps in SSD model can boost the accuracy performance.
For a certain dataset with a special resolution, such as KITTI dataset, the as-
pect ratio of default boxes in feature maps becomes vital in determining the
number of positive samples during the training process. In order to find the
optimal default box shapes, we use K-means algorithm for clustering the most
similar scales and aspect ratios. It is very e↵ective in improving the accuracy
performance of the model for one specific dataset, but possibly degrading the
generality of the model as well. Moreover, we found the SSD model performance
in accuracy aspect does not degrade with grayscale images as input.
This chapter summarizes our contributions and outlines the directions for fu-
ture work. Firstly, we list several ethical issues related to our research and social
impacts in section 5.1. Within our research, there are still some drawbacks of
the model structure and limitations that have slowed down our progress, which
will be explained in section 5.2. More findings related to our work will also be
discussed here. In the end, we conclude our contributions and propose some
future work in section 5.3.
46
As we are using a deep learning model, which is a data-driven method, more
data of good quality in our training set can improve the robustness of our model.
However, the collection of data is quite costly and may cause some ethical issues
as well, such as privacy invasion. The KITTI and Udacity datasets we use are
only for research purpose but not for commercial use, under a certain license
like ”Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License”.
It should be taken care that the images with people’s recognizable features will
not be released on public resources or get processed for anonymity.
Moreover, for the companies who are in the automated driving market, it is
a trouble that there is insufficient legislation, standards, and guidelines for the
regulating the products. The latest developed vehicles cannot be properly tested
or even released due to the slow response in the policies, which will cause a stag-
nating market. Even after the products passing the test phase, transparency is
another problem for evaluation of the ethics of ADAS. If the performance of the
system cannot be enough robust to di↵erent errors, it should be transparent to
customers. Besides, there should be a clear boundary for information disclosure
between di↵erent objectives, including drivers, manufacturers, system design-
ers. For the disclosed part of the technology, the company needs to protect their
intellectual property or other rights.
This still needs to be handled by optimizing the structure of the model. One
solution for this is to enlarge the size of the small objects in the original dataset,
which can o↵er more features to be learned by higher feature maps in SSD. An-
other solution is to change the approach for SSD to generate the feature maps,
such as R-SSD [26] and FPN [34]. R-SSD has replaced VGGNet with ResNet,
which increase the computation overhead, thus lower the speed. While FPN
(Feature Pyramid Network) is more advanced in increasing the accuracy with
strong semantic feature maps at all scales. However, it is developed based on
a two-stage learning method ”Faster R-CNN”, and FPN has replaced the RPN
with marginal extra cost. It needs more future work to adapt the idea of it into
SSD model.
47
Figure 5.1: Left: P-R graph on each class for SSD512 (enhanced) model. Right:
Scale and aspect ratio distribution for four classes in KITTI dataset.
For example, for training an SSD300 model for Udacity dataset, we set up
the batch size as 16 and average the loss for every 500 iterations, as shown in
figure 5.2. It shows that the bias between training and validation error is very
small, meaning there is no overfitting problem. As the training set contains
10897 images and each iteration takes 16 images as a batch, then each epoch
includes 681 iterations. The validation is done every 10k iterations, which is
why it looks not consistent. We spent in total around 48 hours for training one
SSD300 model with single GPU computing, while for SSD512 model training, it
will take even longer. Here we have chosen 0.001 as our initial learning rate for
80k iterations, 10 4 for 20k iterations and 10 5 for 20k iterations. The learning
rate needs to be changed according to the training status. If the learning rate is
too high, SGD will ignore very narrow minima, and the loss will fluctuate in a
certain range. But if the learning rate is small, SGD will fall inside one of these
local minima and cannot come out of it. Then we can increase it to get to the
true local minima. More researches have been working on the adaptive learning
rate to avoid manual tuning for learning rates.
There are mainly two ways of solving this problem. One is to use Snapshot
Ensembles [22], which means to train a single neural network and get multiple
local minima in the meanwhile. Compared with the original training process,
there is no additional training cost and it has been a success in getting a marginal
boost in accuracy and a e↵ective way in the converging. Another method is to
progressively freeze layers [2]. By freezing layers, it means using the parameters
48
Figure 5.2: An example of loss in the training process
of the first few layers in the pre-trained network without changing them during
the training process. However, this method has been proven to have no e↵ects in
speeding up the training process for neural networks like VGGNet, as there are
no skip connections. But for other base networks such as ResNet, DenseNets,
it can e↵ectively save up to 20% training time and keep the accuracy results
unchanged or slightly dropped.
This problem is mainly caused by the narrow-down selection of scales and aspect
ratios. It should be noticed that the neural network input is always resized as
a square image, for computation simplification. Then after resizing, the scales
would not be changed but the aspect ratios are changed by multiplying a factor
imagewidth /imageheight , according to the definition. Thus, for the same objects
such as cars, in datasets of various resolutions, the ground truth box shapes are
di↵erent in aspect ratio aspect.
49
Figure 5.3: Pre-trained Models (KITTI dataset) test on Udacity dataset
To solve this problem, one efficient way is to find optimal default boxes re-
lated to image resolution in more general cases, then the detection will make
full use of the training set samples and enhance the accuracy. Assume the
average aspect ratio for the person boxes is around 0.4 and for the car boxes
is around 0.75 in a resized image, we can try to change the aspect ratio set
ar 2 {1, 2, 3, 12 , 13 } to ar 2 {1, 0.75, 0.4, 1.3, 2.5}. For larger dataset, we should
analyze the typical scales and aspect ratios and assign them as optimal set in
that context. We also found that there is a marginal decrease on speed after
more default boxes are used in each image.
In a word, one single component cannot work alone perfectly for this forward
50
collision avoidance purpose. It is critical to combine information from di↵erent
algorithms and sensors to further increase the reliability of ADAS systems.
5.3 Conclusion
This paper has presented an optimal default box selection scheme that has
been verified to efficiently improve the accuracy performance of SSD model on
grayscale images in a real-time scenario. We have chosen two on-road object
datasets, KITTI Vision Benchmark Suite and Udacity annotated dataset, con-
taining images with 4 classes of objects in our experiments. The contribution
of the paper is as follows.
First, we have verified the SSD pre-trained models from ILSVRC2016 dataset
and Pascal VOC2007+2012 dataset cannot perform well on our chosen on-road
datasets and only two classes, ”person” and ”car” can be poorly detected. They
are not generalized for the object detection in the on-road context. Therefore,
for various data in di↵erent contexts, di↵erent DNN models need to be trained
to fit their special requirements.
Second, with K-means clustering algorithm analyzing the most typical scales
and aspect ratios of default boxes in our data, we found that proper default box
selections can increase the positive samples in the training pool and thus enhance
the learned features for several classes, which improves the overall detection ac-
curacy on a certain dataset. For KITTI test set, the average precision of ”car”
and ”truck” has eventually reached 92.2% and 92.1% respectively. Moreover,
optimal scales and aspect ratio selection on multiple feature maps without in-
creasing the model size would not influence the training and testing time, but
shorten the inference time.
Finally, we found that the model trained with grayscale images does not de-
grade compared to the one trained with color images under the same condition
of data augmentation. However, if more color distortion augmentation are added
into the training set, the test accuracy performance will increase by around 3%,
and get slightly greater than the accuracy of the model trained with grayscale
images. In other words, the performances of SSD models do not rely a lot on
the color information. Besides, by using the SSD model with single-channel
input on a specific video stream, the average inference time is 17.7 ms, which
is slightly shorter than the 18.7 ms inference time of the SSD model with color
input images.
51
Bibliography
[1] Shivani Agarwal, Aatif Awan, and Dan Roth. Learning to detect objects
in images via a sparse, part-based representation. IEEE transactions on
pattern analysis and machine intelligence, 26(11):1475–1490, 2004.
[2] Andrew Brock, Theodore Lim, JM Ritchie, and Nick Weston. Freeze-
out: Accelerate training by progressively freezing layers. arXiv preprint
arXiv:1706.04983, 2017.
[3] Claudio Caraffi, Tomas Vojir, Jura Trefny, Jan Sochman, and Jiri Matas.
A System for Real-time Detection and Tracking of Vehicles from a Single
Car-mounted Camera. In ITS Conference, pages 975–982, Sep. 2012.
[4] Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and
Ng Andrew. Deep learning with cots hpc systems. In International Con-
ference on Machine Learning, pages 1337–1345, 2013.
[5] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human
detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005.
IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE,
2005.
[6] Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Perona. Pedestrian
detection: An evaluation of the state of the art. PAMI, 34, 2012.
[10] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexan-
der C Berg. Dssd: Deconvolutional single shot detector. arXiv preprint
arXiv:1701.06659, 2017.
52
[11] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for au-
tonomous driving? the kitti vision benchmark suite. In Conference on
Computer Vision and Pattern Recognition (CVPR), 2012.
[16] Alex Graves, Abdel-rahman Mohamed, and Geo↵rey Hinton. Speech recog-
nition with deep recurrent neural networks. In Acoustics, speech and signal
processing (icassp), 2013 ieee international conference on, pages 6645–6649.
IEEE, 2013.
[17] Klemen Grm, Vitomir Štruc, Anais Artiges, Matthieu Caron, and Hazım K
Ekenel. Strengths and weaknesses of deep learning models for face recog-
nition against image degradations. IET Biometrics, 7(1):81–89, 2017.
[18] Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Jitendra Malik.
Hypercolumns for object segmentation and fine-grained localization. In
Proceedings of the IEEE conference on computer vision and pattern recog-
nition, pages 447–456, 2015.
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid
pooling in deep convolutional networks for visual recognition. In european
conference on computer vision, pages 346–361. Springer, 2014.
[22] Gao Huang, Yixuan Li, Geo↵ Pleiss, Zhuang Liu, John E Hopcroft, and
Kilian Q Weinberger. Snapshot ensembles: Train 1, get m for free. arXiv
preprint arXiv:1704.00109, 2017.
[23] Qifan Huang and Li Sha. Edge adaptive demosaic system and method,
February 19 2008. US Patent 7,333,678.
[24] Forrest Iandola and Kurt Keutzer. Small neural nets are beautiful: en-
abling embedded systems with small deep-neural-network architectures. In
Proceedings of the Twelfth IEEE/ACM/IFIP International Conference on
Hardware/Software Codesign and System Synthesis Companion, page 1.
ACM, 2017.
53
[25] Kevin Jarrett, Koray Kavukcuoglu, Yann LeCun, et al. What is the best
multi-stage architecture for object recognition? In Computer Vision, 2009
IEEE 12th International Conference on, pages 2146–2153. IEEE, 2009.
[26] Jisoo Jeong, Hyojin Park, and Nojun Kwak. Enhancement of ssd
by concatenating feature maps for object detection. arXiv preprint
arXiv:1705.09587, 2017.
[27] Yangqing Jia, Evan Shelhamer, Je↵ Donahue, Sergey Karayev, Jonathan
Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Ca↵e:
Convolutional architecture for fast feature embedding. arXiv preprint
arXiv:1408.5093, 2014.
[28] Fahad Shahbaz Khan, Rao Muhammad Anwer, Joost Van de Weijer, An-
drew D Bagdanov, Maria Vanrell, and Antonio M Lopez. Color attributes
for object detection. In Computer Vision and Pattern Recognition (CVPR),
2012 IEEE Conference on, pages 3306–3313. IEEE, 2012.
[29] Alex Krizhevsky, Ilya Sutskever, and Geo↵rey E Hinton. Imagenet clas-
sification with deep convolutional neural networks. In Advances in neural
information processing systems, pages 1097–1105, 2012.
[30] William A Leasure Jr and August L Burgett. Nhtsa’s ivhs collision avoid-
ance research program: Strategic plan and status update. In Proc. First
World Congress on Applications of Transport Telematics and Intelligent
Vehicle-Highway Systems, pages 2216–2223, 1994.
[31] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson,
Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backprop-
agation applied to handwritten zip code recognition. Neural computation,
1(4):541–551, 1989.
[32] Yann LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller.
Efficient backprop. In Neural networks: Tricks of the trade, pages 9–50.
Springer, 1998.
[33] Michael W Levine and Jeremy M Shefner. Fundamentals of sensation and
perception. 1991.
[34] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariha-
ran, and Serge Belongie. Feature pyramid networks for object detection.
In CVPR, volume 1, page 4, 2017.
[35] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona,
Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco:
Common objects in context. In European conference on computer vision,
pages 740–755. Springer, 2014.
[36] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott
Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multi-
box detector. In European conference on computer vision, pages 21–37.
Springer, 2016.
54
[37] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional
networks for semantic segmentation. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 3431–3440, 2015.
[38] Seonwoo Min, Byunghan Lee, and Sungroh Yoon. Deep learning in bioin-
formatics. Briefings in bioinformatics, 18(5):851–869, 2017.
[39] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You
only look once: Unified, real-time object detection. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 779–
788, 2016.
[40] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger.
[41] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn:
Towards real-time object detection with region proposal networks. In Ad-
vances in neural information processing systems, pages 91–99, 2015.
[42] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh,
Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bern-
stein, et al. Imagenet large scale visual recognition challenge. International
Journal of Computer Vision, 115(3):211–252, 2015.
[43] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fer-
gus, and Yann LeCun. Overfeat: Integrated recognition, localization and
detection using convolutional networks. arXiv preprint arXiv:1312.6229,
2013.
[44] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[45] Nitish Srivastava, Geo↵rey Hinton, Alex Krizhevsky, Ilya Sutskever, and
Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks
from overfitting. The Journal of Machine Learning Research, 15(1):1929–
1958, 2014.
[46] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabi-
novich, et al. Going deeper with convolutions. Cvpr, 2015.
[47] Christian Szegedy, Scott Reed, Dumitru Erhan, Dragomir Anguelov, and
Sergey Io↵e. Scalable, high-quality object detection. arXiv preprint
arXiv:1412.1441, 2014.
[48] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM
Smeulders. Selective search for object recognition. International journal of
computer vision, 104(2):154–171, 2013.
[49] Ren Wu, Shengen Yan, Yi Shan, Qingqing Dang, and Gang Sun. Deep
image: Scaling up image recognition. arXiv preprint arXiv:1501.02876,
7(8), 2015.
[50] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio
Torralba. Object detectors emerge in deep scene cnns. arXiv preprint
arXiv:1412.6856, 2014.
55
[51] C Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating object pro-
posals from edges. In European Conference on Computer Vision, pages
391–405. Springer, 2014.
56
Appendix A
Figure A.1: SSD300 vs. SSD300 (enhanced). Boxes with objectness score of 0.5
or higher are shown: (a)SSD300 with 8732 prior boxes; (b)SSD300 with 11620
boxes.
57
Figure A.2: SSD512 vs. SSD512 (enhanced). Boxes with objectness score of 0.5
or higher are shown: (a)SSD512 with 23574 prior boxes; (b)SSD512 with 31766
boxes.
58
Appendix B
59
Figure B.2: SSD512 (enhanced) detection output.
60
Appendix C
Figure C.1: SSD300 (enhanced) model detection output with average inference
time is 0.0177s per frame (57 FPS).
61
Figure C.2: SSD512 (enhanced) model detection output with average inference
time 0.032 per frame (31 FPS).
62
TRITA TRITA-EECS-EX-2018:263
ISSN 1653-5146
www.kth.se