Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

mathematics

Article
Target Fusion Detection of LiDAR and Camera Based
on the Improved YOLO Algorithm
Jian Han, Yaping Liao, Junyou Zhang *, Shufeng Wang and Sixian Li
College of Transportation, Shandong University of Science and Technology, Huangdao District,
Qingdao 266590, China; hanjianzrx@163.com (J.H.); liaoyapingsk@163.com (Y.L.);
shufengwang@sdust.edu.cn (S.W.); gsypsbyw666@163.com (S.L.)
* Correspondence: junyouzhang@sdust.edu.cn; Tel.: +86-139-0532-3314

Received: 7 September 2018; Accepted: 17 October 2018; Published: 19 October 2018 

Abstract: Target detection plays a key role in the safe driving of autonomous vehicles. At present,
most studies use single sensor to collect obstacle information, but single sensor cannot deal with the
complex urban road environment, and the rate of missed detection is high. Therefore, this paper
presents a detection fusion system with integrating LiDAR and color camera. Based on the original
You Only Look Once (YOLO) algorithm, the second detection scheme is proposed to improve the
YOLO algorithm for dim targets such as non-motorized vehicles and pedestrians. Many image
samples are used to train the YOLO algorithm to obtain the relevant parameters and establish the
target detection model. Then, the decision level fusion of sensors is introduced to fuse the color image
and the depth image to improve the accuracy of the target detection. Finally, the test samples are used
to verify the decision level fusion. The results show that the improved YOLO algorithm and decision
level fusion have high accuracy of target detection, can meet the need of real-time, and can reduce the
rate of missed detection of dim targets such as non-motor vehicles and pedestrians. Thus, the method
in this paper, under the premise of considering accuracy and real-time, has better performance and
larger application prospect.

Keywords: autonomous vehicle; target detection; multi-sensors; fusion; YOLO

1. Introduction
To improve road traffic safety, autonomous vehicles have become the mainstream of future traffic
development in the world. Target recognition is one of the fundamental parts to ensure the safe driving
of autonomous vehicles, which needs the help of various sensors. In recent years, the most popular
sensors include LiDAR and color camera, due to their excellent performance in the field of obstacle
detection and modeling.
The color cameras can capture images of real-time traffic scenes and use target detection to
find where the target is located. Compared with the traditional target detection methods, the deep
learning-based detection method can provide more accurate information, and therefore has gradually
become a research trend. In deep learning, convolutional neural networks combine artificial neural
networks and convolutional algorithms to identify a variety of targets. It has good robustness
to a certain degree of distortion and deformation [1] and You only look once (YOLO) is a target
real-time detection model based on convolutional neural network. For the ability to learn massive data,
capability to extract point-to-point feature and good real-time recognition effect [2], YOLO has become
a benchmark in the field of target detection. Gao et al. [3] clustered the selected initial candidate boxes,
reorganized the feature maps, and expanded the number of horizontal candidate boxes to construct the
YOLO-based pedestrian (YOLO-P) detector, which reduced the missed rate for pedestrians. However,
the YOLO model was limited to static image detection, making a greater limitation in the detection of

Mathematics 2018, 6, 213; doi:10.3390/math6100213 www.mdpi.com/journal/mathematics


Mathematics 2018, 6, 213 2 of 16

pedestrian dynamic changes. Thus, based on the original YOLO, Yang et al. [4] merged it with the
detection algorithm DPM (Deformable Part Model) and R-FCN (Region-based Fully Convolutional
Network), designed an extraction algorithm that could reduce the loss of feature information, and then
used this algorithm to identify situations involving privacy in the smart home environment. However,
this algorithm divides the grid of the recognition image into 14 × 14. Although dim objects can be
extracted, the workload does not meet the requirement of real-time. Nguyen et al. [5] extracted the
information features of grayscale image and used them as the input layer of YOLO model. However,
the process of extracting information using the alternating direction multiplier method to form the
input layer takes much more time, and the application can be greatly limited.
LiDAR can obtain three-dimensional information of the driving environment, which has unique
advantages in detecting and tracking obstacle detection, measuring speed, navigating and positioning
vehicle. Dynamic obstacle detection and tracking is the research hotspot in the field of LiDAR.
Many scholars have conducted a lot of research on it. Azim et al. [6] proposed the ratio characteristics
method to distinguish moving obstacles. However, it is only uses numerical values to judge the type of
object, which might result in the high missed rate when the regional point cloud data are sparse, or the
detection region is blocked. Zhou et al. [7] used a distance-based vehicle clustering algorithm to identify
vehicles based on multi-feature information fusion after confirming the feature information, and used a
deterministic method to perform the target correlation. However, the multi-feature information fusion
is cumbersome, the rules are not clear, and the correlated methods cannot handle the appearance and
disappearance of goals. Asvadi et al. [8] proposed a 3D voxel-based representation method, and used
a discriminative analysis method to model obstacles. This method is relatively novel, and can be
used to merge the color information from images in the future to provide more robust static/moving
obstacle detection.
All of these above studies use a single sensor for target detection. The image information of color
camera will be affected by the ambient light, and LiDAR cannot give full play to its advantages in
foggy and hot weather. Thus, the performance and recognition accuracy of the single sensor is low in
the complex urban traffic environment, which cannot meet the security needs of autonomous vehicles.
To adapt to the complexity and variability of the traffic environment, some studies use color
camera and LiDAR to detect the target simultaneously on the autonomous vehicle, and then provide
sufficient environmental information for the vehicle through the fusion method. Asvadi et al. [9] uses
a convolutional neural network method to extract the obstacle information based on three detectors
designed by combining the dense depth map and dense reflection map output from the 3D LiDAR
and the color images output from the camera. Xue et al. [10] proposed a vision-centered multi-sensor
fusion framework for autonomous driving in traffic environment perception and integrated sensor
information of LiDAR to achieve efficient autonomous positioning and obstacle perception through
geometric and semantic constraints, but the process and algorithm of multiple sensor fusion are too
complex to meet the requirements of real-time. In addition, references [9,10] did not consider the
existence of dimmer targets such as pedestrians and non-motor vehicle.
Based on the above analysis, this paper presents a multi-sensor (color camera and LiDAR)
and multi-modality (color image and LiDAR depth image) real-time target detection system. Firstly,
color image and depth image of the obstacle are obtained using color camera and LiDAR, respectively,
and are input into the improved YOLO detection model frame. Then, after the convolution and pooling
processing, the detection bounding box for each mode is output. Finally, the two types of detection
bounding boxes are fused on the decision-level to obtain the accurate detection target.
In particular, the contributions of this article are as follows:

(1) By incorporating the proposed secondary detection scheme into the algorithm, the YOLO target
detection model is improved to detect the targets effectively. Then, decision level fusion is
introduced to fuse the image information of LiDAR and color camera output from the YOLO
model. Thus, it can improve the target detection accuracy.
Mathematics 2018, 6, 213 3 of 16
Mathematics 2018, 6, x FOR PEER REVIEW 3 of 16

(2)
(2) The
Theproposed
proposedfusion
fusionsystem
systemhas
hasbeen
beenbuilt
builtininrelated
relatedenvironments,
environments,and
andthe
theoptimal
optimalparameter
parameter
configuration
configuration of the algorithm has been obtained through training with manysamples.
of the algorithm has been obtained through training with many samples.

2.2.System
SystemMethod
MethodOverview
Overview

2.1.
2.1.LiDAR
LiDARand
andColor
ColorCamera
Camera
The
Thesensors
sensorsused
usedin
inthis
thispaper
paperinclude
includecolor
color camera
camera and
and LiDAR,
LiDAR, as
as shown
shown in
in Figure
Figure 1.
1.

Installationlayout
Figure1.1.Installation
Figure layoutof
oftwo
twosensors.
sensors.

The LiDAR
The LiDAR is is aa Velodyne
Velodyne(Velodyne
(Velodyne LiDAR,
LiDAR, SanSan Jose, CA,CA,
Jose, USA) 64-line
USA) three-dimensional
64-line three-dimensionalradar
system which can send a detection signal (laser beam) to a target, and
radar system which can send a detection signal (laser beam) to a target, and then compare the then compare the received signal
reflectedsignal
received from the target (the
reflected fromechothe of the target)
target (the echowith ofthethetransmitted
target) withsignal. After proper
the transmitted processing,
signal. After
the relevant information of the target can be obtained. The LiDAR
proper processing, the relevant information of the target can be obtained. The LiDAR is installedis installed at the top center ofata
vehicle
the and capable
top center of detecting
of a vehicle environmental
and capable information
of detecting environmentalthrough information
high-speed rotation
throughscanning
high-speed[11].
The LiDAR
rotation can emit
scanning 64 The
[11]. laserLiDAR
beams at cantheemit
head.64These
laser laser
beams beams
at thearehead.
divided Theseintolaser
four groups
beams andare
each group has 16 laser emitters [12]. The head rotation angle is 360 ◦ and the detectable distance is
divided into four groups and each group has 16 laser emitters [12]. The head rotation angle is 360°
120 the
and m [13]. The 64-line
detectable distance LiDAR
is 120has 64 fixed
m [13]. laser transmitters.
The 64-line LiDAR has 64 Through
fixed lasera fixed pitch angle,
transmitters. it can
Through
aget surrounding
fixed pitch angle, environmental
it can get surrounding for each ∆t and
informationenvironmental output a series
information for each of three-dimensional
t and output a
series of three-dimensional coordinate points. Then, the 64 points ( p1,p2, ..., p64 ) acquired by the
coordinate points. Then, the 64 points ( p1 , p2 , . . . , p64 ) acquired by the transmitter are marked, and the
distance from each point in the scene to the LiDAR is used as the pixel value to obtain a depth image.
transmitter are marked,
The color camera and the
is installed underdistance
the top from
LiDAR.eachThe
point in the of
position scene to the LiDAR
the camera is used
is adjusted as the
according
pixel value to obtain a depth image. The color camera is installed under
to the axis of the transverse and longitudinal center of the camera image and the transverse and the top LiDAR. The position
of the cameraorthogonal
longitudinal is adjusted planeaccording
formed withto the theaxis
laserofprojector,
the transverse andcamcorder
so that the longitudinal center
angle of yaw
and the the
camera image and the transverse and longitudinal orthogonal plane formed
angle are approximated to 0, and the pitch angle is approximately to 0. Color images can be obtained with the laser projector,
so that the
directly fromcamcorder
color cameras,anglebut and the yaw
images output angle
fromareLiDARapproximated
and cameratomust 0, and the pitch
be matched in angle
time andis
approximately to 0. Color images can be obtained directly from color cameras, but images output
space to realize the information synchronization of the two.
from LiDAR and camera must be matched in time and space to realize the information
synchronization of theand
2.2. Image Calibration two.Synchronization
To integrate
2.2. Image information
Calibration in the vehicle environment perceptual system, information calibration
and Synchronization
and synchronization need to be completed.
To integrate information in the vehicle environment perceptual system, information calibration
2.2.1.
and Information Calibration
synchronization need to be completed.
(1) The installation calibration of LiDAR: The midpoints of the front bumper and windshield
2.2.1. Information Calibration
can be measured with a tape measure, and, according to these two midpoints, the straight line of
(1) axis
central The ofinstallation calibration
the test vehicle can beof LiDAR:
marked byThe midpoints
the laser of the
thrower. front
Then, on bumper and
the central windshield
axis, a straight
can
linebe measured with
perpendicular a tape
to the measure,
central axis isand, according
marked to theseoftwo
at a distance 10 mmidpoints, the straight
from the rear linetest
axle of the of
central
vehicle;axis
the of the test vehicle
longitudinal axis ofcan
thebe marked
radar bycan
center thebelaser thrower.
measured by Then,
a ruler,onand
thecorrected
central axis,
by thea
straight line perpendicular
longitudinal to the central
beam perpendicular axis is marked
to the ground at a thrower,
with a laser distance to
of make
10 m from the rear axle
the longitudinal of the
axis and
test
the vehicle; the longitudinal
beam coincide, axis ofshift
and the lateral the radar
of thecenter can
radar is be measured 0bym.
approximately a ruler, and corrected
The horizontal beamby ofthe
the
longitudinal beam perpendicular to the ground with a laser thrower, to make the longitudinal axis
Mathematics 2018, 6, x FOR PEER REVIEW 4 of 16

Mathematics
and the beam 6, 213
2018,coincide, and the lateral shift of the radar is approximately 0 m. The horizontal beam 4 of 16

of the laser thrower is coincided with the transverse axis of the radar, then the lateral shift of the
radar
laser is approximately
thrower 0 m.with the transverse axis of the radar, then the lateral shift of the radar is
is coincided
(2) The installation
approximately 0 m. calibration of camera: The position of the camera is adjusted according to the
axis of(2) the
The transverse
installationand longitudinal
calibration centerThe
of camera: of position
the camera image
of the cameraandisthe transverse
adjusted and
according
longitudinal orthogonal plane formed with the laser projector, so that the camcorder
to the axis of the transverse and longitudinal center of the camera image and the transverse and angle and the
yaw angle are approximated to 0. Then, the plumb line is used to adjust the pitch angle of
longitudinal orthogonal plane formed with the laser projector, so that the camcorder angle and the the camera
to approximately
yaw 0.
angle are approximated to 0. Then, the plumb line is used to adjust the pitch angle of the camera
to approximately 0.
2.2.2. Information Synchronization
2.2.2.(1)Information Synchronization
Space matching
Space matching
(1) Space requires the space alignment of vehicle sensors. Assuming that the Velodyne
matching
coordinate
Space system
matchingis requires
Ov - X vYthe and alignment
v Zvspace the color camera coordinate
of vehicle system is that
sensors. Assuming Op - the p Zp , the
XpYVelodyne
coordinate systemis isin O
coordinate system v − Xv Yv Zv relationship
translational and the color withcamera
respectcoordinate system
to the Velodyne is O p − system.
coordinate X p Yp Z p ,
the fixing
The coordinate
angle system
betweenisthein sensors
translational relationship
is adjusted to unifywith respect coordinates
the camera to the Velodyne
to the coordinate
Velodyne
system. The
coordinate fixingAssuming
system. angle between the vertical
that the sensors height
is adjusted
of thetoLiDAR
unify theandcamera is hto
coordinates
color camera the
, the
Velodyne coordinate
conversion relationshipsystem. Assuming
of a point “M” inthat theisvertical
space height of the LiDAR and color camera is ∆h,
as follows:
the conversion relationship of a point “M” in space is as follows:
X Vm  X Pm   0 
m m   Xm   
     
XV  YV  = YPP  +  0  0 (1)
 YV  Z  =   Z
YPm   +
 m m  m
V   P  h 
  0 (1)
  
m
 
ZV ZPm ∆h
(2) Time matching
(2) Time
The matching
method of matching in time is to create a data collection thread for the LiDAR and the
camera, respectively.matching
The method of By settingin the
timesame
is to create a dataframes
acquisition collection
ratethread for the
of 30 fps, theLiDAR and the camera,
data matching on the
respectively.
time is achieved.By setting the same acquisition frames rate of 30 fps, the data matching on the time
is achieved.
2.3. The Process of Target Detection
2.3. The Process of Target Detection
The target detection process based on sensor fusion is shown in Figure 2. After collecting
The target detection process based on sensor fusion is shown in Figure 2. After collecting
information from the traffic scene, the LiDAR and the color camera output the depth image and the
information from the traffic scene, the LiDAR and the color camera output the depth image and the
color image, respectively, and input them into the improved YOLO algorithm (the algorithm has
color image, respectively, and input them into the improved YOLO algorithm (the algorithm has been
been trained by many images collected by LiDAR and color camera) to construct target detection
trained by many images collected by LiDAR and color camera) to construct target detection Models
Models 1 and 2. Then, the decision-level fusion is performed to obtain the final target recognition
1 and 2. Then, the decision-level fusion is performed to obtain the final target recognition model,
model, which realizes the multi-sensor information fusion.
which realizes the multi-sensor information fusion.

Theflow
Figure2.2.The
Figure flowchart
chartof
ofmulti-modal
multi-modaltarget
targetdetection.
detection.
Obstacle Detection Method

1. The Original YOLO


Mathematics Algorithm
2018, 6, 213 5 of 16

You Only Look Once (YOLO) is a single convolution neural network to predict the boundin
3. Obstacle Detection Method
oxes and the target categories from full images, which divides the input image into S  S cells an
3.1. The
redicts multiple Original YOLO
bounding Algorithm
boxes with their class probabilities for each cell. The architecture
OLO is composed Youof input
Only Looklayer, convolution
Once (YOLO) is a single layer, pooling
convolution neural layer,
networkfully connected
to predict layer and outp
the bounding
boxes and the target categories from full images, which divides the input image into S × S cells and
ayer. The convolution layer is used to extract the image features, the full connection layer is used
predicts multiple bounding boxes with their class probabilities for each cell. The architecture of YOLO
redict the position
is composedof ofimage and
input layer, the estimated
convolution probability
layer, pooling valueslayer
layer, fully connected of and
target
outputcategories,
layer. and th
The convolution layer is used to extract the image features, the full connection layer is used to predict
ooling layer is responsible for reducing the pixels of the slice.
the position of image and the estimated probability values of target categories, and the pooling layer is
The YOLO network
responsible architecture
for reducing isofshown
the pixels the slice. in Figure 3 [14].
The YOLO network architecture is shown in Figure 3 [14].

Figure 3. The YOLO network architecture. The detection network has 24 convolutional layers followed
Figure 3. ThebyYOLO networklayers.
two fully connected architecture.
Alternating 1 ×The detection
1 convolutional network
layers reduce thehas 24space
features convolutional
from layers
followed by two fullylayers.
preceding connected layers.
We pre-train Alternating
the convolutional 1 ×the1ImageNet
layers on convolutional layers
classification reduce
task at half the the features
resolution (224 × 224 input images) and then double the resolution for detection.
space from preceding layers. We pre-train the convolutional layers on the ImageNet classification
task at half theAssume that B (224
resolution is the number of sliding
× 224 input windows
images) used
and for each
then cell tothe
double predict objects and
resolution forC detection.
is the
total number of categories, then the dimensions of output layer is S × S × ( B × 5 + C ).
The output model of each detected border is as follows:
Assume that B is the number of sliding windows used for each cell to predict objects and C
T = ( x, y, w,of
he total number of categories, then the dimensions h, coutput
) layer is S  S  (B  5 + C
(2)) .

The output model


where of each the
( x, y) represents detected border of
center coordinates is the
as bounding
follows:box and (w, h) represents the height
and width of the detection bounding box. The above four indexes have been normalized with respect
to the width and height of the image. c isTthe (
=confidence )
x , y , w , hscore,
, c which reflects the probability value of (2
the current window containing the accuracy of the detection object, and the formula is as follows:
where (x , y ) represents the center coordinates of the bounding box and
c = Po × PIOU
(w , h )(3)represents th
eight and width of the detection bounding box. The above four indexes have been normalized wi
where Po indicates the probability of including the detection object in the sliding window, PIOU indicates
espect to the width and height of the image. c is the confidence score, which reflects the probabili
the overlapping area ratio of the sliding window and the real detected object.
alue of the current window containing the accuracy of the  detection object, and the formula is
Area BBi ∩ BBg
ollows: PIOU =
Area BB ∪ BB
 (4)
i g

c = P P
In the formula, BBg is the detection boundingobox, and
IOUBBg is the reference standard box based on (3
the training label.
where Po indicates the probability of including the detection object in the sliding window, PI
ndicates the overlapping area ratio of the sliding window and the real detected object.

Area (BB i  BB g )
PIOU = (4
Mathematics 2018, 6, 213 6 of 16

For the regression method in the YOLO, the loss function can be calculated as follows:
q 2 q 2
S2 B obj ∧ 2 ∧ 2 S2 B obj √ ∧ √ ∧
F (loss) = λcoord ∑ ∑ 1ij [( xi+ (yi − yi ) ] + λcoord ∑ ∑ 1ij [( ωi − ωi ) + ( hi − hi ) ]
− xi )
i =0 j =0 i =0 j =0
(5)
S2 B obj ∧ 2 S2 B noobj ∧ 2 S2 obj ∧ 2
+ ∑ ∑ 1ij (Ci − Ci ) + λnoobj ∑ ∑ 1ij (Ci − Ci ) + ∑ 1i ∑ ( pi (c) − pi (c))
i =0 j =0 i =0 j =0 i =0 c∈classes

obj obj
1i denotes that the grid cell i contains part of the traffic objects. 1ij represents the j bounding
noobj
box in grid cell i. Conversely, 1i represents the j bounding box in grid cell i which does not contain
any part of traffic objects. The time complexity of Formula (5) is O (k + c) × S2 , which is calculated


for one image.

3.2. The Improved YOLO Algorithm


In the application process of the original YOLO algorithm, the following issues are found:
(1) YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only
predicts two boxes and can only have one class. This spatial constraint limits the number of
nearby objects that our model can predict.
(2) The cell division of the image is set as 7 × 7 in the original YOLO model, which can only detect
large traffic objects such as buses, cars and trucks, but does not meet the requirements of cell
division of the picture for dim objects such as non-motor vehicles and pedestrians. When the
target is close to the safe distance from the autonomous vehicle and the confidence score of the
detection target is low, it is easy to ignore the existence of the target to cause security risk.
Based on the above deficiencies, this paper improves the original YOLO algorithm as follows:
(1) To eliminate the problem of redundant time caused by the identification of undesired targets,
and according to the size and driving characteristics of common targets in traffic scenes, the total
number of categories is set to six types, including {bus, car, truck, non-motor vehicle, pedestrian and
others}. (2) For the issue of non-motor vehicle and pedestrian detection, this paper proposes a
secondary image detection scheme. Then, the cell division of the image is kept as 7 × 7, the sliding
window convolution kernel is set as 3 × 3.
The whole target detection process of the improved YOLO algorithm is shown in Figure 4, and the
steps are as follows:

(1) When the target is identified, the confidence score c is higher than the maximum threshold
τ1 , indicating that the recognition accuracy is high, and the frame model of target detection is
directly output.
(2) When the recognition categories are {bus, car and truck}, and the confidence score is τ0 ≤ c < τ1
(τ0 is the minimum threshold), indicating such targets are large in size and easy to detect, and they
can be recognized at the next moment, the current border detection model can be directly output.
(3) When the recognition categories are {non-motor vehicle and pedestrian}, the confidence score is
τ0 ≤ c < τ1 . Due to the dim size and mobility of such targets, it is impossible to accurately predict
the position of the next moment. At this time, this target is marked as {others}, indicating that it
is required to be detected further. Then, the next steps need to be performed:

(3a) When the distance l between the target marked as {others} and the autonomous vehicle is
less than the safety distance l0 (the distance that does not affect decision making; if the
distance exceeds it, the target can be ignored), i.e., l ≤ l0 , the slider region divided as
{other} is marked, and the region is subdivided into 9 × 9 cells. The secondary convolution
operation is performed again. When the confidence score c of the secondary detection is
higher than the threshold τ1 , the border model of {others} is output, and the category is
changed from {others} to {non-motor vehicle} or {pedestrian}. When the confidence score
of the secondary detection is lower than the threshold τ1 , it is determined that the target
does not belong to the classification item, and the target is eliminated.
(3b) When l  l0 , this target is kept as {others}. It does not require a secondary convolution
Mathematics 2018, 6, 213 7 of 16
operation.
The original YOLO algorithm fails to distinguish and recognize the targets according to their
c of the secondary detection is lower than the threshold τ1 , it is determined that the target
characteristics, and may lose some targets. The improved YOLO algorithm can try to detect the
does not belong to the classification item, and the target is eliminated.
target twice in a certain distance according to the characteristic of dim of pedestrians and non-motor
(3b) When l > l0 , this target is kept as {others}. It does not require a secondary
vehicles. Thus, it is can reduce the missing rate of the target and output a more comprehensive scene
convolution operation.
model and ensure the safe driving of vehicles.

Figure 4. The
Figure 4. The flow
flow chart
chart of
of secondary
secondaryimage
imagedetection
detectionprogram. Object∈ 
program.Object large means
large thatthat
means targets are
targets
{bus, car, truck}.
are {bus, car, truck}.

The original YOLO


4. Decision-Level Fusionalgorithm fails to Information
of the Detection distinguish and recognize the targets according to their
characteristics, and may lose some targets. The improved YOLO algorithm can try to detect the target
twiceAfter inputting
in a certain the depth
distance imagetoand
according thecolor image into
characteristic the improved
of dim YOLO
of pedestrians model
and algorithm,
non-motor the
vehicles.
detected target frame and confidence score are output, and then the final target model is
Thus, it is can reduce the missing rate of the target and output a more comprehensive scene model and output
based on
ensure thethe fusion
safe distance
driving measurement matrix for decision level fusion.
of vehicles.

4.1.Decision-Level
4. Theory of Data Fusion
Fusion of the Detection Information
It is assumed
After that
inputting multiple
the sensorsand
depth image measure
color the same
image parameter,
into and the
the improved data measured
YOLO by the i
model algorithm,
sensor
the and the
detected j sensor
target frameare i and
andXconfidence X j , and
score both
are obey
output, the
and Gaussian
then the distribution,
final target and
model their
is pdf
output
based on thedistribution
(probability fusion distance measurement
function) curve ismatrix forthe
used as decision level fusion.
characteristic function of the sensor and is
denoted as p i ( x ) , p j ( x ) . x i and x j are the observations of X i and X j , respectively. To reflect
4.1. Theory of Data Fusion
the deviation between x i and x j , the confidence distance measure is introduced [15]:
It is assumed that multiple sensors measure the same parameter, and the data measured by
the i sensor and the j sensor are Xi and X j , and both obey the Gaussian distribution, and their pdf
(probability distribution function) curve is used as the characteristic function of the sensor and is
Mathematics 2018, 6, 213 8 of 16

denoted as pi ( x ), p j ( x ). xi and x j are the observations of Xi and X j , respectively. To reflect the deviation
between xi and x j , the confidence distance measure is introduced [15]:
Z x
j
dij = 2 pi ( x/xi )dx (6)
xi
Z x
i
d ji = 2 p j ( x/x j )dx (7)
xj

Among them:
1 1 x − xi 2
pi ( x/xi ) = √ exp{− [ ] } (8)
2πσi 2 σi

1 1 x − xj 2
p j ( x/x j ) = √ exp{− [ ] } (9)
2πσj 2 σj

The value of dij is called the confidence distance measure of the i sensor and the j sensor
observation, and its value can be directly obtained by means of the error function erf (θ), namely:

xj − xi
dij = erf[ √ ] (10)
2σi

xi − x j
d ji = erf[ √ ] (11)
2σj
If there are n sensors measuring the same indicator parameter, the confidence distance measure
dij (i, j = 1, 2, ..., n) constitutes the confidence distance matrix Dn of the multi-sensor data:
 
d11 d12 ··· d1n

 d21 d22 ··· d2n


Dn =  .. .. ..  (12)
. . .
 
 
dn1 dn2 ··· dnn

The general fusion method is to use experience to give an upper bound β ij of fusion, and then the
degree of fusion between sensors is:
(
1, dij ≤ β ij
rij = (13)
0, dij > β ij

In this paper, there are two sensors, i.e., LiDAR and color camera, so i, j = 1, 2. Then,
taking βij = 0.5 [16], r12 is set as the degree of fusion between the two sensors. Figure 5 explains
the fusion process.

(1) When r12 = 0, it means that the two sets of border models (green and blue areas) do not completely
overlap. At this time, the overlapping area is taken as the final detection model (red area).
The fusion process is shown in Figure 5a,b.
(2) When r12 = 1, it indicates that the two border models (green and blue areas) basically coincide
with each other. At this time, all border model areas are valid and expanded to the standard
border model (red area). The fusion process is shown in Figure 5b,c.
Mathematics 2018, 6, 213 9 of 16
Mathematics 2018, 6, x FOR PEER REVIEW 9 of 16

Figure
Figure 5. Decision-level
5. Decision-level fusion
fusion diagram
diagram of detection
of detection model.
model. BlueBlue area
area (BB(BB
1) is1 )the
is the model
model output
output fromfrom
the depth image. Green area (BB2) is the model output from the color image. Red area (BB’) is thethe
the depth image. Green area (BB 2 ) is the model output from the color image. Red area (BB’) is
final
final detection
detection model.
model. WhenWhen r12 =r12
0, = 0, fusion
the the fusion process
process is shown
is shown in (a).
in (a). TheThemodels models
not not tofused
to be be fused
are are
shown in (b). When r
shown in (b). When r12 = 12 = 1, the fusion process is shown
1, the fusion process is shown in (c). in (c).

Simple average rules between scores are applied in confidence scores. The formula is as follows:
Simple average rules between scores are applied in confidence scores. The formula is as
follows: c + c2
c= 1 (14)
c1 + c 22
c= (14)
where c1 is the confidence score of target Model21, and c2 is the confidence score of target Model 2.
In addition,
where it should be
c1 is the confidence noted
score that, when
of target Modelthere is only
1, and oneconfidence
c2 is the bounding score
box, to
of reduce the missed
target Model 2.
In detection rate,
addition, it this bounding
should box information
be noted that, when there isis retained
only oneasbounding
the final output
box, toresult.
reduceThe
thefinal target
missed
detection
detection model
rate, can be output
this bounding box through decision-level
information is retainedfusion
as theand confidence
final scores.
output result. The final target
detection model can be output through decision-level fusion and confidence scores.
4.2. The Case of the Target Fusion Process
4.2. The An
Caseexample
of the Target
of theFusion
targetProcess
fusion process is shown in Figure 6, and the confidence scores obtained
using different sensors can be
An example of the target fusionseen in process
Table 1. is shown in Figure 6, and the confidence scores
obtained using different sensors can be seen in Table 1.
(1) Figure 6A is a processed depth image. It can be seen that the improved YOLO algorithm identifies
two 6A
(1) Figure targets,
is a aprocessed
and b, anddepth
gives the confidence
image. It can bescores
seenofthat
0.78 the
and improved
0.55, respectively.
YOLO algorithm
(2)identifies
Figuretwo6B istargets,
a colora image.
and b, and gives
It can the confidence
be seen that three scores
targets,ofa,0.78 and c,
b, and 0.55,
arerespectively.
identified and the
(2) Figure 6B is a scores
confidence color image. It can
are given be seen
as 0.86, 0.53that
and three targets, a, b, and c, are identified and the
0.51, respectively.
(3)confidence
The red scores
box in are given
Figure 6Casis 0.86, 0.53target
the final and 0.51,
modelrespectively.
after fusion:
(3) The red box in Figure 6C is the final target model after fusion:
(1) For target a, according to the decision-level fusion scheme, the result r ≤ 0 is obtained;
(1) For target a, according to the decision-level fusion scheme, the result r12 120 is obtained;
then, the overlapping area is taken as the final detection model, and the confidence score
then, the overlapping area is taken as the final detection model, and the confidence score
after fusion is 0.82, as shown in Figure 6C (a’).
after fusion is 0.82, as shown in Figure 6C (a’).
(2) For target b, according to the decision-level fusion scheme, the result r12 ≥ 0 is obtained;
(2) For target b, according to the decision-level fusion scheme, the result r12  0 is obtained;
then, the union of all regions is taken as the final detection model, and the confidence
then, score
the union
afterof all regions
fusion is 0.54,isas
taken
shownas the final detection
in Figure 6C (b’). model, and the confidence score
after fusion is 0.54, as shown in Figure 6C (b’).
(3) For target c, since there is no such information in Figure 6A, and Figure 6B identifies
(3) For target c, since there is no such information in Figure 6A, and Figure 6B identifies the
the pedestrian information on the right, according to the fusion rule, the bounding box
pedestrian information on the right, according to the fusion rule, the bounding box
information of c in Figure 6B is retained as the final output result, and the confidence score
information of c in Figure 6B is retained as the final output result, and the confidence score
is kept as 0.51, as shown in Figure 6C (c’).
is kept as 0.51, as shown in Figure 6C (c’).
Confidence Score (Detected Object from Left to Right)
Sensor
a (a’) b (b’) c (c’)
LiDAR 0.78 0.55 --
Color camera 0.86 0.53 0.51
Mathematics 2018, 6, 213 10 of 16
The fusion of both 0.82 0.54 0.26

(A)

(B)

(C)
Figure 6. An example of target detection
detection fusion
fusion process. (A) isis aa processed
process. (A) processed depth
depth image.
image. The models
detected a and b are shown with blue. (B) is a color image. The models detected a, b and c are shown
green. (C)
with green. (C) isisthe
thefinal
finaltarget
targetmodel
modelafter
after fusion.
fusion. The
The models
models fused
fused a’,and
a’, b’ b’ and c’ shown
c’ are are shown
withwith
red.
red.
Table 1. Confidence scores obtained using different sensors.
5. Results and Discussion Confidence Score (Detected Object from Left to Right)
Sensor
a (a’) b (b’) c (c’)
5.1. Conditional Configuration
LiDAR 0.78 0.55 –
The target detection training dataset0.86
Color camera included 3000-frame
0.53 resolution images
0.51 of 1500 × 630 and
was divided into Thesix
fusion of bothcategories:0.82
different 0.54
bus, car, truck, non-motor vehicle, 0.26
pedestrian and others.
The dataset was partitioned into three subsets: 60% as training set (1800 observations), 20% as
validation set (600
5. Results and observations), and 20% as testing set (600 observations).
Discussion
The autonomous vehicles collected data on and off campus. The shooting equipment included a
5.1. Conditional
color camera and Configuration
a Velodyne 64-line LiDAR. The camera was synchronized with a 10 Hz spining
LiDAR.
TheThe
targetVelodyne
detectionhas 64-layer
training vertical
dataset resolution,
included 0.09 angular
3000-frame resolutions,
resolution images of2 1500
cm of× distance
630 and
was divided into six different categories: bus, car, truck, non-motor vehicle, pedestrian and the
accuracy, and captures 100 k points per cycle [9]. The processing platform was completed in PC
others.
The dataset was partitioned into three subsets: 60% as training set (1800 observations), 20% as
validation set (600 observations), and 20% as testing set (600 observations).
The autonomous vehicles collected data on and off campus. The shooting equipment included
a color camera and a Velodyne 64-line LiDAR. The camera was synchronized with a 10 Hz spining
LiDAR. The Velodyne has 64-layer vertical resolution, 0.09 angular resolutions, 2 cm of distance
Mathematics 2018, 6, 213 11 of 16

accuracy, and captures 100 k points per cycle [9]. The processing platform was completed in the PC
segment, including the i5 processor (Intel Corporation, Santa Clara, CA, USA) and GPU (NVIDIA,
Santa Clara, CA, USA). The improved YOLO algorithm was accomplished by building a Darket
framework and using Python (Python 3.6.0, JetBrains, Prague, The Czech Republic) for programming.

5.2. Time Performance Testing


The whole process included the extraction of depth image and color image, and they were,
respectively, substituted into the improved YOLO algorithm and the proposed decision-level fusion
scheme as the input layer. The improved YOLO algorithm involved the image grid’s secondary
detection process and is therefore slightly slower than the normal recognition process. The amount of
computation to implement the different steps of the environment and algorithm is shown in Figure 7.
In the figure, it can be seen that the average time to process each frame is 81 ms (about 13 fps).
Considering that the operating frequency of the camera and Velodyne LiDAR is about 10 Hz, it can
Mathematics 2018, 6, x FOR
meet the real-time PEER REVIEW
requirements of traffic scenes. 12 of 16

Figure 7. Processing time for each step of the inspection system (in ms).
Figure 7. Processing time for each step of the inspection system (in ms).
5.3. Training Model Parameters Analysis
The learning rate determines the speed at which the parameters are moved to the optimal value.
To findThethe
training
optimal of the modelrate,
learning takesthe
more time,performances
model so the settingwith
of related parameters
the learning rateinofthe model
10−7 , 10−6, has
10−5a,
great impact on performance and accuracy. Because the YOLO model involved
10 , 10 , 10 , 10 and 1 are estimated, respectively, when the training step is set to 10,000.
−4 −3 −2 −1 in this article has been
modified from the initial model, the relevant parameters in the original model need to be reconfigured
through training tests.
The training step will affect the training time and the setting of other parameters. For this
purpose, eight steps of training scale were designed. Under the learning rate of 0.001 given by
YOLO, the confidence prediction score, actual score, and recognition time of the model are statistically
analyzed. Table 1 shows the performance of the BB2 model, and Figure 7 shows the example results
of the BB2 model under D1 (green solid line), D3 (violet solid line), D7 (yellow solid line) and D8
(red solid line).
Table 2 shows that, with the increase of training steps, the confidence score for the BB2 model is
constantly increasing, and the actual confidence level is also in a rising trend. When the training step
reaches 10,000,Figure
the actual confidence
8. Performance score arrives
comparison of BB2at the highest
model under 4 value
kinds ofoftraining
0.947. However,
steps. when the
training step reaches 20,000, the actual confidence score begins to fall, and the recognition time also
Table
slightly 3 showswhich
increases, the estimated
is relatedconfidence scores and
to the configuration offinal
modelscores of the
and the outputofdetection
selection learning models
rate.
BB1 and BB2 under different learning rates. Figure 9 shows the change trend of the confidence score.
After analyzing Table 3 Table and Figure 9, we can
2. Performance see2 model
of BB that, with the
under decrease
different of learning rate, all of the
steps.
confidence prediction score and actual score of model experienced a rising trend firstly and then
Mark Number of Steps Estimated Confidence Actual Confidence Recognition Time (ms)
decreasing. When the learning rate reaches D3 (10−2), the confidence score reaches a maximum value,
D1 4000 0.718 0.739 38.42
and the confidence level remains within a stable range with the change of learning rate. Based on the
D2 5000 0.740 0.771 38.40
aboveD3analysis, when 6000 the learning rate0.781 is 10−2, the proposed0.800 model can obtain a38.33 more accurate
recognition
D4 rate. 7000 0.825 0.842 38.27
D5 8000 0.862 0.885 38.20
D6 9000 0.899 0.923 38.12
Table 3. Model performance under different learning rates.
D7 10,000 0.923 0.947 38.37
D8 20,000 0.940
Estimated Confidence0.885 Actual Confidence38.50
Mark Learning Rate
BB1 BB2 BB1 BB2
D1 1 0.772 0.73 0.827 0.853
D2 10−1 0.881 0.864 0.911 0.938
D3 10−2 0.894 0.912 0.932 0.959
D4 10−3 0.846 0.85 0.894 0.928
Mathematics 2018, 6, 213 12 of 16

Figure 7. Processing time for each step of the inspection system (in ms).
Figure 8 shows the vehicle identification with the training steps of 4000, 6000, 10,000, and 20,000.
The yellow dotted rate
The learning box indicates
determines thethe
recognition rate when
speed at which the the learning are
parameters ratemoved
is 10,000. Clearly,
to the the model
optimal value.
box basically covers the entire goal and almost no redundant area. Based on the
To find the optimal learning rate, the model performances with the learning rate of 10 , 10 , 10−5, above
−7 analysis,
−6
the−4,number
10 10−3, 10−2of, 10
steps set1in
−1 and this
are paper is 10,000.
estimated, respectively, when the training step is set to 10,000.

Figure 8. Performance comparison of BB22 model


model under 4 kinds of training steps.

The learning
Table 3 showsratethedetermines the speed atscores
estimated confidence whichand the final
parameters
scores are moved
of the outputto the optimal
detection value.
models
To
BB1find
andtheBB2optimal learning learning
under different rate, the rates.
modelFigure
performances
9 shows with the learning
the change trend rate
of the 10−7 , 10−6 ,score.
ofconfidence 10−5 ,
10 −4 , 10
After −3 , 10−2 ,Table
analyzing 10−1 and
3 and 1 are estimated,
Figure respectively,
9, we can whenthe
see that, with thedecrease
training ofstep is set torate,
learning 10,000.
all of the
Table 3prediction
confidence shows thescore
estimated confidence
and actual scorescores
of model and experienced
final scores of the output
a rising trenddetection models
firstly and then
BB and
decreasing.
1 BB under
When
2 different
the learning learning
rate rates.
reaches Figure
D3 (10 9
−2), shows
the the change
confidence trend
score of
reaches thea confidence
maximum score.
value,
After
and the analyzing
confidence Table 3 and
level Figure
remains 9, we
within can see
a stable that,with
range withthethechange
decrease of learning
of learning rate.rate, all on
Based of the
confidence
above prediction
analysis, when score and actual
the learning ratescore
is 10of−2, model experienced
the proposed modela rising trendafirstly
can obtain more and then
accurate
decreasing. When
recognition rate. the learning rate reaches D3 (10−2 ), the confidence score reaches a maximum value,
and the confidence level remains within a stable range with the change of learning rate. Based on
the above analysis, when Table
the3.learning rate is 10−2under
Model performance , the different
proposed learning
modelrates.
can obtain a more accurate
recognition rate.
Estimated Confidence Actual Confidence
Mark Learning Rate
BB1 BB2 BB1 BB2
Table 3. Model performance under different learning rates.
D1 1 0.772 0.73 0.827 0.853
D2 10 −1
Learning Rate 0.881
Estimated 0.864
Confidence 0.911
Actual 0.938
Confidence
Mark
D3 10 −2 0.894
BB1 0.912
BB2 0.932
BB1 0.959
BB 2
D4D1 10−31 0.846
0.772 0.85
0.73 0.894
0.827 0.928
0.853
D5D2 1010
−4−1 0.76
0.881 0.773
0.864 0.889
0.911 0.911
0.938
D6D3 1010
−5− 2
0.665
0.894 0.68
0.912 0.874
0.932 0.892
0.959
−3
D7D4 1010
−6 0.846
0.619 0.85
0.62 0.894
0.833 0.928
0.851
−4
D8D5 1010
−7
−5
0.76
0.548 0.773
0.557 0.889
0.802 0.911
0.822
D6 10 0.665 0.68 0.874 0.892
Mathematics 2018, 6, x FOR PEER REVIEW
−6 13 of 16
D7 10 0.619 0.62 0.833 0.851
D8 10−7 0.548 0.557 0.802 0.822

Figure 9.
Figure Performance trends
9. Performance trends under
under different
different learning
learning rates.

5.4. Evaluation of Experiment Results


The paper takes the IOU as the evaluation criteria of recognition accuracy obtained by
comparing the BBi (i = 1, 2) of output model and the BBg of actual target model, and defines three
evaluation grades:
(1) Low precision: Vehicle targets can be identified within the overlap area, and the identified
Mathematics 2018, 6, 213 Figure 9. Performance trends under different learning rates. 13 of 16

5.4. Evaluation of Experiment Results


5.4. Evaluation of Experiment Results
The paper takes the IOU as the evaluation criteria of recognition accuracy obtained by
The paper
comparing the BBtakes the IOU as the evaluation criteria of recognition accuracy obtained by comparing
i (i = 1, 2) of output model and the BBg of actual target model, and defines three
the BB
evaluation (i = 1,
i grades:2) of output model and the BBg of actual target model, and defines three evaluation grades:

(1)(1) LowLow precision:


precision: Vehicle
Vehicle targets
targets cancanbe be identified
identified within
within thethe overlap
overlap area,
area, andandthethe identified
identified
effective area accounts for 60% of the model
effective area accounts for 60% of the model total area. total area.
(2)(2) Medium
Medium precision:
precision: Vehicle
Vehicle targets
targets areare more
more accurately
accurately identified
identified in in overlapping
overlapping areas,
areas, and
and thethe
identified
identified effective
effective area
area accounts
accounts forfor
80%80%of of
thethe model’s
model’s total
total area.
area.
(3)(3) High
High precision:
precision: TheThe vehicle
vehicle is accurately
is accurately identified
identified in in
thethe overlapping
overlapping area,
area, andandthethe identified
identified
effective
effective areaaccounts
area accountsfor for 90% of ofthe
themodel
model total area.
total Figure
area. 10 is10
Figure used
is to describe
used the definition
to describe the
of evaluation
definition grade. The
of evaluation red dotted
grade. The redframe
dottedarea is the
frame target
area actual
is the area
target and area
actual the black frame
and the area
black
is the
frame area is BB
area thei output
area BBfrom
i outputthefrom
model. the model.

Figure 10. The definition of evaluation grade. The yellow area is the identified effective area. The black
Figure 10. The definition of evaluation grade. The yellow area is the identified effective area. The
frame area is model’s total area. The above proportion is the ratio between yellow area and black area.
black frame area is model’s total area. The above proportion is the ratio between yellow area and
black area.
To avoid the influence caused by the imbalance of all kinds of samples, the precision and recall
were introduced to evaluate the box model under the above three levels:
To avoid the influence caused by the imbalance of all kinds of samples, the precision and recall
were introduced to evaluate the box model under the above TP three levels:
Precision = (15)
TP + FP
TP
Precision = (15)
TPTP+ FP
Recall = (16)
TP + FN
TP
In the formula, TP, FP, and FN indicate Recall = the correctly defined examples, wrongly defined (16)
TP + FN
examples and wrongly negative examples, respectively. The Precision–Recall diagram for each model
BBIn the
i (i = formula,
1, 2) TP, FP,
is calculated, as and
shownFNinindicate the correctly defined examples, wrongly defined
Figure 11a,b.
examplesWhen andthewrongly
recall isnegative
less thanexamples,
0.4, all the respectively.
accuracy under Thethe
Precision–Recall diagram
three levels is high; whenfor
theeach
recall
model BBiaround
reaches (i = 1, 2)0.6,
is calculated, as shown
only the accuracy in Figure
of the 11a,b.
level hard decreases sharply and tends to zero, while the
accuracy of the other two levels is basically maintained at a relatively high level. Therefore, when the
requirements of level for target detection is not very high, the method proposed in this paper can fully
satisfy the needs of vehicle detection under real road conditions.
Mathematics 2018, 6, 213 14 of 16
Mathematics 2018, 6, x FOR PEER REVIEW 14 of 16

1 1

Precision

Precision
0.8 0.8

0.6 0.6

0.4 hard 0.4 hard

medium medium
0.2 0.2
low Recall low Recall
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
(a) Performance relationship of BB1 (b) Performance relationship of BB2
Figure11.
Figure 11. Detection performance
performanceof ofthe
thetarget.
target.(A)(A)is is
the performance
the relationship
performance of of
relationship model BB1BB
model . (B)
1.
is the
(B) performance
is the relationship
performance of model
relationship BBBB
of model 2.
2 .

5.5. Method
When Comparison
the recall is less than 0.4, all the accuracy under the three levels is high; when the recall
reaches
Thearound
method0.6, only thein
proposed accuracy of the
this paper level hard decreases
is compared sharplymore
with the current and tends to zero,
advanced while the
algorithms.
accuracy of the other two levels is basically maintained at a relatively high
The indicators are mainly mAP (mean average precision) and FPS (frames per second). Thelevel. Therefore, when the
results
requirements of level for
obtained are shown in Table 4. target detection is not very high, the method proposed in this paper can
fully satisfy the needs of vehicle detection under real road conditions.
Table 4. Comparison of the training results of all algorithms.
5.5. Method Comparison
Algorithms mAP FPS
The method proposed in this paper
YOLO is [17]
compared with 63.4the current
45 more advanced algorithms.
The indicators are mainly mAP (mean average
Fast R-CNN [18] precision) and
70.0FPS (frames
0.5 per second). The results
obtained are shown in Table 4. Faster R-CNN [19] 73.2 7
Projection [20] 96.2 8
3D FCN [21] 64.2 0.2
Table 4. Comparison of the training results of all algorithms.
Vote3D [22] 47.9 2
the improved YOLO algorithmmAP
Algorithms 82.9 13 FPS
YOLO [17] 63.4 45
Fastrecognition
In Table 4, the R-CNN [18] accuracy of the improved70.0algorithm proposed in this 0.5paper is better
than that of the Faster
originalR-CNN
YOLO[19] 73.2
algorithm. This is related to 7 images and
the fusion decision of the two
Projection
the proposed secondary [20]detection scheme. To ensure
image 96.2
the accuracy, the detection8 frame number
of the improved YOLO3D FCN [21] from 45 to 13, and the running
dropped 64.2 time increased, but 0.2 it can fully meet
Vote3D [22] 47.9
the normal target detection requirements and ensure the normal driving of autonomous 2 vehicles.
the improved YOLO algorithm 82.9 13
6. Conclusions
In Table
This paper4,presents
the recognition accuracy
a detection fusionof the improved
system algorithm
with integrating proposed
LiDAR in this
and color paperBased
camera. is better
on
than that of the original YOLO algorithm. This is related to the fusion decision of
the original YOLO algorithm, the second detection scheme is proposed to improve the YOLO algorithm the two images and
thedim
for proposed
targets secondary image detection
such as non-motorized scheme.
vehicles To ensure the
and pedestrians. accuracy,
Then, the detection
the decision level fusionframe
of
number of the improved YOLO dropped from 45 to 13, and the running time increased,
sensors is introduced to fuse the color image of color camera and the depth image of LiDAR to improve but it can
fully
the meet the
accuracy of normal target
the target detection
detection. Therequirements and ensure
final experimental resultsthe
shownormal
that, driving
when the of training
autonomousstep
vehicles.
is set to 10,000 and the learning rate is 0.01, the performance of the model proposed in this paper is
optimal and the Precision–Recall performance relationship could satisfy the target detection in most
6. Conclusions
cases. In addition, in the aspect of algorithm comparison, under the requirement of both accuracy and
real-time,
This the
papermethod of this
presents paper has
a detection bettersystem
fusion performance and a relatively
with integrating LiDARlargeand research prospect.
color camera. Based
Since
on the the samples
original YOLO needed in this
algorithm, the paper
secondare collected
detection from several
scheme traffictoscenes,
is proposed improve thethe
coverage
YOLO
of the traffic
algorithm forscenes is relatively
dim targets such asnarrow. In the future
non-motorized vehiclesresearch work, we Then,
and pedestrians. will gradually expand
the decision level
the
fusion of sensors is introduced to fuse the color image of color camera and the depth imagethe
complexity of the scenario and make further improvements to the YOLO algorithm. In of
next
LiDARexperimental
to improvesession, the influence
the accuracy of environmental
of the target detection. The factors
finalwill be considered,
experimental resultsbecause the
show that,
when the training step is set to 10,000 and the learning rate is 0.01, the performance of the model
proposed in this paper is optimal and the Precision–Recall performance relationship could satisfy
Mathematics 2018, 6, 213 15 of 16

image-based identification method is greatly affected by light. At different distances (0–20 m, 20–50 m,
50–100 m, and >100 m), the intensity level of light is different, so how to deal with the problem of light
intensity and image resolution is the primary basis for target detection.

Author Contributions: Conceptualization, J.Z.; Data curation, J.H. and Y.L.; Formal analysis, J.H.;
Funding acquisition, S.W.; Investigation, J.H. and J.Z.; Methodology, Y.L.; Project administration, Y.L.; Resources,
J.H.; Software, J.Z.; Supervision, J.Z. and S.W.; Validation, S.W.; Visualization, Y.L.; Writing—original draft, J.H.;
and Writing—review and editing, J.Z., S.W. and S.L.
Funding: This work was financially supported by the National Natural Science Foundation of China under Grant
No. 71801144, and the Science & Technology Innovation Fund for Graduate Students of Shandong University of
Science and Technology under Grant No. SDKDYC180373.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Batch Re-normalization of Real-Time Object Detection Algorithm YOLO. Available online: http://www.
arocmag.com/article/02-2018-11-055.html (accessed on 10 November 2017).
2. Liu, Y.; Zhang, Y.; Zhang, X. Adaptive spatial pooling for image classification. Pattern Recognit. 2016, 55,
58–67. [CrossRef]
3. Gao, Z.; Li, S.B.; Chen, J.N.; Li, Z.J. Pedestrian detection method based on YOLO network. Comput. Eng.
2018, 44, 215–219, 226.
4. Improved YOLO Feature Extraction Algorithm and Its Application to Privacy Situation Detection of Social
Robots. Available online: http://kns.cnki.net/kcms/detail/11.2109.TP.20171212.0908.023.html (accessed on
12 December 2017).
5. Nguyen, V.T.; Nguyen, T.B.; Chung, S.T. ConvNets and AGMM Based Real-time Human Detection under
Fisheye Camera for Embedded Surveillance. In Proceedings of the 2016 International Conference on
Information and Communication Technology Convergence (ICTC), Jeju, South Korea, 19–21 October 2016;
pp. 840–845.
6. Azim, A.; Aycard, O. Detection, Classification and Tracking of Moving Objects in a 3D Environment.
In Proceedings of the IEEE Intelligent Vehicles Symposium, Alcala de Henares, Spain, 3–7 June 2012;
pp. 802–807.
7. Zhou, J.J.; Duan, J.M.; Yang, G.Z. A vehicle identification and tracking method based on radar ranging.
Automot. Eng. 2014, 36, 1415–1420, 1414.
8. Asvadi, A.; Premebida, C.; Peixoto, P.; Nunes, U. 3D Lidar-based static and moving obstacle detection in
driving environments: An approach based on voxels and multi-region ground planes. Robot. Auton. Syst.
2016, 83, 299–311. [CrossRef]
9. Asvadi, A.; Garrote, L.; Premebida, C.; Peixoto, P.; Nunes, U.J. Multimodal Vehicle Detection:
Fusing 3D-LIDAR and Color Camera Data. Pattern Recognit. Lett. 2017. [CrossRef]
10. Xue, J.R.; Wang, D.; Du, S.Y. A vision-centered multi-sensor fusing approach to self-localization and obstacle
perception for robotic cars. Front. Inf. Technol. Electron. Eng. 2017, 18, 122–138. [CrossRef]
11. Wang, X.Z.; Li, J.; Li, H.J.; Shang, B.X. Obstacle detection based on 3d laser scanner and range image for
intelligent vehicle. J. Jilin Univ. (Eng. Technol. Ed.) 2016, 46, 360–365.
12. Glennie, C.; Lichti, D.D. Static calibration and analysis of the Velodyne HDL-64E S2 for high accuracy mobile
scanning. Remote Sens. 2010, 2, 1610–1624. [CrossRef]
13. Yang, F.; Zhu, Z.; Gong, X.J.; Liu, J.L. Real-time dynamic obstacle detection and tracking using 3D lidar.
J. Zhejiang Univ. 2012, 46, 1565–1571.
14. Zhang, J.M.; Huang, M.T.; Jin, X.K.; Li, X.D. A real-time Chinese traffic sign detection algorithm based on
modified YOLOv2. Algorithms 2017, 10, 127. [CrossRef]
15. Han, F.; Yang, W.H.; Yuan, X.G. Multi-sensor Data Fusion Based on Correlation Function and Fuzzy Clingy
Degree. J. Proj. Rocket. Missiles Guid. 2009, 29, 227–229, 234.
16. Chen, F.Z. Multi-sensor data fusion mathematics. Math. Pract. Theory 1995, 25, 11–15.
17. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA,
27–30 June 2016; pp. 779–788.
Mathematics 2018, 6, 213 16 of 16

18. Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision
(ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448.
19. Ren, S.; He, K.; Girshick, R. Faster R-CNN: Towards real-time object detection with region proposal networks.
Int. Conf. Neural Inf. Process. Syst. 2015, 39, 91–99. [CrossRef] [PubMed]
20. Wei, P.; Cagle, L.; Reza, T.; Ball, J.; Gafford, J. LiDAR and camera detection fusion in a real-time industrial
multi-sensor collision avoidance system. Electronics 2018, 7, 84. [CrossRef]
21. Li, B. 3D fully convolutional network for vehicle detection in point cloud. arXiv, 2017; arXiv:1611.08069.
22. Voting for Voting in Online Point Cloud Object Detection. Available online: Https://www.researchgate.
net/publication/314582192_Voting_for_Voting_in_Online_Point_Cloud_Object_Detection (accessed on
13 July 2015).

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

You might also like