Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Improvement of Object Detection Based on Faster R-CNN and YOLO

2021 36th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC) | 978-1-6654-3553-6/21/$31.00 ©2021 IEEE | DOI: 10.1109/ITC-CSCC52171.2021.9501480

Jiayi Fan1, JangHyeon Lee2, InSu Jung3, YongKeun Lee1


1
Graduate School of Nano IT Design Fusion, Seoul National University of Science and
Technology, Seoul, Korea (fjy1510780466@163.com, yklee@seoultech.ac.kr)
2
Department of Materials Science and Engineering, Korea University, Seoul, Korea
(janghyeon5@gmail.com)
3
Department of RI Application, Korea Institute of Radiological & Medical Sciences, Seoul,
Korea (jis@kirams.re.kr)

Abstract ILSVRC2013 dataset, is composed of a region


proposal module, feature extraction module, and
The development of artificial intelligence classifier module [6]. To overcome the limitation of
technology has been greatly assisted by object R-CNN, an improved algorithm called Fast R-CNN is
detection. The object detector like you-only-look-once proposed, which substantially improves the training
(YOLO) v2 can detect an object in real-time and also and testing speed by employing several innovations
with good accuracy. However, except for the lower like using Region of Interest (ROI) pooling layer [7].
computation cost and faster speed, the single-stage Then comes the further improved Faster R-CNN by
detector YOLO v2 is not as good as the two-stage Shaoqing Ren, which considerably reduces the
detectors like Faster R-CNN in terms of accuracy; amount of computation in the time-consuming region
more improvement is needed to increase the accuracy. proposal process [8].
This paper uses the Kalman filter to fuse Faster R- The deep learning object detectors can be divided
CNN and YOLO v2 to obtain better detection into two categories: two-stage detector, which
accuracy. The results from Faster R-CNN are served includes a first stage for generating region proposals
as observation due to its better accuracy, while that where there are likely objects inside, and a second
from YOLO v2 as state variables. Experiment is stage for object classification, typical representatives
carried out in video samples containing vehicle of this kind are Fast R-CNN and Faster R-CNN.
images. The results show that the fusion of the two Another is a single-stage algorithm, which produces
algorithms by using the Kalman filter can provide the region proposals and predicts the class
better object detection. probabilities in one evaluation. You Only Look Once
(YOLO) v2 and Single Shot Detector (SSD) are
Keywords: Faster R-CNN, Kalman filter, YOLO v2, among the single-stage detectors [2, 9, 10]. The
object detection. single-stage algorithms aim to reduce heavy
computation to achieve faster speed, while the two-
1. Introduction stage algorithms generally have better accuracy but
slower speed [11]. A fusion of single-stage and two-
Object detection is an essential part of computer stage algorithm using Kalman Filter is presented in
vision that is actively studied. Object detection tasks [11] for unmanned vessel surface object detection, the
not only include knowing the location and scale of the new fused algorithm has better mean average
object by drawing a bounding box around the object, precision and better frame per second than the
but also classification of the object, which involves traditional method.
assigning a class label to the object [1-3]. Object In this paper, the fusion of object detectors Faster
detection is useful in many applications such as R-CNN and YOLO v2 for vehicle detection is
security surveillance, manufacturing and robotics [4, proposed by a Kalman filter. YOLO v2 is fast because
5]. it only uses a single-stage detection network, which,
The power of deep learning has significantly however, sacrifices detection accuracy. Faster R-
helped the advancement of object detection, where CNN uses a two-stage structure; therefore, it has
Convolutional Neural Network (CNN) is essential. better accuracy but slower computation speed. In
The well-known Region-Based CNN (R-CNN) order to overcome the relatively low accuracy
developed by Ross Girshick, which achieves state-of- problem in YOLO v2, the Kalman Filtering algorithm
the-art results on VOC 2012 dataset and is utilized. Faster R-CNN is used as observation in the

ed licensed use limited to: SVKM¿s NMIMS Mukesh Patel School of Technology Management & Engineering. Downloaded on September 01,2022 at 12:38:10 UTC from IEEE Xplore. Restriction
Input Image
Feature Map Classification

ROI Pooling

Bounding Box

Reshape, SoftMax Proposal

Calculating classification loss

Calculating bbox regression loss

RPN

Fig. 1. Structure of Faster R-CNN.


Input Image

Bounding
Conv. Maxpool Conv. Maxpool
…… FC FC Boxes,
Layer Layer Layer Layer
Classes

Fig. 2. Structure of YOLO.


Kalman filter to improve the results from YOLO v2. traditional methods such as high-speed and small
The state-space model of the object is firstly built to computation cost, fewer background errors. It sees the
utilize the Kalman filter. Experimental results show entire image during training, highly generalizable and
that by fusing Faster R-CNN and YOLO v2 in the less likely to break down with unexpected inputs. The
Kalman filter, the accuracy of vehicle detection from YOLO detector regards target detection as a
video frames can be improved. regression problem. The YOLO object detector is
high-speed due to its one-stage structure. The single
2. Review of Object Detection Algorithms deep convolutional neural network used by YOLO
can predict all bounding boxes and class probabilities
2.1 Faster R-CNN Algorithm for those boxes simultaneously, using features from
the entire image. It is suitable for end-to-end training
A Faster R-CNN is an object detection network in real-time while maintaining good average
composed of a feature extraction network followed by precision. The image is first divided into an S x S grid;
two trainable subnetworks, a novel Region Proposal each grid cell predicts B bounding boxes, including
Network (RPN) and an object classification network. coordinates, width, height, confidence scores of the


The feature extraction network is typically a box, and conditional class probabilities. The
pretrained CNN. The first subnetwork RPN is used to confidence score is defined as .
generate a proposal of the object, and the second Intersection Over Union (IOU) is the most popular

= | | = | | , where and are


subnetwork is used to predict the class of the object.
| ∩ | | |
evaluation metric for object detection, and it is
The block diagram of Faster R-CNN is shown in

calculated by
Fig. 1. The full-image convolution feature maps are
shared between RPN and region-based detectors like the predicted bounding box and ground truth box,
Fast R-CNN. Thus, the marginal cost for computing respectively. is the intersection area, and is the
proposals is small, generating region proposals much union area of and . Non-Max Suppression is used
faster than the other region proposal methods like if there are multiple detections of the same object, the
Selective Search (SS). The RPN can be trained end- box with the highest probability will be selected.
to-end to generate region proposals for the use in the However, YOLO makes a significant number of
Fast R-CNN for detection. After obtaining region localization errors and has relatively low recalls. The
proposals with different sizes from RPN, it is applied YOLO v2 is an improved version with better mean
to the ROI Pooling layer. The ROI Pooling splits the average precision but still fast. It adds batch
input feature maps into a fixed number of roughly normalization layers to help regularize the model and
equal regions followed by Max Pooling to obtain the uses anchor boxes to predict bounding boxes. The
fixed-length output. After obtaining proposal feature fully connected layer for predicting bounding boxes is
maps, it is applied to the fully connected layer to make removed and replaced by anchor boxes prediction in
the classification and bounding box regression. v2. The architecture of the YOLO algorithm is shown
in Fig. 2.
2.2 YOLO v2 Algorithm
3. Fusion of Faster R-CNN and YOLO v2
The YOLO network is a new approach to object using Kalman Filter
detection, which has several advantages over

ed licensed use limited to: SVKM¿s NMIMS Mukesh Patel School of Technology Management & Engineering. Downloaded on September 01,2022 at 12:38:10 UTC from IEEE Xplore. Restriction
Kalman H
Filter

" (from Faster R-CNN)


Plant
(from YOLO)

Fig. 3. Kalman Filter.


The algorithm of Kalman Filtering provides an
estimation of some unknown variables given a series
of measurements observed over time. The estimation
tends to be more accurate than the measurement,
Fig. 4. Comparison of x, y coordinates obtained from
especially when statistical noise and other

respectively. =: and =; are the acceleration of the


YOLO v2, Faster R-CNN, and Kalman Filter.
inaccuracies present. The Kalman filter is actively

object along the x and y-axis, respectively. 7̅ and 8?


used in deep learning for object tracking applications
[12, 13].

respectively. ! is the Gaussian distribution noise of


In order to take advantage of the accuracy of the are the measurement of coordinate in the x and y-axis,

the model, and $ is the Gaussian distribution noise of


Faster R-CNN and the speed of YOLO v2 in object
detection, the two algorithms are fused by using a
Kalman Filter. As the latter algorithm has better the measurement.
accuracy, its results are used as the measurement to The Kalman gain can be calculated by
adjust those from YOLO v2. The block diagram of the
@ +1 =
Kalman Filter is shown in Fig. 3. A B CD
CA B C D EF B
(8)
In order to use the Kalman filter to track the

is the covariance matrix, G is the


objects, a state-space model of the object movement
is first constructed. The discrete-time model is given where
in (1). measurement noise covariance matrix.

+1 = + +!
The estimate can then be updated by

" =# +$ H +1 = H +@ I" −#H K


(1)
(9)

where

1 0 '( 0
4. Experimental Results and Discussion

0 1 0 '(
=% )
0 0 1 0
In order to verify the effectiveness of using
(2)
0 0 0 1
Kalman Filter to fuse the YOLO v2 detector and the
Faster R-CNN detector, experiment is carried out in
MATLAB/Simulink environment. An annotated
0⎤
-./
⎡0
driving dataset is used, including frames collected
⎢ -./ ⎥
= ⎢0 ⎥
from cameras while driving in cities during daylight

⎢ ⎥0
conditions.

⎢'( 0⎥
(3)
The 1920x1200 resolution images are first resized
⎣0 '( ⎦
to 359x224 to fit the input size of YOLO v2 and Faster
R-CNN. A pretrained ResNet-50 network is used for
1 0 0 0
#=4 5
feature extraction to build the YOLO v2 and Faster R-
0 1 0 0
(4) CNN detection network. After obtaining the

= 67 8 9: 9; <-
coordinates and sizes of the bounding boxes from
(5) each algorithm, the results are then fed to Kalman

= 6=: =; <-
Filter.
The coordinates predicted from Kalman Filter are
(6)

" = 67̅ 8?<-


compared with those obtained from YOLO v2 and
Faster R-CNN; the results are shown in Fig. 4. It can
(7)

'( is the sampling time, 7 and 8 are the coordinates of


be seen that the results from Kalman Filter tend to
follow the one from observation (Faster R-CNN),
the object in the x and y-axis, respectively. 9: and 9;
while the results from YOLO v2 are more distant from
the observation. It is most evident at 40s that the y
are the velocity of the object along the x and y-axis, coordinate from YOLO v2 has a large deviation from

ed licensed use limited to: SVKM¿s NMIMS Mukesh Patel School of Technology Management & Engineering. Downloaded on September 01,2022 at 12:38:10 UTC from IEEE Xplore. Restriction
[1] Q. Hu, S. Paisitkriangkrai, C. Shen, A. van den Hengel
and F. Porikli, "Fast Detection of Multiple Objects in Traffic
Scenes With a Common Detection Framework," in IEEE
Transactions on Intelligent Transportation Systems, vol. 17,
no. 4, pp. 1002-1014, April 2016.

(a) YOLO v2 (b) Faster R-CNN [2] J. U. Kim and Y. Man Ro, "Attentive Layer Separation
for Object Classification and Object Localization in Object
Detection," 2019 IEEE International Conference on Image
Processing (ICIP), Taipei, Taiwan, 2019, pp. 3995-3999.

[3] X. Wang, H. Ma, X. Chen and S. You, "Edge Preserving


and Multi-Scale Contextual Neural Network for Salient
Object Detection," in IEEE Transactions on Image
Processing, vol. 27, no. 1, pp. 121-134, Jan. 2018.
(c) Kalman Filter
Fig. 5. Comparison of IOU from YOLO v2, Faster R- [4] D. Kim, D. Lee, H. Myung and H. Choi, "Object
CNN and Kalman Filter. detection and tracking for autonomous underwater robots
observation while the one from Kalman Filter has a using weighted template matching," 2012 Oceans - Yeosu,
fair agreement. Yeosu, 2012, pp. 1-5.
The IOU comparison of the two algorithms and the
fusion is shown in Fig. 5. It can be seen that YOLO [5] C. R. del-Blanco, F. Jaureguizar and N. Garcia, "An
efficient multiple object detection and tracking framework
v2 has the lowest IOU 0.74025 due to its one-stage for automatic counting and video surveillance applications,"
structure and sacrifice of accuracy for faster in IEEE Transactions on Consumer Electronics, vol. 58, no.
computation speed. Faster R-CNN has a relatively 3, pp. 857-862, August 2012.
higher IOU 0.75885 than that of YOLO v2. After the
fusion of the two methods in Kalman Filter, the IOU [6] R. Girshick, J. Donahue, T. Darrell and J. Malik, "Rich
achieves the highest, as shown in Fig. 5(c), which is Feature Hierarchies for Accurate Object Detection and
0.78547. It demonstrates that using a fast algorithm Semantic Segmentation," 2014 IEEE Conference on
with relatively low accuracy as state variable, and a Computer Vision and Pattern Recognition, Columbus, OH,
more accurate algorithm as observation, the results 2014, pp. 580-587.
could be better after fusing them into Kalman Filter. [7] R. Girshick, “Fast R-CNN,” in IEEE International
Conference on Computer Vision (ICCV), 2015, pp. 1440-
5. Conclusion 1448.

In this paper, vehicle object detection by [8] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN:
combining results from YOLO v2 and Faster R-CNN Towards Real-Time Object Detection with Region Proposal
is proposed. The YOLO v2 is fast and has a less Networks," in IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 1 June
computational cost. However, it somewhat sacrifices 2017.
the detection accuracy. To overcome this problem, the
Kalman filter is used to fuse the two popular object [9] Redmon, Joseph, et al. "You only look once: Unified,
detection algorithms. Due to the one-stage structure of real-time object detection." Proceedings of the IEEE
YOLO v2 and the two-stage structure of Faster R- conference on computer vision and pattern recognition.
CNN, the former has faster speed while the latter has 2016, pp. 779-788.
better accuracy. Therefore, in the Kalman filter, the
results from Faster R-CNN are used as the [10] Liu, Wei, et al. "Ssd: Single shot multibox detector."
observation. Experiment is carried out for vehicle European conference on computer vision. Springer, Cham,
2016, pp. 21-37.
detection, and the results show that the fusion of the
two algorithms in the Kalman filter has better [11] X. Song, P. Jiang and H. Zhu, "Research on Unmanned
detection accuracy. Vessel Surface Object Detection Based on Fusion of SSD
and Faster-RCNN," 2019 Chinese Automation Congress
Acknowledgment (CAC), Hangzhou, China, 2019, pp. 3784-3788.

This work was supported by Seoul National [12] S. Chang, "A Deep Learning Approach for Localization
Systems of High-Speed Objects," in IEEE Access, vol. 7, pp.
University of Science and Technology, Seoul, South 96521-96530, 2019.
Korea.
[13] G. Yang and Z. Chen, "Pedestrian Tracking Algorithm
References for Dense Crowd based on Deep Learning," 2019 6th
International Conference on Systems and Informatics
(ICSAI), Shanghai, China, 2019, pp. 568-572.

ed licensed use limited to: SVKM¿s NMIMS Mukesh Patel School of Technology Management & Engineering. Downloaded on September 01,2022 at 12:38:10 UTC from IEEE Xplore. Restriction

You might also like