Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2020 International Conference on Computer Network, Electronic and Automation (ICCNEA)

State-of-the-art of Object Detection Model Based on YOLO

Peng Chen Qian Zheng


Unit 32382, Beijing, China Unit 32382, Beijing, China
E-mail: beimingke@163.com E-mail: 1763987144@qq.com

Yuejun Shi Qingchun Wu


Unit 32382, Beijing, China Unit 32382, Beijing, China
E-mail: 519752292@qq.com E-mail: 3247029516@qq.com

Abstract—Object detection has got rapid development these most popular YOLO framework, referred to as You Only
years, and target classification and location become gradually
mature. Fast R-CNN, which is a landmark in accuracy, is a Look Once. It spreads fast for its speed and relatively accu-
kind of classic two-stage model and has relatively low speed. rate effect.
To improve the detection efficiency, YOLO (You Only Look
Once) model is proposed. In this article, the research starts
from the evolution of CNN for object detection and YOLO II. THE EVOLUTION OF CNN FOR OBJECT DETECTION
model is sorted out. Then we compare the structures, target
The first practical CNN is LeNet-5, which is constructed
outputs and network training of each versions. The functional
layers evolved into a structure constructed with residual mod- with two convolutional layers, two subsampling layers and
ules, which keep the accuracy. At last, we discuss the system
two full connection layers, designed for training for gray-
advantage future develop trend. Form the discussion, we con-
clude that YOLO learns very general representations of ob- scale. The first landmark framework is AlexNet, which is
jects, and it integrates various mature new technologies. With
the improvement of hardware technology, it will be applied
the champion of ILSVRC 2012, deeper than LeNet. It is
more widely. constructed with five convolutional layers and three full
connection layers, importing three pooling layers, abandon-
Keyword-Classification and Location; Object Detection;
ing subsampling layers, and replacing the large convolution
CNN
with the multilayer small convolution superposition. Until
I. INTRODUCTION now, the CNN structure is no longer a simple combination
These years, as a core problem in computer vision, ob- of layers, but an organic composition of "modules". AlexNet,
ject detection based on CNN (Convolutional Neural Net- which derivatives ZF net, firstly uses ReLU activation and
works) develops fast and new framework emerges massively. CUDA, speeding up the training greatly.
Generally, it implements by two steps: one is feature extrac- The second major improvement is VGG-16 and Goog-
tion, and the other one is classification and location. For leNet, which is a famous framework in ILSVRC 2014. In
extracting a set of robust features from input images in the contrast, VGG-16 is slightly less efficient than GoogleNet,
first step, Haar [1]
, SIFT [2]
, HOG [3]
, convolutional features [4] the champion of ILSVRC 2014, however it is simple and
are the mainstream method. Then, for classification or loca- faster. VGGNet explores the relationship between the depth
tion are used to identify objects in the feature space. The of the convolutional neural network and its performance. By
classifiers or localizers are run either in sliding window repeatedly stacking the small convolutional kernels and
fashion over the whole image or on some subset of regions maximum pooling layers, VGG-16 successfully constructs a
in the image [5-7]
. Based on this essential structure, there are 16~19 layers deep convolutional neural network. Compared
many framework. This paper will discuss the one of the with the previous state-of-the art network structure, the error

978-1-7281-7083-1/20/$31.00 ©2020 IEEE 101


DOI 10.1109/ICCNEA50255.2020.00030

Authorized licensed use limited to: UNIVERSIDAD DE OVIEDO. Downloaded on December 30,2020 at 17:07:46 UTC from IEEE Xplore. Restrictions apply.
rate of VGG-16 is greatly reduced. VGG-16 structure is very tion. For speedup the course, YOLO framework is proposed
neat, with not so many super parameters. It focus on build- to combine the two steps together, which abandons the RPN
ing a simple network, which is constructed with several and use convolutions to achieve classification and location.
convolutional layers followed by one pooling layer for com- Until now, YOLO framework has developed to the firth
[9-12]
pressing image size. Unlike VGG, GoogleNet is a bolder try, generation, called YOLOv1~YOLOv4 . In this paper,
which not only increase the depth, it also increases the width. we will analysis this framework in several aspects.
For avoiding the curse of massive parameters, excessive
III. THE STRUCTURE OF YOLO
depth and huge net size, GoogleNet designs the Inception
structure, which adds two loss at different depths to ensure A. Basic Structure
the disappearance of the gradient return and adds 1x1 con- As R-CNN, YOLOv1 also divides the image into S×S
volution kernel of to reduce the thickness of the feature map. grids, but this segmentation is not so dense, for we don't
VGG-16 and GoogleNet make people aware that deepening need to merge candidate boxes in advance. The structure of
the network is an effective way to improve the quality of YOLOv1 is shown as Fig. 1.
models, laying the foundation for the follow-up.
The CNN framework such as LeNet, AlexNet, VGG and
GoogleNet has effectively finishes the feature extraction and
classification. However, we need to add another important
function of object location. The framework which can com-
[8]
plete the task is R-CNN (Region with CNN feature) .
R-CNN divides a picture into S×S grids, continuously Figure 1. The structure of YOLOv1
merging the small regional grids using SS (Selective Search),
and then extracting the features of each small grid. When we The convolution layers in the front extracts features, and
get the feature map, border regression are used to complete the full connection layer in the back calculates the output
target location. A border regression is usually a prediction of probabilities and coordinates. The network main body
(x,y,w,h). Regions are derived from the concept of "receptive adopts the structure of 24 volume basic units and 2 full con-
field", and different receptive fields correspond to different nection layers. There is also a Fast YOLO version of
regions. However, R-CNN is not so ideal. Firstly, it needs to YOLOv1, which cuts the previous 24 convolutions to 9
consume a lot of computer resources to extract candidate convolutions, with all the rest being the same.
areas because of this sensitive field processing. Secondly, YOLOv2 was improved on the basis of YOLOv1, whose
although the image normalization of the human eye nerve accompanying version YOLO9000 model was trained on
processing is very clever, the computer processing lead to a COCO detection data set and ImageNet classification data
great increase in the amount of calculation because of the set, which can detect more than 9000 kinds of objects, im-
size normalization problem. In addition, a large number of proving mAP in maintaining a good consistency of speed.
overlapping areas will also lead to double counting of the The basic classification structure of this model is darknet-19,
CNN process. Then Fast R-CNN uses the ROI Pooling layer including 19 convolutional layers, 5 maxpooling layers, one
to decrease the computer load and increase the speed. In global avgpool layer, and one softmax classification activa-
Faster R-CNN, the SS algorithm is replaced by RPN (Re- tion layer.
gion Proposal Net), which is a full convolutional neural YOLOv3 was expanded to darknet53 on the basis of
network, increase the speed and further improve the effect. YOLOv2, which is relatively large. Darknet53 refers to the
For R-CNN framework, the image feature is extraction 52-layer convolutional layer +1 fully connected layer
twice, which classifies the target first and then does the loca- (achieved by 1x1 convolution). Compared with YOLOv2,

102

Authorized licensed use limited to: UNIVERSIDAD DE OVIEDO. Downloaded on December 30,2020 at 17:07:46 UTC from IEEE Xplore. Restrictions apply.
YOLOv3 no longer has the maxpooling layer and the reorg clearer as it moves forward. In order to balance this, you
layer. Compared with the residual layer, the multi-resolution need to take the front information back and mix them with
method is adopted in the structure, and YOLO layers of the information from the back. As discussed in the previous
large, medium and small sizes are set. section, the route layer allows YOLO to stop being a pure
"you only look at it once," and instead “looking back two
B. Improved module
more times”, adjusting the results and improving the accu-
Nearly all the improvement based on the basic structure
racy of the predictions.
is on the module, which is a better combination of added 3) Residual layer
layers. We all know, the depth of the network could not be infi-
1) Batch-normalize layer nite. In fact, the plain network stack begins to degenerate
The batch-normalize layer is added to each implementa-
when the network is deep enough. Because the deeper the
tion of the convolution layer. The primary use of this layer is
network is, the more obvious the gradient disappears. Until
to prevent the gradient from exploding and disappearing.
now, the shallow network is not able to significantly im-
The calculation error is needed to calculate the gradient for
prove the network recognition effect, so the problem to be
stochastic gradient descent, and the error is also propagated
solved is how to control gradient disappearance while deep-
in the back propagation process. According to the principle
ening the network for better recognition effect. On ILSVRC
of back propagation, there will be cumulative and multi-
2014, the new champion ResNet, which import the Residual
plexing effects. If the error calculation at each layer is not
block, of which network depth of ResNet is unlimited theo-
controlled, the increment of gradient will gradually disap-
retically.
pear or become infinite. Batch-normalize is to normalize by
From YOLOv3, residual layer was added to realize re-
batch to control this problem in the back propagation pro-
sidual network. Residual layer is also called shortcut layer. It
cess.
sounds like a bypass layer or a short circuit layer. Generally,
Each layer is a joint probability function in practice,
the output of the previous several layers is directly directed
multiplicative the whole system is a huge joint probability
to the output layer of the current layer. The residual model
function, element is the input of each layer. For normaliza-
scheme is as Fig. 2.
tion, it needs to put the batch elements together to find the
mean and variance, then computes elements minus divided
by average variance and divided by the variance. In order to
avoid changing the characteristics of probability distribution,
scale and offset should be added to the normalized results.
2) Route layer
From YOLOv2, route layer begins to appear in the Figure 2. Residual model scheme

framework. Route layers take the output of the specified


If there is no residual layer, the input of the above mod-
layer and places it in current layer. If the specified layers are
ule is x, and the output is ( x) , our optimization goal is
multiple, it will follow one by one. For multiple layers, one
( x) . With the residual layer, )= ( x)  x , the resid-
( x)=
of the key rules is that the dimensions of the two layers must
ual is ( x)=
)= ( x)--x , and if you train
))-x ( x) to a small
be the same in width×height, otherwise you zero both l. w
enough value, ( x) will get closer to the desired output.
and l.h. Because for YOLO the feature extraction is carried
4) Yolo layer
out in sequence, continuous convolution and pooling. In the
YOLO series author is very interesting, v1 output layer is
end, there may be some information pasted together. It is
called detection layer, v2 output layer is called region layer,
quite true that the detailed information gets clearer and
v3 output layer is called YOLO layer, and perhaps really
clearer as it moves forward, and the classification becomes
think v3 has been successful cultivation.

103

Authorized licensed use limited to: UNIVERSIDAD DE OVIEDO. Downloaded on December 30,2020 at 17:07:46 UTC from IEEE Xplore. Restrictions apply.
Each yolo layer corresponds to three clustering anchor
boxes, which can also be seen at a glance. The
pre-determined size of anchor is relative to the input layer of
the whole network. During the training, it needs to deter-
mine which one of the three boxes should be used, and then
predict the ratio between the width and height.

C. New technology in YOLOv4 Figure 3. Bounding boxes with dimension priors and location prediction

In YOLOv4, CSPDarknet53 is chosen as backbone, The output that border regression needs to process is
which is added over by the SPP block. It imports and veri- (tx,ty,tw,th), which is computed by
fies Bag-of-Freebies and Bag-of-Specials methods of object bx V (t x )  cx
detection. It is also a more efficient and powerful object de- by V (t y )  c y
tection model. It makes everyone can use a 1080 Ti or 2080 bw pw e tw
Ti GPU to train a super-fast and accurate object detector. As bh ph e t h

Alexey said[12]ˈComparison of the proposed YOLOv4 and For YOLOv3, the target output of each yolov layer is
other state-of-the-art object detectors. 255 channels. We use 255 channels because the system out-
put targets of 3 out of 80 categories at the same time. In the
IV. TARGET OUTPUT output result, the first 5 channels of each group are (x, y, w,
The target output of YOLOv1 is 7×7×30 results. The h, score), and the last 80 channels are 80 categories, a total
7×7 output layer contains the results that can be detected in of (4+1+80) ×3 = 255. Also, according to the darknet con-
each corresponding region after the corresponding position figuration, each diagram can display up to 90 targets.
of the original input by 64 times and features are extracted.
V. NETWORK TRAINING
This is a 30-dimensional tensor. The first 10 dimensions are
divided into 2 groups, which are respectively used for detec- The model training of YOLO series is actually a
tion. The confidence of 2 targets, the coordinate of the center two-step training. The initial parameters of the convolutional
(x,y), and the border (w,h) are respectively used. The next layer are pre-trained through ImageNet 1000-class, using the
20 dimensions are used to categorize the 20 objects. In short, model of the first 20 winding bases + the average pooling
the detection capability of YOLOv1 is to detect 2 results out layer + a full connection layer. The newly added weights are
of 20 targets in a region. all randomly generated. Users can use YOLO network ac-
For YOLOv2, taking coco data set as an example, the cording to their actual data set needs and train with YOLO's
last layer outputs the dimension of the last layer is 425, pre-trained weight file.
which means that up to 80 species can be detected, and 5 For YOLOv2 and YOLOv3, the method is the same as
targets can be detected at the same time, 5*(80+5). The out- YOLOv1, the difference is LOSS function. Here we take
put feature map is a tensor of 425 channels. YOLOv3 for example the LOSS function is as follow.
Here, Anchor boxes are used, which are priori boxes.
K-means clustering algorithm is used in advance to cluster
various targets into 5 categories. When the region layer im-
plements the detection, the closest one is selected to calcu-
late the ratio of the target size to the prior box. The Bound-
ing boxes with dimension priors and location prediction is as
Fig. 3.

104

Authorized licensed use limited to: UNIVERSIDAD DE OVIEDO. Downloaded on December 30,2020 at 17:07:46 UTC from IEEE Xplore. Restrictions apply.
Here binary cross entropy loss is used, different from the VII. CONCLUSION
front version, the background is no longer distinguished and In this article, it starts from the evolution of CNN for
the multi-classification loss function based on softmax is no object detection and YOLOv1~YOLOv3 are sorted out from
longer used. Instead of specifically configuring the lambda structure, Target output and network training. YOLOv4 es-
coefficients in the configuration file, they all take the con- sentially inheriting the structure of YOLOv3, adding more
stant 1. current new uses of trick. YOLOv4 is mainly for industrial
A controversial point is that MSE loss function was used applications. It is applied in TensorFlow, pytorch, caffe and
to replace the cross entropy loss function. One explanation is other frameworks. Different versions of YOLOv4 have dif-
that for Logistic regression, the derivative form of variance ferent implementations and are essentially the same. It's
loss and cross entropy loss is exactly the same. In darknet, been reported that YOLOv3 have been constructed in Pad-
the variance loss is replaced, because the numerical trend is dlePaddle, which is independently developed by Baidu, and
the same. the popularity has skyrocketed over the last two years, and
The border information loss is multiplied by a (2-wi×hi) with the improvement of hardware technology, it will be
scaling factor, where w and h are the width and height of GT, applied more widely.
respectively. This is in order to improve prediction accuracy
for small object a little skill, for wi, hi < 1, if wi, hi is small-
er, wi, hi, the greater the this part of the greater the weight, REFERENCES
the impact of error can be amplified, some experiments for [1] C. P. Papageorgiou, M. Oren, and T. Poggio. A General Framework
for Object Detection[C] //IEEE International Conference on Com-
YOLOv3, if not minus wi×hi, the AP will have a signifi- puter vision, April 1998: 555–562.
[2] D. G. Lowe. Object recognition from local scale-invariant features[C]
cantly lower, if continue to add up, such as(2-wi×hi)× 1.5, //The proceedings of the seventh IEEE international conference on
general AP will still rise a PM (including validation set and Computer vision, April 1999: 1150–1157.
[3] N. Dalal and B. Triggs. Histograms of oriented gradients for human
test set). detection[C] // IEEE Computer Society Conference on In Computer
Vision and Pattern Recognition, 2005. CVPR 2005. April 2005:
886–893.
VI. SYSTEM ADVANTAGE
[4] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and
Here use YOLOv3 and YOLOv4 compare with the pop- T. Darrell. Decaf: A deep convolutional activation feature for generic
visual recognition [J]. arXiv preprint arXiv:1310.1531, 2013. 4.
ular frameworks, comparison of the proposed YOLOv3 and [5] S. Gould, T. Gao, and D. Koller. Region-based Segmentation and
Object Detection [J]. Advances in Neural Information Processing
YOLOv4 other state-of-the-art object detectors. YOLOv4 Systems. 2009, 4:655–663.
runs twice faster than other top frameworks with comparable [6] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders.
Selective search for object recognition [J]. International journal of
performance. Improves YOLOv3’s AP and FPS by 10% and computer vision, 2013, 104(2):154–171.
12%, respectively, is shown in Fig. 4 [12]. [7] C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals
from edges[C]// Computer Vision–ECCV 2014: 391–405.
[8] Girshick R. Fast R-CNN[C] // International conference on computer
vision, 2015: 1440-1448.
[9] J Redmon, S K Divvala, R Girshick, et al. You Only Look Once:
Unified, Real-Time Object Detection[C] //Computer vision and pat-
tern recognition, 2016: 779-788
[10] J Redmon, A Farhadi. YOLO9000: Better, Faster, Stronger[C].
Computer vision and pattern recognition, 2017: 6517-6525.
[11] J Redmon, A Farhadi. YOLOv3: An Incremental Improvement [J].
arXiv: Computer Vision and Pattern Recognition, 2018.
[12] A Bochkovskiy, C Y Wang, H Y Mark Liao. YOLOv4: Optimal
Speed and Accuracy of Object Detection [J]. arXiv preprint
arXiv:2004.10934, 2020. 4.

Figure 4. Effect of contrast with other frameworks

105

Authorized licensed use limited to: UNIVERSIDAD DE OVIEDO. Downloaded on December 30,2020 at 17:07:46 UTC from IEEE Xplore. Restrictions apply.

You might also like