Exyolo: A Small Object Detector Based On Yolov3 Object Detector Exyolo: A Small Object Detector Based On Yolov3 Object Detector

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Available online at www.sciencedirect.

com
Available online at www.sciencedirect.com
ScienceDirect
ScienceDirect
Procedia
Available Computer
online Science 00 (2019) 000–000
at www.sciencedirect.com
Procedia Computer Science 00 (2019) 000–000 www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia
ScienceDirect
Procedia Computer Science 188 (2021) 18–25

CQVIP Conference on Data Driven Intelligence and Innovation


CQVIP Conference on Data Driven Intelligence and Innovation
exYOLO: A small object detector based on YOLOv3 Object
exYOLO: A small object detector based on YOLOv3 Object
Detector
Detector
Jian Xiao*
Jian Xiao*
Hunan International Business Vocational College, Wangcheng District, Changsha and 410201, China
Hunan International Business Vocational College, Wangcheng District, Changsha and 410201, China

Abstract
Abstract
Object detection is a hot and popular research area in computer vision. Although object detection algorithm will be greatly speeding
Object
up withdetection is a hot
Deep Neural and popular
Networks, like research
YOLOv3area in computer
Object Detector,vision. Although object
its performance detection
in accuracy andalgorithm
precisionwill
wasbenot
greatly speeding
satisfactory in
up with
small Deepdetection
object Neural Networks,
which haslikelessYOLOv3
than 1% Object
area inDetector,
one image. its This
performance in accuracy
paper proposes andobject
a small precision was not
detector, satisfactory
exYOLO, in
which
small object
enhances the detection whichinformation
shallow feature has less thanby1% area in one the
superimposing image.
2-foldThis paper proposes
down-sampled a small
feature mapobject
from the detector,
originalexYOLO, which
network model.
enhances
The author theattempted
shallow feature
the setinformation by superimposing
of experiments with a view tothe 2-fold down-sampled
demonstrating accuracyfeature map from
and precision in the original
small objectnetwork model.
detection, the
The author
results of theattempted theindicate
experiment set of experiments
that exYOLO with
is aa better
view to demonstrating
performance accuracy
in accuracy and
and precision in small object detection, the
precision.
results of the experiment indicate that exYOLO is a better performance in accuracy and precision.
© 2021 The Authors. Published by ELSEVIER B.V.
©
© 2021
2021
This The
The
is an Authors.
accessPublished
Authors.
open Published by
by Elsevier
article under ELSEVIER B.V.B.V.
the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
This
This is an open
is an open access
access article under
article under the CC BY-NC-ND
CC BY-NC-ND license
license (https://creativecommons.org/licenses/by-nc-nd/4.0)
(https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientific committee of the CQVIP Conference on Data Driven Intelligence and
Peer-review under responsibility of the scientific committee of the CQVIP Conference on Data Driven Intelligence and
Peer-review
Innovation under responsibility
(CCDDII2021) of the scientific committee of the CQVIP Conference on Data Driven Intelligence and
Innovation (CCDDII2021)
Innovation (CCDDII2021)
Keywords: YOLOv3; Small Object Detection; Deep learning; Shallow Feature Information
Keywords: YOLOv3; Small Object Detection; Deep learning; Shallow Feature Information

1. Introduction
1. Introduction
With the continuous improvement of computer computing ability, deep learning model is widely used in object
With the
detection. continuous
Since 2012, thisimprovement
kind of methodof computer computing
has gradually becomeability, deep learning
the mainstream model
of object is widely
detection. Theused
deepin object
learning
detection. Sinceextracts
model directly 2012, this kind offrom
features method has number
a large graduallyofbecome
originalthe mainstream
images of objectthe
by simulating detection. The deep learning
visual perception systems
model directlybrain
of the human extracts
andfeatures
transmitsfrom
thema large
layer number
by layeroftooriginal images
obtain the by simulating information
high-dimensional the visual perception systems
of the images. At
of the human
present, braindeep
excellent andlearning
transmitsmodels
them layer
can bebydivided
layer tointo
obtain
two the high-dimensional
categories: information
The first model dividesofobject
the images. At
detection
present, excellent
into candidate boxdeep learning
selection models
stage can be
and object divided intostage
classification two categories:
[1], such asThe first model dividesNeutral
Region-Convolution object Networks
detection
into candidate box selection stage and object classification stage [1], such as Region-Convolution Neutral Networks
* Corresponding author. Tel.: +86-136-3747-9303;
* E-mail
Corresponding
address:author. Tel.: +86-136-3747-9303;
13453215@qq.com
E-mail address: 13453215@qq.com
1877-0509 © 2021 The Authors. Published by ELSEVIER B.V.
This is an open
1877-0509 access
© 2021 Thearticle under
Authors. the CC BY-NC-ND
Published by ELSEVIER license
B.V.(https://creativecommons.org/licenses/by-nc-nd/4.0)
This is an open
Peer-review access
under article under
responsibility CC BY-NC-ND
of the scientific license
committee (https://creativecommons.org/licenses/by-nc-nd/4.0)
of the CQVIP Conference on Data Driven Intelligence and Innovation
(CCDDII2021)
Peer-review under responsibility of the scientific committee of the CQVIP Conference on Data Driven Intelligence and Innovation
(CCDDII2021)

1877-0509 © 2021 The Authors. Published by Elsevier B.V.


This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientific committee of the CQVIP Conference on Data Driven Intelligence and Innovation
(CCDDII2021)
10.1016/j.procs.2021.05.048
Jian Xiao / Procedia Computer Science 188 (2021) 18–25 19
Jian Xiao/ Procedia Computer Science 00 (2019) 000–000

series [2] and various variations based on it. The second category is a model of a stage that treats classification and
fixed positioning as regression tasks, such as used more Solid-State Drive [3], You Only Look Once [4]. However,
for the detection of small objects, the effect of these two algorithms is not satisfactory [5].
Under the definition of small object, if the area of an object occupies less than or equal to 0.12 percent of 256
pixel*256 pixel image, the object is a small object [6]. When the picture contains small objects, it is often cannot be
accurately detected. The reason is that the feature information about the small objects extracted by the network from
the picture is much less than that of the large objects, which makes the training ability of the model develop not very
well. This paper studies how to enhance the feature information of small object and improve the YOLOv3 network as
follows: first of all, feature enhancement is carried out in the feature extraction module Receptive Fields Block, and
then the superficial feature extraction is enhanced by dense connections in the backbone network. The next step is that
the loss function based on the Film Input Output Unit with more generalization ability is used. Finally, the accuracy
of small object detection is improved.

2. Relative reviews

Deep learning is developed by the traditional Artificial Neural Networks [7]. PITTS and others [8] put forward the
concept of Artificial Neural Networks and the model of neurons for the first time in their paper in 1943. After two
years of improvement and theoretical research, Rosenblatt created a machine [9] for identifying objects in 1956 and
the idea of Artificial Neural Networks was used for reference. Rosenblatt and others organized their research results
into a paper in 1958 and put forward the concept of perceptron, which was a machine that can simulate human
perception ability.
Object detection is one of the basic applications of deep learning that is very useful in many situations, such as
driver-less and safety systems. So far, object detection has been successful in many fields, relying on big data sets and
Convolution Neural Networks. At present, the algorithms used for object detection can be divided into two categories:
The first kind of algorithm can generate several candidate boxes according to the input picture, and then classify the
training data by CNN, such as Fast R-CNN, Faster R-CNN algorithm. A second type of algorithm treats detection as
a regression problem and does not need to generate candidate boxes, such as YOLO [4], SSD [3] algorithms.
YOLO belongs to a one-stage detection algorithm, which directly inputs the picture into the network and uses the
Convolution Neural Networks to extract the features of the whole picture. Finally, the whole picture is regressed to
detect the object. Although Faster R-CNN [10] algorithm reduces the calculation amount of sliding window, it is
limited to fixed size window. However, YOLO algorithm directly divides the whole picture into small pieces that do
not coincide, avoiding a large number of sliding windows, thus the detection speed is improved.
YOLO Network uses the Image Net to preliminarily train before training, and then is followed by four randomly
initialized convolution layers and two full connection layers. The pre-training model consists of 20 convolution layers
plus an average pooling layer and a full connection layer. YOLO Network predicts a S  S  B boundary frame that
is much less than thousands of sliding windows in the two-stage object detection algorithm, which greatly speeds up
the detection speed of the network, but at the same time, the detection accuracy has a slight decrease [11].

3. Improvement of YOLOv3 Algorithm Based on Feature Fusion

With the development of Deep Neural Networks, object detection models that has optimum performance all depend
on deep CNN backbone network, such as Res Net-101 and Inception.
Although strong feature representation is conducive to the improvement of performance, it brings high
computational expense. In contrast, some lightweight detection models can deal with detection problems in real time,
but it’s the decrease of accuracy. Inspired by the receptive field structure in human visual systems, Songtao Liu
proposed a novel RFB [12] module that enhances the differentiability of the feature and the robustness of the model
by simulating the relationship between the size and eccentricity of the RF. The model is shown in Fig.1.
20 Jian Xiao / Procedia Computer Science 188 (2021) 18–25
Jian Xiao/ Procedia Computer Science 00 (2019) 000–000

Fig.1 Construction of the RFB module [12]

YOLOv3 Network adds multi-label classification based on YOLO Network and fuses Fixed Pattern Noise with it
for multi-scale prediction to enhance the performance of small object detection. On the structure of the basic network,
YOLOv3 use Dark Net53 that is a new network architecture to greatly deepen the number of network layers through
the residual structure. YOLOv3 uses the anchor frame obtained by dimension clustering to predict the bounding box,
the transverse and vertical coordinates of the center point of the bounding box and the width and height of the frame,
and each bounding box is predicted to be 4 values. It uses logical regression to predict the category score of each
bounding box and uses mean square and error as a loss function. Whether the bounding box contains the object is
determined by the confidence level. When the confidence level is larger, the possibility that the bounding box contains
the object is greater. If a prior bounding box overlaps with a real object more than any other bounding box, the
confidence level is set to 1. If the priority of the bounding box is not the highest but overlaps with the real object
beyond a certain threshold, the confidence level is set to 0. YOLOv3 uses Dark Net53 Network for feature extraction,
and its network structure diagram is shown in Fig.2.
Jian Xiao / Procedia Computer Science 188 (2021) 18–25 21
Jian Xiao/ Procedia Computer Science 00 (2019) 000–000

Fig.2 The structure of YOLOv3[4]

Dark Net53 fuses with Res Net, consisting of five residual blocks, each of which is composed of different number
of residual units. Each residual unit is composed of two Distance between Lenses units and residual operation [13].
As shown in figure 3(a). Each DBL unit is composed of convolution layers, normalization [14], and Leaky Relu
activation function, as shown in Figure 3(b). The using of residual blocks can not only prevent the loss of effective
information, but also prevent the vanishing gradient problem during deep network training [15]. In addition, there is
no pooling layer in the network. It uses convolution layers with step size 2 instead of pooling operation to reduce the
size of feature graph, because there is little object related information in the picture and the object information will be
lost in the process of pooling operation, which is very bad for small objects.

(a)residual units
Jian Xiao/ Procedia Computer Science 00 (2019) 000–000
22 Jian Xiao / Procedia Computer Science 188 (2021) 18–25

(b) DBL unit


Fig.3 The structure of the residual unit [13]

YOLOv3 Network uses mean square and error as loss function, which consists of three parts that are used to predict
frame positioning error, Io U error with or without object and classification error respectively. The loss function is
shown in formula (1):

𝑠𝑠2 𝐵𝐵
𝑜𝑜𝑜𝑜𝑜𝑜
loss = 𝜆𝜆coord ∑ ∑ 𝑙𝑙𝑖𝑖𝑖𝑖 [(𝑥𝑥𝑖𝑖 − 𝑥𝑥̑ 𝑖𝑖 )2 + (𝑦𝑦𝑖𝑖 − 𝑦𝑦̑ 𝑖𝑖 )2 ]
𝑖𝑖=0 𝑗𝑗=0
𝑠𝑠2 𝐵𝐵 2
𝑜𝑜𝑜𝑜𝑜𝑜 2
+𝜆𝜆coord ∑ ∑ 𝑙𝑙𝑖𝑖𝑖𝑖 [(√𝑤𝑤𝑖𝑖 − √𝑤𝑤̑𝑖𝑖 ) + (√ℎ𝑖𝑖 − √ℎ̑𝑖𝑖 ) ]
𝑖𝑖=0 𝑗𝑗=0
𝑠𝑠2 𝐵𝐵 𝑠𝑠2 𝐵𝐵
𝑜𝑜𝑜𝑜𝑜𝑜 𝑜𝑜𝑜𝑜𝑜𝑜
+ ∑ ∑ 𝑙𝑙𝑖𝑖𝑖𝑖 (𝑐𝑐𝑖𝑖 − 𝑐𝑐̑𝑖𝑖 )2 + 𝜆𝜆𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ∑ ∑ 𝑙𝑙𝑖𝑖𝑖𝑖 (𝑐𝑐𝑖𝑖 − 𝑐𝑐̑𝑖𝑖 )2
𝑖𝑖=0 𝑗𝑗=0 𝑖𝑖=0 𝑗𝑗=0
(1)
𝑠𝑠2
𝑜𝑜𝑜𝑜𝑜𝑜
+ ∑ 𝑙𝑙𝑖𝑖𝑖𝑖 ∑ [𝑝𝑝𝑖𝑖 (𝑐𝑐) − 𝑝𝑝̑ 𝑖𝑖 (𝑐𝑐)]2
𝑖𝑖=0 𝑐𝑐∈𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐

The first and second terms represent the positioning error of the prediction box and the weight of the center
coordinate error, which is set to 5. S represents the number of grids that is the image is divided. B represents the
number of boxes predicted by each grid. The 𝑙𝑙𝑖𝑖𝑖𝑖 represents whether the j prediction box of the i grid detects the object.
x, y, w, h represents the center coordinates, width and height of the real box respectively. The third and fourth terms
represent cross-comparison errors. c represents the score of confidence level. The last term represents the classification
error. 𝑝𝑝𝑖𝑖 (𝑐𝑐) denotes the conditional probability that the detected object belongs to the C.
However, the small object in the image itself occupies very few picture elements, and the pooling operation in the
training process will make the obtained features less obvious, which is not conducive to detect. For the characteristics
of small objects, feature fusion can be used to pass shallow information into deep convolution layers in order to use
features of different scales better.
Feature pyramid is used to enhance the detection effect in the YOLOv3 Network, the output feature map is down-
sampled by 8 times ,16 times and 32 times respectively, that is, when the detected object is less than 8 pixel *8 pixel,
it will be difficult to detect it in the output feature map at last. To make more use of small object information, this
paper proposes exYOLO algorithm, which makes the following improvements to the YOLO. First, the feature map
after one down-sampling is superposed to the input of the second and third residual blocks to add the detailed
information of deep feature map. After 52*52 feature map, the feature enhancement module is connected to strengthen
the ability of small object information extraction in the size feature map. The module replaces the convolution layer
of 5*5 in the RFB with two continuous convolution layers of 3*3, because the area of the small object in the original
Jian Xiao / Procedia Computer Science 188 (2021) 18–25 23
Jian Xiao/ Procedia Computer Science 00 (2019) 000–000

image is already small, the size of the small object becomes smaller in the feature map after multiple scale reduction.
Using two continuous convolution layers can not only satisfy the feature extraction of the small object, but also reduce
the parameter.
RFB module enhances the feature extraction function of the network by simulating the receptive field structure of
human vision. First, the feature map adopts multi-branch structure that is composed of convolution kernel of different
size, then the receptive field is increased through the cavity convolution layer. Finally, the convolution layer output
of different sizes is used to achieve the purpose of merging different features by Concat function. exYOLO network
structure is shown in Fig.4, exYOLO based on the original YOLOv3 network structure adds the fusion between feature
enhancement module and shallow cross-scale feature.

Fig.4 exYOLO network structure

4. Experiment and Verification

Pascal VOC 2007 data set is selected for training and testing. The data set contains 20 classes and is rich in image
tagging information for various tasks. VOC 2007 contains 996 pictures, training set plus verification set 5011, testing
set 4952. Tagging information of the picture in PASCAL VOC 2007 data set uses the file of XML format and is stored
in the annotations folder, corresponding to the picture one by one. According to the characteristics of the algorithm,
the configuration of the server in which the prototype system is located is shown in Table 1.

Table 1. Server parameters and environment of the prototype system


Server configuration
CPU Quad-core, Intel Core i5
Internal Storage 32GB
GPU NVIDIA Telsa T4
Hard Disk 1TB
Operating System Ubuntu 18.04
Operational Framework TensorFlow 1.14, Django 2.0, Python 3.6
24 Jian Xiao / Procedia Computer Science 188 (2021) 18–25
Jian Xiao/ Procedia Computer Science 00 (2019) 000–000

During the experiment, the trained exYOLO model is first selected as the pre-training model of this chapter. Then
the exYOLO model is compared with the YOLOv3 and a series of object detection algorithms on the VOC training
set to verify the superiority of the exYOLO model. The comparison experiment includes the comparison of the overall
average accuracy (Mean Average Precision) and the accuracy of a single category in the VOC training set.
We can see from Table 2 that the overall average accuracy (MAP) of the exYOLO that is compared with YOLOv3
and the existing object detection algorithm is greatly improved, with an increase of 7.6-2.3% or less.

Table 2. PASCAL VOC2007 testing set m AP indicator test results


Algorithm Network Pre-training m AP/%
Faster R-CNN ResNet101 Y 77.2
SSD VGGNet Y 76.8
DSOD DS/64-192-48-1 N 70.2
YOLOv2 Darknet-19 Y 73.2
YOLOv3 DarkNet53 Y 78.6
exYOLO DarkNet53 Y 81.9

In comparison experiments for individual categories, PASCAL VOC2007, as a testing data set of object detection
standard, contains small objects such as Bird, Bottle, chair, plant, and the detection accuracy are generally lower
compared with other classes. On the above testing data set, we compare the exYOLO with the YOLO, and the
experimental result is shown in Table 3. From the result, it can be seen that the detection accuracy of the exYOLO is
increased by 2.8%-4.6% or so.

Table 3.Test results for individual categories


Algorithm exYOLO YOLOv3 SSD Fster R-CNN
Bird 80.8 77.4 79.0 79.3
Bottle 61.5 58.3 46.7 52.1
Chair 66.9 65.8 59.8 60.2
Cow 87.3 85.2 79.8 77.3
Dog 87.1 86.3 86.1 79.6
Horse 88.2 86.3 86.0 87.5
Person 81.6 79.5 78.2 80.3
Plant 65.5 58.2 59.3 61.8

5. Conclusions

In the detection task based on deep learning, it is difficult to detect a small object, because very few pixels are
occupied by the small object in the picture. For the purpose of improving small object detection, the YOLOv3 network
is studied and used to improve the detection accuracy. Because the small object occupies less pixels in the image, and
the YOLOv3 network abandons the pooling layer, it is difficult to extract the effective information of the object by
using five convolution layers with a step size of 2 to reduce the size of the feature map. In this paper, the feature maps
of shallow networks are transmitted to deeper networks in order to improve the accuracy of detection.

Acknowledgements

This work is supported by the Projects of Scientific Research of Hunan Provincial Department of Education
(19C1234).
Jian Xiao / Procedia Computer Science 188 (2021) 18–25 25
Jian Xiao/ Procedia Computer Science 00 (2019) 000–000

References

[1] Shlhamer E, Long J, Darrell T. (2016) “Fully convolutional networks for semantic segmentation.” IEEE Transactions on Pattern Analysis &
Machine Intelligence 79(10):1337-1342.
[2] Girshick R, Donahue J, Darrell T, et al. (2014) “Rich feature hierarchies for accurate object detection and semantic segmentation.” IEEE
Conference on Computer Vision and Pattern Recognition, USA: IEEE 580-587.
[3] Liu W, Auguelov D, Erhan D, et al. (2016) “SSD: Single Shot MultiBox Detector.” European Conference on Computer Vision (ECCV). San
Francisco, CA, USA: IEEE Conference 6517-6525.
[4] Redmon J, Divvals S, Grishick R, et al. (2016) “You Only Look Once: unified, real time object detection.” IEEE Conference on Computer
Vision and Pattern Recognition, USA: IEEE 779-788.
[5] Zhang X Y, Ding Q H, Luo H B, et al. (2017) “Infrared dim target detection algorithm based on improved LCM.” Infrared and Laser Engineering
46(7): 0726002.
[6] Redmon J, Farhadi A. (2018) “YOLOv3: An Incremental Improvement.” IEEE Conference on Computer Vision and Pattern Recognition.
USA: IEEE 2311-2314.
[7] Erhan D, Szegedy C, Toshev A, et al. (2014) “Scalable object detection using deep neural networks.” Proceedings of the IEEE conference on
computer vision and pattern recognition.2147-2154.
[8] MCCULLOCH WS, PITTS W. (1943) “A logical calculus of the ideas immanent in nervous activity.” The bulletin of mathematical biophysics
5(4):115-133.
[9] Zeiler M D, Krishnan D, Taylor G W, et al. (2010)“Deconvolutional networks.” Computer Vision & Pattern Recognition.
[10] M. K. Pargi, B. Setiawan and Y. Kazama. (2019) “Classification of different vehicles in traffic using RGB and Depth images: A Fast RCNN
Approach.” 2019 IEEE International Conference on Imaging Systems and Techniques (IST), Abu Dhabi, United Arab Emirates 1-6.
[11] Y Li, K He, and J Sun. (2016) “R-fcn: Object detection via region based fully convolutional networks.” In Advances in Neural Information
Processing systems, 630-645.
[12] Z. Wang, Z. Cheng, H. Huang and J. Zhao, (2019) “ShuDA-RFBNet for Real-time Multi-task Traffic Scene Perception.” 2019 Chinese
Automation Congress (CAC), Hangzhou, China 305-310.
[13] Sandler M, Howard A, Zhu M, et al. (2018) “MobileNetV2: inverted residuals and linear bottlenecks.” IEEE Conference on Computer Vision
and Pattern Recognition. Salt Lake City: IEEE 4510-4520.
[14] Ioffe S, Szegedy C. (2015) “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shife.” International
Conference on Machine Learning. USA: ICML.
[15] Feng, S., Zhang, L., Xiao-Hui, H. E., & University, X. P. . . (2017) “Pedestrian detection based on Jetson TK1 and convolutional neural
network.” Information Technology.

You might also like