Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

International Journal of Advanced Science and Technology

Vol. 29, No. 3, (2020), pp. 3006- 3022

A Real-Time Student Face Detector in the Classroom using


YoloV3 Anchor Matching Strategy on WIDER Face Dataset

Eko Cahyo Nugroho1, Gede Putra Kusuma2

Computer Science Department, BINUS Graduate Program -


Master of Computer Science, Bina Nusantara University,
Jakarta, Indonesia, 11480
1
eko.nugroho003@binus.ac.id, 2inegara@binus.edu

Abstract
Face detection is still an interesting research topic in recent years because it is
considered a substantial part of face recognition implementation using deep learning
networks. Without the correct area of the face to be detected and recognized, then
implemented face recognition can be inefficient and wasting the valuable areas that can
be useful as inputs in face recognition networks. Therefore, there are many types of
research in this topic using various face dataset benchmark like tested on FDDB and
WIDER Face. WIDER Face offering a challenging dataset with various of the pose, scale,
occlusion along with many annotated tiny faces. In this WIDER Face benchmark, S3FD
offering a faster detector with relatively good accuracy inspired by anchor-based SSD
and RPN where the top accuracy detectors in WIDER Face implemented using Region-
based CNN. Region-based DSFD offering good accuracy and listed on top-accuracy
detectors, but unable counted as a real-time detector in our testing process. In this paper,
we propose a faster face detector model with relatively good accuracy in the WIDER
Face benchmark using optimized 9 clusters of anchor boxes in the Yolov3 framework.
These anchor boxes are optimized using K-Mean clustering from the WIDER Face
training dataset in order to make better-matched anchor boxes during the training-
validation process to detect various faces, especially for tiny faces. This paper shows
relatively good competitiveness of our Yolov3 face detector model in terms of accuracy
and speed on the WIDER Face datasets benchmark. We also tested our model in a real-
case situation for detecting students inside the classroom and show the capability of real-
time face detector with correct face detections during the real case implementation.

Keywords: Convolutional Neural Networks, Real-time Face Detector, Yolov3, Wider


Face Dataset, Real-case Face Detection

1. Introduction
Face detection research is still promising topic due to still there are performance gap
for accuracy and detection speed on the various scale, occlusion, and pose of faces,
especially in real case implementation like security camera [17][28][29]. Advanced
security camera mostly accomplished with face recognition process to identify the
detected faces. Without good accuracy of face detection, convolutional neural networks of
face recognition will be feed with wrong or less feature map need to be detected resulting
in the less accurate recognition also [33]. On the other hand, without an acceptable
detection speed for real-time video, it will be resulting in the increasing specification
needs of hardware matched with the method or model. Moreover, harder to be
implemented in real-case due to the limitation of the required requirements of detection
and recognition system. Therefore, face detection can be considered as one of the
important steps in the face recognition system which needs a balanced state between
accuracy and speed.

3006
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 29, No. 3, (2020), pp. 3006- 3022
Before implemented in real cases, face detection method or model needs to be tested on
the benchmarked face dataset. Based on our references, WIDER Face and FDDB dataset
are frequently used to test the proposed method or model in face detection topic
[8][9][20][30][32]. The challenging WIDER Face dataset consists of 32,203 images split
in training, validation, and test dataset. WIDER Face claims 10 times larger than the
previous larger face dataset [34]. The 4-top inaccuracy of the WIDER Face dataset placed
Refine Face, EXTD, Ainno Face, and Retina Face which region-based face detector
method are promising with relatively high accuracy in Easy, Medium, and Hard test data
split of WIDER Face benchmark [35]. On the other hand, the consequence of high
accuracy is required higher computation resource to accommodate the higher detection
speed [38]. In this work, we hard to implement the top-4 most accurate model for the
WIDER Face dataset because of limited computation resources and get the OOM (Out of
Memory) warning message in GPU RTX 2070 8Gb GDDR5. This can be as the prove
that higher accuracy needs more computation resources. But we succeeded to implement
the DSFD as the 11th most accurate in WIDER Face and the detection speed is
unacceptable for real-time video and gets the average Frame per Second as many as 0,23
frame and average Process Time per Frame as many as 4,44 seconds in 1024x768
resolution. DSFD in our implementation test, gets the relatively good AP (Average
Precision) results in Easy, Medium, and Hard validation data split as many as 94,15%,
93,33%, and 87,27% in 1024x768 resolution. In this work, we also do the implementation
test of S3FD as 20th most accurate in WIDER Face which claims promising detection
speed as many as 36 Frame per Second in VGA resolution using Titan X Pascal GPU
[32]. In this work, we get the result for S3FD in term of detection speed as many as 0,64
average Frame per Second in 1024x768. We also get the AP (Average Precision) as
many as 93,25%, 90,49%, and 74,49% in Easy, Medium, and Hard data split benchmark
with 1024x768 resolution. This result as our analysis data can be concluded that S3FD
which claims to result in relatively good accuracy and acceptable detection speed, also
need higher computation resource which it is dropped significantly when using our GPU.
In our implementation of S3FD, we get 1,56 seconds for each frame. For accommodating
widely implementation of face detection or accuracy with not too much high computation
resource requirement, we believe that we need to create a face detection model with
acceptable accuracy and detection speed using AP (Average Precision) and Process Time
per Frame as the benchmark metrics.
YoloV3 well known as a good object detector with faster speed. In this work we
implement YoloV3 as the framework-based to be improved in a matter of training
strategy to get the acceptable detection speed and accuracy in WIDER Face. YoloV3 is
the latest version of You Look Only Once framework which firstly introduced in 2016
and get the good result at VOC2012 dataset in terms of real-time speed and accuracy [19].
This result caused by the network architecture itself which using a single network for
predicting small, medium, and large objects at once. Then, on the next version, called
YOLO9000, improved small detection has been announced as the better version of the
previous version using anchor boxes for predicting the bounding boxes [18]. These anchor
boxes are used for reducing localization errors for narrow located objects in each other
which reducing the accuracy and low recall rate from the previous version. Then, YoloV3
announced in 2018 with several improvements like better accuracy without significant
cost in terms of detection speed also with new Loss Function which is combined from
localization, confidence, and class loss function [11]. YoloV3 is using a backbone called
Darknet-53 which deeper layer network compared to the previous version. In this version,
YoloV3 is still using the same method for detecting small, medium, and large objects on a
single network. Anchor boxes of YoloV3 are configured using the MS COCO dataset
contains 80 classes to be detected with various size of objects. These anchor boxes are
containing 9 clusters resulted from the K-Mean clustering method and each 3 anchor
boxes are used for detecting small, medium, and large objects to match the detection area.
Based on our literature review, YoloV3 is promising for detecting a single class that is

3007
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 29, No. 3, (2020), pp. 3006- 3022
face. YoloV3 has implemented for various research for object detections like
[1][10][13][14][16][17][21][23][24][25] by improving the accuracy without reducing the
speed significantly for real case implementation. We believe that YoloV3 can be placed
as one of the competitive methods in WIDER Face dataset benchmark compared to other
methods using the same dataset benchmark like DSFD and S3FD which implemented in
this work. Before starting to do our strategy, we need to understand the basic
performance of YoloV3 trained and tested on WIDER Face. In our implementation, we
set 100 epochs using pre-trained weights trained on ImageNet. We implement an early
stopping function to stop the training process if there is no improvement detected and our
training stopped on the 42nd epoch. The result of AP (Average Precision) in Small,
Medium, and Large faces is 78,97%, 79,16%, and 71,40%. If compared to other methods
using the same dataset, especially S3FD which is claimed as a real-time face detector,
default YoloV3 accuracy performance has the potential to be improved. For aiming that
target, we adjust the anchor box using the same method as mentioned in YOLO9000 and
YoloV3 research report [13][18], which is using K-Mean clustering on WIDER Face
dataset and clustered as many as 9 clusters. We take this strategy because the anchor box-
based method performance drops significantly when the object is smaller than the default
anchor box assigned to detect it [32]. So, the anchor matching strategy will be the correct
step to detect numerous small faces contained in the WIDER Face dataset which have a
smaller size of bounding box than default anchor boxes of YoloV3. If a face smaller than
the anchor box, neural networks will never know that the face is existed and never learn to
predict these small faces like shown in Figure 1.

Figure 1. Anchor Boxes to Detect Small Faces

Small faces as shown in Figure 1 have a few features to be feed into deep learning
networks. If anchor box bigger than the small face, background area will be included to be
feed into networks and will be ignored to be categorized as a face [33]. In this work, we
also implement the YoloV3 with our model in a real case which is detecting faces of
students and lecturer in real lecturing situations with various movements and activities.
This implementation test aims to get the real performance of our model in a real situation
for detecting a face. We believe this implementation can be also related to other future
works related to face detection and face recognition in real-time video. It also can be
implemented as student attendance detection and recognition during the lecturing session.
To achieve the success of future works, face detection needs to be real-time and with
relatively good accuracy. Therefore, we do the test using detection speed with average
frame per second and average confidence score as the metrics. The evaluation also
involves the visual assessment of resulted frames during the real case test. The main
contribution of this work as follows:

3008
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 29, No. 3, (2020), pp. 3006- 3022
 To improve the tested YoloV3 average precision on the WIDER Face dataset
benchmark using anchor matching strategy in Easy, Medium, Hard validation
data-split.
 To analyze the competitiveness of YoloV3 in the WIDER Face dataset
benchmark compared to DSFD and S3FD in Easy, Medium, and Hard validation
data-split.
 Implement our improved YoloV3 model as the real-time student face detector in
real case situations like lecturing sessions.

2. Related Works
In recent years, object detection research has achieved relatively good results
using various methods and strategies like using anchor boxes for detecting the
object on images. Anchor boxes method is a good idea compared to the sliding-
window object detector because the sliding window method is struggling when need
to know the exact location of the object and also exhausting strategy to find all
possible positions of the objects with huge computation needs [33]. Anchor boxes
are used to detect the object in one shot of object detection by selecting the most
matched of Intersection over Union of annotation and Confidence Score like
implemented by SSD, YOLOv3, and S3FD [11][32][39]. Therefore, the anchor box
method reduces the computation cost compared to the sliding window because
neural networks can extract image features from the whole image area at once and
possibly create a faster object detector [33]. In this section, we presented the current
related works about Face Detection frameworks and models tested on the WIDER
Face dataset benchmark using Convolutional Neural Networks [27][34][35]. This
section split into two categories of face detector method which are One-stage Face
Detector and Two-stage Face Detector.
One-stage Face Detector: YOLOv3 by Joseph Redmon and Ali Farhadi is using
multiple scales of anchor boxes that clustered into 9 clusters. For each 3 default
anchor boxes are used for detecting small, medium, and large object and get the
good result of accuracy and speed tested on MS COCO dataset benchmark with the
quantitative result as many as 57,9% Mean Average Precision and 20 Average
Frame per Second using Pascal Titan X GPU in 608x608 input image sha pe [11].
Then SSD by Wei Liu, et al., anchor box or default boundary boxed chosen
manually by choosing each scale of boundary boxes to detect small, medium, and
large objects. This becomes a disadvantage because SSD struggles on detecting
small objects when the real object is smaller than default boundary or anchor boxes
and neural networks will never learn that the object exists. To mitigate this
disadvantage, SSD needs higher resolution to detect the smaller objects and increase
the feature map area and resulting in the higher confidence score [33]. SSD gets the
result on COCO dataset benchmark as many as 46,5% Mean Average Precision and
19 Average Frame per Second in 512x512 input image shape using Pascal Titan X
[39]. Then Shifeng Zhang, et al., inspired by the SSD with a newly proposed method
named S3FD. S3FD proposed to improve the disadvantage of SSD which struggle to
detect small faces by developing a network architecture using multiple scales of
anchor for each associated layer. This method claims that different scales of anchor
will have the same density on image and various scales of faces have the possibility
to have the same number of anchors. With this modified anchor strategy, claimed
that has better capability to handle various scales of faces especially for small faces
and increase the recall rate. S3FD also uses classification loss implemented by
SoftMax loss over two classes which are face and background. Then the regression
loss is used for positive anchors and disable the other anchors. Wit h this method,
S3FD claims the result as many as 93,7%, 92,4%, and 85, 2% for each easy,

3009
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 29, No. 3, (2020), pp. 3006- 3022
medium, and hard test data split respectively. Using VGA resolution, S3FD can
reach 36 Average Frame per Second on single GPU Pascal Titan X [32].
Two-stage Face Detector: Two-stage detector implements the enhancement of
object localization instead in order to reduce false-positive rate and increase the
average precision rate [33]. DSFD proposed by Jian Li, et al ., employed an
enhanced module for increasing the quality of the feature map. With a higher quality
of feature map, it also increases the chance to detect the small faces. DSFD
architecture contains a dual shot network combined by FPN-like layer to be
enhanced by the second network in every assigned layer to detect the various scale
of objects. With this method, DSFD can achieve the easy, medium, and hard
validation set as many as 96,6%, 95,7%, and 90,4%. DSFD using Res -50 backbone
can achieve 22 Average Frame per Second in VGA resolution input image shape on
NVIDIA GPU P40 [8].

3. YoloV3 Architecture Overview


YoloV3 is the latest improved version from YOLO and YOLO9000 as known as a
good real-time object detector [33]. YOLO uses 24 convolutional layers and 2 fully
connected layers. Several convolutional layers use 11 sizes of convolutional to reduce the
depth of the feature map dimension. The way Yolo can detect the object can be seen in
Figure 2. Input image divided into S x S grid and one grid cell only can be associated with
one object and a fixed number of predicted B boundary boxes. Each box is associated
with single confidence score from a detected object located inside the box. With this
method of detection, every bounding box contains x, y, w, h, and box confidence score
which x, y, w, and h representing the size of the bounding box. The grid cell will be
related to a sequence of class probabilities to predict the object classification based on the
C class list of the model [19]. The equation of YOLO to create a single Convolutional
Neural Network for predicting the tensor dimensions is shown in equation 1.

(1)

The equation contains S x S for the number of the grid cells inside the image, B for the
number of bounding boxes generated on every grid cell, and C for the number of classes
to be predicted or classified associated with the object. To choose the right bounding
boxes, selected a bounding box with the highest IoU (Intersection over Union) compared
the ground truth during the training iteration. The total value of loss function contains
three loss functions which are the confidence loss of area inside the bounding box, the
localization loss using sum-squared error for comparing predictions and ground truths,
and the last is classification loss using squared error of class probabilities for each class.
YOLO uses non-maximal suppression for eliminating the duplicated detections to the
object.
The second version of YOLO named with YOLO9000 or YoloV2 announced in 2017
with several improvements for competing for the SSD (Single Shot Detector). SSD
outperform the old YOLO in term of accuracy for real-time object detector. So, YoloV2
or YOLO9000 bring better improvements in term of accuracy and process time compared
to SSD [18]. YoloV2 bring Batch Normalization which this technique introduced in 2015
to normalize the input layers using adjustment and scaling the activation layers [40]. This
Batch Normalization adding 2% in terms of mean Average Precision in the COCO
dataset. Then, another improvement is using Higher Resolution Classifier from the
previous size 224x224 into 448x448 for better detection on high resolution and increase
the mAP to reach 4%. YoloV2 also add the anchor boxes but decrease the mAP 0,3% but
improve the recall rate from 81% to 88% then automatically increase the chance to detect
the ground truth of the objects. K-Mean clustering also used on training to choose the best
anchor boxes using IoU (Intersection over Union) for clustering.

3010
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 29, No. 3, (2020), pp. 3006- 3022

Figure 2. Yolo Object Detection Steps

Then the latest version until this paper is written, named YoloV3, adds several
improvements from the previous version like using multi-label classification which the
previous version is using SoftMax function. YoloV3 uses binary cross-entropy loss for
each label than using a logistic classifier on the previous version to avoid the
disadvantage of the overlapping labels for multi-class labeling. Then YoloV3 also add the
different bounding box prediction. It predicts the bounding boxes to objectness score 1 to
bounding box anchors. It ignores other anchors that overlapping the ground truth box
more than the threshold for example 0.5 which also used in our implementation. With this
method, YoloV3 only assigns one bounding box for each ground truth object with the best
match. YoloV3 also implements multiple-scale prediction which is adopting the concept
of feature pyramid network. With this method, YoloV3 predicts boxes from 3 different
scales to be extracted [11]. Caused by this implementation, the tensor equation changed to
equation 2.

(2)

In equation 2, S x S represents the size of the grid cells, then B is the number of
bounding boxes can be predicted in a feature map. The official implementation of YoloV3
by the author uses B = 3 trained on the COCO dataset. 4 + 1 means to decode the
bounding boxes offset and objectness score. Then C means the number of classes to be
trained on a convolutional neural network. With this tensor equation, YoloV3 get the
better semantic information from the up-sampled features and better-grained information
from the earlier feature map. This also adds the improvement for detecting the small
objects with a better result than the previous version. YoloV3 also use the network
backbone called Darknet-53 which has 53 layers of CNN which also implement Residual
Network inspired by ResNet networks [41]. With these several improvements, YoloV3
gets the better result on both mean average precision and speed compared to the other
object detector like SSD on the COCO dataset. YoloV3 is using a single network for
detecting 3 scales of objects. The back layer is used for detecting small objects, the
middle layer for detecting the medium objects, and the last layer is for detecting big
objects.

3011
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 29, No. 3, (2020), pp. 3006- 3022
The small objects detection interestingly using the first layer because when the layer
goes deeper, then the feature map is getting smaller and affecting the small objects
detection capability which is getting harder to be detected. That is why the small objects
detection is using the first layer before the feature map becomes too small and the small
objects are disappearing [33]. Also, YoloV3 uses a feature pyramid network concept that
uses an up-sampled method for the current feature map and merged with the previous up-
sampled feature map. With this way, the feature map becomes richer to be detected as
shown in Figure 3.

Figure 3. YoloV3 Object Detection Architecture

3012
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 29, No. 3, (2020), pp. 3006- 3022
4. Experimental of Default YoloV3 and Anchor Matched Version on WIDER Face
Dataset
In this section, we describe the hardware and software specifications, dataset, and K-
Mean Clustering for new anchor boxes during the implementation. We demonstrate the
default YoloV3 and optimized anchor boxes version on no copyright downloaded video
from Youtube [42] to detect the various scale of faces from video taken in New York City.
We do a comparison between YoloV3 uses default anchor boxes clustered from COCO
dataset, and the customized anchor boxes clustered on our dataset. The evaluation metrics
we have used are average precision from our dataset, average process time per frame, and
frame per second from 608x608 and 1024x768 input image shape.

4.1. WIDER Face Dataset


Wider FACE dataset is one of the challenging face dataset benchmarks proposed by
Yang, et al., in 2015. WID-ER Face contains 32,203 images with 393,703 faces with
various scales, poses, and occlusions. 32,203 images are split into 40% training set, 10%
validation set, and 50% testing set [27][34]. The testing set is not provided with
annotation information and should be submitted to get the evaluation result. In our
experiment, we do an evaluation test on the validation set using the official evaluation
method of WIDER Face which the average precision results split based on difficulty
level: Easy, Medium, and Hard [34]. The evaluation program code is got from [43] which
officially the python version of the WIDER Face evaluation code. When the training
processes, we use the training set and validation set. Data augmentation also implemented
in our training process to create randomly various scales, rotations, and image saturation
to improve the training quality [44].

4.2. Hardware and Software


We implement our YoloV3 using python 3.6 programming language installed on
Windows 10 64 bit managed by Anaconda environment. We install Tensor-Flow GPU,
Keras GPU, and CUDA 10 inside the created environment using Conda command. We
run the YoloV3 using the default backbone which is Darknet-53 without any changes in
network size. We use the default YoloV3 loss function which represented by Figure 4.

Figure 4. YoloV3 Loss Function

3013
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 29, No. 3, (2020), pp. 3006- 3022
The A, B, and C equation are used for calculating localization loss, confidence loss,
and classification loss respectively which are concatenated into a total loss function. We
ignore the prediction result with IoU >= 0.5 during the training and make our networks to
not learn from it. During the training process, we use batch size 5 with the first 5 epochs
we train the network only on the first layers or called bottleneck training and use pre-
trained weights provided by the author of YoloV3 trained on ImageNet. Then start from 5
until 300 epochs, we train the full layers with input size 416x416. We use the initial
learning rate start from 0,0001 and reduced automatically when no improvements
detected for every 6 epochs. We also implement early stopping learning if no
improvements detected for every 10 epochs on validation steps. When we train the
YoloV3 network with default anchor boxes, the training stops at epoch 42 nd with the last
learning rate on 1x10-9 and the latest validation loss is 26,87. We do not continue the
training for the default YoloV3 version because we consider it as the maximum
performance trained on the WIDER Face training and validation set. Therefore, we can
compare the performance of the default YoloV3 with our optimized anchor boxes and
training adjustment strategies. On the optimized anchor boxes version, we train the
networks with the same process as a train the default YoloV3, but we continue the
training process by retraining the latest model from the last epoch with adjusting the
kernel regularizer from the default which is 0,0005 into 0,0001. Then the training process
stops at epoch 123rd with the last validation loss is 25,38. The plotted training process can
be seen in Figure 5 and Figure 6. We use the ground truth labels formatted in PASCAL
VOC style which are x1, y1, x2, y2 instead of official YoloV3 which are x, y, w, and h.
The hardware we have used are:
 CPU: AMD Ryzen 3 2200G overclocked to 3,9 GHz Quad-cores
 RAM: 8 Gb DDR4 2400 MHz
 GPU: Nvidia RTX 2070 8 Gb GDDR5
 Webcam: Microsoft LifeCam HD1080

Figure 5. YoloV3 Training Using Default Anchor Boxes

Figure 6. YoloV3 Training Using Optimized Anchor Boxes on WIDER Face

3014
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 29, No. 3, (2020), pp. 3006- 3022
4.3. Generate Anchor Boxes
We generate the anchor boxes in order to match the size of boundary boxes annotated
in the WIDER Face dataset for small, medium, and large objects. With this approach,
networks can learn precisely and know the certain small, medium, and large faces
provided in the dataset. If we keep using the default anchor boxes, networks cannot
predict the smallest size of faces, because WIDER Face also give the annotation on the
smallest faces. Therefore, for each face smaller than anchor boxes, networks cannot sense
the object inside the box. We generate the anchor boxes using the K-Mean clustering
technique using x and y coordinates from each ground truth box. We make 9 clusters for
detecting small, medium, and large objects. Standard K-Mean clustering is using an
Euclidian Distance to allocate the nearest data point to the centroid [45]. But if we use
Euclidian Distance, we will get more errors caused by larger boxes from the ground truth.
To avoid those errors, we use IoU (Intersection over Union) metric to allocate the x, y
coordinates nearest to the centroid. At initialization, we choose random k as the initial
means or centroid ai. Then, we make a cluster for each bounding box to a cluster Ci with
equation 3.

(3)

Equation 3 applied where . The last step of


this clustering is calculating the mean of all boundary boxes in Ci. Then set it as the new
mean ai and be repeated until convergence. For faster implementation, we use the
provided python code for generating the new anchor box cluster [43]. The new anchor
boxes are shown in Table 1. Our new anchor boxes get the accuracy 78,31% of the
WIDER Face training split dataset after K-Mean iteration ended.

Table 1. New Anchor Boxes Generated from WIDER Face Dataset


Cluster Default Anchor Boxes New Anchor Boxes
1 10,13 5,6
2 16,30 8,10
3 33,23 11,14
4 30,61 15,19
5 62,45 20,25
6 59,119 28,36
7 116,90 42,53
8 156,198 73,96
9 373,326 168,224

4.4. Result and Comparison


We use the average precision metric using official evaluation method of YoloV3 and
generated using provided python code. We do the test on validation data split which is
resulting a predicted result in text file for each image with format:
<image name>
<detected faces number>
<face 1> <x1>,<x2>,<w>,<h>
….
<face ~n> <x1>,<x2>,<w>,<h>

This implementation is also resulting in the detection images with all predicted boxes
or faces with value IoU (Intersection over Union) >= 0.5 and prediction threshold >= 0.1

3015
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 29, No. 3, (2020), pp. 3006- 3022
shown in Figure 7 and Figure 8. We do the implementation with split input size which is
608x608 and 1024x768 in order to know how many average precision changes affected
by the different input sizes. This resolution also the standard average precision value
when our model is implemented in a real case situation inside the classroom. So, with this
result, we can make sure the average precision rate based on the same resolution. During
the evaluation process using the official method from the WIDER Face dataset [34], we
use the text file of predicted results to be calculated and get the average precision value
compared to the ground truth boxes of the validation split dataset as shown in Table 2.

Table 2. Evaluation Results of YoloV3 Using Different Anchor Boxes


YOLOV3 IMAGE EASY MEDIUM HARD
SIZE
Default Anchors 608x608 82,03% 81,35% 67,55%
Default Anchors 1024x768 78,97% 79,16% 71,40%
Optimized Anchors 608x608 89,53% 86,89% 72,63%
Optimized Anchors 1024x768 87,58% 85,73% 78,04%

Figure 7. Samples of Detection Results Using Default Anchor Boxes

Figure 8. Samples of Detection Results Using Optimized Anchor Boxes

Table 2 shows the improvement when using optimized anchor boxes clustered from
WIDER Face dataset on easy, medium, and hard face detection levels. On 608x608 input
image shape improved 7,5% on easy, 5,53% on medium, and 5,08% on hard level
compared to default anchor boxes average precision result. Then, on 1024x768 input
image shape improved 8,61% on easy, 6,57% on medium, and 6,65% on hard level.
Figure 7 and Figure 8 show that there is an improvement to detect small faces visually.
Visually on one of sample of easy level, shows default anchor and optimized anchor
boxes have relatively same detection performance. Then, on medium and hard sample
pictures show that optimized anchor boxes have more face detections compared. In the
example of Figure 7 on hard picture, our model with optimized anchor boxes can detect
305 faces and when YoloV3 is using default anchor boxes it can detect 230 faces. It
shows that optimized anchor boxes in our implementation increase the small faces
detection chances compare to default anchor boxes.

3016
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 29, No. 3, (2020), pp. 3006- 3022
5. Comparison of Optimized Anchor Boxes YoloV3 With DSFD and S3FD
In this section, we describe our comparison between our YoloV3 model using
optimized anchor boxes and two of 31 methods tested on WIDER Face dataset benchmark.
We choose DSFD and S3FD where DSFD is stated on 4th rank in terms of AP (Average
Precision) in Hard dataset, and S3FD is stated on 11th in term of AP in Hard dataset with
better on FPS performance. In this comparison process, we choose the average precision,
average process time per frame, and frame per second as the metrics. We test the average
process time per frame and frame per second using a downloaded no copyrighted Youtube
Video [42] and test the models to detect faces inside the video. We implement DSFD
using python code [36] and using provided pre-trained weights and do not any retrain
process with ignoring the boxes with the value of IoU >= 0.5 and score threshold 0.5 of
confidence score. We also implement S3FD using python code provided [37] using pre-
trained weights without the retraining process. We set the detection to ignore the boxes
with the value of IoU >= 0.5 and score threshold 0.5 for confidence score. The input
image shape we use on DSFD and S3FD is 1024x768. This implementation runs on GPU
and CPU in order to know whether these methods are still relatively acceptable or not if
implemented on very limited computation resources. The result can be seen in Table 3,
Table 4, and Table 5.

Table 3. Average Precision Easy, Medium, and Easy Comparison Results


MODEL IMAGE EASY MEDIUM HARD
SIZE
YoloV3 Default 1024x768 78,97% 79,16% 71,40%
Anchors
YoloV3 Optimized 1024x768 87,58% 85,73% 78,04%
Anchors
DSFD 1024x768 94,15% 93,33% 87,27%
S3FD 1024x768 93,25% 90,49% 74,49%

Table 4. Average Process Time per Frame and Frame per Second Results on GPU
AVG.
TOTAL
IMAGE TOTAL PROCESS
MODEL PROCESS FPS
SIZE FRAME TIME /
TIME
FRAME
YoloV3
Optimized 608x608 5064 500,06 0,1 10,12
Anchors
YoloV3
Optimized 1024x768 3224 512,63 0,16 6,29
Anchors
DSFD 1024x768 113 501,7 4,44 0,23
S3FD 1024x768 321 500 1,56 0,64

Table 5. Average Process Time per Frame and Frame per Second Results on CPU
AVG.
TOTAL
IMAGE TOTAL PROCESS
MODEL PROCESS FPS
SIZE FRAME TIME /
TIME
FRAME
YoloV3
Optimized 608x608 140 502,35 3,58 0,28
Anchors
YoloV3
Optimized 1024x768 65 507,38 7,8 0,12
Anchors
S3FD 1024x768 10 540,53 54,05 0,02

3017
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 29, No. 3, (2020), pp. 3006- 3022

Average Precision metric results in Table 3 show that our YoloV3 model with
optimized anchor boxes on WIDER Face dataset is underperformed compare to the DSFD
which are 6,57%, 7,60%, and 9,23% on easy, medium, and hard validation split dataset.
Our YoloV3 model is also underperformed compare to one of the fast face detectors
models on WIDER Face which is S3FD. The gap between our YoloV3 model and S3FD
which are 5,67%, 4,76% on easy and medium images. But, in our implementation, our
YoloV3 model is outperformed 3,56% compare to S3FD on hard images data split.
Average process time per frame and frame per second metrics show that our YoloV3
model is 28x faster than DSFD and 10x faster than S3FD run on GPU with 1024x768
resolution. Our YoloV3 model run on CPU is also 15x faster compare to S3FD with
1024x768 resolution. But when we try to implement DSFD on our CPU, it fails to run and
show the out of memory error message. With this result, shows that our YoloV3 model is
better than DSFD and S3FD in terms of detection speed and very acceptable for real-time
video run on GPU. When we run on CPU, our YoloV3 model is also better in terms of
detection time and relatively will be more acceptable run on CPU for non-real time
detection systems. This could be understood that DSFD slower than our model caused by
the implementation of VGG16 backbone and additional layers for enhancing the feature
map which needs more computation resource with higher accuracy as the compensation
compared to our implemented Darknet-53 backbone. S3FD in our implementation also
uses VGG16 but in a modified version to increase the speed and better accuracy for
detecting small faces. But, our YoloV3 model is more accurate compared to S3FD in hard
face images which contain numerous tiny faces.
We do the test to get the processing time from each model and run the code for 500
seconds or 8 minutes as the maximum running time on our chosen video. Figure 9 and
Figure 10 show that our YoloV3 model gets a higher number of frames to be processed
and detected within 500 seconds compared to DSFD and S3FD both in GPU and CPU.
Better CPU can bring better detection speed.

Figure 9. Process Time Graph Run on GPU

Figure 10. Process Time Graph Run on CPU

3018
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 29, No. 3, (2020), pp. 3006- 3022
6. Implementation on Class Lecturing Session
To bring more contribution from our research on real-time face detector using YoloV3
and optimized anchor boxes for better detection of tiny faces, we implement our YoloV3
model in a real situation with the real activity of students who are attending the lecturing
session. We implement it at one of the Business School in Jakarta, Indonesia, which
placed in a classroom with a total of 15 peoples stay inside the room including the lecturer.
The students and lecturer have an active activity with various poses, movements, and face
scales. The implementation is split into two sessions. The first session runs on 608x608
input image shape. The second session runs on 1024x768 input image shape. We set
confidence score threshold >= 0.3 to increase the chance to detect more faces considering
the high movement activities of students and the lecturer.

Figure 11. Implementation Performance on Class Lecturing Session in 608x608


Input Image Shape

Figure 12. Implementation Performance on Class Lecturing Session in 1024x768


Input Image Shape

Figure 13. Example of Real-Time Implementation on Class Lecturing Session

Figure 11 shows the robustness of our YoloV3 model in 608x608 input image shape in
terms of real-time capability in line with an acceptable average confidence score visually.
In this input image shape, our YoloV3 model can run 9 until 10 frames per second where
the condition is active lecturing session where the lecturer also appears in front of the
camera like in Figure 13. Our model also can run as the real-time video with resolution

3019
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 29, No. 3, (2020), pp. 3006- 3022
1024x768 with a better average confidence score as shown in Figure 12. Reduced frame
per second is the consequence of increased resolution with better detection performance
which runs on 6 until 7 frames per second. But in 1024x768 resolution with this detection
performance is relatively acceptable as the real-time face detector. With better GPU or
multiple GPU for this implementation, can bring better performance in term of detection
speed.

7. Conclusions
In this paper, we have made an improvement of the YoloV3 model using optimized
anchor boxes using K-Mean clustering on the WIDER Face train dataset. Our research
also shows that anchor boxes quality matched with the objects as the targets is improving
the average precision as shown in WIDER Face evaluation set on easy, medium, and hard
detection level on Table 2. Because based on S3FD research, mentioned that anchor boxes
bigger than objects, the detection accuracy will dramatically be dropped because the
networks ignore the smaller objects than anchor boxes size [32]. Also, the smaller size of
the face has few features in line with the resolution of the input image shape. In this
paper, we also measure the detection speed compare to the other of the tested model on
the WIDER Face dataset. Our YoloV3 model is 28 times faster than DSFD and 10 times
faster than S3FD in 1024x768 resolution run on GPU NVIDIA RTX2070 with relatively
acceptable accuracy gap. But our model is 3,56% more accurate compared to S3FD in
hard detection level. It shows that our model is better on small face detection with faster
detection speed. Then we do an experiment to implement our YoloV3 model in the real
situation which is detecting student faces who attend the class to have a lecturing session
inside the classroom. Our real situation implementation shows our model robustness in
terms of accuracy and detection speed. In 608x608 resolution, the confidence score graph
in Figure 11 shows that the acceptable face detection based on confidence score in every
processed frame. Also, in process time graph shows that our model can run 9 until 10
frames per second. When we implement our model in 1024x768 resolution, confidence
score graph shows better of detection scores shown in Figure 12. The reduced detection
speed is the consequence that runs on 6 until 7 frames per second. In our face detection
implementation in this real lecturing session, all faces can be detected successfully as
shown in Figure 13.

Acknowledgments
The authors wish to thank Bina Nusantara University to support this paper publication.
Also, thank the School of Business and Management – Institut Teknologi Bandung –
Jakarta Campus to allow us to implement our research inside the classroom during
lecturing session.

References
[1] Benjdira, B., Khursheed, T., Koubaa, A., Ammar, A., & Ouni, K., “Car Detection using
Unmanned Aerial Vehicles: Comparison between Faster R-CNN and YOLOv3”, In 2019 1st
International Conference on Unmanned Vehicle Systems-Oman (UVS). IEEE, (2019).
[2] Chollet, F, “Deep learning with Python”, Shelter Island, NY: Manning Publications Co.,
(2018).
[3] Faen Zhang and Xinyu Fan and Guo Ai and Jianfei Song and Yongqiang Qin and Jiahong Wu,
“Accurate Face Detection for High Performance”, Computer Vision and Pattern Recognition,
(2019).
[4] Gulli, A., “Deep learning with Keras: implement neural networks with Keras on Theano and
TensorFlow”, Birmingham, UK: Packt Publishing, (2017).
[5] He, K., Zhang, X., Ren, S., & Sun, J., “Spatial Pyramid Pooling in Deep Convolutional
Networks for Visual Recognition”, In Computer Vision – ECCV Springer International,
(2014), pp. 346–361.

3020
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 29, No. 3, (2020), pp. 3006- 3022
[6] Huang, R., Pedoeem, J., & Chen, C., “YOLO-LITE: A Real-Time Object Detection Algorithm
Optimized for Non-GPU Computers”, In 2018 IEEE International Conference on Big Data
(Big Data), (2018).
[7] Jensen, M. B., Nasrollahi, K., & Moeslund, T. B., “Evaluating State-of-the-Art Object
Detector on Challenging Traffic Light Data”, In 2017 IEEE Conference on Computer Vision
and Pattern Recognition Workshops (CVPRW). IEEE, (2017).
[8] Jian Li, Yabiao Wang, Changan Wang, Ying Tai, Jianjun Qian, Jian Yang, Chengjie Wang,
Jilin Li, Feiyue Huang, “DSFD: Dual Shot Face Detector”, Computer Vision and Pattern
Recognition, (2019).
[9] Jiankang Deng and Jia Guo and Yuxiang Zhou and Jinke Yu and Irene Kotsia and Stefanos
Zafeiriou, “RetinaFace: Single-stage Dense Face Localisation in the Wild”, Vision and
Pattern Recognition, (2019).
[10] Jiwoong Choi, Dayoung Chun, Hyun Kim, Hyuk-Jae Lee, “Gaussian YOLOv3: An Accurate
and Fast Object Detector Using Localization Uncertainty for Autonomous Driving”,
Computer Vision and Pattern Recognition, (2019).
[11] Joseph Redmon, Ali Farhadi, “YOLOv3: An Incremental Improvement”, Computer Vision
and Pattern Recognition, (2018).
[12] Karen Simonyan and Andrew Zisserman, “Very Deep Convolutional Networks for Large-
Scale Image Recognition”, Computer Vision and Pattern Recognition, (2014).
[13] Kim, K.-J., Kim, P.-K., Chung, Y.-S., & Choi, D.-H, “Performance Enhancement of YOLOv3
by Adding Prediction Layers with Spatial Pyramid Pooling for Vehicle Detection”, In 2018
15th IEEE International Conference on Advanced Video and Signal Based Surveillance
(AVSS). IEEE, (2018).
[14] Li, C., Wang, R., Li, J., & Fei, L, “Face Detection Based on YOLOv3. In Recent Trends in
Intelligent Computing, Communication and Devices”, Springer Singapore, (2019), pp. 277–
284.
[15] Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., & Belongie, S., “Feature Pyramid
Networks for Object Detection”, In 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), (2017).
[16] Liu, C., Guo, Y., Li, S., & Chang, F., “ACF Based Region Proposal Extraction for YOLOv3
Network Towards High-Performance Cyclist Detection in High Resolution Images”, Sensors,
19(12), (2019).
[17] Qu, H., Yuan, T., Sheng, Z., & Zhang, Y., “A Pedestrian Detection Method Based on
YOLOv3 Model and Image Enhanced by Retinex”, In 2018 11th International Congress on
Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). IEEE,
(2018).
[18] Redmon, J., & Farhadi, A., “YOLO9000: Better, Faster, Stronger”, In 2017 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), (2017).
[19] Redmon, J., Divvala, S., Girshick, R., & Farhadi, A., “You Only Look Once: Unified, Real-
Time Object Detection”, In 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR). IEEE, (2016).
[20] Shifeng Zhang and Cheng Chi and Zhen Lei and Stan Z. Li., “RefineFace: Refinement Neural
Network for High Performance Face Detection”, Computer Vision and Pattern Recognition,
(2019).
[21] Tian, Y., Yang, G., Wang, Z., Wang, H., Li, E., & Liang, Z., “Apple detection during different
growth stages in orchards using the improved YOLO-V3 model”, Computers and Electronics
in Agriculture, (2019), pp 417–426.
[22] Viola, P., & Jones, M. (n.d.), “Rapid object detection using a boosted cascade of simple
features”, In Proceedings of the 2001 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition. CVPR, (2001).
[23] Won, J.-H., Lee, D.-H., Lee, K.-M., & Lin, C.-H., “An Improved YOLOv3-based Neural
Network for De-identification Technology”, In 2019 34th International Technical Conference
on Circuits/Systems, Computers and Communications (ITC-CSCC). IEEE, (2019).
[24] Won, J.-H., Lee, D.-H., Lee, K.-M., & Lin, C.-H., “An Improved YOLOv3-based Neural
Network for De-identification Technology”, In 2019 34th International Technical Conference
on Circuits/Systems, Computers and Communications (ITC-CSCC). IEEE, (2019).
[25] Wu, F., Jin, G., Gao, M., HE, Z., & Yang, Y., “Helmet Detection Based On Improved YOLO
V3 Deep Model”, In 2019 IEEE 16th International Conference on Networking, Sensing and
Control (ICNSC), (2019).

3021
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 29, No. 3, (2020), pp. 3006- 3022
[26] Xiao, D., Shan, F., Li, Z., Le, B. T., Liu, X., & Li, X., “A Target Detection Model Based on
Improved Tiny-Yolov3 Under the Environment of Mining Truck”, IEEE Access, (2019), pp
123757–123764.
[27] Yang, S., Luo, P., Loy, C. C., & Tang, X., “WIDER FACE: A Face Detection Benchmark”, In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, (2016).
[28] Yang, W., & Jiachun, Z., “Real-time face detection based on YOLO”, In 2018 1st IEEE
International Conference on Knowledge Innovation and Invention (ICKII). IEEE, (2018).
[29] Yi, Z., Yongliang, S., & Jun, Z., “An improved tiny-yolov3 pedestrian detection algorithm”.
Optik, (2019),vol 183,pp 17–23.
[30] YoungJoon Yoo and Dongyoon Han and Sangdoo Yun, “EXTD: Extremely Tiny Face
Detector via Iterative Filter Reuse”, Computer Vision and Pattern Recognition, (2019).
[31] Zhang, C., Chang, C., & Jamshidi, M., “Concrete bridge surface damage detection using a
single‐ stage detector”, Computer-Aided Civil and Infrastructure Engineering, (2019).
[32] Zhang, S., Zhu, X., Lei, Z., Shi, H., Wang, X., & Li, S. Z., “S^3FD: Single Shot Scale-
Invariant Face Detector”, In 2017 IEEE International Conference on Computer Vision
(ICCV). IEEE, (2017).
[33] Zhao, Z.-Q., Zheng, P., Xu, S.-T., & Wu, X., “Object Detection With Deep Learning: A
Review”, IEEE Transactions on Neural Networks and Learning Systems, (2019), pp 1–21.
[34] Shuoyang1213.me, “WIDER FACE: A Face Detection Benchmark”, [online] Available at:
http://shuoyang1213.me/WIDERFACE/ [Accessed 19 Oct. 2019]
[35] Paperswithcode.com, “State-of-the-art table for Face Detection on WIDER Face (Hard)”,
[online] Available at: https://paperswithcode.com/sota/face-detection-on-wider-face-hard
[Accessed 19 Oct. 2019]
[36] GitHub, “610265158/DSFD-tensorflow”, [online] Available at:
https://github.com/610265158/DSFD-tensorflow [Accessed 19 Oct. 2019]
[37] GitHub, “cs-giung/face-detection-pytorch”, [online] Available at: https://github.com/cs-
giung/face-detection-pytorch [Accessed 19 Oct. 2019]
[38] Z. Wojna, Y. S. Song, S. Guadarrama, and K. Murphy, “Speed/accuracy trade-offs for modern
convolutional object detectors,” in CVPR, (2017).
[39] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C., “SSD:
Single Shot MultiBox Detector”, In Computer Vision ECCV 2016, Springer International
Publishing, (2016), pp 21–37
[40] Ioffe, S. & Szegedy, C., “Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift”, CoRR, (2015).
[41] He, K., Zhang, X., Ren, S., & Sun, J., “Deep Residual Learning for Image Recognition”, In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, (2016).
[42] Youtube Video Retrieved from https://www.youtube.com/watch?v=DIv-kECn7wA
[43] Xiyinmsu, “xiyinmsu/python-wider_eval”, Retrieved from
https://github.com/xiyinmsu/python-wider_eval
[44] Mikolajczyk, A., & Grochowski, M., “Data augmentation for improving deep learning in
image classification problem”, In 2018 International Interdisciplinary PhD Workshop
(IIPhDW). IEEE. (2018).
[45] AbdAllah, L., & Shimshoni, I., “K-Means over Incomplete Datasets Using Mean Euclidean
Distance”, In Machine Learning and Data Mining in Pattern Recognition, Springer
International Publishing, (2016), pp. 113–127

3022
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC

You might also like