Professional Documents
Culture Documents
Face Detection System
Face Detection System
Face Detection System
Valentin Bazarevsky Yury Kartynnik Andrey Vakunov Karthik Raveendran Matthias Grundmann
Google Research
1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA
{valik, kartynnik, vakunov, krav, grundman}@google.com
arXiv:1907.05047v2 [cs.CV] 14 Jul 2019
1
5x5 DW Conv Max Pooling 5x5 DW Conv Max Pooling
Optional
Optional
1x1 Conv Channel Pad 1x1 Conv / Project Channel Pad
Activation
5x5 DW Conv
Activation
Feature extractor. For a specific example, we focus on Post-processing. As our feature extractor is not reducing
the feature extractor for the front-facing camera model. It the resolution below 8×8, the number of anchors overlap-
has to account for a smaller range of object scales and there- ping a given object significantly increases with the object
fore has lower computational demands. The extractor takes size. In a typical non-maximum suppression scenario, only
an RGB input of 128×128 pixels and consists of a 2D con- one of the anchors “wins” and is used as the final algo-
volution followed by 5 single BlazeBlocks and 6 double rithm outcome. When such a model is applied to subse-
BlazeBlocks (see Table 4 in Appendix A for the full layout). quent video frames, the predictions tend to fluctuate be-
The highest tensor depth (channel resolution) is 96, while tween different anchors and exhibit temporal jitter (human-
the lowest spatial resolution is 8×8 (in contrast to SSD, perceptible noise).
which reduces the resolution all the way down to 1×1). To minimize this phenomenon, we replace the suppres-
sion algorithm with a blending strategy that estimates the re-
Anchor scheme. SSD-like object detection models rely gression parameters of a bounding box as a weighted mean
on pre-defined fixed-size base bounding boxes called pri- between the overlapping predictions. It incurs virtually no
ors, or anchors in Faster-R-CNN [8] terminology. A set additional cost to the original NMS algorithm. For our face
detection task, this adjustment resulted in a 10% increase in smaller model size. As explored in the following section,
accuracy. this does not necessarily incur a proportional degradation
We quantify the amount of jitter by passing several of the whole AR pipeline quality.
slightly offset versions of the same input image into the net-
work and observing how the model outcomes (adjusted to Model Regression Jitter
account for the translation) are affected. After the described error metric
tie resolution strategy modification, the jitter metric, defined MobileNetV2-SSD 7.4% 3.6%
as the root mean squared difference between the predictions Ours 10.4% 5.3%
for the original and displaced inputs, dropped down by 40%
Table 3. Regression parameters prediction quality
on our frontal camera dataset and by 30% on a rear-facing
camera dataset containing smaller faces.
4. Experiments 5. Applications
We trained our model on a dataset of 66K images. For The proposed model, operating on the full image or a
evaluation, we used a private geographically diverse dataset video frame, can serve as the first step of virtually any face-
consisting of 2K images. For the frontal camera model, only related computer vision application, such as 2D/3D facial
faces that occupy more than 20% of the image area were keypoints, contour, or surface geometry estimation, facial
considered due to the intended use case (the threshold for features or expression classification, and face region seg-
the rear-facing camera model was 5%). mentation. The subsequent task in the computer vision
The regression parameter errors were normalized by the pipeline can thus be defined in terms of a proper facial
inter-ocular distance (IOD) for scale invariance, and the me- crop. Combined with the few facial keypoint estimates pro-
dian absolute error was measured to be 7.4% of IOD. The vided by BlazeFace, this crop can be also rotated so that
jitter metric evaluated by the procedure mentioned above the face inside is centered, scale-normalized and has a roll
was 3% of IOD. angle close to zero. This removes the requirement of sig-
Table 1 shows the average precision (AP) metric [5] nificant translation and rotation invariance from the task-
(with a standard 0.5 intersection-over-union bounding box specific model, allowing for better computational resource
match threshold) and the mobile GPU inference time for allocation.
the proposed frontal face detection network and compares We illustrate this
it to a MobileNetV2-based object detector with the same pipelined approach with
anchor coding scheme (MobileNetV2-SSD). We have used a specific example of
TensorFlow Lite GPU [2] in 16-bit floating point mode as face contour estimation.
the framework for inference time evaluation. In Figure 3, we show
how the output of Blaze-
Model Average Inference Time, ms
Face, i.e. the predicted
Precision (iPhone XS)
bounding box and the
MobileNetV2-SSD 97.95% 2.1 6 keypoints of the face
Ours 98.61% 0.6 (red), are further refined
Table 1. Frontal camera face detection performance by a more complex
face contour estimation Figure 3. Pipeline example
Table 2 gives a perspective on the GPU inference speed model applied to a (best viewed in color).
for the two network models across more flagship devices. slightly expanded crop. Red: BlazeFace output. Green:
The detailed keypoints Task-specific model output.
Device MobileNetV2-SSD, ms Ours, ms yield a finer bounding
Apple iPhone 7 4.2 1.8 box estimate (green) that can be reused for tracking in
Apple iPhone XS 2.1 0.6 the subsequent frame without running the face detector.
Google Pixel 3 7.2 3.4 To detect failures of this computation saving strategy, the
Huawei P20 21.3 5.8 contours model can also detect whether the face is indeed
Samsung Galaxy S9+ 7.2 3.7 present and reasonably aligned in the provided rectangular
(SM-G965U1) crop. Whenever that condition is violated, the BlazeFace
Table 2. Inference speed across several mobile devices face detector is run on the whole video frame again.
The technology described in this paper is driving major
Table 3 shows the amount of degradation in the re- AR self-expression applications and AR developer APIs on
gression parameter prediction quality that is caused by the mobile phones.
References Appendix A. Feature extraction network ar-
[1] Metal performance shaders. chitecture
https://developer.apple.com/
documentation/metalperformanceshaders. Layer/block Input size Conv. kernel sizes
[Online; accessed April 19, 2019]. 1 Convolution 128×128×3 5×5×3×24 (stride 2)
[2] TFLite on GPU. Single BlazeBlock 64×64×24 5×5×24×1
https://github.com/tensorflow/tensorflow/
1×1×24×24
tree/master/tensorflow/lite/delegates/
gpu. [Online; accessed April 19, 2019]. 3
Single BlazeBlock 64×64×24 5×5×24×1
[3] Andrew Howard et al. MobileNets: Efficient convolutional
1×1×24×24
neural networks for mobile vision applications. arXiv preprint Single BlazeBlock 64×64×24 5×5×24×1 (stride 2)
arXiv:1704.04861, 2017. 1 1×1×24×48
[4] Wei Liu et al. SSD: Single shot MultiBox detector. In Euro- Single BlazeBlock 32×32×48 5×5×48×1
pean conference on computer vision, pages 21–37, 2016. 1 1×1×48×48
[5] Mark Everingham, Luc Gool, Christopher K. Williams, John Single BlazeBlock 32×32×48 5×5×48×1
Winn, and Andrew Zisserman. The pascal visual object 1×1×48×48
classes (VOC) challenge. Int. J. Comput. Vision, 88(2):303– Double BlazeBlock 32×32×48 5×5×48×1 (stride 2)
338, 2010. 3 1×1×48×24
[6] Ross Girshick. Fast R-CNN. In Proceedings of the IEEE in- 5×5×24×1
ternational conference on computer vision, pages 1440–1448, 1×1×24×96
2015. 1
Double BlazeBlock 16×16×96 5×5×96×1
[7] Pengchong Jin, Vivek Rathod, and Xiangxin Zhu. Pool-
1×1×96×24
ing pyramid network for object detection. arXiv preprint
arXiv:1807.03284, 2018. 2
5×5×24×1
[8] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
1×1×24×96
Faster R-CNN: Towards real-time object detection with re- Double BlazeBlock 16×16×96 5×5×96×1
gion proposal networks. In Advances in neural information 1×1×96×24
processing systems, pages 91–99, 2015. 1, 2 5×5×24×1
[9] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- 1×1×24×96
moginov, and Liang-Chieh Chen. MobileNetV2: Inverted Double BlazeBlock 16×16×96 5×5×96×1 (stride 2)
residuals and linear bottlenecks. In Proceedings of the IEEE 1×1×96×24
Conference on Computer Vision and Pattern Recognition, 5×5×24×1
pages 4510–4520, 2018. 1, 2 1×1×24×96
Double BlazeBlock 8×8×96 5×5×96×1
1×1×96×24
5×5×24×1
1×1×24×96
Double BlazeBlock 8×8×96 5×5×96×1
1×1×96×24
5×5×24×1
1×1×24×96
Table 4. BlazeFace feature extraction network architecture