MTCNN 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

IEEE SIGNAL PROCESSING LETTERS, VOL. 23, NO.

10, OCTOBER 2016 1499

Joint Face Detection and Alignment Using Multitask


Cascaded Convolutional Networks
Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, Senior Member, IEEE, and Yu Qiao, Senior Member, IEEE

Abstract—Face detection and alignment in unconstrained envi- cade structure, Mathias et al. [5]–[7] introduce deformable part
ronment are challenging due to various poses, illuminations, and models for face detection and achieve remarkable performance.
occlusions. Recent studies show that deep learning approaches can However, they are computationally expensive and may usually
achieve impressive performance on these two tasks. In this letter, require expensive annotation in the training stage. Recently,
we propose a deep cascaded multitask framework that exploits
the inherent correlation between detection and alignment to boost
convolutional neural networks (CNNs) achieve remarkable pro-
up their performance. In particular, our framework leverages a gresses in a variety of computer vision tasks, such as image
cascaded architecture with three stages of carefully designed deep classification [9] and face recognition [10]. Inspired by the sig-
convolutional networks to predict face and landmark location in a nificant successes of deep learning methods in computer vi-
coarse-to-fine manner. In addition, we propose a new online hard sion tasks, several studies utilize deep CNNs for face detection.
sample mining strategy that further improves the performance in Yang et al. [11] train deep CNNs for facial attribute recog-
practice. Our method achieves superior accuracy over the state- nition to obtain high response in face regions, which further
of-the-art techniques on the challenging face detection dataset and yield candidate windows of faces. However, due to its com-
benchmark and WIDER FACE benchmarks for face detection,
plex CNN structure, this approach is time costly in practice.
and annotated facial landmarks in the wild benchmark for face
alignment, while keeps real-time performance. Li et al. [19] use cascaded CNNs for face detection, but it re-
quires bounding box calibration from face detection with ex-
Index Terms—Cascaded convolutional neural network (CNN), tra computational expense and ignores the inherent correla-
face alignment, face detection. tion between facial landmarks localization and bounding box
regression.
Face alignment also attracts extensive research interests. Re-
I. INTRODUCTION
search works in this area can be roughly divided into two cate-
ACE detection and alignment are essential to many face
F applications, such as face recognition and facial expression
analysis. However, the large visual variations of faces, such as
gories, regression-based methods [12], [13], [16], and template
fitting approaches [7], [14], [15]. Recently, Zhang et al. [22]
proposed to use facial attribute recognition as an auxiliary task
occlusions, large pose variations, and extreme lightings, impose to enhance face alignment performance using deep CNN.
great challenges for these tasks in real-world applications. However, most of previous face detection and face alignment
The cascade face detector proposed by Viola and Jones [2] methods ignore the inherent correlation between these two tasks.
utilizes Haar-Like features and AdaBoost to train cascaded clas- Though several existing works attempt to jointly solve them,
sifiers, which achieves good performance with real-time effi- there are still limitations in these works. For example, Chen
ciency. However, quite a few works [1], [3], [4] indicate that et al. [18] jointly conduct alignment and detection with random
this kind of detector may degrade significantly in real-world forest using features of pixel value difference. But, these hand-
applications with larger visual variations of human faces even craft features limit its performance a lot. Zhang et al. [20] use
with more advanced features and classifiers. Besides the cas- multitask CNN to improve the accuracy of multiview face de-
tection, but the detection recall is limited by the initial detection
Manuscript received April 7, 2016; revised June 12, 2016 and July 31,
window produced by a weak face detector.
2016; accepted August 10, 2016. Date of publication August 26, 2016; On the other hand, mining hard samples in training is critical
date of current version September 9, 2016. This work was supported to strengthen the power of detector. However, traditional hard
in part by External Cooperation Program of BIC, in part by Chinese sample mining usually performs in an offline manner, which
Academy of Sciences (172644KYSB20160033, 172644KYSB20150019), in
part by Shenzhen Research Program under Grant KQCX2015033117354153,
significantly increases the manual operations. It is desirable to
Grant JSGG20150925164740726, Grant CXZZ20150930104115529, Grant design an online hard sample mining method for face detection,
CYJ20150925163005055, and Grant JCYJ201 60510154736343, in part by which is adaptive to the current training status automatically.
Guangdong Research Program under Grant 2014B050505017 and Grant In this letter, we propose a new framework to integrate these
2015B010129013, in part by the Natural Science Foundation of Guangdong
Province under Grant 2014A030313688, and in part by the Key Laboratory of
two tasks using unified cascaded CNNs by multitask learning.
Human Machine Intelligence-Synergy Systems through the Chinese Academy The proposed CNNs consist of three stages. In the first stage, it
of Sciences. The associate editor coordinating the review of this manuscript and produces candidate windows quickly through a shallow CNN.
approving it for publication was Dr. Alexandre X. Falcao. Then, it refines the windows by rejecting a large number of
K. Zhang, Z. Li, and Y. Qiao are with Shenzhen Institutes of Advanced
Technology, Chinese Academy of Sciences, Shenzhen 518055, China (e-mail:
nonfaces windows through a more complex CNN. Finally, it
kp.zhang@siat.ac.cn; zhifeng.li@siat.ac.cn; yu.qiao@siat.ac.cn). uses a more powerful CNN to refine the result again and output
Z. Zhang is with the Department of Information Engineering, The Chinese five facial landmarks positions. Thanks to this multitask learning
University of Hong Kong, Hong Kong (e-mail: zz013@ie.cuhk.edu.hk). framework, the performance of the algorithm can be notably
Color versions of one or more of the figures in this letter are available online
at http://ieeexplore.ieee.org.
improved. The major contributions of this letter are summarized
Digital Object Identifier 10.1109/LSP.2016.2603342 as follows:

1070-9908 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

You might also like