Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

230 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO.

2, FEBRUARY 2014

A Framework for Making Face Detection


Benchmark Databases
Gee-Sern Hsu, Member, IEEE, and Tsu-Ying Chu

Abstract—The images in face detection benchmark databases image. Some even demand a certain level of robustness against
are mostly taken by consumer cameras, and thus are constrained occlusion. It will be extremely exhausting, if not impossible,
by popular preferences, including a frontal pose and balanced
to collect image samples that are good enough to encompass
lighting conditions. A good face detector should consider be-
yond such constraints and work well for other types of im- all or most of the aforementioned variables with a sufficiently
ages, for example, those captured by a surveillance camera. large scope of variation in each variable.
To overcome such constraints, a framework is proposed to After the works done by Sung and Poggio [2], Rowley et al.
transform a mother database, originally made for benchmark- [3], and Schneiderman and Kanade [4], their collections have
ing face recognition, to daughter datasets that are good for
been considered as benchmark datasets. The dataset collected
benchmarking face detection. The daughter datasets can be
customized to meet the requirements of various performance by Sung and Rowley is commonly used for the assessment of
criteria; therefore, a face detector can be better evaluated on detection of nearly frontal faces, and Schneiderman’s dataset
desired datasets. The framework is composed of two phases: is used for the detection of nearly profile faces. Both datasets
1) intrinsic parametrization and 2) extrinsic parametrization. are often known as the MIT+CMU dataset as a whole. An
The former parametrizes the intrinsic variables that affect the
extensive review on the works published before 2002 by
appearance of a face, and the latter parametrizes the extrin-
sic variables that determine how faces appear on an image. Yang et al. [5] reveals that Rowley’s dataset was the most
Experiments reveal that the proposed framework can generate popular one for performance evaluation by that time. We
not just data that are similar to those available from popular have also completed a review that covers 42 face detection
benchmark databases, but also those that are hardly available methods, published between 2002 and 2011 (details reported
from existing databases. The datasets generated by the proposed
in the next section), of which 29 (69.1%) use the MIT+CMU
framework offer the following advantages: 1) they can define
the performance specification of a face detector in terms of dataset. Although many variables, including pose, orientation,
the detection rates on variables with different variation scopes; illumination and size, are sampled in the MIT+CMU dataset,
2) they can benchmark the performance on one single or multiple only a limited scope of variation in each variable is revealed
variables, which can be difficult to collect; and 3) their ground by the samples. Take illumination as an example. Rowley’s
truth is available when the datasets are generated, avoiding the
dataset has 130 images with 507 faces in total. 70% of the
time-consuming manual annotation.
faces have nearly uniform illumination, 9.8% are lit on the
Index Terms—Face database, face detection, performance eval- right side, 14.2% on the left, 3.8% on the top, and 2.2% on the
uation.
bottom. Because the performance of a face detector strongly
depends on the training set, if the training set contains only
samples with uniform illumination and samples lit on the right
I. Introduction
or left side, the resultant face detector will perform poorly on
LMOST all face detection algorithms are evaluated using
A datasets with images collected manually from various
sources. Each collected image has one or a few faces in it
detecting faces lit on the top or bottom. However, such a face
detector can attain a detection rate 94% on Rowley’s dataset,
which only has 6% of its samples lit on the top or bottom.
with ground truth manually annotated. The performance is The above observations raise the following issues with many
measured by the difference between the ground truth and those existing benchmark databases.
determined by the algorithm [1]. It is commonly acknowledged
that an ideal face detector should be able to detect faces 1) The images captured by consumer cameras are mostly
of any sizes, orientations, poses, expressions, under various with general preferences, such as frontal pose, smiling,
illumination conditions, and with locations anywhere in the and uniform illumination; the variables contained in such
Manuscript received July 17, 2012; revised March 6, 2013; accepted April a collection are mostly bandlimited, where the band
15, 2013. Date of publication May 31, 2013; date of current version February refers to the variation scope in each variable. These
4, 2014. This paper was recommended by Associate Editor C. Shan. databases may be good enough for evaluating the face
G.-S. Hsu is with the Artificial Vision Laboratory, National Tai-
wan University of Science and Technology, Taipei 106, Taiwan (e-mail: detector needed for consumer cameras. However, it can
jison@mail.ntust.edu.tw). hardly be justified that they are appropriate for other
T.-Y. Chu is with Vols Taipei (e-mail: vacation5588@hotmail.com). applications. The face detector required by an outdoor
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. surveillance camera spotting a street corner would be
Digital Object Identifier 10.1109/TCSVT.2013.2265571 better evaluated on samples with limited poses and sizes
1051-8215 c 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
HSU AND CHU: A FRAMEWORK FOR FACE DETECTION BENCHMARK DATABASES 231

but with a large variation in illumination from dawn to datasets is given when they are generated. Furthermore, the
dusk. proposed framework can be applied to other object recognition
2) Many face detectors report detection rates on, for databases so that they can be transformed to datasets that are
example, Rowley’s dataset, including 96.7% in [6], good for evaluating object detection.
90% in [7], and 92.3% in [8]. However, only a few The rest of this paper is organized as follows. Section II
report the characteristics of the samples that make reviews the datasets used by many for performance evaluation
them fail to be detected. Among the reported cases in and highlights their limitations. The proposed framework is
[8]– [10], the causes include unbalanced illumination, presented in Section III, with two phases in it: 1) the intrinsic
partial occlusion, low contrast, and low image qualities. parametrization (IP) and 2) extrinsic parametrization (EP). To
Because the samples with these characteristics of vari- demonstrate that the proposed framework is able to generate
ables are far less than those with generally preferred daughter datasets for different criteria, Section IV presents
characteristics, e.g., balanced lighting and frontal pose, experimental verifications following two protocols. In Protocol
the reported detection rates can be considered valid only 1, the framework synthesizes two daughter datasets with
for the applications with similar characteristics of vari- attributes that are similar to those in Rowley’s dataset [3] and
ations. If unbalanced illumination, for example, is con- a new benchmark FDDB dataset [12]. A few face detectors are
sidered to be the most common in certain applications, tested on both types of datasets to compare the performance,
the performance of the aforementioned face detectors justifying the validity of the daughter datasets. In Protocol
is better reevaluated using a dataset with unbalanced 2, the performance specification of a face detector is defined
illumination samples. via the daughter datasets with specially designed attributes,
The aforementioned issues motivated the creation of the offering a case study on the application of the proposed
proposed framework that can generate application-oriented framework. This paper is concluded in Section V.
and variable-specifiable datasets. The framework takes in
faces from a mother database, which is originally made
for benchmarking face recognition, and generates daughter II. Issues With Existing Benchmarks
datasets with attributes that meet the requirements of a desired The extensive survey in [5] shows that Rowley’s dataset
evaluation criterion. The mother database offers facial samples was the most popular one for benchmark by 2002. Our review,
characterized by intrinsic variables, and each intrinsic variable which covers 42 face detection methods [6]–[10] and [13]–[49]
covers a large scope of variation. The intrinsic variables refer published between 2002 and 2011, reveals that many use two
to those that are able to change the appearance of a face, such or more datasets for performance evaluation. The following
as illumination and pose. Each daughter dataset is generated gives a summary to this review.
or synthesized with required settings on the intrinsic and 1) 69.1% of the reviewed methods select the MIT+CMU
extrinsic variables. The extrinsic variables refer to those that as one of the test sets.
determine how faces appear on an image, such as the number 2) 26.2% select other face detection databases. such as
and size range of the faces. Extended and improved from our BioID [9], [10], UCD [45], and Champion [14].
preliminary work in [11], which lacks a justification on the 3) 40.5% select face recognition databases, including
validity of the synthesized daughter datasets and does not offer FERET [9], [17], [21], [41], Yale [19], [44], BANCA
a guideline on how to use the daughter datasets, this paper [50], CBCL [19], [51], and PIE [25], [44], [52]. Note
verifies the validity of the daughter datasets by an experimental that these databases were selected for offering data with
comparison with popular benchmark databases, and presents a large variations on illumination and pose.
case study on how to design daughter datasets to characterize 4) 9.5% also use personal collections of the authors.
the performance of a face detector.
The aforementioned databases can be categorized into the
The contributions of this paper include the following.
following two types.
1) Different from existing benchmark datasets with vari- 1) Type 1: The datasets are composed of images collected
ables of fixed scopes of variation, the synthesized from various sources, e.g., the internet and personal
daughter datasets can be made with a desired scope albums. The variables often include illumination, size,
of variables that meets the requirements of various test pose, expression, occlusion, age, race, and background.
criteria. Well-known ones include the MIT+CMU [2]–[4] and
2) The performance specification of a face detector can be the BioID dataset [53]. The former has 130 gray-scaled
defined via the daughter datasets with special settings on images with 507 faces, and the latter has 1521 gray-
the intrinsic and extrinsic variables. The specification is scaled images with one face per image. One of the
given in terms of the detection rate on each variable with latest collections, the FDDB dataset [12], has 2845 color
a wide range of variation. To the best of our knowledge, images with 5171 faces.
this is the first time that a face detector can be associated 2) Type 2: These are meant for benchmarking face recog-
with its performance specification. nition instead of face detection. They offer faces with
Some other advantages deserve to be noted. Unlike manually many variables, and each variable covers a wide scope
collected datasets that require time-consuming manual anno- of variation. For example, the PIE database offers 68
tation on the ground truth, the ground truth of the daughter individuals with 13 poses, 43 illumination conditions,
232 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 2, FEBRUARY 2014

TABLE I
Benchmark Databases Segmented Into Different Scopes of Variation in Pose and Illumination

Rowley’s dataset with 507 faces Schneiderman’s dataset with 411 faces FDDB dataset with 5171 faces
(a MIT+CMU subset) (a MIT+CMU subset)
Pose No. (%) Pose No. (%) Pose No. (%)
Frontal 438 (86.4) <30° to either side 38 (9.3) Frontal 2028 (39.22)
<45° leftsided 21 (4.1) 30°–60° leftsided 25 (6.0) <30° to either side 964 (18.64)
<45° rightsided 19 (3.7) 30°–60° rightsided 47 (11.5) 30°–60° leftsided 279 (5.40)
Chin-up 14 (2.8) 60°–100° leftsided 50 (12.0) 30°–60° rightsided 254 (4.91)
Chin-down 11 (2.2) 60°–100° rightsided 219 (53.2) 60°–100° leftsided 481 (9.30)
Other 4 (0.8) Chin-up to either side 16 (4.0) 60°–100° rightsided 475 (9.19)
Chin-down to either side 15 (3.7) Chin-up to either side 329 (6.36)
Other 1 (0.3) Chin-down to either side 152 (2.94)
Other 209 (4.04)
Illumination No. (%) Illumination No. (%) Illumination No. (%)
Uniform 355 (70.0) Uniform 261 (63.5) Uniform 3075 (59.5)
Right lit 50 (9.8) Right lit 63 (15.3) Right lit 455 (8.8)
Left lit 72 (14.2) Left lit 40 (9.7) Left lit 783 (15.2)
Top lit 19 (3.8) Top lit 32 (7.8) Top lit 378 (7.3)
Bottom lit 11 (2.2) Bottom lit 4 (1.0) Bottom lit 68 (1.3)
Other 11 (2.7) Partially shadowed 152 (2.9)
Other 260 (5.0)

and 3 expressions. Just the poses and illumination con- 47.8% are with nearly frontal pose, i.e., < 30°, and 73.6% are
ditions alone give 559 different combinations for one with nearly balanced illumination.
individual’s face, more than the overall 507 faces offered The UCD database [54] is a Type 1, and it is full of chal-
by the Rowley’s dataset. lenging samples, without a good portion of regular samples.
Very few use it for performance evaluation [45]. Its difficulty
The images in a Type 1 database are mostly taken by consumer seems to deter many from attempting, not to mention report-
cameras; the variables are affected by general preferences, ing, performance. A better solution seems to split challenging
for example, nearly frontal pose and balanced illumination. samples from regular ones, and report performance separately.
To diversify the contents, many Type 1 datasets also include If we can define challenges by variables of different variation
challenging samples characterized by variables with special scopes, we can better evaluate an algorithm with data of
variations, for example, strong shadow, large rotation, partial different difficulty levels.
occlusion, and cluttered backgrounds. However, the percentage Type 2 are primarily used for the evaluation of face recogni-
of the samples with such special variations is often far less than tion. The faces in these databases are similar in size, appear at
that with general preferences. We segmented the MIT+CMU one or a few particular locations in an image, and with one face
and FDDB databases into subsets with similar variations in per image. These conditions make them inappropriate for the
illumination and pose in each subset. The segmentation result evaluation of face detection. However, a few select a Type 2 as
is shown in Table I, where the MIT+CMU database is split into one of their test sets, because it offers samples with variables
Rowley’s and Schneiderman’s subsets [3], [4]. Although it is of specific variation scopes, for example, 30° viewpoint with
claimed to be appropriate for evaluating frontal face detection, illumination from the top.
Rowley’s dataset has 69 samples, i.e., 13.6% of the whole set, The proposed framework takes the advantage of the differ-
with nonfrontal poses. Considering illumination conditions, ences between both types. It selects a Type 2 database as a
355 faces (70%) are with nearly uniform illumination versus mother database for offering a wide range of well-organized
52 (30%) with nonuniform illumination. Furthermore, it is intrinsic variables, and generating daughter datasets, which are
not shown in the table that 474 (93.5%) are with neutral or similar to Type 1 but with variables of controllable variation
smiling versus 33 (6.5%) with other expressions. Schneider- scopes.
man’s dataset has 441 faces in 208 images and is generally
considered a benchmark for the detection of profile faces. The
partition in Table I shows that 65.2% of its samples are nearly III. From Mother Database to Daughter Datasets
profile with a panned angle between 60° and 100°, but 9.3% The variables that affect the performance of a face detection
are closer to frontal with a panned angle < 30°. Besides, as algorithm can be split into two categories: 1) intrinsic and 2)
many as 264 samples (64.2%) are with uniform illumination, extrinsic. The intrinsic variables are those that can alter the
but merely 15.3% lit on the right, 9.7% lit on the left, and 3.0% appearance of a face, for example, pose, illumination, gender,
with other conditions. The recently released FDDB database age, expression, and accessaries such as glasses, caps, and
[12] offers a large collection with a wide range of difficulties, masks. The extrinsic variables are those that determine how
including occlusion and large poses. However, the challenging faces appear in an image, for example, the size, number, and
samples are significantly outnumbered by the regular ones. spatial distribution of the faces across a background. The pro-
HSU AND CHU: A FRAMEWORK FOR FACE DETECTION BENCHMARK DATABASES 233

posed framework is designed with two phases in it: 1) Intrinsic solves it by optimization. At initialization, one only has
Parameterization (IP) and 2) Extrinsic Parameterization (EP). to specify the background in the trimap using a bounding
The former parametrizes the intrinsic variables and the latter box. The region outside of the bounding box is taken as
parametrizes the extrinsic variables. background, the region inside is taken as the foreground,
In the IP phase, we collect faces from a mother database and and the boundary region is taken as an intermediate region.
parametrize the faces with intrinsic parameters, constituting Assuming Gaussian mixture models (GMMs) to be in the
a parametrized face dataset (PFD). When a performance foreground and background, one can classify each pixel in the
evaluation criterion is given, the EP phase first transforms boundary region by maximizing the likelihood of the GMMs
the criterion to required intrinsic and extrinsic parameters. and update the parameters in the GMMs using the classified
The intrinsic parameters are used to select or synthesize pixels. The segmentation is then obtained by minimizing the
the required faces from the PFD. The extrinsic parameters energy computed on the likelihood and a coherence term. This
determine the number of faces, their size range, and how they process can go on until the segmentation converges, giving
are distributed over a set of cluttered backgrounds. a more precise and smooth result than that of graph cuts.
By the end of this section a thorough discussion on the We replace the manual initialization by the cross-pose face
appropriateness and advantages of the proposed framework detector [58], and make the face segmentation automatic. The
is given in Section III-C. It highlights the issues of the segmentation accuracy can be further improved considering
performance gap between human vision and an artificial face that each subject in the mother database is actually aligned
detector, and how to reduce this gap using the framework. across different illumination conditions when the pose is fixed.
Therefore, one can run the aforementioned algorithm across
A. IP multiple images and take the majority of votes to determine
Ideal candidates for mother databases must include many the best segmented facial region.
variables and each variable must cover a wide range of vari- The faces in the mother database are all associated with
ation. Poses and illumination conditions are often considered nominal poses. Take PIE as a sample, which offers 13 poses
to be two major challenging variables for face detection. If approximately 22.5° apart [52]. Denote the frontal pose as
the numbers of individuals and expressions are also taken (φ0 = 0, θ0 = 0), or simply (0, 0), where φ is the pitch
into account, the CMU PIE [52], multi-PIE [55], and CAS- angle and θ is the yaw angle. Those in horizontal rotations
PEAL [56] can be good candidates. The PIE database is are parametrized into (0, θj ), where j = ±1, ±2, ±3, ±4
taken as a sample mother database for presenting the proposed denote for θj = ±22.5°, ±45°, ±67.5°, and ±90°, respectively.
framework and our experimental study. The procedure can be Those with 22.5° chin-up and chin-down from (0, 0) are de-
repeated for other mother databases. noted as (φ1 =22.5°,0) and (φ−1 =−22.5°,0), respectively. The
The IP phase is composed of the following four modules: mixed ones, c25 and c31, are parametrized as (φ−1 =−22.5°,
1) Face segmentation and pose parametrization. θ−3 =−67.5°) and (φ−1 = −22.5°, θ3 =67.5°), respectively. To
2) Illumination clustering with DCT decomposition. normalize the size, those with (φi , 0) can be normalized to Dh ,
3) Scope expansion using illumination cone [57]. the horizontal baseline between both eyes; those with (0, θj )
4) Parametrization of other intrinsic parameters. can be normalized to Dv , the vertical baseline between the
eyes and mouth. c25 and c31 are normalized with reduced Dh
1) Face Segmentation and Pose Parametrization: The faces
and Dv . As Dh is zero at profile, Dv along allows the profile
in the mother database are first cropped along the facial
to align with other poses. This step leads to a set of segmented
contour of each face using an automatic segmentation method
and normalized faces parametrized in specific poses.
that combines a state-of-the-art cross-pose face detector [58]
The parametrized poses are limited by the poses available
and the iterated graph cuts (IGC) [59]. The former is for
from the mother database. Taking the aforementioned PIE
locating the face in an image1 and the latter is for cropping
mother database as an example, only 13 pairs of (φi , θj )
the face. The cross-pose face detector [58] applies a mixture
are available. Although it is not a central concern of this
of trees with a shared pool of features extracted at fiducial
paper to generate poses that are different from those available
points to capture the cross-pose topological changes of a
from the mother database, it is desired to have more poses
face. The tree-structured models are shown to be effective in
to offer. This issue by nature is similar to the generation of
detecting faces of various poses using a considerably small
additional illumination conditions, because both require some
training set, compared to many previous approaches. Because
reconstruction of a 3-D face from 2-D images. The generation
of the satisfactory performance in locating eyes, mouth, and
of additional poses and illumination conditions can be solved
facial contours in our tests using the author-supplied source
by the illumination cone, proposed by Georghiades et al. [57]
codes, the face detector is exploited to segment the foreground,
and briefly presented in Section III-A3, which can reconstruct
i.e., the face, from the background. The identified foreground
the 3-D face given three aligned facial images of the same
and background regions can automatically initialize the IGC,
pose but different illumination.
which requires manual annotation at initialization.
2) Illumination Clustering and Parametrization: Different
The IGC [59] is an iterative version of the graph cuts
mother databases offer different illumination patterns. The
[60], which formulates the segmentation problem as an energy
extended Yale database [57], [61] has 64 illumination con-
minimization via trimaps and probabilistic color models and
ditions on nine poses. PIE offers 43 illumination conditions
1 Each image in the mother database has one face only. with 13 poses. While it is possible to come up with 43
234 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 2, FEBRUARY 2014

Fig. 2. Nine illumination patterns, I1 –I4 on the top and I5 –I9 on the bottom,
Fig. 1. Variation of the cost JL in (1) with different numbers of clusters, extracted from 68 subjects with frontal pose and 43 illumination conditions.
with λc = 8 and 4 DCT components.

features to characterize the 43 illumination conditions for each Given the illumination patterns extracted from the 68 subjects
pose, we reduce this number by clustering those with similar with frontal poses, (φi∗ = 0, θj∗ = 0), and 43 illumination
illumination patterns. conditions, Fig. 1 shows the cost JL varying with Nc , where
Given a set of faces with a normalized pose (φi∗ , θj∗ ), the λc = 8. The minimum occurs at Nc = 9, and the means of the
issue here is to cluster the 43 illumination patterns given nine clustered illumination patterns shown in Fig. 2.
by 2924(= 43 × 68) face samples. Assuming that the illu- The illumination clustering can be repeated for all poses;
mination pattern shown on each face can be approximated thus, each face can be parametrized in [φi , θj , Ik ]. For the
to some extent by its low spatial frequency components, the sample PIE mother database, i ∈ {−1, 0, 1}, j ∈ {0, ±1, ±2,
discrete cosine transform (DCT) can be applied to extract ±3, ±4}, and k ∈ {1, 2, . . . , 9}.
the approximated illumination pattern from each aligned face 3) Sample Add-On Using an Illumination Cone: Several
sample by inverting its DCT coefficients of low frequencies. approaches, for example, the illumination cone [57] and the
A similar application of DCT can be found in [62] for 3-D morphable model [63], can use a few sample faces to
illumination normalization. To solve this clustering problem, generate faces with different poses and illumination condi-
one must determine the number of clusters to decompose tions. The illumination cone [57] is selected for its relatively
the 43 illumination patterns cast on 68 different faces. Our inexpensive storage and computation, and because its require-
experiments show that a good scheme to determine the cluster ment can be readily satisfied by the samples from the mother
number and simultaneously cluster the 2924 face samples is database. The illumination cone exploits the fact that there
to minimize the following cost function JL : exists a set of facial images with the same pose, but taken
under different illumination conditions, is a convex cone in
JL ({x̃i }, Nc , λc ) = the image space. Using a few training images of each face
Nc 
j=1 ∀x̃i ∈Cj x̃i − mj  taken under different illumination conditions, the shape and
Nc + λc log(Nc ) (1) albedo of the face can be reconstructed using the generalized
j=1 nj mj − m0 
bas-relief (GBR) transformation. This reconstruction leads to
where x̃i is the DCT-based illumination feature vector ex- a generative model that is able to synthesize the images of
tracted from the face xi ; Nc is the number of clusters; nj the face under novel poses and illumination conditions. The
is the size of the cluster Cj , i.e., the number of x̃i ’s clustered requirement on training images of the same pose but different
to Cj ; mj is the sample mean of Cj , and m0 is the mean illumination can be readily met by the samples from the PIE
of the overall {x̃i }. λc is a weight that justifies the penalty mother database.
on a large Nc . The first part of (1) measures the ratio of Given k aligned images of a face from the mother database,
the scattering of the illumination patterns within the same denoted as [xi ]ki=1 , xi ∈ Rn , one can form a matrix X =
clusters to the scattering of the illumination patterns across [x1 , x2 , . . . , xk ] with each column formed by one facial image.
different clusters. When this ratio is minimized, the within- With Lambertian assumption, the illumination cone algorithm
cluster scatter of the illumination patterns is at a relative recursively searches for two matrices, B ∈ Rn×3 and S ∈ R3×k ,
minimum, and simultaneously the between-cluster scatter is to minimize the difference between X and B · S, that is
at a relative maximum. To compromise the condition with a
large Nc , which can reduce the ratio substantially, the second B∗ , S∗ = arg min X − B · S (2)
B,S
part of (1) penalizes the model complexity induced by large
Nc with a weight λc . where B = [b1 b2 · · · bn ]T . Each row of B is the product of the
Assuming that the illumination patterns extracted from a albedo with the unit normal at a point on the face projecting
set of faces with pose (φi∗ , θj∗ ) form a GMM, the expec- to a particular pixel in the image, that is
tation maximization can be applied to cluster the extracted [zx (j) zy (j) − 1]
illumination samples. To extract illumination from a facial bTj = αj j = 1, ..., n (3)
z2x (j) + z2y (j) + 1
image, the DCT coefficients of low frequencies are selected for
clustering and checked whether each cluster collects samples where αj is the albedo at a point z on the face projecting
of similar illumination. We found that the DCT coefficients to the jth pixel, and zx (j) and zy (j) are the gradients of the
chosen between the second to the ninth lowest frequencies facial surface z(j) along the horizontal and vertical directions,
from a face 64×64 in size preserve most of the illumination. respectively. Equation (3) can be written into the following
HSU AND CHU: A FRAMEWORK FOR FACE DETECTION BENCHMARK DATABASES 235

Fig. 3. Three faces on the left are directly available from the PIE mother database; the right five faces are synthesized by an illumination cone.

form for better interpretation with the pixel at (x, y) on the 2) Gender can take 1 for male and 0 for female.
2-D facial image 3) Orientation, i.e., in-plane rotation, can take −180° to
[zx (x, y) zy (x, y) − 1] 180°, if the upright pose is considered as 0° and turns
bT (x, y) = α(x, y) 2 (4) clockwise as positive.
zx (x, y) + z2y (x, y) + 1
4) Partial occlusion is handled according to two different
where z(x, y) is the objective facial surface to estimate, and types: the first is mutual occlusion, measured by the
zx (x, y) and zy (x, y) are the directional derivatives. In (2), S = percentage of a face occluded by another face; and the
[s1 , s2 , . . . , sk ], where each si ∈ R3 shows the ith light source second is caused by accessories, such as sun glasses,
in the infinity with intensity si  and direction si /si . masks, and hats. A few accessary templates, collected
To evaluate z(x, y), the algorithm first obtains S0 , an initial from various sources, including the AR database [64],
estimate of S, by performing SVD on X and extracting the are made available as an add-on to the faces in the PFD.
three primary basis vectors that best span the row space of X. Although the generation of different expressions is beyond
The iterative part of the algorithm starts with the evaluation the scope of this paper, it highlights the fact that the intrin-
of B0 , an initial estimate of B, using (2) with X and S0 . An sic parameters available from the proposed framework are
initial estimate of the albedo α0 (x, y) can be given by taking constrained by what the mother database can offer. On the
the average of [xi ]ki=1 ; the gradient fields zx (x, y) and zy (x, y) other hand, the framework would be able to offer intrinsic
can then be evaluated using (4). To overcome a possible parameters with wider scopes of variations if the techniques
integrability issue with zx (x, y) and zy (x, y) obtained this way, for generating novel appearances become available.
z(x, y) is assumed able to be approximated by DCT, that is In summary, the sample mother database, PIE, contributes

z(x, y) = c(w)ψ(x , y; w) (5) 41 368 facial images with different poses, illumination con-
ditions, expressions, and genders. Given a few facial images
where w defines a local window over which the sum is aligned to the same pose but taken under different illumination,
performed, and {c(w)} are the DCT coefficients. Because the the illumination cone [57] is applied to generate faces with
basis set {ψ(x , y; w)} is integrable, the gradient fields zx (x, y) novel poses and illumination conditions, extending the scopes
and zy (x, y) approximated by ψx (x , y; w) and ψy (x , y; w), re- of the intrinsic parameters directly available from the mother
spectively, are also integrable. Given ψx (x , y; w), ψy (x , y; w), database.
and the estimated S, the albedo α(x, y) can be updated by To this end, the IP phase leads to a PFD with each face
least-square minimization, and then B can be recomputed parametrized in p = [φi , θj , ik , ol , g, em , cn ], where φi and θj
using (4). The direction and magnitude of each light source si specify pose-(i, j), ik for illumination-k, ol for orientation-l, g
must be updated independently using the recomputed B. The for gender, em for expression-m, and cn for occlusion. Given a
aforementioned steps of updating S, then α(x, y), and then B specified ∗p , a subset of faces can be segmented or generated
go on until these estimates converge. Because ψx (x , y; w) and from the PFD.
ψy (x , y; w) are made integrable using DCT approximation, the
GBR surface z̄(x, y) can be obtained by taking an inverse DCT
on the final coefficients {c̄(w)}. B. EP
Fig. 3 shows five synthesized cases on the right, different This phase handles how faces of different intrinsic and ex-
from the three on the left directly taken from the PIE database. trinsic parameters appear across a set of cluttered background
4) Parametrization of Other Intrinsic Variables: Although images. The extrinsic parameters include the size density
pose and illumination are two major intrinsic variables, the distribution (SDD), the total number of faces needed, the
importance of other intrinsic variables cannot be overempha- overlap between faces, the spatial distribution of the faces,
sized. For the diversity of the daughter datasets, we also and the resolution. The settings of each extrinsic parameter
consider expression, orientation (in-plane rotation), gender, are given in the following.
and accessaries such as eyeglasses and hats. Some of these 1) The SDD is given by the number of faces at each
variables are easy to parametrize, but some are constrained by different size over the size range [smin smax ]. It is then
the mother database, as stated in the following. normalized to the total number of faces to make it a
1) Limited by the three expressions offered by the sample valid distribution. It can be assigned as uniform, single
PIE mother database, the expression parameters in the or multiple Gaussian, or other patterns for different
PFD can only be given 1 for smile, 0 for neutral, and scenarios. It can also be defined for disjoint segments
−1 for blinking. with [smin smax ].
236 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 2, FEBRUARY 2014

Fig. 4. Low-resolution samples made by applying scale space with the


Gaussian kernel [65].

2) Nf is the total number of faces needed. Multiplication


with SDD gives the exact number of the faces at each
different size.
3) The aforementioned parameters can render a daughter
dataset if the faces are merged into a set of cluttered Fig. 5. Artificial faces in the MIT+CMU database, which are easy for
humans to detect, and an ideal face detector is expected to do the same.
background images. The faces can be distributed arbi-
trarily or according to a preselected distribution. How-
ever, for certain applications, the 2-D spatial distribution
of the faces must reflect the distribution of 3-D faces in
a 3-D scene. With a chosen focal length, we can specify
the number of faces located at different distances from
the camera. Prospective modeling with a normal head
model for each face is used here to obtain its projection
onto a 2-D image.
4) R is the resolution, or more precisely the spatial resolu-
tion, which determines how closely lines can be resolved
in an image. Low-resolution images pose challenges to
face detection. Because the scale-space framework is
successfully applied to capturing the resolution variation
caused by different scales [65], we apply the scale space
Fig. 6. Performance comparison on Rowley’s and FDDB datasets and their
with Gaussian kernel to make low-resolution images. synthesized counterparts.
This technique is applied to the whole image, instead
of on faces only to reflect the fact that resolution is a
global extrinsic parameter. However, for demonstration real data collected in the field that meet the requirements.
purposes, Fig. 4 shows a sample face with resolution However, such optimal real test sets can be extremely
reduced by applying the Gaussian kernels for multiple difficult to obtain for some specific test criteria; for
times. example, pose between 45° to profile 90° with right-
sided illumination. While it is believed that a benchmark
C. Application Scope of the Framework on real data reveals the performance one can expect
A few artificial faces were collected in the MIT+CMU in the field, the collection of the right real data must
database, including hand-drawn ones and a few from poker be ensured. The daughter datasets generated by the
cards, cartoons, masks, and advertisements, as shown in Fig. framework offer a suboptimal solution to this difficult-
5. Our eyes can easily discern faces and even face-like objects to-collect issue.
from others, no matter whether they are real or artificial 3) Offering a novel perspective to benchmark face detec-
ones, and no matter whether they have body parts or even tion: Although the daughter datasets are not made to
head contours. An ideal face detector or manmade human- contain hand-drawn or cartoon faces like those available
like vision is expected to do the same. This highlights the in the MIT+CMU dataset, they can be easily included
following properties about the application scope of the pro- to show some extensions of intrinsic and extrinsic pa-
posed framework. rameters that are rarely seen in real life. However, is
1) Best for evaluating scanning-window based face de- face detection expected to work for real-life samples
tectors: Almost all face detectors use sliding windows only? Because human vision can tell faces from others
of various sizes scanning through a give image, and regardless of whether the faces are real or artificial,
determine whether the region enclosed by each window there is no reason to ask an ideal face detector not to
is a face or nonface. Because face detection algorithms do the same. As artificial images are made for various
do not consider backgrounds and other body parts, the purposes in this era of multimedia, and face detection
daughter datasets generated by the proposed framework in artificial images is important for content-based image
serve well for benchmarking such scanning-window- retrieval and categorization, the proposed framework can
based algorithms. be the first one that is able to offer a good scope of
2) Offering suboptimal test set with synthesized data: Given synthesized images for benchmarking the desired face
a criterion, the optimal test sets are presumed to be the detectors.
HSU AND CHU: A FRAMEWORK FOR FACE DETECTION BENCHMARK DATABASES 237

Fig. 7. Original images are labeled in numbers, and with faces replaced labeled in the same numbers but followed by an “r.” (a) Rowley’s (a MIT+CMU
subset). (b) FDDB.

IV. Experimental Validation Because the resolution of many samples in Rowley’s dataset
Two protocols were designed to validate the proposed was too low to be processed by Zhu’s detector, we replaced it
framework. The first one was designed to justify the va- by Viola’s face detector and kept the segmentation the same.
lidity of the daughter datasets synthesized by the proposed Out of the total 507 faces in Rowley’s, 478 or 94.3% were
framework, and the second was designed to show how to replaced by similar ones from the PFD and 29 were not
use the framework to define the performance specifications replaceable.
of a face detector. In the first protocol, we compared the Fig. 6 shows the detection rates of the three detectors on
performance of a few state-of-the-art face detectors tested on the real and synthesized with increasing false-positive rates
two benchmark datasets, namely the most popular Rowley’s (FPRs). The detection rates on the synthesized are generally
dataset [3] and the recently released FDDB [12] against that better than those on the real, but the gaps in between are
tested on their synthesized counterparts. The latter were made all within 5% for FPR >0.03, and decrease to 3% for
by the former with faces replaced by samples with similar FPR >0.08. Such small gaps demonstrate the similarity of
intrinsic parameters from the PFD. In the second protocol, the the synthesized to the real. To better understand the causes
three face detectors were tested on several daughter datasets of the performance gaps in Fig. 6, one needs to examine
with different intrinsic and extrinsic parameters so that their the results on corresponding samples across the real and
performance specifications and limitations would be clearly synthesized datasets. The following observations are from the
drawn. The datasets used in both protocols can be available at tests using the Viola’s detector, and similar observations can
https://sites.google.com/site/facedetectiondatasets/home. also be obtained from the other two detectors. Fig. 7 shows a
The selected face detectors include the latest one by Zhu and few samples before and after the replacement, along with the
Ramanan [58], one by Kalal et al. [66], and one advanced from detection results. Each original image is tagged by a number,
Viola and Jones [67], which are, respectively, called Zhu’s, and the one with faces replaced is tagged by the same number
Kalal’s, and Viola’s face detectors in this paper. Zhu’s and but with a “r” at the end. Except for a few cases where
Kalal’s detectors can be available from the resources provided the real fails to be detected but its synthesized counterpart
by the authors, and Viola’s detector is available from the can be detected, e.g., samples 1 and 1r in Rowley’s set in
OpenCV open source library [68]. Fig. 7, most faces detected in the real remain detected when
they are replaced in the synthesized, and those missed in the
A. Performance Comparison on Real and Synthesized real mostly remain missed in the synthesized. Some false
Datasets positives without overlap with faces also remain after the
The samples from the original databases are called real data, replacement, e.g., samples 2 and 2r in Rowley’s set.
and the real data with faces replaced by similar ones from the Although the overall performance on both the real and
PFD are called synthesized data in the following. To replace synthesized data seems similar, some differences caught our
the faces in the FDDB, the scheme presented in Section III-A1 attention. Several special cases are revealed in Fig. 8; column
was applied, which not only segments faces but also estimates a shows a case in which the real face shadowed by the hat
their poses and local features. fails to be detected (a1); it becomes detected when replaced
From the samples with the same poses in the PFD, those by a face without shadow (a2), and then fails again when the
with similar illumination, justified by low-frequency DCT synthesized is cast with a shadow. Column b shows that a
components, were selected to replace the segmented faces. Out false positive overlapped with the real face is removed when
of the total 5171 faces, 4808 or 93% were replaced, and 363 the face is replaced (b1 and b2). A face behind the main
or 7% were not replaceable due to extreme poses, expressions, one in b1 fails to be detected because of partial occlusion,
occlusion, and size less than 20×20. and remains undetected after the replacement by faces with
238 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 2, FEBRUARY 2014

Fig. 8. Cases showing that shadows and accessories, e.g., hats, would result in different performance (columns a, c, and d), and false positives overlapped
with faces may be removed in the synthesized (column b). The top row are real and the rest are synthesized.

Fig. 9. Typical samples from each daughter dataset generated for performance evaluation against a specific intrinsic parameter. (f) Perspective model for the
layout in (b). (a) Pose. (b) Illumination-frontal. (c) Illumination-downward. (d) Occlusion. (e) Orientation. (f) Perspective model.

different fractions of occlusion in b2 and b3. Column c B. Daughter Sets for Performance Specification
is similar to column a with hat-cast shadows; however, c2 To determine the performance specifications of the three
and c3 show that the synthesized can be detected even with selected face detectors, a few daughter datasets were designed
the eyeglasses, hat, and hat-cast shadow added in, possibly with the following intrinsic parameters (the labels, such as
because the synthesized shadow may not be strong enough or c02, c07, c27, follow those used in the PIE database).
the shadowed pattern collected in the training set. Column d 1) Pose set: Eight subsets and each with one or a pair
shows a case in which a real one with eyeglasses and hat-cast of specific poses, namely (in the PIE pose tags [52]),
shadow fails to be detected, but can be detected when replaced c27 (frontal, 0°), c05/c29 (22.5° to the right and left),
by a face without glasses and weakened shadow. The cases in c07 (22.5° up), c09 (22.5° down), c11/c37 (45° to both
Fig. 8 reveal that the proposed framework allows a variety sides), c02/c14 (67.5° to both sides), c25/c31 (67.5° to
of conditions and additional parameters to be included in the both sides and 22.5° down), and c22/c34 (90° to both
daughter datasets to expand the application scopes. sides). The remaining intrinsic parameters were the same
HSU AND CHU: A FRAMEWORK FOR FACE DETECTION BENCHMARK DATABASES 239

Fig. 10. Specifications of three face detectors, Zhu’s (green), Viola’s (red), and Kalal’s (blue), in terms of the detection rates on five daughter sets, each
with an intrinsic parameter as the only variable. (a) Pose. (b) Illumination-frontal. (c) Illumination-downward. (d) Occlusion. (e) Orientation.

for each set: illumination was with I3 (shown in Fig. 2) poses, i.e., 82% on c22 and 92% on c02, far better than
for its balance across the face, in-plane orientation less the other two. All three handle the nearly frontal poses well,
than 10°, and no occlusion. but Kalal’s and Viola’s performances drop sharply when the
2) Illumination frontal set: Six subsets with six illumination pose goes beyond c37 (45°). All three perform well on the
patterns, namely, I3 , I5 , I6 , I7 , I8 , and I9 , and all with illumination frontal set, except that Kalal’s detector seems to
frontal pose (c27). The rest of the intrinsic parameters be unable to detect many faces lit on the right side, although it
were the same as the above for the pose set. performs the best for detecting those lit on the left side. Such
3) Illumination downward set: Six subsets with the same a asymmetry performance is also observed in the illumination
looking-downward pose but viewed from different yaw downward set, where both Kalal’s and Zhu’s detectors perform
angles, including c09 (22.5° down), c25 (22.5° down and better on one side of illumination than the other. Viola’s
67.5° to the right), and c31 (22.5° down and 67.5° to the shows the most consistent performance across all lighting
left), and each with a different illumination pattern as for conditions. As for the capability of handling occlusion, Zhu’s
the illumination frontal set. The rest intrinsic parameters again outperforms the others, as shown in the occlusion set,
were the same as the previous cases. with detection rate 94% on 30% occlusion. However, as shown
4) Occlusion set: Four subsets with some percentage of a in the orientation set, Zhu’s detector seems unable to handle
face covered by another face, the percentage was made orientation well; it degrades substantially when detecting faces
to be 5%, 10%, 20%, and 30%. All were with pose c27, of ±10° in orientation. Both Kalal’s and Viola’s detectors show
illumination I3 , and orientation less than 10°. some asymmetry results with better performance handling
5) Orientation set: Six subsets with different degrees of in- orientation 20°–30° than the case with −20°–30°, but Zhu’s
plane rotation, namely ±10°, ±20°, and ±30° clockwise detector shows more symmetry performance.
and counterclockwise. All were with pose c27, illumi- To the best of our knowledge, the charts in Fig. 10 show
nation I3 , and without occlusion. for the first time that the performance of face detectors can
be specified quantitatively with various intrinsic variables and
Both random pasting and perspective modeling were used to compared from different aspects of performance needs.
generate the aforementioned sets. The number of faces in each
set varied from 192 to 400, and the size of each face was within
0.1–0.3 the height of the background image. The size density V. Conclusion
SDD was assumed as a bandlimited Gaussian. Fig. 9 shows A framework was proposed to generate datasets that are
typical samples from each set and a perspective model in good for benchmarking face detection using a database meant
Fig. 9(f) considered for the layout in Fig. 9(b). for benchmarking face recognition. The framework was com-
The test results on these intrinsic-parameter-specific daugh- posed of IP and EP phases. The former categorized the
ter sets are shown in Fig. 10. Zhu’s detector performs the facial samples from the mother database according to intrinsic
best in the pose set with exceptional detection rates on large parameters, and the latter generated the daughter datasets
240 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 2, FEBRUARY 2014

according to desired settings on the intrinsic and extrinsic [19] Y.-Y. Lin and T.-L. Liu, “Robust face detection with multi-class boost-
ing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., vol. 1. Jun.
parameters. Our experiments showed that the daughter dataset 2005, pp. 680–687.
can be made similar to popular benchmarks and the perfor- [20] H. Zhang, W. Gao, X. Chen, S. Shan, and D. Zhao, “Robust multi-view
mance specification of a face detector can be defined using face detection using error correcting output codes,” in Proc. 9th Eur.
Conf. Comput. Vis., vol. 3954. 2006, pp. 1–12.
daughter datasets with parameters of special settings. [21] P. Shih and C. Liu, “Face detection using discriminating feature analysis
A few questions are yet to be answered and are considered and support vector machine,” Pattern Recognit., vol. 39, no. 2, pp. 260–
in the continuing phase of this paper. For example, how should 276, 2006.
[22] D. Nguyen, D. Halupka, P. Aarabi, and A. Sheikholeslami, “Real-time
we define a minimum number of daughter datasets so that the face detection and lip feature extraction using field-programmable gate
performance specification of a face detector can be efficiently arrays,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 36, no. 4, pp.
defined? Should a generic daughter dataset be applied first to 902–912, Aug. 2006.
[23] J. Wu, S. C. Brubaker, M. D. Mullin, and J. M. Rehg, “Fast asymmetric
guide the settings for more specific daughter datasets applied learning for cascade face detection,” IEEE Trans. Pattern Anal. Mach.
later when evaluating a face detector? Is a face detector Intell., vol. 30, no. 3, pp. 369–382, Mar. 2008.
better trained on limited poses and various illumination or [24] R. Wang, J. Chen, S. Shan, and W. Gao, “Enhancing training set for
face detection,” in Proc. IEEE 18th Int. Conf. Pattern Recognit., vol. 3.
vice versa? How does its performance vary with training sets Aug. 2006, pp. 477–480.
of different variations? These questions and others will be [25] J. Meynet, V. Popovici, and J.-P. Thiran, “Face detection with boosted
answered in the near future. Gaussian features,” Pattern Recognit., vol. 40, pp. 2283–2291, Aug.
2007.
[26] C. Huang, H. Ai, Y. Li, and S. Lao, “High performance rotation invariant
multiview face detection,” IEEE Trans. Pattern Anal. Mach. Intell.,
References vol. 29, no. 4, pp. 671–686, Apr. 2007.
[1] R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, J. Garofolo, [27] P. Wang and Q. Ji, “Multi-view face and eye detection using discriminant
R. Bowers, M. Boonstra, V. Korzhova, and J. Zhang, “Framework for features,” Comput. Vis. Image Understanding, vol. 105, no. 2, pp. 99–
performance evaluation of face, text, and vehicle detection and tracking 111, 2007.
in video: Data, metrics, and protocol,” IEEE Trans. Pattern Anal. Mach. [28] R. Xiao, H. Zhu, H. Sun, and X. Tang, “Dynamic cascades for face
Intell., vol. 31, no. 2, pp. 319–336, Feb. 2009. detection,” in Proc. IEEE 11th Int. Conf. Comput. Vis., Oct. 2007, pp.
[2] K.-K. Sung and T. Poggio, “Example-based learning for view-based 1–8.
human face detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, [29] C. Shen, S. Paisitkriangkrai, and J. Zhang, “Face detection from few
no. 1, pp. 39–51, Jan. 1998. training examples,” in Proc. IEEE Int. Conf. Image Process., Oct. 2008,
[3] H. Rowley, S. Baluja, and T. Kanade, “Neural network-based face pp. 2764–2767.
detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 1, pp. [30] S. Yan, S. Shan, X. Chen, and W. Gao, “Locally assembled binary (LAB)
23–38, Jan. 1998. feature with feature-centric cascade for fast and accurate face detec-
[4] H. Schneiderman and T. Kanade, “A statistical method for 3D object tion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2008,
detection applied to faces and cars,” in Proc. IEEE Conf. Comput. Vis. pp. 1–7.
Pattern Recognit., vol. 1. Jun. 2000, pp. 746–751. [31] M. Toews and T. Arbel, “Detection, localization, and sex classification
[5] M.-H. Yang, D. Kriegman, and N. Ahuja, “Detecting faces in images: of faces from arbitrary viewpoints and under occlusions,” IEEE Pattern
A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 1, pp. Anal. Mach. Intell., vol. 31, no. 9, pp. 1567–1581, Sep. 2009.
34–58, Jan. 2002. [32] K. Hotta, “View independent face detection based on horizontal rect-
[6] C. A. Waring and X. Liu, “Face detection using spectral histograms and angular features and accuracy improvement using combination kernel
SVMs,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 35, no. 3, pp. of various sizes,” Pattern Recognit., vol. 42, no. 3, pp. 437–444,
467–476, Jun. 2005. 2009.
[7] S. Kadoury and M. D. Levine, “Face detection in gray scale images [33] J. Chen, X. Chen, J. Yang, S. Shan, and W. Gao, “Optimization of a
using locally linear embeddings,” Comput. Vis. Image Understanding, training set for more robust face detection,” Pattern Recognit., vol. 42,
vol. 105, no. 1, pp. 1–20, 2007. no. 11, pp. 2828–2840, 2009.
[8] W.-K. Tsao, A. J. T. Lee, Y.-H. Liu, T.-W. Chang, and H.-H. Lin, “A [34] S. Yan, S. Shan, X. Chen, and W. Gao, “FEA-Accu cascade for face
data mining approach to face detection,” Pattern Recognit., vol. 43, pp. detection,” in Proc. 16th IEEE Int. Conf. Image Process., Nov. 2009,
1039–1049, Sep. 2009. pp. 1217–1220.
[9] J. Wu and Z.-H. Zhou, “Efficient face candidates selector for face [35] M.-T. Pham, Y. Gao, V.-D. D. Hoang, and T.-J. Chen, “Fast polygonal
detection,” Pattern Recognit., vol. 36, no. 5, pp. 1175–1186, 2003. integration and its application in extending haar-like features to improve
[10] A. Kouzani, “Locating human faces within images,” Comput. Vis. Image object detection,” in Proc. IEEE Comput. Vis. Pattern. Recognit., Jun.
Understanding, vol. 91, pp. 247–279, Sep. 2003. 2010, pp. 942–949.
[11] G.-S. Hsu, T. H. Tran, and S.-L. Chung, “Benchmark face detection [36] V. Subburaman and S. Marcel, “An alternative scanning strategy to detect
using a face recognition database,” in Proc. 17th IEEE Int. Conf. Image faces,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Mar.
Process., Sep. 2010, pp. 3821–3824. 2010, pp. 2122–2125.
[12] V. Jain and E. Learned-Miller, “FDDB: A benchmark for face detection [37] J.-M. Guo and M.-F. Wu, “Pixel-based hierarchical-feature face detec-
in unconstrained settings,” Univ. Massachusetts, Amherst, MA, USA, tion,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Mar.
Tech. Rep. UM-CS-2010-009, 2010. 2010, pp. 1638–1641.
[13] L.-L. Huang and A. Shimizu, “A multi-expert approach for robust face [38] J. Shen, W. Yang, and C. Sun, “Learning discriminative features based
detection,” Pattern Recognit., vol. 39, no. 9, pp. 1695–1703, 2003. on distribution,” in Proc. 20th Int. Conf. Pattern Recognit., Aug. 2010,
[14] R.-L. Hsu, M. Abdel-Mottaleb, and A. K. Jain, “Face detection in color pp. 1401–1404.
images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 5, pp. [39] L. Ding and A. Martinez, “Features versus context: An approach
696–706, May 2002. for precise and detailed detection and delineation of faces and facial
[15] O. Ayinde and Y.-H. Yang, “Region-based face detection,” Pattern features,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 11, pp.
Recognit., vol. 35, no. 10, pp. 2095–2107, 2002. 2022–2038, Nov. 2010.
[16] L.-L. Huang, A. Shimizu, Y. Hagihara, and H. Kobateke, “Gradient fea- [40] J. Chen, S. Shan, C. He, G. Zhao, M. Pietika, X. Chen, and W. Gao,
ture extraction for classification-based face detection,” Pattern Recognit., “WLD: A robust local image descriptor,” IEEE Trans. Pattern Anal.
vol. 36, no. 11, pp. 2501–2511, 2003. Mach. Intell., vol. 32, no. 8, pp. 1705–1720, Sep. 2010.
[17] C. Liu, “A Bayesian discriminating features method for face detection,” [41] S. Stein and G. A. Fink, “A new method for combined face detection and
IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 6, pp. 725–740, identification using interest point descriptors,” in Proc. IEEE Int. Conf.
Jun. 2003. Autom. Face Gesture Recognit. Workshops, Mar. 2011, pp. 519–524.
[18] C. Huang, B. Wu, H. Ai, and S. Lao, “Omni-directional face detection [42] S. Paisitkriangkrai, C. Shen, and J. Zhang, “Incremental training of a
based on real adaboost,” in Proc. IEEE Int. Conf. Image Process., vol. 1. detector using online sparse eigendecomposition,” IEEE Trans. Image
Oct. 2004, pp. 593–596. Process., vol. 20, no. 1, pp. 213–226, Jan. 2011.
HSU AND CHU: A FRAMEWORK FOR FACE DETECTION BENCHMARK DATABASES 241

[43] W. Louis and K. Plataniotis, “Frontal face detection for surveillance [60] Y. Boykov and M.-P. Jolly, “Interactive graph cuts for optimal boundary
purposes using dual local binary patterns features,” in Proc. 17th IEEE and region segmentation of objects in N-D images,” in Proc. 8th IEEE
Int. Conf. Image Process., Sep. 2010, pp. 3809–3812. Int. Conf. Comput. Vis., Jul. 2001, pp. 105–112.
[44] P. Phothisane, E. Bigorgne, L. Collot, and L. Prevost, “A robust [61] P. N. Belhumeur and D. J. Kriegman, “What is the set of images of an
composite metric for head pose tracking using an accurate face model,” object under all possible illumination conditions,” Int. J. Comput. Vis.,
in Proc. IEEE Int. Conf. Autom. Face Gesture Recognit. Workshops, vol. 28, no. 3, pp. 1–16, 1998.
Mar. 2011, pp. 694–699. [62] W. Chen, M. J. Er, and S. Wu, “Illumination compensation and nor-
[45] U. Yang, M. Kang, K.-A. Toh, and K. Sohn, “An illumination invariant malization for robust face recognition using discrete cosine transform in
skin-color model for face detection,” in Proc. 4th IEEE Int. Conf. logarithm domain,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 36,
Biometrics Theory Appl. Syst., Sep. 2010, pp. 1–6. no. 2, pp. 458–466, Apr. 2006.
[46] Z. Kalal, K. Mikolajczyk, and J. Matas, “Face-TLD: Tracking-learning- [63] V. Blanz and T. Vetter, “Face recognition based on fitting a 3D
detection applied to faces,” in Proc. 17th IEEE Int. Conf. Image Process., morphable model,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25,
Sep. 2010, pp. 3789–3792. no. 9, pp. 1063–1074, Sep. 2003.
[47] C. Whitelam, Z. Jafri, and T. Bourlai, “Multispectral eye detection: A [64] A. Martinez and R. Benavente, “The AR face database,” Purdue Univ.,
preliminary study,” in Proc. 20th IEEE Int. Conf. Pattern Recognit., Aug. West Lafayette, IN, USA, CVC Tech. Rep. 24, 1998.
2010, pp. 209–212. [65] D. Lowe, “Distinctive image features from scale-invariant keypoints,”
[48] Q. Yuan, A. Thangali, V. Ablavsky, and S. Sclaroff, “Learning a family Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.
of detectors via multiplicative kernels,” IEEE Trans. Pattern Anal. Mach. [66] Z. Kalal, J. Matas, and K. Mikolajczyk, “Weighted sampling for large-
Intell., vol. 33, no. 3, pp. 514–530, Mar. 2011. scale boosting,” in Proc. Brit. Mach. Vis. Conf., 2008, pp. 42.1–42.10.
[49] K. Ali, F. Fleuret, D. Hasler, and P. Fua, “A real-time deformable [67] P. Viola and M. Jones, “Rapid object detection using a boosted cascade
detector,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 2, pp. of simple features,” in Proc. IEEE Comput. Vis. Pattern Recognit., vol. 1.
225–239, Feb. 2012. Dec. 2001, pp. I-511–I-518.
[50] E. Bailly-Bailliere, S. Bengio, F. Bimbot, M. Hamouz, J. Kittler, [68] OpenCV. (2012). OpenCV 2.4.4 [Online]. Available:
J. Mariethoz, J. Matas, K. Messer, V. Popovici, F. Poree, B. Ruiz, and http://docs.opencv.org/
J.-P. Thiran, “The BANCA database and evaluation protocol,” in Proc.
4th Int. Conf. Audio- Video-Based Biometric Person Authenticat., 2003,
pp. 625–638. Gee-Sern Hsu (M’09) received the dual M.S. degree
[51] B. Weyrauch, J. Huang, B. Heisele, and V. Blanz, “Component-based in electrical and mechanical engineering and the
face recognition with 3D morphable models,” in Proc. IEEE Conf. Ph.D. degree in mechanical engineering from the
Comput. Vis. Pattern Recognit., Jun. 2004, p. 85. University of Michigan, Ann Arbor, MI, USA, in
[52] T. Sim, S. Baker, and M. Bsat, “The CMU pose, illumination, and 1993 and 1995, respectively.
expression database,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, From 1995 to 1996, he was a Post-Doctoral Fellow
no. 12, pp. 1615–1618, Dec. 2003. with the University of Michigan. From 1997 to 2000,
[53] O. Jesorsky, K. Kirchberg, and R. Frischholz, “Robust face detection he was a Senior Research Staff with the National
using the Hausdorff distance,” in Proc. 3rd Int. Conf. Audio-Video-Based University of Singapore, Singapore. In 2001, he
Biometric Person Authenticat., vol. 2091. Jun. 2001, pp. 90–95. joined Penpower Technology, where he led research
[54] P. Sharma and R. Reilly, “A color face image database for benchmark on face recognition and intelligent video surveil-
of automatic facial detection algorithms,” in Proc. 4th Eur. Conf. lance. In 2007, he joined the Department of Mechanical Engineering, National
Video/Image Multimedia Commun., Jul. 2003, pp. 423–428. Taiwan University of Science and Technology, Taipei, Taiwan, as an Assistant
[55] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi-pie,” Professor. His current research interests include object recognition, particularly
Image Vis. Comput., vol. 28, pp. 807–813, May 2010. face recognition, pedestrian detection, and vehicle license plate recognition.
[56] W. Gao, B. Cao, S. Shan, X. Chen, D. Zhou, X. Zhang, and D. Zhao, Dr. Hsu, along with his team at Penpower Technology, was a recipient of
“The CAS-PEAL large-scale Chinese face database and baseline evalu- the Best Innovation Award at the SecuTech Expo during 2005–2007.
ations,” IEEE Trans. Syst., Man, Cybern. A, Syst., Humans, vol. 38, no.
1, pp. 149–161, Jan. 2008.
Tsu-Ying Chu received the B.E. degree in electro-
[57] A. Georghiades, P. Belhumeur, and D. Kriegman, “From few to many:
mechanical engineering from the National Ilan Uni-
Illumination cone models for face recognition under variable lighting
versity, I-Lan, Taiwan, in 2010, and the M.S. degree
and pose,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 6, pp.
in mechanical engineering from the National Taiwan
643–660, Jun. 2001.
University of Science and Technology, Taipei, Tai-
[58] X. Zhu and D. Ramanan, “Face detection, pose estimation, and landmark
wan, in 2012.
localization in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern
She is an Image Processing Engineer at Vols
Recognit., Jun. 2012, pp. 2879–2886.
Taipei.
[59] C. Rother, V. Kolmogorov, and A. Blake, ““GrabCut”: Inter-active
foreground extraction using iterated graph cuts,” ACM Trans. Graph.,
vol. 23, no. 3, pp. 309–314, 2004.

You might also like