Fuerst Leonardis MIPRO2014

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Feature Selection for Object Detection:

The Best Group vs. the Group of Best


Luka Fürst∗ and Aleš Leonardis∗†
∗ University of Ljubljana, Faculty of Computer and Information Science, Tržaška cesta 25, SI-1000 Ljubljana, Slovenia
† Intelligent
Robotics Laboratory, Centre for Computational Neuroscience and Cognitive Robotics,
School of Computer Science, University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK
Email: {luka.fuerst,ales.leonardis}@fri.uni-lj.si

Abstract—The problem of visual object detection, the goal of AdaBoost framework [2], thereby considering interdependence
which is to predict the locations and sizes of all objects of a of features [3] in addition to their individual scores. Informally,
given visual category (e.g., cars) in a given set of images, is AdaBoost seeks features that work well as a group rather
often based on a possibly large set of local features, only a
few of which might actually be useful for the given detection than just individually. We experimentally show that feature
setup. Feature selection is concerned with finding a ‘useful’ interdependence has to be considered in a feature selection
subset of features. In this paper, we compare two approaches process. As the secondary contribution, this paper compares
to feature selection in a visual object detection setup. One of two feature extraction and representation methods, namely
them selects features based on their individual utility scores that of Fidler et al. [4] and that of Ferrari et al. [5]. Both
alone, regardless of possible interdependence with other features.
The other approach employs the AdaBoost framework and comparisons are evaluated on five real-world image datasets.
hence implicitly deals with interdependence. Using two feature In this paper, we use the feature selection approach due
extraction methods and several image datasets, we experimentally to Fürst et al. [6]. Viola and Jones [7] proposed a highly
confirm the significance of feature interdependence: features that successful AdaBoost-based method for selecting features for
perform well individually do not necessarily perform well as a face detection. Both approaches formulate detection as a
group.
classification problem. However, while Viola and Jones use
I. I NTRODUCTION a similar classification approach in both the learning and
the test stage, Fuerst et al. use classification only to select
Many object detection methods begin with extracting a features in the learning stage. Many other AdaBoost-based
potentially large set of local features from a given set of object detection approaches have been proposed [8], [9], [10].
training images. The extracted features may be immediately The rest of this paper is structured as follows: In Sect. II, we
used as building blocks in the detection model of the given present both feature selection approaches. Section III describes
visual category [1]. However, it is often beneficial to reduce the the detection algorithm used for both feature selection and
number of features before constructing the detection model. experimental validation. Section IV gives a brief overview
Besides the computational reasons, a properly selected set of of both feature extraction methods. Section V presents and
features might lead to a better detection performance than the discusses the results of experimental validation. Section VI
original set, since the selection process may be able to remove concludes the paper.
the ‘useless’ or ‘harmful’ features.
A viable feature selection method has to assess the utility II. F EATURE S ELECTION
of individual features for the given object detection problem. In this section, we present both feature selection approaches.
A feature is ‘useful’ if it is both distinctive (it often occurs in For the sake of clarity, we first present the AdaBoost-based
images depicting objects of the given category and rarely in approach, since the approach that searches for the individually
other images) and predictive (it consistently occurs at similar best features is actually its simplification.
relative locations within images of objects). For the ‘Cars’
category, for instance, an example of a ‘useful’ feature is one A. AdaBoost-based Approach
that captures a tire, since tires tend to be visually distinctive The AdaBoost-based feature selection approach proposed
for cars and to occur at similar relative positions in a majority by Fürst et al. [6] takes into account both the properties
of car images. of individual features (distinctiveness and predictiveness) and
The primary contribution of this paper is the comparison of their interdependence. Originally, AdaBoost [2] is an approach
two methods for selecting a favorable (in terms of detection for constructing classifiers in binary classification problems.
performance) subset of T features out of the initial set of Given a set of ‘weak’, threshold-based classifiers, a set of
N features. Both methods use the same measure for the training samples, and a set of weights assigned to individual
utility of individual features. However, while one approach samples, each iteration of AdaBoost selects the classifier
simply selects the T best features according to the individual that yields the highest classification accuracy on the sample
utility scores, the other embeds the scoring mechanism into an set with respect to the current values of the weights. After
selecting the current classifier, the weights of the samples minimal:
misclassified by that classifier are enlarged so that the next θ∗ (F ; w) = arg min (F, θ; w) (4)
θ
iteration will be likely to select a classifier complementary to
the one just selected. In this way, AdaBoost will implicitly The classifier d∗F = dF,θ∗ (F ;w) attains the smallest classifica-
deal with interdependence in the set of weak classifiers. tion error and will thus be called the optimal classifier. The
We shall now present how to adapt the AdaBoost framework utility of a feature F is defined as
to the problem of selecting features for object detection. Let
u(F ; w) = 1 − (F, θ∗ (F ; w); w) (5)
us assume the existence of an object detection algorithm that
accepts an image I and a set of features G and returns a For a feature F that is both distinctive and predictive,
set of detection hypotheses predicting the locations and sizes S(F, I) is large for many positive training images and small
of individual objects in the image I, with the additional for many negative ones. Under the initial weight vector, the
restriction that I can be represented only by the features from optimal classifier for such a feature F attains a low classifica-
G. Each returned hypothesis H is assumed to be equipped with tion error, since a threshold θ∗ can be found such that most of
a positive number s(H) designating its strength. An example the values {S(F, I)}I∈I+ are above θ∗ and most of the values
of such a detection algorithm is described in Sect. III, but the {S(F, I)}I∈I− are below θ∗ . The feature F has therefore a
presented selection approach would work with any algorithm high utility. Conversely, if a feature F is non-distinctive, the
fulfilling the stated requirements. values of S(F, I) tend to be random, whereas if F is non-
The AdaBoost-based feature selection process due to Fürst predictive, S(F, I) is zero for many positive images. In both
et al. requires a set of images that depict various objects of cases, even the optimal classifier produces a high classification
the given visual category on a uniform background (positive error, and so F has a low utility.
training images, I+ ) and a set of images depicting anything but The AdaBoost-based feature selection method is shown
objects of the given category (negative training images, I− ). as Algorithm 1. In each iteration, the algorithm selects the
Let I = I+ ∪ I− denote the entire set of training images, feature having the highest utility with respect to the current
and let w = (wI )I∈I denoteP a vector of weights assigned to weight vector and decreases the weights of the training images
individual images such that I∈I wI = 1. At the beginning correctly classified by the optimal classifier for the selected
[1]
of the AdaBoost algorithm, we have w = w[1] , where wI = feature. In the next iteration, the misclassified images have
[1] higher relative weights, which encourages the selection of a
1/(2|I+ |) for all I ∈ I+ and wI = 1/(2|I− |) for all I ∈ I− .
During the algorithm, the weights are iteratively updated. feature complementary to the feature selected in the previous
To apply the classification-oriented AdaBoost framework to iteration.
our detection-oriented feature selection setup, the problem of B. Static Approach
detecting objects in the training images is transformed into that
of classifying training images themselves, and a threshold- In the static approach, features are selected by decreasing
based classifier is constructed for every feature. Let us first values of u(F ; w[1] ). Features are thus selected by individual
define a function S : F × I → R as utility, but their interdependence is not considered.
III. O BJECT D ETECTION

∗ ∗
s(H (I | F )) if I ∈ I+ and H (I) is correct;

S(F, I) = s(H ∗ (I | F )) if I ∈ I− ; We employed an object detection approach due to Leibe


0 otherwise, et al. [1]. For now, let us assume that all objects are approx-
(1) imately the same size; an extension to arbitrary sizes will be
where H ∗ (I | F ) denotes the strongest detection hypothesis briefly described later. In the training stage, the method of
on the image I provided that I is represented solely by the Leibe et al. learns the so-called Implicit Shape Model (ISM)
feature F . For a feature F and a threshold θ ∈ R, let us define from a set of training images depicting different objects of
a classifier dF,θ : I → {0, 1} that classifies a given training the given visual category on a uniform background. In the test
image as positive (output 1) or negative (output 0) according stage, the learned model is used to form detection hypotheses
to the following rule: for a given real-world image depicting an unknown number
( of objects of the given category.
1 if S(F, I) ≥ θ; To learn the ISM for a given category, every training image
dF,θ (I) = (2)
0 if S(F, I) < θ. is first represented by the given set of features. For a feature
F located at the coordinates y in an image I, the learning
If yI denotes the actual class of the image I (yI = 1 for all algorithm stores a pair (F, x−y), where x denotes the location
I ∈ I+ and yI = 0 for all I ∈ I− ), the classifier’s error on of the object’s centroid in I. Informally, the ISM is thus a
the entire training set can be computed as collection of the relative locations of individual features with
X respect to the object centroid. The test stage begins with the
(F, θ; w) = wI |dF,θ (I) − yI |. (3)
representation of a given test image by the feature set. Each
I∈I
feature then ‘votes’ for all locations of the object centroid

Let θ (F ; w) be a threshold at which the classifier’s error is based on its ISM entries. Specifically, if a feature F is located
Algorithm 1 The AdaBoost-based feature selection method. been experimentally shown to significantly improve detection
Require: performance.
(a) A set of training images I with class membership The described training and detection process can also be
labels yI (1: positive; 0: negative) adapted to multiscale detection. At least two approaches exist:
(b) A set of features, F. (1) single-scale learning, multiscale detection; (2) multiscale
(c) The number of iterations, T . learning, single-scale detection. In the first approach, the train-
Ensure: A sequence of selected features hF [1] , . . . , F [T ] i. ing stage is the same as in the single-scale case, but the test-
Process: stage voting algorithm is applied to several rescaled versions
1: F 0 := F; // F 0 : the set of not-yet selected features of the test image. The second approach only works with scale-
[1]
2: wI := 1/(2|I+ |) for all I ∈ I+ covariant features (such a feature can be detected at multiple
[1] scales, and each of its occurrences provides information about
3: wI := 1/(2|I− |) for all I ∈ I−
4: for t := 1 to T do the scale of the corresponding local structure). The learned
5: Normalize the weights such that their sum becomes 1: ISM contains both the offset and the scale of each feature,
relative to the position and the size of the object. In the test
[t] [t]
X [t]
wI := wI / wJ for all I ∈ I. (6) stage, the features cast votes in a 3-dimensional space. Each
J∈I feature votes not only for candidate object centroid locations
6: Select the feature with the highest score u(F ; w) with but also for candidate object sizes. The detection algorithm
respect to the current weights: thus produces hypotheses about both the centroids and the
sizes of individual objects.
F [t] = arg max0 u(F ; w). (7)
F ∈F
IV. F EATURE E XTRACTION AND R EPRESENTATION
7: Let θ[t] = θ∗ (F [t] ; w[t] ), d[t] = dF [t] ,θ[t] , and [t] = We tested two feature extraction and representation meth-
(F [t] , θ[t] ; w[t] ) = 1 − u(F [t] ; w[t] ). ods: that of Fidler et al. [4] and that of Ferrari et al. [5].
8: F 0 := F 0 \ {F [t] }. Features of Fidler et al. are arranged into layers according
9: Update the weights: to the ‘parts composed of parts’ paradigm. The first layer
contains only a single feature; this is the kernel of the Gabor
( [t]
 [t]
[t+1] [t] wI , if d[t] (I) = yI ;
wI = 1− [t] (8) filter in the orientation 0◦ . Each feature of a layer n > 1 is
wI otherwise. a composition of rotated copies of features of layer n − 1.
10: end for In particular, a layer-n feature Fin can be represented as
{(Fjn−1
1
, α1 , µ1 , Σ1 ), . . . , (Fjn−1
s
, αs , µs , Σs )}, where for
each k ∈ {1, . . . , s}, αk is the discretized orientation of the k-
th component of feature Fin (relative to the orientation of the
at the coordinates q in the test image and if the ISM entries first component), whereas µk and Σk denote the mean and the
for F are (F, r1 ), . . . , (F, rk ), then F votes for the coordinates covariance matrix of the Gaussian distribution that models the
q + r1 , . . . , q + rk . The votes are weighted: the method variability of the position of the k-th component (relative to the
of Leibe et al. considers both the possibility that a given position of the first component). To represent an image with
local structure is represented by several features, each with a set of features, the features are matched against the Gabor-
a certain probability, and the possibility that a given feature filter representation of the image in a layer-by-layer fashion.
votes for several object centroid candidates, each with a certain The result of the representation is a set of triples (Fin , (x, y),
probability. β), where Fin is a layer-n feature, (x, y) is the position of
The output of the voting algorithm is a probability map an occurrence of Fin in the image, and β is the discretized
that, for each pixel of the test image, specifies the probability orientation in which the feature occurs at that position.
that the object centroid is located at the coordinates of that Features of Ferrari et al. [5] are obtained by a multi-step
pixel. To obtain a small set of robust detection hypotheses, the procedure. First, an edge detector is used to obtain edge ele-
probability map is smoothed by a Gaussian filter. The locations ments, i.e., image points at directed intensity transitions. After
of the maxima in the smoothed map are taken as individual that, each group of edge elements that form an approximately
detection hypotheses. The strength of a hypothesis H, s(H), straight line is taken as a separate ‘segment’. In the next
is defined to be the value of the corresponding maximum. step, the extraction procedure builds groups of k segments
A single feature may cast votes for many different object that touch or almost touch each other. If k = 2, for example,
centroids, often giving rise to strong secondary hypotheses. such groups include segment pairs in the shape of the letters
Such hypotheses may be suppressed by a greedy MDL-based L and T. Groups of k adjacent segments are then clustered
algorithm [11]. In each iteration, this algorithm selects the by their visual similarity. Finally, each cluster is taken as a
currently strongest hypothesis and greedily assigns to it all Ferrari et al. feature. Features are described in a rotation- and
of its support, i.e., all probabilistic votes that it gathered in scale-invariant manner. To represent a given image by a given
the voting process. The competing hypotheses thereby lose set of features, the image is first processed so as to obtain a set
part of their support and hence strength. This scheme has of groups of k adjacent segments. After that, the features are
matched against the groups in the image, giving a set of triples TABLE I: T EST IMAGE DATASETS .
(r(G), t(G), R(G)), where G is a k-adjacent-segments group, Dataset Num. of images Total num. of depicted objects
r(G) and t(G) are its location and size, and R(G) = {(F ,
Apple logos [5]a 40 44
r(F, G)), . . .} is a set of features that can be matched against Bottles [5]a 48 55
the group G, equipped with an estimate of the reliability of Cars [12]b 108 139
the match. Horses [5]c 170 180
Mugs [5]a 48 66
V. E XPERIMENTAL R ESULTS
a http://www.vision.ee.ethz.ch/datasets/index.en.html
Before presenting the results, let us describe how individ- b http://cogcomp.cs.illinois.edu/Data/Car/
ual experiments were evaluated. The output of the detection c http://pascal.inrialpes.fr/data/horses/
algorithm for a given image is a set of detection hypotheses to-
gether with their strengths. A hypothesis is a prediction of the
TABLE II: T HE NUMBER OF EXTRACTED FEATURES .
bounding box of an object in the image. Let (xH , yH , wH , hH )
denote the image coordinates of the centroid (xH and yH ) and Dataset Fidler et al. Ferrari et al.
the width and the height (wH and hH ) of the bounding box Apple logos 592 168
predicted by a hypothesis H, and let (xO , yO , wO , hO ) define Bottles 896 422
the true bounding box of a depicted object. The hypothesis H Cars 920 386
Horses 896 398
is proclaimed to be correct if the following three conditions Mugs 912 314
are met [1]:
p
1) (2(xH − xO )/wO )2 + (2(yH − yO )/hO )2 ≤ 0.5.
2) The predicted bounding box covers at least 50% of the
The AdaBoost-based approach significantly outperforms
area of the true bounding box.
the static feature selection approach. The reason for such
3) The true bounding box covers at least 50% of the area
a behavior is the redundancy of the extracted feature set.
of the predicted bounding box.
Many extracted features react to similar local structures and
If the above conditions are fulfilled by multiple hypotheses for
hence attain similar utility scores. Consequently, a feature set
the same object, only one of those hypotheses is considered
selected solely by the utility of individual features is likely to
correct; all others are regarded as false positives.
contain many similar features. By contrast, AdaBoost favors
The detection performance on a given image set is estimated
features that correctly classify the images misclassified by
by the following measure:
the already selected features. Tables III and IV list the first
2 × recall(τ ) × precision(τ ) 10 selected features for the categories ‘Cars’ and ‘Mugs’.
Fmax = max . (9)
τ recall(τ ) + precision(τ ) The feature sets selected by the static approach are clearly
The Fmax measure ranges from 0 to 1, with 0 being the worst redundant. As shown in Fig. 1, a much greater number of
and 1 being the best possible detection performance. The recall features selected by the static approach is required to attain
and precision at a given threshold τ are computed as follows: a detection performance comparable to AdaBoost-selected
features.
recall(τ ) = TP(τ )/OBJ; (10) For each of the first 10 AdaBoost-selected Fidler et al.
precision(τ ) = TP(τ )/(TP(τ ) + FP(τ )). (11) features for the ‘Mugs’ category, Table V shows its rank in the
decreasing sequence of u(F ; w[1] ) values. The third selected
Here, OBJ denotes the number of all depictions of objects
feature, for instance, appears at a rank as low as 156 out of
of the given visual category in the image set, TP(τ ) (‘true
912, but it represents the best complement to the first two
positives’) is the number of correct hypotheses H with s(H) ≥
selected features, at least with respect to the employed training
τ , and FP(τ ) (‘false positives’) is the number of incorrect
image set.
hypotheses H with s(H) ≥ τ .
The features obtained by the method of Fidler et al. led
Each combination of feature extraction and feature selec-
to substantially more favorable detection results than those of
tion method was tested on five test image datasets, some
information about which is provided in Table I. To extract
and select features and to learn the detection model for TABLE III: T HE FIRST 10 SELECTIONS FOR ‘C ARS ’.
individual categories, we used category-specific training image
sets obtained from various sources. All training sets were Extraction Selection The first 10 selected features
completely disjoint from the corresponding test sets.
Fidler et al. AdaBoost
For each combination of feature extraction method, feature
selection method, and image dataset, we extracted all features Fidler et al. static
(their number is given in Table II), selected the first T features Ferrari et al. AdaBoost
(where T varied from 1 to 50), and learned and tested the
detection model using the T selected features. Fig. 1 shows Ferrari et al. static
the Fmax values for all experiments.
Apple logos Bottles
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Fmax

Fmax
0.5 0.5
0.4 0.4
0.3 0.3
0.2 Fidler et al. & AdaBoost 0.2 Fidler et al. & AdaBoost
Fidler et al. & static Fidler et al. & static
0.1 Ferrari et al. & AdaBoost 0.1 Ferrari et al. & AdaBoost
Ferrari et al. & static Ferrari et al. & static
0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
The number of selected features The number of selected features

Cars Horses
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Fmax

Fmax

0.5 0.5
0.4 0.4
0.3 0.3
0.2 Fidler et al. & AdaBoost 0.2 Fidler et al. & AdaBoost
Fidler et al. & static Fidler et al. & static
0.1 Ferrari et al. & AdaBoost 0.1 Ferrari et al. & AdaBoost
Ferrari et al. & static Ferrari et al. & static
0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
The number of selected features The number of selected features

Mugs
1
0.9
0.8
0.7
0.6
Fmax

0.5
0.4
0.3
0.2 Fidler et al. & AdaBoost
Fidler et al. & static
0.1 Ferrari et al. & AdaBoost
Ferrari et al. & static
0
0 5 10 15 20 25 30 35 40 45 50
The number of selected features

Fig. 1: Detection performance w.r.t. the number of selected features.


TABLE IV: T HE FIRST 10 SELECTIONS FOR ‘M UGS ’.
Extraction Selection The first 10 selected features

Fidler et al. AdaBoost

Fidler et al. static

Ferrari et al. AdaBoost

Ferrari et al. static (a) cars

TABLE V: T HE RANKINGS OF THE FIRST 10 A DA B OOST-


SELECTED FEATURES FOR ‘M UGS ’ IN THE SEQUENCE OF
THE FEATURES SELECTED BY THE STATIC APPROACH .

AdaBoost 1 2 3 4 5 6 7 8 9 10
Static 1 7 156 3 28 109 30 4 142 49

Ferrari et al.. For instance, the most distinctive and predictive


features for the ‘Cars’ category are those that capture tires,
a few examples of which are shown in Table III. While the
method of Fidler et al. is able to extract such features, this (b) mugs
is not the case with the method of Ferrari et al., which Fig. 2: The first AdaBoost-selected Fidler et al. feature, its
builds features from straight segments. These features are ISM probability map, and a few occurrences in test images.
therefore unable to capture small circular or semi-circular
structures. In addition, although Fidler et al. features tend to
be fairly indistinctive (with the possible exception of wheel-
like features), they occur on equivalent local structures much R EFERENCES
more consistently than Ferrari et al. ones. Fig. 2 shows the [1] B. Leibe, A. Leonardis, and B. Schiele, “Combined object categorization
first AdaBoost-selected Fidler et al. features for the categories and segmentation with an implicit shape model,” in ECCV’04 Workshop
on Statistical Learning in Computer Vision, 2004, pp. 17–32.
‘Cars’ and ‘Mugs’, their respective ISM probability maps [2] Y. Freund and R. Schapire, “A decision-theoretic generalization of online
(the probability of the object centroid location relative to the learning,” Computer and System Sciences, vol. 55, no. 1, pp. 119–139,
location of the feature), and a few occurrences in test images. 1997.
[3] T. Cover, “The best two independent measurements are not the two best,”
VI. C ONCLUSION IEEE Transactions on Systems, Man, and Cybernetics, vol. 4, no. 1, pp.
116–117, 1974.
We have compared an AdaBoost-based feature selection [4] S. Fidler and A. Leonardis, “Towards scalable representations of object
approach with a simplified, ‘static’ approach that does not categories: Learning a hierarchy of parts,” in IEEE Conference on
Computer Vision and Pattern Recognition, 2007.
pay regard to feature interdependence. Using two feature [5] V. Ferrari, L. Fevrier, F. Jurie, and C. Schmid, “Groups of adjacent
extraction methods and five real-world image datasets, we contour segments for object detection,” IEEE Transactions on Pattern
showed that for a fixed target number of selected features, the Analysis and Machine Intelligence, vol. 30, no. 1, pp. 36–51, 2008.
[6] L. Fürst, S. Fidler, and A. Leonardis, “Selecting features for object
AdaBoost-based method considerably outperforms the simpli- detection using an adaboost-compatible evaluation function,” Pattern
fied selection method. The main reason is that many extracted Recognition Letters, vol. 29, no. 11, pp. 1603–1612, 2008.
features tend to be visually similar and hence attain similar [7] P. Viola and M. Jones, “Robust real-time face detection,” International
Journal of Computer Vision, vol. 57, no. 2, pp. 137–154, 2004.
utility scores. The resulting feature sets therefore contain much [8] H. Grabner, P. Roth, and H. Bischof, “Eigenboosting: Combining
redundancy. AdaBoost, by contrast, selects features so that discriminative and generative information,” in IEEE Conference on
they complement each other and thus tends to achieve a Computer Vision and Pattern Recognition, 2007.
[9] J. Shotton, A. Blake, and R. Cipolla, “Contour-based learning for object
comparable detection performance using much fewer features detection,” in International Conference on Computer Vision, 2005, pp.
than the static approach. 503–510.
We assumed that the number of selected features is fixed [10] A. Torralba, K. Murphy, and W. Freeman, “Sharing features: efficient
boosting procedures for multiclass object detection,” in IEEE Conference
and known in advance. A natural extension of the presented on Computer Vision and Pattern Recognition, 2004, pp. 762–769.
selection approaches is therefore the automatic determination [11] A. Leonardis, H. Bischof, and J. Maver, “Multiple eigenspaces,” Pattern
of the most appropriate number of features with respect to Recognition, vol. 35, no. 11, pp. 2613–2627, 2002.
[12] S. Agarwal, A. Awan, and D. Roth, “Learning to detect objects in images
both detection performance and computational burden. An- via a sparse, part-based representation,” IEEE Transactions on Pattern
other interesting research direction is feature selection for the Analysis and Machine Intelligence, vol. 26, no. 11, pp. 1475–1490, 2004.
multiclass detection setting, where the goal is to find all objects
of several categories simultaneously [10].

You might also like