The PASCAL Visual Object Classes (VOC) Challenge

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Int J Comput Vis (2010) 88: 303338

DOI 10.1007/s11263-009-0275-4

The PASCAL Visual Object Classes (VOC) Challenge


Mark Everingham Luc Van Gool
Christopher K. I. Williams John Winn
Andrew Zisserman

Received: 30 July 2008 / Accepted: 16 July 2009 / Published online: 9 September 2009
Springer Science+Business Media, LLC 2009

Abstract The PASCAL Visual Object Classes (VOC) chal- 1 Introduction


lenge is a benchmark in visual object category recognition
and detection, providing the vision and machine learning The PASCAL1 Visual Object Classes (VOC) Challenge con-
communities with a standard dataset of images and anno- sists of two components: (i) a publicly available dataset
tation, and standard evaluation procedures. Organised annu- of images and annotation, together with standardised eval-
ally from 2005 to present, the challenge and its associated uation software; and (ii) an annual competition and work-
dataset has become accepted as the benchmark for object shop. The VOC2007 dataset consists of annotated con-
detection. sumer photographs collected from the flickr2 photo-sharing
This paper describes the dataset and evaluation proce- web-site. A new dataset with ground truth annotation has
dure. We review the state-of-the-art in evaluated methods for been released each year since 2006. There are two princi-
both classification and detection, analyse whether the meth- pal challenges: classificationdoes the image contain any
ods are statistically different, what they are learning from the instances of a particular object class? (where the object
images (e.g. the object or its context), and what the meth- classes include cars, people, dogs, etc.), and detection
ods find easy or confuse. The paper concludes with lessons where are the instances of a particular object class in the
learnt in the three year history of the challenge, and proposes image (if any)?. In addition, there are two subsidiary chal-
directions for future improvement and extension. lenges (tasters) on pixel-level segmentationassign each
pixel a class label, and person layoutlocalise the head,
hands and feet of people in the image. The challenges are
Keywords Database Benchmark Object recognition issued with deadlines each year, and a workshop held to
Object detection compare and discuss that years results and methods. The
datasets and associated annotation and software are subse-
quently released and available for use at any time.
M. Everingham () The objectives of the VOC challenge are twofold: first
University of Leeds, Leeds, UK to provide challenging images and high quality annotation,
e-mail: m.everingham@leeds.ac.uk
together with a standard evaluation methodologya plug
L. Van Gool and play training and testing harness so that performance
KU Leuven, Leuven, Belgium of algorithms can be compared (the dataset component); and
second to measure the state of the art each year (the compe-
C.K.I. Williams
tition component).
University of Edinburgh, Edinburgh, UK

J. Winn 1 PASCAL
Microsoft Research, Cambridge, UK stands for pattern analysis, statistical modelling and compu-
tational learning. It is an EU Network of Excellence funded under the
A. Zisserman IST Programme of the European Union.
2 http://www.flickr.com/
University of Oxford, Oxford, UK
304 Int J Comput Vis (2010) 88: 303338

The purpose of this paper is to describe the challenge: objects. LabelMe has been ground-breaking in providing
what it is, and the reasons for the way it is. We also describe a web-based annotation interface, encouraging casual and
the methods, results and evaluation of the challenge, and so professional users alike to contribute and share annotation.
in that respect are describing the state-of-the-art in object Many object categories are labelled, with annotation con-
recognition (at least as measured for these challenges and by sisting of a bounding polygon and category, with some ob-
those who entered). We focus mainly on the 2007 challenge, jects additionally being labelled with pose and object parts.
as this is the most recent at the time of writing, but also For the most part the dataset is incompletely labelled
discuss significant changes since earlier challenges and why volunteers are free to choose which objects to annotate, and
these were made. which to omit. This means that, while a very valuable re-
source for training images, the dataset is unsuitable for test-
1.1 Relation to Other Datasets ing in the manner of the VOC challenge since precision and
recall cannot accurately be estimated. Recently the LabelMe
Challenge datasets are important in many areas of research organisers have proposed subsets of the database to use for
in order to set goals for methods, and to allow comparison of training and testing, which are completely annotated with a
their performance. Similar datasets and evaluation method- set of seven object (person, car) and stuff (building, sky,
ologies are sprouting in other areas of computer vision and etc.) classes. However, no evaluation protocol is specified.
machine learning, e.g. the Middlebury datasets for stereo,
MRF optimisation, and optical flow comparison (Scharstein The TREC Video Retrieval Evaluation (TRECVID,3
and Szeliski 2002). Smeaton et al. 2006) is also similar to the VOC challenge in
In addition to organised challenges, there are several that there is a new dataset and competition each year, though
datasets contributed by the vision community which are re- the dataset is only available to participants and is not pub-
lated to that collected for the VOC challenges. licly distributed. TRECVID includes several tasks, but the
one most related to VOC is termed high-level feature ex-
The Caltech 101 Dataset (Fei-Fei et al. 2006) contains im- traction, and involves returning a ranked list of video shots
ages of 101 categories of object, and is relatively widely for specified features. For the 2008 competition these fea-
used within the community for evaluating object recogni- tures include scene categories (such as classroom, cityscape
tion. Each image contains only a single object. A principal or harbour), object categories (such as dog, aeroplane fly-
aim of the Caltech datasets is to evaluate multi-category ob- ing or telephone) and actions/events (such as a demonstra-
ject recognition, as a function of the (relatively small) num- tion/protest). Annotation is not provided by the organisers,
ber of training images. This is complementary to the aims but some is usually distributed amongst the participants. The
of the VOC challenge, which measures performance on a submissions are scored by their Average Precision (AP). The
smaller number of classes and without such constraints on evaluation of the ranked lists is carried out by the organis-
the amount of training data available. ers using a mixture of ground truth labelling and inferred
A common criticism of this dataset, addressed by the ground truth (Yilmaz and Aslam 2006) obtained from high
VOC challenge, is that the images are largely without clutter, ranked results returned by the participants methods.
variation in pose is limited, and the images have been man-
The Lotus Hill Dataset (Yao et al. 2007) is a large, recently
ually aligned to reduce the variability in appearance. These
produced dataset, a small part of which is made freely avail-
factors make the dataset less applicable to real world eval-
able to researchers. It contains 8 data subsets with a range of
uation than the images provided for the VOC challenge.
annotation. Particularly we highlight (a) annotations provid-
ing a hierarchical decomposition of individual objects e.g.
The Caltech 256 Dataset (Griffin et al. 2007) corrected
vehicles (9 classes, 209 images4 ), other man-made objects
some of the deficiencies of Caltech 101there is more vari-
(75 classes, 750 images) and animals (40 classes, 400 im-
ability in size and localisation, and obvious artifacts have
ages); and (b) segmentation labelling of scenes to a pixel
been removed. The number of classes is increased (from 101
level (444 images). As this dataset has only recently been
to 256) and the aim is still to investigate multi-category ob-
released there has not yet been a lot of work reported on it.
ject recognition with a limited number of training images.
The datasets look to have a useful level of annotation (es-
For the most part there is only a single object per image as
pecially with regard to hierarchical decompositions which
is required to support the 1-of-m evaluation adopted (which
have not been attempted elsewhere), but are somewhat lim-
one of m classes does this image contain?).
ited by the number of images that are freely available.
The LabelMe Dataset (Russell et al. 2008) at MIT is
3 http://www-nlpir.nist.gov/projects/trecvid/
most similar to the VOC challenge dataset in that it con-
4 The
tains more-or-less general photographs containing multiple number of images quoted is the number that are freely available.
Int J Comput Vis (2010) 88: 303338 305

1.2 Paper Layout 2.3 Segmentation Taster

This paper is organised as follows: we start with a summary For each test image, predict the object class of each pixel,
of the four challenges in Sect. 2, then describe in more de- or background if the object does not belong to one of the
tail in Sect. 3 the datasetstheir method of collection; the twenty specified classes. Unlike the classification and detec-
classes included and the motivation for including them; and tion challenges there is only one competition, where training
their annotation and statistics. Section 4 describes the eval- data is restricted to that provided by the challenge.
uation procedure and why this procedure was chosen. Sec-
tion 5 overviews the main methods used in the 2007 chal- 2.4 Person Layout Taster
lenge for classification and detection, and Sect. 6 reports and
discusses the results. This discussion includes an analysis of For each person object in a test image (if any), detect the
the statistical significance of the performances of the differ- person, predicting the bounding box of the person, the pres-
ent methods, and also of which object classes and images ence/absence of parts (head/hands/feet), and the bounding
the methods find easy or difficult. We conclude with a dis-
boxes of those parts. Each person detection should be out-
cussion of the merits, and otherwise, of the VOC challenge
put with an associated real-valued confidence. Two compe-
and possible options for the future.
titions are defined in a similar manner to the classification
challenge.
2 Challenge Tasks

This section gives an overview of the two principal chal- 3 Datasets


lenge tasks on classification and detection, and on the two
subsidiary tasks (tasters) on pixel-level segmentation, and The goal of the VOC challenge is to investigate the perfor-
person layout. mance of recognition methods on a wide spectrum of nat-
ural images. To this end, it is required that the VOC datasets
2.1 Classification contain significant variability in terms of object size, orien-
tation, pose, illumination, position and occlusion. It is also
For each of twenty object classes, predict the presence/ important that the datasets do not exhibit systematic bias,
absence of at least one object of that class in a test image. for example, favouring images with centred objects or good
Participants are required to provide a real-valued confidence illumination. Similarly, to ensure accurate training and eval-
of the objects presence for each test image so that a preci- uation, it is necessary for the image annotations to be con-
sion/recall curve can be drawn. Participants may choose to sistent, accurate and exhaustive for the specified classes.
tackle all, or any subset of object classes, for example cars This section describes the processes used for collecting and
only or motorbikes and cars. annotating the VOC2007 datasets, which were designed to
Two competitions are defined according to the choice of achieve these aims.
training data: (1) taken from the VOC training/validation
data provided, or (2) from any source excluding the VOC
3.1 Image Collection Procedure
test data. In the first competition, any annotation provided
in the VOC training/validation data may be used for train-
For the 2007 challenge, all images were collected from the
ing, for example bounding boxes or particular views e.g.
flickr photo-sharing web-site. The use of personal photos
frontal or left. Participants are not permitted to perform
which were not taken by, or selected by, vision/machine
additional manual annotation of either training or test data.
learning researchers results in a very unbiased dataset, in
In the second competition, any source of training data may
the sense that the photos are not taken with a particular pur-
be used except the provided test images. The second com-
pose in mind i.e. object recognition research. Qualitatively
petition is aimed at researchers who have pre-built systems
the images contain a very wide range of viewing conditions
trained on other data, and is a measure of the state-of-the-art.
(pose, lighting, etc.) and images where there is little bias to-
2.2 Detection ward images being of a particular object, e.g. there are
images of motorcycles in a street scene, rather than solely
For each of the twenty classes, predict the bounding boxes images where a motorcycle is the focus of the picture. The
of each object of that class in a test image (if any), with annotation guidelines (Winn and Everingham 2007) pro-
associated real-valued confidence. Participants may choose vided guidance to annotators on which images to annotate
to tackle all, or any subset of object classes. Two compe- essentially everything which could be annotated with confi-
titions are defined in a similar manner to the classification dence. The use of a single source of consumer images ad-
challenge. dressed problems encountered in previous challenges, such
306 Int J Comput Vis (2010) 88: 303338

Fig. 1 Example images from the VOC2007 dataset. For each of the 20 as non-difficult (see Sect. 3.3)bounding boxes for the other classes
classes annotated, two examples are shown. Bounding boxes indicate are available in the annotation but not shown. Note the wide range of
all instances of the corresponding class in the image which are marked pose, scale, clutter, occlusion and imaging conditions

as in VOC2006 where images from the Microsoft Research of capturing particular object classes, so that the object in-
Cambridge database (Shotton et al. 2006) were included. stances tend to be large, well-illuminated and central. The
The MSR Cambridge images were taken with the purpose use of an automated collection method also prevented any
Int J Comput Vis (2010) 88: 303338 307
Table 1 Queries used to retrieve images from flickr. Words in bold of capture, photographers name, etc. were specifiedwe
show the targeted class. Note that the query terms are quite general
including the class name, synonyms and scenes or situations where the return to this point below.
class is likely to occur For a given query, flickr is asked for 100,000 matching
images (flickr organises search results as pages i.e. 100
aeroplane, airplane, plane, biplane, monoplane, aviator, bomber, pages of 1,000 matches). An image is chosen at random
hydroplane, airliner, aircraft, fighter, airport, hangar, jet, boeing, from the returned set and downloaded along with the cor-
fuselage, wing, propellor, flying responding metadata. A new query is then selected at ran-
bicycle, bike, cycle, cyclist, pedal, tandem, saddle, wheel, cycling, dom, and the process is repeated until sufficient images have
ride, wheelie
bird, birdie, birdwatching, nest, sea, aviary, birdcage, bird feeder, been downloaded. Images were downloaded for each class
bird table in parallel using a python interface to the flickr API, with
boat ship, barge, ferry, canoe, boating, craft, liner, cruise, sailing, no restriction on the number of images per class or query.
rowing, watercraft, regatta, racing, marina, beach, water, canal, river, Thanks to flickrs fast servers, downloading the entire image
stream, lake, yacht
bottle, cork, wine, beer, champagne, ketchup, squash, soda, coke, set took just a few hours on a single machine.
lemonade, dinner, lunch, breakfast Table 1 lists the queries used for each of the classes,
bus, omnibus, coach, shuttle, jitney, double-decker, motorbus, produced by free association from the target classes. It
school bus, depot, terminal, station, terminus, passenger, route might appear that the use of keyword queries would bias
car, automobile, cruiser, motorcar, vehicle, hatchback, saloon, con-
vertible, limousine, motor, race, traffic, trip, rally, city, street, road, the images to pictures of an object, however the wide
lane, village, town, centre, shopping, downtown, suburban range of keywords used reduces this likelihood; for exam-
cat, feline, pussy, mew, kitten, tabby, tortoiseshell, ginger, stray ple the query living room can be expected to return scenes
chair, seat, rocker, rocking, deck, swivel, camp, chaise, office, stu- containing chairs, sofas, tables, etc. in context, or the query
dio, armchair, recliner, sitting, lounge, living room, sitting room
cow, beef, heifer, moo, dairy, milk, milking, farm town centre to return scenes containing cars, motorcycles,
dog, hound, bark, kennel, heel, bitch, canine, puppy, hunter, collar, pedestrians, etc. It is worth noting, however, that without
leash using any keyword queries the images retrieved randomly
horse, gallop, jump, buck, equine, foal, cavalry, saddle, canter, from flickr were, subjectively, found to be overwhelmingly
buggy, mare, neigh, dressage, trial, racehorse, steeplechase, thor-
oughbred, cart, equestrian, paddock, stable, farrier party scenes containing predominantly people. We return
motorbike, motorcycle, minibike, moped, dirt, pillion, biker, trials, to the problem of obtaining sufficient examples of minor-
motorcycling, motorcyclist, engine, motocross, scramble, sidecar, ity object classes in Sect. 7.1.
scooter, trail All exact duplicate and near duplicate images were re-
person, people, family, father, mother, brother, sister, aunt, un-
cle, grandmother, grandma, grandfather, grandpa, grandson, grand- moved from the downloaded image set, using the method
daughter, niece, nephew, cousin of Chum et al. (2007). Near duplicate images are those that
sheep, ram, fold, fleece, shear, baa, bleat, lamb, ewe, wool, flock are perceptually similar, but differ in their levels of com-
sofa, chesterfield, settee, divan, couch, bolster pression, or by small photometric distortions or occlusions
table, dining, cafe, restaurant, kitchen, banquet, party, meal
potted plant, pot plant, plant, patio, windowsill, window sill, yard,
for example.
greenhouse, glass house, basket, cutting, pot, cooking, grow After de-duplication, random images from the set of
train, express, locomotive, freight, commuter, platform, subway, un- 500,000 were presented to the annotators for annotation.
derground, steam, railway, railroad, rail, tube, underground, track, During the annotation event, 44,269 images were considered
carriage, coach, metro, sleeper, railcar, buffet, cabin, level crossing
tv/monitor, television, plasma, flatscreen, flat screen, lcd, crt,
for annotation, being either annotated or discarded as un-
watching, dvd, desktop, computer, computer monitor, PC, console, suitable for annotation e.g. containing no instances of the 20
game object classes, according to the annotation guidelines (Winn
and Everingham 2007), or being impossible to annotate cor-
rectly and completely with confidence.
selection bias being introduced by a researcher manually One small bias was discovered in the VOC2007 dataset
performing image selection. The person category provides due to the image collection procedureflickr returns query
a vivid example of how the adopted collection methodology results ranked by recency such that if a given query is sat-
leads to high variability; in previous datasets person was isfied by many images, more recent images are returned first.
essentially synonymous with pedestrian, whereas in the Since the images were collected in January 2007, this led to
VOC dataset we have images of people engaged in a wide an above-average number of Christmas/winter images con-
range of activities such as walking, riding horses, sitting on taining, for example, large numbers of Christmas trees. To
buses, etc. (see Fig. 1). avoid such bias in VOC2008,5 images have been retrieved
In total, 500,000 images were retrieved from flickr. For using queries comprising a random date in addition to key-
each of the 20 object classes to be annotated (see Fig. 1), words.
images were retrieved by querying flickr with a number of
5 http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2008/
related keywords (Table 1). No other query criteria, e.g. date
308 Int J Comput Vis (2010) 88: 303338

3.2 Choice of Classes

Figure 2 shows the 20 classes selected for annotation in the


VOC2007 dataset. As shown, the classes can be considered
in a taxonomy with four main branchesvehicles, animals,
household objects and people.6 The figure also shows the
year of the challenge in which a particular class was in-
cluded. In the original VOC2005 challenge (Everingham
et al. 2006a), which used existing annotated datasets, four
classes were annotated (car, motorbike, bicycle and person).
This number was increased to 10 in VOC2006, and 20 in
Fig. 2 VOC2007 Classes. Leaf nodes correspond to the 20 classes.
VOC2007. The year of inclusion of each class in the challenge is indicated by
Over successive challenges the set of classes has been superscripts: 20051 , 20062 , 20073 . The classes can be considered in
expanded in two ways: First, finer-grain sub-classes have a notional taxonomy, with successive challenges adding new branches
(increasing the domain) and leaves (increasing detail)
been added e.g. bus. The choice of sub-classes has been
motivated by (i) increasing the semantic specificity of the
output required of systems, for example recognising differ- 3.3 Annotated Attributes
ent types of vehicle e.g. car/motorbike (which may not be
visually similar); (ii) increasing the difficulty of the discrim- In order to evaluate the classification and detection chal-
ination task by inclusion of objects which might be consid- lenges, the image annotation includes the following at-
ered visually similar e.g. cat vs. dog. Second, additional tributes for every object in the target set of object classes:
branches of the notional taxonomy have been added e.g. an-
imals (VOC2006) and household objects (VOC2007). class: one of: aeroplane, bird, bicycle, boat, bottle, bus,
The motivations are twofold: (i) increasing the domain of car, cat, chair, cow, dining table, dog, horse, motorbike,
the challenge in terms of the semantic range of objects cov- person, potted plant, sheep, sofa, train, tv/monitor.
ered; (ii) encouraging research on object classes not widely bounding box: an axis-aligned bounding box surround-
addressed because of visual properties which are challeng- ing the extent of the object visible in the image.
ing for current methods, e.g. animals which might be con- The choice of an axis aligned bounding-box for the anno-
sidered to lack highly distinctive parts (c.f. car wheels), and tation is a compromise: for some object classes it fits quite
chairs which are defined functionally, rather than visually, well (e.g. to a horizontal bus or train) with only a small pro-
and also tend to be highly occluded in the dataset. portion of non-class pixels; however, for other classes it can
The choice of object classes, which can be considered be a poor fit either because they are not box shaped (e.g. a
a sub-tree of a taxonomy defined in terms of both seman- person with their arms outstretched, a chair) or/and because
tic and visual similarity, also supports research in two areas they are not axis-aligned (e.g. an aeroplane taking off). The
which show promise in solving the scaling of object recog- advantage though is that they are relatively quick to anno-
nition to many thousands of classes: (i) exploiting visual tate. We return to this point when discussing pixel level an-
properties common to classes e.g. vehicle wheels, for exam- notation in Sect. 3.6.1.
ple in the form of feature sharing (Torralba et al. 2007); In addition, since VOC2006, further annotations were
(ii) exploiting external semantic information about the rela- introduced which could be used during training but which
tions between object classes e.g. WordNet (Fellbaum 1998), were not required for evaluation:
for example by learning a hierarchy of classifiers (Marsza-
viewpoint: one of: front, rear, left, right, unspecified. This
lek and Schmid 2007). The availability of a class hierarchy
annotation supports methods which treat different view-
may also prove essential in future evaluation efforts if the
points differently during training, such as using separate
number of classes increases to the extent that there is im-
detectors for each viewpoint.
plicit ambiguity in the classes, allowing individual objects to
truncation: an object is said to be truncated when the
be annotated at different levels of the hierarchy e.g. hatch-
bounding box in the image does not correspond to the
back/car/vehicle. We return to this point in Sect. 7.3.
full extent of the object. This may occur for two reasons:
(a) the object extends outside the image e.g. an image of
6 These branches are also found in the Caltech 256 (Griffin et al. a person from the waist up; (b) the boundary of the object
2007) taxonomy as transportation, animal, household & everyday, and is occluded e.g. a person standing behind a wall. The aim
humanthough the Caltech 256 taxonomy has many other branches. of including this annotation was to support recognition
Int J Comput Vis (2010) 88: 303338 309

Fig. 3 Example of the difficult annotation. Objects shown dashed


have been marked difficult, and are excluded from the evaluation. Note
that the judgement of difficulty is not solely by object sizethe distant
car on the right of the image is included in the evaluation

methods which require images of an entire object as train- Fig. 4 Summary of the entire VOC2007 dataset. Histogram by class
ing data, for example assuming that the bounding boxes of the number of objects and images containing at least one object of
the corresponding class. Note the log scale
of the objects can be aligned.
For the VOC2008 challenge, objects are additionally an-
accurate, so that there are as few annotation errors as pos-
notated as occluded if a high level of occlusion is present.
sible.
This overcomes a limitation of the VOC2007 dataset that
exhaustive, so that all object instances are labelled.
clean training examples without occlusion cannot auto-
matically be identified from the available annotation. Consistency was achieved by having all annotation take
place at a single annotation party at the University of
difficult: labels objects which are particularly difficult to
Leeds, following a set of annotation guidelines which were
detect due to small size, illumination, image quality or
discussed in detail with the annotators. The guidelines cov-
the need to use significant contextual information. In the
ered aspects including: what to label; how to label pose
challenge evaluation, such objects are discarded, although
and bounding box; how to treat occlusion; acceptable im-
no penalty is incurred for detecting them. The aim of this
age quality; how to label clothing/mud/snow, transparency,
annotation is to maintain a reasonable level of difficulty
mirrors, and pictures. The full guidelines (Winn and Ever-
while not contaminating the evaluation with many near-
ingham 2007) are available on the WWW. In addition, dur-
unrecognisable examples.
ing the annotation process, annotators were periodically ob-
Figure 3 shows an example of the difficult annotation. served to ensure that the guidelines were being followed.
The criteria used to judge an object difficult included confi- Several current annotation projects rely on untrained anno-
dence in the class label e.g. is it certain that all the animals tators or have annotators geographically distributed e.g. La-
in Fig. 3 are cows? (sometimes we see sheep in the same belMe (Russell et al. 2008), or even ignorant of their task
field), object size, level of occlusion, imaging factors e.g. e.g. the ESP Game (von Ahn and Dabbish 2004). It is very
motion blur, and requirement for significant context to en- difficult to maintain consistency of annotation in these cir-
able recognition. Note that by marking difficult examples, cumstances, unlike when all annotators are trained, moni-
rather than discarding them, the data should remain useful tored and co-located.
as methods able to cope with such examples are developed. Following the annotation party, the accuracy of each an-
Furthermore, as noted, any current methods able to detect notation was checked by one of the organisers, including
difficult objects are not penalised for doing so. checking for omitted objects to ensure exhaustive labelling.
To date, only one error has been reported on the VOC2007
3.4 Image Annotation Procedure dataset, which was a viewpoint marked as unspecified rather
than frontal. During the checking process, the difficult
The VOC2007 annotation procedure was designed to be: annotation was applied to objects judged as difficult to
consistent, so that the annotation of the images is consis- recognise. As checking the annotation is an extremely time-
tent, in terms of the definition of the classes, how bound- consuming process, for VOC2008 this has been incorpo-
ing boxes are placed, and how viewpoints and truncation rated into the annotation party, with each image checked for
are defined. completeness and each object checked for accuracy, by one
310 Int J Comput Vis (2010) 88: 303338

Table 2 Statistics of the VOC2007 dataset. The data is divided into number of images (containing at least one object of the corresponding
two main subsets: training/validation data (trainval), and test data class) and number of object instances are shown. Note that because
(test), with the trainval data further divided into suggested train- images may contain objects of several classes, the totals shown in the
ing (train) and validation (val) sets. For each subset and class, the image columns are not simply the sum of the corresponding column

train val trainval test


images objects images objects images objects images objects

Aeroplane 112 151 126 155 238 306 204 285


Bicycle 116 176 127 177 243 353 239 337
Bird 180 243 150 243 330 486 282 459
Boat 81 140 100 150 181 290 172 263
Bottle 139 253 105 252 244 505 212 469
Bus 97 115 89 114 186 229 174 213
Car 376 625 337 625 713 1,250 721 1,201
Cat 163 186 174 190 337 376 322 358
Chair 224 400 221 398 445 798 417 756
Cow 69 136 72 123 141 259 127 244
Dining table 97 103 103 112 200 215 190 206
Dog 203 253 218 257 421 510 418 489
Horse 139 182 148 180 287 362 274 348
Motorbike 120 167 125 172 245 339 222 325
Person 1,025 2,358 983 2,332 2,008 4,690 2,007 4,528
Potted plant 133 248 112 266 245 514 224 480
Sheep 48 130 48 127 96 257 97 242
Sofa 111 124 118 124 229 248 223 239
Train 127 145 134 152 261 297 259 282
Tv/monitor 128 166 128 158 256 324 229 308

Total 2,501 6,301 2,510 6,307 5,011 12,608 4,952 12,032

of the annotators. As in previous years, the difficult anno- Figure 4 shows a histogram of the number of images
tation was applied by one of the organisers to ensure consis- and objects in the entire dataset for each class. Note that
tency. We return to the question of the expense, in terms of these counts are shown on a log scale. The person class
person hours, of annotation and checking, in Sect. 7.3. is by far the most frequent, with 9,218 object instances
vs. 421 (dining table) to 2,421 (car) for the other classes.
3.5 Dataset Statistics This is a natural consequence of requiring each image to
be completely annotatedmost flickr images can be char-
Table 2 summarises the statistics of the VOC2007 dataset. acterised as snapshots e.g. family holidays, birthdays,
For the purposes of the challenge, the data is divided into parties, etc. and so many objects appear only inciden-
two main subsets: training/validation data (trainval), tally in images where people are the subject of the pho-
and test data (test). For participants convenience, the tograph.
trainval data is further divided into suggested training While the properties of objects in the dataset such as
(train) and validation (val) sets, however participants are size and location in the image can be considered represen-
free to use any data in the trainval set for training, for tative of flickr as a whole, the same cannot be said about
example if a given method does not require a separate val- the frequency of occurrence of each object class. In order
idation set. The total number of annotated images is 9,963, to provide a reasonable minimum number of images/objects
roughly double the 5,304 images annotated for VOC2006. per class to participants, both for training and evaluation,
The number of annotated objects similarly rose from 9,507 certain minority classes e.g. sheep were targeted toward
to 24,640. Since the number of classes doubled from 10 to the end of the annotation party to increase their numbers
20, the average number of objects of each class increased annotators were instructed to discard all images not contain-
only slightly from 951 to 1,232, dominated by a quadrupling ing one of the minority classes. Examples of certain classes
of the number of annotated people. e.g. sheep and bus proved difficult to collect, due either
Int J Comput Vis (2010) 88: 303338 311

Fig. 5 Example images and annotation for the taster competitions. they may be object or background. Difficult objects are excluded by
(a) Segmentation taster annotation showing object and class segmen- masking with the void label. (b) Person Layout taster annotation
tation. Border regions are marked with the void label indicating that showing bounding boxes for head, hands and feet

to lack of relevant keyword annotation by flickr users, or The object segmentations, where each pixel is labelled
lack of photographs containing these classes. with the identifier of a particular object, were used to cre-
ate class segmentations (see Fig. 5a for examples) where
3.6 Taster Competitions each pixel is assigned a class label. These were provided
to encourage participation from class-based methods, which
Annotation was also provided for the newly introduced seg- output a class label per pixel but which do not output an ob-
mentation and person layout taster competitions. The idea ject identifier, e.g. do not segment adjacent objects of the
behind these competitions is to allow systems to demon- same class. Participants results were submitted in the form
strate a more detailed understanding of the image, such that of class segmentations, where the aim is to predict the cor-
objects can be localised down to the pixel level, or an ob- rect class label for every pixel not labelled in the ground
jects parts (e.g. a persons head, hands and feet) can be lo- truth as void.
calised within the object. As for the main competitions, the Table 3 summarises the statistics of the segmentation
emphasis was on consistent, accurate and exhaustive anno- dataset. In total, 422 images containing 1,215 segmented ob-
tation. jects were provided in the combined training/validation set.
The test set contained 210 images and 607 objects.
3.6.1 Segmentation
3.6.2 Person Layout
For the segmentation competition, a subset of images from
each of the main datasets was annotated with pixel-level seg- For the person layout competition, a subset of person ob-
mentations of the visible region of all contained objects. jects in each of the main datasets was annotated with in-
These segmentations act as a refinement of the bounding formation about the 2-D pose or layout of the person.
box, giving more precise shape and localisation information. For each person, three types of part were annotated with
In deciding how to provide pixel annotation, it was neces- bounding boxes: the head, hands, and feet, see Fig. 5b. These
sary to consider the trade-off between accuracy and anno- parts were chosen to give a good approximation of the over-
tation time: providing pixel-perfect annotation is extremely all pose of a person, and because they can be annotated with
time intensive. To give high accuracy but to keep the annota- relative speed and accuracy compared to e.g. annotation of a
tion time short enough to provide a large image set, a border skeleton structure where uncertainty in the position of the
area of 5 pixels width was allowed around each object where limbs and joints is hard to avoid. Annotators selected images
the pixels were labelled neither object nor background (these to annotate which were of sufficient size such that there was
were marked void in the data, see Fig. 5a). Annotators no uncertainty in the position of the parts, and where the
were also provided with detailed guidelines to ensure con- head and at least one other part were visibleno other cri-
sistent segmentation (Winn and Everingham 2007). In keep- teria were used to filter suitable images. Figure 5b shows
ing with the main competitions, difficult examples of objects some example images, including partial occlusion (upper-
were removed from both training and test sets by masking left), challenging lighting (upper-right), and non-standard
these objects with the void label. pose (lower-left). In total, the training/validation set con-
312 Int J Comput Vis (2010) 88: 303338

Table 3 Statistics of the VOC2007 segmentation dataset. The data is that because images may contain objects of several classes, the totals
divided into two main subsets: training/validation data (trainval), shown in the image columns are not simply the sum of the correspond-
and test data (test), with the trainval data further divided into ing column. All objects in each image are segmented, with every pixel
suggested training (train) and validation (val) sets. For each subset of the image being labelled as one of the object classes, background
and class, the number of images (containing at least one object of the (not one of the annotated classes) or void (uncertain i.e. near object
corresponding class) and number of object instances are shown. Note boundary)

train val trainval test


images objects images objects images objects images objects

Aeroplane 12 17 13 16 25 33 15 15
Bicycle 11 16 10 16 21 32 11 15
Bird 13 15 13 20 26 35 12 15
Boat 11 15 9 29 20 44 13 16
Bottle 17 30 13 28 30 58 13 20
Bus 14 16 11 15 25 31 12 17
Car 14 34 17 36 31 70 24 58
Cat 15 15 15 18 30 33 14 17
Chair 26 52 20 48 46 100 21 49
Cow 11 27 10 16 21 43 10 26
Diningtable 14 15 17 17 31 32 14 15
Dog 17 20 14 19 31 39 13 18
Horse 15 18 17 19 32 37 11 16
Motorbike 11 15 15 16 26 31 13 19
Person 92 194 79 154 171 348 92 179
Pottedplant 17 33 17 45 34 78 11 25
Sheep 8 41 13 22 21 63 10 27
Sofa 17 22 13 15 30 37 15 16
Train 8 14 15 17 23 31 16 17
Tvmonitor 20 24 13 16 33 40 17 27

Total 209 633 213 582 422 1,215 210 607

tained 439 annotated people in 322 images, and the test set the annotation (stored in an XML format compatible with
441 annotated people in 441 images. LabelMe (Russell et al. 2008)), to compute the evaluation
measures, and including simple baseline implementations
for each competition. In the second phase, un-annotated test
4 Submission and Evaluation images were distributed. Participants were then required to
run their methods on the test data and submit results as de-
The submission and evaluation procedures for the VOC2007 fined in Sect. 4.2. The test data was available for approxi-
challenge competitions were designed to be fair, to prevent mately three months before submission of resultsthis al-
over-fitting, and to demonstrate clearly the differences in ac- lowed substantial time for processing, and aimed to not pe-
curacy between different methods. nalise computationally expensive methods, or groups with
access to only limited computational resources.
4.1 Submission of Results Withholding the annotation of the test data until comple-
tion of the challenge played a significant part in preventing
The running of the VOC2007 challenge consisted of two over-fitting of the parameters of classification or detection
methods. In the VOC2005 challenge, test annotation was
phases: At the start of the challenge, participants were is-
released and this led to some optimistic reported results,
sued a development kit comprising training/validation im-
where a number of parameter settings had been run on the
ages with annotation, and MATLAB7 software to access
test set, and only the best reported. This danger emerges
in any evaluation initiative where ground truth is publicly
7 MATLAB is a registered trademark of The MathWorks, Inc. available. Because the test data is in the form of images, it
Int J Comput Vis (2010) 88: 303338 313

is also theoretically possible for participants to hand-label Both the classification and detection tasks were evaluated
the test data, or eyeball test resultsthis is in contrast to as a set of 20 independent two-class tasks: e.g. for classifica-
e.g. machine learning benchmarks where the test data may tion is there a car in the image?, and for detection where
be sufficiently abstract such that it cannot easily be la- are the cars in the image (if any)?. A separate score is
belled by a non-specialist. We rely on the participants hon- computed for each of the classes. For the classification task,
esty, and the limited time available between release of the participants submitted results in the form of a confidence
test data and submission of results, to minimise the possibil- level for each image and for each class, with larger values
ity of manual labelling. The possibility could be avoided by indicating greater confidence that the image contains the ob-
requiring participants to submit code for their methods, and ject of interest. For the detection task, participants submitted
never release the test images. However, this makes the evalu- a bounding box for each detection, with a confidence level
ation task difficult for both participants and organisers, since for each bounding box. The provision of a confidence level
methods may use a mixture of MATLAB/C code, propri- allows results to be ranked such that the trade-off between
etary libraries, require significant computational resources, false positives and false negatives can be evaluated, without
etc. It is worth noting, however, that results submitted to the defining arbitrary costs on each type of classification error.
VOC challenge, rather than afterward using the released an-
notation data, might appropriately be accorded higher sta- Average Precision (AP) For the VOC2007 challenge, the
tus since participants have limited opportunity to experiment interpolated average precision (Salton and McGill 1986)
with the test data. was used to evaluate both classification and detection.
In addition to withholding the test data annotation, it was For a given task and class, the precision/recall curve is
also required that participants submit only a single result per computed from a methods ranked output. Recall is defined
method, such that the organisers were not asked to choose as the proportion of all positive examples ranked above a
the best result for them. Participants were not required to given rank. Precision is the proportion of all examples above
provide classification or detection results for all 20 classes, that rank which are from the positive class. The AP sum-
to encourage participation from groups having particular ex- marises the shape of the precision/recall curve, and is de-
pertise in e.g. person or vehicle detection. fined as the mean precision at a set of eleven equally spaced
recall levels [0, 0.1, . . . , 1]:
4.2 Evaluation of Results 1 
AP = pinterp (r) (1)
11
r{0,0.1,...,1}
Evaluation of results on multi-class datasets such as VOC-
2007 poses several problems: (i) for the classification task, The precision at each recall level r is interpolated by taking
images contain instances of multiple classes, so a forced the maximum precision measured for a method for which
choice paradigm such as that adopted by Caltech 256 (Grif- the corresponding recall exceeds r:
fin et al. 2007)which one of m classes does this image
contain?cannot be used; (ii) the prior distribution over pinterp (r) = max p(r) (2)
r:rr
classes is significantly nonuniform so a simple accuracy
measure (percentage of correctly classified examples) is not where p(r) is the measured precision at recall r.
appropriate. This is particularly salient in the detection task, The intention in interpolating the precision/recall curve
where sliding window methods will encounter many thou- in this way is to reduce the impact of the wiggles in
sands of negative (non-class) examples for every positive the precision/recall curve, caused by small variations in the
example. In the absence of information about the cost or risk ranking of examples. It should be noted that to obtain a high
of misclassifications, it is necessary to evaluate the trade-off score, a method must have precision at all levels of recall
between different types of classification error; (iii) evalua- this penalises methods which retrieve only a subset of exam-
tion measures need to be algorithm-independent, for exam- ples with high precision (e.g. side views of cars).
ple in the detection task participants have adopted a variety The use of precision/recall and AP replaced the area
of methods e.g. sliding window classification, segmentation- under curve (AUC) measure of the ROC curve used in
based, constellation models, etc. This prevents the use of VOC2006 for the classification task. This change was made
some previous evaluation measures such as the Detection to improve the sensitivity of the metric (in VOC2006 many
Error Tradeoff (DET) commonly used for evaluating pedes- methods were achieving greater than 95% AUC), to improve
trian detectors (Dalal and Triggs 2005), since this is applica- interpretability (especially for image retrieval applications),
ble only to sliding window methods constrained to a spec- to give increased visibility to performance at low recall, and
ified window extraction scheme, and to data with cropped to unify the evaluation of the two main competitions. A com-
positive test examples. parison of the two measures on VOC2006 showed that the
314 Int J Comput Vis (2010) 88: 303338

ranking of participants was generally in agreement but that Pixels marked void in the ground truth are excluded
the AP measure highlighted differences between methods to from this measure. Compared to VOC2007, the measure pe-
a greater extent. nalises methods which have high false positive rates (i.e.
that incorrectly mark non-class pixels as belonging to the
Bounding Box Evaluation As noted, for the detection task, target class). The per-class measure should hence give a
participants submitted a list of bounding boxes with associ- more interpretable evaluation of the performance of individ-
ated confidence (rank). Detections were assigned to ground ual methods.
truth objects and judged to be true/false positives by mea-
suring bounding box overlap. To be considered a correct de- 4.2.2 Evaluation of the Person Layout Taster
tection, the overlap ratio ao between the predicted bounding
box Bp and ground truth bounding box Bgt must exceed 0.5 The person layout taster was treated as an extended de-
(50%) by the formula tection task. Methods were evaluated using the same AP
measure used for the main detection competition. The cri-
area(Bp Bgt )
ao = (3) terion for a correct detection, however, was extended to
area(Bp Bgt ) require correct prediction of (i) the set of visible parts
where Bp Bgt denotes the intersection of the predicted and (head/hands/feet); (ii) correct bounding boxes for all parts,
ground truth bounding boxes and Bp Bgt their union. using the standard overlap threshold of 50%.
The threshold of 50% was set deliberately low to account As reported in Sect. 6.4 this evaluation criterion proved
for inaccuracies in bounding boxes in the ground truth data, extremely challenging. In the VOC2008 challenge, the eval-
for example defining the bounding box for a highly non- uation has been relaxed by providing person bounding boxes
convex object, e.g. a person with arms and legs spread, is for the test data (disjoint from the main challenge test set),
somewhat subjective. Section 6.2.3 evaluates the effect of so that methods are not required to complete the detection
this threshold on the measured average precision. We return part of the task, but only estimate part identity and location.
to the question of the suitability of bounding box annotation
in Sect. 7.3.
Detections output by a method were assigned to ground 5 Methods
truth objects satisfying the overlap criterion in order ranked
Table 4 summarises the participation in the VOC2007 chal-
by the (decreasing) confidence output. Multiple detections
lenge. A total of 16 institutions submitted results (c.f. 16 in
of the same object in an image were considered false detec-
2006 and 9 in 2005). Taking into account multiple groups in
tions e.g. 5 detections of a single object counted as 1 correct
an institution and multiple methods per group, there were a
detection and 4 false detectionsit was the responsibility of
total of 28 methods submitted (c.f. 25 in 2006, 13 in 2005).
the participants system to filter multiple detections from its
output.
5.1 Classification Methods
4.2.1 Evaluation of the Segmentation Taster
There were 17 entries for the classification task in 2007,
compared to 14 in 2006 and 9 in 2005.
A common measure used to evaluate segmentation meth-
Many of the submissions used variations on the ba-
ods is the percentage of pixels correctly labelled. For the
sic bag-of-visual-words method (Csurka et al. 2004; Sivic
VOC2007 segmentation taster, this measure was used per
and Zisserman 2003) that was so successful in VOC2006,
class by considering only pixels labelled with that class in
see Zhang et al. (2007): local features are computed (for ex-
the ground truth annotation. Reporting a per-class accuracy
ample SIFT descriptors); vector quantised (often by using
in this way allowed participants to enter segmentation meth-
k-means) into a visual vocabulary or codebook; and each
ods which handled only a subset of the classes. However,
image is then represented by a histogram of how often the
this evaluation scheme can be misleading, for example, la-
local features are assigned to each visual word. The repre-
belling all pixels car leads to a perfect score on the car
sentation is known as bag-of-visual-words in analogy with
class (though not the other classes). Biases in different meth-
the bag-of-words (BOW) text representation where the fre-
ods can hence lead to misleading high or low accuracies on
quency, but not the position, of words is used to represent
individual classes. To rectify this problem, the VOC2008
text documents. It is also known as bag-of-keypoints or bag-
segmentation challenge will be assessed on a modified per-
of-features. The classifier is typically a support vector ma-
class measure based on the intersection of the inferred seg-
chine (SVM) with 2 or Earth Movers Distance (EMD) ker-
mentation and the ground truth, divided by the union:
nel.
true pos. Within this approach, submissions varied tremendously
seg. accuracy = (4)
true pos. + false pos. + false neg. in the features used, with regard to both their type and their
Int J Comput Vis (2010) 88: 303338 315

Table 4 Participation in the VOC2007 challenge. Each method is each method are listed with references to publications describing the
assigned an abbreviation used in the text, and identified as a classi- method, where available
fication method (Cls) or detection method (Det). The contributors to

Abbreviation Cls Det Contributors References

Darmstadt Mario Fritz and Bernt Schiele, TU Darmstadt Fritz and Schiele (2008)

INRIA_Flat Marcin Marszalek, Cordelia Schmid, Hedi Zhang et al. (2007); van de Weijer and
INRIA_Genetic Harzallah and Joost Van-de-Weijer, INRIA Schmid (2006); Ferrari et al. (2008)
Rhone-Alpes

INRIA_Larlus Diane Larlus and Frederic Jurie, IN-


RIA Rhones-Alpes

INRIA_Normal Hedi Harzallah, Cordelia Schmid, Marcin Ferrari et al. (2008); van de Weijer and
INRIA_PlusClass Marszalek, Vittorio Ferrari, Y-Lan Boureau, Schmid (2006); Zhang et al. (2007)
Jean Ponce and Frederic Jurie, INRIA Rhone-
Alpes

IRISA Ivan Laptev, IRISA/INRIA Rennes and Laptev (2006)


Evgeniy Tarassov, TT-Solutions

MPI_BOW Christoph Lampert and Matthew Blaschko, Lampert et al. (2008)


MPI_Center MPI Tuebingen
MPI_ESSOL

Oxford Ondrej Chum and Andrew Zisserman, Univer- Chum and Zisserman (2007)
sity of Oxford

PRIPUVA Julian Stottinger and Allan Hanbury, Vienna Stoettinger et al. (2007)
University of Technology; Nicu Sebe and
Theo Gevers, University of Amsterdam

QMUL_HSLS Jianguo Zhang, Queen Mary University of Zhang et al. (2007)


QMUL_LSPCH London

TKK Ville Viitaniemi and Jorma Laaksonon, Viitaniemi and Laaksonen (2008)
Helsinki University of Technology

ToshCam_rdf Jamie Shotton, Toshiba Corporate R&D Cen-


ToshCam_svm ter, Japan & Matthew Johnson, University of
Cambridge

Tsinghua Dong Wang, Xiaobing Liu, Cailiang Liu, Wang et al. (2006); Liu et al. (2007)
Zhang Bo and Jianmin Li, Tsinghua Univer-
sity

UoCTTI Pedro Felzenszwalb, University of Chicago; Felzenszwalb et al. (2008)


David McAllester and Deva Ramanan, Toyota
Technological Institute, Chicago

UVA_Bigrams Koen van de Sande, Jan van Gemert and Jasper van de Sande et al. (2008); van Gemert et
UVA_FuseAll Uijlings, University of Amsterdam al. (2006); Geusebroek (2006); Snoek et
al. (2006, 2005)
UVA_MCIP
UVA_SFS
UVA_WGT

XRCE Florent Perronnin, Yan Liu and Gabriela Perronnin and Dance (2007)
Csurka, Xerox Research Centre Europe

density. Sparse local features were detected using the Har- was some attention to exploring different colour spaces
ris interest point operator and/or the SIFT detector (Lowe (such as HSI) in the detection for greater immunity to pho-
2004), and then represented by the SIFT descriptor. There tometric effects such as shadows (PRIP-UvA). Others (e.g.
316 Int J Comput Vis (2010) 88: 303338

INRIA_Larlus) computed descriptors on a dense grid, and The majority of the VOC2007 entries used a sliding
one submission (MPI) combined both sparse and dense de- window approach to the detection task or variants thereof.
scriptors. In addition to SIFT, other descriptors included lo- In the basic sliding window method a rectangular window
cal colour, pairs of adjacent segments (PAS) (Ferrari et al. of the image is taken, features are extracted from this win-
2008), and Sobel edge histograms. dow, and it is then classified as either containing an instance
The BOW representation was still very common, where of a given class or not. This classifier is then run exhaus-
spatial information, such as the position of the descriptors tively over the image at varying location and scale. In order
is disregarded. However, several participants provided ad- to deal with multiple nearby detections a non-maximum
ditional representations (channels) for each image where as suppression stage is then usually applied. Prominent exam-
well as the BOW, spatial information was included by var- ples of this method include the Viola and Jones (2004) face
ious tilings of the image (INRIA_Genetic, INRIA_Flat), or detector and the Dalal and Triggs (2005) pedestrian detec-
using a spatial pyramid (TKK). tor.
While most submissions used a kernel SVM as the clas- The entries Darmstadt, INRIA_Normal, INRIA_Plus
sifier (with kernels including 2 and EMD), XRCE used lo- Class and IRISA were essentially sliding window meth-
gistic regression with a Fisher kernel (Perronnin and Dance ods, with the enhancements that INRIA_PlusClass also
2007), and ToshCam used a random forest classifier. utilised the output of a whole image classifier, and that
Where there was greatest diversity was in the meth- IRISA also trained separate detectors for person-on-X where
ods for combining the multiple representations (channels). X was horse, bicycle, or motorbike. Two variations on
Some methods investigated late fusion where a classifier the sliding window method avoided dense sampling of
is trained on each channel independently, and then a second the test image: The Oxford entry used interest point de-
classifier combines the results. For example TKK used this
tection to select candidate windows, and then applied an
approach, for details see Viitaniemi and Laaksonen (2008).
SVM classifier; see Chum and Zisserman (2007) for de-
Tsinghua combined the individual classifiers using Rank-
tails. The MPI_ESSOL entry (Lampert et al. 2008) used a
Boost. INRIA entered two methods using the same chan-
branch-and-bound scheme to efficiently maximise the clas-
nels, but differing in the manner in which they were com-
sifier function (based on a BOW representation, or pyra-
bined: INRIA_Flat uses uniform weighting on each feature
mid match kernel (Grauman and Darrell 2005) determined
(following Zhang et al. 2007); INRIA_Genetic uses a differ-
on a per-class basis at training time) over all possible win-
ent class-dependent weight for each feature, learnt from the
dows.
validation data by a genetic algorithm search.
The UoCTTI entry used a more complex variant of the
In 2006, several of the submissions tackled the classifi-
cation task as detectionthere is a car here, so the im- sliding window method, see Felzenszwalb et al. (2008) for
age contains a car. This approach is perhaps more in line details. It combines the outputs of a coarse window and sev-
with human intuition about the task, in comparison to the eral higher-resolution part windows which can move relative
global classification methods which establish the presence to the coarse window; inference over location of the parts
of an object without localising it in the image. However, in is performed for each coarse image window. Note that im-
2007 no submissions used this approach. proved results are reported in Felzenszwalb et al. (2008) rel-
The VOC challenge invites submission of results from ative to those in Table 6; these were achieved after the public
off-the-shelf systems or methods trained on data other release of the test set annotation.
than that provided for the challenge (see Sect. 2.1), to be The method proposed by TKK automatically segments
evaluated separately from those using only the provided an image to extract candidate bounding boxes and then clas-
data. No results were submitted to VOC2007 in this cate- sifies these bounding boxes, see Viitaniemi and Laaksonen
gory. This is disappointing, since it prevents answering the (2008) for details. The MPI_Center entry was a baseline that
question as to how well current methods perform given un- returns exactly one object bounding box per image; the box
limited training data, or more detailed annotation of training is centred and is 51% of the total image area.
data. It is an open question how to encourage submission of In previous VOC detection competitions there had been a
results from e.g. commercial systems. greater diversity of methods used for the detection problem,
see Everingham et al. (2006b) for more details. For example
5.2 Detection Methods in VOC2006 the Cambridge entry used a classifier to pre-
dict a class label at each pixel, and then computed contigu-
There were 9 entries for the detection task in 2007, com- ously segmented regions; the TU Darmstadt entry made use
pared to 9 in 2006 and 5 in 2005. As for the classification of the Implicit Shape Model (ISM) (Leibe et al. 2004); and
task, all submitted methods were trained only on the pro- the MIT_Fergus entry used the constellation model (Fer-
vided training data. gus et al. 2007).
Int J Comput Vis (2010) 88: 303338 317

6 Results feature descriptors used by the UVA methods (van de Sande


et al. 2008) has been made publicly available, and would
This section reports and discusses the results of the VOC- form a reasonable state-of-the-art baseline for future chal-
2007 challenge. Full results including precision/recall curves lenges.
for all classes, not all of which are shown here due to space
constraints, can be found on the VOC2007 web-site (Ever- 6.1.2 Statistical Significance of Results
ingham et al. 2007).
A question often overlooked by the computer vision com-
6.1 Classification munity when comparing results on a given dataset is whether
the difference in performance of two methods is statistically
This section reports and discusses the results of the clas- significant. For the VOC challenge, we wanted to establish
sification task. A total of 17 methods were evaluated. All whether, for the given dataset and set of classes, one method
participants tackled all of the 20 classes. Table 5 lists the
can be considered significantly more accurate than another.
AP for all submitted methods and classes. For each class the
Note that this question is different to that investigated e.g. in
method obtaining the greatest AP is identified in bold, and
the Caltech 101 challenge, where multiple training/test folds
the methods with 2nd and 3rd greatest AP in italics. Pre-
are used to establish the variance of the measured accuracy.
cision/recall curves for a representative sample of classes
Whereas that approach measures robustness of a method
are shown in Fig. 6. Results are shown ordered by decreas-
to differing data, we wish to establish significance for the
ing maximum AP. The left column shows all results, while
given, fixed, dataset. This is salient, for example, when a
the right column shows the top five results by AP. The left
method may not involve a training phase, or to compare
column also shows the chance performance, obtained by
against a commercial system trained on proprietary data.
a classifier outputting a random confidence value without
Little work has considered the comparison of multiple
examining the input imagethe precision/recall curve and
classifiers over multiple datasets. We analysed the results of
corresponding AP measure are not invariant to the propor-
the classification task using a method proposed by Demsar
tion of positive images, resulting in varying chance perfor-
(2006), specifically using the Friedman test with Nemenyi
mance across classes. We discuss this further in Sect. 6.1.3.
post hoc analysis. This approach uses only comparisons be-
6.1.1 Classification Results by Method tween the rank of a method (the method achieving the great-
est AP is assigned rank 1, the 2nd greatest AP rank 2, etc.),
Overall the INRIA_Genetic method stands out as the most and thus requires no assumptions about the distribution of
successful method, obtaining the greatest AP in 19 of the 20 AP to be made. Each class is treated as a separate test, giv-
classes. The related INRIA_Flat method achieves very sim- ing one rank measurement per method and class. The analy-
ilar performance, with AP between the two methods differ- sis then consists of two steps: (i) the null hypothesis is made
ing by just 12% for most classes. As described in Sect. 5.1 that the methods are equivalent and so their ranks should be
these methods use the same set of heterogeneous image fea- equal. The hypothesis is tested by the Friedman test (a non-
tures, and differ only in the way that features are fused in a parametric variant of ANOVA), which follows a 2 distrib-
generalised radial basis function (RBF) kernel: INRIA_Flat ution; (ii) having rejected the null hypothesis the differences
uses uniform weighting on each feature, and INRIA_Genetic in ranks are analysed by the Nemenyi test (similar to the
learns a different weight for each feature from the validation Tukey test for ANOVA). The difference between mean ranks
data. The XRCE method comes third in 17 of 20 classes and (over classes) for a pair of methods follows a modified Stu-
first in one. This method differs from the INRIA methods dentised range statistic. For a confidence level of p = 0.05
in using a Fisher kernel representation of the distribution of and given the 17 methods tested over 20 classes, the criti-
visual features within the image, and uses a smaller feature cal difference is calculated as 4.9the difference in mean
set and logistic regression cf. the kernel SVM classifier used rank between a pair of methods must exceed 4.9 for the dif-
by the INRIA methods. ference to be considered statistically significant.
Figure 7 summarises the performance of all methods, Figure 8 visualises the results of this analysis using
plotting the median AP for each method computed over the CD (critical difference) diagram proposed by Demsar
all classes, and ordered by decreasing median AP. Despite (2006). The x-axis shows the mean rank over classes for
the overall similarity in the features used, there is quite a each method. Methods are shown clockwise from right to
wide range in accuracy of 32.157.5%, with one method left in decreasing (first to last) rank order. Groups of meth-
(PRIPUVA) substantially lower at 21.1%. The high per- ods for which the difference in mean rank is not significant
forming methods all combine multiple features (channels) are connected by horizontal bars. As can be seen, there is
though, and some (INRIA_Genetic, INRIA_Flat, TKK) in- substantial overlap between the groups, with no clear clus-
clude spatial information as well as BOW. Software for the tering into sets of equivalent methods. Of interest is that
318

Table 5 Classification results. For each object class and submission, the AP measure (%) is shown. Bold entries in each column denote the maximum AP for the corresponding class. Italic entries
denote the results ranked second or third

aeroplane bicycle bird boat bottle bus car cat chair cow diningtable dog horse motorbike person pottedplant sheep sofa train tvmonitor

INRIA_Flat 74.8 62.5 51.2 69.4 29.2 60.4 76.3 57.6 53.1 41.1 54.0 42.8 76.5 62.3 84.5 35.3 41.3 50.1 77.6 49.3
INRIA_Genetic 77.5 63.6 56.1 71.9 33.1 60.6 78.0 58.8 53.5 42.6 54.9 45.8 77.5 64.0 85.9 36.3 44.7 50.6 79.2 53.2
INRIA_Larlus 62.6 54.0 32.8 47.5 17.8 46.4 69.6 44.2 44.6 26.0 38.1 34.0 66.0 55.1 77.2 13.1 29.1 36.7 62.7 43.3
MPI_BOW 58.9 46.0 31.3 59.0 16.9 40.5 67.2 40.2 44.3 28.3 31.9 34.4 63.6 53.5 75.7 22.3 26.6 35.4 60.6 40.6
PRIPUVA 48.6 20.9 21.3 17.2 6.4 14.2 45.0 31.4 27.4 12.3 14.3 23.7 30.1 13.3 62.0 10.0 12.4 13.3 26.7 26.2
QMUL_HSLS 70.6 54.8 35.7 64.5 27.8 51.1 71.4 54.0 46.6 36.6 34.4 39.9 71.5 55.4 80.6 15.8 35.8 41.5 73.1 45.5
QMUL_LSPCH 71.6 55.0 41.1 65.5 27.2 51.1 72.2 55.1 47.4 35.9 37.4 41.5 71.5 57.9 80.8 15.6 33.3 41.9 76.5 45.9
TKK 71.4 51.7 48.5 63.4 27.3 49.9 70.1 51.2 51.7 32.3 46.3 41.5 72.6 60.2 82.2 31.7 30.1 39.2 71.1 41.0
ToshCam_rdf 59.9 36.8 29.9 40.0 23.6 33.3 60.2 33.0 41.0 17.8 33.2 33.7 63.9 53.1 77.9 29.0 27.3 31.2 50.1 37.6
ToshCam_svm 54.0 27.1 30.3 35.6 17.0 22.3 58.0 34.6 38.0 19.0 27.5 32.4 48.0 40.7 78.1 23.4 21.8 28.0 45.5 31.8
Tsinghua 62.9 42.4 33.9 49.7 23.7 40.7 62.0 35.2 42.7 21.0 38.9 34.7 65.0 48.1 76.9 16.9 30.8 32.8 58.9 33.1
UVA_Bigrams 61.2 33.2 29.4 45.0 16.5 37.6 54.6 31.3 39.9 17.2 31.4 30.6 61.6 42.4 74.6 14.5 20.9 23.5 49.9 30.0
UVA_FuseAll 67.1 48.1 43.3 58.1 19.9 46.3 61.8 41.9 48.4 27.8 41.9 38.5 69.8 51.4 79.4 32.5 31.9 36.0 66.2 40.3
UVA_MCIP 66.5 47.9 41.0 58.0 16.8 44.0 61.2 40.5 48.5 27.8 41.7 37.1 66.4 50.1 78.6 31.2 32.3 31.9 66.6 40.3
UVA_SFS 66.3 49.7 43.5 60.7 18.8 44.9 64.8 41.9 46.8 24.9 42.3 33.9 71.5 53.4 80.4 29.7 31.2 31.8 67.4 43.5
UVA_WGT 59.7 33.7 34.9 44.5 22.2 32.9 55.9 36.3 36.8 20.6 25.2 34.7 65.1 40.1 74.2 26.4 26.9 25.1 50.7 29.7
XRCE 72.3 57.5 53.2 68.9 28.5 57.5 75.4 50.3 52.2 39.0 46.8 45.3 75.7 58.5 84.0 32.6 39.7 50.9 75.1 49.5
Int J Comput Vis (2010) 88: 303338
Int J Comput Vis (2010) 88: 303338 319

Fig. 6 Classification results. Precision/recall curves are shown for a representative sample of classes. The left column shows all results; the right
shows the top 5 results by AP. The legend indicates the AP (%) obtained by the corresponding method
320 Int J Comput Vis (2010) 88: 303338

Fig. 9 Summary of classification results by class. For each class three


Fig. 7 Summary of the classification results by method. For each values are shown: the maximum AP obtained by any method (max),
method the median AP over all classes is shown the median AP over all methods (median) and the AP obtained by a
random ranking of the images (chance)

over all methods. Also shown is the chance performance


the AP obtained by a classifier outputting a random con-
fidence value without examining the input image. Results
are shown ordered by decreasing maximum AP. There is
substantial variation in the maximum AP as a function of
object class, from 33.1% (bottle) to 85.9% (person). The
median AP varies from 15.6% (potted plant) to 75.7% (per-
son). The median results can be seen to approximately fol-
low the ranking of results by maximum AP, suggesting that
the same classes proved difficult across methods, but indi-
Fig. 8 Analysis of statistically significant differences in the classifica- vidual differences can be seen, for example the difference in
tion results. The mean rank over all classes is plotted on the x-axis for maximum AP for the 2nd to 4th classes is very small such
each method. Methods which are not significantly different (p = 0.05), that the ordering is somewhat arbitrary. The high AP for the
in terms of mean rank, are connected
person class can be attributed in part to the high propor-
tion of images in the dataset containing peoplethe chance
the differences between the first six ranked methods (IN- AP for this class is 43.4%. However, as can be seen in Fig. 9,
RIA_Genetic, INRIA_Flat, XRCE, TKK, QMUL_LSPCH, the results are substantially above chance for all classes.
QMUL_HSLS) cannot be considered statistically significant. While results on some classes e.g. person and train
A limitation of this analysis is that the relatively small (Fig. 6ad) are very promising, for all classes there is sub-
number of observations (20 classes per method) limits the stantial room for improvement. For some classes e.g. bottle
power of the test. Increasing the number of classes will make (Fig. 6gh), the precision/recall curves show that the meth-
it more feasible to establish significant differences between ods precision drops greatly at moderate recall, and current
methods in terms of their performance over a wide range methods would not give satisfactory results in the context of
of classes. As discussed in Sect. 7.1, we are also keen to an image retrieval system. It is also worth noting that if one
highlight differences in the approach taken by methods, not views the classification task as image retrieval, the evalua-
solely their performance. tion is somewhat benign since the prior probability of an
image containing an object of interest is still quite high, 2%
6.1.3 Classification Results by Class for the least frequent class (sheep). We might expect that in
a real world scenario, for example image-based web search,
Figure 9 summarises the results obtained by object class, the prior probability for some classes would be much lower.
plotting for each class the maximum and median AP taken We return to this point in Sect. 7.3.
Int J Comput Vis (2010) 88: 303338 321

6.1.4 What Are the Classification Methods Learning? Figure 11 shows the five highest ranked positive and neg-
ative images for the cat class. Here also the confusion
As noted, the quantitative evaluation of methods by AP gives appears natural, with all five of the highest ranked non-cat
a summary of a methods precision/recall trade-off. It is in- images containing dogs. However, it can also be seen that
teresting to examine the success and failure modes of the the composition of the images for the cat and dog classes is
methods to derive some insight into what current methods very similar, and this may play a significant role in the learnt
are learning, and what limitations might be addressed in de- classifiers. This is a bias in the content of flickr images, in
velopment of future methods. that photographers appear to take many close-up images
We first examined the kinds of images which methods of their pets.
found easy or difficult. Five submitted methods were Figure 12 shows corresponding images for the bicycle
selected which represented somewhat different approaches class. The high ranked positive images show a similar pat-
rather than small variations (e.g. INRIA_Genetic vs. IN- tern to the car class, containing uncluttered images of bi-
RIA_Flat): INRIA_Genetic, XRCE, TKK, QMUL_LSPCH cycles in canonical poses, and images where the scene
and UVA_FuseAll. Each test image was then assigned a rank is dominated by multiple bicycles. For this class, however,
by each method (using the methods confidence output for the high ranked negative images (Fig. 12b) are anything but
that image). An overall rank for the image was then assigned intuitiveall of the first five negative images show scenes
by taking the median over the ranks from the five selected of birds sitting on branches, which do not resemble bicy-
methods. By looking at which images, containing the class cles at all to a human observer. The reason for this con-
of interest or not, are ranked first or last, we can gain insight fusion might be explained by the lack of informative spa-
into what properties of the images make recognition easy or tial information in current methods. Examining the negative
difficult for current methods. images (Fig. 12b), which are dominated by wiry or bar-
Figure 10 shows ranked images for the car class. The like features, it seems clear that a BOW representation may
first row shows the five positive images (containing cars) as-
closely resemble that of a bicycle, with the representation of
signed the highest rank (1st5th)these can be considered
branches matching that of the bicycle tubes and spokes. This
images which are easy for current methods to recognise as
is a limitation of BOW methods, in that information about
containing cars. The second row shows the five positive im-
the spatial arrangement of features is discarded, or repre-
ages (containing cars) assigned the lowest rankthese are
sented only very weakly and implicitly, by features repre-
images for which current methods cannot easily establish
senting the conjunction of other features. For methods using
the presence of a car. The third row shows the five negative
tiling or spatial pyramids, the spatial information is captured
images (not containing cars) assigned the highest rank
only at a coarse image/scene level, and not at the level of in-
these are images which confuse current methods, which
dividual objects.
judge them highly likely to contain cars.
Figure 13 shows images for the chair class. Results
The high ranked positive images (Fig. 10a) include im-
here are interesting in that none of the high ranked positive
ages where a single car dominates the image, in an un-
images are close-up views of isolated chairs, even though
cluttered or expected background i.e. a road, and images
these are present in the dataset. All the high ranked nega-
where a number of cars are visible. The inclusion of images
tive images (Fig. 13b) show indoor scenes which might well
with multiple cars is perhaps surprising, but as discussed
below, may be attributed to the reliance of current meth- be expected to contain chairs, but do not. Only one of the
ods on textural properties of the image rather than spa- first five negative images contains a sofa, which might be
tial arrangement of parts. The low ranked positive images considered the most easily confused class both semantically
(Fig. 10b) are typical across all classes, showing the object and visually. It seems likely that in this case the classifiers
of interest small in the image, poorly lit or heavily occluded. are learning about the scene context of chairs rather than
For methods based on global image descriptors, such as the modelling the appearance of a chair itself. Again, this is
BOW approach, these factors cause the presence of the car a somewhat natural consequence of a global classification
to contribute little to the feature vector describing the image. approach. The use of context may be seen in both positive
In the case of the car class, the high ranked negative images and negative lightswhile there is much interest in the field
(Fig. 10c) show an intuitive confusionthe first five im- in exploiting contextual cues to object recognition (Torralba
ages shown all contain buses or lorries (not considered part 2003; Sudderth et al. 2008; Hoiem et al. 2006), the incor-
of the car class). This may be considered a pleasing result poration of context by use of a global descriptor leads to
since there is some natural fuzziness in the distinction be- failure when objects are presented out of context, or over-
tween the classes car and bus, and the classes certainly reliance on context when the training set contains mostly
share both semantic and visual properties. However, as dis- images of scenes rather than individual objects. The ques-
cussed below, these natural confusions are not apparent tion of whether current methods are learning object or scene
for all classes. representations is considered further below.
322 Int J Comput Vis (2010) 88: 303338

Fig. 10 Ranked images for the car classification task (see text for cars); (c) five highest ranked negative images (not containing cars).
details of ranking method). (a) Five highest ranked positive images The number in each image indicates the corresponding median rank
(containing cars); (b) five lowest ranked positive images (containing

Fig. 11 Ranked images for the


cat classification task. (a) Five
highest ranked positive images
(containing cats); (b) five
highest ranked negative images
(not containing cats). The
number in each image indicates
the corresponding median rank

Fig. 12 Ranked images for the


bicycle classification task.
(a) Five highest ranked positive
images (containing bicycles);
(b) five highest ranked negative
images (not containing
bicycles). The number in each
image indicates the
corresponding median rank

Finally, Fig. 14 shows images for the person class. In the presence of a motorbike is a similarly effective predic-
this case, the negative images (Fig. 14b) contain (i) dining tor: 68.9% of the images containing motorbikes also contain
tables (3rd and 5th image), and (ii) motorbikes (1st, 2nd and people (although an alternative explanation may be the elon-
4th images). The confusion with the dining table class gated vertical shape of the motorbikes seen from the front or
seems natural, in the same manner as the chair class, in rear). These unintentional regularities in the dataset are
that the presence of a dining table seems a good predictor for a limiting factor in judging the effectiveness of the classi-
the presence of a person. Statistics of the dataset reveal that fication methods in terms of object recognition rather than
Int J Comput Vis (2010) 88: 303338 323

Fig. 13 Ranked images for the chair classification task. (a) Five highest ranked positive images (containing chairs); (b) five highest ranked
negative images (not containing chairs). The number in each image indicates the corresponding median rank

Fig. 14 Ranked images for the person classification task. (a) Five highest ranked positive images (containing people); (b) five highest ranked
negative images (not containing people). The number in each image indicates the corresponding median rank

image retrieval. The detection task, see Sect. 6.2, is a much shows results of the experiments for a representative set of
more challenging test of object recognition. classes. For each plot, the x-axis shows the threshold on the
proportion of positive pixels in an image for the image to be
6.1.5 Effect of Object Size on Classification Accuracy labelled positive, defining one of the series of test sets. For
each threshold, the AP was measured and is shown on the
All of the methods submitted are essentially global, extract- y-axis.
ing descriptors of the entire image content. The ranking of Figure 15a, b show classes car and motorbike, for
images also suggests that the image context is used exten- which some positive correlation between object size and AP
sively by the classifiers. It is interesting therefore to examine can be observed. As shown, the AP increases as images with
whether methods are biased toward images where the object fewer object pixels are discarded, and peaks at the point
of interest is large, or whether conversely the presence of ad- where the only positive images included have at least 15%
equate scene context determines the accuracy of the results. object pixels. This effect can be accounted for by the use of
We conducted experiments to investigate the effect of ob- interest point mechanisms to extract features, which break
ject size on the submitted methods accuracy. A series of test down if the object is small such that no interest points are
sets was made in which all positively-labelled images con- detected on it, and in the case of dense feature extraction,
tained at least some proportion of pixels labelled as the ob- by the dominance of the background or clutter in the global
ject of interest. For example, given a threshold of 10%, only representation, which swamps the object descriptors. For
images for which at least 10% of the pixels were car were the motorbike class, the AP is seen to decrease slightly
labelled as containing a car; images with some car pixels, but when only images containing at least 20% of object pixels
less than 10%, were removed from the test set, and the nega- are includedthis may be explained by the reduction of rel-
tive examples always had zero car pixels. The proportion of evant context in such images.
pixels belonging to a particular class was approximated by For most classes, zero or negative correlation between
the union of the corresponding bounding boxes. Figure 15 object size and AP was observed, for example bird, shown
324 Int J Comput Vis (2010) 88: 303338

Fig. 15 Classification results as a function of object size. Plots show of 30% means that all positive images (e.g. containing car) which
results for four representative classes. For each plot the x-axis shows contained fewer than 30% positive (i.e. car) pixels were removed from
the lower threshold on object size for a positive image to be included the test set; a threshold of 0% means that no images were discarded
in the test set; the y-axis shows the corresponding AP. A threshold

in Fig. 15c. This is compatible with the conclusions from


examining ranked images, that current methods are making
substantial use of image composition or context. For some
classes, e.g. chair, shown in Fig. 15d, this effect is quite
dramaticfor this class the learnt classifiers are very poor
at recognising images where chairs are the dominant part of
the scene. These results are in agreement with the ranked
images shown in Fig. 13b, suggesting a reliance on scene
context.

6.1.6 Classification Results on the VOC2006 Test Set

In order to gauge progress since the 2006 VOC challenge,


participants in the 2007 challenge were asked to addition- Fig. 16 Comparison of classification results on VOC2006 vs.
ally submit results on the VOC2006 dataset. Since the train- VOC2007 test sets. All methods were trained on the VOC2007 train-
ing/validation set. The median AP over the 10 classes common to the
ing phase for many methods is computationally expensive, VOC2006/VOC2007 challenges is plotted for each method and test
for example requiring multiple cross-validation runs, partic- set. The line marked VOC2006 indicates the best result obtained in the
ipants were asked to submit results trained using VOC2007 VOC2006 challenge, trained using the VOC2006 training/validation set
data i.e. not requiring re-training. This allows us to answer
two questions: (i) do methods which perform well on the (ii) Do newer methods trained on VOC2007 data outperform
VOC2007 test set also perform well on VOC2006 test set? older methods trained on VOC2006 data?
Int J Comput Vis (2010) 88: 303338 325

Participants submitted classification results on both test submitted methods and classes. For each class the method
sets for 9 methods. Figure 16 summarises the results. The obtaining the greatest AP is identified in bold, and the meth-
x-axis show the median AP over all classes on the VOC2007 ods with 2nd and 3rd greatest AP in italics. Precision/recall
test set. The y-axis shows the median AP over all classes on curves for a representative sample of classes are shown in
the VOC2006 test set. The line labelled VOC2006 indi- Fig. 17.
cates the best result reported in the 2006 challenge. Note
that since the median is taken over the 10 classes common 6.2.1 Detection Results by Method
to the 2006 and 2007 challenges (see Fig. 2), the ranking of
methods does not match that shown in Fig. 7, for example It is difficult to judge an overall winner in the detec-
the XRCE method outperforms the INRIA methods on the tion task because different participants tackled different sub-
VOC2007 data for this subset of classes. sets of classes (this is allowed under the rules of the chal-
As the figure shows, there is very high correlation be- lenge). Oxford won on all 6 (vehicle) classes that they en-
tween the results on the VOC2006 and VOC2007 data. This tered, UoCTTI won on 6 classes, and MPI_ESSOL on 5.
suggests that methods performing well on VOC2007 are The Oxford method achieved the greatest AP for all of the
implicitly more successful, rather than obtaining good re- six classes entered, with the AP substantially exceeding the
sults due to excessive fitting of the statistics of the VOC2007 second place result, by a margin of 4.011.4%. The UoCTTI
data. There are small differences in the ranking of methods, method entered all 20 classes, and came first or second in 14.
for example the XRCE method is 1st on VOC2007 (over the MPI_ESSOL also entered all 20 classes, but it is noticeable
subset of 10 classes common to both challenges) but 3rd on that on some classes the AP score for this method is poor
VOC2006. The INRIA_Genetic method gives results mar- relative to other entries.
ginally lower than XRCE on the VOC2007 data, but con- These methods differ quite significantly in approach: Ox-
vincingly better results on the VOC2006 data. ford used interest point detection to select candidate detec-
However, for all methods the performance on the VOC- tion regions, and applied an SVM classifier using a spatial
2006 data is less than on VOC2007, by 5.0% (INRIA_FLAT) pyramid (Lazebnik et al. 2006) representation to the can-
to 12.7% (Tsinghua). This implies that methods have failed didate region; the UoCTTI method used a sliding window
to generalise to the VOC2006 data to some extent. There approach, but with a star model of parts; and MPI_ESSOL
are two possible reasons: (i) there is fundamentally insuf- used an SVM classifier based on a BOW or spatial pyramid
ficient variability in the VOC2007 data to generalise well representation of the candidate window.
to the VOC2006 data; (ii) the classifiers have over-fit some It would have been particularly interesting to see results
peculiarities of the VOC2007 data which do not apply to of the Oxford method on all classes, since it might be ex-
the VOC2006 data. Factors might include the difference in pected that the use of interest points and a fixed grid repre-
the time of year of data collection (see Sect. 3.1). A pos- sentation might limit the applicability to classes with limited
sible concern might be that the better results obtained on distinguishing features or significant variability in shape,
VOC2006 are due to the presence of near-duplicate im- e.g. animals.
ages spanning the training/test sets. This possibility was Promising results were obtained by all methods, but with
minimised by removing such images when the dataset was results for each method varying greatly among the object
collected (see Sect. 3.1). classes. It seems clear that current methods are more or less
Tested on the VOC2006 data, the maximum median-AP suited to particular classes. An example is the failure of the
achieved by a method submitted in 2007 was 62.1% com- UoCTTI method on the dog class (AP = 2.3) compared
pared to 74.5% in 2006. This again suggests either that to the MPI_ESSOL method (AP = 16.2). While the former
the 2007 methods over-fit some properties of the VOC2007 emphasises shape, the latter uses a BOW/spatial pyramid
data, or that there were peculiarities of the VOC2006 data representation which might better capture texture, but cap-
which methods trained on that data in 2006 were able to ex- tures shape more coarsely. Conversely, the UoCTTI method
ploit. One such possibility is the inclusion of the MSR Cam- obtains good results on bottle (AP = 21.4), where it is
bridge images in the VOC2006 data (see Sect. 3.1) which expected that shape is a very important feature, and the
may have provided a boost to 2006 methods learning their MPI_ESSOL method fails (AP = 0.1). The trained detector
specific viewpoints and simple scene composition. used by the UoCTTI method (Felzenszwalb et al. 2008) has
been made publicly available, and would form a reasonable
6.2 Detection state-of-the-art baseline for future challenges.

This section reports and discusses the results of the detec- 6.2.2 Detection Results by Class
tion task. A total of 9 methods were evaluated. Six partici-
pants tackled all of the 20 classes, with the others submit- Figure 18 summarises the results obtained by object class,
ting results on a subset of classes. Table 6 lists the AP for all plotting for each class the maximum and median AP taken
326

Table 6 Detection results. For each object class and submission, the AP measure (%) is shown. Bold entries in each column denote the maximum AP for the corresponding class. Italic entries
denote the results ranked second or third. Note that some participants submitted results for only a subset of the 20 classes

aeroplane bicycle bird boat bottle bus car cat chair cow diningtable dog horse motorbike person pottedplant sheep sofa train tvmonitor

Darmstadt 30.1
INRIA_Normal 9.2 24.6 1.2 0.2 6.8 19.7 26.5 1.8 9.7 3.9 1.7 1.6 22.5 15.3 12.1 9.3 0.2 10.2 15.7 24.2
INRIA_PlusClass 13.6 28.7 4.1 2.5 7.7 27.9 29.4 13.2 10.6 12.7 6.7 7.1 33.5 24.9 9.2 7.2 1.1 9.2 24.2 27.5
IRISA 28.1 31.8 2.6 9.7 11.9 28.9 22.7 22.1 17.5 25.3
MPI_Center 6.0 11.0 2.8 3.1 0.0 16.4 17.2 20.8 0.2 4.4 4.9 14.1 19.8 17.0 9.1 0.4 9.1 3.4 23.7 5.1
MPI_ESSOL 15.2 15.7 9.8 1.6 0.1 18.6 12.0 24.0 0.7 6.1 9.8 16.2 3.4 20.8 11.7 0.2 4.6 14.7 11.0 5.4
Oxford 26.2 40.9 39.3 43.2 37.5 33.4
TKK 18.6 7.8 4.3 7.2 0.2 11.6 18.4 5.0 2.8 10.0 8.6 12.6 18.6 13.5 6.1 1.9 3.6 5.8 6.7 9.0
UoCTTI 20.6 36.9 9.3 9.4 21.4 23.2 34.6 9.8 12.8 14.0 0.2 2.3 18.2 27.6 21.3 12.0 14.3 12.7 13.4 28.9
Int J Comput Vis (2010) 88: 303338
Int J Comput Vis (2010) 88: 303338 327

Fig. 17 Detection results. Precision/recall curves are shown for a representative sample of classes. The legend indicates the AP (%) obtained by
the corresponding method

stantial variation in the maximum AP as a function of ob-


ject class, from 9.4% (boat) to 43.2% (car). The median AP
varies from 2.8% (boat) to 29.4% (car). The median results
can be seen to approximately follow the ranking of results
by maximum AP.
Results on some classes e.g. car/bicycle (Fig. 17a, b) are
quite promising, with the best performing methods obtain-
ing precision close to 100% for recall up to 1520%. How-
ever, precision drops rapidly above this level of recall. This
suggests that methods are retrieving only a subset of exam-
ples with any accuracy, perhaps the canonical views (e.g.
car side, car rear). A challenge for future methods is to in-
crease the recall. In the related domain of face detection
the move from frontal-only detection to arbitrary pose has
proved challenging.
It can be seen from Fig. 18 that the best results were ob-
tained for classes which have traditionally been investigated
Fig. 18 Summary of detection results by class. For each class two in object detection, e.g. car, bicycle and motorbike. Such
values are shown: the maximum AP obtained by any method (max) classes have quite predictable visual properties, with distinc-
and the median AP over all methods (median) tive parts e.g. wheels, and relatively fixed spatial arrange-
ment of parts. For classes with significant variation in shape
over all methods. Results are shown ordered by decreasing or appearance e.g. people (Fig. 17c) and household objects
maximum AP. It should be noted that, because some par- (which are often significantly occluded), results are sub-
ticipants only submitted results for some classes, the num- stantially worse. Results on the important person class
ber of results available varies for each class. There is sub- were, however quite promising overall. The best results in
328 Int J Comput Vis (2010) 88: 303338

terms of AP on this class were obtained by the IRISA and


UoCTTI methods. As noted in Sect. 5.2 the IRISA method
trained multiple person detectors, for example person on
horse/person on bicycle. The UoCTTI method is also po-
tentially better able to deal with varying articulation, by its
approach of simultaneous pose inference and detection.
For several classes the results are somewhat counter-
intuitive, for example good results are obtained on the
train class (max AP = 33.4%) which might be expected
to be challenging due to the great variation in appearance
and aspect ratio with pose. The results for this class may
be explained by the inclusion of several methods which
exploited whole image classificationMPI_Center which Fig. 19 Effect of the bounding box overlap threshold on the AP mea-
predicts a single detection per image of fixed size, and IN- sure. For the class car, on which the submitted methods gave the
best results, AP is plotted as a function of the overlap threshold. The
RIA_PlusClass which combines sliding window detection threshold adopted for the challenge is 50%
with a global classifier. Because trains tend to appear large
in the image, these global methods prove successful on this
data, however it is notable that the Oxford method also did satisfactorily, capturing the proportion of objects detected
well for this class. For the horse class, the good results without overly penalising imprecise bounding box predic-
may be attributable to unwanted regularities in the dataset, tions.
which includes many images of horses taken by a single
photographer at a single gymkhana event. Such regulari- 6.2.4 What Are the Detection Methods Learning?
ties will be reduced in the VOC2008 dataset by distribut-
ing searches over time, as noted in Sect. 3.1. Results for the As in the classification task it is interesting to examine the
classes with low AP, for example boat (Fig. 17d) leave much success and failure modes of the methods to derive some
scope for improvement, with precision dropping to zero by insight into what current methods are learning, and what
20% recall. The VOC2007 dataset remains extremely chal- limitations might be addressed in the development of future
lenging for current detection methods. methods.
Each detection method provides a list of bounding boxes
6.2.3 Evaluation of the Overlap Threshold ranked by the corresponding confidence output. We present
some of the highest ranked true positive (object) and false
As noted in Sect. 4.2, detections are considered true posi- positive (non-object) detections here. Since the detection
tives if the predicted and ground truth bounding boxes over- methods varied greatly in approach and success, as mea-
lap by 50% according to the measure defined in (3), with sured by AP, we present individual results for selected
the threshold of 50% set low to account for uncertainty in classes and methods. For a given class, we present results of
the bounding box annotation. We evaluated the effect the the three methods giving greatest AP. The classes selected
overlap threshold has on the measured AP by varying the were those where results are particularly promising, or in-
threshold. Figure 19 shows AP as a function of the overlap teresting observations can be made e.g. confusion between
threshold for the class car, on which the best results (in motorbike and bicycle.
terms of AP with overlap threshold of 50%) were obtained Figure 20 shows the five highest ranked true positive de-
by the submitted methods. One caveat applies: participants tections for the car class. The methods shown, obtain-
were aware of the 50% threshold, and were therefore free to ing greatest AP, are Oxford, UoCTTI and IRISA. As can be
optimise their methods at this threshold, for example in their seen the Oxford method has a preference for large objects
schemes for elimination of multiple detections. (Fig. 20a) which is less apparent for the other two meth-
As Fig. 19 shows, the measured AP drops steeply for ods (Fig. 20b, c). We analyse this bias toward large objects
thresholds above 50%, indicating that none of the methods further in the next subsection. For this class, there is no ap-
give highly accurate bounding box predictions. Reducing parent preference for a particular viewpoint or aspect
the threshold to 10% results in an increase in measured AP the methods all return cars with varying pose at high confi-
of around 7.5%. Note that for all but one pair of methods dence. However, for the bicycle class shown in Fig. 21,
(Darmstadt and INRIA_PlusClass) the ordering of methods the most successful methods (Oxford, UoCTTI and IN-
by AP does not change for any threshold in the range 050% RIA_PlusClass) all return side views of bicycles with high-
(the AP of these two methods at a threshold of 50% differs est confidence. For the UoCTTI method in particular there
by only 0.7%). We conclude that the measure is performing seems no preference for left or right facing bicycles, though
Int J Comput Vis (2010) 88: 303338 329

Fig. 20 Highest ranked true positive detections for the car detection task. The five highest ranked true positives are shown for each of the three
methods with greatest AP. The number in each image indicates the rank of the detection

Fig. 21 Highest ranked true


positive detections for the
bicycle detection task. The
five highest ranked true positives
are shown for each of the three
methods with greatest AP. The
number in each image indicates
the rank of the detection

the bias toward right facing bicycles for the Oxford method on motorbikes, though in many such cases the predicted
may be a statistical bias in the dataset. bounding box does not correspond to the full extent of
We turn now to the false positive detectionsbounding the object. It is interesting to observe that several of the
boxes which do not correspond to an object of interest. For false positives for the Oxford method are the same images
each class and method we show high-ranked false positive of tree branches which confused the classification methods
detections. To increase the diversity of results presented we (Fig. 12b). This may be explained by the pyramid represen-
have filtered the images shown: (i) images with any (un- tation of spatial information in this method or by the method
detected) object of interest have been removed, though see learning the strong gradients of the frames (if you squint you
discussion of the person class below; (ii) only the most can see a frame in the second and fourth images of Fig. 22a).
confident false positive detection per image is shown. For For the UoCTTI method, the false positives were observed to
example in Fig. 22b, multiple high confidence false positives often resemble the side-view of a bicycle as two blobs hor-
were output for the 3rd image due to the repeated structure. izontally aligned (Fig. 22b, 3rd and 4th image). The most
Figure 22 shows false positives for the bicycle class confident false positive output by this method is actually a
output by the Oxford, UoCTTI and INRIA_PlusClass meth- drawing of a bicycle on a traffic sign, not labelled as bi-
ods. All methods generated some intuitive false positives cycle according to the annotation guidelines. For the IN-
330 Int J Comput Vis (2010) 88: 303338

Fig. 22 High ranked false positive detections for the bicycle detection task. The false positives shown are in images where no bicycles are
present. The number in each image indicates the rank of the detection. Results are shown for the three methods with greatest AP

Fig. 23 High ranked false positive detections for the motorbike detection task. The false positives shown are in images where no motorbikes
are present. The number in each image indicates the rank of the detection. Results are shown for the three methods with greatest AP

RIA_PlusClass method, 4 of the 5 highest confidence false the dominant shape of the motorbike. The Oxford method
positive images contain motorcycles, however the poor pre- outputs two bicycles in the first five false positives, and the
diction of bounding boxes in these cases suggest that the INRIA_PlusClass method outputs one. The remaining high
scene context introduced by the incorporation of a global ranked false positives for these two methods are difficult to
image classifier in this method may be a factor, rather than a explain, consisting mainly of highly cluttered scenes with
natural confusion between the classes. little discernable structure.
Figure 23 shows corresponding high ranked false posi- Figure 24 shows high ranked false positives for the per-
tives for the motorbike class, with the same methods as son class, with the three most successful methods shown:
for the bicycle class shown. For the UoCTTI method, 4 IRISA, UoCTTI and INRIA_Normal. This class is partic-
out of 5 false positives are bicycles, with the remaining false ularly challenging because of the high variability in hu-
positive shown covering a pair of car wheels. These results man pose exhibited in the VOC dataset. As can be seen,
suggest that this method is really capturing something about it is difficulty to see any consistent property of the false
Int J Comput Vis (2010) 88: 303338 331
Fig. 24 High ranked false
positive detections for the
person detection task. The
false positives shown are in
images where no people are
present. The number in each
image indicates the rank of the
detection. Results are shown for
the three methods with
greatest AP

Fig. 25 Near miss false


positives for the person
detection task. The images
shown contain people, but the
detections do not satisfy the
VOC overlap criterion. The
number in each image indicates
the rank of the detection.
Results are shown for the three
methods with greatest AP

positives. Some bias toward elongated vertical structures ods is generally quite good, e.g. the top of the bounding
can be observed e.g. trees (Fig. 24a) and dogs in a frontal box matches the top of the head, and the person is hor-
pose (Fig. 24b), and more of these were visible in lower izontally centred in the bounding box. The failure modes
ranked false positives not shown. However, many false pos- here are mainly inaccurate prediction of the vertical extent
itives seem to be merely cluttered windows with strong tex- of the person (Figs. 25a and 25b, 1st and 2nd image) due
ture. The fifth false positive output by the IRISA method e.g. to occlusion, and failure on non-frontal poses (Fig. 25b,
(Fig. 24a) is interesting (motorbike panniers occluded by 1st and 5th images). This is a limitation of current meth-
another motorbike) and is most likely an artefact of that ods using fixed aspect ratio bounding boxes, which is a poor
method learning separate person on X detectors. model of the possible imaged appearance of a person. The
Thus far the false positives shown exclude images where false positive in Fig. 25a, 1st image, is accounted for by
any object of interest is present. Figure 25 shows false pos- the person on X approach taken by the IRISA method.
itives for the person class, where people are present in The high ranked near misses output by the INRIA_Normal
the image. These represent near miss detections where method (Fig. 25c) mostly cover multiple people, and might
the predicted bounding box does not meet the VOC over- be accounted for by capturing person texture but not lay-
lap criterion of 50%. As noted, this class presents particular out.
challenges because of the high variability in pose and ar- As noted in Sect. 4.2.2, in the 2007 challenge we intro-
ticulation. As can be seen in Fig. 25a, b, the localisation duced a person layout taster to further evaluate the ability
of these false positives for the IRISA and UoCTTI meth- of person detectors to correctly parse images of people,
332 Int J Comput Vis (2010) 88: 303338

and motivate research into methods giving more detailed in- 6.2.6 Detection Results on the VOC2006 Test Set
terpretation of scenes containing people.
As in the classification task, participants in the detection task
6.2.5 Effect of Object Size on Detection Accuracy of the 2007 challenge were asked to additionally submit re-
sults on the VOC2006 dataset, trained using the VOC2007
As in the classification task, it is interesting to investigate data.
how object size affects the accuracy of the submitted de- Participants submitted classification results on both test
tection methods, particularly for those such as the Oxford sets for 3 methods. Table 7 summarises the results, showing
method which makes use of interest point detection, which the AP (%) obtained for each method, class and test set. The
may fail for small objects, and the INRIA_PlusClass method final row shows the maximum AP obtained by any method
which combines sliding window detection with a whole im- in the 2006 challenge, trained on the VOC2006 data. Since
age classifier. participants submitted results for different subsets of classes,
We followed a corresponding procedure to that for the the results have not been summarised e.g. by median AP as
classification task, creating a series of test sets in which
in the classification task.
all objects smaller than a threshold area were removed
For all but one class the ranking of methods by AP is
from consideration. For the detection task, this was done by
the same for the VOC2007 and VOC2006 datasets, suggest-
adding a difficult annotation to such objects, so that they
ing that methods performing well on VOC2007 are intrin-
count neither as false positives or negatives. Figure 26 shows
sically more successful, rather than obtaining good results
results of the experiments for a representative set of classes.
due to excessive fitting of the statistics of the VOC2007 data.
For each plot, the x-axis shows the threshold on object size,
For the cat class, the UoCTTI method comes first and the
as a proportion of image size, for an object to be included
Oxford method second, reversing the order on VOC2007,
in the evaluation. For each threshold, the AP was measured
but the difference in AP is small (53.5% vs. 55.5%).
and is shown on the y-axis.
Particularly encouraging is the observation that for 7 out
As Fig. 26 shows, all submitted methods have lower AP
of 10 classes a method submitted in 2007 achieves greater
for very small objects below around 2% image area. For the
AP than any of the 2006 methods. The UoCTTI method ex-
car class (Fig. 26a), the accuracy of most methods does
ceeds the 2006 results on 7 of the 10 classes entered, the
not increase if objects less than 5% area are removed from
Oxford method on all four classes entered, and the IRISA
the evaluation, indicating limited preference for large ob-
jects. For the INRIA_Normal and IRISA methods the AP method on 4 of the 8 classes entered. The improvement
can be seen to fall slightly with increasing object size, in- is substantial, e.g. 19.1% AP on the bus class (Oxford)
dicating that some highly ranked correct detections are for and 9.8% on the person class (UoCTTI). While it is pos-
small objects. For the motorbike class (Fig. 26b), the IN- sible that the improvement is due to the VOC2007 train-
RIA_PlusClass and MPI_ESSOL methods peak for objects ing/validation data being more useful, this effect was not
above around 17% image area, the IRISA and UoCTTI meth- observed for the classification task. It therefore seems likely
ods show no clear preference for large objects, and AP for that the results represent measurable progress in object de-
all other methods increases monotonically with object size. tection.
Three methods show substantial correlation between ob-
ject size and AP: MPI_Center, MPI_ESSOL and Oxford. 6.3 Segmentation
The MPI_Center method outputs a fixed bounding box with
area 51% of the image, and confidence determined by a All participants in the detection challenge were automati-
global image classifier. This clearly biases results to images cally entered into the segmentation challenge by deriving
where most of the image is covered by the object of inter- a segmentation from the inferred bounding boxes (over-
est, and while an interesting baseline (as intended), is not a laps were resolved heuristically). In addition, only one
successful strategy since many of the objects in the dataset segmentation-specific method was submitted, by Lubor
are small. The MPI_ESSOL method has two aspects which Ladicky, Pushmeet Kohli and Philip Torr of Oxford Brookes
may bias it toward larger objects: (i) it combines a whole University (Kohli and Ladicky 2008). Example segmenta-
image classifier with a sliding window detector to score tions from this team and from the TKK automatic entry are
detections; (ii) it incorporates a log Gaussian prior on ob- shown in Fig. 27. The best overall performance was given
ject size, fit by maximum likelihood to the training data, and by one of the automatic participants (segmentation de-
this prior may have biased results toward large objects. The rived algorithmically from detection), most likely due to
Oxford method relies on scale-invariant interest point opera- an unfinished segmentation-only entry. In future challenges,
tors to provide candidate detections, and the lack of interest it is anticipated that methods which are optimised for the
points on small objects may explain the correlation between segmentation problem will outperform automatic detection
its accuracy and object size. entries. In any case, providing a challenge which directly
Int J Comput Vis (2010) 88: 303338 333

Table 7 Comparison of detection results on VOC2006 vs. VOC2007 challenge, trained using the VOC2006 training/validation set. Bold en-
test sets. All methods were trained on the VOC2007 training/validation tries denote the maximum AP for each dataset and class. Bold entries
set. The AP measure (%) is shown for each method, class, and test set. in the final row indicate where results obtained in 2006 exceeded those
The final row (VOC2006) lists the best result obtained in the VOC2006 obtained in 2007

bicycle bus car cat cow dog horse motorbike person sheep

Test on 2007 IRISA 28.1 31.8 2.6 11.9 28.9 22.7 22.1 17.5
Oxford 40.9 39.3 43.2 37.5
UoCTTI 36.9 23.2 34.6 9.8 14.0 2.3 18.2 27.6 21.3 14.3

Test on 2006 IRISA 35.2 48.2 9.4 20.9 18.3 33.3 21.1 26.2
Oxford 56.8 36.0 53.5 53.9
UoCTTI 56.2 23.6 55.5 10.3 21.2 9.9 17.3 43.9 26.2 22.1

VOC2006 Best 44.0 16.9 44.4 16.0 25.2 11.8 14.0 39.0 16.4 25.1

Fig. 26 Detection results as a function of object size. Plots show re- that all objects smaller than 30% of the image area were ignored in the
sults for two representative classes. For each plot the x-axis shows the evaluation (contributing neither true nor false positives); a threshold of
lower threshold on object size for an object to be included in the test 0% means that no objects were ignored
set; the y-axis shows the corresponding AP. A threshold of 30% means

6.4 Person Layout

Only one result was submitted for the person layout taster,
by Martin Bergtholdt, Jrg Hendrik Kappes and Christoph
Schnrr of the University of Mannheim (Bergtholdt et
al. 2006). Figure 28 shows some example results for this
method. For some images, where the person is in a canoni-
cal frontal pose, the method successfully localises the parts
(Fig. 28a). For more varied poses, the method fails to predict
the correct locations, or confuses hands and feet (Fig. 28b).
Fig. 27 Example segmentation results. Columns show: test images,
ground truth annotations, segmentations from Kohli and Ladicky
Despite some correct results, the ranking of results by the
(2008) and segmentations derived from the bounding boxes of the TKK methods confidence output is poor, such that the measured
detection method AP is zero. This raised the question of whether the evalu-
ation criterion adopted for VOC2007 is sufficiently sensi-
compares detection and segmentation methods should help tive, and as described in Sect. 4.2.2, the requirements for
encourage innovation in how to combine these two types of the person layout taster have been relaxed for the VOC2008
methods to best effect. challenge.
334 Int J Comput Vis (2010) 88: 303338

the test (and training) set are different every time the chal-
lenge is run. The VOC challenge aims to meet these ap-
parently contradictory goals of innovation and consistency
by introducing separate taster challenges to promote re-
search in new directions (see the next section), while retain-
ing the existing classification and detection competitions so
that progress can be consistently tracked.
Fostering innovation is also a question of the attitude
of the community as a whole: it is important that we do
not discourage novel approaches to object recognition sim-
ply because they do not yet achieve the greatest success as
measured by our benchmarks. High methodological novelty
must not be sacrificed on the altar of benchmark ranking,
and this is the last thing the VOC challenge is intended to
Fig. 28 (Color online) Example person layout results for the achieve. An important part of encouraging novel methods is
Mannheim method. For each image the ground truth bounding boxes our selection of speakers for the annual challenge workshop,
are shown in white, and the predicted bounding boxes colour-coded: where we have given time to both particularly successful,
yellow for head and cyan for hand
and particularly interesting methods.
A further criticism raised against the VOC series of
7 Discussion and the Future challenges in particular is that the level of difficulty is
too high, thereby obscuring the way forward. However,
methods submitted in 2007 for the detection task demon-
The VOC challenge has already had a significant, and we
strated substantial improvements over those submitted in
believe positive, impact in terms of providing a rich, stan-
2006 (see Sect. 6.2.6). We are of the opinion that providing
dardised dataset for the community and an evaluation frame-
researchers with such challenging, yet natural, data is only
work for comparing different methods. Participation in the
stimulating progress. It is the very fact of being well ahead
challenges has increased steadily since their first introduc-
of current capabilities which makes the dataset so useful.
tion, as has the use of the VOC dataset beyond the challenge
In contrast, datasets for which performance is saturated
itself. For example, in CVPR07 the VOC dataset was ref-
are likely to encourage fine tuning of implementation details
erenced in 15 papers; in CVPR08 this number increased to
rather than fundamental progress, and such progress may be
27, i.e. almost a year-to-year doubling of the datasets pop-
unmeasurable, being lost in the noise.
ularity. To retain this popularity, the challenge must evolve
A fundamental question is whether the VOC challenges
to meet the requirements and address the criticisms of the
are probing for the right kind of tasks. In a recent paper,
research community. In the following sections we discuss
Pinto et al. (2008) criticised the use of natural images al-
some criticisms of the challenge, look at how the challenge
together, arguing for the use of synthetic data (e.g. rendered
is evolving through the taster competitions, and suggest di-
3D models) for which one has better control over the vari-
rections in which the dataset and challenge can be improved
ability in the dataparameter settings can be sampled at
and extended in the future.
will and annotation is not needed, as perfect ground truth
is available by design. In their view, this is a much bet-
7.1 Criticisms ter way to generate the variability that is needed to crit-
ically test recognition performance. This issue of whether
No benchmark remains without criticism for long, and the to use natural images or completely control imaging condi-
VOC challenge has not been an exception. A common ob- tions is an ongoing debate in the psychophysics community.
jection raised about any competition of this sort is that: In our case, the VOC datasets have been designed to con-
Datasets stifle innovation, because the community concen- tain large variability in pose, illumination, occlusion, etc.
trates effort on this data to the exclusion of others. While Moreover, correlations that occur in the real world are cap-
it is difficult to avoid this effect completely, if the challenge tured, whereas synthetic datasets cannot be expected to re-
is well ahead of capabilities then it will not necessarily stifle flect those faithfully.
the types of methods used. Datasets have a shelf life, and as
performance starts to saturate a new one is needed to drive 7.2 Taster Competitions
research. Conversely, it is also necessary for datasets to re-
main consistent, so that they can be used to gauge progress The taster competitions, which make demands of methods
made by the community. Assessing progress is difficult if quite far ahead of the state-of-the-art, aim to play a key part
Int J Comput Vis (2010) 88: 303338 335

in encouraging fundamental research progress. These were for example labelling an object as van vs. truck is of-
introduced in 2007 to encourage both diversity of approach ten subjective; (iii) evaluation of recognition must be more
and the development of more powerful methods to address flexible, for example a method might assign a class from
these more demanding tasks. For example, the segmentation {hatchback, car, vehicle} and be assigned varying scores
competition not only requires much more precise localisa- dependent on accuracy or level of detail.
tion of objects than the detection task, but it has also been
set up to allow either detection-based or segmentation-based Object Parts VOC2007 introduced annotation of body
approaches to be used. The hope is that the two approaches parts in order to evaluate and encourage development of
are complementary, so that detection methods can be used methods capable of more detailed image annotation than
to improve segmentation performance and vice-versa. This object location alone. Such more detailed indication of the
belief is justified by the similar situation which has already parts of objects is an important direction to pursue. Although
arisen between the classification and detection tasks, where many techniques today start from local features, these fea-
global image classification has aided detection performance tures typically have very little to do with the semantic parts
(see Sect. 5.2). By encouraging participants to blend the best of the objects. However, often the purpose of object detec-
aspects of different methodologies, a greater diversity of ap- tion and recognition is to support interaction with objects
proaches will be encouraged. (e.g. in robotics). A good understanding of where parts are
It is inevitable that any challenge is very much of its (arms, wheels, keyboards, etc.) is often essential to make
time, only testing what can be thought of by current prac- such practical use of object recognition, and should be incor-
titioners, governed by current methods and hardware, and to porated into at least a component of the evaluation scheme.
some extent unaware of these limitations. Through the use Thus far, VOC has confined itself to object classes and
of the taster competitions, the VOC challenge is being up- annotation where discrete objects can be identified. With
dated to allow a broader range of approaches and to address the introduction of the segmentation taster, it is natural to
more current research issues. However, it is recognised that also include stuff classes (grass, sky, etc.) and additionally
the challenge must continue to adapt and remain agile in consider annotation of classes which can appear as stuff
responding to the needs and concerns of the growing com- in the distance e.g. person vs. crowdimages contain-
munity of researchers who use the datasets and participate ing such ambiguities are currently omitted from the VOC
in the competitions. dataset.

7.3 The Future Beyond Nouns Increasingly, vision researchers are forg-
ing strong links with text analysis, and are exploiting tools
In the area of object class recognition, a lot of progress is coming from that area such as WordNet (Fellbaum 1998).
being made and the requirements for a benchmark evolve Part of this endeavour is to build vision systems that can
quickly with this evolution. Here we give a non-exhaustive exploit and/or generate textual descriptions of scenes. This
list of aspects which could be improved or added in future entails bringing objects (nouns) in connection with actions
VOC challenges. (verbs) and attributes (adjectives and adverbs). As progress
in this direction continues, it will be appropriate to introduce
More Object Classes A first and rather obvious extension benchmarks for methods producing richer textual descrip-
is to increase the number of annotated object classes. A pri- tions of a scene than the noun + position outputs which
mary goal here is to put more emphasis on the issue of are currently typical. The interest in methods for exploiting
scalabilityrunning as many detectors as there are object textual description at training time also suggests alternative
classes may not remain a viable strategy, although this is by weaker forms of annotation for the dataset than bounding
far the dominant approach today. Different aspects of detec- boxes; we discuss this further below.
tion schemes may become important, for example the abil-
ity to share features between classes (Torralba et al. 2007), Scene Dynamics Thus far, the VOC challenge has focused
or exploit properties of multiple parent classes (Zehnder entirely on classifying and detecting objects in still images
et al. 2008). Introducing more classes would also stimu- (also the case for VOC2008). Including video clips would
late research in discrimination between more visually sim- expand the challenge in several ways: (i) as training data
ilar classes, and in exploiting semantic relations between it would support learning richer object models, for example
classes, for example in the form of a class hierarchy. How- 3D or multi-aspect. Video of objects with varying viewing
ever, increasing the number of classes will also pose addi- direction would provide relations between parts implicitly
tional difficulties to the running of the VOC challenge: (i) it available through tracking; (ii) as test data it would enable
will prove more difficult to collect sufficient data per class; evaluation of new tasks: object recognition from video (e.g.
(ii) it raises questions of how to annotate objects accurately, people), and recognition of actions. This would also bring
336 Int J Comput Vis (2010) 88: 303338

the VOC challenge into the domain of other benchmarks, complication here is that such data is typically noisy, in
e.g. TRECVID which includes an interactive search task that only a subset of the images are relevant to the supplied
with increasing emphasis on events/actions such as a door search terms.
being opened or a train in motion.
The Future is Bright There has been tremendous progress
Alternative Annotation Methods Manual annotation is in object class recognition this decade. At the turn of the mil-
time-consuming and therefore expensive. For example, an- lennium, few would have dreamt that by now the community
notation of the VOC2008 dataset required around 700 per- would have such impressive performance on both classifica-
son hours. Moreover, since the VOC challenge runs annu- tion and detection for such varied object classes as bicycles,
ally, new test data is required each year in order to avoid cars, cows, sheep, and even for the archetypal functional
participants having access to the ground truth annotations class of chairs. This progress has gone hand in hand with
and over-fitting on the test set. Increasing the level of an- the development of image databaseswhich have provided
notation, for example by increasing the number of classes, both the training data necessary for researchers to work in
only makes annotation more time-consuming. this area; and the testing data necessary to track the improve-
We also found that when increasing the number of ments in performance. The VOC challenge has played a vital
classes, from 10 in 2006 to 20 in 2007, annotators made part in this endeavour, and we hope that it will continue to
many more mistakes as they failed to hold in memory the do so.
complete set of classes to be annotated. This in turn required
more time to be allocated to checking and correction to en- Acknowledgements The preparation and running of the VOC chal-
sure high quality annotation. This raises several questions lenge is supported by the EU-funded PASCAL Network of Excellence
on Pattern Analysis, Statistical Modelling and Computational Learn-
concerning: how the annotation format relates to ease-of-
ing.
annotation, how much agreement there is between different We gratefully acknowledge the VOC2007 annotators: Moray Allan,
human annotators e.g. on bounding box position, and how Patrick Buehler, Terry Herbert, Anitha Kannan, Julia Lasserre, Alain
the annotation tools affect annotation quality. To date, we Lehmann, Mukta Prasad, Till Quack, John Quinn, Florian Schroff. We
are also grateful to James Philbin, Ondra Chum, and Felix Agakov for
have not yet gathered data during the checking process that
additional assistance.
could help answer these questions and this is something we Finally we would like to thank the anonymous reviewers for their
aim to rectify in future years. Annotating pixel-wise seg- detailed and insightful comments.
mentations instead of bounding boxes puts even higher pres-
sure on the sustainability of manual annotation. If object
parts, attributes and especially video are to be added in fu- References
ture, then the method of annotation will certainly need to
evolve in concert with the annotation itself. Possibilities in- Bergtholdt, M., Kappes, J., & Schnrr, C. (2006). Learning of graphi-
clude recruiting help from a much larger pool of volunteers cal models and efficient inference for object class recognition. In
(in the footsteps of LabelMe), combined with a centralised Proceedings of the annual symposium of the German association
for pattern recognition (DAGM06) (pp. 273283)
effort to check quality and make corrections. We are also Chum, O., & Zisserman, A. (2007). An exemplar model for learning
investigating the use of systems like Mechanical Turk to object classes. In Proceedings of the IEEE conference on com-
recruit and pay for annotation (Sorokin and Forsyth 2008; puter vision and pattern recognition.
Spain and Perona 2008). Alternatively, commercial annota- Chum, O., Philbin, J., Isard, M., & Zisserman, A. (2007). Scalable near
identical image and shot detection. In Proceedings of the interna-
tion initiatives could be considered, like the aforementioned tional conference on image and video retrieval (pp. 549556).
Lotus Hill dataset (Yao et al. 2007), in combination with Csurka, G., Bray, C., Dance, C., & Fan, L. (2004). Visual categoriza-
sampled quality inspection. tion with bags of keypoints. In Workshop on statistical learning
As noted above, there has recently been considerable in computer vision, ECCV (pp. 122).
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for
interest in learning recognition from weak supervision human detection. In Proceedings of the IEEE conference on com-
(Duygulu et al. 2002; Fergus et al. 2007). This suggests puter vision and pattern recognition (pp. 886893).
alternative forms of annotation which could be introduced, Demsar, J. (2006). Statistical comparisons of classifiers over multiple
for example per-image annotation with keywords or phrases data sets. Journal of Machine Learning Research, 7, 130.
Duygulu, P., Barnard, K., de Freitas, N., & Forsyth, D. A. (2002). Ob-
(e.g. red car in a street scene). Such annotation could be
ject recognition as machine translation: Learning a lexicon for a
provided alone for some images, in addition to a set of im- fixed image vocabulary. In Proceedings of the European confer-
ages with more precise annotation, providing complemen- ence on computer vision (pp. 97112).
tary supervision for training at low cost. Another possibil- Everingham, M., Zisserman, A., Williams, C. K. I., & Van Gool, L.
(2006a). The 2005 PASCAL visual object classes challenge. In
ity for lighter annotation is to collect (possibly additional)
LNAI: Vol. 3944. Machine learning challengesevaluating pre-
training images directly from a web search engine (such as dictive uncertainty, visual object classification, and recognising
Google image search) (Fergus et al. 2005). The additional textual entailment (pp. 117176). Berlin: Springer.
Int J Comput Vis (2010) 88: 303338 337

Everingham, M., Zisserman, A., Williams, C. K. I., & Van Lowe, D. (2004). Distinctive image features from scale-invariant key-
Gool, L. (2006b). The PASCAL visual object classes challenge points. International Journal of Computer Vision, 60(2), 91110.
2006 (VOC2006) results. http://pascal-network.org/challenges/ Marszalek, M., & Schmid, C. (2007). Semantic hierarchies for vi-
VOC/voc2006/results.pdf. sual object recognition. In Proceedings of the IEEE conference
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zis- on computer vision and pattern recognition.
serman, A. (2007). The PASCAL visual object classes chal- Perronnin, F., & Dance, C. (2007). Fisher kernels on visual vocabular-
lenge 2007 (VOC2007) Results. http://www.pascal-network.org/ ies for image categorization. In Proceedings of the IEEE confer-
challenges/VOC/voc2007/index.html. ence on computer vision and pattern recognition.
Fei-Fei, L., Fergus, R., & Perona, P. (2006). One-shot learning Pinto, N., Cox, D., & DiCarlo, J. (2008). Why is real-world visual ob-
of object categories. IEEE Transactions on Pattern Analysis ject recognition hard? PLoS Computational Biology, 4(1), 151
and Machine Intelligence, 28(4), 594611. http://www.vision. 156.
caltech.edu/Image_Datasets/Caltech101/Caltech101.html. Russell, B., Torralba, A., Murphy, K., & Freeman, W. T. (2008).
Fellbaum, C. (Ed.) (1998). WordNet: an electronic lexical database. LabelMe: a database and web-based tool for image annotation.
Cambridge: MIT Press. International Journal of Computer Vision, 77(13), 157173.
Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discrim- http://labelme.csail.mit.edu/.
inatively trained, multiscale, deformable part model. In Proceed- Salton, G., & McGill, M. J. (1986). Introduction to modern information
ings of the IEEE conference on computer vision and pattern retrieval. New York: McGraw-Hill.
recognition. Scharstein, D., & Szeliski, R. (2002). A taxonomy and evalu-
Fergus, R., Fei-Fei, L., Perona, P., & Zisserman, A. (2005). Learning ation of dense two-frame stereo correspondence algorithms.
object categories from Googles image search. In Proceedings of International Journal of Computer Vision, 47(13), 742.
the international conference on computer vision. http://vision.middlebury.edu/stereo/.
Fergus, R., Perona, P., & Zisserman, A. (2007). Weakly supervised Shotton, J., Winn, J. M., Rother, C., & Criminisi, A. (2006). Texton-
scale-invariant learning of models for visual recognition. Inter- Boost: Joint appearance, shape and context modeling for multi-
national Journal of Computer Vision, 71(3), 273303. class object recognition and segmentation. In Proceedings of the
Ferrari, V., Fevrier, L., Jurie, F., & Schmid, C. (2008). Groups of adja- European conference on computer vision (pp. 115).
cent contour segments for object detection. IEEE Transactions on Sivic, J., & Zisserman, A. (2003). Video Google: A text retrieval ap-
Pattern Analysis and Machine Intelligence, 30(1), 3651. proach to object matching in videos. In Proceedings of the inter-
Fritz, M., & Schiele, B. (2008). Decomposition, discovery and detec- national conference on computer vision (Vol. 2, pp. 14701477).
http://www.robots.ox.ac.uk/~vgg.
tion of visual categories using topic models. In Proceedings of the
Smeaton, A. F., Over, P., & Kraaij, W. (2006). Evaluation campaigns
IEEE conference on computer vision and pattern recognition.
and TRECVID. In MIR 06: Proceedings of the 8th ACM inter-
Geusebroek, J. (2006). Compact object descriptors from local colour
national workshop on multimedia information retrieval (pp. 321
invariant histograms. In Proceedings of the British machine vision
330).
conference (pp. 10291038).
Snoek, C., Worring, M., & Smeulders, A. (2005). Early versus late
Grauman, K., & Darrell, T. (2005). The pyramid match kernel: Dis-
fusion in semantic video analysis. In Proceedings of the ACM in-
criminative classification with sets of image features. In Pro-
ternational conference on multimedia (pp. 399402).
ceedings of the international conference on computer vision
Snoek, C., Worring, M., van Gemert, J., Geusebroek, J., & Smeuld-
(pp. 14581465).
ers, A. (2006). The challenge problem for automated detection
Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 ob- of 101 semantic concepts in multimedia. In Proceedings of ACM
ject category dataset (Technical Report 7694). California multimedia.
Institute of Technology. http://www.vision.caltech.edu/Image_ Sorokin, A., & Forsyth, D. (2008). Utility data annotation with Ama-
Datasets/Caltech256/. zon mechanical turk. In Proceedings of the first IEEE workshop
Hoiem, D., Efros, A. A., & Hebert, M. (2006). Putting objects in per- on Internet vision (at CVPR 2008).
spective. In Proceedings of the IEEE conference on computer vi- Spain, M., & Perona, P. (2008). Some objects are more equal than oth-
sion and pattern recognition (pp. 21372144). ers: Measuring and predicting importance. In Proceedings of the
Kohli, P., Ladicky, L., & Torr, P. (2008). Robust higher order poten- European conference on computer vision (pp. 523536).
tials for enforcing label consistency. In Proceedings of the IEEE Stoettinger, J., Hanbury, A., Sebe, N., & Gevers, T. (2007). Do colour
conference on computer vision and pattern recognition. interest points improve image retrieval? In Proceedings of the
Lampert, C. H., Blaschko, M. B., & Hofmann, T. (2008). Beyond slid- IEEE international conference on image processing (pp. 169
ing windows: Object localization by efficient subwindow search. 172).
In Proceedings of the IEEE conference on computer vision and Sudderth, E. B., Torralba, A. B., Freeman, W. T., & Willsky, A. S.
pattern recognition. (2008). Describing visual scenes using transformed objects and
Laptev, I. (2006). Improvements of object detection using boosted his- parts. International Journal of Computer Vision, 77(13), 291
tograms. In Proceedings of the British machine vision conference 330.
(pp. 949958). Torralba, A. B. (2003). Contextual priming for object detection. Inter-
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of fea- national Journal of Computer Vision, 53(2), 169191.
tures: Spatial pyramid matching for recognizing natural scene cat- Torralba, A. B., Murphy, K. P., & Freeman, W. T. (2007). Shar-
egories. In Proceedings of the IEEE conference on computer vi- ing visual features for multiclass and multiview object detection.
sion and pattern recognition (pp. 21692178). IEEE Transactions on Pattern Analysis and Machine Intelligence,
Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object cat- 29(5), 854869.
egorization and segmentation with an implicit shape model. In van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2008). Evalua-
ECCV2004 workshop on statistical learning in computer vision, tion of color descriptors for object and scene recognition. In Pro-
Prague, Czech Republic (pp. 1732). ceedings of the IEEE conference on computer vision and pattern
Liu, X., Wang, D., Li, J., & Zhang, B. (2007). The feature and spa- recognition.
tial covariant kernel: Adding implicit spatial constraints to his- van de Weijer, J., & Schmid, C. (2006). Coloring local feature extrac-
togram. In Proceedings of the international conference on image tion. In Proceedings of the European conference on computer vi-
and video retrieval. sion.
338 Int J Comput Vis (2010) 88: 303338

van Gemert, J., Geusebroek, J., Veenman, C., Snoek, C., & Smeuld- pascallin.ecs.soton.ac.uk/challenges/VOC/voc2007/guidelines.
ers, A. (2006). Robust scene categorization by learning image sta- html.
tistics in context. In CVPR workshop on semantic learning appli- Yao, B., Yang, X., & Zhu, S. C. (2007). Introduction to a large scale
cations in multimedia. general purpose ground truth dataset: methodology, annotation
Viitaniemi, V., & Laaksonen, J. (2008). Evaluation of techniques tool, and benchmarks. In Proceedings of the 6th international con-
for image classification, object detection and object segmenta- ference on energy minimization methods in computer vision and
tion (Technical Report TKK-ICS-R2). Department of Informa- pattern recognition. http://www.imageparsing.com/.
tion and Computer Science, Helsinki University of Technology. Yilmaz, E., & Aslam, J. (2006). Estimating average precision with
http://www.cis.hut.fi/projects/cbir/. incomplete and imperfect judgments. In Fifteenth ACM inter-
national conference on information and knowledge management
Viola, P. A., & Jones, M. J. (2004). Robust Real-time Face Detection.
(CIKM).
International Journal of Computer Vision, 57(2), 137154.
Zehnder, P., Koller-Meier, E., & Van Gool, L. (2008). An efficient
von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer multi-class detection cascade. In Proceedings of the British ma-
game. In Proceedings of the ACM CHI (pp. 319326). chine vision conference.
Wang, D., Li, J., & Zhang, B. (2006). Relay boost fusion for learning Zhang, J., Marszalek, M., Lazebnik, S., & Schmid, C. (2007). Local
rare concepts in multimedia. In Proceedings of the international features and kernels for classification of texture and object cat-
conference on image and video retrieval. egories: A comprehensive study. International Journal of Com-
Winn, J., & Everingham, M. (2007). The PASCAL visual object puter Vision, 73(2), 213238.
classes challenge 2007 (VOC2007) annotation guidelines. http://

You might also like