Professional Documents
Culture Documents
Histograms of Oriented Gradients For Human Detection
Histograms of Oriented Gradients For Human Detection
Histograms of Oriented Gradients For Human Detection
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids
of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation
on performance, concluding that fine-scale gradients, fine
orientation binning, relatively coarse spatial binning, and
high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new
approach gives near-perfect separation on the original MIT
pedestrian database, so we introduce a more challenging
dataset containing over 1800 annotated human images with
a large range of pose variations and backgrounds.
Previous Work
1 Introduction
Detecting humans in images is a challenging task owing
to their variable appearance and the wide range of poses that
they can adopt. The first need is a robust feature set that
allows the human form to be discriminated cleanly, even in
cluttered backgrounds under difficult illumination. We study
the issue of feature sets for human detection, showing that locally normalized Histogram of Oriented Gradient (HOG) descriptors provide excellent performance relative to other existing feature sets including wavelets [17,22]. The proposed
descriptors are reminiscent of edge orientation histograms
[4,5], SIFT descriptors [12] and shape contexts [1], but they
are computed on a dense grid of uniformly spaced cells and
they use overlapping local contrast normalizations for improved performance. We make a detailed study of the effects
of various implementation choices on detector performance,
taking pedestrian detection (the detection of mostly visible
people in more or less upright poses) as a test case. For simplicity and speed, we use linear SVM as a baseline classifier
throughout the study. The new detectors give essentially perfect results on the MIT pedestrian test set [18,17], so we have
created a more challenging set containing over 1800 pedestrian images with a large range of poses and backgrounds.
Ongoing work suggests that our feature set performs equally
well for other shape-based object classes.
Input
image
Normalize
gamma &
colour
Compute
gradients
Weighted vote
into spatial &
orientation cells
Contrast normalize
over overlapping
spatial blocks
Collect HOGs
over detection
window
Linear
SVM
Person /
nonperson
classification
Figure 1. An overview of our feature extraction and object detection chain. The detector window is tiled with a grid of overlapping blocks
in which Histogram of Oriented Gradient feature vectors are extracted. The combined vectors are fed to a linear SVM for object/non-object
classification. The detection window is scanned across the image at all positions and scales, and conventional non-maximum suppression
is run on the output pyramid to detect object instances, but this paper concentrates on the feature extraction process.
edge directions, even without precise knowledge of the corresponding gradient or edge positions. In practice this is implemented by dividing the image window into small spatial
regions (cells), for each cell accumulating a local 1-D histogram of gradient directions or edge orientations over the
pixels of the cell. The combined histogram entries form the
representation. For better invariance to illumination, shadowing, etc., it is also useful to contrast-normalize the local
responses before using them. This can be done by accumulating a measure of local histogram energy over somewhat
larger spatial regions (blocks) and using the results to normalize all of the cells in the block. We will refer to the normalized descriptor blocks as Histogram of Oriented Gradient (HOG) descriptors. Tiling the detection window with
a dense (in fact, overlapping) grid of HOG descriptors and
using the combined feature vector in a conventional SVM
based window classifier gives our human detection chain
(see fig. 1).
The use of orientation histograms has many precursors
[13,4,5], but it only reached maturity when combined with
local spatial histogramming and normalization in Lowes
Scale Invariant Feature Transformation (SIFT) approach to
wide baseline image matching [12], in which it provides
the underlying image patch descriptor for matching scaleinvariant keypoints. SIFT-style approaches perform remarkably well in this application [12,14]. The Shape Context
work [1] studied alternative cell and block shapes, albeit initially using only edge pixel counts without the orientation
histogramming that makes the representation so effective.
The success of these sparse feature based representations has
somewhat overshadowed the power and simplicity of HOGs
as dense image descriptors. We hope that our study will help
to rectify this. In particular, our informal experiments suggest that even the best current keypoint based approaches are
likely to have false positive rates at least 12 orders of magnitude higher than our dense grid approach for human detection, mainly because none of the keypoint detectors that we
are aware of detect human body structures reliably.
The HOG/SIFT representation has several advantages. It
captures edge or gradient structure that is very characteristic
of local shape, and it does so in a local representation with
an easily controllable degree of invariance to local geometric
and photometric transformations: translations or rotations
make little difference if they are much smaller that the local
spatial or orientation bin size. For human detection, rather
Figure 2. Some sample images from our new human detection database. The subjects are always upright, but with some partial occlusions
and a wide range of variations in pose, appearance, clothing, illumination and background.
5 Overview of Results
Before presenting our detailed implementation and performance analysis, we compare the overall performance of
our final HOG detectors with that of some other existing
methods. Detectors based on rectangular (R-HOG) or circular log-polar (C-HOG) blocks and linear or kernel SVM
are compared with our implementations of the Haar wavelet,
PCA-SIFT, and shape context approaches. Briefly, these approaches are as follows:
Generalized Haar Wavelets. This is an extended set of oriented Haar-like wavelets similar to (but better than) that used
in [17]. The features are rectified responses from 99 and
1212 oriented 1st and 2nd derivative box filters at 45 intervals and the corresponding 2nd derivative xy filter.
PCA-SIFT. These descriptors are based on projecting gradient images onto a basis learned from training images using
PCA [11]. Ke & Sukthankar found that they outperformed
SIFT for key point based matching, but this is controversial
[14]. Our implementation uses 1616 blocks with the same
derivative scale, overlap, etc., settings as our HOG descriptors. The PCA basis is calculated using positive training images.
Shape Contexts. The original Shape Contexts [1] used binary edge-presence voting into log-polar spaced bins, irrespective of edge orientation. We simulate this using our CHOG descriptor (see below) with just 1 orientation bin. 16
angular and 3 radial intervals with inner radius 2 pixels and
outer radius 8 pixels gave the best results. Both gradient-
0.5
Lin. RHOG
Lin. CHOG
Lin. ECHOG
Wavelet
PCASIFT
Lin. GShaceC
Lin. EShaceC
MIT best (part)
MIT baseline
0.1
0.1
0.05
0.05
0.02
0.02
0.01
6
10
10
10
10
10
false positives per window (FPPW)
0.2
miss rate
miss rate
0.2
10
0.01 6
10
Ker. RHOG
Lin. R2HOG
Lin. RHOG
Lin. CHOG
Lin. ECHOG
Wavelet
PCASIFT
Lin. GShapeC
Lin. EShapeC
5
10
10
10
10
false positives per window (FPPW)
10
Figure 3. The performance of selected detectors on (left) MIT and (right) INRIA data sets. See the text for details.
We evaluated several input pixel representations including grayscale, RGB and LAB colour spaces optionally with
power law (gamma) equalization. These normalizations have
only a modest effect on performance, perhaps because the
subsequent descriptor normalization achieves similar results.
We do use colour information when available. RGB and
LAB colour spaces give comparable results, but restricting
to grayscale reduces performance by 1.5% at 104 FPPW.
Square root gamma compression of each colour channel improves performance at low FPPW (by 1% at 104 FPPW)
but log compression is too strong and worsens it by 2% at
104 FPPW.
tive masks. Several smoothing scales were tested including =0 (none). Masks tested included various 1-D point
derivatives (uncentred [1, 1], centred [1, 0, 1] and cubiccorrected [1, 8, 0, 8, 1]) as well as 33 Sobel masks and
1 0
0 1
22 diagonal ones 1
0 ,
0 1 (the most compact centred 2-D derivative masks). Simple 1-D [1, 0, 1] masks at
=0 work best. Using larger masks always seems to decrease performance, and smoothing damages it significantly:
for Gaussian derivatives, moving from =0 to =2 reduces
the recall rate from 89% to 80% at 104 FPPW. At =0,
cubic corrected 1-D width 5 filters are about 1% worse than
[1, 0, 1] at 104 FPPW, while the 22 diagonal masks are
1.5% worse. Using uncentred [1, 1] derivative masks also
decreases performance (by 1.5% at 104 FPPW), presumably because orientation estimation suffers as a result of the
x and y filters being based at different centres.
For colour images, we calculate separate gradients for
each colour channel, and take the one with the largest norm
as the pixels gradient vector.
The next step is the fundamental nonlinearity of the descriptor. Each pixel calculates a weighted vote for an edge
orientation histogram channel based on the orientation of the
gradient element centred on it, and the votes are accumulated into orientation bins over local spatial regions that we
call cells. Cells can be either rectangular or radial (log-polar
sectors). The orientation bins are evenly spaced over 0
180 (unsigned gradient) or 0 360 (signed gradient).
To reduce aliasing, votes are interpolated bilinearly between
the neighbouring bin centres in both orientation and position. The vote is a function of the gradient magnitude at the
pixel, either the magnitude itself, its square, its square root,
or a clipped form of the magnitude representing soft presence/absence of an edge at the pixel. In practice, using the
0.5
0.2
0.2
0.1
0.1
0.05
0.02
0.01 6
10
=0
=0.5
=1
=2
=3
=0, ccor
5
0.05
0.02
4
10
10
10
10
false positives per window (FPPW)
10
0.01 6
10
0.1
bin= 9 (0180)
bin= 6 (0180)
bin= 4 (0180)
bin= 3 (0180)
bin=18 (0360)
bin=12 (0360)
bin= 8 (0360)
bin= 6 (0360)
5
10
10
10
10
false positives per window (FPPW)
10
0.05
0.2
0.2
0.2
0.1
0.1
0.1
0.01 6
10
10
10
10
10
false positives per window (FPPW)
10
miss rate
0.5
0.05
0.02
0.01 6
10
(d)
64x128
56x120
48x112
5
10
0.05
0.02
10
10
10
10
false positives per window (FPPW)
10
10
false positives per window (FPPW)
(c)
0.5
0.02
L2Hys
L2norm
L1Sqrt
L1norm
No norm
Window norm
0.02 5
10
(b)
0.05
0.2
miss rate
miss rate
(a)
miss rate
miss rate
miss rate
0.5
10
0.01 6
10
(e)
Linear
=8e3
=3e2
=7e2
5
10
10
10
10
false positives per window (FPPW)
10
(f)
Figure 4. For details see the text. (a) Using fine derivative scale significantly increases the performance. (c-cor is the 1D cubic-corrected
point derivative). (b) Increasing the number of orientation bins increases performance significantly up to about 9 bins spaced over 0
180 . (c) The effect of different block normalization schemes (see 6.4). (d) Using overlapping descriptor blocks decreases the miss rate
by around 5%. (e) Reducing the 16 pixel margin around the 64128 detection window decreases the performance by about 3%. (f) Using
a Gaussian kernel SVM, exp(kx1 x2 k2 ), improves the performance by about 3%.
20
magnitude itself gives the best results. Taking the square root
reduces performance slightly, while using binary edge presence voting decreases it significantly (by 5% at 104 FPPW).
Fine orientation coding turns out to be essential for good
performance, whereas (see below) spatial binning can be
rather coarse. As fig. 4(b) shows, increasing the number
of orientation bins improves performance significantly up to
about 9 bins, but makes little difference beyond this. This
is for bins spaced over 0 180 , i.e. the sign of the gradient is ignored. Including signed gradients (orientation range
0 360, as in the original SIFT descriptor) decreases the
performance, even when the number of bins is also doubled
to preserve the original orientation resolution. For humans,
the wide range of clothing and background colours presumably makes the signs of contrasts uninformative. However
note that including sign information does help substantially
in some other object recognition tasks, e.g. cars, motorbikes.
15
10
5
0
12x12
10x10
8x8
6x6
4x4
3x3
2x2
1x1
Figure 5. The miss rate at 104 FPPW as the cell and block sizes
change. The stride (block overlap) is fixed at half of the block size.
33 blocks of 66 pixel cells perform best, with 10.4% miss rate.
C-HOG. Our circular block (C-HOG) descriptors are reminiscent of Shape Contexts [1] except that, crucially, each
spatial cell contains a stack of gradient-weighted orientation cells instead of a single orientation-independent edgepresence count. The log-polar grid was originally suggested
by the idea that it would allow fine coding of nearby structure to be combined with coarser coding of wider context,
and the fact that the transformation from the visual field to
the V1 cortex in primates is logarithmic [21]. However small
descriptors with very few radial bins turn out to give the best
performance, so in practice there is little inhomogeneity or
context. It is probably better to think of C-HOGs simply as
an advanced form of centre-surround coding.
We evaluated two variants of the C-HOG geometry,
ones with a single circular central cell (similar to
the GLOH feature of [14]), and ones whose central cell is divided into angular sectors as in shape
contexts. We present results only for the circularcentre variants, as these have fewer spatial cells
than the divided centre ones and give the same performance in practice. A technical report will provide further details. The C-HOG layout has four parameters: the
numbers of angular and radial bins; the radius of the central
bin in pixels; and the expansion factor for subsequent radii.
At least two radial bins (a centre and a surround) and four
angular bins (quartering) are needed for good performance.
Including additional radial bins does not change the performance much, while increasing the number of angular bins
decreases performance (by 1.3% at 104 FPPW when going from 4 to 12 angular bins). 4 pixels is the best radius
for the central bin, but 3 and 5 give similar results. Increasing the expansion factor from 2 to 3 leaves the performance
essentially unchanged. With these parameters, neither Gaussian spatial weighting nor inverse weighting of cell votes by
cell area changes the performance, but combining these two
reduces slightly. These values assume fine orientation sampling. Shape contexts (1 orientation bin) require much finer
spatial subdivision to work well.
Block Normalization schemes. We evaluated four different block normalization schemes for each of the above HOG
geometries. Let v be the unnormalized descriptor vector,
kvkk be its k-norm for k=1, 2, and be p
a small constant.
The schemes are: (a) L2-norm, v v/ kvk22 + 2 ; (b)
L2-Hys, L2-norm followed by clipping (limiting the maximum values of v to 0.2) and renormalizing, as in [12]; (c)
L1-norm, v v/(kvk1 + p
); and (d) L1-sqrt, L1-norm followed by square root v v/(kvk1 + ), which amounts
to treating the descriptor vectors as probability distributions
and using the Bhattacharya distance between them. Fig. 4(c)
shows that L2-Hys, L2-norm and L1-sqrt all perform equally
well, while simple L1-norm reduces performance by 5%,
and omitting normalization entirely reduces it by 27%, at
104 FPPW. Some regularization is needed as we evalu-
6.7 Discussion
6.6 Classifier
(a)
(b)
(c)
(d)
(e)
(f)
(g)
Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks are
centred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each pixel
shows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.
(e) Its computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.
References
[1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The
8th ICCV, Vancouver, Canada, pages 454461, 2001.
[2] V. de Poortere, J. Cant, B. Van den Bosch, J. de
Prins, F. Fransens, and L. Van Gool. Efficient pedestrian detection: a test case for svm based categorization.
Workshop on Cognitive Vision, 2002. Available online:
http://www.vision.ethz.ch/cogvis02/.
[3] P. Felzenszwalb and D. Huttenlocher. Efficient matching of
pictorial structures. CVPR, Hilton Head Island, South Carolina, USA, pages 6675, 2000.
[4] W. T. Freeman and M. Roth. Orientation histograms for
hand gesture recognition. Intl. Workshop on Automatic Faceand Gesture- Recognition, IEEE Computer Society, Zurich,
Switzerland, pages 296301, June 1995.
[5] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Computer vision for computer games. 2nd International Conference on Automatic Face and Gesture Recognition, Killington,
VT, USA, pages 100105, October 1996.
[6] D. M. Gavrila. The visual analysis of human movement: A
survey. CVIU, 73(1):8298, 1999.
[7] D. M. Gavrila, J. Giebel, and S. Munder. Vision-based pedestrian detection: the protector+ system. Proc. of the IEEE Intelligent Vehicles Symposium, Parma, Italy, 2004.
[8] D. M. Gavrila and V. Philomin. Real-time object detection for
smart vehicles. CVPR, Fort Collins, Colorado, USA, pages
8793, 1999.
[9] S. Ioffe and D. A. Forsyth. Probabilistic methods for finding
people. IJCV, 43(1):4568, 2001.