Professional Documents
Culture Documents
Region Proposing Network1
Region Proposing Network1
with R-CNN
Peilei Dong, Wenmin Wang ∗
School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University
Lishui Road 2199, Nanshan District, Shenzhen, Guangdong Province, China 518055
pldong@pku.edu.cn, wangwm@ece.pku.edu.cn
Abstract—Region-based convolutional neural networks (R- sults, if we use region proposals generated only for pedestrian
CNNs) have achieved great success in object detection recently. class in pedestrian detection.
These deep models depend on region proposal algorithms to Our motivation derives from the observation that pedestrian
hypothesize object locations. In this paper, we combine a spe-
cial region proposal algorithm with R-CNN, and apply it to detection algorithms can also be used to generate region
pedestrian detection. The special algorithm is used to generate proposals. In this paper, the region proposal algorithm adopted
region proposals only for pedestrian class. It is different from the is based on a pedestrian detection method, and generates
popular region proposal algorithm selective search that detects region proposals only for pedestrian class. We combine the
generic object locations. The experimental results prove that special region proposal algorithm with R-CNN, and apply it
region proposals generated by our method are more applicable
than selective search for pedestrian detection. Our method to pedestrian detection. This combined method relies on two
performs faster training and testing than the deep model based critical steps: (1) we use the region proposal algorithm to
on, and it achieves competitive performance compared to the generate a set of region proposals as efficiently as possible,
state-of-the-arts in pedestrian detection. and (2) pass the raw image data with this set of candidate
region proposals to the convolutional neural network. The
Index Terms—Pedestrian Detection, R-CNN, Region Proposal network includes two sibling output layers. One of them
Algorithm, ACF, Better Region Proposals produces softmax probability estimates over pedestrian class.
And the other outputs four real-valued numbers representing
the bounding-box positons of pedestrian class.
I. I NTRODUCTION Extensive experiments demonstrate that our method yields
The pedestrian detection task aims to detect pedestrian competitive results compared to the state-of-the-arts in pedes-
instances and locate their positions in images. In the past few trian detection. It achieves 14.1% miss-rate on the INRIA
years, there have been a lot of research works on pedestrian dataset, 15.3% miss-rate on the PKU dataset, and 45.6% miss-
detection. Accurate pedestrian detection could be applied to rate on the ETH dataset, respectively. The region proposals
many fields, including automotive safety, video surveillance generated through our method perform better than selective
and robotics. search in pedestrian detection. Besides, it reduces the number
Recently, R-CNNs have been widely used to object detec- of region proposals, and costs less time in the training and
tion. For instance, [1] proposes a fast R-CNN method achiev- testing of the convolutional neural network.
ing advanced object detection accuracy over 21 classes. This The paper is organized as follows. Section 2 briefly intro-
method takes the popular region proposal method selective duces related works. Section 3 describes details of the region
search [2] to predict object locations first, then refines them proposal algorithm and the architecture of the network. The
through the convolutional neural network. Inspired by the following section presents the results of our experiments on
great success of R-CNN in generic object detection, we try to three datasets. Finally, Section 5 concludes this paper.
apply it to pedestrian detection. However, the method selective
II. RELATED WORK
search generates candidate region proposals not for a single
class, but possible locations of all kinds of objects. Thus the A. The Categories of Pedestrian Detection Algorithms
region proposals generated by it may degrade the quality of Current methods for pedestrian detection can be generally
the detector, and contain many redundant cases if used with R- grouped into two categories. The first category is known as
CNN in pedestrian detection. Moreover, redundant proposals conventional approaches, which extract hand-crafted features
cost more time in training and testing of the convolutional from images to train SVM or boosting classifiers. These fea-
neural network. We speculate that it may produce decent re- tures such as Harr [3], HOG [4] and HOG-LBP [5], achieves
progressive performance in pedestrian detection. Deformable
part models (DPM) [6] consider the appearance of each
This work was supported by Shenzhen Peacock Plan. part and the deformation among parts for detection. Chen
978-1-5090-5316-2/16/$31.00
c 2016 IEEE VCIP 2016, Nov. 27 – 30, 2016, Chengdu, China
et al. [7] models the context information in a multi-order
manner. Moreover, the Integral Channel Features (ICF) [8]
and Aggregated Channel Features (ACF) [9], both of which
consist of gradient histogram, gradients, and LUV, can be
efficiently used to detection. Nam et al. [10] introduces an
efficient feature transform that removes correlations in local
image neighbourhoods.
The other category is deep models. Deep learning methods
improve the performance of pedestrian detection owing to
Fig. 1. Region-based convolutional neural network architecture. An image
learning features from raw pixels. Ouyang et al. [11] learns and multiple proposals are inputted into the network. Each proposal is
features and the visibility of different body parts to handle pooled into a fixed-size feature map by RoI pooling and then fed into fully
occlusion. ConvNet [12] employs convolutional sparse coding connected layers (fc). The network has two output vectors per proposal:
softmax probabilities and bounding-box regression offsets. The architecture
to unsupervised pre-train CNN for pedestrian detection. Tian is trained end-to-end with a multi-task loss.
et al. [13] optimizes pedestrian detection with semantic tasks.
miss rate
miss rate
miss rate
.20 .20
.20
.10 72% VJ
.10 90% VJ
46% HOG .10 64% HOG
39% HogLbp 55% HogLbp
20% ConvNet 67% DPM 51% ACF
.05 17% ACF 46% ACF .05 50% ConvNet
14% OURs 15% OURs 46% OURs
.05
10 -3 10 -2 10 -1 10 0 10 1 10 -2 10 -1 10 0 10 1 10 -3 10 -2 10 -1 10 0 10 1
false positives per image false positives per image false positives per image
Fig. 2. The comparison of ours for pedestrian detection with recent state-of-the-art methods on (a): INRIA dataset, (b): PKU dataset and (c): ETH dataset
1 1 1
.80 .80 .80
.64 .64 .64
.50 .50
.50
.40 .40
.40
.30 .30
.30
miss rate
miss rate
miss rate
.20 .20
.20
.10 .10
.10
Fig. 3. The results between using selective search (SS) and our method to predict proposals on (a): INRIA dataset, (b): PKU dataset and (c): ETH dataset
TABLE I
C OMPARISON OF T RAINING AND T ESTING TIME R EFERENCES
[1] R. Girshick, “Fast r-cnn,” in ICCV, 2015, pp. 1440–1448.
INRIA PKU ETH [2] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders,
SS OURs SS OURs SS OURs “Selective search for object recognition,” IJCV, vol. 104, no. 2, pp. 154–
train (h) 1.7 1.5 1.8 1.5 1.6 1.3 171, 2013.
test (s/im) 0.22 0.07 0.28 0.12 0.14 0.05 [3] P. Viola, M. J. Jones, and D. Snow, “Detecting pedestrians using patterns
of motion and appearance,” IJCV, vol. 63, no. 2, pp. 153–161, 2005.
[4] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in CVPR, 2005, pp. 886–893.
using region proposals generated by selective search. It shows [5] X. Wang, T. X. Han, and S. Yan, “An hog-lbp human detector with
partial occlusion handling,” in ICCV, 2009, pp. 32–39.
our method reduces redundant region proposals and speeds [6] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan,
up the training and testing of the network. We conclude the “Object detection with discriminatively trained part-based models,”
testing presents more prominent improvements than training. PAMI, vol. 32, no. 9, pp. 1627–1645, 2010.
[7] G. Chen, Y. Ding, J. Xiao, and T. Han, “Detection evolution with multi-
We infer the reason of this phenomenon is that the testing is order contextual co-occurrence,” in CVPR, 2013, pp. 1798–1805.
just a forward propagation which region proposals react on. [8] P. Dollár, Z. Tu, P. Perona, and S. Belongie, “Integral channel features,”
in BMVC, 2009.
V. C ONCLUSION [9] P. Dollár, R. Appel, S. Belongie, and P. Perona, “Fast feature pyramids
for object detection,” PAMI, vol. 36, no. 8, pp. 1532–1545, 2014.
In this work, we present a method that combine a special [10] W. Nam, P. Dollár, and J. H. Han, “Local decorrelation for improved
pedestrian detection,” in NIPS, 2014, pp. 424–432.
region proposal algorithm with R-CNN. The region proposal [11] W. Ouyang and X. Wang, “A discriminative deep model for pedestrian
algorithm based on ACF aims to predict region proposals detection with occlusion handling,” in CVPR, 2012, pp. 3258–3265.
used in R-CNN. Different from the popular region proposal [12] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun, “Pedestrian
detection with unsupervised multi-stage feature learning,” in CVPR,
algorithm selective search that detects generic object locations, 2013, pp. 3626–3633.
the algorithm can be used to generate region proposals only [13] Y. Tian, P. Luo, X. Wang, and X. Tang, “Pedestrian detection aided by
for pedestrian class. We illustrate the details of our region deep learning semantic tasks,” in CVPR, 2015, pp. 5079–5087.
[14] M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. Torr, “Bing: Binarized
proposal algorithm and the architecture of the network. The normed gradients for objectness estimation at 300fps,” in CVPR, 2014,
experimental results demonstrate our algorithm improves the pp. 3286–3293.
quality of region proposals. Our combined method achieves [15] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals
from edges,” in ECCV, 2014, pp. 391–405.
competitive results compared to the state-of-the-arts in pedes- [16] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
trian detection, and it consumes less execution time in training S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
and testing of R-CNN. fast feature embedding,” in ACM MM, 2014, pp. 675–678.