Professional Documents
Culture Documents
Bird Region Detection in Images With Multi-Scale HOG Features and SVM Scoring
Bird Region Detection in Images With Multi-Scale HOG Features and SVM Scoring
1 Introduction
In recent years, the task of fine-grained visual classification (FGVC) is being real-
ized as a challenging task for computer vision. The problem involves classifying
different types of objects of the same family. An application area of the FGVC
task is that of identification species of the same organism (e.g. plant [1], insects [2],
bird [3–6]), which is important for ecological or environmental studies. Among these,
the bird image classification task has been explored in many recent works.
The popularity of the bird image classification task is perhaps due to the associated
challenge of considering images acquired ‘in the wild’, as is the case with the well-
known CUB bird image data set [7]. A related problem (arguably, beneficial to the
classification task) involving such images captured in a completely natural settings
is that of detection of birds in the image. The problem involves finding a region of
interest (ROI) which encompasses most of pixels corresponding to the bird, while
eliminating a large part of the background.
Indeed, the bird ROI detection problem has often been considered within the
overall FGVC framework of bird image data (e.g. [3, 4, 6]), which as we discuss
in the next subsection, typically involve sophisticated frameworks. However, the
detection task, by itself, can be primary part in the overall FGVC pipeline. This
is because finding the image ROI locating the bird while eliminating most of the
background can be useful for better localized feature extraction, or as an initial
estimate for pixel-level segmentation [3, 8], or for constraining the image region
for detecting individual parts [5]. Notwithstanding variety in bird appearance and
the background, to benefit the overall processing pipeline, the bird ROI detection
task should, arguably, be relatively simpler as compared to subsequent tasks such as
segmentation, part localization and the overall classification. Thus, it is useful if the
bird ROI detection task is handled by a simple and efficient approach.
Hence, unlike in the works on overall FGVC, in this work, we demonstrate that the
bird ROI detection can indeed be achieved in a straightforward manner. Our proposed
method involves the traditional Histogram of Oriented Gradients (HOG) features,
used in a multi-scale manner, and an SVM classifier on local image regions. The
SVM classifier output is considered in terms of a real-valued weights rather than the
conventional way considering the binary decision. We demonstrate on image exam-
ples containing a variety of birds and background that the proposed strategy yields
reasonably good quality bird ROI detection, which satisfies the above-mentioned
role of the ROI detection task.
As mentioned above, the task of ROI detection in bird images has been considered as
a part of approaches on FGVC. However, the region detection task has been reported,
primarily at the level of bird parts, while the complete bird detection or segmentation
is considered in relatively less number of works. However, as noted earlier, the latter
can be useful in further stages of FGVC such as segmentation, feature extraction or
part detection.
Having said that we discuss below some prominent works on part detection,
whole-bird segmentation, and bird detection, as these can be considered as different
flavours of the detection task. As mentioned above, in most of approaches, the bird
ROI detection is considered as a part of the overall FGVC task. Hence, in the dis-
Bird Region Detection in Images with Multi-scale HOG Features and SVM Scoring 355
cussion below, we only focus on the ROI detection component in considered in such
FGVC methods.
A deformable part descriptor based on the deformable parts model (DPM) is used
for part detection in [4]. The model parameters are learnt using a latent SVM trained
on filtered windows given by the part annotations. The underlying feature space
over which the DPM is defined consists of gradient, local binary patterns and colour
features. There are some FGVC approaches where the part detection is not considered
independently but is intrinsically related to the classification problem [5, 9, 10]. In
[5], given a bounding box around the bird regions, a segmentation is performed so
as to divide the region into multiple segments, with an assumption that each segment
contains a semantic attribute bird part. A latent conditional random field (CRF)
is then employed to learn (and detect) parts which are more discriminative, with
respect to the overall classification objective. The CRF is learnt via the expectation
maximization, and some of the parameters are learnt as weights of an SVM. The
work in [9] also intricately links the detection problem with that of classification.
Here, a set of regions which are common across classes are learnt in an unsupervised
manner. Following this, a multi-kernel learning framework using SVM is employed
to learn weights of the features from such regions. The weights of the feature are
based on how discriminative the underlying region is. Thus, in essence, the approach
detects important regions which yield better discrimination. The feature space used
in this work consists of encoded HOG features. Another work following a similar
philosophy is that of [10], wherein image patches which are more discriminative
for classification are learnt using random forests. The features extracted from image
patches involve SIFT and related codewords from a Bag of words framework.
In recent years, some deep learning-based works are also reported for FGVC,
wherein the whole-bird detection or part detection is also considered. For instance,
the region-based CNN (R-CNN) [6] uses the weights from the CNNs trained on the
complete bird as well as bird parts, in an SVM classifier. Moreover, the detection
is further refined using geometric constraints relating the location of parts and the
complete bird region. Another application of CNN for part detection is reported
in [11], wherein a fully convolution network (FCN) is learnt to directly computing
part annotations, unlike the R-CNN approach. The work reported in [12] devises a
strategy to select filters from the CNN which correspond to semantically meaningful
parts. These are then used in the computation of part-saliency maps which can serve
as approximate part localization detection.
In [13], an approach for coarse segmentation of the overall bird region follows
a supervised Laplacian label propagation framework which involves minimizing a
cost with a smoothness constraint on the labels. The constraint involves some prior
knowledge about labels for some pixels, which is represented using the output of an
a supervised SVM classifier. The SVM classifier is trained on encoded HOG, and
colour features are computed on super-pixels of the original image. Some relatively
simpler coarse segmentation strategies are also followed in [14, 15], wherein colour
information is learnt from regions near the image borders, and is used to represent the
background information. Based on this, a pixel is labelled in the bird or background
class based on the similarity with respect to this representation. While these are
356 R. Kumar et al.
2 Proposed Approach
In this section, we briefly describe the HOG feature, SVM classification and the
multi-scale processing, in the context of their application in this work. We then
summarize the overall approach in a stepwise manner.
Bird Region Detection in Images with Multi-scale HOG Features and SVM Scoring 357
Fig. 1 a, c Examples of bird images, and b, d corresponding HOG feature maps for one of the
colour channel
test sample which essentially represents the distance of that test samples from the
hyperplane. The higher this score, the better confidence one can have on the classified
sampled as belonging to its assigned class. This implies that the higher scored test
sample better represents the characteristics of class that it is assigned to.
In this work, we employ the SVM score for the purposes of bird ROI detection.
This is because many image windows which contain some part of the bird region can
be labelled as belonging to the bird class. Hence, one requires to look for a window
that most appropriately locates the bird region. Thus, the intuition is that such a image
window should contain a large part of the bird region, as it should contain a large
number of features from the bird region pixels. Thus, among the various choices of
windows, it is natural to pick one which best represents the bird features, or in other
words, which has the largest SVM score. We demonstrate in the results section, some
choices of selected windows to convey this point visually.
Bird Region Detection in Images with Multi-scale HOG Features and SVM Scoring 359
We note that the ROI containing birds in the images can span vary in their sizes.
Hence, considering a single size of window may not fit the ROI in all the images. To
address this, we follow a multi-scale strategy for addressing the detection problem
in this work.
Such a strategy essentially involves extracting features by considering different
window sizes relative to the image. We keep the window size constant but consider
four image scales. This ensures that the HOG feature aggregation for each window
is carried out at different scales. The four scales that we choose are free parameters.
In this work, these are based on observation of the maximum and minimum areas
of the regions spanned by the bird pixels in the data set. In general, the image can
be downscaled till the point when one of the image dimensions becomes lower than
the window size (which in our case happens in most of the cases for the lowest scale
specified above).
We treat the HOG feature vectors extracted for each window, at each of the four
levels, as independent samples for the SVM classifier. Thus, for each test image, the
number of test samples for the SVM equals the N ∗ L, where N is the number of
windows and L denotes the number of scales (4, in this case).
Here, we provide the steps of our approach. For the training process, we use the
ground-truth bounding box annotation to differentiate between bird and background
regions, which are provided with the data set that we consider in this work.
• Training:
– From each training image, randomly extract various image regions which con-
tain bird and background. Note that the bird regions may include some back-
ground parts as well, as we do not use exact segmentations, but only bounding
box annotations.
– Extract HOG features from all bird and background subimages and create train-
ing data set.
– Train a linear SVM model.
• Testing
– Given a test image, move a sliding window across the image and extract the
HOG features from each window at different positions.
– Perform the same at downscaled versions of the image. Downscale the image
till the point when one of the image dimensions becomes lower than the window
size.
– Calculate the SVM weight for the HOG feature for each sliding window.
360 R. Kumar et al.
– To avoid very closely spaced windows in a local region getting most of the high
weights, apply a non-maximum suppression on local groups of windows, and
select only a single window from such local groups.
– Pick the highest weighted window from the remaining ones, which is the final
detection region for bird in image.
3 Experimental Results
In this section, we first comment briefly on the data set used in this work and the
training–testing protocol. We then discuss some results of the proposed bird detection
approach.
We validate the proposed detection algorithm on a variety of images from the well-
known Caltech-UCSD Birds image data set [7]. The data set contains images belong-
ing to 200 different classes of birds, in a variety of background (water, trees, relatively
plain, etc.). To validate our approach, we consider different types of birds and a vari-
ety of such backgrounds.
In this work, we have used 20 bird classes, which include all variety of background
scenes in the data set. We note that we manually select different bird classes which
cover enough variation in terms of their visual characteristics (e.g. very different
species of birds). As we focus on the detection problem (a two-class problem), we
believe that such a validation provides a proof of concept for the proposed approach
as enough variety of foreground and background data are being considered. The SVM
training was carried out with 50% examples from all the class (about 30 images in
each class), and the testing was carried out with the rest. Below, we depict and discuss
some of our results on the test images.
The sliding window size chosen in this work is of the size 128 × 192. The four
scaling factors for the multi-scale processing are used as 1, 0.85, 0.72, 0.62. The
HOG feature parameters are same as those in [21]. As mentioned above, we train a
linear SVM model.
3.2 Results
We start with a relatively simple case where the background is rather plain (Fig. 2).
The two bounding boxes in Fig. 2a represent the two detected regions which were
assigned the top two weights by the SVM. Figure 2b depicts the bird image when
retaining the bounding box one corresponding to the highest weight. Note that the
Bird Region Detection in Images with Multi-scale HOG Features and SVM Scoring 361
Fig. 2 a An example bird image with the top two weighted windows, and b the detection when
retaining the top weighted window
selected bounding box indeed covers the maximum part of the bird region, while
keeping the background as low as possible. While the second best detection also
covers most of the bird pixels, it also includes some part of the background. Also,
note that the bird is correctly detected, even if one of the colours on the upper back
and the bottom of the bird is similar to the branch beneath it. However, the gradient
structure is different, which the HOG feature, arguably, seems to capture.
Figure 3 shows three examples, with somewhat of a varying background, each
with the top-ranked bounding box in terms of the SVM weight. In all the cases, the
blue bounding box shows the ground-truth. Figure 3b shows two red bounding boxes
which were weighted quite close. However, the one with the highest weight is the
one covering the beak of the bird. Note that the detection in Fig. 3a, b does miss
some proximities of the wing, which could be because of the texture of the wing
is similar to that of the background, and the mis-detected part is the one connected
with the background. However, we can still observe that the bird region is detected
enough so as to extract appropriate features for a classification task. In Fig. 3c, the
scene is somewhat more challenging, and yet the detection is quite well. Indeed the
only part missing in the detection is a small part of the tail, which is disconnected
from the main body due to the occlusion caused by the branch. Other than this, we
can observe that the detection is very close to the ground-truth, as also is the case
with the other images.
Finally, in Fig. 4, we show some more detection results, some of which involve
more complex variations. Figure 4b shows a bird with very similar texture as the
bottom half of the background. Similarly, in Fig. 4c the bird itself is dark with little
texture variation. However, in all the cases, the approach yields very encouraging
detection performance, wherein almost the complete bird is included in the bounding
box, (except for a small portion of the tail in Fig. 4c). Thus, it is apparent that such a
simplistic approach is able to perform quite well, even on scene images with varying
backgrounds.
362 R. Kumar et al.
Fig. 3 a, b, c Examples of bird detection with top 2 or 3 weighted windows. The blue window
depicts the highest weighted window which is finally selected
Fig. 4 a–d Some more examples of bird detection with different background
Bird Region Detection in Images with Multi-scale HOG Features and SVM Scoring 363
4 Conclusion
In this work, we argue that, as the ROI detection process is a primary step in FGVC,
it ought to be handled in a simplistic manner while providing a good performance
considering the variety of birds and background. Thus, we propose and demonstrate a
rather simple but an effective approach for bird ROI detection in images; a framework
which involves a traditional feature representation, applied in a multi-scale manner,
and an SVM classifier, is used to weigh the image windows for finding one which
most appropriately locates the bird ROI. Our results clearly demonstrate that the
proposed approach is able to provide good quality bird region detections. With such
an approach able to eliminate most of the background, it can be an efficient option
to be used as a primary step for bird FGVC.
References
9. A. Angelova and A. Niculescu-Mizil, Feature combination with multi-kernel learning for fine-
grained visual classification, IEEE Winter Conference on Applications of Computer Vision
(WACV), 2014.
10. B. Yao, A. Khosla and F. Li, Combining randomization and discrimination for fine-grained
image categorization, IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2011.
11. S. Huang, Z. Xu, D. Tao, and Y. Zhang, Part-Stacked CNN for Fine-Grained Visual Catego-
rization, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
12. X. Zhang, H. Xiong, W. Zhou, W. Lin, Q. Tian, Picking deep filter responses for fine-grained
image recognition, IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2016.
13. A. Angelova and S. Zhu, Efficient object detection and segmentation for fine-grained recogni-
tion, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
14. M. Das and R. Manmatha, Automatic segmentation and indexing in a database of bird images,
IEEE Int. Conference on Computer Vision (ICCV), 2001.
15. I. Lillo, J. Niebles, and A. Soto, Bird species classification based on color features, IEEE Int.
Conference on on Systems, Man, and Cybernetics (SMC), 2013.
16. R. Yoshihashi, R. Kawakami, M. Iida, and T. Naemura, Evaluation of bird detection using
time-lapse images around a wind farm, EWEA 2015 Annual Event, 2015.
17. A. Takeki, T. Tuan Trinh, R. Yoshihashi, R. Kawakami, M. Iida, and T. Naemura, Detection
of small birds in large images by combining a deep detector with semantic segmentation, Int.
Conference on Image Processing (ICIP), 2016.
18. T. Lin, A. Roy Chowdhury, and S. Maji. Bilinear CNN models for fine-grained visual recogni-
tion, International Conference on Computer Vision (ICCV) 2015.
19. J. Kraus, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, and F. Li, The
unreasonable effectiveness of noisy data for fine-grained recognition, European Conference
on Computer Vision (ECCV), 2016.
20. Z. Ge, A. Bewley, C. McCool, B. Upcroft, P. Corke, and C. Sanderson, Fine-grained clas-
sification via mixture of deep convolutional neural networks, IEEE Winter Conference on
Applications of Computer Vision (WACV), 2016.
21. N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2005.
22. Dlib C++ library, http://blog.dlib.net/2014/02/dlib-186-released-make-your-own-object.html
23. C. Chang and C. Lin, LIBSVM: A library for support vector machines, ACM Transactions on
Intelligent Systems and Technology, vol. 2, no. 3, 2011, pp. 1–27.
24. H. Lin, http://www.work.caltech.edu/~htlin/program/libsvm/