Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Journal of Intelligent & Fuzzy Systems 31 (2016) 2543–2551 2543

DOI:10.3233/JIFS-169095
IOS Press

A hierarchical model to learn object


proposals and its applications
Liang Peng∗ and Xiaojun Qi
Department of Computer Science, Utah State University, Logan, UT, USA

PY
Abstract. Generating class-agnostic object proposals followed by classification has recently become a common paradigm

CO
for object detection. Current state-of-the-art approaches typically generate generic objects, which serve as candidates for
object classification. Since these object proposals are generic whereas the categories for classification are domain specific,
there is a gap between the generation of object proposals and the classification of object proposals. In this paper, by taking
advantages of the intrinsic structure and the complexity of each category of objects, we propose a novel tree-based hierarchical
model to learn object proposals, from top proposals produced by the existing object proposals generation methods. First,
we develop a tree-structured representation for each object to capture its hierarchical structure feature. Second, we propose
a 23D compact feature vector to represent objects’ visual features. Third, we formulate a learning schema which evaluates
OR
the objectness of each proposal. Experiments demonstrate the significant improvement of the proposed approach over the
state-of-the-art method in terms of object detection rate. An application is proposed based on this approach to help children
learn and recognize objects by their visual appearances and their sub-parts structures.

Keywords: Hierarchical tree model, object proposals, object detection, learning


TH

1. Introduction the number of potential candidate windows in each


image is extremely large; Second, to detect objects
AU

Object detection, as the task of locating and rec- from n categories, n detectors need to be ran sepa-
ognizing object categories in images and videos, is a rately on all windows of each image, which makes the
major research field in computer vision. Multi-class computational cost grow linearly with the number of
object detection is challenging since it needs to not categories. Haar-cascades with boosting techniques
only find the locations of the objects but also spec- (e.g., AdaBoost) [10, 18] significantly boosted the
ify the category of the object at each location. A lot speed with integral images and quickly eliminated
of research work has been done in object detection a large number of non-object windows. These tech-
and recognition and achieved a lot of progress in niques have achieved successful detection accuracy
recent two decades. In earlier work, the problem was on certain types of objects such as cars and human
treated as a binary classification problem for each faces. Segmentation-based [9, 16] and saliency-
object category versus non-object category on a huge based [1, 11] techniques were proposed to reduce the
number of windows (i.e., sliding windows) [7, 19] search space for possible windows. These “filtered”
in different sizes and aspect ratios scanned over each windows are also called proposals. In recent years,
image. This approach has two major drawbacks: First, the class-agnostic approaches [3, 17] were introduced
to generate proposals which are likely to contain
∗ Corresponding author. Liang Peng, Department of Com-
objects of interest. A (n + 1)-way classification with
puter Science, Utah State University, Logan, UT, USA. Tel.:
+(785)236 1972; Fax: +(435)797 3265; E-mail: liang.peng@
n object categories and 1 background category is then
aggiemail.usu.edu. followed to classify each object. The ever-growing

1064-1246/16/$35.00 © 2016 – IOS Press and the authors. All rights reserved
2544 L. Peng and X. Qi / A hierarchical model to learn object proposals and its applications

due to the limited labeled training data sets. For exam-


ple, the VOC 07 data set [4] has 20 object categories
for training and the number of object categories in
generated object proposals is significantly more than
20. This difference means that the object propos-
als usually contain additional categories of objects,
which are not in the labeled categories for classifi-
cation. Figure 1 shows 20 generic object proposals
generated by an existing object proposals generation
method. These generic object proposals include the
aeroplane itself, a “propeller” part, and other parts of
the aeroplane. However, the labeled categories in the

PY
training data contain the “aeroplane” category instead
of the “propeller” category and other parts categories.
Fig. 1. An example of 20 generic object proposals generated by Similarly, a “bicycle wheel” might be detected as a
EdgeBox. proposal whereas the classification labels only con-
tain the “bicycle” category; A “train window” might

CO
labeled data size and computing power (e.g., GPU be detected as a proposal but the classification labels
computing) made the aforementioned approaches only contain the “train” category. The false posi-
work with the deep neural network-based architec- tives caused by these gaps have significant negative
tures [5, 6, 8] (e.g., Convolution Neural Network impact on object detection in terms of both recall and
(CNN)). However, this type of approaches rely on precision. Consequently, it will further affect object
the deep architecture in both steps. The class-agnostic recognition accuracy in the object proposals classi-
generation of proposals is expensive and slow. fication stage. By reducing these false positives, we
OR
Hence, object proposals generation remains a could improve precision and retrieve more appropri-
bottleneck in object detection. Due to the rapid devel- ate candidate proposals to improve recall as well.
opment of mobile computing and cloud computing, In the classification stage, all generated object pro-
some applications of object detection needs partial posals, which are not in the labeled categories, are
computing to be done on mobile devices [13–15]. treated as the “background” category. However, some
TH

Quickly and accurately generating object propos- of these “background” objects are in out-of-domain
als has been of a great topic of interest. A new categories (i.e, object categories are not in the set of
BING-feature-based objectiveness measure [2] was object categories in the domain of problems being
proposed to quickly generate object proposals (at addressed) and are therefore not truly “background”.
300 frames per second) with a high detection rate Hence, reducing these false positives in the object
AU

(i.e., 90% with 50 proposals and 96.2% with 1000 proposals generation stage will also further improve
proposals on the VOC 07 data set [4]). Another the classification accuracy.
approach, named EdgeBox [20], used edges as a We observe each object category differs from oth-
sparse yet informative representation, and the number ers not only by its visual appearance as a whole
of contours that are wholly contained in a bound- part, but also by its intrinsic structure such as their
ing box as a measure to compute the objectness. It sub-parts and the complexity of its structures. For
achieved the state-of-the-art object recall rate (e.g., example, a bicycle differs from a train when being
75% recall at Intersection over Union (IoU) of 0.7 looked as a whole part, which counts for visual
using 1000 proposals on the VOC 07 data set [4]) with appearance as a whole. In addition, a bicycle con-
the speed of approximate 0.25 seconds per image. tains two “wheels” and one “frame” whereas a train
Existing methods for object proposals (e.g., BING contains “multiple rectangular windows”. A bound-
[2] and EdgeBox [20]) typically use measures of ing box of the blue “sky” object may not contain any
likelihood of a bounding box containing objects to sub-parts as an object since it’s relatively pure and
retrieve top n bounding boxes as candidate proposals. it’s true background. Hence, in additional to visual
Due to the class-agnostic nature of these methods, all appearance, sub-part structures and the complexity
generated object proposals are generic objects. On the of an object could be employed to construct richer
other hand, during the classification stage, the object objectness to better represent the objects to reduce
categories are usually bounded in a certain domain the aforementioned two types of false positives and
L. Peng and X. Qi / A hierarchical model to learn object proposals and its applications 2545

improve the recall rate for any domain specific object a tree. Section 2.3 formalizes the proposed learn-
detection task. ing schema to compute a confidence score of each
Driven by this intuition, we develop a model which object using the compact features and the tree-based
takes an object’s visual appearance as a whole and its hierarchical model.
parts-based hierarchical structure into account. There
are four major novelties for the proposed work. First, 2.1. Compact feature representation
we develop a compact 23D feature vector as the visual
appearance representation of an object; Second, we Even though approaches based on deep neural net-
develop an algorithm to map each object to a hier- works could learn feature representations with high
archical tree structure representation, which captures discriminative power, it is quite costly in terms of
the relationships among sub-parts and the complexity time. For object proposals generation, we want to
of the object structure. Third, we formalize a learn- develop a fast learning schema to eventually rank

PY
ing schema to measure each proposal’s objectness by object proposals from the category-specific dataset
comparing visual features and tree structures using higher than object proposals from other categories and
the nearest neighbour method. Last, we introduce a background. Hence, the constructed compact feature
practical application utilizing our approach to help should have high discriminative power to distinguish
children learn and recognize objects via automati- objects from non-objects, and moderate discrimina-

CO
cally generated graph patches. tive power across different categories of objects.
The paper is organized as follows: Section 2 We observe that object and non-object propos-
introduces the proposed method in detail. Section 3 als differ a lot in terms of their properties. First,
presents the experimental settings together with the non-object proposals usually contain background and
results. Section 4 describes a real-word application object proposals usually have one dominate object
utilizing our approach to help children learn and rec- present within the bounding box. Second, non-object
ognize objects. Section 5 concludes the work with proposals are generally more evenly distributed and
OR
summaries, discussions, and future work. have less variations in terms of their color pixels. In
contrast, object proposals are generally more skewly
distributed and have more variations due to the few
2. The proposed method dominate colors in the main object and a small back-
ground region outside of the object boundary. Third,
TH

The proposed method aims to select a few number from the spatial point of view, background proposals
of proposals to achieve the same recall as obtained are usually more evenly distributed among different
by significantly more proposals produced by existing color pixels inside the bounding box while object pro-
object proposal generation method. As a high-level posals would have a few dominant colors occupying
view, we first employ the existing object proposal the majority region of the bounding box. Hence, the
AU

generator on each image to produce top n object pro- variation-related and spatial-related properties could
posals. Since many of these proposals include generic be used to distinguish object proposals from non-
objects and backgrounds, and only a small portion object proposals.
of them include the objects that belong to domain- To this end, we define the first category of fea-
specific categories in the labeled training data, we tures to measure the color variations by computing
develop a learning schema to compute a confidence variances in red, green, and blue channels (three val-
score (i.e., objectness) for each object and re-rank top ues) and entropy of the luminance-based histograms
n objects in each image. This learning schema aims within the proposal (one value). We then define the
to rank category-specific objects towards the top to second category of features to measure the spatial
improve the recall rate for different levels of n. In evenness of the color distribution. Specifically, we
other words, we will select a few number of proposals convert the proposals into a histogram in red, green,
from the top n object proposals to achieve the same and blue color channels. We empirically choose
recall as obtained by n proposals produced by the the top five bins with the highest frequencies and
state-of-the-art methods. The subsections below are compute their relative normalized frequencies. This
organized as follows: Section 2.1 introduces the pro- would produce 15 features (five relative normalized
posed compact features for object’s appearance as a frequencies for each of the three channels).
whole. Section 2.2 describes the methodology to map Finally, we observe that the positions of the object
each object to a hierarchical structure represented by proposals are not evenly distributed within an image.
2546 L. Peng and X. Qi / A hierarchical model to learn object proposals and its applications

For example, the probability for object proposals near For the second category of the proposed features,
the four corners or the edges of the image is far we use 64-bin histogram on each of the RGB channels
less than the probability for object proposals near to compute the five highest normalized frequencies.
the center of an image. Some aspect ratios of the Firstly, we compute the histogram based on the pixel
object proposals may occur much more frequent than counts of 64 bins on each of RGB channels. Secondly,
other aspect ratios due to the limited shapes of objects for each channel, we normalize the frequency of each
in the entire set of object categories. Following this bin by dividing the total pixel counts in the channel.
intuition, we define the third category of features by Lastly, we rank relative frequencies and select top 5
incorporating the distribution of the relative locations relative frequencies in each channel to produce a total
and aspect ratios of proposals into extracted features of 15 values.
to better represent each proposal. Since the image For the third category of the proposed features, we
dimension (i.e., the width and the height) may vary compute the proposed relative horizontal coordinate

PY
among different images, we use the relative positions (Xr), the relative vertical coordinate (Yr), the relative
and dimensions of the proposals within an image width (Wr), and the relative height (Hr) as follows:
to define four additional features: the relative width x
(ratio of the width of the proposal to the width of the Xr = (2)
W
image), the relative height (ratio of the height of the
y

CO
proposal to the height of the image), the relative hori- Yr = (3)
zontal coordinate, and the relative vertical coordinate, H
w
which are the coordinates of the upper-left corner of Wr = (4)
the proposal divided by the width and the height of W
the image, respectively. These four features would h
Hr = (5)
capture the properties regarding the relative position H
and the aspect ratio of a proposal. In all, the compact
OR
feature of the length of 23 is extracted for a proposal
and will be used to classify object and non-object 2.2. Hierarchical tree structure
proposals later on.
To clearly explain the proposed features, we pro- A total of n object proposal bounding boxes may
vide the mathematical definition for some of these be retrieved by the proposal generator for an image.
TH

features. For a given image I with width W and height Since each object or non-object bounding box could
H and a proposal B in the image with x and y repre- include smaller bounding boxes corresponding to
senting the horizontal and vertical coordinates of the sub-parts of an object, we could convert all bounding
upper-left corner of B, respectively, let w and h repre- boxes on each image to several trees in the follow-
sent the width and height of B and Pi,j,c represent the ing way: each node represents a bounding box; If
AU

pixel value at the horizontal and vertical coordinates a bounding box Ba is mostly included in another
(i, j) of I in the channel c of the RGB color space, bounding box Bb (defined by the ratio of the overlap
where c ∈ {R, G, B}.
For the first category of the proposed features, Bounding boxes on images Converted trees
the variance of B in a red color channel could be
expressed as:
y+h x+w
y+h x+w Pi,j,c=R
j=y i=x (Pi,j,c − j=y i=x
w×h )2
(1)
(w × h − 1)

The variances of B in green and blue channels are


similarly computed by substituting c with G and c
with B in equation (1), respectively.
To compute the entropy, we compute the histogram
based on the pixel counts for each of 768 bins (i.e.,
256 possible values times 3 channels), and apply the
standard entropy formula on values stored in 768 bins. Fig. 2. Examples of bounding boxes and their corresponding trees.
L. Peng and X. Qi / A hierarchical model to learn object proposals and its applications 2547

between Ba and Bb to Ba ), the corresponding tree tree. If the computed ratio is smaller than a pre-set
node Ta should be a child node of bounding box Bb ’s threshold for all existing trees, the bounding box
tree node Tb . This condition will be applied to all cannot be inserted into any of the existing trees, and
bounding boxes in an image. Figure 2 shows two we use this bounding box to create a new tree by
examples of this type of mapping. setting it as the root node of the new tree. The detail
of the algorithm is shown in Algorithm 1.
Algorithm 1 Convert multiple bounding boxes in an
image to multiple trees Algorithm 2 Insert a bounding box into an existing
input: bbs: multiple bounding boxes; thresh: a pre- tree
defined threshold for overlap input: bb: a bounding box; treeNode: the root node
output: trees: multiple trees built based on bound- of a tree; thresh: a pre-defined threshold for over-
ing boxes lap

PY
function name: convertBbsToTrees(bbs, thresh) output: altered tree after insertion
sortedBbsDescend ← sortBySizes(bbs) /*sort bbs pre-condition: the ratio of the overlap between
by sizes in descending order*/ treeNode.rootBb and bb to bb > thresh
trees ← new List /*initialize a new list to place all function name: insert(treeNode, bb, thresh)
resultant trees*/ if treeNode.children is NULL /*base case*/ then

CO
for bb in sortedBbsDescend do newNode ← createNewTreeNode(bb)
insertSuccess ← False treeNode.addChild(newNode)
for tree in trees do return
if the ratio of the overlap between tree.rootBb end if
and bb to bb > thresh then for each child c in treeNode do
insert(tree, bb, thresh) /*see Algorithm 2 if the ratio of the overlap between c.rootBb and
OR
for details of this function*/ bb to bb > thresh then
insertSuccess ← True insert(c, bb, thresh) /*recursive call*/
break end if
end if end for
end for newNode ← createNewTreeNode(bb) /* all chil-
if insertSuccess is True then dren of treeNode do not have sufficient overlap
TH

continue /* process next bouding box */ with bb */


else treeNode.addChild(newNode) /* Add newNode
newTree ← createNewTreeNode(bb) as a child of treeNode */
trees.append(newTree) return
end if
AU

end for
The detail of the tree insertion function is summa-
return trees
rized in Algorithm 2, where the inputs are a bounding
box and a tree (i.e., treeNode) and the output is the
We then propose a method to build multiple trees altered tree after the insertion. It should be noted
by considering all bounding boxes in an image. First, that the input bounding box is mostly included in
we sort all bounding boxes by their areas (i.e., width the bounding box corresponding to the root of the
* height) in the descending order. Second, we use input tree based on the pre-condition in Algorithm 1.
the first bounding box (i.e., the biggest bounding Therefore, calling this function will insert a bounding
box) to create a root node as the initial tree and box at the right location of the tree, where its parent
sequentially process the remaining sorted bounding corresponds to the smallest bounding box which has
boxes one-by-one. For each bounding box, we check the sufficient overlap with the inserted bounding box.
if it can be inserted in any of the existing trees We use recursion to implement this function as sum-
by calculating the ratio of the overlap between the marized in Algorithm 2. Let’s denote the function by
bounding box of the root of the tree and the candidate insert(treeNode, bb, thresh). The based case is that
bounding box to the candidate bounding box. If the the treeNode has no children. In this case, we simply
computed ratio is larger than a pre-set threshold, we create a new node for bb and add this new node as a
insert the bounding box into the currently examined child of treeNode. If the treeNode has children, we
2548 L. Peng and X. Qi / A hierarchical model to learn object proposals and its applications

will check if bb is mostly included in the bounding in the testing set, similar to the nearest neighbour
box corresponding to any of the children. If yes, we algorithm, the computation complexity of the pro-
make a recursive call by passing this child, bb and posed algorithm to compute the confidence scores of
thresh as parameters. If bb is not mostly included in all testing bounding boxes is O(MN).
the bounding box correspondings to any of the chil-
dren, we simply create a new node using bb and add
this new node as the child of treeNode. The details 3. Experimental results
of this algorithm are shown in Algorithm 2.
The proposed method has been applied to the
2.3. Learning schema to compute confidence Youtube-Object Dataset V2.2 [12] since we want
scores to test the proposed method on the video dataset.
The dataset contains a total of 720,000 frames with

PY
Based on Sections 2.1 and 2.2, each object proposal 6,975 annotated bounding boxes from 10 categories
bounding box has a tree associated with it. If the root of objects. The annotated frames are divided into
node of a tree Ti corresponds to a bounding box Bi , training and testing sets. Specifically, one object was
all children of Ti correspond to all bounding boxes annotated per image in the training set and multiple
inside of Bi where i ranges from 1 to the total num- objects were annotated per image in the testing set.

CO
ber of trees. We compute feature Fi from box Bi as There are 4306 annotated frames with 4306 bounding
the root node feature. Hence, for each bounding box box annotations in the training set. There are 1781
Bi,j within Bi , Fi,j represents its visual feature and annotated frames with 2669 bounding box annota-
tree Ti,j represents its hierarchical structure, where j tions in the testing set.
ranges from 1 to the total number of nodes in the tree. Due to the nature of the video dataset, the frame
Since each bounding box maps to a tree, we can sequence within the same shot changes gradually and
compare two bounding boxes by comparing their cor- the differences between frames within the same shot
OR
responding trees and their visual features. We define are small. We select the first labeled image within
the distance between two bounding boxes as: each shot as the representative image from that shot.
Some shots do not have any labeled image, we sim-
Dist(Bi , Bj ) ply skip them. As a result, all images we selected are
= w × FeatureDist(Fi , Fj ) from different shots. We call these labeled images
TH

as shot representative images. Among these shot


+(1 − w) × TreeDist(Ti , Tj ) (6) representative images, 870 images were from the
training set and 334 images were from the testing
where Bx , Fx , and Tx represent xth Bounding Box,
set.
Feature, and Tree, respectively. FeatureDist(Fi , Fj )
We apply EdgeBox [20] on every shot represen-
is a distance measure between two feature vectors and
AU

tative image and generate top n bounding boxes


TreeDist(Ti , Tj ) is a distance measure between two
per image. Since this is the video dataset and the
general trees.
sizes of objects is relatively large in each image,
Here, we use the Euclidean distance to compare
top 50 bounding boxes per image from EdgeBox
two visual features and use the edit distance to com-
would detect nearly 70% of true objects. We use top
pare two trees.
n = 50 in our experiment. We empirically set the
For each object Bi , we compute its distance to each
parameters for EdgeBox as follows: the step size
Bk in the training data using Dist(Bi , Bk ) and find the
of sliding window search is 0.8; the non-maximum
n closest Bk s. Among n Bk s, there may be p objects
suppression threshold is set to be 0.55, and min score
and q non-objects (i.e., p + q = n). So, we define a
of boxes to detect is 0.01. As a result of this step,
confidence score as:
we generate up to 50 bounding boxes for each shot
p
ConfB = (7) representative images. We will use these bounding
n boxes as the input to produce the training and the
The higher the confidence score, the more likely testing sets for the proposed method.
the proposal being an object. Last, on each image, we To produce the training set for the proposed
order all object proposals by their confidence scores method, we compare the generated bounding boxes
in the descending order. Suppose there are M bound- with the ground truth annotations from the shot
ing boxes in the training set and N bounding boxes representative images in the training set. We label
L. Peng and X. Qi / A hierarchical model to learn object proposals and its applications 2549

Table 1
Comparison of the recall rates of the proposed method and EdgeBox for top n proposals when n = 1, 2, ..., 50
n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
EdgeBox .10 .18 .22 .27 .29 .31 .33 .35 .36 .37 .39 .41 .42 .43 .43 .44
Proposed .11 .17 .22 .26 .29 .32 .37 .41 .44 .46 .47 .49 .51 .53 .54 .54
n 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
EdgeBox .46 .48 .49 .50 .51 .51 .52 .53 .55 .55 .57 .58 .58 .59 .59 .60
Proposed .54 .55 .55 .55 .57 .58 .58 .59 .59 .60 .60 .60 .61 .62 .62 .63
n 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
EdgeBox .60 .61 .61 .62 .62 .62 .63 .63 .64 .64 .64 .65 .65 .65 .65 .66
Proposed .63 .64 .64 .64 .64 .64 .65 .66 .66 .66 .66 .66 .67 .67 .67 .67
n 49 50
EdgeBox .66 .67

PY
Proposed .67 .67
The bold values indicates the higher recall rate for two methods.

0.5 and 0.7, we simply ignore them. We then apply


the proposed method described in Section 2 on shot

CO
representative images in the testing set to compute a
new confidence score for every proposal and re-order
them in the descending order. Same labeling process
has been applied to compute the recall.
To evaluate the performance of the proposed
method, we compute the recall rate of top n propos-
als ranked by EdgeBox and the proposed method.
OR
Table 1 shows the comparison of the recall rates
between EdgeBox and the proposed method with n
ranging from 1 to 50. Figure 3 shows the comparison
in curves. We clearly see that the proposed method
outperforms EdgeBox for most values of n. The con-
TH

vergence occurs at n = 50 is due to the fact we use


Fig. 3. Comparison of the recall curves of the proposed method all 50 boxes from EdgeBox to re-order proposals.
and EdgeBox for top n proposals when n = 1, 2,...,50.
It shows that the proposed method could use much
fewer top proposals (e.g., 40 proposals) to achieve the
a bounding boxes as positive (i.e., contains object) similar recall as EdgeBox, using more top proposals
AU

if Intersection over Union (IoU) is greater than 0.7 (e.g., 50 proposals). Figure 4 shows the qualitative
and as negative (i.e., does not contain object) if IoU results of the object proposals generated by the pro-
is less than 0.5. For boxes with IoU values between posed method.

Fig. 4. Qualitative results of the object proposals generated by the proposed method.
2550 L. Peng and X. Qi / A hierarchical model to learn object proposals and its applications

4. One potential application test the possible improvement of final object detec-
tion rate. This is one direction of our future work.
Using the method we described, we can design an Also, since the proposed method is more suited for
application to help children learn different categories video datasets where the objects appear relatively
of objects. Given a set of images and a training set, larger in each frame, we will combine the proposed
a number of object proposals can be automatically method with the tracking method to extend the object
generated by applying the proposed method on each detection from images to videos.
image using a training set. Since the proposed method There are also some limitations of the proposed
uses the hierarchical model to represent object pro- method. Since the proposed method needs to build
posals as trees, each object proposal bounding box the hierarchical structure based on the geometric rela-
may also contain a hierarchy of other bounding boxes. tionships of object proposal bounding boxes, it does
We could crop out several bounding boxes from each not perform well when too many object proposal

PY
image. For each cropped out bounding box, we can bounding boxes are generated per image and too
further crop out several bounding boxes which nat- much overlap among all bounding boxes. To address
urally corresponds to different parts of the objects. these limitations, we can fine tune parameters of exist-
We can display these parts at random locations, ing object proposals generation methods to detect
and promote children to use the parts of objects to a reasonable number of bounding boxes with good

CO
fill corresponding object proposal holes by correctly overlap. In this way, the proposed method might cap-
matching the object parts to the right positions. This ture better hierarchical structure of each object and
practice could enable children to gradually get famil- become more generally applicable to other domains.
iar with the concept of each category of “object” and
have a basic notion of how different parts form a
certain category of objects. References
OR
[1] A. Borji, D.N. Sihite and L. Itti, Salient object detection:
5. Summary and discussions A benchmark. In Computer Vision–ECCV 2012, Springer,
2012, pp. 414–429.
[2] M.-M. Cheng, Z. Zhang, W.-Y. Lin and P. Torr, Bing: Bina-
In this paper, we propose a tree-based hierarchical rized normed gradients for objectness estimation at 300fps.
model to learn object proposals and design an appli- In Computer Vision and Pattern Recognition (CVPR), 2014
TH

cation to teach children the concepts of objects. The IEEE Conference on, IEEE, 2014, pp. 3286–3293.
[3] D. Erhan, C. Szegedy, A. Toshev and D. Anguelov, Scalable
proposed work has the following contributions:
object detection using deep neural networks. In Computer
– Developing a compact feature vector to repre- Vision and Pattern Recognition (CVPR), 2014 IEEE Confer-
ence on, IEEE, 2014, pp. 2155–2162.
sent the object’s visual appearance [4] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn and
– Proposing a tree-based hierarchical model to
AU

A. Zisserman, The PASCAL Visual Object Classes Chal-


capture the object’s internal structural features lenge 2007 (VOC2007) Results. http://www.pascalnetwork.
– Formalizing a new objectness measure by org/challenges/VOC/voc2007/workshop/index.html
[5] P.F. Felzenszwalb, R.B. Girshick, D. McAllester and D.
incorporating both visual and tree feature rep- Ramanan, Object detection with discriminatively trained
resentations part-based models, Pattern Analysis and Machine Intelli-
– Designing a practical application to teach chil- gence, IEEE Transactions on 32(9) (2010), 1627–1645.
[6] R. Girshick, J. Donahue, T. Darrell and J. Malik, Rich fea-
dren to learn the concepts of objects using the
ture hierarchies for accurate object detection and semantic
proposed method segmentation. In Computer Vision and Pattern Recogni-
tion (CVPR), 2014 IEEE Conference on, IEEE, 2014,
We demonstrate the proposed method outperforms pp. 580–587.
the state-of-the-art method (i.e., EdgeBox) in terms [7] C.H. Lampert, M.B. Blaschko and T. Hofmann, Beyond slid-
of the detection rate. Specifically, we use fewer object ing windows: Object localization by efficient subwindow
proposals to achieve comparable recall rate achieved search. In Computer Vision and Pattern Recognition, 2008
CVPR 2008 IEEE Conference on, IEEE, 2008, pp. 1–8.
by more object proposals produced by other methods. [8] H. Lee, R. Grosse, R. Ranganath and A.Y. Ng, Convolutional
Since object proposals generation usually serves deep belief networks for scalable unsupervised learning
first step before classification for object detection of hierarchical representations. In Proceedings of the 26th
Annual International Conference on Machine Learning,
task, it will be interesting to use the proposed method
ACM, 2009, pp. 609–616.
to generate the object proposals, followed by clas- [9] B. Leibe, A. Leonardis and B. Schiele, Robust object
sification methods based on deep neural network to detection with interleaved categorization and segmentation,
L. Peng and X. Qi / A hierarchical model to learn object proposals and its applications 2551

International Journal of Computer Vision 77(1-3) (2008), IEEE/ACIS 14th International Conference on, IEEE, 2015,
259–289. pp. 443–448.
[10] R. Lienhart and J. Maydt, An extended set of haarlike fea- [15] H. Qian and D. Andresen, An energy-saving task scheduler
tures for rapid object detection. In Image Processing. 2002. for mobile devices. In Computer and Information Science
Proceedings. 2002 International Conference on, volume 1, (ICIS), 2015 IEEE/ACIS 14th International Conference on,
IEEE, 2002, pp. I–900. IEEE, 2015, pp. 423–430.
[11] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang and [16] J. Shotton, A. Blake and R. Cipolla, Contourbased learning
H.-Y. Shum, Learning to detect a salient object, Pattern Anal- for object detection. In Computer Vision, 2005 ICCV 2005
ysis and Machine Intelligence, IEEE Transactions on 33(2) Tenth IEEE International Conference on, volume 1, IEEE,
(2011), 353–367. 2005, pp. 503–510.
[12] A. Prest, C. Leistner, J. Civera, C. Schmid and V. Ferrari, [17] C. Szegedy, A. Toshev and D. Erhan, Deep neural networks
Learning object class detectors from weakly annotated video. for object detection. In Advances in Neural Information Pro-
In Computer Vision and Pattern Recognition (CVPR), 2012 cessing Systems, 2013, pp. 2553–2561.
IEEE Conference on, IEEE, 2012, pp. 3282–3289. [18] P. Viola and M. Jones, Rapid object detection using a boosted
[13] H. Qian and D. Andresen, Jade: An efficient energyaware cascade of simple features. In Computer Vision and Pattern

PY
computation offloading system with heterogeneous network Recognition, 2001 CVPR 2001 Proceedings of the 2001 IEEE
interface bonding for ad-hoc networked mobile devices. Computer Society Conference on, volume 1, IEEE, 2001,
In Software Engineering, Artificial Intelligence, Network- pp. I–511.
ing and Parallel/Distributed Computing (SNPD), 2014 [19] C. Wojek, G. Dorkó, A. Schulz and B. Schiele, Sliding-
15th IEEE/ACIS International Conference on, IEEE, 2014, windows for rapid object class localization: A parallel
pp. 1–8. technique. In Pattern Recognition, Springer, 2008, pp. 71–81.
[14] H. Qian and D. Andresen, Emerald: Enhance scientific [20] C.L. Zitnick and P. Dollár, Edge boxes: Locating object

CO
workflow performance with computation offloading to the proposals from edges. In Computer Vision–ECCV 2014,
cloud. In Computer and Information Science (ICIS), 2015 Springer, 2014, pp. 391–405.
OR
TH
AU

You might also like