Diverse Complementary Part Mining For Weakly Supervised Object Localization

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

1774 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL.

31, 2022

Diverse Complementary Part Mining for Weakly


Supervised Object Localization
Meng Meng , Tianzhu Zhang , Member, IEEE, Wenfei Yang , Graduate Student Member, IEEE,
Jian Zhao , Member, IEEE, Yongdong Zhang, Senior Member, IEEE,
and Feng Wu, Fellow, IEEE

Abstract— Weakly Supervised Object Localization (WSOL) localization has attracted increasing attention in the research
aims to localize objects with only image-level labels, which has community. However, most existing methods tackle this task
better scalability and practicability than fully supervised methods in a fully supervised setting [7]–[9] by using precise bound-
in the actual deployment. However, a common limitation for
available techniques based on classification networks is that they ing box annotations. Thus, their scalability and practicability
only highlight the most discriminative part of the object, not are limited in real-world application scenarios because it is
the entire object. To alleviate this problem, we propose a novel expensive and time-consuming to gather massive fine-grained
end-to-end part discovery model (PDM) to learn multiple dis- labeled data.
criminative object parts in a unified network for accurate object To overcome the above limitations, several recent methods
localization and classification. The proposed PDM enjoys several
merits. First, to the best of our knowledge, it is the first work have been proposed by using weakly supervised learning
to directly model diverse and robust object parts by exploiting models [10]–[14] that require only image-level annotations
part diversity, compactness, and importance jointly for WSOL. indicating the presence or absence of one class in images.
Second, three effective mechanisms including diversity, compact- Without box-level annotations, it is challenging to identify the
ness, and importance learning mechanisms are designed to learn location of an object. Recently, class-discriminative localiza-
robust object parts. Therefore, our model can exploit comple-
mentary spatial information and local details from the learned tion methods [10], [13], [15] have been proposed to handle
object parts, which help to produce precise bounding boxes and this challenging task. The main idea is that classification
discriminate different object categories. Extensive experiments on networks inherently have the ability to localize discrimina-
two standard benchmarks demonstrate that our PDM performs tive image regions. Unfortunately, these methods tend to be
favorably against state-of-the-art WSOL approaches. biased on the most discriminative object part to increase
Index Terms— Part discovery, diversity learning mechanism, the classification accuracy while ignoring the less discrimi-
compactness learning mechanism, importance learning mecha- native parts, which leads to localization accuracy degradation.
nism, weakly supervised object localization. In pursuit of highlighting the whole object, several techniques
I. I NTRODUCTION have been proposed, which can be mainly categorized into
the following two categories. The first category [16]–[18]
O BJECT localization aims to recognize objects and iden-
tify their locations in the given images [1]. Because of its
broad applications such as autonomous driving [2], [3], face
aims to expand the range of the most discriminative part by
exploring context information. These methods have achieved
recognition [4], [5], and person re-identification [6], object remarkable progress not only in weakly supervised object
localization (WSOL) but also in weakly supervised semantic
Manuscript received August 13, 2020; revised July 13, 2021 and segmentation tasks. However, since there are large differences
November 29, 2021; accepted January 7, 2022. Date of publication among object parts, it is hard to expand the range of the most
February 1, 2022; date of current version February 8, 2022. This
work was supported in part by the National Defense Basic Scien- discriminative part to other object parts, resulting in incom-
tific Research Program under Grant JCKY2020903B002; in part by the plete bounding boxes. The other category [14], [19] adopts
Strategic Priority Research Program of Chinese Academy of Sciences adversarial erasing, which erases the most discriminative part
(CAS) under Grant XDC02050500; in part by the National Natural
Science Foundation of China under Grant 62022078, Grant 62121002, and forces the model to discover other relevant parts. Although
and Grant 62006244; in part by the Youth Innovation Promotion promising results have been reported, it is hard to decide when
Association, Chinese Academy of Sciences, under Grant 2018166; and in to stop seeking object parts. Too many or too few steps could
part by the Young Elite Scientist Sponsorship Program of China Association
for Science and Technology under Grant YESS20200140. The associate editor degrade the localization performance.
coordinating the review of this manuscript and approving it for publication To the best of our knowledge, most existing methods
was Prof. Hichem Sahbi. (Corresponding author: Tianzhu Zhang.) for WSOL are over-relying on class-discriminative activa-
Meng Meng, Tianzhu Zhang, and Wenfei Yang are with the Department
of Automation, School of Information Science and Technology, Univer- tion maps to discover discriminative parts and lack explicit
sity of Science and Technology of China, Hefei 230026, China (e-mail: supervisory signals to model diverse and robust object parts,
meng18@mail.ustc.edu.cn; tzzhang@ustc.edu.cn; yangwf@mail.ustc.edu.cn) failing to localize the whole object densely and accurately
Jian Zhao is with the Institute of North Electronic Equipment,
Beijing 100000, China (e-mail: zhaojian90@u.nus.edu). within an image. As shown in Figure 1, an object consists
Yongdong Zhang and Feng Wu are with the Department of Electronic of multiple discriminative object parts. For example, a bird
Engineering and Information Science, School of Information Science and consists of three major parts, including the head, body, and
Technology, University of Science and Technology of China, Hefei 230026,
China (e-mail: zhyd73@ustc.edu.cn; fengwu@ustc.edu.cn). legs. Therefore, we exploit simple yet effective mechanisms
Digital Object Identifier 10.1109/TIP.2022.3145238 to model robust object parts for WSOL. The basic idea is
1941-0042 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Zhengzhou University. Downloaded on February 20,2023 at 05:31:10 UTC from IEEE Xplore. Restrictions apply.
MENG et al.: DIVERSE COMPLEMENTARY PART MINING FOR WEAKLY SUPERVISED OBJECT LOCALIZATION 1775

activation maps are treated as different spatial attention. Then,


we introduce a diversity loss to expand the discrepancy among
part-aware features, which encourages part-aware activation
maps to generate different spatial responses. As a result, part-
aware activation maps are trained to localize different object
parts. In the compactness learning mechanism, since each
part-aware activation map denotes the confidences of pixels
belonging to the corresponding part, we assign a pseudo label
to each pixel based on part-aware activation maps. Therefore,
each pixel has an identifier indicating which part it belongs to.
A triplet loss is adopted to make the features of pixels from one
part closer and push the features of pixels from different parts
away in the embedding space. In the importance learning
Fig. 1. The illustration of the main idea of our PDM. Intuitively, an object
consists of multiple object parts, and our PDM can learn robust object parts mechanism, we utilize an importance prediction module to
by jointly modeling part diversity, compactness, and importance for weakly learn the importance of each object part, so that the fused
supervised object localization and classification. object localization map and category prediction can be adap-
tively obtained for accurate localization and classification.
that we can learn complementary spatial information and The major contributions of this work can be summarized
local details from object parts, which help to produce precise as follows: (1) We propose a novel end-to-end Part Dis-
bounding boxes and discriminate different object classes. covery Model (PDM) for accurate object localization and
To achieve this goal, we mainly consider three critical factors classification by modeling robust object parts, which can
to learn comprehensive and robust object parts for WSOL. help learn complementary spatial information and local details
(1) The diversity of object parts is essential for activating for WSOL. (2) Three effective mechanisms including diver-
the full object extent. Without the diversity constraint, the sity, compactness, and importance learning mechanisms are
WSOL model tends to learn only the most discriminative part designed to learn discriminative and robust object parts in
(e.g., head), which results in poor localization performance. a unified model. To the best of our knowledge, this is the
(2) The compactness of object parts should be modeled. first work to directly model robust object parts by exploiting
Intuitively, a part corresponding to a group of gathered pixels, part diversity, compactness, and importance jointly for WSOL.
the appearances of pixels from a specific part should be (3) Extensive experimental results on two standard bench-
similar, which helps to learn a strong part detector. In other marks demonstrate that the proposed PDM performs favorably
words, the embedding features of pixels from one part are against state-of-the-art WSOL methods.
encouraged to be closer and form a compact distribution in the
embedding space. (3) The importance of object parts should be II. R ELATED W ORK
considered when multiple parts are fused into the full extent In this section, we briefly overview methods related to part-
of an object. The observation is that important object parts aware fine-grained object categorization, weakly supervised
could include more characteristics of objects and produce more object detection, and weakly supervised object localization.
reliable results. During training, multiple part-aware category
predictions should be merged to perform the classification
task, and multiple part-aware activation maps should be fused A. Part-Aware Fine-Grained Object Categorization
to perform the localization task during testing. Therefore, Fine-grained object categorization aims at capturing mar-
the importance of object parts should be considered in both ginal visual differences within subordinate categories, which
localization and classification tasks. are located at some local parts of object instances [20].
Motivated by the above discussions, we propose a novel Therefore, lots of methods have been proposed by discov-
end-to-end part discovery model (PDM) by jointly exploiting ering discriminative object parts for this task. Generally,
part diversity, compactness, and importance in a unified model. existing methods can be roughly categorized into two cat-
In specific, we design three effective learning mechanisms egories, including proposal-based methods [20]–[23] and
to obtain robust and discriminative object parts for WSOL, attention-based methods [23]–[26]. For proposal-based meth-
including diversity, compactness, and importance learning ods, Xiao et al. [21] and Peng et al. [23] both utilize a selec-
mechanisms. We first introduce a part-aware attention mod- tive algorithm [27] on handcrafted features to mine candidate
ule to generate part-aware activation maps. Each part-aware part proposals. Some methods simply use sub-regions of object
activation map denotes the spatial distribution of one specific proposals as part proposals [22]. Besides, Simon et al. [28]
part. That is, the part-aware activation map has high response and Zhang et al. [20] introduce part proposals anchored at
values at the pixels belonging to the corresponding part. salient locations as candidates of discriminative local parts.
To relieve the background interference, we introduce a sparsity While proposal-based methods are very simple, they fail to
loss on part-aware activation maps to constrain the network produce regions of different shapes. To activate flexible object
to focus on task-relevant regions. In the diversity learning parts, several attention-based techniques have been explored
mechanism, part-aware features can be obtained by attention not only for fine-grained object categorization but also for per-
pooling from the feature map, where multiple part-aware son re-identification [29]. Recasens et al. [26] adopt a spatial

Authorized licensed use limited to: Zhengzhou University. Downloaded on February 20,2023 at 05:31:10 UTC from IEEE Xplore. Restrictions apply.
1776 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 31, 2022

transformer to discover and zoom in on local parts that are using selective search [27] or edge boxes [48], which may be
particularly informative to the task. Ding et al. [25] learn computationally expensive and time-consuming. Furthermore,
sparse attention from class peak responses to produce infor- current WSOD methods utilize high-resolution inputs for
mative object parts. To represent the subtle differences within accurate object locations, which could also bring a heavy
subordinate categories, the above-mentioned methods usually computational burden. Thus, these methods are difficult to be
amplify local parts and feed them into the pipeline of part- applied to large-scale datasets for WSOL.
CNNs [30]. However, most of these methods do not focus
on capturing the integral extent of objects, and thereby they
cannot be directly used to achieve robust WSOL. In this paper, C. Weakly Supervised Object Localization
we argue that a diverse set of object parts not only can help to
discriminate different classes but also can be merged together Weakly Supervised Object Localization (WSOL) can pre-
to generate accurate bounding boxes. dict object positions and categories only using image-level
supervision. Different from WSOD, WSOL has the assumption
that there is only one object of the specific category in the
B. Weakly Supervised Object Detection whole image [49]. In [1], it is the first end-to-end approach
Weakly Supervised Object Detection (WSOD) aims to for WSOL. However, the localization is limited to a point
infer object locations and categories simultaneously with only rather than the full extent of the object. Later, Zhou et al. [10]
image-level labels during training. WSOD considers a scheme propose Class Activation Maps (CAM) with a global average
where multiple objects of one specific category may exist pooling layer and a final fully connected layer (weights of the
in the given image. Under this challenging circumstance, classifier) to produce localization maps. To remove the reliance
most WSOD methods [31]–[36] adopt a Multiple Instance on specific network architectures, Gradient-weighted Class
Learning (MIL) framework where input images contain a Activation Mapping (Grad-CAM) [13] is proposed. Recently,
bag of instances (object proposals). The model is trained CCAM [15] is proposed to combine the activation maps of
with a classification loss to select the most confident positive all classes for robust localization. Despite the simplicity and
proposals. WSDDN [37] is the first end-to-end architecture effectiveness of CAM-based methods, they tend to be biased
to select proposals by parallel detection and classification on the most discriminative object part [50].
branches. To mitigate this problem, several methods explore object
In practice, MIL solutions are found easy to converge context information to expand the range of the most discrim-
to discriminative parts of objects. This is because the loss inative part, and the context information can come from dif-
function of MIL is non-convex, and thus MIL solutions are ferent spatial positions [16], [18] or different layers [17]. For
usually stuck into local minima [38]. To address this issue, example, Wei et al. [16] employ a dilated convolution to con-
Tang et al. [31] propose multi-stage instance classifiers to sider spatial contexts at various ratios. In FickleNet [18], they
help the network see larger parts of objects during training. explore context information by randomly combining units.
Moreover, building on [31], the work of [32] further introduces Differently, some other methods adopt an erasing strategy
the proposal cluster learning and uses the proposal clusters as [14], [19], [51]–[54], which aims to erase the most discrimi-
supervision that indicates the rough locations where objects native part so that the model needs to seek the relevant object
most likely appear. In [39], Wan et al. try to reduce the parts from what remains. In [14], the model is designed to hide
randomness of localization during learning. In [40], they grid-like patches during training randomly. To erase the most
add curriculum learning with the MIL framework. From the discriminative part effectively, Zhang et al. [50] learn parallel
perspective of optimization, Wan et al. [41] relax the original adversarial classifiers to find complementary parts for target
MIL loss function with a set of smoothed loss functions. objects, and more sophisticated erasing strategies are designed
In [42], Gao et al. make use of the instability of MIL-based in later works [11], [55].
detectors and design a multi-branch network with orthogonal Apart from the above works, DANet [56] applies a divergent
initialization. activation to learn complementary and discriminative visual
Besides, there are many attempts [43] to improve the local- patterns for WSOL. PSOL [49] achieves the classification
ization accuracy of the weakly supervised detectors from other and localization task by two separate networks, and the
perspectives. Arun et al. [44] obtain much better performance localization network is trained by pseudo bounding boxes
by employing a probabilistic objective to model the uncertainty that are generated by a class-agnostic co-localization method
in the location of objects. In [45], the detection-segmentation DDT [57]. GCNet [58] takes geometric shape into account
cyclic collaborative framework has been designed to supervise and designs a multi-task loss function. However, most WSOL
the learning of object detection by utilizing the segmentation methods can only discover the most discriminative parts for
maps as prior information. In [46], Zhang et al. propose to the classification task, failing to activate the whole object.
mine accurate pseudo ground truths from a well-trained MIL- Different from the above approaches, our method can directly
based network to train a fully supervised object detector. model robust object parts by exploiting part diversity, compact-
In contrast, the work of [47] integrates WSOD and Fast- ness, and importance jointly in a unified model. By learning
RCNN re-training into a single network that can jointly complementary spatial information and local details from
optimize the regression and classification tasks. However, most object parts, our model can localize the whole object densely
of these methods need to generate candidate proposals by and accurately.

Authorized licensed use limited to: Zhengzhou University. Downloaded on February 20,2023 at 05:31:10 UTC from IEEE Xplore. Restrictions apply.
MENG et al.: DIVERSE COMPLEMENTARY PART MINING FOR WEAKLY SUPERVISED OBJECT LOCALIZATION 1777

Fig. 2. The architecture of our PDM. Given the feature map F, we generate K part-aware activation maps through the part-aware attention module N pa .
A sparsity loss (L spa ) is introduced on part-aware activation maps to constrain the network to focus on task-relevant regions. (1) In the diversity learning
module, we multiply each part-aware activation map to the feature map and employ a global average pooling (GAP) layer [59] to acquire part-aware features,
where the diversity loss (L div ) is introduced to expand the discrepancy among part-aware features. (2) In the compactness learning module, we assign
pseudo labels for each pixel based on part-aware activation maps. Then, the triplet loss (L tri ) utilizes a set of triplets to make the features of pixels from one
part move closer to the corresponding part-aware feature. For example, the 1-st part-aware feature (P 1 ), the feature of the positive pixel with label 1 (F m ), and
the feature of the negative pixel with label 3 (F n ) form a triplet. (3) In the importance learning module, we introduce the part-specific classification module
Ncls and the importance prediction module N I P to generate part-aware category predictions and importance weights. The classification and localization results
are acquired by a weighted sum strategy.

III. M ETHOD a ∈ R H ×W ×K to mine K object parts through a Part-aware


In this section, we provide details of the proposed model Attention Module, denoted as N pa . To discover object parts
for weakly supervised object localization and classification. in a simple and effective way, this module is implemented as
a convolution layer followed by a sigmoid function to change
A. Notations and Preliminaries the channel size of the feature map to K .
In weakly supervised object localization, given an image a = N pa (F), (1)
X, let F ∈ R H ×W ×L denote the feature map from pre-trained
feature extraction network (VGGnet [60] or ResNet50 [61]), where a = {a k }k=1K denotes a set of part-aware activation
where H , W , and L denote the height, width, and channel maps, and the a ∈ R H ×W corresponds to the k-th part-aware
k

number of the feature map, respectively. During training, activation map. Each part-aware activation map denotes the
each image is associated with a ground truth label y ∈ RC , spatial distribution of one specific part and the confidences
and C refers to the number of categories. At test time, of pixels belonging to this part. However, guided by the
given an image, the outputs are a predicted category label only classification loss, the part-aware activation maps may
ỹ and the corresponding bounding box B = (x o , y0 , h o , wo ), highlight background regions. To mitigate this issue, we add
where (x o , yo ) represents the center coordinate, and (h o , wo ) a sparsity constraint on the part-aware activation maps for
represents the height and width. background suppression, as defined in Eq.(2):

1  k
K H W
B. Overview L spa = |ai j |, (2)
K HW
Existing WSOL methods are over-relying on class- k=1 i=1 j =1
discriminative activation maps that can only discover the most The intuition behind this idea is simple. The classification
discriminative parts, leading to incomplete object localization. loss requires that object-related regions are highlighted to
To pursue the whole object, we propose to model robust and classify the image correctly, but the sparsity loss requires part-
diverse object parts, from which we can learn complementary aware activation maps to activate as few pixels as possible.
spatial information and local details to produce precise bound- Thus, these two loss terms together can suppress background
ing boxes and discriminate different object classes. To achieve regions and highlight object-related regions. To further make
this goal, we propose a novel part discovery model (PDM) by the learned part-aware activation maps robust for weakly
jointly exploiting part diversity, compactness, and importance supervised object localization and classification, we design
in a unified deep model. The details are as follows. three effective mechanisms, including diversity, compactness,
and importance learning mechanisms.
C. Part Discovery Model 1) Diversity Learning Mechanism: Based on the feature
The architecture of our PDM is described in Figure 2. Based map F and a = {a k }k=1 K , we generate a set of part-aware

on the feature map F, we generate part-aware activation maps features P = {P }k=1 by the attention weighted pooling.
k K

Authorized licensed use limited to: Zhengzhou University. Downloaded on February 20,2023 at 05:31:10 UTC from IEEE Xplore. Restrictions apply.
1778 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 31, 2022

The k-th part-aware feature P k = [P1k , P2k , . . . , PLk ] is given as

1 
H W
Plk = Fi j l · aikj , (3)
HW
i=1 j =1

where l = 1, 2, . . . , L. Without part-level supervision, all part-


aware activation maps tend to focus on the same discriminative
Fig. 3. The Triplet Loss minimizes the distance between the k-th part-aware
part for image classification failing in localizing full object feature and the feature of the hardest positive pixel [63] with the label k in
extent. To avoid such a degenerated case, we impose a the embedding space, and maximizes the distance between the k-th part-aware
diversity loss to expand the discrepancy among part-aware feature and the feature of the hardest negative pixel with other labels.
features {P k }k=1
K :

1 
K 
K
P m , P n  Assuming there are Mk positive pixels with the label k and
L div = . (4) Nk negative pixels with other labels in the image, we calculate
K (K − 1) P m 2 P n 2
m=1 n=1,n =m the distances between the k-th part-aware feature and Mk
The intuition behind this loss is trivial, if the m-th and the positive pixels. The distances of negative pairs are computed
n-th part-aware features are identical, which means they have in the same way. Let F m , F n ∈ R L denote the embedding
similar spatial responses and give a high attention weight to the features of a positive pixel x m and a negative pixel x n , both
same object part, and thus the Ldiv will be large and prompt of which are selected from the feature map. Then, we define
the model to discover different part-aware activation maps. the distances as follows.
2) Compactness Learning Mechanism: To learn compact P k , F m 
object parts, we introduce a triplet loss [62] to get the features d+ (k, m) = 1 − ,
P k 2 F m 2
of pixels from the same part together and push the features
P k , F n 
of pixels from different parts far away in the embedding d− (k, n) = 1 − ,
space. Based on the feature map F ∈ R H ×W ×L , we can P k 2 F n 2
acquire H × W pixels and their embedding features. Then, k = 1, 2, . . . K ; m = 1, 2, . . . Mk ;
the part-aware activation maps a = {a k }k=1 K can be used to n = 1, 2, . . . Nk , (6)
classify which part the pixels belong to. For background pixels,
all part-aware activation maps have low responses on these where d+ (k, m) measures the distance between the k-th part-
pixels, and they can be easily distinguished with a threshold aware feature and the m-th positive pixel and d− (k, n) mea-
(e.g., 0.2). For foreground pixels, a pseudo part-level label sures the distance between the negative pair. Following the
is generated by selecting the highest value in the part-aware setting in [63], we use the hard negative mining strategy
activation maps at each pixel. We assign a pseudo label z i j to for each part-aware feature by selecting the largest distance
each pixel as defined in Eq. (5). of positive pairs and the smallest distance of negative pairs,
⎧ as defined in Eq. (7).
⎨0, i f ∀ aikj  0.2,
z i j = arg max a k , i f ∃ a k > 0.2, (5) g+ (k) = max d+ (k, m),
⎩ ij ij
m
k g− (k) = min d− (k, n),
n
where i = 1, 2, . . . , H and j = 1, 2, . . . , W , and z i j denotes m = 1, 2, . . . Mk ; n = 1, 2, . . . Nk . (7)
the label of pixel at position (i, j ).
Since part-aware features are obtained by attention weight- The triplet loss function is adopted to reduce the distances
ing on all pixel features where multiple part-aware activation of the hardest positive pairs and increase the distances of the
maps are treated as different spatial attention, each part-aware hardest negative pairs for all part-aware features as follows:
feature can be seen as a representative feature of one specific

K
 
part. Then, we adopt a triplet loss to make features of pixels L tri = g+ (k)−g−(k) + margi n , k = 1, 2, . . . , K ,
+
from one part move closer to the corresponding part-aware k=1
feature in the embedding space. The triplet loss is trained (8)
on a series of triplets, and each triplet consists of an anchor
with the label k, a positive pixel with the same label, and a where margi n is the margin between positive and negative
negative pixel with a different label. In this scheme, we set pairs, and [b]+ = max(b, 0).
part-aware features as anchors and adopt the hard negative 3) Importance Learning Mechanism: To utilize part-level
mining strategy [63] to find the hardest positive pairs and the discriminative information for accurate recognition, part-aware
hardest negative pairs. As shown in Figure 3, the triplet loss features P = {P k }k=1
K are connected with a Part-specific
can reduce the distances between the hardest positive pairs Classification Module, denoted as Ncls , to generate part-
and increase the distances between the hardest negative pairs. aware category predictions ỹ ∗ = { ỹ 1 , ỹ 2 , . . . , ỹ K }, and
Therefore, the features of pixels from one part are pushed to ỹ k ∈ R C is corresponding to the k-th part-aware category
be closer to their representative feature and form a compact prediction. This module consists of K parallel fully connected
k
distribution in the embedding space. layers, each of which is denoted as Ncls . The formulation is

Authorized licensed use limited to: Zhengzhou University. Downloaded on February 20,2023 at 05:31:10 UTC from IEEE Xplore. Restrictions apply.
MENG et al.: DIVERSE COMPLEMENTARY PART MINING FOR WEAKLY SUPERVISED OBJECT LOCALIZATION 1779

given by Eq. (9). trained jointly to discover diverse and robust object parts for
accurate object localization and classification. (1) The classi-
ỹ k = Ncls
k
(P k ), k = 1, 2, . . . , K . (9)
fication loss requires that object-related regions are activated
The final classification results and integral localization maps so as to classify the image correctly. (2) The sparsity loss
can be obtained by merging multiple part-aware category requires the part-aware activation maps to focus on as few
predictions and part-aware activation maps. To achieve this pixels as possible. As a result, these two loss terms together
goal, a simple way is to fuse them by average. However, can suppress background regions and highlight object-related
this strategy may degrade the performance since important regions. (3) The diversity and compactness losses can further
parts could include more characteristics of the objects and guide part-aware activation maps to be distinct and different so
can produce more reliable results. Therefore, we introduce the as to target diverse and robust object parts for accurate object
Importance Prediction Module, denoted as N I P , to evaluate localization and classification.
the importance of each part. This module is implemented as
a shared fully connected layer. Then, a weighted sum strategy
is adopted to merge multiple part-aware results, so that object E. Discussions
locations and categories can be identified in a self-adaptive
way. Given part-aware features P = {P k }k=1 K , part-aware In this section, we show the differences among our model
importance weights t = {t , t , . . . , t } are calculated by
1 2 K and four relevant methods including CAM [10], ACoL [50],
Eq. (10). DANet [56] and UPDM [28]. (1) CAM [10] treats an object
as one piece, so that it tends to focus only on the most
t k = N I P (P k ), k = 1, 2, . . . K . (10) discriminative part to increase the classification accuracy while
The final category prediction ỹ ∈ R L is obtained by weight- ignoring the less discriminative parts. Different from this work,
ing the reliable part-aware category predictions according to our key observation is that an object consists of multiple
Eq. (9) and Eq. (10), and is defined as follows: discriminative object parts. By modeling multiple diverse
and robust object parts, we can capture the whole object

K
instead of the most discriminative part. (2) In ACoL [50],
ỹ = t k ỹ k . (11)
Zhang et al. adopt the adversarial erasing strategy and design
k=1
two separate branches to generate part-level response maps,
The classification loss is given by Eq.(12): and obtain the final localization map by using the element-

C wise max operator between the two part-level response maps.
L cla (y, ỹ) = − yc · log ỹc , (12) However, this strategy may overly rely on one branch results,
c=1 so that even response maps generated by the unreliable branch
where the cross-entropy loss is employed between the category could dominate the final localization map. Different from
prediction ỹ and the ground truth label y, and C denotes the this work, we utilize part-aware importance weights to merge
number of categories. ỹc and yc are the c-th element of ỹ and multiple part-aware activation maps, which is more robust and
y, respectively. adaptive. Besides, since an object consists of multiple object
Note that we have part-aware attention maps a = {a k }k=1 K parts, two parallel branches may not be enough to capture
and part-aware importance weights w = {w }k=1 , our final
k K the whole object. (3) DANet [56] introduces a discrepant
activation map is defined in Eq.(13). Given the final activation divergent activation approach, which leverages the semantic
map, the bilinear interpolation is used for sampling up to the discrepancy to discover complementary object parts. However,
original image size. We identify the discriminative regions it only focuses on increasing the diversity among object parts
by a hard threshold, which is determined by a grid search while ignoring part compactness and part importance, failing
method [17]. The detection bounding box is the coverage to learn robust object parts. Moreover, DANet merges all
of the largest connected component obtained by using the object parts before classification, thus hard to learn local
threshold truncation on the activation map [10]. details from each part for accurate recognition. (4) UPDM [28]
is designed for image classification, while our task is to

K
jointly optimize object localization and classification. Thus,
A= wk a k . (13) it only aims to find subtle details in local parts for better
k=1
classification, leaving the incomplete object localization prob-
lem unexplored. Besides, UPDM extracts fixed-shape image
D. Joint Training patches at predicted part locations for feature extraction, and
For weakly supervised object localization, with only the thus fails to produce regions of different shapes, while our
image-level category label, our PDM is trained by minimizing method can activate flexible object parts. Different from the
the overall objective, as shown in Eq.(14). above-mentioned methods, our approach can jointly model
part diversity, compactness, and importance to learn robust
L f inal = L cla + λspa L spa + λdiv L div + λtri L tri , (14)
and diverse object parts. In this way, our model can obtain
where λspa , λdiv and λtri are balance parameters. The four the complementary spatial information and local details from
loss items, including the classification loss L cla , the sparsity these parts, which help to produce precise bounding boxes and
loss L spa , the diversity loss L div and the triplet loss L tri are discriminate different classes.

Authorized licensed use limited to: Zhengzhou University. Downloaded on February 20,2023 at 05:31:10 UTC from IEEE Xplore. Restrictions apply.
1780 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 31, 2022

IV. E XPERIMENTAL R ESULTS TABLE I


A. Datasets and Evaluation Metrics. C OMPARISON OF THE L OCALIZATION P ERFORMANCE B ETWEEN
PDM AND S TATE - OF - THE -A RT M ETHODS ON
1) Datasets: We evaluate our model on two bench- THE CUB-200-2011 T EST S ET
mark datasets including CUB-200-2011 [64] and ILSVRC
2016 [65], [66]. CUB-200-2011 includes 200 categories of
birds, and contains 11,768 images. Following the protocol
in previous methods [10], [11], [13]–[15], [17], [50], for
this dataset, we evaluate the performance with the test set.
The ILSVRC 2016 is a large-scale dataset, consisting of
over 1.2 million images of 1,000 categories. For this dataset,
we evaluate the performance with the validation set.
2) Evaluation Metrics: As suggested by [10], [11], [67],
we use four different metrics to evaluate our proposed model.
The first metric is Localization Accuracy, which indicates the
ratio of the accurate classification and the correct object local-
ization (IoU with ground truth no less than 50%). The second
metric is GT-known Localization Accuracy, different from
Localization Accuracy, the ground truth class label is given to
eliminate the influence caused by the classification accuracy
when evaluating the localization accuracy. The third metric is
Classification Accuracy which represents the ratio of correct
classification predictions. The last metric is MaxBoxAccV2,
which averages the GT-known localization accuracy across
IoU ∈ {0.3, 0.5, 0.7} to address diverse demands for local-
ization granularity. Here, the bounding boxes with the optimal
score map threshold are extracted to perform the best match
with the ground truth boxes when evaluating the GT-known
metric [67].

B. Implementation Details CCAM [15], Hide-and-Seek [14], ACoL [50], SPG [17],
We implement the proposed algorithm based on four ADL [11], DANet [56], EIL [55], PSOL [49] and GCNet [58]
popular backbone networks including VGGnet [60], on the CUB-200-2011 test set and the ILSVRC 2016 valida-
MobileNetV1 [68], ResNet50 [69] and InceptionV3 [70]. tion set.
Specifically, the VGGnet and ResNet50 architectures are 1) Localization Performance: Table I illustrates the local-
truncated at conv5-3, and the classifiers of MobileNetV1 ization error of existing weakly supervised methods with
and InceptionV3 are removed. The model is fine-tuned different backbones on the CUB-100-2011 test set. The intra-
on the pre-trained weights of ILSVRC [65]. Following class variation of the CUB-200-2011 set is small, because
the guidance in [50], the input images are first reshaped all classes of this dataset belong to birds. In this case, the
to 256 × 256 pixels, and then randomly cropped to extent of the most discriminative region may be quite small.
224 × 224 pixels. Our model is trained with a mini-batch of Therefore, it is very challenging to locate the entire area
28 using the SGD optimizer with a learning rate of 0.001 for of the bird. On this fine-grained dataset, we consistently
40 epochs on the CUB-200-2011 and 10 epochs on the observe that our PDM outperforms all baseline methods.
ILSVRC 2016 dataset. There are three kinds of modules in Besides, ResNet50-PDM sets a new state-of-the-art perfor-
our network including Part-aware Attention Module N pa , mance in terms of Top-1 localization error. Specifically, our
Part-specific Classification Module Ncls and Importance VGGnet-PDM achieves 32.7% and 17.8% of Top-1 and Top-5
Prediction Module N I P . N pa is implemented as a convolution localization error while ResNet50-PDM reports 28.8% and
operation with the kernel size 1 × 1 and K output channels, 16.4% of Top-1 and Top-5 localization error, which indicates
followed by a sigmoid function. Ncls consists of K parallel that our method can give more precise results.
fully connected layers with the kernel size 3 × 3 and C output We illustrate the localization errors on the ILSVRC
channels. N I P is implemented as a shared fully connected 2016 validation dataset in Table II. In the ILSVRC 2016 exper-
layer with the kernel size 3 × 3 and one output channel iments, which is a larger scale dataset and includes a wide
followed by a softmax layer. Empirically, the weight λspa for variety of classes, we observe that our VGGnet-PDM achieves
the sparsity loss, the λdiv for the diversity loss and the λtri 48.9% and 36.2% of Top-1 and Top-5 localization error. When
for the triplet loss are set to 0.04, 0.02 and 0.01, respectively. ResNet50 is used as a backbone, the PDM further reduces
the Top-1/Top-5 localization error to 45.6%/34.5%. On this
C. Comparison With the State-of-the-Arts dataset, the proposed VGGnet-PDM and ResNet50-PDM both
In this section, we compare our PDM with various state-of- achieve better localization accuracy than the existing state-of-
the-art methods mainly including CAM [10], Grad-CAM [13], the-art methods. These results show that our method performs

Authorized licensed use limited to: Zhengzhou University. Downloaded on February 20,2023 at 05:31:10 UTC from IEEE Xplore. Restrictions apply.
MENG et al.: DIVERSE COMPLEMENTARY PART MINING FOR WEAKLY SUPERVISED OBJECT LOCALIZATION 1781

TABLE II robust object parts. (3) Our PDM is superior to the current
C OMPARISON OF THE L OCALIZATION P ERFORMANCE B ETWEEN state-of-the-art PSOL [49]. For PSOL, it first generates pseudo
PDM AND S TATE - OF - THE -A RT M ETHODS ON THE
ILSVRC 2016 VALIDATION S ET
bounding boxes, and then borrows the detection models in
fully supervised works to perform object localization with
these generated boxes. Compared with the PSOL, which has
a complicated training process with a two-step design, the
proposed PDM conducts object localization and classification
in a unified model, which can be trained end-to-end. Besides,
our VGGnet-PDM achieves 1.0% and 0.2% performance gains
than VGGnet-PSOL in terms of Top-1 localization accuracy on
these two datasets, which demonstrates the superiority of our
method. (4) The backbone network has an important impact
on performance, and a stronger backbone usually yields better
performance. For example, PDM can increase by 3.9% and
1.4% in Top-1 and Top-5 localization accuracy when a stronger
backbone is adopted on the CUB-200-2011 set. As reported
in [77], the stronger backbones usually transfer better on
various vision tasks.
2) GT-Known Localization: Localization performance is
restricted by the classification accuracy, because the calcula-
tion of localization overlap only conducts on images that have
the correct prediction of image-level labels. To alleviate this
problem, we evaluate the localization performance with ground
truth class labels, denoted as the GT-known localization. The
GT-known localization error on the CUB-200-2011 test set
is shown in Table I. We observe that VGGnet-PDM outper-
forms VGGnet-GCNet by 1.1% in the GT-known localization
error, and our method exceeds the performance of all other
methods by a large margin. Besides, as shown in Table II,
the proposed PDM outperforms the other methods on the
ILSVRC 2016 validation set. The GT-known localization error
of VGGnet-PDM is 30.7%, which is better than other baseline
approaches with the same backbone. With the assistance of
well on both fine-grained and large-scale datasets, which a stronger backbone, the localization error with ground truth
substantially verifies the effectiveness of our method. labels further reduces to 30.4%. These results reveal the
For detailed analysis, based on the results, we have the superiority of the localization maps generated by our method.
following observations. (1) It is important to learn comple- 3) Classification: Table IV summarizes the benchmark
mentary and diverse object parts. Compared with CAM and approaches for classification with or without (w/o) bounding
Grad-CAM, HaS adopts a random erasing strategy to discover box annotations on the CUB-200-2011 dataset. Our VGGnet-
less discriminative parts. ADL, ACoL, and EIL utilize an PDM and ResNet50-PDM achieve 23.1% and 18.7% of Top-1
attention-based erasing mechanism, while DANet implements classification error, outperforming all previous WSOL meth-
the divergent activation for the same goal. All these methods ods, even the GoogLeNet-CAM with bounding box. It is not
achieve promising localization performances compared with surprising since the fine-grained recognition needs local part
their baseline methods. This proves that it is helpful to explore information and our model is designed for this goal. Table V
complementary and discriminative object parts for improving shows the Top-1 and Top-5 classification error on the ILSVRC
localization performance. Our method can model diverse and 2016 validation set. The VGGnet-PDM reports 31.3%/11.4%
discriminative object parts for capturing the entire object, and of Top-1/Top-5 classification error while ResNet50-PDM
achieves favorable results. (2) In addition to part diversity, reports 24.4%/8.4% of Top-1/Top-5 classification error, which
it is also important to explore part compactness and impor- can achieve comparable performance with state-of-the-art
tance for WSOL. For example, VGGnet-PDM obtains a much methods. Although VGGnet-EIL and ResNet50-SE-ADL per-
lower Top-1 localization error (about 15%) than VGGnet- form slightly better (about 1.6% and 0.2%) in Top-1 classifi-
DANet on the CUB-200-2011 dataset, which leverages the cation accuracy, our method outperforms them in localization
semantic discrepancy to discover complementary object parts accuracy on both datasets. Especially, the Top-1 localization
but fails to learn robust object parts. Furthermore, our method accuracy of our method achieves 9.8% and 8.9% gains by
consistently performs better than adversarial erasing methods comparing with VGGnet-EIL and ResNet50-SE-ADL on the
on two datasets. The improvement of the PDM attributes to CUB-200-2011 dataset.
the three effective mechanisms (diversity, compactness, and 4) MaxBoxAccV2: We also report the recent metric
importance learning mechanisms) that can model diverse and MaxBoxAccV2 in Table III. We follow the standard setting

Authorized licensed use limited to: Zhengzhou University. Downloaded on February 20,2023 at 05:31:10 UTC from IEEE Xplore. Restrictions apply.
1782 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 31, 2022

TABLE III
M AX B OX A CC V2 [67] C OMPARISON W ITH THE S TATE - OF - THE -A RT M ETHODS . T HE R ESULTS FOR E ACH BACKBONE R EPRESENT THE AVERAGE
OF THE T HREE I O U T HRESHOLDS 0.3, 0.5, AND 0.7. VGG: VGG NET [60]. I NC : I NCEPTION V3 [70]. R ES : R ES N ET 50 [69]

TABLE IV TABLE VI
C OMPARISON OF THE C LASSIFICATION P ERFORMANCE B ETWEEN PDM A BLATION S TUDY R ESULTS ON THE CUB-200-2011 T EST S ET
AND S TATE - OF - THE -A RT M ETHODS ON THE CUB-200-2011 T EST S ET

For example, as seen from the left-most samples in Figure 4,


the heatmaps and bounding boxes extracted from CAM only
highlight the face or body of the birds. Contrarily, our method
covers nearly the entire area of the birds. Figure 5 visual-
izes the part-aware activation maps, which are successful in
discovering diverse and complementary target parts. Multiple
part-aware activation maps are fused to generate the final
localization map, which can capture nearly the full object
extent.
TABLE V
C OMPARISON OF THE C LASSIFICATION P ERFORMANCE B ETWEEN D. Ablation Studies
PDM AND S TATE - OF - THE -A RT M ETHODS ON THE
ILSVRC 2016 VALIDATION S ET To analyze the effectiveness of each module, we conduct
a set of ablation studies by deactivating the different learning
modules on the CUB-200-2011 dataset with ResNet50 as the
backbone. We remove all three modules and only keep the
basic loss functions (the classification loss and the sparsity
loss) as our baseline. In detail, we change the output channel
of the part-aware attention module N pa from K to 1. The
diversity loss L div , the triplet loss L tri , and the importance
prediction module N I P are all removed in this setting.
We first perform ablation studies to evaluate the impact
of the sparsity loss. As shown in Table VI, the spar-
sity loss can improve localization/classification accuracy by
6.7%/0.5%, since this loss can relieve background interfer-
ences. Then, experimental results are conducted with the
following configurations: (a) Baseline with only the diver-
sity learning module (DLM). (b) Baseline with the diversity
learning module (DLM) and the compactness learning mod-
in [67], [78]. The results show our method outperforms all ule (CLM). (c) Baseline with the diversity learning module
existing methods in terms of MaxBoxAccV2 (Mean) with all (DLM), the compactness learning module (CLM), and the
backbones. importance learning module (ILM), which is our PDM. When
Figure 4 visualizes the localization results generated on the importance learning module (ILM) is deactivated, the
CUB-200-2011 and ILSVRC 2016 for qualitative evaluation. outputs of multiple parts are fused by average.
From the results, we consistently observe that our PDM 1) Effectiveness of the Diversity Learning Module (DLM):
captures the entire object, which is better than CAM [10]. In the baseline setting, our localization map is the output of the

Authorized licensed use limited to: Zhengzhou University. Downloaded on February 20,2023 at 05:31:10 UTC from IEEE Xplore. Restrictions apply.
MENG et al.: DIVERSE COMPLEMENTARY PART MINING FOR WEAKLY SUPERVISED OBJECT LOCALIZATION 1783

Fig. 4. Visualization comparison with the CAM [10] method. The predicted bounding boxes are in red and the ground truth boxes are in yellow. Our method
can highlight nearly the entire object and produce precise bounding boxes for the images on the CUB-200-2011 and ILSVRC 2016 datasets.

part-aware attention module with a single branch. As shown in verify that modeling diverse object parts can help to improve
Table VI, the baseline has a mediocre ability for localization. localization performance largely.
This is because, with only the basic loss functions, the model 2) Effectiveness of the Compactness Learning Module
pays attention to the most discriminative part of the object. (CLM): As shown in Table VI, there is further gain when
CAM [10] obtain localization maps by projecting back the CLM is introduced. The localization accuracy is increased by
weights of the output layer onto the feature map, which 2.6% and the classification accuracy is increased by 0.2%.
highlights class-specific image regions. We observe that our This reveals that the proposed loss is effective in improving
baseline can perform better than CAM [10] with a lower Top-1 the localization performance. As shown in Figure 7, the
localization error. This supports that accurate localization maps detected parts are more robust with CLM. For example, a more
can be conveniently obtained by extracting spatial attention on accurate and finer bird head is recognized with CLM in the
the feature map. By adding the DLM, the part-aware attention second line. Both statistical and visualization results verify the
module is guided to discover multiple diverse object parts effectiveness of the CLM.
and generate part-specific response maps. Multiple part-aware 3) Effectiveness of the Importance Learning Module (ILM):
activation maps are fused by average to generate final localiza- We further introduce the ILM, where a self-adaptive strategy
tion maps. This setting achieves 36.0% of Top-1 localization takes the place of a simple average strategy. As shown in
error and 19.5% of Top-1 classification error. Compared with Table VI, the performances are improved by 4.6% in Top-1
the baseline method, the performance gain is 19.6% and localization accuracy and 0.6% in classification accuracy. It is
4.2% in terms of Top-1 localization and classification error, noted that important parts could include more characteristics
respectively. Figure 6 visualizes the localization maps without of the object and produce more reliable results. For example,
and with DLM. As shown in Figure 6, the tails of the birds in as shown in Figure 5, the samples in the second row pay
the first column are totally suppressed, resulting in incomplete more attention to the tail of the bird than the body part.
bounding boxes. The suppressed tails are highlighted when the Therefore, it is necessary to introduce an importance prediction
DLM is employed. Both statistical and visualization results module that measures the importance of object parts and the

Authorized licensed use limited to: Zhengzhou University. Downloaded on February 20,2023 at 05:31:10 UTC from IEEE Xplore. Restrictions apply.
1784 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 31, 2022

Fig. 5. The visualization of different part-aware activation maps. The


final activation map (last column) is obtained by fusing multiple part-aware
activation maps (1st to 4th columns). Fig. 6. The visualization of the final activation maps without and with the
diversity learning module (DLM).
TABLE VII
T HE ROBUSTNESS A BOUT THE T HRESHOLD VALUE T HAT D ISTINGUISHES
F OREGROUND F ROM BACKGROUND P IXELS . T HE R ESULTS S HOW E. Hyperparameter Evaluations
T HAT THE L OCALIZATION E RROR I S N OT S ENSITIVE We evaluate how λspa , λdiv and λtri affect our model
TO D IFFERENT T HRESHOLD VALUES
learning. Here, λspa controls the relative importance of the
sparsity of the part-aware activation maps, λdiv controls the
relative importance of the diversity learning module, and λtri
controls the relative importance of the compactness learning
module. We perform the grid search method to discover the
results verify the effectiveness of the ILM. To summarize, the best values of all hyperparameters [17], which is a variable-
proposed method can enable the networks to achieve better controlling approach. Then, we evaluate the influence of each
performance than the baseline. We attribute it to the three hyperparameter, where we keep the other hyperparameters at
effective learning mechanisms, which guide the network to their best choice when one hyperparameter varies. As shown in
discover diverse and robust parts so as to obtain accurate Figure 9, our model achieves much better performance when
localization maps and category predictions. λspa = 0.04, λdiv = 0.02 and λtri = 0.01.
4) Robustness About the Threshold Value That Distinguishes We also evaluate the influence of the different number of
Foreground From Background Pixels: The localization results object parts in Figure 10. With too few part-aware activation
are not sensitive to the labeling provided in Eq. (5), as shown maps, it is difficult to produce sufficient object parts. With too
in Table VII. When the threshold value ranges from 0.1 to 0.3, many part-aware activation maps, redundant information and
the performance is similar. the parameters increase significantly. We explore the effect of
5) Statistical Analysis: In Figure 8, we show the statistical the part number K in Figure 10. The performance continues
analysis of “GT-known bounding boxes” which indicates cor- to grow until K = 7/K = 9 for CUB-200-2011/ILSVRC
rect classification with over 50% IoU with the ground-truth 2016 set, which means that 7 object parts for CUB-200-2011
boxes on the CUB-200-2011 dataset following DANet [56]. and 9 object parts for ILSVRC 2016 are enough to capture
It can be seen that the proposed PDM enhances the quality the whole object.
of GT-known bounding boxes by improving the IoU rates to
68% and 71%. Note that the IoU rate of correct bounding
boxes reported in DANet [56] with VGGnet is 64.6%. The F. Limitations
IoU rate of VGGnet-PDM is increased by 3.4% compared to Figure 11 gives some failure examples. These results show
VGGnet-DANet. that localization maps generated by our PDM can capture

Authorized licensed use limited to: Zhengzhou University. Downloaded on February 20,2023 at 05:31:10 UTC from IEEE Xplore. Restrictions apply.
MENG et al.: DIVERSE COMPLEMENTARY PART MINING FOR WEAKLY SUPERVISED OBJECT LOCALIZATION 1785

Fig. 10. Evaluation of the part number K on CUB-200-2011 dataset and


ILSVRC 2016 dataset.

Fig. 11. The failure cases on ILSVRC 2016 experiments. The target class
is the snowmobile. The proposed PDM can capture the whole object better
than CAM [10]. However, our PDM captures not only the snowmobile, but
also the snow that often co-occurs with the target object.

TABLE VIII
T HE R ESULTS OF THE P ROPOSED PDM W ITH AN A DAPTIVE T HRESHOLD ,
W HICH S HOW THE E FFECTIVENESS OF THE A DAPTIVE T HRESHOLD .
I N T HIS TABLE , “AT” D ENOTES THE A DAPTIVE T HRESHOLD

Fig. 7. The visualization of the detected discriminative parts without and


with the compactness learning module (CLM).

certain level of discriminative power and may be highlighted.


We believe that this problem might be general for all WSOL
Fig. 8. Statistical analysis of GT-known bounding boxes on the methods [11], and will address this issue in our future work.
CUB-200-2011 dataset.

G. Broader Experiments
To further improve our method, inspired by the work [79],
we propose an adaptive threshold to extract bounding boxes by
using the percentile of the activation values, which considers
the distribution of the activation values and is calculated as
follows

Fig. 9. Evaluation of the important hyperparameters λspa , λdiv , λtri on τloc = λ · per i (A), (15)
CUB-200-2011 dataset.
where A denotes the activation map and per i is an i -th
percentile of the activation values. λ represents the coeffi-
the whole object better than CAM [10]. However, our PDM cient. Following the work [79], we perform the grid search
cannot distinguish the background frequently appearing with method [17] to discover the best values for i and λ. The results
the object well. In the case of the snowmobile class, the in Table VIII show that the adaptive threshold can achieve
target object often co-occurs with snow. Thus, the snow has a better localization performance.

Authorized licensed use limited to: Zhengzhou University. Downloaded on February 20,2023 at 05:31:10 UTC from IEEE Xplore. Restrictions apply.
1786 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 31, 2022

V. C ONCLUSION [19] Y. Wei, J. Feng, X. Liang, M.-M. Cheng, Y. Zhao, and S. Yan,
“Object region mining with adversarial erasing: A simple classification
In this paper, we propose a novel end-to-end part discovery to semantic segmentation approach,” in Proc. IEEE Conf. Comput. Vis.
model to explore robust and diverse object parts in a unified Pattern Recognit. (CVPR), Jul. 2017, pp. 1568–1576.
model for weakly supervised object localization. In the pro- [20] Y. Zhang, K. Jia, and Z. Wang, “Part-aware fine-grained object catego-
rization using weakly supervised part detection network,” IEEE Trans.
posed model, we have designed three effective mechanisms Multimedia, vol. 22, no. 5, pp. 1345–1357, May 2020.
including diversity, compactness, and importance learning [21] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang, “The
mechanisms to learn diverse and robust object parts. Besides, application of two-level attention models in deep convolutional neural
network for fine-grained image classification,” in Proc. IEEE Conf.
our model can exploit complementary spatial information and Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 842–850.
local details from the learned object parts, which help to [22] Y. Zhang et al., “Weakly supervised fine-grained categorization with
produce precise bounding boxes and discriminate different part-based image representation,” IEEE Trans. Image Process., vol. 25,
object classes. Experimental results on two standard bench- no. 4, pp. 1713–1725, Apr. 2016.
[23] Y. Peng, X. He, and J. Zhao, “Object-part attention model for fine-
marks demonstrate the effectiveness of the proposed model. grained image classification,” IEEE Trans. Image Process., vol. 27, no. 3,
pp. 1487–1500, Mar. 2018.
R EFERENCES [24] H. Zheng, J. Fu, T. Mei, and J. Luo, “Learning multi-attention convolu-
tional neural network for fine-grained image recognition,” in Proc. IEEE
[1] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Is object localization for Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 5209–5217.
free?—Weakly-supervised learning with convolutional neural networks,”
[25] Y. Ding, Y. Zhou, Y. Zhu, Q. Ye, and J. Jiao, “Selective sparse sampling
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015,
for fine-grained image recognition,” in Proc. IEEE/CVF Int. Conf.
pp. 685–694.
Comput. Vis. (ICCV), Oct. 2019, pp. 6599–6608.
[2] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “DeepDriving: Learning
affordance for direct perception in autonomous driving,” in Proc. IEEE [26] A. Recasens, P. Kellnhofer, S. Stent, W. Matusik, and A. Torralba,
Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 2722–2730. “Learning to zoom: A saliency-based sampling layer for neural net-
[3] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3D object detec- works,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 51–66.
tion network for autonomous driving,” in Proc. IEEE Conf. Comput. Vis. [27] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders,
Pattern Recognit. (CVPR), Jul. 2017, pp. 1907–1915. “Selective search for object recognition,” Int. J. Comput. Vis., vol. 104,
[4] R. Weng, J. Lu, and Y.-P. Tan, “Robust point set matching for partial face no. 2, pp. 154–171, 2013.
recognition,” IEEE Trans. Image Process., vol. 25, no. 3, pp. 1163–1176, [28] M. Simon and E. Rodner, “Neural activation constellations: Unsuper-
Mar. 2016. vised part model discovery with convolutional networks,” in Proc. IEEE
[5] X. Lin, Y. Liang, J. Wan, C. Lin, and S. Z. Li, “Region-based context Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1143–1151.
enhanced network for robust multiple face alignment,” IEEE Trans. [29] C. Wan, Y. Wu, X. Tian, J. Huang, and X.-S. Hua, “Concentrated
Multimedia, vol. 21, no. 12, pp. 3053–3067, Dec. 2019. local part discovery with fine-grained part representation for person re-
[6] J. García, N. Martinel, A. Gardel, I. Bravo, G. L. Foresti, and identification,” IEEE Trans. Multimedia, vol. 22, no. 6, pp. 1605–1618,
C. Micheloni, “Discriminant context information analysis for post- Jun. 2020.
ranking person re-identification,” IEEE Trans. Image Process., vol. 26, [30] S. Branson, G. Van Horn, S. Belongie, and P. Perona, “Bird species
no. 4, pp. 1650–1665, Apr. 2017. categorization using pose normalized deep convolutional nets,” 2014,
[7] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierar- arXiv:1406.2952.
chies for accurate object detection and semantic segmentation,” in Proc. [31] P. Tang, X. Wang, X. Bai, and W. Liu, “Multiple instance detection
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587. network with online instance classifier refinement,” in Proc. IEEE Conf.
[8] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2843–2851.
(ICCV), Dec. 2015, pp. 1440–1448. [32] P. Tang et al., “PCL: Proposal cluster learning for weakly supervised
[9] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real- object detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 1,
time object detection with region proposal networks,” in Proc. Adv. pp. 176–191, Jan. 2020.
Neural Inf. Process. Syst., 2015, pp. 91–99. [33] Z. Ren et al., “Instance-aware, context-focused, and memory-efficient
[10] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning weakly supervised object detection,” in Proc. IEEE/CVF Conf. Comput.
deep features for discriminative localization,” in Proc. IEEE Conf. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 10598–10607.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2921–2929. [34] M. Xu et al., “Missing labels in object detection,” in Proc. Conf. Comput.
[11] J. Choe and H. Shim, “Attention-based dropout layer for weakly Vis. Pattern Recognit. Workshops, 2019, pp. 1–10.
supervised object localization,” in Proc. IEEE/CVF Conf. Comput. Vis. [35] Y. Shen et al., “Noise-aware fully webly supervised object detec-
Pattern Recognit. (CVPR), Jun. 2019, pp. 2219–2228. tion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2020,
[12] C. Cao et al., “Look and think twice: Capturing top-down visual pp. 11326–11335.
attention with feedback convolutional neural networks,” in Proc. IEEE
[36] L. Fang, H. Xu, Z. Liu, S. Parisot, and Z. Li, “EHSOD: CAM-guided
Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 2956–2964.
end-to-end hybrid-supervised object detection with cascade refinement,”
[13] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and
in Proc. AAAI Conf. Artif. Intell., 2020, vol. 34, no. 7, pp. 10778–10785.
D. Batra, “Grad-CAM: Visual explanations from deep networks via
gradient-based localization,” in Proc. IEEE Int. Conf. Comput. Vis. [37] H. Bilen and A. Vedaldi, “Weakly supervised deep detection networks,”
(ICCV), Oct. 2017, pp. 618–626. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016,
[14] K. K. Singh and Y. J. Lee, “Hide-and-seek: Forcing a network to be pp. 2846–2854.
meticulous for weakly-supervised object and action localization,” in [38] Z. Chen, Z. Fu, R. Jiang, Y. Chen, and X.-S. Hua, “SLV: Spatial
Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 3544–3553. likelihood voting for weakly supervised object detection,” in Proc.
[15] S. Yang, Y. Kim, Y. Kim, and C. Kim, “Combinational class activation IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
maps for weakly supervised object localization,” in Proc. IEEE Winter pp. 12995–13004.
Conf. Appl. Comput. Vis. (WACV), Mar. 2020, pp. 2941–2949. [39] F. Wan, P. Wei, J. Jiao, Z. Han, and Q. Ye, “Min-entropy latent model
[16] Y. Wei, H. Xiao, H. Shi, Z. Jie, J. Feng, and T. S. Huang, “Revisiting for weakly supervised object detection,” in Proc. IEEE Conf. Comput.
dilated convolution: A simple approach for Weakly- and semi-supervised Vis. Pattern Recognit., Oct. 2018, pp. 1297–1306.
semantic segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern [40] X. Zhang, J. Feng, H. Xiong, and Q. Tian, “Zigzag learning for weakly
Recognit., Jun. 2018, pp. 7268–7277. supervised object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern
[17] X. Zhang, Y. Wei, G. Kang, Y. Yang, and T. Huang, “Self-produced Recognit., Jun. 2018, pp. 4262–4270.
guidance for weakly-supervised object localization,” in Proc. Eur. Conf. [41] F. Wan, C. Liu, W. Ke, X. Ji, J. Jiao, and Q. Ye, “C-MIL: Contin-
Comput. Vis., 2018, pp. 597–613. uation multiple instance learning for weakly supervised object detec-
[18] J. Lee, E. Kim, S. Lee, J. Lee, and S. Yoon, “FickleNet: Weakly and tion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2019,
semi-supervised semantic image segmentation using stochastic infer- pp. 2199–2208.
ence,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), [42] Y. Gao et al., “Utilizing the instability in weakly supervised object
Jun. 2019, pp. 5267–5276. detection,” 2019, arXiv:1906.06023.

Authorized licensed use limited to: Zhengzhou University. Downloaded on February 20,2023 at 05:31:10 UTC from IEEE Xplore. Restrictions apply.
MENG et al.: DIVERSE COMPLEMENTARY PART MINING FOR WEAKLY SUPERVISED OBJECT LOCALIZATION 1787

[43] Y. Shen, R. Ji, Y. Wang, Y. Wu, and L. Cao, “Cyclic guidance for weakly [70] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
supervised joint detection and segmentation,” in Proc. IEEE/CVF Conf. the inception architecture for computer vision,” in Proc. Conf. Comput.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 697–707. Vis. Pattern Recognit., 2016, pp. 2818–2826.
[44] A. Arun, C. V. Jawahar, and M. P. Kumar, “Dissimilarity coefficient [71] C. Lei, L. Jiang, J. Ji, W. Zhong, and H. Xiong, “Weakly supervised
based weakly supervised object detection,” in Proc. IEEE/CVF Conf. learning of object-part attention model for fine-grained image classi-
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 9432–9441. fication,” in Proc. IEEE 18th Int. Conf. Commun. Technol. (ICCT),
[45] X. Li, M. Kan, S. Shan, and X. Chen, “Weakly supervised object Oct. 2018, pp. 1222–1226.
detection with segmentation collaboration,” in Proc. IEEE/CVF Int. [72] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “CutMix: Reg-
Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 9735–9744. ularization strategy to train strong classifiers with localizable features,”
[46] Y. Zhang, Y. Bai, M. Ding, Y. Li, and B. Ghanem, “W2F: A weakly- in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2019, pp. 6023–6032.
supervised to fully-supervised framework for object detection,” in [73] Z. Xu, D. Tao, S. Huang, and Y. Zhang, “Friend or foe: Fine-grained
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, categorization with weak supervision,” IEEE Trans. Image Process.,
pp. 928–936. vol. 26, no. 1, pp. 135–146, Jan. 2017.
[47] K. Yang, D. Li, and Y. Dou, “Towards precise end-to-end weakly [74] J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff,
supervised object detection network,” in Proc. IEEE/CVF Int. Conf. “Top-down neural attention by excitation backprop,” Int. J. Comput.
Comput. Vis. (ICCV), Oct. 2019, pp. 8372–8381. Vis., vol. 126, no. 10, pp. 1084–1102, 2018.
[48] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from [75] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional
edges,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, networks: Visualising image classification models and saliency maps,”
2014, pp. 391–405. 2013, arXiv:1312.6034.
[49] C.-L. Zhang, Y.-H. Cao, and J. Wu, “Rethinking the route towards [76] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-
weakly supervised object localization,” in Proc. IEEE Conf. Comput. tional networks,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland:
Vis. Pattern Recognit., Jun. 2020, pp. 13460–13469. Springer, 2014, pp. 818–833.
[50] X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. S. Huang, “Adversarial com- [77] S. Kornblith, J. Shlens, and Q. V. Le, “Do better imagenet models
plementary learning for weakly supervised object localization,” in Proc. transfer better?” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 1325–1334. Jun. 2019, pp. 2661–2671.
[51] D. Kim, D. Cho, D. Yoo, and I. S. Kweon, “Two-phase learning for [78] M. Ki, Y. Uh, W. Lee, and H. Byun, “In-sample contrastive learning and
weakly supervised object localization,” in Proc. IEEE Int. Conf. Comput. consistent attention for weakly supervised object localization,” in Proc.
Vis. (ICCV), Oct. 2017, pp. 3534–3543. Asian Conf. Comput. Vis., 2020, pp. 1–6.
[52] X. Wang, A. Shrivastava, and A. Gupta, “A-fast-RCNN: Hard positive [79] W. Bae, J. Noh, and G. Kim, “Rethinking class activation mapping for
generation via adversary for object detection,” in Proc. IEEE Conf. weakly supervised object localization,” in Proc. Eur. Conf. Comput. Vis.,
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2606–2615. 2020, pp. 618–634.
[53] K. Li, Z. Wu, K.-C. Peng, J. Ernst, and Y. Fu, “Tell me where to look:
Guided attention inference network,” in Proc. Conf. Comput. Vis. Pattern
Recognit., 2018, pp. 9215–9223.
[54] Q. Hou, P. Jiang, Y. Wei, and M.-M. Cheng, “Self-erasing network for Meng Meng received the bachelor’s degree in
integral object attention,” in Proc. Adv. Neural Inf. Process. Syst., 2018, remote sensing science and technology from Xid-
pp. 549–559. ian University, Xi’an, Shaanxi, China, in 2018.
[55] J. Mai, M. Yang, and W. Luo, “Erasing integrated learning: A simple yet She is currently pursuing the Ph.D. degree in
effective approach for weakly supervised object localization,” in Proc. control science and engineering with the Univer-
Conf. Comput. Vis. Pattern Recognit., 2020, pp. 8766–8775. sity of Science and Technology of China, Hefei,
[56] H. Xue, C. Liu, F. Wan, J. Jiao, X. Ji, and Q. Ye, “DANet: Divergent Anhui, China. Her research interests include com-
activation for weakly supervised object localization,” in Proc. Int. Conf. puter vision and machine learning, especially weakly
Comput. Vis., 2019, pp. 6589–6598. supervised object localization, weakly supervised
[57] X.-S. Wei, C.-L. Zhang, J. Wu, C. Shen, and Z.-H. Zhou, “Unsupervised semantic segmentation, and object tracking.
object discovery and co-localization by deep descriptor transformation,”
Pattern Recognit., vol. 88, pp. 113–126, Apr. 2019.
[58] W. Lu, X. Jia, W. Xie, L. Shen, Y. Zhou, and J. Duan,
“Geometry constrained weakly supervised object localization,” 2020,
Tianzhu Zhang (Member, IEEE) received the bach-
arXiv:2007.09727.
elor’s degree in communications and information
[59] M. Lin, Q. Chen, and S. Yan, “Network in network,” 2013,
technology from the Beijing Institute of Technol-
arXiv:1312.4400.
ogy, Beijing, China, in 2006, and the Ph.D. degree
[60] K. Simonyan and A. Zisserman, “Very deep convolutional networks for in pattern recognition and intelligent systems from
large-scale image recognition,” 2014, arXiv:1409.1556. the Institute of Automation, Chinese Academy of
[61] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. Sciences, Beijing, China, in 2011. He is currently
Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141. a Professor with the Department of Automation,
[62] Z. Zhong, L. Zheng, S. Li, and Y. Yang, “Generalizing a person retrieval School of Information Science and Technology, Uni-
model hetero-and homogeneously,” in Proc. Eur. Conf. Comput. Vis., versity of Science and Technology of China. His
2018, pp. 172–188. current research interests include computer vision
[63] A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for and multimedia. He served/serves as the Area Chair for CVPR 2020, ECCV
person re-identification,” 2017, arXiv:1703.07737. 2020, ICCV 2019, ACM MM 2019, WACV 2018, ICPR 2018, and MVA 2017,
[64] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, the Associate Editor for IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS
“The Caltech-UCSD birds-200-2011 dataset,” California Inst. Technol., FOR V IDEO T ECHNOLOGY (T-CSVT) and Neurocomputing.
Pasadena, CA, USA, Tech. Rep., 2011.
[65] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., Jan. 2009, pp. 248–255. Wenfei Yang (Graduate Student Member, IEEE)
[66] O. Russakovsky et al., “ImageNet large scale visual recognition chal- received the bachelor’s degree in electronic engi-
lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015. neering and information science from the University
[67] J. Choe, S. J. Oh, S. Lee, S. Chun, Z. Akata, and H. Shim, “Evaluating of Science and Technology of China, Hefei, China,
weakly supervised object localization methods right,” in Proc. Conf. in 2017, where he is currently pursuing the Ph.D.
Comput. Vis. Pattern Recognit., 2020, pp. 3133–3142. degree in control science and engineering. His cur-
[68] A. G. Howard et al., “MobileNets: Efficient convolutional neural net- rent research interests include computer vision and
works for mobile vision applications,” 2017, arXiv:1704.04861. machine learning, especially temporal action detec-
[69] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for tion, video object detection, and temporal sentence
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., grounding.
Jul. 2016, pp. 770–778.

Authorized licensed use limited to: Zhengzhou University. Downloaded on February 20,2023 at 05:31:10 UTC from IEEE Xplore. Restrictions apply.
1788 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 31, 2022

Jian Zhao (Member, IEEE) received the B.S. degree Feng Wu (Fellow, IEEE) received the B.S. degree
from Beihang University in 2012, the master’s in electrical engineering from Xidian University in
degree from the National University of Defense 1992, and the M.S. and Ph.D. degrees in computer
Technology in 2014, and the Ph.D. degree from the science from the Harbin Institute of Technology
National University of Singapore in 2019. He is in 1996 and 1999, respectively. He is currently a
currently an Assistant Professor with the Institute Professor with the University of Science and Tech-
of North Electronic Equipment, Beijing, China, and nology of China and the Dean of the School of Infor-
the Rhino-Bird Visiting Scholar with the Tencent mation Science and Technology. Before that, he was
AI Laboratory, Shenzhen, China. His main research a Principle Researcher and a Research Manager
interests include machine learning, pattern recogni- with Microsoft Research Asia. His research inter-
tion, and computer vision. ests include image and video compression, media
communication, and media analysis and synthesis. He has authored or
coauthored over 200 high quality papers (including several dozens of IEEE
T RANSACTION papers) and top conference papers on MOBICOM, SIGIR,
Yongdong Zhang (Senior Member, IEEE) received CVPR, and ACM MM. He has 77 granted U.S. patents. His 15 techniques
the Ph.D. degree in electronic engineering from have been adopted into international video coding standards. As a coauthor,
Tianjin University, Tianjin, China, in 2002. He is he got the Best Paper Award in IEEE T RANSACTIONS ON C IRCUITS AND
currently a Professor with the University of Science S YSTEM FOR V IDEO T ECHNOLOGY (T-CSVT) in 2009, PCM 2008, and SPIE
and Technology of China. He has authored more VCIP 2007. He serves as an Associate Editor for IEEE T RANSACTIONS ON
than 100 refereed journals and conference papers. C IRCUITS AND S YSTEM FOR V IDEO T ECHNOLOGY, IEEE T RANSACTIONS
His current research interests include multimedia ON M ULTIMEDIA and several other international journals. He got IEEE
content analysis and understanding, multimedia con- Circuits and Systems Society 2012 Best Associate Editor Award. He also
tent security, video encoding, and streaming media serves as a TPC Chair in MMSP 2011, VCIP 2010, and PCM 2009, and the
technology. He serves as an Editorial Board Member Special Sessions Chair in ICME 2010 and ISCAS 2013.
of Multimedia Systems journal and Neurocomputing.
He was a recipient of the Best Paper Award in PCM2013, ICIMCS 2013, and
ICME 2010, and the Best Paper Candidate in ICME 2011.

Authorized licensed use limited to: Zhengzhou University. Downloaded on February 20,2023 at 05:31:10 UTC from IEEE Xplore. Restrictions apply.

You might also like