The Devil Is in The Points: Weakly Semi-Supervised Instance Segmentation Via Point-Guided Mask Representation

The Devil is in the Points: Weakly Semi-Supervised Instance Segmentation
via Point-Guided Mask Representation
Beomyoung Kim1,2 Joonhyun Jeong1,2 Dongyoon Han3 Sung Ju Hwang2
NAVER Cloud, ImageVision1 KAIST2 NAVER AI Lab3

arXiv:2303.15062v1 [cs.CV] 27 Mar 2023
Abstract False-Negative Mask

Proposal Branch
(Missing) Proposal Branch
Input
In this paper, we introduce a novel learning scheme
named weakly semi-supervised instance segmentation (WS- True-Positive Mask
(Correct) Proposal Branch
SIS) with point labels for budget-efficient and high-
performance instance segmentation. Namely, we consider a False-Positive Mask
dataset setting consisting of a few fully-labeled images and (Noise) Proposal Branch
a lot of point-labeled images. Motivated by the main chal-
lenge of semi-supervised approaches mainly derives from Figure 1. Proposals and instance masks. The absence of a pro-
the trade-off between false-negative and false-positive in- posal leads to the missing mask, even though the mask could be
stance proposals, we propose a method for WSSIS that generated if given the correct proposal (zebra). Also, noise pro-
can effectively leverage the budget-friendly point labels as posal often leads to noisy masks. Our motivation stems from the
a powerful weak supervision source to resolve the chal- bottleneck in the proposal branch, and this paper shows economic
lenge. Furthermore, to deal with the hard case where the point labels can be leveraged to resolve it.
amount of fully-labeled data is extremely limited, we pro-
pose a MaskRefineNet that refines noise in rough masks.
We conduct extensive experiments on COCO and BDD100K tation costs. WSIS approaches alternatively utilize inex-
datasets, and the proposed method achieves promising re- pensive weak labels such as image-level labels [1, 27, 55],
sults comparable to those of the fully-supervised model, point labels [11, 27, 28] or bounding box labels [19, 32, 44].
even with 50% of the fully labeled COCO data (38.8% vs. Besides, SSIS approaches [49, 54] employ a small amount
39.7%). Moreover, when using as little as 5% of fully la- of pixel-level (fully) labeled data and a massive amount
beled COCO data, our method shows significantly supe- of unlabeled data. Although they have shown potential
rior performance over the state-of-the-art semi-supervised in budget-efficient instance segmentation, there still exists
learning method (33.7% vs. 24.9%). The code is available a large performance gap between theirs and the results of
at https://github.com/clovaai/PointWSSIS. fully-supervised learning methods.
Specifically, SSIS approaches often adopt the following
training pipeline: (1) train a base network with fully la-
beled data, (2) generate pseudo instance masks for unla-
1. Introduction beled images using the base network, and (3) train a target
network using both full and pseudo labels. The major chal-
Recently proposed instance segmentation methods [5, lenge of SSIS approaches comes from the trade-off between
8, 9, 13, 15, 16, 24, 33, 43, 47] have achieved remark- the number of missing (i.e., false-negative) and noise (i.e.,
able performance owing to the availability of abundant false-positive) samples in the pseudo labels. Namely, some
of segmentation labels for training. However, compared strategies for reducing false-negatives, which is equivalent
to other label types (e.g., bounding box or point), seg- to increasing true-positives, often end up increasing false-
mentation labels necessitate delicate pixel-level annota- positives accordingly; an abundance of false-negatives or
tions, demanding much more monetary cost and human ef- false-positives in pseudo labels impedes stable convergence
fort. Consequently, weakly-supervised instance segmen- of the target network. However, optimally reducing false-
tation (WSIS) and semi-supervised instance segmentation negatives/positives while increasing true-positives is quite
(SSIS) approaches have gained attention to reduce anno- challenging and remains a significant challenge for SSIS.
1
To address this challenge, we first revisit the funda- The MaskRefineNet takes three input sources, i.e., image,
mental behavior of the instance segmentation framework. rough mask, and point; the image provides visual informa-
Most existing instance segmentation methods adopt a two- tion about the target instance, the rough mask is used as the
step inference process as shown in Figure 1: (1) generate prior knowledge to be refined, and the point information ex-
instance proposals where an instance is represented as a plicitly guides the target instance. Using the richer instruc-
box [9,16,22,33] or point [43,45,47,48] in proposal branch, tive input sources, MaskRefineNet can be stably trained
and (2) produce instance masks for each instance proposal even with a limited amount of fully labeled data.
in mask branch. As shown in Figure 1, if the network fails to To demonstrate the effectiveness of our method, we
obtain an instance proposal (i.e., false-negative proposal), it conduct extensive experiments on the COCO [36] and
cannot produce the corresponding instance mask. Although BDD100K [51] datasets. When training with half of the
the network could represent the instance mask in the mask fully labeled images and the rest of the point labeled im-
branch, the absence of the proposal becomes the bottleneck ages on the COCO dataset (i.e., 50% COCO), we achieve a
for producing the instance mask. From the behavior of the competitive performance with the fully-supervised perfor-
network, we suppose that addressing the bottleneck in the mance (38.8% vs. 39.7%). In addition, when using a small
proposals is a shortcut to the success of the SSIS. amount of fully labeled data, e.g., 5% of COCO data, the
Motivated by the above observations, we rethink the po- proposed method shows much superior performance than
tential of using point labels as weak supervision. The point the state-of-the-art SSIS method [49] (33.7% vs. 24.9%).
label contains only a one-pixel categorical instance cue but In summary, the contributions of our paper are
is budget-friendly as it is as easy as providing image-level • We show that point labels can be leveraged as effec-
labels by human annotators [3]. We note that the point la- tive weak supervisions for budget-efficient and high-
bel can be leveraged as an effective source to (i) resolve the performance instance segmentation. Further, based on
performance bottleneck of the instance segmentation net- this observation, we establish a new training protocol
work and (ii) optimally balance the trade-off between false- named Weakly Semi-Supervised Instance Segmenta-
negative and false-positive proposals. Thus, we formulate a tion (WSSIS) with point labels.
new practical training scheme, Weakly Semi-Supervised
Instance Segmentation (WSSIS) with point labels. In • To further boost the quality of the pseudo instance
the WSSIS task, we utilize a small amount of fully labeled masks when the amount of fully labeled data is
data and a massive amount of point labeled data for budget- extremely limited, we propose the MaskRefineNet,
efficient and high-performance instance segmentation. which refines noisy parts of the rough instance masks.
Under the WSSIS setting, we filter out the proposals • Extensive experimental results show that the proposed
to keep only true-positive proposals using the point labels. method can achieve competitive performance to those
Then, given the true-positive proposals, we exploit the mask of the fully-supervised models while significantly out-
representation of the network learned by fully labeled data performing the semi-supervised methods.
to produce high-quality pseudo instance masks. For prop-
erly leveraging point labels, we consider the characteristics 2. Related Work
of the feature pyramid network (FPN) [35], which consists
of multi-level feature maps for multi-scale instance recog- 2.1. Instance Segmentation
nition. Each pyramid level is trained to recognize instances Mask R-CNN [16] is the most widely used method for
of particular sizes, and extracting instance masks from un- instance segmentation. They represent an instance as a
fit levels often causes inaccurate predictions, as shown in bounding box and produce the instance mask after pooling
Figure 4. However, since point labels do not have instance each box region. These box-based approaches have many
size information, we handle this using an effective strat- variants, such as [9, 22, 33, 41] and have shown state-of-the-
egy named Adaptive Pyramid-Level Selection. We estimate art results. Meanwhile, there is a different type of approach,
which level is the best fit based on the reliability of the net- named point-based approaches [43, 45, 47, 48]. They repre-
work (i.e., confidence score) and then adaptively produce sent an instance as a point and generate the instance mask
an instance mask at the selected level. using the point-encoded mask representation. For exam-
Meanwhile, on an extremely limited amount of fully la- ple, SOLOv2 [47] extracts point-encoded kernel parameters
beled data, the network often fails to sufficiently represent and generates instance masks with a dynamic convolution
the instance mask in the mask branch, resulting in rough scheme. We note that the inference pipeline of these two
and noisy mask outputs. In other words, the true-positive approaches is the same as shown in Figure 1; they generate
proposal does not always lead to a true-positive instance proposals in the form of either bounding boxes or points and
mask in this case. To cope with this limitation, we pro- then produce an instance mask for each proposal. Here, the
pose a MaskRefineNet to refine the rough instance mask. proposal is indispensable for producing the instance mask.
2
(a) Confidence Threshold=0.1 (b) Confidence Threshold=0.5 (c) With Point Labels
Figure 2. The qualitative results of pseudo instance masks. (a) and (b): the quality of pseudo masks is largely affected by the confidence
score of the proposal due to the trade-off between false-negative and false-positive instance proposals. (c): our point-driven method can
filter the proposals to keep only true-positive proposals, resulting in clearer quality of pseudo instance masks.
2.2. Budget-Efficient Instance Segmentation R-CNN [52] employs the point labels to filter and augment
proposals with improved positive sample assignments. In
Instance segmentation requires a huge amount of
addition, we propose the MaskRefineNet for high-fidelity
instance-level segmentation labels. However, the annota-
mask refinement to handle the distinct challenge of instance
tion cost of segmentation labels is much higher than other
segmentation, which is a pixel-level recognition task.
labels. According to seminar works [3, 4], the annotations
time is measured on VOC dataset [14] as follows: image-
level (20.0 s/img), point (23.3 s/img), bounding box (38.1 3. Proposed Method
s/img), full mask (239.7 s/img). To reduce the annotation
cost, weakly-supervised instance segmentation (WSIS) and 3.1. Motivation
semi-supervised instance segmentation (SSIS) have been
actively studied. The WSIS methods exploit the activation Existing instance segmentation methods typically adopt
maps generated by self-attention of the network trained with a two-step inference process: (1) generate proposals where
only cost-efficient labels such as image-level [1, 27, 55], each instance is represented as bounding box [9, 16, 22, 33]
point [11, 27, 28], and bounding box [19, 32, 44] labels. or point [43, 45, 47, 48] in proposal branch, and (2) produce
Meanwhile, the SSIS methods [49, 54] use a small amount a mask for each instance in mask branch. Figure 1 provides
of fully labeled data and an abundant amount of unlabeled an intuition that the performance of the instance segmen-
data. Utilizing the knowledge of the segmentations learned tation network critically depends on the correctness of pro-
with the fully labeled data, they generate pseudo instance posals at the proposal branch. Thus, improving the proposal
masks for the unlabeled data. Although they can reduce the branch may lead to a significant performance improvement
annotation cost, their performance is still far behind those in semi-supervised instance segmentation (SSIS).
of the fully-supervised models. To delve deeper into the problem, we adjust a confidence
threshold in the proposal branch to verify the influence of
2.3. Weakly Semi-Supervised Object Detection
the proposal on the output instance mask as shown in Fig-
There exist some previous attempts to tackle the weakly ure 2. At a low threshold of 0.1 with a larger number of
semi-supervised object detection problem using point labels proposals, we obtain more true-positive masks but much
(WSSOD) [10,52]. Namely, they use a few box-labeled data more false-positive masks as well (see Figure 2a). The rea-
and a lot of point-labeled data. Leveraging the point labels, son is that false-positive proposals (e.g., misclassified or
they show improved detection performances compared to erroneously localized proposals) often lead to noisy mask
the semi-supervised setting. Object detection and instance predictions. Conversely, when we increase the threshold to
segmentation tasks share a similar goal: both are object- 0.5, we lose several true-positive masks that were detected
level recognition tasks. However, we point out that the mo- at lower thresholds (see Figure 2b). In other words, al-
tivation for leveraging point labels is different. We focus though the mask branch could represent the instance mask,
on the fundamental drawback of the instance segmentation the absence of the thresholded proposal results in miss-
network to handle the trade-off between false-negative and ing instance masks. However, finding an optimal thresh-
false-positive proposals. In contrast, PointDETR [10] lever- old per instance is impractical, and balancing between true-
ages the point labels as input queries for single-level fea- positive and false-positive proposals still remains a chal-
ture map inference of DETR [6] architecture, and Group lenging problem in SSIS.
3
Step 1 : Training Teacher Network and MaskRefineNet
Teacher Mask Image Point
Network
Mask
RefineNet
crop
supervision
concat
P2 P3 P4
Step 2 : Pseudo Label Generation & Training Student Network
Teacher Mask Image Point
Network
Mask
RefineNet
crop P5 P6 Adaptive
Student
Network
concat
supervision
Figure 3. Overview of the proposed method. Top: (step 1) Figure 4. Adaptive strategy with FPN and qualitative results.
training the teacher network and MaskRefineNet with fully labeled Top: illustration of Adaptive Pyramid-Level Selection. Bottom:
data. Bottom: (step 2) under the point label guidance, pseudo la- the qualitative results from each pyramid level.
bels are generated through the teacher network and further refined
using MaskRefineNet. Then, the student network is trained on FPN head. Most existing instance segmentation ap-
both the pseudo-labeled data and the fully labeled data. proaches [9, 16, 43, 47] adopt Feature Pyramid Network
(FPN) [35] architecture for multi-scale instance prediction.
3.2. Weakly Semi-Supervised Instance Segmenta- Namely, SOLOv2 [47] employs a 5-level feature pyramid
tion using Point Labels (P2∼P6), and each pyramid level recognizes instances of
particular sizes. When combined with using point labels
From the above observations, we can expect that obtain- for sampling proposals, a careful approach to which level
ing correct instance proposals will yield accurate mask rep- to extract proposals based on the size of the instance is de-
resentations to improve an SSIS network. To this end, we manding. Otherwise, generated instance masks are often
revisit the point label, which is a one-pixel categorical in- noisy as shown in Figure 4 below.
stance representation cue. The annotation budget for point Strategy of using pyramid-level adaptively. Since points
labels is known as costly-efficient by the literature [3, 4]. do not contain instance size information, we estimate which
Task definition. We propose a new training protocol named pyramid level is proper for each point. To this end, we pro-
Weakly Semi-Supervised Instance Segmentation (WSSIS) pose a strategy named Adaptive Pyramid-Level Selection,
using point labels and verify that the budget-friendly point which adaptively selects a pyramid level that is expected to
labels can provide effective guidance. The training proto- produce the most appropriate instance mask based on the re-
col employs point-labeled data with a small amount of fully liability of the network. Namely, we rescale the coordinate
labeled data, which yields reduced annotation costs. of point labels according to the resolution of each level and
Training basline. Figure 3 shows our proposed baseline of extract confidence scores for all levels. Then, we generate
a two-step learning procedure for WSSIS: (1) train a teacher an instance mask only from the pyramid level with the max-
network using only the full labels; (2) train a student net- imum confidence score, as illustrated in Figure 4. Formally,
work using both the full and pseudo labels generated by there is N proposal branches {Fpi }N i=1 , and we follows the
the teacher network along with the point labels. Generat- configuration of FPN [35] with N =5. For each point label
ing high-quality pseudo labels is crucial for WSSIS, so we (x, y, c), where c denotes category id, we extract an instance
employ point labels as guidance for filtering the proposals proposal and confidence score (Pi , si ) = Fpi (x, y, c). Re-
to remain only true-positive proposals. Then, given the fil- garding the confidence score as the reliability of the pre-
tered proposals, we generate instance masks by exploiting diction, we adaptively select a pyramid level k with the
the mask representation of the teacher network. Note that maximum score, k = argmaxk∈{1,2,...,N } sk . Finally, at
the proposed architecture is a baseline for the proposed task the mask branch Fm , we generate a pseudo instance mask
so that one can explore a more advanced training scheme. M = σ(Fm (Pk )), where σ is sigmoid function.
4
4. Experiments
4.1. Datasets
We evaluate our method on the COCO 2017 dataset [36]
that contains 118,287 training samples and 5,000 valida-
tion samples for 80 common object categories. To validate
our method under the WSSIS regime, we randomly sam-
ple subsets containing 1%, 2%, 5%, 10%, 20%∼50% of the
COCO training dataset. COCO 10% means using 10% of
the fully labeled data and the rest of 90% of the point la-
beled data. We use a centroid point of an instance mask
label as a point label. In addition, we conduct experiments
on BDD100K dataset [51], which is a large-scale driving
scene dataset with diverse scene types and 8 classes. The
Figure 5. Effect of MaskRefineNet. The qualitative results under BDD100K dataset contains 7k mask-labeled images and
10% of COCO fully labeled data condition. When the teacher net- 67k box-labeled images, and we use the center of the box
work fails to disentangle objects in a rough mask, MaskRefineNet as the point label for this dataset.
can separate each representation owing to the given point label
(1st row). Our MaskRefineNet further enriches the resultant mask 4.2. Implementation Details
representation (2nd row) and removes noisy parts (3rd row).
We adopt SOLOv2 [47] as the baseline instance seg-
mentation network since it is a point-based and box-free
3.3. Mask Refinement Network straightforward method. For both teacher and student net-
With a sufficient amount of fully labeled data (e.g., us- works, we use the same ResNet-101 [17] backbone network
ing 50% images), the teacher network can afford to generate and follow the default training recipe and network setting as
reasonable pseudo-instance masks given true-positive pro- in [47]. For the MaskRefineNet, we adopt the ResNet-101
posals. However, when the amount of fully labeled data is FPN [35] architecture and produce the output only from the
extremely small (e.g., using only 1% images), the mask rep- highest resolution pyramid level, P2. We set the batch size
resentation by the network would produce rough instance of 16, the learning rate of 1e-4 with cosine decay schedul-
masks; it means the true-positive proposal could not ensure ing, dice loss [38], and input size of 256×256 for train-
that the instance mask is a true-positive. ing the MaskRefineNet. After training the teacher network,
To handle such a challenging case, we propose a sim- the MaskRefineNet is trained by taking the rough mask out-
ple yet effective post-hoc mask refinement method named puts from the teacher network. We implement the proposed
MaskRefineNet. Figure 3 shows that MaskRefineNet can method using Pytorch [40] and train on 8 V100 GPUs.
refine the rough mask output from the teacher network Following the labeling budget calculation in [3,4], we es-
based on three input sources, including input image, rough timate the labeling budget for the COCO trainset as follows:
mask, and point information. Specifically, we loosely crop Full mask (645.9s/img), Bounding box (127.5s/img),
each instance region in the input image, rough mask, and Point (87.9s/img), Image-level (80s/img). Detailed cal-
point information, and resize them to 256×256, then con- culation method is described in our supplementary material.
catenate them together into an input tensor. For the point
4.3. Experimental Results
information, we transform the point label to the form of
a heatmap where each point is encoded into a 2D gaus- We compare the proposed method against two baselines
sian kernel with a sigma of 6. The effectiveness of the with the same network architecture and optimization strat-
MaskRefineNet can be attributed to two reasons; (1) it lever- egy. The first is training with only fully labeled data, and
ages the prior knowledge of the teacher network; since the second is training with fully labeled and unlabeled data,
MaskRefineNet takes the rough mask predictions from the which is a semi-supervised setting. For the second base-
teacher network as the input, it learns how to calibrate com- line, we generate pseudo instance masks for the unlabeled
mon errors of predictions from the teacher network; (2) it data from the teacher network without any weak labels. As
takes guidance from the input point that is likely to pro- shown in Figure 6, our method achieves remarkable per-
vide an accurate target instance cue for recognizing overlap- formances on all COCO subsets. Especially, the perfor-
ping instances and falsely predicted pixels. Consequently, mance gap between ours and the baselines is notably larger
MaskRefineNet refines the missing & noisy parts and dis- when we use smaller subsets with fully labeled images, e.g.,
entangles the crowded target instances in the rough mask as COCO 1% or 5%. Compared to the fully-supervised set-
shown in Figure 5 with the help of the point guidance. ting (COCO 100%), our method with COCO 50% shows
5
Method Label Types Budget (days) ↓ AP (%) ↑
Weakly-supervised Models ↓
BESTIE [27] I 100% 109.5 14.3

BESTIE [27] P 100% 120.3 17.7
BBAM [32] B 100% 174.5 26.0
BoxInst [44] B 100% 174.5 33.2
Semi-supervised Models ↓
NB [49] F 5% + U 95% 44.2 25.6

NB [49] F 10% + U 90% 88.4 30.3
NB [49] F 30% + U 70% 265.2 35.5
NB [49] F 50% + U 50% 442.1 36.8
Weakly Semi-supervised Models ↓
Ours F 5% + P 95% 158.5 33.7

Ours F 10% + P 90% 196.7 35.8
Ours F 30% + P 70% 349.5 38.0
Ours F 50% + P 50% 502.3 38.8
Figure 6. Performance trend comparison under supervisions.
Fully Supervised Models ↓
We visualize the AP scores when varying numbers of fully la-
beled data in the COCO test-dev. MRN means applying our point- MRCNN [16] F 100% 884.2 38.8
guided MaskRefineNet. SOLOv2 [47] F 100% 884.2 39.6
Table 1. Performance trade-off of annotation budgets and AP.

We compare the methods on the COCO test-dev under various su-
a highly competitive result (38.8% vs. 39.7%). Moreover, pervisions; U (unlabeled data), I (image-level label), P (point
the qualitative results in Figure 7 show that ours with COCO label), B (box label), F (full label). All methods use the same
5% can properly segment instances of various sizes. These backbone network of R-101 [17]
results demonstrate that the cost-efficient point labels can be
leveraged as an effective source for instance segmentation.
We also compare against other methods that use vari- 4.4. Ablation Study
ous weak labels based on the labeling budget in Table 1. We conduct an ablation study of our method on the
According to the type and the number of labels, we calcu- COCO 10% setting. Unless otherwise specified, we mea-
late the labeling budget following the aforementioned cost. sure the quality of pseudo labels generated by the teacher
All methods use the same amount of total training images, network using randomly sampled 5,000 images in the rest
so the labeling time of unlabeled data is treated as zero. 90% of COCO data, we name it COCO train5K.
Compared to the state-of-the-art semi-supervised method, Effect of Point Labels. In Table 5, we verify the effective-
NB [49], ours show superior performances when using the ness of each weak label candidate (i.e., unlabeled, image-
same amount of fully labeled data, especially when using level, and point label) in instance segmentation. For this
5% of fully labeled data (33.7% vs. 25.6%). We also analysis, we measure the quality of pseudo labels and the
achieve higher performance with a lower labeling budget performance of the student network on the COCO 2017 val-
(35.8% with a budget of 196.7 vs. 35.5% with a budget of idation set. When the unlabeled data is leveraged as a weak
265.2). In addition, compared to the state-of-the-art box- label, we should carefully tune the confidence threshold to
supervised method, BoxInst [44], our efficiency is better balance between false-negative and false-positive propos-
(33.7% with a budget of 158.5 vs. 33.2% with a budget als; the average recall (AR100 ) and precision (AP ) largely
of 174.5). This result demonstrates the effectiveness of vary according to the confidence threshold. It implies that
budget-friendly point labels with the proposed method. Fur- human effort for tuning the threshold is required for target
thermore, we emphasize the potential of our method for per- datasets, and this global threshold may not be optimal for
formance improvement when using more labeling budget. every instance. Leveraging the image-level label as a weak
Also, we conduct experiments on BDD100K dataset. As label can eliminate the misclassified proposals, boosting the
we increase the amount of point labeled data with a fixed performance from 25.9% to 29.5%. However, the perfor-
amount of 7k fully labeled data, the performance is gradu- mance gap with the fully-supervised setting is still signifi-
ally improved, as in Table 6. Especially, when leveraging cant (29.5% vs. 39.0%). When we leverage the point label
all available point labels (67k), ours can achieve significant as a weak label, we filter out the proposals to keep only true-
performance improvements compared to using only 7k fully positive proposals, deprecating the requirement of the confi-
labeled data (22.1%→27.9%). dence threshold. It makes a more straightforward and effec-
6
FPN Proposal AP ↑ AP50 ↑ AR100 ↑ Rough mask Point AP ↑ AP50 ↑ Point AP ↑ AP50 ↑
P2 P2 23.4 48.3 37.6 w/o MaskRefineNet 28.6 56.7 Center 28.6 56.7
P2∼P6 P2∼P6 10.3 20.8 44.7 Random 28.8 57.0
14.8 30.2
P2∼P6 argmaxk∈{2,3,...,6} sk 28.6 56.7 42.6
X 29.7 54.4
P2∼P6 w/ ground-truth size 30.9 62.0 44.9 X 30.9 52.9 Table 4. Robustness to
X X 39.1 65.3 point sources. We com-
pare APs trained with dif-
Table 2. Impact of choosing features adaptively. Pn denotes n-
ferent locations – center
th feature pyramid. si is the confidence score of the proposal in Table 3. Impat of the input sources
and random points.
i-th pyramid level. for MaskRefineNet.
COCO train5K COCO val Label Types AP ↑ AP50 ↑ AP75 ↑

Label Types
AP ↑ AP50 ↑ AR100 ↑ AP ↑ AP50 ↑
F 7k 22.1 40.2 21.2
U (τ =0.1) 6.0 11.3 33.8 20.8 33.5 F 7k + P 20k 26.7 44.4 27.8
U (τ =0.3) 13.1 22.9 23.4 25.9 41.5 F 7k + P 40k 27.3 44.5 28.9
U (τ =0.5) 12.2 19.3 15.6 24.3 38.2 F 7k + P 67k 27.9 44.8 29.2
I (τ =0.3) 19.5 33.1 24.1 29.5 48.9
P 28.6 56.7 42.6 32.2 52.3 Table 6. Quantitative results on BDD100K validation set. We
P† 39.1 65.3 52.0 35.5 56.0 report the AP scores with different training regimes concerning the
number of point labels.
Table 5. Impact of using point labels. We have the notations: U
(unlabeled data), I (image-level label), P (point label), and F (full
label). We use COCO train5K to measure the quality of pseudo la- pseudo label quality of 23.4% as shown in the first row in
bels and COCO val to evaluate the baseline network trained with Table 2. When we generate pseudo masks from all pyra-
the pseudo labels. τ is a confidence threshold in the proposal mid feature maps (P2∼P6), we achieve an inferior qual-
branch. † means applying our point-guided MaskRefineNet. ity of 10.3% because the outputs from unfit pyramid levels
are pretty noisy, as shown in Figure 4. Using our Adap-
tive Pyramid-Level Selection strategy, we choose one ap-
tive pipeline, resulting in 32.2%. Compared to the annota- propriate pyramid level based on the reliability of the net-
tion cost of the image-level label (80 s/img), the point label work, achieving the improved quality of 28.6%. The result
is still budget-friendly (87.9 s/img) and gives a noticeable demonstrates that the proposed strategy is highly effective
performance improvement (32.2% vs. 29.5%). Moreover, in leveraging the behavior of the FPN structure for gener-
our point-guided MaskRefineNet further reduces the per- ating high-quality pseudo labels. Also, the result of 30.9%
formance gap with the fully-supervised setting (35.5% vs. when using ground-truth instance size information leaves us
39.0%). This result demonstrates that our method can ef- room for improvement of our method.
fectively leverage the point label for cost-efficient and high- Effect of MaskRefineNet. In Figure 6, we conduct ex-
performance instance segmentation. periments on various subsets without the MaskRefineNet.
Furthermore, we test the robustness of our method to the When the teacher network has enough mask representation
position of the point label. We originally used the centroid ability as in the COCO 50% setting, the improvement of the
point of each instance as our point label. For the analy- MaskRefineNet is marginal (38.3% vs. 38.8%). However,
sis, we randomly choose one pixel in an instance mask as a the MaskRefineNet yields a considerable performance im-
point label five times and measure the average quality of the provement, especially in the limited number of fully labeled
pseudo labels. As shown in Table 4, the performance gap data settings, e.g., COCO 1% (14.3%→24.0%) and COCO
between the center point and the random point is marginal. 5% (29.0%→33.7%) settings. The result demonstrates that
The reason is that all pixels included in the instance region MaskRefineNet is a remarkably effective method to im-
within the proposal branch are trained to generate instance prove the quality of pseudo labels in the limited quantity
proposals, as in [47]. This result demonstrates the robust- of fully labeled data conditions.
ness of our method to the position of the point labels, which In addition, we analyze the effect of input sources of the
gives us more opportunity to reduce the annotation effort. MaskRefineNet in Table 3. Before applying the MaskRe-
Effect of Adaptive Pyramid-Level Selection. We quanti- fineNet, the quality of pseudo labels is measured as 28.6%.
tatively analyze the behavior of the FPN in Table 2. When When the MaskrefineNet only takes an image as input,
we produce pseudo instance masks from a single layer fea- the accuracy of the pseudo labels drastically reduces to
ture map, i.e., without FPN, we achieve an unsatisfactory 14.8% since the network fails to converge due to the ab-
7
Figure 7. Qualitative results according to the various subsets in COCO data. We observe that training with 5% of full labeled data
appropriately localizes all the object instance masks owing to our proposed point guidance method along with MaskRefineNet.
sence of prior knowledge. When taking the rough mask as pervision to resolve the bottleneck. Moreover, we presented
an input source, the quality of the pseudo labels improves the MaskRefineNet to deal with hard learning scenarios
from 14.8% to 29.7% because the MaskRefineNet takes the where the amount of fully labeled data is extremely lim-
knowledge of the teacher network for fast and stable conver- ited. Owing to the effectiveness of the proposed method, we
gence. However, the improvement is still minor compared can generate high-quality pseudo instance masks, achiev-
to the model without MaskRefineNet. When additionally ing promising instance segmentation results. Despite our
taking the point information as an input source, the quality remarkable results driven by cost-efficient point labels, we
of pseudo labels dramatically improves to 39.1%. The rea- have a limit to straightforwardly exploiting the tremendous
son is that the point information is used as a guidance seed amount of unlabeled image pools, such as web-crawling im-
for the target instance, helping a more accurate segment of ages without any annotations. Our future direction may in-
occluded instances and refining the missing predictions in volve incorporating unlabeled images in our framework to
the rough mask, as shown in Figure 5. extend its application to semi-supervised learning scenarios.
5. Conclusion and Limitation Acknowledgment. We thank Youngjoon Yoo and Im-

ageVision team members for valuable discussions and
In this paper, we proposed a novel and practical weakly feedback. This work was supported by the Engineering
semi-supervised instance segmentation scheme leveraging Research Center Program through the National Research
point labels as weak supervision for cost-efficient and high- Foundation of Korea (NRF) funded by the Korean Gov-
performance instance segmentation. We motivated that the ernment MSIT (NRF-2018R1A5A1059921) and Institute
main performance bottleneck of modern instance segmen- of Information & communications Technology Planning &
tation frameworks arises from the instance proposal extrac- Evaluation (IITP) grant funded by the Korea government
tion. To this end, we proposed a method that can effec- (MSIT) (No.2019-0-00075, Artificial Intelligence Graduate
tively exploit the budget-friendly point labels as weak su- School Program (KAIST)).
8
Appendix: Additional Experimental Details Input Size AP AP50 AP75
Labeling Budget Calculation. Seminar works [3, 4] of- 128×128 34.1 53.4 36.1
fered the annotation time of various labeling sources (e.g., 256×256 35.5 56.0 37.8
full mask, bounding box, point, image-level labels) on Pas- 384×384 35.5 55.9 37.7
cal VOC dataset [14]. Since the COCO dataset [36] we
used has more categories and instances per image than the Table 7. Effect of the input size of MaskRefineNet. The APs are
VOC dataset, we estimate the labeling budget for the COCO measured on COCO 2017 validation set.
dataset following their budget calculation method. The
COCO 2017 trainset has a total of 80 categories and con- Iterative 1% 2% 5% 10% 30% 50% 100%
tains 118,287 images and 860,001 instances. Also, it has an
23.9 25.1 33.4 35.5 37.4 38.3 39.0
average of 7.2 instances and 2.9 categories per image. By X 25.6 26.0 34.5 35.9 37.6 38.3 39.0
considering this statistic of COCO dataset, we calculate the
labeling budget as follows: Table 8. Effect of iterative training strategy. The APs are mea-
sured on COCO 2017 validation set according to COCO subsets.
• Full mask: 77.1classes/img × 1s/class +
7.2inst/img × 79s/mask = 645.9s/img.
Method Label Types Budget (days) ↓ AP (%) ↑
• Bounding box: 77.1classes/img × 1s/class + Weakly-supervised Models
7.2inst/img × 7s/bbox = 127.5s/img. BBTP [20] B 100% 174.5 21.1
BBAM [32] B 100% 174.5 25.7
• Point: 77.1classes/img × 1s/class + BoxInst [44] B 100% 174.5 33.2
2.9classes/img × 2.4s/point + (7.2inst/img − BoxLevelSet [34] B 100% 174.5 33.4
2.9classes/img) × 0.9s/point = 87.9s/img. BoxTeacher [12] B 100% 174.5 35.4
Point-sup [11] P10 100% 263.2 37.7
• Image-level: 80classes/img × 1s/class = 80s/img. Weakly Semi-supervised Models
Ours F 5% + P 95% 158.5 33.7

Input of MaskRefineNet. In this section, we further pro- Ours F 10% + P 90% 196.7 35.8
Ours F 20% + P 80% 273.1 37.1
vide the details about the input sources for MaskRefineNet. Ours F 30% + P 70% 349.5 38.0
After training the teacher network using the fully labeled Ours F 50% + P 50% 502.3 38.8
data, we generate instance mask outputs for the point- Fully Supervised Models
guided filtered proposals (i.e., true-positive proposals) us-
MRCNN [16] F 100% 884.2 38.8
ing the trained teacher network. We treat the mask out- CondInst [43] F 100% 884.2 39.1
puts as rough masks to be used as the input source of the SOLOv2 [47] F 100% 884.2 39.7
MaskRefineNet. For each rough mask, we loosely crop
each instance region in the input image, rough mask, and Table 9. Additional comparisons with weakly-supervised
point heatmap. Specifically, after obtaining the bounding methods in terms of labeling budget and accuracy. We com-
box from the rough mask using the min-max operations, pare the methods on the COCO test-dev under various supervi-
we re-scale the size of the box to double, and then we use sions; B (box label), P10 (10-points label), P (single-point label),
this box region as the cropping region. In addition, for the F (full mask label). All methods use the same backbone network
point heatmap, we encode each point to a 2-dimensional of ResNet-101 [17].
gaussian kernel with a sigma of 6, as done in [48, 53].
We concatenate the three input sources (i.e., cropped in- measure the AP result of the student network trained with
put image RH×W ×3 , cropped rough mask RH×W ×1 , and the pseudo and full labels on the COCO 2017 validation set.
cropped point heatmap RH×W ×C ) to be the input tensor Consequently, the 256×256 size yields the best AP score of
RH×W ×(3+1+C) of the MaskRefineNet, where C is the 35.5% but its performance gap with the 384×384 size is
number of classes. marginal (35.5% vs 35.4%).
Appendix: Additional Analysis

Effect of iterative training strategy. Some weakly-
Effect of the input size of MaskRefineNet. We orig- supervised methods [2,25,46] utilize iterative training strat-
inally set the input size of MaskRefineNet to 256×256. egy; after training the target network, they generate pseudo
Here, we change the input size to verify its effect on the labels using the target network, and then they newly train
WSSIS result in table 7. For this, we train the MaskRe- the target network using the pseudo labels. This strat-
fineNet using the input size of 128×128 or 384×384. We egy could give additional performance improvement but de-
9
Method 5% 10% 20% 30% 40% 50% Comparison with weakly semi-supervised object de-
Point DETR [10] 26.2 30.3 33.3 34.8 35.4 35.8 tection methods: In our main paper, we discussed the
Group R-CNN [52] 30.1 32.6 34.4 35.4 35.9 36.1 weakly semi-supervised object detection (WSSOD) meth-
ours 32.4 34.3 35.6 36.9 37.0 37.6
ods [10, 52], which used the box labels as strong labels and
the point labels as weak labels. Since the instance segmen-
Table 10. Qualitative comparisons on COCO test-dev object de- tation covers object detection, we measure our performance
tection benchmark. All methods used the ResNet-50 backbone. on the COCO test-dev object detection benchmark. For this,
we use the min-max points from the instance mask output
as our bounding box output. Even though our strong label
mands a more complex training pipeline. In this work, we is different from theirs (full mask vs. bounding box), the
suffer from the insufficient mask representation of the net- results in table 10 show that ours can surpass the state-of-
work when the amount of fully labeled data is extremely the-art WSSOD performance. We note that all methods use
limited (e.g., COCO 1%). Although we can alleviate the the same ResNet-50 [17] backbone network and the same
problem with the proposed MaskRefineNet, we addition- amount of total strong and weak labels.
ally try to adopt this strategy since we assume that the
trained student network may have stronger mask represen- Qualitative analysis for the effect of input sources of
tation ability than the teacher network. For this, after train- MaskRefineNet: In Table 2 of our main paper, we pro-
ing the student network, we newly generate pseudo instance vided the quantitative analysis of the effect of input sources
masks for point labeled images. Using both full labels and of MaskRefineNet. Here, we supplement our analysis with
new pseudo labels, we train a new student network. As the the qualitative results according to the input sources of the
results in table 8, the iterative training strategy yields mean- MaskRefineNet in Figure 8. When given all three informa-
ingful improvements on tiny fully labeled data conditions tive input sources, the MaskRefineNet can produce high-
(COCO 1%: 23.9%→25.6%). However, there is no signif- quality refined masks by separating overlapping instances
icant performance improvement for subsets above COCO and removing noisy pixels.
30%. This result demonstrates that (1) the iterative train-
ing strategy is helpful only when the amount of fully la- Qualitative comparison of baselines and our WSSIS
beled data is extremely limited, (2) in more generous con- method. In Figure 6 of our main paper, we provided the AP
ditions such as COCO 30% and 50%, our MaskRefineNet is evolution of two baselines and our WSSIS method accord-
enough to replenish the mask representation of the network. ing to the COCO subsets. In Figure 9, we provide the qual-
Additional Comparison with weakly-supervised itative results of two baselines and our method under the
method: Point-sup [11] introduced a new type of weak COCO 10% setting. There are four types of methods: (a)
supervision source, multiple (10) points. They achieved training with fully labeled data only, (b) training with fully
remarkable instance segmentation results with a highly labeled data and unlabeled data, (c) training with fully la-
reduced annotation cost. To compare with them, we beled data and point labeled data, and (d) training with fully
estimate the annotation time for 10-points according to labeled data and point labeled data along with our point-
the literature; they labeled 10-points in the bounding box guided MaskRefineNet. The results demonstrate that the
region. network trained with our method can be guided with higher-
quality pseudo labels, resulting less false-positive and false-
• 10 Points: 77.1classes/img × 1s/class + negative outputs.
7.2inst/img ×(7s/bbox+10points×0.9s/point) =
192.3s/img. Additional qualitative results on COCO dataset. In Fig-
ure 10, we provide additional qualitative results of ours
In table 9, we provide the results for weakly-supervised trained with 5%, 20%, and 50% COCO subsets.
methods and ours on COCO test-dev in terms of accuracy
and labeling budget. Although Point-sup shows a slightly Qualitative results on BDD100K dataset. We quali-
better efficiency than ours (37.7% with a budget of 263.2 tatively analyze the effect of leveraging point labels for
days vs. 37.1% with a budget of 273.1 days), we argue that the instance segmentation model using the BDD100K
our training setting is more applicable for the current dataset dataset [51]. There are two types of networks: the first is
conditions than them because they require newly annotating the network trained with only 7K fully labeled data, and
of 10-points. Also, we show the possibility for more perfor- the second is the network trained with 7K fully labeled data
mance improvement up to 38.8%, which is highly close to and 67K point labeled data. As shown in Figure 11, due to
the result of the fully-supervised setting. Furthermore, they our effective leveraging of the point labels, the second net-
give us a new future direction; incorporating 10-points and work is much more robust to large and small instances and
single-point without any mask labels. occluded instances.
10
Figure 8. Qualitative analysis of the effect of input sources of MaskRefineNet. Object instances can not be distinguished when the
point label is not given for MaskRefineNet (3rd col). Meanwhile, mask representations are inaccurate due to the absence of prior rough
masks (4th col). Based on these low-cost priors, we can obtain sophisticated masks per object instance (5th col).
11
Figure 9. Qualitative comparison of models trained with different types of supervision on COCO 10% setting. The result of the semi-
supervised setting can detect instance masks for all objects but is vulnerable to misclassification (e.g. cat, person, bear, airplane, toilet).
Meanwhile, our point-guided model presents accurate class predictions. Our MaskRefineNet further elaborates the mask representation.
12
Figure 10. Additional qualitative results according to the various subsets in COCO data. Owing to our point guidance along with
MaskRefineNet, leveraging only 5% of full labeled data sufficiently localizes all the instances with elaborative mask representations.
13
Figure 11. Qualitative comparison of leveraging point labels on BDD100K. Training with point labels clearly enriches the mask
representation and removes the noise incurred by visually hard samples (e.g., dark light condition in the first row).
References sistent instances. In European Conference on Computer Vi-

sion, pages 254–270. Springer, 2020. 9, 10
[1] Jiwoon Ahn, Sunghyun Cho, and Suha Kwak. Weakly su-
pervised learning of instance segmentation with inter-pixel [3] Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li
relations. In Proceedings of the IEEE/CVF conference on Fei-Fei. What’s the point: Semantic segmentation with point
computer vision and pattern recognition, pages 2209–2218, supervision. In European conference on computer vision,
2019. 1, 3 pages 549–565. Springer, 2016. 2, 3, 4, 5, 9
[2] Aditya Arun, CV Jawahar, and M Pawan Kumar. Weakly su- [4] Mı́riam Bellver Bueno, Amaia Salvador Aguilera, Jordi Tor-
pervised instance segmentation by learning annotation con- res Viñals, and Xavier Giró Nieto. Budget-aware semi-
14
supervised semantic and instance segmentation. In The IEEE [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Conference on Computer Vision and Pattern Recognition Deep residual learning for image recognition. In Proceed-
(CVPR) Workshops, 2019, pages 93–102, 2019. 3, 4, 5, 9 ings of the IEEE conference on computer vision and pattern
[5] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. recognition, pages 770–778, 2016. 5, 6, 9, 10
Yolact: Real-time instance segmentation. In Proceedings of [18] Ruifei He, Jihan Yang, and Xiaojuan Qi. Re-distributing
the IEEE/CVF international conference on computer vision, biased pseudo labels for semi-supervised semantic segmen-
pages 9157–9166, 2019. 1 tation: A baseline investigation. In Proceedings of the
[6] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas IEEE/CVF International Conference on Computer Vision,
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- pages 6930–6940, 2021. 10
end object detection with transformers. In European confer- [19] Cheng-Chun Hsu, Kuang-Jui Hsu, Chung-Chi Tsai, Yen-Yu
ence on computer vision, pages 213–229. Springer, 2020. 3 Lin, and Yung-Yu Chuang. Weakly supervised instance seg-
[7] Sungmin Cha, Beomyoung Kim, YoungJoon Yoo, and Tae- mentation using the bounding box tightness prior. Advances
sup Moon. Ssul: Semantic segmentation with unknown la- in Neural Information Processing Systems, 32, 2019. 1, 3
bel for exemplar-based class-incremental learning. Advances [20] Cheng-Chun Hsu, Kuang-Jui Hsu, Chung-Chi Tsai, Yen-Yu
in neural information processing systems, 34:10919–10930, Lin, and Yung-Yu Chuang. Weakly supervised instance seg-
2021. 10 mentation using the bounding box tightness prior. Advances
in Neural Information Processing Systems, 32, 2019. 9
[8] Hao Chen, Kunyang Sun, Zhi Tian, Chunhua Shen, Yong-
ming Huang, and Youliang Yan. Blendmask: Top-down [21] Ronghang Hu, Piotr Dollár, Kaiming He, Trevor Darrell, and
meets bottom-up for instance segmentation. In Proceedings Ross Girshick. Learning to segment every thing. In Proceed-
of the IEEE/CVF conference on computer vision and pattern ings of the IEEE conference on computer vision and pattern
recognition, pages 8573–8581, 2020. 1 recognition, pages 4233–4241, 2018. 10
[22] Zhaojin Huang, Lichao Huang, Yongchao Gong, Chang
[9] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox-
Huang, and Xinggang Wang. Mask scoring r-cnn. In Pro-
iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping
ceedings of the IEEE/CVF conference on computer vision
Shi, Wanli Ouyang, et al. Hybrid task cascade for instance
and pattern recognition, pages 6409–6418, 2019. 2, 3
segmentation. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 4974– [23] Jisoo Jeong, Seungeui Lee, Jeesoo Kim, and Nojun Kwak.
4983, 2019. 1, 2, 3, 4 Consistency-based semi-supervised learning for object de-
tection. Advances in neural information processing systems,
[10] Liangyu Chen, Tong Yang, Xiangyu Zhang, Wei Zhang, and
32, 2019. 10
Jian Sun. Points as queries: Weakly semi-supervised object
detection by points. In Proceedings of the IEEE/CVF Con- [24] Lei Ke, Martin Danelljan, Xia Li, Yu-Wing Tai, Chi-Keung
ference on Computer Vision and Pattern Recognition, pages Tang, and Fisher Yu. Mask transfiner for high-quality in-
8823–8832, 2021. 3, 10 stance segmentation. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
[11] Bowen Cheng, Omkar Parkhi, and Alexander Kirillov.
4412–4421, 2022. 1
Pointly-supervised instance segmentation. In Proceedings of
[25] Anna Khoreva, Rodrigo Benenson, Jan Hosang, Matthias
the IEEE/CVF Conference on Computer Vision and Pattern
Hein, and Bernt Schiele. Simple does it: Weakly supervised
Recognition, pages 2617–2626, 2022. 1, 3, 9, 10
instance and semantic segmentation. In Proceedings of the
[12] Tianheng Cheng, Xinggang Wang, Shaoyu Chen, Qian IEEE conference on computer vision and pattern recogni-
Zhang, and Wenyu Liu. Boxteacher: Exploring high-quality tion, pages 876–885, 2017. 9, 10
pseudo labels for weakly supervised instance segmentation.
[26] Beomyoung Kim, Sangeun Han, and Junmo Kim. Discrim-
arXiv preprint arXiv:2210.05174, 2022. 9, 10
inative region suppression for weakly-supervised semantic
[13] Bin Dong, Fangao Zeng, Tiancai Wang, Xiangyu Zhang, segmentation. In Proceedings of the AAAI Conference on
and Yichen Wei. Solq: Segmenting objects by learning Artificial Intelligence, volume 35, pages 1754–1761, 2021.
queries. Advances in Neural Information Processing Sys- 10
tems, 34:21898–21909, 2021. 1 [27] Beomyoung Kim, Youngjoon Yoo, Chae Eun Rhee, and
[14] Mark Everingham, Luc Van Gool, Christopher KI Williams, Junmo Kim. Beyond semantic to instance segmenta-
John Winn, and Andrew Zisserman. The pascal visual object tion: Weakly-supervised instance segmentation via semantic
classes (voc) challenge. International journal of computer knowledge transfer and self-refinement. In Proceedings of
vision, 88(2):303–338, 2010. 3, 9 the IEEE/CVF Conference on Computer Vision and Pattern
[15] Yuxin Fang, Shusheng Yang, Xinggang Wang, Yu Li, Chen Recognition, pages 4278–4287, 2022. 1, 3, 6
Fang, Ying Shan, Bin Feng, and Wenyu Liu. Instances as [28] Issam H Laradji, Negar Rostamzadeh, Pedro O Pinheiro,
queries. In Proceedings of the IEEE/CVF International Con- David Vazquez, and Mark Schmidt. Proposal-based instance
ference on Computer Vision, pages 6910–6919, 2021. 1 segmentation with point supervision. In 2020 IEEE Interna-
[16] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- tional Conference on Image Processing (ICIP), pages 2126–
shick. Mask r-cnn. In Proceedings of the IEEE international 2130. IEEE, 2020. 1, 3
conference on computer vision, pages 2961–2969, 2017. 1, [29] Dong-Hyun Lee et al. Pseudo-label: The simple and effi-
2, 3, 4, 6, 9 cient semi-supervised learning method for deep neural net-
15
works. In Workshop on challenges in representation learn- able atrous convolution. In Proceedings of the IEEE/CVF
ing, ICML, volume 3, page 896, 2013. 10 conference on computer vision and pattern recognition,
[30] Jungbeom Lee, Eunji Kim, Sungmin Lee, Jangho Lee, and pages 10213–10224, 2021. 2
Sungroh Yoon. Ficklenet: Weakly and semi-supervised se- [42] Yunhang Shen, Rongrong Ji, Yan Wang, Yongjian Wu,
mantic image segmentation using stochastic inference. In and Liujuan Cao. Cyclic guidance for weakly supervised
Proceedings of the IEEE/CVF Conference on Computer Vi- joint detection and segmentation. In Proceedings of the
sion and Pattern Recognition, pages 5267–5276, 2019. 10 IEEE/CVF Conference on Computer Vision and Pattern
[31] Jungbeom Lee, Eunji Kim, and Sungroh Yoon. Anti- Recognition, pages 697–707, 2019. 10
adversarially manipulated attributions for weakly and semi- [43] Zhi Tian, Chunhua Shen, and Hao Chen. Conditional convo-
supervised semantic segmentation. In Proceedings of the lutions for instance segmentation. In European conference
IEEE/CVF Conference on Computer Vision and Pattern on computer vision, pages 282–298. Springer, 2020. 1, 2, 3,
Recognition, pages 4071–4080, 2021. 10 4, 9
[32] Jungbeom Lee, Jihun Yi, Chaehun Shin, and Sungroh Yoon. [44] Zhi Tian, Chunhua Shen, Xinlong Wang, and Hao Chen.
Bbam: Bounding box attribution map for weakly super- Boxinst: High-performance instance segmentation with box
vised semantic and instance segmentation. In Proceedings annotations. In Proceedings of the IEEE/CVF Conference
of the IEEE/CVF conference on computer vision and pattern on Computer Vision and Pattern Recognition, pages 5443–
recognition, pages 2643–2652, 2021. 1, 3, 6, 9 5452, 2021. 1, 3, 6, 9
[33] Youngwan Lee and Jongyoul Park. Centermask: Real- [45] Xinlong Wang, Tao Kong, Chunhua Shen, Yuning Jiang, and
time anchor-free instance segmentation. In Proceedings of Lei Li. Solo: Segmenting objects by locations. In European
the IEEE/CVF conference on computer vision and pattern Conference on Computer Vision, pages 649–665. Springer,
recognition, pages 13906–13915, 2020. 1, 2, 3 2020. 2, 3
[34] Wentong Li, Wenyu Liu, Jianke Zhu, Miaomiao Cui, Xian-
[46] Xiang Wang, Shaodi You, Xi Li, and Huimin Ma. Weakly-
Sheng Hua, and Lei Zhang. Box-supervised instance seg-
supervised semantic segmentation by iteratively mining
mentation with level set evolution. In European Conference
common object features. In Proceedings of the IEEE con-
on Computer Vision, pages 1–18. Springer, 2022. 9, 10
ference on computer vision and pattern recognition, pages
[35] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, 1354–1362, 2018. 9
Bharath Hariharan, and Serge Belongie. Feature pyra-
[47] Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li, and Chun-
mid networks for object detection. In Proceedings of the
hua Shen. Solov2: Dynamic and fast instance segmenta-
IEEE conference on computer vision and pattern recogni-
tion. Advances in Neural information processing systems,
tion, pages 2117–2125, 2017. 2, 4, 5
33:17721–17732, 2020. 1, 2, 3, 4, 5, 6, 7, 9
[36] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
[48] Yuqing Wang, Zhaoliang Xu, Hao Shen, Baoshan Cheng,
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
and Lirong Yang. Centermask: single shot instance seg-
Zitnick. Microsoft coco: Common objects in context. In
mentation with point representation. In Proceedings of
European conference on computer vision, pages 740–755.
the IEEE/CVF conference on computer vision and pattern
Springer, 2014. 2, 5, 9
recognition, pages 9313–9321, 2020. 2, 3, 9
[37] Yun Liu, Yu-Huan Wu, Pei-Song Wen, Yu-Jun Shi, Yu
Qiu, and Ming-Ming Cheng. Leveraging instance-, image- [49] Zhenyu Wang, Yali Li, and Shengjin Wang. Noisy bound-
and dataset-level information for weakly supervised instance aries: Lemon or lemonade for semi-supervised instance seg-
segmentation. IEEE Transactions on Pattern Analysis and mentation? In Proceedings of the IEEE/CVF Conference
Machine Intelligence, 2020. 10 on Computer Vision and Pattern Recognition, pages 16826–
[38] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 16835, 2022. 1, 2, 3, 6
V-net: Fully convolutional neural networks for volumetric [50] Ziang Yan, Jian Liang, Weishen Pan, Jin Li, and Chang-
medical image segmentation. In 2016 fourth international shui Zhang. Weakly-and semi-supervised object detection
conference on 3D vision (3DV), pages 565–571. IEEE, 2016. with expectation-maximization algorithm. arXiv preprint
5 arXiv:1702.08740, 2017. 10
[39] Yassine Ouali, Céline Hudelot, and Myriam Tami. Semi- [51] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying
supervised semantic segmentation with cross-consistency Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Dar-
training. In Proceedings of the IEEE/CVF Conference on rell. Bdd100k: A diverse driving dataset for heterogeneous
Computer Vision and Pattern Recognition, pages 12674– multitask learning. In Proceedings of the IEEE/CVF con-
12684, 2020. 10 ference on computer vision and pattern recognition, pages
[40] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, 2636–2645, 2020. 2, 5, 10
James Bradbury, Gregory Chanan, Trevor Killeen, Zeming [52] Shilong Zhang, Zhuoran Yu, Liyang Liu, Xinjiang Wang,
Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- Aojun Zhou, and Kai Chen. Group r-cnn for weakly semi-
perative style, high-performance deep learning library. Ad- supervised object detection with points. In Proceedings of
vances in neural information processing systems, 32, 2019. the IEEE/CVF Conference on Computer Vision and Pattern
5 Recognition, pages 9417–9426, 2022. 3, 10
[41] Siyuan Qiao, Liang-Chieh Chen, and Alan Yuille. Detectors: [53] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Ob-
Detecting objects with recursive feature pyramid and switch- jects as points. arXiv preprint arXiv:1904.07850, 2019. 9
16
[54] Yanzhao Zhou, Xin Wang, Jianbin Jiao, Trevor Darrell,
and Fisher Yu. Learning saliency propagation for semi-
supervised instance segmentation. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 10307–10316, 2020. 1, 3
[55] Yanzhao Zhou, Yi Zhu, Qixiang Ye, Qiang Qiu, and Jianbin
Jiao. Weakly supervised instance segmentation using class
peak response. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 3791–3800,
2018. 1, 3, 10
17

The Devil Is in The Points: Weakly Semi-Supervised Instance Segmentation Via Point-Guided Mask Representation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Devil Is in The Points: Weakly Semi-Supervised Instance Segmentation Via Point-Guided Mask Representation

Uploaded by

Copyright:

Available Formats

The Devil is in the Points: Weakly Semi-Supervised Instance Segmentation

via Point-Guided Mask Representation

Beomyoung Kim1,2 Joonhyun Jeong1,2 Dongyoon Han3 Sung Ju Hwang2

NAVER Cloud, ImageVision1 KAIST2 NAVER AI Lab3

Abstract False-Negative Mask

BESTIE [27] I 100% 109.5 14.3

NB [49] F 5% + U 95% 44.2 25.6

Ours F 5% + P 95% 158.5 33.7

Table 1. Performance trade-off of annotation budgets and AP.

COCO train5K COCO val Label Types AP ↑ AP50 ↑ AP75 ↑

5. Conclusion and Limitation Acknowledgment. We thank Youngjoon Yoo and Im-

Ours F 5% + P 95% 158.5 33.7

Appendix: Additional Analysis

References sistent instances. In European Conference on Computer Vi-

You might also like