Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

This ICCV paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;


the final published version of the proceedings is available on IEEE Xplore.

CFCG: Semi-Supervised Semantic Segmentation via Cross-Fusion and Contour


Guidance Supervision

Shuo Li*, Yue He∗ , Weiming Zhang∗ , Wei Zhang†, Xiao Tan, Junyu Han, Errui Ding, Jingdong Wang

Baidu Inc
{lishuo16, heyue04, zhangweiming, zhangwei99, tanxiao, hanjunyu, dingerrui,
wangjingdong}@baidu.com

Abstract with the development of deep learning. However, the


training for such a problem is rather a challenge owing
Current state-of-the-art semi-supervised semantic seg- to costly and laborious pixel-wise manual labeling[18].
mentation (SSSS) methods typically adopt pseudo labeling To alleviate this problem, semi-supervised semantic seg-
and consistency regularization between multiple learners mentation(SSSS) with the precious labeled data and large
with different perturbations. Although the performance is amounts of unlabeled data is urgently needed to liberate la-
desirable, many issues remain: (1) supervisions from a sin- bor and ensure accuracy. Under such a setting, how to ade-
gle learner tend to be noisy which causes unreliable con- quately leverage unlabeled data becomes critical.
sistency regularization (2) existing pixel-wise confidence- Recently, approaches based on the combination of con-
score-based reliability measurement causes potential er- sistency regularization and pseudo labeling dominate SSSS
ror accumulation as the training proceeds. In this paper, research [27, 23, 6, 11]. Specially, it encourages high sim-
we propose a novel SSSS framework, called CFCG, which ilarity between the predictions of perturbation for the same
combines cross-fusion and contour guidance supervision to input but expands a confidence prediction map of an un-
tackle these issues. Concretely, we adopt both image-level labeled image to a one-hot pseudo label’s map. However,
and feature-level perturbations to expand feature distribu- it still exists some problems: (1) supervisions from a sin-
tion thus pushing the potential limits of consistency regular- gle learner tend to be noisy which causes unreliable con-
ization. Then, two particular modules are proposed to en- sistency regularization (2) existing pixel-wise confidence-
able effective semi-supervised learning under heavy coher- score-based reliability measurement causes potential error
ent perturbations. Firstly, Cross-Fusion Supervision (CFS) accumulation as the training proceeds.
mechanism leverages multiple learners to enhance the qual-
ity of pseudo labels. Secondly, we introduce an adaptive In this paper, we propose a novel SSSS framework,
contour guidance module (ACGM) to effectively identify un- called CFCG, which combines cross-fusion and contour
reliable spatial regions in pseudo labels. Finally, our pro- guidance supervision to tackle the above issues. First, we
posed CFCG achieves gains of mIoU +1.40%, +0.89% with conduct a detailed analysis of consistency regularization.
a single learner and +1.85%, +1.33% by fusion inference Weak-to-strong consistency regularization(WS)[27], one of
on PASCAL VOC 2012 and on Cityscapes respectively un- the representative image-level consistency regularization
der 1/8 protocols, clearly surpassing previous methods and methods, tries to make weakly and strongly-augmented im-
reaching the state-of-the-art. ages pair. Specifically, weak augmentation includes only
flip-and-shift data augmentation while strong augmenta-
tion is heavily-distorted versions of a given image, includ-
1. Introduction ing the noise, blur, and erasure operations. Its success
comes from weak branch can produce high-quality pseudo
Semantic segmentation, an essential task of the pixel- labels while strong branch can make the training process
wise classification task, has been remarkably successful hard by injecting noise. While in the feature-level, various
* Co-first author
types of perturbations for consistency are designed, such
† Correspondingauthor as VAT[25], random dropout, random noise, etc. We find
This work was done when Shuo Li was an intern at Baidu Inc. that image-level strong perturbations reflected in pixel noise

16348
𝑥' 𝑥'

𝑥! 𝑥" 𝑥! 𝑥"
EMA
Learner Learner1 Learner2

𝒑! 𝒑" 𝒑$! 𝒑%"

Percentage %
(a) (b) (c)

𝒚 𝒚$
(a) (b)
'
𝑥 𝑥'
(d) (e) (f)
! " ! "
𝑥 𝑥 𝑥 𝑥

Learner1 Learner2 Learner1 Learner2

Confidence score
𝒑$! 𝒑$" 𝒑!
% 𝒑%" 𝒑$! 𝒑$" 𝒑!
% 𝒑%"
CFM CFM
𝒚$! 𝒚!
%
Figure 2. The confidence score distributions of presudo label on
𝒑!
& 𝒑&" the VOC dataset. The horizontal axis represents the different con-
fidence intervals, and the vertical axis represents the number of
𝒚!
& correct and error pseudo labels in the current confidence interval.
(c) (d) We illustrate the problems of conventional confidence-score-based
Stop gradient Loss supervision reliability measurement: (a) Input image. (b) Confidence map of
the model prediction. (c) The difference between ground truth and
pseudo label. (d) The predicted pseudo label. (e) The semantic
Figure 1. A comparison between typical SSSS frameworks. (a) A contour of the pseudo label generated from (d). (f) The weight
single learner with weak-to-strong consistency (WS) strategy. [27, map adopted in our proposed ACGM. We can observe that: from
40] (b) Multiple individual learners with WS strategy. [23, 16, 39] (b) and (c), the confidence map is noisy and unreliable as high-
(c) Multiple learners with symmetric WS strategy. [11] (d) Our lighted by the red box; from (f) and (c) our proposed ACGM ef-
proposed Cross-Fusion Supervision (CFS) is applied to multiple fectively identifies unreliable spatial regions.
learners.
and feature-level perturbation reflected in feature noise ex- confidence-score-based reliability measurement in SSSS.
hibit similar characteristics. We put both image-level strong Fig.2 illustrates the confidence score distributions of pseudo
augmentation and feature-level perturbations in the same labels on the VOC dataset. We can observe that it is almost
branch to cause heavy Coherent Perturbation(CP) to push impossible to identify unreliable pixel-wise pseudo labels
the potential limits of consistency regularization. via pixel-wise confidence score thresholds, which is widely
Fig.1 provides an extensive comparison between WS- adopted in semi-supervised tasks [27, 41, 23, 11, 16]. Fur-
based SSSS frameworks. Typically, as shown in Fig. 1 (a), thermore, we can also observe from (b) and (c) of Fig.2,
a single learner with WS is first proposed in [40]. Then, the confidence map is usually noisy and unreliable as high-
with the advantage of mean-teacher, the SSSS framework lighted by the red box. This issue leads to error accumu-
evolves to Fig. 1 (b) in which WS strategy is applied to- lation as the training proceeds, thus seriously affecting the
gether with multiple learners [23, 16, 39]. Fig. 1 (c) further performance of semi-supervised learning in segmentation
introduces the symmetric dual-student method where both tasks. Previous pixel-wise confidence-score-based methods
strongly-augmented images and weakly-augmented images struggle with very limited performance gains. To address
flow into bipartite learners. It uses the pseudo label obtained this issue, we draw inspiration from the similarity between
from one network to supervise the other network and vice (c) and (f) in Fig.2, where most of the unreliable spatial re-
versa [11]. Different from the above works, we propose to gions can be identified by the contour map. Therefore, we
add the Cross-Fusion Supervision (CFS) mechanism, which propose an adaptive contour guidance module (ACGM) that
enables the information transfer between the multiple learn- gradually applies a contour-guided weight map to re-weight
ers to enhance the quality of pseudo labels thus benefiting training loss. In this way, whether the pixel-wise pseudo la-
the process of the semi-supervised training process. While bel is reliable incorporates contextual cues. Intuitively, as
in the inference stage, only one model without relying on shown in Fig.2, our weight map (f) is acquired by softening
cross-fusion is able to generate high-quality results. based on the semantic contour of the pseudo label (e), and
Next, we delve into the shortcomings of conventional its visualization results prove that our ACGM’s weight map

16349
(b) Adaptive Contour Guidance Module
(a) Network Architecture
ℒ )!
𝐂𝐫𝐨𝐬𝐬 𝐄𝐧𝐭𝐫𝐨𝐩𝐲 𝐋𝐨𝐬𝐬

strong aug strong aug 𝑫𝟏 𝒑!# ACGM


𝒑$# 𝜼= 1 − 𝛾 _𝐼+ 𝛾_𝑀

𝐍𝐨𝐫𝐦𝐚𝐥𝐢𝐳𝐚𝐭𝐢𝐨𝐧
𝑬𝟏 𝑫𝟏

𝐆𝐚𝐮𝐬𝐬𝐢𝐚𝐧
𝒑!" 𝒚!"

𝐋𝐚𝐩𝐥𝐚𝐜𝐞
weak aug

𝒚! 𝑀

CFM 𝒑(# ACGM 𝒚"


( 𝒑"
( CFM
(c) Cross-Fusion Module

weak aug 𝒑" 𝒚"


$
$
𝒑(#

Sum fusion
𝑬
𝑬𝟐𝟐 𝑫𝟐 Sum fusion
𝒑$#

Softmax
strong aug strong aug 𝒑$# ACGM

FC
Average pool
FC

Forward Stop gradient Loss supervision Dot product


𝒑!#

Figure 3. Illustration of our CFCG method on unlabeled data. CFM denotes the Cross-Fusion Module which is one component of Cross-
Fusion supervision how to implement fusion can be seen in (c). The details of ACGM are shown in (b). CP consists of the input image
augmentation and encoder feature augmentation, which further enhances the learner’s learning ability. ACGM iteratively updates reliable
pseudo labels to guide semi-supervised learning. CFM fuses the weak flow output and strong flow output of the two learners separately
and then leverages the fusion pseudo label of weak flow to supervise the strong flow’s fusion. For the labeled data, images are augmented
by weak flow into the two networks and both the networks and the CFM are supervised by corresponding ground truth respectively.

can cover the error region more exactly. from FCN to Transfomer-like networks. Specifically, the
To summarise, our contributions are: FCN[24] is a milestone, which builds fully convolutional
networks for pixel-wise prediction. Since then, extensions
• We propose a novel SSSS framework CFCG mainly based on FCN[5, 2, 36] have been validated to be invalu-
including cross-fusion supervision(CFS) and the adap- able in resolving the capturing long-range context depen-
tive contour guidance module(ACGM). Our CFS dency problem. Deeplabv3[5] designs an atrous spatial
which neatly blends CP can tap into underutilized pyramid pooling module to enlarge the receptive field. The
knowledge to enhance the quality of pseudo labels SegNet[2] strengthens context cues by its encoder-decoder
during the training stage and help enhance expressive structure. PSPNet[36] incorporates a pyramid pooling
power during inference. module to embed multi-scale context features to improve
• The proposed ACGM introduces the semantic con- FCN-base architecture. Lately. Transformer-based solu-
tour for encouraging position relations establishment tions have attracted more and more attention since Visual
as guidance to effectively identify unreliable spatial Transformer[9] has been designed. These methods continue
regions in pseudo labels and accurately mitigates the to actively expand transformer skills, including extracting
problem of confirmation bias. features from input image [31, 37], learning class embed-
ding [28], or formulating segmentation as a simple mask
• The experiment results present our CFCG achieves classification task [7].
new state-of-the-art results on two commonly used
benchmarks, which yield mIoU 77.10%, 78.49% with 2.2. Semi-supervised Learning
no additional calculations, and 77.55%, 78.93% by fu- Semi-supervised learning(SSL) has been an active re-
sion inference way on PASCAL VOC 2012 and on search issue and related literature contains a large va-
Cityscapes separately under 1/8 protocols. riety of methods[19, 4, 3, 32, 27, 34, 29]. They can
be mainly categorized as self-training-based approach and
2. Related Work consistency regularization-based approach. The former
typically predicts pseudo labels to unlabeled data and
2.1. Semantic Segmentation
then the ground truth and pseudo labels will be used to-
Traditional semantic segmentation is a fundamental gether as the supervisory signal in training. The latter
topic in the computer vision field. Recent years have enforces similar predictions output by the network un-
witnessed remarkable progress in semantic segmentation: der different forms of perturbations. Specifically, Dual

16350
student[19] replaces the teacher of the mean-teacher frame- of weak flow to supervise the strong flow’s fusion output.
work with another student. UDA[32] and FixMatch[27] uti- Finally, we use the pseudo label generated by weak flow to
lize confidence-based thresholding techniques to ensure the supervise the strong flow output of the other learner. During
quality of pseudo labels. FlexMatch improves this strat- the supervision, ACGM is used to iteratively update reliable
egy by considering different learning difficulties of different pseudo labels to guide semi-supervised learning.
classes[34]. Self-adaptive confidence threshold is proposed For labeled images, following the previous work, images
in FreeMatch[29]. are sent to the weak flow of the two learners simultaneously
and are supervised by corresponding ground truth respec-
2.3. Semi-supervised Semantic Segmentation tively. Note that the CFS and ACGM techniques are also
SSSS is rather a challenge compared with SSL due to employed in it.
the need for dense prediction. Inspired by SSL, com- In general, given limited labeled images Dl =
mon SSSS mostly follows the same track where the self- {(xli , yil ); i ∈ (1, ..., Nl )} and large amount of unlabeled
training approach and consistency regularization approach images Du = {(xui ); i ∈ (1, ..., Nu )}, the unsupervised
are still validated to be very powerful in this field. Specifi- loss for unlabeled images can be written as:
cally, CCT[26] has a shared encoder and multi-decoder with
a series of feature perturbations. CPS[6] achieves cross Lu = λ1 (Lu1 + Lu2 ) + Luf , (1)
pseudo supervision for the same input via two networks
with different initialization. Then, the class imbalance prob- where (Lu1 + Lu2 ) represents the cross pseudo supervision
lem has been solved successively with the adaptive equal- loss from two learners and Luf represents the CFS loss. We
ization learning framework[16], uncertainty guided cross- use λ1 to control the balance between cross pseudo super-
head co-training framework[11], and pixel-level contrastive vision loss and CFS loss as the trade-off weight.
learning scheme[1]. Based on the mean-teacher model, For the labeled images, following the previous work[11],
PS-MT[23] creately designs a new auxiliary teacher and images are augmented by weak flow into the two networks
a stricter confidence-weighted cross entropy loss. The and are supervised by corresponding ground truth respec-
U2 PL[30] chooses a unique way of using unreliable pseudo tively. Note that the CFS and ACGM techniques are also
labels. GCT[18] aims for diverse pixel-wise tasks, which employed in it. The supervised loss from labeled data is:
proposed a flaw detector to locate the noise of pseudo la-
Ll = Ll gt + Llf , (2)
bels. Note that GCT[18], ELN[21] and other similar mod-
els try to design extra models to make them have the ability
where Ll gt represents the loss between predictions and
to predict flaw regions. Currently, self-supervised learning
ground truth. Llf represents the CFS loss for labeled im-
is employed for this task[38, 1, 22, 35]. For self-training,
ages, which is obtained by supervising the fused logits of
ST++[33] improves the self-training pipeline and gets bet-
CFS with ground truth. The detail about the losses is de-
ter performance in an offline way. Different from them, we
scribed in section 3.3. In the end, the total loss Ltotal is as
find the potential information and use it effectively to avoid
follows:
using extra training models and training time.
Ltotal = Ll + λ2 Lu . (3)
3. Method Here, we use λ2 to control a balance between the loss on
the labeled data loss and the unlabeled data.
In this section, we present an overview of the proposed
framework in Section 3.1. Then we describe our CP in Sec- 3.2. Weak and Strong Coherent-Perturbation
tion 3.2. Additionally, we describe CFS’s detail in Section
3.3. Finally, the ACGM is further introduced in Section 3.4. Classic weak-to-strong consistency regularization[27]
can be formulated as:
3.1. Overview
f w = E(α(xu )), (4)
As shown in Fig.3, our semi-supervised framework con-
sists of two learners with the same structure, in which each
f s = E(A(α(xu ))), (5)
learner contains a weak and a strong flow. For unlabeled
u
images, we need to perform three steps to realize our semi- where x represents unlabeled images, E represents en-
supervised learning. coder embedding, f w and f s represent the encoder embed-
Firstly, an unlabeled image is fed into the weak and ding feature generated by weak image augmentation α(·)
strong flow of each learner, where the CP strategy further and strong image augmentation A(·) respectively. To push
enhances the learner’s ability. Secondly, we use CFS to the potential limits of consistence regularization, we pro-
fuse the weak flow output and strong flow output of the two pose CP Strategy, which gets into feature strong perturba-
learners separately and then leverage the fused pseudo label tion after getting the encoder output of strongly-augmented

16351
unlabeled images to cause heavily perturbed feature. Thus Finally, we introduce cross-supervision which calcu-
+
the f s is rewritten as f s : lates consistency loss between the fused logits of weak and
strong flow, respectively. As shown in Fig.3, the pseudo
+
f s = B(f s ), (6) labels, generated from one weak CP flow, are used to super-
vise the predictions which are from the other strong weak
where the B represents the feature perturbation. Among CP flow. The loss Lu1 , Lu2 of CFS can be formulated as:
them, the strong flow for both image-level and feature-level
includes the noise, blur, and erasure operations. 1 X 1
Lu1 = η1 · ℓce (ps2 , y1w ), (10)
|Du | W ×H
x∈Du
3.3. Cross-Fusion Supervision
Sub-figures (a) and (c) in Fig.3 illustrate our proposed 1 X 1
Lu2 = η2 · ℓce (ps1 , y2w ), (11)
Cross-Fusion Supervision (CFS) mechanism. In general, |Du | W ×H
x∈Du
CFS first realizes channel-wise attention and fusion for both
where the cross entropy loss ℓce is used to minimize the
weakly and strongly perturbed learners, respectively. Then
two probability distribution terms. Du is the batch size of
consistency regularization between the two learners is cal-
unlabeled images. η1 and η2 represent the loss coefficient
culated based on the fused weak logits. Compared with
for two learners separately.
previous SSSS works, in which consistency regularization
Taking the advantage of our proposed CFS, each learner
is calculated between outputs of single learners, our pro-
is able to receive better supervision from the other learner.
posed CFS leverages both learners to enhance the qual-
During the inference stage, one single learner without fu-
ity of pseudo labels thus benefiting the process of semi-
sion inference is able to generate state-of-the-art SSSS re-
supervised learning.
sults. And the learner with fusion inference further im-
Specifically, our proposed CFS contains three parts:
proves the effect but subsequently with additional calcula-
cross-learner attention, weighted fusion, and cross-
tions.
supervision. In cross-learner attention, we perform an
element-wise sum operation with pw w
1 , p2 ∈ R
C×H×W
, and 3.4. Adaptive Contour Guidance Module
average pooling is used to encode the spatial representation
of different semantic channels to generate a ∈ RC . In most previous works, confidence-based thresholding
strategy is proposed to measure the quality of pseudo la-
1 XH X W bels. If the confidence is higher than the given threshold,
a= (pw (i, j) + pw
2 (i, j)). (7) the corresponding pseudo label will be considered reliable
H × W i=1 j=1 1
and retainable. However, we find that a large number of
wrong pseudo labels located in the high-confidence interval
Next, vi = F Ci (a), where i ∈ {0, 1}. The a is fed are considered reliable and vice versa from Fig.2. Based on
into different FC layers to generate two attention vectors this observation, we formulate this idea as a re-weight task
v1 , v2 ∈ RC respectively. and propose ACGM using contour to guide the network,
In weighted fusion part, generated attention vectors aiming to detect the noise of pseudo labels by relying on
v1 , v2 are used to perform optionally weighted fusion for spatial information.
pw w w
1 , p2 . The final feature map pf is obtained as follows: Contour-based Weight Map. First of all, the contour-
based weight map M ∈ RH×W is generated by two steps:
pw w w
f = v1 · p1 + v2 · p2 , (8)
Exact and Soften. In the first step, as shown in Fig.1 (c),
where pw C×H×W
. In the same way, psf is acquired by semantic contour maps are exacted from the pseudo labels
f ∈R
the same operation above from ps1 , ps2 . by the Laplacian group. Specifically, we use this operator
Finally, argmax function is needed to choose the corre- to exact contour maps to catch multi-scale information, and
sponding class c ∈ {1, ..., C} with the maximal probability kernels with different strides are employed. Then upsample
for pw w different scale contour maps to the original size and fuse
f . The argmax result is denoted as yf . In this case,
the fusion loss can be written as follows: them with 1 × 1 convolution. Next, convert it into a binary
image as the semantic contour map. In the second step,
1 X 1 image processing operations are injected into this semantic
Luf = ηf · ℓce (psf , yfw ) (9)
|Du | W ×H contour map to produce the final weight map. Specifically,
x∈Du
(1) a Gaussian kernel is employed for blurring the semantic
where the cross entropy loss ℓce is used to minimize the two contour map, transforming it into a dense probability map.
probability distribution terms. Du is the batch size of unla- (2) normalizes all pixels of the dense probability map to
beled images. ηf , a coefficient for loss, will be explained range between [0, 1] to generate the final weight map.
in Section 3.4. Adaptive Loss Re-weight. As mentioned above, we use the

16352
contour-based weight map M as the coefficient of cross en- Network. We use DeepLabv3+ [5] with ResNet [15] pre-
tropy loss to guide the model to leverage spatial information trained on ImageNet [20] as our segmentation network. The
and distinguish the noise in pseudo labels. Compared with decoder head is composed of separable convolution same as
direct multiplication, we take into account that the learning standard DeepLabv3+.
ability of network in the early training stage is not enough, Implementation Details. We initialize the weights of two
so we adopt a progressive strategy η to gradually introduce backbones in the two segmentation networks with the same
the knowledge of semantic contour: weights pre-trained on ImageNet and initialize the weights
of two segmentation heads randomly. We adopt mini-batch
η = (1 − γ) · I + γ · M, (12) SGD with momentum to train our model with Sync-BN
[17]. The momentum is fixed as 0.9 and the weight decay is
ln(2) set to 0.0005. We employ a poly learning rate policy where
γ = exp(ae t) − 1 , ae = . (13)
max epoch iter
the initial learning rate is multiplied by (1 − max 0.9
iter ) .
Where η ∈ RH×W , and the I ∈ RH×W is the all-ones We train PASCAL VOC 2012 for 80 epochs with a base
matrix. In Eq.12, we make the γ gradually learn to assign learning rate set to 0.0025 and crop size of 512 x 512, and
higher weight due to the fact that the network may converge Cityscapes for 240 epochs with a base learning rate set to
in the wrong direction in the early stage of training, and in 0.02 and crop size of 800 x 800.
the later training stage the contour is more reliable which
also means the pseudo label is more instructive. From it,
4.2. Comparison to State-of-the-Art Methods
the function variable t denotes the current training epoch We compare our method with some recent semi-
and the constant parameter ae is calculated to guarantee γ supervised segmentation methods including CPS[6],
ranges from 0 to 1 by Eq.13. U2 PL[30], UCC [11], PS-MT[23],[21] and ST++[33] un-
der different partition protocols, using the same architecture
4. Experiments and partition protocols for fairness.
PASCAL VOC 2012. Table 1 shows the comparison re-
4.1. Experimental Setup
sults on PASCAL VOC 2012. We can see that over all
Datasets. We evaluate our framework on two different the partitions from 1/16 to 1/2, with both ResNet-50 and
datasets: PASCAL VOC 2012 [10] and Cityscapes [8]. The ResNet-101, our method w/o fusion inference and w/ fu-
PASCAL VOC 2012 dataset contains 1464 and 1449 im- sion inference consistently outperforms the other methods.
ages used for training and validation respectively. Later, it For example, compared to CPS[6], which can be considered
is expanded by extra relatively coarse manual annotations as our baseline, CFCG w/o fusion inference improves by
from the SBD dataset [13], resulting in a total of 10582 2.00%-3.43% under all partition protocols with ResNet-50.
training images. We follow the previous work [14] to use While w/ fusion inference improves by 2.50%-3.88% under
the augmented set as our full training set. We further pro- all partition protocols with ResNet-50. Additionally, our
vide results for the Cityscapes dataset, which consists of method is superior to the other state-of-art methods in vari-
2975 training and 500 validation images with 19 individual ous settings. To be specific, based on ResNet101, it outper-
classes. forms previous state-of-art UCC [11] by 2.29% and 2.04%
We rigorously follow the partition protocols of Guided under the 1/8 partitions with ResNet-50 and ResNet-101 re-
Collaborative Training (GCT) [18], and it divides the whole spectively. The result demonstrates our CFCG’s robustness
training set into two groups via randomly sub-sampling 1/2, for this SSSS task.
1/4, 1/8, and 1/16 of the whole set as the labeled set and Based on the w/o fusion inference, we find the test with
regards the remaining images as the unlabeled set. w/ fusion inference brings a significant improvement in per-
Evaluation. Following previous methods [12, 16, 26, 38, formance. Our approach with fusion inference outperforms
6], the images are center cropped into a fixed resolution for the approach without fusion inference by 0.50% and 0.59%
PASCAL VOC 2012. For Cityscapes, previous methods ap- under 1/2 partition protocol with ResNet-50 and ResNet-
ply slide window evaluation, and so do we. Then we adopt 101 separately.
the mean of Intersection over Union (mIoU) as the metric to Cityscapes. Table 2 illustrates the comparison results on
evaluate these cropped images. All results are measured on the Cityscapes val set. In comparison to other state-of-
the val set on both PASCAL VOC 2012 [10] and Cityscapes the-art(SOTA) methods, our model achieves higher per-
[8]. Ablation studies are conducted on the blender PAS- formance among all partition protocols with both ResNet-
CAL VOC 2012 [10] val set under 1/8 partition protocol. 50 and ResNet-101 backbones. For example, our method
During the test, one network(w/o fusion inference) and two w/o fusion inference outperforms previous state-of-art PS-
networks(w/ fusion inference) are all analyzed in our ap- MT[23] by 2.73% and 2.06% under the 1/8 and 1/4 parti-
proach. tions with ResNet-50 separately.

16353
ResNet-50 ResNet-101
Method
1/16(662) 1/8(1323) 1/4(2646) 1/2(5291) 1/16(662) 1/8(1323) 1/4(2646) 1/2(5291)
CPS(w/CutMix)[6] 71.98 73.67 74.90 76.15 74.48 76.44 77.68 78.64
U2 PL(w/CutMix)[30] † - - - - 74.43 77.60 78.70 79.94
UCC[11] 74.05 74.81 76.38 76.53 76.49 77.06 79.07 79.54
PS-MT[23] 72.83 75.70 76.43 77.88 75.50 78.20 78.72 79.76
ELN[21] - 73.20 74.63 - - 75.10 76.58 -
ST++[33] 72.60 74.40 75.40 - 74.50 76.30 76.60 -
Ours(w/o fusion inference) 75.00 77.10 77.72 78.15 76.82 79.10 79.96 80.18
Ours(w/ fusion inference) 75.58 77.55 78.34 78.65 77.39 79.40 80.42 80.77

Table 1. Comparison with state-of-the-art methods on the PASCAL VOC 2012 dataset. Labeled images are sampled from the blended
training set, which is augmented by the SBD dataset. ”w/o fusion inference” denotes directly using output which is predicted by one model
without any CF operation test; ”w/ fusion inference” denotes using output which is predicted by two models with CF operation. Results of
U2 PL with † are acquired through the open-source code repository whose partition is the same with [18].

ResNet-50 ResNet-101
Method
1/16(186) 1/8(372) 1/4(744) 1/2(1488) 1/16(186) 1/8(372) 1/4(744) 1/2(1488)
CPS(w/CutMix)[6] 74.47 76.61 77.83 78.77 74.72 77.62 79.21 80.21
U2 PL(w/CutMix) [30]† - - - - 70.30 74.37 76.47 79.05
U2 PL(w/AEL) [30]† - - - - 74.90 76.48 78.51 79.12
UCC[11] 76.02 77.60 78.28 79.54 77.17 78.71 79.59 80.57
PS-MT[23] - 75.76 76.92 77.64 - 76.89 77.60 79.09
ELN[21] - - - - - 70.33 73.52 75.33
ST++[33] - 72.70 73.80 - - - - -
Ours(w/o fusion inference) 76.13 78.49 78.98 79.76 77.28 79.09 80.07 80.59
Ours(w/ fusion inference) 76.14 78.93 79.28 80.13 77.76 79.60 80.36 80.92

Table 2. Comparison with state-of-the-art methods on the Cityscapes dataset. ”w/o fusion inference” denotes directly using output which
is predicted by one model without any CF operation test; ”w/ fusion inference” denotes using output which is predicted by two models
with CF operation. Results of U2 PL with † follow the paper which is different with [18]. Considering that the partition’s impact is not
significant on the Cityscapes dataset, we also put it in this table for reference.

CFS ResNet-50, and we use DeepLabv3+ to evaluate our results.


CP ACGM mIoU
w/o fusion w/ fusion The effect of CP, CFS, and ACGM in our method is ver-
inference inference ified in Table 3, where we use cross pseudo supervised [6]
73.08 trained with the input image augmentation as the baseline.

73.44 We note that CP strategy provides a 0.36% improvement
√ √
75.23 slightly, it shows that there still exists a certain amount of
√ √
75.22 noise in pseudo labels. To deal with the problem, we in-
√ √ √
77.10 troduced ACGM and CFS. We can see that ACGM and
√ √ √
77.55 CFS(w/o fusion inference) increase by 1.79% and 1.78% re-
spectively, which verify ACGM can generate better pseudo
Table 3. Ablation study using the 1/8 labelled ratio on PASCAL labels for SSSS task, and CFS enable to fully fuse differ-
VOC 2012 under DeepLabv3+ architecture. ent learners’ knowledge to make the model possess strong
learning ability. Further, We achieved a significant im-
Overall, our architecture can adapt to various setting provement of 3.66% by integrating ACGM and CFS(w/o
both on the PASCAL VOC 2012 and Cityscapes dataset and fusion inference). Finally, compared with the baseline, we
achieves stable and impressive results. achieved a performance improvement of 4.47%, which in-
dicate that our CFCG framework is efficient and friendly for
4.3. Ablation Studies SSSS.
In order to deeply explore the effects of different mod- Cross-fusion Supervision. To verify the effectiveness of
ules, in this section, we conduct all the ablation experiments the fusion operation, we also make inferences without CFS
by running on PASCAL VOC 2012 under a 1/8 ratio with and simply test the model using average logits fusion of the

16354
Fusion Strategy mIoU Param Value Detail mIoU
w/o CFS 75.23 constant 0 γ=0 75.22
Average fusion 76.01 constant 1 γ=1 75.95
γ linear γ = ak x, 0 ≤ γ < 1 76.31
CFS (w/o fusion inference) 77.10 exp γ = exp(ae x) − 1, 0 ≤ γ < 1 77.10
CFS (w/ fusion inference) 77.55 log γ = log(al x + 1), 0 ≤ γ < 1 75.56
img size
Table 4. Comparison of different fusion modes in the reference k=32 k= 16 75.86
img size
k k=64 k= 77.10
stage. The first row represents the result of our method without 8
img size
CFS strategy. The second row represents the result of average k=128 k= 4 75.88
model fusion. The third row represents the result of our single
Table 5. Illustration on the performance of different trends of γ.
model with CFS for the training stage but without fusion infer-
and the performance of different kernel size k. All results are eval-
ence. The last row represents the result of fusion inference.
uated under the 1/8 partition protocol with ResNet-50 on PASCAL
two learners. As shown in Table 4, the performance of aver- VOC 2012. Note that ae is calculated by Eq.13, and ak , al is in a
similar way, and the results are generated by w/o fusion inference.
age fusion is 0.78% higher obviously, which shows that the
fusion is indeed working. Moreover, comparing CFS(w/o
Contour Strategy mIoU
fusion inference) testing on one learner with average fusion,
the performance of the former is significantly boosted by w/o ACGM 75.22
1.11%. It undoubtedly proves that the fusion ability of CFS Weakened 75.01
is so powerful that it can learn the knowledge distribution of Strengthened 77.10
different learners well and test without extra computation.
Table 6. Comparation on how to handle unreliable regions in the
Finally, testing with CFS(w/ fusion inference), the fusion contour strategy. Note that the results are generated by w/o fusion
performance further improves by 0.45%, which tells us that inference.
the fusion strategy makes valuable improvement with addi-
tional calculations. That is, our CFS strategy is very neces- 6, from it we can see that the contour-weakened strategy
sary to improve the performance, which greatly integrates is lower than the contour-strengthened strategy by 2.09%.
the learning ability of the two branch models. Further, weakening contour strategy brings no benefits even
if compared with w/o ACGM. Actually, due to the unreli-
Adaptive Contour Guidance Module. In section 3.4, we
able pseudo labels being the difference region between the
add parameter γ which is set to be initialized as 0 and gradu-
predictions of two learners, the unreliable pseudo label in-
ally learns to assign a higher weight to 1. Several tendencies
deed should be emphasized. Emphasizing the different re-
are examined in the experiment. As shown in Table 5, com-
gions can promote the consistency regularization’s perfor-
pared with not using the weight map by setting γ as 0, the
mance on the other hand.
ACGM strategy can clearly improve the performance from
75.22% to 77.10%. Conversely, constant 1 means the loss
totally depends on the contour weight map. Others like log,
5. Conclusion
linear, and exp tendencies keep up while the training itera- In this paper, we propose a novel SSSS framework called
tion process also has been shown. In the above manner, we CFCG combining the CFS and ACGM. Specifically, our
find the exp can improve the baseline by 1.15% to not using CFS creatively fuses the information between weak flows
the weight map as the baseline. At the same time, we in- and strong flows respectively, through the channel-wise
vestigate the influence of different kernel sizes of Gaussian attention mechanism to tap into underutilized knowledge.
blur k that is used to soft binary contour maps as shown in During the test, both with and without fusion inference
section 3.4. From Table 5, we can see that k = 64 performs can achieve consistent performance gains. On the other
best on PASCAL VOC 2012. As a result for Cityscapes, hand, our ACGM guides the learners to adaptively and
we also use k = img size/8 which is 100 to generate the effectively identify unreliable spatial regions by relying on
weight map. It is worth mentioning that the parameter k spatial contour information. The experiment results present
adjustment is based on the γ used in the exp manner. our CFCG achieved new state-of-the-art results on two
Generally, there is a contour-weakened strategy that dy- commonly used benchmarks, which yield mIoU 77.10%,
namically gives higher weight to the reliable pseudo labels 78.49% with no additional calculations, and 77.55%,
while suppressing the unreliable pseudo labels. Instead of 78.93% by fusion inference way on PASCAL VOC 2012
using it, we adopt the contour-strengthened strategy, which and on Cityscapes separately under 1/8 protocols. These
means the unreliable region acquires a higher weight, mak- comparisons have shown that our CFCG method performs
ing the learner pay more attention to these regions and let favorably against state-of-the-art approaches for the semi-
the region less and less until the model is explicit. To prove supervised semantic segmentation task.
this, we conduct contrast experiments as shown in Table

16355
References Zisserman. The pascal visual object classes challenge:
A retrospective. International journal of computer vi-
[1] Inigo Alonso, Alberto Sabater, David Ferstl, Luis sion, 111:98–136, 2015. 6
Montesano, and Ana C Murillo. Semi-supervised
semantic segmentation with pixel-level contrastive [11] Jiashuo Fan, Bin Gao, Huan Jin, and Lihui Jiang. Ucc:
learning from a class-wise memory bank. In Pro- Uncertainty guided cross-head co-training for semi-
ceedings of the IEEE/CVF International Conference supervised semantic segmentation. In Proceedings of
on Computer Vision, pages 8219–8228, 2021. 4 the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 9947–9956, 2022. 1, 2,
[2] Vijay Badrinarayanan, Alex Kendall, and Roberto
4, 6, 7
Cipolla. Segnet: A deep convolutional encoder-
decoder architecture for image segmentation. IEEE [12] Geoff French, Samuli Laine, Timo Aila, Michal
transactions on pattern analysis and machine intelli- Mackiewicz, and Graham Finlayson. Semi-supervised
gence, 39(12):2481–2495, 2017. 3 semantic segmentation needs strong, varied perturba-
[3] David Berthelot, Nicholas Carlini, Ekin D Cubuk, tions. arXiv preprint arXiv:1906.01916, 2019. 6
Alex Kurakin, Kihyuk Sohn, Han Zhang, and Colin [13] Bharath Hariharan, Pablo Arbeláez, Lubomir Bour-
Raffel. Remixmatch: Semi-supervised learning with dev, Subhransu Maji, and Jitendra Malik. Semantic
distribution alignment and augmentation anchoring. contours from inverse detectors. In 2011 international
arXiv preprint arXiv:1911.09785, 2019. 3 conference on computer vision, pages 991–998. IEEE,
[4] David Berthelot, Nicholas Carlini, Ian Goodfellow, 2011. 6
Nicolas Papernot, Avital Oliver, and Colin A Raffel. [14] Bharath Hariharan, Pablo Arbeláez, Lubomir Bour-
Mixmatch: A holistic approach to semi-supervised dev, Subhransu Maji, and Jitendra Malik. Semantic
learning. Advances in neural information processing contours from inverse detectors. In 2011 international
systems, 32, 2019. 3 conference on computer vision, pages 991–998. IEEE,
[5] Liang-Chieh Chen, George Papandreou, Florian 2011. 6
Schroff, and Hartwig Adam. Rethinking atrous con- [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
volution for semantic image segmentation. arXiv Sun. Deep residual learning for image recognition.
preprint arXiv:1706.05587, 2017. 3, 6 In Proceedings of the IEEE conference on computer
[6] Xiaokang Chen, Yuhui Yuan, Gang Zeng, and Jing- vision and pattern recognition, pages 770–778, 2016.
dong Wang. Semi-supervised semantic segmentation 6
with cross pseudo supervision. In Proceedings of the
[16] Hanzhe Hu, Fangyun Wei, Han Hu, Qiwei Ye, Jin-
IEEE/CVF Conference on Computer Vision and Pat-
shi Cui, and Liwei Wang. Semi-supervised seman-
tern Recognition, pages 2613–2622, 2021. 1, 4, 6, 7
tic segmentation via adaptive equalization learning.
[7] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Advances in Neural Information Processing Systems,
Per-pixel classification is not all you need for semantic 34:22106–22118, 2021. 2, 4, 6
segmentation. Advances in Neural Information Pro-
cessing Systems, 34:17864–17875, 2021. 3 [17] Sergey Ioffe and Christian Szegedy. Batch normaliza-
tion: Accelerating deep network training by reducing
[8] Marius Cordts, Mohamed Omran, Sebastian Ramos, internal covariate shift. In International conference on
Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, machine learning, pages 448–456. pmlr, 2015. 6
Uwe Franke, Stefan Roth, and Bernt Schiele. The
cityscapes dataset for semantic urban scene under- [18] Zhanghan Ke, Di Qiu, Kaican Li, Qiong Yan, and
standing. In Proceedings of the IEEE conference on Rynson WH Lau. Guided collaborative training
computer vision and pattern recognition, pages 3213– for pixel-wise semi-supervised learning. In Com-
3223, 2016. 6 puter Vision–ECCV 2020: 16th European Conference,
Glasgow, UK, August 23–28, 2020, Proceedings, Part
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander
XIII 16, pages 429–445. Springer, 2020. 1, 4, 6, 7
Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias [19] Zhanghan Ke, Daoye Wang, Qiong Yan, Jimmy Ren,
Minderer, Georg Heigold, Sylvain Gelly, et al. and Rynson WH Lau. Dual student: Breaking the lim-
An image is worth 16x16 words: Transformers its of the teacher in semi-supervised learning. In Pro-
for image recognition at scale. arXiv preprint ceedings of the IEEE/CVF International Conference
arXiv:2010.11929, 2020. 3 on Computer Vision, pages 6728–6736, 2019. 3, 4
[10] Mark Everingham, SM Ali Eslami, Luc Van Gool, [20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E
Christopher KI Williams, John Winn, and Andrew Hinton. Imagenet classification with deep convolu-

16356
tional neural networks. Communications of the ACM, [30] Yuchao Wang, Haochen Wang, Yujun Shen, Jingjing
60(6):84–90, 2017. 6 Fei, Wei Li, Guoqiang Jin, Liwei Wu, Rui Zhao, and
[21] Donghyeon Kwon and Suha Kwak. Semi-supervised Xinyi Le. Semi-supervised semantic segmentation us-
semantic segmentation with error localization net- ing unreliable pseudo-labels. In Proceedings of the
work. In Proceedings of the IEEE/CVF Conference IEEE/CVF Conference on Computer Vision and Pat-
on Computer Vision and Pattern Recognition, pages tern Recognition, pages 4248–4257, 2022. 4, 6, 7
9957–9967, 2022. 4, 6, 7 [31] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anand-
[22] Xin Lai, Zhuotao Tian, Li Jiang, Shu Liu, Hengshuang kumar, Jose M Alvarez, and Ping Luo. Segformer:
Zhao, Liwei Wang, and Jiaya Jia. Semi-supervised Simple and efficient design for semantic segmentation
semantic segmentation with directional context-aware with transformers. Advances in Neural Information
consistency. In Proceedings of the IEEE/CVF Con- Processing Systems, 34:12077–12090, 2021. 3
ference on Computer Vision and Pattern Recognition,
[32] Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong,
pages 1205–1214, 2021. 4
and Quoc Le. Unsupervised data augmentation for
[23] Yuyuan Liu, Yu Tian, Yuanhong Chen, Fengbei Liu, consistency training. Advances in neural information
Vasileios Belagiannis, and Gustavo Carneiro. Per- processing systems, 33:6256–6268, 2020. 3, 4
turbed and strict mean teachers for semi-supervised
semantic segmentation. In Proceedings of the [33] Lihe Yang, Wei Zhuo, Lei Qi, Yinghuan Shi, and Yang
IEEE/CVF Conference on Computer Vision and Pat- Gao. St++: Make self-training work better for semi-
tern Recognition, pages 4258–4267, 2022. 1, 2, 4, 6, supervised semantic segmentation. In Proceedings of
7 the IEEE/CVF Conference on Computer Vision and
[24] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Pattern Recognition, pages 4268–4277, 2022. 4, 6,
Fully convolutional networks for semantic segmen- 7
tation. In Proceedings of the IEEE conference on [34] Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu,
computer vision and pattern recognition, pages 3431– Jindong Wang, Manabu Okumura, and Takahiro Shi-
3440, 2015. 3 nozaki. Flexmatch: Boosting semi-supervised learn-
[25] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, ing with curriculum pseudo labeling. Advances in
and Shin Ishii. Virtual adversarial training: a regu- Neural Information Processing Systems, 34:18408–
larization method for supervised and semi-supervised 18419, 2021. 3, 4
learning. IEEE transactions on pattern analysis and [35] Jianrong Zhang, Tianyi Wu, Chuanghao Ding, Hong-
machine intelligence, 41(8):1979–1993, 2018. 1 wei Zhao, and Guodong Guo. Region-level contrastive
[26] Yassine Ouali, Céline Hudelot, and Myriam Tami. and consistency learning for semi-supervised seman-
Semi-supervised semantic segmentation with cross- tic segmentation. arXiv preprint arXiv:2204.13314,
consistency training. In Proceedings of the IEEE/CVF 2022. 4
Conference on Computer Vision and Pattern Recogni-
tion, pages 12674–12684, 2020. 4, 6 [36] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiao-
gang Wang, and Jiaya Jia. Pyramid scene parsing
[27] Kihyuk Sohn, David Berthelot, Nicholas Carlini,
network. In Proceedings of the IEEE conference on
Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Do-
computer vision and pattern recognition, pages 2881–
gus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fix-
2890, 2017. 3
match: Simplifying semi-supervised learning with
consistency and confidence. Advances in neural in- [37] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian
formation processing systems, 33:596–608, 2020. 1, Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng
2, 3, 4 Feng, Tao Xiang, Philip HS Torr, et al. Rethinking
[28] Robin Strudel, Ricardo Garcia, Ivan Laptev, and semantic segmentation from a sequence-to-sequence
Cordelia Schmid. Segmenter: Transformer for seman- perspective with transformers. In Proceedings of the
tic segmentation. In Proceedings of the IEEE/CVF IEEE/CVF conference on computer vision and pattern
international conference on computer vision, pages recognition, pages 6881–6890, 2021. 3
7262–7272, 2021. 3 [38] Yuanyi Zhong, Bodi Yuan, Hong Wu, Zhiqiang Yuan,
[29] Yidong Wang, Hao Chen, Qiang Heng, Wenxin Hou, Jian Peng, and Yu-Xiong Wang. Pixel contrastive-
Marios Savvides, Takahiro Shinozaki, Bhiksha Raj, consistent semi-supervised semantic segmentation. In
Zhen Wu, and Jindong Wang. Freematch: Self- Proceedings of the IEEE/CVF International Confer-
adaptive thresholding for semi-supervised learning. ence on Computer Vision, pages 7273–7282, 2021. 4,
arXiv preprint arXiv:2205.07246, 2022. 3, 4 6

16357
[39] Yanning Zhou, Hang Xu, Wei Zhang, Bin Gao, and
Pheng-Ann Heng. C3-semiseg: Contrastive semi-
supervised segmentation via cross-set learning and
dynamic class-balancing. In Proceedings of the
IEEE/CVF International Conference on Computer Vi-
sion, pages 7036–7045, 2021. 2
[40] Yuliang Zou, Zizhao Zhang, Han Zhang, Chun-Liang
Li, Xiao Bian, Jia-Bin Huang, and Tomas Pfister.
Pseudoseg: Designing pseudo labels for semantic seg-
mentation. arXiv preprint arXiv:2010.09713, 2020. 2
[41] Simiao Zuo, Yue Yu, Chen Liang, Haoming Jiang,
Siawpeng Er, Chao Zhang, Tuo Zhao, and Hongyuan
Zha. Self-training with differentiable teacher. arXiv
preprint arXiv:2109.07049, 2021. 2

16358

You might also like