Unsupervised Foggy Scene Understanding Via Self Spatial-Temporal Label Diffusion

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL.

31, 2022 3525

Unsupervised Foggy Scene Understanding via Self


Spatial-Temporal Label Diffusion
Liang Liao , Member, IEEE, Wenyi Chen, Jing Xiao , Member, IEEE, Zheng Wang , Member, IEEE,
Chia-Wen Lin , Fellow, IEEE, and Shin’ichi Satoh , Senior Member, IEEE

Abstract— Understanding foggy image sequence in driving This task has been active in the field of computer vision for
scene is critical for autonomous driving, but it remains a decades due to its broad range of downstream applications,
challenging task due to the difficulty in collecting and annotating such as autonomous driving [4], [5], medical image analy-
real-world images of adverse weather. Recently, self-training
strategy has been considered as a powerful solution for unsu- sis [6], [7], and image restoration [8]. Recently, convolutional
pervised domain adaptation, which iteratively adapts the model neural networks (CNNs) have achieved great success in seman-
from the source domain to the target domain by generating tic scene understanding and high accuracy on standard vision
target pseudo labels and re-training the model. However, the benchmarks [9]–[11]. However, most learning-based algo-
selection of confident pseudo labels inevitably suffers from the rithms are developed under clear visibility and often encounter
conflict between sparsity and accuracy, both of which will lead
to suboptimal models. To tackle this problem, we exploit the challenges in real-life scenarios, especially in outdoor applica-
characteristics of the foggy image sequence of driving scenes to tions under “bad” weather conditions [12]–[14]. In this paper,
densify the confident pseudo labels. Specifically, based on the we will focus on the understanding of foggy weather scenarios.
two discoveries of local spatial similarity and adjacent temporal One of the issues that hinder the performance of
correspondence of the sequential image data, we propose a learning-based algorithms under foggy weather is the dif-
novel Target-Domain driven pseudo label Diffusion (TDo-Dif)
scheme. It employs superpixels and optical flows to identify the ficulty in collecting and annotating real-world images of
spatial similarity and temporal correspondence, respectively, and adverse weather. The reason is that the manual annota-
then diffuses the confident but sparse pseudo labels within a tion is hardly scalable to so many scenarios, and it is
superpixel or a temporal corresponding pair linked by the flow. much harder to provide precise manual annotations due
Moreover, to ensure the feature similarity of the diffused pixels, to poor visibility. An alternative way is to simulate real
we introduce local spatial similarity loss and temporal contrastive
loss in the model re-training stage. Experimental results show foggy scenarios using synthetic datasets, which renders the
that our TDo-Dif scheme helps the adaptive model achieve available clear-weather images with some physical models
51.92% and 53.84% mean intersection-over-union (mIoU) on (e.g., atmospheric scattering model) [15]–[17] or by gen-
two publicly available natural foggy datasets (Foggy Zurich and erative adversarial networks (GANs) based image-to-image
Foggy Driving), which exceeds the state-of-the-art unsupervised translation [18]. Nevertheless, it is difficult to approximate
domain adaptive semantic segmentation methods. The proposed
method can also be applied to non-sequential images in the target fog in complex natural environments because existing physical
domain by considering only spatial similarity. models for fog synthesis tend to assume that the fog is
homogeneous. In addition, the subtle differences in texture,
Index Terms— Natural foggy scene, semantic segmentation,
unsupervised domain adaptation, self-training, label diffusion. color and illumination conditions between clear-weather and
foggy-weather images pose a challenge for performing image
I. I NTRODUCTION translation. As a result, there remains a domain gap between
the synthetic images and the real foggy images, resulting in a
S EMANTIC segmentation refers to the task of assigning
semantic labels to each pixel of an input image [1]–[3]. significant performance degradation when applying the model
trained on synthetic data to real scenes.
Manuscript received September 12, 2021; revised April 1, 2022; accepted Deep self-training approaches emerge as powerful solutions
April 21, 2022. Date of publication May 9, 2022; date of current version
May 18, 2022. This work was supported in part by National Natural for unsupervised domain adaptation and have been commonly
Science Foundation of China under Grant 61825103 and Grant 62171325; used in cross-domain semantic segmentation [19]–[24]. They
in part by National Key Research and Development Project under Grant typically adopt an iterative two-step pipeline: 1) predicting
2021YFC3320301; and in part by the Ministry of Science and Technology,
Taiwan, under Grant MOST 111-2634-F-007-002. The associate editor coordi- the pseudo labels of the target domain; and 2) re-training
nating the review of this manuscript and approving it for publication was Prof. the segmentation network by using the source labels and the
Abdesselam Salim Bouzerdoum. (Liang Liao and Wenyi Chen are co-first target pseudo labels. However, the pseudo labels predicted
authors.) (Corresponding author: Jing Xiao.)
Liang Liao and Shin’ichi Satoh are with the Digital Content and Media Sci- by models trained on the source domain (i.e., clear-weather
ences Research Division, National Institute of Informatics, Tokyo 101-8430, images) usually suffer from inaccurate predictions for the
Japan (e-mail: liang@nii.ac.jp; satoh@nii.ac.jp). target domain (i.e., foggy-weather images). Although some
Wenyi Chen, Jing Xiao, and Zheng Wang are with the School of
Computer Science, Wuhan University, Wuhan 430072, China (e-mail: works set confidence thresholds to neglect the low-confidence
wenyichen@whu.edu.cn; jing@whu.edu.cn; wangzwhu@whu.edu.cn). predictions, this often leads to very sparse pseudo labels,
Chia-Wen Lin is with the Department of Electrical Engineering and the which may further result in sub-optimal models [19], [25]. The
Institute of Communications Engineering, National Tsing Hua University,
Hsinchu 30013, Taiwan (e-mail: cwlin@ee.nthu.edu.tw). problem of sparse labels is exacerbated in the foggy scenarios,
Digital Object Identifier 10.1109/TIP.2022.3172208 as both the scene and weather conditions change.
1941-0042 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on August 16,2023 at 14:28:21 UTC from IEEE Xplore. Restrictions apply.
3526 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 31, 2022

is captured later than the far-view image), probably due to


higher resolution and lighter fog. Taking the Traffic Sign as an
example, it is neglected in pseudo labels of the far-view image,
but labeled in the near-view image. Thus we can diffuse its
pseudo labels from near-view to far-view image.
Based on these discoveries, we propose a self-training
framework based on Target Domain driven pseudo label
Diffusion (TDo-Dif), in which we first exploit the spatial sim-
ilarity in terms of color and structures to propagate the sparsely
labeled pixels to their similar neighbours. Since the sequential
data is available, we also exploit the temporal correspondence
to propagate labels from near-view images to far-view images.
In particular, we develop spatial and temporal pseudo label
diffusion modules to propagate the initial sparse pseudo labels,
and integrate them into the self-training framework (Fig. 1).
In spatial diffusion, we employ a superpixel-based approach
to discard the use of deep features from the source model,
since the deep features are error-prone for domain gaps.
We propagate pseudo labels in these small clusters containing
confident pixels under the assumption that each superpixel
tends to share the same semantic label. In temporal diffusion,
we establish temporal correspondence between near-far image
pairs and propagate reliable labels from the near-view image
to the corresponding pixels in the far-view image. In this way,
pseudo labels can be diffused by adaptively combining the
initial predictions with spatially and temporally diffused labels.
To further boost the accuracy of re-training stage using the
diffused pseudo labels, we propose two spatial and temporal
constraint losses to drive the proximity of deep features of
the same class. The spatial loss measures local spatial simi-
Fig. 1. (a) The unsupervised self-training framework for domain adaptation
using the proposed TDo-Dif scheme. The spatial and temporal pseudo label larity to encourage identical features between pixels within a
diffusion modules are proposed by exploiting the spatial similarity and superpixel, while the temporal loss encourages the features of
temporal correspondences in the target domain data. In addition to the initial corresponding pixels in a near-far image pair to be identical
pseudo labels generated by knowledge transfer from the source domain, labels
are diffused to the unknown pixels by exploiting the knowledge from the target through a contrastive way. By imposing these two newly
domain. All pseudo labels are then used to re-train the segmentation model, proposed losses, we are able to strengthen the pixel-level
helping the model better adapt to the distribution of the target domain. In similarity in an unsupervised manner, encouraging the model
(b) and (c), the initial pseudo labels are overlaid on top of the input foggy
images, and the arrows indicate the directions of diffusion. to better identify pixel similarity in the target domain. Please
note that if the images do not form a sequence in the target
domain, we can independently implement the spatial label
diffusion and loss for self-training.
In this paper, we address domain adaptive semantic seg- In summary, the main contributions of this paper are:
mentation for foggy driving scenarios based on self-training 1) We are the first to exploit spatial and temporal knowledge
from a novel perspective, i.e., by casting it as a pseudo label of the target domain for unsupervised domain adaption, and
diffusion problem from the target domain. In the driving propose a pseudo label diffusion method named TDo-Dif
scenario, sequence data is collected from the forward-looking that propagates the highly confident pseudo labels to the
dash-board camera of a car (examples of image sequence unlabeled pixels by establishing spatial similarity and temporal
shown in Fig. 1), and the object in the later image is closer correspondence.
to the camera than the previous image. By exploiting the 2) We also propose two losses in the re-training stage to
sequential data in the foggy target domain, we have made encourage the learned features to be similar within the same
two discoveries. Discovery 1 (Spatial Similarity): As shown semantic category, which can effectively help the model to be
in Fig. 1(b), only a small portion of a category is assigned better adapted to the target domain.
confident pseudo labels, but the texture of adjacent pixels 3) Experimental results demonstrate that TDo-Dif can effec-
within the category looks similar. Thus, we can diffuse the tively balance the density and accuracy of diffused pseudo
pseudo labels to their adjacent similar pixels. Discovery 2 labels, and outperforms the state-of-the-art methods on real
(Temporal Correspondence): As shown in Fig. 1(c), when foggy scene understanding.
sequential data is available in some cases, the pseudo labels In the rest of the paper, we present the related work in
of an object are more confident in the near-view image than Section II and the proposed pseudo label diffusion scheme
in the corresponding far-view image (the near-view image and the self-training framework in Section III. In section IV,

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on August 16,2023 at 14:28:21 UTC from IEEE Xplore. Restrictions apply.
LIAO et al.: UNSUPERVISED FOGGY SCENE UNDERSTANDING VIA SELF SPATIAL-TEMPORAL LABEL DIFFUSION 3527

we elaborate experimental settings and extensive experimental C. Unsupervised Domain Adaptive Semantic Segmentation
results. Finally, conclusions are drawn in Section V. Unsupervised domain adaptation (UDA) has been exten-
sively investigated in computer vision tasks [38]–[40]. The
II. R ELATED W ORK main idea for domain adaptation is to alleviate the gap between
the source and target domains. For the semantic segmentation
In this section, we briefly review related work in each of task, we expect to use domain adaptation to improve the
the three sub-fields: semantic segmentation, semantic foggy segmentation performance of real images by models trained
scene understanding, unsupervised domain adaptive semantic on synthetic images. To deal with this problem, adversarial
segmentation. training methods have received significant attention. These
methods attempt to mitigate the domain gap from the image
A. Semantic Segmentation level, such as adversarial training based image-to-image trans-
lation [41], [42], or from the representation level, such as mak-
As a widely studied means of inferring the semantic con- ing features or network predictions indistinguishable between
tent of images, semantic segmentation can predict labels at domains through a series of adversarial losses [43], [44]. The
pixel level. After the typical works such as FCN [26] and focus of such approaches is on learning and aligning the
SegNet [27] that employ CNNs to boost the semantic seg- shared knowledge between the source and target domains, but
mentation performance by extracting deep semantic features, domain-specific information is usually ignored.
most of the top-performing methods are built on CNNs. Self-training based UDA methods tend to improve segmen-
The subsequent works attempt to aggregate scene context to tation performance in semi-supervised learning manner [19],
improve the performance of semantic segmentation, for exam- [20]. These works iteratively train the model by using both
ple, by assembling multi-scale visual clues or convolutional labeled source domain data and the generated pseudo labels
features of different sizes [28], [29], adopting atrous spatial from target domain to achieve alignment between the source
pyramid pooling [30], employing neural attention to exchange and target domains. PyCDA [20] constructs the pyramidal
context between paired pixels [31], [32], and iteratively opti- curriculum to provide various properties about the target
mizing the results using Markov decision process [33]. For a domain to guide the training. To deal with the imbalance
comprehensive overview of semantic segmentation algorithms, of pseudo labels between easy and hard classes, CBST [19]
we point the reader to Taghanaki et al. [7] or Minaee et al. [3]. scheme is proposed, showing comparable domain adaptation
Although impressive, most semantic segmentation models performance to the adversarial training based methods. Later,
are trained with fully labeled and clear-weather datasets. Some CRST [21] is proposed to make self-training less sensitive to
general datasets include PASCAL VOC 2012 Challenge [34] noise pseudo labels using soft pseudo labels and smoothing
and MS COCO Challenge [10] for visual object segmentation, network predictions. It also integrates a variety of confidence
and Cityscapes [11] for urban scene understanding. However, regularizers to CBST. Furthermore, some works [45], [46]
the performance of current vision algorithms, even the best resort to self-training as the second stage of training to
performing ones, undergoes a severe performance degradation further boost their adaptive performance. However, the current
under adverse conditions, which are crucial for outdoor appli- self-training strategy relies heavily on the performance of
cations. Therefore, this work focuses on semantic foggy scene predicted segmentation map of the pre-trained model, which
understanding, which is introduced below. is limited by the domain gap between the source and target
domains.
B. Semantic Foggy Scene Understanding
III. P ROPOSED M ETHOD
An early work on semantic foggy scene understanding is
In this section, we describe the proposed pseudo label
SFSU [15]. It builds a synthetic foggy dataset by render-
diffusion method for semantic foggy scene understanding.
ing the Cityscapes dataset [11] with an atmosphere scatter-
Below we first review the general self-training method and
ing model [35], and a follow-up semi-supervised learning
then provide an overview of our framework. Next, we dissect
approach based on RefineNet [36] is proposed. To improve
the individual modules and the associated objective functions
the quality of synthetic fog, Hahner et al. [37] propose to
for model re-training.
utilize the pure synthetic data with segmentation labels, which
is unrestricted in dataset size and employ accurate depth maps
for fog synthesis. Similarly, Sakaridis et al. [16] propose a A. Formulation of Self-Training for UDA
curriculum model adaptation method called CMAda with two Following the common self-training setting in UDA, there
adaptation steps to learn from both synthetic and unlabeled are two datasets: a labeled dataset of source domain X S =
Ns Ns
real foggy images, and later extend this idea to accommodate {xsi }i=1 with its segmentation labels Y S = {ysi }i=1 , and an
N
multiple adaptation steps to gradually deal with light and dense unlabeled dataset of target domain XT = {xti }i=1 t
without
fog [17]. Although these methods improve model adaptation accessing its labels YT , where Ns and Nt indicate the total
by generating pseudo labels in an easy-to-hard manner, their number of images in the source and target domains, respec-
direct use of the noise predictions from models trained on the tively. ysi ( p) ∈ {1, . . . , C} is the label of pixel p of image
source domain as pseudo labels leads to degraded performance xsi and C is the total number of classes. Unsupervised domain
for re-training. adaptive semantic segmentation assumes that the target domain

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on August 16,2023 at 14:28:21 UTC from IEEE Xplore. Restrictions apply.
3528 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 31, 2022

shares the same C classes with the source domain, and its aim domain images to diffuse the pseudo labels to unknown pixels
is to train a segmentation network that transfers knowledge so as to densify the pseudo labels.
from the source dataset and achieves high performance on the The framework is shown in Fig. 1. The self-training process
target dataset. is solved by iteratively alternating the pseudo-label genera-
Typically, a model trained on a source dataset with super- tion and model re-training steps in one self-training round.
vised labels can achieve satisfactory performance on test data 1) Step 1: Pseudo-Label Generation: The pseudo labels of
from the same dataset, but it does not generalize well to the the target domain are initialized by inferring segmentation
target data due to domain gaps. To transfer knowledge, self- labels using the pre-trained model and selecting high confi-
training techniques use a pre-trained source model  to test on dent labels from the predicted segmentation maps. Then, the
Nt
the target dataset and infer target pseudo labels, ŶT = {ŷti }i=1 . initialized sparse pseudo labels are densified by label diffusion.
Then they re-train and optimize the model  by supervision We propose a novel label-diffusion method, i.e., TDo-Dif,
of the source labels Y S and the target pseudo labels ŶT . to diffuse pseudo labels from spatial and temporal perspectives
1  for improving the performance of self-training-based domain
min L St = Lseg ((xs ), ys ) adaptive semantic segmentation. In spatial diffusion, we adopt
w Ns
x s ∈X S a superpixel-based clustering method to divide each image
1  into small clusters. Under the assumption that the pixels
+αt Lseg ((xt ), ŷt ), (1) within a superpixel are similar, the reliable pseudo labels
Nt
x t ∈X T
is expanded in the superpixel regarding the presence of any
where αt is a hyper-parameter balancing the contributions of reliable label in the clusters. In temporal diffusion, we propose
the two dataset, w is the model weights, and Lseg denotes to register temporally adjacent frames by optical flow. Based
the loss function for semantic segmentation. on the property of the foggy data and established pixel-level
Pseudo labels are typically generated from the model pre- correspondence, pseudo labels in the far-view images can be
dictions by selecting the most confident labels. The original refined by their corresponding pixel labels in the near-view
prediction ỹt on the target image xt is generated by: images.
2) Step 2: Model Re-Training: After the pseudo label dif-
ỹt = arg max p (c | xt , ) , (2) fusion, the diffused pseudo labels are then used to re-train
c
the semantic segmentation model. In this step, based on the
where p (c | xt , ) is the probability of class c in the model spatial similarity and temporal correspondence, two new loss
prediction. Since it only picks the most confident predictions functions, a local spatial similarity loss and a contrastive loss,
as pseudo labels, which can be obtained finally by: are designed to constrain the feature similarity of the target

ỹt if p (c | xt , ) > λc domain to assist the unsupervised domain adaptive semantic
ŷt = , (3) segmentation.
0 otherwise
In the following, we address each part of our framework in
where λc denotes the confidence threshold for class c. This detail.
means that if the segmentation probability value of class c is
greater than λc , these pixels will be considered as confident
regions (pseudo-labeled regions) and the rest will be ignored. C. Spatial Diffusion of Pseudo Labels
λc = 0 means that the whole predicted segmentation map is In this section, we attempt to increase the number of pseudo
used as pseudo labels. For each class c, we determine λc by the labels by using spatial similarity within the target image.
confidence value chosen from the most confident p percentage Considering that the learned model is not optimized for the
of the prediction of class c in the entire target set [19]. target foggy scene, the discrimination of deep features cannot
To ensure that the pseudo labels are mostly reliable, the hyper- precisely reveal the semantic similarities in the foggy images.
parameter p is usually set to a very low value, i.e., 0.2. Therefore, we employ a superpixel-based clustering method to
exploit the local spatial similarity present in target images.
B. Framework Overview 1) Spatial Clustering by Superpixels: Superpixels [47] are
defined by clustering a set of pixels with similar visual
In this paper, we attempt to exploit the properties of the properties, e.g., pixel intensity, which have been used as
target domain data and introduce target domain knowledge into pre-processing for lots of applications such as image stitching,
the self-training framework for UDA, which has been rarely image colorization and view synthesis [48]–[50]. In this paper,
exploited in current methods. In the driving scene, the cameras we utilize the visual similarity property within a superpixel
are usually mounted at the front window of a vehicle, facing to extend reliable pseudo labels. Specifically, we adopt the
forward. Along the travel path, the vehicle take a sequence of well-known SLIC algorithm [51] to generate superpixels,
image. Thus, the objects along the road will appear in multiple which generates superpixels by k-means clustering strategy.
images, and the object in a later image will be closer to the SLIC measures the color and spatial difference between each
camera than in the previous images. In the foggy scene, the pixel and its superpixel center is defined as:
deep features of the pixels are changed by the fog, leading to
a very sparse and discrete pseudo labels. Therefore, we exploit dc 2 ds 2
the spatial and temporal relations between pixels in the target D( pspi , spictr ) = ( ) +( ) , (4)
Mc Ms

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on August 16,2023 at 14:28:21 UTC from IEEE Xplore. Restrictions apply.
LIAO et al.: UNSUPERVISED FOGGY SCENE UNDERSTANDING VIA SELF SPATIAL-TEMPORAL LABEL DIFFUSION 3529

Fig. 2. Illustration of the superpixel-based spatial diffusion, which is conducted in the “pseudo-label generation” step during each round of re-training.
Superpixels are first generated from the target foggy image, and then spatial diffusion is performed in a superpixel-wise order in the initial pseudo label
map ŷt . In Case 1, if there is only one category in a superpixel, we copy all labels of that category within the superpixel from the initial predicted segmentation
map ỹt to the diffused superpixel map ŷtsd . This strategy is in line with the assumption that superpixel tends to cluster pixels of same category together, but
also considers the possible clustering errors in superpixels. In Case 2, if there are more categories in a superpixel, we copy the labels of all existing categories
from ỹt to ŷtsd , considering that small objects tend to be mixed with large objects due to the set size of superpixels.

where dc and ds denote the color distance and spatial distance to ŷspsd ( p) so as to diffuse the initial pseudo label on q to p in
i
between pixel pspi and the superpixel center spictr in the sd
ŷspi . The densified pseudo labels of all superpixels ŷsp sd form
i
i -th superpixel spi , respectively. Mc is the maximum √ color the diffused pseudo label ŷ . sd
distance, which is usually set as a constant. Ms = M/K , M Considering that there could still be some clustering errors
denotes the total number of pixels of this image and K is a in a superpixel, especially in the case of small objects, we do
hyper-parameter indicating the number of superpixels, which not directly diffuse the class of high-confidence pixels to
is the only parameter we need to set. the entire superpixel, but only pass the same class present
After the spatial clustering, the target image xt is divided in the predicted segmentation map ỹspi to the pseudo labels
into a superpixel set SP = {spi }i=1 S
, where S is the total sd (Fig. 2 Case 1). In case there exist pseudo labels for
ŷsp i
number of superpixels. multiple classes in one superpixel, we diffuse all existing
2) Superpixel-Based Spatial Label Diffusion: The spatial classes. This is especially designed to handle small objects
label diffusion process starts with the initial pseudo labels, which are usually smaller than the size of a superpixel
which are collected from the high confident labels of the (Fig. 2 Case 2).
original predicted segmentation map ỹt . The superpixels con-
taining initial pseudo labels in the set SP are picked out
for further spatial label diffusion. We then densify the labels D. Temporal Diffusion of Pseudo Labels
in each picked superpixel by diffusing the labels from the We exploit temporal correspondence between neighboring
original predicted segmentation map, under the assumption images to densify pseudo labels. It is motivated by the
that visually clustered pixels within a superpixel have a high assumption that the closer an object is to the camera, the higher
probability of belonging to the same class. resolution and the less it is affected by fog, thus resulting in a
The particular diffusion process of each superpixel spi is more reliable prediction. On this basis, we diffuse the labels
presented in Fig. 2, and can be formulated as follows: of an object from a closer view to a more distant view.

ỹspi ( p) if ∃ ŷspi (q) = ỹspi ( p) 1) Temporal Correspondence by Optical Flow: We take
ŷspi ( p) =
sd
(5) a near-far view pair for the temporal diffusion, where the
0 otherwise
far-view image is the target image denoting as xt and the
where ỹspi and ŷspi are the predicted labels and initial pseudo close-view image is the diffusion reference denoting as xrt .
labels of superpixel spi , and p and q denote a pixel within The first step for temporal diffusion is building the temporal
spi of these maps, respectively. All the pixels in the target correspondence between the image pair. We employ one of
superpixel spi are traversed to check whether the label of a the latest optical flow estimation methods, PDC-Net [52],
pixel p in ỹspi is the same as the label of any pixel q in the to estimate the dense flow field with a coupled pixel-wise
initial pseudo labels ŷspi . If yes, we copy the label of ỹspi ( p) confidence map to indicate the reliability and accuracy of the

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on August 16,2023 at 14:28:21 UTC from IEEE Xplore. Restrictions apply.
3530 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 31, 2022

Fig. 3. Illustration of the optical flow-based temporal diffusion. The near-view image is taken as the reference image to propagate its warped pseudo labels
to optimize the pseudo labels of the target image, under the assumption that the near-view image is less affected by fog. (a) shows the overall process the
temporal diffusion, and (b) shows the label diffusion on each pixel.

flow field prediction. The optical estimation process can be Algorithm 1: Training and Testing Step of TDo-Dif
simplified to the following form: (Temporal Diffusion Followed by Spatial Diffusion)
F, PF = φ(xrt , xt ), (6)

1 PF ( p) > T
M F ( p) = , (7)
0 otherwise

where F = (u, v) ∈ R2 is the predicted flow vector from xrt


to xt . PF and M F are the confidence map and the confident
binary map of the true flow, respectively. p is the pixel index.
The threshold T is set to 0.5, the same as in [52].
2) Flow-Based Temporal Label Diffusion: The illustration
of flow-based temporal diffusion is shown in Fig. 3. After
generating the temporal correspondence, we warp the pseudo
labels ŷrt and the predicted segmentation probability map
ptr (xrt , ) of xrt by the guidance of the estimated flow F and
its confident binary map M F .
ŷrwar p = M F  war p(ŷrt , F), (8)
   
r
pwar p xrt ,  = M F  war p( ptr xrt ,  , F), (9)
where war p(·) denotes a warping function and  denotes
element-wise multiplication. M F acts as a binary mask.
After aligning the pseudo labels of the reference image to
the target image, we handle the temporal diffusion according
to the following situations:
1) If the warped pseudo label in ŷrwar p falls on a pixel p
that already has a label in ŷt , we update the pseudo label
for ŷt ( p) by fusing the soft segmentation prediction of
semantic segmentation and re-select the class label with
the highest confidence score.
 
ŷtt d = arg max( p(c | xt , ) + pwar
r r
p c | xt ,  ). (10)
c

2) If the warped pseudo label in ŷrwar p falls on a pixel p


and there is no specified pseudo label in ŷt , we copy the
label from ŷrwar p ( p) to ŷt ( p). It increases the number
of pseudo labels. utilize the knowledge of the target domain, we propose two
new losses for the model re-training.
ŷtt d = ŷrwar p . (11) We first propose a local spatial similarity loss based on the
assumption that a superpixel should belong to a single object.
In particular, we encourage the distances between the pixel
E. Re-Training Strategy features in each superpixel to be small. Let f (x t ) represent the
After obtaining the diffused pseudo labels, we re-train the deep feature map of image xt , and spi ↓ is the downsamped
segmentation model to adapt it to the target domain. To further version of the superpixel with the same spatial resolution of

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on August 16,2023 at 14:28:21 UTC from IEEE Xplore. Restrictions apply.
LIAO et al.: UNSUPERVISED FOGGY SCENE UNDERSTANDING VIA SELF SPATIAL-TEMPORAL LABEL DIFFUSION 3531

f (x t ). spi ↓ is used to indicate the coressponding locations of with Cityscapes, which contains 5000 images (2048 × 1024)
the feature map for the superpixel spi . We define the feature belonging to 20 categories with 2975 images for training,
distances in a superpixel as follows: 500 for validation, and 1525 for testing.
1  Foggy Zurich [17]. It is a realistic foggy scene dataset
Dspi ↓ = Dc ( f ( p), ηspi ↓ ) consisting of 3808 images with a resolution of 1920×1080 col-
|spi ↓ |
p∈spi ↓ lected in the city of Zurich and its suburbs. It was recorded
1  f ( p) ηspi ↓ as four large video sequences, and sequential images were
= < , >, (12)
|spi ↓ | || f ( p)|| ||ηspi ↓ || extracted at a rate of one frame per second. It provides
p∈spi ↓
 pixel level annotations for 40 images with dense fog. We use
1
ηspi ↓ = f ( p), (13) the 40 annotated images for testing and the rest without
|spi ↓ | annotations for training.
p∈spi ↓
Foggy Driving [15]. It consists of 101 color images depict-
where f ( p) is the deep feature at pixel p, and ηspi ↓ is the
ing real-world foggy driving scenes. 51 of these images were
feature centroid of Dspi ↓ . We adopt Dc as the cosine similarity
taken with a cell phone camera under foggy conditions at
between features to measure the pixel-wise similarity, so that
different areas of Zurich, and the remaining 50 images were
Dspi ↓ is the average feature distance. Then we define the local
collected from the web. Image resolutions in the dataset range
spatial similarity loss as:
from 500 × 339 to 1280 × 960.
1
S 2) Baseline Methods: We compare our method with
L Spa = Dspi ↓ . (14) the state-of-the-art foggy scene understanding methods
S
i=1 and domain adaptive segmentation methods. Among them,
Temporally, we try to maximize the similarity of associated SFSU [15] is a supervised learning-based approach on foggy
pixels via a contrastive loss. The objective is to distinguish scene understanding by training segmentation model with
between the tight correspondence (different views of the the labeled source dataset Foggy Cityscapes. CMAda [17]
same pixel) and incompatible one (different views of different re-trains the segmentation model with the labeled source
pixels). For each pixel p which has found a correspondence dataset Foggy Cityscapes and unlabeled real foggy weather
p , we construct a set of random negative pixels N . Then the dataset Foggy Zurich, gradually adapting the model from
temporal loss function for pixel p and N can be written as: light fog to dense fog in multiple steps. AdSegNet [53]
is a representative adversarial learning strategy for domain
LT em adaptive semantic segmentation. CBST [19] and CRST [25]

1  e Dc ( f ( p), f ( p )) are two representative self-training strategies for domain adap-
=− log D ( f ( p), f ( p ))  ,
||P|| e c + n∈N e Dc ( f ( p), f (n))) tive semantic segmentation. Recent methods CuDA-Net [54],
p∈P
FIFO [55] and CMDIT [56] focus on bridging the domain
(15) gap between the clear images and foggy images to improve the
where P is the set of positive pixels. performance of foggy scene segmentation. FogAdapt+ [57]
Therefore, the overall training loss function for model combines the scale-invariance and uncertainty to minimize the
re-training is defined as the weighted sum of the segmentation domain shift in foggy scenes segmentation.
loss L St , the spatial loss L Spa and the temporal loss LT em : We directly use the released models of SFSU1 and
CMAda2 and re-train the segmentation models of the three
L F inal = L St + αspa L Spa + αt em LT em , (16) domain adaptive training strategies, including AdSegNet,
where αspa and αt em are the weights for the local spatial CBST and CRST, on our dataset. We collect the results of
similarity loss and temporal contrastive loss, respectively. The CuDA-Net from its authors and the results of the last three
details about the overall training and testing procedure can be methods from their papers.
seen in Algorithm 1.
B. Implementation Details
IV. E XPERIMENTAL R ESULTS Similar with SFSU and CMAda, we adopt the
A. Experimental Settings RefineNet [28] with ResNet-101 as the backbone in all
experiments and initialize it with the weights pre-trained
1) Datasets: We validate our method on real foggy datasets,
by Foggy Cityscapes. We implement our method using
Foggy Zurich [17] and Foggy Driving [15], as the target
the Pytorch toolbox and optimize it using Adam algorithm
domain datasets. The source domain for the pre-training is
with β1 = 0.5, β2 = 0.999, and a learning rate of
the synthetic dataset Foggy Cityscapes [15] that derived from
0.0001 following [15]. In all experiments, we use a batch
Cityscapes [11] with fine segmentation labels.
size of 2 and set the self-training round number to 4 and the
Foggy Cityscapes [15]. It is synthesized with the
training epochs in each round to 10. The loss weights αt ,
atmospheric scattering model [35] for fog simulation based
αspa , and αt em are set to 1, 0.1 and 5, respectively. As similar
on Cityscapes. Three distinct fog densities are generated
to CBST and CRST, we set the hyper-parameter p of
by controlling a constant simulated attenuation coefficient
β ∈ {0.005, 0.01, 0.02}, where higher β corresponding to 1 http://people.ee.ethz.ch/∼ csakarid/SFSU_synthetic/
denser fog. Each version shares the same semantic labels 2 http://people.ee.ethz.ch/∼ csakarid/Model_adaptation_SFSU_dense/

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on August 16,2023 at 14:28:21 UTC from IEEE Xplore. Restrictions apply.
3532 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 31, 2022

TABLE I
Q UANTITATIVE C OMPARISON ON F OGGY Z URICH D ATASET. N OTICE T HAT W E E XCLUDE THE S TATISTICS OF C LASS Train B ECAUSE I T I S N OT
I NCLUDED IN THE T EST S ET, AND W E C ALCULATE C LASS Truck A LTHOUGH I T C ANNOT B E D ETECTED BY A NY M ETHOD . THE N UMBERS
IN R ED AND B LUE R EPRESENT THE B EST AND S ECOND B EST S CORES

TABLE II
Q UANTITATIVE C OMPARISON ON F OGGY D RIVING DATASET . T HE N UMBERS IN R ED AND B LUE R EPRESENT THE B EST AND S ECOND B EST S CORES

confident portion to 0.2 so as to pick the top 20% of high TABLE III
confidence predictions as pseudo labels. We set the number S ETTINGS OF D IFFERENT VARIANTS
of superpixels K in each images to 500. The threshold T for
reliable dense matching is set to 0.5, the same as in [52]. The
number of positive samples used for computing the temporal
contrasive loss is set to 20, which are randomly selected
from pixels with temporal correspondence. The number of
negative samples for each positive pixels is set to 1, chosen
from pixels of different predicted categories.

comparison results with the baselines on Foggy Zurich and


C. Comparisons With State-of-the-Art Methods Foggy Driving are shown in Table I and Table II, respectively.
In this section, we compare the proposed method with the 2) Comparison With SOTA: In general, the best variant
baselines on the semantic segmentation of foggy scene, from of TDo-Dif with label diffusion outperform all baselines on
both quantitative and qualitative aspects. Foggy Zurich and Foggy Driving, owning to the introduction
1) Quantitative Comparisons: We first focus on the test- of the target-domain knowledge for domain adaptation. On
ing of the proposed four components based on two core Foggy Zurich, the best variant of TDo-Dif reaches 51.92%
motivations: superpixel-based spatial label diffusion (SD) and mIoU, surpassing both the SFSU [15] and the curriculum
local spatial similarity loss (SL) for spatial similarity, and learning model CMAda [17], which are specifically designed
flow-based temporal label diffusion (TD) and temporal con- for foggy scene understanding, with significant gains of
trastive loss (TL) for temporal correspondence. In general, 18.51% and 7.65%, respectively. It indicates that while the
we design two groups of model variants to sort out the curriculum learning strategy is effective in adapting the model
most optimal combination of the proposed diffusion schemes, to foggy data, the large number of false pseudo labels in
and the details of the variants are listed in Table III. The the whole segmentation predictions leads to a degradation in

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on August 16,2023 at 14:28:21 UTC from IEEE Xplore. Restrictions apply.
LIAO et al.: UNSUPERVISED FOGGY SCENE UNDERSTANDING VIA SELF SPATIAL-TEMPORAL LABEL DIFFUSION 3533

Fig. 4. Subjective quality comparison of test results on image samples from Foggy Zurich [17].

performance. Although CBST [19] and CRST [25] adopt a 3) Comparison Among Variants of TDo-Dif: By comparing
small portion of pseudo labels for model re-training to avoid the performance among variants, we evaluate the effectiveness
false labels, our method obtains additional gains of 6.13% and of each component on semantic segmentation performance.
5.45% mIoU, respectively, with a significant positive impact The variants in the first group show that all components have
on some large objects (e.g., Building, Fence and Sky) and small a positive impact on the understanding of foggy scenes. Com-
objects (e.g., Rider, Motor and Bike). In contrast to methods paring TDo-Dif (SD + SL) with TDo-Dif (TD + TL), spatial
that attempt to bridge the gap between the source and target similarity shows a higher performance improvement over tem-
domains, i.e., CuDA-Net, FIFO, CMDIT, and FogAdapt+, our poral correspondence, with a gain of 2.87% mIoU. This can
proposed TDo-Dif explicitly exploits the self-similarity present be attributed to the fact that the spatial diffused area is much
in the target data to increase the correct labels for these classes larger than the temporal one. The local spatial similarity loss
and therefore achieves the best performance on the foggy scene and temporal contrastive loss also contribute additional 0.25%
segmentation. and 0.88% mIoU, repectively, through comparing TDo-Dif

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on August 16,2023 at 14:28:21 UTC from IEEE Xplore. Restrictions apply.
3534 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 31, 2022

Fig. 5. Subjective quality comparison of test results on image samples from Foggy Driving [15].

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on August 16,2023 at 14:28:21 UTC from IEEE Xplore. Restrictions apply.
LIAO et al.: UNSUPERVISED FOGGY SCENE UNDERSTANDING VIA SELF SPATIAL-TEMPORAL LABEL DIFFUSION 3535

Fig. 6. Confusion matrices for CMAda (left) and Ours (TDo-Dif) (right) for the semantic segmentation on Foggy Zurich.

our final model and denote it as TDo-Dif in the following


experiments.
5) Qualitative Comparisons: Fig. 4 and Fig. 5 give some
examples of qualitative comparison on Foggy Zurich and
Foggy Driving, respectively. Trained on a synthetic foggy
dataset, SFSU [16] has limited generalization ability on the
real-world foggy scenarios, e.g., Sky and Building are incor-
rectly predicted as Road and Vegetation. The performance
of CMAda [17] is relatively good by re-training on the
target domain data, but its results also have many segmen-
tation errors, such as categories of Sky, Wall and Sidewalk.
CRST [25], trained with sparse and confident pseudo labels,
presents better results than CMAda in terms of semantic mis-
Fig. 7. Statistical comparison of the effect of diffused pseudo labels recognition. CuDA-Net achieves the excellent performance in
quality on segmentation performance. The bar and blue line show the average
percentage of pixels and the average accuracy of pseudo labels at each diffused some small objects, such as the Pole and Traffic Sign, but the
stage. The red line shows the segmentation performance of the model trained performance of the large objects still needs to be improved,
with the corresponding diffused pseudo labels. especially Building and Road. Benefiting from the spatial and
temporal pseudo label diffusion, the proposed method achieves
the best results, especially for large objects, e.g., Sky and
(SD + SL) with TDo-Dif (SD) and TDo-Dif (TD + TL) Vegetation. According to the underlying characteristics of fog,
with TDo-Dif (TD). We also investigate the impact of different Sky is most affected by fog at the farthest distance from the
spatial and temporal diffusion orders through the variants of camera, which explains the poor performance of all baselines
the second group. The two strategies are comparable in terms on this class, but our method performs well on it owning to
of segmentation performance and selection of pseudo labels the label diffusion.
as well as mIoU of pseudo labels. 6) Confusion Between Classes: In this part, we provide
4) Model Generalization: Since the adaptive models with more informative analysis by comparing the confusion matri-
both spatial and temporal diffusion achieve the best perfor- ces of the baseline and ours in Fig. 6. Considering the
mance on Foggy Zurich, we test them on the Foggy Driving space limitation, we only show the comparison between
dataset. As shown in Table II, our model achieves 50.65% CMAda [17] (the one with best performance among all base-
mIoU, which exceeds most of state-of-the-art methods and lines) and ours. An obvious observation from the confusion
demonstrates the good generality of our model. matrices of CMAda is that most categories tend to be mis-
Moreover, since TDo-Dif (TD→SD + SL + TL) reaches classified into Building, especially the classes of Truck, Fence,
the highest scores on both datasets, we use this model as Person, and Bus, which is consistent with the visualization in

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on August 16,2023 at 14:28:21 UTC from IEEE Xplore. Restrictions apply.
3536 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 31, 2022

Fig. 8. Comparison of subjective results for various pseudo labels on image samples from Cityscapes. The black solid line and red dashed line represent
the pixel number distribution of every category, and the blue line represent the value of mIoU of every pixel.

Fig. 4. This may be due to the fact that noisy pseudo labels by 25%-27%) but a drop of label accuracy (mIoU decreases
from the source model are used to re-train the model for the by 12%-13%). The original predicted labels covers the entire
foggy scenes. This issue is greatly alleviated by our method, image, but the accuracy is nearly half of the selected pseudo
by which most of the classes are accurately classified with labels, namely more than half of the labels are wrongly
much smaller probability of belonging to other classes than to predicted. Note that the percentage of initial pseudo labels
the same class. Some classes such as Bike, Truck, and Person in the test set (i.e., 0.109), which captured in dense fog,
are difficult to be correctly classified, probably due to their is typically much lower than 0.2 since their confidence scores
small portion of pixel numbers in the overall dataset. are much lower than that of light foggy images.
The impact of pseudo labels quality on the final semantic
D. Ablation Study segmentation is measured by the mIoU of the test data (red
line). The accuracy of the final segmentation is the result of
1) Quality of Diffused Label and Its Impacts on Segmen-
the combined effect of the number and accuracy of pseudo
tation: The success of self-training-based domain adaptation
labels. As shown in the figure, within the proposed label
relies heavily on the quality of pseudo labels (i.e., the percent-
diffusion strategies, the final performance improves with the
age of valid pixels and the accuracy of pseudo labels). In this
percentage increment of labels, but the performance gains are
section, we assess the quality of diffused labels and how the
slowed down by the accuracy drop. The extremely noisy labels
quality of diffusion labels will affect the final performance of
from the original prediction lead to a dramatic decrease in
semantic segmentation.
a) Assessment on overall statistics: The statistics for segmentation accuracy. This proves that although our label
quality assessment of pseudo labels are calculated from all the diffusion strategy may introduce some noise into the labels,
40 test images of Foggy Zurich. As shown in Fig. 7, we adopt the positive effects from the newly added correct labels plays
three pseudo-label generation strategies: 1) initial sparse labels a dominant role on the final performance.
without diffusion, 2) diffused labels from four variants of b) Assessment on sample images: Two samples demon-
our method, and 3) an entire predicted labels with noise. all strating the detailed assessment on the quality of pseudo
variants in this evaluation are with their corresponding losses, label diffusion are shown in Fig. 8, whose statistical patterns
but loss notations are omitted for short in the figure. are quite consistent with the overall statistics. The initial
The accuracy of the pseudo labels is measured by two pseudo labels are quite accurate (almost overlapped number
terms: the average percentage of pseudo labels in the distribution of ground-truth and pseudo labels in the dis-
image (blue bar) and the mIoU of pseudo labels (blue tribution diagram), but they are quite sparse. The temporal
line). In general, temporal diffusion (TD) results in a small diffusion (TD) brings slight increment in the valid pixel labels
increase in the number of labeled pixels (percentage increases and maintains a high accuracy, bringing additional reliable
by 0.5%-1.2%) and improvement in label accuracy (mIoU labels for further amplification by spatial diffusion. Compared
increases by 1.1%-1.5%), while spatial diffusion (SD) results to the initial labels, the spatial diffused (SD) labels are much
in a large growth of the pseudo labels (percentage increases denser. The number of pixels per class and their accuracy

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on August 16,2023 at 14:28:21 UTC from IEEE Xplore. Restrictions apply.
LIAO et al.: UNSUPERVISED FOGGY SCENE UNDERSTANDING VIA SELF SPATIAL-TEMPORAL LABEL DIFFUSION 3537

Fig. 10. Statistical comparisons of the number of superpixels.

TABLE IV
Q UANTITATIVE C OMPARISONS OF THE R ANGE OF C ONFIDENT F LOWS

Fig. 9. Visual comparisons on spatial diffused pseudo labels between


superpixels and deep feature.

shown in the distribution diagram indicate that the valid labels


are largely expanded, while the accuracy is still kept to some
extent. The distribution of pixel counts in the pseudo labels
is almost identical to that in the ground-truth labels. For the
entire original prediction, the accuracy is too low to be used
for re-training, although the number of valid labels is large.
2) Superpixel Versus Deep Feature for Spatial Diffusion:
The superpixel-based spatial diffusion is performed under the
assumption that local spatial similarity in the target domain
can be measured more reliably by superpixels than by deep
feature extracted by segmentation models. Here we attempt
to justify this assumption by visually comparing the pesudo
labels in Fig. 9. The original predictions represents the seman-
tic clustering from the deep features. In order to compare Fig. 11. Effectiveness of number of target domain samples. The total number
with the result from superpixels, we select 50% of the best of target domain samples is 3808.
predictions as the result from deep features. We can notice that
the there are obvious incorrect labels, which is much better in
the superpixel-based spatial diffusion. Another advantage of generally increases with the number of superpixels, while
adopting super-pixels is that it adaptively adjust the extent of computational complexity for diffusing labels and calculating
diffusion according to the difficulty of the scene. For example, losses also increases. In our experiments, we adopt 500 as
the diffused labels are fewer for a heavy fog in the first the number of superpixels per image, considering that the
column than for a light fog in the third column, but a fixed increment of accuracy is approaching saturation at 500, but
selection ratio from deep feature will inevitably introduce computational cost is linearly growing for full range of K .
wrongly predicted labels. 4) Impact of the Range of Confident Flows: For a sanity
3) Impact of the Number of Superpixels: The success of check, we study the impact of the threshold of confident flows
spatial label diffusion largely relies on the clustering of super- by changing T from 0.1 to 0.8. The percentages of pixels and
pixels. Here we conduct experiment to validate the impact of mIoU of pseudo labels are shown in Table IV, which shows
hyper-parameter K for number of superpixels on the pseudo the inclement of valid pixels and the degradation of mIoU are
labels diffusion. With experiments ranging from 100 to 800, not significant because the initial pseudo labels is very sparse.
we plot the normalized distribution of accuracy of pseudo In our experiment, we adopt the radius T of 0.5, the same as
labels, accuracy of superpixels (i.e., the percentage of super- in [52].
pixels with uniquely segmentation labels), computational cost 5) Effects of Number of Target Domain Samples: To investi-
of spatial diffusion and spatial loss in Fig. 10. The plot shows gate the effectiveness of the number of target domain samples
that the proportion of good-fidelity labels and superpixels on the segmentation performance, we test six variants of

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on August 16,2023 at 14:28:21 UTC from IEEE Xplore. Restrictions apply.
3538 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 31, 2022

Fig. 12. Visualization of embedded features via t-SNE from two image samples of Foggy Zurich test set. The features (from left to right) are extracted
from the pre-trained model on source domain and the adaptive models after each self-training round. Features are colored according to class labels.

randomly selected target sample numbers, which are 0%, 20%, confidence of the diffused labels. In addition, two new losses
40%, 60%, 80%, 100% of all target domain samples. The are proposed to restrain the spatial and temporal consistency
results in Fig. 11 shows that increasing the target domain of the features in the re-training stage to help the model better
samples will highly improve the segmentation performance. adapt to the characteristics of the target domain. Experiments
With only 761 sample images from target domain (20% of on real foggy images show that the proposed TDo-Dif method
the total number), the performance in mIoU increases from significantly densify the pseudo labels and further boost the
40.02% to 49.00%, and it continues to increase with the performance of semantic segmentation beyond the state-of-
number increment of target domain samples. But when the the-art counterparts.
number of samples exceeds 60% (around 2284 target domain
samples), the performance almost reaches the saturation and
R EFERENCES
fluctuates between 51.50% to 52.00%.
6) Feature Visualization: We use t-SNE [58] to visualize [1] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
the feature representation of the domain adaptive process in “DeepLab: Semantic image segmentation with deep convolutional nets,
atrous convolution, and fully connected CRFs,” IEEE Trans. Pattern
Fig. 12. Since we re-train the model for four rounds, we show Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, Apr. 2018.
the features in five phrases, i.e., the pre-trained model on [2] W. Zhou, Y. Wang, J. Chu, J. Yang, X. Bai, and Y. Xu, “Affinity space
the source domain and four adaptive models after each self- adaptation for semantic segmentation across domains,” IEEE Trans.
training round. It can be observed that the classes containing Image Process., vol. 30, pp. 2549–2561, 2021.
[3] S. Minaee, Y. Y. Boykov, F. Porikli, A. J. Plaza, N. Kehtarnavaz, and
large number of pixels, such as Road, Sky and Building, are D. Terzopoulos, “Image segmentation using deep learning: A survey,”
not well clustered by the pre-trained model. Since our method IEEE Trans. Pattern Anal. Mach. Intell., early access, Feb. 17, 2021,
can select most reliable pseudo labels, the feature distributions doi: 10.1109/TPAMI.2021.3059968.
[4] J. Wu, J. Jiao, Q. Yang, Z. Zha, and X. Chen, “Ground-aware point
are gradually refined as the number of self-training rounds cloud semantic segmentation for autonomous driving,” in Proc. ACM
increases. The feature presentation exhibits much clearer after Int. Conf. Multimedia, 2019, pp. 971–979.
round 4, revealing that our TDo-Dif method can provide [5] W. Zhou, J. S. Berrio, S. Worrall, and E. Nebot, “Automated evaluation
of semantic segmentation robustness for autonomous driving,” IEEE
correct supervision labels for target domain data. Trans. Intell. Transp. Syst., vol. 21, no. 5, pp. 1951–1963, May 2020.
[6] D. Ravì, H. Fabelo, G. M. Callic, and G.-Z. Yang, “Manifold embedding
and semantic segmentation for intraoperative guidance with hyper-
V. C ONCLUSION spectral brain imaging,” IEEE Trans. Med. Imag., vol. 36, no. 9,
In this paper, we have shown the benefits of using target pp. 1845–1857, Sep. 2017.
[7] S. A. Taghanaki, K. Abhishek, J. P. Cohen, J. Cohen-Adad, and
domain knowledge in an self-training framework to improve G. Hamarneh, “Deep semantic segmentation of natural and medical
the performance of state-of-the-art domain adaptive semantic images: A review,” Artif. Intell. Rev., vol. 54, no. 1, pp. 137–178, 2021.
segmentation models in real foggy scene. To achieve this, [8] X. Wang, K. Yu, C. Dong, and C. Change Loy, “Recovering realistic
texture in image super-resolution by deep spatial feature transform,”
we propose a superpixel-based spatial diffusion and an opti- in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
cal flow-based temporal diffusion scheme that exploit the pp. 606–615.
characteristics of the target data to generate reliable and [9] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams,
dense pseudo labels. When the former better increases the J. Winn, and A. Zisserman, “The PASCAL visual object classes chal-
lenge: A retrospective,” Int. J. Comput. Vis., vol. 111, no. 1, pp. 98–136,
number of pseudo labels and the latter better maintain the Jan. 2015.

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on August 16,2023 at 14:28:21 UTC from IEEE Xplore. Restrictions apply.
LIAO et al.: UNSUPERVISED FOGGY SCENE UNDERSTANDING VIA SELF SPATIAL-TEMPORAL LABEL DIFFUSION 3539

[10] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in [33] Y. Zhou, X. Sun, Z.-J. Zha, and W. Zeng, “Context-reinforced semantic
Proc. Eur. Conf. Comput. Vis. New York City, NY, USA: Springer, 2014, segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
pp. 740–755. 2019, pp. 4046–4055.
[11] M. Cordts et al., “The cityscapes dataset for semantic urban scene [34] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. W. Zisserman, “The PASCAL visual object classes (VOC) challenge,”
(CVPR), Jun. 2016, pp. 3213–3223. Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Sep. 2010.
[12] O. Erkent and C. Laugier, “Semantic segmentation with unsupervised [35] K. He, J. Sun, and X. Tang, “Single image haze removal using dark
domain adaptation under varying weather conditions for autonomous channel prior,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 12,
vehicles,” IEEE Robot. Autom. Lett., vol. 5, no. 2, pp. 3580–3587, pp. 2341–2353, Sep. 2011.
Apr. 2020. [36] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement
[13] F. Tung, J. Chen, L. Meng, and J. J. Little, “The raincouver scene parsing networks for high-resolution semantic segmentation,” in Proc. IEEE
benchmark for self-driving in adverse weather and at night,” IEEE Robot. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1925–1934.
Autom. Lett., vol. 2, no. 4, pp. 2188–2193, Oct. 2017. [37] M. Hahner, D. Dai, C. Sakaridis, J. Zaech, and L. V. Gool, “Semantic
[14] S. Zang, M. Ding, D. Smith, P. Tyler, T. Rakotoarivelo, and understanding of foggy scenes with purely synthetic data,” in Proc. IEEE
M. A. Kaafar, “The impact of adverse weather conditions on Intell. Transp. Syst. Conf., Oct. 2019, pp. 3675–3681.
autonomous vehicles: How rain, snow, fog, and hail affect the perfor- [38] W. Chen and H. Hu, “Generative attention adversarial classification net-
mance of a self-driving car,” IEEE Veh. Technol. Mag., vol. 14, no. 2, work for unsupervised domain adaptation,” Pattern Recognit., vol. 107,
pp. 103–111, Jun. 2019. Nov. 2020, Art. no. 107440.
[15] C. Sakaridis, D. Dai, and L. Van Gool, “Semantic foggy scene under- [39] A. RoyChowdhury et al., “Automatic adaptation of object detectors to
standing with synthetic data,” Int. J. Comput. Vis., vol. 126, no. 9, new domains using self-training,” in Proc. IEEE/CVF Conf. Comput.
pp. 973–992, 2018. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 780–790.
[16] C. Sakaridis, D. Dai, S. Hecker, and L. V. Gool, “Model adaptation with [40] Y. Li, D. Huang, D. Qin, L. Wang, and B. Gong, “Improving object
synthetic and real data for semantic dense foggy scene understanding,” detection with selective self-supervised self-training,” in Proc. Eur. Conf.
in Proc. Eur. Conf. Comput. Vis., 2018, pp. 707–724. Comput. Vis. New York City, NY, USA: Springer, 2020, pp. 589–607.
[17] D. Dai, C. Sakaridis, S. Hecker, and L. Van Gool, “Curriculum model [41] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan,
adaptation with synthetic and real data for semantic foggy scene “Unsupervised pixel-level domain adaptation with generative adversarial
understanding,” Int. J. Comput. Vis., vol. 128, no. 5, pp. 1182–1204, networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
2020. Jul. 2017, pp. 95–104.
[18] R. Gong, D. Dai, Y. Chen, W. Li, and L. Van Gool, “Analogical image [42] Z. Murez, S. Kolouri, D. J. Kriegman, R. Ramamoorthi, and K. Kim,
translation for fog generation,” in Proc. AAAI Conf. Artif. Intell., 2021, “Image to image translation for domain adaptation,” in Proc. IEEE Conf.
pp. 1433–1441. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 4500–4509.
[19] Y. Zou, Z. Yu, B. V. K. V. Kumar, and J. Wang, “Unsupervised domain [43] G. Kang, L. Jiang, Y. Yang, and A. G. Hauptmann, “Contrastive
adaptation for semantic segmentation via class-balanced self-training,” adaptation network for unsupervised domain adaptation,” in Proc.
in Proc. Eur. Conf. Comput. Vis., 2018, pp. 297–313. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
[20] Q. Lian, L. Duan, F. Lv, and B. Gong, “Constructing self-motivated pp. 4893–4902.
pyramid curriculums for cross-domain semantic segmentation: A non- [44] M. Long, J. Wang, and M. I. Jordan, “Deep transfer learning with joint
adversarial approach,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. adaptation networks,” in Proc. ICML, 2017, pp. 2208–2217.
(ICCV), Oct. 2019, pp. 6757–6766. [45] Y. Li, L. Yuan, and N. Vasconcelos, “Bidirectional learning for domain
[21] Y. Zou, Z. Yu, X. Liu, B. V. K. V. Kumar, and J. Wang, “Confidence adaptation of semantic segmentation,” in Proc. IEEE/CVF Conf. Comput.
regularized self-training,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 6936–6945.
(ICCV), Oct. 2019, pp. 5981–5990.
[46] M. Kim and H. Byun, “Learning texture invariant representation for
[22] Y. Zhang, P. David, H. Foroosh, and B. Gong, “A curriculum domain
domain adaptation of semantic segmentation,” in Proc. IEEE Conf.
adaptation approach to the semantic segmentation of urban scenes,”
Comput. Vis. Pattern Recognit., June 2020, pp. 12975–12984.
IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 8, pp. 1823–1841,
Aug. 2020. [47] P. Neubert and P. Protzel, “Superpixel benchmark and comparison,” in
[23] K. Mei, C. Zhu, J. Zou, and S. Zhang, “Instance adaptive self-training Proc. Forum Bildverarbeitung, vol. 6, 2012, pp. 1–12.
for unsupervised domain adaptation,” in Proc. Eur. Conf. Comput. Vis., [48] A. Q. de Oliveira, T. L. T. D. Silveira, M. Walter, and C. R. Jung,
2020, pp. 415–430. “A hierarchical superpixel-based approach for DIBR view synthesis,”
[24] I. Shin, S. Woo, F. Pan, and I. S. Kweon, “Two-phase pseudo label IEEE Trans. Image Process., vol. 30, pp. 6408–6419, 2021.
densification for self-training based domain adaptation,” in Proc. Eur. [49] Y. Yuan, F. Fang, and G. Zhang, “Superpixel-based seamless image
Conf. Comput. Vis., 2020, pp. 532–548. stitching for UAV images,” IEEE Trans. Geosci. Remote Sens., vol. 59,
[25] Y. Zou, Z. Yu, X. Liu, B. V. K. V. Kumar, and J. Wang, “Confidence no. 2, pp. 1565–1576, Feb. 2021.
regularized self-training,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. [50] F. Fang, T. Wang, T. Zeng, and G. Zhang, “A superpixel-based
(ICCV), Oct. 2019, pp. 5982–5991. variational model for image colorization,” IEEE Trans. Vis. Comput.
[26] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks Graphics, vol. 26, no. 10, pp. 2931–2943, Oct. 2020.
for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., [51] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk,
vol. 39, no. 4, pp. 640–651, Apr. 2017. “SLIC superpixels compared to state-of-the-art superpixel methods,”
[27] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep con- IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2274–2282,
volutional encoder-decoder architecture for image segmentation,” IEEE Nov. 2012.
Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, [52] P. Truong, M. Danelljan, L. Van Gool, and R. Timofte, “Learning
Dec. 2017. accurate dense correspondences and when to trust them,” in Proc.
[28] G. Lin, F. Liu, A. Milan, C. Shen, and I. Reid, “RefineNet: Multi-path IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021,
refinement networks for dense prediction,” IEEE Trans. Pattern Anal. pp. 5714–5724.
Mach. Intell., vol. 42, no. 5, pp. 1228–1242, May 2020. [53] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and
[29] A. Mustafa, H. Kim, and A. Hilton, “MSFD: Multi-scale segmentation- M. Chandraker, “Learning to adapt structured output space for semantic
based feature detection for wide-baseline scene reconstruction,” IEEE segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Trans. Image Process., vol. 28, no. 3, pp. 1118–1132, Mar. 2018. Jun. 2018, pp. 7472–7481.
[30] L. C. Chen, G. Papandreou, and I. Kokkinos, “DeepLab: Semantic image
[54] X. Ma et al., “Both style and fog matter: Cumulative domain adaptation
segmentation with deep convolutional nets, atrous convolution, and fully
for semantic foggy scene understanding,” in Proc. IEEE Conf. Comput.
connected CRFs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4,
Vis. Pattern Recognit., 2022.
pp. 834–848, Jun. 2017.
[31] Z. Zhong et al., “Squeeze-and-attention networks for semantic segmenta- [55] S. Lee, S. Taeyoung, and K. Suha, “Fifo: Learning fog-invariant features
tion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), for foggy scene segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern
Jun. 2020, pp. 13062–13071. Recognit., 2022.
[32] T. Zhang, G. Lin, J. Cai, T. Shen, C. Shen, and A. C. Kot, “Decou- [56] V. Vinod, K. R. Prabhakar, R. V. Babu, and A. Chakraborty, “Multi-
pled spatial neural attention for weakly supervised semantic segmen- domain conditional image translation: Translating driving datasets from
tation,” IEEE Trans. Multimedia, vol. 21, no. 11, pp. 2930–2941, clear-weather to adverse conditions,” in Proc. IEEE/CVF Int. Conf.
Nov. 2019. Comput. Vis. Workshops (ICCVW), Oct. 2021, pp. 1571–1582.

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on August 16,2023 at 14:28:21 UTC from IEEE Xplore. Restrictions apply.
3540 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 31, 2022

[57] J. Iqbal, R. Hafiz, and M. Ali, “Combining scale-invariance and uncer- Zheng Wang (Member, IEEE) received the B.S.
tainty for self-supervised domain adaptation of foggy scenes segmenta- and M.S. degrees from Wuhan University, Wuhan,
tion,” 2022, arXiv:2201.02588. China, in 2006 and 2008, respectively, and the Ph.D.
[58] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” degree from the National Engineering Research Cen-
J. Mach. Learn. Res., vol. 9, pp. 2579–2605, Nov. 2008. ter for Multimedia Software, School of Computer
Science, Wuhan University, in 2017. He was a
JSPS Fellowship Researcher at the Shin’ichi Satoh’s
Laboratory, National Institute of Informatics, Japan;
and a Project Assistant Professor at The University
of Tokyo, Japan. He is currently a Professor with the
National Engineering Research Center for Multime-
dia Software, School of Computer Science, Wuhan University. His research
interests focus on person re-identification and instance search. He has received
the Best Paper Award at the 15th Pacific-Rim Conference on Multimedia
Liang Liao (Member, IEEE) received the B.S. (PCM 2014) and the 2017 ACM Wuhan Doctoral Dissertation Award.
degree from the International School of Software,
Wuhan University, Wuhan, China, in 2013, and Chia-Wen Lin (Fellow, IEEE) received the Ph.D.
the Ph.D. degree from the National Engineering degree in electrical engineering from the National
Research Center for Multimedia Software, School Tsing Hua University (NTHU), Hsinchu, Taiwan,
of Computer Science, Wuhan University, in 2019. in 2000.
He is currently a Researcher at the Digital Content He was with the Department of Computer Sci-
and Media Sciences Research Division, National ence and Information Engineering, National Chung
Institute of Informatics, Tokyo, Japan. His research Cheng University, Taiwan, from 2000 to 2007. He is
interests include computer vision, image processing, currently a Professor with the Department of Elec-
and transmission. trical Engineering and the Institute of Communica-
tions Engineering, NTHU. He is also the Deputy
Director of the AI Research Center, NTHU. Prior to
joining academia, he has worked at the Information and Communications
Research Laboratories, Industrial Technology Research Institute, Hsinchu,
from 1992 to 2000. His research interests include image and video processing,
computer vision, and video networking.
Dr. Lin has served as a Steering Committee Member for IEEE T RANS -
ACTIONS ON M ULTIMEDIA from 2014 to 2015. He has been serving as
Wenyi Chen received the B.S. degree in software the President for the Chinese Image Processing and Pattern Recognition
engineering from the School of Computer Science, Association, Taiwan, since 2019. His articles received the Best Paper Award of
Wuhan University, Wuhan, China, in 2020, where IEEE VCIP 2015, the Top 10% Paper Awards of IEEE MMSP 2013, and the
he is currently pursuing the M.S. degree with the Young Investigator Award of VCIP 2005. He has received the Outstanding
National Engineering Research Center for Multime- Electrical Professor Award presented by the Chinese Institute of Electrical
dia Software. His research interests include com- Engineering in 2019 and the Young Investigator Award presented by the
puter vision and image/video processing. Ministry of Science and Technology, Taiwan, in 2006. He has served as
the Chair for the Multimedia Systems and Applications Technical Committee
of the IEEE Circuits and Systems Society from 2013 to 2015. He has served as
the Technical Program Co-Chair for IEEE ICME 2010, the General Co-Chair
for IEEE VCIP 2018, and the Technical Program Co-Chair for IEEE ICIP
2019. He is the Chair of the Steering Committee of IEEE ICME. He has served
as an Associate Editor for IEEE T RANSACTIONS ON I MAGE P ROCESSING,
IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO T ECH -
NOLOGY , IEEE T RANSACTIONS ON M ULTIMEDIA , IEEE M ULTIMEDIA ,
and Journal of Visual Communication and Image Representation. He has
served as a Distinguished Lecturer for IEEE Circuits and Systems Society
from 2018 to 2019.

Jing Xiao (Member, IEEE) received the B.S. Shin’ichi Satoh (Senior Member, IEEE) received
and M.S. degrees from Wuhan University in the B.E. degree in electronics engineering and the
2006 and 2008, respectively, and the Ph.D. M.E. and Ph.D. degrees in information engineer-
degree from the Institute of Geo-Information Sci- ing from The University of Tokyo, Tokyo, Japan,
ence and Earth Observation, Twente University, in 1987, 1989, and 1992, respectively. He was
The Netherlands, in 2013. She was a Project a Visiting Scientist with the Robotics Institute,
Researcher with the National Institute of Informat- Carnegie Mellon University, Pittsburgh, PA, USA,
ics, Japan, from 2019 to 2020. She is currently an from 1995 to 1997. He has been a Full Professor
Associate Professor with the National Engineering with the National Institute of Informatics, Tokyo,
Research Center for Multimedia Software, School of since 2004. His current research interests include
Computer Science, Wuhan University. Her research image processing, video content analysis, and
interests include image/video processing and compression and analysis. multimedia databases.

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on August 16,2023 at 14:28:21 UTC from IEEE Xplore. Restrictions apply.

You might also like