1 s2.0 S0031320322007075 Main

Pattern Recognition 136 (2023) 109228
Contents lists available at ScienceDirect
Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog
An effective CNN and Transformer complementary network for

medical image segmentation
Feiniu Yuan a,c,d,∗, Zhengxiao Zhang a,c,d, Zhijun Fang b
a
College of Information, Mechanical and Electrical Engineering, Shanghai Normal University (SHNU), Shanghai 201418, China
b
School of Computer Science and Technology, Donghua University, Shanghai 201620, China
c
Research Base of Online Education for Shanghai Middle and Primary Schools, Shanghai Normal University, Shanghai 201418, China
d
Shanghai Engineering Research Center of Intelligent Education and Bigdata, Shanghai Normal University, Shanghai 200234, China
a r t i c l e i n f o a b s t r a c t
Article history: The Transformer network was originally proposed for natural language processing. Due to its powerful
Received 21 August 2022 representation ability for long-range dependency, it has been extended for vision tasks in recent years.
Revised 5 November 2022
To fully utilize the advantages of Transformers and Convolutional Neural Networks (CNNs), we propose a
Accepted 29 November 2022
CNN and Transformer Complementary Network (CTC–Net) for medical image segmentation. We first de-
Available online 30 November 2022
sign two encoders by Swin Transformers and Residual CNNs to produce complementary features in Trans-
Keywords: former and CNN domains, respectively. Then we cross-wisely concatenate these complementary features
Transformer to propose a Cross-domain Fusion Block (CFB) for effectively blending them. In addition, we compute
Medical image segmentation the correlation between features from the CNN and Transformer domains, and apply channel attention
Feature complementary module to the self-attention features by Transformers for capturing dual attention information. We incorporate
Cross-domain fusion cross-domain fusion, feature correlation and dual attention together to propose a Feature Complementary
Convolutional Neural Network
Module (FCM) for improving the representation ability of features. Finally, we design a Swin Transformer
decoder to further improve the representation ability of long-range dependencies, and propose to use
skip connections between the Transformer decoded features and the complementary features for extract-
ing spatial details, contextual semantics and long-range information. Skip connections are performed in
different levels for enhancing multi-scale invariance. Experimental results show that our CTC–Net signifi-
cantly surpasses the state-of-the-art image segmentation models based on CNNs, Transformers, and even
Transformer and CNN combined models designed for medical image segmentation. It achieves superior
performance on different medical applications, including multi-organ segmentation and cardiac segmen-
tation.
© 2022 Elsevier Ltd. All rights reserved.
1. Introduction With the constant innovation of computing power and the

rapid development of deep learning, Convolutional Neural Net-
Medical images have different modalities [1] that reflect the in- works [2] (CNNs) have become the predominant backbone for vi-
ternal structures of human bodies and are widely used for modern sion models. In recent years, many CNN based methods have been
medical diagnosis. To better assist disease diagnosis professionals, proposed for medical image processing. Most methods adopt a
medical image segmentation methods have been proposed to sepa- general U-shaped architecture [3], which consists of an encoder
rate specific organs from others. Segmented organs play an impor- and an decoder for medical image segmentation [4,5]. The encoder
tant role in computer-aided clinical diagnosis. Medical image seg- usually captures detailed texture information and contextual fea-
mentation involves many clinical applications, such as multi-organ tures through consecutive down-samplings, convolutions and nor-
segmentation and cardiac segmentation. Accurate pixel-level clas- malizations. With the deepening of networks, receptive fields grad-
sification of medical images for locating lesions is of great signifi- ually enlarge and more semantic information is extracted. The
cance to clinical treatments, and it already serves as an important decoder is responsible for gradually up-sampling feature maps to
auxiliary diagnostic tool. generate the output mask. Spatial details are inevitably lost dur-
ing down-sampling, and lost information can be partially restored
by using skip connections. Despite great successes of CNN based
∗
Corresponding author.
methods, long-rang dependency information has not been mod-
E-mail address: yfn@ustc.edu (F. Yuan). elled well in most CNN based methods.
https://doi.org/10.1016/j.patcog.2022.109228
0031-3203/© 2022 Elsevier Ltd. All rights reserved.
F. Yuan, Z. Zhang and Z. Fang Pattern Recognition 136 (2023) 109228
As an emerging model first thrived and widely used in vari- CNN and Transformer features, and channel attention on the
ous tasks of Natural Language Processing (NLP), the Transformer Transformer path.
model [6] has achieved huge progresses and successes in the deep 3) We propose multi-scale skip connections of complementary
learning community. In recent years, more and more Transformer features to effectively enhance the representation ability of
based methods [7,8] have been proposed for computer vision tasks. the Swin Transformer decoder. Combining Transformer decoded
Compared to CNNs, Transformers are easier to make full use of features and complementary ones can jointly extract contex-
self-attention mechanisms, which can compensate for CNNs’ inher- tual and long-range information. In this way, we propose a CNN
ent limitations to long-range dependencies. Alexey et al. [9] pro- and Transformer Complementary Network (CTC–Net) for medi-
posed a Vision Transformer (ViT) for image classification, which cal image segmentation.
is one of the most influential events in the vision research field.
ViT [9] perfectly bridges the gap between natural language pro- The remainder of this paper is organized as follows. In Sec-
cessing and computer vision. It is the first time for researchers to tion II, we review related work on semantic segmentation by
report that Transformers are used in vision tasks and also achieve CNNs, Transformers and their combinations. Section III describes
the state-of-the-art performance for vision problems. Carion et al. our method. Section IV presents experiments and analysis. Finally,
[10] proposed a DEtection TRansformer (DETR) by utilizing an ele- we conclude this paper in Section V.
gant design based on Transformers to build the first fully end-to-
end object detection model. To improve image segmentation per- 2. Related work
formance, Liu et al. [11] proposed a hierarchical vision Transformer
using shifted windows (Swin Transformer). It not only applies the 2.1. CNN based methods
inductive bias of CNNs to a network structure with Transformers,
but also exploits the advantages of self-attention mechanisms em- Long et al. [2] proposed a Fully Convolution Network (FCN)
bedded in Transformers. for image segmentation, which has begun to become one of the
CNNs and Transformers focus on different aspects. On one hand, mainstream networks for semantic segmentation. Inspired by the
CNNs heavily adopt convolutions with strong inductive biases, success of FCNs, more improvements have been made by adopt-
leading to locality and translation invariance. This property allows ing deeper, wider networks or more effective structures, such as
CNNs to preferably extract local contextual information, but it in- VGG [18], ResNet [17], DenseNet [19], HRNet [20], GoogleNet [21].
evitably brings a non-negligible difficulty that is a limited recep- Besides these classic networks, Ronneberger et al. [3] proposed
tive field. There are many solutions that have been proposed to a U-shaped Network (U-Net) for biomedical image segmentation,
deal with the problem [12], such as Atrous convolution [13], en- which is an encoder-decoder structure. After that, a set of U-Net
larged kernel sizes [14], pyramid pooling [15] and non-local oper- based methods have been proposed, and have demonstrated out-
ations [16]. These methods alleviate this problem, but do not solve standing performance beyond other kinds of models for medical
it completely. On the other hand, Transformers inherently adopt image segmentation.
self-attention mechanisms for perfectly extracting global and long- Although the U-shaped structure is very simple, it exhibits a
range dependencies, but do not capture locality and translation in- powerful representation ability. Res-UNet [22] replaces each sub-
variance very well. module of the standard U-Net with residual blocks and dense net-
According to the above-mentioned analyses, Transformers and works. Yuan et al. [23] stacked several encoders and decoders to
CNNs are naturally complementary to each other. From this perform a Wave-shaped Network (W-Net), which specially uses skip
spective, we believe that combining these two kinds of CNNs and connections between wave crests and troughs for improving seg-
Transformers can overcome the weaknesses of two models and mentation performance. To obtain deep smoke segmentation, Yuan
strengthen their advantages simultaneously. To obtain such a pur- et al. [24] designed a two-path U-shaped architecture, where one
pose, we propose a CNN and Transformer Complementary Network deeper network is responsible for extracting global contexts and
(CTC–Net) for medical image segmentation. In our CTC–Net, we the other shallower one for obtaining fine-grained spatial details.
first design a CNN based encoder branch by ResNet34 [17] mainly Oktay et al. [25] proposed an attention U-Net by generating a gat-
for extracting contextual features and another Transformer based ing signal that emphasizes the attention of features at different
encoder branch by Swin Transformer blocks [11] mainly for captur- spatial locations. In addition, U-shaped structures can also be seen
ing long-range dependency information. Then, we specifically de- in the field of 3D medical image segmentation, such as 3D U-Net
sign a feature complementary module for cross-wisely enhancing [26] and V-Net [27]. To improve accuracy of smoke segmentation,
features from two different domains. We compute the correlation Yuan et al. [28] proposed cubic-cross convolutional and count prior
between features from CNN and Transformer domains for further attentions. The count prior attention globally supervises the overall
improving performance. In addition, we conduct channel attention classification errors of all pixels.
on the Transformer self-attention features for capturing dual atten-
tion information. The main contributions of this paper are summa- 2.2. Transformer based methods
rized as follows:
The model of Transformer [29] was originally designed for
the task of machine translation. It uses self-attention mechanism,
1) We design dual encoding paths that are CNN and Transformer layer normalization, feed forward network and residual structures
encoders for producing complementary features. The CNN en- for achieving excellent performance, and it does not employ any
coder implemented by ResNet34 [17] mainly focuses on extract- convolutions. The elegant and powerful self-attention mechanism
ing spatial and contextual features, while the Transformer one shortly allows Transformer based models to achieve the state-of-
implemented by Swin Transformer [11] is mainly responsible the-art performance in various tasks of natural language process-
for capturing long-range dependencies. ing. Due to the powerful representation ability of long-rang depen-
2) We propose an effective Feature Complementary Module (FCM) dencies, Transformer based models have achieved amazing accu-
by cross-wisely fusing features from the transformation do- racy in many NLP tasks.
mains of CNN and Transformer. Our FCM adopts a cross-domain Recently, many researchers have tried to introduce Transform-
fusion manner for effectively combining CNN and Transformer ers into the computer vision field for utilizing the advantage of
features. In fact, we propose cross-domain correlation between long-range dependency. Alexey et al. [9] divided an input image
2
into patches to replace words in NLP tasks, thus they first applied ious combinations between these two structures in multi-modal
the standard Transformer to vision tasks. Experimental results val- brain tumor segmentation [37,38] and 3D medical image segmen-
idate that Transformers have the possibility to become a backbone tation [12,39].
network for vision tasks, and have outperformed many existing Fig. 1 shows the simplified structures of five typical methods
CNN based models trained on large datasets, such as ImageNet- for main framework comparisons, including two pure CNN meth-
22 K and JFT-300 M. Touvron et al. [30] used attention to enhance ods, and three CNN and Transformer combined methods. For the
the applicability of vision Transformers by training data-efficient sake of simplification and clearness, we do not draw some non-
image transformers on smaller datasets, such as ImageNet-1k. They dominant structures, such as skip connections, attention structures
also used data augmentation and regularization strategies [31] for and loss functions. By the way, the CNN encoders and decoders,
further improving performance. Srinivas et al. [32] proposed a Bot- the Transformer encoders and decoders, and the fusion modules in
tleneck Transformer Network (BoTNet) for visual recognition by Fig. 1 are totally different from each other.
only replacing 3 × 3 convolutions with multi-head self-attention As shown in Fig. 1a, it is the famous U-Net structure consist-
blocks, but which achieves surprisingly good performance on the ing of a CNN encoder and a CNN decoder, which has achieved
ImageNet. excellent performance on medical images. Fig. 1b is our previous
The Swin Transformer [11] mainly adopts patch merging, patch work [23] with two sets of CNN encoders and decoders for boost-
expanding and seft-attention blocks to achieve image segmenta- ing soft segmentation accuracy. In this method, the first encoder
tion tasks that are often dominantly obtained by CNNs. It can is followed by the first decoder, then the results are fed into the
not only use inductive biases of convolutions to networks with second encoder for further improving feature representation, and
Transformers as the backbone of encoders and decoders, but also finally the second decoder recovers the second encoded features
fully exploit the advantages of self-attention mechanisms itself. to the input size. Thus, a Wave-shaped Network (W-Net [23]) is
The method implements local attention in each window instead formed.
of global attention in the whole image, which is similarly equiva- Fig. 1c shows TransUnet [40], which is the first method to
lent to convolutions and reduces the computational complexity of adopt Transformers for medical image segmentation. Its core idea
Transformers to be from quadratic to linear. Shifted windows allow is rather simple by directly inserting a Transformer block between
information flow through originally fixed windows, thus obtain- a CNN encoder and a CNN decoder. The CNN encoder is to ex-
ing global interaction of different local windows. Through patch tract high-resolution spatial details and contextual information, the
merging, Swin Transformer obtains multi-scale features as convo- Transformer block is responsible for further modeling long-range
lutions and further expands receptive fields. Inspired by the great dependencies, and the CNN decoder is to recover features to the
success of Swin Transformers, we adopt Swin Transformer in both input size. However, this method fails to fully exploit the advan-
encoders and decoders. tages brought by Transformers and neglect introducing Transform-
ers into each scale of the feature maps.
2.3. CNN and Transformer combined methods As shown in Fig. 1d, TransFuse [41] has a CNN encoder, two
CNN decoders, a Transformer decoder and a feature fusion module.
Combinative methods can keep individual benefits of differ- The two encoders extract contextual and long-rang dependency
ent algorithms for final purposes [33]. Different from inserting features, and the first decoder is used for up-sampling features to
self-attention mechanism blocks into CNN models [34,35], several be fused with Transformer features. Then the fused features are fed
works mainly make attempts to combine CNNs and Transformers into the second decoder for generating the final output. However,
for improving performance. Carion et al. [35] used CNNs to ini- this method does not use Transformers to decode features for fur-
tially extract the preliminary features of objects, and then adopted ther improving long-rang dependencies and fusing low-level spa-
Transformers to continue dealing with the extracted features. Vala- tial details.
narasu et al. [36] introduced an additional gating mechanism in To solve above-mentioned problems, we propose a CNN and
the Transformer layer to reduce computational complexity and im- Transformer Complementary Network (CTC–Net) for medical image
prove the segmentation capability of the model. There are also var- segmentation, which has a CNN encoder, a Transformer encoder,
Fig. 1. Comparisons of simplified frameworks. (a) U-Net with a CNN encoder and a CNN decoder. (b) W-Net with two CNN encoders and two CNN decoders. (c) TransUnet
with a CNN encoder, a Transformer block and a CNN decoder. (d) TransFuse with a CNN encoder, two CNN decoders, a Transformer decoder and a fusion module. (e) Our
CTC–Net with a CNN encoder, a Transformer encoder, a Transformer decoder and an effective feature fusion module.
3
a Transformer decoder and an effective feature fusion module, as in severe loss of small or thin objects. It is very difficult for de-
shown in Fig. 1e. The CNN encoder is mainly responsible for cap- coders to recover lost small and narrow objects, leading to fail-
turing spatial contextual information, and the Transformer encoder ures in small and narrow objects. On the other hand, Transformer
focuses on extracting long-range dependency. These features are encoders can capture more dependencies between long and nar-
in two domains and complementary to each other. Therefore, we row organs in human bodies, such as TransUnet [40]. Long-range
propose a cross-domain fusion module for enhancing. To further information is very useful for improving segmentation accuracy
model long-range dependencies in different levels, we also propose of long and narrow objects. Therefore, we specially propose the
a Transformer decoder with skip connections for multi-scale fusing multi-scale Feature Complementary Module (FCM) for fusing fea-
and decoding. tures from both CNN and Transformer encoders.
Finally, we design the Transformer decoder, which directly ac-
3. The proposed methods cepts the feature maps from the Transformer encoder for pro-
gressively recovering feature maps. In addition, our FCM produces
3.1. Architecture overview different level complementary information from CNN and Trans-
former decoders, which is also fed into the Transformer decoder
Inspired by the powerful representation abilities of CNNs and by skip connections. The complementary features are very impor-
Transformers, we propose a CNN and Transformer Complementary tant for restoring small and narrow objects commonly existing in
Network (CTC–Net). As shown in Fig. 2, our CTC–Net consists of human bodies.
four main branches, including a CNN encoder, a Transformer en-
coder, a Feature Complementary Module (FCM) and a Transformer 3.2. The Transformer encoder
decoder. The main differences of our method from existing ar-
chitectures are that we design two different encoders to produce Alexey et al. [9] proposed the Vision Transformer (ViT), which
complementary features, and propose a cross-domain complemen- is the first Transformer model for vision tasks. In a standard Trans-
tary fusion module. former, pixel values in each patch are concatenated to produce a
Our motivation of using both CNN and Transformer encoders token feature vector, which is required to calculate the attention
is that CNN encoders are mutually complementary to Transformer with all other tokens in the whole image, directly leading to the
encoders. On the one hand, there are relatively small or thin or- quadratic computational complexity with respect to the image size.
gans in medical images, so deeper CNN architectures may result To solve this problem, the Swin Transformer [11] performs self-
Fig. 2. The overall architecture of our CTC–Net. (a) The CNN encoder. (b) The multi-scale Feature Complementary Module (FCM). (c) The Transformer encoder. (D) The
Transformer decoder. (e) Swin Transformer Block (STB).
4
attention only in each local window, thus the computational com- we propose a feature complementary module by designing four
plexity is linearly proportional to the image size. blocks, as shown in Fig. 3.
We stack Swin Transformer Blocks (STB) and patch operations The first block is called Cross-domain Fusion Block (CFB). Our
[11] to construct the main feature encoding path, named as Trans- CFB is responsible for cross-wisely fusing and enhancing features
former encoder (Fig. 2c). Each Swin Transformer block (Fig. 2e) is from two different domains of Transformer and CNN encoders.
composed of two successive sub-blocks. The first sub-block con- Specifically, the feature maps from the Transformer and CNN en-
sists of Layer Normalization (LN), Window based Multi-head Self coders are denoted by gi and fi , respectively. Suppose that the 2D
Attention (W-MSA), Multi-Layer Perceptron (MLP) and residual ad- feature map gi has the size of (h × w) × c, and the 3D CNN fea-
ditions. The second one has almost the same operations, but re- ture map has the size of h × w × c. To implement cross-domain
places W-MSA with Shifted Window based MSA (SW-MSA) [11]. fusion, we first apply Global Average Pooling (GAP) on these two
We also use patch merging [11] to down-sample feature maps. maps to generate two feature vectors with size of (1 × 1) × c.
Common down-sampling methods used in CNNs are pooling and Then, we concatenate the Transformer input gi with the globally
de-convolution. Similarly, Transformer based methods also need to pooled feature vector of the CNN input fi along the first axis for
perform down-sampling operations for aggregating contextual fea- producing a larger 2D feature map g1i with size of (h × w + 1) × c.
tures. To obtain the same ability, Swin Transformer [11] merges ad- Then the concatenated map is fed into a Swin Transformer Block
jacent 2 × 2 patches into a large patch by concatenating these 4 (STB) for feature fusion. Thus we can obtain a powerfully fused 2D
small patches along the channel direction. This procedure is called feature map g2i with size of (h × w) × c, which is reshaped into
patch merging. its 3D version g3i with size of h × w × c. On the other hand, we
According to spatial resolutions, the Transformer encoder can also concatenate the CNN input fi with the pooled feature vector of
be divided into four levels, as shown in Fig. 2c. The first level has the Transformer input gi along the first axis for producing another
a patch embedding layer and two Swin Transformer blocks for fea- larger 2D feature map fi1 with size of (h × w + 1) × c. Similarly,
ture encoding. The second to fourth levels have a patch merging we use a Swin Transformer Block to process the concatenated fea-
for down-sampling, and two Swin Transformer blocks for extract- ture map for producing another cross-domain fused feature map
ing long-range dependencies. fi2 , and reshape it to obtain a 3D feature map fi3 . Finally, we con-
Suppose that the input RGB image x has the size of H × W × 3, catenate the two cross-domain fused 3D feature maps and use a
and the output y of our CTC–Net is H × W × N where N is the 1 × 1 convolution to generate a cross-domain fusion feature map
category number for segmentation. The 2D outputs of the Trans- si with size of h × w × c. The processing for our CFB is formulated
former encoder in the four levels are denoted by g1 , g2 , g3 , and as follows:
g4 , which have the sizes of (H/4 × W/4) × C, (H/8 × W/8) × 2C,
(H/16 × W/16) × 4C, (H/32 × W/32) × 8C, respectively. According g1i =cat (GAP ( fi ), gi ), (1)
to the Swin Transformer [11], a RGB image patch with size of 4 × 4
is regarded as a token, so the feature dimension C of each token is fi1 =cat (GAP (gi ), fi ), (2)
equal to 4 × 4 × 3 = 48, i.e. C = 48.
g3i =reshape ST B g1i , (3)
3.3. The CNN encoder 1
fi3 =reshape ST B fi , (4)
To obtain contextual features and maintain certain spatial de-
tails by convolutional neural networks, we use the four encod-

si =conv cat g3i , fi3 , (5)
ing blocks of ResNet34 [17] to build a CNN encoder, as shown
in Fig. 2a. The four blocks of ResNet34 [17] are denoted by where GAP, cat, STB, reshape and conv stand for Global Average
Conv1x, Conv2x, Conv3x and Conv4x, each of which performs down- Pooling, concatenation, Swin Transformer Block, reshaping, and
sampling operations by a rate of 2. convolutions, respectively.
To make the feature map size of the CNN encoder to be exactly Eq.s 1 and 2 perform the intensive cross-domain fusion of fea-
consistent with that of the transformer encoder, we adopt Conv1x tures from two different domains. In addition, the Swin Trans-
and Conv2x to down-sample feature maps twice in Level 1. For the former Blocks in Eq.s 3 and 4 can further enhance the feature
sake of consistency, the channel C of the 3D feature map f1 for representation abilities of long-range dependencies. Eq. (5) finally
Level 1 is set to 48. Hence, the CNN encoder generates an output fuses the features from two kinds of cross-wise ways.
feature map f1 with size of H/4 × W/4 × C for the first level, which The second block is the Correlation Enhancement Block (CEB).
has the same pixel number as the first level output of the Trans- Our CEB is proposed to model the cross-domain correlation be-
former encoder. For the second level, the Conv3x block is used to tween features from two transformation domains of Transformer
process f1 to generate another 3D feature map f2 , which has the and CNN encoders. We first reshape the 2D Transformer feature
size of H/8 × W/8 × 2C. Next, we use Conv4x to filter f2 obtain the map gi to obtain its 3D version g0i , and then point-wisely multi-
third 3D feature map f3 with H/16 × W/16 × 4C for Level 3. Our ply g0i by fi to produce a cross-domain correlation feature map ei
CNN encoder has only three levels to produce three feature maps. with size of h × w × c. In fact, our CEB is a special kind of atten-
These three maps, i.e. f1 , f2 and f3 , contain abundant spatial details tion mechanisms, which can enhance important information and
and contextual semantics for improving the representation of the suppress unremarkable features among the two feature maps. By
Transformer decoder. using CEB, we extract mutually salient features in both CNN and
Transformer branches for further improving accuracy.
3.4. Feature complementary module The third one is the Channel Attention Block (CAB). The origi-
nal Swin Transformer block has a built-in self-attention mechanism
Transformer based methods originally proposed for NLPs are for modeling long-range dependency. To further enhancing atten-
different from CNN based ones for vision tasks. These two kinds of tion features, we apply a channel attention [42] commonly used by
methods have totally different feature extraction manners, and also CNNs to the Transformer features. In this way, we efficiently imple-
have completely diverse purposes for applications. The features by ment a mixture of channel and self-attention attentions to obtain
the Transformer encoder and the CNN encoder are generated in a dual attention feature map ai with size of h × w × c. In other
different domains. To obtain mutually complementary information, words, our CAB is factually a mixed attention mechanism.
5
Fig. 3. The architecture of Feature Complementary Module (FCM). Our FCM consists of four blocks, which are Cross-domain Fusion Block (CFB), Correlation Enhancement
Block (CEB), Channel Attention Block (CAB) and Feature Fusion Block (FFB). Yellow cuboids or rectangles represent the output features from the CNN encoder, while blue
ones denote the output features from the Transformer encoder.
The last one is Feature Fusion Block (FFB). We concatenate the patch expanding block, we use a patch expanding with a rate of 4
cross-domain feature map si , the correlation feature map ei and for recovering the size of the 2D feature map, a 1 × 1 convolution
the dual attention feature map ai to obtain a feature map m1i with for adjusting its channel number to the category number N, and a
size of h × w × 3c, and use residual and reshaping operations to reshaping operation to convert the 2D map into a 3D feature map
generate the output feature map mi with size of h × w × c for our that is just the output of our CTC–Net. According to the descrip-
FCM, formulated as: tions of Fig. 2d, the data processing in the Transformer decoder
can be briefly formulated as follows:
m1i =cat (si , ei , ai ), (6)
vk = ST B(ST B(uk , mk )), (8)

mi =reshape conv m1i + CBR m1i , (7) uk−1 = P E (vk ), (9)
where CBR is a block with convolutions (Conv), batch normaliza- where k is the level index, and STB and PE denote Swin Trans-
tion (BN) and rectified linear unit (ReLU) to fuse the concatenated former and patch expanding blocks, respectively.
features and reduce the number of parameters at the same time.
4. Experiments and discussion
3.5. The Transformer decoder

4.1. Datasets
Swin Transformer blocks have been proven quite qualified for

To evaluate the performance of our method for medical image
serving as either encoders or decoders [8]. Just contrary to patch
segmentation, we compared our method with existing state-of-the-
merging, patch expanding [11] is often used to up-sample feature
art networks on two widely used medical image datasets, which
maps. Following the idea of [11], we stack Swin Transformer blocks
are the Synapse dataset (Synapse) and the Automatic Cardiac Di-
and patch expanding operations to create a four-level decoding
agnosis Challenge (ACDC) dataset. The Synapse and ACDC datasets
path. To fuse features from both Transformer and CNN encoders,
are available via https://www.synapse.org/#!Synapse:syn3193805/
we also feed the cross-domain fused features from the Transformer
wiki/217789, and https://www.creatis.insa-lyon.fr/Challenge/acdc/,
and CNN encoders to the Swin Transformer block. As shown in
respectively. More details about the two datasets are described as
Fig. 2d, our Transformer decoder has four levels. In each level, re-
follows:
inforced short connections between the Transformer decoder and
the two encoders are proposed for compensating lost spatial de- 4.1.1. Synapse
tails and long-range dependency information. Synapse includes 30 CT scans on abdominal organs for multi-
The fourth decoding level only adopts a patch expanding opera- organ segmentation. Following TransUnet [40], we selected 18
tion to up-sample feature maps at a rate of 2. In the third and sec- cases as a training set, and regarded the rest 12 cases as a test
ond decoding levels, we first adopt two Swin Transformer blocks to set. We report the average Dice Similarity Coefficient (DSC) and the
fully fuse the cross-domain enhanced feature map from its corre- average Hausdorff Distance (HD) on 8 categories of 2211 2D slices
sponding feature complementary module and the up-sampled fea- extracted from the 3D volumes. The 8 classes are aorta, gallbladder,
tures from its adjacent high level, and then use patch expanding spleen, left kidney, right kidney, liver, pancreas, and stomach.
to up-sample the fused feature map. In the first decoding level, we
also adopts two Swin Transformer block for feature fusion and ex- 4.1.2. ACDC
traction of long-range dependencies. Besides, we use a final patch ACDC aims to evaluate the segmentation performance of left
expanding block to up-sample the feature map for generating the ventricle (LV), right ventricle (RV) and myocardium (MYO) for au-
output mask with the same size as the input image. In the final tomated cardiac diagnosis. The dataset includes MRI images of 100
6
different patients. We divided the dataset into a training set with Table 1
Network configuration of CTC–Net.
70 samples, a validation one with 10 samples and a test one with
20 samples. We report the average DSC on the 3 classes mentioned Parameters Level 1 Level 2 Level 3 Level 4
above. Input size 224 × 224
resolution 56 × 56 28 × 28 14 × 14 7×7
4.2. Implementation details Depth_encoder 2 2 18 2
Depth_decoder 1 2 2 2
Num_heads 3 6 12 24
Our CTC–Net was implemented using Python 3.8 and Pytorch
Num_heads_FCM 3 6 12 N/A
1.7.1. All experiments were conducted on an Intel i9 PC with
an Nvidia GTX 3090 of 24GB memory. We used the pre-trained
weights of Swin Transformer on ImageNet to initialize the Trans-
former encoder and decoder our CTC–Net, and adopted a pre- 4.3. Experiments on Synapse
trained ResNet34 to initialize the parameters of our CNN encoder.
The batch size is set to 24, the maximum iteration number is set to Table 2 lists the results of our method and twenty state-of-the-
13,950, and the optimizer is SGD with basic learning rate 0.01, mo- art CNN semantic segmentation models on the Synapse dataset.
mentum 0.99 and weight decay 3e-5. The decay strategy of learn- We compared our CTC–Net with these 20 methods, including pure
ing rate lr can be described as follows: Transformer based models, pure CNN based ones, and CNN and
0 . 9 Transformer combined ones. We used the average Dice Similarity
iter _num Coefficient (DSC) as our main evaluation metrics.
l r = base_l r · 1− (10)
max_iterations According to Table 2, experimental results demonstrate that
our proposed CTC–Net achieves the highest average DSC of 78.41%
where base_lr is a basic learning rate, max_iterations is a maximum among all the compared methods. Compared with any of the above
iteration number, and iter_num is iteration index. models, our CTC–Net outperforms them on at least half of all the
The overall loss of our model is defined as the weighted sum eight categories. All compared methods always achieve the com-
of a cross entropy loss and a dice loss. The two loss functions and paratively low accuracies on pancreas segmentation due to the
the weight ratios between them can be described as follows: particularly large deformation of the pancreas from case to case
L = (1 − α )ce + α dice (11) and its blurred boundary. Thanks to the efficient combination of
local details and global interactions in a cross-domain manner,
where ce denotes the cross entropy loss, dice stands for the dice our CTC–Net produces the best results on pancreas among all the
loss, and α is a related importance weight empirically set to 0.6. methods. As for Kidney(R) and Kidney(L), our CTC–Net achieves the
Human organs often have very smooth surfaces. To prevent the highest DSC and the second highest DSC among twenty-one meth-
output results being noisy, we add a post-processing method on ods, respectively. These existing methods have achieved state-of-
the segmentation results by our CTC–Net. There are several post- the-art results, so it validates that our method is powerful.
processing methods that can be adopted for removing noise, such For the sake of simplicity, Table 3 presents the average Haus-
as morphological operators and median filtering. For the sake of dorff Distances (HDs) achieved by some of the excellent models in
simplicity and computation efficiency, we use median filtering to Table 2. Experimental results show that our CTC–Net achieves ac-
obtain more smooth results. Subsequent experiments also validate curate segmentations on both large organs, and long and narrow
that the results processed by median filtering are more accurate ones, such as kidney and pancreas. Its CNN and Transformer en-
than the original results of our network. The reason may be that coders provide complementary features that are helpful for med-
human organs have inherent smooth surfaces. ical image segmentation. Our FCM can effectively fuse these two
Two evaluation metrics are the average Dice Similarity Coeffi- kinds of features in different scales. Due to effective fusion of com-
cient (DSC) and the average Hausdorff Distance (HD). They both plementary features, our CTC–Net has a powerful ability to extract
indicate the similarity between a predicted segmentation and its robust feature representations for large, narrow or long-shaped or-
ground truth. DSC is used to evaluate the overlapping degree be- gans, such as kidney, liver and pancreas. The long-range depen-
tween a segmentation prediction P and its corresponding ground dency of Transformer enables our network to segment large or-
truth G, and HD measures the overlapping quality of segmentation gans well, such as kidney and liver. The local details by CNNs make
boundaries. The two metrics are defined as follows: our network produce more accurate boundaries. The combination
2|P ∩ G| of CNN and Transformer features boosts the overall segmentation
DSC = , (12)
|P| + |G| performance. Although our method only surpasses TransUNet by
HD(P, G ) = max[D(P, G ), D(G, P )], (13) 1% in term of DSC, it improves nearly 10% in the HD metric and is
well ahead of other models. According to Table 3, our CTC–Net ob-
D(P, G ) = max min p − g, (14) tains the best result among them. Furthermore, we even achieve a
p∈P g∈G HD of 19.19% in our ablation studies, which is much better than our
where ∩ stands for an intersection operator for two sets, p and g current architecture. Fig. 4 shows the visual comparisons between
are coordinate vectors of two pixels, |S| is the pixel number of a the results of our CTC–Net and compared methods. Our method
set S, ||v|| is the l2 norm of a vector v, and P and G denote the co- achieves very pleasing segmented results.
ordinate sets of the segmentation prediction and the ground truth,
respectively. A larger DSC or a smaller HD means a better segmen- 4.4. Experiments on ACDC
tation.
Other detailed parameter settings of our CTC–Net are sum- To evaluate the generalization and robustness of our CNN and
marized in Table 1. As shown in Table 1, Depth_encoder and Transformer complementary model, we also performed related ex-
Depth_decoder denote the depth of each Swin Transformer layer in periments on the ACDC dataset, which contains entirely different
the Transformer encoder and decoder, Num_heads stands for the modalities and body parts. Table 4 presents the average DSC met-
number of attention heads in the Transformer encoder and de- ric, and our method also achieves the highest average DSC on the
coder, and Num_heads_FCM is the number of attention heads in ACDC dataset among these state-of-the-art methods. In addition,
FCM. our method surpasses all the compared methods on two classes
7
Table 2
Experiments on Synapse (mean Dice Similarity Coefficients in%).
Methods Average Aorta Gallbladder Kidney(L) Kidney(R) Liver Pancreas Spleen Stomach
TransClaw U-Net [43] 78.09 85.87 61.38 84.83 79.36 94.28 57.65 87.74 73.55
R50 U-Net [3] 74.68 87.74 63.66 80.60 78.19 93.74 56.90 85.87 74.16
Net [3] 76.85 89.07 69.72 77.77 68.60 93.43 53.98 86.67 75.58
DARR [44] 69.77 74.74 53.77 72.31 73.24 94.08 54.18 89.90 45.96
VNet [45] 68.81 75.34 51.87 77.10 80.75 87.84 40.04 80.56 56.98
ENet [46] 77.63 85.13 64.91 81.10 77.26 93.37 57.83 87.03 74.41
Att-UNet [25] 77.77 89.55 68.88 77.98 71.11 93.57 58.04 87.30 75.75
R50-DeeplabV3+ [47] 75.73 86.18 60.42 81.18 75.27 92.86 51.06 88.69 70.19
ContextNet [48] 71.17 79.92 51.17 77.58 72.04 91.74 43.78 86.65 66.51
FSSNet [49] 74.59 82.87 64.06 78.03 69.63 92.52 53.10 85.65 70.86
R50 Att-Unet [34] 75.57 55.92 63.91 79.20 72.71 93.56 49.37 87.19 74.95
DABNet [50] 74.91 85.01 56.89 77.84 72.45 93.05 54.39 88.23 71.45
EDANet [51] 75.43 84.35 62.31 76.16 71.65 93.20 53.19 85.47 77.12
FPENet [52] 68.67 78.98 56.35 74.54 64.36 90.86 40.60 78.30 65.35
FastSCNN [53] 70.53 77.79 55.96 73.61 67.38 91.68 44.54 84.51 68.76
VIT None [9] 61.50 44.38 39.59 67.46 62.94 89.21 43.14 75.45 68.78
VIT CUP [9] 67.86 70.19 45.10 74.70 67.40 91.32 42.00 81.75 70.44
R50 VIT CUP [9] 71.29 73.73 55.13 75.80 72.20 91.51 45.99 81.99 73.95
CGNET [54] 75.08 83.48 65.32 77.91 72.04 91.92 57.37 85.47 77.12
TransUNet [40] 77.48 87.23 63.16 81.87 77.02 94.08 55.86 85.08 75.62
CTC–Net(Ours) 78.41 86.46 63.53 83.71 80.79 93.78 59.73 86.87 72.39
Fig. 4. The visualized comparison of different methods on Synapse datasets.
8
Table 3 The purpose of Channel Attention Block (CAB) is to emphasize

Experiments on Synapse datasets (mean HD).
channel related information for improving feature robustness. To
Methods HD↓ validate the importance of CAB on the CNN branch, we add chan-
R50 U-Net [3] 36.87 nel attention blocks in both CNN and Transformer paths to produce
U-Net [3] 39.70 the third variant of our network, named as “Dual CAB”. It seems
Att-UNet [25] 36.02 reasonable that we would achieve better results by adding channel
R50 Att-Unet [34] 36.97 attentions in both CNN and Transformer branches. However, en-
R50 VIT CUP [9] 32.87
abling channel attentions in both CNN and Transformer branches
TransUNet [40] 31.69
CTC–Net(Ours) 22.52 does not produce better results, as shown in Table 5.
In addition, we also performed three conventional ablation ex-
periments to observe the performance of some blocks in our FCM.
Table 4 The fourth variant of our method is obtained by deleting the Chan-
Experiments on ACDC (mean DSC in%).
nel Attention Block (CAB), and it is named as “without CAB”. We
Methods Average RV MYO LV remove the Cross-domain Fusion Block (CFB) from the FCM to pro-
R50 U-Net [3] 87.55 87.10 80.63 94.92 duce the fifth variant of our method, named as “without CFB”. Fur-
R50 Att-Unet [34] 86.75 87.58 79.20 93.47 thermore, we remove Correlation Enhancement Block (CEB) from
VIT CUP [9] 81.45 81.46 70.71 92.18 our FCM to verify its effectiveness, named as “without CEB”.
R50 VIT CUP [9] 87.57 86.07 81.88 94.75
The experimental results for ablation studies are listed in
TransUNet [40] 89.71 88.86 84.54 95.73
SwinUNet [6] 90.00 88.55 85.62 95.83 Table 5. According to the average DSCs, our method can outperform
CTC–Net(Ours) 90.77 90.09 85.52 96.72 the five variants by very large margins. In addition, our method
achieves the highest DSCs on five categories among all variants. Al-
though the second variant uses a cross attention method for cross-
wisely fusing features from Transformers and CNNs, its DSC is far
of right ventricle (RV) and left ventricle (LV). As for the category lower than that of our method. That’s proofed that the FCM plays
of myocardium (MYO), our method obtains the second highest DSC a very key role in our network.
of 85.52%. These experiments effectively validate that our method
outperforms the state-of-the-art methods of medical image seg- 4.5.2. Evaluation of encoders
mentation. Our CTC–Net has two important encoders, which are Trans-
former and CNN ones. The Transformer encoder is the major
4.5. Ablation studies branch for extracting long-range dependencies. The CNN encoder
serves as an auxiliary branch to compensate for contextual fea-
In order to prove the rationality of our CTC–Net and the inter- tures and spatial details. To explore the importance of the CNN
pretability of each module, we conducted ablation studies on the encoder, we remove it to obtain a variant of our method, which
Synapse dataset. is a pure transformer architecture. The variant has an encoder and
a decoder, and both of them are composed of Swin Transformer
4.5.1. Evaluation of FCM blocks. The experimental results are shown in Table 6. Our method
To validate the effectiveness of our FCM, we obtain several vari- achieves the average DSC of 78.41%, which is significantly better
ants of our CTC–Net by replacing it with other existing modules or than that of 76.38% by the variant. In addition, our method also
removing certain key blocks of FCM. obtains better results on six out of eight categories than the vari-
One straightforward idea is to simply concatenate two feature ant. It proves that the CNN encoder is a key branch in our network.
maps from the CNN and Transformer encoders, and adopt 1 × 1
convolutions to fuse them. In this way, we obtain the first variant 4.5.3. Evaluation of decoders
of our CTC–Net for validating the importance of our FCM, denoted Most of existing U-shaped networks have approximately sym-
by “concat+conv”. metric structures. We know by intuition that symmetric networks
Inspired by the success of [40], we replace the proposed FCM may achieve excellent results. However, our CTC–Net has two en-
with a transformer decoder to perform the cross attention between coders but one decoder, so it is obviously asymmetric. Why not
the features from the CNN and Transformer encoders. The query design two decoders for our method?
matrix of the cross attention is from the CNN encoder, and the ma- To answer the above question, we need to evaluate the sym-
trices of the key and value pair are generated based on the Trans- metric variant of our method with two decoders, denoted by
former encoder. This is the case for one aspect. In the other aspect, “CTC–Net with two decoders”. We add a traditional CNN decoder
the query matrix and the matrices of the key and value pair are re- widely used in U-shaped networks to our CTC–Net. The CNN de-
versely from the Transformer and CNN encoders, respectively. The coder directly accepts feature maps from the CNN decoder of
final output is obtained by convolving the outputs from two as- the original CTC–Net as the input, and adopts de-convolutions for
pects. Thus, the second variant is generated, and it is named as gradually up-sampling the feature maps in a learnable manner.
“cross attention”. To improve spatial details, we also use skip connections between
Table 5
Ablation experiment on FCM (mean Dice Similarity Coefficients in%).
Variants Average Aorta Gallbladder Kidney(L) Kidney(R) Liver Pancreas Spleen Stomach
concat+conv 75.52 85.58 60.46 78.86 73.88 93.23 51.24 86.68 74.25
cross attention 72.65 83.56 55.10 81.67 68.66 92.22 44.09 87.17 68.76
Dual CAB 72.70 82.78 53.78 76.86 69.08 91.79 51.68 85.15 70.48
without CAB 76.87 85.36 62.60 79.87 77.66 93.19 54.96 88.59 72.77
without CFB 75.83 85.91 61.11 85.76 79.57 93.51 48.17 86.67 65.99
without CEB 75.13 83.46 60.38 82.40 73.27 92.61 53.49 85.62 69.84
CTC–Net (ours) 78.41 86.46 63.53 83.71 80.79 93.78 59.73 86.87 72.39
9
Table 6
Ablation experiments for evaluating the importance of the CNN encoder (mean Dice Similarity Coefficients in%).
CTC–Net without CNNs 76.38 83.54 63.93 80.73 76.98 93.27 55.71 84.54 72.32
CTC–Net (ours) 78.41 86.46 63.53 83.71 80.79 93.78 59.73 86.87 72.39
Table 7
Ablation experiments for evaluating decoders (mean Dice Similarity Coefficients in%).
CTC–Net with two decoders 69.68 73.81 56.85 73.71 66.74 89.55 47.44 83.02 66.35
CTC–Net with cross attention 76.73 85.46 60.36 83.91 77.41 93.23 52.96 86.35 74.39
CTC–Net 78.41 86.46 63.53 83.71 80.79 93.78 59.73 86.87 72.39
different levels of encoders and decoders. The variant produces coder is created by Swin Transformer blocks for further improving
two outputs by the Transformer and CNN decoders. Finally, we fuse long-range representation and recovering feature maps to the in-
the two outputs for the variant. As shown in Table 7, the sym- put size. Experiments show that our method achieves very good
metric variant cannot achieve better results than our asymmetric results, and it consistently outperforms the state-of-the-art seg-
CTC–Net with only one decoder. Obviously, the results are contrary mentation networks for medical image segmentation. Although our
to our intuitions. There may be two main reasons for explaining method achieves pleasing segmented results, our limitation lies in
the results. The first one is that adding a CNN decoder greatly the extraction of boundary details. The main reason may be that
increases network parameters, leading to possible overfitting. An- our CNN and Transformer encoders start to recover feature maps
other one is that the two decoders recover feature maps indepen- from the 4x down-sampled feature maps, which already lose de-
dently and they lack adequate information interchanging. tailed spatial information. In the future, we will explore novel net-
In addition, we replace the Swin Transformer Block with our works without downsampling feature maps for maintaining high
self-designed Swin Transformer Decoder to produce another vari- resolutions and abundant details.
ant, named as “CTC–Net with cross attention”. By the way, the cross
attention [40] has achieved significant improvements for image Declaration of Competing Interest
segmentation. The variant only differs only in the part of atten-
tion calculation from our CTC–Net. We apply cross attention on The authors declared that they have no conflicts of interest to
features from skip connections and up-sampled features at each this work. We declare that we do not have any commercial or as-
up-sampling stage, where the query matrix is from skip connec- sociative interest that represents a conflict of interest in connection
tion and the matrices of the key and value pair are the up-sampled with the work submitted.
features. The experimental results are shown in Table 7. The vari-
ant with cross attention achieves far better results than the variant Data availability
with two decoders, but it cannot surpass our CTC–Net.
Data will be made available on request.
5. Conclusions
Acknowledgments
Since the Transformer has achieved great improvements in
NLPs, it has been widely used in a variety of NLP tasks. In recent
This work was partially supported by the National Natural Sci-
years, researchers have tried to exploit Transformer based methods
ence Foundation of China (62272308), the Joint Key Fund of Na-
for solving vision tasks, because Transformers have very powerful
tional Natural Science Foundation of China (U2033218), and the
representation capability of long-range dependencies. Traditional
Major Project of New Generation Artificial Intelligence for Scien-
Convolutional Neural Networks (CNNs) can effectively extract con-
tific and Technological Innovation 2030 (2020AAA0109300).
textual information and spatial details, and CNNs have widely been
used for vision tasks. However, CNNs are more difficult to extract
References
long-range dependencies than Transformers. In the contrary side,
Transformers are not good at extracting spatial related informa- [1] X. Fang, P. Yan, Multi-organ segmentation over partially labeled datasets
tion and maintaining spatial details. To an extent, these two kinds with multi-scale feature abstraction, IEEE Trans. Med. Imaging 39 (11) (2020)
of features by CNNs and Transformers are complementary to each 3619–3629.
[2] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic seg-
other. mentation, in: 2015 IEEE Conference on Computer Vision and Pattern Recogni-
To efficiently compensate for shortcomings of CNNs and Trans- tion (CVPR), 2015, pp. 3431–3440.
formers, we propose a CNN and Transformer Complementary Net- [3] O. Ronneberger, P. Fischer, T. Brox, U-Net: convolutional network for biomed-
ical image segmentation, in: International Conference on Medical image
work (CTC–Net) for medical image segmentation. Our CTC–Net has
computing and computer-assisted intervention (MICCAI), Springer, 2015,
two encoders that are a CNN one and a Transformer one, a fea- pp. 234–241.
ture fusion module, and a Transformer decoder. The CNN encoder [4] J. Zhang, Y. Xie, Y. Wang, Y. Xia, Inter-Slice Context Residual Learning for 3D
Medical Image Segmentation, IEEE Trans. Med. Imaging 40 (2) (2021) 661–672.
is constructed by ResNet34 [17] for extracting features in the CNN
[5] Z. Zhou, M.M.R. Siddiquee, N. Tajbakhsh, J. Liang, UNet++: redesigning skip
domain, while the Transformer one is created by Swin Transformer connections to exploit multiscale features in image segmentation, IEEE Trans.
blocks [11] for capturing long-range dependent features. The fea- Med. Imaging 39 (6) (2020) 1856–1867.
ture fusion module uses cross-domain concatenation, feature cor- [6] H. Cao, Y. Wang, Joy. Chen, D. Jiang, X. Zhang, Q. Tian, M. Wang, Swin-unet:
unet-like pure transformer for medical image segmentation, 2021 IEEE Confer-
relation and dual attention methods to effectively combine these ence on Computer Vision and Pattern Recognition (CVPR), 2021 June 19th to
features from CNN and Transformer domains. The Transformer de- 25th.
10
[7] K. Davood, V.S. Didenko, A. Gholipour, Convolution-free medical image seg- [35] F. Yuan, L. Zhang, X. Xia, Q. Huang, X. Li, A gated recurrent network with dual
mentation using transformers, 2021 International Conference on Medical Im- classification assistance for smoke semantic segmentation, IEEE Transact. Im-
age Computing and Computer-Assisted Intervention (MICCAI), 12901, Springer, age Process. 30 (2021) 4409–4422.
Cham, 2021. [36] N. Carion, F. Massa, G. Synnaev, N. Usunier, A. Kirillov, S. Zagoruyko, End–
[8] A. Lin, B. Chen, J. Xu, Z. Zhang, G. Lu, DS-TransUNet: dual Swin Transformer to-end object detection with transformers, in: European Conference on Com-
U-Net for medical image segmentation, IEEE Conference on Computer Vision puter Vision(ECCV), Springer, 2020, pp. 213–229.
and Pattern Recognition (CVPR), 2021 June 19th to 25th. [37] J. Valanarasu, P. Oza, I. Hacihaliloglu, V. Patel, Medical transformer: gated
[9] D. Alexey, B. Lucas, K. Alexander, W. Dirk, An Image is Worth 16x16 Words: axial-attention for medical image segmentation, in: International Conference
transformers for Image Recognition at Scale, the 9th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI),
on Learning Representations (ICLR2021), 2021 May 3rd -7th. Springer, Cham, 2021, pp. 36–46.
[10] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End– [38] W. Wang, C. Chen, M. Ding, H. Yu, S. Zha, J. Li, Transbts: multimodal brain
to-end object detection with transformers, in: European Conference on Com- tumor segmentation using transformer, in: International Conference on Med-
puter Vision (ECCV), Springer, 2020, pp. 213–229. ical Image Computing and Computer-Assisted Intervention (MICCAI), Springer,
[11] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin Transformer: Cham, 2021, pp. 109–119.
hierarchical vision transformer using shifted windows, in: 2021 IEEE/CVF In- [39] A. Hatamizadeh, Y. Tang, V. Nath, D. Yang, A. Myronenko, B. Landman, H. Roth,
ternational Conference on Computer Vision (ICCV), 2021, pp. 9992–10 0 02. D. Xu, UNETR: transformers for 3D medical image segmentation, in: IEEE Win-
[12] Y. Xie, J. Zhang, C. Shen, Y. Xia, CoTr: efficiently bridging CNN and Trans- ter Conference on Applications of Computer Vision (WACV), 2022, pp. 574–
former for 3D medical image segmentation, in: Medical Image Computing 584.
and Computer Assisted Intervention (MICCAI), 12903, Springer, Cham, 2021, [40] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. Yuille, Y. Zhou, Tran-
pp. 171–180. sUnet: transformers make strong encoders for medical image segmentation,
[13] Y. Fisher, V. Koltun, Multi-scale context aggregation by dilated convolutions, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021
International Conference on Learning Representations (ICRL), 2016 May 2nd to June 19th to 25th.
4th. [41] Y. Zhang, H. Liu, Q. Hu, Transfuse: fusing transformers and cnns for medi-
[14] C. Peng, X. Zhang, G. Yu, G. Luo, J. Sun, Large kernel matters — improve seman- cal image segmentation, in: International Conference on Medical Image Com-
tic segmentation by global convolutional network, in: 2017 IEEE Conference on puting and Computer-Assisted Intervention(MICCAI), Springer, Cham, 2021,
Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1743–1751. pp. 14–24.
[15] H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network, in: [42] S. Woo, J. Park, J. Lee, I. Kweon, CBAM: convolutional block attention module,
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, in: Proceedings of the European conference on computer vision (ECCV), 2018,
pp. 6230–6239. pp. 3–19.
[16] X. Wang, R. Girshick, A. Gupta, K. He, Non-local neural networks, in: 2018 [43] C. Yao, M. Hu, G. Zhai, X. Zhang, Transclaw u-net: claw u-net with transformers
IEE Conference on Computer Vision and Pattern Recognition(CVPR), 2018, for medical image segmentation, arXiv preprint arXiv:2107.05188, 2021.
pp. 7794–7803. [44] S. Fu, Y. Lu, Y. Wang, Y. Zhou, W. Shen, E. Fishman, A. Yuille, Domain adaptive
[17] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, relational reasoning for 3d multi-organ segmentation, International Conference
in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), on Medical Image Computing and Computer-Assisted Intervention (MICCAI),
2016, pp. 770–778. Springer, Cham, 2020.
[18] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale [45] F. Milletari, N. Navab, S. Ahmadi, V-Net: fully convolutional neural networks for
image recognition, 2015 International Conference on Learning Representations volumetric medical image segmentation, in: Fourth International Conference
(ICLR), 2015. on 3D Vision (3DV), 2016, pp. 565–571.
[19] G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected con- [46] A. Paszke, A. Chaurasia, S. Kim, E. Culurciello, ENet: A deep neural network
volutional networks, in: 2017 IEEE Conference on Computer Vision and Pattern architecture for real-time semantic segmentation, 5th International Conference
Recognition (CVPR), 2017, pp. 2261–2269. on Learning Representations, ICLR 2017, Toulon, France, 2017.
[20] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, [47] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. Yuille, Deeplab: seman-
X. Wang, W. Liu, B. Xiao, Deep high-resolution representation learning for tic image segmentation with deep convolutional nets, atrous convolution, and
visual recognition, IEEE Trans. Pattern. Anal. Mach. Intell. 43 (10) (2018) fully connected crfs, IEEE Trans. Pattern Anal. Mach. Intell. 40 (4) (2018)
3349–3364. 834–848.
[21] S. Christian, W. Liu, Y. Jia, S. Pierre, R. Scott, A. Dragomir, E. Dumitru, V. Vin- [48] R. Poudel, U. Bonde, S. Liwicki, C. Zach, ContextNet: Exploring context and de-
cent, R. Andrew, Going deeper with convolutions, in: 2015 IEEE Conference on tail for semantic segmentation in realtime, Proceedings of the British Machine
Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9. Vision Conference (BMVC), 2018, pp. 146.1–146.12.
[22] X. Xiao, S. Lian, Z. Luo, S. Li, Weighted Res-UNet for high-quality retina vessel [49] X. Zhang, Z. Chen, Q.M.J. Wu, L. Cai, D. Lu, X. Li, Fast semantic segmenta-
segmentation, in: 9th International Conference on Information Technology in tion for scene perception, IEEE Transact. Ind. Informat. 15 (2) (2019) 1183–
Medicine and Education (ITME), 2018, pp. 327–331. 1192.
[23] F. Yuan, L. Zhang, X. Xia, Q. Huang, X. Li, A wave-shaped deep neural net- [50] G. Li, J. Kim, DABNet: Depth-wise asymmetric bottleneck for real-time se-
work for smoke density estimation, IEEE Transact. Image Process. 29 (2020) mantic segmentation, Proceedings of the British Machine Vision Conference
2301–2313. (BMVC), Cardiff, Wales, UK, 2019, pp. 186.1–186.12.
[24] F. Yuan, L. Zhang, X. Xia, B. Wan, Q. Huang, X. Li, Deep smoke segmentation, [51] S. Lo, H. Hang, S. Chan, J. Lin, Efficient dense modules of asymmetric convolu-
Neurocomputing 357 (2019) 248–260. tion for real-time semantic segmentation, in: Proceedings of the ACM Multi-
[25] O. Oktay, J. Schlemper, L.L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, media Asia, 2019, pp. 1–6.
S. McDonagh, N.Y. Hammerla, B. Kainz, Attention u-net: learning where to look [52] M. Liu, H. Yin, Feature pyramid encoding network for real-time semantic
for the pancreas, MIDL (2018) July 18th, 2018. segmentation, Proceedings of the British Machine Vision Conference (BMVC),
[26] Ö. Çiçek, A. Abdulkadir, S.S. Lienkamp, T. Brox, O. Ronneberger, 3D U-Net: Cardiff, Wales, UK, 2019, pp. 203.1–203.13.
learning dense volumetric segmentation from sparse annotation, in: Interna- [53] R. Poudel, L. Stephan, C. Roberto, Fast-SCNN: Fast semantic segmentation net-
tional conference on medical image computing and computer-assisted inter- work, Proceedings of the British Machine Vision Conference (BMVC), Cardiff,
vention(MICCAI), 9901, Springer, Cham, 2016, pp. 424–432. Wales, UK, 2019, pp. 187.1–187.12. Sep. 9–12.
[27] F. Milletari, N. Navab, S. Ahmadi, V-Net: fully convolutional neural networks for [54] T. Wu, S. Tang, R. Zhang, J. Cao, Y. Zhang, CGNet: a light-weight context guided
volumetric medical image segmentation, in: 2016 Fourth International Confer- network for semantic segmentation, IEEE Transact. Image Process. 30 (2021)
ence on 3D Vision (3DV), 2016, pp. 565–571. 1169–1179 .
[28] F. Yuan, Z. Dong, L. Zhang, X. Xia, J. Shi, Cubic-cross convolutional attention
and count prior embedding for smoke segmentation, Pattern Recognit 131
(2022) 108902. Feiniu Yuan received his B.Eng. and M.E. degrees in me-
[29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, chanical engineering from Hefei University of Technol-
I. Polosukhin, Attention is all you need, Adv. Neur. Inf. Process Syst. 30 (2017). ogy, Hefei, China, in 1998 and 2001, respectively, and his
[30] H. Touvron, M. Cord, D. Matthijs, F. Massa, A. Sablayrolles, H. Jegou, Training Ph.D. degree in pattern recognition and intelligence sys-
data-efficient image transformers & distillation through attention, in: Interna- tem from University of Science and Technology of China
tional Conference on Machine Learning, PMLR, 2021, pp. 10347–10357. (USTC), Hefei, in 20 04. From 20 04 to 2006, he worked as
[31] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, a post-doctor with State Key Lab of Fire Science, USTC.
A survey on vision Transformer, IEEE Trans. Pattern Anal. Mach. Intell. (2022). From 2010 to 2012, he was a Senior Research Fellow
[32] A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel, A. Vaswani, Bottleneck with Singapore Bioimaging Consortium, Agency for Sci-
Transformers for visual recognition, in: IEEE Conference on Computer Vision ence, Technology and Research, Singapore. He is currently
and Pattern Recognition (CVPR), 2021, pp. 16514–16524. a professor, a PhD supervisor and a vice dean with Col-
[33] F. Yuan, Y. Zhou, X. Xia, X. Qian, J. Huang, A confidence prior for image Dehaz- lege of Information, Mechanical and Electrical Engineer-
ing, Pattern Recognit. 119 (2021) 1–16. ing, Shanghai Normal University (SHNU), China. He is also
[34] J. Schlemper, O. Oktay, M. Schaap, M. Heinrich, B. Kainz, B. Glocker, D. Rueck- with Key Innovation Group of Digital Humanities Resource and Research, SHNU,
ert, Attention gated networks: learning to leverage salient regions in medical china. He is a senior member of IEEE and CCF. His research interests include deep
images, Med. Image Anal. 53 (2019) 197–207. learning, image segmentation, pattern recognition and 3D modeling.
11
Zhengxiao Zhang, born in 1997, received his B.Eng. de- Zhijun Fang is a professor and the dean of School of
gree in communication engineering from Shanghai Nor- Computer Science and Technology, Donghua University.
mal University, Shanghai, China, in 2020. Now, he is an He obtained his PhD degree in Shanghai Jiaotong Univer-
M.E candidate with College of Information, Mechanical sity and was a visiting scholar in University of Washing-
and Electrical Engineering, Shanghai Normal University. ton. He is a senior member of IEEE/ACM/CCF/CAAI/CSIG.
From 2020 to 2023, he studied in the Laboratory of Arti- His-current research interests include image & video pro-
ficial Intelligence and Visual Perception at Shanghai Nor- cessing, machine vision, and intelligent data analysis.
mal University. His-research interests include deep learn-
ing and medical image segmentation.
12

1 s2.0 S0031320322007075 Main

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0031320322007075 Main

Uploaded by

Copyright:

Available Formats

Pattern Recognition 136 (2023) 109228

Contents lists available at ScienceDirect

An effective CNN and Transformer complementary network for

1. Introduction With the constant innovation of computing power and the

3.5. The Transformer decoder

Swin Transformer blocks have been proven quite qualiﬁed for

Fig. 4. The visualized comparison of different methods on Synapse datasets.

Table 3 The purpose of Channel Attention Block (CAB) is to emphasize

You might also like