Professional Documents
Culture Documents
Medical Image Segmentation Using Squeeze-and-Expansion Transformers
Medical Image Segmentation Using Squeeze-and-Expansion Transformers
Medical Image Segmentation Using Squeeze-and-Expansion Transformers
Shaohua Li1∗ , Xiuchao Sui1 , Xiangde Luo2 , Xinxing Xu1 , Yong Liu1 , Rick Goh1
1
Institute of High Performance Computing, A*STAR, Singapore
2
University of Electronic Science and Technology of China, Chengdu, China
{shaohua, xiuchao.sui}@gmail.com, xiangde.luo@std.uestc.edu.cn, {xuxinx, liuyong,
gohsm}@ihpc.a-star.edu.sg
Abstract
arXiv:2105.09511v3 [eess.IV] 2 Jun 2021
the same time, i.e., learn image features that incorporate large
context while keep high spatial resolutions to output fine-
Medical image segmentation is important for grained segmentation masks. However, these two demands
computer-aided diagnosis. Good segmentation de- pose a dilemma for CNNs, as CNNs often incorporate larger
mands the model to see the big picture and fine context at the cost of reduced feature resolution. A good mea-
details simultaneously, i.e., to learn image features sure of how large a model “sees” is the effective receptive field
that incorporate large context while keep high spa- (effective RF) [Luo et al., 2016], i.e., the input areas which
tial resolutions. To approach this goal, the most have non-negligible impacts to the model output.
widely used methods – U-Net and variants, ex-
Since the advent of U-Net [Ronneberger et al., 2015], it
tract and fuse multi-scale features. However, the
has shown excellent performance across medical image seg-
fused features still have small effective receptive
mentation tasks. A U-Net consists of an encoder and a de-
fields with a focus on local image cues, limiting
coder, in which the encoder progressively downsamples the
their performance. In this work, we propose Seg-
features and generates coarse contextual features that focus
tran, an alternative segmentation framework based
on contextual patterns, and the decoder progressively upsam-
on transformers, which have unlimited effective re-
ples the contextual features and fuses them with fine-grained
ceptive fields even at high feature resolutions. The
local visual features. The integration of multiple scale fea-
core of Segtran is a novel Squeeze-and-Expansion
tures enlarges the RF of U-Net, accounting for its good per-
transformer: a squeezed attention block regularizes
formance. However, as the convolutional layers deepen, the
the self attention of transformers, and an expan-
impact from far-away pixels decay quickly. As a results, the
sion block learns diversified representations. Ad-
effective RF of a U-Net is much smaller than its theoretical
ditionally, we propose a new positional encoding
RF. As shown in Fig.2, the effective RFs of a standard U-Net
scheme for transformers, imposing a continuity in-
and DeepLabV3+ are merely around 90 pixels. This implies
ductive bias for images. Experiments were per-
that they make decisions mainly based on individual small
formed on 2D and 3D medical image segmentation
patches, and have difficulty to model larger context. How-
tasks: optic disc/cup segmentation in fundus im-
ever, in many tasks, the heights/widths of the ROIs are greater
ages (REFUGE’20 challenge), polyp segmentation
than 200 pixels, far beyond their effective RFs. Without “see-
in colonoscopy images, and brain tumor segmen-
ing the bigger picture”, U-Net and other models may be mis-
tation in MRI scans (BraTS’19 challenge). Com-
led by local visual cues and make segmentation errors.
pared with representative existing methods, Segtran
consistently achieved the highest segmentation ac- Many improvements of U-Net have been proposed. A few
curacy, and exhibited good cross-domain general- typical variants include: U-Net++ [Zhou et al., 2018] and U-
ization capabilities. The source code of Segtran is Net 3+ [Huang et al., 2020], in which more complicated skip
released at https://github.com/askerlee/segtran. connections are added to better utilize multi-scale contextual
features; attention U-Net [Schlemper et al., 2019], which em-
ploys attention gates to focus on foreground regions; 3D U-
1 Introduction Net [Çiçek et al., 2016] and V-Net [Milletari et al., 2016],
which extend U-Net to 3D images, such as MRI volumes;
Automated Medical image segmentation, i.e., automated de-
Eff-UNet [Baheti et al., 2020], which replaces the encoder of
lineation of anatomical structures and other regions of inter-
U-Net with a pretrained EfficientNet [Tan and Le, 2019].
est (ROIs), is an important step in computer-aided diagnosis;
for example it is used to quantify tissue volumes, extract key Transformers [Vaswani et al., 2017] are increasingly pop-
quantitative measurements, and localize pathology [Schlem- ular in computer vision tasks. A transformer calculates
per et al., 2019; Orlando et al., 2020]. Good segmentation the pairwise interactions (“self-attention”) between all input
demands the model to see the big picture and fine details at units, combines their features and generates contextualized
features. The contextualization brought by a transformer is
∗
Corresponding Author. analogous to the upsampling path in a U-Net, except that
Input Encoder Transformer
Segmentation
Coordinates CNN backbone head
𝑥𝑖 , 𝑦𝑗
Reshaping to
Input FPN original shape
… Hi-res features
Output FPN
Learnable sinusoidal
+ Visual features … Contextualized features
positional encoding
Squeeze-and-Expansion
Transformer layers
Spatially
flattening … Local features
Figure 1: Segtran architecture. It extracts visual features with a CNN backbone, combines them with positional encodings of the pixel
coordinates, and flattens them into a sequence of local feature vectors. The local features are contextualized by a few Squeeze-and-Expansion
transformer layers. To increase spatial resolution, an input FPN and an output FPN upsamples the features before and after the transformers.
250
500
200 a 3D image segmentation task: brain tumor segmentation
400
300 150 in MRI scans of the BraTS’19 challenge. Segtran has con-
200 100 sistently shown better performance than U-Net and its vari-
100 50
0 0
ants (UNet++, UNet3+, PraNet, and nnU-Net), as well as
(a) Fundus image (b) Ground truth (c) U-Net (d) DeeplabV3+ (e) Segtran DeepLabV3+ [Chen et al., 2018].
Figure 2: Effective receptive fields of 3 models, indicated by non-
negligible gradients in blue blobs and light-colored dots. Gradients 2 Related Work
are back-propagated from the center of the image. Segtran has non-
negligible gradients dispersed across the whole image (light-colored Our work is largely inspired by DETR [Carion et al., 2020].
dots). U-Net and DeepLabV3+ have concentrated gradients. Input DETR uses transformer layers to generate contextualized fea-
image: 576 × 576. tures that represent objects, and learns a set of object queries
to extract the positions and classes of objects in an image. Al-
though DETR is also explored to do panoptic segmentation
it has unlimited effective receptive field, good at capturing [Kirillov et al., 2019], it adopts a two-stage approach which
long-range correlations. Thus, it is natural to adopt trans- is not applicable to medical image segmentation. A followup
formers for image segmentation. In this work, we present work of DETR, Cell-DETR [Prangemeier et al., 2020] also
Segtran, an alternative segmentation architecture based on employs transformer for biomedical image segmentation, but
transformers. A straightforward incorporation of transform- its architecture is just a simplified DETR, lacking novel com-
ers into segmentation only yields moderate performance. As ponents like our Squeeze-and-Expansion transformer. Most
transformers were originally designed for Natural Language recently, SETR [Zheng et al., 2021] and TransU-Net [Chen et
Processing (NLP) tasks, there are several aspects that could al., 2021] were released concurrently or after our paper sub-
be improved to better suit image applications. To this end, we mission. Both of them employ a Vision Transformer (ViT)
propose a novel transformer design Squeeze-and-Expansion [Dosovitskiy et al., 2021] as the encoder to extract image
Transformer, in which a squeezed attention block helps reg- features, which already contain global contextual informa-
ularize the huge attention matrix, and an expansion block tion. A few convolutional layers are used as the decoder to
learns diversified representations. In addition, we propose a generate the segmentation mask. In contrast, in Segtran, the
learnable sinusoidal positional encoding that imposes a conti- transformer layers build global context on top of the local im-
nuity inductive bias for the transformer. Experiments demon- age features extracted from a CNN backbone, and a Feature
strate that they lead to improved segmentation performance. Pyramid Network (FPN) generates the segmentation mask.
We evaluated Segtran on two 2D medical image segmen- [Murase et al., 2020] extends CNNs with positional en-
tation tasks: optic disc/cup segmentation in fundus images coding channels, and evaluates them on segmentation tasks.
of the REFUGE’20 challenge, and polyp segmentation in Mixed results were observed. In contrast, we verified through
colonoscopy images. Additionally, we also evaluated it on an ablation study that positional encodings indeed help Seg-
𝑥1 𝑥2 𝑥3 ⋯ 𝑥𝑁
tran to do segmentation to a certain degree. 𝑥1 𝑥1 ′
Receptive fields of U-Nets may be enlarged by adding
𝑥2 𝑥2 ′
more downsampling layers. However, this increases the num-
𝑥3 𝑥3 ′
ber of parameters and adds the risk of overfitting. Another
way of increasing receptive fields is using larger stride sizes ⋮ ⋮
of the convolutions in the downsampling path. Doing so, 𝑥𝑁 𝑥𝑁 ′