Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

Medical Image Analysis 91 (2024) 103000

Contents lists available at ScienceDirect

Medical Image Analysis


journal homepage: www.elsevier.com/locate/media

Survey paper

Advances in medical image analysis with vision Transformers: A


comprehensive review
Reza Azad a , Amirhossein Kazerouni b , Moein Heidari b , Ehsan Khodapanah Aghdam c ,
Amirali Molaei d , Yiwei Jia a , Abin Jose a , Rijo Roy a , Dorit Merhof e,f ,∗
a
Faculty of Electrical Engineering and Information Technology, RWTH Aachen University, Aachen, Germany
b School of Electrical Engineering, Iran University of Science and Technology, Tehran, Iran
c Department of Electrical Engineering, Shahid Beheshti University, Tehran, Iran
d School of Computer Engineering, Iran University of Science and Technology, Tehran, Iran
e
Faculty of Informatics and Data Science, University of Regensburg, Regensburg, Germany
f
Fraunhofer Institute for Digital Medicine MEVIS, Bremen, Germany

ARTICLE INFO ABSTRACT

Keywords: The remarkable performance of the Transformer architecture in natural language processing has recently also
Transformers triggered broad interest in Computer Vision. Among other merits, Transformers are witnessed as capable
Medical image analysis of learning long-range dependencies and spatial correlations, which is a clear advantage over convolutional
Vision transformers
neural networks (CNNs), which have been the de facto standard in Computer Vision problems so far. Thus,
Deep neural networks
Transformers have become an integral part of modern medical image analysis. In this review, we provide
an encyclopedic review of the applications of Transformers in medical imaging. Specifically, we present
a systematic and thorough review of relevant recent Transformer literature for different medical image
analysis tasks, including classification, segmentation, detection, registration, synthesis, and clinical report
generation. For each of these applications, we investigate the novelty, strengths and weaknesses of the
different proposed strategies and develop taxonomies highlighting key properties and contributions. Further,
if applicable, we outline current benchmarks on different datasets. Finally, we summarize key challenges and
discuss different future research directions. In addition, we have provided cited papers with their corresponding
implementations in https://github.com/mindflow-institue/Awesome-Transformer.

1. Introduction inference time, it applies fixed weights regardless of any changes to


the visual input. To mitigate the aforementioned problems, there have
Convolutional neural networks (CNNs) have been an integral part of been great research efforts to integrate attention mechanisms, which
research in the field of medical image analysis for many years. By virtue can be regarded as a dynamic weight adjustment process based on input
of convolutional filters whose primary function is to learn and extract features to the seminal CNN-based structures to improve the non-local
necessary features from medical images, a wealth of research has modeling capability (Ramachandran et al., 2019; Bello et al., 2019;
been dedicated to CNNs ranging from tumor detection and classifica- Karimijafarbigloo et al., 2023b; Vaswani et al., 2021).
tion (Arevalo et al., 2016), detection of skin lesions (Karimijafarbigloo To this end, Wang et al. (2018b) designed a non-local flexible
et al., 2023a; Azad et al., 2019) to brain tumor segmentation (Azad
building block, which can be plugged into multiple intermediate con-
et al., 2022e), to name only a few. CNNs have also contributed sig-
volution layers. SENet (Hu et al., 2018) suggested a channel attention
nificantly to the analysis of different imaging modalities in clinical
squeeze-and-excitation (SE) block, which collects global information in
medicine, including X-ray radiography, computed tomography (CT),
order to recalibrate each channel accordingly to create a more robust
magnetic resonance imaging (MRI), ultrasound (US), and digital pathol-
representation. Inspired by this line of research, there has been an
ogy. Despite their outstanding performance, CNNs suffer from concep-
tual limitations and are innately unable to model explicit long-distance overwhelming influx of models with attention variants proposed in
dependencies due to the limited receptive field of convolution kernels. the medical imaging field (Al-Shabi et al., 2022; Sang et al., 2021;
Moreover, the convolutional operator suffers from the fact that at Yao et al., 2021). Although these attention mechanisms allow the

∗ Corresponding author at: Faculty of Informatics and Data Science, University of Regensburg, Regensburg, Germany.
E-mail address: dorit.merhof@ur.de (D. Merhof).

https://doi.org/10.1016/j.media.2023.103000
Received 9 February 2023; Received in revised form 30 September 2023; Accepted 11 October 2023
Available online 19 October 2023
1361-8415/© 2023 Elsevier B.V. All rights reserved.
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 1. Overview of the applications covered in this review.

modeling of full image contextual information, as the computational and merits of each method. In our work, we explicitly cover this aspect
complexity of these approaches typically grows quadratically with and also provide a taxonomy that comprises the imaging modality,
respect to spatial size, they imply an intensive computational burden, organ of interest, and type of training procedure each paper has se-
thus making them inefficient in the case of medical images that are lected. More specifically, in Section 3 (Medical Image Classification),
dense in pixel resolution (Gonçalves et al., 2022). Moreover, despite the we comprehensively elaborate on the most promising networks along
fact that the combination of the attention mechanism with the convo- with their key ideas, limitations, the number of parameters, and the
lutional operation leads to systematic performance gains, these models specific classification task they are addressing. In Section 4 (Medical
inevitably suffer from constraints in learning long-range interactions. Image Segmentation), we analyze network architectures in terms of
Transformers (Vaswani et al., 2017) have demonstrated exemplary their design choice and propose a detailed taxonomy to categorize
performance on a broad range of natural language processing (NLP) each network to provide insight for the reader to understand the
tasks, e.g., machine translation, text classification, and question an- current limitations and progress in segmentation networks based on the
swering. Inspired by the eminent success of Transformer architectures Transformer architecture. In Section 5 (Medical Image Reconstruction),
in the field of NLP, they have become a widely applied technique we take a different perspective to categorize networks based on their
in modern Computer Vision (CV) models. Since the establishment of network structure and the imaging modality they are built upon. We
categorize the synthesis methods in Section 6 based on their objective
Vision-Transformers (ViTs) (Dosovitskiy et al., 2020), Transformers
(intra-modality or inter-modality) and then provide detailed informa-
proved to be valid alternatives to CNNs in diverse tasks ranging from
tion regarding the network architecture, parameters, motivations, and
image recognition (Dosovitskiy et al., 2020), object detection (Zhu
highlights. In the sections related to detection (Section 7), registration
et al., 2021), image segmentation (Chen et al., 2021b; Azad et al.,
(Section 8), and report generation (Section 9) we review in detail
2023a) to video understanding (Arnab et al., 2021) and image super-
the state-of-the-art (SOTA) networks and provide detailed information
resolution (Chen et al., 2023). As a central piece of the Transformer,
regarding the network architectures, advantages, and drawbacks. More-
the self-attention mechanism comes with the ability to model rela-
over, due to the swift development of the field, we believe that the
tionships between elements of a sequence, thereby learning long-range community requires a more recent overview of the literature.
interactions. Moreover, Transformers allow for large-scale pre-training We hope this work will point out new research options and provide
for specific downstream tasks and applications and are capable of a guideline for researchers and initiate further interest in the vision
dealing with variable-length inputs. The immense interest in Trans- community to leverage the potential of Transformer models in the
formers has also spurred research into medical imaging applications medical domain. Our major contributions are as follows:
(see Fig. 1). Being dominant in reputable top-tier medical imaging
conferences and journals, it is extremely challenging for researchers • We systematically and comprehensively review the applications
and practitioners to keep up with the rate of innovation. The rapid of Transformers in the medical imaging domain and provide
adoption of Transformers in the medical imaging field necessitates a a comparison and analysis of SOTA approaches for each task.
comprehensive summary and outlook, which is the main scope of this Specifically, more than 200 papers are covered in a hierarchical
review. Specifically, this review provides a holistic overview of the and structured manner.
Transformer models developed for medical imaging and image analysis • Our work provides a taxonomized (Fig. 1), in-depth analysis
applications. We provide a taxonomy of the network design, highlight (e.g. task-specific research progress and limitations), as well as
the major strengths and deficiencies of the existing approaches, and a discussion of various aspects.
introduce the current benchmarks in each task. We inspect several key • Finally, We discuss challenges and open issues and also identify
technologies that arise from the various medical imaging applications, new trends, raise open questions and identify future directions.
including medical image segmentation, medical image registration, Search Strategy. We conducted a thorough search using DBLP,
medical image reconstruction, and medical image classification. So far, Google Scholar, and Arxiv Sanity Preserver, utilizing customized search
review papers related to Transformers do not concentrate on appli- queries that allowed us to obtain lists of scholarly publications. These
cations of Transformers in the medical imaging and image analysis publications included peer-reviewed journal papers, conference or
domain (Khan et al., 2022). The few literature reviews that do focus on workshop papers, non-peer-reviewed papers, and preprints. Our search
the medical domain (Shamshad et al., 2023; He et al., 2022b), despite queries consisted of keywords (transformer* | deep* | medi-
being very comprehensive, do not necessarily discuss the drawbacks cal* | {Task}*), (transformer | medical*), (transformer*

2
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 2. Architecture of the Vision Transformer as proposed in Dosovitskiy et al. (2020) and the detailed structure of the Vision Transformer encoder block. In the Vision Transformer,
sequential image patches are used as the input and processed using a transformer encoder to produce the final classification output.

|medical* | image* | model*), (convolution* | vision* | each of which is a stack of L tandem of consecutive identical blocks.
transformer* | medical*), where {Task} refers to one application The Transformer module is convolutional-free and solely based on
covered in this review (see Fig. 1). To ensure the selection of relevant the self-attention mechanism or attention mechanism in short. Specif-
papers, we meticulously evaluated their novelty, contribution, and ically, these attention blocks are neural network layers that relate
significance, and prioritized those that were the first of their kind in the different positions of a single sequence to compute the sequence’s
field of medical imaging. Following these criteria, we chose papers with representation. Since the establishment of Transformer models, they
the highest rankings for further examination. It is worth noting that our have attained remarkable performance in diverse natural language pro-
review may have excluded other significant papers in the field, but our cessing tasks (Kalyan et al., 2021). Inspired by this, Dosovitskiy et al.
goal was to provide a comprehensive overview of the most important proposed the Vision Transformer (ViT) (Dosovitskiy et al., 2020) model
and impactful ones. as illustrated in Fig. 2. When trained on large datasets, for instance,
Paper Organizations. The remaining sections of the paper are or- JFT-300M, ViT outperforms the then state-of-the-art, namely ResNet-
ganized as follows. In Section 2, we provide an overview of the key based models like BiT (Kolesnikov et al., 2020). In their approach,
components of the well-established Transformer architecture and its an image is turned into fixed-sized patches before being flattened
clinical importance. Moreover, this section clarifies the categoriza- into vectors. These vectors are then passed through a trainable linear
tion of neural network variants in terms of the position where the projection layer that maps them into 𝑁 vectors with the dimensionality
Transformer is located. Sections 3 to 9 comprehensively review the of 𝐷 × 𝑁 is the number of patches. The outputs of this stage are
applications of Transformers in diverse medical imaging tasks as de- referred to as patch embeddings. To preserve the positional information
picted in Fig. 1. For each task, we propose a taxonomy to characterize present within each patch, they add positional embeddings to the patch
technical innovations and major use cases. Section 10 presents open embeddings. In addition to this, a trainable class embedding is also
challenges and future perspectives of the field as a whole, while finally, appended to the patch embeddings before going through the Trans-
Section 11 concludes this work. former encoder. The Transformer encoder is comprised of multiple
Transformer encoder blocks. There are one multi-head self-attention
2. Background (MSA) block and an MLP block in each Transformer encoder block. The
activations are first normalized using LayerNorm (LN) before going into
In this section, we first provide an overview of the Transformer these blocks in the Transformer encoder block. Furthermore, there are
module and the key ideas behind its feasible design. Then, we outline a skip connections before the LN that add a copy of these activations
general taxonomy of Transformer-based models, characterized by their to the corresponding MSA or MLP block outputs. In the end, there
core techniques of using Transformers, i.e., whether they are purely is an MLP block used as a classification head that maps the output
Transformer-based, or whether the Transformer module is either used to class predictions. The self-attention mechanism is a key defining
in the encoder, decoder, bottleneck, or skip connection, respectively. characteristic of Transformer models. Hence, we start by introducing
the core principle of the attention mechanism.
2.1. Transformers
2.1.1. Self-attention
The original Transformer (Vaswani et al., 2017) was first applied In a self-attention layer (Fig. 3(a)), the input vector is firstly trans-
to the task for machine translation as a new attention-driven building formed into three separate vectors, i.e., the query vector q, the key
block. The vanilla Transformer consists of an encoder and a decoder, vector k, and the value vector v with a fixed dimension. These vectors

3
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 3. (a) The process of self-attention. (b) Multi-head attention. The MSA consists of multiple SA blocks (heads) concatenated together channel-wise as proposed in Vaswani
et al. (2017).

are then packed together into three different weight matrices, namely criteria, there are usually multiple multi-head self-attention modules in
𝑊 𝑄 , 𝑊 𝐾 , and 𝑊 𝑉 . A common form of 𝑄, 𝐾, and 𝑉 can be formulated both encoding and decoding sections that allow the decoder to utilize
as Eq. (1) for an input 𝑋 information from the encoder. Examples of such methods are the Swin-
Unet (Cao et al., 2022) and the TransDeepLab (Azad et al., 2022d)
𝐾 = 𝑊 𝐾 𝑋, 𝑄 = 𝑊 𝑄 𝑋, 𝑉 = 𝑊 𝑉 𝑋, (1)
networks which, as their name suggests, try to model the seminal
where 𝑊 𝐾 , 𝑊 𝑄 , and 𝑊 𝑉 refers to the learnable parameters. The scaled U-Net (Ronneberger et al., 2015), and DeepLab (Chen et al., 2018)
dot-product attention mechanism is then formulated as architectures.
( )
𝑄𝐾 𝑇
Attention(𝑄, 𝐾, 𝑉 ) = Softmax √ 𝑉, (2) 2.2.2. Transformer: Hybrid
𝑑𝑘 The hybrid Transformer models usually modify the base CNN struc-
√ ture by replacing the encoder or decoder modules.
where 𝑑𝑘 is a scaling factor, and a softmax operation is applied to
the generated attention weights to translate them into a normalized Encoder: Encoder-only models such as the seminal BERT (Devlin
distribution. et al., 2018) are designed to make a single prediction per input or a
single prediction for an entire input sequence. In the computer vision
2.1.2. Multi-head self-attention era, these models are applicable for classification tasks. Moreover, as
The multi-head self-attention (MHSA) mechanism (Fig. 3(b)) has utilizing a pure Transformer can result in limited localization capacity
been proposed (Vaswani et al., 2017) to model the complex relation- stemming from inadequate low-level features, many cohort studies try
ships of token entities from different aspects. Specifically, the MHSA to combine CNN and Transformer in the encoding section (Chen et al.,
block helps the model to jointly attend to information from multiple 2021b). Such a design can enhance finer details by recovering localized
representation sub-spaces, as the modeling capability of the single-head spatial information.
attention block is quite coarse. The process of MHSA can be formulated Decoder: Transformers can also be used in a decoding fashion.
as Such a causal model is typically used for generation tasks such as
( ) language modeling. Besides that, the modification can apply to the skip
MultiHead(𝑄, 𝐾, 𝑉 ) = [𝐶𝑜𝑛𝑐𝑎𝑡 head1 , … , headℎ ]𝑊 𝑂 , (3) connections of the decoder module. Skip connection is a widely-used
( )
where head𝑖 = Attention 𝑄𝑊𝑖𝑄 , 𝐾𝑊𝑖𝐾 , 𝑉 𝑊𝑖𝑉 , and 𝑊 𝑂 indicates a technique to improve the performance and the convergence of deep
neural networks. It can also serve as a modulating mechanism between
linear mapping function to combine multi-head representation. Note
the encoder and the decoder. To effectively provide low-level spatial
that ℎ is a hyper-parameter set to ℎ = 8 in the original paper.
information for the decoding path, the idea of exploiting Transformers
2.2. Transformer modes in designing skip connections has emerged. This notable idea can lead
to finer feature fusion and recalibration while guaranteeing the aggre-
While the Transformer was originally introduced with an encoder– gation scheme of using both high-level and low-level features (Chen
decoder pipeline, many modern architectures generally exploit the et al., 2021b; Heidari et al., 2023).
Transformer architecture in different fashions, which generally depend
on the target application. The usage of Transformers in vision tasks can 2.3. Clinical importance
broadly be classified into pure and hybrid designs.
In practice, medical professionals conduct medical image analysis
2.2.1. Pure Transformers qualitatively in real-world scenarios. This can lead to different un-
Due to the deficiency of CNN-based architectures in learning global derstandings and levels of precision due to variations in the reader’s
and long-range semantic information interactions, which stems from expertise or differences in image quality, as well as being timely
the locality of convolution operation, a cohort study has investigated and labor-expensive. Therefore, deep learning techniques have gained
the purely Transformer-based models without any convolution layer. extensive attraction in medical image analysis aiming to reduce inter-
These models usually consist of encoder, bottleneck, decoder, and reader variation and decrease the expenses associated with time and
skip connections directly built upon the ViT or its variants. In this workforce (Shamshad et al., 2023).

4
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 4. Taxonomy of ViT-based approaches in medical image classification. Methods are categorized based on their proposed architecture into pure and hybrid methods, in which
they adopt the vanilla ViT or present a new type of vision Transformer in medical image classification. Notably, we utilize the prefix numbers in the paper’s name in ascending
order and denote the reference for each study as follows: 1. Gao et al. (2021), 2. Yu et al. (2021), 3. Mondal et al. (2021), 4. Park et al. (2021a), 5. Matsoukas et al. (2021),
6. Gheflati and Rivaz (2022), 7. Shome et al. (2021), 8. Yan et al. (2023), 9. Perera et al. (2021), 10. Bhattacharya et al. (2022), 11. Liu and Yin (2021), 12. Dai et al. (2021),
13. Wang et al. (2021c), 14. Tanzi et al. (2022), 15. Park et al. (2021b), 16. Sun et al. (2021), 17. Li et al. (2021b), 18. Shao et al. (2021), 19. Mehta et al. (2022), 20. Zheng
et al. (2022), 21. Huo et al. (2022), 22. Manzari et al. (2023).

The fundamental question that emerges is What are the motiva- authors provided a metric to assess model performance by exposing
tions behind using Transformers in the medical domain? the clinician answers and the model’s output to other physicians and
Advances in adversarial attacks create an unavoidable danger in lay-persons to rate them in nine categories such as factuality, medical
which potential attackers might attempt to gain benefits by exploiting reasoning capability, etc. As a result, the quantitative results endorse
vulnerabilities in the healthcare system. While there is substantial that 72.9% of the time, answers provided by Med-PaLM 2 are preferable
literature available concerning the robustness of CNNs in the field of regarding reporting the medical consensus.
medical imaging, it is still a challenging direction to explore. Recent Seenivasan et al. (2023) addressed the knowledge transfer’s steep
studies (Benz et al., 2021) prove that ViTs are more robust to adver- learning curve from surgical experts to medical trainees and patients
sarial attacks than CNNs. Specifically, ViT is significantly more robust and proposed the SurgicalGPT, a Transformer-based multi-modal vi-
than CNN in a wide range of white-box attacks. A similar trend is also sual question-answering framework that utilizes the large language
observed in the query-based and transfer-based black-box attacks which models with visual cues. SurgicalGPT demonstrated outperforming re-
can be attributed to the fact that ViTs are more reliant on low-frequency sults over several (robotic) surgical datasets in terms of accuracy com-
(robust) features while CNNs are more sensitive to high-frequency pared to the uni-modality text generation models.
features. Consequently, the clinical use cases of Med-PaLM 2 and SurgicalGPT
Besides, due to strict privacy rules, low occurrence rates of cer- have demonstrated the feasibility and benefits of employing Transform-
tain diseases, concerns about data ownership, and a limited number ers in the medical domain, showing their effectiveness and potential
of patients, providing suitable data have always been a paramount applications.
concern in the medical domain. Recently, the idea of exploiting the
3. Medical image classification
inherent structure of ViT in distributed medical imaging applications
has emerged. Owning to the inherent structure of ViT (as opposed to
Image classification is still one of the challenging problems in com-
CNN), it can be split into shared body and task-specific heads which
puter vision, which aids in segregating extensive quantities of data into
demonstrate that a ViT body with sufficient capacity can be shared
meaningful categories (Aminimehr et al., 2023). Vision Transformers
across relevant tasks by leveraging different strategies such as multi-
(ViT) have recently demonstrated outstanding results in various image
task learning (Park et al., 2021a). This way, ViTs can be a paramount classification tasks and offer significant advantages over conventional
application in decentralized medical imaging solutions. CNNs (Liu et al., 2021d; Xia et al., 2022; Fayyaz et al., 2022; Dong
Moreover, Transformers excel at capturing long-range dependencies et al., 2022; Li et al., 2022c; Yao et al., 2023). These advantages include
in data, making them well-suited for analyzing complex medical images long-range relationships, adaptive modeling, and attention maps that
where contextual information plays a crucial role. Transformers offer yield intuition on what the model deems more important inside an
greater interpretability than CNNs, as they provide attention maps that image (Matsoukas et al., 2021). Due to these alluring advantages, there
highlight the regions of an image that contribute most to the model’s is rising interest in building Transformer-based models for medical im-
decision-making. This transparency is crucial in medical settings where age classification. Therefore, highly precise classification is becoming
understanding the reasoning behind a diagnosis is important. The increasingly vital for facilitating clinical care.
flexible design of ViTs can be used on edge devices to speed diagnosis, In this section, we exhaustively examine ViTs in medical image
improve clinician workflows, enhance patient privacy, and save time. classification. As illustrated in Fig. 4, we have broadly classified these
Below, we provide recent examples of clinical use cases of Transformers methods based on the role ViT plays in their architecture. These cat-
in the medical domain. egories include pure Transformers and Hybrid Models. Generally, a
Med-PaLM 2 (Singhal et al., 2023), a Transformer-based model, vision Transformer-based classification architecture consists of three
by Google demonstrated a feasible comparison between physicians and modules: (1) a backbone for capturing input features, (2) an encoder
the application of Transformers in the medical domain questioning. for modeling the information, and (3) a classification head for gener-
The answers provided by Med-PaLM 2 are pretty impressive, and the ating output based on the specified task. Therefore, the Transformer

5
R. Azad et al. Medical Image Analysis 91 (2024) 103000

can be adopted in each module. However, some works, including by the pre-trained weight of ImageNet, then fine-tuning it on the
Lesion Aware Transformer (LAT) (Sun et al., 2021) and Deformable downstream retinal disease classification task in order to encourage the
Transformer for Multi-Instance Learning (DT-MIL) (Li et al., 2021b), model to learn global information and achieve generalization. Unlike
take a different approach and utilize encoder–decoder structures. LAT previous approaches, they apply some modifications to the vanilla ViT
proposes a unified encoder–decoder system for Diabetic Retinopathy structure. In the classic ViT, embedded features are neglected for classi-
(DR) grading, and DT-MIL introduces a Transformer-based encoder– fication; instead, only the class token, which retains the summarization
decoder architecture for classifying histopathological images, where the of embedded features’ information, is used. Yu et al. (2021) propose
deformable Transformer was embraced for the encoder part. In the a novel multiple-instance learning (MIL)-head module to exploit those
following, we will go into great depth on both hybrid and pure models. embedded features to complement the class token. This head comprises
three submodules that attach to the ViT in a plug-and-play manner:
3.1. Pure Transformers (1) the MIL embedding submodule that maps the feature embeddings
to a low-dimensional embedding vector, (2) the attention aggregation
Since the emergence of Transformers, there has been a growing submodule that outputs a spatial weight matrix for the low-dimensional
debate regarding whether it is time to switch entirely from CNNs to patch embeddings; this weight matrix is then applied to the low-
Transformers. Matsoukas et al. (2021) conduct a series of experiments dimensional embeddings to ascertain each instance’s importance, (3)
to answer this critical question. They take ResNet50 (He et al., 2016) the MIL classifier submodule that determines the probability of each
and the DeiT-S (Touvron et al., 2021) models to represent CNN and class through aggregated features. In the downstream task, both MLP
ViT models, respectively. They train each of these two models in 3 and MIL heads use the weighted cross-entropy loss function for training.
different fashions: (a) randomly initialized weights, (b) pre-trained on The outputs of both heads are then weight-averaged for the inference
ImageNet (transfer learning), and (c) pre-training on the target dataset time. Results indicate the effectiveness of the proposed training strategy
in a self-supervised learning (SSL) scheme using DINO (Caron et al., and the MIL-head module by dramatically boosting the performance
2021). Their findings show that when utilizing random initialization, over APTOS2019 (Asia Pacific Tele-Ophthalmology Society, 2019) and
ViTs are inferior to CNNs. In the case of transfer learning, the results RFMiD2020 (Pachade et al., 2021) datasets when compared to CNN-
are similar for both models, with ViT being superior for two out of based baselines. In contrast to the previous 2D-based methods that
three datasets. Additionally, ViT performs better when self-supervision employ transfer learning, COVID-ViT (Gao et al., 2021) proposes train-
on the target data is applied. They conclude that Vision Transformers, ing ViT to classify COVID and non-COVID cases using 3D CT lung
indeed, are suitable replacements for CNNs. images. Given that a COVID volume may contain non-COVID 2D slices,
Transformers have had a profound effect on medical development. COVID-ViT applies a slice voting mechanism after the ViT classification
Researchers have thoroughly investigated adopting the ViT in medical result in which the subject is categorized as having COVID if more than
image classification tasks since its introduction. However, the limited a certain percentage of slices (e.g., 25%) are predicted to be COVID.
number of medical images has hindered Transformers from replicat- The findings reported for the MIA-COVID19 competition (Kollias et al.,
ing their success in medical image classification. ViT-BUS (Gheflati 2021) confirm that ViT outperforms CNN-based approaches such as
and Rivaz, 2022) studies the use of ViTs in medical ultrasound (US) DenseNet (Iandola et al., 2014) in identifying COVID from CT images.
image classification for the first time. They propose to transfer pre- Besides the remarkable accuracy of Transformers compared to
trained ViT models based on the breast US dataset to compensate for CNNs, one of their major drawbacks is their high computational cost,
the data-hunger of ViTs. Evaluated results on B (Yap et al., 2017), thereby making them less effective for real-world applications, such
BUSI (Al-Dhabyani et al., 2020), and B+BUSI datasets indicate the as detecting COVID-19 in real time. In light of the prevalence of
predominance of attention-based ViT models over CNNs on US datasets. COVID-19, rapid diagnosis will be beneficial for starting the proper
Likewise, COVID-Transformer (Shome et al., 2021) utilizes ViT–L/16 course of medical treatment. CXR and lung CT scans are the most
to detect COVID from Non-COVID based on CXR images. Due to the common imaging techniques employed. However, CT imaging is a
limitation of sufficient data, they introduce a balanced dataset con- time-consuming process, and using CXR images is unreliable in iden-
taining 30 K chest X-ray images for multi-class classification and 20 K tifying COVID-19 in the early stage. In addition, vision Transformers
images for binary classification. The published dataset is created by are computationally expensive to deploy on mobile devices for real-
merging datasets (Qi et al., 2021; El-Shafai and Abd El-Samie, 2020; time COVID-19 classification. Therefore, Perera et al. (2021) present
Sait et al., 2020). They fine-tune the model on the dataset with a a lightweight Point-of-Care Transformer (POCFormer). The com-
custom MLP block on top of ViT to classify chest X-ray (CXR) images. pactness of POCFormer allows for real-time diagnosis of COVID-19
Moreover, COVID-Transformer exploits the GradCAM Map (Selvaraju utilizing commercially accessible POC ultrasound devices. POCFormer
et al., 2017) to visualize affected lung areas that are significant for reduces the complexity of the vanilla ViT self-attention mechanism
disease prediction and progression to display the model interpretabil- from quadratic to linear using Linformer (Wang et al., 2020b). The
ity. Similarly, Mondal et al. (2021) present xViTCOS for detecting results display the superiority of POCFormer in the real-time detection
COVID-19 in CTs and CXRs. xViTCOS employs a model that has been of COVID-19 over the CNN-based SOTAs on the POCUS dataset (Born
pre-trained on ImageNet-21k (Deng, 2009). Nevertheless, the train- et al., 2020).
ing data capacity might overshadow the generalization performance In addition, despite the great potential shown by ViTs in ImageNet
of the pre-trained ViT to transfer the knowledge from the learned classification, their performance is still lower than the latest SOTA
domain to the target domain. By training the model on the COVIDx- CNNs without additional data. These Transformers mainly focus on a
CT-2 A dataset (Gunraj, 2021), a moderately-sized dataset, xViTCOS coarse level by adopting a self-attention mechanism to establish global
overcomes this problem. However, due to the shortage of the insuffi- dependency between input tokens. However, relying only on a coarse
cient amount of CXR images, the pre-trained ViT model is fine-tuned level restricts the Transformer’s ability to achieve higher performance.
using the CheXpert dataset (Irvin et al., 2019). In addition, xViTCOS Thus, Liu and Yin (2021) leverage a pre-trained version of VOLO
leverages the Gradient Attention Rollout algorithm (Chefer et al., 2021) for an X-ray COVID-19 classification. VOLO first encodes fine-level
to visually demonstrate the model’s prediction on the input image for information into the token representations through proposed outlook
clinically interpretable and explainable visualization. In experiments attention, alleviating the limitations of Transformers that require a
using COVID CT-2 A and their custom-collected Chest X-ray dataset, large amount of data for training, and second aggregates the global
xViTCOS significantly outperforms conventional COVID-19 detection features via self-attention at the coarse level. Through the outlook
approaches. MIL-VT (Yu et al., 2021) similarly suggests pre-training attention mechanism, VOLO dynamically combines fine-level features
the Transformer on a fundus image large dataset beforehand, initialized by treating each spatial coordinate (𝑖, 𝑗) as the center of a 𝐾 × 𝐾

6
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 5. Overview of the FESTA framework (Park et al., 2021a), which utilizes ViT for multi-task learning of COVID-19 CXR classification, detection, and segmentation. (a) FESTA
leverages ViT’s decomposable modular design to train heads () and tails ( ) via clients while sharing the server-side Transformer body () between clients to integrate retrieved
features. Final predictions are then derived by feeding embedded features to their task-specific tails on the client side. (b) illustrates the single-task learning scheme, and (c) two
steps multi-task learning scheme. The former is trained for 12 000 rounds, while the latter undergoes two training steps. First, the whole parts train in 6000 rounds. Then by
freezing the weights of the Transformer body, the heads and tails are fine-tuned for 6000 steps based on the desired specific task.

local window and calculating its similarity with all its neighbors. The Specifically, they apply this notion to leverage radiologists’ gaze pat-
findings indicate that fine-tuning VOLO on Dataset-1 (Chowdhury et al., terns while diagnosing different diseases on medical images; then, using
2020) leads to 99.67% top1 accuracy on Dataset-1 test cases and a teacher–student architecture, they teach a model to pay attention
98.98% top1 accuracy on unseen Dataset-2 (Cohen et al., 2020), which to regions of an image that a specialist is most likely to examine. As
demonstrates the generality of the approach. illustrated in Fig. 6, The teacher and the student networks consist of two
Furthermore, accessible labeled images have considerably influ- main components: global and focal. The global component learns coarse
enced research on the use of Transformers to diagnose COVID-19. representation while the focal module works on low-level features,
Considering the shortage of labeled data, data sharing between hos- and both these segments are comprised of Transformer blocks with
pitals is needed so as to create a viable centralized dataset. However, shifting windows. In addition, the global and focal components are
such collaboration is challenging due to privacy concerns and patient interconnected using two-way lateral connections to form the global-
permission. Motivated by Federated Learning (FL) and Split Learning focal module; this is to address the inherent attention gap between
(SL), Park et al. (2021a) present a Federated Split Task-Agnostic the two. The teacher network is first directly pre-trained on human
(FESTA) framework that uses ViT for multi-task learning of classifi- visual attention maps. Then, the entire model is trained for different
downstream tasks, e.g., object detection and classification. Further-
cation, detection, and segmentation of COVID-19 CXR images. FESTA
more, the authors propose a self-supervised Visual Attention Loss (VAL)
benefits from the decomposable modular design of ViT to train heads
that incorporates both GIoU and MSE loss. The student network is
and tails via clients and share the server-side Transformer body across
trained to predict probability values for different classes and attention
clients to aggregate extracted features and process each task. The
regions. These attention regions are then compared to those obtained
embedded features from the body Transformer are then passed to
from the teacher model, and the weights are optimized using VAL.
their task-specific tail on the client side to produce the final predic-
tion (Fig. 5(a)). Fig. 5(b) illustrates the single-task learning scheme
3.2. Hybrid models
and (c) the multi-task learning scheme. In multi-task learning, heads,
tails, and a task-agnostic Transformer body are first jointly trained In spite of the vision Transformers’ ability to model global contex-
for 6000 rounds (see Fig. 5(c)). Then, heads and tails are fine-tuned tual representations, the self-attention mechanism undermines the rep-
according to the desired specific task while freezing the weights of resentation of low-level details. CNN-Transformer hybrid approaches
the Transformer body. FESTA merits from 220 000 decentralized CXR have been proposed to ameliorate the problem above by encoding both
images and attains competitive results compared to the data-centralized global and local features using the locality of CNNs and the long-range
training approaches. The experimental results also demonstrate the dependency of Transformers.
stable generalization performance of FESTA, where multi-task learning TransMed (Dai et al., 2021) proposes a hybrid CNN-Transformer
enhances the performance of the individual tasks through their mutual network that leverages the locality of CNNs and the long-range de-
effect during training. pendency character of Transformers for parotid gland tumor and knee
Most attention-based networks utilized for detection and classifi- injury classification. Multimodal medical images primarily have long-
cation rely on the neural network to learn the necessary regions of range interdependencies, and improving performance requires an effec-
interest. Bhattacharya et al. (2022) in RadioTransformer argue that tive fusion strategy. TransMed proposes a novel image fusion strategy.
in certain applications, utilizing experts’ opinions can prove beneficial. Firstly, three neighboring 2D slices of a multimodal image are overlaid

7
R. Azad et al. Medical Image Analysis 91 (2024) 103000

train high-resolution images like WSI. Since annotating such images


at the pixel level is impractical, MIL proposes to divide an input WSI
into a bag of instances and assign a single label to the bag of each
image based on pathology diagnosis. The bag has a positive label if
it contains at least one positive instance, and it is considered negative
if all the instances in the bag are negative. Then CNN backbones are
employed to down-sample and extract the features of each instance
and allow Transformers to operate according to the generated feature
maps and currently available hardware. Therefore, DT-MIL (Li et al.,
2021b) proposes to compress WSIs into compact feature images by
embedding each patch of the original WSI into a super-pixel at its
corresponding position using EfficientNet-B0 (Tan and Le, 2019). The
resulting thumbnail image feed into a 1 × 1 Conv for feature reduction,
followed by a deformable Transformer encoder that aggregates instance
representations globally. A similar approach is adopted by Holistic
ATtention Network (HATNet) (Mehta et al., 2022), where they first
divide an input image into 𝑛 non-overlapping bags, each broken down
Fig. 6. Overview of RadioTransformer (Bhattacharya et al., 2022). Human Visual
into 𝑚 non-overlapping words (or patches). 𝑛 × 𝑚 words are fed into
Attention Training (HVAT) block first uses radiologists’ visual observations of chest
radiographs to train a global-focal teacher network. The pre-trained teacher network is the CNN encoder to obtain word-level representations for each bag.
then utilized to distill the teacher’s knowledge to a global-focal student network through HATNet aims to develop a computer-aided diagnosis system to help
visual attention loss, enabling the student to learn visual information. Following the pathologists in reducing breast cancer detection errors. According to
teacher–student strategy and incorporating radiologist visual examinations leads to an the World Health Organization (WHO), breast cancer is the most fre-
improvement in the classification of disease on chest radiographs.
quent non-skin cancer in women, accounting for one out of every four
new female cancers annually (World-Health-Organization, 2021). As il-
lustrated in Fig. 7, HATNet follows a bottom-up decoding strategy such
to create three-channel images. Then, each image is partitioned into 𝐾× that it first performs multi-head attention to words in a word-to-word
𝐾 patches. This fusion approach allows the following network to learn attention block, then considers the relationship between words and bags
mutual information from images of different modalities. Patch tokens in word-to-bag attention, followed by bag-to-bag attention to attain inter-
are fed into a CNN network to capture their low-level features and bag representations. The acquired bag features are then aggregated
generate patch embeddings. The classic ViT is then used to determine in bag-to-image attention to build image-level representations. A linear
the relationship between patch sequences. TransMed’s final results classifier is ultimately applied to achieve the final results. Furthermore,
verify the effectiveness of hybrid models in classifying multimodal unlike most MIL methods that take all the instances in each bag
medical images by outperforming all its counterparts by a large margin. independent and identically distributed (Lu et al., 2021; Sharma et al.,
TransMed-S enhances average accuracy on the PGT dataset by about 2021; Naik et al., 2020), TransMIL (Shao et al., 2021) suggests that
10.1% over its nearest counterpart, BoTNet (Srinivas et al., 2021), it is essential to consider the correlation between different instances
while requiring fewer parameters and FLOP count. Comparably, Tanzi and explore both morphological and spatial information. Two Trans-
et al. (2022) develop a new CAD system (Femur-ViT) based on Vision former layers address the morphological information, and a conditional
Transformers for diagnosing femoral fractures. First, YOLOv3 (Redmon position encoding layer named Pyramid Position Encoding Generator
and Farhadi, 2018) is utilized to detect and crop the left and right femur (PPEG) addresses the spatial information. The proposed PPEG module
regions. Afterward, a CNN (InceptionV3 (Szegedy et al., 2016)) and a has two merits: (1) It handles positional encoding of sequences with
hierarchical CNN (different InceptionV3 networks in cascade) (Tanzi a variant number of instances by using group convolution over the
et al., 2020) are applied to the dataset, and the results serve as baselines 2D reshaped patch tokens, and (2) It enriches the features of tokens
for the classification. Then, they use a modified ViT to classify seven by capturing more context information through convolutions. In con-
different fracture types. Finally, a clustering approach is proposed as trast to conventional IID-based MIL methods requiring many epochs
an evaluation technique for the ViT encoder. This study highlights the to converge, TransMIL converges two to three times faster by using
power of using ViT models for medical image classification and the morphological and spatial information. TransMIL also outperforms all
ability of the proposed CAD system to significantly increase clinicians’ the latest MIL methods (Ilse et al., 2018; Li et al., 2021a; Campanella
diagnostic accuracy. 3DMeT (Wang et al., 2021c) proposes applying et al., 2019; Li et al., 2019b; Lu et al., 2021) in terms of accuracy and
a 3D medical image Transformer for assessing knee cartilage defects AUC by a significant margin in binary and multiple classification tasks
in three grades: grade 0 (no defect), grade 1 (mild defect), and grade and exhibits the superiority of taking the correlation between different
2 (severe defect). Primarily, using medical 3D volumes as an input instances into account and considering both morphological and spatial
to the Transformer is computationally expensive, thereby making it information.
impractical. 3DMeT resolves the high computational cost problem by Previous methods mainly rely on weakly supervised learning or
replacing conventional linear embedding with 3D convolutional layers. dividing WSIs into image patches and using supervised learning to
The weights of convolutional layers are adopted using the teacher– assess the overall disease grade. Nevertheless, these approaches over-
student training strategy. 3DMeT takes an exponential moving average look WSI contextual information. Thus, Zheng et al. (2022) propose
from the first one/few-layer(s) of the CNN teacher’s weights and uses it a Graph-based Vision Transformer (GTP) framework for predicting
as convolutional layers’ weights. This method enables Transformers to disease grade using both morphological and spatial information at
be compatible with small medical datasets and to benefit from CNNs’ the WSIs. The graph term allows for the representation of the entire
spatial inductive biases. Lastly, the Transformer and CNN teacher’s WSI, and the Transformer term allows for computationally efficient
outputs are combined in order to derive the classification results. WSI-level analysis. The input WSI is first divided into patches, and
Operating Transformers over Whole Slide Images (WSIs) is com- those that contain more than 50% of the background are eliminated
putationally challenging since WSI is a gigapixel image that retains and not considered for further processing. Selected patches are fed
the original structure of the tissue. MIL and CNN backbones have forward through a contrastive learning-based patch embedding module
demonstrated practical tools for acting on WSI. MIL is a weakly su- for feature extraction. A graph is then built via a graph construction
pervised learning approach that enables deep learning methods to module utilizing patch embeddings as nodes of the graph. In the graph

8
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 7. The overall architecture of Mehta et al. (2020). HATNet hierarchically divides an input image into 𝑛 × 𝑚 words, which are then fed into the CNN encoder to provide
word-level representations for each bag. Then by performing a bottom-up decoding strategy and applying a linear classifier, breast biopsy classification results are obtained. Notably,
bag-to-image attention has the same procedure as word-to-bag attention, shown in (c).

Transformer section, a graph convolution layer followed by a mincut


pooling layer (Bianchi et al., 2020) is applied first to learn and enrich
node embeddings and then lessen the number of Transformer input
tokens. Since the graph adjacency matrix contains spatial information
of nodes, by adding an adjacency matrix to node features, GTP obviates
the need for adding extra learnable positional embeddings to nodes.
The final Transformer layer predicts the WSI-level class label for three
lung tumor classes: Normal, LUAD, and LSCC. GTP also introduces
a graph-based class activation mapping (GraphCAM) technique that
highlights the class-specific regions. GraphCAM exploits attention maps
from multi-head self-attention (MHSA) blocks in the Transformer layer
and maps them to the graph space to create a heatmap for the pre-
dicted class. The experiments show that GTP performs as a superior
interpretable and efficient framework for classifying WSI images while
considering morphological and spatial information.
Diabetic Retinopathy (DR) is an eye disorder that can cause im-
paired vision and sightlessness by damaging blood vessels in the retina.
Most deep-learning approaches view lesion discovery and DR grad-
ing as independent tasks that may produce suboptimal results. In Fig. 8. LAT (Sun et al., 2021) vs. CAM (Zhou et al., 2016) visual comparison.
contrast to conventional methods, LAT (Sun et al., 2021) proposes The ground truth consists of microaneurysms, hemorrhages, soft exudates, and hard
exudates, which are colored as green, yellow, green, and blue dots, respectively (zoom
a unified encoder–decoder structure that comprises a pixel relation- in on the ground truth image for better clarity).
based encoder to capture the image context information and a lesion
filter-based decoder to discover lesion locations, which the whole net-
work jointly optimized and complemented during training. The encoder
is particularly in charge of modeling the pixel correlations, and the institutions or different tasks. Self-supervised learning offers a solu-
tion by leveraging unlabeled medical data to develop robust models
Transformer-based decoder part is formulated as a weakly supervised
without costly annotations (Huang et al., 2023). Chen et al. (2021d)
localization problem to detect lesion regions and categories with only
propose a valuable empirical study of training self-supervised vision
DR severity level labels. In addition, LAT proposes two novel mech-
Transformers, examining their effectiveness and performance. Accord-
anisms to improve the effectiveness of lesion-aware filters: (1) Lesion
ing to their findings, self-supervised Transformers demonstrate strong
region importance mechanism, 𝑔(⋅|𝛷), to determine the contribution of
performance when trained using contrastive learning, outperforming
each lesion-aware feature, and (2) Lesion region diversity mechanism
ImageNet-supervised ViT models as their size increases. For example,
to diversify and compact lesion-aware features. The former is a linear
self-supervised pre-training surpasses supervised pre-training in cer-
layer followed by a sigmoid activation function that generates impor- tain cases, particularly with the very large ViT-Large. Furthermore,
tance weights for lesion-aware features, and the latter adopts a triplet self-supervised ViT models achieve competitive results compared to
loss (Zhong et al., 2018) to encourage lesion filters to find diverse prominent convolutional ResNets in prior studies (Chen et al., 2020b;
lesion regions. In the DR grading branch, LAT presents a DR grading Grill et al., 2020), showcasing the potential of ViT with relatively fewer
classification module that calculates a global consistency loss based inductive biases (Dosovitskiy et al., 2020). Notably, they discovered
on the lesion-aware features, indicated as ℎ(⋅|𝜎). Eventually, the final that removing the position embedding in ViT only has a slight negative
DR grading prediction is achieved by calculating the cross-entropy loss impact on accuracy, indicating that self-supervised ViT can learn pow-
between the predicted labels obtained from the fusion of 𝑔(⋅|𝛷) and erful representations without relying heavily on positional information,
ℎ(⋅|𝜎) and the ground truth. The total loss is the aggregation of cross- but also suggesting that the positional information may not have been
entropy loss, global consistency loss, and triplet loss. Visual results of fully utilized. Therefore, these findings indicate that the combination
LAT regarding the lesion discovery are depicted in Fig. 8. of self-supervised learning and vision Transformers can be beneficial to
The current success of deep learning in medical image interpretation adopt for different medical image analysis tasks. For example, Yan et al.
relies heavily on supervised learning, which requires large labeled (2023) propose a robust self-supervised federated learning framework,
datasets. However, annotating medical imaging data is costly and time- Self MedFed, for medical image classification. Their approach com-
consuming. Additionally, models struggle to generalize across external bines masked image modeling with Transformers to pre-train models

9
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Table 1
An overview of the reviewed Transformer-based medical image classification models.
Method Modality Organ Type Pre-trained module: Datasets Metrics Year
Type
Pure
1
ViT-vs-CNN (Matsoukas et al., Fundus Eye 2D ViT: Self-supervised APTOS-2019 (Asia Pacific Tele-Ophthalmology Kappa 2021
2021) Dermoscopy Skin & Supervised Society, 2019) Recall
2
X-ray Breast ISIC-2019 (Combalia et al., 2019) ROC-AUC
3
CBIS-DDSM (Lee et al., 2017)
1
ViT-BUS (Gheflati and Rivaz, Ultrasound Breast 2D ViT: Supervised B (Yap et al., 2017) ACC 2022
2
2022) BUSI (Al-Dhabyani et al., 2020) AUC
POCFormer (Perera et al., 2021) Ultrasound Chest 2D ✗ POCUS (Born et al., 2020) Recall, F1 2021
SP, SE, ACC
1
MIL-VT (Yu et al., 2021) Fundus Eye 2D ViT: Supervised Private Dataset Recall, F1 2021
2 APTOS-2019 (Asia Pacific Tele-Ophthalmology ACC, AUC
Society, 2019) Precision
3 RFMiD-2020 (Pachade et al., 2021)

COVID-VIT (Gao et al., 2021) CT Chest 3D ✗ MIA-COV19 (Kollias et al., 2021) ACC, F1 2021
1
xViTCOS (Mondal et al., 2021) X-ray Chest 2D ViT: Supervised COVID CT-2A (Gunraj, 2021) Recall, F1 2021
2
CT CheXpert (Irvin et al., 2019) Precision
SP, NPV
FESTA (Park et al., 2021a) X-ray Chest 2D ViT: Supervised 1 Four Private Datasets Recall, F1 2021
2
CheXpert (Irvin et al., 2019), 3 BIMCV (Vayá SP, SE, AUC
et al., 2020)
4
Brixia (Signoroni et al., 2021), 5 NIH (Wang
et al., 2017)
6
SIIM-ACR (SIIM-ACR, 2019), 7 RSNA (RSNA,
2018)
1
COVID-Transformer (Shome et al., X-ray Chest 2D ViT: Supervised Qi et al. (2021), 2 Sait et al. (2020) Recall, F1 2021
3
2021) El-Shafai and Abd El-Samie (2020) ACC, AUC
Precision
COVID-VOLO (Liu and Yin, 2021) X-ray Chest 2D ViT: Supervised 1 Chowdhury et al. (2020) ACC 2021
2
Cohen et al. (2020)
RadioTransformer (Bhattacharya X-ray Chest 2D ViT: Supervised 1 RSNA (Radiological Society of North America, Recall, F1 2022
et al., 2022) 2018), 2 Cell Pneumonia (Kermany et al., 2018) ACC, AUC
3 COVID-19 Radiography (Chowdhury et al., Precision
2020; Rahman et al., 2021)
4
NIH (Wang et al., 2017), 5 VinBigData (Nguyen
et al., 2022b)
6
SIIM-FISABIO-RSNA (Lakhani et al., 2021)
7
RSNA-MIDRC (Tsai et al., 2021b,a)
8 TCIA-SBU COVID-19 (Clark et al., 2013; Saltz

et al., 2021)
1
Self MedFed (Yan et al., 2023) X-ray Chest 2D ViT: Self-supervised EyePACKS (EyePACKS, 2015) ACC 2023
2
Dermoscopy Skin ISIC 2017 (Codella et al., 2018), 3 ISIC 2018
Fundus Eye (Codella et al., 2019)
4 ISIC 2020 (Rotemberg et al., 2021)
5 COVID-FL (Yan et al., 2023)

Hybrid
TransMIL (Shao et al., 2021) Microscopy Multi- 2D CNN: Supervised 1 Camelyon16 (Bejnordi et al., 2017) ACC 2021
2
organ TCGA-NSCLC (Albertina et al., 2016; Kirk AUC
et al., 2016)
3
TCGA-RCC (Shinagare et al., 2014)
LAT (Sun et al., 2021) Fundus Eye 2D CNN: Supervised 1 Messidor-1(Decencière et al., 2014) AUC & Kappa 2021
2 Messidor-2(Krause et al., 2018)
3
EyePACKS (EyePACKS, 2015)
TransMed (Dai et al., 2021) MRI Ear 3D ViT: Supervised 1 PGT (Dai et al., 2021) Precision 2021
Knee CNN: Supervised 2 MRNET (Bien et al., 2018) ACC
3DMeT (Wang et al., 2021c) MRI Knee 3D CNN: Supervised Private dataset ACC, Recall, 2021
F1
Hybrid-COVID-ViT (Park et al., X-ray Chest 2D CNN: Supervised CheXpert (Irvin et al., 2019) AUC, ACC 2021
2021b) SP, SE
Femur-ViT (Tanzi et al., 2022) X-ray Femur 2D ViT: Supervised Private dataset Recall, F1 2022
CNN: Unsupervised Precision, ACC
1
DT-MIL (Li et al., 2021b) Microscopy Lung 2D CNN: Supervised CPTAC-LUAD (Clark et al., 2013) Recall, F1 2021
2
Breast BREAST-LNM (Li et al., 2021b) AUC, precision

(continued on next page)

10
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Table 1 (continued).
Method Modality Organ Type Pre-trained module: Datasets Metrics Year
Type
GTP (Zheng et al., 2022) Microscopy Lung 2D CNN: Self-supervised 1 NLST(Team, 2011), 2 CPTAC (Edwards et al., Precision, 2022
2015) Recall
3
TCGA (National Institutes of Health et al., SP, SE, ACC,
2019) AUC
HATNet (Mehta et al., 2022) Microscopy Breast 2D CNN: Supervised Breast Biopsy WSI Dataset (Elmore et al., 2015) ACC, ROC-AUC 2022
F1, SP, SE
1
HiFuse (Huo et al., 2022) CT Skin 2D ✗ ISIC 2018 (Codella et al., 2019) Recall, ACC 2022
2
Dermoscopy Chest COVID19-CT (He et al., 2020) F1, Precision
Endoscopy Stomach 3 Kvasir (Pogorelov et al., 2017)

MedViT (Manzari et al., 2023) CT Multi- 2D ✗ MedMNIST (Yang et al., 2021b) AUC, ACC 2023
X-ray organ Top-1
ultrasound
OCT

directly on decentralized target task datasets. This enables more robust such issues. However, the need for more research on these problems
representation learning on heterogeneous data and effective knowledge is crucial to making such approaches widely applicable.
transfer. Experimental results on public medical imaging datasets (Eye- Data availability in the medical domain is one of the most challeng-
PACKS, 2015; Codella et al., 2018, 2019) and their created COVID-FL ing aspects of developing Transformer-based models since Transformer
dataset demonstrate the framework’s effectiveness in improving model models are known for being data-hungry to generalize. Reasons for data
robustness and learning visual representations across non-IID clients. scarcity in the medical field can be referred to as privacy concerns of
The research addresses data heterogeneity and label deficiency, show- patients, the time-consuming and costly process of annotation, and the
casing the broad applicability of their self-supervised federated learning need for expert staff. To this end, the use of generative models (Pinaya
framework in medical image analysis. et al., 2022; Moghadam et al., 2022; Kazerouni et al., 2023) and their
integration with Transformer models can become prominent since they
3.3. Discussion and conclusion are capable of creating synthetic data that is comparable to genuine
data. In addition, another way to attack this problem is by utilizing
federated learning, such as Park et al. (2021a). Nevertheless, there is
Section 3 thoroughly outlines 22 distinctive Transformer-based
still room for improvement when it comes to privacy concerns since,
models in medical image classification. We have categorized the intro-
in federated learning, communication between the client and server is
duced models based on their architectures into hybrid and pure. These
required.
approaches differ according to whether they adhere to the original
Despite their SOTA performance, Transformer-based networks still
structure of the vanilla ViT or provide a new variant of the vision Trans-
face challenges in deploying their models in the real world due to
former that can be applied to medical applications. In addition, we have
computational limitations. As shown in Table 2, most approaches have
presented details on the studied classification methods regarding their
a high number of parameters which provokes a serious problem. Dif-
architecture type, modality, organ, pre-trained strategy, datasets, met-
ferent novel approaches have been introduced to reduce the quadratic
rics, and the year of publication in Table 1. Additional descriptions of
complexity of self-attention, which can be leveraged in the medical
the methods, including their model size, contributions, and highlights,
domain. Furthermore, though ViTs have shown impressive capabili-
are described in Table 2. Table 3 presents a comparison of Transformer-
ties in ImageNet classification, their performance is still lower than
based models across four distinct medical image classification datasets.
the latest SOTA CNNs without additional data (Liu and Yin, 2021).
The evaluation primarily focuses on assessing the accuracies achieved
Hence, existing methods mostly follow pre-training strategies on the
by various models, as well as analyzing their Flop count and number
ImageNet dataset to build the pre-trained weights for the subsequent
of parameters.
downstream tasks. However, despite the enhancement, the domain of
Additionally, time inference and GPU usage are pivotal factors when natural images is significantly different from medical data, thereby
evaluating method performance. Inference time is a crucial metric for restricting the performance of further improvement. Therefore, we
real-time or time-critical applications, and efficient GPU utilization believe efficient Transformers will considerably influence the future
ensures optimal resource allocation, preventing undue stress on the research of Transformer-based models.
hardware. We conducted experiments using different MedViT models
(small, base, and large) on an NVIDIA Tesla T4 GPU, assessing GPU 4. Medical image segmentation
usage and inference times for input images of size 3 × 224 × 224.
We performed 1000 model inferences after a warm-up period of 50 Medical segmentation is a significant sub-field of image segmen-
iterations to gauge performance. For a batch size of 8, MedViT-S tation in digital image processing. It aims to extract features from
consumes 3.92 GB of GPU memory, while MedViT-B and MedViT-L a set of regions partitioned from the entire image and segment the
require 5.04 GB and 6.06 GB, respectively. In terms of inference time key organs simultaneously, which can assist physicians in making an
(batch size 1), MedViT-S, -B, and -L averaged 49.60 ms, 66.89 ms, accurate diagnosis in practice. X-ray, positron emission tomography
and 87.95 ms, respectively. These insights provide valuable informa- (PET), computed tomography (CT), magnetic resonance imaging (MRI),
tion for choosing the right model for specific applications, balancing and ultrasound are common imaging modalities used to collect data.
performance with GPU resource constraints, and enabling informed The CNN-based U-Net (Ronneberger et al., 2015; Azad et al., 2022a)
comparisons with other models when considering factors like the num- has been the main choice in this field due to its effective perfor-
ber of parameters and FLOPs. By examining these metrics, researchers mance and high accuracy. Nevertheless, it cannot extract long-range
can gain insights into the performance and efficiency of different dependencies in high-dimensional and high-resolution medical images
models in the context of medical image classification tasks. (Karimijafarbigloo et al., 2023; Aghdam et al., 2022). Therefore, the
As is evident in the storyline of this section, we have discussed meth- flexible combination of the U-Net structure with Transformers become
ods in each paragraph regarding the underlying problems in medical a prevalent solution to the segmentation problem at present. Take
image classification and introduced solutions and how they address the multi-organ segmentation task as an example: some networks can

11
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Table 2
A brief description of the reviewed Transformer-based medical image classification models. The unreported number of parameters indicates that the value was not mentioned in
the paper, and the code was unavailable.
Method # Params Contributions Highlights
Pure
ViT-vs-CNN 22M ∙ They investigate three different weight initialization ∙ Utilize three different training schemes on three
(Matsoukas et al., approaches on three medical datasets: (1) random different datasets to conclude whether ViT can replace
2021) initialization, (2) transfer learning using supervised CNN.
ImageNet pre-trained weights, and (3) self-supervised ∙ They repeat their processes five times to be certain
pretraining on the target dataset using DINO (Caron of the outcome.
et al., 2021). Their final verdict is that ViTs can ∙ Comparing only two models, DeiT-S and Resnet-50,
replace CNNs. on only three datasets cannot generalize the
conclusion of the superiority of each of the
Transformer and CNN.
ViT-BUS (Gheflati ViT-Ti/16: 5.53M ∙ Proposes the use of ViT for the classification of ∙ Transferring pre-trained ViT models based on small
and Rivaz, 2022) ViT-S/32: 22.87M breast ultrasound images for the first time. ultrasound datasets yields much higher accuracy than
ViT-B/32: 87.44M CNN models.
POCFormer (Perera Binary CLS: 2.8M ∙ Proposes a lightweight Transformer architecture that ∙ POCFormer can perform in real-time and be
et al., 2021) Multiclass CLS: 6.9M uses lung ultrasound images for real-time detection of deployed to portable devices.
COVID-19. ∙ POCFormer can be used for rapid mass testing of
COVID-19 due to its compactness.
MIL-VT (Yu et al., 22.12M ∙ Proposes to first pre-train the Vision Transformer on ∙ The MIL head can significantly enhance the
2021) a fundus image large dataset and then fine-tune it on performance by easily attaching to the Transformer in
the downstream task of the retinal disease a plug-and-play manner.
classification. ∙ MIL-VT efficiently exploits the embedding features
∙ Introduces the MIL-VT framework with a novel overlooked in the final prediction of the ViT.
Multiple Instance Learning(MIL)-head to effectively
utilize embedding features to improve the ViT
performance.
COVID-VIT (Gao 52.81M ∙ Offers to utilize ViT to classify COVID and ∙ COVID-ViT performs better in classifying COVID
et al., 2021) non-COVID patients using 3D CT lung images. from Non-COVID compared to DenseNet.
∙ The reported result is not enough to conclude. They
only compare ViT with DenseNet.
xViTCOS (Mondal 85.99 ∙ Proposes and explores using ViT for detecting ∙ Uses a heatmap plot to demonstrate the model’s
et al., 2021) COVID-19 from CXR and CT images. explainability.
∙ xViTCOS makes use of the Gradient Attention
Rollout algorithm (Chefer et al., 2021) for
visualization and clinical interpretability of the output.
FESTA (Park et al., Body: 66.37M ∙ Proposes a Federated Split Task-Agnostic (FESTA) ∙ The proposed FESTA Transformer improved
2021a) (CLS) Head: 13.31M, framework that leverages ViT to merit from federated individual task performance when combined with
Tail: 2k learning and split learning. multi-task learning.
(SEG) Head: 15.04M, ∙ They use multi-task learning to classify, detect, and ∙ Experimental results demonstrate stable
Tail: 7.39M segment COVID-19 CXR images. generalization and SOTA performance of FESTA in the
(DET) Head: 25.09M, external test dataset even under non-independent and
Tail: 19.77M non-identically distributed (non-IID) settings.
∙ FESTA eliminates the need for data exchange
between health centers while maintaining data
privacy.
∙ Using FESTA in the industry may not be safe
because it may encounter external attacks on the
server that may lose the parameters of the entire
network. In addition, using this method may
jeopardize patient information through privacy
attacks.
∙ Authors did not evaluate the robustness of their
approach to difficulties through communication,
stragglers, and fault tolerance.
COVID-Transformer ViT-L/16: 307M ∙ COVID-Transformer investigates using ViT for ∙ Uses a heatmap plot to demonstrate the model’s
(Shome et al., 2021) detecting COVID-19 from CXR images. explainability.
∙ COVID-Transformer introduces a new balanced chest
X-ray dataset containing 30K images for multi-class
classification and 20K for binary classification.
COVID-VOLO (Liu 86.3M ∙ Proposes fine-tuning the pre-trained VOLO model for ∙ Using VOLO enables capturing both fine-level and
and Yin, 2021) COVID-19 diagnosis. coarse-level features resulting in higher performance
in COVID-19 binary classification.

(continued on next page)

12
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Table 2 (continued).
Method # Params Contributions Highlights
RadioTransformer 3.93M ∙ Presents a novel global-focal RadioTransformer ∙ outperforms counterpart backbones on multiple
(Bhattacharya et al., architecture, including Transformer blocks with datasets.
2022) shifting windows, to improve diagnosis accuracy by ∙ Model’s explainability
leveraging the knowledge of experts.
∙ Introduces an innovative technique for training
student networks by utilizing visual attention regions
generated by teacher networks.
Self MedFed (Yan 85.2M ∙ Self MedFed is a novel privacy-preserving federated ∙ The proposed network simultaneously tackles the
et al., 2023) self-supervised pre-training framework designed to dual challenges of data heterogeneity and label
learn visual representations from decentralized data deficiency.
by leveraging masked image modeling. ∙ Experimental results on various medical datasets
∙ Introduces a benchmark dataset called COVID-FL, validate the superiority of the proposed method
consisting of federated chest X-ray data from eight compared to ImageNet-supervised baselines and
different medical datasets. existing federated learning algorithms in terms of
label efficiency and robustness to non-IID data.
Hybrid
TransMIL (Shao 2.67M ∙ Presents a Transformer-based Multiple Instance ∙ Converges two to three times faster than SOTA MIL
et al., 2021) Learning (MIL) approach that uses both morphological methods.
and spatial information for weakly supervised WSI ∙ The proposed method can be employed for
classification. unbalanced/balanced and binary/multiple
∙ Proposes to consider the correlation between classification with great visualization and
different instances of WSI instead of assuming them interpretability.
independently and identically distributed ∙ TransMIL is adaptive for positional encoding as
∙ Proposes a CNN-based PPEG module for conditional token numbers in the sequences changes.
position encoding, which is adaptive to the number of ∙ It needs further improvement to handle higher
tokens in the corresponding sequence magnification than ×20 of WSIs - Higher magnification
means longer sequences, which in turn require more
memory and computational costs to process.
LAT (Sun et al., – ∙ Proposes a unified Transformer-based ∙ Unlike most approaches that confront lesion
2021) encoder–decoder structure capable of DR grading and discovery and diabetic retinopathy grading tasks
lesion detection simultaneously. independently, which may generate suboptimal
∙ Proposes a Transformer-based decoder to formulate results, the proposed encoder–decoder structure is
lesion discovery as a weakly supervised lesion jointly optimized for both tasks.
localization problem. ∙ Despite existing methods that only perform well for
∙ Proposes lesion region importance mechanism to discovering explicit lesion regions, LAT can also detect
determine the importance of lesion-aware features. less dense lesion areas.
∙ Proposes lesion region diversity mechanism to ∙ The proposed LAT is capable of identifying Grades 0
diversify and compact lesion-aware features. and 1, which are hard to distinguish.
TransMed (Dai TransMed-Tu: 17M ∙ Proposes a hybrid CNN-Transformer network for ∙ TransMed achieves much higher accuracy in
et al., 2021) TransMed-S: 43M multimodal medical image classification. classifying parotid tumors and knee injuries than CNN
TransMed-B: 110M ∙ Proposes a novel image fusion strategy for 3D MRI models.
TransMed-L: 145M data. ∙ Requiring fewer computational resources compared
to SOTA CNNs.
3DMeT (Wang – ∙ Replaces conventional linear embedding with 3D ∙ The proposed method makes the Transformers
et al., 2021c) convolution layers to reduce the computational cost of capable of using 3D medical images as input.
using 3D volumes as the Transformer’s inputs. ∙ 3DMeT Uses significantly fewer computational
∙ Obtains weights for 3D convolution layers by using resources.
a teacher–student training strategy. ∙ Adopting CNN as a teacher assists in inheriting
CNN’s spatial inductive biases.
Hybrid-COVID-ViT – ∙ Proposes a vision Transformer that embeds features ∙ Different from SOTA models, the proposed model
(Park et al., 2021b) for high-level COVID-19 diagnosis classification using does not use the ImageNet pre-trained weights while
a backbone trained to spot low-level abnormalities in archiving significantly better results.
CXR images. ∙ They examine the interpretability of the proposed
model.
Femur-ViT (Tanzi – ∙ Investigates using ViT for classifying femur fractures. ∙ Achieves SOTA results compared to the CNN models.
et al., 2022) ∙ proposes using unsupervised learning to evaluate the
ViT results.
GTP (Zheng et al., – ∙ Proposes a graph-based vision Transformer (GTP) ∙ In contrast to SOTA approaches, the proposed GTP
2022) framework for predicting disease grade using both can operate on the entire WSI by taking advantage of
morphological and spatial information at the WSIs. graph representations. In addition, GTP can efficiently
∙ Proposes a graph-based class activation mapping classify disease gade by leveraging a vision
(GraphCAM) method that captures regional and Transformer.
contextual information and highlights the ∙ The proposed GTP is interpretable so that it can
class-specific regions. identify salient WSI areas associated with the
∙ They use a self-supervised contrastive learning Transformer output class.
approach to extract more robust and richer patch ∙ GTP obviates the need for adding extra learnable
features. positional embeddings to nodes by using the graph
∙ They exploit a mincut pooling layer (Bianchi et al., adjacency matrix. It enables diminishing the
2020) before the vision Transformer layer to lessen complexity of the model.
the number of Transformer input tokens and reduce ∙ The proposed GTP takes both morphological and
the model complexity. spatial information into account.

(continued on next page)

13
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Table 2 (continued).
Method # Params Contributions Highlights
DT-MIL (Li et al., 10.88M ∙ Presents a novel embedded-space MIL approach ∙ The proposed method efficiently embeds instances’
2021b) incorporated with an encoder–decoder Transformer for position relationships and context information into
histopathological image analysis. Encoding is done bag embedding.
with a deformable Transformer, and decoding with a An extensive analysis of four different bag-embedding
classic ViT. modules is presented on two datasets.
∙ An efficient method to render a huge WSI is
proposed, which encodes the WSI into a
position-encoded feature image.
∙ The proposed method selects the most discriminative
instances simultaneously by utilizing associated
attention weights and calibrating instance features
using the deformable self-attention.
HATNet (Mehta (w/ MobileNetv2): ∙ Presents a novel end-to-end hybrid method for ∙ HATNet surpasses the bag-of-words models by
et al., 2022) 5.59M classifying histopathological images. following a bottom-up strategy and taking into
(w/ ESPNetv2): 5.58M account inter-word, word-to-bag, inter-bag, and
(w/ MNASNet): 5.47M bag-to-image representations, respectively. (word →
bag → image)
HiFuse (Mehta HiFuse-Ti: 82.49M ∙ A parallel framework is developed to effectively ∙ The HiFuse model yields favorable outcomes on the
et al., 2022) HiFuse-S: 93.82M capture local spatial context features and global three different ISIC 2018 (Codella et al., 2018),
HiFuse-B: 127.80M semantic information representation at various scales. Covid19-CT (He et al., 2020), and Kvasir (Pogorelov
∙ An adaptive hierarchical feature fusion (HFF) block et al., 2017) datasets.
is created to selectively blend semantic information ∙ HiFuse incorporates a modular design that offers
from various scale features of each branch using rich scalability and linear computational complexity.
spatial attention, channel attention, residual inverted
MLP, and shortcut connection.
MedViT (Manzari MedViT-Ti: 10.2M ∙ Proposes a hybrid model that combines the locality ∙ The proposed attention mechanism is efficient.
et al., 2023) MedViT-S: 23M of CNNs and the global connectivity of vision ∙ MedViT enhances the robustness of the Transformer
MedViT-B: 45M Transformers, aiming to improve the reliability of models against adversarial attacks.
deep medical diagnosis systems.
∙ By utilizing an efficient convolution operation, the
proposed attention mechanism enables joint attention
to the information in various representation
subspaces, thereby reducing computational complexity
while maintaining performance.
∙ Introduces modifying the shape information in the
high-level feature space by permuting the feature
mean and variance within mini-batches to improve
the model’s resilience against adversarial attacks.

Table 3
Comparison of Transformer-based models on different medical image classification datasets.
ISIC 2018a (Codella et al., 2019) TissueMNISTb (Ljosa et al., 2012)
Method Params (M) FLOPs (G) Accuracy % F1% Precision % Recall % Method Params (M) FLOPs (G) Top-1%
HiFuse-S (Huo et al., 2022) 93.82 8.84 83.59 72.70 72.70 73.14 MedViT-S (Manzari et al., 2023) 23.6 4.9 73.1
Conformer-B/16 (Peng et al., 2021) 83.29 22.89 82.66 72.44 73.31 71.66 Twins-SVT-S (Chu et al., 2021) 24.0 2.9 72.1
Swin-B (Liu et al., 2021d) 87.77 15.14 79.79 63.95 65.09 63.65 PoolFormer-S36 (Yu et al., 2022) 31.2 5.0 71.8
ViT-B/32 (Dosovitskiy et al., 2020) 88.30 8.56 77.92 57.52 58.74 56.90 Swin-T (Liu et al., 2021d) 29.0 4.5 71.7
COVID19-CTa (He et al., 2020) CvT-13 (Wu et al., 2021) 20.1 4.5 71.6
HiFuse-S (Huo et al., 2022) 93.82 8.84 76.88 76.31 77.78 76.19 RVT-S (Mao et al., 2022) 22.1 4.7 71.2
Conformer-B/16 (Peng et al., 2021) 83.29 22.89 75.81 75.60 76.81 77.81 MedViT-T (Manzari et al., 2023) 10.8 1.3 70.3
ViT-B/32 (Dosovitskiy et al., 2020) 88.30 8.56 61.83 60.59 61.89 60.94 MedViT-L (Manzari et al., 2023) 45.8 13.4 69.9
Swin-B (Liu et al., 2021d) 87.77 15.14 60.75 56.36 63.20 58.95 RVT-Ti (Mao et al., 2022) 8.6 1.3 69.6
Kvasira (Pogorelov et al., 2017) CoaT Tiny (Xu et al., 2021) 5.5 4.4 69.3
HiFuse-S (Huo et al., 2022) 93.82 8.84 85.00 84.96 85.08 85.00 RVT-B (Mao et al., 2022) 86.2 17.7 69.3
Conformer-B/16 (Peng et al., 2021) 83.29 22.89 84.25 84.27 84.45 84.37 DeiT-S (Touvron et al., 2021) 22.0 4.6 67.0
Swin-B (Liu et al., 2021d) 87.77 15.14 77.30 77.29 77.74 77.44 PiT-S (Heo et al., 2021) 23.5 2.9 66.9
ViT-B/32 (Dosovitskiy et al., 2020) 88.30 8.56 73.80 73.50 74.24 73.72 PVT-S (Wang et al., 2021e) 25.4 4.0 66.7
a
Indicate that results are adopted from Huo et al. (2022).
b
Indicate that results are adopted from Manzari et al. (2023).

achieve state-of-the-art multi-organ segmentation performance on the learning. Fig. 10 demonstrates the different directions of the methods
Synapse dataset (as shown in Fig. 9) for abdominal images. employing Transformers in the U-Net architecture.
In this section, we present the application of ViTs in segmentation
tasks. First, we divide the approaches into two categories: pure Trans- 4.1. Pure Transformers
formers and hybrid Transformers, where the pure Transformer denotes
the use of the multiple multi-head self-attention modules in both the In this section, we review several networks referred to as pure Trans-
encoder and decoder. Hybrid architecture-based approaches fuse the formers, which employ Transformer blocks in both the encoding and
ViTs with convolution modules as the encoder, bottleneck, decoder, or the decoding paths. Despite the great success of CNN-based approaches
skip connection part to leverage information about the global context in medical segmentation tasks, these models still have limitations in
and local details. Furthermore, we review some methods with other learning long-range semantic information of medical images. The au-
architectures that propose several novel manners for self-supervised thors proposed Swin-Unet, a symmetric Encoder–Decoder architecture

14
R. Azad et al. Medical Image Analysis 91 (2024) 103000

at various resolution levels to reduce the loss of fine-grained contextual


information caused by down-sampling.
In contrast to the CNN-based methods showing over-segmentation
issues, the proposed U-shape pure Transformer presents better segmen-
tation performance resulting from learning both local and long-range
dependencies. Compared to the previous methods (Oktay et al., 2018;
Chen et al., 2021b), the HD evaluation metric of Swin-Unet shows an
improvement in accuracy for better edge prediction. The experiments
on the Synapse multi-organ CT dataset and ACDC dataset from MRI
scanners also demonstrate the robustness and generalization ability of
the method.
Compared to Swin-Unet and DS-TransUNet, nnFormer (Zhou et al.,
2021) proposed by Zhou et al. preserves the superior performance of
convolution layers for local detail extraction and employs a hierarchical
structure to model multi-scale features. It utilizes the volume-based
multi-head self-attention (V-MSA) and the shifted version (SV-MSA) in
the Transformer blocks instead of processing 2D slices of the volume.
The overall architecture of nnFormer is composed of an encoder and
Fig. 9. Transformer-based models can perform image segmentation on medical image
datasets. Figure a and c illustrate two 2D slices of raw images with the labels from
a decoder. Each stage in the encoder and decoder consists of a Trans-
Synapse dataset (Landman et al., 2015). Figure b and d show the 3D visualization former block applying V-MSA and SV-MSA and a successive upsampling
of the labeled organs from different angles. These images were generated with MITK or downsampling block built upon convolution layers, which is referred
Workbench (Nolden et al., 2013). to as the interleaved architecture. V-MSA conducts self-attention within
3D local volumes instead of 2D local windows to reduce the computa-
tional complexity by approximately 98% and 99.5% on the Synapse
motivated by the hierarchical Swin Transformer (Liu et al., 2021d), to and ACDC datasets, respectively. nnFormer is first pre-trained on the
improve segmentation accuracy and robust generalization capability. ImageNet dataset (Deng, 2009) and utilizes symmetrical initialization
In contrast to the closest approaches (Valanarasu et al., 2021; Zhang to reuse the pre-trained weights of the encoder in the decoder. It
et al., 2021c; Wang et al., 2021d; Azad et al., 2022b) using integrations employs the self-supervised learning strategy, deep supervision, to train
of CNN with Transformer, Swin-Unet explores the possibility of pure the network. Specifically, the output feature maps of all resolutions
Transformer applied to medical image segmentation. in the decoder are passed to the corresponding expanding block. It
As shown in Fig. 11, Swin-Unet consists of encoder, bottleneck, applies cross-entropy loss and dice loss to evaluate the loss for each
decoder, and skip connections utilizing the Swin Transformer block stage. Accordingly, the resolution of ground truth is downsampled to
with shifted windows as the basic unit. For the encoder, the sequence of match the predictions of all the resolutions. The weighted losses of
embeddings transformed from image patches is fed into multiple Swin all three stages comprise the final training objective function. The
Transformer blocks and patch merging layers, with Swin Transformer results of experiments that compare nnFormer with prior Transformer-
blocks performing feature learning, and patch merging layers down- based (Chen et al., 2021b; Cao et al., 2022) and CNN-based arts (Isensee
sampling the feature resolution and unifying the feature dimension. et al., 2020) illustrate nnFormer makes significant progress on the
The designed bottleneck comprises two consecutive Swin Transformer segmentation task.
blocks to learn the hierarchical representation from the encoder with Although the recent Transformer-based methods improve the prob-
feature resolution and dimension unchanged. lem that CNN methods cannot capture long-range dependencies, they
Swin Transformer blocks and patch-expanding layers construct the show the limitation of the capability of modeling local details. Some
symmetric Transformer-based decoder. In contrast to the patch merging methods directly embedded the convolution layers between fully con-
layers in the encoder, each patch expanding layer is responsible for nected layers in the feed-forward network. Such structure supplements
upsampling the feature maps into double resolutions and halving the the low-level information but limits the discrimination of features.
corresponding feature dimension. The final reshaped feature maps pass Huang et al. propose MISSFormer (Huang et al., 2022b), a hier-
through a linear projection to produce the pixel-wise segmentation archical encoder–decoder network, which employs the Transformer
outputs. Inspired by the U-Net, the framework also employs skip con- block named Enhanced Transformer Block and equips the Enhanced
nections to combine multi-scale features with the upsampled features Transformer Context Bridge.

Fig. 10. An overview of ViTs in medical image segmentation. Methods are classified into the pure Transformer, hybrid Transformer, and other architectures according to the
positions of the Transformers in the entire architecture. The prefix numbers of the methods denote 1. Cao et al. (2022), 2. Zhou et al. (2021), 3. Huang et al. (2022b), 4. Azad et al.
(2022d), 5. Li et al. (2021c), 6. Chen et al. (2021b), 7. Wang et al. (2021d), 8. Zhang et al. (2021c), 9. Valanarasu et al. (2021), 10. Hatamizadeh et al. (2022b), 11. Hatamizadeh
et al. (2022a), 12. Tang et al. (2022), 13. Xie et al. (2021), 14. Heidari et al. (2023), 15. Yang et al. (2021a), 16. Luo et al. (2022), 17. Zhou et al. (2022).

15
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 11. The architecture of the Swin-Unet (Cao et al., 2022) which follows the U-Shape structure. It contains the encoder, the bottleneck, and the decoder part, which are built
based on the Swin Transformer block. The encoder and the decoder are connected with skip connections.

The Enhanced Transformer Block utilizes a novel efficient self-


attention module that illustrates the effectiveness of spatial reduction
for better usage of the high-resolution map. The original multi-head
self-attention can be formulated as follows:
𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄, 𝐾, 𝑉 ) = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥( √ )𝑉 , (4)
𝑑ℎ𝑒𝑎𝑑
where 𝑄, 𝐾, and 𝑉 refer to query, key, and value, respectively, and
have the same shape of 𝑁 × 𝐶, 𝑑ℎ𝑒𝑎𝑑 denotes the number of heads. The
computational complexity is (𝑁 2 ). In efficient self-attention, the 𝐾
and 𝑉 are reshaped by a spatial reduction ratio 𝑅. Take 𝐾, for example:

𝑁
𝑛𝑒𝑤_𝐾 = 𝑅𝑒𝑠ℎ𝑎𝑝𝑒( , 𝐶 ⋅ 𝑅)𝑊 (𝐶 ⋅ 𝑅, 𝐶) (5)
𝑅
𝐾 is first resized from 𝑁 × 𝐶 to 𝑁 𝑅
× (𝐶 ⋅ 𝑅) and then projected
linearly to restore the channel depth from 𝐶 ⋅𝑅 to 𝐶. The computational
2
cost reduces to ( 𝑁𝑅 ) accordingly. Furthermore, the structure of the
Enhanced Mix Feed-forward network (Mix-FFN) extended from Liu
et al. (2021a) introduces recursive skip connections to make the model
more expressive and consistent with each recursive step.
The U-shaped architecture of the MISSFormer contains the encoder Fig. 12. A visual comparison with the state-of-the-art approaches on Synapse dataset.
and decoder built on the Enhanced Transformer blocks connected Above the red line shows the successful cases of segmentation, and below the red line
are the failed cases with relatively large errors (Huang et al., 2022b).
with an enhanced Transformer context bridge. Multi-scale features
produced from the encoder are flattened and concatenated together
and passed through the Enhanced Transformer Context Bridge. The
pipeline of the Enhanced Transformer Context Bridge is based on the mentioning that MISSFormer trained from scratch even outperforms
Enhanced Transformer Block to fuse the hierarchical features. The state-of-the-art methods pre-trained on ImageNet.
output of the bridge is split and recovered to each original spatial The results in Fig. 12 show that the performance of MISSFormer
dimension to pass through the corresponding stage in the decoder. The for prediction and segmentation of edges in pure Transformer net-
results of experiments show a robust capacity of the method to capture work structures is more accurate compared to TransUNet and Swin-
more discriminative details in medical image segmentation. It is worth Unet. Comparing MISSFormer and MISSFormer-S (MISSFormer without

16
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 13. The overview architecture of TransDeepLab, which comprises encoder and decoder built on Swin-Transformer blocks. It is the pure Transformer-based extension of
DeepLabv3++ (Azad et al., 2022d).

bridge), MISSFormer has fewer segmentation errors because the bridge 4.2.1. Transformer: Encoder
is effective for integrating multi-scale information. Starting with TransUNet (Chen et al., 2021b), multiple methods in
Inspired by the notable DeepLabv3 (Chen et al., 2017b) which the medical image segmentation field adopt the self-attention mecha-
utilizes the Atrous Spatial Pyramid Pooling (ASPP) to learn multi-scale nism in the encoder.
feature representations and depth-wise separable convolution to reduce Transformers have developed as an alternative architecture for mod-
the computational burden, the authors propose TransDeepLab (Azad eling global context that exclusively relies on attention mechanisms in-
et al., 2022d) to combine the DeepLab network with the Swin- stead of convolution operators. However, its inner global self-attention
Transformer blocks. Applying the Swin-Transformer module with win- mechanism induces missing low-level details. Direct upsampling cannot
dows of multiple sizes enables the fusion of multi-scale information retrieve the local information, which results in inaccurate segmentation
with a lightweight model. results. The authors propose the TransUNet architecture, a hybrid
approach that integrates CNN-Transformer hybrid as the encoder and
TransDeepLab is a pure Transformer-based DeepLabv3+ architec-
cascaded upsampler as the decoder, combining the advantages of Trans-
ture, as shown in Fig. 13. The model builds a hierarchical architecture
former and U-Net to boost the segmentation performance by recovering
based on the Swin-Transformer modules. TransDeepLab first employs 𝑁
localized spatial information.
stacked Swin-Transformer blocks to model the input embedded images
The framework of the TransUNet is illustrated in Fig. 14. The
into a deep-level feature space. 2D medical images are first to split
proposed encoder initially employs CNN as a feature extractor to build
into non-overlapping patches of dimension 𝐶 and size 4 × 4. The
a feature map for the Transformer input layer, rather than the Trans-
ensuing Swin-Transformer block learns local semantic information and former directly projecting the raw tokenized picture patches to latent
global contextual dependencies of the sequence of patches. Then, the embedding space. In this way, the intermediate CNN feature maps of
authors introduce windows with different sizes to process the output different resolutions can be saved and utilized in the following process.
of the Swin-Transformer and fuse the resulting multi-scale feature For the decoder, the Cascaded Upsampler (CUP) is proposed to
layers, which are then passed through Cross Contextual attention. This replace naive bilinear upsampling, applying several upsampling blocks
design, referred to as Spatial Pyramid Pooling (SSPP) block, replaces to decode the hidden feature and output the final segmentation re-
the original Atrous Spatial Pyramid Pooling (ASPP) module exploited sult. Finally, the hybrid encoder and the CUP constitute the overall
in DeepLabV3. A cross-contextual attention mechanism is utilized to architecture with skip connections to facilitate feature aggregation at
explore the multi-scale representation after fusion. This attention mod- different resolution levels. This strategy can compensate for the loss of
ule applies channel attention and spatial attention to the output from local fine-grained details caused by the Transformer encoder and merge
windows of each size (from each layer of the spatial pyramid). Finally, the encoded global information with the local information contained in
in the decoder part, the low-level features from the encoder are con- intermediate CNN feature maps.
catenated with the multi-scale features extracted by Cross Contextual The experiments show that TransUNet significantly outperforms
Attention after bilinear upsampling. The last two Swin-Transformer the model consisting of pure Transformer encoder and naive upsam-
blocks and the patch expanding module generate the final prediction pling, as well as the ViT-hybrid model without skip connections (Chen
masks. et al., 2021b). Comparisons with prior work (Ronneberger et al., 2015)
also demonstrate the superiority of TransUNet over competing CNN-
based approaches in terms of both qualitative visualization and the
4.2. Hybrid models
quantitative evaluation criteria (i.e.average DSC and HD). TransUNet
integrates the benefits of both high-level global contextual information
Hybrid Transformers concatenate Transformer blocks with convo- and low-level details as an alternative approach for medical image
lution layers to extract local details and long-range dependencies. We segmentation.
further classify this category into Transformer: Encoder, Transformer: Wang et al. (2021d) propose the encoder–decoder architecture,
Decoder and Transformer: skip connection according to the position of TransBTS, which leverages Transformer on learning global contextual
the combined module in the U-Net architecture. information and merits the 3D CNN for modeling local details. In

17
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 14. The overview architecture of the TransUNet (Chen et al., 2021b). The Transformer layers are employed in the encoder part. The schematic of the Transformer is shown
on the left.

contrast to the concurrent Transformer-based model (Chen et al., Transformer-based models to learn local information. In this study, the
2021b), which analyzes 3D medical volumetric data in a slice-by-slice authors propose a new strategy-TransFuse which consists of the CNN-
manner, TransBTS also explores the local features along the depth based encoder branch and the Transformer-based branch in parallel
dimension by processing all the image slices at once. fused with the proposed BiFusion module, thus further exploring the
The network encoder initially employs a 3D CNN to capture vol- benefits of CNNs and Transformers. The construction of the Trans-
umetric spatial features, simultaneously downsampling the input 3D former branch is designed in the typical encoder–decoder manner. The
images, yielding compact volumetric feature maps. Each feature map is input images are first split into non-overlapped patches. The linear
projected into a token and fed into the Transformer encoder to inves- embedding layer then projects the flattened patches into the raw em-
tigate the global relationships. The full-resolution segmentation maps bedding sequence which is added to the learnable position embedding
are generated by the 3D CNN decoder after the progressive upsampling of the same dimension. The obtained embeddings are fed into the
while using the feature embedding from the Transformer. For the Transformer encoder, which comprises 𝐿 layers of MSA and MLP. The
encoder part, TransBTS first utilizes the 3 × 3 × 3 convolution blocks output of the last layer of the Transformer encoder is passed through
with downsampling to process the 3D input medical image data, which layer normalization to obtain the encoded sequence.
boosts the effective embedding of rich local 3D features a cross spatial The decoder part utilizes the same progressive upsampling (PUP)
and depth dimensions into the low-resolution feature representation 𝐹 . approach as SETR (Zheng et al., 2021). The encoded sequence is first
They apply a linear projection to the feature representation 𝐹 to obtain reshaped back to a sequence of 2D feature maps. Then they employ
the sequence 𝑓 , which is then integrated with position embeddings, two stacked upsampling-convolution layers to restore the feature scales.
as the input for the Transformer encoder. The Transformer encoder The feature maps with different spatial resolutions generated by each
consists of multiple Transformer layers, each of which comprises a upsampling-convolution layer are retained for the subsequent fusion
Multi-Head Attention(MHA) block and a Feed-Forward Network(FFN). operation. For the CNN branch, the approach discards the last layer
The output sequence of the Transformer encoder passes through the of the traditional CNNs architecture and combines the information
feature mapping module to be reshaped to a 4D feature map 𝑍 of the extracted from the CNNs with the global contextual features obtained
same dimension as 𝐹 . The approach employs cascaded upsampling and from the Transformer branch. A shallower model is yielded as a result
convolution blocks to progressively restore the segmentation predic- of this design, avoiding the requirement for extremely deep models
tions at the original resolution. Furthermore, skip connections combine that exhaust resources to get long-range dependencies. For instance,
the fine-grained details of local information with the decoder modules, there are five blocks in a typical ResNet-based network where only
resulting in more accurate segmentation masks. the outputs of the 4th, 3rd, and 2nd layers are saved for the following
The authors conduct comparisons between the proposed TransBTS fusion with the feature maps from the Transformer branch.
and the closest method TransUNet (Chen et al., 2021b). TransUNet The BiFusion module is proposed to fuse the features extracted from
essentially processes 3D medical images slice by slice, while TransBTS the two branches mentioned above to predict the segmentation results
is a 3D model that explores the continuous interaction through the of medical images. The global features from the Transformer branch
depth dimension by processing a 3D medical image in a single pass. In are boosted by the channel attention proposed in SE-Block (Hu et al.,
contrast to TransUNet, which adopts pre-trained ViT models on other 2018). Meanwhile, the feature maps from the CNN branch are filtered
large-scale datasets, TransBTS is trained on the dataset for the specified by the spatial attention which is adopted in the CBAM (Woo et al.,
task without relying on pre-trained weights. 2018) block to suppress the irrelevant and noisy part and highlight
The framework is evaluated on the Brain Tumor Segmentation local interaction. Then the Hadamard product is applied to the features
(BraTS) 2019 challenge and 2020 challenge. Compared to the 3D U-Net from the two branches to learn the interaction between them. They
baseline, TransBTS achieves a significant enhancement in segmenta- concatenate the interaction feature 𝑏𝑖 with attended features 𝑡̂𝑖 and 𝑔̂ 𝑖
tion. The prediction results indicate the improved accuracy and the and feed the results through a residual block to produce the feature
superiority of modeling long-range dependencies. 𝑓 𝑖 , which successfully models both the global and local features at
Previous approaches (Zhang et al., 2021c) primarily focus on re- the original resolution. Finally, the segmentation prediction is gener-
placing convolution operation with Transformer layers or consecu- ated by integrating the 𝑓 𝑖 from different BiFusion modules via the
tively stacking the two together to address the inherent lack of pure attention-gated (AG) (Schlemper et al., 2019) skip connection.

18
R. Azad et al. Medical Image Analysis 91 (2024) 103000

They evaluate the performance of three variants of TransFuse on volume in a non-overlapping manner. The flattened input patches are
four segmentation tasks with different imaging modalities and target then passed through a linear projection layer to yield 𝐾 dimensional
sizes. TransFuse-S is constructed with ResNet-34 (R34) and 8-layer patch embeddings. They attach a 1D learnable positional embedding
DeiT-Small (DeiT-S) (Touvron et al., 2021). Besides, TransFuse-L is to each patch embedding taking into account the spatial information
composed of Res2Net-50 and 10-layer DeiT-Base (DeiT-B). TransFuse- of the extracted patches. After the embedding layer, the global multi-
L* is implemented based on ResNetV2-50 and ViT-B (Dosovitskiy et al., scale representation is captured using Transformer blocks composed
2020). For polyp segmentation, Transfuse-S/L outperforms significantly of multi-head self-attention modules and multilayer perceptron layers.
the CNN baseline models with fewer parameters and faster running They resize and project the sequence representation extracted from the
time. TransFuse-L* also achieves the best performance among the Transformer at different resolutions for use in the decoder in order to
previous SOTA Transformer-based methods with a faster speed for retrieve spatial information of the low-level details.
inference. It runs at 45.3 FPS and about 12% faster than TransUNet. In the expanding pattern of the framework, the proposed CNN-
The experiments for other segmentation tasks also show the superiority based decoder combines the output feature of different resolutions from
of the segmentation performance. the Transformer with upsampled feature maps to properly predict the
Despite the powerful results of applying Transformers to segmen- voxel-wise segmentation mask at the original input resolution.
tation tasks (Wang et al., 2020a; Zheng et al., 2021), the dilemma The paper claims UNETR achieves new state-of-the-art performance
is that properly training existing Transformer-based models requires on all organs compared against CNN (Isensee et al., 2021b; Chen
large-scale datasets, whereas the number of images and labels available et al., 2018; Tang et al., 2021; Zhou et al., 2019) and competing for
for medical image segmentation is relatively limited. To overcome the Transformer-based (Xie et al., 2021; Chen et al., 2021b; Zheng et al.,
difficulty, MedT (Valanarasu et al., 2021) proposes a gated position- 2021) baselines on BTCV dataset, with significant improvement in
sensitive axial attention mechanism where the introduced gates are performance on small organs in particular. In addition, it outperforms
learnable parameters to enable the model to be applied to a dataset of the closest methodologies on brain tumor and spleen segmentation
arbitrary size. Furthermore, they suggested a Local-Global(LoGo) train- tasks in the MSD dataset. UNETR shows the superiority of learning both
ing strategy to improve the segmentation performance by operating on global dependencies and fine-grained local relationships in medical
images.
both the original image and the local patches.
Fig. 16 presents qualitative segmentation comparisons for brain
The main architecture of MedT, as shown in Fig. 15(a), is composed
tumor segmentation on the MSD dataset between UNETR (Hatamizadeh
of 2 branches: a shallow global branch that works on the original
et al., 2022b), TransBTS (Wang et al., 2021d), CoTr (Xie et al., 2021)
resolution of the entire image, and a deep local branch that acts
and U-Net (Ronneberger et al., 2015). It can be seen that the details
on the image patches. Two encoder blocks and two decoder blocks
of the brain tumor are captured well by UNETR (Hatamizadeh et al.,
comprise the global branch, which is sufficient to model long-range
2022b).
dependencies. In the local branch, the original image is partitioned into
As opposed to other methods that attempted to utilize the Trans-
16 patches and each patch is feed-forwarded through the network. The
former module as an additional block beside the CNN-based compo-
output feature maps are re-sampled based on their locations to obtain
nents in the architectures, UNETR (Hatamizadeh et al., 2022b) lever-
the output feature maps of the branch. Then the results generated from
ages the Transformer as the encoder instead of the CNN-based encoder.
both branches are added and fed into a 1 × 1 convolution layer to
The Swin Transformer (Liu et al., 2021d) is a hierarchical visual
produce the output segmentation mask. The LoGo training strategy
Transformer featuring an efficient shift-window partitioning scheme for
enables the global branch to concentrate on high-level information and
computing self-attention. Inspired by these two approaches, a novel
allows the local branch to learn the finer interactions between pixels model termed Swin Unet Transformer (Swin UNETR) (Hatamizadeh
within the patch, resulting in improved segmentation performance. et al., 2022a) is proposed for brain tumor segmentation in this work.
Fig. 15(b) and (c) illustrate the gated axial Transformer layer, which The proposed framework applies a U-shape architecture with the
is used as the main building block in MedT, and the feed-forward Swin Transformers as the encoder and a CNN-based module as the
structure in it. They introduced four learnable gates 𝐺𝑉 1 , 𝐺𝑉 2 , 𝐺𝑄 , 𝐺𝐾 ∈ decoder connected to the encoder via skip connections at different
R that control the amount of information the positional embeddings resolution levels. The model initially converts 3D MRI images with four
supply to key, query, and value. Based on whether a relative positional channels to non-overlapping patches and creates windows of a specific
encoding is learned accurately or not, the gate parameters will be size with a patch partition layer.
assigned weights either converging to 1 or some lower value. The gated The Swin UNETR encoder is composed of 4 stages. Each stage
mechanism can control the impact of relative positional encodings on comprises 2 Transformer blocks and a patch merging layer. In the
the encoding of non-local context and allows the model to work well Transformer blocks, the self-attention is computed with a shifted win-
on any dataset regardless of size. dowing mechanism. Swin UNETR employs the windows of size 𝑀 ×
Unlike the fully-attended baseline (Wang et al., 2020a), MedT 𝑀 × 𝑀 to partition the 3D token with resolution of 𝐻 ′ × 𝑊 ′ × 𝐷′ into
′ ′ ′
trained on even smaller datasets outperforms the convolutional baseline regions of ⌈ 𝐻
𝑀
×𝑊𝑀
×𝐷𝑀
⌉ at layer 𝑙. The partitioned window regions
and other Transformer-based methods. In addition, improvements in are then shifted by (⌊ 2 ⌋, ⌊ 𝑀
𝑀
⌋, ⌊ 𝑀 ⌋) voxels at the following 𝑙 + 1 layer.
2 2
medical segmentation are also observed since the proposed method The patch merging layer after the Transformer components reduces the
takes into account pixel-level dependencies. resolution of feature maps by a factor of two and concatenates them
In contrast to multiple proposed methods (Zheng et al., 2021; Chen to form a feature embedding with the doubled dimensionality of the
et al., 2021b; Valanarasu et al., 2021; Zhang et al., 2021c) that investi- embedding space.
gate the task of 2D medical image segmentation, UNETR (Hatamizadeh For the decoder of the architecture, the output feature representa-
et al., 2022b) proposes a novel Transformer-based architecture for 3D tions of the bottleneck are reshaped and passed through the residual
segmentation which employs the Transformer as the encoder to learn block containing two convolutional layers. The subsequent deconvolu-
global contextual information from the volumetric data. In addition, tional layer increases the resolution of feature maps by a factor of 2.
unlike the previous frameworks proposed for 3D medical image seg- The outputs are then concatenated with the outputs of the previous
mentation (Xie et al., 2021; Wang et al., 2021d), the encoded feature stage and fed into another residual block. After the resolutions of the
from the Transformer of this proposed model is directly connected feature maps are restored to the original 𝐻 ′ ×𝑊 ′ ×𝐷′ , a head is utilized
to a CNN-based decoder via skip connections at different resolution to generate the final segmentation predictions.
levels. The U-shaped UNETR comprises a stack of Transformers as the The authors conduct the experiments to compare Swin UNETR
encoder and a decoder coupling with it by skip connections. They begin against the previous methodologies SegResNet (Myronenko, 2018), nn-
by generating the 1D sequence of patches by splitting the 3D input UNet (Isensee et al., 2020)and TransBTS (Wang et al., 2021d) in this

19
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 15. Overview of the MedT (Valanarasu et al., 2021) architecture. The network uses the LoGo strategy for training. The upper global branch utilizes the first fewer blocks of
the Transformer layers to encode the long-range dependency of the original image. In the local branch, the images are converted into small patches and then fed into the network
to model the local details within each patch. The output of the local branch is re-sampled relying on the location information. Finally, a 1 × 1 convolution layer fuses the output
feature maps from the two branches to generate the final segmentation mask.

the features from the naturally consistent contextual information in


3D radiographic images such as CT. Specifically, inpainting enables
learning the texture, structure, and relationship of masked regions to
their surrounding context. Contrastive learning is utilized to recognize
separate regions of interest (ROIs) of different body compositions. The
rotation task creates a variety of sub-volumes that can be used for
contrastive learning. To validate the effectiveness of the pre-trained
encoder, they fine-tuned the pre-trained Swin UNETR on the down-
stream segmentation task and achieved SOTA performance on BTCV
dataset (Landman et al., 2015).

4.2.2. Transformer: Decoder


Another direction is to modify the decoder of the U-shape structure
to aggregate the Transformer-CNN-based modules.
In the Segtran framework (Li et al., 2021c), as illustrated in Fig. 17,
Fig. 16. Comparison of visualization of brain tumor segmentation on the MSD dataset. Squeeze-and-Expansion Transformer is proposed to ‘‘squeeze’’ the at-
The whole tumor (WT) includes a combination of red, blue, and green regions. The tention matrix and aggregate multiple sets of contextualized features
union of red and blue regions demonstrates the tumor core (TC). The green regions from the output. A novel Learnable Sinusoidal Position Encoding is
indicate the enhanced tumor core (ET) (Hatamizadeh et al., 2022b).
also employed to impose the continuity inductive bias for images. The
Segtran consists of five components: a CNN backbone to extract image
features, (2) input/output feature pyramids to do upsampling, (3) the
work. The results demonstrate that the proposed model has prominence Learnable Sinusoidal Positional Encoding, (4) Squeeze-and-Expansion
as one of the top-ranking approaches in the BraTS 2021 challenge. It Transformer layers to contextualize features, and (5) a segmentation
is due to the better capability of learning multi-scale contextual infor- head. The pretrained CNN backbone is first utilized to learn feature
mation and modeling long-range dependencies by Swin Transformers in maps from the input medical images. Since the input features to Trans-
comparison to regular Transformers with a fixed resolution of windows. formers are of a low spatial resolution, the authors increase their spatial
Apart from the architecture of Swin UNETR (Hatamizadeh et al., resolutions with an input Feature Pyramid Network (FPN) (Liu et al.,
2022a), they also proposed a hierarchical encoder for pre-training the 2018a) to upsample the feature maps by bilinear interpolation. Then
encoder of Swin UNETR (Tang et al., 2022), called Swin UNETR++. the proposed Learnable Sinusoidal Positional Encoding is added to the
The pre-training tasks are tailored to learning the underlying pattern visual features to inject spatial information. In contrast to the previous
of human anatomy. Various proxy tasks such as image inpainting, 3D two mainstream PE schemes (Carion et al., 2020; Dosovitskiy et al.,
rotation prediction, and contrastive learning are leveraged to extract 2020), the new positional embedding vector, a combination of sine

20
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Motivated by this, Xie et al. (2021) proposes a novel encoder–


decoder framework, CoTr, which bridges CNN and Transformer. The
architecture exploits CNN to learn feature representations. An efficient
deformable self-attention mechanism in the Transformer is designed
to model the global context from the extracted feature maps, which
reduces the computational complexity and enables the model to process
high-resolution features. The final segmentation results are generated
by the decoder.
As shown in Fig. 18, the DeTrans-encoder consists of an input-to-
sequence layer and multiple DeTrans Layers. The input-to-sequence
layer first flattens the feature maps at different resolutions extracted
from the CNN-encoder into 1D sequences {𝑓𝑙 }𝐿 𝑙=1
. Then the correspond-
ing 3D positional encoding sequence 𝑝𝑙 is supplemented with each of
the flattened sequences 𝑓𝑙 to complement the spatial information. The
combined sequence is fed as the input into the DeTrans Layers. Each
of the DeTrans Layers is a composition of an MS-DMSA and a Feed-
Forward Network (FFN). In contrast to the self-attention mechanism
Fig. 17. Segtran network extracts image features using a CNN backbone and combines which casts attention to all the possible locations, the proposed MS-
the features with the position encoding of pixels flattened into a series of local feature
vectors. Multiple squeezed and extended transform layers are stacked to process the
DMSA layer only attends to a small set of key sampling locations around
local feature vectors. Finally, an output FPN after the Transformer upsamples the a reference location. As a result, it can achieve faster convergence and
features to generate the final prediction (Li et al., 2021c). lower computational complexity. The skip connection is utilized after
each DeTrans Layer to preserve the low-level details of local infor-
mation. The output of the DeTrans-encoder is successively upsampled
and cosine functions of linear transformations of (𝑥, 𝑦), brings in the by the pure CNN encoder to restore the original resolution. Besides,
continuity bias with adaptability. The equation of the encoding strat- they apply skip connections and a deep supervision strategy to add
egy varies gradually with pixel coordinates. Thus, close units receive fine-grained details and auxiliary losses to the prediction outputs.
similar positional encodings, increasing the attention weights between The results of experiments indicate that CoTr with the hybrid ar-
them towards higher values. The encoding vectors generated from the chitecture has superior of performance over the models with pure CNN
addition of positional encodings and visual features are then fed into encoder or pure Transformer encoder. It also outperforms other hybrid
the Transformer. methods like TransUNet (Chen et al., 2021b) in processing multi-scale
The novel Transformer architecture combines Squeezed Attention 3D medical images with reduced parameters and complexity.
Block (SAB) (Lee et al., 2019) with an Expanded Attention Block. Here HiFormer (Heidari et al., 2023) is proposed to aggregate a fusion
this method employs the Induced Set Attention Block (ISAB) proposed module in the skip connections to learn richer representations. Fig. 19
by Lee et al. (2019) as a squeezed attention block. The Squeezed Atten- demonstrates the end-to-end network structure of the strategy that in-
tion Block computes attention between the input and inducing points corporates the global dependencies learned with the Swin Transformer
and compresses the attention matrices to lower rank matrices, reducing and the detailed local features extracted by the CNN modules. The
noises and overfitting. The Expanded Attention Block (EAB), a mixture- encoder is composed of two hierarchical CNN, Swin Transformer mod-
of-experts model, outputs 𝑁𝑚 sets of complete contextualized features ules and the novel Double-Level Fusion module (DLF module). First,
from 𝑁𝑚 modes. Each mode is an individual single-head Transformer medical images are fed into a CNN module to obtain a local fine-grained
and shares the same feature space with each other. That is as opposed semantic representation. After the CNN layer catches the shallow fea-
to multi-head attention in which each head outputs an exclusive feature ture layers, HiFormer introduces the Swin Transformer modules to
subset. All features are then aggregated into one set using dynamic complement the global feature information. The Swin Transformer
mode attention. The dynamic mode attention can be obtained by doing module employs windows of different sizes to learn the dependencies
a linear transformation of each mode feature and taking softmax over between multiple scales. To reuse the shallow and deep multi-scale
all the modes. feature information in the encoder, HiFormer designs a novel skip
Compared with representative existing methods in the experiments, connection module, the DLF module. The deep-level semantic and
Segtran consistently achieved the highest segmentation accuracy and shallow-level localization information are fed into the DLF module
exhibited good cross-domain generalization capabilities. and fused by the cross-attention mechanism. Finally, both generated
feature maps are passed into the decoder to produce the final segmen-
4.2.3. Transformer: Skip connection tation prediction results. The experiments conducted on the Synapse
In this section, Transformer blocks are incorporated into the skip dataset (Landman et al., 2015), SegPC (Gupta et al., 2023), and ISIC
connections to facilitate the transmission of detailed information from 2017 dataset (Codella et al., 2018) demonstrate the superiority of
the encoder to the decoder. the learning ability of HiFormer. Moreover, the lightweight model
Although Transformer-based methods overcome the limitation of with fewer parameters also exceeds CNN-based methods and previous
capturing long-range dependency, they present extreme computational Transformer-based approaches with lower computational complexity.
and spatial complexity in analyzing high-resolution volumetric image
data. Some studies (Carion et al., 2020; Chen et al., 2021b) employ 4.3. Other architectures
hybrid structures, fusing CNN with Transformer in an attempt to re-
duce the training requirement on huge datasets. The recent approach, Most ViT-based models rely on pre-training of large natural image
TransUNet (Chen et al., 2021b), shows good performance. However, datasets to obtain pre-weights and then solve downstream tasks by
it is difficult to optimize the model due to the inner self-attention transfer learning. Several works explore training in a self-supervised or
mechanism of the vanilla Transformer. First, it takes a long time to semi-supervised manner to efficiently utilize medical image datasets of
train the attention, which is caused by initially distributing attention limited size or datasets without manual labels. Furthermore, some ap-
uniformly to each pixel within the salient regions (Vaswani et al., proaches apply Transformers to seek the design of architectures that im-
2017). Second, a vanilla Transformer struggles to handle multi-scale plement medical image segmentation, instead of using the Transformers
and high-resolution feature maps due to its high computational cost. to act directly on the input image.

21
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 18. Overview of the CoTr (Xie et al., 2021) architecture. It is composed of a CNN-encoder, a DeTrans-encoder and a decoder. The CNN-encoder models the local information
of the input images and provides the outputs at each stage. The outputs of different resolutions are flattened, fused and passed through the Deformable Transformer Layers along
with positional encoding. The decoder reshapes the processed sequences from the DeTrans-encoder and produces the final predictions.

Fig. 19. HiFormer comprises the CNN-Transformer encoder, the CNN-based decoder and the Double-Level Fusion Module (DLF). The feature layers of the shallowest level 𝑝𝑙
and of the deepest level 𝑝𝑠 are fed into the DLF module for the fusion of hierarchical information. Blue blocks and orange blocks refer to Swin Transformer and CNN modules,
respectively (Heidari et al., 2023).

Unlike the previously proposed methods that employ the Trans- layer (linear interpolation) is utilized at the end of the architecture to
formers to act directly on the medical image for feature extraction, produce the results of feature maps at the original volume size.
this method (Yang et al., 2021a) adopts the AutoML for automatically To search for the optimal architecture and training configuration,
designing the network architecture without much human heuristics or the authors first encode the necessary components in the search space
assumptions, where the Transformer is applied to encode the embed- to form a one-dimensional vector 𝑣. The search space contains can-
ding vector regarding the architecture configurations. The approach didates of different configurations with regard to data augmentation,
reduces the workload of algorithm design by automatically estimating learning rates, learning rate schedulers, loss function, the optimizer,
‘‘almost’’ all the components of the framework instead of manually the number and spatial resolution of blocks, and block types.
designing for the network and training strategies. That improves the
After the obtainment of the encoding vector 𝑣, the proposed new
model performance of segmentation simultaneously.
predictor predicts the binary relation of validation accuracy values
The proposed Transformer-based T-AutoML inspired by SpineNet
between 𝑣𝑖 and 𝑣𝑗 . The predictor employs the Transformer encoder to
(Du et al., 2020) leverages neural architecture search (NAS) with a
encode the vector 𝑣 of varying lengths into feature maps of a fixed
larger search space to optimize the selection of the network connec-
resolution. Then the feature maps are passed through the Multiple FC
tions. This framework can connect the feature maps at different spatial
levels of the network with another one arbitrarily, compared with layers to generate the binary relation predictions denoted as 𝐺𝑇𝑣𝑖 ,𝑣𝑗 .
the previous methods that only search for the encoder–decoder U- Since the predictor is designed for ranking the vectors with respect
shape networks (Liu et al., 2019; Bae et al., 2019; Kim et al., 2019). to the accuracy values and estimating the relations, the actual values
The candidates of different blocks in the network consist of 3D resid- of the predicted accuracy are not necessary to be calculated for each
ual blocks, 3D bottleneck blocks, and 3D axial-attention blocks. The vector. Thus, the new predictor requires less overall training time.
residual blocks and bottleneck blocks are effective in alleviating the The experiments indicate that the proposed method can achieve
vanishing gradient. The axial-attention blocks are applied to model the the state of the art(SOTA) in lesion segmentation tasks and shows
long-range dependency in the 2D medical images. Another upsampling superiority in transferring to different datasets.

22
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 20. The model performs the semi-supervised medical image segmentation task. The regularization scheme between CNN and Transformer is referred to as Cross Teaching. L
denotes the labeled data and U denotes the unlabeled data. The cross-teaching employs a bidirectional loss function: one path is from the CNN branch to the Transformer branch,
and the other is from the Transformer to the CNN. A Transformer is applied for complementary training instead of prediction generation (Luo et al., 2022).

Despite the promising results achieved by the CNNs and where 𝑝𝑖 , 𝑦𝑖 represent the prediction and the label of image 𝑥𝑖 . And the
Transformer-based methods with large-scale images, these approaches overall objective combining the cross-teaching branch and supervised
require expert labeling at the pixel/voxel level. Expensive time costs branch is defined as:
and manual annotation limit the size of the medical image dataset. Due
𝐿𝑡𝑜𝑡𝑎𝑙 = 𝐿𝑠𝑢𝑝 + 𝜆𝐿𝑐𝑡𝑙 (10)
to this dilemma, the proposed semi-supervised segmentation (Luo et al.,
2022) provides a low-cost and practical scheme, called Cross Teaching where 𝜆 is a weight factor, which is defined by time-dependent Gaus-
between CNN and Transformer, to train effective models using a little sian warming up function (Yu et al., 2019; Luo et al., 2022):
amount of correctly labeled data and a large amount of unlabeled or ( ( ))2
𝑡
−5 1− 𝑡 𝑖
coarsely labeled data. 𝜆(𝑡) = 0.1 ⋅ 𝑒 𝑡𝑜𝑡𝑎𝑙 (11)
Inspired by the existing works (Qiao et al., 2018; Han et al., 2018;
Chen et al., 2021e) for semi-supervised learning which introduce per- The results of the ablation study indicate that the combination of
turbation at different levels and encourage prediction to be consis- CNN and Transformer in a cross-teaching way shows superiority over
tent during the training stage, the designed cross teaching introduces the existing semi-supervised methods. Furthermore, the novel method
the perturbation in both learning paradigm-level and output-level. As has the potential to reduce the label cost by learning from limited data
shown in Fig. 20, each image within the training set containing labeled and large-scale unlabeled data. However, it is observed that achieving
and unlabeled images is fed into two different learning paradigms: a SOTA via semi-supervised approaches remains a significant challenge.
CNN and a Transformer respectively. For the unlabeled dataset with Zhou et al. (2022) hypothesize that the ability to aggregate contex-
raw images, the cross teaching scheme allows the cross supervision be- tual information is imperative to improve the performance of medical
tween a CNN (𝑓𝜙𝑐 (.)) and a Transformer(𝑓𝜙𝑡 (.)), which aims at integrating image analysis. Nonetheless, there is no ImageNet-scale medical image
the properties of the Transformer modeling the long-range dependency dataset for pre-training. Therefore, they investigate a novel self-pre-
and CNN t0 learn local information in the output level. training paradigm based on Masked Autoencoder (MAE), MAE self
The unlabeled data initially passes through a CNN and a Trans- pre-training, for medical image analysis, one of the masked image
former respectively to generate predictions 𝑝𝑐𝑖 and 𝑝𝑡𝑖 . modeling (MIM) frameworks (Bao et al., 2022; Xie et al., 2022; El-
Nouby et al., 2021; He et al., 2022a). MIM encourages the framework to
𝑝𝑐𝑖 = 𝑓𝜙𝑐 (𝑥𝑖 ); 𝑝𝑡𝑖 = 𝑓𝜙𝑡 (𝑥𝑖 ); (6) restore the masked target by integrating information from the context,
where the main idea of MIM is masking and reconstructing: mask-
Then the pseudo labels 𝑝𝑙𝑖𝑐 and 𝑝𝑙𝑖𝑡 are produced in this manner: ing a set of image patches before input into the Transformer and
𝑝𝑙𝑖𝑐 = 𝑎𝑟𝑔𝑚𝑎𝑥(𝑝𝑡𝑖 ); 𝑝𝑙𝑖𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥(𝑝𝑐𝑖 ); (7) reconstructing these masked patches at the output.
The pipeline for segmentation with MAE self-pre-training contains
The pseudo label 𝑝𝑙𝑖𝑐 used for the CNN training is generated by the two stages. In the first stage (as shown on the left of Fig. 21), ViT is
Transformer. Similarly, the CNN model provides pseudo labels for pre-trained with MAE as the encoder. The input patches are divided
Transformer training. The cross-teaching loss for the unlabeled data is into visible ones and masked ones. The ViT encoder only acts on
defined as follows: the visible patches. Compared to other MIM methods, MAE does not
employ mask tokens in the encoder, which saves time and allows for
𝐿𝑐𝑡𝑙 = 𝐿𝐷𝑖𝑐𝑒 (𝑝𝑐𝑖 , 𝑝𝑙𝑖𝑐 ) + 𝐿𝐷𝑖𝑐𝑒 (𝑝𝑡𝑖 , 𝑝𝑙𝑖𝑡 ) (8)
⏟⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏟ ⏟⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏟ faster pre-training. A lightweight Transformer decoder is appended to
𝑠𝑢𝑝𝑒𝑟𝑣𝑖𝑠𝑖𝑜𝑛 𝑓 𝑜𝑟 𝐶𝑁𝑁𝑠 𝑠𝑢𝑝𝑒𝑟𝑣𝑖𝑠𝑖𝑜𝑛 𝑓 𝑜𝑟 𝑇 𝑟𝑎𝑛𝑠𝑓 𝑜𝑟𝑚𝑒𝑟𝑠 reconstruct the full image. The decoder is only an auxiliary part used
which is a bidirectional loss function. One direction of the data stream for pre-training and will not be applied in downstream tasks.
is from the CNN to the Transformer, and the other direction is from In the second stage (as shown on the right of Fig. 21), the pre-
the Transformer to the CNN. For the labeled data, the CNN and Trans- trained ViT weights are transferred to initialize the segmentation en-
coder. Then, the task-specific heads are appended to perform down-
former are supervised by the ground truth. The commonly-used su-
stream tasks. The whole segmentation network, e.g., UNETR, is fine-
pervised loss functions, i.e. the cross-entropy loss and Dice loss, are
tuned to perform the segmentation task.
employed to update model parameters.
The experiments, including the MAE pre-training and the down-
𝐿𝑠𝑢𝑝 = 𝐿𝑐𝑒 (𝑝𝑖 , 𝑦𝑖 ) + 𝐿𝐷𝑖𝑐𝑒 (𝑝𝑖 , 𝑦𝑖 ) (9) stream task, are conducted to evaluate the performance of the proposed

23
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 21. Illustration of MAE self pre-training. First, MAE is pre-trained as an encoder for ViT. The ViT encoder is fed with a random subset of patches and the decoder of the
Transformer reconstructs the complete image as shown on the left. Then, the pre-trained ViT weights are transferred to the initialized segmentation encoder, as shown on the
right. Finally, the whole segmentation network, such as UNETR, is fine-tuned to perform the segmentation task (Zhou et al., 2022).

method. The results show that MAE can recover the lost information quadratic to the sequence length. For example, the CNN-based mod-
in the masked input patches. MAE pre-training enables the model to els such as U-Net (Ronneberger et al., 2015) require 3.7M parame-
improve its classification and segmentation performance on medical ters (Valanarasu et al., 2021) to reach Dice Score 74.68 (Chen et al.,
image analysis tasks, surpassing the ImageNet pre-trained model to 2021b). However, TransUNet, which achieves Dice Score 77.48 needs
SOTA. 96.07M (Hatamizadeh et al., 2022b) parameters. The researchers have
to meet the high demand for GPU resources. Thus, several novel ap-
4.4. Discussion and conclusion proaches such as Swin Transformer employed in Swin-Unet (Cao et al.,
2022), volume-based Transformer utilized in nnFormer (Zhou et al.,
2021) and efficient self-attention module in MISSFormer (Huang et al.,
This section comprehensively investigates the overview of around
2022b) are proposed to simplify the computation of the Transformer
17 Transformer-based models for medical image segmentation pre-
models. The direction of facilitating the efficiency of models will play a
sented in Sections 4.1 to 4.3. Some representative self-supervised strate-
crucial role in future research. We also note that most existing methods
gies (Tang et al., 2022; Zhou et al., 2021) for training transformer
require pre-training strategies on the ImageNet dataset to obtain the
modules are also introduced in this section. We provide information
pre-trained weights for the following downstream tasks. However, the
on the reviewed segmentation approaches about the architecture type,
natural image datasets and medical datasets differ dramatically from
modality, organ, input size, the pre-trained manner, datasets, metrics,
one another, which may impact the final performance of extracting the
and the year in Table 4. Table 5 also lists the methods along with the
medical features. Meanwhile, pre-training leads to high computational
number of parameters, contributions, and highlights. The comparison costs, which hinders the training of models in practice. Multiple seg-
results of some SOTA methods on the Synapse dataset are also pre- mentation networks that can be trained from scratch on the medical
sented in Table 6, showcasing their respective performance metrics. dataset are suggested as the solutions, such as MISSFormer (Huang
Additionally, the evaluation of SOTA approaches on the ISIC 2017, et al., 2022b). We expect more approaches to exploring more efficient
ISIC 2018, and PH2 datasets are demonstrated in Table 7. The tables pre-training strategies or without pre-training. Furthermore, consid-
serve as valuable references for evaluating the relative performance ering the limited size of some medical datasets, some approaches
and suitability of different approaches in medical image segmenta- propose semi-supervised technologies or self-pre-training paradigms to
tion. Moreover, inference time and GPU usage are crucial performance reduce the dataset burden of training or pre-training. Nevertheless,
factors for model selection. DAE-Former excels with a fast 46.01 ms the performance is still not comparable to that of fully-supervised
inference and 6.42 GB GPU memory. MERIT, while effective, takes models. Designing semi-supervised models with improved accuracy in
longer at 93.13 ms and uses 7.78 GB. D-LKA, with 113.18 ms infer- this direction requires more attention.
ence, balances GPU memory well at 5.22 GB. TransCASCADE offers a
compromise with 53.18 ms inference and 3.75 GB GPU memory. These 5. Medical image reconstruction
metrics alongside the parameter numbers and FLOP count guide model
selection for efficient medical image segmentation. It is important 3D Medical imaging is a clinical breakthrough and is very popular in
to note that these assessments were conducted using the same test medical diagnosis and follow-up after treatment. In Computed Tomog-
setup and configurations as those employed in image classification raphy (CT), Single Photon Emission Tomography (SPECT), and Positron
Section 3.3. Emission Tomography (PET), the imaging process relies on ionizing
ViT-based works offer solutions in a broad range of multimodal radiation (Seeram, 2015; Mathews et al., 2017), which implies a po-
tasks of 2D or 3D. Most of the approaches demonstrate superior re- tential risk for the patient (Brenner and Hall, 2007). A non-invasive 3D
sults over CNN-based segmentation models on benchmark medical imaging technique is Magnetic Resonance Imaging (MRI), which does
datasets. Despite the SOTA performance Transformer-based networks not rely on ionizing radiation. However, image acquisition may take
have achieved, there are some challenges in deploying the Transformer- longer and confines the patient in a discomforting narrow tube (Hyun
based models at present. The first challenge is the high computa- et al., 2018). In order to reconstruct 3D volumetric datasets from the
tional burden due to the relatively large number of parameters of acquired data, Medical image reconstruction is one of the essential
the Transformer-based models (Azad et al., 2022c). The reason is components of 3D medical imaging. The primary objective of 3D image
that the time and space complexity of the attention mechanism is reconstruction is to generate high-quality volumetric images for clinical

24
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Table 4
An overview of the reviewed Transformer-based medical image Segmentation models.
Method Modality Organ Type Pre-trained Datasets Metrics Year
module: Type
Pure
1
Swin-Unet (Cao et al., 2022) CT Multi- 2D ViT: Synapse (Landman et al., 2015) Dice 2021
2
organ Supervised ACDC (Bernard et al., 2018)
nnFormer (Zhou et al., 2021) CT Multi- 3D ViT: 1 Synapse (Landman et al., 2015) Dice 2021
2
MRI organ Supervised ACDC (Bernard et al., 2018)
MISSFormer (Huang et al., 2022b) CT Multi- 2D ✗ 1 Synapse (Landman et al., 2015) Dice 2021
MRI organ 2 ACDC (Bernard et al., 2018) Hausdorff
distance
1
TransDeepLab (Azad et al., 2022d) CT Multi- 2D ViT: Synapse (Landman et al., 2015) Dice 2022
2
Dermoscopy organ Supervised ISIC 2017, 2018 (Codella et al., Hausdorff
Skin 2018, 2019) distance
3 PH2 (Mendonça et al., 2013)

Encoder
TransUNet (Chen et al., 2021b) CT Multi- 2D ViT: 1 Synapse (Landman et al., 2015) Dice 2021
2
MRI organ Supervised ACDC (Bernard et al., 2018) Hausdorff
distance
TransBTS (Wang et al., 2021d) MRI Brain 3D ✗ BraTS 19-20 (Menze et al., 2014; Dice 2021
Bakas et al., 2017, 2018) Hausdorff
distance
1
TransFuse (Zhang et al., 2021c) Colonoscopy Multi- 2D ViT: Synapse (Landman et al., 2015) Dice 2021
organ 3D Supervised 2 ACDC (Bernard et al., 2018) Hausdorff
distance
1
MedT (Valanarasu et al., 2021) Microscopy Multi- 2D ✗ Brain US (Private) F1 2021
Ultrasound organ 2 GLAS (Sirinukunwattana et al.,
2017)
3 MoNuSeg (Kumar et al., 2017)

1
UNETR (Hatamizadeh et al., 2022b) CT Brain, 3D ✗ Synapse (Landman et al., 2015) Dice 2021
2
MRI Spleen MSD (Simpson et al., 2019a) Hausdorff
distance
Swin UNETR (Hatamizadeh et al., 2022a) MRI Brain 3D ViT: BraTS 21 (Baid et al., 2021) Dice 2022
Supervised Hausdorff
distance
Swin UNETR++ (Tang et al., 2022) CT Multi- 3D ViT: Self- 1 BTCV (Landman et al., 2015) Dice 2022
2
organ supervised MSD (Simpson et al., 2019a) NSD
Skip connection
CoTr (Xie et al., 2021) CT Multi- 3D ✗ Synapse (Landman et al., 2015) Dice 2021
organ
HiFormer (Heidari et al., 2023) MRI Multi- 2D ViT: 1 Synapse (Landman et al., 2015) Dice 2022
2
Dermoscopy organ Supervised ISIC 2017, 2018 (Codella et al., Hausdorff
Microscopic Skin CNN: 2018, 2019) distance
3
Cells Supervised SegPC 2021 (Gupta et al., 2018,
2020; Gehlot et al., 2020)
Decoder
SegTran (Li et al., 2021c) Fundus Multi- 2D CNN: REFUGE 20 (Orlando et al., 2020) Dice 2021
1
MRI organ 3D Supervised BraTS 19 (Menze et al., 2014;
X-Colonoscopy Bakas et al., 2017, 2018)
2 X-CVC (Fan et al., 2020)
3 KVASIR (Jha et al., 2020)

Other architectures
T-AutoML (Yang et al., 2021a) CT Liver 3D ✗ MSD 2019 (Simpson et al., Dice 2021
and 2019b) NSD
lung
tumor
Cross teaching (Luo et al., 2022) MRI Multi- 2D ✗ ACDC (Bernard et al., 2018) Dice 2022
organ Hausdorff
distance
Self-pretraining with MAE (Zhou et al., 2022) CT Lung 3D ViT: 1 ChestX-ray14
(Wang et al., 2017) Dice 2022
MRI Brain supervised 2 Synapse
(Simpson et al., 2019b) Hausdorff
3
X-ray Multi- MSD 2019 (Simpson et al., distance
organ 2019a)

25
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Table 5
A brief description of the reviewed Transformer-based medical image segmentation models. The unreported number of parameters indicates that the value was not mentioned in
the paper, and the code was unavailable.
Method # Params Contributions Highlights
Pure
Swin-Unet (Cao et al., 2022) – ∙ Builds a pure Transformer model with symmetric ∙ The results of extensive experiments on multi-organ
Encoder–Decoder architecture based on the and multi-modal datasets show the good
Swin-Transformer block connected via skip generalization ability of the model.
connections. ∙ Pre-trained on ImageNet rather than medical image
∙ Proposes patch merging layers and patch expanding data, which may result in sub-optimal performance.
layers to perform downsampling and upsampling
without convolution or interpolation operation.
nnFormer (Zhou et al., 2021) 158.92M ∙ Proposes a powerful segmentation model with an ∙ Volume-based operations help to reduce the
interleaved architecture (stem) based on the empirical computational complexity.
combination of self-attention and convolution. ∙ Pre-trained on ImageNet rather than medical image
∙ Proposes a volume-based multi-head self-attention data, which may result in sub-optimal performance.
(V-MSA) to reduce computational complexity.
MISSFormer (Huang et al., 2022b) – ∙ Proposes the Enhanced Transformer Block based on ∙ The model can be trained from scratch without the
the Enhanced Mix-FFN and the Efficient Self-attention pretraining step on ImageNet.
module. ∙ Less computational burden due to the novel design
∙ Proposes the Enhanced Transformer Context Bridge of the Efficient Self-attention module.
built on the Enhanced Transformer Block to model
both the local and global feature representation and
fuse multi-scale features.
TransDeepLab (Azad et al., 2022d) 21.14M ∙ Proposes the encoder–decoder DeepLabv3+ ∙ The first attempt to combine the Swin-Transformer
architecture based on Swin-Transformer. with DeepLab architecture for medical image
∙ Proposes the cross-contextual attention to adaptively segmentation.
fuse multi-scale representation. ∙ A lightweight model with only 21.14M parameters
compared with Swin-Unet (Cao et al., 2022), the
original DeepLab model (Chen et al., 2017a) and
TransUNet (Chen et al., 2021b).
Encoder
TransUNet (Chen et al., 2021b) 96.07M ∙ Proposes the first CNN-Transformer hybrid network ∙ TransUNet fully exploits the strong global context
for medical image segmentation, which establishes encoded from the Transformer and local semantics
self-attention mechanisms from the perspective of from the CNN module.
sequence-to-sequence prediction. ∙ It presents the generalization ability on
∙ Proposes a cascaded upsampler (CUP) which multi-modalities.
comprises several upsampling blocks to generate the ∙ The approach allows the segmentation of 2D and 3D
prediction results. medical images.
TransBTS (Wang et al., 2021d) Moderate ∙ Proposes a novel encoder–decoder framework ∙ The method can model the long-range dependencies
TransBTS: TransBTS that integrates Transformer with 3D CNN not only in spatial but also in the depth dimension for
32.99M for MRI Brain Tumor Segmentation. 3D volumetric segmentation.
Lightweight ∙ TransBTS can be trained on the task-specific dataset
TransBTS: without the dependence on pre-trained weights.
15.14M ∙ TransBTS is a moderate-size model that outperforms
in terms of model complexity with 32.99M parameters
and 33G FLOPs. Furthermore, the vanilla Transformer
can be replaced with other variants to reduce the
computation complexity.
TransFuse (Zhang et al., 2021c) TransFuse-S: ∙ Proposes the first parallel-in-branch architecture — ∙ The architecture does not require very deep nets,
26.3M TransFuse to capture both low-level global features which alleviates gradient vanishing and feature
and high-level fine-grained semantic details. diminishing reuse problems.
∙ Proposes BiFusion module in order to fuse the ∙ It improves performance by reducing parameters and
feature representation from the Transformer branch increasing inference speed, allowing deployment on
with the CNN branch. both the cloud and the edge.
∙ The CNN branch is flexible to use any off-the-shelf
CNN network.
4
It can be applied to both 2D and 3D medical image
segmentation.
MedT (Valanarasu et al., 2021) 1.4M ∙ Proposes a gated axial-attention model that ∙ The proposed method does not require pre-training
introduces an additional control mechanism to the on large-scale datasets compared to other
self-attention module. transform-based models.
∙ Proposes a LoGo (Local-Global) training strategy for ∙ The results of predictions are more precise compared
boosting segmentation performance by simultaneously to the full attention model.
training a shallow global branch and a deep local
branch.

(continued on next page)

26
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Table 5 (continued).
Method # Params Contributions Highlights
UNETR (Hatamizadeh et al., 2022b) 92.58M ∙ Proposes a novel architecture to address the task of ∙ UNETR shows moderate model complexity while
3D volumetric medical image segmentation. outperforming these Transformer-based and
∙ Proposes a new architecture where the CNN-based models.
Transformer-based encoder learns long-range ∙ The inference time of UNETR is significantly faster
dependencies and the CNN-based decoder utilizes skip than Transformer-based models.
connections to merge the outputs of a Transformer at ∙ They did not use any pre-trained weights for the
each resolution level with the upsampling part. Transformer backbone.
Swin UNETR (Hatamizadeh et al., 2022a) 61.98M ∙ Proposes a novel segmentation model, Swin UNETR, ∙ The FLOPs of Swin UNETR significantly grow
based on the design of UNETR and Swin Transformers. compared to that of UNETR and TransBTS.
∙ The Swin Transformer is suitable for the
downstream tasks.
Swin UNETR++ (Hatamizadeh et al., 2022a) 61.98M ∙ Presents a new self-supervised learning approach ∙ Fine-tuning the pretrained Swin UNETR model leads
that capitalizes on the Swin UNETR model, harnessing to improved accuracy, faster convergence, and
its multi-resolution encoder for efficient pre-training. reduced annotation effort compared to training from
∙ Introduces a pre-trained robust model using the randomly initialized weights.
proposed encoder and proxy tasks on a dataset of ∙ Researchers can benefit from the pre-trained encoder
5050 publicly available CT images, resulting in a in transfer learning for various medical imaging
powerful feature representation for various medical analysis tasks, including classification and detection.
image analysis tasks.
Skip connection
CoTr (Xie et al., 2021) 46.51M ∙ Proposes a hybrid framework that bridges a ∙ The deformable mechanism in CoTr reduces
convolutional neural network and a Transformer for computational and spatial complexities, allowing the
accurate 3D medical image segmentation. network to model high-resolution and multi-scale
∙ Proposes the deformable Transformer (DeTrans) that feature maps.
employs the multi-scale deformable self-attention
mechanism (MS-DMSA) to model the long-range
dependency efficiently.
HiFormer (Heidari et al., 2023) HiFormer-S: ∙ Proposes an encoder–decoder architecture that ∙ Fewer parameters and lower computational cost.
23.25M bridges a CNN and a Transformer for medical image
HiFormer-B: segmentation.
25.51M ∙ Proposes a Double-level Fusion module in the skip
HiFormer-L: connection.
29.52M
Decoder
SegTran (Li et al., 2021c) 86.03M ∙ Proposes a novel Transformer design, ∙ Compared to U-Net and DeepLabV3+, Segtran has
Squeeze-and-Expansion Transformer, in which a the least performance degradation, showing the best
squeezed attention block helps regularize the huge cross-domain generalization when evaluated on
attention matrix, and an expansion block learns datasets of drastically different characteristics.
diversified representations.
∙ Proposes a learnable sinusoidal positional encoding
that imposes a continuity inductive bias for the
Transformer.
Other architectures
T-AutoML (Yang et al., 2021a) 16.96M ∙ Proposes the first automated machine learning ∙ The method is effectively transferable to different
algorithm, T-AutoML, which automatically estimates datasets.
‘‘almost’’ all components of a deep learning solution ∙ The applied AutoML alleviates the need for the
for lesion segmentation in 3D medical images. manual design of network structures.
∙ Proposes a new predictor-based search method in a ∙ The intrinsic limitations of AutoML.
new search space that searches for the best neural
architecture and the best combination of
hyperparameters and data augmentation strategies
simultaneously.
Cross teaching (Luo et al., 2022) – ∙ Proposes an efficient regularization scheme for ∙ The training process requires less data cost by
semi-supervised medical image segmentation where semi-supervision.
the prediction of a network serves as the pseudo label ∙ The framework contains components with low
to supervise the other network. complexity and simple training strategies compared to
∙ The proposed method is the first attempt to apply other semi-supervised learning methods.
the Transformer to perform the semi-supervised ∙ The proposed semi-supervised segmentation still
medical segmentation utilizing the unlabeled data. cannot achieve the state-of-the-art (SOTA) compared
with the fully-supervised approaches.
Self-pretraining with MAE (Zhou et al., 2022) – ∙ Proposes a self-pre-training paradigm with MAE for ∙ The proposed paradigm demonstrates its
medical images where the pre-training process of the effectiveness in limited data scenarios.
model uses the same data as the target dataset.

usage at minimal cost and radiation exposure, whilst also addressing in higher noise levels and reduced contrast, which poses a challenge
potential artifacts inherent to the physical acquisition process. Image for 3D image reconstruction.
reconstruction solves an inverse problem that is generally challenging Vision Transformers (ViTs) have effectively demonstrated possible
due to its large-scale and ill-posed nature (Zhang and Dong, 2020). solutions to address these challenges. We categorize the literature in
In medical imaging, there are ongoing research efforts to reduce the this domain into low dose enhancement, sparse-view reconstruction, under-
acquisition time (i.e. to reduce cost and potential movement artifacts) sampled reconstruction, and super-resolution reconstruction. This section
as well as radiation dose. However, lowering the radiation dose results will overview some of the SOTA Transformer-based studies that fit

27
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Table 6
Comparison results of some segmentation methods on the Synapse (Landman et al., 2015) dataset. The methods are ordered based on their dice score.
Methods Params (M) FLOPs (G) Dice ↑ HD95 ↓ Aorta Gallbladder Kidney(L) Kidney(R) Liver Pancreas Spleen Stomach
TransUNet (Chen et al., 2021b) 96.07 88.91 77.49 31.69 87.23 63.16 81.87 77.02 94.08 55.86 85.08 75.62
Swin-Unet (Cao et al., 2022) 27.17 6.16 79.13 21.55 85.47 66.53 83.28 79.61 94.29 56.58 90.66 76.60
TransDeepLab (Azad et al., 2022d) 21.14 16.31 80.16 21.25 86.04 69.16 84.08 79.88 93.53 61.19 89.00 78.40
HiFormer (Heidari et al., 2023) 25.51 8.05 80.39 14.70 86.21 65.23 85.23 79.77 94.61 59.52 90.99 81.08
PVT-CASCADE (Rahman and Marculescu, 2023a) 35.28 6.40 81.06 20.23 83.01 70.59 82.23 80.37 94.08 64.43 90.10 83.69
MISSFormer (Huang et al., 2022b) 42.46 9.89 81.96 18.20 86.99 68.65 85.21 82.00 94.41 65.67 91.92 80.81
DAEFormer (Azad et al., 2022c) 48.07 27.89 82.63 16.39 87.84 71.65 87.66 82.39 95.08 63.93 91.82 80.77
TransCASCADE (Rahman and Marculescu, 2023a) 123.49 – 82.68 17.34 86.63 68.48 87.66 84.56 94.43 65.33 90.79 83.52
ScaleFormer (Huang et al., 2022a) 111.60 48.93 82.86 16.81 88.73 74.97 86.36 83.31 95.12 64.85 89.40 80.14
D-LKA Net (Azad et al., 2023b) 101.64 19.92 84.27 20.04 88.34 73.79 88.38 84.92 94.88 67.71 91.22 84.94
MERIT (Rahman and Marculescu, 2023b) 147.86 33.31 84.90 13.22 87.71 74.40 87.79 84.85 95.26 71.81 92.01 85.38

Table 7
Performance comparison of several Transformer-based segmentation approaches on ISIC 2017 (Codella et al., 2018), ISIC 2018 (Codella et al., 2019), and PH2 (Mendonça et al.,
2013).
Methods Params (M) FLOPs (G) ISIC 2017 ISIC 2018 PH2
DSC SE SP ACC DSC SE SP ACC DSC SE SP ACC
TransUNet (Chen et al., 2021b) 96.07 88.91 0.8123 0.8263 0.9577 0.9207 0.8499 0.8578 0.9653 0.9452 0.8840 0.9063 0.9427 0.9200
TransNorm (Azad et al., 2022b) 117.98 33.41 0.8933 0.8532 0.9859 0.9582 0.8951 0.8750 0.9790 0.9580 0.9437 0.9438 0.9810 0.9723
Swin-Unet (Cao et al., 2022) 27.17 6.16 0.9183 0.9142 0.9798 0.9701 0.8946 0.9056 0.9798 0.9645 0.9449 0.9410 0.9564 0.9678
D-LKA Net (Azad et al., 2023b) 101.64 19.92 0.9254 0.9327 0.9793 0.9705 0.9177 0.9164 0.9773 0.9647 0.9490 0.9430 0.9775 0.9659
HiFormer (Heidari et al., 2023) 25.51 8.05 0.9253 0.9155 0.9840 0.9702 0.9102 0.9119 0.9755 0.9621 0.9460 0.9420 0.9772 0.9661

Fig. 22. An overview of medical image reconstruction taxonomies either as categorizing by the task or the location of using Transformer in an architecture.

into our taxonomy. Figs. 22(a) and 22(b) demonstrate our proposed counterpart to transform it into a low-resolution image and extracted
taxonomy for this field of study. Fig. 22(a) indicates the diversity of embedding features (𝑋𝐻𝑓 ) by applying a shallow CNN. Then the re-
our taxonomy based on the medical imaging modalities we studied in sultant latent texture features (𝑋𝐿𝑡 ) and corresponding high-frequency
this research. Fig. 22(b) endorses the usage of the Transformer within representation are fed to the Transformer for noise removal from a
the overviewed studies’ pipelines. high-frequency representation. Ultimately, they reconstruct the high-
quality image piecewise. They showed that the latent texture features
5.1. Low dose enhancement are beneficial in screening noise from the high-frequency domain.
Despite the TransCT (Zhang et al., 2021e), Wang et al. proposed
Zhang et al. (2021e) used a very general intuition about image a convolution-free Token-to-Token vision Transformer-based Encoder–
denoising: the noisy image constructed with high-frequency and low- decoder Dilation network (TED-net) design for CT image denoising
frequency counterparts as 𝑋 = 𝑋𝐻 + 𝑋𝐿 in a study, namely, TransCT. (Wang et al., 2021b). Their approach is based on a U-Net encoder–
Zhang et al. (2021e) claim that the noisy image’s low counterpart decoder scheme enriched by different modules, i.e., Basic Vision Trans-
contains two sub-components of main image content and weakened former, Token-to-Token Dilation (T2TD), and (Inverse) Cyclic Shift
image textures, which are entirely noise-free. They applied a Gaussian blocks. Consider 𝑦 ∈ R𝑁×𝑁 a clean natural dose CT image, 𝑥 ∈ R𝑁×𝑁
filter on the input image to decompose an image into a high-frequency noisy low dose CT image, and 𝑇 ∶ R𝑁×𝑁 → R𝑁×𝑁 is a Transformer-
sub-band and a low-frequency sub-band. After this, they extracted based denoising model. According to Fig. 23 after tokenization of 𝑥
𝑋𝐿𝑐 content features and 𝑋𝐿𝑡 latent texture features by applying two and passing through the Vision Transformer block to capture long-
shallow CNNs on the low-frequency counterpart of the input image. range dependencies and alleviate the absence of local inductive bias
Simultaneously, they applied a sub-pixel layer on a high-frequency in Transformers, they employed Token-to-Token serialization (Yuan

28
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 23. An overview of TED-net (Wang et al., 2021b). Tokenize and DeToken blocks are invertible operations that apply the process of patch embedding and converting patches
again to image, respectively. TB represents a standard Transformer block. (I)CSB denotes the (inverse) Cyclic Shift block to modify the feature map, nonetheless, the reverse
operation avoids pixel shifts in the final result. T2T block represents the Token-to-Token process (Yuan et al., 2021) to improve the spatial inductive bias of Transformers by
merging the neighboring tokens. The Dilated T2T (T2TD) block is used to refine contextual information further.

et al., 2021). Also, they utilized feature re-assembling with a Cyclic satisfied to place a Transformer counterpart across CNN layers of the
Shift block (CSB) to integrate more information. Obvious from Fig. 23, generator to guarantee to extract low-level spatial feature extraction
all of these blocks are replicated in a symmetric decoder path, but and global semantic dependencies. They also introduced adversarial
instead of the CSB, the Inverse Cyclic Shift block (ICSB) is implemented loss term to their voxel-wise estimation error to produce more realistic
to avoid pixel shifts in the final denoising results (𝑦 = 𝑥 + 𝑇 (𝑥)). images.
They reached SOTA results compared to CNN-based methods and a In contradiction with other works, Zhang et al. (2022) proposed
competitive benchmark with regard to the TransCT (Zhang et al., leveraging the PET/MRI data simultaneously for denoising low-count
2021e). PET images, which is a crucial assessment for cancer treatment. PET
Luthra et al. (2021) proposed a Transformer-based network, scan is an emission Computed Tomography (CT) operating by positron
Eformer, to deal with low-dose CT images while concurrently using annihilation radiation. Due to the foundation and requirements of
the edge enhancement paradigm to deliver more accurate and realistic PET scans, there is a severe risk of getting infected with secondary
denoised representations. Their architecture builds upon the LeWin cancer by radiotracers. So to degrade the side effects of this imaging
(Locally-enhanced Window) Transformer block (Wang et al., 2022), process, there are two potential methods: reduction in radiotracer
which is accompanied by an edge enhancement module. The success dose and lessening the patient’s bedtime duration. The aforementioned
of the Swin Transformer (Liu et al., 2021d) in capturing the long-range approaches, without a doubt, affect the imaging result quality with
dependencies with the window-based self-attention technique makes it
decreased contrast to noise ratio and bias in texture. The traditional
a cornerstone in designing new Transformer blocks due to its linear
low-count PET denoising approaches are based on Non-Local Means
computational complexity. LeWin Transformer is one of these blocks
(NLM) (Buades et al., 2005), Block Matching 3D (BM3D) (Dabov et al.,
that capture the global contextual information and, due to the presence
2006), and Iterative methods (Wang et al., 2014), etc., which are
of a depth-wise block in its structure, could also capture a local context.
firmly in bond with hyperparameter tuning for new data or result
Eformer’s first step is through the Sobel edge enhancement filter. In
in unnatural smoothings over denoised images. Zhang et al. (2022)
every encoder–decoder stage, convolutional features pass through the
testify that simultaneous PET/MRI could boost one modality in terms
LeWin Transformer block, and downsampling and upsampling proce-
of correct attenuation, motion, and partial volume effects, and also,
dures are done by convolution and deconvolution layers. Eformer’s
due to the high contrast among soft tissues in MRI, the denoising
learning paradigm is a residual learning scheme, meaning it learns the
process of PET images is preferably straightforward. STFNet (Zhang
noise representation rather than a denoised image due to the ease of
optimization in predicting a residual mapping. et al., 2022) is a U-Net based structure with different medications.
Akin to Low Dose CT (LDCT), Low-Dose PET (LDPET) is prefer- They proposed a new Siamese encoder comprising dual input flow for
able to avoid the radiation risk, especially for cancer patients with a each modality in the encoding path. To obtain sufficient features from
weakened immune system who require multiple PET scans during their different modalities, they used the Spatial Adaptive (SA) block, a dual
treatment at the cost of sacrificing diagnosis accuracy in Standard-Dose path in each block with the residual block design, which consists of
PET (SDPET). Luo et al. (2021) proposed an end-to-end Generative different consecutive convolutional blocks and deformable convolution
Adversarial Network (GAN) based method integrated with a Trans- with fusion modulation. This module aims to learn more contextual
former block, namely 3D Transformer-GAN, to reconstruct SDPET features from each modality. To leverage global attention, they used
images from the corresponding LDPET images. To alleviate the inter- a Transformer to produce a pixel-to-pixel interaction between the PET
slice discontinuity problem of existing 2D methods, they designed their and the MRI modality. After this integration, the fused features are
network to work with 3D PET data. Analogous to any GAN network, input to the two branches based on residual convolution blocks for PET
they used a generator network, encoder–decoder, with a Transformer denoising.
placed in the bottleneck of the generator network to capture contextual Wang et al. (2023a) proposed the enhancement for their pre-
information. Due to the computational overhead of Transformers, they vious work, TED-net (Wang et al., 2021b) convolution-free, solely
did not build their proposed method solely on it. Therefore, they were Transformer-based network, namely CTformer. From Fig. 24, it is

29
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 24. An overview of CTformer (Wang et al., 2023a). This structure is analogous to the TED-net (Wang et al., 2021b) structure, the previous study by the same authors.

Table 8
Comparison result on NIH-AAPM-Mayo (McCollough et al., 2017) dataset in low dose enhancement task. LDE indicates the Low Dose
Enhancement task.
Methods Task Dataset SSIM ↑ RMSE ↓
Eformer (Luthra et al., 2021) LDE NIH-AAPM-Mayo (McCollough et al., 2017) 0.9861 0.0067
TransCT (Zhang et al., 2021e) LDE NIH-AAPM-Mayo (McCollough et al., 2017) 0.923 22.123
CTformer (Wang et al., 2023a) LDE NIH-AAPM-Mayo (McCollough et al., 2017) 0.9121 9.0233

apparent that their network is an unsupervised residual learning, U- 5.2. Sparse-view reconstruction
Net-like encoder–decoder structure, rather than direct map learning of
LDCT to Natural Dose CT (NDCT). The CTformer tries to compensate Due to the customary usage of CT images in medical diagnosis,
for the Transformers’ deficiency in capturing path inter-boundary infor- another policy to lessen the side effects of X-ray radiation is acquiring
mation and spatial inductive bias with token rearrangement, T2T (Yuan fewer projections, known as sparse-view CT, which is a very feasible
et al., 2021). To do so, analogously like TED-net, they used dilation and and effective method rather than manipulating the standard radiation
cyclic shift blocks in the Token2Token block to broaden the receptive dose (Bian et al., 2010; Han and Ye, 2018). However, the resultant
field to capture more contextual information and not increase the images from this method suffer from severe artifacts, and decreasing
computational cost. the number of projections demands profound techniques to recon-
Yang et al. (2022) were inspired by how sinogram works and struct high-quality images. Wang et al. (2021a) is the first paper that
proposed Singoram Inner-Structure Transformer (SIST) (Fig. 25). This inspected the usage of Transformers in this field which was quite
inner structure of the sinogram contains the unique characteristics successful, namely DuDoTrans. Their intuition was to shed light on
of the sinogram domain. To do so, they mimic the global and local the globality nature of the sinogram sampling process, which the pre-
characteristics of sinogram in a loss function based on sinogram inner- vious CNN architectures neglected. DuDoTrans, unlike the conventional
structure, namely Sinogram Inner-Structure Loss (SISL). The global
iterative methods in this literature, does not provide blocky effects
inner-structure loss utilizes conjugate sampling pairs in CT, and local
in reconstructed images. This method simultaneously benefits from
inner-structure loss considers the second-order sparsity of sinograms.
enhanced and raw sinogram streams to restore informative sinograms
The amalgamation of these two terms could be beneficial in recon-
via long-range dependency modeling in a supervised policy. DuDoTrans
structing NDCT images while retaining the noise. Due to the CT imaging
from Fig. 26 is built on three main modules, namely Singoram Restora-
mechanism, each row of the sinogram representation denotes the pro-
tion Transformer (SRT), the DuDo Consistency layer, and the Residual
jection at a certain view. Naturally, this procedure is suitable for
Image Reconstruction Module (RIRM). SRT block consists of successive
leveraging the Transformer block for modeling the interaction between
hybrid Swin Transformer modules and convolutional layers to model
different projections of diverse angles to capture contextual infor-
mation. Therefore, the SIST module applies to raw sinogram input local semantic features and inherent global contextual information in
and captures structural information. Afterward, the unified network the sinogram to produce the enhanced sinogram.
reconstructs the high-quality images in a residual policy with the image Buchholz and Jug (2022) presented the Fourier Image Transformer
reconstruction module. (FIT) that operates on the image frequency representation, especially
Table 8 represents the benchmark results in the LDCT task over the the Fourier description of the image, which in their study is known as
NIH-AAPM-Mayo (McCollough et al., 2017) dataset respecting SSIM Fourier Domain Encoding (FDE), that encodes the entire image at lower
and RMSE metrics on overviewed methods in this study. For clari- resolution. The intuition in their idea is underlying the CT’s acquisition
fication, TED-net (Wang et al., 2021b) achieved better results than process physics. CT utilizes a rotating 1D detector array around the
CTformer (Wang et al., 2023a), but due to two studies originating patient body to calculate the Radon transform (Kak and Slaney, 2001)
from the same authors and the resemblance between architectures, of a 2D object, which leads to a sequence of density measurements at
we preferred to mention CTformer to count in the comparison ta- different projection angles, namely sinogram as a 2D image in which
ble. This result endorses the capability of the pure Transformer-based each column of this representation corresponds to one 1D measure-
Eformer (Luthra et al., 2021) method in reconstructing natural dose CT ment. The Filtered Back Projection (FBP) (Pan et al., 2009; Kak and
images. Slaney, 2001) is a reconstruction method to map sinograms to tangible

30
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 25. The overall architecture of SIST (Yang et al., 2022) pipeline. 𝑆𝑙𝑑 and 𝐼𝑙𝑑 are the LDCT sinogram and image, 𝑆̂ and 𝐼̂ denote the output sinogram and image, 𝑆̂𝑛𝑜𝑖𝑠𝑒 and
𝐼̂𝑛𝑜𝑖𝑠𝑒 are the sinogram noise and image noise. First, the LDCT sinogram feed to the Transformer for sinogram domain denoising, then the denoised sinogram 𝑆input
̂ to the image
reconstruction module for image domain denoising. Within the image reconstruction module, the sinogram noise 𝑆̂𝑛𝑜𝑖𝑠𝑒 with the usage of residual CNN block generates image
domain 𝐼̂𝑛𝑜𝑖𝑠𝑒 . NDCT 𝐼,̂ outputs from applying refinement steps on 𝐼𝑙𝑑 minus 𝐼̂𝑛𝑜𝑖𝑠𝑒 .

Fig. 26. DuDoTrans (Wang et al., 2021a) framework for sparse-view CT image reconstruction. First, the sparse-view sinogram 𝐘 maps to a low-quality image 𝐗 ̃ 1 and other
̃ 2 generated by SRT module’s enhanced sinogram output 𝐘
estimation 𝐗 ̃ followed by DuDo Consistency Layer. Lastly, the predicted estimations are concatenated and fed to the
RIRM module that outputs the CT image of 𝐗̃ in a supervised manner.

CT images. FBP is based on the Fourier slice theorem; hence, computing which is a normalized real-valued matrix with 𝑁 × 2 entries (𝑁 is
the 1D Fourier transform of 1D projection and rearranging them by equal to half of the DFT coefficients number). A linear layer applies
their projection angle in Fourier space, followed by an inverse Fourier on this tensor to upsample the feature dimensionality to 𝐹2 . Finally, a
transformation, results in a reconstructed 2D CT image slice. Limiting 2D positional encoding concatenates to this tensor and produces a 2D
the number of projections leads to missing Fourier measurements, FDE image with the size of 𝑁 × 𝐹 .
which ultimately conduce to reconstruction artifacts. FIT is the first Shi et al. (2022a) presented a CT reconstruction network with
study that uses a Transformer to query arbitrary Fourier coefficients Transformers (CTTR) for sparse-view CT reconstruction. In contrast to
and fill the unobserved Fourier coefficients to conceal or avoid the DuDoTrans (Wang et al., 2021a), CTTR enhances low-quality recon-
probable artifacts in reconstruction within sparse-view CT reconstruc- structions directly from raw sinograms and focuses on global features in
tion literature. From Fig. 27 this procedure starts with calculating the a simple policy in an end-to-end architecture. CTTR contains four parts:
FDE of the raw sinogram. To do so, first, the discrete Fourier transform two CNN-based residual blocks extracting local features from FBP (Pan
(DFT) of the sinogram will be calculated. Secondly, after dropping et al., 2009) images reconstruction and sinograms, an encoder–decoder
half of the coefficients on the Fourier rings of the resultant Fourier Transformer for long-range modeling dependencies, and contextual
representation, it preserves the lower frequency counterparts to recover information between features, and a CNN block to map features to a
the lower resolution of the raw sinogram. Afterward, the complex high-quality reconstruction.
coefficients convert into 1D sequences by unrolling the Fourier rings. Cone-Beam Computed Tomography (CBCT) is a conventional way
These complex values convert to normalized amplitudes and phases. of dental and maxillofacial imaging; due to its fast 3D imaging qualifi-
Therefore, each complex coefficient has its own polar representation, cations, its popularity has extended to lung imaging. However, studies

31
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 27. FIT (Buchholz and Jug, 2022) framework for sparse-view CT reconstruction. FDE representation of the sinogram calculates that serves as an input to an encoder of
Transformer design. The decoder predicts the Fourier coefficients from the encoder’s latent space. The Fourier coefficients of applying the FBP (Pan et al., 2009) algorithm on
sinogram information are fed into a Transformer’s decoder to enrich the Fourier query representation. A shallow CNN block applies after inverse FFT to hamper the frequency
oscillations.

Fig. 28. (a) represents multi-loss untrained network for sparse-view CBCT reconstruction. (b) architecture of UNETR (Hatamizadeh et al., 2022b), as a Transformer module in a
ARMLUT.

approved that its radiation dose is higher than plain radiographs (Patel 5.3. Undersampled reconstruction
et al., 2019) hence sparse-view CBCT could be a suitable method to
lower radiation dose. Wu et al. (2022) proposed a novel untrained Magnetic Resonance Imaging (MRI) is a dominant technique for
3D Transformer-based architecture, namely ARMLUT, with a multi- assistive diagnosis. However, due to the physics behind its operation,
level loss function for CBCT reconstruction. While the Transformer the scanning time can take longer and be very tedious, affecting the
module, especially the UNETR (Hatamizadeh et al., 2022b) in this patient experience and leading to inevitable artifacts in images (Plenge
study, captures long-range contextual information and enhances the et al., 2012). Hence, reducing the number of MRI measurements can
resulting image. The intuition behind this strategy is Deep Image Prior result in faster scan times and artifacts reduction due to the patient’s
(DIP) (Ulyanov et al., 2018) to succeed in the reconstruction field. movement at the cost of aliasing artifacts in the image (Hyun et al.,
From Fig. 28(a), ARMLUT is an iterative optimization problem between 2018).
the Image Reconstructor module and Image Generator module to fit Lin and Heckel (2021) proposed a comprehensive analytical study
a CBCT inverse solver without a large number of data or ground to investigate the usage of ViT in a pure (CNN-free modules) and
truth images. The multi-level loss function comprises a Mean Squared most straightforward Transformer design. This study is evidence of
Error (MSE) and Perceptual Loss (PL) (Johnson et al., 2016) to recon- the prominent effect of ViTs in medical image reconstruction. For this
struct smooth and streak artifact-free outputs. The entire framework work, they adopted the original ViT (Dosovitskiy et al., 2020) for image
(Fig. 28) has three main counterparts: Image Reconstructor, Image reconstruction by discarding the classification token and replacing the
Generator, and Feature Extractor. Image Reconstructor uses Feldkamp– classification head with the reconstruction head, which is comprised
Davis–Kress (FDK) algorithm (Feldkamp et al., 1984) to produce a of successive Norm and Linear layers to map the Transformer output
low enhanced reconstruction from 𝑀-view measurements, and the to a visual image. They performed a complete ablation study with
Image generator module maps the noisy voxel inputs to reconstruct different ViT settings, from the number of stacked Transformers to
a regularized image. The Feature Extractor module applies the VGG- embedding dimension and the number of heads in Multi-Head Self-
11 pre-trained network on two representations and produces a PL Attention (MHSA). Their results were quite effective and proved that
paradigm. To minimize the distance between these two reconstruc- trained ViT on sufficient data from natural images like ImageNet or
tions, ARMLUT utilizes an adaptively re-weight multi-loss technique to medical data could perform better or achieve on-par reconstruction
stabilize the convergence of the Transformer in the optimization. accuracies compared to CNN baselines such as U-Net (Ronneberger

32
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 29. An overview of T2 Net (Feng et al., 2021b). (a) Multi-Task T2 Net pipeline and (b) Task Transformer module—T2 module.

et al., 2015). The proposed design’s distinguished power based on the pipeline physics during the image acquisition process for having high-
mean attention distance metric (d’Ascoli et al., 2021) proves that it resolution images, a patient needs to lie a long time in the MRI tube.
effectively mimics the convolutional receptive fields and could concur- Hereupon lower signal-to-noise ratio and more minor spatial coverage
rently capture local and global dependencies. In addition, they showed drawbacks are inevitable (Plenge et al., 2012). Therefore in this section,
that the ViT benefits from two times faster inference times and fewer we investigate Transformer-utilized algorithms that try to alleviate this
memory requirements compared to the U-Net. problem. Of note, due to the analogy between MRI and super-resolution
Feng et al. (2021a) address the particular issue in this domain reconstruction, some studies investigate these two tasks in conjunction
by designing an end-to-end multi-task learning paradigm to boost with each other.
feature learning between two sub-tasks, MRI reconstruction, and super- Mahapatra and Ge (2021) proposed the GAN-based model with
resolution, which have a high overlap with each other named Task structural and textural information preservation done by multiple
Transformer Network (T2 Net) is showed in Fig. 29. Their network loss function terms. Their pipeline included two pre-trained modules
consists of two branches, each for a specific task. T2 Net utilizes a Trans- named feature disentanglement module, a conventional autoencoder,
former between two branches to share feature representation and trans- and a Transformer-based feature encoder, UNETR (Hatamizadeh et al.,
mission. T2 Net applies a convolution layer and EDSR (Lim et al., 2017) 2022b). UNETR captures the global and local context of the original
backbone to extract task-specific features in each task branch. To share low-resolution image and induces the high-resolution image to preserve
information between two branches and benefit from the interactions these contexts too. These two modules fine-tune on a different medical
of these two task-specific features concerning the nature of the Trans- dataset, and afterward, the low-resolution input plus the intermediate
former’s globality, T2 Net uses a unique Transformer design to learn a generator produced image feed to these modules. The disentanglement
generalized representation. Since the reconstruction branch has more network contains two autoencoders to learn two counterparts, latent
potential in artifact removal capacity than the super-resolution branch, space, structural and textural space, with fed medical images. In an end-
the task Transformer module guides the super-resolution branch into to-end setting, these two pre-trained assistive modules help to generate
high-quality representation from the reconstruction branch. The Trans- more realistic and structural and textural preserving high-resolution
former module inherits the query (𝑄: from super-resolution branch), images by imposing module-related loss terms such as adversarial
key (𝐾: from reconstruction branch), and value (𝑉 : from reconstruction loss to constrain for producing realistic images and cosine similarity
branch) from each scale’s two branches’ output. It forms three main loss for each mentioned module. Results on the IXI dataset proved
concepts: relevance embedding, Transfer attention, and soft attention, that Mahapatra and Ge’s (2021) network outperformed a couple of
which differ from the original Transformer blocks. Relevance embed- the CNN-based attention mechanism networks and T2 Net (Feng et al.,
ding tries to enclose the correlated features from the reconstruction 2021a). Maintaining structural information during the acquiring high-
branch to the super-resolution branch. Transfer attention aims to trans- resolution images plays a crucial role. Hence, the structure information
mit the anatomical and global features between two branches, and last is embedded in an image’s high-frequency counterpart, like in an
but not least, soft attention amalgamates features from the previous image’s gradients. In addition, due to the less time-consuming nature
two steps. Ultimately, this module lets the whole network transfer of obtaining MR T1WI (T1 Weighted Image), it is wise to use it as an
and synthesize the representative and anatomical features to produce inter-modality context prior to producing a high-resolution image. Ac-
a high-quality, artifacts-free representation from highly undersampled cordingly, Fang et al. (2022) devised a network (see Fig. 30) to leverage
measurements. The experimental results on two datasets expressed the these two concerns in their super-resolution pipeline: Cross-Modality
high potential of this approach rather than conventional algorithms. High-Frequency Transformer (Cohf-T). This network is divided into
two streams, the first stream is applied on low-resolution T2WI, and
5.4. Super-resolution reconstruction the second one manipulates T2WI’s gradient and high-resolution T1WI.
The Cohf-T module interacts between two streams to embed the prior
Improving the resolution of images leads to the more detailed knowledge in the super-resolution stream’s features. The Cohf-T module
delineation of objects. Increasing the medical image resolution plays a consists of three different attention modules: short-distance and long-
crucial role in computer-aided diagnosis due to its rich anatomical and distance window attention and inter-modality attention. The first two
textural representation. Based on the aforementioned fact and the MRI’s attention modules help to model intra-modality dependency. To be

33
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 30. The pipeline of Cohf-T (Fang et al., 2022) consists of three main branches with the corresponding input modalities as follows: 𝐈𝑖𝑛 , 𝐑𝑠 , and 𝐑𝑐 denote the low-resolution
T2WI, the gradient of low-resolution T2WI and high-resolution T1WI, respectively. A fully-convolutional branch for density-domain super-resolution, a Transformer-based branch
for restoring high-frequency signals in the gradient domain, and a guidance branch for extracting priors from the T1 modality. Conv, RRDB and MLP represent a 3 × 3 convolution
operation and residual-in-residual dense block and multi-layer perceptron, respectively.

precise, the short-distance window helps recover the local discontinuity task using the NIH-AAPM-Mayo (McCollough et al., 2017) bench-
in boundaries with the help of surrounding structure information, mark medical reconstruction dataset, which shows that Eformer (Luthra
and the long-distance window can capture the textural and structural et al., 2021) is superior as compared to TransCT (Zhang et al., 2021e)
patterns for enhanced results. Due to the discrepancy in intensity and CTformer (Wang et al., 2023a), considering its higher SSIM and
levels between T1WI and T2WI, it is vital to make an alignment lower RMSE scores.
between these two domains, and Fang et al. (2022) presented a Feature Most of the studies in this domain use the original Transformer as
Alignment (FA) module to reduce the cross-modality representation a plug-and-play module in their design and only a limited number of
gap. They compared their results with T2 Net (Feng et al., 2021a) and studies utilize hierarchical and efficient Transformers. However, the
MTrans (Feng et al., 2022), which outperformed both approaches by criteria for using multi-scale and hierarchical architectures are gener-
∼ 1% in terms of PSNR. ally important for dense prediction tasks, e.g. image reconstruction,
and should be considered further. Also, another direction to follow
5.5. Self-supervised reconstruction for future research could be to investigate the influence of using pre-
training weights on Transformers due to the need for a large amount of
Recently, self-supervised reconstruction methods requiring only un- data for better convergence results in Transformers, which contradicts
dersampled k-space data have been proposed for single-contrast MRI the nature of the medical domain, due to the scarceness of annotated
reconstruction, enabling the autonomous learning of high-quality im- medical data.
age reconstructions without the need for fully sampled data or manual In addition, we noticed that most of the studies focus on MRI and
CT image reconstruction tasks. So there is a need for evaluating the
annotations. SLATER (Korkmaz et al., 2022) is introduced as a self-
applicability of these methods on other modalities, too.
supervised MRI reconstruction method that combines an unconditional
adversarial network with cross-attention transformers. It learns a high-
6. Medical image synthesis
quality MRI prior during pre-training and performs zero-shot recon-
struction by optimizing the prior to match undersampled data. SLATER
In this section, we will overview several instances of Transformers
improves contextual feature capture and outperforms existing methods
in the medical image synthesis task. The scarcity of medical data and
in brain MRI. It has potential for other anatomies and imaging modal-
the high cost of acquisition processes make this task very valuable in
ities, offering reduced supervision requirements and subject-specific
the medical field. Some studies aim to synthesize missing slices from
adaptation for accelerated MRI. Likewise, DSFormer (Zhou et al.,
MRI and CT sequences. In addition, some methods target capturing the
2023a) is presented for accelerated multi-contrast MRI reconstruction.
structural information in diverse modalities, e.g., CT to MRI image-to-
It uses a deep conditional cascade transformer (DCCT) with Swin
image translation and vice versa. Fig. 31 shows our taxonomy for the
transformer blocks. DSFormer is trained using self-supervised learning
image-synthesized methods.
in both k-space and image domains. It exploits redundancy between
contrasts and models long-range interactions. It also reduces the need 6.1. Intra-modality
for paired and fully-sampled data, making it more cost-effective.
The main objective of the intra-modality methods is to synthesize
5.6. Discussion and conclusion high-quality images using low-quality samples from the same modality.
In this respect, several Transformer-based approaches are presented
In this section, we outline different Transformer-based approaches to formulate the synthesis task as a sequence-to-sequence matching
for medical image reconstruction and present a detailed taxonomy of problem to generate fine-grained features. In this section, we will
reconstruction approaches. We reviewed 17 studies that profit from the briefly present some recent samples (Zhang et al., 2021b).
Transformer design to compensate for the deficiency of CNN’s limited Brain development monitoring is a de facto standard in predict-
receptive field, with two of them incorporating self-supervised strate- ing later risks; hence it is critical to screen brain biomarkers via
gies (Korkmaz et al., 2022; Zhou et al., 2023a) for training transformer available imaging tools from early life stages. Due to this concern
modules. We investigate each study in depth and represent Table 9 for and the nature of MRI and subjective infants’ restlessness, it is not
detailed information about the dataset, utilized metrics, modality, and relatively straightforward to take all the MR modalities during the
objective tasks. In Table 10, we provide the main contribution of each MRI acquisition. Zhang et al. (2021b) proposed a Pyramid Trans-
study and the prominent highlight of each method, and in Table 8, we former Net (PTNet) as a tool to reconstruct realistic T1WI images from
present a comparison of various methods for the low-dose enhancement T2WI. This pipeline is an end-to-end Transformer-based U-Net-like and

34
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Table 9
Medical image reconstruction. LDE, SVR, USR, and SRR stand for Low dose enhancement, Sparse-view reconstruction, Undersampled reconstruction, and Super resolution
reconstruction, respectively.
Method Task(s) Modality Type Pre-trained Dataset(s) Metrics Year
module: Type
Pure
TED-net (Wang et al., 2021b) LDE CT 2D ✗ NIH-AAPM-Mayo Clinical SSIM 2021
LDCT (McCollough et al., RMSE
2017)
Eformer (Luthra et al., 2021) LDE CT 2D ✗a NIH-AAPM-Mayo Clinical PSNR, SSIM 2021
LDCT (McCollough et al., RMSE
2017)
CTformer (Wang et al., 2023a) LDE CT 2D ✗ NIH-AAPM-Mayo Clinical SSIM 2023
LDCT (McCollough et al., RMSE
2017)
FIT (Buchholz and Jug, 2022) SVR CT 2D ✗ LoDoPaB (Leuschner et al., PSNR 2021
2019)
ViT-Rec (Lin and Heckel, 2021) USR MRI 2D Supervised fastMRI (Zbontar et al., SSIM 2021
2018)
Encoder
TransCT (Zhang et al., 2021e) LDE CT 2D ✗ 1 NIH-AAPM-Mayo Clinical RMSE 2021
LDCT (McCollough et al., SSIM
2017) VIF
2
Private clinical pig head
CBCT
STFNet (Zhang et al., 2022) LDE PET 2D ✗ Private Dataset RMSE, PSNR 2022
MRI SSIM, PCC
1
SIST (Yang et al., 2022) LDE CT 2D ✗ LDCT Dataset (Moen PSNR, SSIM 2022
et al., 2021) RMSE
2
Private dataset
DuDoTrans (Wang et al., 2021a) SVR CT 2D ✗ NIH-AAPM-Mayo Clinical PSNR, SSIM 2021
LDCT (McCollough et al., RMSE
2017)
CTTR (Shi et al., 2022a) SVR CT 2D ✗ LIDC-IDRI (Armato et al., RMSE, PSNR 2022
2011) SSIM
Skip connection
1
Cohf-T (Fang et al., 2022) SRR MRI 2D ✗ BraTS2018 (Bakas et al., PSNR 2022
2018) SSIM
2
IXI (Group, 2023)
Other architectures
3D T-GAN (Luo et al., 2021) LDE PET 3D ✗ Private Dataset PSNR, SSIM 2021
NMSE
ARMLUT (Wu et al., 2022) SVR CT 3D ViT: Superviseda 1
SPARE Challenge Dataset PSNR, SSIM 2022
(Shieh et al., 2019)
2
Walnut dataset
(Der Sarkissian et al.,
2019)
T2 Net (Feng et al., 2021a) USR MRI 2D ✗ 1
IXI (Group, 2023) PSNR, SSIM 2021
2
SRR Private Dataset NMSE
1
DisCNN-ViT (Mahapatra and Ge, 2021) SRR MRI 3D ViT: fastMRI (Zbontar et al., PSNR, SSIM 2021
Self-Supervised 2018) NMSE
2
IXI (Group, 2023)
SLATER (Korkmaz et al., 2022) USR MRI 2D ViT: 1 fastMRI (Zbontar et al., PSNR, SSIM 2022
Self-Supervised 2018)
2
IXI (Group, 2023)
DSFormer (Zhou et al., 2023a) USR MRI 2D ViT: IXI (Group, 2023) PSNR, SSIM 2023
Self-Supervised
a Indicates that this network uses a pre-trained perceptual loss (loss network).

multi-resolution structure network utilizing an efficient Transformer, Dalmaz et al. (2022) introduced a conditional generative adversarial
Performer (Choromanski et al., 2021), in its encoder (PE) and decoder network based on the cooperation of Transformers and CNN opera-
(PD). Analogously to the original U-Net (Ronneberger et al., 2015), tors, namely ResViT. This paradigm addresses the issue of needing to
they used skip connection paths for preserving fine-grained features rebuild separate synthesis models for varying source–target modality
and accurate localization features for reconstruction. Moreover, the settings and represents a unified framework as a single model for
paradigm’s two-level pyramidal design helps the network capture local elevating its practicality. The ResViT (Fig. 32) pervasively refers to
and global information in a multi-resolution fashion. They achieved the generator of its pipeline, whereby it leverages a hybrid pipeline
the SOTA results on the dHCP (Makropoulos et al., 2018) dataset com- of residual convolutional operations and Transformer blocks that en-
pared with the flagship GAN-based image generation method pix2pix able effective aggregation of local and long-range dependencies. The
(HD) (Isola et al., 2017; Wang et al., 2018a). discriminator is based on a conditional PatchGAN framework (Isola

35
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Table 10
A brief description of the reviewed Transformer cooperated in the medical image reconstruction field.
Method Contributions Highlights
Low dose enhancement
TransCT (Zhang et al., 2021e) ∙ The proposed prototype was the first successful ∙ The Transformer effectively could learn the embedded texture
implementation of a Transformer complement to CNN in the representation from the noisy counterpart.
Low Dose CT reconstruction domain by exploring its revenue ∙ This paradigm is not convolution-free and uses Transformer as
within high-frequency and low-frequency counterparts. a complement.
TED-net (Wang et al., 2021b) ∙ Convolution-free U-Net based Transformer model. ∙ Introduced Dilated Token-to-Token-based token serialization
for an improved receptive field in Transformers.
∙ Using Cyclic Shift block for feature refinement in tokenization.
Eformer (Luthra et al., 2021) ∙ Incorporate the learnable Sobel filters into the network for ∙ Successfully imposed the Sobel–Feldman generated low-level
preserving edge reconstruction and improving the overall edge features with intermediate network layers for better
performance of the network. performance.
∙ Conduct diverse experiments to validate that the residual ∙ To guarantee the convergence of the network, they used the
learning paradigm is much more effective than other learning Multi-scale Perceptual (MSP) loss alongside Mean Squared error
techniques, such as deterministic learning approaches. (MSE) to hinder the generation of disfavored artifacts.
3D T-GAN (Luo et al., 2021) ∙ It is a 3D-based method rather than conventional 2D methods. ∙ To produce more reliable images with a generator they used
∙ First LDPET enhancement study that leveraged from an adversarial loss to make the data distribution the same as
Transformer to model long-range contextual information. real data.
STFNet (Zhang et al., 2022) ∙ Proposed a dual input U-Net-based denoising structure for ∙ In comparison with the U-Net and residual U-Net structures
low-count PET images with excessive MRI modality contribution. due to the different training strategy which is roughly named
Siamese structure has a low computational burden and
∙ Used a Transformer block as a hybrid add-on for feature fusion simplified network.
to make a pixel-to-pixel translation of PET and MRI modalities. ∙ This network successfully handled the disparity and
nonuniformity of shape and modality of PET and MRI.
∙ The visual results of denoised images testify that the proposed
method could recover the detail of texture more clearly than
other networks.
CTformer (Wang et al., 2023a) ∙ Convolution-free, computational efficient design. ∙ Alleviate receptive field deficiency with the token
∙ Introduce a new inference mechanism to address the boundary rearrangement.
artifacts.
∙ Proposed interpretability method to follow each path resultant
attention map through the model to understand how the model
is denoising.
SIST (Yang et al., 2022) ∙ Proposed inner-structure loss to mimic the physics of the ∙ Utilizing the image reconstruction module to alleviate the
functionality of sinogram processing by CT devices to restrain artifacts that could happen in sinogram domain denoising by
the noise. transferring the sinogram noise into the image domain.
∙ Extracting the long-range dependencies between distinct ∙ Image domain loss back-propagates into the sinogram domain
sinogram angles of views via Transformer. for complementary optimization.
Sparse-view reconstruction
DuDoTrans (Wang et al., 2021a) ∙ To cope with the global nature of the sinogram sampling ∙ Utilizing a residual learning paradigm for image-domain
process introduced, the SRT module, a hybrid Transformer-CNN reconstruction.
to capture long-range dependencies. ∙ Fewer parameters in comparison with other structures, e.g.,
∙ Utilizing a dual domain model to simultaneously enrich raw DuDoNet (Lin et al., 2019) with better performance.
sinograms and reconstruct CT images with both enhanced and
raw sinograms.
∙ To compensate for the drift error between raw and enhanced
sinogram representation employs DuDo Consistency Layer.
FIT (Buchholz and Jug, 2022) ∙ Introduced the Fourier Domain Encoding to encode the image ∙ Introduced the Fourier coefficient loss as a multiplicative
to a lower resolution representation for feeding to the combination of amplitude loss and phase loss in the complex
encoder–decoder Transformer for reconstructing the sparse-view domain.
CT measurements.
CTTR (Shi et al., 2022a) ∙ Introduced the encoder–decoder Transformer pipeline to utilize ∙ In contrast to DuDoTrans (Wang et al., 2021a), CTTR directly
dual-domain information, raw sinogram, and primary utilizes raw sinograms to enhance reconstruction performance.
reconstruction of CT via FBP (Pan et al., 2009) algorithm for
sparse-view CT measurement reconstruction.
ARMLUT (Wu et al., 2022) ∙ Proposed a paradigm for CT image reconstruction in a ∙ Optimizing the large-scale 3D Transformer with only one
non-trainable manner. reference data in an unsupervised manner.
∙ Extending the most DIP research on 2D to 3D medical imaging ∙ Stabilizing the iterative optimization of multi-loss untrained
scenario. Transformer via re-weighting technique.
Undersampled reconstruction
ViT-Rec (Lin and Heckel, 2021) ∙ This study investigated the advantage of pure Transformer ∙ Utilizing pre-training weights, e.g., ImageNet, extensively
framework, ViT, in fastMRI reconstruction problems in improves the performance of ViT in the low-data regime for
comparison with the baseline U-Net. fastMRI reconstruction, a widespread concept in the medical
∙ ViT benefits from less inference time and memory domain.
consumption compared to the U-Net. ∙ ViTs that accompany pre-training weights demonstrate more
robust performance towards anatomy shifts.

(continued on next page)

36
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Table 10 (continued).
T2 Net (Feng et al., 2021a) ∙ Introduce the first Transformer utilized multi-task learning ∙ Used the same backbone for feature extraction in branches,
network in the literature. however, the purpose of the branches is diverse.
∙ Designed the task Transformer for maintaining and feature
transforming between branches in the network.
∙ Outperformed the sequentially designed networks for
simultaneous MRI reconstruction and super-resolution with
T2 Net.
Super resolution reconstruction
DisCNN-ViT (Mahapatra and Ge, 2021) ∙ Using a Transformer-based network to capture global ∙ Multi prerequisite steps are required to train the experiments;
contextual cues and amalgamate them with CNN’s local however, the super-resolution step is a straightforward
information results in the superior quality of high-resolution end-to-end network.
images in super-resolution literature. ∙ Need fine-tuning steps for two disentanglement and UNETR
∙ Creating realistic images is just not a burden on an adversarial networks.
loss function, in addition, multiple loss functions incorporate ∙ The computational burden of UNETR is high and could use
extra constraints that preserve anatomical and textural new efficient transformer-designed networks.
information in the begotten image.
Cohf-T (Fang et al., 2022) ∙ Leverage the high-resolution T1WI due to its rich structural ∙ Assess prior knowledge into super-resolution paradigm
information for super-resolving T2-weighted MR images. successfully.
∙ Introduced the high-frequency structure prior and ∙ Competitive number of FLOPS in reaching SOTA PSNR results
intra-modality and inter-modality attention paradigms within the in comparison with other attention-based networks.
Cohf-T framework. ∙ End-to-end pipeline for training the network.
Self-supervised reconstruction
SLATER (Korkmaz et al., 2022) ∙ A Self-supervised MRI reconstruction framework that uses ∙ SLATER outperforms CNN and self-attention GAN models,
unconditional adversarial network and cross-attention showing better invertibility and improved reconstruction
transformers for high-quality zero-shot reconstruction. performance with cross-attention transformer blocks and
∙ The first work that proposes an adversarial vision transformer unsupervised pretraining.
framework designed for the task of MRI reconstruction. ∙ Prior adaptation is computationally burdensome but improves
across-domain reconstructions.
DSFormer (Zhou et al., 2023a) ∙ Proposes a deep learning model for fast MRI reconstruction ∙ Both k-space and image-domain self-supervision independently
that uses a deep conditional cascade transformer and produce strong reconstruction quality, showing the benefits of
self-supervised learning to exploit contrast redundancy and self-supervision in both domains.
reduce data requirements.

previous ART block and Transformer module to select the most discrim-
inant representations. Due to the cascade of Transformers in design,
to decrease the model complexity and computational burden, ResViT
utilizes weight sharing strategy among projection tensors for Query,
Key, value, and attention heads besides weight matrices for multi-layer
perceptron operation. The superiority of this method has been proved
over several MRI datasets in multi-contrast MRI synthesis and MRI to
CT experiments with high PSNR and SSIM metrics over the conven-
tional SOTA methods, e.g., pGAN (Dar et al., 2019), SAGAN (Zhang
et al., 2019), pix2pix (Isola et al., 2017) and PTNet (Zhang et al.,
2021b).
Likewise, Liu et al. (2022b) addressed the issue of missing contrasts
in MRI imaging and proposed a multi-contrast multi-scale Transformer
(MMT) framework to handle the unavailability of this information by
synthesizing the existing contrasts as a means to substitute the missing
data. To achieve efficient contrast synthetization, the task is considered
as a seq-to-seq problem, in which the model learns to generate missing
contrasts by leveraging the existing contrast in the following manner:
Fig. 31. An overview of ViTs in medical image synthesis. Methods are categorized by A Swin multi-contrast Transformer encoder is implemented that creates
target and source modality. The prefix numbers in the paper’s name in ascending order
hierarchical representation from the input MRI image. Then, a Swin
denote the reference for each study as follows: 1. Zhang et al. (2021b), 2. Dalmaz et al.
(2022), 3. Liu et al. (2022b), 4. Ristea et al. (2021), 5. Kamran et al. (2021), 6. Li
Transformer-based architecture decodes the provided representation at
et al. (2022d). multiple scales to perform medical image synthesis. Both the encoder
and decoder are composed of two sequential swin blocks that cap-
ture contrast dependencies effectively. Conducted experiments on the
et al., 2017). Utilizing standalone Transformer architectures (e.g., PT- IXI (Group, 2023) and BraTS (Baid et al., 2021) datasets demonstrated
MMT’s advantage compared to previous methods.
Net Zhang et al., 2021b) in pixel-to-pixel tasks is quite challenging due
to the quadratic complexity, which limits its usage to fixed-size patches
6.2. Inter-modality
that hamper its effectiveness. From Fig. 32, it is evident that residual
Transformer blocks stacked successively, known as aggregated residual Unlike the intra-modality strategies, the inter-modality methods
Transformer (ART) blocks, in the bottleneck of the encoder–decoder are designed to learn the mapping function between two different
design of the generator to extract the hidden contextual information modalities. This approach allows the network to convert the samples
of input features. The primary motivation of ART blocks is to learn an from the base modality into a new modality and leverage the generated
integrated representation that combines contextual, local, and hybrid samples in the training process for the sake of performance gain. In
local-contextual features underhood from the input flow. Channel Com- this section, we will elaborate on two Transformer-based (Ristea et al.,
pression (CC) module recalibrates the concatenated features from the 2021; Kamran et al., 2021) strategies.

37
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 32. The ResViT (Dalmaz et al., 2022) framework for multi-modal medical image synthesis. The bottleneck of this encode–decoder comprises successively residual Transformers
and residual convolutions layers for synergistically capturing the fine-grained global and local context.

Several medical conditions may prevent patients from receiving 6.3. Discussion and conclusion
intravenous contrast agents while getting CT screening. However, the
contrast agent is crucial in assisting medical professionals in identi- This section covers the adoption of ViT architectures in medical
fying some specific lesions. Therefore, CyTran (Ristea et al., 2021) image synthesis applications. We explored the proposed methods based
is proposed as an unsupervised generative adversarial convolutional on two synthesis approaches: (I) inter-modality, in which the target
Transformer for translating between contrast and non-contrast CT scans modality is synthesized in a way that encapsulates crucial diagnostic
and image alignment of contrast-enhanced CT scans to non-enhanced. features from different source images; and (II) intra-modality, with the
Its unsupervised part is also derived from its cyclic loss. CyTran is objective of yielding target images with better quality by integrating
composed of three main modules: (I) A downsample CNN-based module information from lower resolution source images. To demonstrate their
designed for handling high-resolution images, (II) A convolutional effectiveness, these approaches usually rely on SSIM, PSNR, and LPIPS
Transformer module tailored for incorporating both local and global as the evaluation metrics, since they are designed to measure the
features, and (III) An upsampling module developed to revert the trans- similarity between images. We also reviewed a ViT-based synthesis
formation of the downsampling block through transpose-convolution. model (Kamran et al., 2021) that operates in a decoder fashion for
Additionally, the authors introduce a new dataset, Coltea-Lung-CT- the task of fundus-to-angiogram translation with different evaluation
100W, comprised of 100 3D anonymized triphasic lung CT scans of
measurements, including Fréchet Inception Distance (FID) and Kernel
female patients.
Inception Distance (KID). A combination of self-supervised learning
Furthermore, Kamran et al. (2021) trained a ViT-based genera-
with transformer-based medical synthesis is further reviewed to em-
tive adversarial network (VTGAN) in a semi-supervised fashion on
phasize the cost-saving benefits of reduced labeling (Li et al., 2022d).
the Fundus Fluorescein Angiography (FFA) dataset provided by Hajeb
We have additionally provided the architectural type, modality, input
Mohammad Alipour et al. (2012) via incorporating multiple modules,
size, training setting, datasets, metrics, and year for every medical
including residual blocks as generators for coarse and fine image gen-
registration technique analyzed in Table 12. Furthermore, Table 13
eration, and two ViT architectures consisting of identical transformer
encoder blocks for concurrent retinal abnormality classification and FA lists the contributions and highlights of the proposed works. To facil-
image synthesis. itate comparison, table Table 11 includes performance evaluation of
To embed self-supervised training in medical image synthesis, multiple methods on the BraTS dataset (Baid et al., 2021). In terms
SLMT-Net (Li et al., 2022d) is presented for generating missing modal- of SSIM and PSNR scores, MMT (Liu et al., 2022b) typically achieves
ities in magnetic resonance (MR) images. It tackles the limited paired superior performance, while PTNet (Li et al., 2022d) falls slightly
data challenge by combining paired and unpaired data. The method behind when compared to the other two methods. Moreover, we have
involves edge-aware pre-training using an Edge-preserving Masked provided information about the model size and computational cost
Auto-Encoder (Edge-MAE) model that predicts missing information and for three distinct approaches: PTNet, MMT, and ResViT. PTNet, with
estimates the edge map. A patch-wise loss enhances the performance. 27.69 million parameters and 233.1 GFLOPs, strikes a balance between
In the multi-scale fine-tuning stage, a DSF module is integrated to model size and computational demand. Meanwhile, MMT offers a larger
synthesize missing-modality images using multi-scale features. The model with 138.2 million parameters but lower computational require-
pre-trained encoder extracts high-level features to ensure consistency ments at 221.5 GFLOPs, emphasizing efficiency. On the other hand,
between synthesized and ground-truth images during training. ResViT provides a mid-sized model with 123.4 million parameters but

38
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Table 11
Comparison results on BraTS (Baid et al., 2021) dataset for medical image synthesis, focusing on one-to-one and many-to-one tasks with
different input to output combinations of T1, T2, and FLAIR modalities. The GFLOPs of all methods are calculated with an input image size of
256 × 256.
Methods Design Task SSIM ↑ PSNR ↑ Params (M) FLOPs (G)
PTNeta (Li et al., 2022d) Hybrid T1, T2 → FLAIR 0.851 23.78 27.69 233.1
T1, FLAIR → T2 0.905 25.09
T2, FLAIR → T1 0.920 22.19
T2 → FLAIR 0.851 23.01
FLAIR → T2 0.894 24.78
MMT (Liu et al., 2022b) Pure T1, T2 → FLAIR 0.909 26.20 138.2 221.5
T1, FLAIR → T2 0.934 26.74
T2, FLAIR → T1 0.922 25.43
T2 → FLAIR 0.891 25.00
FLAIR → T2 0.902 24.84
ResViT (Dalmaz et al., 2022) Bottleneck T1, T2 → FLAIR 0.886 25.84 123.4 486.1
T1, FLAIR → T2 0.938 26.90
T2, FLAIR → T1 0.924 26.20
T2 → FLAIR 0.870 24.97
FLAIR → T2 0.908 25.78
a Indicates that the values are taken from Li et al. (2022d).

Table 12
An overview of the reviewed transformer-based medical image synthesizing approaches.
Method Concept(s) Modality Type Pre-trained module: Type Dataset(s) Metrics Year
Pure
PTNet (Zhang et al., Intra-modality MRI 2D ✗ dHCP dataset (Makropoulos et al., SSIM 2021
2021b) 2018) PSNR
1
MMT (Liu et al., Intra-modality MRI 2D ✗ IXI (Group, 2023) SSIM 2022
2
2022b) Inter-modality BraTS (Baid et al., 2021) PSNR
LPIPS
Bottleneck
ResViT (Dalmaz Intra-modality CT 2D ViT: Supervised 1 IXI (Group, 2023) PSNR 2022
et al., 2022) Inter-modality MRI 2 BraTS (Menze et al., 2014; SSIM
Bakas et al., 2017, 2018)
3
Multi-modal pelvic MRI-CT
(Nyholm et al., 2018)
CyTran (Ristea Inter-modality CT 2D 3D ✗ Coltea-Lung-CT-100W (Ristea MAE 2022
et al., 2021) et al., 2021) RMSE
SSIM
Encoder
SLMT-Net (Li et al., Inter-modality MRI 2D ViT: Self-supervised BraTS (Menze et al., 2014; Bakas SSIM 2022
2022d) et al., 2017, 2018) 2 ISLES (Maier PSNR
et al., 2017) NMSE
Decoder
VTGAN (Kamran Inter-modality Angiogra- 2D ✗ Fundus Fluorescein Angiography Fréchet inception 2021
et al., 2021) phy (Hajeb Mohammad Alipour et al., distance
2012) Kernel Inception
distance

a notably higher computational load of 486.1 GFLOPs. The choice 7. Medical image detection
among these methods should be guided by specific task requirements,
weighing the trade-off between model size and computational cost to Object detection remains one of the challenging problems in com-
achieve optimal performance. However, it is important to note that all puter vision, especially detection in the medical image domain has its
these models require a significant computational cost. own challenges. Current SOTA architectures that work on 2D natural
images use Vision Transformers. The Vision Transformers used in the
With the scarcity of works with ViT implementations and the recent
detection task can be classified into two Transformer backbones and
advancement in the medical synthesis field with Transformer models,
detection Transformers. In addition, the Transformer module can be
we believe that these systems require more research effort to be put into
used in a hybrid manner. Detection Transformers generally represent
them. For example, Transformers have much room for improvement to
an end-to-end detection pipeline with an encoder–decoder structure,
generate more realism and high-quality synthesized medical images. while the Transformer backbone solely utilizes the Transformer encoder
One way to achieve this is by incorporating more detailed anatomy for feature refinement. In order to increase detection performance,
and physiology features using more efficient and effective attention object detectors combine variants of vision Transformers with classical
mechanisms. Additionally, while much of the current research in this convolutional neural networks (CNNs).
area has focused on 2D medical images and CT and MRI modalities, Quite recently, Carion et al. (2020) introduced the concept of
there is potential to apply these techniques to other types of medical DETR, which forms a foundation for Detection Transformers. DETR
images, including 3D and microscopy images. uses a ResNet backbone to create a lower-resolution representation

39
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Table 13
A brief description of the reviewed Transformer-based medical image synthesizing models.
Method Contributions Highlights
Pure
PTNet (Zhang et al., ∙ Introduced the pure Transformer-based network with linear ∙ Practical inference time around 30 image/s
2021b) computational complexity for image-synthesizing context
MMT (Liu et al., ∙ Proposed a pure Transformer-based architecture that incorporates ∙ Since the attention mechanism can be utilized to pinpoint
2022b) Swin Transformer blocks to perform missing data imputation by influential features in the model’s reasoning and
leveraging the existing MRI contrasts. decision-making, the attention scores of the Transformer
∙ Conducted experiments on the IXI and BraTS datasets to perform decoder in MMT make it interpretable by capturing
qualitative and quantitative analysis and confirm their model’s information in different contrasts that play an important role
efficiency. in generating the output sequence.
∙ The framework can be applied to a variety of medical
analysis tasks, including image segmentation and
cross-modality synthesis.
Bottleneck
ResViT (Dalmaz et al., ∙ First conditional adversarial model for medical image-to-image ∙ Utilized weight sharing strategy among Transformers to
2022) translation with hybrid CNN-Transformer generator hinder the computational overhead and lessen the model
∙ Introduced a new module, ART block, for simultaneously complexity
capturing localization and contextual information ∙ An end-to-end design for the synthesized model that
generalizes through multiple settings of source-target
modalities, e.g., one-to-one and many-to-one tasks
CyTran (Ristea et al., ∙ Proposing a generative adversarial convolutional Transformer for ∙ The presented method can handle high-resolution images
2021) two tasks of image translation and image registration. due to its hybrid structure
∙ Introducing a new dataset, named Coltea-Lung-CT-100W, ∙ Utilized style transfer techniques to improve alignment
comprised of 100 3D anonymized triphasic lung CT scans of female between contrast and non-contrast CT scans
patients.
Encoder
SLMT-Net (Li et al., ∙ Proposed a self-supervised method for cross-modality MR image ∙ Despite using only 70% paired data, MT-Net’s performance
2022d) synthesis to generate missing modalities based on available remains on par with comparable models.
modalities. ∙ Yields Enhanced PSNR scores when utilized for MR image
∙ Addresses the challenge of limited paired data by combining both synthesis.
paired and unpaired data. ∙ The combination of edge-aware pre-training and multi-scale
fine-tuning stages contributes to the success of SLMT-Net.
Decoder
VTGAN (Kamran et al., ∙ Proposed a synthesis model for the task of fundus-to-angiogram ∙ Has the potential to be adopted as a tool for tracking
2021) that incorporates ViT architecture in the decoder section of the disease progression.
system to concurrently classify retinal abnormalities and synthesize ∙ The system is designed to operate on non-invasive and
FA images. low-cost fundus data to generate FA images.
∙ Prepared experimental data based on quantitative and qualitative
metrics regarding the model’s generalization ability under the
influence of spatial and radial transformations.

Fig. 33. An overview of Transformers in medical image detection. Methods are classified into the backbone, neck, and head according to the positions of the Transformers in
their architecture. The prefix numbers in the paper’s name in ascending order denote the reference for each study as follows: 1. Ma et al. (2021), 2. Jiang et al. (2021), 3. Wagner
and Rohr (2022), 4. Nguyen et al. (2022a), 5. Shen et al. (2021a), 6. Kong et al. (2021), 7. Tao and Zheng (2021), 8. Wittmann et al. (2023).

of the input images. Even though this approach achieves very good Many recent approaches have tried to improve DETR’s detection con-
2D detection results, comparable to the R-CNN backbone, high com- cept over time. Efficient DETR (Yao et al., 2021a) eliminated DETR’s re-
putational complexity is a downside of this method. The deformable quirement for iterative refinement. Conditional DETR (Criminisi et al.,
DETR (Zhu et al., 2021) approach has improved DETR’s detection per- 2009) introduced the concept of a conditional cross-attention mod-
formance overcoming the problem of high computational complexity. ule. DN-DETR (Li et al., 2022a) introduced a denoising strategy, and

40
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Table 14
An overview of the reviewed transformer-based medical image detection approaches.
Method Modality Organ Type Pre-trained Datasets Metrics Year
module: Type
Backbone
TR-Net (Ma et al., MPR Heart (coronary 3D Supervised Private dataset Accuracy, 2021
2021) artery) Sensitivity,
Specificity,
PPV, NPV,
F1-score
RDFNet (Jiang et al., Dental caries Teeth 2D Supervised Private dataset Precision, 2021
2021) Recall,
mAP@0:5
1
CellCentroidFormer Microscopy Cells 2D Supervised Fluo-N2DL-HeLa (HeLa) Mean-IoU, 2022
(Wagner and Rohr, (Ulman et al., 2017), SSIM
2
2022) Fluo-N2DH-SIM+ (SIM+)
(Ulman et al., 2017),
3
Fluo-N2DH-GOWT1
(GOWT1) (Ulman et al.,
2017),
4 PhC-C2DH-U373 (U373)

(Ulman et al., 2017).


Joint-2D3D-SSL X-ray Chest 2D 3D Self-supervised VinDr-CXR (Nguyen et al., mAP@0.5 2022
(Nguyen et al., 2022a) 2022b)
Head
Focused decoder CT Multi-organ 3D Semi-Supervised 1 VISCERAL anatomy mAP, 2023
(Wittmann et al., 2023) benchmark (Jimenez-del Toro AP50,
et al., 2016), AP75
2
AMOS22 challenge (Ji et al.,
2022).
Neck
1
COTR (Carion et al., Colonoscopy Colon 2D Supervised CVC-ClinicDB (Bernal et al., Precision, 2021
2020) 2015), Sensitivity,
2 ETIS-LARIB (Silva et al., F1-score
2014),
3
CVC-ColonDB (Bernal et al.,
2012).
CT-CAD (Kong et al., X-ray Chest 2D Supervised 1 Vinbig Chest X-ray dataset AP50 2021
2021) (Nguyen et al., 2021b)
2 ChestXDet-10 dataset (Liu

et al., 2020b)
Spine-transformer (Tao CT Vertebra 3D Supervised 1 VerSe 2019 (Sekuboyina Id-Rate, 2021
and Zheng, 2021) et al., 2020), L-Error
2 MICCAI-CSI 2014 (Glocker

et al., 2013),
3
Private dataset

DINO (Zhang et al., 2023) improved many aspects, such as denois- 7.1. Backbone
ing training, etc. Recently, some studies performed experiments on
2D medical data such as Prangemeier et al. (2020) and Shen et al. This section explains Transformer networks using only the Trans-
(2021a), etc. However, only very few attempts tried to adapt it to former encoder layers for object detection. The work proposed by Ma
3D. Spine-Transformer was proposed by Tao and Zheng (2021) for et al. (2021) uses a Transformer network (TR-Net) for identifying
sphere-based vertebrae detection. Another approach in 3D detection stenosis. A leading threat to the lives of cardiovascular disease pa-
was proposed by Ma et al. (2021), which introduced a novel Trans- tients globally is Coronary Artery Disease (CAD). Hence, the automatic
former that combines convolutional layers and Transformer encoders detection of CAD is quite significant and is considered a challenging
for automatically detecting coronary artery stenosis in Coronary CT task in clinical medicine. The complexity of coronary artery plaques,
angiography (CCTA). An approach to better extract complex tooth which result in CAD, makes the detection of coronary artery stenosis
decay features was proposed by Jiang et al. (2021). For end-to-end in Coronary CT angiography (CCTA) challenging.
polyp detection, Shen et al. (2021a) proposed an approach that was The architecture introduces a Transformer and combines the feature
based on the DETR model. Kong et al. have proposed the approach CT- extraction capability of convolutional layers and Transformer encoders.
CAD (Kong et al., 2021), context-aware Transformers for end-to-end TR-Net can easily analyze the semantic information of the sequences
chest abnormality detection. As illustrated in Fig. 33, we have broadly and can generate the relationship between image information in each
classified these methods based on the role ViT plays in the detection’s position of a multiplayer reformatted (MPR) image. This model can
commonly accepted pipeline. Table 14 indicates details on modalities, effectively detect stenosis based on both local and global features. The
organs, datasets, metrics, etc. The highlights of different approaches are CNN easily extracts the local semantic information from images, and
summarized in Table 15. Some of the aforementioned detection papers the Transformer captures global semantic details more easily. A 3D-
in the medical image domain are reviewed in detail in this section. CNN is employed to capture the local semantic features from each

41
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Table 15
A brief description of the reviewed Transformer-based medical image detection models. The unreported number of parameters indicates that the value was not mentioned in the
paper.
Method # params Contributions Highlights
Backbone
TR-Net (Ma et al., 2021) – ∙ This work is the first attempt to detect coronary ∙ While detecting significant stenosis, the TR-Net
artery stenosis more accurately by employing architecture is capable of combining the
Transformers. information of local areas near stenoses and the
∙ To detect significant stenosis, local and global global information of coronary artery branches.
features are effectively integrated into this ∙ Compared to state-of-the-art methods, the TR-Net
approach, which has resulted in more accurate model has better results on multiple indicators.
results. ∙ The shallow CNN layer prevents the overfitting
of semantic information and improves the overall
efficiency.
∙ The gain in performance comes with a trade-off
in the number of parameters, which affects the
computational complexity.

RDFNet (Jiang et al., – ∙ An image dataset of caries is created, which is ∙ Compared with existing approaches, the accuracy
2021) annotated by professional dentists. and speed of caries detection are better.
∙ For better extraction of the complex features of ∙ Method is applicable to portable devices.
dental caries, the Transformer mechanism is ∙ The method does not work really well when the
incorporated. illumination of the oral image is insufficient.
∙ In order to increase the inference speed ∙ Even though detection accuracy and speed are
significantly, the FReLU activation function is improved compared to the original approach, the
adopted. detection speed is not the fastest.

CellCentroidFormer 11.5M ∙ A novel deep learning approach that combines ∙ Pseudocoloring in combination with pre-trained
(Wagner and Rohr, 2022) the self-attention of Transformers and the backbones shows improved cell detection
convolution operation of convolutional neural performance.
networks is proposed. ∙ The model outperforms other state-of-the-art
∙ A centroid-based cell detection method, denoting fully convolutional one-stage detectors on four
the cells as ellipses is proposed. microscopy datasets, despite having a lower
number of parameters.
∙ Larger output strides worsen the performance.
Joint-2D3D-SSL (Nguyen 31.16M ∙ Introduces a self-supervised learning framework ∙ Overcomes limitations in fine-tuning pre-trained
et al., 2022a) capable of leveraging both 2D and 3D data for models with different dimensionality.
downstream applications. ∙ Produces versatile pre-trained weights for both
∙ Proposes a deformable self-Attention mechanism 2D and 3D applications.
that captures flexible correlations between 2D
slices, leading to powerful global feature
representations.
∙ Extends the SSL tasks by proposing a novel 3D
agreement clustering and masking embedding
prediction
Head
Focused Decoder VISCERAL Dataset - ∙ First detection Transformer model for 3D ∙ Better results compared to existing detection
(Wittmann et al., 2023) 41.8M anatomical structure detection. models using a Transformer network like DETR
AMOS22 Dataset - ∙ Introduced a focused decoder to focus the (Carion et al., 2020) and deformable DETR (Zhu
42.6M predictions on RoI. et al., 2021).
∙ Comparable results to the RetinaNet (Schoppe
et al., 2020).
∙ Varying anatomical fields of view (FoVs) can
affect the robustness of the model.
Neck
COTR (Xie et al., 2021) – ∙ Proposed a convolution layer embedded into the ∙ COTR has comparable results with state-of-the-art
Transformer encoder for better feature methods like Mask R-CNN (Qadir et al., 2019) and
reconstruction and faster convergence compared to MDeNet-plus (Qadir et al., 2021).
DETR. ∙ This approach produces low confidence for a
particular type of lesion.

CT-CAD (Kong et al., – ∙ Proposed a context-aware feature extractor, ∙ CT-CAD outperforms the existing methods in
2021) which enhances the receptive fields to encode Cascade R-CNN (Cai and Vasconcelos, 2018), YoLo
multi-scale context-relevant information. (Redmon et al., 2016), and DETR (Zhu et al.,
∙ Proposed a deformable Transformer detector that 2021).
attends to a small set of key sampling locations ∙ CT-CAD is capable to detect hard cases, such as
and then the Transformers can focus to feature nodules that are ignored by Faster R-CNN.
subspace and accelerate the convergence speed. ∙ Compared to the ChestXDet-10 dataset, this
model has a lower performance on the Vinbig
Chest X-ray dataset which has higher categories of
abnormalities with more complex patterns.

(continued on next page)

42
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Table 15 (continued).
Method # params Contributions Highlights
Spine-Transformers (Tao – ∙ Proposed a 3D object detection model based on ∙ Obtained better results for all datasets compared
and Zheng, 2021) the Transformer’s architecture. to state-of-the-art methods.
∙ Proposed a one-to-one set global loss that ∙ The model has a higher Id-Rate on both the
enforces unique prediction and preserves the datasets, but a higher L-Error compared to the
sequential order of vertebrae. benchmark by Sekuboyina et al. (2020).
∙ Proposed a Sphere-based bounding box to
enforce rotational invariance.

Fig. 34. Proposed architecture of TR-Net model (Ma et al., 2021).

position in an MPR image. After this step, the Transformer encoders feature information. The C3Modified layer is a convolution module
are mainly used to analyze feature sequences. The main advantage here activated by the FReLU function, which extracts complex visual-spatial
is that this helps in mining the dependency of local stenosis on each information of the caries images. SPP (He et al., 2015) module has
position of the coronary artery. The architecture of the TR-Net is given a spatial pyramid structure that could expand the perceptual field,
in Fig. 34. One part of the figure indicates the 3D-CNN. This module fusing the local and global features and enhancing the feature maps.
extracts the local features. The other part indicates the Transformer After the SPP structure, RDFNet appends an improved Transformer-
encoder structure. This module associates the local feature maps of encoder module to improve the feature extraction capability. The main
each position and helps in analyzing the dependency between different functionality of the neck module is to fuse the feature maps of different
positions, which in turn is helpful for classifying the significant stenosis sizes and extract high-level semantic structures. This module mainly
at each position. The CNN part mainly has two main advantages: it pre- uses the structure of the feature pyramid network (FPN) proposed
vents the overfitting of semantic information and improves the model’s in Lin et al. (2017a), and path aggregation network (PAN) proposed
efficiency. The input to the network architecture is the coronary artery in Liu et al. (2018b). The FPN approach is employed in a top-down
MPR image. fashion, and PAN is performed in a bottom-up fashion to generate
The 3D-CNN module has four sequentially connected substructures, the feature pyramids. In order to prevent information loss, feature
which consist of a convolutional kernel of size 3 × 3 × 3, a non- fusion is performed using both bottom-up and top-down approaches.
linear ReLU layer and a 2 × 2 × 2 max-pooling layer. The number of An improved C3Modified convolutional module is adopted into the
filters is 16 in the first part, and in subsequent parts, the number of neck module to better extract the semantic features of caries images.
filters is double the number in the previous part. Since Transformers The high-level features generated by the neck module are used by the
have 1D vector sequences as input, the feature maps are flattened. The prediction module, which in turn is used to classify and regress the
Transformer in the proposed architecture consists of 12 Transformer location and class of the objects. To overcome the problems of the
encoders. Each Transformer encoder mainly consists of two sub-blocks single-stage detection method, which has quite a low detection accu-
- multi-head self-attention (MSA) and the feed-forward network (FFN), racy, it mainly has three detection heads for detecting large, medium,
which are connected sequentially. Layer normal (LN) and residual and small objects. As Transformers have proved to have strong fea-
connections are employed before and after two sub-blocks. In order to ture extraction capability, in order to extract complex features, they
ensure the consistency of the encoders, the size of the input is the same utilized the Transformer model. To better extract the features, three
as the size of the output. The output of the previous encoder is given Transformer encoders were stacked together. To simplify the model, the
as input to the next encoder. In the final layer, the embeddings are fed authors removed the original normalization layer from the Transformer
into softmax classifiers to detect significant stenosis. encoder. In order to extract the deep features, the feature map was fed
RDFNet approach proposed by Jiang et al. (2021) basically in- into this structure. For each head, the attention values were calculated
corporates the Transformer mechanism in order to better extract the independently and later concatenated.
complex tooth decay features. The incorporation of the Transformer Wagner and Rohr (2022) proposed a novel hybrid cell detection
has improved the detection accuracy. The main three modules of the approach (CellCentroidFormer) in microscopic images that combines
network are the backbone, neck, and prediction modules. The backbone the advantages of vision Transformers (ViTs) and convolutional neural
module is mainly used to extract the features from caries images. In networks (CNNs). Authors show that the combined use of convolutional
the backbone module, the focus operation is a slicing operation that and Transformer layers is advantageous as the convolutional layers can
could easily replace the convolution operation and reduce the loss of focus on the local information (cell centroids), and the Transformer

43
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 35. Proposed architecture of CellCentroidFormer model (Wagner and Rohr, 2022).

layers can focus on the global information (overall shapes of a cell). and then constructing a comprehensive feature representation through
The proposed centroid-based approach represents the cells as ellipses a deformable self-attention mechanism. This approach enables the
and is trainable in an end-to-end fashion. Four different 2D microscopic capture of correlations among local slices, resulting in a holistic un-
datasets were used for experimental evaluations, and the results out- derstanding of the data. The global embedded features derived from
performed the fully convolutional architecture-based methods. Fig. 35 this transformer are subsequently utilized to define an agreement clus-
shows the architecture. The encoder is then folded into a 3D tensor, tering for 3D volumes and a masked encoding feature prediction. By
which is afterward concatenated with the input tensor. The Mobile- employing this methodology, the framework enables the learning of
ViT block is a lightweight alternative to the actual encoder–decoder feature extractors at both the local and global levels, ensuring consis-
approach using a Transformer (Vaswani et al., 2017). Due to the multi- tency and enhancing performance in downstream tasks. Experimental
head self-attention layers, the MobileViT block causes a much higher results across various downstream tasks, including abnormal chest X-
computational complexity than convolutional layers. To not increase ray detection, lung nodule classification, 3D brain segmentation, and
the computational complexity excessively, the MobileViT blocks are 3D heart structures segmentation, illustrate the efficacy of this joint
combined in the neck part of the proposed model. Layer normalization 2D and 3D SSL approach, surpassing plain 2D and 3D SSL approaches,
is added for regularization and to allow higher learning rates. The as well as improving upon SOTA baselines. Additionally, this method
backbone module of the proposed model is the EfficientNetV2S (Tan overcomes limitations associated with fine-tuning pre-trained models
and Le, 2021) CNN model. This block mainly consists of six high-level with different dimensionality, providing versatile pre-trained weights
blocks, out of which five blocks are used to extract image features. To suitable for both 2D and 3D applications.
use the advantage of transfer learning, the backbone module is initial-
ized with weights learned from training on ImageNet. This, in turn, 7.2. Head
reduces the amount of required training data. The EfficientNetV2S (Tan
and Le, 2021) CNN models are generally optimized for a fixed input Detection Transformers based on Transformer encoder–decoder ar-
size. Therefore the input images need to be resized to this input size. chitecture require a large amount of training data to deliver the highest
The cells are represented mainly by the centroid, width, and height performance. However, this is not feasible in the medical domain,
parameters. Mainly, two fully convolutional heads are used to predict where access to labeled data is limited. To address this problem, for the
these cell parameters in the paper. These heads contain 2D convolution, detection of 3D anatomical structures from the human body, Wittmann
batch normalization, and bilinear upsampling layers. More MobileViT et al. (2023) proposed a detection Transformer network with a focused
blocks are not used as it will increase the computational complexity. decoder. This network considers the relative position of the anatomical
Later convolutional layers have a bigger receptive field which helps in structures and thus requires less training data. The focused decoder
capturing the global information (Raghu et al., 2021) effectively. The uses an anatomical region atlas to deploy query anchors to focus on
first convolutional head predicts a heatmap for detecting the cell cen- the relevant anatomical structures. The proposed network omits the
troids, and the second head is used for predicting the cell dimensions. Transformer encoder network and consists of only Transformer decoder
The output dimensions of this model are 384 × 384. The authors use blocks. The authors show that in 3D datasets, avoiding the encoder
one decoder of the Dual U-Net to predict the centroid heatmap, and the can reduce the complexity of modeling relations with a self-attention
second branch predicts the dimensions of the detected cells. The shapes module.
of the cells are focused on by the Transformer layers in the network. The model architecture contains a backbone network for feature ex-
Nguyen et al. (2022a) present a novel self-supervised learning (SSL) traction, a focus decoder network for providing well-defined detection
framework, Joint-2D3D-SSL, designed to address the scarcity of la- results, a classification network to predict the classes, and a bounding
beled training data for joint learning on 2D and 3D medical data box regression network to output the best possible bounding box. The
modalities. The framework constructs an SSL task based on a 2D feature extraction backbone network is a feature pyramid network
contrastive clustering problem for distinct classes using 2D images or (FPN) inspired by the RetinaNet (Schoppe et al., 2020). Features from
2D slices extracted from 3D volumes. In particular, the framework the second layer (P2) are flattened before being given as input to the
leverages 3D volumes by computing feature embeddings for each slice focus decoder. A specific anatomical region atlas (Hohne et al., 1992)

44
R. Azad et al. Medical Image Analysis 91 (2024) 103000

containing regions of interest (RoI) is determined for each dataset. regression layer to predict object location, and the second one is a
Then to each RoI, uniformly spaced query anchors are placed, and box-classification layer to predict object scores. Therefore, the object
a dedicated object query is assigned to each. Such an object query queries are independently decoded into box coordinates and classes by
will restrict the focus decoder network to predict solely within their the feed-forward network, which results in final predictions, including
respective RoI. object and no object (background) predictions. This model transforms
The focused decoder network contains a self-attention module, a the object detection problem into a direct set prediction problem by
focused cross-attention module, and a feedforward network (FFN). training end-to-end by calculating bipartite matching loss (Hungarian
The self-attention module encodes strong positional inter-dependencies algorithm) between predictions and ground truth for each query. If
among object queries. The focused cross-attention module matches the the number of queries exceeds the number of objects in the image,
input sequence to object queries to regulate the individual feature map the remaining boxes are annotated as no object class. Thus, the model
for prediction via attention. The FFN network then enables richer fea- is trained to predict output for each query as an object or no object
ture representation. Also, residual skip connections and normalizations detection. For the class prediction, they used negative log-likelihood
are used to increase gradient flow. The classification network consists loss, and for the bounding box localization, they used an L1 loss with
of a single fully-connected layer, and the bounding box regression generalized intersection over union (GIOU) (Rezatofighi et al., 2019)
network consists of three layers. The bounding box predictions are loss. The experiments demonstrated that the proposed model achieved
combined with query anchors to get the bounding box together with comparable performance against state-of-the-art methods.
class-specific confidence scores. The network is trained to predict 27 Many deep learning detection methods lack using context-relevant
candidates predictions per class. Dynamic labeling with the help of information for improved accuracy, and they also generally suffer from
generalized intersection over union (GIoU) is created during training slower convergence issues and high computational costs. The proposed
to get 27 predictions. During inference, the prediction with the highest CT-CAD (Kong et al., 2021), context-aware Transformers for end-to-
confidence score indicates the best candidate. The model is trained end chest abnormality detection, address these problems. The model
end-to-end with the above GIoU loss, binary cross-entropy loss for the consists of two main modules: (1) a context-aware feature extractor
classification network, and L1 loss for the bounding box predictions. module for enhancing the features, and (2) a deformable Transformer
detector module for detection prediction and to accelerate the con-
7.3. Neck vergence speed. The context-aware feature extractor network uses a
ResNet50 backbone, dilated context encoding (DCE) blocks, and posi-
Detection methods using region-based approaches need to generate tional encoding structure. The deformable Transformer detector con-
anchor boxes to encode their prior knowledge and use a non-maximum tains a Transformer encoder–decoder architecture and a feed-forward
suppression to filter the resulting bounding boxes after prediction. network. The proposed design of the context-aware feature extractor
These pre-and post-processing steps remarkably reduce the detection is inspired by the feature fusion scheme from DetectoRS (Qiao et al.,
performance. To bypass these surrogate tasks, Carion et al. (2020) 2021) which is based on the Feature Pyramid Networks (FPN) (Lin
proposed Detection Transformer (DETR), which views the object detec- et al., 2017b). The feature fusion scheme iteratively enhances the
tion task as a direct set prediction problem using an encoder–decoder features of the FPN to powerful feature representations. Likewise, the
architecture using Transformers. The self-attention mechanism of the DCE blocks enhance the features extracted from the ResNet50 backbone
Transformers, which explicitly models all pairwise interactions between by expanding the receptive fields to fuse multiscale context information
elements in a sequence, helps to predict the set of detections with using dilated convolution filters of different sizes. This powerful feature
absolute prediction boxes directly from the image rather than using map benefits in detecting objects across various scales. Inspired by
an anchor. For the end-to-end detection of polyp lesions, Shen et al. YOLOF (Chen et al., 2021c) the DCE block uses dilated convolution and
(2021a) proposed a convolution in Transformer (COTR) network based skip connections to achieve a larger receptive field and acquire more
on the DETR model. COTR consists of 4 main layers: (1) a CNN back- local context information. Finally, all the features from different DCE
bone network used for extracting features, (2) Transformer encoder blocks computed at different scales are summed up to get the feature
layers embedded with convolutional layers used for feature encoding map for the output.
and reconstruction, (3) Transformer decoder layers used for object The proposed design of the deformable Transformer detector con-
querying, and (4) a feed-forward network used for detecting prediction. tains single-scale and multi-head attention properties. The deformable
Embedding convolutional layers into the Transformer encoder layer attention block attends to a small set of key sampling points, thus
leads to convergence acceleration compared to the slow convergence allowing the Transformer to focus on the feature space and accelerate
of the DETR model. the convergence. The authors used six encoder and decoder layers with
The CNN backbone uses a pre-trained model with ResNet18 (He positional encoding to obtain the decoder outputs. The outputs from the
et al., 2016) architecture for feature extraction. This layer converts in- decoder are the number of abnormalities detected and the dimension
put medical images to a high-level feature map. The authors then use a of the decoder layers. Finally, a feed-forward network is used to output
1 × 1 convolution to reduce the channel dimensions. In the Transformer the category classification and location regression results. The model is
encoder layers, they used six convolution-in-Transformer encoders to trained end-to-end with a combination of bounding box loss and clas-
collapse this spatial structure into a sequence. Then they use a con- sification (cross-entropy) loss. The authors adopted GIoU (Zheng et al.,
volution layer to reconstruct the sequential layer back to the spatial 2020) to balance the loss between large and small object bounding
one. In the encoder layer, each Transformer has a standard architecture boxes.
with a multi-head self-attention module and a feed-forward network. The attention module in the detection Transformers computes sim-
To the input of each attention layer, a positional embedding (Vaswani ilarity scores between elements of each input data to identify complex
et al., 2017) is also introduced. In the Transformer decoder layers, dependencies within these data. Calculating similarities of all possible
they used six decoders which follow the standard architecture of the positional pairs in the input data scales quadratically with the number
Transformer except that it also decodes object queries in parallel. Each of positions and thus becomes computationally very expensive. For
object query will correspond to a particular object in the image. The this reason, the Transformer-based object detection model from 3D
decoders take these object queries with position embeddings as well images has never been applied. Tao and Zheng (2021) proposed a
as output embeddings from the encoder network and convert them novel Transformer-based 3D object detection model as a one-to-one
into decoded embeddings. Then they used a feed-forward network set prediction problem for the automatic detection of vertebrae in
with two fully connected layers for converting these decoded embed- arbitrary Field-Of-View (FOV) scans, called the Spine-Transformers.
dings into object predictions. The first fully connected layer is a box Here the authors used a one-to-one set-based global loss that compels a

45
R. Azad et al. Medical Image Analysis 91 (2024) 103000

unique prediction for preserving the sequential order of different levels


of vertebrae and eliminated bipartite matching between ground truth
and prediction. The main modules of the Spine-Transformer are (1) a
backbone network to extract features, (2) a light-weighted Transformer
encoder–decoder network using positional embeddings and a skip con-
nection, and (3) two feed-forward networks for detection prediction.
The authors used a ResNet50 (He et al., 2016) architecture without
the last SoftMax layer as the backbone network to extract high-level
features. These features are passed through a 1 × 1 × 1 convolutional
layer to reduce the channel dimensions and then flattened to get a
feature sequence to feed as the input for the Transformer network. The
light-weighted Transformer encoder–decoder network contains only a
two-layer encoder and two-layer decoder to balance between feature
resolution and memory constraint. In both the encoder and decoder
layers of the network, learnable positional embeddings are used. The
authors found that using a skip connection across the Transformer
encoder–decoder network will help in the propagation of context and
gradient information during training and thus improves performance. Fig. 36. An example of pair-wise medical image registration. The goal of image
The two feed-forward networks are then used to predict the existence registration is to geometrically align the moving image with the target or fixed image
by performing the spatial transformation.
of the objects and regress their coordinates. The authors also proposed
a sphere-based bounding box detector to replace the rectangular-based
bounding box to introduce rotational invariance called InSphere detec-
tor. The Spine-Transformer is trained end-to-end with fixed-size patch medical analysis. The transformation can be discovered by solving an
images to predict all the vertebrae objects in parallel by forcing one- optimization problem that maximizes the similarity between the images
to-one matching. Binary cross-entropy loss is used as classification loss, to be registered (Haskins et al., 2020). A pair-wise registration of two
and to enforce the order of the predicted vertebrae objects, an edge MRI brain scans is shown in Fig. 36 for illustration.
loss is introduced, which is an L1 distance loss introduced between Despite remarkable advancements in the quality of medical imaging
the centers of the top and bottom neighborhood vertebrae objects. techniques that aid professionals in better visualization and analysis
For better localization accuracy of the bounding sphere detection, of image data, a prominent challenge prevails in developing a system
the authors used generalized inception-over-union (GIoU) (Rezatofighi capable of effective integration of visual data that captures useful
et al., 2019) loss. The results of this model showed superior results to information from original images with high precision. Most registration
all the state-of-the-art methods. The authors also claim that by using a procedures take into account the whole image as input by utilizing
3D CNN-based landmark regression (Çiçek et al., 2016), the localization global information for spatial transformation, which leads to inefficient
accuracy can be further improved. and slow integration of data. Furthermore, the collection process of
medical images for training is slow and toilsome, performance degrades
7.4. Discussion and conclusion due to the presence of outliers, and local maxima entail negative effects
on performance during optimization (Alam et al., 2018; Alam and Rah-
In this chapter, several well-known Transformer architectures are man, 2019). The emergence of deep learning methods alleviated these
analyzed to address the automatic detection challenge. Based on the problems by automatic extraction of features utilizing convolutional
Transformer model contribution to the network structure, we grouped neural networks (CNN), optimizing a global function, and improving
the set of literature work into the backbone, neck, or head strategies registration accuracy. For instance, Balakrishnan et al. (2018) utilized
and for each category, we provided sample works. We also described a CNN to achieve unsupervised deformable registration by treating it
the details regarding self-supervised learning in Nguyen et al. (2022a). as a parametric function to be optimized during training. Furthermore,
In this respect, the core idea behind each network design along with the Chen et al. (2020a) presented an unsupervised CNN-based registra-
pros and cons of the strategies are highlighted in the summary tables. tion algorithm to produce anthropomorphic phantoms. However, there
Vision Transformers have been shown to make more accurate diag- are still limitations in capturing long-range spatial correspondence in
noses compared to traditional methods of analyzing medical images. CNN-based frameworks (Chen and Yu, 2021; Matsoukas et al., 2021).
These deep learning models can be trained on large datasets, such as Fueled by the strong ability of Transformers to model long-range
ImageNet, and fine-tuned on medical image datasets to improve their dependencies and detect global information (Lin et al., 2021; Khan
performance in detecting abnormalities in X-rays, CT scans, and MRIs. et al., 2022; Xu et al., 2022), they have gained the attention of re-
By incorporating information from multiple modalities, Transformers
searchers in the medical image registration domain in recent years.
can further enhance their ability to identify and detect rare or subtle
In this section, we review Transformer-based methods in medical im-
abnormalities in medical images. Many medical images are often taken
age registration that ameliorate the aforementioned shortcomings of
over time, and incorporating temporal information into the model can
previous systems by utilizing the self-attention mechanism. We have
improve its performance. For example, the model can be designed to
organized the relevant approaches based on their type of registration:
take into account the temporal evolution of diseases or conditions.
Overall, Transformers have demonstrated their capabilities to signifi- (a) Deformable registration, which employs an optimization algo-
cantly improve the accuracy and efficiency of medical image analysis, rithm to tune the transformation model, is a way that maximizes
leading to advances in healthcare. the similarity measure function for the images of interest (Rong
et al., 2021);
8. Medical image registration (b) Rigid registration, which achieves correspondence by maintain-
ing the relative distance between each pair of points between
Medical image registration is the task of transforming a set of two the patient’s anatomy images (Rong et al., 2021).
or more images of an organ or a biological process taken with different (c) Affine registration, which contains the same operations as rigid
poses, time stamps, or modalities (e.g., CT and MRI) into a geometri- registration plus non-isometric scaling (see Fig. 37).
cally aligned and spatially corresponding image that can be utilized for

46
R. Azad et al. Medical Image Analysis 91 (2024) 103000

The concatenation of the generated features from these Transformers


results in enhanced feature embeddings, which are utilized by the
CNN-based decoder to provide a diffeomorphic deformation field. The
evaluation of the framework was conducted on the brain MRI scans
of the OASIS dataset (Marcus et al., 2007), which substantiates their
improvements in diffeomorphic registration compared to the existing
deep-learning-based approaches.
Furthermore, XMorpher (Shi et al., 2022b) put emphasis on the sig-
nificance of backbone architectures in feature extraction and match of
pair-wise images, and proposed a novel full Transformer network as the
backbone, which consists of two parallel U-Net structures (Ronneberger
et al., 2015) as the sub-networks with their convolutions replaced by
the introduced Cross Attention Transformer for feature extraction of
Fig. 37. Taxonomy of Transformer-based image registration based on their transfor- moving and fixed images, and cross-attention-based fusion modules
mation type. We use the prefix numbers in the figure in ascending order and reference that utilize these features for generating the feature representation
the corresponding paper as follows: 1. Xu et al. (2022), 2. Chen et al. (2021a), 3. Chen of moving-fixed correspondence and fine-grained multi-level semantic
et al. (2022a), 4. Zhang et al. (2021d), 5. Shi et al. (2022b), 6. Miao et al. (2022), information that contributes to a fine registration.
7. Mok and Chung (2022).
The latest developments in self-supervised representation learning
have primarily aimed at eliminating inductive biases in training pro-
cesses. Nonetheless, these biases can still serve a purpose in scenarios
8.1. Deformable registration with scarce data or when they offer further understanding of the un-
derlying data distribution. To mitigate this problem, Miao et al. (2022)
Most existing Transformer-based algorithms focus on deformable introduces spatial prior attention (SPAN), a framework that leverages
transformation to perform medical image registration. Vit-V-Net (Chen the consistent spatial and semantic structure found in unlabeled image
et al., 2021a) is the earliest work that incorporates Transformers to datasets to guide the attention mechanism of vision Transformers.
perform medical image registration in a self-supervised fashion. It is SPAN integrates domain expertise to enhance the effectiveness and
inspired by the integration of vision Transformer-based segmentation interpretability of self-supervised pretraining specifically for medical
methods with convolutional neural networks to enhance the localiza- images using image registration and alignment methodologies. Specif-
tion information recovered from the images. Unlike previous research ically, they exploit image registration to align the attention maps
that employed 2D images for spatial correspondence, Vit-V-net stepped with an inductive bias corresponding to a salient region so only a
towards utilizing ViT (Dosovitskiy et al., 2020) as the first study for single representative sample is required. When utilizing deformable
volumetric medical image registration (i.e., 3D image registration). image registration templates, SPAN achieves the highest mAUC score,
As illustrated in Fig. 38, the images are first encoded into high-level followed by the predicted global templates and the triangular spatial
feature representations by implementing multiple convolution blocks; heuristic resulting in improved performance in downstream tasks.
then, these features get split into P patches in the ViT block. Next, the
patches are mapped to a D-dimensional embedding space to provide 8.2. Affine registration
patch embeddings, which are then integrated with learnable positional
encodings to retain positional information. Next, these patches are To perform affine medical image registration with Transformers,
passed into the encoder block of the Transformer, followed by multiple Mok and Chung (2022) proposed C2FViT, a coarse-to-fine vision Trans-
skip connections to retain localization information, and then decoded former that performs affine registration, a geometric transformation
employing a V-Net style decoder (Milletari et al., 2016). Finally, a that preserves points, straight lines, and planes while registering 3D
spatial Transformer (Jaderberg et al., 2015) warps the moving im- medical images. Former studies have relied on CNN-based affine regis-
age by utilizing the final output of the network. TransMorph (Chen tration that focuses on local misalignment or global orientation (De Vos
et al., 2022a) extended ViT-V-Net and proposed a hybrid Transformer et al., 2019; Zhao et al., 2019), which limits the modeling of long-
ConvNet framework that utilizes the Swin Transformer (Liu et al., range dependencies and hinders high generalizability. C2FVit, as the
2021d) as the encoder and a ConvNet as the decoder to provide first work that takes into account the non-local dependencies between
a dense displacement field. Like ViT-V-Net, it employed long skip medical images, leverages vision Transformers instead of CNNs for 3D
connections to retain the flow of localization information that may registration. As depicted in Fig. 39, the model is split into L stages, each
enhance registration accuracy. The output of the network, which is a containing a convolutional patch embedding layer and 𝑁𝑖 Transformer
nonlinear warping function, gets applied to the moving image with the encoder blocks (i indicates the stage number), intending to learn the
deformation field utilizing the spatial transformation function proposed optimal affine registration matrix. In each stage, the fixed and moving
in Jaderberg et al. (2015). An affine transformation Transformer net- images are downsampled and concatenated with each other, then the
work is incorporated to align the moving image with the fixed image new representation gets passed to the convolutional patch embedding
before feeding it to the deformable registration network. This work layer to produce image patch embeddings. Next, the Transformer re-
also proposed two variants of TransMorph: diffeomorphic TransMorph ceives the embeddings and produces the feature embedding of the
(TransMorph-diff) to facilitate topology-preserving deformations and input. Conducted experiments on OASIS (Marcus et al., 2007) and
Bayesian TransMorph (TransMorph-Bayes) to promote a well-calibrated LPBA (Shattuck et al., 2008) demonstrated their superior performance
registration uncertainty estimate. compared to existing CNN-based affine registration techniques in terms
Likewise, Zhang et al. (2021d) introduced the dual Transformer of registration accuracy, robustness, and generalization ability.
network (DTN) framework to perform diffeomorphic registration. It
is composed of a CNN-based 3D U-Net encoder (Çiçek et al., 2016) 8.3. Rigid registration
for the embedding of separate and concatenated volumetric images
and a dual Transformer to capture the cross-volume dependencies. SVoRT (Xu et al., 2022) addressed the necessity of slice-to-volume
One of the Transformers is responsible for modeling the inter- and registration before volumetric reconstruction for the task of volumet-
intra-image dependencies, and the other one handles the modeling of ric fetal brains reconstruction, and employed a Transformer network
the global dependencies by employing the self-attention mechanism. trained on artificially sampled 2D MR slices that learn to predict slice

47
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 38. Overview of ViT-V-Net. Multiple convolution blocks encode images into high-level features, which the Vit block splits into patches. These patches are then mapped to
D-dimensional patch embeddings that get integrated with learnable positional encodings to retain positional information. Next, these patches are passed into the Transformer
encoder block, followed by multiple skip connections to retain localization information, and decoded using a V-Net style decoder. Using the network’s final output, a spatial
Transformer warps the moving image.
Source: Figure taken from Chen et al. (2021a).

Table 16
An overview of the reviewed Transformer-based medical image registration approaches.
Method Modality Organ Type Datasets Metrics Year
Deformable
ViT-V-Net (Chen MRI Brain 3D Private Dataset Dice 2021
et al., 2021a)
1
TransMorph (Chen MRI Brain 3D IXI (Group, 2023) Dice 2022
et al., 2022a) CT Chest-Abdomen-Pelvis 2
T1-weighted brain MRI scans from % of |𝐽𝛷 | ≤ 0
region Johns Hopkins University SSIM
3
Chest-Abdomen-Pelvis CT (Segars
et al., 2013)
DTN (Zhang et al., MRI Brain 3D OASIS (Marcus et al., 2007) Dice 2021
2021d) |𝐽𝛷 | ≤ 0
1
XMorpher (Shi CT Heart 3D MM-WHS 2017 (Zhuang and Shen, Dice 2022
et al., 2022b) MRI 2016) % of |𝐽𝛷 | ≤ 0
2
ASOCA (Gharleghi et al., 2020)
SPAN (Miao et al., CT Lung 3D CheXpert (Irvin et al., 2019) mAUC 2022
2022) MRI JSRT (Shiraishi et al., 2000)
Affine
1
C2FViT (Mok and MRI Brain 3D OASIS (Marcus et al., 2007) Dice 2022
2
Chung, 2022) LPBA (Shattuck et al., 2008) Hausdorff
distance
Rigid
SVoRT (Xu et al., MRI Brain 3D FeTA (Payette et al., 2021) PSNR 2022
2022) SSIM

Fig. 39. The model has L stages with convolutional patch embedding layers and 𝑁 Transformer encoder blocks to learn the optimal affine registration matrix. In each stage,
fixed and moving images are downsampled and concatenated, then passed to the convolutional patch embedding layer to produce image patch embeddings. The Transformer then
produces the input feature embedding from the embeddings (Mok and Chung, 2022).

transformation based on the information gained from other slices. 8.4. Discussion and conclusion
The model also estimates the underlying 3D volume from the input
slices to promote higher accuracy in transformation prediction. The According to the research discussed in this section, vision Trans-
superiority of their proposed method in terms of registration accuracy formers are prominent tools in image registration tasks due to their
and reconstruction based on the evaluation of synthetic data and their training capability on large-scale data, which is made feasible by
experiments on real-world MRI scans demonstrated the ability of the parallel computing and self-attention mechanisms. Leveraging Trans-
model in high-quality volumetric reconstruction. formers to encourage better global dependency identification improves

48
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Table 17
A brief description of the reviewed Transformer-based medical image registration techniques.
Method Contributions Highlights
Deformable
ViT-V-Net (Chen et al., 2021a) ∙ Contributed to the medical image registration domain as the first ∙ Employed a hybrid architecture to incorporate long-range and
work to exploit ViTs to develop a volumetric (3D) registration. local information in the registration process.
∙ Integrated Transformers with CNNs to build a hybrid architecture ∙ Attempted to preserve the localization data with the help of long
for self-supervised brain MRI registration skip connections between the encoder and decoder stages.
TransMorph (Chen et al., 2022a) ∙ Proposed a Transformer-based unsupervised registration approach ∙ They additionally proposed two distinguishable versions of their
for affine and deformable objectives. model: a diffeomorphic variant to facilitate the topology-preserving
∙ Conducted experiments on two brain MRI datasets and in a deformations and a Bayesian variant to promote a well-calibrated
phantom-to-CT registration task to demonstrate their superior registration uncertainty estimate.
performance compared to traditional approaches. ∙ Studied the effect of receptive fields by comparing TransMorph
with CNNs and addressed that while the receptive field of ConvNets
only increases with the layer depth, their presented model takes
into account the whole image at each due to the self-attention
mechanism.
DTN (Zhang et al., 2021d) ∙ Proposed a dual Transformer architecture to capture semantic ∙ The Dual Transformer is capable of reducing the negative Jacobian
correspondence of anatomical structures. determinant while preserving the atlas-based registration quality.
∙ The suggested DTN demonstrated remarkable results in ∙ The qualitative and quantitative analysis of their method on the
diffeomorphic registration and atlas-based segmentation of OASIS dataset indicates that diffeomorphic registration fields are
multi-class anatomical structures. effective.
XMorpher (Shi et al., 2022b) ∙ Devised a deformable registration system consisting of dual ∙ Promotes visual superiority by presenting fine-grained visual
parallel feature extraction networks which facilitate the association results in terms of boundary smoothness, adjacent regions’
of representative features between moving and fixed images. resolution quality, and deformation grid polishness.
∙ Proposed cross-attention Transformer that establishes spatial ∙ Demonstrated the model’s great diagnostic potential by conducting
correspondences through computation of bilateral information in experiments with different training regimes.
the attention mechanism.
SPAN (Miao et al., 2022) ∙ Developed a self-supervised transformer-based registration method ∙ Integrates domain expertise and uses image registration and
that leverages consistent spatial and semantic structure in unlabeled alignment methodologies to enhance the effectiveness and
image datasets to guide the attention mechanism of vision interpretability of self-supervised pretraining, particularly for
Transformers. medical images.
Affine
C2FViT (Mok and Chung, 2022) ∙ Presented a method in order to learn the global affine registration ∙ The suggested training framework can be extended to a number
by taking advantage of the strong long-range dependency of parametric-based registration approaches by removing or scaling
recognition and locality of the hybrid Transformer and the the geometrical transformation matrices.
multi-resolution strategy.
Rigid
SVoRT (Xu et al., 2022) ∙ Devised an approach for the task of Volumetric reconstruction of ∙ Experimental procedures on the FeTA dataset (Payette et al.,
fetal brains based on Transformer architectures. 2021) represented the model’s ability in high-quality volumetric
∙ Employed a Transformer network trained on artificially sampled reconstruction.
2D MR slices that estimate the underlying 3D volume from the ∙ The volumetric reconstruction associated with the transformations
input slices to more accurately predict transformation. of the proposed method displays higher visual quality.

registration in terms of dice scores and Jacobian matrix determinants Based on the brief review of Transformer-based medical image
compared to CNNs. registration research, we believe that other regions of interest (ROI)
To mitigate the burden of quadratic complexity when processing such as neurons, retina, and neck area are worth exploring to fa-
images at high resolution and modeling local relationships, reviewed cilitate diagnostic operations in different domains with more precise
studies usually employ CNNs to provide feature maps or dense dis- registration models.
placement fields (Chen et al., 2021a, 2022a; Zhang et al., 2021d). We have also specified the architectural type, modality, organ, data
C2FViT (Mok and Chung, 2022) disregarded convolutional networks size, training paradigm, datasets, metrics, and year for each medical
and implemented convolutional patch embeddings to promote local- registration technique reviewed in Table 16. Furthermore, Table 17
ity. However, in deformably registering medical content, XMorpher provides a list of the contributions and highlights of the proposed
recently demonstrated the power of cross-attention in better capturing works. We also elaborate on the use cases of self-supervision in medical
spatial relevancy without a CNN implementation (Shi et al., 2022b), image registration.
and SVoRT purely utilized Transformers to perform rigid registra-
tion (Xu et al., 2022). Regarding the challenges caused by the data- 9. Medical report generation
hungry nature of deep neural networks and the difficulties in gathering
annotated training data, studies in medical image registration have Medical report generation focuses on producing comprehensive cap-
also focused on integrating self-supervised learning with transformer tions and descriptions pivoting on medical images for diagnostic pur-
models (Miao et al., 2022). poses. Designing automatic methods capable of performing this task can
The notable experimental attempts on brain MRI scan data, such alleviate tedious and time-consuming work in producing medical re-
as OASIS (Marcus et al., 2007) and FeTA (Payette et al., 2021), show ports and promote medical automation (Jing et al., 2017). Recently, ad-
the importance of accurate automatic registration for neuroimaging vancements in deep learning have brought the attention of researchers
data. One particular work (Shi et al., 2022b) proposed to evaluate their to employing an intelligent system capable of understanding the visual
registration on images of cardiac region datasets including MM-WHO- content of an image and describing its comprehension in natural lan-
2017 (Zhuang and Shen, 2016) and ASOCA (Gharleghi et al., 2020). To guage format (Stefanini et al., 2022). Research efforts in improving this
further clarify the modality type used in the aforementioned proposed area can be employed in medical imaging by implementing systems
methods, all works conducted their evaluations on 3D or volumetric capable of providing descriptions and captions (i.e., generating medical
imaging modalities. reports) concerning medical images. These captioning systems usually

49
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 40. Taxonomy of Transformer-based medical report generation approaches based on the mechanism by which they generate clinical reports. We reference the papers in
ascending order corresponding to their prefix number: 1. Xiong et al. (2019), 2. Zhang et al. (2021a), 3. Chen et al. (2020c), 4. You et al. (2021b), 5. Nooralahzadeh et al. (2021),
6. Yan et al. (2021). 7. Chen et al. (2022b), 8. Li et al. (2019a), 9. Liu et al. (2021b), 10. Lovelace and Mortazavi (2020), 11. Liu et al. (2021c), 12. Alfarghaly et al. (2021),
13. Nguyen et al. (2021a), 14. Wang et al. (2021g), 15. Endo et al. (2021), 16. Bannur et al. (2023), 17. Wang et al. (2023b).

utilize encoder–decoder models that encode medical images and decode 9.1. Reinforcement learning-based systems
their understandings to provide diagnostic information in a natural
language format. The first work to implement a Transformer architecture for medical
Despite the success of deep learning, limitations including relia- report generation is RTMIC (Xiong et al., 2019). It used the reinforce-
bility on an immense amount of data, unbalanced data in radiology ment learning strategy in training to mitigate the problem of exposure
datasets (e.g., IU X-ray chest X-ray (Demner-Fushman et al., 2016)), bias prevailing in Seq2Seq models (Zhang et al., 2020a). In their ap-
and the black box nature of DL models entail challenges in medical proach, the original images are fed into a DenseNet (Huang et al., 2017)
report generation (Monshi et al., 2020). The success of Transformer as the region detector to extract bottom-up visual features. These fea-
models in many vision-and-language tasks has drawn the attention of tures are then passed into a visual encoder to generate visual represen-
researchers in the medical report generation domain to the employ- tations from the detected regions, which the captioning detector then
ment of this architecture. In this section, we discuss approaches that utilizes to generate captions for the specified regions. The proposed
utilize Transformers to promote effective capture of long-range context method was experimented on the IU X-ray dataset (Demner-Fushman
dependencies and better report generation. As illustrated in Fig. 40, the et al., 2016) and achieved state-of-the-art results. Integration of RL
following is our taxonomy of these systems according to the mechanism and Transformers was also applied in surgical instruction generation
by which they produce accurate and reliable clinical reports: since the joint understanding of surgical activity along with modeling
relations linking visual and textual data is a challenging task. Zhang
(I) Reinforcement Learning-based. The ultimate goal of a medical et al. (2021a) employed a Transformer-backboned encoder–decoder
report generation system is to provide clinically accurate and architecture and applied the self-critical reinforcement learning (Rennie
reliable reports. In reinforcement learning, the MRG system is et al., 2017) approach to optimize the CIDEr score (Vedantam et al.,
considered an agent with the objective of maximizing clini- 2015) as the reward. Their approach surpasses existing models in per-
cal accuracy based on the feedback given by the reward sig- formance on the DAISI dataset (Rojas-Muñoz et al., 2020) with caption
nal, which is directly calculated by the evaluation metric score evaluation metrics applied to the model. This work’s key difference
(e.g., CIDEr Vedantam et al., 2015). from others is that their model is proposed to generate instructions
(II) Graph-based. Radiology reports are typically composed of a long instead of descriptions.
finding section with multiple sentences that make report gen-
eration a challenging task. Therefore, the inclusion of prior 9.2. Graph-based systems
information is beneficial for facilitating the generation of long
narratives from visual data. Knowledge graphs, which are pow- In graph-based medical report generation, Li et al. (2019a) proposed
erful models that can capture domain-specific information in a KERP, a Graph Transformer implementation to generate robust graph
structured manner, can be used to exploit prior information for structures from visual features that are extracted by a DenseNet (Huang
medical report generation (Li et al., 2018, 2019a; Zhang et al., et al., 2017) backbone. This approach is composed of three modules:
2020b). Encode, Retrieve and Paraphrase. First, it constructs an abnormality
(III) Memory-based. Memory is a resource through which important graph by converting the visual features extracted from the medical im-
information is recorded. In designing a proper MRG system, ages via an encoder module. Next, a sequence of templates is retrieved
it is crucial to store vital and diagnostic information that can considering the detected abnormalities by utilizing a retrieve module.
benefit the generation process by incorporating prior knowl- Subsequently, the terms of the produced templates are paraphrased into
edge and experience. Hence, configuring a memory mechanism a report by employing the paraphrase module. The KERP’s workflow is
with Transformers as a report generation framework facilitates illustrated in Fig. 41.
longer and more coherent text generation by sharing information Additionally, Liu et al. (2021b) addressed the visual and textual
gained through the process (Chen et al., 2020c, 2022b). data biases and their consequences in generating radiology reports and
(IV) Other Systems. Systems that introduce different ideas from previ- proposed the PPKED framework to alleviate these challenges. Their
ous categories to improve clinical accuracy, such as curriculum work introduced three modules to perform report generation: (1) Prior
learning, contrastive learning, and alternate learning, belong to Knowledge Explorer (PrKE), which obtains relevant prior information
this group. for the input images; (2) Posterior Knowledge Explorer (PoKE), which

50
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 41. Using an encoder module, KERP creates an abnormality graph from the extracted visual features. Then, a retrieval module retrieves a sequence of templates based on
detected abnormalities. Next, the paraphrase module paraphrases the templates’ terms into a report (Li et al., 2019a).

extracts the posterior information, including the abnormal regions of In MDT-WCL (Yan et al., 2021), the problem is approached with
the medical image; and (3) Multi-domain Knowledge Distiller (MKD), a weakly supervised contrastive loss, which lends more weight to the
which distills the obtained information from the previous modules reports that are semantically close to the target reports, and a memory-
to perform the final report generation. PPKED then formulated the driven Transformer is adopted as the backbone model to store key
problem by employing the presented modules in the following manner: information in its memory module. To aid the contrastive learning
PoKE first extracts the image features corresponding to the relevant during training, after clustering the reports into groups with the K-
disease topics by taking the visual features extracted by ResNet-152 (He Means algorithm, each report is assigned a label corresponding to its
et al., 2016) from the input image and abnormal topic word embed- cluster, and the semantically closed ones are considered to be in the
dings as the input. Next, the PrKE module filters the prior knowledge same cluster.
from the introduced prior working experience (a BERT encoder) and Although previous approaches have achieved promising results,
prior medical knowledge component that is relevant to the abnormal they lack the ability to generate mappings between images and texts
regions of the input image by utilizing the output of the PoKE module. to align visual–textual information and assist medical diagnosis. In
Next, the MKD module generates the final medical report by using this order to facilitate visual–textual alignment, the Cross-modal Memory
obtained information, which is implemented based on the decoder part Network (CMN) (Chen et al., 2022b) extended encoder–decoder meth-
of the Transformers equipped with Adaptive Distilling Attention. ods by utilizing a shared memory for better alignment of information
between images and texts. It uses a pre-trained ResNet (He et al.,
9.3. Memory-based systems 2016) as the visual extractor to output visual features, then passes
them to the cross-modal memory network that utilizes a matrix to store
Concerning the development of systems that rely on a memory information where each row represents the embedding of information
mechanism to generate medical reports, Chen et al. (2020c) presented linking images and texts. To access the stored information aligning the
a Memory-Driven Transformer (MDT), a model suitable for the gen- modalities, memory querying and responding are implemented in a
eration of long informative reports and one of the first works on the multi-threaded manner.
MIMIC-CXR dataset (Johnson et al., 2019). MDT employs a relational
memory to exploit characteristics prevailing in reports of similar im- 9.4. Other systems
ages, and then the memory is incorporated into the decoder section of
the Transformer by implementing a memory-driven conditional layer Other MRG systems focus on solving the problem with different
normalization (MCLN). ideas. Lovelace and Mortazavi (2020) proposed a generation framework
Likewise, Nooralahzadeh et al. (2021) introduced M2 TR. progres- composed of two stages. In the first stage, a Transformer model is
sive, a report generation approach that utilizes curriculum learning, adopted to map the input image features extracted by a DenseNet-
which is a strategy of training machine learning models by starting with 121 (Huang et al., 2017) to contextual annotations and learn report
easy samples and gradually increasing the samples’ difficulty (Wang generation. In the second stage, a procedure is introduced to differen-
et al., 2021f). Instead of directly generating full reports from medical tiably sample a clinical report from the Transformer decoder and obtain
images, their work formulates the problem into two steps: first, the observational clinical information from the sample. This differentiabil-
Meshed-Memory Transformer (M2 TR.) (Cornia et al., 2020), as a ity is further employed to fine-tune the model for improving clinical
powerful image captioning model, receives the visual features extracted coherence by applying their differentiable CheXpert to the sampled
by a DenseNet (Huang et al., 2017) backbone and generates high-level reports. Fueled by recent progress in explainable artificial intelligence
global context. Second, BART (Lewis et al., 2019), as a Transformer- and the introduction of algorithms that attempt to provide interpretable
based architecture, encodes these contexts with a bidirectional encoder prediction in DL-based systems, Likewise, in CDGPT2 (Alfarghaly et al.,
and decodes its output using a left-to-right decoder into coherent 2021), the medical image is passed into a Chexnet (Rajpurkar et al.,
reports. The overview of the process is depicted in Fig. 42. 2017) to provide localizations of 14 types of diseases from the images as
Additionally, You et al. (2021b) proposed AlignTransformer, a visual features. To implement better semantic features, the model was
framework composed of two modules: Align Hierarchical Attention fine-tuned as a multi-label classification problem to extract manual tags
(AHA) and Multi-Grained Transformer (MGT). In their approach, first from the IU-Xray dataset (Demner-Fushman et al., 2016) by replacing
visual features and disease tags are extracted from the medical image the final layer of the model with a layer containing 105 neurons to
by an image encoder, then they get aligned hierarchically to obtain produce 105 tags. The vector representation of the tags is then fed
multi-grained disease-grounded visual features in the AHA module. The into a pre-trained distilGPT2 (Radford et al., 2019) as the decoder to
obtained grounded features are capable of tackling the data bias prob- generate medical reports. Moreover, Wang et al. (2021g) presented
lem by promoting a better representation of abnormal sections. Next, a confidence-guided report generation (CGRG) approach to support
these grounded visual features are exploited by an adaptive exploiting reliability in report generation by quantifying visual and textual uncer-
attention (AEA) (Cornia et al., 2020) mechanism in the MGT module for tainties. It is comprised of an auto-encoder that reconstructs images,
the generation of the medical reports. They also justified their model’s a Transformer encoder that encodes the input visual feature extracted
efficiency through the manual evaluation of clinical radiologists. by ResNet-101 (He et al., 2016), and a Transformer decoder for report

51
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Fig. 42. Workflow of the 𝑀 2 Tr. Progressive framework. The task is accomplished in two stages: First, the Meshed-Memory Transformer (M2 TR.) receives visual features extracted
by a DenseNet (Huang et al., 2017) backbone and generates high-level global context. Second, the BART (Lewis et al., 2019) architecture encodes contexts with a bidirectional
encoder and decodes them with a left-to-right decoder to produce coherent reports (Nooralahzadeh et al., 2021).

generation. Visual uncertainty is obtained by the AutoEncoder, which support complementary self-supervision and exploit the strengths of
acts as a guide for the visual feature extractor, and textual uncertainty each modality. It compares prior images with given reports to exploit
is quantified based on the introduced Sentence Matched Adjusted Se- temporal correlations and enhance representation learning. BioViL-T
mantic Similarity (SMAS) which captures the similarity between the uses a multi-image encoder to handle the absence of prior images and
generated reports. These uncertainties are further utilized to aid the spatial misalignment, benefiting both image and text models. Image
model optimization process. representations are extracted utilizing a hybrid CNN and transformer
The recent outbreak of COVID-19, one of the deadliest pandemics, encoder, which are then matched with corresponding text representa-
has influenced the research community to alleviate the tedious and tions obtained with CXR-BERT (Ramesh et al., 2022) by training with
time-consuming work of producing medical reports. VL-BERT, Su et al. contrastive objectives.
(2020) as an extension of BERT, Devlin et al. (2018) can be employed as Recently, large language models (LLMs), which are AI systems
an intelligent medical report generation system to expedite the diagno- that learn the patterns of human language from massive amounts of
sis process. Medical-VLBERT (Liu et al., 2021c) introduced VL-BERT to text data, have revolutionized human-like text generation and natural
the medical report generation domain. It defines the problem as a two- language understanding. Consequently, they hold great potential in
step procedure: First, it utilizes two distinct VL-BERTs as terminology clinical applications, such as understanding medical texts and report
encoders to produce terminology-related features (textual and visual), generation (Zhou et al., 2023b; Ma et al., 2023). One of the suc-
and then these features are fed into a shared language decoder to cessful large language models based on the GPT-3.5 architecture is
produce medical textbooks and reports. The proposed method takes ChatGPT (OpenAI, 2023), with extensive knowledge and abilities to
into account predefined terminology word embeddings that represent perform a wide range of tasks such as question answering, text sum-
medical domain knowledge. These embeddings are paired distinctly marization, and generating textual instructions. Wang et al. (2023b)
with two other embeddings as an input to the encoders: textbook presented a scheme called ChatCAD that integrates LLMs such as GPT-
embeddings, which are generated by employing a lookup table, and 3 (Brown et al., 2020) and ChatGPT with Computer-Aided Diagnosis
spatial feature embeddings (termed ‘‘visual context’’) that are extracted (CAD) models to perform medical report generation. CAD systems are
from medical images by implementing DenseNet-121 (Huang et al., computer programs that assist doctors in interpreting medical images,
2017). The encoders then integrate this pairwise information separately such as X-rays or MRI scans. To generate reports via ChatCAD, the
to produce textual and visual terminological features. Subsequently, a medical image is passed to a segmentation, image classification, and
shared language decoder is trained by utilizing an alternate approach a report generation network. Since the outputs of segmentation and
to properly exchange the knowledge captured by the encoders. classification networks are a mask and a vector, respectively, they are
Furthermore, in the work of Nguyen et al. (2021a), a classification, transformed into textual format to be understandable by LLMs. These
generation, and interpretation framework (CGI) is proposed to address textual representations are then concatenated and presented to the
clinical accuracy. Each term of the framework’s name represents a LLM, which then summarizes the results from all the CAD networks
different module to perform the task. The classification module learns and exploits them to correct the errors in the generated report. Their
how to discover diseases and generate their embeddings, which consist experiments indicate that LLMs can be more effective than conventional
of an image and text encoder to extract the global visual features from CAD systems in enhancing medical report quality.
medical images and obtain text-summarized embeddings from clinical
documents. The generation module is a Transformer model that takes 9.5. Discussion and conclusion
the disease embeddings as input and generates medical reports from
them. The interpretation module then takes these reports for evaluation This section offers a systematic review of the Transformer architec-
and fine-tuning. tures configured for medical report generation. Compared to previous
Deep learning methods in medical report analysis require exten- sections that reviewed ViT-based frameworks to tackle different med-
sive amounts of annotated training data, which requires a significant ical tasks and problems, this section focuses mostly on using standard
amount of time and effort. To this end, self-supervised report gener- Transformers as the core of a medical report generation system. A
ation approaches use a large amount of unlabeled text data without common theme prevailing in these systems is to solve the problem
explicit supervision. For instance, CXR-RePaiR (Endo et al., 2021) uti- with an encoder–decoder architecture supported by a CNN-based vi-
lizes self-supervised contrastive language image pre-training. It encodes sual backbone. As mentioned in previous sections, the self-attention
reports with the CheXbert transformer network (Smit et al., 2020), and mechanism undermines the representation of low-level details. On
leverages the X-ray report pair representations to retrieve unstructured the other hand, since medical reports consist of long and multiple
radiology reports with a contrastive scoring function trained on chest sentences, Transformers are of great significance to model long-term
X-ray-report pairs to rank the similarity between a test dataset of X- dependencies, which assists clinically accurate report generation (Lin
rays and a large corpus of reports. Likewise, BioViL-T (Bannur et al., et al., 2022; Nguyen et al., 2021a). To exploit the power of both CNNs
2023) is a hybrid design that improves the alignment of X-ray report and Transformers simultaneously, state-of-the-art MRG systems usually
pairs by incorporating temporal information and prior knowledge to embed CNNs along with Transformers in their frameworks (Xiong et al.,

52
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Table 18
An overview of the reviewed transformer-based medical report generation approaches.
Method Modality Organ Type Visual backbone Datasets Metrics Year
Reinforcement learning
RTMIC (Xiong et al., X-ray Lung 2D DenseNet-121 (Huang IU Chest X-ray (Demner-Fushman BLEU (Papineni et al., 2019
2019) et al., 2017) et al., 2016) 2002)
CIDEr (Vedantam et al.,
2015)
SIG (Zhang et al., Ultrasound Multi-organ 3D ResNet-101 (He et al., DAISI (Rojas-Muñoz et al., 2020) BLUE (Papineni et al., 2021
2021a) Colonoscopy 2016) 2002)
Meteor (Banerjee and
Lavie, 2005)
CIDEr (Vedantam et al.,
2015)
ROUGE (Lin, 2004)
SPICE (Anderson et al.,
2016)
Graph
1
KERP (Li et al., 2019a) X-ray Lung 2D DenseNet-121 (Huang IU Chest X-ray BLEU (Papineni et al., 2019
et al., 2017) (Demner-Fushman et al., 2016) 2002)
2
CX-CHR (private dataset) CIDEr (Vedantam et al.,
2015)
ROUGE (Lin, 2004)
1
PPKED (Liu et al., X-ray Lung 2D ResNet-152 (He et al., IU Chest X-ray BLUE (Papineni et al., 2021
2021b) 2016) (Demner-Fushman et al., 2016) 2002)
2
MIMIC-CXR (Johnson et al., Meteor (Banerjee and
2019) Lavie, 2005)
CIDEr (Vedantam et al.,
2015)
ROUGE (Lin, 2004)
Memory
1
MDT (Chen et al., X-ray Lung 2D ResNet-121 (He et al., IU Chest X-ray BLEU (Papineni et al., 2020
2020c) 2016) (Demner-Fushman et al., 2016) 2002)
2
MIMIC-CXR (Johnson et al., Meteor (Banerjee and
2019) Lavie, 2005)
ROUGE (Lin, 2004)
1
AlignTransformer (You X-ray Lung 2D ResNet-50 (He et al., IU Chest X-ray BLEU (Papineni et al., 2021
et al., 2021b) 2016) (Demner-Fushman et al., 2016) 2002)
2
MIMIC-CXR (Johnson et al., Meteor (Banerjee and
2019) Lavie, 2005)
ROUGE (Lin, 2004)
M2 TR. progressive X-ray Lung 2D DenseNet-121 (Huang 1
IU Chest X-ray BLEU (Papineni et al., 2021
(Nooralahzadeh et al., et al., 2017) (Demner-Fushman et al., 2016) 2002)
2
2021) MIMIC-CXR (Johnson et al., Meteor (Banerjee and
2019) Lavie, 2005)
ROUGE (Lin, 2004)
1
MDT-WCL (Yan et al., X-ray Lung 2D ResNet (He et al., MIMIC-ABN (Ni et al., 2020) BLUE (Papineni et al., 2021
2
2021) 2016) MIMIC-CXR (Johnson et al., 2002)
2019) Meteor (Banerjee and
Lavie, 2005)
ROUGE (Lin, 2004)
1
CMN (Chen et al., X-ray Lung 2D ResNet-101 (He et al., IU Chest X-ray BLEU (Papineni et al., 2022
2022b) 2016) (Demner-Fushman et al., 2016) 2002)
2
MIMIC-CXR (Johnson et al., Meteor (Banerjee and
2019) Lavie, 2005)
ROUGE (Lin, 2004)
Other
CRG (Lovelace and X-ray Lung 2D DenseNet-121 (Huang MIMIC-CXR (Johnson et al., BLUE (Papineni et al., 2020
Mortazavi, 2020) et al., 2017) 2019) 2002)
Meteor (Banerjee and
Lavie, 2005)
CIDEr (Vedantam et al.,
2015)
ROUGE (Lin, 2004)
1
Medical-VLBERT (Liu CT Lung 2D DenseNet-121 (Huang Chinese Covid-19 CT (Liu et al., BLUE (Papineni et al., 2021
et al., 2021c) X-ray et al., 2017) 2020) 2002)
2
CX-CHR (private dataset) CIDEr (Vedantam et al.,
2015)
ROUGE (Lin, 2004)

(continued on next page)

53
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Table 18 (continued).
Method Modality Organ Type Visual backbone Datasets Metrics Year
CDGPT2 (Alfarghaly X-ray Lung 2D DenseNet-121 (Huang IU chest X-ray (Demner-Fushman BLUE (Papineni et al., 2021
et al., 2021) et al., 2017) et al., 2016) 2002)
Meteor (Banerjee and
Lavie, 2005)
CIDEr (Vedantam et al.,
2015)
ROUGE (Lin, 2004)
1
CGI (Nguyen et al., X-ray Lung 2D DenseNet-121 (Huang MIMIC-CXR (Johnson et al., BLUE (Papineni et al., 2021
2021a) et al., 2017) 2019) 2002)
2
IU chest X-ray Meteor (Banerjee and
(Demner-Fushman et al., 2016) Lavie, 2005)
ROUGE (Lin, 2004)
1
CGRG (Wang et al., X-ray Lung 2D ResNet-101 (He et al., IU Chest X-ray BLUE (Papineni et al., 2021
2021g) 2016) (Demner-Fushman et al., 2016) 2002)
2
COV-CTR (Li et al., 2022b) Meteor (Banerjee and
Lavie, 2005)
ROUGE (Lin, 2004)
CXR-RePaiR (Endo X-ray Lung 2D ✗ MIMIC-CXR (Johnson et al., BLUE (Papineni et al., 2021
et al., 2021) 2019) 2002)
F1
BioViL-T (Bannur et al., X-ray Lung 2D ResNet-50 (He et al., 1 MIMIC-CXR (Johnson et al., BLUE (Papineni et al., 2023
2023) 2016) 2019) 2002)
2
MS-CXR-T (Bannur et al., 2023) ROUGE (Lin, 2004)
CHEXBERT (Smit et al.,
2020)
ChatCAD (Wang et al., X-ray Lung 2D ✗ MIMIC-CXR (Johnson et al., Precision, Recall, F1 2023
2023b) 2019)

2019; Alfarghaly et al., 2021; Wang et al., 2021g). We have provided on the IU Chest X-ray, KERP and CGRG perform the best across most
information in Table 18 on the reviewed report generation meth- metrics, while RTMIC and CDGPT2 are the weakest. MDT has the
ods concerning their architectural type, modality, organ, pre-trained highest METEOR score, CGRG achieves the highest ROUGE score,
strategy, datasets, metrics, and year. Table 19 contains summarized and PPKED demonstrates the highest CIDEr score. In contrast, for the
information about the methodologies, including their contributions MIMIC-CXR dataset, CRG and CGI demonstrate the best performance,
and highlights. In addition, it should be noted that several survey while MDT shows the weakest overall performance.
publications have been published in this field of medicine (Monshi Additionally, Table 20 compares the performance of the reviewed
et al., 2020; Messina et al., 2022; Pavlopoulos et al., 2022), and the methods Natural Language Generation (NLG) metrics. The methods are
most recent one provided a technical overview of Transformer-based presented in separate sections based on the dataset they were evaluated
clinical report generation (Shamshad et al., 2023). We approach our on, with one section dedicated to the IU Chest X-ray (Demner-Fushman
review differently by distinguishing the proposed methods based on the et al., 2016), and the other to the MIMIC-CXR (Johnson et al., 2019).
mechanism they used to support the prevailing concerns such as long Overall, we believe that MRG systems need more research and
and coherent text generation, reliability, and visual–textual biases. progression to be robustly incorporated in a practical setting.
The ultimate goal of these frameworks is to increase clinical accu-
racy to expedite the diagnosis process and reduce the workloads in 10. Open challenges and future perspectives
radiology professions (Lovelace and Mortazavi, 2020; Nguyen et al.,
2021a). Numerous works have attempted to facilitate diagnostic
So far, we discussed the application of Transformers (especially
decision-making by aligning correlated sections of medical image and
vision Transformers) and reviewed state-of-the-art models in medical
textual report that provide valuable information for detecting abnor-
image analysis. Even though their effectiveness is exemplified in pre-
malities (You et al., 2021b; Chen et al., 2022b). Also, multiple studies
vious sections by delicately presenting their ideas and analyzing the
emphasized the importance of universal knowledge, and designed a
significant aspects that were addressed in their proposed methods,
system to incorporate prior information for detecting disease (Li et al.,
there is still room for improvement in many areas to devise a more
2019a; Chen et al., 2020c). Some research effort was also put into better
practical and medically accurate system by leveraging Transformers.
representation learning by contrasting normal and abnormal samples
Consequently, we discuss the challenges and future directions hoping
against each other in representation space by utilizing a contrastive
loss as the objective (Yan et al., 2021). One recent work was inspired to help researchers gain insight into the limitations and develop more
by curriculum learning to imitate the order of the human learning convenient automatic medical systems based on Transformers.
process (Nooralahzadeh et al., 2021).
As for the cumbersomeness of accessing annotated textual training 10.1. Explainability
data, research in MRG also focused on consolidating self-supervised
learning and transformer architectures for report generation, which Fueled by recent progress in XAI (explainable artificial intelligence)
are discussed in Section 9.4 through the analysis of two specific and the introduction of algorithms that attempt to provide interpretable
works (Meng et al., 2020; Endo et al., 2021). ChatCAD (Wang et al., prediction in DL-based systems, researchers are putting effort into
2023b) was also introduced recently to integrate MRG with large incorporating XAI methods into constructing Transformer-based models
language models like ChatGPT to improve diagnosis accuracy, but it to promote a more reliable and understandable system in different
achieves a lower BLEU score as it generates reports that are less similar areas, including medical analysis (Alicioglu and Sun, 2022; Singh et al.,
to human-written text. 2020). Existing approaches usually highlight important regions of the
The Table 20 includes performance values of BLEU, METEOR, medical image that contribute to the model prediction by employing
ROUGE, and CIDEr for each method. Among the methods evaluated attention maps (Hou et al., 2021; Mondal et al., 2021). Furthermore,

54
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Table 19
A brief summary of the reviewed Transformer-based medical report generation methods.
Method Contributions Highlights
Reinforcement learning
RTMIC (Xiong ∙ Presented a novel Hierarchical Reinforced Transformer for ∙ Utilized reinforcement learning to ameliorate the exposure bias
et al., 2019) producing comprehensible, informative medical reports by training problem.
through reinforcement learning-based training. ∙ Enhanced clinical report coherence by employing Transformers to
∙ The initial attempt at incorporating Transformers to develop a capture long-range dependencies.
medical report generation system. ∙ The selected metric (CIDEr) as a reward signal is not designed for
the medical domain (Messina et al., 2022).
SIG (Zhang ∙ Generated surgical instructions from multiple clinical domains by ∙ The proposed method is able to produce multimodal
et al., 2021a) utilizing a Transformer-based Encoder–decoder architecture. dependencies, form pixel-wise patterns, and develop textual
associations for the masked self-attention decoder.
∙ Utilizing self-critical reinforcement learning to perform
optimization increased the performance of surgical instruction
generation.
∙ The selected metric (CIDEr) as a reward signal is not designed for
the medical domain (Messina et al., 2022).
Graph
KERP (Li et al., ∙ Developed an MRG system using a hybrid retrieval-generation ∙ Aligning the generated reports with abnormality attention maps
2019a) technique that unifies standard retrieval-based and recent visual by providing location reference facilitates medical diagnosis.
text generation methods. ∙ Since KERP is designed based on abnormality detection, it may
∙ Introduced Graph Transformer (GTR) as the first research to disregard other valuable information (Nguyen et al., 2021a).
employ an attention mechanism to convert different data types
formulated as a graph.
PPKED (Liu ∙ Proposed a three-module system that mimics the working habits ∙ Provides abnormal descriptions and locations to facilitate medical
et al., 2021b) of radiologists by extracting abnormal regions, encoding prior diagnosis.
information, and distilling the useful knowledge to generate ∙ Capable of extracting relevant information from the explored
accurate reports. posterior and prior multi-domain knowledge.
∙ Some mistakes, such as duplicate reports and inaccurate
descriptions, are present in the generated reports. Liu et al. (2020a)
Memory
MDT (Chen ∙ Introduced Memory-Driven Transformer for radiology report ∙ To facilitate medical diagnosis, visual–textual attention mappings
et al., 2020c) generation were incorporated to capture correspondence with essential medical
∙ Developed a relational memory to retain essential knowledge terms.
gathered through the previous generations. ∙ Dataset imbalance with dominating normal findings hinders the
model’s generalizability.
AlignTrans- ∙ Introduced an MRG framework that mitigates the problem of data ∙ Conducted experiments on MIMIC-CXR and IU-Xray datasets and
former (You bias by hierarchically aligning visual abnormality regions and demonstrated the capability of the model in ameliorating the data
et al., 2021b) illness tags in an iterative fashion. bias problem
M2 TR. ∙ Developed a progressive text generation model for medical report ∙ The division of report generation into two steps enhanced the
progressive generation by incorporating high-level concepts into the process of performance in terms of language generation and clinical efficacy
(Nooralahzadeh generation. metrics.
et al., 2021) ∙ The progressive generation process increases the false positive rate
by including abnormality mentions in negation mode. Messina et al.
(2022).
MDT-WCL (Yan ∙ Introduced the contrastive learning technique into chest X-ray ∙ Optimization with contrastive loss facilitates generalizability in
et al., 2021) report generation by proposing a weakly supervised approach that comparison to contrastive retrieval-based methods.
contrasts report samples against each other to better identify
abnormal findings.
CMN (Chen ∙ Cross-modal memory networks were introduced to improve report ∙ Capable of properly aligning data from radiological images and
et al., 2022b) generation based on encoder–decoder architectures by incorporating texts to aid in the preparation of more precise reports in terms of
a shared memory to capture multi-modal alignment. clinical accuracy.
Other
CRG (Lovelace ∙ Formulated the problem in two steps: (1) a report generation ∙ Transformers’ ability to provide more coherent and fluent reports
and Mortazavi, phase incorporating a standard language generation objective to was demonstrated.
2020) train a Transformer model, and (2) a sampling phase that includes ∙ Due to the biased nature of the dataset caused by dominant
sampling a report from the model and extracting clinical normal findings, the algorithm tends to generate reports that lack
observations from it. essential descriptions of abnormal sections (Liu et al., 2022a).
Medical-VLBERT ∙ Proposed a framework as the first work that generates medical ∙ Alleviated the shortage of COVID-19 data by employing the
(Liu et al., reports for the COVID-19 CT scans transfer learning strategy.
2021c) ∙ Devised an alternate learning strategy to minimize the ∙ Capable of effective terminology prediction
inconsistencies between the visual and textual data. ∙ Overreliance on predetermined terminologies undermines
robustness and generalizability.

(continued on next page)

55
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Table 19 (continued).
Method Contributions Highlights
CDGPT2 ∙ Presented a conditioning mechanism to improve radiology report ∙ Conditioning mechanism tackled punctuations, vocabulary
(Alfarghaly generation in terms of word-overlap metrics and time complexity. collection, and reduced training duration.
et al., 2021) ∙ Utilized a pre-trained GPT2 conditioned on visual and weighted ∙ The architecture does not require modification to be trained on
semantic features to promote faster training, eliminate vocabulary distinct data sets.
selection, and handle punctuation. ∙ Incorporating semantic similarity in addition to word overlap
∙ The first study to employ semantic similarity metrics to metrics improved medical report evaluation.
4
quantitatively analyze medical report generation results. The model’s generalization ability and robustness against
over-fitting are both hindered when the size of the dataset is small.
CGI (Nguyen ∙ Provides cohesive and precise X-ray reports in a fully ∙ Flexibility in processing additional input data, such as clinical
et al., 2021a) differentiable manner by dividing the report generation system into documents and extra scans, which also contributes to performance
a classifier, generator, and interpreter. improvement.
∙ Their conducted experiments revealed that incorporating ∙ The model does not provide vital information, such as illness
additional scans besides clinical history can be beneficial in orientation and time-series correlations, which facilitates more
providing higher-quality X-ray reports. reliable reports.
CGRG (Wang ∙ Presented a Transformer-based method that estimates report ∙ Assessing visual and textual uncertainties leads to more reliable
et al., 2021g) uncertainty to develop a more reliable MRG system and facilitate reports in medical diagnosis.
diagnostic decision-making. ∙ Measuring uncertainties can properly provide correlated
∙ Introduced the Sentence Matched Adjusted Semantic Similarity confidence between various reports, which is beneficial to aiding
(SMAS) to capture vital and relevant features in radiology report radiologists in clinical report generation (Liu et al., 2022a).
generation more effectively.
CXR-RePaiR ∙ Employs contrastive language image pre-training (CLIP) to ∙ Accuracy is evaluated by comparing it with the ground truth
(Endo et al., generate chest X-ray reports. using a labeler, and performance scores are calculated accordingly.
2021) ∙ The utilization of representations pairs of X-ray reports aids the ∙ Their experiments indicate that despite generating precise
process of retrieving unstructured radiology reports written in diagnostic explanations, it may not consistently employ identical
natural language. wording as the initial report.
BioViL-T ∙ Provides a hybrid design that aligns X-ray report pairs using ∙ Prior report is more crucial for optimal performance compared to
(Bannur et al., temporal information and prior knowledge. prior image as it outlines the image content and enhances the
2023) ∙ Combines CNN and transformer encoders to extract image signal clarity.
representations and matches them with text representations using a ∙ A significant factor in effective language supervision during the
chest X-ray domain-specific transformer model. pre-training phase is the utilization of contrastive loss.
∙ Introduced MS-CXR-T, a benchmark dataset that facilitates the
evaluation of vision-language representations in capturing temporal
semantics.
ChatCAD (Wang ∙ Integrates MRG with large language models such as ChatGPT to ∙ The experiments show that model size and complexity have a
et al., 2023b) improve diagnosis accuracy significant impact on generation accuracy, as larger models
∙ The LLM summarizes and corrects errors in the generated reports generally result in improved F1-scores.
via the concatenation of textual representations from CAD networks. ∙ Due to the reports being less similar to human-written text, the
system’s BLEU score is lower compared to previous methods.

Table 20
Comparison of Transformer-based medical report generation systems in terms of NLG performance metrics. The methods are ordered by BL-4 score, which captures more precise
phrase and sentence structure.
IU chest X-ray (Demner-Fushman et al., 2016)

Method BL-1 BL-2 BL-3 BL-4 METEOR ROUGE CIDEr

RTMIC (Xiong et al., 2019) 0.350 0.234 0.143 0.096 – – 0.323


CDGPT2 Alfarghaly et al. (2021) 0.387 0.245 0.166 0.111 0.164 0.289 0.257
KERP (Li et al., 2019a) 0.482 0.325 0.226 0.162 – 0.339 0.280
MDT (Chen et al., 2020c) 0.470 0.304 0.219 0.165 0.187 0.371 –
PPKED Liu et al. (2021b) 0.483 0.315 0.224 0.168 0.190 0.376 0.351
CMN Chen et al. (2022b) 0.475 0.309 0.222 0.170 0.191 0.375 –
AlignTransformer (You et al., 2021b) 0.484 0.313 0.225 0.173 0.204 0.379 –
M2 TR. progressive (Nooralahzadeh et al., 2021) 0.486 0.317 0.232 0.173 0.192 0.390 –
CGRG (Wang et al., 2021g) 0.497 0.357 0.279 0.225 0.217 0.408 –

MIMIC-CXR (Johnson et al., 2019)

Method BL-1 BL-2 BL-3 BL-4 METEOR ROUGE CIDEr

MDT (Chen et al., 2020c) 0.353 0.218 0.145 0.103 0.142 0.277 –
PPKED Liu et al. (2021b) 0.360 0.224 0.149 0.106 0.149 0.284 0.237
CMN (Chen et al., 2022b) 0.353 0.218 0.148 0.106 0.142 0.278 –
M2 TR. progressive Nooralahzadeh et al. (2021) 0.378 0.232 0.154 0.107 0.145 0.272 –
AlignTransformer (You et al., 2021b) 0.378 0.235 0.156 0.112 0.158 0.283 –
CRG Lovelace and Mortazavi (2020) 0.415 0.272 0.193 0.146 0.159 0.318 –
CGI (Nguyen et al., 2021a) 0.495 0.360 0.278 0.224 0.222 0.390 –

Vision Transformers (ViTs) have the ability to provide attention maps 2016) and the vagueness of the attention maps, which leads to in-
that indicate the relevant correlations between the regions of the input accurate token associations (Chefer et al., 2021; Kim et al., 2022),
and the prediction. However, the challenge of numerical instabilities makes interpretable ViTs an open research opportunity in computer
in using propagation-based XAI methods such as LRP (Binder et al., vision, especially in medical image analysis. We believe that including

56
R. Azad et al. Medical Image Analysis 91 (2024) 103000

interpretable vision Transformers, such as ViT-NeT (Kim et al., 2022), volumetric data. To address this issue, several approaches have been
in various medical applications can promote user-friendly predictions proposed to reduce the computational burden of self-attention. They ei-
and facilitate decision-making in the diagnosis of medical conditions, ther reduce the number of tokens by windowing (Ding et al., 2022; Liu
and is a promising direction in medical research problems. et al., 2021d), shift the calculation to the channel dimension (Ali et al.,
2021; Maaz et al., 2022), or change the order of multiplying query, key,
10.2. Richer feature representation and value (Shen et al., 2021b). In the context of 3D biomedical image
segmentation, Zhang and Bagci (2022) introduced a dynamic linear
An effective and suitable representation space is substantially influ- Transformer algorithm that achieves linear complexity by focusing
ential in building medical analysis systems. Transformers have demon- computations only on the region of interest (ROI). This significantly
strated their efficiency in obtaining global information and capturing reduces the overall computational requirements. Shen et al. (2021b)
long-term dependencies in many areas, such as Natural Language Pro- proposed an efficient attention mechanism that avoids the pairwise
cessing (NLP), Computer Vision, and Speech Recognition (Lin et al., similarity computations of dot-product attention. Instead, it normalizes
2021), and CNNs have proven to be effective in extracting local context the keys and queries first, performs multiplication between the keys
from visual data (Li et al., 2021d). However, this locality usually and values, and then multiplies the resulting global context vectors
enables these networks to capture rich local texture representation with the queries. This approach reduces the computational complexity
and lacks model global dependency. As a result, many approaches of self-attention from 𝑂(𝑑𝑛2 ) to 𝑂(𝑑 2 𝑛), where 𝑑 is the embedding
stack Transformers along with CNNs to leverage both local and global dimension. XCiT, Ali et al. (2021) another approach, addresses the
information simultaneously in clinical applications (e.g., medical report complexity challenge by introducing cross-covariance attention, which
2
generation) (You et al., 2021b; Lovelace and Mortazavi, 2020; Tanzi has a complexity of 𝑂( 𝑑ℎ𝑛 ), where ℎ is the number of attention heads.
et al., 2022). Recent studies stated that the single-scale representation EdgeNext, an architecture proposed by Maaz et al. (2022) for edge
of ViTs hinders improvement in dense prediction tasks, so a multi- devices, optimizes the number of Multiplication-Addition (MAdd) op-
scaled feature representation is implemented which achieves better erations required. It employs a variation of the attention mechanism
performance in computer vision tasks, including image classification, used in XCiT called split depth-wise transpose attention (SDTA). Huang
object detection, and image segmentation (Gu et al., 2022; Lee et al., et al. (2022b) introduce efficient self-attention (ESA), specifically for
2022). Generalizing this idea to medical applications of ViTs to facil- the spatial reduction in 3D data. ESA reduces the number of tokens by
itate devising a clinically suitable system can be considered as future a spatial reduction ratio 𝑅 while expanding the channel dimension by
2
work. the same ratio. As a result, the complexity of ESA is reduced to 𝑂( 𝑛𝑅 ).
Despite progress in making self-attention more efficient for com-
10.3. Video-based analysis puter vision tasks, further work is needed to develop efficient Trans-
former models tailored to the medical domain, particularly for handling
There has been an increasing interest in the vision community in challenging 3D data or large-scale images (e.g., WSI). Moreover, the
extending ViT architectures to video recognition tasks. Recently, a ultimate goal is to deploy these models on low-cost edge devices or in
handful of papers have integrated standard Transformers with their resource-constrained environments. Therefore, the focus should be on
models in AI-assisted dynamic clinical tasks (Reynaud et al., 2021; Long designing Transformer systems that strike a balance between computa-
et al., 2021; Czempiel et al., 2021; Zhao et al., 2022). However, the tional complexity and clinical accuracy while ensuring robustness. This
scarcity of the proposed approaches puts video-based medical analysis area of research holds great promise and should be pursued further.
in an infancy stage and open for future investigations. Another potential
research direction is to explore the power of video vision Transformer 10.5. Transformer-based registration
variants, such as Video Swin Transformer (Liu et al., 2022c), in clinical
video understanding and to facilitate automatic robotic surgery. As reviewed in Section 8, the idea of employing Transformers to
support efficient medical image registration has become popular in
10.4. High computational complexity recent years. The ability of the self-attention mechanism assists the
learning of long-term visual correlations since their unlimited receptive
The robustness of Transformer models in layouts that implement field promotes a more accurate understanding of the spatial relation-
large numbers of parameters is one of their strengths. While this is ship between moving and fixed images (Chen et al., 2021a, 2022a).
a beneficial trait that makes it possible to train models of enormous However, registration systems composed of Transformer architectures
scale, it leads to the requirement of large resources for training and are still in their infancy and require more research effort to be put into
inferencing (Khan et al., 2022). Particularly disadvantageous to medical them.
image analysis is that expanding the use of ViTs for pretraining in
new tasks and datasets comes with substantial expenses and burdens. 10.6. Data-driven predictions
Additionally, gathering medical samples can be difficult and the dataset
scale is often limited. For instance, according to empirical studies With supervised learning as a popular fashion in building intelligent
in Dosovitskiy et al. (2020), pretraining a ViT-L/16 model on the large- systems, the model learns features based on the provided annotations
scale dataset of ImageNet takes approximately 30 days employing a that are suitable to accomplish a specific task, which hinders generaliz-
standard cloud TPUv3 with 8 cores. As a result, a notable number of ability. In other words, supervised learning modifies the bias–variance
papers utilized the pre-trained weights of ViT models to exploit the trade-off in favor of the strong inductive biases that lead to making
transfer learning strategy to alleviate training load (Shome et al., 2021; assumptions as a means to aid the model in learning a particular
Chen et al., 2021b; Gheflati and Rivaz, 2022), but in some cases, such task quicker and with higher sample efficiency. However, these hard
as dealing with volumetric medical images, where transfer learning assumptions sacrifice adaptability to other settings and unseen datasets,
does not demonstrate any improvements (Hatamizadeh et al., 2022b; and the model learns to accomplish its task without having an innate
Chen et al., 2021a), the pretraining process is necessary to capture understanding of the data. To tackle this issue, unsupervised regimes
domain-specific features for generalization and better performance. enable the algorithms to act as general descriptors and capture fea-
In addition, the standard self-attention mechanism suffers from a tures that will assist them in performing efficiently in a wide range
significant drawback: its computational complexity grows quadratically of tasks. Similarly, in medical image analysis, adopting Transformer
with the number of tokens (𝑛), making it impractical for tasks where networks with unsupervised learning algorithms promotes robustness
input images can be millions of pixels in size or when dealing with and generalizability to other datasets and tasks.

57
R. Azad et al. Medical Image Analysis 91 (2024) 103000

10.7. Medical software ecosystems Ali, A., Touvron, H., Caron, M., Bojanowski, P., Douze, M., Joulin, A., Laptev, I.,
Neverova, N., Synnaeve, G., Verbeek, J., et al., 2021. Xcit: Cross-covariance image
transformers. Adv. Neural Inf. Process. Syst. 34, 20014–20027.
A future direction for advancing in the automatic medical analysis is
Alicioglu, G., Sun, B., 2022. A survey of visual analytics for explainable artificial
to provide an open-source environment that contains libraries suitable
intelligence methods. Comput. Graph. 102, 502–520.
for solving multiple medical tasks and challenges with Transformer Aminimehr, A., Molaei, A., Cambria, E., 2023. Entri: Ensemble learning with tri-level
architectures. Developers can further contribute to the ecosystem by representations for explainable scene recognition. arXiv preprint arXiv:2307.12442.
updating and adding additional tasks, bringing novelty, and proposing Anderson, P., Fernando, B., Johnson, M., Gould, S., 2016. Spice: Semantic propositional
ideas to enhance performance and accuracy (Kazerouni et al., 2023). image caption evaluation. In: European Conference on Computer Vision. Springer,
pp. 382–398.
Companies and organizations can support the system by preparing the
Arevalo, J., González, F.A., Ramos-Pollán, R., Oliveira, J.L., Lopez, M.A.G., 2016. Repre-
necessary computational resources and hardware requirements. Sample
sentation learning for mammography mass lesion classification with convolutional
of software prototypes in this direction are nnU-Net (Isensee et al., neural networks. Comput. Methods Programs Biomed. 127, 248–257.
2021a), Ivadomed (Gros et al., 2020), and preliminary works such Armato, III, S.G., McLennan, G., Bidaut, L., McNitt-Gray, M.F., Meyer, C.R.,
as Azad et al. (2022a), which provides an end-to-end pipeline for Reeves, A.P., Zhao, B., Aberle, D.R., Henschke, C.I., Hoffman, E.A., et al., 2011.
implementing deep models on medical data. The lung image database consortium (LIDC) and image database resource initiative
(IDRI): a completed reference database of lung nodules on CT scans. Med. Phys.
38 (2), 915–931.
11. Discussion and conclusion Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C., 2021. Vivit: A
video vision transformer. In: Proceedings of the IEEE/CVF International Conference
In this paper, we presented a comprehensive encyclopedic review on Computer Vision. pp. 6836–6846.
of the applications of Transformers in medical imaging. First, we pro- Asia Pacific Tele-Ophthalmology Society, 2019. Aptos 2019 blindness detection: Detect
vided preliminary information regarding the Transformer structures diabetic retinopathy to stop blindness before it’s too late. https://www.kaggle.com/
c/aptos2019-blindness-detection/.
and the idea behind the self-attention mechanism in the introduction
Azad, R., Aghdam, E.K., Rauland, A., Jia, Y., Avval, A.H., Bozorgpour, A., Karimijafar-
and background sections. Starting from Section 3, we reviewed the bigloo, S., Cohen, J.P., Adeli, E., Merhof, D., 2022a. Medical image segmentation
literature on Transformer architecture in diverse medical imaging tasks, review: The success of U-Net. arXiv preprint arXiv:2211.14830.
namely, classification, segmentation, detection, reconstruction, synthe- Azad, R., Al-Antary, M.T., Heidari, M., Merhof, D., 2022b. Transnorm: Transformer
sis, registration, and clinical report generation. For each application, we provides a strong spatial normalization mechanism for a deep segmentation model.
provided a taxonomy and high-level abstraction of the core techniques IEEE Access 10, 108205–108215.
Azad, R., Arimond, R., Aghdam, E.K., Kazerouni, A., Merhof, D., 2022c. DAE-Former:
employed in these models along with the SOTA approaches. We also
Dual attention-guided efficient transformer for medical image segmentation. arXiv
provided comparison tables to highlight the pros and cons, network preprint arXiv:2212.13504.
parameters, type of imaging modality they are considering, organ, and Azad, R., Asadi-Aghbolaghi, M., Fathy, M., Escalera, S., 2019. Bi-directional ConvLSTM
the metrics they are using. Finally, we outlined possible avenues for U-Net with densley connected convolutions. In: 2019 IEEE/CVF International
future research directions. Conference on Computer Vision Workshop (ICCVW). pp. 406–415.
Azad, R., Heidari, M., Shariatnia, M., Aghdam, E.K., Karimijafarbigloo, S., Adeli, E.,
Merhof, D., 2022d. TransDeepLab: Convolution-free transformer-based DeepLab
Declaration of competing interest
v3+ for medical image segmentation. arXiv preprint arXiv:2208.00713.
Azad, R., Kazerouni, A., Azad, B., Aghdam, E.K., Velichko, Y., Bagci, U., Merhof, D.,
The authors declare that they have no known competing finan- 2023a. Laplacian-former: Overcoming the limitations of vision transformers in local
cial interests or personal relationships that could have appeared to texture detection. In: MICCAI 2023.
influence the work reported in this paper. Azad, R., Khosravi, N., Merhof, D., 2022e. SMU-Net: Style matching U-Net for brain
tumor segmentation with missing modalities. In: International Conference on
Medical Imaging with Deep Learning. PMLR, pp. 48–62.
Data availability
Azad, R., Niggemeier, L., Huttemann, M., Kazerouni, A., Aghdam, E.K., Velichko, Y.,
Bagci, U., Merhof, D., 2023b. Beyond self-attention: Deformable large kernel
No data was used for the research described in the article. attention for medical image segmentation. arXiv:2309.00121.
Bae, W., Lee, S., Lee, Y., Park, B., Chung, M., Jung, K.-H., 2019. Resource optimized
Acknowledgments neural architecture search for 3D medical image segmentation. In: International
Conference on Medical Image Computing and Computer-Assisted Intervention.
Springer, pp. 228–236.
This work was funded by the German Research Foundation (Deutsche
Baid, U., Ghodasara, S., Mohan, S., Bilello, M., Calabrese, E., Colak, E., Farahani, K.,
Forschungsgemeinschaft, DFG) under project number Kalpathy-Cramer, J., Kitamura, F.C., Pati, S., et al., 2021. The rsna-asnr-miccai brats
191948804. We thank Johannes Stegmaier for his contribution to the 2021 benchmark on brain tumor segmentation and radiogenomic classification.
proofreading of this document. arXiv preprint arXiv:2107.02314.
Bakas, S., Akbari, H., Sotiras, A., Bilello, M., Rozycki, M., Kirby, J.S., Freymann, J.B.,
Farahani, K., Davatzikos, C., 2017. Advancing the cancer genome atlas glioma MRI
References
collections with expert segmentation labels and radiomic features. Sci. Data 4 (1),
1–13.
Aghdam, E.K., Azad, R., Zarvani, M., Merhof, D., 2022. Attention swin u-net: Cross-
Bakas, S., Reyes, M., Jakab, A., Bauer, S., Rempfler, M., Crimi, A., Shinohara, R.T.,
contextual attention mechanism for skin lesion segmentation. arXiv preprint arXiv:
Berger, C., Ha, S.M., Rozycki, M., et al., 2018. Identifying the best machine learning
2210.16898.
algorithms for brain tumor segmentation, progression assessment, and overall
Al-Dhabyani, W., Gomaa, M., Khaled, H., Fahmy, A., 2020. Dataset of breast ultrasound
survival prediction in the BRATS challenge. arXiv preprint arXiv:1811.02629.
images. Data Brief 28, 104863.
Al-Shabi, M., Shak, K., Tan, M., 2022. ProCAN: Progressive growing channel attentive Balakrishnan, G., Zhao, A., Sabuncu, M.R., Guttag, J., Dalca, A.V., 2018. An unsuper-
non-local network for lung nodule classification. Pattern Recognit. 122, 108309. vised learning model for deformable medical image registration. In: Proceedings of
Alam, F., Rahman, S.U., 2019. Challenges and solutions in multimodal medical image the IEEE Conference on Computer Vision and Pattern Recognition. pp. 9252–9260.
subregion detection and registration. J. Med. Imaging Radiat. Sci. 50 (1), 24–30. Banerjee, S., Lavie, A., 2005. METEOR: An automatic metric for MT evaluation with
Alam, F., Rahman, S.U., Ullah, S., Gulati, K., 2018. Medical image registration in image improved correlation with human judgments. In: Proceedings of the Acl Workshop
guided surgery: Issues, challenges and research opportunities. Biocybern. Biomed. on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/Or
Eng. 38 (1), 71–89. Summarization. pp. 65–72.
Albertina, B., Watson, M., Holback, C., Jarosz, R., Kirk, S., Lee, Y., Rieger-Christ, K., Bannur, S., Hyland, S., Liu, Q., Perez-Garcia, F., Ilse, M., Castro, D.C., Boecking, B.,
Lemmerman, J., 2016. The cancer genome atlas lung adenocarcinoma collection Sharma, H., Bouzid, K., Thieme, A., et al., 2023. Learning to exploit temporal struc-
(TCGA-LUAD) (version 4) [data set]. http://dx.doi.org/10.7937/K9/TCIA.2016. ture for biomedical vision-language processing. In: Proceedings of the IEEE/CVF
JGNIHEP5. Conference on Computer Vision and Pattern Recognition. pp. 15016–15027.
Alfarghaly, O., Khaled, R., Elkorany, A., Helal, M., Fahmy, A., 2021. Automated Bao, H., Dong, L., Piao, S., Wei, F., 2022. BEiT: BERT pre-training of image trans-
radiology report generation using conditioned transformers. Inform. Med. Unlocked formers. In: International Conference on Learning Representations. URL: https:
24, 100557. //openreview.net/forum?id=p-BhZSz59o4.

58
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Bejnordi, B.E., Veta, M., Van Diest, P.J., Van Ginneken, B., Karssemeijer, N., Litjens, G., Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2017a. Deeplab:
Van Der Laak, J.A., Hermsen, M., Manson, Q.F., Balkenhol, M., et al., 2017. Semantic image segmentation with deep convolutional nets, atrous convolution,
Diagnostic assessment of deep learning algorithms for detection of lymph node and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), 834–848.
metastases in women with breast cancer. JAMA 318 (22), 2199–2210. Chen, L.-C., Papandreou, G., Schroff, F., Adam, H., 2017b. Rethinking atrous
Bello, I., Zoph, B., Vaswani, A., Shlens, J., Le, Q.V., 2019. Attention augmented convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587.
convolutional networks. In: Proceedings of the IEEE/CVF International Conference Chen, Z., Shen, Y., Song, Y., Wan, X., 2022b. Cross-modal memory networks for
on Computer Vision. pp. 3286–3295. radiology report generation. arXiv preprint arXiv:2204.13258.
Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S., 2021. Adversarial robustness Chen, Z., Song, Y., Chang, T.-H., Wan, X., 2020c. Generating radiology reports via
comparison of vision transformer and mlp-mixer to cnns. arXiv preprint arXiv: memory-driven transformer. In: Proceedings of the 2020 Conference on Empirical
2110.02797. Methods in Natural Language Processing (EMNLP). pp. 1439–1449.
Bernal, J., Sánchez, F.J., Fernández-Esparrach, G., Gil, D., Rodríguez, C., Vilariño, F., Chen, Q., Wang, Y., Yang, T., Zhang, X., Cheng, J., Sun, J., 2021c. You only look
2015. WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation one-level feature. In: Proceedings of the IEEE/CVF Conference on Computer Vision
vs. saliency maps from physicians. Comput. Med. Imaging Graph. 43, 99–111. and Pattern Recognition. pp. 13039–13048.
Bernal, J., Sánchez, J., Vilarino, F., 2012. Towards automatic polyp detection with a Chen, X., Wang, X., Zhou, J., Qiao, Y., Dong, C., 2023. Activating more pixels in
polyp appearance model. Pattern Recognit. 45 (9), 3166–3182. image super-resolution transformer. In: Proceedings of the IEEE/CVF Conference
Bernard, O., Lalande, A., Zotti, C., Cervenansky, F., Yang, X., Heng, P.-A., Cetin, I., on Computer Vision and Pattern Recognition. pp. 22367–22377.
Lekadir, K., Camara, O., Ballester, M.A.G., et al., 2018. Deep learning techniques for Chen, X., Xie, S., He, K., 2021d. An empirical study of training self-supervised
automatic MRI cardiac multi-structures segmentation and diagnosis: is the problem vision transformers. In: Proceedings of the IEEE/CVF International Conference on
solved? IEEE Trans. Med. Imaging 37 (11), 2514–2525. Computer Vision. pp. 9640–9649.
Bhattacharya, M., Jain, S., Prasanna, P., 2022. RadioTransformer: a cascaded global- Chen, L., Yu, Q.T., 2021. Transformers make strong encoders for medical image
focal transformer for visual attention–guided disease classification. In: European segmentation. arXiv 2021. arXiv preprint arXiv:2102.04306.
Conference on Computer Vision. Springer, pp. 679–698. Chen, X., Yuan, Y., Zeng, G., Wang, J., 2021e. Semi-supervised semantic segmentation
Bian, J., Siewerdsen, J.H., Han, X., Sidky, E.Y., Prince, J.L., Pelizzari, C.A., Pan, X., with cross pseudo supervision. In: Proceedings of the IEEE/CVF Conference on
2010. Evaluation of sparse-view reconstruction from flat-panel-detector cone-beam Computer Vision and Pattern Recognition. pp. 2613–2622.
CT. Phys. Med. Biol. 55 (22), 6575. Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H., 2018. Encoder-decoder with
Bianchi, F.M., Grattarola, D., Alippi, C., 2020. Spectral clustering with graph neural atrous separable convolution for semantic image segmentation. In: Proceedings of
networks for graph pooling. In: International Conference on Machine Learning. the European Conference on Computer Vision (ECCV). pp. 801–818.
PMLR, pp. 874–883. Choromanski, K.M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T.,
Bien, N., Rajpurkar, P., Ball, R.L., Irvin, J., Park, A., Jones, E., Bereket, M., Patel, B.N., Hawkins, P., Davis, J.Q., Mohiuddin, A., Kaiser, L., Belanger, D.B., Colwell, L.J.,
Yeom, K.W., Shpanskaya, K., et al., 2018. Deep-learning-assisted diagnosis for knee Weller, A., 2021. Rethinking attention with performers. In: International Con-
magnetic resonance imaging: development and retrospective validation of MRNet. ference on Learning Representations. URL: https://openreview.net/forum?id=
PLoS Med. 15 (11), e1002699. Ua6zuk0WRH.
Binder, A., Montavon, G., Lapuschkin, S., Müller, K.-R., Samek, W., 2016. Layer-wise Chowdhury, M.E., Rahman, T., Khandakar, A., Mazhar, R., Kadir, M.A., Mahbub, Z.B.,
relevance propagation for neural networks with local renormalization layers. In: Islam, K.R., Khan, M.S., Iqbal, A., Al Emadi, N., et al., 2020. Can AI help in
International Conference on Artificial Neural Networks. Springer, pp. 63–71. screening viral and COVID-19 pneumonia? IEEE Access 8, 132665–132676.
Born, J., Brändle, G., Cossio, M., Disdier, M., Goulet, J., Roulin, J., Wiedemann, N., Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., Shen, C., 2021. Twins:
2020. POCOVID-Net: automatic detection of COVID-19 from a new lung ultrasound Revisiting the design of spatial attention in vision transformers. Adv. Neural Inf.
imaging dataset (POCUS). arXiv preprint arXiv:2004.12084. Process. Syst. 34, 9355–9366.
Brenner, D.J., Hall, E.J., 2007. Computed tomography—an increasing source of Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O., 2016. 3D U-Net:
radiation exposure. New Engl. J. Med. 357 (22), 2277–2284. learning dense volumetric segmentation from sparse annotation. In: International
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakan- Conference on Medical Image Computing and Computer-Assisted Intervention.
tan, A., Shyam, P., Sastry, G., Askell, A., et al., 2020. Language models are few-shot Springer, pp. 424–432.
learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901. Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S.,
Buades, A., Coll, B., Morel, J.-M., 2005. A non-local algorithm for image denoising. Phillips, S., Maffitt, D., Pringle, M., et al., 2013. The cancer imaging archive (TCIA):
In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern maintaining and operating a public information repository. J. Digit. Imaging 26 (6),
Recognition (CVPR’05), Vol. 2. IEEE, pp. 60–65. 1045–1057.
Buchholz, T.-O., Jug, F., 2022. Fourier image transformer. In: Proceedings of the Codella, N.C., Gutman, D., Celebi, M.E., Helba, B., Marchetti, M.A., Dusza, S.W.,
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1846–1854. Kalloo, A., Liopyris, K., Mishra, N., Kittler, H., et al., 2018. Skin lesion analysis
Cai, Z., Vasconcelos, N., 2018. Cascade r-cnn: Delving into high quality object toward melanoma detection: A challenge at the 2017 international symposium on
detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern biomedical imaging (isbi), hosted by the international skin imaging collaboration
Recognition. pp. 6154–6162. (isic). In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI
Campanella, G., Hanna, M.G., Geneslaw, L., Miraflor, A., Werneck Krauss Silva, V., 2018). IEEE, pp. 168–172.
Busam, K.J., Brogi, E., Reuter, V.E., Klimstra, D.S., Fuchs, T.J., 2019. Clinical- Codella, N., Rotemberg, V., Tschandl, P., Celebi, M.E., Dusza, S., Gutman, D., Helba, B.,
grade computational pathology using weakly supervised deep learning on whole Kalloo, A., Liopyris, K., Marchetti, M., et al., 2019. Skin lesion analysis toward
slide images. Nat. Med. 25 (8), 1301–1309. melanoma detection 2018: A challenge hosted by the international skin imaging
Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M., 2022. Swin-unet: collaboration (isic). arXiv preprint arXiv:1902.03368.
Unet-like pure transformer for medical image segmentation. In: Proceedings of the Cohen, J.P., Morrison, P., Dao, L., Roth, K., Duong, T., Ghassem, M., 2020. COVID-
European Conference on Computer Vision Workshops(ECCVW). 19 image data collection: Prospective predictions are the future. Mach. Learn.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S., 2020. End- Biomed. Imaging 1, 1–38. http://dx.doi.org/10.59275/j.melba.2020-48g7, URL:
to-end object detection with transformers. In: European Conference on Computer https://melba-journal.org/2020:002.
Vision. Springer, pp. 213–229. Combalia, M., Codella, N.C., Rotemberg, V., Helba, B., Vilaplana, V., Reiter, O., Car-
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A., rera, C., Barreiro, A., Halpern, A.C., Puig, S., et al., 2019. Bcn20000: Dermoscopic
2021. Emerging properties in self-supervised vision transformers. In: Proceedings lesions in the wild. arXiv preprint arXiv:1908.02288.
of the IEEE/CVF International Conference on Computer Vision. pp. 9650–9660. Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R., 2020. Meshed-memory transformer
Chefer, H., Gur, S., Wolf, L., 2021. Transformer interpretability beyond attention for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer
visualization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Vision and Pattern Recognition. pp. 10578–10587.
Pattern Recognition. pp. 782–791. Criminisi, A., Shotton, J., Bucciarelli, S., 2009. Decision forests with long-range spatial
Chen, J., Frey, E.C., He, Y., Segars, W.P., Li, Y., Du, Y., 2022a. Transmorph: Transformer context for organ localization in CT volumes. In: Medical Image Computing and
for unsupervised medical image registration. Med. Image Anal. 82, 102615. Computer-Assisted Intervention (MICCAI). Citeseer, pp. 69–80.
Chen, J., He, Y., Frey, E., Li, Y., Du, Y., 2021a. ViT-V-Net: Vision transformer for Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N., 2021. Opera:
unsupervised volumetric medical image registration. In: Medical Imaging with Deep Attention-regularized transformers for surgical phase recognition. In: International
Learning. URL: https://openreview.net/forum?id=h3HC1EU7AEz. Conference on Medical Image Computing and Computer-Assisted Intervention.
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.E., 2020b. Big self- Springer, pp. 604–614.
supervised models are strong semi-supervised learners. Adv. Neural Inf. Process. Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K., 2006. Image denoising with block-
Syst. 33, 22243–22255. matching and 3D filtering. In: Image Processing: Algorithms and Systems, Neural
Chen, J., Li, Y., Du, Y., Frey, E.C., 2020a. Generating anthropomorphic phantoms Networks, and Machine Learning, Vol. 6064. SPIE, pp. 354–365.
using fully unsupervised deformable image registration with convolutional neural Dai, Y., Gao, Y., Liu, F., 2021. Transmed: Transformers advance multi-modal medical
networks. Med. Phys. 47 (12), 6366–6380. image classification. Diagnostics 11 (8), 1384.
Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y., 2021b. Dalmaz, O., Yurt, M., Çukur, T., 2022. ResViT: Residual vision transformers for mul-
Transunet: Transformers make strong encoders for medical image segmentation. timodal medical image synthesis. IEEE Trans. Med. Imaging 41 (10), 2598–2614.
arXiv preprint arXiv:2102.04306. http://dx.doi.org/10.1109/TMI.2022.3167808.

59
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Dar, S.U., Yurt, M., Karacan, L., Erdem, A., Erdem, E., Cukur, T., 2019. Image synthesis Gehlot, S., Gupta, A., Gupta, R., 2020. Ednfc-net: Convolutional neural network with
in multi-contrast MRI with conditional generative adversarial networks. IEEE Trans. nested feature concatenation for nuclei-instance segmentation. In: ICASSP 2020-
Med. Imaging 38 (10), 2375–2388. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing
d’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., Sagun, L., 2021. (ICASSP). IEEE, pp. 1389–1393.
Convit: Improving vision transformers with soft convolutional inductive biases. In: Gharleghi, R., Samarasinghe, D.G., Sowmya, P.A., Beier, D.S., 2020. Automated
International Conference on Machine Learning. PMLR, pp. 2286–2296. segmentation of coronary arteries. http://dx.doi.org/10.5281/zenodo.3819799.
De Vos, B.D., Berendsen, F.F., Viergever, M.A., Sokooti, H., Staring, M., Išgum, I., Gheflati, B., Rivaz, H., 2022. Vision transformers for classification of breast ultrasound
2019. A deep learning framework for unsupervised affine and deformable image images. In: 2022 44th Annual International Conference of the IEEE Engineering in
registration. Med. Image Anal. 52, 128–143. Medicine & Biology Society (EMBC). IEEE, pp. 480–483.
Decencière, E., Zhang, X., Cazuguel, G., Lay, B., Cochener, B., Trone, C., Gain, P., Glocker, B., Zikic, D., Konukoglu, E., Haynor, D.R., Criminisi, A., 2013. Vertebrae lo-
Ordonez, R., Massin, P., Erginay, A., et al., 2014. Feedback on a publicly distributed calization in pathological spine CT via dense classification from sparse annotations.
image database: the messidor database. Image Anal. Stereol. 33 (3), 231–234. In: International Conference on Medical Image Computing and Computer-Assisted
Demner-Fushman, D., Kohli, M.D., Rosenman, M.B., Shooshan, S.E., Rodriguez, L., Intervention. Springer, pp. 262–270.
Antani, S., Thoma, G.R., McDonald, C.J., 2016. Preparing a collection of radiology Gonçalves, T., Rio-Torto, I., Teixeira, L.F., Cardoso, J.S., 2022. A survey on attention
examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 23 (2), mechanisms for medical applications: are we moving towards better algorithms?.
304–310. Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C.,
Deng, J., 2009. A large-scale hierarchical image database. In: Proc. of IEEE Computer Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al., 2020. Bootstrap your own
Vision and Pattern Recognition, 2009. latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst.
Der Sarkissian, H., Lucka, F., van Eijnatten, M., Colacicco, G., Coban, S.B., Baten- 33, 21271–21284.
burg, K.J., 2019. A cone-beam X-ray computed tomography data collection designed Gros, C., Lemay, A., Vincent, O., Rouhier, L., Bucquet, A., Cohen, J.P., Cohen-Adad, J.,
for machine learning. Sci. Data 6 (1), 1–8. 2020. Ivadomed: A medical imaging deep learning toolbox. arXiv preprint arXiv:
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2018. Bert: Pre-training of deep 2010.09984.
bidirectional transformers for language understanding. arXiv preprint arXiv:1810. Group, B.I.A., 2023. IXI dataset. http://brain-development.org/ixi-dataset/.
04805. Gu, J., Kwon, H., Wang, D., Ye, W., Li, M., Chen, Y.-H., Lai, L., Chandra, V., Pan, D.Z.,
Ding, M., Xiao, B., Codella, N., Luo, P., Wang, J., Yuan, L., 2022. Davit: Dual attention 2022. Multi-scale high-resolution vision transformer for semantic segmentation.
vision transformers. In: Computer Vision–ECCV 2022: 17th European Conference, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV. Springer, pp. 74–92. Recognition. pp. 12094–12103.
Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., Guo, B., 2022. Cswin Gunraj, H., 2021. COVIDx CT-2A: A large-scale chest CT dataset for COVID-19
transformer: A general vision transformer backbone with cross-shaped windows. detection.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Gupta, A., Duggal, R., Gehlot, S., Gupta, R., Mangal, A., Kumar, L., Thakkar, N.,
Recognition. pp. 12124–12134. Satpathy, D., 2020. GCTI-SN: Geometry-inspired chemical and tissue invariant stain
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., normalization of microscopic medical images. Med. Image Anal. 65, 101788.
Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al., 2020. An image is Gupta, A., Gehlot, S., Goswami, S., Motwani, S., Gupta, R., Faura, Á.G., Štepec, D.,
worth 16 × 16 words: Transformers for image recognition at scale. arXiv preprint Martinčič, T., Azad, R., Merhof, D., et al., 2023. SegPC-2021: A challenge & dataset
arXiv:2010.11929. on segmentation of multiple myeloma plasma cells from microscopic images. Med.
Du, X., Lin, T.-Y., Jin, P., Ghiasi, G., Tan, M., Cui, Y., Le, Q.V., Song, X., 2020. Spinenet: Image Anal. 83, 102677.
Learning scale-permuted backbone for recognition and localization. In: Proceedings Gupta, A., Mallick, P., Sharma, O., Gupta, R., Duggal, R., 2018. PCSeg: Color model
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. driven probabilistic multiphase level set based tool for plasma cell segmentation
11592–11601. in multiple myeloma. PLoS One 13 (12), e0207908.
Edwards, N.J., Oberti, M., Thangudu, R.R., Cai, S., McGarvey, P.B., Jacob, S., Mad- Hajeb Mohammad Alipour, S., Rabbani, H., Akhlaghi, M.R., 2012. Diabetic retinopathy
havan, S., Ketchum, K.A., 2015. The CPTAC data portal: a resource for cancer grading by digital curvelet transform. Comput. Math. Methods Med. 2012.
proteomics research. J. Proteome Res. 14 (6), 2707–2713. Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., Sugiyama, M., 2018.
El-Nouby, A., Izacard, G., Touvron, H., Laptev, I., Jegou, H., Grave, E., 2021. Are Co-teaching: Robust training of deep neural networks with extremely noisy labels.
large-scale datasets necessary for self-supervised pre-training? arXiv preprint arXiv: Adv. Neural Inf. Process. Syst. 31.
2112.10740. Han, Y., Ye, J.C., 2018. Framing U-Net via deep convolutional framelets: Application
El-Shafai, W., Abd El-Samie, F., 2020. Extensive COVID-19 X-ray and CT chest images to sparse-view CT. IEEE Trans. Med. Imaging 37 (6), 1418–1429.
dataset. Mendeley data 3 (10). Haskins, G., Kruger, U., Yan, P., 2020. Deep learning in medical image registration: a
Elmore, J.G., Longton, G.M., Carney, P.A., Geller, B.M., Onega, T., Tosteson, A.N., survey. Mach. Vis. Appl. 31 (1), 1–18.
Nelson, H.D., Pepe, M.S., Allison, K.H., Schnitt, S.J., et al., 2015. Diagnostic Hatamizadeh, A., Nath, V., Tang, Y., Yang, D., Roth, H.R., Xu, D., 2022a. Swin unetr:
concordance among pathologists interpreting breast biopsy specimens. JAMA 313 Swin transformers for semantic segmentation of brain tumors in MRI images. In:
(11), 1122–1132. Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 7th
Endo, M., Krishnan, R., Krishna, V., Ng, A.Y., Rajpurkar, P., 2021. Retrieval-based chest International Workshop, BrainLes 2021, Held in Conjunction with MICCAI 2021,
X-ray report generation using a pre-trained contrastive language-image model. In: Virtual Event, September 27, 2021, Revised Selected Papers, Part I. Springer, pp.
Machine Learning for Health. PMLR, pp. 209–219. 272–284.
EyePACKS, 2015. Kaggle diabetic retinopathy detection competition. https://www. Hatamizadeh, A., Tang, Y., Nath, V., Yang, D., Myronenko, A., Landman, B., Roth, H.R.,
kaggle.com/c/diabetic-retinopathy-detection. Xu, D., 2022b. Unetr: Transformers for 3D medical image segmentation. In:
Fan, D.-P., Ji, G.-P., Zhou, T., Chen, G., Fu, H., Shen, J., Shao, L., 2020. Pranet: Parallel Proceedings of the IEEE/CVF Winter Conference on Applications of Computer
reverse attention network for polyp segmentation. In: International Conference Vision. pp. 574–584.
on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R., 2022a. Masked autoencoders are
263–273. scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer
Fang, C., Zhang, D., Wang, L., Zhang, Y., Cheng, L., Han, J., 2022. Cross-modality Vision and Pattern Recognition. pp. 16000–16009.
high-frequency transformer for MR image super-resolution. In: Proceedings of the He, K., Gan, C., Li, Z., Rekik, I., Yin, Z., Ji, W., Gao, Y., Wang, Q., Zhang, J., Shen, D.,
30th ACM International Conference on Multimedia. pp. 1584–1592. 2022b. Transformers in medical image analysis: A review. Intell. Med..
Fayyaz, M., Abbasi Kouhpayegani, S., Rezaei Jafari, F., Sommerlade, E., He, X., Yang, X., Zhang, S., Zhao, J., Zhang, Y., Xing, E., Xie, P., 2020. Sample-efficient
Vaezi Joze, H.R., Pirsiavash, H., Gall, J., 2022. Adaptive token sampling for deep learning for COVID-19 diagnosis based on CT scans. medrxiv, 2020-2004.
efficient vision transformers. Eur. Conf. Comput. Vis. (ECCV). He, K., Zhang, X., Ren, S., Sun, J., 2015. Spatial pyramid pooling in deep convolutional
Feldkamp, L.A., Davis, L.C., Kress, J.W., 1984. Practical cone-beam algorithm. JOSA A networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37 (9),
1 (6), 612–619. 1904–1916.
Feng, C.-M., Yan, Y., Chen, G., Xu, Y., Hu, Y., Shao, L., Fu, H., 2022. Multi-modal He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recog-
transformer for accelerated MR imaging. IEEE Trans. Med. Imaging. nition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Feng, C.-M., Yan, Y., Fu, H., Chen, L., Xu, Y., 2021a. Task transformer network for joint Recognition. pp. 770–778.
MRI reconstruction and super-resolution. In: International Conference on Medical Heidari, M., Kazerouni, A., Soltany, M., Azad, R., Aghdam, E.K., Cohen-Adad, J.,
Image Computing and Computer-Assisted Intervention. Springer, pp. 307–317. Merhof, D., 2023. Hiformer: Hierarchical multi-scale representations using trans-
Feng, C.-M., Yan, Y., Fu, H., Chen, L., Xu, Y., 2021b. Task transformer network formers for medical image segmentation. In: Proceedings of the IEEE/CVF Winter
for joint MRI reconstruction and super-resolution. In: Medical Image Computing Conference on Applications of Computer Vision (WACV). pp. 6202–6212.
and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J., 2021. Rethinking spatial
Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VI 24. dimensions of vision transformers. In: Proceedings of the IEEE/CVF International
Springer, pp. 307–317. Conference on Computer Vision. pp. 11936–11945.
Gao, X., Qian, Y., Gao, A., 2021. COVID-VIT: Classification of COVID-19 from CT chest Hohne, K.H., Bomans, M., Riemer, M., Schubert, R., Tiede, U., Lierse, W., 1992. A
images based on vision transformer models. arXiv preprint arXiv:2107.01682. volume-based anatomical atlas. IEEE Comput. Graph. Appl. 12 (04), 73–77.

60
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Hou, B., Kaissis, G., Summers, R.M., Kainz, B., 2021. Ratchet: Medical transformer for Kazerouni, A., Aghdam, E.K., Heidari, M., Azad, R., Fayyaz, M., Hacihaliloglu, I.,
chest X-ray diagnosis and reporting. In: International Conference on Medical Image Merhof, D., 2023. Diffusion models in medical imaging: A comprehensive survey.
Computing and Computer-Assisted Intervention. Springer, pp. 293–303. Med. Image Anal. 102846.
Hu, J., Shen, L., Sun, G., 2018. Squeeze-and-excitation networks. In: Proceedings of the Kermany, D.S., Goldbaum, M., Cai, W., Valentim, C.C., Liang, H., Baxter, S.L.,
IEEE Conference on Computer Vision and Pattern Recognition. pp. 7132–7141. McKeown, A., Yang, G., Wu, X., Yan, F., et al., 2018. Identifying medical diagnoses
Huang, X., Deng, Z., Li, D., Yuan, X., Fu, Y., 2022b. Missformer: An effective and treatable diseases by image-based deep learning. Cell 172 (5), 1122–1131, URL:
transformer for 2D medical image segmentation. IEEE Trans. Med. Imaging 1. https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia.
http://dx.doi.org/10.1109/TMI.2022.3230943. Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M., 2022. Transformers
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q., 2017. Densely connected in vision: A survey. ACM Comput. Surv. (CSUR) 54 (10s), 1–41.
convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision Kim, S., Kim, I., Lim, S., Baek, W., Kim, C., Cho, H., Yoon, B., Kim, T., 2019. Scalable
and Pattern Recognition. pp. 4700–4708. neural architecture search for 3D medical image segmentation. In: International
Huang, S.-C., Pareek, A., Jensen, M., Lungren, M.P., Yeung, S., Chaudhari, A.S., 2023. Conference on Medical Image Computing and Computer-Assisted Intervention.
Self-supervised learning for medical image classification: a systematic review and Springer, pp. 220–228.
implementation guidelines. NPJ Digit. Med. 6 (1), 74. Kim, S., Nam, J., Ko, B.C., 2022. Vit-net: Interpretable vision transformers with
Huang, H., Xie, S., Lin, L., Iwamoto, Y., Han, X., Chen, Y.-W., Tong, R., 2022a. Scale- neural tree decoder. In: International Conference on Machine Learning. PMLR, pp.
Former: Revisiting the transformer-based backbones from a scale-wise perspective 11162–11172.
for medical image segmentation. arXiv preprint arXiv:2207.14552. Kirk, S., Lee, Y., Kumar, P., Filippini, J., Albertina, B., Watson, M., Rieger-Christ, K.,
Huo, X., Sun, G., Tian, S., Wang, Y., Yu, L., Long, J., Zhang, W., Li, A., 2022. HiFuse: Lemmerman, J., 2016. The cancer genome atlas lung squamous cell carci-
Hierarchical multi-scale feature fusion network for medical image classification. noma collection (TCGA-LUSC) (version 4). http://dx.doi.org/10.7937/K9/TCIA.
arXiv preprint arXiv:2209.10218. 2016.TYGKKFMQ.
Hyun, C.M., Kim, H.P., Lee, S.M., Lee, S., Seo, J.K., 2018. Deep learning for Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., Houlsby, N., 2020.
undersampled MRI reconstruction. Phys. Med. Biol. 63 (13), 135007. Big transfer (bit): General visual representation learning. In: European Conference
Iandola, F., Moskewicz, M., Karayev, S., Girshick, R., Darrell, T., Keutzer, K., 2014. on Computer Vision. Springer, pp. 491–507.
Densenet: Implementing efficient convnet descriptor pyramids. arXiv preprint arXiv: Kollias, D., Arsenos, A., Soukissian, L., Kollias, S., 2021. MIA-COV19D: COVID-19
1404.1869. detection through 3-D chest CT image analysis. In: 2021 IEEE/CVF International
Ilse, M., Tomczak, J., Welling, M., 2018. Attention-based deep multiple instance Conference on Computer Vision Workshops (ICCVW). IEEE, pp. 537–544.
learning. In: International Conference on Machine Learning. PMLR, pp. 2127–2136. Kong, Q., Wu, Y., Yuan, C., Wang, Y., 2021. CT-CAD: Context-aware transformers for
end-to-end chest abnormality detection on X-Rays. In: 2021 IEEE International
Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H.,
Conference on Bioinformatics and Biomedicine (BIBM). IEEE, pp. 1385–1388.
Haghgoo, B., Ball, R., Shpanskaya, K., et al., 2019. Chexpert: A large chest
radiograph dataset with uncertainty labels and expert comparison. In: Proceedings Korkmaz, Y., Dar, S.U., Yurt, M., Özbey, M., Cukur, T., 2022. Unsupervised MRI
of the AAAI Conference on Artificial Intelligence, Vol. 33. pp. 590–597. reconstruction via zero-shot learned adversarial transformers. IEEE Trans. Med.
Imaging 41 (7), 1747–1763.
Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H., 2021a. nnU-Net: a
Krause, J., Gulshan, V., Rahimy, E., Karth, P., Widner, K., Corrado, G.S., Peng, L.,
self-configuring method for deep learning-based biomedical image segmentation.
Webster, D.R., 2018. Grader variability and the importance of reference standards
Nat. Methods 18 (2), 203–211.
for evaluating machine learning models for diabetic retinopathy. Ophthalmology
Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J., Maier-Hein, K.H., 2021b. Nnu-net:
125 (8), 1264–1272.
a self-configuring method for deep learning-based biomedical image segmentation.
Kumar, N., Verma, R., Sharma, S., Bhargava, S., Vahadane, A., Sethi, A., 2017. A dataset
Nature Methods 18 (2), 203, URL: <Go to ISI>://WOS:000599000100001.
and a technique for generalized nuclear segmentation for computational pathology.
Isensee, F., Jäger, P.F., Full, P.M., Vollmuth, P., Maier-Hein, K.H., 2020. nnU-net
IEEE Trans. Med. Imaging 36 (7), 1550–1560.
for brain tumor segmentation. In: International MICCAI Brainlesion Workshop.
Lakhani, P., Mongan, J., Singhal, C., Zhou, Q., Andriole, K.P., Auffermann, W.F.,
Springer, pp. 118–132.
Prasanna, P., Pham, T., Peterson, M., Bergquist, P.J., et al., 2021. The 2021 SIIM-
Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A., 2017. Image-to-image translation with
FISABIO-RSNA machine learning COVID-19 challenge: Annotation and standard
conditional adversarial networks. In: Proceedings of the IEEE Conference on
exam classification of COVID-19 chest radiographs.
Computer Vision and Pattern Recognition. pp. 1125–1134.
Landman, B., Xu, Z., Igelsias, J., Styner, M., Langerak, T., Klein, A., 2015. Miccai multi-
Jaderberg, M., Simonyan, K., Zisserman, A., et al., 2015. Spatial transformer networks.
atlas labeling beyond the cranial vault–workshop and challenge. In: Proc. MICCAI
Adv. Neural Inf. Process. Syst. 28.
Multi-Atlas Labeling beyond Cranial Vault—Workshop Challenge, Vol. 5. p. 12.
Jha, D., Smedsrud, P.H., Riegler, M.A., Halvorsen, P., Lange, T.d., Johansen, D.,
Lee, R.S., Gimenez, F., Hoogi, A., Miyake, K.K., Gorovoy, M., Rubin, D.L., 2017. A
Johansen, H.D., 2020. Kvasir-seg: A segmented polyp dataset. In: International
curated mammography data set for use in computer-aided detection and diagnosis
Conference on Multimedia Modeling. Springer, pp. 451–462.
research. Sci. Data 4 (1), 1–9.
Ji, Y., Bai, H., Ge, C., Yang, J., Zhu, Y., Zhang, R., Li, Z., Zhanng, L., Ma, W., Wan, X., Lee, Y., Kim, J., Willette, J., Hwang, S.J., 2022. MPViT: Multi-path vision transformer
et al., 2022. Amos: A large-scale abdominal multi-organ benchmark for versatile for dense prediction. In: Proceedings of the IEEE/CVF Conference on Computer
medical image segmentation. Adv. Neural Inf. Process. Syst. 35, 36722–36732. Vision and Pattern Recognition. pp. 7287–7296.
Jiang, H., Zhang, P., Che, C., Jin, B., 2021. RDFNet: A fast caries detection method Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., Teh, Y.W., 2019. Set trans-
incorporating transformer mechanism. Comput. Math. Methods Med. 2021. former: A framework for attention-based permutation-invariant neural networks.
Jing, B., Xie, P., Xing, E., 2017. On the automatic generation of medical imaging In: International Conference on Machine Learning. PMLR, pp. 3744–3753.
reports. arXiv preprint arXiv:1711.08195. Leuschner, J., Schmidt, M., Baguer, D.O., Maaß, P., 2019. The lodopab-ct dataset:
Johnson, J., Alahi, A., Fei-Fei, L., 2016. Perceptual losses for real-time style transfer A benchmark dataset for low-dose CT reconstruction methods. arXiv preprint
and super-resolution. In: European Conference on Computer Vision. Springer, pp. arXiv:1910.01113.
694–711. Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V.,
Johnson, A.E., Pollard, T.J., Greenbaum, N.R., Lungren, M.P., Deng, C.-y., Peng, Y., Zettlemoyer, L., 2019. Bart: Denoising sequence-to-sequence pre-training for natural
Lu, Z., Mark, R.G., Berkowitz, S.J., Horng, S., 2019. MIMIC-CXR-JPG, a large language generation, translation, and comprehension. arXiv preprint arXiv:1910.
publicly available database of labeled chest radiographs. arXiv preprint arXiv: 13461.
1901.07042. Li, B., Li, Y., Eliceiri, K.W., 2021a. Dual-stream multiple instance learning network
Kak, A.C., Slaney, M., 2001. Principles of Computerized Tomographic Imaging. SIAM. for whole slide image classification with self-supervised contrastive learning.
Kalyan, K.S., Rajasekharan, A., Sangeetha, S., 2021. Ammus: A survey of transformer- In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
based pretrained models in natural language processing. arXiv preprint arXiv: Recognition. pp. 14318–14328.
2108.05542. Li, Y., Liang, X., Hu, Z., Xing, E.P., 2018. Hybrid retrieval-generation reinforced agent
Kamran, S.A., Hossain, K.F., Tavakkoli, A., Zuckerbrod, S.L., Baker, S.A., 2021. Vtgan: for medical image report generation. Adv. Neural Inf. Process. Syst. 31.
Semi-supervised retinal image synthesis and disease prediction using vision trans- Li, C.Y., Liang, X., Hu, Z., Xing, E.P., 2019a. Knowledge-driven encode, retrieve,
formers. In: Proceedings of the IEEE/CVF International Conference on Computer paraphrase for medical image report generation. In: Proceedings of the AAAI
Vision. pp. 3235–3245. Conference on Artificial Intelligence, Vol. 33. pp. 6666–6673.
Karimijafarbigloo, S., Azad, R., Kazerouni, A., Ebadollahi, S., Merhof, D., 2023. MMC- Li, M., Liu, R., Wang, F., Chang, X., Liang, X., 2022b. Auxiliary signal-guided knowledge
Former: Missing modality compensation transformer for brain tumor segmentation. encoder-decoder for medical report generation. World Wide Web 1–18.
In: Medical Imaging with Deep Learning. Li, Z., Liu, F., Yang, W., Peng, S., Zhou, J., 2021d. A survey of convolutional neural
Karimijafarbigloo, S., Azad, R., Kazerouni, A., Merhof, D., 2023a. MS-Former: Multi- networks: analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn.
scale self-guided transformer for medical image segmentation. In: Medical Imaging Syst..
with Deep Learning. Li, W., Nguyen, V.-D., Liao, H., Wilder, M., Cheng, K., Luo, J., 2019b. Patch transformer
Karimijafarbigloo, S., Azad, R., Kazerouni, A., Velichko, Y., Bagci, U., Merhof, D., for multi-tagging whole slide histopathology images. In: International Conference
2023b. Self-supervised semantic segmentation: Consistency over transformation. In: on Medical Image Computing and Computer-Assisted Intervention. Springer, pp.
ICCV 2023, IEEE International Conference on Computer Vision 2023. 532–540.

61
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Li, S., Sui, X., Luo, X., Xu, X., Yong, L., Goh, R.S.M., 2021c. Medical image segmen- Lovelace, J., Mortazavi, B., 2020. Learning to generate clinically coherent chest X-ray
tation using squeeze-and-expansion transformers. In: The 30th International Joint reports. In: Findings of the Association for Computational Linguistics: EMNLP 2020.
Conference on Artificial Intelligence (IJCAI). pp. 1235–1243.
Li, W., Wang, X., Xia, X., Wu, J., Xiao, X., Zheng, M., Wen, S., 2022c. Sepvit: Separable Lu, M.Y., Williamson, D.F., Chen, T.Y., Chen, R.J., Barbieri, M., Mahmood, F., 2021.
vision transformer. arXiv preprint arXiv:2203.15380. Data-efficient and weakly supervised computational pathology on whole-slide
Li, H., Yang, F., Zhao, Y., Xing, X., Zhang, J., Gao, M., Huang, J., Wang, L., images. Nat. Biomed. Eng. 5 (6), 555–570.
Yao, J., 2021b. DT-MIL: Deformable transformer for multi-instance learning on Luo, X., Hu, M., Song, T., Wang, G., Zhang, S., 2022. Semi-supervised medical image
histopathological image. In: International Conference on Medical Image Computing segmentation via cross teaching between CNN and transformer. In: International
and Computer-Assisted Intervention. Springer, pp. 206–216. Conference on Medical Imaging with Deep Learning. PMLR, pp. 820–833.
Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., Zhang, L., 2022a. Dn-detr: Accelerate Luo, Y., Wang, Y., Zu, C., Zhan, B., Wu, X., Zhou, J., Shen, D., Zhou, L., 2021. 3D
detr training by introducing query denoising. In: Proceedings of the IEEE/CVF transformer-GAN for high-quality PET reconstruction. In: International Conference
Conference on Computer Vision and Pattern Recognition. pp. 13619–13627. on Medical Image Computing and Computer-Assisted Intervention. Springer, pp.
Li, Y., Zhou, T., He, K., Zhou, Y., Shen, D., 2022d. SLMT-Net: A self-supervised learning 276–285.
based multi-scale transformer network for cross-modality MR image synthesis. arXiv Luthra, A., Sulakhe, H., Mittal, T., Iyer, A., Yadav, S., 2021. Eformer: Edge enhancement
preprint arXiv:2212.01108. based transformer for medical image denoising. arXiv preprint arXiv:2109.08044.
Lim, B., Son, S., Kim, H., Nah, S., Mu Lee, K., 2017. Enhanced deep residual networks Ma, X., Luo, G., Wang, W., Wang, K., 2021. Transformer network for significant stenosis
for single image super-resolution. In: Proceedings of the IEEE Conference on detection in CCTA of coronary arteries. In: Medical Image Computing and Computer
Computer Vision and Pattern Recognition Workshops. pp. 136–144. Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg,
Lin, C.-Y., 2004. Rouge: A package for automatic evaluation of summaries. In: Text France, September 27–October 1, 2021, Proceedings, Part VI 24. Springer, pp.
Summarization Branches Out. pp. 74–81. 516–525.
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S., 2017a. Feature Ma, C., Wu, Z., Wang, J., Xu, S., Wei, Y., Liu, Z., Guo, L., Cai, X., Zhang, S., Zhang, T.,
pyramid networks for object detection. In: Proceedings of the IEEE Conference on et al., 2023. ImpressionGPT: an iterative optimizing framework for radiology report
Computer Vision and Pattern Recognition. pp. 2117–2125. summarization with chatGPT. arXiv preprint arXiv:2304.08448.
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S., 2017b. Feature Maaz, M., Shaker, A., Cholakkal, H., Khan, S., Zamir, S.W., Anwer, R.M., Shah-
pyramid networks for object detection. In: Proceedings of the IEEE Conference on baz Khan, F., 2022. Edgenext: efficiently amalgamated cnn-transformer architecture
Computer Vision and Pattern Recognition. pp. 2117–2125. for mobile vision applications. In: European Conference on Computer Vision.
Lin, K., Heckel, R., 2021. Vision transformers enable fast and robust accelerated MRI. Springer, pp. 3–20.
In: Medical Imaging with Deep Learning.
Mahapatra, D., Ge, Z., 2021. MR image super resolution by combining feature
Lin, W.-A., Liao, H., Peng, C., Sun, X., Zhang, J., Luo, J., Chellappa, R., Zhou, S.K., 2019. disentanglement CNNs and vision transformers. In: Medical Imaging with Deep
Dudonet: Dual domain network for ct metal artifact reduction. In: Proceedings Learning.
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp.
Maier, O., Menze, B.H., von der Gablentz, J., Häni, L., Heinrich, M.P., Liebrand, M.,
10512–10521.
Winzeck, S., Basit, A., Bentley, P., Chen, L., et al., 2017. ISLES 2015-A public
Lin, T., Wang, Y., Liu, X., Qiu, X., 2021. A survey of transformers. arXiv preprint
evaluation benchmark for ischemic stroke lesion segmentation from multispectral
arXiv:2106.04554.
MRI. Med. Image Anal. 35, 250–269.
Lin, T., Wang, Y., Liu, X., Qiu, X., 2022. A survey of transformers. AI Open.
Makropoulos, A., Robinson, E.C., Schuh, A., Wright, R., Fitzgibbon, S., Bozek, J.,
Liu, C., Chen, L.-C., Schroff, F., Adam, H., Hua, W., Yuille, A.L., Fei-Fei, L., 2019. Auto-
Counsell, S.J., Steinweg, J., Vecchiato, K., Passerat-Palmbach, J., et al., 2018. The
deeplab: Hierarchical neural architecture search for semantic image segmentation.
developing human connectome project: A minimal processing pipeline for neonatal
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
cortical surface reconstruction. Neuroimage 173, 88–112.
Recognition. pp. 82–92.
Manzari, O.N., Ahmadabadi, H., Kashiani, H., Shokouhi, S.B., Ayatollahi, A., 2023.
Liu, F., Ge, S., Wu, X., 2022a. Competence-based multimodal curriculum learning for
MedViT: A robust vision transformer for generalized medical image classification.
medical report generation. arXiv preprint arXiv:2206.14579.
Comput. Biol. Med. 157, 106791.
Liu, J., Lian, J., Yu, Y., 2020b. ChestX-det10: chest X-ray dataset on detection of
Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H., 2022. Towards robust
thoracic abnormalities. arXiv preprint arXiv:2006.10550.
vision transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision
Liu, G., Liao, Y., Li, Z., 2020. Covid-19CT dataset. https://covid19ct.github.io/.
and Pattern Recognition. pp. 12042–12051.
Liu, G., Liao, Y., Wang, F., Zhang, B., Zhang, L., Liang, X., Wan, X., Li, S., Li, Z.,
Marcus, D.S., Wang, T.H., Parker, J., Csernansky, J.G., Morris, J.C., Buckner, R.L., 2007.
Zhang, S., et al., 2021c. Medical-vlbert: Medical visual language bert for covid-19
Open access series of imaging studies (OASIS): cross-sectional MRI data in young,
ct report generation with alternate learning. IEEE Trans. Neural Netw. Learn. Syst.
middle aged, nondemented, and demented older adults. J. Cognit. Neurosci. 19 (9),
32 (9), 3786–3797.
1498–1507.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021d. Swin
transformer: Hierarchical vision transformer using shifted windows. In: Proceedings Mathews, J.P., Campbell, Q.P., Xu, H., Halleck, P., 2017. A review of the application
of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022. of X-ray computed tomography to the study of coal. Fuel 209, 10–24.
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H., 2022c. Video swin Matsoukas, C., Haslum, J.F., Söderberg, M., Smith, K., 2021. Is it time to replace cnns
transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and with transformers for medical images? arXiv preprint arXiv:2108.09038.
Pattern Recognition. pp. 3202–3211. McCollough, C.H., Bartley, A.C., Carter, R.E., Chen, B., Drees, T.A., Edwards, P.,
Liu, J., Pasumarthi, S., Duffy, B., Gong, E., Zaharchuk, G., Datta, K., 2022b. One Holmes, III, D.R., Huang, A.E., Khan, F., Leng, S., et al., 2017. Low-dose CT for
model to synthesize them all: Multi-contrast multi-scale transformer for missing the detection and classification of metastatic liver lesions: results of the 2016 low
data imputation. arXiv preprint arXiv:2204.13738. dose CT grand challenge. Med. Phys. 44 (10), e339–e352.
Liu, S., Qi, L., Qin, H., Shi, J., Jia, J., 2018a. Path aggregation network for instance Mehta, S., Lu, X., Weaver, D., Elmore, J.G., Hajishirzi, H., Shapiro, L., 2020. Hatnet: an
segmentation. In: Proceedings of the IEEE Conference on Computer Vision and end-to-end holistic attention network for diagnosis of breast biopsy images. arXiv
Pattern Recognition. pp. 8759–8768. preprint arXiv:2007.13007.
Liu, S., Qi, L., Qin, H., Shi, J., Jia, J., 2018b. Path aggregation network for instance Mehta, S., Lu, X., Wu, W., Weaver, D., Hajishirzi, H., Elmore, J.G., Shapiro, L.G., 2022.
segmentation. In: Proceedings of the IEEE Conference on Computer Vision and End-to-end diagnosis of breast biopsy images with transformers. Med. Image Anal.
Pattern Recognition. pp. 8759–8768. 79, 102466.
Liu, F., Ren, X., Zhang, Z., Sun, X., Zou, Y., 2021a. Rethinking skip connection with Mendonça, T., Ferreira, P.M., Marques, J.S., Marcal, A.R.S., Rozeira, J., 2013. PH2 - a
layer normalization in transformers and resnets. arXiv preprint arXiv:2105.07205. dermoscopic image database for research and benchmarking. In: 2013 35th Annual
Liu, F., Ren, X., Zhao, G., Sun, X., 2020a. Layer-wise cross-view decoding for International Conference of the IEEE Engineering in Medicine and Biology Society
sequence-to-sequence learning. arXiv preprint arXiv:2005.08081. (EMBC). pp. 5437–5440. http://dx.doi.org/10.1109/EMBC.2013.6610779.
Liu, F., Wu, X., Ge, S., Fan, W., Zou, Y., 2021b. Exploring and distilling posterior and Mendonça, T., Ferreira, P.M., Marques, J.S., Marcal, A.R., Rozeira, J., 2013. PH 2-A
prior knowledge for radiology report generation. In: Proceedings of the IEEE/CVF dermoscopic image database for research and benchmarking. In: 2013 35th Annual
Conference on Computer Vision and Pattern Recognition. pp. 13753–13762. International Conference of the IEEE Engineering in Medicine and Biology Society
Liu, C., Yin, Q., 2021. Automatic diagnosis of COVID-19 using a tailored transformer- (EMBC). IEEE, pp. 5437–5440.
like network. In: Journal of Physics: Conference Series, Vol. 2010. IOP Publishing, Meng, X., Ganoe, C.H., Sieberg, R.T., Cheung, Y.Y., Hassanpour, S., 2020. Self-
012175. supervised contextual language representation of radiology reports to improve the
Ljosa, V., Sokolnicki, K.L., Carpenter, A.E., 2012. Annotated high-throughput identification of communication urgency. AMIA Summits Transl. Sci. Proc. 2020,
microscopy image sets for validation. Nat. Methods 9 (7), 637. 413.
Long, Y., Li, Z., Yee, C.H., Ng, C.F., Taylor, R.H., Unberath, M., Dou, Q., 2021. E-dssr: Menze, B.H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J.,
efficient dynamic surgical scene reconstruction with transformer-based stereoscopic Burren, Y., Porz, N., Slotboom, J., Wiest, R., et al., 2014. The multimodal brain
depth perception. In: International Conference on Medical Image Computing and tumor image segmentation benchmark (BRATS). IEEE Trans. Med. Imaging 34 (10),
Computer-Assisted Intervention. Springer, pp. 415–425. 1993–2024.

62
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Messina, P., Pino, P., Parra, D., Soto, A., Besa, C., Uribe, S., Andía, M., Tejos, C., Park, S., Kim, G., Kim, J., Kim, B., Ye, J.C., 2021a. Federated split task-agnostic
Prieto, C., Capurro, D., 2022. A survey on deep learning and explainability for vision transformer for COVID-19 CXR diagnosis. In: Beygelzimer, A., Dauphin, Y.,
automatic report generation from medical images. ACM Comput. Surv. 54 (10s), Liang, P., Vaughan, J.W. (Eds.), Adv. Neural Inf. Process. Syst.. URL: https://
1–40. openreview.net/forum?id=Ggikq6Tdxch.
Miao, K., Gokul, A., Singh, R., Petryk, S., Gonzalez, J., Keutzer, K., Darrell, T., Park, S., Kim, G., Oh, Y., Seo, J.B., Lee, S.M., Kim, J.H., Moon, S., Lim, J.-K., Ye, J.C.,
2022. Prior knowledge-guided attention in self-supervised vision transformers. arXiv 2021b. Vision transformer for COVID-19 CXR diagnosis using chest X-ray feature
preprint arXiv:2209.03745. corpus. arXiv preprint arXiv:2103.07055.
Milletari, F., Navab, N., Ahmadi, S.-A., 2016. V-net: Fully convolutional neural Patel, S., Brown, J., Pimentel, T., Kelly, R., Abella, F., Durack, C., 2019. Cone beam
networks for volumetric medical image segmentation. In: 2016 Fourth International computed tomography in endodontics–a review of the literature. Int. Endodontic
Conference on 3D Vision (3DV). IEEE, pp. 565–571. J. 52 (8), 1138–1152.
Moen, T.R., Chen, B., Holmes, III, D.R., Duan, X., Yu, Z., Yu, L., Leng, S., Fletcher, J.G., Pavlopoulos, J., Kougia, V., Androutsopoulos, I., Papamichail, D., 2022. Diagnostic
McCollough, C.H., 2021. Low-dose CT image and projection dataset. Med. Phys. 48 captioning: a survey. Knowl. Inf. Syst. 1–32.
(2), 902–911. Payette, K., de Dumast, P., Kebiri, H., Ezhov, I., Paetzold, J.C., Shit, S., Iqbal, A.,
Moghadam, P.A., Van Dalen, S., Martin, K.C., Lennerz, J., Yip, S., Farahani, H., Khan, R., Kottke, R., Grehten, P., et al., 2021. An automatic multi-tissue human
Bashashati, A., 2022. A morphology focused diffusion probabilistic model for fetal brain segmentation benchmark using the fetal tissue annotation dataset. Sci.
synthesis of histopathology images. arXiv preprint arXiv:2209.13167. Data 8 (1), 1–14.
Mok, T.C., Chung, A., 2022. Affine medical image registration with coarse-to-fine vision Peng, Z., Huang, W., Gu, S., Xie, L., Wang, Y., Jiao, J., Ye, Q., 2021. Conformer: Local
transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and features coupling global representations for visual recognition. In: Proceedings of
Pattern Recognition. pp. 20835–20844. the IEEE/CVF International Conference on Computer Vision. pp. 367–376.
Mondal, A.K., Bhattacharjee, A., Singla, P., Prathosh, A., 2021. xViTCOS: explainable Perera, S., Adhikari, S., Yilmaz, A., 2021. POCFormer: A lightweight transformer
vision transformer based COVID-19 screening using radiography. IEEE J. Transl. architecture for detection of COVID-19 using point of care ultrasound. In: 2021
Eng. Health Med. 10, 1–10. IEEE International Conference on Image Processing (ICIP). IEEE, pp. 195–199.
Monshi, M.M.A., Poon, J., Chung, V., 2020. Deep learning in generating radiology Pinaya, W.H., Tudosiu, P.-D., Dafflon, J., Da Costa, P.F., Fernandez, V., Nachev, P.,
reports: A survey. Artif. Intell. Med. 106, 101878. Ourselin, S., Cardoso, M.J., 2022. Brain imaging generation with latent diffusion
models. In: MICCAI Workshop on Deep Generative Models. Springer, pp. 117–126.
Myronenko, A., 2018. 3D MRI brain tumor segmentation using autoencoder
regularization. In: International MICCAI Brainlesion Workshop. Springer, pp. Plenge, E., Poot, D.H., Bernsen, M., Kotek, G., Houston, G., Wielopolski, P., van der
311–320. Weerd, L., Niessen, W.J., Meijering, E., 2012. Super-resolution methods in MRI: can
they improve the trade-off between resolution, signal-to-noise ratio, and acquisition
Naik, N., Madani, A., Esteva, A., Keskar, N.S., Press, M.F., Ruderman, D., Agus, D.B.,
time? Magn. Reson. Med. 68 (6), 1983–1993.
Socher, R., 2020. Deep learning-enabled breast cancer hormonal receptor status
Pogorelov, K., Randel, K.R., Griwodz, C., Eskeland, S.L., de Lange, T., Johansen, D.,
determination from base-level h&e stains. Nat. Commun. 11 (1), 1–8.
Spampinato, C., Dang-Nguyen, D.-T., Lux, M., Schmidt, P.T., et al., 2017. Kvasir:
National Institutes of Health, et al., 2019. National cancer institute. The cancer genome
A multi-class image dataset for computer aided gastrointestinal disease detection.
atlas program.
In: Proceedings of the 8th ACM on Multimedia Systems Conference. pp. 164–169.
Nguyen, H.Q., Lam, K., Le, L.T., Pham, H.H., Tran, D.Q., Nguyen, D.B., Le, D.D.,
Prangemeier, T., Reich, C., Koeppl, H., 2020. Attention-based transformers for instance
Pham, C.M., Tong, H.T., Dinh, D.H., et al., 2022b. Vindr-cxr: An open dataset
segmentation of cells in microstructures. In: 2020 IEEE International Conference
of chest X-rays with radiologist’s annotations. Sci. Data 9 (1), 1–7.
on Bioinformatics and Biomedicine (BIBM). IEEE, pp. 700–707.
Nguyen, D.M., Nguyen, H., Truong, M.T., Cao, T., Nguyen, B.T., Ho, N., Swoboda, P.,
Qadir, H.A., Balasingham, I., Solhusvik, J., Bergsland, J., Aabakken, L., Shin, Y., 2019.
Albarqouni, S., Xie, P., Sonntag, D., 2022a. Joint self-supervised image-volume
Improving automatic polyp detection using CNN by exploiting temporal dependency
representation learning with intra-inter contrastive clustering. In: 37th AAAI
in colonoscopy video. IEEE J. Biomed. Health Inform. 24 (1), 180–193.
Conference on Artificial Intelligence. AAAI.
Qadir, H.A., Shin, Y., Solhusvik, J., Bergsland, J., Aabakken, L., Balasingham, I.,
Nguyen, H.T., Nie, D., Badamdorj, T., Liu, Y., Zhu, Y., Truong, J., Cheng, L., 2021a.
2021. Toward real-time polyp detection using fully CNNs for 2D Gaussian shapes
Automated generation of accurate & fluent medical X-ray reports. arXiv preprint
prediction. Med. Image Anal. 68, 101897.
arXiv:2108.12126.
Qi, X., Brown, L.G., Foran, D.J., Nosher, J., Hacihaliloglu, I., 2021. Chest X-ray image
Nguyen, N.T., Truong, P.T., Ho, V.T., Nguyen, T.V., Pham, H.T., Nguyen, M.T., phase features for improved diagnosis of COVID-19 using convolutional neural
Dam, L.T., Nguyen, H.Q., 2021b. VinDr Lab: A data platform for medical AI. URL: network. Int. J. Comput. Assist. Radiol. Surg. 16 (2), 197–206.
https://github.com/vinbigdata-medical/vindr-lab.
Qiao, S., Chen, L.-C., Yuille, A., 2021. Detectors: Detecting objects with recursive feature
Ni, J., Hsu, C.-N., Gentili, A., McAuley, J., 2020. Learning visual-semantic embeddings pyramid and switchable atrous convolution. In: Proceedings of the IEEE/CVF
for reporting abnormal findings on chest X-rays. arXiv preprint arXiv:2010.02467. Conference on Computer Vision and Pattern Recognition. pp. 10213–10224.
Nolden, M., Zelzer, S., Seitel, A., Wald, D., Müller, M., Franz, A.M., Maleike, D., Qiao, S., Shen, W., Zhang, Z., Wang, B., Yuille, A., 2018. Deep co-training for semi-
Fangerau, M., Baumhauer, M., Maier-Hein, L., et al., 2013. The medical imaging supervised image recognition. In: Proceedings of the European Conference on
interaction toolkit: challenges and advances. Int. J. Comput. Assist. Radiol. Surg. Computer Vision (ECCV). pp. 135–152.
8 (4), 607–620. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al., 2019. Language
Nooralahzadeh, F., Gonzalez, N.P., Frauenfelder, T., Fujimoto, K., Krauthammer, M., models are unsupervised multitask learners. OpenAI blog 1 (8), 9.
2021. Progressive transformer-based generation of radiology reports. arXiv preprint Radiological Society of North America, 2018. RSNA pneumonia detection challenge.
arXiv:2102.09777. https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/.
Nyholm, T., Svensson, S., Andersson, S., Jonsson, J., Sohlin, M., Gustafsson, C., Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A., 2021. Do vision
Kjellén, E., Söderström, K., Albertsson, P., Blomqvist, L., et al., 2018. MR and transformers see like convolutional neural networks? Adv. Neural Inf. Process. Syst.
CT data with multiobserver delineations of organs in the pelvic area—Part of the 34, 12116–12128.
gold atlas project. Med. Phys. 45 (3), 1295–1300. Rahman, T., Khandakar, A., Qiblawey, Y., Tahir, A., Kiranyaz, S., Kashem, S.B.A.,
Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., Mori, K., Islam, M.T., Al Maadeed, S., Zughaier, S.M., Khan, M.S., et al., 2021. Exploring
McDonagh, S., Hammerla, N.Y., Kainz, B., Glocker, B., Rueckert, D., 2018. Attention the effect of image enhancement techniques on COVID-19 detection using chest
U-Net: Learning where to look for the pancreas. In: Medical Imaging with Deep X-ray images. Comput. Biol. Med. 132, 104319.
Learning. URL: https://openreview.net/forum?id=Skft7cijM. Rahman, M.M., Marculescu, R., 2023a. Medical image segmentation via cascaded atten-
OpenAI, 2023. ChatGPT: Optimizing language models for dialogue. URL: https://openai. tion decoding. In: Proceedings of the IEEE/CVF Winter Conference on Applications
com/blog/chatgpt/. of Computer Vision. pp. 6222–6231.
Orlando, J.I., Fu, H., Breda, J.B., van Keer, K., Bathula, D.R., Diaz-Pinto, A., Fang, R., Rahman, M.M., Marculescu, R., 2023b. Multi-scale hierarchical vision transformer with
Heng, P.-A., Kim, J., Lee, J., et al., 2020. Refuge challenge: A unified framework for cascaded attention decoding for medical image segmentation. In: Medical Imaging
evaluating automated methods for glaucoma assessment from fundus photographs. with Deep Learning. URL: https://openreview.net/forum?id=u0MHV19E2n.
Med. Image Anal. 59, 101570. Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan, T., Ding, D., Bagul, A.,
Pachade, S., Porwal, P., Thulkar, D., Kokare, M., Deshmukh, G., Sahasrabuddhe, V., Langlotz, C., Shpanskaya, K., et al., 2017. Chexnet: Radiologist-level pneumonia
Giancardo, L., Quellec, G., Mériaudeau, F., 2021. Retinal fundus multi-disease detection on chest X-rays with deep learning. arXiv preprint arXiv:1711.05225.
image dataset (RFMiD): a dataset for multi-disease detection research. Data 6 (2), Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., Shlens, J., 2019.
14. Stand-alone self-attention in vision models. Adv. Neural Inf. Process. Syst. 32.
Pan, X., Sidky, E.Y., Vannier, M., 2009. Why do commercial CT scanners still employ Ramesh, V., Chi, N.A., Rajpurkar, P., 2022. Improving radiology report generation
traditional, filtered back-projection for image reconstruction? Inverse Problems 25 systems by removing hallucinated references to non-existent priors. In: Machine
(12), 123009. Learning for Health. PMLR, pp. 456–473.
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J., 2002. Bleu: a method for automatic Redmon, J., Divvala, S., Girshick, R., Farhadi, A., 2016. You only look once: Unified,
evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of real-time object detection. In: Proceedings of the IEEE Conference on Computer
the Association for Computational Linguistics. pp. 311–318. Vision and Pattern Recognition. pp. 779–788.

63
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Redmon, J., Farhadi, A., 2018. Yolov3: An incremental improvement. arXiv preprint Shi, J., He, Y., Kong, Y., Coatrieux, J.-L., Shu, H., Yang, G., Li, S., 2022b. XMorpher:
arXiv:1804.02767. Full transformer for deformable medical image registration via cross attention.
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V., 2017. Self-critical sequence In: International Conference on Medical Image Computing and Computer-Assisted
training for image captioning. In: Proceedings of the IEEE Conference on Computer Intervention. Springer, pp. 217–226.
Vision and Pattern Recognition. pp. 7008–7024. Shi, C., Xiao, Y., Chen, Z., 2022a. Dual-domain sparse-view CT reconstruction with
Reynaud, H., Vlontzos, A., Hou, B., Beqiri, A., Leeson, P., Kainz, B., 2021. Ultrasound transformers. Phys. Med. 101, 1–7.
video transformers for cardiac ejection fraction estimation. In: International Confer- Shieh, C.-C., Gonzalez, Y., Li, B., Jia, X., Rit, S., Mory, C., Riblett, M., Hugo, G.,
ence on Medical Image Computing and Computer-Assisted Intervention. Springer, Zhang, Y., Jiang, Z., et al., 2019. SPARE: Sparse-view reconstruction challenge
pp. 495–505. for 4D cone-beam CT from a 1-min scan. Med. Phys. 46 (9), 3799–3811.
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S., 2019. General- Shinagare, A.B., Vikram, R., Jaffe, C., Akin, O., Kirby, J., Huang, E., Freymann, J.,
ized intersection over union: A metric and a loss for bounding box regression. Sainani, N.I., Sadow, C.A., Bathala, T.K., Rubin, D.L., Oto, A., Heller, M.T.,
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Surabhi, V.R., Katabathina, V., Silverman, S.G., 2014. Radiogenomics of clear cell
Recognition. pp. 658–666. renal cell carcinoma: Preliminary findings of the cancer genome atlas-renal cell
Ristea, N.-C., Miron, A.-I., Savencu, O., Georgescu, M.-I., Verga, N., Khan, F.S., carcinoma (TCGA-RCC) research group. http://dx.doi.org/10.7937/K9/TCIA.2014.
Ionescu, R.T., 2021. CyTran: Cycle-consistent transformers for non-contrast to K6M61GDW.
contrast CT translation. arXiv preprint arXiv:2110.06400. Shiraishi, J., Katsuragawa, S., Ikezoe, J., Matsumoto, T., Kobayashi, T., Komatsu, K.-
Rojas-Muñoz, E., Couperus, K., Wachs, J., 2020. DAISI: Database for AI surgical i., Matsui, M., Fujita, H., Kodera, Y., Doi, K., 2000. Development of a digital
instruction. arXiv preprint arXiv:2004.02809. image database for chest radiographs with and without a lung nodule: receiver
operating characteristic analysis of radiologists’ detection of pulmonary nodules.
Rong, Y., Rosu-Bubulac, M., Benedict, S.H., Cui, Y., Ruo, R., Connell, T., Kashani, R.,
Am. J. Roentgenol. 174 (1), 71–74.
Latifi, K., Chen, Q., Geng, H., et al., 2021. Rigid and deformable image registration
Shome, D., Kar, T., Mohanty, S.N., Tiwari, P., Muhammad, K., AlTameem, A., Zhang, Y.,
for radiation therapy: a self-study evaluation guide for NRG oncology clinical trial
Saudagar, A.K.J., 2021. Covid-transformer: Interpretable COVID-19 detection using
participation. Pract. Radiat. Oncol. 11 (4), 282–298.
vision transformer for healthcare. Int. J. Environ. Res. Public Health 18 (21), 11086.
Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for
Signoroni, A., Savardi, M., Benini, S., Adami, N., Leonardi, R., Gibellini, P., Vaccher, F.,
biomedical image segmentation. In: International Conference on Medical Image
Ravanelli, M., Borghesi, A., Maroldi, R., et al., 2021. BS-Net: Learning COVID-19
Computing and Computer-Assisted Intervention. Springer, pp. 234–241.
pneumonia severity on a large chest X-ray dataset. Med. Image Anal. 71, 102046.
Rotemberg, V., Kurtansky, N., Betz-Stablein, B., Caffery, L., Chousakos, E., Codella, N.,
SIIM-ACR, 2019. SIIM-ACR pneumothorax segmentation. https://www.kaggle.com/c/
Combalia, M., Dusza, S., Guitera, P., Gutman, D., et al., 2021. A patient-centric
siim-acr-pneumothorax-segmentation.
dataset of images and metadata for identifying melanomas using clinical context. Silva, J., Histace, A., Romain, O., Dray, X., Granado, B., 2014. Toward embedded
Sci. Data 8 (1), 34. detection of polyps in wce images for early diagnosis of colorectal cancer. Int. J.
RSNA, 2018. RSNA pneumonia detection challenge. https://www.kaggle.com/c/rsna- Comput. Assist. Radiol. Surg. 9 (2), 283–293.
pneumonia-detection-challenge. Simpson, A., Antonelli, M., Bakas, S., Bilello, M., Farahani, K., Van Ginneken, B.,
Sait, U., Lal, K., Prajapati, S., Bhaumik, R., Kumar, T., Sanjana, S., Bhalla, K., 2020. Kopp-Schneider, A., Landman, B., Litjens, G., Menze, B., et al., 2019a. A large an-
Curated dataset for COVID-19 posterior-anterior chest radiography images (X-Rays). notated medical image dataset for the development and evaluation of segmentation
Mendeley Data 1. algorithms. arXiv 2019. arXiv preprint arXiv:1902.09063.
Saltz, J., Saltz, M., Prasanna, P., Moffitt, R., Hajagos, J., Bremer, E., Balsamo, J., Simpson, A.L., Antonelli, M., Bakas, S., Bilello, M., Farahani, K., Van Ginneken, B.,
Kurc, T., 2021. Stony brook university COVID-19 positive cases [data set]. http: Kopp-Schneider, A., Landman, B.A., Litjens, G., Menze, B., et al., 2019b. A
//dx.doi.org/10.7937/TCIA.BBAG-2923. large annotated medical image dataset for the development and evaluation of
Sang, D.V., Chung, T.Q., Lan, P.N., Hang, D.V., Van Long, D., Thuy, N.T., 2021. Ag- segmentation algorithms. arXiv preprint arXiv:1902.09063.
curesnest: A novel method for colon polyp segmentation. arXiv preprint arXiv: Singh, A., Sengupta, S., Lakshminarayanan, V., 2020. Explainable deep learning models
2105.00402. in medical image analysis. J. Imaging 6 (6), 52.
Schlemper, J., Oktay, O., Schaap, M., Heinrich, M., Kainz, B., Glocker, B., Rueckert, D., Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Hou, L., Clark, K., Pfohl, S.,
2019. Attention gated networks: Learning to leverage salient regions in medical Cole-Lewis, H., Neal, D., et al., 2023. Towards expert-level medical question
images. Med. Image Anal. 53, 197–207. answering with large language models. arXiv preprint arXiv:2305.09617.
Schoppe, O., Pan, C., Coronel, J., Mai, H., Rong, Z., Todorov, M.I., Müskes, A., Sirinukunwattana, K., Pluim, J.P., Chen, H., Qi, X., Heng, P.-A., Guo, Y.B., Wang, L.Y.,
Navarro, F., Li, H., Ertürk, A., et al., 2020. Deep learning-enabled multi-organ Matuszewski, B.J., Bruni, E., Sanchez, U., et al., 2017. Gland segmentation in colon
segmentation in whole-body mouse scans. Nat. Commun. 11 (1), 1–14. histology images: The glas challenge contest. Med. Image Anal. 35, 489–502.
Seenivasan, L., Islam, M., Kannan, G., Ren, H., 2023. SurgicalGPT: End-to-end language- Smit, A., Jain, S., Rajpurkar, P., Pareek, A., Ng, A.Y., Lungren, M.P., 2020. CheXbert:
vision GPT for visual question answering in surgery. arXiv preprint arXiv:2304. combining automatic labelers and expert annotations for accurate radiology report
09974. labeling using BERT. arXiv preprint arXiv:2004.09167.
Seeram, E., 2015. Computed Tomography-E-Book: Physical Principles, Clinical Srinivas, A., Lin, T.-Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A., 2021. Bottleneck
Applications, and Quality Control. Elsevier Health Sciences. transformers for visual recognition. In: Proceedings of the IEEE/CVF Conference on
Segars, W., Bond, J., Frush, J., Hon, S., Eckersley, C., Williams, C.H., Feng, J., Computer Vision and Pattern Recognition. pp. 16519–16529.
Tward, D.J., Ratnanather, J., Miller, M., et al., 2013. Population of anatomically Stefanini, M., Cornia, M., Baraldi, L., Cascianelli, S., Fiameni, G., Cucchiara, R., 2022.
variable 4D XCAT adult phantoms for imaging research and optimization. Med. From show to tell: a survey on deep learning-based image captioning. IEEE Trans.
Phys. 40 (4), 043701. Pattern Anal. Mach. Intell..
Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., Dai, J., 2020. VL-BERT: Pre-training of
Sekuboyina, A., Bayat, A., Husseini, M.E., Löffler, M., Rempfler, M., Kukačka, J.,
generic visual-linguistic representations. In: International Conference on Learning
Tetteh, G., Valentinitsch, A., Payer, C., Urschler, M., et al., 2020. Verse: a vertebrae
Representations. URL: https://openreview.net/forum?id=SygXPaEYvH.
labelling and segmentation benchmark. arXiv. org e-Print archive.
Sun, R., Li, Y., Zhang, T., Mao, Z., Wu, F., Zhang, Y., 2021. Lesion-aware transformers
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D., 2017. Grad-
for diabetic retinopathy grading. In: Proceedings of the IEEE/CVF Conference on
cam: Visual explanations from deep networks via gradient-based localization. In:
Computer Vision and Pattern Recognition. pp. 10938–10947.
Proceedings of the IEEE International Conference on Computer Vision. pp. 618–626.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z., 2016. Rethinking the
Shamshad, F., Khan, S., Zamir, S.W., Khan, M.H., Hayat, M., Khan, F.S., Fu, H., 2023.
inception architecture for computer vision. In: Proceedings of the IEEE Conference
Transformers in medical imaging: A survey. Med. Image Anal. 102802.
on Computer Vision and Pattern Recognition. pp. 2818–2826.
Shao, Z., Bian, H., Chen, Y., Wang, Y., Zhang, J., Ji, X., et al., 2021. Transmil: Tan, M., Le, Q., 2019. Efficientnet: Rethinking model scaling for convolutional neural
Transformer based correlated multiple instance learning for whole slide image networks. In: International Conference on Machine Learning. PMLR, pp. 6105–6114.
classification. Adv. Neural Inf. Process. Syst. 34. Tan, M., Le, Q., 2021. Efficientnetv2: Smaller models and faster training. In:
Sharma, Y., Shrivastava, A., Ehsan, L., Moskaluk, C.A., Syed, S., Brown, D., 2021. International Conference on Machine Learning. PMLR, pp. 10096–10106.
Cluster-to-conquer: A framework for end-to-end multi-instance learning for whole Tang, Y., Gao, R., Lee, H.H., Han, S., Chen, Y., Gao, D., Nath, V., Bermudez, C.,
slide image classification. In: Medical Imaging with Deep Learning. PMLR, pp. Savona, M.R., Abramson, R.G., et al., 2021. High-resolution 3D abdominal
682–698. segmentation with random patch network fusion. Med. Image Anal. 69, 101894.
Shattuck, D.W., Mirza, M., Adisetiyo, V., Hojatkashani, C., Salamon, G., Narr, K.L., Tang, Y., Yang, D., Li, W., Roth, H.R., Landman, B., Xu, D., Nath, V., Hatamizadeh, A.,
Poldrack, R.A., Bilder, R.M., Toga, A.W., 2008. Construction of a 3D probabilistic 2022. Self-supervised pre-training of swin transformers for 3D medical image
atlas of human cortical structures. Neuroimage 39 (3), 1064–1080. analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Shen, Z., Fu, R., Lin, C., Zheng, S., 2021a. COTR: Convolution in transformer network Pattern Recognition. pp. 20730–20740.
for end to end polyp detection. In: 2021 7th International Conference on Computer Tanzi, L., Audisio, A., Cirrincione, G., Aprato, A., Vezzetti, E., 2022. Vision transformer
and Communications (ICCC). IEEE, pp. 1757–1761. for femur fracture classification. Injury.
Shen, Z., Zhang, M., Zhao, H., Yi, S., Li, H., 2021b. Efficient attention: Attention Tanzi, L., Vezzetti, E., Moreno, R., Aprato, A., Audisio, A., Massè, A., 2020. Hierarchical
with linear complexities. In: Proceedings of the IEEE/CVF Winter Conference on fracture classification of proximal femur X-Ray images using a multistage deep
Applications of Computer Vision. pp. 3531–3539. learning approach. Eur. J. Radiol. 133, 109373.

64
R. Azad et al. Medical Image Analysis 91 (2024) 103000

Tao, R., Zheng, G., 2021. Spine-transformers: Vertebra detection and localization in Wang, D., Wu, Z., Yu, H., 2021b. TED-net: Convolution-free T2T vision transformer-
arbitrary field-of-view spine ct with transformers. In: International Conference based encoder-decoder dilation network for low-dose CT denoising. In: Machine
on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. Learning in Medical Imaging: 12th International Workshop, MLMI 2021, Held
93–103. in Conjunction with MICCAI 2021, Strasbourg, France, September 27, 2021,
Team, N.L.S.T.R., 2011. Reduced lung-cancer mortality with low-dose computed Proceedings 12. Springer, pp. 416–425.
tomographic screening. N. Engl. J. Med. 365 (5), 395–409. Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.,
Jimenez-del Toro, O., Müller, H., Krenn, M., Gruenberg, K., Taha, A.A., Winterstein, M., 2021e. Pyramid vision transformer: A versatile backbone for dense prediction
Eggel, I., Foncubierta-Rodríguez, A., Goksel, O., Jakab, A., et al., 2016. Cloud- without convolutions. In: Proceedings of the IEEE/CVF International Conference
based evaluation of anatomical structure segmentation and landmark detection on Computer Vision. pp. 568–578.
algorithms: VISCERAL anatomy benchmarks. IEEE Trans. Med. Imaging 35 (11), Wang, S., Zhao, Z., Ouyang, X., Wang, Q., Shen, D., 2023b. Chatcad: Interactive
2459–2475. computer-aided diagnosis on medical image using large language models. arXiv
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H., 2021. Training preprint arXiv:2302.07257.
data-efficient image transformers & distillation through attention. In: International Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., Chen, L.-C., 2020a. Axial-deeplab:
Conference on Machine Learning. PMLR, pp. 10347–10357. Stand-alone axial-attention for panoptic segmentation. In: European Conference on
Tsai, E.B., Simpson, S., Lungren, M.P., Hershman, M., Roshkovan, L., Colak, E., Computer Vision. Springer, pp. 108–126.
Erickson, B.J., Shih, G., Stein, A., Kalpathy-Cramer, J., et al., 2021a. Data from Wang, S., Zhuang, Z., Xuan, K., Qian, D., Xue, Z., Xu, J., Liu, Y., Chai, Y., Zhang, L.,
medical imaging data resource center (MIDRC) - RSNA international COVID Wang, Q., et al., 2021c. 3DMET: 3D medical image transformer for knee cartilage
radiology database (RICORD) release 1c - chest X-ray, covid+ (MIDRC-RICORD-1c). defect assessment. In: International Workshop on Machine Learning in Medical
Cancer Imaging Arch.. Imaging. Springer, pp. 347–355.
Tsai, E.B., Simpson, S., Lungren, M.P., Hershman, M., Roshkovan, L., Colak, E.,
Wittmann, B., Navarro, F., Shit, S., Menze, B., et al., 2023. Focused decoding enables 3D
Erickson, B.J., Shih, G., Stein, A., Kalpathy-Cramer, J., et al., 2021b. The RSNA
anatomical detection by transformers. Mach. Learn. Biomed. Imaging 2 (February
international COVID-19 open radiology database (RICORD). Radiology 299 (1),
2023 issue), 72–95.
E204–E213.
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S., 2018. Cbam: Convolutional block attention
Ulman, V., Maška, M., Magnusson, K.E., Ronneberger, O., Haubold, C., Harder, N.,
module. In: Proceedings of the European Conference on Computer Vision (ECCV).
Matula, P., Matula, P., Svoboda, D., Radojevic, M., et al., 2017. An objective
pp. 3–19.
comparison of cell-tracking algorithms. Nat. Methods 14 (12), 1141–1152.
World-Health-Organization, 2021. Breast cancer. https://www.who.int/news-room/
Ulyanov, D., Vedaldi, A., Lempitsky, V., 2018. Deep image prior. In: Proceedings of the
fact-sheets/detail/breast-cancer.
IEEE Conference on Computer Vision and Pattern Recognition. pp. 9446–9454.
Valanarasu, J.M.J., Oza, P., Hacihaliloglu, I., Patel, V.M., 2021. Medical transformer: Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L., 2021. Cvt:
Gated axial-attention for medical image segmentation. In: International Conference Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF
on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. International Conference on Computer Vision. pp. 22–31.
36–46. Wu, M., Xu, Y., Xu, Y., Wu, G., Chen, Q., Lin, H., 2022. Adaptively re-weighting multi-
Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N., Hechtman, B., Shlens, J., 2021. loss untrained transformer for sparse-view cone-beam CT reconstruction. arXiv
Scaling local self-attention for parameter efficient visual backbones. In: Proceedings preprint arXiv:2203.12476.
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G., 2022. Vision transformer with deformable
12894–12904. attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Pattern Recognition. pp. 4794–4803.
Polosukhin, I., 2017. Attention is all you need. Adv. Neural Inf. Process. Syst. 30. Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., Hu, H., 2022. Simmim:
Vayá, M.D.L.I., Saborit, J.M., Montell, J.A., Pertusa, A., Bustos, A., Cazorla, M., A simple framework for masked image modeling. In: Proceedings of the IEEE/CVF
Galant, J., Barber, X., Orozco-Beltrán, D., García-García, F., et al., 2020. BIMCV Conference on Computer Vision and Pattern Recognition. pp. 9653–9663.
COVID-19+: a large annotated dataset of RX and CT images from COVID-19 Xie, Y., Zhang, J., Shen, C., Xia, Y., 2021. CoTr: Efficiently bridging CNN and
patients. arXiv preprint arXiv:2006.01174. transformer for 3D medical image segmentation. arXiv preprint arXiv:2103.03024.
Vedantam, R., Lawrence Zitnick, C., Parikh, D., 2015. Cider: Consensus-based image Xiong, Y., Du, B., Yan, P., 2019. Reinforced transformer for medical image captioning.
description evaluation. In: Proceedings of the IEEE Conference on Computer Vision In: International Workshop on Machine Learning in Medical Imaging. Springer, pp.
and Pattern Recognition. pp. 4566–4575. 673–680.
Wagner, R., Rohr, K., 2022. Cellcentroidformer: Combining self-attention and convo- Xu, J., Moyer, D., Grant, P.E., Golland, P., Iglesias, J.E., Adalsteinsson, E., 2022.
lution for cell detection. In: Annual Conference on Medical Image Understanding SVoRT: iterative transformer for slice-to-volume registration in fetal brain MRI.
and Analysis. Springer, pp. 212–222. In: International Conference on Medical Image Computing and Computer-Assisted
Wang, W., Chen, C., Ding, M., Yu, H., Zha, S., Li, J., 2021d. Transbts: Multimodal brain Intervention. Springer, pp. 3–13.
tumor segmentation using transformer. In: International Conference on Medical Xu, W., Xu, Y., Chang, T., Tu, Z., 2021. Co-scale conv-attentional image transformers.
Image Computing and Computer-Assisted Intervention. Springer, pp. 109–119. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp.
Wang, X., Chen, Y., Zhu, W., 2021f. A survey on curriculum learning. IEEE Trans. 9981–9990.
Pattern Anal. Mach. Intell.. Yan, A., He, Z., Lu, X., Du, J., Chang, E., Gentili, A., McAuley, J., Hsu, C.-n.,
Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., Li, H., 2022. Uformer: A general u-shaped 2021. Weakly supervised contrastive learning for chest X-Ray report generation.
transformer for image restoration. In: Proceedings of the IEEE/CVF Conference on In: Findings of the Association for Computational Linguistics: EMNLP 2021. pp.
Computer Vision and Pattern Recognition. pp. 17683–17693. 4009–4015.
Wang, D., Fan, F., Wu, Z., Liu, R., Wang, F., Yu, H., 2023a. CTformer: convolution-free
Yan, R., Qu, L., Wei, Q., Huang, S.-C., Shen, L., Rubin, D., Xing, L., Zhou, Y., 2023.
Token2Token dilated vision transformer for low-dose CT denoising. Phys. Med.
Label-efficient self-supervised federated learning for tackling data heterogeneity in
Biol. 68 (6), 065012.
medical imaging. IEEE Trans. Med. Imaging.
Wang, X., Girshick, R., Gupta, A., He, K., 2018b. Non-local neural networks. In:
Yang, D., Myronenko, A., Wang, X., Xu, Z., Roth, H.R., Xu, D., 2021a. T-AutoML:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Automated machine learning for lesion segmentation using transformers in 3D
pp. 7794–7803.
medical imaging. In: Proceedings of the IEEE/CVF International Conference on
Wang, C., Hu, Z., Shi, P., Liu, H., 2014. Low dose PET reconstruction with total
Computer Vision. pp. 3962–3974.
variation regularization. In: 2014 36th Annual International Conference of the IEEE
Yang, J., Shi, R., Ni, B., 2021b. Medmnist classification decathlon: A lightweight automl
Engineering in Medicine and Biology Society. IEEE, pp. 1917–1920.
benchmark for medical image analysis. In: 2021 IEEE 18th International Symposium
Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H., 2020b. Linformer: Self-attention with
on Biomedical Imaging (ISBI). IEEE, pp. 191–195.
linear complexity. arXiv preprint arXiv:2006.04768.
Wang, Y., Lin, Z., Tian, J., Shi, Z., Zhang, Y., Fan, J., He, Z., 2021g. Confidence-guided Yang, L., Zhang, D., et al., 2022. Low-dose CT denoising via sinogram inner-structure
radiology report generation. arXiv preprint arXiv:2106.10887. transformer. arXiv preprint arXiv:2204.03163.
Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., Catanzaro, B., 2018a. High- Yao, Z., Ai, J., Li, B., Zhang, C., 2021a. Efficient detr: improving end-to-end object
resolution image synthesis and semantic manipulation with conditional gans. In: detector with dense prior. arXiv preprint arXiv:2104.01318.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Yao, T., Li, Y., Pan, Y., Wang, Y., Zhang, X.-P., Mei, T., 2023. Dual vision transformer.
pp. 8798–8807. IEEE Trans. Pattern Anal. Mach. Intell..
Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M., 2017. Chestx- Yao, C., Tang, J., Hu, M., Wu, Y., Guo, W., Li, Q., Zhang, X.-P., 2021. Claw U-Net:
ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised A UNet variant network with deep feature concatenation for scleral blood vessel
classification and localization of common thorax diseases. In: Proceedings of the segmentation. In: CAAI International Conference on Artificial Intelligence. Springer,
IEEE Conference on Computer Vision and Pattern Recognition. pp. 2097–2106. pp. 67–78.
Wang, C., Shang, K., Zhang, H., Li, Q., Hui, Y., Zhou, S.K., 2021a. Dudotrans: Dual- Yap, M.H., Pons, G., Martí, J., Ganau, S., Sentis, M., Zwiggelaar, R., Davison, A.K.,
domain transformer provides more attention for sinogram restoration in sparse-view Marti, R., 2017. Automated breast ultrasound lesions detection using convolutional
CT reconstruction. arXiv preprint arXiv:2111.10790. neural networks. IEEE J. Biomed. Health Inform. 22 (4), 1218–1226.

65
R. Azad et al. Medical Image Analysis 91 (2024) 103000

You, D., Liu, F., Ge, S., Xie, X., Zhang, J., Wu, X., 2021b. Aligntransformer: Hierarchical Zhang, L., Xiao, Z., Zhou, C., Yuan, J., He, Q., Yang, Y., Liu, X., Liang, D., Zheng, H.,
alignment of visual regions and disease tags for medical report generation. In: Fan, W., et al., 2022. Spatial adaptive and transformer fusion network (STFNet)
International Conference on Medical Image Computing and Computer-Assisted for low-count PET blind denoising with MRI. Med. Phys. 49 (1), 343–356.
Intervention. Springer, pp. 72–82. Zhang, Z., Yu, L., Liang, X., Zhao, W., Xing, L., 2021e. TransCT: dual-path transformer
Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S., 2022. Metaformer for low dose computed tomography. In: Medical Image Computing and Computer
is actually what you need for vision. In: Proceedings of the IEEE/CVF Conference Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg,
on Computer Vision and Pattern Recognition. pp. 10819–10829. France, September 27–October 1, 2021, Proceedings, Part VI 24. Springer, pp.
Yu, S., Ma, K., Bi, Q., Bian, C., Ning, M., He, N., Li, Y., Liu, H., Zheng, Y., 55–64.
2021. Mil-vt: Multiple instance learning enhanced vision transformer for fundus Zhao, Z., Jin, Y., Heng, P.-A., 2022. TraSeTR: track-to-segment transformer with
image classification. In: International Conference on Medical Image Computing and contrastive query for instance-level instrument segmentation in robotic surgery.
Computer-Assisted Intervention. Springer, pp. 45–54.
In: 2022 International Conference on Robotics and Automation (ICRA). IEEE, pp.
Yu, L., Wang, S., Li, X., Fu, C.-W., Heng, P.-A., 2019. Uncertainty-aware self-ensembling
11186–11193.
model for semi-supervised 3D left atrium segmentation. In: International Conference
Zhao, S., Lau, T., Luo, J., Eric, I., Chang, C., Xu, Y., 2019. Unsupervised 3D end-to-end
on Medical Image Computing and Computer-Assisted Intervention. Springer, pp.
medical image registration with volume tweening network. IEEE J. Biomed. Health
605–613.
Inform. 24 (5), 1394–1404.
Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z.-H., Tay, F.E., Feng, J., Yan, S.,
Zheng, Y., Gindra, R.H., Green, E.J., Burks, E.J., Betke, M., Beane, J.E., Kolacha-
2021. Tokens-to-token vit: Training vision transformers from scratch on imagenet.
In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. lama, V.B., 2022. A graph-transformer for whole slide image classification. IEEE
558–567. Trans. Med. Imaging 41 (11), 3003–3015.
Zbontar, J., Knoll, F., Sriram, A., Murrell, T., Huang, Z., Muckley, M.J., Defazio, A., Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T.,
Stern, R., Johnson, P., Bruno, M., Parente, M., Geras, K.J., Katsnelson, J., Torr, P.H., et al., 2021. Rethinking semantic segmentation from a sequence-
Chandarana, H., Zhang, Z., Drozdzal, M., Romero, A., Rabbat, M., Vincent, P., to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF
Yakubova, N., Pinkerton, J., Wang, D., Owens, E., Zitnick, C.L., Recht, M.P., Conference on Computer Vision and Pattern Recognition. pp. 6881–6890.
Sodickson, D.K., Lui, Y.W., 2018. fastMRI: An open dataset and benchmarks for Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D., 2020. Distance-iou loss: Faster and
accelerated MRI. arXiv:1811.08839. better learning for bounding box regression. In: Proceedings of the AAAI Conference
Zhang, Z., Bagci, U., 2022. Dynamic linear transformer for 3D biomedical image seg- on Artificial Intelligence, Vol. 34. pp. 12993–13000.
mentation. In: Machine Learning in Medical Imaging: 13th International Workshop, Zhong, Z., Zheng, L., Li, S., Yang, Y., 2018. Generalizing a person retrieval model
MLMI 2022, Held in Conjunction with MICCAI 2022, Singapore, September 18, hetero-and homogeneously. In: Proceedings of the European Conference on
2022, Proceedings. Springer, pp. 171–180. Computer Vision (ECCV). pp. 172–188.
Zhang, H.-M., Dong, B., 2020. A review on deep learning in medical image Zhou, B., Dey, N., Schlemper, J., Salehi, S.S.M., Liu, C., Duncan, J.S., Sofka, M.,
reconstruction. J. Oper. Res. Soc. China 8 (2), 311–340. 2023a. DSFormer: a dual-domain self-supervised transformer for accelerated multi-
Zhang, H., Goodfellow, I., Metaxas, D., Odena, A., 2019. Self-attention generative contrast MRI reconstruction. In: Proceedings of the IEEE/CVF Winter Conference
adversarial networks. In: International Conference on Machine Learning. PMLR, on Applications of Computer Vision. pp. 4966–4975.
pp. 7354–7363. Zhou, H.-Y., Guo, J., Zhang, Y., Yu, L., Wang, L., Yu, Y., 2021. nnformer: Interleaved
Zhang, X., He, X., Guo, J., Ettehadi, N., Aw, N., Semanek, D., Posner, J., Laine, A., transformer for volumetric segmentation. arXiv preprint arXiv:2109.03201.
Wang, Y., 2021b. PTNet: a high-resolution infant MRI synthesizer based on
Zhou, J., He, X., Sun, L., Xu, J., Chen, X., Chu, Y., Zhou, L., Liao, X., Zhang, B., Gao, X.,
transformer. arXiv preprint arXiv:2105.13993.
2023b. SkinGPT-4: An interactive dermatology diagnostic system with visual large
Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L., Shum, H.-Y., 2023. DINO:
language model. medRxiv, 2023-2006.
DETR with improved DeNoising anchor boxes for end-to-end object detection. In:
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A., 2016. Learning deep features
The Eleventh International Conference on Learning Representations. URL: https:
for discriminative localization. In: Proceedings of the IEEE Conference on Computer
//openreview.net/forum?id=3mRwyG5one.
Vision and Pattern Recognition. pp. 2921–2929.
Zhang, R.H., Liu, Q., Fan, A.X., Ji, H., Zeng, D., Cheng, F., Kawahara, D., Kurohashi, S.,
2020a. Minimize exposure bias of Seq2Seq models in joint entity and relation Zhou, Y., Li, Z., Bai, S., Wang, C., Chen, X., Han, M., Fishman, E., Yuille, A.L., 2019.
extraction. arXiv preprint arXiv:2009.07503. Prior-aware neural network for partially-supervised multi-organ segmentation. In:
Zhang, Y., Liu, H., Hu, Q., 2021c. Transfuse: Fusing transformers and cnns for medical Proceedings of the IEEE/CVF International Conference on Computer Vision. pp.
image segmentation. In: International Conference on Medical Image Computing and 10672–10681.
Computer-Assisted Intervention. Springer, pp. 14–24. Zhou, L., Liu, H., Bae, J., He, J., Samaras, D., Prasanna, P., 2022. Self pre-training with
Zhang, J., Nie, Y., Chang, J., Zhang, J.J., 2021a. Surgical instruction generation masked autoencoders for medical image analysis. arXiv preprint arXiv:2203.05573.
with transformers. In: International Conference on Medical Image Computing and Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J., 2021. Deformable {detr}: Deformable
Computer-Assisted Intervention. Springer, pp. 290–299. transformers for end-to-end object detection. In: International Conference on
Zhang, Y., Pei, Y., Zha, H., 2021d. Learning dual transformer network for diffeomor- Learning Representations. URL: https://openreview.net/forum?id=gZ9hCDWe6ke.
phic registration. In: International Conference on Medical Image Computing and Zhuang, X., Shen, J., 2016. Multi-scale patch and multi-modality atlases for whole heart
Computer-Assisted Intervention. Springer, pp. 129–138. segmentation of MRI. Med. Image Anal. 31, 77–87.
Zhang, Y., Wang, X., Xu, Z., Yu, Q., Yuille, A., Xu, D., 2020b. When radiology report
generation meets knowledge graph. In: Proceedings of the AAAI Conference on
Artificial Intelligence, Vol. 34. pp. 12910–12917.

66

You might also like