ETAM: Ensemble Transformer With Attention Modules For Detection of Small Objects

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/369729553
ETAM: Ensemble transformer with attention modules for detection of small

objects
Article in Expert Systems with Applications · March 2023

DOI: 10.1016/j.eswa.2023.119997
CITATIONS READS
0 13
5 authors, including:
Jiangnan Zhang Obagunle Romoke Grace Akindele
12 PUBLICATIONS 116 CITATIONS

Hebei University of Technology,China
4 PUBLICATIONS 18 CITATIONS
SEE PROFILE
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Site and path diversity mitigation in tropical climate View project
All content following this page was uploaded by Obagunle Romoke Grace Akindele on 04 April 2023.
The user has requested enhancement of the downloaded file.

Expert Systems With Applications 224 (2023) 119997
Contents lists available at ScienceDirect
Expert Systems With Applications

journal homepage: www.elsevier.com/locate/eswa
ETAM: Ensemble transformer with attention modules for detection of small

objects
Jiangnan Zhang a,b , Kewen Xia a ,∗, Zhiyi Huang b , Sijie Wang a , Romoke Grace Akindele a
a
School of Electronics and Information Engineering, Hebei University of Technology, Tianjin 300401, China
b
Department of Computer Science, University of Otago, Dunedin 9054, New Zealand
ARTICLE INFO ABSTRACT
Keywords: Detecting small objects is critical to many submissions, such as automatic drive and lung nodule detection.
Small object detection However, small object detection is challenging with low-resolution features. Therefore, the linchpin of small
Transformer object detection is to design an effective encoder that can extract subtle features. In this paper, we present a
Visual attention module
powerful encoder, called Ensemble Transformer with Attention Modules (ETAM) encoder, for abstracting the
Ensemble learning
subtle small object features without sacrificing the capability of larger object detection. In ETAM, a Magnifying
Glass (MG) module is proposed to focus on representative features of small objects. Then, the Quadruple
Attention (QA) is designed to enrich the small object features with width and height in addition to channel
and position. To accommodate both small and large objects, we use ensemble learning in our ETAM encoder,
which has two branches. Experimental results show that ETAM significantly improves small object detection
based on PASCAL VOC, MS-COCO, VisDrone2019, and LIDC-IDRI. With ETAM, the 𝑚AP for small objects is
improved up to 91.7% based on the four datasets.
1. Introduction The sticking point is to design an effective encoder that can extract
the subtle features of small objects. Transformer models are compelling
Small object detection is critical to many claims, e.g., automatic for learning local context-aware representations (Sheng et al., 2021),
drive and the detection of lung nodules. For example, it is critical which is very beneficial for detecting small objects. Therefore, more
to distinguish distant pedestrians or cyclists in automatic driving, and more researchers turned to the Transformer encoder and made a
like Fig. 1(1). Similarly, in CT image processing, the lung nodules in significant breakthrough in object detection.
Fig. 1(2) are indicators of lung cancer. They are very difficult for human However, the original Transformer does not work well in detecting
eyes to identify. Accurate detection of these small objects with machine small targets owing to the background interference and its attention
learning is essential for automatic lung cancer diagnosis. limitations. For example, Fig. 2 shows the failure situation of Cascade
However, detecting small objects is challenging because it has more RCNN (Swin Transformer Liu et al., 2021), where the little bird and the
interference from the background, while the small object features are distant cars are not detected as objects.
few. Recent object detection methods, be it two-stage models like Faster To build a robust encoder for feature extraction of small object de-
RCNN (Ren, He, Girshick, & Sun, 2015), Mask RCNN (He, Gkioxari, tection, we propose an Ensemble Transformer with Attention Modules
Dollár, & Girshick, 2017), and Cascade RCNN (Cai & Vasconcelos,
(ETAM) encoder, which can detect small objects accurately but does
2018) or one-stage methods like Yolo (Redmon, Divvala, Girshick, &
not affect the Transformer’s original performance on detecting larger
Farhadi, 2016; Redmon & Farhadi, 2017, 2018), SSD (Liu et al., 2016),
objects.
FCOS (Tian, Shen, Chen, & He, 2019), and CenterNet (Zhou, Wang,
In ETAM, a Magnifying Glass (MG) module is a high-quality region
& Krähenbühl, 2019) use a convolutional neural network (CNN) as its
proposal network. MG forecasts the wide positions of small targets on
backbone. But CNNs extract higher layer features relying on increasing
shallow features. Accordingly, we can reap the interest of high-quality
the number of convolutional layers, which results in a loss of local
feature maps and prevent ineffective calculations on background re-
context-aware representations. So CNNs are good at detecting medium
gions. Then, a Quadruple Attention (QA) is designed to enrich the
and large objects, but incapable of detecting small objects with few
features (He, Zhang, Ren, & Sun, 2016). small object features with width and height in addition to channel
∗ Corresponding author.
E-mail addresses: zhangjn1353@163.com (J. Zhang), kwxia@hebut.edu.cn (K. Xia), zhiyi.huang@otago.ac.nz (Z. Huang), jessie_wsj@sina.cn (S. Wang),
201940000019@stu.hebut.edu.cn (R.G. Akindele).
https://doi.org/10.1016/j.eswa.2023.119997
Received 17 January 2023; Received in revised form 15 March 2023; Accepted 27 March 2023
Available online 31 March 2023
0957-4174/© 2023 Elsevier Ltd. All rights reserved.
J. Zhang et al. Expert Systems With Applications 224 (2023) 119997
provides the experimental results and discussion. Finally, Section 5

summarized the conclusions.
2. Related work
Detection Methods of the small objects can be assorted into CNN-

based, Transformer-based, and Attention Module-based.
2.1. CNN-based small object detection
A lot of CNN-based networks have been proposed (Cai & Vascon-

celos, 2018; Lin, Goyal, Girshick, He and Dollár, 2017; Redmon &
Farhadi, 2018; Yang, Fan, Chu, Blasch, & Ling, 2019), to detect small
targets. Moreover, many methods to improve small object detection
have been designed. They can be mainly divided into four classes: (1)
Fig. 1. Limited features and attention of small objects are necessary to recognize person multi-scale feature learning (Gong et al., 2021; Wang et al., 2022; Yang,
and nodule in these pictures. Huang, & Wang, 2022; Zeng et al., 2022); (2) super-resolution (Noh,
Bae, Lee, Seo, & Kim, 2019); (3) incorporating context-based informa-
tion (Akyon, Altinuc, & Temizel, 2022; Fu, Liu, Ranga, Tyagi, & Berg,
2017; Lim, Astrid, Yoon, & Lee, 2021; Misra, Nalamada, Arasanipalai,
& Hou, 2021; Wang, Xu, Yang, & Yu, 2021); and (4) scale-aware
training (Singh & Davis, 2018; Xu et al., 2022). Both two-stage and
one-stage networks use CNN as the image encoder. To get sufficient
features for small objects, CNN relies on expanding the size of filters
and the quantity of convolutional layers. However, raising these hyper-
parameters incurs losing local context-aware representations (He et al.,
2016). Therefore, CNN-based detection is incapable of detecting small
objects with few features, though good at detecting medium and large
Fig. 2. Failure cases of Cascade RCNN (Swin Transformer) in detecting small objects. targets.
Ground-truth bounding boxes are red and predicted bounding boxes are green.
2.2. Transformer-based small object detection
and position. By enhancing the data dependencies on channel, position, A new paradigm for object detection has recently emerged due
width, and height, QA can extract enough features of small objects. To to the bonanza of the Transformer-based encoder in many computer
vision fields (Liu et al., 2021; Ma et al., 2023; Sheng et al., 2021;
accommodate both small objects and large objects, we use ensemble
Üzen, Türkoğlu, Yanikoglu, & Hanbay, 2022; Wan et al., 2022; Yang
learning in our ETAM encoder, which has two branches. One branch,
& Yang, 2023). In 2000, Yu, Li, Yu, and Huang (2019) proved that
called ETAM-S, uses the MG and QA modules, which are designed
using Transformer as an image encoder can achieve superior perfor-
to detect small targets. The other branch, called ETAM-N, is used for
mance over CNN. By recruiting the good point of the Transformer and
detecting larger targets.
CNN, Swin Transformer (Liu et al., 2021) displays powerful feature
Overall, this paper has four main contributions.
expression capability. A variant of Swin Transformer (Üzen et al., 2022)
(1) An ensemble Transformer encoder, ETAM, is proposed for de-
further developed spatial and channel attention modules to achieve
tecting small objects. It addresses the problem of limited features of
inter-level cross-modality fusion. This powerful encoder is also apposite
small objects and reduces interference from background features.
to detect small objects (Carion et al., 2020). For small object detection,
(2) To reduce interference from background features, for the first
recent Transformers (Ma et al., 2023; Sheng et al., 2021; Yang & Yang,
time, a Magnifying Glass (MG) module was designed specifically for
2023) also explored using attention modules and high-quality region
small object detection. By extracting features of small objects, our
proposal networks. Nevertheless, there is still much margin to enhance
ETAM encoder increases the amount of information about foreground
the Transformer for small object detection.
objects while reducing interference from background features.
(3) We also propose a Quadruple Attention (QA) module, which 2.3. Attention module-based small object detection
extends the attention to two extra dimensions: height and width. Its
purpose is to improve the feature extraction ability compared to double In image processing, some researchers used the optimization al-
attention modules that only have channel and position. gorithm to deal with image segmentation (Houssein, Abdelkareem,
(4) To improve small object detection without compromising the Emam, Hameed, & Younan, 2022) and image steganography (Hass-
accuracy of detecting large objects, we adopt an ensemble learning aballah, Hameed, Awad, & Muhammad, 2021) problems. However,
approach to building our ETAM encoder with a two-branch structure. most studies on small object detection used the attention module to
The ETAM-S branch focuses on detecting small objects, while the improve model performance. In deep learning, the attention module
ETAM-N branch takes care of larger object detection. can incorporate context information for small objects (Fu et al., 2017).
(5) Finally, we conduct comprehensive experiments using datasets It can be broadly comprehended as concentrating local input settling
Pascal VOC (Everingham, Van Gool, Williams, Winn, & Zisserman, a particular duty instead of seeing the global input (Lim et al., 2021).
2010), MS COCO (Chen et al., 2015), VisDrone2019 (Zhu et al., 2018), A single attention module was designed in SENet but only used one
and LIDC-IDRI (Armato III et al., 2011). Ablation studies are with dimension (i.e. channel). It improved the classification performance of
the benchmark datasets Pascal VOC, and comparative experiments are the network (Hu, Shen, & Sun, 2018). A dual attention module was
with the most widely used dataset MS-COCO, a small object detection designed in DANet (Fu et al., 2019), used two dimensions: channel and
dataset VisDrone2019 and a lung nodule detection dataset LIDC-IDRI. position. However, all these attention modules ignore the information
Experimental results show, with ETAM, the mean average precisions on object area, such as width and height, which is vital for small object
are improved up to 91.7% based on the four datasets. detection. Distinguishing from the past works, we expand the attention
The rest of the paper is formed as follows. Section 2 introduces module to include object area, the dimensions of height, and width, for
the related works. Section 3 elaborates ETAM encoder, and Section 4 better, compact feature representation of small objects.
2
Fig. 3. The architectures of our ETAM encoder.
Fig. 4. The architecture of Magnifying Glass (MG) module.
Fig. 5. An overview of the Quadruple Attention (QA) module.

3. Architecture of ETAM
This section will discuss the structure of our ETAM encoder. We use
Then, to preserve the location information of representative regions,
ETAM as the encoder of a two-stage object detection network for small
we keep the location 𝐶 ∈ 𝑅𝐻×𝑊 ×2 and embed it in 𝐹 :
objects. We first give an overview of ETAM. Then we describe each part
in detail. ̂𝑘 = 𝑓𝑘 ⊕ 𝐶𝑘
𝑓 (2)
where ⊕ stands for the concatenation operation.

3.1. Overview Finally, a confidence score network consisting of convolutional
layers and a Sigmoid function is constructed to generate a score map
The construction of the ETAM encoder is demonstrated in Fig. 3. 𝑆 ∈ 𝑅𝐻×𝑊 for the predictions of representative regions at all scales:
It composes of two branches: the feature extraction branch for small { ( ) }
|
objects (ETAM-S branch) and the branch for normal objects (ETAM-N 𝑆 = 𝑠𝑘 ∈ 𝑅𝐻𝑘 ×𝑊𝑘 ||𝑠𝑘 = 𝑆 𝑓̂𝑘 , 𝑘 = 0, 1, 2, 3 . (3)
|
branch). ETAM-S is designed to detect small objects, while ETAM-N
is used to detect larger objects. The two branches take each input As a result, the region of representative feature concentration and
image for feature extraction and generate a set of feature maps. They the coordinates are obtained, which can significantly drop the number
of redundant features of small objects.
are concatenated by the feature fusion module. Then, the decoder
adopts these concatenated feature maps as the input and generates the
3.3. Quadruple Attention module
bounding boxes and the classification results. We will present each part
in detail below.
As shown in Fig. 5, the Quadruple Attention (QA) is designed to
enrich the small object features with width and height in addition
3.2. Magnifying Glass module to channel and position. The visual attention module can model the
rich contextual dependencies of local features and thus focus on the
As shown in Fig. 4, we design a simple Magnifying Glass (MG) dimension of the image features. Remote information dependencies on
module to select key feature regions of images at the pixel level. width and height dimensions are also crucial for small object detection.
Through feature selection of MG, representative features are sampled Therefore, we propose and design a QA module using the channel,
for later feature extraction in the Transformer. It can reduce excessive position, width, and height dimensions. More details are shown in
interference of background noise involved in multi-scale features. It Fig. 6, which is explained below.
also reduces computational complexity and facilitates feature learning The QA is designed as a three-branch, two-stage module. Suppose
of small objects. an input tensor 𝑋 ∈ 𝑅𝐶×𝐻×𝑊 , we first pass it to the first stage in the
The basic of the MG structure is a stacked network. First, a feature QA. Then, the average value of these three branches is passed to the
map 𝐹 ∈ 𝑅𝐻×𝑊 ×𝐶 from the FPN with distinct scales (such as 14 , 18 , 16
1
) second stage.
is constructed: In the first branch of the first stage, we establish a channel attention
module (CAM) to explicitly model interdependencies between classes,
{ }
𝐹 = 𝑓𝑘 ∈ 𝑅𝐻𝑘 ×𝑊𝑘 ×𝐶 |𝑘 = 0, 1, 2, 3 . (1) as shown in Fig. 6(a).
3
Fig. 6. The details of each module in QA.
The input tensor 𝐴 ∈ 𝑅𝐶×𝐻×𝑊 is reshaped as 𝐵 ∈ 𝑅𝐶×𝑃 , where 3.4. Ensemble learning-based two-branch model
𝑃 = 𝐻 ×𝑊 . Then the matrix product of 𝐵 𝑇 and 𝐵 is passed to a Softmax
layer to get 𝐶 ∈ 𝑅𝐶×𝐶 as below:
( ) To accommodate both small objects and large objects (Li, Chen,
exp 𝐵𝑖 ⋅ 𝐵𝑖 𝑇 Wang, & Zhang, 2019), Ensemble Learning is introduced to build our
𝑐𝑗𝑖 = ∑𝐶 ( 𝑇
) (4)
ETAM encoder as a two-branch structure for accommodating both
𝑖=1 exp 𝐵𝑖 ⋅ 𝐵𝑖
small objects and larger objects, as shown in Fig. 3. Our MG and QA
where 𝑐𝑗𝑖 estimates the effect of the 𝑖𝑡ℎ channel on the 𝑗𝑡ℎ channel.
modules constitute the first feature extraction branch for small objects
Then we reshape the matrix product of 𝐶 𝑇 and 𝐵 as 𝑅𝐶×𝐻×𝑊 .
Finally, applying weighted sum yields the final tensor 𝐷 ∈ 𝑅𝐶×𝐻×𝑊 : (ETAM-S branch). The QA module is also applied to the second feature
extraction branch for normal objects (ETAM-N branch). MG is executed
𝐶 ( before the Patch Partition module of the ETAM-S branch. The position
∑ )
𝐷𝑗 = 𝛽 𝑐𝑖𝑗𝑇 𝐵𝑖 + 𝐵𝑗 (5) information generated by MG is carried throughout the Transformer
𝑖=1 structure operation. In addition, since the features of the small object
where 𝛽 represents a weight that is learned gradually from 0. Eq. (4) are mainly distributed in the shallow features, QA is only placed in
shows that each channel’s ultimate feature is a weighted sum of all stages 1 & 2. Placing the attention module in the first two stages can
the channel features and its original features. It decorates the remote improve the feature extraction ability and avoid unnecessary loss of
semantic dependencies to enhance feature discriminability. network efficiency. It is worth noting that QA is like a plug-and-play
In the second branch of the first stage, we construct the height and can be extended to any layer and any model.
attention module (HAM) to build the height dimension dependencies
as in Fig. 6(b). For this purpose, the input tensor 𝐴 is turned 90◦
counterclockwise along the 𝑊 axis to get 𝑅𝐻×𝐶×𝑊 . Similarly, the spun 3.5. Two-stage network for small object detection
matrix 𝐴 is reshaped into 𝐵 ∈ 𝑅𝐻×𝑀 (𝑀 = 𝐶 × 𝑊 ). Then 𝐻 ∈ 𝑅𝐻×𝐻
is obtained by transposing the multiplication and Softmax layers. The
We establish a two-stage network, named ETAM using our ETAM
final tensor 𝐷 ∈ 𝑅𝐶×𝐻×𝑊 is obtained by weighting and summing.
as the backbone, as shown in Fig. 7. The input images are first ex-
As shown in Fig. 6(c), similarly, in the last branch of the first stage,
𝐴 rotates 90◦ counterclockwise along the 𝐻 axis to 𝑅𝑊 ×𝐻×𝐶 . Then, they tracted features through the ETAM backbone to obtain feature maps
are reshaped into 𝑅𝑊 ×𝑁 , where 𝑁 = 𝐻 × 𝐶. In the same way as above, and the anchor boxes. Then, they pass through the Region Proposal
the output tensor 𝐷 is finally obtained using transpose multiplication, Network (RPN) to obtain Regions of Interest (ROIs) and suggestion
Softmax layer, and weighted sums. boxes. Finally, the obtained feature maps and ROIs are transmitted to
To build great contextual relationships on local features, a position the Multi-stage ROI Pooling and RCNN head network to obtain the final
attention module (PAM) is built in the second stage, as shown in classes and bounding boxes. Each module will be explained in detail
Fig. 6(d). However, unlike the attention module described above, 𝑃 = below.
𝐻 × 𝑊 is retained after the reshaping, and 𝐵 𝑇 and 𝐵 are multiplied
together and passed through the Softmax layer to acquire 𝑃 ∈ 𝑅𝑃 ×𝑃 .
The final output tensor 𝐷, which finally undergoes the PAM action, is 3.5.1. Multi-stage ROI Pooling
obtained by a weighted sum. Based on the ROI Pooling module of Cascade RCNN (Cai & Vas-
In brief, an output tensor 𝑌 is obtained using QA for an input concelos, 2018), we propose and design the Multi-stage ROI Pooling to
tensor 𝑋 ∈ 𝑅𝐶×𝐻×𝑊 . The QA module can be decorated by the below obtain the proposals, as shown in Fig. 8. At each stage, the bounding
equations: box and classification results are first executed. Then the suggestion
( ( ) ( ) ( ))
1 boxes from the frontal stage are passed to the ROI Pooling module in
𝐴= 𝐶𝐴𝑀 𝑋1 + 𝐻𝐴𝑀 𝑋2 + 𝑊 𝐴𝑀 𝑋3 (6)
3 the current stage for bounding box prediction reference. In the figure, 𝐹
denotes the features from the encoder, 𝑃 is the ROI Pooling, whereas 𝐶𝑖
𝑌 = 𝐴 + 𝑃 𝐴𝑀 (𝐴) (7)
and 𝐵𝑖 represent the classification results and bounding box predictions
where 𝑋 represents the input tensor with a 90◦ counterclockwise at the 𝑖𝑡ℎ stage. It increases the interaction between each stage and
rotation. improves the accuracy of the bounding box.
4
Fig. 7. Overview of the ETAM framework with ETAM encoder for small object detection.
including 20 foreground object classes. Based on previous works (Liu

et al., 2016; Redmon et al., 2016; Redmon & Farhadi, 2017; Ren
et al., 2015) using these datasets, the model training in this paper
used a combination of two PASCAL VOC datasets. We used the training
validation set of 2007 and 2012 for the model training, with 22,136
images. The model is validated with the 2007 and 2012 validation
subsets, respectively. The 2007 test subset is used for testing.
MS COCO has 118,287 training images, 5000 validation images,
and 40,670 testing images. We use the most frequent 80 classes in our
experiments.
Fig. 8. Structure of Multi-stage ROI Pooling module. VisDrone2019 is a dataset with 10 classes dedicated to small object
image detection. It contains 6471 images for training and 548 for
validation.
LIDC-IDRI contains 1018 cases, 243,958 lung CT slices, and 7371
nodule annotations. Since there is no medical concept of non-nodule,
the non-nodule portion was excluded. Further, in the dataset, we manu-
ally screen out those with unclear images and significant discrepancies
from the physician’s labeling. The final dataset consists of 3701 solid
nodules less than 3 mm and solid nodules between 3 and 30 mm. The
remaining dataset was labeled with Labelimg. Finally, a dataset of 6738
images was obtained. Among them, 4781 are used for training, 957 for
validation, and 1000 for testing.
4.1.2. Experimental setup

Fig. 9. Structure of RCNN head module. We comply ETAM based on Pytorch. The ETAM framework is imple-
mented using MMDetection (Chen et al., 2019), an open-source object-
detection framework developed by Open MMLab. The experiments are
3.5.2. RCNN head module executed on a personal computer, which has Intel(R) i5-12400F CPU,
To generate a more precise bounding box, we design an RCNN head and NVIDIA GeForce RTX 3060 GPU with 12 GB. The computer has
module to classify the proposals as illustrated in Fig. 9. The proposals 16 GB RAM and is installed with 64-bit Ubuntu.
are downsampled through four 3 × 3 convolutional layers to obtain two The base learning rate is set to 10e−5 with the ConsineAnnealing
branch results: the bounding box edges and the classification results. schedule, with decreasing learning rate in the first 80k iterations. The
The classification results are further passed to a Softmax layer for batch size is set to 8 for the MS-COCO and VisDrone2019 and 16
category likelihood calculation. Finally, the final classification result for other datasets. For data augmentation, DETR’s data augmentation
and detection border are obtained as shown in Fig. 9. Note, in the method is applied to train ETAM. All of the test results are obtained
figure, 𝑃 indicates a proposal from the Multi-stage ROI Pooling, and with the testing dataset. Based on previous object detection works (Cai
𝐶𝑜𝑛𝑣3 denotes the 3 × 3 convolutional layer. & Vasconcelos, 2018; He et al., 2017; Ren et al., 2015), all the results
are appraised following the MS-COCO evaluation protocol (Chen et al.,
4. Experiments 2015), including 𝑚AP, AP50 , AP75 , and size-wise AP𝑆 , AP𝑀 , and AP𝐿
scores. In addition, the Pascal VOC dataset follows the metric of MS-
To assess ETAM and related work, we execute comprehensive ex- COCO regarding the size of objects, in which small objects are less than
periments using datasets Pascal VOC (Everingham et al., 2010), MS 32 × 32 and large objects are greater than 96 × 96.
COCO (Chen et al., 2015), VisDrone2019 (Zhu et al., 2018), and LIDC-
IDRI (Armato III et al., 2011). The experimental results testify to the 4.2. Ablation studies and analysis
extraordinary potency of our ETAM encoder on all four datasets.
We execute widespread ablation studies on PASCAL VOC as an
4.1. Datasets and experimental setup example and survey different modules of our ETAM encoder.
4.1.1. Datasets and data preparation 4.2.1. Effect of ETAM encoder

PASCAL VOC is available in two versions. PASCAL VOC 2007 has As part of our experiment, we train our ETAM network using
2501 images, 2510 images, and 4952 images for training, validation, different backbones to assess the importance of the ETAM encoder.
and testing. The training, validation, and testing subsets of PASCAL Among them, ResNet (He et al., 2016) is CNN’s most practical and most
VOC 2012 contain 11,297 images, 5828 images, and 5138 images, commonly used encoder. Moreover, Swin Transformer (Liu et al., 2021)
5
Fig. 10. Qualitative results comparison between baseline and our ETAM framework. Ground-truth bounding boxes are red and predicted bounding boxes are green.
Table 2
Module ablation study on voc2012 val-set.
Encoder MG PAM CAM HAM WAM T-B 𝑚AP AP𝑆 AP𝑀 AP𝐿
Swin-T 86.8 24.5 81.9 96.4
Swin-T ✓ 84.9 28.8 73.2 93.1
Swin-T ✓ ✓ 88.3 26.5 83.3 96.5
Swin-T ✓ ✓ ✓ ✓ 89.7 25.9 85.7 97.6
Fig. 11. Double attention (DA) module.
Swin-T ✓ ✓ ✓ ✓ ✓ 87.3 31.2 77.5 92.9
Swin-T ✓ ✓ ✓ ✓ ✓ ✓ 90.2 31.1 86.4 98.0
Table 1
Backbone ablation study on voc2007 val-set.
Backbone 𝑚AP AP50 AP75 AP𝑆 AP𝑀 AP𝐿
utilizing the PAM and CAM emerges 88.3% result in 𝑚AP, which
ResNet101 85.1 93.4 89.1 24.2 81.5 96.2
Swin transformer 87.8 93.6 89.0 24.5 81.9 96.4 prompts a 1.5% improvement. However, integrating the four attention
ETAM 90.2 94.1 90.5 31.1 86.4 98.0 modules (PAM, CAM, HAM, and WAM), the QA module, outperforms
the baseline by 2.9%. Although the MG module does not show better
overall performance due to its loss of some information about large
objects, there is a significant improvement in small object detection
is the latest proposed and most widely used encoder for object detection
(with 28.8% in AP𝑆 ). Additionally, with the two-branch ensemble
in Transformer recently. The different settings of ablation experiments
network, ETAM can improve overall performance and excel in both
are shown in Table 1.
small and larger object detection.
Table 1 illustrates that Our ETAM encoder improves object detec-
tion representation, especially the version of small object detection.
Compared with the model with ResNet101 as the backbone, the 𝑚AP 4.2.3. Visualization of attention module
results improved by 5.1%. Meanwhile, the 𝑚AP results improved by To gain a deeper comprehension of the attention module, we visu-
2.4% compared to the Swin Transformer backbone-based model. In alize the attention mask from ETAM (Swin-T) with a Dual Attention
particular, the AP𝑆 enhanced by 6.6%. As shown in these results, ETAM (DA) module and ETAM (Swin-T) with our Quadruple Attention (QA)
has a tremendous modification in detecting small objects. module, respectively. The DA module we compared combines PAM and
Fig. 10 compares the baseline Swin Transformer (Swin-T) encoder CAM with the structure, as shown in Fig. 11. The visualization is done
and our ETAM encoder qualitatively, where the baseline fails to detect by color superimposing the gradients of the input images and thus
small objects when the ETAM encoder succeeds. From Figs. 10(a), (b), visualizing them and performing the calculation of network predictions
and (c), we can see that the ETAM backbone effectively addresses the and the corresponding confidence scores, as shown in Fig. 12, where
problem of missing small objects by the baseline backbone. Moreover, the more focused the attention visualization graph is, the better.
Figs. 10(b) and (d) show that both models can detect all the small Visualizations support our understanding of the intrinsic capability
objects. But confidence score of the baseline detecting the small object of QA to capture a more noticeable response to specific semantics.
as an airplane only reaches about 50%, whereas our model gives a con- Fig. 12 shows that the QA module can focus on the main detail of
fidence score of 90%. It means that ETAM can set a higher prediction objects compared to the DA module. Comparing the attention visualiza-
threshold, and it will be more accurate when applied in practice. tion results in stage 2, we can see that the attention generated by QA is
more focused on the area of objects than on DA, especially for small
4.2.2. Effect of each module targets. Additionally, with QA, the confidence score is significantly
To test the importance of each ETAM component, we report the improved.
results with or without the following modules. 𝑇 − 𝐵: Using the ETAM- In addition, to better understand the role of the Magnifying Glass
N branch, which forms a two-branch structure. Table 2 shows the Module (MG) module, we have also visualized the feature space before
different settings of ablation experiments, where the Swin Transformer Swin-T, as shown in Fig. 12. The dark (red) areas represent the region
(Swin-T) encoder is used as the baseline for comparison. of representative feature concentration. Taking the airplane image as
As demonstrated in Table 2, the MG and the attention modules an example, it can be seen that the MG module can identify three dark
remarkably improve detection capability. Compared with the baseline, regions, which are the approximate location of the three objects. Then,
6
Fig. 12. Visualization results of attention modules on voc2007 val-set.
Table 3
Comparison with the state-of-the-art models on voc2007 test-dev.
Method 𝑚AP AP𝑆 AP𝑀 AP𝐿
One-stage methods
YOLO-v1 (GoogleNet) (Redmon et al., 2016) 63.4 – – –
YOLO-v1 (VGG-16) (Redmon et al., 2016) 66.4 – – –
YOLO-v2 (DarkNet-19) (Redmon & Farhadi, 2017) 76.8 – – –
SSD (Lim et al., 2021) 77.5 20.7 62.0 83.3
Two-stage methods
Faster RCNN (VGG-16) (Ren et al., 2015) 73.2 – – –
Faster RCNN (ResNet) (Huang et al., 2017) 76.4 – – –
Cascade RCNN (ResNet) (Cai & Vasconcelos, 2018) 88.1 22.7 77.8 94.6
Small object methods
FA-SSD (Lim et al., 2021) 78.1 28.5 61.0 83.6
SR(ResNet101) (Noh et al., 2019) 80.6 11.1 48.9 82.7
ETAM 91.7 30.8 86.7 96.6
this image can be magnified according to the coordinates of these dark training with a vast dataset can make Transformer perform much
regions, which can significantly reduce the number of redundant object better.
features.
4.3.3. Results on VisDrone2019
4.3. Comparison with related work Table 5 presents the capability of the ETAM and typical small object
methods, i.e., Faster RCNN (Ren et al., 2015), RetinaNet (Lin, Goyal
In this part, we contrast ETAM with the state-of-the-art models using et al., 2017), and FCOS (Tian et al., 2019). We observe that ETAM
PASCAL VOC, MS-COCO, VisDrone2019, and LIDC-IDRI. vastly outperforms most of the latest detectors. In particular, the ETAM
also obtained the max AP50 and AP75 . It indicates that the ETAM has
4.3.1. Results on PASCAL VOC
the highest accuracy with accurate detecting and the highest recall
We compare ETAM with other models using the voc2007 testing
with few missed detections. But we notice that ETAM is comparable
subset, as shown in Table 3. ETAM exceeds the state-of-the-art models
to ClusDet (ResNeXt101) (Yang et al., 2019) and has no apparent
to a great extent. Table 3 gives the detection accuracy using the
advantages. The reason is that the VisDrone2019 training subset is
voc2007 testing subset for small, medium, and large targets. The results
tiny, while the advantages of Transformer are more evident on large
of Table 3 demonstrate that ETAM outperforms substantially other two-
datasets.
stage object detectors. Furthermore, our ETAM gets the largest AP𝑆 ,
AP𝑀 , and AP𝐿 . It shows that ETAM improves small object detection
by enhancing it by 2.3%. ETAM also performs well for medium and 4.3.4. Results on LIDC-IDRI
large object detection. What is more, FA-SSD (Lim et al., 2021) and Lung nodule detection is a typical problem of small object detection.
SR(ResNet101) (Noh et al., 2019) results show that ETAM outpaces the We use lung nodules in LIDC-IDRI as the small objects but other parts of
latest small object models. the CT images are treated as the background. From Table 6, we can see
that ETAM improves the performance of lung nodule detection similar
4.3.2. Results on MS COCO to the previous section.
The experiment results using MS-COCO are shown in Table 4, which Different from 𝑚AP used for performance metrics of forward
shows that ETAM performs well in terms of AP𝑆 . It is worth noting datasets, regularly used performance metrics for the LIDC-IDRI dataset
that the ETAM obtained the highest AP𝑆 . The reason is that, with MG, are accuracy (𝐴𝐶𝐶), sensitivity (𝑆𝐸𝑁), and specificity (𝑆𝑃 𝐸), which
QA, and the two-branch structure, ETAM enhances the performance are expressed with the following Eqs. (8), (9), (10), respectively:
on detecting small objects. In addition, ETAM achieves 45.2% of 𝑚AP,
𝐴𝐶𝐶 = (𝑇 𝑃 + 𝑇 𝑁)∕(𝑇 𝑃 + 𝐹 𝑃 + 𝑇 𝑁 + 𝐹 𝑁) (8)
which exceeds other models. Compared with other models that use
convolutional neural networks as the backbone, the performance gain
of ETAM is more evident with the MS-COCO dataset. It is because 𝑆𝐸 = 𝑇 𝑃 ∕(𝑇 𝑃 + 𝐹 𝑁) (9)
7
Table 4
Object detection results on MS-COCO test-dev.
Method 𝑚AP AP50 AP75 AP𝑆 AP𝑀 AP𝐿
One-stage methods
YOLO-v2 (DarkNet-19) (Redmon & Farhadi, 2017) 21.6 44.0 19.2 5.0 22.4 35.5
YOLO-v3 (DarkNet-53) (Redmon & Farhadi, 2018) 33.0 57.9 34.4 18.3 35.4 41.9
SSD513 (ResNet-101-SSD) (Fu et al., 2017; Liu et al., 31.2 50.4 33.3 10.2 34.5 49.8
2016)
RetinaNet (ResNet-101-FPN) (Lin, Goyal et al., 2017) 39.1 59.1 42.3 21.8 44.2 51.2
FCOS (ResNet-101-FPN) (Tian et al., 2019) 41.5 60.7 45.0 24.4 44.8 51.6
CenterNet (DLA-34) (Zhou et al., 2019) 41.6 60.3 45.1 21.5 43.9 56.0
CenterNet (Hourglass-104) (Zhou et al., 2019) 45.1 63.9 49.3 26.6 47.1 57.7
Two-stage methods
Faster RCNN (ResNet-101-FPN) (Lin, Dollár et al., 2017) 36.2 59.1 39.0 18.2 39.0 48.2
Faster RCNN (ResNet-v2) (Huang et al., 2017) 34.7 55.5 36.7 13.5 38.1 52.0
Faster RCNN (ResNet-v2-TDM) (Shrivastava, Sukthankar, 36.8 57.7 39.2 16.2 39.8 52.1
Malik, & Gupta, 2016)
Mask RCNN (ResNet-101-FPN) (He et al., 2017) 38.2 60.3 41.7 20.1 41.1 50.2
Cascade RCNN (ResNet-101-FPN) (Cai & Vasconcelos, 41.5 60.7 45.0 24.4 44.8 51.6
2018)
RetinaNet with S-𝛼 (Gong et al., 2021) 14.9 – – 28.3 – –
SR (ResNet101) (Noh et al., 2019) 34.2 – – 16.2 35.7 48.1
SNIP (Singh & Davis, 2018) 37.8 – – 21.4 40.4 50.1
ETAM 45.2 71.6 47.6 28.7 49.4 59.8
Table 5
Small object detection results on VisDrone2019 val-set.
Method 𝑚AP AP50 AP75
One-stage methods
RetinaNet (ResNeXt101) (Lin, Goyal et al., 2017) 14.40 24.10 15.50
RetinaNet (ResNet101) (Yang et al., 2022) 26.21 44.90 27.10
FCOS (Tian et al., 2019) 14.10 25.50 –
Two-stage methods
Faster R-CNN (ResNet101) (Xu et al., 2022) 22.30 38.00 –
Faster RCNN (ResNeXt101) (Ren et al., 2015) 28.70 51.80 27.70
QueryDet (Yang et al., 2022) 28.32 48.14 28.75
Cascade R-CNN+NWD (Wang et al., 2021) – 40.30 –
TOOD+SF+SAHI+FI+PO (Akyon et al., 2022) – 43.50 –
ClusDet (ResNeXt101) (Yang et al., 2019) 32.40 56.20 31.60
DetectoRS+RFLA (Xu et al., 2022) 27.40 45.30 –
ETAM 33.00 60.31 37.25
Table 6
Lung nodule detection results on the LIDC-IDRI.
Method ACC (%) SEN (%) SPE (%) AUC
Multi-scale CNN (Lyu & Ling, 2018) 84.10 – – –
Fuse-TSD (Xie, Zhang, Xia, Fulham, and Zhang, 2018) 89.53 84.19 92.02 0.9665
MV-KBC (Xie, Xia et al., 2018) 91.60 86.52 94.00 0.9570
QIF 3D-CNN (Causey et al., 2018) 93.20 87.90 98.50 0.9710
RBF 3D-CNN (Polat & Danaei Mehr, 2019) 91.81 88.53 94.23 –
DL-CAD (Zheng et al., 2020) – 90.00 – –
Multi-scale and multi-task 3D-CNN (Zhao, Zhang, Li, & 93.92 92.60 96.25 0.9790
Niu, 2020)
3D-CAD (Kuo, Barman, Hsieh, & Hsu, 2021) – 93.13 – –
ETAM 96.14 94.58 97.10 0.9896
In Fig. 13, the test results are visualized, where CT images with
pulmonary nodules are adopted from the testing subset for detection.
𝑆𝑃 𝐸 = 𝑇 𝑁∕(𝐹 𝑃 + 𝑇 𝑁) (10)
In the figure, the right corner of each image is a magnified view of
where 𝑇 𝑃 , 𝑇 𝑁, 𝐹 𝑃 , and 𝐹 𝑁 represent the true positive, true negative, the lung nodule region. The accuracy below each image represents
false positive, and false negative, all of which are indicators produced the probability that the object is detected as a nodule by ETAM. The
by the dichotomous task. Table 6 has shown these metrics of all red boxes are marked by radiologists with annotation, and the green
compared models. From the table, we know ETAM has the highest boxes and green texts are the test results of nodule detection and their
detection accuracy. confidence score. From Fig. 13, we can see that the detected nodules
8
Fig. 13. Comparison of pulmonary nodules test results and actual locations. Ground-truth bounding boxes are red and predicted bounding boxes are green.
are approximately the same as the actual locations, and each bounding References
box’s size is close to its actual size.
Akyon, F. C., Altinuc, S. O., & Temizel, A. (2022). Slicing aided hyper inference and
5. Conclusion fine-tuning for small object detection. arXiv preprint arXiv:2202.06934.
Armato III, S. G., McLennan, G., Bidaut, L., McNitt-Gray, M. F., Meyer, C. R., Reeves, A.
P., et al. (2011). The lung image database consortium (LIDC) and image database
In this paper, we propose an ETAM to address the challenges in resource initiative (IDRI): a completed reference database of lung nodules on CT
small object detection. In ETAM, a Magnifying Glass (MG) module scans. Medical Physics, 38(2), 915–931.
is designed to solve problems of limited small object features and Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object
detection. In Proceedings of the IEEE conference on computer vision and pattern
to eliminate redundant background information. A Quadruple Atten-
recognition (pp. 6154–6162).
tion (QA) module is devised to enrich the small object features. We Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020).
adopt Ensemble Learning to create a two-branch structure to accom- End-to-end object detection with transformers. In European Conference on Computer
modate both small objects and larger objects. Ablation studies show Vision (pp. 213–229). Springer.
that the MG and QA modules can effectively focus on the small object Causey, J. L., Zhang, J., Ma, S., Jiang, B., Qualls, J. A., Politte, D. G., et al. (2018).
Highly accurate model for prediction of lung nodule malignancy with CT scans.
area and improve small object detection. With Ensemble Learning,
Scientific Reports, 8(1), 1–12.
ETAM can effectively balance the ability to detect small, medium, and Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollár, P., et al. (2015).
large objects. Experiments show that our ETAM encoder consistently Microsoft coco captions: Data collection and evaluation server. arXiv preprint
achieves significant improvement for small object detection using four arXiv:1504.00325 https://arxiv.org/abs/1504.00325.
datasets PASCAL VOC, MS-COCO, VisDrone2019, and LIDC-IDRI. In Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., et al. (2019). MMDetection:
Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155
future tasks, we will probe how to reduce the computational complexity https://arxiv.org/abs/1906.07155.
and enhance the robustness of our network, which are important issues Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The
for real-time applications. We also plan to apply our network in remote pascal visual object classes (voc) challenge. International Journal of Computer Vision,
sensing images like DOTA (Xia et al., 2018) and DIOR (Li, Wan, Cheng, 88(2), 303–338.
Fu, C.-Y., Liu, W., Ranga, A., Tyagi, A., & Berg, A. C. (2017). Dssd: Deconvolutional
Meng, & Han, 2020).
single shot detector. arXiv preprint arXiv:1701.06659.
Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., et al. (2019). Dual attention network
CRediT authorship contribution statement for scene segmentation. In Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition (pp. 3146–3154).
Jiangnan Zhang: Conceptualization, Methodology, Software, In- Gong, Y., Yu, X., Ding, Y., Peng, X., Zhao, J., & Han, Z. (2021). Effective fusion factor
in FPN for tiny object detection. In Proceedings of the IEEE/CVF winter conference
vestigation, Formal analysis, Visualization, Writing – original draft.
on applications of computer vision (pp. 1160–1168).
Kewen Xia: Conceptualization, Validation, Funding acquisition, Re- Hassaballah, M., Hameed, M. A., Awad, A. I., & Muhammad, K. (2021). A novel image
sources, Supervision, Writing – review & editing, Project administra- steganography method for industrial internet of things security. IEEE Transactions
tion. Zhiyi Huang: Conceptualization, Supervision, Writing – review & on Industrial Informatics, 17(11), 7743–7751.
editing, Project administration. Sijie Wang: Data curation, Visualiza- He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the
IEEE international conference on computer vision (pp. 2961–2969).
tion, Investigation. Romoke Grace Akindele: Visualization, Writing –
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
review & editing. recognition. In Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 770–778).
Declaration of competing interest Houssein, E. H., Abdelkareem, D. A., Emam, M. M., Hameed, M. A., & Younan, M.
(2022). An efficient image segmentation method for skin cancer imaging using
improved golden jackal optimization algorithm. Computers in Biology and Medicine,
The authors declare that they have no known competing finan-
149, Article 106075.
cial interests or personal relationships that could have appeared to Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of
influence the work reported in this paper. the IEEE conference on computer vision and pattern recognition (pp. 7132–7141).
Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., et al. (2017).
Data availability Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings
of the IEEE conference on computer vision and pattern recognition (pp. 7310–7311).
Kuo, C.-F. J., Barman, J., Hsieh, C. W., & Hsu, H.-H. (2021). Fast fully automatic
Data will be made available on request detection, classification and 3D reconstruction of pulmonary nodules in CT images
by local image feature analysis. Biomedical Signal Processing and Control, 68, Article
Acknowledgments 102790.
Li, Y., Chen, Y., Wang, N., & Zhang, Z. (2019). Scale-aware trident networks for object
This work was supported by the National Natural Science Founda- detection. In Proceedings of the IEEE/CVF international conference on computer vision
(pp. 6054–6063).
tion of China (No. 42075129), Hebei Province Natural Science Founda- Li, K., Wan, G., Cheng, G., Meng, L., & Han, J. (2020). Object detection in opti-
tion (No. E2021202179), Key Research and Development Project from cal remote sensing images: A survey and a new benchmark. ISPRS Journal of
Hebei Province (No. 19210404D, No. 20351802D, No. 21351803D). Photogrammetry and Remote Sensing, 159, 296–307.
9
Lim, J.-S., Astrid, M., Yoon, H.-J., & Lee, S.-I. (2021). Small object detection using Üzen, H., Türkoğlu, M., Yanikoglu, B., & Hanbay, D. (2022). Swin-MFINet: Swin
context and attention. In 2021 international conference on artificial intelligence in transformer based multi-feature integration network for detection of pixel-level
information and communication (pp. 181–186). IEEE. surface defects. Expert Systems with Applications, 209, Article 118269.
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature Wan, H., Gao, L., Yuan, Z., Qu, H., Sun, Q., Cheng, H., et al. (2022). A novel transformer
pyramid networks for object detection. In Proceedings of the IEEE conference on model for surface damage detection and cognition of concrete bridges. Expert
computer vision and pattern recognition (pp. 2117–2125). Systems with Applications, Article 119019.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object Wang, J., Xu, C., Yang, W., & Yu, L. (2021). A normalized Gaussian wasserstein distance
detection. In Proceedings of the IEEE international conference on computer vision (pp.
for tiny object detection. arXiv preprint arXiv:2110.13389.
2980–2988).
Wang, X., Zhao, Q., Jiang, P., Zheng, Y., Yuan, L., & Yuan, P. (2022). LDS-YOLO:
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., et al. (2016). Ssd:
A lightweight small object detection method for dead trees from shelter forest.
Single shot multibox detector. In European conference on computer vision (pp. 21–37).
Computers and Electronics in Agriculture, 198, Article 107035.
Springer.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., et al. (2021). Swin transformer: Hi- Xia, G.-S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., et al. (2018). DOTA: A
erarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF large-scale dataset for object detection in aerial images. In Proceedings of the IEEE
international conference on computer vision (pp. 10012–10022). conference on computer vision and pattern recognition (pp. 3974–3983).
Lyu, J., & Ling, S. H. (2018). Using multi-level convolutional neural network for Xie, Y., Xia, Y., Zhang, J., Song, Y., Feng, D., Fulham, M., et al. (2018). Knowledge-
classification of lung nodules on CT images. In 2018 40th annual international based collaborative deep learning for benign-malignant lung nodule classification
conference of the ieee engineering in medicine and biology society (pp. 686–689). IEEE. on chest CT. IEEE Transactions on Medical Imaging, 38(4), 991–1004.
Ma, X., Zhang, S., Wang, Y., Li, R., Chen, X., & Yu, D. (2023). ASCAM-Former: Blind Xie, Y., Zhang, J., Xia, Y., Fulham, M., & Zhang, Y. (2018). Fusing texture, shape and
image quality assessment based on adaptive spatial & channel attention merging deep model-learned information at decision level for automated classification of
transformer and image to patch weights sharing. Expert Systems with Applications, lung nodules on chest CT. Information Fusion, 42, 102–110.
215, Article 119268. Xu, C., Wang, J., Yang, W., Yu, H., Yu, L., & Xia, G.-S. (2022). RFLA: Gaussian receptive
Misra, D., Nalamada, T., Arasanipalai, A. U., & Hou, Q. (2021). Rotate to attend: field based label assignment for tiny object detection. In European conference on
Convolutional triplet attention module. In Proceedings of the IEEE/CVF winter computer vision (pp. 526–543). Springer.
conference on applications of computer vision (pp. 3139–3148). Yang, F., Fan, H., Chu, P., Blasch, E., & Ling, H. (2019). Clustered object detection in
Noh, J., Bae, W., Lee, W., Seo, J., & Kim, G. (2019). Better to follow, follow to be
aerial images. In Proceedings of the IEEE/CVF international conference on computer
better: Towards precise supervision of feature super-resolution for small object
vision (pp. 8311–8320).
detection. In Proceedings of the IEEE/CVF international conference on computer vision
Yang, C., Huang, Z., & Wang, N. (2022). QueryDet: Cascaded sparse query for
(pp. 9725–9734).
accelerating high-resolution small object detection. In Proceedings of the IEEE/CVF
Polat, H., & Danaei Mehr, H. (2019). Classification of pulmonary CT images by using
conference on computer vision and pattern recognition (pp. 13668–13677).
hybrid 3D-deep convolutional neural network architecture. Applied Sciences, 9(5),
940. Yang, H., & Yang, D. (2023). Cswin-PNet: A CNN-Swin transformer combined pyramid
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, network for breast lesion segmentation in ultrasound images. Expert Systems with
real-time object detection. In Proceedings of the IEEE conference on computer vision Applications, 213, Article 119024.
and pattern recognition (pp. 779–788). Yu, J., Li, J., Yu, Z., & Huang, Q. (2019). Multimodal transformer with multi-view visual
Redmon, J., & Farhadi, A. (2017). YOLO9000: better, faster, stronger. In Proceedings of representation for image captioning. IEEE Transactions on Circuits and Systems for
the IEEE conference on computer vision and pattern recognition (pp. 7263–7271). Video Technology, 30(12), 4467–4480.
Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint Zeng, N., Wu, P., Wang, Z., Li, H., Liu, W., & Liu, X. (2022). A small-sized object
arXiv:1804.02767 https://arxiv.org/abs/1804.02767. detection oriented multi-scale feature fusion approach with application to defect
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection. IEEE Transactions on Instrumentation and Measurement, 71, 1–14.
detection with region proposal networks. Advances in Neural Information Processing Zhao, J., Zhang, C., Li, D., & Niu, J. (2020). Combining multi-scale feature fusion with
Systems, 28, 91–99. multi-attribute grading, a CNN model for benign and malignant classification of
Sheng, H., Cai, S., Liu, Y., Deng, B., Huang, J., Hua, X.-S., et al. (2021). Improving pulmonary nodules. Journal of Digital Imaging, 33(4), 869–878.
3d object detection with channel-wise transformer. In Proceedings of the IEEE/CVF
Zheng, S., Cui, X., Vonder, M., Veldhuis, R. N., Ye, Z., Vliegenthart, R., et al.
international conference on computer vision (pp. 2743–2752).
(2020). Deep learning-based pulmonary nodule detection: Effect of slab thickness
Shrivastava, A., Sukthankar, R., Malik, J., & Gupta, A. (2016). Beyond skip connections:
in maximum intensity projections at the nodule candidate detection stage. Computer
Top-down modulation for object detection. arXiv preprint arXiv:1612.06851.
Methods and Programs in Biomedicine, 196, Article 105620.
Singh, B., & Davis, L. S. (2018). An analysis of scale invariance in object detection
Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arXiv preprint arXiv:
snip. In Proceedings of the IEEE conference on computer vision and pattern recognition
(pp. 3578–3587). 1904.07850 https://arxiv.org/abs/1904.07850.
Tian, Z., Shen, C., Chen, H., & He, T. (2019). Fcos: Fully convolutional one-stage object Zhu, P., Wen, L., Du, D., Bian, X., Ling, H., Hu, Q., et al. (2018). Visdrone-det2018:
detection. In Proceedings of the IEEE/CVF international conference on computer vision The vision meets drone object detection in image challenge results. In Proceedings
(pp. 9627–9636). of the european conference on computer vision (ECCV) workshops.
10
View publication stats

ETAM: Ensemble Transformer With Attention Modules For Detection of Small Objects

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ETAM: Ensemble Transformer With Attention Modules For Detection of Small Objects

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

ETAM: Ensemble transformer with attention modules for detection of small

Article in Expert Systems with Applications · March 2023

Jiangnan Zhang Obagunle Romoke Grace Akindele

12 PUBLICATIONS 116 CITATIONS

Site and path diversity mitigation in tropical climate View project

The user has requested enhancement of the downloaded file.

Contents lists available at ScienceDirect

Expert Systems With Applications

ETAM: Ensemble transformer with attention modules for detection of small

ARTICLE INFO ABSTRACT

provides the experimental results and discussion. Finally, Section 5

Detection Methods of the small objects can be assorted into CNN-

2.1. CNN-based small object detection

A lot of CNN-based networks have been proposed (Cai & Vascon-

Fig. 3. The architectures of our ETAM encoder.

Fig. 4. The architecture of Magnifying Glass (MG) module.

Fig. 5. An overview of the Quadruple Attention (QA) module.

where ⊕ stands for the concatenation operation.

Fig. 6. The details of each module in QA.

including 20 foreground object classes. Based on previous works (Liu

4.1.2. Experimental setup

4.1.1. Datasets and data preparation 4.2.1. Effect of ETAM encoder

Fig. 12. Visualization results of attention modules on voc2007 val-set.

View publication stats

You might also like