Combining Transformer and CNN For Object Detection in UAV Imagery

Available online at www.sciencedirect.
com
ScienceDirect
ICT Express xxx (xxxx) xxx
www.elsevier.com/locate/icte
Combining transformer and CNN for object detection in UAV imagery

Willy Fitra Hendria, Quang Thinh Phan, Fikriansyah Adzaka, Cheol Jeong ∗
Department of Convergence Engineering for Intelligent Drone, Sejong University, Seoul, South Korea
Received 17 September 2021; received in revised form 29 November 2021; accepted 17 December 2021
Available online xxxx
Abstract
Combining multiple models is a well-known technique to improve predictive performance in challenging tasks such as object detection
in UAV imagery. In this paper, we propose fusion of transformer-based and convolutional neural network-based (CNN) models with two
approaches. First, we ensemble Swin Transformer and DetectoRS with ResNet backbone, and conduct performance comparison on four
typical methods for combining predictions of multiple object detection models. Second, we design a hybrid architecture by combining Swin
Transformer backbone with a neck of DetectoRS. We show that the fusion of the transformer and the CNN-based models performs better
compared to the respective baseline model.
© 2021 The Author(s). Published by Elsevier B.V. on behalf of The Korean Institute of Communications and Information Sciences. This is an open
access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Keywords: Convolutional neural network; Object detection; Transformer; UAV imagery
1. Introduction and hybrid techniques [11], to improve the overall predictive

performance.
Detecting objects in unmanned aerial vehicles (UAV) im-
In ensemble techniques, multiple models are trained, and
agery is often difficult as the objects in UAV imagery usually
the prediction results are combined by an algorithm. How-
have a wide variety in scales, orientations, and densities. After
ever, hybrid techniques integrate multiple different machine
the convolutional neural network (CNN) [1] was introduced
learning approaches wholly or partially, to build an improved
for the object detection task, many researchers developed
standalone model. Previous works on the VisDrone-DET Chal-
CNN-based algorithms to detect objects in UAV imagery [2–
lenge [12–14] showed that ensemble and hybrid techniques
4]. Recently, transformer [5] which has been used widely
could improve the performance of object detection in UAV
for many natural language processing (NLP) tasks, has also
imagery. Especially for the ensemble technique, the diversity
attracted lots of research interest for object detection [6–8].
of models is a key factor that affects the performance of
One main difference between a CNN-based model and
ensembles [15]. Using the CNN-based and the transformer-
a transformer-based model is the size of the receptive field,
based models, both of which have different properties, can be
where the latter is better in capturing a long-distance pixel
relation due to the self-attention mechanism [9]. However, the a good combination for applying ensemble learning. Likewise,
transformer does not have a good mechanism to capture spatial integrating the state-of-the-art techniques of the two models
information inside each patch, which means it can ignore an can be a good combination for building a hybrid model.
important spatial local pattern, such as texture. This is not the In this paper, to investigate the effect of combining trans-
case with CNN as it recognizes objects based on the texture former and CNN model, we select one of the state-of-the-art
instead of the shape [10]. In order to offset the weakness while transformer-based models, Swin Transformer [16], and one of
take the strength of each model, one of the popular approach is the state-of-the-art CNN-based models, DetectoRS [17]. We
to combine multiple machine learning models, i.e., ensemble then fuse both models with two approaches, i.e., ensemble and
hybrid techniques. We evaluate our experiments on the test-
∗ Corresponding author. dev set of VisDrone-DET2021. In summary, our contributions
E-mail addresses: willyfitrahendria@sju.ac.kr (W.F. Hendria), are as follows. (1) We propose an ensemble model of Swin
thinhphan@sju.ac.kr (Q.T. Phan), fadzaka@sju.ac.kr (F. Adzaka),
Transformer and DetectoRS to perform object detection in
cheol.jeong@ieee.org (C. Jeong).
Peer review under responsibility of The Korean Institute of Communica- UAV imagery by exploiting the diversity of the transformer-
tions and Information Sciences (KICS). based and the CNN-based models. (2) We design a hybrid
https://doi.org/10.1016/j.icte.2021.12.006
2405-9595/© 2021 The Author(s). Published by Elsevier B.V. on behalf of The Korean Institute of Communications and Information Sciences. This is an
open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Please cite this article as: W.F. Hendria, Q.T. Phan, F. Adzaka et al., Combining transformer and CNN for object detection in UAV imagery, ICT Express (2022), https://doi.org/10.1016/j.icte.2021.12.006.
W.F. Hendria, Q.T. Phan, F. Adzaka et al. ICT Express xxx (xxxx) xxx
architecture by combining Swin Transformer backbone with

a neck of DetectoRS, i.e., recursive feature pyramid (RFP),
to take advantage of both algorithms. (3) We conduct per-
formance comparison for our proposed models to provide
reference for future research. To the best of our knowledge,
this is the first time that the fusion of these two state-of-the-
art models is explored for object detection in UAV imagery.
In UAV imagery, the experimental results showed that our
proposed models outperform the baseline models which use
either Swin Transformer or DetectoRS, and our ensemble
model also outperforms the previous winning team of the Fig. 1. The block diagram of our ensemble experiment.
VisDrone-DET2020 Challenge, i.e., DroneEye2020.
Table 1
Hyperparameters of Swin-B and Swin-L variants. The symbol C denotes
2. Related work
the channel number of the hidden layers in the first stage and W denotes
2.1. Ensemble learning for object detection the window size. The number of layers and the number of attention heads
are given for four stages of the Swin Transformer architecture.
Ensemble learning combines predictions produced by mul- Model C W # layers # attention heads
tiple models that run independently of each other. In object Swin-B 128 12 2, 2, 18, 2 4, 8, 16, 32
detection for UAV-specific data, SyNet [18] combines a multi- Swin-L 192 12 2, 2, 18, 2 6, 12, 24, 48
stage detector and a single-stage detector, and makes a predic-
tion by combining the individual predictions of each algorithm
through a fusion stage. switchable atrous convolution (SAC). The RFP adds additional
In this paper, we ensemble two state-of-the-art transformer- feedback connections to bring back the output of the FPN to
based and CNN-based models, Swin Transformer and Detec- each stage of the backbone. The SAC convolves the features
toRs, using four different methods which are typically used with some different atrous rates. To connect the output of
for combining predictions of multiple object detection models, RFP to the backbone, DetectoRS uses atrous spatial pyramid
i.e., non-maximum suppression (NMS) [19], soft-NMS [20], pooling (ASPP) [25], which consists of four parallel branches
non-maximum weighted (NMW) [21], and weighted boxes of convolutional layers.
fusion (WBF) [22] We train each single model using identical training data,
and then ensemble the prediction results of the trained models
2.2. Hybrid method for object detection with four different ensemble combinations of the three models,
as shown in Fig. 1. Extensive experiments are conducted to
Different from ensemble learning, a hybrid method com- evaluate the performance of our ensemble model by applying
bines several completely different models to build a new four typical methods for combining predictions of multiple
standalone model. In object detection for UAV-specific data, object detection models, i.e., the NMS, the Soft-NMS, the
FSHNN [23] combines unsupervised Spike Time-Dependent NMW, and the WBF. The hyperparameters of each method
Plasticity (STDP) learning with backpropagation (STBP) are selected carefully through experiments. The architecture
learning methods and also uses Monte Carlo Dropout to get hyperparameters of these Swin Transformer variants, that were
an estimate of the uncertainty error. used for this experiment, are shown in Table 1. For the
In this paper, we design a hybrid architecture based on ASPP in the DetectoRS, we used the following configurations:
two state-of-the-art transformer-based and CNN-based mod- Kernel size = [1, 3, 3, 1], atrous rate = [1, 3, 6, 1], and padding
els, Swin Transformer backbone with a neck of DetectoRs, = [0, 3, 6, 0].
i.e., RFP, which is termed as Swin-RFP.
3.2. Hybrid model
3. Method
Our proposed hybrid model consists of the Swin Trans-
3.1. Ensemble model former backbone with a neck of DetectoRS, i.e., RFP, as
illustrated in Fig. 2 on the next page. The RFP was proposed as
Two variants of Swin Transformer, i.e., Swin-B and Swin- the improvement of the FPN neck, which can look recursively
L, and DetectoRS with ResNet-50 [24] backbone, are adopted at the images to generate more powerful representations. The
for our ensemble model experiments. Swin Transformer uses Swin Transformer backbone was proven to outperform signif-
the self-attention technique within local windows, whose rep- icantly the existing backbone models, including ResNet-50,
resentation is computed with shifted windows. In addition, by extracting the powerful representation of the transformer
Swin Transformer has hierarchical feature maps by succes- hierarchically. Integrating these two best mechanisms in each
sively merging from small patches as the network depth in- structure into a hybrid model can result in better performance.
creased. On the other hand, inspired by the mechanism of Concretely, we add feedback connections from the top-down
looking and thinking twice, DetectoRS uses the RFP and RFP neck to the bottom-up Swin Transformer backbone. The
2
and under various conditions. Each bounding box is annotated

with a label from ten object categories, i.e., Person, Pedestrian
(standing or walking person), Car, Van, Bus, Truck, Motor,
Bicycle, Awning-tricycle, and Tricycle.
4.2. Data augmentation
Offline augmentation. For training three single models of

Swin-B, Swin-L, and DetectoRS, we increase the amount of
Fig. 2. Our proposed Swin-RFP architecture.
the dataset by applying this offline augmentation. We resize all
6471 images in the train set to (3200, 2100) scale and split the
images into four (1600, 1050) sub-images. On each sub-image,
we apply horizontal flip augmentation. Our new train set after
applying this augmentation has 51,768 images in total. There
are 48,016 images which have at least one bounding box, with
Fig. 3. Connection of Swin Transformer and RFP for each stage. 767,648 bounding boxes in total.
Online augmentation. Instead of the offline augmentation,
we apply this online augmentation for the experiment of
output of Swin Transformer at each stage is computed based hybrid model. Random horizontal flip with probability of 0.5
on the output of Swin Transformer at the previous stage and is applied during the training process. We then resize the
the output of the RFP neck at the corresponding stage. Follow- images into multiple scales with the aspect ratio around 16:9,
ing the original DetectoRS, there is no feedback connection i.e., (1920, 1080), (2220, 1305), (2520, 1474), (2820, 1642),
from the neck to the first stage of the backbone. The input (3120, 1755), (3420, 1924), and (3720, 2092).
to the first stage of Swin Transformer is only the image Test-time Augmentation. For all of the experiments in this
patches after partitioning the input image as required by Swin paper, we use the test-time augmentation (TTA) to apply the
Transformer. The number of loop iterations through the RFP, multiscale augmentation in the test set during testing time.
is a hyperparameter to be specified. Different TTA configurations are used for the offline and the
We modify the forward function of the Swin Transformer online augmentation models. The offline augmentation models
backbone to accept both the bottom-up features and the top- use 7 different scales with 32:21 aspect ratio, i.e., (3000,
down features as the input. In Swin Transformer, the bottom- 1969), (3200, 2100), (3400, 2231), (3600, 2362), (3800, 2494),
up features are first transformed by a Patch Merging module (4000, 2625), and (4200, 2756). The online augmentation
to merge the neighboring image patches. We convolve the models use TTA with 5 different scales with the aspect ratio
RFP features with 1 × 1 convolution layer to transform the around 16:9, i.e., (2520, 1474), (2820, 1642), (3120, 1755),
channel dimension to the same dimension as the output of the (3420, 1924), and (3720, 2092).
Swin Transformer block. The final output of each stage is a
channel dimension concatenation between the output of the 4.3. Models
Swin Transformer block with the convolved feature of RFP,
as illustrated in Fig. 3. Please refer to [16] for further details For all of the single models, we adopt the Cascade R-CNN
on Swin Transformer. head, and use the Soft-NMS algorithm to filter the bounding
boxes. We set the maximum number of bounding boxes to
4. Experiments and results 500 and the confidence score threshold to 0. Besides, we use
the pretrained models on ImageNet [26] and the MS COCO
For training our models, we use the MMDetection tool- dataset [27] for Swin Transformer and DetectoRS, respec-
box on Ubuntu 20.04 LTS OS with Python 3.7 and Pytorch tively. For the optimization algorithm, we use the adaptive
1.7 environment. The VisDrone-DET2021 toolkit on MAT- moment estimation (Adam) for Swin Transformer, and the
LAB environment is used to evaluate the performance of stochastic gradient descent (SGD) for DetectoRS.
the models. Our models are trained in a computer with 2× Swin-B and Swin-L. For these two models, we use a
NVIDIA GeForce RTX 3090 GPU and 3.70 GHz Intel® pretrained model on the ImageNet dataset with (384, 384)
Core™ i9-10900K CPU. resolution. The initial learning rate of Adam is set to 0.0001
with 0.05 weight decay. This learning rate is then reduced to
4.1. Datasets 0.00001 and 0.000001 at the 8th epoch and the 11th epoch,
respectively. The best model which has the highest average
The dataset of VisDrone-DET2021 is the same dataset precision (AP) during the training epochs is selected.
which was used in the previous years of the VisDrone-DET DetectoRS. A pretrained model of DetectoRS with ResNet-
Challenge. The dataset is split into 6471, 548, 1580 images 50 and Cascade R-CNN on the MS COCO dataset, is used for
for train, val, and test-dev sets, respectively. The images are the training. For the RFP, we set the number of iteration to 2.
captured by different drone platforms across 14 different cities Besides, we set the initial learning rate of SGD to 0.02 with
3
Table 2
Ensemble performance of Swin-B and DetectoRS on four typical methods for combining predictions of multiple object
detection models.
Method AP[%] AP50[%] AP75[%] AR1[%] AR10[%] AR100[%] AR500[%]
NMS 35.84 61.15 36.94 1.54 12.95 44.3 53.91
WBF 35.41 59.18 36.65 1.41 12.78 45.93 54.81
Soft-NMS 34.86 59.84 35.84 1.54 12.87 43.4 52.51
NMW 37.54 61.97 39.16 1.57 13.32 46.19 55.42
Table 3
Performance of ensemble combinations using the NMW, each single model, and DroneEye2020 on the test-dev set. We provide the comparison
with the DroneEye2020 only because the performance of the other conventional algorithms in the VisDrone Challenge was published based on the
test-challenge set only, which is a private dataset for the challenge [14]. However, we could retrieve the performance of DroneEye2020 from the
VisDrone-DET2020 Winner Talk’s presentation [29].
Model AP[%] AP50[%] AP75[%] AR1[%] AR10[%] AR100[%] AR500[%]
DroneEye2020 (Baseline) 37.33 62.33 38.94 0.41 2.37 8.69 56.09
Swin-B 34.88 60.04 35.78 1.54 12.8 42.85 53.77
Swin-L 35.32 60.83 36.34 1.22 12.5 43.94 54.3
DetectoRS 32.67 57.05 33.22 1.38 11.52 41.48 51.92
Swin-B + Swin-L 37.33 62.33 38.89 1.56 13.29 46.29 54.90
Swin-B + DetectoRS 37.54 61.97 39.16 1.57 13.32 46.19 55.42
Swin-L + DetectoRS 37.48 62.05 39.09 1.36 12.64 46.64 55.56
Swin-B + Swin-L + DetectoRS 38.06 62.55 39.84 1.57 13.37 46.98 56.37
Swin-B + Swin-L + DetectoRS (with the val set) 38.30 63.29 39.89 1.42 12.87 47.11 56.43
weight decay 0.0001. The learning rate is reduced to 0.002 performance among the ensembles of two different models.
and 0.0002 at the 8th epoch and the 11th epoch, respectively. Although Swin-B and Swin-L are the two single models with
Similarly, we also select the best model during the training the highest AP, the ensemble performance of these two models
epochs based on the AP metric. is still lower than the ensemble performance of Swin-B/Swin-
Swin-RFP. The Swin-L variant, which is pretrained on L and DetectoRS on the AP metric. This can be because
ImageNet, is used for this hybrid model. The initial learning the ensemble of Swin Transformer and DetectoRS has higher
rate is set to 0.0001, then it is decayed by 0.1 at the 27th and diversity, which is a key factor to have a good ensemble model.
the 33rd epoch. The reported result of this model is based on In order to further improve the performance, we ensemble
the model trained at the 50th epoch. We also use these settings Swin-B, Swin-L, and DetectoRS, and also include the vali-
for the baseline of this hybrid model, which is the original dation set for training models. Overall, our ensemble model
Swin Transformer with feature pyramid networks (FPN) [28] of Swin Transformer and DetectoRS achieves better results on
neck. all evaluation metrics compared to the previous winning team
(DroneEye2020).
4.4. Quantitative results We also evaluate our hybrid model of Swin Transformer
backbone with RFP neck. As reported in Table 4 on the next
In this subsection, we will show the performance of our page, the performance of this hybrid model is also better than
ensemble and hybrid model using Average Precision (AP) and the baseline model in all AP metrics. These results suggest
Average Recall (AR) evaluation metrics. Through experiments, that our hybrid model has more accurate predictions than the
the Intersection over Union (IoU) thresholds of 0.7, 0.6, 0.65, baseline model.
0.6 are selected carefully for the NMS, the Soft-NMS, the
NMW, and the WBF, respectively. Besides, the parameter that 4.5. Qualitative results
controls the weighting of the confidence score reduction in
the Soft-NMS was carefully selected as 0.0008. As shown Our ensemble and hybrid model are able to detect dif-
in Table 2, we evaluate the ensemble performance of Swin- ficult objects, i.e., small, dense, and overlapped objects, as
B and DetectoRS on four typical methods for combining shown in Fig. 4. Our models can detect most of the objects
predictions of multiple object detection models. We found that defined on the ground truth correctly. Besides, our models
the NMW method achieved the best performance compared can also detect objects on various conditions, i.e., different
to the other methods for ensembling bounding boxes on this backgrounds, different scales, different lighting environments,
VisDrone-DET2021 dataset. and different angles of view. However, when there are not
Using the NMW method, it is shown that the ensemble of many objects on the ground truth as shown in Fig. 5, our
two different models performs better than each single model models incorrectly predict the bounding boxes with many false
in AP metric as reported in Table 3. We can see that the positive predictions. Nevertheless, one can see that the two
ensemble of Swin-B and DetectoRS achieves the best AP objects defined on the ground truth are able to be detected by
4
Table 4
Performance of our hybrid model compared to the baseline on the test-dev set. The Swin-L variant is used for these models.
Method AP[%] AP50[%] AP75[%] AR1[%] AR10[%] AR100[%] AR500[%]
Swin Transformer (Baseline) 34.23 58.58 35.25 1.18 11.2 43.31 54.16
Swin-RFP 34.49 59.00 35.4 1.02 10.99 43.67 54.30
Declaration of competing interest

The authors declare that they have no known competing
financial interests or personal relationships that could have
appeared to influence the work reported in this paper.
Acknowledgment
This work was supported by Institute of Information &
communications Technology Planning & Evaluation (IITP)
Fig. 4. The best qualitative results on the test-dev set of (a) our ensemble
grant funded by the Korea government (MSIT) (No. 2021-0-
(Swin-B + DetectoRS) and (b) hybrid model, and (c) the corresponding
ground truth. 02067).
References
[1] R. Girshick, et al., Rich feature hierarchies for accurate object
detection and semantic segmentation, in: 2014 IEEE Conference on
Computer Vision and Pattern Recognition, CVPR, 2014, pp. 580–587.
[2] C. Kyrkou, et al., DroNet: Efficient convolutional neural network
detector for real-time UAV applications, in: 2018 Design, Automation,
and Test in Europe Conference Exhibition, DATE, 2018, pp. 967–972.
[3] G. Plastiras, C. Kyrkou, T. Theocharides, Efficient ConvNet-based
object detection for unmanned aerial vehicles by selective tile process-
ing, in: 12th International Conference on Distributed Smart Cameras,
Fig. 5. The worst qualitative results on the test-dev set of (a) our ensemble ICDCS, 2018, pp. 3:1–3:6.
(Swin-B + DetectoRS) and (b) hybrid model, and (c) the corresponding [4] B. Kellenberger, M. Volpi, D. Tuia, Fast animal detection in UAV
ground truth. images using convolutional neural networks, in: 2017 IEEE Interna-
tional Geoscience and Remote Sensing Symposium, IGARSS, 2017,
pp. 866–869.
[5] A. Vaswani, et al., Attention is all you need, in: 2017 Advances in
both of our ensemble and hybrid model. Albeit our proposed
Neural Information Processing Systems, NIPS, 2017, pp. 6000–6010.
models can achieve a quite good performance on detecting [6] N. Carion, et al., End-to-end object detection with transformers, in:
many difficult objects, there are still rooms for improvement. 2020 European Conference on Computer Vision, ECCV, 2020, pp.
213–229.
5. Conclusion [7] X. Dai, et al., Dynamic head: Unifying object detection heads with
attentions, in: 2021 IEEE/CVF Conference on Computer Vision and
In this paper, we proposed fusion of transformer-based Pattern Recognition, CVPR, 2021, pp. 7373–7382.
and CNN-based models for object detection in UAV imagery [8] J. Yang, et al., Focal self-attention for local-global interactions in vi-
with two approaches, i.e., ensemble and hybrid techniques. sion transformers, in: 2021 Advances in Neural Information Processing
To this end, we ensembled Swin Transformer and Detec- Systems, NeurIPS, 2021.
toRS with the ResNet-50 backbone, and also integrated the [9] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly
Swin Transformer backbone with the RFP neck of DetectoRS. learning to align and translate, in: 2015 International Conference on
Learning Representations, ICLR, 2015.
By combining these two different models, our experiments
[10] R. Geirhos, et al., ImageNet-trained CNNs are biased towards texture;
showed that our proposed ensemble and hybrid model outper- increasing shape bias improves accuracy and robustness, in: 2019
formed the respective baseline models. We also showed that International Conference on Learning Representations, ICLR, 2019.
the NMW method achieved the best performance compared [11] S. Ardabili, et al., Advances in machine learning modeling review-
to the other methods for combining predictions. Finally, the ing hybrid and ensemble methods, in: A.R. Várkonyi-Kóczy (Ed.),
qualitative results were shown to demonstrate the detection Engineering for Sustainable Future, Springer International Publishing,
Cham, 2020, pp. 215–227.
results of our proposed models. We showed that our models
[12] P. Zhu, et al., VisDrone-DET2018: The vision meets drone object
were able to detect challenging objects on various conditions. detection in image challenge results, in: 2018 European Conference
on Computer Vision, ECCV Workshops, 2018, pp. 437–468.
CRediT authorship contribution statement [13] D. Du, et al., VisDrone-DET2019: The vision meets drone object
Willy Fitra Hendria: Conceptualization, Methodology, detection in image challenge results, in: 2019 IEEE/CVF Interna-
tional Conference on Computer Vision, ICCV Workshops, 2019, pp.
Software, Writing – original draft. Quang Thinh Phan: Soft- 213–226.
ware, Writing – review & editing. Fikriansyah Adzaka: Soft- [14] D. Du, et al., VisDrone-DET2020: The vision meets drone object
ware, Writing – review & editing. Cheol Jeong: Supervision, detection in image challenge results, in: 2020 European Conference
Writing – review & editing. on Computer Vision (ECCV) Workshops, 2020, pp. 692–712.
5
[15] A. Chandra, H. Chen, X. Yao, Trade-off between diversity and accu- [22] R. Solovyev, W. Wang, T. Gabruseva, Weighted boxes fusion: En-
racy in ensemble generation, in: Multi-Objective Machine Learning, sembling boxes from different object detection models, Image Vis.
2006, pp. 429–464. Comput. 107 (2021) 104117.
[16] Z. Liu, et al., Swin transformer: Hierarchical vision transformer using [23] B. Chakraborty, et al., A fully spiking hybrid neural network for
shifted windows, in: 2021 IEEE/CVF International Conference on energy-efficient object detection, IEEE Trans. Image Process. 30
Computer Vision, ICCV, 2021, pp. 10012–100022. (2021) 9014–9029.
[17] S. Qiao, et al., DetectoRS: Detecting objects with recursive feature [24] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image
pyramid and switchable atrous convolution, in: 2021 IEEE/CVF recognition, in: 2016 IEEE Conference on Computer Vision and
Conference on Computer Vision and Pattern Recognition, CVPR, Pattern Recognition, CVPR, 2016, pp. 770–778.
2021, pp. 10213–10224. [25] L.-C. Chen, et al., Deeplab: Semantic image segmentation with deep
[18] B.M. Albaba, S. Ozer, SyNet: An ensemble network for object convolutional nets, atrous convolution, and fully connected CRFs,
detection in UAV images, in: 25th International Conference on Pattern IEEE Trans. Pattern Anal. Mach. Intell. 40 (4) (2018) 834–848.
Recognition, ICPR, 2020, pp. 10227–10234. [26] J. Deng, et al., ImageNet: A large-scale hierarchical image database,
[19] A. Neubeck, L. Van Gool, Efficient non-maximum suppression, in: in: 2009 IEEE Conference on Computer Vision and Pattern
18th International Conference on Pattern Recognition, ICPR, 2006, Recognition, CVPR, 2009, pp. 248–255.
pp. 850–855. [27] T.-Y. Lin, et al., Microsoft COCO: Common objects in context, in:
[20] N. Bodla, et al., Soft-NMS — Improving object detection with one 2014 European Conference on Computer Vision, ECCV, 2014, pp.
line of code, in: 2017 IEEE International Conference on Computer 740–755.
Vision, ICCV, 2017, pp. 5562–5570. [28] T.-Y. Lin, et al., Feature pyramid networks for object detection, in:
[21] H. Zhou, et al., CAD: Scale invariant framework for real-time object 2017 IEEE Conference on Computer Vision and Pattern Recognition,
detection, in: 2017 IEEE International Conference on Computer Vision, CVPR, 2017, pp. 936–944.
ICCV Workshops, 2017, pp. 760–768. [29] S. Moon, et al., VisDrone 2020 winner talk - Detection, 2020.

Combining Transformer and CNN For Object Detection in UAV Imagery

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Combining Transformer and CNN For Object Detection in UAV Imagery

Uploaded by

Copyright:

Available Formats

Available online at www.sciencedirect.

Combining transformer and CNN for object detection in UAV imagery

1. Introduction and hybrid techniques [11], to improve the overall predictive

architecture by combining Swin Transformer backbone with

and under various conditions. Each bounding box is annotated

4.2. Data augmentation

Offline augmentation. For training three single models of

Declaration of competing interest

You might also like