Professional Documents
Culture Documents
Combining Transformer and CNN For Object Detection in UAV Imagery
Combining Transformer and CNN For Object Detection in UAV Imagery
com
ScienceDirect
ICT Express xxx (xxxx) xxx
www.elsevier.com/locate/icte
Abstract
Combining multiple models is a well-known technique to improve predictive performance in challenging tasks such as object detection
in UAV imagery. In this paper, we propose fusion of transformer-based and convolutional neural network-based (CNN) models with two
approaches. First, we ensemble Swin Transformer and DetectoRS with ResNet backbone, and conduct performance comparison on four
typical methods for combining predictions of multiple object detection models. Second, we design a hybrid architecture by combining Swin
Transformer backbone with a neck of DetectoRS. We show that the fusion of the transformer and the CNN-based models performs better
compared to the respective baseline model.
© 2021 The Author(s). Published by Elsevier B.V. on behalf of The Korean Institute of Communications and Information Sciences. This is an open
access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Keywords: Convolutional neural network; Object detection; Transformer; UAV imagery
Please cite this article as: W.F. Hendria, Q.T. Phan, F. Adzaka et al., Combining transformer and CNN for object detection in UAV imagery, ICT Express (2022), https://doi.org/10.1016/j.icte.2021.12.006.
W.F. Hendria, Q.T. Phan, F. Adzaka et al. ICT Express xxx (xxxx) xxx
Ensemble learning combines predictions produced by mul- Model C W # layers # attention heads
tiple models that run independently of each other. In object Swin-B 128 12 2, 2, 18, 2 4, 8, 16, 32
detection for UAV-specific data, SyNet [18] combines a multi- Swin-L 192 12 2, 2, 18, 2 6, 12, 24, 48
stage detector and a single-stage detector, and makes a predic-
tion by combining the individual predictions of each algorithm
through a fusion stage. switchable atrous convolution (SAC). The RFP adds additional
In this paper, we ensemble two state-of-the-art transformer- feedback connections to bring back the output of the FPN to
based and CNN-based models, Swin Transformer and Detec- each stage of the backbone. The SAC convolves the features
toRs, using four different methods which are typically used with some different atrous rates. To connect the output of
for combining predictions of multiple object detection models, RFP to the backbone, DetectoRS uses atrous spatial pyramid
i.e., non-maximum suppression (NMS) [19], soft-NMS [20], pooling (ASPP) [25], which consists of four parallel branches
non-maximum weighted (NMW) [21], and weighted boxes of convolutional layers.
fusion (WBF) [22] We train each single model using identical training data,
and then ensemble the prediction results of the trained models
2.2. Hybrid method for object detection with four different ensemble combinations of the three models,
as shown in Fig. 1. Extensive experiments are conducted to
Different from ensemble learning, a hybrid method com- evaluate the performance of our ensemble model by applying
bines several completely different models to build a new four typical methods for combining predictions of multiple
standalone model. In object detection for UAV-specific data, object detection models, i.e., the NMS, the Soft-NMS, the
FSHNN [23] combines unsupervised Spike Time-Dependent NMW, and the WBF. The hyperparameters of each method
Plasticity (STDP) learning with backpropagation (STBP) are selected carefully through experiments. The architecture
learning methods and also uses Monte Carlo Dropout to get hyperparameters of these Swin Transformer variants, that were
an estimate of the uncertainty error. used for this experiment, are shown in Table 1. For the
In this paper, we design a hybrid architecture based on ASPP in the DetectoRS, we used the following configurations:
two state-of-the-art transformer-based and CNN-based mod- Kernel size = [1, 3, 3, 1], atrous rate = [1, 3, 6, 1], and padding
els, Swin Transformer backbone with a neck of DetectoRs, = [0, 3, 6, 0].
i.e., RFP, which is termed as Swin-RFP.
3.2. Hybrid model
3. Method
Our proposed hybrid model consists of the Swin Trans-
3.1. Ensemble model former backbone with a neck of DetectoRS, i.e., RFP, as
illustrated in Fig. 2 on the next page. The RFP was proposed as
Two variants of Swin Transformer, i.e., Swin-B and Swin- the improvement of the FPN neck, which can look recursively
L, and DetectoRS with ResNet-50 [24] backbone, are adopted at the images to generate more powerful representations. The
for our ensemble model experiments. Swin Transformer uses Swin Transformer backbone was proven to outperform signif-
the self-attention technique within local windows, whose rep- icantly the existing backbone models, including ResNet-50,
resentation is computed with shifted windows. In addition, by extracting the powerful representation of the transformer
Swin Transformer has hierarchical feature maps by succes- hierarchically. Integrating these two best mechanisms in each
sively merging from small patches as the network depth in- structure into a hybrid model can result in better performance.
creased. On the other hand, inspired by the mechanism of Concretely, we add feedback connections from the top-down
looking and thinking twice, DetectoRS uses the RFP and RFP neck to the bottom-up Swin Transformer backbone. The
2
W.F. Hendria, Q.T. Phan, F. Adzaka et al. ICT Express xxx (xxxx) xxx
Table 2
Ensemble performance of Swin-B and DetectoRS on four typical methods for combining predictions of multiple object
detection models.
Method AP[%] AP50[%] AP75[%] AR1[%] AR10[%] AR100[%] AR500[%]
NMS 35.84 61.15 36.94 1.54 12.95 44.3 53.91
WBF 35.41 59.18 36.65 1.41 12.78 45.93 54.81
Soft-NMS 34.86 59.84 35.84 1.54 12.87 43.4 52.51
NMW 37.54 61.97 39.16 1.57 13.32 46.19 55.42
Table 3
Performance of ensemble combinations using the NMW, each single model, and DroneEye2020 on the test-dev set. We provide the comparison
with the DroneEye2020 only because the performance of the other conventional algorithms in the VisDrone Challenge was published based on the
test-challenge set only, which is a private dataset for the challenge [14]. However, we could retrieve the performance of DroneEye2020 from the
VisDrone-DET2020 Winner Talk’s presentation [29].
Model AP[%] AP50[%] AP75[%] AR1[%] AR10[%] AR100[%] AR500[%]
DroneEye2020 (Baseline) 37.33 62.33 38.94 0.41 2.37 8.69 56.09
Swin-B 34.88 60.04 35.78 1.54 12.8 42.85 53.77
Swin-L 35.32 60.83 36.34 1.22 12.5 43.94 54.3
DetectoRS 32.67 57.05 33.22 1.38 11.52 41.48 51.92
Swin-B + Swin-L 37.33 62.33 38.89 1.56 13.29 46.29 54.90
Swin-B + DetectoRS 37.54 61.97 39.16 1.57 13.32 46.19 55.42
Swin-L + DetectoRS 37.48 62.05 39.09 1.36 12.64 46.64 55.56
Swin-B + Swin-L + DetectoRS 38.06 62.55 39.84 1.57 13.37 46.98 56.37
Swin-B + Swin-L + DetectoRS (with the val set) 38.30 63.29 39.89 1.42 12.87 47.11 56.43
weight decay 0.0001. The learning rate is reduced to 0.002 performance among the ensembles of two different models.
and 0.0002 at the 8th epoch and the 11th epoch, respectively. Although Swin-B and Swin-L are the two single models with
Similarly, we also select the best model during the training the highest AP, the ensemble performance of these two models
epochs based on the AP metric. is still lower than the ensemble performance of Swin-B/Swin-
Swin-RFP. The Swin-L variant, which is pretrained on L and DetectoRS on the AP metric. This can be because
ImageNet, is used for this hybrid model. The initial learning the ensemble of Swin Transformer and DetectoRS has higher
rate is set to 0.0001, then it is decayed by 0.1 at the 27th and diversity, which is a key factor to have a good ensemble model.
the 33rd epoch. The reported result of this model is based on In order to further improve the performance, we ensemble
the model trained at the 50th epoch. We also use these settings Swin-B, Swin-L, and DetectoRS, and also include the vali-
for the baseline of this hybrid model, which is the original dation set for training models. Overall, our ensemble model
Swin Transformer with feature pyramid networks (FPN) [28] of Swin Transformer and DetectoRS achieves better results on
neck. all evaluation metrics compared to the previous winning team
(DroneEye2020).
4.4. Quantitative results We also evaluate our hybrid model of Swin Transformer
backbone with RFP neck. As reported in Table 4 on the next
In this subsection, we will show the performance of our page, the performance of this hybrid model is also better than
ensemble and hybrid model using Average Precision (AP) and the baseline model in all AP metrics. These results suggest
Average Recall (AR) evaluation metrics. Through experiments, that our hybrid model has more accurate predictions than the
the Intersection over Union (IoU) thresholds of 0.7, 0.6, 0.65, baseline model.
0.6 are selected carefully for the NMS, the Soft-NMS, the
NMW, and the WBF, respectively. Besides, the parameter that 4.5. Qualitative results
controls the weighting of the confidence score reduction in
the Soft-NMS was carefully selected as 0.0008. As shown Our ensemble and hybrid model are able to detect dif-
in Table 2, we evaluate the ensemble performance of Swin- ficult objects, i.e., small, dense, and overlapped objects, as
B and DetectoRS on four typical methods for combining shown in Fig. 4. Our models can detect most of the objects
predictions of multiple object detection models. We found that defined on the ground truth correctly. Besides, our models
the NMW method achieved the best performance compared can also detect objects on various conditions, i.e., different
to the other methods for ensembling bounding boxes on this backgrounds, different scales, different lighting environments,
VisDrone-DET2021 dataset. and different angles of view. However, when there are not
Using the NMW method, it is shown that the ensemble of many objects on the ground truth as shown in Fig. 5, our
two different models performs better than each single model models incorrectly predict the bounding boxes with many false
in AP metric as reported in Table 3. We can see that the positive predictions. Nevertheless, one can see that the two
ensemble of Swin-B and DetectoRS achieves the best AP objects defined on the ground truth are able to be detected by
4
W.F. Hendria, Q.T. Phan, F. Adzaka et al. ICT Express xxx (xxxx) xxx
Table 4
Performance of our hybrid model compared to the baseline on the test-dev set. The Swin-L variant is used for these models.
Method AP[%] AP50[%] AP75[%] AR1[%] AR10[%] AR100[%] AR500[%]
Swin Transformer (Baseline) 34.23 58.58 35.25 1.18 11.2 43.31 54.16
Swin-RFP 34.49 59.00 35.4 1.02 10.99 43.67 54.30
Acknowledgment
This work was supported by Institute of Information &
communications Technology Planning & Evaluation (IITP)
Fig. 4. The best qualitative results on the test-dev set of (a) our ensemble
grant funded by the Korea government (MSIT) (No. 2021-0-
(Swin-B + DetectoRS) and (b) hybrid model, and (c) the corresponding
ground truth. 02067).
References
[1] R. Girshick, et al., Rich feature hierarchies for accurate object
detection and semantic segmentation, in: 2014 IEEE Conference on
Computer Vision and Pattern Recognition, CVPR, 2014, pp. 580–587.
[2] C. Kyrkou, et al., DroNet: Efficient convolutional neural network
detector for real-time UAV applications, in: 2018 Design, Automation,
and Test in Europe Conference Exhibition, DATE, 2018, pp. 967–972.
[3] G. Plastiras, C. Kyrkou, T. Theocharides, Efficient ConvNet-based
object detection for unmanned aerial vehicles by selective tile process-
ing, in: 12th International Conference on Distributed Smart Cameras,
Fig. 5. The worst qualitative results on the test-dev set of (a) our ensemble ICDCS, 2018, pp. 3:1–3:6.
(Swin-B + DetectoRS) and (b) hybrid model, and (c) the corresponding [4] B. Kellenberger, M. Volpi, D. Tuia, Fast animal detection in UAV
ground truth. images using convolutional neural networks, in: 2017 IEEE Interna-
tional Geoscience and Remote Sensing Symposium, IGARSS, 2017,
pp. 866–869.
[5] A. Vaswani, et al., Attention is all you need, in: 2017 Advances in
both of our ensemble and hybrid model. Albeit our proposed
Neural Information Processing Systems, NIPS, 2017, pp. 6000–6010.
models can achieve a quite good performance on detecting [6] N. Carion, et al., End-to-end object detection with transformers, in:
many difficult objects, there are still rooms for improvement. 2020 European Conference on Computer Vision, ECCV, 2020, pp.
213–229.
5. Conclusion [7] X. Dai, et al., Dynamic head: Unifying object detection heads with
attentions, in: 2021 IEEE/CVF Conference on Computer Vision and
In this paper, we proposed fusion of transformer-based Pattern Recognition, CVPR, 2021, pp. 7373–7382.
and CNN-based models for object detection in UAV imagery [8] J. Yang, et al., Focal self-attention for local-global interactions in vi-
with two approaches, i.e., ensemble and hybrid techniques. sion transformers, in: 2021 Advances in Neural Information Processing
To this end, we ensembled Swin Transformer and Detec- Systems, NeurIPS, 2021.
toRS with the ResNet-50 backbone, and also integrated the [9] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly
Swin Transformer backbone with the RFP neck of DetectoRS. learning to align and translate, in: 2015 International Conference on
Learning Representations, ICLR, 2015.
By combining these two different models, our experiments
[10] R. Geirhos, et al., ImageNet-trained CNNs are biased towards texture;
showed that our proposed ensemble and hybrid model outper- increasing shape bias improves accuracy and robustness, in: 2019
formed the respective baseline models. We also showed that International Conference on Learning Representations, ICLR, 2019.
the NMW method achieved the best performance compared [11] S. Ardabili, et al., Advances in machine learning modeling review-
to the other methods for combining predictions. Finally, the ing hybrid and ensemble methods, in: A.R. Várkonyi-Kóczy (Ed.),
qualitative results were shown to demonstrate the detection Engineering for Sustainable Future, Springer International Publishing,
Cham, 2020, pp. 215–227.
results of our proposed models. We showed that our models
[12] P. Zhu, et al., VisDrone-DET2018: The vision meets drone object
were able to detect challenging objects on various conditions. detection in image challenge results, in: 2018 European Conference
on Computer Vision, ECCV Workshops, 2018, pp. 437–468.
CRediT authorship contribution statement [13] D. Du, et al., VisDrone-DET2019: The vision meets drone object
Willy Fitra Hendria: Conceptualization, Methodology, detection in image challenge results, in: 2019 IEEE/CVF Interna-
tional Conference on Computer Vision, ICCV Workshops, 2019, pp.
Software, Writing – original draft. Quang Thinh Phan: Soft- 213–226.
ware, Writing – review & editing. Fikriansyah Adzaka: Soft- [14] D. Du, et al., VisDrone-DET2020: The vision meets drone object
ware, Writing – review & editing. Cheol Jeong: Supervision, detection in image challenge results, in: 2020 European Conference
Writing – review & editing. on Computer Vision (ECCV) Workshops, 2020, pp. 692–712.
5
W.F. Hendria, Q.T. Phan, F. Adzaka et al. ICT Express xxx (xxxx) xxx
[15] A. Chandra, H. Chen, X. Yao, Trade-off between diversity and accu- [22] R. Solovyev, W. Wang, T. Gabruseva, Weighted boxes fusion: En-
racy in ensemble generation, in: Multi-Objective Machine Learning, sembling boxes from different object detection models, Image Vis.
2006, pp. 429–464. Comput. 107 (2021) 104117.
[16] Z. Liu, et al., Swin transformer: Hierarchical vision transformer using [23] B. Chakraborty, et al., A fully spiking hybrid neural network for
shifted windows, in: 2021 IEEE/CVF International Conference on energy-efficient object detection, IEEE Trans. Image Process. 30
Computer Vision, ICCV, 2021, pp. 10012–100022. (2021) 9014–9029.
[17] S. Qiao, et al., DetectoRS: Detecting objects with recursive feature [24] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image
pyramid and switchable atrous convolution, in: 2021 IEEE/CVF recognition, in: 2016 IEEE Conference on Computer Vision and
Conference on Computer Vision and Pattern Recognition, CVPR, Pattern Recognition, CVPR, 2016, pp. 770–778.
2021, pp. 10213–10224. [25] L.-C. Chen, et al., Deeplab: Semantic image segmentation with deep
[18] B.M. Albaba, S. Ozer, SyNet: An ensemble network for object convolutional nets, atrous convolution, and fully connected CRFs,
detection in UAV images, in: 25th International Conference on Pattern IEEE Trans. Pattern Anal. Mach. Intell. 40 (4) (2018) 834–848.
Recognition, ICPR, 2020, pp. 10227–10234. [26] J. Deng, et al., ImageNet: A large-scale hierarchical image database,
[19] A. Neubeck, L. Van Gool, Efficient non-maximum suppression, in: in: 2009 IEEE Conference on Computer Vision and Pattern
18th International Conference on Pattern Recognition, ICPR, 2006, Recognition, CVPR, 2009, pp. 248–255.
pp. 850–855. [27] T.-Y. Lin, et al., Microsoft COCO: Common objects in context, in:
[20] N. Bodla, et al., Soft-NMS — Improving object detection with one 2014 European Conference on Computer Vision, ECCV, 2014, pp.
line of code, in: 2017 IEEE International Conference on Computer 740–755.
Vision, ICCV, 2017, pp. 5562–5570. [28] T.-Y. Lin, et al., Feature pyramid networks for object detection, in:
[21] H. Zhou, et al., CAD: Scale invariant framework for real-time object 2017 IEEE Conference on Computer Vision and Pattern Recognition,
detection, in: 2017 IEEE International Conference on Computer Vision, CVPR, 2017, pp. 936–944.
ICCV Workshops, 2017, pp. 760–768. [29] S. Moon, et al., VisDrone 2020 winner talk - Detection, 2020.