Professional Documents
Culture Documents
A Human-Detection Method Based On YOLOv5 and Trans
A Human-Detection Method Based On YOLOv5 and Trans
Article
A Human-Detection Method Based on YOLOv5 and Transfer
Learning Using Thermal Image Data from UAV Perspective for
Surveillance System
Aprinaldi Jasa Mantau 1, * , Irawan Widi Widayat 1 , Jenq-Shiou Leu 2 and Mario Köppen 1
1 Department of Computer Science and System Engineering (CSSE), Graduate School of Computer Science and
System Engineering, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka-shi, Fukuoka 820-8502, Japan
2 Department of Electronic and Computer Engineering (ECE), National Taiwan University of Science and
Technology, Taipei City 106, Taiwan
* Correspondence: mantau.aprinaldi@ieee.org
Abstract: At this time, many illegal activities are being been carried out, such as illegal mining,
hunting, logging, and forest burning. These things can have a substantial negative impact on the
environment. These illegal activities are increasingly rampant because of the limited number of
officers and the high cost required to monitor them. One possible solution is to create a surveillance
system that utilizes artificial intelligence to monitor the area. Unmanned aerial vehicles (UAV) and
NVIDIA Jetson modules (general-purpose GPUs) can be inexpensive and efficient because they use
few resources. The problem from the object-detection field utilizing the drone’s perspective is that
the objects are relatively small compared to the observation space, and there are also illumination
and environmental challenges. In this study, we will demonstrate the use of the state-of-the-art
object-detection method you only look once (YOLO) v5 using a dataset of visual images taken from a
Citation: Mantau, A.J.; Widayat, I.W.;
UAV (RGB-image), along with thermal infrared information (TIR), to find poachers. There are seven
Leu, J.-S.; Köppen, M. A scenario training methods that we have employed in this research with RGB and thermal infrared
Human-Detection Method Based on data to find the best model that we will deploy on the Jetson Nano module later. The experimental
YOLOv5 and Transfer Learning result shows that a new model with pre-trained model transfer learning from the MS COCO dataset
Using Thermal Image Data from can improve YOLOv5 to detect the human–object in the RGBT image dataset.
UAV Perspective for Surveillance
System. Drones 2022, 6, 290. Keywords: human detection; Jetson; surveillance; thermal imaging; UAV; YOLO
https://doi.org/10.3390/
drones6100290
UAV technology is currently very advanced and is the most realistic solution today
because it is flexible, fast, relatively inexpensive, lightweight, and easy to use [5]. In several
fields of studies, UAVs have been employed as tools for area and target coverage, path and
trajectory planning, image analysis and vision-based techniques, networking, and flight
control [6]. Despite the massive use of UAVs in these various fields, there are still many
challenges that need to be solved, which include weather conditions, shadows, illumination,
and other variations. To overcome this challenge, RGBT images, which are also known as
red, green, and blue images with thermal infrared information, are utilized.
Conceptually, a thermal infrared (TIR) image represents data that capture information
outside the spectrum of the human eye. It captures wavelengths out of the visible light
spectrum area, as we can see in Figure 2. This helps the TIR to overcome changes in
light intensity that affect the color captured by the human eye. However, TIR also has a
weakness: it is sensitive to temperature changes and does not contain detailed information
such as visual RGB images [7].
Due to the small size of humans in UAV videos, the UAV’s motion, and the low
resolution, the ability to detect poachers in UAV video, particularly thermal infrared
footage, is an important topic of research. In this present study, several scenarios have been
used to enhance the you only look once (YOLO) [8] object-detection method, which focuses
Drones 2022, 6, 290 3 of 12
on small human–object detection from a UAV perspective. The target presents a harder
challenge for the object detection due to its various shapes and dense crowds. Therefore,
the YOLOv5 model was trained using the RGB image and TIR dataset in order to evaluate
how well it performed when identifying humans from aerial perspective data.
The main contributions of this paper are as follows:
• Optimizing the YOLOv5s algorithm for small human–object detection dataset via the
transfer learning method .
• Developing a method to handle different environmental issues, including illumination
and mobility change using thermal infrared (TIR) images in addition to RGB (RGBT)
images.
• The original dataset has been manually annotated to be YOLO-format-compatible,
and the annotation will be made available to the public.
• Proposing a surveillance system for wildlife conservation using NVIDIA Jetson Nano
module.
This paper is organized as follows: Section 2 describes the object detection for surveil-
lance and provides a brief overview of the NVIDIA Jetson Modules. Section 3 consists of
the methodology and necessary background information, as well as the evaluation method.
Sections 4 and 5 consist of the experiment’s results and the conclusion, respectively.
2. Related Work
The technique of object detection in UAVs or drones has been developed for use in a
variety of contexts, including aerial image analysis, monitoring agents, delivery routing
agents, intelligent surveillance, and air force security. Hengstler et al. [9] introduce a new
approach to the distribution model of the surveillance camera by using a low-resolution
stereo camera that calculates all the captured images for the position, range, and dimension
that UAVs use, called MeshEye. Widiyanto et al. [10] introduced a PSO algorithm for
the odor-source localization model of automatic robotic movement by reconstructing two
different points of robotic sensing. Zhao et al. [11] proposed a new mixed YOLOv3-LITE
for image detection precision and speed, which can be used on a non-GPU computing
system such as a mobile or portable device.
Several studies have been conducted in the field of object detection, especially with the
availability of large datasets online and the increasing computing power, which have made
extraordinary achievements in the field of computer vision [12]. It has been observed that
object detection has been able to solve general and specific problems. The two examples of
single-stage detection include you only look once (YOLO) and single-shot multi-box detec-
tor (SSD) [13]. Meanwhile, the RCNN family, which includes RCNN [14], Fast RCNN [15],
and Faster RCNN [16], is categorized as being composed of two-stage detectors. These two
categories of deep-learning-based detectors are divided based on accuracy and processing
time.
YOLO version 2 (YOLOv2) replaced the original architecture with a 19-layer feature called
Darknet-19 [17]. In the third version (YOLOv3), the network architecture was updated
again to a more profound architecture known as Darknet-53 [18]. Furthermore, YOLO
version4 (YOLOv4), regarded as CSP Darknet-53, utilized the same Darknet-53 as the
backbone architecture with additional cross stage partial connection (CSP) [19]. YOLOv4
came up in 2020 with several additional features that are proven to enhance accuracy.
Technical Specifications
NVIDIA Jetson Nano used the compute unified device architecture (CUDA) as a
parallel computing platform. Generally, CUDA is a development and execution enabling
platform designed by NVIDIA for general proposed computing or program on graphical
processing units (GPUs) [22]. It allocates tasks that are parallel to others, which do not
need to be executed sequentially on the GPU. Furthermore, it supports many programming
languages, such as C, C++, Fortran, and Python. CUDA is useful in domains that require a
lot of computing power or in situations where parallelization is possible and high perfor-
mance is required. NVIDIA Jetson modules have been widely used in research in the field
of computer vision; this is because NVIDIA Jetson general-purpose GPUs became a viable
platform for the efficient execution of some computational models [23].
In this current study, the NVIDIA Jetson Nano was used to detect human appearance
from a UAV perspective for the surveillance system. Additionally, the best YOLOv5 model
was deployed from the RGBT dataset on Jetson Nano. An overview design of the Jetson
Nano utilized is seen in Figure 3.
Drones 2022, 6, 290 5 of 12
3. Methodology
3.1. Object Detector
YOLOv5 [24] is the latest major version of YOLO till date. Jocher launched the YOLOv5
publicly on 9 June 2020 and is still being updated. The release of YOLOv5 includes four
main different model sizes, which are YOLOv5s, the smallest; YOLOv5m, medium; and
YOLOv5l, large; and YOLOv5x, the largest. When it was released, YOLOv5 was initially
only intended for an image size of 640 pixels, but now it also offers 1280 pixels.
Furthermore, the architecture of YOLOv5 has a cross stage partial connection (CSP)
backbone and PANET neck, just like YOLOv4. However, YOLOv5 utilizes the PyTorch
instead of using the original Darknet. The significant improvements in YOLOv5 include
mosaic data augmentation and auto-learning bounding box anchors. The architecture of
YOLOv5 is shown in Figure 4.
Figure 4. YOLOv5 architecture. Backbone: CSPD; neck: PANet; and head: YOLO layer detection
results (class, score, location, and size).
Drones 2022, 6, 290 6 of 12
3.2. Dataset
During the experimental design, the VisDrone 2021 RGBT dataset was used [25]. This
dataset was originally part of the VisDrone 2021 Crowd Counting Challenge, which is a
challenge for counting people in each frame. This challenge aims to estimate the number of
people in an image. VisDrone 2021 provides a dataset with pairs of RGB and TIR images. It
is important to note that the VisDrone 2021 RGBT dataset was collected by the AISKYEYE
team from the Lab of Machine Learning and Data Mining at Tianjin University, China.
These data consist of 1807 pairs of RGB and TIR images; an example of this pair image
can be seen in Figure 5. This team collected the data from the actual UAV under several
different scenarios as well as various lighting and weather conditions. The ground truth
of the dataset is the object’s target point in XML format. Before implementing this data in
the experiment, some data prepossessing was performed to make it compatible with the
YOLO format. In this study, the data was divided into training and test sets in the ratio of
80:20, respectively.
• VisDrone RGB and TIR Image + transfer learning YOLOv5s model (YOLO-RGBT-TL);
• VisDrone RGB and TIR image data (YOLO-RGBT).
The seven aforementioned scenarios are intended to investigate the impacts of the
combination transfer-learning approach and dataset utilized so that the best scenario may
be selected and applied to the Jetson Nano device.
3.4. Evaluation
The training scenarios for VisDrone RGB, TIR, and RGBT images were evaluated
in both RGB and TIR test sets. The evaluation measurements utilized include precision
(P), recall (R), and average precision (AP). The AP measures a combination of recall and
precision for ranked retrieval results and is the average precision at various recall values [26].
The formula to calculate P and R is as follows:
TP
Recall = (1)
TP + FN
TP
Precision = (2)
TP + FP
where :
• TP denotes true positive;
• FP denotes false positive;
• FN denotes false negative.
Table 3 shows the comparison results for each of the seven training scenarios and the
original YOLOv5 model when applied on the RGB images test set. It was observed that
the performance from all trained models produced a better performance than the original
YOLOv5.
The best model in this scenario was the YOLO-RGB-TL model, with an average preci-
sion of 79.8%; meanwhile, the YOLO-TIR model failed in the RGB images test as it produced
a lower performance value. Table 3 also shows that the performance of both YOLO-RGB
and YOLO-RGBT became better when pre-trained weight transfer learning from the MS
COCO dataset was employed. This is evident as the model performance increased from
70% to 79.8% and 71.4% to 79.1% for YOLO-RGB and YOLO-RGBT, respectively.
Drones 2022, 6, 290 9 of 12
Furthermore, Table 4 shows the comparison results for each of the seven training
scenarios and the original YOLOv5 model when applied to the TIR images test set. It was
observed that the YOLO-TIR and YOLO-RGBT with transfer learning weight produced a
TIR image test set with AP 88.8%. In Table 4, both YOLO-RGB and YOLO-RGB- TL did not
produce the same result as YOLO-TIR and YOLO-RGBT models because the information
in the TIR image was not as detailed as that in the RGB image. This limited information
makes it to be difficult for this model, which is not trained with TIR images, to detect the
object. The performance results of each scenario for the RGB and TIR images are shown
from Figures 7–10.
(a) (b)
Figure 7. YOLOv5s original model detection result. (a) TIR Image. (b) RGB Image.
(a) (b)
Figure 8. YOLOv5-RGB model detection result. (a) TIR Image. (b) RGB Image.
(a) (b)
Figure 9. YOLOv5-TIR model detection result. (a) TIR Image. (b) RGB Image.
Drones 2022, 6, 290 10 of 12
(a) (b)
Figure 10. YOLOv5-RGBT detection result. (a) TIR Image. (b) RGB Image.
5. Conclusions
The detection ability of the state-of-the-art deep-learning-based algorithm, namely,
you only look once (YOLO), has been investigated by considering the small human–object
detection from an unmanned aerial vehicle perspective using NVIDIA Jetson modules . For
the model search aspect, the YOLOv5 model trained with RGB and thermal infrared images
produced a good result for solving the small object-detection problem. The RGB and TIR
images dataset from VisDrone was able to boost the performance of the YOLOv5 model in
order to detect the small object from a UAV perspective with AP values up to 79.8% and
88.8% for RGB and TIR images, respectively. Future study needs to consider more complex
methods for the training process, including the possibility to observe new architecture in
YOLO and the most effective way to utilize the combination of RGB and thermal infrared
dataset images. Finally, a complex surveillance system can be implemented in a multi-agent
UAV with an edge AI concept using the NVIDIA Jetson module in order to investigate the
cost performance of this solution.
Author Contributions: Conceptualization, A.J.M., I.W.W., M.K., and J.-S.L.; data curation, A.J.M.;
formal analysis, A.J.M.; investigation, A.J.M. and I.W.W.; resources, A.J.M.; supervision, M.K. and
J.-S.L.; visualization, A.J.M.; writing—original draft, A.J.M. and I.W.W.; and writing—review and
editing, A.J.M., I.W.W., M.K., and J.-S.L. All authors read and agreed to the published version of the
manuscript.
Funding: This study was supported by a collaborative research project between the Kyushu Institute
of Technology (Kyutech) and the National Taiwan University of Science and Technology (Taiwan-
Tech).
Drones 2022, 6, 290 11 of 12
References
1. United Nations. International Day of Forests, 21 March. Available online: https://www.un.org/en/observances/forests-and-
trees-day (accessed on 1 September 2021).
2. Food and Agriculture Organization of the United Nations. Global Forest Resources Assessment 2020: Main Report; FAO: Rome, Italy,
2020. https://doi.org/10.4060/ca9825en.
3. Assifa, F. Setiap Tahun, HUTAN INDONESIA HILANG 684.000 Hektar. Available online: https://regional.kompas.com/read/
2016/08/30/15362721/setiap.tahun.hutan.indonesia.hilang.684.000.hektar (accessed on 24 April 2021).
4. Nugroho, W.; Eko Prasetyo, M.S. Forest Management and Environmental Law Enforcement Policy against Illegal Logging in
Indonesia. Int. J. Manag. 2019, 10, 317–323.
5. Mantau, A.J.; Widayat, I.W.; Köppen, M. A Genetic Algorithm for Parallel Unmanned Aerial Vehicle Scheduling: A Cost
Minimization Approach. In Proceedings of the International Conference on Intelligent Networking and Collaborative Systems; Springer:
Cham, Switzerland, 2021; pp. 125–135. https://doi.org/10.1007/978-3-030-84910-8_14.
6. Shakeri, R.; Al-Garadi, M.A.; Badawy, A.; Mohamed, A.; Khattab, T.; Al-Ali, A.; Harras, K.A.; Guizani, M. Design Chal-
lenges of Multi-UAV Systems in Cyber-Physical Applications: A Comprehensive Survey, and Future Directions. arXiv 2018,
arXiv:1810.09729. https://doi.org/10.48550/ARXIV.1810.09729.
7. Bokolonga, E.; Hauhana, M.; Rollings, N.; Aitchison, D.; Assaf, M.H.; Das, S.R.; Biswas, S.N.; Groza, V.; Petriu, E.M. A compact
multispectral image capture unit for deployment on drones. In Proceedings of the 2016 IEEE International Instrumentation and
Measurement Technology Conference Proceedings, Taipei, Taiwan, 23–26 May 2016; pp. 1–5.
8. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. you only look once: Unified, real-time object detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
https://doi.org/10.48550/arXiv.1506.02640.
9. Hengstler, S.; Prashanth, D.; Fong, S.; Aghajan, H. Mesheye: A hybrid-resolution smart camera mote for applications in
distributed intelligent surveillance. In Proceedings of the 6th International Conference on Information Processing in Sensor
Networks, Cambridge, MA, USA, 25–27 April 2007; pp. 360–369.
10. Widiyanto, D.; Purnomo, D.; Jati, G.; Mantau, A.; Jatmiko, W. Modification of particle swarm optimization by reforming global best
term to accelerate the searching of odor sources. Int. J. Smart Sens. Intell. Syst. 2016, 9, 1410–1430. https://doi.org/10.21307/ijssis-
2017-924.
11. Zhao, H.; Zhou, Y.; Zhang, L.; Peng, Y.; Hu, X.; Peng, H.; Cai, X. Mixed YOLOv3-LITE: A lightweight real-time object-detection
method. Sensors 2020, 20, 1861.
12. Liu, M.; Wang, X.; Zhou, A.; Fu, X.; Ma, Y.; Piao, C. UAV-YOLO: Small object detection on unmanned aerial vehicle perspective.
Sensors 2020, 20, 2238.
13. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the
European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 21–37.
14. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmenta-
tion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June
2014; pp. 580–587. https://doi.org/10.48550/arXiv.1311.2524.
15. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December
2015; pp. 1440–1448.
16. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural
Inf. Process. Syst. 2015, 28, 91–99.
17. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271.
18. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
19. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934.
20. Cass, S. Nvidia makes it easy to embed AI: The Jetson nano packs a lot of machine-learning power into DIY projects—[Hands on].
IEEE Spectr. 2020, 57, 14–16. https://doi.org/10.1109/MSPEC.2020.9126102.
21. Jetson Modules, 2021. Available online: https://developer.nvidia.com/embedded/jetson-modules (accessed on 12 January
2022).
Drones 2022, 6, 290 12 of 12
22. Kirk, D. NVIDIA Cuda Software and Gpu Parallel Computing Architecture. In Proceedings of the 6th International Symposium
on Memory Management, ISMM’07, Montreal, QC, Canada, 21–22 October; Association for Computing Machinery: New York,
NY, USA, 2007; pp. 103–104. https://doi.org/10.1145/1296907.1296909.
23. Krömer, P.; Nowaková, J. Medical Image Analysis with NVIDIA Jetson GPU Modules. In Proceedings of the Advances in Intelligent
Networking and Collaborative Systems; Barolli, L., Chen, H.C., Miwa, H., Eds.; Springer International Publishing: Cham, Switzerland,
2022; pp. 233–242. https://doi.org/10.1007/978-3-030-84910-8_25.
24. Jocher, G. yolov5. 2021. Available online: https://github.com/ultralytics/yolov5 (accessed on 31 January 2022).
25. University, T. Crowd Counting. Available online: http://aiskyeye.com/download/crowd-counting_/ (accessed on 6 June 2022).
26. Zhang, E.; Zhang, Y., Average Precision. In Encyclopedia of Database Systems; Liu, L., Özsu, M.T., Eds.; Springer: Boston, MA, USA,
2009; pp. 192–193. https://doi.org/10.1007/978-0-387-39940-9_482.