YOLOv5-Tassel-UAV

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL.
15, 2022 8085
YOLOv5-Tassel: Detecting Tassels in RGB UAV

Imagery With Improved YOLOv5 Based
on Transfer Learning
Wei Liu , Graduate Student Member, IEEE, Karoll Quijano , Graduate Student Member, IEEE,
and Melba M. Crawford , Life Fellow, IEEE
Abstract—Unmanned aerial vehicles (UAVs) equipped with With the development of UAV platforms, introduction of sen-
lightweight sensors, such as RGB cameras and LiDAR, have signif- sors, including LiDAR [5], RGB, and multispectral cameras [6],
icant potential in precision agriculture, including object detection. [7], [8], and GNSS/INS solutions [9], [10], [11], the UAV has
Tassel detection in maize is an essential trait given its relevance as
the beginning of the reproductive stage of growth and development become a popular and cost-effective technology in precision
of the plants. However, compared with general object detection, agriculture. UAV RGB imagery has been demonstrated to be
tassel detection based on RGB imagery acquired by UAVs is more particularly useful for high throughput phenotyping for plant
challenging due to the small size, time-dependent variable shape, breeding applications, including information on plant counting,
and complexity of the objects of interest. A novel algorithm referred flowering date, and yield prediction [4]. Thus, the application
to as YOLOv5-tassel is proposed to detect tassels in UAV-based
RGB imagery. A bidirectional feature pyramid network is adopted and extensions of advanced detection algorithms to cropping
for the path-aggregation neck to effectively fuse cross-scale fea- systems is a current topic of great interest for researchers.
tures. The robust attention module of SimAM is introduced to For maize, flowering is an important trait to monitor as it
extract the features of interest before each detection head. An defines the end of the vegetative stages and the beginning of
additional detection head is also introduced to improve small-size the reproductive stages. The importance of tracking the tassel
tassel detection based on the original YOLOv5. Annotation is per-
formed with guidance from center points derived from CenterNet development relates to the determination of the starting point
to improve the selection of the bounding boxes for tassels. Finally, to for grain filling. A late flowering time typically indicates that
address the issue of limited reference data, transfer learning based the filling and senescence periods would not be adequate for
on the VisDrone dataset is adopted. Testing results for our proposed harvesting. In addititon, environmental or biological stressors
YOLOv5-tassel method achieved the mAP value of 44.7%, which is may have a negative impact, thereby reducing the final grain
better than well-known object detection approaches, such as FCOS,
RetinaNet, and YOLOv5. yield. Plant breeders usually consider flowering time variation
as one of the physiological traits to assess different varieties [12].
Index Terms—CenterNet, SimAM attention module, small tassel Evaluating the performance of different genotypes in multiple
detection, transfer learning, YOLOv5.
environments or under different management practices has been
part of numerous studies in agriculture [13]. In the field, flower-
I. INTRODUCTION ing is traditionally monitored manually. This practice is prone to
N RECENT years, both the fundamentals and applications errors as it is a subjective evaluation, typically time-consuming
I of artificial intelligence have developed rapidly [1]. Ob-
ject detection, which has been the focus of many studies, has
and labor-intensive. Automatic detection of the tassels at all
stages of development (from early to the later stage) using
achieved success in many applications, including autonomous UAV RGB imagery can potentially improve the evaluation of
driving [2], crowd counting [3], and precision agriculture [4]. flowering time variation in maize.
Currently, most object detectors are designed for general
object detection, and they perform well on datasets, such as
Manuscript received 30 June 2022; revised 25 August 2022; accepted 28
August 2022. Date of publication 13 September 2022; date of current version VOC, COCO, and ImageNet [14]. In the COCO dataset, objects
23 September 2022. This work was supported in part by the Advanced Research are categorized as: small objects (area < 322 ), medium objects
Projects Agency-Energy (ARPA-E), U.S. Department of Energy under Grant (322 < area < 962 ), and large objects (area > 962 ). However,
DE-AR0000593 and in part by the National Science Foundation (NSF) under
NSF Award Number EEC-1941529. (Corresponding author: Melba Crawford.) as more than half of objects in these three widely used datasets
Wei Liu is with the School of Electrical and Computer Engineering, Purdue are medium and large-sized, most current detection approaches
University, West Lafayette, IN 47907 USA (e-mail: liu3044@purdue.edu). have struggled with detecting small objects, which are densely
Karoll Quijano is with the Department of Environmental and Ecological
Engineering, Purdue University, West Lafayette, IN 47907 USA (e-mail: kqui- distributed [15]. Because of these limitations, object detection
jano@purdue.edu). in UAV imagery has been explored using datasets, such as
Melba M. Crawford is with the Lyles School of Civil Engineering and School MOR-UAV, UAV123, and VisDrone [16]. However, compared
of Electrical and Computer Engineering, Purdue University, West Lafayette, IN
47907 USA (e-mail: mcrawford@purdue.edu). with traditional UAV imagery object detection, tassel detection
Digital Object Identifier 10.1109/JSTARS.2022.3206399 based on UAV imagery encounters several challenges, including:
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
8086 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 15, 2022
4) Transfer learning based on the VisDrone dataset is utilized

to enhance model generalization for a dataset with only a
small number of observations.
II. RELATED WORK

General object detection: With the rapid development of deep
learning based on labeled datasets, current mainstream object
detection algorithms adopt neural networks to obtain detection
results. Among them, based on whether the prior anchor is
designed in advance, the current deep learning object detection
algorithms can be categorized as either anchor-free or anchor-
based methods. In the past three years, to remove the complexity
of the hyperparameters in the anchor design, which increases the
generalization difficulty of the model, some researchers have
begun to directly use key point regression to obtain the pre-
Fig. 1. Examples of objects in (a) COCO dataset, (b) VisDrone dataset, and dicted bounding box for object detection, e.g., CornerNet [18],
(c) Tassel dataset. ExtremeNet [19], CenterNet [17], ATSS [20], and YOLOx [21].
On the contrary, the anchor-based algorithms use pre-designed
1) tassels, which are initially quite small, emerge over a anchors as prior information, and gradually regress to the target
period of time, and are variety-dependent; bounding box through the adjustment of parameters obtained by
2) size of each tassel changes over the flowering period. It is training, such as the YOLO series (YOLOv3 [22], YOLOv4 [23],
small relative to the area of the plant, which is dominated YOLOF [24], and YOLOv5 [25]), RCNN series (RCNN [26],
by the leaves in the UAV RGB imagery for all the stages Fast-RCNN [27], Faster RCNN [28], Cascade R-CNN [29], and
of development; Mask-RCNN [30]), FCOS [31], and RetinaNet [32]. In general,
3) size and shape of the tassels vary significantly through the anchor-based methods have been dominant in the field of object
tasseling period; detection over the past several years. Considering the model
4) there is a large overlap between neighboring tassels in the speed and performance, YOLOv5 is adopted as the baseline
later stage of development. model in this research.
Examples of objects corresponding to the three different UAV imagery object detection: With the rapid development of
datasets are shown in Fig. 1. UAV platforms, diverse applications in areas, such as precision
A tassel dataset with accurate annotation and a robust ob- agriculture, crowd counting, and surveillance, are advancing
ject detection algorithm is required to achieve adequate tassel rapidly. However, object detection in UAV RGB imagery is more
detection in UAV RGB imagery. A novel modified YOLOv5 challenging than general object detection due to variation in
architecture denoted as YOLOv5-tassel, is proposed to detect viewpoint and scale, lighting conditions, and high density of
tassels. The network combines CSPDarknet53 and BiFPN to the objects [16]. In [33], based on YOLOv5, the transformer
extract the features effectively. In addition, the SimAM attention block is introduced both in the backbone and heads to enhance
mechanism is introduced in the neck module to extract the the UAV RGB image feature extraction. Furthermore, it adds
features of interest before each detection head; four detection another head to improve small object detection. In [34], the
heads are utilized to enhance the detection of small tassels. multihead self-attention (MHSA) is embedded into the CSP-
Considering the difficulty of tassel annotation, due to the small Darknet block of YOLOv4 to achieve the goal of global at-
size in the early stage and the significant overlap in the later tention for a 2D feature map. To address the issues of missed
stages, a tassel reference dataset was designed based on guidance detection, [35] proposes a dual neural network. In [36], an
from center points predicted by CenterNet. Based on the testing ORSIm detector is proposed to efficiently handle image de-
results, the proposed algorithm improved the metric of mAP by formations, particularly objective scaling and rotation. Further-
2.1% compared with the original YOLOv5 method with transfer more, [37] provides a comprehensive survey on the research
learning based on the VisDrone dataset. progress and prospects of UAV object detection with deep
In summary, the main contributions of this article are the learning.
following. Object detection in agriculture: With the introduction of
1) CenterNet predicted center points [17] are effective UAVs in agriculture, near real-time high-resolution monitoring
for guidance to improve the accuracy of tassel dataset of crop growth, plant detection and counting, and forecasting
annotation. of crop yield have become more prevalent. In [38], CenterNet,
2) Bidirectional feature pyramid network (BiFPN) is adopted DetectoRS, and TSD are used to detect tassels in a study of flow-
to fuse feature information across four scales. ering in maize. CenterNet [4] and YOLOv3 are utilized in [39]
3) SimAM attention mechanism is introduced to extract dis- for counting maize and cotton plants, respectively. In [40],
tinguishable features before four detection heads, includ- RetinaNet, YOLOv5, and Faster-RCNN are adopted for panicle
ing one that is quite small. counting, and a modified version of YOLOv4 is used to detect
LIU et al.: YOLOv5-TASSEL: DETECTING TASSELS IN RGB UAV IMAGERY WITH IMPROVED YOLOv5 BASED ON TRANSFER LEARNING 8087
Fig. 2. Network architecture of YOLOV5-tassel. A patch of a UAV RGB image would be input to the detection backbone.
wheat heads in [41]. In [42], a bag of tricks is explored for wheat

head detection based on networks, such as YOLOv5x, YOLOv3,
EfficientDet-D5, and Faster R-CNN, among others. In addition,
studies have investigated deep learning for detecting leaf level
disease [43], [44]. Applying deep learning in agriculture with
UAV RGB imagery information has advanced research and
operational applications.
Attention mechanism: Inspired by the human visual sys-
tem, attention mechanisms have been successfully introduced
into computer vision to assist in understanding useful features
while de-emphasizing nonrelevant information for complex
scenes [45], [46], [47]. In image object detection, attention
mechanisms could be categorized as 1) channel attention, 2)
spatial attention, 3) branch attention, and 4) channel and spatial
attention. The channel attention module generates the chan- Fig. 3. Examples of tassels’ development from early stage (left) to late stage
(right) for an (a) inbred and a (b) hybrid variety.
nel attention map by exploiting the interchannel relationship
of features (SENet [48], and ECANet [49]). Spatial attention
methods, including deformable convolutional networks [50] and
transformers [51], [52], [53], extract features of interest based maps is performed based on the BiFPN, rather than the original
on the relationships between spatial features. Similarly, dynamic PANet in the neck. Next, the attention module of the SimAM is
branch selection can be used to extract interesting branches integrated into the neck. Finally, considering the small size of
based on branch attention; dynamic convolution is the most well- tassels, four detection heads are utilized to enhance the ability
known work that combines different convolution kernels [54]. to detect small tassels.
To fully exploit the advantages of channel attention and spatial For the input image, YOLOv5 scales the original patch to
attention, multiple researchers have proposed to fuse channel 1280 × 1280. After 5× down-sampling, the output feature map
and spatial attention from different perspectives. Other ap- size is 40 × 40, denoted as P 5. Moreover, the feature maps of
proaches, such as CBAM [55], decompose the process and learn P 2 (320 × 320), P 3 (160 × 160), and P 4 (80 × 80) are also
channels and spatial attention separately. Three-dimensional combined with P 5 as the input of the neck module.
attention maps have also been explored to compute 3D weights,
as in SimAM [56] and SCNet [57]. The attention module of A. Feature Fusion Based on BiFPN
SimAM, which is parameter-free and generates a 3D attention The size and shape of tassels vary over the period of flowering.
map direction, is used in the proposed network to extract useful In the early stage, the tassels are small compared with the middle
features. and later stages of development (see Fig. 3). To detect objects
at different scales, the feature pyramid network is a popular
primary component in detectors [58]. It consists of a bottom-up
III. PROPOSED METHOD
pathway, a top-down pathway, and lateral connections. [59]
The overall network architecture of the proposed YOLOv5- introduces another information flow to shorten the information
tassel is shown in Fig. 2. First, the backbone of the CSPDark- path between the down and up layers to boost the feature
net53 from YOLOv5 is adopted to extract the tassel feature fusion of different scale maps. However, with different layer
maps. The backbone’s feature aggregation for different feature features at different resolutions, the respective contributions to
Fig. 4. Different neck architecture designs [60]. (FPN, PANnet, and BiFPN.)
Fig. 5. SimAM with full 3D weights for attention [56].

the fused features should be unequal. To resolve this issue, [60]
proposes the BiFPN. Considering the advantage of the BiFPN,
the proposed method in this article integrates it into YOLOv5
to enhance the capability of feature fusion for the four detection B. SimAM Attention Module
layers. Fig. 4 shows the architectural design of the three different Currently, the attention module is widely used in deep learning
neck styles. to enhance the ability of feature extraction to achieve the good
Unlike the three detection heads in YOLOv5, four detection performance, independent of network architecture. However,
heads are proposed to detect tassels. The additional head im- most current attention modules have two drawbacks. They
proves the ability to detect extremely small tassels, especially typically refine feature maps based on channel and spatial di-
those in the early stage. As a result, levels 2–5 comprise the input mensions separately. Learning the attention weights simulta-
features P in = (P2in , P3in , P4in , P5in ). In FPN, a transformation to neously from the channels and space is challenging. Alterna-
aggregate four different features is designed to output four-scale tively, some attention modules rely on hyperparameters, which
fused features P out = (P2out , P3in , P4out , P5out ), as follows: require rich expert knowledge to ensure performance. To cir-
cumvent these issues, SimAM, which is parameter-free [56],
P5out = Conv P5in
is proposed with 3D attention weights. Fig. 5 shows the at-

P4out = Conv P4in + Resize P5out tention mechanism. In this paper, SimAM is embedded into
the proposed modified YOLOv5 model to enhance the per-
P3out = Conv P3in + Resize P4out formance of tassel detection. SimAM is derived from neuro-
science theory; it extracts the essential features based on an
P2out = Conv P2in + Resize P3out (1)
energy function. The energy function for each neuron is as
where Resize represents an upsampling and downsampling op- follows:
eration for matching four different scale feature maps, and Conv

M −1
is the convolutional operation for the feature extraction. 1
As BiFPN combines bidirectional cross-scale connection and et (wt , bt , y, xi ) = (−1 − (wt xi + bt ))2
M − 1 i=1
fast normalized fusion, the feature map output for each layer is
as follows: + (1 − (wt t + bt ))2 + λwt 2 (4)
in
w1 · Piin + w2 · Resize Pi+1
Pi = Conv
td
where t and xi are the target neuron and other neurons in
w1 + w2 +
each channel for the input feature X ∈ RC×H×W , and i is the
out index for the spatial dimension. M = H × W is the number of
w1 · Piin + w2 · Pitd + w3 · Resize Pi−1
Pi = Conv
out neurons for each channel. The transformation weights and bias,
w1 + w2 + w3 + wt and bt , are expressed as follows:
(2)
2(t − μt )
where i = 2, 3, 4, and Pitd are the intermediate features at level wt = − M−1
i, respectively, on the top-down pathway. The output feature of (t − μt )2 + M2−1 (xi − μt )2 + 2λ
P5out can be described as follows: i=1
1
w1 in out bt = − (t + μt ) wt
P5 = Conv
out
P + w2 · Resize P4 w1 + w2 + . 2
· 5 −1
1
M
(3) μt = xi (5)
Based on these operations, the output feature map integrates M − 1 i=1
the input feature map and the intermediate feature map with
different scaling features. Thus, it enhances the feature fusion where the mean is calculated over all the neurons in a channel,
in the neck module. except for t. Based on (5), the minimal energy can be obtained
Fig. 6. UAV platform for image acquisition [(A) Velodyne VLP-16 Lite, (B)
Applanix APX-15v3, (C) Headwall Nano Hyperspec (VNIR), and (D) Sony
Alpha 7RIII].
Fig. 7. Experiment location and layout: HIPS 2021 at ACRE.
as follows:
M HIPS experiment was planted on May 24th. The imagery of
1
4 M i=1 (xi − μ̂)2 + λ three dates during tasseling in the early stage of flowering were
et ∗ = 2
M collected for annotation (July 19th, July 21st, and July 23rd).
(t − μ̂)2 + M
2
i=1 (xi − μ̂) + 2λ
For the 2020 HIPS experiment, the data were annotated in the
1
M middle and later stage of tassel development. Compared with
μ̂ = xi (6) 2020 HIPS annotation data, the tassels were much smaller in
M i=1
2021 as they were mainly in the early stage. The percentage of
assuming all the pixels have the same distribution (which saves small size tassels was useful for verifying the performance of
computation). Based on (6), the weight for each neuron is e1t ∗ . the proposed algorithm.
Consequently, the SimAM attention module can be described Data annotation: Object detection based on deep learning
as follows: is a data-driven approach where the network architecture relies
1 heavily on the dataset to train the model. Creating the dataset
X = sigmoid X (7) accurately and efficiently is not a trivial task, particularly in
E
the complex agricultural setting. Initially, the orthophoto for
where E groups e1t ∗ both in channel and spatial dimensions. The the HIPS experiment was a large file with spatial resolution
sigmoid function is used to avoid a weight value that is too large. (0.25 cm), making it difficult to train the model directly. Thus,
the orthophoto was split into small patches prior to training.
IV. EXPERIMENTAL RESULTS The row segments were extracted using the COPE method [62],
thereby providing plot boundary information. Fig. 8 shows the
A. Description of the Experiment
bottom left coordinates (x0 , y0 ), top-right coordinates (x1 , y1 ),
Data acquisition: The UAV platform used to acquire data plot identification number (plot_ID), and row in plot for each
for these experiments was a DJI Matrice M600 Pro, as shown row-segment (row_in_plot). For the training model, two row-
in Fig. 6. It was equipped with a Velodyne VLP-16 Lite, an segments with the same plot_ID are set as one input image
Applanix APX-15v3, a Headwall Nano Hyperspec (VNIR), patch. However, as the sizes of the row segments differ, splicing
and a Sony Alpha 7RIII camera. The images were collected from two row-segments with the same plot_ID often results in the
the high-intensity phenotype sites (HIPS) experiment at Pur- shape of the generated patch not being a rectangle. To create
due University’s Agronomy Center for Research and Education a final rectangular patch, the plot coordinates of the row and
(ACRE) in Indiana, USA (see Fig. 7). The data were collected column pairs are averaged, and the patches of the original HIPS
during the 2020 and 2021 growing seasons. Two replications orthophoto are generated successfully with an approximate size
for both inbred and hybrid varieties were planted in a two-row of 620 × 2100.
plot layout with a plant population of 30,000 plants per acre. Compared with the general object annotation, tassel anno-
The UAV was flown at the height of 20 m to capture the tation is complicated as tassels are present in small sizes in a
images, and the RGB imagery was processed to a 0.25 cm pixel high-density environment, with a large amount of occlusion. In
resolution orthophoto. The method in [61] was used to generate addition, multiple annotators, even when trained, have different
the orthomosaics to reduce pixelization and distortion of the perspectives when defining a given bounding box. In the process
tassels. of dataset labeling, minimizing labeling inconsistency is crucial.
The experiment’s location, layout, and details for the 2020 To the best of our knowledge, a satisfactory method to address
data are described in [38]. For the 2021 growing season, the these issues does not exist in the area of plant annotation. Here,
TABLE I
HYPERPARAMETER OF ANCHOR SIZE
TABLE II
TASSEL DETECTION PERFORMANCE COMPARISON WITH DIFFERENT
PRETRAINED DATASET
and Losscls are calculated by binary cross-entropy loss, while

Lossbox is calculated by CIoU loss [64]. YOLOv5 obtains
the anchor’s hyperparameters using the autoanchor calculation
module, which is based on K-means clustering with genetic
algorithms. The autoanchor calculation module is widely used
Fig. 8. Information transformation from plot into patch. for object detection with custom datasets; however, as noted
previously, the tassels in all growing stages are small relative
to general objects in other custom datasets. After testing, the
automated anchor box generator’s performance on the modified
the center points predicted by CenterNet as a preprocessing stage
YOLOv5-tassel network declined. Through extensive experi-
are used to identify tassels and guide the tassel bounding box
ments, the anchor sizes of the four detection heads were finalized
annotation. Annotators can also identify missing or mistaken
as shown in Table I. From the five submodels included on
tassels designated by CenterNet. Karami et al. [38] describes the
YOLOv5 (YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and
corresponding setting of training parameters and the details of
YOLOv5x), the yaml file is nearly the same, except for the
the point annotations. The tassels’ center points for each image
hyperparameters of depth_multiple and width_multiple. After
patch are predicted based on the pretrained weight. With the
comprehensively investigating the model’s accuracy and infer-
supervision of the center point information of each patch, the
ence speed, the YOLOv5l was chosen as a baseline with both
bounding boxes of each patch are annotated using LabelMe [63].
parameters depth_multiple and width_multiple set to 1.
After the annotation, the tassel dataset is reviewed multiple times
The number of images in the annotated tassel dataset was
to ensure a high-quality dataset.
742, and the total number of tassels in these images was 10 249.
Implementation details: The dataset was split randomly into
This size could be limiting for deep learning as it represents
three parts: training, validation, and test sets, with a ratio of
0.24% of the MS COCO dataset and 7.23% of the VisDrone
60%:20%:20% considering the small size of the dataset.
dataset. Transfer learning was implemented to train the proposed
All experiments were conducted on the annotated tassel
model with the information from the VisDrone dataset, which
dataset. YOLOv5 uses AP@0.5 and mAP@0.5:0.95 to save the
was selected given that both data sets were acquired by UAVs.
best-trained weight, and the corresponding weights are 0.1 and
Previous research of object detection in crops based on UAV
0.9, respectively. As a result, mAP@0.5:0.95 is considered the
imagery ignores the difference between pretrained datasets [4],
most critical metric for tassel detection. The proposed method
[38] and those being analyzed. Table II compares the perfor-
was implemented in Pytorch, and the network was trained based
mance of tassel detection achieved using the COCO pretrained
on transfer learning from the VisDrone dataset. The network
weight and the VisDrone pretrained weight. Even though the
was implemented on 2 NVIDIA Quadro RTX 6000 GPUs, each
metric of AP50 declined by 0.3, the metric of mAP improved
with 22-GB RAM. The corresponding versions of Pytorch and
by 1.8. As previously mentioned, the weight of mAP is much
CUDA were 1.10.1 and 10.2, respectively. The dimensions of
higher than AP50 , which demonstrates that the pretrained weight
the input image of each patch were 1280 × 1280. The Adam
from the VisDrone dataset enhanced the performance of tassel
optimizer was utilized with a learning rate of 1e−4 . The total
detection compared with that of the COCO dataset.
training loss is described as follows:
Loss = αLossbox + βLossobj + γLosscls (8) B. Comparison to State-of-the-Art

where Lossbox is the localization loss, Lossobj is the confidence The state-of-the-art method implemented to test the per-
loss, and Losscls is the classification loss. The loss weights α, formance on the tassel dataset is based on the MMdetection
β, and γ, are set to 0.05, 1.0, and 0.18, respectively. Lossobj tool [65]. Table III contains the results, in which the proposed
TABLE III
COMPARISON OF RESULTS OF BASELINE METHODS ON THE TASSEL DATASET
TABLE IV
ABLATION TEST RESULTS FOR YOLOV5-TASSEL
Fig. 9. Transformer encoder block.
TABLE V
DETECTION PERFORMANCE COMPARISON WITH DIFFERENT ATTENTION
MODULES
algorithm shows outstanding performance in the metric of mAP,

which is nearly double that of most methods tested.
C. Ablation Study
The performance of the proposed algorithm YOLOv5-tassel
was evaluated with a thorough ablation study (see Table IV). An
additional detection head embedded into the original YOLOv5 is down-sampled into 1/32 for the feature map, making it less
improved the mAP from 42.6% to 43.6%, with the number effective for global attention. Second, the transformer block
of parameters increasing by only 2.6%. The introduction of lacks image-specific inductive bias, namely two-dimensional
BiFPN and SimAM boosted the metric of mAP by 0.2% and neighborhood structure and translation equivalence.
0.9%, respectively. In total, the proposed YOLOv5-tassel im- Table V compares the detection performance with different
proved the mAP by 2.1%, with the size of the parameter set attention modules, including shuffle attention, CBAM, SELayer,
increasing by 12.4%. These results show that the addition of ECALayer, and SimAM. It shows that the attention module of
the attention mechanism significantly influences the detection SimAM achieves the best metrics of AP50 and mAP.
of small tassels, as it enhances AP50 and mAP by 0.5% and
0.9%, respectively. To evaluate the model’s complexity, the D. Visualization of the Prediction Result
floating-point operations per second (FLOPs) were determined.
Besides the quantitative comparison for the proposed
The FLOPs value of the original YOLOv5l was 107.8, while the
YOLOv5-tassel through the metrics of AP50 and mAP, examples
FLOPs value of the proposed model was 143.4, which increased
of the detection results are shown in Fig. 10. From (a) to (f), the
by nearly 33%.
size and shape of tassels vary over the period of flowering. Based
The success of the transformer encoder block, with its MHSA
on (a) and (b), the tassels could be detected precisely even though
mechanism, has motivated researchers to embed it in the back-
they are small. In the middle stage, the proposed detection
bone to enhance the ability of feature map extraction (see Fig. 9).
algorithm also performs perfectly through (c) and (d). In the
In [33], [34], the transformer block is embedded into the end of
later stage shown in (e) and (f), while there is overlap between
the CSPDarknet53 for UAV imagery object detection, improving
neighboring tassels with high density, nearly all tassels are also
the mAP metric. Thus, a comparison experiment was conducted
detected successfully. Overall, the precision and robustness of
to evaluate the effectiveness of the transformer encoder block
the proposed algorithm are clearly illustrated.
in tassel detection based on the YOLOv5-tassel. In Table IV,
compared with the proposed method, the number of parameters
V. CONCLUSION
increased by 7.2 million after embedding the transformer block,
while the metrics of AP50 and mAP decreased by 0.6% and A novel algorithm referred to as YOLOv5-tassel is developed
0.4%, respectively. There are two limitations to the detection to improve tassel detection. Four detection heads and BiFPN
performance of a transformer block on the tassel dataset. First, are adopted to enhance feature fusion for small tassel detection.
the transformer blocks usually perform better on larger datasets, In addition, the attention mechanism of SimAM is introduced
with their global attention mechanism requiring more parame- to extract the interesting parts in the feature map. Remarkably,
ters. However, the size of the tassel dataset is relatively small. the mAP of YOLOv5 is boosted to 44.7% on the tassel dataset,
In addition, at the end of the backbone, the small-sized tassel enhanced by 2.1%. It also demonstrates that transfer learning
Fig. 10. Examples of visualized detection results on the test data.
based on the VisDrone dataset, compared to the traditional [6] C. Papaioannidis, I. Mademlis, and I. Pitas, “Autonomous UAV safety by
COCO data, could enhance tassel detection based on UAV RGB visual human crowd detection using multi-task deep neural networks,” in
Proc. IEEE Int. Conf. Robot. Automat., 2021, pp. 11074–110 80.
imagery. In addition, CenterNet is utilized to provide a reference [7] W. Liu, L. Xiong, X. Xia, Y. Lu, L. Gao, and S. Song, “Vision-aided
for the tassel dataset annotation. This study’s results will help intelligent vehicle sideslip angle estimation based on a dynamic model,”
provide a foundation for further development of detection of IET Intell. Transport Syst., vol. 14, no. 10, pp. 1183–1189, 2020.
[8] D. Hong, N. Yokoya, J. Chanussot, and X. X. Zhu, “An augmented linear
objects of interest in agriculture based on RGB imagery acquired mixing model to address spectral variability for hyperspectral unmixing,”
by UAVs. IEEE Trans. Image Process., vol. 28, no. 4, pp. 1923–1938, Apr. 2019.
Further work is merited in the following areas: introducing [9] L. Ruan et al., “Cooperative relative localization for UAV swarm in GNSS-
denied environment: A coalition formation game approach,” IEEE Internet
deformable convolutional networks to address the problem of Things J., vol. 9, no. 13, pp. 11560–11577, Jul. 2022.
tassel shape variation, combining the benefits of CNN with the [10] W. Liu, X. Xia, L. Xiong, Y. Lu, L. Gao, and Z. Yu, “Automated vehicle
transformer in the network architecture to further enhance tas- sideslip angle estimation considering signal measurement characteristic,”
IEEE Sensors J., vol. 21, no. 19, pp. 21675–21687, Oct. 2021.
sel detection, and adopting unsupervised learning with domain [11] L. Xiong et al., “IMU-based automated vehicle body sideslip angle and
adaptation to detect tassels with only unlabeled data in the target attitude estimation aided by GNSS using parallel adaptive Kalman filters,”
domain. IEEE Trans. Veh. Technol., vol. 69, no. 10, pp. 10668–10680, Oct. 2020.
[12] E. Durand et al., “Flowering time in maize: Linkage and epistasis at a
major effect locus,” Genetics, vol. 190, no. 4, pp. 1547–1562, 2012.
ACKNOWLEDGMENT [13] M. L. Buchaillot et al., “Evaluating maize genotype performance under low
nitrogen conditions using RGB UAV phenotyping techniques,” Sensors,
The authors would like to thank Taojun Wang, Claudia Aviles, vol. 19, no. 8, 2019, Art. no. 1815.
An-te Huang, Purnima Jayaraj, and Franciele Marques Tolentino [14] L. Liu et al., “Deep learning for generic object detection: A survey,” Int.
J. Comput. Vis., vol. 128, no. 2, pp. 261–318, 2020.
for their contributions to data collection and annotation. [15] G. Chen et al., “A survey of the four pillars for small object detection:
Multiscale representation, contextual information, super-resolution, and
region proposal,” IEEE Trans. Syst., Man, Cybern.: Syst., vol. 52, no. 2,
REFERENCES pp. 936–953, Feb. 2020.
[1] Q. Zhou, D. Zhao, B. Shuai, Y. Li, H. Williams, and H. Xu, “Knowl- [16] P. Zhu et al., “Detection and tracking meet drones challenge,” 2020,
edge implementation and transfer with an adaptive learning network for arXiv:2001.06303.
real-time power management of the plug-in hybrid vehicle,” IEEE Trans. [17] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “CenterNet:
Neural Netw. Learn. Syst., vol. 32, no. 12, pp. 5298–5308, Dec. 2021. Keypoint triplets for object detection,” in Proc. IEEE/CVF Int. Conf.
[2] R. Xu, H. Xiang, Z. Tu, X. Xia, M.-H. Yang, and J. Ma, “V2X-ViT: Vehicle- Comput. Vis., 2019, pp. 6569–6578.
to-everything cooperative perception with vision transformer,” in Proc. [18] H. Law and J. Deng, “CornerNet: Detecting objects as paired keypoints,”
Eur. Conf. Comput. Vis., 2022. in Proc. Eur. Conf. Comput. Vis., 2018, pp. 734–750.
[3] X. Chen, H. Yan, T. Li, J. Xu, and F. Zhu, “Adversarial scale-adaptive [19] X. Zhou, J. Zhuo, and P. Krahenbuhl, “Bottom-up object detection by
neural network for crowd counting,” Neurocomputing, vol. 450, pp. 14–24, grouping extreme and center points,” in Proc. IEEE/CVF Conf. Comput.
2021. Vis. Pattern Recognit., 2019, pp. 850–859.
[4] A. Karami, M. Crawford, and E. J. Delp, “Automatic plant counting and [20] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap between
location based on a few-shot learning technique,” IEEE J. Sel. Topics Appl. anchor-based and anchor-free detection via adaptive training sample se-
Earth Observ. Remote Sens., vol. 13, pp. 5872–5886, 2020. lection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020,
[5] F. Luet al., “HRegNet: A hierarchical network for large-scale outdoor pp. 9756–9765.
LiDAR point cloud registration,” in Proc. IEEE/CVF Int. Conf. Comput. [21] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOX: Exceeding YOLO
Vis., 2021, pp. 15994–16003. series in 2021,” 2021arXiv:2107.08430.
[22] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” [48] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc.
2018, arXiv:1804.02767. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141.
[23] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal [49] H. Xue, M. Sun, and Y. Liang, “ECANet: Explicit cyclic attention-
speed and accuracy of object detection,” 2020, arXiv:2004.10934. based network for video saliency prediction,” Neurocomputing, vol. 468,
[24] Q. Chen, Y. Wang, T. Yang, X. Zhang, J. Cheng, and J. Sun, “You only pp. 233–244, 2022.
look one-level feature,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern [50] J. Dai et al., “Deformable convolutional networks,” in Proc. IEEE Int.
Recognit., 2021, pp. 13034–13043. Conf. Comput. Vis., 2017, pp. 764–773.
[25] G. Jocher, “YOLOv5,” 2020. [Online]. Available: https://github.com/ [51] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for
ultralytics/yolov5 image recognition at scale,” in Proc. Int. Conf. Learn. Representations,
[26] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy
“Selective search for object recognition,” Int. J. Comput. Vis., vol. 104, [52] R. Xu, Z. Tu, H. Xiang, W. Shao, B. Zhou, and J. Ma, “CoBEVT: Coop-
no. 2, pp. 154–171, 2013. erative bird’s eye view semantic segmentation with sparse transformers,”
[27] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, 2022, arXiv:2207.02202.
pp. 1440–1448. [53] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using
[28] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021,
object detection with region proposal networks,” in Proc. Adv. Neural Inf. pp. 10012–10022.
Process. Syst., 2015, vol. 28, pp. 1137–1149. [54] Y. Chen, X. Dai, M. Liu, D. Chen, L. Yuan, and Z. Liu, “Dynamic
[29] Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high qual- convolution: Attention over convolution kernels,” in Proc. IEEE/CVF
ity object detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2018, Conf. Comput. Vis. Pattern Recognit., 2020, pp. 11030–11039.
pp. 6154–6162. [55] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: Convolu-
[30] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. tional block attention module,” in Proc. Eur. Conf. Comput. Vis., 2018,
IEEE Int. Conf. Comput. Vis., 2017, pp. 2980–2988. pp. 3–19.
[31] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully convolutional one- [56] L. Yang, R.-Y. Zhang, L. Li, and X. Xie, “SimAM: A simple, parameter-
stage object detection,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, free attention module for convolutional neural networks,” in Proc. Int.
pp. 9626–9635. Conf. Mach. Learn, 2021, pp. 11863–11874.
[32] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for [57] K. Han et al., “SCNet: Learning semantic correspondence,” in Proc. IEEE
dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, Int. Conf. Comput. Vis., 2017, pp. 1849–1858.
pp. 2980–2988. [58] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
[33] X. Zhu, S. Lyu, X. Wang, and Q. Zhao, “TPH-YOLOv5: Improved “Feature pyramid networks for object detection,” in Proc. IEEE/CVF Conf.
YOLOv5 based on transformer prediction head for object detection on Comput. Vis. Pattern Recognit., 2017, pp. 936–944.
drone-captured scenarios,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., [59] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for
2021, pp. 2778–2788. instance segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
[34] Z. Zhang, X. Lu, G. Cao, Y. Yang, L. Jiao, and F. Liu, “ViT-YOLO: Recognit., 2018, pp. 8759–8768.
Transformer-based YOLO for object detection,” in Proc. IEEE/CVF Int. [60] M. Tan, R. Pang, and Q. V. Le, “EfficientDet: Scalable and efficient object
Conf. Comput. Vis., 2021, pp. 2799–2808. detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020,
[35] G. Tian, J. Liu, and W. Yang, “A dual neural network for object pp. 10778–10787.
detection in UAV images,” Neurocomputing, vol. 443, pp. 292–301, [61] Y.-C. Lin, T. Zhou, T. Wang, M. Crawford, and A. Habib, “New orthophoto
2021. generation strategies from UAV and ground remote sensing platforms
[36] X. Wu, D. Hong, J. Tian, J. Chanussot, W. Li, and R. Tao, “ORSIm detector: for high-throughput phenotyping,” Remote Sens., vol. 13, no. 5, 2021,
A novel object detection framework in optical remote sensing imagery Art. no. 860.
using spatial-frequency channel features,” IEEE Trans. Geosci. Remote [62] C. Yang, S. Baireddy, E. Cai, M. Crawford, and E. J. Delp, “Field-based
Sens., vol. 57, no. 7, pp. 5146–5158, Jul. 2019. plot extraction using UAV RGB images,” in Proc. IEEE/CVF Int. Conf.
[37] X. Wu, W. Li, D. Hong, R. Tao, and Q. Du, “Deep learning for unmanned Comput. Vis., 2021, pp. 1390–1398.
aerial vehicle-based object detection and tracking: A survey,” IEEE Geosci. [63] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, “Labelme:
Remote Sens. Mag., vol. 10, no. 1, pp. 91–124, Mar. 2022. A database and web-based tool for image annotation,” Int. J. Comput. Vis.,
[38] A. Karami, K. Quijano, and M. Crawford, “Advancing tassel detection vol. 77, no. 1, pp. 157–173, 2008.
and counting: Annotation and algorithms,” Remote Sens., vol. 13, no. 15, [64] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-IoU loss:
2021, Art. no. 2881. Faster and better learning for bounding box regression,” in Proc. AAAI
[39] S. Oh et al., “Plant counting of cotton from UAS imagery using deep Conf. Artif. Intell., 2020, vol. 34, no. 07, pp. 12993–13000.
learning-based object detection framework,” Remote Sens., vol. 12, no. 18, [65] K. Chen et al., “MMDetection: Open MMLab detection toolbox and
2020, Art. no. 2981. benchmark,” 2019, arXiv:1906.07155.
[40] E. Cai, S. Baireddy, C. Yang, E. J. Delp, and M. Crawford, [66] Q.-L. Zhang and Y.-B. Yang, “SA-Net: Shuffle attention for deep convo-
“Panicle counting in UAV images for estimating flowering time in lutional neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
sorghum,” in Proc. IEEE Int. Geosci. Remote Sens. Symp., 2021, Process., 2021, pp. 2235–2239.
pp. 6280–6283.
[41] B. Gong, D. Ergu, Y. Cai, and B. Ma, “Real-time detection for wheat
head applying deep neural network,” Sensors, vol. 21, no. 1, 2020,
Art. no. 191.
[42] Y. Wu, Y. Hu, and L. Li, “BTWD: Bag of tricks for wheat detection,” in
Proc. Eur. Conf. Comput. Vis., Springer, 2020, pp. 450–460.
[43] E. C. Tetila et al., “Automatic recognition of soybean leaf diseases using
UAV images and deep convolutional neural networks,” IEEE Geosci.
Remote Sens. Lett., vol. 17, no. 5, pp. 903–907, May 2020.
[44] M. Bhandari et al., “Assessing winter wheat foliage disease severity using
aerial imagery acquired from small unmanned aerial vehicle,” Comput. Wei Liu (Graduate Student Member, IEEE) received
Electron. Agriculture, vol. 176, 2020, Art. no. 105665. the B.E. degree in vehicle engineering from the
[45] M.-H. Guo et al., “Attention mechanisms in computer vision: A survey,” Wuhan University of Technology, Wuhan, China, and
Comput. Visual Media, vol. 8, pp. 331–368, 2022. the M.E. degree in vehicle engineering from Tongji
[46] H. Cao, G. Chen, J. Xia, G. Zhuang, and A. Knoll, “Fusion-based feature University, Shanghai, China.
attention gate component for vehicle detection based on event camera,” He is currently working toward the Ph.D. degree in
IEEE Sensors J., vol. 21, no. 21, pp. 24540–24548, Nov. 2021. electrical and computer engineering with the School
[47] G. Chen, H. Cao, J. Conradt, H. Tang, F. Rohrbein, and A. Knoll, “Event- of Electrical and Computer Engineering, Purdue Uni-
based neuromorphic vision for autonomous driving: A paradigm shift for versity, West Lafayette, IN, USA. His research in-
bio-inspired visual sensing and perception,” IEEE Signal Process. Mag., terests include deep learning, computer vision, au-
vol. 37, no. 4, pp. 34–49, Jul. 2020. tonomous driving, and UAV remote sensing.
Karoll Quijano (Graduate Student Member, IEEE) Melba M. Crawford (Life Fellow, IEEE) received
received the B.S. degree in environmental engineer- the B.S. degree in environmental engineering from
ing from the Universidad Distrital Francisco José de the Universidad Distrital Francisco José de Caldas,
Caldas, Bogota, Colombia, in 2016, and the M.S. Bogota, Colombia, in 2016, and the M.S. degree
degree in environmental and ecological engineering in environmental and ecological engineering from
from Purdue University, West Lafayette, IN, USA, in Purdue University, West Lafayette, IN, USA, in 2020.
2020. She is a Nancy Uridil and Francis Bossu Professor of
She is currently working toward the Ph.D. degree in Civil Engineering with Purdue, where she is also a
environmental and ecological engineering with Pur- Professor in the Schools of Electrical and Computer
due University. Her research interests include remote Engineering and the Department of Agronomy. Previ-
sensing for agriculture, UAV-based hyperspectral im- ously, she was an Engineering Foundation Endowed
agery and LiDAR, precision agriculture, and crop growth modeling. Professor in Mechanical Engineering with the University of Texas at Austin,
Austin, TX, USA, where she founded an interdisciplinary research and appli-
cations development program in space-based and airborne remote sensing. She
has authored or coauthored more than 200 publications in scientific journals,
conference proceedings, book chapters, and technical reports. Her research
focuses on development of machine learning-based algorithms for classification
and prediction, and applications of these methods to hyperspectral, and LIDAR
remotely sensed data.
Dr. Crawford is a Fellow and Life Member of the IEEE, Past President of
the IEEE Geoscience and Remote Sensing Society (GRSS), an IEEE GRSS
Distinguished Lecturer, and past Treasurer of the IEEE Technical Activities
Board. She received the GRSS outstanding Service Award in 2020 and the IEEE
GRSS David Landgrebe Research Award in 2021.

YOLOv5-Tassel-UAV

Uploaded by

Copyright:

Available Formats

You might also like

YOLOv5-Tassel-UAV

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

YOLOv5-Tassel-UAV

Uploaded by

Copyright:

Available Formats

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL.

15, 2022 8085

YOLOv5-Tassel: Detecting Tassels in RGB UAV

4) Transfer learning based on the VisDrone dataset is utilized

II. RELATED WORK

wheat heads in [41]. In [42], a bag of tricks is explored for wheat

Fig. 5. SimAM with full 3D weights for attention [56].

and Losscls are calculated by binary cross-entropy loss, while

Loss = αLossbox + βLossobj + γLosscls (8) B. Comparison to State-of-the-Art

Fig. 9. Transformer encoder block.

algorithm shows outstanding performance in the metric of mAP,

Fig. 10. Examples of visualized detection results on the test data.

You might also like