Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

agronomy

Article
A Cucumber Leaf Disease Severity Grading Method in Natural
Environment Based on the Fusion of TRNet and U-Net
Hui Yao 1,2 , Chunshan Wang 1,2,3 , Lijie Zhang 4, *, Jiuxi Li 4 , Bo Liu 1,2 and Fangfang Liang 1,2

1 School of Information Science and Technology, Hebei Agricultural University, Baoding 071001, China;
gshanhui@163.com (H.Y.); chunshan9701@163.com (C.W.); boliu@hebau.edu.cn (B.L.);
liangfangfang@hebau.edu.cn (F.L.)
2 Hebei Key Laboratory of Agricultural Big Data, Baoding 071001, China
3 National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
4 College of Mechanical and Electrical Engineering, Hebei Agricultural University, Baoding 071001, China;
lijiuxi@163.com
* Correspondence: ljzhang@ysu.edu.cn

Abstract: Disease severity grading is the primary decision-making basis for the amount of pesticide
usage in vegetable disease prevention and control. Based on deep learning, this paper proposed an
integrated framework, which automatically segments the target leaf and disease spots in cucumber
images using different semantic segmentation networks and then calculates the area of disease spots
and the target leaf for disease severity grading. Two independent datasets of leaves and lesions
were constructed, which served as the training set for the first-stage diseased leaf segmentation
and the second-stage lesion segmentation models. The leaf dataset contains 1140 images, and the
lesion data set contains 405 images. The proposed TRNet was composed of a convolutional network
and a Transformer network and achieved an accuracy of 93.94% by fusing local features and global
features for leaf segmentation. In the second stage, U-Net (Resnet50 as the feature network) was
used for lesion segmentation, and a Dice coefficient of 68.14% was obtained. After integrating TRNet
and U-Net, a Dice coefficient of 68.83% was obtained. Overall, the two-stage segmentation network
achieved an average accuracy of 94.49% and 94.43% in the severity grading of cucumber downy
mildew and cucumber anthracnose, respectively. Compared with DUNet and BLSNet, the average
Citation: Yao, H.; Wang, C.; Zhang, L.; accuracy of TUNet in cucumber downy mildew and cucumber anthracnose severity classification
Li, J.; Liu, B.; Liang, F. A Cucumber
increased by 4.71% and 8.08%, respectively. The proposed model showed a strong capability in
Leaf Disease Severity Grading
segmenting cucumber leaves and disease spots at the pixel level, providing a feasible method for
Method in Natural Environment
evaluating the severity of cucumber downy mildew and anthracnose.
Based on the Fusion of TRNet and
U-Net. Agronomy 2024, 14, 72.
Keywords: cucumber disease; disease spot; fusion of TRNet and U-Net; two-stage segmentation
https://doi.org/10.3390/
agronomy14010072
framework; disease severity grading

Academic Editor: Yanbo Huang

Received: 2 December 2023


Revised: 24 December 2023 1. Introduction
Accepted: 26 December 2023 The cucumber is a kind of low-calorie vegetable containing a variety of vitamins and
Published: 27 December 2023 minerals and is widely grown all over the world. According to statistics from the Food and
Agriculture Organization, the global cucumber planting area in 2020 was approximately
2.25 million hectares, with a yield of about 90.35 million tons, making it the third largest
vegetable crop in the world [1]. The cucumber has a longer growth cycle and may be
Copyright: © 2023 by the authors.
Licensee MDPI, Basel, Switzerland.
affected by a number of diseases. As reported, there are 25 common cucumber diseases,
This article is an open access article
among which 18 are leaf diseases (>70%).
distributed under the terms and It can be seen that the prevention and control of cucumber diseases are fraught with
conditions of the Creative Commons enormous challenges. First of all, they must address three key questions: What is the
Attribution (CC BY) license (https:// type of disease? Where are the lesions in infected leaves? How severe is the disease?
creativecommons.org/licenses/by/ The requirements raised by these three aspects perfectly align with the classification,
4.0/). detection, and segmentation tasks in computer vision. Specifically, “what” corresponds to

Agronomy 2024, 14, 72. https://doi.org/10.3390/agronomy14010072 https://www.mdpi.com/journal/agronomy


Agronomy 2024, 14, 72 2 of 19

the classification task: after inputting a disease image into the model, it is expected to output
the category label to which the image belongs based on the detected features. “Where”
corresponds to the object detection and localization task: the model is not only required
to identify the type of disease present in the image but also to indicate the location of the
abnormality. “How” corresponds to the semantic segmentation task: through semantic
segmentation, the model is expected to output a series of useful information, such as the
size and location of disease spots, in order to comprehensively evaluate the disease severity
and guide subsequent pesticide usage.
Among the three questions above, there have been a number of studies focusing on
the classification and detection tasks related to the prevention and control of crop diseases,
and fruitful results have been achieved. For example, Yang et al. [2] optimized GoogleNet
for rice disease detection and achieved an accuracy of 99.58%. Muhammad et al. [3,4]
constructed multiple convolutional neural network (CNN) structures and reported that
the Xception and DenseNet architectures delivered better performance in multi-label plant
disease classification. Zhang et al. [5] proposed to use residual paths instead of the original
MU-Net skip connection in U-Net to segment diseased leaves of corn and cucumber, and
the segmentation accuracy was significantly improved. Bhagat et al. [6,7] constructed the
Eff-UNet++ model, which uses EfficientNet-B4 as the encoder and a modified UNet++ as
the decoder. This model achieved a Dice coefficient of 83.44% in the segmentation of leaves
in the KOMATSUNA dataset.
However, there are a few studies on the severity grading of diseases. Assessing and
grading disease severity has important practical significance because it directly affects
the formulation of control plans and the prediction of crop losses and provides a basis
variable spraying. The existing disease severity grading methods can be divided into two
categories. The first category of methods is to directly construct a classification model
to identify the type and severity of the disease. For example, Esgario et al. [8] used two
CNNs in parallel to classify the disease type and severity of coffee leaves and achieved
an accuracy of 95.24% and 86.51%, respectively. Liang et al. [9] proposed the PD2SE-Net
for category recognition, disease classification, and severity estimation, and the accuracy
of disease severity estimation reached 91%. Hu et al. [10] improved the Faster R-CNN
model for detecting tea tree leaves and then used the VGG16 network to classify the disease
severity. Pan et al. [11] employed the Faster R-CNN model with VGG16 as the feature
network to extract strawberry leaf spots to form a new dataset and then used the Siamese
model to estimate the severity of strawberry burning. Dhiman et al. [12] classified the
severity of citrus diseases as high, medium, low, and healthy and used the optimized
VGGNet16 to classify the severity of diseased fruits, which achieved an accuracy of 97%.
The core of this type of method is to regard disease severity grading as a classification task
in order to establish a relationship between the disease severity and the samples using
an appropriate classification model. The advantage of such a method lies in the ease of
implementation, while the disadvantage is that the disease severity data in the dataset
is manually labelled, which involves a high degree of subjectivity and lacks a stringent
quantitative standard [13].
The second category of methods is to segment the diseased regions through semantic
segmentation first, and then calculate the ratio of the area of diseased regions to the total
area in order to estimate the disease severity. The nature of semantic segmentation is to
classify images pixel by pixel. Wspanialy et al. [14] used the improved U-Net to segment
nine tomato diseases in the PlantVillage tomato dataset and estimated the disease severity,
which had an error rate of 11.8%. Zhang et al. [15] constructed a CNN model, which
regarded cucumber downy mildew leaves with the background removed as the input, to
estimate the severity of cucumber downy mildew and achieved a high level of accuracy
(R2 = 0.9190). Gonçalves et al. [16] applied the multi-semantic segmentation method in
laboratory-acquired images to evaluate the disease severity. The results indicated that
DeepLab V3+ delivered the best performance in disease severity estimation. Lin et al. [17]
proposed a semantic segmentation model based on CNN, which was used to segment
Agronomy 2024, 14, 72 3 of 19

cucumber powdery mildew images at the pixel level. The model achieved an average pixel
accuracy of 96.08%, a cross-merge ratio of 72.11%, and a Dice coefficient of 83.45% using
20 test samples. The advantage of this category of methods is that the classification criteria
are usually objective and clear, while the disadvantage is that the complexity of the image
background can seriously affect the segmentation accuracy.
In order to reduce the impact of complex backgrounds, Tassisa et al. [18] proposed
to use Mask R-CNN to identify the positions of coffee leaves in real production environ-
ments first, and then apply U-Net/PSPNet to segment the coffee leaves and disease spots
simultaneously. Li et al. [19] further improved the segmentation accuracy of the model
using a mixed attention mechanism that combined spatial attention and channel attention,
with support from transfer learning. The model was used to automatically estimate the
severity of cucumber leaf diseases under field conditions and achieved an R2 of 0.9578.
The above studies have, to some extent, reduced the interference of complex backgrounds
and achieved satisfactory results. However, segmenting leaves and disease spots simul-
taneously may affect the segmentation outcome of disease spots due to pixel imbalance,
leading to the omission of disease spots. In addition, none of these methods analyzed
the overlap of leaves in the collected images in actual production environments. In the
process of image collection, the target leaf often overlaps with other leaves, and similar
backgrounds can easily lead to the over-segmentation problem. In light of that, the grading
of disease severity requires the precise segmentation of leaves and disease spots; thus, we
proposed a two-stage segmentation method in this paper as follows:
(1) An image dataset for cucumber leaf segmentation was constructed, which contained
1140 diseased or healthy cucumber leaf images in complex backgrounds.
(2) In order to improve the accuracy of disease classification and disease severity grading,
we proposed a two-stage model. In the first stage, the diseased leaves were separated
from the background, and in the second stage, the diseased spots were separated from
the diseased leaves.
(3) In order to reduce the interference from overlapping leaves and minimize the phe-
nomenon of over-segmentation, the convolutional structure and Transformer were
used simultaneously for feature extraction in order to fuse the global and local fea-
tures. Thus, the information loss caused by down-sampling could be compensated to
optimize the effect of leaf edge segmentation.

2. Materials and Methods


2.1. Dataset
This study mainly focused on the task of cucumber disease segmentation in complex
environments. The image data were acquired from a self-collected dataset of the Xiaotang-
shan National Precision Agriculture Research Demonstration Base. Cucumber leaf images
were captured via a mobile phone (Huawei P20). Considering the diversity of lighting
conditions in practical applications, data were collected at three different time periods every
day during the planting season: morning (8:00–10:00), noon (12:00–14:00), and afternoon
(15:00–17:00). As shown in Figure 1 and Table 1, the collected images included five cate-
gories, namely healthy cucumber leaves, cucumber downy mildew, cucumber anthracnose,
cucumber powdery mildew, and cucumber virus disease. The dataset used to train leaf
segmentation consists of 1140 images. It is randomly divided into a training set and a test
set in a ratio of 8:2, with 912 images used for training and 227 images used for testing. It
should be noted that after a leaf is infected with powdery mildew or viral disease, the
boundary between the diseased area of the leaf and the healthy area of the leaf is blurred,
which makes it extremely difficult to label the boundary of the diseased spot. Therefore,
in the training stage of the lesion segmentation model, the dataset we used consists of
405 images, and the disease types include only downy mildew and anthracnose. Cucumber
leaf images were cropped from the original size of 2976 × 2976 pixels to 512 × 512 pixels.
The Labelme-5.0.0 software was used to label the leaves and disease spots at pixel level
and to generate mask images, as shown in Figure 2.
Agronomy 2024, 14, 72 4 of 21
Agronomy 2024, 14, 72 4 of 21

Agronomy 2024, 14, 72


512 pixels. The Labelme-5.0.0 software was used to label the leaves and disease spots at
512 pixels. The Labelme-5.0.0 software was used to label the leaves and disease spots
4 ofat
19
pixel level and to generate mask images, as shown in Figure 2.
pixel level and to generate mask images, as shown in Figure 2.

Cucumber downy mil- Cucumber powdery mil- Cucumber virus dis-


Cucumber downy mil-Cucumber anthracnose Cucumber powdery mil- Cucumber virus dis- Healthy leaf
dew Cucumber anthracnose dew ease Healthy leaf
dew dew ease
Figure 1. Samples of cucumber leaf images.
Figure 1.
Figure 1. Samples
Samples of
of cucumber
cucumber leaf
leaf images.
images.
Table 1. Number of disease species.
1. Number
Table1.
Table Number of
of disease
disease species.
species.
Type of Disease Quantity
TypeofofDisease
Type Disease Quantity
Quantity
Cucumber downy mildew 290
Cucumber downymildew
Cucumber downy mildew 290
290
Cucumber anthracnose 115
Cucumber anthracnose
Cucumber anthracnose 115
115
Cucumber powderymildew
Cucumber powdery mildew 185
185
Cucumber powdery mildew 185
Cucumber virus
Cucumber virus disease 336
Cucumber virusdisease
disease 336
336
Healthy leaf
Healthy leaf 214
214
Healthy
Total leaf 214
1140
Total
Total 1140
1140

Figure
Figure2.
Figure 2.Image
2. Imagesegmentation
Image segmentationlabels.
segmentation labels.
labels.

2.2. Two-Stage
2.2. Two-Stage Model
2.2. Two-StageModel
Model
Due
Due to interference from complex backgrounds andand significant differences in theinscale
Due toto interference
interference fromfrom complex
complex backgrounds
backgrounds significant differences the
in the
of
scaledisease spots, it is difficult to achieve the accurate segmentation of cucumber leaves
scale of
of disease
disease spots,
spots, itit is
is difficult
difficult to to achieve
achieve thethe accurate
accurate segmentation of cucumber
and disease spots using a one-stage segmentation model. Therefore, a two-stage model
leaves
leaves and
and disease
disease spots
spots using
using aa one-stage
one-stage segmentation
segmentation model. Therefore, a two-stage
was designed in this study to decompose a complex task into two simple subtasks. The
model
model was designed in this study to decompose aa complex
was designed in this study to decompose complex task into two simple subtasks.
proposed two-stage model, namely TUNet, consisted of TRNet and U-Net. In the first
The
Theproposed
proposedtwo-stage
two-stagemodel,
model,namely
namely TUNet,
TUNet, consisted
consisted of TRNet and U-Net. In the first
stage, TRNet was used to segment the target leaf from complex backgrounds. In the second
stage,
stage,TRNet
TRNetwas wasused
used toto segment
segment thethe target
target leaf
leaf from
from complex backgrounds. In the sec-
stage, U-Net was used to further segment disease spots from the obtained target leaf. The
ond
ondstage,
stage,U-Net
U-Netwaswasused
usedto to further
further segment
segment disease
disease spots from the obtained target leaf.
advantage of two-stage segmentation is that the model only needs to focus on one type of
The
Theadvantage
advantage of of two-stage
two-stage segmentation
segmentation is is that
that the
the model only needs to focus on one
target at each stage (leaf target in the first stage and disease spot target in the second stage).
For the two different targets, semantic segmentation models with different structures were
selected according to the specific needs of each target, so as to combine the advantages of
the two models to improve the segmentation accuracy. The framework of the proposed
type of target at each stage (leaf target in the first stage and disease spot target in the
second stage). For the two different targets, semantic segmentation models with different
Agronomy 2024, 14, 72 structures were selected according to the specific needs of each target, so as to combine5 of 19
the advantages of the two models to improve the segmentation accuracy. The framework
of the proposed two-stage model is shown in Figure 3. The structure and key algorithms
oftwo-stage
the two models
model used will in
is shown beFigure
described in detail
3. The in the
structure andfollowing sections.
key algorithms of the two models
used will be described in detail in the following sections.

Figure 3. TUNet framework diagram.


Figure 3. TUNet framework diagram.
2.3. First-Stage Segmentation Model TRNet
2.3. First-Stage Segmentation Model TRNet
In 2015, Long et al. [20] proposed the full convolutional network (FCN) by remov-
ingInthe2015,
fullLong et al. [20]
connection proposed
layer. This was the the
fullfirst
convolutional
time that anetwork (FCN) by
CNN realized removing
pixel-by-pixel
the full connection layer. This was the first time that a CNN
classification on images. The semantic segmentation networks emerged later, such realized pixel-by-pixel classi-
as the
fication
U-Neton and images.
DeepLab The semantic
series, and segmentation
were all based networks
on theemerged
FCN method later, such as the
[21–25]. U-Net
CNN has
and DeepLab
a strong series, and
advantage were all based
in obtaining on the FCN
local features. method
It can forcibly[21–25].
capture CNN thishas
kinda strong
of local
advantage
structure in byobtaining localreceptive
utilizing local features. fields,
It can forcibly captureand
shared weight, thisspatial
kind ofsubsampling
local structure [26].byIn
utilizing
addition, localthereceptive
hierarchicalfields, sharedofweight,
structure and spatial
a convolutional subsampling
kernel takes into[26]. In addition,
account different
the hierarchical
levels structure
of complexity of a spatial
in local convolutional
contexts,kernelfrom takes
simpleinto accountedges
low-level different levels ofto
and textures
complexity
higher-order in local spatial
semantic contexts,
patterns from
[27]. simple
These low-level
strengths edges
enable theand
CNN textures to higher-
structure to effec-
order
tivelysemantic
extract patterns
the local[27]. These
features ofstrengths
cucumberenable leavesthe in CNN
complex structure to effectively
backgrounds. However,ex-
the the
tract CNN localhasfeatures
significant limitationsleaves
of cucumber in extracting
in complex global features dueHowever,
backgrounds. to the loss theof CNN
spatial
hasresolution
significant in limitations
multiple processes
in extractingof down-sampling.
global features due In contrast,
to the lossa ofTransformer that can
spatial resolution
inreflect
multiplecomplex spatial
processes transformations and
of down-sampling. long-distance
In contrast, feature dependencies
a Transformer that can reflect through
com- a
self-attention mechanism would not only completely change
plex spatial transformations and long-distance feature dependencies through a self-atten- the NLP field but also provide
new
tion possibilities
mechanism to the
would notimage
only field [28]. The
completely change emergence
the NLPoffield the vision
but also Transformer
provide new has
greatly inspired
possibilities to thesemantic
image field segmentation, and semantic
[28]. The emergence of the segmentation models such
vision Transformer as SETR
has greatly
have been
inspired proposed
semantic one after another
segmentation, [29,30]. segmentation
and semantic The SETR model regards
models suchthe assegmentation
SETR have
taskproposed
been as a sequence-sequence
one after another prediction
[29,30]. The task,SETRwhich usesregards
model Transformer as the encoder.
the segmentation taskIn
as a sequence-sequence prediction task, which uses Transformer as the encoder. In eachof
each layer of the encoder, global context modeling is conducted so that the limitations
the CNN
layer of thein long-distance
encoder, global relationship
context modeling learning are resolved.
is conducted Nonetheless,
so that in the of
the limitations field
theof
semantic segmentation, a pure Transformer is not flawless.
CNN in long-distance relationship learning are resolved. Nonetheless, in the field of se- During the process of using
the Transformer
mantic segmentation, for feature extraction, it is
a pure Transformer is not
necessary
flawless. to compress
During the two-dimensional
process of usingimage the
patches into one-dimensional sequences. Unfortunately,
Transformer for feature extraction, it is necessary to compress two-dimensional image the adjacent pixels in space are
usually highly correlated, which may lead to the loss
patches into one-dimensional sequences. Unfortunately, the adjacent pixels in space are of structural information in the
patches.
usually Consequently,
highly correlated, in the decoding
which may lead stage, tothethedetailed
loss of information cannot be effectively
structural information in the
restored through up-sampling, resulting in poor segmentation
patches. Consequently, in the decoding stage, the detailed information cannot be effec- results.
Therefore,
tively restored we chose
through the TRNetresulting
up-sampling, model, which in poorcombines
segmentation a Transformer
results. and convo-
lutional structures, to extract features in parallel.
Therefore, we chose the TRNet model, which combines a Transformer This method not onlyandimproved
convolu- the
extraction of global features, but also maintained an excellent grasp of low-level details.
tional structures, to extract features in parallel. This method not only improved the extrac-
Thus, it could effectively reduce the interference caused by the overlap of leaves and im-
tion of global features, but also maintained an excellent grasp of low-level details. Thus,
prove the accuracy of leaf segmentation. The encoder part of TRNet was composed of
it could effectively reduce the interference caused by the overlap of leaves and improve
ResNet50 and a Transformer, as shown in Figure 4.
Agronomy 2024, 14, 72 6 of 21

Agronomy 2024, 14, 72 6 of 19


the accuracy of leaf segmentation. The encoder part of TRNet was composed of ResNet50
and a Transformer, as shown in Figure 4.

Figure
Figure 4.
4. TRNet
TRNetstructure
structure diagram.
diagram.

2.3.1. ResNet50
2.3.1. ResNet50
In thispaper,
In this paper,after
after taking
taking intointo account
account the the network
network performance
performance and model
and model size,
size, Res-
ResNet50
Net50 was was chosen
chosen as theasnetwork
the network for extracting
for extracting local features
local features [31]. ResNet50
[31]. ResNet50 is an
is an archi-
architecture
tecture basedbased on multi-layer
on multi-layer convolution
convolution and identity
and identity mapping,
mapping, as shown
as shown in Figure
in Figure 3. For3.
aFor a given
given image image
as theasinput,
the input, ResNet50
ResNet50 first conducts
first conducts a convolution
a convolution operation
operation and a
and a maxi-
maximum pooling operation on this image. The subsequent operations
mum pooling operation on this image. The subsequent operations consist of four stages, consist of four
stages, namely Stage 1, Stage 2, Stage 3, and Stage 4, which all start
namely Stage 1, Stage 2, Stage 3, and Stage 4, which all start with a Conv Block, followedwith a Conv Block,
followed
by different bynumbers
differentofnumbers of Identity
Identity Block. From Block.
FiguresFrom
4 andFigures
5, it can 4beand
seen5,that
it can
eachbeblock
seen
that each block contains three layers of convolution. The difference
contains three layers of convolution. The difference between Conv Block and Identity between Conv Block
and Identity
Block Block
lies in that lies Block
Conv in thatuses
Conv theBlock uses thekernel
convolution convolution kernel for dimensionality
for dimensionality reduction at
reduction at residual jumps, which
residual jumps, which can be expressed as: can be expressed as:

H(x) =FF(x)
H (x) = (x) + +
x x (1)
(1)

H (x) = F(x) + W (x) (2)


H ( x) = F ( x) + W ( x) (2)
After each operation, the size of the image is reduced by half, while the number of
H W
After each operation,
channels doubles. The finalthe size of
output is X R 16 × 16is×2048
the∈image reduced
. by half, while the number of
H W
× ×2048
channels doubles. The final output is
2.3.2. Transformer X ∈ R 16 16
.
As shown in Figure 3, since the Transformer module does not require three-dimensional
image data, the input image first needs to be transformed into a vector sequence through
an embedding layer. Considering that ResNet50 down-samples an input image 16 times,
H
the input sequence length of the Transformer is designed to be 16 ×W16 × C. The input
image is divided into 1024 patches with a size of 16 × 16. Then, each patch is mapped into
a one-dimensional vector through linear mapping, which is further processed using posi-
tion coding. Subsequently, the obtained vector sequence is inputted into the Transformer
Encoder for feature learning. From Figure 6, it can be seen that the Transformer Encoder
mainly consists of an L-layer Muli-Head Attention (MSA) module and a multi-layer per-
Agronomy 2024, 14, 72 7 of 19

ceptron (MLP) module. As shown in Equations (3) and (4), in the first layer, after the input
sequence passing through the Transformer layer, the output can be obtained as follows:

Zl′ = Zl −1 + MSA( LN ( Zl −1 )) (3)

Zl = Zl′ + MLP( LN ( Zl′ )) (4)


where, the MSA operation is realized by projecting the concatenation of m SA operations,
as shown in Equations (5)–(7).

MSA( Zl −1 ) = Concat(SA1 ( Zl −1 ), SA2 ( Zl −1 ), · · · SAm ( Zl −1 ))W O (5)

QK T
SA( Zl −1 ) = ( √ )V (6)
d
Q = Zl −1 W Q , K = Zl −1 W K , V = Zl −1 W V (7)
Agronomy 2024, 14, 72 7 of 21
where W o ∈ Rmd×C , W q , W k , W v ∈ RC×d are three learnable parameters; d is the dimension
of K.

Agronomy 2024, 14, 72 8 of 21

Figure 5. Two types of residual modules: (a) Identity block and (b) Conv block.
Figure 5. Two types of residual modules: (a) Identity block and (b) Conv block.

2.3.2. Transformer
As shown in Figure 3, since the Transformer module does not require three-dimen-
sional image data, the input image first needs to be transformed into a vector sequence
through an embedding layer. Considering that ResNet50 down-samples an input image
H W
16 times, the input sequence length of the Transformer is designed to be × ×C .
16 16
The input image is divided into 1024 patches with a size of 16 × 16. Then, each patch is
mapped into a one-dimensional vector through linear mapping, which is further pro-
cessed using position coding. Subsequently, the obtained vector sequence is inputted into
the Transformer Encoder for feature learning. From Figure 6, it can be seen that the Trans-
former Encoder mainly consists of an L-layer Muli-Head Attention (MSA) module and a
multi-layer perceptron (MLP) module. As shown in Equations (3) and (4), in the first layer,
after the input sequence passing through the Transformer layer, the output can be ob-
tained as follows:

Z = Z + MSA(LN( Z l −1 ))
'
Figure 6. Structure
Figure 6. Structure of
of Transformer
l
Transformer Encoder.
l −1
Encoder. (3)

2.3.3. Decoder
Z l = Z l + MLP( LN ( Z l ))
' '
(4)
The feature maps of the same scale but with different channels that are outputted by
where, the MSA
ResNet50 operation
and is realized
the Transformer arebyconcatenated
projecting the andconcatenation
then inputted of into
m SA operations,
the decoder part.
as shown in Equations (5)–(7).
The decoder adopts the naive structure in SETR, which consists of two layers of 1 × 1
convolution + BatchNorm + 1 × 1 convolution. The last 1 × 1 convolutionOmaps each com-
MSA( Z l −1) =toConcat
ponent feature vector
( SA1 ( Znumber
the required l −1), SA2of
( Zcategories.
l −1),• • • SAm ( Z l −1))
Then, W
bilinear
(5)
interpolation
Agronomy 2024, 14, 72 8 of 19

Finally, the features of the Transformer are projected onto the dimension of the number
H W
of categories, and the output is X ∈ R 16 × 16 ×768 .

2.3.3. Decoder
The feature maps of the same scale but with different channels that are outputted
by ResNet50 and the Transformer are concatenated and then inputted into the decoder
part. The decoder adopts the naive structure in SETR, which consists of two layers of 1 × 1
convolution + BatchNorm + 1 × 1 convolution. The last 1 × 1 convolution maps each
component feature vector to the required number of categories. Then, bilinear interpolation
up-sampling is performed directly to obtain the output with the same resolution as the
original image, that is, X ∈ R H ×W ×numcls .

2.4. Second-Stage Segmentation Model U-Net


In the second stage, considering the multi-scale nature of the disease spots and a rela-
tively small sample size of the dataset, we chose the U-Net structure, which uses ResNet50
as the backbone, to segment disease spots (see Figure 7 for the network structure). U-Net is
a model based on an encoder-decoder structure, consisting of two parts, where ResNet50
serves as the encoder (details in Section 2.3.1). The structure of U-Net is symmetric. After
down-sampling an image 32 times, up-sampling is conducted. After each round of up-
sampling, the sample is fused at the same scale as the number of channels corresponding
to the feature extraction part. Since the feature map at the top level of the network has a
smaller down-sampling factor, more details can be retained, resulting in a more detailed
feature map. On the contrary, the feature map at the bottom of the network loses a lot
of information during the down-sampling process due to a larger down-sampling factor,
resulting in a significant spatial loss. However, this also results in a high concentration
Agronomy 2024, 14, 72 of information, which is conducive to the determination of the target region in order to 9 of 21
effectively retain the detailed information in the image.

Figure7.7.U-Net
Figure U-Netnetwork
networkstructure
structure diagram.
diagram.

2.5. Disease Severity Grading Method


2.5. Disease Severity Grading Method
Steps for Disease Severity Grading. Cucumber diseases of different severity levels
Steps for Disease Severity Grading. Cucumber diseases of different severity levels
require different prevention and control methods and different amounts of pesticide usage
require
in order different prevention
to avoid affecting theand controlenvironment
ecological methods and different
and amounts
food safety ofexcessive
due to pesticide us-
age in order to avoid affecting the ecological environment and food safety due to excessive
pesticide spraying or ineffective disease control due to insufficient pesticide usage. In this
study, we calculated the ratio of the total area of disease spots on each leaf to the area of
the entire leaf by segmenting the target leaf and the disease spots separately, which was
used as the basis of disease severity grading. The specific steps are as follows:
Agronomy 2024, 14, 72 9 of 19

pesticide spraying or ineffective disease control due to insufficient pesticide usage. In this
study, we calculated the ratio of the total area of disease spots on each leaf to the area of the
entire leaf by segmenting the target leaf and the disease spots separately, which was used
as the basis of disease severity grading. The specific steps are as follows:
Step 1: The leaf and complex backgrounds in the image are considered as the targets.
Then, the complex backgrounds in the manually labeled mask image are removed to obtain
a complete leaf.
Step 2: The mask image obtained in Step 1 is taken as the input of the second stage,
which is segmented to obtain disease spots.
Step 3: The ratio of the total area of disease spots to the area of the entire leaf is
calculated according to Equation (8). Then, this ratio is compared with the disease severity
grading standard to derive the final grading result.

SDisease
P= × 100% (8)
S Lea f

where S Lea f refers to the area of the leaf after segmentation; SDisease refers to the total area
of disease spots after segmentation; and p refers to the ratio of the total area of disease spots
to the area of the entire leaf.
Disease Severity Grading Standard. Referring to the relevant disease severity grading
standards and suggestions from plant protection experts, the severity of cucumber diseases
was classified into five levels in this study, as shown in Table 2.

Table 2. Disease severity grading standard.

Disease Grade Proportion of Disease Spots


Level 0 p = 0%
Level 1 0 < p ≤ 5%
Level 2 5% < p ≤ 10%
Level 3 10% < p ≤ 25%
Level 4 25% < p ≤ 50%
Level 5 50% < p ≤ 100%

3. Experiment and Analysis


The hardware configuration for training and testing was as follows: Intel Xeon (R)
Gold 6248 CPU @ 3.00 GHz × 96; 256 GB Memory; and NVIDIA GeForce RTX 3090
Graphics Card. The software configuration was as follows: 64-bit Ubuntu 20.04.5 LTS
operating system; CUDA Version 11.4; and Pytorch Version 1.13.0. In order to ensure a
fair comparison, the hyperparameters of each network were uniformly configured. After
repeated trial and error, the hyperparameters were determined as shown in Table 3.

Table 3. Configuration of hyperparameters.

Learning Rate 1 × 10−4


epoch 200
batch size 4
optimizer Adam
epoch 200
batch size 4

3.1. Evaluation Indicators


The performance of the model is evaluated using four evaluation metrics: Pixel
Accuracy (PA), IoU, Dice, and Recall. Pixel accuracy represents the proportion of correctly
predicted pixels to the total pixels. IoU is used to calculate the ratio of the intersection and
Agronomy 2024, 14, 72 10 of 19

union of the two sets of true values and predicted values for each category. The calculation
of PA and IoU is as follows:
∑K Pii
PA = K i=0K (9)
∑i=0 ∑ j=0 Pij
pii
IoU = (10)
∑kj=0 pij + ∑kj=0 p ji − pii
where, Pij refers to the total number of i pixels predicted as j pixels; Pii refers to the total
number of i pixels predicted as i pixels, i.e., the total number of correctly classified pixels.
The k value for each stage in the two-stage model is 1. Specifically, in the first stage of the
two-stage model, k = 1 represents leaf, while in the second stage, it represents lesion.
Dice is usually used to calculate the similarity between two samples, and the value
range is [0, 1]. A dice value close to 1 indicates that the set similarity is high, that is, the
segmentation effect between the target and the background is better. A dice value close to 0
indicates that the target cannot be effectively segmented from the background. Recall is the
ratio between the number of samples correctly predicted as positive classes and the total
number of positive classes. Dice and Recall are calculated as follows:

(2 × TP)
Dice = (11)
FN + FP + (2 × TP)

TP
Recall = (12)
FN + TP
where TP represents the true positive example, FP represents the false positive example,
and FN represents the false negative example.

3.2. Comparison of Different Segmentation Models


To verify the effectiveness of TRNet and U-Net(ResNet50), the U-Net, U-Net(MobileNet),
DeepLabV3+(ResNet50), DeepLabV3+(MobileNet), SETR, and PSPNet(ResNet50) were
chosen as control models for the first stage and second stage in this study, and comparisons
of the results are shown in Tables 4 and 5 [32,33]. All the above models were implemented
on the created dataset. The weight file with the best training effect was saved and used for
testing, and the mask image acquired from the test was extracted and put onto the original
image to obtain the segmentation result. The quantitative results were tabulated in a table
format and were visualized in the form of renderings.

Table 4. Comparison of first stage model.

First Stage Model PA% IoU% Dice% Recall%


DeepLabV3+(ResNet50) 92.90 92.48 95.63 96.02
DeepLabV3+(MobileNet) 92.78 92.22 95.50 95.67
U-Net 91.91 89.35 93.67 95.38
U-Net(ResNet50) 92.86 92.23 95.50 95.67
U-Net(MobileNet) 92.94 89.35 93.69 95.38
PSPNet(ResNet50) 92.88 92.36 95.48 95.90
TRNet(ours) 93.94 94.20 96.98 96.91
SETR 91.56 89.65 93.99 95.65

Disease images collected in a production environment have problems with overlap-


ping leaves and complex backgrounds, which makes it difficult to separate leaves from the
background. In order for the model to accurately segment the target blade, the model must
take into account the global features while paying attention to the local features. TRNet
combines the advantages of the Transformer and the convolutional neural network. The
Transformer’s ability to control the global features allows the model to better focus on the
entire image, improves the attention weight of the target leaves, and reduces segmentation
Agronomy 2024, 14, 72 11 of 19

errors caused by complex backgrounds. At the same time, the focus on local features makes
TRNet equally sensitive to detailed features in the target leaves. Therefore, the TRNet
network achieves the best segmentation performance of leaves and lesions, with a PA of
93.94%, an IoU of 96.86%, a Dice coefficient of 72.25%, and a Recall of 98.60%. Compared
with the SETR model using the Transformer as the encoder, the PA was improved by 2.38%,
the IoU was improved by 4.25%, the Dice coefficient was improved by 1.13%, and the Recall
was improved by 2.46%. Among the segmentation networks using convolutional networks
as encoders, DeepLabV3+(ResNet50) achieved the highest metrics, which were 92.90%,
95.49%, 71.65%, and 97.42% for the PA, IoU, Dice coefficient, and Recall, respectively. The
PA, IoU, Dice coefficient, and Recall for TRNet increased by 1.04%, 1.37%, 0.6% and 1.18%,
respectively, compared to DeepLabV3+(ResNet50). It can be seen that the segmentation
performance of TRNet was significantly improved. It further shows that the combination
of the Transformer and the CNN was effective.

Table 5. Comparison of second stage model.

Second Stage Model IoU% Dice% Recall%


DeepLabV3+(ResNet50) 37.94 53.36 71.68
DeepLabV3+(MobileNet) 45.21 57.36 55.07
U-Net 49.64 65.00 66.01
U-Net(ResNet50) 52.51 68.14 73.46
U-Net(MobileNet) 48.40 64.16 69.20
PSPNet(ResNet50) 50.81 66.44 65.87
TRNet(ours) 52.33 67.87 68.44
SETR 44.47 60.26 58.83

In the second-stage task, the model needed to extract complete disease spots from the
target leaf, which required the model to extract finer features. Since ResNet50 is deeper
and wider than the original U-Net encoder, it can extract more comprehensive disease
spot information. Therefore, in the fine segmentation of lesions, U-Net, using ResNet50
as the feature extraction network, achieved an optimal performance, with the IoU, Dice
coefficient, and Recall reaching 52.52%, 68.14%, and 73.46% respectively, which are better
results than those obtained with the original U-Net. The improvements in the IoU, Dice
coefficient, and Recall were 2.87%, 3.14%, and 7.45%, respectively, which were 8.04%, 7.88%,
and 14.63% higher than the Transformer-based SETR network. The proposed TRNet model
had a slight negative impact on the fine segmentation of lesions because the Transformer
branch extracted global features, so the indicators of this model were slightly lower than
U-Net (ResNet50).
To further demonstrate the superiority of TRNet and U-Net(ResNet50), we visualized
the first-stage and second-stage segmentation results, as shown in Figure 8. It can be seen
that, in the first stage, models based on the CNN could completely segment the target leaf
but were inevitably affected by complex backgrounds, resulting in over-segmentation, more
or less. The SETR model, which is purely based on the Transformer as the feature extractor,
was obviously less affected by overlapping leaves. This is largely because that Transformer
mainly focuses on global features. On the other hand, the SETR model was significantly
weaker than CNN-based models in extracting local features of the cucumber leaf. TRNet,
which combines the advantages of both, could more completely segment the target leaf
from complex backgrounds and had less interference from environmental factors.
In the second stage, the image containing disease spots has a simple background with-
out external interference such that the attention to local features becomes more important.
Except for the original U-Net and U-Net(ResNet50), all the other models mistakenly seg-
mented the connection between two adjusted disease spots, while U-Net also ignored some
minor disease spots. It can be seen that the U-Net model had a significant advantage in
fusing multi-scale features for the segmentation of small disease spots. Moreover, ResNet50,
as a feature extractor, could provide the precise extraction of the local features. Overall, it
Agronomy 2024, 14, 72 12 of 19

Agronomy 2024, 14, 72 13 of 21


was found that TRNet and U-Net(ResNet50) achieved the best performance on the test set
compared with the control models. Therefore, the latter part of this paper focuses on the
fusion of these two models.

Original image TRNet DeepLabv3+(ResNet50) DeepLabv3+(MobileNet)

U-Net U-Net(ResNet50) U-Net(MobileNet)

PSPNet(ResNet50) SETR

TRNet DeepLabv3+(ResNet50) DeepLabv3+(MobileNet)

U-Net U-Net(ResNet50) U-Net(MobileNet)

PSPNet(ResNet50) SETR
Figure8.8.Visualization
Figure Visualizationof
ofmodel
modelsegmentation.
segmentation.

3.3. Comparison of Model Fusion Methods


In this paper, the method we proposed consisted of the segmentation of the complete
leaf from the complex backgrounds first, followed by the segmentation of disease spots
from a target leaf that is in a simple background to eventually achieve disease severity
Agronomy 2024, 14, 72 13 of 19

3.3. Comparison of Model Fusion Methods


In this paper, the method we proposed consisted of the segmentation of the complete
leaf from the complex backgrounds first, followed by the segmentation of disease spots from
a target leaf that is in a simple background to eventually achieve disease severity grading.
The intention of two-stage segmentation was not only to remove complex interference
factors but also to utilize the complementary advantages of different models to improve
the segmentation accuracy. Therefore, the fusion of appropriate models was crucial. In this
study, TRNet and U-Net(ResNet50), which delivered the best performance in the first and
second stages, respectively, were used for leaf segmentation, and the extracted mask map
was further processed to segment the disease spots. To verify the advantage of the TUNet
model, we also chose the models delivering the second-best performance in the first stage
and second stage, i.e., DeepLabV3+(ResNet50) and TRNet, and fused them with the best
performers. In the end, four combination schemes were formed for comparative analysis.
As shown in Table 6, Scheme 1 used TRNet for the segmentation in both stages.
Scheme 2 used TRNet in the first stage and U-Net(ResNet50) in the second stage. Scheme 3
used DeepLabV3+(ResNet50) in the first stage and TRNet in the second stage. Scheme 4
used DeepLabV3+(ResNet50) in the first stage and U-Net(ResNet50) in the second stage.

Table 6. Model fusion schemes.

Scheme First-Stage Model Second-Stage Model


1 TRNet TRNet
2 TRNet U-Net(ResNet50)
3 DeepLabV3+(ResNet50) TRNet
4 DeepLabV3+(ResNet50) U-Net(ResNet50)

A comparison of the results is shown in Table 7. It can be seen that the indicators
of the fusion model on these two categories were similar, which is because the lesions
of cucumber downy mildew and cucumber anthracnose were similar. Among the two
diseases, the performance of Scheme 1 was slightly better than Scheme 3 and Scheme 4.
This is because TRNet is a first-stage model and the leaf segmentation was more accurate.
Scheme 2 outperformed all other fusion schemes and performed better on all metrics (PA,
IoU, Dice coefficient, or recall). It was also noted that all the indicators of Scheme 1, Scheme
2, and Scheme 4 were lower than they were before fusion, and only Scheme 2 yielded
higher values for all the indicators after fusion compared to before. Compared with the
situation in which the other combinations were declining, the integrated advantages of the
two models in scenario 2 can be fully reflected.

Table 7. Results of model fusion schemes.

Cucumber Downy Mildew Cucumber Anthracnose


Scheme IoU% Dice% Recall% IoU% Dice% Recall%
1 51.24 67.34 67.34 52.08 67.98 69.44
2 54.12 68.79 75.97 54.44 68.89 74.34
3 51.03 67.11 67.06 51.94 67.32 69.28
4 52.33 67.97 73.95 52.02 67.64 72.37

The segmentation results of the various fusion schemes are shown in Figure 9. It can
be seen that Scheme 3 and Scheme 4, which used DeepLabV3+ for segmentation in the first
stage, mistakenly segmented some leaves with similar colors as the target leaf, resulting in
the segmentation of disease spots from non-target leaves in the second stage. Therefore, the
final accuracy was reduced. For Scheme 1 and Scheme 2, the TRNet model performed well
in the first stage and fully segmented the contour of the target leaf. However, for disease
spots of varying sizes, the multi-scale segmentation of U-Net apparently outperformed
Agronomy 2024, 14, 72 14 of 19

other schemes. Based on the advantages and disadvantages of the four schemes and
Agronomy 2024, 14, 72 15 of 21
the actual production needs, Scheme 2 was ultimately chosen as the cucumber disease
segmentation model in this study.

ground truth Scheme 1 Scheme 2 Scheme 3 Scheme 4

Figure9.9.The
Figure Theresults
results ofofthe
the fusion
fusion scenario
scenario are visualized
are visualized.

3.4.Two-Stage
3.4. Two-StageModel
Model
Consideringthat
Considering that the
the segmenting
segmenting of of leaves
leaves and
and disease
diseasespots
spotsfrom
fromcomplex
complex back-
back-
grounds simultaneously with a one-stage model is extremely challenging, we proposed a
grounds simultaneously with a one-stage model is extremely challenging, we proposed a
two-stage segmentation method in this paper. Specifically, purpose of the first stage was
two-stage segmentation method in this paper. Specifically, purpose of the first stage was
to remove complex backgrounds and the purpose of the second stage was to segment the
todisease
removespots
complex backgrounds and the purpose of the second stage was to segment the
under a simple background. In order to verify the improvement of the pro-
disease spots under
posed two-stage model a simple background.
as compared In order
to one-stage to verify the
segmentation, we improvement of the pro-
chose U-Net(ResNet50),
posed two-stage model as compared to one-stage segmentation, we
which delivered the best performance for disease spot segmentation, to extract disease chose U-Net(Res-
Net50), which
spots from delivered
complex the best
and simple performance for disease spot segmentation, to extract
backgrounds.
diseaseThespots from complex
segmentation andare
results simple
shownbackgrounds.
in Figure 10. It can be seen that the results
The segmentation
obtained from two-stage results are shown
segmentation were infar
Figure
better10.
thanIt can
thosebeobtained
seen that theone-stage
from results ob-
tained from two-stage
segmentation. segmentation
In the three wereinfar
images shown better10,than
Figure some those obtained
disease spots onfrom one-stage
non-target
leaves were mistakenly
segmentation. In the threesegmented
images using
shown one-stage
in Figuresegmentation. This is spots
10, some disease becauseonone-stage
non-target
segmentation
leaves does not remove
were mistakenly segmented confounding factors, such
using one-stage as overlapping
segmentation. This isleaves before
because one-
stage segmentation does not remove confounding factors, such as overlapping leavesabe-
disease spot segmentation, leading to poor segmentation results. Therefore, in this study,
two-stage
fore diseasemodel was proposed for
spot segmentation, disease
leading to severity grading in order
poor segmentation to guarantee
results. Therefore, a high
in this
classification accuracy.
study, a two-stage model was proposed for disease severity grading in order to guarantee
a high classification accuracy.
Agronomy 2024, 14, 72 16 of 21

Agronomy 2024, 14, 72 15 of 19

Original images One-stage Two-stage


Figure
Figure 10.
10. Comparison
Comparison of
of segmentation
segmentation results
results between
between one-stage
one-stage and
and two-stage
two-stage models.
models.

3.5. Disease Severity Grading


At present, there is no unified standard for the severity grading of cucumber downy
mildew.
3.5. According
Disease to the relevant literature, commonly used methods for the severity
Severity Grading
grading
At present, there downy
of cucumber mildew
is no unified are mainly
standard based
for the on (1)
severity the ratio
grading of of the totaldowny
cucumber area of
Agronomy 2024, 14, 72 disease spots to the area of the entire leaf and (2) the number of disease spots per unit
17 of leaf
21
mildew. According to the relevant literature, commonly used methods for the severity
area. In this study, the first method was adopted. The disease severity was divided into five
grading of cucumber downy mildew are mainly based on (1) the ratio of the total area of
levels, as detailed in Section 3.4. Figure 11 shows the images of cucumber downy mildew
disease spots to the area of the entire leaf and (2) the number of disease spots per unit leaf
and cucumber anthracnose from severity Level 1 to Level 5.
area. In this study, the first method was adopted. The disease severity was divided into
five levels, as detailed in Section 3.4. Figure 11 shows the images of cucumber downy
mildew and cucumber anthracnose from severity Level 1 to Level 5.

cucumber
downy mildew

cucumber an-
thracnose

Level 1 Level 2 Level 3 Level 4 Level 5


Figure 11. Example plot of downy mildew and anthracnose severity ratings in cucumbers.
Figure 11. Example plot of downy mildew and anthracnose severity ratings in cucumbers.

Weused
We used TRNet
TRNet and
and U-Net
U-Net toto segment
segment the
thetarget
targetleaf
leafand
anddisease
diseasespots,
spots,respectively,
respectively,
and calculated the ratio of the pixel area of disease spots to the pixel areaofofthe
and calculated the ratio of the pixel area of disease spots to the pixel area theleaf.
leaf.Then,
Then,
the severity of cucumber downy mildew and cucumber anthracnose was graded accord-
ing to the specified grading standard. In this study, 90 cucumber downy mildew images
and 94 cucumber anthracnose images were selected as test objects, and the predicted dis-
ease severity was compared with the manually labelled severity to evaluate the classifica-
tion accuracy of the model. The experimental results are shown in Tables 8 and 9. It can
be seen from Table 8 that the classification accuracy of cucumber downy mildew from
Agronomy 2024, 14, 72 16 of 19

the severity of cucumber downy mildew and cucumber anthracnose was graded according
to the specified grading standard. In this study, 90 cucumber downy mildew images and
94 cucumber anthracnose images were selected as test objects, and the predicted disease
severity was compared with the manually labelled severity to evaluate the classification
accuracy of the model. The experimental results are shown in Tables 8 and 9. It can
be seen from Table 8 that the classification accuracy of cucumber downy mildew from
Levels 1, 2, 3, 4, and 5 was 100.00%, 100.00%, 94.44%, 92.31%, and 85.71%, respectively,
with an average accuracy of 94.49%. According to Table 9, the classification accuracy of
cucumber anthracnose from Levels 1, 2, 3, 4, and 5 was 100%, 96%, 100.00%, 92.85% and
83.33%, respectively, with an average accuracy of 94.43%. In general, the model had a high
prediction accuracy for disease severity for Levels 1 to 3 but performed suboptimally for
Levels 4 and 5. This is because the edges of leaves with Level 4–5 cucumber downy mildew
or cucumber anthracnose were mostly withered, and the model might recognize such edges
as background factors in the first-stage segmentation, resulting in a lower accuracy.

Table 8. Disease severity grading results of cucumber downy mildew.

Disease Grade Number of Datasets Correct Grading Accuracy/%


Level 1 23 23 100.00
Level 2 22 22 100.00
Level 3 18 17 94.44
Level 4 13 12 92.31
Level 5 14 12 85.71

Table 9. Disease severity grading results of cucumber anthracnose.

Disease Grade Number of Datasets Correct Grading Accuracy/%


Level 1 22 22 100.00
Level 2 25 24 96.00
Level 3 21 21 100.00
Level 4 14 13 92.85
Level 5 12 10 83.33

A comparison of the results of the proposed model TUNet and the existing models is
shown in Table 10. Ref. [34] uses the two-stage method DUNet to segment diseased leaves
and lesions, and Ref. [13] uses an improved U-Net model to simultaneously segment leaves
and lesions. As can be seen in Table 10, TUNet has a higher accuracy in disease severity
grading compared to Ref. [34]. The one-stage model in Ref. [13] has a speed advantage, but
the accuracy is much lower than the two-stage model.

Table 10. Comparison of results of cucumber downy mildew and anthracnose grading.

Paper Level 1 Level 2 Level 3 Level 4 Level 5


DUNet [34] 91.10% 95.72% 92.06% 89.01% 80.95%
BLSNet [13] 88.93% 87.45% 89.68% 81.31% 73.21%
TUNet 100% 98% 97.22% 92.58% 84.52%

As can be seen in Figure 12, both Refs. [13,34] have problems with over-segmentation,
that is, the lesions on the edge of the leaves are classified as background, resulting in an
incorrect classification of disease severity. DUNet failed to segment lesions due to the
incorrect segmentation of leaves in the first stage, resulting in an incorrect input in the
second stage, which illustrates the importance of the first-stage model in the two-stage
method. Our method adds global features to the first-stage model for context modeling so
that it can correctly determine whether the edge lesion is part of the leaf, thus avoiding the
As can be seen in Figure 12, both Refs. [13,34] have problems with over-segmentation,
that is, the lesions on the edge of the leaves are classified as background, resulting in an
incorrect classification of disease severity. DUNet failed to segment lesions due to the in-
correct segmentation of leaves in the first stage, resulting in an incorrect input in the sec-
Agronomy 2024, 14, 72 ond stage, which illustrates the importance of the first-stage model in the two-stage 17 of 19
method. Our method adds global features to the first-stage model for context modeling so
that it can correctly determine whether the edge lesion is part of the leaf, thus avoiding
over-segmentation
the over-segmentationproblem. However,
problem. TUNet
However, still still
TUNet has has
shortcomings in the
shortcomings segmentation
in the segmenta-
of small
tion lesions,
of small which
lesions, needsneeds
which further improvement.
further improvement.

Proportion of lesion area: 7.10%


Disease severity grade: level 2
ground truth

Proportion of lesion area: 6.94%


Disease severity grade: level 2
TUNet

Proportion of lesion area: 0.045%


Disease severity grade: level 1
DUNet

Proportion of lesion area: 0.072%


Disease severity grade: level 1
BLSNet

Figure 12.
Figure 12. A
A comparison
comparison of
of the
the results
results of
of segmentation.
segmentation.

4. Conclusions
This paper proposed a two-stage model, namely TUNet, for grading the severity of
cucumber leaf diseases. The proposed model consisted of two segmentation networks,
TRNet and U-Net. In the first stage, we chose TRNet to extract the target cucumber leaf
from the image. The TRNet network uses both a convolutional structure and a Transformer
to extract image features, so it can compensate for the global loss caused by down-sampling
in the convolutional structure. The combination of global and local information not only
improved the segmentation accuracy of the target leaf but also effectively reduced the
impact of complex backgrounds on the segmentation task. Then, the segmented leaf image
with a simple background was used as the input of the second-stage segmentation, and
U-Net, which uses ResNet50 as the backbone network, was chosen to extract disease spots
from the image. We found that when ResNet50 was used as the backbone network, it
could accurately detect and segment very small objects, which is conducive to disease
spot segmentation. Further, we compared these two models with several classic models.
The experimental results showed that these two networks outperformed other models in
leaf segmentation and disease spot segmentation, and the fusion of the two yielded more
effective results. Finally, the cucumber disease severity was graded by calculating the ratio
of the total area of disease spots to the area of the entire leaf. The results showed that the
two-stage model proposed in this study performed well with regard to the grading of the
severity of cucumber downy mildew and cucumber anthracnose under real production
Agronomy 2024, 14, 72 18 of 19

environments. It is worth noting that our approach also has limitations. First of all, the
proposed TRNet model takes a long time to infer, and this time loss cannot be ignored.
Therefore, future research should focus on the light weight of the model structure in order
to shorten the segmentation time. Secondly, in addition to the proportion of the lesion area,
the disease severity classification needs to consider the color of the lesions and whether the
diseased leaves are perforated. Therefore, a more accurate classification of disease severity
requires a comprehensive consideration of multiple factors mentioned above.

Author Contributions: H.Y.: Writing—Original draft preparation; C.W.: Methodology; L.Z.:


Writing—Reviewing and Editing; J.L.: Investigation; B.L.: Software; F.L.: Data curation. All
authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the National Natural Science Foundation of China under
grant number 62106065 and, in part, by the Natural Science Foundation of Hebei province under
grant number F2022204004.
Data Availability Statement: If anyone needs data, please contact us.
Acknowledgments: We are grateful to our colleagues at the Hebei Key Laboratory of Agricultural
Big Data, the National Engineering Research Center for Information Technology in Agriculture and
Institute of Vegetables and Flowers, and the Chinese Academy of Agricultural Sciences for their help
and input, without which this study would not have been possible.
Conflicts of Interest: The authors declare no conflicts of interest.

References
1. Food and Agriculture Organisation of the United States. Food and Agriculture Data. Available online: http://www.fao.org/
faostat/en/#home (accessed on 15 July 2021).
2. Yang, L.; Yu, X.; Zhang, S.; Long, H.; Zhang, H.; Xu, S.; Liao, Y. GoogLeNet based on residual network and attention mechanism
identification of rice leaf diseases. Comput. Electron. Agric. 2023, 204, 107543. [CrossRef]
3. Gulzar, Y. Fruit image classification model based on MobileNetV2 with deep transfer learning technique. Sustainability 2023,
15, 1906. [CrossRef]
4. Kabir, M.M.; Ohi, A.Q.; Mridha, M.F. A Multi-Plant Disease Diagnosis Method Using Convolutional Neural Network. arXiv 2020,
arXiv:2011.05151.
5. Zhang, S.; Zhang, C. Modified U-Net for plant diseased leaf image segmentation. Comput. Electron. Agric. 2023, 204, 107511.
[CrossRef]
6. Bhagat, S.; Kokare, M.; Haswani, V.; Hambarde, P.; Kamble, R. Eff-UNet++: A novel architecture for plant leaf segmentation and
counting. Ecol. Inform. 2022, 68, 101583. [CrossRef]
7. Gulzar, Y.; Ünal, Z.; Aktaş, H.; Mir, M.S. Harnessing the Power of Transfer Learning in Sunflower Disease Detection: A
Comparative Study. Agriculture 2023, 13, 1479. [CrossRef]
8. Esgario, J.G.M.; Krohling, R.A.; Ventura, J.A. Deep learning for classification and severity estimation of coffee leaf biotic stress.
Comput. Electron. Agric. 2020, 169, 105162. [CrossRef]
9. Liang, Q.; Xiang, S.; Hu, Y.; Coppola, G.; Zhang, D.; Sun, W. PD2 SE-Net: Computer-assisted plant disease diagnosis and severity
estimation network. Comput. Electron. Agric. 2019, 157, 518–529. [CrossRef]
10. Hu, G.; Wang, H.; Zhang, Y.; Wan, M. Detection and severity analysis of tea leaf blight based on deep learning. Comput. Electr.
Eng. 2021, 90, 107023. [CrossRef]
11. Pan, J.; Xia, L.; Wu, Q.; Guo, Y.; Chen, Y.; Tian, X. Automatic strawberry leaf scorch severity estimation via faster R-CNN and
few-shot learning. Ecol. Inform. 2022, 70, 101706. [CrossRef]
12. Dhiman, P.; Kukreja, V.; Manoharan, P.; Kaur, A.; Kamruzzaman, M.M.; Dhaou, I.B.; Iwendi, C. A Novel Deep Learning Model for
Detection of Severity Level of the Disease in Citrus Fruits. Electronics 2022, 11, 495. [CrossRef]
13. Chen, S.; Zhang, K.; Zhao, Y.; Sun, Y.; Ban, W.; Chen, Y.; Zhuang, H.; Zhang, X.; Liu, J.; Yang, T. An Approach for Rice Bacterial
Leaf Streak Disease Segmentation and Disease Severity Estimation. Agriculture 2021, 11, 420. [CrossRef]
14. Wspanialy, P.; Moussa, M. A detection and severity estimation system for generic diseases of tomato greenhouse plants. Comput.
Electron. Agric. 2020, 178, 105701. [CrossRef]
15. Zhang, L.-x.; Tian, X.; Li, Y.-x.; Chen, Y.-q.; Chen, Y.-y.; Ma, J.-c. Estimation of Disease Severity for Downy Mildew of Greenhouse
Cucumber Based on Visible Spectral and Machine Learning. Spectrosc. Spectr. Anal. 2020, 40, 227–232.
16. Gonçalves, J.P.; Pinto, F.A.C.; Queiroz, D.M.; Villar, F.M.M.; Barbedo, J.G.A.; Del Ponte, E.M. Deep learning architectures for
semantic segmentation and automatic estimation of severity of foliar symptoms caused by diseases or pests. Biosyst. Eng. 2021,
210, 129–142. [CrossRef]
Agronomy 2024, 14, 72 19 of 19

17. Lin, K.; Gong, L.; Huang, Y.; Liu, C.; Pan, J. Deep Learning-Based Segmentation and Quantification of Cucumber Powdery
Mildew Using Convolutional Neural Network. Front. Plant Sci. 2019, 10, 155. [CrossRef] [PubMed]
18. Tassis, L.M.; de Souza, J.E.T.; Krohling, R.A. A deep learning approach combining instance and semantic segmentation to identify
diseases and pests of coffee leaves from in-field images. Comput. Electron. Agric. 2022, 193, 106732. [CrossRef]
19. Li, K.; Zhang, L.; Li, B.; Li, S.; Ma, J. Attention-optimized DeepLab V3 + for automatic estimation of cucumber disease severity.
Plant Methods 2022, 18, 109. [CrossRef]
20. Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. arXiv 2015, arXiv:1411.4038.
21. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In International
Conference on Medical Image Computing and Computer-Assisted Intervention, Proceedings of the Medical Image Computing and Computer-
Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland; Volume 9351, pp. 234–241.
22. Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic Image Segmentation with Deep Convolutional Nets
and Fully Connected CRFs. arXiv 2016, arXiv:1412.7062.
23. Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep
Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. arXiv 2017, arXiv:1606.00915. [CrossRef] [PubMed]
24. Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017,
arXiv:1706.05587.
25. Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic
Image Segmentation. arXiv 2018, arXiv:1802.02611.
26. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998,
86, 2278–2324. [CrossRef]
27. Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. CvT: Introducing Convolutions to Vision Transformers. arXiv
2021, arXiv:2103.15808.
28. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need.
arXiv 2017, arXiv:1706.03762.
29. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.;
Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929.
30. Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S.; et al. Rethinking Semantic
Segmentation from a Sequence-to-Sequence Perspective with Transformers. arXiv 2020, arXiv:2012.15840.
31. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385.
32. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient
Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861.
33. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. arXiv 2017, arXiv:1612.01105.
34. Wang, C.; Du, P.; Wu, H.; Li, J.; Zhao, C.; Zhu, H. A cucumber leaf disease severity classification method based on the fusion of
DeepLabV3+ and U-Net. Comput. Electron. Agric. 2021, 189, 106373. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like